|
Yes. The only requirement is that you properly acknowledge the use of
Carrot 2 (on the project's website and documentation) and let
us know about your project. Please also remember to read the license.
Please put a statement equivalent to "This product includes software
developed by the Carrot2 Project" on your site and link it to
Carrot 2's website ( http://www.carrot2.org). Additionally,
you can use some of our powered-by logos
if you like.
No. Carrot 2 can add clustering of search results to an
existing search engine. You can use an Open Source project called Nutch to crawl your website. Nutch has
a Carrot 2-based search clustering plugin, so you'll get all
crawling, searching and clustering in one piece. If you need help
with any of these, please contact us.
-
In order to compile Carrot2 you will need Java Software
Developer Kit in version 1.4.2 or newer and Apache Ant.
-
Check out Carrot2 source code from the SVN repository:
svn co https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk/ carrot2
-
To build all Carrot2 components, go to the
trunk/carrot2 directory and run:
ant build
-
To build only one application, go to its directory (e.g.
trunk/carrot2/applications/carrot2-demo-dcs) run Ant with
the default target:
ant build
In the application's tmp/dist directory you'll find the
compiled application. Please refer to the application's specific
readme.txt files for instructions on how to run the
application.
Such an integration depends on what existing infrastructure is already available in your project. Carrot2
requires a feed of documents (search results), so typically you'll need a search engine
that crawls your site. Such an engine can be indeed local to your Web site (proprietary solutions in intranets,
search engines built on top of Nutch or ht://dig),
but it can as well be a global search engine with searches restricted to your domain (Google, Yahoo).
Once a search engine is available, the integration depends on the
technology your site/ software uses for rendering the user interface
(or more accurately: for implementing application logic). Software
written in Java can use Carrot2 directly in a way that is
shown in the end-to-end example
code (JavaDoc).
Sites written in Perl, PHP, .NET and other languages can use the
Carrot2 Document Clustering Server, for more details see the dedicated FAQ.
Finally, in some cases you might want to re-use and
customize (through XSLT) some bits of Carrot2's web
application (located in the carrot2/applications/carrot2-demo-webapp
folder of the source repository) to e.g. visualize clusters.
Note that Carrot2 integration requires some Java development
skills and familiarity with Java development tools (Eclipse, Apache
ANT). The example code and JUnit tests available in the open source project demonstrate
various ways of using Carrot2. The project's mailing
list can also be of some help if you get stuck someplace.
We have also prepared a step-by-step example of using the Carrot2 API directly, it is
available in the source code repository.
If you'd rather pay for having the integration done quickly and professionally, Carrot Search
provides consulting services (paid approximately 60 EUR per hour).
Although Carrot2 does not have native ports on non-Java platforms, such
as .NET, PHP, Ruby, Perl etc., it can be easily integrated with them
using the Carrot2 Document Clustering Server (DCS). The DCS exposes Carrot2 clustering as an HTTP/REST service. Essentially, you make an
HTTP/POST request with an XML containing the documents you want to have
clustered and the DCS responds with an XML containing the clusters
created by Carrot2. For quick integration with Ruby, A JSON output
format is also available. Finally, for batch processing, a simple
command-line application is provided.
To get started with the Carrot2 DCS, download
the latest version and uncompress the archive to some local directory.
Run the DCS providing the port number it should bind to, e.g.:
dcs -port 9090
When the DCS initializes correctly, you should see the following
messages on the console:
[11:28:35,734 INFO] Initializing components.
[11:28:36,031 INFO] Loaded algorithm: haog-fi-en
[11:28:36,046 INFO] Loaded algorithm: haog-stc-en
[11:28:36,046 INFO] Setting the context-level default process id to: lingo-cla..
[11:28:36,234 INFO] Loaded algorithm: lingo-classic
[11:28:36,234 INFO] Loaded algorithm: rough-kmeans
[11:28:36,250 INFO] Loaded algorithm: stc-en
[11:28:36,250 INFO] Finished initializing components.
[11:28:36,250 INFO] Starting standalone DCS server.
[11:28:36,593 INFO] Console mode, skipping configuration in web.xml.
[11:28:36,609 INFO] Accepting HTTP requests on port: 9090
Point your browser to http://localhost:9090/, where you will
find further instructions. See the examples/queries
directory of the DCS distribution for some example document sets in
the DCS format.
To run Carrot 2 Demo Browser directly from Eclipse, please follow these steps:
-
Check out Carrot2 source code from the SVN repository:
svn co https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk/ carrot2
Import all Carrot2 projects into your workspace:
-
From the Package Explorer's context menu choose Import...
-
In the first step of the Import wizard, choose General -> Existing projects into
Workspace and click Next.
-
In the next step of the wizard, in the Select root directory field provide
the path to your local Carrot2 checkout and click
Finish.
The Eclipse compile process will fail because of undefined
classpath variables: ANT_HOME and
CARROT2_CHECKOUT_BASE. To define these variables open the
Preferences window (Window ->
Preferences...) and then go to (Java
-> Build Path -> Classpath variables). Make the
ANT_HOME variable point to your local Ant installation and
CARROT2_CHECKOUT_BASE to your local Carrot2
repository checkout.
-
Clean all projects (Project ->
Clean...) and let Eclipse compile everything again, this
time without errors.
Run Carrot2 browser using the Run... toolbar icon (Eclipse should have
automatically created the appropriate launch entry during project
import).
To feed the Carrot 2 Demo Browser directly from a local Lucene index follow these steps:
-
Run the Carrot2 Demo Browser.
-
In the Process combo box select
Lucene Index -- Lingo Classic
Clusterer.
Click the Settings button and then
the Edit button in the Lucene index location section.
-
In the file browser, provide the path to your Lucene
index directory. A dialog for configuring the index will appear.
-
In the Search fields section,
select the Lucene fields to be searched (hold down Ctrl key for
multiple selections).
-
In the Results fields section,
select which Lucene fields should be mapped to the URL,
document title and document snippet.
-
Finally, in the Analyzer
section, choose the analyzer to be used.
-
Click the OK buttons in the
index configuration and process configuration dialogs.
The demo application running under a Web application container (such as Tomcat)
relies on proper decoding of Unicode characters from the request URI. This decoding
is done by the container and must be properly configured at the container level. Unfortunately, this configuration for each container is a bit different (it
is not part of the J2EE standard). For Tomcat, you can enforce the URI decoding codepage at the connector configuration
level. Locate server.xml file inside Tomcat's conf folder
and add the following attribute to the Connector section: URIEncoding="UTF-8" An example connector configuration should look like this: <Connector port="8080"
maxThreads="25" minSpareThreads="5" maxSpareThreads="10"
minProcessors="5" maxProcessors="25" enableLookups="false"
redirectPort="8443" acceptCount="10" debug="0" connectionTimeout="20000"
URIEncoding="UTF-8" />- Download the Carrot2 web application from
and extract the WAR file from the archive.
-
Unpack the WAR file using your favourite ZIP unpacker.
-
In order to remove a search tab, delete the corresponding
*.bsh file from the inputs/ directory. For
example, to remove the Wiki tab, delete
05-input-yahooapi-wiki.bsh. Then,
In order to add a new tab to the web application, assuming the
data source is already supported by
Carrot2, it's best to clone and modify one of the existing
*.bsh files from the inputs/ directory.
When cloning an existing file, please make sure to change the
component identifier:
LoadedComponentFactory loaded =
new LoadedComponentFactory("input-yahooapi-put-copy", factory);
Using the section shown below, you can customize:
- tab.title -- title of the tab
- tab.accel -- shortcut key for the tab (make sure the
letter is contained in the tab's name)
- tab.accel -- shortcut key for the tab (make sure the
letter is contained in the tab's name)
- tab.description -- description of the tab to be shown as a tool tip
- tab.description.startup -- description of the tab to
be shown on the startup screen
- tab.exampleQueries -- example queries to be shown on
the startup screen, separated by the | character
- tab.icon -- a 16 x 16 icon displayed on the tab (put the file in the
skins/fancy/inputs/ directory)
- tab.ignoreOnError -- set to true, to hide
the tab if an error occurs when initializing it
- tab.default -- set to true to make the tab the default active tab
loaded.setProperties(new String [][] {
{"tab.name", "Web"},
{"tab.accel", "W"},
{"tab.description", "Search the Web with www.etools.ch"},
{"tab.description.startup", "Carrot Clustering Engine will ..."},
{"tab.exampleQueries", "data mining|london|clustering"},
{"tab.icon", "web.gif"},
{"tab.ignoreOnError", "false" },
{"tab.default", "true"}
});
The
following data sources are already available:
-
Using your favourite tool, ZIP the all the files back to form a WAR
file and install the latter in your servlet container.
Yes. While the query is usually very helpful to get rid of the obvious
meanings related to the documents in the search results set, it is not
obligatory -- the clustering algorithms will cope without the query.
Lingo has 3 parameters that influence the number and contents of
clusters is creates:
- Cluster Assignment Threshold — determines
how precise the assignment of documents to clusters should be. For
low values of this threshold, Lingo will assign more documents to
clusters, which will result in less documents ending up in "Other
topics", but also some irrelevant documents making its way to the
clusters. For high values of this parameter, Lingo will assign less
documents to clusters, which will result in better assignment
precision, but also more documents in "Other topics" and less
clusters being created.
- Candidate cluster threshold — determines how
many clusters Lingo will try to create, higher values of the
parameter will give more clusters. However, it is not possible to
exactly predict the number of clusters based on the value of this
parameter before clustering is actually run.
- Preferred cluster count — determines the
maximum number of clusters Lingo will create (excluding the "Other
topics" clusters).
|
|