About | Contact | Carrot2 @ sf.net | Search Clustering Engine | Carrot Search

FAQ

......
Carrot2 Project......
Carrot2: open source framework for building search clustering engines
Yes. The only requirement is that you properly acknowledge the use of Carrot2 (on the project's website and documentation) and let us know about your project. Please also remember to read the license.
Please put a statement equivalent to "This product includes software developed by the Carrot2 Project" on your site and link it to Carrot2's website (http://www.carrot2.org). Additionally, you can use some of our powered-by logos if you like.
No. Carrot2 can add clustering of search results to an existing search engine. You can use an Open Source project called Nutch to crawl your website. Nutch has a Carrot2-based search clustering plugin, so you'll get all crawling, searching and clustering in one piece. If you need help with any of these, please contact us.
  1. In order to compile Carrot2 you will need Java Software Developer Kit in version 1.4.2 or newer and Apache Ant.
  2. Check out Carrot2 source code from the SVN repository:
    svn co https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk/ carrot2
  3. To build all Carrot2 components, go to the trunk/carrot2 directory and run:
    ant build
  4. To build only one application, go to its directory (e.g. trunk/carrot2/applications/carrot2-demo-dcs) run Ant with the default target:
    ant build
    In the application's tmp/dist directory you'll find the compiled application. Please refer to the application's specific readme.txt files for instructions on how to run the application.

Such an integration depends on what existing infrastructure is already available in your project. Carrot2 requires a feed of documents (search results), so typically you'll need a search engine that crawls your site. Such an engine can be indeed local to your Web site (proprietary solutions in intranets, search engines built on top of Nutch or ht://dig), but it can as well be a global search engine with searches restricted to your domain (Google, Yahoo).

Once a search engine is available, the integration depends on the technology your site/ software uses for rendering the user interface (or more accurately: for implementing application logic). Software written in Java can use Carrot2 directly in a way that is shown in the end-to-end example code (JavaDoc). Sites written in Perl, PHP, .NET and other languages can use the Carrot2 Document Clustering Server, for more details see the dedicated FAQ. Finally, in some cases you might want to re-use and customize (through XSLT) some bits of Carrot2's web application (located in the carrot2/applications/carrot2-demo-webapp folder of the source repository) to e.g. visualize clusters.

Note that Carrot2 integration requires some Java development skills and familiarity with Java development tools (Eclipse, Apache ANT). The example code and JUnit tests available in the open source project demonstrate various ways of using Carrot2. The project's mailing list can also be of some help if you get stuck someplace. We have also prepared a step-by-step example of using the Carrot2 API directly, it is available in the source code repository.

If you'd rather pay for having the integration done quickly and professionally, Carrot Search provides consulting services (paid approximately 60 EUR per hour).

Although Carrot2 does not have native ports on non-Java platforms, such as .NET, PHP, Ruby, Perl etc., it can be easily integrated with them using the Carrot2 Document Clustering Server (DCS). The DCS exposes Carrot2 clustering as an HTTP/REST service. Essentially, you make an HTTP/POST request with an XML containing the documents you want to have clustered and the DCS responds with an XML containing the clusters created by Carrot2. For quick integration with Ruby, A JSON output format is also available. Finally, for batch processing, a simple command-line application is provided.

To get started with the Carrot2 DCS, download the latest version and uncompress the archive to some local directory. Run the DCS providing the port number it should bind to, e.g.:

dcs -port 9090

When the DCS initializes correctly, you should see the following messages on the console:

[11:28:35,734 INFO] Initializing components.
[11:28:36,031 INFO] Loaded algorithm: haog-fi-en
[11:28:36,046 INFO] Loaded algorithm: haog-stc-en
[11:28:36,046 INFO] Setting the context-level default process id to: lingo-cla..
[11:28:36,234 INFO] Loaded algorithm: lingo-classic
[11:28:36,234 INFO] Loaded algorithm: rough-kmeans
[11:28:36,250 INFO] Loaded algorithm: stc-en
[11:28:36,250 INFO] Finished initializing components.
[11:28:36,250 INFO] Starting standalone DCS server.
[11:28:36,593 INFO] Console mode, skipping configuration in web.xml.
[11:28:36,609 INFO] Accepting HTTP requests on port: 9090

Point your browser to http://localhost:9090/, where you will find further instructions. See the examples/queries directory of the DCS distribution for some example document sets in the DCS format.

You can use Carrot2's built-in Lucene input component as shown in this well-commented example code (JavaDoc).
To run Carrot2 Demo Browser directly from Eclipse, please follow these steps:
  1. Check out Carrot2 source code from the SVN repository:
    svn co https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk/ carrot2
  2. Import all Carrot2 projects into your workspace:
    1. From the Package Explorer's context menu choose Import...
    2. In the first step of the Import wizard, choose General -> Existing projects into Workspace and click Next.
    3. In the next step of the wizard, in the Select root directory field provide the path to your local Carrot2 checkout and click Finish.
  3. The Eclipse compile process will fail because of undefined classpath variables: ANT_HOME and CARROT2_CHECKOUT_BASE. To define these variables open the Preferences window (Window -> Preferences...) and then go to (Java -> Build Path -> Classpath variables). Make the ANT_HOME variable point to your local Ant installation and CARROT2_CHECKOUT_BASE to your local Carrot2 repository checkout.
  4. Clean all projects (Project -> Clean...) and let Eclipse compile everything again, this time without errors.
  5. Run Carrot2 browser using the Run... toolbar icon (Eclipse should have automatically created the appropriate launch entry during project import).
To feed the Carrot2 Demo Browser directly from a local Lucene index follow these steps:
  1. Run the Carrot2 Demo Browser.
  2. In the Process combo box select Lucene Index -- Lingo Classic Clusterer.
  3. Click the Settings button and then the Edit button in the Lucene index location section.
    1. In the file browser, provide the path to your Lucene index directory. A dialog for configuring the index will appear.
    2. In the Search fields section, select the Lucene fields to be searched (hold down Ctrl key for multiple selections).
    3. In the Results fields section, select which Lucene fields should be mapped to the URL, document title and document snippet.
    4. Finally, in the Analyzer section, choose the analyzer to be used.
    5. Click the OK buttons in the index configuration and process configuration dialogs.

The demo application running under a Web application container (such as Tomcat) relies on proper decoding of Unicode characters from the request URI. This decoding is done by the container and must be properly configured at the container level.

Unfortunately, this configuration for each container is a bit different (it is not part of the J2EE standard).

For Tomcat, you can enforce the URI decoding codepage at the connector configuration level. Locate server.xml file inside Tomcat's conf folder and add the following attribute to the Connector section:

URIEncoding="UTF-8"

An example connector configuration should look like this:

<Connector port="8080"
    maxThreads="25" minSpareThreads="5" maxSpareThreads="10"
    minProcessors="5" maxProcessors="25" enableLookups="false"
    redirectPort="8443" acceptCount="10" debug="0" connectionTimeout="20000" 
    URIEncoding="UTF-8" />
  1. Download the Carrot2 web application from and extract the WAR file from the archive.
  2. Unpack the WAR file using your favourite ZIP unpacker.
  3. In order to remove a search tab, delete the corresponding *.bsh file from the inputs/ directory. For example, to remove the Wiki tab, delete 05-input-yahooapi-wiki.bsh. Then,
  4. In order to add a new tab to the web application, assuming the data source is already supported by Carrot2, it's best to clone and modify one of the existing *.bsh files from the inputs/ directory. When cloning an existing file, please make sure to change the component identifier:

    LoadedComponentFactory loaded = 
      new LoadedComponentFactory("input-yahooapi-put-copy", factory);
    

    Using the section shown below, you can customize:

    • tab.title -- title of the tab
    • tab.accel -- shortcut key for the tab (make sure the letter is contained in the tab's name)
    • tab.accel -- shortcut key for the tab (make sure the letter is contained in the tab's name)
    • tab.description -- description of the tab to be shown as a tool tip
    • tab.description.startup -- description of the tab to be shown on the startup screen
    • tab.exampleQueries -- example queries to be shown on the startup screen, separated by the | character
    • tab.icon -- a 16 x 16 icon displayed on the tab (put the file in the skins/fancy/inputs/ directory)
    • tab.ignoreOnError -- set to true, to hide the tab if an error occurs when initializing it
    • tab.default -- set to true to make the tab the default active tab

    loaded.setProperties(new String [][] {
      {"tab.name", "Web"},
      {"tab.accel", "W"},
      {"tab.description", "Search the Web with www.etools.ch"},
      {"tab.description.startup", "Carrot Clustering Engine will ..."},
      {"tab.exampleQueries", "data mining|london|clustering"},
      {"tab.icon", "web.gif"},
      {"tab.ignoreOnError", "false" },
      {"tab.default", "true"}
    });
    The following data sources are already available:
    • 00-input-etools.bsh -- eTools meta search engine.
    • 01-input-yahooapi.bsh -- YahooAPI web search data source in its default configuration.
    • 02-input-googleapi.bsh -- GoogleAPI web search data source in its default configuration. Please note that this data source is not supported by Google anymore.
    • 03-input-msnapi.bsh -- MSN API web search data source in its default configuration.
    • 05-input-yahooapi-wiki.bsh -- YahooAPI web search data source in a custom configuration. The options are specified by the 05-input-yahooapi-wiki.cfg file. For a description of the parameters, please see Yahoo Web Search API documentation.
    • 06-input-odp.bsh -- tab based on a locally available Lucene index. Specify the location of the index in the odp.index.location system property or hardcode it in the *.bsh file.
    • 07-input-jobs.bsh -- tab based on an OpenSearch data source. The URL for the OpenSearch data source can have the following form (no line break):
      http://www.indeed.com/opensearch?
      q={searchTerms}&start={startIndex}&limit={count}
      Elements marked by the curly braces will be replaced during request time in the following way:
      • searchTerms -- the query user provided
      • startIndex -- the index of the first search result to be fetched
      • count -- the total number of search results to be fetched
    • 08-input-pubmed.bsh -- tab based on the PubMed database.
  5. Using your favourite tool, ZIP the all the files back to form a WAR file and install the latter in your servlet container.
Yes. While the query is usually very helpful to get rid of the obvious meanings related to the documents in the search results set, it is not obligatory -- the clustering algorithms will cope without the query.
Lingo has 3 parameters that influence the number and contents of clusters is creates:
  • Cluster Assignment Threshold — determines how precise the assignment of documents to clusters should be. For low values of this threshold, Lingo will assign more documents to clusters, which will result in less documents ending up in "Other topics", but also some irrelevant documents making its way to the clusters. For high values of this parameter, Lingo will assign less documents to clusters, which will result in better assignment precision, but also more documents in "Other topics" and less clusters being created.
  • Candidate cluster threshold — determines how many clusters Lingo will try to create, higher values of the parameter will give more clusters. However, it is not possible to exactly predict the number of clusters based on the value of this parameter before clustering is actually run.
  • Preferred cluster count — determines the maximum number of clusters Lingo will create (excluding the "Other topics" clusters).
...