About | Contact | Carrot2 @ sf.net | Search Clustering Engine | Carrot Search

Architecture

......
Carrot2 Project......
Carrot2: open source framework for building search clustering engines

Introduction

Carrot2 is composed of components bound together in a processing chain. There are three fundamental components:

  • input— anything that produces snippets to be clustered. Each snippet consists of a unique URL, a title and a fragment of text from the content of a document. Example input components available in Carrot2 provide bridges to existing major search engines (Yahoo, Google, MSN Search), open source search engines (Lucene), but also adapt XMLs (such as RSS or OpenSearch). As a final resort, you can try to write your own input component using the examples available in the project.

  • filters— typically a clustering component and a set of filters that it requires. Carrot2 comes with a number of clustering components; each one implements a different algorithm and has different requirements concerning configuration and previous filters in the processing chain. You'll need to take a look at the demo applications (web application and local application) - there are full scripts configuring each clustering component there. In this example we will use Lingo clustering component and configure it directly from the source code.

  • output— a clustering component typically produces instances of RawCluster interface. The role of an output component is to do something with clusters once you receive them from the clusterer. The easiest way is to save them in an array and wait until all the processing is finished (all the clusters are available). A more advanced application could use (display?) clusters as soon as they appear from the clustering component. In this example we will buffer the output clusters in an array.

We often talk about local components or local architecture. This distinction is for historical reasons (there was once a parallel design using remote communication between components, but it has been dropped.

Design goals and constraints

The initial requirements for Carrot2 design were as follows:

  • Performance

    1. local method calls

    2. memory/ object reuse

    3. direct typing (no interfaces)

    4. single-threaded process pipeline

    5. processes written in compiled code

  • Scalability

    1. Component-language independence.

    2. Incremental pipeline (partial processing).

    3. Distributed processing.

  • Flexibility (openness)

    1. The design must not restrict or limit the data types passed between components (i.e. components may push in the pipeline whatever they want to).

    2. Reuse of components and common code.

  • User-friendly design

    1. Controller handles some of the complexity of process verification (incompatible components linked in a chain).

    2. Components and processes easily scripted without code recompilation.

Many of the above goals are contradictory — for example, Component-language independence and Direct typing, or Processes in compiled code and Scripted components and processes. The suggested design emphasizes performance, but also attempts to preserve the flexibility that we thought was most valuable in the framework. From the above set of goals we have selected the following as driving factors for the project:

  • Local method calls

    Local method calls are the key to achieving high performance. Data must not be passed via bounded buffers, but directly from component to component. If at all possible, data should be reused and not copied/ duplicated.

  • Memory/ object reuse

    Intense memory allocation/ garbage collection slows down any Java application by a factor of magnitude. The design must provide means to reuse intermediate component data from request to request.

  • Single-threaded pipelines

    It seems that the cost of interprocess (or inter-thread) communication and synchronization is usually higher than the gain from parallelization. One request should be processed by one thread entirely.

  • Processes in compiled code

    Process specification must be flexible at the time of development, but efficient for production use.

  • Incremental pipelines

    Components may not need all of their successors' data. Passing a single results object sequentially from component to component would be memory-inefficient.

  • Flexible data types

    This is the most difficult issue: how to specify local binding interfaces without knowing in advance what types of data can be passed between components. We think the proposed design handles this issue gracefully and at the same time allows efficient implementations. Components declare their needs and expectations from predecessors and successors in the processing chain. The controller verifies if the entire processing chain is pairwise-compatible and then components may simply cast successors to a required Java interface.

  • Code reuse

    Components will share a common memory space, so common code sharing should not be a problem (unlike Web applications, which were constrained by sandboxed class loaders). Reusing code limits application's memory footprint and in effect the impact of swapping memory by the operating system.

  • Scripted components and processes

    BeanShell 2.0-series will be used as an alternative form of providing process definitions (because it allows subclassing and anonymous interface implementations).

The goals that we consider of lesser importance (or thrown away because of conflicts):

  • Direct typing

    Type-checks at runtime are quite costly, but we will use interfaces anyway, because they provide more flexibility in designing the controller.

  • Component-language independence

    Similarily to (Single-threaded pipelines): we think that distributing entire atomic single-threaded local processes will be more efficient than distributing components. If really needed, local stubs can simulate local interfaces and allow distributed processing.

  • Distributed processing

    Can be achieved by using JNI wrappers. This requires more effort, but will work.

  • Component compatibility verification

    This is a very nice feature to have, but it is in conflict with (Flexible data types) and there seems to be no elegant way to fulfill this goal. We suggest a method of component compatibility verification based on explicit capabilities (declared inside a component), but we do not make these capabilities an obligation for component/ process designers.

Overview of the design

Components

A class diagram of the core classes in the local architecture is presented below (click to enlarge).

Class diagram of the core classes in the local architecture

All components must implement LocalComponent interface. This interface contains methods that allow component initialization at the moment of creation, verification of compatibility with other components and finally, lifecycle methods that allow the component to be reused by a component container.

The LocalComponent interface is the super interface for all three types of local components:

  • LocalInputComponent— components accepting user queries and producing initial data.

  • LocalFilterComponent— components that somehow alter or enrich the data.

  • LocalOutputComponent— components gathering the result or doing something with the result. For example, a visual component displaying the data can implement this interface.

Lifecycle of a component is regulated by a contract in LocalComponent interface. A state diagram for a lifecycle of a component is presented in the figure below. Transitions in the diagram represent callback method calls from a process object (an instance of LocalProcess interface). All method names and the specifics of the contract are explained in JavaDoc documentation and are omitted here.

State diagram of a lifecycle of a local component

This base functionality of a component is slightly extended depending on the role of a component in a processing chain; input/ filter and output components are distinguished. Input components take an additional query argument, output components provide a method for harvesting the result of a process. Filter and input components also specify a very important method setNext() that lets the local process to combine components in a processing chain and thus allow components to interact directly with each other.

The above interfaces are stripped of any functional methods because these are in a different area of concern and are independent of the local component container and the governing process. Methods specific to data processing should be provided by component designers as an extension to the base interfaces. Components must ensure at the time of processing chain linking that they can understand the successor's data-related methods, usually by attemting to cast it to a more specific type.

Component initialization

Before any processing begins, components and processes are added to a component container, also referred to as a controller component. The controller is not explicitly specified in the core framework, because it is mostly application-dependent and beyond the scope of the specification. However, a LocalControllerContext interface is specified and passed to each instantiated component. LocalControllerContext lets components and processes to verify compatibility with each other and provide some insight into the availability of other component types.

Processes

A local process embodies the logic needed to process queries and assemble componenents in a processing chain. Instances of this interface must be written by the user of the framework, because it is the process that decides which components are used to process a query in how they are connected to each other. The contract on the behavior of a LocalProcess instances is quite complex and developers are encouraged to subclass the reference implementation of LocalProcessBase and override the hook methods specified in there. Having said that, implementing the raw interface gives full control over the component assembly and query execution process.

See JavaDoc for up-to-date information about the details of LocalProcess implementation.

Query execution

When all processes and components are added to a controller, it is ready to process requests. The controller passes a query request to a local process instance for execution using the query() method:

public void Object query(RequestContext context, String query) throws Exception;

A process can now request instances of components (this allows pooling of component instances per-request and ensures they are always returned to the pool), link them in a processing chain and perform the request processing.

The contract on the query() method is detailed in the JavaDoc documentation as well.

...