Introduction
Carrot2 is composed of components bound together
in a processing chain. There are three fundamental components:
input— anything that produces snippets to be
clustered. Each snippet consists of a unique URL, a title and
a fragment of text from the content of a document. Example input
components available in Carrot2 provide bridges to
existing major search engines (Yahoo, Google, MSN Search), open source
search engines (Lucene), but also adapt XMLs (such as RSS or OpenSearch).
As a final resort, you can try to write your own input component
using the examples available in the project.
filters— typically a clustering component and a set
of filters that it requires. Carrot2 comes with a number
of clustering components; each one implements a different algorithm
and has different requirements concerning configuration and previous
filters in the processing chain. You'll need to take a look at the
demo applications (web application and local application) - there
are full scripts configuring each clustering component there. In this
example we will use Lingo clustering component and configure
it directly from the source code.
output— a clustering component typically produces
instances of RawCluster interface. The role of an output
component is to do something
with clusters once you receive them from the clusterer. The easiest
way is to save them in an array and wait until all the processing is
finished (all the clusters are available). A more advanced application
could use (display?) clusters as soon as they appear from the clustering
component. In this example we will buffer the output clusters in
an array.
We often talk about local components or local
architecture. This distinction is for historical reasons (there was once a parallel design
using remote communication between components, but it has been dropped.
Design goals and constraints
The initial requirements for Carrot2 design were as follows:
Performance
local method calls
memory/ object reuse
direct typing (no interfaces)
single-threaded process pipeline
processes written in compiled code
Scalability
Component-language independence.
Incremental pipeline (partial processing).
Distributed processing.
Flexibility (openness)
The design must not restrict or limit the data types
passed between components (i.e. components may push in the
pipeline whatever they want to).
Reuse of components and common code.
User-friendly design
Controller handles some of the complexity of process verification
(incompatible components linked in a chain).
Components and processes easily scripted without code recompilation.
Many of the above goals are contradictory — for example, Component-language independence and Direct typing, or
Processes in compiled code and Scripted components and processes. The suggested design emphasizes performance, but also attempts
to preserve the flexibility that we thought was most valuable in
the framework. From the above set of goals we have selected the following as
driving factors for the project:
Local method calls
Local method calls are the key to achieving high performance. Data must not
be passed via bounded buffers, but directly from component to component. If
at all possible, data should be reused and not copied/ duplicated.
Memory/ object reuse
Intense memory allocation/ garbage collection slows down any Java application
by a factor of magnitude. The design must provide means to reuse intermediate
component data from request to request.
Single-threaded pipelines
It seems that the cost of interprocess (or inter-thread) communication
and synchronization is usually higher than the gain from parallelization.
One request should be processed by one thread entirely.
Processes in compiled code
Process specification must be flexible at the time of development,
but efficient for production use.
Incremental pipelines
Components may not need all of their successors' data. Passing a single results object
sequentially from component to component would be memory-inefficient.
Flexible data types
This is the most difficult issue: how to specify local binding interfaces without
knowing in advance what types of data can be passed between components.
We think the proposed design handles this issue gracefully and at the same time
allows efficient implementations. Components declare their
needs and expectations from predecessors and successors in the processing chain.
The controller verifies if the entire processing chain is pairwise-compatible and
then components may simply cast successors to a required Java interface.
Code reuse
Components will share a common memory space, so common code sharing
should not be a problem (unlike Web applications, which were constrained by
sandboxed class loaders). Reusing code limits application's memory footprint
and in effect the impact of swapping memory by the operating system.
Scripted components and processes
BeanShell 2.0-series will be used as an alternative form of providing process
definitions (because it allows subclassing and anonymous interface
implementations).
The goals that we consider of lesser importance (or thrown away because of conflicts):
Direct typing
Type-checks at runtime are quite costly, but we will use interfaces anyway, because
they provide more flexibility in designing the controller.
Component-language independence
Similarily to (Single-threaded pipelines): we think that distributing entire atomic single-threaded
local processes will be more efficient than distributing components.
If really needed, local stubs can simulate local interfaces and allow distributed processing.
Distributed processing
Can be achieved by using JNI wrappers. This requires more effort, but will work.
Component compatibility verification
This is a very nice feature to have, but it is in conflict with (Flexible data types) and there
seems to be no elegant way to fulfill this goal. We suggest a method of component
compatibility verification based on explicit capabilities (declared inside
a component), but we do not make these capabilities an obligation for
component/ process designers.
Overview of the design
Components
A class diagram of the core classes in the local architecture is presented below (click to enlarge).

All components must implement LocalComponent interface.
This interface contains methods that allow component initialization at the moment of creation,
verification of compatibility with other components and finally,
lifecycle methods that allow the component to be reused by a component container.
The LocalComponent interface is the super interface for all three types
of local components:
LocalInputComponent—
components accepting user queries and producing initial data.
LocalFilterComponent—
components that somehow alter or enrich the data.
LocalOutputComponent—
components gathering the result or doing something with the result.
For example, a visual component displaying the data can implement this interface.
Lifecycle of a component is regulated by a contract in LocalComponent
interface. A state diagram for a lifecycle of a component is presented in
the figure below. Transitions in the diagram represent callback method
calls from a process object (an instance of LocalProcess
interface). All method names and the specifics of the contract are explained
in JavaDoc documentation and are omitted here.

This base functionality of a component is slightly extended depending on the role of a
component in a processing chain; input/ filter and output components are distinguished. Input
components take an additional query argument, output components
provide a method for harvesting the result of a process. Filter and input components also
specify a very important method setNext()
that lets the local process to combine components in a processing chain and thus allow components
to interact directly with each other.
The above interfaces are stripped of any functional methods because these are in a different
area of concern and are independent of the local component container and the governing process.
Methods specific to data processing should be provided by component designers as an
extension to the base interfaces.
Components must ensure at the time of processing chain linking that they can understand
the successor's data-related methods, usually by attemting to cast it to a more specific type.
Component initialization
Before any processing begins, components and processes are added to a
component container, also referred to as a controller component.
The controller is not explicitly specified in the core framework,
because it is mostly application-dependent and beyond the scope of
the specification. However, a LocalControllerContext interface
is specified and passed to each
instantiated component. LocalControllerContext lets
components and processes to verify compatibility with each other and provide some
insight into the availability of other component types.
Processes
A local process embodies the logic needed to process queries and assemble
componenents in a processing chain.
Instances of this interface must be written by the user of the framework,
because it is the process that decides which components are used to process
a query in how they are connected to each other. The contract on
the behavior of a LocalProcess instances
is quite complex and developers are encouraged to subclass
the reference implementation of LocalProcessBase and
override the hook methods specified in there. Having said that, implementing
the raw interface gives full control over the component assembly and query
execution process.
See JavaDoc for up-to-date information about the details of LocalProcess
implementation.
Query execution
When all processes and components are added to a controller, it is ready to
process requests. The controller passes a query request to a local process
instance for execution using the query() method:
public void Object query(RequestContext context, String query) throws Exception;
A process can now request instances of components (this allows pooling of
component instances per-request and ensures they are always returned to the
pool), link them in a processing chain and perform the request processing.
The contract on the query() method is detailed in the JavaDoc documentation
as well.