This project has retired. For details please refer to its Attic page.
Apache Stanbol - Graph Chain

Graph Chain

The GraphChain allows to directly configure the ExecutionPlan returned by the Chain.getExecutionPlan() method. This means on the one hand that it allows to configure any kind of execution process on the other hand its usage also requires a lot of knowledge about the EnhancementEngines and the ExecutionPlan model.

Typically it is a good practice to start with other - more simple to use - Chain implementation such as the Weighted Chain and only afterwards convert this configuration to a GraphChain to configure optimizations to the enhancement process such as to allow more engines to be executed in parallel.

Configuration

The GraphChain supports two variants to configure the ExecutionPlan.

GraphResource

A GraphResource is an RDF file available via the DataFileProvider. The easiest way is to copy the RDF file defining the ExecutionPlan to the "/sling/datafile" directory within the Stanbol home directory. The configuration of the GraphChain needs then only to refer to that file such as:

stanbol.enhancer.chain.graph.graphresource=myExecutionPlan.rdf

The used RDF encoding is guessed by the file extension. If the extension is not recognized, the format can be also parsed as additional parameter

stanbol.enhancer.chain.graph.graphresource=myExecutionPlan.something;format=application/rdf+xml

The GraphCain will track for that file and activate itself as soon as the file gets available. Removing the file, waiting some seconds and providing the new version afterwards should also work. Just replacing the file will not work, because the DataFileProvider does not have support for updates. In such cases it might be needed to deactivate/activate the GraphChain.

ChainList

This allows to directly configure the ExecutionPlan as value of the "stanbol.enhancer.chain.graph.chainlist" property. Both arrays and collections are supported.

Note: As soon as a graph resource is configured the ChainList will be ignored. This is even true if the configured GraphResource is currently not available!

The syntax is defined as follows:

{engine-name};[optional];[dependsOn={engine-name1},{engine-name2}]

The following example shows how this syntax can be used to define an ExecutionPlan.

metaxa;optional
langId;dependsOn=metaxa
ner;dependsOn=langId
zemanta;optional
dbpedia-linking;dependsOn=ner
geonames;optional;dependsOn=ner
refactor;dependsOn=geonames,dbpedia-linking,zemanta

Note: The internal oder of the list does not influence the resulting ExecutionPlan. Only the "dependsOn" properties are used to determine the execution order of the engines and if engines can be executed in parallel.

Within an OSGI configuration file (org.apache.stanbol.enhancer.chain.graph.impl.GraphChain-myGraphChain.config) this would look like

stanbol.enhancer.chain.graph.chainlist=[
    "metaxa;optional","langId;dependsOn\=metaxa","ner;dependsOn\=langId",
    "zemanta;optional","dbpedia-linking;dependsOn\=ner",
    "geonames;optional;dependsOn\=ner",
    "refactor;dependsOn\=geonames,dbpedia-linking,zemanta"]

Note: The whole test MUST BE in a single line within the .config file.

A better visual expression provides this screenshot of the Apache Felix web console showing the dialog for the above configuration

GraphChain configuration dialog with configured ChainList

Execution

In contrast to other chain implementations the ExecutionPlan must not be calculated but is directly parsed by the user. This provides the most possible freedom in defining how the execution should take place.

Optional Engines

The execution of optional engines is not mandatory. The enhancement process will continue, even if they are not active or their execution fail. For users it is important to know, that even engines that depend on an optional engine that was not executed will be called.

Given the above example this means that even if the 'metaxa' engine can not be executed the 'langId' will be called by the EnhancementJobManager.

Parallel Execution

Engines are executed as soon as all engines they depend on have completed. This also includes if optional engines were skipped (because they are not active) or failed. This means that in most cases several EnhancementEngines can be executed in parallel.

Given the above example, both the 'zemanta' and the 'metaxa' engine are executed as soon as the enhancement process starts. When 'metaxa' is finished, the 'langid' engine is called. After the 'langid' finishes its work, the EnhancementJobManager calls the 'ner' engine. After that both the 'dbpedia-linking' and the 'geonames' engine are called. At this time three engines might run simultaneously assuming that 'zemanta' has not finished yet. Before the 'refactor' engine can be executed it need to wait for all these engines to complete.

Note that for parallel execution to be activated both the used EnhancementJobManager and the different engines must support asynchronous enhancement.