This project has retired. For details please refer to its Attic page.
Apache Stanbol - Graph Chain

Graph Chain

The GraphChain allows to directly configure the ExecutionPlan returned by the Chain.getExecutionPlan() method. This means on the one hand that it allows to configure any kind of execution process on the other hand its usage also requires a lot of knowledge about the EnhancementEngines and the ExecutionPlan model.

Typically it is a good practice to start with other - more simple to use - Chain implementation such as the Weighted Chain and only afterwards convert this configuration to a GraphChain to configure optimizations to the enhancement process such as to allow more engines to be executed in parallel.

Configuration

The GraphChain supports two variants to configure the ExecutionPlan.

GraphResource

A GraphResource is an RDF file available via the DataFileProvider. The easiest way is to copy the RDF file defining the ExecutionPlan to the "/sling/datafile" directory within the Stanbol home directory. The configuration of the GraphChain needs then only to refer to that file such as:

stanbol.enhancer.chain.graph.graphresource=myExecutionPlan.rdf

The used RDF encoding is guessed by the file extension. If the extension is not recognized, the format can be also parsed as additional parameter

stanbol.enhancer.chain.graph.graphresource=myExecutionPlan.something;format=application/rdf+xml

The GraphCain will track for that file and activate itself as soon as the file gets available. Removing the file, waiting some seconds and providing the new version afterwards should also work. Just replacing the file will not work, because the DataFileProvider does not have support for updates. In such cases it might be needed to deactivate/activate the GraphChain.

ChainList

This allows to directly configure the ExecutionPlan as value of the "stanbol.enhancer.chain.graph.chainlist" property. Both arrays and collections are supported.

Note: As soon as a graph resource is configured the ChainList will be ignored. This is even true if the configured GraphResource is currently not available!

The syntax is defined as follows:

{engine-name};[optional];[dependsOn={engine-name1},{engine-name2}]

The following example shows how this syntax can be used to define an ExecutionPlan.

metaxa;optional
langId;dependsOn=metaxa
ner;dependsOn=langId
zemanta;optional
dbpedia-linking;dependsOn=ner
geonames;optional;dependsOn=ner
refactor;dependsOn=geonames,dbpedia-linking,zemanta

Note: The internal oder of the list does not influence the resulting ExecutionPlan. Only the "dependsOn" properties are used to determine the execution order of the engines and if engines can be executed in parallel.

Within an OSGI configuration file (org.apache.stanbol.enhancer.chain.graph.impl.GraphChain-myGraphChain.config) this would look like

stanbol.enhancer.chain.graph.chainlist=[
    "metaxa;optional","langId;dependsOn\=metaxa","ner;dependsOn\=langId",
    "zemanta;optional","dbpedia-linking;dependsOn\=ner",
    "geonames;optional;dependsOn\=ner",
    "refactor;dependsOn\=geonames,dbpedia-linking,zemanta"]

Note: The whole test MUST BE in a single line within the .config file.

A better visual expression provides this screenshot of the Apache Felix web console showing the dialog for the above configuration

GraphChain configuration dialog with configured ChainList

Enhancement Properties Support

since 0.12.1

Starting from 0.12.1 the Graph Chain allows to configure EnhancementProperties.

Chain List based Configuration

In case the Chain List type configuration is used properties are configured as follows:

All EnhancementProperties configured with a Chain are written as RDF to the ExecutionPlan. Chain scoped properties are directly added to the ep:ExecutionPlan instance while chain and engine scoped properties are added to the ep:ExecutionNode of the according engine.

The following figure and listing provide an example

GraphChain including some Enhancement Properties

The figure shows the maximum number of suggestions is set as a chain scoped property to 5. In addition two chain and engine scoped properties are set. First for the dbpedia-fst engine the minimum confidence is set to 0.85 and second for the dbpedia-dereference engine the dereferenced languages are set to English, German and Spanish.

In case of the GraphChain it is typical that chain and engine scoped Enhancement Properties get mixed with parameters of the chain configuration itself. As Enhancement Properties are required to start with enhancer. they can be easily separated with chain specific parameters such as dependsOn.

The following listing shows the exact same configuration in the .cfg format.

stanbol.enhancer.chain.name="graph-chain"
stanbol.enhancer.chain.chainproperties=["enhancer.max-suggestions\=5"]
stanbol.enhancer.chain.graph.chainlist=["tika;optional","langdetect;\ dependsOn\=tika",
    "opennlp-sentence;\ dependsOn\=langdetect","opennlp-token;\ dependsOn\=opennlp-sentence",
    "opennlp-pos;\ dependsOn\=opennlp-pos","opennlp-chunker;\ optional;\ dependsOn\=opennlp-chunker",
    "opennlp-ner;\ dependsOn\=opennlp-pos",
    "dbpedia-fst;\ dependsOn\=opennlp-chunker,opennlp-pos;enhancer.min-confidence\=0.85",
    "dbpedia-dereference;\ dependsOn\=dbpedia-fst;\ enhancer.engines.dereference.languages\=en,de,es"]

Graph Resource Configuration

In case the ExecutionPlan is configured by an RDF file the EnhancementProperties need to be directly encoded as RDF.

Chain scoped properties need to be attached to the ep:ExecutionPlan instance while chain and engine scoped properties are added to the ep:ExecutionNode of the according engine.

Single properties are represented by triples where the execution plan or execution mode instance is the subject. The URI or the enhancement property is the predicate and the value is the object. Multiple valus are represented by multiple triples with the same subject and predicate.

The following listing shows the same example as used in the above section as RDF turtle.

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ep: <http://stanbol.apache.org/ontology/enhancer/executionplan#> .
@prefix ehp: <http://stanbol.apache.org/ontology/enhancementproperties#> .

urn:execPlan a ep:ExecutionPlan ;
    ep:hasExecutionNode urn:node1, urn:node2, urn:node3, urn:node4, urn:node5 
        urn:node6, urn:node7, urn:node8;
    ep:chain "demoChain" ;
    ehp:enhancer.max-suggestions "5"^^xsd:int .

urn:node1 a stanbol:ExecutionNode ;
    ep:inExecutionPlan urn:execPlan ;
    ep:engine "langdetect" .

urn:node2 a ep:ExecutionNode ;
    ep:inExecutionPlan urn:execPlan ;
    ep:dependsOn urn:node1 ;
    ep:engine "opennlp-sentence" .

urn:node3 a ep:ExecutionNode ;
    ep:inExecutionPlan urn:execPlan ;
    ep:dependsOn urn:node2 ;
    ep:engine "opennlp-token" .

urn:node4 a ep:ExecutionNode ;
    ep:inExecutionPlan urn:execPlan ;
    ep:dependsOn urn:node3 ;
    ep:engine "opennlp-pos" .

urn:node5 a ep:ExecutionNode ;
    ep:inExecutionPlan urn:execPlan ;
    ep:dependsOn urn:node4 ;
    ep:engine "opennlp-chunker" .

urn:node6 a ep:ExecutionNode ;
    ep:inExecutionPlan urn:execPlan ;
    ep:dependsOn urn:node4 ;
    ep:engine "opennlp-ner" .

urn:node7 a ep:ExecutionNode ;
    ep:inExecutionPlan urn:execPlan ;
    ep:dependsOn urn:node5 ;
    ep:engine "dbpedia-fst" ;
    ehp:enhancer.min-confidence "0.85"^^xsd:float .

urn:node8 a ep:ExecutionNode ;
    ep:inExecutionPlan urn:execPlan ;
    ep:dependsOn urn:node7 ;
    ep:engine "dbpedia-dereference" ;
    ehp:enhancer.engines.dereference.languages "en", "de", "es" .

Execution

In contrast to other chain implementations the ExecutionPlan must not be calculated but is directly parsed by the user. This provides the most possible freedom in defining how the execution should take place.

Optional Engines

The execution of optional engines is not mandatory. The enhancement process will continue, even if they are not active or their execution fail. For users it is important to know, that even engines that depend on an optional engine that was not executed will be called.

Given the above example this means that even if the 'metaxa' engine can not be executed the 'langId' will be called by the EnhancementJobManager.

Parallel Execution

Engines are executed as soon as all engines they depend on have completed. This also includes if optional engines were skipped (because they are not active) or failed. This means that in most cases several EnhancementEngines can be executed in parallel.

Given the above example, both the 'zemanta' and the 'metaxa' engine are executed as soon as the enhancement process starts. When 'metaxa' is finished, the 'langid' engine is called. After the 'langid' finishes its work, the EnhancementJobManager calls the 'ner' engine. After that both the 'dbpedia-linking' and the 'geonames' engine are called. At this time three engines might run simultaneously assuming that 'zemanta' has not finished yet. Before the 'refactor' engine can be executed it need to wait for all these engines to complete.

Note that for parallel execution to be activated both the used EnhancementJobManager and the different engines must support asynchronous enhancement.