Graph Chain
The GraphChain allows to directly configure the ExecutionPlan returned by the Chain.getExecutionPlan() method. This means on the one hand that it allows to configure any kind of execution process on the other hand its usage also requires a lot of knowledge about the EnhancementEngines and the ExecutionPlan model.
Typically it is a good practice to start with other - more simple to use - Chain implementation such as the Weighted Chain and only afterwards convert this configuration to a GraphChain to configure optimizations to the enhancement process such as to allow more engines to be executed in parallel.
Configuration
The GraphChain supports two variants to configure the ExecutionPlan.
GraphResource
A GraphResource is an RDF file available via the DataFileProvider. The easiest way is to copy the RDF file defining the ExecutionPlan to the "/sling/datafile" directory within the Stanbol home directory. The configuration of the GraphChain needs then only to refer to that file such as:
stanbol.enhancer.chain.graph.graphresource=myExecutionPlan.rdf
The used RDF encoding is guessed by the file extension. If the extension is not recognized, the format can be also parsed as additional parameter
stanbol.enhancer.chain.graph.graphresource=myExecutionPlan.something;format=application/rdf+xml
The GraphCain will track for that file and activate itself as soon as the file gets available. Removing the file, waiting some seconds and providing the new version afterwards should also work. Just replacing the file will not work, because the DataFileProvider does not have support for updates. In such cases it might be needed to deactivate/activate the GraphChain.
ChainList
This allows to directly configure the ExecutionPlan as value of the "stanbol.enhancer.chain.graph.chainlist" property. Both arrays and collections are supported.
Note: As soon as a graph resource is configured the ChainList will be ignored. This is even true if the configured GraphResource is currently not available!
The syntax is defined as follows:
{engine-name};[optional];[dependsOn={engine-name1},{engine-name2}]
The following example shows how this syntax can be used to define an ExecutionPlan.
metaxa;optional langId;dependsOn=metaxa ner;dependsOn=langId zemanta;optional dbpedia-linking;dependsOn=ner geonames;optional;dependsOn=ner refactor;dependsOn=geonames,dbpedia-linking,zemanta
Note: The internal oder of the list does not influence the resulting ExecutionPlan. Only the "dependsOn" properties are used to determine the execution order of the engines and if engines can be executed in parallel.
Within an OSGI configuration file (org.apache.stanbol.enhancer.chain.graph.impl.GraphChain-myGraphChain.config) this would look like
stanbol.enhancer.chain.graph.chainlist=[ "metaxa;optional","langId;dependsOn\=metaxa","ner;dependsOn\=langId", "zemanta;optional","dbpedia-linking;dependsOn\=ner", "geonames;optional;dependsOn\=ner", "refactor;dependsOn\=geonames,dbpedia-linking,zemanta"]
Note: The whole test MUST BE in a single line within the .config file.
A better visual expression provides this screenshot of the Apache Felix web console showing the dialog for the above configuration
Enhancement Properties Support
since 0.12.1
Starting from 0.12.1
the Graph Chain allows to configure EnhancementProperties.
Chain List based Configuration
In case the Chain List type configuration is used properties are configured as follows:
-
chain and engine scoped properties are defined as parameters to the engines with the syntax
{engine-name}; {property-name-1}={value-1},{value-2}; {property-name-2}={value-1};
-
chain scoped properties can be configured by using the osgi property key
stanbol.enhancer.chain.chainproperties
by the syntax{property-name-1}={value-1},{value-2}
. NOTE that;
is NOT supported as separator for parsing multiple properties as OSGI configurations already define a way for parsing multiple values
All EnhancementProperties configured with a Chain are written as RDF to the ExecutionPlan. Chain scoped properties are directly added to the ep:ExecutionPlan
instance while chain and engine scoped properties are added to the ep:ExecutionNode
of the according engine.
The following figure and listing provide an example
The figure shows the maximum number of suggestions is set as a chain scoped property to 5
. In addition two chain and engine scoped properties are set. First for the dbpedia-fst
engine the minimum confidence is set to 0.85
and second for the dbpedia-dereference
engine the dereferenced languages are set to English, German and Spanish.
In case of the GraphChain it is typical that chain and engine scoped Enhancement Properties get mixed with parameters of the chain configuration itself. As Enhancement Properties are required to start with enhancer.
they can be easily separated with chain specific parameters such as dependsOn
.
The following listing shows the exact same configuration in the .cfg
format.
stanbol.enhancer.chain.name="graph-chain" stanbol.enhancer.chain.chainproperties=["enhancer.max-suggestions\=5"] stanbol.enhancer.chain.graph.chainlist=["tika;optional","langdetect;\ dependsOn\=tika", "opennlp-sentence;\ dependsOn\=langdetect","opennlp-token;\ dependsOn\=opennlp-sentence", "opennlp-pos;\ dependsOn\=opennlp-pos","opennlp-chunker;\ optional;\ dependsOn\=opennlp-chunker", "opennlp-ner;\ dependsOn\=opennlp-pos", "dbpedia-fst;\ dependsOn\=opennlp-chunker,opennlp-pos;enhancer.min-confidence\=0.85", "dbpedia-dereference;\ dependsOn\=dbpedia-fst;\ enhancer.engines.dereference.languages\=en,de,es"]
Graph Resource Configuration
In case the ExecutionPlan is configured by an RDF file the EnhancementProperties need to be directly encoded as RDF.
Chain scoped properties need to be attached to the ep:ExecutionPlan
instance while chain and engine scoped properties are added to the ep:ExecutionNode
of the according engine.
Single properties are represented by triples where the execution plan or execution mode instance is the subject. The URI or the enhancement property is the predicate and the value is the object. Multiple valus are represented by multiple triples with the same subject and predicate.
The following listing shows the same example as used in the above section as RDF turtle.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix ep: <http://stanbol.apache.org/ontology/enhancer/executionplan#> . @prefix ehp: <http://stanbol.apache.org/ontology/enhancementproperties#> . urn:execPlan a ep:ExecutionPlan ; ep:hasExecutionNode urn:node1, urn:node2, urn:node3, urn:node4, urn:node5 urn:node6, urn:node7, urn:node8; ep:chain "demoChain" ; ehp:enhancer.max-suggestions "5"^^xsd:int . urn:node1 a stanbol:ExecutionNode ; ep:inExecutionPlan urn:execPlan ; ep:engine "langdetect" . urn:node2 a ep:ExecutionNode ; ep:inExecutionPlan urn:execPlan ; ep:dependsOn urn:node1 ; ep:engine "opennlp-sentence" . urn:node3 a ep:ExecutionNode ; ep:inExecutionPlan urn:execPlan ; ep:dependsOn urn:node2 ; ep:engine "opennlp-token" . urn:node4 a ep:ExecutionNode ; ep:inExecutionPlan urn:execPlan ; ep:dependsOn urn:node3 ; ep:engine "opennlp-pos" . urn:node5 a ep:ExecutionNode ; ep:inExecutionPlan urn:execPlan ; ep:dependsOn urn:node4 ; ep:engine "opennlp-chunker" . urn:node6 a ep:ExecutionNode ; ep:inExecutionPlan urn:execPlan ; ep:dependsOn urn:node4 ; ep:engine "opennlp-ner" . urn:node7 a ep:ExecutionNode ; ep:inExecutionPlan urn:execPlan ; ep:dependsOn urn:node5 ; ep:engine "dbpedia-fst" ; ehp:enhancer.min-confidence "0.85"^^xsd:float . urn:node8 a ep:ExecutionNode ; ep:inExecutionPlan urn:execPlan ; ep:dependsOn urn:node7 ; ep:engine "dbpedia-dereference" ; ehp:enhancer.engines.dereference.languages "en", "de", "es" .
Execution
In contrast to other chain implementations the ExecutionPlan must not be calculated but is directly parsed by the user. This provides the most possible freedom in defining how the execution should take place.
Optional Engines
The execution of optional engines is not mandatory. The enhancement process will continue, even if they are not active or their execution fail. For users it is important to know, that even engines that depend on an optional engine that was not executed will be called.
Given the above example this means that even if the 'metaxa' engine can not be executed the 'langId' will be called by the EnhancementJobManager.
Parallel Execution
Engines are executed as soon as all engines they depend on have completed. This also includes if optional engines were skipped (because they are not active) or failed. This means that in most cases several EnhancementEngines can be executed in parallel.
Given the above example, both the 'zemanta' and the 'metaxa' engine are executed as soon as the enhancement process starts. When 'metaxa' is finished, the 'langid' engine is called. After the 'langid' finishes its work, the EnhancementJobManager calls the 'ner' engine. After that both the 'dbpedia-linking' and the 'geonames' engine are called. At this time three engines might run simultaneously assuming that 'zemanta' has not finished yet. Before the 'refactor' engine can be executed it need to wait for all these engines to complete.
Note that for parallel execution to be activated both the used EnhancementJobManager and the different engines must support asynchronous enhancement.