Enhancement Engines
Enhancement engines are the components responsible to enhance content items. They are called by the Enhancement Job Manager. Enhancement engines do have full access to the parsed content items. They are expected to modify their state.
The RESTful interface of an enhancement engine can be accessed via
http://{host}:{port}/{stanbol-root}/enhancer/engine/{engine-name}
e.g. an enhancement engine with the name "ner" running at a Apache Stanbol instance on local host with the default configuration will be accessible at
http://localhost:8080/enhancer/engine/ner
When using the Java API, enhancement engines can be linked up as OSGI services. The Enhancement Engine Manager service is designed to ease this by providing an API that allows to access enhancement engine by their name.
Enhancement Engine Interface
The interface for enhancement engines contains the following three methods:
/** Getter for the value of the "stanbol.enhancer.engine.name" property */ + getName() : String /** Checks if this engine can enhance the parsed content item */ + canEnhance(ContentItem ci) : int /** Enhances the parsed content item */ + computeEnhancements(ContentItem ci) /** The property used for the name of an engine */ PROPERTY_NAME : String /** Indicates that this engine can not enhance an content item */ CANNOT_ENHANCE : int /** Indicates support for synchronous enhancement */ ENHANCE_SYNCHRONOUS : int /** Indicates support for asynchronous enhancement */ ENHANCE_ASYNC : int
Each enhancement engine has a name. This is typically provided by the engine configuration and MUST be set as value to the property "stanbol.enhancer.engine.name" in the service registration of the enhancement engine. The getter for the name MUST return the same value as the value set to this property. Enhancement engine implementations will usually get the name by calling:
this.name = (String)ComponentContext.getProperties(EnhancementEngine.PROPERTY_NAME);
The canEnhance(ContentItem ci)
method is used by the Enhancement Job Manager to check if an engine is able to process a Content Item. Calling this method MUST NOT change the state of the content item and this method MUST also NOT acquire a write lock on the content item.
The computeEnhancements(ContentItem ci)
starts the processing of the parsed content item by the engine. It is expected to change the state of the parsed content item. Engines that support asynchronous processing need to take care to correctly apply read/write locks when reading/writing information from/to the content item. Engines that return ENHANCE_SYNCHRONOUS
on calls to canEnhance(..)
do not need to use locks. They can trust that they have exclusive read/write access to the content item.
Enhancement engines do have full access to the content item. Theoretically, they would be even allowed to delete all metadata as well as all content parts from the parsed content item. However typically the do only
- read existing content parts
- add new content parts
- add new enhancements to the metadata
- some engines might also need to update/delete existing metadata.
Both the canEnhance(..)
and computeEnhancements(..)
methods MUST be called by the Enhancement Job Manager after all the executions of all enhancement engines this one depends on are completed. This dependencies are defined by the Execution Plan used by the enhancement job manager to enhance the content item. Implementors of enhancement engines can therefore trust that all metadata expected to be added by other enhancement engines are already present within the metadata of the parsed content items when canEnhance(..)
or computeEnhancements(..)
is called.
Services Properties Interface
This interface is implemented by most of the current enhancement engines. It allows engines to expose additional properties to other components. This interface defines a single method
/** Getter for the ServiceProperties */ Map<String,Object> getServiceProperties();
but also predefines the property ENHANCEMENT_ENGINE_ORDERING = "org.apache.stanbol.enhancer.engine.order"
that can be used by enhancement engine implementations to specify their typical ordering within the enhancement process.
Engine Ordering Information
By implementing the ServicesProperties interface, enhancement engines do have the possibility to expose additional metadata to other components. The services properties interface defines only a single method
/** Getter for the ServiceProperties */ Map<String,Object> getServiceProperties();
and is implemented by most of the current enhancement engines. Its currently only use is to provide information about the engine ordering within the enhancement process. This information is exposed by using the key "org.apache.stanbol.enhancer.engine.order" that is defined as value by the constant ENHANCEMENT_ENGINE_ORDERING
defined directly by the services properties interface. Values are expected to be integer within the ranges
- ORDERING_PRE_PROCESSING: All values >= 200 are considered for engines that do some kind of preprocessing of the content. This includes e.g. the conversion of media formats such as extracting the plain text from HTML, keyframes from videos, wave form from mp3 ...; extracting metadata directly encoded within the parsed content such as ID3 tags from MP3 or RDFa, microdata provided by HTML content.
- ORDERING_CONTENT_EXTRACTION: This range includes values form < 200 and >= 100 and shall be used by enhancement engines that need to analyze the parsed content to extract additional metadata. Examples would be Language detection, Natural Language Processing, Named Entity Recognition, Face Detection in Images, Speech to text …
- ORDERING_EXTRACTION_ENHANCEMENT: This range includes values from < 100 and >= 1 and shall be used by enhancement engines to provide semantic lifting of preexisting enhancement such as linking named entities extracted by an NER engine with entities defines in a controlled vocabulary or lifting artist names, song titles ... extracted from mp3 files with the according Entities defined in an music database.
- ORDERING_DEFAULT: This represents the value 0 and shall be used as default value for all enhancement engines that do not provide ordering information or do not implement the ServicesProperties interface.
- ORDERING_POST_PROCESSING: This range includes valued form < 0 and >= -100 and is intended to be used by all enhancement engines that do post processing of enhancement results such as schema translation, filtering of Enhancements ...
The engine ordering information as described here are used by the Default Chain and the Weighted Chain to calculate the Execution Plan.
Basically this features allows the implementor of an enhancement engine to define the correct position of his engine within an typical enhancement chain and therefore ensure that users who add this engine to an enhancer installation to immediately use this engine with the Default Chain.
However, the engine ordering is not the only possibility for users to control the execution order. Enhancement chain implementations such as the List Chain and the Graph Chain do also allow to directly define the oder of execution. For these chains the ordering information provided by enhancement engines are ignored.
Enhancement Engine Management
This section describes how enhancement engines are managed by the Apache Stanbol Enhancer and how they can be selected/accessed through the Enhancement Job Manager and executed in an Enhancement Chain.
Enhancement engines are registered as OSGi services and managed by using the following service properties:
- Name: Defined by the value of the property "stanbol.enhancer.engine.name" it will be used to access engines on the Stanbol RESTful interface
- Service Ranking: The service ranking property defined by OSGI will be used to decide which engine to use in case several active enhancement engines do use the same name. In such cases only the Engine with the highest ranking will be used to enhance ContentItems.
Other components such as enhancement chains do refer to engines by their name. The actual enhancement engine instance is only looked up shortly before the execution.
Enhancement Engine Name Conflicts
As enhancement engines are identified by the value of the "stanbol.enhancer.engine.name" property - the name - there might be cases where multiple enhancement engine are registered with the same name. In such cases the normal OSGi procedure to select the default service instance of several possible matches is used. This means that
- the enhancement engine with the highest "service.ranking" and
- the enhancement engine with the lowest "service.id"
will be selected on requests for a enhancement engine with a given name. Requests on the RESTful service API will always answer with the enhancement engine selected as default. When using the Java API there are also means to retrieve all enhancement engines for a given name via the Enhancement Engine Manager interface.
Out of a user perspective there is one major use case for configuring multiple enhancement engines with the same name. This is to allow the definition of fallback engines if the main one becomes unavailable. E.g. lets assume that a user has a local cache of geonames.org loaded into the Entityhub and configures an Named Entity Linking engine to perform semantic lifting of extracted locations. However Apache Stanbol also provides the geonames.org Engine that provides a similar functionality by directly accessing geonames.org. By configuring both engines for the same name, but specifying a higher service ranking for the one using the local cache one can ensure that the local cache is used for the enhancement under normal circumstances. However in case the local cache becomes unavailable the other engine using the remote service will be used for enhancement.
Enhancement Engine Manager Interface
The Enhancement Engine Manager is the management interface for enhancement engines that can be used by components to lookup enhancement engines based on their name. There is also OSGI ServiceTracker like implementation that can be used to track only enhancement engines registered for a specific set of names.
Enhancement Engine Implementations
A list of enhancement engine implementations maintained directly by the Apache Stanbol community can be found here. However the enhancement engine interface is designed in a way that it should be possible for advanced Apache Stanbol users to implement own enhancement engine implementations fulfilling their special needs.
The Apache Stanbol community would be very happy if users decide to share thoughts about possible enhancement engines or even would like to contribute additional engines to the Apache Stanbol project.