This project has retired. For details please refer to its Attic page.
Apache Stanbol - Basic Content Enhancement

Basic Content Enhancement

This Usage scenario will provide you all necessary information for getting started with the Stanbol Enhancer. This includes

Using the RESTful Enhancement service

For enhancing content you simply post you content to the Stanbol Enhancer. The Enhancer will use a Chain of Enhancement Engines to process the parsed content and return extracted features as RDF encoded using the Stanbol Enhancement Structure. The following figure provides an overview on that process.

Enhancing Content with the Stanbol Enhancer

In case you have a local Stanbol Instance you can also test this via the Web interface of the Apache Stanbol Enhancer - http://{host}:{port}/enhancer or from the command line using the CURL command.

curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \
--data "The Stanbol enhancer can detect famous cities such as Paris \
and people such as Bob Marley." http://localhost:8080/engines

The following script sends the contents of the text-examples folder to the Stanbol Enhancer. However it could also be used to index the contents of any folder on the local file system. If you want to keep the Enhancement results you can pipe the results of the curl command (e.g. to files)

for file in enhancer/data/text-examples/*.*;
do
    curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \
        -T $file http://localhost:8080/enhancer;
done

The Apache Stanbol Enhancer can also enhancer non-plain-text files. In this case Apache Tika - via the Tika Engine is used to extract the plain text from those files (see the Apache Tika documentation for supported file formats).

Configuring and Using Enhancement Chains

The Apache Stanbol Enhancer supports multiple enhancement chains. This feature allows to configure use multiple processing chains for parsed content within the same Apache Stanbol instance.

Chains are build based on an execution plan referencing one or more enhancement engines by there name. Users can create and modify enhancement chains by using the Configuration Tab of the Apache Felix web console - http://{host}:{port}/system/console/configMgr. There are three different implementations:

  1. the self sorting weighted chain
  2. the list chain
  3. the graph chain that allows the direct configuration of the execution graph what can allow advanced users to optimize chain execution.

In addition the Stanbol Enhancer includes the so called Default Chain that includes all currently active enhancement engines. While this engine is enabled by default most users might want to deactivate it as soon as they have configured there own chains.

To configure enhancement engines it is essential to understand the intension of the different enhancement engine implementations. The list of available enhancement engines managed by the Apache Stanbol community is available here. See the documentation of the listed engines for detailed information.

The list groups engines by categories: Preprocessing engines typically perform operations on a content scope. This includes plain-text extraction, metadata extraction, language detection. This is followed by engines that analyses the parsed content. This category currently includes all Natural Language Processing (NLP) related engines but also would include image-, audio- and video- processing. The third category consist of engines that consume extracted features from the content and perform some kind of semantic lifting on it - e.g. linking extracted features with entities/concepts contained in controlled vocabularies. Finally post-processing engines can be used to adjust rankings, filter out unwanted enhancements or do other kind of transformations on the enhancement results.

A typical text processing enhancement chain might look like that:

And here is another enhancement chain using an external service

Tips for configuring enhancement chains:

After configuring all the enhancement engines and combining them to enhancement chains it is important to understand how to inspect and call the configured components via the RESTful API of the Apache Stanbol Enhancer.

Enhancement requests directly issued to /enhancer (or the old deprecated /engines) endpoint are processed by using the Enhancement Chain with the name "default" or if none with that name the one with the highest "service.ranking" (see here for details). To process content with a specific chain requests need to be issued against /enhancer/chain/{chain-name}.

Note, that it is also possible to enhance content by using a single enhancement engine. For that, request can be sent to enhancer/engine/{engine-name}. A typical example would be parsing text directly to the Language Identification Engine to use the Apache Stanbol Enhancer to detect the language of the parsed content.

To sum up the RESTful API of the Apache Stanbol Enhancer is structured like follows

GET /enhancer - returns the configuration of the Stanbol Enhancer
GET /enhancer/chain - returns the configuration of all active [Enhancement Chains](components/enhancer/chains)
GET /enhancer/engine - returns the configuration of all active [Enhancement Engines](components/enhancer/engines)
POST /enhancer - enhances parsed content by using the default Enhancement Chain
POST /enhancer/chain/{chain-name} - enhances parsed content by using the Enhancement Chain with the given name
POST /enhancer/engine/{engine-name} - enhances parsed content by using only the referenced Enhancement Engine

See the documentation of the the RESTful API for all services and parameters of the Apache Stanbol Enhancer.

Using a Local Index of a Linked Open Data Source

Both the Named Entity Tagging Engine and the Keyword Linking Engine require to be configured with a dataset containing entities to link/extract for parsed content. As those engines typically need to make a lot of requests against those datasets it is important to make those data locally available - a feature of the Apache Stanbol Entityhub

Because of this Apache Stanbol allows to create/install local indexes of datasets. A detailed description on how to create those indexes is described by this user scenario. A set of pre-computed indexes can be downloaded from the IKS development server.

Indexes always consist of two parts:

To install the local index of a dataset the following two steps need to be performed

NOTE: In case of "dbpedia" the OSGI bundle with the configuration does not need to be installed as the default configuration of the Apache Stanbol launcher does already include the configuration of the necessary components.

Processing the Enhancement Results

The final step in using the Apache Stanbol Enhancer is about processing the enhancement results. As this is a central part developers of client applications this is described in another usage scenario