Basic Content Enhancement
This Usage scenario will provide you all necessary information for getting started with the Stanbol Enhancer. This includes
- Using the RESTful API of the Stanbol Enhancer
- Overview about available Enhancement Engines
- Configuration of the Stanbol Enhancer
Using the RESTful Enhancement service
For enhancing content you simply post you content to the Stanbol Enhancer. The Enhancer will use a Chain of Enhancement Engines to process the parsed content and return extracted features as RDF encoded using the Stanbol Enhancement Structure. The following figure provides an overview on that process.
In case you have a local Stanbol Instance you can also test this via the Web interface of the Apache Stanbol Enhancer - http://{host}:{port}/enhancer or from the command line using the CURL command.
curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \ --data "The Stanbol enhancer can detect famous cities such as Paris \ and people such as Bob Marley." http://localhost:8080/engines
The following script sends the contents of the text-examples folder to the Stanbol Enhancer. However it could also be used to index the contents of any folder on the local file system. If you want to keep the Enhancement results you can pipe the results of the curl command (e.g. to files)
for file in enhancer/data/text-examples/*.*; do curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \ -T $file http://localhost:8080/enhancer; done
The Apache Stanbol Enhancer can also enhancer non-plain-text files. In this case Apache Tika - via the Tika Engine is used to extract the plain text from those files (see the Apache Tika documentation for supported file formats).
Configuring and Using Enhancement Chains
The Apache Stanbol Enhancer supports multiple enhancement chains. This feature allows to configure use multiple processing chains for parsed content within the same Apache Stanbol instance.
Chains are build based on an execution plan referencing one or more enhancement engines by there name. Users can create and modify enhancement chains by using the Configuration Tab of the Apache Felix web console - http://{host}:{port}/system/console/configMgr. There are three different implementations:
- the self sorting weighted chain
- the list chain
- the graph chain that allows the direct configuration of the execution graph what can allow advanced users to optimize chain execution.
In addition the Stanbol Enhancer includes the so called Default Chain that includes all currently active enhancement engines. While this engine is enabled by default most users might want to deactivate it as soon as they have configured there own chains.
To configure enhancement engines it is essential to understand the intension of the different enhancement engine implementations. The list of available enhancement engines managed by the Apache Stanbol community is available here. See the documentation of the listed engines for detailed information.
The list groups engines by categories: Preprocessing engines typically perform operations on a content scope. This includes plain-text extraction, metadata extraction, language detection. This is followed by engines that analyses the parsed content. This category currently includes all Natural Language Processing (NLP) related engines but also would include image-, audio- and video- processing. The third category consist of engines that consume extracted features from the content and perform some kind of semantic lifting on it - e.g. linking extracted features with entities/concepts contained in controlled vocabularies. Finally post-processing engines can be used to adjust rankings, filter out unwanted enhancements or do other kind of transformations on the enhancement results.
A typical text processing enhancement chain might look like that:
- tika - to convert parsed content to "text/plain"
- langid - to detect the language of the parsed text
- ner - to extract named entities (persons, organizations, places) from the parsed text
- dbpediaLinking - link extracted named entities with entities defined by dbpedia.org
- myCustomVocExtraction - keyword extraction based on a custom built vocabulary - as described by this usage scenario.
And here is another enhancement chain using an external service
- tika - assuming we want to send MS Word dokuments to Zemanta
- zemanta - this wraps Zemanta.com/ as an Apache Stanbol Enhancement Engine
Tips for configuring enhancement chains:
- http://{host}:{port}/enhancer/chain provides a list of all configured Enhancement Chains. It also includes direct links to their configurations.
- As one needs to use the names of active Enhancement Engines for the configuration of Enhancement Chains it is very useful to open http://{host}:{port}/enhancer/engine in an other browser window.
After configuring all the enhancement engines and combining them to enhancement chains it is important to understand how to inspect and call the configured components via the RESTful API of the Apache Stanbol Enhancer.
Enhancement requests directly issued to /enhancer
(or the old deprecated /engines
) endpoint are processed by using the Enhancement Chain with the name "default" or if none with that name the one with the highest "service.ranking" (see here for details). To process content with a specific chain requests need to be issued against /enhancer/chain/{chain-name}
.
Note, that it is also possible to enhance content by using a single enhancement engine. For that, request can be sent to enhancer/engine/{engine-name}
. A typical example would be parsing text directly to the Language Identification Engine to use the Apache Stanbol Enhancer to detect the language of the parsed content.
To sum up the RESTful API of the Apache Stanbol Enhancer is structured like follows
GET /enhancer - returns the configuration of the Stanbol Enhancer GET /enhancer/chain - returns the configuration of all active [Enhancement Chains](components/enhancer/chains) GET /enhancer/engine - returns the configuration of all active [Enhancement Engines](components/enhancer/engines) POST /enhancer - enhances parsed content by using the default Enhancement Chain POST /enhancer/chain/{chain-name} - enhances parsed content by using the Enhancement Chain with the given name POST /enhancer/engine/{engine-name} - enhances parsed content by using only the referenced Enhancement Engine
See the documentation of the the RESTful API for all services and parameters of the Apache Stanbol Enhancer.
Using a Local Index of a Linked Open Data Source
Both the Named Entity Tagging Engine and the Keyword Linking Engine require to be configured with a dataset containing entities to link/extract for parsed content. As those engines typically need to make a lot of requests against those datasets it is important to make those data locally available - a feature of the Apache Stanbol Entityhub
Because of this Apache Stanbol allows to create/install local indexes of datasets. A detailed description on how to create those indexes is described by this user scenario. A set of pre-computed indexes can be downloaded from the IKS development server.
Indexes always consist of two parts:
- org.apache.stanbol.data.site.{name}-{version}.jar - An OSGI bundle containing the configuration for
- the Apache Entityhub "ReferencedSite" accessible at "http://{host}/{root}/entityhub/site/{name}"
- the "Cache" used to connect the ReferencedSite with your Data and
- the "SolrYard" component managing the installed data.
- {name}.solrindex.zip - The index data of the dataset (basically a ZIP archive of a Solr Core
To install the local index of a dataset the following two steps need to be performed
- copying the zip archive into the "{stanbol-working-dir}/stanbol/datafiles" folder
- adding the OSGI bundle to the Stanbol Environment (e.g. by using the [Bundle Tab(http://localhost:8080/system/console/bundles) of the Apache Felix Webconsle - http://{host}:{port}/system/console.
NOTE: In case of "dbpedia" the OSGI bundle with the configuration does not need to be installed as the default configuration of the Apache Stanbol launcher does already include the configuration of the necessary components.
Processing the Enhancement Results
The final step in using the Apache Stanbol Enhancer is about processing the enhancement results. As this is a central part developers of client applications this is described in another usage scenario