Apache Stanbol Enhancer
The Apache Stanbol Enhancer provides both a RESTful and a Java API that allows a caller to extract features from passed content. In more detail the passed content is processed by Enhancement Engines as defined by the called Enhancement Chain.
Reader should note that this is the technical documentation of the Stanbol Enhancer intended for Developer. For more practical - usage case oriented - introduction to the Stanbol Enhancer as well as other components please have look at the available Usage Scenarios.
Using the Stanbol Enhancer
The figure below provides an overview of the RESTful as well as the Java API provided by the Stanbol Enhancer
RESTful API
The content to be analyzed should be sent in a POST request with the mime-type specified in the Content-type header. The parsed content is then processed by the targeted Enhancement Chain. The response will hold the RDF enhancement serialized in the format specified in the Accept header. The following figure visualizes this process.
You can test that easily from the command line using the curl command:
curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \ --data "The Stanbol enhancer can detect famous cities such as \ Paris and people such as Bob Marley." \ http://localhost:8080/enhancer
The RESTful interface also provides parameters that can be used to parse/request additional information. The following example shows a request which answers with the plain/text version extracted from the HTML content passed in the request.
curl -v -X POST -H "Accept: text/plain" \ -H "Content-type: text/html; charset=UTF-8" \ --data "<html><body><p>The Stanbol enhancer can detect famous cities \ such as Paris and people such as Bob Marley.</p></body></html>" \ "http://localhost:8080/enhancer/chain/language?omitMetadata=true"
For detailed information please see the documentation of the Stanbol Enhancer RESTful Services. A short version is also provided under the REST API link of the Stanbol Web UI (e.g. http://localhost:8080/enhancer assuming that Apache Stanbol runs on localhost:8080).
Java API
The usage of the Java API requires the following OSGI Services
@Reference EnhancementJobManager jobManager; @Reference ChainManager chainManager;
This code snipped shows how to enhance an HTML document
InputStream content; //the content (assuming an HTML document) String chainName; //the name of the chain or null to use the default ContentItem contentItem = new InMemoryContentItem( IOUtils.toByteArray(content), "text/html; charset=UTF-8"); //get the EnhancementChain Chain enhancementChain; if(chainName == null){ enhancementChain = chainManager.getDefault(); } else { enhancementChain = chainManager.getChain(chainName); } try { //enhance the content jobManager.enhanceContent(contentItem, enhancementChain); } catch (EnhancementException e) {} //Get the enhancement Results MGraph enhancements = contentItem.getMetadata();
After the enhancement process, ContentItems do not only contain the metadata but also other informations such as converted versions of the passed content. The following code snippet shows how to retrieve the text version of the passed HTML content such as created by the Metaxa Engine.
Entry<UriRef,Blob> textContentPart = ContentItemHelper.getBlob(contentItem, Collections.singleton("text/plain")); Blob testBlob = textContentPart.getValue(); String charset = testBlob.getParameter().get("charset"); String plainText = IOUtils.toString( textContentPart.getValue().getStream(), charset == null ? "UTF-8" : charset);
Main Interfaces and Utility Classes
- ContentItem: A content item is the unit of content the Stanbol Enhancer can deal with. It gives access to the binary content that was registered, and the graph that represents its metadata (provided by client and/or generated). ContentItems are created by using the ContentItemFactory.
- EnhancementEngine: The enhancement engine provides the interface to internal or external semantic enhancement engines. Typically content items will be processed by several enhancement engines.
- EnhancementChain: An enhancement chain represents a user provided configuration which describes how content items passed to this chain should be processed by the Stanbol Enhancer. The chain defines a list of available enhancement engines and their order of execution.
- EnhancementJobManager: The enhancement job manager performs the execution of the enhancement process as described in the execution plan provided by the enhancement chain. The enhancement job manager is also responsible for recording the execution metadata.
- ChainManager: The chain manager allows to lookup all configured enhancement chains. It also provides a getter for the default chain.
- EnhancementEngineManager: The enhancement engine manager allows to lookup active enhancement engines by their name.
Note that the "org.apache.stanbol.enhancer.servicesapi" module also provides a set of "**Helper" utility classes (e.g. ContentItemHelper, EnhancementEngineHelper …). It is highly recommended for users to use the functionality provided by such helpers when working with the according classes of the Stanbol Enhancer.
Enhancement Structure
The enhancement structure for Apache Stanbol is been described here in full. It defines the types and properties used for the resulting metadata graph of the Stanbol Enhancer.
The enhancement structure defines three main types of Annotations:
- TextAnnotaitons - describing the occurrence of an extracted feature within the parsed text.
- EntityAnnotaitons - suggesting an entity for an mention within the text (e.g. dbpedia:International_Monetary_Fund for the mention "IMF" in the analyzed Text). In that case the mention would be represented by a TextAnnotation.
- TopicAnnotaitons - for assigning the parsed document (or parts of the document) to topics and categories.
In addition all annotations created by the Stanbol Enhancer do also provide additional meta information defined by the Enhancement class.
Enhancement Properties
since 0.12.1
Enhancement Properties allow to parametrize the enhancement process of a ContentItem. In contrast to the configuration of Enhancement Engines - that is bound to the component life cycle - enhancement properties can be defined for Enhancement Chain or parsed with single enhancement requests as Query Parameters.
List of Available Enhancement Engines
Apache Stanbol comes with a list of enhancement engines implementations. These engines are supported by the Apache Stanbol community. If you would like to implement your own enhancement engine, you should go on reading this documentation.