Stanbol Enhancer
The Apache Stanbol Enhancer provides both a RESTful and a Java API that allows a caller to extract features from parsed content. In more detail the parsed content is processed by Enhancement Engines as defined by the called Enhancement Chain.
Using the Stanbol Enhancer
The figure below provides an overview of the RESTful as well as the Java API provided by the Stanbol Enhancer
RESTful service
The content to be analyzed should be sent in a POST request with the mime-type specified in the Content-type header. The response will hold the RDF enhancement serialized in the format specified in the Accept header:
curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \ --data "The Stanbol enhancer can detect famous cities such as \ Paris and people such as Bob Marley." \ http://localhost:8080/enhancer
The RESTful interface also provides parameters that can be used to parse/request additional information. The following example shows a request which answers with the plain/text version of the parsed HTML content.
curl -v -X POST -H "Accept: text/plain" \ -H "Content-type: text/html; charset=UTF-8" \ --data "<html><body><p>The Stanbol enhancer can detect famous cities \ such as Paris and people such as Bob Marley.</p></body></html>" \ "http://localhost:8080/enhancer/chain/language?omitMetadata=true"
For detailed information please see the documentation of the Stanbol Enhancer RESTful Services. A short version is also provided under the REST API link of the Stanbol Web UI (e.g. http://localhost:8080/enhancer assuming that Apache Stanbol runs on localhost:8080).
Java API
The usage of the Java API requires the following OSGI Services
@Reference EnhancementJobManager jobManager; @Reference ChainManager chainManager;
This code snipped shows how to enhance an HTML document
InputStream content; //the content (assuming an HTML document) String chainName; //the name of the chain or null to use the default ContentItem contentItem = new InMemoryContentItem( IOUtils.toByteArray(content), "text/html; charset=UTF-8"); //get the EnhancementChain Chain enhancementChain; if(chainName == null){ enhancementChain = chainManager.getDefault(); } else { enhancementChain = chainManager.getChain(chainName); } try { //enhance the content jobManager.enhanceContent(contentItem, enhancementChain); } catch (EnhancementException e) {} //Get the enhancement Results MGraph enhancements = contentItem.getMetadata();
After the enhancement process, ContentItems do not only contain the metadata but also other informations such as converted versions of the parsed content. The following code snippet shows how to retrieve the text version of the parsed HTML content such as created by the Metaxa Engine.
Entry<UriRef,Blob> textContentPart = ContentItemHelper.getBlob(contentItem, Collections.singleton("text/plain")); Blob testBlob = textContentPart.getValue(); String charset = testBlob.getParameter().get("charset"); String plainText = IOUtils.toString( textContentPart.getValue().getStream(), charset == null ? "UTF-8" : charset);
List of Available Enhancement Engines
Apache Stanbol comes with a list of enhancement engines implementations. These engines are supported by the Apache Stanbol community. If you would like to implement your own enhancement engine, you should go on reading this documentation.
Main Interfaces and Utilities
- ContentItem: A content item is the unit of content the Stanbol Enhancer can deal with. It gives access to the binary content that was registered, and the graph that represents its metadata (provided by client and/or generated).
- EnhancementEngine: The enhancement engine provides the interface to internal or external semantic enhancement engines. Typically content items will be processed by several enhancement engines.
- EnhancementChain: An enhancement chain represents a user provided configuration which describes how content items parsed to this chain should be processed by the Stanbol Enhancer. The chain defines a list of available enhancement engines and their order of execution.
- EnhancementJobManager: The enhancement job manager performs the execution of the enhancement process as described in the execution plan provided by the enhancement chain. The enhancement job manager is also responsible for recording the execution metadata.
- ChainManager: The chain manager allows to lookup all configured enhancement chains. It also provides a getter for the default chain.
- EnhancementEngineManager: The enhancement engine manager allows to lookup active enhancement engines by their name.
Note that the "org.apache.stanbol.enhancer.servicesapi" module also provides a set of "**Helper" utility classes (e.g. ContentItemHelper, EnhancementEngineHelper …). It is highly recommended for users to use the functionality provided by such helpers when working with the according classes of the Stanbol Enhancer.
Enhancement Structure
The enhancement structure for Apache Stanbol is been described here in full. It defines the types and properties used for the resulting metadata graph of Apache Stanbol.
Note: The currently used Enhancement Structure was defined before the incubation to Apache. There is a proposal and ongoing discussion to update this structure in the future however the decision was to keep the current Structure until a first Release.
Each enhancement type description which contains the following important properties:
- creator: the specific enhancement engine creating this enhancement
- creation time: the local system time, when the annotation was created
- extracted-from: the content item for the enhancement. This links to the ID of the content item as assigned by Apache Stanbol.
- type: the type of the enhancement (e.g. Location, Person, Location, Concept ...).
- confidence: The level of confidence in the range from 0 to 1
A text annotation type provides metadata for the selected text. This is intended to be used in addition to the enhancement type if an enhancement is based on a part of the content.
- start: the character position of the start of the selection. If start is not defined it is assumed, that the selection starts at the beginning of the document
- end: the character position of the end of the selection. If end is not defined it is assumed, that the selection ends at the end of the document.
- selected-text: The text selected by the enhancement. (optional).
- selection-context: The context of the selected text. This adds the possibility to specify the context used to extract entities such as persons, organizations, locations ... from natural language documents.
The entity annotation type refers to named entities which have been recognized within the content. This type is intended to be used together with the FISE enhancement type.
- entity-reference: This refers to the URI identifying the Entity
- entity-label: The label(s) of the referred entity
- entity-type: This property can be used to specify the type of the entity (optional)
- The occurrences of the entity within the content (the exact positions within the text where this entity is referred) are determined by outgoing dc:relation links.