This project has retired. For details please refer to its Attic page.
Apache Stanbol - Stanbol Enhancement Structure

Stanbol Enhancement Structure

This document specifies the Structure used by the Stanbol Enhancer encodes features extracted form the parsed ContentItem. The Enhancement Structure is based on RDF technology and defined as OWL ontology.

Its two main purposes are to facilitate the:

  1. Interoperability between EnhancementEngines: The design of the Stanbol Enhancer is based on the processing of an ContentItem by multiple EnhancementEngines in an EnhancementChain. Together with the ContentItem API the EnhancementStructure is the key enabler for the cooperation of the different engines. It ensures that enhancements created by one engine can be consumed by the following engines (e.g. the first engine detects the language of the parsed text; the second consumes the language to select the correct NER (named entity recognition) model and create enhancements describing Named Entities contained in the text; the third Engine consumes those Named Entity annotations and creates suggestions for Entities part of an controlled vocabulary).
  2. Consumption of extracted Features: The knowledge structure standardized by this Ontology aims to allow users to consume/process the features extracted from the parsed content. This includes things like:
    • list all suggested Entities (accept/reject Tags)
    • list all suggested Topics (content classification)
    • group Entity suggestion based on detected "Named Entities" (disambiguation support)
    • show the occurrence of detected Entities within the analyzed text (similar to spell checker UIs)

While this document focuses on the first Engine and provides details on how the Stanbol Enhancement Structure it the integral part of the Stanbol Enhancer there is also a Usage Scenario available that focuses on how the Enhancements can be consumed by Stanbol Enhancer users.

Overview on the Stanbol Enhancement Structure

The Stanbol Enhancement Structure is a central part of the Stanbol Enhancer architecture as it represents the binding element between the ContentItem analyzed by the the EnhancementEngines as configured by an EnhancementChain. Together with the ContentParts it represents the state that is constantly updated during the enhancement process.

The following graphic provides an overview on how the EnhancementStructure is used by the Stanbol Enhancer to formally represent the enhancement results.

EnhancementStructure Overview

The above figure shows

The bold relations within the figure are central as they show how the EnhancementStructure is used to formally specify that the mention "Bob Marley" within the analyzed text is believed to represent the Entity dbpedia:Bob_Marley. However it is also stated that there is a disambiguation with an other person dbpedia:Bob_Marley_(comedian).

The dashed relations are also important as they are used to formally describe the extraction context: which EnhancementEngine has extracted a feature from what ContentItem. If even more contextual information are needed, users can combine those information with the ExecutionMetadata collected during the enhancement process.

General Information

Used Namespaces

This provides the list of namespaces used/referenced by the Enhancement Structure

(*) Historical side note: FISE was the name of the Stanbol Enhancer before its incubation to Apache. The Enhancement Structure does still use the original namespace for compatibility reasons.

About Expressiveness:

All Stanbol Ontologies are encoded using OWL but restrict itself to basic features. Users need to be aware that not all rules defined in this documentation are formally expressed within the Ontology. However all the stated rules are validated by the EnhancementStructureHelper UnitTest utility part of the "org.apache.stanbol.enhancer.test" module. This ensures that EnhancementEngine implementation that validate there enhancement using this utility comply to this specification.

About Reasoning:

Apache Stanbol assumes the users will have no reasoning support. Because of that EnhancementEngines are required to materialize information that would be otherwise only available by reasoning (e.g. it is required that they add both "fise:TextAnnotation" and "fise:Enhancement" as "rdf:type"s when writing a TextAnnotation).

Core Concepts

The main concept of the Stanbol Enhancement Structure is the "fise:Enhancement". It is used as base concept for all annotation types and defines the generic properties every enhancement MUST provide (e.g. creator, creation date, extracted-from, confidence). On top of the "fise:Enhancement" three specific annotations types are defined:

fise:Enhancement

Every feature extracted by an EnhancementEngine that is expressed using the Stanbol Enhancement Structure needs to be represented as a RDF resource with the "rdf:type" "fise:Enhancement".

Enhancements use Dublin Core terms to provide metadata about their creation:

The following properties provide information about the enhancement

fise:TextAnnotation

TextAnnotations are used to select portions parsed textual content by using the following properties:

As hinted by the description of the above properties their usage depends on the size of the selected part of the text.

The following figure shows an fise:TextAnnotation used to mark the occurrence of Named Entity "Bob Marley" form character 59 to 69 in the given Content.

'fise:TextAnnotation'

NOTE: In future version TextAnnotations might switch to a Model that uses

fise:EntityAnnotation

EntityAnnotations are used to suggest/link entities recognized within the Text. While fise:TextAnnotations are used for representing the recognition(s) (occurrence(s) within the content) the EntityAnnotation provides information about the referenced Entity.

The following figure shows an fise:EntityAnnotation for the Entity 'dbpedia:Bob_Marley'.

'fise:EntityAnnotation' example

fise:TopicAnnotation

TopicAnnotation are used to categorize/classify the parsed content along some categorization system. This is done by suggesting/linking Topics of that categorization system for (possible parts) of the parsed content. A "fise:TextAnnotation" is used to select the part of the content where the linked topics apply.

The following figure shows a fise:TopicAnnotation suggesting the skos:Concept "Boxing" from the IPTC Subject Codes. The figure shows also that the Boxing category has Sport as an browser one.

'fise:TopicAnnotation' example