Stanbol Enhancement Structure
This document specifies the Structure used by the Stanbol Enhancer encodes features extracted form the parsed ContentItem. The Enhancement Structure is based on RDF technology and defined as OWL ontology.
Its two main purposes are to facilitate the:
- Interoperability between EnhancementEngines: The design of the Stanbol Enhancer is based on the processing of an ContentItem by multiple EnhancementEngines in an EnhancementChain. Together with the ContentItem API the EnhancementStructure is the key enabler for the cooperation of the different engines. It ensures that enhancements created by one engine can be consumed by the following engines (e.g. the first engine detects the language of the parsed text; the second consumes the language to select the correct NER (named entity recognition) model and create enhancements describing Named Entities contained in the text; the third Engine consumes those Named Entity annotations and creates suggestions for Entities part of an controlled vocabulary).
- Consumption of extracted Features: The knowledge structure standardized by this Ontology aims to allow users to consume/process the features extracted from the parsed content. This includes things like:
- list all suggested Entities (accept/reject Tags)
- list all suggested Topics (content classification)
- group Entity suggestion based on detected "Named Entities" (disambiguation support)
- show the occurrence of detected Entities within the analyzed text (similar to spell checker UIs)
While this document focuses on the first Engine and provides details on how the Stanbol Enhancement Structure it the integral part of the Stanbol Enhancer there is also a Usage Scenario available that focuses on how the Enhancements can be consumed by Stanbol Enhancer users.
Overview on the Stanbol Enhancement Structure
The Stanbol Enhancement Structure is a central part of the Stanbol Enhancer architecture as it represents the binding element between the ContentItem analyzed by the the EnhancementEngines as configured by an EnhancementChain. Together with the ContentParts it represents the state that is constantly updated during the enhancement process.
The following graphic provides an overview on how the EnhancementStructure is used by the Stanbol Enhancer to formally represent the enhancement results.
The above figure shows
- A ContentItem with a single plain text ContentParts containing the text "Apache Stanbol can detect famous entities such as Paris or Bob Marley!"
- Three Enhancements: One TextAnnotation describing "Bob Marley" as Named-Entity as extracted by the NER (NamedEntityRecognition) engine and two EntityAnnotation that suggest different Entities from DBpedia.org.
- Two referenced Entities: Both dbpedia:Bob_Marley and dbpedia:Bob_Marley_(comedian) are part of DBpedia.org and referenced by fise:EntityAnnotations created by instance of the the NamedEntityLinging engine configured to link with DBpedia.org
- An EnhancementChain with four EnhancementEngines. However only the enhancements of the later two are shown in the figure.
The bold relations within the figure are central as they show how the EnhancementStructure is used to formally specify that the mention "Bob Marley" within the analyzed text is believed to represent the Entity dbpedia:Bob_Marley. However it is also stated that there is a disambiguation with an other person dbpedia:Bob_Marley_(comedian).
The dashed relations are also important as they are used to formally describe the extraction context: which EnhancementEngine has extracted a feature from what ContentItem. If even more contextual information are needed, users can combine those information with the ExecutionMetadata collected during the enhancement process.
General Information
Used Namespaces
This provides the list of namespaces used/referenced by the Enhancement Structure
- fise (http://fise.iks-project.eu/ontology/): This is the main namespace of the currently used Enhancement Structure. All custom concepts and properties are defined using this namespace. (*)
- enhancer (http://stanbol.apache.org/ontology/enhancer/enhancer#): This is the main namespace of the Stanbol Enhancer defining concepts such as ContentItem, EnhancementEngine, EnhancementChain …
-
- entityhub (http://stanbol.apache.org/ontology/entityhub/entityhub#)
- This is the main namespace of the Stanbol Entityhub component.
- dc (http://purl.org/dc/terms/): The Dublin Core terms standard is also heavily used by the Stanbol Enhancement Structure. Especially to encode metada data, but also to encode relations between extracted information (fise:Enhancement's)
- dppedia-ont (http://dbpedia.org/ontology/): Concepts of this Ontology are used to describe the types of "Named Entities" detected in parsed content.
- skos (http://www.w3.org/2004/02/skos/core#): The SKOS standard is preferable used to describe entries of Thesauri or more generally any type of controlled vocabularies.
- rdf (http://www.w3.org/1999/02/22-rdf-syntax-ns#)
- in addition EnhancementEngines are free to add/use properties of any additional Ontology (e.g. when adding the rdf:type's of suggested Entities).
(*) Historical side note: FISE was the name of the Stanbol Enhancer before its incubation to Apache. The Enhancement Structure does still use the original namespace for compatibility reasons.
About Expressiveness:
All Stanbol Ontologies are encoded using OWL but restrict itself to basic features. Users need to be aware that not all rules defined in this documentation are formally expressed within the Ontology. However all the stated rules are validated by the EnhancementStructureHelper UnitTest utility part of the "org.apache.stanbol.enhancer.test" module. This ensures that EnhancementEngine implementation that validate there enhancement using this utility comply to this specification.
About Reasoning:
Apache Stanbol assumes the users will have no reasoning support. Because of that EnhancementEngines are required to materialize information that would be otherwise only available by reasoning (e.g. it is required that they add both "fise:TextAnnotation" and "fise:Enhancement" as "rdf:type"s when writing a TextAnnotation).
Core Concepts
The main concept of the Stanbol Enhancement Structure is the "fise:Enhancement". It is used as base concept for all annotation types and defines the generic properties every enhancement MUST provide (e.g. creator, creation date, extracted-from, confidence). On top of the "fise:Enhancement" three specific annotations types are defined:
- TextAnnotation: To describe features with there occurrence within the parsed Text
- EntityAnnotation: To suggest (linked) Entities with features detected within the content
- TopicAnnotation: To classify (link) the parsed content along topics
fise:Enhancement
Every feature extracted by an EnhancementEngine that is expressed using the Stanbol Enhancement Structure needs to be represented as a RDF resource with the "rdf:type" "fise:Enhancement".
Enhancements use Dublin Core terms to provide metadata about their creation:
- dc:creator (required, single): The EnhancementEngine that created the Enhancement. Currently the full qualified name of the Java Class implementing the engine is used as String values. In future version this will change to the relative URL of the EnhancementEngine (e.g. "/enhancer/engine/{engine-name}")
- dc:created (required, single): The UTF date/time when the enhancement was created by the EnhancementEngine.
- dc:contributor (optional, multiple): Additional EnhancementEngine that contributed to the Enhancement.
- dc:modified (optional, single): The last change to a given enhancement.
The following properties provide information about the enhancement
- fise:extracted-from (required, single): The URI of the "enhancer:ContentItem" the feature was extracted. EnhancementEngines need to use the UriRef returned by ContentItem#getUri() as value.
- fise:confidence (optional, single, range: 0 <= confidence <= 1): The confidence of the enhancement as floating point number. NOTE that while this uses a floating point number as value users should not treat values to be on a rational scale - meaning that an enhancement with a confidence of 0.4 is NOT half as good as one with 0.8!
- dc:relation (optional, multiple): Specifies that the current fise:Enhancement has a relation to an other fise:Enhancement. Values need to be resources of the "rdf:type" "fise:Enhancement".
- dc:requires (optional, multiple): Specifies that the current fise:Enhancement depends on an other fise:Enhancement. This is a stronger version of using "dc:relation" and should indicate that if one of the required enhancements is declined/removed this also affects this one. Values need to be resources of the "rdf:type" "fise:Enhancement". NOTE also that Dublin Core terms defines dc:requires as an sub-property of dc:relation.
fise:TextAnnotation
TextAnnotations are used to select portions parsed textual content by using the following properties:
- fise:start (optional, single): The start character position within the plain text version of the parsed content. Note that the plain text version can be retrieved by using the multi-part content item support of the Stanbol Enhancer RESTful API.
- fise:end (required of fise:start is present, single): The end character position. This MUST only be present of "fise:start" is also defined.
- fise:selected-text (optional, single): The text selected by the TextAnnotation. This MUST be the same as the text from index "fise:start" to "fise:end" within the plain text version of the parsed content.
- fise:selection-context (required if fise:selected-text is present, single): The selection context such as the current sentence or a fixed number of characters/word before and after the selected text. This MUST be present if "fise:selected-text" is defined.
- dc:type (optional,single): The nature of the selected part of the text (e.g. dbpedia-ont:Person, Organization, dbpedia-ont:Place for Named Entities; dc:LinguisticSystem for language annotations; skos:Concept for abstract things incl. categorizations). Note that dc:type values are just recommendations. Users are free to use different as the recommended one. As an example the KeywordLinkingEngine allows users to configure dc:type mappings.
As hinted by the description of the above properties their usage depends on the size of the selected part of the text.
- selection of the whole Document: This is the default and MUST BE assumed if non of the start/end/selected-text/selection-context properties is present
- selection of a part (e.g. chapter, sentence): The preferred way is to define start/end positions. selected-text and selection-context are inefficient for bigger section as they would duplicate those sections of the content with the RDF graph as literals.
- Selection of words, word-phrases: In this case it is highly recommended to define start/end as well as selected-text/selection-context. Especially the selected-text and selection-context are important to calculate the exact position of an enhancement in non-plain-text content (e.g. HTML fragments).
The following figure shows an fise:TextAnnotation used to mark the occurrence of Named Entity "Bob Marley" form character 59 to 69 in the given Content.
NOTE: In future version TextAnnotations might switch to a Model that uses
- fise:selection-prefix: some words/characters before the selected section.
- fise:selection-head: the first few word/characters of a the selected section within the text. Alternative to fise:selected-text in case bigger sections of the parsed content need to be selected.
- fise:selection-tail: the last few words/characters of a selected section. To be used together with fise:selection-head.
- fise:selection-suffix: some words/characters after the selected section.
fise:EntityAnnotation
EntityAnnotations are used to suggest/link entities recognized within the Text. While fise:TextAnnotations are used for representing the recognition(s) (occurrence(s) within the content) the EntityAnnotation provides information about the referenced Entity.
- fise:entity-reference (required, single): The URI of the referenced entity. In cases several URIs are defined as equal (e.g. by "owl:sameAs") EnhancementEngines need to choose one of the URIs and include the according "owl:sameAs" in the enhancement results
- fise:entity-label (required, single): The label of the linked entity. While entities may define multiple labels (e.g. for different languages, alternate/preferred …) EnhancementEngines are required to only include a single - the best fitting - label.
- fise:entity-type (optional, multiple): The types of the linked entity. Usually this is the list of rdf:types. However there might be situations where other Resources are used as types.
- dc:relation (required, multiple): The dc:relation property is required for entity annotations. Typically values are "fise:TextAnnotation"s this EntityAnnotation is a suggestion for.
- entityhub:site (optional, single): The name of the Entityhub ReferencedSite managing the the suggested Entity. If this property is present users can dereference the suggested Entity with a GET request to "{stanbol}/entityhub/site/{site-name}/entity?id={entity}" where {site-name} is the value of this property and {entity} is the value of the "fise:entity-reference" property. NOTE: the values "local" and "entityhub" need to be treated separately. In those cases the GET request need to use "{stanbol}/entityhub/entity?id={entity}".
The following figure shows an fise:EntityAnnotation for the Entity 'dbpedia:Bob_Marley'.
fise:TopicAnnotation
TopicAnnotation are used to categorize/classify the parsed content along some categorization system. This is done by suggesting/linking Topics of that categorization system for (possible parts) of the parsed content. A "fise:TextAnnotation" is used to select the part of the content where the linked topics apply.
- fise:entity-reference (required, single): The URI of the topic.
- fise:entity-label (required, single): The human readable label of the topic. While topics may define multiple labels (e.g. for different languages) EnhancementEngines are required to only include a single - the best fitting - label.
- fise:entity-type (optional, multiple): It is best practice to use SKOS for modeling hierarchical classification systems. If this recommendation is followed than the value of fise:entity-type will be "skos:Concept". However users are free to also use different types with "fise:TopicAnnotation"s.
- dc:relation (required, multiple): The dc:relation property is required for topic annotations. It refers to the fise:TextAnnotation specifying the part of the text this topic is applied to.
- entityhub:site (optional, single)_: The name of the Entityhub ReferencedSite managing the the suggested Entity. If this property is present users can dereference the suggested Entity with a GET request to "{stanbol}/entityhub/site/{site-name}/entity?id={entity}" where {site-name} is the value of this property and {entity} is the value of the "fise:entity-reference" property. NOTE: the values "local" and "entityhub" need to be treated separately. In those cases the GET request need to use "{stanbol}/entityhub/entity?id={entity}".
The following figure shows a fise:TopicAnnotation suggesting the skos:Concept "Boxing" from the IPTC Subject Codes. The figure shows also that the Boxing category has Sport as an browser one.