Making use of Apache Stanbol Enhancements
This document describes how to implement client side, i.e. user interface components by using the enhancement results returned by the Apache Stanbol Enhancer. It does so by using three different scenarios:
- Entity Tagging - replacing text based tags such as "Bob Marley" with entities - dbpedia:Bob_Marley - to improve content search and categorization. As added value this can also be used for mashups with already available information about linked entities and search engine optimization by including metadata of tagged entities within the content.
- Entity Disambiguation - enhance the entity tagging experience by explicit support for disambiguation between different suggested entities. This allows users to explicitly link to Paris (Texas), Bob Marley (Comedian) or in between any other entities that do share similar labels.
- Entity Checker - interact with extracted entities similar as with todays spellchecker: Show extracted/suggested dirtily within the content; Allow users to interact with suggestions and to disambiguate between different matches if necessary; Support search for additional/other entities.
This usage scenario assumes that you already know how to enhance content via the Enhancer's RESTful API. If not, you might want to read about content enhancement.
Entity Tagging: Use tags to relate you content to persons, places, events …
Entity tagging is about suggesting user defined entities instead of strings to tag their documents. The difference is very easy to explain. Let's assume a blogger that uses the tag "Bob Marley" to tag a blog entry. Tagging is all about structuring content. By tagging it with "Bob Marley" he can easily find all documents that uses that tag. However, most likely he would also want to create a category of documents about reggae music and most likely he would like that documents tagged with "Bob Marley" are part of that group.
While the knowledge that "Bob Marley" is related to reggae music might be obvious for the blogger as a person it can not be known by the blogging tool she uses. Typically the only way to solve this is that the blogger tags the document with both tags.
Entity tagging tries to work around that by linking documents with entities defined by a knowledge base. The fact that Bob Marley is related to reggae music is nothing novel. DBpedia, the Wikipedia database, does know that and a lot more about the entity dbpedia:Bob_Marley. If the blogger tags her document with "dbpedia:Bob_Marley", she does not only tag it with "Bob Marley" but also with all the other contextual information provided by DBPedia - including the fact that Bob Marley was a reggae interpret.
But this does not only work with famous people, big cities, etc. Nowadays the Web links data of different domains. However, this is not only about the Web - it works even better if you use entities relevant to yourself and/or your working environment (products, articles, customers, etc).
Suggest entities with the Apache Stanbol Enhancer
Requesting the Apache Stanbol Enhancer to analyze a text requires to send a POST request as defined by the RESTful API.
curl -X POST -H "Accept: application/rdf+xml" -H "Content-type: text/plain" \ --data "The Stanbol enhancer can detect famous cities such as \ Paris and people such as Bob Marley." http://{host}:{port}/enhancer
As response you will receive the enhancement results formatted as an RDF graph in a serialization format specified by the "Accept" header ('application/rdf+xml' in the above example request). This RDF graph contains the information about the entities extracted from the parsed content. See the documentation of the Apache Stanbol enhancement structure for details.
The following figure shows how extracted entities are described in the enhancement results.
In principle there are two resources that are of interest for the entity tagging use case:
- EntityAnnotations: Resources with the 'rdf:type' 'fise:EntityAnnotation' do represent the entity suggestions by the Apache Stanbol Enhancer. This resources provide the label, type and most important the URI of the extracted entity. In addition the value of the fise:confidence' [0..1] can be used as indication how certain the Apache Stanbol Enhancer is about this entity.
- Entities: This refers to all resources with an incoming 'fise:entity-reference' relation (such as 'dbpedia:Bob_Marley' in the above example). Enhancement engines can be configured to "dereference" suggested entities - meaning to use the URI of the entity to retrieve additional information. In this case, additional information about suggested entities will be available in the enhancement results. If this in not the case, users will need to dereference suggested entities themselves.
Process Suggested Entities
The following steps are typically needed to acquire the information needed to implement an entity tagging user interface:
- Iterate over all suggested entities: These are all resources such as "{entity-annotation} rdf:type fise:EntityAnnotation"
- Basic information: Those are available directly via the {entity-annotation} to ensure their availability even if the {entity} itself in not not included - dereferenced - in the enhancement results.
- URI of the suggested entity: {entity-annotation} fise:entity-reference {entity}
- Label: The value of the fise:entity-label is typically the label via that the entity was recognized in the analyzed content. Additional labels are typically available via the {entity}
- Types: Tha value of the fise:entity-type property of the {entity-annotation} are the same as the rdf:type values of the {entity}.
- Confidence: The 'fise:confidence' value represent how confident the Apache Stanbol Enhancer is about this suggestion. Values are in the range [0..1] where 0 means very uncertain and 1 represent a high certainly.
- Dereferenced {entity}: Some enhancement engines support to add also information about suggested entities to the enhancement results - in other words: to dereference suggested entities. In this case, additional information about the entity can be retrieved directly from the enhancement results. Note that those information include all available labels (in all languages) of the entity.
- Dereferencing suggested entities: If the suggested entity is available via the Apache Stanbol Entityhub the {entity-anntotation} does have the 'entityhub:site' property. The value of this property is the name of the referenced site of the Entityhub. To dereference the entity a GET request to "{stanbol-root-URL}/entityhub/site/{site-name}/entity?id={entity}" need to be used. The "Accept" header of the request need to be set to the according RDF serialization (e.g. "application/rdf+json").
Process Content Categorizations
fise:TopicAnnotation instances are used to formally represent categories assigned to the parsed Content. The main difference between extracted entities and assigned categories is that extracted entities do have one or more explicit mentions within the text while assigned categories are suggested based on the document as a whole - typically they are not explicitly mentioned in the text.
Typically, an entity tagging UI will want to distinguish between categories and entities because:
- categories are used to group content (e.g. blog posts about work and private things)
- entities are used to search/suggest blog posts about specific topics (e.g. a blog about some feature implemented for "Apache Solr", a nice event in the "Sternbräu" in "Salzburg")
The usage of fise:TopicAnnotation is similar to an EntityAnnotation. Both annotation types use the exact same properties ('fise:entity-referene','fise:entity-label',fise:entity-type', 'fise:confidence','entityhub:site'). The only difference is that one need to iterate over '{topic-annotation} rdf:type fise:TopicAnnotaion'. So typically clients will want to use the exact same code to process {entity-annotation} and {topic-annotation} instances.
In the next section we will describe an improved version of entity tagging is described that allows users to: (1) accept/decline a spotted entity and than (2) select one of several suggested entities.
Entity tagging with disambiguation support
Entity disambiguation is required if an entity detected in the analyzed text can refer to different entities. The following figure shows an example where "Bob Marley" is detected as a person in the text however there are two possible matches within the controlled vocabulary.
The fact that one entity detected in the text - represented by a 'fise:TextAnnotation' may have multiple suggested entities - represented by the two 'fise:EntityAnnotation's - has a negative impact on entity tagging interface that suggest tags based on 'fise:entityAnnotation's. This is because such an interface would show in the above case two suggestions: (1) for 'dbpedia:Bob_Marley' and (2) for dbpedia:Bob_Marley_(comedian). So even if the user want to tag this content with "Bob Marley", she will need to reject at least one of the two suggestions.
Adding explicit support for entity disambiguation to an entity tagging user interface can solve this problem by grouping suggested entities along 'fise:TextAnnotation's they are suggested for.
Grouping suggested Entities
The goal of an entity tagging UI with disambiguation support is to show only a single tag suggestion for all entities suggested for the same section in the analyzed text. To solve this, we need to follow the link between 'fise:EntityAnnotation' and 'fise:TextAnnotation'.
There are several options on how to achieve this. We present a solution that iterates over the 'fise:EntityAnnotation's.
- Iterate over all 'fise:EntityAnnotation' instances. This refers to all resources such as "{entity-annotation} rdf:type fise:EntityAnnotation".
- For more information on how to collect information for extracted entities see the according section in the entity tagging interface.
- Retrieve the 'fise:TextAnnotation' referenced by processed 'fise:EntityAnnotation's. For this, we retrieve the value(s) of the 'dc:relation' property.
- While iterating over the 'fise:EntityAnnotation's establish a mapping 'fise:TextAnnotation' -> 'fise:EntityAnnotation','fise:EntityAnnotation, ...
- the list of 'fise:EntityAnnotation's for each 'fise:TextAnnotation' needs to be sorted based on the value of the 'fise:confidence' property of the EntityAnnotation. Ensure that the EntityAnnotation with the higher confidence is first in the list. 'fise:confidence' values are in the range 0..1 where higher numbers represent a higher certainly.
- Suggest tags based on 'fise:TextAnnotation's - keys in the mapping created in step (3).
- Allow users to easily accept the Entity with the highest rank - 'dbpedia:Bob_Marley' in the above example. Especially if the confidence of the first suggestion is high (e.g. >= 0.8) and considerable higher as confidence values of other options.
- Provide users with the possibility to inspect further suggested options - to disambiguate between different options.
Showing the extraction context
To allow users to more easily disambiguate between the suggested entities it is important to provide them with information about the extraction context of the suggested entities. This is of special importance if content is not completely visible to the user (e.g. because it is to long to fit on the screen or the content is of a type that can not be rendered within the browser).
Assuming the suggested entities are grouped by 'fise:TextAnnotation' - as explained in the above section - one can use the information provided by the TextAnnotation to visualize the context and therefore helping the user performing the disambiguation task.
The following information of the TextAnnotation can be used for this task:
- 'fise:selection-context': This is the text surrounding the extracted entity. The exact size of this context depends on the configuration and the enhancement engine. Typically it is the current sentence or about 50 charters before an after the selection.
- 'fise:selected-text': This is the text representing the extracted entity - the section of the text the entity was suggested for. The 'fise:selected-text' MUST BE contained within the 'fise:selection-context' so user interfaces to want to highlight the selected part of the context can use a contains query in the selection context for the selected text. In case of multiple matches it is typically sufficient to highlight all occurrences.
- 'fise:start' and 'fise:end' values could be also used to determine the location however because those offset are relative to the start of the content it is typically easier to use the occurrences of the selected text within the selection context.
Entity checker - inline editing of content enhancements
This describes a user interface similar to one of a spell/grammar checker. Instead of marking misspelled words entities recognized within the text are suggested to the user. The following figure shows such an interface as implemented by the hallo.js combined with the annotate.js plugin (see the demo here (last accessed 2012-05-30) - click in the Text and press the "annotate" button).
To implement user interfaces like that one needs to (1) show occurrences of extracted features within the text and (2) let the user interact with suggested entities.
Visualise occurrences of extracted features
The occurrence of extracted features are represented by instances of the concept 'fise:TextAnnotation'. The next figure shows how TextAnnotations describe the occurrence of an recognized feature in the parsed text.
Applications that want to visualize extracted features will need to follow/implement the following steps:
Typically the following steps are required to correctly show extracted features within the content.
- Query for/iterate over 'fise:TextAnnotation's of the enhancement results.
- it is important to only use TextAnnotations that define a 'fise:selected-text' property. TextAnnotations that do not define this property usually select whole sections or even the document as a whole. While such TextAnnotations are important (e.g. for annotating the language of the Text) they are of no interest for this use case and need therefore to be ignored.
- Determine the exact occurrence of the TextAnnoations
- in case of plain text content this can be easily done by using the values of 'fise:start' and 'fise:end'
- in case the content includes additional markup the char indexes of 'fise:start'/'fise:end' will not match. In such cases the preferred way is to first search the occurrence of'fise:selection-context' and thann the occurrence of 'fise:selected-text' within.
- Retrieve the suggestions ('fise:TextAnnoation' instances) for a given TextAnnotation. For that one needs to search for "?suggestion dc:relation {text-annotation}" where '{text-annotation}' refers to the URI of the current TextAnnotation. Note that:
- Not every TextAnnotation will have suggestions
- One and the same suggestion might be linked with several TextAnnotations.
The following SPARQL query could be used to select all the required information. However, the use of SPARQL is optional as the required information can be also easily retrieved by other means (e.g. by filtered iteratros as typically provided by RDF frameworks).
select * from { ?textAnnotation rdfs:type fise:TextAnnotation ?textAnnotation fise:selected-text ?selected ?textAnnotation fise:selection-context ?context ?textAnnotation fise:start ?startIndex ?textAnnotation fise:end ?endIndex ?textAnnotation dc:type ?nature optional { ?suggestions dc:relation ?textAnnotation } }
Tips and Tricks:
- Applications that want to differentiate between different types of extracted entities (e.g. applying different stylesheets for persons, organizations and places) can use the value of the 'dc:type' for that purpose. See the section for fise:TextAnnotation for detailed information.
Interact with suggested entities
This section explains how users mitt want to interact with extracted/suggested entities. Extracted entities are represented by 'fise:EntityAnnotation's. Those EntityAnnotations are linked with the TextAnnotation (occurrences) and to the entity of the used knowledge base. The following figure shows an example for an EntityAnnotation that suggests the entity 'dbpedia:Bob_Marley' for the TextAnnotation used in the example of the previous section.
The main purpose of EntityAnnotations is to suggest entities (e.g. 'dbpedia:Bob_Marley' for mentions within natural languages texts. While the above example (to keep it simple) shows only a single suggestion in practice one need to distinguish between three different cases - that also imply different interaction needs for users:
- No suggestion: This indicates that a named entity was recognized during natural language processing, but no matching entity was found within the knowledge base. In this case users might want to
- manually search the knowledge base for an entity. The Apache Stanbol Entityhub sites endpoint can be used to implement this feature by sending a "GET http://{host}:{port}/entityhub/sites/find?name={name}" (see the WebUI of your Stanbol instance for the detailed documentation).
- Create a new entity based on the current TextAnnotation. In this case the 'fise:selected-text' should be suggested as 'rdfs:label' and the 'dc:type' value could be used for the 'rdf:type'. New entities can be added to the knowledge base by sending a "POST http://{host}:{port}/entityhub/entity" with the RDF data of the Entity as content (see the WebUI of your Apache Stanbol instance for the detailed documentation).
- Distinct suggestion: This means that there is only a single suggestion with a high 'fise:confidence'. Also multiple suggestions where the first one as a high confidence and additional suggestions come with low confidence values may fit this description. In such situations
- the UI might want to automatically accept the suggestion
- allow users to show additional suggestion on request.
- undo automatic acceptance of the suggestion.
- Ambiguous Suggestions: This situation is satisfied if multiple entities are suggested with a medium to high 'fise:confidence'. This also applies to situations where there is no suggestion with an high 'fise:confidence' value. In those cases typically the user must provide additional input by
- selecting the correct entity
- rejecting all suggestions
- also manually searching and/or creating a new Entity as described for (1) would be possible interaction
The required data for for the described interaction patters are available within the enhancement results as follows:
The following assumes {text-annotation} - the URI of the current 'fise:TextAnnotation' - as context
- Query for/iterate over all entity suggestions: The suggestions for {text-annotation} can be acquired by using "?entityAnnotation dc:relation {text-annotation}
- only results with the the 'rdf:type' 'fise:EntityAnnotation' should be processed. However, typically all results will be any way of that type.
- the 'fise:confidence' property represents the confidence of the suggestion in the range FROM 0 (very uncertain) TO 1 (very certain). Note that the 'fise:confidence' value is optional - so there might be EntityAnnotations without confidence information. However, all enhancement engines managed by the Apache Stanbol community do provide confidence information.
- Visualize suggestions: EntityAnnotations do provide some basic information about the suggested entity that can be used for visualization. Most important the URI of the suggested entity as value of 'fise:referenced-entity'. Additionally, the label and the types of the entity are included.
- Retrieving additional information about referenced entities: While the EntityAnnotation includes some basic information some users might want to retrieve all available information of referenced entities - to dereference the entity:
- As this is a rather common use case the EntityLinkingEngine and KeywordLinkingEngine are by default configured to include information of Entities within the EnhancementResults. So users that use those EnhancementEngines will not need to dereference Entities as those information are already available within the enhancement results.
- If a 'fise:EntityAnnotation' has the 'entityhub:site' property, entities can be dereferenced by using the Apache Stanbol Entityhub (see the section for fise:EntityAnnotation for details)
- In all other cases the URI of the suggested entity need to be used for dereferencing. If the referenced entity is part of the linked data cloud, this is often possible by the CoolURI - basically sending a "GET -h "Accept: application/json+rdf" {entity-uri}".