This project has retired. For details please refer to its Attic page.
Apache Stanbol - The Keyword Linking Engine: custom vocabularies and multiple languages

The Keyword Linking Engine: custom vocabularies and multiple languages


WARNING: This engine is deprecated. Users are encouraged to use the EntityhubLinkingEngine engine instead.


The KeywordLinkingEngine is intended to be used to extract occurrences of Entities part of a Controlled Vocabulary in content parsed to the Stanbol Enhancer. To do this words appearing within the text are compared with labels of entities. The Stanbol Entityhub is used to lookup Entities based on their labels.

This documentation first provides information about the configuration options of this engine. This section is mainly intended for users of this engine. The remaining part of this document is rather technical and intended to be read by developers that want to extend this engine or want to know the technical details.

Configuration

The KeywordLinkingEnigne provides a lot of configuration possibilities. This section provides describes the different option based on the configuration dialog as shown by the Apache Felix Webconsole.

KeywordLinkingEngine configuration

The example in the scene shows an configuration that is used to extract Drugs base on various IDs (e.g. the ATC code and the nchi key) that are all stored as values of the skos:notation property. This example is used to emphasize on newer features like case sensitive mapping, keyword tokenizer and also customized type mappings. Similar configurations would be also need to extract product ids, ISBN number or more generally concepts of an thesaurus based on there notation.

Configuration Parameter

Additionally the following properties can be configured via a configuration file:

Type Mappings Syntax

The Type Mappings are used to determine the "dc:type" of the TextAnnotation based on the types of the suggested Entity. The field "Type Mappings" (property: org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings) can be used to customize such mappings.

This field uses the following syntax

{uri}
{source} > {target}
{source1}; {source2}; ... {sourceN} > {target}

The first variant is a shorthand for {uri} > {uri} and therefore specifies that the {uri} should be used as 'dc:type' for TextAnnotations if the matched entity is of type {uri}. The second variant matches a {source} URI to a {target}. Variant three shows the possibility to match multiple URIs to the same target in a single configuration line.

Both 'ns:localName' and full qualified URIs are supported. For supported namespaces see the NamespaceEnum. Information about accepted (INFO) and ignored (WARN) type mappings are available in the logs.

Some Examples of additional Mappings for the e-health domain:

drugbank:drugs; dbp-ont:Drug; dailymed:drugs; sider:drugs; tcm:Medicine > drugbank:drugs
diseasome:diseases; linkedct:condition; tcm:Disease > diseasome:diseases 
sider:side_effects
dailymed:ingredients
dailymed:organization > dbp-ont:Organisation

The first two lines map some will known Classes that represent drugs and diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth line define 1:1 mappings for side effects and ingredients and the last line adds 'dailymed:organization' as an additional mapping to DBpedia Ontology Organisation.

The following mappings are predefined by the KeywordLinkingEngine.

dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person
dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization > dbp-ont:Organisation
dbp-ont:Place; schema:Place; gml:_Feature > dbp-ont:Place
skos:Concept

Multiple Language Support

The KeywordLinkingEngine supports the extraction of keywords in multiple languages. However, the performance and to some extend also the quality of the enhancements depend on how well a language is supported by the used NLP framework (currently OpenNLP). The following list provides a short overview about the different language specific component/configurations:

Keyword extraction and linking workflow

Basically the text is parsed from the beginning to the end and words are looked up in the configured controlled vocabulary.

Text Processing

The AnalysedContent Interface is used to access natural language text that was already processed by a NLP framework. Currently there is only a single implementation based on the commons.opennlp TextAnalyzer utility. In general this part is still very focused on OpenNLP. Making it also usable together with other NLP frameworks would probably need some re-factoring.

The current state of the processing is represented by the ProcessingState. Based on the capabilities of the NLP framework for the current language it provides a the following set of information:

Processing is done based on Tokens (words). The ProcessingState provides means to navigate to the next token. If Chunks are present tokens that are outside of chunks are ignored. Only 'processable' tokens are considered to lookup entities (see the next section for details). If a Token is processable is determined as follows

This algorithm was introduced by STANBOL-658

Entity Lookup

A "OR" query with [1..MAX_SEARCH_TOKENS] processable tokens is used to lookup entities via the EntitySearcher interface. If the actual implementation cut off results, than it must be ensured that Entities that match both tokens are ranked first. Currently there are two implementations of this interface: (1) for the Entityhub (EntityhubSearcher) and (2) for ReferencedSites (ReferencedSiteSearcher). There is also an Implementation that holds entities in-memory, however currently this is only used for unit tests.

Queries do use the configured EntityLinkerConfig.getNameField() and the language of labels is restricted to the current language or labels that do not define any language.

Only "processable" tokens are used to lookup entities. If a token is processable is determined as follows:

Typically the next MAX_SEARCH_TOKENS processable tokens are used for a lookup. However the current Chunk/Sentence is never left in the search for processable tokens.

Matching of found Entities:

All labels (values of the EntityLinkerConfig.getNameField() field) in the language of the content or without any defined language are candidates for matches.

For each label that fulfills the above criteria the following steps are processed. The best result is used as the result of the whole matching process:

If two tokens match is calculated by dividing the longest matching part from the begin of the Token to the maximum length of the two tokens. e.g. 'German' would match with 'Germany' with 5/6=0.83. The result of this comparison is the token similarity. If this similarity is greater equals than the configured minimum token similarity factor (org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor) than those tokens are considered to match. The token similarity is also used for calculating the confidence.

Entities are Suggested if:

The described matching process is currently directly part of the EntityLinker. To support different matching strategies this would need to be externalized into an own "EntityLabelMatcher" interface.

Processing of Entity Suggestions

In case there are one or more Suggestions of Entities for the current position within the text a LinkedEntity instance is created.

LinkedEntity is an object model representing the Stanbol Enhancement Structure. After the processing of the parsed content is completed, the LinkedEntities are "serialized" as RDF triples to the metadata of the ContentItem.

TextAnnotations as defined in the Stanbol Enhancement Structure do use the dc:type property to provide the general type of the extracted Entity. However suggested Entities might have very specific types. Therefore the EntityLinkerConfig provides the possibility to map the specific types of the Entity to types used for the dc:type property of TextAnnotations. The EntityLinkerConfig.DEFAULT_ENTITY_TYPE_MAPPINGS contains some predefined mappings. Note that the field used to retrieve the types of a suggested Entity can be configured by the EntityLinkerConfig. The default value for the type field is "rdf:type".

In some cases suggested entities might redirect to others. In the case of Wikipedia/DBpedia this is often used to link from acronyms like IMF to the real entity International Monetary Fund. But also some Thesauri define labels as own Entities with an URI and users might want to use the URI of the Concept rather than one of the label. To support such use cases the KeywordLinkingEngine has support for redirects. Users can first configure the redirect mode (ignore, copy values, follow) and secondly the field used to search for redirects (default=rdfs:seeAlso). If the redirect mode != ignore for each suggestion the Entities referenced by the configured redirect field are retrieved. In case of the "copy values" mode the values of the name, and type field are copied. In case of the "follow" mode the suggested entity is replaced with the first redirected entity.

Confidence for Suggestions

The confidence for suggestions is calculated based on the following algorithm:

Input Parameters

The confidence is calculated as follows:

confidence = (match/max_matched)^2 * (matched/span) * (matched/label_tokens)

Some Examples:

The calculation of the confidence is currently direct part of the EntityLinker. To support different matching strategies this would need to be externalized into an own interface.

Notes about the TaxonomyLinkingEngine

The KeywordLinkingEngine is a re-implementation of the TaxonomyLinkingEngine which is more modular and therefore better suited for future improvements and extensions as requested by STANBOL-303. As of STANBOL-506 this engine is now deprecated and will be deleted from the SVN.