This project has retired. For details please refer to its Attic page.
Apache Stanbol - EntityLinkingEngine

EntityLinkingEngine

The EntityLinkingEngine is an Engine that consumes results from NLP processing from the AnalyzedText content part and uses those information to link (search and match) entities from an configured vocabulary.

For doing so it uses the following configurations and components:

The EntityLinkingEngine can not directly be used as the four things listed above need to be parsed in its constructor. It is instead intended to be configured/extended by other components. The EntityhubLinkingEngine is one of them configuring the EntityLinkingEngine with EntitySearcher for the Stanbol Entityhub.

This documentation first describes the implemented entity linking process than provides information about the supported configuration parameters of the Text Processing Configuration and the Entity Linking Configuration. The last part described how to extend the EntityLinking engine by implementing/providing custom EntitySearcher and LabelTokenizer.

Linking Process:

The Linking Process consists of three major steps: First it consumes results of the NLP processing to determine tokens - words - that need to be linked with the configured vocabulary. Second the linking of entities based on their labels with the current section of the Text and third the writing of the enhancement results.

Token Types

The EntityLinkingEngine operates based on tokens (words). Those tokens are divided in the following Categories

"University of Salzburg" is a good example as 'University' - a common noun - can be considered a matchable token, 'of' an other- and 'Salzburg' as proper noun is a typical linkable token. As the engine only queries for linkable token a single query for 'Salzburg' would be issued against the vocabulary. However this query would also use the matchable token 'University' as a secondary query term. The token 'of' would only be considered during matching.

In addition to the token type the engine also determines the rolling parameters

Consumed NLP Processing Results:

The EntityLinkingEngine consumes NLP processing results from the AnalyzedText ContentPart of the processed ContentItem. The following list describes the consumed information and their usage in the linking process:

  1. __Language_ (required): The Language of the Text is acquired from the Metadata of the ContentItem. It is required to search for labels in the correct language and also to correctly apply language specific configurations of the engine.
  2. Sentences (optional): Sentence annotations are used as segments for the matching process. In addition for the first word of an Sentence the Upper Case feature is NOT set. In the case that no Sentence Annotations are present the whole text is treated as a single Sentence.
  3. Tokens (required): As this Engine is based on the processing of Tokens such information are absolutely required.
  4. POS Annotations (optional): Part of Speech (POS) tags are used to determine the Token Type. The NLP processing module provides two enumerations that define POS types. The high level Lexical Categories (16 members including "Noun", "Verb", "Adjective", "Adposition" ...) and the Pos enumeration with ~150 very detailed POS definitions (such as (e.g. "ProperNoun", "CommonNoun", "Infinitive", "Gerund", "PresentParticiple" …). In addition the engine can also be configured to use the string tag as used by the POS tagger. The mapping of the POS Annotation to the Token Type is provided by the Engine configuration and can be language specific.
  5. Phrase Annotation (optional): Phrase Annotations of Chunks present in the AnalyzedText are checked against the configured processable phrase categories. The linking of Tokens is NOT limited to Tokens within processable phrases. Phrases are only used as additional context to improve the matching process. The Lexical Category and the string tags used by the Chunker can be used to configure the processable Phrase categories.
  6. Lemma (optional): The Lemma provided by the MorphoAnalysis annotation can be used for linking instead of the token as used within the text.

Entity Linking:

The linking process is based the matching of labels of entities returned as result for searches for entities in the configured controlled vocabulary. In addition the engine can be configured to consider redirects for entities returned by searches.

Searches are issued only for Linkable Tokens and may include up to Max Search Tokens additional Linkable- or Matchable Tokens. If the Linkable Token is within an Phrase than only other tokens within the same phrase are considered. Otherwise any Linkable- or Matchable Tokens within the configured Max Search Token Distance is considered for the search.

Searches to the controlled vocabulary are issued using the EntitySearcher interface and build like follows:

{lt}@{lang} || {lt}@{dl} || [{at}@{lang} || {at}@{dl} ... ]

where:

For results of those queries the labels in the {lang} and {dl} are matched against the text. However {dl} labels are only considered if no match was found for labels in the language of the text. For matching labels with the Tokens of the text the engine need to tokenize the labels. This is done by using the LabelTokenizer interface.

The matching process distinguishes between matchable and non-matchable Tokens as well as non-alpha-numeric Tokens that are completely ignored. Matching starts at the position of the Linkable Token for that the search in the configured vocabulary was issued. From this position Tokens in the Label are matched with Tokens in the text until the first matchable or 2nd non-matchable token is not found. In a second round the same is done in the backward direction. The configured Min Token Match Factor determines how exact tokens in the text must correspond to tokens in the label so that a match is considered. This is repeated for all labels of an Entity. The label match that covers the most tokens is than considered as the match for that Entity.

There are various parameters that can be used to fine tune the matching process. But the most important decision is if one want to include suggestions where labels with two tokens do only match a single Matchable Token in the Text (e.g. "Barack Obama" matching "Obama" but also 1000+ "Tom {something}" matching "Tom"). The default configuration of the Engine excludes those but depending on the use case and the linked vocabulary users might want to change this. See the documentation of the Min Matched Tokens and Min Labe Score for details and examples.

Writing Enhancement Results

This step covers the following steps:

Configurations

The configuration of the EntityLinkingEngine done by parsing a TextProcessingConfig and an EntityLinkingConfig in it constructor. Both configuration classes provide an API base configuration (via getter and setter) as well as an OSGI Dictionary based configuration (via a static method that configures a new instance by an parsed configuration).

The following two sections describe the "key, value" based configuration as the API based version is anyway described by the JavaDoc.

Text Processing Configuration

Proper Noun Linking (enhancer.engines.linking.properNounsState)

This is a high level configuration option allowing users to easily specify if they want to do EntityLinking based on any Nouns ("Noun Linking") or only ProperNouns ("Proper Noun Linking"). Configuration wise this will pre-set the defaults for the linkable LexcicalCategories and Pos types.

"Noun linking" is equivalent to the behavior of the KeywordLinkingEngine while "Proper Noun Linking" is similar to using NER (Named Entity Recognition) with the NamedEntityLinking engine.

When activating "Proper Noun Linking" users need to ensure that:

  1. the POS tagging for given languages do support Pos#ProperNoun. If this is not the case for some languages than language specific configurations need to be used to manually adjust configurations for such languages. The next section provides examples for that.
  2. the Entities in the Vocabulary linked against need typically be mentioned as Proper Nouns in the Text. Users that need to link Vocabularies with Entities that use common nouns as their labels (e.g. House, Mountain, Summer, ...) can typically not use "Proper Noun Linking" with the following exceptions:
    • Entities with labels comprised of multiple common nouns (e.g. White House) can be detected in cases where Chunks are supported and the Link Multiple Matchable Tokens in Phrases option is enabled (see the next sub-section for details).
    • In case Entities mentioned in the text are written as upper case tokens that the Upper Case Token Mode can be set to "LINK" (see the next sub-section for details)

If suitable it is strongly recommended to activate "Proper Noun Linking" as it highly increases the performance because in typical text only around 1/10 of the Nouns are marked as Proper Nouns and therefore the amount of vocabulary lookups also decreases by this amount.

Language Processing configuration (enhancer.engines.linking.processedLanguages)

This parameter is used for two things: (1) to specify what languages are processed and (2) to provide specific configurations on how languages are processed. For the 2nd aspect there is also a default configuration that can be extended with language specific setting.

1. Processed Languages Configuration:

For the configuration of the processed languages the following syntax is used:

de
en

This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages

!fr
!it
*

This specifies that all Languages other than French and Italien are processed by an EntityLinkingEngine instance.

Values MUST BE parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. The following example shows the two above examples combined to a single configuration.

enhancer.engines.linking.processedLanguages=["!fr","!it","de","en","*"]

2. Language specific Parameter Configuration

In addition to specifying the processed languages this configuration can also be used to parse language specific parameters. The syntax for parameters is as follows

{language};{param-name}={param-value};{param-name}={param-value}
*;{param-name}={param-value};{param-name}={param-value}
;{param-name}={param-value};{param-name}={param-value}

The first line sets the parameter for {language}. The 2nd and 3rd line show that either the wildcard language '*' or the empty language '' can be used to configure parameters that are used as defaults for all languages.

The following param-names are supported by the EntityLinkingEngine

Phrase level Parameters:

Token level Parameters:

NOTE: that tokens are linked if any of "lc", "pos" or "tag" match the configuration. This means that adding "lc=Noun" will render "pos=ProperNoun" useless as the Pos type ProperNoun is already included in the LexicalCategory Noun.

Examples:

The default configuration for the EntityLinkingEngine uses the following setting

*;lmmtip;uc=LINK;prob=0.75;pprob=0.75
de;uc=MATCH
es;lc=Noun
nl;lc=Noun

The first line enable Link Multiple Matchable Tokens in Phrases and linking of upper case tokens for all languages. In addition it sets the minimum probabilities for Pos- and Phrase annotations to 0.75 (what would be also the default). The following three lines provide additional language specific defaults. For German the upper case mode is reset to MATCH as in German all Nouns use upper case. For Spain and Dutch linking for the LexicalCategory Noun is enabled. This is because the OpenNLP POS tagger for those languages does not support ProperNoun's and therefore the Engine would not link any tokens if Link ProperNouns only is enabled. The same configuration in the OSGI '.config' file syntax would look like follows (NOTE: please exclude the line break used here for better formatting)

enhancer.engines.linking.processedLanguages=
    ["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]

The 2nd example shows how to define default settings without using the wildcard '*' that would enable processing of all languages. The following example shows an configuration that only enables English and ignores text in all other languages.

;lmmtip;uc=LINK;prob=0.75;pprob=0.75
en
de;uc=MATCH

Entity Linker Configuration

This configuration allows to configure the linking process with the controlled vocabulary. This includes all searching, matching as well as writing Enhancements for suggestions. NOTE that all parameters do support String values regardless of the data type. E.g. parsing "true" is supported for boolean; "1.5" for floating points ...

The following properties define how Linkable and Matchable Tokens are linked against the Entities of the linked vocabulary

The parameters below are used to configure the matching process.

Type Mappings Syntax

The Type Mappings are used to determine the "dc:type" of the TextAnnotation based on the types of the suggested Entity. The field "Type Mappings" (property: enhancer.engines.linking.typeMappings) can be used to customize such mappings.

This field uses the following syntax

{uri}
{source} > {target}
{source1}; {source2}; ... {sourceN} > {target}

The first variant is a shorthand for {uri} > {uri} and therefore specifies that the {uri} should be used as 'dc:type' for TextAnnotations if the matched entity is of type {uri}. The second variant matches a {source} URI to a {target}. Variant three shows the possibility to match multiple URIs to the same target in a single configuration line.

Both 'ns:localName' and full qualified URIs are supported. For supported namespaces see the NamespaceEnum. Information about accepted (INFO) and ignored (WARN) type mappings are available in the logs.

Some Examples of additional Mappings for the e-health domain:

drugbank:drugs; dbp-ont:Drug; dailymed:drugs; sider:drugs; tcm:Medicine > drugbank:drugs
diseasome:diseases; linkedct:condition; tcm:Disease > diseasome:diseases 
sider:side_effects
dailymed:ingredients
dailymed:organization > dbp-ont:Organisation

The first two lines map some will known Classes that represent drugs and diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth line define 1:1 mappings for side effects and ingredients and the last line adds 'dailymed:organization' as an additional mapping to DBpedia Ontology Organisation.

The following mappings are predefined by the EntityLinkingEngine.

dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person
dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization > dbp-ont:Organisation
dbp-ont:Place; schema:Place; gml:_Feature > dbp-ont:Place
skos:Concept

Extension Points

This section describes Interfaces that are used as Extension Points by the EntityLinkingEngine

EntitySearcher

The EntitySearch Interface is used by the EntityLinkingEngine to search for Entities in the linked Vocabulary. An EntitySearcher instance is parsed in the constructor of the EntityLinkingEngine.

This interface supports with search and dereference two main functionalities but also provides some additional metadata. The following list provides a short overview about the methods.

This method is called with the 'id' of an Entity and needs to return the data of the Entity as Representation. The returned Representation needs to at least include the parsed 'includeFields'. If 'includeFields' is empty or NULL than all information for the Entity should be included in the returned Representation.

This method is used for searching entities in the controlled vocabulary. The configured Label Field is parsed in the 'field' parameter. The 'includedFileds' contain all fields required for the linking process. Representations returned as result need to include values for those fields. The 'search' parameter includes the tokens used for the search. Values should be considered optional however Results are considered to rank Entities that match more search tokens first. The array of 'languages' is used to parse the languages that need to be considered for the search. If 'languages' contains NULL or '' it means that also labels without an language tag need to be included in the search (NOTE that this DOES NOT mean to include labels of any language!). Finally the 'limit' parameter is used to specify the maximum number of results. If NULL than the implementation can choose an meaningful default.

The EntityhubLinkingEngine includes EntitySearcher implementations based on the FieldQuery search interface implemented by the Stanbol Entityhub.

Currently the StanbolEntityhub based implementations are instantiated based on the value of the 'enhancer.engines.linking.entityhub.siteId'. Users that want to use a different implementation of this Interface to be used for linking will need to extend the EntityLinkingEngine and override the #activateEntitySearcher(ComponentContext context, Dictionary configuration) and #deactivateEntitySearcher(). Those methods are called during activation/deactivation of the EntityLinkingEngine and are expected to set/unset the #entitySearcher field.

LabelTokenizer

The LabelTokenizer interface is used to tokenize labels of Entity suggestions as returned by the EntitySearcer. As the matching process of the EntityLinkingEngine is based on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be tokenized before they can be matched against the current context in the Text.

The LabelTokenizer interface defines only the single tokenize(String label, String language)::String[] method that gets the label and the language as parameter and returns the tokens as a String array. If the tokenizer where not able to tokenize the label (e.g. because he does not support the language) it MUST return NULL. In this case the NamedEntityLinking engine will try to match the label as a single token.

MainLabelTokenizer

As it might very likely be the case that users will want to use multiple LabelTokenizer for different languages the EntityLinkingEngine comes with an MainLabelTokenizer implementation. It registers itself as LabelTokenizer with highest possible OSGI 'service.ranking' and tracks all other registered LabelTokenizers.

So if custom LabelTokenizers register themselves as OSGI service than the MainLabelTokenizer can forward requests to them. It will do so in the order of the 'service.ranking's. in addition LabelTokenizer can use the 'enhancer.engines.entitylinking.labeltokenizer.languages' property to formally specify the languages they are supporting. This property does use the language configuration syntax (e.g. "en,de" would include English and German; "!it,!fr," would specify all languages expect Italian and French). If no configuration is provided than "" (all languages) is assumed - what is fine as default as long as LabelTokenizer correctly return NULL for languages they do not support.

The MainLabelTokenizer forwards tokenize requests to all available LabelTokenizer implementations that support a specific language sorted by their 'service.ranking' until the first one does NOT return NULL. If no LabelTokenizer was found or all returned NULL it will also return NULL.

The following code snippet shows how to use the MainLabelTokenizer as LabelTokenizer for the EntityLinkingEngine

@Reference
LabelTokenizer labelTokenizer;

This will inject the MainLabelTokenizer as it uses Integer.MAX_VALUE as service.ranking.

@Activate
protected void activate(ComponentContext ctx){
    //within the activate method it can than be used
    //to initialize the NamedEntityLinkingEngine
    NamedEntityLinkingEngine engine = new NamedEntityLinkingEngine(
        engineName,
        entitySearcher, //the searcher might not be available
        textProcessingConfig, linkerConfig, //config
        labelTokenizer); //the MainLabelTokenizer

Configuring the NamedEntityLinkingEngine like this ensures that all registered LabelTokenizers are considered for tokenizing.s_

Simple LabelTokenizer

This is the default implementation of a LabelTokenizer that does not depend on any external dependencies. This implementation behaves exactly the same as the OpenNLP SimpleTokenizer. It is active by default and configured to process all languages. It uses an 'service.ranking' of '-1000' so will be typically overwritten by custom registers implementations.

The main intension of this implementation is to be a reasonable default ensuring LabelTokenizer support for all languages.

OpenNLP LabelTokenizer

The EntityLinkingEngie also contains an OpenNLP tokenizer API based implementation. As the dependency to OpenNLP and the Stanbol Commons OpenNLP module are optionally this implementation will only be active if the org.apache.stanbol:org.apache.stanbol.commons.opennlp bundle with an version starting from 0.10.0 is active.

This LabelTokenizer supports the configuration of custom OpenNLP tokenizer models for specific languages e.g. "de;model=my-de-tokenizermodel.zip;*" would use a custom model for German and the default models for all other languages.

Internally the OpenNLP service to load tokenizer models for languages. That means that tokenizer models are loaded via the DataFileProvider infrastructure. For user that means that custom tokenizer models are loaded from the Stanbol Datafiles directory ({stanbol-working-dir}/stanbol/datafiles).

LinkingStateAware

Added with STANBOL-1070 this interface allows to receive callbacks about the processing state of the entity linking process. This interface define methods for start/end section as well as start/end token. Both the start and the end method do parsed the active Span as parameter. An instance of this interface can be parsed to the constructor of the EntityLinker implementation.

The typical usage of this extension point is as follows:

@Reference 
protected LabelTokenizer labelTokenizer;

private TextProcessingConfig textProcessingConfig;
private EntityLinkerConfig linkerConfig;

private EntitySearcher entitySearcher;

@Activate
@SuppressWarnings("unchecked")
protected void activate(ComponentContext ctx) throws ConfigurationException {
    super.activate(ctx);
    Dictionary<String,Object> properties = ctx.getProperties();
    //extract TextProcessing and EnityLinking config from the provided properties
    textProcessingConfig = TextProcessingConfig.createInstance(properties);
    linkerConfig = EntityLinkerConfig.createInstance(properties,prefixService);

    //create/init the entitySearcher
    entitySearcher = new MyEntitySearcher();

    //parse additional properties
}

public void computeEnhancements(ContentItem ci) throws EngineException {
    AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true);
    String language = NlpEngineHelper.getLanguage(this, ci, true);

    //create an instance of your LinkingStateAware implementation
    LinkingStateAware linkingStateAware; //= new YourImpl(..);

    //create one EntityLinker instance per enhancement request
    EntityLinker entityLinker = new EntityLinker(at,language, 
        languageConfig, entitySearcher, linkerConfig, 
        labelTokenizer, linkingStateAware);

    //during processing we will receive callbacks to the 
    //linkingStateAware instance
    try {
        entityLinker.process();
    } catch (EntitySearcherException e) {
        log.error("Unable to link Entities with "+entityLinker,e);
        throw new EngineException(this, ci, "Unable to link Entities with "+entityLinker, e);
    }
}

Note that it is also possible to use a single EntityLinker/LinkingStateAware pair to process multiple ContentItems. However in this case received callbacks need to be filtered based on the AnalysedText being the context of the Span instanced parsed to the callback methods.

@Override
public void startToken(Token token) {
    //process based on the context
    AnalysedText at = token.getContext();
    // …
}

In addition such a usage would require the LinkingStateAware implementation to be thread save.