EntityLinkingEngine
The EntityLinkingEngine is an Engine that consumes results from NLP processing from the AnalyzedText content part and uses those information to link (search and match) entities from an configured vocabulary.
For doing so it uses the following configurations and components:
- Text Processing Configuration: This configures how the EntityLinkingEngine consumes NLP processing results. Such configurations can be language specific.
- Entity Linking Configuration: This configures various properties that are used for the linking process with the vocabulary
- EntitySearcher: This interface is used to search and dereference Entities. It needs to be implemented to use a datasource for linking with the EntityLinkingEngine. Stanbol provides implementations for the Stanbol Entityhub (see EntityhubLinkingEngine)
- LabelTokenizer: While processed text is already tokenized the Entity labels are note. For the matching of Labels with the text the EntityLinkingEngine needs therefore to tokenizer those labels. Apache Stanbol provides an default implementation of this interface based on the OpenNLP tokenizer API.
The EntityLinkingEngine can not directly be used as the four things listed above need to be parsed in its constructor. It is instead intended to be configured/extended by other components. The EntityhubLinkingEngine is one of them configuring the EntityLinkingEngine with EntitySearcher for the Stanbol Entityhub.
This documentation first describes the implemented entity linking process than provides information about the supported configuration parameters of the Text Processing Configuration and the Entity Linking Configuration. The last part described how to extend the EntityLinking engine by implementing/providing custom EntitySearcher and LabelTokenizer.
Linking Process:
The Linking Process consists of three major steps: First it consumes results of the NLP processing to determine tokens - words - that need to be linked with the configured vocabulary. Second the linking of entities based on their labels with the current section of the Text and third the writing of the enhancement results.
Token Types
The EntityLinkingEngine operates based on tokens (words). Those tokens are divided in the following Categories
- Linkable Tokens: This are words that are linked with the Vocabulary. This means that the engine will issue quires in the controlled vocabulary for those tokens
- Matchable Tokens: Matchable tokens are used to refine quires. For the matching of entity labels with the text those words are treated in the same way as linkable words. So the main difference is that matchable words alone will not cause the engine to query for Entities in the Controlled Vocabulary.
- Other Tokens: All other tokens in the text are not used for searches in the configured vocabulary. However during the matching of labels with the Text they are considered as they might also be present in labels of entities
"University of Salzburg" is a good example as 'University' - a common noun - can be considered a matchable token, 'of' an other- and 'Salzburg' as proper noun is a typical linkable token. As the engine only queries for linkable token a single query for 'Salzburg' would be issued against the vocabulary. However this query would also use the matchable token 'University' as a secondary query term. The token 'of' would only be considered during matching.
In addition to the token type the engine also determines the rolling parameters
- Token Length: The number of characters of a word. This is especially important for languages where no POS tagger is available.
- Alpha-Numeric: If a Token does contain an alpha or an numeric character. This is mainly used to skip processing of tokens that represent punctuation.
- Upper Case: Upper Case Tokens do often represent named entities. because of that the Engine keeps track of upper case words.
- Token Phrase: If a Token is member of a processable Phrase. Phrases are groups of Tokens that can be detected by a Chunker. A typical examples are Noun Phrases.
Consumed NLP Processing Results:
The EntityLinkingEngine consumes NLP processing results from the AnalyzedText ContentPart of the processed ContentItem. The following list describes the consumed information and their usage in the linking process:
- __Language_ (required): The Language of the Text is acquired from the Metadata of the ContentItem. It is required to search for labels in the correct language and also to correctly apply language specific configurations of the engine.
- Sentences (optional): Sentence annotations are used as segments for the matching process. In addition for the first word of an Sentence the Upper Case feature is NOT set. In the case that no Sentence Annotations are present the whole text is treated as a single Sentence.
- Tokens (required): As this Engine is based on the processing of Tokens such information are absolutely required.
- POS Annotations (optional): Part of Speech (POS) tags are used to determine the Token Type. The NLP processing module provides two enumerations that define POS types. The high level Lexical Categories (16 members including "Noun", "Verb", "Adjective", "Adposition" ...) and the Pos enumeration with ~150 very detailed POS definitions (such as (e.g. "ProperNoun", "CommonNoun", "Infinitive", "Gerund", "PresentParticiple" …). In addition the engine can also be configured to use the string tag as used by the POS tagger. The mapping of the POS Annotation to the Token Type is provided by the Engine configuration and can be language specific.
- Phrase Annotation (optional): Phrase Annotations of Chunks present in the AnalyzedText are checked against the configured processable phrase categories. The linking of Tokens is NOT limited to Tokens within processable phrases. Phrases are only used as additional context to improve the matching process. The Lexical Category and the string tags used by the Chunker can be used to configure the processable Phrase categories.
- Lemma (optional): The Lemma provided by the MorphoAnalysis annotation can be used for linking instead of the token as used within the text.
Entity Linking:
The linking process is based the matching of labels of entities returned as result for searches for entities in the configured controlled vocabulary. In addition the engine can be configured to consider redirects for entities returned by searches.
Searches are issued only for Linkable Tokens and may include up to Max Search Tokens additional Linkable- or Matchable Tokens. If the Linkable Token is within an Phrase than only other tokens within the same phrase are considered. Otherwise any Linkable- or Matchable Tokens within the configured Max Search Token Distance is considered for the search.
Searches to the controlled vocabulary are issued using the EntitySearcher interface and build like follows:
{lt}@{lang} || {lt}@{dl} || [{at}@{lang} || {at}@{dl} ... ]
where:
- {lt} ... the Linkable Token for that the search is issued
- {at} ... additional Linkable- or Matchable Tokens included in the search
- {lang} ... the language of the text
- {dl} ... the configured Default Matching Language. If '{df} == {lang}' than the or term(s) for the {dl} are omitted
For results of those queries the labels in the {lang} and {dl} are matched against the text. However {dl} labels are only considered if no match was found for labels in the language of the text. For matching labels with the Tokens of the text the engine need to tokenize the labels. This is done by using the LabelTokenizer interface.
The matching process distinguishes between matchable and non-matchable Tokens as well as non-alpha-numeric Tokens that are completely ignored. Matching starts at the position of the Linkable Token for that the search in the configured vocabulary was issued. From this position Tokens in the Label are matched with Tokens in the text until the first matchable or 2nd non-matchable token is not found. In a second round the same is done in the backward direction. The configured Min Token Match Factor determines how exact tokens in the text must correspond to tokens in the label so that a match is considered. This is repeated for all labels of an Entity. The label match that covers the most tokens is than considered as the match for that Entity.
There are various parameters that can be used to fine tune the matching process. But the most important decision is if one want to include suggestions where labels with two tokens do only match a single Matchable Token in the Text (e.g. "Barack Obama" matching "Obama" but also 1000+ "Tom {something}" matching "Tom"). The default configuration of the Engine excludes those but depending on the use case and the linked vocabulary users might want to change this. See the documentation of the Min Matched Tokens and Min Labe Score for details and examples.
Writing Enhancement Results
This step covers the following steps:
- processing of redirects as configured by the Redirect Mode
- mapping of the Entity types to the dc:type values for fise:TextAnnotations as configured by the Type Mappings configuration
- if Dereference Entities is enabled than information for all configured Dereferenced Fields need to be obtained
- writing of the fise:TextAnnotations, fise:EntityAnnotations and dereferenced entities (if enabled) to the metadata of the processed ContentItem
Configurations
The configuration of the EntityLinkingEngine done by parsing a TextProcessingConfig and an EntityLinkingConfig in it constructor. Both configuration classes provide an API base configuration (via getter and setter) as well as an OSGI Dictionary based configuration (via a static method that configures a new instance by an parsed configuration).
The following two sections describe the "key, value" based configuration as the API based version is anyway described by the JavaDoc.
Text Processing Configuration
Proper Noun Linking (enhancer.engines.linking.properNounsState)
This is a high level configuration option allowing users to easily specify if they want to do EntityLinking based on any Nouns ("Noun Linking") or only ProperNouns ("Proper Noun Linking"). Configuration wise this will pre-set the defaults for the linkable LexcicalCategories and Pos types.
"Noun linking" is equivalent to the behavior of the KeywordLinkingEngine while "Proper Noun Linking" is similar to using NER (Named Entity Recognition) with the NamedEntityLinking engine.
When activating "Proper Noun Linking" users need to ensure that:
- the POS tagging for given languages do support Pos#ProperNoun. If this is not the case for some languages than language specific configurations need to be used to manually adjust configurations for such languages. The next section provides examples for that.
- the Entities in the Vocabulary linked against need typically be mentioned as Proper Nouns in the Text. Users that need to link Vocabularies with Entities that use common nouns as their labels (e.g. House, Mountain, Summer, ...) can typically not use "Proper Noun Linking" with the following exceptions:
- Entities with labels comprised of multiple common nouns (e.g. White House) can be detected in cases where Chunks are supported and the Link Multiple Matchable Tokens in Phrases option is enabled (see the next sub-section for details).
- In case Entities mentioned in the text are written as upper case tokens that the Upper Case Token Mode can be set to "LINK" (see the next sub-section for details)
If suitable it is strongly recommended to activate "Proper Noun Linking" as it highly increases the performance because in typical text only around 1/10 of the Nouns are marked as Proper Nouns and therefore the amount of vocabulary lookups also decreases by this amount.
Language Processing configuration (enhancer.engines.linking.processedLanguages)
This parameter is used for two things: (1) to specify what languages are processed and (2) to provide specific configurations on how languages are processed. For the 2nd aspect there is also a default configuration that can be extended with language specific setting.
1. Processed Languages Configuration:
For the configuration of the processed languages the following syntax is used:
de en
This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages
!fr !it *
This specifies that all Languages other than French and Italien are processed by an EntityLinkingEngine instance.
Values MUST BE parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. The following example shows the two above examples combined to a single configuration.
enhancer.engines.linking.processedLanguages=["!fr","!it","de","en","*"]
2. Language specific Parameter Configuration
In addition to specifying the processed languages this configuration can also be used to parse language specific parameters. The syntax for parameters is as follows
{language};{param-name}={param-value};{param-name}={param-value} *;{param-name}={param-value};{param-name}={param-value} ;{param-name}={param-value};{param-name}={param-value}
The first line sets the parameter for {language}. The 2nd and 3rd line show that either the wildcard language '*' or the empty language '' can be used to configure parameters that are used as defaults for all languages.
The following param-names are supported by the EntityLinkingEngine
Phrase level Parameters:
- pc {name}::LexicalCategory - The Phrase Categories processed by the Engine. Valid values include the name's of members of the LexicalCategory enumeration (e.g. "Noun", "Verb", "Adjective", "Adposition", ...)
- ptag {tag}::String - the Phrase Tag processed by the Engine. This allows to configure the String tags as used by the Chunker of a Language. This should only be used of the Chunk types of the Chunker are not mapped with members of the LexicalCategory enumeration.
- pprob [0..1)::double - the Min Phrase Tag Probability for Chunks to be accepted as processable ('value/2' is sufficient for rejecting).
- lmmtip [''/true/false]::boolean - the Link Multiple Matchable Tokens in Phrases parameter. As the name says it allows to enable/disable the linking of multiple matchable tokens within the same Chunk. This is especially important if Proper Noun Linking is active, as it allows to detect 'named entities' that are constituted by two common nouns. NOTE that 'lmmtip' is short for 'lmmtip=true'
Token level Parameters:
- lc {name}::LexicalCategory - The linked Token Categories. Valid values include the name's of members of the LexicalCategory enumeration (e.g. "Noun", "Verb", "Adjective", "Adposition", …). Typical configurations include "lc=Noun" or an empty list ("lc" or "lc=") to deactivate all categories and provide more fine granular Pos or Tag level configuration.
- pos {name}::Pos - This linked Pos Types. Valid values include the name's of members of the Pos enumeration (e.g. "ProperNoun", "CommonNoun", "Infinitive", "Gerund", "PresentParticiple" and ~150 others). This parameter can be used to provide a very fine granular configuration. It is e.g. used by the Link ProperNouns only setting to define that only "pos=ProperNoun" are linked.
- tag {tag}::String - The linked Pos Tags. This parameter allows to configure POS tags as used by the POS tagger. This is useful if those Tags are not mapped to LexicalCategories or Pos types.
- prob [0..1)::double - the Min PosTag Probability. This parameter replaces the formally used Min POS tag probability (org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability) property. It defines the minimum confidence so that a POS annotation is accepted for linkable and matchable tokens ('value/2' is sufficient for rejecting none linked/matched tokens).
- uc {NONE/MATCH/LINK}::string - the Upper Case Token Mode allows to configure how upper case words are treated. There are three possible modes: (1) NONE: defines that they are not specially treated; (2) MATCH defines that they are considered as matchable tokens (independent of the POS tag or the token length; (3) LINK: defines that they are in any case linked with the vocabulary. The default is "LINK" - as upper case words often represent named entities - with the exception of German ('de') where the mode is set to MATCH - as all Nouns in German are upper case.
NOTE: that tokens are linked if any of "lc", "pos" or "tag" match the configuration. This means that adding "lc=Noun" will render "pos=ProperNoun" useless as the Pos type ProperNoun is already included in the LexicalCategory Noun.
Examples:
The default configuration for the EntityLinkingEngine uses the following setting
*;lmmtip;uc=LINK;prob=0.75;pprob=0.75 de;uc=MATCH es;lc=Noun nl;lc=Noun
The first line enable Link Multiple Matchable Tokens in Phrases and linking of upper case tokens for all languages. In addition it sets the minimum probabilities for Pos- and Phrase annotations to 0.75 (what would be also the default). The following three lines provide additional language specific defaults. For German the upper case mode is reset to MATCH as in German all Nouns use upper case. For Spain and Dutch linking for the LexicalCategory Noun is enabled. This is because the OpenNLP POS tagger for those languages does not support ProperNoun's and therefore the Engine would not link any tokens if Link ProperNouns only is enabled. The same configuration in the OSGI '.config' file syntax would look like follows (NOTE: please exclude the line break used here for better formatting)
enhancer.engines.linking.processedLanguages= ["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]
The 2nd example shows how to define default settings without using the wildcard '*' that would enable processing of all languages. The following example shows an configuration that only enables English and ignores text in all other languages.
;lmmtip;uc=LINK;prob=0.75;pprob=0.75 en de;uc=MATCH
Entity Linker Configuration
This configuration allows to configure the linking process with the controlled vocabulary. This includes all searching, matching as well as writing Enhancements for suggestions. NOTE that all parameters do support String values regardless of the data type. E.g. parsing "true" is supported for boolean; "1.5" for floating points ...
- Label Field (enhancer.engines.linking.labelField): The name of the field/property used to link (search and match) Entities. Only a single field is supported for performance reasons.
- Case Sensitivity (enhancer.engines.linking.caseSensitive): Boolean switch that allows to activate/deactivate case sensitive matching. It is important to understand that even with case sensitivity activated an Entity with the label such as "Anaconda" will be suggested for the mention of "anaconda" in the text. The main difference will be the confidence value of such a suggestion as with case sensitivity activated the starting letters "A" and "a" are NOT considered to be matching. See the second technical part for details about the matching process. Case Sensitivity is deactivated by default. It is recommended to be activated if controlled vocabularies contain abbreviations similar to commonly used words e.g. CAN for Canada.
- Type Field (enhancer.engines.linking.typeField): Values of this field are used as values of the "fise:entity-types" property of created "fise:EntityAnnotation"s. The default is "rdf:type". NOTE that in contrast to the NamedEntityLinking the types are not used for the linking process. They are only used while writing the 'fise:EntityAnnotation's and to determine the 'dc:type' values of 'fise:TextAnnotation's.
- Type Mappings (enhancer.engines.linking.typeMappings): The FISE enhancement structure (as used by the Stanbol Enhancer) distinguishes TextAnnotation and EntityAnnotations. The EntityLinkingEgnine needs to create both types of Annotations: TextAnnotations selecting the words that match some Entities in the Controlled Vocabulary and EntityAnnotations that represent an Entity suggested for a TextAnnotation. The Type Mappings are used to determine the "dc:type" of the TextAnnotation based on the types of the suggested Entity. The default configuration comes with mappings for Persons, Organizations, Places and Concepts but this fields allows to define additional mappings. For details about the syntax see the sub-section "Type Mapping Syntax" below.
- Redirect Field (enhancer.engines.linking.redirectField) and Redirect Mode (enhancer.engines.linking.redirectMode): Redirects allow to follow links to other entities defined in the vocabulary linked against. This is useful in cases where matched Entities are not equals to the Entities that users want to suggest. A good example is DBpedia where the Entity 'dbpedia:USA' defines only the label "USA" and an redirect to the Entity 'dbpedia:United_States' with all the information. The Redirect Mode can now be used to define if redirects should be "IGNORE"; "ADD_VALUES" causes information of the redirected entity ('dbpedia:United_States') to be added to the matched one ('dbpedia:USA'); "FOLLOW" will suggest the redirected Entity ('dbpedia:United_States') instead of the matched one ('dbpedia:USA'). The Redirect Field defines the field/property used for redirects.
- Suggestions (enhancer.engines.linking.suggestions): The maximum number of suggestions. The default value for this is '3'. If the engine is used in combination with an post processing engine (e.g. disambiguation) that users might want to increase this value.
The following properties define how Linkable and Matchable Tokens are linked against the Entities of the linked vocabulary
- Default Matching Language (enhancer.engines.linking.defaultMatchingLanguage): Linking is always done in the language of the processed text and in the Default Matching Language. By default the default language are labels without an language tag, but this parameter allows to override this to a specific language. This is e.g. useful for DBpedia where all labels are marked with the language of the source Wikipedia data. So it makes sense to configure the default matching language to this value.
- Max Search Token Distance (enhancer.engines.linking.maxSearchTokenDistance): The maximum number of Tokens searched around a linked token to search for additional matchable tokens to be included for searches for Entities. The default value is '3'. As an Example in the text section "at the University of Munich a new procedure to" only "Munich" would be marked as linkable token if Proper Noun Linking is activated. However for searching Entities it makes sense to also use the matchable term 'University', because otherwise a search would potentially return an huge number of candidates of Entities mentioning 'Munich' in their labels. This parameter allows to configure the maximum distance of tokens so that the EntityLinkingEngine may include them as additional optional constraints for queries via the EntitySearcher interface. NOTE that this parameter will not allow to include tokens outside of a processable chunk if the linked token is within an such.
- Max Search Tokens (enhancer.engines.linking.maxSearchTokens): The maximum number of Tokens used for searches via the EntitySearcher interface. The default value is '2'. In case more matchable tokens are within the configured Max Search Token Distance than those closer & trailing the linkable token are preferred. E.g. the text "president Barack Obama" where 'Barack' is the currently active linkable token will result in a query with the tokens 'Barack' OR 'Obama' if Max Search Tokens=2 and Max Search Token Distance>=1 because both 'president' and 'Obama' do have a distance of 1 but trailing Tokens are preferred.
- Lemma based Matching (enhancer.engines.linking.lemmaMatching): If this feature in enabled than the MorphoFeatures#getLemma() values are used instead of the Token#getSpan()s if present.
- Min Search Token Length (enhancer.engines.linking.minSearchTokenLength): This is used as fallback if the Tokens in the AnalyzedText do not contain Part of Speech annotations or if the confidence of those annotations is to low. The default value is '3' meaning that in such cases all tokens with more than '3' characters are linked with the vocabulary. NOTE that this configuration might move to the Text Processing Configuration in future versions.
The parameters below are used to configure the matching process.
- Minimum Chunk Match Score (enhancer.engines.linking.minChunkMatchScore): If the mention of an Entity is within a Chunk (e.g. a Noun Phrase) this specifies the minimum percentage of Tokens the detected Entity must match to be accepted. Only matchable tokens of phrases are counted (e.g. for the
lovely Julia Roberts
onlyJulia Roberts
would count as lovely is an adjective). By default this is set to0.51
so an Entity with a labelJulia
would not be accepted. NOTE: This only considers 'processable' chunks. Because of that it depends also on the pc parameter of the Language Processing configuration; This feature was introduced with STANBOL-1211. - Minimum Token Match Score (enhancer.engines.linking.minTokenScore): This defines how well single tokens of the text need to match single tokens in the label so that they are considered as matching. This parameter configures the lower limit. However the actual token match score does also influence the overall matching scores for labels with the text. So non exact matches will decrease matching scores for the whole label with the text.
- Min Label Score (enhancer.engines.linking.minLabelScore) [0..1]::double: The "Label Score" [0..1] represents how much of the Label of an Entity matches with the Text. It compares the number of Tokens of the Label with the number of Tokens matched to the Text. Not exact matches for Tokens, or if the Tokens within the label do appear in an other order than in the text do also reduce this score. Entities are only considered if at least one of their labels cores higher than the minimum for all tree of Min Labe Score, Min Text Match Score and Min Match Score.
-
Min Matched Tokens (enhancer.engines.linking.minFoundTokens) [1..*]::int: The minimum number of matching tokens. Only "matchable" tokens are counted. For full matches (where all tokens of the Label do match tokens in the text) this parameter is ignored.
This parameter is strongly related with the Min Labe Score Typical setting are
- Min Matched Tokens=1 and Min Label Score > 0.5 (e.g. 0.75)
- Min Matched Tokens=2 and Min Label Score <= 0.5 (e.g. 0.5)
For Labels containing of one or two words both options do have the same result, but for Longer labels (1) is more restrictive than (2). The important thing is that both options ensures that Labels with more than one tokens will not be considered if only a single token does match the text.
If used in combination with an disambiguation Engine one might want to consider to suggest Entities where only a single token of multi-token labels do match. In such cases a configuration like Min Matched Tokens=1 and Min Label Score <= 0.5 (e.g. 0.4) might be considered. With such scenarios users will also want to considerable increase the value for Max Suggestions (typically values > 10).
-
Min Text Score (enhancer.engines.linking.minTextScore) [0..1]::double: The "Text Score" [0..1] represents how well the Label of an Entity matches to the selected Span in the Text. It compares the number of matched {@link Token} from the label with the number of Tokens enclosed by the Span in the Text an Entity is suggested for. Not exact matches for Tokens, or if the Tokens within the label do appear in an other order than in the text do also reduce this score. Entities are only considered if at least one of their labels cores higher than the minimum for all three of Min Label Score, Min Text Match Score and Min Match Score.
- Min Match Score (enhancer.engines.linking.minMatchScore) [0..1]::double: Defined as the product of the "Text Score" with the "Label Score" - meaning that this value represents both how well the label matches the text and how much of the label is matched with the text. Entities are only considered if at least one of their labels cores higher than the minimum for all tree of Min Labe Score, Min Text Match Score and Min Match Score.
- Use EntityRankings (enhancer.engines.linking.useEntityRankings) ::boolean (default=true): Entity Rankings can be used to define the ranking (popularity, importance, connectivity, ...) of an entity relative to other within the knowledge base. While fise:confidence values calculated by the EntityLinkingEngie do only represent how well a label of the entity do match with the given section in the processed text it does make sense for manny use cases to sort Entities with the same score based on their entity rankings (e.g. users would expect to get "Paris (France)" suggested before "Paris (Texas)" for Paris appearing in a text. Enabling this feature will slightly (< 0.1) change the score of suggestions to ensure such a ordering.
Type Mappings Syntax
The Type Mappings are used to determine the "dc:type" of the TextAnnotation based on the types of the suggested Entity. The field "Type Mappings" (property: enhancer.engines.linking.typeMappings) can be used to customize such mappings.
This field uses the following syntax
{uri} {source} > {target} {source1}; {source2}; ... {sourceN} > {target}
The first variant is a shorthand for {uri} > {uri} and therefore specifies that the {uri} should be used as 'dc:type' for TextAnnotations if the matched entity is of type {uri}. The second variant matches a {source} URI to a {target}. Variant three shows the possibility to match multiple URIs to the same target in a single configuration line.
Both 'ns:localName' and full qualified URIs are supported. For supported namespaces see the NamespaceEnum. Information about accepted (INFO) and ignored (WARN) type mappings are available in the logs.
Some Examples of additional Mappings for the e-health domain:
drugbank:drugs; dbp-ont:Drug; dailymed:drugs; sider:drugs; tcm:Medicine > drugbank:drugs diseasome:diseases; linkedct:condition; tcm:Disease > diseasome:diseases sider:side_effects dailymed:ingredients dailymed:organization > dbp-ont:Organisation
The first two lines map some will known Classes that represent drugs and diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth line define 1:1 mappings for side effects and ingredients and the last line adds 'dailymed:organization' as an additional mapping to DBpedia Ontology Organisation.
The following mappings are predefined by the EntityLinkingEngine.
dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization > dbp-ont:Organisation dbp-ont:Place; schema:Place; gml:_Feature > dbp-ont:Place skos:Concept
Extension Points
This section describes Interfaces that are used as Extension Points by the EntityLinkingEngine
EntitySearcher
The EntitySearch Interface is used by the EntityLinkingEngine to search for Entities in the linked Vocabulary. An EntitySearcher instance is parsed in the constructor of the EntityLinkingEngine.
This interface supports with search and dereference two main functionalities but also provides some additional metadata. The following list provides a short overview about the methods.
- Dereference Entities get(String id,Set<String> includeFields)::Representation
This method is called with the 'id' of an Entity and needs to return the data of the Entity as Representation. The returned Representation needs to at least include the parsed 'includeFields'. If 'includeFields' is empty or NULL than all information for the Entity should be included in the returned Representation.
- Entity Search lookup(String field, Set<String> includeFields, List<String> search, String[] languages,Integer limit)::Collection<Representation>
This method is used for searching entities in the controlled vocabulary. The configured Label Field is parsed in the 'field' parameter. The 'includedFileds' contain all fields required for the linking process. Representations returned as result need to include values for those fields. The 'search' parameter includes the tokens used for the search. Values should be considered optional however Results are considered to rank Entities that match more search tokens first. The array of 'languages' is used to parse the languages that need to be considered for the search. If 'languages' contains NULL or '' it means that also labels without an language tag need to be included in the search (NOTE that this DOES NOT mean to include labels of any language!). Finally the 'limit' parameter is used to specify the maximum number of results. If NULL than the implementation can choose an meaningful default.
- Offline Mode supportsOfflineMode()::boolean : indicates if the EntitySearcher implementation needs to connect an remote service. This is needed to deactivate the EntityLinkingEngine in cases where Apache Stanbol is started in OfflineMode
- Serach Result Limit getLimit()::Integer : The maximum number of search results supported by the EntitySearcher implementation. Can return NULL if not applicable or unknown.
- Origin Information getOriginInformation()::Map<UriRef,Collection<Resource>> : This method allows to return information about the origin that are added to every 'fise:EntityAnnotation' created by the EntityLinkingEngine. This is e.g. used by the Entityhub based information to provide the 'id' of the Entiyhub Site where the Entities where retrieved from.
The EntityhubLinkingEngine includes EntitySearcher implementations based on the FieldQuery search interface implemented by the Stanbol Entityhub.
Currently the StanbolEntityhub based implementations are instantiated based on the value of the 'enhancer.engines.linking.entityhub.siteId'. Users that want to use a different implementation of this Interface to be used for linking will need to extend the EntityLinkingEngine and override the #activateEntitySearcher(ComponentContext context, Dictionary
LabelTokenizer
The LabelTokenizer interface is used to tokenize labels of Entity suggestions as returned by the EntitySearcer. As the matching process of the EntityLinkingEngine is based on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be tokenized before they can be matched against the current context in the Text.
The LabelTokenizer interface defines only the single tokenize(String label, String language)::String[] method that gets the label and the language as parameter and returns the tokens as a String array. If the tokenizer where not able to tokenize the label (e.g. because he does not support the language) it MUST return NULL. In this case the NamedEntityLinking engine will try to match the label as a single token.
MainLabelTokenizer
As it might very likely be the case that users will want to use multiple LabelTokenizer for different languages the EntityLinkingEngine comes with an MainLabelTokenizer implementation. It registers itself as LabelTokenizer with highest possible OSGI 'service.ranking' and tracks all other registered LabelTokenizers.
So if custom LabelTokenizers register themselves as OSGI service than the MainLabelTokenizer can forward requests to them. It will do so in the order of the 'service.ranking
's. in addition LabelTokenizer can use the 'enhancer.engines.entitylinking.labeltokenizer.languages
' property to formally specify the languages they are supporting. This property does use the language configuration syntax (e.g. "en,de" would include English and German; "!it,!fr," would specify all languages expect Italian and French). If no configuration is provided than "" (all languages) is assumed - what is fine as default as long as LabelTokenizer correctly return NULL for languages they do not support.
The MainLabelTokenizer forwards tokenize requests to all available LabelTokenizer implementations that support a specific language sorted by their 'service.ranking
' until the first one does NOT return NULL. If no LabelTokenizer was found or all returned NULL it will also return NULL.
The following code snippet shows how to use the MainLabelTokenizer as LabelTokenizer for the EntityLinkingEngine
@Reference LabelTokenizer labelTokenizer;
This will inject the MainLabelTokenizer as it uses Integer.MAX_VALUE
as service.ranking
.
@Activate protected void activate(ComponentContext ctx){ //within the activate method it can than be used //to initialize the NamedEntityLinkingEngine NamedEntityLinkingEngine engine = new NamedEntityLinkingEngine( engineName, entitySearcher, //the searcher might not be available textProcessingConfig, linkerConfig, //config labelTokenizer); //the MainLabelTokenizer
Configuring the NamedEntityLinkingEngine like this ensures that all registered LabelTokenizers are considered for tokenizing.s_
Simple LabelTokenizer
This is the default implementation of a LabelTokenizer that does not depend on any external dependencies. This implementation behaves exactly the same as the OpenNLP SimpleTokenizer. It is active by default and configured to process all languages. It uses an 'service.ranking
' of '-1000' so will be typically overwritten by custom registers implementations.
The main intension of this implementation is to be a reasonable default ensuring LabelTokenizer support for all languages.
OpenNLP LabelTokenizer
The EntityLinkingEngie also contains an OpenNLP tokenizer API based implementation. As the dependency to OpenNLP and the Stanbol Commons OpenNLP module are optionally this implementation will only be active if the org.apache.stanbol:org.apache.stanbol.commons.opennlp
bundle with an version starting from 0.10.0
is active.
This LabelTokenizer supports the configuration of custom OpenNLP tokenizer models for specific languages e.g. "de;model=my-de-tokenizermodel.zip;*" would use a custom model for German and the default models for all other languages.
Internally the OpenNLP service to load tokenizer models for languages. That means that tokenizer models are loaded via the DataFileProvider infrastructure. For user that means that custom tokenizer models are loaded from the Stanbol Datafiles directory ({stanbol-working-dir}/stanbol/datafiles).
LinkingStateAware
Added with STANBOL-1070 this interface allows to receive callbacks about the processing state of the entity linking process. This interface define methods for start/end section as well as start/end token. Both the start and the end method do parsed the active Span as parameter. An instance of this interface can be parsed to the constructor of the EntityLinker implementation.
The typical usage of this extension point is as follows:
@Reference protected LabelTokenizer labelTokenizer; private TextProcessingConfig textProcessingConfig; private EntityLinkerConfig linkerConfig; private EntitySearcher entitySearcher; @Activate @SuppressWarnings("unchecked") protected void activate(ComponentContext ctx) throws ConfigurationException { super.activate(ctx); Dictionary<String,Object> properties = ctx.getProperties(); //extract TextProcessing and EnityLinking config from the provided properties textProcessingConfig = TextProcessingConfig.createInstance(properties); linkerConfig = EntityLinkerConfig.createInstance(properties,prefixService); //create/init the entitySearcher entitySearcher = new MyEntitySearcher(); //parse additional properties } public void computeEnhancements(ContentItem ci) throws EngineException { AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true); String language = NlpEngineHelper.getLanguage(this, ci, true); //create an instance of your LinkingStateAware implementation LinkingStateAware linkingStateAware; //= new YourImpl(..); //create one EntityLinker instance per enhancement request EntityLinker entityLinker = new EntityLinker(at,language, languageConfig, entitySearcher, linkerConfig, labelTokenizer, linkingStateAware); //during processing we will receive callbacks to the //linkingStateAware instance try { entityLinker.process(); } catch (EntitySearcherException e) { log.error("Unable to link Entities with "+entityLinker,e); throw new EngineException(this, ci, "Unable to link Entities with "+entityLinker, e); } }
Note that it is also possible to use a single EntityLinker/LinkingStateAware pair to process multiple ContentItems. However in this case received callbacks need to be filtered based on the AnalysedText being the context of the Span instanced parsed to the callback methods.
@Override public void startToken(Token token) { //process based on the context AnalysedText at = token.getContext(); // … }
In addition such a usage would require the LinkingStateAware implementation to be thread save.