Configure Apache Stanbol to work with multiple languages
The following languages are supported -
- English
- German
- Danish
- Swedish
- Dutch
- Portuguese
Configuration steps
- Have language labels in your target data and install the index
- Add language models to your Stanbol instance
- Activate the LangIdEnhancementEngine and the KeywordLinkingEngine
- Configure the KeywordLinkingEngine
Install your index
In DBpedia, there exist language labels for many entities. In case you want to use an index of your custom vocabulary, first create the index from it and add the index to your stanbol instance. Simply paste the {yourindex}.solr.zip
into your {stanbol-root}/sling/datafiles
directory and install the respective OSGI bundle at your OSGI admin console.
Make sure, that this index contains language labels in all languages you want to work with and that they are properly indexed.
Build and add the necessary language bundles
To build the language bundles go to "{stanbol-root}/data/" and call
mvn clean install -P opennlp
This enables the profile to build the OpenNLP models for all languages.
After this the bundles are available in the folder
{stanbol-root}/data/opennlp/lang/{language}/target
The naming of the bundles is "org.apache.stanbol.data.opennlp.lang.{language}-*.jar".
Add the bundles via the OSGI admin console in the bundles tab. The language bundles will fetch and install the according OpenNLP models for the languages you want to use.
Activate LangID engine and KeywordLinkingEngine
Go to the admin console and deactivate some of the available engines. Especially the standard NER engine and the Entity Linking Engines should be deactivated, as they do not support multiple languages. At least two engines need to be activated:
- The Language Identification Engine provides you with the language of the text you want to enhance, it creates a dc:terms languaage property. The
- The Keyword Linking Engine provides you with the TextAnnotations (selects potential parts of your text) as well as with EntitiyAnnotations (provides suggestions for links). Be aware, that the result (especially the recall) heavily depends on the amount of entities you have specified in your target data source.
Configure the KeywordLinkingEngine
At the OSGI admin console, you can get the most relevant configuration options of the Keyword Linking Engine.
- Referenced Site: The ID of the Entityhub Referenced Site holding the Controlled Vocabulary (e.g. a taxonomy or just a set of named entities)
- Label Field: The field used to match Entities with a mentions within the parsed text.
- Type Field: The field used to retrieve the types of matched Entities. Values of that field are expected to be URIs
- Redirect Field: Entities may define redirects to other Entities (e.g. "USA"(http://dbpedia.org/resource/USA) -> "United States"(http://dbpedia.org/resource/United_States). Values of this field are expected to link to other entities part of the controlled vocabulary
- Redirect Mode: Defines how to process redirects of Entities mentioned in the parsed content.. Three modes to deal with such links are supported: Ignore redirects; Add values from redirected Entities to extracted; Follow Redirects and suggest the redirected Entity instead of the extracted.
- Min Token Length: The minimum length of Tokens used to lookup Entities within the Controlled Vocabulary. This parameter is ignored in case a POS (Part of Speech) tagger is available for the language of the parsed content.
- Suggestions: The maximal number of suggestions returned for a single mention. (org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions) Languages
- Languages to process: An empty text indicates that all languages are processed. Use ',' as separator for languages (e.g. 'en,de' to enhance only English and German texts).
- Default Matching Language: The language used in addition to the language detected for the analysed text to search for Entities. Typically this configuration is an empty string to search for labels without any language defined, but for some data sets (such as DBpedia.org) that add languages to any labels it might improve resuls to change this configuration (e.g. to 'en' in the case of DBpedia.org).
Read the technical description of this Enhancement Engine to learn about more configuration options.
Results
Depending on your linking target dataset - the engine provides you with enhancement suggestions using labels in your chosen language(s). Note: In the actual version of the DBpedia index, the link directs to the english version of the resource.
Examples
This article from October 2011 describes how to deal with multilingual texts.