This project has retired. For details please refer to its Attic page.
Apache Stanbol - Working with Multiple Languages

Working with Multiple Languages

To understand multi lingual support with Apache Stanbol one needs to consider that Apache Stanbol supports two different workflows for extracting entities from parsed text:

  1. Named Entity Linking: This first uses Named Entity Recoqunition (NER) for spotting Entities and second linked found Named Entities with Entities defined by the Controlled Vocabulary (e.g. DBpedia.or). For the NER step the NamedEntityExtraction, the CELI NER engine - using the linguagrid.org service or the OpenCalais /work/workspace/stanbol-website/content/stanbol/docs/trunk/enhancer/engines/refactorengine.mdtextcan be used. The linking functionality is implemented by the NamedEntityTaggingEngine. Multi lingual support depends on the availability of NER models for a language. Note also that separate models are required for each entity type. Typical supported types are Persons, Organizations and Places.
  2. Keyword Linking: entity label based spotting and linking of Entities as implemented by the KeywordLinkingEngine. Natural Language Processing (NLP) techniques such as Part-of-Speach (POS) processing are used to improve performance and result of the extraction process but are not a absolute requirement. As extraction only requires a label this method is also independent of the types of the Entities.

The following Languages are supported for NER - and can therefore be used for Named entity Linking:

NOTE: The CELI and OpenCalais engine require users to create an Account with the according services. In addition analyzed Content will be sent to those services!

For the following languages NLP support is available to improve results when using the Keyword Extraction Engine:

Configuration steps

This describes the typical configuration steps required for multi lingual text processing with Apache Stanbol.

  1. Ensure that labels for the {language(s)} are available in the controlled vocabulary: By default labels with the given language and with no defined language will be used for linking.
  2. Add language models to your Stanbol instance: This includes general NLP models, NER models and possible the configuration of external services such as CELI or OpenCalais
  3. Configure the Named Entity Linking / Keyword Linking chain(s)
    • ensure language detection support (e.g by using the Language Identification Engine
    • decide to use (1) Named Entity Linking or (2) Keyword Linking based on the supported/required languages and the supported/present types of Entities in the controlled vocabulary
    • configure the required Enhancement Engines and one or more Enhancement Chain for processing parsed content.

Install your multi lingual controlled vocabulary

If you want to link Entities in a given language you MUST ensure that there are labels in those languages present in the controlled vocabulary you want link against. It is also possible to tell Stanbol that labels are valid regardless of the language by adding labels without a language tag.

In case you want to link against your own vocabulary you will need to create your own index at this point. If you want to use an already indexed dataset you will need to install those to your Stanbol Environment by:

NOTES:

Build and add the necessary language bundles

Users of the full-war or full launcher can skip this as all available language bundles are included by default. In case you use the stable or a custom build launchers you will need to manually provide the required language models.

In principle there are two possibilities to add language processing and NER models to your Stanbol instance:

  1. you can use the OSGI bundles: Those uses artifactIds like org.apache.stanbol.data.opennlp.lang.{language}-.jar and org.apache.stanbol.data.opennlp.ner.{language}-.jar and can be found under {stanbol-root}/data/opennlp/[ner|lang]/{language} in the Apache Stanbol source
  2. you can obtain the OpenNLP language models yourself and copy them to the {stanbol-working-dir}/stanbol/datafiles folder.

While the later provides more flexibility it also requires a basic understanding of the OpenNLP models and the processing workflow the KeywordLinkingEngine.

Configuring Language Identification Support

By default Apache Stanbol uses the Language Identification Engine that is based on the language identification functionality provided by Apache Tika. As an alternative there is also a language identification engine that uses linguagrid.org.

If you configure your own Enhancement Chain it is important to use one of those Engines and to ensure that it processes the content before the other engines referenced in this document.

Configure Named Entity Linking

To use Named Entity Linking users need to add at least two Enhancement Engines to the current Enhancement Chain

  1. NER Engine: possibilities include
    • NamedEntityTaggingEngine - default name "ner"
    • CeliNamedEntityExtractionEnhancementEngine - default name "celiNer": To use this Engine you need to configure a "License Key" or to activate the usage of the Test Account. After providing this configuration you will need to manually disable/enable this engine to bring it from "unsatisfied" to the "active" state.
    • OpenCalais - default name "opencalais": To use this Engine you need to configure you OpenCalais license key. You should also activate the NER only mode if you used it for this purpose. After providing this configuration you will need to manually disable/enable this engine to bring it from "unsatisfied" to the "active" state.
  2. Entity Linking: possibilities include
    • Named Entity Tagging Engine: This engine allows to create multiple instances for different controlled vocabularies. The default configuration of the Stanbol Launchers include an instance that is configured to link Entities form DBpedia.org. To link to your own datasets you will need to create/configure your own instances of this engine by using the Configuration Tab of the Apache Felix WebConsole - http://{host}:{port}/system/console/configMgr.
    • Geonames Enhancement Engine: Uses the web services provided by geonames.org to link extracted Places. To use this Engine you need to configure your geonames "License Key" or to activate the anonymous geonames.org service. After providing this configuration you will need to manually disable/enable this engine to bring it from "unsatisfied" to the "active" state.

It is important to note that one can include multiple NER and Entity Linking Engines in a single Enhancement Chain. A typical Example would be

Configure KeywordLinking

To use Keyword Linking one needs only to create/configure an instance of the KeywordLinkingEngine and add it to the current Enhancement Chain.

The following describe the different Options provided by the KeywordLinkingEngine when configured via the Configuration Tab of the Apache Felix WebConsole - http://{host}:{port}/system/console/configMgr.

Read the technical description of this Enhancement Engine to learn about more configuration options.

Note that an Enhancement Chain may also contain multiple instances of the KeywordLinkingEngine. It is also possible to mix Named Entity Linking and Keyword Linking in a single chain e.g. to link Persons/Organizations and Places of DBPedia and any kind of Entities defined in your custom vocabulary. Such an Enhancement Chain could look like:

Configure the Enhancement Chain

The Apache Stanbol Enhancer supports multiple Enhancement Chain. Those chains allow users to configure what EnhancementEngines are used in which order by the Stanbol Enhancer to process content posted to http://{stanbol-instance}/enhancer/chain/{chain-name}.

Enhancement Chains are created/configured by using the Configuration Tab of the Apache Felix WebConsole - http://{host}:{port}/system/console/configMgr. Users can choose one of the following three Chain implementation: "Weighted Chain", "List Chain" or "Graph Chain". While all three can be used for chains as referenced by this usage scenarios the "Weighted Chain" is typically the easiest to use as it auto sorts the Engines regardless of the configuration order provided by the user.

A Enhancement Chain configuration consist of three parameter:

See also the documentation for details on enhancement chains.

Results

Extracted Entities will be formally describend in the RDF enhancement results of the Stanbol Enhancer by

The following figure provides an overview about the knowledge structure.

Linked Entity Representation

Examples

This article from October 2011 describes how to deal with multilingual texts.