Apache Stanbol OpenNLP integration

OpenNLP is fully integrated with Apache Stanbol. It is also included in the default launcher configuration. While the Full launcher includes all available language models the Stable launcher only includes the models for English

Configuration and Customization

OpenNLP uses model files to provide the statistical models for different languages. Apache Stanbol supports the loading of such models via the DataFileProvider infrastructure. This allows to provide models either by

Installing one of the org.apache.stanbol:org.apache.stanbol.data.opennlp** bundles or by
Copying OpenNLP model files to the Stanbol Datafiles directory (by default {working-dir}/stanbol/datafiles)

Stanbol assumes models to follow the following name schemes

{lang}-sent.bin for sentence detection models
{lang}-token.bin for tokenizer models. If no Tokenizer model for a language is present, than the SimpleTokenizer is used as fallback.
{lang}-pos-perceptron.bin or {lang}-pos-maxent.bin for POS tagging modles. Perceptron models are preferred if present
{lang}-chunker.bin for chunker models

In case modles do use different names the model parameter of the according OpenNLP EnhancementEngine must be used to configure the correct model name. See the Engine documentations for details.

Stanbol Enhancer configuration

OpenNLP based NLP Enhancement Engines

OpenNLP Sentence Detection
OpenNLP Tokenizer
OpenNLP POS Tagger
OpenNLP Chunker
OpenNLP NER as well as a Custom NER Engine that allows to use NER models for Entity types other that Person, Organization and Places.

Enhancement Chain configurations

OpenNLP supports both the NER based Named Entity Linking as well as the POS tagging based Entity Linking processing chain.

Users that want to process texts by using Named Entity Recognition will end up using Enhancement Chain configurations similar to

tika;optional
langdetect
opennlp-token
opennlp-sentence
opennlp-ner
{your-named-entity-linking}

where {your-named-entity-linking} refers to an instance of the NamedEntityLinkingEngine configured for the users controlled vocabulary. Users can also use multiple NamedEntityLinkingEngines configuration in the same chain. Users that want to use NER models for other types than Persons, Organizations or Places will need to use the CustomNerModelEngine instead of the opennlp-ner engine.

Note that the use of the opennlp-token and opennlp-sentence engine is optional as the opennlp-ner engine will to those steps itself in case tokens and sentences are not yet available. Including those engines explicitly in the chain is only required in cases where custom configurations for the tokenizers and sentence detection engines (e.g. custom OpenNLP models) need to be applied.

A typical Entity Linking enhancement engine based on OpenNLP includes the following engines

tika;optional
langdetect
opennlp-token
opennlp-sentence
opennlp-pos
opennlp-chunker
{your-entitylinking}

where '{your-entitylinking}' will typically be an EntityhubLinkingEngine engine configured for the users controlled vocabulary. Users that need to link against multiple controlled vocabularies can add multiple EntityhubLinkingEngines to the enhancement chain.

Note that the use of the opennlp-token and opennlp-sentence engine is optional as the opennlp-pos engine will to those steps itself in case tokens and sentences are not yet available. Including those engines explicitly in the chain is only required in cases where custom configurations for the tokenizers and sentence detection engines (e.g. custom OpenNLP models) need to be applied.

Downloads

Project

Archived Docs

The ASF

Apache Stanbol OpenNLP integration

Configuration and Customization

Stanbol Enhancer configuration

OpenNLP based NLP Enhancement Engines

Enhancement Chain configurations