Apache Stanbol OpenNLP integration
OpenNLP is fully integrated with Apache Stanbol. It is also included in the default launcher configuration. While the Full launcher includes all available language models the Stable launcher only includes the models for English
Configuration and Customization
OpenNLP uses model
files to provide the statistical models for different languages. Apache Stanbol supports the loading of such models via the DataFileProvider infrastructure. This allows to provide models either by
- Installing one of the
org.apache.stanbol:org.apache.stanbol.data.opennlp**
bundles or by - Copying OpenNLP model files to the Stanbol Datafiles directory (by default
{working-dir}/stanbol/datafiles
)
Stanbol assumes models to follow the following name schemes
{lang}-sent.bin
for sentence detection models{lang}-token.bin
for tokenizer models. If no Tokenizer model for a language is present, than theSimpleTokenizer
is used as fallback.{lang}-pos-perceptron.bin
or{lang}-pos-maxent.bin
for POS tagging modles. Perceptron models are preferred if present{lang}-chunker.bin
for chunker models
In case modles do use different names the model
parameter of the according OpenNLP EnhancementEngine must be used to configure the correct model name. See the Engine documentations for details.
Stanbol Enhancer configuration
OpenNLP based NLP Enhancement Engines
- OpenNLP Sentence Detection
- OpenNLP Tokenizer
- OpenNLP POS Tagger
- OpenNLP Chunker
- OpenNLP NER as well as a Custom NER Engine that allows to use NER models for Entity types other that Person, Organization and Places.
Enhancement Chain configurations
OpenNLP supports both the NER based Named Entity Linking as well as the POS tagging based Entity Linking processing chain.
Users that want to process texts by using Named Entity Recognition will end up using Enhancement Chain configurations similar to
tika;optional langdetect opennlp-token opennlp-sentence opennlp-ner {your-named-entity-linking}
where {your-named-entity-linking}
refers to an instance of the NamedEntityLinkingEngine configured for the users controlled vocabulary. Users can also use multiple NamedEntityLinkingEngines configuration in the same chain. Users that want to use NER models for other types than Persons, Organizations or Places will need to use the CustomNerModelEngine instead of the opennlp-ner
engine.
Note that the use of the opennlp-token
and opennlp-sentence
engine is optional as the opennlp-ner
engine will to those steps itself in case tokens and sentences are not yet available. Including those engines explicitly in the chain is only required in cases where custom configurations for the tokenizers and sentence detection engines (e.g. custom OpenNLP models) need to be applied.
A typical Entity Linking enhancement engine based on OpenNLP includes the following engines
tika;optional langdetect opennlp-token opennlp-sentence opennlp-pos opennlp-chunker {your-entitylinking}
where '{your-entitylinking}' will typically be an EntityhubLinkingEngine engine configured for the users controlled vocabulary. Users that need to link against multiple controlled vocabularies can add multiple EntityhubLinkingEngines to the enhancement chain.
Note that the use of the opennlp-token
and opennlp-sentence
engine is optional as the opennlp-pos
engine will to those steps itself in case tokens and sentences are not yet available. Including those engines explicitly in the chain is only required in cases where custom configurations for the tokenizers and sentence detection engines (e.g. custom OpenNLP models) need to be applied.