OpenNLP Tokenizer Engine
The OpenNLP Tokenizer Engine adds Tokens to the AnalyzedText content part. If this content part is not yet present it adds it to the ContentItem.
Consumed information
- Language (required): The language of the text needs to be available. It is read as specified by STANBOL-613 from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine.
- Sentences (optional): In case Sentences are available in the AnalyzedText content part the tokenization of the text is done sentence by sentence. Otherwise the whole text is tokenized at once.
Configuration
The OpenNLP Tokenizer engine provides a default service instance (configuration policy is optional). This instance processes all languages. Language specific tokenizer models are used if available. For other languages the OpenNLP SimpleTokenizer is used. This Engine instance uses the name 'opennlp-token' and has a service ranking of '-100'.
While this engine supports the default configuration including the name (stanbol.enhancer.engine.name) and the ranking (service.ranking) the engine also allows to configure processed languages (org.apache.stanbol.enhancer.token.languages) and an parameter to specify the name of the tokenizer model used for a language.
1. Processed Language Configuraiton:
For the configuration of the processed languages the following syntax is used:
de en
This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages
!fr !it *
This specifies that all Languages other than French and Italien are tokenized.
Values can be parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback also ',' separated Strings are supported.
The following example shows the two above examples combined to a single configuration.
org.apache.stanbol.enhancer.token.languages=["!fr","!it","de","en","*"]
2. Tokenizer model parameter
The OpenNLP Tokenizer engine supports the 'model' parameter to explicitly parse the name of the Tokenizer model used for an language. Tokenizer models are loaded via the Stanbol DataFile provider infrastructure. That means that models can be loaded from the {stanbol-working-dir}/stanbol/datafiles folder.
The syntax for parameters is as follows
{language};{param-name}={param-value}
So to use the "my-de-tokenizer-model.zip" for tokenizing German texts one can use a configuration like follows
de;model=my-de-tokenizer-model.zip *
To configure that the SimpleTokenizer should be used for a given language the 'model' parameter needs to be set to 'SIMPLE' as shown in the following example
de;model=SIMPLE *
By default OpenNLP Tokenizer models are loaded for '{lang}-token.bin'. To use models with other names users need to use the 'model' parameter as described above.