Stanbol Enhancer Natural Language Processing Support
since version 0.10.0 with STANBOL-733
Overview:
This section covers the following topics:
- Stanbol Natural Language Processing: Short introduction to NLP techniques used by the Stanbol Enahncer
- The NLP processing API: Information about the Java API of the NLP processing framework including information on
- how to implement an NLP Enhancement Engine and
- how to integrate third party NLP frameworks as a RESTful NLP Analyses Service and RESTful Language Identification Service
- Lists of supported NLP frameworks and languages
Additional Information can be found in the usage scenario about working with multiple languages
Stanbol Natural Language Processing
The natural language processing module of the Stanbol Enhancer supports the usage of the following NLP processing techniques:
- Language Detection: As all the following NLP processing techniques are highly specific to the language of the text it is very important to correctly detect the language of the analyzed text.
- Sentence Detection: Any Stanbol Enhancer chain that uses NLP requires the detection and extraction of sentences from the analyzed text. Sentences are typically used as 'processing units' in Stanbol. If no sentence detection is available for a language, Stanbol will typically process the text as if it would be a single sentence.
- Word Tokenization: The detection of single words is required by the Stanbol Enhancer to process text. While this is trivial for most languages it is a rather complex task for some eastern languages, e.g. Chinese, Japanese, Korean. If not otherwise configured, Stanbol will use whitespaces to tokenize words.
- Part of Speech (POS) Tagging: This refers to the annotation of words with their lexical category. For entity extraction and linking words with the category noun and the sub-category proper noun are of special interest. For POS tagging Stanbol supports both string tags and ontological concepts as defined by the OLIA ontology.
- Chunking: This refers to the ability to detect groups of words that belong together. Often tools assign a type to such groups. For example, a noun phrase detection refers to the extraction of chunks around a noun. This functionality helps in the detection of multi-word entities (e.g. the White House), but it is also interesting for users that want to collect information about adjectives used in combination with nouns (e.g. nice holiday, beautiful city, ...)
- Named Entity Recognition_ (NER): The detection of entities in an analyzed text. Such entities can consist of multiple words and typically do have an assigned type. Typical detectable types include persons, organizations, and places. However, most frameworks allow users to train models for additional domain specific types.
- Lemmatization: Often words in a text are not in a form they would appear in controlled vocabularies (incl. dictionaries). This might result in Situations where entities are not correctly recognized in the text, because the found word does not match the label in the vocabulary. Lemmatization help with that as it provides the base form, known as the lemma, for a word.
Based on those techniques Stanbol supports two text enhancement processes described in the following two sub sections.
Named Entity Linking
This chain is based on named entity recognition (NER) by linking recognized entities with controlled vocabularies. A typical enhancement chain contains the following type of engines:
- Language Detection (required): The language of the text is needed to select the correct NLP components for the following processing steps.
- Sentence Detection (optional): If sentences are detected, the processing of the later steps is done sentence by sentence instead of the whole text at once. This improves performance and might also improve results.
- Word Tokenization (required): The detection of named entities is based on processed tokens.
- Named Entity Recognition (required): The recognition of entities mentioned in the text.
- Named Entity Linking (optional): Links entities recognized in the text with entities defined in a controlled vocabulary.
Entity Linking
This chain is based on part of speech, chunking and lematization analysis. It uses those results to lookup words in a configured controlled vocabulary. A typical enhacement chain contains the following type of engines:
- Language Detection (required): The language of the text is needed to select the correct NLP components for the following processing steps.
- Sentence Detection (optional): If sentences are detected, the processing of the later steps is done sentence by sentence instead of the whole text at once. This improves performance and might also improve results.
- Word Tokenization (required): The detection of named entities is based on processed tokens.
- Part of Speech (optional): The POS tag of words is used to decide if it should be linked with the vocabulary or not. Linked lexical categories are configurable but typically only proper nouns or all nouns are linked.
- Noun Phrase Detection (optional): If chunking of nouns is supported those information are used to improve linking of multi-word entities. For example, two common nouns within the same noun phrase are considered as a proper noun.
- Lemmatization (optional): If configured the lemma can be used instead of the word as mentioned in the text for linking against the controlled vocabulary.
- Entity Linking (required): Entity linking consumes all the above NLP processing results and uses them to link entities contained in the configured controlled vocabulary with words in the text. This process requires (as a minimum) a correct tokenization of the text. It is considerable improved by POS annotations of proper nouns and nouns. Chunking and lemmatization may further improve results but their influence on the quality of results is not as big as of the POS tagging.
Additional information on how to configure the Stanbol in multilingual environments are given by the usage scenarios on working with multiple languages.
NLP processing API
The intention of the Stanbol NLP processing API is to efficiently handle word level NLP processing annotations. Something that was not possible by using the RDF metadata of the contentItem. Instead of RDF the NLP processing API defines a JAVA API that consists of the following two main parts:
- Analysed Text: A data structure that represent parts of the analyzed text such as tokens, chunks, sentences and the analysed text itself. All such spans represent parts of the text and are sorted by their natural order in a
NavigateableMap
. TheAnalysedText
instance is added to theContentItem
as aContentPart
and is therefore parsed between enhancement engines. Every span of theAnalysedText
can be annotated withAnnotations
. - NLP Annotations: The Stanbol NLP processing module defines ontology aligned annotation models for typical NLP processing results such as part of speech tagging, phrase detection, named entity recognition, full morphological analysis, and sentiment tags. Those annotations can be used to annotate
Span
contained in theAnalysedText
.
The NLP processing module also provides a default in-memory implementation of all defined interfaces. This implementation is used as default by the Stanbol Enhancer.
Additionally, the NLP processing module provides:
- Utilities for implementing NLP processing enhancement engines.
- JSON serialization and parsing support for analysed text including NLP annotations. Together with the RESTful NLP analysis engine this can be used to integrate NLP frameworks as RESTful services.
- RESTful service definition for a language identification service as well as the RESTful language identification engine. This allows to integrate language identification features of an NLP framework in a similar way as the NLP analysis described above (see STANBOL-894 for the service specification).
Stanbol Enhancer NLP Support
This section provides an overview about the currently integrated NLP frameworks and their supported languages.
Integrated NLP frameworks
-
OpenNLP: Apache OpenNLP is the default NLP processing framework used by Stanbol. OpenNLP supports Sentence Detection, Tokenization, Part of Speech tagging, Chunking and Named Entity Recognition for several languages. Users can extend support to additional languages by providing their own statistical models.
-
Smartcn: The Lucene Smartcn Analyzer integration provides basic language support for Chinese by providing Sentence Detection and Tokenization engines.
-
Paoding: The Paoding Analyzer is an alternative to Smartcn for basic Chinese language support. Paoding only supports Tokenization and is therefore best used in combination with the Smartcn Sentnece Detection engine.
-
CELI / linguagrid.org: Celi contributed Stanbol EnhancementEngines based on their NLP processing Framework. It supports Named Entity Recognition for French and Italien as well as Lemmatization and lexical analysis for Italien, Danish, Russian, Romanian and Swedish. In addition CELI also provides a Language identification service
NOTE: This Engine will send processed to the CELI server. Users are required to create an account for the CELI service.
-
Gosen: Lucene-Gosen is an LGPL licensed Analyzer for Japanese. The Apache Stanbol Integration supports Sentence Detection, Tokenization, Part of Speech tagging as well as Named Entity Recognition.
NOTE: As the license of Lucene-Gosen is not compatible with the ASL this project is hosted on https://github.com/westei/stanbol-gosen and is NOT a part of Apache Stanbol. Users that want to use it will need to download it themselves.
-
Freeling: Freeling is an GPL licensed NLP processing framework implemented in
C
. It supports Sentence Detection, Tokenization, Part of Speech tagging, Chunking and Named Entity Recognition for several languages including English, Spanish, Italian, Russian and Portuguese.The integration is based on the RESTful NLP analysis service specification. That means that users will need to install and configure Freeling and than run the Stanbol Freeling Server. After that they can use this server by configuring the RESTful NLP Analysis Engine with the
/analysis
as well as the RESTful NLP Language Identification Engine with the/langident
endpoint of their Stanbol Freeling Server.NOTE: As the license of Freeling is not compatible with the ASL this project is hosted on https://github.com/insideout10/stanbol-freeling and is NOT a part of Apache Stanbol. Users that want to use it will need to download and install it themselves.
-
Talismane: Talismane is an AGPL licensed NLP processing framework implemented in Java. It supports Sentence Detection, Tokenization, Part of Speech tagging for French.
The integration is based on the RESTful NLP analysis service specification. That means that users will need to download and build the Stanbol-Talismane project and than run the Stanbol Talismane Server. After that they can use this server by configuring the RESTful NLP Analysis Engine with the
/analysis
endpoint of their Stanbol-Talismane serverNOTE: As the license of Talismane is not compatible with the ASL this project is hosted on https://github.com/westei/stanbol-talismane and is NOT a part of Apache Stanbol. Users that want to use it will need to download and install it themselves.
Supported Languages
-
Catalan (ca)
- Freeling: Sentence Detection, Tokenization, POS tagging, Chunking and basic NER without classification
-
Chinese (zh)
-
Danish (da)
-
Dutch (nl)
- [OpenNLP] (opennlp): Sentence Detection, Tokenization, POS tagging and full NER for Persons, Organizations and Places
-
English (en)
- OpenNLP: Sentence Detection, Tokenization, POS tagging, Chunking and full NER for Persons, Organizations and Places
- Freeling: Sentence Detection, Tokenization, POS tagging, Chunking and full NER for Persons, Organizations and Places
- OpenCalais: NER
-
French (fr)
- Talismane: Sentence Detection, Tokenization, Part of Speech
- CELI: NER
- OpenCalais: NER
-
Galician (gl)
- Freeling: Sentence Detection, Tokenization, POS tagging, Chunking and NER but without classification
-
German (de)
-
Italien (it)
-
Japanese (jp)
-
Portuguese (pt)
-
Romanian (ro)
- CELI: Lemmatization and lexical analysis
-
Russian (ru)
-
Spanish (es)
- OpenNLP: Sentence Detection, Tokenization, POS tagging (no Proper Noun support) and NER for Persons, Organizations and Places
- Freeling: Sentence Detection, Tokenization, POS tagging, Chunking and full NER for Persons, Organizations and Places
- OpenCalais: NER
-
Swedish (sv)
-
Welsh (cy)
- Freeling: Sentence Detection, Tokenization, POS tagging, Chunking and basic NER but without classification