Enhancement Engines and their main features
This provides an overview about all Enhancement Engine implementations managed by the Apache Stanbol community.
Preprocessing
-
Tika Engine: (based on Apache Tika)
- content type detection
- text extraction from various document formats
- extraction of metadata from document formats
-
- text extraction from various document formats
- extraction of metadata from document formats
- NOTE this engine is not includes in the default Stanbol Launchers
Natural Language Processing (NLP)
This does contain Engines the process textual content sent to the Stanbol Enhancer
Language Detection
Language detection engines add Language annotations as defined by STANBOL-613 to the metadata of the ContentItem
-
Language Identification Engine:
- language detection for textual content utilizing Apache Tika
-
- language detection for textual content utilizing language-detection Project
-
RESTful Language Identification Engine:
- Client for the RESTful Language Identification Service as specified by STANBOL-894
-
CELI language detection Engine: This engine is part of the CELI enhancement engines (see STANBOL-583)
- Language detected based on a linguagrid.org server hosted by CELI
Sentence Detection
Sentence detection engines add Sentences to the AnalyzedText content part
-
OpenNLP Sentence Detection Engine:
- Sentence Detection based on OpenNLP
-
Smartcn Sentence Detection Engine:
- Adds Sentence detection support for Chinese.
- Part of the Lucene Smartcn Analyzer Integration
Tokenizer Engines
The responsibility of Tokenizer Engines is to add Tokens to the AnalyzedText content part
-
OpenNLP Tokenizer Detection Engine:
- Tokenizer implementation based on OpenNLP
-
Smartcn Tokenizer Engine:
- Adds Tokenization detection support for Chinese.
- Part of the Lucene Smartcn Analyzer Integration
-
Paoding Tokenizer Engine:
- Adds Tokenization detection support for Chinese.
- Part of the Paoding Analyzer Integration
Part of Speech (POS) Tagging
POS tagging engines do add Part-of-Speech annotations to Tokens present in the AnalyzedText content part
- OpenNLP POS Tagging Engine:
- POS tagger implementation based on OpenNLP
Chunk/Phrase detection
Chunker (or Phrase Detection) Engines do add detected Chunks to the AnalyzedText content part. They also annotate added Chunks with the type of the detected phrase
- OpenNLP Chunker Engine:
- Chunker implementation based on OpenNLP
Named Entity Recognition (NER) Engines
NER engines need to write detected Named Entities as 'fise:TextAnnotation's to the metadata of the ContentItem. In addition they may also add NER annotations to Chunks in the AnalyzedText content part
-
- NLP processing using OpenNLP NER
- detects occurrences of persons, places and organizations only
- supports NER annotations
-
OpenNLP Custom NER Model Engine:
- NLP processing using OpenNLP NER
- uses custom NameFinder models (user configured)
- supports custom Named Entity types (other than persons, places and organizations
-
CELI NER engine: This engine is part of the CELI enhancement engines (see STANBOL-583)
- NER based on a linguagrid.org server hosted by CELI
- detects occurrences of persons, places and organizations and some other types
-
OpenCalais Enhancement Engine:
- integrates service from Open Calais. (Note: You need to provide a key in order to use this engine)
- can be configured to do only NER and no EntityLinking
Morphological Analysis
This includes Engines that perform some sort of morphological analyses (e.g. lemmatization)
- CELI AnalyzedText Lemmatizer Engine: This engine is part of the CELI enhancement engines (see STANBOL-583 and STANBOL-739)
- lemmatization support for "it", "da", "de", "ru", "ro"
General NLP processing Engines
-
- client for the RESTful NLP Analysis Service as specified by STANBOL-892
-
- Supports Sentence Detection, Tokenizing Part of Speech tagging and Named Entity Recognition for Japanese
-
Gosen NLP Analyses Engine:
- Supports Sentence Detection, Tokenizing, Part of Speech tagging and Named Entity Recognition for Japanese
- Provided by the Stanbol Gosen integration
- NOTE: This Engine is not part of Apache Stanbol and needs to be downloaded separately from https://github.com/westei/stanbol-gosen
Linking / Suggestions
This category covers enhancement engines that suggest Entities for features present in the parsed content. An Entity is an uniquely identified resource. Typically it provides (or links to) further information such as the type, a description (text, pictures, videos …), spatial and/or temporal context, links to other entities … .
-
- suggest links to several Linked Data Sources (e.g. DBpedia)
-
- EntityLinkingEngine configuration for the Stanbol Entityhub
- consumes NLP processing results form the AnalyzedText content part
- Links Entities managed by the Entityhub, ReferencedSites or ManagedSites
- Supports any language however quality/performance depends on NLP processing support
-
- Entity Linking Engine based on Lucene FST (Finit State Transducer) technology
- Links Entities indexed in a Solr index (e.g. an Entityhub Site backed by a SolrYard)
- Provides better linking performance as the Entityhub Linking Engine
- Requires a lot of CPU after changes of the vocabulary to re-create the FST models.
-
- Uses initial mentions of an Entity (e.g. 'Barack Obama' in 'Barack Obama attended the UN security council ...')
- To detect co-mentions at a later position in the same document (e.g. 'Obama' in '... Obama indicated consent …')
-
DBpedia Spotlight Annotation Engine: Integration of the DBpedia Spotlight with the Stanbol Enhancer (see STANBOL-706)
- includes NLP, Entity Linking and Disambiguation of Entities using DBpedia as knowledge base
- accesses a remote service
-
- suggests links to geonames.org
- provides hierarchical links for locations
- accesses a remote service, requires a user account
-
OpenCalais Enhancement Engine:
- integrates service from Open Calais. (Note: You need to provide a key in order to use this engine)
- provides both NER and Entity Linking
- accesses a remote service, requires a user account
-
- integrates the Zemanta services. (Note: You need to provide a key in order to use this engine)
- provides both NLP and Entity Linking
- accesses a remote service, requires a user account
Sentiment Analyses
This includes Engines that perform word/chunk level sentiment classifications on the AnalyzedText content part as well as Engines that summarize those lower level annotations to Sentiments for sentences, sections or the whole text. Sentiment summarizations are represented as 'fise:SentimentAnnotation's (TODO: not yet fully specified (see STANBOL-760).
-
Sentiment WordClassifier Engine: This engine annotates Tokens of the AnalyzedText content part with sentiment annotations (a double value in the range [-1..1]
- supports de and en
- can be extended to support additional languages by implementing the SentimentClassifier interface
-
Sentiment Summarization Engine: under development (see STANBOL-760)
- summarizes sentiments on word level to chunks, sentences and the whole text
- create 'fise:SentimentAnnotations'
Disambiguation
Enhancement Engines in this category can disambiguate Entities based on contextual information (e.g. if "Apple" in a sentence refers to the fruit or the company). Based on that such engines can adjust existing Entity suggestions or also create new one.
-
DBpedia Spotlight Disambiguation Engine: (see STANBOL-706)
- consumes existing fise:TextAnnotations and disambiguate them by using DBpedia Spotlight
- create Entity suggestions (fise:EntityAnnotations) for the processed fise:TextAnnotations
- accesses a remote service
-
Solr More-like-This Disambiguation Engine: (see STANBOL-723)
- disambiguates Entities managed by the Stanbol Entityhub by using Solr MLT queries
Postprocessing / Other
Post-Processing engines are executed after the Semantic Analysis is done. Typical examples of post-processing tasks are to dereference information about linked entities, re-write enhancements, filter annotations (e.g. based on the confidence ...).
Dereference Entities
This kind of Enhancement Engines are responsible for retrieving additional information about linked Entities. They first query the enhancement results for referenced Entities, second check if an entity can dereferenced and in an third step dereference the entity and add those information to the enhancement results.
Apache Stanbol provide a core implementation of an Entity Dereference Engine that can be extended for different information sources.
- Entityhub Dereference Engine allows to dereference Entities available through the Stanbol Entityhub
- Allows to configure the dereferenced languages and fields
- Supports LD Path
- Uses a thread pool to dereference Entities
Refactor Engines
-
TextAnnotation new Model Converter Engine
- This engine converts fise:TextAnnotation to include fise:selection-prefix and fise:selection-suffix properties.
-
- transforms enhancements according to a target ontology, requires KRES launcher.
Others
- NIF 2.0 Transformation Engine allows to serialize low level NLP results as RDF
- NIF 2.0 stands for NLP Interchange Format. It defines an RDF schema that allows to describe Sentences, Phrases, Words and its NLP annotation.
- This engines allows to retrieve detailed information about NLP results typically only available by the Java API of the Analysed Text content part.
Deprecated
Enhancement Engines listed below are no longer supported or where replaced by others
-
KeywordLinkingEngine: depreacted use EntityhubLinkingEngine instead!
- NLP processing using OpenNLP
- supports multiple languages
- detects occurrences of untyped entities as concepts, takes local taxonomies as linking target
-
NLP 2 RDF Engine: under development (see STANBOL-741)
- replaced by the NIF 2.0 Transformation Engine that supportes version 2.0 of the NIF standard while this engine is based on NIF 1.0
- converts NLP processing results stored in the AnalyzedText content part to RDF and adds them to the metadata of the ContentItem
- generated RDF uses the NIF (NLP Interchange Format)
-
CachingDereferencerEngine deprecated (see dereferencing support of individual engines as well as STANBOL-336)
- retrieves additional content for presenting the enhancement results.