Enhancement Engines and their main features

This provides an overview about all Enhancement Engine implementations managed by the Apache Stanbol community.

Preprocessing

Tika Engine: (based on Apache Tika)
- content type detection
- text extraction from various document formats
- extraction of metadata from document formats
Metaxa Engine:
- text extraction from various document formats
- extraction of metadata from document formats
- NOTE this engine is not includes in the default Stanbol Launchers

Natural Language Processing (NLP)

This does contain Engines the process textual content sent to the Stanbol Enhancer

Language Detection

Language detection engines add Language annotations as defined by STANBOL-613 to the metadata of the ContentItem

Language Identification Engine:
- language detection for textual content utilizing Apache Tika
Language Detection Engine:
- language detection for textual content utilizing language-detection Project
RESTful Language Identification Engine:
- Client for the RESTful Language Identification Service as specified by STANBOL-894
CELI language detection Engine: This engine is part of the CELI enhancement engines (see STANBOL-583)
- Language detected based on a linguagrid.org server hosted by CELI

Sentence Detection

Sentence detection engines add Sentences to the AnalyzedText content part

OpenNLP Sentence Detection Engine:
- Sentence Detection based on OpenNLP
Smartcn Sentence Detection Engine:
- Adds Sentence detection support for Chinese.
- Part of the Lucene Smartcn Analyzer Integration

Tokenizer Engines

The responsibility of Tokenizer Engines is to add Tokens to the AnalyzedText content part

OpenNLP Tokenizer Detection Engine:
- Tokenizer implementation based on OpenNLP
Smartcn Tokenizer Engine:
- Adds Tokenization detection support for Chinese.
- Part of the Lucene Smartcn Analyzer Integration
Paoding Tokenizer Engine:
- Adds Tokenization detection support for Chinese.
- Part of the Paoding Analyzer Integration

Part of Speech (POS) Tagging

POS tagging engines do add Part-of-Speech annotations to Tokens present in the AnalyzedText content part

OpenNLP POS Tagging Engine:
- POS tagger implementation based on OpenNLP

Chunk/Phrase detection

Chunker (or Phrase Detection) Engines do add detected Chunks to the AnalyzedText content part. They also annotate added Chunks with the type of the detected phrase

OpenNLP Chunker Engine:
- Chunker implementation based on OpenNLP

Named Entity Recognition (NER) Engines

NER engines need to write detected Named Entities as 'fise:TextAnnotation's to the metadata of the ContentItem. In addition they may also add NER annotations to Chunks in the AnalyzedText content part

OpenNLP NER Engine:
- NLP processing using OpenNLP NER
- detects occurrences of persons, places and organizations only
- supports NER annotations
OpenNLP Custom NER Model Engine:
- NLP processing using OpenNLP NER
- uses custom NameFinder models (user configured)
- supports custom Named Entity types (other than persons, places and organizations
CELI NER engine: This engine is part of the CELI enhancement engines (see STANBOL-583)
- NER based on a linguagrid.org server hosted by CELI
- detects occurrences of persons, places and organizations and some other types
OpenCalais Enhancement Engine:
- integrates service from Open Calais. (Note: You need to provide a key in order to use this engine)
- can be configured to do only NER and no EntityLinking

Morphological Analysis

This includes Engines that perform some sort of morphological analyses (e.g. lemmatization)

CELI AnalyzedText Lemmatizer Engine: This engine is part of the CELI enhancement engines (see STANBOL-583 and STANBOL-739)
- lemmatization support for "it", "da", "de", "ru", "ro"

General NLP processing Engines

RESTfull NLP Analysis Engine:
- client for the RESTful NLP Analysis Service as specified by STANBOL-892
Kuromoji NLP Engine:
- Supports Sentence Detection, Tokenizing Part of Speech tagging and Named Entity Recognition for Japanese
Gosen NLP Analyses Engine:
- Supports Sentence Detection, Tokenizing, Part of Speech tagging and Named Entity Recognition for Japanese
- Provided by the Stanbol Gosen integration
- NOTE: This Engine is not part of Apache Stanbol and needs to be downloaded separately from https://github.com/westei/stanbol-gosen

Linking / Suggestions

This category covers enhancement engines that suggest Entities for features present in the parsed content. An Entity is an uniquely identified resource. Typically it provides (or links to) further information such as the type, a description (text, pictures, videos …), spatial and/or temporal context, links to other entities … .

Named Entity Linking Engine:
- suggest links to several Linked Data Sources (e.g. DBpedia)
Entityhub Linking Engine:
- EntityLinkingEngine configuration for the Stanbol Entityhub
- consumes NLP processing results form the AnalyzedText content part
- Links Entities managed by the Entityhub, ReferencedSites or ManagedSites
- Supports any language however quality/performance depends on NLP processing support
FST Linking Engine:
- Entity Linking Engine based on Lucene FST (Finit State Transducer) technology
- Links Entities indexed in a Solr index (e.g. an Entityhub Site backed by a SolrYard)
- Provides better linking performance as the Entityhub Linking Engine
- Requires a lot of CPU after changes of the vocabulary to re-create the FST models.
Entity Co-Mention Engine:
- Uses initial mentions of an Entity (e.g. 'Barack Obama' in 'Barack Obama attended the UN security council ...')
- To detect co-mentions at a later position in the same document (e.g. 'Obama' in '... Obama indicated consent …')
DBpedia Spotlight Annotation Engine: Integration of the DBpedia Spotlight with the Stanbol Enhancer (see STANBOL-706)
- includes NLP, Entity Linking and Disambiguation of Entities using DBpedia as knowledge base
- accesses a remote service
Geonames Enhancement Engine:
- suggests links to geonames.org
- provides hierarchical links for locations
- accesses a remote service, requires a user account
OpenCalais Enhancement Engine:
- integrates service from Open Calais. (Note: You need to provide a key in order to use this engine)
- provides both NER and Entity Linking
- accesses a remote service, requires a user account
Zemanta Enhancement Engine:
- integrates the Zemanta services. (Note: You need to provide a key in order to use this engine)
- provides both NLP and Entity Linking
- accesses a remote service, requires a user account

Sentiment Analyses

This includes Engines that perform word/chunk level sentiment classifications on the AnalyzedText content part as well as Engines that summarize those lower level annotations to Sentiments for sentences, sections or the whole text. Sentiment summarizations are represented as 'fise:SentimentAnnotation's (TODO: not yet fully specified (see STANBOL-760).

Sentiment WordClassifier Engine: This engine annotates Tokens of the AnalyzedText content part with sentiment annotations (a double value in the range [-1..1]
- supports de and en
- can be extended to support additional languages by implementing the SentimentClassifier interface
Sentiment Summarization Engine: under development (see STANBOL-760)
- summarizes sentiments on word level to chunks, sentences and the whole text
- create 'fise:SentimentAnnotations'

Disambiguation

Enhancement Engines in this category can disambiguate Entities based on contextual information (e.g. if "Apple" in a sentence refers to the fruit or the company). Based on that such engines can adjust existing Entity suggestions or also create new one.

DBpedia Spotlight Disambiguation Engine: (see STANBOL-706)
- consumes existing fise:TextAnnotations and disambiguate them by using DBpedia Spotlight
- create Entity suggestions (fise:EntityAnnotations) for the processed fise:TextAnnotations
- accesses a remote service
Solr More-like-This Disambiguation Engine: (see STANBOL-723)
- disambiguates Entities managed by the Stanbol Entityhub by using Solr MLT queries

Postprocessing / Other

Post-Processing engines are executed after the Semantic Analysis is done. Typical examples of post-processing tasks are to dereference information about linked entities, re-write enhancements, filter annotations (e.g. based on the confidence ...).

Dereference Entities

This kind of Enhancement Engines are responsible for retrieving additional information about linked Entities. They first query the enhancement results for referenced Entities, second check if an entity can dereferenced and in an third step dereference the entity and add those information to the enhancement results.

Apache Stanbol provide a core implementation of an Entity Dereference Engine that can be extended for different information sources.

Entityhub Dereference Engine allows to dereference Entities available through the Stanbol Entityhub
- Allows to configure the dereferenced languages and fields
- Supports LD Path
- Uses a thread pool to dereference Entities

Refactor Engines

TextAnnotation new Model Converter Engine
- This engine converts fise:TextAnnotation to include fise:selection-prefix and fise:selection-suffix properties.
Refactor Engine:
- transforms enhancements according to a target ontology, requires KRES launcher.

Others

NIF 2.0 Transformation Engine allows to serialize low level NLP results as RDF
- NIF 2.0 stands for NLP Interchange Format. It defines an RDF schema that allows to describe Sentences, Phrases, Words and its NLP annotation.
- This engines allows to retrieve detailed information about NLP results typically only available by the Java API of the Analysed Text content part.

Deprecated

Enhancement Engines listed below are no longer supported or where replaced by others

KeywordLinkingEngine: depreacted use EntityhubLinkingEngine instead!
- NLP processing using OpenNLP
- supports multiple languages
- detects occurrences of untyped entities as concepts, takes local taxonomies as linking target
NLP 2 RDF Engine: under development (see STANBOL-741)
- replaced by the NIF 2.0 Transformation Engine that supportes version 2.0 of the NIF standard while this engine is based on NIF 1.0
- converts NLP processing results stored in the AnalyzedText content part to RDF and adds them to the metadata of the ContentItem
- generated RDF uses the NIF (NLP Interchange Format)
CachingDereferencerEngine deprecated (see dereferencing support of individual engines as well as STANBOL-336)
- retrieves additional content for presenting the enhancement results.

Downloads

Project

Archived Docs

The ASF