This project has retired. For details please refer to its Attic page.
Apache Stanbol - Tika Engine

Tika Engine

Apache Stanbol Enhancement Engine based on Apache Tika that has three main functionalities:

  1. To detect the content type of parsed content. This is only performed if the no content type is parsed of the cogent type is set to "application/octed-stream". The detected content type is added to the metadata of the Content Item.
  2. To extract the plain text (and XHTML) from parsed content and add it to the ContentItem as content parts with the type Blob.
  3. To extract metadata from the parsed content and add it to the metadata of the ContentItem

Supported Media Types

As this engine uses Apache Tika the supported media types are the same as stated on the Tika Homepage.

Extracted Metadata

Tika provides metadata as 'key:values' pairs. To use them efficiently within stanbol they need to be converted to valid RDF and aligned with existing Ontologies.

The TikaEngine supports alignments to several different Ontologies. Such alignment rules can be activated/deactivated within the configuration of the TikaEngine.

Supported Ontologies:

Note that the metadata extracted by the Tika engine are not covered by the Stanbol Enhancement Structure as they are outside of its scope.

ContentType:

The detected content type for the parsed contentItem is added by using the following two properties:

Note that this properties will only be present if the related Ontology is activated in the TikaEngine configuration.

Sending Requests directly to the Tika Engine

The Stanbol Enhancer allows to send enhancement requests directly to specific EnhancementEngine. This feature can be used in combination with the Tika Engine to request

  1. the "text/plain" or "application/xhtml+xml" version of parsed content
  2. the extracted metadata as RDF aligned to the activated Ontologies

The first example requests the plain text version of a PDF file with the name "test.pdf".

curl -v -X POST -H "Accept: text/plain" -T test.pdf \
    "http://localhost:8080/enhancer/engine/tika?omitMetadata=true"

Note the

This second example returns the metadata as extracted from the parsed "song.mp3"

curl -v -X POST -H "Accept: application/rdf+xml" -T song.mp3 \
    "http://localhost:8080/enhancer/engine/tika"