Tika Engine
Apache Stanbol Enhancement Engine based on Apache Tika that has three main functionalities:
- To detect the content type of parsed content. This is only performed if the no content type is parsed of the cogent type is set to "application/octed-stream". The detected content type is added to the metadata of the Content Item.
- To extract the plain text (and XHTML) from parsed content and add it to the ContentItem as content parts with the type Blob.
- To extract metadata from the parsed content and add it to the metadata of the ContentItem
Supported Media Types
As this engine uses Apache Tika the supported media types are the same as stated on the Tika Homepage.
Extracted Metadata
Tika provides metadata as 'key:values' pairs. To use them efficiently within stanbol they need to be converted to valid RDF and aligned with existing Ontologies.
The TikaEngine supports alignments to several different Ontologies. Such alignment rules can be activated/deactivated within the configuration of the TikaEngine.
Supported Ontologies:
-
Ontology for Media Resources: This is the most complete mapping to an single Ontology. This includes mappings for all Dublin Core metadata; geo locations; some image specific data and most of the Audio and Viedo related metadata.
-
DC terms: Provides good mappings for text documents (HTML, Office, OpenOffice, PDF ...)
-
Nepomuk EXIF ontology: Interesting for users that want to work with EXIF metadata extracted from images.
-
Nepomuk Message Ontology: Used for sender and recaiver information of mail messages.
-
SKOS: Allows mapping of labels and notes to SKOS. This is deactivated by default.
-
RDFS: Allows to map labels and comments to "rdfs:label" and "rdfs:comment"
Note that the metadata extracted by the Tika engine are not covered by the Stanbol Enhancement Structure as they are outside of its scope.
ContentType:
The detected content type for the parsed contentItem is added by using the following two properties:
- 'http://purl.org/dc/terms/format': Dublin Core terms 'format'
- 'http://www.w3.org/ns/ma-ont#hasFormat': Media Resource Ontology 'hasFormat'
Note that this properties will only be present if the related Ontology is activated in the TikaEngine configuration.
Sending Requests directly to the Tika Engine
The Stanbol Enhancer allows to send enhancement requests directly to specific EnhancementEngine. This feature can be used in combination with the Tika Engine to request
- the "text/plain" or "application/xhtml+xml" version of parsed content
- the extracted metadata as RDF aligned to the activated Ontologies
The first example requests the plain text version of a PDF file with the name "test.pdf".
curl -v -X POST -H "Accept: text/plain" -T test.pdf \ "http://localhost:8080/enhancer/engine/tika?omitMetadata=true"
Note the
- 'Accept' header is set to the contentType of the requested content and the
- 'omitMetadata=true' telling the Enhancer to not return the RDF metadata.
This second example returns the metadata as extracted from the parsed "song.mp3"
curl -v -X POST -H "Accept: application/rdf+xml" -T song.mp3 \ "http://localhost:8080/enhancer/engine/tika"