The Metaxa Enhancement Engine: extracting content and metadata from various formats
The Metaxa Enhancement Engine extracts embedded metadata and textual content from a large variety of document types and formats. The text extraction functionality also makes Metaxa suitable as a pre-processor for other components, especially NLP processors and indexing for search.
Technical description
The engine is based on the Aperture framework with new extensions to handling structured content embedded in HTML web content, such as Microformats and RDFa. Also some of the original extractors of Aperture were replaced by other engines using different base libraries. Metaxa introduces a single TextEnhancement instance that refers to the content item by its extracted-from property. The specific metadata extracted by Metaxa are ascribed directly to the content item/document since they represent document properties and not text annotations. Various ontologies are employed to describe various types of metadata. An overview will be given below.
The general structure of the Metaxa annotations consists of three levels of annotations illustrated in the following example:
The top-level TextAnnotation instance
<urn:enhancement-03c9e85e-2681-21b7-a5af-6da62d67ef6b> a <http://fise.iks-project.eu/ontology/TextAnnotation> , <http://fise.iks-project.eu/ontology/Enhancement> ; <http://fise.iks-project.eu/ontology/confidence> "1.0"^^<http://www.w3.org/2001/XMLSchema#double> ; <http://fise.iks-project.eu/ontology/extracted-from> <http://localhost:8080/store/content/mf_example.htm> ; <http://purl.org/dc/terms/created> "2010-09-22T09:06:53.056+02:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ; <http://purl.org/dc/terms/creator> "org.apache.enhancer.engines.metaxa.MetaxaEngine"^^<http://www.w3.org/2001/XMLSchema#string> .
The top-level document metadata, referenced from the TextAnnotation instance via the extracted-from property:
<http://localhost:8080/store/content/mf_example.htm> a <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument> ; <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains> <urn:rnd:-9e25553:12b3843df43:-7ffe> ; <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#description> "Cheap Flights to Tenerife, Arrecife, Paphos, Mahon, Las Palmas, Malaga, Alicante, Faro, Heraklion, Palma and the rest of the World. Flightline searches over 100 Airlines and 30,000 Hotels. ABTA, IATA, ATOL Bonded." ; <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#keyword> "travel" , "bargain flights" , "late deals" , "hotels" , "air tickets" , "air fares" , "discount travel" , "last minute flights" , "cheap airlines" , "cheap holidays" , "cheap flights" , "flightline" , "hotel reservations" , "discount flights" , "air travel" , "package holidays" ; <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#title> "Flightline | Cheap Flights, Package Holidays, Hotels, Travel Insurance & More" .
NOTE: The extracted plain text is no longer added to the metadata of the ContentItem but stores in an own ContentPart with the media type "text/plain". Both the RESTful Service as the Java API allows to request this data. See the according documentations for details.
Embedded hCard microformat data referenced via the nie:contains property:
<urn:rnd:-9e25553:12b3843df43:-7ffe> a <http://www.w3.org/2006/vcard/ns#VCard> ; <http://www.w3.org/2006/vcard/ns#adr> <urn:rnd:-9e25553:12b3843df43:-7ffc> ; <http://www.w3.org/2006/vcard/ns#fn> "Flightgeoline Essex Limited" ; <http://www.w3.org/2006/vcard/ns#geo> <urn:rnd:-9e25553:12b3843df43:-7ffb> ; <http://www.w3.org/2006/vcard/ns#org> <urn:rnd:-9e25553:12b3843df43:-7ffd> ; <http://www.w3.org/2006/vcard/ns#photo> <https://www.flightline.co.uk/common/images/building_banner_sm.jpg> ; <http://www.w3.org/2006/vcard/ns#url> <http://www.flightline.co.uk> ; <http://www.w3.org/2006/vcard/ns#workTel> <tel:0800541541> . <urn:rnd:-9e25553:12b3843df43:-7ffd> a <http://www.w3.org/2006/vcard/ns#Organization> ; <http://www.w3.org/2006/vcard/ns#organization-name> "Flightline Essex Limited" . <urn:rnd:-9e25553:12b3843df43:-7ffc> a <http://www.w3.org/2006/vcard/ns#Address> ; <http://www.w3.org/2006/vcard/ns#countryName> "UK" ; <http://www.w3.org/2006/vcard/ns#extendedAddress> "Flightline House" ; <http://www.w3.org/2006/vcard/ns#locality> "Westcliff-on-Sea" ; <http://www.w3.org/2006/vcard/ns#postalCode> "SS0 7JE" ; <http://www.w3.org/2006/vcard/ns#region> "Essex" ; <http://www.w3.org/2006/vcard/ns#streetAddress> "32-38 Milton Road" . <urn:rnd:-9e25553:12b3843df43:-7ffb> a <http://www.w3.org/2006/vcard/ns#Location> ; <http://www.w3.org/2006/vcard/ns#latitude> "51.53894902845868" ; <http://www.w3.org/2006/vcard/ns#longitude> "0.700753927230835" .
Supported document types
The set of extraction engines for specific document types is defined by the resource extractionregistry.xml. Each engine specifies what MIME types it can handle. By default the extraction registry provides extractors for the following set of document formats:
- Office documents:
- MS-Works
- MS-Office
- Excel
- PowerPoint
- Word
- Visio
- OpenDocument
- OpenXml
- Publisher
- Corel-Presentations
- QuattroPro
-
WordPerfect
-
Multimedia documents:
- JPG
-
MP3
-
(X)HTML, supporting also these types of embedded structures/microformats, as defined by the default resource htmlextractors.xml:
- RDFa
- geo
- hAtom
- hCal
- hCard
- hReview
- rel-license
- rel-tag
-
xFolk
-
Other:
- RTF
- Plain Text
- XML
Textual Content
The extracted plain text is no longer added to the metadata of the contentItem but stores in an own ContentPart with the media type "text/plain".
The following POST request to the Enhancer can be used to directly request the plain text version of a parsed content
curl -v -X POST -H "Accept: text/plain" \ -H "Content-type: text/html; charset=UTF-8" \ --data "<html><body><p>The Stanbol enhancer can detect \ famous cities such as Paris and people such as Bob Marley.</p></body></html>" \ "http://localhost:8080/enhancer/chain/language?omitMetadata=true"
There is also the possibility to request both the extracted metadata and the plain text version. Please see the Documentation of the RESTful API (http://localhost:8080/enhacer if Stanbol runs on localhost).
NOTE: previous versions of this engine had stored the plain text version by using the "http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent" property directly in the metadata of the ContentItem. This is no longer supported.
Vocabularies
Metaxa uses a set of vocabularies ("ontologies") for structured data representation.
Aperture Core Ontologies
These ontologies belong to the underlying Aperture subsystem, contained in the package
org.semanticdesktop.aperture.vocabulary
The most important ones with respect to top-level document properties are
-
NIE (Nepomuk Information Element):
:::text http://www.semanticdesktop.org/ontologies/2007/01/19/nie#
-
NFO (Nepomuk File Object):
:::text http://www.semanticdesktop.org/ontologies/2007/01/19/nfo#
Documentation of Aperture's core ontologies is provided in Aperture's Javadoc http://aperture.sourceforge.net/doc/javadoc/1.5.0/index.html for the packages in
org.semanticdesktop.aperture.vocabulary.
HTML Microformat Extractors
The following table describes which vocabularies are used for representing microformat data in Metaxa:
MF | Vocabulary (Namespace) |
---|---|
geo | wgs84 (http://www.w3.org/2003/01/geo/wgs84_pos#) |
hAtom | atom (http://www.w3.org/2005/Atom#) |
tagging (http://aperture.sourceforge.net/ontologies/tagging#) | |
hCal | ical (http://www.w3.org/2002/12/cal/icaltzd#) |
vcard (http://www.w3.org/2006/vcard/ns#) | |
hCard | vcard (http://www.w3.org/2006/vcard/ns#) |
hReview | review (http://www.purl.org/stuff/rev#) |
wgs84 (http://www.w3.org/2003/01/geo/wgs84_pos#) | |
dc (http://purl.org/dc/elements/1.1/) | |
dcterms (http://purl.org/dc/dcmitype/) | |
foaf (http://xmlns.com/foaf/0.1/) | |
vcard (http://www.w3.org/2006/vcard/ns#) | |
tag (http://www.holygoat.co.uk/owl/redwood/0.1/tags/) | |
rel-license | dc (http://purl.org/dc/elements/1.1/) |
rel-tag | tagging (http://aperture.sourceforge.net/ontologies/tagging#) |
xFolk | nfo (http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#) |
dc (http://purl.org/dc/elements/1.1/) | |
tagging (http://aperture.sourceforge.net/ontologies/tagging#) |
Configuration options
By default, Metaxa uses the extractors specified in the resource "extractionregistry.xml", and for HTML pages, the resource "htmlregistry.xml". Alternative configurations and extractors can be attached to Metaxa as fragment bundles, specifying as host bundle
Fragment-Host: org.apache.stanbol.enhancer.engines.metaxa
The alternative configuration files then can be set as values of the properties
-
org.apache.stanbol.enhancer.engines.metaxa.extractionregistry
-
org.apache.stanbol.enhancer.engines.metaxa.htmlextractors
Usage
Assuming that the Stanbol endpoint with the full launcher is running at
http://localhost:8080
and the engine is activated, from the command line commands like this can be used for submitting some file as content item, where the mime type must match the document type:
-
stateless interface
:::text curl -i -X POST -H "Content-Type:text/html" -T testpage.html http://localhost:8080/engines
-
stateful interface
:::text curl -i -X PUT -H "Content-Type:text/html" -T testpage.html http://localhost:8080/contenthub/content/someFileId
Alternatively, the Stanbol web interface can be used for submitting documents and viewing the metadata at
http://localhost:8080/contenthub