This project has retired. For details please refer to its Attic page.
Apache Stanbol - The Metaxa Enhancement Engine: extracting content and metadata from various formats

The Metaxa Enhancement Engine: extracting content and metadata from various formats

The Metaxa Enhancement Engine extracts embedded metadata and textual content from a large variety of document types and formats. The text extraction functionality also makes Metaxa suitable as a pre-processor for other components, especially NLP processors and indexing for search.

Technical description

The engine is based on the Aperture framework with new extensions to handling structured content embedded in HTML web content, such as Microformats and RDFa. Also some of the original extractors of Aperture were replaced by other engines using different base libraries. Metaxa introduces a single TextEnhancement instance that refers to the content item by its extracted-from property. The specific metadata extracted by Metaxa are ascribed directly to the content item/document since they represent document properties and not text annotations. Various ontologies are employed to describe various types of metadata. An overview will be given below.

The general structure of the Metaxa annotations consists of three levels of annotations illustrated in the following example:

The top-level TextAnnotation instance

<urn:enhancement-03c9e85e-2681-21b7-a5af-6da62d67ef6b>
     a       <http://fise.iks-project.eu/ontology/TextAnnotation> ,
             <http://fise.iks-project.eu/ontology/Enhancement> ;
             <http://fise.iks-project.eu/ontology/confidence>
                 "1.0"^^<http://www.w3.org/2001/XMLSchema#double> ;
     <http://fise.iks-project.eu/ontology/extracted-from>
             <http://localhost:8080/store/content/mf_example.htm> ;
     <http://purl.org/dc/terms/created>
             "2010-09-22T09:06:53.056+02:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
     <http://purl.org/dc/terms/creator>
              "org.apache.enhancer.engines.metaxa.MetaxaEngine"^^<http://www.w3.org/2001/XMLSchema#string> .

The top-level document metadata, referenced from the TextAnnotation instance via the extracted-from property:

<http://localhost:8080/store/content/mf_example.htm>
     a       <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument> ;
     <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains>
             <urn:rnd:-9e25553:12b3843df43:-7ffe> ;
     <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#description>
             "Cheap Flights to Tenerife, Arrecife, Paphos, Mahon, Las Palmas, Malaga, Alicante, Faro, Heraklion, Palma and the rest of the World. Flightline searches over 100 Airlines and 30,000 Hotels. ABTA, IATA, ATOL Bonded." ;
     <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#keyword>
             "travel" , "bargain flights" , "late deals" , "hotels" , "air tickets" , "air fares" , "discount travel" , "last minute flights" , "cheap airlines" , "cheap holidays" , "cheap flights" , "flightline" , "hotel reservations" , "discount flights" , "air travel" , "package holidays" ;
     <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#title>
             "Flightline | Cheap Flights, Package Holidays, Hotels, Travel Insurance &amp; More" .

NOTE: The extracted plain text is no longer added to the metadata of the ContentItem but stores in an own ContentPart with the media type "text/plain". Both the RESTful Service as the Java API allows to request this data. See the according documentations for details.

Embedded hCard microformat data referenced via the nie:contains property:

<urn:rnd:-9e25553:12b3843df43:-7ffe>
     a       <http://www.w3.org/2006/vcard/ns#VCard> ;
     <http://www.w3.org/2006/vcard/ns#adr>
           <urn:rnd:-9e25553:12b3843df43:-7ffc> ;
     <http://www.w3.org/2006/vcard/ns#fn>
           "Flightgeoline Essex Limited" ;
     <http://www.w3.org/2006/vcard/ns#geo>
           <urn:rnd:-9e25553:12b3843df43:-7ffb> ;
    <http://www.w3.org/2006/vcard/ns#org>
           <urn:rnd:-9e25553:12b3843df43:-7ffd> ;
    <http://www.w3.org/2006/vcard/ns#photo>
           <https://www.flightline.co.uk/common/images/building_banner_sm.jpg> ;
    <http://www.w3.org/2006/vcard/ns#url>
           <http://www.flightline.co.uk> ;
    <http://www.w3.org/2006/vcard/ns#workTel>
           <tel:0800541541> .

<urn:rnd:-9e25553:12b3843df43:-7ffd>
     a       <http://www.w3.org/2006/vcard/ns#Organization> ;
     <http://www.w3.org/2006/vcard/ns#organization-name>
           "Flightline Essex Limited" .

<urn:rnd:-9e25553:12b3843df43:-7ffc>
     a       <http://www.w3.org/2006/vcard/ns#Address> ;
     <http://www.w3.org/2006/vcard/ns#countryName>
           "UK" ;
     <http://www.w3.org/2006/vcard/ns#extendedAddress>
          "Flightline House" ;
     <http://www.w3.org/2006/vcard/ns#locality>
          "Westcliff-on-Sea" ;
     <http://www.w3.org/2006/vcard/ns#postalCode>
          "SS0 7JE" ;
     <http://www.w3.org/2006/vcard/ns#region>
          "Essex" ;
     <http://www.w3.org/2006/vcard/ns#streetAddress>
          "32-38 Milton Road" .

<urn:rnd:-9e25553:12b3843df43:-7ffb>
     a       <http://www.w3.org/2006/vcard/ns#Location> ;
     <http://www.w3.org/2006/vcard/ns#latitude>
          "51.53894902845868" ;
     <http://www.w3.org/2006/vcard/ns#longitude>
          "0.700753927230835" .

Supported document types

The set of extraction engines for specific document types is defined by the resource extractionregistry.xml. Each engine specifies what MIME types it can handle. By default the extraction registry provides extractors for the following set of document formats:

Textual Content

The extracted plain text is no longer added to the metadata of the contentItem but stores in an own ContentPart with the media type "text/plain".

The following POST request to the Enhancer can be used to directly request the plain text version of a parsed content

curl -v -X POST -H "Accept: text/plain" \
    -H "Content-type: text/html; charset=UTF-8" \
    --data "<html><body><p>The Stanbol enhancer can detect \
      famous cities such as Paris and people such as Bob Marley.</p></body></html>" \
    "http://localhost:8080/enhancer/chain/language?omitMetadata=true"

There is also the possibility to request both the extracted metadata and the plain text version. Please see the Documentation of the RESTful API (http://localhost:8080/enhacer if Stanbol runs on localhost).

NOTE: previous versions of this engine had stored the plain text version by using the "http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent" property directly in the metadata of the ContentItem. This is no longer supported.

Vocabularies

Metaxa uses a set of vocabularies ("ontologies") for structured data representation.

Aperture Core Ontologies

These ontologies belong to the underlying Aperture subsystem, contained in the package

org.semanticdesktop.aperture.vocabulary

The most important ones with respect to top-level document properties are

Documentation of Aperture's core ontologies is provided in Aperture's Javadoc http://aperture.sourceforge.net/doc/javadoc/1.5.0/index.html for the packages in

org.semanticdesktop.aperture.vocabulary.

HTML Microformat Extractors

The following table describes which vocabularies are used for representing microformat data in Metaxa:

MF Vocabulary (Namespace)
geo wgs84 (http://www.w3.org/2003/01/geo/wgs84_pos#)
hAtom atom (http://www.w3.org/2005/Atom#)
tagging (http://aperture.sourceforge.net/ontologies/tagging#)
hCal ical (http://www.w3.org/2002/12/cal/icaltzd#)
vcard (http://www.w3.org/2006/vcard/ns#)
hCard vcard (http://www.w3.org/2006/vcard/ns#)
hReview review (http://www.purl.org/stuff/rev#)
wgs84 (http://www.w3.org/2003/01/geo/wgs84_pos#)
dc (http://purl.org/dc/elements/1.1/)
dcterms (http://purl.org/dc/dcmitype/)
foaf (http://xmlns.com/foaf/0.1/)
vcard (http://www.w3.org/2006/vcard/ns#)
tag (http://www.holygoat.co.uk/owl/redwood/0.1/tags/)
rel-license dc (http://purl.org/dc/elements/1.1/)
rel-tag tagging (http://aperture.sourceforge.net/ontologies/tagging#)
xFolk nfo (http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#)
dc (http://purl.org/dc/elements/1.1/)
tagging (http://aperture.sourceforge.net/ontologies/tagging#)

Configuration options

By default, Metaxa uses the extractors specified in the resource "extractionregistry.xml", and for HTML pages, the resource "htmlregistry.xml". Alternative configurations and extractors can be attached to Metaxa as fragment bundles, specifying as host bundle

Fragment-Host: org.apache.stanbol.enhancer.engines.metaxa

The alternative configuration files then can be set as values of the properties

Usage

Assuming that the Stanbol endpoint with the full launcher is running at

http://localhost:8080

and the engine is activated, from the command line commands like this can be used for submitting some file as content item, where the mime type must match the document type:

Alternatively, the Stanbol web interface can be used for submitting documents and viewing the metadata at

http://localhost:8080/contenthub