Content Item
The ContentItem is the object which represents the content to be enhanced by Apache Stanbol. It is created based on the data provided by the enhancement request and used throughout the enhancement process to store results. Therefore, after the enhancement process has finished, the ContentItem represents the result of the Apache Stanbol enhancement process. ContentItem instances are created by using the ContentItemFactory service.
The following section describes the interface of the ContentItem in detail:
Content Parts
Content parts are used to represent the original content as well as transformations of the original content (typically created by pre-processing enhancement engines such as the Metaxa engine).
The ContentItem provides the following API to work with content parts:
/** Getter for the ContentPart based on the index */ getPart(int index, Class<T> type) : T /** Getter for the ContentPart based on its ID */ getPart(UriRef uri, Class<T> type) : T /** Getter for the ID based on the index */ getPartUri(index index) : UriRef /** Adds a new ContentPart to the content item */ addPart(UriRef uri, Object part) : Object
Content parts are accessible by the index and by their URI formatted ID. Re-adding a content part will replace the old one. The index will not be changed by this operation.
There are two types of content parts:
- Content parts which have additional metadata provided within the metadata of the content item. Such content parts are typically used to store transformed versions of the original content. This allows e.g. engines which can only process plain text versions to query for the content part containing this version of the passed document.
- Content parts that are registered under a predefined URI. Such content parts are typically not mentioned within the metadata of the content item. This is used to share intermediate enhancement results between enhancement engines. An example would be tokens, sentences, POS tags and chunks that are extracted by some NLP engine. Engines which want to consume such data need to know the predefined URI of the content part holding this data. They will check within the
canEnhance(..)
method if a content part with an expected URI is present and if it has the correct type.
Accessing the main content of the ContentItem
The main content of the ContentItem refers to the content passed by the enhancement request (or downloaded from the URL provided by a request). For accessing this content the following methods are available
/** Getter for the InputStream of the content as passed for the ContentItem */ + getStream() : InputStream /** Getter for the mime type of the content */ + getMimeType() : String /** Getted for the Content as Blob */ + getBlob() : Blob
The getStream()
and getMimeType()
methods are shortcuts for the according methods of the content item's blob object. Calling contentItem.getBlob.getStream()
will return an InputStream over the exact same content as directly calling getStream()
on the content item. Note that the blob interface also provides a getParameter()
method which allows to retrieve mime-type parameters such as the charset of textual content.
The content passed by the user is stored as content part at the index '0' with the URI of the content item in the form of a blob. Therefore, calling
contentItem.getPart(0,Blob.class) contentItem.getPart(contentItem.getUri(),Blob.class) contentItem.getBlob()
returns the same blob instance.
Metadata of the ContentItem
The metadata of the ContentItem is managed by a lockable MGraph. This is basically a normal java.util.Collections
for triples. The only RDF specific method is the support for filtered iterators which support wildcards for subjects, predicates and objects.
This graph is used to store all enhancement results as well as metadata about the content item (such as content parts) and the enhancement process (see execution metadata).
Read/Write locks
During the Apache Stanbol enhancement process as executed by the enhancement job manager components running in multiple threads need to access the state of the ContentItem. Because of that the content item provides the possibility to acquire locks.
/** Getter for the ReadWirteLock of a ContentItem */ + getLock() : java.util.concurrent.ReadWriteLock
Note also that
contentItem.getLock() contentItem.getMetadata().getLock()
will return the same ReadWriteLock
instance.
This lock can be used to request read/write locks on the content item. All methods of the content item and also the MGraph
holding the metadata need to be protected by using the lock. This means that users which do not need to protect whole sections of code do not need to bother with the usage of locks. Typical examples are working with content parts, final classes like Blob
or adding/removing a triple from the metadata.
However, whenever components need to ensure that the data are not changed by other threads while performing some calculations read/write locks must be used. A typical example are iterations over data returned by the MGraph. In this case code iterating over the results should be protected against concurrent changes by
contentItem.getLock().readLock().lock(); try { Iterator<Triple> it = contentItem.getMetadata(). filter(null,RDF.TYPE,TechnicalClasses.ENHANCER_TEXTANNOTATION); while(it.hasNext()){ log.debug("Process TextAnnotation: {},it.next().getSubject()); //read the needed information } } finally { contentItem.getLock().readLock().unlock() }
While accessing content items within an enhancement engine there is an exception to this rule. If an engine declares that it only supports the SYNCHRONOUS
enhancement mode, then the enhancement job manager needs to take care that an engine has exclusive access to the CotentItem. In this case implementors of enhancement engines need not to care about using read/write locks.
ContentItemFactory
Since version 0.10.0 ContentItems and Blobs are created by using the ContentItemFactory. ContentItemFactory implementation register themselves as OSGI service. By default the implementation with the highest "service.ranking" is used by the StanbolEnhancer to create instances. By default two implementations are available. The in-memory and a file-based one where the in-memory implementation is used as default.
Most users will not need to change the default ContentItem implementation. However if the Enhancer is used to extract metadata from gib media files such as EXIF metadata from big images, ID3 from MP3 files ... than changing the default from the InMemoryContentItemFactory to the FileContentItemFactory might considerable reduce the memory footprint.
With the introduction of the ContentItemFactory also all ContentItem implementation specific constructors to parse content where deprecated and replaced by the following three interfaces:
- ContentSource allows to parse Content that is available as stream, byte array or string.
- ContentReference allows to parse a Reference (e.g. a URL) to a ContentItem. The derefernce() method of this interface is used by the ContentItemFactory to convert a ContentReference to a ContentSource.
- ContentSink allows to obtain an OutputStream to an initially empty Blob that can later be used to stream the content. This is intended to be used by EnhancementEngine that need to convert content from one format to an other because it allows to avoid caching the converted content in-memory.
Multipart MIME serialization
Stanbol supports the serialization of content items as multipart MIME. This serialization is used by the RESTful API of the Stanbol Enhancer. This section provides details about how content items are represented using multipart MIME. For more information on how to send/receive multipart content items via the RESTful Services provided by the Stanbol Enhancer please see the documentation provided in the web interface (e.g. at http://localhost:8080/enhancer).
The following figure provides an overview on how ContentItems are represented using MultiPart MIME.
ContentItem Container
- ContentItems are contained within a "multipart/form-data" container
- Apache Stanbol uses "ContentItem" as "boundary", but users may use any other as long as the "boundary" parameter in the "Content-Type" header is set correctly.
- Stanbol uses UTF-8 as charset, but users might use any supported encoding as long as the "charset" parameter in the "Content-Type" header is set accordingly.
The default Content-Type for serialized ContentItems is therefore "multipart/form-data; boundary=contentItem; charset=UTF-8"
Enhancement Metadata
- If present this MUST BE the first MIME part within the "multipart/form-data" container representing the ContentItem.
- The "name" parameter of the "Content-Disposition" header MUST BE "metadata"
- If the "fileName" parameter of the "Content-Disposition" header is present it MUST BE the URI of the ContentItem. Users are typically required to set this header in case they want to parse existing metadata with enhancement requests. This is because is such cases it is important that the URI of the ContentItem created by the Stanbol Enhancer is equal to the URI used to describe the Content within the passed Metadata. The Stanbol Enhancer MUST set to "fileName" parameter of the metadata to the URI of the processed ContentItem.
- The "Content-Type" of the metadata can be any RDF serialization supported by Apache Stanbol. UTF-8 is used as default charset.
- The RDF data serialized in this MIME part represent the enhancement results.
Content
- If present the MIME part representing the Content MUST directly follow the Metadata. If the Metadata are not present the Content MUST BE the first MIME part within the "multipart/form-data" container representing the ContentItem.
- Because multiple content variants can be included within a ContentItem a "multipart/alternate" container is used to represent the content.
- The "name" parameter of the "Content-Disposition" header MUST BE "content". The "fileName" parameter is not used and therefore not present/ignored. The Stanbol Enhancer uses "contentParts" as boundary but users may use any boundary as long as it is correctly set within the "Content-Type" header.
The various content elements are contained within the "multipart/form-data" container. The ordering is important. For serialized ContentItems it is assumed that the first element is the original document for the ConentItem. All further MIME parts are considered alternate - e.g. transcoded/transformed - versions. For serialized ContentItems provided as response to requests to the Stanbol Enhancer the ordering of the MIME parts is the same as the indexes of the ContentParts in the ContentItem.
- the "name" parameter of the "Content-Disposition" is set to the URI of the ContentPart in the ContentItem.
- the "Content-Type" header must correspond to the media type of the content
Note that users which want to send a single ContentPart AND Metadata to the Stanbol Enhancer can also directly add the content to the "multipart/form-data" container of the ContentItem. In this case the "name" parameter MUST BE still set to "content" but the "Content-Type" header needs to be directly set to the media type of the passed ContentPart. The Stanbol Enhancer does NOT use this option when serializing ContentItems. It will ALWAYS use a "multipart/alternate" container for the "content" even when only a single ContentPart is included in an Response.
Additional Metadata
The ContentPart API of the Stanbol ContentItem allows to register content parts of any type. The MultiPart MIME serialization of ContentItems supports the serialization of such additional parts as long as they are encoded as RDF graphs (compatible to the Clerezza TripleCollection class). Additional ContentParts which are not encoded as RDF data are currently not supported by the Multipart MIME serialization.
- MimeParts representing such ContentParts MUST BE added after the MIME parts for the "metadata" AND the "content"
- The "name" parameter of the "Content-Disposition" MUST BE set to the URI of the ContentPart in the ContentItem.
- the "Content-Type" header must correspond to the media type of the content. The Stanbol Enhancer will always use the same RDF serialization as for the "metadata" when serializing additional Metadata. Users are free to use any supported serialization as long as they set the "Content-Type" header accordingly.
- The ordering of parts representing additional Metadata is the same as the ordering (index) of the ContentParts in the ContentItem.