This project has retired. For details please refer to its Attic page.
Apache Stanbol - Content Item Factory

Content Item Factory

The ContentItemFactory is used by the Stanbol Enhancer to create ContentItem and Blob instances. ContentItemFactory implementation typically register themselves as OSGI service. The Stanbol Enhancer will use the factory implementation with the highest "service.ranking" to create ContentItems and Blobs for requests on the RESTful API. When using the Java API any ContentItem implementation can be used.

ContentItemFactory interface

The interface of the ContentItemFactory defines the following methods to create ContentItems

+ createContentItem(ContentSource source) : ContentItem
+ createContentItem(String prefix, ContentSource source) : ContentItem
+ createContentItem(UriRef id, ContentSource source) : ContentItem
+ createContentItem(String prefix, ContentSource source, MGraph metadata) : ContentItem
+ createContentItem(UriRef id, ContentSource source, MGraph metadata) : ContentItem
+ createContentItem(ContentReference reference) : ContentItem
+ createContentItem(ContentReference reference, MGraph metadata) : ContentItem

The content for created ContentItem can be passed by using either a ContentSource or a ContentReference. The Stanbol Enhancer Servicesapi module provides implementations for creating ContentSources for Java streams, byte arrays and string object as well as ContentReferences for URLs. For details see the sections below.

The URI of the created ContentItem is determined as follows:

The ContentItemFactory allows also to parse pre-existing metadata. All RDF triples in the passed MGraph are guaranteed to be added to the metadata of the created ContentItems. Note that implementations are free to directly use the passed MGraph instance for the metadata or to create an new MGraph instance and copy all triples of the passed instance.

The following methods of the ContentItemFactory can be used to create Blobs

+ createBlob(ContentSource source) : Blob
+ createBlob(ContentReference reference) : Blob
+ createContentSink(String mediaType) : ContentSink

The Blob interface is used by the Stanbol Enhancer to represent content. Blobs are added to ContentItems as content parts. In addition to the ContentSource and ContentReference interfaces that are also supported for the creation of ContentItems for the creation of Blobs also a ContentSink can be used. A ContentSink allows to obtain an OutputStream to an initially empty Blob that can later be used to stream the content. This is intended to be used by EnhancementEngine that need to convert content from one format to an other because it allows to avoid caching the converted content in-memory.

ContentItem implementations

By default the Stanbol Enhancer provides two ContentItemFactory/ContentItem/Blob implementations. Users can control the implementation used by the Stanbol Enhancer by configuring the "service.ranking" property of the different ContentItemFactory implementations (e.g. via the configuration tab of the Apache Felix Web Console). The implementation with the highest "service.ranking" will be used by the Stanbol Enhancer to create ContentItems and Blobs.

In-memory ContentItem

This implementation manages contents - Blobs - as byte arrays that are kept in-memory. While this ensures fast access to the passed content it also might cause problems if the Stanbol Enhancer is used to process big media files. Nonetheless this is currently used as default, because for typical usage scenarios content processed by the Stanbol Enhancer easily fits into memory.

The ContentItemFactory of this implementation registers itself with a "service.ranking" of 100 and is therefore used as default by the Stanbol Enhancer.

File-based ContentItem

This implementation differs from the in-memory one that it stores content - Blobs - in temporary files on the hard disc. All other information such as the metadata or non Blob content parts are still kept in-memory. This implementation is intended to be used by users that use the Stanbol Enhancer to process big media files such as TIFF images, MP3 files, rich text files including big graphics or even video files.

The ContentItemFactory of the the file based implementation is registered with a "service.ranking" of 50. To use it as default users need to ensure that the ranking of this implementation higher than the one of the in-memory implementation.

ContentSource

This interface describes the source of a content. It defines the following API

/** the content as stream */
+ getStream() : InputStream
/** the content as byte array */
+ getData() : byte[]
/** optionally the media type of the content */
+ getMediaType() : String
/** optionally the file name of the content */
+ getFileName() : String
/** optionally additional headers */
+ getHeaders() : Map<String,List<String>>

The ContentSource interface defines methods for obtaining the wrapped content as InputStream and byte[]. This is mainly to avoid unnecessary copying of content. Implementors of ContentItems SHOULD prefer to call

The following implementations of this interface are provided by the Stanbol Enhnacer servicesapi module

ContentReference

This interface allows to describe content that is not yet locally available. The Stanbol Enhancer will dereference the content when automatically when needed.

/** the Reference to the content */
+ gerReference() : String
/** dereferences the content */
+ dereference() : ContentSource

When referenced content is dereferenced by the Stanbol Enhancer depends on many factors. Earliest it may be dereferenced by the createBlob/createContentItem methods of a ContentItemFactory implementation. At latest it will be dereferenced when the referenced content is first used by the Stanbol Enhancer (e.g. on a call to ContentItem#getStream() or ContentItem#getMimeType()).

By default an ContentReference implementation for Java URLs is provided by the Stanbol Enhancer servicesapi module. This implementation replaces the WebContentItem that was used for obtaining content from URL until Stanbol version 0.9.0-incubating.

ContentSink

EnhancementEngines that do convert passed content (e.g. the TikaEngine) are often capable to so stream processing on content - meaning that the do not need to load the whole content in memory while analyzing it. To support this operation mode also within the StanbolEnhancer the ContentSink interface place an important role as it allows to create an - initially empty - Blob and than "stream" the content to it while processing the content.

The following method of the ContentItemFactory can be used to create a ContentSink

/** Creates a new ContentSink */
+ createContentSink(String mediaType) : ContentSink;

The ContentSink interface provides the OutputStream as well as the created Blob

/** Getter for the OutputStream */
+ getOutputStream() : OutputStream;
/** Getter for the Blob */
+ getBlob() : Blob;

Note: User MUST NOT parse the Blob of a ContentSink to any other components until all the data are written to the OutputStream, because this may cause that other components to read partial data when calling Blob#getStream(). This feature is intended to reduce the memory footprint and not to support concurrent writing and reading of data as supported by pipes.

Intended Usage:

This example shows a typical usage of a ContentSink within the processEnhancement(..) method of an EnhancementEngine that needs to transform some content.

ContentItem ci; //the content item to process
ContentSink plainTextSink = contentItemFactory.createContentSink("text/plain");
Writer writer = new OutputStreamWriter(plainTextSink.getOutputStream,"UTF-8");
try {
// parse the writer to the framework that extracts the text
} finally {
    IOUtils.closeQuietly(writer);
}
//now add the Blob to the ContentItem
UriRef textBlobUri; //create an UriRef for the Blob
ci.addPart(textBlobUri, plainTextSink.getBlob());
plainTextSink = null;