Content Item Factory
The ContentItemFactory is used by the Stanbol Enhancer to create ContentItem and Blob instances. ContentItemFactory implementation typically register themselves as OSGI service. The Stanbol Enhancer will use the factory implementation with the highest "service.ranking" to create ContentItems and Blobs for requests on the RESTful API. When using the Java API any ContentItem implementation can be used.
ContentItemFactory interface
The interface of the ContentItemFactory defines the following methods to create ContentItems
+ createContentItem(ContentSource source) : ContentItem + createContentItem(String prefix, ContentSource source) : ContentItem + createContentItem(UriRef id, ContentSource source) : ContentItem + createContentItem(String prefix, ContentSource source, MGraph metadata) : ContentItem + createContentItem(UriRef id, ContentSource source, MGraph metadata) : ContentItem + createContentItem(ContentReference reference) : ContentItem + createContentItem(ContentReference reference, MGraph metadata) : ContentItem
The content for created ContentItem can be passed by using either a ContentSource or a ContentReference. The Stanbol Enhancer Servicesapi module provides implementations for creating ContentSources for Java streams, byte arrays and string object as well as ContentReferences for URLs. For details see the sections below.
The URI of the created ContentItem is determined as follows:
- if no URI is passed, than it is calculated by using a default prefix plus an digest over the passed content. This ensures that of the some content is passed several times the created ContentItems will use the same id.
- methods that take a prefix will also generate the URI by calculating a digest over the passed content. However the passed prefix will be used instead of the default one.
- If an UriRef id is passed, than that URI is used as id for the content item.
The ContentItemFactory allows also to parse pre-existing metadata. All RDF triples in the passed MGraph are guaranteed to be added to the metadata of the created ContentItems. Note that implementations are free to directly use the passed MGraph instance for the metadata or to create an new MGraph instance and copy all triples of the passed instance.
The following methods of the ContentItemFactory can be used to create Blobs
+ createBlob(ContentSource source) : Blob + createBlob(ContentReference reference) : Blob + createContentSink(String mediaType) : ContentSink
The Blob interface is used by the Stanbol Enhancer to represent content. Blobs are added to ContentItems as content parts. In addition to the ContentSource and ContentReference interfaces that are also supported for the creation of ContentItems for the creation of Blobs also a ContentSink can be used. A ContentSink allows to obtain an OutputStream to an initially empty Blob that can later be used to stream the content. This is intended to be used by EnhancementEngine that need to convert content from one format to an other because it allows to avoid caching the converted content in-memory.
ContentItem implementations
By default the Stanbol Enhancer provides two ContentItemFactory/ContentItem/Blob implementations. Users can control the implementation used by the Stanbol Enhancer by configuring the "service.ranking" property of the different ContentItemFactory implementations (e.g. via the configuration tab of the Apache Felix Web Console). The implementation with the highest "service.ranking" will be used by the Stanbol Enhancer to create ContentItems and Blobs.
In-memory ContentItem
This implementation manages contents - Blobs - as byte arrays that are kept in-memory. While this ensures fast access to the passed content it also might cause problems if the Stanbol Enhancer is used to process big media files. Nonetheless this is currently used as default, because for typical usage scenarios content processed by the Stanbol Enhancer easily fits into memory.
The ContentItemFactory of this implementation registers itself with a "service.ranking" of 100 and is therefore used as default by the Stanbol Enhancer.
File-based ContentItem
This implementation differs from the in-memory one that it stores content - Blobs - in temporary files on the hard disc. All other information such as the metadata or non Blob content parts are still kept in-memory. This implementation is intended to be used by users that use the Stanbol Enhancer to process big media files such as TIFF images, MP3 files, rich text files including big graphics or even video files.
The ContentItemFactory of the the file based implementation is registered with a "service.ranking" of 50. To use it as default users need to ensure that the ranking of this implementation higher than the one of the in-memory implementation.
ContentSource
This interface describes the source of a content. It defines the following API
/** the content as stream */ + getStream() : InputStream /** the content as byte array */ + getData() : byte[] /** optionally the media type of the content */ + getMediaType() : String /** optionally the file name of the content */ + getFileName() : String /** optionally additional headers */ + getHeaders() : Map<String,List<String>>
The ContentSource interface defines methods for obtaining the wrapped content as InputStream and byte[]. This is mainly to avoid unnecessary copying of content. Implementors of ContentItems SHOULD prefer to call
- ContentSource#getData() if the ContentItem/Blob implementation will store the content as byte[] in-memory
- ContentSource#getStream() if the content of a ContentSource is streamed to a file, database, CMS or any other target outside the JVM.
The following implementations of this interface are provided by the Stanbol Enhnacer servicesapi module
- StreamSource: A ContentSource wrapping an InputStream. Multiple calls to #getStream() are not be supported and will cause IllegalStateExceptions. Calls to #getData() will load the contents of the stream to an in memory.
- ByteArraySource: A ContentSource implementation that uses a byte array to store represent the content. All constructors take the byte array representing the content as parameter. Calls to #getData() MUST NOT copy the byte array to avoid duplications.
- StringSource: A ContentSource implementation that directly allows to parse a String instance. The constructors convert the passed String to an byte array by using the passed Charset. UTF-8 is used as default. This implementation is based on the ByteArraySource.
ContentReference
This interface allows to describe content that is not yet locally available. The Stanbol Enhancer will dereference the content when automatically when needed.
/** the Reference to the content */ + gerReference() : String /** dereferences the content */ + dereference() : ContentSource
When referenced content is dereferenced by the Stanbol Enhancer depends on many factors. Earliest it may be dereferenced by the createBlob/createContentItem methods of a ContentItemFactory implementation. At latest it will be dereferenced when the referenced content is first used by the Stanbol Enhancer (e.g. on a call to ContentItem#getStream() or ContentItem#getMimeType()).
By default an ContentReference implementation for Java URLs is provided by the Stanbol Enhancer servicesapi module. This implementation replaces the WebContentItem that was used for obtaining content from URL until Stanbol version 0.9.0-incubating.
ContentSink
EnhancementEngines that do convert passed content (e.g. the TikaEngine) are often capable to so stream processing on content - meaning that the do not need to load the whole content in memory while analyzing it. To support this operation mode also within the StanbolEnhancer the ContentSink interface place an important role as it allows to create an - initially empty - Blob and than "stream" the content to it while processing the content.
The following method of the ContentItemFactory can be used to create a ContentSink
/** Creates a new ContentSink */ + createContentSink(String mediaType) : ContentSink;
The ContentSink interface provides the OutputStream as well as the created Blob
/** Getter for the OutputStream */ + getOutputStream() : OutputStream; /** Getter for the Blob */ + getBlob() : Blob;
Note: User MUST NOT parse the Blob of a ContentSink to any other components until all the data are written to the OutputStream, because this may cause that other components to read partial data when calling Blob#getStream(). This feature is intended to reduce the memory footprint and not to support concurrent writing and reading of data as supported by pipes.
Intended Usage:
This example shows a typical usage of a ContentSink within the processEnhancement(..) method of an EnhancementEngine that needs to transform some content.
ContentItem ci; //the content item to process ContentSink plainTextSink = contentItemFactory.createContentSink("text/plain"); Writer writer = new OutputStreamWriter(plainTextSink.getOutputStream,"UTF-8"); try { // parse the writer to the framework that extracts the text } finally { IOUtils.closeQuietly(writer); } //now add the Blob to the ContentItem UriRef textBlobUri; //create an UriRef for the Blob ci.addPart(textBlobUri, plainTextSink.getBlob()); plainTextSink = null;