This project has retired. For details please refer to its Attic page.
Apache Stanbol - Contenthub (5 minutes tutorial)

Contenthub (5 minutes tutorial)

The Apache Stanbol Contenthub is an Apache Solr based document repository which enables storage of text-based documents and customizable semantic search facilities. Contenthub exposes an efficient Java API together with the corresponding RESTful services.

Contenthub is basically a document repository. A document within Contenthub is referred as a "Content Item". A content item consists of metadata of the document in addition to the text-based content of the document. Contenthub has two main subcomponents, namely Store and Search. As their names indicate, Store is specifically responsible for persistent storage of content items. And Search provides strong semantic search facilities on top of the content items.

Contenthub Store

It is the part of Contenthub which actually stores the documents and their metadata persistently. In current implementation only text/plain documents are supported.

The storage part of the Contenthub provide basic methods such as create, put, get and delete. When a document is submitted, it delegates the textual content to Stanbol Enhancer to retrieve its enhancements. (Enhancements of a content item are called its metadata within the terminology) While submitting the document, it is also possible to specify external metadata (in addition to the enhancements retrieved from Enhancer) as field:value pairs along with the document.

The document itself and all metadata are indexed through an embedded Apache Solr core/index which is created specifically for Contenthub. Since documents are given unique IDs while indexing, using its unique ID, a document can be retrieved or deleted from Contenthub. Contenthub provides an HTML interface for its functionalities under the following endpoint, which is available after running the full launcher of Apache Stanbol:

http://localhost:8080/contenthub

Apache Solr can manage several cores (indexes) within the same running instance, and Contenthub makes use of this facility to manage different those cores. This management performed by LDPath programs1.

LDPath is a simple path-based query language similar to XPath or SPARQL Property Paths that is particularly well-suited for querying and retrieving resources from the Linked Data Cloud by following RDF links between resources and servers. For example, the following path query would select the names of objects (people) who is known by the context resource (the resource on which this path is being executed):

foaf:knows / foaf:name

An LDPath program is a collection of path queries. For example, following LDPath program can be executed on the resources which can be retrieved from Stanbol Enhancer as a result of the enhancement process. An LDPath program can be executed on any semantic collection of resources to extract specific information.

@prefix rdf : <http://www.w3.org/1999/02/22-rdf-syntax-ns#>;
@prefix rdfs : <http://www.w3.org/2000/01/rdf-schema#>;
@prefix db-ont : <http://dbpedia.org/ontology/>;
title = rdfs:label :: xsd:string;
dbpediatype = rdf:type :: xsd:anyURI;
population = db-ont:populationTotal :: xsd:int;

Given an LDPath program, Contenthub can create a corresponding Solr core to index the content items through that core. When you submit a document to Contenthub Store by providing an LDPath program, this means the content item (the document content and its metadata/enhancements) will be indexed according to the fields determined by the LDPath program. For instance, the example LDPath program above will lead to a Solr core including the following fields (in addition to default configuration and several default fields)

<field name="title" type="string" stored="true" indexed="true" multiValued="true"/>
<field name="dbpediatype" type="uri" stored="true" indexed="true" multiValued="true"/>
<field name="population" type="int" stored="true" indexed="true" multiValued="true"/>

To submit an LDPath program, you can use the following command through the REST API of Contenthub

curl -i -X POST -d \ 
    "name=myindex&program=\
    @prefix rdf : <http://www.w3.org/1999/02/22-rdf-syntax-ns#>; \ 
    @prefix rdfs : <http://www.w3.org/2000/01/rdf-schema#>; \
    @prefix db-ont : <http://dbpedia.org/ontology/>; \
    title = rdfs:label :: xsd:string; dbpediatype = rdf:type :: xsd:anyURI; \ 
    population = db-ont:populationTotal :: xsd:int;" \
    http://localhost:8080/contenthub/ldpath/program

You can retrieve the list of managed LDPath programs in JSON format with the following command. This is also the list of available Solr cores (except the default Solr core)

curl -i -X GET http://localhost:8080/contenthub/ldpath

LDPath related management is performed through SemanticIndexManager of Contenthub. To take advantage of semantic indexes while storing content items, you need to specify the name of the index in the path of the url while submitting the document. Default index for contenthub is named as "contenthub". Hence, following command submits document to the default index:

curl -i -X POST -H "Content-Type:application/x-www-form-urlencoded" \
    -d "title=about me&content=I live in Istanbul." \
    http://localhost:8080/contenthub/contenthub/store

Following command will store the content item into Solr core names with "myindex". Therefore, the indexing will be performed through the field properties indicated with the LDPath program named with "myindex".

curl -i -X POST -H "Content-Type:application/x-www-form-urlencoded" \
    -d "title=about me&content=I live in Istanbul." \
    http://localhost:8080/contenthub/myindex/store

Contenthub provides three search interfaces so that capabilities of Stanbol can be adopted by the users through different levels of complexities. These interfaces are;

Following request retrieves all documents from the default index (whose name is "contenthub") of Solr:

http://localhost:8080/solr/default/contenthub/select?q=*:*

Following request retrieves all documents from the Solr index named as "myindex":

http://localhost:8080/solr/default/myindex/select?q=*:*

RelatedKeywordSearch is performed by three independent search engines within the Stanbol system, namely:

Following command will retrieve related keywords about "turkey" from referenced sites and wordnet (ReferencedSiteSearch and WordnetSearch). Since no ontology is specified, OntologyResourceSearch will not execute.

curl -i -X GET -H "Accept: application/json" \
    http://localhost:8080/contenthub/contenthub/search/related?keyword=turkey

If URI of an ontology is also specified with the keyword as follows, result of the service will include related keywords found through the specified ontology in addition to referenced site and wordnet data. Following command will add the related keywords of "turkey" which are retrieved from the ontology identified with "uri-dummy" to the search result of related keyword service.

curl -i -X GET -H "Accept: application/json" \
    http://localhost:8080/contenthub/contenthub/search/related?keyword=turkey&ontologyURI=uri-dummy

Lastly, Contenthub provides a featured search interface which combines the services of SolrSearch and RelatedKeywordSearch. Results of the services of FeaturedSearch interface includes resultant documents and related keywords of the given query term. Following query will retrieve the documents whose indexed fileds includes the term "turkey" and related keywords from several sources about "turkey".

curl -i -X GET -H "Accept: application/json" -H "Content-Type:text/plain" \
    http://localhost:8080/contenthub/contenthub/search/featured?queryTerm=turkey