Contenthub (5 minutes tutorial)
The Apache Stanbol Contenthub is an Apache Solr based document repository which enables storage of text-based documents and customizable semantic search facilities. Contenthub exposes an efficient Java API together with the corresponding RESTful services.
Contenthub is basically a document repository. A document within Contenthub is referred as a "Content Item". A content item consists of metadata of the document in addition to the text-based content of the document. Contenthub has two main subcomponents, namely Store and Search. As their names indicate, Store is specifically responsible for persistent storage of content items. And Search provides strong semantic search facilities on top of the content items.
Contenthub Store
It is the part of Contenthub which actually stores the documents and their metadata persistently. In current implementation only text/plain documents are supported.
The storage part of the Contenthub provide basic methods such as create, put, get and delete. When a document is submitted, it delegates the textual content to Stanbol Enhancer to retrieve its enhancements. (Enhancements of a content item are called its metadata within the terminology) While submitting the document, it is also possible to specify external metadata (in addition to the enhancements retrieved from Enhancer) as field:value pairs along with the document.
The document itself and all metadata are indexed through an embedded Apache Solr core/index which is created specifically for Contenthub. Since documents are given unique IDs while indexing, using its unique ID, a document can be retrieved or deleted from Contenthub. Contenthub provides an HTML interface for its functionalities under the following endpoint, which is available after running the full launcher of Apache Stanbol:
http://localhost:8080/contenthub
Apache Solr can manage several cores (indexes) within the same running instance, and Contenthub makes use of this facility to manage different those cores. This management performed by LDPath programs1.
LDPath is a simple path-based query language similar to XPath or SPARQL Property Paths that is particularly well-suited for querying and retrieving resources from the Linked Data Cloud by following RDF links between resources and servers. For example, the following path query would select the names of objects (people) who is known by the context resource (the resource on which this path is being executed):
foaf:knows / foaf:name
An LDPath program is a collection of path queries. For example, following LDPath program can be executed on the resources which can be retrieved from Stanbol Enhancer as a result of the enhancement process. An LDPath program can be executed on any semantic collection of resources to extract specific information.
@prefix rdf : <http://www.w3.org/1999/02/22-rdf-syntax-ns#>; @prefix rdfs : <http://www.w3.org/2000/01/rdf-schema#>; @prefix db-ont : <http://dbpedia.org/ontology/>; title = rdfs:label :: xsd:string; dbpediatype = rdf:type :: xsd:anyURI; population = db-ont:populationTotal :: xsd:int;
Given an LDPath program, Contenthub can create a corresponding Solr core to index the content items through that core. When you submit a document to Contenthub Store by providing an LDPath program, this means the content item (the document content and its metadata/enhancements) will be indexed according to the fields determined by the LDPath program. For instance, the example LDPath program above will lead to a Solr core including the following fields (in addition to default configuration and several default fields)
<field name="title" type="string" stored="true" indexed="true" multiValued="true"/> <field name="dbpediatype" type="uri" stored="true" indexed="true" multiValued="true"/> <field name="population" type="int" stored="true" indexed="true" multiValued="true"/>
To submit an LDPath program, you can use the following command through the REST API of Contenthub
curl -i -X POST -d \ "name=myindex&program=\ @prefix rdf : <http://www.w3.org/1999/02/22-rdf-syntax-ns#>; \ @prefix rdfs : <http://www.w3.org/2000/01/rdf-schema#>; \ @prefix db-ont : <http://dbpedia.org/ontology/>; \ title = rdfs:label :: xsd:string; dbpediatype = rdf:type :: xsd:anyURI; \ population = db-ont:populationTotal :: xsd:int;" \ http://localhost:8080/contenthub/ldpath/program
You can retrieve the list of managed LDPath programs in JSON format with the following command. This is also the list of available Solr cores (except the default Solr core)
curl -i -X GET http://localhost:8080/contenthub/ldpath
LDPath related management is performed through SemanticIndexManager of Contenthub. To take advantage of semantic indexes while storing content items, you need to specify the name of the index in the path of the url while submitting the document. Default index for contenthub is named as "contenthub". Hence, following command submits document to the default index:
curl -i -X POST -H "Content-Type:application/x-www-form-urlencoded" \ -d "title=about me&content=I live in Istanbul." \ http://localhost:8080/contenthub/contenthub/store
Following command will store the content item into Solr core names with "myindex". Therefore, the indexing will be performed through the field properties indicated with the LDPath program named with "myindex".
curl -i -X POST -H "Content-Type:application/x-www-form-urlencoded" \ -d "title=about me&content=I live in Istanbul." \ http://localhost:8080/contenthub/myindex/store
Contenthub Search
Contenthub provides three search interfaces so that capabilities of Stanbol can be adopted by the users through different levels of complexities. These interfaces are;
- SolrSearch: provides native Solr interface to the outside world. Retrieved the resulting content items (documents) from the Solr backend. SolrJ users can easily make use of this interface. Search is performed on the corresponding Solr index and results are returned in "org.apache.solr.client.solrj.response.QueryResponse" format.
- RelatedKeywordSearch: provides supporting functionalities for search facilities. Given a keyword, services of this interface finds other related keywords from several sources. Wordnet, domain ontologies and referenced sites are the data sources for these services to retrieve the related keywords.
- FeaturedSearch: Combines the services of SolrSearch and RelatedKeywordSearch for the users who want the results of a query term all in one interface. Featured search not only returns resulting documents, but also related keywords retrieved from various resources (if the resources are available within the running Stanbol instance) Given a query term, returns the resultant documents from the queried Solr core/index and related keywords from different sources.
Following request retrieves all documents from the default index (whose name is "contenthub") of Solr:
http://localhost:8080/solr/default/contenthub/select?q=*:*
Following request retrieves all documents from the Solr index named as "myindex":
http://localhost:8080/solr/default/myindex/select?q=*:*
RelatedKeywordSearch is performed by three independent search engines within the Stanbol system, namely:
- OntologyResourceSearch: If an ontology is already registered to Stanbol (e.g. a domain ontology), it can be used to look for similar keywords, given a keyword. A SPARQL query based on a LARQ index is executed on the specified ontology to find individual and class resources related with the keyword.
- ReferencedSiteSearch: Referenced sites are used to retrieve the enhancements of a content item. Stanbol Enhancer handles all enhancement operations through the referenced sites. This interface makes use of the referenced sites to look for similar keywords, given a keyword.
- WordnetSearch: If a Wordnet database is registered to the system (through the OSGi console), this service is ready for use. Looks for several relations among keywords (such as synonyms, hyponyms etc...) and retrieves a list of related keywords from the Wordnet database.
Following command will retrieve related keywords about "turkey" from referenced sites and wordnet (ReferencedSiteSearch and WordnetSearch). Since no ontology is specified, OntologyResourceSearch will not execute.
curl -i -X GET -H "Accept: application/json" \ http://localhost:8080/contenthub/contenthub/search/related?keyword=turkey
If URI of an ontology is also specified with the keyword as follows, result of the service will include related keywords found through the specified ontology in addition to referenced site and wordnet data. Following command will add the related keywords of "turkey" which are retrieved from the ontology identified with "uri-dummy" to the search result of related keyword service.
curl -i -X GET -H "Accept: application/json" \ http://localhost:8080/contenthub/contenthub/search/related?keyword=turkey&ontologyURI=uri-dummy
Lastly, Contenthub provides a featured search interface which combines the services of SolrSearch and RelatedKeywordSearch. Results of the services of FeaturedSearch interface includes resultant documents and related keywords of the given query term. Following query will retrieve the documents whose indexed fileds includes the term "turkey" and related keywords from several sources about "turkey".
curl -i -X GET -H "Accept: application/json" -H "Content-Type:text/plain" \ http://localhost:8080/contenthub/contenthub/search/featured?queryTerm=turkey