Using Apache Stanbol for enhancing textual content
For enhancing content you simply post plain text content to the Enhancement Engines and you will get back enhancement data. The enhancement process is stateless, so neither your content item, nor the enhancements will be stored.
You can test this via the [web interface of the engines][stan-engines] or from console via
curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \ --data "The Stanbol enhancer can detect famous cities such as Paris \ and people such as Bob Marley." http://localhost:8080/engines
or by using the text examples delivered with Stanbol.
for file in enhancer/data/text-examples/*.txt; do curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" -T $file http://localhost:8080/engines; done
Content items in formats other than plain text can be tested via the [web interface of contenthub][stan-contenthub] or via the console by attaching files. (The Metaxa Engine needs to be activated).
Using the enhancement engines
Apache Stanbol starts with a number of active enhancement engines by default. You can activate or deactivate engines as well as configure them to your needs via the [OSGI administration console][stan-admin].
For the enhancement engines, a workflow for the enhancement process is defined as pre-processing, content-extraction, extraction-enhancement, default and post-processing.
The following pre-processing engines are available:
-
The Language Identification Engine detects several European languages of the content items you want to process.
-
The Metaxa Engine extracts embedded metadata and textual content from a large variety of document types and formats.
For content extraction / natural language processing one engine is available:
- The Named Entity Extraction Enhancement Engine leverages the sentence detector and name finder tools of the OpenNLP project bundled with statistical models trained to detect occurrences of names of persons, places and organizations.
The extracted items will then be enhanced by a dedicated engine:
- The Named Entity Tagging Engine provides according suggestions from dbpedia (default) and other references sites for entities extracted by the NER engine .
Specific additional enhancement engines are:
-
The Location Enhancement Engine takes its suggestions from geonames.org only.
-
The OpenCalais Enhancement Engine uses services from Open Calais. (Note: You need to provide a key in order to use this engine)
-
The Zemanta Enhancement Engine uses the Zemanta services. (Note: You need to provide a key in order to use this engine)
For post-processing the results of the enhancement engines
- The CachingDereferencerEngine is used for the Web UI and fetches files such as images for locations from external sites and is used to present the enhancement results.
Using an index of linked open data locally
To use the pre-configured indexes you can download them from [here][stan-download]. You will get two files for each index:
- org.apache.stanbol.data.site.{name}-{version}.jar
- {name}.solrindex.zip
By copying the zip archive into the "/sling/datafiles" folder before installing the bundle, the data will used during the installation of the bundle automatically. If you provide the file after installing the bundle, you will need to restart the SolrYard installed by the bundle.
The jar can be installed at any OSGI environment running the Apache Stanbol Entityhub. When started it will create and configure:
- a "ReferencedSite" accessible at "http://{host}/{root}/entityhub/site/{name}"
- a "Cache" used to connect the ReferencedSite with your Data and
- a "SolrYard" that manages the data indexed by this utility.
This bundle does not contain the indexed data but only the configuration for the Solr Index.
If one has not copied the archive beforehand, the ZIP archive will be requested by the Apache Stanbol Data File Provider after installing the Bundle. To install the data you need copy this file to the "/sling/datafiles" folder within the working directory of your Stanbol Server.
Note: {name} denotes to the value you configured for the "name" property within the "indexing.properties" file.
Enhancement Example
The text "The Stanbol enhancer can detect famous cities such as Paris and people such as Bob Marley." with the default configuration of enhancement engines and with a local index of dbpedia entities will result in the following output graph of several Entity Annotations and Text Annotations.
Two of the relevant fragments for "Paris" are listed below in Turtle-Syntax:
Example for Text Annotation
<urn:enhancement-4a2543d8-4d83-43ce-3a33-2924f457c872> a <http://fise.iks-project.eu/ontology/TextAnnotation> , <http://fise.iks-project.eu/ontology/Enhancement> ; <http://fise.iks-project.eu/ontology/confidence> "0.9322403510215739"^^<http://www.w3.org/2001/XMLSchema#double> ; <http://fise.iks-project.eu/ontology/end> "59"^^<http://www.w3.org/2001/XMLSchema#int> ; <http://fise.iks-project.eu/ontology/extracted-from> <urn:content-item-sha1-37c8a8244041cf6113d4ee04b3a04d0a014f6e10> ; <http://fise.iks-project.eu/ontology/selected-text> "Paris"^^<http://www.w3.org/2001/XMLSchema#string> ; <http://fise.iks-project.eu/ontology/selection-context> "The Stanbol enhancer can detect famous cities such as Paris and people such as Bob Marley." ^^<http://www.w3.org/2001/XMLSchema#string> ; <http://fise.iks-project.eu/ontology/start> "54"^^<http://www.w3.org/2001/XMLSchema#int> ; <http://purl.org/dc/terms/created> "2012-02-29T11:18:36.282Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ; <http://purl.org/dc/terms/creator> "org.apache.stanbol.enhancer.engines.opennlp.impl.NEREngineCore" ^^<http://www.w3.org/2001/XMLSchema#string> ; <http://purl.org/dc/terms/type> <http://dbpedia.org/ontology/Place> .
Example for Entity Annotation
<urn:enhancement-b5e71f70-4978-a70b-7111-8d6e31283a58> a <http://fise.iks-project.eu/ontology/EntityAnnotation> , <http://fise.iks-project.eu/ontology/Enhancement> ; <http://fise.iks-project.eu/ontology/confidence> "1323049.5"^^<http://www.w3.org/2001/XMLSchema#double> ; <http://fise.iks-project.eu/ontology/entity-label> "Paris"@en ; <http://fise.iks-project.eu/ontology/entity-reference> <http://dbpedia.org/resource/Paris> ; <http://fise.iks-project.eu/ontology/entity-type> <http://www.w3.org/2002/07/owl#Thing> , <http://www.opengis.net/gml/_Feature> , <http://dbpedia.org/ontology/Place> , <http://dbpedia.org/ontology/Settlement> , <http://dbpedia.org/ontology/PopulatedPlace> ; <http://fise.iks-project.eu/ontology/extracted-from> <urn:content-item-sha1-37c8a8244041cf6113d4ee04b3a04d0a014f6e10> ; <http://purl.org/dc/terms/created> "2012-02-29T11:18:36.320Z" ^^<http://www.w3.org/2001/XMLSchema#dateTime> ; <http://purl.org/dc/terms/creator> "org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine" ^^<http://www.w3.org/2001/XMLSchema#string> ; <http://purl.org/dc/terms/relation> <urn:enhancement-4a2543d8-4d83-43ce-3a33-2924f457c872> .