This project has retired. For details please refer to its Attic page.
Apache Stanbol -

RESTful NLP Analysis Service

STANBOL-892 added a standard RESTful NLP Analyses service based on the JSON serialization support for the AnalysedText content part.

On the Stanbol Enhancer side the service is consumed by the RESTful NLP processing Enigne meaning that integrators of NLP frameworks need only take care of implementing the RESTful service.

This option of integrating an NLP framework with the Stanbol Enhancer should be considered in the following scenarios:

The first NLP processing Frameworks integrated by this method where Freeling and Talismane. For integrators it is strongly recommended to have a look at those two implementations.

The following sub sections provide more information on how to implement a Stanbol compatible RESTful NLP Analyses service

RESTful Interface

The RESTful Interface (as specified by STANBOL-892) defines two services:

  1. Supported Languages: The languages supported by the NLP Analyses server need to be returned as an JSON array on GET requests to the Analysis Endpoint. A request to curl -X GET -H "Accept: application/json" http://{analysis-endpoint} will return the supported languages like ["it","en"]
  2. NLP Analysis: The NLP Analyses of a Text is done by sending a POST request to the Analysis Endpoint. The language of the parsed text can be specified by using the 'Content-Language' header. If not present the service might try to detect the language or return a 'HTTP Error 400 Bad request' if not possible/supported. The response is a JSON serialized AnalysedText content part (see below for more information). The response needs also to include the Content-Language header with the language of the processed text as value.

For both services the Accept header is optional. However if present it must be set to application/json. Implementations might also support HTML compatible media types to provide the documentation if a Browser send an request to the Analysis Endpoint.

Integration of NLP frameworks with the Stanbol NLP processing Module

When implementing the Stanbol RESTful NLP Analyses service will need to convert the results of the integrated NLP framework to the Stanbol NLP processing framework. This is requires to

The documentation of the AnalysedText provides a good overview and several examples on how to use the API.

An special feature of the Stanbol NLP processing module is that it supports the alignment of TagSets (typically simple String codes) with Ontological concepts as defined by the OLIA ontology. This alignment is important as it allows other EnhancementEngines to process NLP annotations without the need to know language specific Tag sets.

To avoid the need for users to directly use the OLIA ontology the Stanbol NLP processing module defines Java Enumerations LexicalCategory, POS, Case, Definitness, Gender, NumberFeature, Person, Tense, VerbNood) for the concepts defined by the Ontology. Server side implementations that do use Java should use those enumerations. Implementations in other programming languages can use the names of the enumerations entries (e.g. PossessiveAdjective as defined in the Pos enumeration).

Typically the mappings between TagSets of NLP frameworks with the Tags used by Stanbol are defines in TagSets. The following example shows such mappings for the Penn Treebank POS tag set for English

/** Penn Treebank Stanbol NLP module mappings */
public static final TagSet<PosTag> PENN_TREEBANK = new TagSet<PosTag>(
    "Penn Treebank", "en");

static {
    PENN_TREEBANK.addTag(new PosTag("CC", Pos.CoordinatingConjunction));
    PENN_TREEBANK.addTag(new PosTag("CD",Pos.CardinalNumber));
    //[..]
    PENN_TREEBANK.addTag(new PosTag("NN",Pos.CommonNoun, Pos.SingularQuantifier));
    PENN_TREEBANK.addTag(new PosTag("NNP",Pos.ProperNoun, Pos.SingularQuantifier));
    PENN_TREEBANK.addTag(new PosTag("NNPS",Pos.ProperNoun, Pos.PluralQuantifier));
    PENN_TREEBANK.addTag(new PosTag("NNS",Pos.CommonNoun, Pos.PluralQuantifier));
    //[..]
}

JSON serialization of the AnalysedText

For Java users it is strongly recommended to use the AnalyzedTextSerializer provided by the org.apache.stanbol:org.apache.stanbol.enhancer.nlp.json module. Users that use a JAX-RS framework can also use the AnalyzedTextWriter that implements the MessageBodyWriter interface for AnalysedText.

Non Java users will need to generate the JSON themselves based on the following documentation:

  1. Root Element: The JSON representation of the AnalysedText uses a JSON object as root object. This root object has the spans attribute with an array as value. The array contains all JSON serialized Spans as values. The first entry in the array MUST BE the AnalysedText - a span with the "type"="Text" - itself.
  2. Span: Each Span is serialized as an JSON object with the following attributes
    • type: one of Text, Sentence, Chunk or Token
    • start: the absolute start index of the span
    • end: the absolute end index of the span
    • {annotation-key}: keys used by annotations of the span. Values can be both an array or a single value
  3. Annotations: Annotations are encoded as JSON Objects. They are added as values to the {annotation-key} directly to the Span. The encoding of {annotation-value}(s) is specific to the value. However the following keys are reserved:
    • class (required): the Java class for the annotation Value (e.g org.apache.stanbol.enhancer.nlp.pos.PosTag or java.lang.Double in case the value is a simple Double). This information is required by the deserializer to select the correct parser.
    • prob (optional): the probability for the annotation. Expected to be a floating point number in the range [0..1]
  4. PosTag: This annotation uses the annotation-key stanbol.enhancer.nlp.pos and defines the following attributes
    • class: org.apache.stanbol.enhancer.nlp.pos.PosTag
    • tag (required): The String tag as used by the TagSet
    • lc (optional): The name(s) of the LexicalCategories (e.g. Noun, Verb …). In case of a single value the use of a JSON Array is optional. NOTE that instead of the names it is also possible to use ordinal numbers.
    • pos (optional): The name(s) of the POS tags (e.g. ProperNoun, FiniteVerb …). In case of a single value the use of a JSON Array is optional. NOTE that instead of the names it is also possible to use ordinal numbers.
  5. PhraseTag: This annotation uses the annotation-key stanbol.enhancer.nlp.phrase and defines the following attributes
    • class: org.apache.stanbol.enhancer.nlp.phrase.PhraseTag
    • tag (required): The String tag as used by the TagSet
    • lc (optional): The name(s) of the LexicalCategories (e.g. Noun, Verb …). In case of a single value the use of a JSON Array is optional. NOTE that instead of the names it is also possible to use ordinal numbers.
  6. NerTag: This annotation uses the annotation-key stanbol.enhancer.nlp.ner and defines the following attributes
    • class: org.apache.stanbol.enhancer.nlp.ner.NerTag
    • tag (required): The String tag as used by the TagSet
    • uri (optional): the URI of the entity. Stanbol prefers the URIs http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/Organisation and http://dbpedia.org/ontology/Place for Persons, Organizations and Places.
  7. MorphoFeatures: This annotation uses the annotation-key stanbol.enhancer.nlp.morpho and defines the following attributes
    • class: org.apache.stanbol.enhancer.nlp.morpho.MorphoFeatures
    • lemma: The lemma for the annotated span. MUST BE a single value.
    • pos (optional): An array of _PosTag_s. Encoded as specified above. Integrators are free to add all possible morphological interpretations for a Span or just those that correspond with the detected POS tag of a word.
    • case (optional) : Array with CaseTag elements defining the following attributes
      • tag (required): The string tag as used by the NLP framework
      • type (optional): the Case name
    • definitness (optional): The Definitness value
    • gender (optional): Array with GenderTag elements defining the following attributes
      • tag (required): The string tag as used by the NLP framework
      • type (optional): the Gender name
    • number (optional): Array with the NumberTag elements defining the following attributes
      • tag (required): The string tag as used by the NLP framework
      • type (optional): the NumberFeature name
    • person: The Person name
    • tense (optional): Array with the TenseTag elements defining the following attributes
      • tag (required): The string tag as used by the NLP framework
      • type (optional): the Tense name
    • verb-mood (optional): Array with the MoodTag elements defining the following attributes
      • tag (required): The string tag as used by the NLP framework
      • type (optional): the VerbMood name
  8. Default Value Mappings : For Annotations without an specific serializer/parser Stanbol uses Jackson Data Binding. In those cases the annotation-key is still the string used by the annotation of the Span
    • class: The class of the value. Typically a java.lang.String, any java.lang.Number, collection
    • value (required): Holding the JSON serialized value.

The following Example shows the serialized JSON as serialized/parsed by the unit test. It contains at least a single example for all Elements described above.

{
    "spans" : [ {
        "type" : "Text",
        "start" : 0,
        "end" : 90
    }, {
        "type" : "Sentence",
        "start" : 0,
        "end" : 90
    }, {
        "type" : "Token",
        "start" : 0,
        "end" : 3,
        "stanbol.enhancer.nlp.pos" : {
            "tag" : "PREP",
            "pos" : 12,
            "class" : "org.apache.stanbol.enhancer.nlp.pos.PosTag",
            "prob" : 0.85
        }
    }, {
        "type" : "Chunk",
        "start" : 4,
        "end" : 20,
        "stanbol.enhancer.nlp.ner" : {
            "tag" : "organization",
            "uri" : "http://dbpedia.org/ontology/Organisation",
           "class" : "org.apache.stanbol.enhancer.nlp.ner.NerTag"
        },
        "stanbol.enhancer.nlp.phrase" : {
            "tag" : "NP",
            "lc" : 0,
            "class" : "org.apache.stanbol.enhancer.nlp.phrase.PhraseTag",
            "prob" : 0.98
        }
    }, {
        "type" : "Token",
        "start" : 4,
        "end" : 11,
        "stanbol.enhancer.nlp.pos" : {
            "tag" : "PN",
            "pos" : 53,
            "class" : "org.apache.stanbol.enhancer.nlp.pos.PosTag",
            "prob" : 0.95
        },
        "stanbol.enhancer.nlp.sentiment" : {
            "value" : 0.5,
            "class" : "java.lang.Double"
        }
    }, {
        "type" : "Token",
        "start" : 12,
        "end" : 20,
        "stanbol.enhancer.nlp.pos" : [ {
            "tag" : "PN",
            "pos" : 53,
            "class" : "org.apache.stanbol.enhancer.nlp.pos.PosTag",
            "prob" : 0.95
          }, {
            "tag" : "N",
            "lc" : 0,
            "class" : "org.apache.stanbol.enhancer.nlp.pos.PosTag",
            "prob" : 0.87
        } ],
        "stanbol.enhancer.nlp.morpho" : {
            "lemma" : "enhance",
            "case" : [ {
                "tag" : "test-case-1",
                "type" : "Comitative"
            }, {
                "tag" : "test-case-2",
                "type" : "Abessive"
            } ],
            "definitness" : "Definite",
            "gender" : [ {
                "tag" : "test-gender",
                "type" : "Masculine"
            } ],
            "number" : [ {
                "tag" : "test-number",
                "type" : "Plural"
            } ],
            "person" : "First",
            "pos" : [ {
                "tag" : "PN",
                "pos" : 53
            } ],
            "tense" : [ {
                "tag" : "test-tense",
                "type" : "Present"
            } ],
            "verb-mood" : [ {
                "tag" : "test-verb-mood",
                "type" : "ConditionalVerb"
            } ],
            "class" : "org.apache.stanbol.enhancer.nlp.morpho.MorphoFeatures"
        }
    } ]
}