This project has retired. For details please refer to its Attic page.
Apache Stanbol - NLP Annotations

NLP Annotations

While the The Analyzed Text interface allows to define Sentences, Chunks and Tokens within the text and also to attach annotations to those this part of the Stanbol NLP processing module provides the Java domain model for the annotations section this part of the Stanbol NLP processing module defines the Java domain model used for those annotations. This includes annotation models for Part of Speech (POS) tags, Chunks , recognized Named Entities (NER) as well as morphological analysis.

Part of Speech (POS) annotations

Part of Speech (POS) tagging represents an token level annotation. It assigns tokens with categories like noun, verb, adjectives, punctuation ... This annotations are typically provided by an POS tagger that consumes Tokens and provides tag(s) with confidence(s) as output. Tags are usually string values that are member of a TagSet - a fixed list of tags used to annotate tokens. Those Tag sets are typically language and often even trainings corpus specific. This makes it really hard to consume POS tags created by different POS tagger for different languages as the consumer would need to know about the meanings of all the different POS tags for the different languages.

The POS annotation model defined by the Stanbol NLP module tries to solve this issue by providing means to align POS tag sets with formal categories defined by the OLiA Ontology. The following sub-section will provide details and usage examples.

OLiA MorphosyntacticCategories

The 'OLiA Reference Model for Morphology and Morphosyntax, with experimental extension to Syntax' defines a set of ~150 formally defined and multi-lingual POS tags. Those types are defined as a non-cyclic multi-hierarchy with 'oilia:MorphosyntacticCategory' as common root.

To give an example the POS 'olia:Gerund' is defined as a 'olia:NonFiniteVerb' what itself is a 'olia:Verb'. An example for a multi-hierarchy is 'olia:NominalQuantifier' that is both a 'olia:Noun' and a 'olia:Quantifier'.

To allow support a nice integration of the formal definitions by the OLiA ontology within the Stanbol NLP annotations there are two Java enumerations:

PosTag and TagSet

The PosTag represents a POS tag as used by an POS tagger. PosTags do support the following features:

An Example for a PosTag representing a 'olia:ProperNoun' looks like follows

PosTag tag = new PosTag("NP", Pos.ProperNoun);

The first parameter is the String POS tag used by the POS tagger and the second parameter represents the mapping to the OLiA MorphosyntacticCategories for this tag. The next example shows an sofisticated mapping for the "PWAV" (Pronominaladverb) as used by the STTS tag set for the German language

new PosTag("PWAV", LexicalCategory.Adverb, Pos.RelativePronoun, Pos.InterrogativePronoun);

TagSet is the other important class as it allows to manage the set of PosTag instances. TagSet has two main functions: First it allows an integrator of an POS tagger with Stanbol to define the mappings from the string POS tags used by the Pos Tagger to the LexicalCategory and Pos enumeration members as preferable used by the Stanbol NLP chain. Second it ensures that there is only a single instance of PosTag used to annotate all Tokens with the same type.

TagSets are typically specified as static members of utility classes. The following code snippet shows an example

//Tagset is generically typed. We need a TagSet for PosTag's
public static final TagSet<PosTag> STTS = new TagSet<PosTag>(
    "STTS", "de"); //define a name and the languages it supports

static {
    //you can set properties to a TagSet. While supported this
    //feature is currently not used by Stanbol
    STTS.getProperties().put("olia.annotationModel",
        new UriRef("http://purl.org/olia/stts.owl"));
    STTS.getProperties().put("olia.linkingModel",
        new UriRef("http://purl.org/olia/stts-link.rdf"));
    STTS.addTag(new PosTag("ADJA", Pos.AttributiveAdjective));
    STTS.addTag(new PosTag("ADJD", Pos.PredicativeAdjective));
    STTS.addTag(new PosTag("ADV", LexicalCategory.Adverb));
    //[...]
}

The string tag (first parameter) of the PosTag is used as unique key by the TagSet. Adding an 2nd PasTag with the same tag will override the first one. PosTags that are added to a TagSet have the Tag#getAnnotationModel() property set to that model.

The final example shows a code snippet shows the core part of an POS tagging engine using the both the AnalyzedText and the PosTag and TagSet APIs.

TagSet<PosTag> tagSet; //the used TagSet
//holds PosTags for tags returned by the POS tagger that
//are missing in the TagSet
Map<String,PosTag> adhocTags = new HashMap<String,PosTag>():
List<Span> token = new ArrayList<Span>(64);

Iterator<Section> sentences; //Iterator over the sentences

while(sentences.hasNext()){
    Section sentence = sentences.next();
    //get the tokens of the current sentence
    token.clean();
    AnalysedTextUtils.appandToList(
        sentence.getEnclosed(SpanTypeEnum.Token),
        tokenList);
    //typically one needs also to get the Strings
    //of the tokens for the pos tagger
    String[] tokenText = new String[tokenList.size()];
    for(int i=0;i<tokens.size();i++){
        tokenText[i] = tokens.get(i).getSpan();
    }

    //now POS tag the sentence
    String[] posTags = posTagger.tag(tokens);

    //finally apply the PosTags and save the annotation
    for(int i=0;i<tokens.size();i++){
        PosTag tag = tagSet.get(posTags[i]);
        if(tag == null) { //unmapped tag
            tag = adhocTags.get(posTags[i]);
        }
        if(tag == null) { //unknown tag
            tag = new PosTag(posTags[i]);
            adhocTags.put(posTags[i],tag);
        }
        //add the annotation to the Token
        token.addAnnotation(
            NlpAnnotations.POS_ANNOTATION,
            Value.value(tag));
    }
}

Phrase annotations

Phrase annotations can be used to define the type of a Chunk. The PhraseTag class is used for phrase annotations. It defines first a string tag and secondly the Phrase category. The LexicalCategory enumeration is used as valued for the category. As the PhraseTag is a subclass of Tag it can be also used in combination with the TagSet class as described in the [PosTag and TagSet] section.

The following code snippets show how to create a PhraseTag for noun phrases

PhraseTag tag = new PhraseTag("NP", LexicalCategory.Noun);

Name Entity (NER) annotations

Named Entity annotations are created by NER modules. Before the Stanbol NLP chain they where represented in Stanbol by using 'fise:TextAnnotation's and any Enhancement Engine that does NER should still support this. With the Stanbol NLP processing module it is now also possible to represent detected Named Entities as Chunk with an PhraseTag added as Annotation.

A Named Entity represented as 'fise:TextAnnotation' includes the following information:

urn:namedEntity:1
    rdf:type fise:TextAnnotation, fise:Enhancement
    fise:selected-text {named-entity-text}
    fise:start {start-char-pos}
    fise:end {end-char-pos}
    dc:type {named-entity-type}

where:

The NerTag class extends Tag and can therefore be also used with the TagSet class. This means that users of the API can use TagSet to manage the string tag to URI mappings for the supported Named Entity types.

The following Code Snippets shows how to add NER annotations to the AnalysedText:

AnalysedText at; //The AnalysedText
TagSet<NerTag> nerTags; //registered NER tags
Iterator<Section> sections; //sections to iterate over

List<String> tokenTexts = new ArrayList<Span>(64);

while(sections.hasNext()){
    Section section = sections.next();
    //NER tagger typically need String[] as input
    token.clean();

Iterator tokens = section.getTokens; while(tokens.hasNext()){ tokenTexts.add(tokens.next().getSpan()); } //Span -> #start #end #type #probability Span[] nerSpans = nerTagger.tag( tokenTexts.toArray(new String[tokenTexts.size()]); for(int i=0; i < nerSpans.length; i++){ Chunk namedEntity = at.addChunk( nerSpans[i].start,nerSpans[i].start); NerTag tag = nerTags.get(nerSpans[i].type) if(tag == null){ //unmapped NER tag = new NerTag(nerSpans[i].type); } namedEntity.addAnnotation( NlpAnnotations.NER_ANNOTATION, Value.value(tag, nerSpans[i]. probability)); } }

Note that the above Code Snippet only shows how to add the Named Entity to the AnalyzedText ContentPart. A actual NER engine Implementation needs also to add those information to the metadata of the ContentItem.

ContentItem ci; //The processed ContentItem
Language lang; //The Language of the processed Text
MGraph metadata = ci.getMetadata();
Section section; //the current Section
Chunk namedEntity //the currently processed Named Entity

Value<NerTag> nerAnnotation = namedEntity.getAnnotation(
    NlpAnnotations.NER_ANNOTATION);

UriRef textAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, this);
metadata.add(new TripleImpl(textAnnotation, ENHANCER_SELECTED_TEXT,
    new PlainLiteralImpl(namedEntity.getSpan(), language)));
metadata.add.add(new TripleImpl(textAnnotation, ENHANCER_SELECTION_CONTEXT,
    new PlainLiteralImpl(section.getSpan(), language)));
if(tag.getType() != null){
    metadata.add(new TripleImpl(textAnnotation, DC_TYPE,
        nerAnnotation.value().getType));
} //else do not add an dc:type for unmapped NamedEntities
g.add(new TripleImpl(textAnnotation, ENHANCER_CONFIDENCE,
    literalFactory.createTypedLiteral(nerAnnotation.probability())));
g.add(new TripleImpl(textAnnotation, ENHANCER_START,
    literalFactory.createTypedLiteral(namedEntity.getStart()));
g.add(new TripleImpl(textAnnotation, ENHANCER_END,
    literalFactory.createTypedLiteral(namedEntity.getEnd())));

Morphological Analyses

NOTE: This part of the Stanbol NLP annotations is still work in progress. So this part of the API might undergo heavy changes even in minor releases.

The results of a Morphological Analyses are represented by the MorphoFeatures class and can be added to the analyzed word (Token) by using the NlpAnnotations.MORPHO_ANNOTATION. The MorphoFeatures class provides the following features:

The MorphoFeatures supports multi valued annotations for all the above features. Getter for a single value will always return the first added value.