NLP Annotations
While the The Analyzed Text interface allows to define Sentences, Chunks and Tokens within the text and also to attach annotations to those this part of the Stanbol NLP processing module provides the Java domain model for the annotations section this part of the Stanbol NLP processing module defines the Java domain model used for those annotations. This includes annotation models for Part of Speech (POS) tags, Chunks , recognized Named Entities (NER) as well as morphological analysis.
Part of Speech (POS) annotations
Part of Speech (POS) tagging represents an token level annotation. It assigns tokens with categories like noun, verb, adjectives, punctuation ... This annotations are typically provided by an POS tagger that consumes Tokens and provides tag(s) with confidence(s) as output. Tags are usually string values that are member of a TagSet - a fixed list of tags used to annotate tokens. Those Tag sets are typically language and often even trainings corpus specific. This makes it really hard to consume POS tags created by different POS tagger for different languages as the consumer would need to know about the meanings of all the different POS tags for the different languages.
The POS annotation model defined by the Stanbol NLP module tries to solve this issue by providing means to align POS tag sets with formal categories defined by the OLiA Ontology. The following sub-section will provide details and usage examples.
OLiA MorphosyntacticCategories
The 'OLiA Reference Model for Morphology and Morphosyntax, with experimental extension to Syntax' defines a set of ~150 formally defined and multi-lingual POS tags. Those types are defined as a non-cyclic multi-hierarchy with 'oilia:MorphosyntacticCategory' as common root.
To give an example the POS 'olia:Gerund' is defined as a 'olia:NonFiniteVerb' what itself is a 'olia:Verb'. An example for a multi-hierarchy is 'olia:NominalQuantifier' that is both a 'olia:Noun' and a 'olia:Quantifier'.
To allow support a nice integration of the formal definitions by the OLiA ontology within the Stanbol NLP annotations there are two Java enumerations:
- LexicalCategories: This enumeration covers the 12 top level categories as defined by OLiA. This includes Noun, Verb, Adjective, Adposition, Adverb, Conjuction, Interjection, PronounOrDeterminer, Punctuation, Quantifier, Residual and Unique.
- Pos: This enumeration covers all OLiA MorphosyntacticCategories from the 2+ level. So by using the Pos enum one can e.g. distinguish between ProperNoun's and CommonNoun's or FiniteVerb's and NonFiniteVerb's ... The Pos enumeration has full support for the multi-hierarchy as defined by OLiA. The Pos#categories() methods allows to get the 1st level parents of Pos. The Pos#hierarchy() returns all 2+ level parents of a Pos member.
PosTag and TagSet
The PosTag represents a POS tag as used by an POS tagger. PosTags do support the following features:
- tag [1..1]::Stirng - This is the string tag as used by the POS tagger.
- category [0..*]::LexicalCategory - The assigned LexicalCategory enumeration members.
- pos [0..*]::Pos - The assigned Pos enumeration members.
An Example for a PosTag representing a 'olia:ProperNoun' looks like follows
PosTag tag = new PosTag("NP", Pos.ProperNoun);
The first parameter is the String POS tag used by the POS tagger and the second parameter represents the mapping to the OLiA MorphosyntacticCategories for this tag. The next example shows an sofisticated mapping for the "PWAV" (Pronominaladverb) as used by the STTS tag set for the German language
new PosTag("PWAV", LexicalCategory.Adverb, Pos.RelativePronoun, Pos.InterrogativePronoun);
TagSet is the other important class as it allows to manage the set of PosTag instances. TagSet has two main functions: First it allows an integrator of an POS tagger with Stanbol to define the mappings from the string POS tags used by the Pos Tagger to the LexicalCategory and Pos enumeration members as preferable used by the Stanbol NLP chain. Second it ensures that there is only a single instance of PosTag used to annotate all Tokens with the same type.
TagSets are typically specified as static members of utility classes. The following code snippet shows an example
//Tagset is generically typed. We need a TagSet for PosTag's public static final TagSet<PosTag> STTS = new TagSet<PosTag>( "STTS", "de"); //define a name and the languages it supports static { //you can set properties to a TagSet. While supported this //feature is currently not used by Stanbol STTS.getProperties().put("olia.annotationModel", new UriRef("http://purl.org/olia/stts.owl")); STTS.getProperties().put("olia.linkingModel", new UriRef("http://purl.org/olia/stts-link.rdf")); STTS.addTag(new PosTag("ADJA", Pos.AttributiveAdjective)); STTS.addTag(new PosTag("ADJD", Pos.PredicativeAdjective)); STTS.addTag(new PosTag("ADV", LexicalCategory.Adverb)); //[...] }
The string tag (first parameter) of the PosTag is used as unique key by the TagSet. Adding an 2nd PasTag with the same tag will override the first one. PosTags that are added to a TagSet have the Tag#getAnnotationModel() property set to that model.
The final example shows a code snippet shows the core part of an POS tagging engine using the both the AnalyzedText and the PosTag and TagSet APIs.
TagSet<PosTag> tagSet; //the used TagSet //holds PosTags for tags returned by the POS tagger that //are missing in the TagSet Map<String,PosTag> adhocTags = new HashMap<String,PosTag>(): List<Span> token = new ArrayList<Span>(64); Iterator<Section> sentences; //Iterator over the sentences while(sentences.hasNext()){ Section sentence = sentences.next(); //get the tokens of the current sentence token.clean(); AnalysedTextUtils.appandToList( sentence.getEnclosed(SpanTypeEnum.Token), tokenList); //typically one needs also to get the Strings //of the tokens for the pos tagger String[] tokenText = new String[tokenList.size()]; for(int i=0;i<tokens.size();i++){ tokenText[i] = tokens.get(i).getSpan(); } //now POS tag the sentence String[] posTags = posTagger.tag(tokens); //finally apply the PosTags and save the annotation for(int i=0;i<tokens.size();i++){ PosTag tag = tagSet.get(posTags[i]); if(tag == null) { //unmapped tag tag = adhocTags.get(posTags[i]); } if(tag == null) { //unknown tag tag = new PosTag(posTags[i]); adhocTags.put(posTags[i],tag); } //add the annotation to the Token token.addAnnotation( NlpAnnotations.POS_ANNOTATION, Value.value(tag)); } }
Phrase annotations
Phrase annotations can be used to define the type of a Chunk. The PhraseTag class is used for phrase annotations. It defines first a string tag and secondly the Phrase category. The LexicalCategory enumeration is used as valued for the category. As the PhraseTag is a subclass of Tag it can be also used in combination with the TagSet class as described in the [PosTag and TagSet] section.
The following code snippets show how to create a PhraseTag for noun phrases
PhraseTag tag = new PhraseTag("NP", LexicalCategory.Noun);
Name Entity (NER) annotations
Named Entity annotations are created by NER modules. Before the Stanbol NLP chain they where represented in Stanbol by using 'fise:TextAnnotation's and any Enhancement Engine that does NER should still support this. With the Stanbol NLP processing module it is now also possible to represent detected Named Entities as Chunk with an PhraseTag added as Annotation.
A Named Entity represented as 'fise:TextAnnotation' includes the following information:
urn:namedEntity:1 rdf:type fise:TextAnnotation, fise:Enhancement fise:selected-text {named-entity-text} fise:start {start-char-pos} fise:end {end-char-pos} dc:type {named-entity-type}
where:
- {named-entity-text} is the text recognized as Named Entity. This is the same as returned by Chunk#getSpan()
- {start-char-pos} is the start character position of the Named Entity relative to the start of the text. This is the same as Chunk#getStart()
- {end-char-pos} is the end position and the same as Chunk#getEnd()
- {named-enttiy-type} is the type of the recognized Named Entity as URI. The _PhraseTag allows to define both the string tag as used by the NER component as well as the URI this type is mapped to. In Stanbol it is preferred to use 'dbpedia:Person', 'dbpedia:Organisation' and 'dbpedia:Place' for the according entity types.
The NerTag class extends Tag and can therefore be also used with the TagSet class. This means that users of the API can use TagSet to manage the string tag to URI mappings for the supported Named Entity types.
The following Code Snippets shows how to add NER annotations to the AnalysedText:
AnalysedText at; //The AnalysedText TagSet<NerTag> nerTags; //registered NER tags Iterator<Section> sections; //sections to iterate over List<String> tokenTexts = new ArrayList<Span>(64); while(sections.hasNext()){ Section section = sections.next(); //NER tagger typically need String[] as input token.clean();
Iterator
Note that the above Code Snippet only shows how to add the Named Entity to the AnalyzedText ContentPart. A actual NER engine Implementation needs also to add those information to the metadata of the ContentItem.
ContentItem ci; //The processed ContentItem Language lang; //The Language of the processed Text MGraph metadata = ci.getMetadata(); Section section; //the current Section Chunk namedEntity //the currently processed Named Entity Value<NerTag> nerAnnotation = namedEntity.getAnnotation( NlpAnnotations.NER_ANNOTATION); UriRef textAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, this); metadata.add(new TripleImpl(textAnnotation, ENHANCER_SELECTED_TEXT, new PlainLiteralImpl(namedEntity.getSpan(), language))); metadata.add.add(new TripleImpl(textAnnotation, ENHANCER_SELECTION_CONTEXT, new PlainLiteralImpl(section.getSpan(), language))); if(tag.getType() != null){ metadata.add(new TripleImpl(textAnnotation, DC_TYPE, nerAnnotation.value().getType)); } //else do not add an dc:type for unmapped NamedEntities g.add(new TripleImpl(textAnnotation, ENHANCER_CONFIDENCE, literalFactory.createTypedLiteral(nerAnnotation.probability()))); g.add(new TripleImpl(textAnnotation, ENHANCER_START, literalFactory.createTypedLiteral(namedEntity.getStart())); g.add(new TripleImpl(textAnnotation, ENHANCER_END, literalFactory.createTypedLiteral(namedEntity.getEnd())));
Morphological Analyses
NOTE: This part of the Stanbol NLP annotations is still work in progress. So this part of the API might undergo heavy changes even in minor releases.
The results of a Morphological Analyses are represented by the MorphoFeatures class and can be added to the analyzed word (Token) by using the NlpAnnotations.MORPHO_ANNOTATION. The MorphoFeatures class provides the following features:
- Lemma: A String value representing the lemmatization of the annotated Token.
- Case: The Case enumeration contains around 70 members defined based on concepts of the OLiA Ontology. The CaseTag allows to define cases and optionally map them to the cases defined by the enumeration.
- Definitness: The Definitness enumeration has the members Definite and Indefinite also defined by Concepts in the OLiA Ontology.
- Gender: The Gender enumeration contains the six gender defined by the OLiA Ontology. The GenderTag allows to define Genders and optionally map them to the gender defined by the enumeration.
- Number: The NumberFeature enumeration defines the eight number features defined by OLiA. The NumberTag can be used to define number features and map them to the members of the enumeration
- Person: the Person enumeration has the definitions for 'first', 'second' and 'third' with mappings to the according concepts of the OLiA Ontology.
- Tense: The Tense enumeration represents the tense hierarchy as defined by the OLiA Ontology. the Tense#getParent() allows access to the direct parent of a Tense while the Tense#getTenses() method can be used to obtain the transitive closure (including the Tens object itself). TenseTag is used for Tense annotations. It allows both to parse a string tag representing the tense as well as defining a mapping to the tenses defined by the Tense enumeration.
- Mood: The VerbMood enumeration currently defines members from different part of the OLiA Ontology. While OLiA does define the 'ilia:MoodFeature' class but those members had not a good match with verb moods as used by the CELI/linguagrid.org service. For now the decision was to define the VerbMood enumeration more closely to the usage of CELI, but this needs clearly to be validated as soon as implementations for other NLP frameworks are added. Their is also a VerbMoodTag that allows to define verb moods by a string tag and an mapping to the VerbMood enumeration.
The MorphoFeatures supports multi valued annotations for all the above features. Getter for a single value will always return the first added value.