AnalysedText
The AnalysedText is a Java domain model designed to describe NLP processing results. It describes of two major parts:
- Structure of the Text such as text-sections, sentences, chunks and tokens
- Annotations for the detected parts of the text.
AnalysetText as ContentPart
Within the Stanbol Enhancer the AnalysedText is used as ContentPart registered with the URI urn:stanbol.enhancer:nlp.analysedText
Because of that it can be retrieved by using the following code
AnalysedText at; ci.getLock().readLock().lock(); try { at = ci.getPart(AnalysedText.ANALYSED_TEXT_URI, AnalysedText.class); } catch (NoSuchPartException e) { //not present at = null; } finally { ci.getLock().readLock().unlock(); }
Components that need to create an AnalysedText instance can do so by using the AnalysedTextFactory
@Reference AnalysedTextFactory atf; ContentItem ci; //the contentItem AnalysedText at; Entry<String,Blob> plainTextBlob = ContentItemHelper.getBlob( ci, Collections.singelton("text/plain")); if(plainTextBlob != null){ //creates and adds the AnalysedText ContentPart to the ContentItem ci.getLock().writeLock().lock(); try { at = atf.createAnalysedText(ci,plainTextBlob.value()); } finally { ci.getLock().writeLock().unlock(); } } else { //no NLP processing possible at = null; }
If used outside of OSGI users can also use the AnalysedTextFactory#getDefaultInstance() to obtain the AnalysedTextFactory instance of the in-memory implementation.
Structure of the Text
The basic building block of the AnalysedText is the Span. A Span defines type, [start,end) as well as the spanText. For the type an enumeration (SpanTypeEnum) with the members Text, TextSection, Sentence, Chunk and Text. [start,end) define the character positions of the Span within the Text where the start position is inclusive and the end position is exclusive.
Analog to the type of the Span there are also Java interfaces representing those types and providing additional convenience methods. An additional Section interface was introduced as common parent for all types that may have enclosed Spans. The AnalyzedText is the interface representing SpanTypeEnum#Text. The main intension of those Java classes are to have convenience methods that ease the use of the API.
Uniqueness of Spans
A Span is considered equals to an other Span if [start, end) and type are the same. The natural oder of Spans is defined by
- smaller start index first
- bigger end index first
- higher ordinal number of the SpanTypeEnum first
This order is used by all Iterators returned by the AnalyzedText API
Concurrent Modifications and Iterators
Iterators returned by the AnalyzedText API MUST throw ConcurrentModificationExceptions but rather reflect changes to the underlaying model. While this is not constant with the default behavior of Iterators in Java this is central for the effective usage of the AnalyzedText API - e.g. when Iterating over Sentences while adding Tokens.
Code Samples:
The following Code Snippet shows some typical usages of the API:
AnalysedText at; //typically retrieved from the contentPart Iterator<Sentence> sentences = at.getSentences; while(sentences.hasNext){ Sentence sentence = sentences.next(); String sentText = sentence.getSpan(); Iterator<SentenceToken> tokens = sentence.getTokens(); while(tokens.hasNext()){ Token token = tokens.next(); String tokenText = token.getSpan(); Value<PosTag> pos = token.getAnnotation( NlpAnnotations.posAnnotation); String tag = pos.value().getTag(); double confidence = pos.probability(); } }
Code that adds new Spans looks like follows
//Tokenize an Text Iterator<Sentence> sentences = at.getSentences(); Iterator<? extends Section> sections; if(sentences.hasNext()){ //sentence Annotations presnet sections = sentences; } else { //if no sentences tokenize the text at once sections = Collections.singelton(at).iterator(); } //Tokenize the sections for(Section section : sentenceList){ //assuming the Tokenizer returns tokens as 2dim int array int[][] tokenSpans = tokenizer.tokenize(section.getSpan()); for(int ti = 0; ti < tokenSpans.length; ti++){ Token token = section.addToken( tokenSpans[ti][0],tokenSpans[ti][1]); } }
For all #add(start,end) methods in the API the parsed start and end indexes are relative to the parent (the one the #add(..) method is called). The [start,end) indexes returned by Spans are absolute values. If an #add**(..) method is called for a Span '[start,end):type' that already exists than instead of an new instance the already existing one is returned.
Annotation Support
Annotation support is provided by two interfaces Annotated and Annotation and the Value class. Annotated provides an API for adding information the the annotated object. Those annotations are represented by key value mappings where Object is used as key and the Value class for values. The Value class provides the generically typed value as well as a double probability in the range [0..1] or -1 if not known. Finally the Annotation class is used to ensure type safety.
The following example shows the intended usage of the API
-
One needs to define the Annotations one would like to use. Annotations are typically defined as public static members of interfaces or classes. The following example uses the definition of the Part of Speech annotation.
public interface NlpAnnotations { //an Part of Speech Annotation using a String key //and the PosTag class as value Annotation<String,PosTag> POS_ANNOTATION = new Annotation<String,PosTag>( "stanbol.enhancer.nlp.pos", PosTag.class); ... }
-
Defined Annotation are used to add information to an Annotated instance (like a Span). For adding annotations the use of Annotations is required to ensure type safety. The following code snippet shows how to add an PosTag with the probability 0.95.
PosTag tag = new PosTag("N"); //a simple POS tag Token token; //The Token we want to add the tag token.addAnnotations(POS_ANNOTATION,Value.value(tag),0.95);
-
For consuming annotations there are two options. First the possibility to use the Annotation object and second by directly using the key. While the 2nd option is not as nicely to use (as it does not provide type safety) it allows consuming annotations without the need to have the used Annotation in the classpath. The following examples show both options
Iterator<Token> tokens = sentence.getTokens(); while(tokens.hasNext){ Token token = tokens.next(); //use the POS_ANNOTATION to get the PosTag PosTag tag = token.getAnnotation(POS_ANNOTATION); if(tag != null){ log.info("{} has PosTag {}",token,tag.value()); } else { log.infor("{} has no PosTag",token); } //(2) use the key to retrieve values String key = "urn:test-dummy"; Value<?> value = token.getValue(key); //the programmer needs to know the type! if(v.probability() > 0.5){ log.info("{}={}",key,value.value()); } }
The Annotated interface supports multi valued annotations. For that it defines methods for adding/setting and getting multiple values. Values are sorted first by the probability (unknown probability last) and secondly by the insert order (first in first out). So calling the single value getAnnotation() method on a multi valued field will return the first item (highest probability and first added in case of multiple items with the same/no probabilities)