This project has retired. For details please refer to its Attic page.
Apache Stanbol - AnalysedText

AnalysedText

The AnalysedText is a Java domain model designed to describe NLP processing results. It describes of two major parts:

  1. Structure of the Text such as text-sections, sentences, chunks and tokens
  2. Annotations for the detected parts of the text.

AnalysetText as ContentPart

Within the Stanbol Enhancer the AnalysedText is used as ContentPart registered with the URI urn:stanbol.enhancer:nlp.analysedText

Because of that it can be retrieved by using the following code

AnalysedText at;
ci.getLock().readLock().lock();
try {
    at = ci.getPart(AnalysedText.ANALYSED_TEXT_URI, AnalysedText.class);
} catch (NoSuchPartException e) {
    //not present
    at = null;
} finally {
    ci.getLock().readLock().unlock();
}

Components that need to create an AnalysedText instance can do so by using the AnalysedTextFactory

@Reference
AnalysedTextFactory atf;

ContentItem ci; //the contentItem
AnalysedText at;
Entry<String,Blob> plainTextBlob = ContentItemHelper.getBlob(
    ci, Collections.singelton("text/plain"));
if(plainTextBlob != null){
    //creates and adds the AnalysedText ContentPart to the ContentItem
    ci.getLock().writeLock().lock();
    try {
        at = atf.createAnalysedText(ci,plainTextBlob.value());
    } finally {
        ci.getLock().writeLock().unlock();
    }
} else { //no NLP processing possible
    at = null;
}

If used outside of OSGI users can also use the AnalysedTextFactory#getDefaultInstance() to obtain the AnalysedTextFactory instance of the in-memory implementation.

Structure of the Text

The basic building block of the AnalysedText is the Span. A Span defines type, [start,end) as well as the spanText. For the type an enumeration (SpanTypeEnum) with the members Text, TextSection, Sentence, Chunk and Text. [start,end) define the character positions of the Span within the Text where the start position is inclusive and the end position is exclusive.

Analog to the type of the Span there are also Java interfaces representing those types and providing additional convenience methods. An additional Section interface was introduced as common parent for all types that may have enclosed Spans. The AnalyzedText is the interface representing SpanTypeEnum#Text. The main intension of those Java classes are to have convenience methods that ease the use of the API.

Uniqueness of Spans

A Span is considered equals to an other Span if [start, end) and type are the same. The natural oder of Spans is defined by

This order is used by all Iterators returned by the AnalyzedText API

Concurrent Modifications and Iterators

Iterators returned by the AnalyzedText API MUST throw ConcurrentModificationExceptions but rather reflect changes to the underlaying model. While this is not constant with the default behavior of Iterators in Java this is central for the effective usage of the AnalyzedText API - e.g. when Iterating over Sentences while adding Tokens.

Code Samples:

The following Code Snippet shows some typical usages of the API:

AnalysedText at; //typically retrieved from the contentPart
Iterator<Sentence> sentences = at.getSentences;
while(sentences.hasNext){
    Sentence sentence = sentences.next();
    String sentText = sentence.getSpan();
    Iterator<SentenceToken> tokens = sentence.getTokens();
    while(tokens.hasNext()){
        Token token = tokens.next();
        String tokenText = token.getSpan();
        Value<PosTag> pos = token.getAnnotation(
            NlpAnnotations.posAnnotation);
        String tag = pos.value().getTag();
        double confidence = pos.probability();
    }
}

Code that adds new Spans looks like follows

//Tokenize an Text
Iterator<Sentence> sentences = at.getSentences();
Iterator<? extends Section> sections;
if(sentences.hasNext()){ //sentence Annotations presnet
    sections = sentences;
} else { //if no sentences tokenize the text at once
    sections = Collections.singelton(at).iterator();
}
//Tokenize the sections
for(Section section : sentenceList){
    //assuming the Tokenizer returns tokens as 2dim int array
    int[][] tokenSpans = tokenizer.tokenize(section.getSpan());
    for(int ti = 0; ti < tokenSpans.length; ti++){
        Token token = section.addToken(
            tokenSpans[ti][0],tokenSpans[ti][1]);
    }
}

For all #add(start,end) methods in the API the parsed start and end indexes are relative to the parent (the one the #add(..) method is called). The [start,end) indexes returned by Spans are absolute values. If an #add**(..) method is called for a Span '[start,end):type' that already exists than instead of an new instance the already existing one is returned.

Annotation Support

Annotation support is provided by two interfaces Annotated and Annotation and the Value class. Annotated provides an API for adding information the the annotated object. Those annotations are represented by key value mappings where Object is used as key and the Value class for values. The Value class provides the generically typed value as well as a double probability in the range [0..1] or -1 if not known. Finally the Annotation class is used to ensure type safety.

The following example shows the intended usage of the API

  1. One needs to define the Annotations one would like to use. Annotations are typically defined as public static members of interfaces or classes. The following example uses the definition of the Part of Speech annotation.

    public interface NlpAnnotations {
        //an Part of Speech Annotation using a String key
        //and the PosTag class as value
        Annotation<String,PosTag> POS_ANNOTATION = new Annotation<String,PosTag>(
            "stanbol.enhancer.nlp.pos", PosTag.class);
        ...
    }
    
  2. Defined Annotation are used to add information to an Annotated instance (like a Span). For adding annotations the use of Annotations is required to ensure type safety. The following code snippet shows how to add an PosTag with the probability 0.95.

    PosTag tag = new PosTag("N"); //a simple POS tag
    Token token; //The Token we want to add the tag
    token.addAnnotations(POS_ANNOTATION,Value.value(tag),0.95);
    
  3. For consuming annotations there are two options. First the possibility to use the Annotation object and second by directly using the key. While the 2nd option is not as nicely to use (as it does not provide type safety) it allows consuming annotations without the need to have the used Annotation in the classpath. The following examples show both options

    Iterator<Token> tokens = sentence.getTokens();
    while(tokens.hasNext){
        Token token = tokens.next();
        //use the POS_ANNOTATION to get the PosTag
        PosTag tag = token.getAnnotation(POS_ANNOTATION);
        if(tag != null){
            log.info("{} has PosTag {}",token,tag.value());
        } else {
            log.infor("{} has no PosTag",token);
        }
        //(2) use the key to retrieve values
        String key = "urn:test-dummy";
        Value<?> value = token.getValue(key);
        //the programmer needs to know the type!
        if(v.probability() > 0.5){
            log.info("{}={}",key,value.value());
        }
    }
    

The Annotated interface supports multi valued annotations. For that it defines methods for adding/setting and getting multiple values. Values are sorted first by the probability (unknown probability last) and secondly by the insert order (first in first out). So calling the single value getAnnotation() method on a multi valued field will return the first item (highest probability and first added in case of multiple items with the same/no probabilities)