Apache Stanbol -

Implementing a NLP Processing Engine

Enhancement Engines are the basic processing unit of the Stanbol Enhancer. An NLP processing Engine is an Enhancement Engine that processing the plain/text version of the parsed Content Item and adds processing results to the Metadata of the ContentItem or the AnalysedText ContentPart. Enhancemen Engines do run in the same Java VM as the Stanbol Enhancer. However they may access remote services (e.g. a NLP processing WebService).

The following sub section will provide information on typical tasks of NLP EnhancementEngine implementors.

Accessing the Plain Text

The plain text version of the parsed content should not be directly obtained from the ContentItem parsed to the #canEnhance(..) and '#processEnhancement(..)' methods (e.g. by using the ContentItem#getStream). The reason for that is that those methods will return the content as parsed by the request and this might as well be a PDF, word document or even an Audio or Video file. In such cases Users will most likely have configured an EnhancementEngine (such as the TikaEngine) to extract the plain text from those rich text formats.

For retrieving the plain text version from the NlpEngineHelper provides the '#getPlainText(..)' method. It returns an Entry with the URI of the plain text version as key and the Blob as value. The Blob interface is used by the Stanbol Enhancer to handle content elements and provides access to the content, content length, content type and charset.

The following code snippets show typical usage examples:

@Override
public int canEnhance(ContentItem ci) throws EngineException {
    if(NlpEngineHelper.getPlainText(this, ci, false) == null){
        return EnhancementEngine.CANNOT_ENHANCE;
    }
    // add further tests if this engine can Enhance the parsed
    // ContentItem
    return EnhancementEngine.ENHANCE_ASYNC;
}

In the #canEnhance method one needs to check if the EnhancementEngine is able to process the parsed ContentItem. Only if this method does return 'ENHANCE_SYNCHRONOUS' or ENHANCE_ASYNC the '#computeEnhancements(..)' method will be called

@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
    //if TRUE is parsed as 3rd arg this will throw an exception rather than
    //returning NULL
    Entry<UriRef, Blob> plainText = NlpEngineHelper.getPlainText(this, ci, true);
    //Now we can read the plain text from the Blob
    String charset = plainText.getValue().getParameter().get("charset");
    if(charset == null){
        charset = "UTF-8";
    }
    Reader reader = new InputStreamReader(
        plainText.getValue().getStream(),charset);
}

The AnalysedText Content Part

The AnalysedText content part is used to store NLP processing results. The NlpEngineHelper provides the #getAnalysedText(..) and #initAnalysedText(..) methods for obtaining this content parsed form the processed ContentItem.

The #getAnalysedText(..) Method is typically used by EnhancementEngines that need to consume NLP processing results of previous NLP processing engines.

@Override
public int canEnhance(ContentItem ci) throws EngineException {
    //possible check other requirements (such as if the language
    //detected for the parsed content is supported
    if(getAnalysedText(this,ci,false) == null) {
        return CANNOT_ENHANCE;
    }
    return ENHANCE_ASYNC;
}

@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
    AnalysedText at = getAnalysedText(this, ci, true);
    // [..]
}

EnhancementEngines that do not depend on NLP processing results of other EnhancementEngines SHOULD use the #initAnalysedText(..) method as this method only creates a new AnalysedText content part if it is not already present. Otherwise it will return the already existing one.

/** The AnalysedTextFactory is an OSGI service **/
@Reference
private AnalysedTextFactory analysedTextFactory;


@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
    AnalysedText at = initAnalysedText(this,analysedTextFactory,ci);
    // [..]
}

Note that NLP Enhancement Engines that do not consume NLP processing results of other EnhancementEngines need not to check in the #canEnhance(..) method if the AnalysedText text content part is present.

The usage of the AnalysedText content part is not covered by this section. For more information on that please see the documentation of the AnalysedText.

Dealing with supported Languages and the Language of the Content

NLP processing EnhancementEngines typically only support a specific set of languages. This sub section provides best practice examples for language specific configurations as well as retrieving the language of processed content item.

For Language specific Configurations of EnhancementEngines the Stanbol NLP processing module provides the LanguageConfiguration utility. This utility implements the following configuration syntax:

List of supported languages: To process German and English texts configure de,en
List of excluded languages: !fr,!cn, * would process all languages other than French and Chinese
Support for parameters by using the following syntax `{language};{param-name}={param-value};{param-name}={param-value}
- Default parameters for all languages can be defined by applying parameters to the wildcard language *;{param-name}={param-value}; or if no wildcard is desired by the configuration by adding parameters to the empty {language} such as ;{param-name}={param-value};.
- Language specific parameters will override default parameters. This means that *;state=true, de;state=false will result in state=false for German and state=true for all other languages.

The following example shows a typical usage of the LanguageConfiguration utility in an NLP processing engine

/** Key used for the language configuration */
public static final String CONFIG_LANGUAGES = "myNlpEngine.languageconfig";

/** Possible multiple parameters supported by this engine */
public static final String PARAM_EXAMPLE = "example";

/** The language Configuration instance used by the Engine */
private LanguageConfiguration languageConfig = new LanguageConfiguration(
    CONFIG_LANGUAGES, //the property key used for the configuration
    new String[]{"*"}); //the default configuration

@Activate
protected void activate(ComponentContext ctx) throws ConfigurationException {
    super.activate(ce); // assuming this engine extends AbstractEnhancementEngine
    @SuppressWarnings("unchecked")
    Dictionary<String, Object> properties = ctx.getProperties();
// The LanguageConfiguration utility directly parses the config from the 
    // properties
    languageConfig.setConfiguration(properties);
}

@Deactivate
protected void deactivate(ComponentContext context) {
    languageConfig.setDefault(); //reset to the default configuration
    super.deactivate(context); // assuming this engine extends AbstractEnhancementEngine
}

For getting the detected language(s) of the textual content there are several utilities available. The default EnhancementEngineHelper provides two utility methods: First the EnhancementEngineHelper#getLanguage(..) method that reruns the language with the highest confidence and second the EnhancementEngineHelper#getLanguageAnnotations(..) that returns an ordered list with the URIs of fise:TextAnnotations.

However for NLP processing EnhancementEngines it will be easier to use the higher level utilities provided by the NlpEngineHelper. First there is also a #getLanguage(..) method that does the same as the one of the EnhancementEngineHelper but in addition deals with Exception handling and logging. Second the isLangaugeConfigured(..) method that compares the detected language with the LanguageConfiguration.

A typical usage looks like follows

@Override
public int canEnhance(ContentItem ci) throws EngineException {
    String language = getLanguage(this, ci,false);
    if(language == null){
        return CANNOT_ENHANCE;
    }
    if(!isLangaugeConfigured(this,languageConfig,language,false)){
       return CANNOT_ENHANCE; 
    }
    //possible further checks
    return ENHANCE_ASYNC;
}

@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
    //get the language
    String language = getLanguage(this, ci, true);
    //validate against the language configuration
    isLangaugeConfigured(this, languageConfig, language, true);
//[..]
}

Downloads

Project

Archived Docs

The ASF

Implementing a NLP Processing Engine

Accessing the Plain Text

The AnalysedText Content Part

Dealing with supported Languages and the Language of the Content