Implementing a NLP Processing Engine
Enhancement Engines are the basic processing unit of the Stanbol Enhancer. An NLP processing Engine is an Enhancement Engine that processing the plain/text
version of the parsed Content Item and adds processing results to the Metadata of the ContentItem or the AnalysedText ContentPart. Enhancemen Engines do run in the same Java VM as the Stanbol Enhancer. However they may access remote services (e.g. a NLP processing WebService).
The following sub section will provide information on typical tasks of NLP EnhancementEngine implementors.
Accessing the Plain Text
The plain text version of the parsed content should not be directly obtained from the ContentItem parsed to the #canEnhance(..)
and '#processEnhancement(..)' methods (e.g. by using the ContentItem#getStream
). The reason for that is that those methods will return the content as parsed by the request and this might as well be a PDF, word document or even an Audio or Video file. In such cases Users will most likely have configured an EnhancementEngine (such as the TikaEngine) to extract the plain text from those rich text formats.
For retrieving the plain text version from the NlpEngineHelper
provides the '#getPlainText(..)' method. It returns an Entry with the URI of the plain text version as key and the Blob
as value. The Blob
interface is used by the Stanbol Enhancer to handle content elements and provides access to the content, content length, content type and charset.
The following code snippets show typical usage examples:
@Override public int canEnhance(ContentItem ci) throws EngineException { if(NlpEngineHelper.getPlainText(this, ci, false) == null){ return EnhancementEngine.CANNOT_ENHANCE; } // add further tests if this engine can Enhance the parsed // ContentItem return EnhancementEngine.ENHANCE_ASYNC; }
In the #canEnhance
method one needs to check if the EnhancementEngine is able to process the parsed ContentItem. Only if this method does return 'ENHANCE_SYNCHRONOUS' or ENHANCE_ASYNC
the '#computeEnhancements(..)' method will be called
@Override public void computeEnhancements(ContentItem ci) throws EngineException { //if TRUE is parsed as 3rd arg this will throw an exception rather than //returning NULL Entry<UriRef, Blob> plainText = NlpEngineHelper.getPlainText(this, ci, true); //Now we can read the plain text from the Blob String charset = plainText.getValue().getParameter().get("charset"); if(charset == null){ charset = "UTF-8"; } Reader reader = new InputStreamReader( plainText.getValue().getStream(),charset); }
The AnalysedText Content Part
The AnalysedText content part is used to store NLP processing results. The NlpEngineHelper
provides the #getAnalysedText(..)
and #initAnalysedText(..)
methods for obtaining this content parsed form the processed ContentItem.
The #getAnalysedText(..)
Method is typically used by EnhancementEngines that need to consume NLP processing results of previous NLP processing engines.
@Override public int canEnhance(ContentItem ci) throws EngineException { //possible check other requirements (such as if the language //detected for the parsed content is supported if(getAnalysedText(this,ci,false) == null) { return CANNOT_ENHANCE; } return ENHANCE_ASYNC; } @Override public void computeEnhancements(ContentItem ci) throws EngineException { AnalysedText at = getAnalysedText(this, ci, true); // [..] }
EnhancementEngines that do not depend on NLP processing results of other EnhancementEngines SHOULD use the #initAnalysedText(..)
method as this method only creates a new AnalysedText
content part if it is not already present. Otherwise it will return the already existing one.
/** The AnalysedTextFactory is an OSGI service **/ @Reference private AnalysedTextFactory analysedTextFactory; @Override public void computeEnhancements(ContentItem ci) throws EngineException { AnalysedText at = initAnalysedText(this,analysedTextFactory,ci); // [..] }
Note that NLP Enhancement Engines that do not consume NLP processing results of other EnhancementEngines need not to check in the #canEnhance(..)
method if the AnalysedText
text content part is present.
The usage of the AnalysedText content part is not covered by this section. For more information on that please see the documentation of the AnalysedText.
Dealing with supported Languages and the Language of the Content
NLP processing EnhancementEngines typically only support a specific set of languages. This sub section provides best practice examples for language specific configurations as well as retrieving the language of processed content item.
For Language specific Configurations of EnhancementEngines the Stanbol NLP processing module provides the LanguageConfiguration
utility. This utility implements the following configuration syntax:
- List of supported languages: To process German and English texts configure
de,en
- List of excluded languages:
!fr,!cn, *
would process all languages other than French and Chinese - Support for parameters by using the following syntax `{language};{param-name}={param-value};{param-name}={param-value}
- Default parameters for all languages can be defined by applying parameters to the wildcard language
*;{param-name}={param-value};
or if no wildcard is desired by the configuration by adding parameters to the empty {language} such as;{param-name}={param-value};
. - Language specific parameters will override default parameters. This means that
*;state=true, de;state=false
will result instate=false
for German andstate=true
for all other languages.
- Default parameters for all languages can be defined by applying parameters to the wildcard language
The following example shows a typical usage of the LanguageConfiguration
utility in an NLP processing engine
/** Key used for the language configuration */ public static final String CONFIG_LANGUAGES = "myNlpEngine.languageconfig"; /** Possible multiple parameters supported by this engine */ public static final String PARAM_EXAMPLE = "example"; /** The language Configuration instance used by the Engine */ private LanguageConfiguration languageConfig = new LanguageConfiguration( CONFIG_LANGUAGES, //the property key used for the configuration new String[]{"*"}); //the default configuration @Activate protected void activate(ComponentContext ctx) throws ConfigurationException { super.activate(ce); // assuming this engine extends AbstractEnhancementEngine @SuppressWarnings("unchecked") Dictionary<String, Object> properties = ctx.getProperties(); // The LanguageConfiguration utility directly parses the config from the // properties languageConfig.setConfiguration(properties); } @Deactivate protected void deactivate(ComponentContext context) { languageConfig.setDefault(); //reset to the default configuration super.deactivate(context); // assuming this engine extends AbstractEnhancementEngine }
For getting the detected language(s) of the textual content there are several utilities available. The default EnhancementEngineHelper
provides two utility methods: First the EnhancementEngineHelper#getLanguage(..)
method that reruns the language with the highest confidence and second the EnhancementEngineHelper#getLanguageAnnotations(..)
that returns an ordered list with the URIs of fise:TextAnnotation
s.
However for NLP processing EnhancementEngines it will be easier to use the higher level utilities provided by the NlpEngineHelper
. First there is also a #getLanguage(..)
method that does the same as the one of the EnhancementEngineHelper
but in addition deals with Exception handling and logging. Second the isLangaugeConfigured(..)
method that compares the detected language with the LanguageConfiguration
.
A typical usage looks like follows
@Override public int canEnhance(ContentItem ci) throws EngineException { String language = getLanguage(this, ci,false); if(language == null){ return CANNOT_ENHANCE; } if(!isLangaugeConfigured(this,languageConfig,language,false)){ return CANNOT_ENHANCE; } //possible further checks return ENHANCE_ASYNC; } @Override public void computeEnhancements(ContentItem ci) throws EngineException { //get the language String language = getLanguage(this, ci, true); //validate against the language configuration isLangaugeConfigured(this, languageConfig, language, true); //[..] }