The Language Detection Engine
The LangDetect engine determines the language of text.
Technical Description
The provided engine is based on the language identifier of language-detection project.
The plain text needed for the detection is retrieved from the processed ContentItem by searching a Blob with the media type "text/plain".
The result of language identification is added as fise:TextAnnotation to the content item's metadata as string value of the property
http://purl.org/dc/terms/language
This RDF snippet illustrates the output:
<fise:TextAnnotation rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49"> <dc:language>en</dc:language> <fise:confidence>0.99987</fise:confidence> <dc:type rdf:resource="http://purl.org/dc/terms/LinguisticSystem"/> <dc:creator>org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine</dc:creator> </fise:TextAnnotation>
The list of supported languages is available here.
Configuration options
org.apache.stanbol.enhancer.engines.langdetect.probe-length
: an integer specifying how many characters will be used for identification. A value of 0 or below means to use the complete text. Otherwise only a substring of the specified length taken from the middle of the text will be used. NOTE that the used library already supports random selection of text parts so typically the probe-lengh feature should not be activated.org.apache.stanbol.enhancer.engines.langdetect.max-suggested
: The used language detection library supports the annotation of multiple languages. This allows to configure the maximum number of suggested languages.stanbol.enhancer.engine.name
: As with any EnhancementEngine this property can be used to change the name of the Engine. The default is "langdetect"