The Language Identification Engine

The LangId engine determines the language of text.

NOTE: Users of this engine might want to consider using the LangDetect instead because the language detection library used by this engine supports more languages and also delivers better results.

Technical Description

The provided engine is based on the language identifier of Apache Tika. The text to be checked must be provided in plain text format in one of two forms:

a plain text content item

by the content item's metadata as the string value of the property

:::html

http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent

The result of language identification is added as fise:TextAnnotation to the content item's metadata as string value of the property

http://purl.org/dc/terms/language

This RDF snippet illustrates the output:

<fise:TextAnnotation rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49">
    <dc:language>en</dc:language>
    <dc:creator>org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine</dc:creator>
</fise:TextAnnotation>

By default the language identifier distinguishes the languages listed below. After the colon the value of the language label in the metadata is given.

German: de
English: en
Estonian: et
French: fr
Spanish: es
Italian: it
Swedish: sv
Polish: pl
Dutch: nl
Norwegian: no
Finnish: fi
Greek: el
Danish: da
Hungarian: hu
Icelandic: is
Lithuanian: lt
Portuguese: pt
Russian: ru
Thai: th

Additional language models can be created as Tika LanguageProfile.

Configuration options

org.apache.stanbol.enhancer.engines.langid.probe-length: an integer specifying how many characters will be used for identification. A value of 0 or below means to use the complete text. Otherwise only a substring of the specified length taken from the middle of the text will be used. The default value is 400 characters.
stanbol.enhancer.engine.name: As with any EnhancementEngine this property can be used to change the name of the Engine. The default is "langid"

Usage

Assuming that the Stanbol endpoint with the full launcher is running at

http://localhost:8080

and the engine is activated, from the command line commands like this can be used for submitting some text file as content item:

stateless interface

:::bash curl -i -X POST -H "Content-Type:text/plain" -T testfile.txt http://localhost:8080/engines
stateful interface

:::bash curl -i -X PUT -H "Content-Type:text/plain" -T testfile.txt http://localhost:8080/contenthub/content/someFileId

Alternatively, the Stanbol web interface can be used for submitting documents and viewing the metadata at

http://localhost:8080/contenthub

Downloads

Project

Archived Docs

The ASF

The Language Identification Engine

Technical Description

Configuration options

Usage