The Language Identification Engine
The LangId engine determines the language of text.
NOTE: Users of this engine might want to consider using the LangDetect instead because the language detection library used by this engine supports more languages and also delivers better results.
Technical Description
The provided engine is based on the language identifier of Apache Tika. The text to be checked must be provided in plain text format in one of two forms:
- a plain text content item
-
by the content item's metadata as the string value of the property
:::html
http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent
The result of language identification is added as fise:TextAnnotation to the content item's metadata as string value of the property
http://purl.org/dc/terms/language
This RDF snippet illustrates the output:
<fise:TextAnnotation rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49"> <dc:language>en</dc:language> <dc:creator>org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine</dc:creator> </fise:TextAnnotation>
By default the language identifier distinguishes the languages listed below. After the colon the value of the language label in the metadata is given.
- German: de
- English: en
- Estonian: et
- French: fr
- Spanish: es
- Italian: it
- Swedish: sv
- Polish: pl
- Dutch: nl
- Norwegian: no
- Finnish: fi
- Greek: el
- Danish: da
- Hungarian: hu
- Icelandic: is
- Lithuanian: lt
- Portuguese: pt
- Russian: ru
- Thai: th
Additional language models can be created as Tika LanguageProfile.
Configuration options
org.apache.stanbol.enhancer.engines.langid.probe-length
: an integer specifying how many characters will be used for identification. A value of 0 or below means to use the complete text. Otherwise only a substring of the specified length taken from the middle of the text will be used. The default value is 400 characters.stanbol.enhancer.engine.name
: As with any EnhancementEngine this property can be used to change the name of the Engine. The default is "langid"
Usage
Assuming that the Stanbol endpoint with the full launcher is running at
http://localhost:8080
and the engine is activated, from the command line commands like this can be used for submitting some text file as content item:
-
stateless interface
:::bash curl -i -X POST -H "Content-Type:text/plain" -T testfile.txt http://localhost:8080/engines
-
stateful interface
:::bash curl -i -X PUT -H "Content-Type:text/plain" -T testfile.txt http://localhost:8080/contenthub/content/someFileId
Alternatively, the Stanbol web interface can be used for submitting documents and viewing the metadata at
http://localhost:8080/contenthub