This project has retired. For details please refer to its Attic page.
Apache Stanbol - The Language Identification Engine

The Language Identification Engine

The LangId engine determines the language of text.

NOTE: Users of this engine might want to consider using the LangDetect instead because the language detection library used by this engine supports more languages and also delivers better results.

Technical Description

The provided engine is based on the language identifier of Apache Tika. The text to be checked must be provided in plain text format in one of two forms:

The result of language identification is added as fise:TextAnnotation to the content item's metadata as string value of the property

http://purl.org/dc/terms/language

This RDF snippet illustrates the output:

<fise:TextAnnotation rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49">
    <dc:language>en</dc:language>
    <dc:creator>org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine</dc:creator>
</fise:TextAnnotation>

By default the language identifier distinguishes the languages listed below. After the colon the value of the language label in the metadata is given.

Additional language models can be created as Tika LanguageProfile.

Configuration options

Usage

Assuming that the Stanbol endpoint with the full launcher is running at

http://localhost:8080

and the engine is activated, from the command line commands like this can be used for submitting some text file as content item:

Alternatively, the Stanbol web interface can be used for submitting documents and viewing the metadata at

http://localhost:8080/contenthub