Kuromoji NLP Engine for Japanese
Kuromoji is a NLP Framework contributed to Apache Lucene. It is available starting with version 3.6.2 and 4.1 of Solr/Lucene. In Stanbol it requires the use of a version newer than revision 1458703 as it only works for the stanbol.commons.solr modules compatible to Solr 4.1.
- Language (required): The language of the text needs to be available. It is read as specified by STANBOL-613 from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine.
- Sentences : Kuromoji itself does not provide sentence detection. Because of that the detection of sentences is done by using POS tagging results. The POS tag '記号-句点' is used for splitting Sentences. Further it is assumed that each Text starts and ends with a complete sentence.
- Tokens: Kuromoji is configured to provide tokens for all words and punctuation. This is done by configuring an empty stop tag list as well as setting the 'discardPunctuation' property to
- POS tagging: The POS tag set used by Kuromoji was mapped to the LexicalCategories and POS types as defined by the Stanbol NLP processing module. For the String tags the Japanese name is used (e.g. '名詞-代名詞-縮約' := Pos.Pronoun,Pos.Participle, description: noun-pronoun-contraction: Spoken language contraction made by combining a pronoun and the particle 'wa'. e.g. ありゃ, こりゃ, こりゃあ, そりゃ, そりゃあ ) POS tags are represented by adding NlpAnnotations#POS_ANNOTATION's to the Tokens of the AnalyzedText content part. Kuromoji provides only a single POS tag per Token.
- NER detection; The POS tag set used by Kuromoji defines POS tags describing named entities. Those POS tags are than combined to chunks and interpreted as named entities (e.g. '名詞-固有名詞-人名-姓' noun-proper-person-surname; '名詞-固有名詞-人名-名' noun-proper-person-given_name) Named Entities are represented by adding NlpAnnotations#NER_ANNOTATION's to the Tokens of the AnalyzedText content part. In addition also 'fise:TextAnnotations' are added to the metadata of the ContentItem.
Kuromoji does not provide confidence values for results.
The engine does not provide any custom configuration. However it supports the configuration of the engine name.