Basic Chinese language support based on Lucene Smartcn Analyzer

As Chinese does not use Whiespace characters for word tokenization the default tokenizers used by Stanbol are not capable to properly process Chinese language texts. Therefore users that need to process Chinese texts need to add special modules even for basic language support.

The integration of the Stanbol NLP processing module with the Lucene Smartcn Analyzer provides this by

Allowing to correctly index Controlled Vocabularies with Chinese labels with the Stanbol Entityhub
Detect Sentences and Tokenize parsed Chinese Text
Tokenizer Chinese Labels of Entities in the controlled vocabulary.

Installation

The Smartcn integration consists of three bundles as referenced by the Smartcn Bundle List. Users can either include this BundleList in their Custom Launcher configuration by including the

<dependency>
    <groupId>org.apache.stanbol</groupId>
    <artifactId>org.apache.stanbol.launchers.bundlelists.languageextras.smartcn</artifactId>
    <version>0.10.0-SNAPSHOT</version>
    <packaging>partialbundlelist</packaging>
</dependency>

or alternatively manually installing the tree bundles referenced by the Smartcn Bundle List to the Stanbol Environment (e.g. by copying them to the stanbol/fileinstall directory)

Stanbol Enhancer configuration

When the Smartcn Analyzer is installed to the Stanbol Environment two new EnhancementEngiens will be available that can be used to configure an EnhancementChain for Chinese texts. A typical EnhancementChain for Chinese text will look like:

langdetect
smartcn-sentence;optional
smartcn-token
{your-entitylinking}

where '{your-entitylinking}' will typically be an EntityhubLinkingEngine engine configured for your vocabulary containing the Entities with Chinese labels.

Solr Configuration

When you plan to use the Smartcn Analyzer to process Chinese texts it is important to also properly configure the Solr schema.xml used by the Entityhub SolrYard.

For that you will need to add two things:

A fieldType specification for Chinese

<fieldType name="text_zh" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
        <filter class="solr.SmartChineseWordTokenFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
        <filter class="solr.SmartChineseWordTokenFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.PositionFilterFactory" />
    </analyzer>
</fieldType>

A dynamic field using this field type that matches against Chinese language literals

<!--
 Dynamic field for Chinese languages.
 -->
<dynamicField name="@zh*" type="text_zh" indexed="true" stored="true" multiValued="true" omitNorms="false"/>

The smartcn.solrindex.zip is identical with the default configuration but uses the above fieldType and dynamicField specification.

Usage with the EntityhubIndexing Tool

Extract the smartcn.solrindex.zip to the "indexing/config" directory
Rename the "indexing/config/smartcn" directory to the {site-name} (the value of the "name" property of the "indexing/config/indexing.properties" file).

As an alternative to (2) you can also explicitly configure the name of the solr config as value to the "solrConf:smartcn" of SolrYardIndexingDestination.

indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf:smartcn,boosts:fieldboosts

Usage with the Entityhub SolrYard

If you want to create an empty SolrYard instance using the smartcn.solrindex.zip configuration you will need to

copy the smartcn.solrindex.zip to the datafile directory of your Stanbol instance ({working-dir}/stanbol/datafiles)
rename it to the {name} of the SolrYard you want to create. The file name needs to be {name}.solrindex.zip
create the SolrYard instance and configure the "Solr Index/Core" (org.apache.stanbol.entityhub.yard.solr.solrUri) to {name}. Make sure the "Use default SolrCore configuration" (org.apache.stanbol.entityhub.yard.solr.useDefaultConfig) is disabled.

If you want to use the smartcn.solrindex.zip as default you can rename the file in the datafilee folder to "default.solrindex.zip" and the enable the "Use default SolrCore configuration" (org.apache.stanbol.entityhub.yard.solr.useDefaultConfig) when you configure a SolrYard instance.

See also the documentation on how to configure a managed site.

Downloads

Project

Archived Docs

The ASF