Basic Chinese language support based on Paoding Analyzer

As Chinese does not use Whiespace characters for word tokenization the default tokenizers used by Stanbol are not capable to properly process Chinese language texts. Therefore users that need to process Chinese texts need to add special modules even for basic language support.

The integration of the Stanbol NLP processing module with the Paoding Analyzer provides this by

Allowing to correctly index Controlled Vocabularies with Chinese labels with the Stanbol Entityhub
Tokenize parsed Chinese Text
Tokenizer Chinese Labels of Entities in the controlled vocabulary.

It is highly recommended to use the Paoding Analyzer in combination with the Smartcn as the Smartcn Analyzer provide Sentence detection.

Installation

The Paoding Analyzer integration consists of three bundles as referenced by the Paoding Bundle List. Users can either include this BundleList in their Custom Launcher configuration by including the

<dependency>
    <groupId>org.apache.stanbol</groupId>
    <artifactId>org.apache.stanbol.launchers.bundlelists.languageextras.paoding</artifactId>
    <version>0.10.0-SNAPSHOT</version>
    <packaging>partialbundlelist</packaging>
</dependency>

or alternatively manually installing the tree bundles referenced by the Paoding Bundle List to the Stanbol Environment (e.g. by copying them to the stanbol/fileinstall directory)

NOTE that if for Sentence Detection users will also need to install the Smartcn Analyer

Stanbol Enhancer configuration

When Paoding and Smartcn are installed to the Stanbol Environment several EnhancementEngiens will be available that can be used to configure an EnhancementChain for Chinese texts. A typical EnhancementChain for Chinese text will look like:

langdetect
smartcn-sentence;optional
paoding-token
{your-entitylinking}

where '{your-entitylinking}' will typically be an EntityhubLinkingEngine engine configured for your vocabulary containing the Entities with Chinese labels. Note that the smartcn-sentence will be only available if the Smartcn analyzer is also installed.

Solr Configuration

When you plan to use the Paoding Analyzer to process Chinese texts it is important to also properly configure the Solr schema.xml used by the Entityhub SolrYard. The DZone article Indexing Chinese in Solr by Jason Hull provides really great background information on that.

When following those instructions keep in mind that the {working-dir} of the Stanbol Entityhub IndexingTool is that directory where you call 'java -jar …' therefore if you configure the 'PAODING_DIC_HOME' the value will be relative to the {working-dir}.

For the use of Paoding within Apache Stanbol the directory will be automatically initialized and be located in the persistent storage location of the org.apache.stanbol:org.apache.stanbol.commons.solr.extras.paoding:0.10.0-SNAPSHOT bundle.

Solr Field Configuration

To use the Paoding Analyzer for Chinese literals a FieldType and a DynamicField configuration need to be added to the Solr schema.xml.

the fieldType specification for Chinese

<fieldType name="text_zh" class="solr.TextField">
    <analyzer class="net.paoding.analysis.analyzer.PaodingAnalyzer"/>
</fieldType>

A dynamic field using this field type that matches against Chinese language literals

<!--
 Dynamic field for Chinese languages.
 -->
<dynamicField name="@zh*" type="text_zh" indexed="true" stored="true" multiValued="true" omitNorms="false"/>

The smartcn.solrindex.zip is identical with the default configuration but uses the above fieldType and dynamicField specification.

Usage with the EntityhubIndexing Tool

Extract the paoding.solrindex.zip to the "indexing/config" directory.
Copy the Paoding Bundle (org.apache.stanbol:org.apache.stanbol.commons.solr.extras.paoding) in the lib directory of the Solr Core configuration "indexing/config/paoding/lib". Solr includes all jar files within this directory in the Classpath. Because of that it will find the padding analyzer implementation during indexing.
Rename the "indexing/config/paoding" directory to the {site-name} (the value of the "name" property of the "indexing/config/indexing.properties" file).

As an alternative to (2) you can also explicitly configure the name of the solr config as value to the "solrConf:smartcn" of SolrYardIndexingDestination.
```
indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf:smartcn,boosts:fieldboosts
```
Copy the padding dictionary to '{paoding-dic-dir}'. You can obtain the dic from the original paoding projects SVN repository. An Zip archive with the dictionary is also included in the Paoding OSGI bundle part of Stanbol.
Correctly parse the -DPAODING_DIC_HOME={paoding-dic-dir} when calling the Entityhub indexing tool. As alternative you can also set the 'PAODING_DIC_HOME' as system environment variable.

Usage with the Entityhub SolrYard

If you want to create an empty SolrYard instance using the paoding.solrindex.zip configuration you will need to

copy the paoding.solrindex.zip to the datafile directory of your Stanbol instance ({working-dir}/stanbol/datafiles)
rename it to the {name} of the SolrYard you want to create. The file name needs to be {name}.solrindex.zip
create the SolrYard instance and configure the "Solr Index/Core" (org.apache.stanbol.entityhub.yard.solr.solrUri) to {name}. Make sure the "Use default SolrCore configuration" (org.apache.stanbol.entityhub.yard.solr.useDefaultConfig) is disabled.

If you want to use the paoding.solrindex.zip as default you can rename the file in the datafilee folder to "default.solrindex.zip" and the enable the "Use default SolrCore configuration" (org.apache.stanbol.entityhub.yard.solr.useDefaultConfig) when you configure a SolrYard instance.

See also the documentation on how to configure a managed site.

Downloads

Project

Archived Docs

The ASF