Basic Chinese language support based on Paoding Analyzer
As Chinese does not use Whiespace characters for word tokenization the default tokenizers used by Stanbol are not capable to properly process Chinese language texts. Therefore users that need to process Chinese texts need to add special modules even for basic language support.
The integration of the Stanbol NLP processing module with the Paoding Analyzer provides this by
- Allowing to correctly index Controlled Vocabularies with Chinese labels with the Stanbol Entityhub
- Tokenize parsed Chinese Text
- Tokenizer Chinese Labels of Entities in the controlled vocabulary.
It is highly recommended to use the Paoding Analyzer in combination with the Smartcn as the Smartcn Analyzer provide Sentence detection.
Installation
The Paoding Analyzer integration consists of three bundles as referenced by the Paoding Bundle List. Users can either include this BundleList in their Custom Launcher configuration by including the
<dependency> <groupId>org.apache.stanbol</groupId> <artifactId>org.apache.stanbol.launchers.bundlelists.languageextras.paoding</artifactId> <version>0.10.0-SNAPSHOT</version> <packaging>partialbundlelist</packaging> </dependency>
or alternatively manually installing the tree bundles referenced by the Paoding Bundle List to the Stanbol Environment (e.g. by copying them to the stanbol/fileinstall
directory)
NOTE that if for Sentence Detection users will also need to install the Smartcn Analyer
Stanbol Enhancer configuration
When Paoding and Smartcn are installed to the Stanbol Environment several EnhancementEngiens will be available that can be used to configure an EnhancementChain for Chinese texts. A typical EnhancementChain for Chinese text will look like:
langdetect smartcn-sentence;optional paoding-token {your-entitylinking}
where '{your-entitylinking}' will typically be an EntityhubLinkingEngine engine configured for your vocabulary containing the Entities with Chinese labels. Note that the smartcn-sentence
will be only available if the Smartcn analyzer is also installed.
Solr Configuration
When you plan to use the Paoding Analyzer to process Chinese texts it is important to also properly configure the Solr schema.xml used by the Entityhub SolrYard. The DZone article Indexing Chinese in Solr by Jason Hull provides really great background information on that.
When following those instructions keep in mind that the {working-dir} of the Stanbol Entityhub IndexingTool is that directory where you call 'java -jar …
' therefore if you configure the 'PAODING_DIC_HOME' the value will be relative to the {working-dir}.
For the use of Paoding within Apache Stanbol the directory will be automatically initialized and be located in the persistent storage location of the org.apache.stanbol:org.apache.stanbol.commons.solr.extras.paoding:0.10.0-SNAPSHOT
bundle.
Solr Field Configuration
To use the Paoding Analyzer for Chinese literals a FieldType and a DynamicField configuration need to be added to the Solr schema.xml.
-
the fieldType specification for Chinese
<fieldType name="text_zh" class="solr.TextField"> <analyzer class="net.paoding.analysis.analyzer.PaodingAnalyzer"/> </fieldType>
-
A dynamic field using this field type that matches against Chinese language literals
<!-- Dynamic field for Chinese languages. --> <dynamicField name="@zh*" type="text_zh" indexed="true" stored="true" multiValued="true" omitNorms="false"/>
The smartcn.solrindex.zip is identical with the default configuration but uses the above fieldType and dynamicField specification.
Usage with the EntityhubIndexing Tool
-
Extract the paoding.solrindex.zip to the "indexing/config" directory.
-
Copy the Paoding Bundle (
org.apache.stanbol:org.apache.stanbol.commons.solr.extras.paoding
) in the lib directory of the Solr Core configuration "indexing/config/paoding/lib". Solr includes all jar files within this directory in the Classpath. Because of that it will find the padding analyzer implementation during indexing. -
Rename the "indexing/config/paoding" directory to the {site-name} (the value of the "name" property of the "indexing/config/indexing.properties" file).
As an alternative to (2) you can also explicitly configure the name of the solr config as value to the "solrConf:smartcn" of SolrYardIndexingDestination.
indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf:smartcn,boosts:fieldboosts
-
Copy the padding dictionary to '{paoding-dic-dir}'. You can obtain the dic from the original paoding projects SVN repository. An Zip archive with the dictionary is also included in the Paoding OSGI bundle part of Stanbol.
-
Correctly parse the -DPAODING_DIC_HOME={paoding-dic-dir} when calling the Entityhub indexing tool. As alternative you can also set the 'PAODING_DIC_HOME' as system environment variable.
Usage with the Entityhub SolrYard
If you want to create an empty SolrYard instance using the paoding.solrindex.zip configuration you will need to
- copy the paoding.solrindex.zip to the datafile directory of your Stanbol instance ({working-dir}/stanbol/datafiles)
- rename it to the {name} of the SolrYard you want to create. The file name needs to be {name}.solrindex.zip
- create the SolrYard instance and configure the "Solr Index/Core" (org.apache.stanbol.entityhub.yard.solr.solrUri) to {name}. Make sure the "Use default SolrCore configuration" (org.apache.stanbol.entityhub.yard.solr.useDefaultConfig) is disabled.
If you want to use the paoding.solrindex.zip as default you can rename the file in the datafilee folder to "default.solrindex.zip" and the enable the "Use default SolrCore configuration" (org.apache.stanbol.entityhub.yard.solr.useDefaultConfig) when you configure a SolrYard instance.
See also the documentation on how to configure a managed site.