This project has retired. For details please refer to its Attic page.
Apache Stanbol - The OpenNLP Custom NER Model Extraction Engine

The OpenNLP Custom NER Model Extraction Engine

This engine allows the configuration of custom Apache OpenNLP NameFinder models for NER of plain text content.

Example Result

This engine adds fise:TextAnnotation for the processed plain text to the metadata of the content item. The following code listing shows an DNA type Named Entity detected based on a OpenNLP NameFinder model trained based on the BioNLP2004 dataset:

{
    "@subject": "urn:enhancement-0e31eb01-23c5-82b5-1372-5c5606c09960",
    "@type": [
        "Enhancement",
        "TextAnnotation"
    ],
    "confidence": 0.40148407,
    "creator": "org.apache.stanbol.enhancer.engines.opennlp.impl.CustomNERModelEnhancementEngine",
    "start": 228,
    "end": 242,
    "extracted-from": "urn:content-item-sha1-84a30aeeb073be543f7c54266e232aae572efac0",
    "selected-text": {
        "@language": "en",
        "@literal": "HIV-2 enhancer"
    },
    "selection-context": {
        "@language": "en",
        "@literal": "activation of the HIV-2 enhancer in monocytes and T cells"
    },
    "type": "http://www.bootstrep.eu/ontology/GRO#DNA"
},

Configuration

The usage of this Engine requires to create a service configuration. Configurations require at least a single NameFinderModel name to be configured.

Parameters

The following figure provides a visual representation of an engine configuration configured for all NamedEntity types supported by the BioNLP2004 dataset.

'CustomNerModelEngine Configuration'

The same configuration can be also provided as OSGI configuration file with the name 'org.apache.stanbol.enhancer.engines.opennlp.impl.CustomNERModelEnhancementEngine-ehealthner.config' and the contents:

stanbol.enhancer.engine.name="ehealth-ner"
stanbol.engines.opennlp-ner.nameFinderModels=["bionlp2004-DNA-en.bin","bionlp2004-protein-en.bin","bionlp2004-cell_type-en.bin","bionlp2004-cell_line-en.bin","bionlp2004-RNA-en.bin"]
stanbol.engines.opennlp-ner.typeMappings=["DNA\ >\ http://www.bootstrep.eu/ontology/GRO#DNA","RNA\ >\ http://www.bootstrep.eu/ontology/GRO#RNA","protein\ >\ http://www.bootstrep.eu/ontology/GRO#Protein","cell_type\ >\ http://purl.bioontology.org/ontology/CL","cell_line\ >\ http://purl.bioontology.org/ontology/MCCL"]

NOTE: that the '.config' format requires spaces to be escaped with '\'