Entity Dereference Engine
since version 0.12.0
with STANBOL-1222
The responsibility of the Dereference Engine is to retrieve information about Entities referenced by the Enhancement Results and add them to the metadata of the Content Item.
Consumed information
The Entity Dereference Engine consumes the RDF enhancements generated by other Enhancement Engines. Especially the fise:entity-reference
properties used by fise:EntityAnnotation
and fise:TopicAnnotation
are processed by this engine as they do link to the Entities that need to be dereferenced.
- Language (optional): The language detected for the text may be used to determine the set of languages of literals to be dereferenced.
Design
The Entity Dereference Engine can not directly be used to dereference Entities. It provides the base functionality for the implementation of dereference Engines for different technologies and services. One such implementation is the Entityhub Dereference Engine for dereferencing Entities via the Stanbol Entityhub).
The module providing this infrastructure is
<dependency> <groupId>org.apache.stanbol</groupId> <artifactId>org.apache.stanbol.enhancer.dereference.core</artifactId> <version>${stanbol-version}</version> </depednecy>
This module provides the following main components:
- EnhancementEngine implementation that
- processes the Enhancement results and schedules Entities to be dereferenced.
- supports the use of a thread pool to dereference multiple entities concurrently.
- supports EnhancementProperties for chain and request scoped configuration of the dereferenced information.
- Definition of the
EntityDerefernecer
interface used to dereference scheduled entities. This interface needs to be implemented by Dereference Engines for different technologies/services (e.g. the Entityhub)
In addition the module also provides utilities for managing the enhancement engine configuration as well as parsed Enhancement Properties.
Configuration
The following Configuration parameter are defined by the core Entity Dereference Engine. Actual Dereference Engine implementations might not support all of them.
- Name (stanbol.enhancer.engine.name): The name of the Enhancement engine
- URI Prefix (enhancer.engines.dereference.uriPrefix): Allows to configure [0..*] prefixes of Entity URIs that can be dereferenced by this engine. If present only Entities that match one of those prefixes are scheduled to be dereferenced by the
EntityDereferencer
. - URI Pattern (enhancer.engines.dereference.uriPatter): Allows to configure a regex pattern for matching Entity URIs. If present only Entities matching at lease one of the configured patterns will be scheduled for dereferencing.
- Fallback Mode (enhancer.engines.dereference.fallback): The fallback mode will only schedule Entities for dereferencing if no data for them are yet present in the Enhancement results. In case a Weighted Chain is use this mode will also make sure that Dereference Engines in Fallback Mode will be executed after those with this mode deactivated.
This option is only useful in cases where multiple dereference engines are used in the same enhancement chain. It allows to ensure the following workflows
- First running Dereference Engines for fast/local data sources. Especially those where one can configure an URI Prefix and/or an URI Pattern - by deactivating Fallback Mode. Second running Dereference Engines for datastes that require remote service calls or for those no URI Prefix nor URI Pattern can be configured - by activating Fallback Mode. This can greatly improve performance and reduce the number of remote service calls as already dereferenced Entities will not get scheduled to be dereferenced by using the remote service.
- In settings where a partial local cache for an otherwise slow data source exists. In this case one can configure two Entity Dereference Engines for the same data source. First one with Fallback Mode deactivated for the partial cache and a second with enabled Fallback Mode for the original but slower datasource.
- Dereference Properties (enhancer.engines.dereference.references): The list of properties that reference Entities. By default
fise:entity-reference
is used. A Triple pattern(null,{entity-reference},null)
is used for all configured property URIs. All unique objects of type URI are considered as entities to be dereferenced. NOTE that configured URI Prefix and/or an URI Pattern are also applied to the list of entity uris. - Dereference Languages (enhancer.engines.dereference.languages): A set of languages that are dereferenced. Even if 'Dereference only Content Language Literals' is active explicitly configured languages will still get dereferenced. If not present and 'Dereference only Content Language Literals' is deactivated literals of any language will get dereferenced.
- Dereference only Content Language Literals (enhancer.engine.dereference.filterContentlanguages): If enabled only Literals with the same language as the language detected for the Content will get dereferenced. Literals with no language tag will always get dereferenced.
- Dereferenced Fields (enhancer.engines.dereference.fields): The dereferenced fields - in RDF terminology 'properties' - to be dereferenced. Typically QNames (e.g.
rdf:label
) can be used for the configuration. However support for QNames is optional. Some Implementations might also support wildcards and exclusions. - Dereference LD Path (enhancer.engines.dereference.ldpath): The LD Path Language allows to define powerful selectors for dereferenced Entities. As an example LDPath allows to select different properties based on the type of the dereferenced entity.
- Service Ranking (service.ranking): The OSGI service ranking. Will only have an effect if their are two engines with the same name. In such cases the one with the higher service ranking will get called.
NOTE that the configurations for Dereference Languages, Dereferenced Fields and Dereference LD Path are just managed by the Core Entity Dereference Engine implementation. Actual support for such properties will depend on the actual EntityDereferencer
implementation.
Building a Custom Entity Dereference Engine
This provides information about the necessary steps for building a custom Entity Dereference Engine.
Entity Dereferencer implementation
The EntityDereferencer
interface is used to dereference Entities. It also allows the EntityDereferenceEngine
to check if OfflineMode is supported and to retrieve the ExecutorService
service.
The following listing shows the signature of the EntityDereferencer
interface
EntityDereferencer + supportsOfflineMode() : boolean + getExecutor() : ExecutorService + boolean dereference(UriRef entity, MGraph graph, Lock writeLock, DereferenceContext dereferenceContext) throws DereferenceException;
supportsOfflineMode
need to return true
if the implementation does not need to access a remote service for dereferencing entities and false
if it requires remote services. If Apache Stanbol is started with Offline Mode enabled EntityDereferencer
implementation that do not support Offline Mode will not be called - meaning that no Entities will get dereferenced from services that do require an internet connection.
The ExecutorService
is used by the EntityDereferenceEngine
to concurrently dereference entities. This means that the dereference(..)
method of the EntityDereferencer
implementations will be called in the contexts of threads provided by the returned ExecutorService
. Returning null
will deactivate this feature.
NOTE that all EntityDereferencer
MUST BE thread save as multiple threads will be used to call the dereference(..)
method. Even if getExecutor()
returns null
the EnhancementJobManager will still use multiple threads for calling the EntityDereferenceEngine
- meaning that dereference(..)
will be called with different thread contexts.
The dereference(..)
method is used to dereference the Entity with the parsed UriRef
. Dereferenced information are expected to be written in the parsed MGraph
. While writing dereferenced information to the parsed graph a write lock MUST BE acquired. The DereferenceContext
provides the configuration (see the following section for more information). If the parsed entity was successfully dereferenced this method is expected to return true
. Otherwise false
.
Configuration API
Configuration Parameters supported by the Core Entity Dereference Engine implementation are defined in the DereferenceConstants
class.
DereferenceEngineConfig
The DereferenceEngineConfig
class provides an easy - API based - access to those configuration parameters. It is instantiated by using the Dictionary
parsed by the OSGI as part of the ComponentContext
.
DereferenceContext
The DereferenceContext
is used to parse request specific context to the EntityDereferencer
implementation.
For that it is important to note that a single request to the Entity Dereference Engine can schedule multiple Entities to be dereferenced and therefore result in multiple call to the EntityDereferencer#dereference(..)
method. All such calls will use the same DereferenceContext
instance.
Extending the DereferenceContextFactory
allows dereference engine implementations to use a custom DereferenceContext
. With that it is possible to parse request specific configuration (e.g. parsed by Enhancement Properties only once per request. The following code snippet shows how to use a custom DereferenceContext
with the core EntityDereferenceEngine
implementation.
entityDereferenceEngine = new EntityDereferenceEngine(entityDereferencer, engineConfig, new DereferenceContextFactory() { //we want to use our own DereferenceContext impl @Override public DereferenceContext createContext(EntityDereferenceEngine engine, Map<String,Object> enhancementProperties) throws DereferenceConfigurationException { //Instantiate custom DereferenceContext DereferenceContext dereferenceContext = null; //TODO return dereferenceContext; } });
For the initialization of the custom DereferenceContext
one need to use the initialise
callback
public class MyDereferenceContext extends DereferenceContext { protected MyDereferenceContext(MyDereferenceEngine engine, Map<String,Object> enhancementProps) throws DereferenceConfigurationException { super(engine, enhancementProps); } @Override protected void initialise() throws DereferenceConfigurationException { //do your custom initialisation here } }
If you apply this code all calls to EntityDereferencer#dereference(..)
will parse an instance of the custom DereferenceContext
implementation.
The custom DereferenceContext implementation of the Entityhub Dereference Engine is a good example to start from.
OSGI Component
Finally each Dereference Engine implementation needs to provide an OSGI component. This component is required for parsing the configuration and for implementing the life cycle.
The following listing provide the pseudo code for such a component
@Component( configurationFactory = true, //allow multiple instances policy = ConfigurationPolicy.REQUIRE, //a configuration is required metatype = true, immediate = true) @Properties(value={ @Property(name=PROPERTY_NAME), //the name of the engine //Properties supported by the Core Entity Dereference Engine @Property(name=EntityhubDereferenceEngine.SITE_ID), @Property(name=DereferenceConstants.FALLBACK_MODE, boolValue=DereferenceConstants.DEFAULT_FALLBACK_MODE), @Property(name=DereferenceConstants.URI_PREFIX, cardinality=Integer.MAX_VALUE), @Property(name=DereferenceConstants.URI_PATTERN, cardinality=Integer.MAX_VALUE), @Property(name=DereferenceConstants.FILTER_CONTENT_LANGUAGES, boolValue=DereferenceConstants.DEFAULT_FILTER_CONTENT_LANGUAGES), @Property(name=DEREFERENCE_ENTITIES_FIELDS,cardinality=Integer.MAX_VALUE, value={"rdfs:comment","geo:lat","geo:long","foaf:depiction","dbp-ont:thumbnail"}), @Property(name=DEREFERENCE_ENTITIES_LDPATH, cardinality=Integer.MAX_VALUE), /* add also implementation specific properties */ @Property(name=SERVICE_RANKING,intValue=0) }) public class YourDereferneceEngineComponent { /** support QName configurations */ @Reference(cardinality=ReferenceCardinality.OPTIONAL_UNARY) protected NamespacePrefixService prefixService; /** The engine instance registered as OSGI service */ protected EntityDereferenceEngine entityDereferenceEngine; /** The OSGI service registration */ protected ServiceRegistration engineRegistration; @Activate protected void activate(ComponentContext ctx) throws ConfigurationException { Dictionary<String,Object> properties = ctx.getProperties(); DereferenceEngineConfig engineConfig = new DereferenceEngineConfig(properties, prefixService); /* TODO: parse custom configuration properties */ /* Initialise the custom EntityDereferencer implemenation */ EntiyDereferencer dereferencer; //TODO //create the Entity Dereference Engine instance entityDereferenceEngine = new EntityDereferenceEngine(entityDereferencer, engineConfig); //register the engine as OSGI service engineRegistration = ctx.getBundleContext().registerService( new String[]{EnhancementEngine.class.getName(), ServiceProperties.class.getName()}, entityDereferenceEngine, engineConfig.getDict()); } @Deactivate protected void deactivate(ComponentContext context) { //Unregister the OSGI service if(engineRegistration != null){ engineRegistration.unregister(); engineRegistration = null; } entityDereferenceEngine = null; //TODO: close the dereferencer implementation (if required) } }