2012-07-30 21:37:17 -04:00
|
|
|
Apache Solr UIMA Metadata Extraction Library
|
|
|
|
|
|
|
|
Introduction
|
|
|
|
------------
|
|
|
|
This module is intended to be used both as an UpdateRequestProcessor while indexing documents and as a set of tokenizer/filters
|
|
|
|
to be configured inside the schema.xml for use during analysis phase.
|
|
|
|
UIMAUpdateRequestProcessor purpose is to provide additional on the fly automatically generated fields to the Solr index.
|
|
|
|
Such fields could be language, concepts, keywords, sentences, named entities, etc.
|
|
|
|
UIMA based tokenizers/filters can be used either inside plain Lucene or as index/query analyzers to be defined
|
|
|
|
inside the schema.xml of a Solr core to create/filter tokens using specific UIMA annotations.
|
|
|
|
|
|
|
|
|
2011-01-23 20:58:00 -05:00
|
|
|
Getting Started
|
|
|
|
---------------
|
|
|
|
To start using Solr UIMA Metadata Extraction Library you should go through the following configuration steps:
|
|
|
|
|
|
|
|
1. copy generated solr-uima jar and its libs (under contrib/uima/lib) inside a Solr libraries directory.
|
2011-04-24 07:48:43 -04:00
|
|
|
or set <lib/> tags in solrconfig.xml appropriately to point those jar files.
|
|
|
|
|
|
|
|
<lib dir="../../contrib/uima/lib" />
|
2012-07-23 13:33:24 -04:00
|
|
|
<lib dir="../../contrib/uima/lucene-libs" />
|
2013-01-12 12:51:57 -05:00
|
|
|
<lib dir="../../dist/" regex="solr-uima-\d.*\.jar" />
|
2011-01-23 20:58:00 -05:00
|
|
|
|
|
|
|
2. modify your schema.xml adding the fields you want to be hold metadata specifying proper values for type, indexed, stored and multiValued options:
|
|
|
|
|
2011-04-24 07:48:43 -04:00
|
|
|
for example you could specify the following
|
|
|
|
|
2011-01-23 20:58:00 -05:00
|
|
|
<field name="language" type="string" indexed="true" stored="true" required="false"/>
|
|
|
|
<field name="concept" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
|
|
|
|
<field name="sentence" type="text" indexed="true" stored="true" multiValued="true" required="false" />
|
|
|
|
|
2011-04-24 07:48:43 -04:00
|
|
|
3. modify your solrconfig.xml adding the following snippet:
|
|
|
|
|
|
|
|
<updateRequestProcessorChain name="uima">
|
|
|
|
<processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
|
|
|
|
<lst name="uimaConfig">
|
|
|
|
<lst name="runtimeParameters">
|
|
|
|
<str name="keyword_apikey">VALID_ALCHEMYAPI_KEY</str>
|
|
|
|
<str name="concept_apikey">VALID_ALCHEMYAPI_KEY</str>
|
|
|
|
<str name="lang_apikey">VALID_ALCHEMYAPI_KEY</str>
|
|
|
|
<str name="cat_apikey">VALID_ALCHEMYAPI_KEY</str>
|
|
|
|
<str name="entities_apikey">VALID_ALCHEMYAPI_KEY</str>
|
|
|
|
<str name="oc_licenseID">VALID_OPENCALAIS_KEY</str>
|
|
|
|
</lst>
|
|
|
|
<str name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</str>
|
2011-05-13 11:12:53 -04:00
|
|
|
<!-- Set to true if you want to continue indexing even if text processing fails.
|
|
|
|
Default is false. That is, Solr throws RuntimeException and
|
|
|
|
never indexed documents entirely in your session. -->
|
|
|
|
<bool name="ignoreErrors">true</bool>
|
|
|
|
<!-- This is optional. It is used for logging when text processing fails.
|
2011-06-12 22:37:57 -04:00
|
|
|
If logField is not specified, uniqueKey will be used as logField.
|
2011-05-13 11:12:53 -04:00
|
|
|
<str name="logField">id</str>
|
2011-06-12 22:37:57 -04:00
|
|
|
-->
|
2011-04-24 07:48:43 -04:00
|
|
|
<lst name="analyzeFields">
|
|
|
|
<bool name="merge">false</bool>
|
|
|
|
<arr name="fields">
|
|
|
|
<str>text</str>
|
|
|
|
</arr>
|
|
|
|
</lst>
|
|
|
|
<lst name="fieldMappings">
|
2011-05-06 23:36:40 -04:00
|
|
|
<lst name="type">
|
|
|
|
<str name="name">org.apache.uima.alchemy.ts.concept.ConceptFS</str>
|
|
|
|
<lst name="mapping">
|
|
|
|
<str name="feature">text</str>
|
|
|
|
<str name="field">concept</str>
|
|
|
|
</lst>
|
2011-04-24 07:48:43 -04:00
|
|
|
</lst>
|
2011-05-06 23:36:40 -04:00
|
|
|
<lst name="type">
|
|
|
|
<str name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
|
|
|
|
<lst name="mapping">
|
|
|
|
<str name="feature">language</str>
|
|
|
|
<str name="field">language</str>
|
|
|
|
</lst>
|
2011-04-24 07:48:43 -04:00
|
|
|
</lst>
|
2011-05-06 23:36:40 -04:00
|
|
|
<lst name="type">
|
|
|
|
<str name="name">org.apache.uima.SentenceAnnotation</str>
|
|
|
|
<lst name="mapping">
|
|
|
|
<str name="feature">coveredText</str>
|
|
|
|
<str name="field">sentence</str>
|
|
|
|
</lst>
|
2011-04-24 07:48:43 -04:00
|
|
|
</lst>
|
|
|
|
</lst>
|
|
|
|
</lst>
|
|
|
|
</processor>
|
|
|
|
<processor class="solr.LogUpdateProcessorFactory" />
|
|
|
|
<processor class="solr.RunUpdateProcessorFactory" />
|
|
|
|
</updateRequestProcessorChain>
|
2011-02-12 21:45:11 -05:00
|
|
|
|
|
|
|
where VALID_ALCHEMYAPI_KEY is your AlchemyAPI Access Key. You need to register AlchemyAPI Access
|
|
|
|
key to exploit the AlchemyAPI services: http://www.alchemyapi.com/api/register.html
|
|
|
|
|
|
|
|
where VALID_OPENCALAIS_KEY is your Calais Service Key. You need to register Calais Service
|
|
|
|
key to exploit the Calais services: http://www.opencalais.com/apikey
|
2011-01-23 20:58:00 -05:00
|
|
|
|
2011-04-24 07:48:43 -04:00
|
|
|
the analysisEngine must contain an AE descriptor inside the specified path in the classpath
|
2011-01-23 20:58:00 -05:00
|
|
|
|
2011-04-24 07:48:43 -04:00
|
|
|
the analyzeFields must contain the input fields that need to be analyzed by UIMA,
|
2011-01-23 20:58:00 -05:00
|
|
|
if merge=true then their content will be merged and analyzed only once
|
|
|
|
|
2011-04-24 07:48:43 -04:00
|
|
|
field mapping describes which features of which types should go in a field
|
2011-01-23 20:58:00 -05:00
|
|
|
|
2011-04-24 07:48:43 -04:00
|
|
|
4. in your solrconfig.xml replace the existing default (<requestHandler name="/update"...) or create a new UpdateRequestHandler with the following:
|
2011-01-23 20:58:00 -05:00
|
|
|
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
|
|
|
|
<lst name="defaults">
|
|
|
|
<str name="update.processor">uima</str>
|
|
|
|
</lst>
|
|
|
|
</requestHandler>
|
|
|
|
|
|
|
|
Once you're done with the configuration you can index documents which will be automatically enriched with the specified fields
|