mirror of https://github.com/apache/lucene.git
SOLR-12376: Add some links to other Ref Guide pages; minor format & typo cleanup
This commit is contained in:
parent
9b65d7e1a2
commit
d7abebd7af
|
@ -1,118 +1,115 @@
|
|||
[[the-tagger-handler]]
|
||||
= The Tagger Handler
|
||||
|
||||
The "Tagger" Request Handler, AKA the "SolrTextTagger" is a "text tagger".
|
||||
|
||||
Given a dictionary (a Solr index) with a name-like field,
|
||||
you post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired.
|
||||
you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired.
|
||||
It's used for named entity recognition (NER).
|
||||
It doesn't do any NLP (outside of Lucene text analysis) so it's said to be a "naive tagger",
|
||||
|
||||
The tagger doesn't do any natural language processing (NLP) (outside of Lucene text analysis) so it's considered a "naive tagger",
|
||||
but it's definitely useful as-is and a more complete NER or ERD (entity recognition and disambiguation)
|
||||
system can be built with this as a key component.
|
||||
The SolrTextTagger might be used on queries for query-understanding or large documents as well.
|
||||
|
||||
To get a sense of how to use it, jump to the tutorial below.
|
||||
To get a sense of how to use it, jump to the <<tutorial-with-geonames,tutorial>> below.
|
||||
|
||||
The tagger does not yet support a sharded index.
|
||||
Tens, perhaps hundreds of millions of names (documents) are supported, mostly limited by memory.
|
||||
|
||||
[[tagger-configuration]]
|
||||
== Configuration
|
||||
== Tagger Configuration
|
||||
|
||||
The Solr schema needs 2 things:
|
||||
To configure the tagger, your Solr schema needs 2 fields:
|
||||
|
||||
* A unique key field (see `<uniqueKey>`).
|
||||
Recommended field settings: set `docValues=true`
|
||||
* A tag field, a TextField, with `ConcatenateGraphFilterFactory` at the end of the index chain (not the query chain):
|
||||
* A unique key field (see <<other-schema-elements.adoc#unique-key,Unique Key>> for how to define a unique key in your schema).
|
||||
Recommended field settings: set `docValues=true`.
|
||||
* A tag field, which must be a `TextField`, with `ConcatenateGraphFilterFactory` at the end of the index chain (not the query chain):
|
||||
Set `preservePositionIncrements=false` on that filter.
|
||||
Recommended field settings: `omitNorms=true`, `omitTermFreqAndPositions=true` and `postingsFormat=FST50`
|
||||
Recommended field settings: `omitNorms=true`, `omitTermFreqAndPositions=true` and `postingsFormat=FST50`.
|
||||
|
||||
The text field's _index analysis chain_, aside from needing ConcatenateGraphFilterFactory at the end,
|
||||
The text field's _index analysis chain_, aside from needing `ConcatenateGraphFilterFactory` at the end,
|
||||
can otherwise have whatever tokenizer and filters suit your matching preferences.
|
||||
It can have multi-word synonyms and use WordDelimiterGraphFilterFactory for example.
|
||||
However, do _not_ use FlattenGraphFilterFactory as it will interfere with ConcatenateGraphFilterFactory.
|
||||
Position gaps (e.g. stop words) get ignored; it's not (yet) supported for the gap to be significant.
|
||||
It can have multi-word synonyms and use `WordDelimiterGraphFilterFactory` for example.
|
||||
However, do _not_ use `FlattenGraphFilterFactory` as it will interfere with `ConcatenateGraphFilterFactory`.
|
||||
Position gaps (e.g., stop words) get ignored; it's not (yet) supported for the gap to be significant.
|
||||
|
||||
The text field's _query analysis chain_, on the other hand, is more limited.
|
||||
There should not be tokens at the same position, thus no synonym expansion -- do that at index time instead.
|
||||
Stop words (or any other filter introducing a position gap) are supported.
|
||||
At runtime the tagger can be configured to either treat it as a tag break or to ignore it.
|
||||
|
||||
The Solr config needs the `solr.TagRequestHandler` defined, which supports `defaults`, `invariants`, and `appends`
|
||||
Your `solrconfig.xml` needs the `solr.TagRequestHandler` defined, which supports `defaults`, `invariants`, and `appends`
|
||||
sections just like the search handler.
|
||||
|
||||
[[tagger-parameters]]
|
||||
For configuration examples, jump to the <<tutorial-with-geonames,tutorial>> below.
|
||||
|
||||
== Tagger Parameters
|
||||
|
||||
The tagger's execution is completely configurable with request parameters. Only `field` is required.
|
||||
|
||||
`field`::
|
||||
The tag field that serves as the dictionary.
|
||||
This is required; you'll probably specify it in the request handler.
|
||||
The tag field that serves as the dictionary.
|
||||
This is required; you'll probably specify it in the request handler.
|
||||
|
||||
`fq`::
|
||||
You can specify some number of _filter queries_ to limit the dictionary used for tagging.
|
||||
This parameter is the same as is used by the `solr.SearchHandler`.
|
||||
You can specify some number of _filter queries_ to limit the dictionary used for tagging.
|
||||
This parameter is the same one used by the `solr.SearchHandler`.
|
||||
|
||||
`rows`::
|
||||
The maximum number of documents to return, but defaulting to 10000 for a tag request.
|
||||
This parameter is the same as is used by the `solr.SearchHandler`.
|
||||
The maximum number of documents to return, but defaulting to 10000 for a tag request.
|
||||
This parameter is the same as is used by the `solr.SearchHandler`.
|
||||
|
||||
`fl`::
|
||||
Solr's standard param for listing the fields to return.
|
||||
This parameter is the same as is used by the `solr.SearchHandler`.
|
||||
Solr's standard parameter for listing the fields to return.
|
||||
This parameter is the same one used by the `solr.SearchHandler`.
|
||||
|
||||
`overlaps`::
|
||||
Choose the algorithm to determine which tags in an overlapping set should be retained, versus being pruned away.
|
||||
Options are:
|
||||
Choose the algorithm to determine which tags in an overlapping set should be retained, versus being pruned away.
|
||||
Options are:
|
||||
|
||||
* `ALL`: Emit all tags.
|
||||
* `NO_SUB`: Don't emit a tag that is completely within another tag (i.e. no subtag).
|
||||
* `LONGEST_DOMINANT_RIGHT`: Given a cluster of overlapping tags, emit the longest one (by character length).
|
||||
If there is a tie, pick the right-most.
|
||||
Remove any tags overlapping with this tag then repeat the algorithm to potentially find other tags
|
||||
that can be emitted in the cluster.
|
||||
* `ALL`: Emit all tags.
|
||||
* `NO_SUB`: Don't emit a tag that is completely within another tag (i.e., no subtag).
|
||||
* `LONGEST_DOMINANT_RIGHT`: Given a cluster of overlapping tags, emit the longest one (by character length).
|
||||
If there is a tie, pick the right-most.
|
||||
Remove any tags overlapping with this tag then repeat the algorithm to potentially find other tags that can be emitted in the cluster.
|
||||
|
||||
`matchText`::
|
||||
A boolean indicating whether to return the matched text in the tag response.
|
||||
This will trigger the tagger to fully buffer the input before tagging.
|
||||
A boolean indicating whether to return the matched text in the tag response.
|
||||
This will trigger the tagger to fully buffer the input before tagging.
|
||||
|
||||
`tagsLimit`::
|
||||
The maximum number of tags to return in the response.
|
||||
Tagging effectively stops after this point.
|
||||
By default this is 1000.
|
||||
The maximum number of tags to return in the response.
|
||||
Tagging effectively stops after this point.
|
||||
By default this is `1000`.
|
||||
|
||||
`skipAltTokens`::
|
||||
A boolean flag used to suppress errors that can occur if, for example,
|
||||
A boolean flag used to suppress errors that can occur if, for example,
|
||||
you enable synonym expansion at query time in the analyzer, which you normally shouldn't do.
|
||||
Let this default to false unless you know that such tokens can't be avoided.
|
||||
Let this default to false unless you know that such tokens can't be avoided.
|
||||
|
||||
`ignoreStopwords`::
|
||||
A boolean flag that causes stopwords (or any condition causing positions to skip like >255 char words)
|
||||
to be ignored as if it wasn't there.
|
||||
Otherwise, the behavior is to treat them as breaks in tagging on the presumption your indexed text-analysis
|
||||
configuration doesn't have a StopWordFilter.
|
||||
By default the indexed analysis chain is checked for the presence of a StopWordFilter and if found
|
||||
then ignoreStopWords is true if unspecified.
|
||||
You probably shouldn't have a StopWordFilter configured and probably won't need to set this param either.
|
||||
A boolean flag that causes stopwords (or any condition causing positions to skip like >255 char words)
|
||||
to be ignored as if they aren't there.
|
||||
Otherwise, the behavior is to treat them as breaks in tagging on the presumption your indexed text-analysis
|
||||
configuration doesn't have a `StopWordFilter` defined.
|
||||
By default the indexed analysis chain is checked for the presence of a `StopWordFilter` and if found
|
||||
then `ignoreStopWords` is true if unspecified.
|
||||
You probably shouldn't have a `StopWordFilter` configured and probably won't need to set this parameter either.
|
||||
|
||||
`xmlOffsetAdjust`::
|
||||
A boolean indicating that the input is XML and furthermore that the offsets of returned tags should be adjusted as
|
||||
necessary to allow for the client to insert an openening and closing element at the tag offset pair.
|
||||
If it isn't possible to do so then the tag will be omitted.
|
||||
You are expected to configure `HTMLStripCharFilterFactory` in the schema when using this option.
|
||||
This will trigger the tagger to fully buffer the input before tagging.
|
||||
A boolean indicating that the input is XML and furthermore that the offsets of returned tags should be adjusted as
|
||||
necessary to allow for the client to insert an opening and closing element at the tag offset pair.
|
||||
If it isn't possible to do so then the tag will be omitted.
|
||||
You are expected to configure `HTMLStripCharFilterFactory` in the schema when using this option.
|
||||
This will trigger the tagger to fully buffer the input before tagging.
|
||||
|
||||
Solr's parameters for controlling the response format are supported, like:
|
||||
`echoParams`, `wt`, `indent`, etc.
|
||||
Solr's parameters for controlling the response format are also supported, such as `echoParams`, `wt`, `indent`, etc.
|
||||
|
||||
[[tagger-tutorial-with-geonames]]
|
||||
== Tutorial with Geonames
|
||||
|
||||
This is a tutorial that demonstrates how to configure and use the text
|
||||
tagger with the popular Geonames data set. It's more than a tutorial;
|
||||
tagger with the popular http://www.geonames.org/[Geonames] data set. It's more than a tutorial;
|
||||
it's a how-to with information that wasn't described above.
|
||||
|
||||
[[tagger-create-and-configure-a-solr-collection]]
|
||||
=== Create and Configure a Solr Collection
|
||||
|
||||
Create a Solr collection named "geonames". For the tutorial, we'll
|
||||
|
@ -120,26 +117,27 @@ assume the default "data-driven" configuration. It's good for
|
|||
experimentation and getting going fast but not for production or being
|
||||
optimal.
|
||||
|
||||
....
|
||||
[source,bash]
|
||||
bin/solr create -c geonames
|
||||
....
|
||||
|
||||
[[tagger-configuring]]
|
||||
==== Configuring
|
||||
==== Configuring the Tagger
|
||||
|
||||
We need to configure the schema first. The "data driven" mode we're
|
||||
using allows us to keep this step fairly minimal -- we just need to
|
||||
declare a field type, 2 fields, and a copy-field. The critical part
|
||||
declare a field type, 2 fields, and a copy-field.
|
||||
|
||||
The critical part
|
||||
up-front is to define the "tag" field type. There are many many ways to
|
||||
configure text analysis; and we're not going to get into those choices
|
||||
here. But an important bit is the `ConcatenateGraphFilterFactory` at the
|
||||
end of the index analyzer chain. Another important bit for performance
|
||||
is postingsFormat=FST50 resulting in a compact FST based in-memory data
|
||||
is `postingsFormat=FST50` resulting in a compact FST based in-memory data
|
||||
structure that is especially beneficial for the text tagger.
|
||||
|
||||
Schema configuration:
|
||||
|
||||
....
|
||||
[source,bash]
|
||||
----
|
||||
curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/geonames/schema -d '{
|
||||
"add-field-type":{
|
||||
"name":"tag",
|
||||
|
@ -166,25 +164,26 @@ curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/geon
|
|||
]}
|
||||
},
|
||||
|
||||
"add-field":{ "name":"name", "type":"text_general"},
|
||||
"add-field":{"name":"name", "type":"text_general"},
|
||||
|
||||
"add-field":{ "name":"name_tag", "type":"tag", "stored":false },
|
||||
"add-field":{"name":"name_tag", "type":"tag", "stored":false },
|
||||
|
||||
"add-copy-field":{ "source":"name", "dest":[ "name_tag" ]}
|
||||
"add-copy-field":{"source":"name", "dest":["name_tag"]}
|
||||
}'
|
||||
....
|
||||
----
|
||||
|
||||
Configure a custom Solr Request Handler:
|
||||
|
||||
....
|
||||
[source,bash]
|
||||
----
|
||||
curl -X POST -H 'Content-type:application/json' http://localhost:8983/solr/geonames/config -d '{
|
||||
"add-requesthandler" : {
|
||||
"name": "/tag",
|
||||
"class":"solr.TaggerRequestHandler",
|
||||
"defaults":{ "field":"name_tag" }
|
||||
"defaults":{"field":"name_tag"}
|
||||
}
|
||||
}'
|
||||
....
|
||||
----
|
||||
|
||||
[[tagger-load-some-sample-data]]
|
||||
=== Load Some Sample Data
|
||||
|
@ -197,40 +196,45 @@ should be almost 7MB file expanding to a cities1000.txt file around
|
|||
population.
|
||||
|
||||
Using bin/post:
|
||||
....
|
||||
[source,bash]
|
||||
----
|
||||
bin/post -c geonames -type text/csv \
|
||||
-params 'optimize=true&separator=%09&encapsulator=%00&fieldnames=id,name,,alternative_names,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate' \
|
||||
/tmp/cities1000.txt
|
||||
....
|
||||
----
|
||||
|
||||
or using curl:
|
||||
....
|
||||
|
||||
[source,bash]
|
||||
----
|
||||
curl -X POST --data-binary @/path/to/cities1000.txt -H 'Content-type:application/csv' \
|
||||
'http://localhost:8983/solr/geonames/update?commit=true&optimize=true&separator=%09&encapsulator=%00&fieldnames=id,name,,alternative_names,latitude,longitude,,,countrycode,,,,,,population,elevation,,timezone,lastupdate'
|
||||
....
|
||||
----
|
||||
|
||||
That might take around 35 seconds; it depends. It can be a lot faster if
|
||||
the schema were tuned to only have what we truly need (no text search if
|
||||
not needed).
|
||||
|
||||
In that command we said optimize=true to put the index in a state that
|
||||
will make tagging faster. The encapsulator=%00 is a bit of a hack to
|
||||
In that command we said `optimize=true` to put the index in a state that
|
||||
will make tagging faster. The `encapsulator=%00` is a bit of a hack to
|
||||
disable the default double-quote.
|
||||
|
||||
[[tagger-tag-time]]
|
||||
=== Tag Time!
|
||||
|
||||
This is a trivial example tagging a small piece of text. For more
|
||||
options, see the earlier documentation.
|
||||
|
||||
....
|
||||
[source,bash]
|
||||
----
|
||||
curl -X POST \
|
||||
'http://localhost:8983/solr/geonames/tag?overlaps=NO_SUB&tagsLimit=5000&fl=id,name,countrycode&wt=json&indent=on' \
|
||||
-H 'Content-Type:text/plain' -d 'Hello New York City'
|
||||
....
|
||||
----
|
||||
|
||||
The response should be this (the QTime may vary):
|
||||
|
||||
....
|
||||
[source,json]
|
||||
----
|
||||
{
|
||||
"responseHeader":{
|
||||
"status":0,
|
||||
|
@ -246,10 +250,9 @@ The response should be this (the QTime may vary):
|
|||
"name":["New York City"],
|
||||
"countrycode":["US"]}]
|
||||
}}
|
||||
....
|
||||
----
|
||||
|
||||
[[tagger-tips]]
|
||||
== Tips
|
||||
== Tagger Tips
|
||||
|
||||
Performance Tips:
|
||||
|
||||
|
|
Loading…
Reference in New Issue