docs: document work around for the percolator if query time text analysis is expensive.

This commit is contained in:
Martijn van Groningen 2017-07-13 17:33:59 +02:00
parent 7c3735bdc4
commit ec7ac32772
No known key found for this signature in database
GPG Key ID: AB236F4FCF2AF12A

View File

@ -224,6 +224,199 @@ now returns matches from the new index:
<1> Percolator query hit is now being presented from the new index.
[float]
==== Optimizing query time text analysis
When the percolator verifies a percolator candidate match it is going to parse, perform query time text analysis and actually run
the percolator query on the document being percolated. This is done for each candidate match and every time the `percolate` query executes.
If your query time text analysis is relatively expensive part of query parsing then text analysis can become the
dominating factor time is being spent on when percolating. This query parsing overhead can become noticeable when the
percolator ends up verifying many candidate percolator query matches.
To avoid the most expensive part of text analysis at percolate time. One can choose to do the expensive part of text analysis
when indexing the percolator query. This requires using two different analyzers. The first analyzer actually performs
text analysis that needs be performed (expensive part). The second analyzer (usually whitespace) just splits the generated tokens
that the first analyzer has produced. Then before indexing a percolator query, the analyze api should be used to analyze the query
text with the more expensive analyzer. The result of the analyze api, the tokens, should be used to substitute the original query
text in the percolator query. It is important that the query should now be configured to override the analyzer from the mapping and
just the second analyzer. Most text based queries support an `analyzer` option (`match`, `query_string`, `simple_query_string`).
Using this approach the expensive text analysis is performed once instead of many times.
Lets demonstrate this workflow via a simplified example.
Lets say we want to index the following percolator query:
[source,js]
--------------------------------------------------
{
"query" : {
"match" : {
"body" : {
"query" : "missing bicycles"
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
with these settings and mapping:
[source,js]
--------------------------------------------------
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer" : {
"tokenizer": "standard",
"filter" : ["lowercase", "porter_stem"]
}
}
}
},
"mappings": {
"doc" : {
"properties": {
"query" : {
"type": "percolator"
},
"body" : {
"type": "text",
"analyzer": "my_analyzer" <1>
}
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[continued]
<1> For the purpose of this example, this analyzer is considered expensive.
First we need to use the analyze api to perform the text analysis prior to indexing:
[source,js]
--------------------------------------------------
POST /test_index/_analyze
{
"analyzer" : "my_analyzer",
"text" : "missing bicycles"
}
--------------------------------------------------
// CONSOLE
// TEST[continued]
This results the following response:
[source,js]
--------------------------------------------------
{
"tokens": [
{
"token": "miss",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "bicycl",
"start_offset": 8,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 1
}
]
}
--------------------------------------------------
// TESTRESPONSE
All the tokens in the returned order need to replace the query text in the percolator query:
[source,js]
--------------------------------------------------
PUT /test_index/doc/1?refresh
{
"query" : {
"match" : {
"body" : {
"query" : "miss bicycl",
"analyzer" : "whitespace" <1>
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[continued]
<1> It is important to select a whitespace analyzer here, otherwise the analyzer defined in the mapping will be used,
which defeats the point of using this workflow. Note that `whitespace` is a built-in analyzer, if a different analyzer
needs to be used, it needs to be configured first in the index's settings.
The analyze api prior to the indexing the percolator flow should be done for each percolator query.
At percolate time nothing changes and the `percolate` query can be defined normally:
[source,js]
--------------------------------------------------
GET /test_index/_search
{
"query": {
"percolate" : {
"field" : "query",
"document" : {
"body" : "Bycicles are missing"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[continued]
This results in a response like this:
[source,js]
--------------------------------------------------
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped" : 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"query": {
"match": {
"body": {
"query": "miss bicycl",
"analyzer": "whitespace"
}
}
}
}
}
]
}
}
--------------------------------------------------
// TESTRESPONSE[s/"took": 6,/"took": "$body.took",/]
[float]
==== Dedicated Percolator Index