mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-23 05:15:04 +00:00
docs: document work around for the percolator if query time text analysis is expensive.
This commit is contained in:
parent
7c3735bdc4
commit
ec7ac32772
@ -224,6 +224,199 @@ now returns matches from the new index:
|
||||
|
||||
<1> Percolator query hit is now being presented from the new index.
|
||||
|
||||
[float]
|
||||
==== Optimizing query time text analysis
|
||||
|
||||
When the percolator verifies a percolator candidate match it is going to parse, perform query time text analysis and actually run
|
||||
the percolator query on the document being percolated. This is done for each candidate match and every time the `percolate` query executes.
|
||||
If your query time text analysis is relatively expensive part of query parsing then text analysis can become the
|
||||
dominating factor time is being spent on when percolating. This query parsing overhead can become noticeable when the
|
||||
percolator ends up verifying many candidate percolator query matches.
|
||||
|
||||
To avoid the most expensive part of text analysis at percolate time. One can choose to do the expensive part of text analysis
|
||||
when indexing the percolator query. This requires using two different analyzers. The first analyzer actually performs
|
||||
text analysis that needs be performed (expensive part). The second analyzer (usually whitespace) just splits the generated tokens
|
||||
that the first analyzer has produced. Then before indexing a percolator query, the analyze api should be used to analyze the query
|
||||
text with the more expensive analyzer. The result of the analyze api, the tokens, should be used to substitute the original query
|
||||
text in the percolator query. It is important that the query should now be configured to override the analyzer from the mapping and
|
||||
just the second analyzer. Most text based queries support an `analyzer` option (`match`, `query_string`, `simple_query_string`).
|
||||
Using this approach the expensive text analysis is performed once instead of many times.
|
||||
|
||||
Lets demonstrate this workflow via a simplified example.
|
||||
|
||||
Lets say we want to index the following percolator query:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"query" : {
|
||||
"match" : {
|
||||
"body" : {
|
||||
"query" : "missing bicycles"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// NOTCONSOLE
|
||||
|
||||
with these settings and mapping:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT /test_index
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"my_analyzer" : {
|
||||
"tokenizer": "standard",
|
||||
"filter" : ["lowercase", "porter_stem"]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"mappings": {
|
||||
"doc" : {
|
||||
"properties": {
|
||||
"query" : {
|
||||
"type": "percolator"
|
||||
},
|
||||
"body" : {
|
||||
"type": "text",
|
||||
"analyzer": "my_analyzer" <1>
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[continued]
|
||||
|
||||
<1> For the purpose of this example, this analyzer is considered expensive.
|
||||
|
||||
First we need to use the analyze api to perform the text analysis prior to indexing:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
POST /test_index/_analyze
|
||||
{
|
||||
"analyzer" : "my_analyzer",
|
||||
"text" : "missing bicycles"
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[continued]
|
||||
|
||||
This results the following response:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"tokens": [
|
||||
{
|
||||
"token": "miss",
|
||||
"start_offset": 0,
|
||||
"end_offset": 7,
|
||||
"type": "<ALPHANUM>",
|
||||
"position": 0
|
||||
},
|
||||
{
|
||||
"token": "bicycl",
|
||||
"start_offset": 8,
|
||||
"end_offset": 16,
|
||||
"type": "<ALPHANUM>",
|
||||
"position": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TESTRESPONSE
|
||||
|
||||
All the tokens in the returned order need to replace the query text in the percolator query:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT /test_index/doc/1?refresh
|
||||
{
|
||||
"query" : {
|
||||
"match" : {
|
||||
"body" : {
|
||||
"query" : "miss bicycl",
|
||||
"analyzer" : "whitespace" <1>
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[continued]
|
||||
|
||||
<1> It is important to select a whitespace analyzer here, otherwise the analyzer defined in the mapping will be used,
|
||||
which defeats the point of using this workflow. Note that `whitespace` is a built-in analyzer, if a different analyzer
|
||||
needs to be used, it needs to be configured first in the index's settings.
|
||||
|
||||
The analyze api prior to the indexing the percolator flow should be done for each percolator query.
|
||||
|
||||
At percolate time nothing changes and the `percolate` query can be defined normally:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
GET /test_index/_search
|
||||
{
|
||||
"query": {
|
||||
"percolate" : {
|
||||
"field" : "query",
|
||||
"document" : {
|
||||
"body" : "Bycicles are missing"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[continued]
|
||||
|
||||
This results in a response like this:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"took": 6,
|
||||
"timed_out": false,
|
||||
"_shards": {
|
||||
"total": 5,
|
||||
"successful": 5,
|
||||
"skipped" : 0,
|
||||
"failed": 0
|
||||
},
|
||||
"hits": {
|
||||
"total": 1,
|
||||
"max_score": 0.2876821,
|
||||
"hits": [
|
||||
{
|
||||
"_index": "test_index",
|
||||
"_type": "doc",
|
||||
"_id": "1",
|
||||
"_score": 0.2876821,
|
||||
"_source": {
|
||||
"query": {
|
||||
"match": {
|
||||
"body": {
|
||||
"query": "miss bicycl",
|
||||
"analyzer": "whitespace"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TESTRESPONSE[s/"took": 6,/"took": "$body.took",/]
|
||||
|
||||
[float]
|
||||
==== Dedicated Percolator Index
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user