docs: document work around for the percolator if query time text analysis is expensive.

2025-02-23 05:15:04 +00:00 · 2017-07-13 17:33:59 +02:00 · 2017-07-13 17:33:59 +02:00 · ec7ac32772
commit ec7ac32772
parent 7c3735bdc4
1 changed files with 193 additions and 0 deletions
--- a/docs/reference/mapping/types/percolator.asciidoc
+++ b/docs/reference/mapping/types/percolator.asciidoc
@ -224,6 +224,199 @@ now returns matches from the new index:

 <1> Percolator query hit is now being presented from the new index.

+[float]
+==== Optimizing query time text analysis
+
+When the percolator verifies a percolator candidate match it is going to parse, perform query time text analysis and actually run
+the percolator query on the document being percolated. This is done for each candidate match and every time the `percolate` query executes.
+If your query time text analysis is relatively expensive part of query parsing then text analysis can become the
+dominating factor time is being spent on when percolating. This query parsing overhead can become noticeable when the
+percolator ends up verifying many candidate percolator query matches.
+
+To avoid the most expensive part of text analysis at percolate time. One can choose to do the expensive part of text analysis
+when indexing the percolator query. This requires using two different analyzers. The first analyzer actually performs
+text analysis that needs be performed (expensive part). The second analyzer (usually whitespace) just splits the generated tokens
+that the first analyzer has produced. Then before indexing a percolator query, the analyze api should be used to analyze the query
+text with the more expensive analyzer. The result of the analyze api, the tokens, should be used to substitute the original query
+text in the percolator query. It is important that the query should now be configured to override the analyzer from the mapping and
+just the second analyzer. Most text based queries support an `analyzer` option (`match`, `query_string`, `simple_query_string`).
+Using this approach the expensive text analysis is performed once instead of many times.
+
+Lets demonstrate this workflow via a simplified example.
+
+Lets say we want to index the following percolator query:
+
+[source,js]
+--------------------------------------------------
+{
+  "query" : {
+    "match" : {
+      "body" : {
+        "query" : "missing bicycles"
+      }
+    }
+  }
+}
+--------------------------------------------------
+// NOTCONSOLE
+
+with these settings and mapping:
+
+[source,js]
+--------------------------------------------------
+PUT /test_index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "my_analyzer" : {
+          "tokenizer": "standard",
+          "filter" : ["lowercase", "porter_stem"]
+        }
+      }
+    }
+  },
+  "mappings": {
+    "doc" : {
+      "properties": {
+        "query" : {
+          "type": "percolator"
+        },
+        "body" : {
+          "type": "text",
+          "analyzer": "my_analyzer" <1>
+        }
+      }
+    }
+  }
+}
+--------------------------------------------------
+// CONSOLE
+// TEST[continued]
+
+<1> For the purpose of this example, this analyzer is considered expensive.
+
+First we need to use the analyze api to perform the text analysis prior to indexing:
+
+[source,js]
+--------------------------------------------------
+POST /test_index/_analyze
+{
+  "analyzer" : "my_analyzer",
+  "text" : "missing bicycles"
+}
+--------------------------------------------------
+// CONSOLE
+// TEST[continued]
+
+This results the following response:
+
+[source,js]
+--------------------------------------------------
+{
+  "tokens": [
+    {
+      "token": "miss",
+      "start_offset": 0,
+      "end_offset": 7,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "bicycl",
+      "start_offset": 8,
+      "end_offset": 16,
+      "type": "<ALPHANUM>",
+      "position": 1
+    }
+  ]
+}
+--------------------------------------------------
+// TESTRESPONSE
+
+All the tokens in the returned order need to replace the query text in the percolator query:
+
+[source,js]
+--------------------------------------------------
+PUT /test_index/doc/1?refresh
+{
+  "query" : {
+    "match" : {
+      "body" : {
+        "query" : "miss bicycl",
+        "analyzer" : "whitespace" <1>
+      }
+    }
+  }
+}
+--------------------------------------------------
+// CONSOLE
+// TEST[continued]
+
+<1> It is important to select a whitespace analyzer here, otherwise the analyzer defined in the mapping will be used,
+which defeats the point of using this workflow. Note that `whitespace` is a built-in analyzer, if a different analyzer
+needs to be used, it needs to be configured first in the index's settings.
+
+The analyze api prior to the indexing the percolator flow should be done for each percolator query.
+
+At percolate time nothing changes and the `percolate` query can be defined normally:
+
+[source,js]
+--------------------------------------------------
+GET /test_index/_search
+{
+  "query": {
+    "percolate" : {
+      "field" : "query",
+      "document" : {
+        "body" : "Bycicles are missing"
+      }
+    }
+  }
+}
+--------------------------------------------------
+// CONSOLE
+// TEST[continued]
+
+This results in a response like this:
+
+[source,js]
+--------------------------------------------------
+{
+  "took": 6,
+  "timed_out": false,
+  "_shards": {
+    "total": 5,
+    "successful": 5,
+    "skipped" : 0,
+    "failed": 0
+  },
+  "hits": {
+    "total": 1,
+    "max_score": 0.2876821,
+    "hits": [
+      {
+        "_index": "test_index",
+        "_type": "doc",
+        "_id": "1",
+        "_score": 0.2876821,
+        "_source": {
+          "query": {
+            "match": {
+              "body": {
+                "query": "miss bicycl",
+                "analyzer": "whitespace"
+              }
+            }
+          }
+        }
+      }
+    ]
+  }
+}
+--------------------------------------------------
+// TESTRESPONSE[s/"took": 6,/"took": "$body.took",/]
+
 [float]
 ==== Dedicated Percolator Index