Added initial documentation for the redesigned percolator.

2013-10-16 14:12:19 +02:00 · 2013-10-16 14:12:19 +02:00 · 1d0841e2b8
parent 18e12ef66c
commit 1d0841e2b8
1 changed files with 388 additions and 81 deletions
--- a/docs/reference/search/percolate.asciidoc
+++ b/docs/reference/search/percolate.asciidoc
@ -1,135 +1,442 @@
 [[search-percolate]]
-== Percolate API
+== Percolator

-The percolator allows to register queries against an index, and then
-send `percolate` requests which include a doc, and getting back the
-queries that match on that doc out of the set of registered queries.
+Note: This page described the redesigned percolator that is part of the 1.0 release.

-Think of it as the reverse operation of indexing and then searching.
-Instead of sending docs, indexing them, and then running queries. One
-sends queries, registers them, and then sends docs and finds out which
-queries match that doc.
+Tradionally you design documents based on your data and store them into an index and then define queries via the search api
+in order to retrieve these documents. The percolator works in the opposite direction, first you store queries into an
+index and then via the percolate api you define documents in order to retrieve these queries.

-As an example, a user can register an interest (a query) on all tweets
-that contain the word "elasticsearch". For every tweet, one can
-percolate the tweet against all registered user queries, and find out
-which ones matched.
+The reason that queries can be stored comes from the fact that in Elasticsearch both documents and queries are defined in
+JSON. This allows you to embed queries into documents via the index api. Elasticsearch can extract the query from a
+document and make it available to the percolate api. Since documents are also defines as json, you can define a document
+in a request to the percolate api.

-Here is a quick sample, first, lets create a `test` index:
+The percolator and most of its features work in realtime, so once a percolate query is indexed it can immediately be used
+in the percolate api.
+
+[float]
+=== Sample usage
+
+Adding a query to the percolator:

 [source,js]
 --------------------------------------------------
-curl -XPUT localhost:9200/test
--------------------------------------------------
-
-Next, we will register a percolator query with a specific name called
-`kuku` against the `test` index:
-
-[source,js]
--------------------------------------------------
-curl -XPUT localhost:9200/_percolator/test/kuku -d '{
+curl -XPUT 'localhost:9200/my-index/_percolator/1' -d '{
    "query" : {
-        "term" : {
-            "field1" : "value1"
+        "match" : {
+            "message" : "bonsai tree"
        }
    }
 }'
 --------------------------------------------------

-And now, we can percolate a document and see which queries match on it
-(note, its not really indexed!):
+Matching documents to the added queries:

 [source,js]
 --------------------------------------------------
-curl -XGET localhost:9200/test/type1/_percolate -d '{
+curl -XGET 'localhost:9200/my-index/message/_percolate' -d '{
    "doc" : {
-        "field1" : "value1"
+        "message" : "A new bonsai tree in the office"
    }
 }'
 --------------------------------------------------

-And the matches are part of the response:
+The above request will yield the following response:

 [source,js]
 --------------------------------------------------
-{"ok":true, "matches":["kuku"]}
+{
+    "took" : 19,
+    "_shards" : {
+        "total" : 5,
+        "successful" : 5,
+        "failed" : 0
+    },
+    "count" : 1,
+    "matches" : [
+    	{
+    	     "_index" : "my-index",
+    	     "_id" : "1"
+    	}
+    ]
+}
 --------------------------------------------------

-You can unregister the previous percolator query with the same API you
-use to delete any document in an index:
+The percolate api returns matches that refer to percolate queries that have matched with the document defined in the percolate api.
+
+[float]
+=== Indexing percolator queries
+
+Percolate queries are stored as documents in a specific format and in an aribitrary index under a reserved type with the
+name `_percolator`. The query itself is placed as is in a json objects under the top level field `query`.

 [source,js]
 --------------------------------------------------
-curl -XDELETE localhost:9200/_percolator/test/kuku
+{
+    "query" : {
+		"match" : {
+			"field" : "value"
+		}
+	}
+}
 --------------------------------------------------

+Since this is just an ordinary document, any field can be added to this document. This can be useful later on to only
+percolate documents by specific queries.
+
+[source,js]
+--------------------------------------------------
+{
+	"query" : {
+		"match" : {
+			"field" : "value"
+		}
+	},
+	"priority" : "high"
+}
+--------------------------------------------------
+
+On top of this also a mapping type can be associated with the this query. This allows to control how certain queries
+like range queries, shape filters and other query & filters that rely on mapping settings get constructed. This is
+important since the percolate queries are indexed into the `_percolator` type, and the queries / filters that rely on
+mapping settings would yield unexpected behaviour. Note by default field names do get resolved in a smart manner,
+but in certain cases with multiple types this can lead to unexpected behaviour, so being explicit about it will help.
+
+[source,js]
+--------------------------------------------------
+{
+	"query" : {
+		"range" : {
+			"created_at" : {
+				"gte" : "2010-01-01T00:00:00",
+				"lte" : "2011-01-01T00:00:00"
+			}
+		}
+	},
+	"type" : "tweet",
+	"priority" : "high"
+}
+--------------------------------------------------
+
+In the above example the range query gets really parsed into a Lucene numeric range query, based on the settings for
+the field `created_at` in the type `tweet`.
+
+Just as with any other type, the `_percolator` type has a mapping, which you can configure via the mappings apis.
+The default percolate mapping doesn't index the query field and only stores it.
+
+Because `_percolate` is a type it also has a mapping. By default the following mapping is active:
+
+[source,js]
+--------------------------------------------------
+{
+	"_percolator" : {
+		"properties" : {
+			"query" : {
+				"type" : "object",
+				"enabled" : false
+			}
+		}
+	}
+}
+--------------------------------------------------
+
+If needed this mapping can be modified wit the update mapping api.
+
+In order to un-register a percolate query the delete api can be used. So if the previous added query needs to be deleted
+the following delete requests needs to be executed:
+
+[source,js]
+--------------------------------------------------
+curl -XDELETE localhost:9200/my-index/_percolator/1
+--------------------------------------------------
+
+[float]
+=== Percolate api
+
+The percolate api executes in a distributed manner, meaning it executes on all shards an index points to.
+
+.Required options
+* `index` - The index that contains the `_percolator` type. This can also be an alias.
+* `type` - The type of the document to be percolated. The mapping of that type is used to parse document.
+* `doc` - The actual document to percolate. Unlike the other two options this needs to be specified in the request body. Note this isn't required when percolating an existing document.
+
+[source,js]
+--------------------------------------------------
+curl -XGET 'localhost:9200/twitter/tweet/_percolate' -d '{
+	"doc" : {
+		"created_at" : "2010-10-10T00:00:00",
+		"message" : "some text"
+	}
+}'
+--------------------------------------------------
+
+.Additional supported query string options
+* `routing` - In the case the percolate queries are partitioned by a custom routing value, that routing option make sure
+that the percolate request only gets executed on the shard where the routing value is partioned to. This means that
+the percolate request only gets executed on one shard instead of all shards. Multiple values can be specified as a
+comma seperated string, in that case the request can be be executed on more than one shard.
+* `preference` - Controls which shard replicas are preferred to execute the request on. Works the same as in the search api.
+* `ignore_indices` - Controls if missing indices should silently be ignored. Same as is in the search api.
+* `percolate_format` - If `ids` is specified then the matches array in the percolate response will contain a string
+array of the matching ids instead of an array of objects. This can be useful the reduce the amount of data being send
+back to the client. Obviously if there are to percolator queries with same id from different indices there is no way
+the find out which percolator query belongs to what index. Any other value to `percolate_format` will be ignored.
+
+.Additional request body options
+* `filter` - Reduces the number queries to execute during percolating. Only the percolator queries that match with the
+filter will be included in the percolate execution. The filter option works in near realtime, so a refresh needs to have
+occurred for the filter to included the latest percolate queries.
+* `query` - Same as the `filter` option, but also the score is computed. The computed scores can then be used by the
+`score` and `sort` option.
+* `size` - Defines to maximum number of matches (percolate queries) to be returned. Defaults to unlimited.
+* `score` - Whether the `_score` is included for each match. The is based on the query and represents how the query matched
+to the percolate query's metadata and *not* how the document being percolated matched to the query. The `query` option
+is required for this option. Defaults to `false`.
+* `sort` - Whether the matches should be sorted by the `_score`. Similar to the `score` option, but also sorts the
+matches. The `size` and `query` option are required for this option. Defaults to `false`.
+* `facets` - Allows facet definitions to be included. The facets are based on the matching percolator queries. See facet
+documentation how to define facets.
+* `highlight` - Allows highlight definitions to be included. The document being percolated is being highlight for each
+matching query. This allows you to see how each match is highlighting the document being percolated. See highlight
+documentation on how to define highlights.
+
+[float]
+=== Dedicated percolator index
+
+Percolate queries can be added to any index. Instead of adding percolate queries to the index the data resides in,
+these queries can also be added to an dedicated index. The advantage of this is that this dedicated percolator index
+can have its own index settings (For example the number of primary and replicas shards). If you choose to have a dedicated
+percolate index, you need to make sure that the mappings from the normal index are also available on the percolate index.
+Otherwise percolate queries can be parsed incorrectly.
+
 [float]
 === Filtering Executed Queries

-Since the registered percolator queries are just docs in an index, one
-can filter the queries that will be used to percolate a doc. For
-example, we can add a `color` field to the registered query:
-
-[source,js]
--------------------------------------------------
-curl -XPUT localhost:9200/_percolator/test/kuku -d '{
-    "color" : "blue",
-    "query" : {
-        "term" : {
-            "field1" : "value1"
-        }
-    }
-}'
--------------------------------------------------
-
-And then, we can percolate a doc that only matches on blue colors:
+Filtering allows to reduce the number of queries, any filter that the search api supports, (expect the ones mentioned in important notes)
+can also be used in the percolate api. The filter only works on the metadata fields. The `query` field isn't indexed by
+default. Based on the query we indexed before the following filter can be defined:

 [source,js]
 --------------------------------------------------
 curl -XGET localhost:9200/test/type1/_percolate -d '{
    "doc" : {
-        "field1" : "value1"
+        "field" : "value"
    },
-    "query" : {
+    "filter" : {
        "term" : {
-            "color" : "blue"
+            "priority" : "high"
        }
    }
 }'
 --------------------------------------------------

 [float]
-=== How it Works
+=== Percolator count api

-The `_percolator` which holds the repository of registered queries is
-just a another index. The query is registered under a concrete index
-that exists (or will exist). That index name is represented as the type
-in the `_percolator` index (a bit confusing, I know...).
+The count percolate api, only keeps track of the number of matches and doesn't keep track of the actual matches
+Example:

-The fact that the queries are stored as docs in another index
-(`_percolator`) gives us both the persistency nature of it, and the
-ability to filter out queries to execute using another query.
+[source,js]
+--------------------------------------------------
+curl -XGET 'localhost:9200/my-index/my-type/_percolate/count' -d '{
+   "doc" : {
+       "message" : "some message"
+   }
+}'
+--------------------------------------------------

-The `_percolator` index uses the `index.auto_expand_replica` setting to
-make sure that each data node will have access locally to the registered
-queries, allowing for fast query executing to filter out queries to run
-against a percolated doc.
+Response:

-The percolate API uses the whole number of shards as percolating
-processing "engines", both primaries and replicas. In our above case, if
-the `test` index has 2 shards with 1 replica, 4 shards will round-robin
-in handling percolate requests. Increasing (dynamically) the number of
-replicas will increase the number of percolating processing "engines"
-and thus the percolation power.
+[source,js]
+--------------------------------------------------
+{
+   ... // header
+   "total" : 3
+}
+--------------------------------------------------

-Note, percolate requests will prefer to be executed locally, and will
-not try and round-robin across shards if a shard exists locally on a
-node that received a request (for example, from HTTP). It's important to
-do some round-robin in the client code among nodes (in any case its
-recommended). If this behavior is not desired, the `prefer_local`
-parameter can be set to `false` to disable it.

-Because the percolator API is processing one document at a time, it
-doesn't support queries and filters that run against child and nested
-documents such as `has_child`, `has_parent`, `top_children`, and
-`nested`.
+[float]
+=== Percolating an existing document
+
+In order to percolate in newly indexed document, the percolate existing document can be used. Based on the response
+from an index request the `_id` and other meta information can be used to the immediately percolate the newly added
+document.
+
+.Supported options for percolating an existing document on top of existing percolator options:
+* `id` - The id of the document to retrieve the source for.
+* `percolate_index` - The index containing the percolate queries. Defaults to the `index` defined in the url.
+* `percolate_type` - The percolate type (used for parsing the document). Default to `type` defined in the url.
+* `routing` - The routing value to use when retrieving the document to percolate.
+* `preference` - Which shard to prefer when retrieving the existing document.
+* `percolate_routing` - The routing value to use when percolating the existing document.
+* `percolate_preference` - Which shard to prefer when executing the percolate request.
+* `version` - Enables a version check. If the fetched document's version isn't equal to the specified version then the request fails with a version conflict and the percolation request is aborted.
+* `version_type` - Whether internal or external versioning is used. Defaults to internal versioning.
+
+Internally the percolate api will issue a get request for fetching the`_source` of the document to percolate.
+For this feature to work the `_source` for documents to be percolated need to be stored.
+
+[float]
+==== Example
+
+Index response:
+
+[source,js]
+--------------------------------------------------
+{
+	"ok" : true,
+	"_index" : "my-index",
+	"_type" : "message",
+	"_id" : "1",
+	"_version" : 1
+}
+--------------------------------------------------
+
+Percolating an existing document:
+
+[source,js]
+--------------------------------------------------
+curl -XGET 'localhost:9200/my-index1/message/1/_percolate'
+--------------------------------------------------
+
+The response is the same as with the regular percolate api.
+
+[float]
+=== Multi percolate api
+
+The multi percolate api allows to bundle multiple percolate requests into a single request, similar to what the multi
+search api does to search requests. The request body format is line based. Each percolate request item takes two lines,
+the first line is the header and the second line is the body.
+
+The header can contain any parameter that normally would be set via the request path or query string parameters. T
+here are several percolate actions, because there are multiple types of percolate requests.
+
+.Supported actions:
+* `percolate` - Action for defining a regular percolate request.
+* `count` - Action for defining a count percolate request.
+
+Depending on the percolate action different parameters can be specified. For example the percolate and percolate existing
+document actions support different parameters.
+
+.The following endpoints are supported
+* POST /[index]/[type]/_mpercolate
+* POST /[index]/_mpercolate
+* POST /_mpercolate
+
+The `index` and `type` defined in the url path are the default index and type.
+
+[float]
+==== Example
+
+Request:
+
+[source,js]
+--------------------------------------------------
+curl -XGET 'localhost:9200/twitter/tweet/_mpercolate' --data-binary @requests.txt; echo
+--------------------------------------------------
+
+The index twitter is the default index and the type tweet is the default type and will be used in the case a header
+doesn't specify an index or type.
+
+requests.txt:
+
+[source,js]
+--------------------------------------------------
+{"percolate" : {"index" : twitter", "type" : "tweet"}}
+{"doc" : {"message" : "some text"}}
+{"percolate" : "index" : twitter", "type" : "tweet", "id" : "1"}
+{}
+{"percolate" : "index" : users", "type" : "user", "id" : "3", "percolate_index" : "users_2012" }
+{"size" : 10}
+{"count" : {"index" : twitter", "type" : "tweet"}}
+{"doc" : {"message" : "some other text"}}
+{"count" : "index" : twitter", "type" : "tweet", "id" : "1"}
+{}
+--------------------------------------------------
+
+For a percolate existing document item (headers with the `id` field), the response can be an empty json object.
+All the required options are set in the header.
+
+Response:
+
+[source,js]
+--------------------------------------------------
+{
+    "items" : [
+        {
+            "took" : 24,
+            "_shards" : {
+                "total" : 5,
+                "successful" : 5,
+                "failed" : 0,
+            },
+            "total" : 3,
+            "matches" : ["1", "2", "3"]
+        },
+        {
+            "took" : 12,
+            "_shards" : {
+                "total" : 5,
+                "successful" : 5,
+                "failed" : 0,
+            },
+            "total" : 3,
+            "matches" : ["4", "5", "6"]
+        },
+        {
+            "error" : "[user][3]document missing"
+        },
+        {
+            "took" : 12,
+            "_shards" : {
+                "total" : 5,
+                "successful" : 5,
+                "failed" : 0,
+            },
+            "total" : 3
+        },
+        {
+            "took" : 14,
+            "_shards" : {
+                "total" : 5,
+                "successful" : 5,
+                "failed" : 0,
+            },
+            "total" : 3
+        }
+    ]
+}
+--------------------------------------------------
+
+Each item represents a percolate response, the order of the items maps to the order in where the percolate requests
+were specified. In case a percolate request failed, the item response is substituted with an error message.
+
+[float]
+=== How it works under the hood
+
+When indexing a document that contains a query in an index and the `_percolator` type the query part of the documents gets
+parsed into a Lucene query and is kept in memory until that percolator document is removed or the index containing the
+`_percolator` type get removed. So all the active percolator queries are kept in memory.
+
+At percolate time the document specified in the request gets parsed into a Lucene document and is stored in a in-memory
+Lucene index. This in-memory index can just hold this one document and it is optimized for that. Then all the queries
+that are registered to the index that the percolate request is targeted for are going to be executed on this single document
+in-memory index. This happens on each shard the percolate request need to execute.
+
+By using `routing`, `filter` or `query` features the amount of queries that need to be executed can be reduced and thus
+the time the percolate api needs to run can be decreased.
+
+[float]
+=== Important notes
+
+Because the percolator API is processing one document at a time, it doesn't support queries and filters that run
+against child and nested documents such as `has_child`, `has_parent`, `top_children`, and `nested`.
+
+The `wildcard` and `regexp` query natively use a lot of memory and because the percolator keeps the queries into memory
+this can easily take up the available memory in the heap space. If possible try to use a `prefix` query or ngramming to
+achieve the same result (with way less memory being used).
+
+The delete-by-query api doesn't work to unregister a query, it only deletes the percolate documents from disk. In order
+to update the registered queries in memory the index needs be closed and opened.