Added initial documentation for the redesigned percolator.
This commit is contained in:
parent
18e12ef66c
commit
1d0841e2b8
|
@ -1,135 +1,442 @@
|
|||
[[search-percolate]]
|
||||
== Percolate API
|
||||
== Percolator
|
||||
|
||||
The percolator allows to register queries against an index, and then
|
||||
send `percolate` requests which include a doc, and getting back the
|
||||
queries that match on that doc out of the set of registered queries.
|
||||
Note: This page described the redesigned percolator that is part of the 1.0 release.
|
||||
|
||||
Think of it as the reverse operation of indexing and then searching.
|
||||
Instead of sending docs, indexing them, and then running queries. One
|
||||
sends queries, registers them, and then sends docs and finds out which
|
||||
queries match that doc.
|
||||
Tradionally you design documents based on your data and store them into an index and then define queries via the search api
|
||||
in order to retrieve these documents. The percolator works in the opposite direction, first you store queries into an
|
||||
index and then via the percolate api you define documents in order to retrieve these queries.
|
||||
|
||||
As an example, a user can register an interest (a query) on all tweets
|
||||
that contain the word "elasticsearch". For every tweet, one can
|
||||
percolate the tweet against all registered user queries, and find out
|
||||
which ones matched.
|
||||
The reason that queries can be stored comes from the fact that in Elasticsearch both documents and queries are defined in
|
||||
JSON. This allows you to embed queries into documents via the index api. Elasticsearch can extract the query from a
|
||||
document and make it available to the percolate api. Since documents are also defines as json, you can define a document
|
||||
in a request to the percolate api.
|
||||
|
||||
Here is a quick sample, first, lets create a `test` index:
|
||||
The percolator and most of its features work in realtime, so once a percolate query is indexed it can immediately be used
|
||||
in the percolate api.
|
||||
|
||||
[float]
|
||||
=== Sample usage
|
||||
|
||||
Adding a query to the percolator:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XPUT localhost:9200/test
|
||||
--------------------------------------------------
|
||||
|
||||
Next, we will register a percolator query with a specific name called
|
||||
`kuku` against the `test` index:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XPUT localhost:9200/_percolator/test/kuku -d '{
|
||||
curl -XPUT 'localhost:9200/my-index/_percolator/1' -d '{
|
||||
"query" : {
|
||||
"term" : {
|
||||
"field1" : "value1"
|
||||
"match" : {
|
||||
"message" : "bonsai tree"
|
||||
}
|
||||
}
|
||||
}'
|
||||
--------------------------------------------------
|
||||
|
||||
And now, we can percolate a document and see which queries match on it
|
||||
(note, its not really indexed!):
|
||||
Matching documents to the added queries:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XGET localhost:9200/test/type1/_percolate -d '{
|
||||
curl -XGET 'localhost:9200/my-index/message/_percolate' -d '{
|
||||
"doc" : {
|
||||
"field1" : "value1"
|
||||
"message" : "A new bonsai tree in the office"
|
||||
}
|
||||
}'
|
||||
--------------------------------------------------
|
||||
|
||||
And the matches are part of the response:
|
||||
The above request will yield the following response:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{"ok":true, "matches":["kuku"]}
|
||||
{
|
||||
"took" : 19,
|
||||
"_shards" : {
|
||||
"total" : 5,
|
||||
"successful" : 5,
|
||||
"failed" : 0
|
||||
},
|
||||
"count" : 1,
|
||||
"matches" : [
|
||||
{
|
||||
"_index" : "my-index",
|
||||
"_id" : "1"
|
||||
}
|
||||
]
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
You can unregister the previous percolator query with the same API you
|
||||
use to delete any document in an index:
|
||||
The percolate api returns matches that refer to percolate queries that have matched with the document defined in the percolate api.
|
||||
|
||||
[float]
|
||||
=== Indexing percolator queries
|
||||
|
||||
Percolate queries are stored as documents in a specific format and in an aribitrary index under a reserved type with the
|
||||
name `_percolator`. The query itself is placed as is in a json objects under the top level field `query`.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XDELETE localhost:9200/_percolator/test/kuku
|
||||
{
|
||||
"query" : {
|
||||
"match" : {
|
||||
"field" : "value"
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
Since this is just an ordinary document, any field can be added to this document. This can be useful later on to only
|
||||
percolate documents by specific queries.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"query" : {
|
||||
"match" : {
|
||||
"field" : "value"
|
||||
}
|
||||
},
|
||||
"priority" : "high"
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
On top of this also a mapping type can be associated with the this query. This allows to control how certain queries
|
||||
like range queries, shape filters and other query & filters that rely on mapping settings get constructed. This is
|
||||
important since the percolate queries are indexed into the `_percolator` type, and the queries / filters that rely on
|
||||
mapping settings would yield unexpected behaviour. Note by default field names do get resolved in a smart manner,
|
||||
but in certain cases with multiple types this can lead to unexpected behaviour, so being explicit about it will help.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"query" : {
|
||||
"range" : {
|
||||
"created_at" : {
|
||||
"gte" : "2010-01-01T00:00:00",
|
||||
"lte" : "2011-01-01T00:00:00"
|
||||
}
|
||||
}
|
||||
},
|
||||
"type" : "tweet",
|
||||
"priority" : "high"
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
In the above example the range query gets really parsed into a Lucene numeric range query, based on the settings for
|
||||
the field `created_at` in the type `tweet`.
|
||||
|
||||
Just as with any other type, the `_percolator` type has a mapping, which you can configure via the mappings apis.
|
||||
The default percolate mapping doesn't index the query field and only stores it.
|
||||
|
||||
Because `_percolate` is a type it also has a mapping. By default the following mapping is active:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"_percolator" : {
|
||||
"properties" : {
|
||||
"query" : {
|
||||
"type" : "object",
|
||||
"enabled" : false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
If needed this mapping can be modified wit the update mapping api.
|
||||
|
||||
In order to un-register a percolate query the delete api can be used. So if the previous added query needs to be deleted
|
||||
the following delete requests needs to be executed:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XDELETE localhost:9200/my-index/_percolator/1
|
||||
--------------------------------------------------
|
||||
|
||||
[float]
|
||||
=== Percolate api
|
||||
|
||||
The percolate api executes in a distributed manner, meaning it executes on all shards an index points to.
|
||||
|
||||
.Required options
|
||||
* `index` - The index that contains the `_percolator` type. This can also be an alias.
|
||||
* `type` - The type of the document to be percolated. The mapping of that type is used to parse document.
|
||||
* `doc` - The actual document to percolate. Unlike the other two options this needs to be specified in the request body. Note this isn't required when percolating an existing document.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XGET 'localhost:9200/twitter/tweet/_percolate' -d '{
|
||||
"doc" : {
|
||||
"created_at" : "2010-10-10T00:00:00",
|
||||
"message" : "some text"
|
||||
}
|
||||
}'
|
||||
--------------------------------------------------
|
||||
|
||||
.Additional supported query string options
|
||||
* `routing` - In the case the percolate queries are partitioned by a custom routing value, that routing option make sure
|
||||
that the percolate request only gets executed on the shard where the routing value is partioned to. This means that
|
||||
the percolate request only gets executed on one shard instead of all shards. Multiple values can be specified as a
|
||||
comma seperated string, in that case the request can be be executed on more than one shard.
|
||||
* `preference` - Controls which shard replicas are preferred to execute the request on. Works the same as in the search api.
|
||||
* `ignore_indices` - Controls if missing indices should silently be ignored. Same as is in the search api.
|
||||
* `percolate_format` - If `ids` is specified then the matches array in the percolate response will contain a string
|
||||
array of the matching ids instead of an array of objects. This can be useful the reduce the amount of data being send
|
||||
back to the client. Obviously if there are to percolator queries with same id from different indices there is no way
|
||||
the find out which percolator query belongs to what index. Any other value to `percolate_format` will be ignored.
|
||||
|
||||
.Additional request body options
|
||||
* `filter` - Reduces the number queries to execute during percolating. Only the percolator queries that match with the
|
||||
filter will be included in the percolate execution. The filter option works in near realtime, so a refresh needs to have
|
||||
occurred for the filter to included the latest percolate queries.
|
||||
* `query` - Same as the `filter` option, but also the score is computed. The computed scores can then be used by the
|
||||
`score` and `sort` option.
|
||||
* `size` - Defines to maximum number of matches (percolate queries) to be returned. Defaults to unlimited.
|
||||
* `score` - Whether the `_score` is included for each match. The is based on the query and represents how the query matched
|
||||
to the percolate query's metadata and *not* how the document being percolated matched to the query. The `query` option
|
||||
is required for this option. Defaults to `false`.
|
||||
* `sort` - Whether the matches should be sorted by the `_score`. Similar to the `score` option, but also sorts the
|
||||
matches. The `size` and `query` option are required for this option. Defaults to `false`.
|
||||
* `facets` - Allows facet definitions to be included. The facets are based on the matching percolator queries. See facet
|
||||
documentation how to define facets.
|
||||
* `highlight` - Allows highlight definitions to be included. The document being percolated is being highlight for each
|
||||
matching query. This allows you to see how each match is highlighting the document being percolated. See highlight
|
||||
documentation on how to define highlights.
|
||||
|
||||
[float]
|
||||
=== Dedicated percolator index
|
||||
|
||||
Percolate queries can be added to any index. Instead of adding percolate queries to the index the data resides in,
|
||||
these queries can also be added to an dedicated index. The advantage of this is that this dedicated percolator index
|
||||
can have its own index settings (For example the number of primary and replicas shards). If you choose to have a dedicated
|
||||
percolate index, you need to make sure that the mappings from the normal index are also available on the percolate index.
|
||||
Otherwise percolate queries can be parsed incorrectly.
|
||||
|
||||
[float]
|
||||
=== Filtering Executed Queries
|
||||
|
||||
Since the registered percolator queries are just docs in an index, one
|
||||
can filter the queries that will be used to percolate a doc. For
|
||||
example, we can add a `color` field to the registered query:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XPUT localhost:9200/_percolator/test/kuku -d '{
|
||||
"color" : "blue",
|
||||
"query" : {
|
||||
"term" : {
|
||||
"field1" : "value1"
|
||||
}
|
||||
}
|
||||
}'
|
||||
--------------------------------------------------
|
||||
|
||||
And then, we can percolate a doc that only matches on blue colors:
|
||||
Filtering allows to reduce the number of queries, any filter that the search api supports, (expect the ones mentioned in important notes)
|
||||
can also be used in the percolate api. The filter only works on the metadata fields. The `query` field isn't indexed by
|
||||
default. Based on the query we indexed before the following filter can be defined:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XGET localhost:9200/test/type1/_percolate -d '{
|
||||
"doc" : {
|
||||
"field1" : "value1"
|
||||
"field" : "value"
|
||||
},
|
||||
"query" : {
|
||||
"filter" : {
|
||||
"term" : {
|
||||
"color" : "blue"
|
||||
"priority" : "high"
|
||||
}
|
||||
}
|
||||
}'
|
||||
--------------------------------------------------
|
||||
|
||||
[float]
|
||||
=== How it Works
|
||||
=== Percolator count api
|
||||
|
||||
The `_percolator` which holds the repository of registered queries is
|
||||
just a another index. The query is registered under a concrete index
|
||||
that exists (or will exist). That index name is represented as the type
|
||||
in the `_percolator` index (a bit confusing, I know...).
|
||||
The count percolate api, only keeps track of the number of matches and doesn't keep track of the actual matches
|
||||
Example:
|
||||
|
||||
The fact that the queries are stored as docs in another index
|
||||
(`_percolator`) gives us both the persistency nature of it, and the
|
||||
ability to filter out queries to execute using another query.
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XGET 'localhost:9200/my-index/my-type/_percolate/count' -d '{
|
||||
"doc" : {
|
||||
"message" : "some message"
|
||||
}
|
||||
}'
|
||||
--------------------------------------------------
|
||||
|
||||
The `_percolator` index uses the `index.auto_expand_replica` setting to
|
||||
make sure that each data node will have access locally to the registered
|
||||
queries, allowing for fast query executing to filter out queries to run
|
||||
against a percolated doc.
|
||||
Response:
|
||||
|
||||
The percolate API uses the whole number of shards as percolating
|
||||
processing "engines", both primaries and replicas. In our above case, if
|
||||
the `test` index has 2 shards with 1 replica, 4 shards will round-robin
|
||||
in handling percolate requests. Increasing (dynamically) the number of
|
||||
replicas will increase the number of percolating processing "engines"
|
||||
and thus the percolation power.
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
... // header
|
||||
"total" : 3
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
Note, percolate requests will prefer to be executed locally, and will
|
||||
not try and round-robin across shards if a shard exists locally on a
|
||||
node that received a request (for example, from HTTP). It's important to
|
||||
do some round-robin in the client code among nodes (in any case its
|
||||
recommended). If this behavior is not desired, the `prefer_local`
|
||||
parameter can be set to `false` to disable it.
|
||||
|
||||
Because the percolator API is processing one document at a time, it
|
||||
doesn't support queries and filters that run against child and nested
|
||||
documents such as `has_child`, `has_parent`, `top_children`, and
|
||||
`nested`.
|
||||
[float]
|
||||
=== Percolating an existing document
|
||||
|
||||
In order to percolate in newly indexed document, the percolate existing document can be used. Based on the response
|
||||
from an index request the `_id` and other meta information can be used to the immediately percolate the newly added
|
||||
document.
|
||||
|
||||
.Supported options for percolating an existing document on top of existing percolator options:
|
||||
* `id` - The id of the document to retrieve the source for.
|
||||
* `percolate_index` - The index containing the percolate queries. Defaults to the `index` defined in the url.
|
||||
* `percolate_type` - The percolate type (used for parsing the document). Default to `type` defined in the url.
|
||||
* `routing` - The routing value to use when retrieving the document to percolate.
|
||||
* `preference` - Which shard to prefer when retrieving the existing document.
|
||||
* `percolate_routing` - The routing value to use when percolating the existing document.
|
||||
* `percolate_preference` - Which shard to prefer when executing the percolate request.
|
||||
* `version` - Enables a version check. If the fetched document's version isn't equal to the specified version then the request fails with a version conflict and the percolation request is aborted.
|
||||
* `version_type` - Whether internal or external versioning is used. Defaults to internal versioning.
|
||||
|
||||
Internally the percolate api will issue a get request for fetching the`_source` of the document to percolate.
|
||||
For this feature to work the `_source` for documents to be percolated need to be stored.
|
||||
|
||||
[float]
|
||||
==== Example
|
||||
|
||||
Index response:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"ok" : true,
|
||||
"_index" : "my-index",
|
||||
"_type" : "message",
|
||||
"_id" : "1",
|
||||
"_version" : 1
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
Percolating an existing document:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XGET 'localhost:9200/my-index1/message/1/_percolate'
|
||||
--------------------------------------------------
|
||||
|
||||
The response is the same as with the regular percolate api.
|
||||
|
||||
[float]
|
||||
=== Multi percolate api
|
||||
|
||||
The multi percolate api allows to bundle multiple percolate requests into a single request, similar to what the multi
|
||||
search api does to search requests. The request body format is line based. Each percolate request item takes two lines,
|
||||
the first line is the header and the second line is the body.
|
||||
|
||||
The header can contain any parameter that normally would be set via the request path or query string parameters. T
|
||||
here are several percolate actions, because there are multiple types of percolate requests.
|
||||
|
||||
.Supported actions:
|
||||
* `percolate` - Action for defining a regular percolate request.
|
||||
* `count` - Action for defining a count percolate request.
|
||||
|
||||
Depending on the percolate action different parameters can be specified. For example the percolate and percolate existing
|
||||
document actions support different parameters.
|
||||
|
||||
.The following endpoints are supported
|
||||
* POST /[index]/[type]/_mpercolate
|
||||
* POST /[index]/_mpercolate
|
||||
* POST /_mpercolate
|
||||
|
||||
The `index` and `type` defined in the url path are the default index and type.
|
||||
|
||||
[float]
|
||||
==== Example
|
||||
|
||||
Request:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XGET 'localhost:9200/twitter/tweet/_mpercolate' --data-binary @requests.txt; echo
|
||||
--------------------------------------------------
|
||||
|
||||
The index twitter is the default index and the type tweet is the default type and will be used in the case a header
|
||||
doesn't specify an index or type.
|
||||
|
||||
requests.txt:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{"percolate" : {"index" : twitter", "type" : "tweet"}}
|
||||
{"doc" : {"message" : "some text"}}
|
||||
{"percolate" : "index" : twitter", "type" : "tweet", "id" : "1"}
|
||||
{}
|
||||
{"percolate" : "index" : users", "type" : "user", "id" : "3", "percolate_index" : "users_2012" }
|
||||
{"size" : 10}
|
||||
{"count" : {"index" : twitter", "type" : "tweet"}}
|
||||
{"doc" : {"message" : "some other text"}}
|
||||
{"count" : "index" : twitter", "type" : "tweet", "id" : "1"}
|
||||
{}
|
||||
--------------------------------------------------
|
||||
|
||||
For a percolate existing document item (headers with the `id` field), the response can be an empty json object.
|
||||
All the required options are set in the header.
|
||||
|
||||
Response:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"items" : [
|
||||
{
|
||||
"took" : 24,
|
||||
"_shards" : {
|
||||
"total" : 5,
|
||||
"successful" : 5,
|
||||
"failed" : 0,
|
||||
},
|
||||
"total" : 3,
|
||||
"matches" : ["1", "2", "3"]
|
||||
},
|
||||
{
|
||||
"took" : 12,
|
||||
"_shards" : {
|
||||
"total" : 5,
|
||||
"successful" : 5,
|
||||
"failed" : 0,
|
||||
},
|
||||
"total" : 3,
|
||||
"matches" : ["4", "5", "6"]
|
||||
},
|
||||
{
|
||||
"error" : "[user][3]document missing"
|
||||
},
|
||||
{
|
||||
"took" : 12,
|
||||
"_shards" : {
|
||||
"total" : 5,
|
||||
"successful" : 5,
|
||||
"failed" : 0,
|
||||
},
|
||||
"total" : 3
|
||||
},
|
||||
{
|
||||
"took" : 14,
|
||||
"_shards" : {
|
||||
"total" : 5,
|
||||
"successful" : 5,
|
||||
"failed" : 0,
|
||||
},
|
||||
"total" : 3
|
||||
}
|
||||
]
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
Each item represents a percolate response, the order of the items maps to the order in where the percolate requests
|
||||
were specified. In case a percolate request failed, the item response is substituted with an error message.
|
||||
|
||||
[float]
|
||||
=== How it works under the hood
|
||||
|
||||
When indexing a document that contains a query in an index and the `_percolator` type the query part of the documents gets
|
||||
parsed into a Lucene query and is kept in memory until that percolator document is removed or the index containing the
|
||||
`_percolator` type get removed. So all the active percolator queries are kept in memory.
|
||||
|
||||
At percolate time the document specified in the request gets parsed into a Lucene document and is stored in a in-memory
|
||||
Lucene index. This in-memory index can just hold this one document and it is optimized for that. Then all the queries
|
||||
that are registered to the index that the percolate request is targeted for are going to be executed on this single document
|
||||
in-memory index. This happens on each shard the percolate request need to execute.
|
||||
|
||||
By using `routing`, `filter` or `query` features the amount of queries that need to be executed can be reduced and thus
|
||||
the time the percolate api needs to run can be decreased.
|
||||
|
||||
[float]
|
||||
=== Important notes
|
||||
|
||||
Because the percolator API is processing one document at a time, it doesn't support queries and filters that run
|
||||
against child and nested documents such as `has_child`, `has_parent`, `top_children`, and `nested`.
|
||||
|
||||
The `wildcard` and `regexp` query natively use a lot of memory and because the percolator keeps the queries into memory
|
||||
this can easily take up the available memory in the heap space. If possible try to use a `prefix` query or ngramming to
|
||||
achieve the same result (with way less memory being used).
|
||||
|
||||
The delete-by-query api doesn't work to unregister a query, it only deletes the percolate documents from disk. In order
|
||||
to update the registered queries in memory the index needs be closed and opened.
|
Loading…
Reference in New Issue