448 lines
18 KiB
Plaintext
448 lines
18 KiB
Plaintext
[[search-percolate]]
|
|
== Percolator
|
|
|
|
added[1.0.0.Beta1]
|
|
|
|
Traditionally you design documents based on your data and store them into an index and then define queries via the search api
|
|
in order to retrieve these documents. The percolator works in the opposite direction, first you store queries into an
|
|
index and then via the percolate api you define documents in order to retrieve these queries.
|
|
|
|
The reason that queries can be stored comes from the fact that in Elasticsearch both documents and queries are defined in
|
|
JSON. This allows you to embed queries into documents via the index api. Elasticsearch can extract the query from a
|
|
document and make it available to the percolate api. Since documents are also defines as json, you can define a document
|
|
in a request to the percolate api.
|
|
|
|
The percolator and most of its features work in realtime, so once a percolate query is indexed it can immediately be used
|
|
in the percolate api.
|
|
|
|
[float]
|
|
=== Sample usage
|
|
|
|
Adding a query to the percolator:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XPUT 'localhost:9200/my-index/.percolator/1' -d '{
|
|
"query" : {
|
|
"match" : {
|
|
"message" : "bonsai tree"
|
|
}
|
|
}
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
Matching documents to the added queries:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET 'localhost:9200/my-index/message/_percolate' -d '{
|
|
"doc" : {
|
|
"message" : "A new bonsai tree in the office"
|
|
}
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
The above request will yield the following response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"took" : 19,
|
|
"_shards" : {
|
|
"total" : 5,
|
|
"successful" : 5,
|
|
"failed" : 0
|
|
},
|
|
"total" : 1,
|
|
"matches" : [
|
|
{
|
|
"_index" : "my-index",
|
|
"_id" : "1"
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The percolate api returns matches that refer to percolate queries that have matched with the document defined in the percolate api.
|
|
|
|
[float]
|
|
=== Indexing percolator queries
|
|
|
|
Percolate queries are stored as documents in a specific format and in an arbitrary index under a reserved type with the
|
|
name `.percolator`. The query itself is placed as is in a json objects under the top level field `query`.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"query" : {
|
|
"match" : {
|
|
"field" : "value"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Since this is just an ordinary document, any field can be added to this document. This can be useful later on to only
|
|
percolate documents by specific queries.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"query" : {
|
|
"match" : {
|
|
"field" : "value"
|
|
}
|
|
},
|
|
"priority" : "high"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
On top of this also a mapping type can be associated with the this query. This allows to control how certain queries
|
|
like range queries, shape filters and other query & filters that rely on mapping settings get constructed. This is
|
|
important since the percolate queries are indexed into the `.percolator` type, and the queries / filters that rely on
|
|
mapping settings would yield unexpected behaviour. Note by default field names do get resolved in a smart manner,
|
|
but in certain cases with multiple types this can lead to unexpected behaviour, so being explicit about it will help.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"query" : {
|
|
"range" : {
|
|
"created_at" : {
|
|
"gte" : "2010-01-01T00:00:00",
|
|
"lte" : "2011-01-01T00:00:00"
|
|
}
|
|
}
|
|
},
|
|
"type" : "tweet",
|
|
"priority" : "high"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
In the above example the range query gets really parsed into a Lucene numeric range query, based on the settings for
|
|
the field `created_at` in the type `tweet`.
|
|
|
|
Just as with any other type, the `.percolator` type has a mapping, which you can configure via the mappings apis.
|
|
The default percolate mapping doesn't index the query field and only stores it.
|
|
|
|
Because `.percolate` is a type it also has a mapping. By default the following mapping is active:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
".percolator" : {
|
|
"properties" : {
|
|
"query" : {
|
|
"type" : "object",
|
|
"enabled" : false
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
If needed this mapping can be modified wit the update mapping api.
|
|
|
|
In order to un-register a percolate query the delete api can be used. So if the previous added query needs to be deleted
|
|
the following delete requests needs to be executed:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XDELETE localhost:9200/my-index/.percolator/1
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
=== Percolate api
|
|
|
|
The percolate api executes in a distributed manner, meaning it executes on all shards an index points to.
|
|
|
|
.Required options
|
|
* `index` - The index that contains the `.percolator` type. This can also be an alias.
|
|
* `type` - The type of the document to be percolated. The mapping of that type is used to parse document.
|
|
* `doc` - The actual document to percolate. Unlike the other two options this needs to be specified in the request body. Note this isn't required when percolating an existing document.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET 'localhost:9200/twitter/tweet/_percolate' -d '{
|
|
"doc" : {
|
|
"created_at" : "2010-10-10T00:00:00",
|
|
"message" : "some text"
|
|
}
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
.Additional supported query string options
|
|
* `routing` - In the case the percolate queries are partitioned by a custom routing value, that routing option make sure
|
|
that the percolate request only gets executed on the shard where the routing value is partitioned to. This means that
|
|
the percolate request only gets executed on one shard instead of all shards. Multiple values can be specified as a
|
|
comma separated string, in that case the request can be be executed on more than one shard.
|
|
* `preference` - Controls which shard replicas are preferred to execute the request on. Works the same as in the search api.
|
|
* `ignore_unavailable` - Controls if missing concrete indices should silently be ignored. Same as is in the search api.
|
|
* `percolate_format` - If `ids` is specified then the matches array in the percolate response will contain a string
|
|
array of the matching ids instead of an array of objects. This can be useful the reduce the amount of data being send
|
|
back to the client. Obviously if there are to percolator queries with same id from different indices there is no way
|
|
the find out which percolator query belongs to what index. Any other value to `percolate_format` will be ignored.
|
|
|
|
.Additional request body options
|
|
* `filter` - Reduces the number queries to execute during percolating. Only the percolator queries that match with the
|
|
filter will be included in the percolate execution. The filter option works in near realtime, so a refresh needs to have
|
|
occurred for the filter to included the latest percolate queries.
|
|
* `query` - Same as the `filter` option, but also the score is computed. The computed scores can then be used by the
|
|
`track_scores` and `sort` option.
|
|
* `size` - Defines to maximum number of matches (percolate queries) to be returned. Defaults to unlimited.
|
|
* `track_scores` - Whether the `_score` is included for each match. The is based on the query and represents how the query matched
|
|
to the percolate query's metadata and *not* how the document being percolated matched to the query. The `query` option
|
|
is required for this option. Defaults to `false`.
|
|
* `sort` - Define a sort specification like in the search api. Currently only sorting `_score` reverse (default relevancy)
|
|
is supported. Other sort fields will throw an exception. The `size` and `query` option are required for this setting. Like
|
|
`track_score` the score is based on the query and represents how the query matched to the percolate query's metadata
|
|
and *not* how the document being percolated matched to the query.
|
|
* `facets` - Allows facet definitions to be included. The facets are based on the matching percolator queries. See facet
|
|
documentation how to define facets.
|
|
* `aggs` - Allows aggregation definitions to be included. The aggregations are based on the matching percolator queries,
|
|
look at the aggregation documentation on how to define aggregations.
|
|
* `highlight` - Allows highlight definitions to be included. The document being percolated is being highlight for each
|
|
matching query. This allows you to see how each match is highlighting the document being percolated. See highlight
|
|
documentation on how to define highlights. The `size` option is required for highlighting, the performance of highlighting
|
|
in the percolate api depends of how many matches are being highlighted.
|
|
|
|
[float]
|
|
=== Dedicated percolator index
|
|
|
|
Percolate queries can be added to any index. Instead of adding percolate queries to the index the data resides in,
|
|
these queries can also be added to an dedicated index. The advantage of this is that this dedicated percolator index
|
|
can have its own index settings (For example the number of primary and replicas shards). If you choose to have a dedicated
|
|
percolate index, you need to make sure that the mappings from the normal index are also available on the percolate index.
|
|
Otherwise percolate queries can be parsed incorrectly.
|
|
|
|
[float]
|
|
=== Filtering Executed Queries
|
|
|
|
Filtering allows to reduce the number of queries, any filter that the search api supports, (expect the ones mentioned in important notes)
|
|
can also be used in the percolate api. The filter only works on the metadata fields. The `query` field isn't indexed by
|
|
default. Based on the query we indexed before the following filter can be defined:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET localhost:9200/test/type1/_percolate -d '{
|
|
"doc" : {
|
|
"field" : "value"
|
|
},
|
|
"filter" : {
|
|
"term" : {
|
|
"priority" : "high"
|
|
}
|
|
}
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
=== Percolator count api
|
|
|
|
The count percolate api, only keeps track of the number of matches and doesn't keep track of the actual matches
|
|
Example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET 'localhost:9200/my-index/my-type/_percolate/count' -d '{
|
|
"doc" : {
|
|
"message" : "some message"
|
|
}
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
Response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
... // header
|
|
"total" : 3
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
[float]
|
|
=== Percolating an existing document
|
|
|
|
In order to percolate in newly indexed document, the percolate existing document can be used. Based on the response
|
|
from an index request the `_id` and other meta information can be used to the immediately percolate the newly added
|
|
document.
|
|
|
|
.Supported options for percolating an existing document on top of existing percolator options:
|
|
* `id` - The id of the document to retrieve the source for.
|
|
* `percolate_index` - The index containing the percolate queries. Defaults to the `index` defined in the url.
|
|
* `percolate_type` - The percolate type (used for parsing the document). Default to `type` defined in the url.
|
|
* `routing` - The routing value to use when retrieving the document to percolate.
|
|
* `preference` - Which shard to prefer when retrieving the existing document.
|
|
* `percolate_routing` - The routing value to use when percolating the existing document.
|
|
* `percolate_preference` - Which shard to prefer when executing the percolate request.
|
|
* `version` - Enables a version check. If the fetched document's version isn't equal to the specified version then the request fails with a version conflict and the percolation request is aborted.
|
|
* `version_type` - Whether internal or external versioning is used. Defaults to internal versioning.
|
|
|
|
Internally the percolate api will issue a get request for fetching the`_source` of the document to percolate.
|
|
For this feature to work the `_source` for documents to be percolated need to be stored.
|
|
|
|
[float]
|
|
==== Example
|
|
|
|
Index response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"_index" : "my-index",
|
|
"_type" : "message",
|
|
"_id" : "1",
|
|
"_version" : 1,
|
|
"created" : true
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Percolating an existing document:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET 'localhost:9200/my-index1/message/1/_percolate'
|
|
--------------------------------------------------
|
|
|
|
The response is the same as with the regular percolate api.
|
|
|
|
[float]
|
|
=== Multi percolate api
|
|
|
|
The multi percolate api allows to bundle multiple percolate requests into a single request, similar to what the multi
|
|
search api does to search requests. The request body format is line based. Each percolate request item takes two lines,
|
|
the first line is the header and the second line is the body.
|
|
|
|
The header can contain any parameter that normally would be set via the request path or query string parameters. T
|
|
here are several percolate actions, because there are multiple types of percolate requests.
|
|
|
|
.Supported actions:
|
|
* `percolate` - Action for defining a regular percolate request.
|
|
* `count` - Action for defining a count percolate request.
|
|
|
|
Depending on the percolate action different parameters can be specified. For example the percolate and percolate existing
|
|
document actions support different parameters.
|
|
|
|
.The following endpoints are supported
|
|
* GET|POST /[index]/[type]/_mpercolate
|
|
* GET}POST /[index]/_mpercolate
|
|
* GET|POST /_mpercolate
|
|
|
|
The `index` and `type` defined in the url path are the default index and type.
|
|
|
|
[float]
|
|
==== Example
|
|
|
|
Request:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XGET 'localhost:9200/twitter/tweet/_mpercolate' --data-binary @requests.txt; echo
|
|
--------------------------------------------------
|
|
|
|
The index twitter is the default index and the type tweet is the default type and will be used in the case a header
|
|
doesn't specify an index or type.
|
|
|
|
requests.txt:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{"percolate" : {"index" : twitter", "type" : "tweet"}}
|
|
{"doc" : {"message" : "some text"}}
|
|
{"percolate" : {"index" : twitter", "type" : "tweet", "id" : "1"}}
|
|
{}
|
|
{"percolate" : {"index" : users", "type" : "user", "id" : "3", "percolate_index" : "users_2012" }}
|
|
{"size" : 10}
|
|
{"count" : {"index" : twitter", "type" : "tweet"}}
|
|
{"doc" : {"message" : "some other text"}}
|
|
{"count" : {"index" : twitter", "type" : "tweet", "id" : "1"}}
|
|
{}
|
|
--------------------------------------------------
|
|
|
|
For a percolate existing document item (headers with the `id` field), the response can be an empty json object.
|
|
All the required options are set in the header.
|
|
|
|
Response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"items" : [
|
|
{
|
|
"took" : 24,
|
|
"_shards" : {
|
|
"total" : 5,
|
|
"successful" : 5,
|
|
"failed" : 0,
|
|
},
|
|
"total" : 3,
|
|
"matches" : ["1", "2", "3"]
|
|
},
|
|
{
|
|
"took" : 12,
|
|
"_shards" : {
|
|
"total" : 5,
|
|
"successful" : 5,
|
|
"failed" : 0,
|
|
},
|
|
"total" : 3,
|
|
"matches" : ["4", "5", "6"]
|
|
},
|
|
{
|
|
"error" : "[user][3]document missing"
|
|
},
|
|
{
|
|
"took" : 12,
|
|
"_shards" : {
|
|
"total" : 5,
|
|
"successful" : 5,
|
|
"failed" : 0,
|
|
},
|
|
"total" : 3
|
|
},
|
|
{
|
|
"took" : 14,
|
|
"_shards" : {
|
|
"total" : 5,
|
|
"successful" : 5,
|
|
"failed" : 0,
|
|
},
|
|
"total" : 3
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Each item represents a percolate response, the order of the items maps to the order in where the percolate requests
|
|
were specified. In case a percolate request failed, the item response is substituted with an error message.
|
|
|
|
[float]
|
|
=== How it works under the hood
|
|
|
|
When indexing a document that contains a query in an index and the `.percolator` type the query part of the documents gets
|
|
parsed into a Lucene query and is kept in memory until that percolator document is removed or the index containing the
|
|
`.percolator` type get removed. So all the active percolator queries are kept in memory.
|
|
|
|
At percolate time the document specified in the request gets parsed into a Lucene document and is stored in a in-memory
|
|
Lucene index. This in-memory index can just hold this one document and it is optimized for that. Then all the queries
|
|
that are registered to the index that the percolate request is targeted for are going to be executed on this single document
|
|
in-memory index. This happens on each shard the percolate request need to execute.
|
|
|
|
By using `routing`, `filter` or `query` features the amount of queries that need to be executed can be reduced and thus
|
|
the time the percolate api needs to run can be decreased.
|
|
|
|
[float]
|
|
=== Important notes
|
|
|
|
Because the percolator API is processing one document at a time, it doesn't support queries and filters that run
|
|
against child documents such as `has_child`, `has_parent` and `top_children`.
|
|
|
|
The `wildcard` and `regexp` query natively use a lot of memory and because the percolator keeps the queries into memory
|
|
this can easily take up the available memory in the heap space. If possible try to use a `prefix` query or ngramming to
|
|
achieve the same result (with way less memory being used).
|
|
|
|
The delete-by-query api doesn't work to unregister a query, it only deletes the percolate documents from disk. In order
|
|
to update the registered queries in memory the index needs be closed and opened.
|