270 lines
8.7 KiB
Plaintext
270 lines
8.7 KiB
Plaintext
[[plugins-delete-by-query]]
|
|
=== Delete By Query Plugin
|
|
|
|
The delete-by-query plugin adds support for deleting all of the documents
|
|
(from one or more indices) which match the specified query. It is a
|
|
replacement for the problematic _delete-by-query_ functionality which has been
|
|
removed from Elasticsearch core.
|
|
|
|
Internally, it uses the {ref}/search-request-scroll.html#scroll-scan[Scan/Scroll]
|
|
and {ref}/docs-bulk.html[Bulk] APIs to delete documents in an efficient and
|
|
safe manner. It is slower than the old _delete-by-query_ functionality, but
|
|
fixes the problems with the previous implementation.
|
|
|
|
To understand more about why we removed delete-by-query from core and about
|
|
the semantics of the new implementation, see
|
|
<<delete-by-query-plugin-reason>>.
|
|
|
|
[TIP]
|
|
============================================
|
|
Queries which match large numbers of documents may run for a long time,
|
|
as every document has to be deleted individually. Don't use _delete-by-query_
|
|
to clean out all or most documents in an index. Rather create a new index and
|
|
perhaps reindex the documents you want to keep.
|
|
============================================
|
|
|
|
[float]
|
|
==== Installation
|
|
|
|
This plugin can be installed using the plugin manager:
|
|
|
|
[source,sh]
|
|
----------------------------------------------------------------
|
|
sudo bin/plugin install delete-by-query
|
|
----------------------------------------------------------------
|
|
|
|
The plugin must be installed on every node in the cluster, and each node must
|
|
be restarted after installation.
|
|
|
|
[float]
|
|
==== Removal
|
|
|
|
The plugin can be removed with the following command:
|
|
|
|
[source,sh]
|
|
----------------------------------------------------------------
|
|
sudo bin/plugin remove delete-by-query
|
|
----------------------------------------------------------------
|
|
|
|
The node must be stopped before removing the plugin.
|
|
|
|
[[delete-by-query-usage]]
|
|
==== Using Delete-by-Query
|
|
|
|
The query can either be provided using a simple query string as
|
|
a parameter:
|
|
|
|
[source,shell]
|
|
--------------------------------------------------
|
|
DELETE /twitter/tweet/_query?q=user:kimchy
|
|
--------------------------------------------------
|
|
// AUTOSENSE
|
|
|
|
or using the {ref}/query-dsl.html[Query DSL] defined within the request body:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
DELETE /twitter/tweet/_query
|
|
{
|
|
"query": { <1>
|
|
"term": {
|
|
"user": "kimchy"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// AUTOSENSE
|
|
|
|
<1> The query must be passed as a value to the `query` key, in the same way as
|
|
the {ref}/search-search.html[search api].
|
|
|
|
Both of the above examples end up doing the same thing, which is to delete all
|
|
tweets from the twitter index for the user `kimchy`.
|
|
|
|
Delete-by-query supports deletion across
|
|
{ref}/search-search.html#search-multi-index-type[multiple indices and multiple types].
|
|
|
|
[float]
|
|
=== Query-string parameters
|
|
|
|
The following query string parameters are supported:
|
|
|
|
`q`::
|
|
|
|
Instead of using the {ref}/query-dsl.html[Query DSL] to pass a `query` in the request
|
|
body, you can use the `q` query string parameter to specify a query using
|
|
{ref}/query-dsl-query-string-query.html#query-string-syntax[`query_string` syntax].
|
|
In this case, the following additional parameters are supported: `df`,
|
|
`analyzer`, `default_operator`, `lowercase_expanded_terms`,
|
|
`analyze_wildcard` and `lenient`.
|
|
See {ref}/search-uri-request.html[URI search request] for details.
|
|
|
|
`size`::
|
|
|
|
The number of hits returned *per shard* by the {ref}/search-request-scroll.html#scroll-scan[scan]
|
|
request. Defaults to 10. May also be specified in the request body.
|
|
|
|
`timeout`::
|
|
|
|
The maximum execution time of the delete by query process. Once expired, no
|
|
more documents will be deleted.
|
|
|
|
`routing`::
|
|
|
|
A comma separated list of routing values to control which shards the delete by
|
|
query request should be executed on.
|
|
|
|
When using the `q` parameter, the following additional parameters are
|
|
supported (as explained in {ref}/search-uri-request.html[URI search request]): `df`, `analyzer`,
|
|
`default_operator`.
|
|
|
|
|
|
[float]
|
|
=== Response body
|
|
|
|
The JSON response looks like this:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"took" : 639,
|
|
"timed_out" : false,
|
|
"_indices" : {
|
|
"_all" : {
|
|
"found" : 5901,
|
|
"deleted" : 5901,
|
|
"missing" : 0,
|
|
"failed" : 0
|
|
},
|
|
"twitter" : {
|
|
"found" : 5901,
|
|
"deleted" : 5901,
|
|
"missing" : 0,
|
|
"failed" : 0
|
|
}
|
|
},
|
|
"failures" : [ ]
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Internally, the query is used to execute an initial
|
|
{ref}/search-request-scroll.html#scroll-scan[scroll/scan] request. As hits are
|
|
pulled from the scroll API, they are passed to the {ref}/docs-bulk.html[Bulk
|
|
API] for deletion.
|
|
|
|
IMPORTANT: Delete by query will only delete the version of the document that
|
|
was visible to search at the time the request was executed. Any documents
|
|
that have been reindexed or updated during execution will not be deleted.
|
|
|
|
Since documents can be updated or deleted by external operations during the
|
|
_scan-scroll-bulk_ process, the plugin keeps track of different counters for
|
|
each index, with the totals displayed under the `_all` index. The counters
|
|
are as follows:
|
|
|
|
`found`::
|
|
|
|
The number of documents matching the query for the given index.
|
|
|
|
`deleted`::
|
|
|
|
The number of documents successfully deleted for the given index.
|
|
|
|
`missing`::
|
|
|
|
The number of documents that were missing when the plugin tried to delete
|
|
them. Missing documents were present when the original query was run, but have
|
|
already been deleted by another process.
|
|
|
|
`failed`::
|
|
|
|
The number of documents that failed to be deleted for the given index. A
|
|
document may fail to be deleted if it has been updated to a new version by
|
|
another process, or if the shard containing the document has gone missing due
|
|
to hardware failure, for example.
|
|
|
|
[[delete-by-query-plugin-reason]]
|
|
==== Why Delete-By-Query is a plugin
|
|
|
|
The old delete-by-query API in Elasticsearch 1.x was fast but problematic. We
|
|
decided to remove the feature from Elasticsearch for these reasons:
|
|
|
|
Forward compatibility::
|
|
|
|
The old implementation wrote a delete-by-query request, including the
|
|
query, to the transaction log. This meant that, when upgrading to a new
|
|
version, old unsupported queries which cannot be executed might exist in
|
|
the translog, thus causing data corruption.
|
|
|
|
Consistency and correctness::
|
|
|
|
The old implementation executed the query and deleted all matching docs on
|
|
the primary first. It then repeated this procedure on each replica shard.
|
|
There was no guarantee that the queries on the primary and the replicas
|
|
matched the same document, so it was quite possible to end up with
|
|
different documents on each shard copy.
|
|
|
|
Resiliency::
|
|
|
|
The old implementation could cause out-of-memory exceptions, merge storms,
|
|
and dramatic slow downs if used incorrectly.
|
|
|
|
[float]
|
|
=== New delete-by-query implementation
|
|
|
|
The new implementation, provided by this plugin, is built internally
|
|
using {ref}/search-request-scroll.html#scroll-scan[scan and scroll] to return
|
|
the document IDs and versions of all the documents that need to be deleted.
|
|
It then uses the {ref}/docs-bulk.html[`bulk` API] to do the actual deletion.
|
|
|
|
This can have performance as well as visibility implications. Delete-by-query
|
|
now has the following semantics:
|
|
|
|
non-atomic::
|
|
|
|
A delete-by-query may fail at any time while some documents matching the
|
|
query have already been deleted.
|
|
|
|
try-once::
|
|
|
|
A delete-by-query may fail at any time and will not retry it's execution.
|
|
All retry logic is left to the user.
|
|
|
|
syntactic sugar::
|
|
|
|
A delete-by-query is equivalent to a scan/scroll search and corresponding
|
|
bulk-deletes by ID.
|
|
|
|
point-in-time::
|
|
|
|
A delete-by-query will only delete the documents that are visible at the
|
|
point in time the delete-by-query was started, equivalent to the
|
|
scan/scroll API.
|
|
|
|
consistent::
|
|
|
|
A delete-by-query will yield consistent results across all replicas of a
|
|
shard.
|
|
|
|
forward-compatible::
|
|
|
|
A delete-by-query will only send IDs to the shards as deletes such that no
|
|
queries are stored in the transaction logs that might not be supported in
|
|
the future.
|
|
|
|
visibility::
|
|
|
|
The effect of a delete-by-query request will not be visible to search
|
|
until the user refreshes the index, or the index is refreshed
|
|
automatically.
|
|
|
|
The new implementation suffers from two issues, which is why we decided to
|
|
move the functionality to a plugin instead of replacing the feautre in core:
|
|
|
|
* It is not as fast as the previous implementation. For most use cases, this
|
|
difference should not be noticeable but users running delete-by-query on
|
|
many matching documents may be affected.
|
|
|
|
* There is currently no way to monitor or cancel a running delete-by-query
|
|
request, except for the `timeout` parameter.
|
|
|
|
We have plans to solve both of these issues in a later version of Elasticsearch. |