From d33764583c59c36a6af40e24a445a6623c022a5f Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Tue, 30 Jun 2020 08:35:13 -0400 Subject: [PATCH] [7.x] [DOCS] Document delete/update by query for data streams (#58679) (#58706) --- .../data-streams-overview.asciidoc | 28 ++--- .../set-up-a-data-stream.asciidoc | 9 +- .../data-streams/use-a-data-stream.asciidoc | 107 ++++++++++++++++-- docs/reference/docs/delete-by-query.asciidoc | 31 ++--- docs/reference/docs/update-by-query.asciidoc | 35 +++--- 5 files changed, 156 insertions(+), 54 deletions(-) diff --git a/docs/reference/data-streams/data-streams-overview.asciidoc b/docs/reference/data-streams/data-streams-overview.asciidoc index 7373b8ac8ee..f4cf8020cf0 100644 --- a/docs/reference/data-streams/data-streams-overview.asciidoc +++ b/docs/reference/data-streams/data-streams-overview.asciidoc @@ -119,28 +119,30 @@ manually perform a rollover. See <>. === Append-only For most time-series use cases, existing data is rarely, if ever, updated. -Because of this, data streams are designed to be append-only. This means you can -send indexing requests for new documents directly to a data stream. However, you -cannot send update or deletion requests for existing documents to a data stream. +Because of this, data streams are designed to be append-only. -To update or delete specific documents in a data stream, submit one of the -following requests to the backing index containing the document: +You can send <> directly to a data stream. However, you cannot send the following +requests for existing documents directly to a data stream: * An <> request with an - <> of `index`. - These requests must include valid <> arguments. + <> of `index`. The `op_type` parameter + defaults to `index` for existing documents. * A <> request using the `delete`, `index`, or `update` - action. If the action type is `index`, the action must include valid - <> - arguments. + action. * A <> request -See <>. +Instead, you can use the <> and +<> APIs to update or delete existing +documents in a data stream. See <>. + +Alternatively, you can update or delete a document by submitting requests to the +backing index containing the document. See +<>. TIP: If you frequently update or delete existing documents, we recommend using an <> and <> instead of a data stream. You can still -use <> to manage indices for the alias. +use <> to manage indices for the alias. \ No newline at end of file diff --git a/docs/reference/data-streams/set-up-a-data-stream.asciidoc b/docs/reference/data-streams/set-up-a-data-stream.asciidoc index f6afd36fb17..2d455d95d4e 100644 --- a/docs/reference/data-streams/set-up-a-data-stream.asciidoc +++ b/docs/reference/data-streams/set-up-a-data-stream.asciidoc @@ -26,11 +26,10 @@ TIP: Data streams work well with most common log formats. While no schema is required to use data streams, we recommend the {ecs-ref}[Elastic Common Schema (ECS)]. -* Data streams are designed to be <>. -While you can index new documents directly to a data stream, you cannot use a -data stream to directly update or delete individual documents. To update or -delete specific documents in a data stream, submit a <> or -<> API request to the backing index containing the document. +* Data streams are best suited for time-based, +<> use cases. If you frequently need to +update or delete existing documents, we recommend using an index alias and an +index template instead. [discrete] diff --git a/docs/reference/data-streams/use-a-data-stream.asciidoc b/docs/reference/data-streams/use-a-data-stream.asciidoc index c2802616d87..b0e0080113d 100644 --- a/docs/reference/data-streams/use-a-data-stream.asciidoc +++ b/docs/reference/data-streams/use-a-data-stream.asciidoc @@ -9,6 +9,7 @@ the following: * <> * <> * <> +* <> //// [source,console] @@ -66,6 +67,10 @@ POST /logs/_doc/ ---- // TEST[continued] ==== + +IMPORTANT: You cannot add new documents to a data stream using the index API's +`PUT //_doc/<_id>` request format. Use the `PUT //_create/<_id>` +format instead. -- * A <> request using the `create` action. Specify the data @@ -348,12 +353,96 @@ POST /_reindex [[update-delete-docs-in-a-data-stream]] === Update or delete documents in a data stream -Data streams are designed to be <>. This -means you cannot send update or deletion requests for existing documents to a -data stream. However, you can send update or deletion requests to the backing -index containing the document. +You can update or delete documents in a data stream using the following +requests: -To delete or update a document in a data stream, you first need to get: +* An <> request ++ +.*Example* +[%collapsible] +==== +The following update by query API request updates documents in the `logs` data +stream with a `user.id` of `i96BP1mA`. The request uses a +<> to assign matching documents a new `user.id` +value of `XgdX0NoX`. + +//// +[source,console] +---- +PUT /logs/_create/2?refresh=wait_for +{ + "@timestamp": "2020-12-07T11:06:07.000Z", + "user": { + "id": "i96BP1mA" + } +} +---- +// TEST[continued] +//// + +[source,console] +---- +POST /logs/_update_by_query +{ + "query": { + "match": { + "user.id": "i96BP1mA" + } + }, + "script": { + "source": "ctx._source.user.id = params.new_id", + "params": { + "new_id": "XgdX0NoX" + } + } +} +---- +// TEST[continued] +==== + +* A <> request ++ +.*Example* +[%collapsible] +==== +The following delete by query API request deletes documents in the `logs` data +stream with a `user.id` of `zVZMamUM`. + +//// +[source,console] +---- +PUT /logs/_create/1?refresh=wait_for +{ + "@timestamp": "2020-12-07T11:06:07.000Z", + "user": { + "id": "zVZMamUM" + } +} +---- +// TEST[continued] +//// + +[source,console] +---- +POST /logs/_delete_by_query +{ + "query": { + "match": { + "user.id": "zVZMamUM" + } + } +} +---- +// TEST[continued] +==== + +[discrete] +[[update-delete-docs-in-a-backing-index]] +=== Update or delete documents in a backing index + +Alternatively, you can update or delete documents in a data stream by sending +the update or deletion request to the backing index containing the document. To +do this, you first need to get: * The <> * The name of the backing index that contains the document @@ -429,7 +518,7 @@ information for any documents matching the search. "_index": ".ds-logs-000002", <1> "_type": "_doc", "_id": "bfspvnIBr7VVZlfp2lqX", <2> - "_seq_no": 4, <3> + "_seq_no": 8, <3> "_primary_term": 1, <4> "_score": 0.2876821, "_source": { @@ -445,6 +534,8 @@ information for any documents matching the search. } ---- // TESTRESPONSE[s/"took": 20/"took": $body.took/] +// TESTRESPONSE[s/"max_score": 0.2876821/"max_score": $body.hits.max_score/] +// TESTRESPONSE[s/"_score": 0.2876821/"_score": $body.hits.hits.0._score/] <1> Backing index containing the matching document <2> Document ID for the document @@ -469,7 +560,7 @@ contains a new JSON source for the document. [source,console] ---- -PUT /.ds-logs-000002/_doc/bfspvnIBr7VVZlfp2lqX?if_seq_no=4&if_primary_term=1 +PUT /.ds-logs-000002/_doc/bfspvnIBr7VVZlfp2lqX?if_seq_no=8&if_primary_term=1 { "@timestamp": "2020-12-07T11:06:07.000Z", "user": { @@ -534,7 +625,7 @@ parameters. [source,console] ---- PUT /_bulk?refresh -{ "index": { "_index": ".ds-logs-000002", "_id": "bfspvnIBr7VVZlfp2lqX", "if_seq_no": 4, "if_primary_term": 1 } } +{ "index": { "_index": ".ds-logs-000002", "_id": "bfspvnIBr7VVZlfp2lqX", "if_seq_no": 8, "if_primary_term": 1 } } { "@timestamp": "2020-12-07T11:06:07.000Z", "user": { "id": "8a4f500d" }, "message": "Login successful" } ---- // TEST[continued] diff --git a/docs/reference/docs/delete-by-query.asciidoc b/docs/reference/docs/delete-by-query.asciidoc index 51f64faf7b9..c3c85290328 100644 --- a/docs/reference/docs/delete-by-query.asciidoc +++ b/docs/reference/docs/delete-by-query.asciidoc @@ -47,7 +47,7 @@ POST /twitter/_delete_by_query [[docs-delete-by-query-api-request]] ==== {api-request-title} -`POST //_delete_by_query` +`POST //_delete_by_query` [[docs-delete-by-query-api-desc]] ==== {api-description-title} @@ -55,7 +55,7 @@ POST /twitter/_delete_by_query You can specify the query criteria in the request URI or the request body using the same syntax as the <>. -When you submit a delete by query request, {es} gets a snapshot of the index +When you submit a delete by query request, {es} gets a snapshot of the data stream or index when it begins processing the request and deletes matching documents using `internal` versioning. If a document changes between the time that the snapshot is taken and the delete operation is processed, it results in a version @@ -134,12 +134,12 @@ Delete by query supports <> to parallelize the delete process. This can improve efficiency and provide a convenient way to break the request down into smaller parts. -Setting `slices` to `auto` chooses a reasonable number for most indices. +Setting `slices` to `auto` chooses a reasonable number for most data streams and indices. If you're slicing manually or otherwise tuning automatic slicing, keep in mind that: * Query performance is most efficient when the number of `slices` is equal to -the number of shards in the index. If that number is large (for example, +the number of shards in the index or backing index. If that number is large (for example, 500), choose a lower number as too many `slices` hurts performance. Setting `slices` higher than the number of shards generally does not improve efficiency and adds overhead. @@ -153,9 +153,11 @@ documents being reindexed and cluster resources. [[docs-delete-by-query-api-path-params]] ==== {api-path-parms-title} -``:: -(Optional, string) A comma-separated list of index names to search. Use `_all` -or omit to search all indices. +``:: +(Optional, string) +A comma-separated list of data streams, indices, and index aliases to search. +Wildcard (`*`) expressions are supported. To search all data streams or indices +in a cluster, omit this parameter or use `_all` or `*`. [[docs-delete-by-query-api-query-params]] ==== {api-query-parms-title} @@ -200,7 +202,10 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=requests_per_second] include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=routing] -include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=scroll] +`scroll`:: +(Optional, <>) +Period to retain the <> for scrolling. See +<>. include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=scroll_size] @@ -343,7 +348,7 @@ version conflicts. [[docs-delete-by-query-api-example]] ==== {api-examples-title} -Delete all tweets from the `twitter` index: +Delete all tweets from the `twitter` data stream or index: [source,console] -------------------------------------------------- @@ -356,7 +361,7 @@ POST twitter/_delete_by_query?conflicts=proceed -------------------------------------------------- // TEST[setup:twitter] -Delete documents from multiple indices: +Delete documents from multiple data streams or indices: [source,console] -------------------------------------------------- @@ -531,8 +536,8 @@ Which results in a sensible `total` like this one: Setting `slices` to `auto` will let {es} choose the number of slices to use. This setting will use one slice per shard, up to a certain limit. If -there are multiple source indices, it will choose the number of slices based -on the index with the smallest number of shards. +there are multiple source data streams or indices, it will choose the number of slices based +on the index or backing index with the smallest number of shards. Adding `slices` to `_delete_by_query` just automates the manual process used in the section above, creating sub-requests which means it has some quirks: @@ -555,7 +560,7 @@ slices` are distributed proportionally to each sub-request. Combine that with the point above about distribution being uneven and you should conclude that using `max_docs` with `slices` might not result in exactly `max_docs` documents being deleted. -* Each sub-request gets a slightly different snapshot of the source index +* Each sub-request gets a slightly different snapshot of the source data stream or index though these are all taken at approximately the same time. [float] diff --git a/docs/reference/docs/update-by-query.asciidoc b/docs/reference/docs/update-by-query.asciidoc index 34e3538b7a7..2a670449d69 100644 --- a/docs/reference/docs/update-by-query.asciidoc +++ b/docs/reference/docs/update-by-query.asciidoc @@ -5,7 +5,7 @@ ++++ Updates documents that match the specified query. -If no query is specified, performs an update on every document in the index without +If no query is specified, performs an update on every document in the data stream or index without modifying the source, which is useful for picking up mapping changes. [source,console] @@ -44,7 +44,7 @@ POST twitter/_update_by_query?conflicts=proceed [[docs-update-by-query-api-request]] ==== {api-request-title} -`POST //_update_by_query` +`POST //_update_by_query` [[docs-update-by-query-api-desc]] ==== {api-description-title} @@ -52,7 +52,7 @@ POST twitter/_update_by_query?conflicts=proceed You can specify the query criteria in the request URI or the request body using the same syntax as the <>. -When you submit an update by query request, {es} gets a snapshot of the index +When you submit an update by query request, {es} gets a snapshot of the data stream or index when it begins processing the request and updates matching documents using `internal` versioning. When the versions match, the document is updated and the version number is incremented. @@ -75,7 +75,7 @@ Any update requests that completed successfully still stick, they are not rolled ===== Refreshing shards Specifying the `refresh` parameter refreshes all shards once the request completes. -This is different than the update API#8217;s `refresh` parameter, which causes just the shard +This is different than the update API's `refresh` parameter, which causes just the shard that received the request to be refreshed. Unlike the update API, it does not support `wait_for`. @@ -129,12 +129,12 @@ Update by query supports <> to parallelize the update process. This can improve efficiency and provide a convenient way to break the request down into smaller parts. -Setting `slices` to `auto` chooses a reasonable number for most indices. +Setting `slices` to `auto` chooses a reasonable number for most data streams and indices. If you're slicing manually or otherwise tuning automatic slicing, keep in mind that: * Query performance is most efficient when the number of `slices` is equal to -the number of shards in the index. If that number is large (for example, +the number of shards in the index or backing index. If that number is large (for example, 500), choose a lower number as too many `slices` hurts performance. Setting `slices` higher than the number of shards generally does not improve efficiency and adds overhead. @@ -148,9 +148,11 @@ documents being reindexed and cluster resources. [[docs-update-by-query-api-path-params]] ==== {api-path-parms-title} -``:: -(Optional, string) A comma-separated list of index names to search. Use `_all` -or omit to search all indices. +``:: +(Optional, string) +A comma-separated list of data streams, indices, and index aliases to search. +Wildcard (`*`) expressions are supported. To search all data streams or indices +in a cluster, omit this parameter or use `_all` or `*`. [[docs-update-by-query-api-query-params]] ==== {api-query-parms-title} @@ -197,7 +199,10 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=requests_per_second] include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=routing] -include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=scroll] +`scroll`:: +(Optional, <>) +Period to retain the <> for scrolling. See +<>. include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=scroll_size] @@ -290,7 +295,7 @@ version conflicts. ==== {api-examples-title} The simplest usage of `_update_by_query` just performs an update on every -document in the index without changing the source. This is useful to +document in the data stream or index without changing the source. This is useful to <> or some other online mapping change. @@ -313,7 +318,7 @@ POST twitter/_update_by_query?conflicts=proceed way as the <>. You can also use the `q` parameter in the same way as the search API. -Update documents in multiple indices: +Update documents in multiple data streams or indices: [source,console] -------------------------------------------------- @@ -617,8 +622,8 @@ Which results in a sensible `total` like this one: Setting `slices` to `auto` will let Elasticsearch choose the number of slices to use. This setting will use one slice per shard, up to a certain limit. If -there are multiple source indices, it will choose the number of slices based -on the index with the smallest number of shards. +there are multiple source data streams or indices, it will choose the number of slices based +on the index or backing index with the smallest number of shards. Adding `slices` to `_update_by_query` just automates the manual process used in the section above, creating sub-requests which means it has some quirks: @@ -641,7 +646,7 @@ be larger than others. Expect larger slices to have a more even distribution. the point above about distribution being uneven and you should conclude that using `max_docs` with `slices` might not result in exactly `max_docs` documents being updated. -* Each sub-request gets a slightly different snapshot of the source index +* Each sub-request gets a slightly different snapshot of the source data stream or index though these are all taken at approximately the same time. [float]