Document that sliced scroll works for reindex

Surprise! You can use sliced scroll to easily parallelize reindex
and friend. They support it because they use the same infrastructure
as a regular search to parse the search request. While we would like
to make an "automatic" option for parallelizing reindex, this manual
option works right now and is pretty convenient!
This commit is contained in:
Nik Everett 2016-09-21 12:42:07 -04:00
parent 370afa371b
commit 560fba1b28
4 changed files with 196 additions and 12 deletions

View File

@ -160,7 +160,7 @@ to keep or remove as you see fit. When you are done with it, delete it so
Elasticsearch can reclaim the space it uses.
`wait_for_active_shards` controls how many copies of a shard must be active
before proceeding with the request. See <<index-wait-for-active-shards,here>>
before proceeding with the request. See <<index-wait-for-active-shards,here>>
for details. `timeout` controls how long each write request waits for unavailable
shards to become available. Both work exactly how they work in the
<<docs-bulk,Bulk API>>.
@ -339,3 +339,74 @@ like `1.7` or `12` to throttle to that level. Rethrottling that speeds up the
query takes effect immediately but rethrotting that slows down the query will
take effect on after completing the current batch. This prevents scroll
timeouts.
[float]
=== Manually slicing
Delete-by-query supports <<sliced-scroll>> allowing you to manually parallelize
the process relatively easily:
[source,js]
----------------------------------------------------------------
POST twitter/_delete_by_query
{
"slice": {
"id": 0,
"max": 2
},
"query": {
"range": {
"likes": {
"lt": 10
}
}
}
}
POST twitter/_delete_by_query
{
"slice": {
"id": 1,
"max": 2
},
"query": {
"range": {
"likes": {
"lt": 10
}
}
}
}
----------------------------------------------------------------
// CONSOLE
// TEST[setup:big_twitter]
Which you can verify works with:
[source,js]
----------------------------------------------------------------
GET _refresh
POST twitter/_search?size=0&filter_path=hits.total
{
"query": {
"range": {
"likes": {
"lt": 10
}
}
}
}
----------------------------------------------------------------
// CONSOLE
// TEST[continued]
Which results in a sensible `total` like this one:
[source,js]
----------------------------------------------------------------
{
"hits": {
"total": 0
}
}
----------------------------------------------------------------
// TESTRESPONSE

View File

@ -694,6 +694,65 @@ and it'll look like:
Or you can search by `tag` or whatever you want.
[float]
=== Manually slicing
Reindex supports <<sliced-scroll>> allowing you to manually parallelize the
process relatively easily:
[source,js]
----------------------------------------------------------------
POST _reindex
{
"source": {
"index": "twitter",
"slice": {
"id": 0,
"max": 2
}
},
"dest": {
"index": "new_twitter"
}
}
POST _reindex
{
"source": {
"index": "twitter",
"slice": {
"id": 1,
"max": 2
}
},
"dest": {
"index": "new_twitter"
}
}
----------------------------------------------------------------
// CONSOLE
// TEST[setup:big_twitter]
Which you can verify works with:
[source,js]
----------------------------------------------------------------
GET _refresh
POST new_twitter/_search?size=0&filter_path=hits.total
----------------------------------------------------------------
// CONSOLE
// TEST[continued]
Which results in a sensible `total` like this one:
[source,js]
----------------------------------------------------------------
{
"hits": {
"total": 120
}
}
----------------------------------------------------------------
// TESTRESPONSE
[float]
=== Reindex daily indices

View File

@ -217,7 +217,7 @@ to keep or remove as you see fit. When you are done with it, delete it so
Elasticsearch can reclaim the space it uses.
`wait_for_active_shards` controls how many copies of a shard must be active
before proceeding with the request. See <<index-wait-for-active-shards,here>>
before proceeding with the request. See <<index-wait-for-active-shards,here>>
for details. `timeout` controls how long each write request waits for unavailable
shards to become available. Both work exactly how they work in the
<<docs-bulk,Bulk API>>.
@ -405,6 +405,60 @@ query takes effect immediately but rethrotting that slows down the query will
take effect on after completing the current batch. This prevents scroll
timeouts.
[float]
=== Manually slicing
Update-by-query supports <<sliced-scroll>> allowing you to manually parallelize
the process relatively easily:
[source,js]
----------------------------------------------------------------
POST twitter/_update_by_query
{
"slice": {
"id": 0,
"max": 2
},
"script": {
"inline": "ctx._source['extra'] = 'test'"
}
}
POST twitter/_update_by_query
{
"slice": {
"id": 1,
"max": 2
},
"script": {
"inline": "ctx._source['extra'] = 'test'"
}
}
----------------------------------------------------------------
// CONSOLE
// TEST[setup:big_twitter]
Which you can verify works with:
[source,js]
----------------------------------------------------------------
GET _refresh
POST twitter/_search?size=0&q=extra:test&filter_path=hits.total
----------------------------------------------------------------
// CONSOLE
// TEST[continued]
Which results in a sensible `total` like this one:
[source,js]
----------------------------------------------------------------
{
"hits": {
"total": 120
}
}
----------------------------------------------------------------
// TESTRESPONSE
[float]
[[picking-up-a-new-property]]
=== Pick up a new property

View File

@ -175,7 +175,7 @@ curl -XDELETE localhost:9200/_search/scroll \
-d 'c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1,aGVuRmV0Y2g7NTsxOnkxaDZ'
---------------------------------------
[[sliced-scroll]]
==== Sliced Scroll
For scroll queries that return a lot of documents it is possible to split the scroll in multiple slices which
@ -183,7 +183,7 @@ can be consumed independently:
[source,js]
--------------------------------------------------
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
GET /twitter/tweet/_search?scroll=1m
{
"slice": {
"id": 0, <1>
@ -195,9 +195,7 @@ curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
}
}
}
'
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
GET /twitter/tweet/_search?scroll=1m
{
"slice": {
"id": 1,
@ -209,8 +207,9 @@ curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
}
}
}
'
--------------------------------------------------
// CONSOLE
// TEST[setup:big_twitter]
<1> The id of the slice
<2> The maximum number of slices
@ -247,10 +246,10 @@ slice gets deterministic results.
[source,js]
--------------------------------------------------
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
GET /twitter/tweet/_search?scroll=1m
{
"slice": {
"field": "my_random_integer_field",
"field": "date",
"id": 0,
"max": 10
},
@ -260,10 +259,11 @@ curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
}
}
}
'
--------------------------------------------------
// CONSOLE
// TEST[setup:big_twitter]
For append only time-based indices, the `timestamp` field can be used safely.
NOTE: By default the maximum number of slices allowed per scroll is limited to 1024.
You can update the `index.max_slices_per_scroll` index setting to bypass this limit.
You can update the `index.max_slices_per_scroll` index setting to bypass this limit.