Document that sliced scroll works for reindex
Surprise! You can use sliced scroll to easily parallelize reindex and friend. They support it because they use the same infrastructure as a regular search to parse the search request. While we would like to make an "automatic" option for parallelizing reindex, this manual option works right now and is pretty convenient!
This commit is contained in:
parent
370afa371b
commit
560fba1b28
|
@ -160,7 +160,7 @@ to keep or remove as you see fit. When you are done with it, delete it so
|
|||
Elasticsearch can reclaim the space it uses.
|
||||
|
||||
`wait_for_active_shards` controls how many copies of a shard must be active
|
||||
before proceeding with the request. See <<index-wait-for-active-shards,here>>
|
||||
before proceeding with the request. See <<index-wait-for-active-shards,here>>
|
||||
for details. `timeout` controls how long each write request waits for unavailable
|
||||
shards to become available. Both work exactly how they work in the
|
||||
<<docs-bulk,Bulk API>>.
|
||||
|
@ -339,3 +339,74 @@ like `1.7` or `12` to throttle to that level. Rethrottling that speeds up the
|
|||
query takes effect immediately but rethrotting that slows down the query will
|
||||
take effect on after completing the current batch. This prevents scroll
|
||||
timeouts.
|
||||
|
||||
[float]
|
||||
=== Manually slicing
|
||||
|
||||
Delete-by-query supports <<sliced-scroll>> allowing you to manually parallelize
|
||||
the process relatively easily:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------------------
|
||||
POST twitter/_delete_by_query
|
||||
{
|
||||
"slice": {
|
||||
"id": 0,
|
||||
"max": 2
|
||||
},
|
||||
"query": {
|
||||
"range": {
|
||||
"likes": {
|
||||
"lt": 10
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
POST twitter/_delete_by_query
|
||||
{
|
||||
"slice": {
|
||||
"id": 1,
|
||||
"max": 2
|
||||
},
|
||||
"query": {
|
||||
"range": {
|
||||
"likes": {
|
||||
"lt": 10
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----------------------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[setup:big_twitter]
|
||||
|
||||
Which you can verify works with:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------------------
|
||||
GET _refresh
|
||||
POST twitter/_search?size=0&filter_path=hits.total
|
||||
{
|
||||
"query": {
|
||||
"range": {
|
||||
"likes": {
|
||||
"lt": 10
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----------------------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[continued]
|
||||
|
||||
Which results in a sensible `total` like this one:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------------------
|
||||
{
|
||||
"hits": {
|
||||
"total": 0
|
||||
}
|
||||
}
|
||||
----------------------------------------------------------------
|
||||
// TESTRESPONSE
|
||||
|
|
|
@ -694,6 +694,65 @@ and it'll look like:
|
|||
|
||||
Or you can search by `tag` or whatever you want.
|
||||
|
||||
[float]
|
||||
=== Manually slicing
|
||||
|
||||
Reindex supports <<sliced-scroll>> allowing you to manually parallelize the
|
||||
process relatively easily:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------------------
|
||||
POST _reindex
|
||||
{
|
||||
"source": {
|
||||
"index": "twitter",
|
||||
"slice": {
|
||||
"id": 0,
|
||||
"max": 2
|
||||
}
|
||||
},
|
||||
"dest": {
|
||||
"index": "new_twitter"
|
||||
}
|
||||
}
|
||||
POST _reindex
|
||||
{
|
||||
"source": {
|
||||
"index": "twitter",
|
||||
"slice": {
|
||||
"id": 1,
|
||||
"max": 2
|
||||
}
|
||||
},
|
||||
"dest": {
|
||||
"index": "new_twitter"
|
||||
}
|
||||
}
|
||||
----------------------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[setup:big_twitter]
|
||||
|
||||
Which you can verify works with:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------------------
|
||||
GET _refresh
|
||||
POST new_twitter/_search?size=0&filter_path=hits.total
|
||||
----------------------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[continued]
|
||||
|
||||
Which results in a sensible `total` like this one:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------------------
|
||||
{
|
||||
"hits": {
|
||||
"total": 120
|
||||
}
|
||||
}
|
||||
----------------------------------------------------------------
|
||||
// TESTRESPONSE
|
||||
|
||||
[float]
|
||||
=== Reindex daily indices
|
||||
|
|
|
@ -217,7 +217,7 @@ to keep or remove as you see fit. When you are done with it, delete it so
|
|||
Elasticsearch can reclaim the space it uses.
|
||||
|
||||
`wait_for_active_shards` controls how many copies of a shard must be active
|
||||
before proceeding with the request. See <<index-wait-for-active-shards,here>>
|
||||
before proceeding with the request. See <<index-wait-for-active-shards,here>>
|
||||
for details. `timeout` controls how long each write request waits for unavailable
|
||||
shards to become available. Both work exactly how they work in the
|
||||
<<docs-bulk,Bulk API>>.
|
||||
|
@ -405,6 +405,60 @@ query takes effect immediately but rethrotting that slows down the query will
|
|||
take effect on after completing the current batch. This prevents scroll
|
||||
timeouts.
|
||||
|
||||
[float]
|
||||
=== Manually slicing
|
||||
|
||||
Update-by-query supports <<sliced-scroll>> allowing you to manually parallelize
|
||||
the process relatively easily:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------------------
|
||||
POST twitter/_update_by_query
|
||||
{
|
||||
"slice": {
|
||||
"id": 0,
|
||||
"max": 2
|
||||
},
|
||||
"script": {
|
||||
"inline": "ctx._source['extra'] = 'test'"
|
||||
}
|
||||
}
|
||||
POST twitter/_update_by_query
|
||||
{
|
||||
"slice": {
|
||||
"id": 1,
|
||||
"max": 2
|
||||
},
|
||||
"script": {
|
||||
"inline": "ctx._source['extra'] = 'test'"
|
||||
}
|
||||
}
|
||||
----------------------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[setup:big_twitter]
|
||||
|
||||
Which you can verify works with:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------------------
|
||||
GET _refresh
|
||||
POST twitter/_search?size=0&q=extra:test&filter_path=hits.total
|
||||
----------------------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[continued]
|
||||
|
||||
Which results in a sensible `total` like this one:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------------------
|
||||
{
|
||||
"hits": {
|
||||
"total": 120
|
||||
}
|
||||
}
|
||||
----------------------------------------------------------------
|
||||
// TESTRESPONSE
|
||||
|
||||
[float]
|
||||
[[picking-up-a-new-property]]
|
||||
=== Pick up a new property
|
||||
|
|
|
@ -175,7 +175,7 @@ curl -XDELETE localhost:9200/_search/scroll \
|
|||
-d 'c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1,aGVuRmV0Y2g7NTsxOnkxaDZ'
|
||||
---------------------------------------
|
||||
|
||||
|
||||
[[sliced-scroll]]
|
||||
==== Sliced Scroll
|
||||
|
||||
For scroll queries that return a lot of documents it is possible to split the scroll in multiple slices which
|
||||
|
@ -183,7 +183,7 @@ can be consumed independently:
|
|||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
|
||||
GET /twitter/tweet/_search?scroll=1m
|
||||
{
|
||||
"slice": {
|
||||
"id": 0, <1>
|
||||
|
@ -195,9 +195,7 @@ curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
|
|||
}
|
||||
}
|
||||
}
|
||||
'
|
||||
|
||||
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
|
||||
GET /twitter/tweet/_search?scroll=1m
|
||||
{
|
||||
"slice": {
|
||||
"id": 1,
|
||||
|
@ -209,8 +207,9 @@ curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
|
|||
}
|
||||
}
|
||||
}
|
||||
'
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[setup:big_twitter]
|
||||
|
||||
<1> The id of the slice
|
||||
<2> The maximum number of slices
|
||||
|
@ -247,10 +246,10 @@ slice gets deterministic results.
|
|||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
|
||||
GET /twitter/tweet/_search?scroll=1m
|
||||
{
|
||||
"slice": {
|
||||
"field": "my_random_integer_field",
|
||||
"field": "date",
|
||||
"id": 0,
|
||||
"max": 10
|
||||
},
|
||||
|
@ -260,10 +259,11 @@ curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
|
|||
}
|
||||
}
|
||||
}
|
||||
'
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[setup:big_twitter]
|
||||
|
||||
For append only time-based indices, the `timestamp` field can be used safely.
|
||||
|
||||
NOTE: By default the maximum number of slices allowed per scroll is limited to 1024.
|
||||
You can update the `index.max_slices_per_scroll` index setting to bypass this limit.
|
||||
You can update the `index.max_slices_per_scroll` index setting to bypass this limit.
|
||||
|
|
Loading…
Reference in New Issue