Document that sliced scroll works for reindex

Surprise! You can use sliced scroll to easily parallelize reindex and friend. They support it because they use the same infrastructure as a regular search to parse the search request. While we would like to make an "automatic" option for parallelizing reindex, this manual option works right now and is pretty convenient!
2016-09-21 12:42:07 -04:00 · 2016-09-21 12:42:07 -04:00 · 560fba1b28
parent 370afa371b
commit 560fba1b28
4 changed files with 196 additions and 12 deletions
--- a/docs/reference/docs/delete-by-query.asciidoc
+++ b/docs/reference/docs/delete-by-query.asciidoc
@ -160,7 +160,7 @@ to keep or remove as you see fit. When you are done with it, delete it so
 Elasticsearch can reclaim the space it uses.

 `wait_for_active_shards` controls how many copies of a shard must be active
-before proceeding with the request. See <<index-wait-for-active-shards,here>> 
+before proceeding with the request. See <<index-wait-for-active-shards,here>>
 for details. `timeout` controls how long each write request waits for unavailable
 shards to become available. Both work exactly how they work in the
 <<docs-bulk,Bulk API>>.
@ -339,3 +339,74 @@ like `1.7` or `12` to throttle to that level. Rethrottling that speeds up the
 query takes effect immediately but rethrotting that slows down the query will
 take effect on after completing the current batch. This prevents scroll
 timeouts.
+
+[float]
+=== Manually slicing
+
+Delete-by-query supports <<sliced-scroll>> allowing you to manually parallelize
+the process relatively easily:
+
+[source,js]
+----------------------------------------------------------------
+POST twitter/_delete_by_query
+{
+  "slice": {
+    "id": 0,
+    "max": 2
+  },
+  "query": {
+    "range": {
+      "likes": {
+        "lt": 10
+      }
+    }
+  }
+}
+POST twitter/_delete_by_query
+{
+  "slice": {
+    "id": 1,
+    "max": 2
+  },
+  "query": {
+    "range": {
+      "likes": {
+        "lt": 10
+      }
+    }
+  }
+}
+----------------------------------------------------------------
+// CONSOLE
+// TEST[setup:big_twitter]
+
+Which you can verify works with:
+
+[source,js]
+----------------------------------------------------------------
+GET _refresh
+POST twitter/_search?size=0&filter_path=hits.total
+{
+  "query": {
+    "range": {
+      "likes": {
+        "lt": 10
+      }
+    }
+  }
+}
+----------------------------------------------------------------
+// CONSOLE
+// TEST[continued]
+
+Which results in a sensible `total` like this one:
+
+[source,js]
+----------------------------------------------------------------
+{
+  "hits": {
+    "total": 0
+  }
+}
+----------------------------------------------------------------
+// TESTRESPONSE
--- a/docs/reference/docs/reindex.asciidoc
+++ b/docs/reference/docs/reindex.asciidoc
@ -694,6 +694,65 @@ and it'll look like:

 Or you can search by `tag` or whatever you want.

+[float]
+=== Manually slicing
+
+Reindex supports <<sliced-scroll>> allowing you to manually parallelize the
+process relatively easily:
+
+[source,js]
+----------------------------------------------------------------
+POST _reindex
+{
+  "source": {
+    "index": "twitter",
+    "slice": {
+      "id": 0,
+      "max": 2
+    }
+  },
+  "dest": {
+    "index": "new_twitter"
+  }
+}
+POST _reindex
+{
+  "source": {
+    "index": "twitter",
+    "slice": {
+      "id": 1,
+      "max": 2
+    }
+  },
+  "dest": {
+    "index": "new_twitter"
+  }
+}
+----------------------------------------------------------------
+// CONSOLE
+// TEST[setup:big_twitter]
+
+Which you can verify works with:
+
+[source,js]
+----------------------------------------------------------------
+GET _refresh
+POST new_twitter/_search?size=0&filter_path=hits.total
+----------------------------------------------------------------
+// CONSOLE
+// TEST[continued]
+
+Which results in a sensible `total` like this one:
+
+[source,js]
+----------------------------------------------------------------
+{
+  "hits": {
+    "total": 120
+  }
+}
+----------------------------------------------------------------
+// TESTRESPONSE

 [float]
 === Reindex daily indices
--- a/docs/reference/docs/update-by-query.asciidoc
+++ b/docs/reference/docs/update-by-query.asciidoc
@ -217,7 +217,7 @@ to keep or remove as you see fit. When you are done with it, delete it so
 Elasticsearch can reclaim the space it uses.

 `wait_for_active_shards` controls how many copies of a shard must be active
-before proceeding with the request. See <<index-wait-for-active-shards,here>> 
+before proceeding with the request. See <<index-wait-for-active-shards,here>>
 for details. `timeout` controls how long each write request waits for unavailable
 shards to become available. Both work exactly how they work in the
 <<docs-bulk,Bulk API>>.
@ -405,6 +405,60 @@ query takes effect immediately but rethrotting that slows down the query will
 take effect on after completing the current batch. This prevents scroll
 timeouts.

+[float]
+=== Manually slicing
+
+Update-by-query supports <<sliced-scroll>> allowing you to manually parallelize
+the process relatively easily:
+
+[source,js]
+----------------------------------------------------------------
+POST twitter/_update_by_query
+{
+  "slice": {
+    "id": 0,
+    "max": 2
+  },
+  "script": {
+    "inline": "ctx._source['extra'] = 'test'"
+  }
+}
+POST twitter/_update_by_query
+{
+  "slice": {
+    "id": 1,
+    "max": 2
+  },
+  "script": {
+    "inline": "ctx._source['extra'] = 'test'"
+  }
+}
+----------------------------------------------------------------
+// CONSOLE
+// TEST[setup:big_twitter]
+
+Which you can verify works with:
+
+[source,js]
+----------------------------------------------------------------
+GET _refresh
+POST twitter/_search?size=0&q=extra:test&filter_path=hits.total
+----------------------------------------------------------------
+// CONSOLE
+// TEST[continued]
+
+Which results in a sensible `total` like this one:
+
+[source,js]
+----------------------------------------------------------------
+{
+  "hits": {
+    "total": 120
+  }
+}
+----------------------------------------------------------------
+// TESTRESPONSE
+
 [float]
 [[picking-up-a-new-property]]
 === Pick up a new property
--- a/docs/reference/search/request/scroll.asciidoc
+++ b/docs/reference/search/request/scroll.asciidoc
@ -175,7 +175,7 @@ curl -XDELETE localhost:9200/_search/scroll \
     -d 'c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1,aGVuRmV0Y2g7NTsxOnkxaDZ'
 ---------------------------------------

-
+[[sliced-scroll]]
 ==== Sliced Scroll

 For scroll queries that return a lot of documents it is possible to split the scroll in multiple slices which
@ -183,7 +183,7 @@ can be consumed independently:

 [source,js]
 --------------------------------------------------
-curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
+GET /twitter/tweet/_search?scroll=1m
 {
    "slice": {
        "id": 0, <1>
@ -195,9 +195,7 @@ curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
        }
    }
 }
-'
-
-curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
+GET /twitter/tweet/_search?scroll=1m
 {
    "slice": {
        "id": 1,
@ -209,8 +207,9 @@ curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
        }
    }
 }
-'
 --------------------------------------------------
+// CONSOLE
+// TEST[setup:big_twitter]

 <1> The id of the slice
 <2> The maximum number of slices
@ -247,10 +246,10 @@ slice gets deterministic results.

 [source,js]
 --------------------------------------------------
-curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
+GET /twitter/tweet/_search?scroll=1m
 {
    "slice": {
-        "field": "my_random_integer_field",
+        "field": "date",
        "id": 0,
        "max": 10
    },
@ -260,10 +259,11 @@ curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
        }
    }
 }
-'
 --------------------------------------------------
+// CONSOLE
+// TEST[setup:big_twitter]

 For append only time-based indices, the `timestamp` field can be used safely.

 NOTE: By default the maximum number of slices allowed per scroll is limited to 1024.
-You can update the `index.max_slices_per_scroll` index setting to bypass this limit.
+You can update the `index.max_slices_per_scroll` index setting to bypass this limit.