1162 lines
32 KiB
Plaintext
1162 lines
32 KiB
Plaintext
[[docs-reindex]]
|
|
== Reindex API
|
|
|
|
IMPORTANT: Reindex requires <<mapping-source-field,`_source`>> to be enabled for
|
|
all documents in the source index.
|
|
|
|
IMPORTANT: Reindex does not attempt to set up the destination index. It does
|
|
not copy the settings of the source index. You should set up the destination
|
|
index prior to running a `_reindex` action, including setting up mappings, shard
|
|
counts, replicas, etc.
|
|
|
|
The most basic form of `_reindex` just copies documents from one index to another.
|
|
This will copy documents from the `twitter` index into the `new_twitter` index:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "twitter"
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:big_twitter]
|
|
|
|
That will return something like this:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"took" : 147,
|
|
"timed_out": false,
|
|
"created": 120,
|
|
"updated": 0,
|
|
"deleted": 0,
|
|
"batches": 1,
|
|
"version_conflicts": 0,
|
|
"noops": 0,
|
|
"retries": {
|
|
"bulk": 0,
|
|
"search": 0
|
|
},
|
|
"throttled_millis": 0,
|
|
"requests_per_second": -1.0,
|
|
"throttled_until_millis": 0,
|
|
"total": 120,
|
|
"failures" : [ ]
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/"took" : 147/"took" : "$body.took"/]
|
|
|
|
Just like <<docs-update-by-query,`_update_by_query`>>, `_reindex` gets a
|
|
snapshot of the source index but its target must be a **different** index so
|
|
version conflicts are unlikely. The `dest` element can be configured like the
|
|
index API to control optimistic concurrency control. Just leaving out
|
|
`version_type` (as above) or setting it to `internal` will cause Elasticsearch
|
|
to blindly dump documents into the target, overwriting any that happen to have
|
|
the same type and id:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "twitter"
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter",
|
|
"version_type": "internal"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
|
|
Setting `version_type` to `external` will cause Elasticsearch to preserve the
|
|
`version` from the source, create any documents that are missing, and update
|
|
any documents that have an older version in the destination index than they do
|
|
in the source index:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "twitter"
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter",
|
|
"version_type": "external"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
|
|
Settings `op_type` to `create` will cause `_reindex` to only create missing
|
|
documents in the target index. All existing documents will cause a version
|
|
conflict:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "twitter"
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter",
|
|
"op_type": "create"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
|
|
By default version conflicts abort the `_reindex` process but you can just
|
|
count them by settings `"conflicts": "proceed"` in the request body:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"conflicts": "proceed",
|
|
"source": {
|
|
"index": "twitter"
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter",
|
|
"op_type": "create"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
|
|
You can limit the documents by adding a type to the `source` or by adding a
|
|
query. This will only copy tweets made by `kimchy` into `new_twitter`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "twitter",
|
|
"type": "_doc",
|
|
"query": {
|
|
"term": {
|
|
"user": "kimchy"
|
|
}
|
|
}
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
|
|
`index` and `type` in `source` can both be lists, allowing you to copy from
|
|
lots of sources in one request. This will copy documents from the `_doc` and
|
|
`post` types in the `twitter` and `blog` indices.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": ["twitter", "blog"],
|
|
"type": ["_doc", "post"]
|
|
},
|
|
"dest": {
|
|
"index": "all_together",
|
|
"type": "_doc"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
// TEST[s/^/PUT blog\/post\/post1?refresh\n{"test": "foo"}\n/]
|
|
|
|
NOTE: The Reindex API makes no effort to handle ID collisions so the last
|
|
document written will "win" but the order isn't usually predictable so it is
|
|
not a good idea to rely on this behavior. Instead, make sure that IDs are unique
|
|
using a script.
|
|
|
|
It's also possible to limit the number of processed documents by setting
|
|
`size`. This will only copy a single document from `twitter` to
|
|
`new_twitter`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"size": 1,
|
|
"source": {
|
|
"index": "twitter"
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
|
|
If you want a particular set of documents from the `twitter` index you'll
|
|
need to use `sort`. Sorting makes the scroll less efficient but in some contexts
|
|
it's worth it. If possible, prefer a more selective query to `size` and `sort`.
|
|
This will copy 10000 documents from `twitter` into `new_twitter`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"size": 10000,
|
|
"source": {
|
|
"index": "twitter",
|
|
"sort": { "date": "desc" }
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
|
|
The `source` section supports all the elements that are supported in a
|
|
<<search-request-body,search request>>. For instance, only a subset of the
|
|
fields from the original documents can be reindexed using `source` filtering
|
|
as follows:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "twitter",
|
|
"_source": ["user", "_doc"]
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
|
|
|
|
Like `_update_by_query`, `_reindex` supports a script that modifies the
|
|
document. Unlike `_update_by_query`, the script is allowed to modify the
|
|
document's metadata. This example bumps the version of the source document:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "twitter"
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter",
|
|
"version_type": "external"
|
|
},
|
|
"script": {
|
|
"source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}",
|
|
"lang": "painless"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
|
|
Just as in `_update_by_query`, you can set `ctx.op` to change the
|
|
operation that is executed on the destination index:
|
|
|
|
`noop`::
|
|
|
|
Set `ctx.op = "noop"` if your script decides that the document doesn't have
|
|
to be indexed in the destination index. This no operation will be reported
|
|
in the `noop` counter in the <<docs-reindex-response-body, response body>>.
|
|
|
|
`delete`::
|
|
|
|
Set `ctx.op = "delete"` if your script decides that the document must be
|
|
deleted from the destination index. The deletion will be reported in the
|
|
`deleted` counter in the <<docs-reindex-response-body, response body>>.
|
|
|
|
Setting `ctx.op` to anything else will return an error, as will setting any
|
|
other field in `ctx`.
|
|
|
|
Think of the possibilities! Just be careful; you are able to
|
|
change:
|
|
|
|
* `_id`
|
|
* `_type`
|
|
* `_index`
|
|
* `_version`
|
|
* `_routing`
|
|
|
|
Setting `_version` to `null` or clearing it from the `ctx` map is just like not
|
|
sending the version in an indexing request; it will cause the document to be
|
|
overwritten in the target index regardless of the version on the target or the
|
|
version type you use in the `_reindex` request.
|
|
|
|
By default if `_reindex` sees a document with routing then the routing is
|
|
preserved unless it's changed by the script. You can set `routing` on the
|
|
`dest` request to change this:
|
|
|
|
`keep`::
|
|
|
|
Sets the routing on the bulk request sent for each match to the routing on
|
|
the match. This is the default value.
|
|
|
|
`discard`::
|
|
|
|
Sets the routing on the bulk request sent for each match to `null`.
|
|
|
|
`=<some text>`::
|
|
|
|
Sets the routing on the bulk request sent for each match to all text after
|
|
the `=`.
|
|
|
|
For example, you can use the following request to copy all documents from
|
|
the `source` index with the company name `cat` into the `dest` index with
|
|
routing set to `cat`.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "source",
|
|
"query": {
|
|
"match": {
|
|
"company": "cat"
|
|
}
|
|
}
|
|
},
|
|
"dest": {
|
|
"index": "dest",
|
|
"routing": "=cat"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[s/^/PUT source\n/]
|
|
|
|
By default `_reindex` uses scroll batches of 1000. You can change the
|
|
batch size with the `size` field in the `source` element:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "source",
|
|
"size": 100
|
|
},
|
|
"dest": {
|
|
"index": "dest",
|
|
"routing": "=cat"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[s/^/PUT source\n/]
|
|
|
|
Reindex can also use the <<ingest>> feature by specifying a
|
|
`pipeline` like this:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "source"
|
|
},
|
|
"dest": {
|
|
"index": "dest",
|
|
"pipeline": "some_ingest_pipeline"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[s/^/PUT source\n/]
|
|
|
|
[float]
|
|
[[reindex-from-remote]]
|
|
=== Reindex from Remote
|
|
|
|
Reindex supports reindexing from a remote Elasticsearch cluster:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"remote": {
|
|
"host": "http://otherhost:9200",
|
|
"username": "user",
|
|
"password": "pass"
|
|
},
|
|
"index": "source",
|
|
"query": {
|
|
"match": {
|
|
"test": "data"
|
|
}
|
|
}
|
|
},
|
|
"dest": {
|
|
"index": "dest"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:host]
|
|
// TEST[s/^/PUT source\n/]
|
|
// TEST[s/otherhost:9200",/\${host}"/]
|
|
// TEST[s/"username": "user",//]
|
|
// TEST[s/"password": "pass"//]
|
|
|
|
The `host` parameter must contain a scheme, host, port (e.g.
|
|
`https://otherhost:9200`) and optional path (e.g. `https://otherhost:9200/proxy`).
|
|
The `username` and `password` parameters are optional, and when they are present `_reindex`
|
|
will connect to the remote Elasticsearch node using basic auth. Be sure to use `https` when
|
|
using basic auth or the password will be sent in plain text.
|
|
|
|
Remote hosts have to be explicitly whitelisted in elasticsearch.yml using the
|
|
`reindex.remote.whitelist` property. It can be set to a comma delimited list
|
|
of allowed remote `host` and `port` combinations (e.g.
|
|
`otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*`). Scheme is
|
|
ignored by the whitelist - only host and port are used, for example:
|
|
|
|
|
|
[source,yaml]
|
|
--------------------------------------------------
|
|
reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"
|
|
--------------------------------------------------
|
|
|
|
The whitelist must be configured on any nodes that will coordinate the reindex.
|
|
|
|
This feature should work with remote clusters of any version of Elasticsearch
|
|
you are likely to find. This should allow you to upgrade from any version of
|
|
Elasticsearch to the current version by reindexing from a cluster of the old
|
|
version.
|
|
|
|
To enable queries sent to older versions of Elasticsearch the `query` parameter
|
|
is sent directly to the remote host without validation or modification.
|
|
|
|
NOTE: Reindexing from remote clusters does not support
|
|
<<docs-reindex-manual-slice, manual>> or
|
|
<<docs-reindex-automatic-slice, automatic slicing>>.
|
|
|
|
Reindexing from a remote server uses an on-heap buffer that defaults to a
|
|
maximum size of 100mb. If the remote index includes very large documents you'll
|
|
need to use a smaller batch size. The example below sets the batch size to `10`
|
|
which is very, very small.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"remote": {
|
|
"host": "http://otherhost:9200"
|
|
},
|
|
"index": "source",
|
|
"size": 10,
|
|
"query": {
|
|
"match": {
|
|
"test": "data"
|
|
}
|
|
}
|
|
},
|
|
"dest": {
|
|
"index": "dest"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:host]
|
|
// TEST[s/^/PUT source\n/]
|
|
// TEST[s/otherhost:9200/\${host}/]
|
|
|
|
It is also possible to set the socket read timeout on the remote connection
|
|
with the `socket_timeout` field and the connection timeout with the
|
|
`connect_timeout` field. Both default to 30 seconds. This example
|
|
sets the socket read timeout to one minute and the connection timeout to 10
|
|
seconds:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"remote": {
|
|
"host": "http://otherhost:9200",
|
|
"socket_timeout": "1m",
|
|
"connect_timeout": "10s"
|
|
},
|
|
"index": "source",
|
|
"query": {
|
|
"match": {
|
|
"test": "data"
|
|
}
|
|
}
|
|
},
|
|
"dest": {
|
|
"index": "dest"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:host]
|
|
// TEST[s/^/PUT source\n/]
|
|
// TEST[s/otherhost:9200/\${host}/]
|
|
|
|
[float]
|
|
=== URL Parameters
|
|
|
|
In addition to the standard parameters like `pretty`, the Reindex API also
|
|
supports `refresh`, `wait_for_completion`, `wait_for_active_shards`, `timeout`,
|
|
`scroll` and `requests_per_second`.
|
|
|
|
Sending the `refresh` url parameter will cause all indexes to which the request
|
|
wrote to be refreshed. This is different than the Index API's `refresh`
|
|
parameter which causes just the shard that received the new data to be
|
|
refreshed. Also unlike the Index API it does not support `wait_for`.
|
|
|
|
If the request contains `wait_for_completion=false` then Elasticsearch will
|
|
perform some preflight checks, launch the request, and then return a `task`
|
|
which can be used with <<docs-reindex-task-api,Tasks APIs>>
|
|
to cancel or get the status of the task. Elasticsearch will also create a
|
|
record of this task as a document at `.tasks/task/${taskId}`. This is yours
|
|
to keep or remove as you see fit. When you are done with it, delete it so
|
|
Elasticsearch can reclaim the space it uses.
|
|
|
|
`wait_for_active_shards` controls how many copies of a shard must be active
|
|
before proceeding with the reindexing. See <<index-wait-for-active-shards,here>>
|
|
for details. `timeout` controls how long each write request waits for unavailable
|
|
shards to become available. Both work exactly how they work in the
|
|
<<docs-bulk,Bulk API>>. As `_reindex` uses scroll search, you can also specify
|
|
the `scroll` parameter to control how long it keeps the "search context" alive,
|
|
(e.g. `?scroll=10m`). The default value is 5 minutes.
|
|
|
|
`requests_per_second` can be set to any positive decimal number (`1.4`, `6`,
|
|
`1000`, etc) and throttles the rate at which `_reindex` issues batches of index
|
|
operations by padding each batch with a wait time. The throttling can be
|
|
disabled by setting `requests_per_second` to `-1`.
|
|
|
|
The throttling is done by waiting between batches so that the `scroll` which `_reindex`
|
|
uses internally can be given a timeout that takes into account the padding.
|
|
The padding time is the difference between the batch size divided by the
|
|
`requests_per_second` and the time spent writing. By default the batch size is
|
|
`1000`, so if the `requests_per_second` is set to `500`:
|
|
|
|
[source,txt]
|
|
--------------------------------------------------
|
|
target_time = 1000 / 500 per second = 2 seconds
|
|
wait_time = target_time - write_time = 2 seconds - .5 seconds = 1.5 seconds
|
|
--------------------------------------------------
|
|
|
|
Since the batch is issued as a single `_bulk` request, large batch sizes will
|
|
cause Elasticsearch to create many requests and then wait for a while before
|
|
starting the next set. This is "bursty" instead of "smooth". The default value is `-1`.
|
|
|
|
[float]
|
|
[[docs-reindex-response-body]]
|
|
=== Response body
|
|
|
|
//////////////////////////
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_reindex?wait_for_completion
|
|
{
|
|
"source": {
|
|
"index": "twitter"
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:twitter]
|
|
|
|
//////////////////////////
|
|
|
|
The JSON response looks like this:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"took": 639,
|
|
"timed_out": false,
|
|
"total": 5,
|
|
"updated": 0,
|
|
"created": 5,
|
|
"deleted": 0,
|
|
"batches": 1,
|
|
"noops": 0,
|
|
"version_conflicts": 2,
|
|
"retries": {
|
|
"bulk": 0,
|
|
"search": 0
|
|
},
|
|
"throttled_millis": 0,
|
|
"requests_per_second": 1,
|
|
"throttled_until_millis": 0,
|
|
"failures": [ ]
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/: [0-9]+/: $body.$_path/]
|
|
|
|
`took`::
|
|
|
|
The total milliseconds the entire operation took.
|
|
|
|
`timed_out`::
|
|
|
|
This flag is set to `true` if any of the requests executed during the
|
|
reindex timed out.
|
|
|
|
`total`::
|
|
|
|
The number of documents that were successfully processed.
|
|
|
|
`updated`::
|
|
|
|
The number of documents that were successfully updated.
|
|
|
|
`created`::
|
|
|
|
The number of documents that were successfully created.
|
|
|
|
`deleted`::
|
|
|
|
The number of documents that were successfully deleted.
|
|
|
|
`batches`::
|
|
|
|
The number of scroll responses pulled back by the reindex.
|
|
|
|
`noops`::
|
|
|
|
The number of documents that were ignored because the script used for
|
|
the reindex returned a `noop` value for `ctx.op`.
|
|
|
|
`version_conflicts`::
|
|
|
|
The number of version conflicts that reindex hit.
|
|
|
|
`retries`::
|
|
|
|
The number of retries attempted by reindex. `bulk` is the number of bulk
|
|
actions retried and `search` is the number of search actions retried.
|
|
|
|
`throttled_millis`::
|
|
|
|
Number of milliseconds the request slept to conform to `requests_per_second`.
|
|
|
|
`requests_per_second`::
|
|
|
|
The number of requests per second effectively executed during the reindex.
|
|
|
|
`throttled_until_millis`::
|
|
|
|
This field should always be equal to zero in a `_reindex` response. It only
|
|
has meaning when using the <<docs-reindex-task-api, Task API>>, where it
|
|
indicates the next time (in milliseconds since epoch) a throttled request will be
|
|
executed again in order to conform to `requests_per_second`.
|
|
|
|
`failures`::
|
|
|
|
Array of failures if there were any unrecoverable errors during the process. If
|
|
this is non-empty then the request aborted because of those failures. Reindex
|
|
is implemented using batches and any failure causes the entire process to abort
|
|
but all failures in the current batch are collected into the array. You can use
|
|
the `conflicts` option to prevent reindex from aborting on version conflicts.
|
|
|
|
[float]
|
|
[[docs-reindex-task-api]]
|
|
=== Works with the Task API
|
|
|
|
You can fetch the status of all running reindex requests with the
|
|
<<tasks,Task API>>:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET _tasks?detailed=true&actions=*reindex
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[skip:No tasks to retrieve]
|
|
|
|
The response looks like:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"nodes" : {
|
|
"r1A2WoRbTwKZ516z6NEs5A" : {
|
|
"name" : "r1A2WoR",
|
|
"transport_address" : "127.0.0.1:9300",
|
|
"host" : "127.0.0.1",
|
|
"ip" : "127.0.0.1:9300",
|
|
"attributes" : {
|
|
"testattr" : "test",
|
|
"portsfile" : "true"
|
|
},
|
|
"tasks" : {
|
|
"r1A2WoRbTwKZ516z6NEs5A:36619" : {
|
|
"node" : "r1A2WoRbTwKZ516z6NEs5A",
|
|
"id" : 36619,
|
|
"type" : "transport",
|
|
"action" : "indices:data/write/reindex",
|
|
"status" : { <1>
|
|
"total" : 6154,
|
|
"updated" : 3500,
|
|
"created" : 0,
|
|
"deleted" : 0,
|
|
"batches" : 4,
|
|
"version_conflicts" : 0,
|
|
"noops" : 0,
|
|
"retries": {
|
|
"bulk": 0,
|
|
"search": 0
|
|
},
|
|
"throttled_millis": 0,
|
|
"requests_per_second": -1,
|
|
"throttled_until_millis": 0
|
|
},
|
|
"description" : "",
|
|
"start_time_in_millis": 1535149899665,
|
|
"running_time_in_nanos": 5926916792,
|
|
"cancellable": true,
|
|
"headers": {}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE
|
|
<1> this object contains the actual status. It is identical to the response JSON
|
|
except for the important addition of the `total` field. `total` is the total number
|
|
of operations that the `_reindex` expects to perform. You can estimate the
|
|
progress by adding the `updated`, `created`, and `deleted` fields. The request
|
|
will finish when their sum is equal to the `total` field.
|
|
|
|
With the task id you can look up the task directly. The following example
|
|
retrieves information about the task `r1A2WoRbTwKZ516z6NEs5A:36619`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_tasks/r1A2WoRbTwKZ516z6NEs5A:36619
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[catch:missing]
|
|
|
|
The advantage of this API is that it integrates with `wait_for_completion=false`
|
|
to transparently return the status of completed tasks. If the task is completed
|
|
and `wait_for_completion=false` was set, it will return a
|
|
`results` or an `error` field. The cost of this feature is the document that
|
|
`wait_for_completion=false` creates at `.tasks/task/${taskId}`. It is up to
|
|
you to delete that document.
|
|
|
|
|
|
[float]
|
|
[[docs-reindex-cancel-task-api]]
|
|
=== Works with the Cancel Task API
|
|
|
|
Any Reindex can be canceled using the <<task-cancellation,Task Cancel API>>. For
|
|
example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _tasks/r1A2WoRbTwKZ516z6NEs5A:36619/_cancel
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
The task ID can be found using the <<tasks,Tasks API>>.
|
|
|
|
Cancelation should happen quickly but might take a few seconds. The Tasks
|
|
API will continue to list the task until it wakes to cancel itself.
|
|
|
|
|
|
[float]
|
|
[[docs-reindex-rethrottle]]
|
|
=== Rethrottling
|
|
|
|
The value of `requests_per_second` can be changed on a running reindex using
|
|
the `_rethrottle` API:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
The task ID can be found using the <<tasks,tasks API>>.
|
|
|
|
Just like when setting it on the Reindex API, `requests_per_second`
|
|
can be either `-1` to disable throttling or any decimal number
|
|
like `1.7` or `12` to throttle to that level. Rethrottling that speeds up the
|
|
query takes effect immediately but rethrotting that slows down the query will
|
|
take effect on after completing the current batch. This prevents scroll
|
|
timeouts.
|
|
|
|
[float]
|
|
[[docs-reindex-change-name]]
|
|
=== Reindex to change the name of a field
|
|
|
|
`_reindex` can be used to build a copy of an index with renamed fields. Say you
|
|
create an index containing documents that look like this:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST test/_doc/1?refresh
|
|
{
|
|
"text": "words words",
|
|
"flag": "foo"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
but you don't like the name `flag` and want to replace it with `tag`.
|
|
`_reindex` can create the other index for you:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "test"
|
|
},
|
|
"dest": {
|
|
"index": "test2"
|
|
},
|
|
"script": {
|
|
"source": "ctx._source.tag = ctx._source.remove(\"flag\")"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
Now you can get the new document:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET test2/_doc/1
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
which will return:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"found": true,
|
|
"_id": "1",
|
|
"_index": "test2",
|
|
"_type": "_doc",
|
|
"_version": 1,
|
|
"_seq_no": 44,
|
|
"_primary_term": 1,
|
|
"_source": {
|
|
"text": "words words",
|
|
"tag": "foo"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term": 1/"_primary_term" : $body._primary_term/]
|
|
|
|
[float]
|
|
[[docs-reindex-slice]]
|
|
=== Slicing
|
|
|
|
Reindex supports <<sliced-scroll>> to parallelize the reindexing process.
|
|
This parallelization can improve efficiency and provide a convenient way to
|
|
break the request down into smaller parts.
|
|
|
|
[float]
|
|
[[docs-reindex-manual-slice]]
|
|
==== Manual slicing
|
|
Slice a reindex request manually by providing a slice id and total number of
|
|
slices to each request:
|
|
|
|
[source,js]
|
|
----------------------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "twitter",
|
|
"slice": {
|
|
"id": 0,
|
|
"max": 2
|
|
}
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter"
|
|
}
|
|
}
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "twitter",
|
|
"slice": {
|
|
"id": 1,
|
|
"max": 2
|
|
}
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter"
|
|
}
|
|
}
|
|
----------------------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:big_twitter]
|
|
|
|
You can verify this works by:
|
|
|
|
[source,js]
|
|
----------------------------------------------------------------
|
|
GET _refresh
|
|
POST new_twitter/_search?size=0&filter_path=hits.total
|
|
----------------------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
which results in a sensible `total` like this one:
|
|
|
|
[source,js]
|
|
----------------------------------------------------------------
|
|
{
|
|
"hits": {
|
|
"total" : {
|
|
"value": 120,
|
|
"relation": "eq"
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------------------
|
|
// TESTRESPONSE
|
|
|
|
[float]
|
|
[[docs-reindex-automatic-slice]]
|
|
==== Automatic slicing
|
|
|
|
You can also let `_reindex` automatically parallelize using <<sliced-scroll>> to
|
|
slice on `_uid`. Use `slices` to specify the number of slices to use:
|
|
|
|
[source,js]
|
|
----------------------------------------------------------------
|
|
POST _reindex?slices=5&refresh
|
|
{
|
|
"source": {
|
|
"index": "twitter"
|
|
},
|
|
"dest": {
|
|
"index": "new_twitter"
|
|
}
|
|
}
|
|
----------------------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:big_twitter]
|
|
|
|
You can also this verify works by:
|
|
|
|
[source,js]
|
|
----------------------------------------------------------------
|
|
POST new_twitter/_search?size=0&filter_path=hits.total
|
|
----------------------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
which results in a sensible `total` like this one:
|
|
|
|
[source,js]
|
|
----------------------------------------------------------------
|
|
{
|
|
"hits": {
|
|
"total" : {
|
|
"value": 120,
|
|
"relation": "eq"
|
|
}
|
|
}
|
|
}
|
|
----------------------------------------------------------------
|
|
// TESTRESPONSE
|
|
|
|
Setting `slices` to `auto` will let Elasticsearch choose the number of slices
|
|
to use. This setting will use one slice per shard, up to a certain limit. If
|
|
there are multiple source indices, it will choose the number of slices based
|
|
on the index with the smallest number of shards.
|
|
|
|
Adding `slices` to `_reindex` just automates the manual process used in the
|
|
section above, creating sub-requests which means it has some quirks:
|
|
|
|
* You can see these requests in the <<docs-reindex-task-api,Tasks APIs>>. These
|
|
sub-requests are "child" tasks of the task for the request with `slices`.
|
|
* Fetching the status of the task for the request with `slices` only contains
|
|
the status of completed slices.
|
|
* These sub-requests are individually addressable for things like cancelation
|
|
and rethrottling.
|
|
* Rethrottling the request with `slices` will rethrottle the unfinished
|
|
sub-request proportionally.
|
|
* Canceling the request with `slices` will cancel each sub-request.
|
|
* Due to the nature of `slices` each sub-request won't get a perfectly even
|
|
portion of the documents. All documents will be addressed, but some slices may
|
|
be larger than others. Expect larger slices to have a more even distribution.
|
|
* Parameters like `requests_per_second` and `size` on a request with `slices`
|
|
are distributed proportionally to each sub-request. Combine that with the point
|
|
above about distribution being uneven and you should conclude that the using
|
|
`size` with `slices` might not result in exactly `size` documents being
|
|
`_reindex`ed.
|
|
* Each sub-request gets a slightly different snapshot of the source index,
|
|
though these are all taken at approximately the same time.
|
|
|
|
[float]
|
|
[[docs-reindex-picking-slices]]
|
|
===== Picking the number of slices
|
|
|
|
If slicing automatically, setting `slices` to `auto` will choose a reasonable
|
|
number for most indices. If slicing manually or otherwise tuning
|
|
automatic slicing, use these guidelines.
|
|
|
|
Query performance is most efficient when the number of `slices` is equal to the
|
|
number of shards in the index. If that number is large (e.g. 500),
|
|
choose a lower number as too many `slices` will hurt performance. Setting
|
|
`slices` higher than the number of shards generally does not improve efficiency
|
|
and adds overhead.
|
|
|
|
Indexing performance scales linearly across available resources with the
|
|
number of slices.
|
|
|
|
Whether query or indexing performance dominates the runtime depends on the
|
|
documents being reindexed and cluster resources.
|
|
|
|
[float]
|
|
=== Reindexing many indices
|
|
If you have many indices to reindex it is generally better to reindex them
|
|
one at a time rather than using a glob pattern to pick up many indices. That
|
|
way you can resume the process if there are any errors by removing the
|
|
partially completed index and starting over at that index. It also makes
|
|
parallelizing the process fairly simple: split the list of indices to reindex
|
|
and run each list in parallel.
|
|
|
|
One off bash scripts seem to work nicely for this:
|
|
|
|
[source,bash]
|
|
----------------------------------------------------------------
|
|
for index in i1 i2 i3 i4 i5; do
|
|
curl -HContent-Type:application/json -XPOST localhost:9200/_reindex?pretty -d'{
|
|
"source": {
|
|
"index": "'$index'"
|
|
},
|
|
"dest": {
|
|
"index": "'$index'-reindexed"
|
|
}
|
|
}'
|
|
done
|
|
----------------------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
[float]
|
|
=== Reindex daily indices
|
|
|
|
Notwithstanding the above advice, you can use `_reindex` in combination with
|
|
<<modules-scripting-painless, Painless>> to reindex daily indices to apply
|
|
a new template to the existing documents.
|
|
|
|
Assuming you have indices consisting of documents as follows:
|
|
|
|
[source,js]
|
|
----------------------------------------------------------------
|
|
PUT metricbeat-2016.05.30/_doc/1?refresh
|
|
{"system.cpu.idle.pct": 0.908}
|
|
PUT metricbeat-2016.05.31/_doc/1?refresh
|
|
{"system.cpu.idle.pct": 0.105}
|
|
----------------------------------------------------------------
|
|
// CONSOLE
|
|
|
|
The new template for the `metricbeat-*` indices is already loaded into Elasticsearch,
|
|
but it applies only to the newly created indices. Painless can be used to reindex
|
|
the existing documents and apply the new template.
|
|
|
|
The script below extracts the date from the index name and creates a new index
|
|
with `-1` appended. All data from `metricbeat-2016.05.31` will be reindexed
|
|
into `metricbeat-2016.05.31-1`.
|
|
|
|
[source,js]
|
|
----------------------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "metricbeat-*"
|
|
},
|
|
"dest": {
|
|
"index": "metricbeat"
|
|
},
|
|
"script": {
|
|
"lang": "painless",
|
|
"source": "ctx._index = 'metricbeat-' + (ctx._index.substring('metricbeat-'.length(), ctx._index.length())) + '-1'"
|
|
}
|
|
}
|
|
----------------------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
All documents from the previous metricbeat indices can now be found in the `*-1` indices.
|
|
|
|
[source,js]
|
|
----------------------------------------------------------------
|
|
GET metricbeat-2016.05.30-1/_doc/1
|
|
GET metricbeat-2016.05.31-1/_doc/1
|
|
----------------------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
The previous method can also be used in conjunction with <<docs-reindex-change-name, change the name of a field>>
|
|
to load only the existing data into the new index and rename any fields if needed.
|
|
|
|
[float]
|
|
=== Extracting a random subset of an index
|
|
|
|
`_reindex` can be used to extract a random subset of an index for testing:
|
|
|
|
[source,js]
|
|
----------------------------------------------------------------
|
|
POST _reindex
|
|
{
|
|
"size": 10,
|
|
"source": {
|
|
"index": "twitter",
|
|
"query": {
|
|
"function_score" : {
|
|
"query" : { "match_all": {} },
|
|
"random_score" : {}
|
|
}
|
|
},
|
|
"sort": "_score" <1>
|
|
},
|
|
"dest": {
|
|
"index": "random_twitter"
|
|
}
|
|
}
|
|
----------------------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:big_twitter]
|
|
|
|
<1> `_reindex` defaults to sorting by `_doc` so `random_score` will not have any
|
|
effect unless you override the sort to `_score`.
|