262 lines
7.3 KiB
Markdown
262 lines
7.3 KiB
Markdown
---
|
|
layout: default
|
|
title: Reindex data
|
|
nav_order: 16
|
|
---
|
|
|
|
# Reindex data
|
|
|
|
After creating an index, you might need to make an extensive change such as adding a new field to every document or combining multiple indices to form a new one. Rather than deleting your index, making the change offline, and then indexing your data all over again, you can use the `reindex` operation.
|
|
|
|
With the `reindex` operation, you can copy all or a subset of documents that you select through a query to another index. Reindex is a `POST` operation. In its most basic form, you specify a source index and a destination index.
|
|
|
|
Reindexing can be an expensive operation depending on the size of your source index. We recommend you disable replicas in your destination index by setting `number_of_replicas` to `0` and re-enable them once the reindex process is complete.
|
|
{: .note }
|
|
|
|
---
|
|
|
|
#### Table of contents
|
|
1. TOC
|
|
{:toc}
|
|
|
|
|
|
---
|
|
|
|
## Reindex all documents
|
|
|
|
You can copy all documents from one index to another.
|
|
|
|
You first need to create a destination index with your desired field mappings and settings or you can copy the ones from your source index:
|
|
|
|
```json
|
|
PUT destination
|
|
{
|
|
"mappings":{
|
|
"Add in your desired mappings"
|
|
},
|
|
"settings":{
|
|
"Add in your desired settings"
|
|
}
|
|
}
|
|
```
|
|
|
|
This `reindex` command copies all the documents from a source index to a destination index:
|
|
|
|
```json
|
|
POST _reindex
|
|
{
|
|
"source":{
|
|
"index":"source"
|
|
},
|
|
"dest":{
|
|
"index":"destination"
|
|
}
|
|
}
|
|
```
|
|
|
|
If the destination index is not already created, the `reindex` operation creates a new destination index with default configurations.
|
|
|
|
## Reindex from a remote cluster
|
|
|
|
You can copy documents from an index in a remote cluster. Use the `remote` option to specify the remote hostname and the required login credentials.
|
|
|
|
This command reaches out to a remote cluster, logs in with the username and password, and copies all the documents from the source index in that remote cluster to the destination index in your local cluster:
|
|
|
|
```json
|
|
POST _reindex
|
|
{
|
|
"source":{
|
|
"remote":{
|
|
"host":"https://<REST_endpoint_of_remote_cluster>:9200",
|
|
"username":"YOUR_USERNAME",
|
|
"password":"YOUR_PASSWORD"
|
|
}
|
|
},
|
|
"dest":{
|
|
"index":"destination"
|
|
}
|
|
}
|
|
```
|
|
|
|
You can specify the following options:
|
|
|
|
Options | Valid values | Description | Required
|
|
:--- | :--- | :---
|
|
`host` | String | The REST endpoint of the remote cluster. | Yes
|
|
`username` | String | The username to log into the remote cluster. | No
|
|
`password` | String | The password to log into the remote cluster. | No
|
|
`socket_timeout` | Time Unit | The wait time for socket reads (default 30s). | No
|
|
`connect_timeout` | Time Unit | The wait time for remote connection timeouts (default 30s). | No
|
|
|
|
|
|
## Reindex a subset of documents
|
|
|
|
You can copy a specific set of documents that match a search query.
|
|
|
|
This command copies only a subset of documents matched by a query operation to the destination index:
|
|
|
|
```json
|
|
POST _reindex
|
|
{
|
|
"source":{
|
|
"index":"source",
|
|
"query": {
|
|
"match": {
|
|
"field_name": "text"
|
|
}
|
|
}
|
|
},
|
|
"dest":{
|
|
"index":"destination"
|
|
}
|
|
}
|
|
```
|
|
|
|
For a list of all query operations, see [Full-text queries]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/full-text/).
|
|
|
|
## Combine one or more indices
|
|
|
|
You can combine documents from one or more indices by adding the source indices as a list.
|
|
|
|
This command copies all documents from two source indices to one destination index:
|
|
|
|
```json
|
|
POST _reindex
|
|
{
|
|
"source":{
|
|
"index":[
|
|
"source_1",
|
|
"source_2"
|
|
]
|
|
},
|
|
"dest":{
|
|
"index":"destination"
|
|
}
|
|
}
|
|
```
|
|
Make sure the number of shards for your source and destination indices are the same.
|
|
|
|
## Reindex only unique documents
|
|
|
|
You can copy only documents missing from a destination index by setting the `op_type` option to `create`.
|
|
In this case, if a document with the same ID already exists, the operation ignores the one from the source index.
|
|
To ignore all version conflicts of documents, set the `conflicts` option to `proceed`.
|
|
|
|
```json
|
|
POST _reindex
|
|
{
|
|
"conflicts":"proceed",
|
|
"source":{
|
|
"index":"source"
|
|
},
|
|
"dest":{
|
|
"index":"destination",
|
|
"op_type":"create"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Transform documents during reindexing
|
|
|
|
You can transform your data during the reindexing process using the `script` option.
|
|
We recommend Painless for scripting in OpenSearch.
|
|
|
|
This command runs the source index through a Painless script that increments a `number` field inside an `account` object before copying it to the destination index:
|
|
|
|
```json
|
|
POST _reindex
|
|
{
|
|
"source":{
|
|
"index":"source"
|
|
},
|
|
"dest":{
|
|
"index":"destination"
|
|
},
|
|
"script":{
|
|
"lang":"painless",
|
|
"source":"ctx._account.number++"
|
|
}
|
|
}
|
|
```
|
|
|
|
You can also specify an ingest pipeline to transform your data during the reindexing process.
|
|
|
|
You would first have to create a pipeline with `processors` defined. You have a number of different `processors` available to use in your ingest pipeline.
|
|
|
|
Here's a sample ingest pipeline that defines a `split` processor that splits a `text` field based on a `space` separator and stores it in a new `word` field. The `script` processor is a Painless script that finds the length of the `word` field and stores it in a new `word_count` field. The `remove` processor removes the `test` field.
|
|
|
|
```json
|
|
PUT _ingest/pipeline/pipeline-test
|
|
{
|
|
"description": "Splits the text field into a list. Computes the length of the 'word' field and stores it in a new 'word_count' field. Removes the 'test' field.",
|
|
"processors": [
|
|
{
|
|
"split": {
|
|
"field": "text",
|
|
"separator": "\\s+",
|
|
"target_field": "word"
|
|
},
|
|
}
|
|
{
|
|
"script": {
|
|
"lang": "painless",
|
|
"source": "ctx.word_count = ctx.word.length"
|
|
}
|
|
},
|
|
{
|
|
"remove": {
|
|
"field": "test"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
After creating a pipeline, you can use the `reindex` operation:
|
|
|
|
```json
|
|
POST _reindex
|
|
{
|
|
"source": {
|
|
"index": "source",
|
|
},
|
|
"dest": {
|
|
"index": "destination",
|
|
"pipeline": "pipeline-test"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Update documents in the current index
|
|
|
|
To update the data in your current index itself without copying it to a different index, use the `update_by_query` operation.
|
|
|
|
The `update_by_query` operation is `POST` operation that you can perform on a single index at a time.
|
|
|
|
```json
|
|
POST <index_name>/_update_by_query
|
|
```
|
|
|
|
If you run this command with no parameters, it increments the version number for all documents in the index.
|
|
|
|
## Source index options
|
|
|
|
You can specify the following options for your source index:
|
|
|
|
Option | Valid values | Description | Required
|
|
:--- | :--- | :---
|
|
`index` | String | The name of the source index. You can provide multiple source indices as a list. | Yes
|
|
`max_docs` | Integer | The maximum number of documents to reindex. | No
|
|
`query` | Object | The search query to use for the reindex operation. | No
|
|
`size` | Integer | The number of documents to reindex. | No
|
|
`slice` | String | Specify manual or automatic slicing to parallelize reindexing. | No
|
|
|
|
## Destination index options
|
|
|
|
You can specify the following options for your destination index:
|
|
|
|
Option | Valid values | Description | Required
|
|
:--- | :--- | :---
|
|
`index` | String | The name of the destination index. | Yes
|
|
`version_type` | Enum | The version type for the indexing operation. Valid values: internal, external, external_gt, external_gte. | No
|