205 lines
7.7 KiB
Plaintext
205 lines
7.7 KiB
Plaintext
[[docs-bulk]]
|
|
== Bulk API
|
|
|
|
The bulk API makes it possible to perform many index/delete operations
|
|
in a single API call. This can greatly increase the indexing speed.
|
|
|
|
.Client support for bulk requests
|
|
*********************************************
|
|
|
|
Some of the officially supported clients provide helpers to assist with
|
|
bulk requests and reindexing of documents from one index to another:
|
|
|
|
Perl::
|
|
|
|
See https://metacpan.org/pod/Search::Elasticsearch::Bulk[Search::Elasticsearch::Bulk]
|
|
and https://metacpan.org/pod/Search::Elasticsearch::Scroll[Search::Elasticsearch::Scroll]
|
|
|
|
Python::
|
|
|
|
See http://elasticsearch-py.readthedocs.org/en/master/helpers.html[elasticsearch.helpers.*]
|
|
|
|
*********************************************
|
|
|
|
The REST API endpoint is `/_bulk`, and it expects the following JSON
|
|
structure:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
action_and_meta_data\n
|
|
optional_source\n
|
|
action_and_meta_data\n
|
|
optional_source\n
|
|
....
|
|
action_and_meta_data\n
|
|
optional_source\n
|
|
--------------------------------------------------
|
|
|
|
*NOTE*: the final line of data must end with a newline character `\n`.
|
|
|
|
The possible actions are `index`, `create`, `delete` and `update`.
|
|
`index` and `create` expect a source on the next
|
|
line, and have the same semantics as the `op_type` parameter to the
|
|
standard index API (i.e. create will fail if a document with the same
|
|
index and type exists already, whereas index will add or replace a
|
|
document as necessary). `delete` does not expect a source on the
|
|
following line, and has the same semantics as the standard delete API.
|
|
`update` expects that the partial doc, upsert and script and its options
|
|
are specified on the next line.
|
|
|
|
If you're providing text file input to `curl`, you *must* use the
|
|
`--data-binary` flag instead of plain `-d`. The latter doesn't preserve
|
|
newlines. Example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
$ cat requests
|
|
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
|
|
{ "field1" : "value1" }
|
|
$ curl -s -XPOST localhost:9200/_bulk --data-binary "@requests"; echo
|
|
{"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1}}]}
|
|
--------------------------------------------------
|
|
|
|
Because this format uses literal `\n`'s as delimiters, please be sure
|
|
that the JSON actions and sources are not pretty printed. Here is an
|
|
example of a correct sequence of bulk commands:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
|
|
{ "field1" : "value1" }
|
|
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
|
|
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
|
|
{ "field1" : "value3" }
|
|
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} }
|
|
{ "doc" : {"field2" : "value2"} }
|
|
--------------------------------------------------
|
|
|
|
In the above example `doc` for the `update` action is a partial
|
|
document, that will be merged with the already stored document.
|
|
|
|
The endpoints are `/_bulk`, `/{index}/_bulk`, and `{index}/{type}/_bulk`.
|
|
When the index or the index/type are provided, they will be used by
|
|
default on bulk items that don't provide them explicitly.
|
|
|
|
A note on the format. The idea here is to make processing of this as
|
|
fast as possible. As some of the actions will be redirected to other
|
|
shards on other nodes, only `action_meta_data` is parsed on the
|
|
receiving node side.
|
|
|
|
Client libraries using this protocol should try and strive to do
|
|
something similar on the client side, and reduce buffering as much as
|
|
possible.
|
|
|
|
The response to a bulk action is a large JSON structure with the
|
|
individual results of each action that was performed. The failure of a
|
|
single action does not affect the remaining actions.
|
|
|
|
There is no "correct" number of actions to perform in a single bulk
|
|
call. You should experiment with different settings to find the optimum
|
|
size for your particular workload.
|
|
|
|
If using the HTTP API, make sure that the client does not send HTTP
|
|
chunks, as this will slow things down.
|
|
|
|
[float]
|
|
[[bulk-versioning]]
|
|
=== Versioning
|
|
|
|
Each bulk item can include the version value using the
|
|
`_version`/`version` field. It automatically follows the behavior of the
|
|
index / delete operation based on the `_version` mapping. It also
|
|
support the `version_type`/`_version_type` (see <<index-versioning, versioning>>)
|
|
|
|
[float]
|
|
[[bulk-routing]]
|
|
=== Routing
|
|
|
|
Each bulk item can include the routing value using the
|
|
`_routing`/`routing` field. It automatically follows the behavior of the
|
|
index / delete operation based on the `_routing` mapping.
|
|
|
|
[float]
|
|
[[bulk-parent]]
|
|
=== Parent
|
|
|
|
Each bulk item can include the parent value using the `_parent`/`parent`
|
|
field. It automatically follows the behavior of the index / delete
|
|
operation based on the `_parent` / `_routing` mapping.
|
|
|
|
[float]
|
|
[[bulk-timestamp]]
|
|
=== Timestamp
|
|
|
|
Each bulk item can include the timestamp value using the
|
|
`_timestamp`/`timestamp` field. It automatically follows the behavior of
|
|
the index operation based on the `_timestamp` mapping.
|
|
|
|
[float]
|
|
[[bulk-ttl]]
|
|
=== TTL
|
|
|
|
Each bulk item can include the ttl value using the `_ttl`/`ttl` field.
|
|
It automatically follows the behavior of the index operation based on
|
|
the `_ttl` mapping.
|
|
|
|
[float]
|
|
[[bulk-consistency]]
|
|
=== Write Consistency
|
|
|
|
When making bulk calls, you can require a minimum number of active
|
|
shards in the partition through the `consistency` parameter. The values
|
|
allowed are `one`, `quorum`, and `all`. It defaults to the node level
|
|
setting of `action.write_consistency`, which in turn defaults to
|
|
`quorum`.
|
|
|
|
For example, in a N shards with 2 replicas index, there will have to be
|
|
at least 2 active shards within the relevant partition (`quorum`) for
|
|
the operation to succeed. In a N shards with 1 replica scenario, there
|
|
will need to be a single shard active (in this case, `one` and `quorum`
|
|
is the same).
|
|
|
|
[float]
|
|
[[bulk-refresh]]
|
|
=== Refresh
|
|
|
|
The `refresh` parameter can be set to `true` in order to refresh the relevant
|
|
primary and replica shards immediately after the bulk operation has occurred
|
|
and make it searchable, instead of waiting for the normal refresh interval to
|
|
expire. Setting it to `true` can trigger additional load, and may slow down
|
|
indexing. Due to its costly nature, the `refresh` parameter is set on the bulk request level
|
|
and is not supported on each individual bulk item.
|
|
|
|
[float]
|
|
[[bulk-update]]
|
|
=== Update
|
|
|
|
When using `update` action `_retry_on_conflict` can be used as field in
|
|
the action itself (not in the extra payload line), to specify how many
|
|
times an update should be retried in the case of a version conflict.
|
|
|
|
The `update` action payload, supports the following options: `doc`
|
|
(partial document), `upsert`, `doc_as_upsert`, `script`, `params` (for
|
|
script), `lang` (for script) and `fields`. See update documentation for details on
|
|
the options. Curl example with update actions:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }
|
|
{ "doc" : {"field" : "value"} }
|
|
{ "update" : { "_id" : "0", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }
|
|
{ "script" : { "inline": "ctx._source.counter += param1", "lang" : "js", "params" : {"param1" : 1}}, "upsert" : {"counter" : 1}}
|
|
{ "update" : {"_id" : "2", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }
|
|
{ "doc" : {"field" : "value"}, "doc_as_upsert" : true }
|
|
{ "update" : {"_id" : "3", "_type" : "type1", "_index" : "index1", "fields" : ["_source"]} }
|
|
{ "doc" : {"field" : "value"} }
|
|
{ "update" : {"_id" : "4", "_type" : "type1", "_index" : "index1"} }
|
|
{ "doc" : {"field" : "value"}, "fields": ["_source"]}
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
[[bulk-security]]
|
|
=== Security
|
|
|
|
See <<url-access-control>>
|