OpenSearch/docs/reference/docs/index_.asciidoc

408 lines
15 KiB
Plaintext

[[docs-index_]]
== Index API
IMPORTANT: See <<removal-of-types>>.
The index API adds or updates a JSON document in a specific index,
making it searchable. The following example inserts the JSON document
into the "twitter" index with an id of 1:
[source,js]
--------------------------------------------------
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
--------------------------------------------------
// CONSOLE
The result of the above index operation is:
[source,js]
--------------------------------------------------
{
"_shards" : {
"total" : 2,
"failed" : 0,
"successful" : 2
},
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"result" : "created"
}
--------------------------------------------------
// TESTRESPONSE[s/"successful" : 2/"successful" : 1/]
The `_shards` header provides information about the replication process of the index operation:
`total`:: Indicates how many shard copies (primary and replica shards) the index operation should be executed on.
`successful`:: Indicates the number of shard copies the index operation succeeded on.
`failed`:: An array that contains replication-related errors in the case an index operation failed on a replica shard.
The index operation is successful in the case `successful` is at least 1.
NOTE: Replica shards may not all be started when an indexing operation successfully returns (by default, only the
primary is required, but this behavior can be <<index-wait-for-active-shards,changed>>). In that case,
`total` will be equal to the total shards based on the `number_of_replicas` setting and `successful` will be
equal to the number of shards started (primary plus replicas). If there were no failures, the `failed` will be 0.
[float]
[[index-creation]]
=== Automatic Index Creation
The index operation automatically creates an index if it does not already
exist, and applies any <<indices-templates,index templates>> that are
configured. The index operation also creates a dynamic mapping if one does not
already exist. By default, new fields and objects will automatically be added
to the mapping definition if needed. Check out the <<mapping,mapping>> section
for more information on mapping definitions, and the
<<indices-put-mapping,put mapping>> API for information about updating mappings
manually.
Automatic index creation is controlled by the `action.auto_create_index`
setting. This setting defaults to `true`, meaning that indices are always
automatically created. Automatic index creation can be permitted only for
indices matching certain patterns by changing the value of this setting to a
comma-separated list of these patterns. It can also be explicitly permitted and
forbidden by prefixing patterns in the list with a `+` or `-`. Finally it can
be completely disabled by changing this setting to `false`.
[source,js]
--------------------------------------------------
PUT _cluster/settings
{
"persistent": {
"action.auto_create_index": "twitter,index10,-index1*,+ind*" <1>
}
}
PUT _cluster/settings
{
"persistent": {
"action.auto_create_index": "false" <2>
}
}
PUT _cluster/settings
{
"persistent": {
"action.auto_create_index": "true" <3>
}
}
--------------------------------------------------
// CONSOLE
<1> Permit only the auto-creation of indices called `twitter`, `index10`, no
other index matching `index1*`, and any other index matching `ind*`. The
patterns are matched in the order in which they are given.
<2> Completely disable the auto-creation of indices.
<3> Permit the auto-creation of indices with any name. This is the default.
[float]
[[operation-type]]
=== Operation Type
The index operation also accepts an `op_type` that can be used to force
a `create` operation, allowing for "put-if-absent" behavior. When
`create` is used, the index operation will fail if a document by that id
already exists in the index.
Here is an example of using the `op_type` parameter:
[source,js]
--------------------------------------------------
PUT twitter/_doc/1?op_type=create
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
--------------------------------------------------
// CONSOLE
Another option to specify `create` is to use the following uri:
[source,js]
--------------------------------------------------
PUT twitter/_create/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
--------------------------------------------------
// CONSOLE
[float]
=== Automatic ID Generation
The index operation can be executed without specifying the id. In such a
case, an id will be generated automatically. In addition, the `op_type`
will automatically be set to `create`. Here is an example (note the
*POST* used instead of *PUT*):
[source,js]
--------------------------------------------------
POST twitter/_doc/
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
--------------------------------------------------
// CONSOLE
The result of the above index operation is:
[source,js]
--------------------------------------------------
{
"_shards" : {
"total" : 2,
"failed" : 0,
"successful" : 2
},
"_index" : "twitter",
"_type" : "_doc",
"_id" : "W0tpsmIBdwcYyG50zbta",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"result": "created"
}
--------------------------------------------------
// TESTRESPONSE[s/W0tpsmIBdwcYyG50zbta/$body._id/ s/"successful" : 2/"successful" : 1/]
[float]
[[optimistic-concurrency-control-index]]
=== Optimistic concurrency control
Index operations can be made conditional and only be performed if the last
modification to the document was assigned the sequence number and primary
term specified by the `if_seq_no` and `if_primary_term` parameters. If a
mismatch is detected, the operation will result in a `VersionConflictException`
and a status code of 409. See <<optimistic-concurrency-control>> for more details.
[float]
[[index-routing]]
=== Routing
By default, shard placement ? or `routing` ? is controlled by using a
hash of the document's id value. For more explicit control, the value
fed into the hash function used by the router can be directly specified
on a per-operation basis using the `routing` parameter. For example:
[source,js]
--------------------------------------------------
POST twitter/_doc?routing=kimchy
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
--------------------------------------------------
// CONSOLE
In the example above, the "_doc" document is routed to a shard based on
the `routing` parameter provided: "kimchy".
When setting up explicit mapping, the `_routing` field can be optionally
used to direct the index operation to extract the routing value from the
document itself. This does come at the (very minimal) cost of an
additional document parsing pass. If the `_routing` mapping is defined
and set to be `required`, the index operation will fail if no routing
value is provided or extracted.
[float]
[[index-distributed]]
=== Distributed
The index operation is directed to the primary shard based on its route
(see the Routing section above) and performed on the actual node
containing this shard. After the primary shard completes the operation,
if needed, the update is distributed to applicable replicas.
[float]
[[index-wait-for-active-shards]]
=== Wait For Active Shards
To improve the resiliency of writes to the system, indexing operations
can be configured to wait for a certain number of active shard copies
before proceeding with the operation. If the requisite number of active
shard copies are not available, then the write operation must wait and
retry, until either the requisite shard copies have started or a timeout
occurs. By default, write operations only wait for the primary shards
to be active before proceeding (i.e. `wait_for_active_shards=1`).
This default can be overridden in the index settings dynamically
by setting `index.write.wait_for_active_shards`. To alter this behavior
per operation, the `wait_for_active_shards` request parameter can be used.
Valid values are `all` or any positive integer up to the total number
of configured copies per shard in the index (which is `number_of_replicas+1`).
Specifying a negative value or a number greater than the number of
shard copies will throw an error.
For example, suppose we have a cluster of three nodes, `A`, `B`, and `C` and
we create an index `index` with the number of replicas set to 3 (resulting in
4 shard copies, one more copy than there are nodes). If we
attempt an indexing operation, by default the operation will only ensure
the primary copy of each shard is available before proceeding. This means
that even if `B` and `C` went down, and `A` hosted the primary shard copies,
the indexing operation would still proceed with only one copy of the data.
If `wait_for_active_shards` is set on the request to `3` (and all 3 nodes
are up), then the indexing operation will require 3 active shard copies
before proceeding, a requirement which should be met because there are 3
active nodes in the cluster, each one holding a copy of the shard. However,
if we set `wait_for_active_shards` to `all` (or to `4`, which is the same),
the indexing operation will not proceed as we do not have all 4 copies of
each shard active in the index. The operation will timeout
unless a new node is brought up in the cluster to host the fourth copy of
the shard.
It is important to note that this setting greatly reduces the chances of
the write operation not writing to the requisite number of shard copies,
but it does not completely eliminate the possibility, because this check
occurs before the write operation commences. Once the write operation
is underway, it is still possible for replication to fail on any number of
shard copies but still succeed on the primary. The `_shards` section of the
write operation's response reveals the number of shard copies on which
replication succeeded/failed.
[source,js]
--------------------------------------------------
{
"_shards" : {
"total" : 2,
"failed" : 0,
"successful" : 2
}
}
--------------------------------------------------
// NOTCONSOLE
[float]
[[index-refresh]]
=== Refresh
Control when the changes made by this request are visible to search. See
<<docs-refresh,refresh>>.
[float]
[[index-noop]]
=== Noop Updates
When updating a document using the index API a new version of the document is
always created even if the document hasn't changed. If this isn't acceptable
use the `_update` API with `detect_noop` set to true. This option isn't
available on the index API because the index API doesn't fetch the old source
and isn't able to compare it against the new source.
There isn't a hard and fast rule about when noop updates aren't acceptable.
It's a combination of lots of factors like how frequently your data source
sends updates that are actually noops and how many queries per second
Elasticsearch runs on the shard receiving the updates.
[float]
[[timeout]]
=== Timeout
The primary shard assigned to perform the index operation might not be
available when the index operation is executed. Some reasons for this
might be that the primary shard is currently recovering from a gateway
or undergoing relocation. By default, the index operation will wait on
the primary shard to become available for up to 1 minute before failing
and responding with an error. The `timeout` parameter can be used to
explicitly specify how long it waits. Here is an example of setting it
to 5 minutes:
[source,js]
--------------------------------------------------
PUT twitter/_doc/1?timeout=5m
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
--------------------------------------------------
// CONSOLE
[float]
[[index-versioning]]
=== Versioning
Each indexed document is given a version number. By default,
internal versioning is used that starts at 1 and increments
with each update, deletes included. Optionally, the version number can be
set to an external value (for example, if maintained in a
database). To enable this functionality, `version_type` should be set to
`external`. The value provided must be a numeric, long value greater than or equal to 0,
and less than around 9.2e+18.
When using the external version type, the system checks to see if
the version number passed to the index request is greater than the
version of the currently stored document. If true, the document will be
indexed and the new version number used. If the value provided is less
than or equal to the stored document's version number, a version
conflict will occur and the index operation will fail. For example:
[source,js]
--------------------------------------------------
PUT twitter/_doc/1?version=2&version_type=external
{
"message" : "elasticsearch now has versioning support, double cool!"
}
--------------------------------------------------
// CONSOLE
// TEST[continued]
*NOTE:* Versioning is completely real time, and is not affected by the
near real time aspects of search operations. If no version is provided,
then the operation is executed without any version checks.
The above will succeed since the the supplied version of 2 is higher than
the current document version of 1. If the document was already updated
and its version was set to 2 or higher, the indexing command will fail
and result in a conflict (409 http status code).
A nice side effect is that there is no need to maintain strict ordering
of async indexing operations executed as a result of changes to a source
database, as long as version numbers from the source database are used.
Even the simple case of updating the Elasticsearch index using data from
a database is simplified if external versioning is used, as only the
latest version will be used if the index operations arrive out of order for
whatever reason.
[float]
==== Version types
Next to the `external` version type explained above, Elasticsearch
also supports other types for specific use cases. Here is an overview of
the different version types and their semantics.
`internal`:: Only index the document if the given version is identical to the version
of the stored document.
`external` or `external_gt`:: Only index the document if the given version is strictly higher
than the version of the stored document *or* if there is no existing document. The given
version will be used as the new version and will be stored with the new document. The supplied
version must be a non-negative long number.
`external_gte`:: Only index the document if the given version is *equal* or higher
than the version of the stored document. If there is no existing document
the operation will succeed as well. The given version will be used as the new version
and will be stored with the new document. The supplied version must be a non-negative long number.
*NOTE*: The `external_gte` version type is meant for special use cases and
should be used with care. If used incorrectly, it can result in loss of data.
There is another option, `force`, which is deprecated because it can cause
primary and replica shards to diverge.