[DOCS] Streamlined GS indexing topic. (#45714)

* Streamlined GS indexing topic.

* Incorporated review feedback

* Applied formatting per the style guidelines.
This commit is contained in:
debadair 2019-08-20 09:14:49 -07:00 committed by Deb Adair
parent cff09bea00
commit 9b180314e3
1 changed files with 26 additions and 54 deletions

View File

@ -135,8 +135,8 @@ Windows:
The additional nodes are assigned unique IDs. Because you're running all three
nodes locally, they automatically join the cluster with the first node.
. Use the `cat health` API to verify that your three-node cluster is up running.
The `cat` APIs return information about your cluster and indices in a
. Use the cat health API to verify that your three-node cluster is up running.
The cat APIs return information about your cluster and indices in a
format that's easier to read than raw JSON.
+
You can interact directly with your cluster by submitting HTTP requests to
@ -155,8 +155,8 @@ GET /_cat/health?v
--------------------------------------------------
// CONSOLE
+
The response should indicate that the status of the _elasticsearch_ cluster
is _green_ and it has three nodes:
The response should indicate that the status of the `elasticsearch` cluster
is `green` and it has three nodes:
+
[source,txt]
--------------------------------------------------
@ -191,8 +191,8 @@ Once you have a cluster up and running, you're ready to index some data.
There are a variety of ingest options for {es}, but in the end they all
do the same thing: put JSON documents into an {es} index.
You can do this directly with a simple POST request that identifies
the index you want to add the document to and specifies one or more
You can do this directly with a simple PUT request that specifies
the index you want to add the document, a unique document ID, and one or more
`"field": "value"` pairs in the request body:
[source,js]
@ -204,9 +204,9 @@ PUT /customer/_doc/1
--------------------------------------------------
// CONSOLE
This request automatically creates the _customer_ index if it doesn't already
This request automatically creates the `customer` index if it doesn't already
exist, adds a new document that has an ID of `1`, and stores and
indexes the _name_ field.
indexes the `name` field.
Since this is a new document, the response shows that the result of the
operation was that version 1 of the document was created:
@ -264,46 +264,22 @@ and shows the original source fields that were indexed.
// TESTRESPONSE[s/"_seq_no" : \d+/"_seq_no" : $body._seq_no/ ]
// TESTRESPONSE[s/"_primary_term" : \d+/"_primary_term" : $body._primary_term/]
[float]
[[getting-started-batch-processing]]
=== Batch processing
=== Indexing documents in bulk
In addition to being able to index, update, and delete individual documents, Elasticsearch also provides the ability to perform any of the above operations in batches using the {ref}/docs-bulk.html[`_bulk` API]. This functionality is important in that it provides a very efficient mechanism to do multiple operations as fast as possible with as few network roundtrips as possible.
If you have a lot of documents to index, you can submit them in batches with
the {ref}/docs-bulk.html[bulk API]. Using bulk to batch document
operations is significantly faster than submitting requests individually as it minimizes network roundtrips.
As a quick example, the following call indexes two documents (ID 1 - John Doe and ID 2 - Jane Doe) in one bulk operation:
The optimal batch size depends a number of factors: the document size and complexity, the indexing and search load, and the resources available to your cluster. A good place to start is with batches of 1,000 to 5,000 documents
and a total payload between 5MB and 15MB. From there, you can experiment
to find the sweet spot.
[source,js]
--------------------------------------------------
POST /customer/_bulk?pretty
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
--------------------------------------------------
// CONSOLE
This example updates the first document (ID of 1) and then deletes the second document (ID of 2) in one bulk operation:
[source,sh]
--------------------------------------------------
POST /customer/_bulk
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}
--------------------------------------------------
// CONSOLE
// TEST[continued]
Note above that for the delete action, there is no corresponding source document after it since deletes only require the ID of the document to be deleted.
The Bulk API does not fail due to failures in one of the actions. If a single action fails for whatever reason, it will continue to process the remainder of the actions after it. When the bulk API returns, it will provide a status for each action (in the same order it was sent in) so that you can check if a specific action failed or not.
[float]
=== Sample dataset
Now that we've gotten a glimpse of the basics, let's try to work on a more realistic dataset. I've prepared a sample of fictitious JSON documents of customer bank account information. Each document has the following schema:
To get some data into {es} that you can start searching and analyzing:
. Download the https://github.com/elastic/elasticsearch/blob/master/docs/src/test/resources/accounts.json?raw=true[`accounts.json`] sample data set. The documents in this randomly-generated data set represent user accounts with the following information:
+
[source,js]
--------------------------------------------------
{
@ -322,21 +298,19 @@ Now that we've gotten a glimpse of the basics, let's try to work on a more reali
--------------------------------------------------
// NOTCONSOLE
For the curious, this data was generated using http://www.json-generator.com/[`www.json-generator.com/`], so please ignore the actual values and semantics of the data as these are all randomly generated.
You can download the sample dataset (accounts.json) from https://github.com/elastic/elasticsearch/blob/master/docs/src/test/resources/accounts.json?raw=true[here]. Extract it to our current directory and let's load it into our cluster as follows:
. Index the account data into the `bank` index with the following `_bulk` request:
+
[source,sh]
--------------------------------------------------
curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"
curl "localhost:9200/_cat/indices?v"
--------------------------------------------------
// NOTCONSOLE
+
////
This replicates the above in a document-testing friendly way but isn't visible
in the docs:
+
[source,js]
--------------------------------------------------
GET /_cat/indices?v
@ -344,9 +318,9 @@ GET /_cat/indices?v
// CONSOLE
// TEST[setup:bank]
////
And the response:
+
The response indicates that 1,000 documents were indexed successfully.
+
[source,txt]
--------------------------------------------------
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
@ -355,8 +329,6 @@ yellow open bank l7sSYV2cQXmu6_4rJWVIww 5 1 1000 0 12
// TESTRESPONSE[s/128.6kb/\\d+(\\.\\d+)?[mk]?b/]
// TESTRESPONSE[s/l7sSYV2cQXmu6_4rJWVIww/.+/ non_json]
Which means that we just successfully bulk indexed 1000 documents into the bank index.
[[getting-started-search]]
== Start searching