OpenSearch/docs/reference/getting-started.asciidoc

709 lines
23 KiB
Plaintext
Raw Normal View History

2014-04-18 13:51:59 -04:00
[[getting-started]]
= Getting started with {es}
2014-04-18 13:51:59 -04:00
[partintro]
--
Ready to take {es} for a test drive and see for yourself how you can use the
REST APIs to store, search, and analyze data?
2014-04-18 13:51:59 -04:00
Step through this getting started tutorial to:
2014-04-18 13:51:59 -04:00
. Get an {es} cluster up and running
. Index some sample documents
. Search for documents using the {es} query language
. Analyze the results using bucket and metrics aggregations
2014-04-18 13:51:59 -04:00
Need more context?
Check out the <<elasticsearch-intro,
{es} Introduction>> to learn the lingo and understand the basics of
how {es} works. If you're already familiar with {es} and want to see how it works
with the rest of the stack, you might want to jump to the
{stack-gs}/get-started-elastic-stack.html[Elastic Stack
Tutorial] to see how to set up a system monitoring solution with {es}, {kib},
{beats}, and {ls}.
2014-04-18 13:51:59 -04:00
TIP: The fastest way to get started with {es} is to
https://www.elastic.co/cloud/elasticsearch-service/signup[start a free 14-day
trial of {ess}] in the cloud.
--
2014-04-18 13:51:59 -04:00
2018-12-06 13:14:37 -05:00
[[getting-started-install]]
== Get {es} up and running
To take {es} for a test drive, you can create a
https://www.elastic.co/cloud/elasticsearch-service/signup[hosted deployment] on
the {ess} or set up a multi-node {es} cluster on your own
Linux, macOS, or Windows machine.
[float]
[[run-elasticsearch-hosted]]
=== Run {es} on Elastic Cloud
When you create a deployment on the {es} Service, the service provisions
a three-node {es} cluster along with Kibana and APM.
To create a deployment:
. Sign up for a https://www.elastic.co/cloud/elasticsearch-service/signup[free trial]
and verify your email address.
. Set a password for your account.
. Click **Create Deployment**.
Once you've created a deployment, you're ready to <<getting-started-index>>.
2014-04-18 13:51:59 -04:00
[float]
[[run-elasticsearch-local]]
=== Run {es} locally on Linux, macOS, or Windows
When you create a deployment on the {ess}, a master node and
two data nodes are provisioned automatically. By installing from the tar or zip
archive, you can start multiple instances of {es} locally to see how a multi-node
cluster behaves.
To run a three-node {es} cluster locally:
2014-04-18 13:51:59 -04:00
. Download the {es} archive for your OS:
+
Linux: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-{version}-linux-x86_64.tar.gz[elasticsearch-{version}-linux-x86_64.tar.gz]
+
["source","sh",subs="attributes,callouts"]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-{version}-linux-x86_64.tar.gz
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// NOTCONSOLE
+
macOS: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-{version}-darwin-x86_64.tar.gz[elasticsearch-{version}-darwin-x86_64.tar.gz]
+
["source","sh",subs="attributes,callouts"]
--------------------------------------------------
curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-{version}-darwin-x86_64.tar.gz
--------------------------------------------------
// NOTCONSOLE
+
Windows:
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-{version}-windows-x86_64.zip[elasticsearch-{version}-windows-x86_64.zip]
. Extract the archive:
+
Linux:
+
["source","sh",subs="attributes,callouts"]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
tar -xvf elasticsearch-{version}-linux-x86_64.tar.gz
2014-04-18 13:51:59 -04:00
--------------------------------------------------
+
macOS:
+
["source","sh",subs="attributes,callouts"]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
tar -xvf elasticsearch-{version}-darwin-x86_64.tar.gz
--------------------------------------------------
+
Windows PowerShell:
+
["source","powershell",subs="attributes,callouts"]
--------------------------------------------------
Expand-Archive elasticsearch-{version}-windows-x86_64.zip
2014-04-18 13:51:59 -04:00
--------------------------------------------------
. Start {es} from the `bin` directory:
+
Linux and macOS:
+
["source","sh",subs="attributes,callouts"]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
cd elasticsearch-{version}/bin
2014-04-18 13:51:59 -04:00
./elasticsearch
--------------------------------------------------
+
Windows:
+
["source","powershell",subs="attributes,callouts"]
--------------------------------------------------
cd elasticsearch-{version}\bin
.\elasticsearch.bat
--------------------------------------------------
+
You now have a single-node {es} cluster up and running!
. Start two more instances of {es} so you can see how a typical multi-node
cluster behaves. You need to specify unique data and log paths
for each node.
+
Linux and macOS:
+
["source","sh",subs="attributes,callouts"]
--------------------------------------------------
./elasticsearch -Epath.data=data2 -Epath.logs=log2
./elasticsearch -Epath.data=data3 -Epath.logs=log3
--------------------------------------------------
+
Windows:
+
["source","powershell",subs="attributes,callouts"]
--------------------------------------------------
.\elasticsearch.bat -E path.data=data2 -E path.logs=log2
.\elasticsearch.bat -E path.data=data3 -E path.logs=log3
--------------------------------------------------
+
The additional nodes are assigned unique IDs. Because you're running all three
nodes locally, they automatically join the cluster with the first node.
. Use the cat health API to verify that your three-node cluster is up running.
The cat APIs return information about your cluster and indices in a
format that's easier to read than raw JSON.
+
You can interact directly with your cluster by submitting HTTP requests to
the {es} REST API. Most of the examples in this guide enable you to copy the
appropriate cURL command and submit the request to your local {es} instance from
the command line. If you have Kibana installed and running, you can also
open Kibana and submit requests through the Dev Console.
+
TIP: You'll want to check out the
https://www.elastic.co/guide/en/elasticsearch/client/index.html[{es} language
clients] when you're ready to start using {es} in your own applications.
+
[source,console]
--------------------------------------------------
GET /_cat/health?v
--------------------------------------------------
+
The response should indicate that the status of the `elasticsearch` cluster
is `green` and it has three nodes:
+
[source,txt]
--------------------------------------------------
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1565052807 00:53:27 elasticsearch green 3 3 6 3 0 0 0 0 - 100.0%
--------------------------------------------------
// TESTRESPONSE[s/1565052807 00:53:27 elasticsearch/\\d+ \\d+:\\d+:\\d+ integTest/]
// TESTRESPONSE[s/3 3 6 3/\\d+ \\d+ \\d+ \\d+/]
// TESTRESPONSE[s/0 0 -/0 \\d+ (-|\\d+(\.\\d+)?(micros|ms|s))/]
// TESTRESPONSE[non_json]
+
NOTE: The cluster status will remain yellow if you are only running a single
instance of {es}. A single node cluster is fully functional, but data
cannot be replicated to another node to provide resiliency. Replica shards must
be available for the cluster status to be green. If the cluster status is red,
some data is unavailable.
[float]
[[gs-other-install]]
=== Other installation options
Installing {es} from an archive file enables you to easily install and run
multiple instances locally so you can try things out. To run a single instance,
you can run {es} in a Docker container, install {es} using the DEB or RPM
packages on Linux, install using Homebrew on macOS, or install using the MSI
package installer on Windows. See <<install-elasticsearch>> for more information.
2014-04-18 13:51:59 -04:00
[[getting-started-index]]
== Index some documents
2014-04-18 13:51:59 -04:00
Once you have a cluster up and running, you're ready to index some data.
There are a variety of ingest options for {es}, but in the end they all
do the same thing: put JSON documents into an {es} index.
2014-04-18 13:51:59 -04:00
You can do this directly with a simple PUT request that specifies
the index you want to add the document, a unique document ID, and one or more
`"field": "value"` pairs in the request body:
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
PUT /customer/_doc/1
{
"name": "John Doe"
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
This request automatically creates the `customer` index if it doesn't already
exist, adds a new document that has an ID of `1`, and stores and
indexes the `name` field.
Since this is a new document, the response shows that the result of the
operation was that version 1 of the document was created:
2014-04-18 13:51:59 -04:00
[source,console-result]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
{
"_index" : "customer",
"_type" : "_doc",
2014-04-18 13:51:59 -04:00
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 26,
"_primary_term" : 4
2014-04-18 13:51:59 -04:00
}
--------------------------------------------------
// TESTRESPONSE[s/"_seq_no" : \d+/"_seq_no" : $body._seq_no/]
// TESTRESPONSE[s/"successful" : \d+/"successful" : $body._shards.successful/]
// TESTRESPONSE[s/"_primary_term" : \d+/"_primary_term" : $body._primary_term/]
2014-04-18 13:51:59 -04:00
The new document is available immediately from any node in the cluster.
You can retrieve it with a GET request that specifies its document ID:
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
GET /customer/_doc/1
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TEST[continued]
The response indicates that a document with the specified ID was found
and shows the original source fields that were indexed.
2014-04-18 13:51:59 -04:00
[source,console-result]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
{
"_index" : "customer",
"_type" : "_doc",
2014-04-18 13:51:59 -04:00
"_id" : "1",
"_version" : 1,
"_seq_no" : 26,
"_primary_term" : 4,
"found" : true,
"_source" : {
"name": "John Doe"
}
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TESTRESPONSE[s/"_seq_no" : \d+/"_seq_no" : $body._seq_no/ ]
// TESTRESPONSE[s/"_primary_term" : \d+/"_primary_term" : $body._primary_term/]
2014-04-18 13:51:59 -04:00
[float]
2018-12-06 13:14:37 -05:00
[[getting-started-batch-processing]]
=== Indexing documents in bulk
2014-04-18 13:51:59 -04:00
If you have a lot of documents to index, you can submit them in batches with
the {ref}/docs-bulk.html[bulk API]. Using bulk to batch document
operations is significantly faster than submitting requests individually as it minimizes network roundtrips.
2014-04-18 13:51:59 -04:00
The optimal batch size depends on a number of factors: the document size and complexity, the indexing and search load, and the resources available to your cluster. A good place to start is with batches of 1,000 to 5,000 documents
and a total payload between 5MB and 15MB. From there, you can experiment
to find the sweet spot.
2014-04-18 13:51:59 -04:00
To get some data into {es} that you can start searching and analyzing:
2014-04-18 13:51:59 -04:00
. Download the https://github.com/elastic/elasticsearch/blob/master/docs/src/test/resources/accounts.json?raw=true[`accounts.json`] sample data set. The documents in this randomly-generated data set represent user accounts with the following information:
+
[source,js]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
{
"account_number": 0,
"balance": 16623,
"firstname": "Bradshaw",
"lastname": "Mckenzie",
"age": 29,
"gender": "F",
"address": "244 Columbus Place",
"employer": "Euron",
"email": "bradshawmckenzie@euron.com",
"city": "Hobucken",
"state": "CO"
}
--------------------------------------------------
// NOTCONSOLE
2014-04-18 13:51:59 -04:00
. Index the account data into the `bank` index with the following `_bulk` request:
+
2014-04-18 13:51:59 -04:00
[source,sh]
--------------------------------------------------
curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"
curl "localhost:9200/_cat/indices?v"
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// NOTCONSOLE
+
////
This replicates the above in a document-testing friendly way but isn't visible
in the docs:
+
[source,console]
--------------------------------------------------
GET /_cat/indices?v
--------------------------------------------------
// TEST[setup:bank]
////
+
The response indicates that 1,000 documents were indexed successfully.
+
Enforce that responses in docs are valid json (#26249) All of the snippets in our docs marked with `// TESTRESPONSE` are checked against the response from Elasticsearch but, due to the way they are implemented they are actually parsed as YAML instead of JSON. Luckilly, all valid JSON is valid YAML! Unfurtunately that means that invalid JSON has snuck into the exmples! This adds a step during the build to parse them as JSON and fail the build if they don't parse. But no! It isn't quite that simple. The displayed text of some of these responses looks like: ``` { ... "aggregations": { "range": { "buckets": [ { "to": 1.4436576E12, "to_as_string": "10-2015", "doc_count": 7, "key": "*-10-2015" }, { "from": 1.4436576E12, "from_as_string": "10-2015", "doc_count": 0, "key": "10-2015-*" } ] } } } ``` Note the `...` which isn't valid json but we like it anyway and want it in the output. We use substitution rules to convert the `...` into the response we expect. That yields a response that looks like: ``` { "took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits, "aggregations": { "range": { "buckets": [ { "to": 1.4436576E12, "to_as_string": "10-2015", "doc_count": 7, "key": "*-10-2015" }, { "from": 1.4436576E12, "from_as_string": "10-2015", "doc_count": 0, "key": "10-2015-*" } ] } } } ``` That is what the tests consume but it isn't valid JSON! Oh no! We don't want to go update all the substitution rules because that'd be huge and, ultimately, wouldn't buy much. So we quote the `$body.took` bits before parsing the JSON. Note the responses that we use for the `_cat` APIs are all converted into regexes and there is no expectation that they are valid JSON. Closes #26233
2017-08-17 09:02:10 -04:00
[source,txt]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open bank l7sSYV2cQXmu6_4rJWVIww 5 1 1000 0 128.6kb 128.6kb
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TESTRESPONSE[s/128.6kb/\\d+(\\.\\d+)?[mk]?b/]
// TESTRESPONSE[s/l7sSYV2cQXmu6_4rJWVIww/.+/ non_json]
2014-04-18 13:51:59 -04:00
[[getting-started-search]]
== Start searching
2014-04-18 13:51:59 -04:00
Once you have ingested some data into an {es} index, you can search it
by sending requests to the `_search` endpoint. To access the full suite of
search capabilities, you use the {es} Query DSL to specify the
search criteria in the request body. You specify the name of the index you
want to search in the request URI.
2014-04-18 13:51:59 -04:00
For example, the following request retrieves all documents in the `bank`
index sorted by account number:
2014-04-18 13:51:59 -04:00
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
GET /bank/_search
2014-04-18 13:51:59 -04:00
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
]
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TEST[continued]
2014-04-18 13:51:59 -04:00
By default, the `hits` section of the response includes the first 10 documents
that match the search criteria:
2014-04-18 13:51:59 -04:00
[source,console-result]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
{
"took" : 63,
2014-04-18 13:51:59 -04:00
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
Add a shard filter search phase to pre-filter shards based on query rewriting (#25658) Today if we search across a large amount of shards we hit every shard. Yet, it's quite common to search across an index pattern for time based indices but filtering will exclude all results outside a certain time range ie. `now-3d`. While the search can potentially hit hundreds of shards the majority of the shards might yield 0 results since there is not document that is within this date range. Kibana for instance does this regularly but used `_field_stats` to optimize the indexes they need to query. Now with the deprecation of `_field_stats` and it's upcoming removal a single dashboard in kibana can potentially turn into searches hitting hundreds or thousands of shards and that can easily cause search rejections even though the most of the requests are very likely super cheap and only need a query rewriting to early terminate with 0 results. This change adds a pre-filter phase for searches that can, if the number of shards are higher than a the `pre_filter_shard_size` threshold (defaults to 128 shards), fan out to the shards and check if the query can potentially match any documents at all. While false positives are possible, a negative response means that no matches are possible. These requests are not subject to rejection and can greatly reduce the number of shards a request needs to hit. The approach here is preferable to the kibana approach with field stats since it correctly handles aliases and uses the correct threadpools to execute these requests. Further it's completely transparent to the user and improves scalability of elasticsearch in general on large clusters.
2017-07-12 16:19:20 -04:00
"skipped" : 0,
2014-04-18 13:51:59 -04:00
"failed" : 0
},
"hits" : {
"total" : {
"value": 1000,
"relation": "eq"
},
"max_score" : null,
2014-04-18 13:51:59 -04:00
"hits" : [ {
"_index" : "bank",
"_type" : "_doc",
"_id" : "0",
"sort": [0],
"_score" : null,
"_source" : {"account_number":0,"balance":16623,"firstname":"Bradshaw","lastname":"Mckenzie","age":29,"gender":"F","address":"244 Columbus Place","employer":"Euron","email":"bradshawmckenzie@euron.com","city":"Hobucken","state":"CO"}
2014-04-18 13:51:59 -04:00
}, {
"_index" : "bank",
"_type" : "_doc",
"_id" : "1",
"sort": [1],
"_score" : null,
"_source" : {"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}
}, ...
]
}
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TESTRESPONSE[s/"took" : 63/"took" : $body.took/]
// TESTRESPONSE[s/\.\.\./$body.hits.hits.2, $body.hits.hits.3, $body.hits.hits.4, $body.hits.hits.5, $body.hits.hits.6, $body.hits.hits.7, $body.hits.hits.8, $body.hits.hits.9/]
The response also provides the following information about the search request:
2014-04-18 13:51:59 -04:00
* `took` how long it took {es} to run the query, in milliseconds
* `timed_out` whether or not the search request timed out
* `_shards` how many shards were searched and a breakdown of how many shards
succeeded, failed, or were skipped.
* `max_score` the score of the most relevant document found
* `hits.total.value` - how many matching documents were found
* `hits.sort` - the document's sort position (when not sorting by relevance score)
* `hits._score` - the document's relevance score (not applicable when using `match_all`)
2014-04-18 13:51:59 -04:00
Each search request is self-contained: {es} does not maintain any
state information across requests. To page through the search hits, specify
the `from` and `size` parameters in your request.
2014-04-18 13:51:59 -04:00
For example, the following request gets hits 10 through 19:
2014-04-18 13:51:59 -04:00
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
GET /bank/_search
2014-04-18 13:51:59 -04:00
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
],
2014-04-18 13:51:59 -04:00
"from": 10,
"size": 10
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TEST[continued]
2014-04-18 13:51:59 -04:00
Now that you've seen how to submit a basic search request, you can start to
construct queries that are a bit more interesting than `match_all`.
2014-04-18 13:51:59 -04:00
To search for specific terms within a field, you can use a `match` query.
For example, the following request searches the `address` field to find
customers whose addresses contain `mill` or `lane`:
2014-04-18 13:51:59 -04:00
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
GET /bank/_search
2014-04-18 13:51:59 -04:00
{
"query": { "match": { "address": "mill lane" } }
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TEST[continued]
2014-04-18 13:51:59 -04:00
To perform a phrase search rather than matching individual terms, you use
`match_phrase` instead of `match`. For example, the following request only
matches addresses that contain the phrase `mill lane`:
2014-04-18 13:51:59 -04:00
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
GET /bank/_search
2014-04-18 13:51:59 -04:00
{
"query": { "match_phrase": { "address": "mill lane" } }
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TEST[continued]
2014-04-18 13:51:59 -04:00
To construct more complex queries, you can use a `bool` query to combine
multiple query criteria. You can designate criteria as required (must match),
desirable (should match), or undesirable (must not match).
2014-04-18 13:51:59 -04:00
For example, the following request searches the `bank` index for accounts that
belong to customers who are 40 years old, but excludes anyone who lives in
Idaho (ID):
2014-04-18 13:51:59 -04:00
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
GET /bank/_search
2014-04-18 13:51:59 -04:00
{
"query": {
"bool": {
"must": [
{ "match": { "age": "40" } }
],
"must_not": [
{ "match": { "state": "ID" } }
]
}
}
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TEST[continued]
2014-04-18 13:51:59 -04:00
Each `must`, `should`, and `must_not` element in a Boolean query is referred
to as a query clause. How well a document meets the criteria in each `must` or
`should` clause contributes to the document's _relevance score_. The higher the
score, the better the document matches your search criteria. By default, {es}
returns documents ranked by these relevance scores.
2014-04-18 13:51:59 -04:00
The criteria in a `must_not` clause is treated as a _filter_. It affects whether
or not the document is included in the results, but does not contribute to
how documents are scored. You can also explicitly specify arbitrary filters to
include or exclude documents based on structured data.
2014-04-18 13:51:59 -04:00
For example, the following request uses a range filter to limit the results to
accounts with a balance between $20,000 and $30,000 (inclusive).
2014-04-18 13:51:59 -04:00
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
GET /bank/_search
2014-04-18 13:51:59 -04:00
{
"query": {
"bool": {
"must": { "match_all": {} },
2014-04-18 13:51:59 -04:00
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TEST[continued]
2014-04-18 13:51:59 -04:00
2018-12-06 13:14:37 -05:00
[[getting-started-aggregations]]
== Analyze results with aggregations
2014-04-18 13:51:59 -04:00
{es} aggregations enable you to get meta-information about your search results
and answer questions like, "How many account holders are in Texas?" or
"What's the average balance of accounts in Tennessee?" You can search
documents, filter hits, and use aggregations to analyze the results all in one
request.
2014-04-18 13:51:59 -04:00
For example, the following request uses a `terms` aggregation to group
all of the accounts in the `bank` index by state, and returns the ten states
with the most accounts in descending order:
2014-04-18 13:51:59 -04:00
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
GET /bank/_search
2014-04-18 13:51:59 -04:00
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
2014-04-18 13:51:59 -04:00
}
}
}
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TEST[continued]
2014-04-18 13:51:59 -04:00
The `buckets` in the response are the values of the `state` field. The
`doc_count` shows the number of accounts in each state. For example, you
can see that there are 27 accounts in `ID` (Idaho). Because the request
set `size=0`, the response only contains the aggregation results.
2014-04-18 13:51:59 -04:00
[source,console-result]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
Add a shard filter search phase to pre-filter shards based on query rewriting (#25658) Today if we search across a large amount of shards we hit every shard. Yet, it's quite common to search across an index pattern for time based indices but filtering will exclude all results outside a certain time range ie. `now-3d`. While the search can potentially hit hundreds of shards the majority of the shards might yield 0 results since there is not document that is within this date range. Kibana for instance does this regularly but used `_field_stats` to optimize the indexes they need to query. Now with the deprecation of `_field_stats` and it's upcoming removal a single dashboard in kibana can potentially turn into searches hitting hundreds or thousands of shards and that can easily cause search rejections even though the most of the requests are very likely super cheap and only need a query rewriting to early terminate with 0 results. This change adds a pre-filter phase for searches that can, if the number of shards are higher than a the `pre_filter_shard_size` threshold (defaults to 128 shards), fan out to the shards and check if the query can potentially match any documents at all. While false positives are possible, a negative response means that no matches are possible. These requests are not subject to rejection and can greatly reduce the number of shards a request needs to hit. The approach here is preferable to the kibana approach with field stats since it correctly handles aliases and uses the correct threadpools to execute these requests. Further it's completely transparent to the user and improves scalability of elasticsearch in general on large clusters.
2017-07-12 16:19:20 -04:00
"skipped" : 0,
"failed": 0
},
2014-04-18 13:51:59 -04:00
"hits" : {
"total" : {
"value": 1000,
"relation": "eq"
},
"max_score" : null,
2014-04-18 13:51:59 -04:00
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound": 20,
"sum_other_doc_count": 770,
2014-04-18 13:51:59 -04:00
"buckets" : [ {
"key" : "ID",
"doc_count" : 27
2014-04-18 13:51:59 -04:00
}, {
"key" : "TX",
"doc_count" : 27
2014-04-18 13:51:59 -04:00
}, {
"key" : "AL",
"doc_count" : 25
2014-04-18 13:51:59 -04:00
}, {
"key" : "MD",
"doc_count" : 25
2014-04-18 13:51:59 -04:00
}, {
"key" : "TN",
"doc_count" : 23
2014-04-18 13:51:59 -04:00
}, {
"key" : "MA",
"doc_count" : 21
2014-04-18 13:51:59 -04:00
}, {
"key" : "NC",
"doc_count" : 21
2014-04-18 13:51:59 -04:00
}, {
"key" : "ND",
"doc_count" : 21
2014-04-18 13:51:59 -04:00
}, {
"key" : "ME",
"doc_count" : 20
2014-04-18 13:51:59 -04:00
}, {
"key" : "MO",
"doc_count" : 20
2014-04-18 13:51:59 -04:00
} ]
}
}
}
--------------------------------------------------
// TESTRESPONSE[s/"took": 29/"took": $body.took/]
2014-04-18 13:51:59 -04:00
You can combine aggregations to build more complex summaries of your data. For
example, the following request nests an `avg` aggregation within the previous
`group_by_state` aggregation to calculate the average account balances for
each state.
2014-04-18 13:51:59 -04:00
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
GET /bank/_search
2014-04-18 13:51:59 -04:00
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
2014-04-18 13:51:59 -04:00
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TEST[continued]
2014-04-18 13:51:59 -04:00
Instead of sorting the results by count, you could sort using the result of
the nested aggregation by specifying the order within the `terms` aggregation:
2014-04-18 13:51:59 -04:00
[source,console]
2014-04-18 13:51:59 -04:00
--------------------------------------------------
GET /bank/_search
2014-04-18 13:51:59 -04:00
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
2014-04-18 13:51:59 -04:00
"order": {
"average_balance": "desc"
}
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
2014-04-18 13:51:59 -04:00
--------------------------------------------------
// TEST[continued]
2014-04-18 13:51:59 -04:00
In addition to basic bucketing and metrics aggregations like these, {es}
provides specialized aggregations for operating on multiple fields and
analyzing particular types of data such as dates, IP addresses, and geo
data. You can also feed the results of individual aggregations into pipeline
aggregations for further analysis.
2014-04-18 13:51:59 -04:00
The core analysis capabilities provided by aggregations enable advanced
features such as using machine learning to detect anomalies.
2014-04-18 13:51:59 -04:00
[[getting-started-next-steps]]
== Where to go from here
Now that you've set up a cluster, indexed some documents, and run some
searches and aggregations, you might want to:
* {stack-gs}/get-started-elastic-stack.html#install-kibana[Dive in to the Elastic
Stack Tutorial] to install Kibana, Logstash, and Beats and
set up a basic system monitoring solution.
* {kibana-ref}/add-sample-data.html[Load one of the sample data sets into Kibana]
to see how you can use {es} and Kibana together to visualize your data.
2014-04-18 13:51:59 -04:00
* Try out one of the Elastic search solutions:
** https://swiftype.com/documentation/site-search/crawler-quick-start[Site Search]
** https://swiftype.com/documentation/app-search/getting-started[App Search]
** https://swiftype.com/documentation/enterprise-search/getting-started[Enterprise Search]