OpenSearch/docs/reference/search/search-your-data/search-your-data.asciidoc

460 lines
14 KiB
Plaintext

[[search-your-data]]
= Search your data
[[search-query]]
A _search query_, or _query_, is a request for information about data in
{es} data streams or indices.
You can think of a query as a question, written in a way {es} understands.
Depending on your data, you can use a query to get answers to questions like:
* What processes on my server take longer than 500 milliseconds to respond?
* What users on my network ran `regsvr32.exe` within the last week?
* What pages on my website contain a specific word or phrase?
A _search_ consists of one or more queries that are combined and sent to {es}.
Documents that match a search's queries are returned in the _hits_, or
_search results_, of the response.
A search may also contain additional information used to better process its
queries. For example, a search may be limited to a specific index or only return
a specific number of results.
[discrete]
[[run-an-es-search]]
== Run a search
You can use the <<search-search,search API>> to search and
<<search-aggregations,aggregate>> data stored in {es} data streams or indices.
The API's `query` request body parameter accepts queries written in
<<query-dsl,Query DSL>>.
The following request searches `my-index-000001` using a
<<query-dsl-match-query,`match`>> query. This query matches documents with a
`user.id` value of `kimchy`.
[source,console]
----
GET /my-index-000001/_search
{
"query": {
"match": {
"user.id": "kimchy"
}
}
}
----
// TEST[setup:my_index]
The API response returns the top 10 documents matching the query in the
`hits.hits` property.
[source,console-result]
----
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.3862942,
"hits": [
{
"_index": "my-index-000001",
"_type": "_doc",
"_id": "kxWFcnMByiguvud1Z8vC",
"_score": 1.3862942,
"_source": {
"@timestamp": "2099-11-15T14:12:12",
"http": {
"request": {
"method": "get"
},
"response": {
"bytes": 1070000,
"status_code": 200
},
"version": "1.1"
},
"message": "GET /search HTTP/1.1 200 1070000",
"source": {
"ip": "127.0.0.1"
},
"user": {
"id": "kimchy"
}
}
}
]
}
}
----
// TESTRESPONSE[s/"took": 5/"took": "$body.took"/]
// TESTRESPONSE[s/"_id": "kxWFcnMByiguvud1Z8vC"/"_id": "$body.hits.hits.0._id"/]
[discrete]
[[common-search-options]]
=== Common search options
You can use the following options to customize your searches.
*Query DSL* +
<<query-dsl,Query DSL>> supports a variety of query types you can mix and match
to get the results you want. Query types include:
* <<query-dsl-bool-query,Boolean>> and other <<compound-queries,compound
queries>>, which let you combine queries and match results based on multiple
criteria
* <<term-level-queries,Term-level queries>> for filtering and finding exact matches
* <<full-text-queries,Full text queries>>, which are commonly used in search
engines
* <<geo-queries,Geo>> and <<shape-queries,spatial queries>>
*Aggregations* +
You can use <<search-aggregations,search aggregations>> to get statistics and
other analytics for your search results. Aggregations help you answer questions
like:
* What's the average response time for my servers?
* What are the top IP addresses hit by users on my network?
* What is the total transaction revenue by customer?
*Search multiple data streams and indices* +
You can use comma-separated values and grep-like index patterns to search
several data streams and indices in the same request. You can even boost search
results from specific indices. See <<search-multiple-indices>>.
*Paginate search results* +
By default, searches return only the top 10 matching hits. To retrieve
more or fewer documents, see <<paginate-search-results>>.
*Retrieve selected fields* +
The search response's `hit.hits` property includes the full document
<<mapping-source-field,`_source`>> for each hit. To retrieve only a subset of
the `_source` or other fields, see <<search-fields>>.
*Sort search results* +
By default, search hits are sorted by `_score`, a <<relevance-scores,relevance
score>> that measures how well each document matches the query. To customize the
calculation of these scores, use the
<<query-dsl-script-score-query,`script_score`>> query. To sort search hits by
other field values, see <<sort-search-results>>.
*Run an async search* +
{es} searches are designed to run on large volumes of data quickly, often
returning results in milliseconds. For this reason, searches are
_synchronous_ by default. The search request waits for complete results before
returning a response.
However, complete results can take longer for searches across
<<frozen-indices,frozen indices>> or <<modules-cross-cluster-search,multiple
clusters>>.
To avoid long waits, you can run an _asynchronous_, or _async_, search
instead. An <<async-search-intro,async search>> lets you retrieve partial
results for a long-running search now and get complete results later.
[discrete]
[[search-timeout]]
=== Search timeout
By default, search requests don't time out. The request waits for complete
results before returning a response.
While <<async-search-intro,async search>> is designed for long-running
searches, you can also use the `timeout` parameter to specify a duration you'd
like to wait for a search to complete. If no response is received before this
period ends, the request fails and returns an error.
[source,console]
----
GET /my-index-000001/_search
{
"timeout": "2s",
"query": {
"match": {
"user.id": "kimchy"
}
}
}
----
// TEST[setup:my_index]
To set a cluster-wide default timeout for all search requests, configure
`search.default_search_timeout` using the <<cluster-update-settings,cluster
settings API>>. This global timeout duration is used if no `timeout` argument is
passed in the request. If the global search timeout expires before the search
request finishes, the request is cancelled using <<task-cancellation,task
cancellation>>. The `search.default_search_timeout` setting defaults to `-1` (no
timeout).
[discrete]
[[global-search-cancellation]]
=== Search cancellation
You can cancel a search request using the <<task-cancellation,task management
API>>. {es} also automatically cancels a search request when your client's HTTP
connection closes. We recommend you set up your client to close HTTP connections
when a search request is aborted or times out.
[discrete]
[[track-total-hits]]
=== Track total hits
Generally the total hit count can't be computed accurately without visiting all
matches, which is costly for queries that match lots of documents. The
`track_total_hits` parameter allows you to control how the total number of hits
should be tracked.
Given that it is often enough to have a lower bound of the number of hits,
such as "there are at least 10000 hits", the default is set to `10,000`.
This means that requests will count the total hit accurately up to `10,000` hits.
It's is a good trade off to speed up searches if you don't need the accurate number
of hits after a certain threshold.
When set to `true` the search response will always track the number of hits that
match the query accurately (e.g. `total.relation` will always be equal to `"eq"`
when `track_total_hits` is set to true). Otherwise the `"total.relation"` returned
in the `"total"` object in the search response determines how the `"total.value"`
should be interpreted. A value of `"gte"` means that the `"total.value"` is a
lower bound of the total hits that match the query and a value of `"eq"` indicates
that `"total.value"` is the accurate count.
[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"track_total_hits": true,
"query": {
"match" : {
"user.id" : "elkbee"
}
}
}
--------------------------------------------------
// TEST[setup:my_index]
\... returns:
[source,console-result]
--------------------------------------------------
{
"_shards": ...
"timed_out": false,
"took": 100,
"hits": {
"max_score": 1.0,
"total" : {
"value": 2048, <1>
"relation": "eq" <2>
},
"hits": ...
}
}
--------------------------------------------------
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": "$body._shards",/]
// TESTRESPONSE[s/"took": 100/"took": $body.took/]
// TESTRESPONSE[s/"max_score": 1\.0/"max_score": $body.hits.max_score/]
// TESTRESPONSE[s/"value": 2048/"value": $body.hits.total.value/]
// TESTRESPONSE[s/"hits": \.\.\./"hits": "$body.hits.hits"/]
<1> The total number of hits that match the query.
<2> The count is accurate (e.g. `"eq"` means equals).
It is also possible to set `track_total_hits` to an integer.
For instance the following query will accurately track the total hit count that match
the query up to 100 documents:
[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"track_total_hits": 100,
"query": {
"match": {
"user.id": "elkbee"
}
}
}
--------------------------------------------------
// TEST[continued]
The `hits.total.relation` in the response will indicate if the
value returned in `hits.total.value` is accurate (`"eq"`) or a lower
bound of the total (`"gte"`).
For instance the following response:
[source,console-result]
--------------------------------------------------
{
"_shards": ...
"timed_out": false,
"took": 30,
"hits": {
"max_score": 1.0,
"total": {
"value": 42, <1>
"relation": "eq" <2>
},
"hits": ...
}
}
--------------------------------------------------
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": "$body._shards",/]
// TESTRESPONSE[s/"took": 30/"took": $body.took/]
// TESTRESPONSE[s/"max_score": 1\.0/"max_score": $body.hits.max_score/]
// TESTRESPONSE[s/"value": 42/"value": $body.hits.total.value/]
// TESTRESPONSE[s/"hits": \.\.\./"hits": "$body.hits.hits"/]
<1> 42 documents match the query
<2> and the count is accurate (`"eq"`)
\... indicates that the number of hits returned in the `total`
is accurate.
If the total number of hits that match the query is greater than the
value set in `track_total_hits`, the total hits in the response
will indicate that the returned value is a lower bound:
[source,console-result]
--------------------------------------------------
{
"_shards": ...
"hits": {
"max_score": 1.0,
"total": {
"value": 100, <1>
"relation": "gte" <2>
},
"hits": ...
}
}
--------------------------------------------------
// TESTRESPONSE[skip:response is already tested in the previous snippet]
<1> There are at least 100 documents that match the query
<2> This is a lower bound (`"gte"`).
If you don't need to track the total number of hits at all you can improve query
times by setting this option to `false`:
[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"track_total_hits": false,
"query": {
"match": {
"user.id": "elkbee"
}
}
}
--------------------------------------------------
// TEST[continued]
\... returns:
[source,console-result]
--------------------------------------------------
{
"_shards": ...
"timed_out": false,
"took": 10,
"hits": { <1>
"max_score": 1.0,
"hits": ...
}
}
--------------------------------------------------
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": "$body._shards",/]
// TESTRESPONSE[s/"took": 10/"took": $body.took/]
// TESTRESPONSE[s/"max_score": 1\.0/"max_score": $body.hits.max_score/]
// TESTRESPONSE[s/"hits": \.\.\./"hits": "$body.hits.hits"/]
<1> The total number of hits is unknown.
Finally you can force an accurate count by setting `"track_total_hits"`
to `true` in the request.
[discrete]
[[quickly-check-for-matching-docs]]
=== Quickly check for matching docs
If you only want to know if there are any documents matching a
specific query, you can set the `size` to `0` to indicate that we are not
interested in the search results. You can also set `terminate_after` to `1`
to indicate that the query execution can be terminated whenever the first
matching document was found (per shard).
[source,console]
--------------------------------------------------
GET /_search?q=user.id:elkbee&size=0&terminate_after=1
--------------------------------------------------
// TEST[setup:my_index]
NOTE: `terminate_after` is always applied **after** the
<<post-filter,`post_filter`>> and stops the query as well as the aggregation
executions when enough hits have been collected on the shard. Though the doc
count on aggregations may not reflect the `hits.total` in the response since
aggregations are applied **before** the post filtering.
The response will not contain any hits as the `size` was set to `0`. The
`hits.total` will be either equal to `0`, indicating that there were no
matching documents, or greater than `0` meaning that there were at least
as many documents matching the query when it was early terminated.
Also if the query was terminated early, the `terminated_early` flag will
be set to `true` in the response.
[source,console-result]
--------------------------------------------------
{
"took": 3,
"timed_out": false,
"terminated_early": true,
"_shards": {
"total": 1,
"successful": 1,
"skipped" : 0,
"failed": 0
},
"hits": {
"total" : {
"value": 1,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
--------------------------------------------------
// TESTRESPONSE[s/"took": 3/"took": $body.took/]
The `took` time in the response contains the milliseconds that this request
took for processing, beginning quickly after the node received the query, up
until all search related work is done and before the above JSON is returned
to the client. This means it includes the time spent waiting in thread pools,
executing a distributed search across the whole cluster and gathering all the
results.
include::collapse-search-results.asciidoc[]
include::filter-search-results.asciidoc[]
include::highlighting.asciidoc[]
include::long-running-searches.asciidoc[]
include::near-real-time.asciidoc[]
include::paginate-search-results.asciidoc[]
include::retrieve-inner-hits.asciidoc[]
include::retrieve-selected-fields.asciidoc[]
include::search-across-clusters.asciidoc[]
include::search-multiple-indices.asciidoc[]
include::search-shard-routing.asciidoc[]
include::sort-search-results.asciidoc[]