26 KiB
layout | title | nav_order |
---|---|---|
default | Search experience | 55 |
Search experience
Expectations from search engines have evolved over the years. Just returning relevant results quickly is no longer enough for most users. OpenSearch includes many features that enhance the user’s search experience as follows:
Feature | Description |
---|---|
Autocomplete queries | Suggest phrases as the user types. |
Paginate results | Rather than a single, long list, break search results into pages. |
Scroll search | Return a large number of results in batches. |
Sort results | Allow sorting results by different criteria. |
Highlight query matches | Highlight the search term in the results. |
Autocomplete queries
Autocomplete shows suggestions to users while they type.
For example, if a user types "pop," OpenSearch provides suggestions like "popcorn" or "popsicles." These suggestions preempt your user's intention and lead them to a possible search term more quickly.
OpenSearch lets you design autocomplete that updates with each keystroke, provides a few relevant suggestions, and tolerates typos.
Implement autocomplete using one of three methods:
- Prefix matching
- Edge N-gram matching
- Completion suggesters
These methods are described in the following sections.
Prefix matching
Prefix matching finds documents that matches the last term in the query string.
For example, assume that the user types “qui” into a search UI. To autocomplete this phrase, use the match_phrase_prefix
query to search all text_entry
fields that begin with the prefix "qui."
To make the word order and relative positions flexible, specify a slop
value. To learn about the slop
option, see Other optional query fields.
Sample Request
GET shakespeare/_search
{
"query": {
"match_phrase_prefix": {
"text_entry": {
"query": "qui",
"slop": 3
}
}
}
}
Prefix matching doesn’t require any special mappings. It works with your data as-is.
However, it’s a fairly resource-intensive operation. A prefix of a
could match hundreds of thousands of terms and not be useful to your user.
To limit the impact of prefix expansion, set max_expansions
to a reasonable number. To learn about the max_expansions
option, see Optional query fields.
Sample Request
GET shakespeare/_search
{
"query": {
"match_phrase_prefix": {
"text_entry": {
"query": "qui",
"slop": 3,
"max_expansions": 10
}
}
}
}
The ease of implementing query-time autocomplete comes at the cost of performance. When implementing this feature on a large scale, we recommend an index-time solution. With an index-time solution, you might experience slower indexing, but it’s a price you pay only once and not for every query. The edge N-gram and completion suggester methods are index time.
Edge N-gram matching
During indexing, edge N-grams chop up a word into a sequence of N characters to support a faster lookup of partial search terms.
If you N-gram the word "quick," the results depend on the value of N.
N | Type | N-gram |
---|---|---|
1 | Unigram | [ q , u , i , c , k ] |
2 | Bigram | [ qu , ui , ic , ck ] |
3 | Trigram | [ qui , uic , ick ] |
4 | Four-gram | [ quic , uick ] |
5 | Five-gram | [ quick ] |
Autocomplete needs only the beginning N-grams of a search phrase, so OpenSearch uses a special type of N-gram called edge N-gram.
Edge N-gramming the word "quick" results in the following:
q
qu
qui
quic
quick
This follows the same sequence the user types.
To configure a field to use edge N-grams, create an autocomplete analyzer with an edge_ngram
filter:
Sample Request
PUT shakespeare
{
"mappings": {
"properties": {
"text_entry": {
"type": "text",
"analyzer": "autocomplete"
}
}
},
"settings": {
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"edge_ngram_filter"
]
}
}
}
}
}
This example creates the index and instantiates the edge N-gram filter and analyzer.
The edge_ngram_filter
produces edge N-grams with a minimum N-gram length of 1 (a single letter) and a maximum length of 20. So it offers suggestions for words of up to 20 letters.
The autocomplete
analyzer tokenizes a string into individual terms, lowercases the terms, and then produces edge N-grams for each term using the edge_ngram_filter
.
Use the analyze
operation to test this analyzer:
POST shakespeare/_analyze
{
"analyzer": "autocomplete",
"text": "quick"
}
It returns edge N-grams as tokens:
q
qu
qui
quic
quick
Use the standard
analyzer at search time. Otherwise, the search query splits into edge N-grams and you get results for everything that matches q
, u
, and i
.
This is one of the few occasions where you use a different analyzer on the index and query side.
Sample Request
GET shakespeare/_search
{
"query": {
"match": {
"text_entry": {
"query": "qui",
"analyzer": "standard"
}
}
}
}
Sample Response
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 533,
"relation": "eq"
},
"max_score": 9.712725,
"hits": [
{
"_index": "shakespeare",
"_id": "22006",
"_score": 9.712725,
"_source": {
"type": "line",
"line_id": 22007,
"play_name": "Antony and Cleopatra",
"speech_number": 12,
"line_number": "5.2.44",
"speaker": "CLEOPATRA",
"text_entry": "Quick, quick, good hands."
}
},
{
"_index": "shakespeare",
"_id": "54665",
"_score": 9.712725,
"_source": {
"type": "line",
"line_id": 54666,
"play_name": "Loves Labours Lost",
"speech_number": 21,
"line_number": "5.1.52",
"speaker": "HOLOFERNES",
"text_entry": "Quis, quis, thou consonant?"
}
}
]
}
}
Alternatively, specify the search_analyzer
in the mapping itself:
"mappings": {
"properties": {
"text_entry": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
Completion suggester
The completion suggester accepts a list of suggestions and builds them into a finite-state transducer (FST), an optimized data structure that’s essentially a graph. This data structure lives in memory and is optimized for fast prefix lookups. To learn more about FSTs, see Wikipedia.
As the user types, the completion suggester moves through the FST graph one character at a time along a matching path. After it runs out of user input, it examines the remaining endings to produce a list of suggestions.
The completion suggester makes your autocomplete solution as efficient as possible and lets you have explicit control over its suggestions.
Use a dedicated field type called completion
, which stores the FST-like data structures in the index:
PUT shakespeare
{
"mappings": {
"properties": {
"text_entry": {
"type": "completion"
}
}
}
}
To get back suggestions, use the search
endpoint with the suggest
parameter:
GET shakespeare/_search
{
"suggest": {
"autocomplete": {
"prefix": "To be",
"completion": {
"field": "text_entry"
}
}
}
}
The phrase "to be" is prefix matched with the FST of the text_entry
field.
Sample Response
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"suggest": {
"text_entry": [
{
"text": "To be",
"offset": 0,
"length": 5,
"options": [
{
"text": "To be a comrade with the wolf and owl,--",
"_index": "shakespeare",
"_id": "50652",
"_score": 1,
"_source": {
"type": "line",
"line_id": 50653,
"play_name": "King Lear",
"speech_number": 68,
"line_number": "2.4.230",
"speaker": "KING LEAR",
"text_entry": "To be a comrade with the wolf and owl,--"
}
},
{
"text": "To be a make-peace shall become my age:",
"_index": "shakespeare",
"_id": "78566",
"_score": 1,
"_source": {
"type": "line",
"line_id": 78567,
"play_name": "Richard II",
"speech_number": 20,
"line_number": "1.1.160",
"speaker": "JOHN OF GAUNT",
"text_entry": "To be a make-peace shall become my age:"
}
}
]
}
]
}
}
To specify the number of suggestions that you want to return, use the size
parameter:
GET shakespeare/_search
{
"suggest": {
"autocomplete": {
"prefix": "To m",
"completion": {
"field": "text_entry",
"size": 3
}
}
}
}
Sample Response
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"suggest": {
"text_entry": [
{
"text": "To m",
"offset": 0,
"length": 5,
"options": [
{
"text": "To make a bastard and a slave of me!",
"_index": "shakespeare",
"_id": "5369",
"_score": 4,
"_source": {
"type": "line",
"line_id": 5370,
"play_name": "Henry VI Part 1",
"speech_number": 2,
"line_number": "4.5.15",
"speaker": "JOHN TALBOT",
"text_entry": "To make a bastard and a slave of me!"
}
},
{
"text": "To make a bloody supper in the Tower.",
"_index": "shakespeare",
"_id": "12504",
"_score": 4,
"_source": {
"type": "line",
"line_id": 12505,
"play_name": "Henry VI Part 3",
"speech_number": 40,
"line_number": "5.5.85",
"speaker": "CLARENCE",
"text_entry": "To make a bloody supper in the Tower."
}
}
]
}
]
}
}
The suggest
parameter finds suggestions using only prefix matching.
For example, you don't get back "To be, or not to be," which you might want as a suggestion.
To work around this issue, manually add curated suggestions and add weights to prioritize your suggestions.
Index a document with an input suggestion and assign a weight:
PUT shakespeare/_doc/1
{
"text": "To m",
"text_entry": {
"input": [
"To be, or not to be: that is the question:"
],
"weight": 10
}
}
Perform the same search as before:
GET shakespeare/_search
{
"suggest": {
"autocomplete": {
"prefix": "To m",
"completion": {
"field": "text_entry",
"size": 3
}
}
}
}
You see the indexed document as the first result:
{
"took": 1021,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"suggest": {
"autocomplete": [
{
"text": "To m",
"offset": 0,
"length": 5,
"options": [
{
"text": "To be, or not to be: that is the question:",
"_index": "shakespeare",
"_id": "1",
"_score": 30,
"_source": {
"text": "To me",
"text_entry": {
"input": [
"To be, or not to be: that is the question:"
],
"weight": 10
}
}
},
{
"text": "To make a bastard and a slave of me!",
"_index": "shakespeare",
"_id": "5369",
"_score": 4,
"_source": {
"type": "line",
"line_id": 5370,
"play_name": "Henry VI Part 1",
"speech_number": 2,
"line_number": "4.5.15",
"speaker": "JOHN TALBOT",
"text_entry": "To make a bastard and a slave of me!"
}
}
]
}
]
}
}
Use the term
suggester to suggest corrected spellings for individual words.
The term
suggester uses an edit distance to compute suggestions. Edit distance is the number of characters that need to be changed for a term to match.
In this example, the user misspells a search term:
GET shakespeare/_search
{
"suggest": {
"spell-check": {
"text": "lief",
"term": {
"field": "text_entry"
}
}
}
}
The term
suggester returns a list of corrections:
{
"took": 48,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"suggest": {
"spell-check": [
{
"text": "lief",
"offset": 0,
"length": 4,
"options": [
{
"text": "lifes",
"score": 0.8,
"freq": 21
},
{
"text": "life",
"score": 0.75,
"freq": 805
},
{
"text": "lives",
"score": 0.6,
"freq": 187
},
{
"text": "liege",
"score": 0.6,
"freq": 138
},
{
"text": "lived",
"score": 0.6,
"freq": 80
}
]
}
]
}
}
The higher the score, the better the suggestion is. The frequency represents the number of times the term appears in the documents of that index.
To implement a "Did you mean suggestion
?" feature, use a phrase
suggester.
The phrase
suggester is similar to the term
suggester, except that it uses N-gram language models to suggest whole phrases instead of individual words.
Create a custom analyzer called trigram
that uses a shingle
filter. This filter is similar to the edge_ngram
filter, but it applies to words instead of letters:
PUT shakespeare
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"trigram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
]
}
},
"filter": {
"shingle": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
}
}
}
},
"mappings": {
"properties": {
"text_entry": {
"type": "text",
"fields": {
"trigram": {
"type": "text",
"analyzer": "trigram"
}
}
}
}
}
}
This example includes as incorrect phrase:
POST shakespeare/_search
{
"suggest": {
"text": "That the qution",
"simple_phrase": {
"phrase": {
"field": "text_entry.trigram"
}
}
}
}
You get back the corrected phrase:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"suggest": {
"simple_phrase": [
{
"text": "That the qution",
"offset": 0,
"length": 18,
"options": [
{
"text": "that is the question",
"score": 0.0015543294
}
]
}
]
}
}
Paginate results
The from
and size
parameters return results to your users one page at a time.
The from
parameter is the document number that you want to start showing the results from. The size
parameter is the number of results that you want to show. Together, they let you return a subset of the search results.
For example, if the value of size
is 10 and the value of from
is 0, you see the first 10 results. If you change the value of from
to 10, you see the next 10 results (because the results are zero-indexed). So, if you want to see results starting from result 11, from
must be 10.
GET shakespeare/_search
{
"from": 0,
"size": 10,
"query": {
"match": {
"play_name": "Hamlet"
}
}
}
To calculate the from
parameter relative to the page number:
from = size * (page_number - 1)
Each time the user chooses the next page of the results, your application needs to make the same search query with an incremented from
value.
You can also specify the from
and size
parameters in the search URI:
GET shakespeare/_search?from=0&size=10
If you only specify the size
parameter, the from
parameter defaults to 0.
Querying for pages deep in your results can have a significant performance impact, so OpenSearch limits this approach to 10,000 results.
The from
and size
parameters are stateless, so the results are based on the latest available data.
This can cause inconsistent pagination.
For example, assume a user stays on the first page of the results for a minute and then navigates to the second page; in that time, a new document is indexed in the background which is relevant enough to show up on the first page. In this scenario, the last result of the first page is pushed to the second page, so the user ends up seeing a result on the second page that they already saw on the first page.
Use the scroll
operation for consistent pagination. The scroll
operation keeps a search context open for a certain period of time. Any data changes do not affect the results during this time.
Scroll search
The from
and size
parameters allow you to paginate your search results, but with a limit of 10,000 results at a time.
If you need to request massive volumes of data from, for example, a machine learning job, use the scroll
operation instead. The scroll
operation allows you to request an unlimited number of results.
To use the scroll operation, add a scroll
parameter to the request header with a search context to tell OpenSearch how long you need to keep scrolling. This search context needs to be long enough to process a single batch of results.
To set the number of results that you want returned for each batch, use the size
parameter:
GET shakespeare/_search?scroll=10m
{
"size": 10000
}
OpenSearch caches the results and returns a scroll ID to access them in batches:
"_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAAUWdmpUZDhnRFBUcWFtV21nMmFwUGJEQQ=="
Pass this scroll ID to the scroll
operation to get back the next batch of results:
GET _search/scroll
{
"scroll": "10m",
"scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAAUWdmpUZDhnRFBUcWFtV21nMmFwUGJEQQ=="
}
Using this scroll ID, you get results in batches of 10,000 as long as the search context is still open. Typically, the scroll ID does not change between requests, but it can change, so make sure to always use the latest scroll ID. If you don't send the next scroll request within the set search context, the scroll
operation does not return any results.
If you expect billions of results, use a sliced scroll. Slicing allows you to perform multiple scroll operations for the same request, but in parallel. Set the ID and the maximum number of slices for the scroll:
GET shakespeare/_search?scroll=10m
{
"slice": {
"id": 0,
"max": 10
},
"query": {
"match_all": {}
}
}
With a single scroll ID, you get back 10 results. You can have up to 10 IDs. Perform the same command with ID equal to 1:
GET shakespeare/_search?scroll=10m
{
"slice": {
"id": 1,
"max": 10
},
"query": {
"match_all": {}
}
}
Close the search context when you’re done scrolling, because it continues to consume computing resources until the timeout:
DELETE _search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAAcWdmpUZDhnRFBUcWFtV21nMmFwUGJEQQ==
Sample Response
{
"succeeded": true,
"num_freed": 1
}
To close all open scroll contexts:
DELETE _search/scroll/_all
The scroll
operation corresponds to a specific timestamp. It doesn't consider documents added after that timestamp as potential results.
Because open search contexts consume a lot of memory, we suggest you don't use the scroll
operation for frequent user queries that don't need the search context open. Instead, use the sort
parameter with the search_after
parameter to scroll responses for user queries.
Sort results
Sorting allows your users to sort the results in a way that’s most meaningful to them.
By default, full-text queries sort results by the relevance score. You can choose to sort the results by any field value in either ascending or descending order.
For example, to sort results by descending order of a line_id
value:
GET shakespeare/_search
{
"query": {
"term": {
"play_name": {
"value": "Henry IV"
}
}
},
"sort": [
{
"line_id": {
"order": "desc"
}
}
]
}
The sort parameter is an array, so you can specify multiple field values in the order of their priority.
If you have two fields with the same value for line_id
, OpenSearch uses speech_number
, which is the second option for sorting.
GET shakespeare/_search
{
"query": {
"term": {
"play_name": {
"value": "Henry IV"
}
}
},
"sort": [
{
"line_id": {
"order": "desc"
}
},
{
"speech_number": {
"order": "desc"
}
}
]
}
You can continue to sort by any number of field values to get the results in just the right order. It doesn’t have to be a numerical value---you can also sort by date or timestamp fields:
"sort": [
{
"date": {
"order": "desc"
}
}
]
For numeric fields that contain an array of numbers, you can sort by avg
, sum
, and median
modes:
"sort": [
{
"price": {
"order": "asc",
"mode": "avg"
}
}
]
To sort by the minimum or maximum values, use the min
or max
modes. These modes work for both numeric and string data types.
A text field that’s analyzed cannot be used to sort documents, because the inverted index only contains the individual tokenized terms and not the entire string. So, you cannot sort by the play_name
, for example.
One workaround is map a raw version of the text field as a keyword type, so it won’t be analyzed and you have a copy of the full original version for sorting purposes.
GET shakespeare/_search
{
"query": {
"term": {
"play_name": {
"value": "Henry IV"
}
}
},
"sort": [
{
"play_name.keyword": {
"order": "desc"
}
}
]
}
You get back results sorted by the play_name
field in alphabetic order.
Use sort
with search_after
parameter for more efficient scrolling.
You get back results after the values you specify in the search_after
array.
Make sure you have the same number of values in the search_after
array as in the sort
array, also ordered in the same way.
In this case, you get back results after line_id = 3202
and speech_number = 8
.
GET shakespeare/_search
{
"query": {
"term": {
"play_name": {
"value": "Henry IV"
}
}
},
"sort": [
{
"line_id": {
"order": "desc"
}
},
{
"speech_number": {
"order": "desc"
}
}
],
"search_after": [
"3202",
"8"
]
}
Highlight query matches
Highlighting emphasizes the search term(s) in the results.
To highlight the search terms, add a highlight
parameter outside of the query block:
GET shakespeare/_search
{
"query": {
"match": {
"text_entry": "life"
}
},
"highlight": {
"fields": {
"text_entry": {}
}
}
}
For each document in the results, you get back a highlight
object that shows your search term wrapped in an em
tag:
"highlight": {
"text_entry": [
"my <em>life</em>, except my <em>life</em>."
]
}
Design your application code to parse the results from the highlight
object and perform some action on the search terms, such as changing their color, bolding, italicizing, and so on.
To change the default em
tags, use the pretag
and posttag
parameters:
GET shakespeare/_search?format=yaml
{
"query": {
"match": {
"play_name": "Henry IV"
}
},
"highlight": {
"pre_tags": [
"<strong>"
],
"post_tags": [
"</strong>"
],
"fields": {
"play_name": {}
}
}
}
The highlight
parameter highlights the original terms even when using synonyms or stemming for the search itself.