2017-07-17 12:21:20 -04:00
|
|
|
[[mixing-exact-search-with-stemming]]
|
|
|
|
=== Mixing exact search with stemming
|
|
|
|
|
|
|
|
When building a search application, stemming is often a must as it is desirable
|
|
|
|
for a query on `skiing` to match documents that contain `ski` or `skis`. But
|
|
|
|
what if a user wants to search for `skiing` specifically? The typical way to do
|
|
|
|
this would be to use a <<multi-fields,multi-field>> in order to have the same
|
|
|
|
content indexed in two different ways:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
PUT index
|
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"english_exact": {
|
|
|
|
"tokenizer": "standard",
|
|
|
|
"filter": [
|
|
|
|
"lowercase"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
},
|
|
|
|
"mappings": {
|
2017-12-14 11:47:53 -05:00
|
|
|
"_doc": {
|
2017-07-17 12:21:20 -04:00
|
|
|
"properties": {
|
|
|
|
"body": {
|
|
|
|
"type": "text",
|
|
|
|
"analyzer": "english",
|
|
|
|
"fields": {
|
|
|
|
"exact": {
|
|
|
|
"type": "text",
|
|
|
|
"analyzer": "english_exact"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-12-14 11:47:53 -05:00
|
|
|
PUT index/_doc/1
|
2017-07-17 12:21:20 -04:00
|
|
|
{
|
|
|
|
"body": "Ski resort"
|
|
|
|
}
|
|
|
|
|
2017-12-14 11:47:53 -05:00
|
|
|
PUT index/_doc/2
|
2017-07-17 12:21:20 -04:00
|
|
|
{
|
|
|
|
"body": "A pair of skis"
|
|
|
|
}
|
|
|
|
|
|
|
|
POST index/_refresh
|
|
|
|
--------------------------------------------------
|
|
|
|
// CONSOLE
|
|
|
|
|
|
|
|
With such a setup, searching for `ski` on `body` would return both documents:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
GET index/_search
|
|
|
|
{
|
|
|
|
"query": {
|
|
|
|
"simple_query_string": {
|
|
|
|
"fields": [ "body" ],
|
|
|
|
"query": "ski"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// CONSOLE
|
|
|
|
// TEST[continued]
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
"took": 2,
|
|
|
|
"timed_out": false,
|
|
|
|
"_shards": {
|
2018-05-14 12:22:35 -04:00
|
|
|
"total": 1,
|
|
|
|
"successful": 1,
|
2017-07-17 12:21:20 -04:00
|
|
|
"skipped" : 0,
|
|
|
|
"failed": 0
|
|
|
|
},
|
|
|
|
"hits": {
|
2018-12-05 13:49:06 -05:00
|
|
|
"total" : {
|
|
|
|
"value": 2,
|
|
|
|
"relation": "eq"
|
|
|
|
},
|
2018-05-14 12:22:35 -04:00
|
|
|
"max_score": 0.18232156,
|
2017-07-17 12:21:20 -04:00
|
|
|
"hits": [
|
|
|
|
{
|
|
|
|
"_index": "index",
|
2017-12-14 11:47:53 -05:00
|
|
|
"_type": "_doc",
|
2018-05-14 12:22:35 -04:00
|
|
|
"_id": "1",
|
|
|
|
"_score": 0.18232156,
|
2017-07-17 12:21:20 -04:00
|
|
|
"_source": {
|
2018-05-14 12:22:35 -04:00
|
|
|
"body": "Ski resort"
|
2017-07-17 12:21:20 -04:00
|
|
|
}
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"_index": "index",
|
2017-12-14 11:47:53 -05:00
|
|
|
"_type": "_doc",
|
2018-05-14 12:22:35 -04:00
|
|
|
"_id": "2",
|
|
|
|
"_score": 0.18232156,
|
2017-07-17 12:21:20 -04:00
|
|
|
"_source": {
|
2018-05-14 12:22:35 -04:00
|
|
|
"body": "A pair of skis"
|
2017-07-17 12:21:20 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// TESTRESPONSE[s/"took": 2,/"took": "$body.took",/]
|
|
|
|
|
|
|
|
On the other hand, searching for `ski` on `body.exact` would only return
|
|
|
|
document `1` since the analysis chain of `body.exact` does not perform
|
|
|
|
stemming.
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
GET index/_search
|
|
|
|
{
|
|
|
|
"query": {
|
|
|
|
"simple_query_string": {
|
|
|
|
"fields": [ "body.exact" ],
|
|
|
|
"query": "ski"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// CONSOLE
|
|
|
|
// TEST[continued]
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
"took": 1,
|
|
|
|
"timed_out": false,
|
|
|
|
"_shards": {
|
2018-05-14 12:22:35 -04:00
|
|
|
"total": 1,
|
|
|
|
"successful": 1,
|
2017-07-17 12:21:20 -04:00
|
|
|
"skipped" : 0,
|
|
|
|
"failed": 0
|
|
|
|
},
|
|
|
|
"hits": {
|
2018-12-05 13:49:06 -05:00
|
|
|
"total" : {
|
|
|
|
"value": 1,
|
|
|
|
"relation": "eq"
|
|
|
|
},
|
2018-09-06 08:42:06 -04:00
|
|
|
"max_score": 0.8025915,
|
2017-07-17 12:21:20 -04:00
|
|
|
"hits": [
|
|
|
|
{
|
|
|
|
"_index": "index",
|
2017-12-14 11:47:53 -05:00
|
|
|
"_type": "_doc",
|
2017-07-17 12:21:20 -04:00
|
|
|
"_id": "1",
|
2018-09-06 08:42:06 -04:00
|
|
|
"_score": 0.8025915,
|
2017-07-17 12:21:20 -04:00
|
|
|
"_source": {
|
|
|
|
"body": "Ski resort"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// TESTRESPONSE[s/"took": 1,/"took": "$body.took",/]
|
|
|
|
|
|
|
|
This is not something that is easy to expose to end users, as we would need to
|
|
|
|
have a way to figure out whether they are looking for an exact match or not and
|
|
|
|
redirect to the appropriate field accordingly. Also what to do if only parts of
|
|
|
|
the query need to be matched exactly while other parts should still take
|
|
|
|
stemming into account?
|
|
|
|
|
|
|
|
Fortunately, the `query_string` and `simple_query_string` queries have a feature
|
|
|
|
that solve this exact problem: `quote_field_suffix`. This tell Elasticsearch
|
|
|
|
that the words that appear in between quotes are to be redirected to a different
|
|
|
|
field, see below:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
GET index/_search
|
|
|
|
{
|
|
|
|
"query": {
|
|
|
|
"simple_query_string": {
|
|
|
|
"fields": [ "body" ],
|
|
|
|
"quote_field_suffix": ".exact",
|
|
|
|
"query": "\"ski\""
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// CONSOLE
|
|
|
|
// TEST[continued]
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
"took": 2,
|
|
|
|
"timed_out": false,
|
|
|
|
"_shards": {
|
2018-05-14 12:22:35 -04:00
|
|
|
"total": 1,
|
|
|
|
"successful": 1,
|
2017-07-17 12:21:20 -04:00
|
|
|
"skipped" : 0,
|
|
|
|
"failed": 0
|
|
|
|
},
|
|
|
|
"hits": {
|
2018-12-05 13:49:06 -05:00
|
|
|
"total" : {
|
|
|
|
"value": 1,
|
|
|
|
"relation": "eq"
|
|
|
|
},
|
2018-09-06 08:42:06 -04:00
|
|
|
"max_score": 0.8025915,
|
2017-07-17 12:21:20 -04:00
|
|
|
"hits": [
|
|
|
|
{
|
|
|
|
"_index": "index",
|
2017-12-14 11:47:53 -05:00
|
|
|
"_type": "_doc",
|
2017-07-17 12:21:20 -04:00
|
|
|
"_id": "1",
|
2018-09-06 08:42:06 -04:00
|
|
|
"_score": 0.8025915,
|
2017-07-17 12:21:20 -04:00
|
|
|
"_source": {
|
|
|
|
"body": "Ski resort"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// TESTRESPONSE[s/"took": 2,/"took": "$body.took",/]
|
|
|
|
|
|
|
|
In the above case, since `ski` was in-between quotes, it was searched on the
|
|
|
|
`body.exact` field due to the `quote_field_suffix` parameter, so only document
|
|
|
|
`1` matched. This allows users to mix exact search with stemmed search as they
|
|
|
|
like.
|