176 lines
4.6 KiB
Plaintext
176 lines
4.6 KiB
Plaintext
[[analysis-index-search-time]]
|
|
=== Index and search analysis
|
|
|
|
Text analysis occurs at two times:
|
|
|
|
Index time::
|
|
When a document is indexed, any <<text,`text`>> field values are analyzed.
|
|
|
|
Search time::
|
|
When running a <<full-text-queries,full-text search>> on a `text` field,
|
|
the query string (the text the user is searching for) is analyzed.
|
|
+
|
|
Search time is also called _query time_.
|
|
|
|
The analyzer, or set of analysis rules, used at each time is called the _index
|
|
analyzer_ or _search analyzer_ respectively.
|
|
|
|
[[analysis-same-index-search-analyzer]]
|
|
==== How the index and search analyzer work together
|
|
|
|
In most cases, the same analyzer should be used at index and search time. This
|
|
ensures the values and query strings for a field are changed into the same form
|
|
of tokens. In turn, this ensures the tokens match as expected during a search.
|
|
|
|
.**Example**
|
|
[%collapsible]
|
|
====
|
|
|
|
A document is indexed with the following value in a `text` field:
|
|
|
|
[source,text]
|
|
------
|
|
The QUICK brown foxes jumped over the dog!
|
|
------
|
|
|
|
The index analyzer for the field converts the value into tokens and normalizes
|
|
them. In this case, each of the tokens represents a word:
|
|
|
|
[source,text]
|
|
------
|
|
[ quick, brown, fox, jump, over, dog ]
|
|
------
|
|
|
|
These tokens are then indexed.
|
|
|
|
Later, a user searches the same `text` field for:
|
|
|
|
[source,text]
|
|
------
|
|
"Quick fox"
|
|
------
|
|
|
|
The user expects this search to match the sentence indexed earlier,
|
|
`The QUICK brown foxes jumped over the dog!`.
|
|
|
|
However, the query string does not contain the exact words used in the
|
|
document's original text:
|
|
|
|
* `quick` vs `QUICK`
|
|
* `fox` vs `foxes`
|
|
|
|
To account for this, the query string is analyzed using the same analyzer. This
|
|
analyzer produces the following tokens:
|
|
|
|
[source,text]
|
|
------
|
|
[ quick, fox ]
|
|
------
|
|
|
|
To execute the serach, {es} compares these query string tokens to the tokens
|
|
indexed in the `text` field.
|
|
|
|
[options="header"]
|
|
|===
|
|
|Token | Query string | `text` field
|
|
|`quick` | X | X
|
|
|`brown` | | X
|
|
|`fox` | X | X
|
|
|`jump` | | X
|
|
|`over` | | X
|
|
|`dog` | | X
|
|
|===
|
|
|
|
Because the field value are query string were analyzed in the same way, they
|
|
created similar tokens. The tokens `quick` and `fox` are exact matches. This
|
|
means the search matches the document containing `"The QUICK brown foxes jumped
|
|
over the dog!"`, just as the user expects.
|
|
====
|
|
|
|
[[different-analyzers]]
|
|
==== When to use a different search analyzer
|
|
|
|
While less common, it sometimes makes sense to use different analyzers at index
|
|
and search time. To enable this, {es} allows you to
|
|
<<specify-search-analyzer,specify a separate search analyzer>>.
|
|
|
|
Generally, a separate search analyzer should only be specified when using the
|
|
same form of tokens for field values and query strings would create unexpected
|
|
or irrelevant search matches.
|
|
|
|
[[different-analyzer-ex]]
|
|
.*Example*
|
|
[%collapsible]
|
|
====
|
|
{es} is used to create a search engine that matches only words that start with
|
|
a provided prefix. For instance, a search for `tr` should return `tram` or
|
|
`trope`—but never `taxi` or `bat`.
|
|
|
|
A document is added to the search engine's index; this document contains one
|
|
such word in a `text` field:
|
|
|
|
[source,text]
|
|
------
|
|
"Apple"
|
|
------
|
|
|
|
The index analyzer for the field converts the value into tokens and normalizes
|
|
them. In this case, each of the tokens represents a potential prefix for
|
|
the word:
|
|
|
|
[source,text]
|
|
------
|
|
[ a, ap, app, appl, apple]
|
|
------
|
|
|
|
These tokens are then indexed.
|
|
|
|
Later, a user searches the same `text` field for:
|
|
|
|
[source,text]
|
|
------
|
|
"appli"
|
|
------
|
|
|
|
The user expects this search to match only words that start with `appli`,
|
|
such as `appliance` or `application`. The search should not match `apple`.
|
|
|
|
However, if the index analyzer is used to analyze this query string, it would
|
|
produce the following tokens:
|
|
|
|
[source,text]
|
|
------
|
|
[ a, ap, app, appl, appli ]
|
|
------
|
|
|
|
When {es} compares these query string tokens to the ones indexed for `apple`,
|
|
it finds several matches.
|
|
|
|
[options="header"]
|
|
|===
|
|
|Token | `appli` | `apple`
|
|
|`a` | X | X
|
|
|`ap` | X | X
|
|
|`app` | X | X
|
|
|`appl` | X | X
|
|
|`appli` | | X
|
|
|===
|
|
|
|
This means the search would erroneously match `apple`. Not only that, it would
|
|
match any word starting with `a`.
|
|
|
|
To fix this, you can specify a different search analyzer for query strings used
|
|
on the `text` field.
|
|
|
|
In this case, you could specify a search analyzer that produces a single token
|
|
rather than a set of prefixes:
|
|
|
|
[source,text]
|
|
------
|
|
[ appli ]
|
|
------
|
|
|
|
This query string token would only match tokens for words that start with
|
|
`appli`, which better aligns with the user's search expectations.
|
|
====
|