2013-08-28 19:24:34 -04:00
|
|
|
[[analysis]]
|
2020-01-08 13:53:08 -05:00
|
|
|
= Text analysis
|
2013-08-28 19:24:34 -04:00
|
|
|
|
|
|
|
[partintro]
|
|
|
|
--
|
|
|
|
|
2020-01-08 13:53:08 -05:00
|
|
|
_Text analysis_ is the process of converting text, like the body of any email,
|
|
|
|
into _tokens_ or _terms_ which are added to the inverted index for searching.
|
2016-05-11 08:17:56 -04:00
|
|
|
Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
|
|
|
|
either a built-in analyzer or a <<analysis-custom-analyzer,`custom`>> analyzer
|
|
|
|
defined per index.
|
2013-08-28 19:24:34 -04:00
|
|
|
|
2016-05-11 08:17:56 -04:00
|
|
|
[float]
|
|
|
|
== Index time analysis
|
|
|
|
|
2018-07-06 08:36:58 -04:00
|
|
|
For instance, at index time the built-in <<english-analyzer,`english`>> _analyzer_
|
|
|
|
will first convert the sentence:
|
2016-05-11 08:17:56 -04:00
|
|
|
|
|
|
|
[source,text]
|
|
|
|
------
|
|
|
|
"The QUICK brown foxes jumped over the lazy dog!"
|
|
|
|
------
|
2013-08-28 19:24:34 -04:00
|
|
|
|
2018-07-06 08:36:58 -04:00
|
|
|
into distinct tokens. It will then lowercase each token, remove frequent
|
|
|
|
stopwords ("the") and reduce the terms to their word stems (foxes -> fox,
|
|
|
|
jumped -> jump, lazy -> lazi). In the end, the following terms will be added
|
|
|
|
to the inverted index:
|
2016-05-11 08:17:56 -04:00
|
|
|
|
|
|
|
[source,text]
|
|
|
|
------
|
|
|
|
[ quick, brown, fox, jump, over, lazi, dog ]
|
|
|
|
------
|
|
|
|
|
|
|
|
[float]
|
2019-08-06 14:01:49 -04:00
|
|
|
[[specify-index-time-analyzer]]
|
2016-05-11 08:17:56 -04:00
|
|
|
=== Specifying an index time analyzer
|
|
|
|
|
2020-01-08 12:06:54 -05:00
|
|
|
{es} determines which index-time analyzer to use by
|
|
|
|
checking the following parameters in order:
|
|
|
|
|
|
|
|
. The <<analyzer,`analyzer`>> mapping parameter of the field
|
|
|
|
. The `default` analyzer parameter in the index settings
|
|
|
|
|
|
|
|
If none of these parameters are specified, the
|
|
|
|
<<analysis-standard-analyzer,`standard` analyzer>> is used.
|
|
|
|
|
|
|
|
[discrete]
|
|
|
|
[[specify-index-time-field-analyzer]]
|
|
|
|
==== Specify the index-time analyzer for a field
|
|
|
|
|
2016-05-11 08:17:56 -04:00
|
|
|
Each <<text,`text`>> field in a mapping can specify its own
|
|
|
|
<<analyzer,`analyzer`>>:
|
2013-08-28 19:24:34 -04:00
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2016-05-11 08:17:56 -04:00
|
|
|
-------------------------
|
2019-01-18 08:11:18 -05:00
|
|
|
PUT my_index
|
2016-05-11 08:17:56 -04:00
|
|
|
{
|
|
|
|
"mappings": {
|
2019-01-18 08:11:18 -05:00
|
|
|
"properties": {
|
|
|
|
"title": {
|
|
|
|
"type": "text",
|
|
|
|
"analyzer": "standard"
|
2016-05-11 08:17:56 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
-------------------------
|
|
|
|
|
2020-01-08 12:06:54 -05:00
|
|
|
[discrete]
|
|
|
|
[[specify-index-time-default-analyzer]]
|
|
|
|
==== Specify a default index-time analyzer
|
|
|
|
|
|
|
|
When <<indices-create-index,creating an index>>, you can set a default
|
|
|
|
index-time analyzer using the `default` analyzer setting:
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
----
|
|
|
|
PUT my_index
|
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"default": {
|
|
|
|
"type": "whitespace"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
----
|
2016-05-11 08:17:56 -04:00
|
|
|
|
2020-01-08 12:06:54 -05:00
|
|
|
A default index-time analyzer is useful when mapping multiple `text` fields that
|
|
|
|
use the same analyzer. It's also used as a general fallback analyzer for both
|
|
|
|
index-time and search-time analysis.
|
2013-08-28 19:24:34 -04:00
|
|
|
|
|
|
|
[float]
|
2016-05-11 08:17:56 -04:00
|
|
|
== Search time analysis
|
|
|
|
|
|
|
|
This same analysis process is applied to the query string at search time in
|
|
|
|
<<full-text-queries,full text queries>> like the
|
|
|
|
<<query-dsl-match-query,`match` query>>
|
|
|
|
to convert the text in the query string into terms of the same form as those
|
|
|
|
that are stored in the inverted index.
|
|
|
|
|
|
|
|
For instance, a user might search for:
|
2013-08-28 19:24:34 -04:00
|
|
|
|
2016-05-11 08:17:56 -04:00
|
|
|
[source,text]
|
|
|
|
------
|
|
|
|
"a quick fox"
|
|
|
|
------
|
|
|
|
|
|
|
|
which would be analysed by the same `english` analyzer into the following terms:
|
|
|
|
|
|
|
|
[source,text]
|
|
|
|
------
|
|
|
|
[ quick, fox ]
|
|
|
|
------
|
|
|
|
|
|
|
|
Even though the exact words used in the query string don't appear in the
|
|
|
|
original text (`quick` vs `QUICK`, `fox` vs `foxes`), because we have applied
|
|
|
|
the same analyzer to both the text and the query string, the terms from the
|
|
|
|
query string exactly match the terms from the text in the inverted index,
|
|
|
|
which means that this query would match our example document.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Specifying a search time analyzer
|
|
|
|
|
|
|
|
Usually the same analyzer should be used both at
|
|
|
|
index time and at search time, and <<full-text-queries,full text queries>>
|
|
|
|
like the <<query-dsl-match-query,`match` query>> will use the mapping to look
|
|
|
|
up the analyzer to use for each field.
|
|
|
|
|
|
|
|
The analyzer to use to search a particular field is determined by
|
|
|
|
looking for:
|
|
|
|
|
|
|
|
* An `analyzer` specified in the query itself.
|
|
|
|
* The <<search-analyzer,`search_analyzer`>> mapping parameter.
|
|
|
|
* The <<analyzer,`analyzer`>> mapping parameter.
|
|
|
|
* An analyzer in the index settings called `default_search`.
|
|
|
|
* An analyzer in the index settings called `default`.
|
|
|
|
* The `standard` analyzer.
|
2013-08-28 19:24:34 -04:00
|
|
|
|
|
|
|
--
|
|
|
|
|
2020-01-08 13:53:08 -05:00
|
|
|
include::analysis/overview.asciidoc[]
|
|
|
|
|
2020-01-16 13:00:04 -05:00
|
|
|
include::analysis/concepts.asciidoc[]
|
2016-05-11 08:17:56 -04:00
|
|
|
|
2020-01-16 13:11:42 -05:00
|
|
|
include::analysis/configure-text-analysis.asciidoc[]
|
2016-05-11 08:17:56 -04:00
|
|
|
|
2013-08-28 19:24:34 -04:00
|
|
|
include::analysis/analyzers.asciidoc[]
|
|
|
|
|
|
|
|
include::analysis/tokenizers.asciidoc[]
|
|
|
|
|
|
|
|
include::analysis/tokenfilters.asciidoc[]
|
|
|
|
|
|
|
|
include::analysis/charfilters.asciidoc[]
|
|
|
|
|
2020-01-16 12:27:54 -05:00
|
|
|
include::analysis/normalizers.asciidoc[]
|
|
|
|
|