126 lines
3.3 KiB
Plaintext
126 lines
3.3 KiB
Plaintext
[[analysis]]
|
|
= Analysis
|
|
|
|
[partintro]
|
|
--
|
|
|
|
_Analysis_ is the process of converting text, like the body of any email, into
|
|
_tokens_ or _terms_ which are added to the inverted index for searching.
|
|
Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
|
|
either a built-in analyzer or a <<analysis-custom-analyzer,`custom`>> analyzer
|
|
defined per index.
|
|
|
|
[float]
|
|
== Index time analysis
|
|
|
|
For instance, at index time the built-in <<english-analyzer,`english`>> _analyzer_
|
|
will first convert the sentence:
|
|
|
|
[source,text]
|
|
------
|
|
"The QUICK brown foxes jumped over the lazy dog!"
|
|
------
|
|
|
|
into distinct tokens. It will then lowercase each token, remove frequent
|
|
stopwords ("the") and reduce the terms to their word stems (foxes -> fox,
|
|
jumped -> jump, lazy -> lazi). In the end, the following terms will be added
|
|
to the inverted index:
|
|
|
|
[source,text]
|
|
------
|
|
[ quick, brown, fox, jump, over, lazi, dog ]
|
|
------
|
|
|
|
[float]
|
|
=== Specifying an index time analyzer
|
|
|
|
Each <<text,`text`>> field in a mapping can specify its own
|
|
<<analyzer,`analyzer`>>:
|
|
|
|
[source,js]
|
|
-------------------------
|
|
PUT my_index
|
|
{
|
|
"mappings": {
|
|
"_doc": {
|
|
"properties": {
|
|
"title": {
|
|
"type": "text",
|
|
"analyzer": "standard"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
-------------------------
|
|
// CONSOLE
|
|
|
|
At index time, if no `analyzer` has been specified, it looks for an analyzer
|
|
in the index settings called `default`. Failing that, it defaults to using
|
|
the <<analysis-standard-analyzer,`standard` analyzer>>.
|
|
|
|
|
|
[float]
|
|
== Search time analysis
|
|
|
|
This same analysis process is applied to the query string at search time in
|
|
<<full-text-queries,full text queries>> like the
|
|
<<query-dsl-match-query,`match` query>>
|
|
to convert the text in the query string into terms of the same form as those
|
|
that are stored in the inverted index.
|
|
|
|
For instance, a user might search for:
|
|
|
|
[source,text]
|
|
------
|
|
"a quick fox"
|
|
------
|
|
|
|
which would be analysed by the same `english` analyzer into the following terms:
|
|
|
|
[source,text]
|
|
------
|
|
[ quick, fox ]
|
|
------
|
|
|
|
Even though the exact words used in the query string don't appear in the
|
|
original text (`quick` vs `QUICK`, `fox` vs `foxes`), because we have applied
|
|
the same analyzer to both the text and the query string, the terms from the
|
|
query string exactly match the terms from the text in the inverted index,
|
|
which means that this query would match our example document.
|
|
|
|
[float]
|
|
=== Specifying a search time analyzer
|
|
|
|
Usually the same analyzer should be used both at
|
|
index time and at search time, and <<full-text-queries,full text queries>>
|
|
like the <<query-dsl-match-query,`match` query>> will use the mapping to look
|
|
up the analyzer to use for each field.
|
|
|
|
The analyzer to use to search a particular field is determined by
|
|
looking for:
|
|
|
|
* An `analyzer` specified in the query itself.
|
|
* The <<search-analyzer,`search_analyzer`>> mapping parameter.
|
|
* The <<analyzer,`analyzer`>> mapping parameter.
|
|
* An analyzer in the index settings called `default_search`.
|
|
* An analyzer in the index settings called `default`.
|
|
* The `standard` analyzer.
|
|
|
|
--
|
|
|
|
include::analysis/anatomy.asciidoc[]
|
|
|
|
include::analysis/testing.asciidoc[]
|
|
|
|
include::analysis/analyzers.asciidoc[]
|
|
|
|
include::analysis/normalizers.asciidoc[]
|
|
|
|
include::analysis/tokenizers.asciidoc[]
|
|
|
|
include::analysis/tokenfilters.asciidoc[]
|
|
|
|
include::analysis/charfilters.asciidoc[]
|
|
|