[DOCS] Rewrite analysis intro (#51184)
* [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip
This commit is contained in:
parent
f373020349
commit
4fcf5a9de4
|
@ -4,141 +4,40 @@
|
|||
[partintro]
|
||||
--
|
||||
|
||||
_Text analysis_ is the process of converting text, like the body of any email,
|
||||
into _tokens_ or _terms_ which are added to the inverted index for searching.
|
||||
Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
|
||||
either a built-in analyzer or a <<analysis-custom-analyzer,`custom`>> analyzer
|
||||
defined per index.
|
||||
_Text analysis_ is the process of converting unstructured text, like
|
||||
the body of an email or a product description, into a structured format that's
|
||||
optimized for search.
|
||||
|
||||
[float]
|
||||
== Index time analysis
|
||||
[[when-to-configure-analysis]]
|
||||
=== When to configure text analysis
|
||||
|
||||
For instance, at index time the built-in <<english-analyzer,`english`>> _analyzer_
|
||||
will first convert the sentence:
|
||||
{es} performs text analysis when indexing or searching <<text,`text`>> fields.
|
||||
|
||||
[source,text]
|
||||
------
|
||||
"The QUICK brown foxes jumped over the lazy dog!"
|
||||
------
|
||||
If your index doesn't contain `text` fields, no further setup is needed; you can
|
||||
skip the pages in this section.
|
||||
|
||||
into distinct tokens. It will then lowercase each token, remove frequent
|
||||
stopwords ("the") and reduce the terms to their word stems (foxes -> fox,
|
||||
jumped -> jump, lazy -> lazi). In the end, the following terms will be added
|
||||
to the inverted index:
|
||||
However, if you use `text` fields or your text searches aren't returning results
|
||||
as expected, configuring text analysis can often help. You should also look into
|
||||
analysis configuration if you're using {es} to:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
[ quick, brown, fox, jump, over, lazi, dog ]
|
||||
------
|
||||
* Build a search engine
|
||||
* Mine unstructured data
|
||||
* Fine-tune search for a specific language
|
||||
* Perform lexicographic or linguistic research
|
||||
|
||||
[float]
|
||||
[[specify-index-time-analyzer]]
|
||||
=== Specifying an index time analyzer
|
||||
[[analysis-toc]]
|
||||
=== In this section
|
||||
|
||||
{es} determines which index-time analyzer to use by
|
||||
checking the following parameters in order:
|
||||
|
||||
. The <<analyzer,`analyzer`>> mapping parameter of the field
|
||||
. The `default` analyzer parameter in the index settings
|
||||
|
||||
If none of these parameters are specified, the
|
||||
<<analysis-standard-analyzer,`standard` analyzer>> is used.
|
||||
|
||||
[discrete]
|
||||
[[specify-index-time-field-analyzer]]
|
||||
==== Specify the index-time analyzer for a field
|
||||
|
||||
Each <<text,`text`>> field in a mapping can specify its own
|
||||
<<analyzer,`analyzer`>>:
|
||||
|
||||
[source,console]
|
||||
-------------------------
|
||||
PUT my_index
|
||||
{
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"title": {
|
||||
"type": "text",
|
||||
"analyzer": "standard"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
-------------------------
|
||||
|
||||
[discrete]
|
||||
[[specify-index-time-default-analyzer]]
|
||||
==== Specify a default index-time analyzer
|
||||
|
||||
When <<indices-create-index,creating an index>>, you can set a default
|
||||
index-time analyzer using the `default` analyzer setting:
|
||||
|
||||
[source,console]
|
||||
----
|
||||
PUT my_index
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"default": {
|
||||
"type": "whitespace"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
||||
|
||||
A default index-time analyzer is useful when mapping multiple `text` fields that
|
||||
use the same analyzer. It's also used as a general fallback analyzer for both
|
||||
index-time and search-time analysis.
|
||||
|
||||
[float]
|
||||
== Search time analysis
|
||||
|
||||
This same analysis process is applied to the query string at search time in
|
||||
<<full-text-queries,full text queries>> like the
|
||||
<<query-dsl-match-query,`match` query>>
|
||||
to convert the text in the query string into terms of the same form as those
|
||||
that are stored in the inverted index.
|
||||
|
||||
For instance, a user might search for:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
"a quick fox"
|
||||
------
|
||||
|
||||
which would be analysed by the same `english` analyzer into the following terms:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
[ quick, fox ]
|
||||
------
|
||||
|
||||
Even though the exact words used in the query string don't appear in the
|
||||
original text (`quick` vs `QUICK`, `fox` vs `foxes`), because we have applied
|
||||
the same analyzer to both the text and the query string, the terms from the
|
||||
query string exactly match the terms from the text in the inverted index,
|
||||
which means that this query would match our example document.
|
||||
|
||||
[float]
|
||||
=== Specifying a search time analyzer
|
||||
|
||||
Usually the same analyzer should be used both at
|
||||
index time and at search time, and <<full-text-queries,full text queries>>
|
||||
like the <<query-dsl-match-query,`match` query>> will use the mapping to look
|
||||
up the analyzer to use for each field.
|
||||
|
||||
The analyzer to use to search a particular field is determined by
|
||||
looking for:
|
||||
|
||||
* An `analyzer` specified in the query itself.
|
||||
* The <<search-analyzer,`search_analyzer`>> mapping parameter.
|
||||
* The <<analyzer,`analyzer`>> mapping parameter.
|
||||
* An analyzer in the index settings called `default_search`.
|
||||
* An analyzer in the index settings called `default`.
|
||||
* The `standard` analyzer.
|
||||
* <<analysis-overview>>
|
||||
* <<analysis-concepts>>
|
||||
* <<configure-text-analysis>>
|
||||
* <<analysis-analyzers>>
|
||||
* <<analysis-tokenizers>>
|
||||
* <<analysis-tokenfilters>>
|
||||
* <<analysis-charfilters>>
|
||||
* <<analysis-normalizers>>
|
||||
|
||||
--
|
||||
|
||||
|
@ -156,5 +55,4 @@ include::analysis/tokenfilters.asciidoc[]
|
|||
|
||||
include::analysis/charfilters.asciidoc[]
|
||||
|
||||
include::analysis/normalizers.asciidoc[]
|
||||
|
||||
include::analysis/normalizers.asciidoc[]
|
|
@ -7,5 +7,7 @@
|
|||
This section explains the fundamental concepts of text analysis in {es}.
|
||||
|
||||
* <<analyzer-anatomy>>
|
||||
* <<analysis-index-search-time>>
|
||||
|
||||
include::anatomy.asciidoc[]
|
||||
include::anatomy.asciidoc[]
|
||||
include::index-search-time.asciidoc[]
|
|
@ -20,10 +20,13 @@ the process.
|
|||
* <<test-analyzer>>
|
||||
* <<configuring-analyzers>>
|
||||
* <<analysis-custom-analyzer>>
|
||||
* <specify-analyer>>
|
||||
|
||||
|
||||
include::testing.asciidoc[]
|
||||
|
||||
include::analyzers/configuring.asciidoc[]
|
||||
|
||||
include::analyzers/custom-analyzer.asciidoc[]
|
||||
include::analyzers/custom-analyzer.asciidoc[]
|
||||
|
||||
include::specify-analyzer.asciidoc[]
|
|
@ -0,0 +1,175 @@
|
|||
[[analysis-index-search-time]]
|
||||
=== Index and search analysis
|
||||
|
||||
Text analysis occurs at two times:
|
||||
|
||||
Index time::
|
||||
When a document is indexed, any <<text,`text`>> field values are analyzed.
|
||||
|
||||
Search time::
|
||||
When running a <<full-text-queries,full-text search>> on a `text` field,
|
||||
the query string (the text the user is searching for) is analyzed.
|
||||
+
|
||||
Search time is also called _query time_.
|
||||
|
||||
The analyzer, or set of analysis rules, used at each time is called the _index
|
||||
analyzer_ or _search analyzer_ respectively.
|
||||
|
||||
[[analysis-same-index-search-analyzer]]
|
||||
==== How the index and search analyzer work together
|
||||
|
||||
In most cases, the same analyzer should be used at index and search time. This
|
||||
ensures the values and query strings for a field are changed into the same form
|
||||
of tokens. In turn, this ensures the tokens match as expected during a search.
|
||||
|
||||
.**Example**
|
||||
[%collapsible]
|
||||
====
|
||||
|
||||
A document is indexed with the following value in a `text` field:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
The QUICK brown foxes jumped over the dog!
|
||||
------
|
||||
|
||||
The index analyzer for the field converts the value into tokens and normalizes
|
||||
them. In this case, each of the tokens represents a word:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
[ quick, brown, fox, jump, over, dog ]
|
||||
------
|
||||
|
||||
These tokens are then indexed.
|
||||
|
||||
Later, a user searches the same `text` field for:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
"Quick fox"
|
||||
------
|
||||
|
||||
The user expects this search to match the sentence indexed earlier,
|
||||
`The QUICK brown foxes jumped over the dog!`.
|
||||
|
||||
However, the query string does not contain the exact words used in the
|
||||
document's original text:
|
||||
|
||||
* `quick` vs `QUICK`
|
||||
* `fox` vs `foxes`
|
||||
|
||||
To account for this, the query string is analyzed using the same analyzer. This
|
||||
analyzer produces the following tokens:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
[ quick, fox ]
|
||||
------
|
||||
|
||||
To execute the serach, {es} compares these query string tokens to the tokens
|
||||
indexed in the `text` field.
|
||||
|
||||
[options="header"]
|
||||
|===
|
||||
|Token | Query string | `text` field
|
||||
|`quick` | X | X
|
||||
|`brown` | | X
|
||||
|`fox` | X | X
|
||||
|`jump` | | X
|
||||
|`over` | | X
|
||||
|`dog` | | X
|
||||
|===
|
||||
|
||||
Because the field value are query string were analyzed in the same way, they
|
||||
created similar tokens. The tokens `quick` and `fox` are exact matches. This
|
||||
means the search matches the document containing `"The QUICK brown foxes jumped
|
||||
over the dog!"`, just as the user expects.
|
||||
====
|
||||
|
||||
[[different-analyzers]]
|
||||
==== When to use a different search analyzer
|
||||
|
||||
While less common, it sometimes makes sense to use different analyzers at index
|
||||
and search time. To enable this, {es} allows you to
|
||||
<<specify-search-analyzer,specify a separate search analyzer>>.
|
||||
|
||||
Generally, a separate search analyzer should only be specified when using the
|
||||
same form of tokens for field values and query strings would create unexpected
|
||||
or irrelevant search matches.
|
||||
|
||||
[[different-analyzer-ex]]
|
||||
.*Example*
|
||||
[%collapsible]
|
||||
====
|
||||
{es} is used to create a search engine that matches only words that start with
|
||||
a provided prefix. For instance, a search for `tr` should return `tram` or
|
||||
`trope`—but never `taxi` or `bat`.
|
||||
|
||||
A document is added to the search engine's index; this document contains one
|
||||
such word in a `text` field:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
"Apple"
|
||||
------
|
||||
|
||||
The index analyzer for the field converts the value into tokens and normalizes
|
||||
them. In this case, each of the tokens represents a potential prefix for
|
||||
the word:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
[ a, ap, app, appl, apple]
|
||||
------
|
||||
|
||||
These tokens are then indexed.
|
||||
|
||||
Later, a user searches the same `text` field for:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
"appli"
|
||||
------
|
||||
|
||||
The user expects this search to match only words that start with `appli`,
|
||||
such as `appliance` or `application`. The search should not match `apple`.
|
||||
|
||||
However, if the index analyzer is used to analyze this query string, it would
|
||||
produce the following tokens:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
[ a, ap, app, appl, appli ]
|
||||
------
|
||||
|
||||
When {es} compares these query string tokens to the ones indexed for `apple`,
|
||||
it finds several matches.
|
||||
|
||||
[options="header"]
|
||||
|===
|
||||
|Token | `appli` | `apple`
|
||||
|`a` | X | X
|
||||
|`ap` | X | X
|
||||
|`app` | X | X
|
||||
|`appl` | X | X
|
||||
|`appli` | | X
|
||||
|===
|
||||
|
||||
This means the search would erroneously match `apple`. Not only that, it would
|
||||
match any word starting with `a`.
|
||||
|
||||
To fix this, you can specify a different search analyzer for query strings used
|
||||
on the `text` field.
|
||||
|
||||
In this case, you could specify a search analyzer that produces a single token
|
||||
rather than a set of prefixes:
|
||||
|
||||
[source,text]
|
||||
------
|
||||
[ appli ]
|
||||
------
|
||||
|
||||
This query string token would only match tokens for words that start with
|
||||
`appli`, which better aligns with the user's search expectations.
|
||||
====
|
|
@ -0,0 +1,202 @@
|
|||
[[specify-analyzer]]
|
||||
=== Specify an analyzer
|
||||
|
||||
{es} offers a variety of ways to specify built-in or custom analyzers:
|
||||
|
||||
* By `text` field, index, or query
|
||||
* For <<analysis-index-search-time,index or search time>>
|
||||
|
||||
[TIP]
|
||||
.Keep it simple
|
||||
====
|
||||
The flexibility to specify analyzers at different levels and for different times
|
||||
is great... _but only when it's needed_.
|
||||
|
||||
In most cases, a simple approach works best: Specify an analyzer for each
|
||||
`text` field, as outlined in <<specify-index-field-analyzer>>.
|
||||
|
||||
This approach works well with {es}'s default behavior, letting you use the same
|
||||
analyzer for indexing and search. It also lets you quickly see which analyzer
|
||||
applies to which field using the <<indices-get-mapping,get mapping API>>.
|
||||
|
||||
If you don't typically create mappings for your indices, you can use
|
||||
<<indices-templates,index templates>> to achieve a similar effect.
|
||||
====
|
||||
|
||||
[[specify-index-time-analyzer]]
|
||||
==== How {es} determines the index analyzer
|
||||
|
||||
{es} determines which index analyzer to use by checking the following parameters
|
||||
in order:
|
||||
|
||||
. The <<analyzer,`analyzer`>> mapping parameter for the field.
|
||||
See <<specify-index-field-analyzer>>.
|
||||
. The `analysis.analyzer.default` index setting.
|
||||
See <<specify-index-time-default-analyzer>>.
|
||||
|
||||
If none of these parameters are specified, the
|
||||
<<analysis-standard-analyzer,`standard` analyzer>> is used.
|
||||
|
||||
[[specify-index-field-analyzer]]
|
||||
==== Specify the analyzer for a field
|
||||
|
||||
When mapping an index, you can use the <<analyzer,`analyzer`>> mapping parameter
|
||||
to specify an analyzer for each `text` field.
|
||||
|
||||
The following <<indices-create-index,create index API>> request sets the
|
||||
`whitespace` analyzer as the analyzer for the `title` field.
|
||||
|
||||
[source,console]
|
||||
----
|
||||
PUT my_index
|
||||
{
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"title": {
|
||||
"type": "text",
|
||||
"analyzer": "whitespace"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
||||
|
||||
[[specify-index-time-default-analyzer]]
|
||||
==== Specify the default analyzer for an index
|
||||
|
||||
In addition to a field-level analyzer, you can set a fallback analyzer for
|
||||
using the `analysis.analyzer.default` setting.
|
||||
|
||||
The following <<indices-create-index,create index API>> request sets the
|
||||
`simple` analyzer as the fallback analyzer for `my_index`.
|
||||
|
||||
[source,console]
|
||||
----
|
||||
PUT my_index
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"default": {
|
||||
"type": "simple"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
||||
|
||||
[[specify-search-analyzer]]
|
||||
==== How {es} determines the search analyzer
|
||||
|
||||
// tag::search-analyzer-warning[]
|
||||
[WARNING]
|
||||
====
|
||||
In most cases, specifying a different search analyzer is unnecessary. Doing so
|
||||
could negatively impact relevancy and result in unexpected search results.
|
||||
|
||||
If you choose to specify a separate search analyzer, we recommend you thoroughly
|
||||
<<test-analyzer,test your analysis configuration>> before deploying in
|
||||
production.
|
||||
====
|
||||
// end::search-analyzer-warning[]
|
||||
|
||||
At search time, {es} determines which analyzer to use by checking the following
|
||||
parameters in order:
|
||||
|
||||
. The <<analyzer,`analyzer`>> parameter in the search query.
|
||||
See <<specify-search-query-analyzer>>.
|
||||
. The <<search-analyzer,`search_analyzer`>> mapping parameter for the field.
|
||||
See <<specify-search-field-analyzer>>.
|
||||
. The `analysis.analyzer.default_search` index setting.
|
||||
See <<specify-search-default-analyzer>>.
|
||||
. The <<analyzer,`analyzer`>> mapping parameter for the field.
|
||||
See <<specify-index-field-analyzer>>.
|
||||
|
||||
If none of these parameters are specified, the
|
||||
<<analysis-standard-analyzer,`standard` analyzer>> is used.
|
||||
|
||||
[[specify-search-query-analyzer]]
|
||||
==== Specify the search analyzer for a query
|
||||
|
||||
When writing a <<full-text-queries,full-text query>>, you can use the `analyzer`
|
||||
parameter to specify a search analyzer. If provided, this overrides any other
|
||||
search analyzers.
|
||||
|
||||
The following <<search-search,search API>> request sets the `stop` analyzer as
|
||||
the search analyzer for a <<query-dsl-match-query,`match`>> query.
|
||||
|
||||
[source,console]
|
||||
----
|
||||
GET my_index/_search
|
||||
{
|
||||
"query": {
|
||||
"match": {
|
||||
"message": {
|
||||
"query": "Quick foxes",
|
||||
"analyzer": "stop"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
||||
// TEST[s/^/PUT my_index\n/]
|
||||
|
||||
[[specify-search-field-analyzer]]
|
||||
==== Specify the search analyzer for a field
|
||||
|
||||
When mapping an index, you can use the <<analyzer,`search_analyzer`>> mapping
|
||||
parameter to specify a search analyzer for each `text` field.
|
||||
|
||||
If a search analyzer is provided, the index analyzer must also be specified
|
||||
using the `analyzer` parameter.
|
||||
|
||||
The following <<indices-create-index,create index API>> request sets the
|
||||
`simple` analyzer as the search analyzer for the `title` field.
|
||||
|
||||
[source,console]
|
||||
----
|
||||
PUT my_index
|
||||
{
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"title": {
|
||||
"type": "text",
|
||||
"analyzer": "whitespace",
|
||||
"search_analyzer": "simple"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
||||
|
||||
[[specify-search-default-analyzer]]
|
||||
==== Specify the default search analyzer for an index
|
||||
|
||||
When <<indices-create-index,creating an index>>, you can set a default search
|
||||
analyzer using the `analysis.analyzer.default_search` setting.
|
||||
|
||||
If a search analyzer is provided, a default index analyzer must also be
|
||||
specified using the `analysis.analyzer.default` setting.
|
||||
|
||||
The following <<indices-create-index,create index API>> request sets the
|
||||
`whitespace` analyzer as the default search analyzer for the `my_index` index.
|
||||
|
||||
[source,console]
|
||||
----
|
||||
PUT my_index
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"default": {
|
||||
"type": "simple"
|
||||
},
|
||||
"default_search": {
|
||||
"type": "whitespace"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
Loading…
Reference in New Issue