173 lines
6.0 KiB
Plaintext
173 lines
6.0 KiB
Plaintext
[[analyzer]]
|
|
=== `analyzer`
|
|
|
|
The values of <<mapping-index,`analyzed`>> string fields are passed through an
|
|
<<analysis,analyzer>> to convert the string into a stream of _tokens_ or
|
|
_terms_. For instance, the string `"The quick Brown Foxes."` may, depending
|
|
on which analyzer is used, be analyzed to the tokens: `quick`, `brown`,
|
|
`fox`. These are the actual terms that are indexed for the field, which makes
|
|
it possible to search efficiently for individual words _within_ big blobs of
|
|
text.
|
|
|
|
This analysis process needs to happen not just at index time, but also at
|
|
query time: the query string needs to be passed through the same (or a
|
|
similar) analyzer so that the terms that it tries to find are in the same
|
|
format as those that exist in the index.
|
|
|
|
Elasticsearch ships with a number of <<analysis-analyzers,pre-defined analyzers>>,
|
|
which can be used without further configuration. It also ships with many
|
|
<<analysis-charfilters,character filters>>, <<analysis-tokenizers,tokenizers>>,
|
|
and <<analysis-tokenfilters>> which can be combined to configure
|
|
custom analyzers per index.
|
|
|
|
Analyzers can be specified per-query, per-field or per-index. At index time,
|
|
Elasticsearch will look for an analyzer in this order:
|
|
|
|
* The `analyzer` defined in the field mapping.
|
|
* An analyzer named `default` in the index settings.
|
|
* The <<analysis-standard-analyzer,`standard`>> analyzer.
|
|
|
|
At query time, there are a few more layers:
|
|
|
|
* The `analyzer` defined in a <<full-text-queries,full-text query>>.
|
|
* The `search_analyzer` defined in the field mapping.
|
|
* The `analyzer` defined in the field mapping.
|
|
* An analyzer named `default_search` in the index settings.
|
|
* An analyzer named `default` in the index settings.
|
|
* The <<analysis-standard-analyzer,`standard`>> analyzer.
|
|
|
|
The easiest way to specify an analyzer for a particular field is to define it
|
|
in the field mapping, as follows:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /my_index
|
|
{
|
|
"mappings": {
|
|
"my_type": {
|
|
"properties": {
|
|
"text": { <1>
|
|
"type": "text",
|
|
"fields": {
|
|
"english": { <2>
|
|
"type": "text",
|
|
"analyzer": "english"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
GET _cluster/health?wait_for_status=yellow
|
|
|
|
GET my_index/_analyze?field=text <3>
|
|
{
|
|
"text": "The quick Brown Foxes."
|
|
}
|
|
|
|
GET my_index/_analyze?field=text.english <4>
|
|
{
|
|
"text": "The quick Brown Foxes."
|
|
}
|
|
--------------------------------------------------
|
|
// AUTOSENSE
|
|
<1> The `text` field uses the default `standard` analyzer`.
|
|
<2> The `text.english` <<multi-fields,multi-field>> uses the `english` analyzer, which removes stop words and applies stemming.
|
|
<3> This returns the tokens: [ `the`, `quick`, `brown`, `foxes` ].
|
|
<4> This returns the tokens: [ `quick`, `brown`, `fox` ].
|
|
|
|
|
|
[[search-quote-analyzer]]
|
|
==== `search_quote_analyzer`
|
|
|
|
The `search_quote_analyzer` setting allows you to specify an analyzer for phrases, this is particularly useful when dealing with disabling
|
|
stop words for phrase queries.
|
|
|
|
To disable stop words for phrases a field utilising three analyzer settings will be required:
|
|
|
|
1. An `analyzer` setting for indexing all terms including stop words
|
|
2. A `search_analyzer` setting for non-phrase queries that will remove stop words
|
|
3. A `search_quote_analyzer` setting for phrase queries that will not remove stop words
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT my_index
|
|
{
|
|
"settings":{
|
|
"analysis":{
|
|
"analyzer":{
|
|
"my_analyzer":{ <1>
|
|
"type":"custom",
|
|
"tokenizer":"standard",
|
|
"filter":[
|
|
"lowercase"
|
|
]
|
|
},
|
|
"my_stop_analyzer":{ <2>
|
|
"type":"custom",
|
|
"tokenizer":"standard",
|
|
"filter":[
|
|
"lowercase",
|
|
"english_stop"
|
|
]
|
|
}
|
|
},
|
|
"filter":{
|
|
"english_stop":{
|
|
"type":"stop",
|
|
"stopwords":"_english_"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"mappings":{
|
|
"my_type":{
|
|
"properties":{
|
|
"title": {
|
|
"type":"text",
|
|
"analyzer":"my_analyzer", <3>
|
|
"search_analyzer":"my_stop_analyzer", <4>
|
|
"search_quote_analyzer":"my_analyzer" <5>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// AUTOSENSE
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT my_index/my_type/1
|
|
{
|
|
"title":"The Quick Brown Fox"
|
|
}
|
|
|
|
PUT my_index/my_type/2
|
|
{
|
|
"title":"A Quick Brown Fox"
|
|
}
|
|
|
|
GET my_index/my_type/_search
|
|
{
|
|
"query":{
|
|
"query_string":{
|
|
"query":"\"the quick brown fox\"" <6>
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
<1> `my_analyzer` analyzer which tokens all terms including stop words
|
|
<2> `my_stop_analyzer` analyzer which removes stop words
|
|
<3> `analyzer` setting that points to the `my_analyzer` analyzer which will be used at index time
|
|
<4> `search_analyzer` setting that points to the `my_stop_analyzer` and removes stop words for non-phrase queries
|
|
<5> `search_quote_analyzer` setting that points to the `my_analyzer` analyzer and ensures that stop words are not removed from phrase queries
|
|
<6> Since the query is wrapped in quotes it is detected as a phrase query therefore the `search_quote_analyzer` kicks in and ensures the stop words
|
|
are not removed from the query. The `my_analyzer` analyzer will then return the following tokens [`the`, `quick`, `brown`, `fox`] which will match one
|
|
of the documents. Meanwhile term queries will be analyzed with the `my_stop_analyzer` analyzer which will filter out stop words. So a search for either
|
|
`The quick brown fox` or `A quick brown fox` will return both documents since both documents contain the following tokens [`quick`, `brown`, `fox`].
|
|
Without the `search_quote_analyzer` it would not be possible to do exact matches for phrase queries as the stop words from phrase queries would be
|
|
removed resulting in both documents matching.
|