OpenSearch/docs/reference/analysis/analyzers/custom-analyzer.asciidoc

262 lines
5.8 KiB
Plaintext
Raw Normal View History

[[analysis-custom-analyzer]]
=== Custom Analyzer
When the built-in analyzers do not fulfill your needs, you can create a
`custom` analyzer which uses the appropriate combination of:
* zero or more <<analysis-charfilters, character filters>>
* a <<analysis-tokenizers,tokenizer>>
* zero or more <<analysis-tokenfilters,token filters>>.
[float]
=== Configuration
The `custom` analyzer accepts the following parameters:
[horizontal]
`tokenizer`::
A built-in or customised <<analysis-tokenizers,tokenizer>>.
(Required)
`char_filter`::
An optional array of built-in or customised
<<analysis-charfilters, character filters>>.
`filter`::
An optional array of built-in or customised
<<analysis-tokenfilters, token filters>>.
`position_increment_gap`::
When indexing an array of text values, Elasticsearch inserts a fake "gap"
between the last term of one value and the first term of the next value to
ensure that a phrase query doesn't match two terms from different array
elements. Defaults to `100`. See <<position-increment-gap>> for more.
[float]
=== Example configuration
Here is an example that combines the following:
Character Filter::
* <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>
[source,js]
--------------------------------
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.
2019-01-14 16:08:01 -05:00
PUT my_index?include_type_name=true
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom", <1>
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà vu</b>?"
}
--------------------------------
// CONSOLE
<1> Setting `type` to `custom` tells Elasticsearch that we are defining a custom analyzer.
Compare this to how <<configuring-analyzers,built-in analyzers can be configured>>:
`type` will be set to the name of the built-in analyzer, like
<<analysis-standard-analyzer,`standard`>> or <<analysis-simple-analyzer,`simple`>>.
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "is",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "this",
"start_offset": 3,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "deja",
"start_offset": 11,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "vu",
"start_offset": 16,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above example produces the following terms:
[source,text]
---------------------------
[ is, this, deja, vu ]
---------------------------
The previous example used tokenizer, token filters, and character filters with
their default configurations, but it is possible to create configured versions
of each and to use them in a custom analyzer.
Here is a more complicated example that combines the following:
Character Filter::
* <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`
Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters
Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words
Here is an example:
[source,js]
--------------------------------------------------
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.
2019-01-14 16:08:01 -05:00
PUT my_index?include_type_name=true
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons" <1>
],
"tokenizer": "punctuation", <1>
"filter": [
"lowercase",
"english_stop" <1>
]
}
},
"tokenizer": {
"punctuation": { <1>
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": { <1>
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": { <1>
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
--------------------------------------------------
// CONSOLE
<1> The `emoticons` character filter, `punctuation` tokenizer and
`english_stop` token filter are custom implementations which are defined
in the same index settings.
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "i'm",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "_happy_",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 2
},
{
"token": "person",
"start_offset": 9,
"end_offset": 15,
"type": "word",
"position": 3
},
{
"token": "you",
"start_offset": 21,
"end_offset": 24,
"type": "word",
"position": 5
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above example produces the following terms:
[source,text]
---------------------------
[ i'm, _happy_, person, you ]
---------------------------