OpenSearch/docs/reference/indices/analyze.asciidoc

245 lines
6.4 KiB
Plaintext
Raw Normal View History

[[indices-analyze]]
== Analyze
Performs the analysis process on a text and return the tokens breakdown
of the text.
Can be used without specifying an index against one of the many built in
analyzers:
[source,js]
--------------------------------------------------
GET _analyze
{
"analyzer" : "standard",
"text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
If text parameter is provided as array of strings, it is analyzed as a multi-valued field.
[source,js]
--------------------------------------------------
GET _analyze
{
"analyzer" : "standard",
"text" : ["this is a test", "the second text"]
}
--------------------------------------------------
// CONSOLE
Or by building a custom transient analyzer out of tokenizers,
token filters and char filters. Token filters can use the shorter 'filter'
parameter name:
[source,js]
--------------------------------------------------
GET _analyze
{
"tokenizer" : "keyword",
"filter" : ["lowercase"],
"text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
[source,js]
--------------------------------------------------
GET _analyze
{
"tokenizer" : "keyword",
"filter" : ["lowercase"],
"char_filter" : ["html_strip"],
"text" : "this is a <b>test</b>"
}
--------------------------------------------------
// CONSOLE
deprecated[5.0.0, Use `filter`/`char_filter` instead of `filters`/`char_filters` and `token_filters` has been removed]
Custom tokenizers, token filters, and character filters can be specified in the request body as follows:
[source,js]
--------------------------------------------------
GET _analyze
{
"tokenizer" : "whitespace",
"filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
"text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
It can also run against a specific index:
[source,js]
--------------------------------------------------
GET analyze_sample/_analyze
{
"text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
// TEST[setup:analyze_sample]
The above will run an analysis on the "this is a test" text, using the
default index analyzer associated with the `analyze_sample` index. An `analyzer`
can also be provided to use a different analyzer:
[source,js]
--------------------------------------------------
GET analyze_sample/_analyze
{
"analyzer" : "whitespace",
2016-07-11 09:49:39 -04:00
"text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
// TEST[setup:analyze_sample]
Also, the analyzer can be derived based on a field mapping, for example:
[source,js]
--------------------------------------------------
GET analyze_sample/_analyze
{
"field" : "obj1.field1",
"text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
// TEST[setup:analyze_sample]
2014-03-07 08:21:45 -05:00
Will cause the analysis to happen based on the analyzer configured in the
mapping for `obj1.field1` (and if not, the default index analyzer).
A `normalizer` can be provided for keyword field with normalizer associated with the `analyze_sample` index.
[source,js]
--------------------------------------------------
GET analyze_sample/_analyze
{
"normalizer" : "my_normalizer",
"text" : "BaR"
}
--------------------------------------------------
// CONSOLE
// TEST[setup:analyze_sample]
Or by building a custom transient normalizer out of token filters and char filters.
[source,js]
--------------------------------------------------
GET _analyze
{
"filter" : ["lowercase"],
"text" : "BaR"
}
--------------------------------------------------
// CONSOLE
=== Explain Analyze
If you want to get more advanced details, set `explain` to `true` (defaults to `false`). It will output all token attributes for each token.
You can filter token attributes you want to output by setting `attributes` option.
NOTE: The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.
[source,js]
--------------------------------------------------
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["snowball"],
"text" : "detailed output",
"explain" : true,
"attributes" : ["keyword"] <1>
}
--------------------------------------------------
// CONSOLE
<1> Set "keyword" to output "keyword" attribute only
The request returns the following result:
[source,js]
--------------------------------------------------
{
"detail" : {
"custom_analyzer" : true,
"charfilters" : [ ],
"tokenizer" : {
"name" : "standard",
"tokens" : [ {
"token" : "detailed",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "output",
"start_offset" : 9,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
} ]
},
"tokenfilters" : [ {
"name" : "snowball",
"tokens" : [ {
"token" : "detail",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0,
"keyword" : false <1>
}, {
"token" : "output",
"start_offset" : 9,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1,
"keyword" : false <1>
} ]
} ]
}
}
--------------------------------------------------
// TESTRESPONSE
<1> Output only "keyword" attribute, since specify "attributes" in the request.
[[tokens-limit-settings]]
[float]
== Settings to prevent tokens explosion
Generating excessive amount of tokens may cause a node to run out of memory.
The following setting allows to limit the number of tokens that can be produced:
`index.analyze.max_token_count`::
The maximum number of tokens that can be produced using `_analyze` API.
The default value is `10000`. If more than this limit of tokens gets
generated, an error will be thrown. The `_analyze` endpoint without a specified
index will always use `10000` value as a limit. This setting allows you to control
the limit for a specific index:
[source,js]
--------------------------------------------------
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.
2019-01-14 16:08:01 -05:00
PUT analyze_sample?include_type_name=true
{
"settings" : {
"index.analyze.max_token_count" : 20000
}
}
--------------------------------------------------
// CONSOLE
[source,js]
--------------------------------------------------
GET analyze_sample/_analyze
{
"text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
// TEST[setup:analyze_sample]