diff --git a/_opensearch/rest-api/analyze-apis/index.md b/_opensearch/rest-api/analyze-apis/index.md new file mode 100644 index 00000000..8a714d39 --- /dev/null +++ b/_opensearch/rest-api/analyze-apis/index.md @@ -0,0 +1,13 @@ +--- +layout: default +title: Analyze API +parent: REST API reference +has_children: true +nav_order: 7 +redirect_from: + - /opensearch/rest-api/analyze-apis/ +--- + +# Analyze API + +The analyze API allows you to perform text analysis, which is the process of converting unstructured text into individual tokens (usually words) that are optimized for search. \ No newline at end of file diff --git a/_opensearch/rest-api/analyze-apis/perform-text-analysis.md b/_opensearch/rest-api/analyze-apis/perform-text-analysis.md new file mode 100644 index 00000000..e1fa00ec --- /dev/null +++ b/_opensearch/rest-api/analyze-apis/perform-text-analysis.md @@ -0,0 +1,670 @@ +--- +layout: default +title: Perform text analysis +parent: Analyze API +grand_parent: REST API reference +nav_order: 2 +--- + +# Perform text analysis + +The perform text analysis API analyzes a text string and returns the resulting tokens. + +If you use the security plugin, you must have the `manage index` privilege. If you simply want to analyze text, you must have the `manager cluster` privilege. +{: .note} + +## Path and HTTP methods + +``` +GET /_analyze +GET /{index}/_analyze +POST /_analyze +POST /{index}/_analyze +``` + +Although you can issue an analyzer request via both `GET` and `POST` requests, the two have important distinctions. A `GET` request causes data to be cached in the index so that the next time the data is requested, it is retrieved faster. A `POST` request sends a string that does not already exist to the analyzer to be compared to data that is already in the index. `POST` requests are not cached. +{: .note} + +## Path parameter + +You can include the following optional path parameter in your request. + +Parameter | Data Type | Description +:--- | :--- | :--- +index | String | Index that is used to derive the analyzer. + +## Query parameters + +You can include the following optional query parameters in your request. + +Field | Data Type | Description +:--- | :--- | :--- +analyzer | String | The name of the analyzer to apply to the `text` field. The analyzer can be built in or configured in the index.

If `analyzer` is not specified, the analyze API uses the analyzer defined in the mapping of the `field` field.

If the `field` field is not specified, the analyze API uses the default analyzer for the index.

If no index is specified or the index does not have a default analyzer, the analyze API uses the standard analyzer. +attributes | Array of Strings | Array of token attributes for filtering the output of the `explain` field. +char_filter | Array of Strings | Array of character filters for preprocessing characters before the `tokenizer` field. +explain | Boolean | If true, causes the response to include token attributes and additional details. Defaults to `false`. +field | String | Field for deriving the analyzer.

If you specify `field`, you must also specify the `index` path parameter.

If you specify the `analyzer` field, it overrides the value of `field`.

If you do not specify `field`, the analyze API uses the default analyzer for the index.

If you do not specify the `index` field, or the index does not have a default analyzer, the analyze API uses the standard analyzer. +filter | Array of Strings | Array of token filters to apply after the `tokenizer` field. +normalizer | String | Normalizer for converting text into a single token. +tokenizer | String | Tokenizer for converting the `text` field into tokens. + +The following query parameter is required. + +Field | Data Type | Description +:--- | :--- | :--- +text | String or Array of Strings | Text to analyze. If you provide an array of strings, the text is analyzed as a multi-value field. + +#### Sample requests + +[Analyze array of text strings](#analyze-array-of-text-strings) + +[Apply a built-in analyzer](#apply-a-built-in-analyzer) + +[Apply a custom analyzer](#apply-a-custom-analyzer) + +[Apply a custom transient analyzer](#apply-a-custom-transient-analyzer) + +[Specify an index](#specify-an-index) + +[Derive the analyzer from an index field](#derive-the-analyzer-from-an-index-field) + +[Specify a normalizer](#specify-a-normalizer) + +[Get token details](#get-token-details) + +[Set a token limit](#set-a-token-limit) + +#### Analyze array of text strings + +When you pass an array of strings to the `text` field, it is analyzed as a multi-value field. + +````json +GET /_analyze +{ + "analyzer" : "standard", + "text" : ["first array element", "second array element"] +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "first", + "start_offset" : 0, + "end_offset" : 5, + "type" : "", + "position" : 0 + }, + { + "token" : "array", + "start_offset" : 6, + "end_offset" : 11, + "type" : "", + "position" : 1 + }, + { + "token" : "element", + "start_offset" : 12, + "end_offset" : 19, + "type" : "", + "position" : 2 + }, + { + "token" : "second", + "start_offset" : 20, + "end_offset" : 26, + "type" : "", + "position" : 3 + }, + { + "token" : "array", + "start_offset" : 27, + "end_offset" : 32, + "type" : "", + "position" : 4 + }, + { + "token" : "element", + "start_offset" : 33, + "end_offset" : 40, + "type" : "", + "position" : 5 + } + ] +} +```` + +#### Apply a built-in analyzer + +If you omit the `index` path parameter, you can apply any of the built-in analyzers to the text string. + +The following request analyzes text using the `standard` built-in analyzer: + +````json +GET /_analyze +{ + "analyzer" : "standard", + "text" : "OpenSearch text analysis" +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "opensearch", + "start_offset" : 0, + "end_offset" : 10, + "type" : "", + "position" : 0 + }, + { + "token" : "text", + "start_offset" : 11, + "end_offset" : 15, + "type" : "", + "position" : 1 + }, + { + "token" : "analysis", + "start_offset" : 16, + "end_offset" : 24, + "type" : "", + "position" : 2 + } + ] +} +```` + +#### Apply a custom analyzer + +You can create your own analyzer and specify it in an analyze request. + +In this scenario, a custom analyzer `lowercase_ascii_folding` has been created and associated with the `books2` index. The analyzer converts text to lowercase and converts non-ASCII characters to ASCII. + +The following request applies the custom analyzer to the provided text: + +````json +GET /books2/_analyze +{ + "analyzer": "lowercase_ascii_folding", + "text" : "Le garçon m'a SUIVI." +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "le", + "start_offset" : 0, + "end_offset" : 2, + "type" : "", + "position" : 0 + }, + { + "token" : "garcon", + "start_offset" : 3, + "end_offset" : 9, + "type" : "", + "position" : 1 + }, + { + "token" : "m'a", + "start_offset" : 10, + "end_offset" : 13, + "type" : "", + "position" : 2 + }, + { + "token" : "suivi", + "start_offset" : 14, + "end_offset" : 19, + "type" : "", + "position" : 3 + } + ] +} +```` + +#### Apply a custom transient analyzer + +You can build a custom transient analyzer from tokenizers, token filters, or character filters. Use the `filter` parameter to specify token filters. + +The following request uses the `uppercase` character filter to convert the text to uppercase: + +````json +GET /_analyze +{ + "tokenizer" : "keyword", + "filter" : ["uppercase"], + "text" : "OpenSearch filter" +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "OPENSEARCH FILTER", + "start_offset" : 0, + "end_offset" : 17, + "type" : "word", + "position" : 0 + } + ] +} +```` +
+ +The following request uses the `html_strip` filter to remove HTML characters from the text: + +````json +GET /_analyze +{ + "tokenizer" : "keyword", + "filter" : ["lowercase"], + "char_filter" : ["html_strip"], + "text" : "Leave right now!" +} +```` + +The previous request returns the following fields: + +```` json +{ + "tokens" : [ + { + "token" : "leave right now!", + "start_offset" : 3, + "end_offset" : 23, + "type" : "word", + "position" : 0 + } + ] +} +```` + +
+ +You can combine filters using an array. + +The following request combines a `lowercase` translation with a `stop` filter that removes the words in the `stopwords` array: + +````json +GET /_analyze +{ + "tokenizer" : "whitespace", + "filter" : ["lowercase", {"type": "stop", "stopwords": [ "to", "in"]}], + "text" : "how to train your dog in five steps" +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "how", + "start_offset" : 0, + "end_offset" : 3, + "type" : "word", + "position" : 0 + }, + { + "token" : "train", + "start_offset" : 7, + "end_offset" : 12, + "type" : "word", + "position" : 2 + }, + { + "token" : "your", + "start_offset" : 13, + "end_offset" : 17, + "type" : "word", + "position" : 3 + }, + { + "token" : "dog", + "start_offset" : 18, + "end_offset" : 21, + "type" : "word", + "position" : 4 + }, + { + "token" : "five", + "start_offset" : 25, + "end_offset" : 29, + "type" : "word", + "position" : 6 + }, + { + "token" : "steps", + "start_offset" : 30, + "end_offset" : 35, + "type" : "word", + "position" : 7 + } + ] +} +```` + +#### Specify an index + +You can analyze text using an index's default analyzer, or you can specify a different analyzer. + +The following request analyzes the provided text using the default analyzer associated with the `books` index: + +````json +GET /books/_analyze +{ + "text" : "OpenSearch analyze test" +} +```` + +The previous request returns the following fields: + +````json + + "tokens" : [ + { + "token" : "opensearch", + "start_offset" : 0, + "end_offset" : 10, + "type" : "", + "position" : 0 + }, + { + "token" : "analyze", + "start_offset" : 11, + "end_offset" : 18, + "type" : "", + "position" : 1 + }, + { + "token" : "test", + "start_offset" : 19, + "end_offset" : 23, + "type" : "", + "position" : 2 + } + ] +} +```` + +
+ +The following request analyzes the provided text using the `keyword` analyzer, which returns the entire text value as a single token: + +````json +GET /books/_analyze +{ + "analyzer" : "keyword", + "text" : "OpenSearch analyze test" +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "OpenSearch analyze test", + "start_offset" : 0, + "end_offset" : 23, + "type" : "word", + "position" : 0 + } + ] +} +```` + +#### Derive the analyzer from an index field + +You can pass text and a field in the index. The API looks up the field's analyzer and uses it to analyze the text. + +If the mapping does not exist, the API uses the standard analyzer, which converts all text to lowercase and tokenizes based on white space. + +The following request causes the analysis to be based on the mapping for `name`: + +````json +GET /books2/_analyze +{ + "field" : "name", + "text" : "OpenSearch analyze test" +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "opensearch", + "start_offset" : 0, + "end_offset" : 10, + "type" : "", + "position" : 0 + }, + { + "token" : "analyze", + "start_offset" : 11, + "end_offset" : 18, + "type" : "", + "position" : 1 + }, + { + "token" : "test", + "start_offset" : 19, + "end_offset" : 23, + "type" : "", + "position" : 2 + } + ] +} +```` + +#### Specify a normalizer + +Instead of using a keyword field, you can use the normalizer associated with the index. A normalizer causes the analysis change to produce a single token. + +In this example, the `books2` index includes a normalizer called `to_lower_fold_ascii` that converts text to lowercase and translates non-ASCII text to ASCII. + +The following request applies `to_lower_fold_ascii` to the text: + +````json +GET /books2/_analyze +{ + "normalizer" : "to_lower_fold_ascii", + "text" : "C'est le garçon qui m'a suivi." +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "c'est le garcon qui m'a suivi.", + "start_offset" : 0, + "end_offset" : 30, + "type" : "word", + "position" : 0 + } + ] +} +```` + +
+ +You can create a custom transient normalizer with token and character filters. + +The following request uses the `uppercase` character filter to convert the given text to all uppercase: + +````json +GET /_analyze +{ + "filter" : ["uppercase"], + "text" : "That is the boy who followed me." +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "THAT IS THE BOY WHO FOLLOWED ME.", + "start_offset" : 0, + "end_offset" : 32, + "type" : "word", + "position" : 0 + } + ] +} +```` + +#### Get token details + +You can obtain additional details for all tokens by setting the `explain` attribute to `true`. + +The following request provides detailed token information for the `reverse` filter used with the `standard` tokenizer: + +````json +GET /_analyze +{ + "tokenizer" : "standard", + "filter" : ["reverse"], + "text" : "OpenSearch analyze test", + "explain" : true, + "attributes" : ["keyword"] +} +```` + +The previous request returns the following fields: + +````json +{ + "detail" : { + "custom_analyzer" : true, + "charfilters" : [ ], + "tokenizer" : { + "name" : "standard", + "tokens" : [ + { + "token" : "OpenSearch", + "start_offset" : 0, + "end_offset" : 10, + "type" : "", + "position" : 0 + }, + { + "token" : "analyze", + "start_offset" : 11, + "end_offset" : 18, + "type" : "", + "position" : 1 + }, + { + "token" : "test", + "start_offset" : 19, + "end_offset" : 23, + "type" : "", + "position" : 2 + } + ] + }, + "tokenfilters" : [ + { + "name" : "reverse", + "tokens" : [ + { + "token" : "hcraeSnepO", + "start_offset" : 0, + "end_offset" : 10, + "type" : "", + "position" : 0 + }, + { + "token" : "ezylana", + "start_offset" : 11, + "end_offset" : 18, + "type" : "", + "position" : 1 + }, + { + "token" : "tset", + "start_offset" : 19, + "end_offset" : 23, + "type" : "", + "position" : 2 + } + ] + } + ] + } +} +```` + +#### Set a token limit + +You can set a limit to the number of tokens generated. Setting a lower value reduces a node's memory usage. The default value is 10000. + +The following request limits the tokens to four: + +````json +PUT /books2 +{ + "settings" : { + "index.analyze.max_token_count" : 4 + } +} +```` +The preceding request is an index API rather than an analyze API. See [DYNAMIC INDEX SETTINGS]({{site.url}}{{site.baseurl}}/opensearch/rest-api/index-apis/create-index/#dynamic-index-settings) for additional details. +{: .note} + +### Response fields + +The text analysis endpoints return the following response fields. + +Field | Data Type | Description +:--- | :--- | :--- +tokens | Array | Array of tokens derived from the `text`. See [token object](#token-object). +detail | Object | Details about the analysis and each token. Included only when you request token details. See [detail object](#detail-object). + +#### Token object + +Field | Data Type | Description +:--- | :--- | :--- +token | String | The token's text. +start_offset | Integer | The token's starting position within the original text string. Offsets are zero-based. +end_offset | Integer | The token's ending position within the original text string. +type | String | Classification of the token: ``, ``, and so on. The tokenizer usually sets the type, but some filters define their own types. For example, the synonym filter defines the `` type. +position | Integer | The token's position within the `tokens` array. + +#### Detail object + +Field | Data Type | Description +:--- | :--- | :--- +custom_analyzer | Boolean | Whether the analyzer applied to the text is custom or built in. +charfilters | Array | List of character filters applied to the text. +tokenizer | Object | Name of the tokenizer applied to the text and a list of tokens* with content before the token filters were applied. +tokenfilters | Array | List of token filters applied to the text. Each token filter includes the filter's name and a list of tokens* with content after the filters were applied. Token filters are listed in the order they are specified in the request. + +See [token object](#token-object) for token field descriptions. +{: .note} \ No newline at end of file diff --git a/_opensearch/rest-api/analyze-apis/terminology.md b/_opensearch/rest-api/analyze-apis/terminology.md new file mode 100644 index 00000000..95b6f163 --- /dev/null +++ b/_opensearch/rest-api/analyze-apis/terminology.md @@ -0,0 +1,37 @@ +--- +layout: default +title: Analysis API Terminology +parent: Analyze API +grand_parent: REST API reference +nav_order: 1 +--- + +# Terminology + +The following sections provide descriptions of important text analysis terms. + +## Analyzers + +Analyzers tell OpenSearch how to index and search text. An analyzer is composed of three components: a tokenizer, zero or more token filters, and zero or more character filters. + +OpenSearch provides *built-in* analyzers. For example, the `standard` built-in analyzer converts text to lowercase and breaks text into tokens based on word boundaries such as carriage returns and white space. The `standard` analyzer is also called the *default* analyzer and is used when no analyzer is specified in the text analysis request. + +If needed, you can combine tokenizers, token filters, and character filters to create a *custom* analyzer. + +#### Tokenizers + +Tokenizers break unstuctured text into tokens and maintain metadata about tokens, such as their start and ending positions in the text. + +#### Character filters + +Character filters examine text and perform translations, such as changing, removing, and adding characters. + +#### Token filters + +Token filters modify tokens, performing operations such as converting a token's characters to uppercase and adding or removing tokens. + +## Normalizers + +Similar to analyzers, normalizers tokenize text but return a single token only. Normalizers do not employ tokenizers; they make limited use of character and token filters, such as those that operate on one character at a time. + +By default, OpenSearch does not apply normalizers. To apply normalizers, you must add them to your data before creating an index. \ No newline at end of file