Steve-O-Murphy analyzer apis documentation (#1594)

* Add analyzer endpoints Signed-off-by: Steve Murphy <stevemurphy@BCSR-11413.local> * Post Add analyzer endpoints documentation Signed-off-by: ariamarble <armarble@amazon.com> * made suggested changes Signed-off-by: ariamarble <armarble@amazon.com> * made editorial changes Signed-off-by: ariamarble <armarble@amazon.com> Signed-off-by: Steve Murphy <stevemurphy@BCSR-11413.local> Signed-off-by: ariamarble <armarble@amazon.com> Co-authored-by: Steve Murphy <stevemurphy@BCSR-11413.local>
2022-10-19 08:54:55 -07:00 · 2022-10-19 08:54:55 -07:00 · e7f5ebb336
commit e7f5ebb336
parent 128f57abba
3 changed files with 720 additions and 0 deletions
--- a/_opensearch/rest-api/analyze-apis/index.md
+++ b/_opensearch/rest-api/analyze-apis/index.md
@ -0,0 +1,13 @@
+---
+layout: default
+title: Analyze API
+parent: REST API reference
+has_children: true
+nav_order: 7
+redirect_from:
+  - /opensearch/rest-api/analyze-apis/
+---
+
+# Analyze API
+
+The analyze API allows you to perform text analysis, which is the process of converting unstructured text into individual tokens (usually words) that are optimized for search.
--- a/_opensearch/rest-api/analyze-apis/perform-text-analysis.md
+++ b/_opensearch/rest-api/analyze-apis/perform-text-analysis.md
@ -0,0 +1,670 @@
+---
+layout: default
+title: Perform text analysis
+parent: Analyze API
+grand_parent: REST API reference
+nav_order: 2
+---
+
+# Perform text analysis
+
+The perform text analysis API analyzes a text string and returns the resulting tokens.
+
+If you use the security plugin, you must have the `manage index` privilege. If you simply want to analyze text, you must have the `manager cluster` privilege.
+{: .note}
+
+## Path and HTTP methods
+
+```
+GET /_analyze
+GET /{index}/_analyze
+POST /_analyze
+POST /{index}/_analyze
+```
+
+Although you can issue an analyzer request via both `GET` and `POST` requests, the two have important distinctions. A `GET` request causes data to be cached in the index so that the next time the data is requested, it is retrieved faster. A `POST` request sends a string that does not already exist to the analyzer to be compared to data that is already in the index. `POST` requests are not cached.
+{: .note}
+
+## Path parameter
+
+You can include the following optional path parameter in your request.
+
+Parameter | Data Type | Description
+:--- | :--- | :---
+index | String | Index that is used to derive the analyzer.
+
+## Query parameters
+
+You can include the following optional query parameters in your request.
+
+Field | Data Type | Description
+:--- | :--- | :---
+analyzer | String | The name of the analyzer to apply to the `text` field. The analyzer can be built in or configured in the index.<br /><br />If `analyzer` is not specified, the analyze API uses the analyzer defined in the mapping of the `field` field.<br /><br />If the `field` field is not specified, the analyze API uses the default analyzer for the index.<br /><br > If no index is specified or the index does not have a default analyzer, the analyze API uses the standard analyzer.
+attributes | Array of Strings | Array of token attributes for filtering the output of the `explain` field.
+char_filter | Array of Strings | Array of character filters for preprocessing characters before the `tokenizer` field.
+explain | Boolean | If true, causes the response to include token attributes and additional details. Defaults to `false`.
+field | String | Field for deriving the analyzer. <br /><br > If you specify `field`, you must also specify the `index` path parameter. <br /><br > If you specify the `analyzer` field, it overrides the value of `field`. <br /><br > If you do not specify `field`, the analyze API uses the default analyzer for the index. <br /><br > If you do not specify the `index` field, or the index does not have a default analyzer, the analyze API uses the standard analyzer.
+filter | Array of Strings | Array of token filters to apply after the `tokenizer` field.
+normalizer | String | Normalizer for converting text into a single token. 
+tokenizer | String | Tokenizer for converting the `text` field into tokens.
+
+The following query parameter is required.
+
+Field | Data Type | Description
+:--- | :--- | :---
+text | String or Array of Strings | Text to analyze. If you provide an array of strings, the text is analyzed as a multi-value field.
+
+#### Sample requests
+
+[Analyze array of text strings](#analyze-array-of-text-strings)
+
+[Apply a built-in analyzer](#apply-a-built-in-analyzer)
+
+[Apply a custom analyzer](#apply-a-custom-analyzer)
+
+[Apply a custom transient analyzer](#apply-a-custom-transient-analyzer)
+
+[Specify an index](#specify-an-index)
+
+[Derive the analyzer from an index field](#derive-the-analyzer-from-an-index-field)
+
+[Specify a normalizer](#specify-a-normalizer)
+
+[Get token details](#get-token-details)
+
+[Set a token limit](#set-a-token-limit)
+
+#### Analyze array of text strings
+
+When you pass an array of strings to the `text` field, it is analyzed as a multi-value field.
+
+````json
+GET /_analyze
+{
+  "analyzer" : "standard",
+  "text" : ["first array element", "second array element"]
+}
+````
+
+The previous request returns the following fields:
+
+````json
+{
+  "tokens" : [
+    {
+      "token" : "first",
+      "start_offset" : 0,
+      "end_offset" : 5,
+      "type" : "<ALPHANUM>",
+      "position" : 0
+    },
+    {
+      "token" : "array",
+      "start_offset" : 6,
+      "end_offset" : 11,
+      "type" : "<ALPHANUM>",
+      "position" : 1
+    },
+    {
+      "token" : "element",
+      "start_offset" : 12,
+      "end_offset" : 19,
+      "type" : "<ALPHANUM>",
+      "position" : 2
+    },
+    {
+      "token" : "second",
+      "start_offset" : 20,
+      "end_offset" : 26,
+      "type" : "<ALPHANUM>",
+      "position" : 3
+    },
+    {
+      "token" : "array",
+      "start_offset" : 27,
+      "end_offset" : 32,
+      "type" : "<ALPHANUM>",
+      "position" : 4
+    },
+    {
+      "token" : "element",
+      "start_offset" : 33,
+      "end_offset" : 40,
+      "type" : "<ALPHANUM>",
+      "position" : 5
+    }
+  ]
+}
+````
+
+#### Apply a built-in analyzer
+
+If you omit the `index` path parameter, you can apply any of the built-in analyzers to the text string.
+
+The following request analyzes text using the `standard` built-in analyzer:
+
+````json
+GET /_analyze
+{
+  "analyzer" : "standard",
+  "text" : "OpenSearch text analysis"
+}
+````
+
+The previous request returns the following fields:
+
+````json
+{
+  "tokens" : [
+    {
+      "token" : "opensearch",
+      "start_offset" : 0,
+      "end_offset" : 10,
+      "type" : "<ALPHANUM>",
+      "position" : 0
+    },
+    {
+      "token" : "text",
+      "start_offset" : 11,
+      "end_offset" : 15,
+      "type" : "<ALPHANUM>",
+      "position" : 1
+    },
+    {
+      "token" : "analysis",
+      "start_offset" : 16,
+      "end_offset" : 24,
+      "type" : "<ALPHANUM>",
+      "position" : 2
+    }
+  ]
+}
+````
+
+#### Apply a custom analyzer
+
+You can create your own analyzer and specify it in an analyze request.
+
+In this scenario, a custom analyzer `lowercase_ascii_folding` has been created and associated with the `books2` index. The analyzer converts text to lowercase and converts non-ASCII characters to ASCII.
+
+The following request applies the custom analyzer to the provided text:
+
+````json
+GET /books2/_analyze
+{
+  "analyzer": "lowercase_ascii_folding",
+  "text" : "Le garçon m'a SUIVI."
+}
+````
+
+The previous request returns the following fields:
+
+````json
+{
+  "tokens" : [
+    {
+      "token" : "le",
+      "start_offset" : 0,
+      "end_offset" : 2,
+      "type" : "<ALPHANUM>",
+      "position" : 0
+    },
+    {
+      "token" : "garcon",
+      "start_offset" : 3,
+      "end_offset" : 9,
+      "type" : "<ALPHANUM>",
+      "position" : 1
+    },
+    {
+      "token" : "m'a",
+      "start_offset" : 10,
+      "end_offset" : 13,
+      "type" : "<ALPHANUM>",
+      "position" : 2
+    },
+    {
+      "token" : "suivi",
+      "start_offset" : 14,
+      "end_offset" : 19,
+      "type" : "<ALPHANUM>",
+      "position" : 3
+    }
+  ]
+}
+````
+
+#### Apply a custom transient analyzer
+
+You can build a custom transient analyzer from tokenizers, token filters, or character filters. Use the `filter` parameter to specify token filters.
+
+The following request uses the `uppercase` character filter to convert the text to uppercase:
+
+````json
+GET /_analyze
+{
+  "tokenizer" : "keyword",
+  "filter" : ["uppercase"],
+  "text" : "OpenSearch filter"
+}
+````
+
+The previous request returns the following fields:
+
+````json
+{
+  "tokens" : [
+    {
+      "token" : "OPENSEARCH FILTER",
+      "start_offset" : 0,
+      "end_offset" : 17,
+      "type" : "word",
+      "position" : 0
+    }
+  ]
+}
+````
+<hr />
+
+The following request uses the `html_strip` filter to remove HTML characters from the text:
+
+````json
+GET /_analyze
+{
+  "tokenizer" : "keyword",
+  "filter" : ["lowercase"],
+  "char_filter" : ["html_strip"],
+  "text" : "<b>Leave</b> right now!"
+}
+````
+
+The previous request returns the following fields:
+
+```` json
+{
+  "tokens" : [
+    {
+      "token" : "leave right now!",
+      "start_offset" : 3,
+      "end_offset" : 23,
+      "type" : "word",
+      "position" : 0
+    }
+  ]
+}
+````
+
+<hr />
+
+You can combine filters using an array.
+
+The following request combines a `lowercase` translation with a `stop` filter that removes the words in the `stopwords` array:
+
+````json
+GET /_analyze
+{
+  "tokenizer" : "whitespace",
+  "filter" : ["lowercase", {"type": "stop", "stopwords": [ "to", "in"]}],
+  "text" : "how to train your dog in five steps"
+}
+````
+
+The previous request returns the following fields:
+
+````json
+{
+  "tokens" : [
+    {
+      "token" : "how",
+      "start_offset" : 0,
+      "end_offset" : 3,
+      "type" : "word",
+      "position" : 0
+    },
+    {
+      "token" : "train",
+      "start_offset" : 7,
+      "end_offset" : 12,
+      "type" : "word",
+      "position" : 2
+    },
+    {
+      "token" : "your",
+      "start_offset" : 13,
+      "end_offset" : 17,
+      "type" : "word",
+      "position" : 3
+    },
+    {
+      "token" : "dog",
+      "start_offset" : 18,
+      "end_offset" : 21,
+      "type" : "word",
+      "position" : 4
+    },
+    {
+      "token" : "five",
+      "start_offset" : 25,
+      "end_offset" : 29,
+      "type" : "word",
+      "position" : 6
+    },
+    {
+      "token" : "steps",
+      "start_offset" : 30,
+      "end_offset" : 35,
+      "type" : "word",
+      "position" : 7
+    }
+  ]
+}
+````
+
+#### Specify an index
+
+You can analyze text using an index's default analyzer, or you can specify a different analyzer.
+
+The following request analyzes the provided text using the default analyzer associated with the `books` index:
+
+````json
+GET /books/_analyze
+{
+  "text" : "OpenSearch analyze test"
+}
+````
+
+The previous request returns the following fields:
+
+````json
+
+  "tokens" : [
+    {
+      "token" : "opensearch",
+      "start_offset" : 0,
+      "end_offset" : 10,
+      "type" : "<ALPHANUM>",
+      "position" : 0
+    },
+    {
+      "token" : "analyze",
+      "start_offset" : 11,
+      "end_offset" : 18,
+      "type" : "<ALPHANUM>",
+      "position" : 1
+    },
+    {
+      "token" : "test",
+      "start_offset" : 19,
+      "end_offset" : 23,
+      "type" : "<ALPHANUM>",
+      "position" : 2
+    }
+  ]
+}
+````
+
+<hr />
+
+The following request analyzes the provided text using the `keyword` analyzer, which returns the entire text value as a single token:
+
+````json
+GET /books/_analyze
+{
+  "analyzer" : "keyword",
+  "text" : "OpenSearch analyze test"
+}
+````
+
+The previous request returns the following fields:
+
+````json
+{
+  "tokens" : [
+    {
+      "token" : "OpenSearch analyze test",
+      "start_offset" : 0,
+      "end_offset" : 23,
+      "type" : "word",
+      "position" : 0
+    }
+  ]
+}
+````
+
+#### Derive the analyzer from an index field
+
+You can pass text and a field in the index. The API looks up the field's analyzer and uses it to analyze the text.
+
+If the mapping does not exist, the API uses the standard analyzer, which converts all text to lowercase and tokenizes based on white space.
+
+The following request causes the analysis to be based on the mapping for `name`:
+
+````json
+GET /books2/_analyze
+{
+  "field" : "name",
+  "text" : "OpenSearch analyze test"
+}
+````
+
+The previous request returns the following fields:
+
+````json
+{
+  "tokens" : [
+    {
+      "token" : "opensearch",
+      "start_offset" : 0,
+      "end_offset" : 10,
+      "type" : "<ALPHANUM>",
+      "position" : 0
+    },
+    {
+      "token" : "analyze",
+      "start_offset" : 11,
+      "end_offset" : 18,
+      "type" : "<ALPHANUM>",
+      "position" : 1
+    },
+    {
+      "token" : "test",
+      "start_offset" : 19,
+      "end_offset" : 23,
+      "type" : "<ALPHANUM>",
+      "position" : 2
+    }
+  ]
+}
+````
+
+#### Specify a normalizer
+
+Instead of using a keyword field, you can use the normalizer associated with the index. A normalizer causes the analysis change to produce a single token.
+
+In this example, the `books2` index includes a normalizer called `to_lower_fold_ascii` that converts text to lowercase and translates non-ASCII text to ASCII.
+
+The following request applies `to_lower_fold_ascii` to the text:
+
+````json
+GET /books2/_analyze
+{
+  "normalizer" : "to_lower_fold_ascii",
+  "text" : "C'est le garçon qui m'a suivi."
+}
+````
+
+The previous request returns the following fields:
+
+````json
+{
+  "tokens" : [
+    {
+      "token" : "c'est le garcon qui m'a suivi.",
+      "start_offset" : 0,
+      "end_offset" : 30,
+      "type" : "word",
+      "position" : 0
+    }
+  ]
+}
+````
+
+<hr />
+
+You can create a custom transient normalizer with token and character filters.
+
+The following request uses the `uppercase` character filter to convert the given text to all uppercase:
+
+````json
+GET /_analyze
+{
+  "filter" : ["uppercase"],
+  "text" : "That is the boy who followed me."
+}
+````
+
+The previous request returns the following fields:
+
+````json
+{
+  "tokens" : [
+    {
+      "token" : "THAT IS THE BOY WHO FOLLOWED ME.",
+      "start_offset" : 0,
+      "end_offset" : 32,
+      "type" : "word",
+      "position" : 0
+    }
+  ]
+}
+````
+
+#### Get token details
+
+You can obtain additional details for all tokens by setting the `explain` attribute to `true`.
+
+The following request provides detailed token information for the `reverse` filter used with the `standard` tokenizer:
+
+````json
+GET /_analyze
+{
+  "tokenizer" : "standard",
+  "filter" : ["reverse"],
+  "text" : "OpenSearch analyze test",
+  "explain" : true,
+  "attributes" : ["keyword"] 
+}
+````
+
+The previous request returns the following fields:
+
+````json
+{
+  "detail" : {
+    "custom_analyzer" : true,
+    "charfilters" : [ ],
+    "tokenizer" : {
+      "name" : "standard",
+      "tokens" : [
+        {
+          "token" : "OpenSearch",
+          "start_offset" : 0,
+          "end_offset" : 10,
+          "type" : "<ALPHANUM>",
+          "position" : 0
+        },
+        {
+          "token" : "analyze",
+          "start_offset" : 11,
+          "end_offset" : 18,
+          "type" : "<ALPHANUM>",
+          "position" : 1
+        },
+        {
+          "token" : "test",
+          "start_offset" : 19,
+          "end_offset" : 23,
+          "type" : "<ALPHANUM>",
+          "position" : 2
+        }
+      ]
+    },
+    "tokenfilters" : [
+      {
+        "name" : "reverse",
+        "tokens" : [
+          {
+            "token" : "hcraeSnepO",
+            "start_offset" : 0,
+            "end_offset" : 10,
+            "type" : "<ALPHANUM>",
+            "position" : 0
+          },
+          {
+            "token" : "ezylana",
+            "start_offset" : 11,
+            "end_offset" : 18,
+            "type" : "<ALPHANUM>",
+            "position" : 1
+          },
+          {
+            "token" : "tset",
+            "start_offset" : 19,
+            "end_offset" : 23,
+            "type" : "<ALPHANUM>",
+            "position" : 2
+          }
+        ]
+      }
+    ]
+  }
+}
+````
+
+#### Set a token limit
+
+You can set a limit to the number of tokens generated. Setting a lower value reduces a node's memory usage. The default value is 10000.
+
+The following request limits the tokens to four:
+
+````json
+PUT /books2
+{
+  "settings" : {
+    "index.analyze.max_token_count" : 4
+  }
+}
+````
+The preceding request is an index API rather than an analyze API. See [DYNAMIC INDEX SETTINGS]({{site.url}}{{site.baseurl}}/opensearch/rest-api/index-apis/create-index/#dynamic-index-settings) for additional details.
+{: .note}
+
+### Response fields
+
+The text analysis endpoints return the following response fields.
+
+Field | Data Type | Description
+:--- | :--- | :---
+tokens | Array | Array of tokens derived from the `text`. See [token object](#token-object).
+detail | Object | Details about the analysis and each token. Included only when you request token details. See [detail object](#detail-object).
+
+#### Token object
+
+Field | Data Type | Description
+:--- | :--- | :---
+token  | String | The token's text.
+start_offset | Integer | The token's starting position within the original text string. Offsets are zero-based.
+end_offset | Integer | The token's ending position within the original text string.
+type | String | Classification of the token: `<ALPHANUM>`, `<NUM>`, and so on. The tokenizer usually sets the type, but some filters define their own types. For example, the synonym filter defines the `<SYNONYM>` type.
+position |  Integer | The token's position within the `tokens` array.
+
+#### Detail object
+
+Field | Data Type | Description
+:--- | :--- | :---
+custom_analyzer | Boolean | Whether the analyzer applied to the text is custom or built in.
+charfilters | Array | List of character filters applied to the text.
+tokenizer | Object | Name of the tokenizer applied to the text and a list of tokens<sup>*</sup> with content before the token filters were applied.
+tokenfilters | Array | List of token filters applied to the text. Each token filter includes the filter's name and a list of tokens<sup>*</sup> with content after the filters were applied. Token filters are listed in the order they are specified in the request. 
+
+See [token object](#token-object) for token field descriptions.
+{: .note}
--- a/_opensearch/rest-api/analyze-apis/terminology.md
+++ b/_opensearch/rest-api/analyze-apis/terminology.md
@ -0,0 +1,37 @@
+---
+layout: default
+title: Analysis API Terminology
+parent: Analyze API
+grand_parent: REST API reference
+nav_order: 1
+---
+
+# Terminology
+
+The following sections provide descriptions of important text analysis terms. 
+
+## Analyzers
+
+Analyzers tell OpenSearch how to index and search text. An analyzer is composed of three components: a tokenizer, zero or more token filters, and zero or more character filters. 
+
+OpenSearch provides *built-in* analyzers. For example, the `standard` built-in analyzer converts text to lowercase and breaks text into tokens based on word boundaries such as carriage returns and white space. The `standard` analyzer is also called the *default* analyzer and is used when no analyzer is specified in the text analysis request.
+
+If needed, you can combine tokenizers, token filters, and character filters to create a *custom* analyzer.
+
+#### Tokenizers
+
+Tokenizers break unstuctured text into tokens and maintain metadata about tokens, such as their start and ending positions in the text.
+
+#### Character filters
+
+Character filters examine text and perform translations, such as changing, removing, and adding characters. 
+
+#### Token filters
+
+Token filters modify tokens, performing operations such as converting a token's characters to uppercase and adding or removing tokens. 
+
+## Normalizers
+
+Similar to analyzers, normalizers tokenize text but return a single token only. Normalizers do not employ tokenizers; they make limited use of character and token filters, such as those that operate on one character at a time.
+
+By default, OpenSearch does not apply normalizers. To apply normalizers, you must add them to your data before creating an index.