Add cluster awareness and decommission docs (#2438)

* Add cluster awareness and decommission docs Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Bukhtawar Khan <bukhtawar7152@gmail.com> * Edit technical feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add new cluster awareness examples Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add technical feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Alice Williams <88908598+alicejw-aws@users.noreply.github.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Alice Williams <88908598+alicejw-aws@users.noreply.github.com> * Add Caroline's feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add one more tweak Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Update _ml-commons-plugin/cluster-settings.md Co-authored-by: Heather Halter <HDHALTER@AMAZON.COM> * Update _ml-commons-plugin/cluster-settings.md Co-authored-by: Heather Halter <HDHALTER@AMAZON.COM> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _ml-commons-plugin/cluster-settings.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _ml-commons-plugin/cluster-settings.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-decommission.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-awareness.md Co-authored-by: Nate Bower <nbower@amazon.com> * Update _api-reference/cluster-decommission.md Co-authored-by: Nate Bower <nbower@amazon.com> * Add editoiral feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Fix typos Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Final editorial note Signed-off-by: Naarcha-AWS <naarcha@amazon.com> Signed-off-by: Naarcha-AWS <naarcha@amazon.com> Co-authored-by: Bukhtawar Khan <bukhtawar7152@gmail.com> Co-authored-by: Alice Williams <88908598+alicejw-aws@users.noreply.github.com> Co-authored-by: Heather Halter <HDHALTER@AMAZON.COM> Co-authored-by: Nate Bower <nbower@amazon.com>
2023-01-23 17:13:07 -06:00 · 2023-01-23 17:13:07 -06:00 · 1589201a9e
parent c2e423ff71
commit 1589201a9e
6 changed files with 245 additions and 9 deletions
--- a/_api-reference/cluster-awareness.md
+++ b/_api-reference/cluster-awareness.md
@ -0,0 +1,121 @@
+---
+layout: default
+title: Cluster routing and awareness
+nav_order: 16
+---
+
+# Cluster routing and awareness
+
+To control the distribution of search or HTTP traffic, you can use the weights per awareness attribute to control the distribution of search or HTTP traffic across zones. This is commonly used for zonal deployments, heterogeneous instances, and routing traffic away from zones during zonal failure.
+
+## HTTP and path methods
+
+```
+PUT /_cluster/routing/awareness/<attribute>/weights
+GET /_cluster/routing/awareness/<attribute>/weights?local
+GET /_cluster/routing/awareness/<attribute>/weights
+```
+
+## Path parameters
+
+Parameter | Type | Description
+:--- | :--- | :---
+attribute | String | The name of the awareness attribute, usually `zone`. The attribute name must match the values listed in the request body when assigning weights to zones.
+
+## Request body parameters
+
+Parameter | Type | Description
+:--- | :--- | :---
+weights | JSON object | Assigns weights to attributes within the request body of the PUT request. Weights can be set in any ratio, for example, 2:3:5. In a 2:3:5 ratio with 3 zones, for every 100 requests sent to the cluster, each zone would receive either 20, 30, or 50 search requests in a random order. When assigned a weight of `0`, the zone does not receive any search traffic. 
+_version | String | Implements optimistic concurrency control (OCC) through versioning. The parameter uses simple versioning, such as `1`, and increments upward based on each subsequent modification. This allows any servers from which a request originates to validate whether or not a zone has been modified. 
+
+
+In the following example request body, `zone_1` and `zone_2` receive 50 requests each, whereas `zone_3` is prevented from receiving requests:
+
+```
+{ 
+      "weights":
+      {
+        "zone_1": "5", 
+        "zone_2": "5", 
+        "zone_3": "0"
+      }
+      "_version" : 1
+}
+```
+
+## Example: Weighted round robin search
+
+The following example request creates a round robin shard allocation for search traffic by using an undefined ratio:
+
+### Request
+
+PUT /_cluster/routing/awareness/zone/weights
+{ 
+      "weights":
+      {
+        "zone_1": "1", 
+        "zone_2": "1", 
+        "zone_3": "0"
+      }
+      "_version" : 1
+}
+
+### Response
+
+```
+{
+     "acknowledged": true
+}
+```
+
+
+## Example: Getting weights for all zones
+
+The following example request gets weights for all zones.
+
+### Request
+
+```
+GET /_cluster/routing/awareness/zone/weights
+```
+
+### Response
+
+OpenSearch responds with the weight of each zone:
+
+```json
+{
+      "weights":
+      {
+      
+        "zone_1": "1.0", 
+        "zone_2": "1.0", 
+        "zone_3": "0.0"
+      },
+      "_version":1
+}
+```
+
+## Example: Deleting weights
+
+You can remove your weight ratio for each zone using the `DELETE` method.
+
+### Request
+
+```
+DELETE /_cluster/routing/awareness/zone/weights
+```
+
+### Response
+
+```json
+{
+   "_version":1
+}
+```
+
+## Next steps
+
+- For more information about zone commissioning, see [Cluster decommission]({{site.url}}{{site.baseurl}}/api-reference/cluster-decommission/).
+- For more information about allocation awareness, see [Cluster formation]({{site.url}}{{site.baseurl}}/opensearch/cluster/#advanced-step-6-configure-shard-allocation-awareness-or-forced-awareness).
--- a/_api-reference/cluster-decommission.md
+++ b/_api-reference/cluster-decommission.md
@ -0,0 +1,80 @@
+---
+layout: default
+title: Cluster decommission 
+nav_order: 20
+---
+
+# Cluster decommission
+
+The cluster decommission operation adds support decommissioning based on awareness. It greatly benefits multi-zone deployments, where awareness attributes, such as `zones`, can aid in applying new upgrades to a cluster in a controlled fashion. This is especially useful during outages, in which case, you can decommission the unhealthy zone to prevent replication requests from stalling and prevent your request backlog from becoming too large.
+
+For more information about allocation awareness, see [Shard allocation awareness]({{site.url}}{{site.baseurl}}//opensearch/cluster/#shard-allocation-awareness).
+
+
+## HTTP and Path methods
+
+```
+PUT  /_cluster/decommission/awareness/{awareness_attribute_name}/{awareness_attribute_value}
+GET  /_cluster/decommission/awareness/{awareness_attribute_name}/_status
+DELETE /_cluster/decommission/awareness
+```
+
+## URL parameters
+
+Parameter | Type | Description
+:--- | :--- | :---
+awareness_attribute_name | String | The name of awareness attribute, usually `zone`.
+awareness_attribute_value | String | The value of the awareness attribute. For example, if you have shards allocated in two different zones, you can give each zone a value of `zone-a` or `zoneb`. The cluster decommission operation decommissions the zone listed in the method.
+
+
+## Example: Decommissioning and recommissioning a zone
+
+You can use the following example requests to decommission and recommission a zone:
+
+### Request
+
+The following example request decommissions `zone-a`:
+
+```
+PUT /_cluster/decommission/awareness/<zone>/<zone-a>
+```
+
+If you want to recommission a decommissioned zone, you can use the `DELETE` method:
+
+```
+DELETE /_cluster/decommission/awareness
+```
+
+### Response
+
+
+```json
+{
+      "acknowledged": true
+}
+```
+
+## Example: Getting zone decommission status
+
+The following example requests returns the decommission status of all zones.
+
+### Request
+
+```
+GET /_cluster/decommission/awareness/zone/_status
+```
+
+
+### Response
+
+```json
+{
+     "zone-1": "INIT | DRAINING | IN_PROGRESS | SUCCESSFUL | FAILED"
+}
+```
+
+
+## Next steps
+
+- For more information about zone awareness and weight, see [Cluster awareness]({{site.url}}{{site.baseurl}}/api-reference/cluster-awareness/).
+- For more information about allocation awareness, see [Cluster formation]({{site.url}}{{site.baseurl}}/opensearch/cluster/#advanced-step-6-configure-shard-allocation-awareness-or-forced-awareness).
--- a/_api-reference/cluster-health.md
+++ b/_api-reference/cluster-health.md
@ -1,7 +1,7 @@
 ---
 layout: default
 title: Cluster health
-nav_order: 16
+nav_order: 17
 ---

 # Cluster health
@ -47,6 +47,7 @@ wait_for_events | Enum | Wait until all currently queued events with the given p
 wait_for_no_relocating_shards | Boolean | Whether to wait until there are no relocating shards in the cluster. Default is false.
 wait_for_no_initializing_shards | Boolean | Whether to wait until there are no initializing shards in the cluster. Default is false.
 wait_for_status | Enum | Wait until the cluster health reaches the specified status or better. Supported values are `green`, `yellow`, and `red`.
+weights | JSON object | Assigns weights to attributes within the request body of the PUT request. Weights can be set in any ration, for example, 2:3:5. In a 2:3:5 ratio with three zones, for every 100 requests sent to the cluster, each zone would receive either 20, 30, or 50 search requests in a random order. When assigned a weight of `0`, the zone does not receive any search traffic. 

 #### Sample request

--- a/_api-reference/cluster-settings.md
+++ b/_api-reference/cluster-settings.md
@ -1,7 +1,7 @@
 ---
 layout: default
 title: Cluster settings
-nav_order: 17
+nav_order: 18
 ---

 # Cluster settings
--- a/_api-reference/count.md
+++ b/_api-reference/count.md
@ -1,7 +1,7 @@
 ---
 layout: default
 title: Count
-nav_order: 20
+nav_order: 21
 ---

 # Count
--- a/_ml-commons-plugin/cluster-settings.md
+++ b/_ml-commons-plugin/cluster-settings.md
@ -12,7 +12,7 @@ To enhance and customize your OpenSearch cluster for machine learning (ML), you

 ## Run tasks and models on ML nodes only

-If `true`, ML Commons tasks and models run machine learning (ML) tasks on ML nodes only. If `false`, tasks and models run on ML nodes first. If no ML nodes exist, tasks and models run on data nodes. Don't set as `false` on a production cluster. 
+If `true`, ML Commons tasks and models run machine learning (ML) tasks on ML nodes only. If `false`, tasks and models run on ML nodes first. If no ML nodes exist, tasks and models run on data nodes. We recommend that you do not set this value to "false" on production clusters. 

 ### Setting

@ -27,7 +27,7 @@ plugins.ml_commons.only_run_on_ml_node: true

 ## Dispatch tasks to ML node 

-`round_robin` dispatches ML tasks to ML nodes using round robin routing. `least_load` gathers all ML nodes' runtime information, such as JVM heap memory usage and running tasks, then dispatches tasks to the ML node with the least load.
+`round_robin` dispatches ML tasks to ML nodes using round robin routing. `least_load` gathers runtime information from all ML nodes, like JVM heap memory usage and running tasks, and then dispatches the tasks to the ML node with the lowest load.


 ### Setting
@ -43,7 +43,9 @@ plugins.ml_commons.task_dispatch_policy: round_robin
 - Value range: `round_robin` or `least_load`


-## Set sync up job intervals 
+## Set sync job intervals 
+
+When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular job to sync newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.

 When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular sync up job to sync up newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.

@ -60,7 +62,7 @@ plugins.ml_commons.sync_up_job_interval_in_seconds: 10

 ## Predict monitoring requests

-Controls how many predict requests are monitored on one node. If set to `0`, OpenSearch clears all monitoring predict requests in the node's cache, and does not monitor predict requests from that point forward.
+Controls how many upload model tasks can run in parallel on one node. If set to `0`, you cannot upload models to any node.

 ### Setting

@ -92,7 +94,7 @@ plugins.ml_commons.max_upload_model_tasks_per_node: 10

 ## Load model tasks per node

-Controls how many load model tasks can run in parallel on one node. If set to `0`, you cannot load models to any node.
+Controls how many load model tasks can run in parallel on one node. If set to 0, you cannot load models to any node.

 ### Setting

@ -107,7 +109,7 @@ plugins.ml_commons.max_load_model_tasks_per_node: 10

 ## Add trusted URL

-The default value allows uploading a model file from any `http`, `https`, `ftp`, or local file. You can change this value to restrict trusted model URL.
+The default value allows you to upload a model file from any http/https/ftp/local file. You can change this value to restrict trusted model URLs.


 ### Setting
@ -120,3 +122,35 @@ plugins.ml_commons.trusted_url_regex: ^(https?\|ftp\|file)://[-a-zA-Z0-9+&@#/%?=

 - Default value: `^(https?\|ftp\|file)://[-a-zA-Z0-9+&@#/%?=~_\|!:,.;]*[-a-zA-Z0-9+&@#/%=~_\|]`
 - Value range: Java regular expression (regex) string
+
+## Assign task timeout
+
+Assigns how long in seconds an ML task will live. After the timeout, the task will fail.
+
+### Setting
+
+```
+plugins.ml_commons.ml_task_timeout_in_seconds: 600
+```
+
+### Values
+
+- Default value: 600
+- Value range: [1, 86400]
+
+## Set native memory threshold 
+
+Sets a circuit breaker that checks all system memory usage before running an ML task. If the native memory exceeds the threshold, OpenSearch throws an exception and stops running any ML task. 
+
+Values are based on the percentage of memory available. When set to `0`, no ML tasks will run. When set to `100`, the circuit breaker closes and no threshold exists.
+
+### Setting
+
+```
+plugins.ml_commons.native_memory_threshold: 90
+```
+
+### Values
+
+- Default value: 90
+- Value range: [0, 100]