Revise search backpressure documentation for 2.6 (#2989)

* Revise search backpressure documentation for 2.6

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Second iteration of changes

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Changed wording from canceled to marked for cancellation

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Revise cancellation stats description

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Implemented editorial feedback

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

---------

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
This commit is contained in:
kolchfa-aws 2023-02-28 11:17:34 -05:00 committed by GitHub
parent 5303f8fa7c
commit 826b46b018
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 51 additions and 13 deletions

View File

@ -20,7 +20,7 @@ To decide whether to apply search backpressure, OpenSearch periodically measures
- Heap usage
- Elapsed time
An observer thread periodically measures the resource usage of the node. If OpenSearch determines that the node is under duress, OpenSearch examines the resource usage of each search shard task and compares it against configurable thresholds. OpenSearch considers CPU usage, heap usage, and elapsed time and assigns each task a cancellation score that is then used to cancel the most resource-intensive tasks.
An observer thread periodically measures the resource usage of the node. If OpenSearch determines that the node is under duress, OpenSearch examines the resource usage of each search task and search shard task and compares it against configurable thresholds. OpenSearch considers CPU usage, heap usage, and elapsed time and assigns each task a cancellation score that is then used to cancel the most resource-intensive tasks.
OpenSearch limits the number of cancellations to a fraction of successful task completions. Additionally, it limits the number of cancellations per unit time. OpenSearch continues to monitor and cancel tasks until the node is no longer under duress.
@ -81,19 +81,30 @@ Search backpressure adds several settings to the standard OpenSearch cluster set
Setting | Default | Description
:--- | :--- | :---
search_backpressure.mode | `monitor_only` | The search backpressure [mode](#search-backpressure-modes). Valid values are `monitor_only`, `enforced`, or `disabled`.
search_backpressure.interval_millis | 1,000 ms | The interval at which the observer thread measures the resource usage and cancels tasks.
search_backpressure.cancellation_ratio | 10% | The maximum number of tasks to cancel, as a percentage of successful task completions.
search_backpressure.cancellation_rate | 0.003 | The maximum number of tasks to cancel per millisecond of elapsed time.
search_backpressure.cancellation_burst | 10 | The maximum number of tasks to cancel in a single iteration of the observer thread.
search_backpressure.node_duress.num_successive_breaches | 3 | The number of successive limit breaches after which the node is considered under duress.
search_backpressure.node_duress.cpu_threshold | 90% | The CPU usage threshold (as a percentage) required for a node to be considered under duress.
search_backpressure.node_duress.heap_threshold | 70% | The heap usage threshold (as a percentage) required for a node to be considered under duress.
search_backpressure.search_shard_task.total_heap_percent_threshold | 5% | The heap usage threshold (as a percentage) required for the sum of heap usages of all search shard tasks before cancellation is applied.
search_backpressure.cancellation_ratio<br> *Deprecated in 2.6. Replaced by search_backpressure.search_shard_task.cancellation_ratio* | 10% | The maximum number of tasks to cancel, as a percentage of successful task completions.
search_backpressure.cancellation_rate<br> *Deprecated in 2.6. Replaced by search_backpressure.search_shard_task.cancellation_rate* | 0.003 | The maximum number of tasks to cancel per millisecond of elapsed time.
search_backpressure.cancellation_burst<br> *Deprecated in 2.6. Replaced by search_backpressure.search_shard_task.cancellation_burst* | 10 | The maximum number of search shard tasks to cancel in a single iteration of the observer thread.
search_backpressure.node_duress.num_successive_breaches | 3 | The number of successive limit breaches after which the node is considered to be under duress.
search_backpressure.node_duress.cpu_threshold | 90% | The CPU usage threshold (as a percentage) required for a node to be considered to be under duress.
search_backpressure.node_duress.heap_threshold | 70% | The heap usage threshold (as a percentage) required for a node to be considered to be under duress.
search_backpressure.search_task.elapsed_time_millis_threshold | 45,000 | The elapsed time threshold (in milliseconds) required for an individual parent task before it is considered for cancellation.
search_backpressure.search_task.cancellation_ratio | 0.1 | The maximum number of search tasks to cancel, as a percentage of successful search task completions.
search_backpressure.search_task.cancellation_rate| 0.003 | The maximum number of search tasks to cancel per millisecond of elapsed time.
search_backpressure.search_task.cancellation_burst | 5 | The maximum number of search tasks to cancel in a single iteration of the observer thread.
search_backpressure.search_task.heap_percent_threshold | 2% | The heap usage threshold (as a percentage) required for an individual parent task before it is considered for cancellation.
search_backpressure.search_task.total_heap_percent_threshold | 5% | The heap usage threshold (as a percentage) required for the sum of heap usages of all search tasks before cancellation is applied.
search_backpressure.search_task.heap_variance | 2.0 | The heap usage variance required for an individual parent task before it is considered for cancellation. A task is considered for cancellation when `taskHeapUsage` is greater than or equal to `heapUsageMovingAverage` * `variance`.
search_backpressure.search_task.heap_moving_average_window_size | 10 | The window size used to calculate the rolling average of the heap usage for the completed parent tasks.
search_backpressure.search_task.cpu_time_millis_threshold | 30,000 | The CPU usage threshold (in milliseconds) required for an individual parent task before it is considered for cancellation.
search_backpressure.search_shard_task.elapsed_time_millis_threshold | 30,000 | The elapsed time threshold (in milliseconds) required for a single search shard task before it is considered for cancellation.
search_backpressure.search_shard_task.cancellation_ratio | 0.1 | The maximum number of search shard tasks to cancel, as a percentage of successful search shard task completions.
search_backpressure.search_shard_task.cancellation_rate | 0.003 | The maximum number of search shard tasks to cancel per millisecond of elapsed time.
search_backpressure.search_shard_task.cancellation_burst | 10 | The maximum number of search shard tasks to cancel in a single iteration of the observer thread.
search_backpressure.search_shard_task.heap_percent_threshold | 0.5% | The heap usage threshold (as a percentage) required for a single search shard task before it is considered for cancellation.
search_backpressure.search_shard_task.total_heap_percent_threshold | 5% | The heap usage threshold (as a percentage) required for the sum of heap usages of all search shard tasks before cancellation is applied.
search_backpressure.search_shard_task.heap_variance | 2.0 | The minimum variance required for a single search shard task's heap usage compared to the rolling average of previously completed tasks before it is considered for cancellation.
search_backpressure.search_shard_task.heap_moving_average_window_size | 100 | The number of previously completed search shard tasks to consider when calculating the rolling average of heap usage.
search_backpressure.search_shard_task.cpu_time_millis_threshold | 15,000 ms | The CPU usage threshold (in milliseconds) required for a single search shard task before it is considered for cancellation.
search_backpressure.search_shard_task.elapsed_time_millis_threshold | 30,000 ms | The elapsed time threshold (in milliseconds) required for a single search shard task before it is considered for cancellation.
search_backpressure.search_shard_task.cpu_time_millis_threshold | 15,000 | The CPU usage threshold (in milliseconds) required for a single search shard task before it is considered for cancellation.
## Search Backpressure Stats API
Introduced 2.4
@ -136,6 +147,30 @@ The response contains server-side request cancellation statistics:
"shard_indexing_pressure_enabled": "true"
},
"search_backpressure": {
"search_task": {
"resource_tracker_stats": {
"heap_usage_tracker": {
"cancellation_count": 57,
"current_max_bytes": 5739204,
"current_avg_bytes": 962465,
"rolling_avg_bytes": 4009239
},
"elapsed_time_tracker": {
"cancellation_count": 97,
"current_max_millis": 15902,
"current_avg_millis": 9705
},
"cpu_usage_tracker": {
"cancellation_count": 64,
"current_max_millis": 8483,
"current_avg_millis": 7843
}
},
"cancellation_stats": {
"cancellation_count": 102,
"cancellation_limit_reached_count": 25
}
},
"search_shard_task": {
"resource_tracker_stats": {
"heap_usage_tracker": {
@ -174,9 +209,12 @@ The response contains the following fields.
Field Name | Data type | Description
:--- | :--- | :---
search_backpressure | Object | Statistics about search backpressure.
search_backpressure.search_task | Object | Statistics specific to the search task.
search_backpressure.search_task.[resource_tracker_stats](#resource_tracker_stats) | Object | Statistics about the current search tasks.
search_backpressure.search_task.[cancellation_stats](#cancellation_stats) | Object | Statistics about the search tasks canceled since the node last restarted.
search_backpressure.search_shard_task | Object | Statistics specific to the search shard task.
search_backpressure.search_shard_task.[resource_tracker_stats](#resource_tracker_stats) | Object | Statistics about the current tasks.
search_backpressure.search_shard_task.[cancellation_stats](#cancellation_stats) | Object | Statistics about the tasks canceled since the node last restarted.
search_backpressure.search_shard_task.[resource_tracker_stats](#resource_tracker_stats) | Object | Statistics about the current search shard tasks.
search_backpressure.search_shard_task.[cancellation_stats](#cancellation_stats) | Object | Statistics about the search shard tasks canceled since the node last restarted.
search_backpressure.mode | String | The [mode](#search-backpressure-modes) for search backpressure.
### `resource_tracker_stats`