Update docs for Remote Segment and Remote Translog store stats surfaced via nodes/indices/cluster stats APIs (#4995)

* Update docs for remote_store.moving_average_window_size

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>

* Update docs for Remote Segment and Remote Translog store stats surfaced via nodes/indices/cluster stats APIs

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update _api-reference/nodes-apis/nodes-stats.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update nodes-stats.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update cluster-stats.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

---------

Signed-off-by: Bhumika Saini <sabhumik@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
This commit is contained in:
Bhumika Saini 2023-09-13 23:24:39 +05:30 committed by GitHub
parent 60ee1fad77
commit 58789e75d2
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 249 additions and 10 deletions

View File

@ -107,6 +107,7 @@ The following request field parameters are compatible with the cluster API.
| cluster.max_shards_per_node | Integer | Limits the total number of primary and replica shards for the cluster. The limit is calculated as follows: `cluster.max_shards_per_node` multiplied by the number of non-frozen data nodes. Shards for closed indexes do not count toward this limit. Default is `1000`. |
| cluster.persistent_tasks.allocation.enable | String | Enables or disables allocation for persistent tasks: <br /> <br /> `all` Allows persistent tasks to be assigned to nodes. <br /> <br /> `none` No allocations are allowed for persistent tasks. This does not affect persistent tasks already running. <br /> <br /> Default is `all`. |
| cluster.persistent_tasks.allocation.recheck_interval | Time unit | The cluster manager automatically checks whether or not persistent tasks need to be assigned when the cluster state changes in a significant way. There are other factors, such as memory usage, that will affect whether or not persistent tasks are assigned to nodes but do not otherwise cause the cluster state to change. This setting defines how often assignment checks are performed in response to these factors. Default is `30 seconds`, with a minimum of `10 seconds` being required. |
| remote_store.moving_average_window_size | Integer | The moving average window size used to calculate the rolling statistic values exposed through the [Remote Store Stats API]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/remote-store-stats-api/). Default is `20`. Minimum enforced is `5`. |
#### Example request

View File

@ -116,6 +116,29 @@ Parameter | Type | Description
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 1112,
"max_unsafe_auto_id_timestamp": 1644269449096,
"remote_store" : {
"upload" : {
"total_upload_size" : {
"started_bytes" : 152419,
"succeeded_bytes" : 152419,
"failed_bytes" : 0
},
"refresh_size_lag" : {
"total_bytes" : 0,
"max_bytes" : 0
},
"max_refresh_time_lag_in_millis" : 0,
"total_time_spent_in_millis" : 516
},
"download" : {
"total_download_size" : {
"started_bytes" : 0,
"succeeded_bytes" : 0,
"failed_bytes" : 0
},
"total_time_spent_in_millis" : 0
}
},
"file_sizes": {}
},
"mappings": {
@ -491,3 +514,4 @@ nodes.network_types | The transport and HTTP networks within the nodes.
nodes.discovery_type | The method the nodes use to find other nodes within the cluster.
nodes.packaging_types | Information about the nodes' OpenSearch distribution.
nodes.ingest | Information about the nodes' ingest pipelines/nodes, if there are any.
total_time_spent | The total amount of download and upload time spent across all shards in the cluster when downloading or uploading from the remote store.

View File

@ -199,6 +199,29 @@ By default, the returned statistics are aggregated in the `primaries` and `total
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"remote_store" : {
"upload" : {
"total_upload_size" : {
"started_bytes" : 152419,
"succeeded_bytes" : 152419,
"failed_bytes" : 0
},
"refresh_size_lag" : {
"total_bytes" : 0,
"max_bytes" : 0
},
"max_refresh_time_lag_in_millis" : 0,
"total_time_spent_in_millis" : 516
},
"download" : {
"total_download_size" : {
"started_bytes" : 0,
"succeeded_bytes" : 0,
"failed_bytes" : 0
},
"total_time_spent_in_millis" : 0
}
},
"file_sizes": {}
},
"translog": {
@ -206,7 +229,21 @@ By default, the returned statistics are aggregated in the `primaries` and `total
"size_in_bytes": 55,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 55,
"earliest_last_modified_age": 142622215
"earliest_last_modified_age": 142622215,
"remote_store" : {
"upload" : {
"total_uploads" : {
"started" : 57,
"failed" : 0,
"succeeded" : 57
},
"total_upload_size" : {
"started_bytes" : 16830,
"failed_bytes" : 0,
"succeeded_bytes" : 16830
}
}
}
},
"request_cache": {
"memory_size_in_bytes": 0,
@ -326,6 +363,29 @@ By default, the returned statistics are aggregated in the `primaries` and `total
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"remote_store" : {
"upload" : {
"total_upload_size" : {
"started_bytes" : 152419,
"succeeded_bytes" : 152419,
"failed_bytes" : 0
},
"refresh_size_lag" : {
"total_bytes" : 0,
"max_bytes" : 0
},
"max_refresh_time_lag_in_millis" : 0,
"total_time_spent_in_millis" : 516
},
"download" : {
"total_download_size" : {
"started_bytes" : 0,
"succeeded_bytes" : 0,
"failed_bytes" : 0
},
"total_time_spent_in_millis" : 0
}
},
"file_sizes": {}
},
"translog": {
@ -333,7 +393,21 @@ By default, the returned statistics are aggregated in the `primaries` and `total
"size_in_bytes": 55,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 55,
"earliest_last_modified_age": 142622215
"earliest_last_modified_age": 142622215,
"remote_store" : {
"upload" : {
"total_uploads" : {
"started" : 57,
"failed" : 0,
"succeeded" : 57
},
"total_upload_size" : {
"started_bytes" : 16830,
"failed_bytes" : 0,
"succeeded_bytes" : 16830
}
}
}
},
"request_cache": {
"memory_size_in_bytes": 0,
@ -457,6 +531,29 @@ By default, the returned statistics are aggregated in the `primaries` and `total
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"remote_store" : {
"upload" : {
"total_upload_size" : {
"started_bytes" : 152419,
"succeeded_bytes" : 152419,
"failed_bytes" : 0
},
"refresh_size_lag" : {
"total_bytes" : 0,
"max_bytes" : 0
},
"max_refresh_time_lag_in_millis" : 0,
"total_time_spent_in_millis" : 516
},
"download" : {
"total_download_size" : {
"started_bytes" : 0,
"succeeded_bytes" : 0,
"failed_bytes" : 0
},
"total_time_spent_in_millis" : 0
}
},
"file_sizes": {}
},
"translog": {
@ -464,7 +561,21 @@ By default, the returned statistics are aggregated in the `primaries` and `total
"size_in_bytes": 55,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 55,
"earliest_last_modified_age": 142622215
"earliest_last_modified_age": 142622215,
"remote_store" : {
"upload" : {
"total_uploads" : {
"started" : 57,
"failed" : 0,
"succeeded" : 57
},
"total_upload_size" : {
"started_bytes" : 16830,
"failed_bytes" : 0,
"succeeded_bytes" : 16830
}
}
}
},
"request_cache": {
"memory_size_in_bytes": 0,
@ -584,6 +695,29 @@ By default, the returned statistics are aggregated in the `primaries` and `total
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"remote_store" : {
"upload" : {
"total_upload_size" : {
"started_bytes" : 152419,
"succeeded_bytes" : 152419,
"failed_bytes" : 0
},
"refresh_size_lag" : {
"total_bytes" : 0,
"max_bytes" : 0
},
"max_refresh_time_lag_in_millis" : 0,
"total_time_spent_in_millis" : 516
},
"download" : {
"total_download_size" : {
"started_bytes" : 0,
"succeeded_bytes" : 0,
"failed_bytes" : 0
},
"total_time_spent_in_millis" : 0
}
},
"file_sizes": {}
},
"translog": {
@ -591,7 +725,21 @@ By default, the returned statistics are aggregated in the `primaries` and `total
"size_in_bytes": 55,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 55,
"earliest_last_modified_age": 142622215
"earliest_last_modified_age": 142622215,
"remote_store" : {
"upload" : {
"total_uploads" : {
"started" : 57,
"failed" : 0,
"succeeded" : 57
},
"total_upload_size" : {
"started_bytes" : 16830,
"failed_bytes" : 0,
"succeeded_bytes" : 16830
}
}
}
},
"request_cache": {
"memory_size_in_bytes": 0,

View File

@ -245,6 +245,29 @@ Select the arrow to view the example response.
"version_map_memory_in_bytes" : 0,
"fixed_bit_set_memory_in_bytes" : 288,
"max_unsafe_auto_id_timestamp" : -1,
"remote_store" : {
"upload" : {
"total_upload_size" : {
"started_bytes" : 152419,
"succeeded_bytes" : 152419,
"failed_bytes" : 0
},
"refresh_size_lag" : {
"total_bytes" : 0,
"max_bytes" : 0
},
"max_refresh_time_lag_in_millis" : 0,
"total_time_spent_in_millis" : 516
},
"download" : {
"total_download_size" : {
"started_bytes" : 0,
"succeeded_bytes" : 0,
"failed_bytes" : 0
},
"total_time_spent_in_millis" : 0
}
},
"file_sizes" : { }
},
"translog" : {
@ -252,7 +275,21 @@ Select the arrow to view the example response.
"size_in_bytes" : 1452,
"uncommitted_operations" : 12,
"uncommitted_size_in_bytes" : 1452,
"earliest_last_modified_age" : 164160
"earliest_last_modified_age" : 164160,
"remote_store" : {
"upload" : {
"total_uploads" : {
"started" : 57,
"failed" : 0,
"succeeded" : 57
},
"total_upload_size" : {
"started_bytes" : 16830,
"failed_bytes" : 0,
"succeeded_bytes" : 16830
}
}
}
},
"request_cache" : {
"memory_size_in_bytes" : 1649,
@ -792,6 +829,27 @@ segments.index_writer_memory_in_bytes | Integer | The total amount of memory use
segments.version_map_memory_in_bytes | Integer | The total amount of memory used by all version maps, in bytes.
segments.fixed_bit_set_memory_in_bytes | Integer | The total amount of memory used by fixed bit sets, in bytes. Fixed bit sets are used for nested objects and join fields.
segments.max_unsafe_auto_id_timestamp | Integer | The timestamp for the most recently retired indexing request, in milliseconds since the epoch.
segments.segment_replication | Object | Segment replication statistics for all primary shards when segment replication is enabled on the node.
segments.segment_replication.maxBytesBehind | long | The maximum number of bytes behind the primary replica.
segments.segment_replication.totalBytesBehind | long | The total number of bytes behind the primary replicas.
segments.segment_replication.maxReplicationLag | long | The maximum amount of time, in milliseconds, taken by a replica to catch up to its primary.
segments.remote_store | Object | Statistics about remote segment store operations.
segments.remote_store.upload | Object | Statistics related to uploads to the remote segment store.
segments.remote_store.upload.total_upload_size | Object | The amount of data, in bytes, uploaded to the remote segment store.
segments.remote_store.upload.total_upload_size.started_bytes | Integer | The number of bytes to upload to the remote segment store after the upload has started.
segments.remote_store.upload.total_upload_size.succeeded_bytes | Integer | The number of bytes successfully uploaded to the remote segment store.
segments.remote_store.upload.total_upload_size.failed_bytes | Integer | The number of bytes that failed to upload to the remote segment store.
segments.remote_store.upload.refresh_size_lag | Object | The amount of lag during upload between the remote segment store and the local store.
segments.remote_store.upload.refresh_size_lag.total_bytes | Integer | The total number of bytes that lagged during the upload refresh between the remote segment store and the local store.
segments.remote_store.upload.refresh_size_lag.max_bytes | Integer | The maximum amount of lag, in bytes, during the upload refresh between the remote segment store and the local store.
segments.remote_store.upload.max_refresh_time_lag_in_millis | Integer | The maximum duration, in milliseconds, that the remote refresh is behind the local refresh.
segments.remote_store.upload.total_time_spent_in_millis | Integer | The total amount of time, in milliseconds, spent on uploads to the remote segment store.
segments.remote_store.download | Object | Statistics related to downloads to the remote segment store.
segments.remote_store.download.total_download_size | Object | The total amount of data download from the remote segment store.
segments.remote_store.download.total_download_size.started_bytes | Integer | The number of bytes downloaded from the remote segment store after the download starts.
segments.remote_store.download.total_download_size.succeeded_bytes | Integer | The number of bytes successfully downloaded from the remote segment store.
segments.remote_store.download.total_download_size.failed_bytes | Integer | The number of bytes that failed to download from the remote segment store.
segments.remote_store.download.total_time_spent_in_millis | Integer | The total duration, in milliseconds, spent on downloads from the remote segment store.
segments.file_sizes | Integer | Statistics about the size of the segment files.
translog | Object | Statistics about transaction log operations for the node.
translog.operations | Integer | The number of translog operations.
@ -799,6 +857,16 @@ translog.size_in_bytes | Integer | The size of the translog, in bytes.
translog.uncommitted_operations | Integer | The number of uncommitted translog operations.
translog.uncommitted_size_in_bytes | Integer | The size of uncommitted translog operations, in bytes.
translog.earliest_last_modified_age | Integer | The earliest last modified age for the translog.
translog.remote_store | Object | Statistics related to operations from the remote translog store.
translog.remote_store.upload | Object | Statistics related to uploads to the remote translog store.
translog.remote_store.upload.total_uploads | Object | The number of syncs to the remote translog store.
translog.remote_store.upload.total_uploads.started | Integer | The number of upload syncs to the remote translog store that have started.
translog.remote_store.upload.total_uploads.failed | Integer | The number of failed upload syncs to the remote translog store.
translog.remote_store.upload.total_uploads.succeeded | Integer | The number of successful upload syncs to the remote translog store.
translog.remote_store.upload.total_upload_size | Object | The total amount of data uploaded to the remote translog store.
translog.remote_store.upload.total_upload_size.started_bytes | Integer | The number of bytes actively uploading to the remote translog store after the upload has started.
translog.remote_store.upload.total_upload_size.failed_bytes | Integer | The number of bytes that failed to upload to the remote translog store.
translog.remote_store.upload.total_upload_size.succeeded_bytes | Integer | The number of bytes successfully uploaded to the remote translog store.
request_cache | Object | Statistics about the request cache for the node.
request_cache.memory_size_in_bytes | Integer | The memory size used by the request cache, in bytes.
request_cache.evictions | Integer | The number of request cache evictions.

View File

@ -18,8 +18,8 @@ Remote segment backpressure is a shard-level rejection mechanism that dynamicall
Remote segment backpressure is activated if any of the following thresholds is breached:
- **Consecutive failure**: The backpressure is activated if there are _N_ or more consecutive failures. The value of _N_ is configurable in `remote_store.segment.pressure.consecutive_failures.limit`.
- **Bytes lag**: The bytes lag is computed by adding the size of all files that are present in local committed segments but not in the remote store. The backpressure is activated if the bytes lag is greater than _K_ multiplied by the moving average of the size (in bytes) of files uploaded after each refresh. The variance factor _K_ is configurable in `remote_store.segment.pressure.bytes_lag.variance_factor`. The moving window size is configurable in `remote_store.segment.pressure.upload_bytes_moving_average_window_size`.
- **Time lag**: The time lag is computed by comparing the timestamps of the most recent local refresh and the most recent remote store segment upload. The backpressure is activated if the time lag is greater than _K_ multiplied by the moving average of the time taken to upload new segments and metadata file after each refresh. The variance factor _K_ is configurable in `remote_store.segment.pressure.time_lag.variance_factor`. The moving window size is configurable in `remote_store.segment.pressure.upload_time_moving_average_window_size`.
- **Bytes lag**: The bytes lag is calculated by adding the sizes of all the files that are present in local committed segments but not in the remote store. Backpressure is activated if the bytes lag is greater than _K_ multiplied by the moving average of the size, in bytes, of the files uploaded after each refresh. The variance factor _K_ is configurable in `remote_store.segment.pressure.bytes_lag.variance_factor`. The moving window size is configurable through the `remote_store.moving_average_window_size` setting.
- **Time lag**: The time lag is calculated by comparing the timestamps of the most recent local refresh and the most recent remote store segment upload. Backpressure is activated if the time lag is greater than _K_ multiplied by the moving average of the time taken to upload new segments and metadata files after each refresh. The variance factor _K_ is configurable through the `remote_store.segment.pressure.time_lag.variance_factor` setting. The moving window size is configurable through the `remote_store.moving_average_window_size` setting.
## Handling segment merges
@ -36,13 +36,11 @@ The following table lists the settings used for activating backpressure. For thr
|`remote_store.segment.pressure.enabled` |Boolean | If `true`, enables remote segment backpressure. Default is `false`. |
|`remote_store.segment.pressure.consecutive_failures.limit` |Integer |The minimum consecutive failure count for activating remote segment backpressure. Default is `5`. |
|`remote_store.segment.pressure.bytes_lag.variance_factor` |Float | The variance factor that is used together with the moving average to calculate the dynamic bytes lag threshold for activating remote segment backpressure. Default is `10`. |
|`remote_store.segment.pressure.upload_bytes_moving_average_window_size` |Integer |The moving average window size that is used to calculate the dynamic bytes lag threshold for activating remote segment backpressure. The moving average is also exposed through the [Remote Store Stats API]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/remote-store-stats-api/). Default is `20`. |
|`remote_store.segment.pressure.time_lag.variance_factor` |Float |The variance factor that is used together with the moving average to calculate the dynamic time lag threshold for activating remote segment backpressure. Default is `10`. |
|`remote_store.segment.pressure.upload_time_moving_average_window_size` |Integer |The moving average window size that is used to calculate the dynamic time lag threshold for activating remote segment backpressure. The moving average is also exposed through the [Remote Store Stats API]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/remote-store-stats-api/). Default is `20`.|
The following table lists the settings used for statistics.
|Setting |Data type |Description |
|:--- |:--- |:--- |
|`remote_store.segment.pressure.upload_bytes_per_sec_moving_average_window_size` |Integer |The moving average window size, which is exposed through [Remote Store Stats API]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/remote-store-stats-api/). Default is `20`.|
| `remote_store.moving_average_window_size` | Integer | The moving average window size used to calculate the rolling statistic values exposed through the [Remote Store Stats API]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/remote-store-stats-api/). Default is `20`. Minimum enforced is `5`. |