From 5769c1618525eb8c0956b08241d30d7618934cab Mon Sep 17 00:00:00 2001 From: Heather Halter Date: Fri, 13 Oct 2023 15:12:19 -0700 Subject: [PATCH] Add new Performance Analyzer metrics and fix table formatting (#5182) * fixed problem with table and added new metrics Signed-off-by: Heather Halter * fix formatting Signed-off-by: Heather Halter * fixed table formatting Signed-off-by: Heather Halter * Update reference.md Signed-off-by: Heather Halter * Update reference.md Signed-off-by: Heather Halter * Update reference.md Signed-off-by: Heather Halter * Update _monitoring-your-cluster/pa/reference.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Heather Halter * Update _monitoring-your-cluster/pa/reference.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Heather Halter * Update _monitoring-your-cluster/pa/reference.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Heather Halter * Update _monitoring-your-cluster/pa/reference.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Heather Halter * Update _monitoring-your-cluster/pa/reference.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Heather Halter * Update _monitoring-your-cluster/pa/reference.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Heather Halter * Update reference.md Fixed the intro statement Signed-off-by: Heather Halter * Apply suggestions from code review Editorial corrections. Co-authored-by: Nathan Bower Signed-off-by: Heather Halter * Update reference.md Added a description under the Dimensions: N/A section and a couple editorial nits. Signed-off-by: Heather Halter * Update reference.md Signed-off-by: Heather Halter * Update reference.md Signed-off-by: Heather Halter * Update reference.md Signed-off-by: Heather Halter * Update reference.md Signed-off-by: Heather Halter * Update _monitoring-your-cluster/pa/reference.md Co-authored-by: Nathan Bower Signed-off-by: Heather Halter * Update reference.md Removed search backpressure info, as requested; removed info marked with TBD. Signed-off-by: Heather Halter --------- Signed-off-by: Heather Halter Signed-off-by: Heather Halter Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower --- _monitoring-your-cluster/pa/reference.md | 720 ++++++++++++++--------- 1 file changed, 444 insertions(+), 276 deletions(-) diff --git a/_monitoring-your-cluster/pa/reference.md b/_monitoring-your-cluster/pa/reference.md index 9fed7646..4d9e8532 100644 --- a/_monitoring-your-cluster/pa/reference.md +++ b/_monitoring-your-cluster/pa/reference.md @@ -9,27 +9,26 @@ redirect_from: # Metrics reference -This page contains all Performance Analyzer metrics. All metrics support the `avg`, `sum`, `min`, and `max` aggregations, although certain metrics measure only one thing, making the choice of aggregation irrelevant. +Performance Analyzer provides a number of metrics to help you evaluate performance. The following tables describe the available metrics, grouped by the dimensions that are most relevant for that metric. All metrics support the `avg`, `sum`, `min`, and `max` aggregations, although for certain metrics, the measured value is the same regardless of aggregation type. -For information on dimensions, see the [dimensions reference](#dimensions-reference). +For information about each of the dimensions, see [dimensions reference](#dimensions-reference) later in this topic. This list is extensive. We recommend using Ctrl/Cmd + F to find what you're looking for. {: .tip } +## Relevant dimensions: `ShardID`, `IndexName`, `Operation`, `ShardRole` + - + - - + - @@ -72,7 +71,7 @@ This list is extensive. We recommend using Ctrl/Cmd + F to find what you're look - @@ -114,25 +113,25 @@ This list is extensive. We recommend using Ctrl/Cmd + F to find what you're look - - - - @@ -146,12 +145,23 @@ This list is extensive. We recommend using Ctrl/Cmd + F to find what you're look - + + +
MetricDimensions Description
CPU_Utilization ShardID, IndexName, Operation, ShardRole - CPU usage ratio. CPU time (in milliseconds) used by the associated thread(s) in the past five seconds, divided by 5000 milliseconds.
Heap_AllocRate An approximation of the heap memory allocated, in bytes, per second in the past five seconds + An approximation, in bytes, of the heap memory allocated per second in the last 5 seconds.
Thread_Blocked_Time Average time (seconds) that the associated thread(s) blocked to enter or reenter a monitor. + The average amount of time, in seconds, that the associated thread has been blocked from entering or reentering a monitor.
Thread_Blocked_Event The total number of times that the associated thread(s) blocked to enter or reenter a monitor (i.e. the number of times a thread has been in the blocked state). + The total number of times that the associated thread has been blocked from entering or reentering a monitor (that is, the number of times a thread has been in the `blocked` state).
Thread_Waited_Time Average time (seconds) that the associated thread(s) waited to enter or reenter a monitor in WAITING or TIMED_WAITING state. + The average amount of time, in seconds, that the associated thread has waited to enter or reenter a monitor (that is, the amount of time a thread has been in the `WAITING` or `TIMED_WAITING` state)".
Thread_Waited_Event The total number of times that the associated thread(s) waited to enter or reenter a monitor (i.e. the number of times a thread has been in the WAITING or TIMED_WAITING state). + The total number of times that the associated thread has waited to enter or reenter a monitor (that is, the number of times a thread has been in the WAITING or TIMED_WAITING state).
The total number of documents indexed in the past five seconds.
+ +## Relevant dimensions: `ShardID`, `IndexName` + + + + + + + + + - @@ -287,51 +297,84 @@ This list is extensive. We recommend using Ctrl/Cmd + F to find what you're look + +
MetricDescription
Indexing_ThrottleTime ShardID, IndexName - Time (milliseconds) that the index has been under merge throttling control in the past five seconds.
Estimated disk usage of the shard in bytes.
+ +## Relevant dimensions: `ShardID`, `IndexName`, `IndexingStage` + + + + + + + + + - - - - - - + +
MetricDescription
Indexing_Pressure_Current_Limits ShardID, IndexName, IndexingStage - Total heap size (in bytes) that is available for utilization by a shard of an index in a particular indexing stage (Coordinating, Primary or Replica). + The total heap size, in bytes, that is available for use by an index shard in a particular indexing stage (Coordinating, Primary, or Replica).
Indexing_Pressure_Current_Bytes Total heap size (in bytes) occupied by a shard of an index in a particular indexing stage (Coordinating, Primary or Replica). + The total heap size, in bytes, occupied by an index shard in a particular indexing stage (Coordinating, Primary, or Replica).
Indexing_Pressure_Last_Successful_Timestamp Timestamp of a request that was successful for a shard of an index in a particular indexing stage (Coordinating, Primary or Replica). + The timestamp of a successful request for an index shard in a particular indexing stage (Coordinating, Primary, or Replica).
Indexing_Pressure_Rejection_Count Total rejections performed by OpenSearch for a shard of an index in a particular indexing stage (Coordinating, Primary or Replica). + The total number of rejections performed by OpenSearch for an index shard in a particular indexing stage (Coordinating, Primary, or Replica).
Indexing_Pressure_Average_Window_Throughput Average throughput of the last n requests (The value of n is determined by `shard_indexing_pressure.secondary_parameter.throughput.request_size_window` setting) for a shard of an index in a particular indexing stage (Coordinating, Primary or Replica). + The average throughput of the last n requests (The value of n is determined by the `shard_indexing_pressure.secondary_parameter.throughput.request_size_window` setting) for an index shard in a particular indexing stage (Coordinating, Primary, or Replica).
+ +## Relevant dimensions: `Operation`, `Exception`, `Indices`, `HTTPRespCode`, `ShardID`, `IndexName`, `ShardRole` + + + - - + + + + + + + +
Latency - Operation, Exception, Indices, HTTPRespCode, ShardID, IndexName, ShardRole + MetricDescription
Latency Latency (milliseconds) of a request.
+ +## Relevant dimension: `MemType` + + + + + + + + + - @@ -365,11 +408,22 @@ This list is extensive. We recommend using Ctrl/Cmd + F to find what you're look + +
MetricDescription
GC_Collection_Event MemType - The number of garbage collections that have occurred in the past five seconds.
The amount of used memory in bytes.
+ +## Relevant dimension: `DiskName` + + + + + + + + + - @@ -385,51 +439,73 @@ This list is extensive. We recommend using Ctrl/Cmd + F to find what you're look + +
MetricDescription
Disk_Utilization DiskName - Disk utilization rate: percentage of disk time spent reading and writing by the OpenSearch process in the past five seconds.
Service rate: MB read or written per second in the past five seconds. This metric assumes that each disk sector stores 512 bytes.
+ +## Relevant dimension: `DestAddr` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricDescription
Net_TCP_NumFlows + The number of samples collected. Performance Analyzer collects 1 sample every 5 seconds. +
Net_TCP_TxQ + The average number of TCP packets in the send buffer. +
Net_TCP_RxQ + The average number of TCP packets in the receive buffer. +
Net_TCP_Lost + The average number of unrecovered recurring timeouts. This number is reset when the recovery finishes or `SND.UNA` is advanced. `SND.UNA` is the sequence number of the first byte of data that has been sent but not yet acknowledged. +
Net_TCP_SendCWND + The average size, in bytes, of the sending congestion window. +
Net_TCP_SSThresh + The average size, in bytes, of the slow start size threshold. +
+ +## Relevant dimension: `Direction` + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + +
MetricDescription
Net_TCP_NumFlows - DestAddr - Number of samples collected. Performance Analyzer collects one sample every five seconds. -
Net_TCP_TxQ - Average number of TCP packets in the send buffer. -
Net_TCP_RxQ - Average number of TCP packets in the receive buffer. -
Net_TCP_Lost - Average number of unrecovered recurring timeouts. This number is reset when the recovery finishes or `SND.UNA` is advanced. `SND.UNA` is the sequence number of the first byte of data that has been sent, but not yet acknowledged. -
Net_TCP_SendCWND - Average size (bytes) of the sending congestion window. -
Net_TCP_SSThresh - Average size (bytes) of the slow start size threshold. -
Net_PacketRate4 - Direction - The total number of IPv4 datagrams transmitted/received from/by interfaces per second, including those transmitted or received in error. - Net_PacketRate4 + The total number of IPv4 datagrams transmitted/received from/by interfaces per second, including those transmitted or received in error. +
Net_PacketDropRate4 @@ -455,225 +531,317 @@ This list is extensive. We recommend using Ctrl/Cmd + F to find what you're look The number of bits transmitted or received per second by all network interfaces.
+ + +## Relevant dimension: `ThreadPoolType` + + + - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ThreadPool_QueueSize - ThreadPoolType - The size of the task queue. - MetricDescription
ThreadPool_QueueSize + The size of the task queue. +
ThreadPool_RejectedReqs + The number of rejected executions. +
ThreadPool_TotalThreads + The current number of threads in the pool. +
ThreadPool_ActiveThreads + The approximate number of threads that are actively executing tasks. +
ThreadPool_QueueLatency + The latency of the task queue. +
ThreadPool_QueueCapacity + The current capacity of the task queue. +
+ +## Relevant dimension: `ClusterManager_PendingTaskType` + + + + + + + + + + + + + + +
MetricDescription
ClusterManager_PendingQueueSize + The current number of pending tasks in the cluster state update thread. Each node has a cluster state update thread that submits cluster state update tasks, such as create index, update mapping, allocate shard, and fail shard. +
+ +## Relevant dimensions: `Operation`, `Exception`, `Indices`, `HTTPRespCode` + + + + + + + + + + + + + + + + + + +
MetricDescription
HTTP_RequestDocs + The number of items in the request (only for the `_bulk` request type). +
HTTP_TotalRequests + The number of requests completed in the last 5 seconds. +
+ +## Relevant dimension: `CBType` + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + + +
MetricDescription
CB_EstimatedSize + The current number of estimated bytes. +
ThreadPool_RejectedReqs - The number of rejected executions. -
ThreadPool_TotalThreads - The current number of threads in the pool. -
ThreadPool_ActiveThreads - The approximate number of threads that are actively executing tasks. -
ThreadPool_QueueLatency - The latency of the task queue. -
ThreadPool_QueueCapacity - The current capacity of the task queue. -
Master_PendingQueueSize - Master_PendingTaskType - The current number of pending tasks in the cluster state update thread. Each node has a cluster state update thread that submits cluster state update tasks (create index, update mapping, allocate shard, fail shard, etc.). -
HTTP_RequestDocs - Operation, Exception, Indices, HTTPRespCode - The number of items in the request (only for `_bulk` request type). -
HTTP_TotalRequests - The number of finished requests in the past five seconds. -
CB_EstimatedSize - CBType - The current number of estimated bytes. -
CB_TrippedEvents - The number of times the circuit breaker has tripped. - CB_TrippedEvents + The number of times that the circuit breaker has tripped. +
CB_ConfiguredSize The limit (bytes) for how much memory operations can use. + The limit, in bytes, of the amount of memory operations can use. +
+ +## Relevant dimensions: `ClusterManagerTaskInsertOrder`, `ClusterManagerTaskPriority`, `ClusterManagerTaskType`, `ClusterManagerTaskMetadata` + + + + + + + + + + + + + + + + + + +
MetricDescription
ClusterManager_Task_Queue_Time + The amount of time, in milliseconds, that a cluster manager task spent in the queue. +
ClusterManager_Task_Run_Time + The amount of time, in milliseconds, that a cluster manager task has been running. +
+ +## Relevant dimension: `CacheType` + + + + + + + + + + + + + + +
MetricDescription
Cache_MaxSize + The maximum size of the cache, in bytes. +
+ +## Relevant dimension: `ControllerName` + + + + + + + + + + + + + + + + + + + + + +
MetricDescription
AdmissionControl_RejectionCount + The total number of rejections performed by a Controller of Admission Control. +
AdmissionControl_CurrentValue + The current value for Controller of Admission Control. +
AdmissionControl_ThresholdValue + The threshold value for Controller of Admission Control. +
+ +## Relevant dimension: `NodeID` + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
MetricDescription
Data_RetryingPendingTasksCount + The number of throttled pending tasks on which the data node is actively performing retries. It is an absolute metric at that point in time.
Master_Task_Queue_Time + ClusterManager_ThrottledPendingTasksCount MasterTaskInsertOrder, MasterTaskPriority, MasterTaskType, MasterTaskMetadata - The time (milliseconds) that a master task spent in the queue. -
Master_Task_Run_Time - The time (milliseconds) that a master task has been executed. -
Cache_MaxSize - CacheType - The max size of the cache in bytes. -
AdmissionControl_RejectionCount (WIP) - ControllerName - Total rejections performed by a Controller of Admission Control. -
AdmissionControl_CurrentValue (WIP) - Current value for Controller of Admission Control. -
AdmissionControl_ThresholdValue (WIP) - Threshold value for Controller of Admission Control. -
Data_RetryingPendingTasksCount (WIP) - NodeID - Number of throttled pending tasks on which data node is actively performing retries. It will be an absolute metric at that point of time. -
Master_ThrottledPendingTasksCount (WIP) - Sum of total pending tasks which got throttled by node (master node). It is a cumulative metric so look at the max aggregation. -
Election_Term (WIP) - N/A - Monotonically increasing number with every master election. -
PublishClusterState_Latency (WIP) - The time taken by quorum of nodes to publish new cluster state. This metric is available for current master. -
PublishClusterState_Failure (WIP) - The number of times publish new cluster state action failed on master node. -
ClusterApplierService_Latency (WIP) - The time taken by each node to apply cluster state sent by master. -
ClusterApplierService_Failure (WIP) - The number of times apply cluster state action failed on each node. -
Shard_State (WIP) - IndexName, NodeName, ShardType, ShardID - The state of each shard - whether it is STARTED, UNASSIGNED, RELOCATING etc. -
LeaderCheck_Latency (WIP) - WIP - WIP -
FollowerCheck_Failure (WIP) -
LeaderCheck_Failure (WIP) -
FollowerCheck_Latency (WIP) + The sum of the total pending tasks that were throttled by the cluster manager node. This is a cumulative metric, so make sure to check the max aggregation.
+ +## Relevant dimensions: N/A +The following metrics are relevant to the cluster as a whole and do not require specific dimensions. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricDescription
Election_Term + A number that increases monotonically with every cluster manager election. +
PublishClusterState_Latency + The amount of time taken by the quorum of nodes to publish the new cluster state. This metric is available for the current cluster manager. +
PublishClusterState_Failure + The number of times the new cluster state failed to publish on the cluster manager node. +
ClusterApplierService_Latency + The amount of time taken by each node for the apply cluster state sent by the cluster manager. +
ClusterApplierService_Failure + The number of times that the apply cluster state action failed on each node. +
+ +## Relevant dimensions: `IndexName`, `NodeName`, `ShardType`, `ShardID` + + + + + + + + + + + + + +
MetricDescription
Shard_State + The state of each shard, for example, `STARTED`, `UNASSIGNED`, or `RELOCATING`. +
## Dimensions reference -Dimension | Return values -:--- | :--- -ShardID | ID for the shard (e.g. `1`). -IndexName | Name of the index (e.g. `my-index`). -Operation | Type of operation (e.g. `shardbulk`). -ShardRole | `primary`, `replica` -Exception | OpenSearch exceptions (e.g. `org.opensearch.index_not_found_exception`). -Indices | The list of indices in the request URI. -HTTPRespCode | Response code from OpenSearch (e.g. `200`). -MemType | `totYoungGC`, `totFullGC`, `Survivor`, `PermGen`, `OldGen`, `Eden`, `NonHeap`, `Heap` -DiskName | Name of the disk (e.g. `sda1`). -DestAddr | Destination address (e.g. `010015AC`). -Direction | `in`, `out` -ThreadPoolType | The OpenSearch thread pools (e.g. `index`, `search`,`snapshot`). -CBType | `accounting`, `fielddata`, `in_flight_requests`, `parent`, `request` -MasterTaskInsertOrder | The order in which the task was inserted (e.g. `3691`). -MasterTaskPriority | Priority of the task (e.g. `URGENT`). OpenSearch executes higher priority tasks before lower priority ones, regardless of `insert_order`. -MasterTaskType | `shard-started`, `create-index`, `delete-index`, `refresh-mapping`, `put-mapping`, `CleanupSnapshotRestoreState`, `Update snapshot state` -MasterTaskMetadata | Metadata for the task (if any). -CacheType | `Field_Data_Cache`, `Shard_Request_Cache`, `Node_Query_Cache` +| Dimension | Return values | +|----------------------|-------------------------------------------------| +| ShardID | The ID of the shard, for example, `1`. | +| IndexName | The name of the index, for example, `my-index`. | +| Operation | The type of operation, for example, `shardbulk`. | +| ShardRole | The shard role, for example, `primary` or `replica`. | +| Exception | OpenSearch exceptions, for example, `org.opensearch.index_not_found_exception`. | +| Indices | The list of indexes in the request URL. | +| HTTPRespCode | The response code from OpenSearch, for example, `200`. | +| MemType | The memory type, for example, `totYoungGC`, `totFullGC`, `Survivor`, `PermGen`, `OldGen`, `Eden`, `NonHeap`, or `Heap`. | +| DiskName | The name of the disk, for example, `sda1`. | +| DestAddr | The destination address, for example, `010015AC`. | +| Direction | The direction, for example, `in` or `out`. | +| ThreadPoolType | The OpenSearch thread pools, for example, `index`, `search`, or `snapshot`. | +| CBType | The circuit breaker type, for example, `accounting`, `fielddata`, `in_flight_requests`, `parent`, or `request`. | +| ClusterManagerTaskInsertOrder| The order in which the task was inserted, for example, `3691`. | +| ClusterManagerTaskPriority | The priority of the task, for example, `URGENT`. OpenSearch executes higher-priority tasks before lower-priority ones, regardless of `insert_order`. | +| ClusterManagerTaskType | The task type, for example, `shard-started`, `create-index`, `delete-index`, `refresh-mapping`, `put-mapping`, `CleanupSnapshotRestoreState`, or `Update snapshot state`. | +| ClusterManagerTaskMetadata | The metadata for the task (if any). | +| CacheType | The cache type, for example, `Field_Data_Cache`, `Shard_Request_Cache`, or `Node_Query_Cache`. | +