110 lines
5.9 KiB
Plaintext
110 lines
5.9 KiB
Plaintext
[role="xpack"]
|
|
[[es-monitoring-collectors]]
|
|
=== Collectors
|
|
|
|
Collectors, as their name implies, collect things. In {monitoring} for {es},
|
|
collectors have a few rules that define them.
|
|
|
|
There is only one collector per data type gathered. In other words, for any
|
|
monitoring document that is created, it comes from a single collector rather
|
|
than being merged from multiple collectors. {monitoring} for {es} currently has
|
|
a few collectors because the goal is to minimize overlap between them for
|
|
optimal performance.
|
|
|
|
Each collector can create zero or more monitoring documents. For example,
|
|
the `index_stats` collector collects all index statistics at the same time to
|
|
avoid many unnecessary calls.
|
|
|
|
[options="header"]
|
|
|=======================
|
|
| Collector | Data Types | Description
|
|
| Cluster Stats | `cluster_stats`
|
|
| Gathers details about the cluster state, including parts of
|
|
the actual cluster state (for example `GET /_cluster/state`) and statistics
|
|
about it (for example, `GET /_cluster/stats`). This produces a single document
|
|
type. In versions prior to X-Pack 5.5, this was actually three separate collectors
|
|
that resulted in three separate types: `cluster_stats`, `cluster_state`, and
|
|
`cluster_info`. In 5.5 and later, all three are combined into `cluster_stats`.
|
|
+
|
|
This only runs on the _elected_ master node and the data collected
|
|
(`cluster_stats`) largely controls the UI. When this data is not present, it
|
|
indicates either a misconfiguration on the elected master node, timeouts related
|
|
to the collection of the data, or issues with storing the data. Only a single
|
|
document is produced per collection.
|
|
| Index Stats | `indices_stats`, `index_stats`
|
|
| Gathers details about the indices in the cluster, both in summary and
|
|
individually. This creates many documents that represent parts of the index
|
|
statistics output (for example, `GET /_stats`).
|
|
+
|
|
This information only needs to be collected once, so it is collected on the
|
|
_elected_ master node. The most common failure for this collector relates to an
|
|
extreme number of indices -- and therefore time to gather them -- resulting in
|
|
timeouts. One summary `indices_stats` document is produced per collection and one
|
|
`index_stats` document is produced per index, per collection.
|
|
| Index Recovery | `index_recovery`
|
|
| Gathers details about index recovery in the cluster. Index recovery represents
|
|
the assignment of _shards_ at the cluster level. If an index is not recovered,
|
|
it is not usable. This also corresponds to shard restoration via snapshots.
|
|
+
|
|
This information only needs to be collected once, so it is collected on the
|
|
_elected_ master node. The most common failure for this collector relates to an
|
|
extreme number of shards -- and therefore time to gather them -- resulting in
|
|
timeouts. This creates a single document that contains all recoveries by default,
|
|
which can be quite large, but it gives the most accurate picture of recovery in
|
|
the production cluster.
|
|
| Shards | `shards`
|
|
| Gathers details about all _allocated_ shards for all indices, particularly
|
|
including what node the shard is allocated to.
|
|
+
|
|
This information only needs to be collected once, so it is collected on the
|
|
_elected_ master node. The collector uses the local cluster state to get the
|
|
routing table without any network timeout issues unlike most other collectors.
|
|
Each shard is represented by a separate monitoring document.
|
|
| Jobs | `job_stats`
|
|
| Gathers details about all machine learning job statistics (for example,
|
|
`GET /_xpack/ml/anomaly_detectors/_stats`).
|
|
+
|
|
This information only needs to be collected once, so it is collected on the
|
|
_elected_ master node. However, for the master node to be able to perform the
|
|
collection, the master node must have `xpack.ml.enabled` set to true (default)
|
|
and a license level that supports {ml}.
|
|
| Node Stats | `node_stats`
|
|
| Gathers details about the running node, such as memory utilization and CPU
|
|
usage (for example, `GET /_nodes/_local/stats`).
|
|
+
|
|
This runs on _every_ node with {monitoring} enabled. One common failure
|
|
results in the timeout of the node stats request due to too many segment files.
|
|
As a result, the collector spends too much time waiting for the file system
|
|
stats to be calculated until it finally times out. A single `node_stats`
|
|
document is created per collection. This is collected per node to help to
|
|
discover issues with nodes communicating with each other, but not with the
|
|
monitoring cluster (for example, intermittent network issues or memory pressure).
|
|
|=======================
|
|
|
|
Fundamentally, each collector works on the same principle. Per collection
|
|
interval, which defaults to 10 seconds (`10s`), each collector is checked to
|
|
see whether it should run and then the appropriate collectors run. The failure
|
|
of an individual collector does not impact any other collector.
|
|
|
|
Once collection has completed, all of the monitoring data is passed to the
|
|
exporters to route the monitoring data to the monitoring clusters.
|
|
|
|
The collection interval can be configured dynamically and you can also disable
|
|
data collection. This can be very useful when you are using a separate
|
|
monitoring cluster to automatically take advantage of the cleaner service.
|
|
|
|
If gaps exist in the monitoring charts in {kib}, it is typically because either
|
|
a collector failed or the monitoring cluster did not receive the data (for
|
|
example, it was being restarted). In the event that a collector fails, a logged
|
|
error should exist on the node that attempted to perform the collection.
|
|
|
|
NOTE: Collection is currently done serially, rather than in parallel, to avoid
|
|
extra overhead on the elected master node. The downside to this approach
|
|
is that collectors might observe a different version of the cluster state
|
|
within the same collection period. In practice, this does not make a
|
|
significant difference and running the collectors in parallel would not
|
|
prevent such a possibility.
|
|
|
|
For more infomration about the configuration options for the collectors, see
|
|
<<monitoring-collection-settings>>.
|