144 lines
8.0 KiB
Plaintext
144 lines
8.0 KiB
Plaintext
[role="xpack"]
|
|
[testenv="basic"]
|
|
[[es-monitoring-collectors]]
|
|
== Collectors
|
|
|
|
Collectors, as their name implies, collect things. Each collector runs once for
|
|
each collection interval to obtain data from the public APIs in {es} and {xpack}
|
|
that it chooses to monitor. When the data collection is finished, the data is
|
|
handed in bulk to the <<es-monitoring-exporters,exporters>> to be sent to the
|
|
monitoring clusters. Regardless of the number of exporters, each collector only
|
|
runs once per collection interval.
|
|
|
|
There is only one collector per data type gathered. In other words, for any
|
|
monitoring document that is created, it comes from a single collector rather
|
|
than being merged from multiple collectors. {monitoring} for {es} currently has
|
|
a few collectors because the goal is to minimize overlap between them for
|
|
optimal performance.
|
|
|
|
Each collector can create zero or more monitoring documents. For example,
|
|
the `index_stats` collector collects all index statistics at the same time to
|
|
avoid many unnecessary calls.
|
|
|
|
[options="header"]
|
|
|=======================
|
|
| Collector | Data Types | Description
|
|
| Cluster Stats | `cluster_stats`
|
|
| Gathers details about the cluster state, including parts of the actual cluster
|
|
state (for example `GET /_cluster/state`) and statistics about it (for example,
|
|
`GET /_cluster/stats`). This produces a single document type. In versions prior
|
|
to X-Pack 5.5, this was actually three separate collectors that resulted in
|
|
three separate types: `cluster_stats`, `cluster_state`, and `cluster_info`. In
|
|
5.5 and later, all three are combined into `cluster_stats`. This only runs on
|
|
the _elected_ master node and the data collected (`cluster_stats`) largely
|
|
controls the UI. When this data is not present, it indicates either a
|
|
misconfiguration on the elected master node, timeouts related to the collection
|
|
of the data, or issues with storing the data. Only a single document is produced
|
|
per collection.
|
|
| Index Stats | `indices_stats`, `index_stats`
|
|
| Gathers details about the indices in the cluster, both in summary and
|
|
individually. This creates many documents that represent parts of the index
|
|
statistics output (for example, `GET /_stats`). This information only needs to
|
|
be collected once, so it is collected on the _elected_ master node. The most
|
|
common failure for this collector relates to an extreme number of indices -- and
|
|
therefore time to gather them -- resulting in timeouts. One summary
|
|
`indices_stats` document is produced per collection and one `index_stats`
|
|
document is produced per index, per collection.
|
|
| Index Recovery | `index_recovery`
|
|
| Gathers details about index recovery in the cluster. Index recovery represents
|
|
the assignment of _shards_ at the cluster level. If an index is not recovered,
|
|
it is not usable. This also corresponds to shard restoration via snapshots. This
|
|
information only needs to be collected once, so it is collected on the _elected_
|
|
master node. The most common failure for this collector relates to an extreme
|
|
number of shards -- and therefore time to gather them -- resulting in timeouts.
|
|
This creates a single document that contains all recoveries by default, which
|
|
can be quite large, but it gives the most accurate picture of recovery in the
|
|
production cluster.
|
|
| Shards | `shards`
|
|
| Gathers details about all _allocated_ shards for all indices, particularly
|
|
including what node the shard is allocated to. This information only needs to be
|
|
collected once, so it is collected on the _elected_ master node. The collector
|
|
uses the local cluster state to get the routing table without any network
|
|
timeout issues unlike most other collectors. Each shard is represented by a
|
|
separate monitoring document.
|
|
| Jobs | `job_stats`
|
|
| Gathers details about all machine learning job statistics (for example, `GET
|
|
/_ml/anomaly_detectors/_stats`). This information only needs to be collected
|
|
once, so it is collected on the _elected_ master node. However, for the master
|
|
node to be able to perform the collection, the master node must have
|
|
`xpack.ml.enabled` set to true (default) and a license level that supports {ml}.
|
|
| Node Stats | `node_stats`
|
|
| Gathers details about the running node, such as memory utilization and CPU
|
|
usage (for example, `GET /_nodes/_local/stats`). This runs on _every_ node with
|
|
{monitoring} enabled. One common failure results in the timeout of the node
|
|
stats request due to too many segment files. As a result, the collector spends
|
|
too much time waiting for the file system stats to be calculated until it
|
|
finally times out. A single `node_stats` document is created per collection.
|
|
This is collected per node to help to discover issues with nodes communicating
|
|
with each other, but not with the monitoring cluster (for example, intermittent
|
|
network issues or memory pressure).
|
|
|=======================
|
|
|
|
{monitoring} uses a single threaded scheduler to run the collection of {es}
|
|
monitoring data by all of the appropriate collectors on each node. This
|
|
scheduler is managed locally by each node and its interval is controlled by
|
|
specifying the `xpack.monitoring.collection.interval`, which defaults to 10
|
|
seconds (`10s`), at either the node or cluster level.
|
|
|
|
Fundamentally, each collector works on the same principle. Per collection
|
|
interval, each collector is checked to see whether it should run and then the
|
|
appropriate collectors run. The failure of an individual collector does not
|
|
impact any other collector.
|
|
|
|
Once collection has completed, all of the monitoring data is passed to the
|
|
exporters to route the monitoring data to the monitoring clusters.
|
|
|
|
If gaps exist in the monitoring charts in {kib}, it is typically because either
|
|
a collector failed or the monitoring cluster did not receive the data (for
|
|
example, it was being restarted). In the event that a collector fails, a logged
|
|
error should exist on the node that attempted to perform the collection.
|
|
|
|
NOTE: Collection is currently done serially, rather than in parallel, to avoid
|
|
extra overhead on the elected master node. The downside to this approach
|
|
is that collectors might observe a different version of the cluster state
|
|
within the same collection period. In practice, this does not make a
|
|
significant difference and running the collectors in parallel would not
|
|
prevent such a possibility.
|
|
|
|
For more information about the configuration options for the collectors, see
|
|
<<monitoring-collection-settings>>.
|
|
|
|
[float]
|
|
[[es-monitoring-stack]]
|
|
==== Collecting data from across the Elastic Stack
|
|
|
|
{monitoring} in {es} also receives monitoring data from other parts of the
|
|
Elastic Stack. In this way, it serves as an unscheduled monitoring data
|
|
collector for the stack.
|
|
|
|
By default, data collection is disabled. {es} monitoring data is not
|
|
collected and all monitoring data from other sources such as {kib}, Beats, and
|
|
Logstash is ignored. You must set `xpack.monitoring.collection.enabled` to `true`
|
|
to enable the collection of monitoring data. See <<monitoring-settings>>.
|
|
|
|
Once data is received, it is forwarded to the exporters
|
|
to be routed to the monitoring cluster like all monitoring data.
|
|
|
|
WARNING: Because this stack-level "collector" lives outside of the collection
|
|
interval of {monitoring} for {es}, it is not impacted by the
|
|
`xpack.monitoring.collection.interval` setting. Therefore, data is passed to the
|
|
exporters whenever it is received. This behavior can result in indices for {kib},
|
|
Logstash, or Beats being created somewhat unexpectedly.
|
|
|
|
While the monitoring data is collected and processed, some production cluster
|
|
metadata is added to incoming documents. This metadata enables {kib} to link the
|
|
monitoring data to the appropriate cluster. If this linkage is unimportant to
|
|
the infrastructure that you're monitoring, it might be simpler to configure
|
|
Logstash and Beats to report monitoring data directly to the monitoring cluster.
|
|
This scenario also prevents the production cluster from adding extra overhead
|
|
related to monitoring data, which can be very useful when there are a large
|
|
number of Logstash nodes or Beats.
|
|
|
|
For more information about typical monitoring architectures, see
|
|
{xpack-ref}/how-monitoring-works.html[How Monitoring Works].
|