219 lines
9.8 KiB
Plaintext
219 lines
9.8 KiB
Plaintext
|
[[ml-gs-multi-jobs]]
|
||
|
=== Creating Multi-metric Jobs
|
||
|
|
||
|
The multi-metric job wizard in {kib} provides a simple way to create more
|
||
|
complex jobs with multiple detectors. For example, in the single metric job, you
|
||
|
were tracking total requests versus time. You might also want to track other
|
||
|
metrics like average response time or the maximum number of denied requests.
|
||
|
Instead of creating jobs for each of those metrics, you can combine them in a
|
||
|
multi-metric job.
|
||
|
|
||
|
You can also use multi-metric jobs to split a single time series into multiple
|
||
|
time series based on a categorical field. For example, you can split the data
|
||
|
based on its hostnames, locations, or users. Each time series is modeled
|
||
|
independently. By looking at temporal patterns on a per entity basis, you might
|
||
|
spot things that might have otherwise been hidden in the lumped view.
|
||
|
|
||
|
Conceptually, you can think of this as running many independent single metric
|
||
|
jobs. By bundling them together in a multi-metric job, however, you can see an
|
||
|
overall score and shared influencers for all the metrics and all the entities in
|
||
|
the job. Multi-metric jobs therefore scale better than having many independent
|
||
|
single metric jobs and provide better results when you have influencers that are
|
||
|
shared across the detectors.
|
||
|
|
||
|
The sample data for this tutorial contains information about the requests that
|
||
|
are received by various applications and services in a system. Let's assume that
|
||
|
you want to monitor the requests received and the response time. In particular,
|
||
|
you might want to track those metrics on a per service basis to see if any
|
||
|
services have unusual patterns.
|
||
|
|
||
|
To create a multi-metric job in {kib}:
|
||
|
|
||
|
. Open {kib} in your web browser and log in. If you are running {kib} locally,
|
||
|
go to `http://localhost:5601/`.
|
||
|
|
||
|
. Click **Machine Learning** in the side navigation, then click **Create new job**. +
|
||
|
+
|
||
|
--
|
||
|
[role="screenshot"]
|
||
|
image::images/ml-kibana.jpg[Job Management]
|
||
|
--
|
||
|
|
||
|
. Click **Create multi metric job**. +
|
||
|
+
|
||
|
--
|
||
|
[role="screenshot"]
|
||
|
image::images/ml-create-job2.jpg["Create a multi metric job"]
|
||
|
--
|
||
|
|
||
|
. Click the `server-metrics` index. +
|
||
|
+
|
||
|
--
|
||
|
[role="screenshot"]
|
||
|
image::images/ml-gs-index.jpg["Select an index"]
|
||
|
--
|
||
|
|
||
|
. Configure the job by providing the following job settings: +
|
||
|
+
|
||
|
--
|
||
|
[role="screenshot"]
|
||
|
image::images/ml-gs-multi-job.jpg["Create a new job from the server-metrics index"]
|
||
|
--
|
||
|
|
||
|
.. For the **Fields**, select `high mean(response)` and `sum(total)`. This
|
||
|
creates two detectors and specifies the analysis function and field that each
|
||
|
detector uses. The first detector uses the high mean function to detect
|
||
|
unusually high average values for the `response` field in each bucket. The
|
||
|
second detector uses the sum function to detect when the sum of the `total`
|
||
|
field is anomalous in each bucket. For more information about any of the
|
||
|
analytical functions, see <<ml-functions>>.
|
||
|
|
||
|
.. For the **Bucket span**, enter `10m`. This value specifies the size of the
|
||
|
interval that the analysis is aggregated into. As was the case in the single
|
||
|
metric example, this value has a significant impact on the analysis. When you're
|
||
|
creating jobs for your own data, you might need to experiment with different
|
||
|
bucket spans depending on the frequency of the input data, the duration of
|
||
|
typical anomalies, and the frequency at which alerting is required.
|
||
|
|
||
|
.. For the **Split Data**, select `service`. When you specify this
|
||
|
option, the analysis is segmented such that you have completely independent
|
||
|
baselines for each distinct value of this field.
|
||
|
//TBD: What is the importance of having separate baselines?
|
||
|
There are seven unique service keyword values in the sample data. Thus for each
|
||
|
of the seven services, you will see the high mean response metrics and sum
|
||
|
total metrics. +
|
||
|
+
|
||
|
--
|
||
|
NOTE: If you are creating a job by using the {ml} APIs or the advanced job
|
||
|
wizard in {kib}, you can accomplish this split by using the
|
||
|
`partition_field_name` property.
|
||
|
|
||
|
--
|
||
|
|
||
|
.. For the **Key Fields**, select `host`. Note that the `service` field
|
||
|
is also automatically selected because you used it to split the data. These key
|
||
|
fields are also known as _influencers_.
|
||
|
When you identify a field as an influencer, you are indicating that you think
|
||
|
it contains information about someone or something that influences or
|
||
|
contributes to anomalies.
|
||
|
+
|
||
|
--
|
||
|
[TIP]
|
||
|
========================
|
||
|
Picking an influencer is strongly recommended for the following reasons:
|
||
|
|
||
|
* It allows you to more easily assign blame for the anomaly
|
||
|
* It simplifies and aggregates the results
|
||
|
|
||
|
The best influencer is the person or thing that you want to blame for the
|
||
|
anomaly. In many cases, users or client IP addresses make excellent influencers.
|
||
|
Influencers can be any field in your data; they do not need to be fields that
|
||
|
are specified in your detectors, though they often are.
|
||
|
|
||
|
As a best practice, do not pick too many influencers. For example, you generally
|
||
|
do not need more than three. If you pick many influencers, the results can be
|
||
|
overwhelming and there is a small overhead to the analysis.
|
||
|
|
||
|
========================
|
||
|
//TBD: Is this something you can determine later from looking at results and
|
||
|
//update your job with if necessary? Is it all post-processing or does it affect
|
||
|
//the ongoing modeling?
|
||
|
--
|
||
|
|
||
|
. Click **Use full server-metrics* data**. Two graphs are generated for each
|
||
|
`service` value, which represent the high mean `response` values and
|
||
|
sum `total` values over time.
|
||
|
//TBD What is the use of the document count table?
|
||
|
|
||
|
. Provide a name for the job, for example `response_requests_by_app`. The job
|
||
|
name must be unique in your cluster. You can also optionally provide a
|
||
|
description of the job.
|
||
|
|
||
|
. Click **Create Job**. As the job is created, the graphs are updated to give a
|
||
|
visual representation of the progress of {ml} as the data is processed. For
|
||
|
example:
|
||
|
+
|
||
|
--
|
||
|
[role="screenshot"]
|
||
|
image::images/ml-gs-job2-results.jpg["Job results updating as data is processed"]
|
||
|
--
|
||
|
|
||
|
TIP: The `create_multi_metic.sh` script creates a similar job and {dfeed} by
|
||
|
using the {ml} APIs. You can download that script by clicking
|
||
|
here: https://download.elastic.co/demos/machine_learning/gettingstarted/create_multi_metric.sh[create_multi_metric.sh]
|
||
|
For API reference information, see {ref}/ml-apis.html[Machine Learning APIs].
|
||
|
|
||
|
[[ml-gs-job2-analyze]]
|
||
|
=== Exploring Multi-metric Job Results
|
||
|
|
||
|
The {xpackml} features analyze the input stream of data, model its behavior, and
|
||
|
perform analysis based on the two detectors you defined in your job. When an
|
||
|
event occurs outside of the model, that event is identified as an anomaly.
|
||
|
|
||
|
You can use the **Anomaly Explorer** in {kib} to view the analysis results:
|
||
|
|
||
|
[role="screenshot"]
|
||
|
image::images/ml-gs-job2-explorer.jpg["Job results in the Anomaly Explorer"]
|
||
|
|
||
|
You can explore the overall anomaly time line, which shows the maximum anomaly
|
||
|
score for each section in the specified time period. You can change the time
|
||
|
period by using the time picker in the {kib} toolbar. Note that the sections in
|
||
|
this time line do not necessarily correspond to the bucket span. If you change
|
||
|
the time period, the sections change size too. The smallest possible size for
|
||
|
these sections is a bucket. If you specify a large time period, the sections can
|
||
|
span many buckets.
|
||
|
|
||
|
On the left is a list of the top influencers for all of the detected anomalies
|
||
|
in that same time period. The list includes maximum anomaly scores, which in
|
||
|
this case are aggregated for each influencer, for each bucket, across all
|
||
|
detectors. There is also a total sum of the anomaly scores for each influencer.
|
||
|
You can use this list to help you narrow down the contributing factors and focus
|
||
|
on the most anomalous entities.
|
||
|
|
||
|
If your job contains influencers, you can also explore swim lanes that
|
||
|
correspond to the values of an influencer. In this example, the swim lanes
|
||
|
correspond to the values for the `service` field that you used to split the data.
|
||
|
Each lane represents a unique application or service name. Since you specified
|
||
|
the `host` field as an influencer, you can also optionally view the results in
|
||
|
swim lanes for each host name:
|
||
|
|
||
|
[role="screenshot"]
|
||
|
image::images/ml-gs-job2-explorer-host.jpg["Job results sorted by host"]
|
||
|
|
||
|
By default, the swim lanes are ordered by their maximum anomaly score values.
|
||
|
You can click on the sections in the swim lane to see details about the
|
||
|
anomalies that occurred in that time interval.
|
||
|
|
||
|
NOTE: The anomaly scores that you see in each section of the **Anomaly Explorer**
|
||
|
might differ slightly. This disparity occurs because for each job we generate
|
||
|
bucket results, influencer results, and record results. Anomaly scores are
|
||
|
generated for each type of result. The anomaly timeline uses the bucket-level
|
||
|
anomaly scores. The list of top influencers uses the influencer-level anomaly
|
||
|
scores. The list of anomalies uses the record-level anomaly scores. For more
|
||
|
information about these different result types, see
|
||
|
{ref}/ml-results-resource.html[Results Resources].
|
||
|
|
||
|
Click on a section in the swim lanes to obtain more information about the
|
||
|
anomalies in that time period. For example, click on the red section in the swim
|
||
|
lane for `server_2`:
|
||
|
|
||
|
[role="screenshot"]
|
||
|
image::images/ml-gs-job2-explorer-anomaly.jpg["Job results for an anomaly"]
|
||
|
|
||
|
You can see exact times when anomalies occurred and which detectors or metrics
|
||
|
caught the anomaly. Also note that because you split the data by the `service`
|
||
|
field, you see separate charts for each applicable service. In particular, you
|
||
|
see charts for each service for which there is data on the specified host in the
|
||
|
specified time interval.
|
||
|
|
||
|
Below the charts, there is a table that provides more information, such as the
|
||
|
typical and actual values and the influencers that contributed to the anomaly.
|
||
|
|
||
|
[role="screenshot"]
|
||
|
image::images/ml-gs-job2-explorer-table.jpg["Job results table"]
|
||
|
|
||
|
Notice that there are anomalies for both detectors, that is to say for both the
|
||
|
`high_mean(response)` and the `sum(total)` metrics in this time interval. By
|
||
|
investigating multiple metrics in a single job, you might see relationships
|
||
|
between events in your data that would otherwise be overlooked.
|