332 lines
15 KiB
Plaintext
332 lines
15 KiB
Plaintext
[[ml-gs-jobs]]
|
|
=== Creating Single Metric Jobs
|
|
|
|
At this point in the tutorial, the goal is to detect anomalies in the
|
|
total requests received by your applications and services. The sample data
|
|
contains a single key performance indicator(KPI) to track this, which is the total
|
|
requests over time. It is therefore logical to start by creating a single metric
|
|
job for this KPI.
|
|
|
|
TIP: If you are using aggregated data, you can create an advanced job
|
|
and configure it to use a `summary_count_field_name`. The {ml} algorithms will
|
|
make the best possible use of summarized data in this case. For simplicity, in
|
|
this tutorial we will not make use of that advanced functionality. For more
|
|
information, see <<ml-configuring-aggregation>>.
|
|
|
|
A single metric job contains a single _detector_. A detector defines the type of
|
|
analysis that will occur (for example, `max`, `average`, or `rare` analytical
|
|
functions) and the fields that will be analyzed.
|
|
|
|
To create a single metric job in {kib}:
|
|
|
|
. Open {kib} in your web browser and log in. If you are running {kib} locally,
|
|
go to `http://localhost:5601/`.
|
|
|
|
. Click **Machine Learning** in the side navigation.
|
|
|
|
. Click **Create new job**.
|
|
|
|
. Select the index pattern that you created for the sample data. For example,
|
|
`server-metrics*`.
|
|
|
|
. In the **Use a wizard** section, click **Single metric**.
|
|
|
|
. Configure the job by providing the following information: +
|
|
+
|
|
--
|
|
[role="screenshot"]
|
|
image::images/ml-gs-single-job.jpg["Create a new job from the server-metrics index"]
|
|
--
|
|
|
|
.. For the **Aggregation**, select `Sum`. This value specifies the analysis
|
|
function that is used.
|
|
+
|
|
--
|
|
Some of the analytical functions look for single anomalous data points. For
|
|
example, `max` identifies the maximum value that is seen within a bucket.
|
|
Others perform some aggregation over the length of the bucket. For example,
|
|
`mean` calculates the mean of all the data points seen within the bucket.
|
|
Similarly, `count` calculates the total number of data points within the bucket.
|
|
In this tutorial, you are using the `sum` function, which calculates the sum of
|
|
the specified field's values within the bucket. For descriptions of all the
|
|
functions, see <<ml-functions>>.
|
|
--
|
|
|
|
.. For the **Field**, select `total`. This value specifies the field that
|
|
the detector uses in the function.
|
|
+
|
|
--
|
|
NOTE: Some functions such as `count` and `rare` do not require fields.
|
|
--
|
|
|
|
.. For the **Bucket span**, enter `10m`. This value specifies the size of the
|
|
interval that the analysis is aggregated into.
|
|
+
|
|
--
|
|
The {xpackml} features use the concept of a bucket to divide up the time series
|
|
into batches for processing. For example, if you are monitoring
|
|
the total number of requests in the system,
|
|
using a bucket span of 1 hour would mean that at the end of each hour, it
|
|
calculates the sum of the requests for the last hour and computes the
|
|
anomalousness of that value compared to previous hours.
|
|
|
|
The bucket span has two purposes: it dictates over what time span to look for
|
|
anomalous features in data, and also determines how quickly anomalies can be
|
|
detected. Choosing a shorter bucket span enables anomalies to be detected more
|
|
quickly. However, there is a risk of being too sensitive to natural variations
|
|
or noise in the input data. Choosing too long a bucket span can mean that
|
|
interesting anomalies are averaged away. There is also the possibility that the
|
|
aggregation might smooth out some anomalies based on when the bucket starts
|
|
in time.
|
|
|
|
The bucket span has a significant impact on the analysis. When you're trying to
|
|
determine what value to use, take into account the granularity at which you
|
|
want to perform the analysis, the frequency of the input data, the duration of
|
|
typical anomalies, and the frequency at which alerting is required.
|
|
--
|
|
|
|
. Determine whether you want to process all of the data or only part of it. If
|
|
you want to analyze all of the existing data, click
|
|
**Use full server-metrics* data**. If you want to see what happens when you
|
|
stop and start {dfeeds} and process additional data over time, click the time
|
|
picker in the {kib} toolbar. Since the sample data spans a period of time
|
|
between March 23, 2017 and April 22, 2017, click **Absolute**. Set the start
|
|
time to March 23, 2017 and the end time to April 1, 2017, for example. Once
|
|
you've got the time range set up, click the **Go** button. +
|
|
+
|
|
--
|
|
[role="screenshot"]
|
|
image::images/ml-gs-job1-time.jpg["Setting the time range for the {dfeed}"]
|
|
--
|
|
+
|
|
--
|
|
A graph is generated, which represents the total number of requests over time.
|
|
|
|
Note that the **Estimate bucket span** option is no longer greyed out in the
|
|
**Buck span** field. This is an experimental feature that you can use to help
|
|
determine an appropriate bucket span for your data. For the purposes of this
|
|
tutorial, we will leave the bucket span at 10 minutes.
|
|
--
|
|
|
|
. Provide a name for the job, for example `total-requests`. The job name must
|
|
be unique in your cluster. You can also optionally provide a description of the
|
|
job and create a job group.
|
|
|
|
. Click **Create Job**. +
|
|
+
|
|
--
|
|
[role="screenshot"]
|
|
image::images/ml-gs-job1.jpg["A graph of the total number of requests over time"]
|
|
--
|
|
|
|
As the job is created, the graph is updated to give a visual representation of
|
|
the progress of {ml} as the data is processed. This view is only available whilst the
|
|
job is running.
|
|
|
|
When the job is created, you can choose to view the results, continue the job
|
|
in real-time, and create a watch. In this tutorial, we will look at how to
|
|
manage jobs and {dfeeds} before we view the results.
|
|
|
|
TIP: The `create_single_metic.sh` script creates a similar job and {dfeed} by
|
|
using the {ml} APIs. You can download that script by clicking
|
|
here: https://download.elastic.co/demos/machine_learning/gettingstarted/create_single_metric.sh[create_single_metric.sh]
|
|
For API reference information, see {ref}/ml-apis.html[Machine Learning APIs].
|
|
|
|
[[ml-gs-job1-manage]]
|
|
=== Managing Jobs
|
|
|
|
After you create a job, you can see its status in the **Job Management** tab: +
|
|
|
|
[role="screenshot"]
|
|
image::images/ml-gs-job1-manage1.jpg["Status information for the total-requests job"]
|
|
|
|
The following information is provided for each job:
|
|
|
|
Job ID::
|
|
The unique identifier for the job.
|
|
|
|
Description::
|
|
The optional description of the job.
|
|
|
|
Processed records::
|
|
The number of records that have been processed by the job.
|
|
|
|
Memory status::
|
|
The status of the mathematical models. When you create jobs by using the APIs or
|
|
by using the advanced options in {kib}, you can specify a `model_memory_limit`.
|
|
That value is the maximum amount of memory resources that the mathematical
|
|
models can use. Once that limit is approached, data pruning becomes more
|
|
aggressive. Upon exceeding that limit, new entities are not modeled. For more
|
|
information about this setting, see
|
|
{ref}/ml-job-resource.html#ml-apilimits[Analysis Limits]. The memory status
|
|
field reflects whether you have reached or exceeded the model memory limit. It
|
|
can have one of the following values: +
|
|
`ok`::: The models stayed below the configured value.
|
|
`soft_limit`::: The models used more than 60% of the configured memory limit
|
|
and older unused models will be pruned to free up space.
|
|
`hard_limit`::: The models used more space than the configured memory limit.
|
|
As a result, not all incoming data was processed.
|
|
|
|
Job state::
|
|
The status of the job, which can be one of the following values: +
|
|
`opened`::: The job is available to receive and process data.
|
|
`closed`::: The job finished successfully with its model state persisted.
|
|
The job must be opened before it can accept further data.
|
|
`closing`::: The job close action is in progress and has not yet completed.
|
|
A closing job cannot accept further data.
|
|
`failed`::: The job did not finish successfully due to an error.
|
|
This situation can occur due to invalid input data.
|
|
If the job had irrevocably failed, it must be force closed and then deleted.
|
|
If the {dfeed} can be corrected, the job can be closed and then re-opened.
|
|
|
|
{dfeed-cap} state::
|
|
The status of the {dfeed}, which can be one of the following values: +
|
|
started::: The {dfeed} is actively receiving data.
|
|
stopped::: The {dfeed} is stopped and will not receive data until it is
|
|
re-started.
|
|
|
|
Latest timestamp::
|
|
The timestamp of the last processed record.
|
|
|
|
|
|
If you click the arrow beside the name of job, you can show or hide additional
|
|
information, such as the settings, configuration information, or messages for
|
|
the job.
|
|
|
|
You can also click one of the **Actions** buttons to start the {dfeed}, edit
|
|
the job or {dfeed}, and clone or delete the job, for example.
|
|
|
|
[float]
|
|
[[ml-gs-job1-datafeed]]
|
|
==== Managing {dfeeds-cap}
|
|
|
|
A {dfeed} can be started and stopped multiple times throughout its lifecycle.
|
|
If you want to retrieve more data from {es} and the {dfeed} is stopped, you must
|
|
restart it.
|
|
|
|
For example, if you did not use the full data when you created the job, you can
|
|
now process the remaining data by restarting the {dfeed}:
|
|
|
|
. In the **Machine Learning** / **Job Management** tab, click the following
|
|
button to start the {dfeed}: image:images/ml-start-feed.jpg["Start {dfeed}"]
|
|
|
|
|
|
. Choose a start time and end time. For example,
|
|
click **Continue from 2017-04-01 23:59:00** and select **2017-04-30** as the
|
|
search end time. Then click **Start**. The date picker defaults to the latest
|
|
timestamp of processed data. Be careful not to leave any gaps in the analysis,
|
|
otherwise you might miss anomalies. +
|
|
+
|
|
--
|
|
[role="screenshot"]
|
|
image::images/ml-gs-job1-datafeed.jpg["Restarting a {dfeed}"]
|
|
--
|
|
|
|
The {dfeed} state changes to `started`, the job state changes to `opened`,
|
|
and the number of processed records increases as the new data is analyzed. The
|
|
latest timestamp information also increases.
|
|
|
|
TIP: If your data is being loaded continuously, you can continue running the job
|
|
in real time. For this, start your {dfeed} and select **No end time**.
|
|
|
|
If you want to stop the {dfeed} at this point, you can click the following
|
|
button: image:images/ml-stop-feed.jpg["Stop {dfeed}"]
|
|
|
|
Now that you have processed all the data, let's start exploring the job results.
|
|
|
|
[[ml-gs-job1-analyze]]
|
|
=== Exploring Single Metric Job Results
|
|
|
|
The {xpackml} features analyze the input stream of data, model its behavior,
|
|
and perform analysis based on the detectors you defined in your job. When an
|
|
event occurs outside of the model, that event is identified as an anomaly.
|
|
|
|
Result records for each anomaly are stored in `.ml-anomalies-*` indices in {es}.
|
|
By default, the name of the index where {ml} results are stored is labelled
|
|
`shared`, which corresponds to the `.ml-anomalies-shared` index.
|
|
|
|
You can use the **Anomaly Explorer** or the **Single Metric Viewer** in {kib} to
|
|
view the analysis results.
|
|
|
|
Anomaly Explorer::
|
|
This view contains swim lanes showing the maximum anomaly score over time.
|
|
There is an overall swim lane that shows the overall score for the job, and
|
|
also swim lanes for each influencer. By selecting a block in a swim lane, the
|
|
anomaly details are displayed alongside the original source data (where
|
|
applicable).
|
|
|
|
Single Metric Viewer::
|
|
This view contains a chart that represents the actual and expected values over
|
|
time. This is only available for jobs that analyze a single time series and
|
|
where `model_plot_config` is enabled. As in the **Anomaly Explorer**, anomalous
|
|
data points are shown in different colors depending on their score.
|
|
|
|
By default when you view the results for a single metric job, the
|
|
**Single Metric Viewer** opens:
|
|
[role="screenshot"]
|
|
image::images/ml-gs-job1-analysis.jpg["Single Metric Viewer for total-requests job"]
|
|
|
|
|
|
The blue line in the chart represents the actual data values. The shaded blue
|
|
area represents the bounds for the expected values. The area between the upper
|
|
and lower bounds are the most likely values for the model. If a value is outside
|
|
of this area then it can be said to be anomalous.
|
|
|
|
If you slide the time selector from the beginning of the data to the end of the
|
|
data, you can see how the model improves as it processes more data. At the
|
|
beginning, the expected range of values is pretty broad and the model is not
|
|
capturing the periodicity in the data. But it quickly learns and begins to
|
|
reflect the daily variation.
|
|
|
|
Any data points outside the range that was predicted by the model are marked
|
|
as anomalies. When you have high volumes of real-life data, many anomalies
|
|
might be found. These vary in probability from very likely to highly unlikely,
|
|
that is to say, from not particularly anomalous to highly anomalous. There
|
|
can be none, one or two or tens, sometimes hundreds of anomalies found within
|
|
each bucket. There can be many thousands found per job. In order to provide
|
|
a sensible view of the results, an _anomaly score_ is calculated for each bucket
|
|
time interval. The anomaly score is a value from 0 to 100, which indicates
|
|
the significance of the observed anomaly compared to previously seen anomalies.
|
|
The highly anomalous values are shown in red and the low scored values are
|
|
indicated in blue. An interval with a high anomaly score is significant and
|
|
requires investigation.
|
|
|
|
Slide the time selector to a section of the time series that contains a red
|
|
anomaly data point. If you hover over the point, you can see more information
|
|
about that data point. You can also see details in the **Anomalies** section
|
|
of the viewer. For example:
|
|
[role="screenshot"]
|
|
image::images/ml-gs-job1-anomalies.jpg["Single Metric Viewer Anomalies for total-requests job"]
|
|
|
|
For each anomaly you can see key details such as the time, the actual and
|
|
expected ("typical") values, and their probability.
|
|
|
|
By default, the table contains all anomalies that have a severity of "warning"
|
|
or higher in the selected section of the timeline. If you are only interested in
|
|
critical anomalies, for example, you can change the severity threshold for this
|
|
table.
|
|
|
|
The anomalies table also automatically calculates an interval for the data in
|
|
the table. If the time difference between the earliest and latest records in the
|
|
table is less than two days, the data is aggregated by hour to show the details
|
|
of the highest severity anomaly for each detector. Otherwise, it is
|
|
aggregated by day. You can change the interval for the table, for example, to
|
|
show all anomalies.
|
|
|
|
You can see the same information in a different format by using the
|
|
**Anomaly Explorer**:
|
|
[role="screenshot"]
|
|
image::images/ml-gs-job1-explorer.jpg["Anomaly Explorer for total-requests job"]
|
|
|
|
|
|
Click one of the red sections in the swim lane to see details about the anomalies
|
|
that occurred in that time interval. For example:
|
|
[role="screenshot"]
|
|
image::images/ml-gs-job1-explorer-anomaly.jpg["Anomaly Explorer details for total-requests job"]
|
|
|
|
After you have identified anomalies, often the next step is to try to determine
|
|
the context of those situations. For example, are there other factors that are
|
|
contributing to the problem? Are the anomalies confined to particular
|
|
applications or servers? You can begin to troubleshoot these situations by
|
|
layering additional jobs or creating multi-metric jobs.
|