OpenSearch/docs/en/ml/getting-started-single.asci...

332 lines
15 KiB
Plaintext

[[ml-gs-jobs]]
=== Creating Single Metric Jobs
At this point in the tutorial, the goal is to detect anomalies in the
total requests received by your applications and services. The sample data
contains a single key performance indicator(KPI) to track this, which is the total
requests over time. It is therefore logical to start by creating a single metric
job for this KPI.
TIP: If you are using aggregated data, you can create an advanced job
and configure it to use a `summary_count_field_name`. The {ml} algorithms will
make the best possible use of summarized data in this case. For simplicity, in
this tutorial we will not make use of that advanced functionality. For more
information, see <<ml-configuring-aggregation>>.
A single metric job contains a single _detector_. A detector defines the type of
analysis that will occur (for example, `max`, `average`, or `rare` analytical
functions) and the fields that will be analyzed.
To create a single metric job in {kib}:
. Open {kib} in your web browser and log in. If you are running {kib} locally,
go to `http://localhost:5601/`.
. Click **Machine Learning** in the side navigation.
. Click **Create new job**.
. Select the index pattern that you created for the sample data. For example,
`server-metrics*`.
. In the **Use a wizard** section, click **Single metric**.
. Configure the job by providing the following information: +
+
--
[role="screenshot"]
image::images/ml-gs-single-job.jpg["Create a new job from the server-metrics index"]
--
.. For the **Aggregation**, select `Sum`. This value specifies the analysis
function that is used.
+
--
Some of the analytical functions look for single anomalous data points. For
example, `max` identifies the maximum value that is seen within a bucket.
Others perform some aggregation over the length of the bucket. For example,
`mean` calculates the mean of all the data points seen within the bucket.
Similarly, `count` calculates the total number of data points within the bucket.
In this tutorial, you are using the `sum` function, which calculates the sum of
the specified field's values within the bucket. For descriptions of all the
functions, see <<ml-functions>>.
--
.. For the **Field**, select `total`. This value specifies the field that
the detector uses in the function.
+
--
NOTE: Some functions such as `count` and `rare` do not require fields.
--
.. For the **Bucket span**, enter `10m`. This value specifies the size of the
interval that the analysis is aggregated into.
+
--
The {xpackml} features use the concept of a bucket to divide up the time series
into batches for processing. For example, if you are monitoring
the total number of requests in the system,
using a bucket span of 1 hour would mean that at the end of each hour, it
calculates the sum of the requests for the last hour and computes the
anomalousness of that value compared to previous hours.
The bucket span has two purposes: it dictates over what time span to look for
anomalous features in data, and also determines how quickly anomalies can be
detected. Choosing a shorter bucket span enables anomalies to be detected more
quickly. However, there is a risk of being too sensitive to natural variations
or noise in the input data. Choosing too long a bucket span can mean that
interesting anomalies are averaged away. There is also the possibility that the
aggregation might smooth out some anomalies based on when the bucket starts
in time.
The bucket span has a significant impact on the analysis. When you're trying to
determine what value to use, take into account the granularity at which you
want to perform the analysis, the frequency of the input data, the duration of
typical anomalies, and the frequency at which alerting is required.
--
. Determine whether you want to process all of the data or only part of it. If
you want to analyze all of the existing data, click
**Use full server-metrics* data**. If you want to see what happens when you
stop and start {dfeeds} and process additional data over time, click the time
picker in the {kib} toolbar. Since the sample data spans a period of time
between March 23, 2017 and April 22, 2017, click **Absolute**. Set the start
time to March 23, 2017 and the end time to April 1, 2017, for example. Once
you've got the time range set up, click the **Go** button. +
+
--
[role="screenshot"]
image::images/ml-gs-job1-time.jpg["Setting the time range for the {dfeed}"]
--
+
--
A graph is generated, which represents the total number of requests over time.
Note that the **Estimate bucket span** option is no longer greyed out in the
**Buck span** field. This is an experimental feature that you can use to help
determine an appropriate bucket span for your data. For the purposes of this
tutorial, we will leave the bucket span at 10 minutes.
--
. Provide a name for the job, for example `total-requests`. The job name must
be unique in your cluster. You can also optionally provide a description of the
job and create a job group.
. Click **Create Job**. +
+
--
[role="screenshot"]
image::images/ml-gs-job1.jpg["A graph of the total number of requests over time"]
--
As the job is created, the graph is updated to give a visual representation of
the progress of {ml} as the data is processed. This view is only available whilst the
job is running.
When the job is created, you can choose to view the results, continue the job
in real-time, and create a watch. In this tutorial, we will look at how to
manage jobs and {dfeeds} before we view the results.
TIP: The `create_single_metic.sh` script creates a similar job and {dfeed} by
using the {ml} APIs. You can download that script by clicking
here: https://download.elastic.co/demos/machine_learning/gettingstarted/create_single_metric.sh[create_single_metric.sh]
For API reference information, see {ref}/ml-apis.html[Machine Learning APIs].
[[ml-gs-job1-manage]]
=== Managing Jobs
After you create a job, you can see its status in the **Job Management** tab: +
[role="screenshot"]
image::images/ml-gs-job1-manage1.jpg["Status information for the total-requests job"]
The following information is provided for each job:
Job ID::
The unique identifier for the job.
Description::
The optional description of the job.
Processed records::
The number of records that have been processed by the job.
Memory status::
The status of the mathematical models. When you create jobs by using the APIs or
by using the advanced options in {kib}, you can specify a `model_memory_limit`.
That value is the maximum amount of memory resources that the mathematical
models can use. Once that limit is approached, data pruning becomes more
aggressive. Upon exceeding that limit, new entities are not modeled. For more
information about this setting, see
{ref}/ml-job-resource.html#ml-apilimits[Analysis Limits]. The memory status
field reflects whether you have reached or exceeded the model memory limit. It
can have one of the following values: +
`ok`::: The models stayed below the configured value.
`soft_limit`::: The models used more than 60% of the configured memory limit
and older unused models will be pruned to free up space.
`hard_limit`::: The models used more space than the configured memory limit.
As a result, not all incoming data was processed.
Job state::
The status of the job, which can be one of the following values: +
`opened`::: The job is available to receive and process data.
`closed`::: The job finished successfully with its model state persisted.
The job must be opened before it can accept further data.
`closing`::: The job close action is in progress and has not yet completed.
A closing job cannot accept further data.
`failed`::: The job did not finish successfully due to an error.
This situation can occur due to invalid input data.
If the job had irrevocably failed, it must be force closed and then deleted.
If the {dfeed} can be corrected, the job can be closed and then re-opened.
{dfeed-cap} state::
The status of the {dfeed}, which can be one of the following values: +
started::: The {dfeed} is actively receiving data.
stopped::: The {dfeed} is stopped and will not receive data until it is
re-started.
Latest timestamp::
The timestamp of the last processed record.
If you click the arrow beside the name of job, you can show or hide additional
information, such as the settings, configuration information, or messages for
the job.
You can also click one of the **Actions** buttons to start the {dfeed}, edit
the job or {dfeed}, and clone or delete the job, for example.
[float]
[[ml-gs-job1-datafeed]]
==== Managing {dfeeds-cap}
A {dfeed} can be started and stopped multiple times throughout its lifecycle.
If you want to retrieve more data from {es} and the {dfeed} is stopped, you must
restart it.
For example, if you did not use the full data when you created the job, you can
now process the remaining data by restarting the {dfeed}:
. In the **Machine Learning** / **Job Management** tab, click the following
button to start the {dfeed}: image:images/ml-start-feed.jpg["Start {dfeed}"]
. Choose a start time and end time. For example,
click **Continue from 2017-04-01 23:59:00** and select **2017-04-30** as the
search end time. Then click **Start**. The date picker defaults to the latest
timestamp of processed data. Be careful not to leave any gaps in the analysis,
otherwise you might miss anomalies. +
+
--
[role="screenshot"]
image::images/ml-gs-job1-datafeed.jpg["Restarting a {dfeed}"]
--
The {dfeed} state changes to `started`, the job state changes to `opened`,
and the number of processed records increases as the new data is analyzed. The
latest timestamp information also increases.
TIP: If your data is being loaded continuously, you can continue running the job
in real time. For this, start your {dfeed} and select **No end time**.
If you want to stop the {dfeed} at this point, you can click the following
button: image:images/ml-stop-feed.jpg["Stop {dfeed}"]
Now that you have processed all the data, let's start exploring the job results.
[[ml-gs-job1-analyze]]
=== Exploring Single Metric Job Results
The {xpackml} features analyze the input stream of data, model its behavior,
and perform analysis based on the detectors you defined in your job. When an
event occurs outside of the model, that event is identified as an anomaly.
Result records for each anomaly are stored in `.ml-anomalies-*` indices in {es}.
By default, the name of the index where {ml} results are stored is labelled
`shared`, which corresponds to the `.ml-anomalies-shared` index.
You can use the **Anomaly Explorer** or the **Single Metric Viewer** in {kib} to
view the analysis results.
Anomaly Explorer::
This view contains swim lanes showing the maximum anomaly score over time.
There is an overall swim lane that shows the overall score for the job, and
also swim lanes for each influencer. By selecting a block in a swim lane, the
anomaly details are displayed alongside the original source data (where
applicable).
Single Metric Viewer::
This view contains a chart that represents the actual and expected values over
time. This is only available for jobs that analyze a single time series and
where `model_plot_config` is enabled. As in the **Anomaly Explorer**, anomalous
data points are shown in different colors depending on their score.
By default when you view the results for a single metric job, the
**Single Metric Viewer** opens:
[role="screenshot"]
image::images/ml-gs-job1-analysis.jpg["Single Metric Viewer for total-requests job"]
The blue line in the chart represents the actual data values. The shaded blue
area represents the bounds for the expected values. The area between the upper
and lower bounds are the most likely values for the model. If a value is outside
of this area then it can be said to be anomalous.
If you slide the time selector from the beginning of the data to the end of the
data, you can see how the model improves as it processes more data. At the
beginning, the expected range of values is pretty broad and the model is not
capturing the periodicity in the data. But it quickly learns and begins to
reflect the daily variation.
Any data points outside the range that was predicted by the model are marked
as anomalies. When you have high volumes of real-life data, many anomalies
might be found. These vary in probability from very likely to highly unlikely,
that is to say, from not particularly anomalous to highly anomalous. There
can be none, one or two or tens, sometimes hundreds of anomalies found within
each bucket. There can be many thousands found per job. In order to provide
a sensible view of the results, an _anomaly score_ is calculated for each bucket
time interval. The anomaly score is a value from 0 to 100, which indicates
the significance of the observed anomaly compared to previously seen anomalies.
The highly anomalous values are shown in red and the low scored values are
indicated in blue. An interval with a high anomaly score is significant and
requires investigation.
Slide the time selector to a section of the time series that contains a red
anomaly data point. If you hover over the point, you can see more information
about that data point. You can also see details in the **Anomalies** section
of the viewer. For example:
[role="screenshot"]
image::images/ml-gs-job1-anomalies.jpg["Single Metric Viewer Anomalies for total-requests job"]
For each anomaly you can see key details such as the time, the actual and
expected ("typical") values, and their probability.
By default, the table contains all anomalies that have a severity of "warning"
or higher in the selected section of the timeline. If you are only interested in
critical anomalies, for example, you can change the severity threshold for this
table.
The anomalies table also automatically calculates an interval for the data in
the table. If the time difference between the earliest and latest records in the
table is less than two days, the data is aggregated by hour to show the details
of the highest severity anomaly for each detector. Otherwise, it is
aggregated by day. You can change the interval for the table, for example, to
show all anomalies.
You can see the same information in a different format by using the
**Anomaly Explorer**:
[role="screenshot"]
image::images/ml-gs-job1-explorer.jpg["Anomaly Explorer for total-requests job"]
Click one of the red sections in the swim lane to see details about the anomalies
that occurred in that time interval. For example:
[role="screenshot"]
image::images/ml-gs-job1-explorer-anomaly.jpg["Anomaly Explorer details for total-requests job"]
After you have identified anomalies, often the next step is to try to determine
the context of those situations. For example, are there other factors that are
contributing to the problem? Are the anomalies confined to particular
applications or servers? You can begin to troubleshoot these situations by
layering additional jobs or creating multi-metric jobs.