712 lines
28 KiB
Plaintext
712 lines
28 KiB
Plaintext
[[ml-getting-started]]
|
|
== Getting Started
|
|
|
|
////
|
|
{xpack} {ml} features automatically detect:
|
|
* Anomalies in single or multiple time series
|
|
* Outliers in a population (also known as _entity profiling_)
|
|
* Rare events (also known as _log categorization_)
|
|
|
|
This tutorial is focuses on an anomaly detection scenario in single time series.
|
|
////
|
|
Ready to get some hands-on experience with the {xpack} {ml} features? This
|
|
tutorial shows you how to:
|
|
|
|
* Load a sample data set into {es}
|
|
* Create a {ml} job
|
|
* Use the results to identify possible anomalies in the data
|
|
|
|
At the end of this tutorial, you should have a good idea of what {ml} is and
|
|
will hopefully be inspired to use it to detect anomalies in your own data.
|
|
|
|
You might also be interested in these video tutorials:
|
|
|
|
* Getting started with machine learning (single metric)
|
|
* Getting started with machine learning (multiple metric)
|
|
|
|
|
|
[float]
|
|
[[ml-gs-sysoverview]]
|
|
=== System Overview
|
|
|
|
To follow the steps in this tutorial, you will need the following
|
|
components of the Elastic Stack:
|
|
|
|
* {es} {version}, which stores the data and the analysis results
|
|
* {xpack} {version}, which includes the beta {ml} features for both {es} and {kib}
|
|
* {kib} {version}, which provides a helpful user interface for creating and
|
|
viewing jobs +
|
|
|
|
//ll {ml} features are available to use as an API, however this tutorial
|
|
//will focus on using the {ml} tab in the {kib} UI.
|
|
|
|
WARNING: The {xpack} {ml} features are in beta and subject to change.
|
|
Beta features are not subject to the same support SLA as GA features,
|
|
and deployment in production is at your own risk.
|
|
|
|
See the https://www.elastic.co/support/matrix[Elastic Support Matrix] for
|
|
information about supported operating systems.
|
|
|
|
See {stack-ref}/installing-elastic-stack.html[Installing the Elastic Stack] for
|
|
information about installing each of the components.
|
|
|
|
NOTE: To get started, you can install {es} and {kib} on a
|
|
single VM or even on your laptop (requires 64-bit OS).
|
|
As you add more data and your traffic grows,
|
|
you'll want to replace the single {es} instance with a cluster.
|
|
|
|
When you install {xpack} into {es} and {kib}, the {ml} features are
|
|
enabled by default. If you have multiple nodes in your cluster, you can
|
|
optionally dedicate nodes to specific purposes. If you want to control which
|
|
nodes are _machine learning nodes_ or limit which nodes run resource-intensive
|
|
activity related to jobs, see <<ml-settings>>.
|
|
|
|
|
|
[float]
|
|
[[ml-gs-users]]
|
|
==== Users, Roles, and Privileges
|
|
|
|
The {xpack} {ml} features implement cluster privileges and built-in roles to
|
|
make it easier to control which users have authority to view and manage the jobs,
|
|
data feeds, and results.
|
|
|
|
By default, you can perform all of the steps in this tutorial by using the
|
|
built-in `elastic` super user. The default password for the `elastic` user is
|
|
`changeme`. For information about how to change that password, see
|
|
<<security-getting-started>>.
|
|
|
|
If you are performing these steps in a production environment, take extra care
|
|
because `elastic` has the `superuser` role and you could inadvertently make
|
|
significant changes to the system. You can alternatively assign the
|
|
`machine_learning_admin` and `kibana_user` roles to a user ID of your choice.
|
|
|
|
For more information, see <<built-in-roles>> and <<privileges-list-cluster>>.
|
|
|
|
[[ml-gs-data]]
|
|
=== Identifying Data for Analysis
|
|
|
|
For the purposes of this tutorial, we provide sample data that you can play with
|
|
and search in {es}. When you consider your own data, however, it's important to
|
|
take a moment and think about where the {xpack} {ml} features will be most
|
|
impactful.
|
|
|
|
The first consideration is that it must be time series data. The {ml} features
|
|
are designed to model and detect anomalies in time series data.
|
|
|
|
The second consideration, especially when you are first learning to use {ml},
|
|
is the importance of the data and how familiar you are with it. Ideally, it is
|
|
information that contains key performance indicators (KPIs) for the health,
|
|
security, or success of your business or system. It is information that you need
|
|
to monitor and act on when anomalous behavior occurs. You might even have {kib}
|
|
dashboards that you're already using to watch this data. The better you know the
|
|
data, the quicker you will be able to create {ml} jobs that generate useful
|
|
insights.
|
|
|
|
The final consideration is where the data is located. This tutorial assumes that
|
|
your data is stored in {es}. It guides you through the steps required to create
|
|
a _data feed_ that passes data to a job. If your own data is outside of {es},
|
|
analysis is still possible by using a post data API.
|
|
|
|
IMPORTANT: If you want to create {ml} jobs in {kib}, you must use data feeds.
|
|
That is to say, you must store your input data in {es}. When you create
|
|
a job, you select an existing index pattern and {kib} configures the data feed
|
|
for you under the covers.
|
|
|
|
|
|
[float]
|
|
[[ml-gs-sampledata]]
|
|
==== Obtaining a Sample Data Set
|
|
|
|
In this step we will upload some sample data to {es}. This is standard
|
|
{es} functionality, and is needed to set the stage for using {ml}.
|
|
|
|
The sample data for this tutorial contains information about the requests that
|
|
are received by various applications and services in a system. A system
|
|
administrator might use this type of information to track the the total
|
|
number of requests across all of the infrastructure. If the number of requests
|
|
increases or decreases unexpectedly, for example, this might be an indication
|
|
that there is a problem or that resources need to be redistributed. By using
|
|
the {xpack} {ml} features to model the behavior of this data, it is easier to
|
|
identify anomalies and take appropriate action.
|
|
|
|
Download this sample data from: https://github.com/elastic/examples
|
|
//Download this data set by clicking here:
|
|
//See https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json[shakespeare.json].
|
|
|
|
Use the following commands to extract the files:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
tar xvf server_metrics.tar.gz
|
|
----------------------------------
|
|
|
|
Each document in the server-metrics data set has the following schema:
|
|
|
|
[source,js]
|
|
----------------------------------
|
|
|
|
{
|
|
"index":
|
|
{
|
|
"_index":"server-metrics",
|
|
"_type":"metric",
|
|
"_id":"AVuQL1eekrHQ5a9V5qre"
|
|
}
|
|
}
|
|
{
|
|
"deny":1783,
|
|
"service":"app_0",
|
|
"@timestamp":"2017-03-26T06:47:28.684926",
|
|
"accept":24465,
|
|
"host":"server_1",
|
|
"total":26248,
|
|
"response":1.8242486553275024
|
|
}
|
|
----------------------------------
|
|
|
|
TIP: The sample data sets include summarized data. For example, the `total`
|
|
value is a sum of the requests that were received by a specific service at a
|
|
particular time. If your data is stored in {es}, you can generate
|
|
this type of sum or average by using aggregations. One of the benefits of
|
|
summarizing data this way is that {es} automatically distributes
|
|
these calculations across your cluster. You can then feed this summarized data
|
|
into {xpack} {ml} instead of raw results, which reduces the volume
|
|
of data that must be considered while detecting anomalies. For the purposes of
|
|
this tutorial, however, these summary values are stored in {es},
|
|
rather than created using the {ref}/search-aggregations.html[_aggregations framework_].
|
|
|
|
//TBD link to working with aggregations page
|
|
|
|
Before you load the data set, you need to set up {ref}/mapping.html[_mappings_]
|
|
for the fields. Mappings divide the documents in the index into logical groups
|
|
and specify a field's characteristics, such as the field's searchability or
|
|
whether or not it's _tokenized_, or broken up into separate words.
|
|
|
|
The sample data includes an `upload_server-metrics.sh` script, which you can use
|
|
to create the mappings and load the data set. Before you run it, however, you
|
|
must edit the USERNAME and PASSWORD variables with your actual user ID and
|
|
password.
|
|
|
|
The script runs a command similar to the following example, which sets up a
|
|
mapping for the data set:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
curl -u elastic:changeme -X PUT -H 'Content-Type: application/json'
|
|
http://localhost:9200/server-metrics -d '{
|
|
"settings": {
|
|
"number_of_shards": 1,
|
|
"number_of_replicas": 0
|
|
},
|
|
"mappings": {
|
|
"metric": {
|
|
"properties": {
|
|
"@timestamp": {
|
|
"type": "date"
|
|
},
|
|
"accept": {
|
|
"type": "long"
|
|
},
|
|
"deny": {
|
|
"type": "long"
|
|
},
|
|
"host": {
|
|
"type": "text",
|
|
"fields": {
|
|
"keyword": {
|
|
"type": "keyword",
|
|
"ignore_above": 256
|
|
}
|
|
}
|
|
},
|
|
"response": {
|
|
"type": "float"
|
|
},
|
|
"service": {
|
|
"type": "text",
|
|
"fields": {
|
|
"keyword": {
|
|
"type": "keyword",
|
|
"ignore_above": 256
|
|
}
|
|
}
|
|
},
|
|
"total": {
|
|
"type": "long"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}'
|
|
----------------------------------
|
|
|
|
NOTE: If you run this command, you must replace `changeme` with your
|
|
actual password.
|
|
|
|
////
|
|
This mapping specifies the following qualities for the data set:
|
|
|
|
* The _@timestamp_ field is a date.
|
|
//that uses the ISO format `epoch_second`,
|
|
//which is the number of seconds since the epoch.
|
|
* The _accept_, _deny_, and _total_ fields are long numbers.
|
|
* The _host
|
|
////
|
|
|
|
You can then use the {es} `bulk` API to load the data set. The
|
|
`upload_server-metrics.sh` script runs commands similar to the following
|
|
example, which loads the four JSON files:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
curl -u elastic:changeme -X POST -H "Content-Type: application/json"
|
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_1.json"
|
|
|
|
curl -u elastic:changeme -X POST -H "Content-Type: application/json"
|
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_2.json"
|
|
|
|
curl -u elastic:changeme -X POST -H "Content-Type: application/json"
|
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_3.json"
|
|
|
|
curl -u elastic:changeme -X POST -H "Content-Type: application/json"
|
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
|
|
----------------------------------
|
|
|
|
TIP: This will upload 200MB of data. This is split into 4 files as there is a
|
|
maximum 100MB limit when using the `_bulk` API.
|
|
|
|
These commands might take some time to run, depending on the computing resources
|
|
available.
|
|
|
|
You can verify that the data was loaded successfully with the following command:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
curl 'http://localhost:9200/_cat/indices?v' -u elastic:changeme
|
|
----------------------------------
|
|
|
|
You should see output similar to the following:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
health status index ... pri rep docs.count docs.deleted store.size ...
|
|
green open server-metrics ... 1 0 905940 0 120.5mb ...
|
|
----------------------------------
|
|
|
|
Next, you must define an index pattern for this data set:
|
|
|
|
. Open {kib} in your web browser and log in. If you are running {kib}
|
|
locally, go to `http://localhost:5601/`.
|
|
|
|
. Click the **Management** tab, then **Index Patterns**.
|
|
|
|
. If you already have index patterns, click the plus sign (+) to define a new
|
|
one. Otherwise, the **Configure an index pattern** wizard is already open.
|
|
|
|
. For this tutorial, any pattern that matches the name of the index you've
|
|
loaded will work. For example, enter `server-metrics*` as the index pattern.
|
|
|
|
. Verify that the **Index contains time-based events** is checked.
|
|
|
|
. Select the `@timestamp` field from the **Time-field name** list.
|
|
|
|
. Click **Create**.
|
|
|
|
This data set can now be analyzed in {ml} jobs in {kib}.
|
|
|
|
|
|
[[ml-gs-jobs]]
|
|
=== Creating Jobs
|
|
|
|
Machine learning jobs contain the configuration information and metadata
|
|
necessary to perform an analytical task. They also contain the results of the
|
|
analytical task.
|
|
|
|
NOTE: This tutorial uses {kib} to create jobs and view results, but you can
|
|
alternatively use APIs to accomplish these tasks.
|
|
For API reference information, see <<ml-apis>>.
|
|
|
|
To work with jobs in {kib}:
|
|
|
|
. Open {kib} in your web browser and log in. If you are running {kib} locally,
|
|
go to `http://localhost:5601/`.
|
|
|
|
. Click **Machine Learning** in the side navigation:
|
|
image::images/ml-kibana.jpg["Job Management"]
|
|
|
|
You can choose to create single metric, multi-metric, or advanced jobs in
|
|
{kib}. In this tutorial, the goal is to detect anomalies in the total requests
|
|
received by your applications and services. The sample data contains a single
|
|
key performance indicator to track this, which is the total requests over time.
|
|
It is therefore logical to start by creating a single metric job for this KPI.
|
|
|
|
TIP: If you are using aggregated data, you can create an advanced job
|
|
and configure it to use a `summary_count_field`. The {ml} algorithms will
|
|
make the best possible use of summarized data in this case. For simplicity in this tutorial
|
|
we will not make use of that advanced functionality.
|
|
|
|
|
|
[float]
|
|
[[ml-gs-job1-create]]
|
|
==== Creating a Single Metric Job
|
|
|
|
A single metric job contains a single _detector_. A detector defines the type of
|
|
analysis that will occur (for example, `max`, `average`, or `rare` analytical
|
|
functions) and the fields that will be analyzed.
|
|
|
|
To create a single metric job in {kib}:
|
|
|
|
. Click **Machine Learning** in the side navigation,
|
|
then click **Create new job**.
|
|
|
|
. Click **Create single metric job**.
|
|
image::images/ml-create-jobs.jpg["Create a new job"]
|
|
|
|
. Click the `server-metrics` index. +
|
|
+
|
|
--
|
|
image::images/ml-gs-index.jpg["Select an index"]
|
|
--
|
|
|
|
. Configure the job by providing the following information:
|
|
image::images/ml-gs-single-job.jpg["Create a new job from the server-metrics index"]
|
|
|
|
.. For the **Aggregation**, select `Sum`. This value specifies the analysis
|
|
function that is used.
|
|
+
|
|
--
|
|
Some of the analytical functions look for single anomalous data points. For
|
|
example, `max` identifies the maximum value that is seen within a bucket.
|
|
Others perform some aggregation over the length of the bucket. For example,
|
|
`mean` calculates the mean of all the data points seen within the bucket.
|
|
Similarly, `count` calculates the total number of data points within the bucket.
|
|
In this tutorial, you are using the `sum` function, which calculates the sum of
|
|
the specified field's values within the bucket.
|
|
--
|
|
|
|
.. For the **Field**, select `total`. This value specifies the field that
|
|
the detector uses in the function.
|
|
+
|
|
--
|
|
NOTE: Some functions such as `count` and `rare` do not require fields.
|
|
--
|
|
|
|
.. For the **Bucket span**, enter `10m`. This value specifies the size of the
|
|
interval that the analysis is aggregated into.
|
|
+
|
|
--
|
|
The {xpack} {ml} features use the concept of a bucket to divide up the time series
|
|
into batches for processing. For example, if you are monitoring
|
|
the total number of requests in the system,
|
|
//and receive a data point every 10 minutes
|
|
using a bucket span of 1 hour would mean that at the end of each hour, it
|
|
calculates the sum of the requests for the last hour and computes the
|
|
anomalousness of that value compared to previous hours.
|
|
|
|
The bucket span has two purposes: it dictates over what time span to look for
|
|
anomalous features in data, and also determines how quickly anomalies can be
|
|
detected. Choosing a shorter bucket span enables anomalies to be detected more
|
|
quickly. However, there is a risk of being too sensitive to natural variations
|
|
or noise in the input data. Choosing too long a bucket span can mean that
|
|
interesting anomalies are averaged away. There is also the possibility that the
|
|
aggregation might smooth out some anomalies based on when the bucket starts
|
|
in time.
|
|
|
|
The bucket span has a significant impact on the analysis. When you're trying to
|
|
determine what value to use, take into account the granularity at which you
|
|
want to perform the analysis, the frequency of the input data, the duration of
|
|
typical anomalies and the frequency at which alerting is required.
|
|
--
|
|
|
|
. Determine whether you want to process all of the data or only part of it. If
|
|
you want to analyze all of the existing data, click
|
|
**Use full server-metrics* data**. If you want to see what happens when you
|
|
stop and start data feeds and process additional data over time, click the time
|
|
picker in the {kib} toolbar. Since the sample data spans a period of time
|
|
between March 23, 2017 and April 22, 2017, click **Absolute**. Set the start
|
|
time to March 23, 2017 and the end time to April 1, 2017, for example. Once
|
|
you've got the time range set up, click the **Go** button.
|
|
image:images/ml-gs-job1-time.jpg["Setting the time range for the data feed"]
|
|
+
|
|
--
|
|
A graph is generated, which represents the total number of requests over time.
|
|
--
|
|
|
|
. Provide a name for the job, for example `total-requests`. The job name must
|
|
be unique in your cluster. You can also optionally provide a description of the
|
|
job.
|
|
|
|
. Click **Create Job**.
|
|
image::images/ml-gs-job1.jpg["A graph of the total number of requests over time"]
|
|
|
|
As the job is created, the graph is updated to give a visual representation of
|
|
the progress of {ml} as the data is processed. This view is only available whilst the
|
|
job is running.
|
|
|
|
TIP: The `create_single_metic.sh` script creates a similar job and data feed by
|
|
using the {ml} APIs. For API reference information, see <<ml-apis>>.
|
|
|
|
[[ml-gs-job1-manage]]
|
|
=== Managing Jobs
|
|
|
|
After you create a job, you can see its status in the **Job Management** tab:
|
|
|
|
image::images/ml-gs-job1-manage1.jpg["Status information for the total-requests job"]
|
|
|
|
The following information is provided for each job:
|
|
|
|
Job ID::
|
|
The unique identifier for the job.
|
|
|
|
Description::
|
|
The optional description of the job.
|
|
|
|
Processed records::
|
|
The number of records that have been processed by the job.
|
|
|
|
Memory status::
|
|
The status of the mathematical models. When you create jobs by using the APIs or
|
|
by using the advanced options in {kib}, you can specify a `model_memory_limit`.
|
|
That value is the maximum amount of memory, in MiB, that the mathematical models
|
|
can use. Once that limit is approached, data pruning becomes more aggressive.
|
|
Upon exceeding that limit, new entities are not modeled.
|
|
The default value is `4096`. The memory status field reflects whether you have
|
|
reached or exceeded the model memory limit. It can have one of the following
|
|
values: +
|
|
`ok`::: The models stayed below the configured value.
|
|
`soft_limit`::: The models used more than 60% of the configured memory limit
|
|
and older unused models will be pruned to free up space.
|
|
`hard_limit`::: The models used more space than the configured memory limit.
|
|
As a result, not all incoming data was processed.
|
|
|
|
Job state::
|
|
The status of the job, which can be one of the following values: +
|
|
`open`::: The job is available to receive and process data.
|
|
`closed`::: The job finished successfully with its model state persisted.
|
|
The job must be opened before it can accept further data.
|
|
`closing`::: The job close action is in progress and has not yet completed.
|
|
A closing job cannot accept further data.
|
|
`failed`::: The job did not finish successfully due to an error.
|
|
This situation can occur due to invalid input data.
|
|
If the job had irrevocably failed, it must be force closed and then deleted.
|
|
If the data feed can be corrected, the job can be closed and then re-opened.
|
|
|
|
Datafeed state::
|
|
The status of the data feed, which can be one of the following values: +
|
|
started::: The data feed is actively receiving data.
|
|
stopped::: The data feed is stopped and will not receive data until it is
|
|
re-started.
|
|
|
|
Latest timestamp::
|
|
The timestamp of the last processed record.
|
|
|
|
|
|
If you click the arrow beside the name of job, you can show or hide additional
|
|
information, such as the settings, configuration information, or messages for
|
|
the job.
|
|
|
|
You can also click one of the **Actions** buttons to start the data feed, edit
|
|
the job or data feed, and clone or delete the job, for example.
|
|
|
|
[float]
|
|
[[ml-gs-job1-datafeed]]
|
|
==== Managing Data Feeds
|
|
|
|
A data feed can be started and stopped multiple times throughout its lifecycle.
|
|
If you want to retrieve more data from {es} and the data feed is
|
|
stopped, you must restart it.
|
|
|
|
For example, if you did not use the full data when you created the job, you can
|
|
now process the remaining data by restarting the data feed:
|
|
|
|
. In the **Machine Learning** / **Job Management** tab, click the following
|
|
button to start the data feed:
|
|
image::images/ml-start-feed.jpg["Start data feed"]
|
|
|
|
. Choose a start time and end time. For example,
|
|
click **Continue from 2017-04-01 23:59:00** and select **2017-04-30** as the
|
|
search end time. Then click **Start**. The date picker defaults to the latest
|
|
timestamp of processed data. Be careful not to leave any gaps in the analysis,
|
|
otherwise you might miss anomalies.
|
|
image::images/ml-gs-job1-datafeed.jpg["Restarting a data feed"]
|
|
|
|
The data feed state changes to `started`, the job state changes to `opened`,
|
|
and the number of processed records increases as the new data is analyzed. The
|
|
latest timestamp information also increases. For example:
|
|
image::images/ml-gs-job1-manage2.jpg["Job opened and data feed started"]
|
|
|
|
TIP: If your data is being loaded continuously, you can continue running the job
|
|
in real time. For this, start your data feed and select **No end time**.
|
|
|
|
If you want to stop the data feed at this point, you can click the following
|
|
button:
|
|
image::images/ml-stop-feed.jpg["Stop data feed"]
|
|
|
|
Now that you have processed all the data, let's start exploring the job results.
|
|
|
|
|
|
[[ml-gs-jobresults]]
|
|
=== Exploring Job Results
|
|
|
|
The {xpack} {ml} features analyze the input stream of data, model its behavior,
|
|
and perform analysis based on the detectors you defined in your job. When an
|
|
event occurs outside of the model, that event is identified as an anomaly.
|
|
|
|
Result records for each anomaly are stored in `.ml-anomalies-*` indices in {es}.
|
|
By default, the name of the index where {ml} results are stored is labelled
|
|
`shared`, which corresponds to the `.ml-anomalies-shared` index.
|
|
|
|
You can use the **Anomaly Explorer** or the **Single Metric Viewer** in {kib} to
|
|
view the analysis results.
|
|
|
|
Anomaly Explorer::
|
|
This view contains swim lanes showing the maximum anomaly score over time.
|
|
There is an overall swim lane that shows the overall score for the job, and
|
|
also swim lanes for each influencer. By selecting a block in a swim lane, the
|
|
anomaly details are displayed alongside the original source data (where
|
|
applicable).
|
|
//TBD: Are they swimlane blocks, tiles, segments or cards? hmmm
|
|
//TBD: Do the time periods in the heat map correspond to buckets? hmmm is it a heat map?
|
|
//As time is the x-axis, and the block sizes stay the same, it feels more intuitive call it a swimlane.
|
|
//The swimlane bucket intervals depends on the time range selected. Their smallest possible
|
|
//granularity is a bucket, but if you have a big time range selected, then they will span many buckets
|
|
|
|
Single Metric Viewer::
|
|
This view contains a chart that represents the actual and expected values over
|
|
time. This is only available for jobs that analyze a single time series and
|
|
where `model_plot_config` is enabled. As in the **Anomaly Explorer**, anomalous
|
|
data points are shown in different colors depending on their score.
|
|
|
|
[float]
|
|
[[ml-gs-job1-analyze]]
|
|
==== Exploring Single Metric Job Results
|
|
|
|
By default when you view the results for a single metric job, the
|
|
**Single Metric Viewer** opens:
|
|
image::images/ml-gs-job1-analysis.jpg["Single Metric Viewer for total-requests job"]
|
|
|
|
The blue line in the chart represents the actual data values. The shaded blue
|
|
area represents the bounds for the expected values. The area between the upper
|
|
and lower bounds are the most likely values for the model. If a value is outside
|
|
of this area then it can be said to be anomalous.
|
|
|
|
If you slide the time selector from the beginning of the data to the end of the
|
|
data, you can see how the model improves as it processes more data. At the
|
|
beginning, the expected range of values is pretty broad and the model is not
|
|
capturing the periodicity in the data. But it quickly learns and begins to
|
|
reflect the daily variation.
|
|
|
|
Any data points outside the range that was predicted by the model are marked
|
|
as anomalies. When you have high volumes of real-life data, many anomalies
|
|
might be found. These vary in probability from very likely to highly unlikely,
|
|
that is to say, from not particularly anomalous to highly anomalous. There
|
|
can be none, one or two or tens, sometimes hundreds of anomalies found within
|
|
each bucket. There can be many thousands found per job. In order to provide
|
|
a sensible view of the results, an _anomaly score_ is calculated for each bucket
|
|
time interval. The anomaly score is a value from 0 to 100, which indicates
|
|
the significance of the observed anomaly compared to previously seen anomalies.
|
|
The highly anomalous values are shown in red and the low scored values are
|
|
indicated in blue. An interval with a high anomaly score is significant and
|
|
requires investigation.
|
|
|
|
Slide the time selector to a section of the time series that contains a red
|
|
anomaly data point. If you hover over the point, you can see more information
|
|
about that data point. You can also see details in the **Anomalies** section
|
|
of the viewer. For example:
|
|
|
|
image::images/ml-gs-job1-anomalies.jpg["Single Metric Viewer Anomalies for total-requests job"]
|
|
|
|
For each anomaly you can see key details such as the time, the actual and
|
|
expected ("typical") values, and their probability.
|
|
|
|
You can see the same information in a different format by using the
|
|
**Anomaly Explorer**:
|
|
|
|
image::images/ml-gs-job1-explorer.jpg["Anomaly Explorer for total-requests job"]
|
|
|
|
Click one of the red blocks in the swim lane to see details about the anomalies
|
|
that occurred in that time interval. For example:
|
|
|
|
image::images/ml-gs-job1-explorer-anomaly.jpg["Anomaly Explorer details for total-requests job"]
|
|
|
|
|
|
After you have identified anomalies, often the next step is to try to determine
|
|
the context of those situations. For example, are there other factors that are
|
|
contributing to the problem? Are the anomalies confined to particular
|
|
applications or servers? You can begin to troubleshoot these situations by
|
|
layering additional jobs or creating multi-metric jobs.
|
|
|
|
////
|
|
The troubleshooting job would not create alarms of its own, but rather would
|
|
help explain the overall situation. It's usually a different job because it's
|
|
operating on different indices. Layering jobs is an important concept.
|
|
////
|
|
////
|
|
[float]
|
|
[[ml-gs-job2-create]]
|
|
==== Creating a Multi-Metric Job
|
|
|
|
TBD.
|
|
|
|
* Walk through creation of a simple multi-metric job.
|
|
* Provide overview of:
|
|
** partition fields,
|
|
** influencers
|
|
*** An influencer is someone or something that has influenced or contributed to the anomaly.
|
|
Results are aggregated for each influencer, for each bucket, across all detectors.
|
|
In this way, a combined anomaly score is calculated for each influencer,
|
|
which determines its relative anomalousness. You can specify one or many influencers.
|
|
Picking an influencer is strongly recommended for the following reasons:
|
|
**** It allow you to blame someone/something for the anomaly
|
|
**** It simplifies and aggregates results
|
|
*** The best influencer is the person or thing that you want to blame for the anomaly.
|
|
In many cases, users or client IP make excellent influencers.
|
|
*** By/over/partition fields are usually good candidates for influencers.
|
|
*** Influencers can be any field in the source data; they do not need to be fields
|
|
specified in detectors, although they often are.
|
|
** by/over fields,
|
|
*** detectors
|
|
**** You can have more than one detector in a job which is more efficient than
|
|
running multiple jobs against the same data stream.
|
|
|
|
//http://www.prelert.com/docs/behavioral_analytics/latest/concepts/multivariate.html
|
|
|
|
[float]
|
|
[[ml-gs-job2-analyze]]
|
|
===== Viewing Multi-Metric Job Results
|
|
|
|
TBD.
|
|
|
|
* Walk through exploration of job results.
|
|
* Describe how influencer detection accelerates root cause identification.
|
|
|
|
////
|
|
////
|
|
* Provide brief overview of statistical models and/or link to more info.
|
|
* Possibly discuss effect of altering bucket span.
|
|
|
|
The anomaly score is a sophisticated aggregation of the anomaly records in the
|
|
bucket. The calculation is optimized for high throughput, gracefully ages
|
|
historical data, and reduces the signal to noise levels. It adjusts for
|
|
variations in event rate, takes into account the frequency and the level of
|
|
anomalous activity and is adjusted relative to past anomalous behavior.
|
|
In addition, [the anomaly score] is boosted if anomalous activity occurs for related entities,
|
|
for example if disk IO and CPU are both behaving unusually for a given host.
|
|
** Once an anomalous time interval has been identified, it can be expanded to
|
|
view the detailed anomaly records which are the significant causal factors.
|
|
////
|
|
////
|
|
[[ml-gs-alerts]]
|
|
=== Creating Alerts for Job Results
|
|
|
|
TBD.
|
|
|
|
* Walk through creation of simple alert for anomalous data?
|
|
|
|
////
|