708 lines
28 KiB
Plaintext
708 lines
28 KiB
Plaintext
[[ml-getting-started]]
|
|
== Getting Started
|
|
|
|
////
|
|
{xpack} {ml} features automatically detect:
|
|
* Anomalies in single or multiple time series
|
|
* Outliers in a population (also known as _entity profiling_)
|
|
* Rare events (also known as _log categorization_)
|
|
|
|
This tutorial is focuses on an anomaly detection scenario in single time series.
|
|
////
|
|
Ready to get some hands-on experience with the {xpack} {ml} features? This
|
|
tutorial shows you how to:
|
|
|
|
* Load a sample data set into Elasticsearch
|
|
* Create a {ml} job
|
|
* Use the results to identify possible anomalies in the data
|
|
|
|
|
|
At the end of this tutorial, you should have a good idea of what {ml} is and
|
|
will hopefully be inspired to use it to detect anomalies in your own data.
|
|
|
|
You might also be interested in these video tutorials:
|
|
|
|
* Getting started with machine learning (single metric)
|
|
* Getting started with machine learning (multiple metric)
|
|
|
|
|
|
[float]
|
|
[[ml-gs-sysoverview]]
|
|
=== System Overview
|
|
|
|
To follow the steps in this tutorial, you will need the following
|
|
components of the Elastic Stack:
|
|
|
|
* Elasticsearch {version}, which stores the data and the analysis results
|
|
* {xpack} {version}, which provides the {ml} features
|
|
* Kibana {version}, which provides a helpful user interface for creating and
|
|
viewing jobs +
|
|
|
|
|
|
See the https://www.elastic.co/support/matrix[Elastic Support Matrix] for
|
|
information about supported operating systems.
|
|
|
|
See {stack-ref}/installing-elastic-stack.html[Installing the Elastic Stack] for
|
|
information about installing each of the components.
|
|
|
|
NOTE: To get started, you can install Elasticsearch and Kibana on a
|
|
single VM or even on your laptop. As you add more data and your traffic grows,
|
|
you'll want to replace the single Elasticsearch instance with a cluster.
|
|
|
|
When you install {xpack} into Elasticsearch and Kibana, the {ml} features are
|
|
enabled by default. If you have multiple nodes in your cluster, you can
|
|
optionally dedicate nodes to specific purposes. If you want to control which
|
|
nodes are _machine learning nodes_ or limit which nodes run resource-intensive
|
|
activity related to jobs, see <<ml-settings>>.
|
|
|
|
[float]
|
|
[[ml-gs-users]]
|
|
==== Users, Roles, and Privileges
|
|
|
|
The {xpack} {ml} features implement cluster privileges and built-in roles to
|
|
make it easier to control which users have authority to view and manage the jobs,
|
|
data feeds, and results.
|
|
|
|
By default, you can perform all of the steps in this tutorial by using the
|
|
built-in `elastic` user. If you are performing these steps in a production
|
|
environment, take extra care because that user has the `superuser` role and you
|
|
could inadvertently make significant changes to the system. You can
|
|
alternatively assign the `machine_learning_admin` and `kibana_user` roles to a
|
|
user ID of your choice.
|
|
|
|
For more information, see <<built-in-roles>> and <<privileges-list-cluster>>.
|
|
|
|
[[ml-gs-data]]
|
|
=== Identifying Data for Analysis
|
|
|
|
For the purposes of this tutorial, we provide sample data that you can play with.
|
|
When you consider your own data, however, it's important to take a moment
|
|
and consider where the {xpack} {ml} features will be most impactful.
|
|
|
|
The first consideration is that it must be time series data.
|
|
Generally, it's best to use data that is in chronological order. When the data
|
|
feed occurs in ascending time order, the statistical models and calculations are
|
|
very efficient and occur in real-time.
|
|
//TBD: Talk about handling out of sequence data?
|
|
|
|
The second consideration, especially when you are first learning to use {ml},
|
|
is the importance of the data and how familiar you are with it. Ideally, it is
|
|
information that contains key performance indicators (KPIs) for the health or
|
|
success of your business or system. It is information that you need to act on
|
|
when anomalous behavior occurs. You might even have Kibana dashboards that
|
|
you're already using to watch this data. The better you know the data,
|
|
the quicker you will be able to create {ml} jobs that generate useful insights.
|
|
|
|
//TBD: Talk about layering additional jobs?
|
|
////
|
|
You can then create additional jobs to troubleshoot the situation and put it
|
|
into context of what was going on in the system at the time.
|
|
The troubleshooting job would not create alarms of its own, but rather would
|
|
help explain the overall situation. It's usually a different job because it's
|
|
operating on different indices. Layering jobs is an important concept.
|
|
////
|
|
////
|
|
* Working with out of sequence data:
|
|
** In the typical case where data arrives in ascending time order,
|
|
each new record pushes the time forward. When a record is received that belongs
|
|
to a new bucket, the current bucket is considered to be completed.
|
|
At this point, the model is updated and final results are calculated for the
|
|
completed bucket and the new bucket is created.
|
|
** Expecting data to be in time sequence means that modeling and results
|
|
calculations can be performed very efficiently and in real-time.
|
|
As a direct consequence of this approach, out-of-sequence records are ignored.
|
|
** When data is expected to arrive out-of-sequence, a latency window can be
|
|
specified in the job configuration (does not apply to data feeds?). (If we're
|
|
using a data feed in the sample, perhaps this discussion can be deferred for
|
|
future more-advanced scenario.)
|
|
//See http://www.prelert.com/docs/behavioral_analytics/latest/concepts/outofsequence.html
|
|
////
|
|
|
|
The final consideration is where the data is located. If the data that you want
|
|
to analyze is stored in Elasticsearch, you can define a _data feed_ that
|
|
provides data to the job in real time. When you have both the input data and the
|
|
analytical results in Elasticsearch, this data gravity provides performance
|
|
benefits.
|
|
|
|
IMPORTANT: If you want to create {ml} jobs in Kibana, you must use data feeds.
|
|
That is to say, you must store your input data in Elasticsearch. When you create
|
|
a job, you select an existing index pattern and Kibana configures the data feed
|
|
for you under the covers.
|
|
|
|
If your data is not stored in Elasticsearch, you can create jobs by using
|
|
the <<ml-put-job,create job API>> and upload batches of data to the job by
|
|
using the <<ml-post-data,post data API>>. That scenario is not covered in
|
|
this tutorial, however.
|
|
|
|
//TBD: The data must be provided in JSON format?
|
|
|
|
[float]
|
|
[[ml-gs-sampledata]]
|
|
==== Obtaining a Sample Data Set
|
|
|
|
The sample data for this tutorial contains information about the requests that
|
|
are received by various applications and services in a system. A system
|
|
administrator might use this type of information to track the the total
|
|
number of requests across all of the infrastructure. If the number of requests
|
|
increases or decreases unexpectedly, for example, this might be an indication
|
|
that there is a problem or that resources need to be redistributed. By using
|
|
the {xpack} {ml} features to model the behavior of this data, it is easier to
|
|
identify anomalies and take appropriate action.
|
|
|
|
* TBD: Provide instructions for downloading the sample data after it's made
|
|
available publicly on https://github.com/elastic/examples
|
|
//Download this data set by clicking here:
|
|
//See https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json[shakespeare.json].
|
|
|
|
Use the following commands to extract the files:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
tar xvf server_metrics.tar.gz
|
|
----------------------------------
|
|
|
|
Each document in the server-metrics data set has the following schema:
|
|
|
|
[source,js]
|
|
----------------------------------
|
|
|
|
{
|
|
"index":
|
|
{
|
|
"_index":"server-metrics",
|
|
"_type":"metric",
|
|
"_id":"AVuQL1eekrHQ5a9V5qre"
|
|
}
|
|
}
|
|
{
|
|
"deny":1783,
|
|
"service":"app_0",
|
|
"@timestamp":"2017-03-26T06:47:28.684926",
|
|
"accept":24465,
|
|
"host":"server_1",
|
|
"total":26248,
|
|
"response":1.8242486553275024
|
|
}
|
|
----------------------------------
|
|
|
|
Before you load the data set, you need to set up {ref}/mapping.html[_mappings_]
|
|
for the fields. Mappings divide the documents in the index into logical groups
|
|
and specify a field's characteristics, such as the field's searchability or
|
|
whether or not it's _tokenized_, or broken up into separate words.
|
|
|
|
The sample data includes an `upload_server-metrics.sh` script, which you can use
|
|
to create the mappings and load the data set. Before you run it, however, you
|
|
must edit the USERNAME and PASSWORD variables with your actual user ID and
|
|
password. If you want to test adding data to an existing data feed, you must
|
|
also comment out the final two commands related to `server-metrics_4.json`.
|
|
|
|
The script runs a command similar
|
|
to the following example, which sets up a mapping for the data set:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
curl -u elastic:elasticpassword -X PUT -H 'Content-Type: application/json'
|
|
http://localhost:9200/server-metrics -d '{
|
|
"settings": {
|
|
"number_of_shards": 1,
|
|
"number_of_replicas": 0
|
|
},
|
|
"mappings": {
|
|
"metric": {
|
|
"properties": {
|
|
"@timestamp": {
|
|
"type": "date"
|
|
},
|
|
"accept": {
|
|
"type": "long"
|
|
},
|
|
"deny": {
|
|
"type": "long"
|
|
},
|
|
"host": {
|
|
"type": "text",
|
|
"fields": {
|
|
"keyword": {
|
|
"type": "keyword",
|
|
"ignore_above": 256
|
|
}
|
|
}
|
|
},
|
|
"response": {
|
|
"type": "float"
|
|
},
|
|
"service": {
|
|
"type": "text",
|
|
"fields": {
|
|
"keyword": {
|
|
"type": "keyword",
|
|
"ignore_above": 256
|
|
}
|
|
}
|
|
},
|
|
"total": {
|
|
"type": "long"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}'
|
|
----------------------------------
|
|
|
|
NOTE: If you run this command, you must replace `elasticpassword` with your
|
|
actual password.
|
|
|
|
////
|
|
This mapping specifies the following qualities for the data set:
|
|
|
|
* The _@timestamp_ field is a date.
|
|
//that uses the ISO format `epoch_second`,
|
|
//which is the number of seconds since the epoch.
|
|
* The _accept_, _deny_, and _total_ fields are long numbers.
|
|
* The _host
|
|
////
|
|
|
|
You can then use the Elasticsearch `bulk` API to load the data set. The
|
|
`upload_server-metrics.sh` script runs commands similar to the following
|
|
example, which loads three of the JSON files:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
|
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_1.json"
|
|
|
|
curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
|
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_2.json"
|
|
|
|
curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
|
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_3.json"
|
|
|
|
----------------------------------
|
|
|
|
//curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
|
|
//http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
|
|
These commands might take some time to run, depending on the computing resources
|
|
available.
|
|
|
|
You can verify that the data was loaded successfully with the following command:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
curl 'http://localhost:9200/_cat/indices?v' -u elastic:elasticpassword
|
|
----------------------------------
|
|
|
|
For three sample JSON files, you should see output similar to the following:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
health status index ... pri rep docs.count docs.deleted store.size ...
|
|
green open server-metrics ... 1 0 680400 0 101.7mb ...
|
|
----------------------------------
|
|
|
|
Next, you must define an index pattern for this data set:
|
|
|
|
. Open Kibana in your web browser and log in. If you are running Kibana
|
|
locally, go to `http://localhost:5601/`.
|
|
|
|
. Click the **Management** tab, then **Index Patterns**.
|
|
|
|
. Click the plus sign (+) to define a new index pattern.
|
|
|
|
. For this tutorial, any pattern that matches the name of the index you've
|
|
loaded will work. For example, enter `server-metrics*` as the index pattern.
|
|
|
|
. Verify that the **Index contains time-based events** is checked.
|
|
|
|
. Select the `@timestamp` field from the **Time-field name** list.
|
|
|
|
. Click **Create**.
|
|
|
|
This data set can now be analyzed in {ml} jobs in Kibana.
|
|
//Content based on https://www.elastic.co/guide/en/kibana/current/tutorial-load-dataset.html
|
|
|
|
[[ml-gs-jobs]]
|
|
=== Creating Jobs
|
|
|
|
Machine learning jobs contain the configuration information and metadata
|
|
necessary to perform an analytical task. They also contain the results of the
|
|
analytical task.
|
|
|
|
NOTE: This tutorial uses Kibana to create jobs and view results, but you can
|
|
alternatively use APIs to accomplish most tasks.
|
|
For API reference information, see <<ml-apis>>.
|
|
|
|
To work with jobs in Kibana:
|
|
|
|
. Open Kibana in your web browser and log in. If you are running Kibana
|
|
locally, go to `http://localhost:5601/`.
|
|
|
|
. Click **Machine Learning** in the side navigation:
|
|
image::images/ml-kibana.jpg["Job Management"]
|
|
|
|
You can choose to create single metric, multi-metric, or advanced jobs in
|
|
Kibana. In this tutorial, the goal is to detect anomalies in the total requests
|
|
received by your applications and services. The sample data contains a single
|
|
key performance indicator to track this, which is the total requests over time.
|
|
It is therefore logical to start by creating a single metric job for this KPI.
|
|
|
|
[float]
|
|
[[ml-gs-job1-create]]
|
|
==== Creating a Single Metric Job
|
|
|
|
A single metric job contains a single _detector_. A detector defines the type of
|
|
analysis that will occur (for example, `max`, `average`, or `rare` analytical
|
|
functions) and the fields that will be analyzed.
|
|
|
|
To create a single metric job in Kibana:
|
|
|
|
. Click **Machine Learning** in the side navigation,
|
|
then click **Create new job**.
|
|
|
|
. Click **Create single metric job**.
|
|
image::images/ml-create-jobs.jpg["Create a new job"]
|
|
|
|
. Click the `server-metrics` index. +
|
|
+
|
|
--
|
|
image::images/ml-gs-index.jpg["Select an index"]
|
|
--
|
|
|
|
. Configure the job by providing the following information:
|
|
image::images/ml-gs-single-job.jpg["Create a new job from the server-metrics index"]
|
|
|
|
.. For the **Aggregation**, select `Sum`. This value specifies the analysis
|
|
function that is used.
|
|
+
|
|
--
|
|
Some of the analytical functions look for single anomalous data points. For
|
|
example, `max` identifies the maximum value that is seen within a bucket.
|
|
Others perform some aggregation over the length of the bucket. For example,
|
|
`mean` calculates the mean of all the data points seen within the bucket.
|
|
Similarly, `count` calculates the total number of data points within the bucket.
|
|
In this tutorial, you are using the `sum` function, which calculates the sum of
|
|
the specified field's values within the bucket.
|
|
--
|
|
|
|
.. For the **Field**, select `total`. This value specifies the field that
|
|
the detector uses in the function.
|
|
+
|
|
--
|
|
NOTE: Some functions such as `count` and `rare` do not require fields.
|
|
--
|
|
|
|
.. For the **Bucket span**, enter `600s`. This value specifies the size of the
|
|
interval that the analysis is aggregated into.
|
|
+
|
|
--
|
|
The {xpack} {ml} features use the concept of a bucket to divide up a continuous
|
|
stream of data into batches for processing. For example, if you are monitoring
|
|
the total number of requests in the system,
|
|
//and receive a data point every 10 minutes
|
|
using a bucket span of 1 hour would mean that at the end of each hour, it
|
|
calculates the sum of the requests for the last hour and computes the
|
|
anomalousness of that value compared to previous hours.
|
|
|
|
The bucket span has two purposes: it dictates over what time span to look for
|
|
anomalous features in data, and also determines how quickly anomalies can be
|
|
detected. Choosing a shorter bucket span allows anomalies to be detected more
|
|
quickly. However, there is a risk of being too sensitive to natural variations
|
|
or noise in the input data. Choosing too long a bucket span can mean that
|
|
interesting anomalies are averaged away. There is also the possibility that the
|
|
aggregation might smooth out some anomalies based on when the bucket starts
|
|
in time.
|
|
|
|
The bucket span has a significant impact on the analysis. When you're trying to
|
|
determine what value to use, take into account the granularity at which you
|
|
want to perform the analysis, the frequency of the input data, and the frequency
|
|
at which alerting is required.
|
|
//TBD: Talk about overlapping buckets? "To avoid this, you can use overlapping
|
|
//buckets (how/where?). We analyze the data points in two buckets simultaneously,
|
|
//one starting half a bucket span later than the other. Overlapping buckets are
|
|
//only beneficial for aggregating functions, and should not be used for
|
|
//non-aggregating functions.
|
|
--
|
|
|
|
. Click **Use full transaction_counts data**. A graph is generated,
|
|
which represents the total number of requests over time.
|
|
|
|
. Provide a name for the job, for example `total-requests`. The job name must
|
|
be unique in your cluster. You can also optionally provide a description of the
|
|
job.
|
|
|
|
. Click **Create Job**.
|
|
image::images/ml-gs-job1.jpg["A graph of the total number of requests over time"]
|
|
|
|
As the job is created, the graph is updated to give a visual representation of
|
|
the {ml} that occurs as the data is processed.
|
|
//To explore the results, click **View Results**.
|
|
//TBD: image::images/ml-gs-job1-results.jpg["The total-requests job is created"]
|
|
|
|
TIP: The `create_single_metic.sh` script creates a similar job and data feed by
|
|
using the {ml} APIs. For API reference information, see <<ml-apis>>.
|
|
|
|
[[ml-gs-job1-manage]]
|
|
=== Managing Jobs
|
|
|
|
After you create a job, you can see its status in the **Job Management** tab:
|
|
|
|
image::images/ml-gs-job1-manage.jpg["Status information for the total-requests job"]
|
|
|
|
The following information is provided for each job:
|
|
|
|
Job ID::
|
|
The unique identifier for the job.
|
|
|
|
Description::
|
|
The optional description of the job.
|
|
|
|
Processed records::
|
|
The number of records that have been processed by the job.
|
|
|
|
NOTE: Depending on how you send data to the job, the number of processed
|
|
records is not always equal to the number of input records. For more information,
|
|
see the `processed_record_count` description in <<ml-datacounts,Data Counts Objects>>.
|
|
|
|
Memory status::
|
|
The status of the mathematical models. When you create jobs by using the APIs or
|
|
by using the advanced options in Kibana, you can specify a `model_memory_limit`.
|
|
That value is the maximum amount of memory, in MiB, that the mathematical models
|
|
can use. Once that limit is approached, data pruning becomes more aggressive.
|
|
Upon exceeding that limit, new entities are not modeled.
|
|
The default value is `4096`. The memory status field reflects whether you have
|
|
reached or exceeded the model memory limit. It can have one of the following
|
|
values: +
|
|
`ok`::: The models stayed below the configured value.
|
|
`soft_limit`::: The models used more than 60% of the configured memory limit
|
|
and older unused models will be pruned to free up space.
|
|
`hard_limit`::: The models used more space than the configured memory limit.
|
|
As a result, not all incoming data was processed.
|
|
|
|
Job state::
|
|
The status of the job, which can be one of the following values: +
|
|
`open`::: The job is available to receive and process data.
|
|
`closed`::: The job finished successfully with its model state persisted.
|
|
The job must be opened before it can accept further data.
|
|
`closing`::: The job close action is in progress and has not yet completed.
|
|
A closing job cannot accept further data.
|
|
`failed`::: The job did not finish successfully due to an error.
|
|
This situation can occur due to invalid input data.
|
|
If the job had irrevocably failed, it must be force closed and then deleted.
|
|
If the data feed can be corrected, the job can be closed and then re-opened.
|
|
|
|
Datafeed state::
|
|
The status of the data feed, which can be one of the following values: +
|
|
started::: The data feed is actively receiving data.
|
|
stopped::: The data feed is stopped and will not receive data until it is re-started.
|
|
//TBD: How to restart data feeds in Kibana?
|
|
|
|
Latest timestamp::
|
|
The timestamp of the last processed record.
|
|
//TBD: Is that right?
|
|
|
|
If you click the arrow beside the name of job, you can show or hide additional
|
|
information, such as the settings, configuration information, or messages for
|
|
the job.
|
|
|
|
You can also click one of the **Actions** buttons to start the data feed, edit
|
|
the job or data feed, and clone or delete the job, for example.
|
|
|
|
[float]
|
|
[[ml-gs-job1-datafeed]]
|
|
==== Managing Data Feeds
|
|
|
|
A data feed can be started and stopped multiple times throughout its lifecycle.
|
|
If you want to retrieve more data from Elasticsearch and the data feed is
|
|
stopped, you must restart it.
|
|
|
|
For example, if you only loaded three of the sample JSON files, you can now load
|
|
the fourth using the Elasticsearch `bulk` API as follows:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
|
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
|
|
----------------------------------
|
|
|
|
You can optionally verify that the data was loaded successfully with the
|
|
following command:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
curl 'http://localhost:9200/_cat/indices?v' -u elastic:elasticpassword
|
|
----------------------------------
|
|
|
|
For the four sample JSON files, you should see output similar to the following:
|
|
|
|
[source,shell]
|
|
----------------------------------
|
|
|
|
health status index ... pri rep docs.count docs.deleted store.size ...
|
|
green open server-metrics ... 1 0 907200 0 136.2mb ...
|
|
----------------------------------
|
|
|
|
To use this new data in your job:
|
|
|
|
. In the **Machine Learning** / **Job Management** tab, click the following
|
|
button to start the data feed: image::images/ml-start-feed.jpg["Start data feed"].
|
|
|
|
. Choose a start time and end time. For example,
|
|
click **Continue from 2017-04-22** and **No end time**, then click **Start**.
|
|
image::images/ml-gs-job1-datafeed.jpg["Restarting a data feed"]
|
|
|
|
* TBD: Why do I not see increases in the job count stats after this occurs?
|
|
How can I determine that it has been successfully processed?
|
|
|
|
|
|
[[ml-gs-jobresults]]
|
|
=== Exploring Job Results
|
|
|
|
The {xpack} {ml} features analyze the input stream of data, model its behavior,
|
|
and perform analysis based on the detectors you defined in your job. When an
|
|
event occurs outside of the model, that event is identified as an anomaly.
|
|
|
|
Result records for each anomaly are stored in `.ml-notifications` and
|
|
`.ml-anomalies*` indices in Elasticsearch. By default, the name of the
|
|
index where {ml} results are stored is `shared`, which corresponds to
|
|
the `.ml-anomalies-shared` index.
|
|
//For example, these results include the probability of detecting that anomaly.
|
|
|
|
You can use the **Anomaly Explorer** or the
|
|
**Single Metric Viewer** in Kibana to view the analysis results.
|
|
|
|
Anomaly Explorer::
|
|
This view contains heatmap charts, where the color for each section of the
|
|
timeline is determined by the maximum anomaly score in that period.
|
|
//TBD: Do the time periods in the heat map correspond to buckets?
|
|
|
|
Single Metric Viewer::
|
|
This view contains a time series chart that represents the analysis.
|
|
As in the **Anomaly Explorer**, anomalous data points are shown in
|
|
different colors depending on their probability.
|
|
|
|
[float]
|
|
[[ml-gs-job1-analyze]]
|
|
==== Exploring Single Metric Job Results
|
|
|
|
By default when you view the results for a single metric job,
|
|
the **Single Metric Viewer** opens:
|
|
image::images/ml-gs-job1-analysis.jpg["Single Metric Viewer for total-requests job"]
|
|
|
|
The blue line in the chart represents the actual data values. The shaded blue area
|
|
represents the expected behavior that was calculated by the model.
|
|
//TBD: What is meant by "95% prediction bounds"?
|
|
|
|
If you slide the time selector from the beginning of the data to the end of the
|
|
data, you can see how the model improves as it processes more data. At the
|
|
beginning, the expected range of values is pretty broad and the model is not
|
|
capturing the periodicity in the data. But it quickly learns and begins to
|
|
reflect the daily variation.
|
|
|
|
Any data points outside the range that was predicted by the model are marked
|
|
as anomalies. When you have high volumes of real-life data, many anomalies
|
|
might be found. These vary in probability from very likely to highly unlikely,
|
|
that is to say, from not particularly anomalous to highly anomalous. There
|
|
can be none, one or two or tens, sometimes hundreds of anomalies found within
|
|
each bucket. There can be many thousands found per job. In order to provide
|
|
a sensible view of the results, an _anomaly score_ is calculated for each bucket
|
|
time interval. The anomaly score is a value from 0 to 100, which indicates
|
|
the significance of the observed anomaly compared to previously seen anomalies.
|
|
The highly anomalous values are shown in red and the low scored values are
|
|
indicated in blue. An interval with a high anomaly score is significant and
|
|
requires investigation.
|
|
|
|
Slide the time selector to a section of the time series that contains a red data
|
|
point. If you hover over the point, you can see more information about that
|
|
data point. You can also see details in the **Anomalies** section of the viewer.
|
|
For example:
|
|
|
|
image::images/ml-gs-job1-anomalies.jpg["Single Metric Viewer Anomalies for total-requests job"]
|
|
|
|
For each anomaly you can see key details such as the time, the actual and
|
|
expected ("typical") values, and their probability.
|
|
|
|
You can see the same information in a different format by using the **Anomaly Explorer**:
|
|
|
|
image::images/ml-gs-job1-explorer.jpg["Anomaly Explorer for total-requests job"]
|
|
|
|
Click one of the red areas in the heatmap to see details about that anomaly. For
|
|
example:
|
|
|
|
image::images/ml-gs-job1-explorer-anomaly.jpg["Anomaly Explorer details for total-requests job"]
|
|
|
|
After you have identified anomalies, often the next step is to try to determine
|
|
the context of those situations. For example, are there other factors that are
|
|
contributing to the problem? Are the anomalies confined to particular
|
|
applications or servers? You can begin to troubleshoot these situations by
|
|
layering additional jobs or creating multi-metric jobs.
|
|
|
|
////
|
|
[float]
|
|
[[ml-gs-job2-create]]
|
|
==== Creating a Multi-Metric Job
|
|
|
|
TBD.
|
|
|
|
* Walk through creation of a simple multi-metric job.
|
|
* Provide overview of:
|
|
** partition fields,
|
|
** influencers
|
|
*** An influencer is someone or something that has influenced or contributed to the anomaly.
|
|
Results are aggregated for each influencer, for each bucket, across all detectors.
|
|
In this way, a combined anomaly score is calculated for each influencer,
|
|
which determines its relative anomalousness. You can specify one or many influencers.
|
|
Picking an influencer is strongly recommended for the following reasons:
|
|
**** It allow you to blame someone/something for the anomaly
|
|
**** It simplifies and aggregates results
|
|
*** The best influencer is the person or thing that you want to blame for the anomaly.
|
|
In many cases, users or client IP make excellent influencers.
|
|
*** By/over/partition fields are usually good candidates for influencers.
|
|
*** Influencers can be any field in the source data; they do not need to be fields
|
|
specified in detectors, although they often are.
|
|
** by/over fields,
|
|
*** detectors
|
|
**** You can have more than one detector in a job which is more efficient than
|
|
running multiple jobs against the same data stream.
|
|
|
|
//http://www.prelert.com/docs/behavioral_analytics/latest/concepts/multivariate.html
|
|
|
|
[float]
|
|
[[ml-gs-job2-analyze]]
|
|
===== Viewing Multi-Metric Job Results
|
|
|
|
TBD.
|
|
|
|
* Walk through exploration of job results.
|
|
* Describe how influencer detection accelerates root cause identification.
|
|
|
|
////
|
|
////
|
|
* Provide brief overview of statistical models and/or link to more info.
|
|
* Possibly discuss effect of altering bucket span.
|
|
|
|
The anomaly score is a sophisticated aggregation of the anomaly records in the
|
|
bucket. The calculation is optimized for high throughput, gracefully ages
|
|
historical data, and reduces the signal to noise levels. It adjusts for
|
|
variations in event rate, takes into account the frequency and the level of
|
|
anomalous activity and is adjusted relative to past anomalous behavior.
|
|
In addition, [the anomaly score] is boosted if anomalous activity occurs for related entities,
|
|
for example if disk IO and CPU are both behaving unusually for a given host.
|
|
** Once an anomalous time interval has been identified, it can be expanded to
|
|
view the detailed anomaly records which are the significant causal factors.
|
|
////
|
|
////
|
|
[[ml-gs-alerts]]
|
|
=== Creating Alerts for Job Results
|
|
|
|
TBD.
|
|
|
|
* Walk through creation of simple alert for anomalous data?
|
|
|
|
////
|