OpenSearch/docs/en/ml/getting-started.asciidoc

271 lines
12 KiB
Plaintext

[[ml-getting-started]]
== Getting Started
TBD.
////
{xpack} {ml} features automatically detect:
* Anomalies in single or multiple time series
* Outliers in a population (also known as _entity profiling_)
* Rare events (also known as _log categorization_)
This tutorial is focuses on an anomaly detection scenario in single time series.
////
In this tutorial, you will explore the {xpack} {ml} features by using sample
data. You will create two simple jobs and use the results to identify possible
anomalies in the data. You can also optionally create an alert. At the end of
this tutorial, you should have a good idea of what {ml} is and will hopefully
be inspired to use it to detect anomalies in your own data.
[float]
[[ml-gs-sysoverview]]
=== System Overview
TBD.
To follow the steps in this tutorial, you will need the following
components of the Elastic Stack:
* Elasticsearch {version}, which stores the data and the analysis results
* {xpack} {version}, which provides the {ml} features
* Kibana {version}, which provides a helpful user interface for creating
and viewing jobs
See the https://www.elastic.co/support/matrix[Elastic Support Matrix] for
information about supported operating systems and product compatibility.
See {stack-ref}/installing-elastic-stack.html[Installing the Elastic Stack] for
information about installing each of the components.
NOTE: To get started, you can install Elasticsearch and Kibana on a
single VM or even on your laptop. As you add more data and your traffic grows,
you'll want to replace the single Elasticsearch instance with a cluster.
When you install {xpack} into Elasticsearch and Kibana, the {ml} features are
enabled by default. If you have multiple nodes in your cluster, you can
optionally dedicate nodes to specific purposes. If you want to control which
nodes are _machine learning nodes_ or limit which nodes run resource-intensive
activity related to jobs, see <<ml-settings>>.
NOTE: This tutorial uses Kibana to create jobs and view results, but you can
alternatively use APIs to accomplish most tasks.
For API reference information, see <<ml-apis>>.
[[ml-gs-data]]
=== Identifying Data for Analysis
TBD.
For the purposes of this tutorial, we provide sample data that you can play with.
When you consider your own data, however, it's important to take a moment
and consider where the {xpack} {ml} features will be most impactful.
The first consideration is that it must be time series data.
Generally, it's best to use data that is in chronological order. When the data
feed occurs in ascending time order, the statistical models and calculations are
very efficient and occur in real-time.
//TBD: Talk about handling out of sequence data?
The second consideration, especially when you are first learning to use {ml},
is the importance of the data and how familiar you are with it. Ideally, it is
information that contains key performance indicators (KPIs) for the health or
success of your business or system. It is information for which you want alarms
to ring when anomalous behavior occurs. You might even have Kibana dashboards
that you're already using to watch this data. The better you know the data,
the quicker you will be able to create jobs that generate useful insights from
{ml}.
//TBD: Talk about layering additional jobs?
////
You can then create additional jobs to troubleshoot the situation and put it
into context of what was going on in the system at the time.
The troubleshooting job would not create alarms of its own, but rather would
help explain the overall situation. It's usually a different job because it's
operating on different indices. Layering jobs is an important concept.
////
////
* Working with out of sequence data:
** In the typical case where data arrives in ascending time order,
each new record pushes the time forward. When a record is received that belongs
to a new bucket, the current bucket is considered to be completed.
At this point, the model is updated and final results are calculated for the
completed bucket and the new bucket is created.
** Expecting data to be in time sequence means that modeling and results
calculations can be performed very efficiently and in real-time.
As a direct consequence of this approach, out-of-sequence records are ignored.
** When data is expected to arrive out-of-sequence, a latency window can be
specified in the job configuration (does not apply to data feeds?). (If we're
using a data feed in the sample, perhaps this discussion can be deferred for
future more-advanced scenario.)
//See http://www.prelert.com/docs/behavioral_analytics/latest/concepts/outofsequence.html
////
The final consideration is where the data is located. If the data that you want
to analyze is stored in Elasticsearch, you can define a _data feed_ that
provides data to the job in real time. By having both the input data and the
analytical results in Elasticsearch, you get performance benefits? (TBD)
The alternative to data feeds is to upload batches of data to the job by
using the <<ml-post-data,post data API>>.
//TBD: The data must be provided in JSON format?
[float]
[[ml-gs-sampledata]]
==== Obtaining a sample dataset
TBD.
* Provide instructions for downloading the sample data from https://github.com/elastic/examples
* Provide overview/context of the sample data
[[ml-gs-jobs]]
=== Working with Jobs
TBD.
Machine learning jobs contain the configuration information and metadata
necessary to perform an analytical task. They also contain the results of the
analytical task. Each job ID must be unique in your cluster.
To work with jobs in Kibana:
. Open Kibana in your web browser and log in. If you are running Kibana
locally, go to `http://localhost:5601/`. To use the {ml} features,
you must log in as a user who has the `kibana_user`
and `monitor_ml` roles (TBD).
. Click **Machine Learning** in the side navigation.
//image::images/ml.jpg["Job Management"]
You can choose to create single-metric, multi-metric, or advanced jobs in Kibana.
[float]
[[ml-gs-job1-create]]
==== Creating a Single Metric Job
TBD.
* Walk through creation of a simple single metric job.
* Provide overview of:
** aggregations
** detectors (define which fields to analyze)
*** The detectors define what type of analysis needs to be done
(e.g. max, average, rare) and upon which fields (e.g. IP address, Host name, Num bytes).
** bucket spans (define time intervals to analyze across)
*** Take into account the granularity at which you want to analyze,
the frequency of the input data, and the frequency at which alerting is required.
*** When we analyze data, we use the concept of a bucket to divide up a continuous
stream of data into batches for processing. For example, if you were monitoring the
average response time of a system and received a data point every 10 minutes,
using a bucket span of 1 hour means that at the end of each hour we would calculate
the average (mean) value of the data for the last hour and compute the
anomalousness of that average value compared to previous hours.
*** The bucket span has two purposes: it dictates over what time span to look for
anomalous features in data, and also determines how quickly anomalies can be detected.
Choosing a shorter bucket span allows anomalies to be detected more quickly but
at the risk of being too sensitive to natural variations or noise in the input data.
Choosing too long a bucket span however can mean that interesting anomalies are averaged away.
** analysis functions
*** Some of the analytical functions look for single anomalous data points, e.g. max,
which identifies the maximum value seen within a bucket.
Others perform some aggregation over the length of the bucket, e.g. mean,
which calculates the mean of all the data points seen within the bucket,
or count, which calculates the total number of data points within the bucket.
There is the possibility that the aggregation might smooth out some anomalies
based on when the bucket starts in time.
**** To avoid this, you can use overlapping buckets (how/where?).
We analyze the data points in two buckets simultaneously, one starting half a bucket
span later than the other. Overlapping buckets are only beneficial for
aggregating functions, and should not be used for non-aggregating functions.
[float]
[[ml-gs-job1-analyze]]
===== Viewing Single Metric Job Results
TBD.
* Walk through exploration of job results.
** Based on this job configuration we analyze the input stream of data.
We model the behavior of the data, perform analysis based upon the defined detectors
and for the time interval. When we see an event occurring outside of our model,
we identify this as an anomaly. For each anomaly detected, we store the
result records of our analysis, which includes the probability of
detecting that anomaly.
** With high volumes of real-life data, many anomalies may be found.
These vary in probability from very likely to highly unlikely i.e. from not
particularly anomalous to highly anomalous. There can be none, one or two or
tens, sometimes hundreds of anomalies found within each bucket.
There can be many thousands found per job.
In order to provide a sensible view of the results, we calculate an anomaly score
for each time interval. An interval with a high anomaly score is significant
and requires investigation.
** The anomaly score is a sophisticated aggregation of the anomaly records.
The calculation is optimized for high throughput, gracefully ages historical data,
and reduces the signal to noise levels.
It adjusts for variations in event rate, takes into account the frequency
and the level of anomalous activity and is adjusted relative to past anomalous behavior.
In addition, it is boosted if anomalous activity occurs for related entities,
for example if disk IO and CPU are both behaving unusually for a given host.
** Once an anomalous time interval has been identified, it can be expanded to
view the detailed anomaly records which are the significant causal factors.
* Provide brief overview of statistical models and/or link to more info.
* Possibly discuss effect of altering bucket span.
* Provide general overview of management of jobs (when/why to start or
stop them).
[float]
[[ml-gs-job2-create]]
==== Creating a Multi-Metric Job
TBD.
* Walk through creation of a simple multi-metric job.
* Provide overview of:
** partition fields,
** influencers
*** An influencer is someone or something that has influenced or contributed to the anomaly.
Results are aggregated for each influencer, for each bucket, across all detectors.
In this way, a combined anomaly score is calculated for each influencer,
which determines its relative anomalousness. You can specify one or many influencers.
Picking an influencer is strongly recommended for the following reasons:
**** It allow you to blame someone/something for the anomaly
**** It simplifies and aggregates results
*** The best influencer is the person or thing that you want to blame for the anomaly.
In many cases, users or client IP make excellent influencers.
*** By/over/partition fields are usually good candidates for influencers.
*** Influencers can be any field in the source data; they do not need to be fields
specified in detectors, although they often are.
** by/over fields,
*** detectors
**** You can have more than one detector in a job which is more efficient than
running multiple jobs against the same data stream.
//http://www.prelert.com/docs/behavioral_analytics/latest/concepts/multivariate.html
[float]
[[ml-gs-job2-analyze]]
===== Viewing Multi-Metric Job Results
TBD.
* Walk through exploration of job results.
* Describe how influencer detection accelerates root cause identification.
[[ml-gs-alerts]]
=== Creating Alerts for Job Results
TBD.
* Walk through creation of simple alert for anomalous data?
////
To start exploring anomalies in your data:
. Open Kibana in your web browser and log in. If you are running Kibana
locally, go to `http://localhost:5601/`.
. Click **ML** in the side navigation ...
////
//image::images/graph-open.jpg["Accessing Graph"]