diff --git a/docs/en/ml/getting-started.asciidoc b/docs/en/ml/getting-started.asciidoc index b3b7404f8e7..41a325f8e28 100644 --- a/docs/en/ml/getting-started.asciidoc +++ b/docs/en/ml/getting-started.asciidoc @@ -1,11 +1,270 @@ [[ml-getting-started]] == Getting Started +TBD. +//// +{xpack} {ml} features automatically detect: +* Anomalies in single or multiple time series +* Outliers in a population (also known as _entity profiling_) +* Rare events (also known as _log categorization_) + +This tutorial is focuses on an anomaly detection scenario in single time series. +//// + +In this tutorial, you will explore the {xpack} {ml} features by using sample +data. You will create two simple jobs and use the results to identify possible +anomalies in the data. You can also optionally create an alert. At the end of +this tutorial, you should have a good idea of what {ml} is and will hopefully +be inspired to use it to detect anomalies in your own data. + +[float] +[[ml-gs-sysoverview]] +=== System Overview + +TBD. + +To follow the steps in this tutorial, you will need the following +components of the Elastic Stack: + +* Elasticsearch {version}, which stores the data and the analysis results +* {xpack} {version}, which provides the {ml} features +* Kibana {version}, which provides a helpful user interface for creating +and viewing jobs + +See the https://www.elastic.co/support/matrix[Elastic Support Matrix] for +information about supported operating systems and product compatibility. + +See {stack-ref}/installing-elastic-stack.html[Installing the Elastic Stack] for +information about installing each of the components. + +NOTE: To get started, you can install Elasticsearch and Kibana on a +single VM or even on your laptop. As you add more data and your traffic grows, +you'll want to replace the single Elasticsearch instance with a cluster. + +When you install {xpack} into Elasticsearch and Kibana, the {ml} features are +enabled by default. If you have multiple nodes in your cluster, you can +optionally dedicate nodes to specific purposes. If you want to control which +nodes are _machine learning nodes_ or limit which nodes run resource-intensive +activity related to jobs, see <>. + +NOTE: This tutorial uses Kibana to create jobs and view results, but you can +alternatively use APIs to accomplish most tasks. +For API reference information, see <>. + +[[ml-gs-data]] +=== Identifying Data for Analysis + +TBD. + +For the purposes of this tutorial, we provide sample data that you can play with. +When you consider your own data, however, it's important to take a moment +and consider where the {xpack} {ml} features will be most impactful. + +The first consideration is that it must be time series data. +Generally, it's best to use data that is in chronological order. When the data +feed occurs in ascending time order, the statistical models and calculations are +very efficient and occur in real-time. +//TBD: Talk about handling out of sequence data? + +The second consideration, especially when you are first learning to use {ml}, +is the importance of the data and how familiar you are with it. Ideally, it is +information that contains key performance indicators (KPIs) for the health or +success of your business or system. It is information for which you want alarms +to ring when anomalous behavior occurs. You might even have Kibana dashboards +that you're already using to watch this data. The better you know the data, +the quicker you will be able to create jobs that generate useful insights from +{ml}. + +//TBD: Talk about layering additional jobs? +//// + You can then create additional jobs to troubleshoot the situation and put it +into context of what was going on in the system at the time. +The troubleshooting job would not create alarms of its own, but rather would +help explain the overall situation. It's usually a different job because it's +operating on different indices. Layering jobs is an important concept. +//// +//// +* Working with out of sequence data: +** In the typical case where data arrives in ascending time order, +each new record pushes the time forward. When a record is received that belongs +to a new bucket, the current bucket is considered to be completed. +At this point, the model is updated and final results are calculated for the +completed bucket and the new bucket is created. +** Expecting data to be in time sequence means that modeling and results +calculations can be performed very efficiently and in real-time. +As a direct consequence of this approach, out-of-sequence records are ignored. +** When data is expected to arrive out-of-sequence, a latency window can be +specified in the job configuration (does not apply to data feeds?). (If we're +using a data feed in the sample, perhaps this discussion can be deferred for +future more-advanced scenario.) +//See http://www.prelert.com/docs/behavioral_analytics/latest/concepts/outofsequence.html +//// + +The final consideration is where the data is located. If the data that you want +to analyze is stored in Elasticsearch, you can define a _data feed_ that +provides data to the job in real time. By having both the input data and the +analytical results in Elasticsearch, you get performance benefits? (TBD) +The alternative to data feeds is to upload batches of data to the job by +using the <>. +//TBD: The data must be provided in JSON format? + +[float] +[[ml-gs-sampledata]] +==== Obtaining a sample dataset + +TBD. + +* Provide instructions for downloading the sample data from https://github.com/elastic/examples +* Provide overview/context of the sample data + +[[ml-gs-jobs]] +=== Working with Jobs + +TBD. + +Machine learning jobs contain the configuration information and metadata +necessary to perform an analytical task. They also contain the results of the +analytical task. Each job ID must be unique in your cluster. + +To work with jobs in Kibana: + +. Open Kibana in your web browser and log in. If you are running Kibana +locally, go to `http://localhost:5601/`. To use the {ml} features, +you must log in as a user who has the `kibana_user` +and `monitor_ml` roles (TBD). + +. Click **Machine Learning** in the side navigation. + +//image::ml.jpg["Job Management"] + +You can choose to create single-metric, multi-metric, or advanced jobs in Kibana. + +[float] +[[ml-gs-job1-create]] +==== Creating a Single Metric Job + +TBD. + +* Walk through creation of a simple single metric job. +* Provide overview of: +** aggregations +** detectors (define which fields to analyze) +*** The detectors define what type of analysis needs to be done +(e.g. max, average, rare) and upon which fields (e.g. IP address, Host name, Num bytes). +** bucket spans (define time intervals to analyze across) +*** Take into account the granularity at which you want to analyze, +the frequency of the input data, and the frequency at which alerting is required. +*** When we analyze data, we use the concept of a bucket to divide up a continuous +stream of data into batches for processing. For example, if you were monitoring the +average response time of a system and received a data point every 10 minutes, +using a bucket span of 1 hour means that at the end of each hour we would calculate +the average (mean) value of the data for the last hour and compute the +anomalousness of that average value compared to previous hours. +*** The bucket span has two purposes: it dictates over what time span to look for +anomalous features in data, and also determines how quickly anomalies can be detected. +Choosing a shorter bucket span allows anomalies to be detected more quickly but +at the risk of being too sensitive to natural variations or noise in the input data. +Choosing too long a bucket span however can mean that interesting anomalies are averaged away. +** analysis functions +*** Some of the analytical functions look for single anomalous data points, e.g. max, +which identifies the maximum value seen within a bucket. +Others perform some aggregation over the length of the bucket, e.g. mean, +which calculates the mean of all the data points seen within the bucket, +or count, which calculates the total number of data points within the bucket. +There is the possibility that the aggregation might smooth out some anomalies +based on when the bucket starts in time. +**** To avoid this, you can use overlapping buckets (how/where?). +We analyze the data points in two buckets simultaneously, one starting half a bucket +span later than the other. Overlapping buckets are only beneficial for +aggregating functions, and should not be used for non-aggregating functions. + +[float] +[[ml-gs-job1-analyze]] +===== Viewing Single Metric Job Results + +TBD. + +* Walk through exploration of job results. +** Based on this job configuration we analyze the input stream of data. +We model the behavior of the data, perform analysis based upon the defined detectors +and for the time interval. When we see an event occurring outside of our model, +we identify this as an anomaly. For each anomaly detected, we store the +result records of our analysis, which includes the probability of +detecting that anomaly. +** With high volumes of real-life data, many anomalies may be found. +These vary in probability from very likely to highly unlikely i.e. from not +particularly anomalous to highly anomalous. There can be none, one or two or +tens, sometimes hundreds of anomalies found within each bucket. +There can be many thousands found per job. +In order to provide a sensible view of the results, we calculate an anomaly score +for each time interval. An interval with a high anomaly score is significant +and requires investigation. +** The anomaly score is a sophisticated aggregation of the anomaly records. +The calculation is optimized for high throughput, gracefully ages historical data, +and reduces the signal to noise levels. +It adjusts for variations in event rate, takes into account the frequency +and the level of anomalous activity and is adjusted relative to past anomalous behavior. +In addition, it is boosted if anomalous activity occurs for related entities, +for example if disk IO and CPU are both behaving unusually for a given host. +** Once an anomalous time interval has been identified, it can be expanded to +view the detailed anomaly records which are the significant causal factors. +* Provide brief overview of statistical models and/or link to more info. +* Possibly discuss effect of altering bucket span. + +* Provide general overview of management of jobs (when/why to start or + stop them). + +[float] +[[ml-gs-job2-create]] +==== Creating a Multi-Metric Job + +TBD. + +* Walk through creation of a simple multi-metric job. +* Provide overview of: +** partition fields, +** influencers +*** An influencer is someone or something that has influenced or contributed to the anomaly. +Results are aggregated for each influencer, for each bucket, across all detectors. +In this way, a combined anomaly score is calculated for each influencer, +which determines its relative anomalousness. You can specify one or many influencers. +Picking an influencer is strongly recommended for the following reasons: +**** It allow you to blame someone/something for the anomaly +**** It simplifies and aggregates results +*** The best influencer is the person or thing that you want to blame for the anomaly. +In many cases, users or client IP make excellent influencers. +*** By/over/partition fields are usually good candidates for influencers. +*** Influencers can be any field in the source data; they do not need to be fields +specified in detectors, although they often are. +** by/over fields, +*** detectors +**** You can have more than one detector in a job which is more efficient than +running multiple jobs against the same data stream. + +//http://www.prelert.com/docs/behavioral_analytics/latest/concepts/multivariate.html + +[float] +[[ml-gs-job2-analyze]] +===== Viewing Multi-Metric Job Results + +TBD. + +* Walk through exploration of job results. +* Describe how influencer detection accelerates root cause identification. + +[[ml-gs-alerts]] +=== Creating Alerts for Job Results + +TBD. + +* Walk through creation of simple alert for anomalous data? + +//// To start exploring anomalies in your data: . Open Kibana in your web browser and log in. If you are running Kibana locally, go to `http://localhost:5601/`. . Click **ML** in the side navigation ... - +//// //image::graph-open.jpg["Accessing Graph"] diff --git a/docs/en/ml/images/ml.jpg b/docs/en/ml/images/ml.jpg new file mode 100644 index 00000000000..12f427675a1 Binary files /dev/null and b/docs/en/ml/images/ml.jpg differ diff --git a/docs/en/ml/ml-scenarios.asciidoc b/docs/en/ml/ml-scenarios.asciidoc index da47718108a..e8e0396bdaa 100644 --- a/docs/en/ml/ml-scenarios.asciidoc +++ b/docs/en/ml/ml-scenarios.asciidoc @@ -1,6 +1,9 @@ [[ml-scenarios]] == Use Cases +TBD + +//// Enterprises, government organizations and cloud based service providers daily process volumes of machine data so massive as to make real-time human analysis impossible. Changing behaviors hidden in this data provide the @@ -98,3 +101,4 @@ would be very different than the thresholds that would be effective during the d By using {ml}, time-related trends are automatically identified and smoothed, leaving the residual to be analyzed for anomalies. +////