[DOCS] Add multi-metric job creation to ML getting started tutorial (elastic/x-pack-elasticsearch#1451)

* [DOCS] Getting started with multi-metric jobs * [DOCS] More work on ML getting started with multi-metric jobs * [DOCS] Add ML getting started with multi-metric jobs screenshot * [DOCS] Add ML getting started information about influencers * [DOCS] Getting started with multi-metric jobs * [DOCS] Fix ML getting started build error * [DOCS] Add ML getting started multi-metric snapshots * [DOCS] Add screenshots and next steps to ML Getting Started tutorial * [DOCS] Clarified anomaly scores in ML Getting Started pages * [DOCS] Addressed ML getting started feedback * [DOCS] Fix ML getting started links Original commit: elastic/x-pack-elasticsearch@a7e80cfabf
2017-06-27 13:30:15 -07:00 · 2017-06-27 13:30:15 -07:00 · 6d4be0e5d3
parent b710f5906f
commit 6d4be0e5d3
10 changed files with 304 additions and 111 deletions
--- a/docs/en/ml/getting-started-multi.asciidoc
+++ b/docs/en/ml/getting-started-multi.asciidoc
@ -0,0 +1,218 @@
+[[ml-gs-multi-jobs]]
+=== Creating Multi-metric Jobs
+
+The multi-metric job wizard in {kib} provides a simple way to create more
+complex jobs with multiple detectors. For example, in the single metric job, you
+were tracking total requests versus time. You might also want to track other
+metrics like average response time or the maximum number of denied requests.
+Instead of creating jobs for each of those metrics, you can combine them in a
+multi-metric job.
+
+You can also use multi-metric jobs to split a single time series into multiple
+time series based on a categorical field. For example, you can split the data
+based on its hostnames, locations, or users. Each time series is modeled
+independently. By looking at temporal patterns on a per entity basis, you might
+spot things that might have otherwise been hidden in the lumped view.
+
+Conceptually, you can think of this as running many independent single metric
+jobs. By bundling them together in a multi-metric job, however, you can see an
+overall score and shared influencers for all the metrics and all the entities in
+the job. Multi-metric jobs therefore scale better than having many independent
+single metric jobs and provide better results when you have influencers that are
+shared across the detectors.
+
+The sample data for this tutorial contains information about the requests that
+are received by various applications and services in a system. Let's assume that
+you want to monitor the requests received and the response time.  In particular,
+you might want to track those metrics on a per service basis to see if any
+services have unusual patterns.
+
+To create a multi-metric job in {kib}:
+
+. Open {kib} in your web browser and log in. If you are running {kib} locally,
+go to `http://localhost:5601/`.
+
+. Click **Machine Learning** in the side navigation, then click **Create new job**. +
+
+--
+[role="screenshot"]
+image::images/ml-kibana.jpg[Job Management]
+--
+
+. Click **Create multi metric job**. +
+
+--
+[role="screenshot"]
+image::images/ml-create-job2.jpg["Create a multi metric job"]
+--
+
+. Click the `server-metrics` index. +
+
+--
+[role="screenshot"]
+image::images/ml-gs-index.jpg["Select an index"]
+--
+
+. Configure the job by providing the following job settings: +
+
+--
+[role="screenshot"]
+image::images/ml-gs-multi-job.jpg["Create a new job from the server-metrics index"]
+--
+
+.. For the **Fields**, select `high mean(response)` and `sum(total)`. This
+creates two detectors and specifies the analysis function and field that each
+detector uses. The first detector uses the high mean function to detect
+unusually high average values for the `response` field in each bucket. The
+second detector uses the sum function to detect when the sum of the `total`
+field is anomalous in each bucket. For more information about any of the
+analytical functions, see <<ml-functions>>.
+
+.. For the **Bucket span**, enter `10m`. This value specifies the size of the
+interval that the analysis is aggregated into. As was the case in the single
+metric example, this value has a significant impact on the analysis. When you're
+creating jobs for your own data, you might need to experiment with different
+bucket spans depending on the frequency of the input data, the duration of
+typical anomalies, and the frequency at which alerting is required.
+
+.. For the **Split Data**, select `service`. When you specify this
+option, the analysis is segmented such that you have completely independent
+baselines for each distinct value of this field.
+//TBD: What is the importance of having separate baselines?
+There are seven unique service keyword values in the sample data. Thus for each
+of the seven services, you will see the high mean response metrics and sum
+total metrics. +
+
+--
+NOTE: If you are creating a job by using the {ml} APIs or the advanced job
+wizard in {kib}, you can accomplish this split by using the
+`partition_field_name` property.
+
+--
+
+.. For the **Key Fields**, select `host`. Note that the `service` field
+is also automatically selected because you used it to split the data. These key
+fields are also known as _influencers_.
+When you identify a field as an influencer, you are indicating that you think
+it contains information about someone or something that influences or
+contributes to anomalies.
+
+--
+[TIP]
+========================
+Picking an influencer is strongly recommended for the following reasons:
+
+* It allows you to more easily assign blame for the anomaly
+* It simplifies and aggregates the results
+
+The best influencer is the person or thing that you want to blame for the
+anomaly. In many cases, users or client IP addresses make excellent influencers.
+Influencers can be any field in your data; they do not need to be fields that
+are specified in your detectors, though they often are.
+
+As a best practice, do not pick too many influencers. For example, you generally
+do not need more than three. If you pick many influencers, the results can be
+overwhelming and there is a small overhead to the analysis.
+
+========================
+//TBD: Is this something you can determine later from looking at results and
+//update your job with if necessary? Is it all post-processing or does it affect
+//the ongoing modeling?
+--
+
+. Click **Use full server-metrics* data**. Two graphs are generated for each
+`service` value, which represent the high mean `response` values and
+sum `total` values over time.
+//TBD What is the use of the document count table?
+
+. Provide a name for the job, for example `response_requests_by_app`. The job
+name must be unique in your cluster. You can also optionally provide a
+description of the job.
+
+. Click **Create Job**. As the job is created, the graphs are updated to give a
+visual representation of the progress of {ml} as the data is processed. For
+example:
+
+--
+[role="screenshot"]
+image::images/ml-gs-job2-results.jpg["Job results updating as data is processed"]
+--
+
+TIP: The `create_multi_metic.sh` script creates a similar job and {dfeed} by
+using the {ml} APIs. You can download that script by clicking
+here: https://download.elastic.co/demos/machine_learning/gettingstarted/create_multi_metric.sh[create_multi_metric.sh]
+For API reference information, see {ref}/ml-apis.html[Machine Learning APIs].
+
+[[ml-gs-job2-analyze]]
+=== Exploring Multi-metric Job Results
+
+The {xpackml} features analyze the input stream of data, model its behavior, and
+perform analysis based on the two detectors you defined in your job. When an
+event occurs outside of the model, that event is identified as an anomaly.
+
+You can use the **Anomaly Explorer** in {kib} to view the analysis results:
+
+[role="screenshot"]
+image::images/ml-gs-job2-explorer.jpg["Job results in the Anomaly Explorer"]
+
+You can explore the overall anomaly time line, which shows the maximum anomaly
+score for each section in the specified time period. You can change the time
+period by using the time picker in the {kib} toolbar. Note that the sections in
+this time line do not necessarily correspond to the bucket span. If you change
+the time period, the sections change size too. The smallest possible size for
+these sections is a bucket. If you specify a large time period, the sections can
+span many buckets.
+
+On the left is a list of the top influencers for all of the detected anomalies
+in that same time period. The list includes maximum anomaly scores, which in
+this case are aggregated for each influencer, for each bucket, across all
+detectors. There is also a total sum of the anomaly scores for each influencer.
+You can use this list to help you narrow down the contributing factors and focus
+on the most anomalous entities.
+
+If your job contains influencers, you can also explore swim lanes that
+correspond to the values of an influencer. In this example, the swim lanes
+correspond to the values for the `service` field that you used to split the data.
+Each lane represents a unique application or service name. Since you specified
+the `host` field as an influencer, you can also optionally view the results in
+swim lanes for each host name:
+
+[role="screenshot"]
+image::images/ml-gs-job2-explorer-host.jpg["Job results sorted by host"]
+
+By default, the swim lanes are ordered by their maximum anomaly score values.
+You can click on the sections in the swim lane to see details about the
+anomalies that occurred in that time interval.
+
+NOTE: The anomaly scores that you see in each section of the **Anomaly Explorer**
+might differ slightly. This disparity occurs because for each job we generate
+bucket results, influencer results, and record results. Anomaly scores are
+generated for each type of result. The anomaly timeline uses the bucket-level
+anomaly scores. The list of top influencers uses the influencer-level anomaly
+scores. The list of anomalies uses the record-level anomaly scores. For more
+information about these different result types, see
+{ref}/ml-results-resource.html[Results Resources].
+
+Click on a section in the swim lanes to obtain more information about the
+anomalies in that time period. For example, click on the red section in the swim
+lane for `server_2`:
+
+[role="screenshot"]
+image::images/ml-gs-job2-explorer-anomaly.jpg["Job results for an anomaly"]
+
+You can see exact times when anomalies occurred and which detectors or metrics
+caught the anomaly. Also note that because you split the data by the `service`
+field, you see separate charts for each applicable service. In particular, you
+see charts for each service for which there is data on the specified host in the
+specified time interval.
+
+Below the charts, there is a table that provides more information, such as the
+typical and actual values and the influencers that contributed to the anomaly.
+
+[role="screenshot"]
+image::images/ml-gs-job2-explorer-table.jpg["Job results table"]
+
+Notice that there are anomalies for both detectors, that is to say for both the
+`high_mean(response)` and the `sum(total)` metrics in this time interval. By
+investigating multiple metrics in a single job, you might see relationships
+between events in your data that would otherwise be overlooked.
--- a/docs/en/ml/getting-started-next.asciidoc
+++ b/docs/en/ml/getting-started-next.asciidoc
@ -0,0 +1,55 @@
+[[ml-gs-next]]
+=== Next Steps
+
+By completing this tutorial, you've learned how you can detect anomalous
+behavior in a simple set of sample data. You created single and multi-metric
+jobs in {kib}, which creates and opens jobs and creates and starts {dfeeds} for
+you under the covers. You examined the results of the {ml} analysis in the
+**Single Metric Viewer** and **Anomaly Explorer** in {kib}.
+
+If you want to learn about advanced job options, you might be interested in
+the following video tutorial:
+https://www.elastic.co/videos/machine-learning-lab-3-detect-outliers-in-a-population[Machine Learning Lab 3 - Detect Outliers in a Population].
+
+If you intend to use {ml} APIs in your applications, a good next step might be
+to learn about the APIs by retrieving information about these sample jobs.
+For example, the following APIs retrieve information about the jobs and {dfeeds}.
+
+[source,js]
+--------------------------------------------------
+GET _xpack/ml/anomaly_detectors
+
+GET _xpack/ml/datafeeds
+--------------------------------------------------
+// CONSOLE
+
+For more information about the {ml} APIs, see <<ml-api-quickref>>.
+
+Ultimately, the next step is to start applying {ml} to your own data.
+As mentioned in <<ml-gs-data>>, there are three things to consider when you're
+thinking about where {ml} will be most impactful:
+
+. It must be time series data.
+. It should be information that contains key performance indicators for the
+health, security, or success of your business or system. The better you know the
+data, the quicker you will be able to create jobs that generate useful
+insights.
+. Ideally, the data is located in {es} and you can therefore create a {dfeed}
+that retrieves data in real time.  If your data is outside of {es}, you
+cannot use {kib} to create your jobs and you cannot use {dfeeds}. Machine
+learning analysis is still possible, however, by using APIs to create and manage
+jobs and to post data to them. 
+
+Once you have decided which data to analyze, you can start considering which
+analysis functions you want to use. For more information, see  <<ml-functions>>.
+
+In general, it is a good idea to start with single metric jobs for your
+key performance indicators. After you examine these simple analysis results,
+you will have a better idea of what the influencers might be. You can create
+multi-metric jobs and split the data or create more complex analysis functions
+as necessary.
+//TO)DO: Add link to configuration section: For examples of
+//more complicated configuration options, see <<>>.
+
+If you encounter problems, we're here to help. See <<xpack-help>> and
+<<ml-troubleshooting>>.
--- a/docs/en/ml/getting-started.asciidoc
+++ b/docs/en/ml/getting-started.asciidoc
@ -13,13 +13,14 @@ Ready to get some hands-on experience with the {xpackml} features? This
 tutorial shows you how to:

 * Load a sample data set into {es}
-* Create a {ml} job
+* Create single and multi-metric {ml} jobs in {kib}
 * Use the results to identify possible anomalies in the data

 At the end of this tutorial, you should have a good idea of what {ml} is and
 will hopefully be inspired to use it to detect anomalies in your own data.

-You might also be interested in these video tutorials:
+You might also be interested in these video tutorials, which use the same sample
+data:

 * https://www.elastic.co/videos/machine-learning-tutorial-creating-a-single-metric-job[Machine Learning for the Elastic Stack: Creating a single metric job]
 * https://www.elastic.co/videos/machine-learning-tutorial-creating-a-multi-metric-job[Machine Learning for the Elastic Stack: Creating a multi-metric job]
@ -278,8 +279,8 @@ You should see output similar to the following:
 [source,shell]
 ----------------------------------

-health status index ... pri rep docs.count  docs.deleted  store.size ...
-green  open   server-metrics ... 1 0 905940  0  120.5mb  ...
+health status index ... pri rep docs.count  ...
+green  open   server-metrics ... 1 0 905940  ...
 ----------------------------------

 Next, you must define an index pattern for this data set:
@ -305,7 +306,7 @@ This data set can now be analyzed in {ml} jobs in {kib}.


 [[ml-gs-jobs]]
-=== Creating Jobs
+=== Creating Single Metric Jobs

 Machine learning jobs contain the configuration information and metadata
 necessary to perform an analytical task. They also contain the results of the
@ -322,7 +323,25 @@ web browser so that it does not block pop-up windows or create an
 exception for your Kibana URL.
 --

-To work with jobs in {kib}:
+You can choose to create single metric, multi-metric, or advanced jobs in
+{kib}. At this point in the tutorial, the goal is to detect anomalies in the
+total requests received by your applications and services. The sample data
+contains a single key performance indicator to track this, which is the total
+requests over time. It is therefore logical to start by creating a single metric
+job for this KPI.
+
+TIP: If you are using aggregated data, you can create an advanced job
+and configure it to use a `summary_count_field`. The {ml} algorithms will
+make the best possible use of summarized data in this case. For simplicity, in
+this tutorial we will not make use of that advanced functionality.
+
+//TO-DO: Add link to aggregations.asciidoc: For more information, see <<>>.
+
+A single metric job contains a single _detector_. A detector defines the type of
+analysis that will occur (for example, `max`, `average`, or `rare` analytical
+functions) and the fields that will be analyzed.
+
+To create a single metric job in {kib}:

 . Open {kib} in your web browser and log in. If you are running {kib} locally,
 go to `http://localhost:5601/`.
@ -334,30 +353,7 @@ go to `http://localhost:5601/`.
 image::images/ml-kibana.jpg[Job Management]
 --

-You can choose to create single metric, multi-metric, or advanced jobs in
-{kib}. In this tutorial, the goal is to detect anomalies in the total requests
-received by your applications and services. The sample data contains a single
-key performance indicator to track this, which is the total requests over time.
-It is therefore logical to start by creating a single metric job for this KPI.
-
-TIP: If you are using aggregated data, you can create an advanced job
-and configure it to use a `summary_count_field`. The {ml} algorithms will
-make the best possible use of summarized data in this case. For simplicity in this tutorial
-we will not make use of that advanced functionality.
-
-
-[float]
-[[ml-gs-job1-create]]
-==== Creating a Single Metric Job
-
-A single metric job contains a single _detector_. A detector defines the type of
-analysis that will occur (for example, `max`, `average`, or `rare` analytical
-functions) and the fields that will be analyzed.
-
-To create a single metric job in {kib}:
-
-. Click **Machine Learning** in the side navigation,
-then click **Create new job**.
+. Click **Create new job**.

 . Click **Create single metric job**. +
 +
@ -568,9 +564,8 @@ button: image:images/ml-stop-feed.jpg["Stop {dfeed}"]

 Now that you have processed all the data, let's start exploring the job results.

-
-[[ml-gs-jobresults]]
-=== Exploring Job Results
+[[ml-gs-job1-analyze]]
+=== Exploring Single Metric Job Results

 The {xpackml} features analyze the input stream of data, model its behavior,
 and perform analysis based on the detectors you defined in your job. When an
@ -589,11 +584,6 @@ Anomaly Explorer::
  also swim lanes for each influencer. By selecting a block in a swim lane, the
  anomaly details are displayed alongside the original source data (where
  applicable).
-//TBD: Are they swimlane blocks, tiles, segments or cards? hmmm
-//TBD: Do the time periods in the heat map correspond to buckets? hmmm is it a heat map?
-//As time is the x-axis, and the block sizes stay the same, it feels more intuitive call it a swimlane.
-//The swimlane bucket intervals depends on the time range selected. Their smallest possible
-//granularity is a bucket, but if you have a big time range selected, then they will span many buckets

 Single Metric Viewer::
  This view contains a chart that represents the actual and expected values over
@ -601,10 +591,6 @@ Single Metric Viewer::
  where `model_plot_config` is enabled. As in the **Anomaly Explorer**, anomalous
  data points are shown in different colors depending on their score.

-[float]
-[[ml-gs-job1-analyze]]
-==== Exploring Single Metric Job Results
-
 By default when you view the results for a single metric job, the
 **Single Metric Viewer** opens:
 [role="screenshot"]
@ -652,7 +638,7 @@ You can see the same information in a different format by using the
 image::images/ml-gs-job1-explorer.jpg["Anomaly Explorer for total-requests job"]


-Click one of the red blocks in the swim lane to see details about the anomalies
+Click one of the red sections in the swim lane to see details about the anomalies
 that occurred in that time interval. For example:
 [role="screenshot"]
 image::images/ml-gs-job1-explorer-anomaly.jpg["Anomaly Explorer details for total-requests job"]
@ -663,71 +649,5 @@ contributing to the problem? Are the anomalies confined to particular
 applications or servers? You can begin to troubleshoot these situations by
 layering additional jobs or creating multi-metric jobs.

-////
-The troubleshooting job would not create alarms of its own, but rather would
-help explain the overall situation.  It's usually a different job because it's
-operating on different indices. Layering jobs is an important concept.
-////
-////
-[float]
-[[ml-gs-job2-create]]
-==== Creating a Multi-Metric Job
-
-TBD.
-
-* Walk through creation of a simple multi-metric job.
-* Provide overview of:
-** partition fields,
-** influencers
-*** An influencer is someone or something that has influenced or contributed to the anomaly.
-Results are aggregated for each influencer, for each bucket, across all detectors.
-In this way, a combined anomaly score is calculated for each influencer,
-which determines its relative anomalousness. You can specify one or many influencers.
-Picking an influencer is strongly recommended for the following reasons:
-**** It allow you to blame someone/something for the anomaly
-**** It simplifies and aggregates results
-*** The best influencer is the person or thing that you want to blame for the anomaly.
-In many cases, users or client IP make excellent influencers.
-*** By/over/partition fields are usually good candidates for influencers.
-*** Influencers can be any field in the source data; they do not need to be fields
-specified in detectors, although they often are.
-** by/over fields,
-*** detectors
-**** You can have more than one detector in a job which is more efficient than
-running multiple jobs against the same data stream.
-
-//http://www.prelert.com/docs/behavioral_analytics/latest/concepts/multivariate.html
-
-[float]
-[[ml-gs-job2-analyze]]
-===== Viewing Multi-Metric Job Results
-
-TBD.
-
-* Walk through exploration of job results.
-* Describe how influencer detection accelerates root cause identification.
-
-////
-////
-* Provide brief overview of statistical models and/or link to more info.
-* Possibly discuss effect of altering bucket span.
-
-The anomaly score is a sophisticated aggregation of the anomaly records in the
-bucket. The calculation is optimized for high throughput, gracefully ages
-historical data, and reduces the signal to noise levels. It adjusts for
-variations in event rate, takes into account the frequency and the level of
-anomalous activity and is adjusted relative to past anomalous behavior.
-In addition, [the anomaly score] is boosted if anomalous activity occurs for related entities,
-for example if disk IO and CPU are both behaving unusually for a given host.
-** Once an anomalous time interval has been identified, it can be expanded to
-view the detailed anomaly records which are the significant causal factors.
-////
-////
-[[ml-gs-alerts]]
-=== Creating Alerts for Job Results
-
-TBD.
-
-* Walk through creation of simple alert for anomalous data?
-
-////
+include::getting-started-multi.asciidoc[]
+include::getting-started-next.asciidoc[]
--- a/docs/en/ml/images/ml-create-job2.jpg
+++ b/docs/en/ml/images/ml-create-job2.jpg
--- a/docs/en/ml/images/ml-gs-job2-explorer-anomaly.jpg
+++ b/docs/en/ml/images/ml-gs-job2-explorer-anomaly.jpg
--- a/docs/en/ml/images/ml-gs-job2-explorer-host.jpg
+++ b/docs/en/ml/images/ml-gs-job2-explorer-host.jpg
--- a/docs/en/ml/images/ml-gs-job2-explorer-table.jpg
+++ b/docs/en/ml/images/ml-gs-job2-explorer-table.jpg
--- a/docs/en/ml/images/ml-gs-job2-explorer.jpg
+++ b/docs/en/ml/images/ml-gs-job2-explorer.jpg
--- a/docs/en/ml/images/ml-gs-job2-results.jpg
+++ b/docs/en/ml/images/ml-gs-job2-results.jpg
--- a/docs/en/ml/images/ml-gs-multi-job.jpg
+++ b/docs/en/ml/images/ml-gs-multi-job.jpg