[DOCS] Update ML getting started docs with server-metrics sample data (elastic/x-pack-elasticsearch#1166)

* [DOCS] Add role info to ML getting started docs * [DOCS] Getting started with sample ML data * [DOCS] Getting started with server-metrics sample data Original commit: elastic/x-pack-elasticsearch@2f268f87b4
2017-04-21 18:56:07 -07:00 · 2017-04-21 18:56:07 -07:00 · d3a2e34f9d
parent c5b14197d4
commit d3a2e34f9d
15 changed files with 436 additions and 81 deletions
--- a/docs/en/ml/getting-started.asciidoc
+++ b/docs/en/ml/getting-started.asciidoc
@ -1,7 +1,6 @@
 [[ml-getting-started]]
 == Getting Started
 TBD.
 ////
 {xpack} {ml} features automatically detect:
 * Anomalies in single or multiple time series
@ -10,29 +9,39 @@ TBD.
 This tutorial is focuses on an anomaly detection scenario in single time series.
 ////
 Ready to get some hands-on experience with the {xpack} {ml} features? This
 tutorial shows you how to:
 * Load a sample data set into Elasticsearch
 * Create a {ml} job
 * Use the results to identify possible anomalies in the data
 {nbsp}
 At the end of this tutorial, you should have a good idea of what {ml} is and
 will hopefully be inspired to use it to detect anomalies in your own data.
 You might also be interested in these video tutorials:
 * Getting started with machine learning (single metric)
 * Getting started with machine learning (multiple metric)
 In this tutorial, you will explore the {xpack} {ml} features by using sample
 data. You will create two simple jobs and use the results to identify possible
 anomalies in the data. You can also optionally create an alert. At the end of
 this tutorial, you should have a good idea of what {ml} is and will hopefully
 be inspired to use it to detect anomalies in your own data.
 [float]
 [[ml-gs-sysoverview]]
 === System Overview
 TBD.
 To follow the steps in this tutorial, you will need the following
 components of the Elastic Stack:
 * Elasticsearch {version}, which stores the data and the analysis results
 * {xpack} {version}, which provides the {ml} features
-* Kibana {version}, which provides a helpful user interface for creating
+* Kibana {version}, which provides a helpful user interface for creating and
-and viewing jobs
+viewing jobs +
 See the https://www.elastic.co/support/matrix[Elastic Support Matrix] for
-information about supported operating systems and product compatibility.
+information about supported operating systems.
 See {stack-ref}/installing-elastic-stack.html[Installing the Elastic Stack] for
 information about installing each of the components.
@ -47,15 +56,26 @@ optionally dedicate nodes to specific purposes. If you want to control which
 nodes are _machine learning nodes_ or limit which nodes run resource-intensive
 activity related to jobs, see <<ml-settings>>.
-NOTE: This tutorial uses Kibana to create jobs and view results, but you can
+[float]
-alternatively use APIs to accomplish most tasks.
+[[ml-gs-users]]
-For API reference information, see <<ml-apis>>.
+==== Users, Roles, and Privileges
 The {xpack} {ml} features implement cluster privileges and built-in roles to
 make it easier to control which users have authority to view and manage the jobs,
 data feeds, and results.
 By default, you can perform all of the steps in this tutorial by using the
 built-in `elastic` user. If you are performing these steps in a production
 environment, take extra care because that user has the `superuser` role and you
 could inadvertently make significant changes to the system. You can
 alternatively assign the `machine_learning_admin` and `kibana_user` roles to a
 user ID of your choice.
 For more information, see <<built-in-roles>> and <<privileges-list-cluster>>.
 [[ml-gs-data]]
 === Identifying Data for Analysis
 TBD.
 For the purposes of this tutorial, we provide sample data that you can play with.
 When you consider your own data, however, it's important to take a moment
 and consider where the {xpack} {ml} features will be most impactful.
@ -69,11 +89,10 @@ very efficient and occur in real-time.
 The second consideration, especially when you are first learning to use {ml},
 is the importance of the data and how familiar you are with it. Ideally, it is
 information that contains key performance indicators (KPIs) for the health or
-success of your business or system. It is information for which you want alarms
+success of your business or system. It is information that you need to act on
-to ring when anomalous behavior occurs. You might even have Kibana dashboards
+when anomalous behavior occurs. You might even have Kibana dashboards that
-that you're already using to watch this data. The better you know the data,
+you're already using to watch this data. The better you know the data,
-the quicker you will be able to create jobs that generate useful insights from
+the quicker you will be able to create {ml} jobs that generate useful insights.
 {ml}.
 //TBD: Talk about layering additional jobs?
 ////
@ -102,84 +121,413 @@ future more-advanced scenario.)
 The final consideration is where the data is located. If the data that you want
 to analyze is stored in Elasticsearch, you can define a _data feed_ that
-provides data to the job in real time. By having both the input data and the
+provides data to the job in real time. When you have both the input data and the
-analytical results in Elasticsearch, you get performance benefits? (TBD)
+analytical results in Elasticsearch, this data gravity provides performance
-The alternative to data feeds is to upload batches of data to the job by
+benefits.
-using the <<ml-post-data,post data API>>.
+
 IMPORTANT: If you want to create {ml} jobs in Kibana, you must use data feeds.
 That is to say, you must store your input data in Elasticsearch. When you create
 a job, you select an existing index pattern and Kibana configures the data feed
 for you under the covers.
 If your data is not stored in Elasticsearch, you can create jobs by using
 the <<ml-put-job,create job API>> and upload batches of data to the job by
 using the <<ml-post-data,post data API>>. That scenario is not covered in
 this tutorial, however.
 //TBD: The data must be provided in JSON format?
 [float]
 [[ml-gs-sampledata]]
-==== Obtaining a sample dataset
+==== Obtaining a Sample Data Set
-TBD.
+The sample data for this tutorial contains information about the requests that
 are received by various applications and services in a system. A system
 administrator might use this type of information to track the the total
 number of requests across all of the infrastructure. If the number of requests
 increases or decreases unexpectedly, for example, this might be an indication
 that there is a problem or that resources need to be redistributed. By using
 the {xpack} {ml} features to model the behavior of this data, it is easier to
 identify anomalies and take appropriate action.
-* Provide instructions for downloading the sample data from https://github.com/elastic/examples
+* TBD: Provide instructions for downloading the sample data after it's made
-* Provide overview/context of the sample data
+available publicly on https://github.com/elastic/examples
 //Download this data set by clicking here:
 //See  https://download.elastic.co/demos/kibana/gettingstarted/shakespeare.json[shakespeare.json].
 ////
 Use the following commands to extract the files:
 [source,shell]
 gzip -d transactions.ndjson.gz
 ////
 Each document in the server-metrics data set has the following schema:
 [source,json]
 ----------------------------------
 {
  "index":
  {
    "_index":"server-metrics",
    "_type":"metric",
    "_id":"AVuQL1eekrHQ5a9V5qre"
  }
 }
 {
  "deny":1783,
  "service":"app_0",
  "@timestamp":"2017-03-26T06:47:28.684926",
  "accept":24465,
  "host":"server_1",
  "total":26248,
  "response":1.8242486553275024
 }
 ----------------------------------
 Before you load the data set, you need to set up {ref}/mapping.html[_mappings_]
 for the fields. Mappings divide the documents in the index into logical groups
 and specify a field's characteristics, such as the field's searchability or
 whether or not it's _tokenized_, or broken up into separate words.
 The sample data includes an `upload_server-metrics.sh` script, which you can use
 to create the mappings and load the data set. The script runs a command similar
 to the following example, which sets up a mapping for the data set:
 [source,shell]
 ----------------------------------
 curl -u elastic:elasticpassword -X PUT -H 'Content-Type: application/json'
 http://localhost:9200/server-metrics -d '{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
      "metric": {
        "properties": {
          "@timestamp": {
            "type": "date"
          },
          "accept": {
            "type": "long"
          },
          "deny": {
            "type": "long"
          },
          "host": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "response": {
            "type": "float"
          },
          "service": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "total": {
            "type": "long"
          }
        }
      }
    }
  }
 }'
 ----------------------------------
 NOTE: If you run this command, you must replace `elasticpassword` with your
 actual password. Likewise, if you use the `upload_server-metrics.sh` script,
 you must edit the USERNAME and PASSWORD variables before you run it.
 ////
 This mapping specifies the following qualities for the data set:
 * The _@timestamp_ field is a date.
 //that uses the ISO format `epoch_second`,
 //which is the number of seconds since the epoch.
 * The _accept_, _deny_, and _total_ fields are long numbers.
 * The _host
 ////
 You can then use the Elasticsearch `bulk` API to load the data set. The
 `upload_server-metrics.sh` script runs commands similar to the following
 example, which loads the four JSON files:
 [source,shell]
 ----------------------------------
 curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
 http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_1.json"
 curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
 http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_2.json"
 curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
 http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_3.json"
 curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
 http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
 ----------------------------------
 These commands might take some time to run, depending on the computing resources
 available.
 You can verify that the data was loaded successfully with the following command:
 [source,shell]
 ----------------------------------
 curl 'http://localhost:9200/_cat/indices?v' -u elastic:elasticpassword
 ----------------------------------
 You should see output similar to the following:
 [source,shell]
 ----------------------------------
 health status index ... pri rep docs.count  docs.deleted  store.size ...
 green  open   server-metrics ... 1 0 907200  0 134.9mb ...
 ----------------------------------
 Next, you must define an index pattern for this data set:
 . Open Kibana in your web browser and log in. If you are running Kibana
 locally, go to `http://localhost:5601/`.
 . Click the **Management** tab, then **Index Patterns**.
 . Click the plus sign (+) to define a new index pattern.
 . For this tutorial, any pattern that matches the name of the index you've
 loaded will work. For example, enter `server-metrics*` as the index pattern.
 . Verify that the **Index contains time-based events** is checked.
 . Select the `@timestamp` field from the **Time-field name** list.
 . Click **Create**.
 This data set can now be analyzed in {ml} jobs in Kibana.
 //Content based on https://www.elastic.co/guide/en/kibana/current/tutorial-load-dataset.html
 [[ml-gs-jobs]]
-=== Working with Jobs
+=== Creating Jobs
 TBD.
 Machine learning jobs contain the configuration information and metadata
 necessary to perform an analytical task. They also contain the results of the
-analytical task. Each job ID must be unique in your cluster.
+analytical task.
 NOTE: This tutorial uses Kibana to create jobs and view results, but you can
 alternatively use APIs to accomplish most tasks.
 For API reference information, see <<ml-apis>>.
 To work with jobs in Kibana:
 . Open Kibana in your web browser and log in. If you are running Kibana
-locally, go to `http://localhost:5601/`. To use the {ml} features,
+locally, go to `http://localhost:5601/`.
 you must log in as a user who has the `kibana_user`
 and `monitor_ml` roles (TBD).
 . Click **Machine Learning** in the side navigation:
-image::images/ml.jpg["Job Management"]
+image::images/ml-kibana.jpg["Job Management"]
-You can choose to create single-metric, multi-metric, or advanced jobs in Kibana.
+You can choose to create single metric, multi-metric, or advanced jobs in
 Kibana. In this tutorial, the goal is to detect anomalies in the total requests
 received by your applications and services. The sample data contains a single
 key performance indicator to track this, which is the total requests over time.
 It is therefore logical to start by creating a single metric job for this KPI.
 [float]
 [[ml-gs-job1-create]]
 ==== Creating a Single Metric Job
-TBD.
+A single metric job contains a single _detector_. A detector defines the type of
 analysis that will occur (for example, `max`, `average`, or `rare` analytical
 functions) and the fields that will be analyzed.
-* Walk through creation of a simple single metric job.
+To create a single metric job in Kibana:
-* Provide overview of:
+
-** aggregations
+. Click **Machine Learning** in the side navigation,
-** detectors (define which fields to analyze)
+then click **Create new job**.
-*** The detectors define what type of analysis needs to be done
+
-(e.g. max, average, rare) and upon which fields (e.g. IP address, Host name, Num bytes).
+. Click **Create single metric job**.
-** bucket spans (define time intervals to analyze across)
+image::images/ml-create-jobs.jpg["Create a new job"]
-*** Take into account the granularity at which you want to analyze,
+
-the frequency of the input data, and the frequency at which alerting is required.
+. Click the `server-metrics` index. +
-*** When we analyze data, we use the concept of a bucket to divide up a continuous
+
-stream of data into batches for processing. For example, if you were monitoring the
+--
-average response time of a system and received a data point every 10 minutes,
+image::images/ml-gs-index.jpg["Select an index"]
-using a bucket span of 1 hour means that at the end of each hour we would calculate
+--
-the average (mean) value of the data for the last hour and compute the
+
-anomalousness of that average value compared to previous hours.
+. Configure the job by providing the following information:
-*** The bucket span has two purposes: it dictates over what time span to look for
+image::images/ml-gs-single-job.jpg["Create a new job from the server-metrics index"]
-anomalous features in data, and also determines how quickly anomalies can be detected.
+
-Choosing a shorter bucket span allows anomalies to be detected more quickly but
+.. For the **Aggregation**, select `Sum`. This value specifies the analysis
-at the risk of being too sensitive to natural variations or noise in the input data.
+function that is used.
-Choosing too long a bucket span however can mean that interesting anomalies are averaged away.
+
-** analysis functions
+--
-*** Some of the analytical functions look for single anomalous data points, e.g. max,
+Some of the analytical functions look for single anomalous data points. For
-which identifies the maximum value seen within a bucket.
+example, `max` identifies the maximum value that is seen within a bucket.
-Others perform some aggregation over the length of the bucket, e.g. mean,
+Others perform some aggregation over the length of the bucket. For example,
-which calculates the mean of all the data points seen within the bucket,
+`mean` calculates the mean of all the data points seen within the bucket.
-or count, which calculates the total number of data points within the bucket.
+Similarly, `count` calculates the total number of data points within the bucket.
-There is the possibility that the aggregation might smooth out some anomalies
+In this tutorial, you are using the `sum` function, which calculates the sum of
-based on when the bucket starts in time.
+the specified field's values within the bucket.
-**** To avoid this, you can use overlapping buckets (how/where?).
+--
-We analyze the data points in two buckets simultaneously, one starting half a bucket
+
-span later than the other. Overlapping buckets are only beneficial for
+.. For the **Field**, select `total`. This value specifies the field that
-aggregating functions, and should not be used for non-aggregating functions.
+the detector uses in the function.
 +
 --
 NOTE: Some functions such as `count` and `rare` do not require fields.
 --
 .. For the **Bucket span**, enter `600s`. This value specifies the size of the
 interval that the analysis is aggregated into.
 +
 --
 The {xpack} {ml} features use the concept of a bucket to divide up a continuous
 stream of data into batches for processing. For example, if you are monitoring
 the total number of requests in the system,
 //and receive a data point every 10 minutes
 using a bucket span of 1 hour would mean that at the end of each hour, it
 calculates the sum of the requests for the last hour and computes the
 anomalousness of that value compared to previous hours.
 The bucket span has two purposes: it dictates over what time span to look for
 anomalous features in data, and also determines how quickly anomalies can be
 detected. Choosing a shorter bucket span allows anomalies to be detected more
 quickly. However, there is a risk of being too sensitive to natural variations
 or noise in the input data. Choosing too long a bucket span can mean that
 interesting anomalies are averaged away. There is also the possibility that the
 aggregation might smooth out some anomalies based on when the bucket starts
 in time.
 The bucket span has a significant impact on the analysis. When you're trying to
 determine what value to use, take into account the granularity at which you
 want to perform the analysis, the frequency of the input data, and the frequency
 at which alerting is required.
 //TBD: Talk about overlapping buckets? "To avoid this, you can use overlapping
 //buckets (how/where?). We analyze the data points in two buckets simultaneously,
 //one starting half a bucket span later than the other. Overlapping buckets are
 //only beneficial for aggregating functions, and should not be used for
 //non-aggregating functions.
 --
 . Click **Use full transaction_counts data**.
 +
 --
 A graph is generated, which represents the total number of requests over time.
 //TBD: What happens if you click the play button instead?
 --
 . Provide a name for the job, for example `total-requests`. The job name must
 be unique in your cluster. You can also optionally provide a description of the
 job.
 . Click **Create Job**.
 image::images/ml-gs-job1.jpg["A graph of the total number of requests over time"]
 As the job is created, the graph is updated to give a visual representation of
 the {ml} that occurs as the data is processed.
 //To explore the results, click **View Results**.
 //TBD: image::images/ml-gs-job1-results.jpg["The total-requests job is created"]
 [[ml-gs-job1-managa]]
 === Managing Jobs
 After you create a job, you can see its status in the **Job Management** tab:
 image::images/ml-gs-job1-manage.jpg["Status information for the total-requests job"]
 The following information is provided for each job:
 Job ID::
 The unique identifier for the job.
 Description::
 The optional description of the job.
 Processed records::
 The number of records that have been processed by the job.
 +
 --
 NOTE: Depending on how you send data to the job, the number of processed
 records is not always equal to the number of input records. For more information,
 see the `processed_record_count` description in <<ml-datacounts,Data Counts Objects>>.
 --
 Memory status::
 The status of the mathematical models. When you create jobs by using the APIs or
 by using the advanced options in Kibana, you can specify a `model_memory_limit`.
 That value is the maximum amount of memory, in MiB, that the mathematical models
 can use. Once that limit is approached, data pruning becomes more aggressive.
 Upon exceeding that limit, new entities are not modeled.
 The default value is `4096`. The memory status field reflects whether you have
 reached or exceeded the model memory limit. It can have one of the following
 values: +
 `ok`::: The models stayed below the configured value.
 `soft_limit`::: The models used more than 60% of the configured memory limit
 and older unused models will be pruned to free up space.
 `hard_limit`::: The models used more space than the configured memory limit.
 As a result, not all incoming data was processed.
 Job state::
 The status of the job, which can be one of the following values: +
 `open`::: The job is available to receive and process data.
 `closed`::: The job finished successfully with its model state persisted.
 The job must be opened before it can accept further data.
 `closing`::: The job close action is in progress and has not yet completed.
 A closing job cannot accept further data.
 `failed`::: The job did not finish successfully due to an error.
 This situation can occur due to invalid input data.
 If the job had irrevocably failed, it must be force closed and then deleted.
 If the data feed can be corrected, the job can be closed and then re-opened.
 Datafeed state::
 The status of the data feed, which can be one of the following values: +
 started::: The data feed is actively receiving data.
 stopped::: The data feed is stopped and will not receive data until it is re-started.
 //TBD: How to restart data feeds in Kibana?
 Latest timestamp::
 The timestamp of the last processed record.
 //TBD: Is that right?
 If you click the arrow beside the name of job, you can show or hide additional
 information, such as the settings, configuration information, or messages for
 the job.
 You can also click one of the **Actions** buttons to start the data feed, edit
 the job or data feed, and clone or delete the job, for example.
 * TBD: Demonstrate how to re-open the data feed and add additional data
 [[ml-gs-jobresults]]
 === Exploring Job Results
 After you create a job, you can use the **Anomaly Explorer** or the
 **Single Metric Viewer** in Kibana to view the analysis results.
 Anomaly Explorer::
 TBD
 Single Metric Viewer::
 TBD
 [float]
 [[ml-gs-job1-analyze]]
-===== Viewing Single Metric Job Results
+==== Exploring Single Metric Job Results
 TBD.
@ -213,6 +561,21 @@ view the detailed anomaly records which are the significant causal factors.
 * Provide general overview of management of jobs (when/why to start or
  stop them).
 Integrate the following images:
 . Single Metric Viewer: All
 image::images/ml-gs-job1-analysis.jpg["Single Metric Viewer for total-requests job"]
 . Single Metric Viewer: Anomalies
 image::images/ml-gs-job1-anomalies.jpg["Single Metric Viewer Anomalies for total-requests job"]
 . Anomaly Explorer: All
 image::images/ml-gs-job1-explorer.jpg["Anomaly Explorer for total-requests job"]
 . Anomaly Explorer: Selected a red area from the heatmap
 image::images/ml-gs-job1-explorer-anomaly.jpg["Anomaly Explorer details for total-requests job"]
 ////
 [float]
 [[ml-gs-job2-create]]
 ==== Creating a Multi-Metric Job
@ -259,11 +622,3 @@ TBD.
 * Walk through creation of simple alert for anomalous data?
 ////
 To start exploring anomalies in your data:
 . Open Kibana in your web browser and log in. If you are running Kibana
 locally, go to `http://localhost:5601/`.
 . Click **ML** in the side navigation ...
 ////
 //image::images/graph-open.jpg["Accessing Graph"]
--- a/docs/en/ml/images/ml-create-jobs.jpg
+++ b/docs/en/ml/images/ml-create-jobs.jpg
--- a/docs/en/ml/images/ml-edit-job.jpg
+++ b/docs/en/ml/images/ml-edit-job.jpg
--- a/docs/en/ml/images/ml-gs-aggregations.jpg
+++ b/docs/en/ml/images/ml-gs-aggregations.jpg
--- a/docs/en/ml/images/ml-gs-index.jpg
+++ b/docs/en/ml/images/ml-gs-index.jpg
--- a/docs/en/ml/images/ml-gs-job1-analysis.jpg
+++ b/docs/en/ml/images/ml-gs-job1-analysis.jpg
--- a/docs/en/ml/images/ml-gs-job1-anomalies.jpg
+++ b/docs/en/ml/images/ml-gs-job1-anomalies.jpg
--- a/docs/en/ml/images/ml-gs-job1-explorer-anomaly.jpg
+++ b/docs/en/ml/images/ml-gs-job1-explorer-anomaly.jpg
--- a/docs/en/ml/images/ml-gs-job1-explorer.jpg
+++ b/docs/en/ml/images/ml-gs-job1-explorer.jpg
--- a/docs/en/ml/images/ml-gs-job1-manage.jpg
+++ b/docs/en/ml/images/ml-gs-job1-manage.jpg
--- a/docs/en/ml/images/ml-gs-job1-results.jpg
+++ b/docs/en/ml/images/ml-gs-job1-results.jpg
--- a/docs/en/ml/images/ml-gs-job1.jpg
+++ b/docs/en/ml/images/ml-gs-job1.jpg
--- a/docs/en/ml/images/ml-gs-single-job.jpg
+++ b/docs/en/ml/images/ml-gs-single-job.jpg
--- a/docs/en/ml/images/ml-kibana.jpg
+++ b/docs/en/ml/images/ml-kibana.jpg
--- a/docs/en/ml/index.asciidoc
+++ b/docs/en/ml/index.asciidoc
@ -16,7 +16,7 @@ easily answer these types of questions.
 include::introduction.asciidoc[]
 include::getting-started.asciidoc[]
-include::ml-scenarios.asciidoc[]
+// include::ml-scenarios.asciidoc[]
 include::api-quickref.asciidoc[]
 //include::troubleshooting.asciidoc[]  Referenced from x-pack/docs/public/xpack-troubleshooting.asciidoc