[DOCS] Stop and start data feeds in ML Getting Started (elastic/x-pack-elasticsearch#1206)

Original commit: elastic/x-pack-elasticsearch@b938c19695
2017-04-25 17:24:51 -07:00 · 2017-04-25 17:24:51 -07:00 · aa421af2fc
parent 268f5a95af
commit aa421af2fc
9 changed files with 80 additions and 65 deletions
--- a/docs/en/ml/getting-started.asciidoc
+++ b/docs/en/ml/getting-started.asciidoc
@ -34,10 +34,15 @@ To follow the steps in this tutorial, you will need the following
 components of the Elastic Stack:

 * Elasticsearch {version}, which stores the data and the analysis results
-* {xpack} {version}, which provides the {ml} features
+* {xpack} {version}, which provides the beta {ml} features
 * Kibana {version}, which provides a helpful user interface for creating and
 viewing jobs +

+WARNING: The {xpack} {ml} features are in beta and subject to change.
+The design and code are considered to be less mature than official GA features.
+Elastic will take a best effort approach to fix any issues, but beta features
+are not subject to the support SLA of official GA features. Exercise caution if
+you use these features in production environments.

 See the https://www.elastic.co/support/matrix[Elastic Support Matrix] for
 information about supported operating systems.
@ -55,6 +60,7 @@ optionally dedicate nodes to specific purposes. If you want to control which
 nodes are _machine learning nodes_ or limit which nodes run resource-intensive
 activity related to jobs, see <<ml-settings>>.

+
 [float]
 [[ml-gs-users]]
 ==== Users, Roles, and Privileges
@ -77,7 +83,7 @@ For more information, see <<built-in-roles>> and <<privileges-list-cluster>>.

 For the purposes of this tutorial, we provide sample data that you can play with.
 When you consider your own data, however, it's important to take a moment
-and consider where the {xpack} {ml} features will be most impactful.
+and think about where the {xpack} {ml} features will be most impactful.

 The first consideration is that it must be time series data.
 Generally, it's best to use data that is in chronological order. When the data
@ -87,20 +93,13 @@ very efficient and occur in real-time.

 The second consideration, especially when you are first learning to use {ml},
 is the importance of the data and how familiar you are with it. Ideally, it is
-information that contains key performance indicators (KPIs) for the health or
-success of your business or system. It is information that you need to act on
-when anomalous behavior occurs. You might even have Kibana dashboards that
-you're already using to watch this data. The better you know the data,
-the quicker you will be able to create {ml} jobs that generate useful insights.
+information that contains key performance indicators (KPIs) for the health,
+security, or success of your business or system. It is information that you need
+to monitor and act on when anomalous behavior occurs. You might even have Kibana
+dashboards that you're already using to watch this data. The better you know the
+data, the quicker you will be able to create {ml} jobs that generate useful
+insights.

-//TBD: Talk about layering additional jobs?
-////
- You can then create additional jobs to troubleshoot the situation and put it
-into context of what was going on in the system at the time.
-The troubleshooting job would not create alarms of its own, but rather would
-help explain the overall situation.  It's usually a different job because it's
-operating on different indices. Layering jobs is an important concept.
-////
 ////
 * Working with out of sequence data:
 ** In the typical case where data arrives in ascending time order,
@ -185,6 +184,17 @@ Each document in the server-metrics data set has the following schema:
 }
 ----------------------------------

+TIP: The sample data sets include summarized data. For example, the `total`
+value is a sum of the requests that were received by a specific service at a
+particular time. If your data is stored in Elasticsearch, you can generate
+this type of sum or average by using search queries. One of the benefits of
+summarizing data this way is that Elasticsearch automatically distributes
+these calculations across your cluster. You can then feed this summarized data
+into the {xpack} {ml} features instead of raw results, which reduces the volume
+of data that must be considered while detecting anomalies. For the purposes of
+this tutorial, however, these summary values are provided directly in the JSON
+source files. They are not generated by Elasticsearch queries.
+
 Before you load the data set, you need to set up {ref}/mapping.html[_mappings_]
 for the fields. Mappings divide the documents in the index into logical groups
 and specify a field's characteristics, such as the field's searchability or
@ -193,11 +203,10 @@ whether or not it's _tokenized_, or broken up into separate words.
 The sample data includes an `upload_server-metrics.sh` script, which you can use
 to create the mappings and load the data set. Before you run it, however, you
 must edit the USERNAME and PASSWORD variables with your actual user ID and
-password. If you want to test adding data to an existing data feed, you must
-also comment out the final two commands related to `server-metrics_4.json`.
+password.

-The script runs a command similar
-to the following example, which sets up a mapping for the data set:
+The script runs a command similar to the following example, which sets up a
+mapping for the data set:

 [source,shell]
 ----------------------------------
@ -266,7 +275,7 @@ This mapping specifies the following qualities for the data set:

 You can then use the Elasticsearch `bulk` API to load the data set. The
 `upload_server-metrics.sh` script runs commands similar to the following
-example, which loads three of the JSON files:
+example, which loads the four JSON files:

 [source,shell]
 ----------------------------------
@ -280,10 +289,10 @@ http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_2.json
 curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
 http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_3.json"

+curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
+http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
 ----------------------------------

-//curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
-//http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
 These commands might take some time to run, depending on the computing resources
 available.

@ -295,13 +304,13 @@ You can verify that the data was loaded successfully with the following command:
 curl 'http://localhost:9200/_cat/indices?v' -u elastic:elasticpassword
 ----------------------------------

-For three sample JSON files, you should see output similar to the following:
+You should see output similar to the following:

 [source,shell]
 ----------------------------------

 health status index ... pri rep docs.count  docs.deleted  store.size ...
-green  open   server-metrics ... 1 0 680400  0 101.7mb  ...
+green  open   server-metrics ... 1 0 907200  0  136.2mb  ...
 ----------------------------------

 Next, you must define an index pattern for this data set:
@ -323,7 +332,7 @@ loaded will work. For example, enter `server-metrics*` as the index pattern.
 . Click **Create**.

 This data set can now be analyzed in {ml} jobs in Kibana.
-//Content based on https://www.elastic.co/guide/en/kibana/current/tutorial-load-dataset.html
+

 [[ml-gs-jobs]]
 === Creating Jobs
@ -350,6 +359,15 @@ received by your applications and services. The sample data contains a single
 key performance indicator to track this, which is the total requests over time.
 It is therefore logical to start by creating a single metric job for this KPI.

+TIP: In general, if you are using summarized data that is generated from
+Elasticsearch queries, you should create an advanced job. You can then identify
+the fields that were summarized, the count of events that were summarized, and
+in some cases, the associated function. The {ml} algorithms use those details
+to make the best possible use of summarized data. Since we are not using
+Elasticsearch queries to generate the summarized data in this tutorial, however,
+we will not make use of that advanced functionality.
+
+
 [float]
 [[ml-gs-job1-create]]
 ==== Creating a Single Metric Job
@ -427,8 +445,19 @@ at which alerting is required.
 //non-aggregating functions.
 --

-. Click **Use full transaction_counts data**. A graph is generated,
-which represents the total number of requests over time.
+. Determine whether you want to process all of the data or only part of it. If
+you want to analyze all of the existing data, click
+**Use full transaction_counts data**. If you want to see what happens when you
+stop and start data feeds and process additional data over time, click the time
+picker in the Kibana toolbar. Since the sample data spans a period of time
+between March 26, 2017 and April 22, 2017, click **Absolute**. Set the start
+time to March 26, 2017 and the end time to April 1, 2017, for example. Once
+you've got the time range set up, click the **Go** button.
+image:images/ml-gs-job1-time.jpg["Setting the time range for the data feed"]
+
+--
+A graph is generated, which represents the total number of requests over time.
+--

 . Provide a name for the job, for example `total-requests`. The job name must
 be unique in your cluster. You can also optionally provide a description of the
@ -450,7 +479,7 @@ using the {ml} APIs. For API reference information, see <<ml-apis>>.

 After you create a job, you can see its status in the **Job Management** tab:

-image::images/ml-gs-job1-manage.jpg["Status information for the total-requests job"]
+image::images/ml-gs-job1-manage1.jpg["Status information for the total-requests job"]

 The following information is provided for each job:

@ -519,46 +548,27 @@ A data feed can be started and stopped multiple times throughout its lifecycle.
 If you want to retrieve more data from Elasticsearch and the data feed is
 stopped, you must restart it.

-For example, if you only loaded three of the sample JSON files, you can now load
-the fourth using the Elasticsearch `bulk` API as follows:
-
-[source,shell]
----------------------------------
-
-curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
-http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
----------------------------------
-
-You can optionally verify that the data was loaded successfully with the
-following command:
-
-[source,shell]
----------------------------------
-
-curl 'http://localhost:9200/_cat/indices?v' -u elastic:elasticpassword
----------------------------------
-
-For the four sample JSON files, you should see output similar to the following:
-
-[source,shell]
----------------------------------
-
-health status index ... pri rep docs.count  docs.deleted  store.size ...
-green  open   server-metrics ... 1 0 907200  0  136.2mb  ...
----------------------------------
-
-To use this new data in your job:
+For example, if you did not use the full data when you created the job, you can
+now process the remaining data by restarting the data feed:

 . In the **Machine Learning** / **Job Management** tab, click the following
-button to start the data feed: image::images/ml-start-feed.jpg["Start data feed"].
+button to start the data feed:
+image::images/ml-start-feed.jpg["Start data feed"]

 . Choose a start time and end time. For example,
-click **Continue from 2017-04-22** and **No end time**, then click **Start**.
+click **Continue from 2017-04-01** and **No end time**, then click **Start**.
 image::images/ml-gs-job1-datafeed.jpg["Restarting a data feed"]

-* TBD: Why do I not see increases in the job count stats after this occurs?
-How can I determine that it has been successfully processed?
+The data feed state changes to `started`, the job state changes to `opened`,
+and the number of processed records increases as the new data is analyzed. The
+latest timestamp information also increases. For example:
+image::images/ml-gs-job1-manage2.jpg["Job opened and data feed started"]

+If you want to stop the data feed at this point, you can click the following
+button:
+image::images/ml-stop-feed.jpg["Stop data feed"]
+
+Now that you have processed all the data, let's start exploring the job results.

 [[ml-gs-jobresults]]
 === Exploring Job Results
@ -571,10 +581,9 @@ Result records for each anomaly are stored in `.ml-notifications` and
 `.ml-anomalies*` indices in Elasticsearch. By default, the name of the
 index where {ml} results are stored is `shared`, which corresponds to
 the `.ml-anomalies-shared` index.
-//For example, these results include the probability of detecting that anomaly.

-You can use the **Anomaly Explorer** or the
-**Single Metric Viewer** in Kibana to view the analysis results.
+You can use the **Anomaly Explorer** or the **Single Metric Viewer** in Kibana
+to view the analysis results.

 Anomaly Explorer::
  This view contains heatmap charts, where the color for each section of the
@ -582,7 +591,8 @@ Anomaly Explorer::
 //TBD: Do the time periods in the heat map correspond to buckets?

 Single Metric Viewer::
-  This view contains a time series chart that represents the analysis.
+  This view contains a time series chart that represents the actual and expected
+  values over time.
  As in the **Anomaly Explorer**, anomalous data points are shown in
  different colors depending on their probability.

@ -642,6 +652,11 @@ contributing to the problem? Are the anomalies confined to particular
 applications or servers? You can begin to troubleshoot these situations by
 layering additional jobs or creating multi-metric jobs.

+////
+The troubleshooting job would not create alarms of its own, but rather would
+help explain the overall situation.  It's usually a different job because it's
+operating on different indices. Layering jobs is an important concept.
+////
 ////
 [float]
 [[ml-gs-job2-create]]
--- a/docs/en/ml/images/ml-gs-job1-datafeed.jpg
+++ b/docs/en/ml/images/ml-gs-job1-datafeed.jpg
--- a/docs/en/ml/images/ml-gs-job1-manage.jpg
+++ b/docs/en/ml/images/ml-gs-job1-manage.jpg
--- a/docs/en/ml/images/ml-gs-job1-manage1.jpg
+++ b/docs/en/ml/images/ml-gs-job1-manage1.jpg
--- a/docs/en/ml/images/ml-gs-job1-manage2.jpg
+++ b/docs/en/ml/images/ml-gs-job1-manage2.jpg
--- a/docs/en/ml/images/ml-gs-job1-time.jpg
+++ b/docs/en/ml/images/ml-gs-job1-time.jpg
--- a/docs/en/ml/images/ml-gs-job1.jpg
+++ b/docs/en/ml/images/ml-gs-job1.jpg
--- a/docs/en/ml/images/ml-gs-single-job.jpg
+++ b/docs/en/ml/images/ml-gs-single-job.jpg
--- a/docs/en/ml/images/ml-stop-feed.jpg
+++ b/docs/en/ml/images/ml-stop-feed.jpg