[DOCS] Stop and start data feeds in ML Getting Started (elastic/x-pack-elasticsearch#1206)

Original commit: elastic/x-pack-elasticsearch@b938c19695
This commit is contained in:
Lisa Cawley 2017-04-25 17:24:51 -07:00 committed by lcawley
parent 268f5a95af
commit aa421af2fc
9 changed files with 80 additions and 65 deletions

View File

@ -34,10 +34,15 @@ To follow the steps in this tutorial, you will need the following
components of the Elastic Stack:
* Elasticsearch {version}, which stores the data and the analysis results
* {xpack} {version}, which provides the {ml} features
* {xpack} {version}, which provides the beta {ml} features
* Kibana {version}, which provides a helpful user interface for creating and
viewing jobs +
WARNING: The {xpack} {ml} features are in beta and subject to change.
The design and code are considered to be less mature than official GA features.
Elastic will take a best effort approach to fix any issues, but beta features
are not subject to the support SLA of official GA features. Exercise caution if
you use these features in production environments.
See the https://www.elastic.co/support/matrix[Elastic Support Matrix] for
information about supported operating systems.
@ -55,6 +60,7 @@ optionally dedicate nodes to specific purposes. If you want to control which
nodes are _machine learning nodes_ or limit which nodes run resource-intensive
activity related to jobs, see <<ml-settings>>.
[float]
[[ml-gs-users]]
==== Users, Roles, and Privileges
@ -77,7 +83,7 @@ For more information, see <<built-in-roles>> and <<privileges-list-cluster>>.
For the purposes of this tutorial, we provide sample data that you can play with.
When you consider your own data, however, it's important to take a moment
and consider where the {xpack} {ml} features will be most impactful.
and think about where the {xpack} {ml} features will be most impactful.
The first consideration is that it must be time series data.
Generally, it's best to use data that is in chronological order. When the data
@ -87,20 +93,13 @@ very efficient and occur in real-time.
The second consideration, especially when you are first learning to use {ml},
is the importance of the data and how familiar you are with it. Ideally, it is
information that contains key performance indicators (KPIs) for the health or
success of your business or system. It is information that you need to act on
when anomalous behavior occurs. You might even have Kibana dashboards that
you're already using to watch this data. The better you know the data,
the quicker you will be able to create {ml} jobs that generate useful insights.
information that contains key performance indicators (KPIs) for the health,
security, or success of your business or system. It is information that you need
to monitor and act on when anomalous behavior occurs. You might even have Kibana
dashboards that you're already using to watch this data. The better you know the
data, the quicker you will be able to create {ml} jobs that generate useful
insights.
//TBD: Talk about layering additional jobs?
////
You can then create additional jobs to troubleshoot the situation and put it
into context of what was going on in the system at the time.
The troubleshooting job would not create alarms of its own, but rather would
help explain the overall situation. It's usually a different job because it's
operating on different indices. Layering jobs is an important concept.
////
////
* Working with out of sequence data:
** In the typical case where data arrives in ascending time order,
@ -185,6 +184,17 @@ Each document in the server-metrics data set has the following schema:
}
----------------------------------
TIP: The sample data sets include summarized data. For example, the `total`
value is a sum of the requests that were received by a specific service at a
particular time. If your data is stored in Elasticsearch, you can generate
this type of sum or average by using search queries. One of the benefits of
summarizing data this way is that Elasticsearch automatically distributes
these calculations across your cluster. You can then feed this summarized data
into the {xpack} {ml} features instead of raw results, which reduces the volume
of data that must be considered while detecting anomalies. For the purposes of
this tutorial, however, these summary values are provided directly in the JSON
source files. They are not generated by Elasticsearch queries.
Before you load the data set, you need to set up {ref}/mapping.html[_mappings_]
for the fields. Mappings divide the documents in the index into logical groups
and specify a field's characteristics, such as the field's searchability or
@ -193,11 +203,10 @@ whether or not it's _tokenized_, or broken up into separate words.
The sample data includes an `upload_server-metrics.sh` script, which you can use
to create the mappings and load the data set. Before you run it, however, you
must edit the USERNAME and PASSWORD variables with your actual user ID and
password. If you want to test adding data to an existing data feed, you must
also comment out the final two commands related to `server-metrics_4.json`.
password.
The script runs a command similar
to the following example, which sets up a mapping for the data set:
The script runs a command similar to the following example, which sets up a
mapping for the data set:
[source,shell]
----------------------------------
@ -266,7 +275,7 @@ This mapping specifies the following qualities for the data set:
You can then use the Elasticsearch `bulk` API to load the data set. The
`upload_server-metrics.sh` script runs commands similar to the following
example, which loads three of the JSON files:
example, which loads the four JSON files:
[source,shell]
----------------------------------
@ -280,10 +289,10 @@ http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_2.json
curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_3.json"
curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
----------------------------------
//curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
//http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
These commands might take some time to run, depending on the computing resources
available.
@ -295,13 +304,13 @@ You can verify that the data was loaded successfully with the following command:
curl 'http://localhost:9200/_cat/indices?v' -u elastic:elasticpassword
----------------------------------
For three sample JSON files, you should see output similar to the following:
You should see output similar to the following:
[source,shell]
----------------------------------
health status index ... pri rep docs.count docs.deleted store.size ...
green open server-metrics ... 1 0 680400 0 101.7mb ...
green open server-metrics ... 1 0 907200 0 136.2mb ...
----------------------------------
Next, you must define an index pattern for this data set:
@ -323,7 +332,7 @@ loaded will work. For example, enter `server-metrics*` as the index pattern.
. Click **Create**.
This data set can now be analyzed in {ml} jobs in Kibana.
//Content based on https://www.elastic.co/guide/en/kibana/current/tutorial-load-dataset.html
[[ml-gs-jobs]]
=== Creating Jobs
@ -350,6 +359,15 @@ received by your applications and services. The sample data contains a single
key performance indicator to track this, which is the total requests over time.
It is therefore logical to start by creating a single metric job for this KPI.
TIP: In general, if you are using summarized data that is generated from
Elasticsearch queries, you should create an advanced job. You can then identify
the fields that were summarized, the count of events that were summarized, and
in some cases, the associated function. The {ml} algorithms use those details
to make the best possible use of summarized data. Since we are not using
Elasticsearch queries to generate the summarized data in this tutorial, however,
we will not make use of that advanced functionality.
[float]
[[ml-gs-job1-create]]
==== Creating a Single Metric Job
@ -427,8 +445,19 @@ at which alerting is required.
//non-aggregating functions.
--
. Click **Use full transaction_counts data**. A graph is generated,
which represents the total number of requests over time.
. Determine whether you want to process all of the data or only part of it. If
you want to analyze all of the existing data, click
**Use full transaction_counts data**. If you want to see what happens when you
stop and start data feeds and process additional data over time, click the time
picker in the Kibana toolbar. Since the sample data spans a period of time
between March 26, 2017 and April 22, 2017, click **Absolute**. Set the start
time to March 26, 2017 and the end time to April 1, 2017, for example. Once
you've got the time range set up, click the **Go** button.
image:images/ml-gs-job1-time.jpg["Setting the time range for the data feed"]
+
--
A graph is generated, which represents the total number of requests over time.
--
. Provide a name for the job, for example `total-requests`. The job name must
be unique in your cluster. You can also optionally provide a description of the
@ -450,7 +479,7 @@ using the {ml} APIs. For API reference information, see <<ml-apis>>.
After you create a job, you can see its status in the **Job Management** tab:
image::images/ml-gs-job1-manage.jpg["Status information for the total-requests job"]
image::images/ml-gs-job1-manage1.jpg["Status information for the total-requests job"]
The following information is provided for each job:
@ -519,46 +548,27 @@ A data feed can be started and stopped multiple times throughout its lifecycle.
If you want to retrieve more data from Elasticsearch and the data feed is
stopped, you must restart it.
For example, if you only loaded three of the sample JSON files, you can now load
the fourth using the Elasticsearch `bulk` API as follows:
[source,shell]
----------------------------------
curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
----------------------------------
You can optionally verify that the data was loaded successfully with the
following command:
[source,shell]
----------------------------------
curl 'http://localhost:9200/_cat/indices?v' -u elastic:elasticpassword
----------------------------------
For the four sample JSON files, you should see output similar to the following:
[source,shell]
----------------------------------
health status index ... pri rep docs.count docs.deleted store.size ...
green open server-metrics ... 1 0 907200 0 136.2mb ...
----------------------------------
To use this new data in your job:
For example, if you did not use the full data when you created the job, you can
now process the remaining data by restarting the data feed:
. In the **Machine Learning** / **Job Management** tab, click the following
button to start the data feed: image::images/ml-start-feed.jpg["Start data feed"].
button to start the data feed:
image::images/ml-start-feed.jpg["Start data feed"]
. Choose a start time and end time. For example,
click **Continue from 2017-04-22** and **No end time**, then click **Start**.
click **Continue from 2017-04-01** and **No end time**, then click **Start**.
image::images/ml-gs-job1-datafeed.jpg["Restarting a data feed"]
* TBD: Why do I not see increases in the job count stats after this occurs?
How can I determine that it has been successfully processed?
The data feed state changes to `started`, the job state changes to `opened`,
and the number of processed records increases as the new data is analyzed. The
latest timestamp information also increases. For example:
image::images/ml-gs-job1-manage2.jpg["Job opened and data feed started"]
If you want to stop the data feed at this point, you can click the following
button:
image::images/ml-stop-feed.jpg["Stop data feed"]
Now that you have processed all the data, let's start exploring the job results.
[[ml-gs-jobresults]]
=== Exploring Job Results
@ -571,10 +581,9 @@ Result records for each anomaly are stored in `.ml-notifications` and
`.ml-anomalies*` indices in Elasticsearch. By default, the name of the
index where {ml} results are stored is `shared`, which corresponds to
the `.ml-anomalies-shared` index.
//For example, these results include the probability of detecting that anomaly.
You can use the **Anomaly Explorer** or the
**Single Metric Viewer** in Kibana to view the analysis results.
You can use the **Anomaly Explorer** or the **Single Metric Viewer** in Kibana
to view the analysis results.
Anomaly Explorer::
This view contains heatmap charts, where the color for each section of the
@ -582,7 +591,8 @@ Anomaly Explorer::
//TBD: Do the time periods in the heat map correspond to buckets?
Single Metric Viewer::
This view contains a time series chart that represents the analysis.
This view contains a time series chart that represents the actual and expected
values over time.
As in the **Anomaly Explorer**, anomalous data points are shown in
different colors depending on their probability.
@ -642,6 +652,11 @@ contributing to the problem? Are the anomalies confined to particular
applications or servers? You can begin to troubleshoot these situations by
layering additional jobs or creating multi-metric jobs.
////
The troubleshooting job would not create alarms of its own, but rather would
help explain the overall situation. It's usually a different job because it's
operating on different indices. Layering jobs is an important concept.
////
////
[float]
[[ml-gs-job2-create]]

Binary file not shown.

Before

Width:  |  Height:  |  Size: 124 KiB

After

Width:  |  Height:  |  Size: 77 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 128 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 113 KiB

After

Width:  |  Height:  |  Size: 173 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.1 KiB