[DOCS] Review getting started (elastic/x-pack-elasticsearch#1219)

* [DOCS] Initial review of getting started

* [DOCS] Completed review of getting started

Original commit: elastic/x-pack-elasticsearch@a4b800b59b
This commit is contained in:
Sophie Chang 2017-04-27 16:04:46 +01:00 committed by lcawley
parent 779e8f6771
commit aa7d94ec44
1 changed files with 77 additions and 85 deletions

View File

@ -16,7 +16,6 @@ tutorial shows you how to:
* Create a {ml} job
* Use the results to identify possible anomalies in the data
At the end of this tutorial, you should have a good idea of what {ml} is and
will hopefully be inspired to use it to detect anomalies in your own data.
@ -34,15 +33,17 @@ To follow the steps in this tutorial, you will need the following
components of the Elastic Stack:
* Elasticsearch {version}, which stores the data and the analysis results
* {xpack} {version}, which provides the beta {ml} features
* {xpack} {version}, which includes the beta {ml} features for both Elasticsearch and Kibana
* Kibana {version}, which provides a helpful user interface for creating and
viewing jobs +
All {ml} features are available to use as an API, however this tutorial
will focus on using the {ml} tab in the Kibana UI.
WARNING: The {xpack} {ml} features are in beta and subject to change.
The design and code are considered to be less mature than official GA features.
Elastic will take a best effort approach to fix any issues, but beta features
are not subject to the support SLA of official GA features. Exercise caution if
you use these features in production environments.
Beta features are not subject to the same support SLA as GA features,
and deployment in production is at your own risk.
// Warning was supplied by Steve K (deleteme)
See the https://www.elastic.co/support/matrix[Elastic Support Matrix] for
information about supported operating systems.
@ -51,7 +52,8 @@ See {stack-ref}/installing-elastic-stack.html[Installing the Elastic Stack] for
information about installing each of the components.
NOTE: To get started, you can install Elasticsearch and Kibana on a
single VM or even on your laptop. As you add more data and your traffic grows,
single VM or even on your laptop (requires 64-bit OS).
As you add more data and your traffic grows,
you'll want to replace the single Elasticsearch instance with a cluster.
When you install {xpack} into Elasticsearch and Kibana, the {ml} features are
@ -70,7 +72,7 @@ make it easier to control which users have authority to view and manage the jobs
data feeds, and results.
By default, you can perform all of the steps in this tutorial by using the
built-in `elastic` user. If you are performing these steps in a production
built-in `elastic` super user. If you are performing these steps in a production
environment, take extra care because that user has the `superuser` role and you
could inadvertently make significant changes to the system. You can
alternatively assign the `machine_learning_admin` and `kibana_user` roles to a
@ -82,14 +84,12 @@ For more information, see <<built-in-roles>> and <<privileges-list-cluster>>.
=== Identifying Data for Analysis
For the purposes of this tutorial, we provide sample data that you can play with.
This data will be available to search in Elasticsearch.
When you consider your own data, however, it's important to take a moment
and think about where the {xpack} {ml} features will be most impactful.
The first consideration is that it must be time series data.
Generally, it's best to use data that is in chronological order. When the data
feed occurs in ascending time order, the statistical models and calculations are
very efficient and occur in real-time.
//TBD: Talk about handling out of sequence data?
The first consideration is that it must be time series data as
the {ml} features are designed to model and detect anomalies in time series data.
The second consideration, especially when you are first learning to use {ml},
is the importance of the data and how familiar you are with it. Ideally, it is
@ -100,45 +100,24 @@ dashboards that you're already using to watch this data. The better you know the
data, the quicker you will be able to create {ml} jobs that generate useful
insights.
////
* Working with out of sequence data:
** In the typical case where data arrives in ascending time order,
each new record pushes the time forward. When a record is received that belongs
to a new bucket, the current bucket is considered to be completed.
At this point, the model is updated and final results are calculated for the
completed bucket and the new bucket is created.
** Expecting data to be in time sequence means that modeling and results
calculations can be performed very efficiently and in real-time.
As a direct consequence of this approach, out-of-sequence records are ignored.
** When data is expected to arrive out-of-sequence, a latency window can be
specified in the job configuration (does not apply to data feeds?). (If we're
using a data feed in the sample, perhaps this discussion can be deferred for
future more-advanced scenario.)
//See http://www.prelert.com/docs/behavioral_analytics/latest/concepts/outofsequence.html
////
The final consideration is where the data is located. If the data that you want
to analyze is stored in Elasticsearch, you can define a _data feed_ that
provides data to the job in real time. When you have both the input data and the
analytical results in Elasticsearch, this data gravity provides performance
benefits.
The final consideration is where the data is located. This guide assumes that
your data is stored in Elasticsearch, and will guide you through the steps
required to create a _data feed_ that will pass data to the job. If your data
is outside of Elasticsearch, then analysis is still possible via a POST _data_.
IMPORTANT: If you want to create {ml} jobs in Kibana, you must use data feeds.
That is to say, you must store your input data in Elasticsearch. When you create
a job, you select an existing index pattern and Kibana configures the data feed
for you under the covers.
If your data is not stored in Elasticsearch, you can create jobs by using
the <<ml-put-job,create job API>> and upload batches of data to the job by
using the <<ml-post-data,post data API>>. That scenario is not covered in
this tutorial, however.
//TBD: The data must be provided in JSON format?
[float]
[[ml-gs-sampledata]]
==== Obtaining a Sample Data Set
In this step we will upload some sample data to Elasticsearch. This is standard
Elasticsearch functionality, and is needed to set the stage for using {ml}.
The sample data for this tutorial contains information about the requests that
are received by various applications and services in a system. A system
administrator might use this type of information to track the the total
@ -187,13 +166,14 @@ Each document in the server-metrics data set has the following schema:
TIP: The sample data sets include summarized data. For example, the `total`
value is a sum of the requests that were received by a specific service at a
particular time. If your data is stored in Elasticsearch, you can generate
this type of sum or average by using search queries. One of the benefits of
this type of sum or average by using aggregations. One of the benefits of
summarizing data this way is that Elasticsearch automatically distributes
these calculations across your cluster. You can then feed this summarized data
into the {xpack} {ml} features instead of raw results, which reduces the volume
into {xpack} {ml} instead of raw results, which reduces the volume
of data that must be considered while detecting anomalies. For the purposes of
this tutorial, however, these summary values are provided directly in the JSON
source files. They are not generated by Elasticsearch queries.
this tutorial, however, these summary values are stored in Elasticsearch,
rather than created using the {ref}/search-aggregations.html[_aggregations framework_].
//TBD link to working with aggregations page
Before you load the data set, you need to set up {ref}/mapping.html[_mappings_]
for the fields. Mappings divide the documents in the index into logical groups
@ -293,6 +273,9 @@ curl -u elastic:elasticpassword -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
----------------------------------
TIP: This will upload 200MB of data. This is split into 4 files as there is a
maximum 100MB limit when using the `_bulk` API.
These commands might take some time to run, depending on the computing resources
available.
@ -342,7 +325,7 @@ necessary to perform an analytical task. They also contain the results of the
analytical task.
NOTE: This tutorial uses Kibana to create jobs and view results, but you can
alternatively use APIs to accomplish most tasks.
alternatively use APIs to accomplish these tasks.
For API reference information, see <<ml-apis>>.
To work with jobs in Kibana:
@ -359,12 +342,9 @@ received by your applications and services. The sample data contains a single
key performance indicator to track this, which is the total requests over time.
It is therefore logical to start by creating a single metric job for this KPI.
TIP: In general, if you are using summarized data that is generated from
Elasticsearch queries, you should create an advanced job. You can then identify
the fields that were summarized, the count of events that were summarized, and
in some cases, the associated function. The {ml} algorithms use those details
to make the best possible use of summarized data. Since we are not using
Elasticsearch queries to generate the summarized data in this tutorial, however,
TIP: If you are using aggregated data, you can create an advanced job
and configure it to use a `summary_count_field`. The {ml} algorithms will
make the best possible use of summarized data in this case. For simplicity in this tutorial
we will not make use of that advanced functionality.
@ -413,12 +393,12 @@ the detector uses in the function.
NOTE: Some functions such as `count` and `rare` do not require fields.
--
.. For the **Bucket span**, enter `600s`. This value specifies the size of the
.. For the **Bucket span**, enter `10m`. This value specifies the size of the
interval that the analysis is aggregated into.
+
--
The {xpack} {ml} features use the concept of a bucket to divide up a continuous
stream of data into batches for processing. For example, if you are monitoring
The {xpack} {ml} features use the concept of a bucket to divide up the time series
into batches for processing. For example, if you are monitoring
the total number of requests in the system,
//and receive a data point every 10 minutes
using a bucket span of 1 hour would mean that at the end of each hour, it
@ -436,13 +416,8 @@ in time.
The bucket span has a significant impact on the analysis. When you're trying to
determine what value to use, take into account the granularity at which you
want to perform the analysis, the frequency of the input data, and the frequency
at which alerting is required.
//TBD: Talk about overlapping buckets? "To avoid this, you can use overlapping
//buckets (how/where?). We analyze the data points in two buckets simultaneously,
//one starting half a bucket span later than the other. Overlapping buckets are
//only beneficial for aggregating functions, and should not be used for
//non-aggregating functions.
want to perform the analysis, the frequency of the input data, the duration of typical anomalies
and the frequency at which alerting is required.
--
. Determine whether you want to process all of the data or only part of it. If
@ -467,7 +442,8 @@ job.
image::images/ml-gs-job1.jpg["A graph of the total number of requests over time"]
As the job is created, the graph is updated to give a visual representation of
the {ml} that occurs as the data is processed.
the progress of {ml} as the data is processed. This view is only available whilst the
job is running.
//To explore the results, click **View Results**.
//TBD: image::images/ml-gs-job1-results.jpg["The total-requests job is created"]
@ -492,9 +468,11 @@ The optional description of the job.
Processed records::
The number of records that have been processed by the job.
NOTE: Depending on how you send data to the job, the number of processed
records is not always equal to the number of input records. For more information,
see the `processed_record_count` description in <<ml-datacounts,Data Counts Objects>>.
// NOTE: Depending on how you send data to the job, the number of processed
// records is not always equal to the number of input records. For more information,
// see the `processed_record_count` description in <<ml-datacounts,Data Counts Objects>>.
// TBD delete for this getting started guide, but should be in the datacounts objects
Memory status::
The status of the mathematical models. When you create jobs by using the APIs or
@ -527,11 +505,11 @@ Datafeed state::
The status of the data feed, which can be one of the following values: +
started::: The data feed is actively receiving data.
stopped::: The data feed is stopped and will not receive data until it is re-started.
//TBD: How to restart data feeds in Kibana?
//TBD: How to restart data feeds in Kibana?
Latest timestamp::
The timestamp of the last processed record.
//TBD: Is that right?
If you click the arrow beside the name of job, you can show or hide additional
information, such as the settings, configuration information, or messages for
@ -556,7 +534,9 @@ button to start the data feed:
image::images/ml-start-feed.jpg["Start data feed"]
. Choose a start time and end time. For example,
click **Continue from 2017-04-01** and **No end time**, then click **Start**.
click **Continue from 2017-04-01** and **2017-04-30**, then click **Start**.
The date picker will default to the latest timestamp of processed data.
Be careful not to leave any gaps in the analysis otherwise you may miss anoamlies.
image::images/ml-gs-job1-datafeed.jpg["Restarting a data feed"]
The data feed state changes to `started`, the job state changes to `opened`,
@ -570,6 +550,9 @@ image::images/ml-stop-feed.jpg["Stop data feed"]
Now that you have processed all the data, let's start exploring the job results.
TIP: If your data is being loaded continuously, you can continue running the job in real time.
For this, start your data feed and select **No end time**.
[[ml-gs-jobresults]]
=== Exploring Job Results
@ -577,24 +560,30 @@ The {xpack} {ml} features analyze the input stream of data, model its behavior,
and perform analysis based on the detectors you defined in your job. When an
event occurs outside of the model, that event is identified as an anomaly.
Result records for each anomaly are stored in `.ml-notifications` and
`.ml-anomalies*` indices in Elasticsearch. By default, the name of the
index where {ml} results are stored is `shared`, which corresponds to
the `.ml-anomalies-shared` index.
Result records for each anomaly are stored in `.ml-anomalies-*` indices in Elasticsearch.
By default, the name of the index where {ml} results are stored is labelled `shared`,
which corresponds to the `.ml-anomalies-shared` index.
You can use the **Anomaly Explorer** or the **Single Metric Viewer** in Kibana
to view the analysis results.
Anomaly Explorer::
This view contains heatmap charts, where the color for each section of the
timeline is determined by the maximum anomaly score in that period.
//TBD: Do the time periods in the heat map correspond to buckets?
This view contains swimlanes showing the maximum anomaly score over time.
There is an overall swimlane which shows the overall score for the job, and
also swimlanes for each influencer. By selecting a block in a swimlane, the
anomaly details are displayed along side the original source data (where applicable).
//TBD: Are they swimlane blocks, tiles, segments or cards? hmmm
//TBD: Do the time periods in the heat map correspond to buckets? hmmm is it a heat map?
//As time is the x-axis, and the block sizes stay the same, it feels more intuitive call it a swimlane.
//The swimlane bucket intervals depends on the time range selected. Their smallest possible
//granularity is a bucket, but if you have a big time range selected, then they will span many buckets
Single Metric Viewer::
This view contains a time series chart that represents the actual and expected
values over time.
This view contains a chart that represents the actual and expected values over time.
This is only available for jobs which analyze a single time series
and where `model_plot_config` is enabled.
As in the **Anomaly Explorer**, anomalous data points are shown in
different colors depending on their probability.
different colors depending on their score.
[float]
[[ml-gs-job1-analyze]]
@ -604,9 +593,12 @@ By default when you view the results for a single metric job,
the **Single Metric Viewer** opens:
image::images/ml-gs-job1-analysis.jpg["Single Metric Viewer for total-requests job"]
The blue line in the chart represents the actual data values. The shaded blue area
represents the expected behavior that was calculated by the model.
//TBD: What is meant by "95% prediction bounds"?
The blue line in the chart represents the actual data values.
The shaded blue area represents the bounds for the expected values.
The area between the upper and lower bounds are the most likely values for the model.
If a value is outside of this area then it can be said to be anomalous.
//TBD: What is meant by "95% prediction bounds"? Because we are using probability
//to "predict" the values..
If you slide the time selector from the beginning of the data to the end of the
data, you can see how the model improves as it processes more data. At the
@ -627,7 +619,7 @@ The highly anomalous values are shown in red and the low scored values are
indicated in blue. An interval with a high anomaly score is significant and
requires investigation.
Slide the time selector to a section of the time series that contains a red data
Slide the time selector to a section of the time series that contains a red anomaly data
point. If you hover over the point, you can see more information about that
data point. You can also see details in the **Anomalies** section of the viewer.
For example:
@ -641,8 +633,8 @@ You can see the same information in a different format by using the **Anomaly Ex
image::images/ml-gs-job1-explorer.jpg["Anomaly Explorer for total-requests job"]
Click one of the red areas in the heatmap to see details about that anomaly. For
example:
Click one of the red blocks in the swimlane to see details about the anomalies that occured in
that time interval. For example:
image::images/ml-gs-job1-explorer-anomaly.jpg["Anomaly Explorer details for total-requests job"]