221 lines
8.0 KiB
Plaintext
221 lines
8.0 KiB
Plaintext
|
[[ml-gs-data]]
|
||
|
=== Identifying Data for Analysis
|
||
|
|
||
|
For the purposes of this tutorial, we provide sample data that you can play with
|
||
|
and search in {es}. When you consider your own data, however, it's important to
|
||
|
take a moment and think about where the {xpackml} features will be most
|
||
|
impactful.
|
||
|
|
||
|
The first consideration is that it must be time series data. The {ml} features
|
||
|
are designed to model and detect anomalies in time series data.
|
||
|
|
||
|
The second consideration, especially when you are first learning to use {ml},
|
||
|
is the importance of the data and how familiar you are with it. Ideally, it is
|
||
|
information that contains key performance indicators (KPIs) for the health,
|
||
|
security, or success of your business or system. It is information that you need
|
||
|
to monitor and act on when anomalous behavior occurs. You might even have {kib}
|
||
|
dashboards that you're already using to watch this data. The better you know the
|
||
|
data, the quicker you will be able to create {ml} jobs that generate useful
|
||
|
insights.
|
||
|
|
||
|
The final consideration is where the data is located. This tutorial assumes that
|
||
|
your data is stored in {es}. It guides you through the steps required to create
|
||
|
a _{dfeed}_ that passes data to a job. If your own data is outside of {es},
|
||
|
analysis is still possible by using a post data API.
|
||
|
|
||
|
IMPORTANT: If you want to create {ml} jobs in {kib}, you must use {dfeeds}.
|
||
|
That is to say, you must store your input data in {es}. When you create
|
||
|
a job, you select an existing index pattern and {kib} configures the {dfeed}
|
||
|
for you under the covers.
|
||
|
|
||
|
|
||
|
[float]
|
||
|
[[ml-gs-sampledata]]
|
||
|
==== Obtaining a Sample Data Set
|
||
|
|
||
|
In this step we will upload some sample data to {es}. This is standard
|
||
|
{es} functionality, and is needed to set the stage for using {ml}.
|
||
|
|
||
|
The sample data for this tutorial contains information about the requests that
|
||
|
are received by various applications and services in a system. A system
|
||
|
administrator might use this type of information to track the total number of
|
||
|
requests across all of the infrastructure. If the number of requests increases
|
||
|
or decreases unexpectedly, for example, this might be an indication that there
|
||
|
is a problem or that resources need to be redistributed. By using the {xpack}
|
||
|
{ml} features to model the behavior of this data, it is easier to identify
|
||
|
anomalies and take appropriate action.
|
||
|
|
||
|
Download this sample data by clicking here:
|
||
|
https://download.elastic.co/demos/machine_learning/gettingstarted/server_metrics.tar.gz[server_metrics.tar.gz]
|
||
|
|
||
|
Use the following commands to extract the files:
|
||
|
|
||
|
[source,shell]
|
||
|
----------------------------------
|
||
|
tar -zxvf server_metrics.tar.gz
|
||
|
----------------------------------
|
||
|
|
||
|
Each document in the server-metrics data set has the following schema:
|
||
|
|
||
|
[source,js]
|
||
|
----------------------------------
|
||
|
{
|
||
|
"index":
|
||
|
{
|
||
|
"_index":"server-metrics",
|
||
|
"_type":"metric",
|
||
|
"_id":"1177"
|
||
|
}
|
||
|
}
|
||
|
{
|
||
|
"@timestamp":"2017-03-23T13:00:00",
|
||
|
"accept":36320,
|
||
|
"deny":4156,
|
||
|
"host":"server_2",
|
||
|
"response":2.4558210155,
|
||
|
"service":"app_3",
|
||
|
"total":40476
|
||
|
}
|
||
|
----------------------------------
|
||
|
|
||
|
TIP: The sample data sets include summarized data. For example, the `total`
|
||
|
value is a sum of the requests that were received by a specific service at a
|
||
|
particular time. If your data is stored in {es}, you can generate
|
||
|
this type of sum or average by using aggregations. One of the benefits of
|
||
|
summarizing data this way is that {es} automatically distributes
|
||
|
these calculations across your cluster. You can then feed this summarized data
|
||
|
into {xpackml} instead of raw results, which reduces the volume
|
||
|
of data that must be considered while detecting anomalies. For the purposes of
|
||
|
this tutorial, however, these summary values are stored in {es}. For more
|
||
|
information, see <<ml-configuring-aggregation>>.
|
||
|
|
||
|
Before you load the data set, you need to set up {ref}/mapping.html[_mappings_]
|
||
|
for the fields. Mappings divide the documents in the index into logical groups
|
||
|
and specify a field's characteristics, such as the field's searchability or
|
||
|
whether or not it's _tokenized_, or broken up into separate words.
|
||
|
|
||
|
The sample data includes an `upload_server-metrics.sh` script, which you can use
|
||
|
to create the mappings and load the data set. You can download it by clicking
|
||
|
here: https://download.elastic.co/demos/machine_learning/gettingstarted/upload_server-metrics.sh[upload_server-metrics.sh]
|
||
|
Before you run it, however, you must edit the USERNAME and PASSWORD variables
|
||
|
with your actual user ID and password.
|
||
|
|
||
|
The script runs a command similar to the following example, which sets up a
|
||
|
mapping for the data set:
|
||
|
|
||
|
[source,shell]
|
||
|
----------------------------------
|
||
|
|
||
|
curl -u elastic:x-pack-test-password -X PUT -H 'Content-Type: application/json'
|
||
|
http://localhost:9200/server-metrics -d '{
|
||
|
"settings":{
|
||
|
"number_of_shards":1,
|
||
|
"number_of_replicas":0
|
||
|
},
|
||
|
"mappings":{
|
||
|
"metric":{
|
||
|
"properties":{
|
||
|
"@timestamp":{
|
||
|
"type":"date"
|
||
|
},
|
||
|
"accept":{
|
||
|
"type":"long"
|
||
|
},
|
||
|
"deny":{
|
||
|
"type":"long"
|
||
|
},
|
||
|
"host":{
|
||
|
"type":"keyword"
|
||
|
},
|
||
|
"response":{
|
||
|
"type":"float"
|
||
|
},
|
||
|
"service":{
|
||
|
"type":"keyword"
|
||
|
},
|
||
|
"total":{
|
||
|
"type":"long"
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
}'
|
||
|
----------------------------------
|
||
|
|
||
|
NOTE: If you run this command, you must replace `x-pack-test-password` with your
|
||
|
actual password.
|
||
|
|
||
|
////
|
||
|
This mapping specifies the following qualities for the data set:
|
||
|
|
||
|
* The _@timestamp_ field is a date.
|
||
|
//that uses the ISO format `epoch_second`,
|
||
|
//which is the number of seconds since the epoch.
|
||
|
* The _accept_, _deny_, and _total_ fields are long numbers.
|
||
|
* The _host
|
||
|
////
|
||
|
|
||
|
You can then use the {es} `bulk` API to load the data set. The
|
||
|
`upload_server-metrics.sh` script runs commands similar to the following
|
||
|
example, which loads the four JSON files:
|
||
|
|
||
|
[source,shell]
|
||
|
----------------------------------
|
||
|
|
||
|
curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json"
|
||
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_1.json"
|
||
|
|
||
|
curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json"
|
||
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_2.json"
|
||
|
|
||
|
curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json"
|
||
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_3.json"
|
||
|
|
||
|
curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json"
|
||
|
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
|
||
|
----------------------------------
|
||
|
|
||
|
TIP: This will upload 200MB of data. This is split into 4 files as there is a
|
||
|
maximum 100MB limit when using the `_bulk` API.
|
||
|
|
||
|
These commands might take some time to run, depending on the computing resources
|
||
|
available.
|
||
|
|
||
|
You can verify that the data was loaded successfully with the following command:
|
||
|
|
||
|
[source,shell]
|
||
|
----------------------------------
|
||
|
|
||
|
curl 'http://localhost:9200/_cat/indices?v' -u elastic:x-pack-test-password
|
||
|
----------------------------------
|
||
|
|
||
|
You should see output similar to the following:
|
||
|
|
||
|
[source,shell]
|
||
|
----------------------------------
|
||
|
|
||
|
health status index ... pri rep docs.count ...
|
||
|
green open server-metrics ... 1 0 905940 ...
|
||
|
----------------------------------
|
||
|
|
||
|
Next, you must define an index pattern for this data set:
|
||
|
|
||
|
. Open {kib} in your web browser and log in. If you are running {kib}
|
||
|
locally, go to `http://localhost:5601/`.
|
||
|
|
||
|
. Click the **Management** tab, then **Index Patterns**.
|
||
|
|
||
|
. If you already have index patterns, click the plus sign (+) to define a new
|
||
|
one. Otherwise, the **Configure an index pattern** wizard is already open.
|
||
|
|
||
|
. For this tutorial, any pattern that matches the name of the index you've
|
||
|
loaded will work. For example, enter `server-metrics*` as the index pattern.
|
||
|
|
||
|
. Verify that the **Index contains time-based events** is checked.
|
||
|
|
||
|
. Select the `@timestamp` field from the **Time-field name** list.
|
||
|
|
||
|
. Click **Create**.
|
||
|
|
||
|
This data set can now be analyzed in {ml} jobs in {kib}.
|