OpenSearch/docs/en/ml/getting-started-data.asciidoc

221 lines
8.0 KiB
Plaintext

[[ml-gs-data]]
=== Identifying Data for Analysis
For the purposes of this tutorial, we provide sample data that you can play with
and search in {es}. When you consider your own data, however, it's important to
take a moment and think about where the {xpackml} features will be most
impactful.
The first consideration is that it must be time series data. The {ml} features
are designed to model and detect anomalies in time series data.
The second consideration, especially when you are first learning to use {ml},
is the importance of the data and how familiar you are with it. Ideally, it is
information that contains key performance indicators (KPIs) for the health,
security, or success of your business or system. It is information that you need
to monitor and act on when anomalous behavior occurs. You might even have {kib}
dashboards that you're already using to watch this data. The better you know the
data, the quicker you will be able to create {ml} jobs that generate useful
insights.
The final consideration is where the data is located. This tutorial assumes that
your data is stored in {es}. It guides you through the steps required to create
a _{dfeed}_ that passes data to a job. If your own data is outside of {es},
analysis is still possible by using a post data API.
IMPORTANT: If you want to create {ml} jobs in {kib}, you must use {dfeeds}.
That is to say, you must store your input data in {es}. When you create
a job, you select an existing index pattern and {kib} configures the {dfeed}
for you under the covers.
[float]
[[ml-gs-sampledata]]
==== Obtaining a Sample Data Set
In this step we will upload some sample data to {es}. This is standard
{es} functionality, and is needed to set the stage for using {ml}.
The sample data for this tutorial contains information about the requests that
are received by various applications and services in a system. A system
administrator might use this type of information to track the total number of
requests across all of the infrastructure. If the number of requests increases
or decreases unexpectedly, for example, this might be an indication that there
is a problem or that resources need to be redistributed. By using the {xpack}
{ml} features to model the behavior of this data, it is easier to identify
anomalies and take appropriate action.
Download this sample data by clicking here:
https://download.elastic.co/demos/machine_learning/gettingstarted/server_metrics.tar.gz[server_metrics.tar.gz]
Use the following commands to extract the files:
[source,shell]
----------------------------------
tar -zxvf server_metrics.tar.gz
----------------------------------
Each document in the server-metrics data set has the following schema:
[source,js]
----------------------------------
{
"index":
{
"_index":"server-metrics",
"_type":"metric",
"_id":"1177"
}
}
{
"@timestamp":"2017-03-23T13:00:00",
"accept":36320,
"deny":4156,
"host":"server_2",
"response":2.4558210155,
"service":"app_3",
"total":40476
}
----------------------------------
TIP: The sample data sets include summarized data. For example, the `total`
value is a sum of the requests that were received by a specific service at a
particular time. If your data is stored in {es}, you can generate
this type of sum or average by using aggregations. One of the benefits of
summarizing data this way is that {es} automatically distributes
these calculations across your cluster. You can then feed this summarized data
into {xpackml} instead of raw results, which reduces the volume
of data that must be considered while detecting anomalies. For the purposes of
this tutorial, however, these summary values are stored in {es}. For more
information, see <<ml-configuring-aggregation>>.
Before you load the data set, you need to set up {ref}/mapping.html[_mappings_]
for the fields. Mappings divide the documents in the index into logical groups
and specify a field's characteristics, such as the field's searchability or
whether or not it's _tokenized_, or broken up into separate words.
The sample data includes an `upload_server-metrics.sh` script, which you can use
to create the mappings and load the data set. You can download it by clicking
here: https://download.elastic.co/demos/machine_learning/gettingstarted/upload_server-metrics.sh[upload_server-metrics.sh]
Before you run it, however, you must edit the USERNAME and PASSWORD variables
with your actual user ID and password.
The script runs a command similar to the following example, which sets up a
mapping for the data set:
[source,shell]
----------------------------------
curl -u elastic:x-pack-test-password -X PUT -H 'Content-Type: application/json'
http://localhost:9200/server-metrics -d '{
"settings":{
"number_of_shards":1,
"number_of_replicas":0
},
"mappings":{
"metric":{
"properties":{
"@timestamp":{
"type":"date"
},
"accept":{
"type":"long"
},
"deny":{
"type":"long"
},
"host":{
"type":"keyword"
},
"response":{
"type":"float"
},
"service":{
"type":"keyword"
},
"total":{
"type":"long"
}
}
}
}
}'
----------------------------------
NOTE: If you run this command, you must replace `x-pack-test-password` with your
actual password.
////
This mapping specifies the following qualities for the data set:
* The _@timestamp_ field is a date.
//that uses the ISO format `epoch_second`,
//which is the number of seconds since the epoch.
* The _accept_, _deny_, and _total_ fields are long numbers.
* The _host
////
You can then use the {es} `bulk` API to load the data set. The
`upload_server-metrics.sh` script runs commands similar to the following
example, which loads the four JSON files:
[source,shell]
----------------------------------
curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_1.json"
curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_2.json"
curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_3.json"
curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json"
http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json"
----------------------------------
TIP: This will upload 200MB of data. This is split into 4 files as there is a
maximum 100MB limit when using the `_bulk` API.
These commands might take some time to run, depending on the computing resources
available.
You can verify that the data was loaded successfully with the following command:
[source,shell]
----------------------------------
curl 'http://localhost:9200/_cat/indices?v' -u elastic:x-pack-test-password
----------------------------------
You should see output similar to the following:
[source,shell]
----------------------------------
health status index ... pri rep docs.count ...
green open server-metrics ... 1 0 905940 ...
----------------------------------
Next, you must define an index pattern for this data set:
. Open {kib} in your web browser and log in. If you are running {kib}
locally, go to `http://localhost:5601/`.
. Click the **Management** tab, then **Index Patterns**.
. If you already have index patterns, click the plus sign (+) to define a new
one. Otherwise, the **Configure an index pattern** wizard is already open.
. For this tutorial, any pattern that matches the name of the index you've
loaded will work. For example, enter `server-metrics*` as the index pattern.
. Verify that the **Index contains time-based events** is checked.
. Select the `@timestamp` field from the **Time-field name** list.
. Click **Create**.
This data set can now be analyzed in {ml} jobs in {kib}.