[[ml-gs-data]] === Identifying Data for Analysis For the purposes of this tutorial, we provide sample data that you can play with and search in {es}. When you consider your own data, however, it's important to take a moment and think about where the {xpackml} features will be most impactful. The first consideration is that it must be time series data. The {ml} features are designed to model and detect anomalies in time series data. The second consideration, especially when you are first learning to use {ml}, is the importance of the data and how familiar you are with it. Ideally, it is information that contains key performance indicators (KPIs) for the health, security, or success of your business or system. It is information that you need to monitor and act on when anomalous behavior occurs. You might even have {kib} dashboards that you're already using to watch this data. The better you know the data, the quicker you will be able to create {ml} jobs that generate useful insights. The final consideration is where the data is located. This tutorial assumes that your data is stored in {es}. It guides you through the steps required to create a _{dfeed}_ that passes data to a job. If your own data is outside of {es}, analysis is still possible by using a post data API. IMPORTANT: If you want to create {ml} jobs in {kib}, you must use {dfeeds}. That is to say, you must store your input data in {es}. When you create a job, you select an existing index pattern and {kib} configures the {dfeed} for you under the covers. [float] [[ml-gs-sampledata]] ==== Obtaining a Sample Data Set In this step we will upload some sample data to {es}. This is standard {es} functionality, and is needed to set the stage for using {ml}. The sample data for this tutorial contains information about the requests that are received by various applications and services in a system. A system administrator might use this type of information to track the total number of requests across all of the infrastructure. If the number of requests increases or decreases unexpectedly, for example, this might be an indication that there is a problem or that resources need to be redistributed. By using the {xpack} {ml} features to model the behavior of this data, it is easier to identify anomalies and take appropriate action. Download this sample data by clicking here: https://download.elastic.co/demos/machine_learning/gettingstarted/server_metrics.tar.gz[server_metrics.tar.gz] Use the following commands to extract the files: [source,sh] ---------------------------------- tar -zxvf server_metrics.tar.gz ---------------------------------- Each document in the server-metrics data set has the following schema: [source,js] ---------------------------------- { "index": { "_index":"server-metrics", "_type":"metric", "_id":"1177" } } { "@timestamp":"2017-03-23T13:00:00", "accept":36320, "deny":4156, "host":"server_2", "response":2.4558210155, "service":"app_3", "total":40476 } ---------------------------------- // NOTCONSOLE TIP: The sample data sets include summarized data. For example, the `total` value is a sum of the requests that were received by a specific service at a particular time. If your data is stored in {es}, you can generate this type of sum or average by using aggregations. One of the benefits of summarizing data this way is that {es} automatically distributes these calculations across your cluster. You can then feed this summarized data into {xpackml} instead of raw results, which reduces the volume of data that must be considered while detecting anomalies. For the purposes of this tutorial, however, these summary values are stored in {es}. For more information, see <>. Before you load the data set, you need to set up {ref}/mapping.html[_mappings_] for the fields. Mappings divide the documents in the index into logical groups and specify a field's characteristics, such as the field's searchability or whether or not it's _tokenized_, or broken up into separate words. The sample data includes an `upload_server-metrics.sh` script, which you can use to create the mappings and load the data set. You can download it by clicking here: https://download.elastic.co/demos/machine_learning/gettingstarted/upload_server-metrics.sh[upload_server-metrics.sh] Before you run it, however, you must edit the USERNAME and PASSWORD variables with your actual user ID and password. The script runs a command similar to the following example, which sets up a mapping for the data set: [source,sh] ---------------------------------- curl -u elastic:x-pack-test-password -X PUT -H 'Content-Type: application/json' http://localhost:9200/server-metrics -d '{ "settings":{ "number_of_shards":1, "number_of_replicas":0 }, "mappings":{ "metric":{ "properties":{ "@timestamp":{ "type":"date" }, "accept":{ "type":"long" }, "deny":{ "type":"long" }, "host":{ "type":"keyword" }, "response":{ "type":"float" }, "service":{ "type":"keyword" }, "total":{ "type":"long" } } } } }' ---------------------------------- // NOTCONSOLE NOTE: If you run this command, you must replace `x-pack-test-password` with your actual password. You can then use the {es} `bulk` API to load the data set. The `upload_server-metrics.sh` script runs commands similar to the following example, which loads the four JSON files: [source,sh] ---------------------------------- curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json" http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_1.json" curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json" http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_2.json" curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json" http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_3.json" curl -u elastic:x-pack-test-password -X POST -H "Content-Type: application/json" http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json" ---------------------------------- // NOTCONSOLE TIP: This will upload 200MB of data. This is split into 4 files as there is a maximum 100MB limit when using the `_bulk` API. These commands might take some time to run, depending on the computing resources available. You can verify that the data was loaded successfully with the following command: [source,sh] ---------------------------------- curl 'http://localhost:9200/_cat/indices?v' -u elastic:x-pack-test-password ---------------------------------- // NOTCONSOLE You should see output similar to the following: [source,txt] ---------------------------------- health status index ... pri rep docs.count ... green open server-metrics ... 1 0 905940 ... ---------------------------------- // NOTCONSOLE Next, you must define an index pattern for this data set: . Open {kib} in your web browser and log in. If you are running {kib} locally, go to `http://localhost:5601/`. . Click the **Management** tab, then **{kib}** > **Index Patterns**. . If you already have index patterns, click **Create Index** to define a new one. Otherwise, the **Create index pattern** wizard is already open. . For this tutorial, any pattern that matches the name of the index you've loaded will work. For example, enter `server-metrics*` as the index pattern. . In the **Configure settings** step, select the `@timestamp` field in the **Time Filter field name** list. . Click **Create index pattern**. This data set can now be analyzed in {ml} jobs in {kib}.