[[ml-getting-started]] == Getting Started //// {xpackml} features automatically detect: * Anomalies in single or multiple time series * Outliers in a population (also known as _entity profiling_) * Rare events (also known as _log categorization_) This tutorial is focuses on an anomaly detection scenario in single time series. //// Ready to get some hands-on experience with the {xpackml} features? This tutorial shows you how to: * Load a sample data set into {es} * Create a {ml} job * Use the results to identify possible anomalies in the data At the end of this tutorial, you should have a good idea of what {ml} is and will hopefully be inspired to use it to detect anomalies in your own data. You might also be interested in these video tutorials: * https://www.elastic.co/videos/machine-learning-tutorial-creating-a-single-metric-job[Machine Learning for the Elastic Stack: Creating a single metric job] * https://www.elastic.co/videos/machine-learning-tutorial-creating-a-multi-metric-job[Machine Learning for the Elastic Stack: Creating a multi-metric job] [float] [[ml-gs-sysoverview]] === System Overview To follow the steps in this tutorial, you will need the following components of the Elastic Stack: * {es} {version}, which stores the data and the analysis results * {xpack} {version}, which includes the beta {ml} features for both {es} and {kib} * {kib} {version}, which provides a helpful user interface for creating and viewing jobs + //ll {ml} features are available to use as an API, however this tutorial //will focus on using the {ml} tab in the {kib} UI. WARNING: The {xpackml} features are in beta and subject to change. Beta features are not subject to the same support SLA as GA features, and deployment in production is at your own risk. See the https://www.elastic.co/support/matrix[Elastic Support Matrix] for information about supported operating systems. See {stack-ref}/installing-elastic-stack.html[Installing the Elastic Stack] for information about installing each of the components. NOTE: To get started, you can install {es} and {kib} on a single VM or even on your laptop (requires 64-bit OS). As you add more data and your traffic grows, you'll want to replace the single {es} instance with a cluster. When you install {xpack} into {es} and {kib}, the {ml} features are enabled by default. If you have multiple nodes in your cluster, you can optionally dedicate nodes to specific purposes. If you want to control which nodes are _machine learning nodes_ or limit which nodes run resource-intensive activity related to jobs, see <>. [float] [[ml-gs-users]] ==== Users, Roles, and Privileges The {xpackml} features implement cluster privileges and built-in roles to make it easier to control which users have authority to view and manage the jobs, {dfeeds}, and results. By default, you can perform all of the steps in this tutorial by using the built-in `elastic` super user. The default password for the `elastic` user is `changeme`. For information about how to change that password, see <>. If you are performing these steps in a production environment, take extra care because `elastic` has the `superuser` role and you could inadvertently make significant changes to the system. You can alternatively assign the `machine_learning_admin` and `kibana_user` roles to a user ID of your choice. For more information, see <> and <>. [[ml-gs-data]] === Identifying Data for Analysis For the purposes of this tutorial, we provide sample data that you can play with and search in {es}. When you consider your own data, however, it's important to take a moment and think about where the {xpackml} features will be most impactful. The first consideration is that it must be time series data. The {ml} features are designed to model and detect anomalies in time series data. The second consideration, especially when you are first learning to use {ml}, is the importance of the data and how familiar you are with it. Ideally, it is information that contains key performance indicators (KPIs) for the health, security, or success of your business or system. It is information that you need to monitor and act on when anomalous behavior occurs. You might even have {kib} dashboards that you're already using to watch this data. The better you know the data, the quicker you will be able to create {ml} jobs that generate useful insights. The final consideration is where the data is located. This tutorial assumes that your data is stored in {es}. It guides you through the steps required to create a _{dfeed}_ that passes data to a job. If your own data is outside of {es}, analysis is still possible by using a post data API. IMPORTANT: If you want to create {ml} jobs in {kib}, you must use {dfeeds}. That is to say, you must store your input data in {es}. When you create a job, you select an existing index pattern and {kib} configures the {dfeed} for you under the covers. [float] [[ml-gs-sampledata]] ==== Obtaining a Sample Data Set In this step we will upload some sample data to {es}. This is standard {es} functionality, and is needed to set the stage for using {ml}. The sample data for this tutorial contains information about the requests that are received by various applications and services in a system. A system administrator might use this type of information to track the total number of requests across all of the infrastructure. If the number of requests increases or decreases unexpectedly, for example, this might be an indication that there is a problem or that resources need to be redistributed. By using the {xpack} {ml} features to model the behavior of this data, it is easier to identify anomalies and take appropriate action. Download this sample data by clicking here: https://download.elastic.co/demos/machine_learning/gettingstarted/server_metrics.tar.gz[server_metrics.tar.gz] Use the following commands to extract the files: [source,shell] ---------------------------------- tar -zxvf server_metrics.tar.gz ---------------------------------- Each document in the server-metrics data set has the following schema: [source,js] ---------------------------------- { "index": { "_index":"server-metrics", "_type":"metric", "_id":"1177" } } { "@timestamp":"2017-03-23T13:00:00", "accept":36320, "deny":4156, "host":"server_2", "response":2.4558210155, "service":"app_3", "total":40476 } ---------------------------------- TIP: The sample data sets include summarized data. For example, the `total` value is a sum of the requests that were received by a specific service at a particular time. If your data is stored in {es}, you can generate this type of sum or average by using aggregations. One of the benefits of summarizing data this way is that {es} automatically distributes these calculations across your cluster. You can then feed this summarized data into {xpackml} instead of raw results, which reduces the volume of data that must be considered while detecting anomalies. For the purposes of this tutorial, however, these summary values are stored in {es}, rather than created using the {ref}/search-aggregations.html[_aggregations framework_]. //TBD link to working with aggregations page Before you load the data set, you need to set up {ref}/mapping.html[_mappings_] for the fields. Mappings divide the documents in the index into logical groups and specify a field's characteristics, such as the field's searchability or whether or not it's _tokenized_, or broken up into separate words. The sample data includes an `upload_server-metrics.sh` script, which you can use to create the mappings and load the data set. You can download it by clicking here: https://download.elastic.co/demos/machine_learning/gettingstarted/upload_server-metrics.sh[upload_server-metrics.sh] Before you run it, however, you must edit the USERNAME and PASSWORD variables with your actual user ID and password. The script runs a command similar to the following example, which sets up a mapping for the data set: [source,shell] ---------------------------------- curl -u elastic:changeme -X PUT -H 'Content-Type: application/json' http://localhost:9200/server-metrics -d '{ "settings":{ "number_of_shards":1, "number_of_replicas":0 }, "mappings":{ "metric":{ "properties":{ "@timestamp":{ "type":"date" }, "accept":{ "type":"long" }, "deny":{ "type":"long" }, "host":{ "type":"keyword" }, "response":{ "type":"float" }, "service":{ "type":"keyword" }, "total":{ "type":"long" } } } } }' ---------------------------------- NOTE: If you run this command, you must replace `changeme` with your actual password. //// This mapping specifies the following qualities for the data set: * The _@timestamp_ field is a date. //that uses the ISO format `epoch_second`, //which is the number of seconds since the epoch. * The _accept_, _deny_, and _total_ fields are long numbers. * The _host //// You can then use the {es} `bulk` API to load the data set. The `upload_server-metrics.sh` script runs commands similar to the following example, which loads the four JSON files: [source,shell] ---------------------------------- curl -u elastic:changeme -X POST -H "Content-Type: application/json" http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_1.json" curl -u elastic:changeme -X POST -H "Content-Type: application/json" http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_2.json" curl -u elastic:changeme -X POST -H "Content-Type: application/json" http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_3.json" curl -u elastic:changeme -X POST -H "Content-Type: application/json" http://localhost:9200/server-metrics/_bulk --data-binary "@server-metrics_4.json" ---------------------------------- TIP: This will upload 200MB of data. This is split into 4 files as there is a maximum 100MB limit when using the `_bulk` API. These commands might take some time to run, depending on the computing resources available. You can verify that the data was loaded successfully with the following command: [source,shell] ---------------------------------- curl 'http://localhost:9200/_cat/indices?v' -u elastic:changeme ---------------------------------- You should see output similar to the following: [source,shell] ---------------------------------- health status index ... pri rep docs.count docs.deleted store.size ... green open server-metrics ... 1 0 905940 0 120.5mb ... ---------------------------------- Next, you must define an index pattern for this data set: . Open {kib} in your web browser and log in. If you are running {kib} locally, go to `http://localhost:5601/`. . Click the **Management** tab, then **Index Patterns**. . If you already have index patterns, click the plus sign (+) to define a new one. Otherwise, the **Configure an index pattern** wizard is already open. . For this tutorial, any pattern that matches the name of the index you've loaded will work. For example, enter `server-metrics*` as the index pattern. . Verify that the **Index contains time-based events** is checked. . Select the `@timestamp` field from the **Time-field name** list. . Click **Create**. This data set can now be analyzed in {ml} jobs in {kib}. [[ml-gs-jobs]] === Creating Jobs Machine learning jobs contain the configuration information and metadata necessary to perform an analytical task. They also contain the results of the analytical task. [NOTE] -- This tutorial uses {kib} to create jobs and view results, but you can alternatively use APIs to accomplish most tasks. For API reference information, see <>. The {xpackml} features in {kib} use pop-ups. You must configure your web browser so that it does not block pop-up windows or create an exception for your Kibana URL. -- To work with jobs in {kib}: . Open {kib} in your web browser and log in. If you are running {kib} locally, go to `http://localhost:5601/`. . Click **Machine Learning** in the side navigation: + + -- [role="screenshot"] image::images/ml-kibana.jpg[Job Management] -- You can choose to create single metric, multi-metric, or advanced jobs in {kib}. In this tutorial, the goal is to detect anomalies in the total requests received by your applications and services. The sample data contains a single key performance indicator to track this, which is the total requests over time. It is therefore logical to start by creating a single metric job for this KPI. TIP: If you are using aggregated data, you can create an advanced job and configure it to use a `summary_count_field`. The {ml} algorithms will make the best possible use of summarized data in this case. For simplicity in this tutorial we will not make use of that advanced functionality. [float] [[ml-gs-job1-create]] ==== Creating a Single Metric Job A single metric job contains a single _detector_. A detector defines the type of analysis that will occur (for example, `max`, `average`, or `rare` analytical functions) and the fields that will be analyzed. To create a single metric job in {kib}: . Click **Machine Learning** in the side navigation, then click **Create new job**. . Click **Create single metric job**. + + -- [role="screenshot"] image::images/ml-create-jobs.jpg["Create a new job"] -- . Click the `server-metrics` index. + + -- [role="screenshot"] image::images/ml-gs-index.jpg["Select an index"] -- . Configure the job by providing the following information: + + -- [role="screenshot"] image::images/ml-gs-single-job.jpg["Create a new job from the server-metrics index"] -- .. For the **Aggregation**, select `Sum`. This value specifies the analysis function that is used. + -- Some of the analytical functions look for single anomalous data points. For example, `max` identifies the maximum value that is seen within a bucket. Others perform some aggregation over the length of the bucket. For example, `mean` calculates the mean of all the data points seen within the bucket. Similarly, `count` calculates the total number of data points within the bucket. In this tutorial, you are using the `sum` function, which calculates the sum of the specified field's values within the bucket. -- .. For the **Field**, select `total`. This value specifies the field that the detector uses in the function. + -- NOTE: Some functions such as `count` and `rare` do not require fields. -- .. For the **Bucket span**, enter `10m`. This value specifies the size of the interval that the analysis is aggregated into. + -- The {xpackml} features use the concept of a bucket to divide up the time series into batches for processing. For example, if you are monitoring the total number of requests in the system, //and receive a data point every 10 minutes using a bucket span of 1 hour would mean that at the end of each hour, it calculates the sum of the requests for the last hour and computes the anomalousness of that value compared to previous hours. The bucket span has two purposes: it dictates over what time span to look for anomalous features in data, and also determines how quickly anomalies can be detected. Choosing a shorter bucket span enables anomalies to be detected more quickly. However, there is a risk of being too sensitive to natural variations or noise in the input data. Choosing too long a bucket span can mean that interesting anomalies are averaged away. There is also the possibility that the aggregation might smooth out some anomalies based on when the bucket starts in time. The bucket span has a significant impact on the analysis. When you're trying to determine what value to use, take into account the granularity at which you want to perform the analysis, the frequency of the input data, the duration of typical anomalies and the frequency at which alerting is required. -- . Determine whether you want to process all of the data or only part of it. If you want to analyze all of the existing data, click **Use full server-metrics* data**. If you want to see what happens when you stop and start {dfeeds} and process additional data over time, click the time picker in the {kib} toolbar. Since the sample data spans a period of time between March 23, 2017 and April 22, 2017, click **Absolute**. Set the start time to March 23, 2017 and the end time to April 1, 2017, for example. Once you've got the time range set up, click the **Go** button. + + -- [role="screenshot"] image::images/ml-gs-job1-time.jpg["Setting the time range for the {dfeed}"] -- + -- A graph is generated, which represents the total number of requests over time. -- . Provide a name for the job, for example `total-requests`. The job name must be unique in your cluster. You can also optionally provide a description of the job. . Click **Create Job**. + + -- [role="screenshot"] image::images/ml-gs-job1.jpg["A graph of the total number of requests over time"] -- As the job is created, the graph is updated to give a visual representation of the progress of {ml} as the data is processed. This view is only available whilst the job is running. TIP: The `create_single_metic.sh` script creates a similar job and {dfeed} by using the {ml} APIs. You can download that script by clicking here: https://download.elastic.co/demos/machine_learning/gettingstarted/create_single_metric.sh[create_single_metric.sh] For API reference information, see <>. [[ml-gs-job1-manage]] === Managing Jobs After you create a job, you can see its status in the **Job Management** tab: + [role="screenshot"] image::images/ml-gs-job1-manage1.jpg["Status information for the total-requests job"] The following information is provided for each job: Job ID:: The unique identifier for the job. Description:: The optional description of the job. Processed records:: The number of records that have been processed by the job. Memory status:: The status of the mathematical models. When you create jobs by using the APIs or by using the advanced options in {kib}, you can specify a `model_memory_limit`. That value is the maximum amount of memory resources, in MiB, that the mathematical models can use. Once that limit is approached, data pruning becomes more aggressive. Upon exceeding that limit, new entities are not modeled. The default value is `4096`. The memory status field reflects whether you have reached or exceeded the model memory limit. It can have one of the following values: + `ok`::: The models stayed below the configured value. `soft_limit`::: The models used more than 60% of the configured memory limit and older unused models will be pruned to free up space. `hard_limit`::: The models used more space than the configured memory limit. As a result, not all incoming data was processed. Job state:: The status of the job, which can be one of the following values: + `open`::: The job is available to receive and process data. `closed`::: The job finished successfully with its model state persisted. The job must be opened before it can accept further data. `closing`::: The job close action is in progress and has not yet completed. A closing job cannot accept further data. `failed`::: The job did not finish successfully due to an error. This situation can occur due to invalid input data. If the job had irrevocably failed, it must be force closed and then deleted. If the {dfeed} can be corrected, the job can be closed and then re-opened. {dfeed-cap} state:: The status of the {dfeed}, which can be one of the following values: + started::: The {dfeed} is actively receiving data. stopped::: The {dfeed} is stopped and will not receive data until it is re-started. Latest timestamp:: The timestamp of the last processed record. If you click the arrow beside the name of job, you can show or hide additional information, such as the settings, configuration information, or messages for the job. You can also click one of the **Actions** buttons to start the {dfeed}, edit the job or {dfeed}, and clone or delete the job, for example. [float] [[ml-gs-job1-datafeed]] ==== Managing {dfeeds-cap} A {dfeed} can be started and stopped multiple times throughout its lifecycle. If you want to retrieve more data from {es} and the {dfeed} is stopped, you must restart it. For example, if you did not use the full data when you created the job, you can now process the remaining data by restarting the {dfeed}: . In the **Machine Learning** / **Job Management** tab, click the following button to start the {dfeed}: image:images/ml-start-feed.jpg["Start {dfeed}"] . Choose a start time and end time. For example, click **Continue from 2017-04-01 23:59:00** and select **2017-04-30** as the search end time. Then click **Start**. The date picker defaults to the latest timestamp of processed data. Be careful not to leave any gaps in the analysis, otherwise you might miss anomalies. + + -- [role="screenshot"] image::images/ml-gs-job1-datafeed.jpg["Restarting a {dfeed}"] -- The {dfeed} state changes to `started`, the job state changes to `opened`, and the number of processed records increases as the new data is analyzed. The latest timestamp information also increases. For example: [role="screenshot"] image::images/ml-gs-job1-manage2.jpg["Job opened and {dfeed} started"] TIP: If your data is being loaded continuously, you can continue running the job in real time. For this, start your {dfeed} and select **No end time**. If you want to stop the {dfeed} at this point, you can click the following button: image:images/ml-stop-feed.jpg["Stop {dfeed}"] Now that you have processed all the data, let's start exploring the job results. [[ml-gs-jobresults]] === Exploring Job Results The {xpackml} features analyze the input stream of data, model its behavior, and perform analysis based on the detectors you defined in your job. When an event occurs outside of the model, that event is identified as an anomaly. Result records for each anomaly are stored in `.ml-anomalies-*` indices in {es}. By default, the name of the index where {ml} results are stored is labelled `shared`, which corresponds to the `.ml-anomalies-shared` index. You can use the **Anomaly Explorer** or the **Single Metric Viewer** in {kib} to view the analysis results. Anomaly Explorer:: This view contains swim lanes showing the maximum anomaly score over time. There is an overall swim lane that shows the overall score for the job, and also swim lanes for each influencer. By selecting a block in a swim lane, the anomaly details are displayed alongside the original source data (where applicable). //TBD: Are they swimlane blocks, tiles, segments or cards? hmmm //TBD: Do the time periods in the heat map correspond to buckets? hmmm is it a heat map? //As time is the x-axis, and the block sizes stay the same, it feels more intuitive call it a swimlane. //The swimlane bucket intervals depends on the time range selected. Their smallest possible //granularity is a bucket, but if you have a big time range selected, then they will span many buckets Single Metric Viewer:: This view contains a chart that represents the actual and expected values over time. This is only available for jobs that analyze a single time series and where `model_plot_config` is enabled. As in the **Anomaly Explorer**, anomalous data points are shown in different colors depending on their score. [float] [[ml-gs-job1-analyze]] ==== Exploring Single Metric Job Results By default when you view the results for a single metric job, the **Single Metric Viewer** opens: [role="screenshot"] image::images/ml-gs-job1-analysis.jpg["Single Metric Viewer for total-requests job"] The blue line in the chart represents the actual data values. The shaded blue area represents the bounds for the expected values. The area between the upper and lower bounds are the most likely values for the model. If a value is outside of this area then it can be said to be anomalous. If you slide the time selector from the beginning of the data to the end of the data, you can see how the model improves as it processes more data. At the beginning, the expected range of values is pretty broad and the model is not capturing the periodicity in the data. But it quickly learns and begins to reflect the daily variation. Any data points outside the range that was predicted by the model are marked as anomalies. When you have high volumes of real-life data, many anomalies might be found. These vary in probability from very likely to highly unlikely, that is to say, from not particularly anomalous to highly anomalous. There can be none, one or two or tens, sometimes hundreds of anomalies found within each bucket. There can be many thousands found per job. In order to provide a sensible view of the results, an _anomaly score_ is calculated for each bucket time interval. The anomaly score is a value from 0 to 100, which indicates the significance of the observed anomaly compared to previously seen anomalies. The highly anomalous values are shown in red and the low scored values are indicated in blue. An interval with a high anomaly score is significant and requires investigation. Slide the time selector to a section of the time series that contains a red anomaly data point. If you hover over the point, you can see more information about that data point. You can also see details in the **Anomalies** section of the viewer. For example: [role="screenshot"] image::images/ml-gs-job1-anomalies.jpg["Single Metric Viewer Anomalies for total-requests job"] For each anomaly you can see key details such as the time, the actual and expected ("typical") values, and their probability. You can see the same information in a different format by using the **Anomaly Explorer**: [role="screenshot"] image::images/ml-gs-job1-explorer.jpg["Anomaly Explorer for total-requests job"] Click one of the red blocks in the swim lane to see details about the anomalies that occurred in that time interval. For example: [role="screenshot"] image::images/ml-gs-job1-explorer-anomaly.jpg["Anomaly Explorer details for total-requests job"] After you have identified anomalies, often the next step is to try to determine the context of those situations. For example, are there other factors that are contributing to the problem? Are the anomalies confined to particular applications or servers? You can begin to troubleshoot these situations by layering additional jobs or creating multi-metric jobs. //// The troubleshooting job would not create alarms of its own, but rather would help explain the overall situation. It's usually a different job because it's operating on different indices. Layering jobs is an important concept. //// //// [float] [[ml-gs-job2-create]] ==== Creating a Multi-Metric Job TBD. * Walk through creation of a simple multi-metric job. * Provide overview of: ** partition fields, ** influencers *** An influencer is someone or something that has influenced or contributed to the anomaly. Results are aggregated for each influencer, for each bucket, across all detectors. In this way, a combined anomaly score is calculated for each influencer, which determines its relative anomalousness. You can specify one or many influencers. Picking an influencer is strongly recommended for the following reasons: **** It allow you to blame someone/something for the anomaly **** It simplifies and aggregates results *** The best influencer is the person or thing that you want to blame for the anomaly. In many cases, users or client IP make excellent influencers. *** By/over/partition fields are usually good candidates for influencers. *** Influencers can be any field in the source data; they do not need to be fields specified in detectors, although they often are. ** by/over fields, *** detectors **** You can have more than one detector in a job which is more efficient than running multiple jobs against the same data stream. //http://www.prelert.com/docs/behavioral_analytics/latest/concepts/multivariate.html [float] [[ml-gs-job2-analyze]] ===== Viewing Multi-Metric Job Results TBD. * Walk through exploration of job results. * Describe how influencer detection accelerates root cause identification. //// //// * Provide brief overview of statistical models and/or link to more info. * Possibly discuss effect of altering bucket span. The anomaly score is a sophisticated aggregation of the anomaly records in the bucket. The calculation is optimized for high throughput, gracefully ages historical data, and reduces the signal to noise levels. It adjusts for variations in event rate, takes into account the frequency and the level of anomalous activity and is adjusted relative to past anomalous behavior. In addition, [the anomaly score] is boosted if anomalous activity occurs for related entities, for example if disk IO and CPU are both behaving unusually for a given host. ** Once an anomalous time interval has been identified, it can be expanded to view the detailed anomaly records which are the significant causal factors. //// //// [[ml-gs-alerts]] === Creating Alerts for Job Results TBD. * Walk through creation of simple alert for anomalous data? ////