[[ml-scenarios]] == Use Cases TBD //// Enterprises, government organizations and cloud based service providers daily process volumes of machine data so massive as to make real-time human analysis impossible. Changing behaviors hidden in this data provide the information needed to quickly resolve massive service outage, detect security breaches before they result in the theft of millions of credit records or identify the next big trend in consumer patterns. Current search and analysis, performance management and cyber security tools are unable to find these anomalies without significant human work in the form of thresholds, rules, signatures and data models. By using advanced anomaly detection techniques that learn normal behavior patterns represented by the data and identify and cross-correlate anomalies, performance, security and operational anomalies and their cause can be identified as they develop, so they can be acted on before they impact business. Whilst anomaly detection is applicable to any type of data, we focus on machine data scenarios. Enterprise application developers, cloud service providers and technology vendors need to harness the power of machine learning based anomaly detection analytics to better manage complex on-line services, detect the earliest signs of advanced security threats and gain insight to business opportunities and risks represented by changing behaviors hidden in their massive data sets. Here are some real-world examples. === Eliminating noise generated by threshold-based alerts Modern IT systems are highly instrumented and can generate TBs of machine data a day. Traditional methods for analyzing data involves alerting when metric values exceed a known value (static thresholds), or looking for simple statistical deviations (dynamic thresholds). Setting accurate thresholds for each metric at different times of day is practically impossible. It results in static thresholds generating large volumes of false positives (threshold set too low) and false negatives (threshold set too high). The {ml} features in {xpack} automatically learn and calculate the probability of a value being anomalous based on its historical behavior. This enables accurate alerting and highlights only the subset of relevant metrics that have changed. These alerts provide actionable insight into what is a growing mountain of data. === Reducing troubleshooting times and subject matter expert (SME) involvement It is said that 75 percent of troubleshooting time is spent mining data to try and identify the root cause of an incident. The {ml} features in {xpack} automatically analyze data and boil down the massive volume of information to the few metrics or log messages that have changed behavior. This allows the subject matter experts (SMEs) to focus on the subset of information that is relevant to an issue, which greatly reduces triage time. //In a major credit services provider, within a month of deployment, the company //reported that its overall time to triage was reduced by 70 percent and the use of //outside SMEs’ time to troubleshoot was decreased by 80 percent. === Finding and fixing issues before they impact the end user Large-scale systems, such as online banking, typically require complex infrastructures involving hundreds of different interdependent applications. Just accessing an account summary page might involve dozens of different databases, systems and applications. Because of their importance to the business, these systems are typically highly resilient and a critical problem will not be allowed to re-occur. If a problem happens, it is likely to be complicated and be the result of a causal sequence of events that span multiple interacting resources. Troubleshooting would require the analysis of large volumes of data with a wide range of characteristics and data types. A variety of experts from multiple disciplines would need to participate in time consuming “war rooms” to mine the data for answers. By using {ml} in real-time, large volumes of data can be analyzed to provide alerts to early indicators of problems and highlight the events that were likely to have contributed to the problem. === Finding rare events that may be symptomatic of a security issue With several hundred servers under management, the presence of new processes running might indicate a security breach. Using typical operational management techniques, each server would require a period of baselining in order to identify which processes are considered standard. Ideally a baseline would be created for each server (or server group) and would be periodically updated, making this a large management overhead. By using {ml} features in {xpack}, baselines are automatically built based upon normal behavior patterns for each host and alerts are generated when rare events occur. === Finding anomalies in periodic data For data that has periodicity it is difficult for standard monitoring tools to accurately tell whether a change in data is due to a service outage, or is a result of usual time schedules. Daily and weekly trends in data along with peak and off-peak hours, make it difficult to identify anomalies using standard threshold-based methods. A min and max threshold for SMS text activity at 2am would be very different than the thresholds that would be effective during the day. By using {ml}, time-related trends are automatically identified and smoothed, leaving the residual to be analyzed for anomalies. ////