OpenSearch/docs/en/ml/ml-scenarios.asciidoc

101 lines
5.4 KiB
Plaintext
Raw Normal View History

[[ml-scenarios]]
== Use Cases
Enterprises, government organizations and cloud based service providers daily
process volumes of machine data so massive as to make real-time human
analysis impossible. Changing behaviors hidden in this data provide the
information needed to quickly resolve massive service outage, detect security
breaches before they result in the theft of millions of credit records or
identify the next big trend in consumer patterns. Current search and analysis,
performance management and cyber security tools are unable to find these
anomalies without significant human work in the form of thresholds, rules,
signatures and data models.
By using advanced anomaly detection techniques that learn normal behavior
patterns represented by the data and identify and cross-correlate anomalies,
performance, security and operational anomalies and their cause can be
identified as they develop, so they can be acted on before they impact business.
Whilst anomaly detection is applicable to any type of data, we focus on machine
data scenarios. Enterprise application developers, cloud service providers and
technology vendors need to harness the power of machine learning based anomaly
detection analytics to better manage complex on-line services, detect the
earliest signs of advanced security threats and gain insight to business
opportunities and risks represented by changing behaviors hidden in their
massive data sets. Here are some real-world examples.
=== Eliminating noise generated by threshold-based alerts
Modern IT systems are highly instrumented and can generate TBs of machine data
a day. Traditional methods for analyzing data involves alerting when metric
values exceed a known value (static thresholds), or looking for simple statistical deviations (dynamic thresholds).
Setting accurate thresholds for each metric at different times of day is
practically impossible. It results in static thresholds generating large volumes
of false positives (threshold set too low) and false negatives (threshold set too high).
The {ml} features in {xpack} automatically learn and calculate the probability
of a value being anomalous based on its historical behavior.
This enables accurate alerting and highlights only the subset of relevant metrics
that have changed. These alerts provide actionable insight into what is a growing
mountain of data.
=== Reducing troubleshooting times and subject matter expert (SME) involvement
It is said that 75 percent of troubleshooting time is spent mining data to try
and identify the root cause of an incident. The {ml} features in {xpack}
automatically analyze data and boil down the massive volume of information
to the few metrics or log messages that have changed behavior.
This allows the subject matter experts (SMEs) to focus on the subset of
information that is relevant to an issue, which greatly reduces triage time.
//In a major credit services provider, within a month of deployment, the company
//reported that its overall time to triage was reduced by 70 percent and the use of
//outside SMEs time to troubleshoot was decreased by 80 percent.
=== Finding and fixing issues before they impact the end user
Large-scale systems, such as online banking, typically require complex
infrastructures involving hundreds of different interdependent applications.
Just accessing an account summary page might involve dozens of different
databases, systems and applications.
Because of their importance to the business, these systems are typically highly
resilient and a critical problem will not be allowed to re-occur.
If a problem happens, it is likely to be complicated and be the result of a
causal sequence of events that span multiple interacting resources.
Troubleshooting would require the analysis of large volumes of data with a wide
range of characteristics and data types. A variety of experts from multiple
disciplines would need to participate in time consuming “war rooms” to mine
the data for answers.
By using {ml} in real-time, large volumes of data can be analyzed to provide
alerts to early indicators of problems and highlight the events that were likely
to have contributed to the problem.
=== Finding rare events that may be symptomatic of a security issue
With several hundred servers under management, the presence of new processes
running might indicate a security breach.
Using typical operational management techniques, each server would require a
period of baselining in order to identify which processes are considered standard.
Ideally a baseline would be created for each server (or server group)
and would be periodically updated, making this a large management overhead.
By using {ml} features in {xpack}, baselines are automatically built based
upon normal behavior patterns for each host and alerts are generated when rare
events occur.
=== Finding anomalies in periodic data
For data that has periodicity it is difficult for standard monitoring tools to
accurately tell whether a change in data is due to a service outage, or is a
result of usual time schedules. Daily and weekly trends in data along with
peak and off-peak hours, make it difficult to identify anomalies using standard
threshold-based methods. A min and max threshold for SMS text activity at 2am
would be very different than the thresholds that would be effective during the day.
By using {ml}, time-related trends are automatically identified and smoothed,
leaving the residual to be analyzed for anomalies.