101 lines
5.4 KiB
Plaintext
101 lines
5.4 KiB
Plaintext
[[ml-scenarios]]
|
||
== Use Cases
|
||
|
||
Enterprises, government organizations and cloud based service providers daily
|
||
process volumes of machine data so massive as to make real-time human
|
||
analysis impossible. Changing behaviors hidden in this data provide the
|
||
information needed to quickly resolve massive service outage, detect security
|
||
breaches before they result in the theft of millions of credit records or
|
||
identify the next big trend in consumer patterns. Current search and analysis,
|
||
performance management and cyber security tools are unable to find these
|
||
anomalies without significant human work in the form of thresholds, rules,
|
||
signatures and data models.
|
||
|
||
By using advanced anomaly detection techniques that learn normal behavior
|
||
patterns represented by the data and identify and cross-correlate anomalies,
|
||
performance, security and operational anomalies and their cause can be
|
||
identified as they develop, so they can be acted on before they impact business.
|
||
|
||
Whilst anomaly detection is applicable to any type of data, we focus on machine
|
||
data scenarios. Enterprise application developers, cloud service providers and
|
||
technology vendors need to harness the power of machine learning based anomaly
|
||
detection analytics to better manage complex on-line services, detect the
|
||
earliest signs of advanced security threats and gain insight to business
|
||
opportunities and risks represented by changing behaviors hidden in their
|
||
massive data sets. Here are some real-world examples.
|
||
|
||
=== Eliminating noise generated by threshold-based alerts
|
||
|
||
Modern IT systems are highly instrumented and can generate TBs of machine data
|
||
a day. Traditional methods for analyzing data involves alerting when metric
|
||
values exceed a known value (static thresholds), or looking for simple statistical deviations (dynamic thresholds).
|
||
|
||
Setting accurate thresholds for each metric at different times of day is
|
||
practically impossible. It results in static thresholds generating large volumes
|
||
of false positives (threshold set too low) and false negatives (threshold set too high).
|
||
|
||
The {ml} features in {xpack} automatically learn and calculate the probability
|
||
of a value being anomalous based on its historical behavior.
|
||
This enables accurate alerting and highlights only the subset of relevant metrics
|
||
that have changed. These alerts provide actionable insight into what is a growing
|
||
mountain of data.
|
||
|
||
=== Reducing troubleshooting times and subject matter expert (SME) involvement
|
||
|
||
It is said that 75 percent of troubleshooting time is spent mining data to try
|
||
and identify the root cause of an incident. The {ml} features in {xpack}
|
||
automatically analyze data and boil down the massive volume of information
|
||
to the few metrics or log messages that have changed behavior.
|
||
This allows the subject matter experts (SMEs) to focus on the subset of
|
||
information that is relevant to an issue, which greatly reduces triage time.
|
||
|
||
//In a major credit services provider, within a month of deployment, the company
|
||
//reported that its overall time to triage was reduced by 70 percent and the use of
|
||
//outside SMEs’ time to troubleshoot was decreased by 80 percent.
|
||
|
||
=== Finding and fixing issues before they impact the end user
|
||
|
||
Large-scale systems, such as online banking, typically require complex
|
||
infrastructures involving hundreds of different interdependent applications.
|
||
Just accessing an account summary page might involve dozens of different
|
||
databases, systems and applications.
|
||
|
||
Because of their importance to the business, these systems are typically highly
|
||
resilient and a critical problem will not be allowed to re-occur.
|
||
If a problem happens, it is likely to be complicated and be the result of a
|
||
causal sequence of events that span multiple interacting resources.
|
||
Troubleshooting would require the analysis of large volumes of data with a wide
|
||
range of characteristics and data types. A variety of experts from multiple
|
||
disciplines would need to participate in time consuming “war rooms” to mine
|
||
the data for answers.
|
||
|
||
By using {ml} in real-time, large volumes of data can be analyzed to provide
|
||
alerts to early indicators of problems and highlight the events that were likely
|
||
to have contributed to the problem.
|
||
|
||
=== Finding rare events that may be symptomatic of a security issue
|
||
|
||
With several hundred servers under management, the presence of new processes
|
||
running might indicate a security breach.
|
||
|
||
Using typical operational management techniques, each server would require a
|
||
period of baselining in order to identify which processes are considered standard.
|
||
Ideally a baseline would be created for each server (or server group)
|
||
and would be periodically updated, making this a large management overhead.
|
||
|
||
By using {ml} features in {xpack}, baselines are automatically built based
|
||
upon normal behavior patterns for each host and alerts are generated when rare
|
||
events occur.
|
||
|
||
=== Finding anomalies in periodic data
|
||
|
||
For data that has periodicity it is difficult for standard monitoring tools to
|
||
accurately tell whether a change in data is due to a service outage, or is a
|
||
result of usual time schedules. Daily and weekly trends in data along with
|
||
peak and off-peak hours, make it difficult to identify anomalies using standard
|
||
threshold-based methods. A min and max threshold for SMS text activity at 2am
|
||
would be very different than the thresholds that would be effective during the day.
|
||
|
||
By using {ml}, time-related trends are automatically identified and smoothed,
|
||
leaving the residual to be analyzed for anomalies.
|