101 lines
5.4 KiB
Plaintext
101 lines
5.4 KiB
Plaintext
|
[[ml-scenarios]]
|
|||
|
== Use Cases
|
|||
|
|
|||
|
Enterprises, government organizations and cloud based service providers daily
|
|||
|
process volumes of machine data so massive as to make real-time human
|
|||
|
analysis impossible. Changing behaviors hidden in this data provide the
|
|||
|
information needed to quickly resolve massive service outage, detect security
|
|||
|
breaches before they result in the theft of millions of credit records or
|
|||
|
identify the next big trend in consumer patterns. Current search and analysis,
|
|||
|
performance management and cyber security tools are unable to find these
|
|||
|
anomalies without significant human work in the form of thresholds, rules,
|
|||
|
signatures and data models.
|
|||
|
|
|||
|
By using advanced anomaly detection techniques that learn normal behavior
|
|||
|
patterns represented by the data and identify and cross-correlate anomalies,
|
|||
|
performance, security and operational anomalies and their cause can be
|
|||
|
identified as they develop, so they can be acted on before they impact business.
|
|||
|
|
|||
|
Whilst anomaly detection is applicable to any type of data, we focus on machine
|
|||
|
data scenarios. Enterprise application developers, cloud service providers and
|
|||
|
technology vendors need to harness the power of machine learning based anomaly
|
|||
|
detection analytics to better manage complex on-line services, detect the
|
|||
|
earliest signs of advanced security threats and gain insight to business
|
|||
|
opportunities and risks represented by changing behaviors hidden in their
|
|||
|
massive data sets. Here are some real-world examples.
|
|||
|
|
|||
|
=== Eliminating noise generated by threshold-based alerts
|
|||
|
|
|||
|
Modern IT systems are highly instrumented and can generate TBs of machine data
|
|||
|
a day. Traditional methods for analyzing data involves alerting when metric
|
|||
|
values exceed a known value (static thresholds), or looking for simple statistical deviations (dynamic thresholds).
|
|||
|
|
|||
|
Setting accurate thresholds for each metric at different times of day is
|
|||
|
practically impossible. It results in static thresholds generating large volumes
|
|||
|
of false positives (threshold set too low) and false negatives (threshold set too high).
|
|||
|
|
|||
|
The {ml} features in {xpack} automatically learn and calculate the probability
|
|||
|
of a value being anomalous based on its historical behavior.
|
|||
|
This enables accurate alerting and highlights only the subset of relevant metrics
|
|||
|
that have changed. These alerts provide actionable insight into what is a growing
|
|||
|
mountain of data.
|
|||
|
|
|||
|
=== Reducing troubleshooting times and subject matter expert (SME) involvement
|
|||
|
|
|||
|
It is said that 75 percent of troubleshooting time is spent mining data to try
|
|||
|
and identify the root cause of an incident. The {ml} features in {xpack}
|
|||
|
automatically analyze data and boil down the massive volume of information
|
|||
|
to the few metrics or log messages that have changed behavior.
|
|||
|
This allows the subject matter experts (SMEs) to focus on the subset of
|
|||
|
information that is relevant to an issue, which greatly reduces triage time.
|
|||
|
|
|||
|
//In a major credit services provider, within a month of deployment, the company
|
|||
|
//reported that its overall time to triage was reduced by 70 percent and the use of
|
|||
|
//outside SMEs’ time to troubleshoot was decreased by 80 percent.
|
|||
|
|
|||
|
=== Finding and fixing issues before they impact the end user
|
|||
|
|
|||
|
Large-scale systems, such as online banking, typically require complex
|
|||
|
infrastructures involving hundreds of different interdependent applications.
|
|||
|
Just accessing an account summary page might involve dozens of different
|
|||
|
databases, systems and applications.
|
|||
|
|
|||
|
Because of their importance to the business, these systems are typically highly
|
|||
|
resilient and a critical problem will not be allowed to re-occur.
|
|||
|
If a problem happens, it is likely to be complicated and be the result of a
|
|||
|
causal sequence of events that span multiple interacting resources.
|
|||
|
Troubleshooting would require the analysis of large volumes of data with a wide
|
|||
|
range of characteristics and data types. A variety of experts from multiple
|
|||
|
disciplines would need to participate in time consuming “war rooms” to mine
|
|||
|
the data for answers.
|
|||
|
|
|||
|
By using {ml} in real-time, large volumes of data can be analyzed to provide
|
|||
|
alerts to early indicators of problems and highlight the events that were likely
|
|||
|
to have contributed to the problem.
|
|||
|
|
|||
|
=== Finding rare events that may be symptomatic of a security issue
|
|||
|
|
|||
|
With several hundred servers under management, the presence of new processes
|
|||
|
running might indicate a security breach.
|
|||
|
|
|||
|
Using typical operational management techniques, each server would require a
|
|||
|
period of baselining in order to identify which processes are considered standard.
|
|||
|
Ideally a baseline would be created for each server (or server group)
|
|||
|
and would be periodically updated, making this a large management overhead.
|
|||
|
|
|||
|
By using {ml} features in {xpack}, baselines are automatically built based
|
|||
|
upon normal behavior patterns for each host and alerts are generated when rare
|
|||
|
events occur.
|
|||
|
|
|||
|
=== Finding anomalies in periodic data
|
|||
|
|
|||
|
For data that has periodicity it is difficult for standard monitoring tools to
|
|||
|
accurately tell whether a change in data is due to a service outage, or is a
|
|||
|
result of usual time schedules. Daily and weekly trends in data along with
|
|||
|
peak and off-peak hours, make it difficult to identify anomalies using standard
|
|||
|
threshold-based methods. A min and max threshold for SMS text activity at 2am
|
|||
|
would be very different than the thresholds that would be effective during the day.
|
|||
|
|
|||
|
By using {ml}, time-related trends are automatically identified and smoothed,
|
|||
|
leaving the residual to be analyzed for anomalies.
|