88 lines
3.3 KiB
Plaintext
88 lines
3.3 KiB
Plaintext
[[ml-configuring-pop]]
|
|
=== Performing Population Analysis
|
|
|
|
Entities or events in your data can be considered anomalous when:
|
|
|
|
* Their behavior changes over time, relative to their own previous behavior, or
|
|
* Their behavior is different than other entities in a specified population.
|
|
|
|
The latter method of detecting outliers is known as _population analysis_. The
|
|
{ml} analytics build a profile of what a "typical" user, machine, or other entity
|
|
does over a specified time period and then identify when one is behaving
|
|
abnormally compared to the population.
|
|
|
|
This type of analysis is most useful when the behavior of the population as a
|
|
whole is mostly homogeneous and you want to identify outliers. In general,
|
|
population analysis is not useful when members of the population inherently
|
|
have vastly different behavior. You can, however, segment your data into groups
|
|
that behave similarly and run these as separate jobs. For example, you can use a
|
|
query filter in the {dfeed} to segment your data or you can use the
|
|
`partition_field_name` to split the analysis for the different groups.
|
|
|
|
Population analysis scales well and has a lower resource footprint than
|
|
individual analysis of each series. For example, you can analyze populations
|
|
of hundreds of thousands or millions of entities.
|
|
|
|
To specify the population, use the `over_field_name` property. For example:
|
|
|
|
[source,js]
|
|
----------------------------------
|
|
PUT _xpack/ml/anomaly_detectors/population
|
|
{
|
|
"description" : "Population analysis",
|
|
"analysis_config" : {
|
|
"bucket_span":"10m",
|
|
"influencers": [
|
|
"username"
|
|
],
|
|
"detectors": [
|
|
{
|
|
"function": "mean",
|
|
"field_name": "bytesSent",
|
|
"over_field_name": "username" <1>
|
|
}
|
|
]
|
|
},
|
|
"data_description" : {
|
|
"time_field":"@timestamp",
|
|
"time_format": "epoch_ms"
|
|
}
|
|
}
|
|
----------------------------------
|
|
//CONSOLE
|
|
<1> This `over_field-name` property indicates that the metrics for each user (
|
|
as identified by their `username` value) are analyzed relative to other users
|
|
in each bucket.
|
|
|
|
//TO-DO: Per sophiec20 "Perhaps add the datafeed config and add a query filter to
|
|
//include only workstations as servers and printers would behave differently
|
|
//from the population
|
|
|
|
If your data is stored in {es}, you can create an advanced job with these same
|
|
properties. In particular, you specify the `over_field_name` property when you
|
|
add detectors:
|
|
|
|
[role="screenshot"]
|
|
image::images/ml-population-job.jpg["Create a detector for population analysis]
|
|
|
|
After you open the job and start the {dfeed} or supply data to the job, you can
|
|
view the results in {kib}. For example:
|
|
|
|
[role="screenshot"]
|
|
image::images/ml-population-results.jpg["Population analysis results in the Anomaly Explorer"]
|
|
|
|
As in this case, the results are often quite sparse. There might be just a few
|
|
data points for the selected time period. Population analysis is particularly
|
|
useful when you have many entities and the data for specific entitles is sporadic
|
|
or sparse.
|
|
|
|
If you click on a section in the time line or swim lanes, you can see more
|
|
details about the anomalies:
|
|
|
|
[role="screenshot"]
|
|
image::images/ml-population-anomaly.jpg["Anomaly details for a specific user"]
|
|
|
|
In this example, the user identified as `antonette` sent a high volume of bytes
|
|
on the date and time shown. This event is anomalous because the mean is two times
|
|
higher than the expected behavior of the population.
|