2018-06-19 16:57:10 -04:00
|
|
|
[role="xpack"]
|
2020-07-20 20:04:59 -04:00
|
|
|
[[ml-configuring-populations]]
|
|
|
|
= Performing population analysis
|
2017-06-23 13:53:16 -04:00
|
|
|
|
|
|
|
Entities or events in your data can be considered anomalous when:
|
|
|
|
|
|
|
|
* Their behavior changes over time, relative to their own previous behavior, or
|
|
|
|
* Their behavior is different than other entities in a specified population.
|
|
|
|
|
|
|
|
The latter method of detecting outliers is known as _population analysis_. The
|
|
|
|
{ml} analytics build a profile of what a "typical" user, machine, or other entity
|
|
|
|
does over a specified time period and then identify when one is behaving
|
|
|
|
abnormally compared to the population.
|
|
|
|
|
|
|
|
This type of analysis is most useful when the behavior of the population as a
|
|
|
|
whole is mostly homogeneous and you want to identify outliers. In general,
|
|
|
|
population analysis is not useful when members of the population inherently
|
|
|
|
have vastly different behavior. You can, however, segment your data into groups
|
|
|
|
that behave similarly and run these as separate jobs. For example, you can use a
|
|
|
|
query filter in the {dfeed} to segment your data or you can use the
|
|
|
|
`partition_field_name` to split the analysis for the different groups.
|
|
|
|
|
|
|
|
Population analysis scales well and has a lower resource footprint than
|
|
|
|
individual analysis of each series. For example, you can analyze populations
|
|
|
|
of hundreds of thousands or millions of entities.
|
|
|
|
|
|
|
|
To specify the population, use the `over_field_name` property. For example:
|
|
|
|
|
2019-09-06 16:09:09 -04:00
|
|
|
[source,console]
|
2017-06-23 13:53:16 -04:00
|
|
|
----------------------------------
|
2018-12-07 15:34:11 -05:00
|
|
|
PUT _ml/anomaly_detectors/population
|
2017-06-23 13:53:16 -04:00
|
|
|
{
|
|
|
|
"description" : "Population analysis",
|
|
|
|
"analysis_config" : {
|
2018-11-30 11:55:29 -05:00
|
|
|
"bucket_span":"15m",
|
2017-06-23 13:53:16 -04:00
|
|
|
"influencers": [
|
2018-11-30 11:55:29 -05:00
|
|
|
"clientip"
|
2017-06-23 13:53:16 -04:00
|
|
|
],
|
|
|
|
"detectors": [
|
|
|
|
{
|
|
|
|
"function": "mean",
|
2018-11-30 11:55:29 -05:00
|
|
|
"field_name": "bytes",
|
|
|
|
"over_field_name": "clientip" <1>
|
2017-06-23 13:53:16 -04:00
|
|
|
}
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"data_description" : {
|
2018-11-30 11:55:29 -05:00
|
|
|
"time_field":"timestamp",
|
2017-06-23 13:53:16 -04:00
|
|
|
"time_format": "epoch_ms"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
----------------------------------
|
2018-08-31 14:56:26 -04:00
|
|
|
// TEST[skip:needs-licence]
|
2019-09-06 16:09:09 -04:00
|
|
|
|
2018-11-30 11:55:29 -05:00
|
|
|
<1> This `over_field_name` property indicates that the metrics for each client (
|
|
|
|
as identified by their IP address) are analyzed relative to other clients
|
2017-06-23 13:53:16 -04:00
|
|
|
in each bucket.
|
|
|
|
|
2017-12-20 16:09:58 -05:00
|
|
|
If your data is stored in {es}, you can use the population job wizard in {kib}
|
2019-07-26 14:07:01 -04:00
|
|
|
to create an {anomaly-job} with these same properties. For example, if you add
|
|
|
|
the sample web logs in {kib}, you can use the following job settings in the
|
|
|
|
population job wizard:
|
2017-06-23 13:53:16 -04:00
|
|
|
|
|
|
|
[role="screenshot"]
|
2020-06-23 03:03:31 -04:00
|
|
|
image::images/ml-population-job.png["Job settings in the population job wizard]
|
2017-06-23 13:53:16 -04:00
|
|
|
|
|
|
|
After you open the job and start the {dfeed} or supply data to the job, you can
|
2017-12-20 16:09:58 -05:00
|
|
|
view the results in {kib}. For example, you can view the results in the
|
|
|
|
**Anomaly Explorer**:
|
2017-06-23 13:53:16 -04:00
|
|
|
|
|
|
|
[role="screenshot"]
|
2020-06-23 03:03:31 -04:00
|
|
|
image::images/ml-population-results.png["Population analysis results in the Anomaly Explorer"]
|
2017-06-23 13:53:16 -04:00
|
|
|
|
|
|
|
As in this case, the results are often quite sparse. There might be just a few
|
|
|
|
data points for the selected time period. Population analysis is particularly
|
|
|
|
useful when you have many entities and the data for specific entitles is sporadic
|
|
|
|
or sparse.
|
|
|
|
|
2020-06-23 03:03:31 -04:00
|
|
|
If you click on a section in the timeline or swim lanes, you can see more
|
2017-06-23 13:53:16 -04:00
|
|
|
details about the anomalies:
|
|
|
|
|
|
|
|
[role="screenshot"]
|
2020-06-23 03:03:31 -04:00
|
|
|
image::images/ml-population-anomaly.png["Anomaly details for a specific user"]
|
2017-06-23 13:53:16 -04:00
|
|
|
|
2020-06-23 03:03:31 -04:00
|
|
|
In this example, the client IP address `30.156.16.164` received a low volume of
|
2018-11-30 11:55:29 -05:00
|
|
|
bytes on the date and time shown. This event is anomalous because the mean is
|
2020-06-23 03:03:31 -04:00
|
|
|
three times lower than the expected behavior of the population.
|