OpenSearch/docs/en/ml/populations.asciidoc

[[ml-configuring-pop]]
=== Performing Population Analysis

Entities or events in your data can be considered anomalous when:

* Their behavior changes over time, relative to their own previous behavior, or
* Their behavior is different than other entities in a specified population.

The latter method of detecting outliers is known as _population analysis_. The
{ml} analytics build a profile of what a "typical" user, machine, or other entity
does over a specified time period and then identify when one is behaving
abnormally compared to the population.

This type of analysis is most useful when the behavior of the population as a
whole is mostly homogeneous and you want to identify outliers. In general,
population analysis is not useful when members of the population inherently
have vastly different behavior. You can, however, segment your data into groups
that behave similarly and run these as separate jobs. For example, you can use a
query filter in the {dfeed} to segment your data or you can use the
`partition_field_name` to split the analysis for the different groups.

Population analysis scales well and has a lower resource footprint than
individual analysis of each series. For example, you can analyze populations
of hundreds of thousands or millions of entities.

To specify the population, use the `over_field_name` property. For example:

[source,js]
----------------------------------
PUT _xpack/ml/anomaly_detectors/population
{
  "description" : "Population analysis",
  "analysis_config" : {
    "bucket_span":"10m",
    "influencers": [
      "username"
    ],
    "detectors": [
      {
        "function": "mean",
        "field_name": "bytesSent",
        "over_field_name": "username" <1>
      }
    ]
  },
  "data_description" : {
    "time_field":"@timestamp",
    "time_format": "epoch_ms"
  }
}
----------------------------------
//CONSOLE
<1> This `over_field-name` property indicates that the metrics for each user (
  as identified by their `username` value) are analyzed relative to other users
  in each bucket.

//TO-DO: Per sophiec20 "Perhaps add the datafeed config and add a query filter to
//include only workstations as servers and printers would behave differently
//from the population

If your data is stored in {es}, you can create an advanced job with these same
properties. In particular, you specify the `over_field_name` property when you
add detectors:

[role="screenshot"]
image::images/ml-population-job.jpg["Create a detector for population analysis]

After you open the job and start the {dfeed} or supply data to the job, you can
view the results in {kib}. For example:

[role="screenshot"]
image::images/ml-population-results.jpg["Population analysis results in the Anomaly Explorer"]

As in this case, the results are often quite sparse. There might be just a few
data points for the selected time period. Population analysis is particularly
useful when you have many entities and the data for specific entitles is sporadic
or sparse.

If you click on a section in the time line or swim lanes, you can see more
details about the anomalies:

[role="screenshot"]
image::images/ml-population-anomaly.jpg["Anomaly details for a specific user"]

In this example, the user identified as `antonette` sent a high volume of bytes
on the date and time shown. This event is anomalous because the mean is two times
higher than the expected behavior of the population.