390 lines
16 KiB
Plaintext
390 lines
16 KiB
Plaintext
//lcawley Verified example output 2017-04-11
|
|
[[ml-results-resource]]
|
|
==== Results Resources
|
|
|
|
Different results types are created for each job.
|
|
Anomaly results for _buckets_, _influencers_ and _records_ can be queried using the results API.
|
|
|
|
These results are written for every `bucket_span`, with the timestamp being the start of the time interval.
|
|
|
|
As part of the results, scores are calculated for each anomaly result type and each bucket interval.
|
|
These are aggregated in order to reduce noise, and normalized in order to identify and rank the most mathematically significant anomalies.
|
|
|
|
Bucket results provide the top level, overall view of the job and are ideal for alerting on.
|
|
For example, at 16:05 the system was unusual.
|
|
This is a summary of all the anomalies, pinpointing when they occurred.
|
|
|
|
Influencer results show which entities were anomalous and when.
|
|
For example, at 16:05 `user_name: Bob` was unusual.
|
|
This is a summary of all anomalies for each entity, so there can be a lot of these results.
|
|
Once you have identified a notable bucket time, you can look to see which entites were significant.
|
|
|
|
Record results provide the detail showing what the individual anomaly was, when it occurred and which entity was involved.
|
|
For example, at 16:05 Bob sent 837262434 bytes, when the typical value was 1067 bytes.
|
|
Once you have identified a bucket time and/or a significant entity, you can drill through to the record results
|
|
in order to investigate the anomalous behavior.
|
|
|
|
//TBD Add links to categorization
|
|
Categorization results contain the definitions of _categories_ that have been identified.
|
|
These are only applicable for jobs that are configured to analyze unstructured log data using categorization.
|
|
These results do not contain a timestamp or any calculated scores.
|
|
|
|
* <<ml-results-buckets,Buckets>>
|
|
* <<ml-results-influencers,Influencers>>
|
|
* <<ml-results-records,Records>>
|
|
* <<ml-results-categories,Categories>>
|
|
|
|
[float]
|
|
[[ml-results-buckets]]
|
|
===== Buckets
|
|
|
|
Bucket results provide the top level, overall view of the job and are best for alerting.
|
|
|
|
Each bucket has an `anomaly_score`, which is a statistically aggregated and
|
|
normalized view of the combined anomalousness of all record results within each bucket.
|
|
|
|
One bucket result is written for each `bucket_span` for each job, even if it is not considered to be anomalous
|
|
(when it will have an `anomaly_score` of zero).
|
|
|
|
Upon identifying an anomalous bucket, you can investigate further by either
|
|
expanding the bucket resource to show the records as nested objects or by
|
|
accessing the records resource directly and filtering upon date range.
|
|
|
|
A bucket resource has the following properties:
|
|
|
|
`anomaly_score`::
|
|
(number) The maximum anomaly score, between 0-100, for any of the bucket influencers.
|
|
This is an overall, rate limited score for the job.
|
|
All the anomaly records in the bucket contribute to this score.
|
|
This value may be updated as new data is analyzed.
|
|
|
|
`bucket_influencers[]`::
|
|
(array) An array of bucket influencer objects.
|
|
For more information, see <<ml-results-bucket-influencers,Bucket Influencers>>.
|
|
|
|
`bucket_span`::
|
|
(time units) The length of the bucket.
|
|
This value matches the `bucket_span` that is specified in the job.
|
|
|
|
`event_count`::
|
|
(number) The number of input data records processed in this bucket.
|
|
|
|
`initial_anomaly_score`::
|
|
(number) The maximum `anomaly_score` for any of the bucket influencers.
|
|
This is this initial value calculated at the time the bucket was processed.
|
|
|
|
`is_interim`::
|
|
(boolean) If true, then this bucket result is an interim result.
|
|
In other words, it is calculated based on partial input data.
|
|
|
|
`job_id`::
|
|
(string) The unique identifier for the job that these results belong to.
|
|
|
|
`processing_time_ms`::
|
|
(number) The time in milliseconds taken to analyze the bucket contents and calculate results.
|
|
|
|
`record_count`::
|
|
(number) The number of anomaly records in this bucket.
|
|
|
|
`result_type`::
|
|
(string) Internal. This value is always set to "bucket".
|
|
|
|
`timestamp`::
|
|
(date) The start time of the bucket. This timestamp uniquely identifies the bucket. +
|
|
|
|
NOTE: Events that occur exactly at the timestamp of the bucket are included in
|
|
the results for the bucket.
|
|
|
|
|
|
[float]
|
|
[[ml-results-bucket-influencers]]
|
|
===== Bucket Influencers
|
|
|
|
Bucket influencer results are available as nested objects contained within bucket results.
|
|
These results are an aggregation for each the type of influencer.
|
|
For example if both client_ip and user_name were specified as influencers,
|
|
then you would be able to find when client_ip's or user_name's were collectively anomalous.
|
|
|
|
There is a built-in bucket influencer called `bucket_time` which is always available.
|
|
This is the aggregation of all records in the bucket, and is not just limited to a type of influencer.
|
|
|
|
NOTE: A bucket influencer is a type of influencer. For example, `client_ip` or `user_name`
|
|
can be bucket influencers, whereas `192.168.88.2` and `Bob` are influencers.
|
|
|
|
An bucket influencer object has the following properties:
|
|
|
|
`anomaly_score`::
|
|
(number) A normalized score between 0-100, calculated for each bucket influencer.
|
|
This score may be updated as newer data is analyzed.
|
|
|
|
`bucket_span`::
|
|
(time units) The length of the bucket.
|
|
This value matches the `bucket_span` that is specified in the job.
|
|
|
|
`initial_anomaly_score`::
|
|
(number) The score between 0-100 for each bucket influencers.
|
|
This is this initial value calculated at the time the bucket was processed.
|
|
|
|
`influencer_field_name`::
|
|
(string) The field name of the influencer. For example `client_ip` or `user_name`.
|
|
|
|
`influencer_field_value`::
|
|
(string) The field value of the influencer. For example `192.168.88.2` or `Bob`.
|
|
|
|
`is_interim`::
|
|
(boolean) If true, then this is an interim result.
|
|
In other words, it is calculated based on partial input data.
|
|
|
|
`job_id`::
|
|
(string) The unique identifier for the job that these results belong to.
|
|
|
|
`probability`::
|
|
(number) The probability that the bucket has this behavior, in the range 0 to 1. For example, 0.0000109783.
|
|
This value can be held to a high precision of over 300 decimal places, so the `anomaly_score` is provided as a
|
|
human-readable and friendly interpretation of this.
|
|
|
|
`raw_anomaly_score`::
|
|
(number) Internal.
|
|
|
|
`result_type`::
|
|
(string) Internal. This value is always set to "bucket_influencer".
|
|
|
|
`sequence_num`::
|
|
(number) Internal.
|
|
|
|
`timestamp`::
|
|
(date) This value is the start time of the bucket for which these results have been calculated for.
|
|
|
|
[float]
|
|
[[ml-results-influencers]]
|
|
===== Influencers
|
|
|
|
Influencers are the entities that have contributed to, or are to blame for, the anomalies.
|
|
Influencer results will only be available if an `influencer_field_name` has been specified in the job configuration.
|
|
|
|
Influencers are given an `influencer_score`, which is calculated
|
|
based on the anomalies that have occurred in each bucket interval.
|
|
For jobs with more than one detector, this gives a powerful view of the most anomalous entities.
|
|
|
|
For example, if analyzing unusual bytes sent and unusual domains visited, if user_name was
|
|
specified as the influencer, then an 'influencer_score' for each anomalous user_name would be written per bucket.
|
|
E.g. If `user_name: Bob` had an `influencer_score` > 75,
|
|
then `Bob` would be considered very anomalous during this time interval in either or both of those attack vectors.
|
|
|
|
One `influencer` result is written per bucket for each influencer that is considered anomalous.
|
|
|
|
Upon identifying an influencer with a high score, you can investigate further
|
|
by accessing the records resource for that bucket and enumerating the anomaly
|
|
records that contain this influencer.
|
|
|
|
An influencer object has the following properties:
|
|
|
|
`bucket_span`::
|
|
(time units) The length of the bucket.
|
|
This value matches the `bucket_span` that is specified in the job.
|
|
|
|
`influencer_score`::
|
|
(number) A normalized score between 0-100, based on the probability of the influencer in this bucket,
|
|
aggregated across detectors.
|
|
Unlike `initial_influencer_score`, this value will be updated by a re-normalization process as new data is analyzed.
|
|
|
|
`initial_influencer_score`::
|
|
(number) A normalized score between 0-100, based on the probability of the influencer, aggregated across detectors.
|
|
This is this initial value calculated at the time the bucket was processed.
|
|
|
|
`influencer_field_name`::
|
|
(string) The field name of the influencer.
|
|
|
|
`influencer_field_value`::
|
|
(string) The entity that influenced, contributed to, or was to blame for the
|
|
anomaly.
|
|
|
|
`is_interim`::
|
|
(boolean) If true, then this is an interim result.
|
|
In other words, it is calculated based on partial input data.
|
|
|
|
`job_id`::
|
|
(string) The unique identifier for the job that these results belong to.
|
|
|
|
`probability`::
|
|
(number) The probability that the influencer has this behavior, in the range 0 to 1.
|
|
For example, 0.0000109783.
|
|
This value can be held to a high precision of over 300 decimal places,
|
|
so the `influencer_score` is provided as a human-readable and friendly interpretation of this.
|
|
// For example, 0.03 means 3%. This value is held to a high precision of over
|
|
//300 decimal places. In scientific notation, a value of 3.24E-300 is highly
|
|
//unlikely and therefore highly anomalous.
|
|
|
|
`result_type`::
|
|
(string) Internal. This value is always set to "influencer".
|
|
|
|
`sequence_num`::
|
|
(number) Internal.
|
|
|
|
`timestamp`::
|
|
(date) The start time of the bucket for which these results have been calculated for.
|
|
|
|
NOTE: Additional influencer properties are added, depending on the fields being analyzed.
|
|
For example, if analysing `user_name` as an influencer, then a field `user_name` would be added to the
|
|
result document. This allows easier filtering of the anomaly results.
|
|
|
|
|
|
[float]
|
|
[[ml-results-records]]
|
|
===== Records
|
|
|
|
Records contain the detailed analytical results. They describe the anomalous activity that
|
|
has been identified in the input data based upon the detector configuration.
|
|
|
|
For example, if you are looking for unusually large data transfers,
|
|
an anomaly record would identify the source IP address, the destination,
|
|
the time window during which it occurred, the expected and actual size of the
|
|
transfer and the probability of this occurring.
|
|
|
|
There can be many anomaly records depending upon the characteristics and size
|
|
of the input data; in practice too many to be able to manually process.
|
|
The {xpack} {ml} features therefore perform a sophisticated aggregation of
|
|
the anomaly records into buckets.
|
|
|
|
The number of record results depends on the number of anomalies found in each bucket
|
|
which relates to the number of timeseries being modelled and the number of detectors.
|
|
|
|
|
|
A record object has the following properties:
|
|
|
|
`actual`::
|
|
(array) The actual value for the bucket.
|
|
|
|
`bucket_span`::
|
|
(time units) The length of the bucket.
|
|
This value matches the `bucket_span` that is specified in the job.
|
|
|
|
`by_field_name`::
|
|
(string) The name of the analyzed field. Only present if specified in the detector.
|
|
For example, `client_ip`.
|
|
|
|
`by_field_value`::
|
|
(string) The value of `by_field_name`. Only present if specified in the detector.
|
|
For example, `192.168.66.2`.
|
|
|
|
`causes`
|
|
(array) For population analysis, an over field must be specified in the detector.
|
|
This property contains an array of anomaly records that are the causes for the anomaly
|
|
that has been identified for the over field.
|
|
If no over fields exist, this field will not be present.
|
|
This sub-resource contains the most anomalous records for the `over_field_name`.
|
|
For scalability reasons, a maximum of the 10 most significant causes of
|
|
the anomaly will be returned. As part of the core analytical modeling,
|
|
these low-level anomaly records are aggregated for their parent over field record.
|
|
The causes resource contains similar elements to the record resource,
|
|
namely `actual`, `typical`, `*_field_name` and `*_field_value`.
|
|
Probability and scores are not applicable to causes.
|
|
|
|
`detector_index`::
|
|
(number) A unique identifier for the detector.
|
|
|
|
`field_name`::
|
|
(string) Certain functions require a field to operate on. E.g. `sum()`.
|
|
For those functions, this is the name of the field to be analyzed.
|
|
|
|
`function`::
|
|
(string) The function in which the anomaly occurs, as specified in the detector configuration.
|
|
For example, `max`.
|
|
|
|
`function_description`::
|
|
(string) The description of the function in which the anomaly occurs, as
|
|
specified in the detector configuration.
|
|
|
|
`influencers`::
|
|
(array) If `influencers` was specified in the detector configuration, then
|
|
this array contains influencers that contributed to or were to blame for an anomaly.
|
|
|
|
`initial_record_score`::
|
|
(number) A normalized score between 0-100, based on the probability of the anomalousness of this record.
|
|
This is this initial value calculated at the time the bucket was processed.
|
|
|
|
`is_interim`::
|
|
(boolean) If true, then this anomaly record is an interim result.
|
|
In other words, it is calculated based on partial input data
|
|
|
|
`job_id`::
|
|
(string) The unique identifier for the job that these results belong to.
|
|
|
|
`over_field_name`::
|
|
(string) The name of the over field that was used in the analysis. Only present if specified in the detector.
|
|
Over fields are used in population analysis.
|
|
For example, `user`.
|
|
|
|
`over_field_value`::
|
|
(string) The value of `over_field_name`. Only present if specified in the detector.
|
|
For example, `Bob`.
|
|
|
|
`partition_field_name`::
|
|
(string) The name of the partition field that was used in the analysis. Only present if specified in the detector.
|
|
For example, `region`.
|
|
|
|
`partition_field_value`::
|
|
(string) The value of `partition_field_name`. Only present if specified in the detector.
|
|
For example, `us-east-1`.
|
|
|
|
`probability`::
|
|
(number) The probability of the individual anomaly occurring, in the range 0 to 1. For example, 0.0000772031.
|
|
This value can be held to a high precision of over 300 decimal places, so the `record_score` is provided as a
|
|
human-readable and friendly interpretation of this.
|
|
//In scientific notation, a value of 3.24E-300 is highly unlikely and therefore
|
|
//highly anomalous.
|
|
|
|
`record_score`::
|
|
(number) A normalized score between 0-100, based on the probability of the anomalousness of this record.
|
|
Unlike `initial_record_score`, this value will be updated by a re-normalization process as new data is analyzed.
|
|
|
|
`result_type`::
|
|
(string) Internal. This is always set to "record".
|
|
|
|
`sequence_num`::
|
|
(number) Internal.
|
|
|
|
`timestamp`::
|
|
(date) The start time of the bucket for which these results have been calculated for.
|
|
|
|
`typical`::
|
|
(array) The typical value for the bucket, according to analytical modeling.
|
|
|
|
NOTE: Additional record properties are added, depending on the fields being analyzed.
|
|
For example, if analyzing `hostname` as a _by field_, then a field `hostname` would be added to the
|
|
result document. This allows easier filtering of the anomaly results.
|
|
|
|
|
|
[float]
|
|
[[ml-results-categories]]
|
|
===== Categories
|
|
|
|
When `categorization_field_name` is specified in the job configuration,
|
|
it is possible to view the definitions of the resulting categories.
|
|
A category definition describes the common terms matched and contains examples of matched values.
|
|
|
|
The anomaly results from a categorization analysis are available as _buckets_, _influencers_ and _records_ results.
|
|
For example, at 16:45 there was an unusual count of log message category 11.
|
|
These definitions can be used to describe and show examples of `categorid_id: 11`.
|
|
|
|
A category resource has the following properties:
|
|
|
|
`category_id`::
|
|
(unsigned integer) A unique identifier for the category.
|
|
|
|
`examples`::
|
|
(array) A list of examples of actual values that matched the category.
|
|
|
|
`job_id`::
|
|
(string) The unique identifier for the job that these results belong to.
|
|
|
|
`max_matching_length`::
|
|
(unsigned integer) The maximum length of the fields that matched the category.
|
|
The value is increased by 10% to enable matching for similar fields that have not been analyzed.
|
|
|
|
`regex`::
|
|
(string) A regular expression that is used to search for values that match the category.
|
|
|
|
`terms`::
|
|
(string) A space separated list of the common tokens that are matched in values of the category.
|