OpenSearch/docs/en/rest-api/ml/resultsresource.asciidoc

353 lines
12 KiB
Plaintext
Raw Normal View History

//lcawley Verified example output 2017-04-11
[[ml-results-resource]]
==== Results Resources
The results of a job are organized into _records_ and _buckets_.
The results are aggregated and normalized in order to identify the mathematically
significant anomalies.
When categorization is specified, the results also contain category definitions.
* <<ml-results-records,Records>>
* <<ml-results-influencers,Influencers>>
* <<ml-results-buckets,Buckets>>
* <<ml-results-categories,Categories>>
[float]
[[ml-results-records]]
===== Records
Records contain the analytic results. They detail the anomalous activity that
has been identified in the input data based upon the detector configuration.
For example, if you are looking for unusually large data transfers,
an anomaly record would identify the source IP address, the destination,
the time window during which it occurred, the expected and actual size of the
transfer and the probability of this occurring.
Something that is highly improbable is therefore highly anomalous.
There can be many anomaly records depending upon the characteristics and size
of the input data; in practice too many to be able to manually process.
The {xpack} {ml} features therefore perform a sophisticated aggregation of
the anomaly records into buckets.
A record object has the following properties:
`actual`::
(+number+) The actual value for the bucket.
`bucket_span`::
(+number+) The length of the bucket in seconds.
This value matches the `bucket_span` that is specified in the job.
//`byFieldName`::
//TBD: This field did not appear in my results, but it might be a valid property.
// (+string+) The name of the analyzed field, if it was specified in the detector.
//`byFieldValue`::
//TBD: This field did not appear in my results, but it might be a valid property.
// (+string+) The value of `by_field_name`, if it was specified in the detecter.
//`causes`
//TBD: This field did not appear in my results, but it might be a valid property.
// (+array+) If an over field was specified in the detector, this property
// contains an array of anomaly records that are the causes for the anomaly
// that has been identified for the over field.
// If no over fields exist. this field will not be present.
// This sub-resource contains the most anomalous records for the `over_field_name`.
// For scalability reasons, a maximum of the 10 most significant causes of
// the anomaly will be returned. As part of the core analytical modeling,
// these low-level anomaly records are aggregated for their parent over field record.
// The causes resource contains similar elements to the record resource,
// namely actual, typical, *FieldName and *FieldValue.
// Probability and scores are not applicable to causes.
`detector_index`::
(+number+) A unique identifier for the detector.
`field_name`::
(+string+) Certain functions require a field to operate on.
For those functions, this is the name of the field to be analyzed.
`function`::
(+string+) The function in which the anomaly occurs.
`function_description`::
(+string+) The description of the function in which the anomaly occurs, as
specified in the detector configuration information.
`influencers`::
(+array+) If `influencers` was specified in the detector configuration, then
this array contains influencers that contributed to or were to blame for an
anomaly.
`initial_record_score`::
(++) TBD. For example, 94.1386.
`is_interim`::
(+boolean+) If true, then this anomaly record is an interim result.
In other words, it is calculated based on partial input data
`job_id`::
(+string+) A numerical character string that uniquely identifies the job.
//`kpi_indicator`::
// (++) TBD. For example, ["online_purchases"]
// I did not receive this in later tests. Is it still valid?
`partition_field_name`::
(+string+) The name of the partition field that was used in the analysis, if
such a field was specified in the detector.
//`overFieldName`::
// TBD: This field did not appear in my results, but it might be a valid property.
// (+string+) The name of the over field, if `over_field_name` was specified
// in the detector.
`partition_field_value`::
(+string+) The value of the partition field that was used in the analysis, if
`partition_field_name` was specified in the detector.
`probability`::
(+number+) The probability of the individual anomaly occurring.
This value is in the range 0 to 1. For example, 0.0000772031.
//This value is held to a high precision of over 300 decimal places.
//In scientific notation, a value of 3.24E-300 is highly unlikely and therefore
//highly anomalous.
`record_score`::
(+number+) An anomaly score for the bucket time interval.
The score is calculated based on a sophisticated aggregation of the anomalies
in the bucket.
//Use this score for rate-controlled alerting.
`result_type`::
(+string+) TBD. For example, "record".
`sequence_num`::
(++) TBD. For example, 1.
`timestamp`::
(+date+) The start time of the bucket that contains the record, specified in
ISO 8601 format. For example, 1454020800000.
`typical`::
(+number+) The typical value for the bucket, according to analytical modeling.
[float]
[[ml-results-influencers]]
===== Influencers
Influencers are the entities that have contributed to, or are to blame for,
the anomalies. Influencers are given an anomaly score, which is calculated
based on the anomalies that have occurred in each bucket interval.
For jobs with more than one detector, this gives a powerful view of the most
anomalous entities.
Upon identifying an influencer with a high score, you can investigate further
by accessing the records resource for that bucket and enumerating the anomaly
records that contain this influencer.
An influencer object has the following properties:
`bucket_span`::
(++) TBD. For example, 300.
// Same as for buckets? i.e. (+unsigned integer+) The length of the bucket in seconds.
// This value is equal to the `bucket_span` value in the job configuration.
`influencer_score`::
(+number+) An anomaly score for the influencer in this bucket time interval.
The score is calculated based upon a sophisticated aggregation of the anomalies
in the bucket for this entity. For example: 94.1386.
`initial_influencer_score`::
(++) TBD. For example, 83.3831.
`influencer_field_name`::
(+string+) The field name of the influencer.
`influencer_field_value`::
(+string+) The entity that influenced, contributed to, or was to blame for the
anomaly.
`is_interim`::
(+boolean+) If true, then this is an interim result.
In other words, it is calculated based on partial input data.
`job_id`::
(+string+) A numerical character string that uniquely identifies the job.
`kpi_indicator`::
(++) TBD. For example, "online_purchases".
`probability`::
(+number+) The probability that the influencer has this behavior.
This value is in the range 0 to 1. For example, 0.0000109783.
// For example, 0.03 means 3%. This value is held to a high precision of over
//300 decimal places. In scientific notation, a value of 3.24E-300 is highly
//unlikely and therefore highly anomalous.
`result_type`::
(++) TBD. For example, "influencer".
`sequence_num`::
(++) TBD. For example, 2.
`timestamp`::
(+date+) Influencers are produced in buckets. This value is the start time
of the bucket, specified in ISO 8601 format. For example, 1454943900000.
An bucket influencer object has the same following properties:
`anomaly_score`::
(+number+) TBD
//It is unclear how this differs from the influencer_score.
//An anomaly score for the influencer in this bucket time interval.
//The score is calculated based upon a sophisticated aggregation of the anomalies
//in the bucket for this entity. For example: 94.1386.
`bucket_span`::
(++) TBD. For example, 300.
////
// Same as for buckets? i.e. (+unsigned integer+) The length of the bucket in seconds.
// This value is equal to the `bucket_span` value in the job configuration.
////
`initial_anomaly_score`::
(++) TBD. For example, 83.3831.
`influencer_field_name`::
(+string+) The field name of the influencer.
`is_interim`::
(+boolean+) If true, then this is an interim result.
In other words, it is calculated based on partial input data.
`job_id`::
(+string+) A numerical character string that uniquely identifies the job.
`probability`::
(+number+) The probability that the influencer has this behavior.
This value is in the range 0 to 1. For example, 0.0000109783.
// For example, 0.03 means 3%. This value is held to a high precision of over
//300 decimal places. In scientific notation, a value of 3.24E-300 is highly
//unlikely and therefore highly anomalous.
`raw_anomaly_score`::
(++) TBD. For example, 2.32119.
`result_type`::
(++) TBD. For example, "bucket_influencer".
`sequence_num`::
(++) TBD. For example, 2.
`timestamp`::
(+date+) Influencers are produced in buckets. This value is the start time
of the bucket, specified in ISO 8601 format. For example, 1454943900000.
[float]
[[ml-results-buckets]]
===== Buckets
Buckets are the grouped and time-ordered view of the job results.
A bucket time interval is defined by `bucket_span`, which is specified in the
job configuration.
Each bucket has an `anomaly_score`, which is a statistically aggregated and
normalized view of the combined anomalousness of the records. You can use this
score for rate controlled alerting.
//TBD: Still correct?
//Each bucket also has a maxNormalizedProbability that is equal to the highest
//normalizedProbability of the records with the bucket. This gives an indication
// of the most anomalous event that has occurred within the time interval.
//Unlike anomalyScore this does not take into account the number of correlated
//anomalies that have happened.
Upon identifying an anomalous bucket, you can investigate further by either
expanding the bucket resource to show the records as nested objects or by
accessing the records resource directly and filtering upon date range.
A bucket resource has the following properties:
`anomaly_score`::
(+number+) The aggregated and normalized anomaly score.
All the anomaly records in the bucket contribute to this score.
`bucket_influencers`::
(+array+) An array of influencer objects.
For more information, see <<ml-results-influencers,Influencers>>.
`bucket_span`::
(+unsigned integer+) The length of the bucket in seconds. This value is
equal to the `bucket_span` value in the job configuration.
`event_count`::
(+unsigned integer+) The number of input data records processed in this bucket.
`initial_anomaly_score`::
(+number+) The value of `anomaly_score` at the time the bucket result was
created. This is normalized based on data which has already been seen;
this is not re-normalized and therefore is not adjusted for more recent data.
//TBD. This description is unclear.
`is_interim`::
(+boolean+) If true, then this bucket result is an interim result.
In other words, it is calculated based on partial input data.
`job_id`::
(+string+) A numerical character string that uniquely identifies the job.
`partition_scores`::
(+TBD+) TBD. For example, [].
`processing_time_ms`::
(+unsigned integer+) The time in milliseconds taken to analyze the bucket
contents and produce results.
`record_count`::
(+unsigned integer+) The number of anomaly records in this bucket.
`result_type`::
(+string+) TBD. For example, "bucket".
`timestamp`::
(+date+) The start time of the bucket, specified in ISO 8601 format.
For example, 1454020800000. This timestamp uniquely identifies the bucket.
NOTE: Events that occur exactly at the timestamp of the bucket are included in
the results for the bucket.
[float]
[[ml-results-categories]]
===== Categories
When `categorization_field_name` is specified in the job configuration, it is
possible to view the definitions of the resulting categories. A category
definition describes the common terms matched and contains examples of matched
values.
A category resource has the following properties:
`category_id`::
(+unsigned integer+) A unique identifier for the category.
`examples`::
(+array+) A list of examples of actual values that matched the category.
`job_id`::
(+string+) A numerical character string that uniquely identifies the job.
`max_matching_length`::
(+unsigned integer+) The maximum length of the fields that matched the
category.
//TBD: Still true? "The value is increased by 10% to enable matching for
//similar fields that have not been analyzed"
`regex`::
(+string+) A regular expression that is used to search for values that match
the category.
`terms`::
(+string+) A space separated list of the common tokens that are matched in
values of the category.