Merge pull request #198 from opensearch-project/ad_1.1

Added AD changes for 1.1
2021-10-05 15:21:59 -07:00 · 2021-10-05 15:21:59 -07:00 · f1b01f6179
commit f1b01f6179
parent 5caca6009b 6ff9541c4b
3 changed files with 2509 additions and 1525 deletions
--- a/_monitoring-plugins/ad/api.md
+++ b/_monitoring-plugins/ad/api.md
--- a/_monitoring-plugins/ad/index.md
+++ b/_monitoring-plugins/ad/index.md
@ -17,24 +17,21 @@ Anomaly detection  automatically detects anomalies in your OpenSearch data in ne

 You can pair the anomaly detection plugin with the [alerting plugin]({{site.url}}{{site.baseurl}}/monitoring-plugins/alerting/) to notify you as soon as an anomaly is detected.

-To use the anomaly detection plugin, your computer needs to have more than one CPU core.
-{: .note }
-
 ## Get started with Anomaly Detection

 To get started, choose **Anomaly Detection** in OpenSearch Dashboards.
-To first test with sample streaming data, choose **Sample Detectors** and try out one of the preconfigured detectors.
+To first test with sample streaming data, you can try out one of the preconfigured detectors with one of the sample datasets.

-### Step 1: Create a detector
+### Step 1: Define a detector

-A detector is an individual anomaly detection task. You can create multiple detectors, and all the detectors can run simultaneously, with each analyzing data from different sources.
+A detector is an individual anomaly detection task. You can define multiple detectors, and all the detectors can run simultaneously, with each analyzing data from different sources.

-1. Choose **Create Detector**.
+1. Choose **Create detector**.
 1. Enter a name and brief description. Make sure the name is unique and descriptive enough to help you to identify the purpose of the detector.
 1. For **Data source**, choose the index you want to use as the data source. You can optionally use index patterns to choose multiple indices.
+1. (Optional) For **Data filter**, filter the index you chose as the data source. From the **Data filter** menu, choose **Add data filter**, and then design your filter query by selecting **Field**, **Operator**, and **Value**, or choose **Use query DSL** and add your own JSON filter query.
 1. Select the **Timestamp field** in your index.
-1. (Optional) For **Data filter**, filter the index you chose as the data source. From the **Filter type** menu, choose  **Visual filter**, and then design your filter query by selecting **Fields**, **Operator**, and **Value**, or choose **Custom Expression** and add your own JSON filter query.
-1. For **Detector operation settings**, define the **Detector interval**, which is the time interval at which the detector collects data.
+1. For **Operation settings**, define the **Detector interval**, which is the time interval at which the detector collects data.
 - The detector aggregates the data in this interval, then feeds the aggregated result into the anomaly detection model.
 The shorter you set this interval, the fewer data points the detector aggregates.
 The anomaly detection model uses a shingling process, a technique that uses consecutive data points to create a sample for the model. This process needs a certain number of aggregated data points from contiguous intervals.
@ -44,42 +41,49 @@ Set the window delay to shift the detector interval to account for this delay.
 - For example, say the detector interval is 10 minutes and data is ingested into your cluster with a general delay of 1 minute.
 Assume the detector runs at 2:00. The detector attempts to get the last 10 minutes of data from 1:50 to 2:00, but because of the 1-minute delay, it only gets 9 minutes of data and misses the data from 1:59 to 2:00.
 Setting the window delay to 1 minute shifts the interval window to 1:49 - 1:59, so the detector accounts for all 10 minutes of the detector interval time.
-1. Choose **Create**.
+1. Choose **Next**.

-After you create the detector, the next step is to add features to it.
+After you define the detector, the next step is to configure the model.

-### Step 2: Add features to your detector
+### Step 2: Configure the model
+
+#### Add features to your detector

 A feature is the field in your index that you want to check for anomalies. A detector can discover anomalies across one or more features. You must choose an aggregation method for each feature: `average()`, `count()`, `sum()`, `min()`, or `max()`. The aggregation method determines what constitutes an anomaly.

 For example, if you choose `min()`, the detector focuses on finding anomalies based on the minimum values of your feature. If you choose `average()`, the detector finds anomalies based on the average values of your feature.

-A multi-feature model correlates anomalies across all its features. The [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) makes it less likely for multi-feature models to identify smaller anomalies as compared to a single-feature model. Adding more features might negatively impact the [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) of a model. A higher proportion of noise in your data might further amplify this negative impact. Selecting the optimal feature set is usually an iterative process. We recommend experimenting with a historical detector with different feature sets and checking the precision before moving on to real-time detectors. By default, the maximum number of features for a detector is 5. You can adjust this limit with the `plugins.anomaly_detection.max_anomaly_features` setting.
+A multi-feature model correlates anomalies across all its features. The [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) makes it less likely for multi-feature models to identify smaller anomalies as compared to a single-feature model. Adding more features might negatively impact the [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) of a model. A higher proportion of noise in your data might further amplify this negative impact. Selecting the optimal feature set is usually an iterative process. By default, the maximum number of features for a detector is 5. You can adjust this limit with the `plugins.anomaly_detection.max_anomaly_features` setting.
 {: .note }

-1. On the **Model configuration** page, enter the **Feature name**.
-1. For **Find anomalies based on**, choose the method to find anomalies. For **Field Value** menu, choose the **field** and the **aggregation method**. Or choose **Custom expression**, and add your own JSON aggregation query.
+1. On the **Configure Model** page, enter the **Feature name** and check **Enable feature**.
+1. For **Find anomalies based on**, choose the method to find anomalies. For **Field Value**, choose the **aggregation method**. Or choose **Custom expression**, and add your own JSON aggregation query.
+1. Select a field.

-#### (Optional) Set a category field for high cardinality
+#### (Optional) Set category fields for high cardinality

 You can categorize anomalies based on a keyword or IP field type.

 The category field categorizes or slices the source time series with a dimension like IP addresses, product IDs, country codes, and so on. This helps to see a granular view of anomalies within each entity of the category field to isolate and debug issues.

-To set a category field, choose **Enable a category field** and select a field.
+To set a category field, choose **Enable a category field** and select a field. You can’t change the category fields after you create the detector.

 Only a certain number of unique entities are supported in the category field. Use the following equation to calculate the recommended total number of entities supported in a cluster:

 ```
-(data nodes * heap size * anomaly detection maximum memory percentage) / (entity size of a detector)
+(data nodes * heap size * anomaly detection maximum memory percentage) / (entity model size of a detector)
 ```

+To get the entity model size of a detector, use the [profile detector API]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/api/#profile-detector). You can adjust the maximum memory percentage with the `plugins.anomaly_detection.model_max_size_percent` setting.
+
 This formula provides a good starting point, but make sure to test with a representative workload.
 {: .note }

-For example, for a cluster with 3 data nodes, each with 8G of JVM heap size, a maximum memory percentage of 10% (default), and the entity size of the detector as 1MB: the total number of unique entities supported is (8.096 * 10^9 * 0.1 / 1M ) * 3 = 2429.
+For example, for a cluster with three data nodes, each with 8 GB of JVM heap size, a maximum memory percentage of 10% (default), and the entity model size of the detector as 1MB: the total number of unique entities supported is (8.096 * 10^9 * 0.1 / 1 MB ) * 3 = 2429.

-#### Set a shingle size
+If the actual total number of unique entities higher than this number that you calculate (in this case: 2429), the anomaly detector makes its best effort to model the extra entities. The detector prioritizes entities that occur more often and are more recent.
+
+#### (Advanced settings) Set a shingle size

 Set the number of aggregation intervals from your data stream to consider in a detection window. It’s best to choose this value based on your actual data to see which one leads to the best results for your use case.

@ -92,12 +96,27 @@ For sample previews, the anomaly detection plugin selects a small number of data

 Examine the sample preview and use it to fine-tune your feature configurations (for example, enable or disable features) to get more accurate results.

-1. Choose **Save and start detector**.
-1. Choose between automatically starting the detector (recommended) or manually starting the detector at a later time.
+1. Choose **Preview sample anomalies**.
+    - If you don't see any sample anomaly result, check the detector interval and make sure you have more than 400 data points for some entities during the preview date range.
+1. Choose **Next**.

-### Step 3: Observe the results
+### Step 3: Set up detector jobs

-Choose the **Anomaly results** tab. You need to wait for some time to see the anomaly results. If the detector interval is 10 minutes, the detector might take more than an hour to start, as it's waiting for sufficient data to generate anomalies.
+To start a real-time detector to find anomalies in your data in near real-time, check **Start real-time detector automatically (recommended)**.
+
+Alternatively, if you want to perform historical analysis and find patterns in long historical data windows (weeks or months), check **Run historical analysis detection** and select a date range (at least 128 detection intervals).
+
+Analyzing historical data helps you get familiar with the anomaly detection plugin. You can also evaluate the performance of a detector with historical data to further fine-tune it.
+
+We recommend experimenting with historical analysis with different feature sets and checking the precision before moving on to real-time detectors.
+
+### Step 4: Review and create
+
+Review your model configuration and select **Create detector**.
+
+### Step 5: Observe the results
+
+Choose the **Real-time results** or **Historical analysis** tab. For real-time results, you need to wait for some time to see the anomaly results. If the detector interval is 10 minutes, the detector might take more than an hour to start, as it's waiting for sufficient data to generate anomalies.

 A shorter interval means the model passes the shingle process more quickly and starts to generate the anomaly results sooner.
 Use the [profile detector]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/api#profile-detector) operation to make sure you have sufficient data points.
@ -106,12 +125,12 @@ If you see the detector pending in "initialization" for longer than a day, aggre

 ![Anomaly detection results]({{site.url}}{{site.baseurl}}/images/ad.png)

-Analize anomalies with the following visualizations:
+Analyze anomalies with the following visualizations:

 - **Live anomalies** - displays live anomaly results for the last 60 intervals. For example, if the interval is 10, it shows results for the last 600 minutes. The chart refreshes every 30 seconds.
- **Anomaly history** - plots the anomaly grade with the corresponding measure of confidence.
- **Feature breakdown** - plots the features based on the aggregation method. You can vary the date-time range of the detector.
+- **Anomaly history** (for historical analysis) / **Anomaly overview** (for real-time results) - plots the anomaly grade with the corresponding measure of confidence.
 - **Anomaly occurrence** - shows the `Start time`, `End time`, `Data confidence`, and `Anomaly grade` for each detected anomaly.
+- **Feature breakdown** - plots the features based on the aggregation method. You can vary the date-time range of the detector.

 `Anomaly grade` is a number between 0 and 1 that indicates how anomalous a data point is. An anomaly grade of 0 represents “not an anomaly,” and a non-zero value represents the relative severity of the anomaly.

@ -119,12 +138,13 @@ Analize anomalies with the following visualizations:

 If you set the category field, you see an additional **Heat map** chart. The heat map correlates results for anomalous entities. This chart is empty until you select an anomalous entity. You also see the anomaly and feature line chart for the time period of the anomaly (`anomaly_grade` > 0).

-Choose a filled rectangle to see a more detailed view of the anomaly.
+Choose and drag over the anomaly line chart to zoom in and see a more detailed view of an anomaly.
 {: .note }

+
 ### Step 4: Set up alerts

-Choose **Set up alerts** and configure a monitor to notify you when anomalies are detected. For steps to create a monitor and set up notifications based on your anomaly detector, see [Monitors]({{site.url}}{{site.baseurl}}/monitoring-plugins/alerting/monitors/).
+Under **Real-time results**, choose **Set up alerts** and configure a monitor to notify you when anomalies are detected. For steps to create a monitor and set up notifications based on your anomaly detector, see [Monitors]({{site.url}}{{site.baseurl}}/monitoring-plugins/alerting/monitors/).

 If you stop or delete a detector, make sure to delete any monitors associated with it.

@ -133,33 +153,12 @@ If you stop or delete a detector, make sure to delete any monitors associated wi
 To see all the configuration settings for a detector, choose the **Detector configuration** tab.

 1. To make any changes to the detector configuration, or fine tune the time interval to minimize any false positives, go to the **Detector configuration** section and choose **Edit**.
- You need to stop the detector to change its configuration. Confirm that you want to stop the detector and proceed.
+- You need to stop real-time and historical analysis to change its configuration. Confirm that you want to stop the detector and proceed.
 1. To enable or disable features, in the **Features** section, choose **Edit** and adjust the feature settings as needed. After you make your changes, choose **Save and start detector**.
- Choose between automatically starting the detector (recommended) or manually starting the detector at a later time.

-### Step 6: Analyze historical data
+### Step 8: Manage your detectors

-Analyzing historical data helps you get familiar with the anomaly detection plugin. You can also evaluate the performance of a detector with historical data to further fine-tune it.
+To start, stop, or delete a detector, go to the **Detectors** page.

-To use a historical detector, you need to specify a date range that has data present in at least 1,000 detection intervals.
-{: .note }
-
-1. Choose **Historical detectors** and **Create historical detector**.
-1. Enter the **Name** of the detector and a brief **Description**.
-1. For **Data source**, choose the index to use as the data source. You can optionally use index patterns to choose multiple indices.
-1. For **Time range**, select a time range for historical analysis.
-1. For **Detector settings**, choose to use the settings of an existing detector. Or choose the **Timestamp field** in your index, add individual features to the detector, and set the detector interval.
-1. (Optional) Choose to run the historical detector automatically after creating it.
-1. Choose **Create**.
-   - You can stop the historical detector even before it completes.
-
-### Step 7: Manage your detectors
-
-To change or delete a detector, go to the **Detector details** page.
-
-1. To make changes to your detector, choose the detector name.
-1. Choose **Actions** and **Edit detector**.
-   - You need to stop the detector to change its configuration. Confirm that you want to stop the detector and proceed.
-1. Make your changes and choose **Save changes**.
-
-To delete your detector, choose **Actions** and **Delete detector**. In the pop-up box, type `delete` to confirm and choose **Delete**.
+1. Choose the detector name.
+2. Choose **Actions** and select **Start real-time detectors**, **Stop real-time detectors**, or **Delete detectors**.
--- a/_monitoring-plugins/ad/settings.md
+++ b/_monitoring-plugins/ad/settings.md
@ -28,15 +28,20 @@ Setting | Default | Description
 `plugins.anomaly_detection.max_anomaly_detectors` | 1,000 | The maximum number of non-high cardinality detectors (no category field) users can create.
 `plugins.anomaly_detection.max_multi_entity_anomaly_detectors` | 10 | The maximum number of high cardinality detectors (with category field) in a cluster.
 `plugins.anomaly_detection.max_anomaly_features` | 5 | The maximum number of features for a detector.
-`plugins.anomaly_detection.ad_result_history_rollover_period` | 12h | How often the rollover condition is checked. If `true`, the plugin rolls over the result index to a new index.
-`plugins.anomaly_detection.ad_result_history_max_docs` | 250000000 | The maximum number of documents in one result index. The plugin only counts refreshed documents in the primary shards.
-`plugins.anomaly_detection.ad_result_history_retention_period` | 30d | The maximum age of the result index.  If its age exceeds the threshold, the plugin deletes the rolled over result index. If the cluster has only one result index, the plugin keeps the index even if it's older than its configured retention period.
-`plugins.anomaly_detection.max_entities_per_query` | 1,000 | The maximum unique values per detection interval for high cardinality detectors. By default, if the category field has more than 1,000 unique values in a detector interval, the plugin selects the top 1,000 values and orders them by `doc_count`.
-`plugins.anomaly_detection.max_entities_for_preview` | 30 | The maximum unique category field values displayed with the preview operation for high cardinality detectors. If the category field has more than 30 unique values, the plugin selects the top 30 values and orders them by `doc_count`.
+`plugins.anomaly_detection.ad_result_history_rollover_period` | 12h | How often the rollover condition is checked. If `true`, the anomaly detection plugin rolls over the result index to a new index.
+`plugins.anomaly_detection.ad_result_history_max_docs_per_shard` | 1,350,000,000 | The maximum number of documents in a single shard of the result index. The anomaly detection plugin only counts the refreshed documents in the primary shards.
+`plugins.anomaly_detection.max_entities_per_query` | 1,000,000 | The maximum unique values per detection interval for high cardinality detectors. By default, if the category field(s) have more than the configured unique values in a detector interval, the anomaly detection plugin orders them by the natural ordering of categorical values (for example, entity `ab` comes before `bc`) and then selects the top values.
+`plugins.anomaly_detection.max_entities_for_preview` | 5 | The maximum unique category field values displayed with the preview operation for high cardinality detectors. By default, if the category field(s) have more than the configured unique values in a detector interval, the anomaly detection plugin orders them by the natural ordering of categorical values (for example, entity `ab` comes before `bc`) and then selects the top values.
 `plugins.anomaly_detection.max_primary_shards` | 10 | The maximum number of primary shards an anomaly detection index can have.
-`plugins.anomaly_detection.filter_by_backend_roles` | False | When you enable the security plugin and set this to `true`, the plugin filters results based on the user's backend role(s).
-`plugins.anomaly_detection.max_cache_miss_handling_per_second` | 100 | High cardinality detectors use a cache to store active models. In the event of a cache miss, the cache gets the models from the model checkpoint index. Use this setting to limit the rate of fetching models. Because the thread pool for a GET operation has a queue of 1,000, we recommend setting this value below 1,000.
-`plugins.anomaly_detection.max_batch_task_per_node` | 2 | Starting a historical detector triggers a batch task. This setting is the number of batch tasks that you can run per data node. You can tune this setting from 1 to 1000. If the data nodes can't support all batch tasks and you're not sure if the data nodes are capable of running more historical detectors, add more data nodes instead of changing this setting to a higher value.
-`plugins.anomaly_detection.max_old_ad_task_docs_per_detector` | 10 | You can run the same historical detector many times. For each run, the anomaly detection plugin creates a new task. This setting is the number of previous tasks the plugin keeps. Set this value to at least 1 to track its last run. You can keep a maximum of 1,000 old tasks to avoid overwhelming the cluster.
-`plugins.anomaly_detection.batch_task_piece_size` | 1000 | The date range for a historical task is split into smaller pieces and the anomaly detection plugin runs the task piece by piece. Each piece contains 1,000 detection intervals by default. For example, if detector interval is 1 minute and one piece is 1000 minutes, the feature data is queried every 1,000 minutes. You can change this setting from 1 to 10,000.
-`plugins.anomaly_detection.batch_task_piece_interval_seconds` | 5 | Add a time interval between historical detector tasks. This interval prevents the task from consuming too much of the available resources and starving other operations like search and bulk index. You can change this setting from 1 to 600 seconds.
+`plugins.anomaly_detection.filter_by_backend_roles` | False | When you enable the security plugin and set this to `true`, the anomaly detection plugin filters results based on the user's backend role(s).
+`plugins.anomaly_detection.max_batch_task_per_node` | 10 | Starting a historical analysis triggers a batch task. This setting is the number of batch tasks that you can run per data node. You can tune this setting from 1 to 1,000. If the data nodes can’t support all batch tasks and you’re not sure if the data nodes are capable of running more historical analysis, add more data nodes instead of changing this setting to a higher value. Increasing this value might bring more load on each data node.
+`plugins.anomaly_detection.max_old_ad_task_docs_per_detector` | 1 | You can run historical analysis for the same detector many times. For each run, the anomaly detection plugin creates a new task. This setting is the number of previous tasks the plugin keeps. Set this value to at least 1 to track its last run. You can keep a maximum of 1,000 old tasks to avoid overwhelming the cluster.
+`plugins.anomaly_detection.batch_task_piece_size` | 1,000 | The date range for a historical task is split into smaller pieces and the anomaly detection plugin runs the task piece by piece. Each piece contains 1,000 detection intervals by default. For example, if detector interval is 1 minute and one piece is 1,000 minutes, the feature data is queried every 1,000 minutes. You can change this setting from 1 to 10,000.
+`plugins.anomaly_detection.batch_task_piece_interval_seconds` | 5 | Add a time interval between two pieces of the same historical analysis task. This interval prevents the task from consuming too much of the available resources and starving other operations like search and bulk index. You can change this setting from 1 to 600 seconds.
+`plugins.anomaly_detection.max_top_entities_for_historical_analysis` | 1,000 | The maximum number of top entities that you run for a high cardinality detector historical analysis. The range is from 1 to 10,000.
+`plugins.anomaly_detection.max_running_entities_per_detector_for_historical_analysis` | 10 | The number of entity tasks that you can run in parallel for a single high cardinality detector. The task slots available on your cluster also impact how many entities run in parallel. If a cluster has 3 data nodes, each data node has 10 task slots by default. Say you already have two high cardinality detectors and each of them run 10 entities. If you start a single-entity detector that takes 1 task slot, the number of task slots available is 10 * 3 - 10 * 2 - 1 = 9. If you now start a new high cardinality detector, the detector can only run 9 entities in parallel and not 10. You can tune this value from 1 to 1,000 based on your cluster's capability. If you set a higher value, the anomaly detection plugin runs historical analysis faster but also consumes more resources.
+`plugins.anomaly_detection.max_cached_deleted_tasks` | 1,000 | You can rerun historical analysis for a single detector as many times as you like. The anomaly detection plugin only keeps a limited number of old tasks, by default 1 old task. If you run historical analysis three times for a detector, the oldest task is deleted. Because historical analysis generates a number of anomaly results in a short span of time, it's necessary to clean up anomaly results for a deleted task. With this field, you can configure how many deleted tasks you can cache at most. The plugin cleans up a task's results when it's deleted. If the plugin fails to do this cleanup, it adds the task's results into a cache and an hourly cron job performs the cleanup. You can use this setting to limit how many old tasks are put into cache to avoid a DDoS attack. After an hour, if still you find an old task result in the cache, use the [delete detector results API]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/api/#delete-detector-results) to delete the task result manually. You can tune this setting from 1 to 10,000.
+`plugins.anomaly_detection.delete_anomaly_result_when_delete_detector` | False | Whether the anomaly detection plugin deletes the anomaly result when you delete a detector. If you want to save some disk space, especially if you've high cardinality detectors generating a lot of results, set this field to true. Alternatively, you can use the [delete detector results API]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/api/#delete-detector-results) to manually delete the results.
+`plugins.anomaly_detection.dedicated_cache_size` | 10 | If the real-time analysis of a high cardinality detector starts successfully, the anomaly detection plugin guarantees keeping 10 (dynamically adjustable via this setting) entities' models in memory per node. If the number of entities exceeds this limit, the plugin puts the extra entities' models in a memory space shared by all detectors. The actual number of entities varies based on the memory that you've available and the frequencies of the entities. If you'd like the plugin to guarantee keeping more entities' models in memory and if you're cluster has sufficient memory, you can increase this setting value.
+`plugins.anomaly_detection.max_concurrent_preview` | 2 | The maximum number of concurrent previews. You can use this setting to limit resource usage.
+`plugins.anomaly_detection.model_max_size_percent` | 0.1 | The upper bound of the memory percentage for a model.