Merge pull request #460 from opensearch-project/ml-commons

Add ML commons Plugin section
2022-03-22 12:53:11 -05:00 · 2022-03-22 12:53:11 -05:00 · c81c33fc75
parent 866a36f570 0417e971b8
commit c81c33fc75
5 changed files with 865 additions and 0 deletions
--- a/_config.yml
+++ b/_config.yml
@ -52,6 +52,9 @@ collections:
  observability-plugin:
    permalink: /:collection/:path/
    output: true
+  ml-commons-plugin:
+    permalink: /:collection/:path/
+    output: true
  monitoring-plugins:
    permalink: /:collection/:path/
    output: true
@ -94,6 +97,9 @@ just_the_docs:
    observability-plugin:
      name: Observability plugin
      nav_fold: true
+    ml-commons-plugin:
+      name: ML Commons plugin
+      nav_fold: true
    monitoring-plugins:
      name: Monitoring plugins
      nav_fold: true
--- a/_ml-commons-plugin/api.md
+++ b/_ml-commons-plugin/api.md
@ -0,0 +1,742 @@
+---
+layout: default
+title: API
+has_children: false
+nav_order: 99
+---
+
+# ML Commons API 
+
+---
+
+#### Table of contents
+- TOC
+{:toc}
+
+
+---
+
+The Machine Learning (ML) commons API lets you train ML algorithms synchronously and asynchronously, make predictions with that trained model, and train and predict with the same data set.
+
+In order to train tasks through the API, three inputs are required. 
+
+- Algorithm name: Must be one of a [FunctionaName](https://github.com/opensearch-project/ml-commons/blob/1.3/common/src/main/java/org/opensearch/ml/common/parameter/FunctionName.java). This determines what algorithm the ML Engine runs. To add a new function, see [How To Add a New Function](https://github.com/opensearch-project/ml-commons/blob/main/docs/how-to-add-new-function.md).
+- Model hyper parameters: Adjust these parameters to make the model train better.  
+- Input data: The data input that trains the ML model, or applies the ML models to predictions. You can input data in two ways, query against your index or use data frame.
+
+## Train model
+
+Training can occur both synchronously and asynchronously.
+
+### Request 
+
+The following examples use the kmeans algorithm to train index data.
+
+**Train with kmeans synchronously** 
+
+```json
+POST /_plugins/_ml/_train/kmeans
+{
+    "parameters": {
+        "centroids": 3,
+        "iterations": 10,
+        "distance_type": "COSINE"
+    },
+    "input_query": {
+        "_source": ["petal_length_in_cm", "petal_width_in_cm"],
+        "size": 10000
+    },
+    "input_index": [
+        "iris_data"
+    ]
+}
+```
+
+**Train with kmeans asynchronously**
+
+```json
+POST /_plugins/_ml/_train/kmeans?async=true
+{
+    "parameters": {
+        "centroids": 3,
+        "iterations": 10,
+        "distance_type": "COSINE"
+    },
+    "input_query": {
+        "_source": ["petal_length_in_cm", "petal_width_in_cm"],
+        "size": 10000
+    },
+    "input_index": [
+        "iris_data"
+    ]
+}
+```
+
+### Response
+
+**Synchronously**
+
+For synchronous responses, the API returns the model_id, which can be used to get or delete a model.
+
+```json
+{
+  "model_id" : "lblVmX8BO5w8y8RaYYvN",
+  "status" : "COMPLETED"
+}
+```
+
+**Asynchronously**
+
+For asynchronous responses, the API returns the task_id, which can be used to get or delete a task.
+
+```json
+{
+  "task_id" : "lrlamX8BO5w8y8Ra2otd",
+  "status" : "CREATED"
+}
+```
+
+## Get model information
+
+You can retrieve information on your model using the model_id.
+
+```json
+GET /_plugins/_ml/models/<model-id>
+```
+
+The API returns information on the model, the algorithm used, and the content found within the model.
+
+```json
+{
+  "name" : "KMEANS",
+  "algorithm" : "KMEANS",
+  "version" : 1,
+  "content" : ""
+}
+```
+
+## Search model
+
+Use this command to search models you're already created.
+
+
+```json
+POST /_plugins/_ml/models/_search
+{query}
+```
+
+### Example: Query all models
+
+```json
+POST /_plugins/_ml/models/_search
+{
+  "query": {
+    "match_all": {}
+  },
+  "size": 1000
+}
+```
+
+### Example: Query models with algorithm "FIT_RCF"
+
+```json
+POST /_plugins/_ml/models/_search
+{
+  "query": {
+    "term": {
+      "algorithm": {
+        "value": "FIT_RCF"
+      }
+    }
+  }
+}
+```
+
+### Response
+
+```json
+{
+    "took" : 8,
+    "timed_out" : false,
+    "_shards" : {
+      "total" : 1,
+      "successful" : 1,
+      "skipped" : 0,
+      "failed" : 0
+    },
+    "hits" : {
+      "total" : {
+        "value" : 2,
+        "relation" : "eq"
+      },
+      "max_score" : 2.4159138,
+      "hits" : [
+        {
+          "_index" : ".plugins-ml-model",
+          "_type" : "_doc",
+          "_id" : "-QkKJX8BvytMh9aUeuLD",
+          "_version" : 1,
+          "_seq_no" : 12,
+          "_primary_term" : 15,
+          "_score" : 2.4159138,
+          "_source" : {
+            "name" : "FIT_RCF",
+            "version" : 1,
+            "content" : "xxx",
+            "algorithm" : "FIT_RCF"
+          }
+        },
+        {
+          "_index" : ".plugins-ml-model",
+          "_type" : "_doc",
+          "_id" : "OxkvHn8BNJ65KnIpck8x",
+          "_version" : 1,
+          "_seq_no" : 2,
+          "_primary_term" : 8,
+          "_score" : 2.4159138,
+          "_source" : {
+            "name" : "FIT_RCF",
+            "version" : 1,
+            "content" : "xxx",
+            "algorithm" : "FIT_RCF"
+          }
+        }
+      ]
+    }
+  }
+```
+
+## Delete model
+
+Deletes a model based on the model_id
+
+```json
+DELETE /_plugins/_ml/models/<model_id>
+```
+
+The API returns the following:
+
+```json
+{
+  "_index" : ".plugins-ml-model",
+  "_type" : "_doc",
+  "_id" : "MzcIJX8BA7mbufL6DOwl",
+  "_version" : 2,
+  "result" : "deleted",
+  "_shards" : {
+    "total" : 2,
+    "successful" : 2,
+    "failed" : 0
+  },
+  "_seq_no" : 27,
+  "_primary_term" : 18
+}
+```
+
+## Predict
+
+ML commons can predict new data with your trained model either from indexed data or a data frame. The model_id is required to use the Predict API.
+
+```json
+POST /_plugins/_ml/_predict/<algorithm_name>/<model_id>
+```
+
+### Request
+
+```json
+POST /_plugins/_ml/_predict/kmeans/<model-id>
+{
+    "input_query": {
+        "_source": ["petal_length_in_cm", "petal_width_in_cm"],
+        "size": 10000
+    },
+    "input_index": [
+        "iris_data"
+    ]
+}
+```
+
+### Response
+
+```json
+{
+  "status" : "COMPLETED",
+  "prediction_result" : {
+    "column_metas" : [
+      {
+        "name" : "ClusterID",
+        "column_type" : "INTEGER"
+      }
+    ],
+    "rows" : [
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 1
+          }
+        ]
+      },
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 1
+          }
+        ]
+      },
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 0
+          }
+        ]
+      },
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 0
+          }
+        ]
+      },
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 0
+          }
+        ]
+      },
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 0
+          }
+        ]
+      }
+    ]
+  }
+```
+
+
+## Train and predict
+
+Use to train and then immediately predict against the same training data set. Can only be used with unsupervised learning models and the following algorithms:
+
+- BATCH_RCF
+- FIT_RCF
+- kmeans
+
+### Example: Train and predict with indexed data
+
+
+```json
+POST /_plugins/_ml/_train_predict/kmeans
+{
+    "parameters": {
+        "centroids": 2,
+        "iterations": 10,
+        "distance_type": "COSINE"
+    },
+    "input_query": {
+        "query": {
+            "bool": {
+                "filter": [
+                    {
+                        "range": {
+                            "k1": {
+                                "gte": 0
+                            }
+                        }
+                    }
+                ]
+            }
+        },
+        "size": 10
+    },
+    "input_index": [
+        "test_data"
+    ]
+}
+```
+
+### Example: Train and predict with data directly
+
+```json
+POST /_plugins/_ml/_train_predict/kmeans
+{
+    "parameters": {
+        "centroids": 2,
+        "iterations": 1,
+        "distance_type": "EUCLIDEAN"
+    },
+    "input_data": {
+        "column_metas": [
+            {
+                "name": "k1",
+                "column_type": "DOUBLE"
+            },
+            {
+                "name": "k2",
+                "column_type": "DOUBLE"
+            }
+        ],
+        "rows": [
+            {
+                "values": [
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 1.00
+                    },
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 2.00
+                    }
+                ]
+            },
+            {
+                "values": [
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 1.00
+                    },
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 4.00
+                    }
+                ]
+            },
+            {
+                "values": [
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 1.00
+                    },
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 0.00
+                    }
+                ]
+            },
+            {
+                "values": [
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 10.00
+                    },
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 2.00
+                    }
+                ]
+            },
+            {
+                "values": [
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 10.00
+                    },
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 4.00
+                    }
+                ]
+            },
+            {
+                "values": [
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 10.00
+                    },
+                    {
+                        "column_type": "DOUBLE",
+                        "value": 0.00
+                    }
+                ]
+            }
+        ]
+    }
+}
+```
+
+### Response
+
+```json
+{
+  "status" : "COMPLETED",
+  "prediction_result" : {
+    "column_metas" : [
+      {
+        "name" : "ClusterID",
+        "column_type" : "INTEGER"
+      }
+    ],
+    "rows" : [
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 1
+          }
+        ]
+      },
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 1
+          }
+        ]
+      },
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 1
+          }
+        ]
+      },
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 0
+          }
+        ]
+      },
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 0
+          }
+        ]
+      },
+      {
+        "values" : [
+          {
+            "column_type" : "INTEGER",
+            "value" : 0
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+
+## Get task information
+
+You can retrieve information about a task using the task_id.
+
+```json
+GET /_plugins/_ml/tasks/<task_id>
+```
+
+The response includes information about the task.
+
+```json
+{
+  "model_id" : "l7lamX8BO5w8y8Ra2oty",
+  "task_type" : "TRAINING",
+  "function_name" : "KMEANS",
+  "state" : "COMPLETED",
+  "input_type" : "SEARCH_QUERY",
+  "worker_node" : "54xOe0w8Qjyze00UuLDfdA",
+  "create_time" : 1647545342556,
+  "last_update_time" : 1647545342587,
+  "is_async" : true
+}
+```
+
+## Search task
+
+Search tasks based on parameters indicated in the request body.
+
+```json
+GET /_plugins/_ml/tasks/_search
+{query body}
+```
+
+
+### Example: Search task which "function_name" is "KMEANS"
+
+```json
+GET /_plugins/_ml/tasks/_search
+{
+  "query": {
+    "bool": {
+      "filter": [
+        {
+          "term": {
+            "function_name": "KMEANS"
+          }
+        }
+      ]
+    }
+  }
+}
+```
+
+### Response
+
+```json
+{
+  "took" : 12,
+  "timed_out" : false,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  },
+  "hits" : {
+    "total" : {
+      "value" : 2,
+      "relation" : "eq"
+    },
+    "max_score" : 0.0,
+    "hits" : [
+      {
+        "_index" : ".plugins-ml-task",
+        "_type" : "_doc",
+        "_id" : "_wnLJ38BvytMh9aUi-Ia",
+        "_version" : 4,
+        "_seq_no" : 29,
+        "_primary_term" : 4,
+        "_score" : 0.0,
+        "_source" : {
+          "last_update_time" : 1645640125267,
+          "create_time" : 1645640125209,
+          "is_async" : true,
+          "function_name" : "KMEANS",
+          "input_type" : "SEARCH_QUERY",
+          "worker_node" : "jjqFrlW7QWmni1tRnb_7Dg",
+          "state" : "COMPLETED",
+          "model_id" : "AAnLJ38BvytMh9aUi-M2",
+          "task_type" : "TRAINING"
+        }
+      },
+      {
+        "_index" : ".plugins-ml-task",
+        "_type" : "_doc",
+        "_id" : "wwRRLX8BydmmU1x6I-AI",
+        "_version" : 3,
+        "_seq_no" : 38,
+        "_primary_term" : 7,
+        "_score" : 0.0,
+        "_source" : {
+          "last_update_time" : 1645732766656,
+          "create_time" : 1645732766472,
+          "is_async" : true,
+          "function_name" : "KMEANS",
+          "input_type" : "SEARCH_QUERY",
+          "worker_node" : "A_IiqoloTDK01uZvCjREaA",
+          "state" : "COMPLETED",
+          "model_id" : "xARRLX8BydmmU1x6I-CG",
+          "task_type" : "TRAINING"
+        }
+      }
+    ]
+  }
+}
+```
+
+## Delete task
+
+Delete a task based on the task_id.
+
+```json
+DELETE /_plugins/_ml/tasks/{task_id}
+```
+
+The API returns the following:
+
+```json
+{
+  "_index" : ".plugins-ml-task",
+  "_type" : "_doc",
+  "_id" : "xQRYLX8BydmmU1x6nuD3",
+  "_version" : 4,
+  "result" : "deleted",
+  "_shards" : {
+    "total" : 2,
+    "successful" : 2,
+    "failed" : 0
+  },
+  "_seq_no" : 42,
+  "_primary_term" : 7
+}
+```
+
+## Stats
+
+Get statistics related to the number of tasks. 
+
+To receive all stats, use:
+
+```json
+GET /_plugins/_ml/stats
+```
+
+To receive stats for a specific node, use:
+
+```json
+GET /_plugins/_ml/<nodeId>/stats/
+```
+
+To receive stats for a specific node and  return a specified stat, use:
+
+```json
+GET /_plugins/_ml/<nodeId>/stats/<stat>
+```
+
+To receive information on a specific stat from all nodes, use:
+
+```json
+GET /_plugins/_ml/stats/<stat>
+```
+
+
+### Example: Get all stats
+
+```json
+GET /_plugins/_ml/stats
+```
+
+### Response
+
+```json
+{
+  "zbduvgCCSOeu6cfbQhTpnQ" : {
+    "ml_executing_task_count" : 0
+  },
+  "54xOe0w8Qjyze00UuLDfdA" : {
+    "ml_executing_task_count" : 0
+  },
+  "UJiykI7bTKiCpR-rqLYHyw" : {
+    "ml_executing_task_count" : 0
+  },
+  "zj2_NgIbTP-StNlGZJlxdg" : {
+    "ml_executing_task_count" : 0
+  },
+  "jjqFrlW7QWmni1tRnb_7Dg" : {
+    "ml_executing_task_count" : 0
+  },
+  "3pSSjl5PSVqzv5-hBdFqyA" : {
+    "ml_executing_task_count" : 0
+  },
+  "A_IiqoloTDK01uZvCjREaA" : {
+    "ml_executing_task_count" : 0
+  }
+}
+```
+
+
+
+
+
+
--- a/_ml-commons-plugin/index.md
+++ b/_ml-commons-plugin/index.md
@ -0,0 +1,34 @@
+---
+layout: default
+title: About ML Commons 
+nav_order: 1
+has_children: false
+has_toc: false
+---
+
+# ML Commons plugin
+
+ML Commons for OpenSearch eases the development of machine learning features by providing a set of common machine learning (ML) algorithms through transport and REST API calls. Those calls choose the right nodes and resources for each ML request and monitors ML tasks to ensure uptime. This allows you to leverage existing open-source ML algorithms and reduce the effort required to develop new ML features. 
+
+Interaction with the ML commons plugin occurs through either the [REST API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api) or [AD]({{site.url}}{{site.baseurl}}/ppl/commands#ad) and [kmeans]({{site.url}}{{site.baseurl}}/observability-plugin/ppl/commands#kmeans) PPL commands.
+
+Models [trained]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#train) through the ML Commons plugin support model-based algorithms such as kmeans. After you've trained a model enough so that it meets your precision requirements, you can apply the model to [predict]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#predict) new data safely. 
+
+Should you not want to use a model, you can use the [Train and Predict]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#train-and-predict) API to test your model without having to evaluate the model's performance.
+
+
+## Permissions
+
+There are two reserved user roles that can use of the ML commons plugin. 
+
+- `ml_full_access`: Full access to all ML features, including starting new ML tasks and reading or deleting models.
+- `ml_readonly_access`: Can only read ML tasks, trained models and statistics relevant to the model's cluster. Cannot start nor delete ML tasks or models.
+
+
+
+
+
+
+
+
+
--- a/_observability-plugin/ppl/commands.md
+++ b/_observability-plugin/ppl/commands.md
@ -831,3 +831,84 @@ PPL query:
 ```ppl
 search source=my_index | where match(message, "this is a test", operator=and, zero_terms_query=all)
 ```
+
+## ad
+
+The `ad` command applies the Random Cut Forest (RCF) algorithm in the ML Commons plugin on the search result returned by a PPL command. Based on the input, the plugin uses two types of RCF algorithms: fixed in time RCF for processing time-series data and batch RCF for processing non-time-series data.
+
+### Fixed In Time RCF For Time-series Data Command Syntax
+
+```sql
+ad <shingle_size> <time_decay> <time_field>
+```
+
+Field | Description | Required
+:--- | :--- |:---
+`shingle_size` | A consecutive sequence of the most recent records. The default value is 8. | No
+`time_decay` | Specifies how much of the recent past to consider when computing an anomaly score. The default value is 0.001. | No
+`time_field` | Specifies the time filed for RCF to use as time-series data. Must be either a long value, such as the timestamp in miliseconds, or a string value in "yyyy-MM-dd HH:mm:ss".| Yes
+
+### Batch RCF for Non-time-series Data Command Syntax
+
+```sql
+ad <shingle_size> <time_decay>
+```
+
+Field | Description | Required
+:--- | :--- |:---
+`shingle_size` | A consecutive sequence of the most recent records. The default value is 8. | No
+`time_decay` | Specifies how much of the recent past to consider when computing an anomaly score. The default value is 0.001. | No
+
+*Example 1*: Detecting events in New York City from taxi ridership data with time-series data
+
+The example trains a RCF model and use the model to detect anomalies in the time-series ridership data.
+
+PPL query:
+
+```sql
+os> source=nyc_taxi | fields value, timestamp | AD time_field='timestamp' | where value=10844.0
+```
+
+value | timestamp | score | anomaly_grade
+:--- | :--- |:--- | :---
+10844.0 | 1404172800000 | 0.0 | 0.0    
+
+*Example 2*: Detecting events in New York City from taxi ridership data with non-time-series data
+
+PPL query:
+
+```sql
+os> source=nyc_taxi | fields value | AD | where value=10844.0
+```
+
+value | score | anomalous
+:--- | :--- |:--- 
+| 10844.0 | 0.0 | false  
+
+## kmeans
+
+The kmeans command applies the ML Commons plugin's kmeans algorithm to the provided PPL command's search results.
+
+## Syntax
+
+```sql
+kmeans <cluster-number>
+```
+
+For `cluster-number`, enter the number of clusters you want to group your data points into.
+
+*Example*
+
+The example shows how to classify three Iris species (Iris setosa, Iris virginica and Iris versicolor) based on the combination of four features measured from each sample: the length and the width of the sepals and petals.
+
+PPL query:
+
+```sql
+os> source=iris_data | fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm | kmeans 3
+```
+
+sepal_length_in_cm | sepal_width_in_cm | petal_length_in_cm | petal_width_in_cm | ClusterID
+:--- | :--- |:--- | :--- | :--- 
+| 5.1 | 3.5 | 1.4 | 0.2 | 1   
+| 5.6 | 3.0 | 4.1 | 1.3 | 0
+| 6.7 | 2.5 | 5.8 | 1.8 | 2
--- a/index.md
+++ b/index.md
@ -35,9 +35,11 @@ Component | Purpose
 [KNN]({{site.url}}{{site.baseurl}}/search-plugins/knn/) | Find “nearest neighbors” in your vector data
 [Performance Analyzer]({{site.url}}{{site.baseurl}}/monitoring-plugins/pa/) | Monitor and optimize your cluster
 [Anomaly detection]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/) | Identify atypical data and receive automatic notifications
+[ML Commons plugin]({{site.url}}{{site.baseurl}}/ml-commons-plugin/index/) | Train and execute machine-learning models
 [Asynchronous search]({{site.url}}{{site.baseurl}}/search-plugins/async/) | Run search requests in the background
 [Cross-cluster replication]({{site.url}}{{site.baseurl}}/replication-plugin/index/) | Replicate your data across multiple OpenSearch clusters

+
 Most OpenSearch plugins have corresponding OpenSearch Dashboards plugins that provide a convenient, unified user interface.

 For specifics around the project, see the [FAQ](https://opensearch.org/faq/).