Add Logistic and RCF algorthims (#864)

* Add Logistic and RCF algorthims

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add RCF Summarize and copyedits

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add feedback. Add new parameter descriptions

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Fix typo

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Fix typo

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Change style of k-means

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add editorial feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Resolve one last typo

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add final editorial feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
This commit is contained in:
Naarcha-AWS 2022-08-11 15:05:59 -05:00 committed by GitHub
parent fb227f8081
commit 201e3fcb46
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 317 additions and 4 deletions

View File

@ -13,9 +13,9 @@ ML Commons supports various algorithms to help train and predict machine learnin
Except for the Localization algorithm, all of the following algorithms can only support retrieving 10,000 documents from an index as an input.
## K-Means
## K-means
K-Means is a simple and popular unsupervised clustering ML algorithm built on top of [Tribuo](https://tribuo.org/) library. K-Means will randomly choose centroids, then calculate iteratively to optimize the position of the centroids until each observation belongs to the cluster with the nearest mean.
K-means is a simple and popular unsupervised clustering ML algorithm built on top of [Tribuo](https://tribuo.org/) library. K-means will randomly choose centroids, then calculate iteratively to optimize the position of the centroids until each observation belongs to the cluster with the nearest mean.
### Parameters
@ -33,7 +33,7 @@ distance_type | enum, such as `EUCLIDEAN`, `COSINE`, or `L1` | The type of measu
### Example
The following example uses the Iris Data index to train K-Means synchronously.
The following example uses the Iris Data index to train k-means synchronously.
```json
POST /_plugins/_ml/_train/kmeans
@ -198,6 +198,93 @@ time_zone | string | The time zone for the time_field field | "UTC"
For FIT RCF, you can train the model with historical data and store the trained model in your index. The model will be deserialized and predict new data points when using the Predict API. However, the model in the index will not be refreshed with new data, because the model is fixed in time.
## RCFSummarize
RCFSummarize is a clustering algorithm based on the Clustering Using REpresentatives (CURE) algorithm. Compared to [k-means](#k-means), which uses random iterations to cluster, RCFSummarize uses a hierarchical clustering technique. The algorithm starts, with a set of randomly selected centroids larger than the centroids' ground truth distribution. During iteration, centroid pairs too close to each other automatically merge. Therefore, the number of centroids (`max_k`) converge to a rational number of clusters that fits ground truth, as opposed to a fixed `k` number of clusters.
### Parameters
| Parameter | Type | Description | Default Value |
|---|---|---|---|
| max_k | integer | The max allowed number of centroids | 2 |
| distance_type | enum, such as `EUCLIDEAN`, `L1`, `L2`, or `LInfinity` | The type of measurement used to measure the distance between centroids | EUCLIDEAN |
### APIs
* [Train]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/#train-model)
* [Predict]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/#predict)
* [Train and predict]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/#train-and-predict)
### Example: Train and predict
The following example estimates cluster centers and provides cluster labels for each sample in a given data frame.
```bash
POST _plugins/_ml/_train_predict/RCF_SUMMARIZE
{
"parameters": {
"centroids": 3,
"max_k": 15,
"distance_type": "L2"
},
"input_data": {
"column_metas": [
{
"name": "d0",
"column_type": "DOUBLE"
},
{
"name": "d1",
"column_type": "DOUBLE"
}
],
"rows": [
{
"values": [
{
"column_type": "DOUBLE",
"value": 6.2
},
{
"column_type": "DOUBLE",
"value": 3.4
}
]
}
]
}
}
```
**Response**
The `rows` parameter within the prediction result has been modified for length. In your response, expect more rows and columns to be contained within the response body.
```json
{
"status": "COMPLETED",
"prediction_result": {
"column_metas": [
{
"name": "ClusterID",
"column_type": "INTEGER"
}
],
"rows": [
{
"values": [
{
"column_type": "DOUBLE",
"value": 0
}
]
}
]
}
}
```
## Localization
The Localization algorithm finds subset-level information for aggregate data (for example, aggregated over time) that demonstrates the activity of interest, such as spikes, drops, changes, or anomalies. Localization can be applied in different scenarios, such as data exploration or root cause analysis, to expose the contributors driving the activity of interest in the aggregate data.
@ -316,4 +403,230 @@ The API responds with the sum of the contribution and base values per aggregatio
### Limitations
The Localization algorithm can only be executed directly. Therefore, it cannot be used with the ML Commons Train and Predict APIs.
The Localization algorithm can only be executed directly. Therefore, it cannot be used with the ML Commons Train and Predict APIs.
## Logistic regression
A classification algorithm, logistic regression models the probability of a discrete outcome given an input variable. In ML Commons, these classifications include both binary and multi-class. The most common is the binary classification, which takes two values, such as "true/false" or "yes/no", and predicts the outcome based on the values specified. Alternatively, a multi-class output can categorize different inputs based on type. This makes logistic regression most useful for situations where you are trying to determine how your inputs fit best into a specified category.
### Parameters
| Parameter | Type | Description | Default Value |
|---|---|---|---|
| learningRate | Double | The gradient descent step size at each iteration when moving toward a minimum of a loss function or optimal value | 1 |
| momentumFactor | Double | The extra weight factors that accelerate the rate at which the weight is adjusted. This helps move the minimization routine out of local minima. | 0 |
| epsilon | Double | The value for stabilizing gradient inversion | 0.1 |
| beta1 | Double | The exponential decay rates for the moment estimates | 0.9 |
| beta2 | Double | The exponential decay rates for the moment estimates | 0.99 |
| decayRate | Double | The Root Mean Squared Propagation (RMSProp) | 0.9 |
| momentumType | MomentumType | The Stochastic Gradient Descent (SGD) momentum that helps accelerate gradient vectors in the right direction, leading to faster convergence between vectors | STANDARD |
| optimizerType | OptimizerType | The optimizer used in the model | AdaGrad |
| target | String | The target field | null |
| objectiveType | ObjectiveType | The objective function type | LogMulticlass |
| epochs | Integer | The number of iterations | 5 |
| batchSize | Integer | The size of minbatches | 1 |
| loggingInterval | Integer | The interval of logs lost after many iterations. The interval is `1` if the algorithm contains no logs. | 1000 |
### APIs
* [Train]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/#train-model)
* [Predict]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/#predict)
### Example: Train/Predict with Iris data
The following example creates an index in OpenSearch with the [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris), then trains the data using logistic regression. Lastly, it uses the trained model to predict Iris types separated by row.
#### Create an Iris index
Before using this request, make sure that you have downloaded [Iris data](https://archive.ics.uci.edu/ml/datasets/iris).
```bash
PUT /iris_data
{
"mappings": {
"properties": {
"sepal_length_in_cm": {
"type": "double"
},
"sepal_width_in_cm": {
"type": "double"
},
"petal_length_in_cm": {
"type": "double"
},
"petal_width_in_cm": {
"type": "double"
},
"class": {
"type": "keyword"
}
}
}
}
```
#### Ingest data from IRIS_data.txt
```bash
POST _bulk
{ "index" : { "_index" : "iris_data" } }
{"sepal_length_in_cm":5.1,"sepal_width_in_cm":3.5,"petal_length_in_cm":1.4,"petal_width_in_cm":0.2,"class":"Iris-setosa"}
{ "index" : { "_index" : "iris_data" } }
{"sepal_length_in_cm":4.9,"sepal_width_in_cm":3.0,"petal_length_in_cm":1.4,"petal_width_in_cm":0.2,"class":"Iris-setosa"}
...
...
```
#### Train the logistic regression model
This example uses a multi-class logistic regression categorization methodology. Here, the inputs of sepal and petal length and width are used to train the model to categorize centroids based on the `class`, as indicated by the `target` parameter.
**Request**
```bash
{
"parameters": {
"target": "class"
},
"input_query": {
"query": {
"match_all": {}
},
"_source": [
"sepal_length_in_cm",
"sepal_width_in_cm",
"petal_length_in_cm",
"petal_width_in_cm",
"class"
],
"size": 200
},
"input_index": [
"iris_data"
]
}
```
**Response**
The `model_id` will be used to predict the class of the Iris.
```json
{
"model_id" : "TOgsf4IByBqD7FK_FQGc",
"status" : "COMPLETED"
}
```
#### Predict results
Using the `model_id` of the trained Iris dataset, logistic regression will predict the class of the Iris based on the input data.
```bash
POST _plugins/_ml/_predict/logistic_regression/SsfQaoIBEoC4g4joZiyD
{
"parameters": {
"target": "class"
},
"input_data": {
"column_metas": [
{
"name": "sepal_length_in_cm",
"column_type": "DOUBLE"
},
{
"name": "sepal_width_in_cm",
"column_type": "DOUBLE"
},
{
"name": "petal_length_in_cm",
"column_type": "DOUBLE"
},
{
"name": "petal_width_in_cm",
"column_type": "DOUBLE"
}
],
"rows": [
{
"values": [
{
"column_type": "DOUBLE",
"value": 6.2
},
{
"column_type": "DOUBLE",
"value": 3.4
},
{
"column_type": "DOUBLE",
"value": 5.4
},
{
"column_type": "DOUBLE",
"value": 2.3
}
]
},
{
"values": [
{
"column_type": "DOUBLE",
"value": 5.9
},
{
"column_type": "DOUBLE",
"value": 3.0
},
{
"column_type": "DOUBLE",
"value": 5.1
},
{
"column_type": "DOUBLE",
"value": 1.8
}
]
}
]
}
}
```
**Response**
```json
{
"status" : "COMPLETED",
"prediction_result" : {
"column_metas" : [
{
"name" : "result",
"column_type" : "STRING"
}
],
"rows" : [
{
"values" : [
{
"column_type" : "STRING",
"value" : "Iris-virginica"
}
]
},
{
"values" : [
{
"column_type" : "STRING",
"value" : "Iris-virginica"
}
]
}
]
}
}
```
### Limitations
Convergence metrics are not built into Tribuo's trainers. Therefore, ML Commons cannot indicate the convergence status through the ML Commons API.