ML Commons supports various algorithms to help train and predict ML models or test data-driven predictions without a model. This page outlines the algorithms supported by the ML Commons plugin and the API actions they support.
K-Means is a simple and popular unsupervised clustering ML algorithm, built on top of [Tribuo](https://tribuo.org/) library. K-Means will randomly choose centroids, then calculate iteratively to optimize the position of the centroids until each observation belongs to the cluster with the nearest mean.
The training process supports multi-threads, but the thread number is less than half of CPUs.
## Linear Regression
Linear Regression maps the linear relationship between inputs and outputs. In ml-common, the linear regression algorithm is adopted from the public machine learning library [Tribuo](https://tribuo.org/), which offers multidimensional linear regression models. The model supports the linear optimizer in training, including popular approaches like Linear Decay, SQRT_DECAY, [ADA](http://chrome-extension//gphandlahdpffmccakmbngmbjnjiiahp/https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf), [ADAM](https://tribuo.org/learn/4.1/javadoc/org/tribuo/math/optimisers/Adam.html), and [RMS_DROP](https://tribuo.org/learn/4.1/javadoc/org/tribuo/math/optimisers/RMSProp.html).
### Parameters
Parameter | Type | Description | Default Value
:--- |:--- | :--- | :---
learningRate | Double | The rate of speed that the gradient moves during descent | 0.01
momentumFactor | Double | The medium-term from which the regressor rises or falls | 0
epsilon | Double | The criteria in which a linear model is identified | 1.00E-06
beta1 | Double | The estimated exponential decay for the moment | 0.9
beta2 | Double | The estimated exponential decay for the moment | 0.99
decayRate | Double | The rate at which the model decays exponentially | 0.9
momentumType | MomentumType | The momentum with SDG to help accelerate gradients vectors in the right directions, leading to a faster convergence | STANDARD
optimizerType | OptimizerType | The optimizer used in the model | SIMPLE_SGD
POST _plugins/_ml/_predict/LINEAR_REGRESSION/ROZs-38Br5eVE0lTsoD9
{
"parameters": {
"target": "price"
},
"input_data": {
"column_metas": [
{
"name": "A",
"column_type": "DOUBLE"
},
{
"name": "B",
"column_type": "DOUBLE"
}
],
"rows": [
{
"values": [
{
"column_type": "DOUBLE",
"value": 3
},
{
"column_type": "DOUBLE",
"value": 5
}
]
}
]
}
}
```
**Response**
```json
{
"status": "COMPLETED",
"prediction_result": {
"column_metas": [
{
"name": "price",
"column_type": "DOUBLE"
}
],
"rows": [
{
"values": [
{
"column_type": "DOUBLE",
"value": 17.25701855310131
}
]
}
]
}
}
```
### Limitations
ML Commons only supports the linear Stochastic gradient trainer or optimizer, which can not effectively map the non-linear relationships in trained data. When used with complicated data sets, the linear Stochastic trainer might cause some convergence problems and inaccurate results.
## RCF
[Random Cut Forest](https://github.com/aws/random-cut-forest-by-aws) (RCF) is a probabilistic data structure used primarily for unsupervised anomaly detection. Its use also extends to density estimation and forecasting. OpenSearch leverages RCF for anomaly detection. ML Commons supports two new variants of RCF for different use cases:
For FIT RCF, you can train the model with historical data, and store the trained model in your index. The model will be deserialized and predict new data points when using the Predict API. However, the model in the index will not be refreshed with new data, because the model is "Fixed In Time".
The Localization algorithm finds subset level information for aggregate data (for example, aggregated over time) that demonstrates the activity of interest, such as spikes, drops, changes or anomalies. Localization can be applied in different scenarios, such as data exploration or root cause analysis, to expose the contributors driving the activity of interest in the aggregate data.