Add ML fault tolerance (#3803)
* Add ML fault tolerance Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Rework Profile API sentence Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Fix link Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add review feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add technical feedback for ML. Change API names Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add final ML node setting Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add more technical feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Apply suggestions from code review Co-authored-by: Chris Moore <107723039+cwillum@users.noreply.github.com> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update cluster-settings.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _ml-commons-plugin/api.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _ml-commons-plugin/api.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _ml-commons-plugin/api.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _ml-commons-plugin/api.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _ml-commons-plugin/api.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update api.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Naarcha-AWS <naarcha@amazon.com> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Chris Moore <107723039+cwillum@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
This commit is contained in:
parent
1259454b73
commit
bce2e33444
|
@ -24,7 +24,7 @@ In order to train tasks through the API, three inputs are required.
|
||||||
- Model hyper parameters: Adjust these parameters to make the model train better.
|
- Model hyper parameters: Adjust these parameters to make the model train better.
|
||||||
- Input data: The data input that trains the ML model, or applies the ML models to predictions. You can input data in two ways, query against your index or use data frame.
|
- Input data: The data input that trains the ML model, or applies the ML models to predictions. You can input data in two ways, query against your index or use data frame.
|
||||||
|
|
||||||
## Train model
|
## Training a model
|
||||||
|
|
||||||
Training can occur both synchronously and asynchronously.
|
Training can occur both synchronously and asynchronously.
|
||||||
|
|
||||||
|
@ -96,7 +96,7 @@ For asynchronous responses, the API returns the task_id, which can be used to ge
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Get model information
|
## Getting model information
|
||||||
|
|
||||||
You can retrieve information on your model using the model_id.
|
You can retrieve information on your model using the model_id.
|
||||||
|
|
||||||
|
@ -115,12 +115,12 @@ The API returns information on the model, the algorithm used, and the content fo
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Upload a model
|
## Registering a model
|
||||||
|
|
||||||
Use the upload operation to upload a custom model to a model index. ML Commons splits the model into smaller chunks and saves those chunks in the model's index.
|
Use the register operation to register a custom model to a model index. ML Commons splits the model into smaller chunks and saves those chunks in the model's index.
|
||||||
|
|
||||||
```json
|
```json
|
||||||
POST /_plugins/_ml/models/_upload
|
POST /_plugins/_ml/models/_register
|
||||||
```
|
```
|
||||||
|
|
||||||
### Request fields
|
### Request fields
|
||||||
|
@ -137,10 +137,10 @@ Field | Data type | Description
|
||||||
|
|
||||||
### Example
|
### Example
|
||||||
|
|
||||||
The following example request uploads version `1.0.0` of an NLP sentence transformation model named `all-MiniLM-L6-v2`.
|
The following example request registers a version `1.0.0` of an NLP sentence transformation model named `all-MiniLM-L6-v2`.
|
||||||
|
|
||||||
```json
|
```json
|
||||||
POST /_plugins/_ml/models/_upload
|
POST /_plugins/_ml/models/_register
|
||||||
{
|
{
|
||||||
"name": "all-MiniLM-L6-v2",
|
"name": "all-MiniLM-L6-v2",
|
||||||
"version": "1.0.0",
|
"version": "1.0.0",
|
||||||
|
@ -166,14 +166,14 @@ OpenSearch responds with the `task_id` and task `status`.
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
To see the status of your model upload, enter the `task_id` into the [task API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#get-task-information). Use the `model_id` from the task response once the upload is complete. For example:
|
To see the status of your model registration, enter the `task_id` in the [task API] ...
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"model_id" : "WWQI44MBbzI2oUKAvNUt",
|
"model_id" : "WWQI44MBbzI2oUKAvNUt",
|
||||||
"task_type" : "UPLOAD_MODEL",
|
"task_type" : "UPLOAD_MODEL",
|
||||||
"function_name" : "TEXT_EMBEDDING",
|
"function_name" : "TEXT_EMBEDDING",
|
||||||
"state" : "COMPLETED",
|
"state" : "REGISTERED",
|
||||||
"worker_node" : "KzONM8c8T4Od-NoUANQNGg",
|
"worker_node" : "KzONM8c8T4Od-NoUANQNGg",
|
||||||
"create_time" : 1665961344003,
|
"create_time" : 1665961344003,
|
||||||
"last_update_time" : 1665961373047,
|
"last_update_time" : 1665961373047,
|
||||||
|
@ -181,28 +181,28 @@ To see the status of your model upload, enter the `task_id` into the [task API](
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Load model
|
## Deploying a model
|
||||||
|
|
||||||
The load model operation reads the model's chunks from the model index, then creates an instance of the model to cache into memory. This operation requires the `model_id`.
|
The deploy model operation reads the model's chunks from the model index and then creates an instance of the model to cache into memory. This operation requires the `model_id`.
|
||||||
|
|
||||||
```json
|
```json
|
||||||
POST /_plugins/_ml/models/<model_id>/_load
|
POST /_plugins/_ml/models/<model_id>/_deploy
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example: Load into all available ML nodes
|
### Example: Deploying to all available ML nodes
|
||||||
|
|
||||||
In this example request, OpenSearch loads the model into any available OpenSearch ML node:
|
In this example request, OpenSearch deploys the model to any available OpenSearch ML node:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_load
|
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_deploy
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example: Load into a specific node
|
### Example: Deploying to a specific node
|
||||||
|
|
||||||
If you want to reserve the memory of other ML nodes within your cluster, you can load your model into a specific node(s) by specifying the `node_ids` in the request body:
|
If you want to reserve the memory of other ML nodes within your cluster, you can deploy your model to a specific node(s) by specifying the `node_ids` in the request body:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_load
|
POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_deploy
|
||||||
{
|
{
|
||||||
"node_ids": ["4PLK7KJWReyX0oWKnBA8nA"]
|
"node_ids": ["4PLK7KJWReyX0oWKnBA8nA"]
|
||||||
}
|
}
|
||||||
|
@ -213,40 +213,40 @@ POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_load
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"task_id" : "hA8P44MBhyWuIwnfvTKP",
|
"task_id" : "hA8P44MBhyWuIwnfvTKP",
|
||||||
"status" : "CREATED"
|
"status" : "DEPLOYING"
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Unload a model
|
## Undeploying a model
|
||||||
|
|
||||||
To unload a model from memory, use the unload operation.
|
To undeploy a model from memory, use the undeploy operation:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
POST /_plugins/_ml/models/<model_id>/_unload
|
POST /_plugins/_ml/models/<model_id>/_undeploy
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example: Unload model from all ML nodes
|
### Example: Undeploying model from all ML nodes
|
||||||
|
|
||||||
```json
|
```json
|
||||||
POST /_plugins/_ml/models/MGqJhYMBbbh0ushjm8p_/_unload
|
POST /_plugins/_ml/models/MGqJhYMBbbh0ushjm8p_/_undeploy
|
||||||
```
|
```
|
||||||
|
|
||||||
### Response: Unload model from all ML nodes
|
### Response: Undeploying a model from all ML nodes
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"s5JwjZRqTY6nOT0EvFwVdA": {
|
"s5JwjZRqTY6nOT0EvFwVdA": {
|
||||||
"stats": {
|
"stats": {
|
||||||
"MGqJhYMBbbh0ushjm8p_": "unloaded"
|
"MGqJhYMBbbh0ushjm8p_": "UNDEPLOYED"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example: Unload specific models from specific nodes
|
### Example: Undeploying specific models from specific nodes
|
||||||
|
|
||||||
```json
|
```json
|
||||||
POST /_plugins/_ml/models/_unload
|
POST /_plugins/_ml/models/_undeploy
|
||||||
{
|
{
|
||||||
"node_ids": ["sv7-3CbwQW-4PiIsDOfLxQ"],
|
"node_ids": ["sv7-3CbwQW-4PiIsDOfLxQ"],
|
||||||
"model_ids": ["KDo2ZYQB-v9VEDwdjkZ4"]
|
"model_ids": ["KDo2ZYQB-v9VEDwdjkZ4"]
|
||||||
|
@ -254,32 +254,32 @@ POST /_plugins/_ml/models/_unload
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Response: Unload specific models from specific nodes
|
### Response: Undeploying specific models from specific nodes
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"sv7-3CbwQW-4PiIsDOfLxQ" : {
|
"sv7-3CbwQW-4PiIsDOfLxQ" : {
|
||||||
"stats" : {
|
"stats" : {
|
||||||
"KDo2ZYQB-v9VEDwdjkZ4" : "unloaded"
|
"KDo2ZYQB-v9VEDwdjkZ4" : "UNDEPLOYED"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Response: Unload all models from specific nodes
|
### Response: Undeploying all models from specific nodes
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"sv7-3CbwQW-4PiIsDOfLxQ" : {
|
"sv7-3CbwQW-4PiIsDOfLxQ" : {
|
||||||
"stats" : {
|
"stats" : {
|
||||||
"KDo2ZYQB-v9VEDwdjkZ4" : "unloaded",
|
"KDo2ZYQB-v9VEDwdjkZ4" : "UNDEPLOYED",
|
||||||
"-8o8ZYQBvrLMaN0vtwzN" : "unloaded"
|
"-8o8ZYQBvrLMaN0vtwzN" : "UNDEPLOYED"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example: Unload specific models from all nodes
|
### Example: Undeploying specific models from all nodes
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
@ -287,19 +287,19 @@ POST /_plugins/_ml/models/_unload
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Response: Unload specific models from all nodes
|
### Response: Undeploying specific models from all nodes
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"sv7-3CbwQW-4PiIsDOfLxQ" : {
|
"sv7-3CbwQW-4PiIsDOfLxQ" : {
|
||||||
"stats" : {
|
"stats" : {
|
||||||
"KDo2ZYQB-v9VEDwdjkZ4" : "unloaded"
|
"KDo2ZYQB-v9VEDwdjkZ4" : "UNDEPLOYED"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Search model
|
## Searching for a model
|
||||||
|
|
||||||
Use this command to search models you've already created.
|
Use this command to search models you've already created.
|
||||||
|
|
||||||
|
@ -309,7 +309,7 @@ POST /_plugins/_ml/models/_search
|
||||||
{query}
|
{query}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example: Query all models
|
### Example: Querying all models
|
||||||
|
|
||||||
```json
|
```json
|
||||||
POST /_plugins/_ml/models/_search
|
POST /_plugins/_ml/models/_search
|
||||||
|
@ -321,7 +321,7 @@ POST /_plugins/_ml/models/_search
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example: Query models with algorithm "FIT_RCF"
|
### Example: Querying models with algorithm "FIT_RCF"
|
||||||
|
|
||||||
```json
|
```json
|
||||||
POST /_plugins/_ml/models/_search
|
POST /_plugins/_ml/models/_search
|
||||||
|
@ -388,9 +388,9 @@ POST /_plugins/_ml/models/_search
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Delete model
|
## Deleting a model
|
||||||
|
|
||||||
Deletes a model based on the model_id
|
Deletes a model based on the `model_id`.
|
||||||
|
|
||||||
```json
|
```json
|
||||||
DELETE /_plugins/_ml/models/<model_id>
|
DELETE /_plugins/_ml/models/<model_id>
|
||||||
|
@ -414,9 +414,9 @@ The API returns the following:
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Profile
|
## Returning model profile information
|
||||||
|
|
||||||
Returns runtime information on ML tasks and models. This operation can help debug issues with models at runtime.
|
The profile operation returns runtime information on ML tasks and models. The profile operation can help debug issues with models at runtime.
|
||||||
|
|
||||||
|
|
||||||
```json
|
```json
|
||||||
|
@ -444,7 +444,7 @@ task_ids | string | Returns runtime data for a specific task. You can string tog
|
||||||
return_all_tasks | boolean | Determines whether or not a request returns all tasks. When set to `false` task profiles are left out of the response.
|
return_all_tasks | boolean | Determines whether or not a request returns all tasks. When set to `false` task profiles are left out of the response.
|
||||||
return_all_models | boolean | Determines whether or not a profile request returns all models. When set to `false` model profiles are left out of the response.
|
return_all_models | boolean | Determines whether or not a profile request returns all models. When set to `false` model profiles are left out of the response.
|
||||||
|
|
||||||
### Example: Return all tasks and models on a specific node
|
### Example: Returning all tasks and models on a specific node
|
||||||
|
|
||||||
```json
|
```json
|
||||||
GET /_plugins/_ml/profile
|
GET /_plugins/_ml/profile
|
||||||
|
@ -455,7 +455,7 @@ GET /_plugins/_ml/profile
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Response: Return all tasks and models on a specific node
|
### Response: Returning all tasks and models on a specific node
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
@ -473,7 +473,7 @@ GET /_plugins/_ml/profile
|
||||||
"KzONM8c8T4Od-NoUANQNGg" : { # node id
|
"KzONM8c8T4Od-NoUANQNGg" : { # node id
|
||||||
"models" : {
|
"models" : {
|
||||||
"WWQI44MBbzI2oUKAvNUt" : { # model id
|
"WWQI44MBbzI2oUKAvNUt" : { # model id
|
||||||
"model_state" : "LOADED", # model status
|
"model_state" : "DEPLOYED", # model status
|
||||||
"predictor" : "org.opensearch.ml.engine.algorithms.text_embedding.TextEmbeddingModel@592814c9",
|
"predictor" : "org.opensearch.ml.engine.algorithms.text_embedding.TextEmbeddingModel@592814c9",
|
||||||
"worker_nodes" : [ # routing table
|
"worker_nodes" : [ # routing table
|
||||||
"KzONM8c8T4Od-NoUANQNGg"
|
"KzONM8c8T4Od-NoUANQNGg"
|
||||||
|
@ -790,7 +790,7 @@ POST /_plugins/_ml/_train_predict/kmeans
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Get task information
|
## Getting task information
|
||||||
|
|
||||||
You can retrieve information about a task using the task_id.
|
You can retrieve information about a task using the task_id.
|
||||||
|
|
||||||
|
@ -814,7 +814,7 @@ The response includes information about the task.
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Search task
|
## Searching for a task
|
||||||
|
|
||||||
Search tasks based on parameters indicated in the request body.
|
Search tasks based on parameters indicated in the request body.
|
||||||
|
|
||||||
|
@ -905,7 +905,7 @@ GET /_plugins/_ml/tasks/_search
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Delete task
|
## Deleting a task
|
||||||
|
|
||||||
Delete a task based on the task_id.
|
Delete a task based on the task_id.
|
||||||
|
|
||||||
|
|
|
@ -59,7 +59,7 @@ plugins.ml_commons.max_ml_task_per_node: 10
|
||||||
|
|
||||||
## Set number of ML models per node
|
## Set number of ML models per node
|
||||||
|
|
||||||
Sets the number of ML models that can be loaded on to each ML node. When set to `0`, no ML models can load on any node.
|
Sets the number of ML models that can be deployed to each ML node. When set to `0`, no ML models can deploy on any node.
|
||||||
|
|
||||||
### Setting
|
### Setting
|
||||||
|
|
||||||
|
@ -74,7 +74,7 @@ plugins.ml_commons.max_model_on_node: 10
|
||||||
|
|
||||||
## Set sync job intervals
|
## Set sync job intervals
|
||||||
|
|
||||||
When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular job to sync newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.
|
When returning runtime information with the [Profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular job to sync newly deployed or undeployed models on each node. When set to `0`, ML Commons immediately stops sync-up jobs.
|
||||||
|
|
||||||
|
|
||||||
### Setting
|
### Setting
|
||||||
|
@ -186,3 +186,63 @@ plugins.ml_commons.native_memory_threshold: 90
|
||||||
|
|
||||||
- Default value: 90
|
- Default value: 90
|
||||||
- Value range: [0, 100]
|
- Value range: [0, 100]
|
||||||
|
|
||||||
|
## Allow custom deployment plans
|
||||||
|
|
||||||
|
When enabled, this setting grants users the ability to deploy models to specific ML nodes according to that user's permissions.
|
||||||
|
|
||||||
|
### Setting
|
||||||
|
|
||||||
|
```
|
||||||
|
plugins.ml_commons.allow_custom_deployment_plan: false
|
||||||
|
```
|
||||||
|
|
||||||
|
### Values
|
||||||
|
|
||||||
|
- Default value: false
|
||||||
|
- Value range: [false, true]
|
||||||
|
|
||||||
|
## Enable auto redeploy
|
||||||
|
|
||||||
|
This setting automatically redeploys deployed or partially deployed models upon cluster failure. If all ML nodes inside a cluster crash, the model switches to the `DEPLOYED_FAILED` state, and the model must be deployed manually.
|
||||||
|
|
||||||
|
### Setting
|
||||||
|
|
||||||
|
```
|
||||||
|
plugins.ml_commons.model_auto_redeploy.enable: false
|
||||||
|
```
|
||||||
|
|
||||||
|
### Values
|
||||||
|
|
||||||
|
- Default value: false
|
||||||
|
- Value range: [false, true]
|
||||||
|
|
||||||
|
## Set retires for auto redeploy
|
||||||
|
|
||||||
|
This setting sets the limit for the number of times a deployed or partially deployed model will try and redeploy when ML nodes in a cluster fail or new ML nodes join the cluster.
|
||||||
|
|
||||||
|
### Setting
|
||||||
|
|
||||||
|
```
|
||||||
|
plugins.ml_commons.model_auto_redeploy.lifetime_retry_times: 3
|
||||||
|
```
|
||||||
|
|
||||||
|
### Values
|
||||||
|
|
||||||
|
- Default value: 3
|
||||||
|
- Value range: [0, 100]
|
||||||
|
|
||||||
|
## Set auto redeploy success ratio
|
||||||
|
|
||||||
|
This setting sets the ratio of success for the auto-redeployment of a model based on the available ML nodes in a cluster. For example, if ML nodes crash inside a cluster, the auto redeploy protocol adds another node or retires a crashed node. If the ratio is `0.7` and 70% of all ML nodes successfully redeploy the model on auto-redeploy activation, the redeployment is a success. If the model redeploys on fewer than 70% of available ML nodes, the auto-redeploy retries until the redeployment succeeds or OpenSearch reaches [the maximum number of retries](#set-retires-for-auto-redeploy).
|
||||||
|
|
||||||
|
### Setting
|
||||||
|
|
||||||
|
```
|
||||||
|
plugins.ml_commons.model_auto_redeploy_success_ratio: 0.8
|
||||||
|
```
|
||||||
|
|
||||||
|
### Values
|
||||||
|
|
||||||
|
- Default value: 0.8
|
||||||
|
- Value range: [0, 1]
|
||||||
|
|
Loading…
Reference in New Issue