Add workload section to Benchmarks (#4705)

* Add Benchmark workload section. Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add workload reference intro. Add indices and corpora reference. Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add technical feedback Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Fix typo Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Apply suggestions from code review Co-authored-by: Heather Halter <HDHALTER@AMAZON.COM> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * indices consistency. Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Add final piece of feedback. Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * One last comment. Signed-off-by: Naarcha-AWS <naarcha@amazon.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Naarcha-AWS <naarcha@amazon.com> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Heather Halter <HDHALTER@AMAZON.COM> Co-authored-by: Nathan Bower <nbower@amazon.com>
2023-08-24 13:03:56 -05:00 · 2023-08-24 13:03:56 -05:00 · 18702840c6
parent 0ca5c0ce81
commit 18702840c6
4 changed files with 333 additions and 1 deletions
--- a/_benchmark/creating-custom-workloads.md
+++ b/_benchmark/creating-custom-workloads.md
@ -75,7 +75,7 @@ As part of workload creation, OpenSearch Benchmark generates the following files

 By default, OpenSearch Benchmark does not contain a reference to generate queries. Because you have the best understanding of your data, we recommend adding a query to `workload.json` that matches your index's specifications. Use the following `match_all` query as an example of a query added to your workload: 

-```
+```json
 {
      "operation": {
        "name": "query-match-all",
--- a/_benchmark/workloads/corpora.md
+++ b/_benchmark/workloads/corpora.md
@ -0,0 +1,54 @@
+---
+layout: default
+title: corpora
+parent: Workload reference
+nav_order: 70
+---
+
+The `corpora` element contains all the document corpora used by the workload. You can use document corpora across workloads by copying and pasting any corpora definitions. 
+
+## Example
+
+The following example defines a single corpus called `movies` with `11658903` documents and `1544799789` uncompressed bytes:
+
+```json
+  "corpora": [
+    {
+      "name": "movies",
+      "documents": [
+        {
+          "source-file": "movies-documents.json",
+          "document-count": 11658903, # Fetch document count from command line
+          "uncompressed-bytes": 1544799789 # Fetch uncompressed bytes from command line
+        }
+      ]
+    }
+  ]
+```
+
+## Configuration options
+
+Use the following options with `corpora`.
+
+Parameter | Required | Type | Description
+:--- | :--- | :--- | :---
+| `name` | Yes | String | The name of the document corpus. Because OpenSearch Benchmark uses this name in its directories, use only lowercase names without white spaces. |
+| `documents` | Yes | JSON array | An array of document files. |
+| `meta` | No | String | A mapping of key-value pairs with additional metadata for a corpus. |
+
+
+Each entry in the `documents` array consists of the following options.
+
+Parameter | Required | Type | Description
+:--- | :--- | :--- | :---
+| `source-file` | Yes | String | The file name containing the corresponding documents for the workload. When using OpenSearch Benchmark locally, documents are contained in a JSON file. When providing a `base_url`, use a compressed file format: `.zip`, `.bz2`, `.gz`, `.tar`, `.tar.gz`, `.tgz`, or `.tar.bz2`. The compressed file must have one JSON file containing the name. |
+| `document-count` | Yes | Integer | The number of documents in the `source-file`, which determines which client indices correlate to which parts of the document corpus. Each N client receives an Nth of the document corpus. When using a source that contains a document with a parent-child relationship, specify the number of parent documents. |
+| `base-url` | No | String | An http(s), Amazon Simple Storage Service (Amazon S3), or Google Cloud Storage URL that points to the root path where OpenSearch Benchmark can obtain the corresponding source file. |
+| `source-format` | No | String | Defines the format OpenSearch Benchmark uses to interpret the data file specified in `source-file`. Only `bulk` is supported. |
+| `compressed-bytes` | No | Integer | The size, in bytes, of the compressed source file, indicating how much data OpenSearch Benchmark downloads. |
+| `uncompressed-bytes` | No | Integer | The size, in bytes, of the source file after decompression, indicating how much disk space the decompressed source file needs. | 
+| `target-index` | No | String | Defines the name of the index that the `bulk` operation should target. OpenSearch Benchmark automatically derives this value when only one index is defined in the `indices` element. The value of `target-index` is ignored when the `includes-action-and-meta-data` setting is `true`. |
+| `target-type` | No | String | Defines the document type of the target index targeted in bulk operations. OpenSearch Benchmark automatically derives this value when only one index is defined in the `indices` element and the index has only one type. The value of `target-type` is ignored when the `includes-action-and-meta-data` setting is `true`. |
+| `includes-action-and-meta-data` | No | Boolean | When set to `true`, indicates that the document's file already contains an `action` line and a `meta-data` line. When `false`, indicates that the document's file contains only documents. Default is `false`. |
+| `meta` | No | String | A mapping of key-value pairs with additional metadata for a corpus. |
+
--- a/_benchmark/workloads/index.md
+++ b/_benchmark/workloads/index.md
@ -0,0 +1,250 @@
+---
+layout: default
+title: Workload reference
+nav_order: 60
+has_children: true
+---
+
+# OpenSearch Benchmark workload reference
+
+A workload is a specification of one or more benchmarking scenarios. A workload typically includes the following:
+
+- One or more data streams that are ingested into indices
+- A set of queries and operations that are invoked as part of the benchmark
+
+## Anatomy of a workload
+
+The following example workload shows all of the essential elements needed to create a workload.json file. You can run this workload in your own benchmark configuration in order to understand how all of the elements work together:
+
+```json
+{
+  "description": "Tutorial benchmark for OpenSearch Benchmark",
+  "indices": [
+    {
+      "name": "movies",
+      "body": "index.json"
+    }
+  ],
+  "corpora": [
+    {
+      "name": "movies",
+      "documents": [
+        {
+          "source-file": "movies-documents.json",
+          "document-count": 11658903, # Fetch document count from command line
+          "uncompressed-bytes": 1544799789 # Fetch uncompressed bytes from command line
+        }
+      ]
+    }
+  ],
+  "schedule": [
+    {
+      "operation": {
+        "operation-type": "create-index"
+      }
+    },
+    {
+      "operation": {
+        "operation-type": "cluster-health",
+        "request-params": {
+          "wait_for_status": "green"
+        },
+        "retry-until-success": true
+      }
+    },
+    {
+      "operation": {
+        "operation-type": "bulk",
+        "bulk-size": 5000
+      },
+      "warmup-time-period": 120,
+      "clients": 8
+    },
+    {
+      "operation": {
+        "name": "query-match-all",
+        "operation-type": "search",
+        "body": {
+          "query": {
+            "match_all": {}
+          }
+        }
+      },
+      "iterations": 1000,
+      "target-throughput": 100
+    }
+  ]
+}
+```
+
+A workload usually consists of the following elements:
+
+- [indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/): Defines the relevant indices and index templates used for the workload.
+- [corpora]({{site.url}}{{site.baseurl}}/benchmark/workloads/corpora/): Defines all document corpora used for the workload.
+- `schedule`: Defines operations and in what order the operations run in-line. Alternatively, you can use `operations` to group operations and the `test_procedures` parameter to specify the order of operations. 
+- `operations`: **Optional**. Describes which operations are available for the workload and how they are parameterized. 
+
+### Indices
+
+To create an index, specify its `name`. To add definitions to your index, use the `body` option and point it to the JSON file containing the index definitions. For more information, see [indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/). For more information, see [indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/).
+
+### Corpora
+
+The `corpora` element requires the name of the index containing the document corpus, for example, `movies`, and a list of parameters that define the document corpora. This list includes the following parameters:
+
+-  `source-file`: The file name that contains the workload's corresponding documents. When using OpenSearch Benchmark locally, documents are contained in a JSON file. When providing a `base_url`, use a compressed file format: `.zip`, `.bz2`, `.gz`, `.tar`, `.tar.gz`, `.tgz`, or `.tar.bz2`. The compressed file must have one JSON file containing the name. 
+-  `document-count`: The number of documents in the `source-file`, which determines which client indices correlate to which parts of the document corpus. Each N client receives an Nth of the document corpus. When using a source that contains a document with a parent-child relationship, specify the number of parent documents. 
+- `uncompressed-bytes`: The size, in bytes, of the source file after decompression, indicating how much disk space the decompressed source file needs. 
+- `compressed-bytes`: The size, in bytes, of the source file before decompression. This can help you assess the amount of time needed for the cluster to ingest documents.
+
+### Operations
+
+The `operations` element lists the OpenSearch API operations performed by the workload. For example, you can set an operation to `create-index`, which creates an index in the test cluster that OpenSearch Benchmark can write documents into. Operations are usually listed inside of `schedule`.
+
+### Schedule
+
+The `schedule` element contains a list of actions and operations that are run by the workload. Operations run according to the order in which they appear in the `schedule`. The following example illustrates a `schedule` with multiple operations, each defined by its `operation-type`: 
+
+```json
+  "schedule": [
+    {
+      "operation": {
+        "operation-type": "create-index"
+      }
+    },
+    {
+      "operation": {
+        "operation-type": "cluster-health",
+        "request-params": {
+          "wait_for_status": "green"
+        },
+        "retry-until-success": true
+      }
+    },
+    {
+      "operation": {
+        "operation-type": "bulk",
+        "bulk-size": 5000
+      },
+      "warmup-time-period": 120,
+      "clients": 8
+    },
+    {
+      "operation": {
+        "name": "query-match-all",
+        "operation-type": "search",
+        "body": {
+          "query": {
+            "match_all": {}
+          }
+        }
+      },
+      "iterations": 1000,
+      "target-throughput": 100
+    }
+  ]
+}
+```
+
+According to this schedule, the actions will run in the following order:
+
+1. The `create-index` operation creates an index. The index remains empty until the `bulk` operation adds documents with benchmarked data.
+2. The `cluster-health` operation assesses the health of the cluster before running the workload. In this example, the workload waits until the status of the cluster's health is `green`.
+   - The `bulk` operation runs the `bulk` API to index `5000` documents simultaneously.
+   - Before benchmarking, the workload waits until the specified `warmup-time-period` passes. In this example, the warmup period is `120` seconds.
+5. The `clients` option defines the number of clients that will run the remaining actions in the schedule concurrently.
+6. The `search` runs a `match_all` query to match all documents after they have been indexed by the `bulk` API using the 8 clients specified.
+   - The `iterations` option indicates the number of times each client runs the `search` operation. The report generated by the benchmark automatically adjusts the percentile numbers based on this number. To generate a precise percentile, the benchmark needs to run at least 1,000 iterations.
+   - Lastly, the `target-throughput` option defines the number of requests per second each client performs, which, when set, can help reduce the latency of the benchmark. For example, a `target-throughput` of 100 requests divided by 8 clients means that each client will issue 12 requests per second.
+
+
+## More workload examples
+
+If you want to try certain workloads before creating your own, use the following examples.
+
+### Running unthrottled
+
+In the following example, OpenSearch Benchmark runs an unthrottled bulk index operation for 1 hour against the `movies` index:
+
+```json
+{
+  "description": "Tutorial benchmark for OpenSearch Benchmark",
+  "indices": [
+    {
+      "name": "movies",
+      "body": "index.json"
+    }
+  ],
+  "corpora": [
+    {
+      "name": "movies",
+      "documents": [
+        {
+          "source-file": "movies-documents.json",
+          "document-count": 11658903, # Fetch document count from command line
+          "uncompressed-bytes": 1544799789 # Fetch uncompressed bytes from command line
+        }
+      ]
+    }
+  ],
+  "schedule": [
+  {
+    "operation": "bulk",
+    "warmup-time-period": 120,
+    "time-period": 3600,
+    "clients": 8
+  }
+]
+}
+```
+
+### Workload with a single task
+
+The following workload runs a benchmark with a single task: a `match_all` query. Because no `clients` are indicated, only one client is used. According to the `schedule`, the workload runs the `match_all` query at 10 operations per second with 1 client, uses 100 iterations to warm up, and uses the next 100 iterations to measure the benchmark:
+
+```json
+{
+  "description": "Tutorial benchmark for OpenSearch Benchmark",
+  "indices": [
+    {
+      "name": "movies",
+      "body": "index.json"
+    }
+  ],
+  "corpora": [
+    {
+      "name": "movies",
+      "documents": [
+        {
+          "source-file": "movies-documents.json",
+          "document-count": 11658903, # Fetch document count from command line
+          "uncompressed-bytes": 1544799789 # Fetch uncompressed bytes from command line
+        }
+      ]
+    }
+  ],
+{
+  "schedule": [
+    {
+      "operation": {
+        "operation-type": "search",
+        "index": "_all",
+        "body": {
+          "query": {
+            "match_all": {}
+          }
+        }
+      },
+      "warmup-iterations": 100,
+      "iterations": 100,
+      "target-throughput": 10
+    }
+  ]
+}
+}
+```
+
+## Next steps
+
+- For more information about configuring OpenSearch Benchmark, see [Configuring OpenSearch Benchmark]({{site.url}}{{site.baseurl}}/benchmark/configuring-benchmark/). 
+- For a list of prepackaged workloads for OpenSearch Benchmark, see the [opensearch-benchmark-workloads](https://github.com/opensearch-project/opensearch-benchmark-workloads) repository. 
--- a/_benchmark/workloads/indices.md
+++ b/_benchmark/workloads/indices.md
@ -0,0 +1,28 @@
+---
+layout: default
+title: indices
+parent: Workload reference
+nav_order: 65
+---
+
+The `indices` element contains a list of all indices used in the workload. 
+
+## Example
+
+```json
+"indices": [
+    {
+      "name": "geonames",
+      "body": "geonames-index.json",
+    }
+]
+```
+
+## Configuration options
+
+Use the following options with `indices`:
+
+Parameter | Required | Type | Description
+:--- | :--- | :--- | :---
+| `name` | Yes | String | The name of the index template. |
+| `body` | No | String | The file name corresponding to the index definition used in the body of the Create Index API. |