--- layout: default title: Anatomy of a workload nav_order: 15 grand_parent: User guide parent: Understanding workloads --- # Anatomy of a workload All workloads contain the following files and directories: - [workload.json](#workloadjson): Contains all of the workload settings. - [index.json](#indexjson): Contains the document mappings and parameters as well as index settings. - [files.txt](#filestxt): Contains the data corpora file names. - [_test-procedures](#_operations-and-_test-procedures): Most workloads contain only one default test procedure, which is configured in `default.json`. - [_operations](#_operations-and-_test-procedures): Contains all of the operations used in test procedures. - workload.py: Adds more dynamic functionality to the test. ## workload.json The following example workload shows all of the essential elements needed to create a `workload.json` file. You can run this workload in your own benchmark configuration to understand how all of the elements work together: ```json { "description": "Tutorial benchmark for OpenSearch Benchmark", "indices": [ { "name": "movies", "body": "index.json" } ], "corpora": [ { "name": "movies", "documents": [ { "source-file": "movies-documents.json", "document-count": 11658903, # Fetch document count from command line "uncompressed-bytes": 1544799789 # Fetch uncompressed bytes from command line } ] } ], "schedule": [ { "operation": { "operation-type": "create-index" } }, { "operation": { "operation-type": "cluster-health", "request-params": { "wait_for_status": "green" }, "retry-until-success": true } }, { "operation": { "operation-type": "bulk", "bulk-size": 5000 }, "warmup-time-period": 120, "clients": 8 }, { "operation": { "name": "query-match-all", "operation-type": "search", "body": { "query": { "match_all": {} } } }, "iterations": 1000, "target-throughput": 100 } ] } ``` A workload usually includes the following elements: - [indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/): Defines the relevant indexes and index templates used for the workload. - [corpora]({{site.url}}{{site.baseurl}}/benchmark/workloads/corpora/): Defines all document corpora used for the workload. - `schedule`: Defines operations and the order in which the operations run inline. Alternatively, you can use `operations` to group operations and the `test_procedures` parameter to specify the order of operations. - `operations`: **Optional**. Describes which operations are available for the workload and how they are parameterized. ### Indices To create an index, specify its `name`. To add definitions to your index, use the `body` option and point it to the JSON file containing the index definitions. For more information, see [Indices]({{site.url}}{{site.baseurl}}/benchmark/workloads/indices/). ### Corpora The `corpora` element requires the name of the index containing the document corpus, for example, `movies`, and a list of parameters that define the document corpora. This list includes the following parameters: - `source-file`: The file name that contains the workload's corresponding documents. When using OpenSearch Benchmark locally, documents are contained in a JSON file. When providing a `base_url`, use a compressed file format: `.zip`, `.bz2`, `.zst`, `.gz`, `.tar`, `.tar.gz`, `.tgz`, or `.tar.bz2`. The compressed file must include one JSON file containing the name. - `document-count`: The number of documents in the `source-file`, which determines which client indexes correlate to which parts of the document corpus. Each N client is assigned an Nth of the document corpus to ingest into the test cluster. When using a source that contains a document with a parent-child relationship, specify the number of parent documents. - `uncompressed-bytes`: The size, in bytes, of the source file after decompression, indicating how much disk space the decompressed source file needs. - `compressed-bytes`: The size, in bytes, of the source file before decompression. This can help you assess the amount of time needed for the cluster to ingest documents. ### Operations The `operations` element lists the OpenSearch API operations performed by the workload. For example, you can list an operation named `create-index` that creates an index in the benchmark cluster to which OpenSearch Benchmark can write documents. Operations are usually listed inside of the `schedule` element. ### Schedule The `schedule` element contains a list of operations that are run in a specified order, as shown in the following JSON example: ```json "schedule": [ { "operation": { "operation-type": "create-index" } }, { "operation": { "operation-type": "cluster-health", "request-params": { "wait_for_status": "green" }, "retry-until-success": true } }, { "operation": { "operation-type": "bulk", "bulk-size": 5000 }, "warmup-time-period": 120, "clients": 8 }, { "operation": { "name": "query-match-all", "operation-type": "search", "body": { "query": { "match_all": {} } } }, "iterations": 1000, "target-throughput": 100 } ] } ``` According to this `schedule`, the actions will run in the following order: 1. The `create-index` operation creates an index. The index remains empty until the `bulk` operation adds documents with benchmarked data. 2. The `cluster-health` operation assesses the cluster's health before running the workload. In the JSON example, the workload waits until the cluster's health status is `green`. - The `bulk` operation runs the `bulk` API to index `5000` documents simultaneously. - Before benchmarking, the workload waits until the specified `warmup-time-period` passes. In the JSON example, the warmup period is `120` seconds. 3. The `clients` field defines the number of clients, in this example, eight, that will run the bulk indexing operation concurrently. 4. The `search` operation runs a `match_all` query to match all documents after they have been indexed by the `bulk` API using the specified clients. - The `iterations` field defines the number of times each client runs the `search` operation. The benchmark report automatically adjusts the percentile numbers based on this number. To generate a precise percentile, the benchmark needs to run at least 1,000 iterations. - The `target-throughput` field defines the number of requests per second that each client performs. When set, the setting can help reduce benchmark latency. For example, a `target-throughput` of 100 requests divided by 8 clients means that each client will issue 12 requests per second. For more information about how target throughput is defined in OpenSearch Benchmark, see [Throughput and latency](https://opensearch.org/docs/latest/benchmark/user-guide/concepts/#throughput-and-latency). ## index.json The `index.json` file defines the data mappings, indexing parameters, and index settings for workload documents during `create-index` operations. When OpenSearch Benchmark creates an index for the workload, it uses the index settings and mappings template in the `index.json` file. Mappings in the `index.json` file are based on the mappings of a single document from the workload's corpus, which is stored in the `files.txt` file. The following is an example of the `index.json` file for the `nyc_taxis` workload. You can customize the fields, such as `number_of_shards`, `number_of_replicas`, `query_cache_enabled`, and `requests_cache_enabled`. ```json { "settings": { "index.number_of_shards": {{number_of_shards | default(1)}}, "index.number_of_replicas": {{number_of_replicas | default(0)}}, "index.queries.cache.enabled": {{query_cache_enabled | default(false) | tojson}}, "index.requests.cache.enable": {{requests_cache_enabled | default(false) | tojson}} }, "mappings": { "_source": { "enabled": {{ source_enabled | default(true) | tojson }} }, "properties": { "surcharge": { "scaling_factor": 100, "type": "scaled_float" }, "dropoff_datetime": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "trip_type": { "type": "keyword" }, "mta_tax": { "scaling_factor": 100, "type": "scaled_float" }, "rate_code_id": { "type": "keyword" }, "passenger_count": { "type": "integer" }, "pickup_datetime": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "tolls_amount": { "scaling_factor": 100, "type": "scaled_float" }, "tip_amount": { "type": "half_float" }, "payment_type": { "type": "keyword" }, "extra": { "scaling_factor": 100, "type": "scaled_float" }, "vendor_id": { "type": "keyword" }, "store_and_fwd_flag": { "type": "keyword" }, "improvement_surcharge": { "scaling_factor": 100, "type": "scaled_float" }, "fare_amount": { "scaling_factor": 100, "type": "scaled_float" }, "ehail_fee": { "scaling_factor": 100, "type": "scaled_float" }, "cab_color": { "type": "keyword" }, "dropoff_location": { "type": "geo_point" }, "vendor_name": { "type": "text" }, "total_amount": { "scaling_factor": 100, "type": "scaled_float" }, "trip_distance": { "scaling_factor": 100, "type": "scaled_float" }, "pickup_location": { "type": "geo_point" } }, "dynamic": "strict" } } ``` ## files.txt The `files.txt` file lists the files that store the workload data, which are typically stored in a zipped JSON file. ## _operations and _test-procedures To make the workload more human-readable, `_operations` and `_test-procedures` are separated into two directories. The `_operations` directory contains a `default.json` file that lists all of the supported operations that the test procedure can use. Some workloads, such as `nyc_taxis`, contain an additional `.json` file that lists feature-specific operations, such as `snapshot` operations. The following JSON example shows a list of operations from the `nyc_taxis` workload: ```json { "name": "index", "operation-type": "bulk", "bulk-size": {{bulk_size | default(10000)}}, "ingest-percentage": {{ingest_percentage | default(100)}} }, { "name": "update", "operation-type": "bulk", "bulk-size": {{bulk_size | default(10000)}}, "ingest-percentage": {{ingest_percentage | default(100)}}, "conflicts": "{{conflicts | default('random')}}", "on-conflict": "{{on_conflict | default('update')}}", "conflict-probability": {{conflict_probability | default(25)}}, "recency": {{recency | default(0)}} }, { "name": "wait-until-merges-finish", "operation-type": "index-stats", "index": "_all", "condition": { "path": "_all.total.merges.current", "expected-value": 0 }, "retry-until-success": true, "include-in-reporting": false }, { "name": "default", "operation-type": "search", "body": { "query": { "match_all": {} } } }, { "name": "range", "operation-type": "search", "body": { "query": { "range": { "total_amount": { "gte": 5, "lt": 15 } } } } }, { "name": "distance_amount_agg", "operation-type": "search", "body": { "size": 0, "query": { "bool": { "filter": { "range": { "trip_distance": { "lt": 50, "gte": 0 } } } } }, "aggs": { "distance_histo": { "histogram": { "field": "trip_distance", "interval": 1 }, "aggs": { "total_amount_stats": { "stats": { "field": "total_amount" } } } } } } }, { "name": "autohisto_agg", "operation-type": "search", "body": { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "01/01/2015", "lte": "21/01/2015", "format": "dd/MM/yyyy" } } }, "aggs": { "dropoffs_over_time": { "auto_date_histogram": { "field": "dropoff_datetime", "buckets": 20 } } } } }, { "name": "date_histogram_agg", "operation-type": "search", "body": { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "01/01/2015", "lte": "21/01/2015", "format": "dd/MM/yyyy" } } }, "aggs": { "dropoffs_over_time": { "date_histogram": { "field": "dropoff_datetime", "calendar_interval": "day" } } } } }, { "name": "date_histogram_calendar_interval", "operation-type": "search", "body": { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "date_histogram": { "field": "dropoff_datetime", "calendar_interval": "month" } } } } }, { "name": "date_histogram_calendar_interval_with_tz", "operation-type": "search", "body": { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "date_histogram": { "field": "dropoff_datetime", "calendar_interval": "month", "time_zone": "America/New_York" } } } } }, { "name": "date_histogram_fixed_interval", "operation-type": "search", "body": { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "date_histogram": { "field": "dropoff_datetime", "fixed_interval": "60d" } } } } }, { "name": "date_histogram_fixed_interval_with_tz", "operation-type": "search", "body": { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "date_histogram": { "field": "dropoff_datetime", "fixed_interval": "60d", "time_zone": "America/New_York" } } } } }, { "name": "date_histogram_fixed_interval_with_metrics", "operation-type": "search", "body": { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "date_histogram": { "field": "dropoff_datetime", "fixed_interval": "60d" }, "aggs": { "total_amount": { "stats": { "field": "total_amount" } }, "tip_amount": { "stats": { "field": "tip_amount" } }, "trip_distance": { "stats": { "field": "trip_distance" } } } } } } }, { "name": "auto_date_histogram", "operation-type": "search", "body": { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "auto_date_histogram": { "field": "dropoff_datetime", "buckets": "12" } } } } }, { "name": "auto_date_histogram_with_tz", "operation-type": "search", "body": { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "auto_date_histogram": { "field": "dropoff_datetime", "buckets": "13", "time_zone": "America/New_York" } } } } }, { "name": "auto_date_histogram_with_metrics", "operation-type": "search", "body": { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "auto_date_histogram": { "field": "dropoff_datetime", "buckets": "12" }, "aggs": { "total_amount": { "stats": { "field": "total_amount" } }, "tip_amount": { "stats": { "field": "tip_amount" } }, "trip_distance": { "stats": { "field": "trip_distance" } } } } } } }, { "name": "desc_sort_tip_amount", "operation-type": "search", "index": "nyc_taxis", "body": { "query": { "match_all": {} }, "sort" : [ {"tip_amount" : "desc"} ] } }, { "name": "asc_sort_tip_amount", "operation-type": "search", "index": "nyc_taxis", "body": { "query": { "match_all": {} }, "sort" : [ {"tip_amount" : "asc"} ] } } ``` The `_test-procedures` directory contains a `default.json` file that sets the order of operations performed by the workload. Similar to the `_operations` directory, the `_test-procedures` directory can also contain feature-specific test procedures, such as `searchable_snapshots.json` for `nyc_taxis`. The following examples show the searchable snapshots test procedures for `nyc_taxis`: ```json { "name": "searchable-snapshot", "description": "Measuring performance for Searchable Snapshot feature. Based on the default test procedure 'append-no-conflicts'.", "schedule": [ { "operation": "delete-index" }, { "operation": { "operation-type": "create-index", "settings": { "index.codec": "best_compression", "index.refresh_interval": "30s", "index.translog.flush_threshold_size": "4g" } } }, { "name": "check-cluster-health", "operation": { "operation-type": "cluster-health", "index": "nyc_taxis", "request-params": { "wait_for_status": "{{ cluster_health | default('green') }}", "wait_for_no_relocating_shards": "true" }, "retry-until-success": true } }, { "operation": "index", "warmup-time-period": 240, "clients": {{ bulk_indexing_clients | default(8) }}, "ignore-response-error-level": "{{ error_level | default('non-fatal') }}" }, { "name": "refresh-after-index", "operation": "refresh" }, { "operation": { "operation-type": "force-merge", "request-timeout": 7200 } }, { "name": "refresh-after-force-merge", "operation": "refresh" }, { "operation": "wait-until-merges-finish" }, { "operation": "create-snapshot-repository" }, { "operation": "delete-snapshot" }, { "operation": "create-snapshot" }, { "operation": "wait-for-snapshot-creation" }, { "operation": { "name": "delete-local-index", "operation-type": "delete-index" } }, { "operation": "restore-snapshot" }, { "operation": "default", "warmup-iterations": 50, "iterations": 100 }, { "operation": "range", "warmup-iterations": 50, "iterations": 100 }, { "operation": "distance_amount_agg", "warmup-iterations": 50, "iterations": 50 }, { "operation": "autohisto_agg", "warmup-iterations": 50, "iterations": 100 }, { "operation": "date_histogram_agg", "warmup-iterations": 50, "iterations": 100 } ] } ``` ## Next steps Now that you have familiarized yourself with the anatomy of a workload, see the criteria for [Choosing a workload]({{site.url}}{{site.baseurl}}/benchmark/user-guide/understanding-workloads/choosing-a-workload/).