Update segment replication backpressure (#3839)

* Update segment replication backpressure Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented editorial comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Added benchmarks Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Implemented editorial comments Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
2023-05-02 10:59:20 -04:00 · 2023-05-02 10:59:20 -04:00 · 1ff9a3ad16
parent aa6483903c
commit 1ff9a3ad16
3 changed files with 306 additions and 182 deletions
--- a/_sass/custom/custom.scss
+++ b/_sass/custom/custom.scss
@ -119,6 +119,14 @@ code {
  box-shadow: 0 1px 2px rgba(0, 0, 0, 0.12), 0 3px 10px rgba(0, 0, 0, 0.08);
 }

+.td-custom {
+  @extend td;
+  
+  &:first-of-type {
+    border-left: $border $border-color;
+  }
+}
+
 img {
  @extend .panel;
 }
--- a/_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md
+++ b/_tuning-your-cluster/availability-and-recovery/segment-replication/backpressure.md
@ -7,26 +7,26 @@ has_children: false
 grand_parent: Availability and Recovery
 ---

-## Segment replication back-pressure
+## Segment replication backpressure

-Segment replication back-pressure is a per-shard level rejection mechanism that dynamically rejects indexing requests when the number of replica shards in your cluster are falling behind the number of primary shards. With Segment replication back-pressure, indexing requests are rejected when more than half of the replication group is stale, which is defined by the `MAX_ALLOWED_STALE_SHARDS` field. A replica is considered stale if it is behind by more than the defined `MAX_INDEXING_CHECKPOINTS` field, and its current replication lag is over the defined `MAX_REPLICATION_TIME_SETTING` field.
+Segment replication backpressure is a shard-level rejection mechanism that dynamically rejects indexing requests as replica shards in your cluster fall behind primary shards. With segment replication backpressure, indexing requests are rejected when the percentage of stale shards in the replication group exceeds `MAX_ALLOWED_STALE_SHARDS` (50% by default). A replica is considered stale if it is behind the primary shard by the number of checkpoints that exceeds the `MAX_INDEXING_CHECKPOINTS` setting and its current replication lag is greater than the defined `MAX_REPLICATION_TIME_SETTING` field.

-Replica shards are also monitored to determine whether the shards are stuck or are lagging for an extended period of time. When replica shards are stuck or lagging for more than double the amount of time defined by the `MAX_REPLICATION_TIME_SETTING` field, the shards are removed and then replaced with new replica shards.
+Replica shards are also monitored to determine whether the shards are stuck or lagging for an extended period of time. When replica shards are stuck or lagging for more than double the amount of time defined by the `MAX_REPLICATION_TIME_SETTING` field, the shards are removed and replaced with new replica shards.

 ## Request fields

-Segment replication back-pressure is enabled by default. The following are dynamic cluster settings, and can be enabled or disabled using the [cluster settings]({{site.url}}{{site.baseurl}}/api-reference/cluster-api/cluster-settings/) API endpoint.
+Segment replication backpressure is disabled by default. To enable it, set `SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED` to `true`. You can update the following dynamic cluster settings using the [cluster settings]({{site.url}}{{site.baseurl}}/api-reference/cluster-api/cluster-settings/) API endpoint.

 Field | Data type | Description
 :--- | :--- | :---
-SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED | Boolean | Enables the segment replication back-pressure mechanism. Default is `true`.
-MAX_REPLICATION_TIME_SETTING | Time unit | The maximum time that a replica shard can take to copy from primary. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication back-pressure mechanism is triggered. Default is `5 minutes`.
-MAX_INDEXING_CHECKPOINTS | Integer | The maximum number of indexing checkpoints that a replica shard can fall behind when copying from primary. Once `MAX_INDEXING_CHECKPOINTS` is breached along with `MAX_REPLICATION_TIME_SETTING`, the segment replication back-pressure mechanism is triggered. Default is `4` checkpoints.
-MAX_ALLOWED_STALE_SHARDS | Floating point | The maximum number of stale replica shards that can exist in a replication group. Once `MAX_ALLOWED_STALE_SHARDS` is breached, the segment replication back-pressure mechanism is triggered. Default is `.5`, which is 50% of a replication group.
+SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED | Boolean | Enables the segment replication backpressure mechanism. Default is `false`.
+MAX_REPLICATION_TIME_SETTING | Time unit | The maximum amount of time that a replica shard can take to copy from the primary shard. Once `MAX_REPLICATION_TIME_SETTING` is breached along with `MAX_INDEXING_CHECKPOINTS`, the segment replication backpressure mechanism is initiated. Default is `5 minutes`.
+MAX_INDEXING_CHECKPOINTS | Integer | The maximum number of indexing checkpoints that a replica shard can fall behind when copying from primary. Once `MAX_INDEXING_CHECKPOINTS` is breached along with `MAX_REPLICATION_TIME_SETTING`, the segment replication backpressure mechanism is initiated. Default is `4` checkpoints.
+MAX_ALLOWED_STALE_SHARDS | Floating point | The maximum number of stale replica shards that can exist in a replication group. Once `MAX_ALLOWED_STALE_SHARDS` is breached, the segment replication backpressure mechanism is initiated. Default is `.5`, which is 50% of a replication group.

 ## Path and HTTP methods

-You can use the segment replication API endpoint to retrieve segment replication back-pressure metrics.
+You can use the segment replication API endpoint to retrieve segment replication backpressure metrics as follows:

 ```bash
 GET _cat/segment_replication
@ -40,5 +40,4 @@ shardId       target_node    target_host   checkpoints_behind bytes_behind   cur
 [index-1][0]     runTask-1    127.0.0.1              0              0b           0s              7ms                    0
 ```

- `checkpoints_behind` and `current_lag` directly correlate with `MAX_INDEXING_CHECKPOINTS` and `MAX_REPLICATION_TIME_SETTING`.
- `checkpoints_behind` and `current_lag` metrics are taken into consideration when triggering segment replication back-pressure.
+The `checkpoints_behind` and `current_lag` metrics are taken into consideration when initiating segment replication backpressure. They are checked against `MAX_INDEXING_CHECKPOINTS` and `MAX_REPLICATION_TIME_SETTING`, respectively.
--- a/_tuning-your-cluster/availability-and-recovery/segment-replication/index.md
+++ b/_tuning-your-cluster/availability-and-recovery/segment-replication/index.md
@ -4,6 +4,7 @@ title: Segment replication
 nav_order: 70
 has_children: true
 parent: Availability and Recovery
+datatable: true
 redirect_from:
  - /opensearch/segment-replication/
  - /opensearch/segment-replication/index/
@ -64,186 +65,302 @@ curl -X PUT "$host/_cluster/settings?pretty" -H 'Content-Type: application/json'
 ```
 {% include copy-curl.html %}

-## Comparing replication benchmarks
-
-During initial benchmarks, segment replication users reported 40% higher throughput than when using document replication with the same cluster setup.
-
-The following benchmarks were collected with [OpenSearch-benchmark](https://github.com/opensearch-project/opensearch-benchmark) using the [`nyc_taxi`](https://github.com/topics/nyc-taxi-dataset) dataset.  
-
-The test run was performed on a 10-node (m5.xlarge) cluster with 10 shards and 5 replicas. Each shard was about 3.2GBs in size.
-
-The benchmarking results are listed in the following table.
-
-<table>
-    <tr>
-        <td></td>
-        <td></td>
-        <td><b>Document Replication</b></td>
-        <td><b>Segment Replication</b></td>
-        <td><b>Percent difference</b></td>
-    </tr>
-    <tr>
-        <td><b>Test execution time (minutes)</b></td>
-        <td></td>
-        <td>118.00</td>
-        <td>98.00</td>
-        <td>27%</td>
-    </tr>
-    <tr>
-        <td rowspan="3"><b>Index Throughput (number of requests per second)</b></td>
-        <td>p0</td>
-        <td>71917.20</td>
-        <td>105740.70</td>
-        <td>47.03%</td>
-    </tr>
-    <tr>
-        <td>p50</td>
-        <td>77392.90</td>
-        <td>110988.70</td>
-        <td>43.41%</td>
-    </tr>
-    <tr>
-        <td>p100</td>
-        <td>93032.90</td>
-        <td>123131.50</td>
-        <td>32.35%</td>
-    </tr>
-     <tr>
-        <td rowspan="3"><b>Query Throughput (number of requests per second)</b></td>
-        <td>p0</td>
-        <td>1.748</td>
-        <td>1.744</td>
-        <td>-0.23%</td>
-    </tr>
-    <tr>
-        <td>p50</td>
-        <td>1.754</td>
-        <td>1.753</td>
-        <td>0%</td>
-    </tr>
-    <tr>
-        <td>p100</td>
-        <td>1.769</td>
-        <td>1.768</td>
-        <td>-0.06%</td>
-    </tr>
-    <tr>
-        <td rowspan="4"><b>CPU (%)</b></td>
-        <td>p50</td>
-        <td>37.19</td>
-        <td>25.579</td>
-        <td>-31.22%</td>
-    </tr>
-    <tr>
-        <td>p90</td>
-        <td>94.00</td>
-        <td>88.00</td>
-        <td>-6.38%</td>
-    </tr>
-    <tr>
-        <td>p99</td>
-        <td>100</td>
-        <td>100</td>
-        <td>0%</td>
-    </tr>
-    <tr>
-        <td>p100</td>
-        <td>100.00</td>
-        <td>100.00</td>
-        <td>0%</td>
-    </tr>
-    <tr>
-        <td rowspan="4"><b>Memory (%)</b></td>
-        <td>p50</td>
-        <td>30</td>
-        <td>24.241</td>
-        <td>-19.20%</td>
-    </tr>
-    <tr>
-        <td>p90</td>
-        <td>61.00</td>
-        <td>55.00</td>
-        <td>-9.84%</td>
-    </tr>
-    <tr>
-        <td>p99</td>
-        <td>72</td>
-        <td>62</td>
-        <td>-13.89%%</td>
-    </tr>
-    <tr>
-        <td>p100</td>
-        <td>80.00</td>
-        <td>67.00</td>
-        <td>-16.25%</td>
-    </tr>
-    <tr>
-        <td rowspan="4"><b>Index Latency (ms)</b></td>
-        <td>p50</td>
-        <td>803</td>
-        <td>647.90</td>
-        <td>-19.32%</td>
-    </tr>
-    <tr>
-        <td>p90</td>
-        <td>1215.50</td>
-        <td>908.60</td>
-        <td>-25.25%</td>
-    </tr>
-    <tr>
-        <td>p99</td>
-        <td>9738.70</td>
-        <td>1565</td>
-        <td>-83.93%</td>
-    </tr>
-    <tr>
-        <td>p100</td>
-        <td>21559.60</td>
-        <td>2747.70</td>
-        <td>-87.26%</td>
-    </tr>
-    <tr>
-        <td rowspan="4"><b>Query Latency (ms)</b></td>
-        <td>p50</td>
-        <td>36.209</td>
-        <td>37.799</td>
-        <td>4.39%</td>
-    </tr>
-    <tr>
-        <td>p90</td>
-        <td>42.971</td>
-        <td>60.823</td>
-        <td>41.54%</td>
-    </tr>
-    <tr>
-        <td>p99</td>
-        <td>50.604</td>
-        <td>70.072</td>
-        <td>38.47%</td>
-    </tr>
-    <tr>
-        <td>p100</td>
-        <td>52.883</td>
-        <td>73.911</td>
-        <td>39.76%</td>
-    </tr>
-</table>
-
-Your results may vary based on the cluster topology, hardware used, shard count, and merge settings. 
-{: .note }
-
-## Other considerations
+## Considerations

 When using segment replication, consider the following:

 1. Enabling segment replication for an existing index requires [reindexing](https://github.com/opensearch-project/OpenSearch/issues/3685).
 1. Rolling upgrades are not currently supported. Full cluster restarts are required when upgrading indexes using segment replication. See [Issue 3881](https://github.com/opensearch-project/OpenSearch/issues/3881).
 1. [Cross-cluster replication](https://github.com/opensearch-project/OpenSearch/issues/4090) does not currently use segment replication to copy between clusters.
-1. Increased network congestion on primary shards. See [Issue - Optimize network bandwidth on primary shards](https://github.com/opensearch-project/OpenSearch/issues/4245).
-1. Integration with remote-backed storage as the source of replication is [currently unsupported](https://github.com/opensearch-project/OpenSearch/issues/4448). 
+1. Segment replication leads to increased network congestion on primary shards. See [Issue - Optimize network bandwidth on primary shards](https://github.com/opensearch-project/OpenSearch/issues/4245).
+1. Integration with remote-backed storage as the source of replication is [currently not supported](https://github.com/opensearch-project/OpenSearch/issues/4448). 
 1. Read-after-write guarantees: The `wait_until` refresh policy is not compatible with segment replication. If you use the `wait_until` refresh policy while ingesting documents, you'll get a response only after the primary node has refreshed and made those documents searchable. Replica shards will respond only after having written to their local translog. We are exploring other mechanisms for providing read-after-write guarantees. For more information, see the corresponding [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/6046).  
 1. System indexes will continue to use document replication internally until read-after-write guarantees are available. In this case, document replication does not hinder the overall performance because there are few system indexes.

+## Benchmarks
+
+During initial benchmarks, segment replication users reported 40% higher throughput than when using document replication with the same cluster setup.
+
+The following benchmarks were collected with [OpenSearch-benchmark](https://github.com/opensearch-project/opensearch-benchmark) using the [`stackoverflow`](https://www.kaggle.com/datasets/stackoverflow/stackoverflow) and [`nyc_taxi`](https://github.com/topics/nyc-taxi-dataset) datasets.  
+
+The benchmarks demonstrate the effect of the following configurations on segment replication:
+
+- [The workload size](#increasing-the-workload-size)
+
+- [The number of primary shards](#increasing-the-number-of-primary-shards)
+
+- [The number of replicas](#increasing-the-number-of-replicas)
+
+Your results may vary based on the cluster topology, hardware used, shard count, and merge settings. 
+{: .note }
+
+### Increasing the workload size
+
+The following table lists benchmarking results for the `nyc_taxi` dataset with the following configuration:
+
+- 10 m5.xlarge data nodes
+
+- 40 primary shards, 1 replica each (80 shards total)
+
+- 4 primary shards and 4 replica shards per node
+
+<table>
+    <th colspan="2" ></th>
+    <th colspan="3" >40 GB primary shard, 80 GB total</th>
+    <th colspan="3">240 GB primary shard, 480 GB total</th>
+    <tr>
+        <td></td>
+        <td></td>
+        <td>Document Replication</td>
+        <td>Segment Replication</td>
+        <td>Percent difference</td>
+        <td>Document Replication</td>
+        <td>Segment Replication</td>
+        <td>Percent difference</td>
+    </tr>
+    <tr>
+        <td>Store size</td>
+        <td ></td>
+        <td>85.2781</td>
+        <td>91.2268</td>
+        <td>N/A</td>
+        <td>515.726</td>
+        <td>558.039</td>
+        <td>N/A</td>
+    </tr>
+    <tr>
+        <td rowspan="3">Index throughput (number of requests per second)</td>
+        <td>Minimum</td>
+        <td>148,134</td>
+        <td>185,092</td>
+        <td>24.95%</td>
+        <td>100,140</td>
+        <td>168,335</td>
+        <td>68.10%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">Median</td>
+        <td>160,110</td>
+        <td>189,799</td>
+        <td>18.54%</td>
+        <td>106,642</td>
+        <td>170,573</td>
+        <td>59.95%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">Maximum</td>
+        <td>175,196</td>
+        <td>190,757</td>
+        <td>8.88%</td>
+        <td>108,583</td>
+        <td>172,507</td>
+        <td>58.87%</td>
+    </tr>
+    <tr>
+        <td>Error rate</td>
+        <td ></td>
+        <td>0.00%</td>
+        <td>0.00%</td>
+        <td >0.00%</td>
+        <td>0.00%</td>
+        <td>0.00%</td>
+        <td>0.00%</td>
+    </tr>
+</table>
+
+As the size of the workload increases, the benefits of segment replication are amplified because the replicas are not required to index the larger dataset. In general, segment replication leads to higher throughput at lower resource cost than document replication in all cluster configurations, not accounting for replication lag. 
+
+### Increasing the number of primary shards
+
+The following table lists benchmarking results for the `nyc_taxi` dataset for 40 and 100 primary shards.
+
+{::nomarkdown}
+<table>
+    <th colspan="2"></th>
+    <th colspan="3">40 primary shards, 1 replica</th>
+    <th colspan="3">100 primary shards, 1 replica</th>
+    <tr>
+        <td></td>
+        <td></td>
+        <td>Document Replication</td>
+        <td>Segment Replication</td>
+        <td>Percent difference</td>
+        <td>Document Replication</td>
+        <td>Segment Replication</td>
+        <td>Percent difference</td>
+    </tr>
+    <tr>
+        <td rowspan="3">Index throughput (number of requests per second)</td>
+        <td>Minimum</td>
+        <td>148,134</td>
+        <td>185,092</td>
+        <td>24.95%</td>
+        <td>151,404</td>
+        <td>167,391</td>
+        <td>9.55%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">Median</td>
+        <td>160,110</td>
+        <td>189,799</td>
+        <td>18.54%</td>
+        <td>154,796</td>
+        <td>172,995</td>
+        <td>10.52%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">Maximum</td>
+        <td>175,196</td>
+        <td>190,757</td>
+        <td>8.88%</td>
+        <td>166,173</td>
+        <td>174,655</td>
+        <td>4.86%</td>
+    </tr>
+    <tr>
+        <td>Error rate</td>
+        <td ></td>
+        <td>0.00%</td>
+        <td>0.00%</td>
+        <td >0.00%</td>
+        <td>0.00%</td>
+        <td>0.00%</td>
+        <td>0.00%</td>
+    </tr>
+</table>
+{:/}
+
+As the number of primary shards increases, the benefits of segment replication over document replication decrease. While segment replication is still beneficial with a larger number of primary shards, the difference in performance becomes less pronounced because there are more primary shards per node that must copy segment files across the cluster. 
+
+### Increasing the number of replicas
+
+The following table lists benchmarking results for the `stackoverflow` dataset for 1 and 9 replicas.
+
+{::nomarkdown}
+<table>
+    <th colspan="2"  ></th>
+    <th colspan="3"  >10 primary shards, 1 replica</th>
+    <th colspan="3">10 primary shards, 9 replicas</th>
+    <tr>
+        <td></td>
+        <td></td>
+        <td>Document Replication</td>
+        <td>Segment Replication</td>
+        <td>Percent difference</td>
+        <td>Document Replication</td>
+        <td>Segment Replication</td>
+        <td>Percent difference</td>
+    </tr>
+    <tr>
+        <td rowspan="2">Index throughput (number of requests per second)</td>
+        <td >Median</td>
+        <td>72,598.10</td>
+        <td>90,776.10</td>
+        <td>25.04%</td>
+        <td>16,537.00</td> 
+        <td>14,429.80</td> 
+        <td>&minus;12.74%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">Maximum</td>
+        <td>86,130.80</td>
+        <td>96,471.00</td>
+        <td>12.01%</td>
+        <td>21,472.40</td>
+        <td>38,235.00</td>
+        <td>78.07%</td>
+    </tr>
+    <tr>
+        <td rowspan="4">CPU usage (%)</td>
+        <td >p50</td>
+        <td>17</td>
+        <td>18.857</td>
+        <td>10.92%</td>
+        <td>69.857</td>
+        <td>8.833</td>
+        <td>&minus;87.36%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">p90</td>
+        <td>76</td>
+        <td>82.133</td>
+        <td>8.07%</td>
+        <td>99</td>
+        <td>86.4</td>
+        <td>&minus;12.73%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">p99</td>
+        <td>100</td>
+        <td>100</td>
+        <td >0%</td>
+        <td>100</td>
+        <td>100</td>
+        <td>0%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">p100</td>
+        <td>100</td>
+        <td>100</td>
+        <td >0%</td>
+        <td>100</td>
+        <td>100</td>
+        <td>0%</td>
+    </tr>
+    <tr>
+        <td rowspan="4">Memory usage (%)</td>
+        <td >p50</td>
+        <td>35</td>
+        <td>23</td>
+        <td>&minus;34.29%</td>
+        <td>42</td>
+        <td>40</td>
+        <td>&minus;4.76%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">p90</td>
+        <td>59</td>
+        <td>57</td>
+        <td>&minus;3.39%</td>
+        <td>59</td>
+        <td>63</td>
+        <td>6.78%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">p99</td>
+        <td>69</td>
+        <td>61</td>
+        <td>&minus;11.59%</td>
+        <td>66</td>
+        <td>70</td>
+        <td>6.06%</td>
+    </tr>
+    <tr>
+        <td class="td-custom">p100</td>
+        <td>72</td>
+        <td>62</td>
+        <td>&minus;13.89%</td>
+        <td>69</td>
+        <td>72</td>
+        <td>4.35%</td>
+    </tr>
+    <tr>
+        <td>Error rate</td>
+        <td ></td>
+        <td>0.00%</td>
+        <td>0.00%</td>
+        <td >0.00%</td>
+        <td>0.00%</td>
+        <td>2.30%</td>
+        <td>2.30%</td>
+    </tr>
+</table>
+{:/}
+
+As the number of replicas increases, the amount of time required for primary shards to keep replicas up to date (known as the _replication lag_) also increases. This is because segment replication copies the segment files directly from primary shards to replicas. 
+
+The benchmarking results show a non-zero error rate as the number of replicas increases. The error rate indicates that the [segment replication backpressure]({{site.urs}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/segment-replication/backpressure/) mechanism is initiated when replicas cannot keep up with the primary shard. However, the error rate is offset by the significant CPU and memory gains that segment replication provides.
+
 ## Next steps

 1. Track [future enhancements to segment replication](https://github.com/orgs/opensearch-project/projects/99).