Update retention rules doc (#13181)

* Update retention rules doc * Update rule-configuration.md * Updated * Updated * Updated * Updated * Update rule-configuration.md * Update rule-configuration.md
2022-11-07 22:47:33 +00:00 · 2022-11-07 22:47:33 +00:00 · d1a4de022a
parent a9b39fc29d
commit d1a4de022a
2 changed files with 202 additions and 111 deletions
--- a/docs/assets/retention-rules.png
+++ b/docs/assets/retention-rules.png
--- a/docs/operations/rule-configuration.md
+++ b/docs/operations/rule-configuration.md
@ -1,6 +1,6 @@
 ---
 id: rule-configuration
-title: "Retaining or automatically dropping data"
+title: "Using rules to drop and retain data"
 ---

 <!--
@ -22,213 +22,304 @@ title: "Retaining or automatically dropping data"
  ~ under the License.
  -->

+Data retention rules allow you to configure Apache Druid to conform to your data retention policies. Your data retention policies specify which data to retain and which data to drop from the cluster.

-In Apache Druid, Coordinator processes use rules to determine what data should be loaded to or dropped from the cluster. Rules are used for data retention and query execution, and are set via the [web console](./web-console.md).
+Druid supports [load](#load-rules), [drop](#drop-rules), and [broadcast](#broadcast-rules) rules. Each rule is a JSON object. See the [rule definitions below](#load-rules) for details.

-There are three types of rules, i.e., load rules, drop rules, and broadcast rules. Load rules indicate how segments should be assigned to different historical process tiers and how many replicas of a segment should exist in each tier.
-Drop rules indicate when segments should be dropped entirely from the cluster. Finally, broadcast rules indicate how segments of different datasources should be co-located in Historical processes.
+You can configure a default set of rules to apply to all datasources, and/or you can set specific rules for specific datasources. See [rule structure](#rule-structure) to see how rule order impacts the way the Coordinator applies retention rules.

-The Coordinator loads a set of rules from the metadata storage. Rules may be specific to a certain datasource and/or a
-default set of rules can be configured. Rules are read in order and hence the ordering of rules is important. The
-Coordinator will cycle through all used segments and match each segment with the first rule that applies. Each segment
-may only match a single rule.
+You can specify the data to retain or drop in the following ways:

-Note: It is recommended that the web console is used to configure rules. However, the Coordinator process does have HTTP endpoints to programmatically configure rules.
+- Forever: all data in the segment.
+- Period: segment data specified as an offset from the present time.
+- Interval: a fixed time range.
+
+Retention rules are persistent: they remain in effect until you change them. Druid stores retention rules in its [metadata store](../dependencies/metadata-storage.md).
+
+## Set retention rules
+
+You can use the Druid [web console](./web-console.md) or the [Coordinator API](./api-reference.md#coordinator) to create and manage retention rules.
+
+### Use the web console
+
+To set retention rules in the Druid web console:
+
+1. On the console home page, click **Datasources**. 
+2. Click the name of your datasource to open the data window.
+3. Select **Actions > Edit retention rules**.
+4. Click **+New rule**.
+5. Select a rule type and set properties according to the [rules reference]().
+6. Click **Next** and enter a description for the rule.
+7. Click **Save** to save and apply the rule to the datasource.
+
+### Use the Coordinator API
+
+To set one or more default retention rules for all datasources, send a POST request containing a JSON object for each rule to `/druid/coordinator/v1/rules/_default`.
+
+The following example request sets a default forever broadcast rule for all datasources:
+
+```bash
+curl --location --request POST 'http://localhost:8888/druid/coordinator/v1/rules/_default' \
+--header 'Content-Type: application/json' \
+--data-raw '[{
+  "type": "broadcastForever"
+  }]'
+```
+
+To set one or more retention rules for a specific datasource, send a POST request containing a JSON object for each rule to `/druid/coordinator/v1/rules/{datasourceName}`.
+
+The following example request sets a period drop rule and a period broadcast rule for the `wikipedia` datasource:
+
+```bash
+curl --location --request POST 'http://localhost:8888/druid/coordinator/v1/rules/wikipedia' \
+--header 'Content-Type: application/json' \
+--data-raw '[{
+   "type": "dropByPeriod",
+   "period": "P1M",
+   "includeFuture": true
+   },
+   {
+    "type": "broadcastByPeriod",
+    "period": "P1M",
+    "includeFuture": true
+   }]'
+```
+To retrieve all rules for all datasources, send a GET request to `/druid/coordinator/v1/rules`&mdash;for example:
+
+```bash
+curl --location --request GET 'http://localhost:8888/druid/coordinator/v1/rules'
+```
+
+### Rule structure
+
+The rules API accepts an array of rules as JSON objects. The JSON object you send in the API request for each rule is specific to the rules types outlined below.
+
+> You must pass the entire array of rules, in your desired order, with each API request. Each POST request to the rules API overwrites the existing rules for the specified datasource.
+
+The order of rules is very important. The Coordinator reads rules in the order in which they appear in the rules list. For example, in the following screenshot the Coordinator evaluates data against rule 1, then rule 2, then rule 3:
+
+![retention rules](../assets/retention-rules.png)
+
+The Coordinator cycles through all used segments and matches each segment with the first rule that applies. Each segment can only match a single rule.
+
+In the web console you can use the up and down arrows on the right side of the interface to change the order of the rules.

 ## Load rules

-Load rules indicate how many replicas of a segment should exist in a server tier. **Please note**: If a Load rule is used to retain only data from a certain interval or period, it must be accompanied by a Drop rule. If a Drop rule is not included, data not within the specified interval or period will be retained by the default rule (loadForever).
+Load rules define how Druid assigns segments to [historical process tiers](./mixed-workloads.md#historical-tiering), and how many replicas of a segment exist in each tier.

-### Forever Load Rule
+If you have a single tier, Druid automatically names the tier `_default` and loads all segments onto it. If you define an additional tier, you must define a load rule to specify which segments to load on that tier. Until you define a load rule, your new tier remains empty.

-Forever load rules are of the form:
+### Forever load rule
+
+The forever load rule assigns all datasource segments to specified tiers. It is the default rule Druid applies to datasources. Forever load rules have type `loadForever`. 
+
+The following example places one replica of each segment on a custom tier named `hot`, and another single replica on the default tier.

 ```json
 {
-  "type" : "loadForever",
+  "type": "loadForever",
  "tieredReplicants": {
    "hot": 1,
-    "_default_tier" : 1
+    "_default_tier": 1
+  }
+}
+```
+Set the following property:
+- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier.
+
+### Period load rule
+
+You can use a period load rule to assign segment data in a specific period to a tier. Druid compares a segment's interval to the period you specify in the rule and loads the matching data.
+
+Period load rules have type `loadByPeriod`. The following example places one replica of data in a one-month period on a custom tier named `hot`, and another single replica on the default tier.
+
+```json
+{
+  "type": "loadByPeriod",
+  "period": "P1M",
+  "includeFuture": true,
+  "tieredReplicants": {
+      "hot": 1,
+      "_default_tier": 1
  }
 }
 ```

-* `type` - this should always be "loadForever"
-* `tieredReplicants` - A JSON Object where the keys are the tier names and values are the number of replicas for that tier.
+Set the following properties:
+- `period`: a JSON object representing [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) periods. The period is from some time in the past to the present, or into the future if `includeFuture` is set to `true`.
+- `includeFuture`: a boolean flag to instruct Druid to match a segment if:
+<br>- the segment interval overlaps the rule interval, or
+<br>- the segment interval starts any time after the rule interval starts.
+<br>You can use this property to load segments with future start and end dates, where "future" is relative to the time when the Coordinator evaluates data against the rule. Defaults to `true`.
+- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier.

+### Interval load rule

-### Interval Load Rule
+You can use an interval rule to assign a specific range of data to a tier. For example, analysts may typically work with the complete data set for all of last week and not so much with the data for the current week.

-Interval load rules are of the form:
+Interval load rules have type `loadByInterval`. The following example places one replica of data matching the specified interval on a custom tier named `hot`, and another single replica on the default tier.

 ```json
 {
-  "type" : "loadByInterval",
+  "type": "loadByInterval",
  "interval": "2012-01-01/2013-01-01",
  "tieredReplicants": {
    "hot": 1,
-    "_default_tier" : 1
+    "_default_tier": 1
  }
 }
 ```

-* `type` - this should always be "loadByInterval"
-* `interval` - A JSON Object representing ISO-8601 Intervals
-* `tieredReplicants` - A JSON Object where the keys are the tier names and values are the number of replicas for that tier.
+Set the following properties:
+- `interval`: the load interval specified as an ISO-8601 [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) range encoded as a string.
+- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier.

-### Period Load Rule
+## Drop rules

-Period load rules are of the form:
+Drop rules define when Druid drops segments from the cluster. Druid keeps dropped data in deep storage. Note that if you enable segment autokill, or you run a kill task, Druid deletes the data from deep storage. See [Data deletion](../data-management/delete.md) for more information on deleting data.
+
+If you want to use a [load rule](#load-rules) to retain only data from a defined period of time, you must also define a drop rule. If you don't define a drop rule, Druid retains data that doesn't lie within your defined period according to the default rule, `loadForever`.
+
+### Forever drop rule
+
+The forever drop rule drops all segment data from the cluster. If you configure a set of rules with a forever drop rule as the last rule, Druid drops any segment data that remains after it evaluates the higher priority rules.
+
+Forever drop rules have type `dropForever`:

 ```json
 {
-  "type" : "loadByPeriod",
-  "period" : "P1M",
-  "includeFuture" : true,
-  "tieredReplicants": {
-      "hot": 1,
-      "_default_tier" : 1
-  }
+  "type": "dropForever",
 }
 ```

-* `type` - this should always be "loadByPeriod"
-* `period` - A JSON Object representing ISO-8601 Periods
-* `includeFuture` - A JSON Boolean indicating whether the load period should include the future. This property is optional, Default is true.
-* `tieredReplicants` - A JSON Object where the keys are the tier names and values are the number of replicas for that tier.
+### Period drop rule

-The interval of a segment will be compared against the specified period. The period is from some time in the past to the future or to the current time, which depends on `includeFuture` is true or false. The rule matches if the period *overlaps* the interval.
+Druid compares a segment's interval to the period you specify in the rule and drops the matching data. The rule matches if the period contains the segment interval. This rule always drops recent data.

-## Drop Rules
-
-Drop rules indicate when segments should be dropped from the cluster.
-
-### Forever Drop Rule
-
-Forever drop rules are of the form:
+Period drop rules have type `dropByPeriod` and the following JSON structure:

 ```json
 {
-  "type" : "dropForever"
+  "type": "dropByPeriod",
+  "period": "P1M",
+  "includeFuture": true,
 }
 ```

-* `type` - this should always be "dropForever"
+Set the following properties:
+- `period`: a JSON object representing [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) periods. The period is from some time in the past to the future or to the current time, depending on the `includeFuture` flag.
+- `includeFuture`: a boolean flag to instruct Druid to match a segment if:
+<br>- the segment interval overlaps the rule interval, or
+<br>- the segment interval starts any time after the rule interval starts.
+<br>You can use this property to drop segments with future start and end dates, where "future" is relative to the time when the Coordinator evaluates data against the rule. Defaults to `true`.

-All segments that match this rule are dropped from the cluster.
+### Period drop before rule

+Druid compares a segment's interval to the period you specify in the rule and drops the matching data. The rule matches if the segment interval is before the specified period. 

-### Interval Drop Rule
+If you only want to retain recent data, you can use this rule to drop old data before a specified period, and add a `loadForever` rule to retain the data that follows it. Note that the rule combination `dropBeforeByPeriod` + `loadForever` is equivalent to `loadByPeriod(includeFuture = true)` + `dropForever`.

-Interval drop rules are of the form:
+Period drop rules have type `dropBeforeByPeriod` and the following JSON structure:

 ```json
 {
-  "type" : "dropByInterval",
-  "interval" : "2012-01-01/2013-01-01"
+  "type": "dropBeforeByPeriod",
+  "period": "P1M"
 }
 ```

-* `type` - this should always be "dropByInterval"
-* `interval` - A JSON Object representing ISO-8601 Periods
+Set the following property:
+- `period`: a JSON object representing [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) periods.

-A segment is dropped if the interval contains the interval of the segment.
+### Interval drop rule

-### Period Drop Rule
+You can use a drop interval rule to prevent Druid from loading a specified range of data onto any tier. The range is typically your oldest data. The dropped data resides in cold storage, but is not queryable. If you need to query the data, update or remove the interval drop rule so that Druid reloads the data.

-Period drop rules are of the form:
+Interval drop rules have type `dropByInterval` and the following JSON structure:

 ```json
 {
-  "type" : "dropByPeriod",
-  "period" : "P1M",
-  "includeFuture" : true
+  "type": "dropByInterval",
+  "interval": "2012-01-01/2013-01-01"
 }
 ```

-* `type` - this should always be "dropByPeriod"
-* `period` - A JSON Object representing ISO-8601 Periods
-* `includeFuture` - A JSON Boolean indicating whether the load period should include the future. This property is optional, Default is true.
+Set the following property:
+- `interval`: the drop interval specified as an ISO-8601 [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) range encoded as a string.

-The interval of a segment will be compared against the specified period. The period is from some time in the past to the future or to the current time, which depends on `includeFuture` is true or false. The rule matches if the period *contains* the interval. This drop rule always dropping recent data.
+## Broadcast rules

-### Period Drop Before Rule
+Druid extensions use broadcast rules to load segment data onto all brokers in the cluster. Apply broadcast rules in a test environment, not in production.

-Period drop before rules are of the form:
+### Forever broadcast rule
+
+The forever broadcast rule loads all segment data in your datasources onto all brokers in the cluster. 
+
+Forever broadcast rules have type `broadcastForever`:

 ```json
 {
-  "type" : "dropBeforeByPeriod",
-  "period" : "P1M"
+  "type": "broadcastForever",
 }
-```
+``` 

-* `type` - this should always be "dropBeforeByPeriod"
-* `period` - A JSON Object representing ISO-8601 Periods
+### Period broadcast rule

-The interval of a segment will be compared against the specified period. The period is from some time in the past to the current time. The rule matches if the interval before the period. If you just want to retain recent data, you can use this rule to drop the old data that before a specified period and add a `loadForever` rule to follow it. Notes, `dropBeforeByPeriod + loadForever` is equivalent to `loadByPeriod(includeFuture = true) + dropForever`.
+Druid compares a segment's interval to the period you specify in the rule and loads the matching data onto the brokers in the cluster.

-## Broadcast Rules
-
-Broadcast rules indicate that segments of a data source should be loaded by all servers of a cluster of the following types: historicals, brokers, tasks, and indexers.
-
-Note that the broadcast segments are only directly queryable through the historicals, but they are currently loaded on other server types to support join queries.
-
-### Forever Broadcast Rule
-
-Forever broadcast rules are of the form:
+Period broadcast rules have type `broadcastByPeriod` and the following JSON structure:

 ```json
 {
-  "type" : "broadcastForever"
+  "type": "broadcastByPeriod",
+  "period": "P1M",
+  "includeFuture": true,
 }
 ```

-* `type` - this should always be "broadcastForever"
+Set the following properties:

-This rule applies to all segments of a datasource, covering all intervals.
+- `period`: a JSON object representing [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) periods. The period is from some time in the past to the future or to the current time, depending on the `includeFuture` flag.
+- `includeFuture`: a boolean flag to instruct Druid to match a segment if:
+<br>- the segment interval overlaps the rule interval, or
+<br>- the segment interval starts any time after the rule interval starts.
+<br>You can use this property to broadcast segments with future start and end dates, where "future" is relative to the time when the Coordinator evaluates data against the rule. Defaults to `true`.

-### Interval Broadcast Rule
+### Interval broadcast rule

-Interval broadcast rules are of the form:
+An interval broadcast rule loads a specific range of data onto the brokers in the cluster.
+
+Interval broadcast rules have type `broadcastByInterval` and the following JSON structure:

 ```json
 {
-  "type" : "broadcastByInterval",
-  "interval" : "2012-01-01/2013-01-01"
+  "type": "broadcastByInterval",
+  "interval": "2012-01-01/2013-01-01"
 }
 ```

-* `type` - this should always be "broadcastByInterval"
-* `interval` - A JSON Object representing ISO-8601 Periods. Only the segments of the interval will be broadcasted.
+Set the following property:

-### Period Broadcast Rule
+- `interval`: the broadcast interval specified as an ISO-8601 [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) range encoded as a string.

-Period broadcast rules are of the form:
+## Permanently delete data

-```json
-{
-  "type" : "broadcastByPeriod",
-  "period" : "P1M",
-  "includeFuture" : true
-}
-```
+Druid can fully drop data from the cluster, wipe the metadata store entry, and remove the data from deep storage for any segments marked `unused`. Note that Druid always marks segments dropped from the cluster by rules as `unused`. You can submit a [kill task](./ingestion/tasks) to the [Overlord](./design/overlord) to do this.

-* `type` - this should always be "broadcastByPeriod"
-* `period` - A JSON Object representing ISO-8601 Periods
-* `includeFuture` - A JSON Boolean indicating whether the load period should include the future. This property is optional, Default is true.
+## Reload dropped data

-The interval of a segment will be compared against the specified period. The period is from some time in the past to the future or to the current time, which depends on `includeFuture` is true or false. The rule matches if the period *overlaps* the interval.
+You can't use a single rule to reload data Druid has dropped from a cluster.

-## Permanently deleting data
+To reload dropped data:

-Druid can fully drop data from the cluster, wipe the metadata store entry, and remove the data from deep storage for any
-segments that are marked as unused (segments dropped from the cluster via rules are always marked as unused). You can
-submit a [kill task](../ingestion/tasks.md) to the [Overlord](../design/overlord.md) to do this.
+1. Set your retention period&mdash;for example, change the retention period from one month to two months.
+2. Use the web console or the API to mark all segments belonging to the datasource as `used`.

-## Reloading dropped data
+## Learn more

-Data that has been dropped from a Druid cluster cannot be reloaded using only rules. To reload dropped data in Druid,
-you must first set your retention period (i.e. changing the retention period from 1 month to 2 months), and then mark as
-used all segments belonging to the datasource in the web console, or through the Druid Coordinator
-endpoints.
+For more information about using retention rules in Druid, see the following topics:
+
+- [Tutorial: Configuring data retention](../tutorials/tutorial-retention.md)
+- [Configure Druid for mixed workloads](../operations/mixed-workloads.md)
+- [Router process](../design/router.md)