2018-12-13 14:47:20 -05:00
---
2019-08-21 00:48:59 -04:00
id: tutorial-retention
2023-05-19 12:42:27 -04:00
title: Configure data retention
sidebar_label: Configure data retention
2018-12-13 14:47:20 -05:00
---
2018-11-13 12:38:37 -05:00
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
2018-08-09 16:37:52 -04:00
This tutorial demonstrates how to configure retention rules on a datasource to set the time intervals of data that will be retained or dropped.
2020-01-03 12:33:19 -05:00
For this tutorial, we'll assume you've already downloaded Apache Druid as described in
2020-12-17 16:37:43 -05:00
the [single-machine quickstart ](index.md ) and have it running on your local machine.
2018-08-09 16:37:52 -04:00
2023-05-19 12:42:27 -04:00
It will also be helpful to have finished [Load a file ](../tutorials/tutorial-batch.md ) and [Query data ](../tutorials/tutorial-query.md ) tutorials.
2018-08-09 16:37:52 -04:00
## Load the example data
For this tutorial, we'll be using the Wikipedia edits sample data, with an ingestion task spec that will create a separate segment for each hour in the input data.
2018-11-02 00:47:29 -04:00
The ingestion spec can be found at `quickstart/tutorial/retention-index.json` . Let's submit that spec, which will create a datasource called `retention-tutorial` :
2018-08-09 16:37:52 -04:00
2018-08-13 14:11:32 -04:00
```bash
2019-05-16 14:13:48 -04:00
bin/post-index-task --file quickstart/tutorial/retention-index.json --url http://localhost:8081
2018-08-09 16:37:52 -04:00
```
2022-09-17 00:58:11 -04:00
After the ingestion completes, go to [http://localhost:8888/unified-console.html#datasources ](http://localhost:8888/unified-console.html#datasources ) in a browser to access the web console's datasource view.
2019-08-21 00:48:59 -04:00
2019-02-27 22:50:31 -05:00
This view shows the available datasources and a summary of the retention rules for each datasource:
2018-08-09 16:37:52 -04:00
2019-08-21 00:48:59 -04:00
![Summary ](../assets/tutorial-retention-01.png "Summary" )
2018-08-09 16:37:52 -04:00
2019-09-17 15:47:30 -04:00
Currently there are no rules set for the `retention-tutorial` datasource. Note that there are default rules for the cluster: load forever with 2 replicas in `_default_tier` .
2018-08-09 16:37:52 -04:00
2019-08-21 00:48:59 -04:00
This means that all data will be loaded regardless of timestamp, and each segment will be replicated to two Historical processes in the default tier.
2018-08-09 16:37:52 -04:00
In this tutorial, we will ignore the tiering and redundancy concepts for now.
2019-02-27 22:50:31 -05:00
Let's view the segments for the `retention-tutorial` datasource by clicking the "24 Segments" link next to "Fully Available".
2018-08-09 16:37:52 -04:00
2019-02-27 22:50:31 -05:00
The segments view ([http://localhost:8888/unified-console.html#segments](http://localhost:8888/unified-console.html#segments)) provides information about what segments a datasource contains. The page shows that there are 24 segments, each one containing data for a specific hour of 2015-09-12:
2018-08-09 16:37:52 -04:00
2019-08-21 00:48:59 -04:00
![Original segments ](../assets/tutorial-retention-02.png "Original segments" )
2018-08-09 16:37:52 -04:00
## Set retention rules
Suppose we want to drop data for the first 12 hours of 2015-09-12 and keep data for the later 12 hours of 2015-09-12.
2019-02-27 22:50:31 -05:00
Go to the [datasources view ](http://localhost:8888/unified-console.html#datasources ) and click the blue pencil icon next to `Cluster default: loadForever` for the `retention-tutorial` datasource.
A rule configuration window will appear:
2018-08-09 16:37:52 -04:00
2019-08-21 00:48:59 -04:00
![Rule configuration ](../assets/tutorial-retention-03.png "Rule configuration" )
2018-08-09 16:37:52 -04:00
2019-08-21 00:48:59 -04:00
Now click the `+ New rule` button twice.
2018-08-09 16:37:52 -04:00
2019-09-17 15:47:30 -04:00
In the upper rule box, select `Load` and `by interval` , and then enter `2015-09-12T12:00:00.000Z/2015-09-13T00:00:00.000Z` in field next to `by interval` . Replicas can remain at 2 in the `_default_tier` .
2018-08-09 16:37:52 -04:00
2019-02-27 22:50:31 -05:00
In the lower rule box, select `Drop` and `forever` .
2018-08-09 16:37:52 -04:00
The rules should look like this:
2019-08-21 00:48:59 -04:00
![Set rules ](../assets/tutorial-retention-04.png "Set rules" )
2019-02-27 22:50:31 -05:00
Now click `Next` . The rule configuration process will ask for a user name and comment, for change logging purposes. You can enter `tutorial` for both.
2018-08-09 16:37:52 -04:00
2019-02-27 22:50:31 -05:00
Now click `Save` . You can see the new rules in the datasources view:
2018-08-09 16:37:52 -04:00
2019-08-21 00:48:59 -04:00
![New rules ](../assets/tutorial-retention-05.png "New rules" )
2019-02-27 22:50:31 -05:00
2022-09-17 00:58:11 -04:00
Give the cluster a few minutes to apply the rule change, and go to the [segments view ](http://localhost:8888/unified-console.html#segments ) in the web console.
2018-08-09 16:37:52 -04:00
The segments for the first 12 hours of 2015-09-12 are now gone:
2019-08-21 00:48:59 -04:00
![New segments ](../assets/tutorial-retention-06.png "New segments" )
2018-08-09 16:37:52 -04:00
The resulting retention rule chain is the following:
2018-08-13 14:11:32 -04:00
1. loadByInterval 2015-09-12T12/2015-09-13 (12 hours)
2018-08-09 16:37:52 -04:00
2018-08-13 14:11:32 -04:00
2. dropForever
3. loadForever (default rule)
2018-08-09 16:37:52 -04:00
The rule chain is evaluated from top to bottom, with the default rule chain always added at the bottom.
2019-08-21 00:48:59 -04:00
The tutorial rule chain we just created loads data if it is within the specified 12 hour interval.
2018-08-09 16:37:52 -04:00
If data is not within the 12 hour interval, the rule chain evaluates `dropForever` next, which will drop any data.
The `dropForever` terminates the rule chain, effectively overriding the default `loadForever` rule, which will never be reached in this rule chain.
2019-08-21 00:48:59 -04:00
Note that in this tutorial we defined a load rule on a specific interval.
2018-08-09 16:37:52 -04:00
If instead you want to retain data based on how old it is (e.g., retain data that ranges from 3 months in the past to the present time), you would define a Period load rule instead.
## Further reading
2019-08-21 00:48:59 -04:00
* [Load rules ](../operations/rule-configuration.md )