mirror of https://github.com/apache/druid.git
119 lines
7.1 KiB
Markdown
119 lines
7.1 KiB
Markdown
---
|
|
id: delete
|
|
title: "Data deletion"
|
|
---
|
|
|
|
<!--
|
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
~ or more contributor license agreements. See the NOTICE file
|
|
~ distributed with this work for additional information
|
|
~ regarding copyright ownership. The ASF licenses this file
|
|
~ to you under the Apache License, Version 2.0 (the
|
|
~ "License"); you may not use this file except in compliance
|
|
~ with the License. You may obtain a copy of the License at
|
|
~
|
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
~
|
|
~ Unless required by applicable law or agreed to in writing,
|
|
~ software distributed under the License is distributed on an
|
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
~ KIND, either express or implied. See the License for the
|
|
~ specific language governing permissions and limitations
|
|
~ under the License.
|
|
-->
|
|
|
|
## By time range, manually
|
|
|
|
Apache Druid stores data [partitioned by time chunk](../design/storage.md) and supports
|
|
deleting data for time chunks by dropping segments. This is a fast, metadata-only operation.
|
|
|
|
Deletion by time range happens in two steps:
|
|
|
|
1. Segments to be deleted must first be marked as ["unused"](../design/storage.md#segment-lifecycle). This can
|
|
happen when a segment is dropped by a [drop rule](../operations/rule-configuration.md) or when you manually mark a
|
|
segment unused through the Coordinator API or web console. This is a soft delete: the data is not available for
|
|
querying, but the segment files remains in deep storage, and the segment records remains in the metadata store.
|
|
2. Once a segment is marked "unused", you can use a [`kill` task](#kill-task) to permanently delete the segment file from
|
|
deep storage and remove its record from the metadata store. This is a hard delete: the data is unrecoverable unless
|
|
you have a backup.
|
|
|
|
For documentation on disabling segments using the Coordinator API, see the
|
|
[Legacy metadata API reference](../api-reference/legacy-metadata-api.md#datasources).
|
|
|
|
A data deletion tutorial is available at [Tutorial: Deleting data](../tutorials/tutorial-delete-data.md).
|
|
|
|
## By time range, automatically
|
|
|
|
Druid supports [load and drop rules](../operations/rule-configuration.md), which are used to define intervals of time
|
|
where data should be preserved, and intervals where data should be discarded. Data that falls under a drop rule is
|
|
marked unused, in the same manner as if you [manually mark that time range unused](#by-time-range-manually). This is a
|
|
fast, metadata-only operation.
|
|
|
|
Data that is dropped in this way is marked unused, but remains in deep storage. To permanently delete it, use a
|
|
[`kill` task](#kill-task).
|
|
|
|
## Specific records
|
|
|
|
Druid supports deleting specific records using [reindexing](update.md#reindex) with a filter. The filter specifies which
|
|
data remains after reindexing, so it must be the inverse of the data you want to delete. Because segments must be
|
|
rewritten to delete data in this way, it can be a time-consuming operation.
|
|
|
|
For example, to delete records where `userName` is `'bob'` with native batch indexing, use a
|
|
[`transformSpec`](../ingestion/ingestion-spec.md#transformspec) with filter `{"type": "not", "field": {"type":
|
|
"selector", "dimension": "userName", "value": "bob"}}`.
|
|
|
|
To delete the same records using SQL, use [REPLACE](../multi-stage-query/concepts.md#replace) with `WHERE userName <> 'bob'`.
|
|
|
|
To reindex using [native batch](../ingestion/native-batch.md), use the [`druid` input
|
|
source](../ingestion/input-sources.md#druid-input-source). If needed,
|
|
[`transformSpec`](../ingestion/ingestion-spec.md#transformspec) can be used to filter or modify data during the
|
|
reindexing job. To reindex with SQL, use [`REPLACE <table> OVERWRITE`](../multi-stage-query/reference.md#replace)
|
|
with `SELECT ... FROM <table>`. (Druid does not have `UPDATE` or `ALTER TABLE` statements.) Any SQL SELECT query can be
|
|
used to filter, modify, or enrich the data during the reindexing job.
|
|
|
|
Data that is deleted in this way is marked unused, but remains in deep storage. To permanently delete it, use a [`kill`
|
|
task](#kill-task).
|
|
|
|
## Entire table
|
|
|
|
Deleting an entire table works the same way as [deleting part of a table by time range](#by-time-range-manually). First,
|
|
mark all segments unused using the Coordinator API or web console. Then, optionally, delete it permanently using a
|
|
[`kill` task](#kill-task).
|
|
|
|
<a name="kill-task"></a>
|
|
|
|
## Permanently (`kill` task)
|
|
|
|
Data that has been overwritten or soft-deleted still remains as segments that have been marked unused. You can use a
|
|
`kill` task to permanently delete this data.
|
|
|
|
The available grammar is:
|
|
|
|
```json
|
|
{
|
|
"type": "kill",
|
|
"id": <task_id>,
|
|
"dataSource": <task_datasource>,
|
|
"interval" : <all_unused_segments_in_this_interval_will_die!>,
|
|
"versions" : <optional_list_of_segment_versions_to_delete_in_this_interval>,
|
|
"context": <task_context>,
|
|
"batchSize": <optional_batch_size>,
|
|
"limit": <optional_maximum_number_of_segments_to_delete>,
|
|
"maxUsedStatusLastUpdatedTime": <optional_maximum_timestamp_when_segments_were_marked_as_unused>
|
|
}
|
|
```
|
|
|
|
Some of the parameters used in the task payload are further explained below:
|
|
|
|
| Parameter | Default | Explanation |
|
|
|-------------|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
| `versions` | null (all versions) | List of segment versions within the specified `interval` for the kill task to delete. The default behavior is to delete all unused segment versions in the specified `interval`.|
|
|
| `batchSize` |100 | Maximum number of segments that are deleted in one kill batch. Some operations on the Overlord may get stuck while a `kill` task is in progress due to concurrency constraints (such as in `TaskLockbox`). Thus, a `kill` task splits the list of unused segments to be deleted into smaller batches to yield the Overlord resources intermittently to other task operations.|
|
|
| `limit` | null (no limit) | Maximum number of segments for the kill task to delete.|
|
|
| `maxUsedStatusLastUpdatedTime` | null (no cutoff) | Maximum timestamp used as a cutoff to include unused segments. The kill task only considers segments which lie in the specified `interval` and were marked as unused no later than this time. The default behavior is to kill all unused segments in the `interval` regardless of when they where marked as unused.|
|
|
|
|
|
|
**WARNING:** The `kill` task permanently removes all information about the affected segments from the metadata store and
|
|
deep storage. This operation cannot be undone.
|
|
|