* Kill tasks should honor the buffer period of unused segments. - The coordinator duty KillUnusedSegments determines an umbrella interval for each datasource to determine the kill interval. There can be multiple unused segments in an umbrella interval with different used_status_last_updated timestamps. For example, consider an unused segment that is 30 days old and one that is 1 hour old. Currently the kill task after the 30-day mark would kill both the unused segments and not retain the 1-hour old one. - However, when a kill task is instantiated with this umbrella interval, it’d kill all the unused segments regardless of the last updated timestamp. We need kill tasks and RetrieveUnusedSegmentsAction to honor the bufferPeriod to avoid killing unused segments in the kill interval prematurely. * Clarify default behavior in docs. * test comments * fix canDutyRun() * small updates. * checkstyle * forbidden api fix * doc fix, unused import, codeql scan error, and cleanup logs. * Address review comments * Rename maxUsedFlagLastUpdatedTime to maxUsedStatusLastUpdatedTime This is consistent with the column name `used_status_last_updated`. * Apply suggestions from code review Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Make period Duration type * Remove older variants of runKilLTask() in OverlordClient interface * Test can now run without waiting for canDutyRun(). * Remove previous variants of retrieveUnusedSegments from internal metadata storage coordinator interface. Removes the following interface methods in favor of a new method added: - retrieveUnusedSegmentsForInterval(String, Interval) - retrieveUnusedSegmentsForInterval(String, Interval, Integer) * Chain stream operations * cleanup * Pass in the lastUpdatedTime to markUnused test function and remove sleep. --------- Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
6.8 KiB
id | title |
---|---|
delete | Data deletion |
By time range, manually
Apache Druid stores data partitioned by time chunk and supports deleting data for time chunks by dropping segments. This is a fast, metadata-only operation.
Deletion by time range happens in two steps:
- Segments to be deleted must first be marked as "unused". This can happen when a segment is dropped by a drop rule or when you manually mark a segment unused through the Coordinator API or web console. This is a soft delete: the data is not available for querying, but the segment files remains in deep storage, and the segment records remains in the metadata store.
- Once a segment is marked "unused", you can use a
kill
task to permanently delete the segment file from deep storage and remove its record from the metadata store. This is a hard delete: the data is unrecoverable unless you have a backup.
For documentation on disabling segments using the Coordinator API, see the Legacy metadata API reference.
A data deletion tutorial is available at Tutorial: Deleting data.
By time range, automatically
Druid supports load and drop rules, which are used to define intervals of time where data should be preserved, and intervals where data should be discarded. Data that falls under a drop rule is marked unused, in the same manner as if you manually mark that time range unused. This is a fast, metadata-only operation.
Data that is dropped in this way is marked unused, but remains in deep storage. To permanently delete it, use a
kill
task.
Specific records
Druid supports deleting specific records using reindexing with a filter. The filter specifies which data remains after reindexing, so it must be the inverse of the data you want to delete. Because segments must be rewritten to delete data in this way, it can be a time-consuming operation.
For example, to delete records where userName
is 'bob'
with native batch indexing, use a
transformSpec
with filter {"type": "not", "field": {"type": "selector", "dimension": "userName", "value": "bob"}}
.
To delete the same records using SQL, use REPLACE with WHERE userName <> 'bob'
.
To reindex using native batch, use the druid
input
source. If needed,
transformSpec
can be used to filter or modify data during the
reindexing job. To reindex with SQL, use REPLACE <table> OVERWRITE
with SELECT ... FROM <table>
. (Druid does not have UPDATE
or ALTER TABLE
statements.) Any SQL SELECT query can be
used to filter, modify, or enrich the data during the reindexing job.
Data that is deleted in this way is marked unused, but remains in deep storage. To permanently delete it, use a kill
task.
Entire table
Deleting an entire table works the same way as deleting part of a table by time range. First,
mark all segments unused using the Coordinator API or web console. Then, optionally, delete it permanently using a
kill
task.
Permanently (kill
task)
Data that has been overwritten or soft-deleted still remains as segments that have been marked unused. You can use a
kill
task to permanently delete this data.
The available grammar is:
{
"type": "kill",
"id": <task_id>,
"dataSource": <task_datasource>,
"interval" : <all_unused_segments_in_this_interval_will_die!>,
"context": <task_context>,
"batchSize": <optional_batch_size>,
"limit": <optional_maximum_number_of_segments_to_delete>,
"maxUsedStatusLastUpdatedTime": <optional_maximum_timestamp_when_segments_were_marked_as_unused>
}
Some of the parameters used in the task payload are further explained below:
Parameter | Default | Explanation |
---|---|---|
batchSize |
100 | Maximum number of segments that are deleted in one kill batch. Some operations on the Overlord may get stuck while a kill task is in progress due to concurrency constraints (such as in TaskLockbox ). Thus, a kill task splits the list of unused segments to be deleted into smaller batches to yield the Overlord resources intermittently to other task operations. |
limit |
null (no limit) | Maximum number of segments for the kill task to delete. |
maxUsedStatusLastUpdatedTime |
null (no cutoff) | Maximum timestamp used as a cutoff to include unused segments. The kill task only considers segments which lie in the specified interval and were marked as unused no later than this time. The default behavior is to kill all unused segments in the interval regardless of when they where marked as unused. |
WARNING: The kill
task permanently removes all information about the affected segments from the metadata store and
deep storage. This operation cannot be undone.