mirror of https://github.com/apache/druid.git
248 lines
11 KiB
Markdown
248 lines
11 KiB
Markdown
---
|
|
id: tutorial-update-data
|
|
title: Update data
|
|
sidebar_label: Update data
|
|
description: Learn how to update data in Apache Druid.
|
|
---
|
|
|
|
<!--
|
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
~ or more contributor license agreements. See the NOTICE file
|
|
~ distributed with this work for additional information
|
|
~ regarding copyright ownership. The ASF licenses this file
|
|
~ to you under the Apache License, Version 2.0 (the
|
|
~ "License"); you may not use this file except in compliance
|
|
~ with the License. You may obtain a copy of the License at
|
|
~
|
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
~
|
|
~ Unless required by applicable law or agreed to in writing,
|
|
~ software distributed under the License is distributed on an
|
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
~ KIND, either express or implied. See the License for the
|
|
~ specific language governing permissions and limitations
|
|
~ under the License.
|
|
-->
|
|
|
|
Apache Druid stores data and indexes in [segment files](../design/segments.md) partitioned by time.
|
|
After Druid creates a segment, its contents can't be modified.
|
|
You can either replace data for the whole segment, or, in some cases, overshadow a portion of the segment data.
|
|
|
|
In Druid, use time ranges to specify the data you want to update, as opposed to a primary key or dimensions often used in transactional databases. Data outside the specified replacement time range remains unaffected.
|
|
You can use this Druid functionality to perform data updates, inserts, and deletes, similar to UPSERT functionality for transactional databases.
|
|
|
|
This tutorial shows you how to use the Druid SQL [REPLACE](../multi-stage-query/reference.md#replace) function with the OVERWRITE clause to update existing data.
|
|
|
|
The tutorial walks you through the following use cases:
|
|
|
|
* [Overwrite all data](#overwrite-all-data)
|
|
* [Overwrite records for a specific time range](#overwrite-records-for-a-specific-time-range)
|
|
* [Update a row using partial segment overshadowing](#update-a-row-using-partial-segment-overshadowing)
|
|
|
|
All examples use the [multi-stage query (MSQ)](../multi-stage-query/index.md) task engine to executes SQL statements.
|
|
|
|
## Prerequisites
|
|
|
|
Before you follow the steps in this tutorial, download Druid as described in [Quickstart (local)](index.md) and have it running on your local machine. You don't need to load any data into the Druid cluster.
|
|
|
|
You should be familiar with data querying in Druid. If you haven't already, go through the [Query data](../tutorials/tutorial-query.md) tutorial first.
|
|
|
|
## Load sample data
|
|
|
|
Load a sample dataset using [REPLACE](../multi-stage-query/reference.md#replace) and [EXTERN](../multi-stage-query/reference.md#extern-function) functions.
|
|
In Druid SQL, the REPLACE function can create a new [datasource](../design/storage.md) or update an existing datasource.
|
|
|
|
In the Druid [web console](../operations/web-console.md), go to the **Query** view and run the following query:
|
|
|
|
```sql
|
|
REPLACE INTO "update_tutorial" OVERWRITE ALL
|
|
WITH "ext" AS (
|
|
SELECT *
|
|
FROM TABLE(
|
|
EXTERN(
|
|
'{"type":"inline","data":"{\"timestamp\":\"2024-01-01T07:01:35Z\",\"animal\":\"octopus\", \"number\":115}\n{\"timestamp\":\"2024-01-01T05:01:35Z\",\"animal\":\"mongoose\", \"number\":737}\n{\"timestamp\":\"2024-01-01T06:01:35Z\",\"animal\":\"snake\", \"number\":1234}\n{\"timestamp\":\"2024-01-01T01:01:35Z\",\"animal\":\"lion\", \"number\":300}\n{\"timestamp\":\"2024-01-02T07:01:35Z\",\"animal\":\"seahorse\", \"number\":115}\n{\"timestamp\":\"2024-01-02T05:01:35Z\",\"animal\":\"skunk\", \"number\":737}\n{\"timestamp\":\"2024-01-02T06:01:35Z\",\"animal\":\"iguana\", \"number\":1234}\n{\"timestamp\":\"2024-01-02T01:01:35Z\",\"animal\":\"opossum\", \"number\":300}"}',
|
|
'{"type":"json"}'
|
|
)
|
|
) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT)
|
|
)
|
|
SELECT
|
|
TIME_PARSE("timestamp") AS "__time",
|
|
"animal",
|
|
"number"
|
|
FROM "ext"
|
|
PARTITIONED BY DAY
|
|
|
|
```
|
|
|
|
In the resulting `update_tutorial` datasource, individual rows are uniquely identified by `__time`, `animal`, and `number`.
|
|
To view the results, open a new tab and run the following query:
|
|
|
|
```sql
|
|
SELECT * FROM "update_tutorial"
|
|
```
|
|
|
|
<details>
|
|
<summary> View the results</summary>
|
|
|
|
| `__time` | `animal` | `number`|
|
|
| -- | -- | -- |
|
|
| `2024-01-01T01:01:35.000Z`| `lion`| 300 |
|
|
| `2024-01-01T05:01:35.000Z`| `mongoose`| 737 |
|
|
| `2024-01-01T06:01:35.000Z`| `snake`| 1234 |
|
|
| `2024-01-01T07:01:35.000Z`| `octopus`| 115 |
|
|
| `2024-01-02T01:01:35.000Z`| `opossum`| 300 |
|
|
| `2024-01-02T05:01:35.000Z`| `skunk`| 737 |
|
|
| `2024-01-02T06:01:35.000Z`| `iguana`| 1234 |
|
|
| `2024-01-02T07:01:35.000Z`| `seahorse`| 115 |
|
|
|
|
</details>
|
|
|
|
The results contain records for eight animals over two days.
|
|
|
|
## Overwrite all data
|
|
|
|
You can use the REPLACE function with OVERWRITE ALL to replace the entire datasource with new data while dropping the old data.
|
|
|
|
In the web console, open a new tab and run the following query to overwrite timestamp data for the entire `update_tutorial` datasource:
|
|
|
|
```sql
|
|
REPLACE INTO "update_tutorial" OVERWRITE ALL
|
|
WITH "ext" AS (SELECT *
|
|
FROM TABLE(
|
|
EXTERN(
|
|
'{"type":"inline","data":"{\"timestamp\":\"2024-01-02T07:01:35Z\",\"animal\":\"octopus\", \"number\":115}\n{\"timestamp\":\"2024-01-02T05:01:35Z\",\"animal\":\"mongoose\", \"number\":737}\n{\"timestamp\":\"2024-01-02T06:01:35Z\",\"animal\":\"snake\", \"number\":1234}\n{\"timestamp\":\"2024-01-02T01:01:35Z\",\"animal\":\"lion\", \"number\":300}\n{\"timestamp\":\"2024-01-03T07:01:35Z\",\"animal\":\"seahorse\", \"number\":115}\n{\"timestamp\":\"2024-01-03T05:01:35Z\",\"animal\":\"skunk\", \"number\":737}\n{\"timestamp\":\"2024-01-03T06:01:35Z\",\"animal\":\"iguana\", \"number\":1234}\n{\"timestamp\":\"2024-01-03T01:01:35Z\",\"animal\":\"opossum\", \"number\":300}"}',
|
|
'{"type":"json"}'
|
|
)
|
|
) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT))
|
|
SELECT
|
|
TIME_PARSE("timestamp") AS "__time",
|
|
"animal",
|
|
"number"
|
|
FROM "ext"
|
|
PARTITIONED BY DAY
|
|
```
|
|
|
|
<details>
|
|
<summary> View the results</summary>
|
|
|
|
| `__time` | `animal` | `number`|
|
|
| -- | -- | -- |
|
|
| `2024-01-02T01:01:35.000Z`| `lion`| 300 |
|
|
| `2024-01-02T05:01:35.000Z`| `mongoose`| 737 |
|
|
| `2024-01-02T06:01:35.000Z`| `snake`| 1234 |
|
|
| `2024-01-02T07:01:35.000Z`| `octopus`| 115 |
|
|
| `2024-01-03T01:01:35.000Z`| `opossum`| 300 |
|
|
| `2024-01-03T05:01:35.000Z`| `skunk`| 737 |
|
|
| `2024-01-03T06:01:35.000Z`| `iguana`| 1234 |
|
|
| `2024-01-03T07:01:35.000Z`| `seahorse`| 115 |
|
|
|
|
</details>
|
|
|
|
Note that the values in the `__time` column have changed to one day later.
|
|
|
|
## Overwrite records for a specific time range
|
|
|
|
You can use the REPLACE function to overwrite a specific time range of a datasource. When you overwrite a specific time range, that time range must align with the granularity specified in the PARTITIONED BY clause.
|
|
|
|
In the web console, open a new tab and run the following query to insert a new row and update specific rows. Note that the OVERWRITE WHERE clause tells the query to only update records for the date 2024-01-03.
|
|
|
|
```sql
|
|
REPLACE INTO "update_tutorial"
|
|
OVERWRITE WHERE "__time" >= TIMESTAMP'2024-01-03 00:00:00' AND "__time" < TIMESTAMP'2024-01-04 00:00:00'
|
|
WITH "ext" AS (SELECT *
|
|
FROM TABLE(
|
|
EXTERN(
|
|
'{"type":"inline","data":"{\"timestamp\":\"2024-01-03T01:01:35Z\",\"animal\":\"tiger\", \"number\":300}\n{\"timestamp\":\"2024-01-03T07:01:35Z\",\"animal\":\"seahorse\", \"number\":500}\n{\"timestamp\":\"2024-01-03T05:01:35Z\",\"animal\":\"polecat\", \"number\":626}\n{\"timestamp\":\"2024-01-03T06:01:35Z\",\"animal\":\"iguana\", \"number\":300}\n{\"timestamp\":\"2024-01-03T01:01:35Z\",\"animal\":\"flamingo\", \"number\":999}"}',
|
|
'{"type":"json"}'
|
|
)
|
|
) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT))
|
|
SELECT
|
|
TIME_PARSE("timestamp") AS "__time",
|
|
"animal",
|
|
"number"
|
|
FROM "ext"
|
|
PARTITIONED BY DAY
|
|
```
|
|
|
|
<details>
|
|
<summary> View the results</summary>
|
|
|
|
| `__time` | `animal` | `number`|
|
|
| -- | -- | -- |
|
|
| `2024-01-02T01:01:35.000Z`| `lion`| 300 |
|
|
| `2024-01-02T05:01:35.000Z`| `mongoose`| 737 |
|
|
| `2024-01-02T06:01:35.000Z`| `snake`| 1234 |
|
|
| `2024-01-02T07:01:35.000Z`| `octopus`| 115 |
|
|
| `2024-01-03T01:01:35.000Z`| `flamingo`| 999 |
|
|
| `2024-01-03T01:01:35.000Z`| `tiger`| 300 |
|
|
| `2024-01-03T05:01:35.000Z`| `polecat`| 626 |
|
|
| `2024-01-03T06:01:35.000Z`| `iguana`| 300 |
|
|
| `2024-01-03T07:01:35.000Z`| `seahorse`| 500 |
|
|
|
|
</details>
|
|
|
|
Note the changes in the resulting datasource:
|
|
|
|
* There is now a new row called `flamingo`.
|
|
* The `opossum` row has the value `tiger`.
|
|
* The `skunk` row has the value `polecat`.
|
|
* The `iguana` and `seahorse` rows have different numbers.
|
|
|
|
## Update a row using partial segment overshadowing
|
|
|
|
In Druid, you can overlay older data with newer data for the entire segment or portions of the segment within a particular partition.
|
|
This capability is called [overshadowing](../ingestion/tasks.md#overshadowing-between-segments).
|
|
|
|
You can use partial overshadowing to update a single row by adding a smaller time granularity segment on top of the existing data.
|
|
It's a less common variation on a more common approach where you replace the entire time chunk.
|
|
|
|
The following example demonstrates how update data using partial overshadowing with mixed segment granularity.
|
|
Note the following important points about the example:
|
|
|
|
* The query updates a single record for a specific `number` row.
|
|
* The original datasource uses DAY segment granularity.
|
|
* The new data segment is at HOUR granularity and represents a time range that's smaller than the existing data.
|
|
* The OVERWRITE WHERE and WHERE TIME_IN_INTERVAL clauses specify the destination where the update occurs and the source of the update, respectively.
|
|
* The query replaces everything within the specified interval. To update only a subset of data in that interval, you have to carry forward all records, changing only what you want to change. You can accomplish that by using the [CASE](../querying/sql-functions.md#case) function in the SELECT list.
|
|
|
|
```sql
|
|
REPLACE INTO "update_tutorial"
|
|
OVERWRITE
|
|
WHERE "__time" >= TIMESTAMP'2024-01-03 05:00:00' AND "__time" < TIMESTAMP'2024-01-03 06:00:00'
|
|
SELECT
|
|
"__time",
|
|
"animal",
|
|
CAST(486 AS BIGINT) AS "number"
|
|
FROM "update_tutorial"
|
|
WHERE TIME_IN_INTERVAL("__time", '2024-01-03T05:01:35Z/PT1S')
|
|
PARTITIONED BY FLOOR(__time TO HOUR)
|
|
```
|
|
|
|
<details>
|
|
<summary> View the results</summary>
|
|
|
|
| `__time` | `animal` | `number`|
|
|
| -- | -- | -- |
|
|
| `2024-01-02T01:01:35.000Z`| `lion`| 300 |
|
|
| `2024-01-02T05:01:35.000Z`| `mongoose`| 737 |
|
|
| `2024-01-02T06:01:35.000Z`| `snake`| 1234 |
|
|
| `2024-01-02T07:01:35.000Z`| `octopus`| 115 |
|
|
| `2024-01-03T01:01:35.000Z`| `flamingo`| 999 |
|
|
| `2024-01-03T01:01:35.000Z`| `tiger`| 300 |
|
|
| `2024-01-03T05:01:35.000Z`| `polecat`| 486 |
|
|
| `2024-01-03T06:01:35.000Z`| `iguana`| 300 |
|
|
| `2024-01-03T07:01:35.000Z`| `seahorse`| 500 |
|
|
|
|
</details>
|
|
|
|
Note that the `number` for `polecat` has changed from 626 to 486.
|
|
|
|
When you perform partial segment overshadowing multiple times, you can create segment fragmentation that could affect query performance. Use [compaction](../data-management/compaction.md) to correct any fragmentation.
|
|
|
|
## Learn more
|
|
|
|
See the following topics for more information:
|
|
|
|
* [Data updates](../data-management/update.md) for an overview of updating data in Druid.
|
|
* [Load files with SQL-based ingestion](../tutorials/tutorial-msq-extern.md) for generating a query that references externally hosted data.
|
|
* [Overwrite data with REPLACE](../multi-stage-query/concepts.md#overwrite-data-with-replace) for details on how the MSQ task engine executes SQL REPLACE queries. |