2022-09-06 13:36:09 -04:00
|
|
|
|
---
|
|
|
|
|
id: index
|
2022-09-17 00:58:11 -04:00
|
|
|
|
title: SQL-based ingestion
|
|
|
|
|
sidebar_label: Overview
|
2022-09-06 13:36:09 -04:00
|
|
|
|
description: Introduces multi-stage query architecture and its task engine
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
<!--
|
|
|
|
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
|
|
|
~ or more contributor license agreements. See the NOTICE file
|
|
|
|
|
~ distributed with this work for additional information
|
|
|
|
|
~ regarding copyright ownership. The ASF licenses this file
|
|
|
|
|
~ to you under the Apache License, Version 2.0 (the
|
|
|
|
|
~ "License"); you may not use this file except in compliance
|
|
|
|
|
~ with the License. You may obtain a copy of the License at
|
|
|
|
|
~
|
|
|
|
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
~
|
|
|
|
|
~ Unless required by applicable law or agreed to in writing,
|
|
|
|
|
~ software distributed under the License is distributed on an
|
|
|
|
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
|
|
|
~ KIND, either express or implied. See the License for the
|
|
|
|
|
~ specific language governing permissions and limitations
|
|
|
|
|
~ under the License.
|
|
|
|
|
-->
|
|
|
|
|
|
2022-09-17 00:58:11 -04:00
|
|
|
|
> This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md)
|
|
|
|
|
> extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which
|
|
|
|
|
> ingestion method is right for you.
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
2022-09-17 00:58:11 -04:00
|
|
|
|
Apache Druid supports SQL-based ingestion using the bundled [`druid-multi-stage-query` extension](#load-the-extension).
|
2022-09-21 21:46:06 -04:00
|
|
|
|
This extension adds a [multi-stage query task engine for SQL](concepts.md#multi-stage-query-task-engine) that allows running SQL
|
2022-09-17 13:47:57 -04:00
|
|
|
|
[INSERT](concepts.md#insert) and [REPLACE](concepts.md#replace) statements as batch tasks. As an experimental feature,
|
2023-01-17 11:41:57 -05:00
|
|
|
|
the task engine also supports running `SELECT` queries as batch tasks.
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
2023-01-17 11:41:57 -05:00
|
|
|
|
Nearly all `SELECT` capabilities are available in the multi-stage query (MSQ) task engine, with certain exceptions listed on the [Known
|
|
|
|
|
issues](./known-issues.md#select-statement) page. This allows great flexibility to apply transformations, filters, JOINs,
|
2022-09-17 13:47:57 -04:00
|
|
|
|
aggregations, and so on as part of `INSERT ... SELECT` and `REPLACE ... SELECT` statements. This also allows in-database
|
2022-09-17 00:58:11 -04:00
|
|
|
|
transformation: creating new tables based on queries of other tables.
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
2022-09-17 00:58:11 -04:00
|
|
|
|
## Vocabulary
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
2022-09-17 00:58:11 -04:00
|
|
|
|
- **Controller**: An indexing service task of type `query_controller` that manages
|
|
|
|
|
the execution of a query. There is one controller task per query.
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
2022-09-17 00:58:11 -04:00
|
|
|
|
- **Worker**: Indexing service tasks of type `query_worker` that execute a
|
|
|
|
|
query. There can be multiple worker tasks per query. Internally,
|
|
|
|
|
the tasks process items in parallel using their processing pools (up to `druid.processing.numThreads` of execution parallelism
|
|
|
|
|
within a worker task).
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
2022-09-17 00:58:11 -04:00
|
|
|
|
- **Stage**: A stage of query execution that is parallelized across
|
|
|
|
|
worker tasks. Workers exchange data with each other between stages.
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
2022-09-17 00:58:11 -04:00
|
|
|
|
- **Partition**: A slice of data output by worker tasks. In INSERT or REPLACE
|
|
|
|
|
queries, the partitions of the final stage become Druid segments.
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
2022-09-17 00:58:11 -04:00
|
|
|
|
- **Shuffle**: Workers exchange data between themselves on a per-partition basis in a process called
|
|
|
|
|
shuffling. During a shuffle, each output partition is sorted by a clustering key.
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
|
|
|
|
## Load the extension
|
|
|
|
|
|
2022-09-17 00:58:11 -04:00
|
|
|
|
To add the extension to an existing cluster, add `druid-multi-stage-query` to `druid.extensions.loadlist` in your
|
|
|
|
|
`common.runtime.properties` file.
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
|
|
|
|
For more information about how to load an extension, see [Loading extensions](../development/extensions.md#loading-extensions).
|
|
|
|
|
|
2023-01-17 11:41:57 -05:00
|
|
|
|
To use [EXTERN](reference.md#extern-function), you need READ permission on the resource named "EXTERNAL" of the resource type
|
|
|
|
|
"EXTERNAL". If you encounter a 403 error when trying to use `EXTERN`, verify that you have the correct permissions.
|
|
|
|
|
The same is true of any of the input-source specific table function such as `S3` or `LOCALFILES`.
|
2022-09-06 13:36:09 -04:00
|
|
|
|
|
|
|
|
|
## Next steps
|
|
|
|
|
|
2022-11-10 10:33:04 -05:00
|
|
|
|
- [Read about key concepts](./concepts.md) to learn more about how SQL-based ingestion and multi-stage queries work.
|
|
|
|
|
- [Check out the examples](./examples.md) to see SQL-based ingestion in action.
|
|
|
|
|
- [Explore the Query view](../operations/web-console.md) to get started in the web console.
|