--- id: reference title: SQL-based ingestion reference sidebar_label: Reference --- :::info This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md) extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which ingestion method is right for you. ::: ## SQL reference This topic is a reference guide for the multi-stage query architecture in Apache Druid. For examples of real-world usage, refer to the [Examples](examples.md) page. `INSERT` and `REPLACE` load data into a Druid datasource from either an external input source, or from another datasource. When loading from an external datasource, you typically must provide the kind of input source, the data format, and the schema (signature) of the input file. Druid provides *table functions* to allow you to specify the external file. There are two kinds. `EXTERN` works with the JSON-serialized specs for the three items, using the same JSON you would use in native ingest. A set of other, input-source-specific functions use SQL syntax to specify the format and the input schema. There is one function for each input source. The input-source-specific functions allow you to use SQL query parameters to specify the set of files (or URIs), making it easy to reuse the same SQL statement for each ingest: just specify the set of files to use each time. ### `EXTERN` Function Use the `EXTERN` function to read external data or write to an external location. #### `EXTERN` as an input source The function has two variations. Function variation 1, with the input schema expressed as JSON: ```sql SELECT FROM TABLE( EXTERN( '', '', '' ) ) ``` `EXTERN` consists of the following parts: 1. Any [Druid input source](../ingestion/input-sources.md) as a JSON-encoded string. 2. Any [Druid input format](../ingestion/data-formats.md) as a JSON-encoded string. 3. A row signature, as a JSON-encoded array of column descriptors. Each column descriptor must have a `name` and a `type`. The type can be `string`, `long`, `double`, or `float`. This row signature is used to map the external data into the SQL layer. Variation 2, with the input schema expressed in SQL using an `EXTEND` clause. (See the next section for more detail on `EXTEND`). This format also uses named arguments to make the SQL a bit easier to read: ```sql SELECT FROM TABLE( EXTERN( inputSource => '', inputFormat => '' )) () ``` The input source and format are as above. The columns are expressed as in a SQL `CREATE TABLE`. Example: `(timestamp VARCHAR, metricType VARCHAR, value BIGINT)`. The optional `EXTEND` keyword can precede the column list: `EXTEND (timestamp VARCHAR...)`. For more information, see [Read external data with EXTERN](concepts.md#read-external-data-with-extern). #### `EXTERN` to export to a destination `EXTERN` can be used to specify a destination where you want to export data to. This variation of EXTERN requires one argument, the details of the destination as specified below. This variation additionally requires an `AS` clause to specify the format of the exported rows. Keep the following in mind when using EXTERN to export rows: - Only INSERT statements are supported. - Only `CSV` format is supported as an export format. - Partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED BY`) aren't supported with export statements. - You can export to Amazon S3 or local storage. - The destination provided should contain no other files or directories. When you export data, use the `rowsPerPage` context parameter to restrict the size of exported files. When the number of rows in the result set exceeds the value of the parameter, Druid splits the output into multiple files. ```sql INSERT INTO EXTERN() AS CSV SELECT FROM ``` ##### S3 Export results to S3 by passing the function `S3()` as an argument to the `EXTERN` function. Note that this requires the `druid-s3-extensions`. The `S3()` function is a Druid function that configures the connection. Arguments for `S3()` should be passed as named parameters with the value in single quotes like the following example: ```sql INSERT INTO EXTERN( S3(bucket => 'your_bucket', prefix => 'prefix/to/files') ) AS CSV SELECT FROM
``` Supported arguments for the function: | Parameter | Required | Description | Default | |-------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------| | `bucket` | Yes | The S3 bucket to which the files are exported to. The bucket and prefix combination should be whitelisted in `druid.export.storage.s3.allowedExportPaths`. | n/a | | `prefix` | Yes | Path where the exported files would be created. The export query expects the destination to be empty. If the location includes other files, then the query will fail. The bucket and prefix combination should be whitelisted in `druid.export.storage.s3.allowedExportPaths`. | n/a | The following runtime parameters must be configured to export into an S3 destination: | Runtime Parameter | Required | Description | Default | |----------------------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----| | `druid.export.storage.s3.tempLocalDir` | Yes | Directory used on the local storage of the worker to store temporary files required while uploading the data. | n/a | | `druid.export.storage.s3.allowedExportPaths` | Yes | An array of S3 prefixes that are whitelisted as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. Example: `[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]` | n/a | | `druid.export.storage.s3.maxRetry` | No | Defines the max number times to attempt S3 API calls to avoid failures due to transient errors. | 10 | | `druid.export.storage.s3.chunkSize` | No | Defines the size of each chunk to temporarily store in `tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls to S3, however it requires more disk space to store the temporary chunks. | 100MiB | ##### LOCAL You can export to the local storage, which exports the results to the filesystem of the MSQ worker. This is useful in a single node setup or for testing but is not suitable for production use cases. Export results to local storage by passing the function `LOCAL()` as an argument for the `EXTERN FUNCTION`. To use local storage as an export destination, the runtime property `druid.export.storage.baseDir` must be configured on the Indexer/Middle Manager. This value must be set to an absolute path on the local machine. Exporting data will be allowed to paths which match the prefix set by this value. Arguments to `LOCAL()` should be passed as named parameters with the value in single quotes in the following example: ```sql INSERT INTO EXTERN( local(exportPath => 'exportLocation/query1') ) AS CSV SELECT FROM
``` Supported arguments to the function: | Parameter | Required | Description | Default | |-------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --| | `exportPath` | Yes | Absolute path to a subdirectory of `druid.export.storage.baseDir` used as the destination to export the results to. The export query expects the destination to be empty. If the location includes other files or directories, then the query will fail. | n/a | For more information, see [Read external data with EXTERN](concepts.md#write-to-an-external-destination-with-extern). ### `INSERT` Use the `INSERT` statement to insert data. Unlike standard SQL, `INSERT` loads data into the target table according to column name, not positionally. If necessary, use `AS` in your `SELECT` column list to assign the correct names. Do not rely on their positions within the SELECT clause. Statement format: ```sql INSERT INTO
< SELECT query > PARTITIONED BY