docs: Refresh docs for SQL input source (#17031)

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
This commit is contained in:
Victoria Lim 2024-09-16 15:52:37 -07:00 committed by GitHub
parent 9696f0b37c
commit 2e2f3cf66a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 95 additions and 71 deletions

View File

@ -31,9 +31,9 @@ This module can be used side to side with other lookup module like the global ca
To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `druid-lookups-cached-single` in the extensions load list. To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `druid-lookups-cached-single` in the extensions load list.
:::info :::info
If using JDBC, you will need to add your database's client JAR files to the extension's directory. To use JDBC, you must add your database client JAR files to the extension's directory.
For Postgres, the connector JAR is already included. For Postgres, the connector JAR is already included.
See the MySQL extension documentation for instructions to obtain [MySQL](./mysql.md#installing-the-mysql-connector-library) or [MariaDB](./mysql.md#alternative-installing-the-mariadb-connector-library) connector libraries. See the MySQL extension documentation for instructions to obtain [MySQL](./mysql.md#install-mysql-connectorj) or [MariaDB](./mysql.md#install-mariadb-connectorj) connector libraries.
Copy or symlink the downloaded file to `extensions/druid-lookups-cached-single` under the distribution root directory. Copy or symlink the downloaded file to `extensions/druid-lookups-cached-single` under the distribution root directory.
::: :::

View File

@ -1,6 +1,6 @@
--- ---
id: mysql id: mysql
title: "MySQL Metadata Store" title: "MySQL metadata store"
--- ---
<!-- <!--
@ -25,41 +25,58 @@ title: "MySQL Metadata Store"
To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `mysql-metadata-storage` in the extensions load list. To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `mysql-metadata-storage` in the extensions load list.
:::info With the MySQL extension, you can use MySQL as a metadata store or ingest from a MySQL database.
The MySQL extension requires the MySQL Connector/J library or MariaDB Connector/J library, neither of which are included in the Druid distribution.
Refer to the following section for instructions on how to install this library.
:::
## Installing the MySQL connector library The extension requires a connector library that's not included with Druid.
See the [Prerequisites](#prerequisites) for installation instructions.
This extension can use Oracle's MySQL JDBC driver which is not included in the Druid distribution. You must ## Prerequisites
install it separately. There are a few ways to obtain this library:
- It can be downloaded from the MySQL site at: https://dev.mysql.com/downloads/connector/j/ To use the MySQL extension, you need to install one of the following libraries:
- It can be fetched from Maven Central at: https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.2.0/mysql-connector-j-8.2.0.jar * [MySQL Connector/J](#install-mysql-connectorj)
- It may be available through your package manager, e.g. as `libmysql-java` on APT for a Debian-based OS * [MariaDB Connector/J](#install-mariadb-connectorj)
This fetches the MySQL connector JAR file with a name like `mysql-connector-j-8.2.0.jar`. ### Install MySQL Connector/J
Copy or symlink this file inside the folder `extensions/mysql-metadata-storage` under the distribution root directory. The MySQL extension uses Oracle's MySQL JDBC driver.
The current version of Druid uses version 8.2.0.
Other versions may not work with this extension.
## Alternative: Installing the MariaDB connector library You can download the library from one of the following sources:
This extension also supports using the MariaDB connector jar, though it is also not included in the Druid distribution, so you must install it separately. - [MySQL website](https://dev.mysql.com/downloads/connector/j/)
Visit the archives page to access older product versions.
- [Maven Central (direct download)](https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.2.0/mysql-connector-j-8.2.0.jar)
- Your package manager. For example, `libmysql-java` on APT for a Debian-based OS.
- Download from the MariaDB site: https://mariadb.com/downloads/connector The download includes the MySQL connector JAR file with a name like `mysql-connector-j-8.2.0.jar`.
- Download from Maven Central: https://repo1.maven.org/maven2/org/mariadb/jdbc/mariadb-java-client/2.7.3/mariadb-java-client-2.7.3.jar Copy or create a symbolic link to this file inside the `lib` folder in the distribution root directory.
This fetches the MariaDB connector JAR file with a name like `maria-java-client-2.7.3.jar`. ### Install MariaDB Connector/J
Copy or symlink this file to `extensions/mysql-metadata-storage` under the distribution root directory. This extension also supports using the MariaDB connector jar.
The current version of Druid uses version 2.7.3.
Other versions may not work with this extension.
You can download the library from one of the following sources:
- [MariaDB website](https://mariadb.com/downloads/connectors/connectors-data-access/java8-connector)
Click **Show All Files** to access older product versions.
- [Maven Central (direct download)](https://repo1.maven.org/maven2/org/mariadb/jdbc/mariadb-java-client/2.7.3/mariadb-java-client-2.7.3.jar)
The download includes the MariaDB connector JAR file with a name like `maria-java-client-2.7.3.jar`.
Copy or create a symbolic link to this file inside the `lib` folder in the distribution root directory.
To configure the `mysql-metadata-storage` extension to use the MariaDB connector library instead of MySQL, set `druid.metadata.mysql.driver.driverClassName=org.mariadb.jdbc.Driver`. To configure the `mysql-metadata-storage` extension to use the MariaDB connector library instead of MySQL, set `druid.metadata.mysql.driver.driverClassName=org.mariadb.jdbc.Driver`.
Depending on the MariaDB client library version, the connector supports both `jdbc:mysql:` and `jdbc:mariadb:` connection URIs. However, the parameters to configure the connection vary between implementations, so be sure to [check the documentation](https://mariadb.com/kb/en/about-mariadb-connector-j/#connection-strings) for details. The protocol of the connection string is `jdbc:mysql:` or `jdbc:mariadb:`,
depending on your specific version of the MariaDB client library.
For more information on the parameters to configure a connection,
[see the MariaDB documentation](https://mariadb.com/kb/en/about-mariadb-connector-j/#connection-strings)
for your connector version.
## Setting up MySQL ## Set up MySQL
To avoid issues with upgrades that require schema changes to a large metadata table, consider a MySQL version that supports instant ADD COLUMN semantics. For example, MySQL 8. To avoid issues with upgrades that require schema changes to a large metadata table, consider a MySQL version that supports instant ADD COLUMN semantics. For example, MySQL 8.
@ -90,7 +107,7 @@ This extension also supports using MariaDB server, https://mariadb.org/download/
CREATE DATABASE druid DEFAULT CHARACTER SET utf8mb4; CREATE DATABASE druid DEFAULT CHARACTER SET utf8mb4;
-- create a druid user -- create a druid user
CREATE USER 'druid'@'localhost' IDENTIFIED BY 'diurd'; CREATE USER 'druid'@'localhost' IDENTIFIED BY 'password';
-- grant the user all the permissions on the database we just created -- grant the user all the permissions on the database we just created
GRANT ALL PRIVILEGES ON druid.* TO 'druid'@'localhost'; GRANT ALL PRIVILEGES ON druid.* TO 'druid'@'localhost';
@ -111,10 +128,11 @@ This extension also supports using MariaDB server, https://mariadb.org/download/
If using the MariaDB connector library, set `druid.metadata.mysql.driver.driverClassName=org.mariadb.jdbc.Driver`. If using the MariaDB connector library, set `druid.metadata.mysql.driver.driverClassName=org.mariadb.jdbc.Driver`.
## Encrypting MySQL connections ## Encrypt MySQL connections
This extension provides support for encrypting MySQL connections. To get more information about encrypting MySQL connections using TLS/SSL in general, please refer to this [guide](https://dev.mysql.com/doc/refman/5.7/en/using-encrypted-connections.html).
## Configuration This extension provides support for encrypting MySQL connections. To get more information about encrypting MySQL connections using TLS/SSL in general, please refer to this [guide](https://dev.mysql.com/doc/refman/5.7/en/using-encrypted-connections.html).
## Configuration properties
|Property|Description|Default|Required| |Property|Description|Default|Required|
|--------|-----------|-------|--------| |--------|-----------|-------|--------|
@ -129,7 +147,10 @@ If using the MariaDB connector library, set `druid.metadata.mysql.driver.driverC
|`druid.metadata.mysql.ssl.enabledSSLCipherSuites`|Overrides the existing cipher suites with these cipher suites.|none|no| |`druid.metadata.mysql.ssl.enabledSSLCipherSuites`|Overrides the existing cipher suites with these cipher suites.|none|no|
|`druid.metadata.mysql.ssl.enabledTLSProtocols`|Overrides the TLS protocols with these protocols.|none|no| |`druid.metadata.mysql.ssl.enabledTLSProtocols`|Overrides the TLS protocols with these protocols.|none|no|
### MySQL InputSource ## MySQL input source
The MySQL extension provides an implementation of an SQL input source to ingest data into Druid from a MySQL database.
For more information on the input source parameters, see [SQL input source](../../ingestion/input-sources.md#sql-input-source).
```json ```json
{ {

View File

@ -1,6 +1,6 @@
--- ---
id: postgresql id: postgresql
title: "PostgreSQL Metadata Store" title: "PostgreSQL metadata store"
--- ---
<!-- <!--
@ -25,7 +25,9 @@ title: "PostgreSQL Metadata Store"
To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `postgresql-metadata-storage` in the extensions load list. To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `postgresql-metadata-storage` in the extensions load list.
## Setting up PostgreSQL With the PostgreSQL extension, you can use PostgreSQL as a metadata store or ingest from a PostgreSQL database.
## Set up PostgreSQL
To avoid issues with upgrades that require schema changes to a large metadata table, consider a PostgreSQL version that supports instant ADD COLUMN semantics. To avoid issues with upgrades that require schema changes to a large metadata table, consider a PostgreSQL version that supports instant ADD COLUMN semantics.
@ -69,7 +71,7 @@ To avoid issues with upgrades that require schema changes to a large metadata ta
druid.metadata.storage.connector.password=diurd druid.metadata.storage.connector.password=diurd
``` ```
## Configuration ## Configuration properties
In most cases, the configuration options map directly to the [postgres JDBC connection options](https://jdbc.postgresql.org/documentation/use/#connecting-to-the-database). In most cases, the configuration options map directly to the [postgres JDBC connection options](https://jdbc.postgresql.org/documentation/use/#connecting-to-the-database).
@ -87,9 +89,10 @@ In most cases, the configuration options map directly to the [postgres JDBC conn
| `druid.metadata.postgres.ssl.sslPasswordCallback` | The classname of the SSL password provider. | none | no | | `druid.metadata.postgres.ssl.sslPasswordCallback` | The classname of the SSL password provider. | none | no |
| `druid.metadata.postgres.dbTableSchema` | druid meta table schema | `public` | no | | `druid.metadata.postgres.dbTableSchema` | druid meta table schema | `public` | no |
### PostgreSQL InputSource ## PostgreSQL input source
The PostgreSQL extension provides an implementation of an [SQL input source](../../ingestion/input-sources.md) which can be used to ingest data into Druid from a PostgreSQL database. The PostgreSQL extension provides an implementation of an SQL input source to ingest data into Druid from a PostgreSQL database.
For more information on the input source parameters, see [SQL input source](../../ingestion/input-sources.md#sql-input-source).
```json ```json
{ {

View File

@ -29,10 +29,8 @@ For general information on native batch indexing and parallel task indexing, see
## S3 input source ## S3 input source
:::info :::info Required extension
To use the S3 input source, load the extension [`druid-s3-extensions`](../development/extensions-core/s3.md) in your `common.runtime.properties` file.
You need to include the [`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension to use the S3 input source.
::: :::
The S3 input source reads objects directly from S3. You can specify either: The S3 input source reads objects directly from S3. You can specify either:
@ -41,7 +39,7 @@ The S3 input source reads objects directly from S3. You can specify either:
* a list of S3 location prefixes that attempts to list the contents and ingest * a list of S3 location prefixes that attempts to list the contents and ingest
all objects contained within the locations. all objects contained within the locations.
The S3 input source is splittable. Therefore, you can use it with the [Parallel task](./native-batch.md). Each worker task of `index_parallel` reads one or multiple objects. The S3 input source is splittable. Therefore, you can use it with the [parallel task](./native-batch.md). Each worker task of `index_parallel` reads one or multiple objects.
Sample specs: Sample specs:
@ -219,16 +217,14 @@ If `accessKeyId` and `secretAccessKey` are not given, the default [S3 credential
## Google Cloud Storage input source ## Google Cloud Storage input source
:::info :::info Required extension
To use the Google Cloud Storage input source, load the extension [`druid-google-extensions`](../development/extensions-core/google.md) in your `common.runtime.properties` file.
You need to include the [`druid-google-extensions`](../development/extensions-core/google.md) as an extension to use the Google Cloud Storage input source.
::: :::
The Google Cloud Storage input source is to support reading objects directly The Google Cloud Storage input source is to support reading objects directly
from Google Cloud Storage. Objects can be specified as list of Google from Google Cloud Storage. Objects can be specified as list of Google
Cloud Storage URI strings. The Google Cloud Storage input source is splittable Cloud Storage URI strings. The Google Cloud Storage input source is splittable
and can be used by the [Parallel task](./native-batch.md), where each worker task of `index_parallel` will read and can be used by the [parallel task](./native-batch.md), where each worker task of `index_parallel` will read
one or multiple objects. one or multiple objects.
Sample specs: Sample specs:
@ -307,14 +303,12 @@ Google Cloud Storage object:
## Azure input source ## Azure input source
:::info :::info Required extension
To use the Azure input source, load the extension [`druid-azure-extensions`](../development/extensions-core/azure.md) in your `common.runtime.properties` file.
You need to include the [`druid-azure-extensions`](../development/extensions-core/azure.md) as an extension to use the Azure input source.
::: :::
The Azure input source (that uses the type `azureStorage`) reads objects directly from Azure Blob store or Azure Data Lake sources. You can The Azure input source (that uses the type `azureStorage`) reads objects directly from Azure Blob store or Azure Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the Azure input source for use with [Parallel task](./native-batch.md) indexing and each worker task reads one chunk of the split data. specify objects as a list of file URI strings or prefixes. You can split the Azure input source for use with [parallel task](./native-batch.md) indexing and each worker task reads one chunk of the split data.
The `azureStorage` input source is a new schema for Azure input sources that allows you to specify which storage account files should be ingested from. We recommend that you update any specs that use the old `azure` schema to use the new `azureStorage` schema. The new schema provides more functionality than the older `azure` schema. The `azureStorage` input source is a new schema for Azure input sources that allows you to specify which storage account files should be ingested from. We recommend that you update any specs that use the old `azure` schema to use the new `azureStorage` schema. The new schema provides more functionality than the older `azure` schema.
@ -491,15 +485,13 @@ The `objects` property is:
## HDFS input source ## HDFS input source
:::info :::info Required extension
To use the HDFS input source, load the extension [`druid-hdfs-storage`](../development/extensions-core/hdfs.md) in your `common.runtime.properties` file.
You need to include the [`druid-hdfs-storage`](../development/extensions-core/hdfs.md) as an extension to use the HDFS input source.
::: :::
The HDFS input source is to support reading files directly The HDFS input source is to support reading files directly
from HDFS storage. File paths can be specified as an HDFS URI string or a list from HDFS storage. File paths can be specified as an HDFS URI string or a list
of HDFS URI strings. The HDFS input source is splittable and can be used by the [Parallel task](./native-batch.md), of HDFS URI strings. The HDFS input source is splittable and can be used by the [parallel task](./native-batch.md),
where each worker task of `index_parallel` will read one or multiple files. where each worker task of `index_parallel` will read one or multiple files.
Sample specs: Sample specs:
@ -593,7 +585,7 @@ The `http` input source is not limited to the HTTP or HTTPS protocols. It uses t
For more information about security best practices, see [Security overview](../operations/security-overview.md#best-practices). For more information about security best practices, see [Security overview](../operations/security-overview.md#best-practices).
The HTTP input source is _splittable_ and can be used by the [Parallel task](./native-batch.md), The HTTP input source is _splittable_ and can be used by the [parallel task](./native-batch.md),
where each worker task of `index_parallel` will read only one file. This input source does not support Split Hint Spec. where each worker task of `index_parallel` will read only one file. This input source does not support Split Hint Spec.
Sample specs: Sample specs:
@ -701,7 +693,7 @@ Sample spec:
The Local input source is to support reading files directly from local storage, The Local input source is to support reading files directly from local storage,
and is mainly intended for proof-of-concept testing. and is mainly intended for proof-of-concept testing.
The Local input source is _splittable_ and can be used by the [Parallel task](./native-batch.md), The Local input source is _splittable_ and can be used by the [parallel task](./native-batch.md),
where each worker task of `index_parallel` will read one or multiple files. where each worker task of `index_parallel` will read one or multiple files.
Sample spec: Sample spec:
@ -736,7 +728,7 @@ Sample spec:
The Druid input source is to support reading data directly from existing Druid segments, The Druid input source is to support reading data directly from existing Druid segments,
potentially using a new schema and changing the name, dimensions, metrics, rollup, etc. of the segment. potentially using a new schema and changing the name, dimensions, metrics, rollup, etc. of the segment.
The Druid input source is _splittable_ and can be used by the [Parallel task](./native-batch.md). The Druid input source is _splittable_ and can be used by the [parallel task](./native-batch.md).
This input source has a fixed input format for reading from Druid segments; This input source has a fixed input format for reading from Druid segments;
no `inputFormat` field needs to be specified in the ingestion spec when using this input source. no `inputFormat` field needs to be specified in the ingestion spec when using this input source.
@ -833,17 +825,29 @@ For more information on the `maxNumConcurrentSubTasks` field, see [Implementatio
## SQL input source ## SQL input source
:::info Required extension
To use the SQL input source, you must load the appropriate extension in your `common.runtime.properties` file.
* To connect to MySQL, load the extension [`mysql-metadata-storage`](../development/extensions-core/mysql.md).
* To connect to PostgreSQL, load the extension [`postgresql-metadata-storage`](../development/extensions-core/postgresql.md).
The MySQL extension requires a JDBC driver.
For more information, see the [Installing the MySQL connector library](../development/extensions-core/mysql.md).
:::
The SQL input source is used to read data directly from RDBMS. The SQL input source is used to read data directly from RDBMS.
The SQL input source is _splittable_ and can be used by the [Parallel task](./native-batch.md), where each worker task will read from one SQL query from the list of queries. You can _split_ the ingestion tasks for a SQL input source. When you use the [parallel task](./native-batch.md) type, each worker task reads from one SQL query from the list of queries.
This input source does not support Split Hint Spec. This input source does not support Split Hint Spec.
Since this input source has a fixed input format for reading events, no `inputFormat` field needs to be specified in the ingestion spec when using this input source.
Please refer to the Recommended practices section below before using this input source. The SQL input source has a fixed input format for reading events.
Don't specify `inputFormat` when using this input source.
Refer to the [recommended practices](#recommended-practices) before using this input source.
|Property|Description|Required| |Property|Description|Required|
|--------|-----------|---------| |--------|-----------|---------|
|type|Set the value to `sql`.|Yes| |type|Set the value to `sql`.|Yes|
|database|Specifies the database connection details. The database type corresponds to the extension that supplies the `connectorConfig` support. The specified extension must be loaded into Druid:<br/><br/><ul><li>[mysql-metadata-storage](../development/extensions-core/mysql.md) for `mysql`</li><li> [postgresql-metadata-storage](../development/extensions-core/postgresql.md) extension for `postgresql`.</li></ul><br/><br/>You can selectively allow JDBC properties in `connectURI`. See [JDBC connections security config](../configuration/index.md#jdbc-connections-to-external-databases) for more details.|Yes| |database|Specifies the database connection details. The database type corresponds to the extension that supplies the `connectorConfig` support.<br/><br/>You can selectively allow JDBC properties in `connectURI`. See [JDBC connections security config](../configuration/index.md#jdbc-connections-to-external-databases) for more details.|Yes|
|foldCase|Toggle case folding of database column names. This may be enabled in cases where the database returns case insensitive column names in query results.|No| |foldCase|Boolean to toggle case folding of database column names. For example, to ingest a database column named `Entry_Date` as `entry_date`, set `foldCase` to true and include `entry_date` in the [`dimensionsSpec`](ingestion-spec.md#dimensionsspec).|No|
|sqls|List of SQL queries where each SQL query would retrieve the data to be indexed.|Yes| |sqls|List of SQL queries where each SQL query would retrieve the data to be indexed.|Yes|
The following is an example of an SQL input source spec: The following is an example of an SQL input source spec:
@ -887,7 +891,7 @@ Compared to the other native batch input sources, SQL input source behaves diffe
The Combining input source lets you read data from multiple input sources. The Combining input source lets you read data from multiple input sources.
It identifies the splits from delegate input sources and uses a worker task to process each split. It identifies the splits from delegate input sources and uses a worker task to process each split.
Use the Combining input source only if all the delegates are splittable and can be used by the [Parallel task](./native-batch.md). Each delegate input source must be splittable and compatible with the [parallel task type](./native-batch.md).
Similar to other input sources, the Combining input source supports a single `inputFormat`. Similar to other input sources, the Combining input source supports a single `inputFormat`.
Delegate input sources that require an `inputFormat` must have the same format for input data. Delegate input sources that require an `inputFormat` must have the same format for input data.
@ -931,10 +935,8 @@ The following is an example of a Combining input source spec:
## Iceberg input source ## Iceberg input source
:::info :::info Required extension
To use the Iceberg input source, load the extension [`druid-iceberg-extensions`](../development/extensions-contrib/iceberg.md) in your `common.runtime.properties` file.
To use the Iceberg input source, load the extension [`druid-iceberg-extensions`](../development/extensions-contrib/iceberg.md).
::: :::
You use the Iceberg input source to read data stored in the Iceberg table format. For a given table, the input source scans up to the latest Iceberg snapshot from the configured Hive catalog. Druid ingests the underlying live data files using the existing input source formats. You use the Iceberg input source to read data stored in the Iceberg table format. For a given table, the input source scans up to the latest Iceberg snapshot from the configured Hive catalog. Druid ingests the underlying live data files using the existing input source formats.
@ -1138,10 +1140,8 @@ This input source provides the following filters: `and`, `equals`, `interval`, a
## Delta Lake input source ## Delta Lake input source
:::info :::info Required extension
To use the Delta Lake input source, load the extension [`druid-deltalake-extensions`](../development/extensions-contrib/delta-lake.md) in your `common.runtime.properties` file.
To use the Delta Lake input source, load the extension [`druid-deltalake-extensions`](../development/extensions-contrib/delta-lake.md).
::: :::
You can use the Delta input source to read data stored in a Delta Lake table. For a given table, the input source scans You can use the Delta input source to read data stored in a Delta Lake table. For a given table, the input source scans

View File

@ -377,7 +377,7 @@ The JDBC lookups will poll a database to populate its local cache. If the `tsCol
:::info :::info
If using JDBC, you will need to add your database's client JAR files to the extension's directory. If using JDBC, you will need to add your database's client JAR files to the extension's directory.
For Postgres, the connector JAR is already included. For Postgres, the connector JAR is already included.
See the MySQL extension documentation for instructions to obtain [MySQL](../development/extensions-core/mysql.md#installing-the-mysql-connector-library) or [MariaDB](../development/extensions-core/mysql.md#alternative-installing-the-mariadb-connector-library) connector libraries. See the MySQL extension documentation for instructions to obtain [MySQL](../development/extensions-core/mysql.md#install-mysql-connectorj) or [MariaDB](../development/extensions-core/mysql.md#install-mariadb-connectorj) connector libraries.
The connector JAR should reside in the classpath of Druid's main class loader. The connector JAR should reside in the classpath of Druid's main class loader.
To add the connector JAR to the classpath, you can copy the downloaded file to `lib/` under the distribution root directory. Alternatively, create a symbolic link to the connector in the `lib` directory. To add the connector JAR to the classpath, you can copy the downloaded file to `lib/` under the distribution root directory. Alternatively, create a symbolic link to the connector in the `lib` directory.
::: :::