mirror of https://github.com/apache/druid.git
docs: Update Azure extension (#16585)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
This commit is contained in:
parent
b20c3dbadf
commit
ae70e18bc8
|
@ -22,7 +22,6 @@ title: "Extensions"
|
|||
~ under the License.
|
||||
-->
|
||||
|
||||
|
||||
Druid implements an extension system that allows for adding functionality at runtime. Extensions
|
||||
are commonly used to add support for deep storages (like HDFS and S3), metadata stores (like MySQL
|
||||
and PostgreSQL), new aggregators, new input formats, and so on.
|
||||
|
@ -55,7 +54,7 @@ Core extensions are maintained by Druid committers.
|
|||
|druid-parquet-extensions|Support for data in Apache Parquet data format. Requires druid-avro-extensions to be loaded.|[link](../development/extensions-core/parquet.md)|
|
||||
|druid-protobuf-extensions| Support for data in Protobuf data format.|[link](../development/extensions-core/protobuf.md)|
|
||||
|druid-ranger-security|Support for access control through Apache Ranger.|[link](../development/extensions-core/druid-ranger-security.md)|
|
||||
|druid-s3-extensions|Interfacing with data in AWS S3, and using S3 as deep storage.|[link](../development/extensions-core/s3.md)|
|
||||
|druid-s3-extensions|Interfacing with data in Amazon S3, and using S3 as deep storage.|[link](../development/extensions-core/s3.md)|
|
||||
|druid-ec2-extensions|Interfacing with AWS EC2 for autoscaling middle managers|UNDOCUMENTED|
|
||||
|druid-aws-rds-extensions|Support for AWS token based access to AWS RDS DB Cluster.|[link](../development/extensions-core/druid-aws-rds.md)|
|
||||
|druid-stats|Statistics related module including variance and standard deviation.|[link](../development/extensions-core/stats.md)|
|
||||
|
@ -101,7 +100,7 @@ All of these community extensions can be downloaded using [pull-deps](../operati
|
|||
|druid-momentsketch|Support for approximate quantile queries using the [momentsketch](https://github.com/stanford-futuredata/momentsketch) library|[link](../development/extensions-contrib/momentsketch-quantiles.md)|
|
||||
|druid-tdigestsketch|Support for approximate sketch aggregators based on [T-Digest](https://github.com/tdunning/t-digest)|[link](../development/extensions-contrib/tdigestsketch-quantiles.md)|
|
||||
|gce-extensions|GCE Extensions|[link](../development/extensions-contrib/gce-extensions.md)|
|
||||
|prometheus-emitter|Exposes [Druid metrics](../operations/metrics.md) for Prometheus server collection (https://prometheus.io/)|[link](../development/extensions-contrib/prometheus.md)|
|
||||
|prometheus-emitter|Exposes [Druid metrics](../operations/metrics.md) for Prometheus server collection (<https://prometheus.io/>)|[link](../development/extensions-contrib/prometheus.md)|
|
||||
|druid-kubernetes-overlord-extensions|Support for launching tasks in k8s without Middle Managers|[link](../development/extensions-contrib/k8s-jobs.md)|
|
||||
|druid-spectator-histogram|Support for efficient approximate percentile queries|[link](../development/extensions-contrib/spectator-histogram.md)|
|
||||
|druid-rabbit-indexing-service|Support for creating and managing [RabbitMQ](https://www.rabbitmq.com/) indexing tasks|[link](../development/extensions-contrib/rabbit-stream-ingestion.md)|
|
||||
|
@ -111,7 +110,6 @@ All of these community extensions can be downloaded using [pull-deps](../operati
|
|||
Please post on [dev@druid.apache.org](https://lists.apache.org/list.html?dev@druid.apache.org) if you'd like an extension to be promoted to core.
|
||||
If we see a community extension actively supported by the community, we can promote it to core based on community feedback.
|
||||
|
||||
|
||||
For information how to create your own extension, please see [here](../development/modules.md).
|
||||
|
||||
## Loading extensions
|
||||
|
|
|
@ -668,14 +668,12 @@ Store task logs in S3. Note that the `druid-s3-extensions` extension must be loa
|
|||
|
||||
##### Azure Blob Store task logs
|
||||
|
||||
Store task logs in Azure Blob Store.
|
||||
Store task logs in Azure Blob Store. To enable this feature, load the `druid-azure-extensions` extension, and configure deep storage for Azure. Druid uses the same authentication method configured for deep storage and stores task logs in the same storage account (set in `druid.azure.account`).
|
||||
|
||||
Note: The `druid-azure-extensions` extension must be loaded, and this uses the same storage account as the deep storage module for azure.
|
||||
|
||||
|Property|Description|Default|
|
||||
|--------|-----------|-------|
|
||||
|`druid.indexer.logs.container`|The Azure Blob Store container to write logs to|none|
|
||||
|`druid.indexer.logs.prefix`|The path to prepend to logs|none|
|
||||
| Property | Description | Default |
|
||||
|---|---|---|
|
||||
| `druid.indexer.logs.container` | The Azure Blob Store container to write logs to. | Must be set. |
|
||||
| `druid.indexer.logs.prefix` | The path to prepend to logs. | Must be set. |
|
||||
|
||||
##### Google Cloud Storage task logs
|
||||
|
||||
|
@ -714,7 +712,7 @@ You can configure Druid API error responses to hide internal information like th
|
|||
|`druid.server.http.showDetailedJettyErrors`|When set to true, any error from the Jetty layer / Jetty filter includes the following fields in the JSON response: `servlet`, `message`, `url`, `status`, and `cause`, if it exists. When set to false, the JSON response only includes `message`, `url`, and `status`. The field values remain unchanged.|true|
|
||||
|`druid.server.http.errorResponseTransform.strategy`|Error response transform strategy. The strategy controls how Druid transforms error responses from Druid services. When unset or set to `none`, Druid leaves error responses unchanged.|`none`|
|
||||
|
||||
##### Error response transform strategy
|
||||
#### Error response transform strategy
|
||||
|
||||
You can use an error response transform strategy to transform error responses from within Druid services to hide internal information.
|
||||
When you specify an error response transform strategy other than `none`, Druid transforms the error responses from Druid services as follows:
|
||||
|
@ -723,12 +721,12 @@ When you specify an error response transform strategy other than `none`, Druid t
|
|||
* For any SQL query API that fails, for example `POST /druid/v2/sql/...`, Druid sets the fields `errorClass` and `host` to null. Druid applies the transformation strategy to the `errorMessage` field.
|
||||
* For any JDBC related exceptions, Druid will turn all checked exceptions into `QueryInterruptedException` otherwise druid will attempt to keep the exception as the same type. For example if the original exception isn't owned by Druid it will become `QueryInterruptedException`. Druid applies the transformation strategy to the `errorMessage` field.
|
||||
|
||||
###### No error response transform strategy
|
||||
##### No error response transform strategy
|
||||
|
||||
In this mode, Druid leaves error responses from underlying services unchanged and returns the unchanged errors to the API client.
|
||||
This is the default Druid error response mode. To explicitly enable this strategy, set `druid.server.http.errorResponseTransform.strategy` to `none`.
|
||||
|
||||
###### Allowed regular expression error response transform strategy
|
||||
##### Allowed regular expression error response transform strategy
|
||||
|
||||
In this mode, Druid validates the error responses from underlying services against a list of regular expressions. Only error messages that match a configured regular expression are returned. To enable this strategy, set `druid.server.http.errorResponseTransform.strategy` to `allowedRegex`.
|
||||
|
||||
|
@ -774,7 +772,7 @@ This config is used to find the [Coordinator](../design/coordinator.md) using Cu
|
|||
|
||||
You can configure how to announce and unannounce Znodes in ZooKeeper (using Curator). For normal operations you do not need to override any of these configs.
|
||||
|
||||
##### Batch data segment announcer
|
||||
#### Batch data segment announcer
|
||||
|
||||
In current Druid, multiple data segments may be announced under the same Znode.
|
||||
|
||||
|
@ -2037,7 +2035,7 @@ A simple in-memory LRU cache. Local cache resides in JVM heap memory, so if you
|
|||
|Property|Description|Default|
|
||||
|--------|-----------|-------|
|
||||
|`druid.cache.sizeInBytes`|Maximum cache size in bytes. Zero disables caching.|0|
|
||||
|`druid.cache.initialSize`|Initial size of the hashtable backing the cache.|500000|
|
||||
|`druid.cache.initialSize`|Initial size of the hash table backing the cache.|500000|
|
||||
|`druid.cache.logEvictionCount`|If non-zero, log cache eviction every `logEvictionCount` items.|0|
|
||||
|
||||
#### Caffeine cache
|
||||
|
|
|
@ -22,25 +22,75 @@ title: "Microsoft Azure"
|
|||
~ under the License.
|
||||
-->
|
||||
|
||||
## Azure extension
|
||||
|
||||
This extension allows you to do the following:
|
||||
|
||||
* [Ingest data](#ingest-data-from-azure) from objects stored in Azure Blob Storage.
|
||||
* [Write segments](#store-segments-in-azure) to Azure Blob Storage for deep storage.
|
||||
* [Persist task logs](#persist-task-logs-in-azure) to Azure Blob Storage for long-term storage.
|
||||
|
||||
:::info
|
||||
|
||||
To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `druid-azure-extensions` in the extensions load list.
|
||||
|
||||
## Deep Storage
|
||||
:::
|
||||
|
||||
[Microsoft Azure Storage](http://azure.microsoft.com/en-us/services/storage/) is another option for deep storage. This requires some additional Druid configuration.
|
||||
### Ingest data from Azure
|
||||
|
||||
|Property|Description|Possible Values|Default|
|
||||
|--------|---------------|-----------|-------|
|
||||
|`druid.storage.type`|azure||Must be set.|
|
||||
|`druid.azure.account`||Azure Storage account name.|Must be set.|
|
||||
|`druid.azure.key`||Azure Storage account key.|Optional. Set one of key, sharedAccessStorageToken or useAzureCredentialsChain.|
|
||||
|`druid.azure.sharedAccessStorageToken`||Azure Shared Storage access token|Optional. Set one of key, sharedAccessStorageToken or useAzureCredentialsChain..|
|
||||
|`druid.azure.useAzureCredentialsChain`|Use [DefaultAzureCredential](https://learn.microsoft.com/en-us/java/api/overview/azure/identity-readme?view=azure-java-stable) for authentication|Optional. Set one of key, sharedAccessStorageToken or useAzureCredentialsChain.|False|
|
||||
|`druid.azure.managedIdentityClientId`|If you want to use managed identity authentication in the `DefaultAzureCredential`, `useAzureCredentialsChain` must be true.||Optional.|
|
||||
|`druid.azure.container`||Azure Storage container name.|Must be set.|
|
||||
|`druid.azure.prefix`|A prefix string that will be prepended to the blob names for the segments published to Azure deep storage| |""|
|
||||
|`druid.azure.protocol`|the protocol to use|http or https|https|
|
||||
|`druid.azure.maxTries`|Number of tries before canceling an Azure operation.| |3|
|
||||
|`druid.azure.maxListingLength`|maximum number of input files matching a given prefix to retrieve at a time| |1024|
|
||||
|`druid.azure.storageAccountEndpointSuffix`| The endpoint suffix to use. Use this config instead of `druid.azure.endpointSuffix`. Override the default value to connect to [Azure Government](https://learn.microsoft.com/en-us/azure/azure-government/documentation-government-get-started-connect-to-storage#getting-started-with-storage-api). This config supports storage accounts enabled for [AzureDNSZone](https://learn.microsoft.com/en-us/azure/dns/dns-getstarted-portal). Note: do not include the storage account name prefix in this config value. | Examples: `ABCD1234.blob.storage.azure.net`, `blob.core.usgovcloudapi.net`| `blob.core.windows.net`|
|
||||
See [Azure Services](http://azure.microsoft.com/en-us/pricing/free-trial/) for more information.
|
||||
Ingest data using either [MSQ](../../multi-stage-query/index.md) or a native batch [parallel task](../../ingestion/native-batch.md) with an [Azure input source](../../ingestion/input-sources.md#azure-input-source) (`azureStorage`) to read objects directly from Azure Blob Storage.
|
||||
|
||||
### Store segments in Azure
|
||||
|
||||
:::info
|
||||
|
||||
To use Azure for deep storage, set `druid.storage.type=azure`.
|
||||
|
||||
:::
|
||||
|
||||
#### Configure location
|
||||
|
||||
Configure where to store segments using the following properties:
|
||||
|
||||
| Property | Description | Default |
|
||||
|---|---|---|
|
||||
| `druid.azure.account` | The Azure Storage account name. | Must be set. |
|
||||
| `druid.azure.container` | The Azure Storage container name. | Must be set. |
|
||||
| `druid.azure.prefix` | A prefix string that will be prepended to the blob names for the segments published. | "" |
|
||||
| `druid.azure.maxTries` | Number of tries before canceling an Azure operation. | 3 |
|
||||
| `druid.azure.protocol` | The protocol to use to connect to the Azure Storage account. Either `http` or `https`. | `https` |
|
||||
| `druid.azure.storageAccountEndpointSuffix` | The Storage account endpoint to use. Override the default value to connect to [Azure Government](https://learn.microsoft.com/en-us/azure/azure-government/documentation-government-get-started-connect-to-storage#getting-started-with-storage-api) or storage accounts with [Azure DNS zone endpoints](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview#azure-dns-zone-endpoints-preview).<br/><br/>Do _not_ include the storage account name prefix in this config value.<br/><br/>Examples: `ABCD1234.blob.storage.azure.net`, `blob.core.usgovcloudapi.net`. | `blob.core.windows.net` |
|
||||
|
||||
#### Configure authentication
|
||||
|
||||
Authenticate access to Azure Blob Storage using one of the following methods:
|
||||
|
||||
* [SAS token](https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview)
|
||||
* [Shared Key](https://learn.microsoft.com/en-us/rest/api/storageservices/authorize-with-shared-key)
|
||||
* Default Azure credentials chain ([`DefaultAzureCredential`](https://learn.microsoft.com/en-us/java/api/overview/azure/identity-readme#defaultazurecredential)).
|
||||
|
||||
Configure authentication using the following properties:
|
||||
|
||||
| Property | Description | Default |
|
||||
|---|---|---|
|
||||
| `druid.azure.sharedAccessStorageToken` | The SAS (Shared Storage Access) token. | |
|
||||
| `druid.azure.key` | The Shared Key. | |
|
||||
| `druid.azure.useAzureCredentialsChain` | If `true`, use `DefaultAzureCredential` for authentication. | `false` |
|
||||
| `druid.azure.managedIdentityClientId` | To use managed identity authentication in the `DefaultAzureCredential`, set `useAzureCredentialsChain` to `true` and provide the client ID here. | |
|
||||
|
||||
### Persist task logs in Azure
|
||||
|
||||
:::info
|
||||
|
||||
To persist task logs in Azure Blob Storage, set `druid.indexer.logs.type=azure`.
|
||||
|
||||
:::
|
||||
|
||||
Druid stores task logs using the storage account and authentication method configured for storing segments. Use the following configuration to set up where to store the task logs:
|
||||
|
||||
| Property | Description | Default |
|
||||
|---|---|---|
|
||||
| `druid.indexer.logs.container` | The Azure Blob Store container to write logs to. | Must be set. |
|
||||
| `druid.indexer.logs.prefix` | The path to prepend to logs. | Must be set. |
|
||||
|
||||
For general options regarding task retention, see [Log retention policy](../../configuration/index.md#log-retention-policy).
|
||||
|
|
|
@ -22,7 +22,6 @@ title: "HDFS"
|
|||
~ under the License.
|
||||
-->
|
||||
|
||||
|
||||
To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `druid-hdfs-storage` in the extensions load list and run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.
|
||||
|
||||
## Deep Storage
|
||||
|
@ -44,11 +43,11 @@ If you want to eagerly authenticate against a secured hadoop/hdfs cluster you mu
|
|||
|
||||
### Configuration for Cloud Storage
|
||||
|
||||
You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS.
|
||||
You can also use the Amazon S3 or the Google Cloud Storage as the deep storage via HDFS.
|
||||
|
||||
#### Configuration for AWS S3
|
||||
#### Configuration for Amazon S3
|
||||
|
||||
To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly.
|
||||
To use the Amazon S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly.
|
||||
|
||||
|Property|Possible Values|Description|Default|
|
||||
|--------|---------------|-----------|-------|
|
||||
|
|
|
@ -25,6 +25,7 @@ title: "S3-compatible"
|
|||
## S3 extension
|
||||
|
||||
This extension allows you to do 2 things:
|
||||
|
||||
* [Ingest data](#reading-data-from-s3) from files stored in S3.
|
||||
* Write segments to [deep storage](#deep-storage) in S3.
|
||||
|
||||
|
@ -41,7 +42,7 @@ To read objects from S3, you must supply [connection information](#configuration
|
|||
|
||||
### Deep Storage
|
||||
|
||||
S3-compatible deep storage means either AWS S3 or a compatible service like Google Storage which exposes the same API as S3.
|
||||
S3-compatible deep storage means either Amazon S3 or a compatible service like Google Storage which exposes the same API as S3.
|
||||
|
||||
S3 deep storage needs to be explicitly enabled by setting `druid.storage.type=s3`. **Only after setting the storage type to S3 will any of the settings below take effect.**
|
||||
|
||||
|
@ -97,19 +98,19 @@ Note that this setting only affects Druid's behavior. Changing S3 to use Object
|
|||
|
||||
If you're using ACLs, Druid needs the following permissions:
|
||||
|
||||
- `s3:GetObject`
|
||||
- `s3:PutObject`
|
||||
- `s3:DeleteObject`
|
||||
- `s3:GetBucketAcl`
|
||||
- `s3:PutObjectAcl`
|
||||
* `s3:GetObject`
|
||||
* `s3:PutObject`
|
||||
* `s3:DeleteObject`
|
||||
* `s3:GetBucketAcl`
|
||||
* `s3:PutObjectAcl`
|
||||
|
||||
#### Object Ownership permissions
|
||||
|
||||
If you're using Object Ownership, Druid needs the following permissions:
|
||||
|
||||
- `s3:GetObject`
|
||||
- `s3:PutObject`
|
||||
- `s3:DeleteObject`
|
||||
* `s3:GetObject`
|
||||
* `s3:PutObject`
|
||||
* `s3:DeleteObject`
|
||||
|
||||
### AWS region
|
||||
|
||||
|
@ -117,8 +118,8 @@ The AWS SDK requires that a target region be specified. You can set these by us
|
|||
|
||||
For example, to set the region to 'us-east-1' through system properties:
|
||||
|
||||
- Add `-Daws.region=us-east-1` to the `jvm.config` file for all Druid services.
|
||||
- Add `-Daws.region=us-east-1` to `druid.indexer.runner.javaOpts` in [Middle Manager configuration](../../configuration/index.md#middlemanager-configuration) so that the property will be passed to Peon (worker) processes.
|
||||
* Add `-Daws.region=us-east-1` to the `jvm.config` file for all Druid services.
|
||||
* Add `-Daws.region=us-east-1` to `druid.indexer.runner.javaOpts` in [Middle Manager configuration](../../configuration/index.md#middlemanager-configuration) so that the property will be passed to Peon (worker) processes.
|
||||
|
||||
### Connecting to S3 configuration
|
||||
|
||||
|
@ -146,6 +147,6 @@ For example, to set the region to 'us-east-1' through system properties:
|
|||
You can enable [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption) by setting
|
||||
`druid.storage.sse.type` to a supported type of server-side encryption. The current supported types are:
|
||||
|
||||
- s3: [Server-side encryption with S3-managed encryption keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption)
|
||||
- kms: [Server-side encryption with AWS KMS–Managed Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption)
|
||||
- custom: [Server-side encryption with Customer-Provided Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys)
|
||||
* s3: [Server-side encryption with S3-managed encryption keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption)
|
||||
* kms: [Server-side encryption with AWS KMS–Managed Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption)
|
||||
* custom: [Server-side encryption with Customer-Provided Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys)
|
||||
|
|
|
@ -148,7 +148,7 @@ For example, using the static input paths:
|
|||
"paths" : "hdfs://path/to/data/is/here/data.gz,hdfs://path/to/data/is/here/moredata.gz,hdfs://path/to/data/is/here/evenmoredata.gz"
|
||||
```
|
||||
|
||||
You can also read from cloud storage such as AWS S3 or Google Cloud Storage.
|
||||
You can also read from cloud storage such as Amazon S3 or Google Cloud Storage.
|
||||
To do so, you need to install the necessary library under Druid's classpath in _all MiddleManager or Indexer processes_.
|
||||
For S3, you can run the below command to install the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/).
|
||||
|
||||
|
@ -336,7 +336,7 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|
|||
|logParseExceptions|Boolean|If true, log an error message when a parsing exception occurs, containing information about the row where the error occurred.|no(default = false)|
|
||||
|maxParseExceptions|Integer|The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overrides `ignoreInvalidRows` if `maxParseExceptions` is defined.|no(default = unlimited)|
|
||||
|useYarnRMJobStatusFallback|Boolean|If the Hadoop jobs created by the indexing task are unable to retrieve their completion status from the JobHistory server, and this parameter is true, the indexing task will try to fetch the application status from `http://<yarn-rm-address>/ws/v1/cluster/apps/<application-id>`, where `<yarn-rm-address>` is the value of `yarn.resourcemanager.webapp.address` in your Hadoop configuration. This flag is intended as a fallback for cases where an indexing task's jobs succeed, but the JobHistory server is unavailable, causing the indexing task to fail because it cannot determine the job statuses.|no (default = true)|
|
||||
|awaitSegmentAvailabilityTimeoutMillis|Long|Milliseconds to wait for the newly indexed segments to become available for query after ingestion completes. If `<= 0`, no wait will occur. If `> 0`, the task will wait for the Coordinator to indicate that the new segments are available for querying. If the timeout expires, the task will exit as successful, but the segments were not confirmed to have become available for query.|no (default = 0)|
|
||||
|awaitSegmentAvailabilityTimeoutMillis|Long|Milliseconds to wait for the newly indexed segments to become available for query after ingestion completes. If `<= 0`, no wait will occur. If `> 0`, the task will wait for the Coordinator to indicate that the new segments are available for querying. If the timeout expires, the task will exit as successful, but the segments were not confirmed to have become available for query.|no (default = 0)|
|
||||
|
||||
### `jobProperties`
|
||||
|
||||
|
|
|
@ -30,12 +30,15 @@ For general information on native batch indexing and parallel task indexing, see
|
|||
## S3 input source
|
||||
|
||||
:::info
|
||||
You need to include the [`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension to use the S3 input source.
|
||||
|
||||
You need to include the [`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension to use the S3 input source.
|
||||
|
||||
:::
|
||||
|
||||
The S3 input source reads objects directly from S3. You can specify either:
|
||||
- a list of S3 URI strings
|
||||
- a list of S3 location prefixes that attempts to list the contents and ingest
|
||||
|
||||
* a list of S3 URI strings
|
||||
* a list of S3 location prefixes that attempts to list the contents and ingest
|
||||
all objects contained within the locations.
|
||||
|
||||
The S3 input source is splittable. Therefore, you can use it with the [Parallel task](./native-batch.md). Each worker task of `index_parallel` reads one or multiple objects.
|
||||
|
@ -76,7 +79,6 @@ Sample specs:
|
|||
...
|
||||
```
|
||||
|
||||
|
||||
```json
|
||||
...
|
||||
"ioConfig": {
|
||||
|
@ -210,13 +212,17 @@ Properties Object:
|
|||
|assumeRoleExternalId|A unique identifier that might be required when you assume a role in another account [see](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)|None|no|
|
||||
|
||||
:::info
|
||||
**Note:** If `accessKeyId` and `secretAccessKey` are not given, the default [S3 credentials provider chain](../development/extensions-core/s3.md#s3-authentication-methods) is used.
|
||||
|
||||
If `accessKeyId` and `secretAccessKey` are not given, the default [S3 credentials provider chain](../development/extensions-core/s3.md#s3-authentication-methods) is used.
|
||||
|
||||
:::
|
||||
|
||||
## Google Cloud Storage input source
|
||||
|
||||
:::info
|
||||
You need to include the [`druid-google-extensions`](../development/extensions-core/google.md) as an extension to use the Google Cloud Storage input source.
|
||||
|
||||
You need to include the [`druid-google-extensions`](../development/extensions-core/google.md) as an extension to use the Google Cloud Storage input source.
|
||||
|
||||
:::
|
||||
|
||||
The Google Cloud Storage input source is to support reading objects directly
|
||||
|
@ -261,7 +267,6 @@ Sample specs:
|
|||
...
|
||||
```
|
||||
|
||||
|
||||
```json
|
||||
...
|
||||
"ioConfig": {
|
||||
|
@ -300,16 +305,18 @@ Google Cloud Storage object:
|
|||
|path|The path where data is located.|None|yes|
|
||||
|systemFields|JSON array of system fields to return as part of input rows. Possible values: `__file_uri` (Google Cloud Storage URI starting with `gs://`), `__file_bucket` (GCS bucket), and `__file_path` (GCS key).|None|no|
|
||||
|
||||
## Azure input source
|
||||
## Azure input source
|
||||
|
||||
:::info
|
||||
You need to include the [`druid-azure-extensions`](../development/extensions-core/azure.md) as an extension to use the Azure input source.
|
||||
|
||||
You need to include the [`druid-azure-extensions`](../development/extensions-core/azure.md) as an extension to use the Azure input source.
|
||||
|
||||
:::
|
||||
|
||||
The Azure input source (that uses the type `azureStorage`) reads objects directly from Azure Blob store or Azure Data Lake sources. You can
|
||||
specify objects as a list of file URI strings or prefixes. You can split the Azure input source for use with [Parallel task](./native-batch.md) indexing and each worker task reads one chunk of the split data.
|
||||
|
||||
The `azureStorage` input source is a new schema for Azure input sources that allows you to specify which storage account files should be ingested from. We recommend that you update any specs that use the old `azure` schema to use the new `azureStorage` schema. The new schema provides more functionality than the older `azure` schema.
|
||||
The `azureStorage` input source is a new schema for Azure input sources that allows you to specify which storage account files should be ingested from. We recommend that you update any specs that use the old `azure` schema to use the new `azureStorage` schema. The new schema provides more functionality than the older `azure` schema.
|
||||
|
||||
Sample specs:
|
||||
|
||||
|
@ -347,7 +354,6 @@ Sample specs:
|
|||
...
|
||||
```
|
||||
|
||||
|
||||
```json
|
||||
...
|
||||
"ioConfig": {
|
||||
|
@ -379,7 +385,7 @@ Sample specs:
|
|||
|objects|JSON array of Azure objects to ingest.|None|One of the following must be set:`uris`, `prefixes`, or `objects`.|
|
||||
|objectGlob|A glob for the object part of the Azure URI. In the URI `azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br /><br />The glob must match the entire object part, not just the filename. For example, the glob `*.json` does not match `azureStorage://foo/bar/file.json` because the object part is `bar/file.json`, and the`*` does not match the slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br />For more information, refer to the documentation for [`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
|
||||
|systemFields|JSON array of system fields to return as part of input rows. Possible values: `__file_uri` (Azure blob URI starting with `azureStorage://`), `__file_bucket` (Azure bucket), and `__file_path` (Azure object path).|None|no|
|
||||
|properties|Properties object for overriding the default Azure configuration. See below for more information.|None|No (defaults will be used if not given)
|
||||
|properties|Properties object for overriding the default Azure configuration. See below for more information.|None|No (defaults will be used if not given)|
|
||||
|
||||
Note that the Azure input source skips all empty objects only when `prefixes` is specified.
|
||||
|
||||
|
@ -390,14 +396,12 @@ The `objects` property can one of the following:
|
|||
|bucket|Name of the Azure Blob Storage or Azure Data Lake storage account|None|yes|
|
||||
|path|The container and path where data is located.|None|yes|
|
||||
|
||||
|
||||
The `properties` property can be one of the following:
|
||||
|
||||
- `sharedAccessStorageToken`
|
||||
- `key`
|
||||
- `appRegistrationClientId`, `appRegistrationClientSecret`, and `tenantId`
|
||||
- empty
|
||||
|
||||
* `sharedAccessStorageToken`
|
||||
* `key`
|
||||
* `appRegistrationClientId`, `appRegistrationClientSecret`, and `tenantId`
|
||||
* empty
|
||||
|
||||
|Property|Description|Default|Required|
|
||||
|--------|-----------|-------|---------|
|
||||
|
@ -407,8 +411,7 @@ The `properties` property can be one of the following:
|
|||
|appRegistrationClientSecret|The client secret of the Azure App registration to authenticate as|None|Yes if `appRegistrationClientId` is provided|
|
||||
|tenantId|The tenant ID of the Azure App registration to authenticate as|None|Yes if `appRegistrationClientId` is provided|
|
||||
|
||||
|
||||
#### `azure` input source
|
||||
### Legacy `azure` input source
|
||||
|
||||
The Azure input source that uses the type `azure` is an older version of the Azure input type and is not recommended. It doesn't support specifying which storage account to ingest from. We recommend using the [`azureStorage` input source schema](#azure-input-source) instead since it provides more functionality.
|
||||
|
||||
|
@ -448,7 +451,6 @@ Sample specs:
|
|||
...
|
||||
```
|
||||
|
||||
|
||||
```json
|
||||
...
|
||||
"ioConfig": {
|
||||
|
@ -487,11 +489,12 @@ The `objects` property is:
|
|||
|bucket|Name of the Azure Blob Storage or Azure Data Lake container|None|yes|
|
||||
|path|The path where data is located.|None|yes|
|
||||
|
||||
|
||||
## HDFS input source
|
||||
|
||||
:::info
|
||||
You need to include the [`druid-hdfs-storage`](../development/extensions-core/hdfs.md) as an extension to use the HDFS input source.
|
||||
|
||||
You need to include the [`druid-hdfs-storage`](../development/extensions-core/hdfs.md) as an extension to use the HDFS input source.
|
||||
|
||||
:::
|
||||
|
||||
The HDFS input source is to support reading files directly
|
||||
|
@ -580,10 +583,12 @@ in `druid.ingestion.hdfs.allowedProtocols`. See [HDFS input source security conf
|
|||
|
||||
The HTTP input source is to support reading files directly from remote sites via HTTP.
|
||||
|
||||
:::info
|
||||
**Security notes:** Ingestion tasks run under the operating system account that runs the Druid processes, for example the Indexer, Middle Manager, and Peon. This means any user who can submit an ingestion task can specify an input source referring to any location that the Druid process can access. For example, using `http` input source, users may have access to internal network servers.
|
||||
:::info Security notes
|
||||
|
||||
Ingestion tasks run under the operating system account that runs the Druid processes, for example the Indexer, Middle Manager, and Peon. This means any user who can submit an ingestion task can specify an input source referring to any location that the Druid process can access. For example, using `http` input source, users may have access to internal network servers.
|
||||
|
||||
The `http` input source is not limited to the HTTP or HTTPS protocols. It uses the Java URI class that supports HTTP, HTTPS, FTP, file, and jar protocols by default.
|
||||
|
||||
The `http` input source is not limited to the HTTP or HTTPS protocols. It uses the Java URI class that supports HTTP, HTTPS, FTP, file, and jar protocols by default.
|
||||
:::
|
||||
|
||||
For more information about security best practices, see [Security overview](../operations/security-overview.md#best-practices).
|
||||
|
@ -725,7 +730,7 @@ Sample spec:
|
|||
|filter|A wildcard filter for files. See [here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter) for more information. Files matching the filter criteria are considered for ingestion. Files not matching the filter criteria are ignored.|yes if `baseDir` is specified|
|
||||
|baseDir|Directory to search recursively for files to be ingested. Empty files under the `baseDir` will be skipped.|At least one of `baseDir` or `files` should be specified|
|
||||
|files|File paths to ingest. Some files can be ignored to avoid ingesting duplicate files if they are located under the specified `baseDir`. Empty files will be skipped.|At least one of `baseDir` or `files` should be specified|
|
||||
|systemFields|JSON array of system fields to return as part of input rows. Possible values: `__file_uri` (File URI starting with `file:`) and `__file_path` (file path).|None|no|
|
||||
|systemFields|JSON array of system fields to return as part of input rows. Possible values: `__file_uri` (File URI starting with `file:`) and `__file_path` (file path).|no|
|
||||
|
||||
## Druid input source
|
||||
|
||||
|
@ -744,9 +749,9 @@ no `inputFormat` field needs to be specified in the ingestion spec when using th
|
|||
|
||||
The Druid input source can be used for a variety of purposes, including:
|
||||
|
||||
- Creating new datasources that are rolled-up copies of existing datasources.
|
||||
- Changing the [partitioning or sorting](./partitioning.md) of a datasource to improve performance.
|
||||
- Updating or removing rows using a [`transformSpec`](./ingestion-spec.md#transformspec).
|
||||
* Creating new datasources that are rolled-up copies of existing datasources.
|
||||
* Changing the [partitioning or sorting](./partitioning.md) of a datasource to improve performance.
|
||||
* Updating or removing rows using a [`transformSpec`](./ingestion-spec.md#transformspec).
|
||||
|
||||
When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number
|
||||
of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you
|
||||
|
@ -813,16 +818,16 @@ rolled-up datasource `wikipedia_rollup` by grouping on hour, "countryName", and
|
|||
```
|
||||
|
||||
:::info
|
||||
Note: Older versions (0.19 and earlier) did not respect the timestampSpec when using the Druid input source. If you
|
||||
have ingestion specs that rely on this and cannot rewrite them, set
|
||||
[`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`](../configuration/index.md#indexer-general-configuration)
|
||||
to `true` to enable a compatibility mode where the timestampSpec is ignored.
|
||||
|
||||
Older versions (0.19 and earlier) did not respect the timestampSpec when using the Druid input source. If you have ingestion specs that rely on this and cannot rewrite them, set [`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`](../configuration/index.md#indexer-general-configuration) to `true` to enable a compatibility mode where the timestampSpec is ignored.
|
||||
|
||||
:::
|
||||
|
||||
The [secondary partitioning method](native-batch.md#partitionsspec) determines the requisite number of concurrent worker tasks that run in parallel to complete ingestion with the Combining input source.
|
||||
Set this value in `maxNumConcurrentSubTasks` in `tuningConfig` based on the secondary partitioning method:
|
||||
- `range` or `single_dim` partitioning: greater than or equal to 1
|
||||
- `hashed` or `dynamic` partitioning: greater than or equal to 2
|
||||
|
||||
* `range` or `single_dim` partitioning: greater than or equal to 1
|
||||
* `hashed` or `dynamic` partitioning: greater than or equal to 2
|
||||
|
||||
For more information on the `maxNumConcurrentSubTasks` field, see [Implementation considerations](native-batch.md#implementation-considerations).
|
||||
|
||||
|
@ -866,7 +871,7 @@ The following is an example of an SQL input source spec:
|
|||
The spec above will read all events from two separate SQLs for the interval `2013-01-01/2013-01-02`.
|
||||
Each of the SQL queries will be run in its own sub-task and thus for the above example, there would be two sub-tasks.
|
||||
|
||||
**Recommended practices**
|
||||
### Recommended practices
|
||||
|
||||
Compared to the other native batch input sources, SQL input source behaves differently in terms of reading the input data. Therefore, consider the following points before using this input source in a production environment:
|
||||
|
||||
|
@ -878,7 +883,6 @@ Compared to the other native batch input sources, SQL input source behaves diffe
|
|||
|
||||
* Similar to file-based input formats, any updates to existing data will replace the data in segments specific to the intervals specified in the `granularitySpec`.
|
||||
|
||||
|
||||
## Combining input source
|
||||
|
||||
The Combining input source lets you read data from multiple input sources.
|
||||
|
@ -928,7 +932,9 @@ The following is an example of a Combining input source spec:
|
|||
## Iceberg input source
|
||||
|
||||
:::info
|
||||
|
||||
To use the Iceberg input source, load the extension [`druid-iceberg-extensions`](../development/extensions-contrib/iceberg.md).
|
||||
|
||||
:::
|
||||
|
||||
You use the Iceberg input source to read data stored in the Iceberg table format. For a given table, the input source scans up to the latest Iceberg snapshot from the configured Hive catalog. Druid ingests the underlying live data files using the existing input source formats.
|
||||
|
@ -1133,13 +1139,15 @@ This input source provides the following filters: `and`, `equals`, `interval`, a
|
|||
## Delta Lake input source
|
||||
|
||||
:::info
|
||||
|
||||
To use the Delta Lake input source, load the extension [`druid-deltalake-extensions`](../development/extensions-contrib/delta-lake.md).
|
||||
|
||||
:::
|
||||
|
||||
You can use the Delta input source to read data stored in a Delta Lake table. For a given table, the input source scans
|
||||
the latest snapshot from the configured table. Druid ingests the underlying delta files from the table.
|
||||
|
||||
| Property|Description|Required|
|
||||
| Property|Description|Required|
|
||||
|---------|-----------|--------|
|
||||
| type|Set this value to `delta`.|yes|
|
||||
| tablePath|The location of the Delta table.|yes|
|
||||
|
@ -1155,7 +1163,6 @@ on statistics collected when the non-partitioned table is created. In this scena
|
|||
data that doesn't match the filter. To guarantee that the Delta Kernel prunes out unnecessary column values, only use
|
||||
filters on partitioned columns.
|
||||
|
||||
|
||||
`and` filter:
|
||||
|
||||
| Property | Description | Required |
|
||||
|
@ -1217,7 +1224,6 @@ filters on partitioned columns.
|
|||
| column | The table column to apply the filter on. | yes |
|
||||
| value | The value to use in the filter. | yes |
|
||||
|
||||
|
||||
The following is a sample spec to read all records from the Delta table `/delta-table/foo`:
|
||||
|
||||
```json
|
||||
|
|
|
@ -28,12 +28,14 @@ sidebar_label: JSON-based batch
|
|||
:::
|
||||
|
||||
Apache Druid supports the following types of JSON-based batch indexing tasks:
|
||||
|
||||
- Parallel task indexing (`index_parallel`) that can run multiple indexing tasks concurrently. Parallel task works well for production ingestion tasks.
|
||||
- Simple task indexing (`index`) that run a single indexing task at a time. Simple task indexing is suitable for development and test environments.
|
||||
|
||||
This topic covers the configuration for `index_parallel` ingestion specs.
|
||||
|
||||
For related information on batch indexing, see:
|
||||
|
||||
- [Batch ingestion method comparison table](./index.md#batch) for a comparison of batch ingestion methods.
|
||||
- [Tutorial: Loading a file](../tutorials/tutorial-batch.md) for a tutorial on JSON-based batch ingestion.
|
||||
- [Input sources](./input-sources.md) for possible input sources.
|
||||
|
@ -97,7 +99,6 @@ By default, JSON-based batch ingestion replaces all data in the intervals in you
|
|||
|
||||
You can also perform concurrent append and replace tasks. For more information, see [Concurrent append and replace](./concurrent-append-replace.md)
|
||||
|
||||
|
||||
#### Fully replacing existing segments using tombstones
|
||||
|
||||
:::info
|
||||
|
@ -124,12 +125,12 @@ You want to re-ingest and overwrite with new data as follows:
|
|||
|
||||
Unless you set `dropExisting` to true, the result after ingestion with overwrite using the same `MONTH` `segmentGranularity` would be:
|
||||
|
||||
* **January**: 1 record
|
||||
* **February**: 10 records
|
||||
* **March**: 9 records
|
||||
- **January**: 1 record
|
||||
- **February**: 10 records
|
||||
- **March**: 9 records
|
||||
|
||||
This may not be what it is expected since the new data has 0 records for January. Set `dropExisting` to true to replace the unneeded January segment with a tombstone.
|
||||
|
||||
|
||||
## Parallel indexing example
|
||||
|
||||
The following example illustrates the configuration for a parallel indexing task.
|
||||
|
@ -214,6 +215,7 @@ The following example illustrates the configuration for a parallel indexing task
|
|||
}
|
||||
}
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Parallel indexing configuration
|
||||
|
@ -305,7 +307,7 @@ The segments split hint spec is used only for [`DruidInputSource`](./input-sourc
|
|||
|
||||
### `partitionsSpec`
|
||||
|
||||
The primary partition for Druid is time. You can define a secondary partitioning method in the partitions spec. Use the `partitionsSpec` type that applies for your [rollup](rollup.md) method.
|
||||
The primary partition for Druid is time. You can define a secondary partitioning method in the partitions spec. Use the `partitionsSpec` type that applies for your [rollup](rollup.md) method.
|
||||
|
||||
For perfect rollup, you can use:
|
||||
|
||||
|
@ -366,7 +368,7 @@ In the `partial segment generation` phase, just like the Map phase in MapReduce,
|
|||
the Parallel task splits the input data based on the split hint spec
|
||||
and assigns each split to a worker task. Each worker task (type `partial_index_generate`) reads the assigned split, and partitions rows by the time chunk from `segmentGranularity` (primary partition key) in the `granularitySpec`
|
||||
and then by the hash value of `partitionDimensions` (secondary partition key) in the `partitionsSpec`.
|
||||
The partitioned data is stored in local storage of
|
||||
The partitioned data is stored in local storage of
|
||||
the [middleManager](../design/middlemanager.md) or the [indexer](../design/indexer.md).
|
||||
|
||||
The `partial segment merge` phase is similar to the Reduce phase in MapReduce.
|
||||
|
@ -709,12 +711,14 @@ The returned result contains the worker task spec, a current task status if exis
|
|||
"taskHistory": []
|
||||
}
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
`http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspec/{SUB_TASK_SPEC_ID}/history`
|
||||
Returns the task attempt history of the worker task spec of the given id, or HTTP 404 Not Found error if the supervisor task is running in the sequential mode.
|
||||
|
||||
## Segment pushing modes
|
||||
|
||||
While ingesting data using the parallel task indexing, Druid creates segments from the input data and pushes them. For segment pushing,
|
||||
the parallel task index supports the following segment pushing modes based upon your type of [rollup](./rollup.md):
|
||||
|
||||
|
@ -743,10 +747,12 @@ This may help the higher priority tasks to finish earlier than lower priority ta
|
|||
by assigning more task slots to them.
|
||||
|
||||
## Splittable input sources
|
||||
|
||||
Use the `inputSource` object to define the location where your index can read data. Only the native parallel task and simple task support the input source.
|
||||
|
||||
For details on available input sources see:
|
||||
- [S3 input source](./input-sources.md#s3-input-source) (`s3`) reads data from AWS S3 storage.
|
||||
|
||||
- [S3 input source](./input-sources.md#s3-input-source) (`s3`) reads data from Amazon S3 storage.
|
||||
- [Google Cloud Storage input source](./input-sources.md#google-cloud-storage-input-source) (`gs`) reads data from Google Cloud Storage.
|
||||
- [Azure input source](./input-sources.md#azure-input-source) (`azure`) reads data from Azure Blob Storage and Azure Data Lake.
|
||||
- [HDFS input source](./input-sources.md#hdfs-input-source) (`hdfs`) reads data from HDFS storage.
|
||||
|
|
|
@ -216,6 +216,7 @@ ROUTINE_TYPE
|
|||
Rackspace
|
||||
Redis
|
||||
S3
|
||||
SAS
|
||||
SDK
|
||||
SIGAR
|
||||
SPNEGO
|
||||
|
|
Loading…
Reference in New Issue