Azure data lake input source (#11153)

* Mention Azure Data Lake

* Make consistent with other entries

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
This commit is contained in:
sthetland 2021-06-25 15:54:34 -07:00 committed by GitHub
parent ed0a57e106
commit fd0931d35e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 9 additions and 11 deletions

View File

@ -58,7 +58,7 @@ The supported splittable input formats for now are:
- [`s3`](#s3-input-source) reads data from AWS S3 storage.
- [`gs`](#google-cloud-storage-input-source) reads data from Google Cloud Storage.
- [`azure`](#azure-input-source) reads data from Azure Blob Storage.
- [`azure`](#azure-input-source) reads data from Azure Blob Storage and Azure Data Lake.
- [`hdfs`](#hdfs-input-source) reads data from HDFS storage.
- [`http`](#http-input-source) reads data from HTTP servers.
- [`local`](#local-input-source) reads data from local storage.
@ -1046,10 +1046,8 @@ Google Cloud Storage object:
> You need to include the [`druid-azure-extensions`](../development/extensions-core/azure.md) as an extension to use the Azure input source.
The Azure input source is to support reading objects directly from Azure Blob store. Objects can be
specified as list of Azure Blob store URI strings. The Azure input source is splittable and can be used
by the [Parallel task](#parallel-task), where each worker task of `index_parallel` will read
a single object.
The Azure input source reads objects directly from Azure Blob store or Azure Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the Azure input source for use with [Parallel task](#parallel-task) indexing and each worker task reads one chunk of the split data.
Sample specs:
@ -1108,17 +1106,17 @@ Sample specs:
|property|description|default|required?|
|--------|-----------|-------|---------|
|type|This should be `azure`.|None|yes|
|uris|JSON array of URIs where Azure Blob objects to be ingested are located. Should be in form "azure://\<container>/\<path-to-file\>"|None|`uris` or `prefixes` or `objects` must be set|
|prefixes|JSON array of URI prefixes for the locations of Azure Blob objects to be ingested. Should be in the form "azure://\<container>/\<prefix\>". Empty objects starting with one of the given prefixes will be skipped.|None|`uris` or `prefixes` or `objects` must be set|
|objects|JSON array of Azure Blob objects to be ingested.|None|`uris` or `prefixes` or `objects` must be set|
|uris|JSON array of URIs where the Azure objects to be ingested are located, in the form "azure://\<container>/\<path-to-file\>"|None|`uris` or `prefixes` or `objects` must be set|
|prefixes|JSON array of URI prefixes for the locations of Azure objects to ingest, in the form "azure://\<container>/\<prefix\>". Empty objects starting with one of the given prefixes are skipped.|None|`uris` or `prefixes` or `objects` must be set|
|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or `objects` must be set|
Note that the Azure input source will skip all empty objects only when `prefixes` is specified.
Note that the Azure input source skips all empty objects only when `prefixes` is specified.
Azure Blob object:
The `objects` property is:
|property|description|default|required?|
|--------|-----------|-------|---------|
|bucket|Name of the Azure Blob Storage container|None|yes|
|bucket|Name of the Azure Blob Storage or Azure Data Lake container|None|yes|
|path|The path where data is located.|None|yes|
### HDFS Input Source