Azure data lake input source (#11153)

* Mention Azure Data Lake * Make consistent with other entries Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
2021-06-25 15:54:34 -07:00 · 2021-06-25 15:54:34 -07:00 · fd0931d35e
parent ed0a57e106
commit fd0931d35e
1 changed files with 9 additions and 11 deletions
--- a/docs/ingestion/native-batch.md
+++ b/docs/ingestion/native-batch.md
@ -58,7 +58,7 @@ The supported splittable input formats for now are:

 - [`s3`](#s3-input-source) reads data from AWS S3 storage.
 - [`gs`](#google-cloud-storage-input-source) reads data from Google Cloud Storage.
- [`azure`](#azure-input-source) reads data from Azure Blob Storage.
+- [`azure`](#azure-input-source) reads data from Azure Blob Storage and Azure Data Lake.
 - [`hdfs`](#hdfs-input-source) reads data from HDFS storage.
 - [`http`](#http-input-source) reads data from HTTP servers.
 - [`local`](#local-input-source) reads data from local storage.
@ -1046,10 +1046,8 @@ Google Cloud Storage object:

 > You need to include the [`druid-azure-extensions`](../development/extensions-core/azure.md) as an extension to use the Azure input source.

-The Azure input source is to support reading objects directly from Azure Blob store. Objects can be
-specified as list of Azure Blob store URI strings. The Azure input source is splittable and can be used
-by the [Parallel task](#parallel-task), where each worker task of `index_parallel` will read
-a single object.
+The Azure input source reads objects directly from Azure Blob store or Azure Data Lake sources. You can
+specify objects as a list of file URI strings or prefixes. You can split the Azure input source for use with [Parallel task](#parallel-task) indexing and each worker task reads one chunk of the split data.

 Sample specs:

@ -1108,17 +1106,17 @@ Sample specs:
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should be `azure`.|None|yes|
-|uris|JSON array of URIs where Azure Blob objects to be ingested are located. Should be in form "azure://\<container>/\<path-to-file\>"|None|`uris` or `prefixes` or `objects` must be set|
-|prefixes|JSON array of URI prefixes for the locations of Azure Blob objects to be ingested. Should be in the form "azure://\<container>/\<prefix\>". Empty objects starting with one of the given prefixes will be skipped.|None|`uris` or `prefixes` or `objects` must be set|
-|objects|JSON array of Azure Blob objects to be ingested.|None|`uris` or `prefixes` or `objects` must be set|
+|uris|JSON array of URIs where the Azure objects to be ingested are located, in the form "azure://\<container>/\<path-to-file\>"|None|`uris` or `prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to ingest, in the form "azure://\<container>/\<prefix\>". Empty objects starting with one of the given prefixes are skipped.|None|`uris` or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or `objects` must be set|

-Note that the Azure input source will skip all empty objects only when `prefixes` is specified.
+Note that the Azure input source skips all empty objects only when `prefixes` is specified.

-Azure Blob object:
+The `objects` property is:

 |property|description|default|required?|
 |--------|-----------|-------|---------|
-|bucket|Name of the Azure Blob Storage container|None|yes|
+|bucket|Name of the Azure Blob Storage or Azure Data Lake container|None|yes|
 |path|The path where data is located.|None|yes|

 ### HDFS Input Source