HADOOP-16401. ABFS: port Azure doc to 3.2 branch.
Signed-off-by: Masatake Iwasaki <iwasakims@apache.org>
This commit is contained in:
parent
950aa74d5f
commit
b6718c754a
|
@ -16,67 +16,780 @@
|
|||
|
||||
<!-- MACRO{toc|fromDepth=1|toDepth=3} -->
|
||||
|
||||
## Introduction
|
||||
## <a name="introduction"></a> Introduction
|
||||
|
||||
The `hadoop-azure` module provides support for the Azure Data Lake Storage Gen2
|
||||
storage layer through the "abfs" connector
|
||||
|
||||
To make it part of Apache Hadoop's default classpath, simply make sure that
|
||||
`HADOOP_OPTIONAL_TOOLS` in `hadoop-env.sh` has `hadoop-azure` in the list.
|
||||
To make it part of Apache Hadoop's default classpath, make sure that
|
||||
`HADOOP_OPTIONAL_TOOLS` environment variable has `hadoop-azure` in the list,
|
||||
*on every machine in the cluster*
|
||||
|
||||
## Features
|
||||
```bash
|
||||
export HADOOP_OPTIONAL_TOOLS=hadoop-azure
|
||||
```
|
||||
|
||||
* Read and write data stored in an Azure Blob Storage account.
|
||||
You can set this locally in your `.profile`/`.bashrc`, but note it won't
|
||||
propagate to jobs running in-cluster.
|
||||
|
||||
|
||||
## <a name="features"></a> Features of the ABFS connector.
|
||||
|
||||
* Supports reading and writing data stored in an Azure Blob Storage account.
|
||||
* *Fully Consistent* view of the storage across all clients.
|
||||
* Can read data written through the wasb: connector.
|
||||
* Present a hierarchical file system view by implementing the standard Hadoop
|
||||
* Can read data written through the `wasb:` connector.
|
||||
* Presents a hierarchical file system view by implementing the standard Hadoop
|
||||
[`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
|
||||
* Supports configuration of multiple Azure Blob Storage accounts.
|
||||
* Can act as a source or destination of data in Hadoop MapReduce, Apache Hive, Apache Spark
|
||||
* Tested at scale on both Linux and Windows.
|
||||
* Can act as a source or destination of data in Hadoop MapReduce, Apache Hive, Apache Spark.
|
||||
* Tested at scale on both Linux and Windows by Microsoft themselves.
|
||||
* Can be used as a replacement for HDFS on Hadoop clusters deployed in Azure infrastructure.
|
||||
|
||||
|
||||
|
||||
## Limitations
|
||||
|
||||
* File last access time is not tracked.
|
||||
|
||||
|
||||
## Technical notes
|
||||
|
||||
### Security
|
||||
|
||||
### Consistency and Concurrency
|
||||
|
||||
*TODO*: complete/review
|
||||
|
||||
The abfs client has a fully consistent view of the store, which has complete Create Read Update and Delete consistency for data and metadata.
|
||||
(Compare and contrast with S3 which only offers Create consistency; S3Guard adds CRUD to metadata, but not the underlying data).
|
||||
|
||||
### Performance
|
||||
|
||||
*TODO*: check these.
|
||||
|
||||
* File Rename: `O(1)`.
|
||||
* Directory Rename: `O(files)`.
|
||||
* Directory Delete: `O(files)`.
|
||||
|
||||
## Configuring ABFS
|
||||
|
||||
Any configuration can be specified generally (or as the default when accessing all accounts) or can be tied to s a specific account.
|
||||
For example, an OAuth identity can be configured for use regardless of which account is accessed with the property
|
||||
"fs.azure.account.oauth2.client.id"
|
||||
or you can configure an identity to be used only for a specific storage account with
|
||||
"fs.azure.account.oauth2.client.id.\<account\_name\>.dfs.core.windows.net".
|
||||
|
||||
Note that it doesn't make sense to do this with some properties, like shared keys that are inherently account-specific.
|
||||
|
||||
## Testing ABFS
|
||||
|
||||
See the relevant section in [Testing Azure](testing_azure.html).
|
||||
|
||||
## References
|
||||
For details on ABFS, consult the following documents:
|
||||
|
||||
* [A closer look at Azure Data Lake Storage Gen2](https://azure.microsoft.com/en-gb/blog/a-closer-look-at-azure-data-lake-storage-gen2/);
|
||||
MSDN Article from June 28, 2018.
|
||||
* [Storage Tiers](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers)
|
||||
|
||||
## Getting started
|
||||
|
||||
### Concepts
|
||||
|
||||
The Azure Storage data model presents 3 core concepts:
|
||||
|
||||
* **Storage Account**: All access is done through a storage account.
|
||||
* **Container**: A container is a grouping of multiple blobs. A storage account
|
||||
may have multiple containers. In Hadoop, an entire file system hierarchy is
|
||||
stored in a single container.
|
||||
* **Blob**: A file of any type and size stored with the existing wasb connector
|
||||
|
||||
The ABFS connector connects to classic containers, or those created
|
||||
with Hierarchical Namespaces.
|
||||
|
||||
## <a name="namespaces"></a> Hierarchical Namespaces (and WASB Compatibility)
|
||||
|
||||
A key aspect of ADLS Gen 2 is its support for
|
||||
[hierachical namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace)
|
||||
These are effectively directories and offer high performance rename and delete operations
|
||||
—something which makes a significant improvement in performance in query engines
|
||||
writing data to, including MapReduce, Spark, Hive, as well as DistCp.
|
||||
|
||||
This feature is only available if the container was created with "namespace"
|
||||
support.
|
||||
|
||||
You enable namespace support when creating a new Storage Account,
|
||||
by checking the "Hierarchical Namespace" option in the Portal UI, or, when
|
||||
creating through the command line, using the option `--hierarchical-namespace true`
|
||||
|
||||
_You cannot enable Hierarchical Namespaces on an existing storage account_
|
||||
|
||||
Containers in a storage account with Hierarchical Namespaces are
|
||||
not (currently) readable through the `wasb:` connector.
|
||||
|
||||
Some of the `az storage` command line commands fail too, for example:
|
||||
|
||||
```bash
|
||||
$ az storage container list --account-name abfswales1
|
||||
Blob API is not yet supported for hierarchical namespace accounts. ErrorCode: BlobApiNotYetSupportedForHierarchicalNamespaceAccounts
|
||||
```
|
||||
|
||||
### <a name="creating"></a> Creating an Azure Storage Account
|
||||
|
||||
The best documentation on getting started with Azure Datalake Gen2 with the
|
||||
abfs connector is [Using Azure Data Lake Storage Gen2 with Azure HDInsight clusters](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-hdi-cluster)
|
||||
|
||||
It includes instructions to create it from [the Azure command line tool](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest),
|
||||
which can be installed on Windows, MacOS (via Homebrew) and Linux (apt or yum).
|
||||
|
||||
The [az storage](https://docs.microsoft.com/en-us/cli/azure/storage?view=azure-cli-latest) subcommand
|
||||
handles all storage commands, [`az storage account create`](https://docs.microsoft.com/en-us/cli/azure/storage/account?view=azure-cli-latest#az-storage-account-create)
|
||||
does the creation.
|
||||
|
||||
Until the ADLS gen2 API support is finalized, you need to add an extension
|
||||
to the ADLS command.
|
||||
```bash
|
||||
az extension add --name storage-preview
|
||||
```
|
||||
|
||||
Check that all is well by verifying that the usage command includes `--hierarchical-namespace`:
|
||||
```
|
||||
$ az storage account
|
||||
usage: az storage account create [-h] [--verbose] [--debug]
|
||||
[--output {json,jsonc,table,tsv,yaml,none}]
|
||||
[--query JMESPATH] --resource-group
|
||||
RESOURCE_GROUP_NAME --name ACCOUNT_NAME
|
||||
[--sku {Standard_LRS,Standard_GRS,Standard_RAGRS,Standard_ZRS,Premium_LRS,Premium_ZRS}]
|
||||
[--location LOCATION]
|
||||
[--kind {Storage,StorageV2,BlobStorage,FileStorage,BlockBlobStorage}]
|
||||
[--tags [TAGS [TAGS ...]]]
|
||||
[--custom-domain CUSTOM_DOMAIN]
|
||||
[--encryption-services {blob,file,table,queue} [{blob,file,table,queue} ...]]
|
||||
[--access-tier {Hot,Cool}]
|
||||
[--https-only [{true,false}]]
|
||||
[--file-aad [{true,false}]]
|
||||
[--hierarchical-namespace [{true,false}]]
|
||||
[--bypass {None,Logging,Metrics,AzureServices} [{None,Logging,Metrics,AzureServices} ...]]
|
||||
[--default-action {Allow,Deny}]
|
||||
[--assign-identity]
|
||||
[--subscription _SUBSCRIPTION]
|
||||
```
|
||||
|
||||
You can list locations from `az account list-locations`, which lists the
|
||||
name to refer to in the `--location` argument:
|
||||
```
|
||||
$ az account list-locations -o table
|
||||
|
||||
DisplayName Latitude Longitude Name
|
||||
------------------- ---------- ----------- ------------------
|
||||
East Asia 22.267 114.188 eastasia
|
||||
Southeast Asia 1.283 103.833 southeastasia
|
||||
Central US 41.5908 -93.6208 centralus
|
||||
East US 37.3719 -79.8164 eastus
|
||||
East US 2 36.6681 -78.3889 eastus2
|
||||
West US 37.783 -122.417 westus
|
||||
North Central US 41.8819 -87.6278 northcentralus
|
||||
South Central US 29.4167 -98.5 southcentralus
|
||||
North Europe 53.3478 -6.2597 northeurope
|
||||
West Europe 52.3667 4.9 westeurope
|
||||
Japan West 34.6939 135.5022 japanwest
|
||||
Japan East 35.68 139.77 japaneast
|
||||
Brazil South -23.55 -46.633 brazilsouth
|
||||
Australia East -33.86 151.2094 australiaeast
|
||||
Australia Southeast -37.8136 144.9631 australiasoutheast
|
||||
South India 12.9822 80.1636 southindia
|
||||
Central India 18.5822 73.9197 centralindia
|
||||
West India 19.088 72.868 westindia
|
||||
Canada Central 43.653 -79.383 canadacentral
|
||||
Canada East 46.817 -71.217 canadaeast
|
||||
UK South 50.941 -0.799 uksouth
|
||||
UK West 53.427 -3.084 ukwest
|
||||
West Central US 40.890 -110.234 westcentralus
|
||||
West US 2 47.233 -119.852 westus2
|
||||
Korea Central 37.5665 126.9780 koreacentral
|
||||
Korea South 35.1796 129.0756 koreasouth
|
||||
France Central 46.3772 2.3730 francecentral
|
||||
France South 43.8345 2.1972 francesouth
|
||||
Australia Central -35.3075 149.1244 australiacentral
|
||||
Australia Central 2 -35.3075 149.1244 australiacentral2
|
||||
```
|
||||
|
||||
Once a location has been chosen, create the account
|
||||
```bash
|
||||
|
||||
az storage account create --verbose \
|
||||
--name abfswales1 \
|
||||
--resource-group devteam2 \
|
||||
--kind StorageV2 \
|
||||
--hierarchical-namespace true \
|
||||
--location ukwest \
|
||||
--sku Standard_LRS \
|
||||
--https-only true \
|
||||
--encryption-services blob \
|
||||
--access-tier Hot \
|
||||
--tags owner=engineering \
|
||||
--assign-identity \
|
||||
--output jsonc
|
||||
```
|
||||
|
||||
The output of the command is a JSON file, whose `primaryEndpoints` command
|
||||
includes the name of the store endpoint:
|
||||
```json
|
||||
{
|
||||
"primaryEndpoints": {
|
||||
"blob": "https://abfswales1.blob.core.windows.net/",
|
||||
"dfs": "https://abfswales1.dfs.core.windows.net/",
|
||||
"file": "https://abfswales1.file.core.windows.net/",
|
||||
"queue": "https://abfswales1.queue.core.windows.net/",
|
||||
"table": "https://abfswales1.table.core.windows.net/",
|
||||
"web": "https://abfswales1.z35.web.core.windows.net/"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `abfswales1.dfs.core.windows.net` account is the name by which the
|
||||
storage account will be referred to.
|
||||
|
||||
Now ask for the connection string to the store, which contains the account key
|
||||
```bash
|
||||
az storage account show-connection-string --name abfswales1
|
||||
{
|
||||
"connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfswales1;AccountKey=ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA=="
|
||||
}
|
||||
```
|
||||
|
||||
You then need to add the access key to your `core-site.xml`, JCEKs file or
|
||||
use your cluster management tool to set it the option `fs.azure.account.key.STORAGE-ACCOUNT`
|
||||
to this value.
|
||||
```XML
|
||||
<property>
|
||||
<name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name>
|
||||
<value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value>
|
||||
</property>
|
||||
```
|
||||
|
||||
#### Creation through the Azure Portal
|
||||
|
||||
Creation through the portal is covered in [Quickstart: Create an Azure Data Lake Storage Gen2 storage account](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-quickstart-create-account)
|
||||
|
||||
Key Steps
|
||||
|
||||
1. Create a new Storage Account in a location which suits you.
|
||||
1. "Basics" Tab: select "StorageV2".
|
||||
1. "Advanced" Tab: enable "Hierarchical Namespace".
|
||||
|
||||
You have now created your storage account. Next, get the key for authentication
|
||||
for using the default "Shared Key" authentication.
|
||||
|
||||
1. Go to the Azure Portal.
|
||||
1. Select "Storage Accounts"
|
||||
1. Select the newly created storage account.
|
||||
1. In the list of settings, locate "Access Keys" and select that.
|
||||
1. Copy one of the access keys to the clipboard, add to the XML option,
|
||||
set in cluster management tools, Hadoop JCEKS file or KMS store.
|
||||
|
||||
### <a name="new_container"></a> Creating a new container
|
||||
|
||||
An Azure storage account can have multiple containers, each with the container
|
||||
name as the userinfo field of the URI used to reference it.
|
||||
|
||||
For example, the container "container1" in the storage account just created
|
||||
will have the URL `abfs://container1@abfswales1.dfs.core.windows.net/`
|
||||
|
||||
|
||||
You can create a new container through the ABFS connector, by setting the option
|
||||
`fs.azure.createRemoteFileSystemDuringInitialization` to `true`.
|
||||
|
||||
If the container does not exist, an attempt to list it with `hadoop fs -ls`
|
||||
will fail
|
||||
|
||||
```
|
||||
$ hadoop fs -ls abfs://container1@abfswales1.dfs.core.windows.net/
|
||||
|
||||
ls: `abfs://container1@abfswales1.dfs.core.windows.net/': No such file or directory
|
||||
```
|
||||
|
||||
Enable remote FS creation and the second attempt succeeds, creating the container as it does so:
|
||||
|
||||
```
|
||||
$ hadoop fs -D fs.azure.createRemoteFileSystemDuringInitialization=true \
|
||||
-ls abfs://container1@abfswales1.dfs.core.windows.net/
|
||||
```
|
||||
|
||||
This is useful for creating accounts on the command line, especially before
|
||||
the `az storage` command supports hierarchical namespaces completely.
|
||||
|
||||
|
||||
### Listing and examining containers of a Storage Account.
|
||||
|
||||
You can use the [Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/)
|
||||
|
||||
## <a name="configuring"></a> Configuring ABFS
|
||||
|
||||
Any configuration can be specified generally (or as the default when accessing all accounts)
|
||||
or can be tied to a specific account.
|
||||
For example, an OAuth identity can be configured for use regardless of which
|
||||
account is accessed with the property `fs.azure.account.oauth2.client.id`
|
||||
or you can configure an identity to be used only for a specific storage account with
|
||||
`fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net`.
|
||||
|
||||
This is shown in the Authentication section.
|
||||
|
||||
## <a name="authentication"></a> Authentication
|
||||
|
||||
Authentication for ABFS is ultimately granted by [Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-scenarios).
|
||||
|
||||
The concepts covered there are beyond the scope of this document to cover;
|
||||
developers are expected to have read and understood the concepts therein
|
||||
to take advantage of the different authentication mechanisms.
|
||||
|
||||
What is covered here, briefly, is how to configure the ABFS client to authenticate
|
||||
in different deployment situations.
|
||||
|
||||
The ABFS client can be deployed in different ways, with its authentication needs
|
||||
driven by them.
|
||||
|
||||
1. With the storage account's authentication secret in the configuration:
|
||||
"Shared Key".
|
||||
1. Using OAuth 2.0 tokens of one form or another.
|
||||
1. Deployed in-Azure with the Azure VMs providing OAuth 2.0 tokens to the application,
|
||||
"Managed Instance".
|
||||
|
||||
What can be changed is what secrets/credentials are used to authenticate the caller.
|
||||
|
||||
The authentication mechanism is set in `fs.azure.account.auth.type` (or the account specific variant),
|
||||
and, for the various OAuth options `fs.azure.account.oauth.provider.type`
|
||||
|
||||
All secrets can be stored in JCEKS files. These are encrypted and password
|
||||
protected —use them or a compatible Hadoop Key Management Store wherever
|
||||
possible
|
||||
|
||||
### <a name="shared-key-auth"></a> Default: Shared Key
|
||||
|
||||
This is the simplest authentication mechanism of account + password.
|
||||
|
||||
The account name is inferred from the URL;
|
||||
the password, "key", retrieved from the XML/JCECKs configuration files.
|
||||
|
||||
```xml
|
||||
<property>
|
||||
<name>fs.azure.account.auth.type.abfswales1.dfs.core.windows.net</name>
|
||||
<value>SharedKey</value>
|
||||
<description>
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name>
|
||||
<value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value>
|
||||
<description>
|
||||
The secret password. Never share these.
|
||||
</description>
|
||||
</property>
|
||||
```
|
||||
|
||||
*Note*: The source of the account key can be changed through a custom key provider;
|
||||
one exists to execute a shell script to retrieve it.
|
||||
|
||||
### <a name="oauth-client-credentials"></a> OAuth 2.0 Client Credentials
|
||||
|
||||
OAuth 2.0 credentials of (client id, client secret, endpoint) are provided in the configuration/JCEKS file.
|
||||
|
||||
The specifics of this process is covered
|
||||
in [hadoop-azure-datalake](../hadoop-azure-datalake/index.html#Configuring_Credentials_and_FileSystem);
|
||||
the key names are slightly different here.
|
||||
|
||||
```xml
|
||||
<property>
|
||||
<name>fs.azure.account.auth.type</name>
|
||||
<value>OAuth</value>
|
||||
<description>
|
||||
Use OAuth authentication
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth.provider.type</name>
|
||||
<value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
|
||||
<description>
|
||||
Use client credentials
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth2.client.endpoint</name>
|
||||
<value></value>
|
||||
<description>
|
||||
URL of OAuth endpoint
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth2.client.id</name>
|
||||
<value></value>
|
||||
<description>
|
||||
Client ID
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth2.client.secret</name>
|
||||
<value></value>
|
||||
<description>
|
||||
Secret
|
||||
</description>
|
||||
</property>
|
||||
```
|
||||
|
||||
### <a name="oauth-user-and-passwd"></a> OAuth 2.0: Username and Password
|
||||
|
||||
An OAuth 2.0 endpoint, username and password are provided in the configuration/JCEKS file.
|
||||
|
||||
```xml
|
||||
<property>
|
||||
<name>fs.azure.account.auth.type</name>
|
||||
<value>OAuth</value>
|
||||
<description>
|
||||
Use OAuth authentication
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth.provider.type</name>
|
||||
<value>org.apache.hadoop.fs.azurebfs.oauth2.UserPasswordTokenProvider</value>
|
||||
<description>
|
||||
Use user and password
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth2.client.endpoint</name>
|
||||
<value></value>
|
||||
<description>
|
||||
URL of OAuth 2.0 endpoint
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth2.user.name</name>
|
||||
<value></value>
|
||||
<description>
|
||||
username
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth2.user.password</name>
|
||||
<value></value>
|
||||
<description>
|
||||
password for account
|
||||
</description>
|
||||
</property>
|
||||
```
|
||||
|
||||
### <a name="oauth-refresh-token"></a> OAuth 2.0: Refresh Token
|
||||
|
||||
With an existing Oauth 2.0 token, make a request of the Active Directory endpoint
|
||||
`https://login.microsoftonline.com/Common/oauth2/token` for this token to be refreshed.
|
||||
|
||||
```xml
|
||||
<property>
|
||||
<name>fs.azure.account.auth.type</name>
|
||||
<value>OAuth</value>
|
||||
<description>
|
||||
Use OAuth 2.0 authentication
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth.provider.type</name>
|
||||
<value>org.apache.hadoop.fs.azurebfs.oauth2.RefreshTokenBasedTokenProvider</value>
|
||||
<description>
|
||||
Use the Refresh Token Provider
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth2.refresh.token</name>
|
||||
<value></value>
|
||||
<description>
|
||||
Refresh token
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth2.client.id</name>
|
||||
<value></value>
|
||||
<description>
|
||||
Optional Client ID
|
||||
</description>
|
||||
</property>
|
||||
```
|
||||
|
||||
### <a name="managed-identity"></a> Azure Managed Identity
|
||||
|
||||
[Azure Managed Identities](https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/overview), formerly "Managed Service Identities".
|
||||
|
||||
OAuth 2.0 tokens are issued by a special endpoint only accessible
|
||||
from the executing VM (`http://169.254.169.254/metadata/identity/oauth2/token`).
|
||||
The issued credentials can be used to authenticate.
|
||||
|
||||
The Azure Portal/CLI is used to create the service identity.
|
||||
|
||||
```xml
|
||||
<property>
|
||||
<name>fs.azure.account.auth.type</name>
|
||||
<value>OAuth</value>
|
||||
<description>
|
||||
Use OAuth authentication
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth.provider.type</name>
|
||||
<value>org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider</value>
|
||||
<description>
|
||||
Use MSI for issuing OAuth tokens
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth2.msi.tenant</name>
|
||||
<value></value>
|
||||
<description>
|
||||
Optional MSI Tenant ID
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth2.client.id</name>
|
||||
<value></value>
|
||||
<description>
|
||||
Optional Client ID
|
||||
</description>
|
||||
</property>
|
||||
```
|
||||
|
||||
### Custom OAuth 2.0 Token Provider
|
||||
|
||||
A Custom OAuth 2.0 token provider supplies the ABFS connector with an OAuth 2.0
|
||||
token when its `getAccessToken()` method is invoked.
|
||||
|
||||
```xml
|
||||
<property>
|
||||
<name>fs.azure.account.auth.type</name>
|
||||
<value>Custom</value>
|
||||
<description>
|
||||
Custom Authentication
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>fs.azure.account.oauth.provider.type</name>
|
||||
<value></value>
|
||||
<description>
|
||||
classname of Custom Authentication Provider
|
||||
</description>
|
||||
</property>
|
||||
```
|
||||
|
||||
The declared class must implement `org.apache.hadoop.fs.azurebfs.extensions.CustomTokenProviderAdaptee`
|
||||
and optionally `org.apache.hadoop.fs.azurebfs.extensions.BoundDTExtension`.
|
||||
|
||||
## <a name="technical"></a> Technical notes
|
||||
|
||||
### <a name="proxy"></a> Proxy setup
|
||||
|
||||
The connector uses the JVM proxy settings to control its proxy setup.
|
||||
|
||||
See The [Oracle Java documentation](https://docs.oracle.com/javase/8/docs/technotes/guides/net/proxies.html) for the options to set.
|
||||
|
||||
As the connector uses HTTPS by default, the `https.proxyHost` and `https.proxyPort`
|
||||
options are those which must be configured.
|
||||
|
||||
In MapReduce jobs, including distcp, the proxy options must be set in both the
|
||||
`mapreduce.map.java.opts` and `mapreduce.reduce.java.opts`.
|
||||
|
||||
```bash
|
||||
# this variable is only here to avoid typing the same values twice.
|
||||
# It's name is not important.
|
||||
export DISTCP_PROXY_OPTS="-Dhttps.proxyHost=web-proxy.example.com -Dhttps.proxyPort=80"
|
||||
|
||||
hadoop distcp \
|
||||
-D mapreduce.map.java.opts="$DISTCP_PROXY_OPTS" \
|
||||
-D mapreduce.reduce.java.opts="$DISTCP_PROXY_OPTS" \
|
||||
-update -skipcrccheck -numListstatusThreads 40 \
|
||||
hdfs://namenode:8020/users/alice abfs://backups@account.dfs.core.windows.net/users/alice
|
||||
```
|
||||
|
||||
Without these settings, even though access to ADLS may work from the command line,
|
||||
`distcp` access can fail with network errors.
|
||||
|
||||
### <a name="security"></a> Security
|
||||
|
||||
As with other object stores, login secrets are valuable pieces of information.
|
||||
Organizations should have a process for safely sharing them.
|
||||
|
||||
### <a name="limitations"></a> Limitations of the ABFS connector
|
||||
|
||||
* File last access time is not tracked.
|
||||
* Extended attributes are not supported.
|
||||
* File Checksums are not supported.
|
||||
* The `Syncable` interfaces `hsync()` and `hflush()` operations are supported if
|
||||
`fs.azure.enable.flush` is set to true (default=true). With the Wasb connector,
|
||||
this limited the number of times either call could be made to 50,000
|
||||
[HADOOP-15478](https://issues.apache.org/jira/browse/HADOOP-15478).
|
||||
If abfs has the a similar limit, then excessive use of sync/flush may
|
||||
cause problems.
|
||||
|
||||
### <a name="consistency"></a> Consistency and Concurrency
|
||||
|
||||
As with all Azure storage services, the Azure Datalake Gen 2 store offers
|
||||
a fully consistent view of the store, with complete
|
||||
Create, Read, Update, and Delete consistency for data and metadata.
|
||||
(Compare and contrast with S3 which only offers Create consistency;
|
||||
S3Guard adds CRUD to metadata, but not the underlying data).
|
||||
|
||||
### <a name="performance"></a> Performance and Scalability
|
||||
|
||||
For containers with hierarchical namespaces,
|
||||
the scalability numbers are, in Big-O-notation, as follows:
|
||||
|
||||
| Operation | Scalability |
|
||||
|-----------|-------------|
|
||||
| File Rename | `O(1)` |
|
||||
| File Delete | `O(1)` |
|
||||
| Directory Rename:| `O(1)` |
|
||||
| Directory Delete | `O(1)` |
|
||||
|
||||
For non-namespace stores, the scalability becomes:
|
||||
|
||||
| Operation | Scalability |
|
||||
|-----------|-------------|
|
||||
| File Rename | `O(1)` |
|
||||
| File Delete | `O(1)` |
|
||||
| Directory Rename:| `O(files)` |
|
||||
| Directory Delete | `O(files)` |
|
||||
|
||||
That is: the more files there are, the slower directory operations get.
|
||||
|
||||
|
||||
Further reading: [Azure Storage Scalability Targets](https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets?toc=%2fazure%2fstorage%2fqueues%2ftoc.json)
|
||||
|
||||
### <a name="extensibility"></a> Extensibility
|
||||
|
||||
The ABFS connector supports a number of limited-private/unstable extension
|
||||
points for third-parties to integrate their authentication and authorization
|
||||
services into the ABFS client.
|
||||
|
||||
* `CustomDelegationTokenManager` : adds ability to issue Hadoop Delegation Tokens.
|
||||
* `AbfsAuthorizer` permits client-side authorization of file operations.
|
||||
* `CustomTokenProviderAdaptee`: allows for custom provision of
|
||||
Azure OAuth tokens.
|
||||
* `KeyProvider`.
|
||||
|
||||
Consult the source in `org.apache.hadoop.fs.azurebfs.extensions`
|
||||
and all associated tests to see how to make use of these extension points.
|
||||
|
||||
_Warning_ These extension points are unstable.
|
||||
|
||||
## <a href="options"></a> Other configuration options
|
||||
|
||||
Consult the javadocs for `org.apache.hadoop.fs.azurebfs.constants.ConfigurationKeys`,
|
||||
`org.apache.hadoop.fs.azurebfs.constants.FileSystemConfigurations` and
|
||||
`org.apache.hadoop.fs.azurebfs.AbfsConfiguration` for the full list
|
||||
of configuration options and their default values.
|
||||
|
||||
|
||||
## <a name="troubleshooting"></a> Troubleshooting
|
||||
|
||||
The problems associated with the connector usually come down to, in order
|
||||
|
||||
1. Classpath.
|
||||
1. Network setup (proxy etc.).
|
||||
1. Authentication and Authorization.
|
||||
1. Anything else.
|
||||
|
||||
If you log `org.apache.hadoop.fs.azurebfs.services` at `DEBUG` then you will
|
||||
see more details about any request which is failing.
|
||||
|
||||
One useful tool for debugging connectivity is the [cloudstore storediag utility](https://github.com/steveloughran/cloudstore/releases).
|
||||
|
||||
This validates the classpath, the settings, then tries to work with the filesystem.
|
||||
|
||||
```bash
|
||||
bin/hadoop jar cloudstore-0.1-SNAPSHOT.jar storediag abfs://container@account.dfs.core.windows.net/
|
||||
```
|
||||
|
||||
1. If the `storediag` command cannot work with an abfs store, nothing else is likely to.
|
||||
1. If the `storediag` store does successfully work, that does not guarantee that the classpath
|
||||
or configuration on the rest of the cluster is also going to work, especially
|
||||
in distributed applications. But it is at least a start.
|
||||
|
||||
### `ClassNotFoundException: org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem`
|
||||
|
||||
The `hadoop-azure` JAR is not on the classpah.
|
||||
|
||||
```
|
||||
java.lang.RuntimeException: java.lang.ClassNotFoundException:
|
||||
Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
|
||||
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2625)
|
||||
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3290)
|
||||
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3322)
|
||||
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:136)
|
||||
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3373)
|
||||
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3341)
|
||||
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:491)
|
||||
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
|
||||
Caused by: java.lang.ClassNotFoundException:
|
||||
Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
|
||||
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2529)
|
||||
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2623)
|
||||
... 16 more
|
||||
```
|
||||
|
||||
Tip: if this is happening on the command line, you can turn on debug logging
|
||||
of the hadoop scripts:
|
||||
|
||||
```bash
|
||||
export HADOOP_SHELL_SCRIPT_DEBUG=true
|
||||
```
|
||||
|
||||
If this is happening on an application running within the cluster, it means
|
||||
the cluster (somehow) needs to be configured so that the `hadoop-azure`
|
||||
module and dependencies are on the classpath of deployed applications.
|
||||
|
||||
### `ClassNotFoundException: com.microsoft.azure.storage.StorageErrorCode`
|
||||
|
||||
The `azure-storage` JAR is not on the classpath.
|
||||
|
||||
### `Server failed to authenticate the request`
|
||||
|
||||
The request wasn't authenticated while using the default shared-key
|
||||
authentication mechanism.
|
||||
|
||||
```
|
||||
Operation failed: "Server failed to authenticate the request.
|
||||
Make sure the value of Authorization header is formed correctly including the signature.",
|
||||
403, HEAD, https://account.dfs.core.windows.net/container2?resource=filesystem&timeout=90
|
||||
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:135)
|
||||
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getFilesystemProperties(AbfsClient.java:209)
|
||||
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFilesystemProperties(AzureBlobFileSystemStore.java:259)
|
||||
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.fileSystemExists(AzureBlobFileSystem.java:859)
|
||||
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:110)
|
||||
```
|
||||
|
||||
Causes include:
|
||||
|
||||
* Your credentials are incorrect.
|
||||
* Your shared secret has expired. in Azure, this happens automatically
|
||||
* Your shared secret has been revoked.
|
||||
* host/VM clock drift means that your client's clock is out of sync with the
|
||||
Azure servers —the call is being rejected as it is either out of date (considered a replay)
|
||||
or from the future. Fix: Check your clocks, etc.
|
||||
|
||||
### `Configuration property _something_.dfs.core.windows.net not found`
|
||||
|
||||
There's no `fs.azure.account.key.` entry in your cluster configuration declaring the
|
||||
access key for the specific account, or you are using the wrong URL
|
||||
|
||||
```
|
||||
$ hadoop fs -ls abfs://container@abfswales2.dfs.core.windows.net/
|
||||
|
||||
ls: Configuration property abfswales2.dfs.core.windows.net not found.
|
||||
```
|
||||
|
||||
* Make sure that the URL is correct
|
||||
* Add the missing account key.
|
||||
|
||||
|
||||
### `No such file or directory when trying to list a container`
|
||||
|
||||
There is no container of the given name. Either it has been mistyped
|
||||
or the container needs to be created.
|
||||
|
||||
```
|
||||
$ hadoop fs -ls abfs://container@abfswales1.dfs.core.windows.net/
|
||||
|
||||
ls: `abfs://container@abfswales1.dfs.core.windows.net/': No such file or directory
|
||||
```
|
||||
|
||||
* Make sure that the URL is correct
|
||||
* Create the container if needed
|
||||
|
||||
### "HTTP connection to https://login.microsoftonline.com/_something_ failed for getting token from AzureAD. Http response: 200 OK"
|
||||
|
||||
+ it has a content-type `text/html`, `text/plain`, `application/xml`
|
||||
|
||||
The OAuth authentication page didn't fail with an HTTP error code, but it didn't return JSON either
|
||||
|
||||
```
|
||||
$ bin/hadoop fs -ls abfs://container@abfswales1.dfs.core.windows.net/
|
||||
|
||||
...
|
||||
|
||||
ls: HTTP Error 200;
|
||||
url='https://login.microsoftonline.com/02a07549-0a5f-4c91-9d76-53d172a638a2/oauth2/authorize'
|
||||
AADToken: HTTP connection to
|
||||
https://login.microsoftonline.com/02a07549-0a5f-4c91-9d76-53d172a638a2/oauth2/authorize
|
||||
failed for getting token from AzureAD.
|
||||
Unexpected response.
|
||||
Check configuration, URLs and proxy settings.
|
||||
proxies=none;
|
||||
requestId='dd9d526c-8b3d-4b3f-a193-0cf021938600';
|
||||
contentType='text/html; charset=utf-8';
|
||||
```
|
||||
|
||||
Likely causes are configuration and networking:
|
||||
|
||||
1. Authentication is failing, the caller is being served up the Azure Active Directory
|
||||
signon page for humans, even though it is a machine calling.
|
||||
1. The URL is wrong —it is pointing at a web page unrelated to OAuth2.0
|
||||
1. There's a proxy server in the way trying to return helpful instructions.
|
||||
|
||||
## <a name="testing"></a> Testing ABFS
|
||||
|
||||
See the relevant section in [Testing Azure](testing_azure.html).
|
||||
|
|
|
@ -16,17 +16,26 @@
|
|||
|
||||
<!-- MACRO{toc|fromDepth=1|toDepth=3} -->
|
||||
|
||||
See also:
|
||||
|
||||
* [ABFS](./abfs.html)
|
||||
* [Testing](./testing_azure.html)
|
||||
|
||||
## Introduction
|
||||
|
||||
The hadoop-azure module provides support for integration with
|
||||
The `hadoop-azure` module provides support for integration with
|
||||
[Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
|
||||
The built jar file, named hadoop-azure.jar, also declares transitive dependencies
|
||||
The built jar file, named `hadoop-azure.jar`, also declares transitive dependencies
|
||||
on the additional artifacts it requires, notably the
|
||||
[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).
|
||||
|
||||
To make it part of Apache Hadoop's default classpath, simply make sure that
|
||||
HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has 'hadoop-azure' in the list.
|
||||
`HADOOP_OPTIONAL_TOOLS`in `hadoop-env.sh` has `'hadoop-azure` in the list.
|
||||
Example:
|
||||
|
||||
```bash
|
||||
export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake"
|
||||
```
|
||||
## Features
|
||||
|
||||
* Read and write data stored in an Azure Blob Storage account.
|
||||
|
|
Loading…
Reference in New Issue