From bdbd59cfa0904860fc4ce7a2afef1e84f35b8b82 Mon Sep 17 00:00:00 2001 From: bilaharith <52483117+bilaharith@users.noreply.github.com> Date: Tue, 19 May 2020 09:15:54 +0530 Subject: [PATCH] HADOOP-17004. ABFS: Improve the ABFS driver documentation Contributed by Bilahari T H. --- .../hadoop-azure/src/site/markdown/abfs.md | 133 +++++++++++++++++- 1 file changed, 130 insertions(+), 3 deletions(-) diff --git a/hadoop-tools/hadoop-azure/src/site/markdown/abfs.md b/hadoop-tools/hadoop-azure/src/site/markdown/abfs.md index 89f52e7e84d..93141f1d1c5 100644 --- a/hadoop-tools/hadoop-azure/src/site/markdown/abfs.md +++ b/hadoop-tools/hadoop-azure/src/site/markdown/abfs.md @@ -257,7 +257,8 @@ will have the URL `abfs://container1@abfswales1.dfs.core.windows.net/` You can create a new container through the ABFS connector, by setting the option - `fs.azure.createRemoteFileSystemDuringInitialization` to `true`. + `fs.azure.createRemoteFileSystemDuringInitialization` to `true`. Though the + same is not supported when AuthType is SAS. If the container does not exist, an attempt to list it with `hadoop fs -ls` will fail @@ -317,8 +318,13 @@ driven by them. What can be changed is what secrets/credentials are used to authenticate the caller. -The authentication mechanism is set in `fs.azure.account.auth.type` (or the account specific variant), -and, for the various OAuth options `fs.azure.account.oauth.provider.type` +The authentication mechanism is set in `fs.azure.account.auth.type` (or the +account specific variant). The possible values are SharedKey, OAuth, Custom +and SAS. For the various OAuth options use the config `fs.azure.account +.oauth.provider.type`. Following are the implementations supported +ClientCredsTokenProvider, UserPasswordTokenProvider, MsiTokenProvider and +RefreshTokenBasedTokenProvider. An IllegalArgumentException is thrown if +the specified provider type is not one of the supported. All secrets can be stored in JCEKS files. These are encrypted and password protected —use them or a compatible Hadoop Key Management Store wherever @@ -350,6 +356,15 @@ the password, "key", retrieved from the XML/JCECKs configuration files. *Note*: The source of the account key can be changed through a custom key provider; one exists to execute a shell script to retrieve it. +A custom key provider class can be provided with the config +`fs.azure.account.keyprovider`. If a key provider class is specified the same +will be used to get account key. Otherwise the Simple key provider will be used +which will use the key specified for the config `fs.azure.account.key`. + +To retrieve using shell script, specify the path to the script for the config +`fs.azure.shellkeyprovider.script`. ShellDecryptionKeyProvider class use the +script specified to retrieve the key. + ### OAuth 2.0 Client Credentials OAuth 2.0 credentials of (client id, client secret, endpoint) are provided in the configuration/JCEKS file. @@ -465,6 +480,13 @@ With an existing Oauth 2.0 token, make a request of the Active Directory endpoin Refresh token + + fs.azure.account.oauth2.refresh.endpoint + + + Refresh token endpoint + + fs.azure.account.oauth2.client.id @@ -506,6 +528,13 @@ The Azure Portal/CLI is used to create the service identity. Optional MSI Tenant ID + + fs.azure.account.oauth2.msi.endpoint + + + MSI endpoint + + fs.azure.account.oauth2.client.id @@ -542,6 +571,26 @@ and optionally `org.apache.hadoop.fs.azurebfs.extensions.BoundDTExtension`. The declared class also holds responsibility to implement retry logic while fetching access tokens. +### Delegation Token Provider + +A delegation token provider supplies the ABFS connector with delegation tokens, +helps renew and cancel the tokens by implementing the +CustomDelegationTokenManager interface. + +```xml + + fs.azure.enable.delegation.token + true + Make this true to use delegation token provider + + + fs.azure.delegation.token.provider.type + {fully-qualified-class-name-for-implementation-of-CustomDelegationTokenManager-interface} + +``` +In case delegation token is enabled, and the config `fs.azure.delegation.token +.provider.type` is not provided then an IlleagalArgumentException is thrown. + ### Shared Access Signature (SAS) Token Provider A Shared Access Signature (SAS) token provider supplies the ABFS connector with SAS @@ -691,6 +740,84 @@ Config `fs.azure.account.hns.enabled` provides an option to specify whether Config `fs.azure.enable.check.access` needs to be set true to enable the AzureBlobFileSystem.access(). +### Primary User Group Options +The group name which is part of FileStatus and AclStatus will be set the same as +the username if the following config is set to true +`fs.azure.skipUserGroupMetadataDuringInitialization`. + +### IO Options +The following configs are related to read and write operations. + +`fs.azure.io.retry.max.retries`: Sets the number of retries for IO operations. +Currently this is used only for the server call retry logic. Used within +AbfsClient class as part of the ExponentialRetryPolicy. The value should be +>= 0. + +`fs.azure.write.request.size`: To set the write buffer size. Specify the value +in bytes. The value should be between 16384 to 104857600 both inclusive (16 KB +to 100 MB). The default value will be 8388608 (8 MB). + +`fs.azure.read.request.size`: To set the read buffer size.Specify the value in +bytes. The value should be between 16384 to 104857600 both inclusive (16 KB to +100 MB). The default value will be 4194304 (4 MB). + +`fs.azure.readaheadqueue.depth`: Sets the readahead queue depth in +AbfsInputStream. In case the set value is negative the read ahead queue depth +will be set as Runtime.getRuntime().availableProcessors(). By default the value +will be -1. + +### Security Options +`fs.azure.always.use.https`: Enforces to use HTTPS instead of HTTP when the flag +is made true. Irrespective of the flag, AbfsClient will use HTTPS if the secure +scheme (ABFSS) is used or OAuth is used for authentication. By default this will +be set to true. + +`fs.azure.ssl.channel.mode`: Initializing DelegatingSSLSocketFactory with the +specified SSL channel mode. Value should be of the enum +DelegatingSSLSocketFactory.SSLChannelMode. The default value will be +DelegatingSSLSocketFactory.SSLChannelMode.Default. + +### Server Options +When the config `fs.azure.io.read.tolerate.concurrent.append` is made true, the +If-Match header sent to the server for read calls will be set as * otherwise the +same will be set with ETag. This is basically a mechanism in place to handle the +reads with optimistic concurrency. +Please refer the following links for further information. +1. https://docs.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/read +2. https://azure.microsoft.com/de-de/blog/managing-concurrency-in-microsoft-azure-storage-2/ + +listStatus API fetches the FileStatus information from server in a page by page +manner. The config `fs.azure.list.max.results` used to set the maxResults URI + param which sets the pagesize(maximum results per call). The value should + be > 0. By default this will be 500. Server has a maximum value for this + parameter as 5000. So even if the config is above 5000 the response will only +contain 5000 entries. Please refer the following link for further information. +https://docs.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/list + +### Throttling Options +ABFS driver has the capability to throttle read and write operations to achieve +maximum throughput by minimizing errors. The errors occur when the account +ingress or egress limits are exceeded and, the server-side throttles requests. +Server-side throttling causes the retry policy to be used, but the retry policy +sleeps for long periods of time causing the total ingress or egress throughput +to be as much as 35% lower than optimal. The retry policy is also after the +fact, in that it applies after a request fails. On the other hand, the +client-side throttling implemented here happens before requests are made and +sleeps just enough to minimize errors, allowing optimal ingress and/or egress +throughput. By default the throttling mechanism is enabled in the driver. The +same can be disabled by setting the config `fs.azure.enable.autothrottling` +to false. + +### Rename Options +`fs.azure.atomic.rename.key`: Directories for atomic rename support can be +specified comma separated in this config. The driver prints the following +warning log if the source of the rename belongs to one of the configured +directories. "The atomic rename feature is not supported by the ABFS scheme +; however, rename, create and delete operations are atomic if Namespace is +enabled for your Azure Storage account." +The directories can be specified as comma separated values. By default the value +is "/hbase" + ### Perf Options #### 1. HTTP Request Tracking Options