HADOOP-17004. ABFS: Improve the ABFS driver documentation

Contributed by Bilahari T H.
This commit is contained in:
bilaharith 2020-05-19 09:15:54 +05:30 committed by GitHub
parent 7bb902bc0d
commit bdbd59cfa0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 130 additions and 3 deletions

View File

@ -257,7 +257,8 @@ will have the URL `abfs://container1@abfswales1.dfs.core.windows.net/`
You can create a new container through the ABFS connector, by setting the option
`fs.azure.createRemoteFileSystemDuringInitialization` to `true`.
`fs.azure.createRemoteFileSystemDuringInitialization` to `true`. Though the
same is not supported when AuthType is SAS.
If the container does not exist, an attempt to list it with `hadoop fs -ls`
will fail
@ -317,8 +318,13 @@ driven by them.
What can be changed is what secrets/credentials are used to authenticate the caller.
The authentication mechanism is set in `fs.azure.account.auth.type` (or the account specific variant),
and, for the various OAuth options `fs.azure.account.oauth.provider.type`
The authentication mechanism is set in `fs.azure.account.auth.type` (or the
account specific variant). The possible values are SharedKey, OAuth, Custom
and SAS. For the various OAuth options use the config `fs.azure.account
.oauth.provider.type`. Following are the implementations supported
ClientCredsTokenProvider, UserPasswordTokenProvider, MsiTokenProvider and
RefreshTokenBasedTokenProvider. An IllegalArgumentException is thrown if
the specified provider type is not one of the supported.
All secrets can be stored in JCEKS files. These are encrypted and password
protected —use them or a compatible Hadoop Key Management Store wherever
@ -350,6 +356,15 @@ the password, "key", retrieved from the XML/JCECKs configuration files.
*Note*: The source of the account key can be changed through a custom key provider;
one exists to execute a shell script to retrieve it.
A custom key provider class can be provided with the config
`fs.azure.account.keyprovider`. If a key provider class is specified the same
will be used to get account key. Otherwise the Simple key provider will be used
which will use the key specified for the config `fs.azure.account.key`.
To retrieve using shell script, specify the path to the script for the config
`fs.azure.shellkeyprovider.script`. ShellDecryptionKeyProvider class use the
script specified to retrieve the key.
### <a name="oauth-client-credentials"></a> OAuth 2.0 Client Credentials
OAuth 2.0 credentials of (client id, client secret, endpoint) are provided in the configuration/JCEKS file.
@ -465,6 +480,13 @@ With an existing Oauth 2.0 token, make a request of the Active Directory endpoin
Refresh token
</description>
</property>
<property>
<name>fs.azure.account.oauth2.refresh.endpoint</name>
<value></value>
<description>
Refresh token endpoint
</description>
</property>
<property>
<name>fs.azure.account.oauth2.client.id</name>
<value></value>
@ -506,6 +528,13 @@ The Azure Portal/CLI is used to create the service identity.
Optional MSI Tenant ID
</description>
</property>
<property>
<name>fs.azure.account.oauth2.msi.endpoint</name>
<value></value>
<description>
MSI endpoint
</description>
</property>
<property>
<name>fs.azure.account.oauth2.client.id</name>
<value></value>
@ -542,6 +571,26 @@ and optionally `org.apache.hadoop.fs.azurebfs.extensions.BoundDTExtension`.
The declared class also holds responsibility to implement retry logic while fetching access tokens.
### <a name="delegationtokensupportconfigoptions"></a> Delegation Token Provider
A delegation token provider supplies the ABFS connector with delegation tokens,
helps renew and cancel the tokens by implementing the
CustomDelegationTokenManager interface.
```xml
<property>
<name>fs.azure.enable.delegation.token</name>
<value>true</value>
<description>Make this true to use delegation token provider</description>
</property>
<property>
<name>fs.azure.delegation.token.provider.type</name>
<value>{fully-qualified-class-name-for-implementation-of-CustomDelegationTokenManager-interface}</value>
</property>
```
In case delegation token is enabled, and the config `fs.azure.delegation.token
.provider.type` is not provided then an IlleagalArgumentException is thrown.
### Shared Access Signature (SAS) Token Provider
A Shared Access Signature (SAS) token provider supplies the ABFS connector with SAS
@ -691,6 +740,84 @@ Config `fs.azure.account.hns.enabled` provides an option to specify whether
Config `fs.azure.enable.check.access` needs to be set true to enable
the AzureBlobFileSystem.access().
### <a name="featureconfigoptions"></a> Primary User Group Options
The group name which is part of FileStatus and AclStatus will be set the same as
the username if the following config is set to true
`fs.azure.skipUserGroupMetadataDuringInitialization`.
### <a name="ioconfigoptions"></a> IO Options
The following configs are related to read and write operations.
`fs.azure.io.retry.max.retries`: Sets the number of retries for IO operations.
Currently this is used only for the server call retry logic. Used within
AbfsClient class as part of the ExponentialRetryPolicy. The value should be
>= 0.
`fs.azure.write.request.size`: To set the write buffer size. Specify the value
in bytes. The value should be between 16384 to 104857600 both inclusive (16 KB
to 100 MB). The default value will be 8388608 (8 MB).
`fs.azure.read.request.size`: To set the read buffer size.Specify the value in
bytes. The value should be between 16384 to 104857600 both inclusive (16 KB to
100 MB). The default value will be 4194304 (4 MB).
`fs.azure.readaheadqueue.depth`: Sets the readahead queue depth in
AbfsInputStream. In case the set value is negative the read ahead queue depth
will be set as Runtime.getRuntime().availableProcessors(). By default the value
will be -1.
### <a name="securityconfigoptions"></a> Security Options
`fs.azure.always.use.https`: Enforces to use HTTPS instead of HTTP when the flag
is made true. Irrespective of the flag, AbfsClient will use HTTPS if the secure
scheme (ABFSS) is used or OAuth is used for authentication. By default this will
be set to true.
`fs.azure.ssl.channel.mode`: Initializing DelegatingSSLSocketFactory with the
specified SSL channel mode. Value should be of the enum
DelegatingSSLSocketFactory.SSLChannelMode. The default value will be
DelegatingSSLSocketFactory.SSLChannelMode.Default.
### <a name="serverconfigoptions"></a> Server Options
When the config `fs.azure.io.read.tolerate.concurrent.append` is made true, the
If-Match header sent to the server for read calls will be set as * otherwise the
same will be set with ETag. This is basically a mechanism in place to handle the
reads with optimistic concurrency.
Please refer the following links for further information.
1. https://docs.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/read
2. https://azure.microsoft.com/de-de/blog/managing-concurrency-in-microsoft-azure-storage-2/
listStatus API fetches the FileStatus information from server in a page by page
manner. The config `fs.azure.list.max.results` used to set the maxResults URI
param which sets the pagesize(maximum results per call). The value should
be > 0. By default this will be 500. Server has a maximum value for this
parameter as 5000. So even if the config is above 5000 the response will only
contain 5000 entries. Please refer the following link for further information.
https://docs.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/list
### <a name="throttlingconfigoptions"></a> Throttling Options
ABFS driver has the capability to throttle read and write operations to achieve
maximum throughput by minimizing errors. The errors occur when the account
ingress or egress limits are exceeded and, the server-side throttles requests.
Server-side throttling causes the retry policy to be used, but the retry policy
sleeps for long periods of time causing the total ingress or egress throughput
to be as much as 35% lower than optimal. The retry policy is also after the
fact, in that it applies after a request fails. On the other hand, the
client-side throttling implemented here happens before requests are made and
sleeps just enough to minimize errors, allowing optimal ingress and/or egress
throughput. By default the throttling mechanism is enabled in the driver. The
same can be disabled by setting the config `fs.azure.enable.autothrottling`
to false.
### <a name="renameconfigoptions"></a> Rename Options
`fs.azure.atomic.rename.key`: Directories for atomic rename support can be
specified comma separated in this config. The driver prints the following
warning log if the source of the rename belongs to one of the configured
directories. "The atomic rename feature is not supported by the ABFS scheme
; however, rename, create and delete operations are atomic if Namespace is
enabled for your Azure Storage account."
The directories can be specified as comma separated values. By default the value
is "/hbase"
### <a name="perfoptions"></a> Perf Options
#### <a name="abfstracklatencyoptions"></a> 1. HTTP Request Tracking Options