HADOOP-14491. Azure has messed doc structure. Contributed by Mingliang Liu

This commit is contained in:
Mingliang Liu 2017-06-05 18:24:40 -07:00 committed by Xiaoyu Yao
parent 756ff412af
commit 974f33add2
1 changed files with 124 additions and 111 deletions

View File

@ -14,20 +14,9 @@
# Hadoop Azure Support: Azure Blob Storage # Hadoop Azure Support: Azure Blob Storage
* [Introduction](#Introduction) <!-- MACRO{toc|fromDepth=1|toDepth=3} -->
* [Features](#Features)
* [Limitations](#Limitations)
* [Usage](#Usage)
* [Concepts](#Concepts)
* [Configuring Credentials](#Configuring_Credentials)
* [Page Blob Support and Configuration](#Page_Blob_Support_and_Configuration)
* [Atomic Folder Rename](#Atomic_Folder_Rename)
* [Accessing wasb URLs](#Accessing_wasb_URLs)
* [Append API Support and Configuration](#Append_API_Support_and_Configuration)
* [Multithread Support](#Multithread_Support)
* [Testing the hadoop-azure Module](#Testing_the_hadoop-azure_Module)
## <a name="Introduction" />Introduction ## Introduction
The hadoop-azure module provides support for integration with The hadoop-azure module provides support for integration with
[Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/). [Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
@ -38,7 +27,7 @@ on the additional artifacts it requires, notably the
To make it part of Apache Hadoop's default classpath, simply make sure that To make it part of Apache Hadoop's default classpath, simply make sure that
HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has 'hadoop-azure' in the list. HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has 'hadoop-azure' in the list.
## <a name="Features" />Features ## Features
* Read and write data stored in an Azure Blob Storage account. * Read and write data stored in an Azure Blob Storage account.
* Present a hierarchical file system view by implementing the standard Hadoop * Present a hierarchical file system view by implementing the standard Hadoop
@ -54,15 +43,15 @@ HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has 'hadoop-azure' in the list.
* Tested on both Linux and Windows. * Tested on both Linux and Windows.
* Tested at scale. * Tested at scale.
## <a name="Limitations" />Limitations ## Limitations
* File owner and group are persisted, but the permissions model is not enforced. * File owner and group are persisted, but the permissions model is not enforced.
Authorization occurs at the level of the entire Azure Blob Storage account. Authorization occurs at the level of the entire Azure Blob Storage account.
* File last access time is not tracked. * File last access time is not tracked.
## <a name="Usage" />Usage ## Usage
### <a name="Concepts" />Concepts ### Concepts
The Azure Blob Storage data model presents 3 core concepts: The Azure Blob Storage data model presents 3 core concepts:
@ -76,7 +65,7 @@ The Azure Blob Storage data model presents 3 core concepts:
The internal implementation also uses blobs to persist the file system The internal implementation also uses blobs to persist the file system
hierarchy and other metadata. hierarchy and other metadata.
### <a name="Configuring_Credentials" />Configuring Credentials ### Configuring Credentials
Usage of Azure Blob Storage requires configuration of credentials. Typically Usage of Azure Blob Storage requires configuration of credentials. Typically
this is set in core-site.xml. The configuration property name is of the form this is set in core-site.xml. The configuration property name is of the form
@ -87,11 +76,12 @@ untrusted party.**
For example: For example:
<property> ```xml
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name> <property>
<value>YOUR ACCESS KEY</value> <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
</property> <value>YOUR ACCESS KEY</value>
</property>
```
In many Hadoop clusters, the core-site.xml file is world-readable. It is possible to In many Hadoop clusters, the core-site.xml file is world-readable. It is possible to
protect the access key within a credential provider as well. This provides an encrypted protect the access key within a credential provider as well. This provides an encrypted
file format along with protection with file permissions. file format along with protection with file permissions.
@ -110,14 +100,14 @@ For additional reading on the credential provider API see:
###### provision ###### provision
``` ```bash
% hadoop credential create fs.azure.account.key.youraccount.blob.core.windows.net -value 123 % hadoop credential create fs.azure.account.key.youraccount.blob.core.windows.net -value 123
-provider localjceks://file/home/lmccay/wasb.jceks -provider localjceks://file/home/lmccay/wasb.jceks
``` ```
###### configure core-site.xml or command line system property ###### configure core-site.xml or command line system property
``` ```xml
<property> <property>
<name>hadoop.security.credential.provider.path</name> <name>hadoop.security.credential.provider.path</name>
<value>localjceks://file/home/lmccay/wasb.jceks</value> <value>localjceks://file/home/lmccay/wasb.jceks</value>
@ -127,7 +117,7 @@ For additional reading on the credential provider API see:
###### distcp ###### distcp
``` ```bash
% hadoop distcp % hadoop distcp
[-D hadoop.security.credential.provider.path=localjceks://file/home/lmccay/wasb.jceks] [-D hadoop.security.credential.provider.path=localjceks://file/home/lmccay/wasb.jceks]
hdfs://hostname:9001/user/lmccay/007020615 wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/ hdfs://hostname:9001/user/lmccay/007020615 wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/
@ -145,22 +135,25 @@ specifies an external program to be invoked by Hadoop processes to decrypt the
key. The encrypted key value is passed to this external program as a command key. The encrypted key value is passed to this external program as a command
line argument: line argument:
<property> ```xml
<name>fs.azure.account.keyprovider.youraccount</name> <property>
<value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value> <name>fs.azure.account.keyprovider.youraccount</name>
</property> <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
<property> <property>
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name> <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
<value>YOUR ENCRYPTED ACCESS KEY</value> <value>YOUR ENCRYPTED ACCESS KEY</value>
</property> </property>
<property> <property>
<name>fs.azure.shellkeyprovider.script</name> <name>fs.azure.shellkeyprovider.script</name>
<value>PATH TO DECRYPTION PROGRAM</value> <value>PATH TO DECRYPTION PROGRAM</value>
</property> </property>
### <a name="Page_Blob_Support_and_Configuration" />Page Blob Support and Configuration ```
### Page Blob Support and Configuration
The Azure Blob Storage interface for Hadoop supports two kinds of blobs, The Azure Blob Storage interface for Hadoop supports two kinds of blobs,
[block blobs and page blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx). [block blobs and page blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx).
@ -182,10 +175,12 @@ folder names.
For example: For example:
<property> ```xml
<name>fs.azure.page.blob.dir</name> <property>
<value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value> <name>fs.azure.page.blob.dir</name>
</property> <value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value>
</property>
```
You can set this to simply / to make all files page blobs. You can set this to simply / to make all files page blobs.
@ -197,7 +192,7 @@ The configuration option `fs.azure.page.blob.extension.size` is the page blob
extension size. This defines the amount to extend a page blob if it starts to extension size. This defines the amount to extend a page blob if it starts to
get full. It must be 128MB or greater, specified as an integer number of bytes. get full. It must be 128MB or greater, specified as an integer number of bytes.
### <a name="Atomic_Folder_Rename" />Atomic Folder Rename ### Atomic Folder Rename
Azure storage stores files as a flat key/value store without formal support Azure storage stores files as a flat key/value store without formal support
for folders. The hadoop-azure file system layer simulates folders on top for folders. The hadoop-azure file system layer simulates folders on top
@ -216,12 +211,14 @@ the intention of the rename operation, to allow redo in event of a failure.
For example: For example:
<property> ```xml
<name>fs.azure.atomic.rename.dir</name> <property>
<value>/hbase,/data</value> <name>fs.azure.atomic.rename.dir</name>
</property> <value>/hbase,/data</value>
</property>
```
### <a name="Accessing_wasb_URLs" />Accessing wasb URLs ### Accessing wasb URLs
After credentials are configured in core-site.xml, any Hadoop component may After credentials are configured in core-site.xml, any Hadoop component may
reference files in that Azure Blob Storage account by using URLs of the following reference files in that Azure Blob Storage account by using URLs of the following
@ -238,28 +235,32 @@ For example, the following
commands demonstrate access to a storage account named `youraccount` and a commands demonstrate access to a storage account named `youraccount` and a
container named `yourcontainer`. container named `yourcontainer`.
> hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir ```bash
% hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
> hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile % hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
> hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile % hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
test file content test file content
```
It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL. It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL.
This causes all bare paths, such as `/testDir/testFile` to resolve automatically This causes all bare paths, such as `/testDir/testFile` to resolve automatically
to that file system. to that file system.
### <a name="Append_API_Support_and_Configuration" />Append API Support and Configuration ### Append API Support and Configuration
The Azure Blob Storage interface for Hadoop has optional support for Append API for The Azure Blob Storage interface for Hadoop has optional support for Append API for
single writer by setting the configuration `fs.azure.enable.append.support` to true. single writer by setting the configuration `fs.azure.enable.append.support` to true.
For Example: For Example:
<property> ```xml
<name>fs.azure.enable.append.support</name> <property>
<value>true</value> <name>fs.azure.enable.append.support</name>
</property> <value>true</value>
</property>
```
It must be noted Append support in Azure Blob Storage interface DIFFERS FROM HDFS SEMANTICS. Append It must be noted Append support in Azure Blob Storage interface DIFFERS FROM HDFS SEMANTICS. Append
support does not enforce single writer internally but requires applications to guarantee this semantic. support does not enforce single writer internally but requires applications to guarantee this semantic.
@ -267,25 +268,29 @@ It becomes a responsibility of the application either to ensure single-threaded
file path, or rely on some external locking mechanism of its own. Failure to do so will result in file path, or rely on some external locking mechanism of its own. Failure to do so will result in
unexpected behavior. unexpected behavior.
### <a name="Multithread_Support" />Multithread Support ### Multithread Support
Rename and Delete blob operations on directories with large number of files and sub directories currently is very slow as these operations are done one blob at a time serially. These files and sub folders can be deleted or renamed parallel. Following configurations can be used to enable threads to do parallel processing Rename and Delete blob operations on directories with large number of files and sub directories currently is very slow as these operations are done one blob at a time serially. These files and sub folders can be deleted or renamed parallel. Following configurations can be used to enable threads to do parallel processing
To enable 10 threads for Delete operation. Set configuration value to 0 or 1 to disable threads. The default behavior is threads disabled. To enable 10 threads for Delete operation. Set configuration value to 0 or 1 to disable threads. The default behavior is threads disabled.
<property> ```xml
<name>fs.azure.delete.threads</name> <property>
<value>10</value> <name>fs.azure.delete.threads</name>
</property> <value>10</value>
</property>
```
To enable 20 threads for Rename operation. Set configuration value to 0 or 1 to disable threads. The default behavior is threads disabled. To enable 20 threads for Rename operation. Set configuration value to 0 or 1 to disable threads. The default behavior is threads disabled.
<property> ```xml
<name>fs.azure.rename.threads</name> <property>
<value>20</value> <name>fs.azure.rename.threads</name>
</property> <value>20</value>
</property>
```
### <a name="WASB_SECURE_MODE" />WASB Secure mode and configuration ### WASB Secure mode and configuration
WASB can operate in secure mode where the Storage access keys required to communicate with Azure storage does not have to WASB can operate in secure mode where the Storage access keys required to communicate with Azure storage does not have to
be in the same address space as the process using WASB. In this mode all interactions with Azure storage is performed using be in the same address space as the process using WASB. In this mode all interactions with Azure storage is performed using
@ -295,30 +300,32 @@ Romote mode, however for testing purposes the local mode can be enabled to gener
To enable Secure mode following property needs to be set to true. To enable Secure mode following property needs to be set to true.
``` ```xml
<property> <property>
<name>fs.azure.secure.mode</name> <name>fs.azure.secure.mode</name>
<value>true</value> <value>true</value>
</property> </property>
``` ```
To enable SAS key generation locally following property needs to be set to true. To enable SAS key generation locally following property needs to be set to true.
```xml
<property>
<name>fs.azure.local.sas.key.mode</name>
<value>true</value>
</property>
``` ```
<property>
<name>fs.azure.local.sas.key.mode</name>
<value>true</value>
</property>
```
To use the remote SAS key generation mode, an external REST service is expected to provided required SAS keys. To use the remote SAS key generation mode, an external REST service is expected to provided required SAS keys.
Following property can used to provide the end point to use for remote SAS Key generation: Following property can used to provide the end point to use for remote SAS Key generation:
```xml
<property>
<name>fs.azure.cred.service.url</name>
<value>{URL}</value>
</property>
``` ```
<property>
<name>fs.azure.cred.service.url</name>
<value>{URL}</value>
</property>
```
The remote service is expected to provide support for two REST calls ```{URL}/GET_CONTAINER_SAS``` and ```{URL}/GET_RELATIVE_BLOB_SAS```, for generating The remote service is expected to provide support for two REST calls ```{URL}/GET_CONTAINER_SAS``` and ```{URL}/GET_RELATIVE_BLOB_SAS```, for generating
container and relative blob sas keys. An example requests container and relative blob sas keys. An example requests
@ -326,7 +333,8 @@ container and relative blob sas keys. An example requests
```{URL}/GET_CONTAINER_SAS?storage_account=<account_name>&container=<container>&relative_path=<relative path>&sas_expiry=<expiry period>&delegation_token=<delegation token>``` ```{URL}/GET_CONTAINER_SAS?storage_account=<account_name>&container=<container>&relative_path=<relative path>&sas_expiry=<expiry period>&delegation_token=<delegation token>```
The service is expected to return a response in JSON format: The service is expected to return a response in JSON format:
```
```json
{ {
"responseCode" : 0 or non-zero <int>, "responseCode" : 0 or non-zero <int>,
"responseMessage" : relavant message on failure <String>, "responseMessage" : relavant message on failure <String>,
@ -334,40 +342,42 @@ The service is expected to return a response in JSON format:
} }
``` ```
## <a name="WASB Authorization" />Authorization Support in WASB. ### Authorization Support in WASB
Authorization support can be enabled in WASB using the following configuration: Authorization support can be enabled in WASB using the following configuration:
``` ```xml
<property> <property>
<name>fs.azure.authorization</name> <name>fs.azure.authorization</name>
<value>true</value> <value>true</value>
</property> </property>
```
The current implementation of authorization relies on the presence of an external service that can enforce
the authorization. The service is expected to be running on a URL provided by the following config.
```
<property>
<name>fs.azure.authorization.remote.service.url</name>
<value>{URL}</value>
</property>
``` ```
The remote service is expected to provide support for the following REST call: ```{URL}/CHECK_AUTHORIZATION``` The current implementation of authorization relies on the presence of an external service that can enforce
An example request: the authorization. The service is expected to be running on a URL provided by the following config.
```xml
<property>
<name>fs.azure.authorization.remote.service.url</name>
<value>{URL}</value>
</property>
```
The remote service is expected to provide support for the following REST call: ```{URL}/CHECK_AUTHORIZATION```
An example request:
```{URL}/CHECK_AUTHORIZATION?wasb_absolute_path=<absolute_path>&operation_type=<operation type>&delegation_token=<delegation token>``` ```{URL}/CHECK_AUTHORIZATION?wasb_absolute_path=<absolute_path>&operation_type=<operation type>&delegation_token=<delegation token>```
The service is expected to return a response in JSON format: The service is expected to return a response in JSON format:
```
{ ```json
{
"responseCode" : 0 or non-zero <int>, "responseCode" : 0 or non-zero <int>,
"responseMessage" : relevant message on failure <String>, "responseMessage" : relevant message on failure <String>,
"authorizationResult" : true/false <boolean> "authorizationResult" : true/false <boolean>
} }
``` ```
## <a name="Testing_the_hadoop-azure_Module" />Testing the hadoop-azure Module ## Testing the hadoop-azure Module
The hadoop-azure module includes a full suite of unit tests. Most of the tests The hadoop-azure module includes a full suite of unit tests. Most of the tests
will run without additional configuration by running `mvn test`. This includes will run without additional configuration by running `mvn test`. This includes
@ -382,10 +392,12 @@ that runs on a local machine.
To use the emulator, install Azure SDK 2.3 and start the storage emulator. Then, To use the emulator, install Azure SDK 2.3 and start the storage emulator. Then,
edit `src/test/resources/azure-test.xml` and add the following property: edit `src/test/resources/azure-test.xml` and add the following property:
<property> ```xml
<name>fs.azure.test.emulator</name> <property>
<value>true</value> <name>fs.azure.test.emulator</name>
</property> <value>true</value>
</property>
```
There is a known issue when running tests with the emulator. You may see the There is a known issue when running tests with the emulator. You may see the
following failure message: following failure message:
@ -399,6 +411,7 @@ file to `src/test/resources/azure-auth-keys.xml` and setting
the name of the storage account and its access key. the name of the storage account and its access key.
For example: For example:
```xml ```xml
<?xml version="1.0"?> <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>