HADOOP-14491. Azure has messed doc structure. Contributed by Mingliang Liu

This commit is contained in:
Mingliang Liu 2017-06-05 18:24:40 -07:00
parent 6b5285bbcb
commit 536f057158
1 changed files with 124 additions and 111 deletions

View File

@ -14,20 +14,9 @@
# Hadoop Azure Support: Azure Blob Storage
* [Introduction](#Introduction)
* [Features](#Features)
* [Limitations](#Limitations)
* [Usage](#Usage)
* [Concepts](#Concepts)
* [Configuring Credentials](#Configuring_Credentials)
* [Page Blob Support and Configuration](#Page_Blob_Support_and_Configuration)
* [Atomic Folder Rename](#Atomic_Folder_Rename)
* [Accessing wasb URLs](#Accessing_wasb_URLs)
* [Append API Support and Configuration](#Append_API_Support_and_Configuration)
* [Multithread Support](#Multithread_Support)
* [Testing the hadoop-azure Module](#Testing_the_hadoop-azure_Module)
<!-- MACRO{toc|fromDepth=1|toDepth=3} -->
## <a name="Introduction" />Introduction
## Introduction
The hadoop-azure module provides support for integration with
[Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
@ -38,7 +27,7 @@ on the additional artifacts it requires, notably the
To make it part of Apache Hadoop's default classpath, simply make sure that
HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has 'hadoop-azure' in the list.
## <a name="Features" />Features
## Features
* Read and write data stored in an Azure Blob Storage account.
* Present a hierarchical file system view by implementing the standard Hadoop
@ -54,15 +43,15 @@ HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has 'hadoop-azure' in the list.
* Tested on both Linux and Windows.
* Tested at scale.
## <a name="Limitations" />Limitations
## Limitations
* File owner and group are persisted, but the permissions model is not enforced.
Authorization occurs at the level of the entire Azure Blob Storage account.
* File last access time is not tracked.
## <a name="Usage" />Usage
## Usage
### <a name="Concepts" />Concepts
### Concepts
The Azure Blob Storage data model presents 3 core concepts:
@ -76,7 +65,7 @@ The Azure Blob Storage data model presents 3 core concepts:
The internal implementation also uses blobs to persist the file system
hierarchy and other metadata.
### <a name="Configuring_Credentials" />Configuring Credentials
### Configuring Credentials
Usage of Azure Blob Storage requires configuration of credentials. Typically
this is set in core-site.xml. The configuration property name is of the form
@ -87,11 +76,12 @@ untrusted party.**
For example:
<property>
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
<value>YOUR ACCESS KEY</value>
</property>
```xml
<property>
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
<value>YOUR ACCESS KEY</value>
</property>
```
In many Hadoop clusters, the core-site.xml file is world-readable. It is possible to
protect the access key within a credential provider as well. This provides an encrypted
file format along with protection with file permissions.
@ -110,14 +100,14 @@ For additional reading on the credential provider API see:
###### provision
```
```bash
% hadoop credential create fs.azure.account.key.youraccount.blob.core.windows.net -value 123
-provider localjceks://file/home/lmccay/wasb.jceks
```
###### configure core-site.xml or command line system property
```
```xml
<property>
<name>hadoop.security.credential.provider.path</name>
<value>localjceks://file/home/lmccay/wasb.jceks</value>
@ -127,7 +117,7 @@ For additional reading on the credential provider API see:
###### distcp
```
```bash
% hadoop distcp
[-D hadoop.security.credential.provider.path=localjceks://file/home/lmccay/wasb.jceks]
hdfs://hostname:9001/user/lmccay/007020615 wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/
@ -145,22 +135,25 @@ specifies an external program to be invoked by Hadoop processes to decrypt the
key. The encrypted key value is passed to this external program as a command
line argument:
<property>
<name>fs.azure.account.keyprovider.youraccount</name>
<value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
```xml
<property>
<name>fs.azure.account.keyprovider.youraccount</name>
<value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
<property>
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
<value>YOUR ENCRYPTED ACCESS KEY</value>
</property>
<property>
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
<value>YOUR ENCRYPTED ACCESS KEY</value>
</property>
<property>
<name>fs.azure.shellkeyprovider.script</name>
<value>PATH TO DECRYPTION PROGRAM</value>
</property>
<property>
<name>fs.azure.shellkeyprovider.script</name>
<value>PATH TO DECRYPTION PROGRAM</value>
</property>
### <a name="Page_Blob_Support_and_Configuration" />Page Blob Support and Configuration
```
### Page Blob Support and Configuration
The Azure Blob Storage interface for Hadoop supports two kinds of blobs,
[block blobs and page blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx).
@ -182,10 +175,12 @@ folder names.
For example:
<property>
<name>fs.azure.page.blob.dir</name>
<value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value>
</property>
```xml
<property>
<name>fs.azure.page.blob.dir</name>
<value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value>
</property>
```
You can set this to simply / to make all files page blobs.
@ -197,7 +192,7 @@ The configuration option `fs.azure.page.blob.extension.size` is the page blob
extension size. This defines the amount to extend a page blob if it starts to
get full. It must be 128MB or greater, specified as an integer number of bytes.
### <a name="Atomic_Folder_Rename" />Atomic Folder Rename
### Atomic Folder Rename
Azure storage stores files as a flat key/value store without formal support
for folders. The hadoop-azure file system layer simulates folders on top
@ -216,12 +211,14 @@ the intention of the rename operation, to allow redo in event of a failure.
For example:
<property>
<name>fs.azure.atomic.rename.dir</name>
<value>/hbase,/data</value>
</property>
```xml
<property>
<name>fs.azure.atomic.rename.dir</name>
<value>/hbase,/data</value>
</property>
```
### <a name="Accessing_wasb_URLs" />Accessing wasb URLs
### Accessing wasb URLs
After credentials are configured in core-site.xml, any Hadoop component may
reference files in that Azure Blob Storage account by using URLs of the following
@ -238,28 +235,32 @@ For example, the following
commands demonstrate access to a storage account named `youraccount` and a
container named `yourcontainer`.
> hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
```bash
% hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
> hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
% hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
> hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
test file content
% hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
test file content
```
It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL.
This causes all bare paths, such as `/testDir/testFile` to resolve automatically
to that file system.
### <a name="Append_API_Support_and_Configuration" />Append API Support and Configuration
### Append API Support and Configuration
The Azure Blob Storage interface for Hadoop has optional support for Append API for
single writer by setting the configuration `fs.azure.enable.append.support` to true.
For Example:
<property>
<name>fs.azure.enable.append.support</name>
<value>true</value>
</property>
```xml
<property>
<name>fs.azure.enable.append.support</name>
<value>true</value>
</property>
```
It must be noted Append support in Azure Blob Storage interface DIFFERS FROM HDFS SEMANTICS. Append
support does not enforce single writer internally but requires applications to guarantee this semantic.
@ -267,25 +268,29 @@ It becomes a responsibility of the application either to ensure single-threaded
file path, or rely on some external locking mechanism of its own. Failure to do so will result in
unexpected behavior.
### <a name="Multithread_Support" />Multithread Support
### Multithread Support
Rename and Delete blob operations on directories with large number of files and sub directories currently is very slow as these operations are done one blob at a time serially. These files and sub folders can be deleted or renamed parallel. Following configurations can be used to enable threads to do parallel processing
To enable 10 threads for Delete operation. Set configuration value to 0 or 1 to disable threads. The default behavior is threads disabled.
<property>
<name>fs.azure.delete.threads</name>
<value>10</value>
</property>
```xml
<property>
<name>fs.azure.delete.threads</name>
<value>10</value>
</property>
```
To enable 20 threads for Rename operation. Set configuration value to 0 or 1 to disable threads. The default behavior is threads disabled.
<property>
<name>fs.azure.rename.threads</name>
<value>20</value>
</property>
```xml
<property>
<name>fs.azure.rename.threads</name>
<value>20</value>
</property>
```
### <a name="WASB_SECURE_MODE" />WASB Secure mode and configuration
### WASB Secure mode and configuration
WASB can operate in secure mode where the Storage access keys required to communicate with Azure storage does not have to
be in the same address space as the process using WASB. In this mode all interactions with Azure storage is performed using
@ -295,30 +300,32 @@ Romote mode, however for testing purposes the local mode can be enabled to gener
To enable Secure mode following property needs to be set to true.
```
<property>
<name>fs.azure.secure.mode</name>
<value>true</value>
</property>
```xml
<property>
<name>fs.azure.secure.mode</name>
<value>true</value>
</property>
```
To enable SAS key generation locally following property needs to be set to true.
```xml
<property>
<name>fs.azure.local.sas.key.mode</name>
<value>true</value>
</property>
```
<property>
<name>fs.azure.local.sas.key.mode</name>
<value>true</value>
</property>
```
To use the remote SAS key generation mode, an external REST service is expected to provided required SAS keys.
Following property can used to provide the end point to use for remote SAS Key generation:
```xml
<property>
<name>fs.azure.cred.service.url</name>
<value>{URL}</value>
</property>
```
<property>
<name>fs.azure.cred.service.url</name>
<value>{URL}</value>
</property>
```
The remote service is expected to provide support for two REST calls ```{URL}/GET_CONTAINER_SAS``` and ```{URL}/GET_RELATIVE_BLOB_SAS```, for generating
container and relative blob sas keys. An example requests
@ -326,7 +333,8 @@ container and relative blob sas keys. An example requests
```{URL}/GET_CONTAINER_SAS?storage_account=<account_name>&container=<container>&relative_path=<relative path>&sas_expiry=<expiry period>&delegation_token=<delegation token>```
The service is expected to return a response in JSON format:
```
```json
{
"responseCode" : 0 or non-zero <int>,
"responseMessage" : relavant message on failure <String>,
@ -334,40 +342,42 @@ The service is expected to return a response in JSON format:
}
```
## <a name="WASB Authorization" />Authorization Support in WASB.
### Authorization Support in WASB
Authorization support can be enabled in WASB using the following configuration:
```
<property>
<name>fs.azure.authorization</name>
<value>true</value>
</property>
```
The current implementation of authorization relies on the presence of an external service that can enforce
the authorization. The service is expected to be running on a URL provided by the following config.
```
<property>
<name>fs.azure.authorization.remote.service.url</name>
<value>{URL}</value>
</property>
```xml
<property>
<name>fs.azure.authorization</name>
<value>true</value>
</property>
```
The remote service is expected to provide support for the following REST call: ```{URL}/CHECK_AUTHORIZATION```
An example request:
The current implementation of authorization relies on the presence of an external service that can enforce
the authorization. The service is expected to be running on a URL provided by the following config.
```xml
<property>
<name>fs.azure.authorization.remote.service.url</name>
<value>{URL}</value>
</property>
```
The remote service is expected to provide support for the following REST call: ```{URL}/CHECK_AUTHORIZATION```
An example request:
```{URL}/CHECK_AUTHORIZATION?wasb_absolute_path=<absolute_path>&operation_type=<operation type>&delegation_token=<delegation token>```
The service is expected to return a response in JSON format:
```
{
The service is expected to return a response in JSON format:
```json
{
"responseCode" : 0 or non-zero <int>,
"responseMessage" : relevant message on failure <String>,
"authorizationResult" : true/false <boolean>
}
```
}
```
## <a name="Testing_the_hadoop-azure_Module" />Testing the hadoop-azure Module
## Testing the hadoop-azure Module
The hadoop-azure module includes a full suite of unit tests. Most of the tests
will run without additional configuration by running `mvn test`. This includes
@ -382,10 +392,12 @@ that runs on a local machine.
To use the emulator, install Azure SDK 2.3 and start the storage emulator. Then,
edit `src/test/resources/azure-test.xml` and add the following property:
<property>
<name>fs.azure.test.emulator</name>
<value>true</value>
</property>
```xml
<property>
<name>fs.azure.test.emulator</name>
<value>true</value>
</property>
```
There is a known issue when running tests with the emulator. You may see the
following failure message:
@ -399,6 +411,7 @@ file to `src/test/resources/azure-auth-keys.xml` and setting
the name of the storage account and its access key.
For example:
```xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>