HADOOP-11395. Add site documentation for Azure Storage FileSystem integration. (Contributed by Chris Nauroth)
This commit is contained in:
parent
9180d11b3b
commit
0a5b28605f
|
@ -88,6 +88,9 @@ Release 2.7.0 - UNRELEASED
|
||||||
HADOOP-11213. Typos in html pages: SecureMode and EncryptedShuffle.
|
HADOOP-11213. Typos in html pages: SecureMode and EncryptedShuffle.
|
||||||
(Wei Yan via kasha)
|
(Wei Yan via kasha)
|
||||||
|
|
||||||
|
HADOOP-11395. Add site documentation for Azure Storage FileSystem
|
||||||
|
integration. (Chris Nauroth via Arpit Agarwal)
|
||||||
|
|
||||||
OPTIMIZATIONS
|
OPTIMIZATIONS
|
||||||
|
|
||||||
HADOOP-11323. WritableComparator#compare keeps reference to byte array.
|
HADOOP-11323. WritableComparator#compare keeps reference to byte array.
|
||||||
|
|
|
@ -137,6 +137,7 @@
|
||||||
|
|
||||||
<menu name="Hadoop Compatible File Systems" inherit="top">
|
<menu name="Hadoop Compatible File Systems" inherit="top">
|
||||||
<item name="Amazon S3" href="hadoop-aws/tools/hadoop-aws/index.html"/>
|
<item name="Amazon S3" href="hadoop-aws/tools/hadoop-aws/index.html"/>
|
||||||
|
<item name="Azure Blob Storage" href="hadoop-azure/index.html"/>
|
||||||
<item name="OpenStack Swift" href="hadoop-openstack/index.html"/>
|
<item name="OpenStack Swift" href="hadoop-openstack/index.html"/>
|
||||||
</menu>
|
</menu>
|
||||||
|
|
||||||
|
|
|
@ -1,166 +0,0 @@
|
||||||
=============
|
|
||||||
Building
|
|
||||||
=============
|
|
||||||
basic compilation:
|
|
||||||
> mvn clean compile test-compile
|
|
||||||
|
|
||||||
Compile, run tests and produce jar
|
|
||||||
> mvn clean package
|
|
||||||
|
|
||||||
=============
|
|
||||||
Unit tests
|
|
||||||
=============
|
|
||||||
Most of the tests will run without additional configuration.
|
|
||||||
For complete testing, configuration in src/test/resources is required:
|
|
||||||
|
|
||||||
src/test/resources/azure-test.xml -> Defines Azure storage dependencies, including account information
|
|
||||||
|
|
||||||
The other files in src/test/resources do not normally need alteration:
|
|
||||||
log4j.properties -> Test logging setup
|
|
||||||
hadoop-metrics2-azure-file-system.properties -> used to wire up instrumentation for testing
|
|
||||||
|
|
||||||
From command-line
|
|
||||||
------------------
|
|
||||||
Basic execution:
|
|
||||||
> mvn test
|
|
||||||
|
|
||||||
NOTES:
|
|
||||||
- The mvn pom.xml includes src/test/resources in the runtime classpath
|
|
||||||
- detailed output (such as log4j) appears in target\surefire-reports\TEST-{testName}.xml
|
|
||||||
including log4j messages.
|
|
||||||
|
|
||||||
Run the tests and generate report:
|
|
||||||
> mvn site (at least once to setup some basics including images for the report)
|
|
||||||
> mvn surefire-report:report (run and produce report)
|
|
||||||
> mvn mvn surefire-report:report-only (produce report from last run)
|
|
||||||
> mvn mvn surefire-report:report-only -DshowSuccess=false (produce report from last run, only show errors)
|
|
||||||
> .\target\site\surefire-report.html (view the report)
|
|
||||||
|
|
||||||
Via eclipse
|
|
||||||
-------------
|
|
||||||
Manually add src\test\resources to the classpath for test run configuration:
|
|
||||||
- run menu|run configurations|{configuration}|classpath|User Entries|advanced|add folder
|
|
||||||
|
|
||||||
Then run via junit test runner.
|
|
||||||
NOTE:
|
|
||||||
- if you change log4.properties, rebuild the project to refresh the eclipse cache.
|
|
||||||
|
|
||||||
Run Tests against Mocked storage.
|
|
||||||
---------------------------------
|
|
||||||
These run automatically and make use of an in-memory emulation of azure storage.
|
|
||||||
|
|
||||||
|
|
||||||
Running tests against the Azure storage emulator
|
|
||||||
---------------------------------------------------
|
|
||||||
A selection of tests can run against the Azure Storage Emulator which is
|
|
||||||
a high-fidelity emulation of live Azure Storage. The emulator is sufficient for high-confidence testing.
|
|
||||||
The emulator is a Windows executable that runs on a local machine.
|
|
||||||
|
|
||||||
To use the emulator, install Azure SDK 2.3 and start the storage emulator
|
|
||||||
See http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx
|
|
||||||
|
|
||||||
Enable the Azure emulator tests by setting
|
|
||||||
fs.azure.test.emulator -> true
|
|
||||||
in src\test\resources\azure-test.xml
|
|
||||||
|
|
||||||
Known issues:
|
|
||||||
Symptom: When running tests for emulator, you see the following failure message
|
|
||||||
com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format.
|
|
||||||
Issue: The emulator can get into a confused state.
|
|
||||||
Fix: Restart the Azure Emulator. Ensure it is v3.2 or later.
|
|
||||||
|
|
||||||
Running tests against live Azure storage
|
|
||||||
-------------------------------------------------------------------------
|
|
||||||
In order to run WASB unit tests against a live Azure Storage account, add credentials to
|
|
||||||
src\test\resources\azure-test.xml. These settings augment the hadoop configuration object.
|
|
||||||
|
|
||||||
For live tests, set the following in azure-test.xml:
|
|
||||||
1. "fs.azure.test.account.name -> {azureStorageAccountName}
|
|
||||||
2. "fs.azure.account.key.{AccountName} -> {fullStorageKey}"
|
|
||||||
|
|
||||||
===================================
|
|
||||||
Page Blob Support and Configuration
|
|
||||||
===================================
|
|
||||||
|
|
||||||
The Azure Blob Storage interface for Hadoop supports two kinds of blobs, block blobs
|
|
||||||
and page blobs. Block blobs are the default kind of blob and are good for most
|
|
||||||
big-data use cases, like input data for Hive, Pig, analytical map-reduce jobs etc.
|
|
||||||
Page blob handling in hadoop-azure was introduced to support HBase log files.
|
|
||||||
Page blobs can be written any number of times, whereas block blobs can only be
|
|
||||||
appended to 50,000 times before you run out of blocks and your writes will fail.
|
|
||||||
That won't work for HBase logs, so page blob support was introduced to overcome
|
|
||||||
this limitation.
|
|
||||||
|
|
||||||
Page blobs can be used for other purposes beyond just HBase log files though.
|
|
||||||
They support the Hadoop FileSystem interface. Page blobs can be up to 1TB in
|
|
||||||
size, larger than the maximum 200GB size for block blobs.
|
|
||||||
|
|
||||||
In order to have the files you create be page blobs, you must set the configuration
|
|
||||||
variable fs.azure.page.blob.dir to a comma-separated list of folder names.
|
|
||||||
E.g.
|
|
||||||
|
|
||||||
/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles
|
|
||||||
|
|
||||||
You can set this to simply / to make all files page blobs.
|
|
||||||
|
|
||||||
The configuration option fs.azure.page.blob.size is the default initial
|
|
||||||
size for a page blob. It must be 128MB or greater, and no more than 1TB,
|
|
||||||
specified as an integer number of bytes.
|
|
||||||
|
|
||||||
====================
|
|
||||||
Atomic Folder Rename
|
|
||||||
====================
|
|
||||||
|
|
||||||
Azure storage stores files as a flat key/value store without formal support
|
|
||||||
for folders. The hadoop-azure file system layer simulates folders on top
|
|
||||||
of Azure storage. By default, folder rename in the hadoop-azure file system
|
|
||||||
layer is not atomic. That means that a failure during a folder rename
|
|
||||||
could, for example, leave some folders in the original directory and
|
|
||||||
some in the new one.
|
|
||||||
|
|
||||||
HBase depends on atomic folder rename. Hence, a configuration setting was
|
|
||||||
introduced called fs.azure.atomic.rename.dir that allows you to specify a
|
|
||||||
comma-separated list of directories to receive special treatment so that
|
|
||||||
folder rename is made atomic. The default value of this setting is just /hbase.
|
|
||||||
Redo will be applied to finish a folder rename that fails. A file
|
|
||||||
<folderName>-renamePending.json may appear temporarily and is the record of
|
|
||||||
the intention of the rename operation, to allow redo in event of a failure.
|
|
||||||
|
|
||||||
=============
|
|
||||||
Findbugs
|
|
||||||
=============
|
|
||||||
Run findbugs and show interactive GUI for review of problems
|
|
||||||
> mvn findbugs:gui
|
|
||||||
|
|
||||||
Run findbugs and fail build if errors are found:
|
|
||||||
> mvn findbugs:check
|
|
||||||
|
|
||||||
For help with findbugs plugin.
|
|
||||||
> mvn findbugs:help
|
|
||||||
|
|
||||||
=============
|
|
||||||
Checkstyle
|
|
||||||
=============
|
|
||||||
Rules for checkstyle @ src\config\checkstyle.xml
|
|
||||||
- these are based on a core set of standards, with exclusions for non-serious issues
|
|
||||||
- as a general plan it would be good to turn on more rules over time.
|
|
||||||
- Occasionally, run checkstyle with the default Sun rules by editing pom.xml.
|
|
||||||
|
|
||||||
Command-line:
|
|
||||||
> mvn checkstyle:check --> just test & fail build if violations found
|
|
||||||
> mvn site checkstyle:checkstyle --> produce html report
|
|
||||||
> . target\site\checkstyle.html --> view report.
|
|
||||||
|
|
||||||
Eclipse:
|
|
||||||
- add the checkstyle plugin: Help|Install, site=http://eclipse-cs.sf.net/update
|
|
||||||
- window|preferences|checkstyle. Add src/config/checkstyle.xml. Set as default.
|
|
||||||
- project|properties|create configurations as required, eg src/main/java -> src/config/checkstyle.xml
|
|
||||||
|
|
||||||
NOTE:
|
|
||||||
- After any change to the checkstyle rules xml, use window|preferences|checkstyle|{refresh}|OK
|
|
||||||
|
|
||||||
=============
|
|
||||||
Javadoc
|
|
||||||
=============
|
|
||||||
Command-line
|
|
||||||
> mvn javadoc:javadoc
|
|
|
@ -0,0 +1,243 @@
|
||||||
|
<!---
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License. See accompanying LICENSE file.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Hadoop Azure Support: Azure Blob Storage
|
||||||
|
|
||||||
|
* [Introduction](#Introduction)
|
||||||
|
* [Features](#Features)
|
||||||
|
* [Limitations](#Limitations)
|
||||||
|
* [Usage](#Usage)
|
||||||
|
* [Concepts](#Concepts)
|
||||||
|
* [Configuring Credentials](#Configuring_Credentials)
|
||||||
|
* [Page Blob Support and Configuration](#Page_Blob_Support_and_Configuration)
|
||||||
|
* [Atomic Folder Rename](#Atomic_Folder_Rename)
|
||||||
|
* [Accessing wasb URLs](#Accessing_wasb_URLs)
|
||||||
|
* [Testing the hadoop-azure Module](#Testing_the_hadoop-azure_Module)
|
||||||
|
|
||||||
|
## <a name="Introduction" />Introduction
|
||||||
|
|
||||||
|
The hadoop-azure module provides support for integration with
|
||||||
|
[Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
|
||||||
|
The built jar file, named hadoop-azure.jar, also declares transitive dependencies
|
||||||
|
on the additional artifacts it requires, notably the
|
||||||
|
[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).
|
||||||
|
|
||||||
|
## <a name="Features" />Features
|
||||||
|
|
||||||
|
* Read and write data stored in an Azure Blob Storage account.
|
||||||
|
* Present a hierarchical file system view by implementing the standard Hadoop
|
||||||
|
[`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
|
||||||
|
* Supports configuration of multiple Azure Blob Storage accounts.
|
||||||
|
* Supports both page blobs (suitable for most use cases, such as MapReduce) and
|
||||||
|
block blobs (suitable for continuous write use cases, such as an HBase
|
||||||
|
write-ahead log).
|
||||||
|
* Reference file system paths using URLs using the `wasb` scheme.
|
||||||
|
* Also reference file system paths using URLs with the `wasbs` scheme for SSL
|
||||||
|
encrypted access.
|
||||||
|
* Can act as a source of data in a MapReduce job, or a sink.
|
||||||
|
* Tested on both Linux and Windows.
|
||||||
|
* Tested at scale.
|
||||||
|
|
||||||
|
## <a name="Limitations" />Limitations
|
||||||
|
|
||||||
|
* The append operation is not implemented.
|
||||||
|
* File owner and group are persisted, but the permissions model is not enforced.
|
||||||
|
Authorization occurs at the level of the entire Azure Blob Storage account.
|
||||||
|
* File last access time is not tracked.
|
||||||
|
|
||||||
|
## <a name="Usage" />Usage
|
||||||
|
|
||||||
|
### <a name="Concepts" />Concepts
|
||||||
|
|
||||||
|
The Azure Blob Storage data model presents 3 core concepts:
|
||||||
|
|
||||||
|
* **Storage Account**: All access is done through a storage account.
|
||||||
|
* **Container**: A container is a grouping of multiple blobs. A storage account
|
||||||
|
may have multiple containers. In Hadoop, an entire file system hierarchy is
|
||||||
|
stored in a single container. It is also possible to configure multiple
|
||||||
|
containers, effectively presenting multiple file systems that can be referenced
|
||||||
|
using distinct URLs.
|
||||||
|
* **Blob**: A file of any type and size. In Hadoop, files are stored in blobs.
|
||||||
|
The internal implementation also uses blobs to persist the file system
|
||||||
|
hierarchy and other metadata.
|
||||||
|
|
||||||
|
### <a name="Configuring_Credentials" />Configuring Credentials
|
||||||
|
|
||||||
|
Usage of Azure Blob Storage requires configuration of credentials. Typically
|
||||||
|
this is set in core-site.xml. The configuration property name is of the form
|
||||||
|
`fs.azure.account.key.<account name>.blob.core.windows.net` and the value is the
|
||||||
|
access key. **The access key is a secret that protects access to your storage
|
||||||
|
account. Do not share the access key (or the core-site.xml file) with an
|
||||||
|
untrusted party.**
|
||||||
|
|
||||||
|
For example:
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
|
||||||
|
<value>YOUR ACCESS KEY</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
In many Hadoop clusters, the core-site.xml file is world-readable. If it's
|
||||||
|
undesirable for the access key to be visible in core-site.xml, then it's also
|
||||||
|
possible to configure it in encrypted form. An additional configuration property
|
||||||
|
specifies an external program to be invoked by Hadoop processes to decrypt the
|
||||||
|
key. The encrypted key value is passed to this external program as a command
|
||||||
|
line argument:
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.azure.account.keyprovider.youraccount</name>
|
||||||
|
<value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
|
||||||
|
<value>YOUR ENCRYPTED ACCESS KEY</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.azure.shellkeyprovider.script</name>
|
||||||
|
<value>PATH TO DECRYPTION PROGRAM</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
### <a name="Page_Blob_Support_and_Configuration" />Page Blob Support and Configuration
|
||||||
|
|
||||||
|
The Azure Blob Storage interface for Hadoop supports two kinds of blobs,
|
||||||
|
[block blobs and page blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx).
|
||||||
|
Block blobs are the default kind of blob and are good for most big-data use
|
||||||
|
cases, like input data for Hive, Pig, analytical map-reduce jobs etc. Page blob
|
||||||
|
handling in hadoop-azure was introduced to support HBase log files. Page blobs
|
||||||
|
can be written any number of times, whereas block blobs can only be appended to
|
||||||
|
50,000 times before you run out of blocks and your writes will fail. That won't
|
||||||
|
work for HBase logs, so page blob support was introduced to overcome this
|
||||||
|
limitation.
|
||||||
|
|
||||||
|
Page blobs can be used for other purposes beyond just HBase log files though.
|
||||||
|
Page blobs can be up to 1TB in size, larger than the maximum 200GB size for block
|
||||||
|
blobs.
|
||||||
|
|
||||||
|
In order to have the files you create be page blobs, you must set the
|
||||||
|
configuration variable `fs.azure.page.blob.dir` to a comma-separated list of
|
||||||
|
folder names.
|
||||||
|
|
||||||
|
For example:
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.azure.page.blob.dir</name>
|
||||||
|
<value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
You can set this to simply / to make all files page blobs.
|
||||||
|
|
||||||
|
The configuration option `fs.azure.page.blob.size` is the default initial
|
||||||
|
size for a page blob. It must be 128MB or greater, and no more than 1TB,
|
||||||
|
specified as an integer number of bytes.
|
||||||
|
|
||||||
|
The configuration option `fs.azure.page.blob.extension.size` is the page blob
|
||||||
|
extension size. This defines the amount to extend a page blob if it starts to
|
||||||
|
get full. It must be 128MB or greater, specified as an integer number of bytes.
|
||||||
|
|
||||||
|
### <a name="Atomic_Folder_Rename" />Atomic Folder Rename
|
||||||
|
|
||||||
|
Azure storage stores files as a flat key/value store without formal support
|
||||||
|
for folders. The hadoop-azure file system layer simulates folders on top
|
||||||
|
of Azure storage. By default, folder rename in the hadoop-azure file system
|
||||||
|
layer is not atomic. That means that a failure during a folder rename
|
||||||
|
could, for example, leave some folders in the original directory and
|
||||||
|
some in the new one.
|
||||||
|
|
||||||
|
HBase depends on atomic folder rename. Hence, a configuration setting was
|
||||||
|
introduced called `fs.azure.atomic.rename.dir` that allows you to specify a
|
||||||
|
comma-separated list of directories to receive special treatment so that
|
||||||
|
folder rename is made atomic. The default value of this setting is just
|
||||||
|
`/hbase`. Redo will be applied to finish a folder rename that fails. A file
|
||||||
|
`<folderName>-renamePending.json` may appear temporarily and is the record of
|
||||||
|
the intention of the rename operation, to allow redo in event of a failure.
|
||||||
|
|
||||||
|
For example:
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.azure.atomic.rename.dir</name>
|
||||||
|
<value>/hbase,/data</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
### <a name="Accessing_wasb_URLs" />Accessing wasb URLs
|
||||||
|
|
||||||
|
After credentials are configured in core-site.xml, any Hadoop component may
|
||||||
|
reference files in that Azure Blob Storage account by using URLs of the following
|
||||||
|
format:
|
||||||
|
|
||||||
|
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
|
||||||
|
|
||||||
|
The schemes `wasb` and `wasbs` identify a URL on a file system backed by Azure
|
||||||
|
Blob Storage. `wasb` utilizes unencrypted HTTP access for all interaction with
|
||||||
|
the Azure Blob Storage API. `wasbs` utilizes SSL encrypted HTTPS access.
|
||||||
|
|
||||||
|
For example, the following
|
||||||
|
[FileSystem Shell](../hadoop-project-dist/hadoop-common/FileSystemShell.html)
|
||||||
|
commands demonstrate access to a storage account named `youraccount` and a
|
||||||
|
container named `yourcontainer`.
|
||||||
|
|
||||||
|
> hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
|
||||||
|
|
||||||
|
> hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
|
||||||
|
|
||||||
|
> hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
|
||||||
|
test file content
|
||||||
|
|
||||||
|
It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL.
|
||||||
|
This causes all bare paths, such as `/testDir/testFile` to resolve automatically
|
||||||
|
to that file system.
|
||||||
|
|
||||||
|
## <a name="Testing_the_hadoop-azure_Module" />Testing the hadoop-azure Module
|
||||||
|
|
||||||
|
The hadoop-azure module includes a full suite of unit tests. Most of the tests
|
||||||
|
will run without additional configuration by running `mvn test`. This includes
|
||||||
|
tests against mocked storage, which is an in-memory emulation of Azure Storage.
|
||||||
|
|
||||||
|
A selection of tests can run against the
|
||||||
|
[Azure Storage Emulator](http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx)
|
||||||
|
which is a high-fidelity emulation of live Azure Storage. The emulator is
|
||||||
|
sufficient for high-confidence testing. The emulator is a Windows executable
|
||||||
|
that runs on a local machine.
|
||||||
|
|
||||||
|
To use the emulator, install Azure SDK 2.3 and start the storage emulator. Then,
|
||||||
|
edit `src/test/resources/azure-test.xml` and add the following property:
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.azure.test.emulator</name>
|
||||||
|
<value>true</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
There is a known issue when running tests with the emulator. You may see the
|
||||||
|
following failure message:
|
||||||
|
|
||||||
|
com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format.
|
||||||
|
|
||||||
|
To resolve this, restart the Azure Emulator. Ensure it v3.2 or later.
|
||||||
|
|
||||||
|
It's also possible to run tests against a live Azure Storage account by adding
|
||||||
|
credentials to `src/test/resources/azure-test.xml` and setting
|
||||||
|
`fs.azure.test.account.name` to the name of the storage account.
|
||||||
|
|
||||||
|
For example:
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
|
||||||
|
<value>YOUR ACCESS KEY</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.azure.test.account.name</name>
|
||||||
|
<value>youraccount</value>
|
||||||
|
</property>
|
Loading…
Reference in New Issue