HADOOP-11395. Add site documentation for Azure Storage FileSystem integration. (Contributed by Chris Nauroth)

2014-12-19 18:54:22 -08:00 · 2014-12-19 18:54:22 -08:00 · 0a5b28605f
commit 0a5b28605f
parent 9180d11b3b
4 changed files with 247 additions and 166 deletions
--- a/hadoop-common-project/hadoop-common/CHANGES.txt
+++ b/hadoop-common-project/hadoop-common/CHANGES.txt
@ -88,6 +88,9 @@ Release 2.7.0 - UNRELEASED
    HADOOP-11213. Typos in html pages: SecureMode and EncryptedShuffle. 
    (Wei Yan via kasha)

+    HADOOP-11395. Add site documentation for Azure Storage FileSystem
+    integration. (Chris Nauroth via Arpit Agarwal)
+
  OPTIMIZATIONS

    HADOOP-11323. WritableComparator#compare keeps reference to byte array.
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@ -137,6 +137,7 @@
    
    <menu name="Hadoop Compatible File Systems" inherit="top">
      <item name="Amazon S3" href="hadoop-aws/tools/hadoop-aws/index.html"/>
+      <item name="Azure Blob Storage" href="hadoop-azure/index.html"/>
      <item name="OpenStack Swift" href="hadoop-openstack/index.html"/>
    </menu>

--- a/hadoop-tools/hadoop-azure/README.txt
+++ b/hadoop-tools/hadoop-azure/README.txt
@ -1,166 +0,0 @@
-=============
-Building
-=============
-basic compilation:
-> mvn clean compile test-compile
-
-Compile, run tests and produce jar 
-> mvn clean package
-
-=============
-Unit tests
-=============
-Most of the tests will run without additional configuration.
-For complete testing, configuration in src/test/resources is required:
-  
-  src/test/resources/azure-test.xml -> Defines Azure storage dependencies, including account information 
-
-The other files in src/test/resources do not normally need alteration:
-  log4j.properties -> Test logging setup
-  hadoop-metrics2-azure-file-system.properties -> used to wire up instrumentation for testing
-  
-From command-line
------------------
-Basic execution:
-> mvn test
-
-NOTES:
- - The mvn pom.xml includes src/test/resources in the runtime classpath
- - detailed output (such as log4j) appears in target\surefire-reports\TEST-{testName}.xml
-   including log4j messages.
-   
-Run the tests and generate report:
-> mvn site (at least once to setup some basics including images for the report)
-> mvn surefire-report:report  (run and produce report)
-> mvn mvn surefire-report:report-only  (produce report from last run)
-> mvn mvn surefire-report:report-only -DshowSuccess=false (produce report from last run, only show errors)
-> .\target\site\surefire-report.html (view the report)
-
-Via eclipse
-------------
-Manually add src\test\resources to the classpath for test run configuration:
-  - run menu|run configurations|{configuration}|classpath|User Entries|advanced|add folder
-
-Then run via junit test runner.
-NOTE:
- - if you change log4.properties, rebuild the project to refresh the eclipse cache.
-
-Run Tests against Mocked storage.
---------------------------------
-These run automatically and make use of an in-memory emulation of azure storage.
-
-
-Running tests against the Azure storage emulator  
---------------------------------------------------
-A selection of tests can run against the Azure Storage Emulator which is 
-a high-fidelity emulation of live Azure Storage.  The emulator is sufficient for high-confidence testing.
-The emulator is a Windows executable that runs on a local machine. 
-
-To use the emulator, install Azure SDK 2.3 and start the storage emulator
-See http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx
-
-Enable the Azure emulator tests by setting 
-  fs.azure.test.emulator -> true 
-in src\test\resources\azure-test.xml
-
-Known issues:
-  Symptom: When running tests for emulator, you see the following failure message
-           com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format.
-  Issue:   The emulator can get into a confused state.  
-  Fix:     Restart the Azure Emulator.  Ensure it is v3.2 or later.
- 
-Running tests against live Azure storage 
-------------------------------------------------------------------------
-In order to run WASB unit tests against a live Azure Storage account, add credentials to 
-src\test\resources\azure-test.xml. These settings augment the hadoop configuration object.
-
-For live tests, set the following in azure-test.xml:
- 1. "fs.azure.test.account.name -> {azureStorageAccountName} 
- 2. "fs.azure.account.key.{AccountName} -> {fullStorageKey}"
- 
-===================================
-Page Blob Support and Configuration
-===================================
-
-The Azure Blob Storage interface for Hadoop supports two kinds of blobs, block blobs
-and page blobs. Block blobs are the default kind of blob and are good for most 
-big-data use cases, like input data for Hive, Pig, analytical map-reduce jobs etc. 
-Page blob handling in hadoop-azure was introduced to support HBase log files. 
-Page blobs can be written any number of times, whereas block blobs can only be 
-appended to 50,000 times before you run out of blocks and your writes will fail.
-That won't work for HBase logs, so page blob support was introduced to overcome
-this limitation.
-
-Page blobs can be used for other purposes beyond just HBase log files though.
-They support the Hadoop FileSystem interface. Page blobs can be up to 1TB in
-size, larger than the maximum 200GB size for block blobs.
-
-In order to have the files you create be page blobs, you must set the configuration
-variable fs.azure.page.blob.dir to a comma-separated list of folder names.
-E.g. 
-
-    /hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles
-    
-You can set this to simply / to make all files page blobs.
-
-The configuration option fs.azure.page.blob.size is the default initial 
-size for a page blob. It must be 128MB or greater, and no more than 1TB,
-specified as an integer number of bytes.
-
-====================
-Atomic Folder Rename
-====================
-
-Azure storage stores files as a flat key/value store without formal support
-for folders. The hadoop-azure file system layer simulates folders on top
-of Azure storage. By default, folder rename in the hadoop-azure file system
-layer is not atomic. That means that a failure during a folder rename 
-could, for example, leave some folders in the original directory and
-some in the new one.
-
-HBase depends on atomic folder rename. Hence, a configuration setting was
-introduced called fs.azure.atomic.rename.dir that allows you to specify a 
-comma-separated list of directories to receive special treatment so that 
-folder rename is made atomic. The default value of this setting is just /hbase.
-Redo will be applied to finish a folder rename that fails. A file  
-<folderName>-renamePending.json may appear temporarily and is the record of 
-the intention of the rename operation, to allow redo in event of a failure. 
-
-=============
-Findbugs
-=============
-Run findbugs and show interactive GUI for review of problems
-> mvn findbugs:gui 
-
-Run findbugs and fail build if errors are found:
-> mvn findbugs:check
-
-For help with findbugs plugin.
-> mvn findbugs:help
-
-=============
-Checkstyle
-=============
-Rules for checkstyle @ src\config\checkstyle.xml
- - these are based on a core set of standards, with exclusions for non-serious issues
- - as a general plan it would be good to turn on more rules over time.
- - Occasionally, run checkstyle with the default Sun rules by editing pom.xml.
-
-Command-line:
-> mvn checkstyle:check --> just test & fail build if violations found
-> mvn site checkstyle:checkstyle --> produce html report
-> . target\site\checkstyle.html  --> view report.
-
-Eclipse:
- add the checkstyle plugin: Help|Install, site=http://eclipse-cs.sf.net/update
- window|preferences|checkstyle. Add src/config/checkstyle.xml. Set as default.
- project|properties|create configurations as required, eg src/main/java -> src/config/checkstyle.xml
-
-NOTE:
- After any change to the checkstyle rules xml, use window|preferences|checkstyle|{refresh}|OK
-
-=============
-Javadoc
-============= 
-Command-line
-> mvn javadoc:javadoc
--- a/hadoop-tools/hadoop-azure/src/site/markdown/index.md
+++ b/hadoop-tools/hadoop-azure/src/site/markdown/index.md
@ -0,0 +1,243 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Hadoop Azure Support: Azure Blob Storage
+
+* [Introduction](#Introduction)
+* [Features](#Features)
+* [Limitations](#Limitations)
+* [Usage](#Usage)
+    * [Concepts](#Concepts)
+    * [Configuring Credentials](#Configuring_Credentials)
+    * [Page Blob Support and Configuration](#Page_Blob_Support_and_Configuration)
+    * [Atomic Folder Rename](#Atomic_Folder_Rename)
+    * [Accessing wasb URLs](#Accessing_wasb_URLs)
+* [Testing the hadoop-azure Module](#Testing_the_hadoop-azure_Module)
+
+## <a name="Introduction" />Introduction
+
+The hadoop-azure module provides support for integration with
+[Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
+The built jar file, named hadoop-azure.jar, also declares transitive dependencies
+on the additional artifacts it requires, notably the
+[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).
+
+## <a name="Features" />Features
+
+* Read and write data stored in an Azure Blob Storage account.
+* Present a hierarchical file system view by implementing the standard Hadoop
+  [`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
+* Supports configuration of multiple Azure Blob Storage accounts.
+* Supports both page blobs (suitable for most use cases, such as MapReduce) and
+  block blobs (suitable for continuous write use cases, such as an HBase
+  write-ahead log).
+* Reference file system paths using URLs using the `wasb` scheme.
+* Also reference file system paths using URLs with the `wasbs` scheme for SSL
+  encrypted access.
+* Can act as a source of data in a MapReduce job, or a sink.
+* Tested on both Linux and Windows.
+* Tested at scale.
+
+## <a name="Limitations" />Limitations
+
+* The append operation is not implemented.
+* File owner and group are persisted, but the permissions model is not enforced.
+  Authorization occurs at the level of the entire Azure Blob Storage account.
+* File last access time is not tracked.
+
+## <a name="Usage" />Usage
+
+### <a name="Concepts" />Concepts
+
+The Azure Blob Storage data model presents 3 core concepts:
+
+* **Storage Account**: All access is done through a storage account.
+* **Container**: A container is a grouping of multiple blobs.  A storage account
+  may have multiple containers.  In Hadoop, an entire file system hierarchy is
+  stored in a single container.  It is also possible to configure multiple
+  containers, effectively presenting multiple file systems that can be referenced
+  using distinct URLs.
+* **Blob**: A file of any type and size.  In Hadoop, files are stored in blobs.
+  The internal implementation also uses blobs to persist the file system
+  hierarchy and other metadata.
+
+### <a name="Configuring_Credentials" />Configuring Credentials
+
+Usage of Azure Blob Storage requires configuration of credentials.  Typically
+this is set in core-site.xml.  The configuration property name is of the form
+`fs.azure.account.key.<account name>.blob.core.windows.net` and the value is the
+access key.  **The access key is a secret that protects access to your storage
+account.  Do not share the access key (or the core-site.xml file) with an
+untrusted party.**
+
+For example:
+
+    <property>
+      <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
+      <value>YOUR ACCESS KEY</value>
+    </property>
+
+In many Hadoop clusters, the core-site.xml file is world-readable.  If it's
+undesirable for the access key to be visible in core-site.xml, then it's also
+possible to configure it in encrypted form.  An additional configuration property
+specifies an external program to be invoked by Hadoop processes to decrypt the
+key.  The encrypted key value is passed to this external program as a command
+line argument:
+
+    <property>
+      <name>fs.azure.account.keyprovider.youraccount</name>
+      <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
+    </property>
+
+    <property>
+      <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
+      <value>YOUR ENCRYPTED ACCESS KEY</value>
+    </property>
+
+    <property>
+      <name>fs.azure.shellkeyprovider.script</name>
+      <value>PATH TO DECRYPTION PROGRAM</value>
+    </property>
+
+### <a name="Page_Blob_Support_and_Configuration" />Page Blob Support and Configuration
+
+The Azure Blob Storage interface for Hadoop supports two kinds of blobs,
+[block blobs and page blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx).
+Block blobs are the default kind of blob and are good for most big-data use
+cases, like input data for Hive, Pig, analytical map-reduce jobs etc.  Page blob
+handling in hadoop-azure was introduced to support HBase log files.  Page blobs
+can be written any number of times, whereas block blobs can only be appended to
+50,000 times before you run out of blocks and your writes will fail.  That won't
+work for HBase logs, so page blob support was introduced to overcome this
+limitation.
+
+Page blobs can be used for other purposes beyond just HBase log files though.
+Page blobs can be up to 1TB in size, larger than the maximum 200GB size for block
+blobs.
+
+In order to have the files you create be page blobs, you must set the
+configuration variable `fs.azure.page.blob.dir` to a comma-separated list of
+folder names.
+
+For example:
+
+    <property>
+      <name>fs.azure.page.blob.dir</name>
+      <value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value>
+    </property>
+
+You can set this to simply / to make all files page blobs.
+
+The configuration option `fs.azure.page.blob.size` is the default initial
+size for a page blob. It must be 128MB or greater, and no more than 1TB,
+specified as an integer number of bytes.
+
+The configuration option `fs.azure.page.blob.extension.size` is the page blob
+extension size.  This defines the amount to extend a page blob if it starts to
+get full.  It must be 128MB or greater, specified as an integer number of bytes.
+
+### <a name="Atomic_Folder_Rename" />Atomic Folder Rename
+
+Azure storage stores files as a flat key/value store without formal support
+for folders.  The hadoop-azure file system layer simulates folders on top
+of Azure storage.  By default, folder rename in the hadoop-azure file system
+layer is not atomic.  That means that a failure during a folder rename
+could, for example, leave some folders in the original directory and
+some in the new one.
+
+HBase depends on atomic folder rename.  Hence, a configuration setting was
+introduced called `fs.azure.atomic.rename.dir` that allows you to specify a
+comma-separated list of directories to receive special treatment so that
+folder rename is made atomic.  The default value of this setting is just
+`/hbase`.  Redo will be applied to finish a folder rename that fails. A file
+`<folderName>-renamePending.json` may appear temporarily and is the record of
+the intention of the rename operation, to allow redo in event of a failure.
+
+For example:
+
+    <property>
+      <name>fs.azure.atomic.rename.dir</name>
+      <value>/hbase,/data</value>
+    </property>
+
+### <a name="Accessing_wasb_URLs" />Accessing wasb URLs
+
+After credentials are configured in core-site.xml, any Hadoop component may
+reference files in that Azure Blob Storage account by using URLs of the following
+format:
+
+    wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
+
+The schemes `wasb` and `wasbs` identify a URL on a file system backed by Azure
+Blob Storage.  `wasb` utilizes unencrypted HTTP access for all interaction with
+the Azure Blob Storage API.  `wasbs` utilizes SSL encrypted HTTPS access.
+
+For example, the following
+[FileSystem Shell](../hadoop-project-dist/hadoop-common/FileSystemShell.html)
+commands demonstrate access to a storage account named `youraccount` and a
+container named `yourcontainer`.
+
+    > hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
+
+    > hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
+
+    > hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
+    test file content
+
+It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL.
+This causes all bare paths, such as `/testDir/testFile` to resolve automatically
+to that file system.
+
+## <a name="Testing_the_hadoop-azure_Module" />Testing the hadoop-azure Module
+
+The hadoop-azure module includes a full suite of unit tests.  Most of the tests
+will run without additional configuration by running `mvn test`.  This includes
+tests against mocked storage, which is an in-memory emulation of Azure Storage.
+
+A selection of tests can run against the
+[Azure Storage Emulator](http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx)
+which is a high-fidelity emulation of live Azure Storage.  The emulator is
+sufficient for high-confidence testing.  The emulator is a Windows executable
+that runs on a local machine.
+
+To use the emulator, install Azure SDK 2.3 and start the storage emulator.  Then,
+edit `src/test/resources/azure-test.xml` and add the following property:
+
+    <property>
+      <name>fs.azure.test.emulator</name>
+      <value>true</value>
+    </property>
+
+There is a known issue when running tests with the emulator.  You may see the
+following failure message:
+
+    com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format.
+
+To resolve this, restart the Azure Emulator.  Ensure it v3.2 or later.
+
+It's also possible to run tests against a live Azure Storage account by adding
+credentials to `src/test/resources/azure-test.xml` and setting
+`fs.azure.test.account.name` to the name of the storage account.
+
+For example:
+
+    <property>
+      <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
+      <value>YOUR ACCESS KEY</value>
+    </property>
+
+    <property>
+      <name>fs.azure.test.account.name</name>
+      <value>youraccount</value>
+    </property>