diff --git a/hadoop-common-project/hadoop-common/CHANGES.txt b/hadoop-common-project/hadoop-common/CHANGES.txt index a02cb81f201..3000b65e850 100644 --- a/hadoop-common-project/hadoop-common/CHANGES.txt +++ b/hadoop-common-project/hadoop-common/CHANGES.txt @@ -441,6 +441,9 @@ Release 2.7.0 - UNRELEASED HADOOP-11213. Typos in html pages: SecureMode and EncryptedShuffle. (Wei Yan via kasha) + HADOOP-11395. Add site documentation for Azure Storage FileSystem + integration. (Chris Nauroth via Arpit Agarwal) + OPTIMIZATIONS HADOOP-11323. WritableComparator#compare keeps reference to byte array. diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml index 8cc2d656fb1..aec33d63f5a 100644 --- a/hadoop-project/src/site/site.xml +++ b/hadoop-project/src/site/site.xml @@ -136,6 +136,7 @@ + diff --git a/hadoop-tools/hadoop-azure/README.txt b/hadoop-tools/hadoop-azure/README.txt deleted file mode 100644 index a1d1a653ed8..00000000000 --- a/hadoop-tools/hadoop-azure/README.txt +++ /dev/null @@ -1,166 +0,0 @@ -============= -Building -============= -basic compilation: -> mvn clean compile test-compile - -Compile, run tests and produce jar -> mvn clean package - -============= -Unit tests -============= -Most of the tests will run without additional configuration. -For complete testing, configuration in src/test/resources is required: - - src/test/resources/azure-test.xml -> Defines Azure storage dependencies, including account information - -The other files in src/test/resources do not normally need alteration: - log4j.properties -> Test logging setup - hadoop-metrics2-azure-file-system.properties -> used to wire up instrumentation for testing - -From command-line ------------------- -Basic execution: -> mvn test - -NOTES: - - The mvn pom.xml includes src/test/resources in the runtime classpath - - detailed output (such as log4j) appears in target\surefire-reports\TEST-{testName}.xml - including log4j messages. - -Run the tests and generate report: -> mvn site (at least once to setup some basics including images for the report) -> mvn surefire-report:report (run and produce report) -> mvn mvn surefire-report:report-only (produce report from last run) -> mvn mvn surefire-report:report-only -DshowSuccess=false (produce report from last run, only show errors) -> .\target\site\surefire-report.html (view the report) - -Via eclipse -------------- -Manually add src\test\resources to the classpath for test run configuration: - - run menu|run configurations|{configuration}|classpath|User Entries|advanced|add folder - -Then run via junit test runner. -NOTE: - - if you change log4.properties, rebuild the project to refresh the eclipse cache. - -Run Tests against Mocked storage. ---------------------------------- -These run automatically and make use of an in-memory emulation of azure storage. - - -Running tests against the Azure storage emulator ---------------------------------------------------- -A selection of tests can run against the Azure Storage Emulator which is -a high-fidelity emulation of live Azure Storage. The emulator is sufficient for high-confidence testing. -The emulator is a Windows executable that runs on a local machine. - -To use the emulator, install Azure SDK 2.3 and start the storage emulator -See http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx - -Enable the Azure emulator tests by setting - fs.azure.test.emulator -> true -in src\test\resources\azure-test.xml - -Known issues: - Symptom: When running tests for emulator, you see the following failure message - com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format. - Issue: The emulator can get into a confused state. - Fix: Restart the Azure Emulator. Ensure it is v3.2 or later. - -Running tests against live Azure storage -------------------------------------------------------------------------- -In order to run WASB unit tests against a live Azure Storage account, add credentials to -src\test\resources\azure-test.xml. These settings augment the hadoop configuration object. - -For live tests, set the following in azure-test.xml: - 1. "fs.azure.test.account.name -> {azureStorageAccountName} - 2. "fs.azure.account.key.{AccountName} -> {fullStorageKey}" - -=================================== -Page Blob Support and Configuration -=================================== - -The Azure Blob Storage interface for Hadoop supports two kinds of blobs, block blobs -and page blobs. Block blobs are the default kind of blob and are good for most -big-data use cases, like input data for Hive, Pig, analytical map-reduce jobs etc. -Page blob handling in hadoop-azure was introduced to support HBase log files. -Page blobs can be written any number of times, whereas block blobs can only be -appended to 50,000 times before you run out of blocks and your writes will fail. -That won't work for HBase logs, so page blob support was introduced to overcome -this limitation. - -Page blobs can be used for other purposes beyond just HBase log files though. -They support the Hadoop FileSystem interface. Page blobs can be up to 1TB in -size, larger than the maximum 200GB size for block blobs. - -In order to have the files you create be page blobs, you must set the configuration -variable fs.azure.page.blob.dir to a comma-separated list of folder names. -E.g. - - /hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles - -You can set this to simply / to make all files page blobs. - -The configuration option fs.azure.page.blob.size is the default initial -size for a page blob. It must be 128MB or greater, and no more than 1TB, -specified as an integer number of bytes. - -==================== -Atomic Folder Rename -==================== - -Azure storage stores files as a flat key/value store without formal support -for folders. The hadoop-azure file system layer simulates folders on top -of Azure storage. By default, folder rename in the hadoop-azure file system -layer is not atomic. That means that a failure during a folder rename -could, for example, leave some folders in the original directory and -some in the new one. - -HBase depends on atomic folder rename. Hence, a configuration setting was -introduced called fs.azure.atomic.rename.dir that allows you to specify a -comma-separated list of directories to receive special treatment so that -folder rename is made atomic. The default value of this setting is just /hbase. -Redo will be applied to finish a folder rename that fails. A file --renamePending.json may appear temporarily and is the record of -the intention of the rename operation, to allow redo in event of a failure. - -============= -Findbugs -============= -Run findbugs and show interactive GUI for review of problems -> mvn findbugs:gui - -Run findbugs and fail build if errors are found: -> mvn findbugs:check - -For help with findbugs plugin. -> mvn findbugs:help - -============= -Checkstyle -============= -Rules for checkstyle @ src\config\checkstyle.xml - - these are based on a core set of standards, with exclusions for non-serious issues - - as a general plan it would be good to turn on more rules over time. - - Occasionally, run checkstyle with the default Sun rules by editing pom.xml. - -Command-line: -> mvn checkstyle:check --> just test & fail build if violations found -> mvn site checkstyle:checkstyle --> produce html report -> . target\site\checkstyle.html --> view report. - -Eclipse: -- add the checkstyle plugin: Help|Install, site=http://eclipse-cs.sf.net/update -- window|preferences|checkstyle. Add src/config/checkstyle.xml. Set as default. -- project|properties|create configurations as required, eg src/main/java -> src/config/checkstyle.xml - -NOTE: -- After any change to the checkstyle rules xml, use window|preferences|checkstyle|{refresh}|OK - -============= -Javadoc -============= -Command-line -> mvn javadoc:javadoc \ No newline at end of file diff --git a/hadoop-tools/hadoop-azure/src/site/markdown/index.md b/hadoop-tools/hadoop-azure/src/site/markdown/index.md new file mode 100644 index 00000000000..0d69ccf7370 --- /dev/null +++ b/hadoop-tools/hadoop-azure/src/site/markdown/index.md @@ -0,0 +1,243 @@ + + +# Hadoop Azure Support: Azure Blob Storage + +* [Introduction](#Introduction) +* [Features](#Features) +* [Limitations](#Limitations) +* [Usage](#Usage) + * [Concepts](#Concepts) + * [Configuring Credentials](#Configuring_Credentials) + * [Page Blob Support and Configuration](#Page_Blob_Support_and_Configuration) + * [Atomic Folder Rename](#Atomic_Folder_Rename) + * [Accessing wasb URLs](#Accessing_wasb_URLs) +* [Testing the hadoop-azure Module](#Testing_the_hadoop-azure_Module) + +## Introduction + +The hadoop-azure module provides support for integration with +[Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/). +The built jar file, named hadoop-azure.jar, also declares transitive dependencies +on the additional artifacts it requires, notably the +[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java). + +## Features + +* Read and write data stored in an Azure Blob Storage account. +* Present a hierarchical file system view by implementing the standard Hadoop + [`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface. +* Supports configuration of multiple Azure Blob Storage accounts. +* Supports both page blobs (suitable for most use cases, such as MapReduce) and + block blobs (suitable for continuous write use cases, such as an HBase + write-ahead log). +* Reference file system paths using URLs using the `wasb` scheme. +* Also reference file system paths using URLs with the `wasbs` scheme for SSL + encrypted access. +* Can act as a source of data in a MapReduce job, or a sink. +* Tested on both Linux and Windows. +* Tested at scale. + +## Limitations + +* The append operation is not implemented. +* File owner and group are persisted, but the permissions model is not enforced. + Authorization occurs at the level of the entire Azure Blob Storage account. +* File last access time is not tracked. + +## Usage + +### Concepts + +The Azure Blob Storage data model presents 3 core concepts: + +* **Storage Account**: All access is done through a storage account. +* **Container**: A container is a grouping of multiple blobs. A storage account + may have multiple containers. In Hadoop, an entire file system hierarchy is + stored in a single container. It is also possible to configure multiple + containers, effectively presenting multiple file systems that can be referenced + using distinct URLs. +* **Blob**: A file of any type and size. In Hadoop, files are stored in blobs. + The internal implementation also uses blobs to persist the file system + hierarchy and other metadata. + +### Configuring Credentials + +Usage of Azure Blob Storage requires configuration of credentials. Typically +this is set in core-site.xml. The configuration property name is of the form +`fs.azure.account.key..blob.core.windows.net` and the value is the +access key. **The access key is a secret that protects access to your storage +account. Do not share the access key (or the core-site.xml file) with an +untrusted party.** + +For example: + + + fs.azure.account.key.youraccount.blob.core.windows.net + YOUR ACCESS KEY + + +In many Hadoop clusters, the core-site.xml file is world-readable. If it's +undesirable for the access key to be visible in core-site.xml, then it's also +possible to configure it in encrypted form. An additional configuration property +specifies an external program to be invoked by Hadoop processes to decrypt the +key. The encrypted key value is passed to this external program as a command +line argument: + + + fs.azure.account.keyprovider.youraccount + org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider + + + + fs.azure.account.key.youraccount.blob.core.windows.net + YOUR ENCRYPTED ACCESS KEY + + + + fs.azure.shellkeyprovider.script + PATH TO DECRYPTION PROGRAM + + +### Page Blob Support and Configuration + +The Azure Blob Storage interface for Hadoop supports two kinds of blobs, +[block blobs and page blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx). +Block blobs are the default kind of blob and are good for most big-data use +cases, like input data for Hive, Pig, analytical map-reduce jobs etc. Page blob +handling in hadoop-azure was introduced to support HBase log files. Page blobs +can be written any number of times, whereas block blobs can only be appended to +50,000 times before you run out of blocks and your writes will fail. That won't +work for HBase logs, so page blob support was introduced to overcome this +limitation. + +Page blobs can be used for other purposes beyond just HBase log files though. +Page blobs can be up to 1TB in size, larger than the maximum 200GB size for block +blobs. + +In order to have the files you create be page blobs, you must set the +configuration variable `fs.azure.page.blob.dir` to a comma-separated list of +folder names. + +For example: + + + fs.azure.page.blob.dir + /hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles + + +You can set this to simply / to make all files page blobs. + +The configuration option `fs.azure.page.blob.size` is the default initial +size for a page blob. It must be 128MB or greater, and no more than 1TB, +specified as an integer number of bytes. + +The configuration option `fs.azure.page.blob.extension.size` is the page blob +extension size. This defines the amount to extend a page blob if it starts to +get full. It must be 128MB or greater, specified as an integer number of bytes. + +### Atomic Folder Rename + +Azure storage stores files as a flat key/value store without formal support +for folders. The hadoop-azure file system layer simulates folders on top +of Azure storage. By default, folder rename in the hadoop-azure file system +layer is not atomic. That means that a failure during a folder rename +could, for example, leave some folders in the original directory and +some in the new one. + +HBase depends on atomic folder rename. Hence, a configuration setting was +introduced called `fs.azure.atomic.rename.dir` that allows you to specify a +comma-separated list of directories to receive special treatment so that +folder rename is made atomic. The default value of this setting is just +`/hbase`. Redo will be applied to finish a folder rename that fails. A file +`-renamePending.json` may appear temporarily and is the record of +the intention of the rename operation, to allow redo in event of a failure. + +For example: + + + fs.azure.atomic.rename.dir + /hbase,/data + + +### Accessing wasb URLs + +After credentials are configured in core-site.xml, any Hadoop component may +reference files in that Azure Blob Storage account by using URLs of the following +format: + + wasb[s]://@.blob.core.windows.net/ + +The schemes `wasb` and `wasbs` identify a URL on a file system backed by Azure +Blob Storage. `wasb` utilizes unencrypted HTTP access for all interaction with +the Azure Blob Storage API. `wasbs` utilizes SSL encrypted HTTPS access. + +For example, the following +[FileSystem Shell](../hadoop-project-dist/hadoop-common/FileSystemShell.html) +commands demonstrate access to a storage account named `youraccount` and a +container named `yourcontainer`. + + > hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir + + > hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile + + > hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile + test file content + +It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL. +This causes all bare paths, such as `/testDir/testFile` to resolve automatically +to that file system. + +## Testing the hadoop-azure Module + +The hadoop-azure module includes a full suite of unit tests. Most of the tests +will run without additional configuration by running `mvn test`. This includes +tests against mocked storage, which is an in-memory emulation of Azure Storage. + +A selection of tests can run against the +[Azure Storage Emulator](http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx) +which is a high-fidelity emulation of live Azure Storage. The emulator is +sufficient for high-confidence testing. The emulator is a Windows executable +that runs on a local machine. + +To use the emulator, install Azure SDK 2.3 and start the storage emulator. Then, +edit `src/test/resources/azure-test.xml` and add the following property: + + + fs.azure.test.emulator + true + + +There is a known issue when running tests with the emulator. You may see the +following failure message: + + com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format. + +To resolve this, restart the Azure Emulator. Ensure it v3.2 or later. + +It's also possible to run tests against a live Azure Storage account by adding +credentials to `src/test/resources/azure-test.xml` and setting +`fs.azure.test.account.name` to the name of the storage account. + +For example: + + + fs.azure.account.key.youraccount.blob.core.windows.net + YOUR ACCESS KEY + + + + fs.azure.test.account.name + youraccount +