HBASE-15646 Add some docs about exporting and importing snapshots using S3

2016-04-13 12:14:29 -07:00 · 2016-04-13 12:14:29 -07:00 · 92f5595e7e
commit 92f5595e7e
parent a050e1d9f8
2 changed files with 99 additions and 0 deletions
--- a/src/main/asciidoc/_chapters/configuration.adoc
+++ b/src/main/asciidoc/_chapters/configuration.adoc
@ -1111,6 +1111,37 @@ Only a subset of all configurations can currently be changed in the running serv
 Here is an incomplete list: `hbase.regionserver.thread.compaction.large`, `hbase.regionserver.thread.compaction.small`, `hbase.regionserver.thread.split`, `hbase.regionserver.thread.merge`, as well as compaction policy and configurations and adjustment to offpeak hours.
 For the full list consult the patch attached to  link:https://issues.apache.org/jira/browse/HBASE-12147[HBASE-12147 Porting Online Config Change from 89-fb].

+[[amazon_s3_configuration]]
+== Using Amazon S3 Storage
+
+HBase is designed to be tightly coupled with HDFS, and testing of other filesystems
+has not been thorough.
+
+The following limitations have been reported:
+
+- RegionServers should be deployed in Amazon EC2 to mitigate latency and bandwidth
+limitations when accessing the filesystem, and RegionServers must remain available
+to preserve data locality.
+- S3 writes each inbound and outbound file to disk, which adds overhead to each operation.
+- The best performance is achieved when all clients and servers are in the Amazon
+cloud, rather than a heterogenous architecture.
+- You must be aware of the location of `hadoop.tmp.dir` so that the local `/tmp/`
+directory is not filled to capacity.
+- HBase has a different file usage pattern than MapReduce jobs and has been optimized for
+HDFS, rather than distant networked storage.
+- The `s3a://` protocol is strongly recommended. The `s3n://` and `s3://` protocols have serious
+limitations and do not use the Amazon AWS SDK. The `s3a://` protocol is supported
+for use with HBase if you use Hadoop 2.6.1 or higher with HBase 1.2 or higher. Hadoop
+2.6.0 is not supported with HBase at all.
+
+Configuration details for Amazon S3 and associated Amazon services such as EMR are
+out of the scope of the HBase documentation. See the
+link:https://wiki.apache.org/hadoop/AmazonS3[Hadoop Wiki entry on Amazon S3 Storage]
+and
+link:http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hbase.html[Amazon's documentation for deploying HBase in EMR].
+
+One use case that is well-suited for Amazon S3 is storing snapshots. See <<snapshots_s3>>.
+
 ifdef::backend-docbook[]
 [index]
 == Index
--- a/src/main/asciidoc/_chapters/ops_mgt.adoc
+++ b/src/main/asciidoc/_chapters/ops_mgt.adoc
@ -2050,6 +2050,74 @@ The following example limits the above example to 200 MB/sec.
 $ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16 -bandwidth 200
 ----

+[[snapshots_s3]]
+=== Storing Snapshots in an Amazon S3 Bucket
+
+For general information and limitations of using Amazon S3 storage with HBase, see
+<<amazon_s3_configuration>>. You can also store and retrieve snapshots from Amazon
+S3, using the following procedure.
+
+NOTE: You can also store snapshots in Microsoft Azure Blob Storage. See <<snapshots_azure>>.
+
+.Prerequisites
+- You must be using HBase 1.0 or higher and Hadoop 2.6.1 or higher, which is the first
+configuration that uses the Amazon AWS SDK.
+- You must use the `s3a://` protocol to connect to Amazon S3. The older `s3n://`
+and `s3://` protocols have various limitations and do not use the Amazon AWS SDK.
+- The `s3a://` URI must be configured and available on the server where you run
+the commands to export and restore the snapshot.
+
+After you have fulfilled the prerequisites, take the snapshot like you normally would.
+Afterward, you can export it using the `org.apache.hadoop.hbase.snapshot.ExportSnapshot`
+command like the one below, substituting your own `s3a://` path in the `copy-from`
+or `copy-to` directive and substituting or modifying other options as required:
+
+----
+$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
+    -snapshot MySnapshot \
+    -copy-from hdfs://srv2:8082/hbase \
+    -copy-to s3a://<bucket>/<namespace>/hbase \
+    -chuser MyUser \
+    -chgroup MyGroup \
+    -chmod 700 \
+    -mappers 16
+----
+
+----
+$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
+    -snapshot MySnapshot
+    -copy-from s3a://<bucket>/<namespace>/hbase \
+    -copy-to hdfs://srv2:8082/hbase \
+    -chuser MyUser \
+    -chgroup MyGroup \
+    -chmod 700 \
+    -mappers 16
+----
+
+You can also use the `org.apache.hadoop.hbase.snapshot.SnapshotInfo` utility with the `s3a://` path by including the
+`-remote-dir` option.
+
+----
+$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo \
+    -remote-dir s3a://<bucket>/<namespace>/hbase \
+    -list-snapshots
+----
+
+[[snapshots_azure]]
+== Storing Snapshots in Microsoft Azure Blob Storage
+
+You can store snapshots in Microsoft Azure Blog Storage using the same techniques
+as in <<snapshots_s3>>.
+
+.Prerequisites
+- You must be using HBase 1.2 or higher with Hadoop 2.7.1 or
+  higher. No version of HBase supports Hadoop 2.7.0.
+- Your hosts must be configured to be aware of the Azure blob storage filesystem.
+  See http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html.
+
+After you meet the prerequisites, follow the instructions
+in <<snapshots_s3>>, replacingthe protocol specifier with `wasb://` or `wasbs://`.
+
 [[ops.capacity]]
 == Capacity Planning and Region Sizing