From 92f5595e7edd805f092f5e18352a012207d64fe2 Mon Sep 17 00:00:00 2001 From: Misty Stanley-Jones Date: Wed, 13 Apr 2016 12:14:29 -0700 Subject: [PATCH] HBASE-15646 Add some docs about exporting and importing snapshots using S3 --- .../asciidoc/_chapters/configuration.adoc | 31 +++++++++ src/main/asciidoc/_chapters/ops_mgt.adoc | 68 +++++++++++++++++++ 2 files changed, 99 insertions(+) diff --git a/src/main/asciidoc/_chapters/configuration.adoc b/src/main/asciidoc/_chapters/configuration.adoc index d705db97be1..4702bcb6dd4 100644 --- a/src/main/asciidoc/_chapters/configuration.adoc +++ b/src/main/asciidoc/_chapters/configuration.adoc @@ -1111,6 +1111,37 @@ Only a subset of all configurations can currently be changed in the running serv Here is an incomplete list: `hbase.regionserver.thread.compaction.large`, `hbase.regionserver.thread.compaction.small`, `hbase.regionserver.thread.split`, `hbase.regionserver.thread.merge`, as well as compaction policy and configurations and adjustment to offpeak hours. For the full list consult the patch attached to link:https://issues.apache.org/jira/browse/HBASE-12147[HBASE-12147 Porting Online Config Change from 89-fb]. +[[amazon_s3_configuration]] +== Using Amazon S3 Storage + +HBase is designed to be tightly coupled with HDFS, and testing of other filesystems +has not been thorough. + +The following limitations have been reported: + +- RegionServers should be deployed in Amazon EC2 to mitigate latency and bandwidth +limitations when accessing the filesystem, and RegionServers must remain available +to preserve data locality. +- S3 writes each inbound and outbound file to disk, which adds overhead to each operation. +- The best performance is achieved when all clients and servers are in the Amazon +cloud, rather than a heterogenous architecture. +- You must be aware of the location of `hadoop.tmp.dir` so that the local `/tmp/` +directory is not filled to capacity. +- HBase has a different file usage pattern than MapReduce jobs and has been optimized for +HDFS, rather than distant networked storage. +- The `s3a://` protocol is strongly recommended. The `s3n://` and `s3://` protocols have serious +limitations and do not use the Amazon AWS SDK. The `s3a://` protocol is supported +for use with HBase if you use Hadoop 2.6.1 or higher with HBase 1.2 or higher. Hadoop +2.6.0 is not supported with HBase at all. + +Configuration details for Amazon S3 and associated Amazon services such as EMR are +out of the scope of the HBase documentation. See the +link:https://wiki.apache.org/hadoop/AmazonS3[Hadoop Wiki entry on Amazon S3 Storage] +and +link:http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hbase.html[Amazon's documentation for deploying HBase in EMR]. + +One use case that is well-suited for Amazon S3 is storing snapshots. See <>. + ifdef::backend-docbook[] [index] == Index diff --git a/src/main/asciidoc/_chapters/ops_mgt.adoc b/src/main/asciidoc/_chapters/ops_mgt.adoc index 583a8727630..bc7595121a4 100644 --- a/src/main/asciidoc/_chapters/ops_mgt.adoc +++ b/src/main/asciidoc/_chapters/ops_mgt.adoc @@ -2050,6 +2050,74 @@ The following example limits the above example to 200 MB/sec. $ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16 -bandwidth 200 ---- +[[snapshots_s3]] +=== Storing Snapshots in an Amazon S3 Bucket + +For general information and limitations of using Amazon S3 storage with HBase, see +<>. You can also store and retrieve snapshots from Amazon +S3, using the following procedure. + +NOTE: You can also store snapshots in Microsoft Azure Blob Storage. See <>. + +.Prerequisites +- You must be using HBase 1.0 or higher and Hadoop 2.6.1 or higher, which is the first +configuration that uses the Amazon AWS SDK. +- You must use the `s3a://` protocol to connect to Amazon S3. The older `s3n://` +and `s3://` protocols have various limitations and do not use the Amazon AWS SDK. +- The `s3a://` URI must be configured and available on the server where you run +the commands to export and restore the snapshot. + +After you have fulfilled the prerequisites, take the snapshot like you normally would. +Afterward, you can export it using the `org.apache.hadoop.hbase.snapshot.ExportSnapshot` +command like the one below, substituting your own `s3a://` path in the `copy-from` +or `copy-to` directive and substituting or modifying other options as required: + +---- +$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \ + -snapshot MySnapshot \ + -copy-from hdfs://srv2:8082/hbase \ + -copy-to s3a:////hbase \ + -chuser MyUser \ + -chgroup MyGroup \ + -chmod 700 \ + -mappers 16 +---- + +---- +$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \ + -snapshot MySnapshot + -copy-from s3a:////hbase \ + -copy-to hdfs://srv2:8082/hbase \ + -chuser MyUser \ + -chgroup MyGroup \ + -chmod 700 \ + -mappers 16 +---- + +You can also use the `org.apache.hadoop.hbase.snapshot.SnapshotInfo` utility with the `s3a://` path by including the +`-remote-dir` option. + +---- +$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo \ + -remote-dir s3a:////hbase \ + -list-snapshots +---- + +[[snapshots_azure]] +== Storing Snapshots in Microsoft Azure Blob Storage + +You can store snapshots in Microsoft Azure Blog Storage using the same techniques +as in <>. + +.Prerequisites +- You must be using HBase 1.2 or higher with Hadoop 2.7.1 or + higher. No version of HBase supports Hadoop 2.7.0. +- Your hosts must be configured to be aware of the Azure blob storage filesystem. + See http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html. + +After you meet the prerequisites, follow the instructions +in <>, replacingthe protocol specifier with `wasb://` or `wasbs://`. + [[ops.capacity]] == Capacity Planning and Region Sizing