From 6f65994f51f0cd95a791310fbc4d013b82b199d6 Mon Sep 17 00:00:00 2001 From: Doug Meil Date: Sun, 16 Oct 2011 14:16:31 +0000 Subject: [PATCH] HBASE-4598 book update (book.xml, perf.xml, trouble.xml) git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1184830 13f79535-47bb-0310-9956-ffa450edef68 --- src/docbkx/book.xml | 34 ++++++++++++++++++++++++++------ src/docbkx/performance.xml | 36 ++++++++++++++++++++++++++-------- src/docbkx/troubleshooting.xml | 7 +++++++ 3 files changed, 63 insertions(+), 14 deletions(-) diff --git a/src/docbkx/book.xml b/src/docbkx/book.xml index 62a3514c27e..21348888c35 100644 --- a/src/docbkx/book.xml +++ b/src/docbkx/book.xml @@ -1316,7 +1316,7 @@ scan.setFilter(filter);
Master HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is - the interface for all metadata changes. + the interface for all metadata changes. In a distributed cluster, the Master typically runs on the .
Startup Behavior If run in a multi-Master environment, all Masters compete to run the cluster. If the active @@ -1352,7 +1352,8 @@ scan.setFilter(filter);
RegionServer - HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. + HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. + In a distributed cluster, a RegionServer runs on a .
Interface The methods exposed by HRegionRegionInterface contain both data-oriented and region-maintenance methods: @@ -1711,6 +1712,27 @@ scan.setFilter(filter);
+ +
HDFS + As HBase runs on HDFS (and each StoreFile is written as a file on HDFS), + it is important to have an understanding of the HDFS Architecture + especially in terms of how it stores files, handles failovers, and replicates blocks. + + See the Hadoop documentation on HDFS Architecture + for more information. + +
NameNode + The NameNode is responsible for maintaining the filesystem metadata. See the above HDFS Architecture link + for more information. + +
+
DataNode + The DataNodes are responsible for storing HDFS blocks. See the above HDFS Architecture link + for more information. + +
+
+ @@ -1889,15 +1911,15 @@ hbase> describe 't1' - EC2 + Amazon EC2 - Why doesn't my remote java connection into my ec2 cluster work? + I am running HBase on Amazon EC2 and... - See Andrew's answer here, up on the user list: Remote Java client connection into EC2 instance. - + See Troubleshooting and Performance sections. + diff --git a/src/docbkx/performance.xml b/src/docbkx/performance.xml index 6c0a0bbd33c..1dfd4db8a36 100644 --- a/src/docbkx/performance.xml +++ b/src/docbkx/performance.xml @@ -409,15 +409,35 @@ htable.close();
+ +
HDFS + Because HBase runs on it is important to understand how it works and how it affects + HBase. + +
Current Issues With Low-Latency Reads + The original use-case for HDFS was batch processing. As such, there low-latency reads were historically not a priority. + With the increased adoption of HBase this is changing, and several improvements are already in development. + See the + Umbrella Jira Ticket for HDFS Improvements for HBase. + +
+
Performance Comparisons of HBase vs. HDFS + A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as + a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, + returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this + processing context. Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS + will always be faster in this use-case. + +
+
Amazon EC2 - Performance questions are common on Amazon EC2 environments because it is is a shared environment. You will - not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same - reason (i.e., it's a shared environment and you don't know what else is happening on the server). - - If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that - because EC2 issues are practically a separate class of performance issues. - - + Performance questions are common on Amazon EC2 environments because it is is a shared environment. You will + not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same + reason (i.e., it's a shared environment and you don't know what else is happening on the server). + + If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that + because EC2 issues are practically a separate class of performance issues. +
diff --git a/src/docbkx/troubleshooting.xml b/src/docbkx/troubleshooting.xml index 1ba03fe0045..fd757d489f2 100644 --- a/src/docbkx/troubleshooting.xml +++ b/src/docbkx/troubleshooting.xml @@ -793,6 +793,13 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi Questions on HBase and Amazon EC2 come up frequently on the HBase dist-list. Search for old threads using Search Hadoop +
+ Remote Java Connection into EC2 Cluster Not Working + + See Andrew's answer here, up on the user list: Remote Java client connection into EC2 instance. + +
+