Added note on hlog tool, that it can be used to look at files in recovered edits file

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1001907 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Michael Stack 2010-09-27 20:58:26 +00:00
parent 1339f395d6
commit 5efa0ba9c9
1 changed files with 42 additions and 47 deletions

View File

@ -66,54 +66,48 @@
<para>TODO: Review all of the below to ensure it matches what was
committed -- St.Ack 20100901</para>
</note>
<section>
<title>
Region Size
</title>
<para>Region size is one of those tricky things, there are a few factors to consider:
</para>
<itemizedlist>
<listitem>
<para>
Regions are the basic element of availability and distribution.
</para>
</listitem>
<listitem>
<para>
HBase scales by having regions across many servers. Thus if you
have 2 regions for 16GB data, on a 20 node machine you are a net loss
there.
</para>
</listitem>
<listitem>
<para>
High region count has been known to make things slow, this is
getting better, but it is probably better to have 700 regions than
3000 for the same amount of data.
</para>
</listitem>
<listitem>
<para>
Low region count prevents parallel scalability as per point #2.
This really cant be stressed enough, since a common problem is loading
200MB data into HBase then wondering why your awesome 10 node cluster
is mostly idle.
</para>
</listitem>
<listitem>
<para>
There is not much memory footprint difference between 1 region and
10 in terms of indexes, etc, held by the regionserver.
</para>
</listitem>
</itemizedlist>
<title>Region Size</title>
<para>Its probably best to stick to the default,
perhaps going smaller for hot tables (or manually split hot regions
to spread the load over the cluster), or go with a 1GB region size
if your cell sizes tend to be largish (100k and up).
</para>
<para>Region size is one of those tricky things, there are a few factors
to consider:</para>
<itemizedlist>
<listitem>
<para>Regions are the basic element of availability and
distribution.</para>
</listitem>
<listitem>
<para>HBase scales by having regions across many servers. Thus if
you have 2 regions for 16GB data, on a 20 node machine you are a net
loss there.</para>
</listitem>
<listitem>
<para>High region count has been known to make things slow, this is
getting better, but it is probably better to have 700 regions than
3000 for the same amount of data.</para>
</listitem>
<listitem>
<para>Low region count prevents parallel scalability as per point
#2. This really cant be stressed enough, since a common problem is
loading 200MB data into HBase then wondering why your awesome 10
node cluster is mostly idle.</para>
</listitem>
<listitem>
<para>There is not much memory footprint difference between 1 region
and 10 in terms of indexes, etc, held by the regionserver.</para>
</listitem>
</itemizedlist>
<para>Its probably best to stick to the default, perhaps going smaller
for hot tables (or manually split hot regions to spread the load over
the cluster), or go with a 1GB region size if your cell sizes tend to be
largish (100k and up).</para>
</section>
<section>
@ -739,10 +733,11 @@ if your cell sizes tend to be largish (100k and up).
<title>WAL Tools</title>
<section>
<title><classname>HLog</classname> main</title>
<title><classname>HLog</classname> tool</title>
<para>The main method on <classname>HLog</classname> offers manual
split and dump facilities.</para>
split and dump facilities. Pass it WALs or the product of a split, the
content of the <filename>recovered.edits</filename>. directory.</para>
<para>You can get a textual dump of a WAL file content by doing the
following:<programlisting> <code>$ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --dump hdfs://example.org:9000/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012</code> </programlisting>The