HBASE-11400 [docs] edit, consolidate, and update compression and data encoding docs (Misty Stanley-Jones)
This commit is contained in:
parent
a030b17ba7
commit
209dd6dcfe
|
@ -4387,230 +4387,451 @@ This option should not normally be used, and it is not in <code>-fixAll</code>.
|
|||
</section>
|
||||
</appendix>
|
||||
|
||||
<appendix xml:id="compression">
|
||||
<appendix
|
||||
xml:id="compression">
|
||||
|
||||
<title >Compression In HBase<indexterm><primary>Compression</primary></indexterm></title>
|
||||
<title>Compression and Data Block Encoding In
|
||||
HBase<indexterm><primary>Compression</primary><secondary>Data Block
|
||||
Encoding</secondary><seealso>codecs</seealso></indexterm></title>
|
||||
<note>
|
||||
<para>Codecs mentioned in this section are for encoding and decoding data blocks. For
|
||||
information about replication codecs, see <xref
|
||||
<para>Codecs mentioned in this section are for encoding and decoding data blocks or row keys.
|
||||
For information about replication codecs, see <xref
|
||||
linkend="cluster.replication.preserving.tags" />.</para>
|
||||
</note>
|
||||
<para>There are a bunch of compression options in HBase. Some codecs come with java --
|
||||
e.g. gzip -- and so require no additional installations. Others require native
|
||||
libraries. The native libraries may be available in your hadoop as is the case
|
||||
with lz4 and it is just a matter of making sure the hadoop native .so is available
|
||||
to HBase. You may have to do extra work to make the codec accessible; for example,
|
||||
if the codec has an apache-incompatible license that makes it so hadoop cannot bundle
|
||||
the library.</para>
|
||||
<para>Below we
|
||||
discuss what is necessary for the common codecs. Whatever codec you use, be sure
|
||||
to test it is installed properly and is available on all nodes that make up your cluster.
|
||||
Add any necessary operational step that will ensure checking the codec present when you
|
||||
happen to add new nodes to your cluster. The <xref linkend="compression.test" />
|
||||
discussed below can help check the codec is properly install.</para>
|
||||
<para>As to which codec to use, there is some helpful discussion
|
||||
to be found in <link xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1">Documenting Guidance on compression and codecs</link>.
|
||||
</para>
|
||||
<para>Some of the information in this section is pulled from a <link
|
||||
xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1/v=threaded">discussion</link> on the
|
||||
HBase Development mailing list.</para>
|
||||
<para>HBase supports several different compression algorithms which can be enabled on a
|
||||
ColumnFamily. Data block encoding attempts to limit duplication of information in keys, taking
|
||||
advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys
|
||||
and the schema of a given table. Compressors reduce the size of large, opaque byte arrays in
|
||||
cells, and can significantly reduce the storage space needed to store uncompressed
|
||||
data.</para>
|
||||
<para>Compressors and data block encoding can be used together on the same ColumnFamily.</para>
|
||||
|
||||
<formalpara>
|
||||
<title>Changes Take Effect Upon Compaction</title>
|
||||
<para>If you change compression or encoding for a ColumnFamily, the changes take effect during
|
||||
compaction.</para>
|
||||
</formalpara>
|
||||
|
||||
<section xml:id="compression.test">
|
||||
<title>CompressionTest Tool</title>
|
||||
<para>
|
||||
HBase includes a tool to test compression is set up properly.
|
||||
To run it, type <code>/bin/hbase org.apache.hadoop.hbase.util.CompressionTest</code>.
|
||||
This will emit usage on how to run the tool.
|
||||
</para>
|
||||
<note><title>You need to restart regionserver for it to pick up changes!</title>
|
||||
<para>Be aware that the regionserver caches the result of the compression check it runs
|
||||
ahead of each region open. This means that you will have to restart the regionserver
|
||||
for it to notice that you have fixed any codec issues; e.g. changed symlinks or
|
||||
moved lib locations under HBase.</para>
|
||||
</note>
|
||||
<note xml:id="hbase.native.platform"><title>On the location of native libraries</title>
|
||||
<para>Hadoop looks in <filename>lib/native</filename> for .so files. HBase looks in
|
||||
<filename>lib/native/PLATFORM</filename>. See the <command>bin/hbase</command>.
|
||||
View the file and look for <varname>native</varname>. See how we
|
||||
do the work to find out what platform we are running on running a little java program
|
||||
<classname>org.apache.hadoop.util.PlatformName</classname> to figure it out.
|
||||
We'll then add <filename>./lib/native/PLATFORM</filename> to the
|
||||
<varname>LD_LIBRARY_PATH</varname> environment for when the JVM starts.
|
||||
The JVM will look in here (as well as in any other dirs specified on LD_LIBRARY_PATH)
|
||||
for codec native libs. If you are unable to figure your 'platform', do:
|
||||
<programlisting>$ ./bin/hbase org.apache.hadoop.util.PlatformName</programlisting>.
|
||||
An example platform would be <varname>Linux-amd64-64</varname>.
|
||||
</para>
|
||||
</note>
|
||||
<para>Some codecs take advantage of capabilities built into Java, such as GZip compression.
|
||||
Others rely on native libraries. Native libraries may be available as part of Hadoop, such as
|
||||
LZ4. In this case, HBase only needs access to the appropriate shared library. Other codecs,
|
||||
such as Google Snappy, need to be installed first. Some codecs are licensed in ways that
|
||||
conflict with HBase's license and cannot be shipped as part of HBase.</para>
|
||||
|
||||
<para>This section discusses common codecs that are used and tested with HBase. No matter what
|
||||
codec you use, be sure to test that it is installed correctly and is available on all nodes in
|
||||
your cluster. Extra operational steps may be necessary to be sure that codecs are available on
|
||||
newly-deployed nodes. You can use the <xref
|
||||
linkend="compression.test" /> utility to check that a given codec is correctly
|
||||
installed.</para>
|
||||
|
||||
<para>To configure HBase to use a compressor, see <xref
|
||||
linkend="compressor.install" />. To enable a compressor for a ColumnFamily, see <xref
|
||||
linkend="changing.compression" />. To enable data block encoding for a ColumnFamily, see
|
||||
<xref linkend="data.block.encoding.enable" />.</para>
|
||||
<itemizedlist>
|
||||
<title>Block Compressors</title>
|
||||
<listitem>
|
||||
<para>none</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Snappy</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>LZO</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>LZ4</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>GZ</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
|
||||
<itemizedlist>
|
||||
<title>Data Block Encoding Types</title>
|
||||
<listitem>
|
||||
<para>Prefix - Often, keys are very similar. Specifically, keys often share a common prefix
|
||||
and only differ near the end. For instance, one key might be
|
||||
<literal>RowKey:Family:Qualifier0</literal> and the next key might be
|
||||
<literal>RowKey:Family:Qualifier1</literal>. In Prefix encoding, an extra column is
|
||||
added which holds the length of the prefix shared between the current key and the previous
|
||||
key. Assuming the first key here is totally different from the key before, its prefix
|
||||
length is 0. The second key's prefix length is <literal>23</literal>, since they have the
|
||||
first 23 characters in common.</para>
|
||||
<para>Obviously if the keys tend to have nothing in common, Prefix will not provide much
|
||||
benefit.</para>
|
||||
<para>The following image shows a hypothetical ColumnFamily with no data block encoding.</para>
|
||||
<figure>
|
||||
<title>ColumnFamily with No Encoding</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="data_block_no_encoding.png" width="800"/>
|
||||
</imageobject>
|
||||
<textobject><para></para>
|
||||
</textobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>Here is the same data with prefix data encoding.</para>
|
||||
<figure>
|
||||
<title>ColumnFamily with Prefix Encoding</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="data_block_prefix_encoding.png" width="800"/>
|
||||
</imageobject>
|
||||
<textobject><para></para>
|
||||
</textobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Diff - Diff encoding expands upon Prefix encoding. Instead of considering the key
|
||||
sequentially as a monolithic series of bytes, each key field is split so that each part of
|
||||
the key can be compressed more efficiently. Two new fields are added: timestamp and type.
|
||||
If the ColumnFamily is the same as the previous row, it is omitted from the current row.
|
||||
If the key length, value length or type are the same as the previous row, the field is
|
||||
omitted. In addition, for increased compression, the timestamp is stored as a Diff from
|
||||
the previous row's timestamp, rather than being stored in full. Given the two row keys in
|
||||
the Prefix example, and given an exact match on timestamp and the same type, neither the
|
||||
value length, or type needs to be stored for the second row, and the timestamp value for
|
||||
the second row is just 0, rather than a full timestamp.</para>
|
||||
<para>Diff encoding is disabled by default because writing and scanning are slower but more
|
||||
data is cached.</para>
|
||||
<para>This image shows the same ColumnFamily from the previous images, with Diff encoding.</para>
|
||||
<figure>
|
||||
<title>ColumnFamily with Diff Encoding</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="data_block_diff_encoding.png" width="800"/>
|
||||
</imageobject>
|
||||
<textobject><para></para>
|
||||
</textobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Fast Diff - Fast Diff works similar to Diff, but uses a faster implementation. It also
|
||||
adds another field which stores a single bit to track whether the data itself is the same
|
||||
as the previous row. If it is, the data is not stored again. Fast Diff is the recommended
|
||||
codec to use if you have long keys or many columns. The data format is nearly identical to
|
||||
Diff encoding, so there is not an image to illustrate it.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Prefix Tree encoding was introduced as an experimental feature in HBase 0.96. It
|
||||
provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but provides
|
||||
faster random access at a cost of slower encoding speed. Prefix Tree may be appropriate
|
||||
for applications that have high block cache hit ratios. It introduces new 'tree' fields
|
||||
for the row and column. The row tree field contains a list of offsets/references
|
||||
corresponding to the cells in that row. This allows for a good deal of compression. For
|
||||
more details about Prefix Tree encoding, see <link
|
||||
xlink:href="https://issues.apache.org/jira/browse/HBASE-4676">HBASE-4676</link>. It is
|
||||
difficult to graphically illustrate a prefix tree, so no image is included. See the
|
||||
Wikipedia article for <link
|
||||
xlink:href="http://en.wikipedia.org/wiki/Trie">Trie</link> for more general information
|
||||
about this data structure.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<section>
|
||||
<title>Which Compressor or Data Block Encoder To Use</title>
|
||||
<para>The compression or codec type to use depends on the characteristics of your data.
|
||||
Choosing the wrong type could cause your data to take more space rather than less, and can
|
||||
have performance implications. In general, you need to weigh your options between smaller
|
||||
size and faster compression/decompression. Following are some general guidelines, expanded from a discussion at <link xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1">Documenting Guidance on compression and codecs</link>. </para>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>If you have long keys (compared to the values) or many columns, use a prefix
|
||||
encoder. FAST_DIFF is recommended, as more testing is needed for Prefix Tree
|
||||
encoding.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>If the values are large (and not precompressed, such as images), use a data block
|
||||
compressor.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Use GZIP for <firstterm>cold data</firstterm>, which is accessed infrequently. GZIP
|
||||
compression uses more CPU resources than Snappy or LZO, but provides a higher
|
||||
compression ratio.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Use Snappy or LZO for <firstterm>hot data</firstterm>, which is accessed
|
||||
frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high
|
||||
of a compression ratio.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>In most cases, enabling Snappy or LZO by default is a good choice, because they have
|
||||
a low performance overhead and provide space savings.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Before Snappy became available by Google in 2011, LZO was the default. Snappy has
|
||||
similar qualities as LZO but has been shown to perform better.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</section>
|
||||
|
||||
<section xml:id="hbase.regionserver.codecs">
|
||||
<title>
|
||||
<varname>
|
||||
hbase.regionserver.codecs
|
||||
</varname>
|
||||
</title>
|
||||
<para>
|
||||
To have a RegionServer test a set of codecs and fail-to-start if any
|
||||
code is missing or misinstalled, add the configuration
|
||||
<varname>
|
||||
hbase.regionserver.codecs
|
||||
</varname>
|
||||
to your <filename>hbase-site.xml</filename> with a value of
|
||||
codecs to test on startup. For example if the
|
||||
<varname>
|
||||
hbase.regionserver.codecs
|
||||
</varname> value is <code>lzo,gz</code> and if lzo is not present
|
||||
or improperly installed, the misconfigured RegionServer will fail
|
||||
to start.
|
||||
</para>
|
||||
<para>
|
||||
Administrators might make use of this facility to guard against
|
||||
the case where a new server is added to cluster but the cluster
|
||||
requires install of a particular coded.
|
||||
</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="gzip.compression">
|
||||
<title>
|
||||
GZIP
|
||||
</title>
|
||||
<para>
|
||||
GZIP will generally compress better than LZO but it will run slower.
|
||||
For some setups, better compression may be preferred ('cold' data).
|
||||
Java will use java's GZIP unless the native Hadoop libs are
|
||||
available on the CLASSPATH; in this case it will use native
|
||||
compressors instead (If the native libs are NOT present,
|
||||
you will see lots of <emphasis>Got brand-new compressor</emphasis>
|
||||
reports in your logs; see <xref linkend="brand.new.compressor" />).
|
||||
</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="lz4.compression">
|
||||
<title>
|
||||
LZ4
|
||||
</title>
|
||||
<para>
|
||||
LZ4 is bundled with Hadoop. Make sure the hadoop .so is
|
||||
accessible when you start HBase. One means of doing this is after figuring your
|
||||
platform, see <xref linkend="hbase.native.platform" />, make a symlink from HBase
|
||||
to the native Hadoop libraries presuming the two software installs are colocated.
|
||||
For example, if my 'platform' is Linux-amd64-64:
|
||||
<programlisting>$ cd $HBASE_HOME
|
||||
<section>
|
||||
<title>Compressor Configuration, Installation, and Use</title>
|
||||
<section
|
||||
xml:id="compressor.install">
|
||||
<title>Configure HBase For Compressors</title>
|
||||
<para>Before HBase can use a given compressor, its libraries need to be available. Due to
|
||||
licensing issues, only GZ compression is available to HBase (via native Java libraries) in
|
||||
a default installation.</para>
|
||||
<section>
|
||||
<title>Compressor Support On the Master</title>
|
||||
<para>A new configuration setting was introduced in HBase 0.95, to check the Master to
|
||||
determine which data block encoders are installed and configured on it, and assume that
|
||||
the entire cluster is configured the same. This option,
|
||||
<code>hbase.master.check.compression</code>, defaults to <literal>true</literal>. This
|
||||
prevents the situation described in <link
|
||||
xlink:href="https://issues.apache.org/jira/browse/HBASE-6370">HBASE-6370</link>, where
|
||||
a table is created or modified to support a codec that a region server does not support,
|
||||
leading to failures that take a long time to occur and are difficult to debug. </para>
|
||||
<para>If <code>hbase.master.check.compression</code> is enabled, libraries for all desired
|
||||
compressors need to be installed and configured on the Master, even if the Master does
|
||||
not run a region server.</para>
|
||||
</section>
|
||||
<section>
|
||||
<title>Install GZ Support Via Native Libraries</title>
|
||||
<para>HBase uses Java's built-in GZip support unless the native Hadoop libraries are
|
||||
available on the CLASSPATH. The recommended way to add libraries to the CLASSPATH is to
|
||||
set the environment variable <envar>HBASE_LIBRARY_PATH</envar> for the user running
|
||||
HBase. If native libraries are not available and Java's GZIP is used, <literal>Got
|
||||
brand-new compressor</literal> reports will be present in the logs. See <xref
|
||||
linkend="brand.new.compressor" />).</para>
|
||||
</section>
|
||||
<section
|
||||
xml:id="lzo.compression">
|
||||
<title>Install LZO Support</title>
|
||||
<para>HBase cannot ship with LZO because of incompatibility between HBase, which uses an
|
||||
Apache Software License (ASL) and LZO, which uses a GPL license. See the <link
|
||||
xlink:href="http://wiki.apache.org/hadoop/UsingLzoCompression">Using LZO
|
||||
Compression</link> wiki page for information on configuring LZO support for HBase. </para>
|
||||
<para>If you depend upon LZO compression, consider configuring your RegionServers to fail
|
||||
to start if LZO is not available. See <xref
|
||||
linkend="hbase.regionserver.codecs" />.</para>
|
||||
</section>
|
||||
<section
|
||||
xml:id="lz4.compression">
|
||||
<title>Configure LZ4 Support</title>
|
||||
<para>LZ4 support is bundled with Hadoop. Make sure the hadoop shared library
|
||||
(libhadoop.so) is accessible when you start
|
||||
HBase. After configuring your platform (see <xref
|
||||
linkend="hbase.native.platform" />), you can make a symbolic link from HBase to the native Hadoop
|
||||
libraries. This assumes the two software installs are colocated. For example, if my
|
||||
'platform' is Linux-amd64-64:
|
||||
<programlisting>$ cd $HBASE_HOME
|
||||
$ mkdir lib/native
|
||||
$ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64</programlisting>
|
||||
Use the compression tool to check lz4 installed on all nodes.
|
||||
Start up (or restart) hbase. From here on out you will be able to create
|
||||
and alter tables to enable LZ4 as a compression codec. E.g.:
|
||||
<programlisting>hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}</programlisting>
|
||||
</para>
|
||||
Use the compression tool to check that LZ4 is installed on all nodes. Start up (or restart)
|
||||
HBase. Afterward, you can create and alter tables to enable LZ4 as a
|
||||
compression codec.:
|
||||
<screen>
|
||||
hbase(main):003:0> <userinput>alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}</userinput>
|
||||
</screen>
|
||||
</para>
|
||||
</section>
|
||||
<section
|
||||
xml:id="snappy.compression.installation">
|
||||
<title>Install Snappy Support</title>
|
||||
<para>HBase does not ship with Snappy support because of licensing issues. You can install
|
||||
Snappy binaries (for instance, by using <command>yum install snappy</command> on CentOS)
|
||||
or build Snappy from source. After installing Snappy, search for the shared library,
|
||||
which will be called <filename>libsnappy.so.X</filename> where X is a number. If you
|
||||
built from source, copy the shared library to a known location on your system, such as
|
||||
<filename>/opt/snappy/lib/</filename>.</para>
|
||||
<para>In addition to the Snappy library, HBase also needs access to the Hadoop shared
|
||||
library, which will be called something like <filename>libhadoop.so.X.Y</filename>,
|
||||
where X and Y are both numbers. Make note of the location of the Hadoop library, or copy
|
||||
it to the same location as the Snappy library.</para>
|
||||
<note>
|
||||
<para>The Snappy and Hadoop libraries need to be available on each node of your cluster.
|
||||
See <xref
|
||||
linkend="compression.test" /> to find out how to test that this is the case.</para>
|
||||
<para>See <xref
|
||||
linkend="hbase.regionserver.codecs" /> to configure your RegionServers to fail to
|
||||
start if a given compressor is not available.</para>
|
||||
</note>
|
||||
<para>Each of these library locations need to be added to the environment variable
|
||||
<envar>HBASE_LIBRARY_PATH</envar> for the operating system user that runs HBase. You
|
||||
need to restart the RegionServer for the changes to take effect.</para>
|
||||
</section>
|
||||
|
||||
|
||||
<section
|
||||
xml:id="compression.test">
|
||||
<title>CompressionTest</title>
|
||||
<para>You can use the CompressionTest tool to verify that your compressor is available to
|
||||
HBase:</para>
|
||||
<screen>
|
||||
$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://<replaceable>host/path/to/hbase</replaceable> snappy
|
||||
</screen>
|
||||
</section>
|
||||
|
||||
|
||||
<section
|
||||
xml:id="hbase.regionserver.codecs">
|
||||
<title>Enforce Compression Settings On a RegionServer</title>
|
||||
<para>You can configure a RegionServer so that it will fail to restart if compression is
|
||||
configured incorrectly, by adding the option hbase.regionserver.codecs to the
|
||||
<filename>hbase-site.xml</filename>, and setting its value to a comma-separated list
|
||||
of codecs that need to be available. For example, if you set this property to
|
||||
<literal>lzo,gz</literal>, the RegionServer would fail to start if both compressors
|
||||
were not available. This would prevent a new server from being added to the cluster
|
||||
without having codecs configured properly.</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section
|
||||
xml:id="changing.compression">
|
||||
<title>Enable Compression On a ColumnFamily</title>
|
||||
<para>To enable compression for a ColumnFamily, use an <code>alter</code> command. You do
|
||||
not need to re-create the table or copy data. If you are changing codecs, be sure the old
|
||||
codec is still available until all the old StoreFiles have been compacted.</para>
|
||||
<example>
|
||||
<title>Enabling Compression on a ColumnFamily of an Existing Table using HBase
|
||||
Shell</title>
|
||||
<screen><![CDATA[
|
||||
hbase> disable 'test'
|
||||
hbase> alter 'test', {NAME => 'cf', COMPRESSION => 'GZ'}
|
||||
hbase> enable 'test']]>
|
||||
</screen>
|
||||
</example>
|
||||
<example>
|
||||
<title>Creating a New Table with Compression On a ColumnFamily</title>
|
||||
<screen><![CDATA[
|
||||
hbase> create 'test2', { NAME => 'cf2', COMPRESSION => 'SNAPPY' }
|
||||
]]></screen>
|
||||
</example>
|
||||
<example>
|
||||
<title>Verifying a ColumnFamily's Compression Settings</title>
|
||||
<screen><![CDATA[
|
||||
hbase> describe 'test'
|
||||
DESCRIPTION ENABLED
|
||||
'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE false
|
||||
', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
|
||||
VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERSIONS
|
||||
=> '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'fa
|
||||
lse', BLOCKSIZE => '65536', IN_MEMORY => 'false', B
|
||||
LOCKCACHE => 'true'}
|
||||
1 row(s) in 0.1070 seconds
|
||||
]]></screen>
|
||||
</example>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<title>Testing Compression Performance</title>
|
||||
<para>HBase includes a tool called LoadTestTool which provides mechanisms to test your
|
||||
compression performance. You must specify either <literal>-write</literal> or
|
||||
<literal>-update-read</literal> as your first parameter, and if you do not specify another
|
||||
parameter, usage advice is printed for each option.</para>
|
||||
<example>
|
||||
<title><command>LoadTestTool</command> Usage</title>
|
||||
<screen><![CDATA[
|
||||
$ bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -h
|
||||
usage: bin/hbase org.apache.hadoop.hbase.util.LoadTestTool <options>
|
||||
Options:
|
||||
-batchupdate Whether to use batch as opposed to separate
|
||||
updates for every column in a row
|
||||
-bloom <arg> Bloom filter type, one of [NONE, ROW, ROWCOL]
|
||||
-compression <arg> Compression type, one of [LZO, GZ, NONE, SNAPPY,
|
||||
LZ4]
|
||||
-data_block_encoding <arg> Encoding algorithm (e.g. prefix compression) to
|
||||
use for data blocks in the test column family, one
|
||||
of [NONE, PREFIX, DIFF, FAST_DIFF, PREFIX_TREE].
|
||||
-encryption <arg> Enables transparent encryption on the test table,
|
||||
one of [AES]
|
||||
-generator <arg> The class which generates load for the tool. Any
|
||||
args for this class can be passed as colon
|
||||
separated after class name
|
||||
-h,--help Show usage
|
||||
-in_memory Tries to keep the HFiles of the CF inmemory as far
|
||||
as possible. Not guaranteed that reads are always
|
||||
served from inmemory
|
||||
-init_only Initialize the test table only, don't do any
|
||||
loading
|
||||
-key_window <arg> The 'key window' to maintain between reads and
|
||||
writes for concurrent write/read workload. The
|
||||
default is 0.
|
||||
-max_read_errors <arg> The maximum number of read errors to tolerate
|
||||
before terminating all reader threads. The default
|
||||
is 10.
|
||||
-multiput Whether to use multi-puts as opposed to separate
|
||||
puts for every column in a row
|
||||
-num_keys <arg> The number of keys to read/write
|
||||
-num_tables <arg> A positive integer number. When a number n is
|
||||
speicfied, load test tool will load n table
|
||||
parallely. -tn parameter value becomes table name
|
||||
prefix. Each table name is in format
|
||||
<tn>_1...<tn>_n
|
||||
-read <arg> <verify_percent>[:<#threads=20>]
|
||||
-regions_per_server <arg> A positive integer number. When a number n is
|
||||
specified, load test tool will create the test
|
||||
table with n regions per server
|
||||
-skip_init Skip the initialization; assume test table already
|
||||
exists
|
||||
-start_key <arg> The first key to read/write (a 0-based index). The
|
||||
default value is 0.
|
||||
-tn <arg> The name of the table to read or write
|
||||
-update <arg> <update_percent>[:<#threads=20>][:<#whether to
|
||||
ignore nonce collisions=0>]
|
||||
-write <arg> <avg_cols_per_key>:<avg_data_size>[:<#threads=20>]
|
||||
-zk <arg> ZK quorum as comma-separated host names without
|
||||
port numbers
|
||||
-zk_root <arg> name of parent znode in zookeeper
|
||||
]]></screen>
|
||||
</example>
|
||||
<example>
|
||||
<title>Example Usage of LoadTestTool</title>
|
||||
<screen>
|
||||
$ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000000
|
||||
-read 100:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE
|
||||
</screen>
|
||||
</example>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section xml:id="lzo.compression">
|
||||
<title>
|
||||
LZO
|
||||
</title>
|
||||
<para>Unfortunately, HBase cannot ship with LZO because of
|
||||
the licensing issues; HBase is Apache-licensed, LZO is GPL.
|
||||
Therefore LZO install is to be done post-HBase install.
|
||||
See the <link xlink:href="http://wiki.apache.org/hadoop/UsingLzoCompression">Using LZO Compression</link>
|
||||
wiki page for how to make LZO work with HBase.
|
||||
</para>
|
||||
<para>A common problem users run into when using LZO is that while initial
|
||||
setup of the cluster runs smooth, a month goes by and some sysadmin goes to
|
||||
add a machine to the cluster only they'll have forgotten to do the LZO
|
||||
fixup on the new machine. In versions since HBase 0.90.0, we should
|
||||
fail in a way that makes it plain what the problem is, but maybe not. </para>
|
||||
<para>See <xref linkend="hbase.regionserver.codecs" />
|
||||
for a feature to help protect against failed LZO install.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="snappy.compression">
|
||||
<title>
|
||||
SNAPPY
|
||||
</title>
|
||||
<para>
|
||||
If snappy is installed, HBase can make use of it (courtesy of
|
||||
<link xlink:href="http://code.google.com/p/hadoop-snappy/">hadoop-snappy</link>
|
||||
<footnote><para>See <link xlink:href="http://search-hadoop.com/m/Ds8d51c263B1/%2522Hadoop-Snappy+in+synch+with+Hadoop+trunk%2522&subj=Hadoop+Snappy+in+synch+with+Hadoop+trunk">Alejandro's note</link> up on the list on difference between Snappy in Hadoop
|
||||
and Snappy in HBase</para></footnote>).
|
||||
|
||||
<orderedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
Build and install <link xlink:href="http://code.google.com/p/snappy/">snappy</link> on all nodes
|
||||
of your cluster (see below). HBase nor Hadoop cannot include snappy because of licensing issues (The
|
||||
hadoop libhadoop.so under its native dir does not include snappy; of note, the shipped .so
|
||||
may be for 32-bit architectures -- this fact has tripped up folks in the past with them thinking
|
||||
it 64-bit). The notes below are about installing snappy for HBase use. You may want snappy
|
||||
available in your hadoop context also. That is not covered here.
|
||||
HBase and Hadoop find the snappy .so in different locations currently: Hadoop picks those files in
|
||||
<filename>./lib</filename> while HBase finds the .so in <filename>./lib/[PLATFORM]</filename>.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
Use CompressionTest to verify snappy support is enabled and the libs can be loaded ON ALL NODES of your cluster:
|
||||
<programlisting>$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy</programlisting>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
Create a column family with snappy compression and verify it in the hbase shell:
|
||||
<programlisting>$ hbase> create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' }
|
||||
hbase> describe 't1'</programlisting>
|
||||
In the output of the "describe" command, you need to ensure it lists "COMPRESSION => 'SNAPPY'"
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
</orderedlist>
|
||||
|
||||
</para>
|
||||
<section xml:id="snappy.compression.installation">
|
||||
<title>
|
||||
Installation
|
||||
</title>
|
||||
<para>Snappy is used by hbase to compress HFiles on flush and when compacting.
|
||||
</para>
|
||||
<para>
|
||||
You will find the snappy library file under the .libs directory from your Snappy build (For example
|
||||
/home/hbase/snappy-1.0.5/.libs/). The file is called libsnappy.so.1.x.x where 1.x.x is the version of the snappy
|
||||
code you are building. You can either copy this file into your hbase lib directory -- under lib/native/PLATFORM --
|
||||
naming the file as libsnappy.so,
|
||||
or simply create a symbolic link to it (See ./bin/hbase for how it does library path for native libs).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The second file you need is the hadoop native library. You will find this file in your hadoop installation directory
|
||||
under lib/native/Linux-amd64-64/ or lib/native/Linux-i386-32/. The file you are looking for is libhadoop.so.1.x.x.
|
||||
Again, you can simply copy this file or link to it from under hbase in lib/native/PLATFORM (e.g. Linux-amd64-64, etc.),
|
||||
using the name libhadoop.so.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
At the end of the installation, you should have both libsnappy.so and libhadoop.so links or files present into
|
||||
lib/native/Linux-amd64-64 or into lib/native/Linux-i386-32 (where the last part of the directory path is the
|
||||
PLATFORM you built and rare running the native lib on)
|
||||
</para>
|
||||
<para>To point hbase at snappy support, in hbase-env.sh set
|
||||
<programlisting>export HBASE_LIBRARY_PATH=/pathtoyourhadoop/lib/native/Linux-amd64-64</programlisting>
|
||||
In <filename>/pathtoyourhadoop/lib/native/Linux-amd64-64</filename> you should have something like:
|
||||
<programlisting>
|
||||
libsnappy.a
|
||||
libsnappy.so
|
||||
libsnappy.so.1
|
||||
libsnappy.so.1.1.2
|
||||
</programlisting>
|
||||
</para>
|
||||
</section>
|
||||
</section>
|
||||
<section xml:id="changing.compression">
|
||||
<title>Changing Compression Schemes</title>
|
||||
<para>A frequent question on the dist-list is how to change compression schemes for ColumnFamilies. This is actually quite simple,
|
||||
and can be done via an alter command. Because the compression scheme is encoded at the block-level in StoreFiles, the table does
|
||||
<emphasis>not</emphasis> need to be re-created and the data does <emphasis>not</emphasis> copied somewhere else. Just make sure
|
||||
the old codec is still available until you are sure that all of the old StoreFiles have been compacted.
|
||||
</para>
|
||||
<section xml:id="data.block.encoding.enable">
|
||||
<title>Enable Data Block Encoding</title>
|
||||
<para>Codecs are built into HBase so no extra configuration is needed. Codecs are enabled on a
|
||||
table by setting the <code>DATA_BLOCK_ENCODING</code> property. Disable the table before
|
||||
altering its DATA_BLOCK_ENCODING setting. Following is an example using HBase Shell:</para>
|
||||
<example>
|
||||
<title>Enable Data Block Encoding On a Table</title>
|
||||
<screen><![CDATA[
|
||||
hbase> disable 'test'
|
||||
hbase> alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
|
||||
Updating all regions with the new schema...
|
||||
0/1 regions updated.
|
||||
1/1 regions updated.
|
||||
Done.
|
||||
0 row(s) in 2.2820 seconds
|
||||
hbase> enable 'test'
|
||||
0 row(s) in 0.1580 seconds
|
||||
]]></screen>
|
||||
</example>
|
||||
<example>
|
||||
<title>Verifying a ColumnFamily's Data Block Encoding</title>
|
||||
<screen><![CDATA[
|
||||
hbase> describe 'test'
|
||||
DESCRIPTION ENABLED
|
||||
'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST true
|
||||
_DIFF', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =>
|
||||
'0', VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERS
|
||||
IONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS =
|
||||
> 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals
|
||||
e', BLOCKCACHE => 'true'}
|
||||
1 row(s) in 0.0650 seconds
|
||||
]]></screen>
|
||||
</example>
|
||||
</section>
|
||||
</appendix>
|
||||
|
||||
|
||||
<appendix>
|
||||
<title xml:id="ycsb"><link xlink:href="https://github.com/brianfrankcooper/YCSB/">YCSB: The Yahoo! Cloud Serving Benchmark</link> and HBase</title>
|
||||
<para>TODO: Describe how YCSB is poor for putting up a decent cluster load.</para>
|
||||
|
|
Binary file not shown.
After Width: | Height: | Size: 53 KiB |
Binary file not shown.
After Width: | Height: | Size: 46 KiB |
Binary file not shown.
After Width: | Height: | Size: 34 KiB |
Loading…
Reference in New Issue