HBASE-11400 [docs] edit, consolidate, and update compression and data encoding docs (Misty Stanley-Jones)

This commit is contained in:
Jonathan M Hsieh 2014-07-18 13:45:57 -07:00
parent a030b17ba7
commit 209dd6dcfe
4 changed files with 431 additions and 210 deletions

View File

@ -4387,230 +4387,451 @@ This option should not normally be used, and it is not in <code>-fixAll</code>.
</section>
</appendix>
<appendix xml:id="compression">
<appendix
xml:id="compression">
<title >Compression In HBase<indexterm><primary>Compression</primary></indexterm></title>
<title>Compression and Data Block Encoding In
HBase<indexterm><primary>Compression</primary><secondary>Data Block
Encoding</secondary><seealso>codecs</seealso></indexterm></title>
<note>
<para>Codecs mentioned in this section are for encoding and decoding data blocks. For
information about replication codecs, see <xref
<para>Codecs mentioned in this section are for encoding and decoding data blocks or row keys.
For information about replication codecs, see <xref
linkend="cluster.replication.preserving.tags" />.</para>
</note>
<para>There are a bunch of compression options in HBase. Some codecs come with java --
e.g. gzip -- and so require no additional installations. Others require native
libraries. The native libraries may be available in your hadoop as is the case
with lz4 and it is just a matter of making sure the hadoop native .so is available
to HBase. You may have to do extra work to make the codec accessible; for example,
if the codec has an apache-incompatible license that makes it so hadoop cannot bundle
the library.</para>
<para>Below we
discuss what is necessary for the common codecs. Whatever codec you use, be sure
to test it is installed properly and is available on all nodes that make up your cluster.
Add any necessary operational step that will ensure checking the codec present when you
happen to add new nodes to your cluster. The <xref linkend="compression.test" />
discussed below can help check the codec is properly install.</para>
<para>As to which codec to use, there is some helpful discussion
to be found in <link xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1">Documenting Guidance on compression and codecs</link>.
</para>
<para>Some of the information in this section is pulled from a <link
xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1/v=threaded">discussion</link> on the
HBase Development mailing list.</para>
<para>HBase supports several different compression algorithms which can be enabled on a
ColumnFamily. Data block encoding attempts to limit duplication of information in keys, taking
advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys
and the schema of a given table. Compressors reduce the size of large, opaque byte arrays in
cells, and can significantly reduce the storage space needed to store uncompressed
data.</para>
<para>Compressors and data block encoding can be used together on the same ColumnFamily.</para>
<formalpara>
<title>Changes Take Effect Upon Compaction</title>
<para>If you change compression or encoding for a ColumnFamily, the changes take effect during
compaction.</para>
</formalpara>
<section xml:id="compression.test">
<title>CompressionTest Tool</title>
<para>
HBase includes a tool to test compression is set up properly.
To run it, type <code>/bin/hbase org.apache.hadoop.hbase.util.CompressionTest</code>.
This will emit usage on how to run the tool.
</para>
<note><title>You need to restart regionserver for it to pick up changes!</title>
<para>Be aware that the regionserver caches the result of the compression check it runs
ahead of each region open. This means that you will have to restart the regionserver
for it to notice that you have fixed any codec issues; e.g. changed symlinks or
moved lib locations under HBase.</para>
</note>
<note xml:id="hbase.native.platform"><title>On the location of native libraries</title>
<para>Hadoop looks in <filename>lib/native</filename> for .so files. HBase looks in
<filename>lib/native/PLATFORM</filename>. See the <command>bin/hbase</command>.
View the file and look for <varname>native</varname>. See how we
do the work to find out what platform we are running on running a little java program
<classname>org.apache.hadoop.util.PlatformName</classname> to figure it out.
We'll then add <filename>./lib/native/PLATFORM</filename> to the
<varname>LD_LIBRARY_PATH</varname> environment for when the JVM starts.
The JVM will look in here (as well as in any other dirs specified on LD_LIBRARY_PATH)
for codec native libs. If you are unable to figure your 'platform', do:
<programlisting>$ ./bin/hbase org.apache.hadoop.util.PlatformName</programlisting>.
An example platform would be <varname>Linux-amd64-64</varname>.
</para>
</note>
<para>Some codecs take advantage of capabilities built into Java, such as GZip compression.
Others rely on native libraries. Native libraries may be available as part of Hadoop, such as
LZ4. In this case, HBase only needs access to the appropriate shared library. Other codecs,
such as Google Snappy, need to be installed first. Some codecs are licensed in ways that
conflict with HBase's license and cannot be shipped as part of HBase.</para>
<para>This section discusses common codecs that are used and tested with HBase. No matter what
codec you use, be sure to test that it is installed correctly and is available on all nodes in
your cluster. Extra operational steps may be necessary to be sure that codecs are available on
newly-deployed nodes. You can use the <xref
linkend="compression.test" /> utility to check that a given codec is correctly
installed.</para>
<para>To configure HBase to use a compressor, see <xref
linkend="compressor.install" />. To enable a compressor for a ColumnFamily, see <xref
linkend="changing.compression" />. To enable data block encoding for a ColumnFamily, see
<xref linkend="data.block.encoding.enable" />.</para>
<itemizedlist>
<title>Block Compressors</title>
<listitem>
<para>none</para>
</listitem>
<listitem>
<para>Snappy</para>
</listitem>
<listitem>
<para>LZO</para>
</listitem>
<listitem>
<para>LZ4</para>
</listitem>
<listitem>
<para>GZ</para>
</listitem>
</itemizedlist>
<itemizedlist>
<title>Data Block Encoding Types</title>
<listitem>
<para>Prefix - Often, keys are very similar. Specifically, keys often share a common prefix
and only differ near the end. For instance, one key might be
<literal>RowKey:Family:Qualifier0</literal> and the next key might be
<literal>RowKey:Family:Qualifier1</literal>. In Prefix encoding, an extra column is
added which holds the length of the prefix shared between the current key and the previous
key. Assuming the first key here is totally different from the key before, its prefix
length is 0. The second key's prefix length is <literal>23</literal>, since they have the
first 23 characters in common.</para>
<para>Obviously if the keys tend to have nothing in common, Prefix will not provide much
benefit.</para>
<para>The following image shows a hypothetical ColumnFamily with no data block encoding.</para>
<figure>
<title>ColumnFamily with No Encoding</title>
<mediaobject>
<imageobject>
<imagedata fileref="data_block_no_encoding.png" width="800"/>
</imageobject>
<textobject><para></para>
</textobject>
</mediaobject>
</figure>
<para>Here is the same data with prefix data encoding.</para>
<figure>
<title>ColumnFamily with Prefix Encoding</title>
<mediaobject>
<imageobject>
<imagedata fileref="data_block_prefix_encoding.png" width="800"/>
</imageobject>
<textobject><para></para>
</textobject>
</mediaobject>
</figure>
</listitem>
<listitem>
<para>Diff - Diff encoding expands upon Prefix encoding. Instead of considering the key
sequentially as a monolithic series of bytes, each key field is split so that each part of
the key can be compressed more efficiently. Two new fields are added: timestamp and type.
If the ColumnFamily is the same as the previous row, it is omitted from the current row.
If the key length, value length or type are the same as the previous row, the field is
omitted. In addition, for increased compression, the timestamp is stored as a Diff from
the previous row's timestamp, rather than being stored in full. Given the two row keys in
the Prefix example, and given an exact match on timestamp and the same type, neither the
value length, or type needs to be stored for the second row, and the timestamp value for
the second row is just 0, rather than a full timestamp.</para>
<para>Diff encoding is disabled by default because writing and scanning are slower but more
data is cached.</para>
<para>This image shows the same ColumnFamily from the previous images, with Diff encoding.</para>
<figure>
<title>ColumnFamily with Diff Encoding</title>
<mediaobject>
<imageobject>
<imagedata fileref="data_block_diff_encoding.png" width="800"/>
</imageobject>
<textobject><para></para>
</textobject>
</mediaobject>
</figure>
</listitem>
<listitem>
<para>Fast Diff - Fast Diff works similar to Diff, but uses a faster implementation. It also
adds another field which stores a single bit to track whether the data itself is the same
as the previous row. If it is, the data is not stored again. Fast Diff is the recommended
codec to use if you have long keys or many columns. The data format is nearly identical to
Diff encoding, so there is not an image to illustrate it.</para>
</listitem>
<listitem>
<para>Prefix Tree encoding was introduced as an experimental feature in HBase 0.96. It
provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but provides
faster random access at a cost of slower encoding speed. Prefix Tree may be appropriate
for applications that have high block cache hit ratios. It introduces new 'tree' fields
for the row and column. The row tree field contains a list of offsets/references
corresponding to the cells in that row. This allows for a good deal of compression. For
more details about Prefix Tree encoding, see <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-4676">HBASE-4676</link>. It is
difficult to graphically illustrate a prefix tree, so no image is included. See the
Wikipedia article for <link
xlink:href="http://en.wikipedia.org/wiki/Trie">Trie</link> for more general information
about this data structure.</para>
</listitem>
</itemizedlist>
<section>
<title>Which Compressor or Data Block Encoder To Use</title>
<para>The compression or codec type to use depends on the characteristics of your data.
Choosing the wrong type could cause your data to take more space rather than less, and can
have performance implications. In general, you need to weigh your options between smaller
size and faster compression/decompression. Following are some general guidelines, expanded from a discussion at <link xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1">Documenting Guidance on compression and codecs</link>. </para>
<itemizedlist>
<listitem>
<para>If you have long keys (compared to the values) or many columns, use a prefix
encoder. FAST_DIFF is recommended, as more testing is needed for Prefix Tree
encoding.</para>
</listitem>
<listitem>
<para>If the values are large (and not precompressed, such as images), use a data block
compressor.</para>
</listitem>
<listitem>
<para>Use GZIP for <firstterm>cold data</firstterm>, which is accessed infrequently. GZIP
compression uses more CPU resources than Snappy or LZO, but provides a higher
compression ratio.</para>
</listitem>
<listitem>
<para>Use Snappy or LZO for <firstterm>hot data</firstterm>, which is accessed
frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high
of a compression ratio.</para>
</listitem>
<listitem>
<para>In most cases, enabling Snappy or LZO by default is a good choice, because they have
a low performance overhead and provide space savings.</para>
</listitem>
<listitem>
<para>Before Snappy became available by Google in 2011, LZO was the default. Snappy has
similar qualities as LZO but has been shown to perform better.</para>
</listitem>
</itemizedlist>
</section>
<section xml:id="hbase.regionserver.codecs">
<title>
<varname>
hbase.regionserver.codecs
</varname>
</title>
<para>
To have a RegionServer test a set of codecs and fail-to-start if any
code is missing or misinstalled, add the configuration
<varname>
hbase.regionserver.codecs
</varname>
to your <filename>hbase-site.xml</filename> with a value of
codecs to test on startup. For example if the
<varname>
hbase.regionserver.codecs
</varname> value is <code>lzo,gz</code> and if lzo is not present
or improperly installed, the misconfigured RegionServer will fail
to start.
</para>
<para>
Administrators might make use of this facility to guard against
the case where a new server is added to cluster but the cluster
requires install of a particular coded.
</para>
</section>
<section xml:id="gzip.compression">
<title>
GZIP
</title>
<para>
GZIP will generally compress better than LZO but it will run slower.
For some setups, better compression may be preferred ('cold' data).
Java will use java's GZIP unless the native Hadoop libs are
available on the CLASSPATH; in this case it will use native
compressors instead (If the native libs are NOT present,
you will see lots of <emphasis>Got brand-new compressor</emphasis>
reports in your logs; see <xref linkend="brand.new.compressor" />).
</para>
</section>
<section xml:id="lz4.compression">
<title>
LZ4
</title>
<para>
LZ4 is bundled with Hadoop. Make sure the hadoop .so is
accessible when you start HBase. One means of doing this is after figuring your
platform, see <xref linkend="hbase.native.platform" />, make a symlink from HBase
to the native Hadoop libraries presuming the two software installs are colocated.
For example, if my 'platform' is Linux-amd64-64:
<programlisting>$ cd $HBASE_HOME
<section>
<title>Compressor Configuration, Installation, and Use</title>
<section
xml:id="compressor.install">
<title>Configure HBase For Compressors</title>
<para>Before HBase can use a given compressor, its libraries need to be available. Due to
licensing issues, only GZ compression is available to HBase (via native Java libraries) in
a default installation.</para>
<section>
<title>Compressor Support On the Master</title>
<para>A new configuration setting was introduced in HBase 0.95, to check the Master to
determine which data block encoders are installed and configured on it, and assume that
the entire cluster is configured the same. This option,
<code>hbase.master.check.compression</code>, defaults to <literal>true</literal>. This
prevents the situation described in <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-6370">HBASE-6370</link>, where
a table is created or modified to support a codec that a region server does not support,
leading to failures that take a long time to occur and are difficult to debug. </para>
<para>If <code>hbase.master.check.compression</code> is enabled, libraries for all desired
compressors need to be installed and configured on the Master, even if the Master does
not run a region server.</para>
</section>
<section>
<title>Install GZ Support Via Native Libraries</title>
<para>HBase uses Java's built-in GZip support unless the native Hadoop libraries are
available on the CLASSPATH. The recommended way to add libraries to the CLASSPATH is to
set the environment variable <envar>HBASE_LIBRARY_PATH</envar> for the user running
HBase. If native libraries are not available and Java's GZIP is used, <literal>Got
brand-new compressor</literal> reports will be present in the logs. See <xref
linkend="brand.new.compressor" />).</para>
</section>
<section
xml:id="lzo.compression">
<title>Install LZO Support</title>
<para>HBase cannot ship with LZO because of incompatibility between HBase, which uses an
Apache Software License (ASL) and LZO, which uses a GPL license. See the <link
xlink:href="http://wiki.apache.org/hadoop/UsingLzoCompression">Using LZO
Compression</link> wiki page for information on configuring LZO support for HBase. </para>
<para>If you depend upon LZO compression, consider configuring your RegionServers to fail
to start if LZO is not available. See <xref
linkend="hbase.regionserver.codecs" />.</para>
</section>
<section
xml:id="lz4.compression">
<title>Configure LZ4 Support</title>
<para>LZ4 support is bundled with Hadoop. Make sure the hadoop shared library
(libhadoop.so) is accessible when you start
HBase. After configuring your platform (see <xref
linkend="hbase.native.platform" />), you can make a symbolic link from HBase to the native Hadoop
libraries. This assumes the two software installs are colocated. For example, if my
'platform' is Linux-amd64-64:
<programlisting>$ cd $HBASE_HOME
$ mkdir lib/native
$ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64</programlisting>
Use the compression tool to check lz4 installed on all nodes.
Start up (or restart) hbase. From here on out you will be able to create
and alter tables to enable LZ4 as a compression codec. E.g.:
<programlisting>hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}</programlisting>
</para>
Use the compression tool to check that LZ4 is installed on all nodes. Start up (or restart)
HBase. Afterward, you can create and alter tables to enable LZ4 as a
compression codec.:
<screen>
hbase(main):003:0> <userinput>alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}</userinput>
</screen>
</para>
</section>
<section
xml:id="snappy.compression.installation">
<title>Install Snappy Support</title>
<para>HBase does not ship with Snappy support because of licensing issues. You can install
Snappy binaries (for instance, by using <command>yum install snappy</command> on CentOS)
or build Snappy from source. After installing Snappy, search for the shared library,
which will be called <filename>libsnappy.so.X</filename> where X is a number. If you
built from source, copy the shared library to a known location on your system, such as
<filename>/opt/snappy/lib/</filename>.</para>
<para>In addition to the Snappy library, HBase also needs access to the Hadoop shared
library, which will be called something like <filename>libhadoop.so.X.Y</filename>,
where X and Y are both numbers. Make note of the location of the Hadoop library, or copy
it to the same location as the Snappy library.</para>
<note>
<para>The Snappy and Hadoop libraries need to be available on each node of your cluster.
See <xref
linkend="compression.test" /> to find out how to test that this is the case.</para>
<para>See <xref
linkend="hbase.regionserver.codecs" /> to configure your RegionServers to fail to
start if a given compressor is not available.</para>
</note>
<para>Each of these library locations need to be added to the environment variable
<envar>HBASE_LIBRARY_PATH</envar> for the operating system user that runs HBase. You
need to restart the RegionServer for the changes to take effect.</para>
</section>
<section
xml:id="compression.test">
<title>CompressionTest</title>
<para>You can use the CompressionTest tool to verify that your compressor is available to
HBase:</para>
<screen>
$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://<replaceable>host/path/to/hbase</replaceable> snappy
</screen>
</section>
<section
xml:id="hbase.regionserver.codecs">
<title>Enforce Compression Settings On a RegionServer</title>
<para>You can configure a RegionServer so that it will fail to restart if compression is
configured incorrectly, by adding the option hbase.regionserver.codecs to the
<filename>hbase-site.xml</filename>, and setting its value to a comma-separated list
of codecs that need to be available. For example, if you set this property to
<literal>lzo,gz</literal>, the RegionServer would fail to start if both compressors
were not available. This would prevent a new server from being added to the cluster
without having codecs configured properly.</para>
</section>
</section>
<section
xml:id="changing.compression">
<title>Enable Compression On a ColumnFamily</title>
<para>To enable compression for a ColumnFamily, use an <code>alter</code> command. You do
not need to re-create the table or copy data. If you are changing codecs, be sure the old
codec is still available until all the old StoreFiles have been compacted.</para>
<example>
<title>Enabling Compression on a ColumnFamily of an Existing Table using HBase
Shell</title>
<screen><![CDATA[
hbase> disable 'test'
hbase> alter 'test', {NAME => 'cf', COMPRESSION => 'GZ'}
hbase> enable 'test']]>
</screen>
</example>
<example>
<title>Creating a New Table with Compression On a ColumnFamily</title>
<screen><![CDATA[
hbase> create 'test2', { NAME => 'cf2', COMPRESSION => 'SNAPPY' }
]]></screen>
</example>
<example>
<title>Verifying a ColumnFamily's Compression Settings</title>
<screen><![CDATA[
hbase> describe 'test'
DESCRIPTION ENABLED
'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE false
', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERSIONS
=> '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'fa
lse', BLOCKSIZE => '65536', IN_MEMORY => 'false', B
LOCKCACHE => 'true'}
1 row(s) in 0.1070 seconds
]]></screen>
</example>
</section>
<section>
<title>Testing Compression Performance</title>
<para>HBase includes a tool called LoadTestTool which provides mechanisms to test your
compression performance. You must specify either <literal>-write</literal> or
<literal>-update-read</literal> as your first parameter, and if you do not specify another
parameter, usage advice is printed for each option.</para>
<example>
<title><command>LoadTestTool</command> Usage</title>
<screen><![CDATA[
$ bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -h
usage: bin/hbase org.apache.hadoop.hbase.util.LoadTestTool <options>
Options:
-batchupdate Whether to use batch as opposed to separate
updates for every column in a row
-bloom <arg> Bloom filter type, one of [NONE, ROW, ROWCOL]
-compression <arg> Compression type, one of [LZO, GZ, NONE, SNAPPY,
LZ4]
-data_block_encoding <arg> Encoding algorithm (e.g. prefix compression) to
use for data blocks in the test column family, one
of [NONE, PREFIX, DIFF, FAST_DIFF, PREFIX_TREE].
-encryption <arg> Enables transparent encryption on the test table,
one of [AES]
-generator <arg> The class which generates load for the tool. Any
args for this class can be passed as colon
separated after class name
-h,--help Show usage
-in_memory Tries to keep the HFiles of the CF inmemory as far
as possible. Not guaranteed that reads are always
served from inmemory
-init_only Initialize the test table only, don't do any
loading
-key_window <arg> The 'key window' to maintain between reads and
writes for concurrent write/read workload. The
default is 0.
-max_read_errors <arg> The maximum number of read errors to tolerate
before terminating all reader threads. The default
is 10.
-multiput Whether to use multi-puts as opposed to separate
puts for every column in a row
-num_keys <arg> The number of keys to read/write
-num_tables <arg> A positive integer number. When a number n is
speicfied, load test tool will load n table
parallely. -tn parameter value becomes table name
prefix. Each table name is in format
<tn>_1...<tn>_n
-read <arg> <verify_percent>[:<#threads=20>]
-regions_per_server <arg> A positive integer number. When a number n is
specified, load test tool will create the test
table with n regions per server
-skip_init Skip the initialization; assume test table already
exists
-start_key <arg> The first key to read/write (a 0-based index). The
default value is 0.
-tn <arg> The name of the table to read or write
-update <arg> <update_percent>[:<#threads=20>][:<#whether to
ignore nonce collisions=0>]
-write <arg> <avg_cols_per_key>:<avg_data_size>[:<#threads=20>]
-zk <arg> ZK quorum as comma-separated host names without
port numbers
-zk_root <arg> name of parent znode in zookeeper
]]></screen>
</example>
<example>
<title>Example Usage of LoadTestTool</title>
<screen>
$ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000000
-read 100:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE
</screen>
</example>
</section>
</section>
<section xml:id="lzo.compression">
<title>
LZO
</title>
<para>Unfortunately, HBase cannot ship with LZO because of
the licensing issues; HBase is Apache-licensed, LZO is GPL.
Therefore LZO install is to be done post-HBase install.
See the <link xlink:href="http://wiki.apache.org/hadoop/UsingLzoCompression">Using LZO Compression</link>
wiki page for how to make LZO work with HBase.
</para>
<para>A common problem users run into when using LZO is that while initial
setup of the cluster runs smooth, a month goes by and some sysadmin goes to
add a machine to the cluster only they'll have forgotten to do the LZO
fixup on the new machine. In versions since HBase 0.90.0, we should
fail in a way that makes it plain what the problem is, but maybe not. </para>
<para>See <xref linkend="hbase.regionserver.codecs" />
for a feature to help protect against failed LZO install.</para>
</section>
<section xml:id="snappy.compression">
<title>
SNAPPY
</title>
<para>
If snappy is installed, HBase can make use of it (courtesy of
<link xlink:href="http://code.google.com/p/hadoop-snappy/">hadoop-snappy</link>
<footnote><para>See <link xlink:href="http://search-hadoop.com/m/Ds8d51c263B1/%2522Hadoop-Snappy+in+synch+with+Hadoop+trunk%2522&amp;subj=Hadoop+Snappy+in+synch+with+Hadoop+trunk">Alejandro's note</link> up on the list on difference between Snappy in Hadoop
and Snappy in HBase</para></footnote>).
<orderedlist>
<listitem>
<para>
Build and install <link xlink:href="http://code.google.com/p/snappy/">snappy</link> on all nodes
of your cluster (see below). HBase nor Hadoop cannot include snappy because of licensing issues (The
hadoop libhadoop.so under its native dir does not include snappy; of note, the shipped .so
may be for 32-bit architectures -- this fact has tripped up folks in the past with them thinking
it 64-bit). The notes below are about installing snappy for HBase use. You may want snappy
available in your hadoop context also. That is not covered here.
HBase and Hadoop find the snappy .so in different locations currently: Hadoop picks those files in
<filename>./lib</filename> while HBase finds the .so in <filename>./lib/[PLATFORM]</filename>.
</para>
</listitem>
<listitem>
<para>
Use CompressionTest to verify snappy support is enabled and the libs can be loaded ON ALL NODES of your cluster:
<programlisting>$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy</programlisting>
</para>
</listitem>
<listitem>
<para>
Create a column family with snappy compression and verify it in the hbase shell:
<programlisting>$ hbase> create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' }
hbase> describe 't1'</programlisting>
In the output of the "describe" command, you need to ensure it lists "COMPRESSION => 'SNAPPY'"
</para>
</listitem>
</orderedlist>
</para>
<section xml:id="snappy.compression.installation">
<title>
Installation
</title>
<para>Snappy is used by hbase to compress HFiles on flush and when compacting.
</para>
<para>
You will find the snappy library file under the .libs directory from your Snappy build (For example
/home/hbase/snappy-1.0.5/.libs/). The file is called libsnappy.so.1.x.x where 1.x.x is the version of the snappy
code you are building. You can either copy this file into your hbase lib directory -- under lib/native/PLATFORM --
naming the file as libsnappy.so,
or simply create a symbolic link to it (See ./bin/hbase for how it does library path for native libs).
</para>
<para>
The second file you need is the hadoop native library. You will find this file in your hadoop installation directory
under lib/native/Linux-amd64-64/ or lib/native/Linux-i386-32/. The file you are looking for is libhadoop.so.1.x.x.
Again, you can simply copy this file or link to it from under hbase in lib/native/PLATFORM (e.g. Linux-amd64-64, etc.),
using the name libhadoop.so.
</para>
<para>
At the end of the installation, you should have both libsnappy.so and libhadoop.so links or files present into
lib/native/Linux-amd64-64 or into lib/native/Linux-i386-32 (where the last part of the directory path is the
PLATFORM you built and rare running the native lib on)
</para>
<para>To point hbase at snappy support, in hbase-env.sh set
<programlisting>export HBASE_LIBRARY_PATH=/pathtoyourhadoop/lib/native/Linux-amd64-64</programlisting>
In <filename>/pathtoyourhadoop/lib/native/Linux-amd64-64</filename> you should have something like:
<programlisting>
libsnappy.a
libsnappy.so
libsnappy.so.1
libsnappy.so.1.1.2
</programlisting>
</para>
</section>
</section>
<section xml:id="changing.compression">
<title>Changing Compression Schemes</title>
<para>A frequent question on the dist-list is how to change compression schemes for ColumnFamilies. This is actually quite simple,
and can be done via an alter command. Because the compression scheme is encoded at the block-level in StoreFiles, the table does
<emphasis>not</emphasis> need to be re-created and the data does <emphasis>not</emphasis> copied somewhere else. Just make sure
the old codec is still available until you are sure that all of the old StoreFiles have been compacted.
</para>
<section xml:id="data.block.encoding.enable">
<title>Enable Data Block Encoding</title>
<para>Codecs are built into HBase so no extra configuration is needed. Codecs are enabled on a
table by setting the <code>DATA_BLOCK_ENCODING</code> property. Disable the table before
altering its DATA_BLOCK_ENCODING setting. Following is an example using HBase Shell:</para>
<example>
<title>Enable Data Block Encoding On a Table</title>
<screen><![CDATA[
hbase> disable 'test'
hbase> alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 2.2820 seconds
hbase> enable 'test'
0 row(s) in 0.1580 seconds
]]></screen>
</example>
<example>
<title>Verifying a ColumnFamily's Data Block Encoding</title>
<screen><![CDATA[
hbase> describe 'test'
DESCRIPTION ENABLED
'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST true
_DIFF', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =>
'0', VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERS
IONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS =
> 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals
e', BLOCKCACHE => 'true'}
1 row(s) in 0.0650 seconds
]]></screen>
</example>
</section>
</appendix>
<appendix>
<title xml:id="ycsb"><link xlink:href="https://github.com/brianfrankcooper/YCSB/">YCSB: The Yahoo! Cloud Serving Benchmark</link> and HBase</title>
<para>TODO: Describe how YCSB is poor for putting up a decent cluster load.</para>

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB