HBASE-26330 Document new provided compression codecs (#4396)

Signed-off-by: Xiaolin Ha <haxiaolin@apache.org>
Signed-off-by: Viraj Jasani<virajjasani@apache.org
This commit is contained in:
Andrew Purtell 2022-05-07 11:25:43 -07:00
parent ffbdac12ca
commit ad74cd993f
1 changed files with 221 additions and 33 deletions

View File

@ -40,12 +40,14 @@ Compressors and data block encoding can be used together on the same ColumnFamil
.Changes Take Effect Upon Compaction
If you change compression or encoding for a ColumnFamily, the changes take effect during compaction.
Some codecs take advantage of capabilities built into Java, such as GZip compression. Others rely on native libraries. Native libraries may be available as part of Hadoop, such as LZ4. In this case, HBase only needs access to the appropriate shared library.
Other codecs, such as Google Snappy, need to be installed first.
Some codecs are licensed in ways that conflict with HBase's license and cannot be shipped as part of HBase.
Some codecs take advantage of capabilities built into Java, such as GZip compression.
Others rely on native libraries. Native libraries may be available via codec dependencies installed into
HBase's library directory, or, if you are utilizing Hadoop codecs, as part of Hadoop. Hadoop codecs
typically have a native code component so follow instructions for installing Hadoop native binary
support at <<hadoop.native.lib>>.
This section discusses common codecs that are used and tested with HBase.
No matter what codec you use, be sure to test that it is installed correctly and is available on all nodes in your cluster.
Extra operational steps may be necessary to be sure that codecs are available on newly-deployed nodes.
You can use the <<compression.test,compression.test>> utility to check that a given codec is correctly installed.
@ -55,11 +57,69 @@ To enable a compressor for a ColumnFamily, see <<changing.compression,changing.c
To enable data block encoding for a ColumnFamily, see <<data.block.encoding.enable,data.block.encoding.enable>>.
.Block Compressors
* none
* Snappy
* LZO
* LZ4
* NONE
+
This compression type constant selects no compression, and is the default.
* BROTLI
+
https://en.wikipedia.org/wiki/Brotli[Brotli] is a generic-purpose lossless compression algorithm
that compresses data using a combination of a modern variant of the LZ77 algorithm, Huffman
coding, and 2nd order context modeling, with a compression ratio comparable to the best currently
available general-purpose compression methods. It is similar in speed with GZ but offers more
dense compression.
* BZIP2
+
https://en.wikipedia.org/wiki/Bzip2[Bzip2] compresses files using the Burrows-Wheeler block
sorting text compression algorithm and Huffman coding. Compression is generally considerably
better than that achieved by the dictionary- (LZ-) based compressors, but both compression and
decompression can be slow in comparison to other options.
* GZ
+
gzip is based on the https://en.wikipedia.org/wiki/Deflate[DEFLATE] algorithm, which is a
combination of LZ77 and Huffman coding. It is universally available in the Java Runtime
Environment so is a good lowest common denominator option. However in comparison to more modern
algorithms like Zstandard it is quite slow.
* LZ4
+
https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)[LZ4] is a lossless data compression
algorithm that is focused on compression and decompression speed. It belongs to the LZ77 family
of compression algorithms, like Brotli, DEFLATE, Zstandard, and others. In our microbenchmarks
LZ4 is the fastest option for both compression and decompression in that family, and is our
universally recommended option.
* LZMA
+
https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm[LZMA] is a
dictionary compression scheme somewhat similar to the LZ77 algorithm that achieves very high
compression ratios with a computationally expensive predictive model and variable size
compression dictionary, while still maintaining decompression speed similar to other commonly used
compression algorithms. LZMA is superior to all other options in general compression ratio but as
a compressor it can be extremely slow, especially when configured to operate at higher levels of
compression.
* LZO
+
https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Oberhumer[LZO] is another LZ-variant
data compression algorithm, with an implementation focused on decompression speed. It is almost
but not quite as fast as LZ4.
* SNAPPY
+
https://en.wikipedia.org/wiki/Snappy_(compression)[Snappy] is based on ideas from LZ77 but is
optimized for very high compression speed, achieving only a "reasonable" compression in trade.
It is as fast as LZ4 but does not compress quite as well. We offer a pure Java Snappy codec
that can be used instead of GZ as the universally available option for any Java runtime on any
hardware architecture.
* ZSTD
+
https://en.wikipedia.org/wiki/Zstd[Zstandard] combines a dictionary-matching stage (LZ77) with
a large search window and a fast entropy coding stage, using both Finite State Entropy and
Huffman coding. Compression speed can vary by a factor of 20 or more between the fastest and
slowest levels, while decompression is uniformly fast, varying by less than 20% between the
fastest and slowest levels.
+
ZStandard is the most flexible of the available compression codec options, offering a compression
ratio similar to LZ4 at level 1 (but with slightly less performance), compression ratios
comparable to DEFLATE at mid levels (but with better performance), and LZMA-alike dense
compression (and LZMA-alike compression speeds) at high levels; while providing universally fast
decompression.
.Data Block Encoding Types
Prefix::
@ -122,16 +182,23 @@ The compression or codec type to use depends on the characteristics of your data
In general, you need to weigh your options between smaller size and faster compression/decompression. Following are some general guidelines, expanded from a discussion at link:https://lists.apache.org/thread.html/481e67a61163efaaf4345510447a9244871a8d428244868345a155ff%401378926618%40%3Cdev.hbase.apache.org%3E[Documenting Guidance on compression and codecs].
* In most cases, enabling LZ4 or Snappy by default is a good choice, because they have a low
performance overhead and provide reasonable space savings. A fast compression algorithm almost
always improves overall system performance by trading some increased CPU usage for better I/O
efficiency.
* If the values are large (and not pre-compressed, such as images), use a data block compressor.
* For [firstterm]_cold data_, which is accessed infrequently, depending on your use case, it might
make sense to opt for Zstandard at its higher compression levels, or LZMA, especially for high
entropy binary data, or Brotli for data similar in characteristics to web data. Bzip2 might also
be a reasonable option but Zstandard is very likely to offer superior decompression speed.
* For [firstterm]_hot data_, which is accessed frequently, you almost certainly want only LZ4,
Snappy, LZO, or Zstandard at a low compression level. These options will not provide as high of
a compression ratio but will in trade not unduly impact system performance.
* If you have long keys (compared to the values) or many columns, use a prefix encoder.
FAST_DIFF is recommended.
* If the values are large (and not precompressed, such as images), use a data block compressor.
* Use GZIP for [firstterm]_cold data_, which is accessed infrequently.
GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio.
* Use Snappy or LZO for [firstterm]_hot data_, which is accessed frequently.
Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high of a compression ratio.
* In most cases, enabling Snappy or LZO by default is a good choice, because they have a low performance overhead and provide space savings.
* Before Snappy became available by Google in 2011, LZO was the default.
Snappy has similar qualities as LZO but has been shown to perform better.
* If enabling WAL value compression, consider LZ4 or SNAPPY compression, or Zstandard at
level 1. Reading and writing the WAL is performance critical. That said, the I/O
savings of these compression options can improve overall system performance.
[[hadoop.native.lib]]
=== Making use of Hadoop Native Libraries in HBase
@ -235,11 +302,120 @@ Set in _hbase-env.sh_ the LD_LIBRARY_PATH environment variable when starting you
[[compressor.install]]
==== Configure HBase For Compressors
Before HBase can use a given compressor, its libraries need to be available.
Due to licensing issues, only GZ compression is available to HBase (via native Java libraries) in a default installation.
Other compression libraries are available via the shared library bundled with your hadoop.
The hadoop native library needs to be findable when HBase starts.
See
Compression codecs are provided either by HBase compressor modules or by Hadoop's native compression
support. As described above you choose a compression type in table or column family schema or in
site configuration using its short label, e.g. _snappy_ for Snappy, or _zstd_ for ZStandard. Which
codec implementation is dynamically loaded to support what label is configurable by way of site
configuration.
[options="header"]
|===
|Algorithm label|Codec implementation configuration key|Default value
//----------------------
|BROTLI|hbase.io.compress.brotli.codec|org.apache.hadoop.hbase.io.compress.brotli.BrotliCodec
|BZIP2|hbase.io.compress.bzip2.codec|org.apache.hadoop.io.compress.BZip2Codec
|GZ|hbase.io.compress.gz.codec|org.apache.hadoop.hbase.io.compress.ReusableStreamGzipCodec
|LZ4|hbase.io.compress.lz4.codec|org.apache.hadoop.io.compress.Lz4Codec
|LZMA|hbase.io.compress.lzma.codec|org.apache.hadoop.hbase.io.compress.xz.LzmaCodec
|LZO|hbase.io.compress.lzo.codec|com.hadoop.compression.lzo.LzoCodec
|SNAPPY|hbase.io.compress.snappy.codec|org.apache.hadoop.io.compress.SnappyCodec
|ZSTD|hbase.io.compress.zstd.codec|org.apache.hadoop.io.compress.ZStandardCodec
|===
The available codec implementation options are:
[options="header"]
|===
|Label|Codec implementation class|Notes
//----------------------
|BROTLI|org.apache.hadoop.hbase.io.compress.brotli.BrotliCodec|
Implemented with https://github.com/hyperxpro/Brotli4j[Brotli4j]
|BZIP2|org.apache.hadoop.io.compress.BZip2Codec|Hadoop native codec
|GZ|org.apache.hadoop.hbase.io.compress.ReusableStreamGzipCodec|
Requires the Hadoop native GZ codec
|LZ4|org.apache.hadoop.io.compress.Lz4Codec|Hadoop native codec
|LZ4|org.apache.hadoop.hbase.io.compress.aircompressor.Lz4Codec|
Pure Java implementation
|LZ4|org.apache.hadoop.hbase.io.compress.lz4.Lz4Codec|
Implemented with https://github.com/lz4/lz4-java[lz4-java]
|LZMA|org.apache.hadoop.hbase.io.compress.xz.LzmaCodec|
Implemented with https://tukaani.org/xz/java.html[XZ For Java]
|LZO|com.hadoop.compression.lzo.LzoCodec|Hadoop native codec,
requires GPL licensed native dependencies
|LZO|org.apache.hadoop.io.compress.LzoCodec|Hadoop native codec,
requires GPL licensed native dependencies
|LZO|org.apache.hadoop.hbase.io.compress.aircompressor.LzoCodec|
Pure Java implementation
|SNAPPY|org.apache.hadoop.io.compress.SnappyCodec|Hadoop native codec
|SNAPPY|org.apache.hadoop.hbase.io.compress.aircompressor.SnappyCodec|
Pure Java implementation
|SNAPPY|org.apache.hadoop.hbase.io.compress.xerial.SnappyCodec|
Implemented with https://github.com/xerial/snappy-java[snappy-java]
|ZSTD|org.apache.hadoop.io.compress.ZStandardCodec|Hadoop native codec
|ZSTD|org.apache.hadoop.hbase.io.compress.aircompressor.ZStdCodec|
Pure Java implementation, limited to a fixed compression level,
not data compatible with the Hadoop zstd codec
|ZSTD|org.apache.hadoop.hbase.io.compress.zstd.ZStdCodec|
Implemented with https://github.com/luben/zstd-jni[zstd-jni],
supports all compression levels, supports custom dictionaries
|===
Specify which codec implementation option you prefer for a given compression algorithm
in site configuration, like so:
[source]
----
...
<property>
<name>hbase.io.compress.lz4.codec</name>
<value>org.apache.hadoop.hbase.io.compress.lz4.Lz4Codec</value>
</property>
...
----
.Compressor Microbenchmarks
See https://github.com/apurtell/jmh-compression-tests
256MB (258,126,022 bytes exactly) of block data was extracted from two HFiles containing Common
Crawl data ingested using IntegrationLoadTestCommonCrawl, 2,680 blocks in total. This data was
processed by each new codec implementation as if the block data were being compressed again for
write into an HFile, but without writing any data, comparing only the CPU time and resource demand
of the codec itself. Absolute performance numbers will vary depending on hardware and software
particulars of your deployment. The relative differences are what are interesting. Measured time
is the average time in milliseconds required to compress all blocks of the 256MB file. This is
how long it would take to write the HFile containing these contents, minus the I/O overhead of
block encoding and actual persistence.
These are the results:
[options="header"]
|===
|Codec|Level|Time (milliseconds)|Result (bytes)|Improvement
//----------------------
|AirCompressor LZ4|-|349.989 ± 2.835|76,999,408|70.17%
|AirCompressor LZO|-|334.554 ± 3.243|79,369,805|69.25%
|AirCompressor Snappy|-|364.153 ± 19.718|80,201,763|68.93%
|AirCompressor Zstandard|3 (effective)|1108.267 ± 8.969|55,129,189|78.64%
|Brotli|1|593.107 ± 2.376|58,672,319|77.27%
|Brotli|3|1345.195 ± 27.327|53,917,438|79.11%
|Brotli|6|2812.411 ± 25.372|48,696,441|81.13%
|Brotli|10|74615.936 ± 224.854|44,970,710|82.58%
|LZ4 (lz4-java)|-|303.045 ± 0.783|76,974,364|70.18%
|LZMA|1|6410.428 ± 115.065|49,948,535|80.65%
|LZMA|3|8144.620 ± 152.119|49,109,363|80.97%
|LZMA|6|43802.576 ± 382.025|46,951,810|81.81%
|LZMA|9|49821.979 ± 580.110|46,951,810|81.81%
|Snappy (xerial)|-|360.225 ± 2.324|80,749,937|68.72%
|Zstd (zstd-jni)|1|654.699 ± 16.839|56,719,994|78.03%
|Zstd (zstd-jni)|3|839.160 ± 24.906|54,573,095|78.86%
|Zstd (zstd-jni)|5|1594.373 ± 22.384|52,025,485|79.84%
|Zstd (zstd-jni)|7|2308.705 ± 24.744|50,651,554|80.38%
|Zstd (zstd-jni)|9|3659.677 ± 58.018|50,208,425|80.55%
|Zstd (zstd-jni)|12|8705.294 ± 58.080|49,841,446|80.69%
|Zstd (zstd-jni)|15|19785.646 ± 278.080|48,499,508|81.21%
|Zstd (zstd-jni)|18|47702.097 ± 442.670|48,319,879|81.28%
|Zstd (zstd-jni)|22|97799.695 ± 1106.571|48,212,220|81.32%
|===
.Compressor Support On the Master
@ -257,22 +433,29 @@ If native libraries are not available and Java's GZIP is used, `Got brand-new co
See <<brand.new.compressor,brand.new.compressor>>).
[[lzo.compression]]
.Install LZO Support
.Install Hadoop Native LZO Support
HBase cannot ship with LZO because of incompatibility between HBase, which uses an Apache Software License (ASL) and LZO, which uses a GPL license.
HBase cannot ship with the Hadoop native LZO codc because of incompatibility between HBase, which uses an Apache Software License (ASL) and LZO, which uses a GPL license.
See the link:https://github.com/twitter/hadoop-lzo/blob/master/README.md[Hadoop-LZO at Twitter] for information on configuring LZO support for HBase.
If you depend upon LZO compression, consider configuring your RegionServers to fail to start if LZO is not available.
If you depend upon LZO compression, consider using the pure Java and ASL licensed
AirCompressor LZO codec option instead of the Hadoop native default, or configure your
RegionServers to fail to start if native LZO support is not available.
See <<hbase.regionserver.codecs,hbase.regionserver.codecs>>.
[[lz4.compression]]
.Configure LZ4 Support
.Configure Hadoop Native LZ4 Support
LZ4 support is bundled with Hadoop.
Make sure the hadoop shared library (libhadoop.so) is accessible when you start HBase.
After configuring your platform (see <<hadoop.native.lib,hadoop.native.lib>>), you can make a symbolic link from HBase to the native Hadoop libraries.
This assumes the two software installs are colocated.
For example, if my 'platform' is Linux-amd64-64:
LZ4 support is bundled with Hadoop and is the default LZ4 codec implementation.
It is not required that you make use of the Hadoop LZ4 codec. Our LZ4 codec implemented
with lz4-java offers superior performance, and the AirCompressor LZ4 codec offers a
pure Java option for use where native support is not available.
That said, if you prefer the Hadoop option, make sure the hadoop shared library
(libhadoop.so) is accessible when you start HBase.
After configuring your platform (see <<hadoop.native.lib,hadoop.native.lib>>), you can
make a symbolic link from HBase to the native Hadoop libraries. This assumes the two
software installs are colocated. For example, if my 'platform' is Linux-amd64-64:
[source,bourne]
----
$ cd $HBASE_HOME
@ -287,10 +470,15 @@ hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}
----
[[snappy.compression.installation]]
.Install Snappy Support
.Install Hadoop native Snappy Support
HBase does not ship with Snappy support because of licensing issues.
You can install Snappy binaries (for instance, by using +yum install snappy+ on CentOS) or build Snappy from source.
Snappy support is bundled with Hadoop and is the default Snappy codec implementation.
It is not required that you make use of the Hadoop Snappy codec. Our Snappy codec
implemented with Xerial Snappy offers superior performance, and the AirCompressor
Snappy codec offers a pure Java option for use where native support is not available.
That said, if you prefer the Hadoop codec option, you can install Snappy binaries (for
instance, by using +yum install snappy+ on CentOS) or build Snappy from source.
After installing Snappy, search for the shared library, which will be called _libsnappy.so.X_ where X is a number.
If you built from source, copy the shared library to a known location on your system, such as _/opt/snappy/lib/_.