HBASE-11338 Expand documentation on bloom filters (Misty Stanley-Jones)

This commit is contained in:
Michael Stack 2014-06-19 15:11:15 -07:00
parent 5764df2974
commit 9829bb9c24
1 changed files with 123 additions and 10 deletions

View File

@ -295,18 +295,131 @@
<section
xml:id="schema.bloom">
<title>Bloom Filters</title>
<para>Bloom Filters can be enabled per-ColumnFamily. Use
<code>HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL)</code> to enable blooms
per Column Family. Default = <varname>NONE</varname> for no bloom filters. If
<varname>ROW</varname>, the hash of the row will be added to the bloom on each insert. If
<varname>ROWCOL</varname>, the hash of the row + column family name + column family
qualifier will be added to the bloom on each key insert.</para>
<para>See <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>
and <xref
linkend="blooms" /> for more information or this answer up in quora, <link
<para>A Bloom filter, named for its creator, Burton Howard Bloom, is a data structure which is
designed to predict whether a given element is a member of a set of data. A positive result
from a Bloom filter is not always accurate, but a negative result is guaranteed to be
accurate. Bloom filters are designed to be "accurate enough" for sets of data which are so
large that conventional hashing mechanisms would be impractical. For more information about
Bloom filters in general, refer to <link
xlink:href="http://en.wikipedia.org/wiki/Bloom_filter" />.</para>
<para>In terms of HBase, Bloom filters provide a lightweight in-memory structure to reduce the
number of disk reads for a given Get operation (Bloom filters do not work with Scans) to only the StoreFiles likely to
contain the desired Row. The potential performance gain increases with the number of
parallel reads. </para>
<para>The Bloom filters themselves are stored in the metadata of each HFile and never need to
be updated. When an HFile is opened because a region is deployed to a RegionServer, the
Bloom filter is loaded into memory. </para>
<para>HBase includes some tuning mechanisms for folding the Bloom filter to reduce the size
and keep the false positive rate within a desired range.</para>
<para>Bloom filters were introduced in <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-1200">HBASE-1200</link>. Since
HBase 0.96, row-based Bloom filters are enabled by default. (<link
xlink:href="https://issues.apache.org/jira/browse/HBASE-8450">HBASE-</link>)</para>
<para>For more information on Bloom filters in relation to HBase, see <xref
linkend="blooms" /> for more information, or the following Quora discussion: <link
xlink:href="http://www.quora.com/How-are-bloom-filters-used-in-HBase">How are bloom
filters used in HBase?</link>. </para>
<section xml:id="bloom.filters.when">
<title>When To Use Bloom Filters</title>
<para>Since HBase 0.96, row-based Bloom filters are enabled by default. You may choose to
disable them or to change some tables to use row+column Bloom filters, depending on the
characteristics of your data and how it is loaded into HBase.</para>
<para>To determine whether Bloom filters could have a positive impact, check the value of
<code>blockCacheHitRatio</code> in the RegionServer metrics. If Bloom filters are enabled, the value of
<code>blockCacheHitRatio</code> should increase, because the Bloom filter is filtering out blocks that
are definitely not needed. </para>
<para>You can choose to enable Bloom filters for a row or for a row+column combination. If
you generally scan entire rows, the row+column combination will not provide any benefit. A
row-based Bloom filter can operate on a row+column Get, but not the other way around.
However, if you have a large number of column-level Puts, such that a row may be present
in every StoreFile, a row-based filter will always return a positive result and provide no
benefit. Unless you have one column per row, row+column Bloom filters require more space,
in order to store more keys. Bloom filters work best when the size of each data entry is
at least a few kilobytes in size. </para>
<para>Overhead will be reduced when your data is stored in a few larger StoreFiles, to avoid
extra disk IO during low-level scans to find a specific row. </para>
<para>Bloom filters need to be rebuilt upon deletion, so may not be appropriate in
environments with a large number of deletions.</para>
</section>
<section>
<title>Enabling Bloom Filters</title>
<para>Bloom filters are enabled on a Column Family. You can do this by using the
setBloomFilterType method of HColumnDescriptor or using the HBase API. Valid values are
<literal>NONE</literal> (the default), <literal>ROW</literal>, or
<literal>ROWCOL</literal>. See <xref
linkend="bloom.filters.when" /> for more information on <literal>ROW</literal> versus
<literal>ROWCOL</literal>. See also the API documentation for <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>.</para>
<para>The following example creates a table and enables a ROWCOL Bloom filter on the
<literal>colfam1</literal> column family.</para>
<screen>
hbase> <userinput>create 'mytable',{NAME => 'colfam1', BLOOMFILTER => 'ROWCOL'}</userinput>
</screen>
</section>
<section>
<title>Configuring Server-Wide Behavior of Bloom Filters</title>
<para>You can configure the following settings in the <filename>hbase-site.xml</filename>.
</para>
<informaltable>
<tgroup cols="3">
<thead>
<row>
<entry>Parameter</entry>
<entry>Default</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry><para><code>io.hfile.bloom.enabled</code></para></entry>
<entry><para><literal>yes</literal></para></entry>
<entry><para>Set to <literal>no</literal> to kill bloom filters server-wide if
something goes wrong</para></entry>
</row>
<row>
<entry><para><code>io.hfile.bloom.error.rate</code></para></entry>
<entry><para><literal>.01</literal></para></entry>
<entry><para>The average false positive rate for bloom filters. Folding is used to
maintain the false positive rate. Expressed as a decimal representation of a
percentage.</para></entry>
</row>
<row>
<entry><para><code>io.hfile.bloom.max.fold</code></para></entry>
<entry><para><literal>7</literal></para></entry>
<entry><para>The guaranteed maximum fold rate. Changing this setting should not be
necessary and is not recommended.</para></entry>
</row>
<row>
<entry><para><code>io.storefile.bloom.max.keys</code></para></entry>
<entry><para><literal>128000000</literal></para></entry>
<entry><para>For default (single-block) Bloom filters, this specifies the maximum
number of keys.</para></entry>
</row>
<row>
<entry><para><code>io.storefile.delete.family.bloom.enabled</code></para></entry>
<entry><para><literal>true</literal></para></entry>
<entry><para>Master switch to enable Delete Family Bloom filters and store them in
the StoreFile.</para></entry>
</row>
<row>
<entry><para><code>io.storefile.bloom.block.size</code></para></entry>
<entry><para><literal>65536</literal></para></entry>
<entry><para>Target Bloom block size. Bloom filter blocks of approximately this size
are interleaved with data blocks.</para></entry>
</row>
<row>
<entry><para><code>hfile.block.bloom.cacheonwrite</code></para></entry>
<entry><para><literal>false</literal></para></entry>
<entry><para>Enables cache-on-write for inline blocks of a compound Bloom filter.</para></entry>
</row>
</tbody>
</tgroup>
</informaltable>
</section>
</section>
<section
xml:id="schema.cf.blocksize">