HBASE-12738 Chunk Ref Guide into file-per-chapter
This commit is contained in:
parent
d9f25e30a1
commit
a1fe1e0964
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,44 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<appendix
|
||||||
|
xml:id="asf"
|
||||||
|
version="5.0"
|
||||||
|
xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg"
|
||||||
|
xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||||
|
xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
<title>HBase and the Apache Software Foundation</title>
|
||||||
|
<para>HBase is a project in the Apache Software Foundation and as such there are responsibilities to the ASF to ensure
|
||||||
|
a healthy project.</para>
|
||||||
|
<section xml:id="asf.devprocess"><title>ASF Development Process</title>
|
||||||
|
<para>See the <link xlink:href="http://www.apache.org/dev/#committers">Apache Development Process page</link>
|
||||||
|
for all sorts of information on how the ASF is structured (e.g., PMC, committers, contributors), to tips on contributing
|
||||||
|
and getting involved, and how open-source works at ASF.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="asf.reporting"><title>ASF Board Reporting</title>
|
||||||
|
<para>Once a quarter, each project in the ASF portfolio submits a report to the ASF board. This is done by the HBase project
|
||||||
|
lead and the committers. See <link xlink:href="http://www.apache.org/foundation/board/reporting">ASF board reporting</link> for more information.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
</appendix>
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,535 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<appendix
|
||||||
|
xml:id="compression"
|
||||||
|
version="5.0"
|
||||||
|
xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg"
|
||||||
|
xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||||
|
xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
|
||||||
|
<title>Compression and Data Block Encoding In
|
||||||
|
HBase<indexterm><primary>Compression</primary><secondary>Data Block
|
||||||
|
Encoding</secondary><seealso>codecs</seealso></indexterm></title>
|
||||||
|
<note>
|
||||||
|
<para>Codecs mentioned in this section are for encoding and decoding data blocks or row keys.
|
||||||
|
For information about replication codecs, see <xref
|
||||||
|
linkend="cluster.replication.preserving.tags" />.</para>
|
||||||
|
</note>
|
||||||
|
<para>Some of the information in this section is pulled from a <link
|
||||||
|
xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1/v=threaded">discussion</link> on the
|
||||||
|
HBase Development mailing list.</para>
|
||||||
|
<para>HBase supports several different compression algorithms which can be enabled on a
|
||||||
|
ColumnFamily. Data block encoding attempts to limit duplication of information in keys, taking
|
||||||
|
advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys
|
||||||
|
and the schema of a given table. Compressors reduce the size of large, opaque byte arrays in
|
||||||
|
cells, and can significantly reduce the storage space needed to store uncompressed
|
||||||
|
data.</para>
|
||||||
|
<para>Compressors and data block encoding can be used together on the same ColumnFamily.</para>
|
||||||
|
|
||||||
|
<formalpara>
|
||||||
|
<title>Changes Take Effect Upon Compaction</title>
|
||||||
|
<para>If you change compression or encoding for a ColumnFamily, the changes take effect during
|
||||||
|
compaction.</para>
|
||||||
|
</formalpara>
|
||||||
|
|
||||||
|
<para>Some codecs take advantage of capabilities built into Java, such as GZip compression.
|
||||||
|
Others rely on native libraries. Native libraries may be available as part of Hadoop, such as
|
||||||
|
LZ4. In this case, HBase only needs access to the appropriate shared library. Other codecs,
|
||||||
|
such as Google Snappy, need to be installed first. Some codecs are licensed in ways that
|
||||||
|
conflict with HBase's license and cannot be shipped as part of HBase.</para>
|
||||||
|
|
||||||
|
<para>This section discusses common codecs that are used and tested with HBase. No matter what
|
||||||
|
codec you use, be sure to test that it is installed correctly and is available on all nodes in
|
||||||
|
your cluster. Extra operational steps may be necessary to be sure that codecs are available on
|
||||||
|
newly-deployed nodes. You can use the <xref
|
||||||
|
linkend="compression.test" /> utility to check that a given codec is correctly
|
||||||
|
installed.</para>
|
||||||
|
|
||||||
|
<para>To configure HBase to use a compressor, see <xref
|
||||||
|
linkend="compressor.install" />. To enable a compressor for a ColumnFamily, see <xref
|
||||||
|
linkend="changing.compression" />. To enable data block encoding for a ColumnFamily, see
|
||||||
|
<xref linkend="data.block.encoding.enable" />.</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<title>Block Compressors</title>
|
||||||
|
<listitem>
|
||||||
|
<para>none</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Snappy</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>LZO</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>LZ4</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>GZ</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
|
||||||
|
|
||||||
|
<itemizedlist xml:id="data.block.encoding.types">
|
||||||
|
<title>Data Block Encoding Types</title>
|
||||||
|
<listitem>
|
||||||
|
<para>Prefix - Often, keys are very similar. Specifically, keys often share a common prefix
|
||||||
|
and only differ near the end. For instance, one key might be
|
||||||
|
<literal>RowKey:Family:Qualifier0</literal> and the next key might be
|
||||||
|
<literal>RowKey:Family:Qualifier1</literal>. In Prefix encoding, an extra column is
|
||||||
|
added which holds the length of the prefix shared between the current key and the previous
|
||||||
|
key. Assuming the first key here is totally different from the key before, its prefix
|
||||||
|
length is 0. The second key's prefix length is <literal>23</literal>, since they have the
|
||||||
|
first 23 characters in common.</para>
|
||||||
|
<para>Obviously if the keys tend to have nothing in common, Prefix will not provide much
|
||||||
|
benefit.</para>
|
||||||
|
<para>The following image shows a hypothetical ColumnFamily with no data block encoding.</para>
|
||||||
|
<figure>
|
||||||
|
<title>ColumnFamily with No Encoding</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="data_block_no_encoding.png" width="800"/>
|
||||||
|
</imageobject>
|
||||||
|
<caption><para>A ColumnFamily with no encoding></para></caption>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
<para>Here is the same data with prefix data encoding.</para>
|
||||||
|
<figure>
|
||||||
|
<title>ColumnFamily with Prefix Encoding</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="data_block_prefix_encoding.png" width="800"/>
|
||||||
|
</imageobject>
|
||||||
|
<caption><para>A ColumnFamily with prefix encoding</para></caption>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Diff - Diff encoding expands upon Prefix encoding. Instead of considering the key
|
||||||
|
sequentially as a monolithic series of bytes, each key field is split so that each part of
|
||||||
|
the key can be compressed more efficiently. Two new fields are added: timestamp and type.
|
||||||
|
If the ColumnFamily is the same as the previous row, it is omitted from the current row.
|
||||||
|
If the key length, value length or type are the same as the previous row, the field is
|
||||||
|
omitted. In addition, for increased compression, the timestamp is stored as a Diff from
|
||||||
|
the previous row's timestamp, rather than being stored in full. Given the two row keys in
|
||||||
|
the Prefix example, and given an exact match on timestamp and the same type, neither the
|
||||||
|
value length, or type needs to be stored for the second row, and the timestamp value for
|
||||||
|
the second row is just 0, rather than a full timestamp.</para>
|
||||||
|
<para>Diff encoding is disabled by default because writing and scanning are slower but more
|
||||||
|
data is cached.</para>
|
||||||
|
<para>This image shows the same ColumnFamily from the previous images, with Diff encoding.</para>
|
||||||
|
<figure>
|
||||||
|
<title>ColumnFamily with Diff Encoding</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="data_block_diff_encoding.png" width="800"/>
|
||||||
|
</imageobject>
|
||||||
|
<caption><para>A ColumnFamily with diff encoding</para></caption>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Fast Diff - Fast Diff works similar to Diff, but uses a faster implementation. It also
|
||||||
|
adds another field which stores a single bit to track whether the data itself is the same
|
||||||
|
as the previous row. If it is, the data is not stored again. Fast Diff is the recommended
|
||||||
|
codec to use if you have long keys or many columns. The data format is nearly identical to
|
||||||
|
Diff encoding, so there is not an image to illustrate it.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Prefix Tree encoding was introduced as an experimental feature in HBase 0.96. It
|
||||||
|
provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but provides
|
||||||
|
faster random access at a cost of slower encoding speed. Prefix Tree may be appropriate
|
||||||
|
for applications that have high block cache hit ratios. It introduces new 'tree' fields
|
||||||
|
for the row and column. The row tree field contains a list of offsets/references
|
||||||
|
corresponding to the cells in that row. This allows for a good deal of compression. For
|
||||||
|
more details about Prefix Tree encoding, see <link
|
||||||
|
xlink:href="https://issues.apache.org/jira/browse/HBASE-4676">HBASE-4676</link>. It is
|
||||||
|
difficult to graphically illustrate a prefix tree, so no image is included. See the
|
||||||
|
Wikipedia article for <link
|
||||||
|
xlink:href="http://en.wikipedia.org/wiki/Trie">Trie</link> for more general information
|
||||||
|
about this data structure.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Which Compressor or Data Block Encoder To Use</title>
|
||||||
|
<para>The compression or codec type to use depends on the characteristics of your data.
|
||||||
|
Choosing the wrong type could cause your data to take more space rather than less, and can
|
||||||
|
have performance implications. In general, you need to weigh your options between smaller
|
||||||
|
size and faster compression/decompression. Following are some general guidelines, expanded from a discussion at <link xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1">Documenting Guidance on compression and codecs</link>. </para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>If you have long keys (compared to the values) or many columns, use a prefix
|
||||||
|
encoder. FAST_DIFF is recommended, as more testing is needed for Prefix Tree
|
||||||
|
encoding.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>If the values are large (and not precompressed, such as images), use a data block
|
||||||
|
compressor.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Use GZIP for <firstterm>cold data</firstterm>, which is accessed infrequently. GZIP
|
||||||
|
compression uses more CPU resources than Snappy or LZO, but provides a higher
|
||||||
|
compression ratio.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Use Snappy or LZO for <firstterm>hot data</firstterm>, which is accessed
|
||||||
|
frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high
|
||||||
|
of a compression ratio.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>In most cases, enabling Snappy or LZO by default is a good choice, because they have
|
||||||
|
a low performance overhead and provide space savings.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Before Snappy became available by Google in 2011, LZO was the default. Snappy has
|
||||||
|
similar qualities as LZO but has been shown to perform better.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
</section>
|
||||||
|
<section xml:id="hadoop.native.lib">
|
||||||
|
<title>Making use of Hadoop Native Libraries in HBase</title>
|
||||||
|
<para>The Hadoop shared library has a bunch of facility including
|
||||||
|
compression libraries and fast crc'ing. To make this facility available
|
||||||
|
to HBase, do the following. HBase/Hadoop will fall back to use
|
||||||
|
alternatives if it cannot find the native library versions -- or
|
||||||
|
fail outright if you asking for an explicit compressor and there is
|
||||||
|
no alternative available.</para>
|
||||||
|
<para>If you see the following in your HBase logs, you know that HBase was unable
|
||||||
|
to locate the Hadoop native libraries:
|
||||||
|
<programlisting>2014-08-07 09:26:20,139 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable</programlisting>
|
||||||
|
If the libraries loaded successfully, the WARN message does not show.
|
||||||
|
</para>
|
||||||
|
<para>Lets presume your Hadoop shipped with a native library that
|
||||||
|
suits the platform you are running HBase on. To check if the Hadoop
|
||||||
|
native library is available to HBase, run the following tool (available in
|
||||||
|
Hadoop 2.1 and greater):
|
||||||
|
<programlisting>$ ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker
|
||||||
|
2014-08-26 13:15:38,717 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
|
||||||
|
Native library checking:
|
||||||
|
hadoop: false
|
||||||
|
zlib: false
|
||||||
|
snappy: false
|
||||||
|
lz4: false
|
||||||
|
bzip2: false
|
||||||
|
2014-08-26 13:15:38,863 INFO [main] util.ExitUtil: Exiting with status 1</programlisting>
|
||||||
|
Above shows that the native hadoop library is not available in HBase context.
|
||||||
|
</para>
|
||||||
|
<para>To fix the above, either copy the Hadoop native libraries local or symlink to
|
||||||
|
them if the Hadoop and HBase stalls are adjacent in the filesystem.
|
||||||
|
You could also point at their location by setting the <varname>LD_LIBRARY_PATH</varname> environment
|
||||||
|
variable.</para>
|
||||||
|
<para>Where the JVM looks to find native librarys is "system dependent"
|
||||||
|
(See <classname>java.lang.System#loadLibrary(name)</classname>). On linux, by default,
|
||||||
|
is going to look in <filename>lib/native/PLATFORM</filename> where <varname>PLATFORM</varname>
|
||||||
|
is the label for the platform your HBase is installed on.
|
||||||
|
On a local linux machine, it seems to be the concatenation of the java properties
|
||||||
|
<varname>os.name</varname> and <varname>os.arch</varname> followed by whether 32 or 64 bit.
|
||||||
|
HBase on startup prints out all of the java system properties so find the os.name and os.arch
|
||||||
|
in the log. For example:
|
||||||
|
<programlisting>....
|
||||||
|
2014-08-06 15:27:22,853 INFO [main] zookeeper.ZooKeeper: Client environment:os.name=Linux
|
||||||
|
2014-08-06 15:27:22,853 INFO [main] zookeeper.ZooKeeper: Client environment:os.arch=amd64
|
||||||
|
...
|
||||||
|
</programlisting>
|
||||||
|
So in this case, the PLATFORM string is <varname>Linux-amd64-64</varname>.
|
||||||
|
Copying the Hadoop native libraries or symlinking at <filename>lib/native/Linux-amd64-64</filename>
|
||||||
|
will ensure they are found. Check with the Hadoop <filename>NativeLibraryChecker</filename>.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>Here is example of how to point at the Hadoop libs with <varname>LD_LIBRARY_PATH</varname>
|
||||||
|
environment variable:
|
||||||
|
<programlisting>$ LD_LIBRARY_PATH=~/hadoop-2.5.0-SNAPSHOT/lib/native ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker
|
||||||
|
2014-08-26 13:42:49,332 INFO [main] bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
|
||||||
|
2014-08-26 13:42:49,337 INFO [main] zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
|
||||||
|
Native library checking:
|
||||||
|
hadoop: true /home/stack/hadoop-2.5.0-SNAPSHOT/lib/native/libhadoop.so.1.0.0
|
||||||
|
zlib: true /lib64/libz.so.1
|
||||||
|
snappy: true /usr/lib64/libsnappy.so.1
|
||||||
|
lz4: true revision:99
|
||||||
|
bzip2: true /lib64/libbz2.so.1</programlisting>
|
||||||
|
Set in <filename>hbase-env.sh</filename> the LD_LIBRARY_PATH environment variable when starting your HBase.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Compressor Configuration, Installation, and Use</title>
|
||||||
|
<section
|
||||||
|
xml:id="compressor.install">
|
||||||
|
<title>Configure HBase For Compressors</title>
|
||||||
|
<para>Before HBase can use a given compressor, its libraries need to be available. Due to
|
||||||
|
licensing issues, only GZ compression is available to HBase (via native Java libraries) in
|
||||||
|
a default installation. Other compression libraries are available via the shared library
|
||||||
|
bundled with your hadoop. The hadoop native library needs to be findable when HBase
|
||||||
|
starts. See </para>
|
||||||
|
<section>
|
||||||
|
<title>Compressor Support On the Master</title>
|
||||||
|
<para>A new configuration setting was introduced in HBase 0.95, to check the Master to
|
||||||
|
determine which data block encoders are installed and configured on it, and assume that
|
||||||
|
the entire cluster is configured the same. This option,
|
||||||
|
<code>hbase.master.check.compression</code>, defaults to <literal>true</literal>. This
|
||||||
|
prevents the situation described in <link
|
||||||
|
xlink:href="https://issues.apache.org/jira/browse/HBASE-6370">HBASE-6370</link>, where
|
||||||
|
a table is created or modified to support a codec that a region server does not support,
|
||||||
|
leading to failures that take a long time to occur and are difficult to debug. </para>
|
||||||
|
<para>If <code>hbase.master.check.compression</code> is enabled, libraries for all desired
|
||||||
|
compressors need to be installed and configured on the Master, even if the Master does
|
||||||
|
not run a region server.</para>
|
||||||
|
</section>
|
||||||
|
<section>
|
||||||
|
<title>Install GZ Support Via Native Libraries</title>
|
||||||
|
<para>HBase uses Java's built-in GZip support unless the native Hadoop libraries are
|
||||||
|
available on the CLASSPATH. The recommended way to add libraries to the CLASSPATH is to
|
||||||
|
set the environment variable <envar>HBASE_LIBRARY_PATH</envar> for the user running
|
||||||
|
HBase. If native libraries are not available and Java's GZIP is used, <literal>Got
|
||||||
|
brand-new compressor</literal> reports will be present in the logs. See <xref
|
||||||
|
linkend="brand.new.compressor" />).</para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="lzo.compression">
|
||||||
|
<title>Install LZO Support</title>
|
||||||
|
<para>HBase cannot ship with LZO because of incompatibility between HBase, which uses an
|
||||||
|
Apache Software License (ASL) and LZO, which uses a GPL license. See the <link
|
||||||
|
xlink:href="http://wiki.apache.org/hadoop/UsingLzoCompression">Using LZO
|
||||||
|
Compression</link> wiki page for information on configuring LZO support for HBase. </para>
|
||||||
|
<para>If you depend upon LZO compression, consider configuring your RegionServers to fail
|
||||||
|
to start if LZO is not available. See <xref
|
||||||
|
linkend="hbase.regionserver.codecs" />.</para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="lz4.compression">
|
||||||
|
<title>Configure LZ4 Support</title>
|
||||||
|
<para>LZ4 support is bundled with Hadoop. Make sure the hadoop shared library
|
||||||
|
(libhadoop.so) is accessible when you start
|
||||||
|
HBase. After configuring your platform (see <xref
|
||||||
|
linkend="hbase.native.platform" />), you can make a symbolic link from HBase to the native Hadoop
|
||||||
|
libraries. This assumes the two software installs are colocated. For example, if my
|
||||||
|
'platform' is Linux-amd64-64:
|
||||||
|
<programlisting language="bourne">$ cd $HBASE_HOME
|
||||||
|
$ mkdir lib/native
|
||||||
|
$ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64</programlisting>
|
||||||
|
Use the compression tool to check that LZ4 is installed on all nodes. Start up (or restart)
|
||||||
|
HBase. Afterward, you can create and alter tables to enable LZ4 as a
|
||||||
|
compression codec.:
|
||||||
|
<screen>
|
||||||
|
hbase(main):003:0> <userinput>alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}</userinput>
|
||||||
|
</screen>
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="snappy.compression.installation">
|
||||||
|
<title>Install Snappy Support</title>
|
||||||
|
<para>HBase does not ship with Snappy support because of licensing issues. You can install
|
||||||
|
Snappy binaries (for instance, by using <command>yum install snappy</command> on CentOS)
|
||||||
|
or build Snappy from source. After installing Snappy, search for the shared library,
|
||||||
|
which will be called <filename>libsnappy.so.X</filename> where X is a number. If you
|
||||||
|
built from source, copy the shared library to a known location on your system, such as
|
||||||
|
<filename>/opt/snappy/lib/</filename>.</para>
|
||||||
|
<para>In addition to the Snappy library, HBase also needs access to the Hadoop shared
|
||||||
|
library, which will be called something like <filename>libhadoop.so.X.Y</filename>,
|
||||||
|
where X and Y are both numbers. Make note of the location of the Hadoop library, or copy
|
||||||
|
it to the same location as the Snappy library.</para>
|
||||||
|
<note>
|
||||||
|
<para>The Snappy and Hadoop libraries need to be available on each node of your cluster.
|
||||||
|
See <xref
|
||||||
|
linkend="compression.test" /> to find out how to test that this is the case.</para>
|
||||||
|
<para>See <xref
|
||||||
|
linkend="hbase.regionserver.codecs" /> to configure your RegionServers to fail to
|
||||||
|
start if a given compressor is not available.</para>
|
||||||
|
</note>
|
||||||
|
<para>Each of these library locations need to be added to the environment variable
|
||||||
|
<envar>HBASE_LIBRARY_PATH</envar> for the operating system user that runs HBase. You
|
||||||
|
need to restart the RegionServer for the changes to take effect.</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="compression.test">
|
||||||
|
<title>CompressionTest</title>
|
||||||
|
<para>You can use the CompressionTest tool to verify that your compressor is available to
|
||||||
|
HBase:</para>
|
||||||
|
<screen language="bourne">
|
||||||
|
$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://<replaceable>host/path/to/hbase</replaceable> snappy
|
||||||
|
</screen>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="hbase.regionserver.codecs">
|
||||||
|
<title>Enforce Compression Settings On a RegionServer</title>
|
||||||
|
<para>You can configure a RegionServer so that it will fail to restart if compression is
|
||||||
|
configured incorrectly, by adding the option hbase.regionserver.codecs to the
|
||||||
|
<filename>hbase-site.xml</filename>, and setting its value to a comma-separated list
|
||||||
|
of codecs that need to be available. For example, if you set this property to
|
||||||
|
<literal>lzo,gz</literal>, the RegionServer would fail to start if both compressors
|
||||||
|
were not available. This would prevent a new server from being added to the cluster
|
||||||
|
without having codecs configured properly.</para>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="changing.compression">
|
||||||
|
<title>Enable Compression On a ColumnFamily</title>
|
||||||
|
<para>To enable compression for a ColumnFamily, use an <code>alter</code> command. You do
|
||||||
|
not need to re-create the table or copy data. If you are changing codecs, be sure the old
|
||||||
|
codec is still available until all the old StoreFiles have been compacted.</para>
|
||||||
|
<example>
|
||||||
|
<title>Enabling Compression on a ColumnFamily of an Existing Table using HBase
|
||||||
|
Shell</title>
|
||||||
|
<screen><![CDATA[
|
||||||
|
hbase> disable 'test'
|
||||||
|
hbase> alter 'test', {NAME => 'cf', COMPRESSION => 'GZ'}
|
||||||
|
hbase> enable 'test']]>
|
||||||
|
</screen>
|
||||||
|
</example>
|
||||||
|
<example>
|
||||||
|
<title>Creating a New Table with Compression On a ColumnFamily</title>
|
||||||
|
<screen><![CDATA[
|
||||||
|
hbase> create 'test2', { NAME => 'cf2', COMPRESSION => 'SNAPPY' }
|
||||||
|
]]></screen>
|
||||||
|
</example>
|
||||||
|
<example>
|
||||||
|
<title>Verifying a ColumnFamily's Compression Settings</title>
|
||||||
|
<screen><![CDATA[
|
||||||
|
hbase> describe 'test'
|
||||||
|
DESCRIPTION ENABLED
|
||||||
|
'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE false
|
||||||
|
', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
|
||||||
|
VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERSIONS
|
||||||
|
=> '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'fa
|
||||||
|
lse', BLOCKSIZE => '65536', IN_MEMORY => 'false', B
|
||||||
|
LOCKCACHE => 'true'}
|
||||||
|
1 row(s) in 0.1070 seconds
|
||||||
|
]]></screen>
|
||||||
|
</example>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Testing Compression Performance</title>
|
||||||
|
<para>HBase includes a tool called LoadTestTool which provides mechanisms to test your
|
||||||
|
compression performance. You must specify either <literal>-write</literal> or
|
||||||
|
<literal>-update-read</literal> as your first parameter, and if you do not specify another
|
||||||
|
parameter, usage advice is printed for each option.</para>
|
||||||
|
<example>
|
||||||
|
<title><command>LoadTestTool</command> Usage</title>
|
||||||
|
<screen language="bourne"><![CDATA[
|
||||||
|
$ bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -h
|
||||||
|
usage: bin/hbase org.apache.hadoop.hbase.util.LoadTestTool <options>
|
||||||
|
Options:
|
||||||
|
-batchupdate Whether to use batch as opposed to separate
|
||||||
|
updates for every column in a row
|
||||||
|
-bloom <arg> Bloom filter type, one of [NONE, ROW, ROWCOL]
|
||||||
|
-compression <arg> Compression type, one of [LZO, GZ, NONE, SNAPPY,
|
||||||
|
LZ4]
|
||||||
|
-data_block_encoding <arg> Encoding algorithm (e.g. prefix compression) to
|
||||||
|
use for data blocks in the test column family, one
|
||||||
|
of [NONE, PREFIX, DIFF, FAST_DIFF, PREFIX_TREE].
|
||||||
|
-encryption <arg> Enables transparent encryption on the test table,
|
||||||
|
one of [AES]
|
||||||
|
-generator <arg> The class which generates load for the tool. Any
|
||||||
|
args for this class can be passed as colon
|
||||||
|
separated after class name
|
||||||
|
-h,--help Show usage
|
||||||
|
-in_memory Tries to keep the HFiles of the CF inmemory as far
|
||||||
|
as possible. Not guaranteed that reads are always
|
||||||
|
served from inmemory
|
||||||
|
-init_only Initialize the test table only, don't do any
|
||||||
|
loading
|
||||||
|
-key_window <arg> The 'key window' to maintain between reads and
|
||||||
|
writes for concurrent write/read workload. The
|
||||||
|
default is 0.
|
||||||
|
-max_read_errors <arg> The maximum number of read errors to tolerate
|
||||||
|
before terminating all reader threads. The default
|
||||||
|
is 10.
|
||||||
|
-multiput Whether to use multi-puts as opposed to separate
|
||||||
|
puts for every column in a row
|
||||||
|
-num_keys <arg> The number of keys to read/write
|
||||||
|
-num_tables <arg> A positive integer number. When a number n is
|
||||||
|
speicfied, load test tool will load n table
|
||||||
|
parallely. -tn parameter value becomes table name
|
||||||
|
prefix. Each table name is in format
|
||||||
|
<tn>_1...<tn>_n
|
||||||
|
-read <arg> <verify_percent>[:<#threads=20>]
|
||||||
|
-regions_per_server <arg> A positive integer number. When a number n is
|
||||||
|
specified, load test tool will create the test
|
||||||
|
table with n regions per server
|
||||||
|
-skip_init Skip the initialization; assume test table already
|
||||||
|
exists
|
||||||
|
-start_key <arg> The first key to read/write (a 0-based index). The
|
||||||
|
default value is 0.
|
||||||
|
-tn <arg> The name of the table to read or write
|
||||||
|
-update <arg> <update_percent>[:<#threads=20>][:<#whether to
|
||||||
|
ignore nonce collisions=0>]
|
||||||
|
-write <arg> <avg_cols_per_key>:<avg_data_size>[:<#threads=20>]
|
||||||
|
-zk <arg> ZK quorum as comma-separated host names without
|
||||||
|
port numbers
|
||||||
|
-zk_root <arg> name of parent znode in zookeeper
|
||||||
|
]]></screen>
|
||||||
|
</example>
|
||||||
|
<example>
|
||||||
|
<title>Example Usage of LoadTestTool</title>
|
||||||
|
<screen language="bourne">
|
||||||
|
$ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000000
|
||||||
|
-read 100:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE
|
||||||
|
</screen>
|
||||||
|
</example>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section xml:id="data.block.encoding.enable">
|
||||||
|
<title>Enable Data Block Encoding</title>
|
||||||
|
<para>Codecs are built into HBase so no extra configuration is needed. Codecs are enabled on a
|
||||||
|
table by setting the <code>DATA_BLOCK_ENCODING</code> property. Disable the table before
|
||||||
|
altering its DATA_BLOCK_ENCODING setting. Following is an example using HBase Shell:</para>
|
||||||
|
<example>
|
||||||
|
<title>Enable Data Block Encoding On a Table</title>
|
||||||
|
<screen><![CDATA[
|
||||||
|
hbase> disable 'test'
|
||||||
|
hbase> alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
|
||||||
|
Updating all regions with the new schema...
|
||||||
|
0/1 regions updated.
|
||||||
|
1/1 regions updated.
|
||||||
|
Done.
|
||||||
|
0 row(s) in 2.2820 seconds
|
||||||
|
hbase> enable 'test'
|
||||||
|
0 row(s) in 0.1580 seconds
|
||||||
|
]]></screen>
|
||||||
|
</example>
|
||||||
|
<example>
|
||||||
|
<title>Verifying a ColumnFamily's Data Block Encoding</title>
|
||||||
|
<screen><![CDATA[
|
||||||
|
hbase> describe 'test'
|
||||||
|
DESCRIPTION ENABLED
|
||||||
|
'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST true
|
||||||
|
_DIFF', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =>
|
||||||
|
'0', VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERS
|
||||||
|
IONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS =
|
||||||
|
> 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals
|
||||||
|
e', BLOCKCACHE => 'true'}
|
||||||
|
1 row(s) in 0.0650 seconds
|
||||||
|
]]></screen>
|
||||||
|
</example>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
|
||||||
|
</appendix>
|
|
@ -925,8 +925,8 @@ stopping hbase...............</screen>
|
||||||
<!--presumes the pre-site target has put the hbase-default.xml at this location-->
|
<!--presumes the pre-site target has put the hbase-default.xml at this location-->
|
||||||
<xi:include
|
<xi:include
|
||||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
href="../../../target/docbkx/hbase-default.xml">
|
href="hbase-default.xml">
|
||||||
<xi:fallback>
|
<!--<xi:fallback>
|
||||||
<section
|
<section
|
||||||
xml:id="hbase_default_configurations">
|
xml:id="hbase_default_configurations">
|
||||||
<title />
|
<title />
|
||||||
|
@ -1007,7 +1007,7 @@ stopping hbase...............</screen>
|
||||||
</section>
|
</section>
|
||||||
</section>
|
</section>
|
||||||
</section>
|
</section>
|
||||||
</xi:fallback>
|
</xi:fallback>-->
|
||||||
</xi:include>
|
</xi:include>
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,129 @@
|
||||||
|
<?xml version="1.0"?>
|
||||||
|
<xsl:stylesheet
|
||||||
|
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
|
||||||
|
version="1.0">
|
||||||
|
<!--
|
||||||
|
/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
<xsl:import href="urn:docbkx:stylesheet/docbook.xsl"/>
|
||||||
|
<xsl:import href="urn:docbkx:stylesheet/highlight.xsl"/>
|
||||||
|
|
||||||
|
|
||||||
|
<!--###################################################
|
||||||
|
Paper & Page Size
|
||||||
|
################################################### -->
|
||||||
|
|
||||||
|
<!-- Paper type, no headers on blank pages, no double sided printing -->
|
||||||
|
<xsl:param name="paper.type" select="'USletter'"/>
|
||||||
|
<xsl:param name="double.sided">0</xsl:param>
|
||||||
|
<xsl:param name="headers.on.blank.pages">0</xsl:param>
|
||||||
|
<xsl:param name="footers.on.blank.pages">0</xsl:param>
|
||||||
|
|
||||||
|
<!-- Space between paper border and content (chaotic stuff, don't touch) -->
|
||||||
|
<xsl:param name="page.margin.top">5mm</xsl:param>
|
||||||
|
<xsl:param name="region.before.extent">10mm</xsl:param>
|
||||||
|
<xsl:param name="body.margin.top">10mm</xsl:param>
|
||||||
|
|
||||||
|
<xsl:param name="body.margin.bottom">15mm</xsl:param>
|
||||||
|
<xsl:param name="region.after.extent">10mm</xsl:param>
|
||||||
|
<xsl:param name="page.margin.bottom">0mm</xsl:param>
|
||||||
|
|
||||||
|
<xsl:param name="page.margin.outer">18mm</xsl:param>
|
||||||
|
<xsl:param name="page.margin.inner">18mm</xsl:param>
|
||||||
|
|
||||||
|
<!-- No intendation of Titles -->
|
||||||
|
<xsl:param name="title.margin.left">0pc</xsl:param>
|
||||||
|
|
||||||
|
<!--###################################################
|
||||||
|
Fonts & Styles
|
||||||
|
################################################### -->
|
||||||
|
|
||||||
|
<!-- Left aligned text and no hyphenation -->
|
||||||
|
<xsl:param name="alignment">justify</xsl:param>
|
||||||
|
<xsl:param name="hyphenate">true</xsl:param>
|
||||||
|
|
||||||
|
<!-- Default Font size -->
|
||||||
|
<xsl:param name="body.font.master">11</xsl:param>
|
||||||
|
<xsl:param name="body.font.small">8</xsl:param>
|
||||||
|
|
||||||
|
<!-- Line height in body text -->
|
||||||
|
<xsl:param name="line-height">1.4</xsl:param>
|
||||||
|
|
||||||
|
<!-- Force line break in long URLs -->
|
||||||
|
<xsl:param name="ulink.hyphenate.chars">/&?</xsl:param>
|
||||||
|
<xsl:param name="ulink.hyphenate">​</xsl:param>
|
||||||
|
|
||||||
|
<!-- Monospaced fonts are smaller than regular text -->
|
||||||
|
<xsl:attribute-set name="monospace.properties">
|
||||||
|
<xsl:attribute name="font-family">
|
||||||
|
<xsl:value-of select="$monospace.font.family"/>
|
||||||
|
</xsl:attribute>
|
||||||
|
<xsl:attribute name="font-size">0.8em</xsl:attribute>
|
||||||
|
<xsl:attribute name="wrap-option">wrap</xsl:attribute>
|
||||||
|
<xsl:attribute name="hyphenate">true</xsl:attribute>
|
||||||
|
</xsl:attribute-set>
|
||||||
|
|
||||||
|
|
||||||
|
<!-- add page break after abstract block -->
|
||||||
|
<xsl:attribute-set name="abstract.properties">
|
||||||
|
<xsl:attribute name="break-after">page</xsl:attribute>
|
||||||
|
</xsl:attribute-set>
|
||||||
|
|
||||||
|
<!-- add page break after toc -->
|
||||||
|
<xsl:attribute-set name="toc.margin.properties">
|
||||||
|
<xsl:attribute name="break-after">page</xsl:attribute>
|
||||||
|
</xsl:attribute-set>
|
||||||
|
|
||||||
|
<!-- add page break after first level sections -->
|
||||||
|
<xsl:attribute-set name="section.level1.properties">
|
||||||
|
<xsl:attribute name="break-after">page</xsl:attribute>
|
||||||
|
</xsl:attribute-set>
|
||||||
|
|
||||||
|
<!-- Show only Sections up to level 3 in the TOCs -->
|
||||||
|
<xsl:param name="toc.section.depth">2</xsl:param>
|
||||||
|
|
||||||
|
<!-- Dot and Whitespace as separator in TOC between Label and Title-->
|
||||||
|
<xsl:param name="autotoc.label.separator" select="'. '"/>
|
||||||
|
|
||||||
|
<!-- program listings / examples formatting -->
|
||||||
|
<xsl:attribute-set name="monospace.verbatim.properties">
|
||||||
|
<xsl:attribute name="font-family">Courier</xsl:attribute>
|
||||||
|
<xsl:attribute name="font-size">8pt</xsl:attribute>
|
||||||
|
<xsl:attribute name="keep-together.within-column">always</xsl:attribute>
|
||||||
|
</xsl:attribute-set>
|
||||||
|
|
||||||
|
<xsl:param name="shade.verbatim" select="1" />
|
||||||
|
|
||||||
|
<xsl:attribute-set name="shade.verbatim.style">
|
||||||
|
<xsl:attribute name="background-color">#E8E8E8</xsl:attribute>
|
||||||
|
<xsl:attribute name="border-width">0.5pt</xsl:attribute>
|
||||||
|
<xsl:attribute name="border-style">solid</xsl:attribute>
|
||||||
|
<xsl:attribute name="border-color">#575757</xsl:attribute>
|
||||||
|
<xsl:attribute name="padding">3pt</xsl:attribute>
|
||||||
|
</xsl:attribute-set>
|
||||||
|
|
||||||
|
<!-- callouts customization -->
|
||||||
|
<xsl:param name="callout.unicode" select="1" />
|
||||||
|
<xsl:param name="callout.graphics" select="0" />
|
||||||
|
<xsl:param name="callout.defaultcolumn">90</xsl:param>
|
||||||
|
|
||||||
|
<!-- Syntax Highlighting -->
|
||||||
|
|
||||||
|
|
||||||
|
</xsl:stylesheet>
|
|
@ -0,0 +1,865 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<chapter
|
||||||
|
xml:id="datamodel"
|
||||||
|
version="5.0"
|
||||||
|
xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg"
|
||||||
|
xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||||
|
xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
|
||||||
|
<title>Data Model</title>
|
||||||
|
<para>In HBase, data is stored in tables, which have rows and columns. This is a terminology
|
||||||
|
overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can
|
||||||
|
be helpful to think of an HBase table as a multi-dimensional map.</para>
|
||||||
|
<variablelist>
|
||||||
|
<title>HBase Data Model Terminology</title>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Table</term>
|
||||||
|
<listitem>
|
||||||
|
<para>An HBase table consists of multiple rows.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Row</term>
|
||||||
|
<listitem>
|
||||||
|
<para>A row in HBase consists of a row key and one or more columns with values associated
|
||||||
|
with them. Rows are sorted alphabetically by the row key as they are stored. For this
|
||||||
|
reason, the design of the row key is very important. The goal is to store data in such a
|
||||||
|
way that related rows are near each other. A common row key pattern is a website domain.
|
||||||
|
If your row keys are domains, you should probably store them in reverse (org.apache.www,
|
||||||
|
org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each
|
||||||
|
other in the table, rather than being spread out based on the first letter of the
|
||||||
|
subdomain.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Column</term>
|
||||||
|
<listitem>
|
||||||
|
<para>A column in HBase consists of a column family and a column qualifier, which are
|
||||||
|
delimited by a <literal>:</literal> (colon) character.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Column Family</term>
|
||||||
|
<listitem>
|
||||||
|
<para>Column families physically colocate a set of columns and their values, often for
|
||||||
|
performance reasons. Each column family has a set of storage properties, such as whether
|
||||||
|
its values should be cached in memory, how its data is compressed or its row keys are
|
||||||
|
encoded, and others. Each row in a table has the same column
|
||||||
|
families, though a given row might not store anything in a given column family.</para>
|
||||||
|
<para>Column families are specified when you create your table, and influence the way your
|
||||||
|
data is stored in the underlying filesystem. Therefore, the column families should be
|
||||||
|
considered carefully during schema design.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Column Qualifier</term>
|
||||||
|
<listitem>
|
||||||
|
<para>A column qualifier is added to a column family to provide the index for a given
|
||||||
|
piece of data. Given a column family <literal>content</literal>, a column qualifier
|
||||||
|
might be <literal>content:html</literal>, and another might be
|
||||||
|
<literal>content:pdf</literal>. Though column families are fixed at table creation,
|
||||||
|
column qualifiers are mutable and may differ greatly between rows.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Cell</term>
|
||||||
|
<listitem>
|
||||||
|
<para>A cell is a combination of row, column family, and column qualifier, and contains a
|
||||||
|
value and a timestamp, which represents the value's version.</para>
|
||||||
|
<para>A cell's value is an uninterpreted array of bytes.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Timestamp</term>
|
||||||
|
<listitem>
|
||||||
|
<para>A timestamp is written alongside each value, and is the identifier for a given
|
||||||
|
version of a value. By default, the timestamp represents the time on the RegionServer
|
||||||
|
when the data was written, but you can specify a different timestamp value when you put
|
||||||
|
data into the cell.</para>
|
||||||
|
<caution>
|
||||||
|
<para>Direct manipulation of timestamps is an advanced feature which is only exposed for
|
||||||
|
special cases that are deeply integrated with HBase, and is discouraged in general.
|
||||||
|
Encoding a timestamp at the application level is the preferred pattern.</para>
|
||||||
|
</caution>
|
||||||
|
<para>You can specify the maximum number of versions of a value that HBase retains, per column
|
||||||
|
family. When the maximum number of versions is reached, the oldest versions are
|
||||||
|
eventually deleted. By default, only the newest version is kept.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
</variablelist>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="conceptual.view">
|
||||||
|
<title>Conceptual View</title>
|
||||||
|
<para>You can read a very understandable explanation of the HBase data model in the blog post <link
|
||||||
|
xlink:href="http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable">Understanding
|
||||||
|
HBase and BigTable</link> by Jim R. Wilson. Another good explanation is available in the
|
||||||
|
PDF <link
|
||||||
|
xlink:href="http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf">Introduction
|
||||||
|
to Basic Schema Design</link> by Amandeep Khurana. It may help to read different
|
||||||
|
perspectives to get a solid understanding of HBase schema design. The linked articles cover
|
||||||
|
the same ground as the information in this section.</para>
|
||||||
|
<para> The following example is a slightly modified form of the one on page 2 of the <link
|
||||||
|
xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper. There
|
||||||
|
is a table called <varname>webtable</varname> that contains two rows
|
||||||
|
(<literal>com.cnn.www</literal>
|
||||||
|
and <literal>com.example.www</literal>), three column families named
|
||||||
|
<varname>contents</varname>, <varname>anchor</varname>, and <varname>people</varname>. In
|
||||||
|
this example, for the first row (<literal>com.cnn.www</literal>),
|
||||||
|
<varname>anchor</varname> contains two columns (<varname>anchor:cssnsi.com</varname>,
|
||||||
|
<varname>anchor:my.look.ca</varname>) and <varname>contents</varname> contains one column
|
||||||
|
(<varname>contents:html</varname>). This example contains 5 versions of the row with the
|
||||||
|
row key <literal>com.cnn.www</literal>, and one version of the row with the row key
|
||||||
|
<literal>com.example.www</literal>. The <varname>contents:html</varname> column qualifier contains the entire
|
||||||
|
HTML of a given website. Qualifiers of the <varname>anchor</varname> column family each
|
||||||
|
contain the external site which links to the site represented by the row, along with the
|
||||||
|
text it used in the anchor of its link. The <varname>people</varname> column family represents
|
||||||
|
people associated with the site.
|
||||||
|
</para>
|
||||||
|
<note>
|
||||||
|
<title>Column Names</title>
|
||||||
|
<para> By convention, a column name is made of its column family prefix and a
|
||||||
|
<emphasis>qualifier</emphasis>. For example, the column
|
||||||
|
<emphasis>contents:html</emphasis> is made up of the column family
|
||||||
|
<varname>contents</varname> and the <varname>html</varname> qualifier. The colon
|
||||||
|
character (<literal>:</literal>) delimits the column family from the column family
|
||||||
|
<emphasis>qualifier</emphasis>. </para>
|
||||||
|
</note>
|
||||||
|
<table
|
||||||
|
frame="all">
|
||||||
|
<title>Table <varname>webtable</varname></title>
|
||||||
|
<tgroup
|
||||||
|
cols="5"
|
||||||
|
align="left"
|
||||||
|
colsep="1"
|
||||||
|
rowsep="1">
|
||||||
|
<colspec
|
||||||
|
colname="c1" />
|
||||||
|
<colspec
|
||||||
|
colname="c2" />
|
||||||
|
<colspec
|
||||||
|
colname="c3" />
|
||||||
|
<colspec
|
||||||
|
colname="c4" />
|
||||||
|
<colspec
|
||||||
|
colname="c5" />
|
||||||
|
<thead>
|
||||||
|
<row>
|
||||||
|
<entry>Row Key</entry>
|
||||||
|
<entry>Time Stamp</entry>
|
||||||
|
<entry>ColumnFamily <varname>contents</varname></entry>
|
||||||
|
<entry>ColumnFamily <varname>anchor</varname></entry>
|
||||||
|
<entry>ColumnFamily <varname>people</varname></entry>
|
||||||
|
</row>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<row>
|
||||||
|
<entry>"com.cnn.www"</entry>
|
||||||
|
<entry>t9</entry>
|
||||||
|
<entry />
|
||||||
|
<entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
|
||||||
|
<entry />
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>"com.cnn.www"</entry>
|
||||||
|
<entry>t8</entry>
|
||||||
|
<entry />
|
||||||
|
<entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
|
||||||
|
<entry />
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>"com.cnn.www"</entry>
|
||||||
|
<entry>t6</entry>
|
||||||
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
|
<entry />
|
||||||
|
<entry />
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>"com.cnn.www"</entry>
|
||||||
|
<entry>t5</entry>
|
||||||
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
|
<entry />
|
||||||
|
<entry />
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>"com.cnn.www"</entry>
|
||||||
|
<entry>t3</entry>
|
||||||
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
|
<entry />
|
||||||
|
<entry />
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>"com.example.www"</entry>
|
||||||
|
<entry>t5</entry>
|
||||||
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
|
<entry></entry>
|
||||||
|
<entry>people:author = "John Doe"</entry>
|
||||||
|
</row>
|
||||||
|
</tbody>
|
||||||
|
</tgroup>
|
||||||
|
</table>
|
||||||
|
<para>Cells in this table that appear to be empty do not take space, or in fact exist, in
|
||||||
|
HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to
|
||||||
|
look at data in HBase, or even the most accurate. The following represents the same
|
||||||
|
information as a multi-dimensional map. This is only a mock-up for illustrative
|
||||||
|
purposes and may not be strictly accurate.</para>
|
||||||
|
<programlisting><![CDATA[
|
||||||
|
{
|
||||||
|
"com.cnn.www": {
|
||||||
|
contents: {
|
||||||
|
t6: contents:html: "<html>..."
|
||||||
|
t5: contents:html: "<html>..."
|
||||||
|
t3: contents:html: "<html>..."
|
||||||
|
}
|
||||||
|
anchor: {
|
||||||
|
t9: anchor:cnnsi.com = "CNN"
|
||||||
|
t8: anchor:my.look.ca = "CNN.com"
|
||||||
|
}
|
||||||
|
people: {}
|
||||||
|
}
|
||||||
|
"com.example.www": {
|
||||||
|
contents: {
|
||||||
|
t5: contents:html: "<html>..."
|
||||||
|
}
|
||||||
|
anchor: {}
|
||||||
|
people: {
|
||||||
|
t5: people:author: "John Doe"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]]></programlisting>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="physical.view">
|
||||||
|
<title>Physical View</title>
|
||||||
|
<para> Although at a conceptual level tables may be viewed as a sparse set of rows, they are
|
||||||
|
physically stored by column family. A new column qualifier (column_family:column_qualifier)
|
||||||
|
can be added to an existing column family at any time.</para>
|
||||||
|
<table
|
||||||
|
frame="all">
|
||||||
|
<title>ColumnFamily <varname>anchor</varname></title>
|
||||||
|
<tgroup
|
||||||
|
cols="3"
|
||||||
|
align="left"
|
||||||
|
colsep="1"
|
||||||
|
rowsep="1">
|
||||||
|
<colspec
|
||||||
|
colname="c1" />
|
||||||
|
<colspec
|
||||||
|
colname="c2" />
|
||||||
|
<colspec
|
||||||
|
colname="c3" />
|
||||||
|
<thead>
|
||||||
|
<row>
|
||||||
|
<entry>Row Key</entry>
|
||||||
|
<entry>Time Stamp</entry>
|
||||||
|
<entry>Column Family <varname>anchor</varname></entry>
|
||||||
|
</row>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<row>
|
||||||
|
<entry>"com.cnn.www"</entry>
|
||||||
|
<entry>t9</entry>
|
||||||
|
<entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>"com.cnn.www"</entry>
|
||||||
|
<entry>t8</entry>
|
||||||
|
<entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
|
||||||
|
</row>
|
||||||
|
</tbody>
|
||||||
|
</tgroup>
|
||||||
|
</table>
|
||||||
|
<table
|
||||||
|
frame="all">
|
||||||
|
<title>ColumnFamily <varname>contents</varname></title>
|
||||||
|
<tgroup
|
||||||
|
cols="3"
|
||||||
|
align="left"
|
||||||
|
colsep="1"
|
||||||
|
rowsep="1">
|
||||||
|
<colspec
|
||||||
|
colname="c1" />
|
||||||
|
<colspec
|
||||||
|
colname="c2" />
|
||||||
|
<colspec
|
||||||
|
colname="c3" />
|
||||||
|
<thead>
|
||||||
|
<row>
|
||||||
|
<entry>Row Key</entry>
|
||||||
|
<entry>Time Stamp</entry>
|
||||||
|
<entry>ColumnFamily "contents:"</entry>
|
||||||
|
</row>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<row>
|
||||||
|
<entry>"com.cnn.www"</entry>
|
||||||
|
<entry>t6</entry>
|
||||||
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>"com.cnn.www"</entry>
|
||||||
|
<entry>t5</entry>
|
||||||
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>"com.cnn.www"</entry>
|
||||||
|
<entry>t3</entry>
|
||||||
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
|
</row>
|
||||||
|
</tbody>
|
||||||
|
</tgroup>
|
||||||
|
</table>
|
||||||
|
<para>The empty cells shown in the
|
||||||
|
conceptual view are not stored at all.
|
||||||
|
Thus a request for the value of the <varname>contents:html</varname> column at time stamp
|
||||||
|
<literal>t8</literal> would return no value. Similarly, a request for an
|
||||||
|
<varname>anchor:my.look.ca</varname> value at time stamp <literal>t9</literal> would
|
||||||
|
return no value. However, if no timestamp is supplied, the most recent value for a
|
||||||
|
particular column would be returned. Given multiple versions, the most recent is also the
|
||||||
|
first one found, since timestamps
|
||||||
|
are stored in descending order. Thus a request for the values of all columns in the row
|
||||||
|
<varname>com.cnn.www</varname> if no timestamp is specified would be: the value of
|
||||||
|
<varname>contents:html</varname> from timestamp <literal>t6</literal>, the value of
|
||||||
|
<varname>anchor:cnnsi.com</varname> from timestamp <literal>t9</literal>, the value of
|
||||||
|
<varname>anchor:my.look.ca</varname> from timestamp <literal>t8</literal>. </para>
|
||||||
|
<para>For more information about the internals of how Apache HBase stores data, see <xref
|
||||||
|
linkend="regions.arch" />. </para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="namespace">
|
||||||
|
<title>Namespace</title>
|
||||||
|
<para> A namespace is a logical grouping of tables analogous to a database in relation
|
||||||
|
database systems. This abstraction lays the groundwork for upcoming multi-tenancy related
|
||||||
|
features: <itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>Quota Management (HBASE-8410) - Restrict the amount of resources (ie regions,
|
||||||
|
tables) a namespace can consume.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Namespace Security Administration (HBASE-9206) - provide another level of security
|
||||||
|
administration for tenants.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset
|
||||||
|
of regionservers thus guaranteeing a course level of isolation.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
</para>
|
||||||
|
<section
|
||||||
|
xml:id="namespace_creation">
|
||||||
|
<title>Namespace management</title>
|
||||||
|
<para> A namespace can be created, removed or altered. Namespace membership is determined
|
||||||
|
during table creation by specifying a fully-qualified table name of the form:</para>
|
||||||
|
|
||||||
|
<programlisting language="xml"><![CDATA[<table namespace>:<table qualifier>]]></programlisting>
|
||||||
|
|
||||||
|
|
||||||
|
<example>
|
||||||
|
<title>Examples</title>
|
||||||
|
|
||||||
|
<programlisting language="bourne">
|
||||||
|
#Create a namespace
|
||||||
|
create_namespace 'my_ns'
|
||||||
|
</programlisting>
|
||||||
|
<programlisting language="bourne">
|
||||||
|
#create my_table in my_ns namespace
|
||||||
|
create 'my_ns:my_table', 'fam'
|
||||||
|
</programlisting>
|
||||||
|
<programlisting language="bourne">
|
||||||
|
#drop namespace
|
||||||
|
drop_namespace 'my_ns'
|
||||||
|
</programlisting>
|
||||||
|
<programlisting language="bourne">
|
||||||
|
#alter namespace
|
||||||
|
alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}
|
||||||
|
</programlisting>
|
||||||
|
</example>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="namespace_special">
|
||||||
|
<title>Predefined namespaces</title>
|
||||||
|
<para> There are two predefined special namespaces: </para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>hbase - system namespace, used to contain hbase internal tables</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>default - tables with no explicit specified namespace will automatically fall into
|
||||||
|
this namespace.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<example>
|
||||||
|
<title>Examples</title>
|
||||||
|
|
||||||
|
<programlisting language="bourne">
|
||||||
|
#namespace=foo and table qualifier=bar
|
||||||
|
create 'foo:bar', 'fam'
|
||||||
|
|
||||||
|
#namespace=default and table qualifier=bar
|
||||||
|
create 'bar', 'fam'
|
||||||
|
</programlisting>
|
||||||
|
</example>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="table">
|
||||||
|
<title>Table</title>
|
||||||
|
<para> Tables are declared up front at schema definition time. </para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="row">
|
||||||
|
<title>Row</title>
|
||||||
|
<para>Row keys are uninterrpreted bytes. Rows are lexicographically sorted with the lowest
|
||||||
|
order appearing first in a table. The empty byte array is used to denote both the start and
|
||||||
|
end of a tables' namespace.</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="columnfamily">
|
||||||
|
<title>Column Family<indexterm><primary>Column Family</primary></indexterm></title>
|
||||||
|
<para> Columns in Apache HBase are grouped into <emphasis>column families</emphasis>. All
|
||||||
|
column members of a column family have the same prefix. For example, the columns
|
||||||
|
<emphasis>courses:history</emphasis> and <emphasis>courses:math</emphasis> are both
|
||||||
|
members of the <emphasis>courses</emphasis> column family. The colon character
|
||||||
|
(<literal>:</literal>) delimits the column family from the <indexterm><primary>column
|
||||||
|
family qualifier</primary><secondary>Column Family Qualifier</secondary></indexterm>.
|
||||||
|
The column family prefix must be composed of <emphasis>printable</emphasis> characters. The
|
||||||
|
qualifying tail, the column family <emphasis>qualifier</emphasis>, can be made of any
|
||||||
|
arbitrary bytes. Column families must be declared up front at schema definition time whereas
|
||||||
|
columns do not need to be defined at schema time but can be conjured on the fly while the
|
||||||
|
table is up an running.</para>
|
||||||
|
<para>Physically, all column family members are stored together on the filesystem. Because
|
||||||
|
tunings and storage specifications are done at the column family level, it is advised that
|
||||||
|
all column family members have the same general access pattern and size
|
||||||
|
characteristics.</para>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="cells">
|
||||||
|
<title>Cells<indexterm><primary>Cells</primary></indexterm></title>
|
||||||
|
<para>A <emphasis>{row, column, version} </emphasis>tuple exactly specifies a
|
||||||
|
<literal>cell</literal> in HBase. Cell content is uninterrpreted bytes</para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="data_model_operations">
|
||||||
|
<title>Data Model Operations</title>
|
||||||
|
<para>The four primary data model operations are Get, Put, Scan, and Delete. Operations are
|
||||||
|
applied via <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html">Table</link>
|
||||||
|
instances.
|
||||||
|
</para>
|
||||||
|
<section
|
||||||
|
xml:id="get">
|
||||||
|
<title>Get</title>
|
||||||
|
<para><link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link>
|
||||||
|
returns attributes for a specified row. Gets are executed via <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#get(org.apache.hadoop.hbase.client.Get)">
|
||||||
|
Table.get</link>. </para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="put">
|
||||||
|
<title>Put</title>
|
||||||
|
<para><link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html">Put</link>
|
||||||
|
either adds new rows to a table (if the key is new) or can update existing rows (if the
|
||||||
|
key already exists). Puts are executed via <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#put(org.apache.hadoop.hbase.client.Put)">
|
||||||
|
Table.put</link> (writeBuffer) or <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#batch(java.util.List, java.lang.Object[])">
|
||||||
|
Table.batch</link> (non-writeBuffer). </para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="scan">
|
||||||
|
<title>Scans</title>
|
||||||
|
<para><link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
|
||||||
|
allow iteration over multiple rows for specified attributes. </para>
|
||||||
|
<para>The following is an example of a Scan on a Table instance. Assume that a table is
|
||||||
|
populated with rows with keys "row1", "row2", "row3", and then another set of rows with
|
||||||
|
the keys "abc1", "abc2", and "abc3". The following example shows how to set a Scan
|
||||||
|
instance to return the rows beginning with "row".</para>
|
||||||
|
<programlisting language="java">
|
||||||
|
public static final byte[] CF = "cf".getBytes();
|
||||||
|
public static final byte[] ATTR = "attr".getBytes();
|
||||||
|
...
|
||||||
|
|
||||||
|
Table table = ... // instantiate a Table instance
|
||||||
|
|
||||||
|
Scan scan = new Scan();
|
||||||
|
scan.addColumn(CF, ATTR);
|
||||||
|
scan.setRowPrefixFilter(Bytes.toBytes("row"));
|
||||||
|
ResultScanner rs = table.getScanner(scan);
|
||||||
|
try {
|
||||||
|
for (Result r = rs.next(); r != null; r = rs.next()) {
|
||||||
|
// process result...
|
||||||
|
} finally {
|
||||||
|
rs.close(); // always close the ResultScanner!
|
||||||
|
</programlisting>
|
||||||
|
<para>Note that generally the easiest way to specify a specific stop point for a scan is by
|
||||||
|
using the <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/InclusiveStopFilter.html">InclusiveStopFilter</link>
|
||||||
|
class. </para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="delete">
|
||||||
|
<title>Delete</title>
|
||||||
|
<para><link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Delete.html">Delete</link>
|
||||||
|
removes a row from a table. Deletes are executed via <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete(org.apache.hadoop.hbase.client.Delete)">
|
||||||
|
HTable.delete</link>. </para>
|
||||||
|
<para>HBase does not modify data in place, and so deletes are handled by creating new
|
||||||
|
markers called <emphasis>tombstones</emphasis>. These tombstones, along with the dead
|
||||||
|
values, are cleaned up on major compactions. </para>
|
||||||
|
<para>See <xref
|
||||||
|
linkend="version.delete" /> for more information on deleting versions of columns, and
|
||||||
|
see <xref
|
||||||
|
linkend="compaction" /> for more information on compactions. </para>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="versions">
|
||||||
|
<title>Versions<indexterm><primary>Versions</primary></indexterm></title>
|
||||||
|
|
||||||
|
<para>A <emphasis>{row, column, version} </emphasis>tuple exactly specifies a
|
||||||
|
<literal>cell</literal> in HBase. It's possible to have an unbounded number of cells where
|
||||||
|
the row and column are the same but the cell address differs only in its version
|
||||||
|
dimension.</para>
|
||||||
|
|
||||||
|
<para>While rows and column keys are expressed as bytes, the version is specified using a long
|
||||||
|
integer. Typically this long contains time instances such as those returned by
|
||||||
|
<code>java.util.Date.getTime()</code> or <code>System.currentTimeMillis()</code>, that is:
|
||||||
|
<quote>the difference, measured in milliseconds, between the current time and midnight,
|
||||||
|
January 1, 1970 UTC</quote>.</para>
|
||||||
|
|
||||||
|
<para>The HBase version dimension is stored in decreasing order, so that when reading from a
|
||||||
|
store file, the most recent values are found first.</para>
|
||||||
|
|
||||||
|
<para>There is a lot of confusion over the semantics of <literal>cell</literal> versions, in
|
||||||
|
HBase. In particular:</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>If multiple writes to a cell have the same version, only the last written is
|
||||||
|
fetchable.</para>
|
||||||
|
</listitem>
|
||||||
|
|
||||||
|
<listitem>
|
||||||
|
<para>It is OK to write cells in a non-increasing version order.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
|
||||||
|
<para>Below we describe how the version dimension in HBase currently works. See <link
|
||||||
|
xlink:href="https://issues.apache.org/jira/browse/HBASE-2406">HBASE-2406</link> for
|
||||||
|
discussion of HBase versions. <link
|
||||||
|
xlink:href="http://outerthought.org/blog/417-ot.html">Bending time in HBase</link>
|
||||||
|
makes for a good read on the version, or time, dimension in HBase. It has more detail on
|
||||||
|
versioning than is provided here. As of this writing, the limiitation
|
||||||
|
<emphasis>Overwriting values at existing timestamps</emphasis> mentioned in the
|
||||||
|
article no longer holds in HBase. This section is basically a synopsis of this article
|
||||||
|
by Bruno Dumon.</para>
|
||||||
|
|
||||||
|
<section xml:id="specify.number.of.versions">
|
||||||
|
<title>Specifying the Number of Versions to Store</title>
|
||||||
|
<para>The maximum number of versions to store for a given column is part of the column
|
||||||
|
schema and is specified at table creation, or via an <command>alter</command> command, via
|
||||||
|
<code>HColumnDescriptor.DEFAULT_VERSIONS</code>. Prior to HBase 0.96, the default number
|
||||||
|
of versions kept was <literal>3</literal>, but in 0.96 and newer has been changed to
|
||||||
|
<literal>1</literal>.</para>
|
||||||
|
<example>
|
||||||
|
<title>Modify the Maximum Number of Versions for a Column</title>
|
||||||
|
<para>This example uses HBase Shell to keep a maximum of 5 versions of column
|
||||||
|
<code>f1</code>. You could also use <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html"
|
||||||
|
>HColumnDescriptor</link>.</para>
|
||||||
|
<screen><![CDATA[hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5]]></screen>
|
||||||
|
</example>
|
||||||
|
<example>
|
||||||
|
<title>Modify the Minimum Number of Versions for a Column</title>
|
||||||
|
<para>You can also specify the minimum number of versions to store. By default, this is
|
||||||
|
set to 0, which means the feature is disabled. The following example sets the minimum
|
||||||
|
number of versions on field <code>f1</code> to <literal>2</literal>, via HBase Shell.
|
||||||
|
You could also use <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html"
|
||||||
|
>HColumnDescriptor</link>.</para>
|
||||||
|
<screen><![CDATA[hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2]]></screen>
|
||||||
|
</example>
|
||||||
|
<para>Starting with HBase 0.98.2, you can specify a global default for the maximum number of
|
||||||
|
versions kept for all newly-created columns, by setting
|
||||||
|
<option>hbase.column.max.version</option> in <filename>hbase-site.xml</filename>. See
|
||||||
|
<xref linkend="hbase.column.max.version"/>.</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="versions.ops">
|
||||||
|
<title>Versions and HBase Operations</title>
|
||||||
|
|
||||||
|
<para>In this section we look at the behavior of the version dimension for each of the core
|
||||||
|
HBase operations.</para>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Get/Scan</title>
|
||||||
|
|
||||||
|
<para>Gets are implemented on top of Scans. The below discussion of <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link>
|
||||||
|
applies equally to <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scans</link>.</para>
|
||||||
|
|
||||||
|
<para>By default, i.e. if you specify no explicit version, when doing a
|
||||||
|
<literal>get</literal>, the cell whose version has the largest value is returned
|
||||||
|
(which may or may not be the latest one written, see later). The default behavior can be
|
||||||
|
modified in the following ways:</para>
|
||||||
|
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>to return more than one version, see <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html#setMaxVersions()">Get.setMaxVersions()</link></para>
|
||||||
|
</listitem>
|
||||||
|
|
||||||
|
<listitem>
|
||||||
|
<para>to return versions other than the latest, see <link
|
||||||
|
xlink:href="???">Get.setTimeRange()</link></para>
|
||||||
|
|
||||||
|
<para>To retrieve the latest version that is less than or equal to a given value, thus
|
||||||
|
giving the 'latest' state of the record at a certain point in time, just use a range
|
||||||
|
from 0 to the desired version and set the max versions to 1.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="default_get_example">
|
||||||
|
<title>Default Get Example</title>
|
||||||
|
<para>The following Get will only retrieve the current version of the row</para>
|
||||||
|
<programlisting language="java">
|
||||||
|
public static final byte[] CF = "cf".getBytes();
|
||||||
|
public static final byte[] ATTR = "attr".getBytes();
|
||||||
|
...
|
||||||
|
Get get = new Get(Bytes.toBytes("row1"));
|
||||||
|
Result r = table.get(get);
|
||||||
|
byte[] b = r.getValue(CF, ATTR); // returns current version of value
|
||||||
|
</programlisting>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="versioned_get_example">
|
||||||
|
<title>Versioned Get Example</title>
|
||||||
|
<para>The following Get will return the last 3 versions of the row.</para>
|
||||||
|
<programlisting language="java">
|
||||||
|
public static final byte[] CF = "cf".getBytes();
|
||||||
|
public static final byte[] ATTR = "attr".getBytes();
|
||||||
|
...
|
||||||
|
Get get = new Get(Bytes.toBytes("row1"));
|
||||||
|
get.setMaxVersions(3); // will return last 3 versions of row
|
||||||
|
Result r = table.get(get);
|
||||||
|
byte[] b = r.getValue(CF, ATTR); // returns current version of value
|
||||||
|
List<KeyValue> kv = r.getColumn(CF, ATTR); // returns all versions of this column
|
||||||
|
</programlisting>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Put</title>
|
||||||
|
|
||||||
|
<para>Doing a put always creates a new version of a <literal>cell</literal>, at a certain
|
||||||
|
timestamp. By default the system uses the server's <literal>currentTimeMillis</literal>,
|
||||||
|
but you can specify the version (= the long integer) yourself, on a per-column level.
|
||||||
|
This means you could assign a time in the past or the future, or use the long value for
|
||||||
|
non-time purposes.</para>
|
||||||
|
|
||||||
|
<para>To overwrite an existing value, do a put at exactly the same row, column, and
|
||||||
|
version as that of the cell you would overshadow.</para>
|
||||||
|
<section
|
||||||
|
xml:id="implicit_version_example">
|
||||||
|
<title>Implicit Version Example</title>
|
||||||
|
<para>The following Put will be implicitly versioned by HBase with the current
|
||||||
|
time.</para>
|
||||||
|
<programlisting language="java">
|
||||||
|
public static final byte[] CF = "cf".getBytes();
|
||||||
|
public static final byte[] ATTR = "attr".getBytes();
|
||||||
|
...
|
||||||
|
Put put = new Put(Bytes.toBytes(row));
|
||||||
|
put.add(CF, ATTR, Bytes.toBytes( data));
|
||||||
|
table.put(put);
|
||||||
|
</programlisting>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="explicit_version_example">
|
||||||
|
<title>Explicit Version Example</title>
|
||||||
|
<para>The following Put has the version timestamp explicitly set.</para>
|
||||||
|
<programlisting language="java">
|
||||||
|
public static final byte[] CF = "cf".getBytes();
|
||||||
|
public static final byte[] ATTR = "attr".getBytes();
|
||||||
|
...
|
||||||
|
Put put = new Put( Bytes.toBytes(row));
|
||||||
|
long explicitTimeInMs = 555; // just an example
|
||||||
|
put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data));
|
||||||
|
table.put(put);
|
||||||
|
</programlisting>
|
||||||
|
<para>Caution: the version timestamp is internally by HBase for things like time-to-live
|
||||||
|
calculations. It's usually best to avoid setting this timestamp yourself. Prefer using
|
||||||
|
a separate timestamp attribute of the row, or have the timestamp a part of the rowkey,
|
||||||
|
or both. </para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="version.delete">
|
||||||
|
<title>Delete</title>
|
||||||
|
|
||||||
|
<para>There are three different types of internal delete markers. See Lars Hofhansl's blog
|
||||||
|
for discussion of his attempt adding another, <link
|
||||||
|
xlink:href="http://hadoop-hbase.blogspot.com/2012/01/scanning-in-hbase.html">Scanning
|
||||||
|
in HBase: Prefix Delete Marker</link>. </para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>Delete: for a specific version of a column.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Delete column: for all versions of a column.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Delete family: for all columns of a particular ColumnFamily</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<para>When deleting an entire row, HBase will internally create a tombstone for each
|
||||||
|
ColumnFamily (i.e., not each individual column). </para>
|
||||||
|
<para>Deletes work by creating <emphasis>tombstone</emphasis> markers. For example, let's
|
||||||
|
suppose we want to delete a row. For this you can specify a version, or else by default
|
||||||
|
the <literal>currentTimeMillis</literal> is used. What this means is <quote>delete all
|
||||||
|
cells where the version is less than or equal to this version</quote>. HBase never
|
||||||
|
modifies data in place, so for example a delete will not immediately delete (or mark as
|
||||||
|
deleted) the entries in the storage file that correspond to the delete condition.
|
||||||
|
Rather, a so-called <emphasis>tombstone</emphasis> is written, which will mask the
|
||||||
|
deleted values. When HBase does a major compaction, the tombstones are processed to
|
||||||
|
actually remove the dead values, together with the tombstones themselves. If the version
|
||||||
|
you specified when deleting a row is larger than the version of any value in the row,
|
||||||
|
then you can consider the complete row to be deleted.</para>
|
||||||
|
<para>For an informative discussion on how deletes and versioning interact, see the thread <link
|
||||||
|
xlink:href="http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/28421">Put w/
|
||||||
|
timestamp -> Deleteall -> Put w/ timestamp fails</link> up on the user mailing
|
||||||
|
list.</para>
|
||||||
|
<para>Also see <xref
|
||||||
|
linkend="keyvalue" /> for more information on the internal KeyValue format. </para>
|
||||||
|
<para>Delete markers are purged during the next major compaction of the store, unless the
|
||||||
|
<option>KEEP_DELETED_CELLS</option> option is set in the column family. To keep the
|
||||||
|
deletes for a configurable amount of time, you can set the delete TTL via the
|
||||||
|
<option>hbase.hstore.time.to.purge.deletes</option> property in
|
||||||
|
<filename>hbase-site.xml</filename>. If
|
||||||
|
<option>hbase.hstore.time.to.purge.deletes</option> is not set, or set to 0, all
|
||||||
|
delete markers, including those with timestamps in the future, are purged during the
|
||||||
|
next major compaction. Otherwise, a delete marker with a timestamp in the future is kept
|
||||||
|
until the major compaction which occurs after the time represented by the marker's
|
||||||
|
timestamp plus the value of <option>hbase.hstore.time.to.purge.deletes</option>, in
|
||||||
|
milliseconds. </para>
|
||||||
|
<note>
|
||||||
|
<para>This behavior represents a fix for an unexpected change that was introduced in
|
||||||
|
HBase 0.94, and was fixed in <link
|
||||||
|
xlink:href="https://issues.apache.org/jira/browse/HBASE-10118">HBASE-10118</link>.
|
||||||
|
The change has been backported to HBase 0.94 and newer branches.</para>
|
||||||
|
</note>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Current Limitations</title>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Deletes mask Puts</title>
|
||||||
|
|
||||||
|
<para>Deletes mask puts, even puts that happened after the delete
|
||||||
|
was entered. See <link xlink:href="https://issues.apache.org/jira/browse/HBASE-2256"
|
||||||
|
>HBASE-2256</link>. Remember that a delete writes a tombstone, which only
|
||||||
|
disappears after then next major compaction has run. Suppose you do
|
||||||
|
a delete of everything <= T. After this you do a new put with a
|
||||||
|
timestamp <= T. This put, even if it happened after the delete,
|
||||||
|
will be masked by the delete tombstone. Performing the put will not
|
||||||
|
fail, but when you do a get you will notice the put did have no
|
||||||
|
effect. It will start working again after the major compaction has
|
||||||
|
run. These issues should not be a problem if you use
|
||||||
|
always-increasing versions for new puts to a row. But they can occur
|
||||||
|
even if you do not care about time: just do delete and put
|
||||||
|
immediately after each other, and there is some chance they happen
|
||||||
|
within the same millisecond.</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="major.compactions.change.query.results">
|
||||||
|
<title>Major compactions change query results</title>
|
||||||
|
|
||||||
|
<para><quote>...create three cell versions at t1, t2 and t3, with a maximum-versions
|
||||||
|
setting of 2. So when getting all versions, only the values at t2 and t3 will be
|
||||||
|
returned. But if you delete the version at t2 or t3, the one at t1 will appear again.
|
||||||
|
Obviously, once a major compaction has run, such behavior will not be the case
|
||||||
|
anymore...</quote> (See <emphasis>Garbage Collection</emphasis> in <link
|
||||||
|
xlink:href="http://outerthought.org/blog/417-ot.html">Bending time in
|
||||||
|
HBase</link>.)</para>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
<section xml:id="dm.sort">
|
||||||
|
<title>Sort Order</title>
|
||||||
|
<para>All data model operations HBase return data in sorted order. First by row,
|
||||||
|
then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted
|
||||||
|
in reverse, so newest records are returned first).
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="dm.column.metadata">
|
||||||
|
<title>Column Metadata</title>
|
||||||
|
<para>There is no store of column metadata outside of the internal KeyValue instances for a ColumnFamily.
|
||||||
|
Thus, while HBase can support not only a wide number of columns per row, but a heterogenous set of columns
|
||||||
|
between rows as well, it is your responsibility to keep track of the column names.
|
||||||
|
</para>
|
||||||
|
<para>The only way to get a complete set of columns that exist for a ColumnFamily is to process all the rows.
|
||||||
|
For more information about how HBase stores data internally, see <xref linkend="keyvalue" />.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="joins"><title>Joins</title>
|
||||||
|
<para>Whether HBase supports joins is a common question on the dist-list, and there is a simple answer: it doesn't,
|
||||||
|
at not least in the way that RDBMS' support them (e.g., with equi-joins or outer-joins in SQL). As has been illustrated
|
||||||
|
in this chapter, the read data model operations in HBase are Get and Scan.
|
||||||
|
</para>
|
||||||
|
<para>However, that doesn't mean that equivalent join functionality can't be supported in your application, but
|
||||||
|
you have to do it yourself. The two primary strategies are either denormalizing the data upon writing to HBase,
|
||||||
|
or to have lookup tables and do the join between HBase tables in your application or MapReduce code (and as RDBMS'
|
||||||
|
demonstrate, there are several strategies for this depending on the size of the tables, e.g., nested loops vs.
|
||||||
|
hash-joins). So which is the best approach? It depends on what you are trying to do, and as such there isn't a single
|
||||||
|
answer that works for every use case.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="acid"><title>ACID</title>
|
||||||
|
<para>See <link xlink:href="http://hbase.apache.org/acid-semantics.html">ACID Semantics</link>.
|
||||||
|
Lars Hofhansl has also written a note on
|
||||||
|
<link xlink:href="http://hadoop-hbase.blogspot.com/2012/03/acid-in-hbase.html">ACID in HBase</link>.</para>
|
||||||
|
</section>
|
||||||
|
</chapter>
|
|
@ -0,0 +1,270 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<appendix
|
||||||
|
xml:id="faq"
|
||||||
|
version="5.0"
|
||||||
|
xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg"
|
||||||
|
xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||||
|
xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
<title >FAQ</title>
|
||||||
|
<qandaset defaultlabel='qanda'>
|
||||||
|
<qandadiv><title>General</title>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>When should I use HBase?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>See the <xref linkend="arch.overview" /> in the Architecture chapter.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>Are there other HBase FAQs?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See the FAQ that is up on the wiki, <link xlink:href="http://wiki.apache.org/hadoop/Hbase/FAQ">HBase Wiki FAQ</link>.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry xml:id="faq.sql">
|
||||||
|
<question><para>Does HBase support SQL?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
Not really. SQL-ish support for HBase via <link xlink:href="http://hive.apache.org/">Hive</link> is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests.
|
||||||
|
See the <xref linkend="datamodel" /> section for examples on the HBase client.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>How can I find examples of NoSQL/HBase?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>See the link to the BigTable paper in <xref linkend="other.info" /> in the appendix, as
|
||||||
|
well as the other papers.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>What is the history of HBase?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>See <xref linkend="hbase.history"/>.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
</qandadiv>
|
||||||
|
<qandadiv>
|
||||||
|
<title>Upgrading</title>
|
||||||
|
<qandaentry>
|
||||||
|
<question>
|
||||||
|
<para>How do I upgrade Maven-managed projects from HBase 0.94 to HBase 0.96+?</para>
|
||||||
|
</question>
|
||||||
|
<answer>
|
||||||
|
<para>In HBase 0.96, the project moved to a modular structure. Adjust your project's
|
||||||
|
dependencies to rely upon the <filename>hbase-client</filename> module or another
|
||||||
|
module as appropriate, rather than a single JAR. You can model your Maven depency
|
||||||
|
after one of the following, depending on your targeted version of HBase. See <xref
|
||||||
|
linkend="upgrade0.96"/> or <xref linkend="upgrade0.98"/> for more
|
||||||
|
information.</para>
|
||||||
|
<example>
|
||||||
|
<title>Maven Dependency for HBase 0.98</title>
|
||||||
|
<programlisting language="xml"><![CDATA[
|
||||||
|
<dependency>
|
||||||
|
<groupId>org.apache.hbase</groupId>
|
||||||
|
<artifactId>hbase-client</artifactId>
|
||||||
|
<version>0.98.5-hadoop2</version>
|
||||||
|
</dependency>
|
||||||
|
]]></programlisting>
|
||||||
|
</example>
|
||||||
|
<example>
|
||||||
|
<title>Maven Dependency for HBase 0.96</title>
|
||||||
|
<programlisting language="xml"><![CDATA[
|
||||||
|
<dependency>
|
||||||
|
<groupId>org.apache.hbase</groupId>
|
||||||
|
<artifactId>hbase-client</artifactId>
|
||||||
|
<version>0.96.2-hadoop2</version>
|
||||||
|
</dependency>
|
||||||
|
]]></programlisting>
|
||||||
|
</example>
|
||||||
|
<example>
|
||||||
|
<title>Maven Dependency for HBase 0.94</title>
|
||||||
|
<programlisting language="xml"><![CDATA[
|
||||||
|
<dependency>
|
||||||
|
<groupId>org.apache.hbase</groupId>
|
||||||
|
<artifactId>hbase</artifactId>
|
||||||
|
<version>0.94.3</version>
|
||||||
|
</dependency>
|
||||||
|
]]></programlisting>
|
||||||
|
</example>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
</qandadiv>
|
||||||
|
<qandadiv xml:id="faq.arch"><title>Architecture</title>
|
||||||
|
<qandaentry xml:id="faq.arch.regions">
|
||||||
|
<question><para>How does HBase handle Region-RegionServer assignment and locality?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="regions.arch" />.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
</qandadiv>
|
||||||
|
<qandadiv xml:id="faq.config"><title>Configuration</title>
|
||||||
|
<qandaentry xml:id="faq.config.started">
|
||||||
|
<question><para>How can I get started with my first cluster?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="quickstart" />.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry xml:id="faq.config.options">
|
||||||
|
<question><para>Where can I learn about the rest of the configuration options?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="configuration" />.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
</qandadiv>
|
||||||
|
<qandadiv xml:id="faq.design"><title>Schema Design / Data Access</title>
|
||||||
|
<qandaentry xml:id="faq.design.schema">
|
||||||
|
<question><para>How should I design my schema in HBase?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="datamodel" /> and <xref linkend="schema" />
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>
|
||||||
|
How can I store (fill in the blank) in HBase?
|
||||||
|
</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="supported.datatypes" />.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry xml:id="secondary.indices">
|
||||||
|
<question><para>
|
||||||
|
How can I handle secondary indexes in HBase?
|
||||||
|
</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="secondary.indexes" />
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry xml:id="faq.changing.rowkeys">
|
||||||
|
<question><para>Can I change a table's rowkeys?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para> This is a very common question. You can't. See <xref
|
||||||
|
linkend="changing.rowkeys"/>. </para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry xml:id="faq.apis">
|
||||||
|
<question><para>What APIs does HBase support?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="datamodel" />, <xref linkend="client" /> and <xref linkend="nonjava.jvm"/>.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
</qandadiv>
|
||||||
|
<qandadiv xml:id="faq.mapreduce"><title>MapReduce</title>
|
||||||
|
<qandaentry xml:id="faq.mapreduce.use">
|
||||||
|
<question><para>How can I use MapReduce with HBase?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="mapreduce" />
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
</qandadiv>
|
||||||
|
<qandadiv><title>Performance and Troubleshooting</title>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>
|
||||||
|
How can I improve HBase cluster performance?
|
||||||
|
</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="performance" />.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>
|
||||||
|
How can I troubleshoot my HBase cluster?
|
||||||
|
</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="trouble" />.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
</qandadiv>
|
||||||
|
<qandadiv xml:id="ec2"><title>Amazon EC2</title>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>
|
||||||
|
I am running HBase on Amazon EC2 and...
|
||||||
|
</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
EC2 issues are a special case. See Troubleshooting <xref linkend="trouble.ec2" /> and Performance <xref linkend="perf.ec2" /> sections.
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
</qandadiv>
|
||||||
|
<qandadiv><title xml:id="faq.operations">Operations</title>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>
|
||||||
|
How do I manage my HBase cluster?
|
||||||
|
</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="ops_mgt" />
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>
|
||||||
|
How do I back up my HBase cluster?
|
||||||
|
</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="ops.backup" />
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
</qandadiv>
|
||||||
|
<qandadiv><title>HBase in Action</title>
|
||||||
|
<qandaentry>
|
||||||
|
<question><para>Where can I find interesting videos and presentations on HBase?</para></question>
|
||||||
|
<answer>
|
||||||
|
<para>
|
||||||
|
See <xref linkend="other.info" />
|
||||||
|
</para>
|
||||||
|
</answer>
|
||||||
|
</qandaentry>
|
||||||
|
</qandadiv>
|
||||||
|
</qandaset>
|
||||||
|
|
||||||
|
</appendix>
|
|
@ -0,0 +1,538 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?><glossary xml:id="hbase_default_configurations" version="5.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:db="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:svg="http://www.w3.org/2000/svg" xmlns:html="http://www.w3.org/1999/xhtml" xmlns="http://docbook.org/ns/docbook"><title>HBase Default Configuration</title><para>
|
||||||
|
The documentation below is generated using the default hbase configuration file,
|
||||||
|
<filename>hbase-default.xml</filename>, as source.
|
||||||
|
</para><glossentry xml:id="hbase.tmp.dir"><glossterm><varname>hbase.tmp.dir</varname></glossterm><glossdef><para>Temporary directory on the local filesystem.
|
||||||
|
Change this setting to point to a location more permanent
|
||||||
|
than '/tmp', the usual resolve for java.io.tmpdir, as the
|
||||||
|
'/tmp' directory is cleared on machine restart.</para><formalpara><title>Default</title><para><varname>${java.io.tmpdir}/hbase-${user.name}</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rootdir"><glossterm><varname>hbase.rootdir</varname></glossterm><glossdef><para>The directory shared by region servers and into
|
||||||
|
which HBase persists. The URL should be 'fully-qualified'
|
||||||
|
to include the filesystem scheme. For example, to specify the
|
||||||
|
HDFS directory '/hbase' where the HDFS instance's namenode is
|
||||||
|
running at namenode.example.org on port 9000, set this value to:
|
||||||
|
hdfs://namenode.example.org:9000/hbase. By default, we write
|
||||||
|
to whatever ${hbase.tmp.dir} is set too -- usually /tmp --
|
||||||
|
so change this configuration or else all data will be lost on
|
||||||
|
machine restart.</para><formalpara><title>Default</title><para><varname>${hbase.tmp.dir}/hbase</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.cluster.distributed"><glossterm><varname>hbase.cluster.distributed</varname></glossterm><glossdef><para>The mode the cluster will be in. Possible values are
|
||||||
|
false for standalone mode and true for distributed mode. If
|
||||||
|
false, startup will run all HBase and ZooKeeper daemons together
|
||||||
|
in the one JVM.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.quorum"><glossterm><varname>hbase.zookeeper.quorum</varname></glossterm><glossdef><para>Comma separated list of servers in the ZooKeeper ensemble
|
||||||
|
(This config. should have been named hbase.zookeeper.ensemble).
|
||||||
|
For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
|
||||||
|
By default this is set to localhost for local and pseudo-distributed modes
|
||||||
|
of operation. For a fully-distributed setup, this should be set to a full
|
||||||
|
list of ZooKeeper ensemble servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
|
||||||
|
this is the list of servers which hbase will start/stop ZooKeeper on as
|
||||||
|
part of cluster start/stop. Client-side, we will take this list of
|
||||||
|
ensemble members and put it together with the hbase.zookeeper.clientPort
|
||||||
|
config. and pass it into zookeeper constructor as the connectString
|
||||||
|
parameter.</para><formalpara><title>Default</title><para><varname>localhost</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.local.dir"><glossterm><varname>hbase.local.dir</varname></glossterm><glossdef><para>Directory on the local filesystem to be used
|
||||||
|
as a local storage.</para><formalpara><title>Default</title><para><varname>${hbase.tmp.dir}/local/</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.info.port"><glossterm><varname>hbase.master.info.port</varname></glossterm><glossdef><para>The port for the HBase Master web UI.
|
||||||
|
Set to -1 if you do not want a UI instance run.</para><formalpara><title>Default</title><para><varname>16010</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.info.bindAddress"><glossterm><varname>hbase.master.info.bindAddress</varname></glossterm><glossdef><para>The bind address for the HBase Master web UI
|
||||||
|
</para><formalpara><title>Default</title><para><varname>0.0.0.0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.logcleaner.plugins"><glossterm><varname>hbase.master.logcleaner.plugins</varname></glossterm><glossdef><para>A comma-separated list of BaseLogCleanerDelegate invoked by
|
||||||
|
the LogsCleaner service. These WAL cleaners are called in order,
|
||||||
|
so put the cleaner that prunes the most files in front. To
|
||||||
|
implement your own BaseLogCleanerDelegate, just put it in HBase's classpath
|
||||||
|
and add the fully qualified class name here. Always add the above
|
||||||
|
default log cleaners in the list.</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.master.cleaner.TimeToLiveLogCleaner</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.logcleaner.ttl"><glossterm><varname>hbase.master.logcleaner.ttl</varname></glossterm><glossdef><para>Maximum time a WAL can stay in the .oldlogdir directory,
|
||||||
|
after which it will be cleaned by a Master thread.</para><formalpara><title>Default</title><para><varname>600000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.hfilecleaner.plugins"><glossterm><varname>hbase.master.hfilecleaner.plugins</varname></glossterm><glossdef><para>A comma-separated list of BaseHFileCleanerDelegate invoked by
|
||||||
|
the HFileCleaner service. These HFiles cleaners are called in order,
|
||||||
|
so put the cleaner that prunes the most files in front. To
|
||||||
|
implement your own BaseHFileCleanerDelegate, just put it in HBase's classpath
|
||||||
|
and add the fully qualified class name here. Always add the above
|
||||||
|
default log cleaners in the list as they will be overwritten in
|
||||||
|
hbase-site.xml.</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.catalog.timeout"><glossterm><varname>hbase.master.catalog.timeout</varname></glossterm><glossdef><para>Timeout value for the Catalog Janitor from the master to
|
||||||
|
META.</para><formalpara><title>Default</title><para><varname>600000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.infoserver.redirect"><glossterm><varname>hbase.master.infoserver.redirect</varname></glossterm><glossdef><para>Whether or not the Master listens to the Master web
|
||||||
|
UI port (hbase.master.info.port) and redirects requests to the web
|
||||||
|
UI server shared by the Master and RegionServer.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.port"><glossterm><varname>hbase.regionserver.port</varname></glossterm><glossdef><para>The port the HBase RegionServer binds to.</para><formalpara><title>Default</title><para><varname>16020</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.info.port"><glossterm><varname>hbase.regionserver.info.port</varname></glossterm><glossdef><para>The port for the HBase RegionServer web UI
|
||||||
|
Set to -1 if you do not want the RegionServer UI to run.</para><formalpara><title>Default</title><para><varname>16030</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.info.bindAddress"><glossterm><varname>hbase.regionserver.info.bindAddress</varname></glossterm><glossdef><para>The address for the HBase RegionServer web UI</para><formalpara><title>Default</title><para><varname>0.0.0.0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.info.port.auto"><glossterm><varname>hbase.regionserver.info.port.auto</varname></glossterm><glossdef><para>Whether or not the Master or RegionServer
|
||||||
|
UI should search for a port to bind to. Enables automatic port
|
||||||
|
search if hbase.regionserver.info.port is already in use.
|
||||||
|
Useful for testing, turned off by default.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.handler.count"><glossterm><varname>hbase.regionserver.handler.count</varname></glossterm><glossdef><para>Count of RPC Listener instances spun up on RegionServers.
|
||||||
|
Same property is used by the Master for count of master handlers.</para><formalpara><title>Default</title><para><varname>30</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.ipc.server.callqueue.handler.factor"><glossterm><varname>hbase.ipc.server.callqueue.handler.factor</varname></glossterm><glossdef><para>Factor to determine the number of call queues.
|
||||||
|
A value of 0 means a single queue shared between all the handlers.
|
||||||
|
A value of 1 means that each handler has its own queue.</para><formalpara><title>Default</title><para><varname>0.1</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.ipc.server.callqueue.read.ratio"><glossterm><varname>hbase.ipc.server.callqueue.read.ratio</varname></glossterm><glossdef><para>Split the call queues into read and write queues.
|
||||||
|
The specified interval (which should be between 0.0 and 1.0)
|
||||||
|
will be multiplied by the number of call queues.
|
||||||
|
A value of 0 indicate to not split the call queues, meaning that both read and write
|
||||||
|
requests will be pushed to the same set of queues.
|
||||||
|
A value lower than 0.5 means that there will be less read queues than write queues.
|
||||||
|
A value of 0.5 means there will be the same number of read and write queues.
|
||||||
|
A value greater than 0.5 means that there will be more read queues than write queues.
|
||||||
|
A value of 1.0 means that all the queues except one are used to dispatch read requests.
|
||||||
|
|
||||||
|
Example: Given the total number of call queues being 10
|
||||||
|
a read.ratio of 0 means that: the 10 queues will contain both read/write requests.
|
||||||
|
a read.ratio of 0.3 means that: 3 queues will contain only read requests
|
||||||
|
and 7 queues will contain only write requests.
|
||||||
|
a read.ratio of 0.5 means that: 5 queues will contain only read requests
|
||||||
|
and 5 queues will contain only write requests.
|
||||||
|
a read.ratio of 0.8 means that: 8 queues will contain only read requests
|
||||||
|
and 2 queues will contain only write requests.
|
||||||
|
a read.ratio of 1 means that: 9 queues will contain only read requests
|
||||||
|
and 1 queues will contain only write requests.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.ipc.server.callqueue.scan.ratio"><glossterm><varname>hbase.ipc.server.callqueue.scan.ratio</varname></glossterm><glossdef><para>Given the number of read call queues, calculated from the total number
|
||||||
|
of call queues multiplied by the callqueue.read.ratio, the scan.ratio property
|
||||||
|
will split the read call queues into small-read and long-read queues.
|
||||||
|
A value lower than 0.5 means that there will be less long-read queues than short-read queues.
|
||||||
|
A value of 0.5 means that there will be the same number of short-read and long-read queues.
|
||||||
|
A value greater than 0.5 means that there will be more long-read queues than short-read queues
|
||||||
|
A value of 0 or 1 indicate to use the same set of queues for gets and scans.
|
||||||
|
|
||||||
|
Example: Given the total number of read call queues being 8
|
||||||
|
a scan.ratio of 0 or 1 means that: 8 queues will contain both long and short read requests.
|
||||||
|
a scan.ratio of 0.3 means that: 2 queues will contain only long-read requests
|
||||||
|
and 6 queues will contain only short-read requests.
|
||||||
|
a scan.ratio of 0.5 means that: 4 queues will contain only long-read requests
|
||||||
|
and 4 queues will contain only short-read requests.
|
||||||
|
a scan.ratio of 0.8 means that: 6 queues will contain only long-read requests
|
||||||
|
and 2 queues will contain only short-read requests.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.msginterval"><glossterm><varname>hbase.regionserver.msginterval</varname></glossterm><glossdef><para>Interval between messages from the RegionServer to Master
|
||||||
|
in milliseconds.</para><formalpara><title>Default</title><para><varname>3000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.regionSplitLimit"><glossterm><varname>hbase.regionserver.regionSplitLimit</varname></glossterm><glossdef><para>Limit for the number of regions after which no more region
|
||||||
|
splitting should take place. This is not a hard limit for the number of
|
||||||
|
regions but acts as a guideline for the regionserver to stop splitting after
|
||||||
|
a certain limit. Default is MAX_INT; i.e. do not block splitting.</para><formalpara><title>Default</title><para><varname>2147483647</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.logroll.period"><glossterm><varname>hbase.regionserver.logroll.period</varname></glossterm><glossdef><para>Period at which we will roll the commit log regardless
|
||||||
|
of how many edits it has.</para><formalpara><title>Default</title><para><varname>3600000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.logroll.errors.tolerated"><glossterm><varname>hbase.regionserver.logroll.errors.tolerated</varname></glossterm><glossdef><para>The number of consecutive WAL close errors we will allow
|
||||||
|
before triggering a server abort. A setting of 0 will cause the
|
||||||
|
region server to abort if closing the current WAL writer fails during
|
||||||
|
log rolling. Even a small value (2 or 3) will allow a region server
|
||||||
|
to ride over transient HDFS errors.</para><formalpara><title>Default</title><para><varname>2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.hlog.reader.impl"><glossterm><varname>hbase.regionserver.hlog.reader.impl</varname></glossterm><glossdef><para>The WAL file reader implementation.</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.hlog.writer.impl"><glossterm><varname>hbase.regionserver.hlog.writer.impl</varname></glossterm><glossdef><para>The WAL file writer implementation.</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.distributed.log.replay"><glossterm><varname>hbase.master.distributed.log.replay</varname></glossterm><glossdef><para>Enable 'distributed log replay' as default engine splitting
|
||||||
|
WAL files on server crash. This default is new in hbase 1.0. To fall
|
||||||
|
back to the old mode 'distributed log splitter', set the value to
|
||||||
|
'false'. 'Disributed log replay' improves MTTR because it does not
|
||||||
|
write intermediate files. 'DLR' required that 'hfile.format.version'
|
||||||
|
be set to version 3 or higher.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.global.memstore.size"><glossterm><varname>hbase.regionserver.global.memstore.size</varname></glossterm><glossdef><para>Maximum size of all memstores in a region server before new
|
||||||
|
updates are blocked and flushes are forced. Defaults to 40% of heap.
|
||||||
|
Updates are blocked and flushes are forced until size of all memstores
|
||||||
|
in a region server hits hbase.regionserver.global.memstore.size.lower.limit.</para><formalpara><title>Default</title><para><varname>0.4</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.global.memstore.size.lower.limit"><glossterm><varname>hbase.regionserver.global.memstore.size.lower.limit</varname></glossterm><glossdef><para>Maximum size of all memstores in a region server before flushes are forced.
|
||||||
|
Defaults to 95% of hbase.regionserver.global.memstore.size.
|
||||||
|
A 100% value for this value causes the minimum possible flushing to occur when updates are
|
||||||
|
blocked due to memstore limiting.</para><formalpara><title>Default</title><para><varname>0.95</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.optionalcacheflushinterval"><glossterm><varname>hbase.regionserver.optionalcacheflushinterval</varname></glossterm><glossdef><para>
|
||||||
|
Maximum amount of time an edit lives in memory before being automatically flushed.
|
||||||
|
Default 1 hour. Set it to 0 to disable automatic flushing.</para><formalpara><title>Default</title><para><varname>3600000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.catalog.timeout"><glossterm><varname>hbase.regionserver.catalog.timeout</varname></glossterm><glossdef><para>Timeout value for the Catalog Janitor from the regionserver to META.</para><formalpara><title>Default</title><para><varname>600000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.dns.interface"><glossterm><varname>hbase.regionserver.dns.interface</varname></glossterm><glossdef><para>The name of the Network Interface from which a region server
|
||||||
|
should report its IP address.</para><formalpara><title>Default</title><para><varname>default</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.dns.nameserver"><glossterm><varname>hbase.regionserver.dns.nameserver</varname></glossterm><glossdef><para>The host name or IP address of the name server (DNS)
|
||||||
|
which a region server should use to determine the host name used by the
|
||||||
|
master for communication and display purposes.</para><formalpara><title>Default</title><para><varname>default</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.region.split.policy"><glossterm><varname>hbase.regionserver.region.split.policy</varname></glossterm><glossdef><para>
|
||||||
|
A split policy determines when a region should be split. The various other split policies that
|
||||||
|
are available currently are ConstantSizeRegionSplitPolicy, DisabledRegionSplitPolicy,
|
||||||
|
DelimitedKeyPrefixRegionSplitPolicy, KeyPrefixRegionSplitPolicy etc.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="zookeeper.session.timeout"><glossterm><varname>zookeeper.session.timeout</varname></glossterm><glossdef><para>ZooKeeper session timeout in milliseconds. It is used in two different ways.
|
||||||
|
First, this value is used in the ZK client that HBase uses to connect to the ensemble.
|
||||||
|
It is also used by HBase when it starts a ZK server and it is passed as the 'maxSessionTimeout'. See
|
||||||
|
http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions.
|
||||||
|
For example, if a HBase region server connects to a ZK ensemble that's also managed by HBase, then the
|
||||||
|
session timeout will be the one specified by this configuration. But, a region server that connects
|
||||||
|
to an ensemble managed with a different configuration will be subjected that ensemble's maxSessionTimeout. So,
|
||||||
|
even though HBase might propose using 90 seconds, the ensemble can have a max timeout lower than this and
|
||||||
|
it will take precedence. The current default that ZK ships with is 40 seconds, which is lower than HBase's.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>90000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="zookeeper.znode.parent"><glossterm><varname>zookeeper.znode.parent</varname></glossterm><glossdef><para>Root ZNode for HBase in ZooKeeper. All of HBase's ZooKeeper
|
||||||
|
files that are configured with a relative path will go under this node.
|
||||||
|
By default, all of HBase's ZooKeeper file path are configured with a
|
||||||
|
relative path, so they will all go under this directory unless changed.</para><formalpara><title>Default</title><para><varname>/hbase</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="zookeeper.znode.rootserver"><glossterm><varname>zookeeper.znode.rootserver</varname></glossterm><glossdef><para>Path to ZNode holding root region location. This is written by
|
||||||
|
the master and read by clients and region servers. If a relative path is
|
||||||
|
given, the parent folder will be ${zookeeper.znode.parent}. By default,
|
||||||
|
this means the root location is stored at /hbase/root-region-server.</para><formalpara><title>Default</title><para><varname>root-region-server</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="zookeeper.znode.acl.parent"><glossterm><varname>zookeeper.znode.acl.parent</varname></glossterm><glossdef><para>Root ZNode for access control lists.</para><formalpara><title>Default</title><para><varname>acl</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.dns.interface"><glossterm><varname>hbase.zookeeper.dns.interface</varname></glossterm><glossdef><para>The name of the Network Interface from which a ZooKeeper server
|
||||||
|
should report its IP address.</para><formalpara><title>Default</title><para><varname>default</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.dns.nameserver"><glossterm><varname>hbase.zookeeper.dns.nameserver</varname></glossterm><glossdef><para>The host name or IP address of the name server (DNS)
|
||||||
|
which a ZooKeeper server should use to determine the host name used by the
|
||||||
|
master for communication and display purposes.</para><formalpara><title>Default</title><para><varname>default</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.peerport"><glossterm><varname>hbase.zookeeper.peerport</varname></glossterm><glossdef><para>Port used by ZooKeeper peers to talk to each other.
|
||||||
|
See http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperStarted.html#sc_RunningReplicatedZooKeeper
|
||||||
|
for more information.</para><formalpara><title>Default</title><para><varname>2888</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.leaderport"><glossterm><varname>hbase.zookeeper.leaderport</varname></glossterm><glossdef><para>Port used by ZooKeeper for leader election.
|
||||||
|
See http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperStarted.html#sc_RunningReplicatedZooKeeper
|
||||||
|
for more information.</para><formalpara><title>Default</title><para><varname>3888</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.useMulti"><glossterm><varname>hbase.zookeeper.useMulti</varname></glossterm><glossdef><para>Instructs HBase to make use of ZooKeeper's multi-update functionality.
|
||||||
|
This allows certain ZooKeeper operations to complete more quickly and prevents some issues
|
||||||
|
with rare Replication failure scenarios (see the release note of HBASE-2611 for an example).
|
||||||
|
IMPORTANT: only set this to true if all ZooKeeper servers in the cluster are on version 3.4+
|
||||||
|
and will not be downgraded. ZooKeeper versions before 3.4 do not support multi-update and
|
||||||
|
will not fail gracefully if multi-update is invoked (see ZOOKEEPER-1495).</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.config.read.zookeeper.config"><glossterm><varname>hbase.config.read.zookeeper.config</varname></glossterm><glossdef><para>
|
||||||
|
Set to true to allow HBaseConfiguration to read the
|
||||||
|
zoo.cfg file for ZooKeeper properties. Switching this to true
|
||||||
|
is not recommended, since the functionality of reading ZK
|
||||||
|
properties from a zoo.cfg file has been deprecated.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.property.initLimit"><glossterm><varname>hbase.zookeeper.property.initLimit</varname></glossterm><glossdef><para>Property from ZooKeeper's config zoo.cfg.
|
||||||
|
The number of ticks that the initial synchronization phase can take.</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.property.syncLimit"><glossterm><varname>hbase.zookeeper.property.syncLimit</varname></glossterm><glossdef><para>Property from ZooKeeper's config zoo.cfg.
|
||||||
|
The number of ticks that can pass between sending a request and getting an
|
||||||
|
acknowledgment.</para><formalpara><title>Default</title><para><varname>5</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.property.dataDir"><glossterm><varname>hbase.zookeeper.property.dataDir</varname></glossterm><glossdef><para>Property from ZooKeeper's config zoo.cfg.
|
||||||
|
The directory where the snapshot is stored.</para><formalpara><title>Default</title><para><varname>${hbase.tmp.dir}/zookeeper</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.property.clientPort"><glossterm><varname>hbase.zookeeper.property.clientPort</varname></glossterm><glossdef><para>Property from ZooKeeper's config zoo.cfg.
|
||||||
|
The port at which the clients will connect.</para><formalpara><title>Default</title><para><varname>2181</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.property.maxClientCnxns"><glossterm><varname>hbase.zookeeper.property.maxClientCnxns</varname></glossterm><glossdef><para>Property from ZooKeeper's config zoo.cfg.
|
||||||
|
Limit on number of concurrent connections (at the socket level) that a
|
||||||
|
single client, identified by IP address, may make to a single member of
|
||||||
|
the ZooKeeper ensemble. Set high to avoid zk connection issues running
|
||||||
|
standalone and pseudo-distributed.</para><formalpara><title>Default</title><para><varname>300</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.write.buffer"><glossterm><varname>hbase.client.write.buffer</varname></glossterm><glossdef><para>Default size of the HTable client write buffer in bytes.
|
||||||
|
A bigger buffer takes more memory -- on both the client and server
|
||||||
|
side since server instantiates the passed write buffer to process
|
||||||
|
it -- but a larger buffer size reduces the number of RPCs made.
|
||||||
|
For an estimate of server-side memory-used, evaluate
|
||||||
|
hbase.client.write.buffer * hbase.regionserver.handler.count</para><formalpara><title>Default</title><para><varname>2097152</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.pause"><glossterm><varname>hbase.client.pause</varname></glossterm><glossdef><para>General client pause value. Used mostly as value to wait
|
||||||
|
before running a retry of a failed get, region lookup, etc.
|
||||||
|
See hbase.client.retries.number for description of how we backoff from
|
||||||
|
this initial pause amount and how this pause works w/ retries.</para><formalpara><title>Default</title><para><varname>100</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.retries.number"><glossterm><varname>hbase.client.retries.number</varname></glossterm><glossdef><para>Maximum retries. Used as maximum for all retryable
|
||||||
|
operations such as the getting of a cell's value, starting a row update,
|
||||||
|
etc. Retry interval is a rough function based on hbase.client.pause. At
|
||||||
|
first we retry at this interval but then with backoff, we pretty quickly reach
|
||||||
|
retrying every ten seconds. See HConstants#RETRY_BACKOFF for how the backup
|
||||||
|
ramps up. Change this setting and hbase.client.pause to suit your workload.</para><formalpara><title>Default</title><para><varname>35</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.max.total.tasks"><glossterm><varname>hbase.client.max.total.tasks</varname></glossterm><glossdef><para>The maximum number of concurrent tasks a single HTable instance will
|
||||||
|
send to the cluster.</para><formalpara><title>Default</title><para><varname>100</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.max.perserver.tasks"><glossterm><varname>hbase.client.max.perserver.tasks</varname></glossterm><glossdef><para>The maximum number of concurrent tasks a single HTable instance will
|
||||||
|
send to a single region server.</para><formalpara><title>Default</title><para><varname>5</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.max.perregion.tasks"><glossterm><varname>hbase.client.max.perregion.tasks</varname></glossterm><glossdef><para>The maximum number of concurrent connections the client will
|
||||||
|
maintain to a single Region. That is, if there is already
|
||||||
|
hbase.client.max.perregion.tasks writes in progress for this region, new puts
|
||||||
|
won't be sent to this region until some writes finishes.</para><formalpara><title>Default</title><para><varname>1</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.scanner.caching"><glossterm><varname>hbase.client.scanner.caching</varname></glossterm><glossdef><para>Number of rows that will be fetched when calling next
|
||||||
|
on a scanner if it is not served from (local, client) memory. Higher
|
||||||
|
caching values will enable faster scanners but will eat up more memory
|
||||||
|
and some calls of next may take longer and longer times when the cache is empty.
|
||||||
|
Do not set this value such that the time between invocations is greater
|
||||||
|
than the scanner timeout; i.e. hbase.client.scanner.timeout.period</para><formalpara><title>Default</title><para><varname>100</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.keyvalue.maxsize"><glossterm><varname>hbase.client.keyvalue.maxsize</varname></glossterm><glossdef><para>Specifies the combined maximum allowed size of a KeyValue
|
||||||
|
instance. This is to set an upper boundary for a single entry saved in a
|
||||||
|
storage file. Since they cannot be split it helps avoiding that a region
|
||||||
|
cannot be split any further because the data is too large. It seems wise
|
||||||
|
to set this to a fraction of the maximum region size. Setting it to zero
|
||||||
|
or less disables the check.</para><formalpara><title>Default</title><para><varname>10485760</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.scanner.timeout.period"><glossterm><varname>hbase.client.scanner.timeout.period</varname></glossterm><glossdef><para>Client scanner lease period in milliseconds.</para><formalpara><title>Default</title><para><varname>60000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.localityCheck.threadPoolSize"><glossterm><varname>hbase.client.localityCheck.threadPoolSize</varname></glossterm><glossdef><para/><formalpara><title>Default</title><para><varname>2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.bulkload.retries.number"><glossterm><varname>hbase.bulkload.retries.number</varname></glossterm><glossdef><para>Maximum retries. This is maximum number of iterations
|
||||||
|
to atomic bulk loads are attempted in the face of splitting operations
|
||||||
|
0 means never give up.</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.balancer.period "><glossterm><varname>hbase.balancer.period
|
||||||
|
</varname></glossterm><glossdef><para>Period at which the region balancer runs in the Master.</para><formalpara><title>Default</title><para><varname>300000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regions.slop"><glossterm><varname>hbase.regions.slop</varname></glossterm><glossdef><para>Rebalance if any regionserver has average + (average * slop) regions.</para><formalpara><title>Default</title><para><varname>0.2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.server.thread.wakefrequency"><glossterm><varname>hbase.server.thread.wakefrequency</varname></glossterm><glossdef><para>Time to sleep in between searches for work (in milliseconds).
|
||||||
|
Used as sleep interval by service threads such as log roller.</para><formalpara><title>Default</title><para><varname>10000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.server.versionfile.writeattempts"><glossterm><varname>hbase.server.versionfile.writeattempts</varname></glossterm><glossdef><para>
|
||||||
|
How many time to retry attempting to write a version file
|
||||||
|
before just aborting. Each attempt is seperated by the
|
||||||
|
hbase.server.thread.wakefrequency milliseconds.</para><formalpara><title>Default</title><para><varname>3</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.memstore.flush.size"><glossterm><varname>hbase.hregion.memstore.flush.size</varname></glossterm><glossdef><para>
|
||||||
|
Memstore will be flushed to disk if size of the memstore
|
||||||
|
exceeds this number of bytes. Value is checked by a thread that runs
|
||||||
|
every hbase.server.thread.wakefrequency.</para><formalpara><title>Default</title><para><varname>134217728</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.percolumnfamilyflush.size.lower.bound"><glossterm><varname>hbase.hregion.percolumnfamilyflush.size.lower.bound</varname></glossterm><glossdef><para>
|
||||||
|
If FlushLargeStoresPolicy is used, then every time that we hit the
|
||||||
|
total memstore limit, we find out all the column families whose memstores
|
||||||
|
exceed this value, and only flush them, while retaining the others whose
|
||||||
|
memstores are lower than this limit. If none of the families have their
|
||||||
|
memstore size more than this, all the memstores will be flushed
|
||||||
|
(just as usual). This value should be less than half of the total memstore
|
||||||
|
threshold (hbase.hregion.memstore.flush.size).
|
||||||
|
</para><formalpara><title>Default</title><para><varname>16777216</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.preclose.flush.size"><glossterm><varname>hbase.hregion.preclose.flush.size</varname></glossterm><glossdef><para>
|
||||||
|
If the memstores in a region are this size or larger when we go
|
||||||
|
to close, run a "pre-flush" to clear out memstores before we put up
|
||||||
|
the region closed flag and take the region offline. On close,
|
||||||
|
a flush is run under the close flag to empty memory. During
|
||||||
|
this time the region is offline and we are not taking on any writes.
|
||||||
|
If the memstore content is large, this flush could take a long time to
|
||||||
|
complete. The preflush is meant to clean out the bulk of the memstore
|
||||||
|
before putting up the close flag and taking the region offline so the
|
||||||
|
flush that runs under the close flag has little to do.</para><formalpara><title>Default</title><para><varname>5242880</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.memstore.block.multiplier"><glossterm><varname>hbase.hregion.memstore.block.multiplier</varname></glossterm><glossdef><para>
|
||||||
|
Block updates if memstore has hbase.hregion.memstore.block.multiplier
|
||||||
|
times hbase.hregion.memstore.flush.size bytes. Useful preventing
|
||||||
|
runaway memstore during spikes in update traffic. Without an
|
||||||
|
upper-bound, memstore fills such that when it flushes the
|
||||||
|
resultant flush files take a long time to compact or split, or
|
||||||
|
worse, we OOME.</para><formalpara><title>Default</title><para><varname>4</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.memstore.mslab.enabled"><glossterm><varname>hbase.hregion.memstore.mslab.enabled</varname></glossterm><glossdef><para>
|
||||||
|
Enables the MemStore-Local Allocation Buffer,
|
||||||
|
a feature which works to prevent heap fragmentation under
|
||||||
|
heavy write loads. This can reduce the frequency of stop-the-world
|
||||||
|
GC pauses on large heaps.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.max.filesize"><glossterm><varname>hbase.hregion.max.filesize</varname></glossterm><glossdef><para>
|
||||||
|
Maximum HFile size. If the sum of the sizes of a region's HFiles has grown to exceed this
|
||||||
|
value, the region is split in two.</para><formalpara><title>Default</title><para><varname>10737418240</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.majorcompaction"><glossterm><varname>hbase.hregion.majorcompaction</varname></glossterm><glossdef><para>Time between major compactions, expressed in milliseconds. Set to 0 to disable
|
||||||
|
time-based automatic major compactions. User-requested and size-based major compactions will
|
||||||
|
still run. This value is multiplied by hbase.hregion.majorcompaction.jitter to cause
|
||||||
|
compaction to start at a somewhat-random time during a given window of time. The default value
|
||||||
|
is 7 days, expressed in milliseconds. If major compactions are causing disruption in your
|
||||||
|
environment, you can configure them to run at off-peak times for your deployment, or disable
|
||||||
|
time-based major compactions by setting this parameter to 0, and run major compactions in a
|
||||||
|
cron job or by another external mechanism.</para><formalpara><title>Default</title><para><varname>604800000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.majorcompaction.jitter"><glossterm><varname>hbase.hregion.majorcompaction.jitter</varname></glossterm><glossdef><para>A multiplier applied to hbase.hregion.majorcompaction to cause compaction to occur
|
||||||
|
a given amount of time either side of hbase.hregion.majorcompaction. The smaller the number,
|
||||||
|
the closer the compactions will happen to the hbase.hregion.majorcompaction
|
||||||
|
interval.</para><formalpara><title>Default</title><para><varname>0.50</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compactionThreshold"><glossterm><varname>hbase.hstore.compactionThreshold</varname></glossterm><glossdef><para> If more than this number of StoreFiles exist in any one Store
|
||||||
|
(one StoreFile is written per flush of MemStore), a compaction is run to rewrite all
|
||||||
|
StoreFiles into a single StoreFile. Larger values delay compaction, but when compaction does
|
||||||
|
occur, it takes longer to complete.</para><formalpara><title>Default</title><para><varname>3</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.flusher.count"><glossterm><varname>hbase.hstore.flusher.count</varname></glossterm><glossdef><para> The number of flush threads. With fewer threads, the MemStore flushes will be
|
||||||
|
queued. With more threads, the flushes will be executed in parallel, increasing the load on
|
||||||
|
HDFS, and potentially causing more compactions. </para><formalpara><title>Default</title><para><varname>2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.blockingStoreFiles"><glossterm><varname>hbase.hstore.blockingStoreFiles</varname></glossterm><glossdef><para> If more than this number of StoreFiles exist in any one Store (one StoreFile
|
||||||
|
is written per flush of MemStore), updates are blocked for this region until a compaction is
|
||||||
|
completed, or until hbase.hstore.blockingWaitTime has been exceeded.</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.blockingWaitTime"><glossterm><varname>hbase.hstore.blockingWaitTime</varname></glossterm><glossdef><para> The time for which a region will block updates after reaching the StoreFile limit
|
||||||
|
defined by hbase.hstore.blockingStoreFiles. After this time has elapsed, the region will stop
|
||||||
|
blocking updates even if a compaction has not been completed.</para><formalpara><title>Default</title><para><varname>90000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.min"><glossterm><varname>hbase.hstore.compaction.min</varname></glossterm><glossdef><para>The minimum number of StoreFiles which must be eligible for compaction before
|
||||||
|
compaction can run. The goal of tuning hbase.hstore.compaction.min is to avoid ending up with
|
||||||
|
too many tiny StoreFiles to compact. Setting this value to 2 would cause a minor compaction
|
||||||
|
each time you have two StoreFiles in a Store, and this is probably not appropriate. If you
|
||||||
|
set this value too high, all the other values will need to be adjusted accordingly. For most
|
||||||
|
cases, the default value is appropriate. In previous versions of HBase, the parameter
|
||||||
|
hbase.hstore.compaction.min was named hbase.hstore.compactionThreshold.</para><formalpara><title>Default</title><para><varname>3</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.max"><glossterm><varname>hbase.hstore.compaction.max</varname></glossterm><glossdef><para>The maximum number of StoreFiles which will be selected for a single minor
|
||||||
|
compaction, regardless of the number of eligible StoreFiles. Effectively, the value of
|
||||||
|
hbase.hstore.compaction.max controls the length of time it takes a single compaction to
|
||||||
|
complete. Setting it larger means that more StoreFiles are included in a compaction. For most
|
||||||
|
cases, the default value is appropriate.</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.min.size"><glossterm><varname>hbase.hstore.compaction.min.size</varname></glossterm><glossdef><para>A StoreFile smaller than this size will always be eligible for minor compaction.
|
||||||
|
HFiles this size or larger are evaluated by hbase.hstore.compaction.ratio to determine if
|
||||||
|
they are eligible. Because this limit represents the "automatic include"limit for all
|
||||||
|
StoreFiles smaller than this value, this value may need to be reduced in write-heavy
|
||||||
|
environments where many StoreFiles in the 1-2 MB range are being flushed, because every
|
||||||
|
StoreFile will be targeted for compaction and the resulting StoreFiles may still be under the
|
||||||
|
minimum size and require further compaction. If this parameter is lowered, the ratio check is
|
||||||
|
triggered more quickly. This addressed some issues seen in earlier versions of HBase but
|
||||||
|
changing this parameter is no longer necessary in most situations. Default: 128 MB expressed
|
||||||
|
in bytes.</para><formalpara><title>Default</title><para><varname>134217728</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.max.size"><glossterm><varname>hbase.hstore.compaction.max.size</varname></glossterm><glossdef><para>A StoreFile larger than this size will be excluded from compaction. The effect of
|
||||||
|
raising hbase.hstore.compaction.max.size is fewer, larger StoreFiles that do not get
|
||||||
|
compacted often. If you feel that compaction is happening too often without much benefit, you
|
||||||
|
can try raising this value. Default: the value of LONG.MAX_VALUE, expressed in bytes.</para><formalpara><title>Default</title><para><varname>9223372036854775807</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.ratio"><glossterm><varname>hbase.hstore.compaction.ratio</varname></glossterm><glossdef><para>For minor compaction, this ratio is used to determine whether a given StoreFile
|
||||||
|
which is larger than hbase.hstore.compaction.min.size is eligible for compaction. Its
|
||||||
|
effect is to limit compaction of large StoreFiles. The value of hbase.hstore.compaction.ratio
|
||||||
|
is expressed as a floating-point decimal. A large ratio, such as 10, will produce a single
|
||||||
|
giant StoreFile. Conversely, a low value, such as .25, will produce behavior similar to the
|
||||||
|
BigTable compaction algorithm, producing four StoreFiles. A moderate value of between 1.0 and
|
||||||
|
1.4 is recommended. When tuning this value, you are balancing write costs with read costs.
|
||||||
|
Raising the value (to something like 1.4) will have more write costs, because you will
|
||||||
|
compact larger StoreFiles. However, during reads, HBase will need to seek through fewer
|
||||||
|
StoreFiles to accomplish the read. Consider this approach if you cannot take advantage of
|
||||||
|
Bloom filters. Otherwise, you can lower this value to something like 1.0 to reduce the
|
||||||
|
background cost of writes, and use Bloom filters to control the number of StoreFiles touched
|
||||||
|
during reads. For most cases, the default value is appropriate.</para><formalpara><title>Default</title><para><varname>1.2F</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.ratio.offpeak"><glossterm><varname>hbase.hstore.compaction.ratio.offpeak</varname></glossterm><glossdef><para>Allows you to set a different (by default, more aggressive) ratio for determining
|
||||||
|
whether larger StoreFiles are included in compactions during off-peak hours. Works in the
|
||||||
|
same way as hbase.hstore.compaction.ratio. Only applies if hbase.offpeak.start.hour and
|
||||||
|
hbase.offpeak.end.hour are also enabled.</para><formalpara><title>Default</title><para><varname>5.0F</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.time.to.purge.deletes"><glossterm><varname>hbase.hstore.time.to.purge.deletes</varname></glossterm><glossdef><para>The amount of time to delay purging of delete markers with future timestamps. If
|
||||||
|
unset, or set to 0, all delete markers, including those with future timestamps, are purged
|
||||||
|
during the next major compaction. Otherwise, a delete marker is kept until the major compaction
|
||||||
|
which occurs after the marker's timestamp plus the value of this setting, in milliseconds.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.offpeak.start.hour"><glossterm><varname>hbase.offpeak.start.hour</varname></glossterm><glossdef><para>The start of off-peak hours, expressed as an integer between 0 and 23, inclusive.
|
||||||
|
Set to -1 to disable off-peak.</para><formalpara><title>Default</title><para><varname>-1</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.offpeak.end.hour"><glossterm><varname>hbase.offpeak.end.hour</varname></glossterm><glossdef><para>The end of off-peak hours, expressed as an integer between 0 and 23, inclusive. Set
|
||||||
|
to -1 to disable off-peak.</para><formalpara><title>Default</title><para><varname>-1</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.thread.compaction.throttle"><glossterm><varname>hbase.regionserver.thread.compaction.throttle</varname></glossterm><glossdef><para>There are two different thread pools for compactions, one for large compactions and
|
||||||
|
the other for small compactions. This helps to keep compaction of lean tables (such as
|
||||||
|
hbase:meta) fast. If a compaction is larger than this threshold, it
|
||||||
|
goes into the large compaction pool. In most cases, the default value is appropriate. Default:
|
||||||
|
2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size (which defaults to 128MB).
|
||||||
|
The value field assumes that the value of hbase.hregion.memstore.flush.size is unchanged from
|
||||||
|
the default.</para><formalpara><title>Default</title><para><varname>2684354560</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.kv.max"><glossterm><varname>hbase.hstore.compaction.kv.max</varname></glossterm><glossdef><para>The maximum number of KeyValues to read and then write in a batch when flushing or
|
||||||
|
compacting. Set this lower if you have big KeyValues and problems with Out Of Memory
|
||||||
|
Exceptions Set this higher if you have wide, small rows. </para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.storescanner.parallel.seek.enable"><glossterm><varname>hbase.storescanner.parallel.seek.enable</varname></glossterm><glossdef><para>
|
||||||
|
Enables StoreFileScanner parallel-seeking in StoreScanner,
|
||||||
|
a feature which can reduce response latency under special conditions.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.storescanner.parallel.seek.threads"><glossterm><varname>hbase.storescanner.parallel.seek.threads</varname></glossterm><glossdef><para>
|
||||||
|
The default thread pool size if parallel-seeking feature enabled.</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hfile.block.cache.size"><glossterm><varname>hfile.block.cache.size</varname></glossterm><glossdef><para>Percentage of maximum heap (-Xmx setting) to allocate to block cache
|
||||||
|
used by a StoreFile. Default of 0.4 means allocate 40%.
|
||||||
|
Set to 0 to disable but it's not recommended; you need at least
|
||||||
|
enough cache to hold the storefile indices.</para><formalpara><title>Default</title><para><varname>0.4</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hfile.block.index.cacheonwrite"><glossterm><varname>hfile.block.index.cacheonwrite</varname></glossterm><glossdef><para>This allows to put non-root multi-level index blocks into the block
|
||||||
|
cache at the time the index is being written.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hfile.index.block.max.size"><glossterm><varname>hfile.index.block.max.size</varname></glossterm><glossdef><para>When the size of a leaf-level, intermediate-level, or root-level
|
||||||
|
index block in a multi-level block index grows to this size, the
|
||||||
|
block is written out and a new block is started.</para><formalpara><title>Default</title><para><varname>131072</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.bucketcache.ioengine"><glossterm><varname>hbase.bucketcache.ioengine</varname></glossterm><glossdef><para>Where to store the contents of the bucketcache. One of: onheap,
|
||||||
|
offheap, or file. If a file, set it to file:PATH_TO_FILE. See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html for more information.
|
||||||
|
</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.bucketcache.combinedcache.enabled"><glossterm><varname>hbase.bucketcache.combinedcache.enabled</varname></glossterm><glossdef><para>Whether or not the bucketcache is used in league with the LRU
|
||||||
|
on-heap block cache. In this mode, indices and blooms are kept in the LRU
|
||||||
|
blockcache and the data blocks are kept in the bucketcache.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.bucketcache.size"><glossterm><varname>hbase.bucketcache.size</varname></glossterm><glossdef><para>The size of the buckets for the bucketcache if you only use a single size.
|
||||||
|
Defaults to the default blocksize, which is 64 * 1024.</para><formalpara><title>Default</title><para><varname>65536</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.bucketcache.sizes"><glossterm><varname>hbase.bucketcache.sizes</varname></glossterm><glossdef><para>A comma-separated list of sizes for buckets for the bucketcache
|
||||||
|
if you use multiple sizes. Should be a list of block sizes in order from smallest
|
||||||
|
to largest. The sizes you use will depend on your data access patterns.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hfile.format.version"><glossterm><varname>hfile.format.version</varname></glossterm><glossdef><para>The HFile format version to use for new files.
|
||||||
|
Version 3 adds support for tags in hfiles (See http://hbase.apache.org/book.html#hbase.tags).
|
||||||
|
Distributed Log Replay requires that tags are enabled. Also see the configuration
|
||||||
|
'hbase.replication.rpc.codec'.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>3</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hfile.block.bloom.cacheonwrite"><glossterm><varname>hfile.block.bloom.cacheonwrite</varname></glossterm><glossdef><para>Enables cache-on-write for inline blocks of a compound Bloom filter.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="io.storefile.bloom.block.size"><glossterm><varname>io.storefile.bloom.block.size</varname></glossterm><glossdef><para>The size in bytes of a single block ("chunk") of a compound Bloom
|
||||||
|
filter. This size is approximate, because Bloom blocks can only be
|
||||||
|
inserted at data block boundaries, and the number of keys per data
|
||||||
|
block varies.</para><formalpara><title>Default</title><para><varname>131072</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rs.cacheblocksonwrite"><glossterm><varname>hbase.rs.cacheblocksonwrite</varname></glossterm><glossdef><para>Whether an HFile block should be added to the block cache when the
|
||||||
|
block is finished.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rpc.timeout"><glossterm><varname>hbase.rpc.timeout</varname></glossterm><glossdef><para>This is for the RPC layer to define how long HBase client applications
|
||||||
|
take for a remote call to time out. It uses pings to check connections
|
||||||
|
but will eventually throw a TimeoutException.</para><formalpara><title>Default</title><para><varname>60000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rpc.shortoperation.timeout"><glossterm><varname>hbase.rpc.shortoperation.timeout</varname></glossterm><glossdef><para>This is another version of "hbase.rpc.timeout". For those RPC operation
|
||||||
|
within cluster, we rely on this configuration to set a short timeout limitation
|
||||||
|
for short operation. For example, short rpc timeout for region server's trying
|
||||||
|
to report to active master can benefit quicker master failover process.</para><formalpara><title>Default</title><para><varname>10000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.ipc.client.tcpnodelay"><glossterm><varname>hbase.ipc.client.tcpnodelay</varname></glossterm><glossdef><para>Set no delay on rpc socket connections. See
|
||||||
|
http://docs.oracle.com/javase/1.5.0/docs/api/java/net/Socket.html#getTcpNoDelay()</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.keytab.file"><glossterm><varname>hbase.master.keytab.file</varname></glossterm><glossdef><para>Full path to the kerberos keytab file to use for logging in
|
||||||
|
the configured HMaster server principal.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.kerberos.principal"><glossterm><varname>hbase.master.kerberos.principal</varname></glossterm><glossdef><para>Ex. "hbase/_HOST@EXAMPLE.COM". The kerberos principal name
|
||||||
|
that should be used to run the HMaster process. The principal name should
|
||||||
|
be in the form: user/hostname@DOMAIN. If "_HOST" is used as the hostname
|
||||||
|
portion, it will be replaced with the actual hostname of the running
|
||||||
|
instance.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.keytab.file"><glossterm><varname>hbase.regionserver.keytab.file</varname></glossterm><glossdef><para>Full path to the kerberos keytab file to use for logging in
|
||||||
|
the configured HRegionServer server principal.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.kerberos.principal"><glossterm><varname>hbase.regionserver.kerberos.principal</varname></glossterm><glossdef><para>Ex. "hbase/_HOST@EXAMPLE.COM". The kerberos principal name
|
||||||
|
that should be used to run the HRegionServer process. The principal name
|
||||||
|
should be in the form: user/hostname@DOMAIN. If "_HOST" is used as the
|
||||||
|
hostname portion, it will be replaced with the actual hostname of the
|
||||||
|
running instance. An entry for this principal must exist in the file
|
||||||
|
specified in hbase.regionserver.keytab.file</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hadoop.policy.file"><glossterm><varname>hadoop.policy.file</varname></glossterm><glossdef><para>The policy configuration file used by RPC servers to make
|
||||||
|
authorization decisions on client requests. Only used when HBase
|
||||||
|
security is enabled.</para><formalpara><title>Default</title><para><varname>hbase-policy.xml</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.superuser"><glossterm><varname>hbase.superuser</varname></glossterm><glossdef><para>List of users or groups (comma-separated), who are allowed
|
||||||
|
full privileges, regardless of stored ACLs, across the cluster.
|
||||||
|
Only used when HBase security is enabled.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.auth.key.update.interval"><glossterm><varname>hbase.auth.key.update.interval</varname></glossterm><glossdef><para>The update interval for master key for authentication tokens
|
||||||
|
in servers in milliseconds. Only used when HBase security is enabled.</para><formalpara><title>Default</title><para><varname>86400000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.auth.token.max.lifetime"><glossterm><varname>hbase.auth.token.max.lifetime</varname></glossterm><glossdef><para>The maximum lifetime in milliseconds after which an
|
||||||
|
authentication token expires. Only used when HBase security is enabled.</para><formalpara><title>Default</title><para><varname>604800000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.ipc.client.fallback-to-simple-auth-allowed"><glossterm><varname>hbase.ipc.client.fallback-to-simple-auth-allowed</varname></glossterm><glossdef><para>When a client is configured to attempt a secure connection, but attempts to
|
||||||
|
connect to an insecure server, that server may instruct the client to
|
||||||
|
switch to SASL SIMPLE (unsecure) authentication. This setting controls
|
||||||
|
whether or not the client will accept this instruction from the server.
|
||||||
|
When false (the default), the client will not allow the fallback to SIMPLE
|
||||||
|
authentication, and will abort the connection.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.display.keys"><glossterm><varname>hbase.display.keys</varname></glossterm><glossdef><para>When this is set to true the webUI and such will display all start/end keys
|
||||||
|
as part of the table details, region names, etc. When this is set to false,
|
||||||
|
the keys are hidden.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.coprocessor.region.classes"><glossterm><varname>hbase.coprocessor.region.classes</varname></glossterm><glossdef><para>A comma-separated list of Coprocessors that are loaded by
|
||||||
|
default on all tables. For any override coprocessor method, these classes
|
||||||
|
will be called in order. After implementing your own Coprocessor, just put
|
||||||
|
it in HBase's classpath and add the fully qualified class name here.
|
||||||
|
A coprocessor can also be loaded on demand by setting HTableDescriptor.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.port"><glossterm><varname>hbase.rest.port</varname></glossterm><glossdef><para>The port for the HBase REST server.</para><formalpara><title>Default</title><para><varname>8080</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.readonly"><glossterm><varname>hbase.rest.readonly</varname></glossterm><glossdef><para>Defines the mode the REST server will be started in. Possible values are:
|
||||||
|
false: All HTTP methods are permitted - GET/PUT/POST/DELETE.
|
||||||
|
true: Only the GET method is permitted.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.threads.max"><glossterm><varname>hbase.rest.threads.max</varname></glossterm><glossdef><para>The maximum number of threads of the REST server thread pool.
|
||||||
|
Threads in the pool are reused to process REST requests. This
|
||||||
|
controls the maximum number of requests processed concurrently.
|
||||||
|
It may help to control the memory used by the REST server to
|
||||||
|
avoid OOM issues. If the thread pool is full, incoming requests
|
||||||
|
will be queued up and wait for some free threads.</para><formalpara><title>Default</title><para><varname>100</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.threads.min"><glossterm><varname>hbase.rest.threads.min</varname></glossterm><glossdef><para>The minimum number of threads of the REST server thread pool.
|
||||||
|
The thread pool always has at least these number of threads so
|
||||||
|
the REST server is ready to serve incoming requests.</para><formalpara><title>Default</title><para><varname>2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.support.proxyuser"><glossterm><varname>hbase.rest.support.proxyuser</varname></glossterm><glossdef><para>Enables running the REST server to support proxy-user mode.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.defaults.for.version.skip"><glossterm><varname>hbase.defaults.for.version.skip</varname></glossterm><glossdef><para>Set to true to skip the 'hbase.defaults.for.version' check.
|
||||||
|
Setting this to true can be useful in contexts other than
|
||||||
|
the other side of a maven generation; i.e. running in an
|
||||||
|
ide. You'll want to set this boolean to true to avoid
|
||||||
|
seeing the RuntimException complaint: "hbase-default.xml file
|
||||||
|
seems to be for and old version of HBase (\${hbase.version}), this
|
||||||
|
version is X.X.X-SNAPSHOT"</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.coprocessor.master.classes"><glossterm><varname>hbase.coprocessor.master.classes</varname></glossterm><glossdef><para>A comma-separated list of
|
||||||
|
org.apache.hadoop.hbase.coprocessor.MasterObserver coprocessors that are
|
||||||
|
loaded by default on the active HMaster process. For any implemented
|
||||||
|
coprocessor methods, the listed classes will be called in order. After
|
||||||
|
implementing your own MasterObserver, just put it in HBase's classpath
|
||||||
|
and add the fully qualified class name here.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.coprocessor.abortonerror"><glossterm><varname>hbase.coprocessor.abortonerror</varname></glossterm><glossdef><para>Set to true to cause the hosting server (master or regionserver)
|
||||||
|
to abort if a coprocessor fails to load, fails to initialize, or throws an
|
||||||
|
unexpected Throwable object. Setting this to false will allow the server to
|
||||||
|
continue execution but the system wide state of the coprocessor in question
|
||||||
|
will become inconsistent as it will be properly executing in only a subset
|
||||||
|
of servers, so this is most useful for debugging only.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.online.schema.update.enable"><glossterm><varname>hbase.online.schema.update.enable</varname></glossterm><glossdef><para>Set true to enable online schema changes.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.table.lock.enable"><glossterm><varname>hbase.table.lock.enable</varname></glossterm><glossdef><para>Set to true to enable locking the table in zookeeper for schema change operations.
|
||||||
|
Table locking from master prevents concurrent schema modifications to corrupt table
|
||||||
|
state.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.table.max.rowsize"><glossterm><varname>hbase.table.max.rowsize</varname></glossterm><glossdef><para>
|
||||||
|
Maximum size of single row in bytes (default is 1 Gb) for Get'ting
|
||||||
|
or Scan'ning without in-row scan flag set. If row size exceeds this limit
|
||||||
|
RowTooBigException is thrown to client.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>1073741824</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.thrift.minWorkerThreads"><glossterm><varname>hbase.thrift.minWorkerThreads</varname></glossterm><glossdef><para>The "core size" of the thread pool. New threads are created on every
|
||||||
|
connection until this many threads are created.</para><formalpara><title>Default</title><para><varname>16</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.thrift.maxWorkerThreads"><glossterm><varname>hbase.thrift.maxWorkerThreads</varname></glossterm><glossdef><para>The maximum size of the thread pool. When the pending request queue
|
||||||
|
overflows, new threads are created until their number reaches this number.
|
||||||
|
After that, the server starts dropping connections.</para><formalpara><title>Default</title><para><varname>1000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.thrift.maxQueuedRequests"><glossterm><varname>hbase.thrift.maxQueuedRequests</varname></glossterm><glossdef><para>The maximum number of pending Thrift connections waiting in the queue. If
|
||||||
|
there are no idle threads in the pool, the server queues requests. Only
|
||||||
|
when the queue overflows, new threads are added, up to
|
||||||
|
hbase.thrift.maxQueuedRequests threads.</para><formalpara><title>Default</title><para><varname>1000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.thrift.htablepool.size.max"><glossterm><varname>hbase.thrift.htablepool.size.max</varname></glossterm><glossdef><para>The upper bound for the table pool used in the Thrift gateways server.
|
||||||
|
Since this is per table name, we assume a single table and so with 1000 default
|
||||||
|
worker threads max this is set to a matching number. For other workloads this number
|
||||||
|
can be adjusted as needed.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>1000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.thrift.framed"><glossterm><varname>hbase.regionserver.thrift.framed</varname></glossterm><glossdef><para>Use Thrift TFramedTransport on the server side.
|
||||||
|
This is the recommended transport for thrift servers and requires a similar setting
|
||||||
|
on the client side. Changing this to false will select the default transport,
|
||||||
|
vulnerable to DoS when malformed requests are issued due to THRIFT-601.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.thrift.framed.max_frame_size_in_mb"><glossterm><varname>hbase.regionserver.thrift.framed.max_frame_size_in_mb</varname></glossterm><glossdef><para>Default frame size when using framed transport</para><formalpara><title>Default</title><para><varname>2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.thrift.compact"><glossterm><varname>hbase.regionserver.thrift.compact</varname></glossterm><glossdef><para>Use Thrift TCompactProtocol binary serialization protocol.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.data.umask.enable"><glossterm><varname>hbase.data.umask.enable</varname></glossterm><glossdef><para>Enable, if true, that file permissions should be assigned
|
||||||
|
to the files written by the regionserver</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.data.umask"><glossterm><varname>hbase.data.umask</varname></glossterm><glossdef><para>File permissions that should be used to write data
|
||||||
|
files when hbase.data.umask.enable is true</para><formalpara><title>Default</title><para><varname>000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.metrics.showTableName"><glossterm><varname>hbase.metrics.showTableName</varname></glossterm><glossdef><para>Whether to include the prefix "tbl.tablename" in per-column family metrics.
|
||||||
|
If true, for each metric M, per-cf metrics will be reported for tbl.T.cf.CF.M, if false,
|
||||||
|
per-cf metrics will be aggregated by column-family across tables, and reported for cf.CF.M.
|
||||||
|
In both cases, the aggregated metric M across tables and cfs will be reported.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.metrics.exposeOperationTimes"><glossterm><varname>hbase.metrics.exposeOperationTimes</varname></glossterm><glossdef><para>Whether to report metrics about time taken performing an
|
||||||
|
operation on the region server. Get, Put, Delete, Increment, and Append can all
|
||||||
|
have their times exposed through Hadoop metrics per CF and per region.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.snapshot.enabled"><glossterm><varname>hbase.snapshot.enabled</varname></glossterm><glossdef><para>Set to true to allow snapshots to be taken / restored / cloned.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.snapshot.restore.take.failsafe.snapshot"><glossterm><varname>hbase.snapshot.restore.take.failsafe.snapshot</varname></glossterm><glossdef><para>Set to true to take a snapshot before the restore operation.
|
||||||
|
The snapshot taken will be used in case of failure, to restore the previous state.
|
||||||
|
At the end of the restore operation this snapshot will be deleted</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.snapshot.restore.failsafe.name"><glossterm><varname>hbase.snapshot.restore.failsafe.name</varname></glossterm><glossdef><para>Name of the failsafe snapshot taken by the restore operation.
|
||||||
|
You can use the {snapshot.name}, {table.name} and {restore.timestamp} variables
|
||||||
|
to create a name based on what you are restoring.</para><formalpara><title>Default</title><para><varname>hbase-failsafe-{snapshot.name}-{restore.timestamp}</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.server.compactchecker.interval.multiplier"><glossterm><varname>hbase.server.compactchecker.interval.multiplier</varname></glossterm><glossdef><para>The number that determines how often we scan to see if compaction is necessary.
|
||||||
|
Normally, compactions are done after some events (such as memstore flush), but if
|
||||||
|
region didn't receive a lot of writes for some time, or due to different compaction
|
||||||
|
policies, it may be necessary to check it periodically. The interval between checks is
|
||||||
|
hbase.server.compactchecker.interval.multiplier multiplied by
|
||||||
|
hbase.server.thread.wakefrequency.</para><formalpara><title>Default</title><para><varname>1000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.lease.recovery.timeout"><glossterm><varname>hbase.lease.recovery.timeout</varname></glossterm><glossdef><para>How long we wait on dfs lease recovery in total before giving up.</para><formalpara><title>Default</title><para><varname>900000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.lease.recovery.dfs.timeout"><glossterm><varname>hbase.lease.recovery.dfs.timeout</varname></glossterm><glossdef><para>How long between dfs recover lease invocations. Should be larger than the sum of
|
||||||
|
the time it takes for the namenode to issue a block recovery command as part of
|
||||||
|
datanode; dfs.heartbeat.interval and the time it takes for the primary
|
||||||
|
datanode, performing block recovery to timeout on a dead datanode; usually
|
||||||
|
dfs.client.socket-timeout. See the end of HBASE-8389 for more.</para><formalpara><title>Default</title><para><varname>64000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.column.max.version"><glossterm><varname>hbase.column.max.version</varname></glossterm><glossdef><para>New column family descriptors will use this value as the default number of versions
|
||||||
|
to keep.</para><formalpara><title>Default</title><para><varname>1</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.dfs.client.read.shortcircuit.buffer.size"><glossterm><varname>hbase.dfs.client.read.shortcircuit.buffer.size</varname></glossterm><glossdef><para>If the DFSClient configuration
|
||||||
|
dfs.client.read.shortcircuit.buffer.size is unset, we will
|
||||||
|
use what is configured here as the short circuit read default
|
||||||
|
direct byte buffer size. DFSClient native default is 1MB; HBase
|
||||||
|
keeps its HDFS files open so number of file blocks * 1MB soon
|
||||||
|
starts to add up and threaten OOME because of a shortage of
|
||||||
|
direct memory. So, we set it down from the default. Make
|
||||||
|
it > the default hbase block size set in the HColumnDescriptor
|
||||||
|
which is usually 64k.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>131072</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.checksum.verify"><glossterm><varname>hbase.regionserver.checksum.verify</varname></glossterm><glossdef><para>
|
||||||
|
If set to true (the default), HBase verifies the checksums for hfile
|
||||||
|
blocks. HBase writes checksums inline with the data when it writes out
|
||||||
|
hfiles. HDFS (as of this writing) writes checksums to a separate file
|
||||||
|
than the data file necessitating extra seeks. Setting this flag saves
|
||||||
|
some on i/o. Checksum verification by HDFS will be internally disabled
|
||||||
|
on hfile streams when this flag is set. If the hbase-checksum verification
|
||||||
|
fails, we will switch back to using HDFS checksums (so do not disable HDFS
|
||||||
|
checksums! And besides this feature applies to hfiles only, not to WALs).
|
||||||
|
If this parameter is set to false, then hbase will not verify any checksums,
|
||||||
|
instead it will depend on checksum verification being done in the HDFS client.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.bytes.per.checksum"><glossterm><varname>hbase.hstore.bytes.per.checksum</varname></glossterm><glossdef><para>
|
||||||
|
Number of bytes in a newly created checksum chunk for HBase-level
|
||||||
|
checksums in hfile blocks.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>16384</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.checksum.algorithm"><glossterm><varname>hbase.hstore.checksum.algorithm</varname></glossterm><glossdef><para>
|
||||||
|
Name of an algorithm that is used to compute checksums. Possible values
|
||||||
|
are NULL, CRC32, CRC32C.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>CRC32</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.status.published"><glossterm><varname>hbase.status.published</varname></glossterm><glossdef><para>
|
||||||
|
This setting activates the publication by the master of the status of the region server.
|
||||||
|
When a region server dies and its recovery starts, the master will push this information
|
||||||
|
to the client application, to let them cut the connection immediately instead of waiting
|
||||||
|
for a timeout.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.status.publisher.class"><glossterm><varname>hbase.status.publisher.class</varname></glossterm><glossdef><para>
|
||||||
|
Implementation of the status publication with a multicast message.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.status.listener.class"><glossterm><varname>hbase.status.listener.class</varname></glossterm><glossdef><para>
|
||||||
|
Implementation of the status listener with a multicast message.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.status.multicast.address.ip"><glossterm><varname>hbase.status.multicast.address.ip</varname></glossterm><glossdef><para>
|
||||||
|
Multicast address to use for the status publication by multicast.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>226.1.1.3</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.status.multicast.address.port"><glossterm><varname>hbase.status.multicast.address.port</varname></glossterm><glossdef><para>
|
||||||
|
Multicast port to use for the status publication by multicast.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>16100</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.dynamic.jars.dir"><glossterm><varname>hbase.dynamic.jars.dir</varname></glossterm><glossdef><para>
|
||||||
|
The directory from which the custom filter/co-processor jars can be loaded
|
||||||
|
dynamically by the region server without the need to restart. However,
|
||||||
|
an already loaded filter/co-processor class would not be un-loaded. See
|
||||||
|
HBASE-1936 for more details.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>${hbase.rootdir}/lib</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.security.authentication"><glossterm><varname>hbase.security.authentication</varname></glossterm><glossdef><para>
|
||||||
|
Controls whether or not secure authentication is enabled for HBase.
|
||||||
|
Possible values are 'simple' (no authentication), and 'kerberos'.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>simple</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.filter.classes"><glossterm><varname>hbase.rest.filter.classes</varname></glossterm><glossdef><para>
|
||||||
|
Servlet filters for REST service.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.rest.filter.GzipFilter</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.loadbalancer.class"><glossterm><varname>hbase.master.loadbalancer.class</varname></glossterm><glossdef><para>
|
||||||
|
Class used to execute the regions balancing when the period occurs.
|
||||||
|
See the class comment for more on how it works
|
||||||
|
http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.html
|
||||||
|
It replaces the DefaultLoadBalancer as the default (since renamed
|
||||||
|
as the SimpleLoadBalancer).
|
||||||
|
</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.security.exec.permission.checks"><glossterm><varname>hbase.security.exec.permission.checks</varname></glossterm><glossdef><para>
|
||||||
|
If this setting is enabled and ACL based access control is active (the
|
||||||
|
AccessController coprocessor is installed either as a system coprocessor
|
||||||
|
or on a table as a table coprocessor) then you must grant all relevant
|
||||||
|
users EXEC privilege if they require the ability to execute coprocessor
|
||||||
|
endpoint calls. EXEC privilege, like any other permission, can be
|
||||||
|
granted globally to a user, or to a user on a per table or per namespace
|
||||||
|
basis. For more information on coprocessor endpoints, see the coprocessor
|
||||||
|
section of the HBase online manual. For more information on granting or
|
||||||
|
revoking permissions using the AccessController, see the security
|
||||||
|
section of the HBase online manual.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.procedure.regionserver.classes"><glossterm><varname>hbase.procedure.regionserver.classes</varname></glossterm><glossdef><para>A comma-separated list of
|
||||||
|
org.apache.hadoop.hbase.procedure.RegionServerProcedureManager procedure managers that are
|
||||||
|
loaded by default on the active HRegionServer process. The lifecycle methods (init/start/stop)
|
||||||
|
will be called by the active HRegionServer process to perform the specific globally barriered
|
||||||
|
procedure. After implementing your own RegionServerProcedureManager, just put it in
|
||||||
|
HBase's classpath and add the fully qualified class name here.
|
||||||
|
</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.procedure.master.classes"><glossterm><varname>hbase.procedure.master.classes</varname></glossterm><glossdef><para>A comma-separated list of
|
||||||
|
org.apache.hadoop.hbase.procedure.MasterProcedureManager procedure managers that are
|
||||||
|
loaded by default on the active HMaster process. A procedure is identified by its signature and
|
||||||
|
users can use the signature and an instant name to trigger an execution of a globally barriered
|
||||||
|
procedure. After implementing your own MasterProcedureManager, just put it in HBase's classpath
|
||||||
|
and add the fully qualified class name here.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.coordinated.state.manager.class"><glossterm><varname>hbase.coordinated.state.manager.class</varname></glossterm><glossdef><para>Fully qualified name of class implementing coordinated state manager.</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.coordination.ZkCoordinatedStateManager</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.storefile.refresh.period"><glossterm><varname>hbase.regionserver.storefile.refresh.period</varname></glossterm><glossdef><para>
|
||||||
|
The period (in milliseconds) for refreshing the store files for the secondary regions. 0
|
||||||
|
means this feature is disabled. Secondary regions sees new files (from flushes and
|
||||||
|
compactions) from primary once the secondary region refreshes the list of files in the
|
||||||
|
region (there is no notification mechanism). But too frequent refreshes might cause
|
||||||
|
extra Namenode pressure. If the files cannot be refreshed for longer than HFile TTL
|
||||||
|
(hbase.master.hfilecleaner.ttl) the requests are rejected. Configuring HFile TTL to a larger
|
||||||
|
value is also recommended with this setting.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.region.replica.replication.enabled"><glossterm><varname>hbase.region.replica.replication.enabled</varname></glossterm><glossdef><para>
|
||||||
|
Whether asynchronous WAL replication to the secondary region replicas is enabled or not.
|
||||||
|
If this is enabled, a replication peer named "region_replica_replication" will be created
|
||||||
|
which will tail the logs and replicate the mutatations to region replicas for tables that
|
||||||
|
have region replication > 1. If this is enabled once, disabling this replication also
|
||||||
|
requires disabling the replication peer using shell or ReplicationAdmin java class.
|
||||||
|
Replication to secondary region replicas works over standard inter-cluster replication.
|
||||||
|
So replication, if disabled explicitly, also has to be enabled by setting "hbase.replication"
|
||||||
|
to true for this feature to work.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.http.filter.initializers"><glossterm><varname>hbase.http.filter.initializers</varname></glossterm><glossdef><para>
|
||||||
|
A comma separated list of class names. Each class in the list must extend
|
||||||
|
org.apache.hadoop.hbase.http.FilterInitializer. The corresponding Filter will
|
||||||
|
be initialized. Then, the Filter will be applied to all user facing jsp
|
||||||
|
and servlet web pages.
|
||||||
|
The ordering of the list defines the ordering of the filters.
|
||||||
|
The default StaticUserWebFilter add a user principal as defined by the
|
||||||
|
hbase.http.staticuser.user property.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.http.lib.StaticUserWebFilter</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.security.visibility.mutations.checkauths"><glossterm><varname>hbase.security.visibility.mutations.checkauths</varname></glossterm><glossdef><para>
|
||||||
|
This property if enabled, will check whether the labels in the visibility expression are associated
|
||||||
|
with the user issuing the mutation
|
||||||
|
</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.http.max.threads"><glossterm><varname>hbase.http.max.threads</varname></glossterm><glossdef><para>
|
||||||
|
The maximum number of threads that the HTTP Server will create in its
|
||||||
|
ThreadPool.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.replication.rpc.codec"><glossterm><varname>hbase.replication.rpc.codec</varname></glossterm><glossdef><para>
|
||||||
|
The codec that is to be used when replication is enabled so that
|
||||||
|
the tags are also replicated. This is used along with HFileV3 which
|
||||||
|
supports tags in them. If tags are not used or if the hfile version used
|
||||||
|
is HFileV2 then KeyValueCodec can be used as the replication codec. Note that
|
||||||
|
using KeyValueCodecWithTags for replication when there are no tags causes no harm.
|
||||||
|
</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.codec.KeyValueCodecWithTags</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.http.staticuser.user"><glossterm><varname>hbase.http.staticuser.user</varname></glossterm><glossdef><para>
|
||||||
|
The user name to filter as, on static web filters
|
||||||
|
while rendering content. An example use is the HDFS
|
||||||
|
web UI (user to be used for browsing files).
|
||||||
|
</para><formalpara><title>Default</title><para><varname>dr.stack</varname></para></formalpara></glossdef></glossentry></glossary>
|
|
@ -0,0 +1,41 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<appendix
|
||||||
|
xml:id="hbase.history"
|
||||||
|
version="5.0"
|
||||||
|
xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg"
|
||||||
|
xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||||
|
xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
<title>HBase History</title>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para>2006: <link xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper published by Google.
|
||||||
|
</para></listitem>
|
||||||
|
<listitem><para>2006 (end of year): HBase development starts.
|
||||||
|
</para></listitem>
|
||||||
|
<listitem><para>2008: HBase becomes Hadoop sub-project.
|
||||||
|
</para></listitem>
|
||||||
|
<listitem><para>2010: HBase becomes Apache top-level project.
|
||||||
|
</para></listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
</appendix>
|
|
@ -0,0 +1,237 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<appendix
|
||||||
|
xml:id="hbck.in.depth"
|
||||||
|
version="5.0"
|
||||||
|
xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg"
|
||||||
|
xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||||
|
xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
|
||||||
|
<title>hbck In Depth</title>
|
||||||
|
<para>HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems
|
||||||
|
and repairing a corrupted HBase. It works in two basic modes -- a read-only inconsistency
|
||||||
|
identifying mode and a multi-phase read-write repair mode.
|
||||||
|
</para>
|
||||||
|
<section>
|
||||||
|
<title>Running hbck to identify inconsistencies</title>
|
||||||
|
<para>To check to see if your HBase cluster has corruptions, run hbck against your HBase cluster:</para>
|
||||||
|
<programlisting language="bourne">
|
||||||
|
$ ./bin/hbase hbck
|
||||||
|
</programlisting>
|
||||||
|
<para>
|
||||||
|
At the end of the commands output it prints OK or tells you the number of INCONSISTENCIES
|
||||||
|
present. You may also want to run run hbck a few times because some inconsistencies can be
|
||||||
|
transient (e.g. cluster is starting up or a region is splitting). Operationally you may want to run
|
||||||
|
hbck regularly and setup alert (e.g. via nagios) if it repeatedly reports inconsistencies .
|
||||||
|
A run of hbck will report a list of inconsistencies along with a brief description of the regions and
|
||||||
|
tables affected. The using the <code>-details</code> option will report more details including a representative
|
||||||
|
listing of all the splits present in all the tables.
|
||||||
|
</para>
|
||||||
|
<programlisting language="bourne">
|
||||||
|
$ ./bin/hbase hbck -details
|
||||||
|
</programlisting>
|
||||||
|
<para>If you just want to know if some tables are corrupted, you can limit hbck to identify inconsistencies
|
||||||
|
in only specific tables. For example the following command would only attempt to check table
|
||||||
|
TableFoo and TableBar. The benefit is that hbck will run in less time.</para>
|
||||||
|
<programlisting language="bourne">
|
||||||
|
$ ./bin/hbase hbck TableFoo TableBar
|
||||||
|
</programlisting>
|
||||||
|
</section>
|
||||||
|
<section><title>Inconsistencies</title>
|
||||||
|
<para>
|
||||||
|
If after several runs, inconsistencies continue to be reported, you may have encountered a
|
||||||
|
corruption. These should be rare, but in the event they occur newer versions of HBase include
|
||||||
|
the hbck tool enabled with automatic repair options.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
There are two invariants that when violated create inconsistencies in HBase:
|
||||||
|
</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para>HBase’s region consistency invariant is satisfied if every region is assigned and
|
||||||
|
deployed on exactly one region server, and all places where this state kept is in
|
||||||
|
accordance.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem><para>HBase’s table integrity invariant is satisfied if for each table, every possible row key
|
||||||
|
resolves to exactly one region.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<para>
|
||||||
|
Repairs generally work in three phases -- a read-only information gathering phase that identifies
|
||||||
|
inconsistencies, a table integrity repair phase that restores the table integrity invariant, and then
|
||||||
|
finally a region consistency repair phase that restores the region consistency invariant.
|
||||||
|
Starting from version 0.90.0, hbck could detect region consistency problems report on a subset
|
||||||
|
of possible table integrity problems. It also included the ability to automatically fix the most
|
||||||
|
common inconsistency, region assignment and deployment consistency problems. This repair
|
||||||
|
could be done by using the <code>-fix</code> command line option. These problems close regions if they are
|
||||||
|
open on the wrong server or on multiple region servers and also assigns regions to region
|
||||||
|
servers if they are not open.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Starting from HBase versions 0.90.7, 0.92.2 and 0.94.0, several new command line options are
|
||||||
|
introduced to aid repairing a corrupted HBase. This hbck sometimes goes by the nickname
|
||||||
|
“uberhbck”. Each particular version of uber hbck is compatible with the HBase’s of the same
|
||||||
|
major version (0.90.7 uberhbck can repair a 0.90.4). However, versions <=0.90.6 and versions
|
||||||
|
<=0.92.1 may require restarting the master or failing over to a backup master.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section><title>Localized repairs</title>
|
||||||
|
<para>
|
||||||
|
When repairing a corrupted HBase, it is best to repair the lowest risk inconsistencies first.
|
||||||
|
These are generally region consistency repairs -- localized single region repairs, that only modify
|
||||||
|
in-memory data, ephemeral zookeeper data, or patch holes in the META table.
|
||||||
|
Region consistency requires that the HBase instance has the state of the region’s data in HDFS
|
||||||
|
(.regioninfo files), the region’s row in the hbase:meta table., and region’s deployment/assignments on
|
||||||
|
region servers and the master in accordance. Options for repairing region consistency include:
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para><code>-fixAssignments</code> (equivalent to the 0.90 <code>-fix</code> option) repairs unassigned, incorrectly
|
||||||
|
assigned or multiply assigned regions.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem><para><code>-fixMeta</code> which removes meta rows when corresponding regions are not present in
|
||||||
|
HDFS and adds new meta rows if they regions are present in HDFS while not in META.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
To fix deployment and assignment problems you can run this command:
|
||||||
|
</para>
|
||||||
|
<programlisting language="bourne">
|
||||||
|
$ ./bin/hbase hbck -fixAssignments
|
||||||
|
</programlisting>
|
||||||
|
<para>To fix deployment and assignment problems as well as repairing incorrect meta rows you can
|
||||||
|
run this command:</para>
|
||||||
|
<programlisting language="bourne">
|
||||||
|
$ ./bin/hbase hbck -fixAssignments -fixMeta
|
||||||
|
</programlisting>
|
||||||
|
<para>There are a few classes of table integrity problems that are low risk repairs. The first two are
|
||||||
|
degenerate (startkey == endkey) regions and backwards regions (startkey > endkey). These are
|
||||||
|
automatically handled by sidelining the data to a temporary directory (/hbck/xxxx).
|
||||||
|
The third low-risk class is hdfs region holes. This can be repaired by using the:</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para><code>-fixHdfsHoles</code> option for fabricating new empty regions on the file system.
|
||||||
|
If holes are detected you can use -fixHdfsHoles and should include -fixMeta and -fixAssignments to make the new region consistent.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<programlisting language="bourne">
|
||||||
|
$ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHoles
|
||||||
|
</programlisting>
|
||||||
|
<para>Since this is a common operation, we’ve added a the <code>-repairHoles</code> flag that is equivalent to the
|
||||||
|
previous command:</para>
|
||||||
|
<programlisting language="bourne">
|
||||||
|
$ ./bin/hbase hbck -repairHoles
|
||||||
|
</programlisting>
|
||||||
|
<para>If inconsistencies still remain after these steps, you most likely have table integrity problems
|
||||||
|
related to orphaned or overlapping regions.</para>
|
||||||
|
</section>
|
||||||
|
<section><title>Region Overlap Repairs</title>
|
||||||
|
<para>Table integrity problems can require repairs that deal with overlaps. This is a riskier operation
|
||||||
|
because it requires modifications to the file system, requires some decision making, and may
|
||||||
|
require some manual steps. For these repairs it is best to analyze the output of a <code>hbck -details</code>
|
||||||
|
run so that you isolate repairs attempts only upon problems the checks identify. Because this is
|
||||||
|
riskier, there are safeguard that should be used to limit the scope of the repairs.
|
||||||
|
WARNING: This is a relatively new and have only been tested on online but idle HBase instances
|
||||||
|
(no reads/writes). Use at your own risk in an active production environment!
|
||||||
|
The options for repairing table integrity violations include:</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para><code>-fixHdfsOrphans</code> option for “adopting” a region directory that is missing a region
|
||||||
|
metadata file (the .regioninfo file).</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem><para><code>-fixHdfsOverlaps</code> ability for fixing overlapping regions</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<para>When repairing overlapping regions, a region’s data can be modified on the file system in two
|
||||||
|
ways: 1) by merging regions into a larger region or 2) by sidelining regions by moving data to
|
||||||
|
“sideline” directory where data could be restored later. Merging a large number of regions is
|
||||||
|
technically correct but could result in an extremely large region that requires series of costly
|
||||||
|
compactions and splitting operations. In these cases, it is probably better to sideline the regions
|
||||||
|
that overlap with the most other regions (likely the largest ranges) so that merges can happen on
|
||||||
|
a more reasonable scale. Since these sidelined regions are already laid out in HBase’s native
|
||||||
|
directory and HFile format, they can be restored by using HBase’s bulk load mechanism.
|
||||||
|
The default safeguard thresholds are conservative. These options let you override the default
|
||||||
|
thresholds and to enable the large region sidelining feature.</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para><code>-maxMerge <n></code> maximum number of overlapping regions to merge</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem><para><code>-sidelineBigOverlaps</code> if more than maxMerge regions are overlapping, sideline attempt
|
||||||
|
to sideline the regions overlapping with the most other regions.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem><para><code>-maxOverlapsToSideline <n></code> if sidelining large overlapping regions, sideline at most n
|
||||||
|
regions.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
|
||||||
|
<para>Since often times you would just want to get the tables repaired, you can use this option to turn
|
||||||
|
on all repair options:</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para><code>-repair</code> includes all the region consistency options and only the hole repairing table
|
||||||
|
integrity options.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
<para>Finally, there are safeguards to limit repairs to only specific tables. For example the following
|
||||||
|
command would only attempt to check and repair table TableFoo and TableBar.</para>
|
||||||
|
<screen language="bourne">
|
||||||
|
$ ./bin/hbase hbck -repair TableFoo TableBar
|
||||||
|
</screen>
|
||||||
|
<section><title>Special cases: Meta is not properly assigned</title>
|
||||||
|
<para>There are a few special cases that hbck can handle as well.
|
||||||
|
Sometimes the meta table’s only region is inconsistently assigned or deployed. In this case
|
||||||
|
there is a special <code>-fixMetaOnly</code> option that can try to fix meta assignments.</para>
|
||||||
|
<screen language="bourne">
|
||||||
|
$ ./bin/hbase hbck -fixMetaOnly -fixAssignments
|
||||||
|
</screen>
|
||||||
|
</section>
|
||||||
|
<section><title>Special cases: HBase version file is missing</title>
|
||||||
|
<para>HBase’s data on the file system requires a version file in order to start. If this flie is missing, you
|
||||||
|
can use the <code>-fixVersionFile</code> option to fabricating a new HBase version file. This assumes that
|
||||||
|
the version of hbck you are running is the appropriate version for the HBase cluster.</para>
|
||||||
|
</section>
|
||||||
|
<section><title>Special case: Root and META are corrupt.</title>
|
||||||
|
<para>The most drastic corruption scenario is the case where the ROOT or META is corrupted and
|
||||||
|
HBase will not start. In this case you can use the OfflineMetaRepair tool create new ROOT
|
||||||
|
and META regions and tables.
|
||||||
|
This tool assumes that HBase is offline. It then marches through the existing HBase home
|
||||||
|
directory, loads as much information from region metadata files (.regioninfo files) as possible
|
||||||
|
from the file system. If the region metadata has proper table integrity, it sidelines the original root
|
||||||
|
and meta table directories, and builds new ones with pointers to the region directories and their
|
||||||
|
data.</para>
|
||||||
|
<screen language="bourne">
|
||||||
|
$ ./bin/hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
|
||||||
|
</screen>
|
||||||
|
<para>NOTE: This tool is not as clever as uberhbck but can be used to bootstrap repairs that uberhbck
|
||||||
|
can complete.
|
||||||
|
If the tool succeeds you should be able to start hbase and run online repairs if necessary.</para>
|
||||||
|
</section>
|
||||||
|
<section><title>Special cases: Offline split parent</title>
|
||||||
|
<para>
|
||||||
|
Once a region is split, the offline parent will be cleaned up automatically. Sometimes, daughter regions
|
||||||
|
are split again before their parents are cleaned up. HBase can clean up parents in the right order. However,
|
||||||
|
there could be some lingering offline split parents sometimes. They are in META, in HDFS, and not deployed.
|
||||||
|
But HBase can't clean them up. In this case, you can use the <code>-fixSplitParents</code> option to reset
|
||||||
|
them in META to be online and not split. Therefore, hbck can merge them with other regions if fixing
|
||||||
|
overlapping regions option is used.
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
This option should not normally be used, and it is not in <code>-fixAll</code>.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
</appendix>
|
|
@ -0,0 +1,630 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<chapter
|
||||||
|
xml:id="mapreduce"
|
||||||
|
version="5.0"
|
||||||
|
xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg"
|
||||||
|
xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||||
|
xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
|
||||||
|
<title>HBase and MapReduce</title>
|
||||||
|
<para>Apache MapReduce is a software framework used to analyze large amounts of data, and is
|
||||||
|
the framework used most often with <link
|
||||||
|
xlink:href="http://hadoop.apache.org/">Apache Hadoop</link>. MapReduce itself is out of the
|
||||||
|
scope of this document. A good place to get started with MapReduce is <link
|
||||||
|
xlink:href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html" />. MapReduce version
|
||||||
|
2 (MR2)is now part of <link
|
||||||
|
xlink:href="http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/">YARN</link>. </para>
|
||||||
|
|
||||||
|
<para> This chapter discusses specific configuration steps you need to take to use MapReduce on
|
||||||
|
data within HBase. In addition, it discusses other interactions and issues between HBase and
|
||||||
|
MapReduce jobs.
|
||||||
|
<note>
|
||||||
|
<title>mapred and mapreduce</title>
|
||||||
|
<para>There are two mapreduce packages in HBase as in MapReduce itself: <filename>org.apache.hadoop.hbase.mapred</filename>
|
||||||
|
and <filename>org.apache.hadoop.hbase.mapreduce</filename>. The former does old-style API and the latter
|
||||||
|
the new style. The latter has more facility though you can usually find an equivalent in the older
|
||||||
|
package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the
|
||||||
|
<filename>org.apache.hadoop.hbase.mapreduce</filename>. In the notes below, we refer to
|
||||||
|
o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
|
||||||
|
</para>
|
||||||
|
</note>
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="hbase.mapreduce.classpath">
|
||||||
|
<title>HBase, MapReduce, and the CLASSPATH</title>
|
||||||
|
<para>By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either
|
||||||
|
the HBase configuration under <envar>$HBASE_CONF_DIR</envar> or the HBase classes.</para>
|
||||||
|
<para>To give the MapReduce jobs the access they need, you could add
|
||||||
|
<filename>hbase-site.xml</filename> to the
|
||||||
|
<filename><replaceable>$HADOOP_HOME</replaceable>/conf/</filename> directory and add the
|
||||||
|
HBase JARs to the <filename><replaceable>HADOOP_HOME</replaceable>/conf/</filename>
|
||||||
|
directory, then copy these changes across your cluster. You could add hbase-site.xml to
|
||||||
|
$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy
|
||||||
|
these changes across your cluster or edit
|
||||||
|
<filename><replaceable>$HADOOP_HOME</replaceable>conf/hadoop-env.sh</filename> and add
|
||||||
|
them to the <envar>HADOOP_CLASSPATH</envar> variable. However, this approach is not
|
||||||
|
recommended because it will pollute your Hadoop install with HBase references. It also
|
||||||
|
requires you to restart the Hadoop cluster before Hadoop can use the HBase data.</para>
|
||||||
|
<para> Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The
|
||||||
|
dependencies only need to be available on the local CLASSPATH. The following example runs
|
||||||
|
the bundled HBase <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
|
||||||
|
MapReduce job against a table named <systemitem>usertable</systemitem> If you have not set
|
||||||
|
the environment variables expected in the command (the parts prefixed by a
|
||||||
|
<literal>$</literal> sign and curly braces), you can use the actual system paths instead.
|
||||||
|
Be sure to use the correct version of the HBase JAR for your system. The backticks
|
||||||
|
(<literal>`</literal> symbols) cause ths shell to execute the sub-commands, setting the
|
||||||
|
CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. </para>
|
||||||
|
<screen language="bourne">$ <userinput>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable</userinput></screen>
|
||||||
|
<para>When the command runs, internally, the HBase JAR finds the dependencies it needs for
|
||||||
|
zookeeper, guava, and its other dependencies on the passed <envar>HADOOP_CLASSPATH</envar>
|
||||||
|
and adds the JARs to the MapReduce job configuration. See the source at
|
||||||
|
TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. </para>
|
||||||
|
<note>
|
||||||
|
<para> The example may not work if you are running HBase from its build directory rather
|
||||||
|
than an installed location. You may see an error like the following:</para>
|
||||||
|
<screen>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper</screen>
|
||||||
|
<para>If this occurs, try modifying the command as follows, so that it uses the HBase JARs
|
||||||
|
from the <filename>target/</filename> directory within the build environment.</para>
|
||||||
|
<screen language="bourne">$ <userinput>HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable</userinput></screen>
|
||||||
|
</note>
|
||||||
|
<caution>
|
||||||
|
<title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
|
||||||
|
<para>Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar
|
||||||
|
to the following:</para>
|
||||||
|
<screen>
|
||||||
|
Exception in thread "main" java.lang.IllegalAccessError: class
|
||||||
|
com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
|
||||||
|
com.google.protobuf.LiteralByteString
|
||||||
|
at java.lang.ClassLoader.defineClass1(Native Method)
|
||||||
|
at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
|
||||||
|
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
|
||||||
|
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
|
||||||
|
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
|
||||||
|
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
|
||||||
|
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
|
||||||
|
at java.security.AccessController.doPrivileged(Native Method)
|
||||||
|
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
|
||||||
|
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
|
||||||
|
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
|
||||||
|
at
|
||||||
|
org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818)
|
||||||
|
at
|
||||||
|
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433)
|
||||||
|
at
|
||||||
|
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186)
|
||||||
|
at
|
||||||
|
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147)
|
||||||
|
at
|
||||||
|
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270)
|
||||||
|
at
|
||||||
|
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
|
||||||
|
...
|
||||||
|
</screen>
|
||||||
|
<para>This is caused by an optimization introduced in <link
|
||||||
|
xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link> that
|
||||||
|
inadvertently introduced a classloader dependency. </para>
|
||||||
|
<para>This affects both jobs using the <code>-libjars</code> option and "fat jar," those
|
||||||
|
which package their runtime dependencies in a nested <code>lib</code> folder.</para>
|
||||||
|
<para>In order to satisfy the new classloader requirements, hbase-protocol.jar must be
|
||||||
|
included in Hadoop's classpath. See <xref
|
||||||
|
linkend="hbase.mapreduce.classpath" /> for current recommendations for resolving
|
||||||
|
classpath errors. The following is included for historical purposes.</para>
|
||||||
|
<para>This can be resolved system-wide by including a reference to the hbase-protocol.jar in
|
||||||
|
hadoop's lib directory, via a symlink or by copying the jar into the new location.</para>
|
||||||
|
<para>This can also be achieved on a per-job launch basis by including it in the
|
||||||
|
<code>HADOOP_CLASSPATH</code> environment variable at job submission time. When
|
||||||
|
launching jobs that package their dependencies, all three of the following job launching
|
||||||
|
commands satisfy this requirement:</para>
|
||||||
|
<screen language="bourne">
|
||||||
|
$ <userinput>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
|
||||||
|
$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
|
||||||
|
$ <userinput>HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass</userinput>
|
||||||
|
</screen>
|
||||||
|
<para>For jars that do not package their dependencies, the following command structure is
|
||||||
|
necessary:</para>
|
||||||
|
<screen language="bourne">
|
||||||
|
$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',')</userinput> ...
|
||||||
|
</screen>
|
||||||
|
<para>See also <link
|
||||||
|
xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link> for
|
||||||
|
further discussion of this issue.</para>
|
||||||
|
</caution>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>MapReduce Scan Caching</title>
|
||||||
|
<para>TableMapReduceUtil now restores the option to set scanner caching (the number of rows
|
||||||
|
which are cached before returning the result to the client) on the Scan object that is
|
||||||
|
passed in. This functionality was lost due to a bug in HBase 0.95 (<link
|
||||||
|
xlink:href="https://issues.apache.org/jira/browse/HBASE-11558">HBASE-11558</link>), which
|
||||||
|
is fixed for HBase 0.98.5 and 0.96.3. The priority order for choosing the scanner caching is
|
||||||
|
as follows:</para>
|
||||||
|
<orderedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>Caching settings which are set on the scan object.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Caching settings which are specified via the configuration option
|
||||||
|
<option>hbase.client.scanner.caching</option>, which can either be set manually in
|
||||||
|
<filename>hbase-site.xml</filename> or via the helper method
|
||||||
|
<code>TableMapReduceUtil.setScannerCaching()</code>.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>The default value <code>HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING</code>, which is set to
|
||||||
|
<literal>100</literal>.</para>
|
||||||
|
</listitem>
|
||||||
|
</orderedlist>
|
||||||
|
<para>Optimizing the caching settings is a balance between the time the client waits for a
|
||||||
|
result and the number of sets of results the client needs to receive. If the caching setting
|
||||||
|
is too large, the client could end up waiting for a long time or the request could even time
|
||||||
|
out. If the setting is too small, the scan needs to return results in several pieces.
|
||||||
|
If you think of the scan as a shovel, a bigger cache setting is analogous to a bigger
|
||||||
|
shovel, and a smaller cache setting is equivalent to more shoveling in order to fill the
|
||||||
|
bucket.</para>
|
||||||
|
<para>The list of priorities mentioned above allows you to set a reasonable default, and
|
||||||
|
override it for specific operations.</para>
|
||||||
|
<para>See the API documentation for <link
|
||||||
|
xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html"
|
||||||
|
>Scan</link> for more details.</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Bundled HBase MapReduce Jobs</title>
|
||||||
|
<para>The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about
|
||||||
|
the bundled MapReduce jobs, run the following command.</para>
|
||||||
|
|
||||||
|
<screen language="bourne">$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar</userinput>
|
||||||
|
<computeroutput>An example program must be given as the first argument.
|
||||||
|
Valid program names are:
|
||||||
|
copytable: Export a table from local cluster to peer cluster
|
||||||
|
completebulkload: Complete a bulk data load.
|
||||||
|
export: Write table data to HDFS.
|
||||||
|
import: Import data written by Export.
|
||||||
|
importtsv: Import data in TSV format.
|
||||||
|
rowcounter: Count rows in HBase table</computeroutput>
|
||||||
|
</screen>
|
||||||
|
<para>Each of the valid program names are bundled MapReduce jobs. To run one of the jobs,
|
||||||
|
model your command after the following example.</para>
|
||||||
|
<screen language="bourne">$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter myTable</userinput></screen>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>HBase as a MapReduce Job Data Source and Data Sink</title>
|
||||||
|
<para>HBase can be used as a data source, <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>,
|
||||||
|
and data sink, <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
|
||||||
|
or <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html">MultiTableOutputFormat</link>,
|
||||||
|
for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to
|
||||||
|
subclass <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>
|
||||||
|
and/or <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html">TableReducer</link>.
|
||||||
|
See the do-nothing pass-through classes <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html">IdentityTableMapper</link>
|
||||||
|
and <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html">IdentityTableReducer</link>
|
||||||
|
for basic usage. For a more involved example, see <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
|
||||||
|
or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test. </para>
|
||||||
|
<para>If you run MapReduce jobs that use HBase as source or sink, need to specify source and
|
||||||
|
sink table and column names in your configuration.</para>
|
||||||
|
|
||||||
|
<para>When you read from HBase, the <code>TableInputFormat</code> requests the list of regions
|
||||||
|
from HBase and makes a map, which is either a <code>map-per-region</code> or
|
||||||
|
<code>mapreduce.job.maps</code> map, whichever is smaller. If your job only has two maps,
|
||||||
|
raise <code>mapreduce.job.maps</code> to a number greater than the number of regions. Maps
|
||||||
|
will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per
|
||||||
|
node. When writing to HBase, it may make sense to avoid the Reduce step and write back into
|
||||||
|
HBase from within your map. This approach works when your job does not need the sort and
|
||||||
|
collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is
|
||||||
|
no point double-sorting (and shuffling data around your MapReduce cluster) unless you need
|
||||||
|
to. If you do not need the Reduce, you myour map might emit counts of records processed for
|
||||||
|
reporting at the end of the jobj, or set the number of Reduces to zero and use
|
||||||
|
TableOutputFormat. If running the Reduce step makes sense in your case, you should typically
|
||||||
|
use multiple reducers so that load is spread across the HBase cluster.</para>
|
||||||
|
|
||||||
|
<para>A new HBase partitioner, the <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html">HRegionPartitioner</link>,
|
||||||
|
can run as many reducers the number of existing regions. The HRegionPartitioner is suitable
|
||||||
|
when your table is large and your upload will not greatly alter the number of existing
|
||||||
|
regions upon completion. Otherwise use the default partitioner. </para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Writing HFiles Directly During Bulk Import</title>
|
||||||
|
<para>If you are importing into a new table, you can bypass the HBase API and write your
|
||||||
|
content directly to the filesystem, formatted into HBase data files (HFiles). Your import
|
||||||
|
will run faster, perhaps an order of magnitude faster. For more on how this mechanism works,
|
||||||
|
see <xref
|
||||||
|
linkend="arch.bulk.load" />.</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>RowCounter Example</title>
|
||||||
|
<para>The included <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
|
||||||
|
MapReduce job uses <code>TableInputFormat</code> and does a count of all rows in the specified
|
||||||
|
table. To run it, use the following command: </para>
|
||||||
|
<screen language="bourne">$ <userinput>./bin/hadoop jar hbase-X.X.X.jar</userinput></screen>
|
||||||
|
<para>This will
|
||||||
|
invoke the HBase MapReduce Driver class. Select <literal>rowcounter</literal> from the choice of jobs
|
||||||
|
offered. This will print rowcouner usage advice to standard output. Specify the tablename,
|
||||||
|
column to count, and output
|
||||||
|
directory. If you have classpath errors, see <xref linkend="hbase.mapreduce.classpath" />.</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="splitter">
|
||||||
|
<title>Map-Task Splitting</title>
|
||||||
|
<section
|
||||||
|
xml:id="splitter.default">
|
||||||
|
<title>The Default HBase MapReduce Splitter</title>
|
||||||
|
<para>When <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
|
||||||
|
is used to source an HBase table in a MapReduce job, its splitter will make a map task for
|
||||||
|
each region of the table. Thus, if there are 100 regions in the table, there will be 100
|
||||||
|
map-tasks for the job - regardless of how many column families are selected in the
|
||||||
|
Scan.</para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="splitter.custom">
|
||||||
|
<title>Custom Splitters</title>
|
||||||
|
<para>For those interested in implementing custom splitters, see the method
|
||||||
|
<code>getSplits</code> in <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
|
||||||
|
That is where the logic for map-task assignment resides. </para>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.example">
|
||||||
|
<title>HBase MapReduce Examples</title>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.example.read">
|
||||||
|
<title>HBase MapReduce Read Example</title>
|
||||||
|
<para>The following is an example of using HBase as a MapReduce source in read-only manner.
|
||||||
|
Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from
|
||||||
|
the Mapper. There job would be defined as follows...</para>
|
||||||
|
<programlisting language="java">
|
||||||
|
Configuration config = HBaseConfiguration.create();
|
||||||
|
Job job = new Job(config, "ExampleRead");
|
||||||
|
job.setJarByClass(MyReadJob.class); // class that contains mapper
|
||||||
|
|
||||||
|
Scan scan = new Scan();
|
||||||
|
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
|
||||||
|
scan.setCacheBlocks(false); // don't set to true for MR jobs
|
||||||
|
// set other scan attrs
|
||||||
|
...
|
||||||
|
|
||||||
|
TableMapReduceUtil.initTableMapperJob(
|
||||||
|
tableName, // input HBase table name
|
||||||
|
scan, // Scan instance to control CF and attribute selection
|
||||||
|
MyMapper.class, // mapper
|
||||||
|
null, // mapper output key
|
||||||
|
null, // mapper output value
|
||||||
|
job);
|
||||||
|
job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper
|
||||||
|
|
||||||
|
boolean b = job.waitForCompletion(true);
|
||||||
|
if (!b) {
|
||||||
|
throw new IOException("error with job!");
|
||||||
|
}
|
||||||
|
</programlisting>
|
||||||
|
<para>...and the mapper instance would extend <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...</para>
|
||||||
|
<programlisting language="java">
|
||||||
|
public static class MyMapper extends TableMapper<Text, Text> {
|
||||||
|
|
||||||
|
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
|
||||||
|
// process data for the row from the Result instance.
|
||||||
|
}
|
||||||
|
}
|
||||||
|
</programlisting>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.example.readwrite">
|
||||||
|
<title>HBase MapReduce Read/Write Example</title>
|
||||||
|
<para>The following is an example of using HBase both as a source and as a sink with
|
||||||
|
MapReduce. This example will simply copy data from one table to another.</para>
|
||||||
|
<programlisting language="java">
|
||||||
|
Configuration config = HBaseConfiguration.create();
|
||||||
|
Job job = new Job(config,"ExampleReadWrite");
|
||||||
|
job.setJarByClass(MyReadWriteJob.class); // class that contains mapper
|
||||||
|
|
||||||
|
Scan scan = new Scan();
|
||||||
|
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
|
||||||
|
scan.setCacheBlocks(false); // don't set to true for MR jobs
|
||||||
|
// set other scan attrs
|
||||||
|
|
||||||
|
TableMapReduceUtil.initTableMapperJob(
|
||||||
|
sourceTable, // input table
|
||||||
|
scan, // Scan instance to control CF and attribute selection
|
||||||
|
MyMapper.class, // mapper class
|
||||||
|
null, // mapper output key
|
||||||
|
null, // mapper output value
|
||||||
|
job);
|
||||||
|
TableMapReduceUtil.initTableReducerJob(
|
||||||
|
targetTable, // output table
|
||||||
|
null, // reducer class
|
||||||
|
job);
|
||||||
|
job.setNumReduceTasks(0);
|
||||||
|
|
||||||
|
boolean b = job.waitForCompletion(true);
|
||||||
|
if (!b) {
|
||||||
|
throw new IOException("error with job!");
|
||||||
|
}
|
||||||
|
</programlisting>
|
||||||
|
<para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing,
|
||||||
|
especially with the reducer. <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
|
||||||
|
is being used as the outputFormat class, and several parameters are being set on the
|
||||||
|
config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key
|
||||||
|
to <classname>ImmutableBytesWritable</classname> and reducer value to
|
||||||
|
<classname>Writable</classname>. These could be set by the programmer on the job and
|
||||||
|
conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
|
||||||
|
<para>The following is the example mapper, which will create a <classname>Put</classname>
|
||||||
|
and matching the input <classname>Result</classname> and emit it. Note: this is what the
|
||||||
|
CopyTable utility does. </para>
|
||||||
|
<programlisting language="java">
|
||||||
|
public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> {
|
||||||
|
|
||||||
|
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
|
||||||
|
// this example is just copying the data from the source table...
|
||||||
|
context.write(row, resultToPut(row,value));
|
||||||
|
}
|
||||||
|
|
||||||
|
private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
|
||||||
|
Put put = new Put(key.get());
|
||||||
|
for (KeyValue kv : result.raw()) {
|
||||||
|
put.add(kv);
|
||||||
|
}
|
||||||
|
return put;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
</programlisting>
|
||||||
|
<para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes
|
||||||
|
care of sending the <classname>Put</classname> to the target table. </para>
|
||||||
|
<para>This is just an example, developers could choose not to use
|
||||||
|
<classname>TableOutputFormat</classname> and connect to the target table themselves.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.example.readwrite.multi">
|
||||||
|
<title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
|
||||||
|
<para>TODO: example for <classname>MultiTableOutputFormat</classname>. </para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.example.summary">
|
||||||
|
<title>HBase MapReduce Summary to HBase Example</title>
|
||||||
|
<para>The following example uses HBase as a MapReduce source and sink with a summarization
|
||||||
|
step. This example will count the number of distinct instances of a value in a table and
|
||||||
|
write those summarized counts in another table.
|
||||||
|
<programlisting language="java">
|
||||||
|
Configuration config = HBaseConfiguration.create();
|
||||||
|
Job job = new Job(config,"ExampleSummary");
|
||||||
|
job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer
|
||||||
|
|
||||||
|
Scan scan = new Scan();
|
||||||
|
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
|
||||||
|
scan.setCacheBlocks(false); // don't set to true for MR jobs
|
||||||
|
// set other scan attrs
|
||||||
|
|
||||||
|
TableMapReduceUtil.initTableMapperJob(
|
||||||
|
sourceTable, // input table
|
||||||
|
scan, // Scan instance to control CF and attribute selection
|
||||||
|
MyMapper.class, // mapper class
|
||||||
|
Text.class, // mapper output key
|
||||||
|
IntWritable.class, // mapper output value
|
||||||
|
job);
|
||||||
|
TableMapReduceUtil.initTableReducerJob(
|
||||||
|
targetTable, // output table
|
||||||
|
MyTableReducer.class, // reducer class
|
||||||
|
job);
|
||||||
|
job.setNumReduceTasks(1); // at least one, adjust as required
|
||||||
|
|
||||||
|
boolean b = job.waitForCompletion(true);
|
||||||
|
if (!b) {
|
||||||
|
throw new IOException("error with job!");
|
||||||
|
}
|
||||||
|
</programlisting>
|
||||||
|
In this example mapper a column with a String-value is chosen as the value to summarize
|
||||||
|
upon. This value is used as the key to emit from the mapper, and an
|
||||||
|
<classname>IntWritable</classname> represents an instance counter.
|
||||||
|
<programlisting language="java">
|
||||||
|
public static class MyMapper extends TableMapper<Text, IntWritable> {
|
||||||
|
public static final byte[] CF = "cf".getBytes();
|
||||||
|
public static final byte[] ATTR1 = "attr1".getBytes();
|
||||||
|
|
||||||
|
private final IntWritable ONE = new IntWritable(1);
|
||||||
|
private Text text = new Text();
|
||||||
|
|
||||||
|
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
|
||||||
|
String val = new String(value.getValue(CF, ATTR1));
|
||||||
|
text.set(val); // we can only emit Writables...
|
||||||
|
|
||||||
|
context.write(text, ONE);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
</programlisting>
|
||||||
|
In the reducer, the "ones" are counted (just like any other MR example that does this),
|
||||||
|
and then emits a <classname>Put</classname>.
|
||||||
|
<programlisting language="java">
|
||||||
|
public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> {
|
||||||
|
public static final byte[] CF = "cf".getBytes();
|
||||||
|
public static final byte[] COUNT = "count".getBytes();
|
||||||
|
|
||||||
|
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
|
||||||
|
int i = 0;
|
||||||
|
for (IntWritable val : values) {
|
||||||
|
i += val.get();
|
||||||
|
}
|
||||||
|
Put put = new Put(Bytes.toBytes(key.toString()));
|
||||||
|
put.add(CF, COUNT, Bytes.toBytes(i));
|
||||||
|
|
||||||
|
context.write(null, put);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
</programlisting>
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.example.summary.file">
|
||||||
|
<title>HBase MapReduce Summary to File Example</title>
|
||||||
|
<para>This very similar to the summary example above, with exception that this is using
|
||||||
|
HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and
|
||||||
|
in the reducer. The mapper remains the same. </para>
|
||||||
|
<programlisting language="java">
|
||||||
|
Configuration config = HBaseConfiguration.create();
|
||||||
|
Job job = new Job(config,"ExampleSummaryToFile");
|
||||||
|
job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer
|
||||||
|
|
||||||
|
Scan scan = new Scan();
|
||||||
|
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
|
||||||
|
scan.setCacheBlocks(false); // don't set to true for MR jobs
|
||||||
|
// set other scan attrs
|
||||||
|
|
||||||
|
TableMapReduceUtil.initTableMapperJob(
|
||||||
|
sourceTable, // input table
|
||||||
|
scan, // Scan instance to control CF and attribute selection
|
||||||
|
MyMapper.class, // mapper class
|
||||||
|
Text.class, // mapper output key
|
||||||
|
IntWritable.class, // mapper output value
|
||||||
|
job);
|
||||||
|
job.setReducerClass(MyReducer.class); // reducer class
|
||||||
|
job.setNumReduceTasks(1); // at least one, adjust as required
|
||||||
|
FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile")); // adjust directories as required
|
||||||
|
|
||||||
|
boolean b = job.waitForCompletion(true);
|
||||||
|
if (!b) {
|
||||||
|
throw new IOException("error with job!");
|
||||||
|
}
|
||||||
|
</programlisting>
|
||||||
|
<para>As stated above, the previous Mapper can run unchanged with this example. As for the
|
||||||
|
Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting
|
||||||
|
Puts.</para>
|
||||||
|
<programlisting language="java">
|
||||||
|
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
|
||||||
|
|
||||||
|
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
|
||||||
|
int i = 0;
|
||||||
|
for (IntWritable val : values) {
|
||||||
|
i += val.get();
|
||||||
|
}
|
||||||
|
context.write(key, new IntWritable(i));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
</programlisting>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.example.summary.noreducer">
|
||||||
|
<title>HBase MapReduce Summary to HBase Without Reducer</title>
|
||||||
|
<para>It is also possible to perform summaries without a reducer - if you use HBase as the
|
||||||
|
reducer. </para>
|
||||||
|
<para>An HBase target table would need to exist for the job summary. The Table method
|
||||||
|
<code>incrementColumnValue</code> would be used to atomically increment values. From a
|
||||||
|
performance perspective, it might make sense to keep a Map of values with their values to
|
||||||
|
be incremeneted for each map-task, and make one update per key at during the <code>
|
||||||
|
cleanup</code> method of the mapper. However, your milage may vary depending on the
|
||||||
|
number of rows to be processed and unique keys. </para>
|
||||||
|
<para>In the end, the summary results are in HBase. </para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.example.summary.rdbms">
|
||||||
|
<title>HBase MapReduce Summary to RDBMS</title>
|
||||||
|
<para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases,
|
||||||
|
it is possible to generate summaries directly to an RDBMS via a custom reducer. The
|
||||||
|
<code>setup</code> method can connect to an RDBMS (the connection information can be
|
||||||
|
passed via custom parameters in the context) and the cleanup method can close the
|
||||||
|
connection. </para>
|
||||||
|
<para>It is critical to understand that number of reducers for the job affects the
|
||||||
|
summarization implementation, and you'll have to design this into your reducer.
|
||||||
|
Specifically, whether it is designed to run as a singleton (one reducer) or multiple
|
||||||
|
reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more
|
||||||
|
reducers that are assigned to the job, the more simultaneous connections to the RDBMS will
|
||||||
|
be created - this will scale, but only to a point. </para>
|
||||||
|
<programlisting language="java">
|
||||||
|
public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
|
||||||
|
|
||||||
|
private Connection c = null;
|
||||||
|
|
||||||
|
public void setup(Context context) {
|
||||||
|
// create DB connection...
|
||||||
|
}
|
||||||
|
|
||||||
|
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
|
||||||
|
// do summarization
|
||||||
|
// in this example the keys are Text, but this is just an example
|
||||||
|
}
|
||||||
|
|
||||||
|
public void cleanup(Context context) {
|
||||||
|
// close db connection
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
</programlisting>
|
||||||
|
<para>In the end, the summary results are written to your RDBMS table/s. </para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
<!-- mr examples -->
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.htable.access">
|
||||||
|
<title>Accessing Other HBase Tables in a MapReduce Job</title>
|
||||||
|
<para>Although the framework currently allows one HBase table as input to a MapReduce job,
|
||||||
|
other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating
|
||||||
|
an Table instance in the setup method of the Mapper.
|
||||||
|
<programlisting language="java">public class MyMapper extends TableMapper<Text, LongWritable> {
|
||||||
|
private Table myOtherTable;
|
||||||
|
|
||||||
|
public void setup(Context context) {
|
||||||
|
// In here create a Connection to the cluster and save it or use the Connection
|
||||||
|
// from the existing table
|
||||||
|
myOtherTable = connection.getTable("myOtherTable");
|
||||||
|
}
|
||||||
|
|
||||||
|
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
|
||||||
|
// process Result...
|
||||||
|
// use 'myOtherTable' for lookups
|
||||||
|
}
|
||||||
|
|
||||||
|
</programlisting>
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.specex">
|
||||||
|
<title>Speculative Execution</title>
|
||||||
|
<para>It is generally advisable to turn off speculative execution for MapReduce jobs that use
|
||||||
|
HBase as a source. This can either be done on a per-Job basis through properties, on on the
|
||||||
|
entire cluster. Especially for longer running jobs, speculative execution will create
|
||||||
|
duplicate map-tasks which will double-write your data to HBase; this is probably not what
|
||||||
|
you want. </para>
|
||||||
|
<para>See <xref
|
||||||
|
linkend="spec.ex" /> for more information. </para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
</chapter>
|
|
@ -0,0 +1,47 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<appendix
|
||||||
|
xml:id="orca"
|
||||||
|
version="5.0"
|
||||||
|
xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg"
|
||||||
|
xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||||
|
xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
<title>Apache HBase Orca</title>
|
||||||
|
<figure>
|
||||||
|
<title>Apache HBase Orca</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata align="center" valign="right"
|
||||||
|
fileref="jumping-orca_rotated_25percent.png"/>
|
||||||
|
</imageobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
<para><link xlink:href="https://issues.apache.org/jira/browse/HBASE-4920">An Orca is the Apache
|
||||||
|
HBase mascot.</link>
|
||||||
|
See NOTICES.txt. Our Orca logo we got here: http://www.vectorfree.com/jumping-orca
|
||||||
|
It is licensed Creative Commons Attribution 3.0. See https://creativecommons.org/licenses/by/3.0/us/
|
||||||
|
We changed the logo by stripping the colored background, inverting
|
||||||
|
it and then rotating it some.
|
||||||
|
</para>
|
||||||
|
</appendix>
|
|
@ -0,0 +1,83 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<appendix
|
||||||
|
xml:id="other.info"
|
||||||
|
version="5.0"
|
||||||
|
xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg"
|
||||||
|
xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||||
|
xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
<title>Other Information About HBase</title>
|
||||||
|
<section xml:id="other.info.videos"><title>HBase Videos</title>
|
||||||
|
<para>Introduction to HBase
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para><link xlink:href="http://www.cloudera.com/content/cloudera/en/resources/library/presentation/chicago_data_summit_apache_hbase_an_introduction_todd_lipcon.html">Introduction to HBase</link> by Todd Lipcon (Chicago Data Summit 2011).
|
||||||
|
</para></listitem>
|
||||||
|
<listitem><para><link xlink:href="http://www.cloudera.com/videos/intorduction-hbase-todd-lipcon">Introduction to HBase</link> by Todd Lipcon (2010).
|
||||||
|
</para></listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
</para>
|
||||||
|
<para><link xlink:href="http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-building-realtime-big-data-services-at-facebook-with-hadoop-and-hbase">Building Real Time Services at Facebook with HBase</link> by Jonathan Gray (Hadoop World 2011).
|
||||||
|
</para>
|
||||||
|
<para><link xlink:href="http://www.cloudera.com/videos/hw10_video_how_stumbleupon_built_and_advertising_platform_using_hbase_and_hadoop">HBase and Hadoop, Mixing Real-Time and Batch Processing at StumbleUpon</link> by JD Cryans (Hadoop World 2010).
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="other.info.pres"><title>HBase Presentations (Slides)</title>
|
||||||
|
<para><link xlink:href="http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-advanced-hbase-schema-design.html">Advanced HBase Schema Design</link> by Lars George (Hadoop World 2011).
|
||||||
|
</para>
|
||||||
|
<para><link xlink:href="http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction">Introduction to HBase</link> by Todd Lipcon (Chicago Data Summit 2011).
|
||||||
|
</para>
|
||||||
|
<para><link xlink:href="http://www.slideshare.net/cloudera/hw09-practical-h-base-getting-the-most-from-your-h-base-install">Getting The Most From Your HBase Install</link> by Ryan Rawson, Jonathan Gray (Hadoop World 2009).
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="other.info.papers"><title>HBase Papers</title>
|
||||||
|
<para><link xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> by Google (2006).
|
||||||
|
</para>
|
||||||
|
<para><link xlink:href="http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html">HBase and HDFS Locality</link> by Lars George (2010).
|
||||||
|
</para>
|
||||||
|
<para><link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link> by Ian Varley (2009).
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="other.info.sites"><title>HBase Sites</title>
|
||||||
|
<para><link xlink:href="http://www.cloudera.com/blog/category/hbase/">Cloudera's HBase Blog</link> has a lot of links to useful HBase information.
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para><link xlink:href="http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/">CAP Confusion</link> is a relevant entry for background information on
|
||||||
|
distributed storage systems.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
</para>
|
||||||
|
<para><link xlink:href="http://wiki.apache.org/hadoop/HBase/HBasePresentations">HBase Wiki</link> has a page with a number of presentations.
|
||||||
|
</para>
|
||||||
|
<para><link xlink:href="http://refcardz.dzone.com/refcardz/hbase">HBase RefCard</link> from DZone.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="other.info.books"><title>HBase Books</title>
|
||||||
|
<para><link xlink:href="http://shop.oreilly.com/product/0636920014348.do">HBase: The Definitive Guide</link> by Lars George.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="other.info.books.hadoop"><title>Hadoop Books</title>
|
||||||
|
<para><link xlink:href="http://shop.oreilly.com/product/9780596521981.do">Hadoop: The Definitive Guide</link> by Tom White.
|
||||||
|
</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
</appendix>
|
|
@ -273,7 +273,7 @@ tableDesc.addFamily(cfDesc);
|
||||||
If there is enough RAM, increasing this can help.
|
If there is enough RAM, increasing this can help.
|
||||||
</para>
|
</para>
|
||||||
</section>
|
</section>
|
||||||
<section xml:id="hbase.regionserver.checksum.verify">
|
<section xml:id="hbase.regionserver.checksum.verify.performance">
|
||||||
<title><varname>hbase.regionserver.checksum.verify</varname></title>
|
<title><varname>hbase.regionserver.checksum.verify</varname></title>
|
||||||
<para>Have HBase write the checksum into the datablock and save
|
<para>Have HBase write the checksum into the datablock and save
|
||||||
having to do the checksum seek whenever you read.</para>
|
having to do the checksum seek whenever you read.</para>
|
||||||
|
|
|
@ -0,0 +1,40 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<appendix
|
||||||
|
xml:id="sql"
|
||||||
|
version="5.0"
|
||||||
|
xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||||
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg"
|
||||||
|
xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||||
|
xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
<title>SQL over HBase</title>
|
||||||
|
<section xml:id="phoenix">
|
||||||
|
<title>Apache Phoenix</title>
|
||||||
|
<para><link xlink:href="http://phoenix.apache.org">Apache Phoenix</link></para>
|
||||||
|
</section>
|
||||||
|
<section xml:id="trafodion">
|
||||||
|
<title>Trafodion</title>
|
||||||
|
<para><link xlink:href="https://wiki.trafodion.org/">Trafodion: Transactional SQL-on-HBase</link></para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
</appendix>
|
|
@ -240,7 +240,7 @@
|
||||||
</table>
|
</table>
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
<section xml:id="hbase.client.api">
|
<section xml:id="hbase.client.api.surface">
|
||||||
<title>HBase API surface</title>
|
<title>HBase API surface</title>
|
||||||
<para> HBase has a lot of API points, but for the compatibility matrix above, we differentiate between Client API, Limited Private API, and Private API. HBase uses a version of
|
<para> HBase has a lot of API points, but for the compatibility matrix above, we differentiate between Client API, Limited Private API, and Private API. HBase uses a version of
|
||||||
<link xlink:href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html">Hadoop's Interface classification</link>. HBase's Interface classification classes can be found <link xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/classification/package-summary.html"> here</link>.
|
<link xlink:href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html">Hadoop's Interface classification</link>. HBase's Interface classification classes can be found <link xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/classification/package-summary.html"> here</link>.
|
||||||
|
|
|
@ -0,0 +1,36 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<appendix xml:id="ycsb" version="5.0" xmlns="http://docbook.org/ns/docbook"
|
||||||
|
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||||
|
xmlns:svg="http://www.w3.org/2000/svg" xmlns:m="http://www.w3.org/1998/Math/MathML"
|
||||||
|
xmlns:html="http://www.w3.org/1999/xhtml" xmlns:db="http://docbook.org/ns/docbook">
|
||||||
|
<!--/**
|
||||||
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
* or more contributor license agreements. See the NOTICE file
|
||||||
|
* distributed with this work for additional information
|
||||||
|
* regarding copyright ownership. The ASF licenses this file
|
||||||
|
* to you under the Apache License, Version 2.0 (the
|
||||||
|
* "License"); you may not use this file except in compliance
|
||||||
|
* with the License. You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing, software
|
||||||
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
* See the License for the specific language governing permissions and
|
||||||
|
* limitations under the License.
|
||||||
|
*/
|
||||||
|
-->
|
||||||
|
<title>YCSB</title>
|
||||||
|
<para><link xlink:href="https://github.com/brianfrankcooper/YCSB/">YCSB: The
|
||||||
|
Yahoo! Cloud Serving Benchmark</link> and HBase</para>
|
||||||
|
<para>TODO: Describe how YCSB is poor for putting up a decent cluster load.</para>
|
||||||
|
<para>TODO: Describe setup of YCSB for HBase. In particular, presplit your tables before you
|
||||||
|
start a run. See <link xlink:href="https://issues.apache.org/jira/browse/HBASE-4163"
|
||||||
|
>HBASE-4163 Create Split Strategy for YCSB Benchmark</link> for why and a little shell
|
||||||
|
command for how to do it.</para>
|
||||||
|
<para>Ted Dunning redid YCSB so it's mavenized and added facility for verifying workloads. See
|
||||||
|
<link xlink:href="https://github.com/tdunning/YCSB">Ted Dunning's YCSB</link>.</para>
|
||||||
|
|
||||||
|
|
||||||
|
</appendix>
|
Loading…
Reference in New Issue