HBASE-12902 Post-asciidoc conversion fix-ups

This commit is contained in:
Misty Stanley-Jones 2015-01-22 13:29:21 +10:00
parent 38701ea8ec
commit 5fbf80ee5e
28 changed files with 1369 additions and 1554 deletions

View File

@ -68,11 +68,11 @@ Having Write permission does not imply Read permission.::
It is possible and sometimes desirable for a user to be able to write data that same user cannot read. One such example is a log-writing process.
The [systemitem]+hbase:meta+ table is readable by every user, regardless of the user's other grants or restrictions.::
This is a requirement for HBase to function correctly.
[code]+CheckAndPut+ and [code]+CheckAndDelete+ operations will fail if the user does not have both Write and Read permission.::
[code]+Increment+ and [code]+Append+ operations do not require Read access.::
`CheckAndPut` and `CheckAndDelete` operations will fail if the user does not have both Write and Read permission.::
`Increment` and `Append` operations do not require Read access.::
The following table is sorted by the interface that provides each operation.
In case the table goes out of date, the unit tests which check for accuracy of permissions can be found in [path]_hbase-server/src/test/java/org/apache/hadoop/hbase/security/access/TestAccessController.java_, and the access controls themselves can be examined in [path]_hbase-server/src/main/java/org/apache/hadoop/hbase/security/access/AccessController.java_.
In case the table goes out of date, the unit tests which check for accuracy of permissions can be found in _hbase-server/src/test/java/org/apache/hadoop/hbase/security/access/TestAccessController.java_, and the access controls themselves can be examined in _hbase-server/src/main/java/org/apache/hadoop/hbase/security/access/AccessController.java_.
.ACL Matrix
[cols="1,1,1,1", frame="all", options="header"]

View File

@ -59,7 +59,7 @@ Contact one of the HBase committers, who can either give you access or refer you
=== Contributing to Documentation or Other Strings
If you spot an error in a string in a UI, utility, script, log message, or elsewhere, or you think something could be made more clear, or you think text needs to be added where it doesn't currently exist, the first step is to file a JIRA.
Be sure to set the component to [literal]+Documentation+ in addition any other involved components.
Be sure to set the component to `Documentation` in addition any other involved components.
Most components have one or more default owners, who monitor new issues which come into those queues.
Regardless of whether you feel able to fix the bug, you should still file bugs where you see them.
@ -73,7 +73,7 @@ This procedure goes into more detail than Git pros will need, but is included in
. If you have not already done so, clone the Git repository locally.
You only need to do this once.
. Fairly often, pull remote changes into your local repository by using the [code]+git pull+ command, while your master branch is checked out.
. Fairly often, pull remote changes into your local repository by using the `git pull` command, while your master branch is checked out.
. For each issue you work on, create a new branch.
One convention that works well for naming the branches is to name a given branch the same as the JIRA it relates to:
+
@ -84,9 +84,9 @@ $ git checkout -b HBASE-123456
. Make your suggested changes on your branch, committing your changes to your local repository often.
If you need to switch to working on a different issue, remember to check out the appropriate branch.
. When you are ready to submit your patch, first be sure that HBase builds cleanly and behaves as expected in your modified branch.
If you have made documentation changes, be sure the documentation and website builds.
If you have made documentation changes, be sure the documentation and website builds by running `mvn clean site`.
+
NOTE: Before you use the [literal]+site+ target the very first time, be sure you have built HBase at least once, in order to fetch all the Maven dependencies you need.
NOTE: Before you use the `site` target the very first time, be sure you have built HBase at least once, in order to fetch all the Maven dependencies you need.
+
----
$ mvn clean install -DskipTests # Builds HBase
@ -107,10 +107,10 @@ $ git rebase origin/master
----
. Generate your patch against the remote master.
Run the following command from the top level of your git repository (usually called [literal]+hbase+):
Run the following command from the top level of your git repository (usually called `hbase`):
+
----
$ git diff --no-prefix origin/master > HBASE-123456.patch
$ git format-patch --stdout origin/master > HBASE-123456.patch
----
+
The name of the patch should contain the JIRA ID.
@ -120,12 +120,12 @@ A reviewer will review your patch.
If you need to submit a new version of the patch, leave the old one on the JIRA and add a version number to the name of the new patch.
. After a change has been committed, there is no need to keep your local branch around.
Instead you should run +git pull+ to get the new change into your master branch.
Instead you should run `git pull` to get the new change into your master branch.
=== Editing the HBase Website
The source for the HBase website is in the HBase source, in the [path]_src/main/site/_ directory.
Within this directory, source for the individual pages is in the [path]_xdocs/_ directory, and images referenced in those pages are in the [path]_images/_ directory.
The source for the HBase website is in the HBase source, in the _src/main/site/_ directory.
Within this directory, source for the individual pages is in the _xdocs/_ directory, and images referenced in those pages are in the _images/_ directory.
This directory also stores images used in the HBase Reference Guide.
The website's pages are written in an HTML-like XML dialect called xdoc, which has a reference guide at link:http://maven.apache.org/archives/maven-1.x/plugins/xdoc/reference/xdocs.html.
@ -133,12 +133,12 @@ You can edit these files in a plain-text editor, an IDE, or an XML editor such a
To preview your changes, build the website using the +mvn clean site
-DskipTests+ command.
The HTML output resides in the [path]_target/site/_ directory.
The HTML output resides in the _target/site/_ directory.
When you are satisfied with your changes, follow the procedure in <<submit_doc_patch_procedure,submit doc patch procedure>> to submit your patch.
=== HBase Reference Guide Style Guide and Cheat Sheet
We may be converting the HBase Reference Guide to use link:http://asciidoctor.org[AsciiDoctor]. In case that happens, the following cheat sheet is included for your reference. More nuanced and comprehensive documentation is available at link:http://asciidoctor.org/docs/user-manual/. To skip down to the Docbook stuff, see <<docbook.editing>>.
The HBase Reference Guide is written in Asciidoc and built using link:http://asciidoctor.org[AsciiDoctor]. The following cheat sheet is included for your reference. More nuanced and comprehensive documentation is available at link:http://asciidoctor.org/docs/user-manual/.
.AsciiDoc Cheat Sheet
[cols="1,1,a",options="header"]
@ -189,7 +189,7 @@ link:http://www.google.com[Google]
----
<<anchor_name,Anchor Text>>
----
| An block image | The image with alt text |
| A block image | The image with alt text |
----
image::sunset.jpg[Alt Text]
----
@ -310,9 +310,7 @@ include::[/path/to/file.adoc]
For plenty of examples. see _book.adoc_.
| A table | a table | See http://asciidoctor.org/docs/user-manual/#tables. Generally rows are separated by newlines and columns by pipes
| Comment out a single line | A line is skipped during rendering |
----
// This line won't show up
----
`+//+ This line won't show up`
| Comment out a block | A section of the file is skipped during rendering |
----
////
@ -325,64 +323,40 @@ Test between #hash marks# is highlighted yellow.
----
|===
[[docbook.editing]]
=== Editing the HBase Reference Guide
The source for the HBase Reference Guide is in the HBase source, in the [path]_src/main/docbkx/_ directory.
It is written in link:http://www.docbook.org/[Docbook] XML.
Docbook can be intimidating, but you can typically follow the formatting of the surrounding file to get an idea of the mark-up.
You can edit Docbook XML files using a plain-text editor, an XML-aware IDE, or a specialized XML editor.
Docbook's syntax can be picky.
Before submitting a patch, be sure to build the output locally using the +mvn site+ command.
If you do not get any build errors, that means that the XML is well-formed, which means that each opening tag is balanced by a closing tag.
Well-formedness is not exactly the same as validity.
Check the output in [path]_target/docbkx/_ for any surprises before submitting a patch.
=== Auto-Generated Content
Some parts of the HBase Reference Guide, most notably <<config.files,config.files>>, are generated automatically, so that this area of the documentation stays in sync with the code.
This is done by means of an XSLT transform, which you can examine in the source at [path]_src/main/xslt/configuration_to_docbook_section.xsl_.
This transforms the [path]_hbase-common/src/main/resources/hbase-default.xml_ file into a Docbook output which can be included in the Reference Guide.
This is done by means of an XSLT transform, which you can examine in the source at _src/main/xslt/configuration_to_asciidoc_chapter.xsl_.
This transforms the _hbase-common/src/main/resources/hbase-default.xml_ file into an Asciidoc output which can be included in the Reference Guide.
Sometimes, it is necessary to add configuration parameters or modify their descriptions.
Make the modifications to the source file, and they will be included in the Reference Guide when it is rebuilt.
It is possible that other types of content can and will be automatically generated from HBase source files in the future.
=== Multi-Page and Single-Page Output
You can examine the [literal]+site+ target in the Maven [path]_pom.xml_ file included at the top level of the HBase source for details on the process of building the website and documentation.
The Reference Guide is built twice, once as a single-page output and once with one HTML file per chapter.
The single-page output is located in [path]_target/docbkx/book.html_, while the multi-page output's index page is at [path]_target/docbkx/book/book.html_.
Each of these outputs has its own [path]_images/_ and [path]_css/_ directories, which are created at build time.
=== Images in the HBase Reference Guide
You can include images in the HBase Reference Guide.
For accessibility reasons, it is recommended that you use a <figure> Docbook element for an image.
You can include images in the HBase Reference Guide. It is important to include an image title if possible, and alternate text always.
This allows screen readers to navigate to the image and also provides alternative text for the image.
The following is an example of a <figure> element.
The following is an example of an image with a title and alternate text. Notice the double colon.
[source,xml]
[source,asciidoc]
----
<figure>
<title>HFile Version 1</title>
<mediaobject>
<imageobject>
<imagedata fileref="timeline_consistency.png" />
</imageobject>
<textobject>
<phrase>HFile Version 1</phrase>
</textobject>
</mediaobject>
</figure>
.My Image Title
image::sunset.jpg[Alt Text]
----
The <textobject> can contain a few sentences describing the image, rather than simply reiterating the title.
You can optionally specify alignment and size options in the <imagedata> element.
Here is an example of an inline image with alternate text. Notice the single colon. Inline images cannot have titles. They are generally small images like GUI buttons.
When doing a local build, save the image to the [path]_src/main/site/resources/images/_ directory.
In the <imagedata> element, refer to the image as above, with no directory component.
[source,asciidoc]
----
image:sunset.jpg[Alt Text]
----
When doing a local build, save the image to the _src/main/site/resources/images/_ directory.
When you link to the image, do not include the directory portion of the path.
The image will be copied to the appropriate target location during the build of the output.
When you submit a patch which includes adding an image to the HBase Reference Guide, attach the image to the JIRA.
@ -390,89 +364,32 @@ If the committer asks where the image should be committed, it should go into the
=== Adding a New Chapter to the HBase Reference Guide
If you want to add a new chapter to the HBase Reference Guide, the easiest way is to copy an existing chapter file, rename it, and change the ID and title elements near the top of the file.
If you want to add a new chapter to the HBase Reference Guide, the easiest way is to copy an existing chapter file, rename it, and change the ID (in double brackets) and title. Chapters are located in the _src/main/asciidoc/_chapters/_ directory.
Delete the existing content and create the new content.
Then open the [path]_book.xml_ file, which is the main file for the HBase Reference Guide, and use an <xi:include> element to include your new chapter in the appropriate location.
Then open the _src/main/asciidoc/book.adoc_ file, which is the main file for the HBase Reference Guide, and copy an existing `include` element to include your new chapter in the appropriate location.
Be sure to add your new file to your Git repository before creating your patch.
Note that the [path]_book.xml_ file currently contains many chapters.
You can only include a chapter at the same nesting levels as the other chapters in the file.
When in doubt, check to see how other files have been included.
=== Docbook Common Issues
=== Common Documentation Issues
The following Docbook issues come up often.
The following documentation issues come up often.
Some of these are preferences, but others can create mysterious build errors or other problems.
[qanda]
What can go where?::
There is often confusion about which child elements are valid in a given context. When in doubt, Docbook: The Definitive Guide is the best resource. It has an appendix which is indexed by element and contains all valid child and parent elements of any given element. If you edit Docbook often, a schema-aware XML editor makes things easier.
Paragraphs and Admonitions::
It is a common pattern, and it is technically valid, to put an admonition such as a <note> inside a <para> element. Because admonitions render as block-level elements (they take the whole width of the page), it is better to mark them up as siblings to the paragraphs around them, like this:
+
[source,xml]
----
<para>This is the paragraph.</para>
<note>
<para>This is an admonition which occurs after the paragraph.</para>
</note>
----
Wrap textual <listitem> and <entry> contents in <para> elements.::
Because the contents of a <listitem> (an element in an itemized, ordered, or variable list) or an <entry> (a cell in a table) can consist of things other than plain text, they need to be wrapped in some element. If they are plain text, they need to be inclosed in <para> tags. This is tedious but necessary for validity.
+
[source,xml]
----
<itemizedlist>
<listitem>
<para>This is a paragraph.</para>
</listitem>
<listitem>
<screen>This is screen output.</screen>
</listitem>
</itemizedlist>
----
When to use <command>, <code>, <programlisting>, <screen>?::
The first two are in-line tags, which can occur within the flow of paragraphs or titles. The second two are block elements.
+
Use <command> to mention a command such as hbase shell in the flow of a sentence. Use <code> for other inline text referring to code. Incidentally, use <literal> to specify literal strings that should be typed or entered exactly as shown. Within a <screen> listing, it can be helpful to use the <userinput> and <computeroutput> elements to mark up the text further.
+
Use <screen> to display input and output as the user would see it on the screen, in a log file, etc. Use <programlisting> only for blocks of code that occur within a file, such as Java or XML code, or a Bash shell script.
How to escape XML elements so that they show up as XML?::
For one-off instances or short in-line mentions, use the `&lt;` and `&gt;` encoded characters. For longer mentions, or blocks of code, enclose it with `&lt;![CDATA[]]&gt;`, which is much easier to maintain and parse in the source files.
Tips and tricks for making screen output look good::
Text within <screen> and <programlisting> elements is shown exactly as it appears in the source, including indentation, tabs, and line wrap.
+
Indent the starting and closing XML elements, but do not indent the content. Also, to avoid having an extra blank line at the beginning of the programlisting output, do not put the CDATA element on its own line. For example:
+
[source,xml]
----
<programlisting>
case $1 in
--cleanZk|--cleanHdfs|--cleanAll)
matches="yes" ;;
*) ;;
esac
</programlisting>
----
+
After pasting code into a programlisting, fix the indentation manually, using two spaces per desired indentation. For screen output, be sure to include line breaks so that the text is no longer than 100 characters.
Isolate Changes for Easy Diff Review.::
Be careful with pretty-printing or re-formatting an entire XML file, even if the formatting has degraded over time. If you need to reformat a file, do that in a separate JIRA where you do not change any content. Be careful because some XML editors do a bulk-reformat when you open a new file, especially if you use GUI mode in the editor.
Syntax Highlighting::
The HBase Reference Guide uses the XSLT Syntax Highlighting Maven module for syntax highlighting. To enable syntax highlighting for a given <programlisting> or <screen> (or possibly other elements), add the attribute language=LANGUAGE_OF_CHOICE to the element, as in the following example:
+
[source,xml]
----
<programlisting language="xml">
<foo>bar</foo>
<bar>foo</bar>
</programlisting>
----
+
Several syntax types are supported. The most interesting ones for the HBase Reference Guide are java, xml, sql, and bourne (for BASH shell output or Linux command-line examples).
The HBase Reference Guide uses `coderay` for syntax highlighting. To enable syntax highlighting for a given code listing, use the following type of syntax:
+
........
[source,xml]
----
<name>My Name</name>
----
........
+
Several syntax types are supported. The most interesting ones for the HBase Reference Guide are `java`, `xml`, `sql`, and `bash`.

View File

@ -312,12 +312,12 @@ Version 3 added two additional pieces of information to the reserved keys in the
| hfile.TAGS_COMPRESSED | Does the block encoder for this hfile compress tags? (boolean). Should only be present if hfile.MAX_TAGS_LEN is also present.
|===
When reading a Version 3 HFile the presence of [class]+MAX_TAGS_LEN+ is used to determine how to deserialize the cells within a data block.
When reading a Version 3 HFile the presence of `MAX_TAGS_LEN` is used to determine how to deserialize the cells within a data block.
Therefore, consumers must read the file's info block prior to reading any data blocks.
When writing a Version 3 HFile, HBase will always include [class]+MAX_TAGS_LEN + when flushing the memstore to underlying filesystem and when using prefix tree encoding for data blocks, as described in <<compression,compression>>.
When writing a Version 3 HFile, HBase will always include `MAX_TAGS_LEN ` when flushing the memstore to underlying filesystem and when using prefix tree encoding for data blocks, as described in <<compression,compression>>.
When compacting extant files, the default writer will omit [class]+MAX_TAGS_LEN+ if all of the files selected do not themselves contain any cells with tags.
When compacting extant files, the default writer will omit `MAX_TAGS_LEN` if all of the files selected do not themselves contain any cells with tags.
See <<compaction,compaction>> for details on the compaction file selection algorithm.
@ -338,11 +338,11 @@ Within an HFile, HBase cells are stored in data blocks as a sequence of KeyValue
| | Tags bytes (variable)
|===
If the info block for a given HFile contains an entry for [class]+MAX_TAGS_LEN+ each cell will have the length of that cell's tags included, even if that length is zero.
If the info block for a given HFile contains an entry for `MAX_TAGS_LEN` each cell will have the length of that cell's tags included, even if that length is zero.
The actual tags are stored as a sequence of tag length (2 bytes), tag type (1 byte), tag bytes (variable). The format an individual tag's bytes depends on the tag type.
Note that the dependence on the contents of the info block implies that prior to reading any data blocks you must first process a file's info block.
It also implies that prior to writing a data block you must know if the file's info block will include [class]+MAX_TAGS_LEN+.
It also implies that prior to writing a data block you must know if the file's info block will include `MAX_TAGS_LEN`.
[[hfilev3.fixedtrailer]]
==== Fixed File Trailer in Version 3

File diff suppressed because it is too large Load Diff

View File

@ -84,16 +84,16 @@ If system memory is heavily overcommitted, the Linux kernel may enter a vicious
Further, a failing hard disk will often retry reads and/or writes many times before giving up and returning an error.
This can manifest as high iowait, as running processes wait for reads and writes to complete.
Finally, a disk nearing the upper edge of its performance envelope will begin to cause iowait as it informs the kernel that it cannot accept any more data, and the kernel queues incoming data into the dirty write pool in memory.
However, using [code]+vmstat(1)+ and [code]+free(1)+, we could see that no swap was being used, and the amount of disk IO was only a few kilobytes per second.
However, using `vmstat(1)` and `free(1)`, we could see that no swap was being used, and the amount of disk IO was only a few kilobytes per second.
===== Slowness Due To High Processor Usage
Next, we checked to see whether the system was performing slowly simply due to very high computational load. [code]+top(1)+ showed that the system load was higher than normal, but [code]+vmstat(1)+ and [code]+mpstat(1)+ showed that the amount of processor being used for actual computation was low.
Next, we checked to see whether the system was performing slowly simply due to very high computational load. `top(1)` showed that the system load was higher than normal, but `vmstat(1)` and `mpstat(1)` showed that the amount of processor being used for actual computation was low.
===== Network Saturation (The Winner)
Since neither the disks nor the processors were being utilized heavily, we moved on to the performance of the network interfaces.
The datanode had two gigabit ethernet adapters, bonded to form an active-standby interface. [code]+ifconfig(8)+ showed some unusual anomalies, namely interface errors, overruns, framing errors.
The datanode had two gigabit ethernet adapters, bonded to form an active-standby interface. `ifconfig(8)` showed some unusual anomalies, namely interface errors, overruns, framing errors.
While not unheard of, these kinds of errors are exceedingly rare on modern hardware which is operating as it should:
----
@ -109,7 +109,7 @@ RX bytes:2416328868676 (2.4 TB) TX bytes:3464991094001 (3.4 TB)
----
These errors immediately lead us to suspect that one or more of the ethernet interfaces might have negotiated the wrong line speed.
This was confirmed both by running an ICMP ping from an external host and observing round-trip-time in excess of 700ms, and by running [code]+ethtool(8)+ on the members of the bond interface and discovering that the active interface was operating at 100Mbs/, full duplex.
This was confirmed both by running an ICMP ping from an external host and observing round-trip-time in excess of 700ms, and by running `ethtool(8)` on the members of the bond interface and discovering that the active interface was operating at 100Mbs/, full duplex.
----
@ -145,7 +145,7 @@ In normal operation, the ICMP ping round trip time should be around 20ms, and th
==== Resolution
After determining that the active ethernet adapter was at the incorrect speed, we used the [code]+ifenslave(8)+ command to make the standby interface the active interface, which yielded an immediate improvement in MapReduce performance, and a 10 times improvement in network throughput:
After determining that the active ethernet adapter was at the incorrect speed, we used the `ifenslave(8)` command to make the standby interface the active interface, which yielded an immediate improvement in MapReduce performance, and a 10 times improvement in network throughput:
On the next trip to the datacenter, we determined that the line speed issue was ultimately caused by a bad network cable, which was replaced.
@ -163,6 +163,6 @@ Although this research is on an older version of the codebase, this writeup is s
[[casestudies.max.transfer.threads]]
=== Case Study #4 (max.transfer.threads Config)
Case study of configuring [code]+max.transfer.threads+ (previously known as [code]+xcievers+) and diagnosing errors from misconfigurations. link:http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html
Case study of configuring `max.transfer.threads` (previously known as `xcievers`) and diagnosing errors from misconfigurations. link:http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html
See also <<dfs.datanode.max.transfer.threads,dfs.datanode.max.transfer.threads>>.

View File

@ -65,12 +65,12 @@ To enable data block encoding for a ColumnFamily, see <<data.block.encoding.enab
.Data Block Encoding Types
Prefix::
Often, keys are very similar. Specifically, keys often share a common prefix and only differ near the end. For instance, one key might be [literal]+RowKey:Family:Qualifier0+ and the next key might be [literal]+RowKey:Family:Qualifier1+.
Often, keys are very similar. Specifically, keys often share a common prefix and only differ near the end. For instance, one key might be `RowKey:Family:Qualifier0` and the next key might be `RowKey:Family:Qualifier1`.
+
In Prefix encoding, an extra column is added which holds the length of the prefix shared between the current key and the previous key.
Assuming the first key here is totally different from the key before, its prefix length is 0.
+
The second key's prefix length is [literal]+23+, since they have the first 23 characters in common.
The second key's prefix length is `23`, since they have the first 23 characters in common.
+
Obviously if the keys tend to have nothing in common, Prefix will not provide much benefit.
+
@ -168,10 +168,10 @@ bzip2: false
Above shows that the native hadoop library is not available in HBase context.
To fix the above, either copy the Hadoop native libraries local or symlink to them if the Hadoop and HBase stalls are adjacent in the filesystem.
You could also point at their location by setting the [var]+LD_LIBRARY_PATH+ environment variable.
You could also point at their location by setting the `LD_LIBRARY_PATH` environment variable.
Where the JVM looks to find native librarys is "system dependent" (See [class]+java.lang.System#loadLibrary(name)+). On linux, by default, is going to look in [path]_lib/native/PLATFORM_ where [var]+PLATFORM+ is the label for the platform your HBase is installed on.
On a local linux machine, it seems to be the concatenation of the java properties [var]+os.name+ and [var]+os.arch+ followed by whether 32 or 64 bit.
Where the JVM looks to find native librarys is "system dependent" (See `java.lang.System#loadLibrary(name)`). On linux, by default, is going to look in _lib/native/PLATFORM_ where `PLATFORM` is the label for the platform your HBase is installed on.
On a local linux machine, it seems to be the concatenation of the java properties `os.name` and `os.arch` followed by whether 32 or 64 bit.
HBase on startup prints out all of the java system properties so find the os.name and os.arch in the log.
For example:
[source]
@ -181,12 +181,12 @@ For example:
2014-08-06 15:27:22,853 INFO [main] zookeeper.ZooKeeper: Client environment:os.arch=amd64
...
----
So in this case, the PLATFORM string is [var]+Linux-amd64-64+.
Copying the Hadoop native libraries or symlinking at [path]_lib/native/Linux-amd64-64_ will ensure they are found.
Check with the Hadoop [path]_NativeLibraryChecker_.
So in this case, the PLATFORM string is `Linux-amd64-64`.
Copying the Hadoop native libraries or symlinking at _lib/native/Linux-amd64-64_ will ensure they are found.
Check with the Hadoop _NativeLibraryChecker_.
Here is example of how to point at the Hadoop libs with [var]+LD_LIBRARY_PATH+ environment variable:
Here is example of how to point at the Hadoop libs with `LD_LIBRARY_PATH` environment variable:
[source]
----
$ LD_LIBRARY_PATH=~/hadoop-2.5.0-SNAPSHOT/lib/native ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker
@ -199,7 +199,7 @@ snappy: true /usr/lib64/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
----
Set in [path]_hbase-env.sh_ the LD_LIBRARY_PATH environment variable when starting your HBase.
Set in _hbase-env.sh_ the LD_LIBRARY_PATH environment variable when starting your HBase.
=== Compressor Configuration, Installation, and Use
@ -215,17 +215,16 @@ See
.Compressor Support On the Master
A new configuration setting was introduced in HBase 0.95, to check the Master to determine which data block encoders are installed and configured on it, and assume that the entire cluster is configured the same.
This option, [code]+hbase.master.check.compression+, defaults to [literal]+true+.
This option, `hbase.master.check.compression`, defaults to `true`.
This prevents the situation described in link:https://issues.apache.org/jira/browse/HBASE-6370[HBASE-6370], where a table is created or modified to support a codec that a region server does not support, leading to failures that take a long time to occur and are difficult to debug.
If [code]+hbase.master.check.compression+ is enabled, libraries for all desired compressors need to be installed and configured on the Master, even if the Master does not run a region server.
If `hbase.master.check.compression` is enabled, libraries for all desired compressors need to be installed and configured on the Master, even if the Master does not run a region server.
.Install GZ Support Via Native Libraries
HBase uses Java's built-in GZip support unless the native Hadoop libraries are available on the CLASSPATH.
The recommended way to add libraries to the CLASSPATH is to set the environment variable [var]+HBASE_LIBRARY_PATH+ for the user running HBase.
If native libraries are not available and Java's GZIP is used, [literal]+Got
brand-new compressor+ reports will be present in the logs.
The recommended way to add libraries to the CLASSPATH is to set the environment variable `HBASE_LIBRARY_PATH` for the user running HBase.
If native libraries are not available and Java's GZIP is used, `Got brand-new compressor` reports will be present in the logs.
See <<brand.new.compressor,brand.new.compressor>>).
[[lzo.compression]]
@ -264,10 +263,10 @@ hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}
HBase does not ship with Snappy support because of licensing issues.
You can install Snappy binaries (for instance, by using +yum install snappy+ on CentOS) or build Snappy from source.
After installing Snappy, search for the shared library, which will be called [path]_libsnappy.so.X_ where X is a number.
If you built from source, copy the shared library to a known location on your system, such as [path]_/opt/snappy/lib/_.
After installing Snappy, search for the shared library, which will be called _libsnappy.so.X_ where X is a number.
If you built from source, copy the shared library to a known location on your system, such as _/opt/snappy/lib/_.
In addition to the Snappy library, HBase also needs access to the Hadoop shared library, which will be called something like [path]_libhadoop.so.X.Y_, where X and Y are both numbers.
In addition to the Snappy library, HBase also needs access to the Hadoop shared library, which will be called something like _libhadoop.so.X.Y_, where X and Y are both numbers.
Make note of the location of the Hadoop library, or copy it to the same location as the Snappy library.
[NOTE]
@ -278,7 +277,7 @@ See <<compression.test,compression.test>> to find out how to test that this is t
See <<hbase.regionserver.codecs,hbase.regionserver.codecs>> to configure your RegionServers to fail to start if a given compressor is not available.
====
Each of these library locations need to be added to the environment variable [var]+HBASE_LIBRARY_PATH+ for the operating system user that runs HBase.
Each of these library locations need to be added to the environment variable `HBASE_LIBRARY_PATH` for the operating system user that runs HBase.
You need to restart the RegionServer for the changes to take effect.
[[compression.test]]
@ -294,14 +293,14 @@ You can use the CompressionTest tool to verify that your compressor is available
[[hbase.regionserver.codecs]]
.Enforce Compression Settings On a RegionServer
You can configure a RegionServer so that it will fail to restart if compression is configured incorrectly, by adding the option hbase.regionserver.codecs to the [path]_hbase-site.xml_, and setting its value to a comma-separated list of codecs that need to be available.
For example, if you set this property to [literal]+lzo,gz+, the RegionServer would fail to start if both compressors were not available.
You can configure a RegionServer so that it will fail to restart if compression is configured incorrectly, by adding the option hbase.regionserver.codecs to the _hbase-site.xml_, and setting its value to a comma-separated list of codecs that need to be available.
For example, if you set this property to `lzo,gz`, the RegionServer would fail to start if both compressors were not available.
This would prevent a new server from being added to the cluster without having codecs configured properly.
[[changing.compression]]
==== Enable Compression On a ColumnFamily
To enable compression for a ColumnFamily, use an [code]+alter+ command.
To enable compression for a ColumnFamily, use an `alter` command.
You do not need to re-create the table or copy data.
If you are changing codecs, be sure the old codec is still available until all the old StoreFiles have been compacted.
@ -342,7 +341,7 @@ DESCRIPTION ENABLED
==== Testing Compression Performance
HBase includes a tool called LoadTestTool which provides mechanisms to test your compression performance.
You must specify either [literal]+-write+ or [literal]+-update-read+ as your first parameter, and if you do not specify another parameter, usage advice is printed for each option.
You must specify either `-write` or `-update-read` as your first parameter, and if you do not specify another parameter, usage advice is printed for each option.
.+LoadTestTool+ Usage
====
@ -415,7 +414,7 @@ $ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000
== Enable Data Block Encoding
Codecs are built into HBase so no extra configuration is needed.
Codecs are enabled on a table by setting the [code]+DATA_BLOCK_ENCODING+ property.
Codecs are enabled on a table by setting the `DATA_BLOCK_ENCODING` property.
Disable the table before altering its DATA_BLOCK_ENCODING setting.
Following is an example using HBase Shell:

View File

@ -30,41 +30,42 @@
This chapter expands upon the <<getting_started,getting started>> chapter to further explain configuration of Apache HBase.
Please read this chapter carefully, especially <<basic.prerequisites,basic.prerequisites>> to ensure that your HBase testing and deployment goes smoothly, and prevent data loss.
== Configuration Files
Apache HBase uses the same configuration system as Apache Hadoop.
All configuration files are located in the [path]_conf/_ directory, which needs to be kept in sync for each node on your cluster.
All configuration files are located in the _conf/_ directory, which needs to be kept in sync for each node on your cluster.
.HBase Configuration Files
[path]_backup-masters_::
.HBase Configuration File Descriptions
_backup-masters_::
Not present by default.
A plain-text file which lists hosts on which the Master should start a backup Master process, one host per line.
[path]_hadoop-metrics2-hbase.properties_::
_hadoop-metrics2-hbase.properties_::
Used to connect HBase Hadoop's Metrics2 framework.
See the link:http://wiki.apache.org/hadoop/HADOOP-6728-MetricsV2[Hadoop Wiki
entry] for more information on Metrics2.
Contains only commented-out examples by default.
[path]_hbase-env.cmd_ and [path]_hbase-env.sh_::
_hbase-env.cmd_ and _hbase-env.sh_::
Script for Windows and Linux / Unix environments to set up the working environment for HBase, including the location of Java, Java options, and other environment variables.
The file contains many commented-out examples to provide guidance.
[path]_hbase-policy.xml_::
_hbase-policy.xml_::
The default policy configuration file used by RPC servers to make authorization decisions on client requests.
Only used if HBase security (<<security,security>>) is enabled.
[path]_hbase-site.xml_::
_hbase-site.xml_::
The main HBase configuration file.
This file specifies configuration options which override HBase's default configuration.
You can view (but do not edit) the default configuration file at [path]_docs/hbase-default.xml_.
You can view (but do not edit) the default configuration file at _docs/hbase-default.xml_.
You can also view the entire effective configuration for your cluster (defaults and overrides) in the [label]#HBase Configuration# tab of the HBase Web UI.
[path]_log4j.properties_::
Configuration file for HBase logging via [code]+log4j+.
_log4j.properties_::
Configuration file for HBase logging via `log4j`.
[path]_regionservers_::
_regionservers_::
A plain-text file containing a list of hosts which should run a RegionServer in your HBase cluster.
By default this file contains the single entry [literal]+localhost+.
It should contain a list of hostnames or IP addresses, one per line, and should only contain [literal]+localhost+ if each node in your cluster will run a RegionServer on its [literal]+localhost+ interface.
By default this file contains the single entry `localhost`.
It should contain a list of hostnames or IP addresses, one per line, and should only contain `localhost` if each node in your cluster will run a RegionServer on its `localhost` interface.
.Checking XML Validity
[TIP]
@ -75,11 +76,10 @@ By default, +xmllint+ re-flows and prints the XML to standard output.
To check for well-formedness and only print output if errors exist, use the command +xmllint -noout
filename.xml+.
====
.Keep Configuration In Sync Across the Cluster
[WARNING]
====
When running in distributed mode, after you make an edit to an HBase configuration, make sure you copy the content of the [path]_conf/_ directory to all nodes of the cluster.
When running in distributed mode, after you make an edit to an HBase configuration, make sure you copy the content of the _conf/_ directory to all nodes of the cluster.
HBase will not do this for you.
Use +rsync+, +scp+, or another secure mechanism for copying the configuration files to your nodes.
For most configuration, a restart is needed for servers to pick up changes An exception is dynamic configuration.
@ -123,7 +123,7 @@ support.
|N/A
|===
NOTE: In HBase 0.98.5 and newer, you must set `JAVA_HOME` on each node of your cluster. [path]_hbase-env.sh_ provides a handy mechanism to do this.
NOTE: In HBase 0.98.5 and newer, you must set `JAVA_HOME` on each node of your cluster. _hbase-env.sh_ provides a handy mechanism to do this.
.Operating System Utilities
ssh::
@ -133,14 +133,14 @@ DNS::
HBase uses the local hostname to self-report its IP address. Both forward and reverse DNS resolving must work in versions of HBase previous to 0.92.0. The link:https://github.com/sujee/hadoop-dns-checker[hadoop-dns-checker] tool can be used to verify DNS is working correctly on the cluster. The project README file provides detailed instructions on usage.
Loopback IP::
Prior to hbase-0.96.0, HBase only used the IP address [systemitem]+127.0.0.1+ to refer to [code]+localhost+, and this could not be configured.
Prior to hbase-0.96.0, HBase only used the IP address [systemitem]+127.0.0.1+ to refer to `localhost`, and this could not be configured.
See <<loopback.ip,loopback.ip>>.
NTP::
The clocks on cluster nodes should be synchronized. A small amount of variation is acceptable, but larger amounts of skew can cause erratic and unexpected behavior. Time synchronization is one of the first things to check if you see unexplained problems in your cluster. It is recommended that you run a Network Time Protocol (NTP) service, or another time-synchronization mechanism, on your cluster, and that all nodes look to the same service for time synchronization. See the link:http://www.tldp.org/LDP/sag/html/basic-ntp-config.html[Basic NTP Configuration] at [citetitle]_The Linux Documentation Project (TLDP)_ to set up NTP.
Limits on Number of Files and Processes (ulimit)::
Apache HBase is a database. It requires the ability to open a large number of files at once. Many Linux distributions limit the number of files a single user is allowed to open to [literal]+1024+ (or [literal]+256+ on older versions of OS X). You can check this limit on your servers by running the command +ulimit -n+ when logged in as the user which runs HBase. See <<trouble.rs.runtime.filehandles,trouble.rs.runtime.filehandles>> for some of the problems you may experience if the limit is too low. You may also notice errors such as the following:
Apache HBase is a database. It requires the ability to open a large number of files at once. Many Linux distributions limit the number of files a single user is allowed to open to `1024` (or `256` on older versions of OS X). You can check this limit on your servers by running the command +ulimit -n+ when logged in as the user which runs HBase. See <<trouble.rs.runtime.filehandles,trouble.rs.runtime.filehandles>> for some of the problems you may experience if the limit is too low. You may also notice errors such as the following:
+
----
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
@ -217,7 +217,7 @@ Use the following legend to interpret this table:
.Replace the Hadoop Bundled With HBase!
[NOTE]
====
Because HBase depends on Hadoop, it bundles an instance of the Hadoop jar under its [path]_lib_ directory.
Because HBase depends on Hadoop, it bundles an instance of the Hadoop jar under its _lib_ directory.
The bundled jar is ONLY for use in standalone mode.
In distributed mode, it is _critical_ that the version of Hadoop that is out on your cluster match what is under HBase.
Replace the hadoop jar found in the HBase lib directory with the hadoop jar you are running on your cluster to avoid version mismatch issues.
@ -228,7 +228,7 @@ Hadoop version mismatch issues have various manifestations but often all looks l
[[hadoop2.hbase_0.94]]
==== Apache HBase 0.94 with Hadoop 2
To get 0.94.x to run on hadoop 2.2.0, you need to change the hadoop 2 and protobuf versions in the [path]_pom.xml_: Here is a diff with pom.xml changes:
To get 0.94.x to run on hadoop 2.2.0, you need to change the hadoop 2 and protobuf versions in the _pom.xml_: Here is a diff with pom.xml changes:
[source]
----
@ -298,14 +298,14 @@ Do not move to Apache HBase 0.96.x if you cannot upgrade your Hadoop.. See link:
[[hadoop.older.versions]]
==== Hadoop versions 0.20.x - 1.x
HBase will lose data unless it is running on an HDFS that has a durable [code]+sync+ implementation.
HBase will lose data unless it is running on an HDFS that has a durable `sync` implementation.
DO NOT use Hadoop 0.20.2, Hadoop 0.20.203.0, and Hadoop 0.20.204.0 which DO NOT have this attribute.
Currently only Hadoop versions 0.20.205.x or any release in excess of this version -- this includes hadoop-1.0.0 -- have a working, durable sync.
The Cloudera blog post link:http://www.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/[An
update on Apache Hadoop 1.0] by Charles Zedlweski has a nice exposition on how all the Hadoop versions relate.
Its worth checking out if you are having trouble making sense of the Hadoop version morass.
Sync has to be explicitly enabled by setting [var]+dfs.support.append+ equal to true on both the client side -- in [path]_hbase-site.xml_ -- and on the serverside in [path]_hdfs-site.xml_ (The sync facility HBase needs is a subset of the append code path).
Sync has to be explicitly enabled by setting `dfs.support.append` equal to true on both the client side -- in _hbase-site.xml_ -- and on the serverside in _hdfs-site.xml_ (The sync facility HBase needs is a subset of the append code path).
[source,xml]
----
@ -317,7 +317,7 @@ Sync has to be explicitly enabled by setting [var]+dfs.support.append+ equal to
----
You will have to restart your cluster after making this edit.
Ignore the chicken-little comment you'll find in the [path]_hdfs-default.xml_ in the description for the [var]+dfs.support.append+ configuration.
Ignore the chicken-little comment you'll find in the _hdfs-default.xml_ in the description for the `dfs.support.append` configuration.
[[hadoop.security]]
==== Apache HBase on Secure Hadoop
@ -325,12 +325,12 @@ Ignore the chicken-little comment you'll find in the [path]_hdfs-default.xml_ in
Apache HBase will run on any Hadoop 0.20.x that incorporates Hadoop security features as long as you do as suggested above and replace the Hadoop jar that ships with HBase with the secure version.
If you want to read more about how to setup Secure HBase, see <<hbase.secure.configuration,hbase.secure.configuration>>.
[var]+dfs.datanode.max.transfer.threads+
`dfs.datanode.max.transfer.threads`
[[dfs.datanode.max.transfer.threads]]
==== (((dfs.datanode.max.transfer.threads)))
An HDFS datanode has an upper bound on the number of files that it will serve at any one time.
Before doing any loading, make sure you have configured Hadoop's [path]_conf/hdfs-site.xml_, setting the [var]+dfs.datanode.max.transfer.threads+ value to at least the following:
Before doing any loading, make sure you have configured Hadoop's _conf/hdfs-site.xml_, setting the `dfs.datanode.max.transfer.threads` value to at least the following:
[source,xml]
----
@ -353,7 +353,7 @@ For example:
contain current block. Will get new block locations from namenode and retry...
----
See also <<casestudies.max.transfer.threads,casestudies.max.transfer.threads>> and note that this property was previously known as [var]+dfs.datanode.max.xcievers+ (e.g. link:http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html[
See also <<casestudies.max.transfer.threads,casestudies.max.transfer.threads>> and note that this property was previously known as `dfs.datanode.max.xcievers` (e.g. link:http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html[
Hadoop HDFS: Deceived by Xciever]).
[[zookeeper.requirements]]
@ -367,10 +367,10 @@ HBase makes use of the [method]+multi+ functionality that is only available sinc
HBase has two run modes: <<standalone,standalone>> and <<distributed,distributed>>.
Out of the box, HBase runs in standalone mode.
Whatever your mode, you will need to configure HBase by editing files in the HBase [path]_conf_ directory.
At a minimum, you must edit [code]+conf/hbase-env.sh+ to tell HBase which +java+ to use.
Whatever your mode, you will need to configure HBase by editing files in the HBase _conf_ directory.
At a minimum, you must edit `conf/hbase-env.sh` to tell HBase which +java+ to use.
In this file you set HBase environment variables such as the heapsize and other options for the +JVM+, the preferred location for log files, etc.
Set [var]+JAVA_HOME+ to point at the root of your +java+ install.
Set `JAVA_HOME` to point at the root of your +java+ install.
[[standalone]]
=== Standalone HBase
@ -417,15 +417,15 @@ Both standalone mode and pseudo-distributed mode are provided for the purposes o
For a production environment, distributed mode is appropriate.
In distributed mode, multiple instances of HBase daemons run on multiple servers in the cluster.
Just as in pseudo-distributed mode, a fully distributed configuration requires that you set the [code]+hbase-cluster.distributed+ property to [literal]+true+.
Typically, the [code]+hbase.rootdir+ is configured to point to a highly-available HDFS filesystem.
Just as in pseudo-distributed mode, a fully distributed configuration requires that you set the `hbase-cluster.distributed` property to `true`.
Typically, the `hbase.rootdir` is configured to point to a highly-available HDFS filesystem.
In addition, the cluster is configured so that multiple cluster nodes enlist as RegionServers, ZooKeeper QuorumPeers, and backup HMaster servers.
These configuration basics are all demonstrated in <<quickstart_fully_distributed,quickstart-fully-distributed>>.
.Distributed RegionServers
Typically, your cluster will contain multiple RegionServers all running on different servers, as well as primary and backup Master and Zookeeper daemons.
The [path]_conf/regionservers_ file on the master server contains a list of hosts whose RegionServers are associated with this cluster.
The _conf/regionservers_ file on the master server contains a list of hosts whose RegionServers are associated with this cluster.
Each host is on a separate line.
All hosts listed in this file will have their RegionServer processes started and stopped when the master server starts or stops.
@ -434,9 +434,9 @@ See section <<zookeeper,zookeeper>> for ZooKeeper setup for HBase.
.Example Distributed HBase Cluster
====
This is a bare-bones [path]_conf/hbase-site.xml_ for a distributed HBase cluster.
This is a bare-bones _conf/hbase-site.xml_ for a distributed HBase cluster.
A cluster that is used for real-world work would contain more custom configuration parameters.
Most HBase configuration directives have default values, which are used unless the value is overridden in the [path]_hbase-site.xml_.
Most HBase configuration directives have default values, which are used unless the value is overridden in the _hbase-site.xml_.
See <<config.files,config.files>> for more information.
[source,xml]
@ -458,8 +458,8 @@ See <<config.files,config.files>> for more information.
</configuration>
----
This is an example [path]_conf/regionservers_ file, which contains a list of each node that should run a RegionServer in the cluster.
These nodes need HBase installed and they need to use the same contents of the [path]_conf/_ directory as the Master server..
This is an example _conf/regionservers_ file, which contains a list of each node that should run a RegionServer in the cluster.
These nodes need HBase installed and they need to use the same contents of the _conf/_ directory as the Master server..
[source]
----
@ -469,7 +469,7 @@ node-b.example.com
node-c.example.com
----
This is an example [path]_conf/backup-masters_ file, which contains a list of each node that should run a backup Master instance.
This is an example _conf/backup-masters_ file, which contains a list of each node that should run a backup Master instance.
The backup Master instances will sit idle unless the main Master becomes unavailable.
[source]
@ -486,19 +486,19 @@ See <<quickstart_fully_distributed,quickstart-fully-distributed>> for a walk-thr
.Procedure: HDFS Client Configuration
. Of note, if you have made HDFS client configuration on your Hadoop cluster, such as configuration directives for HDFS clients, as opposed to server-side configurations, you must use one of the following methods to enable HBase to see and use these configuration changes:
+
a. Add a pointer to your [var]+HADOOP_CONF_DIR+ to the [var]+HBASE_CLASSPATH+ environment variable in [path]_hbase-env.sh_.
b. Add a copy of [path]_hdfs-site.xml_ (or [path]_hadoop-site.xml_) or, better, symlinks, under [path]_${HBASE_HOME}/conf_, or
c. if only a small set of HDFS client configurations, add them to [path]_hbase-site.xml_.
a. Add a pointer to your `HADOOP_CONF_DIR` to the `HBASE_CLASSPATH` environment variable in _hbase-env.sh_.
b. Add a copy of _hdfs-site.xml_ (or _hadoop-site.xml_) or, better, symlinks, under _${HBASE_HOME}/conf_, or
c. if only a small set of HDFS client configurations, add them to _hbase-site.xml_.
An example of such an HDFS client configuration is [var]+dfs.replication+.
An example of such an HDFS client configuration is `dfs.replication`.
If for example, you want to run with a replication factor of 5, hbase will create files with the default of 3 unless you do the above to make the configuration available to HBase.
[[confirm]]
== Running and Confirming Your Installation
Make sure HDFS is running first.
Start and stop the Hadoop HDFS daemons by running [path]_bin/start-hdfs.sh_ over in the [var]+HADOOP_HOME+ directory.
Start and stop the Hadoop HDFS daemons by running _bin/start-hdfs.sh_ over in the `HADOOP_HOME` directory.
You can ensure it started properly by testing the +put+ and +get+ of files into the Hadoop filesystem.
HBase does not normally use the mapreduce daemons.
These do not need to be started.
@ -511,14 +511,14 @@ Start HBase with the following command:
bin/start-hbase.sh
----
Run the above from the [var]+HBASE_HOME+ directory.
Run the above from the `HBASE_HOME` directory.
You should now have a running HBase instance.
HBase logs can be found in the [path]_logs_ subdirectory.
HBase logs can be found in the _logs_ subdirectory.
Check them out especially if HBase had trouble starting.
HBase also puts up a UI listing vital attributes.
By default its deployed on the Master host at port 16010 (HBase RegionServers listen on port 16020 by default and put up an informational http server at 16030). If the Master were running on a host named [var]+master.example.org+ on the default port, to see the Master's homepage you'd point your browser at [path]_http://master.example.org:16010_.
By default its deployed on the Master host at port 16010 (HBase RegionServers listen on port 16020 by default and put up an informational http server at 16030). If the Master were running on a host named `master.example.org` on the default port, to see the Master's homepage you'd point your browser at _http://master.example.org:16010_.
Prior to HBase 0.98, the default ports the master ui was deployed on port 16010, and the HBase RegionServers would listen on port 16020 by default and put up an informational http server at 16030.
@ -536,15 +536,15 @@ It can take longer if your cluster is comprised of many machines.
If you are running a distributed operation, be sure to wait until HBase has shut down completely before stopping the Hadoop daemons.
[[config.files]]
== Configuration Files
== Default Configuration
[[hbase.site]]
=== [path]_hbase-site.xml_ and [path]_hbase-default.xml_
=== _hbase-site.xml_ and _hbase-default.xml_
Just as in Hadoop where you add site-specific HDFS configuration to the [path]_hdfs-site.xml_ file, for HBase, site specific customizations go into the file [path]_conf/hbase-site.xml_.
For the list of configurable properties, see <<hbase_default_configurations,hbase default configurations>> below or view the raw [path]_hbase-default.xml_ source file in the HBase source code at [path]_src/main/resources_.
Just as in Hadoop where you add site-specific HDFS configuration to the _hdfs-site.xml_ file, for HBase, site specific customizations go into the file _conf/hbase-site.xml_.
For the list of configurable properties, see <<hbase_default_configurations,hbase default configurations>> below or view the raw _hbase-default.xml_ source file in the HBase source code at _src/main/resources_.
Not all configuration options make it out to [path]_hbase-default.xml_.
Not all configuration options make it out to _hbase-default.xml_.
Configuration that it is thought rare anyone would change can exist only in code; the only way to turn up such configurations is via a reading of the source code itself.
Currently, changes here will require a cluster restart for HBase to notice the change.
@ -554,19 +554,19 @@ include::../../../../target/asciidoc/hbase-default.adoc[]
[[hbase.env.sh]]
=== [path]_hbase-env.sh_
=== _hbase-env.sh_
Set HBase environment variables in this file.
Examples include options to pass the JVM on start of an HBase daemon such as heap size and garbage collector configs.
You can also set configurations for HBase configuration, log directories, niceness, ssh options, where to locate process pid files, etc.
Open the file at [path]_conf/hbase-env.sh_ and peruse its content.
Open the file at _conf/hbase-env.sh_ and peruse its content.
Each option is fairly well documented.
Add your own environment variables here if you want them read by HBase daemons on startup.
Changes here will require a cluster restart for HBase to notice the change.
[[log4j]]
=== [path]_log4j.properties_
=== _log4j.properties_
Edit this file to change rate at which HBase files are rolled and to change the level at which HBase logs messages.
@ -580,11 +580,11 @@ If you are running HBase in standalone mode, you don't need to configure anythin
Since the HBase Master may move around, clients bootstrap by looking to ZooKeeper for current critical locations.
ZooKeeper is where all these values are kept.
Thus clients require the location of the ZooKeeper ensemble information before they can do anything else.
Usually this the ensemble location is kept out in the [path]_hbase-site.xml_ and is picked up by the client from the [var]+CLASSPATH+.
Usually this the ensemble location is kept out in the _hbase-site.xml_ and is picked up by the client from the `CLASSPATH`.
If you are configuring an IDE to run a HBase client, you should include the [path]_conf/_ directory on your classpath so [path]_hbase-site.xml_ settings can be found (or add [path]_src/test/resources_ to pick up the hbase-site.xml used by tests).
If you are configuring an IDE to run a HBase client, you should include the _conf/_ directory on your classpath so _hbase-site.xml_ settings can be found (or add _src/test/resources_ to pick up the hbase-site.xml used by tests).
Minimally, a client of HBase needs several libraries in its [var]+CLASSPATH+ when connecting to a cluster, including:
Minimally, a client of HBase needs several libraries in its `CLASSPATH` when connecting to a cluster, including:
[source]
----
@ -599,10 +599,9 @@ slf4j-log4j (slf4j-log4j12-1.5.8.jar)
zookeeper (zookeeper-3.4.2.jar)
----
An example basic [path]_hbase-site.xml_ for client only might look as follows:
An example basic _hbase-site.xml_ for client only might look as follows:
[source,xml]
----
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
@ -620,7 +619,7 @@ An example basic [path]_hbase-site.xml_ for client only might look as follows:
The configuration used by a Java client is kept in an link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration[HBaseConfiguration] instance.
The factory method on HBaseConfiguration, [code]+HBaseConfiguration.create();+, on invocation, will read in the content of the first [path]_hbase-site.xml_ found on the client's [var]+CLASSPATH+, if one is present (Invocation will also factor in any [path]_hbase-default.xml_ found; an hbase-default.xml ships inside the [path]_hbase.X.X.X.jar_). It is also possible to specify configuration directly without having to read from a [path]_hbase-site.xml_.
The factory method on HBaseConfiguration, `HBaseConfiguration.create();`, on invocation, will read in the content of the first _hbase-site.xml_ found on the client's `CLASSPATH`, if one is present (Invocation will also factor in any _hbase-default.xml_ found; an hbase-default.xml ships inside the _hbase.X.X.X.jar_). It is also possible to specify configuration directly without having to read from a _hbase-site.xml_.
For example, to set the ZooKeeper ensemble for the cluster programmatically do as follows:
[source,java]
@ -629,7 +628,7 @@ Configuration config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", "localhost"); // Here we are running zookeeper locally
----
If multiple ZooKeeper instances make up your ZooKeeper ensemble, they may be specified in a comma-separated list (just as in the [path]_hbase-site.xml_ file). This populated [class]+Configuration+ instance can then be passed to an link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html[HTable], and so on.
If multiple ZooKeeper instances make up your ZooKeeper ensemble, they may be specified in a comma-separated list (just as in the _hbase-site.xml_ file). This populated `Configuration` instance can then be passed to an link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html[HTable], and so on.
[[example_config]]
== Example Configurations
@ -637,20 +636,18 @@ If multiple ZooKeeper instances make up your ZooKeeper ensemble, they may be spe
=== Basic Distributed HBase Install
Here is an example basic configuration for a distributed ten node cluster.
The nodes are named [var]+example0+, [var]+example1+, etc., through node [var]+example9+ in this example.
The HBase Master and the HDFS namenode are running on the node [var]+example0+.
RegionServers run on nodes [var]+example1+-[var]+example9+.
A 3-node ZooKeeper ensemble runs on [var]+example1+, [var]+example2+, and [var]+example3+ on the default ports.
ZooKeeper data is persisted to the directory [path]_/export/zookeeper_.
Below we show what the main configuration files -- [path]_hbase-site.xml_, [path]_regionservers_, and [path]_hbase-env.sh_ -- found in the HBase [path]_conf_ directory might look like.
The nodes are named `example0`, `example1`, etc., through node `example9` in this example.
The HBase Master and the HDFS namenode are running on the node `example0`.
RegionServers run on nodes `example1`-`example9`.
A 3-node ZooKeeper ensemble runs on `example1`, `example2`, and `example3` on the default ports.
ZooKeeper data is persisted to the directory _/export/zookeeper_.
Below we show what the main configuration files -- _hbase-site.xml_, _regionservers_, and _hbase-env.sh_ -- found in the HBase _conf_ directory might look like.
[[hbase_site]]
==== [path]_hbase-site.xml_
==== _hbase-site.xml_
[source,bourne]
[source,xml]
----
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
@ -685,14 +682,13 @@ Below we show what the main configuration files -- [path]_hbase-site.xml_, [path
----
[[regionservers]]
==== [path]_regionservers_
==== _regionservers_
In this file you list the nodes that will run RegionServers.
In our case, these nodes are [var]+example1+-[var]+example9+.
In our case, these nodes are `example1`-`example9`.
[source]
----
example1
example2
example3
@ -705,12 +701,12 @@ example9
----
[[hbase_env]]
==== [path]_hbase-env.sh_
==== _hbase-env.sh_
The following lines in the [path]_hbase-env.sh_ file show how to set the [var]+JAVA_HOME+ environment variable (required for HBase 0.98.5 and newer) and set the heap to 4 GB (rather than the default value of 1 GB). If you copy and paste this example, be sure to adjust the [var]+JAVA_HOME+ to suit your environment.
The following lines in the _hbase-env.sh_ file show how to set the `JAVA_HOME` environment variable (required for HBase 0.98.5 and newer) and set the heap to 4 GB (rather than the default value of 1 GB). If you copy and paste this example, be sure to adjust the `JAVA_HOME` to suit your environment.
[source,bash]
----
# The java implementation to use.
export JAVA_HOME=/usr/java/jdk1.7.0/
@ -718,7 +714,7 @@ export JAVA_HOME=/usr/java/jdk1.7.0/
export HBASE_HEAPSIZE=4096
----
Use +rsync+ to copy the content of the [path]_conf_ directory to all nodes of the cluster.
Use +rsync+ to copy the content of the _conf_ directory to all nodes of the cluster.
[[important_configurations]]
== The Important Configurations
@ -736,7 +732,7 @@ Review the <<os,os>> and <<hadoop,hadoop>> sections.
If a cluster with a lot of regions, it is possible if an eager beaver regionserver checks in soon after master start while all the rest in the cluster are laggardly, this first server to checkin will be assigned all regions.
If lots of regions, this first server could buckle under the load.
To prevent the above scenario happening up the [var]+hbase.master.wait.on.regionservers.mintostart+ from its default value of 1.
To prevent the above scenario happening up the `hbase.master.wait.on.regionservers.mintostart` from its default value of 1.
See link:https://issues.apache.org/jira/browse/HBASE-6389[HBASE-6389 Modify the
conditions to ensure that Master waits for sufficient number of Region Servers before
starting region assignments] for more detail.
@ -756,13 +752,13 @@ See the configuration <<fail.fast.expired.active.master,fail.fast.expired.active
==== ZooKeeper Configuration
[[sect.zookeeper.session.timeout]]
===== [var]+zookeeper.session.timeout+
===== `zookeeper.session.timeout`
The default timeout is three minutes (specified in milliseconds). This means that if a server crashes, it will be three minutes before the Master notices the crash and starts recovery.
You might like to tune the timeout down to a minute or even less so the Master notices failures the sooner.
Before changing this value, be sure you have your JVM garbage collection configuration under control otherwise, a long garbage collection that lasts beyond the ZooKeeper session timeout will take out your RegionServer (You might be fine with this -- you probably want recovery to start on the server if a RegionServer has been in GC for a long period of time).
To change this configuration, edit [path]_hbase-site.xml_, copy the changed file around the cluster and restart.
To change this configuration, edit _hbase-site.xml_, copy the changed file around the cluster and restart.
We set this value high to save our having to field noob questions up on the mailing lists asking why a RegionServer went down during a massive import.
The usual cause is that their JVM is untuned and they are running into long GC pauses.
@ -781,11 +777,11 @@ See <<zookeeper,zookeeper>>.
===== dfs.datanode.failed.volumes.tolerated
This is the "...number of volumes that are allowed to fail before a datanode stops offering service.
By default any volume failure will cause a datanode to shutdown" from the [path]_hdfs-default.xml_ description.
By default any volume failure will cause a datanode to shutdown" from the _hdfs-default.xml_ description.
If you have > three or four disks, you might want to set this to 1 or if you have many disks, two or more.
[[hbase.regionserver.handler.count_description]]
==== [var]+hbase.regionserver.handler.count+
==== `hbase.regionserver.handler.count`
This setting defines the number of threads that are kept open to answer incoming requests to user tables.
The rule of thumb is to keep this number low when the payload per request approaches the MB (big puts, scans using a large cache) and high when the payload is small (gets, small puts, ICVs, deletes). The total size of the queries in progress is limited by the setting "hbase.ipc.server.max.callqueue.size".
@ -826,9 +822,9 @@ However, as all memstores are not expected to be full all the time, less WAL fil
[[disable.splitting]]
==== Managed Splitting
HBase generally handles splitting your regions, based upon the settings in your [path]_hbase-default.xml_ and [path]_hbase-site.xml_ configuration files.
Important settings include [var]+hbase.regionserver.region.split.policy+, [var]+hbase.hregion.max.filesize+, [var]+hbase.regionserver.regionSplitLimit+.
A simplistic view of splitting is that when a region grows to [var]+hbase.hregion.max.filesize+, it is split.
HBase generally handles splitting your regions, based upon the settings in your _hbase-default.xml_ and _hbase-site.xml_ configuration files.
Important settings include `hbase.regionserver.region.split.policy`, `hbase.hregion.max.filesize`, `hbase.regionserver.regionSplitLimit`.
A simplistic view of splitting is that when a region grows to `hbase.hregion.max.filesize`, it is split.
For most use patterns, most of the time, you should use automatic splitting.
See <<manual_region_splitting_decisions,manual region splitting decisions>> for more information about manual region splitting.
@ -839,7 +835,7 @@ Manual splitting can mitigate region creation and movement under load.
It also makes it so region boundaries are known and invariant (if you disable region splitting). If you use manual splits, it is easier doing staggered, time-based major compactions spread out your network IO load.
.Disable Automatic Splitting
To disable automatic splitting, set [var]+hbase.hregion.max.filesize+ to a very large value, such as [literal]+100 GB+ It is not recommended to set it to its absolute maximum value of [literal]+Long.MAX_VALUE+.
To disable automatic splitting, set `hbase.hregion.max.filesize` to a very large value, such as `100 GB` It is not recommended to set it to its absolute maximum value of `Long.MAX_VALUE`.
.Automatic Splitting Is Recommended
[NOTE]
@ -858,8 +854,8 @@ The goal is for the largest region to be just large enough that the compaction s
Otherwise, the cluster can be prone to compaction storms where a large number of regions under compaction at the same time.
It is important to understand that the data growth causes compaction storms, and not the manual split decision.
If the regions are split into too many large regions, you can increase the major compaction interval by configuring [var]+HConstants.MAJOR_COMPACTION_PERIOD+.
HBase 0.90 introduced [class]+org.apache.hadoop.hbase.util.RegionSplitter+, which provides a network-IO-safe rolling split of all regions.
If the regions are split into too many large regions, you can increase the major compaction interval by configuring `HConstants.MAJOR_COMPACTION_PERIOD`.
HBase 0.90 introduced `org.apache.hadoop.hbase.util.RegionSplitter`, which provides a network-IO-safe rolling split of all regions.
[[managed.compactions]]
==== Managed Compactions
@ -868,7 +864,7 @@ By default, major compactions are scheduled to run once in a 7-day period.
Prior to HBase 0.96.x, major compactions were scheduled to happen once per day by default.
If you need to control exactly when and how often major compaction runs, you can disable managed major compactions.
See the entry for [var]+hbase.hregion.majorcompaction+ in the <<compaction.parameters,compaction.parameters>> table for details.
See the entry for `hbase.hregion.majorcompaction` in the <<compaction.parameters,compaction.parameters>> table for details.
.Do Not Disable Major Compactions
[WARNING]
@ -885,7 +881,7 @@ For more information about compactions and the compaction file selection process
==== Speculative Execution
Speculative Execution of MapReduce tasks is on by default, and for HBase clusters it is generally advised to turn off Speculative Execution at a system-level unless you need it for a specific case, where it can be configured per-job.
Set the properties [var]+mapreduce.map.speculative+ and [var]+mapreduce.reduce.speculative+ to false.
Set the properties `mapreduce.map.speculative` and `mapreduce.reduce.speculative` to false.
[[other_configuration]]
=== Other Configurations
@ -894,14 +890,14 @@ Set the properties [var]+mapreduce.map.speculative+ and [var]+mapreduce.reduce.s
==== Balancer
The balancer is a periodic operation which is run on the master to redistribute regions on the cluster.
It is configured via [var]+hbase.balancer.period+ and defaults to 300000 (5 minutes).
It is configured via `hbase.balancer.period` and defaults to 300000 (5 minutes).
See <<master.processes.loadbalancer,master.processes.loadbalancer>> for more information on the LoadBalancer.
[[disabling.blockcache]]
==== Disabling Blockcache
Do not turn off block cache (You'd do it by setting [var]+hbase.block.cache.size+ to zero). Currently we do not do well if you do this because the regionserver will spend all its time loading hfile indices over and over again.
Do not turn off block cache (You'd do it by setting `hbase.block.cache.size` to zero). Currently we do not do well if you do this because the regionserver will spend all its time loading hfile indices over and over again.
If your working set it such that block cache does you no good, at least size the block cache such that hfile indices will stay up in the cache (you can get a rough idea on the size you need by surveying regionserver UIs; you'll see index block size accounted near the top of the webpage).
[[nagles]]
@ -925,8 +921,6 @@ HDFS-3703, HDFS-3712, and HDFS-4791 -- hadoop 2 for sure has them and late hadoo
[source,xml]
----
<property>
<property>
<name>hbase.lease.recovery.dfs.timeout</name>
<value>23000</value>
@ -944,7 +938,6 @@ And on the namenode/datanode side, set the following to enable 'staleness' intro
[source,xml]
----
<property>
<name>dfs.client.socket-timeout</name>
<value>10000</value>
@ -991,7 +984,7 @@ See link:http://docs.oracle.com/javase/6/docs/technotes/guides/management/agent.
Historically, besides above port mentioned, JMX opens 2 additional random TCP listening ports, which could lead to port conflict problem.(See link:https://issues.apache.org/jira/browse/HBASE-10289[HBASE-10289] for details)
As an alternative, You can use the coprocessor-based JMX implementation provided by HBase.
To enable it in 0.99 or above, add below property in [path]_hbase-site.xml_:
To enable it in 0.99 or above, add below property in _hbase-site.xml_:
[source,xml]
----
@ -1009,7 +1002,6 @@ The reason why you only configure coprocessor for 'regionserver' is that, starti
[source,xml]
----
<property>
<name>regionserver.rmi.registry.port</name>
<value>61130</value>
@ -1024,7 +1016,8 @@ The registry port can be shared with connector port in most cases, so you only n
However if you want to use SSL communication, the 2 ports must be configured to different values.
By default the password authentication and SSL communication is disabled.
To enable password authentication, you need to update [path]_hbase-env.sh_ like below:
To enable password authentication, you need to update _hbase-env.sh_ like below:
[source,bash]
----
export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.authenticate=true \
-Dcom.sun.management.jmxremote.password.file=your_password_file \
@ -1038,6 +1031,7 @@ See example password/access file under $JRE_HOME/lib/management.
To enable SSL communication with password authentication, follow below steps:
[source,bash]
----
#1. generate a key pair, stored in myKeyStore
keytool -genkey -alias jconsole -keystore myKeyStore
@ -1049,10 +1043,10 @@ keytool -export -alias jconsole -keystore myKeyStore -file jconsole.cert
keytool -import -alias jconsole -keystore jconsoleKeyStore -file jconsole.cert
----
And then update [path]_hbase-env.sh_ like below:
And then update _hbase-env.sh_ like below:
[source,bash]
----
export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=true \
-Djavax.net.ssl.keyStore=/home/tianq/myKeyStore \
-Djavax.net.ssl.keyStorePassword=your_password_in_step_1 \
@ -1066,11 +1060,12 @@ export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE "
Finally start jconsole on client using the key store:
[source,bash]
----
jconsole -J-Djavax.net.ssl.trustStore=/home/tianq/jconsoleKeyStore
----
NOTE: for HBase 0.98, To enable the HBase JMX implementation on Master, you also need to add below property in [path]_hbase-site.xml_:
NOTE: for HBase 0.98, To enable the HBase JMX implementation on Master, you also need to add below property in _hbase-site.xml_:
[source,xml]
----

View File

@ -69,7 +69,7 @@ Endpoints (HBase 0.94.x and earlier)::
== Examples
An example of an observer is included in [path]_hbase-examples/src/test/java/org/apache/hadoop/hbase/coprocessor/example/TestZooKeeperScanPolicyObserver.java_.
An example of an observer is included in _hbase-examples/src/test/java/org/apache/hadoop/hbase/coprocessor/example/TestZooKeeperScanPolicyObserver.java_.
Several endpoint examples are included in the same directory.
== Building A Coprocessor
@ -80,11 +80,11 @@ You can load the coprocessor from your HBase configuration, so that the coproces
=== Load from Configuration
To configure a coprocessor to be loaded when HBase starts, modify the RegionServer's [path]_hbase-site.xml_ and configure one of the following properties, based on the type of observer you are configuring:
To configure a coprocessor to be loaded when HBase starts, modify the RegionServer's _hbase-site.xml_ and configure one of the following properties, based on the type of observer you are configuring:
* [code]+hbase.coprocessor.region.classes+for RegionObservers and Endpoints
* [code]+hbase.coprocessor.wal.classes+for WALObservers
* [code]+hbase.coprocessor.master.classes+for MasterObservers
* `hbase.coprocessor.region.classes`for RegionObservers and Endpoints
* `hbase.coprocessor.wal.classes`for WALObservers
* `hbase.coprocessor.master.classes`for MasterObservers
.Example RegionObserver Configuration
====
@ -105,7 +105,7 @@ Therefore, the jar file must reside on the server-side HBase classpath.
Coprocessors which are loaded in this way will be active on all regions of all tables.
These are the system coprocessor introduced earlier.
The first listed coprocessors will be assigned the priority [literal]+Coprocessor.Priority.SYSTEM+.
The first listed coprocessors will be assigned the priority `Coprocessor.Priority.SYSTEM`.
Each subsequent coprocessor in the list will have its priority value incremented by one (which reduces its priority, because priorities have the natural sort order of Integers).
When calling out to registered observers, the framework executes their callbacks methods in the sorted order of their priority.
@ -145,7 +145,7 @@ DESCRIPTION ENABLED
====
The coprocessor framework will try to read the class information from the coprocessor table attribute value.
The value contains four pieces of information which are separated by the [literal]+|+ character.
The value contains four pieces of information which are separated by the `|` character.
* File path: The jar file containing the coprocessor implementation must be in a location where all region servers can read it.
You could copy the file onto the local disk on each region server, but it is recommended to store it in HDFS.

View File

@ -31,7 +31,8 @@ In HBase, data is stored in tables, which have rows and columns.
This is a terminology overlap with relational databases (RDBMSs), but this is not a helpful analogy.
Instead, it can be helpful to think of an HBase table as a multi-dimensional map.
.HBase Data Model TerminologyTable::
.HBase Data Model Terminology
Table::
An HBase table consists of multiple rows.
Row::
@ -43,7 +44,7 @@ Row::
If your row keys are domains, you should probably store them in reverse (org.apache.www, org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each other in the table, rather than being spread out based on the first letter of the subdomain.
Column::
A column in HBase consists of a column family and a column qualifier, which are delimited by a [literal]+:+ (colon) character.
A column in HBase consists of a column family and a column qualifier, which are delimited by a `:` (colon) character.
Column Family::
Column families physically colocate a set of columns and their values, often for performance reasons.
@ -52,7 +53,7 @@ Column Family::
Column Qualifier::
A column qualifier is added to a column family to provide the index for a given piece of data.
Given a column family [literal]+content+, a column qualifier might be [literal]+content:html+, and another might be [literal]+content:pdf+.
Given a column family `content`, a column qualifier might be `content:html`, and another might be `content:pdf`.
Though column families are fixed at table creation, column qualifiers are mutable and may differ greatly between rows.
Cell::
@ -65,48 +66,39 @@ Timestamp::
[[conceptual.view]]
== Conceptual View
You can read a very understandable explanation of the HBase data model in the blog post link:http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable[Understanding
HBase and BigTable] by Jim R.
Wilson.
Another good explanation is available in the PDF link:http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf[Introduction
to Basic Schema Design] by Amandeep Khurana.
You can read a very understandable explanation of the HBase data model in the blog post link:http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable[Understanding HBase and BigTable] by Jim R. Wilson.
Another good explanation is available in the PDF link:http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf[Introduction
to Basic Schema Design] by Amandeep Khurana.
It may help to read different perspectives to get a solid understanding of HBase schema design.
The linked articles cover the same ground as the information in this section.
The following example is a slightly modified form of the one on page 2 of the link:http://research.google.com/archive/bigtable.html[BigTable] paper.
There is a table called [var]+webtable+ that contains two rows ([literal]+com.cnn.www+ and [literal]+com.example.www+), three column families named [var]+contents+, [var]+anchor+, and [var]+people+.
In this example, for the first row ([literal]+com.cnn.www+), [var]+anchor+ contains two columns ([var]+anchor:cssnsi.com+, [var]+anchor:my.look.ca+) and [var]+contents+ contains one column ([var]+contents:html+). This example contains 5 versions of the row with the row key [literal]+com.cnn.www+, and one version of the row with the row key [literal]+com.example.www+.
The [var]+contents:html+ column qualifier contains the entire HTML of a given website.
Qualifiers of the [var]+anchor+ column family each contain the external site which links to the site represented by the row, along with the text it used in the anchor of its link.
The [var]+people+ column family represents people associated with the site.
There is a table called `webtable` that contains two rows (`com.cnn.www` and `com.example.www`), three column families named `contents`, `anchor`, and `people`.
In this example, for the first row (`com.cnn.www`), `anchor` contains two columns (`anchor:cssnsi.com`, `anchor:my.look.ca`) and `contents` contains one column (`contents:html`). This example contains 5 versions of the row with the row key `com.cnn.www`, and one version of the row with the row key `com.example.www`.
The `contents:html` column qualifier contains the entire HTML of a given website.
Qualifiers of the `anchor` column family each contain the external site which links to the site represented by the row, along with the text it used in the anchor of its link.
The `people` column family represents people associated with the site.
.Column Names
[NOTE]
====
By convention, a column name is made of its column family prefix and a _qualifier_.
For example, the column _contents:html_ is made up of the column family [var]+contents+ and the [var]+html+ qualifier.
The colon character ([literal]+:+) delimits the column family from the column family _qualifier_.
For example, the column _contents:html_ is made up of the column family `contents` and the `html` qualifier.
The colon character (`:`) delimits the column family from the column family _qualifier_.
====
.Table [var]+webtable+
.Table `webtable`
[cols="1,1,1,1,1", frame="all", options="header"]
|===
| Row Key
| Time Stamp
| ColumnFamily contents
| ColumnFamily anchor
| ColumnFamily people
| anchor:cnnsi.com = "CNN"
| anchor:my.look.ca = "CNN.com"
| contents:html = "<html>..."
| contents:html = "<html>..."
| contents:html = "<html>..."
| contents:html = "<html>..."
|Row Key |Time Stamp |ColumnFamily `contents` |ColumnFamily `anchor`|ColumnFamily `people`
|"com.cnn.www" |t9 | |anchor:cnnsi.com = "CNN" |
|"com.cnn.www" |t8 | |anchor:my.look.ca = "CNN.com" |
|"com.cnn.www" |t6 | contents:html = "<html>..." | |
|"com.cnn.www" |t5 | contents:html = "<html>..." | |
|"com.cnn.www" |t3 | contents:html = "<html>..." | |
|"com.example.www"| t5 | contents:html = "<html>..." | people:author = "John Doe"
|===
Cells in this table that appear to be empty do not take space, or in fact exist, in HBase.
@ -114,9 +106,8 @@ This is what makes HBase "sparse." A tabular view is not the only possible way t
The following represents the same information as a multi-dimensional map.
This is only a mock-up for illustrative purposes and may not be strictly accurate.
[source]
[source,json]
----
{
"com.cnn.www": {
contents: {
@ -148,36 +139,31 @@ This is only a mock-up for illustrative purposes and may not be strictly accurat
Although at a conceptual level tables may be viewed as a sparse set of rows, they are physically stored by column family.
A new column qualifier (column_family:column_qualifier) can be added to an existing column family at any time.
.ColumnFamily [var]+anchor+
.ColumnFamily `anchor`
[cols="1,1,1", frame="all", options="header"]
|===
| Row Key
| Time Stamp
| Column Family anchor
| anchor:cnnsi.com = "CNN"
| anchor:my.look.ca = "CNN.com"
|Row Key | Time Stamp |Column Family `anchor`
|"com.cnn.www" |t9 |`anchor:cnnsi.com = "CNN"`
|"com.cnn.www" |t8 |`anchor:my.look.ca = "CNN.com"`
|===
.ColumnFamily [var]+contents+
.ColumnFamily `contents`
[cols="1,1,1", frame="all", options="header"]
|===
| Row Key
| Time Stamp
| ColumnFamily "contents:"
| contents:html = "<html>..."
| contents:html = "<html>..."
| contents:html = "<html>..."
|Row Key |Time Stamp |ColumnFamily `contents:`
|"com.cnn.www" |t6 |contents:html = "<html>..."
|"com.cnn.www" |t5 |contents:html = "<html>..."
|"com.cnn.www" |t3 |contents:html = "<html>..."
|===
The empty cells shown in the conceptual view are not stored at all.
Thus a request for the value of the [var]+contents:html+ column at time stamp [literal]+t8+ would return no value.
Similarly, a request for an [var]+anchor:my.look.ca+ value at time stamp [literal]+t9+ would return no value.
Thus a request for the value of the `contents:html` column at time stamp `t8` would return no value.
Similarly, a request for an `anchor:my.look.ca` value at time stamp `t9` would return no value.
However, if no timestamp is supplied, the most recent value for a particular column would be returned.
Given multiple versions, the most recent is also the first one found, since timestamps are stored in descending order.
Thus a request for the values of all columns in the row [var]+com.cnn.www+ if no timestamp is specified would be: the value of [var]+contents:html+ from timestamp [literal]+t6+, the value of [var]+anchor:cnnsi.com+ from timestamp [literal]+t9+, the value of [var]+anchor:my.look.ca+ from timestamp [literal]+t8+.
Thus a request for the values of all columns in the row `com.cnn.www` if no timestamp is specified would be: the value of `contents:html` from timestamp `t6`, the value of `anchor:cnnsi.com` from timestamp `t9`, the value of `anchor:my.look.ca` from timestamp `t8`.
For more information about the internals of how Apache HBase stores data, see <<regions.arch,regions.arch>>.
@ -269,7 +255,7 @@ The empty byte array is used to denote both the start and end of a tables' names
Columns in Apache HBase are grouped into _column families_.
All column members of a column family have the same prefix.
For example, the columns _courses:history_ and _courses:math_ are both members of the _courses_ column family.
The colon character ([literal]+:+) delimits the column family from the
The colon character (`:`) delimits the column family from the
column family qualifier.
The column family prefix must be composed of _printable_ characters.
The qualifying tail, the column family _qualifier_, can be made of any arbitrary bytes.
@ -280,7 +266,7 @@ Because tunings and storage specifications are done at the column family level,
== Cells
A _{row, column, version}_ tuple exactly specifies a [literal]+cell+ in HBase.
A _{row, column, version}_ tuple exactly specifies a `cell` in HBase.
Cell content is uninterrpreted bytes
== Data Model Operations
@ -344,15 +330,15 @@ See <<version.delete,version.delete>> for more information on deleting versions
[[versions]]
== Versions
A _{row, column, version}_ tuple exactly specifies a [literal]+cell+ in HBase.
A _{row, column, version}_ tuple exactly specifies a `cell` in HBase.
It's possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.
While rows and column keys are expressed as bytes, the version is specified using a long integer.
Typically this long contains time instances such as those returned by [code]+java.util.Date.getTime()+ or [code]+System.currentTimeMillis()+, that is: [quote]_the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC_.
Typically this long contains time instances such as those returned by `java.util.Date.getTime()` or `System.currentTimeMillis()`, that is: [quote]_the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC_.
The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.
There is a lot of confusion over the semantics of [literal]+cell+ versions, in HBase.
There is a lot of confusion over the semantics of `cell` versions, in HBase.
In particular:
* If multiple writes to a cell have the same version, only the last written is fetchable.
@ -367,12 +353,12 @@ This section is basically a synopsis of this article by Bruno Dumon.
[[specify.number.of.versions]]
=== Specifying the Number of Versions to Store
The maximum number of versions to store for a given column is part of the column schema and is specified at table creation, or via an +alter+ command, via [code]+HColumnDescriptor.DEFAULT_VERSIONS+.
Prior to HBase 0.96, the default number of versions kept was [literal]+3+, but in 0.96 and newer has been changed to [literal]+1+.
The maximum number of versions to store for a given column is part of the column schema and is specified at table creation, or via an +alter+ command, via `HColumnDescriptor.DEFAULT_VERSIONS`.
Prior to HBase 0.96, the default number of versions kept was `3`, but in 0.96 and newer has been changed to `1`.
.Modify the Maximum Number of Versions for a Column
====
This example uses HBase Shell to keep a maximum of 5 versions of column [code]+f1+.
This example uses HBase Shell to keep a maximum of 5 versions of column `f1`.
You could also use link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor].
----
@ -384,7 +370,7 @@ hbase> alter t1, NAME => f1, VERSIONS => 5
====
You can also specify the minimum number of versions to store.
By default, this is set to 0, which means the feature is disabled.
The following example sets the minimum number of versions on field [code]+f1+ to [literal]+2+, via HBase Shell.
The following example sets the minimum number of versions on field `f1` to `2`, via HBase Shell.
You could also use link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor].
----
@ -392,7 +378,7 @@ hbase> alter t1, NAME => f1, MIN_VERSIONS => 2
----
====
Starting with HBase 0.98.2, you can specify a global default for the maximum number of versions kept for all newly-created columns, by setting +hbase.column.max.version+ in [path]_hbase-site.xml_.
Starting with HBase 0.98.2, you can specify a global default for the maximum number of versions kept for all newly-created columns, by setting +hbase.column.max.version+ in _hbase-site.xml_.
See <<hbase.column.max.version,hbase.column.max.version>>.
[[versions.ops]]
@ -406,7 +392,7 @@ Gets are implemented on top of Scans.
The below discussion of link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html[Get] applies equally to link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scans].
By default, i.e.
if you specify no explicit version, when doing a [literal]+get+, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways:
if you specify no explicit version, when doing a `get`, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways:
* to return more than one version, see link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html#setMaxVersions()[Get.setMaxVersions()]
* to return versions other than the latest, see link:???[Get.setTimeRange()]
@ -448,8 +434,8 @@ List<KeyValue> kv = r.getColumn(CF, ATTR); // returns all versions of this colu
==== Put
Doing a put always creates a new version of a [literal]+cell+, at a certain timestamp.
By default the system uses the server's [literal]+currentTimeMillis+, but you can specify the version (= the long integer) yourself, on a per-column level.
Doing a put always creates a new version of a `cell`, at a certain timestamp.
By default the system uses the server's `currentTimeMillis`, but you can specify the version (= the long integer) yourself, on a per-column level.
This means you could assign a time in the past or the future, or use the long value for non-time purposes.
To overwrite an existing value, do a put at exactly the same row, column, and version as that of the cell you would overshadow.
@ -504,7 +490,7 @@ When deleting an entire row, HBase will internally create a tombstone for each C
Deletes work by creating _tombstone_ markers.
For example, let's suppose we want to delete a row.
For this you can specify a version, or else by default the [literal]+currentTimeMillis+ is used.
For this you can specify a version, or else by default the `currentTimeMillis` is used.
What this means is [quote]_delete all
cells where the version is less than or equal to this version_.
HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition.
@ -518,7 +504,7 @@ For an informative discussion on how deletes and versioning interact, see the th
Also see <<keyvalue,keyvalue>> for more information on the internal KeyValue format.
Delete markers are purged during the next major compaction of the store, unless the +KEEP_DELETED_CELLS+ option is set in the column family.
To keep the deletes for a configurable amount of time, you can set the delete TTL via the +hbase.hstore.time.to.purge.deletes+ property in [path]_hbase-site.xml_.
To keep the deletes for a configurable amount of time, you can set the delete TTL via the +hbase.hstore.time.to.purge.deletes+ property in _hbase-site.xml_.
If +hbase.hstore.time.to.purge.deletes+ is not set, or set to 0, all delete markers, including those with timestamps in the future, are purged during the next major compaction.
Otherwise, a delete marker with a timestamp in the future is kept until the major compaction which occurs after the time represented by the marker's timestamp plus the value of +hbase.hstore.time.to.purge.deletes+, in milliseconds.

View File

@ -52,7 +52,7 @@ Posing questions - and helping to answer other people's questions - is encourage
[[irc]]
=== Internet Relay Chat (IRC)
For real-time questions and discussions, use the [literal]+#hbase+ IRC channel on the link:https://freenode.net/[FreeNode] IRC network.
For real-time questions and discussions, use the `#hbase` IRC channel on the link:https://freenode.net/[FreeNode] IRC network.
FreeNode offers a web-based client, but most people prefer a native client, and several clients are available for each operating system.
=== Jira
@ -99,13 +99,13 @@ Updating hbase.apache.org still requires use of SVN (See <<hbase.org,hbase.org>>
[[eclipse.code.formatting]]
==== Code Formatting
Under the [path]_dev-support/_ folder, you will find [path]_hbase_eclipse_formatter.xml_.
Under the _dev-support/_ folder, you will find _hbase_eclipse_formatter.xml_.
We encourage you to have this formatter in place in eclipse when editing HBase code.
.Procedure: Load the HBase Formatter Into Eclipse
. Open the menu item.
. In Preferences, click the menu item.
. Click btn:[Import] and browse to the location of the [path]_hbase_eclipse_formatter.xml_ file, which is in the [path]_dev-support/_ directory.
. Click btn:[Import] and browse to the location of the _hbase_eclipse_formatter.xml_ file, which is in the _dev-support/_ directory.
Click btn:[Apply].
. Still in Preferences, click .
Be sure the following options are selected:
@ -120,7 +120,7 @@ Close all dialog boxes and return to the main window.
In addition to the automatic formatting, make sure you follow the style guidelines explained in <<common.patch.feedback,common.patch.feedback>>
Also, no [code]+@author+ tags - that's a rule.
Also, no `@author` tags - that's a rule.
Quality Javadoc comments are appreciated.
And include the Apache license.
@ -130,18 +130,18 @@ And include the Apache license.
If you cloned the project via git, download and install the Git plugin (EGit). Attach to your local git repo (via the [label]#Git Repositories# window) and you'll be able to see file revision history, generate patches, etc.
[[eclipse.maven.setup]]
==== HBase Project Setup in Eclipse using [code]+m2eclipse+
==== HBase Project Setup in Eclipse using `m2eclipse`
The easiest way is to use the +m2eclipse+ plugin for Eclipse.
Eclipse Indigo or newer includes +m2eclipse+, or you can download it from link:http://www.eclipse.org/m2e//. It provides Maven integration for Eclipse, and even lets you use the direct Maven commands from within Eclipse to compile and test your project.
To import the project, click and select the HBase root directory. [code]+m2eclipse+ locates all the hbase modules for you.
To import the project, click and select the HBase root directory. `m2eclipse` locates all the hbase modules for you.
If you install +m2eclipse+ and import HBase in your workspace, do the following to fix your eclipse Build Path.
. Remove [path]_target_ folder
. Add [path]_target/generated-jamon_ and [path]_target/generated-sources/java_ folders.
. Remove from your Build Path the exclusions on the [path]_src/main/resources_ and [path]_src/test/resources_ to avoid error message in the console, such as the following:
. Remove _target_ folder
. Add _target/generated-jamon_ and _target/generated-sources/java_ folders.
. Remove from your Build Path the exclusions on the _src/main/resources_ and _src/test/resources_ to avoid error message in the console, such as the following:
+
----
Failed to execute goal
@ -156,7 +156,7 @@ This will also reduce the eclipse build cycles and make your life easier when de
[[eclipse.commandline]]
==== HBase Project Setup in Eclipse Using the Command Line
Instead of using [code]+m2eclipse+, you can generate the Eclipse files from the command line.
Instead of using `m2eclipse`, you can generate the Eclipse files from the command line.
. First, run the following command, which builds HBase.
You only need to do this once.
@ -166,20 +166,20 @@ Instead of using [code]+m2eclipse+, you can generate the Eclipse files from the
mvn clean install -DskipTests
----
. Close Eclipse, and execute the following command from the terminal, in your local HBase project directory, to generate new [path]_.project_ and [path]_.classpath_ files.
. Close Eclipse, and execute the following command from the terminal, in your local HBase project directory, to generate new _.project_ and _.classpath_ files.
+
[source,bourne]
----
mvn eclipse:eclipse
----
. Reopen Eclipse and import the [path]_.project_ file in the HBase directory to a workspace.
. Reopen Eclipse and import the _.project_ file in the HBase directory to a workspace.
[[eclipse.maven.class]]
==== Maven Classpath Variable
The [var]+$M2_REPO+ classpath variable needs to be set up for the project.
This needs to be set to your local Maven repository, which is usually [path]_~/.m2/repository_
The `$M2_REPO` classpath variable needs to be set up for the project.
This needs to be set to your local Maven repository, which is usually _~/.m2/repository_
If this classpath variable is not configured, you will see compile errors in Eclipse like this:
@ -195,7 +195,7 @@ Unbound classpath variable: 'M2_REPO/com/google/protobuf/protobuf-java/2.3.0/pro
[[eclipse.issues]]
==== Eclipse Known Issues
Eclipse will currently complain about [path]_Bytes.java_.
Eclipse will currently complain about _Bytes.java_.
It is not possible to turn these errors off.
----
@ -254,7 +254,7 @@ All commands are executed from the local HBase project directory.
===== Package
The simplest command to compile HBase from its java source code is to use the [code]+package+ target, which builds JARs with the compiled files.
The simplest command to compile HBase from its java source code is to use the `package` target, which builds JARs with the compiled files.
[source,bourne]
----
@ -274,7 +274,7 @@ To create the full installable HBase package takes a little bit more work, so re
[[maven.build.commands.compile]]
===== Compile
The [code]+compile+ target does not create the JARs with the compiled files.
The `compile` target does not create the JARs with the compiled files.
[source,bourne]
----
@ -288,7 +288,7 @@ mvn clean compile
===== Install
To install the JARs in your [path]_~/.m2/_ directory, use the [code]+install+ target.
To install the JARs in your _~/.m2/_ directory, use the `install` target.
[source,bourne]
----
@ -323,8 +323,8 @@ To change the version to build against, add a hadoop.profile property when you i
mvn -Dhadoop.profile=1.0 ...
----
The above will build against whatever explicit hadoop 1.x version we have in our [path]_pom.xml_ as our '1.0' version.
Tests may not all pass so you may need to pass [code]+-DskipTests+ unless you are inclined to fix the failing tests.
The above will build against whatever explicit hadoop 1.x version we have in our _pom.xml_ as our '1.0' version.
Tests may not all pass so you may need to pass `-DskipTests` unless you are inclined to fix the failing tests.
.'dependencyManagement.dependencies.dependency.artifactId' fororg.apache.hbase:${compat.module}:test-jar with value '${compat.module}'does not match a valid id pattern
[NOTE]
@ -348,18 +348,18 @@ mvn -Dhadoop.profile=22 ...
[[build.protobuf]]
==== Build Protobuf
You may need to change the protobuf definitions that reside in the [path]_hbase-protocol_ module or other modules.
You may need to change the protobuf definitions that reside in the _hbase-protocol_ module or other modules.
The protobuf files are located in [path]_hbase-protocol/src/main/protobuf_.
The protobuf files are located in _hbase-protocol/src/main/protobuf_.
For the change to be effective, you will need to regenerate the classes.
You can use maven profile [code]+compile-protobuf+ to do this.
You can use maven profile `compile-protobuf` to do this.
[source,bourne]
----
mvn compile -Pcompile-protobuf
----
You may also want to define [var]+protoc.path+ for the protoc binary, using the following command:
You may also want to define `protoc.path` for the protoc binary, using the following command:
[source,bourne]
----
@ -367,23 +367,23 @@ You may also want to define [var]+protoc.path+ for the protoc binary, using the
mvn compile -Pcompile-protobuf -Dprotoc.path=/opt/local/bin/protoc
----
Read the [path]_hbase-protocol/README.txt_ for more details.
Read the _hbase-protocol/README.txt_ for more details.
[[build.thrift]]
==== Build Thrift
You may need to change the thrift definitions that reside in the [path]_hbase-thrift_ module or other modules.
You may need to change the thrift definitions that reside in the _hbase-thrift_ module or other modules.
The thrift files are located in [path]_hbase-thrift/src/main/resources_.
The thrift files are located in _hbase-thrift/src/main/resources_.
For the change to be effective, you will need to regenerate the classes.
You can use maven profile [code]+compile-thrift+ to do this.
You can use maven profile `compile-thrift` to do this.
[source,bourne]
----
mvn compile -Pcompile-thrift
----
You may also want to define [var]+thrift.path+ for the thrift binary, using the following command:
You may also want to define `thrift.path` for the thrift binary, using the following command:
[source,bourne]
----
@ -399,12 +399,12 @@ You can build a tarball without going through the release process described in <
mvn -DskipTests clean install && mvn -DskipTests package assembly:single
----
The distribution tarball is built in [path]_hbase-assembly/target/hbase-<version>-bin.tar.gz_.
The distribution tarball is built in _hbase-assembly/target/hbase-<version>-bin.tar.gz_.
[[build.gotchas]]
==== Build Gotchas
If you see [code]+Unable to find resource 'VM_global_library.vm'+, ignore it.
If you see `Unable to find resource 'VM_global_library.vm'`, ignore it.
Its not an error.
It is link:http://jira.codehaus.org/browse/MSITE-286[officially
ugly] though.
@ -412,7 +412,7 @@ It is link:http://jira.codehaus.org/browse/MSITE-286[officially
[[build.snappy]]
==== Building in snappy compression support
Pass [code]+-Psnappy+ to trigger the [code]+hadoop-snappy+ maven profile for building Google Snappy native libraries into HBase.
Pass `-Psnappy` to trigger the `hadoop-snappy` maven profile for building Google Snappy native libraries into HBase.
See also <<snappy.compression.installation,snappy.compression.installation>>
[[releasing]]
@ -440,16 +440,16 @@ To determine which HBase you have, look at the HBase version.
The Hadoop version is embedded within it.
Maven, our build system, natively does not allow a single product to be built against different dependencies.
Also, Maven cannot change the set of included modules and write out the correct [path]_pom.xml_ files with appropriate dependencies, even using two build targets, one for Hadoop 1 and another for Hadoop 2.
A prerequisite step is required, which takes as input the current [path]_pom.xml_s and generates Hadoop 1 or Hadoop 2 versions using a script in the [path]_dev-tools/_ directory, called [path]_generate-hadoopX-poms.sh_ where [replaceable]_X_ is either [literal]+1+ or [literal]+2+.
Also, Maven cannot change the set of included modules and write out the correct _pom.xml_ files with appropriate dependencies, even using two build targets, one for Hadoop 1 and another for Hadoop 2.
A prerequisite step is required, which takes as input the current _pom.xml_s and generates Hadoop 1 or Hadoop 2 versions using a script in the _dev-tools/_ directory, called _generate-hadoopX-poms.sh_ where [replaceable]_X_ is either `1` or `2`.
You then reference these generated poms when you build.
For now, just be aware of the difference between HBase 1.x builds and those of HBase 0.96-0.98.
This difference is important to the build instructions.
.Example [path]_~/.m2/settings.xml_ File
.Example _~/.m2/settings.xml_ File
====
Publishing to maven requires you sign the artifacts you want to upload.
For the build to sign them for you, you a properly configured [path]_settings.xml_ in your local repository under [path]_.m2_, such as the following.
For the build to sign them for you, you a properly configured _settings.xml_ in your local repository under _.m2_, such as the following.
[source,xml]
----
@ -505,7 +505,7 @@ I'll prefix those special steps with _Point Release Only_.
.Before You Begin
Before you make a release candidate, do a practice run by deploying a snapshot.
Before you start, check to be sure recent builds have been passing for the branch from where you are going to take your release.
You should also have tried recent branch tips out on a cluster under load, perhaps by running the [code]+hbase-it+ integration test suite for a few hours to 'burn in' the near-candidate bits.
You should also have tried recent branch tips out on a cluster under load, perhaps by running the `hbase-it` integration test suite for a few hours to 'burn in' the near-candidate bits.
.Point Release Only
[NOTE]
@ -520,7 +520,7 @@ The Hadoop link:http://wiki.apache.org/hadoop/HowToRelease[How To
.Specifying the Heap Space for Maven on OSX
[NOTE]
====
On OSX, you may need to specify the heap space for Maven commands, by setting the [var]+MAVEN_OPTS+ variable to [literal]+-Xmx3g+.
On OSX, you may need to specify the heap space for Maven commands, by setting the `MAVEN_OPTS` variable to `-Xmx3g`.
You can prefix the variable to the Maven command, as in the following example:
----
@ -531,19 +531,19 @@ You could also set this in an environment variable or alias in your shell.
====
NOTE: The script [path]_dev-support/make_rc.sh_ automates many of these steps.
It does not do the modification of the [path]_CHANGES.txt_ for the release, the close of the staging repository in Apache Maven (human intervention is needed here), the checking of the produced artifacts to ensure they are 'good' -- e.g.
NOTE: The script _dev-support/make_rc.sh_ automates many of these steps.
It does not do the modification of the _CHANGES.txt_ for the release, the close of the staging repository in Apache Maven (human intervention is needed here), the checking of the produced artifacts to ensure they are 'good' -- e.g.
extracting the produced tarballs, verifying that they look right, then starting HBase and checking that everything is running correctly, then the signing and pushing of the tarballs to link:http://people.apache.org[people.apache.org].
The script handles everything else, and comes in handy.
.Procedure: Release Procedure
. Update the [path]_CHANGES.txt_ file and the POM files.
. Update the _CHANGES.txt_ file and the POM files.
+
Update [path]_CHANGES.txt_ with the changes since the last release.
Update _CHANGES.txt_ with the changes since the last release.
Make sure the URL to the JIRA points to the proper location which lists fixes for this release.
Adjust the version in all the POM files appropriately.
If you are making a release candidate, you must remove the [literal]+-SNAPSHOT+ label from all versions.
If you are running this receipe to publish a snapshot, you must keep the [literal]+-SNAPSHOT+ suffix on the hbase version.
If you are making a release candidate, you must remove the `-SNAPSHOT` label from all versions.
If you are running this receipe to publish a snapshot, you must keep the `-SNAPSHOT` suffix on the hbase version.
The link:http://mojo.codehaus.org/versions-maven-plugin/[Versions
Maven Plugin] can be of use here.
To set a version in all the many poms of the hbase multi-module project, use a command like the following:
@ -554,11 +554,11 @@ To set a version in all the many poms of the hbase multi-module project, use a c
$ mvn clean org.codehaus.mojo:versions-maven-plugin:1.3.1:set -DnewVersion=0.96.0
----
+
Checkin the [path]_CHANGES.txt_ and any version changes.
Checkin the _CHANGES.txt_ and any version changes.
. Update the documentation.
+
Update the documentation under [path]_src/main/docbkx_.
Update the documentation under _src/main/docbkx_.
This usually involves copying the latest from trunk and making version-particular adjustments to suit this release candidate version.
. Build the source tarball.
@ -566,7 +566,7 @@ This usually involves copying the latest from trunk and making version-particula
Now, build the source tarball.
This tarball is Hadoop-version-independent.
It is just the pure source code and documentation without a particular hadoop taint, etc.
Add the [var]+-Prelease+ profile when building.
Add the `-Prelease` profile when building.
It checks files for licenses and will fail the build if unlicensed files are present.
+
[source,bourne]
@ -578,13 +578,13 @@ $ mvn clean install -DskipTests assembly:single -Dassembly.file=hbase-assembly/s
Extract the tarball and make sure it looks good.
A good test for the src tarball being 'complete' is to see if you can build new tarballs from this source bundle.
If the source tarball is good, save it off to a _version directory_, a directory somewhere where you are collecting all of the tarballs you will publish as part of the release candidate.
For example if you were building a hbase-0.96.0 release candidate, you might call the directory [path]_hbase-0.96.0RC0_.
For example if you were building a hbase-0.96.0 release candidate, you might call the directory _hbase-0.96.0RC0_.
Later you will publish this directory as our release candidate up on link:people.apache.org/~YOU[people.apache.org/~YOU/].
. Build the binary tarball.
+
Next, build the binary tarball.
Add the [var]+-Prelease+ profile when building.
Add the `-Prelease` profile when building.
It checks files for licenses and will fail the build if unlicensed files are present.
Do it in two steps.
+
@ -626,9 +626,9 @@ Release needs to be tagged for the next step.
. Deploy to the Maven Repository.
+
Next, deploy HBase to the Apache Maven repository, using the [var]+apache-release+ profile instead of the [var]+release+ profile when running the +mvn
Next, deploy HBase to the Apache Maven repository, using the `apache-release` profile instead of the `release` profile when running the +mvn
deploy+ command.
This profile invokes the Apache pom referenced by our pom files, and also signs your artifacts published to Maven, as long as the [path]_settings.xml_ is configured correctly, as described in <<mvn.settings.file,mvn.settings.file>>.
This profile invokes the Apache pom referenced by our pom files, and also signs your artifacts published to Maven, as long as the _settings.xml_ is configured correctly, as described in <<mvn.settings.file,mvn.settings.file>>.
+
[source,bourne]
----
@ -651,7 +651,7 @@ If it checks out, 'close' the repo.
This will make the artifacts publically available.
You will receive an email with the URL to give out for the temporary staging repository for others to use trying out this new release candidate.
Include it in the email that announces the release candidate.
Folks will need to add this repo URL to their local poms or to their local [path]_settings.xml_ file to pull the published release candidate artifacts.
Folks will need to add this repo URL to their local poms or to their local _settings.xml_ file to pull the published release candidate artifacts.
If the published artifacts are incomplete or have problems, just delete the 'open' staged artifacts.
+
.hbase-downstreamer
@ -660,7 +660,7 @@ If the published artifacts are incomplete or have problems, just delete the 'ope
See the link:https://github.com/saintstack/hbase-downstreamer[hbase-downstreamer] test for a simple example of a project that is downstream of HBase an depends on it.
Check it out and run its simple test to make sure maven artifacts are properly deployed to the maven repository.
Be sure to edit the pom to point to the proper staging repository.
Make sure you are pulling from the repository when tests run and that you are not getting from your local repository, by either passing the [code]+-U+ flag or deleting your local repo content and check maven is pulling from remote out of the staging repository.
Make sure you are pulling from the repository when tests run and that you are not getting from your local repository, by either passing the `-U` flag or deleting your local repo content and check maven is pulling from remote out of the staging repository.
====
+
See link:http://www.apache.org/dev/publishing-maven-artifacts.html[Publishing Maven Artifacts] for some pointers on this maven staging process.
@ -670,16 +670,16 @@ Instead we do +mvn deploy+.
It seems to give us a backdoor to maven release publishing.
If there is no _-SNAPSHOT_ on the version string, then we are 'deployed' to the apache maven repository staging directory from which we can publish URLs for candidates and later, if they pass, publish as release (if a _-SNAPSHOT_ on the version string, deploy will put the artifacts up into apache snapshot repos).
+
If the HBase version ends in [var]+-SNAPSHOT+, the artifacts go elsewhere.
If the HBase version ends in `-SNAPSHOT`, the artifacts go elsewhere.
They are put into the Apache snapshots repository directly and are immediately available.
Making a SNAPSHOT release, this is what you want to happen.
. If you used the [path]_make_rc.sh_ script instead of doing
. If you used the _make_rc.sh_ script instead of doing
the above manually, do your sanity checks now.
+
At this stage, you have two tarballs in your 'version directory' and a set of artifacts in a staging area of the maven repository, in the 'closed' state.
These are publicly accessible in a temporary staging repository whose URL you should have gotten in an email.
The above mentioned script, [path]_make_rc.sh_ does all of the above for you minus the check of the artifacts built, the closing of the staging repository up in maven, and the tagging of the release.
The above mentioned script, _make_rc.sh_ does all of the above for you minus the check of the artifacts built, the closing of the staging repository up in maven, and the tagging of the release.
If you run the script, do your checks at this stage verifying the src and bin tarballs and checking what is up in staging using hbase-downstreamer project.
Tag before you start the build.
You can always delete it if the build goes haywire.
@ -709,8 +709,8 @@ Announce the release candidate on the mailing list and call a vote.
[[maven.snapshot]]
=== Publishing a SNAPSHOT to maven
Make sure your [path]_settings.xml_ is set up properly, as in <<mvn.settings.file,mvn.settings.file>>.
Make sure the hbase version includes [var]+-SNAPSHOT+ as a suffix.
Make sure your _settings.xml_ is set up properly, as in <<mvn.settings.file,mvn.settings.file>>.
Make sure the hbase version includes `-SNAPSHOT` as a suffix.
Following is an example of publishing SNAPSHOTS of a release that had an hbase version of 0.96.0 in its poms.
[source,bourne]
@ -720,8 +720,8 @@ Following is an example of publishing SNAPSHOTS of a release that had an hbase v
$ mvn -DskipTests deploy -Papache-release
----
The [path]_make_rc.sh_ script mentioned above (see <<maven.release,maven.release>>) can help you publish [var]+SNAPSHOTS+.
Make sure your [var]+hbase.version+ has a [var]+-SNAPSHOT+ suffix before running the script.
The _make_rc.sh_ script mentioned above (see <<maven.release,maven.release>>) can help you publish `SNAPSHOTS`.
Make sure your `hbase.version` has a `-SNAPSHOT` suffix before running the script.
It will put a snapshot up into the apache snapshot repository for you.
[[hbase.rc.voting]]
@ -742,11 +742,9 @@ for how we arrived at this process.
[[documentation]]
== Generating the HBase Reference Guide
The manual is marked up using link:http://www.docbook.org/[docbook].
We then use the link:http://code.google.com/p/docbkx-tools/[docbkx maven plugin] to transform the markup to html.
This plugin is run when you specify the +site+ goal as in when you run +mvn site+ or you can call the plugin explicitly to just generate the manual by doing +mvn
docbkx:generate-html+.
When you run +mvn site+, the documentation is generated twice, once to generate the multipage manual and then again for the single page manual, which is easier to search.
The manual is marked up using Asciidoc.
We then use the link:http://asciidoctor.org/docs/asciidoctor-maven-plugin/[Asciidoctor maven plugin] to transform the markup to html.
This plugin is run when you specify the +site+ goal as in when you run +mvn site+.
See <<appendix_contributing_to_documentation,appendix contributing to documentation>> for more information on building the documentation.
[[hbase.org]]
@ -760,8 +758,8 @@ See <<appendix_contributing_to_documentation,appendix contributing to documentat
[[hbase.org.site.publishing]]
=== Publishing link:http://hbase.apache.org[hbase.apache.org]
As of link:https://issues.apache.org/jira/browse/INFRA-5680[INFRA-5680 Migrate apache hbase website], to publish the website, build it using Maven, and then deploy it over a checkout of [path]_https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk_ and check in your changes.
The script [path]_dev-scripts/publish_hbase_website.sh_ is provided to automate this process and to be sure that stale files are removed from SVN.
As of link:https://issues.apache.org/jira/browse/INFRA-5680[INFRA-5680 Migrate apache hbase website], to publish the website, build it using Maven, and then deploy it over a checkout of _https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk_ and check in your changes.
The script _dev-scripts/publish_hbase_website.sh_ is provided to automate this process and to be sure that stale files are removed from SVN.
Review the script even if you decide to publish the website manually.
Use the script as follows:
@ -792,15 +790,15 @@ For developing unit tests for your HBase applications, see <<unit.tests,unit.tes
As of 0.96, Apache HBase is split into multiple modules.
This creates "interesting" rules for how and where tests are written.
If you are writing code for [class]+hbase-server+, see <<hbase.unittests,hbase.unittests>> for how to write your tests.
If you are writing code for `hbase-server`, see <<hbase.unittests,hbase.unittests>> for how to write your tests.
These tests can spin up a minicluster and will need to be categorized.
For any other module, for example [class]+hbase-common+, the tests must be strict unit tests and just test the class under test - no use of the HBaseTestingUtility or minicluster is allowed (or even possible given the dependency tree).
For any other module, for example `hbase-common`, the tests must be strict unit tests and just test the class under test - no use of the HBaseTestingUtility or minicluster is allowed (or even possible given the dependency tree).
[[hbase.moduletest.shell]]
==== Testing the HBase Shell
The HBase shell and its tests are predominantly written in jruby.
In order to make these tests run as a part of the standard build, there is a single JUnit test, [class]+TestShell+, that takes care of loading the jruby implemented tests and running them.
In order to make these tests run as a part of the standard build, there is a single JUnit test, `TestShell`, that takes care of loading the jruby implemented tests and running them.
You can run all of these tests from the top level with:
[source,bourne]
@ -809,9 +807,9 @@ You can run all of these tests from the top level with:
mvn clean test -Dtest=TestShell
----
Alternatively, you may limit the shell tests that run using the system variable [class]+shell.test+.
Alternatively, you may limit the shell tests that run using the system variable `shell.test`.
This value should specify the ruby literal equivalent of a particular test case by name.
For example, the tests that cover the shell commands for altering tables are contained in the test case [class]+AdminAlterTableTest+ and you can run them with:
For example, the tests that cover the shell commands for altering tables are contained in the test case `AdminAlterTableTest` and you can run them with:
[source,bourne]
----
@ -820,7 +818,7 @@ For example, the tests that cover the shell commands for altering tables are con
----
You may also use a link:http://docs.ruby-doc.com/docs/ProgrammingRuby/html/language.html#UJ[Ruby Regular Expression
literal] (in the [class]+/pattern/+ style) to select a set of test cases.
literal] (in the `/pattern/` style) to select a set of test cases.
You can run all of the HBase admin related tests, including both the normal administration and the security administration, with the command:
[source,bourne]
@ -859,19 +857,19 @@ mvn clean test -PskipServerTests
from the top level directory to run all the tests in modules other than hbase-server.
Note that you can specify to skip tests in multiple modules as well as just for a single module.
For example, to skip the tests in [class]+hbase-server+ and [class]+hbase-common+, you would run:
For example, to skip the tests in `hbase-server` and `hbase-common`, you would run:
[source,bourne]
----
mvn clean test -PskipServerTests -PskipCommonTests
----
Also, keep in mind that if you are running tests in the [class]+hbase-server+ module you will need to apply the maven profiles discussed in <<hbase.unittests.cmds,hbase.unittests.cmds>> to get the tests to run properly.
Also, keep in mind that if you are running tests in the `hbase-server` module you will need to apply the maven profiles discussed in <<hbase.unittests.cmds,hbase.unittests.cmds>> to get the tests to run properly.
[[hbase.unittests]]
=== Unit Tests
Apache HBase unit tests are subdivided into four categories: small, medium, large, and integration with corresponding JUnit link:http://www.junit.org/node/581[categories]: [class]+SmallTests+, [class]+MediumTests+, [class]+LargeTests+, [class]+IntegrationTests+.
Apache HBase unit tests are subdivided into four categories: small, medium, large, and integration with corresponding JUnit link:http://www.junit.org/node/581[categories]: `SmallTests`, `MediumTests`, `LargeTests`, `IntegrationTests`.
JUnit categories are denoted using java annotations and look like this in your unit test code.
[source,java]
@ -886,14 +884,13 @@ public class TestHRegionInfo {
}
----
The above example shows how to mark a unit test as belonging to the [literal]+small+ category.
The above example shows how to mark a unit test as belonging to the `small` category.
All unit tests in HBase have a categorization.
The first three categories, [literal]+small+, [literal]+medium+, and [literal]+large+, are for tests run when you type [code]+$ mvn
test+.
The first three categories, `small`, `medium`, and `large`, are for tests run when you type `$ mvn test`.
In other words, these three categorizations are for HBase unit tests.
The [literal]+integration+ category is not for unit tests, but for integration tests.
These are run when you invoke [code]+$ mvn verify+.
The `integration` category is not for unit tests, but for integration tests.
These are run when you invoke `$ mvn verify`.
Integration tests are described in <<integration.tests,integration.tests>>.
HBase uses a patched maven surefire plugin and maven profiles to implement its unit test characterizations.
@ -928,7 +925,7 @@ Integration Tests (((IntegrationTests)))::
[[hbase.unittests.cmds.test]]
==== Default: small and medium category tests
Running [code]+mvn test+ will execute all small tests in a single JVM (no fork) and then medium tests in a separate JVM for each test instance.
Running `mvn test` will execute all small tests in a single JVM (no fork) and then medium tests in a separate JVM for each test instance.
Medium tests are NOT executed if there is an error in a small test.
Large tests are NOT executed.
There is one report for small tests, and one report for medium tests if they are executed.
@ -936,7 +933,7 @@ There is one report for small tests, and one report for medium tests if they are
[[hbase.unittests.cmds.test.runalltests]]
==== Running all tests
Running [code]+mvn test -P runAllTests+ will execute small tests in a single JVM then medium and large tests in a separate JVM for each test.
Running `mvn test -P runAllTests` will execute small tests in a single JVM then medium and large tests in a separate JVM for each test.
Medium and large tests are NOT executed if there is an error in a small test.
Large tests are NOT executed if there is an error in a small or medium test.
There is one report for small tests, and one report for medium and large tests if they are executed.
@ -944,16 +941,22 @@ There is one report for small tests, and one report for medium and large tests i
[[hbase.unittests.cmds.test.localtests.mytest]]
==== Running a single test or all tests in a package
To run an individual test, e.g. [class]+MyTest+, rum [code]+mvn test -Dtest=MyTest+ You can also pass multiple, individual tests as a comma-delimited list: [code]+mvn test
-Dtest=MyTest1,MyTest2,MyTest3+ You can also pass a package, which will run all tests under the package: [code]+mvn test
'-Dtest=org.apache.hadoop.hbase.client.*'+
To run an individual test, e.g. `MyTest`, rum `mvn test -Dtest=MyTest` You can also pass multiple, individual tests as a comma-delimited list:
[source,bash]
----
mvn test -Dtest=MyTest1,MyTest2,MyTest3
----
You can also pass a package, which will run all tests under the package:
[source,bash]
----
mvn test '-Dtest=org.apache.hadoop.hbase.client.*'
----
When [code]+-Dtest+ is specified, the [code]+localTests+ profile will be used.
When `-Dtest` is specified, the `localTests` profile will be used.
It will use the official release of maven surefire, rather than our custom surefire plugin, and the old connector (The HBase build uses a patched version of the maven surefire plugin). Each junit test is executed in a separate JVM (A fork per test class). There is no parallelization when tests are running in this mode.
You will see a new message at the end of the -report: [literal]+"[INFO] Tests are skipped"+.
You will see a new message at the end of the -report: `"[INFO] Tests are skipped"`.
It's harmless.
However, you need to make sure the sum of [code]+Tests run:+ in the [code]+Results
:+ section of test reports matching the number of tests you specified because no error will be reported when a non-existent test case is specified.
However, you need to make sure the sum of `Tests run:` in the `Results:` section of test reports matching the number of tests you specified because no error will be reported when a non-existent test case is specified.
[[hbase.unittests.cmds.test.profiles]]
==== Other test invocation permutations
@ -969,7 +972,7 @@ For convenience, you can run `mvn test -P runDevTests` to execute both small and
[[hbase.unittests.test.faster]]
==== Running tests faster
By default, [code]+$ mvn test -P runAllTests+ runs 5 tests in parallel.
By default, `$ mvn test -P runAllTests` runs 5 tests in parallel.
It can be increased on a developer's machine.
Allowing that you can have 2 tests in parallel per core, and you need about 2GB of memory per test (at the extreme), if you have an 8 core, 24GB box, you can have 16 tests in parallel.
but the memory available limits it to 12 (24/2), To run all tests with 12 tests in parallel, do this: +mvn test -P runAllTests
@ -1008,7 +1011,7 @@ mvn test
It's also possible to use the script +hbasetests.sh+.
This script runs the medium and large tests in parallel with two maven instances, and provides a single report.
This script does not use the hbase version of surefire so no parallelization is being done other than the two maven instances the script sets up.
It must be executed from the directory which contains the [path]_pom.xml_.
It must be executed from the directory which contains the _pom.xml_.
For example running +./dev-support/hbasetests.sh+ will execute small and medium tests.
Running +./dev-support/hbasetests.sh
@ -1018,8 +1021,8 @@ Running +./dev-support/hbasetests.sh replayFailed+ will rerun the failed tests a
[[hbase.unittests.resource.checker]]
==== Test Resource Checker(((Test ResourceChecker)))
A custom Maven SureFire plugin listener checks a number of resources before and after each HBase unit test runs and logs its findings at the end of the test output files which can be found in [path]_target/surefire-reports_ per Maven module (Tests write test reports named for the test class into this directory.
Check the [path]_*-out.txt_ files). The resources counted are the number of threads, the number of file descriptors, etc.
A custom Maven SureFire plugin listener checks a number of resources before and after each HBase unit test runs and logs its findings at the end of the test output files which can be found in _target/surefire-reports_ per Maven module (Tests write test reports named for the test class into this directory.
Check the _*-out.txt_ files). The resources counted are the number of threads, the number of file descriptors, etc.
If the number has increased, it adds a _LEAK?_ comment in the logs.
As you can have an HBase instance running in the background, some threads can be deleted/created without any specific action in the test.
However, if the test does not work as expected, or if the test should not impact these resources, it's worth checking these log lines [computeroutput]+...hbase.ResourceChecker(157): before...+ and [computeroutput]+...hbase.ResourceChecker(157): after...+.
@ -1043,7 +1046,7 @@ ConnectionCount=1 (was 1)
* All tests must be written to support parallel execution on the same machine, hence they should not use shared resources as fixed ports or fixed file names.
* Tests should not overlog.
More than 100 lines/second makes the logs complex to read and use i/o that are hence not available for the other tests.
* Tests can be written with [class]+HBaseTestingUtility+.
* Tests can be written with `HBaseTestingUtility`.
This class offers helper functions to create a temp directory and do the cleanup, or to start a cluster.
[[hbase.tests.categories]]
@ -1087,27 +1090,35 @@ They are generally long-lasting, sizeable (the test can be asked to 1M rows or 1
Integration tests are what you would run when you need to more elaborate proofing of a release candidate beyond what unit tests can do.
They are not generally run on the Apache Continuous Integration build server, however, some sites opt to run integration tests as a part of their continuous testing on an actual cluster.
Integration tests currently live under the [path]_src/test_ directory in the hbase-it submodule and will match the regex: [path]_**/IntegrationTest*.java_.
All integration tests are also annotated with [code]+@Category(IntegrationTests.class)+.
Integration tests currently live under the _src/test_ directory in the hbase-it submodule and will match the regex: _**/IntegrationTest*.java_.
All integration tests are also annotated with `@Category(IntegrationTests.class)`.
Integration tests can be run in two modes: using a mini cluster, or against an actual distributed cluster.
Maven failsafe is used to run the tests using the mini cluster.
IntegrationTestsDriver class is used for executing the tests against a distributed cluster.
Integration tests SHOULD NOT assume that they are running against a mini cluster, and SHOULD NOT use private API's to access cluster state.
To interact with the distributed or mini cluster uniformly, [code]+IntegrationTestingUtility+, and [code]+HBaseCluster+ classes, and public client API's can be used.
To interact with the distributed or mini cluster uniformly, `IntegrationTestingUtility`, and `HBaseCluster` classes, and public client API's can be used.
On a distributed cluster, integration tests that use ChaosMonkey or otherwise manipulate services thru cluster manager (e.g.
restart regionservers) use SSH to do it.
To run these, test process should be able to run commands on remote end, so ssh should be configured accordingly (for example, if HBase runs under hbase user in your cluster, you can set up passwordless ssh for that user and run the test also under it). To facilitate that, [code]+hbase.it.clustermanager.ssh.user+, [code]+hbase.it.clustermanager.ssh.opts+ and [code]+hbase.it.clustermanager.ssh.cmd+ configuration settings can be used.
To run these, test process should be able to run commands on remote end, so ssh should be configured accordingly (for example, if HBase runs under hbase user in your cluster, you can set up passwordless ssh for that user and run the test also under it). To facilitate that, `hbase.it.clustermanager.ssh.user`, `hbase.it.clustermanager.ssh.opts` and `hbase.it.clustermanager.ssh.cmd` configuration settings can be used.
"User" is the remote user that cluster manager should use to perform ssh commands.
"Opts" contains additional options that are passed to SSH (for example, "-i /tmp/my-key"). Finally, if you have some custom environment setup, "cmd" is the override format for the entire tunnel (ssh) command.
The default string is {[code]+/usr/bin/ssh %1$s %2$s%3$s%4$s "%5$s"+} and is a good starting point.
The default string is {`/usr/bin/ssh %1$s %2$s%3$s%4$s "%5$s"`} and is a good starting point.
This is a standard Java format string with 5 arguments that is used to execute the remote command.
The argument 1 (%1$s) is SSH options set the via opts setting or via environment variable, 2 is SSH user name, 3 is "@" if username is set or "" otherwise, 4 is the target host name, and 5 is the logical command to execute (that may include single quotes, so don't use them). For example, if you run the tests under non-hbase user and want to ssh as that user and change to hbase on remote machine, you can use {[code]+/usr/bin/ssh %1$s %2$s%3$s%4$s "su hbase - -c
\"%5$s\""+}. That way, to kill RS (for example) integration tests may run {[code]+/usr/bin/ssh some-hostname "su hbase - -c \"ps aux | ... | kill
...\""+}. The command is logged in the test logs, so you can verify it is correct for your environment.
The argument 1 (%1$s) is SSH options set the via opts setting or via environment variable, 2 is SSH user name, 3 is "@" if username is set or "" otherwise, 4 is the target host name, and 5 is the logical command to execute (that may include single quotes, so don't use them). For example, if you run the tests under non-hbase user and want to ssh as that user and change to hbase on remote machine, you can use:
[source,bash]
----
/usr/bin/ssh %1$s %2$s%3$s%4$s "su hbase - -c \"%5$s\""
----
That way, to kill RS (for example) integration tests may run:
[source,bash]
----
{/usr/bin/ssh some-hostname "su hbase - -c \"ps aux | ... | kill ...\""}
----
The command is logged in the test logs, so you can verify it is correct for your environment.
To disable the running of Integration Tests, pass the following profile on the command line [code]+-PskipIntegrationTests+.
To disable the running of Integration Tests, pass the following profile on the command line `-PskipIntegrationTests`.
For example,
[source]
----
@ -1117,8 +1128,8 @@ $ mvn clean install test -Dtest=TestZooKeeper -PskipIntegrationTests
[[maven.build.commands.integration.tests.mini]]
==== Running integration tests against mini cluster
HBase 0.92 added a [var]+verify+ maven target.
Invoking it, for example by doing [code]+mvn verify+, will run all the phases up to and including the verify phase via the maven link:http://maven.apache.org/plugins/maven-failsafe-plugin/[failsafe
HBase 0.92 added a `verify` maven target.
Invoking it, for example by doing `mvn verify`, will run all the phases up to and including the verify phase via the maven link:http://maven.apache.org/plugins/maven-failsafe-plugin/[failsafe
plugin], running all the above mentioned HBase unit tests as well as tests that are in the HBase integration test group.
After you have completed +mvn install -DskipTests+ You can run just the integration tests by invoking:
@ -1132,7 +1143,7 @@ mvn verify
If you just want to run the integration tests in top-level, you need to run two commands.
First: +mvn failsafe:integration-test+ This actually runs ALL the integration tests.
NOTE: This command will always output [code]+BUILD SUCCESS+ even if there are test failures.
NOTE: This command will always output `BUILD SUCCESS` even if there are test failures.
At this point, you could grep the output by hand looking for failed tests.
However, maven will do this for us; just use: +mvn
@ -1141,15 +1152,15 @@ However, maven will do this for us; just use: +mvn
[[maven.build.commands.integration.tests2]]
===== Running a subset of Integration tests
This is very similar to how you specify running a subset of unit tests (see above), but use the property [code]+it.test+ instead of [code]+test+.
To just run [class]+IntegrationTestClassXYZ.java+, use: +mvn
This is very similar to how you specify running a subset of unit tests (see above), but use the property `it.test` instead of `test`.
To just run `IntegrationTestClassXYZ.java`, use: +mvn
failsafe:integration-test -Dit.test=IntegrationTestClassXYZ+ The next thing you might want to do is run groups of integration tests, say all integration tests that are named IntegrationTestClassX*.java: +mvn failsafe:integration-test -Dit.test=*ClassX*+ This runs everything that is an integration test that matches *ClassX*. This means anything matching: "**/IntegrationTest*ClassX*". You can also run multiple groups of integration tests using comma-delimited lists (similar to unit tests). Using a list of matches still supports full regex matching for each of the groups.This would look something like: +mvn
failsafe:integration-test -Dit.test=*ClassX*, *ClassY+
[[maven.build.commands.integration.tests.distributed]]
==== Running integration tests against distributed cluster
If you have an already-setup HBase cluster, you can launch the integration tests by invoking the class [code]+IntegrationTestsDriver+.
If you have an already-setup HBase cluster, you can launch the integration tests by invoking the class `IntegrationTestsDriver`.
You may have to run test-compile first.
The configuration will be picked by the bin/hbase script.
[source,bourne]
@ -1163,25 +1174,24 @@ Then launch the tests with:
bin/hbase [--config config_dir] org.apache.hadoop.hbase.IntegrationTestsDriver
----
Pass [code]+-h+ to get usage on this sweet tool.
Running the IntegrationTestsDriver without any argument will launch tests found under [code]+hbase-it/src/test+, having [code]+@Category(IntegrationTests.class)+ annotation, and a name starting with [code]+IntegrationTests+.
Pass `-h` to get usage on this sweet tool.
Running the IntegrationTestsDriver without any argument will launch tests found under `hbase-it/src/test`, having `@Category(IntegrationTests.class)` annotation, and a name starting with `IntegrationTests`.
See the usage, by passing -h, to see how to filter test classes.
You can pass a regex which is checked against the full class name; so, part of class name can be used.
IntegrationTestsDriver uses Junit to run the tests.
Currently there is no support for running integration tests against a distributed cluster using maven (see link:https://issues.apache.org/jira/browse/HBASE-6201[HBASE-6201]).
The tests interact with the distributed cluster by using the methods in the [code]+DistributedHBaseCluster+ (implementing [code]+HBaseCluster+) class, which in turn uses a pluggable [code]+ClusterManager+.
Concrete implementations provide actual functionality for carrying out deployment-specific and environment-dependent tasks (SSH, etc). The default [code]+ClusterManager+ is [code]+HBaseClusterManager+, which uses SSH to remotely execute start/stop/kill/signal commands, and assumes some posix commands (ps, etc). Also assumes the user running the test has enough "power" to start/stop servers on the remote machines.
By default, it picks up [code]+HBASE_SSH_OPTS, HBASE_HOME,
HBASE_CONF_DIR+ from the env, and uses [code]+bin/hbase-daemon.sh+ to carry out the actions.
Currently tarball deployments, deployments which uses hbase-daemons.sh, and link:http://incubator.apache.org/ambari/[Apache Ambari] deployments are supported.
/etc/init.d/ scripts are not supported for now, but it can be easily added.
The tests interact with the distributed cluster by using the methods in the `DistributedHBaseCluster` (implementing `HBaseCluster`) class, which in turn uses a pluggable `ClusterManager`.
Concrete implementations provide actual functionality for carrying out deployment-specific and environment-dependent tasks (SSH, etc). The default `ClusterManager` is `HBaseClusterManager`, which uses SSH to remotely execute start/stop/kill/signal commands, and assumes some posix commands (ps, etc). Also assumes the user running the test has enough "power" to start/stop servers on the remote machines.
By default, it picks up `HBASE_SSH_OPTS`, `HBASE_HOME`, `HBASE_CONF_DIR` from the env, and uses `bin/hbase-daemon.sh` to carry out the actions.
Currently tarball deployments, deployments which uses _hbase-daemons.sh_, and link:http://incubator.apache.org/ambari/[Apache Ambari] deployments are supported.
_/etc/init.d/_ scripts are not supported for now, but it can be easily added.
For other deployment options, a ClusterManager can be implemented and plugged in.
[[maven.build.commands.integration.tests.destructive]]
==== Destructive integration / system tests
In 0.96, a tool named [code]+ChaosMonkey+ has been introduced.
In 0.96, a tool named `ChaosMonkey` has been introduced.
It is modeled after the link:http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html[same-named tool by Netflix].
Some of the tests use ChaosMonkey to simulate faults in the running cluster in the way of killing random servers, disconnecting servers, etc.
ChaosMonkey can also be used as a stand-alone tool to run a (misbehaving) policy while you are running other tests.
@ -1262,7 +1272,7 @@ ChaosMonkey tool, if run from command line, will keep on running until the proce
Since HBase version 1.0.0 (link:https://issues.apache.org/jira/browse/HBASE-11348[HBASE-11348]), the chaos monkeys is used to run integration tests can be configured per test run.
Users can create a java properties file and and pass this to the chaos monkey with timing configurations.
The properties file needs to be in the HBase classpath.
The various properties that can be configured and their default values can be found listed in the [class]+org.apache.hadoop.hbase.chaos.factories.MonkeyConstants+ class.
The various properties that can be configured and their default values can be found listed in the `org.apache.hadoop.hbase.chaos.factories.MonkeyConstants` class.
If any chaos monkey configuration is missing from the property file, then the default values are assumed.
For example:
@ -1272,7 +1282,7 @@ For example:
$bin/hbase org.apache.hadoop.hbase.IntegrationTestIngest -m slowDeterministic -monkeyProps monkey.properties
----
The above command will start the integration tests and chaos monkey passing the properties file [path]_monkey.properties_.
The above command will start the integration tests and chaos monkey passing the properties file _monkey.properties_.
Here is an example chaos monkey file:
[source]
@ -1291,8 +1301,8 @@ batch.restart.rs.ratio=0.4f
=== Codelines
Most development is done on the master branch, which is named [literal]+master+ in the Git repository.
Previously, HBase used Subversion, in which the master branch was called [literal]+TRUNK+.
Most development is done on the master branch, which is named `master` in the Git repository.
Previously, HBase used Subversion, in which the master branch was called `TRUNK`.
Branches exist for minor releases, and important features and bug fixes are often back-ported.
=== Release Managers
@ -1330,44 +1340,44 @@ The conventions followed by HBase are inherited by its parent project, Hadoop.
The following interface classifications are commonly used:
.InterfaceAudience
[code]+@InterfaceAudience.Public+::
`@InterfaceAudience.Public`::
APIs for users and HBase applications.
These APIs will be deprecated through major versions of HBase.
[code]+@InterfaceAudience.Private+::
`@InterfaceAudience.Private`::
APIs for HBase internals developers.
No guarantees on compatibility or availability in future versions.
Private interfaces do not need an [code]+@InterfaceStability+ classification.
Private interfaces do not need an `@InterfaceStability` classification.
[code]+@InterfaceAudience.LimitedPrivate(HBaseInterfaceAudience.COPROC)+::
`@InterfaceAudience.LimitedPrivate(HBaseInterfaceAudience.COPROC)`::
APIs for HBase coprocessor writers.
As of HBase 0.92/0.94/0.96/0.98 this api is still unstable.
No guarantees on compatibility with future versions.
No [code]+@InterfaceAudience+ Classification::
Packages without an [code]+@InterfaceAudience+ label are considered private.
No `@InterfaceAudience` Classification::
Packages without an `@InterfaceAudience` label are considered private.
Mark your new packages if publicly accessible.
.Excluding Non-Public Interfaces from API Documentation
[NOTE]
====
Only interfaces classified [code]+@InterfaceAudience.Public+ should be included in API documentation (Javadoc). Committers must add new package excludes [code]+ExcludePackageNames+ section of the [path]_pom.xml_ for new packages which do not contain public classes.
Only interfaces classified `@InterfaceAudience.Public` should be included in API documentation (Javadoc). Committers must add new package excludes `ExcludePackageNames` section of the _pom.xml_ for new packages which do not contain public classes.
====
.@InterfaceStability
[code]+@InterfaceStability+ is important for packages marked [code]+@InterfaceAudience.Public+.
`@InterfaceStability` is important for packages marked `@InterfaceAudience.Public`.
[code]+@InterfaceStability.Stable+::
`@InterfaceStability.Stable`::
Public packages marked as stable cannot be changed without a deprecation path or a very good reason.
[code]+@InterfaceStability.Unstable+::
`@InterfaceStability.Unstable`::
Public packages marked as unstable can be changed without a deprecation path.
[code]+@InterfaceStability.Evolving+::
`@InterfaceStability.Evolving`::
Public packages marked as evolving may be changed, but it is discouraged.
No [code]+@InterfaceStability+ Label::
Public classes with no [code]+@InterfaceStability+ label are discouraged, and should be considered implicitly unstable.
No `@InterfaceStability` Label::
Public classes with no `@InterfaceStability` label are discouraged, and should be considered implicitly unstable.
If you are unclear about how to mark packages, ask on the development list.
@ -1413,7 +1423,7 @@ foo = barArray[i];
[[common.patch.feedback.autogen]]
===== Auto Generated Code
Auto-generated code in Eclipse often uses bad variable names such as [literal]+arg0+.
Auto-generated code in Eclipse often uses bad variable names such as `arg0`.
Use more informative variable names.
Use code like the second example here.
@ -1477,13 +1487,13 @@ Your patch won't be committed if it adds such warnings.
[[common.patch.feedback.findbugs]]
===== Findbugs
[code]+Findbugs+ is used to detect common bugs pattern.
`Findbugs` is used to detect common bugs pattern.
It is checked during the precommit build by Apache's Jenkins.
If errors are found, please fix them.
You can run findbugs locally with +mvn
findbugs:findbugs+, which will generate the [code]+findbugs+ files locally.
Sometimes, you may have to write code smarter than [code]+findbugs+.
You can annotate your code to tell [code]+findbugs+ you know what you're doing, by annotating your class with the following annotation:
findbugs:findbugs+, which will generate the `findbugs` files locally.
Sometimes, you may have to write code smarter than `findbugs`.
You can annotate your code to tell `findbugs` you know what you're doing, by annotating your class with the following annotation:
[source,java]
----
@ -1510,7 +1520,7 @@ Don't just leave the @param arguments the way your IDE generated them.:
public Foo getFoo(Bar bar);
----
Either add something descriptive to the @[code]+param+ and @[code]+return+ lines, or just remove them.
Either add something descriptive to the @`param` and @`return` lines, or just remove them.
The preference is to add something descriptive and useful.
[[common.patch.feedback.onething]]
@ -1534,7 +1544,7 @@ Make sure that you're clear about what you are testing in your unit tests and wh
In 0.96, HBase moved to protocol buffers (protobufs). The below section on Writables applies to 0.94.x and previous, not to 0.96 and beyond.
====
Every class returned by RegionServers must implement the [code]+Writable+ interface.
Every class returned by RegionServers must implement the `Writable` interface.
If you are creating a new class that needs to implement this interface, do not forget the default constructor.
[[design.invariants]]
@ -1551,9 +1561,9 @@ ZooKeeper state should transient (treat it like memory). If ZooKeeper state is d
* .ExceptionsThere are currently a few exceptions that we need to fix around whether a table is enabled or disabled.
* Replication data is currently stored only in ZooKeeper.
Deleting ZooKeeper data related to replication may cause replication to be disabled.
Do not delete the replication tree, [path]_/hbase/replication/_.
Do not delete the replication tree, _/hbase/replication/_.
+
WARNING: Replication may be disrupted and data loss may occur if you delete the replication tree ([path]_/hbase/replication/_) from ZooKeeper.
WARNING: Replication may be disrupted and data loss may occur if you delete the replication tree (_/hbase/replication/_) from ZooKeeper.
Follow progress on this issue at link:https://issues.apache.org/jira/browse/HBASE-10295[HBASE-10295].
@ -1609,25 +1619,24 @@ For this the MetricsAssertHelper is provided.
[[git.best.practices]]
=== Git Best Practices
* Use the correct method to create patches.
Use the correct method to create patches.::
See <<submitting.patches,submitting.patches>>.
* Avoid git merges.
Use [code]+git pull --rebase+ or [code]+git
fetch+ followed by [code]+git rebase+.
* Do not use [code]+git push --force+.
Avoid git merges.::
Use `git pull --rebase` or `git fetch` followed by `git rebase`.
Do not use `git push --force`.::
If the push does not work, fix the problem or ask for help.
Please contribute to this document if you think of other Git best practices.
==== [code]+rebase_all_git_branches.sh+
==== `rebase_all_git_branches.sh`
The [path]_dev-support/rebase_all_git_branches.sh_ script is provided to help keep your Git repository clean.
Use the [code]+-h+ parameter to get usage instructions.
The script automatically refreshes your tracking branches, attempts an automatic rebase of each local branch against its remote branch, and gives you the option to delete any branch which represents a closed [literal]+HBASE-+ JIRA.
The _dev-support/rebase_all_git_branches.sh_ script is provided to help keep your Git repository clean.
Use the `-h` parameter to get usage instructions.
The script automatically refreshes your tracking branches, attempts an automatic rebase of each local branch against its remote branch, and gives you the option to delete any branch which represents a closed `HBASE-` JIRA.
The script has one optional configuration option, the location of your Git directory.
You can set a default by editing the script.
Otherwise, you can pass the git directory manually by using the [code]+-d+ parameter, followed by an absolute or relative directory name, or even '.' for the current working directory.
The script checks the directory for sub-directory called [path]_.git/_, before proceeding.
Otherwise, you can pass the git directory manually by using the `-d` parameter, followed by an absolute or relative directory name, or even '.' for the current working directory.
The script checks the directory for sub-directory called _.git/_, before proceeding.
[[submitting.patches]]
=== Submitting Patches
@ -1645,16 +1654,16 @@ It provides a nice overview that applies equally to the Apache HBase Project.
[[submitting.patches.create]]
==== Create Patch
The script [path]_dev-support/make_patch.sh_ has been provided to help you adhere to patch-creation guidelines.
The script _dev-support/make_patch.sh_ has been provided to help you adhere to patch-creation guidelines.
The script has the following syntax:
----
$ make_patch.sh [-a] [-p <patch_dir>]
----
. If you do not pass a [code]+patch_dir+, the script defaults to [path]_~/patches/_.
If the [code]+patch_dir+ does not exist, it is created.
. By default, if an existing patch exists with the JIRA ID, the version of the new patch is incremented ([path]_HBASE-XXXX-v3.patch_). If the [code]+-a+ option is passed, the version is not incremented, but the suffix [literal]+-addendum+ is added ([path]_HBASE-XXXX-v2-addendum.patch_). A second addendum to a given version is not supported.
. If you do not pass a `patch_dir`, the script defaults to _~/patches/_.
If the `patch_dir` does not exist, it is created.
. By default, if an existing patch exists with the JIRA ID, the version of the new patch is incremented (_HBASE-XXXX-v3.patch_). If the `-a` option is passed, the version is not incremented, but the suffix `-addendum` is added (_HBASE-XXXX-v2-addendum.patch_). A second addendum to a given version is not supported.
. Detects whether you have more than one local commit on your branch.
If you do, the script offers you the chance to run +git rebase
-i+ to squash the changes into a single commit so that it can use +git format-patch+.
@ -1694,15 +1703,15 @@ Please understand that not every patch may get committed, and that feedback will
* If you need to revise your patch, leave the previous patch file(s) attached to the JIRA, and upload the new one, following the naming conventions in <<submitting.patches.create,submitting.patches.create>>.
Cancel the Patch Available flag and then re-trigger it, by toggling the btn:[Patch Available] button in JIRA.
JIRA sorts attached files by the time they were attached, and has no problem with multiple attachments with the same name.
However, at times it is easier to refer to different version of a patch if you add [literal]+-vX+, where the [replaceable]_X_ is the version (starting with 2).
However, at times it is easier to refer to different version of a patch if you add `-vX`, where the [replaceable]_X_ is the version (starting with 2).
* If you need to submit your patch against multiple branches, rather than just master, name each version of the patch with the branch it is for, following the naming conventions in <<submitting.patches.create,submitting.patches.create>>.
.Methods to Create PatchesEclipse::
Select the menu item.
Git::
[code]+git format-patch+ is preferred because it preserves commit messages.
Use [code]+git rebase -i+ first, to combine (squash) smaller commits into a single larger one.
`git format-patch` is preferred because it preserves commit messages.
Use `git rebase -i` first, to combine (squash) smaller commits into a single larger one.
Subversion::
@ -1734,16 +1743,16 @@ Patches larger than one screen, or patches that will be tricky to review, should
It does not use the credentials from link:http://issues.apache.org[issues.apache.org].
Log in.
. Click [label]#New Review Request#.
. Choose the [literal]+hbase-git+ repository.
. Choose the `hbase-git` repository.
Click Choose File to select the diff and optionally a parent diff.
Click btn:[Create
Review Request].
. Fill in the fields as required.
At the minimum, fill in the [label]#Summary# and choose [literal]+hbase+ as the [label]#Review Group#.
At the minimum, fill in the [label]#Summary# and choose `hbase` as the [label]#Review Group#.
If you fill in the [label]#Bugs# field, the review board links back to the relevant JIRA.
The more fields you fill in, the better.
Click btn:[Publish] to make your review request public.
An email will be sent to everyone in the [literal]+hbase+ group, to review the patch.
An email will be sent to everyone in the `hbase` group, to review the patch.
. Back in your JIRA, click , and paste in the URL of your ReviewBoard request.
This attaches the ReviewBoard to the JIRA, for easy access.
. To cancel the request, click .
@ -1770,7 +1779,7 @@ The list of submitted patches is in the link:https://issues.apache.org/jira/secu
Committers should scan the list from top to bottom, looking for patches that they feel qualified to review and possibly commit.
For non-trivial changes, it is required to get another committer to review your own patches before commit.
Use the btn:[Submit Patch] button in JIRA, just like other contributors, and then wait for a [literal]`+1` response from another committer before committing.
Use the btn:[Submit Patch] button in JIRA, just like other contributors, and then wait for a `+1` response from another committer before committing.
===== Reject
@ -1812,16 +1821,16 @@ The instructions and preferences around the way to create patches have changed,
This is the preference, because you can reuse the submitter's commit message.
If the commit message is not appropriate, you can still use the commit, then run the command +git
rebase -i origin/master+, and squash and reword as appropriate.
* If the first line of the patch looks similar to the following, it was created using +git diff+ without [code]+--no-prefix+.
* If the first line of the patch looks similar to the following, it was created using +git diff+ without `--no-prefix`.
This is acceptable too.
Notice the [literal]+a+ and [literal]+b+ in front of the file names.
This is the indication that the patch was not created with [code]+--no-prefix+.
Notice the `a` and `b` in front of the file names.
This is the indication that the patch was not created with `--no-prefix`.
+
----
diff --git a/src/main/docbkx/developer.xml b/src/main/docbkx/developer.xml
----
* If the first line of the patch looks similar to the following (without the [literal]+a+ and [literal]+b+), the patch was created with +git diff --no-prefix+ and you need to add [code]+-p0+ to the +git apply+ command below.
* If the first line of the patch looks similar to the following (without the `a` and `b`), the patch was created with +git diff --no-prefix+ and you need to add `-p0` to the +git apply+ command below.
+
----
diff --git src/main/docbkx/developer.xml src/main/docbkx/developer.xml
@ -1835,9 +1844,9 @@ The only command that actually writes anything to the remote repository is +git
The extra +git
pull+ commands are usually redundant, but better safe than sorry.
The first example shows how to apply a patch that was generated with +git format-patch+ and apply it to the [code]+master+ and [code]+branch-1+ branches.
The first example shows how to apply a patch that was generated with +git format-patch+ and apply it to the `master` and `branch-1` branches.
The directive to use +git format-patch+ rather than +git diff+, and not to use [code]+--no-prefix+, is a new one.
The directive to use +git format-patch+ rather than +git diff+, and not to use `--no-prefix`, is a new one.
See the second example for how to apply a patch created with +git
diff+, and educate the person who created the patch.
@ -1859,8 +1868,8 @@ $ git push origin branch-1
$ git branch -D HBASE-XXXX
----
This example shows how to commit a patch that was created using +git diff+ without [code]+--no-prefix+.
If the patch was created with [code]+--no-prefix+, add [code]+-p0+ to the +git apply+ command.
This example shows how to commit a patch that was created using +git diff+ without `--no-prefix`.
If the patch was created with `--no-prefix`, add `-p0` to the +git apply+ command.
----
$ git apply ~/Downloads/HBASE-XXXX-v2.patch
@ -1905,8 +1914,8 @@ If the contributor used +git format-patch+ to generate the patch, their commit m
We've established the practice of committing to trunk and then cherry picking back to branches whenever possible.
When there is a minor conflict we can fix it up and just proceed with the commit.
The resulting commit retains the original author.
When the amending author is different from the original committer, add notice of this at the end of the commit message as: [var]+Amending-Author: Author
<committer&apache>+ See discussion at link:http://search-hadoop.com/m/DHED4wHGYS[HBase, mail # dev
When the amending author is different from the original committer, add notice of this at the end of the commit message as: `Amending-Author: Author
<committer&apache>` See discussion at link:http://search-hadoop.com/m/DHED4wHGYS[HBase, mail # dev
- [DISCUSSION] Best practice when amending commits cherry picked
from master to branch].

View File

@ -31,7 +31,7 @@
<<quickstart,quickstart>> will get you up and running on a single-node, standalone instance of HBase, followed by a pseudo-distributed single-machine instance, and finally a fully-distributed cluster.
[[quickstart]]
== Quick Start - Standalone HBase
== Quick Start
This guide describes setup of a standalone HBase instance running against the local filesystem.
This is not an appropriate configuration for a production instance of HBase, but will allow you to experiment with HBase.
@ -56,7 +56,7 @@ Prior to HBase 0.94.x, HBase expected the loopback IP address to be 127.0.0.1. U
.Example /etc/hosts File for Ubuntu
====
The following [path]_/etc/hosts_ file works correctly for HBase 0.94.x and earlier, on Ubuntu. Use this as a template if you run into trouble.
The following _/etc/hosts_ file works correctly for HBase 0.94.x and earlier, on Ubuntu. Use this as a template if you run into trouble.
[listing]
----
127.0.0.1 localhost
@ -78,10 +78,10 @@ See <<java,java>> for information about supported JDK versions.
Click on the suggested top link.
This will take you to a mirror of _HBase
Releases_.
Click on the folder named [path]_stable_ and then download the binary file that ends in [path]_.tar.gz_ to your local filesystem.
Click on the folder named _stable_ and then download the binary file that ends in _.tar.gz_ to your local filesystem.
Be sure to choose the version that corresponds with the version of Hadoop you are likely to use later.
In most cases, you should choose the file for Hadoop 2, which will be called something like [path]_hbase-0.98.3-hadoop2-bin.tar.gz_.
Do not download the file ending in [path]_src.tar.gz_ for now.
In most cases, you should choose the file for Hadoop 2, which will be called something like _hbase-0.98.3-hadoop2-bin.tar.gz_.
Do not download the file ending in _src.tar.gz_ for now.
. Extract the downloaded file, and change to the newly-created directory.
+
----
@ -90,29 +90,29 @@ $ tar xzvf hbase-<?eval ${project.version}?>-hadoop2-bin.tar.gz
$ cd hbase-<?eval ${project.version}?>-hadoop2/
----
. For HBase 0.98.5 and later, you are required to set the [var]+JAVA_HOME+ environment variable before starting HBase.
. For HBase 0.98.5 and later, you are required to set the `JAVA_HOME` environment variable before starting HBase.
Prior to 0.98.5, HBase attempted to detect the location of Java if the variables was not set.
You can set the variable via your operating system's usual mechanism, but HBase provides a central mechanism, [path]_conf/hbase-env.sh_.
Edit this file, uncomment the line starting with [literal]+JAVA_HOME+, and set it to the appropriate location for your operating system.
The [var]+JAVA_HOME+ variable should be set to a directory which contains the executable file [path]_bin/java_.
You can set the variable via your operating system's usual mechanism, but HBase provides a central mechanism, _conf/hbase-env.sh_.
Edit this file, uncomment the line starting with `JAVA_HOME`, and set it to the appropriate location for your operating system.
The `JAVA_HOME` variable should be set to a directory which contains the executable file _bin/java_.
Most modern Linux operating systems provide a mechanism, such as /usr/bin/alternatives on RHEL or CentOS, for transparently switching between versions of executables such as Java.
In this case, you can set [var]+JAVA_HOME+ to the directory containing the symbolic link to [path]_bin/java_, which is usually [path]_/usr_.
In this case, you can set `JAVA_HOME` to the directory containing the symbolic link to _bin/java_, which is usually _/usr_.
+
----
JAVA_HOME=/usr
----
+
NOTE: These instructions assume that each node of your cluster uses the same configuration.
If this is not the case, you may need to set [var]+JAVA_HOME+ separately for each node.
If this is not the case, you may need to set `JAVA_HOME` separately for each node.
. Edit [path]_conf/hbase-site.xml_, which is the main HBase configuration file.
. Edit _conf/hbase-site.xml_, which is the main HBase configuration file.
At this time, you only need to specify the directory on the local filesystem where HBase and Zookeeper write data.
By default, a new directory is created under /tmp.
Many servers are configured to delete the contents of /tmp upon reboot, so you should store the data elsewhere.
The following configuration will store HBase's data in the [path]_hbase_ directory, in the home directory of the user called [systemitem]+testuser+.
The following configuration will store HBase's data in the _hbase_ directory, in the home directory of the user called [systemitem]+testuser+.
Paste the [markup]+<property>+ tags beneath the [markup]+<configuration>+ tags, which should be empty in a new HBase install.
+
.Example [path]_hbase-site.xml_ for Standalone HBase
.Example _hbase-site.xml_ for Standalone HBase
====
[source,xml]
----
@ -134,22 +134,22 @@ You do not need to create the HBase data directory.
HBase will do this for you.
If you create the directory, HBase will attempt to do a migration, which is not what you want.
. The [path]_bin/start-hbase.sh_ script is provided as a convenient way to start HBase.
. The _bin/start-hbase.sh_ script is provided as a convenient way to start HBase.
Issue the command, and if all goes well, a message is logged to standard output showing that HBase started successfully.
You can use the +jps+ command to verify that you have one running process called [literal]+HMaster+.
You can use the +jps+ command to verify that you have one running process called `HMaster`.
In standalone mode HBase runs all daemons within this single JVM, i.e.
the HMaster, a single HRegionServer, and the ZooKeeper daemon.
+
NOTE: Java needs to be installed and available.
If you get an error indicating that Java is not installed, but it is on your system, perhaps in a non-standard location, edit the [path]_conf/hbase-env.sh_ file and modify the [var]+JAVA_HOME+ setting to point to the directory that contains [path]_bin/java_ your system.
If you get an error indicating that Java is not installed, but it is on your system, perhaps in a non-standard location, edit the _conf/hbase-env.sh_ file and modify the `JAVA_HOME` setting to point to the directory that contains _bin/java_ your system.
.Procedure: Use HBase For the First Time
. Connect to HBase.
+
Connect to your running instance of HBase using the +hbase shell+ command, located in the [path]_bin/_ directory of your HBase install.
Connect to your running instance of HBase using the +hbase shell+ command, located in the _bin/_ directory of your HBase install.
In this example, some usage and version information that is printed when you start HBase Shell has been omitted.
The HBase Shell prompt ends with a [literal]+>+ character.
The HBase Shell prompt ends with a `>` character.
+
----
@ -159,12 +159,12 @@ hbase(main):001:0>
. Display HBase Shell Help Text.
+
Type [literal]+help+ and press Enter, to display some basic usage information for HBase Shell, as well as several example commands.
Type `help` and press Enter, to display some basic usage information for HBase Shell, as well as several example commands.
Notice that table names, rows, columns all must be enclosed in quote characters.
. Create a table.
+
Use the [code]+create+ command to create a new table.
Use the `create` command to create a new table.
You must specify the table name and the ColumnFamily name.
+
----
@ -175,7 +175,7 @@ hbase> create 'test', 'cf'
. List Information About your Table
+
Use the [code]+list+ command to
Use the `list` command to
+
----
@ -189,7 +189,7 @@ test
. Put data into your table.
+
To put data into your table, use the [code]+put+ command.
To put data into your table, use the `put` command.
+
----
@ -204,8 +204,8 @@ hbase> put 'test', 'row3', 'cf:c', 'value3'
----
+
Here, we insert three values, one at a time.
The first insert is at [literal]+row1+, column [literal]+cf:a+, with a value of [literal]+value1+.
Columns in HBase are comprised of a column family prefix, [literal]+cf+ in this example, followed by a colon and then a column qualifier suffix, [literal]+a+ in this case.
The first insert is at `row1`, column `cf:a`, with a value of `value1`.
Columns in HBase are comprised of a column family prefix, `cf` in this example, followed by a colon and then a column qualifier suffix, `a` in this case.
. Scan the table for all data at once.
+
@ -237,8 +237,8 @@ COLUMN CELL
. Disable a table.
+
If you want to delete a table or change its settings, as well as in some other situations, you need to disable the table first, using the [code]+disable+ command.
You can re-enable it using the [code]+enable+ command.
If you want to delete a table or change its settings, as well as in some other situations, you need to disable the table first, using the `disable` command.
You can re-enable it using the `enable` command.
+
----
@ -259,7 +259,7 @@ hbase> disable 'test'
. Drop the table.
+
To drop (delete) a table, use the [code]+drop+ command.
To drop (delete) a table, use the `drop` command.
+
----
@ -274,7 +274,7 @@ HBase is still running in the background.
.Procedure: Stop HBase
. In the same way that the [path]_bin/start-hbase.sh_ script is provided to conveniently start all HBase daemons, the [path]_bin/stop-hbase.sh_ script stops them.
. In the same way that the _bin/start-hbase.sh_ script is provided to conveniently start all HBase daemons, the _bin/stop-hbase.sh_ script stops them.
+
----
@ -291,7 +291,7 @@ $
After working your way through <<quickstart,quickstart>>, you can re-configure HBase to run in pseudo-distributed mode.
Pseudo-distributed mode means that HBase still runs completely on a single host, but each HBase daemon (HMaster, HRegionServer, and Zookeeper) runs as a separate process.
By default, unless you configure the [code]+hbase.rootdir+ property as described in <<quickstart,quickstart>>, your data is still stored in [path]_/tmp/_.
By default, unless you configure the `hbase.rootdir` property as described in <<quickstart,quickstart>>, your data is still stored in _/tmp/_.
In this walk-through, we store your data in HDFS instead, assuming you have HDFS available.
You can skip the HDFS configuration to continue storing your data in the local filesystem.
@ -311,7 +311,7 @@ This procedure will create a totally new directory where HBase will store its da
. Configure HBase.
+
Edit the [path]_hbase-site.xml_ configuration.
Edit the _hbase-site.xml_ configuration.
First, add the following property.
which directs HBase to run in distributed mode, with one JVM instance per daemon.
+
@ -324,7 +324,7 @@ which directs HBase to run in distributed mode, with one JVM instance per daemon
</property>
----
+
Next, change the [code]+hbase.rootdir+ from the local filesystem to the address of your HDFS instance, using the [code]+hdfs:////+ URI syntax.
Next, change the `hbase.rootdir` from the local filesystem to the address of your HDFS instance, using the `hdfs:////` URI syntax.
In this example, HDFS is running on the localhost at port 8020.
+
[source,xml]
@ -342,14 +342,14 @@ If you create the directory, HBase will attempt to do a migration, which is not
. Start HBase.
+
Use the [path]_bin/start-hbase.sh_ command to start HBase.
Use the _bin/start-hbase.sh_ command to start HBase.
If your system is configured correctly, the +jps+ command should show the HMaster and HRegionServer processes running.
. Check the HBase directory in HDFS.
+
If everything worked correctly, HBase created its directory in HDFS.
In the configuration above, it is stored in [path]_/hbase/_ on HDFS.
You can use the +hadoop fs+ command in Hadoop's [path]_bin/_ directory to list this directory.
In the configuration above, it is stored in _/hbase/_ on HDFS.
You can use the +hadoop fs+ command in Hadoop's _bin/_ directory to list this directory.
+
----
@ -385,7 +385,7 @@ The following command starts 3 backup servers using ports 16012/16022/16032, 160
$ ./bin/local-master-backup.sh 2 3 5
----
+
To kill a backup master without killing the entire cluster, you need to find its process ID (PID). The PID is stored in a file with a name like [path]_/tmp/hbase-USER-X-master.pid_.
To kill a backup master without killing the entire cluster, you need to find its process ID (PID). The PID is stored in a file with a name like _/tmp/hbase-USER-X-master.pid_.
The only contents of the file are the PID.
You can use the +kill -9+ command to kill that PID.
The following command will kill the master with port offset 1, but leave the cluster running:
@ -413,7 +413,7 @@ The following command starts four additional RegionServers, running on sequentia
$ .bin/local-regionservers.sh start 2 3 4 5
----
+
To stop a RegionServer manually, use the +local-regionservers.sh+ command with the [literal]+stop+ parameter and the offset of the server to stop.
To stop a RegionServer manually, use the +local-regionservers.sh+ command with the `stop` parameter and the offset of the server to stop.
+
----
$ .bin/local-regionservers.sh stop 3
@ -421,7 +421,7 @@ $ .bin/local-regionservers.sh stop 3
. Stop HBase.
+
You can stop HBase the same way as in the <<quickstart,quickstart>> procedure, using the [path]_bin/stop-hbase.sh_ command.
You can stop HBase the same way as in the <<quickstart,quickstart>> procedure, using the _bin/stop-hbase.sh_ command.
[[quickstart_fully_distributed]]
@ -437,27 +437,25 @@ The architecture will be as follows:
.Distributed Cluster Demo Architecture
[cols="1,1,1,1", options="header"]
|===
| Node Name
| Master
| ZooKeeper
| RegionServer
| Node Name | Master | ZooKeeper | RegionServer
| node-a.example.com | yes | yes | no
| node-b.example.com | backup | yes | yes
| node-c.example.com | no | yes | yes
|===
This quickstart assumes that each node is a virtual machine and that they are all on the same network.
It builds upon the previous quickstart, <<quickstart_pseudo,quickstart-pseudo>>, assuming that the system you configured in that procedure is now [code]+node-a+.
Stop HBase on [code]+node-a+ before continuing.
It builds upon the previous quickstart, <<quickstart_pseudo,quickstart-pseudo>>, assuming that the system you configured in that procedure is now `node-a`.
Stop HBase on `node-a` before continuing.
NOTE: Be sure that all the nodes have full access to communicate, and that no firewall rules are in place which could prevent them from talking to each other.
If you see any errors like [literal]+no route to host+, check your firewall.
If you see any errors like `no route to host`, check your firewall.
.Procedure: Configure Password-Less SSH Access
[code]+node-a+ needs to be able to log into [code]+node-b+ and [code]+node-c+ (and to itself) in order to start the daemons.
The easiest way to accomplish this is to use the same username on all hosts, and configure password-less SSH login from [code]+node-a+ to each of the others.
`node-a` needs to be able to log into `node-b` and `node-c` (and to itself) in order to start the daemons.
The easiest way to accomplish this is to use the same username on all hosts, and configure password-less SSH login from `node-a` to each of the others.
. On [code]+node-a+, generate a key pair.
. On `node-a`, generate a key pair.
+
While logged in as the user who will run HBase, generate a SSH key pair, using the following command:
+
@ -467,19 +465,19 @@ $ ssh-keygen -t rsa
----
+
If the command succeeds, the location of the key pair is printed to standard output.
The default name of the public key is [path]_id_rsa.pub_.
The default name of the public key is _id_rsa.pub_.
. Create the directory that will hold the shared keys on the other nodes.
+
On [code]+node-b+ and [code]+node-c+, log in as the HBase user and create a [path]_.ssh/_ directory in the user's home directory, if it does not already exist.
On `node-b` and `node-c`, log in as the HBase user and create a _.ssh/_ directory in the user's home directory, if it does not already exist.
If it already exists, be aware that it may already contain other keys.
. Copy the public key to the other nodes.
+
Securely copy the public key from [code]+node-a+ to each of the nodes, by using the +scp+ or some other secure means.
On each of the other nodes, create a new file called [path]_.ssh/authorized_keys_ _if it does
not already exist_, and append the contents of the [path]_id_rsa.pub_ file to the end of it.
Note that you also need to do this for [code]+node-a+ itself.
Securely copy the public key from `node-a` to each of the nodes, by using the +scp+ or some other secure means.
On each of the other nodes, create a new file called _.ssh/authorized_keys_ _if it does
not already exist_, and append the contents of the _id_rsa.pub_ file to the end of it.
Note that you also need to do this for `node-a` itself.
+
----
$ cat id_rsa.pub >> ~/.ssh/authorized_keys
@ -487,27 +485,27 @@ $ cat id_rsa.pub >> ~/.ssh/authorized_keys
. Test password-less login.
+
If you performed the procedure correctly, if you SSH from [code]+node-a+ to either of the other nodes, using the same username, you should not be prompted for a password.
If you performed the procedure correctly, if you SSH from `node-a` to either of the other nodes, using the same username, you should not be prompted for a password.
. Since [code]+node-b+ will run a backup Master, repeat the procedure above, substituting [code]+node-b+ everywhere you see [code]+node-a+.
Be sure not to overwrite your existing [path]_.ssh/authorized_keys_ files, but concatenate the new key onto the existing file using the [code]+>>+ operator rather than the [code]+>+ operator.
. Since `node-b` will run a backup Master, repeat the procedure above, substituting `node-b` everywhere you see `node-a`.
Be sure not to overwrite your existing _.ssh/authorized_keys_ files, but concatenate the new key onto the existing file using the `>>` operator rather than the `>` operator.
.Procedure: Prepare [code]+node-a+
.Procedure: Prepare `node-a`
`node-a` will run your primary master and ZooKeeper processes, but no RegionServers.
. Stop the RegionServer from starting on [code]+node-a+.
. Stop the RegionServer from starting on `node-a`.
. Edit [path]_conf/regionservers_ and remove the line which contains [literal]+localhost+. Add lines with the hostnames or IP addresses for [code]+node-b+ and [code]+node-c+.
. Edit _conf/regionservers_ and remove the line which contains `localhost`. Add lines with the hostnames or IP addresses for `node-b` and `node-c`.
+
Even if you did want to run a RegionServer on [code]+node-a+, you should refer to it by the hostname the other servers would use to communicate with it.
In this case, that would be [literal]+node-a.example.com+.
Even if you did want to run a RegionServer on `node-a`, you should refer to it by the hostname the other servers would use to communicate with it.
In this case, that would be `node-a.example.com`.
This enables you to distribute the configuration to each node of your cluster any hostname conflicts.
Save the file.
. Configure HBase to use [code]+node-b+ as a backup master.
. Configure HBase to use `node-b` as a backup master.
+
Create a new file in [path]_conf/_ called [path]_backup-masters_, and add a new line to it with the hostname for [code]+node-b+.
In this demonstration, the hostname is [literal]+node-b.example.com+.
Create a new file in _conf/_ called _backup-masters_, and add a new line to it with the hostname for `node-b`.
In this demonstration, the hostname is `node-b.example.com`.
. Configure ZooKeeper
+
@ -515,7 +513,7 @@ In reality, you should carefully consider your ZooKeeper configuration.
You can find out more about configuring ZooKeeper in <<zookeeper,zookeeper>>.
This configuration will direct HBase to start and manage a ZooKeeper instance on each node of the cluster.
+
On [code]+node-a+, edit [path]_conf/hbase-site.xml_ and add the following properties.
On `node-a`, edit _conf/hbase-site.xml_ and add the following properties.
+
[source,bourne]
----
@ -529,22 +527,22 @@ On [code]+node-a+, edit [path]_conf/hbase-site.xml_ and add the following proper
</property>
----
. Everywhere in your configuration that you have referred to [code]+node-a+ as [literal]+localhost+, change the reference to point to the hostname that the other nodes will use to refer to [code]+node-a+.
In these examples, the hostname is [literal]+node-a.example.com+.
. Everywhere in your configuration that you have referred to `node-a` as `localhost`, change the reference to point to the hostname that the other nodes will use to refer to `node-a`.
In these examples, the hostname is `node-a.example.com`.
.Procedure: Prepare [code]+node-b+ and [code]+node-c+
.Procedure: Prepare `node-b` and `node-c`
[code]+node-b+ will run a backup master server and a ZooKeeper instance.
`node-b` will run a backup master server and a ZooKeeper instance.
. Download and unpack HBase.
+
Download and unpack HBase to [code]+node-b+, just as you did for the standalone and pseudo-distributed quickstarts.
Download and unpack HBase to `node-b`, just as you did for the standalone and pseudo-distributed quickstarts.
. Copy the configuration files from [code]+node-a+ to [code]+node-b+.and
[code]+node-c+.
. Copy the configuration files from `node-a` to `node-b`.and
`node-c`.
+
Each node of your cluster needs to have the same configuration information.
Copy the contents of the [path]_conf/_ directory to the [path]_conf/_ directory on [code]+node-b+ and [code]+node-c+.
Copy the contents of the _conf/_ directory to the _conf/_ directory on `node-b` and `node-c`.
.Procedure: Start and Test Your Cluster
@ -552,12 +550,12 @@ Copy the contents of the [path]_conf/_ directory to the [path]_conf/_
+
If you forgot to stop HBase from previous testing, you will have errors.
Check to see whether HBase is running on any of your nodes by using the +jps+ command.
Look for the processes [literal]+HMaster+, [literal]+HRegionServer+, and [literal]+HQuorumPeer+.
Look for the processes `HMaster`, `HRegionServer`, and `HQuorumPeer`.
If they exist, kill them.
. Start the cluster.
+
On [code]+node-a+, issue the +start-hbase.sh+ command.
On `node-a`, issue the +start-hbase.sh+ command.
Your output will be similar to that below.
+
----
@ -614,9 +612,9 @@ $ jps
.ZooKeeper Process Name
[NOTE]
====
The [code]+HQuorumPeer+ process is a ZooKeeper instance which is controlled and started by HBase.
The `HQuorumPeer` process is a ZooKeeper instance which is controlled and started by HBase.
If you use ZooKeeper this way, it is limited to one instance per cluster node, , and is appropriate for testing only.
If ZooKeeper is run outside of HBase, the process is called [code]+QuorumPeer+.
If ZooKeeper is run outside of HBase, the process is called `QuorumPeer`.
For more about ZooKeeper configuration, including using an external ZooKeeper instance with HBase, see <<zookeeper,zookeeper>>.
====
@ -628,9 +626,9 @@ NOTE: Web UI Port Changes
In HBase newer than 0.98.x, the HTTP ports used by the HBase Web UI changed from 60010 for the Master and 60030 for each RegionServer to 16610 for the Master and 16030 for the RegionServer.
+
If everything is set up correctly, you should be able to connect to the UI for the Master [literal]+http://node-a.example.com:60110/+ or the secondary master at [literal]+http://node-b.example.com:60110/+ for the secondary master, using a web browser.
If you can connect via [code]+localhost+ but not from another host, check your firewall rules.
You can see the web UI for each of the RegionServers at port 60130 of their IP addresses, or by clicking their links in the web UI for the Master.
If everything is set up correctly, you should be able to connect to the UI for the Master `http://node-a.example.com:16610/` or the secondary master at `http://node-b.example.com:16610/` for the secondary master, using a web browser.
If you can connect via `localhost` but not from another host, check your firewall rules.
You can see the web UI for each of the RegionServers at port 16630 of their IP addresses, or by clicking their links in the web UI for the Master.
. Test what happens when nodes or services disappear.
+

View File

@ -45,7 +45,7 @@ At the end of the commands output it prints OK or tells you the number of INCONS
You may also want to run run hbck a few times because some inconsistencies can be transient (e.g.
cluster is starting up or a region is splitting). Operationally you may want to run hbck regularly and setup alert (e.g.
via nagios) if it repeatedly reports inconsistencies . A run of hbck will report a list of inconsistencies along with a brief description of the regions and tables affected.
The using the [code]+-details+ option will report more details including a representative listing of all the splits present in all the tables.
The using the `-details` option will report more details including a representative listing of all the splits present in all the tables.
[source,bourne]
----
@ -76,7 +76,7 @@ There are two invariants that when violated create inconsistencies in HBase:
Repairs generally work in three phases -- a read-only information gathering phase that identifies inconsistencies, a table integrity repair phase that restores the table integrity invariant, and then finally a region consistency repair phase that restores the region consistency invariant.
Starting from version 0.90.0, hbck could detect region consistency problems report on a subset of possible table integrity problems.
It also included the ability to automatically fix the most common inconsistency, region assignment and deployment consistency problems.
This repair could be done by using the [code]+-fix+ command line option.
This repair could be done by using the `-fix` command line option.
These problems close regions if they are open on the wrong server or on multiple region servers and also assigns regions to region servers if they are not open.
Starting from HBase versions 0.90.7, 0.92.2 and 0.94.0, several new command line options are introduced to aid repairing a corrupted HBase.
@ -89,8 +89,8 @@ These are generally region consistency repairs -- localized single region repair
Region consistency requires that the HBase instance has the state of the region's data in HDFS (.regioninfo files), the region's row in the hbase:meta table., and region's deployment/assignments on region servers and the master in accordance.
Options for repairing region consistency include:
* [code]+-fixAssignments+ (equivalent to the 0.90 [code]+-fix+ option) repairs unassigned, incorrectly assigned or multiply assigned regions.
* [code]+-fixMeta+ which removes meta rows when corresponding regions are not present in HDFS and adds new meta rows if they regions are present in HDFS while not in META. To fix deployment and assignment problems you can run this command:
* `-fixAssignments` (equivalent to the 0.90 `-fix` option) repairs unassigned, incorrectly assigned or multiply assigned regions.
* `-fixMeta` which removes meta rows when corresponding regions are not present in HDFS and adds new meta rows if they regions are present in HDFS while not in META. To fix deployment and assignment problems you can run this command:
[source,bourne]
----
@ -110,7 +110,7 @@ There are a few classes of table integrity problems that are low risk repairs.
The first two are degenerate (startkey == endkey) regions and backwards regions (startkey > endkey). These are automatically handled by sidelining the data to a temporary directory (/hbck/xxxx). The third low-risk class is hdfs region holes.
This can be repaired by using the:
* [code]+-fixHdfsHoles+ option for fabricating new empty regions on the file system.
* `-fixHdfsHoles` option for fabricating new empty regions on the file system.
If holes are detected you can use -fixHdfsHoles and should include -fixMeta and -fixAssignments to make the new region consistent.
[source,bourne]
@ -119,7 +119,7 @@ This can be repaired by using the:
$ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHoles
----
Since this is a common operation, we've added a the [code]+-repairHoles+ flag that is equivalent to the previous command:
Since this is a common operation, we've added a the `-repairHoles` flag that is equivalent to the previous command:
[source,bourne]
----
@ -133,12 +133,12 @@ If inconsistencies still remain after these steps, you most likely have table in
Table integrity problems can require repairs that deal with overlaps.
This is a riskier operation because it requires modifications to the file system, requires some decision making, and may require some manual steps.
For these repairs it is best to analyze the output of a [code]+hbck -details+ run so that you isolate repairs attempts only upon problems the checks identify.
For these repairs it is best to analyze the output of a `hbck -details` run so that you isolate repairs attempts only upon problems the checks identify.
Because this is riskier, there are safeguard that should be used to limit the scope of the repairs.
WARNING: This is a relatively new and have only been tested on online but idle HBase instances (no reads/writes). Use at your own risk in an active production environment! The options for repairing table integrity violations include:
* [code]+-fixHdfsOrphans+ option for ``adopting'' a region directory that is missing a region metadata file (the .regioninfo file).
* [code]+-fixHdfsOverlaps+ ability for fixing overlapping regions
* `-fixHdfsOrphans` option for ``adopting'' a region directory that is missing a region metadata file (the .regioninfo file).
* `-fixHdfsOverlaps` ability for fixing overlapping regions
When repairing overlapping regions, a region's data can be modified on the file system in two ways: 1) by merging regions into a larger region or 2) by sidelining regions by moving data to ``sideline'' directory where data could be restored later.
Merging a large number of regions is technically correct but could result in an extremely large region that requires series of costly compactions and splitting operations.
@ -147,13 +147,13 @@ Since these sidelined regions are already laid out in HBase's native directory a
The default safeguard thresholds are conservative.
These options let you override the default thresholds and to enable the large region sidelining feature.
* [code]+-maxMerge <n>+ maximum number of overlapping regions to merge
* [code]+-sidelineBigOverlaps+ if more than maxMerge regions are overlapping, sideline attempt to sideline the regions overlapping with the most other regions.
* [code]+-maxOverlapsToSideline <n>+ if sidelining large overlapping regions, sideline at most n regions.
* `-maxMerge <n>` maximum number of overlapping regions to merge
* `-sidelineBigOverlaps` if more than maxMerge regions are overlapping, sideline attempt to sideline the regions overlapping with the most other regions.
* `-maxOverlapsToSideline <n>` if sidelining large overlapping regions, sideline at most n regions.
Since often times you would just want to get the tables repaired, you can use this option to turn on all repair options:
* [code]+-repair+ includes all the region consistency options and only the hole repairing table integrity options.
* `-repair` includes all the region consistency options and only the hole repairing table integrity options.
Finally, there are safeguards to limit repairs to only specific tables.
For example the following command would only attempt to check and repair table TableFoo and TableBar.
@ -167,7 +167,7 @@ $ ./bin/hbase hbck -repair TableFoo TableBar
There are a few special cases that hbck can handle as well.
Sometimes the meta table's only region is inconsistently assigned or deployed.
In this case there is a special [code]+-fixMetaOnly+ option that can try to fix meta assignments.
In this case there is a special `-fixMetaOnly` option that can try to fix meta assignments.
----
@ -177,7 +177,7 @@ $ ./bin/hbase hbck -fixMetaOnly -fixAssignments
==== Special cases: HBase version file is missing
HBase's data on the file system requires a version file in order to start.
If this flie is missing, you can use the [code]+-fixVersionFile+ option to fabricating a new HBase version file.
If this flie is missing, you can use the `-fixVersionFile` option to fabricating a new HBase version file.
This assumes that the version of hbck you are running is the appropriate version for the HBase cluster.
==== Special case: Root and META are corrupt.
@ -204,9 +204,9 @@ HBase can clean up parents in the right order.
However, there could be some lingering offline split parents sometimes.
They are in META, in HDFS, and not deployed.
But HBase can't clean them up.
In this case, you can use the [code]+-fixSplitParents+ option to reset them in META to be online and not split.
In this case, you can use the `-fixSplitParents` option to reset them in META to be online and not split.
Therefore, hbck can merge them with other regions if fixing overlapping regions option is used.
This option should not normally be used, and it is not in [code]+-fixAll+.
This option should not normally be used, and it is not in `-fixAll`.
:numbered:

View File

@ -38,37 +38,38 @@ In addition, it discusses other interactions and issues between HBase and MapRed
.mapred and mapreduce
[NOTE]
====
There are two mapreduce packages in HBase as in MapReduce itself: [path]_org.apache.hadoop.hbase.mapred_ and [path]_org.apache.hadoop.hbase.mapreduce_.
There are two mapreduce packages in HBase as in MapReduce itself: _org.apache.hadoop.hbase.mapred_ and _org.apache.hadoop.hbase.mapreduce_.
The former does old-style API and the latter the new style.
The latter has more facility though you can usually find an equivalent in the older package.
Pick the package that goes with your mapreduce deploy.
When in doubt or starting over, pick the [path]_org.apache.hadoop.hbase.mapreduce_.
When in doubt or starting over, pick the _org.apache.hadoop.hbase.mapreduce_.
In the notes below, we refer to o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
====
[[hbase.mapreduce.classpath]]
== HBase, MapReduce, and the CLASSPATH
By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the HBase configuration under [var]+$HBASE_CONF_DIR+ or the HBase classes.
By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the HBase configuration under `$HBASE_CONF_DIR` or the HBase classes.
To give the MapReduce jobs the access they need, you could add [path]_hbase-site.xml_ to the [path]_$HADOOP_HOME/conf/_ directory and add the HBase JARs to the [path]_HADOOP_HOME/conf/_ directory, then copy these changes across your cluster.
You could add hbase-site.xml to $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib.
You would then need to copy these changes across your cluster or edit [path]_$HADOOP_HOMEconf/hadoop-env.sh_ and add them to the [var]+HADOOP_CLASSPATH+ variable.
To give the MapReduce jobs the access they need, you could add _hbase-site.xml_ to the _$HADOOP_HOME/conf/_ directory and add the HBase JARs to the _`$HADOOP_HOME`/conf/_ directory, then copy these changes across your cluster.
You could add hbase-site.xml to `$HADOOP_HOME`/conf and add HBase jars to the $HADOOP_HOME/lib.
You would then need to copy these changes across your cluster or edit _`$HADOOP_HOME`/conf/hadoop-env.sh_ and add them to the `HADOOP_CLASSPATH` variable.
However, this approach is not recommended because it will pollute your Hadoop install with HBase references.
It also requires you to restart the Hadoop cluster before Hadoop can use the HBase data.
Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself.
The dependencies only need to be available on the local CLASSPATH.
The following example runs the bundled HBase link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] MapReduce job against a table named [systemitem]+usertable+ If you have not set the environment variables expected in the command (the parts prefixed by a [literal]+$+ sign and curly braces), you can use the actual system paths instead.
The dependencies only need to be available on the local `CLASSPATH`.
The following example runs the bundled HBase link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] MapReduce job against a table named [systemitem]+usertable+ If you have not set the environment variables expected in the command (the parts prefixed by a `$` sign and curly braces), you can use the actual system paths instead.
Be sure to use the correct version of the HBase JAR for your system.
The backticks ([literal]+`+ symbols) cause ths shell to execute the sub-commands, setting the CLASSPATH as part of the command.
The backticks (``` symbols) cause ths shell to execute the sub-commands, setting the CLASSPATH as part of the command.
This example assumes you use a BASH-compatible shell.
[source,bash]
----
$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable
----
When the command runs, internally, the HBase JAR finds the dependencies it needs for zookeeper, guava, and its other dependencies on the passed [var]+HADOOP_CLASSPATH+ and adds the JARs to the MapReduce job configuration.
When the command runs, internally, the HBase JAR finds the dependencies it needs for zookeeper, guava, and its other dependencies on the passed `HADOOP_CLASSPATH` and adds the JARs to the MapReduce job configuration.
See the source at TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done.
[NOTE]
@ -80,8 +81,9 @@ You may see an error like the following:
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper
----
If this occurs, try modifying the command as follows, so that it uses the HBase JARs from the [path]_target/_ directory within the build environment.
If this occurs, try modifying the command as follows, so that it uses the HBase JARs from the _target/_ directory within the build environment.
[source,bash]
----
$ HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable
----
@ -94,7 +96,6 @@ Some mapreduce jobs that use HBase fail to launch.
The symptom is an exception similar to the following:
----
Exception in thread "main" java.lang.IllegalAccessError: class
com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
com.google.protobuf.LiteralByteString
@ -126,7 +127,7 @@ Exception in thread "main" java.lang.IllegalAccessError: class
This is caused by an optimization introduced in link:https://issues.apache.org/jira/browse/HBASE-9867[HBASE-9867] that inadvertently introduced a classloader dependency.
This affects both jobs using the [code]+-libjars+ option and "fat jar," those which package their runtime dependencies in a nested [code]+lib+ folder.
This affects both jobs using the `-libjars` option and "fat jar," those which package their runtime dependencies in a nested `lib` folder.
In order to satisfy the new classloader requirements, hbase-protocol.jar must be included in Hadoop's classpath.
See <<hbase.mapreduce.classpath,hbase.mapreduce.classpath>> for current recommendations for resolving classpath errors.
@ -134,11 +135,11 @@ The following is included for historical purposes.
This can be resolved system-wide by including a reference to the hbase-protocol.jar in hadoop's lib directory, via a symlink or by copying the jar into the new location.
This can also be achieved on a per-job launch basis by including it in the [code]+HADOOP_CLASSPATH+ environment variable at job submission time.
This can also be achieved on a per-job launch basis by including it in the `HADOOP_CLASSPATH` environment variable at job submission time.
When launching jobs that package their dependencies, all three of the following job launching commands satisfy this requirement:
[source,bash]
----
$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
@ -146,8 +147,8 @@ $ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
For jars that do not package their dependencies, the following command structure is necessary:
[source,bash]
----
$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ...
----
@ -161,8 +162,8 @@ This functionality was lost due to a bug in HBase 0.95 (link:https://issues.apac
The priority order for choosing the scanner caching is as follows:
. Caching settings which are set on the scan object.
. Caching settings which are specified via the configuration option +hbase.client.scanner.caching+, which can either be set manually in [path]_hbase-site.xml_ or via the helper method [code]+TableMapReduceUtil.setScannerCaching()+.
. The default value [code]+HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING+, which is set to [literal]+100+.
. Caching settings which are specified via the configuration option +hbase.client.scanner.caching+, which can either be set manually in _hbase-site.xml_ or via the helper method `TableMapReduceUtil.setScannerCaching()`.
. The default value `HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING`, which is set to `100`.
Optimizing the caching settings is a balance between the time the client waits for a result and the number of sets of results the client needs to receive.
If the caching setting is too large, the client could end up waiting for a long time or the request could even time out.
@ -178,6 +179,7 @@ See the API documentation for link:https://hbase.apache.org/apidocs/org/apache/h
The HBase JAR also serves as a Driver for some bundled mapreduce jobs.
To learn about the bundled MapReduce jobs, run the following command.
[source,bash]
----
$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar
An example program must be given as the first argument.
@ -193,6 +195,7 @@ Valid program names are:
Each of the valid program names are bundled MapReduce jobs.
To run one of the jobs, model your command after the following example.
[source,bash]
----
$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter myTable
----
@ -202,12 +205,12 @@ $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounte
HBase can be used as a data source, link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat], and data sink, link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat] or link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html[MultiTableOutputFormat], for MapReduce jobs.
Writing MapReduce jobs that read or write HBase, it is advisable to subclass link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html[TableMapper] and/or link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html[TableReducer].
See the do-nothing pass-through classes link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html[IdentityTableMapper] and link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html[IdentityTableReducer] for basic usage.
For a more involved example, see link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] or review the [code]+org.apache.hadoop.hbase.mapreduce.TestTableMapReduce+ unit test.
For a more involved example, see link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] or review the `org.apache.hadoop.hbase.mapreduce.TestTableMapReduce` unit test.
If you run MapReduce jobs that use HBase as source or sink, need to specify source and sink table and column names in your configuration.
When you read from HBase, the [code]+TableInputFormat+ requests the list of regions from HBase and makes a map, which is either a [code]+map-per-region+ or [code]+mapreduce.job.maps+ map, whichever is smaller.
If your job only has two maps, raise [code]+mapreduce.job.maps+ to a number greater than the number of regions.
When you read from HBase, the `TableInputFormat` requests the list of regions from HBase and makes a map, which is either a `map-per-region` or `mapreduce.job.maps` map, whichever is smaller.
If your job only has two maps, raise `mapreduce.job.maps` to a number greater than the number of regions.
Maps will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per node.
When writing to HBase, it may make sense to avoid the Reduce step and write back into HBase from within your map.
This approach works when your job does not need the sort and collation that MapReduce does on the map-emitted data.
@ -226,15 +229,16 @@ For more on how this mechanism works, see <<arch.bulk.load,arch.bulk.load>>.
== RowCounter Example
The included link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] MapReduce job uses [code]+TableInputFormat+ and does a count of all rows in the specified table.
The included link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] MapReduce job uses `TableInputFormat` and does a count of all rows in the specified table.
To run it, use the following command:
[source,bash]
----
$ ./bin/hadoop jar hbase-X.X.X.jar
----
This will invoke the HBase MapReduce Driver class.
Select [literal]+rowcounter+ from the choice of jobs offered.
Select `rowcounter` from the choice of jobs offered.
This will print rowcouner usage advice to standard output.
Specify the tablename, column to count, and output directory.
If you have classpath errors, see <<hbase.mapreduce.classpath,hbase.mapreduce.classpath>>.
@ -251,7 +255,7 @@ Thus, if there are 100 regions in the table, there will be 100 map-tasks for the
[[splitter.custom]]
=== Custom Splitters
For those interested in implementing custom splitters, see the method [code]+getSplits+ in link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html[TableInputFormatBase].
For those interested in implementing custom splitters, see the method `getSplits` in link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html[TableInputFormatBase].
That is where the logic for map-task assignment resides.
[[mapreduce.example]]
@ -266,7 +270,6 @@ There job would be defined as follows...
[source,java]
----
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper
@ -296,7 +299,6 @@ if (!b) {
[source,java]
----
public static class MyMapper extends TableMapper<Text, Text> {
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
@ -313,7 +315,6 @@ This example will simply copy data from one table to another.
[source,java]
----
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class); // class that contains mapper
@ -342,15 +343,14 @@ if (!b) {
}
----
An explanation is required of what [class]+TableMapReduceUtil+ is doing, especially with the reducer. link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat] is being used as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key to [class]+ImmutableBytesWritable+ and reducer value to [class]+Writable+.
These could be set by the programmer on the job and conf, but [class]+TableMapReduceUtil+ tries to make things easier.
An explanation is required of what `TableMapReduceUtil` is doing, especially with the reducer. link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat] is being used as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key to `ImmutableBytesWritable` and reducer value to `Writable`.
These could be set by the programmer on the job and conf, but `TableMapReduceUtil` tries to make things easier.
The following is the example mapper, which will create a [class]+Put+ and matching the input [class]+Result+ and emit it.
The following is the example mapper, which will create a `Put` and matching the input `Result` and emit it.
Note: this is what the CopyTable utility does.
[source,java]
----
public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> {
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
@ -368,14 +368,14 @@ public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> {
}
----
There isn't actually a reducer step, so [class]+TableOutputFormat+ takes care of sending the [class]+Put+ to the target table.
There isn't actually a reducer step, so `TableOutputFormat` takes care of sending the `Put` to the target table.
This is just an example, developers could choose not to use [class]+TableOutputFormat+ and connect to the target table themselves.
This is just an example, developers could choose not to use `TableOutputFormat` and connect to the target table themselves.
[[mapreduce.example.readwrite.multi]]
=== HBase MapReduce Read/Write Example With Multi-Table Output
TODO: example for [class]+MultiTableOutputFormat+.
TODO: example for `MultiTableOutputFormat`.
[[mapreduce.example.summary]]
=== HBase MapReduce Summary to HBase Example
@ -414,7 +414,7 @@ if (!b) {
----
In this example mapper a column with a String-value is chosen as the value to summarize upon.
This value is used as the key to emit from the mapper, and an [class]+IntWritable+ represents an instance counter.
This value is used as the key to emit from the mapper, and an `IntWritable` represents an instance counter.
[source,java]
----
@ -434,7 +434,7 @@ public static class MyMapper extends TableMapper<Text, IntWritable> {
}
----
In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a [class]+Put+.
In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a `Put`.
[source,java]
----
@ -513,9 +513,8 @@ public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritab
It is also possible to perform summaries without a reducer - if you use HBase as the reducer.
An HBase target table would need to exist for the job summary.
The Table method [code]+incrementColumnValue+ would be used to atomically increment values.
From a performance perspective, it might make sense to keep a Map of values with their values to be incremeneted for each map-task, and make one update per key at during the [code]+
cleanup+ method of the mapper.
The Table method `incrementColumnValue` would be used to atomically increment values.
From a performance perspective, it might make sense to keep a Map of values with their values to be incremeneted for each map-task, and make one update per key at during the `cleanup` method of the mapper.
However, your milage may vary depending on the number of rows to be processed and unique keys.
In the end, the summary results are in HBase.
@ -525,7 +524,7 @@ In the end, the summary results are in HBase.
Sometimes it is more appropriate to generate summaries to an RDBMS.
For these cases, it is possible to generate summaries directly to an RDBMS via a custom reducer.
The [code]+setup+ method can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the cleanup method can close the connection.
The `setup` method can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the cleanup method can close the connection.
It is critical to understand that number of reducers for the job affects the summarization implementation, and you'll have to design this into your reducer.
Specifically, whether it is designed to run as a singleton (one reducer) or multiple reducers.
@ -534,7 +533,6 @@ Recognize that the more reducers that are assigned to the job, the more simultan
[source,java]
----
public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private Connection c = null;

View File

@ -34,9 +34,9 @@ The subject of operations is related to the topics of <<trouble,trouble>>, <<per
== HBase Tools and Utilities
HBase provides several tools for administration, analysis, and debugging of your cluster.
The entry-point to most of these tools is the [path]_bin/hbase_ command, though some tools are available in the [path]_dev-support/_ directory.
The entry-point to most of these tools is the _bin/hbase_ command, though some tools are available in the _dev-support/_ directory.
To see usage instructions for [path]_bin/hbase_ command, run it with no arguments, or with the +-h+ argument.
To see usage instructions for _bin/hbase_ command, run it with no arguments, or with the +-h+ argument.
These are the usage instructions for HBase 0.98.x.
Some commands, such as +version+, +pe+, +ltt+, +clean+, are not available in previous versions.
@ -70,14 +70,14 @@ Some commands take arguments. Pass no args or -h for usage.
CLASSNAME Run the class named CLASSNAME
----
Some of the tools and utilities below are Java classes which are passed directly to the [path]_bin/hbase_ command, as referred to in the last line of the usage instructions.
Some of the tools and utilities below are Java classes which are passed directly to the _bin/hbase_ command, as referred to in the last line of the usage instructions.
Others, such as +hbase shell+ (<<shell,shell>>), +hbase upgrade+ (<<upgrading,upgrading>>), and +hbase
thrift+ (<<thrift,thrift>>), are documented elsewhere in this guide.
=== Canary
There is a Canary class can help users to canary-test the HBase cluster status, with every column-family for every regions or regionservers granularity.
To see the usage, use the [literal]+--help+ parameter.
To see the usage, use the `--help` parameter.
----
$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.tool.Canary -help
@ -197,17 +197,17 @@ $ ${HBASE_HOME}/bin/hbase orghapache.hadoop.hbase.tool.Canary -t 600000
==== Running Canary in a Kerberos-enabled Cluster
To run Canary in a Kerberos-enabled cluster, configure the following two properties in [path]_hbase-site.xml_:
To run Canary in a Kerberos-enabled cluster, configure the following two properties in _hbase-site.xml_:
* [code]+hbase.client.keytab.file+
* [code]+hbase.client.kerberos.principal+
* `hbase.client.keytab.file`
* `hbase.client.kerberos.principal`
Kerberos credentials are refreshed every 30 seconds when Canary runs in daemon mode.
To configure the DNS interface for the client, configure the following optional properties in [path]_hbase-site.xml_.
To configure the DNS interface for the client, configure the following optional properties in _hbase-site.xml_.
* [code]+hbase.client.dns.interface+
* [code]+hbase.client.dns.nameserver+
* `hbase.client.dns.interface`
* `hbase.client.dns.nameserver`
.Canary in a Kerberos-Enabled Cluster
====
@ -244,10 +244,10 @@ See link:[HBASE-7351 Periodic health check script] for configurations and detail
=== Driver
Several frequently-accessed utilities are provided as [code]+Driver+ classes, and executed by the [path]_bin/hbase_ command.
Several frequently-accessed utilities are provided as `Driver` classes, and executed by the _bin/hbase_ command.
These utilities represent MapReduce jobs which run on your cluster.
They are run in the following way, replacing [replaceable]_UtilityName_ with the utility you want to run.
This command assumes you have set the environment variable [literal]+HBASE_HOME+ to the directory where HBase is unpacked on your server.
This command assumes you have set the environment variable `HBASE_HOME` to the directory where HBase is unpacked on your server.
----
@ -299,10 +299,10 @@ See <<hfile_tool,hfile tool>>.
=== WAL Tools
[[hlog_tool]]
==== [class]+FSHLog+ tool
==== `FSHLog` tool
The main method on [class]+FSHLog+ offers manual split and dump facilities.
Pass it WALs or the product of a split, the content of the [path]_recovered.edits_.
The main method on `FSHLog` offers manual split and dump facilities.
Pass it WALs or the product of a split, the content of the _recovered.edits_.
directory.
You can get a textual dump of a WAL file content by doing the following:
@ -311,7 +311,7 @@ You can get a textual dump of a WAL file content by doing the following:
$ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.FSHLog --dump hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
----
The return code will be non-zero if issues with the file so you can test wholesomeness of file by redirecting [var]+STDOUT+ to [code]+/dev/null+ and testing the program return.
The return code will be non-zero if issues with the file so you can test wholesomeness of file by redirecting `STDOUT` to `/dev/null` and testing the program return.
Similarly you can force a split of a log file directory by doing:
@ -332,7 +332,7 @@ You can invoke it via the hbase cli with the 'wal' command.
.WAL Printing in older versions of HBase
[NOTE]
====
Prior to version 2.0, the WAL Pretty Printer was called the [class]+HLogPrettyPrinter+, after an internal name for HBase's write ahead log.
Prior to version 2.0, the WAL Pretty Printer was called the `HLogPrettyPrinter`, after an internal name for HBase's write ahead log.
In those versions, you can pring the contents of a WAL using the same configuration as above, but with the 'hlog' command.
----
@ -394,13 +394,13 @@ For performance consider the following general options:
.Scanner Caching
[NOTE]
====
Caching for the input Scan is configured via [code]+hbase.client.scanner.caching+ in the job configuration.
Caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
====
.Versions
[NOTE]
====
By default, CopyTable utility only copies the latest version of row cells unless [code]+--versions=n+ is explicitly specified in the command.
By default, CopyTable utility only copies the latest version of row cells unless `--versions=n` is explicitly specified in the command.
====
See Jonathan Hsieh's link:http://www.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/[Online
@ -415,7 +415,7 @@ Invoke via:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
----
Note: caching for the input Scan is configured via [code]+hbase.client.scanner.caching+ in the job configuration.
Note: caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
=== Import
@ -435,7 +435,7 @@ $ bin/hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import
=== ImportTsv
ImportTsv is a utility that will load data in TSV format into HBase.
It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the [code]+completebulkload+.
It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the `completebulkload`.
To load data via Puts (i.e., non-bulk loading):
@ -525,7 +525,7 @@ For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,
=== CompleteBulkLoad
The [code]+completebulkload+ utility will move generated StoreFiles into an HBase table.
The `completebulkload` utility will move generated StoreFiles into an HBase table.
This utility is often used in conjunction with output from <<importtsv,importtsv>>.
There are two ways to invoke this utility, with explicit classname and via the driver:
@ -570,7 +570,7 @@ $ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,
----
WALPlayer, by default, runs as a mapreduce job.
To NOT run WALPlayer as a mapreduce job on your cluster, force it to run all in the local process by adding the flags [code]+-Dmapreduce.jobtracker.address=local+ on the command line.
To NOT run WALPlayer as a mapreduce job on your cluster, force it to run all in the local process by adding the flags `-Dmapreduce.jobtracker.address=local` on the command line.
[[rowcounter]]
=== RowCounter and CellCounter
@ -583,7 +583,7 @@ It will run the mapreduce all in a single process but it will run faster if you
$ bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename> [<column1> <column2>...]
----
Note: caching for the input Scan is configured via [code]+hbase.client.scanner.caching+ in the job configuration.
Note: caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
HBase ships another diagnostic mapreduce job called link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/CellCounter.html[CellCounter].
Like RowCounter, it gathers more fine-grained statistics about your table.
@ -598,13 +598,13 @@ The statistics gathered by RowCounter are more fine-grained and include:
The program allows you to limit the scope of the run.
Provide a row regex or prefix to limit the rows to analyze.
Use [code]+hbase.mapreduce.scan.column.family+ to specify scanning a single column family.
Use `hbase.mapreduce.scan.column.family` to specify scanning a single column family.
----
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CellCounter <tablename> <outputDir> [regex or prefix]
----
Note: just like RowCounter, caching for the input Scan is configured via [code]+hbase.client.scanner.caching+ in the job configuration.
Note: just like RowCounter, caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
=== mlockall
@ -639,7 +639,7 @@ Options:
=== +hbase pe+
The +hbase pe+ command is a shortcut provided to run the [code]+org.apache.hadoop.hbase.PerformanceEvaluation+ tool, which is used for testing.
The +hbase pe+ command is a shortcut provided to run the `org.apache.hadoop.hbase.PerformanceEvaluation` tool, which is used for testing.
The +hbase pe+ command was introduced in HBase 0.98.4.
The PerformanceEvaluation tool accepts many different options and commands.
@ -651,7 +651,7 @@ The PerformanceEvaluation tool has received many updates in recent HBase release
=== +hbase ltt+
The +hbase ltt+ command is a shortcut provided to run the [code]+org.apache.hadoop.hbase.util.LoadTestTool+ utility, which is used for testing.
The +hbase ltt+ command is a shortcut provided to run the `org.apache.hadoop.hbase.util.LoadTestTool` utility, which is used for testing.
The +hbase ltt+ command was introduced in HBase 0.98.4.
You must specify either +-write+ or +-update-read+ as the first option.
@ -721,8 +721,8 @@ See <<lb,lb>> below.
.Kill Node Tool
[NOTE]
====
In hbase-2.0, in the bin directory, we added a script named [path]_considerAsDead.sh_ that can be used to kill a regionserver.
Hardware issues could be detected by specialized monitoring tools before the zookeeper timeout has expired. [path]_considerAsDead.sh_ is a simple function to mark a RegionServer as dead.
In hbase-2.0, in the bin directory, we added a script named _considerAsDead.sh_ that can be used to kill a regionserver.
Hardware issues could be detected by specialized monitoring tools before the zookeeper timeout has expired. _considerAsDead.sh_ is a simple function to mark a RegionServer as dead.
It deletes all the znodes of the server, starting the recovery process.
Plug in the script into your monitoring/fault detection tools to initiate faster failover.
Be careful how you use this disruptive tool.
@ -733,7 +733,7 @@ A downside to the above stop of a RegionServer is that regions could be offline
Regions are closed in order.
If many regions on the server, the first region to close may not be back online until all regions close and after the master notices the RegionServer's znode gone.
In Apache HBase 0.90.2, we added facility for having a node gradually shed its load and then shutdown itself down.
Apache HBase 0.90.2 added the [path]_graceful_stop.sh_ script.
Apache HBase 0.90.2 added the _graceful_stop.sh_ script.
Here is its usage:
----
@ -748,21 +748,21 @@ Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift]
----
To decommission a loaded RegionServer, run the following: +$
./bin/graceful_stop.sh HOSTNAME+ where [var]+HOSTNAME+ is the host carrying the RegionServer you would decommission.
./bin/graceful_stop.sh HOSTNAME+ where `HOSTNAME` is the host carrying the RegionServer you would decommission.
.On [var]+HOSTNAME+
.On `HOSTNAME`
[NOTE]
====
The [var]+HOSTNAME+ passed to [path]_graceful_stop.sh_ must match the hostname that hbase is using to identify RegionServers.
The `HOSTNAME` passed to _graceful_stop.sh_ must match the hostname that hbase is using to identify RegionServers.
Check the list of RegionServers in the master UI for how HBase is referring to servers.
Its usually hostname but can also be FQDN.
Whatever HBase is using, this is what you should pass the [path]_graceful_stop.sh_ decommission script.
Whatever HBase is using, this is what you should pass the _graceful_stop.sh_ decommission script.
If you pass IPs, the script is not yet smart enough to make a hostname (or FQDN) of it and so it will fail when it checks if server is currently running; the graceful unloading of regions will not run.
====
The [path]_graceful_stop.sh_ script will move the regions off the decommissioned RegionServer one at a time to minimize region churn.
The _graceful_stop.sh_ script will move the regions off the decommissioned RegionServer one at a time to minimize region churn.
It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned server is carrying zero regions.
At this point, the [path]_graceful_stop.sh_ tells the RegionServer +stop+.
At this point, the _graceful_stop.sh_ tells the RegionServer +stop+.
The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the RegionServer went down cleanly, there will be no WAL logs to split.
.Load Balancer
@ -797,8 +797,8 @@ Hence, it is better to manage the balancer apart from +graceful_stop+ reenabling
If you have a large cluster, you may want to decommission more than one machine at a time by gracefully stopping mutiple RegionServers concurrently.
To gracefully drain multiple regionservers at the same time, RegionServers can be put into a "draining" state.
This is done by marking a RegionServer as a draining node by creating an entry in ZooKeeper under the [path]_hbase_root/draining_ znode.
This znode has format [code]+name,port,startcode+ just like the regionserver entries under [path]_hbase_root/rs_ znode.
This is done by marking a RegionServer as a draining node by creating an entry in ZooKeeper under the _hbase_root/draining_ znode.
This znode has format `name,port,startcode` just like the regionserver entries under _hbase_root/rs_ znode.
Without this facility, decommissioning mulitple nodes may be non-optimal because regions that are being drained from one region server may be moved to other regionservers that are also draining.
Marking RegionServers to be in the draining state prevents this from happening.
@ -810,7 +810,7 @@ See this link:http://inchoate-clatter.blogspot.com/2012/03/hbase-ops-automation.
It is good having <<dfs.datanode.failed.volumes.tolerated,dfs.datanode.failed.volumes.tolerated>> set if you have a decent number of disks per machine for the case where a disk plain dies.
But usually disks do the "John Wayne" -- i.e.
take a while to go down spewing errors in [path]_dmesg_ -- or for some reason, run much slower than their companions.
take a while to go down spewing errors in _dmesg_ -- or for some reason, run much slower than their companions.
In this case you want to decommission the disk.
You have two options.
You can link:http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F[decommission
@ -835,13 +835,13 @@ These methods are detailed below.
==== Using the +rolling-restart.sh+ Script
HBase ships with a script, [path]_bin/rolling-restart.sh_, that allows you to perform rolling restarts on the entire cluster, the master only, or the RegionServers only.
HBase ships with a script, _bin/rolling-restart.sh_, that allows you to perform rolling restarts on the entire cluster, the master only, or the RegionServers only.
The script is provided as a template for your own script, and is not explicitly tested.
It requires password-less SSH login to be configured and assumes that you have deployed using a tarball.
The script requires you to set some environment variables before running it.
Examine the script and modify it to suit your needs.
.[path]_rolling-restart.sh_ General Usage
._rolling-restart.sh_ General Usage
====
----
@ -851,19 +851,19 @@ Usage: rolling-restart.sh [--config <hbase-confdir>] [--rs-only] [--master-only]
====
Rolling Restart on RegionServers Only::
To perform a rolling restart on the RegionServers only, use the [code]+--rs-only+ option.
To perform a rolling restart on the RegionServers only, use the `--rs-only` option.
This might be necessary if you need to reboot the individual RegionServer or if you make a configuration change that only affects RegionServers and not the other HBase processes.
Rolling Restart on Masters Only::
To perform a rolling restart on the active and backup Masters, use the [code]+--master-only+ option.
To perform a rolling restart on the active and backup Masters, use the `--master-only` option.
You might use this if you know that your configuration change only affects the Master and not the RegionServers, or if you need to restart the server where the active Master is running.
Graceful Restart::
If you specify the [code]+--graceful+ option, RegionServers are restarted using the [path]_bin/graceful_stop.sh_ script, which moves regions off a RegionServer before restarting it.
If you specify the `--graceful` option, RegionServers are restarted using the _bin/graceful_stop.sh_ script, which moves regions off a RegionServer before restarting it.
This is safer, but can delay the restart.
Limiting the Number of Threads::
To limit the rolling restart to using only a specific number of threads, use the [code]+--maxthreads+ option.
To limit the rolling restart to using only a specific number of threads, use the `--maxthreads` option.
[[rolling.restart.manual]]
==== Manual Rolling Restart
@ -882,7 +882,7 @@ It disables the load balancer before moving the regions.
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &;
----
Monitor the output of the [path]_/tmp/log.txt_ file to follow the progress of the script.
Monitor the output of the _/tmp/log.txt_ file to follow the progress of the script.
==== Logic for Crafting Your Own Rolling Restart Script
@ -936,7 +936,7 @@ $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --
Adding a new regionserver in HBase is essentially free, you simply start it like this: +$ ./bin/hbase-daemon.sh start regionserver+ and it will register itself with the master.
Ideally you also started a DataNode on the same machine so that the RS can eventually start to have local files.
If you rely on ssh to start your daemons, don't forget to add the new hostname in [path]_conf/regionservers_ on the master.
If you rely on ssh to start your daemons, don't forget to add the new hostname in _conf/regionservers_ on the master.
At this point the region server isn't serving data because no regions have moved to it yet.
If the balancer is enabled, it will start moving regions to the new RS.
@ -961,10 +961,10 @@ You can also filter which metrics are emitted and extend the metrics framework t
For HBase 0.95 and newer, HBase ships with a default metrics configuration, or [firstterm]_sink_.
This includes a wide variety of individual metrics, and emits them every 10 seconds by default.
To configure metrics for a given region server, edit the [path]_conf/hadoop-metrics2-hbase.properties_ file.
To configure metrics for a given region server, edit the _conf/hadoop-metrics2-hbase.properties_ file.
Restart the region server for the changes to take effect.
To change the sampling rate for the default sink, edit the line beginning with [literal]+*.period+.
To change the sampling rate for the default sink, edit the line beginning with `*.period`.
To filter which metrics are emitted or to extend the metrics framework, see link:http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html
.HBase Metrics and Ganglia
@ -978,7 +978,7 @@ See link:http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/pa
=== Disabling Metrics
To disable metrics for a region server, edit the [path]_conf/hadoop-metrics2-hbase.properties_ file and comment out any uncommented lines.
To disable metrics for a region server, edit the _conf/hadoop-metrics2-hbase.properties_ file and comment out any uncommented lines.
Restart the region server for the changes to take effect.
[[discovering.available.metrics]]
@ -988,15 +988,15 @@ Rather than listing each metric which HBase emits by default, you can browse thr
Different metrics are exposed for the Master process and each region server process.
.Procedure: Access a JSON Output of Available Metrics
. After starting HBase, access the region server's web UI, at [literal]+http://REGIONSERVER_HOSTNAME:60030+ by default (or port 16030 in HBase 1.0+).
. After starting HBase, access the region server's web UI, at `http://REGIONSERVER_HOSTNAME:60030` by default (or port 16030 in HBase 1.0+).
. Click the [label]#Metrics Dump# link near the top.
The metrics for the region server are presented as a dump of the JMX bean in JSON format.
This will dump out all metrics names and their values.
To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of [literal]+?description=true+ so your URL becomes [literal]+http://REGIONSERVER_HOSTNAME:60030/jmx?description=true+.
To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes `http://REGIONSERVER_HOSTNAME:60030/jmx?description=true`.
Not all beans and attributes have descriptions.
. To view metrics for the Master, connect to the Master's web UI instead (defaults to [literal]+http://localhost:60010+ or port 16010 in HBase 1.0+) and click its [label]#Metrics
. To view metrics for the Master, connect to the Master's web UI instead (defaults to `http://localhost:60010` or port 16010 in HBase 1.0+) and click its [label]#Metrics
Dump# link.
To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of [literal]+?description=true+ so your URL becomes [literal]+http://REGIONSERVER_HOSTNAME:60010/jmx?description=true+.
To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes `http://REGIONSERVER_HOSTNAME:60010/jmx?description=true`.
Not all beans and attributes have descriptions.
@ -1023,15 +1023,15 @@ This procedure uses +jvisualvm+, which is an application usually available in th
=== Units of Measure for Metrics
Different metrics are expressed in different units, as appropriate.
Often, the unit of measure is in the name (as in the metric [code]+shippedKBs+). Otherwise, use the following guidelines.
Often, the unit of measure is in the name (as in the metric `shippedKBs`). Otherwise, use the following guidelines.
When in doubt, you may need to examine the source for a given metric.
* Metrics that refer to a point in time are usually expressed as a timestamp.
* Metrics that refer to an age (such as [code]+ageOfLastShippedOp+) are usually expressed in milliseconds.
* Metrics that refer to an age (such as `ageOfLastShippedOp`) are usually expressed in milliseconds.
* Metrics that refer to memory sizes are in bytes.
* Sizes of queues (such as [code]+sizeOfLogQueue+) are expressed as the number of items in the queue.
* Sizes of queues (such as `sizeOfLogQueue`) are expressed as the number of items in the queue.
Determine the size by multiplying by the block size (default is 64 MB in HDFS).
* Metrics that refer to things like the number of a given type of operations (such as [code]+logEditsRead+) are expressed as an integer.
* Metrics that refer to things like the number of a given type of operations (such as `logEditsRead`) are expressed as an integer.
[[master_metrics]]
=== Most Important Master Metrics
@ -1174,10 +1174,10 @@ It is also prepended with identifying tags [constant]+(responseTooSlow)+, [const
There are two configuration knobs that can be used to adjust the thresholds for when queries are logged.
* [var]+hbase.ipc.warn.response.time+ Maximum number of milliseconds that a query can be run without being logged.
* `hbase.ipc.warn.response.time` Maximum number of milliseconds that a query can be run without being logged.
Defaults to 10000, or 10 seconds.
Can be set to -1 to disable logging by time.
* [var]+hbase.ipc.warn.response.size+ Maximum byte size of response that a query can return without being logged.
* `hbase.ipc.warn.response.size` Maximum byte size of response that a query can return without being logged.
Defaults to 100 megabytes.
Can be set to -1 to disable logging by size.
@ -1185,8 +1185,8 @@ There are two configuration knobs that can be used to adjust the thresholds for
The slow query log exposes to metrics to JMX.
* [var]+hadoop.regionserver_rpc_slowResponse+ a global metric reflecting the durations of all responses that triggered logging.
* [var]+hadoop.regionserver_rpc_methodName.aboveOneSec+ A metric reflecting the durations of all responses that lasted for more than one second.
* `hadoop.regionserver_rpc_slowResponse` a global metric reflecting the durations of all responses that triggered logging.
* `hadoop.regionserver_rpc_methodName.aboveOneSec` A metric reflecting the durations of all responses that lasted for more than one second.
==== Output
@ -1293,8 +1293,8 @@ For more information, see the link:http://hbase.apache.org/apidocs/org/apache/ha
. Configure and start the source and destination clusters.
Create tables with the same names and column families on both the source and destination clusters, so that the destination cluster knows where to store data it will receive.
All hosts in the source and destination clusters should be reachable to each other.
. On the source cluster, enable replication by setting [code]+hbase.replication+ to [literal]+true+ in [path]_hbase-site.xml_.
. On the source cluster, in HBase Shell, add the destination cluster as a peer, using the [code]+add_peer+ command.
. On the source cluster, enable replication by setting `hbase.replication` to `true` in _hbase-site.xml_.
. On the source cluster, in HBase Shell, add the destination cluster as a peer, using the `add_peer` command.
The syntax is as follows:
+
----
@ -1307,7 +1307,7 @@ The ID is a string (prior to link:https://issues.apache.org/jira/browse/HBASE-11
hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent
----
+
If both clusters use the same ZooKeeper cluster, you must use a different [code]+zookeeper.znode.parent+, because they cannot write in the same folder.
If both clusters use the same ZooKeeper cluster, you must use a different `zookeeper.znode.parent`, because they cannot write in the same folder.
. On the source cluster, configure each column family to be replicated by setting its REPLICATION_SCOPE to 1, using commands such as the following in HBase Shell.
+
@ -1325,7 +1325,7 @@ Getting 1 rs from peer cluster # 0
Choosing peer 10.10.1.49:62020
----
. To verify the validity of replicated data, you can use the included [code]+VerifyReplication+ MapReduce job on the source cluster, providing it with the ID of the replication peer and table name to verify.
. To verify the validity of replicated data, you can use the included `VerifyReplication` MapReduce job on the source cluster, providing it with the ID of the replication peer and table name to verify.
Other options are possible, such as a time range or specific families to verify.
+
The command has the following form:
@ -1334,7 +1334,7 @@ The command has the following form:
hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication [--starttime=timestamp1] [--stoptime=timestamp [--families=comma separated list of families] <peerId><tablename>
----
+
The [code]+VerifyReplication+ command prints out [literal]+GOODROWS+ and [literal]+BADROWS+ counters to indicate rows that did and did not replicate correctly.
The `VerifyReplication` command prints out `GOODROWS` and `BADROWS` counters to indicate rows that did and did not replicate correctly.
=== Detailed Information About Cluster Replication
@ -1351,7 +1351,7 @@ A single WAL edit goes through several steps in order to be replicated to a slav
. If the changed cell corresponds to a column family that is scoped for replication, the edit is added to the queue for replication.
. In a separate thread, the edit is read from the log, as part of a batch process.
Only the KeyValues that are eligible for replication are kept.
Replicable KeyValues are part of a column family whose schema is scoped GLOBAL, are not part of a catalog such as [code]+hbase:meta+, did not originate from the target slave cluster, and have not already been consumed by the target slave cluster.
Replicable KeyValues are part of a column family whose schema is scoped GLOBAL, are not part of a catalog such as `hbase:meta`, did not originate from the target slave cluster, and have not already been consumed by the target slave cluster.
. The edit is tagged with the master's UUID and added to a buffer.
When the buffer is filled, or the reader reaches the end of the file, the buffer is sent to a random region server on the slave cluster.
. The region server reads the edits sequentially and separates them into buffers, one buffer per table.
@ -1374,31 +1374,31 @@ When replication is active, a subset of region servers in the source cluster is
This responsibility must be failed over like all other region server functions should a process or node crash.
The following configuration settings are recommended for maintaining an even distribution of replication activity over the remaining live servers in the source cluster:
* Set [code]+replication.source.maxretriesmultiplier+ to [literal]+300+.
* Set [code]+replication.source.sleepforretries+ to [literal]+1+ (1 second). This value, combined with the value of [code]+replication.source.maxretriesmultiplier+, causes the retry cycle to last about 5 minutes.
* Set [code]+replication.sleep.before.failover+ to [literal]+30000+ (30 seconds) in the source cluster site configuration.
* Set `replication.source.maxretriesmultiplier` to `300`.
* Set `replication.source.sleepforretries` to `1` (1 second). This value, combined with the value of `replication.source.maxretriesmultiplier`, causes the retry cycle to last about 5 minutes.
* Set `replication.sleep.before.failover` to `30000` (30 seconds) in the source cluster site configuration.
.Preserving Tags During Replication
By default, the codec used for replication between clusters strips tags, such as cell-level ACLs, from cells.
To prevent the tags from being stripped, you can use a different codec which does not strip them.
Configure [code]+hbase.replication.rpc.codec+ to use [literal]+org.apache.hadoop.hbase.codec.KeyValueCodecWithTags+, on both the source and sink RegionServers involved in the replication.
Configure `hbase.replication.rpc.codec` to use `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`, on both the source and sink RegionServers involved in the replication.
This option was introduced in link:https://issues.apache.org/jira/browse/HBASE-10322[HBASE-10322].
==== Replication Internals
Replication State in ZooKeeper::
HBase replication maintains its state in ZooKeeper.
By default, the state is contained in the base node [path]_/hbase/replication_.
This node contains two child nodes, the [code]+Peers+ znode and the [code]+RS+ znode.
By default, the state is contained in the base node _/hbase/replication_.
This node contains two child nodes, the `Peers` znode and the `RS` znode.
The [code]+Peers+ Znode::
The [code]+peers+ znode is stored in [path]_/hbase/replication/peers_ by default.
The `Peers` Znode::
The `peers` znode is stored in _/hbase/replication/peers_ by default.
It consists of a list of all peer replication clusters, along with the status of each of them.
The value of each peer is its cluster key, which is provided in the HBase Shell.
The cluster key contains a list of ZooKeeper nodes in the cluster's quorum, the client port for the ZooKeeper quorum, and the base znode for HBase in HDFS on that cluster.
The [code]+RS+ Znode::
The [code]+rs+ znode contains a list of WAL logs which need to be replicated.
The `RS` Znode::
The `rs` znode contains a list of WAL logs which need to be replicated.
This list is divided into a set of queues organized by region server and the peer cluster the region server is shipping the logs to.
The rs znode has one child znode for each region server in the cluster.
The child znode name is the region server's hostname, client port, and start code.
@ -1406,11 +1406,11 @@ The [code]+RS+ Znode::
==== Choosing Region Servers to Replicate To
When a master cluster region server initiates a replication source to a slave cluster, it first connects to the slave's ZooKeeper ensemble using the provided cluster key . It then scans the [path]_rs/_ directory to discover all the available sinks (region servers that are accepting incoming streams of edits to replicate) and randomly chooses a subset of them using a configured ratio which has a default value of 10%. For example, if a slave cluster has 150 machines, 15 will be chosen as potential recipient for edits that this master cluster region server sends.
When a master cluster region server initiates a replication source to a slave cluster, it first connects to the slave's ZooKeeper ensemble using the provided cluster key . It then scans the _rs/_ directory to discover all the available sinks (region servers that are accepting incoming streams of edits to replicate) and randomly chooses a subset of them using a configured ratio which has a default value of 10%. For example, if a slave cluster has 150 machines, 15 will be chosen as potential recipient for edits that this master cluster region server sends.
Because this selection is performed by each master region server, the probability that all slave region servers are used is very high, and this method works for clusters of any size.
For example, a master cluster of 10 machines replicating to a slave cluster of 5 machines with a ratio of 10% causes the master cluster region servers to choose one machine each at random.
A ZooKeeper watcher is placed on the [path]_${zookeeper.znode.parent}/rs_ node of the slave cluster by each of the master cluster's region servers.
A ZooKeeper watcher is placed on the _${zookeeper.znode.parent}/rs_ node of the slave cluster by each of the master cluster's region servers.
This watch is used to monitor changes in the composition of the slave cluster.
When nodes are removed from the slave cluster, or if nodes go down or come back up, the master cluster's region servers will respond by selecting a new pool of slave region servers to replicate to.
@ -1428,7 +1428,7 @@ This ensures that all the sources are aware that a new log exists before the reg
The queue items are discarded when the replication thread cannot read more entries from a file (because it reached the end of the last block) and there are other files in the queue.
This means that if a source is up to date and replicates from the log that the region server writes to, reading up to the "end" of the current file will not delete the item in the queue.
A log can be archived if it is no longer used or if the number of logs exceeds [code]+hbase.regionserver.maxlogs+ because the insertion rate is faster than regions are flushed.
A log can be archived if it is no longer used or if the number of logs exceeds `hbase.regionserver.maxlogs` because the insertion rate is faster than regions are flushed.
When a log is archived, the source threads are notified that the path for that log changed.
If a particular source has already finished with an archived log, it will just ignore the message.
If the log is in the queue, the path will be updated in memory.
@ -1463,7 +1463,7 @@ The next time the cleaning process needs to look for a log, it starts by using i
When no region servers are failing, keeping track of the logs in ZooKeeper adds no value.
Unfortunately, region servers do fail, and since ZooKeeper is highly available, it is useful for managing the transfer of the queues in the event of a failure.
Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called [literal]+lock+ inside the dead region server's znode that contains its queues.
Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues.
The region server that creates it successfully then transfers all the queues to its own znode, one at a time since ZooKeeper does not support renaming queues.
After queues are all transferred, they are deleted from the old location.
The znodes that were recovered are renamed with the ID of the slave cluster appended with the name of the dead server.
@ -1472,9 +1472,9 @@ Next, the master cluster region server creates one new source thread per copied
The main difference is that those queues will never receive new data, since they do not belong to their new region server.
When the reader hits the end of the last log, the queue's znode is deleted and the master cluster region server closes that replication source.
Given a master cluster with 3 region servers replicating to a single slave with id [literal]+2+, the following hierarchy represents what the znodes layout could be at some point in time.
The region servers' znodes all contain a [literal]+peers+ znode which contains a single queue.
The znode names in the queues represent the actual file names on HDFS in the form [literal]+address,port.timestamp+.
Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following hierarchy represents what the znodes layout could be at some point in time.
The region servers' znodes all contain a `peers` znode which contains a single queue.
The znode names in the queues represent the actual file names on HDFS in the form `address,port.timestamp`.
----
@ -1553,16 +1553,16 @@ The new layout will be:
The following metrics are exposed at the global region server level and (since HBase 0.95) at the peer level:
[code]+source.sizeOfLogQueue+::
`source.sizeOfLogQueue`::
number of WALs to process (excludes the one which is being processed) at the Replication source
[code]+source.shippedOps+::
`source.shippedOps`::
number of mutations shipped
[code]+source.logEditsRead+::
`source.logEditsRead`::
number of mutations read from WALs at the replication source
[code]+source.ageOfLastShippedOp+::
`source.ageOfLastShippedOp`::
age of last batch that was shipped by the replication source
=== Replication Configuration Options
@ -1679,7 +1679,7 @@ The disadvantages of these methods are that you can degrade region server perfor
[[ops.snapshots.configuration]]
=== Configuration
To turn on the snapshot support just set the [var]+hbase.snapshot.enabled+ property to true.
To turn on the snapshot support just set the `hbase.snapshot.enabled` property to true.
(Snapshots are enabled by default in 0.95+ and off by default in 0.94.6+)
[source,java]
@ -1789,7 +1789,7 @@ $ bin/hbase class org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySn
----
.Limiting Bandwidth Consumption
You can limit the bandwidth consumption when exporting a snapshot, by specifying the [code]+-bandwidth+ parameter, which expects an integer representing megabytes per second.
You can limit the bandwidth consumption when exporting a snapshot, by specifying the `-bandwidth` parameter, which expects an integer representing megabytes per second.
The following example limits the above example to 200 MB/sec.
[source,bourne]
@ -1856,7 +1856,7 @@ Generally less regions makes for a smoother running cluster (you can always manu
The number of regions cannot be configured directly (unless you go for fully <<disable.splitting,disable.splitting>>); adjust the region size to achieve the target region size given table size.
When configuring regions for multiple tables, note that most region settings can be set on a per-table basis via link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor], as well as shell commands.
These settings will override the ones in [var]+hbase-site.xml+.
These settings will override the ones in `hbase-site.xml`.
That is useful if your tables have different workloads/use cases.
Also note that in the discussion of region sizes here, _HDFS replication factor is not (and should not be) taken into account, whereas
@ -1957,7 +1957,7 @@ See <<compaction,compaction>> for some details.
When provisioning for large data sizes, however, it's good to keep in mind that compactions can affect write throughput.
Thus, for write-intensive workloads, you may opt for less frequent compactions and more store files per regions.
Minimum number of files for compactions ([var]+hbase.hstore.compaction.min+) can be set to higher value; <<hbase.hstore.blockingstorefiles,hbase.hstore.blockingStoreFiles>> should also be increased, as more files might accumulate in such case.
Minimum number of files for compactions (`hbase.hstore.compaction.min`) can be set to higher value; <<hbase.hstore.blockingstorefiles,hbase.hstore.blockingStoreFiles>> should also be increased, as more files might accumulate in such case.
You may also consider manually managing compactions: <<managed.compactions,managed.compactions>>
[[ops.capacity.config.presplit]]

View File

@ -111,9 +111,9 @@ Are all the network interfaces functioning correctly? Are you sure? See the Trou
In his presentation, link:http://www.slideshare.net/cloudera/hbase-hug-presentation[Avoiding Full GCs
with MemStore-Local Allocation Buffers], Todd Lipcon describes two cases of stop-the-world garbage collections common in HBase, especially during loading; CMS failure modes and old generation heap fragmentation brought.
To address the first, start the CMS earlier than default by adding [code]+-XX:CMSInitiatingOccupancyFraction+ and setting it down from defaults.
To address the first, start the CMS earlier than default by adding `-XX:CMSInitiatingOccupancyFraction` and setting it down from defaults.
Start at 60 or 70 percent (The lower you bring down the threshold, the more GCing is done, the more CPU used). To address the second fragmentation issue, Todd added an experimental facility,
(((MSLAB))), that must be explicitly enabled in Apache HBase 0.90.x (Its defaulted to be on in Apache 0.92.x HBase). See [code]+hbase.hregion.memstore.mslab.enabled+ to true in your [class]+Configuration+.
(((MSLAB))), that must be explicitly enabled in Apache HBase 0.90.x (Its defaulted to be on in Apache 0.92.x HBase). See `hbase.hregion.memstore.mslab.enabled` to true in your `Configuration`.
See the cited slides for background and detail.
The latest jvms do better regards fragmentation so make sure you are running a recent release.
Read down in the message, link:http://osdir.com/ml/hotspot-gc-use/2011-11/msg00002.html[Identifying
@ -125,7 +125,7 @@ Disable MSLAB in this case, or lower the amount of memory it uses or float less
If you have a write-heavy workload, check out link:https://issues.apache.org/jira/browse/HBASE-8163[HBASE-8163
MemStoreChunkPool: An improvement for JAVA GC when using MSLAB].
It describes configurations to lower the amount of young GC during write-heavy loadings.
If you do not have HBASE-8163 installed, and you are trying to improve your young GC times, one trick to consider -- courtesy of our Liang Xie -- is to set the GC config [var]+-XX:PretenureSizeThreshold+ in [path]_hbase-env.sh_ to be just smaller than the size of [var]+hbase.hregion.memstore.mslab.chunksize+ so MSLAB allocations happen in the tenured space directly rather than first in the young gen.
If you do not have HBASE-8163 installed, and you are trying to improve your young GC times, one trick to consider -- courtesy of our Liang Xie -- is to set the GC config `-XX:PretenureSizeThreshold` in _hbase-env.sh_ to be just smaller than the size of `hbase.hregion.memstore.mslab.chunksize` so MSLAB allocations happen in the tenured space directly rather than first in the young gen.
You'd do this because these MSLAB allocations are going to likely make it to the old gen anyways and rather than pay the price of a copies between s0 and s1 in eden space followed by the copy up from young to old gen after the MSLABs have achieved sufficient tenure, save a bit of YGC churn and allocate in the old gen directly.
For more information about GC logs, see <<trouble.log.gc,trouble.log.gc>>.
@ -145,12 +145,12 @@ See <<recommended_configurations,recommended configurations>>.
For larger systems, managing link:[compactions and splits] may be something you want to consider.
[[perf.handlers]]
=== [var]+hbase.regionserver.handler.count+
=== `hbase.regionserver.handler.count`
See <<hbase.regionserver.handler.count,hbase.regionserver.handler.count>>.
[[perf.hfile.block.cache.size]]
=== [var]+hfile.block.cache.size+
=== `hfile.block.cache.size`
See <<hfile.block.cache.size,hfile.block.cache.size>>.
A memory setting for the RegionServer process.
@ -190,83 +190,79 @@ tableDesc.addFamily(cfDesc);
See the API documentation for link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html[CacheConfig].
[[perf.rs.memstore.size]]
=== [var]+hbase.regionserver.global.memstore.size+
=== `hbase.regionserver.global.memstore.size`
See <<hbase.regionserver.global.memstore.size,hbase.regionserver.global.memstore.size>>.
This memory setting is often adjusted for the RegionServer process depending on needs.
[[perf.rs.memstore.size.lower.limit]]
=== [var]+hbase.regionserver.global.memstore.size.lower.limit+
=== `hbase.regionserver.global.memstore.size.lower.limit`
See <<hbase.regionserver.global.memstore.size.lower.limit,hbase.regionserver.global.memstore.size.lower.limit>>.
This memory setting is often adjusted for the RegionServer process depending on needs.
[[perf.hstore.blockingstorefiles]]
=== [var]+hbase.hstore.blockingStoreFiles+
=== `hbase.hstore.blockingStoreFiles`
See <<hbase.hstore.blockingstorefiles,hbase.hstore.blockingStoreFiles>>.
If there is blocking in the RegionServer logs, increasing this can help.
[[perf.hregion.memstore.block.multiplier]]
=== [var]+hbase.hregion.memstore.block.multiplier+
=== `hbase.hregion.memstore.block.multiplier`
See <<hbase.hregion.memstore.block.multiplier,hbase.hregion.memstore.block.multiplier>>.
If there is enough RAM, increasing this can help.
[[hbase.regionserver.checksum.verify.performance]]
=== [var]+hbase.regionserver.checksum.verify+
=== `hbase.regionserver.checksum.verify`
Have HBase write the checksum into the datablock and save having to do the checksum seek whenever you read.
See <<hbase.regionserver.checksum.verify,hbase.regionserver.checksum.verify>>, <<hbase.hstore.bytes.per.checksum,hbase.hstore.bytes.per.checksum>> and <<hbase.hstore.checksum.algorithm,hbase.hstore.checksum.algorithm>> For more information see the release note on link:https://issues.apache.org/jira/browse/HBASE-5074[HBASE-5074 support checksums in HBase block cache].
=== Tuning [code]+callQueue+ Options
=== Tuning `callQueue` Options
link:https://issues.apache.org/jira/browse/HBASE-11355[HBASE-11355] introduces several callQueue tuning mechanisms which can increase performance.
See the JIRA for some benchmarking information.
* To increase the number of callqueues, set +hbase.ipc.server.num.callqueue+ to a value greater than [literal]+1+.
* To split the callqueue into separate read and write queues, set [code]+hbase.ipc.server.callqueue.read.ratio+ to a value between [literal]+0+ and [literal]+1+.
* To increase the number of callqueues, set +hbase.ipc.server.num.callqueue+ to a value greater than `1`.
* To split the callqueue into separate read and write queues, set `hbase.ipc.server.callqueue.read.ratio` to a value between `0` and `1`.
This factor weights the queues toward writes (if below .5) or reads (if above .5). Another way to say this is that the factor determines what percentage of the split queues are used for reads.
The following examples illustrate some of the possibilities.
Note that you always have at least one write queue, no matter what setting you use.
+
* The default value of [literal]+0+ does not split the queue.
* A value of [literal]+.3+ uses 30% of the queues for reading and 60% for writing.
Given a value of [literal]+10+ for +hbase.ipc.server.num.callqueue+, 3 queues would be used for reads and 7 for writes.
* A value of [literal]+.5+ uses the same number of read queues and write queues.
Given a value of [literal]+10+ for +hbase.ipc.server.num.callqueue+, 5 queues would be used for reads and 5 for writes.
* A value of [literal]+.6+ uses 60% of the queues for reading and 30% for reading.
Given a value of [literal]+10+ for +hbase.ipc.server.num.callqueue+, 7 queues would be used for reads and 3 for writes.
* A value of [literal]+1.0+ uses one queue to process write requests, and all other queues process read requests.
A value higher than [literal]+1.0+ has the same effect as a value of [literal]+1.0+.
Given a value of [literal]+10+ for +hbase.ipc.server.num.callqueue+, 9 queues would be used for reads and 1 for writes.
* The default value of `0` does not split the queue.
* A value of `.3` uses 30% of the queues for reading and 60% for writing.
Given a value of `10` for +hbase.ipc.server.num.callqueue+, 3 queues would be used for reads and 7 for writes.
* A value of `.5` uses the same number of read queues and write queues.
Given a value of `10` for +hbase.ipc.server.num.callqueue+, 5 queues would be used for reads and 5 for writes.
* A value of `.6` uses 60% of the queues for reading and 30% for reading.
Given a value of `10` for +hbase.ipc.server.num.callqueue+, 7 queues would be used for reads and 3 for writes.
* A value of `1.0` uses one queue to process write requests, and all other queues process read requests.
A value higher than `1.0` has the same effect as a value of `1.0`.
Given a value of `10` for +hbase.ipc.server.num.callqueue+, 9 queues would be used for reads and 1 for writes.
* You can also split the read queues so that separate queues are used for short reads (from Get operations) and long reads (from Scan operations), by setting the +hbase.ipc.server.callqueue.scan.ratio+ option.
This option is a factor between 0 and 1, which determine the ratio of read queues used for Gets and Scans.
More queues are used for Gets if the value is below [literal]+.5+ and more are used for scans if the value is above [literal]+.5+.
More queues are used for Gets if the value is below `.5` and more are used for scans if the value is above `.5`.
No matter what setting you use, at least one read queue is used for Get operations.
+
* A value of [literal]+0+ does not split the read queue.
* A value of [literal]+.3+ uses 60% of the read queues for Gets and 30% for Scans.
Given a value of [literal]+20+ for +hbase.ipc.server.num.callqueue+ and a value of [literal]+.5
+ for +hbase.ipc.server.callqueue.read.ratio+, 10 queues would be used for reads, out of those 10, 7 would be used for Gets and 3 for Scans.
* A value of [literal]+.5+ uses half the read queues for Gets and half for Scans.
Given a value of [literal]+20+ for +hbase.ipc.server.num.callqueue+ and a value of [literal]+.5
+ for +hbase.ipc.server.callqueue.read.ratio+, 10 queues would be used for reads, out of those 10, 5 would be used for Gets and 5 for Scans.
* A value of [literal]+.6+ uses 30% of the read queues for Gets and 60% for Scans.
Given a value of [literal]+20+ for +hbase.ipc.server.num.callqueue+ and a value of [literal]+.5
+ for +hbase.ipc.server.callqueue.read.ratio+, 10 queues would be used for reads, out of those 10, 3 would be used for Gets and 7 for Scans.
* A value of [literal]+1.0+ uses all but one of the read queues for Scans.
Given a value of [literal]+20+ for +hbase.ipc.server.num.callqueue+ and a value of [literal]+.5
+ for +hbase.ipc.server.callqueue.read.ratio+, 10 queues would be used for reads, out of those 10, 1 would be used for Gets and 9 for Scans.
* A value of `0` does not split the read queue.
* A value of `.3` uses 60% of the read queues for Gets and 30% for Scans.
Given a value of `20` for +hbase.ipc.server.num.callqueue+ and a value of `.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for reads, out of those 10, 7 would be used for Gets and 3 for Scans.
* A value of `.5` uses half the read queues for Gets and half for Scans.
Given a value of `20` for +hbase.ipc.server.num.callqueue+ and a value of `.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for reads, out of those 10, 5 would be used for Gets and 5 for Scans.
* A value of `.6` uses 30% of the read queues for Gets and 60% for Scans.
Given a value of `20` for +hbase.ipc.server.num.callqueue+ and a value of `.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for reads, out of those 10, 3 would be used for Gets and 7 for Scans.
* A value of `1.0` uses all but one of the read queues for Scans.
Given a value of `20` for +hbase.ipc.server.num.callqueue+ and a value of`.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for reads, out of those 10, 1 would be used for Gets and 9 for Scans.
* You can use the new option +hbase.ipc.server.callqueue.handler.factor+ to programmatically tune the number of queues:
* You can use the new option `hbase.ipc.server.callqueue.handler.factor` to programmatically tune the number of queues:
+
* A value of [literal]+0+ uses a single shared queue between all the handlers.
* A value of [literal]+1+ uses a separate queue for each handler.
* A value between [literal]+0+ and [literal]+1+ tunes the number of queues against the number of handlers.
For instance, a value of [literal]+.5+ shares one queue between each two handlers.
* A value of `0` uses a single shared queue between all the handlers.
* A value of `1` uses a separate queue for each handler.
* A value between `0` and `1` tunes the number of queues against the number of handlers.
For instance, a value of `.5` shares one queue between each two handlers.
+
Having more queues, such as in a situation where you have one queue per handler, reduces contention when adding a task to a queue or selecting it from a queue.
The trade-off is that if you have some queues with long-running tasks, a handler may end up waiting to execute from that queue rather than processing another queue which has waiting tasks.
@ -297,7 +293,7 @@ See also <<perf.compression.however,perf.compression.however>> for compression c
[[schema.regionsize]]
=== Table RegionSize
The regionsize can be set on a per-table basis via [code]+setFileSize+ on link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor] in the event where certain tables require different regionsizes than the configured default regionsize.
The regionsize can be set on a per-table basis via `setFileSize` on link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor] in the event where certain tables require different regionsizes than the configured default regionsize.
See <<ops.capacity.regions,ops.capacity.regions>> for more information.
@ -330,8 +326,8 @@ For more information on Bloom filters in relation to HBase, see <<blooms,blooms>
Since HBase 0.96, row-based Bloom filters are enabled by default.
You may choose to disable them or to change some tables to use row+column Bloom filters, depending on the characteristics of your data and how it is loaded into HBase.
To determine whether Bloom filters could have a positive impact, check the value of [code]+blockCacheHitRatio+ in the RegionServer metrics.
If Bloom filters are enabled, the value of [code]+blockCacheHitRatio+ should increase, because the Bloom filter is filtering out blocks that are definitely not needed.
To determine whether Bloom filters could have a positive impact, check the value of `blockCacheHitRatio` in the RegionServer metrics.
If Bloom filters are enabled, the value of `blockCacheHitRatio` should increase, because the Bloom filter is filtering out blocks that are definitely not needed.
You can choose to enable Bloom filters for a row or for a row+column combination.
If you generally scan entire rows, the row+column combination will not provide any benefit.
@ -348,11 +344,11 @@ Bloom filters need to be rebuilt upon deletion, so may not be appropriate in env
Bloom filters are enabled on a Column Family.
You can do this by using the setBloomFilterType method of HColumnDescriptor or using the HBase API.
Valid values are [literal]+NONE+ (the default), [literal]+ROW+, or [literal]+ROWCOL+.
See <<bloom.filters.when,bloom.filters.when>> for more information on [literal]+ROW+ versus [literal]+ROWCOL+.
Valid values are `NONE` (the default), `ROW`, or `ROWCOL`.
See <<bloom.filters.when,bloom.filters.when>> for more information on `ROW` versus `ROWCOL`.
See also the API documentation for link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor].
The following example creates a table and enables a ROWCOL Bloom filter on the [literal]+colfam1+ column family.
The following example creates a table and enables a ROWCOL Bloom filter on the `colfam1` column family.
----
@ -361,7 +357,7 @@ hbase> create 'mytable',{NAME => 'colfam1', BLOOMFILTER => 'ROWCOL'}
==== Configuring Server-Wide Behavior of Bloom Filters
You can configure the following settings in the [path]_hbase-site.xml_.
You can configure the following settings in the _hbase-site.xml_.
[cols="1,1,1", options="header"]
|===
@ -487,7 +483,7 @@ A useful pattern to speed up the bulk import process is to pre-create empty regi
Be somewhat conservative in this, because too-many regions can actually degrade performance.
There are two different approaches to pre-creating splits.
The first approach is to rely on the default [code]+HBaseAdmin+ strategy (which is implemented in [code]+Bytes.split+)...
The first approach is to rely on the default `HBaseAdmin` strategy (which is implemented in `Bytes.split`)...
[source,java]
----
@ -513,23 +509,23 @@ See <<manual_region_splitting_decisions,manual region splitting decisions>>
[[def.log.flush]]
=== Table Creation: Deferred Log Flush
The default behavior for Puts using the Write Ahead Log (WAL) is that [class]+WAL+ edits will be written immediately.
The default behavior for Puts using the Write Ahead Log (WAL) is that `WAL` edits will be written immediately.
If deferred log flush is used, WAL edits are kept in memory until the flush period.
The benefit is aggregated and asynchronous [class]+WAL+- writes, but the potential downside is that if the RegionServer goes down the yet-to-be-flushed edits are lost.
The benefit is aggregated and asynchronous `WAL`- writes, but the potential downside is that if the RegionServer goes down the yet-to-be-flushed edits are lost.
This is safer, however, than not using WAL at all with Puts.
Deferred log flush can be configured on tables via link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor].
The default value of [var]+hbase.regionserver.optionallogflushinterval+ is 1000ms.
The default value of `hbase.regionserver.optionallogflushinterval` is 1000ms.
[[perf.hbase.client.autoflush]]
=== HBase Client: AutoFlush
When performing a lot of Puts, make sure that setAutoFlush is set to false on your link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html[HTable] instance.
Otherwise, the Puts will be sent one at a time to the RegionServer.
Puts added via [code]+ htable.add(Put)+ and [code]+ htable.add( <List> Put)+ wind up in the same write buffer.
If [code]+autoFlush = false+, these messages are not sent until the write-buffer is filled.
Puts added via ` htable.add(Put)` and ` htable.add( <List> Put)` wind up in the same write buffer.
If `autoFlush = false`, these messages are not sent until the write-buffer is filled.
To explicitly flush the messages, call [method]+flushCommits+.
Calling [method]+close+ on the [class]+HTable+ instance will invoke [method]+flushCommits+.
Calling [method]+close+ on the `HTable` instance will invoke [method]+flushCommits+.
[[perf.hbase.client.putwal]]
=== HBase Client: Turn off WAL on Puts
@ -547,8 +543,8 @@ To disable the WAL, see <<wal.disable,wal.disable>>.
[[perf.hbase.client.regiongroup]]
=== HBase Client: Group Puts by RegionServer
In addition to using the writeBuffer, grouping [class]+Put+s by RegionServer can reduce the number of client RPC calls per writeBuffer flush.
There is a utility [class]+HTableUtil+ currently on TRUNK that does this, but you can either copy that or implement your own version for those still on 0.90.x or earlier.
In addition to using the writeBuffer, grouping `Put`s by RegionServer can reduce the number of client RPC calls per writeBuffer flush.
There is a utility `HTableUtil` currently on TRUNK that does this, but you can either copy that or implement your own version for those still on 0.90.x or earlier.
[[perf.hbase.write.mr.reducer]]
=== MapReduce: Skip The Reducer
@ -599,17 +595,17 @@ Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBas
=== Scan Attribute Selection
Whenever a Scan is used to process large numbers of rows (and especially when used as a MapReduce source), be aware of which attributes are selected.
If [code]+scan.addFamily+ is called then _all_ of the attributes in the specified ColumnFamily will be returned to the client.
If `scan.addFamily` is called then _all_ of the attributes in the specified ColumnFamily will be returned to the client.
If only a small number of the available attributes are to be processed, then only those attributes should be specified in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets.
[[perf.hbase.client.seek]]
=== Avoid scan seeks
When columns are selected explicitly with [code]+scan.addColumn+, HBase will schedule seek operations to seek between the selected columns.
When columns are selected explicitly with `scan.addColumn`, HBase will schedule seek operations to seek between the selected columns.
When rows have few columns and each column has only a few versions this can be inefficient.
A seek operation is generally slower if does not seek at least past 5-10 columns/versions or 512-1024 bytes.
In order to opportunistically look ahead a few columns/versions to see if the next column/version can be found that way before a seek operation is scheduled, a new attribute [code]+Scan.HINT_LOOKAHEAD+ can be set the on Scan object.
In order to opportunistically look ahead a few columns/versions to see if the next column/version can be found that way before a seek operation is scheduled, a new attribute `Scan.HINT_LOOKAHEAD` can be set the on Scan object.
The following code instructs the RegionServer to attempt two iterations of next before a seek is scheduled:
[source,java]
@ -652,7 +648,7 @@ htable.close();
=== Block Cache
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instances can be set to use the block cache in the RegionServer via the [method]+setCacheBlocks+ method.
For input Scans to MapReduce jobs, this should be [var]+false+.
For input Scans to MapReduce jobs, this should be `false`.
For frequently accessed rows, it is advisable to use the block cache.
Cache more data by moving your Block Cache offheap.
@ -661,7 +657,7 @@ See <<offheap.blockcache,offheap.blockcache>>
[[perf.hbase.client.rowkeyonly]]
=== Optimal Loading of Row Keys
When performing a table link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[scan] where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a [var]+MUST_PASS_ALL+ operator to the scanner using [method]+setFilter+.
When performing a table link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[scan] where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a `MUST_PASS_ALL` operator to the scanner using [method]+setFilter+.
The filter list should include both a link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html[FirstKeyOnlyFilter] and a link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html[KeyOnlyFilter].
Using this filter combination will result in a worst case scenario of a RegionServer reading a single value from disk and minimal network traffic to the client for a single row.
@ -693,38 +689,38 @@ See also <<schema.bloom,schema.bloom>>.
[[bloom_footprint]]
==== Bloom StoreFile footprint
Bloom filters add an entry to the [class]+StoreFile+ general [class]+FileInfo+ data structure and then two extra entries to the [class]+StoreFile+ metadata section.
Bloom filters add an entry to the `StoreFile` general `FileInfo` data structure and then two extra entries to the `StoreFile` metadata section.
===== BloomFilter in the [class]+StoreFile+[class]+FileInfo+ data structure
===== BloomFilter in the `StoreFile``FileInfo` data structure
[class]+FileInfo+ has a [var]+BLOOM_FILTER_TYPE+ entry which is set to [var]+NONE+, [var]+ROW+ or [var]+ROWCOL.+
`FileInfo` has a `BLOOM_FILTER_TYPE` entry which is set to `NONE`, `ROW` or `ROWCOL.`
===== BloomFilter entries in [class]+StoreFile+ metadata
===== BloomFilter entries in `StoreFile` metadata
[var]+BLOOM_FILTER_META+ holds Bloom Size, Hash Function used, etc.
Its small in size and is cached on [class]+StoreFile.Reader+ load
`BLOOM_FILTER_META` holds Bloom Size, Hash Function used, etc.
Its small in size and is cached on `StoreFile.Reader` load
[var]+BLOOM_FILTER_DATA+ is the actual bloomfilter data.
`BLOOM_FILTER_DATA` is the actual bloomfilter data.
Obtained on-demand.
Stored in the LRU cache, if it is enabled (Its enabled by default).
[[config.bloom]]
==== Bloom Filter Configuration
===== [var]+io.hfile.bloom.enabled+ global kill switch
===== `io.hfile.bloom.enabled` global kill switch
[code]+io.hfile.bloom.enabled+ in [class]+Configuration+ serves as the kill switch in case something goes wrong.
Default = [var]+true+.
`io.hfile.bloom.enabled` in `Configuration` serves as the kill switch in case something goes wrong.
Default = `true`.
===== [var]+io.hfile.bloom.error.rate+
===== `io.hfile.bloom.error.rate`
[var]+io.hfile.bloom.error.rate+ = average false positive rate.
`io.hfile.bloom.error.rate` = average false positive rate.
Default = 1%. Decrease rate by ½ (e.g.
to .5%) == +1 bit per bloom entry.
===== [var]+io.hfile.bloom.max.fold+
===== `io.hfile.bloom.max.fold`
[var]+io.hfile.bloom.max.fold+ = guaranteed minimum fold rate.
`io.hfile.bloom.max.fold` = guaranteed minimum fold rate.
Most people should leave this alone.
Default = 7, or can collapse to at least 1/128th of original size.
See the _Development Process_ section of the document link:https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf[BloomFilters
@ -740,9 +736,9 @@ Hedged reads can be helpful for times where a rare slow read is caused by a tran
Because a HBase RegionServer is a HDFS client, you can enable hedged reads in HBase, by adding the following properties to the RegionServer's hbase-site.xml and tuning the values to suit your environment.
* .Configuration for Hedged Reads[code]+dfs.client.hedged.read.threadpool.size+ - the number of threads dedicated to servicing hedged reads.
* .Configuration for Hedged Reads`dfs.client.hedged.read.threadpool.size` - the number of threads dedicated to servicing hedged reads.
If this is set to 0 (the default), hedged reads are disabled.
* [code]+dfs.client.hedged.read.threshold.millis+ - the number of milliseconds to wait before spawning a second read thread.
* `dfs.client.hedged.read.threshold.millis` - the number of milliseconds to wait before spawning a second read thread.
.Hedged Reads Configuration Example
====
@ -782,9 +778,9 @@ See also <<compaction,compaction>> and link:http://hbase.apache.org/apidocs/org/
[[perf.deleting.rpc]]
=== Delete RPC Behavior
Be aware that [code]+htable.delete(Delete)+ doesn't use the writeBuffer.
Be aware that `htable.delete(Delete)` doesn't use the writeBuffer.
It will execute an RegionServer RPC with each invocation.
For a large number of deletes, consider [code]+htable.delete(List)+.
For a large number of deletes, consider `htable.delete(List)`.
See link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#delete%28org.apache.hadoop.hbase.client.Delete%29
@ -818,7 +814,7 @@ See link:http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-
See link:http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html[Hadoop
shortcircuit reads configuration page] for how to enable the latter, better version of shortcircuit.
For example, here is a minimal config.
enabling short-circuit reads added to [path]_hbase-site.xml_:
enabling short-circuit reads added to _hbase-site.xml_:
[source,xml]
----
@ -845,9 +841,9 @@ Be careful about permissions for the directory that hosts the shared domain sock
If you are running on an old Hadoop, one that is without link:https://issues.apache.org/jira/browse/HDFS-347[HDFS-347] but that has link:https://issues.apache.org/jira/browse/HDFS-2246[HDFS-2246], you must set two configurations.
First, the hdfs-site.xml needs to be amended.
Set the property [var]+dfs.block.local-path-access.user+ to be the _only_ user that can use the shortcut.
Set the property `dfs.block.local-path-access.user` to be the _only_ user that can use the shortcut.
This has to be the user that started HBase.
Then in hbase-site.xml, set [var]+dfs.client.read.shortcircuit+ to be [var]+true+
Then in hbase-site.xml, set `dfs.client.read.shortcircuit` to be `true`
Services -- at least the HBase RegionServers -- will need to be restarted in order to pick up the new configurations.

View File

@ -32,7 +32,7 @@ This is the official reference guide for the link:http://hbase.apache.org/[HBase
Herein you will find either the definitive documentation on an HBase topic as of its standing when the referenced HBase version shipped, or it will point to the location in link:http://hbase.apache.org/apidocs/index.html[javadoc], link:https://issues.apache.org/jira/browse/HBASE[JIRA] or link:http://wiki.apache.org/hadoop/Hbase[wiki] where the pertinent information can be found.
.About This Guide
This reference guide is a work in progress. The source for this guide can be found in the [path]_src/main/docbkx_ directory of the HBase source. This reference guide is marked up using link:http://www.docbook.org/[DocBook] from which the the finished guide is generated as part of the 'site' build target. Run
This reference guide is a work in progress. The source for this guide can be found in the _src/main/dasciidoc_ directory of the HBase source. This reference guide is marked up using Asciidoc, from which the the finished guide is generated as part of the 'site' build target. Run
[source,bourne]
----
mvn site
@ -42,7 +42,7 @@ Amendments and improvements to the documentation are welcomed.
Click link:https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12310753&issuetype=1&components=12312132&summary=SHORT+DESCRIPTION[this link] to file a new documentation bug against Apache HBase with some values pre-selected.
.Contributing to the Documentation
For an overview of Docbook and suggestions to get started contributing to the documentation, see <<appendix_contributing_to_documentation,appendix contributing to documentation>>.
For an overview of Asciidoc and suggestions to get started contributing to the documentation, see <<appendix_contributing_to_documentation,appendix contributing to documentation>>.
.Providing Feedback
This guide allows you to leave comments or questions on any page, using Disqus.

View File

@ -199,24 +199,24 @@ If later, fat request has clear advantage, can roll out a v2 later.
==== RPC Configurations
.CellBlock Codecs
To enable a codec other than the default [class]+KeyValueCodec+, set [var]+hbase.client.rpc.codec+ to the name of the Codec class to use.
Codec must implement hbase's [class]+Codec+ Interface.
To enable a codec other than the default `KeyValueCodec`, set `hbase.client.rpc.codec` to the name of the Codec class to use.
Codec must implement hbase's `Codec` Interface.
After connection setup, all passed cellblocks will be sent with this codec.
The server will return cellblocks using this same codec as long as the codec is on the servers' CLASSPATH (else you will get [class]+UnsupportedCellCodecException+).
The server will return cellblocks using this same codec as long as the codec is on the servers' CLASSPATH (else you will get `UnsupportedCellCodecException`).
To change the default codec, set [var]+hbase.client.default.rpc.codec+.
To change the default codec, set `hbase.client.default.rpc.codec`.
To disable cellblocks completely and to go pure protobuf, set the default to the empty String and do not specify a codec in your Configuration.
So, set [var]+hbase.client.default.rpc.codec+ to the empty string and do not set [var]+hbase.client.rpc.codec+.
So, set `hbase.client.default.rpc.codec` to the empty string and do not set `hbase.client.rpc.codec`.
This will cause the client to connect to the server with no codec specified.
If a server sees no codec, it will return all responses in pure protobuf.
Running pure protobuf all the time will be slower than running with cellblocks.
.Compression
Uses hadoops compression codecs.
To enable compressing of passed CellBlocks, set [var]+hbase.client.rpc.compressor+ to the name of the Compressor to use.
To enable compressing of passed CellBlocks, set `hbase.client.rpc.compressor` to the name of the Compressor to use.
Compressor must implement Hadoops' CompressionCodec Interface.
After connection setup, all passed cellblocks will be sent compressed.
The server will return cellblocks compressed using this same compressor as long as the compressor is on its CLASSPATH (else you will get [class]+UnsupportedCompressionCodecException+).
The server will return cellblocks compressed using this same compressor as long as the compressor is on its CLASSPATH (else you will get `UnsupportedCompressionCodecException`).
:numbered:

View File

@ -123,7 +123,7 @@ foo0004
----
Now, imagine that you would like to spread these across four different regions.
You decide to use four different salts: [literal]+a+, [literal]+b+, [literal]+c+, and [literal]+d+.
You decide to use four different salts: `a`, `b`, `c`, and `d`.
In this scenario, each of these letter prefixes will be on a different region.
After applying the salts, you have the following rowkeys instead.
Since you can now write to four separate regions, you theoretically have four times the throughput when writing that you would have if all the writes were going to the same region.
@ -159,7 +159,7 @@ Using a deterministic hash allows the client to reconstruct the complete rowkey
.Hashing Example
[example]
Given the same situation in the salting example above, you could instead apply a one-way hash that would cause the row with key [literal]+foo0003+ to always, and predictably, receive the [literal]+a+ prefix.
Given the same situation in the salting example above, you could instead apply a one-way hash that would cause the row with key `foo0003` to always, and predictably, receive the `a` prefix.
Then, to retrieve that row, you would already know the key.
You could also optimize things so that certain pairs of keys were always in the same region, for instance.
@ -292,8 +292,8 @@ See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.ht
A common problem in database processing is quickly finding the most recent version of a value.
A technique using reverse timestamps as a part of the key can help greatly with a special case of this problem.
Also found in the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly), the technique involves appending ([code]+Long.MAX_VALUE -
timestamp+) to the end of any key, e.g., [key][reverse_timestamp].
Also found in the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly), the technique involves appending (`Long.MAX_VALUE -
timestamp`) to the end of any key, e.g., [key][reverse_timestamp].
The most recent value for [key] in a table can be found by performing a Scan for [key] and obtaining the first record.
Since HBase keys are in sorted order, this key sorts before any older row-keys for [key] and thus is first.
@ -317,7 +317,7 @@ This is a fairly common question on the HBase dist-list so it pays to get the ro
=== Relationship Between RowKeys and Region Splits
If you pre-split your table, it is _critical_ to understand how your rowkey will be distributed across the region boundaries.
As an example of why this is important, consider the example of using displayable hex characters as the lead position of the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through [code]+Bytes.split+ (which is the split strategy used when creating regions in [code]+HBaseAdmin.createTable(byte[] startKey, byte[] endKey, numRegions)+ for 10 regions will generate the following splits...
As an example of why this is important, consider the example of using displayable hex characters as the lead position of the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through `Bytes.split` (which is the split strategy used when creating regions in `HBaseAdmin.createTable(byte[] startKey, byte[] endKey, numRegions)` for 10 regions will generate the following splits...
----
@ -428,7 +428,7 @@ This applies to _all_ versions of a row - even the current one.
The TTL time encoded in the HBase for the row is specified in UTC.
Store files which contains only expired rows are deleted on minor compaction.
Setting [var]+hbase.store.delete.expired.storefile+ to [code]+false+ disables this feature.
Setting `hbase.store.delete.expired.storefile` to `false` disables this feature.
Setting link:[minimum number of versions] to other than 0 also disables this.
See link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor] for more information.
@ -455,14 +455,14 @@ This allows for point-in-time queries even in the presence of deletes.
Deleted cells are still subject to TTL and there will never be more than "maximum number of versions" deleted cells.
A new "raw" scan options returns all deleted rows and the delete markers.
.Change the Value of [code]+KEEP_DELETED_CELLS+ Using HBase Shell
.Change the Value of `KEEP_DELETED_CELLS` Using HBase Shell
====
----
hbase> hbase> alter t1, NAME => f1, KEEP_DELETED_CELLS => true
----
====
.Change the Value of [code]+KEEP_DELETED_CELLS+ Using the API
.Change the Value of `KEEP_DELETED_CELLS` Using the API
====
[source,java]
----
@ -576,7 +576,7 @@ We can store them in an HBase table called LOG_DATA, but what will the rowkey be
[[schema.casestudies.log_timeseries.tslead]]
==== Timestamp In The Rowkey Lead Position
The rowkey [code]+[timestamp][hostname][log-event]+ suffers from the monotonically increasing rowkey problem described in <<timeseries,timeseries>>.
The rowkey `[timestamp][hostname][log-event]` suffers from the monotonically increasing rowkey problem described in <<timeseries,timeseries>>.
There is another pattern frequently mentioned in the dist-lists about ``bucketing'' timestamps, by performing a mod operation on the timestamp.
If time-oriented scans are important, this could be a useful approach.
@ -602,14 +602,14 @@ As stated above, to select data for a particular timerange, a Scan will need to
[[schema.casestudies.log_timeseries.hostlead]]
==== Host In The Rowkey Lead Position
The rowkey [code]+[hostname][log-event][timestamp]+ is a candidate if there is a large-ish number of hosts to spread the writes and reads across the keyspace.
The rowkey `[hostname][log-event][timestamp]` is a candidate if there is a large-ish number of hosts to spread the writes and reads across the keyspace.
This approach would be useful if scanning by hostname was a priority.
[[schema.casestudies.log_timeseries.revts]]
==== Timestamp, or Reverse Timestamp?
If the most important access path is to pull most recent events, then storing the timestamps as reverse-timestamps (e.g., [code]+timestamp = Long.MAX_VALUE
timestamp+) will create the property of being able to do a Scan on [code]+[hostname][log-event]+ to obtain the quickly obtain the most recently captured events.
If the most important access path is to pull most recent events, then storing the timestamps as reverse-timestamps (e.g., `timestamp = Long.MAX_VALUE
timestamp`) will create the property of being able to do a Scan on `[hostname][log-event]` to obtain the quickly obtain the most recently captured events.
Neither approach is wrong, it just depends on what is most appropriate for the situation.

View File

@ -32,19 +32,19 @@ HBase provides mechanisms to secure various components and aspects of HBase and
== Using Secure HTTP (HTTPS) for the Web UI
A default HBase install uses insecure HTTP connections for web UIs for the master and region servers.
To enable secure HTTP (HTTPS) connections instead, set [code]+hadoop.ssl.enabled+ to [literal]+true+ in [path]_hbase-site.xml_.
To enable secure HTTP (HTTPS) connections instead, set `hadoop.ssl.enabled` to `true` in _hbase-site.xml_.
This does not change the port used by the Web UI.
To change the port for the web UI for a given HBase component, configure that port's setting in hbase-site.xml.
These settings are:
* [code]+hbase.master.info.port+
* [code]+hbase.regionserver.info.port+
* `hbase.master.info.port`
* `hbase.regionserver.info.port`
.If you enable HTTPS, clients should avoid using the non-secure HTTP connection.
[NOTE]
====
If you enable secure HTTP, clients should connect to HBase using the [code]+https://+ URL.
Clients using the [code]+http://+ URL will receive an HTTP response of [literal]+200+, but will not receive any data.
If you enable secure HTTP, clients should connect to HBase using the `https://` URL.
Clients using the `http://` URL will receive an HTTP response of `200`, but will not receive any data.
The following exception is logged:
----
@ -72,8 +72,8 @@ This describes how to set up Apache HBase and clients for connection to secure H
=== Prerequisites
Hadoop Authentication Configuration::
To run HBase RPC with strong authentication, you must set [code]+hbase.security.authentication+ to [literal]+true+.
In this case, you must also set [code]+hadoop.security.authentication+ to [literal]+true+.
To run HBase RPC with strong authentication, you must set `hbase.security.authentication` to `true`.
In this case, you must also set `hadoop.security.authentication` to `true`.
Otherwise, you would be using strong authentication for HBase but not for the underlying HDFS, which would cancel out any benefit.
Kerberos KDC::
@ -83,11 +83,10 @@ Kerberos KDC::
First, refer to <<security.prerequisites,security.prerequisites>> and ensure that your underlying HDFS configuration is secure.
Add the following to the [code]+hbase-site.xml+ file on every server machine in the cluster:
Add the following to the `hbase-site.xml` file on every server machine in the cluster:
[source,xml]
----
<property>
<name>hbase.security.authentication</name>
<value>kerberos</value>
@ -108,27 +107,25 @@ A full shutdown and restart of HBase service is required when deploying these co
First, refer to <<security.prerequisites,security.prerequisites>> and ensure that your underlying HDFS configuration is secure.
Add the following to the [code]+hbase-site.xml+ file on every client:
Add the following to the `hbase-site.xml` file on every client:
[source,xml]
----
<property>
<name>hbase.security.authentication</name>
<value>kerberos</value>
</property>
----
The client environment must be logged in to Kerberos from KDC or keytab via the [code]+kinit+ command before communication with the HBase cluster will be possible.
The client environment must be logged in to Kerberos from KDC or keytab via the `kinit` command before communication with the HBase cluster will be possible.
Be advised that if the [code]+hbase.security.authentication+ in the client- and server-side site files do not match, the client will not be able to communicate with the cluster.
Be advised that if the `hbase.security.authentication` in the client- and server-side site files do not match, the client will not be able to communicate with the cluster.
Once HBase is configured for secure RPC it is possible to optionally configure encrypted communication.
To do so, add the following to the [code]+hbase-site.xml+ file on every client:
To do so, add the following to the `hbase-site.xml` file on every client:
[source,xml]
----
<property>
<name>hbase.rpc.protection</name>
<value>privacy</value>
@ -136,11 +133,10 @@ To do so, add the following to the [code]+hbase-site.xml+ file on every client:
----
This configuration property can also be set on a per connection basis.
Set it in the [code]+Configuration+ supplied to [code]+HTable+:
Set it in the `Configuration` supplied to `HTable`:
[source,java]
----
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.rpc.protection", "privacy");
HTable table = new HTable(conf, tablename);
@ -151,10 +147,9 @@ Expect a ~10% performance penalty for encrypted communication.
[[security.client.thrift]]
=== Client-side Configuration for Secure Operation - Thrift Gateway
Add the following to the [code]+hbase-site.xml+ file for every Thrift gateway:
Add the following to the `hbase-site.xml` file for every Thrift gateway:
[source,xml]
----
<property>
<name>hbase.thrift.keytab.file</name>
<value>/etc/hbase/conf/hbase.keytab</value>
@ -170,12 +165,11 @@ Add the following to the [code]+hbase-site.xml+ file for every Thrift gateway:
Substitute the appropriate credential and keytab for [replaceable]_$USER_ and [replaceable]_$KEYTAB_ respectively.
In order to use the Thrift API principal to interact with HBase, it is also necessary to add the [code]+hbase.thrift.kerberos.principal+ to the [code]+_acl_+ table.
For example, to give the Thrift API principal, [code]+thrift_server+, administrative access, a command such as this one will suffice:
In order to use the Thrift API principal to interact with HBase, it is also necessary to add the `hbase.thrift.kerberos.principal` to the `_acl_` table.
For example, to give the Thrift API principal, `thrift_server`, administrative access, a command such as this one will suffice:
[source,sql]
----
grant 'thrift_server', 'RWCA'
----
@ -203,14 +197,14 @@ To enable it, do the following.
. Be sure Thrift is running in secure mode, by following the procedure described in <<security.client.thrift,security.client.thrift>>.
. Be sure that HBase is configured to allow proxy users, as described in <<security.rest.gateway,security.rest.gateway>>.
. In [path]_hbase-site.xml_ for each cluster node running a Thrift gateway, set the property [code]+hbase.thrift.security.qop+ to one of the following three values:
. In _hbase-site.xml_ for each cluster node running a Thrift gateway, set the property `hbase.thrift.security.qop` to one of the following three values:
+
* [literal]+auth-conf+ - authentication, integrity, and confidentiality checking
* [literal]+auth-int+ - authentication and integrity checking
* [literal]+auth+ - authentication checking only
* `auth-conf` - authentication, integrity, and confidentiality checking
* `auth-int` - authentication and integrity checking
* `auth` - authentication checking only
. Restart the Thrift gateway processes for the changes to take effect.
If a node is running Thrift, the output of the +jps+ command will list a [code]+ThriftServer+ process.
If a node is running Thrift, the output of the +jps+ command will list a `ThriftServer` process.
To stop Thrift on a node, run the command +bin/hbase-daemon.sh stop thrift+.
To start Thrift on a node, run the command +bin/hbase-daemon.sh start thrift+.
@ -255,11 +249,10 @@ Take a look at the link:https://github.com/apache/hbase/blob/master/hbase-exampl
=== Client-side Configuration for Secure Operation - REST Gateway
Add the following to the [code]+hbase-site.xml+ file for every REST gateway:
Add the following to the `hbase-site.xml` file for every REST gateway:
[source,xml]
----
<property>
<name>hbase.rest.keytab.file</name>
<value>$KEYTAB</value>
@ -276,12 +269,11 @@ The REST gateway will authenticate with HBase using the supplied credential.
No authentication will be performed by the REST gateway itself.
All client access via the REST gateway will use the REST gateway's credential and have its privilege.
In order to use the REST API principal to interact with HBase, it is also necessary to add the [code]+hbase.rest.kerberos.principal+ to the [code]+_acl_+ table.
For example, to give the REST API principal, [code]+rest_server+, administrative access, a command such as this one will suffice:
In order to use the REST API principal to interact with HBase, it is also necessary to add the `hbase.rest.kerberos.principal` to the `_acl_` table.
For example, to give the REST API principal, `rest_server`, administrative access, a command such as this one will suffice:
[source,sql]
----
grant 'rest_server', 'RWCA'
----
@ -304,7 +296,7 @@ So it can apply proper authorizations.
To turn on REST gateway impersonation, we need to configure HBase servers (masters and region servers) to allow proxy users; configure REST gateway to enable impersonation.
To allow proxy users, add the following to the [code]+hbase-site.xml+ file for every HBase server:
To allow proxy users, add the following to the `hbase-site.xml` file for every HBase server:
[source,xml]
----
@ -324,7 +316,7 @@ To allow proxy users, add the following to the [code]+hbase-site.xml+ file for e
Substitute the REST gateway proxy user for $USER, and the allowed group list for $GROUPS.
To enable REST gateway impersonation, add the following to the [code]+hbase-site.xml+ file for every REST gateway.
To enable REST gateway impersonation, add the following to the `hbase-site.xml` file for every REST gateway.
[source,xml]
----
@ -370,7 +362,7 @@ None
=== Server-side Configuration for Simple User Access Operation
Add the following to the [code]+hbase-site.xml+ file on every server machine in the cluster:
Add the following to the `hbase-site.xml` file on every server machine in the cluster:
[source,xml]
----
@ -396,7 +388,7 @@ Add the following to the [code]+hbase-site.xml+ file on every server machine in
</property>
----
For 0.94, add the following to the [code]+hbase-site.xml+ file on every server machine in the cluster:
For 0.94, add the following to the `hbase-site.xml` file on every server machine in the cluster:
[source,xml]
----
@ -418,7 +410,7 @@ A full shutdown and restart of HBase service is required when deploying these co
=== Client-side Configuration for Simple User Access Operation
Add the following to the [code]+hbase-site.xml+ file on every client:
Add the following to the `hbase-site.xml` file on every client:
[source,xml]
----
@ -428,7 +420,7 @@ Add the following to the [code]+hbase-site.xml+ file on every client:
</property>
----
For 0.94, add the following to the [code]+hbase-site.xml+ file on every server machine in the cluster:
For 0.94, add the following to the `hbase-site.xml` file on every server machine in the cluster:
[source,xml]
----
@ -438,16 +430,15 @@ For 0.94, add the following to the [code]+hbase-site.xml+ file on every server m
</property>
----
Be advised that if the [code]+hbase.security.authentication+ in the client- and server-side site files do not match, the client will not be able to communicate with the cluster.
Be advised that if the `hbase.security.authentication` in the client- and server-side site files do not match, the client will not be able to communicate with the cluster.
==== Client-side Configuration for Simple User Access Operation - Thrift Gateway
The Thrift gateway user will need access.
For example, to give the Thrift API user, [code]+thrift_server+, administrative access, a command such as this one will suffice:
For example, to give the Thrift API user, `thrift_server`, administrative access, a command such as this one will suffice:
[source,sql]
----
grant 'thrift_server', 'RWCA'
----
@ -464,11 +455,10 @@ No authentication will be performed by the REST gateway itself.
All client access via the REST gateway will use the REST gateway's credential and have its privilege.
The REST gateway user will need access.
For example, to give the REST API user, [code]+rest_server+, administrative access, a command such as this one will suffice:
For example, to give the REST API user, `rest_server`, administrative access, a command such as this one will suffice:
[source,sql]
----
grant 'rest_server', 'RWCA'
----
@ -502,11 +492,11 @@ To take advantage of many of these features, you must be running HBase 0.98+ and
[WARNING]
====
Several procedures in this section require you to copy files between cluster nodes.
When copying keys, configuration files, or other files containing sensitive strings, use a secure method, such as [code]+ssh+, to avoid leaking sensitive data.
When copying keys, configuration files, or other files containing sensitive strings, use a secure method, such as `ssh`, to avoid leaking sensitive data.
====
.Procedure: Basic Server-Side Configuration
. Enable HFile v3, by setting +hfile.format.version +to 3 in [path]_hbase-site.xml_.
. Enable HFile v3, by setting +hfile.format.version +to 3 in _hbase-site.xml_.
This is the default for HBase 1.0 and newer. +
[source,xml]
----
@ -535,10 +525,10 @@ Every tag has a type and the actual tag byte array.
Just as row keys, column families, qualifiers and values can be encoded (see <<data.block.encoding.types,data.block.encoding.types>>), tags can also be encoded as well.
You can enable or disable tag encoding at the level of the column family, and it is enabled by default.
Use the [code]+HColumnDescriptor#setCompressionTags(boolean compressTags)+ method to manage encoding settings on a column family.
Use the `HColumnDescriptor#setCompressionTags(boolean compressTags)` method to manage encoding settings on a column family.
You also need to enable the DataBlockEncoder for the column family, for encoding of tags to take effect.
You can enable compression of each tag in the WAL, if WAL compression is also enabled, by setting the value of +hbase.regionserver.wal.tags.enablecompression+ to [literal]+true+ in [path]_hbase-site.xml_.
You can enable compression of each tag in the WAL, if WAL compression is also enabled, by setting the value of +hbase.regionserver.wal.tags.enablecompression+ to `true` in _hbase-site.xml_.
Tag compression uses dictionary encoding.
Tag compression is not supported when using WAL encryption.
@ -574,21 +564,21 @@ HBase access levels are granted independently of each other and allow for differ
The possible scopes are:
* +Superuser+ - superusers can perform any operation available in HBase, to any resource.
The user who runs HBase on your cluster is a superuser, as are any principals assigned to the configuration property [code]+hbase.superuser+ in [path]_hbase-site.xml_ on the HMaster.
* +Global+ - permissions granted at [path]_global_ scope allow the admin to operate on all tables of the cluster.
* +Namespace+ - permissions granted at [path]_namespace_ scope apply to all tables within a given namespace.
* +Table+ - permissions granted at [path]_table_ scope apply to data or metadata within a given table.
* +ColumnFamily+ - permissions granted at [path]_ColumnFamily_ scope apply to cells within that ColumnFamily.
* +Cell+ - permissions granted at [path]_cell_ scope apply to that exact cell coordinate (key, value, timestamp). This allows for policy evolution along with data.
The user who runs HBase on your cluster is a superuser, as are any principals assigned to the configuration property `hbase.superuser` in _hbase-site.xml_ on the HMaster.
* +Global+ - permissions granted at _global_ scope allow the admin to operate on all tables of the cluster.
* +Namespace+ - permissions granted at _namespace_ scope apply to all tables within a given namespace.
* +Table+ - permissions granted at _table_ scope apply to data or metadata within a given table.
* +ColumnFamily+ - permissions granted at _ColumnFamily_ scope apply to cells within that ColumnFamily.
* +Cell+ - permissions granted at _cell_ scope apply to that exact cell coordinate (key, value, timestamp). This allows for policy evolution along with data.
+
To change an ACL on a specific cell, write an updated cell with new ACL to the precise coordinates of the original.
+
If you have a multi-versioned schema and want to update ACLs on all visible versions, you need to write new cells for all visible versions.
The application has complete control over policy evolution.
+
The exception to the above rule is [code]+append+ and [code]+increment+ processing.
The exception to the above rule is `append` and `increment` processing.
Appends and increments can carry an ACL in the operation.
If one is included in the operation, then it will be applied to the result of the [code]+append+ or [code]+increment+.
If one is included in the operation, then it will be applied to the result of the `append` or `increment`.
Otherwise, the ACL of the existing cell you are appending to or incrementing is preserved.
@ -612,21 +602,21 @@ In a production environment, it is likely that different users will have only on
+
[WARNING]
====
In the current implementation, a Global Admin with [code]+Admin+ permission can grant himself [code]+Read+ and [code]+Write+ permissions on a table and gain access to that table's data.
For this reason, only grant [code]+Global Admin+ permissions to trusted user who actually need them.
In the current implementation, a Global Admin with `Admin` permission can grant himself `Read` and `Write` permissions on a table and gain access to that table's data.
For this reason, only grant `Global Admin` permissions to trusted user who actually need them.
Also be aware that a [code]+Global Admin+ with [code]+Create+ permission can perform a [code]+Put+ operation on the ACL table, simulating a [code]+grant+ or [code]+revoke+ and circumventing the authorization check for [code]+Global Admin+ permissions.
Also be aware that a `Global Admin` with `Create` permission can perform a `Put` operation on the ACL table, simulating a `grant` or `revoke` and circumventing the authorization check for `Global Admin` permissions.
Due to these issues, be cautious with granting [code]+Global Admin+ privileges.
Due to these issues, be cautious with granting `Global Admin` privileges.
====
* +Namespace Admins+ - a namespace admin with [code]+Create+ permissions can create or drop tables within that namespace, and take and restore snapshots.
A namespace admin with [code]+Admin+ permissions can perform operations such as splits or major compactions on tables within that namespace.
* +Namespace Admins+ - a namespace admin with `Create` permissions can create or drop tables within that namespace, and take and restore snapshots.
A namespace admin with `Admin` permissions can perform operations such as splits or major compactions on tables within that namespace.
* +Table Admins+ - A table admin can perform administrative operations only on that table.
A table admin with [code]+Create+ permissions can create snapshots from that table or restore that table from a snapshot.
A table admin with [code]+Admin+ permissions can perform operations such as splits or major compactions on that table.
A table admin with `Create` permissions can create snapshots from that table or restore that table from a snapshot.
A table admin with `Admin` permissions can perform operations such as splits or major compactions on that table.
* +Users+ - Users can read or write data, or both.
Users can also execute coprocessor endpoints, if given [code]+Executable+ permissions.
Users can also execute coprocessor endpoints, if given `Executable` permissions.
.Real-World Example of Access Levels
[cols="1,1,1,1", options="header"]
@ -682,7 +672,7 @@ Cell-level ACLs are implemented using tags (see <<hbase.tags,hbase.tags>>). In o
. As a prerequisite, perform the steps in <<security.data.basic.server.side,security.data.basic.server.side>>.
. Install and configure the AccessController coprocessor, by setting the following properties in [path]_hbase-site.xml_.
. Install and configure the AccessController coprocessor, by setting the following properties in _hbase-site.xml_.
These properties take a list of classes.
+
NOTE: If you use the AccessController along with the VisibilityController, the AccessController must come first in the list, because with both components active, the VisibilityController will delegate access control on its system tables to the AccessController.
@ -708,10 +698,10 @@ For an example of using both together, see <<security.example.config,security.ex
</property>
----
+
Optionally, you can enable transport security, by setting +hbase.rpc.protection+ to [literal]+auth-conf+.
Optionally, you can enable transport security, by setting +hbase.rpc.protection+ to `auth-conf`.
This requires HBase 0.98.4 or newer.
. Set up the Hadoop group mapper in the Hadoop namenode's [path]_core-site.xml_.
. Set up the Hadoop group mapper in the Hadoop namenode's _core-site.xml_.
This is a Hadoop file, not an HBase file.
Customize it to your site's needs.
Following is an example.
@ -766,11 +756,11 @@ This requires HBase 0.98.4 or newer.
. Optionally, enable the early-out evaluation strategy.
Prior to HBase 0.98.0, if a user was not granted access to a column family, or at least a column qualifier, an AccessDeniedException would be thrown.
HBase 0.98.0 removed this exception in order to allow cell-level exceptional grants.
To restore the old behavior in HBase 0.98.0-0.98.6, set +hbase.security.access.early_out+ to [literal]+true+ in [path]_hbase-site.xml_.
In HBase 0.98.6, the default has been returned to [literal]+true+.
To restore the old behavior in HBase 0.98.0-0.98.6, set +hbase.security.access.early_out+ to `true` in _hbase-site.xml_.
In HBase 0.98.6, the default has been returned to `true`.
. Distribute your configuration and restart your cluster for changes to take effect.
. To test your configuration, log into HBase Shell as a given user and use the +whoami+ command to report the groups your user is part of.
In this example, the user is reported as being a member of the [code]+services+ group.
In this example, the user is reported as being a member of the `services` group.
+
----
hbase> whoami
@ -786,7 +776,7 @@ Administration tasks can be performed from HBase Shell or via an API.
.API Examples
[CAUTION]
====
Many of the API examples below are taken from source files [path]_hbase-server/src/test/java/org/apache/hadoop/hbase/security/access/TestAccessController.java_ and [path]_hbase-server/src/test/java/org/apache/hadoop/hbase/security/access/SecureTestUtil.java_.
Many of the API examples below are taken from source files _hbase-server/src/test/java/org/apache/hadoop/hbase/security/access/TestAccessController.java_ and _hbase-server/src/test/java/org/apache/hadoop/hbase/security/access/SecureTestUtil.java_.
Neither the examples, nor the source files they are taken from, are part of the public HBase API, and are provided for illustration only.
Refer to the official API for usage instructions.
@ -802,12 +792,13 @@ Users and groups are maintained external to HBase, in your directory.
There are a few different types of syntax for grant statements.
The first, and most familiar, is as follows, with the table and column family being optional:
+
[source,sql]
----
grant 'user', 'RWXCA', 'TABLE', 'CF', 'CQ'
----
+
Groups and users are granted access in the same way, but groups are prefixed with an [literal]+@+ symbol.
In the same way, tables and namespaces are specified in the same way, but namespaces are prefixed with an [literal]+@+ symbol.
Groups and users are granted access in the same way, but groups are prefixed with an `@` symbol.
In the same way, tables and namespaces are specified in the same way, but namespaces are prefixed with an `@` symbol.
+
It is also possible to grant multiple permissions against the same resource in a single statement, as in this example.
The first sub-clause maps users to ACLs and the second sub-clause specifies the resource.
@ -862,7 +853,7 @@ grant <table>, \
{ <scanner-specification> }
----
+
* [replaceable]_<user-or-group>_ is the user or group name, prefixed with [literal]+@+ in the case of a group.
* [replaceable]_<user-or-group>_ is the user or group name, prefixed with `@` in the case of a group.
* [replaceable]_<permissions>_ is a string containing any or all of "RWXCA", though only R and W are meaningful at cell scope.
* [replaceable]_<scanner-specification>_ is the scanner specification syntax and conventions used by the 'scan' shell command.
For some examples of scanner specifications, issue the following HBase Shell command.
@ -911,7 +902,7 @@ public static void grantOnTable(final HBaseTestingUtility util, final String use
}
----
To grant permissions at the cell level, you can use the [code]+Mutation.setACL+ method:
To grant permissions at the cell level, you can use the `Mutation.setACL` method:
[source,java]
----
@ -919,7 +910,7 @@ Mutation.setACL(String user, Permission perms)
Mutation.setACL(Map<String, Permission> perms)
----
Specifically, this example provides read permission to a user called [literal]+user1+ on any cells contained in a particular Put operation:
Specifically, this example provides read permission to a user called `user1` on any cells contained in a particular Put operation:
[source,java]
----
@ -1000,7 +991,7 @@ public static void verifyAllowed(User user, AccessTestAction action, int count)
=== Visibility Labels
Visibility labels control can be used to only permit users or principals associated with a given label to read or access cells with that label.
For instance, you might label a cell [literal]+top-secret+, and only grant access to that label to the [literal]+managers+ group.
For instance, you might label a cell `top-secret`, and only grant access to that label to the `managers` group.
Visibility labels are implemented using Tags, which are a feature of HFile v3, and allow you to store metadata on a per-cell basis.
A label is a string, and labels can be combined into expressions by using logical operators (&, |, or !), and using parentheses for grouping.
HBase does not do any kind of validation of expressions beyond basic well-formedness.
@ -1009,14 +1000,14 @@ Visibility labels have no meaning on their own, and may be used to denote sensit
If a user's labels do not match a cell's label or expression, the user is denied access to the cell.
In HBase 0.98.6 and newer, UTF-8 encoding is supported for visibility labels and expressions.
When creating labels using the [code]+addLabels(conf, labels)+ method provided by the [code]+org.apache.hadoop.hbase.security.visibility.VisibilityClient+ class and passing labels in Authorizations via Scan or Get, labels can contain UTF-8 characters, as well as the logical operators normally used in visibility labels, with normal Java notations, without needing any escaping method.
However, when you pass a CellVisibility expression via a Mutation, you must enclose the expression with the [code]+CellVisibility.quote()+ method if you use UTF-8 characters or logical operators.
See [code]+TestExpressionParser+ and the source file [path]_hbase-client/src/test/java/org/apache/hadoop/hbase/client/TestScan.java_.
When creating labels using the `addLabels(conf, labels)` method provided by the `org.apache.hadoop.hbase.security.visibility.VisibilityClient` class and passing labels in Authorizations via Scan or Get, labels can contain UTF-8 characters, as well as the logical operators normally used in visibility labels, with normal Java notations, without needing any escaping method.
However, when you pass a CellVisibility expression via a Mutation, you must enclose the expression with the `CellVisibility.quote()` method if you use UTF-8 characters or logical operators.
See `TestExpressionParser` and the source file _hbase-client/src/test/java/org/apache/hadoop/hbase/client/TestScan.java_.
A user adds visibility expressions to a cell during a Put operation.
In the default configuration, the user does not need to access to a label in order to label cells with it.
This behavior is controlled by the configuration option +hbase.security.visibility.mutations.checkauths+.
If you set this option to [literal]+true+, the labels the user is modifying as part of the mutation must be associated with the user, or the mutation will fail.
If you set this option to `true`, the labels the user is modifying as part of the mutation must be associated with the user, or the mutation will fail.
Whether a user is authorized to read a labelled cell is determined during a Get or Scan, and results which the user is not allowed to read are filtered out.
This incurs the same I/O penalty as if the results were returned, but reduces load on the network.
@ -1027,11 +1018,11 @@ The user's effective label set is built in the RPC context when a request is fir
The way that users are associated with labels is pluggable.
The default plugin passes through labels specified in Authorizations added to the Get or Scan and checks those against the calling user's authenticated labels list.
When the client passes labels for which the user is not authenticated, the default plugin drops them.
You can pass a subset of user authenticated labels via the [code]+Get#setAuthorizations(Authorizations(String,...))+ and [code]+Scan#setAuthorizations(Authorizations(String,...));+ methods.
You can pass a subset of user authenticated labels via the `Get#setAuthorizations(Authorizations(String,...))` and `Scan#setAuthorizations(Authorizations(String,...));` methods.
Visibility label access checking is performed by the VisibilityController coprocessor.
You can use interface [code]+VisibilityLabelService+ to provide a custom implementation and/or control the way that visibility labels are stored with cells.
See the source file [path]_hbase-server/src/test/java/org/apache/hadoop/hbase/security/visibility/TestVisibilityLabelsWithCustomVisLabService.java_ for one example.
You can use interface `VisibilityLabelService` to provide a custom implementation and/or control the way that visibility labels are stored with cells.
See the source file _hbase-server/src/test/java/org/apache/hadoop/hbase/security/visibility/TestVisibilityLabelsWithCustomVisLabService.java_ for one example.
Visibility labels can be used in conjunction with ACLs.
@ -1058,12 +1049,11 @@ Visibility labels can be used in conjunction with ACLs.
. As a prerequisite, perform the steps in <<security.data.basic.server.side,security.data.basic.server.side>>.
. Install and configure the VisibilityController coprocessor by setting the following properties in [path]_hbase-site.xml_.
. Install and configure the VisibilityController coprocessor by setting the following properties in _hbase-site.xml_.
These properties take a list of class names.
+
[source,xml]
----
<property>
<name>hbase.coprocessor.region.classes</name>
<value>org.apache.hadoop.hbase.security.visibility.VisibilityController</value>
@ -1080,7 +1070,7 @@ NOTE: If you use the AccessController and VisibilityController coprocessors toge
+
By default, users can label cells with any label, including labels they are not associated with, which means that a user can Put data that he cannot read.
For example, a user could label a cell with the (hypothetical) 'topsecret' label even if the user is not associated with that label.
If you only want users to be able to label cells with labels they are associated with, set +hbase.security.visibility.mutations.checkauths+ to [literal]+true+.
If you only want users to be able to label cells with labels they are associated with, set +hbase.security.visibility.mutations.checkauths+ to `true`.
In that case, the mutation will fail if it makes use of labels the user is not associated with.
. Distribute your configuration and restart your cluster for changes to take effect.
@ -1093,7 +1083,7 @@ For defining the list of visibility labels and associating labels with users, th
.API Examples
[CAUTION]
====
Many of the Java API examples in this section are taken from the source file [path]_hbase-server/src/test/java/org/apache/hadoop/hbase/security/visibility/TestVisibilityLabels.java_.
Many of the Java API examples in this section are taken from the source file _hbase-server/src/test/java/org/apache/hadoop/hbase/security/visibility/TestVisibilityLabels.java_.
Refer to that file or the API documentation for more context.
Neither these examples, nor the source file they were taken from, are part of the public HBase API, and are provided for illustration only.
@ -1234,7 +1224,6 @@ The correct way to apply cell level labels is to do so in the application code w
====
[source,java]
----
static HTable createTableAndWriteDataWithLabels(TableName tableName, String... labelExps)
throws Exception {
HTable table = null;
@ -1262,9 +1251,9 @@ static HTable createTableAndWriteDataWithLabels(TableName tableName, String... l
==== Implementing Your Own Visibility Label Algorithm
Interpreting the labels authenticated for a given get/scan request is a pluggable algorithm.
You can specify a custom plugin by using the property [code]+hbase.regionserver.scan.visibility.label.generator.class+.
The default implementation class is [code]+org.apache.hadoop.hbase.security.visibility.DefaultScanLabelGenerator+.
You can also configure a set of [code]+ScanLabelGenerators+ to be used by the system, as a comma-separated list.
You can specify a custom plugin by using the property `hbase.regionserver.scan.visibility.label.generator.class`.
The default implementation class is `org.apache.hadoop.hbase.security.visibility.DefaultScanLabelGenerator`.
You can also configure a set of `ScanLabelGenerators` to be used by the system, as a comma-separated list.
==== Replicating Visibility Tags as Strings
@ -1292,7 +1281,7 @@ When it is read, it is decrypted on demand.
The administrator provisions a master key for the cluster, which is stored in a key provider accessible to every trusted HBase process, including the HMaster, RegionServers, and clients (such as HBase Shell) on administrative workstations.
The default key provider is integrated with the Java KeyStore API and any key management systems with support for it.
Other custom key provider implementations are possible.
The key retrieval mechanism is configured in the [path]_hbase-site.xml_ configuration file.
The key retrieval mechanism is configured in the _hbase-site.xml_ configuration file.
The master key may be stored on the cluster servers, protected by a secure KeyStore file, or on an external keyserver, or in a hardware security module.
This master key is resolved as needed by HBase processes through the configured key provider.
@ -1320,8 +1309,9 @@ If you are using a custom implementation, check its documentation and adjust acc
. Create a secret key of appropriate length for AES encryption, using the
[code]+keytool+ utility.
`keytool` utility.
+
[source,bash]
----
$ keytool -keystore /path/to/hbase/conf/hbase.jks \
-storetype jceks -storepass **** \
@ -1337,17 +1327,16 @@ Do not specify a separate password for the key, but press kbd:[Return] when prom
. Set appropriate permissions on the keyfile and distribute it to all the HBase
servers.
+
The previous command created a file called [path]_hbase.jks_ in the HBase [path]_conf/_ directory.
The previous command created a file called _hbase.jks_ in the HBase _conf/_ directory.
Set the permissions and ownership on this file such that only the HBase service account user can read the file, and securely distribute the key to all HBase servers.
. Configure the HBase daemons.
+
Set the following properties in [path]_hbase-site.xml_ on the region servers, to configure HBase daemons to use a key provider backed by the KeyStore file or retrieving the cluster master key.
Set the following properties in _hbase-site.xml_ on the region servers, to configure HBase daemons to use a key provider backed by the KeyStore file or retrieving the cluster master key.
In the example below, replace [replaceable]_****_ with the password.
+
[source,xml]
----
<property>
<name>hbase.crypto.keyprovider</name>
<value>org.apache.hadoop.hbase.io.crypto.KeyStoreKeyProvider</value>
@ -1363,7 +1352,6 @@ However, you can store it with an arbitrary alias (in the +keytool+ command). In
+
[source,xml]
----
<property>
<name>hbase.crypto.master.key.name</name>
<value>my-alias</value>
@ -1372,11 +1360,10 @@ However, you can store it with an arbitrary alias (in the +keytool+ command). In
+
You also need to be sure your HFiles use HFile v3, in order to use transparent encryption.
This is the default configuration for HBase 1.0 onward.
For previous versions, set the following property in your [path]_hbase-site.xml_ file.
For previous versions, set the following property in your _hbase-site.xml_ file.
+
[source,xml]
----
<property>
<name>hfile.format.version</name>
<value>3</value>
@ -1388,41 +1375,40 @@ Optionally, you can use a different cipher provider, either a Java Cryptography
* JCE:
+
* Install a signed JCE provider (supporting ``AES/CTR/NoPadding'' mode with 128 bit keys)
* Add it with highest preference to the JCE site configuration file [path]_$JAVA_HOME/lib/security/java.security_.
* Update +hbase.crypto.algorithm.aes.provider+ and +hbase.crypto.algorithm.rng.provider+ options in [path]_hbase-site.xml_.
* Add it with highest preference to the JCE site configuration file _$JAVA_HOME/lib/security/java.security_.
* Update +hbase.crypto.algorithm.aes.provider+ and +hbase.crypto.algorithm.rng.provider+ options in _hbase-site.xml_.
* Custom HBase Cipher:
+
* Implement [code]+org.apache.hadoop.hbase.io.crypto.CipherProvider+.
* Implement `org.apache.hadoop.hbase.io.crypto.CipherProvider`.
* Add the implementation to the server classpath.
* Update +hbase.crypto.cipherprovider+ in [path]_hbase-site.xml_.
* Update +hbase.crypto.cipherprovider+ in _hbase-site.xml_.
. Configure WAL encryption.
+
Configure WAL encryption in every RegionServer's [path]_hbase-site.xml_, by setting the following properties.
You can include these in the HMaster's [path]_hbase-site.xml_ as well, but the HMaster does not have a WAL and will not use them.
Configure WAL encryption in every RegionServer's _hbase-site.xml_, by setting the following properties.
You can include these in the HMaster's _hbase-site.xml_ as well, but the HMaster does not have a WAL and will not use them.
+
[source,xml]
----
<property>
<name>hbase.regionserver.hlog.reader.impl</name>
<value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogReader</value>
<name>hbase.regionserver.hlog.reader.impl</name>
<value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogReader</value>
</property>
<property>
<name>hbase.regionserver.hlog.writer.impl</name>
<value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogWriter</value>
<name>hbase.regionserver.hlog.writer.impl</name>
<value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogWriter</value>
</property>
<property>
<name>hbase.regionserver.wal.encryption</name>
<value>true</value>
<name>hbase.regionserver.wal.encryption</name>
<value>true</value>
</property>
----
. Configure permissions on the [path]_hbase-site.xml_ file.
. Configure permissions on the _hbase-site.xml_ file.
+
Because the keystore password is stored in the hbase-site.xml, you need to ensure that only the HBase user can read the [path]_hbase-site.xml_ file, using file ownership and permissions.
Because the keystore password is stored in the hbase-site.xml, you need to ensure that only the HBase user can read the _hbase-site.xml_ file, using file ownership and permissions.
. Restart your cluster.
+
@ -1436,7 +1422,7 @@ Administrative tasks can be performed in HBase Shell or the Java API.
.Java API
[CAUTION]
====
Java API examples in this section are taken from the source file [path]_hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsckEncryption.java_.
Java API examples in this section are taken from the source file _hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsckEncryption.java_.
.
Neither these examples, nor the source files they are taken from, are part of the public HBase API, and are provided for illustration only.
@ -1454,12 +1440,12 @@ Rotate the Data Key::
Until the compaction completes, the old HFiles will still be readable using the old key.
Switching Between Using a Random Data Key and Specifying A Key::
If you configured a column family to use a specific key and you want to return to the default behavior of using a randomly-generated key for that column family, use the Java API to alter the [code]+HColumnDescriptor+ so that no value is sent with the key [literal]+ENCRYPTION_KEY+.
If you configured a column family to use a specific key and you want to return to the default behavior of using a randomly-generated key for that column family, use the Java API to alter the `HColumnDescriptor` so that no value is sent with the key `ENCRYPTION_KEY`.
Rotate the Master Key::
To rotate the master key, first generate and distribute the new key.
Then update the KeyStore to contain a new master key, and keep the old master key in the KeyStore using a different alias.
Next, configure fallback to the old master key in the [path]_hbase-site.xml_ file.
Next, configure fallback to the old master key in the _hbase-site.xml_ file.
::
@ -1467,29 +1453,29 @@ Rotate the Master Key::
=== Secure Bulk Load
Bulk loading in secure mode is a bit more involved than normal setup, since the client has to transfer the ownership of the files generated from the mapreduce job to HBase.
Secure bulk loading is implemented by a coprocessor, named link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/security/access/SecureBulkLoadEndpoint.html[SecureBulkLoadEndpoint], which uses a staging directory configured by the configuration property +hbase.bulkload.staging.dir+, which defaults to [path]_/tmp/hbase-staging/_.
Secure bulk loading is implemented by a coprocessor, named link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/security/access/SecureBulkLoadEndpoint.html[SecureBulkLoadEndpoint], which uses a staging directory configured by the configuration property +hbase.bulkload.staging.dir+, which defaults to _/tmp/hbase-staging/_.
* .Secure Bulk Load AlgorithmOne time only, create a staging directory which is world-traversable and owned by the user which runs HBase (mode 711, or [literal]+rwx--x--x+). A listing of this directory will look similar to the following:
* .Secure Bulk Load AlgorithmOne time only, create a staging directory which is world-traversable and owned by the user which runs HBase (mode 711, or `rwx--x--x`). A listing of this directory will look similar to the following:
+
[source,bash]
----
$ ls -ld /tmp/hbase-staging
drwx--x--x 2 hbase hbase 68 3 Sep 14:54 /tmp/hbase-staging
----
* A user writes out data to a secure output directory owned by that user.
For example, [path]_/user/foo/data_.
* Internally, HBase creates a secret staging directory which is globally readable/writable ([code]+-rwxrwxrwx, 777+). For example, [path]_/tmp/hbase-staging/averylongandrandomdirectoryname_.
For example, _/user/foo/data_.
* Internally, HBase creates a secret staging directory which is globally readable/writable (`-rwxrwxrwx, 777`). For example, _/tmp/hbase-staging/averylongandrandomdirectoryname_.
The name and location of this directory is not exposed to the user.
HBase manages creation and deletion of this directory.
* The user makes the data world-readable and world-writable, moves it into the random staging directory, then calls the [code]+SecureBulkLoadClient#bulkLoadHFiles+ method.
* The user makes the data world-readable and world-writable, moves it into the random staging directory, then calls the `SecureBulkLoadClient#bulkLoadHFiles` method.
The strength of the security lies in the length and randomness of the secret directory.
To enable secure bulk load, add the following properties to [path]_hbase-site.xml_.
To enable secure bulk load, add the following properties to _hbase-site.xml_.
[source,xml]
----
<property>
<name>hbase.bulkload.staging.dir</name>
<value>/tmp/hbase-staging</value>
@ -1507,11 +1493,10 @@ To enable secure bulk load, add the following properties to [path]_hbase-site.xm
This configuration example includes support for HFile v3, ACLs, Visibility Labels, and transparent encryption of data at rest and the WAL.
All options have been discussed separately in the sections above.
.Example Security Settings in [path]_hbase-site.xml_
.Example Security Settings in _hbase-site.xml_
====
[source,xml]
----
<!-- HFile v3 Support -->
<property>
<name>hfile.format.version</name>
@ -1598,13 +1583,12 @@ All options have been discussed separately in the sections above.
----
====
.Example Group Mapper in Hadoop [path]_core-site.xml_
.Example Group Mapper in Hadoop _core-site.xml_
====
Adjust these settings to suit your environment.
[source,xml]
----
<property>
<name>hadoop.security.group.mapping</name>
<value>org.apache.hadoop.security.LdapGroupsMapping</value>

View File

@ -33,7 +33,7 @@ Anything you can do in IRB, you should be able to do in the HBase Shell.
To run the HBase shell, do as follows:
[source]
[source,bash]
----
$ ./bin/hbase shell
----
@ -49,11 +49,11 @@ Here is a nicely formatted listing of link:http://learnhbase.wordpress.com/2013/
[[scripting]]
== Scripting with Ruby
For examples scripting Apache HBase, look in the HBase [path]_bin_ directory.
Look at the files that end in [path]_*.rb_.
For examples scripting Apache HBase, look in the HBase _bin_ directory.
Look at the files that end in _*.rb_.
To run one of these files, do as follows:
[source]
[source,bash]
----
$ ./bin/hbase org.jruby.Main PATH_TO_SCRIPT
----
@ -62,7 +62,7 @@ $ ./bin/hbase org.jruby.Main PATH_TO_SCRIPT
A new non-interactive mode has been added to the HBase Shell (link:https://issues.apache.org/jira/browse/HBASE-11658[HBASE-11658)].
Non-interactive mode captures the exit status (success or failure) of HBase Shell commands and passes that status back to the command interpreter.
If you use the normal interactive mode, the HBase Shell will only ever return its own exit status, which will nearly always be [literal]+0+ for success.
If you use the normal interactive mode, the HBase Shell will only ever return its own exit status, which will nearly always be `0` for success.
To invoke non-interactive mode, pass the +-n+ or +--non-interactive+ option to HBase Shell.
@ -77,10 +77,11 @@ NOTE: Spawning HBase Shell commands in this way is slow, so keep that in mind wh
.Passing Commands to the HBase Shell
====
You can pass commands to the HBase Shell in non-interactive mode (see <<hbasee.shell.noninteractive,hbasee.shell.noninteractive>>) using the +echo+ command and the [literal]+|+ (pipe) operator.
You can pass commands to the HBase Shell in non-interactive mode (see <<hbasee.shell.noninteractive,hbasee.shell.noninteractive>>) using the +echo+ command and the `|` (pipe) operator.
Be sure to escape characters in the HBase commands which would otherwise be interpreted by the shell.
Some debug-level output has been truncated from the example below.
[source,bash]
----
$ echo "describe 'test1'" | ./hbase shell -n
@ -98,8 +99,9 @@ DESCRIPTION ENABLED
1 row(s) in 3.2410 seconds
----
To suppress all output, echo it to [path]_/dev/null:_
To suppress all output, echo it to _/dev/null:_
[source,bash]
----
$ echo "describe 'test'" | ./hbase shell -n > /dev/null 2>&1
----
@ -108,15 +110,14 @@ $ echo "describe 'test'" | ./hbase shell -n > /dev/null 2>&1
.Checking the Result of a Scripted Command
====
Since scripts are not designed to be run interactively, you need a way to check whether your command failed or succeeded.
The HBase shell uses the standard convention of returning a value of [literal]+0+ for successful commands, and some non-zero value for failed commands.
Bash stores a command's return value in a special environment variable called [var]+$?+.
The HBase shell uses the standard convention of returning a value of `0` for successful commands, and some non-zero value for failed commands.
Bash stores a command's return value in a special environment variable called `$?`.
Because that variable is overwritten each time the shell runs any command, you should store the result in a different, script-defined variable.
This is a naive script that shows one way to store the return value and make a decision based upon it.
[source,bourne]
[source,bash]
----
#!/bin/bash
echo "describe 'test'" | ./hbase shell -n > /dev/null 2>&1
@ -147,7 +148,6 @@ You can enter HBase Shell commands into a text file, one command per line, and p
.Example Command File
====
----
create 'test', 'cf'
list 'test'
put 'test', 'row1', 'cf:a', 'value1'
@ -170,8 +170,8 @@ If you do not include the +exit+ command in your script, you are returned to the
There is no way to programmatically check each individual command for success or failure.
Also, though you see the output for each command, the commands themselves are not echoed to the screen so it can be difficult to line up the command with its output.
[source,bash]
----
$ ./hbase shell ./sample_commands.txt
0 row(s) in 3.4170 seconds
@ -206,13 +206,13 @@ COLUMN CELL
== Passing VM Options to the Shell
You can pass VM options to the HBase Shell using the [code]+HBASE_SHELL_OPTS+ environment variable.
You can set this in your environment, for instance by editing [path]_~/.bashrc_, or set it as part of the command to launch HBase Shell.
You can pass VM options to the HBase Shell using the `HBASE_SHELL_OPTS` environment variable.
You can set this in your environment, for instance by editing _~/.bashrc_, or set it as part of the command to launch HBase Shell.
The following example sets several garbage-collection-related variables, just for the lifetime of the VM running the HBase Shell.
The command should be run all on a single line, but is broken by the [literal]+\+ character, for readability.
The command should be run all on a single line, but is broken by the `\` character, for readability.
[source,bash]
----
$ HBASE_SHELL_OPTS="-verbose:gc -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps \
-XX:+PrintGCDetails -Xloggc:$HBASE_HOME/logs/gc-hbase.log" ./bin/hbase shell
----
@ -229,7 +229,6 @@ The table reference can be used to perform data read write operations such as pu
For example, previously you would always specify a table name:
----
hbase(main):000:0> create t, f
0 row(s) in 1.0970 seconds
hbase(main):001:0> put 't', 'rold', 'f', 'v'
@ -260,7 +259,6 @@ hbase(main):006:0>
Now you can assign the table to a variable and use the results in jruby shell code.
----
hbase(main):007 > t = create 't', 'f'
0 row(s) in 1.0970 seconds
@ -287,7 +285,6 @@ hbase(main):039:0> t.drop
If the table has already been created, you can assign a Table to a variable by using the get_table method:
----
hbase(main):011 > create 't','f'
0 row(s) in 1.2500 seconds
@ -310,7 +307,6 @@ You can then use jruby to script table operations based on these names.
The list_snapshots command also acts similarly.
----
hbase(main):016 > tables = list(t.*)
TABLE
t
@ -324,28 +320,26 @@ hbase(main):017:0> tables.map { |t| disable t ; drop t}
hbase(main):018:0>
----
=== [path]_irbrc_
=== _irbrc_
Create an [path]_.irbrc_ file for yourself in your home directory.
Create an _.irbrc_ file for yourself in your home directory.
Add customizations.
A useful one is command history so commands are save across Shell invocations:
[source,bash]
----
$ more .irbrc
require 'irb/ext/save-history'
IRB.conf[:SAVE_HISTORY] = 100
IRB.conf[:HISTORY_FILE] = "#{ENV['HOME']}/.irb-save-history"
----
See the +ruby+ documentation of [path]_.irbrc_ to learn about other possible configurations.
See the +ruby+ documentation of _.irbrc_ to learn about other possible configurations.
=== LOG data to timestamp
To convert the date '08/08/16 20:56:29' from an hbase log into a timestamp, do:
----
hbase(main):021:0> import java.text.SimpleDateFormat
hbase(main):022:0> import java.text.ParsePosition
hbase(main):023:0> SimpleDateFormat.new("yy/MM/dd HH:mm:ss").parse("08/08/16 20:56:29", ParsePosition.new(0)).getTime() => 1218920189000
@ -354,7 +348,6 @@ hbase(main):023:0> SimpleDateFormat.new("yy/MM/dd HH:mm:ss").parse("08/08/16 20:
To go the other direction:
----
hbase(main):021:0> import java.util.Date
hbase(main):022:0> Date.new(1218920189000).toString() => "Sat Aug 16 20:56:29 UTC 2008"
----
@ -377,7 +370,7 @@ hbase> debug <RETURN>
To enable DEBUG level logging in the shell, launch it with the +-d+ option.
[source]
[source,bash]
----
$ ./bin/hbase shell -d
----

View File

@ -42,7 +42,7 @@ The rest of this chapter discusses the filter language provided by the Thrift AP
Thrift Filter Language was introduced in APache HBase 0.92.
It allows you to perform server-side filtering when accessing HBase over Thrift or in the HBase shell.
You can find out more about shell integration by using the [code]+scan help+ command in the shell.
You can find out more about shell integration by using the `scan help` command in the shell.
You specify a filter as a string, which is parsed on the server to construct the filter.
@ -58,7 +58,7 @@ A simple filter expression is expressed as a string:
Keep the following syntax guidelines in mind.
* Specify the name of the filter followed by the comma-separated argument list in parentheses.
* If the argument represents a string, it should be enclosed in single quotes ([literal]+'+).
* If the argument represents a string, it should be enclosed in single quotes (`'`).
* Arguments which represent a boolean, an integer, or a comparison operator (such as <, >, or !=), should not be enclosed in quotes
* The filter name must be a single word.
All ASCII characters are allowed except for whitespace, single quotes and parentheses.
@ -68,17 +68,17 @@ Keep the following syntax guidelines in mind.
=== Compound Filters and Operators
.Binary Operators
[code]+AND+::
If the [code]+AND+ operator is used, the key-vallue must satisfy both the filters.
`AND`::
If the `AND` operator is used, the key-vallue must satisfy both the filters.
[code]+OR+::
If the [code]+OR+ operator is used, the key-value must satisfy at least one of the filters.
`OR`::
If the `OR` operator is used, the key-value must satisfy at least one of the filters.
.Unary Operators
[code]+SKIP+::
`SKIP`::
For a particular row, if any of the key-values fail the filter condition, the entire row is skipped.
[code]+WHILE+::
`WHILE`::
For a particular row, key-values will be emitted until a key-value is reached t hat fails the filter condition.
.Compound Operators
@ -93,8 +93,8 @@ You can combine multiple operators to create a hierarchy of filters, such as the
=== Order of Evaluation
. Parentheses have the highest precedence.
. The unary operators [code]+SKIP+ and [code]+WHILE+ are next, and have the same precedence.
. The binary operators follow. [code]+AND+ has highest precedence, followed by [code]+OR+.
. The unary operators `SKIP` and `WHILE` are next, and have the same precedence.
. The binary operators follow. `AND` has highest precedence, followed by `OR`.
.Precedence Example
====
@ -142,8 +142,8 @@ A comparator can be any of the following:
The comparison is case insensitive.
Only EQUAL and NOT_EQUAL comparisons are valid with this comparator
The general syntax of a comparator is:[code]+
ComparatorType:ComparatorValue+
The general syntax of a comparator is:`
ComparatorType:ComparatorValue`
The ComparatorType for the various comparators is as follows:
@ -184,8 +184,8 @@ The ComparatorValue can be any value.
=== Example Filter Strings
* [code]+“PrefixFilter (Row) AND PageFilter (1) AND FirstKeyOnlyFilter
()”+ will return all key-value pairs that match the following conditions:
* `“PrefixFilter (Row) AND PageFilter (1) AND FirstKeyOnlyFilter
()”` will return all key-value pairs that match the following conditions:
+
. The row containing the key-value should have prefix ``Row''
. The key-value must be located in the first row of the table
@ -193,9 +193,9 @@ The ComparatorValue can be any value.
* [code]+“(RowFilter (=, binary:Row 1) AND TimeStampsFilter (74689,
* `“(RowFilter (=, binary:Row 1) AND TimeStampsFilter (74689,
89734)) OR ColumnRangeFilter (abc, true, xyz,
false))”+ will return all key-value pairs that match both the following conditions:
false))”` will return all key-value pairs that match both the following conditions:
+
* The key-value is in a row having row key ``Row 1''
* The key-value must have a timestamp of either 74689 or 89734.
@ -206,7 +206,7 @@ The ComparatorValue can be any value.
* [code]+“SKIP ValueFilter (0)”+ will skip the entire row if any of the values in the row is not 0
* `“SKIP ValueFilter (0)”` will skip the entire row if any of the values in the row is not 0
[[individualfiltersyntax]]
=== Individual Filter Syntax
@ -226,12 +226,12 @@ PrefixFilter::
ColumnPrefixFilter::
This filter takes one argument a column prefix.
It returns only those key-values present in a column that starts with the specified column prefix.
The column prefix must be of the form: [code]+“qualifier”+.
The column prefix must be of the form: `“qualifier”`.
MultipleColumnPrefixFilter::
This filter takes a list of column prefixes.
It returns key-values that are present in a column that starts with any of the specified column prefixes.
Each of the column prefixes must be of the form: [code]+“qualifier”+.
Each of the column prefixes must be of the form: `“qualifier”`.
ColumnCountGetFilter::
This filter takes one argument a limit.

View File

@ -36,7 +36,7 @@ Setting up tracing is quite simple, however it currently requires some very mino
[[tracing.spanreceivers]]
=== SpanReceivers
The tracing system works by collecting information in structs called 'Spans'. It is up to you to choose how you want to receive this information by implementing the [class]+SpanReceiver+ interface, which defines one method:
The tracing system works by collecting information in structs called 'Spans'. It is up to you to choose how you want to receive this information by implementing the `SpanReceiver` interface, which defines one method:
[source]
----
@ -47,10 +47,10 @@ public void receiveSpan(Span span);
This method serves as a callback whenever a span is completed.
HTrace allows you to use as many SpanReceivers as you want so you can easily send trace information to multiple destinations.
Configure what SpanReceivers you'd like to us by putting a comma separated list of the fully-qualified class name of classes implementing [class]+SpanReceiver+ in [path]_hbase-site.xml_ property: [var]+hbase.trace.spanreceiver.classes+.
Configure what SpanReceivers you'd like to us by putting a comma separated list of the fully-qualified class name of classes implementing `SpanReceiver` in _hbase-site.xml_ property: `hbase.trace.spanreceiver.classes`.
HTrace includes a [class]+LocalFileSpanReceiver+ that writes all span information to local files in a JSON-based format.
The [class]+LocalFileSpanReceiver+ looks in [path]_hbase-site.xml_ for a [var]+hbase.local-file-span-receiver.path+ property with a value describing the name of the file to which nodes should write their span information.
HTrace includes a `LocalFileSpanReceiver` that writes all span information to local files in a JSON-based format.
The `LocalFileSpanReceiver` looks in _hbase-site.xml_ for a `hbase.local-file-span-receiver.path` property with a value describing the name of the file to which nodes should write their span information.
[source]
----
@ -65,10 +65,10 @@ The [class]+LocalFileSpanReceiver+ looks in [path]_hbase-site.xml_ for a [v
</property>
----
HTrace also provides [class]+ZipkinSpanReceiver+ which converts spans to link:http://github.com/twitter/zipkin[Zipkin] span format and send them to Zipkin server.
HTrace also provides `ZipkinSpanReceiver` which converts spans to link:http://github.com/twitter/zipkin[Zipkin] span format and send them to Zipkin server.
In order to use this span receiver, you need to install the jar of htrace-zipkin to your HBase's classpath on all of the nodes in your cluster.
[path]_htrace-zipkin_ is published to the maven central repository.
_htrace-zipkin_ is published to the maven central repository.
You could get the latest version from there or just build it locally and then copy it out to all nodes, change your config to use zipkin receiver, distribute the new configuration and then (rolling) restart.
Here is the example of manual setup procedure.
@ -82,7 +82,7 @@ $ cp target/htrace-zipkin-*-jar-with-dependencies.jar $HBASE_HOME/lib/
# copy jar to all nodes...
----
The [class]+ZipkinSpanReceiver+ looks in [path]_hbase-site.xml_ for a [var]+hbase.zipkin.collector-hostname+ and [var]+hbase.zipkin.collector-port+ property with a value describing the Zipkin collector server to which span information are sent.
The `ZipkinSpanReceiver` looks in _hbase-site.xml_ for a `hbase.zipkin.collector-hostname` and `hbase.zipkin.collector-port` property with a value describing the Zipkin collector server to which span information are sent.
[source,xml]
----
@ -101,7 +101,7 @@ The [class]+ZipkinSpanReceiver+ looks in [path]_hbase-site.xml_ for a [var]
</property>
----
If you do not want to use the included span receivers, you are encouraged to write your own receiver (take a look at [class]+LocalFileSpanReceiver+ for an example). If you think others would benefit from your receiver, file a JIRA or send a pull request to link:http://github.com/cloudera/htrace[HTrace].
If you do not want to use the included span receivers, you are encouraged to write your own receiver (take a look at `LocalFileSpanReceiver` for an example). If you think others would benefit from your receiver, file a JIRA or send a pull request to link:http://github.com/cloudera/htrace[HTrace].
[[tracing.client.modifications]]
== Client Modifications
@ -153,8 +153,8 @@ If you wanted to trace half of your 'get' operations, you would pass in:
new ProbabilitySampler(0.5)
----
in lieu of [var]+Sampler.ALWAYS+ to [class]+Trace.startSpan()+.
See the HTrace [path]_README_ for more information on Samplers.
in lieu of `Sampler.ALWAYS` to `Trace.startSpan()`.
See the HTrace _README_ for more information on Samplers.
[[tracing.client.shell]]
== Tracing from HBase Shell

View File

@ -49,19 +49,19 @@ For more information on GC pauses, see the link:http://www.cloudera.com/blog/201
The key process logs are as follows... (replace <user> with the user that started the service, and <hostname> for the machine name)
NameNode: [path]_$HADOOP_HOME/logs/hadoop-<user>-namenode-<hostname>.log_
NameNode: _$HADOOP_HOME/logs/hadoop-<user>-namenode-<hostname>.log_
DataNode: [path]_$HADOOP_HOME/logs/hadoop-<user>-datanode-<hostname>.log_
DataNode: _$HADOOP_HOME/logs/hadoop-<user>-datanode-<hostname>.log_
JobTracker: [path]_$HADOOP_HOME/logs/hadoop-<user>-jobtracker-<hostname>.log_
JobTracker: _$HADOOP_HOME/logs/hadoop-<user>-jobtracker-<hostname>.log_
TaskTracker: [path]_$HADOOP_HOME/logs/hadoop-<user>-tasktracker-<hostname>.log_
TaskTracker: _$HADOOP_HOME/logs/hadoop-<user>-tasktracker-<hostname>.log_
HMaster: [path]_$HBASE_HOME/logs/hbase-<user>-master-<hostname>.log_
HMaster: _$HBASE_HOME/logs/hbase-<user>-master-<hostname>.log_
RegionServer: [path]_$HBASE_HOME/logs/hbase-<user>-regionserver-<hostname>.log_
RegionServer: _$HBASE_HOME/logs/hbase-<user>-regionserver-<hostname>.log_
ZooKeeper: [path]_TODO_
ZooKeeper: _TODO_
[[trouble.log.locations]]
=== Log Locations
@ -94,17 +94,17 @@ Enabling the RPC-level logging on a RegionServer can often given insight on timi
Once enabled, the amount of log spewed is voluminous.
It is not recommended that you leave this logging on for more than short bursts of time.
To enable RPC-level logging, browse to the RegionServer UI and click on _Log Level_.
Set the log level to [var]+DEBUG+ for the package [class]+org.apache.hadoop.ipc+ (Thats right, for [class]+hadoop.ipc+, NOT, [class]+hbase.ipc+). Then tail the RegionServers log.
Set the log level to `DEBUG` for the package `org.apache.hadoop.ipc` (Thats right, for `hadoop.ipc`, NOT, `hbase.ipc`). Then tail the RegionServers log.
Analyze.
To disable, set the logging level back to [var]+INFO+ level.
To disable, set the logging level back to `INFO` level.
[[trouble.log.gc]]
=== JVM Garbage Collection Logs
HBase is memory intensive, and using the default GC you can see long pauses in all threads including the _Juliet Pause_ aka "GC of Death". To help debug this or confirm this is happening GC logging can be turned on in the Java virtual machine.
To enable, in [path]_hbase-env.sh_, uncomment one of the below lines :
To enable, in _hbase-env.sh_, uncomment one of the below lines :
[source,bourne]
----
@ -188,14 +188,14 @@ CMS pauses are always low, but if your ParNew starts growing, you can see minor
This can be due to the size of the ParNew, which should be relatively small.
If your ParNew is very large after running HBase for a while, in one example a ParNew was about 150MB, then you might have to constrain the size of ParNew (The larger it is, the longer the collections take but if its too small, objects are promoted to old gen too quickly). In the below we constrain new gen size to 64m.
Add the below line in [path]_hbase-env.sh_:
Add the below line in _hbase-env.sh_:
[source,bourne]
----
export SERVER_GC_OPTS="$SERVER_GC_OPTS -XX:NewSize=64m -XX:MaxNewSize=64m"
----
Similarly, to enable GC logging for client processes, uncomment one of the below lines in [path]_hbase-env.sh_:
Similarly, to enable GC logging for client processes, uncomment one of the below lines in _hbase-env.sh_:
[source,bourne]
----
@ -273,7 +273,7 @@ See <<hbase_metrics,hbase metrics>> for more information in metric definitions.
[[trouble.tools.builtin.zkcli]]
==== zkcli
[code]+zkcli+ is a very useful tool for investigating ZooKeeper-related issues.
`zkcli` is a very useful tool for investigating ZooKeeper-related issues.
To invoke:
[source,bourne]
----
@ -312,14 +312,14 @@ The commands (and arguments) are:
[[trouble.tools.tail]]
==== tail
[code]+tail+ is the command line tool that lets you look at the end of a file.
`tail` is the command line tool that lets you look at the end of a file.
Add the ``-f'' option and it will refresh when new data is available.
It's useful when you are wondering what's happening, for example, when a cluster is taking a long time to shutdown or startup as you can just fire a new terminal and tail the master log (and maybe a few RegionServers).
[[trouble.tools.top]]
==== top
[code]+top+ is probably one of the most important tool when first trying to see what's running on a machine and how the resources are consumed.
`top` is probably one of the most important tool when first trying to see what's running on a machine and how the resources are consumed.
Here's an example from production system:
[source]
@ -351,7 +351,7 @@ Typing ``1'' will give you the detail of how each CPU is used instead of the ave
[[trouble.tools.jps]]
==== jps
[code]+jps+ is shipped with every JDK and gives the java process ids for the current user (if root, then it gives the ids for all users). Example:
`jps` is shipped with every JDK and gives the java process ids for the current user (if root, then it gives the ids for all users). Example:
[source,bourne]
----
@ -389,7 +389,7 @@ hadoop 17789 155 35.2 9067824 8604364 ? S&lt;l Mar04 9855:48 /usr/java/j
[[trouble.tools.jstack]]
==== jstack
[code]+jstack+ is one of the most important tools when trying to figure out what a java process is doing apart from looking at the logs.
`jstack` is one of the most important tools when trying to figure out what a java process is doing apart from looking at the logs.
It has to be used in conjunction with jps in order to give it a process id.
It shows a list of threads, each one has a name, and they appear in the order that they were created (so the top ones are the most recent threads). Here are a few example:
@ -566,36 +566,36 @@ For more information on the HBase client, see <<client,client>>.
=== ScannerTimeoutException or UnknownScannerException
This is thrown if the time between RPC calls from the client to RegionServer exceeds the scan timeout.
For example, if [code]+Scan.setCaching+ is set to 500, then there will be an RPC call to fetch the next batch of rows every 500 [code]+.next()+ calls on the ResultScanner because data is being transferred in blocks of 500 rows to the client.
For example, if `Scan.setCaching` is set to 500, then there will be an RPC call to fetch the next batch of rows every 500 `.next()` calls on the ResultScanner because data is being transferred in blocks of 500 rows to the client.
Reducing the setCaching value may be an option, but setting this value too low makes for inefficient processing on numbers of rows.
See <<perf.hbase.client.caching,perf.hbase.client.caching>>.
=== Performance Differences in Thrift and Java APIs
Poor performance, or even [code]+ScannerTimeoutExceptions+, can occur if [code]+Scan.setCaching+ is too high, as discussed in <<trouble.client.scantimeout,trouble.client.scantimeout>>.
Poor performance, or even `ScannerTimeoutExceptions`, can occur if `Scan.setCaching` is too high, as discussed in <<trouble.client.scantimeout,trouble.client.scantimeout>>.
If the Thrift client uses the wrong caching settings for a given workload, performance can suffer compared to the Java API.
To set caching for a given scan in the Thrift client, use the [code]+scannerGetList(scannerId,
numRows)+ method, where [code]+numRows+ is an integer representing the number of rows to cache.
To set caching for a given scan in the Thrift client, use the `scannerGetList(scannerId,
numRows)` method, where `numRows` is an integer representing the number of rows to cache.
In one case, it was found that reducing the cache for Thrift scans from 1000 to 100 increased performance to near parity with the Java API given the same queries.
See also Jesse Andersen's link:http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/[blog post] about using Scans with Thrift.
[[trouble.client.lease.exception]]
=== [class]+LeaseException+ when calling[class]+Scanner.next+
=== `LeaseException` when calling`Scanner.next`
In some situations clients that fetch data from a RegionServer get a LeaseException instead of the usual <<trouble.client.scantimeout,trouble.client.scantimeout>>.
Usually the source of the exception is [class]+org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)+ (line number may vary). It tends to happen in the context of a slow/freezing RegionServer#next call.
It can be prevented by having [var]+hbase.rpc.timeout+ > [var]+hbase.regionserver.lease.period+.
Usually the source of the exception is `org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)` (line number may vary). It tends to happen in the context of a slow/freezing RegionServer#next call.
It can be prevented by having `hbase.rpc.timeout` > `hbase.regionserver.lease.period`.
Harsh J investigated the issue as part of the mailing list thread link:http://mail-archives.apache.org/mod_mbox/hbase-user/201209.mbox/%3CCAOcnVr3R-LqtKhFsk8Bhrm-YW2i9O6J6Fhjz2h7q6_sxvwd2yw%40mail.gmail.com%3E[HBase,
mail # user - Lease does not exist exceptions]
[[trouble.client.scarylogs]]
=== Shell or client application throws lots of scary exceptions during normaloperation
Since 0.20.0 the default log level for [code]+org.apache.hadoop.hbase.*+is DEBUG.
Since 0.20.0 the default log level for `org.apache.hadoop.hbase.*`is DEBUG.
On your clients, edit [path]_$HBASE_HOME/conf/log4j.properties_ and change this: [code]+log4j.logger.org.apache.hadoop.hbase=DEBUG+ to this: [code]+log4j.logger.org.apache.hadoop.hbase=INFO+, or even [code]+log4j.logger.org.apache.hadoop.hbase=WARN+.
On your clients, edit _$HBASE_HOME/conf/log4j.properties_ and change this: `log4j.logger.org.apache.hadoop.hbase=DEBUG` to this: `log4j.logger.org.apache.hadoop.hbase=INFO`, or even `log4j.logger.org.apache.hadoop.hbase=WARN`.
[[trouble.client.longpauseswithcompression]]
=== Long Client Pauses With Compression
@ -606,7 +606,7 @@ Compression can exacerbate the pauses, although it is not the source of the prob
See <<precreate.regions,precreate.regions>> on the pattern for pre-creating regions and confirm that the table isn't starting with a single region.
See <<perf.configurations,perf.configurations>> for cluster configuration, particularly [code]+hbase.hstore.blockingStoreFiles+, [code]+hbase.hregion.memstore.block.multiplier+, [code]+MAX_FILESIZE+ (region size), and [code]+MEMSTORE_FLUSHSIZE.+
See <<perf.configurations,perf.configurations>> for cluster configuration, particularly `hbase.hstore.blockingStoreFiles`, `hbase.hregion.memstore.block.multiplier`, `MAX_FILESIZE` (region size), and `MEMSTORE_FLUSHSIZE.`
A slightly longer explanation of why pauses can happen is as follows: Puts are sometimes blocked on the MemStores which are blocked by the flusher thread which is blocked because there are too many files to compact because the compactor is given too many small files to compact and has to compact the same data repeatedly.
This situation can occur even with minor compactions.
@ -631,7 +631,7 @@ Secure Client Connect ([Caused by GSSException: No valid credentials provided
This issue is caused by bugs in the MIT Kerberos replay_cache component, link:http://krbdev.mit.edu/rt/Ticket/Display.html?id=1201[#1201] and link:http://krbdev.mit.edu/rt/Ticket/Display.html?id=5924[#5924].
These bugs caused the old version of krb5-server to erroneously block subsequent requests sent from a Principal.
This caused krb5-server to block the connections sent from one Client (one HTable instance with multi-threading connection instances for each regionserver); Messages, such as [literal]+Request is a replay (34)+, are logged in the client log You can ignore the messages, because HTable will retry 5 * 10 (50) times for each failed connection by default.
This caused krb5-server to block the connections sent from one Client (one HTable instance with multi-threading connection instances for each regionserver); Messages, such as `Request is a replay (34)`, are logged in the client log You can ignore the messages, because HTable will retry 5 * 10 (50) times for each failed connection by default.
HTable will throw IOException if any connection to the regionserver fails after the retries, so that the user client code for HTable instance can handle it further.
Alternatively, update krb5-server to a version which solves these issues, such as krb5-server-1.10.3.
@ -673,8 +673,8 @@ The utility <<trouble.tools.builtin.zkcli,trouble.tools.builtin.zkcli>> may help
You are likely running into the issue that is described and worked through in the mail thread link:http://search-hadoop.com/m/ubhrX8KvcH/Suspected+memory+leak&subj=Re+Suspected+memory+leak[HBase,
mail # user - Suspected memory leak] and continued over in link:http://search-hadoop.com/m/p2Agc1Zy7Va/MaxDirectMemorySize+Was%253A+Suspected+memory+leak&subj=Re+FeedbackRe+Suspected+memory+leak[HBase,
mail # dev - FeedbackRe: Suspected memory leak].
A workaround is passing your client-side JVM a reasonable value for [code]+-XX:MaxDirectMemorySize+.
By default, the [var]+MaxDirectMemorySize+ is equal to your [code]+-Xmx+ max heapsize setting (if [code]+-Xmx+ is set). Try seting it to something smaller (for example, one user had success setting it to [code]+1g+ when they had a client-side heap of [code]+12g+). If you set it too small, it will bring on [code]+FullGCs+ so keep it a bit hefty.
A workaround is passing your client-side JVM a reasonable value for `-XX:MaxDirectMemorySize`.
By default, the `MaxDirectMemorySize` is equal to your `-Xmx` max heapsize setting (if `-Xmx` is set). Try seting it to something smaller (for example, one user had success setting it to `1g` when they had a client-side heap of `12g`). If you set it too small, it will bring on `FullGCs` so keep it a bit hefty.
You want to make this setting client-side only especially if you are running the new experiemental server-side off-heap cache since this feature depends on being able to use big direct buffers (You may have to keep separate client-side and server-side config dirs).
[[trouble.client.slowdown.admin]]
@ -715,7 +715,7 @@ Uncompress and extract the downloaded file, and install the policy jars into <ja
[[trouble.mapreduce.local]]
=== You Think You're On The Cluster, But You're Actually Local
This following stacktrace happened using [code]+ImportTsv+, but things like this can happen on any job with a mis-configuration.
This following stacktrace happened using `ImportTsv`, but things like this can happen on any job with a mis-configuration.
[source]
----
@ -748,7 +748,7 @@ at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
LocalJobRunner means the job is running locally, not on the cluster.
To solve this problem, you should run your MR job with your [code]+HADOOP_CLASSPATH+ set to include the HBase dependencies.
To solve this problem, you should run your MR job with your `HADOOP_CLASSPATH` set to include the HBase dependencies.
The "hbase classpath" utility can be used to do this easily.
For example (substitute VERSION with your HBase version):
@ -776,7 +776,7 @@ For more information on the NameNode, see <<arch.hdfs,arch.hdfs>>.
[[trouble.namenode.disk]]
=== HDFS Utilization of Tables and Regions
To determine how much space HBase is using on HDFS use the [code]+hadoop+ shell commands from the NameNode.
To determine how much space HBase is using on HDFS use the `hadoop` shell commands from the NameNode.
For example...
@ -833,7 +833,7 @@ The HDFS directory structure of HBase WAL is..
----
See the link:http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html[HDFS User
Guide] for other non-shell diagnostic utilities like [code]+fsck+.
Guide] for other non-shell diagnostic utilities like `fsck`.
[[trouble.namenode.0size.hlogs]]
==== Zero size WALs with data in them
@ -856,7 +856,7 @@ Additionally, after a major compaction if the resulting StoreFile is "small" it
[[trouble.network.spikes]]
=== Network Spikes
If you are seeing periodic network spikes you might want to check the [code]+compactionQueues+ to see if major compactions are happening.
If you are seeing periodic network spikes you might want to check the `compactionQueues` to see if major compactions are happening.
See <<managed.compactions,managed.compactions>> for more information on managing compactions.
@ -886,7 +886,7 @@ The Master believes the RegionServers have the IP of 127.0.0.1 - which is localh
The RegionServers are erroneously informing the Master that their IP addresses are 127.0.0.1.
Modify [path]_/etc/hosts_ on the region servers, from...
Modify _/etc/hosts_ on the region servers, from...
[source]
----
@ -933,7 +933,7 @@ See the Configuration section on link:[LZO compression configuration].
Are you running an old JVM (< 1.6.0_u21?)? When you look at a thread dump, does it look like threads are BLOCKED but no one holds the lock all are blocked on? See link:https://issues.apache.org/jira/browse/HBASE-3622[HBASE 3622 Deadlock in
HBaseServer (JVM bug?)].
Adding [code]`-XX:+UseMembar` to the HBase [var]+HBASE_OPTS+ in [path]_conf/hbase-env.sh_ may fix it.
Adding `-XX:+UseMembar` to the HBase `HBASE_OPTS` in _conf/hbase-env.sh_ may fix it.
[[trouble.rs.runtime.filehandles]]
==== java.io.IOException...(Too many open files)
@ -1013,13 +1013,13 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
The JVM is doing a long running garbage collecting which is pausing every threads (aka "stop the world"). Since the RegionServer's local ZooKeeper client cannot send heartbeats, the session times out.
By design, we shut down any node that isn't able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere.
* Make sure you give plenty of RAM (in [path]_hbase-env.sh_), the default of 1GB won't be able to sustain long running imports.
* Make sure you give plenty of RAM (in _hbase-env.sh_), the default of 1GB won't be able to sustain long running imports.
* Make sure you don't swap, the JVM never behaves well under swapping.
* Make sure you are not CPU starving the RegionServer thread.
For example, if you are running a MapReduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably starving the RegionServer enough to create longer garbage collection pauses.
* Increase the ZooKeeper session timeout
If you wish to increase the session timeout, add the following to your [path]_hbase-site.xml_ to increase the timeout from the default of 60 seconds to 120 seconds.
If you wish to increase the session timeout, add the following to your _hbase-site.xml_ to increase the timeout from the default of 60 seconds to 120 seconds.
[source,xml]
----
@ -1138,10 +1138,10 @@ A ZooKeeper server wasn't able to start, throws that error.
xyz is the name of your server.
This is a name lookup problem.
HBase tries to start a ZooKeeper server on some machine but that machine isn't able to find itself in the [var]+hbase.zookeeper.quorum+ configuration.
HBase tries to start a ZooKeeper server on some machine but that machine isn't able to find itself in the `hbase.zookeeper.quorum` configuration.
Use the hostname presented in the error message instead of the value you used.
If you have a DNS server, you can set [var]+hbase.zookeeper.dns.interface+ and [var]+hbase.zookeeper.dns.nameserver+ in [path]_hbase-site.xml_ to make sure it resolves to the correct FQDN.
If you have a DNS server, you can set `hbase.zookeeper.dns.interface` and `hbase.zookeeper.dns.nameserver` in _hbase-site.xml_ to make sure it resolves to the correct FQDN.
[[trouble.zookeeper.general]]
=== ZooKeeper, The Cluster Canary
@ -1191,10 +1191,10 @@ See Andrew's answer here, up on the user list: link:http://search-hadoop.com/m/s
== HBase and Hadoop version issues
[[trouble.versions.205]]
=== [code]+NoClassDefFoundError+ when trying to run 0.90.x on hadoop-0.20.205.x (or hadoop-1.0.x)
=== `NoClassDefFoundError` when trying to run 0.90.x on hadoop-0.20.205.x (or hadoop-1.0.x)
Apache HBase 0.90.x does not ship with hadoop-0.20.205.x, etc.
To make it run, you need to replace the hadoop jars that Apache HBase shipped with in its [path]_lib_ directory with those of the Hadoop you want to run HBase on.
To make it run, you need to replace the hadoop jars that Apache HBase shipped with in its _lib_ directory with those of the Hadoop you want to run HBase on.
If even after replacing Hadoop jars you get the below exception:
[source]
@ -1212,7 +1212,7 @@ sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.initialize(Us
sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:177)
----
you need to copy under [path]_hbase/lib_, the [path]_commons-configuration-X.jar_ you find in your Hadoop's [path]_lib_ directory.
you need to copy under _hbase/lib_, the _commons-configuration-X.jar_ you find in your Hadoop's _lib_ directory.
That should fix the above complaint.
[[trouble.wrong.version]]
@ -1228,7 +1228,7 @@ If you see something like the following in your logs [computeroutput]+... 2012-0
If the Hadoop configuration is loaded after the HBase configuration, and you have configured custom IPC settings in both HBase and Hadoop, the Hadoop values may overwrite the HBase values.
There is normally no need to change these settings for HBase, so this problem is an edge case.
However, link:https://issues.apache.org/jira/browse/HBASE-11492[HBASE-11492] renames these settings for HBase to remove the chance of a conflict.
Each of the setting names have been prefixed with [literal]+hbase.+, as shown in the following table.
Each of the setting names have been prefixed with `hbase.`, as shown in the following table.
No action is required related to these changes unless you are already experiencing a conflict.
These changes were backported to HBase 0.98.x and apply to all newer versions.
@ -1297,7 +1297,7 @@ To operate with the most efficiency, HBase needs data to be available locally.
Therefore, it is a good practice to run an HDFS datanode on each RegionServer.
.Important Information and Guidelines for HBase and HDFSHBase is a client of HDFS.::
HBase is an HDFS client, using the HDFS [code]+DFSClient+ class, and references to this class appear in HBase logs with other HDFS client log messages.
HBase is an HDFS client, using the HDFS `DFSClient` class, and references to this class appear in HBase logs with other HDFS client log messages.
Configuration is necessary in multiple places.::
Some HDFS configurations relating to HBase need to be done at the HDFS (server) side.
@ -1309,9 +1309,9 @@ Write errors which affect HBase may be logged in the HDFS logs rather than HBase
Communication problems between datanodes are logged in the HDFS logs, not the HBase logs.
HBase communicates with HDFS using two different ports.::
HBase communicates with datanodes using the [code]+ipc.Client+ interface and the [code]+DataNode+ class.
HBase communicates with datanodes using the `ipc.Client` interface and the `DataNode` class.
References to these will appear in HBase logs.
Each of these communication channels use a different port (50010 and 50020 by default). The ports are configured in the HDFS configuration, via the [code]+dfs.datanode.address+ and [code]+dfs.datanode.ipc.address+ parameters.
Each of these communication channels use a different port (50010 and 50020 by default). The ports are configured in the HDFS configuration, via the `dfs.datanode.address` and `dfs.datanode.ipc.address` parameters.
Errors may be logged in HBase, HDFS, or both.::
When troubleshooting HDFS issues in HBase, check logs in both places for errors.
@ -1320,8 +1320,8 @@ HDFS takes a while to mark a node as dead. You can configure HDFS to avoid using
datanodes.::
By default, HDFS does not mark a node as dead until it is unreachable for 630 seconds.
In Hadoop 1.1 and Hadoop 2.x, this can be alleviated by enabling checks for stale datanodes, though this check is disabled by default.
You can enable the check for reads and writes separately, via [code]+dfs.namenode.avoid.read.stale.datanode+ and [code]+dfs.namenode.avoid.write.stale.datanode settings+.
A stale datanode is one that has not been reachable for [code]+dfs.namenode.stale.datanode.interval+ (default is 30 seconds). Stale datanodes are avoided, and marked as the last possible target for a read or write operation.
You can enable the check for reads and writes separately, via `dfs.namenode.avoid.read.stale.datanode` and `dfs.namenode.avoid.write.stale.datanode settings`.
A stale datanode is one that has not been reachable for `dfs.namenode.stale.datanode.interval` (default is 30 seconds). Stale datanodes are avoided, and marked as the last possible target for a read or write operation.
For configuration details, see the HDFS documentation.
Settings for HDFS retries and timeouts are important to HBase.::
@ -1336,28 +1336,28 @@ Connection timeouts occur between the client (HBASE) and the HDFS datanode.
They may occur when establishing a connection, attempting to read, or attempting to write.
The two settings below are used in combination, and affect connections between the DFSClient and the datanode, the ipc.cClient and the datanode, and communication between two datanodes.
[code]+dfs.client.socket-timeout+ (default: 60000)::
`dfs.client.socket-timeout` (default: 60000)::
The amount of time before a client connection times out when establishing a connection or reading.
The value is expressed in milliseconds, so the default is 60 seconds.
[code]+dfs.datanode.socket.write.timeout+ (default: 480000)::
`dfs.datanode.socket.write.timeout` (default: 480000)::
The amount of time before a write operation times out.
The default is 8 minutes, expressed as milliseconds.
.Typical Error Logs
The following types of errors are often seen in the logs.
[code]+INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and
`INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and
continue java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel
to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
remote=/region-server-1:50010]+::
remote=/region-server-1:50010]`::
All datanodes for a block are dead, and recovery is not possible.
Here is the sequence of events that leads to this error:
[code]+INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream
`INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be
ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/
xxx:50010]+::
xxx:50010]`::
This type of error indicates a write issue.
In this case, the master wants to split the log.
It does not have a local datanode so it tries to connect to a remote datanode, but the datanode is dead.
@ -1423,7 +1423,7 @@ This problem appears to affect some versions of OpenJDK 7 shipped by some Linux
NSS is configured as the default provider.
If the host has an x86_64 architecture, depending on if the vendor packages contain the defect, the NSS provider will not function correctly.
To work around this problem, find the JRE home directory and edit the file [path]_lib/security/java.security_.
To work around this problem, find the JRE home directory and edit the file _lib/security/java.security_.
Edit the file to comment out the line:
[source]
@ -1446,7 +1446,7 @@ Some users have reported seeing the following error:
kernel: java: page allocation failure. order:4, mode:0x20
----
Raising the value of [code]+min_free_kbytes+ was reported to fix this problem.
Raising the value of `min_free_kbytes` was reported to fix this problem.
This parameter is set to a percentage of the amount of RAM on your system, and is described in more detail at link:http://www.centos.org/docs/5/html/5.1/Deployment_Guide/s3-proc-sys-vm.html.
To find the current value on your system, run the following command:
@ -1460,7 +1460,7 @@ Try doubling, then quadrupling the value.
Note that setting the value too low or too high could have detrimental effects on your system.
Consult your operating system vendor for specific recommendations.
Use the following command to modify the value of [code]+min_free_kbytes+, substituting [replaceable]_<value>_ with your intended value:
Use the following command to modify the value of `min_free_kbytes`, substituting [replaceable]_<value>_ with your intended value:
----
[user@host]# echo <value> > /proc/sys/vm/min_free_kbytes

View File

@ -73,7 +73,7 @@ The first step is to add JUnit dependencies to your Maven POM file:
----
Next, add some unit tests to your code.
Tests are annotated with [literal]+@Test+.
Tests are annotated with `@Test`.
Here, the unit tests are in bold.
[source,java]
@ -94,7 +94,7 @@ public class TestMyHbaseDAOData {
}
----
These tests ensure that your [code]+createPut+ method creates, populates, and returns a [code]+Put+ object with expected values.
These tests ensure that your `createPut` method creates, populates, and returns a `Put` object with expected values.
Of course, JUnit can do much more than this.
For an introduction to JUnit, see link:https://github.com/junit-team/junit/wiki/Getting-started.
@ -105,9 +105,9 @@ It goes further than JUnit by allowing you to test the interactions between obje
You can read more about Mockito at its project site, link:https://code.google.com/p/mockito/.
You can use Mockito to do unit testing on smaller units.
For instance, you can mock a [class]+org.apache.hadoop.hbase.Server+ instance or a [class]+org.apache.hadoop.hbase.master.MasterServices+ interface reference rather than a full-blown [class]+org.apache.hadoop.hbase.master.HMaster+.
For instance, you can mock a `org.apache.hadoop.hbase.Server` instance or a `org.apache.hadoop.hbase.master.MasterServices` interface reference rather than a full-blown `org.apache.hadoop.hbase.master.HMaster`.
This example builds upon the example code in <<unit.tests,unit.tests>>, to test the [code]+insertRecord+ method.
This example builds upon the example code in <<unit.tests,unit.tests>>, to test the `insertRecord` method.
First, add a dependency for Mockito to your Maven POM file.
@ -122,7 +122,7 @@ First, add a dependency for Mockito to your Maven POM file.
</dependency>
----
Next, add a [code]+@RunWith+ annotation to your test class, to direct it to use Mockito.
Next, add a `@RunWith` annotation to your test class, to direct it to use Mockito.
[source,java]
----
@ -158,7 +158,7 @@ public class TestMyHBaseDAO{
}
----
This code populates [code]+HBaseTestObj+ with ``ROWKEY-1'', ``DATA-1'', ``DATA-2'' as values.
This code populates `HBaseTestObj` with ``ROWKEY-1'', ``DATA-1'', ``DATA-2'' as values.
It then inserts the record into the mocked table.
The Put that the DAO would have inserted is captured, and values are tested to verify that they are what you expected them to be.
@ -171,7 +171,7 @@ Similarly, you can now expand into other operations such as Get, Scan, or Delete
link:http://mrunit.apache.org/[Apache MRUnit] is a library that allows you to unit-test MapReduce jobs.
You can use it to test HBase jobs in the same way as other MapReduce jobs.
Given a MapReduce job that writes to an HBase table called [literal]+MyTest+, which has one column family called [literal]+CF+, the reducer of such a job could look like the following:
Given a MapReduce job that writes to an HBase table called `MyTest`, which has one column family called `CF`, the reducer of such a job could look like the following:
[source,java]
----
@ -338,7 +338,7 @@ public class MyHBaseIntegrationTest {
----
This code creates an HBase mini-cluster and starts it.
Next, it creates a table called [literal]+MyTest+ with one column family, [literal]+CF+.
Next, it creates a table called `MyTest` with one column family, `CF`.
A record is inserted, a Get is performed from the same table, and the insertion is verified.
NOTE: Starting the mini-cluster takes about 20-30 seconds, but that should be appropriate for integration testing.

View File

@ -157,10 +157,12 @@ When we say two HBase versions are compatible, we mean that the versions are wir
A rolling upgrade is the process by which you update the servers in your cluster a server at a time. You can rolling upgrade across HBase versions if they are binary or wire compatible. See <<hbase.rolling.restart>> for more on what this means. Coarsely, a rolling upgrade is a graceful stop each server, update the software, and then restart. You do this for each server in the cluster. Usually you upgrade the Master first and then the regionservers. See <<rolling>> for tools that can help use the rolling upgrade process.
For example, in the below, hbase was symlinked to the actual hbase install. On upgrade, before running a rolling restart over the cluser, we changed the symlink to point at the new HBase software version and then ran
[source,bash]
----
$ HADOOP_HOME=~/hadoop-2.6.0-CRC-SNAPSHOT ~/hbase/bin/rolling-restart.sh --config ~/conf_hbase
----
The rolling-restart script will first gracefully stop and restart the master, and then each of the regionservers in turn. Because the symlink was changed, on restart the server will come up using the new hbase version. Check logs for errors as the rolling upgrade proceeds.
[[hbase.rolling.restart]]
@ -169,12 +171,14 @@ Unless otherwise specified, HBase point versions are binary compatible. You can
In the minor version-particular sections below, we call out where the versions are wire/protocol compatible and in this case, it is also possible to do a <<hbase.rolling.upgrade>>. For example, in <<upgrade1.0.rolling.upgrade>>, we state that it is possible to do a rolling upgrade between hbase-0.98.x and hbase-1.0.0.
== Upgrade Paths
[[upgrade1.0]]
== Upgrading from 0.98.x to 1.0.x
=== Upgrading from 0.98.x to 1.0.x
In this section we first note the significant changes that come in with 1.0.0 HBase and then we go over the upgrade process. Be sure to read the significant changes section with care so you avoid surprises.
=== Changes of Note!
==== Changes of Note!
In here we list important changes that are in 1.0.0 since 0.98.x., changes you should be aware that will go into effect once you upgrade.
@ -184,7 +188,7 @@ See <<zookeeper.requirements>>.
[[default.ports.changed]]
.HBase Default Ports Changed
The ports used by HBase changed. The used to be in the 600XX range. In hbase-1.0.0 they have been moved up out of the ephemeral port range and are 160XX instead (Master web UI was 60010 and is now 16030; the RegionServer web UI was 60030 and is now 16030, etc). If you want to keep the old port locations, copy the port setting configs from [path]_hbase-default.xml_ into [path]_hbase-site.xml_, change them back to the old values from hbase-0.98.x era, and ensure you've distributed your configurations before you restart.
The ports used by HBase changed. The used to be in the 600XX range. In hbase-1.0.0 they have been moved up out of the ephemeral port range and are 160XX instead (Master web UI was 60010 and is now 16010; the RegionServer web UI was 60030 and is now 16030, etc). If you want to keep the old port locations, copy the port setting configs from _hbase-default.xml_ into _hbase-site.xml_, change them back to the old values from hbase-0.98.x era, and ensure you've distributed your configurations before you restart.
[[upgrade1.0.hbase.bucketcache.percentage.in.combinedcache]]
.hbase.bucketcache.percentage.in.combinedcache configuration has been REMOVED
@ -199,31 +203,31 @@ See the release notes on the issue link:https://issues.apache.org/jira/browse/HB
<<distributed.log.replay>> is off by default in hbase-1.0. Enabling it can make a big difference improving HBase MTTR. Enable this feature if you are doing a clean stop/start when you are upgrading. You cannot rolling upgrade on to this feature (caveat if you are running on a version of hbase in excess of hbase-0.98.4 -- see link:https://issues.apache.org/jira/browse/HBASE-12577[HBASE-12577 Disable distributed log replay by default] for more).
[[upgrade1.0.rolling.upgrade]]
=== Rolling upgrade from 0.98.x to HBase 1.0.0
==== Rolling upgrade from 0.98.x to HBase 1.0.0
.From 0.96.x to 1.0.0
NOTE: You cannot do a <<hbase.rolling.upgrade,rolling upgrade>> from 0.96.x to 1.0.0 without first doing a rolling upgrade to 0.98.x. See comment in link:https://issues.apache.org/jira/browse/HBASE-11164?focusedCommentId=14182330&amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&#35;comment-14182330[HBASE-11164 Document and test rolling updates from 0.98 -> 1.0] for the why. Also because hbase-1.0.0 enables hfilev3 by default, link:https://issues.apache.org/jira/browse/HBASE-9801[HBASE-9801 Change the default HFile version to V3], and support for hfilev3 only arrives in 0.98, this is another reason you cannot rolling upgrade from hbase-0.96.x; if the rolling upgrade stalls, the 0.96.x servers cannot open files written by the servers running the newer hbase-1.0.0 hfilev3 writing servers.
There are no known issues running a <<hbase.rolling.upgrade,rolling upgrade>> from hbase-0.98.x to hbase-1.0.0.
[[upgrade1.0.from.0.94]]
=== Upgrading to 1.0 from 0.94
==== Upgrading to 1.0 from 0.94
You cannot rolling upgrade from 0.94.x to 1.x.x. You must stop your cluster, install the 1.x.x software, run the migration described at <<executing.the.0.96.upgrade>> (substituting 1.x.x. wherever we make mention of 0.96.x in the section below), and then restart. Be sure to upgrade your zookeeper if it is a version less than the required 3.4.x.
[[upgrade0.98]]
== Upgrading from 0.96.x to 0.98.x
=== Upgrading from 0.96.x to 0.98.x
A rolling upgrade from 0.96.x to 0.98.x works. The two versions are not binary compatible.
Additional steps are required to take advantage of some of the new features of 0.98.x, including cell visibility labels, cell ACLs, and transparent server side encryption. See <<security>> for more information. Significant performance improvements include a change to the write ahead log threading model that provides higher transaction throughput under high load, reverse scanners, MapReduce over snapshot files, and striped compaction.
Clients and servers can run with 0.98.x and 0.96.x versions. However, applications may need to be recompiled due to changes in the Java API.
== Upgrading from 0.94.x to 0.98.x
=== Upgrading from 0.94.x to 0.98.x
A rolling upgrade from 0.94.x directly to 0.98.x does not work. The upgrade path follows the same procedures as <<upgrade0.96>>. Additional steps are required to use some of the new features of 0.98.x. See <<upgrade0.98>> for an abbreviated list of these features.
[[upgrade0.96]]
== Upgrading from 0.94.x to 0.96.x
=== Upgrading from 0.94.x to 0.96.x
=== The "Singularity"
==== The "Singularity"
.HBase 0.96.x was EOL'd, September 1st, 2014
NOTE: Do not deploy 0.96.x Deploy a 0.98.x at least. See link:https://issues.apache.org/jira/browse/HBASE-11642[EOL 0.96].
@ -233,12 +237,14 @@ You will have to stop your old 0.94.x cluster completely to upgrade. If you are
The API has changed. You will need to recompile your code against 0.96 and you may need to adjust applications to go against new APIs (TODO: List of changes).
[[executing.the.0.96.upgrade]]
=== Executing the 0.96 Upgrade
==== Executing the 0.96 Upgrade
.HDFS and ZooKeeper must be up!
NOTE: HDFS and ZooKeeper should be up and running during the upgrade process.
hbase-0.96.0 comes with an upgrade script. Run
[source,bash]
----
$ bin/hbase upgrade
----
@ -250,9 +256,12 @@ The check step is run against a running 0.94 cluster. Run it from a downloaded 0
The check step prints stats at the end of its run (grep for `“Result:”` in the log) printing absolute path of the tables it scanned, any HFileV1 files found, the regions containing said files (the regions we need to major compact to purge the HFileV1s), and any corrupted files if any found. A corrupt file is unreadable, and so is undefined (neither HFileV1 nor HFileV2).
To run the check step, run
[source,bash]
----
$ bin/hbase upgrade -check
----
Here is sample output:
----
Tables Processed:
@ -280,6 +289,7 @@ There are some HFileV1, or corrupt files (files with incorrect major version)
In the above sample output, there are two HFileV1 in two regions, and one corrupt file. Corrupt files should probably be removed. The regions that have HFileV1s need to be major compacted. To major compact, start up the hbase shell and review how to compact an individual region. After the major compaction is done, rerun the check step and the HFileV1s shoudl be gone, replaced by HFileV2 instances.
By default, the check step scans the hbase root directory (defined as hbase.rootdir in the configuration). To scan a specific directory only, pass the -dir option.
[source,bash]
----
$ bin/hbase upgrade -check -dir /myHBase/testTable
----
@ -293,6 +303,7 @@ After the _check_ step shows the cluster is free of HFileV1, it is safe to proce
[NOTE]
====
HDFS and ZooKeeper should be up and running during the upgrade process. If zookeeper is managed by HBase, then you can start zookeeper so it is available to the upgrade by running
[source,bash]
----
$ ./hbase/bin/hbase-daemon.sh start zookeeper
----
@ -307,6 +318,7 @@ The execute upgrade step is made of three substeps.
* WAL Log Splitting: If the 0.94.x cluster shutdown was not clean, we'll split WAL logs as part of migration before we startup on 0.96.0. This WAL splitting runs slower than the native distributed WAL splitting because it is all inside the single upgrade process (so try and get a clean shutdown of the 0.94.0 cluster if you can).
To run the _execute_ step, make sure that first you have copied hbase-0.96.0 binaries everywhere under servers and under clients. Make sure the 0.94.0 cluster is down. Then do as follows:
[source,bash]
----
$ bin/hbase upgrade -execute
----
@ -329,6 +341,7 @@ Successfully completed Log splitting
----
If the output from the execute step looks good, stop the zookeeper instance you started to do the upgrade:
[source,bash]
----
$ ./hbase/bin/hbase-daemon.sh stop zookeeper
----
@ -355,19 +368,19 @@ It will fail with an exception like the below. Upgrade.
17:22:15 at Client_4_3_0.main(Client_4_3_0.java:63)
----
=== Upgrading `META` to use Protocol Buffers (Protobuf)
==== Upgrading `META` to use Protocol Buffers (Protobuf)
When you upgrade from versions prior to 0.96, `META` needs to be converted to use protocol buffers. This is controlled by the configuration option `hbase.MetaMigrationConvertingToPB`, which is set to `true` by default. Therefore, by default, no action is required on your part.
The migration is a one-time event. However, every time your cluster starts, `META` is scanned to ensure that it does not need to be converted. If you have a very large number of regions, this scan can take a long time. Starting in 0.98.5, you can set `hbase.MetaMigrationConvertingToPB` to `false` in _hbase-site.xml_, to disable this start-up scan. This should be considered an expert-level setting.
[[upgrade0.94]]
== Upgrading from 0.92.x to 0.94.x
=== Upgrading from 0.92.x to 0.94.x
We used to think that 0.92 and 0.94 were interface compatible and that you can do a rolling upgrade between these versions but then we figured that link:https://issues.apache.org/jira/browse/HBASE-5357[">]HBASE-5357 Use builder pattern in HColumnDescriptor] changed method signatures so rather than return void they instead return HColumnDescriptor. This will throw`java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V` so 0.92 and 0.94 are NOT compatible. You cannot do a rolling upgrade between them.
[[upgrade0.92]]
== Upgrading from 0.90.x to 0.92.x
=== Upgrade Guide
=== Upgrading from 0.90.x to 0.92.x
==== Upgrade Guide
ou will find that 0.92.0 runs a little differently to 0.90.x releases. Here are a few things to watch out for upgrading from 0.90.x to 0.92.0.
.tl:dr
@ -425,7 +438,7 @@ If an OOME, we now have the JVM kill -9 the regionserver process so it goes down
0.92.0 stores data in a new format, <<hfilev2>>. As HBase runs, it will move all your data from HFile v1 to HFile v2 format. This auto-migration will run in the background as flushes and compactions run. HFile V2 allows HBase run with larger regions/files. In fact, we encourage that all HBasers going forward tend toward Facebook axiom #1, run with larger, fewer regions. If you have lots of regions now -- more than 100s per host -- you should look into setting your region size up after you move to 0.92.0 (In 0.92.0, default size is now 1G, up from 256M), and then running online merge tool (See link:https://issues.apache.org/jira/browse/HBASE-1621[HBASE-1621 merge tool should work on online cluster, but disabled table]).
[[upgrade0.90]]
== Upgrading to HBase 0.90.x from 0.20.x or 0.89.x
=== Upgrading to HBase 0.90.x from 0.20.x or 0.89.x
This version of 0.90.x HBase can be started on data written by HBase 0.20.x or HBase 0.89.x. There is no need of a migration step. HBase 0.89.x and 0.90.x does write out the name of region directories differently -- it names them with a md5 hash of the region name rather than a jenkins hash -- so this means that once started, there is no going back to HBase 0.20.x.
Be sure to remove the _hbase-default.xml_ from your _conf_ directory on upgrade. A 0.20.x version of this file will have sub-optimal configurations for 0.90.x HBase. The _hbase-default.xml_ file is now bundled into the HBase jar and read from there. If you would like to review the content of this file, see it in the src tree at _src/main/resources/hbase-default.xml_ or see <<hbase_default_configurations>>.

View File

@ -32,19 +32,19 @@ All participating nodes and clients need to be able to access the running ZooKee
Apache HBase by default manages a ZooKeeper "cluster" for you.
It will start and stop the ZooKeeper ensemble as part of the HBase start/stop process.
You can also manage the ZooKeeper ensemble independent of HBase and just point HBase at the cluster it should use.
To toggle HBase management of ZooKeeper, use the [var]+HBASE_MANAGES_ZK+ variable in [path]_conf/hbase-env.sh_.
This variable, which defaults to [var]+true+, tells HBase whether to start/stop the ZooKeeper ensemble servers as part of HBase start/stop.
To toggle HBase management of ZooKeeper, use the `HBASE_MANAGES_ZK` variable in _conf/hbase-env.sh_.
This variable, which defaults to `true`, tells HBase whether to start/stop the ZooKeeper ensemble servers as part of HBase start/stop.
When HBase manages the ZooKeeper ensemble, you can specify ZooKeeper configuration using its native [path]_zoo.cfg_ file, or, the easier option is to just specify ZooKeeper options directly in [path]_conf/hbase-site.xml_.
A ZooKeeper configuration option can be set as a property in the HBase [path]_hbase-site.xml_ XML configuration file by prefacing the ZooKeeper option name with [var]+hbase.zookeeper.property+.
For example, the [var]+clientPort+ setting in ZooKeeper can be changed by setting the [var]+hbase.zookeeper.property.clientPort+ property.
When HBase manages the ZooKeeper ensemble, you can specify ZooKeeper configuration using its native _zoo.cfg_ file, or, the easier option is to just specify ZooKeeper options directly in _conf/hbase-site.xml_.
A ZooKeeper configuration option can be set as a property in the HBase _hbase-site.xml_ XML configuration file by prefacing the ZooKeeper option name with `hbase.zookeeper.property`.
For example, the `clientPort` setting in ZooKeeper can be changed by setting the `hbase.zookeeper.property.clientPort` property.
For all default values used by HBase, including ZooKeeper configuration, see <<hbase_default_configurations,hbase default configurations>>.
Look for the [var]+hbase.zookeeper.property+ prefix.
For the full list of ZooKeeper configurations, see ZooKeeper's [path]_zoo.cfg_.
HBase does not ship with a [path]_zoo.cfg_ so you will need to browse the [path]_conf_ directory in an appropriate ZooKeeper download.
Look for the `hbase.zookeeper.property` prefix.
For the full list of ZooKeeper configurations, see ZooKeeper's _zoo.cfg_.
HBase does not ship with a _zoo.cfg_ so you will need to browse the _conf_ directory in an appropriate ZooKeeper download.
You must at least list the ensemble servers in [path]_hbase-site.xml_ using the [var]+hbase.zookeeper.quorum+ property.
This property defaults to a single ensemble member at [var]+localhost+ which is not suitable for a fully distributed HBase.
You must at least list the ensemble servers in _hbase-site.xml_ using the `hbase.zookeeper.quorum` property.
This property defaults to a single ensemble member at `localhost` which is not suitable for a fully distributed HBase.
(It binds to the local machine only and remote clients will not be able to connect).
.How many ZooKeepers should I run?
@ -59,9 +59,9 @@ Thus, an ensemble of 5 allows 2 peers to fail, and thus is more fault tolerant t
Give each ZooKeeper server around 1GB of RAM, and if possible, its own dedicated disk (A dedicated disk is the best thing you can do to ensure a performant ZooKeeper ensemble). For very heavily loaded clusters, run ZooKeeper servers on separate machines from RegionServers (DataNodes and TaskTrackers).
====
For example, to have HBase manage a ZooKeeper quorum on nodes _rs{1,2,3,4,5}.example.com_, bound to port 2222 (the default is 2181) ensure [var]+HBASE_MANAGE_ZK+ is commented out or set to [var]+true+ in [path]_conf/hbase-env.sh_ and then edit [path]_conf/hbase-site.xml_ and set [var]+hbase.zookeeper.property.clientPort+ and [var]+hbase.zookeeper.quorum+.
You should also set [var]+hbase.zookeeper.property.dataDir+ to other than the default as the default has ZooKeeper persist data under [path]_/tmp_ which is often cleared on system restart.
In the example below we have ZooKeeper persist to [path]_/user/local/zookeeper_.
For example, to have HBase manage a ZooKeeper quorum on nodes _rs{1,2,3,4,5}.example.com_, bound to port 2222 (the default is 2181) ensure `HBASE_MANAGE_ZK` is commented out or set to `true` in _conf/hbase-env.sh_ and then edit _conf/hbase-site.xml_ and set `hbase.zookeeper.property.clientPort` and `hbase.zookeeper.quorum`.
You should also set `hbase.zookeeper.property.dataDir` to other than the default as the default has ZooKeeper persist data under _/tmp_ which is often cleared on system restart.
In the example below we have ZooKeeper persist to _/user/local/zookeeper_.
[source,java]
----
@ -102,7 +102,7 @@ In the example below we have ZooKeeper persist to [path]_/user/local/zookeeper_.
====
The newer version, the better.
For example, some folks have been bitten by link:https://issues.apache.org/jira/browse/ZOOKEEPER-1277[ZOOKEEPER-1277].
If running zookeeper 3.5+, you can ask hbase to make use of the new multi operation by enabling <<hbase.zookeeper.usemulti,hbase.zookeeper.useMulti>>" in your [path]_hbase-site.xml_.
If running zookeeper 3.5+, you can ask hbase to make use of the new multi operation by enabling <<hbase.zookeeper.usemulti,hbase.zookeeper.useMulti>>" in your _hbase-site.xml_.
====
.ZooKeeper Maintenance
@ -115,7 +115,7 @@ zookeeper could start dropping sessions if it has to run through a directory of
== Using existing ZooKeeper ensemble
To point HBase at an existing ZooKeeper cluster, one that is not managed by HBase, set [var]+HBASE_MANAGES_ZK+ in [path]_conf/hbase-env.sh_ to false
To point HBase at an existing ZooKeeper cluster, one that is not managed by HBase, set `HBASE_MANAGES_ZK` in _conf/hbase-env.sh_ to false
----
@ -124,8 +124,8 @@ To point HBase at an existing ZooKeeper cluster, one that is not managed by HBas
export HBASE_MANAGES_ZK=false
----
Next set ensemble locations and client port, if non-standard, in [path]_hbase-site.xml_, or add a suitably configured [path]_zoo.cfg_ to HBase's [path]_CLASSPATH_.
HBase will prefer the configuration found in [path]_zoo.cfg_ over any settings in [path]_hbase-site.xml_.
Next set ensemble locations and client port, if non-standard, in _hbase-site.xml_, or add a suitably configured _zoo.cfg_ to HBase's _CLASSPATH_.
HBase will prefer the configuration found in _zoo.cfg_ over any settings in _hbase-site.xml_.
When HBase manages ZooKeeper, it will start/stop the ZooKeeper servers as a part of the regular start/stop scripts.
If you would like to run ZooKeeper yourself, independent of HBase start/stop, you would do the following
@ -136,7 +136,7 @@ ${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper
----
Note that you can use HBase in this manner to spin up a ZooKeeper cluster, unrelated to HBase.
Just make sure to set [var]+HBASE_MANAGES_ZK+ to [var]+false+ if you want it to stay up across HBase restarts so that when HBase shuts down, it doesn't take ZooKeeper down with it.
Just make sure to set `HBASE_MANAGES_ZK` to `false` if you want it to stay up across HBase restarts so that when HBase shuts down, it doesn't take ZooKeeper down with it.
For more information about running a distinct ZooKeeper cluster, see the ZooKeeper link:http://hadoop.apache.org/zookeeper/docs/current/zookeeperStarted.html[Getting
Started Guide].
@ -154,21 +154,21 @@ ZooKeeper/HBase mutual authentication (link:https://issues.apache.org/jira/brows
=== Operating System Prerequisites
You need to have a working Kerberos KDC setup.
For each [code]+$HOST+ that will run a ZooKeeper server, you should have a principle [code]+zookeeper/$HOST+.
For each such host, add a service key (using the [code]+kadmin+ or [code]+kadmin.local+ tool's [code]+ktadd+ command) for [code]+zookeeper/$HOST+ and copy this file to [code]+$HOST+, and make it readable only to the user that will run zookeeper on [code]+$HOST+.
Note the location of this file, which we will use below as [path]_$PATH_TO_ZOOKEEPER_KEYTAB_.
For each `$HOST` that will run a ZooKeeper server, you should have a principle `zookeeper/$HOST`.
For each such host, add a service key (using the `kadmin` or `kadmin.local` tool's `ktadd` command) for `zookeeper/$HOST` and copy this file to `$HOST`, and make it readable only to the user that will run zookeeper on `$HOST`.
Note the location of this file, which we will use below as _$PATH_TO_ZOOKEEPER_KEYTAB_.
Similarly, for each [code]+$HOST+ that will run an HBase server (master or regionserver), you should have a principle: [code]+hbase/$HOST+.
For each host, add a keytab file called [path]_hbase.keytab_ containing a service key for [code]+hbase/$HOST+, copy this file to [code]+$HOST+, and make it readable only to the user that will run an HBase service on [code]+$HOST+.
Note the location of this file, which we will use below as [path]_$PATH_TO_HBASE_KEYTAB_.
Similarly, for each `$HOST` that will run an HBase server (master or regionserver), you should have a principle: `hbase/$HOST`.
For each host, add a keytab file called _hbase.keytab_ containing a service key for `hbase/$HOST`, copy this file to `$HOST`, and make it readable only to the user that will run an HBase service on `$HOST`.
Note the location of this file, which we will use below as _$PATH_TO_HBASE_KEYTAB_.
Each user who will be an HBase client should also be given a Kerberos principal.
This principal should usually have a password assigned to it (as opposed to, as with the HBase servers, a keytab file) which only this user knows.
The client's principal's [code]+maxrenewlife+ should be set so that it can be renewed enough so that the user can complete their HBase client processes.
For example, if a user runs a long-running HBase client process that takes at most 3 days, we might create this user's principal within [code]+kadmin+ with: [code]+addprinc -maxrenewlife 3days+.
The client's principal's `maxrenewlife` should be set so that it can be renewed enough so that the user can complete their HBase client processes.
For example, if a user runs a long-running HBase client process that takes at most 3 days, we might create this user's principal within `kadmin` with: `addprinc -maxrenewlife 3days`.
The Zookeeper client and server libraries manage their own ticket refreshment by running threads that wake up periodically to do the refreshment.
On each host that will run an HBase client (e.g. [code]+hbase shell+), add the following file to the HBase home directory's [path]_conf_ directory:
On each host that will run an HBase client (e.g. `hbase shell`), add the following file to the HBase home directory's _conf_ directory:
[source,java]
----
@ -180,11 +180,11 @@ Client {
};
----
We'll refer to this JAAS configuration file as [path]_$CLIENT_CONF_ below.
We'll refer to this JAAS configuration file as _$CLIENT_CONF_ below.
=== HBase-managed Zookeeper Configuration
On each node that will run a zookeeper, a master, or a regionserver, create a link:http://docs.oracle.com/javase/1.4.2/docs/guide/security/jgss/tutorials/LoginConfigFile.html[JAAS] configuration file in the conf directory of the node's [path]_HBASE_HOME_ directory that looks like the following:
On each node that will run a zookeeper, a master, or a regionserver, create a link:http://docs.oracle.com/javase/1.4.2/docs/guide/security/jgss/tutorials/LoginConfigFile.html[JAAS] configuration file in the conf directory of the node's _HBASE_HOME_ directory that looks like the following:
[source,java]
----
@ -206,14 +206,14 @@ Client {
};
----
where the [path]_$PATH_TO_HBASE_KEYTAB_ and [path]_$PATH_TO_ZOOKEEPER_KEYTAB_ files are what you created above, and [code]+$HOST+ is the hostname for that node.
where the _$PATH_TO_HBASE_KEYTAB_ and _$PATH_TO_ZOOKEEPER_KEYTAB_ files are what you created above, and `$HOST` is the hostname for that node.
The [code]+Server+ section will be used by the Zookeeper quorum server, while the [code]+Client+ section will be used by the HBase master and regionservers.
The path to this file should be substituted for the text [path]_$HBASE_SERVER_CONF_ in the [path]_hbase-env.sh_ listing below.
The `Server` section will be used by the Zookeeper quorum server, while the `Client` section will be used by the HBase master and regionservers.
The path to this file should be substituted for the text _$HBASE_SERVER_CONF_ in the _hbase-env.sh_ listing below.
The path to this file should be substituted for the text [path]_$CLIENT_CONF_ in the [path]_hbase-env.sh_ listing below.
The path to this file should be substituted for the text _$CLIENT_CONF_ in the _hbase-env.sh_ listing below.
Modify your [path]_hbase-env.sh_ to include the following:
Modify your _hbase-env.sh_ to include the following:
[source,bourne]
----
@ -225,9 +225,9 @@ export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
----
where [path]_$HBASE_SERVER_CONF_ and [path]_$CLIENT_CONF_ are the full paths to the JAAS configuration files created above.
where _$HBASE_SERVER_CONF_ and _$CLIENT_CONF_ are the full paths to the JAAS configuration files created above.
Modify your [path]_hbase-site.xml_ on each node that will run zookeeper, master or regionserver to contain:
Modify your _hbase-site.xml_ on each node that will run zookeeper, master or regionserver to contain:
[source,java]
----
@ -256,7 +256,7 @@ Modify your [path]_hbase-site.xml_ on each node that will run zookeeper, master
</configuration>
----
where [code]+$ZK_NODES+ is the comma-separated list of hostnames of the Zookeeper Quorum hosts.
where `$ZK_NODES` is the comma-separated list of hostnames of the Zookeeper Quorum hosts.
Start your hbase cluster by running one or more of the following set of commands on the appropriate hosts:
@ -283,9 +283,9 @@ Client {
};
----
where the [path]_$PATH_TO_HBASE_KEYTAB_ is the keytab created above for HBase services to run on this host, and [code]+$HOST+ is the hostname for that node.
where the _$PATH_TO_HBASE_KEYTAB_ is the keytab created above for HBase services to run on this host, and `$HOST` is the hostname for that node.
Put this in the HBase home's configuration directory.
We'll refer to this file's full pathname as [path]_$HBASE_SERVER_CONF_ below.
We'll refer to this file's full pathname as _$HBASE_SERVER_CONF_ below.
Modify your hbase-env.sh to include the following:
@ -298,7 +298,7 @@ export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
----
Modify your [path]_hbase-site.xml_ on each node that will run a master or regionserver to contain:
Modify your _hbase-site.xml_ on each node that will run a master or regionserver to contain:
[source,xml]
----
@ -315,9 +315,9 @@ Modify your [path]_hbase-site.xml_ on each node that will run a master or region
</configuration>
----
where [code]+$ZK_NODES+ is the comma-separated list of hostnames of the Zookeeper Quorum hosts.
where `$ZK_NODES` is the comma-separated list of hostnames of the Zookeeper Quorum hosts.
Add a [path]_zoo.cfg_ for each Zookeeper Quorum host containing:
Add a _zoo.cfg_ for each Zookeeper Quorum host containing:
[source,java]
----
@ -342,8 +342,8 @@ Server {
};
----
where [code]+$HOST+ is the hostname of each Quorum host.
We will refer to the full pathname of this file as [path]_$ZK_SERVER_CONF_ below.
where `$HOST` is the hostname of each Quorum host.
We will refer to the full pathname of this file as _$ZK_SERVER_CONF_ below.
Start your Zookeepers on each Zookeeper Quorum host with:
@ -427,7 +427,7 @@ bin/hbase regionserver &
==== Fix target/cached_classpath.txt
You must override the standard hadoop-core jar file from the [code]+target/cached_classpath.txt+ file with the version containing the HADOOP-7070 fix.
You must override the standard hadoop-core jar file from the `target/cached_classpath.txt` file with the version containing the HADOOP-7070 fix.
You can use the following script to do this:
----
@ -440,7 +440,7 @@ mv target/tmp.txt target/cached_classpath.txt
This would avoid the need for a separate Hadoop jar that fixes link:https://issues.apache.org/jira/browse/HADOOP-7070[HADOOP-7070].
==== Elimination of [code]+kerberos.removeHostFromPrincipal+ and[code]+kerberos.removeRealmFromPrincipal+
==== Elimination of `kerberos.removeHostFromPrincipal` and`kerberos.removeRealmFromPrincipal`

View File

@ -19,7 +19,7 @@
*/
////
= Apache HBase (TM) Reference Guide image:jumping-orca_rotated_25percent.png[]
= Apache HBase (TM) Reference Guide image:hbase_logo.png[] image:jumping-orca_rotated_25percent.png[]
:Author: Apache HBase Team
:Email: <hbase-dev@lists.apache.org>
:doctype: book

View File

@ -1,7 +1,7 @@
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="configuration">
<!--
/**
* Copyright 2010 The Apache Software Foundation
@ -26,6 +26,19 @@
This stylesheet is used making an html version of hbase-default.adoc.
-->
<xsl:output method="text"/>
<!-- Normalize space -->
<xsl:template match="text()">
<xsl:if test="normalize-space(.)">
<xsl:value-of select="normalize-space(.)"/>
</xsl:if>
</xsl:template>
<!-- Grab nodes of the <configuration> element -->
<xsl:template match="configuration">
<!-- Print the license at the top of the file -->
////
/**
*
@ -46,7 +59,6 @@ This stylesheet is used making an html version of hbase-default.adoc.
* limitations under the License.
*/
////
:doctype: book
:numbered:
:toc: left
@ -58,18 +70,23 @@ This stylesheet is used making an html version of hbase-default.adoc.
The documentation below is generated using the default hbase configuration file, _hbase-default.xml_, as source.
<xsl:for-each select="property">
<xsl:if test="not(@skipInDoc)">
[[<xsl:value-of select="name" />]]
*`<xsl:value-of select="name"/>`*::
<xsl:for-each select="property">
<xsl:if test="not(@skipInDoc)">
[[<xsl:apply-templates select="name"/>]]
`<xsl:apply-templates select="name"/>`::
+
.Description
<xsl:value-of select="description"/>
<xsl:apply-templates select="description"/>
+
.Default
`<xsl:value-of select="value"/>`
<xsl:choose>
<xsl:when test="value != ''">`<xsl:apply-templates select="value"/>`
</xsl:when>
<xsl:otherwise>none</xsl:otherwise>
</xsl:choose>
</xsl:if>
</xsl:for-each>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>