HBASE-14823 HBase Ref Guide Refactoring

Some tables, links, and other output do not render right in the output,
either because of Asciidoc code mistakes or the wrong formatting
choices. Make improvements.
This commit is contained in:
Misty Stanley-Jones 2015-11-17 11:14:56 +10:00
parent 1b13bfcd43
commit 623dc1303e
21 changed files with 561 additions and 485 deletions

View File

@ -65,7 +65,7 @@ Possible permissions include the following:
For the most part, permissions work in an expected way, with the following caveats:
Having Write permission does not imply Read permission.::
It is possible and sometimes desirable for a user to be able to write data that same user cannot read. One such example is a log-writing process.
It is possible and sometimes desirable for a user to be able to write data that same user cannot read. One such example is a log-writing process.
The [systemitem]+hbase:meta+ table is readable by every user, regardless of the user's other grants or restrictions.::
This is a requirement for HBase to function correctly.
`CheckAndPut` and `CheckAndDelete` operations will fail if the user does not have both Write and Read permission.::

View File

@ -192,8 +192,11 @@ This format applies to intermediate-level and leaf index blocks of a version 2 m
Every non-root index block is structured as follows.
. numEntries: the number of entries (int).
. entryOffsets: the ``secondary index'' of offsets of entries in the block, to facilitate a quick binary search on the key (numEntries + 1 int values). The last value is the total length of all entries in this index block.
For example, in a non-root index block with entry sizes 60, 80, 50 the ``secondary index'' will contain the following int array: {0, 60, 140, 190}.
. entryOffsets: the "secondary index" of offsets of entries in the block, to facilitate
a quick binary search on the key (`numEntries + 1` int values). The last value
is the total length of all entries in this index block. For example, in a non-root
index block with entry sizes 60, 80, 50 the "secondary index" will contain the
following int array: `{0, 60, 140, 190}`.
. Entries.
Each entry contains:
+

View File

@ -140,7 +140,7 @@ If a region has both an empty start and an empty end key, it is the only region
In the (hopefully unlikely) event that programmatic processing of catalog metadata
is required, see the
link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29[Writables]
+++<a href="http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29">Writables</a>+++
utility.
[[arch.catalog.startup]]
@ -931,7 +931,7 @@ To configure MultiWAL for a RegionServer, set the value of the property `hbase.w
</property>
----
Restart the RegionServer for the changes to take effect.
Restart the RegionServer for the changes to take effect.
To disable MultiWAL for a RegionServer, unset the property and restart the RegionServer.
@ -1806,60 +1806,116 @@ This list is not exhaustive.
To tune these parameters from the defaults, edit the _hbase-default.xml_ file.
For a full list of all configuration parameters available, see <<config.files,config.files>>
[cols="1,1a,1", options="header"]
|===
| Parameter
| Description
| Default
`hbase.hstore.compaction.min`::
The minimum number of StoreFiles which must be eligible for compaction before compaction can run.
The goal of tuning `hbase.hstore.compaction.min` is to avoid ending up with too many tiny StoreFiles
to compact. Setting this value to 2 would cause a minor compaction each time you have two StoreFiles
in a Store, and this is probably not appropriate. If you set this value too high, all the other
values will need to be adjusted accordingly. For most cases, the default value is appropriate.
In previous versions of HBase, the parameter `hbase.hstore.compaction.min` was called
`hbase.hstore.compactionThreshold`.
+
*Default*: 3
|`hbase.hstore.compaction.min`
| The minimum number of StoreFiles which must be eligible for compaction before compaction can run. The goal of tuning `hbase.hstore.compaction.min` is to avoid ending up with too many tiny StoreFiles to compact. Setting this value to 2 would cause a minor compaction each time you have two StoreFiles in a Store, and this is probably not appropriate. If you set this value too high, all the other values will need to be adjusted accordingly. For most cases, the default value is appropriate. In previous versions of HBase, the parameter hbase.hstore.compaction.min was called `hbase.hstore.compactionThreshold`.
|3
`hbase.hstore.compaction.max`::
The maximum number of StoreFiles which will be selected for a single minor compaction,
regardless of the number of eligible StoreFiles. Effectively, the value of
`hbase.hstore.compaction.max` controls the length of time it takes a single
compaction to complete. Setting it larger means that more StoreFiles are included
in a compaction. For most cases, the default value is appropriate.
+
*Default*: 10
|`hbase.hstore.compaction.max`
| The maximum number of StoreFiles which will be selected for a single minor compaction, regardless of the number of eligible StoreFiles. Effectively, the value of hbase.hstore.compaction.max controls the length of time it takes a single compaction to complete. Setting it larger means that more StoreFiles are included in a compaction. For most cases, the default value is appropriate.
|10
`hbase.hstore.compaction.min.size`::
A StoreFile smaller than this size will always be eligible for minor compaction.
StoreFiles this size or larger are evaluated by `hbase.hstore.compaction.ratio`
to determine if they are eligible. Because this limit represents the "automatic
include" limit for all StoreFiles smaller than this value, this value may need
to be reduced in write-heavy environments where many files in the 1-2 MB range
are being flushed, because every StoreFile will be targeted for compaction and
the resulting StoreFiles may still be under the minimum size and require further
compaction. If this parameter is lowered, the ratio check is triggered more quickly.
This addressed some issues seen in earlier versions of HBase but changing this
parameter is no longer necessary in most situations.
+
*Default*:128 MB
|`hbase.hstore.compaction.min.size`
| A StoreFile smaller than this size will always be eligible for minor compaction. StoreFiles this size or larger are evaluated by `hbase.hstore.compaction.ratio` to determine if they are eligible. Because this limit represents the "automatic include" limit for all StoreFiles smaller than this value, this value may need to be reduced in write-heavy environments where many files in the 1-2 MB range are being flushed, because every StoreFile will be targeted for compaction and the resulting StoreFiles may still be under the minimum size and require further compaction. If this parameter is lowered, the ratio check is triggered more quickly. This addressed some issues seen in earlier versions of HBase but changing this parameter is no longer necessary in most situations.
|128 MB
`hbase.hstore.compaction.max.size`::
A StoreFile larger than this size will be excluded from compaction. The effect of
raising `hbase.hstore.compaction.max.size` is fewer, larger StoreFiles that do not
get compacted often. If you feel that compaction is happening too often without
much benefit, you can try raising this value.
+
*Default*: `Long.MAX_VALUE`
|`hbase.hstore.compaction.max.size`
| An StoreFile larger than this size will be excluded from compaction. The effect of raising `hbase.hstore.compaction.max.size` is fewer, larger StoreFiles that do not get compacted often. If you feel that compaction is happening too often without much benefit, you can try raising this value.
|`Long.MAX_VALUE`
`hbase.hstore.compaction.ratio`::
For minor compaction, this ratio is used to determine whether a given StoreFile
which is larger than `hbase.hstore.compaction.min.size` is eligible for compaction.
Its effect is to limit compaction of large StoreFile. The value of
`hbase.hstore.compaction.ratio` is expressed as a floating-point decimal.
+
* A large ratio, such as 10, will produce a single giant StoreFile. Conversely,
a value of .25, will produce behavior similar to the BigTable compaction algorithm,
producing four StoreFiles.
* A moderate value of between 1.0 and 1.4 is recommended. When tuning this value,
you are balancing write costs with read costs. Raising the value (to something like
1.4) will have more write costs, because you will compact larger StoreFiles.
However, during reads, HBase will need to seek through fewer StoreFiles to
accomplish the read. Consider this approach if you cannot take advantage of <<bloom>>.
* Alternatively, you can lower this value to something like 1.0 to reduce the
background cost of writes, and use to limit the number of StoreFiles touched
during reads. For most cases, the default value is appropriate.
+
*Default*: `1.2F`
|`hbase.hstore.compaction.ratio`
| For minor compaction, this ratio is used to determine whether a given StoreFile which is larger than `hbase.hstore.compaction.min.size` is eligible for compaction. Its effect is to limit compaction of large StoreFile. The value of `hbase.hstore.compaction.ratio` is expressed as a floating-point decimal.
`hbase.hstore.compaction.ratio.offpeak`::
The compaction ratio used during off-peak compactions, if off-peak hours are
also configured (see below). Expressed as a floating-point decimal. This allows
for more aggressive (or less aggressive, if you set it lower than
`hbase.hstore.compaction.ratio`) compaction during a set time period. Ignored
if off-peak is disabled (default). This works the same as
`hbase.hstore.compaction.ratio`.
+
*Default*: `5.0F`
* A large ratio, such as 10, will produce a single giant StoreFile. Conversely, a value of .25, will produce behavior similar to the BigTable compaction algorithm, producing four StoreFiles.
* A moderate value of between 1.0 and 1.4 is recommended. When tuning this value, you are balancing write costs with read costs. Raising the value (to something like 1.4) will have more write costs, because you will compact larger StoreFiles. However, during reads, HBase will need to seek through fewer StoreFiles to accomplish the read. Consider this approach if you cannot take advantage of <<bloom>>.
* Alternatively, you can lower this value to something like 1.0 to reduce the background cost of writes, and use to limit the number of StoreFiles touched during reads. For most cases, the default value is appropriate.
| `1.2F`
`hbase.offpeak.start.hour`::
The start of off-peak hours, expressed as an integer between 0 and 23, inclusive.
Set to -1 to disable off-peak.
+
*Default*: `-1` (disabled)
|`hbase.hstore.compaction.ratio.offpeak`
| The compaction ratio used during off-peak compactions, if off-peak hours are also configured (see below). Expressed as a floating-point decimal. This allows for more aggressive (or less aggressive, if you set it lower than `hbase.hstore.compaction.ratio`) compaction during a set time period. Ignored if off-peak is disabled (default). This works the same as hbase.hstore.compaction.ratio.
| `5.0F`
`hbase.offpeak.end.hour`::
The end of off-peak hours, expressed as an integer between 0 and 23, inclusive.
Set to -1 to disable off-peak.
+
*Default*: `-1` (disabled)
| `hbase.offpeak.start.hour`
| The start of off-peak hours, expressed as an integer between 0 and 23, inclusive. Set to -1 to disable off-peak.
| `-1` (disabled)
`hbase.regionserver.thread.compaction.throttle`::
There are two different thread pools for compactions, one for large compactions
and the other for small compactions. This helps to keep compaction of lean tables
(such as `hbase:meta`) fast. If a compaction is larger than this threshold,
it goes into the large compaction pool. In most cases, the default value is
appropriate.
+
*Default*: `2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size`
(which defaults to `128`)
| `hbase.offpeak.end.hour`
| The end of off-peak hours, expressed as an integer between 0 and 23, inclusive. Set to -1 to disable off-peak.
| `-1` (disabled)
`hbase.hregion.majorcompaction`::
Time between major compactions, expressed in milliseconds. Set to 0 to disable
time-based automatic major compactions. User-requested and size-based major
compactions will still run. This value is multiplied by
`hbase.hregion.majorcompaction.jitter` to cause compaction to start at a
somewhat-random time during a given window of time.
+
*Default*: 7 days (`604800000` milliseconds)
| `hbase.regionserver.thread.compaction.throttle`
| There are two different thread pools for compactions, one for large compactions and the other for small compactions. This helps to keep compaction of lean tables (such as `hbase:meta`) fast. If a compaction is larger than this threshold, it goes into the large compaction pool. In most cases, the default value is appropriate.
| `2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size` (which defaults to `128`)
| `hbase.hregion.majorcompaction`
| Time between major compactions, expressed in milliseconds. Set to 0 to disable time-based automatic major compactions. User-requested and size-based major compactions will still run. This value is multiplied by `hbase.hregion.majorcompaction.jitter` to cause compaction to start at a somewhat-random time during a given window of time.
| 7 days (`604800000` milliseconds)
| `hbase.hregion.majorcompaction.jitter`
| A multiplier applied to hbase.hregion.majorcompaction to cause compaction to occur a given amount of time either side of `hbase.hregion.majorcompaction`. The smaller the number, the closer the compactions will happen to the `hbase.hregion.majorcompaction` interval. Expressed as a floating-point decimal.
| `.50F`
|===
`hbase.hregion.majorcompaction.jitter`::
A multiplier applied to hbase.hregion.majorcompaction to cause compaction to
occur a given amount of time either side of `hbase.hregion.majorcompaction`.
The smaller the number, the closer the compactions will happen to the
`hbase.hregion.majorcompaction` interval. Expressed as a floating-point decimal.
+
*Default*: `.50F`
[[compaction.file.selection.old]]
===== Compaction File Selection
@ -2308,18 +2364,18 @@ To serve the region data from multiple replicas, HBase opens the regions in seco
The regions opened in secondary mode will share the same data files with the primary region replica, however each secondary region replica will have its own MemStore to keep the unflushed data (only primary region can do flushes). Also to serve reads from secondary regions, the blocks of data files may be also cached in the block caches for the secondary regions.
=== Where is the code
This feature is delivered in two phases, Phase 1 and 2. The first phase is done in time for HBase-1.0.0 release. Meaning that using HBase-1.0.x, you can use all the features that are marked for Phase 1. Phase 2 is committed in HBase-1.1.0, meaning all HBase versions after 1.1.0 should contain Phase 2 items.
This feature is delivered in two phases, Phase 1 and 2. The first phase is done in time for HBase-1.0.0 release. Meaning that using HBase-1.0.x, you can use all the features that are marked for Phase 1. Phase 2 is committed in HBase-1.1.0, meaning all HBase versions after 1.1.0 should contain Phase 2 items.
=== Propagating writes to region replicas
As discussed above writes only go to the primary region replica. For propagating the writes from the primary region replica to the secondaries, there are two different mechanisms. For read-only tables, you do not need to use any of the following methods. Disabling and enabling the table should make the data available in all region replicas. For mutable tables, you have to use *only* one of the following mechanisms: storefile refresher, or async wal replication. The latter is recommeded.
As discussed above writes only go to the primary region replica. For propagating the writes from the primary region replica to the secondaries, there are two different mechanisms. For read-only tables, you do not need to use any of the following methods. Disabling and enabling the table should make the data available in all region replicas. For mutable tables, you have to use *only* one of the following mechanisms: storefile refresher, or async wal replication. The latter is recommeded.
==== StoreFile Refresher
The first mechanism is store file refresher which is introduced in HBase-1.0+. Store file refresher is a thread per region server, which runs periodically, and does a refresh operation for the store files of the primary region for the secondary region replicas. If enabled, the refresher will ensure that the secondary region replicas see the new flushed, compacted or bulk loaded files from the primary region in a timely manner. However, this means that only flushed data can be read back from the secondary region replicas, and after the refresher is run, making the secondaries lag behind the primary for an a longer time.
The first mechanism is store file refresher which is introduced in HBase-1.0+. Store file refresher is a thread per region server, which runs periodically, and does a refresh operation for the store files of the primary region for the secondary region replicas. If enabled, the refresher will ensure that the secondary region replicas see the new flushed, compacted or bulk loaded files from the primary region in a timely manner. However, this means that only flushed data can be read back from the secondary region replicas, and after the refresher is run, making the secondaries lag behind the primary for an a longer time.
For turning this feature on, you should configure `hbase.regionserver.storefile.refresh.period` to a non-zero value. See Configuration section below.
For turning this feature on, you should configure `hbase.regionserver.storefile.refresh.period` to a non-zero value. See Configuration section below.
==== Asnyc WAL replication
The second mechanism for propagation of writes to secondaries is done via “Async WAL Replication” feature and is only available in HBase-1.1+. This works similarly to HBases multi-datacenter replication, but instead the data from a region is replicated to the secondary regions. Each secondary replica always receives and observes the writes in the same order that the primary region committed them. In some sense, this design can be thought of as “in-cluster replication”, where instead of replicating to a different datacenter, the data goes to secondary regions to keep secondary regions in-memory state up to date. The data files are shared between the primary region and the other replicas, so that there is no extra storage overhead. However, the secondary regions will have recent non-flushed data in their memstores, which increases the memory overhead. The primary region writes flush, compaction, and bulk load events to its WAL as well, which are also replicated through wal replication to secondaries. When they observe the flush/compaction or bulk load event, the secondary regions replay the event to pick up the new files and drop the old ones.
The second mechanism for propagation of writes to secondaries is done via “Async WAL Replication” feature and is only available in HBase-1.1+. This works similarly to HBases multi-datacenter replication, but instead the data from a region is replicated to the secondary regions. Each secondary replica always receives and observes the writes in the same order that the primary region committed them. In some sense, this design can be thought of as “in-cluster replication”, where instead of replicating to a different datacenter, the data goes to secondary regions to keep secondary regions in-memory state up to date. The data files are shared between the primary region and the other replicas, so that there is no extra storage overhead. However, the secondary regions will have recent non-flushed data in their memstores, which increases the memory overhead. The primary region writes flush, compaction, and bulk load events to its WAL as well, which are also replicated through wal replication to secondaries. When they observe the flush/compaction or bulk load event, the secondary regions replay the event to pick up the new files and drop the old ones.
Committing writes in the same order as in primary ensures that the secondaries wont diverge from the primary regions data, but since the log replication is asynchronous, the data might still be stale in secondary regions. Since this feature works as a replication endpoint, the performance and latency characteristics is expected to be similar to inter-cluster replication.
@ -2332,18 +2388,18 @@ Asyn WAL Replication feature will add a new replication peer named `region_repli
hbase> disable_peer 'region_replica_replication'
----
=== Store File TTL
In both of the write propagation approaches mentioned above, store files of the primary will be opened in secondaries independent of the primary region. So for files that the primary compacted away, the secondaries might still be referring to these files for reading. Both features are using HFileLinks to refer to files, but there is no protection (yet) for guaranteeing that the file will not be deleted prematurely. Thus, as a guard, you should set the configuration property `hbase.master.hfilecleaner.ttl` to a larger value, such as 1 hour to guarantee that you will not receive IOExceptions for requests going to replicas.
=== Store File TTL
In both of the write propagation approaches mentioned above, store files of the primary will be opened in secondaries independent of the primary region. So for files that the primary compacted away, the secondaries might still be referring to these files for reading. Both features are using HFileLinks to refer to files, but there is no protection (yet) for guaranteeing that the file will not be deleted prematurely. Thus, as a guard, you should set the configuration property `hbase.master.hfilecleaner.ttl` to a larger value, such as 1 hour to guarantee that you will not receive IOExceptions for requests going to replicas.
=== Region replication for META tables region
Currently, Async WAL Replication is not done for the META tables WAL. The meta tables secondary replicas still refreshes themselves from the persistent store files. Hence the `hbase.regionserver.meta.storefile.refresh.period` needs to be set to a certain non-zero value for refreshing the meta store files. Note that this configuration is configured differently than
`hbase.regionserver.storefile.refresh.period`.
Currently, Async WAL Replication is not done for the META tables WAL. The meta tables secondary replicas still refreshes themselves from the persistent store files. Hence the `hbase.regionserver.meta.storefile.refresh.period` needs to be set to a certain non-zero value for refreshing the meta store files. Note that this configuration is configured differently than
`hbase.regionserver.storefile.refresh.period`.
=== Memory accounting
The secondary region replicas refer to the data files of the primary region replica, but they have their own memstores (in HBase-1.1+) and uses block cache as well. However, one distinction is that the secondary region replicas cannot flush the data when there is memory pressure for their memstores. They can only free up memstore memory when the primary region does a flush and this flush is replicated to the secondary. Since in a region server hosting primary replicas for some regions and secondaries for some others, the secondaries might cause extra flushes to the primary regions in the same host. In extreme situations, there can be no memory left for adding new writes coming from the primary via wal replication. For unblocking this situation (and since secondary cannot flush by itself), the secondary is allowed to do a “store file refresh” by doing a file system list operation to pick up new files from primary, and possibly dropping its memstore. This refresh will only be performed if the memstore size of the biggest secondary region replica is at least `hbase.region.replica.storefile.refresh.memstore.multiplier` (default 4) times bigger than the biggest memstore of a primary replica. One caveat is that if this is performed, the secondary can observe partial row updates across column families (since column families are flushed independently). The default should be good to not do this operation frequently. You can set this value to a large number to disable this feature if desired, but be warned that it might cause the replication to block forever.
=== Secondary replica failover
When a secondary region replica first comes online, or fails over, it may have served some edits from its memstore. Since the recovery is handled differently for secondary replicas, the secondary has to ensure that it does not go back in time before it starts serving requests after assignment. For doing that, the secondary waits until it observes a full flush cycle (start flush, commit flush) or a “region open event” replicated from the primary. Until this happens, the secondary region replica will reject all read requests by throwing an IOException with message “The region's reads are disabled”. However, the other replicas will probably still be available to read, thus not causing any impact for the rpc with TIMELINE consistency. To facilitate faster recovery, the secondary region will trigger a flush request from the primary when it is opened. The configuration property `hbase.region.replica.wait.for.primary.flush` (enabled by default) can be used to disable this feature if needed.
When a secondary region replica first comes online, or fails over, it may have served some edits from its memstore. Since the recovery is handled differently for secondary replicas, the secondary has to ensure that it does not go back in time before it starts serving requests after assignment. For doing that, the secondary waits until it observes a full flush cycle (start flush, commit flush) or a “region open event” replicated from the primary. Until this happens, the secondary region replica will reject all read requests by throwing an IOException with message “The region's reads are disabled”. However, the other replicas will probably still be available to read, thus not causing any impact for the rpc with TIMELINE consistency. To facilitate faster recovery, the secondary region will trigger a flush request from the primary when it is opened. The configuration property `hbase.region.replica.wait.for.primary.flush` (enabled by default) can be used to disable this feature if needed.
@ -2352,7 +2408,7 @@ When a secondary region replica first comes online, or fails over, it may have s
To use highly available reads, you should set the following properties in `hbase-site.xml` file.
There is no specific configuration to enable or disable region replicas.
Instead you can change the number of region replicas per table to increase or decrease at the table creation or with alter table. The following configuration is for using async wal replication and using meta replicas of 3.
Instead you can change the number of region replicas per table to increase or decrease at the table creation or with alter table. The following configuration is for using async wal replication and using meta replicas of 3.
==== Server side properties
@ -2413,7 +2469,7 @@ Instead you can change the number of region replicas per table to increase or de
</property>
<property>
<property>
<name>hbase.region.replica.storefile.refresh.memstore.multiplier</name>
<value>4</value>
<description>
@ -2476,7 +2532,7 @@ Ensure to set the following for all clients (and servers) that will use region r
</property>
----
Note HBase-1.0.x users should use `hbase.ipc.client.allowsInterrupt` rather than `hbase.ipc.client.specificThreadForWriting`.
Note HBase-1.0.x users should use `hbase.ipc.client.allowsInterrupt` rather than `hbase.ipc.client.specificThreadForWriting`.
=== User Interface

View File

@ -35,13 +35,13 @@ HBase is a project in the Apache Software Foundation and as such there are respo
[[asf.devprocess]]
=== ASF Development Process
See the link:http://www.apache.org/dev/#committers[Apache Development Process page] for all sorts of information on how the ASF is structured (e.g., PMC, committers, contributors), to tips on contributing and getting involved, and how open-source works at ASF.
See the link:http://www.apache.org/dev/#committers[Apache Development Process page] for all sorts of information on how the ASF is structured (e.g., PMC, committers, contributors), to tips on contributing and getting involved, and how open-source works at ASF.
[[asf.reporting]]
=== ASF Board Reporting
Once a quarter, each project in the ASF portfolio submits a report to the ASF board.
This is done by the HBase project lead and the committers.
See link:http://www.apache.org/foundation/board/reporting[ASF board reporting] for more information.
See link:http://www.apache.org/foundation/board/reporting[ASF board reporting] for more information.
:numbered:

View File

@ -45,18 +45,18 @@ See link:http://search-hadoop.com/m/asM982C5FkS1[HBase, mail # dev - Thoughts
The below policy is something we put in place 09/2012.
It is a suggested policy rather than a hard requirement.
We want to try it first to see if it works before we cast it in stone.
We want to try it first to see if it works before we cast it in stone.
Apache HBase is made of link:https://issues.apache.org/jira/browse/HBASE#selectedTab=com.atlassian.jira.plugin.system.project%3Acomponents-panel[components].
Components have one or more <<owner,OWNER>>s.
See the 'Description' field on the link:https://issues.apache.org/jira/browse/HBASE#selectedTab=com.atlassian.jira.plugin.system.project%3Acomponents-panel[components] JIRA page for who the current owners are by component.
See the 'Description' field on the link:https://issues.apache.org/jira/browse/HBASE#selectedTab=com.atlassian.jira.plugin.system.project%3Acomponents-panel[components] JIRA page for who the current owners are by component.
Patches that fit within the scope of a single Apache HBase component require, at least, a +1 by one of the component's owners before commit.
If owners are absent -- busy or otherwise -- two +1s by non-owners will suffice.
If owners are absent -- busy or otherwise -- two +1s by non-owners will suffice.
Patches that span components need at least two +1s before they can be committed, preferably +1s by owners of components touched by the x-component patch (TODO: This needs tightening up but I think fine for first pass).
Patches that span components need at least two +1s before they can be committed, preferably +1s by owners of components touched by the x-component patch (TODO: This needs tightening up but I think fine for first pass).
Any -1 on a patch by anyone vetos a patch; it cannot be committed until the justification for the -1 is addressed.
Any -1 on a patch by anyone vetos a patch; it cannot be committed until the justification for the -1 is addressed.
[[hbase.fix.version.in.jira]]
.How to set fix version in JIRA on issue resolve
@ -67,13 +67,13 @@ If master is going to be 0.98.0 then:
* Commit only to master: Mark with 0.98
* Commit to 0.95 and master: Mark with 0.98, and 0.95.x
* Commit to 0.94.x and 0.95, and master: Mark with 0.98, 0.95.x, and 0.94.x
* Commit to 89-fb: Mark with 89-fb.
* Commit site fixes: no version
* Commit to 89-fb: Mark with 89-fb.
* Commit site fixes: no version
[[hbase.when.to.close.jira]]
.Policy on when to set a RESOLVED JIRA as CLOSED
We link:http://search-hadoop.com/m/4cIKs1iwXMS1[agreed] that for issues that list multiple releases in their _Fix Version/s_ field, CLOSE the issue on the release of any of the versions listed; subsequent change to the issue must happen in a new JIRA.
We link:http://search-hadoop.com/m/4cIKs1iwXMS1[agreed] that for issues that list multiple releases in their _Fix Version/s_ field, CLOSE the issue on the release of any of the versions listed; subsequent change to the issue must happen in a new JIRA.
[[no.permanent.state.in.zk]]
.Only transient state in ZooKeeper!
@ -81,7 +81,7 @@ We link:http://search-hadoop.com/m/4cIKs1iwXMS1[agreed] that for issues that lis
You should be able to kill the data in zookeeper and hbase should ride over it recreating the zk content as it goes.
This is an old adage around these parts.
We just made note of it now.
We also are currently in violation of this basic tenet -- replication at least keeps permanent state in zk -- but we are working to undo this breaking of a golden rule.
We also are currently in violation of this basic tenet -- replication at least keeps permanent state in zk -- but we are working to undo this breaking of a golden rule.
[[community.roles]]
== Community Roles
@ -90,22 +90,22 @@ We also are currently in violation of this basic tenet -- replication at least k
.Component Owner/Lieutenant
Component owners are listed in the description field on this Apache HBase JIRA link:https://issues.apache.org/jira/browse/HBASE#selectedTab=com.atlassian.jira.plugin.system.project%3Acomponents-panel[components] page.
The owners are listed in the 'Description' field rather than in the 'Component Lead' field because the latter only allows us list one individual whereas it is encouraged that components have multiple owners.
The owners are listed in the 'Description' field rather than in the 'Component Lead' field because the latter only allows us list one individual whereas it is encouraged that components have multiple owners.
Owners or component lieutenants are volunteers who are (usually, but not necessarily) expert in their component domain and may have an agenda on how they think their Apache HBase component should evolve.
Owners or component lieutenants are volunteers who are (usually, but not necessarily) expert in their component domain and may have an agenda on how they think their Apache HBase component should evolve.
. Owners will try and review patches that land within their component's scope.
. If applicable, if an owner has an agenda, they will publish their goals or the design toward which they are driving their component
. Owners will try and review patches that land within their component's scope.
. If applicable, if an owner has an agenda, they will publish their goals or the design toward which they are driving their component
If you would like to be volunteer as a component owner, just write the dev list and we'll sign you up.
Owners do not need to be committers.
Owners do not need to be committers.
[[hbase.commit.msg.format]]
== Commit Message format
We link:http://search-hadoop.com/m/Gwxwl10cFHa1[agreed] to the following Git commit message format:
We link:http://search-hadoop.com/m/Gwxwl10cFHa1[agreed] to the following Git commit message format:
[source]
----
HBASE-xxxxx <title>. (<contributor>)
----
If the person making the commit is the contributor, leave off the '(<contributor>)' element.
----
If the person making the commit is the contributor, leave off the '(<contributor>)' element.

View File

@ -144,15 +144,15 @@ In general, you need to weigh your options between smaller size and faster compr
The Hadoop shared library has a bunch of facility including compression libraries and fast crc'ing. To make this facility available to HBase, do the following. HBase/Hadoop will fall back to use alternatives if it cannot find the native library versions -- or fail outright if you asking for an explicit compressor and there is no alternative available.
If you see the following in your HBase logs, you know that HBase was unable to locate the Hadoop native libraries:
If you see the following in your HBase logs, you know that HBase was unable to locate the Hadoop native libraries:
[source]
----
2014-08-07 09:26:20,139 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
----
If the libraries loaded successfully, the WARN message does not show.
----
If the libraries loaded successfully, the WARN message does not show.
Lets presume your Hadoop shipped with a native library that suits the platform you are running HBase on.
To check if the Hadoop native library is available to HBase, run the following tool (available in Hadoop 2.1 and greater):
To check if the Hadoop native library is available to HBase, run the following tool (available in Hadoop 2.1 and greater):
[source]
----
$ ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker
@ -165,7 +165,7 @@ lz4: false
bzip2: false
2014-08-26 13:15:38,863 INFO [main] util.ExitUtil: Exiting with status 1
----
Above shows that the native hadoop library is not available in HBase context.
Above shows that the native hadoop library is not available in HBase context.
To fix the above, either copy the Hadoop native libraries local or symlink to them if the Hadoop and HBase stalls are adjacent in the filesystem.
You could also point at their location by setting the `LD_LIBRARY_PATH` environment variable.
@ -173,20 +173,20 @@ You could also point at their location by setting the `LD_LIBRARY_PATH` environm
Where the JVM looks to find native librarys is "system dependent" (See `java.lang.System#loadLibrary(name)`). On linux, by default, is going to look in _lib/native/PLATFORM_ where `PLATFORM` is the label for the platform your HBase is installed on.
On a local linux machine, it seems to be the concatenation of the java properties `os.name` and `os.arch` followed by whether 32 or 64 bit.
HBase on startup prints out all of the java system properties so find the os.name and os.arch in the log.
For example:
For example:
[source]
----
...
2014-08-06 15:27:22,853 INFO [main] zookeeper.ZooKeeper: Client environment:os.name=Linux
2014-08-06 15:27:22,853 INFO [main] zookeeper.ZooKeeper: Client environment:os.arch=amd64
...
----
----
So in this case, the PLATFORM string is `Linux-amd64-64`.
Copying the Hadoop native libraries or symlinking at _lib/native/Linux-amd64-64_ will ensure they are found.
Check with the Hadoop _NativeLibraryChecker_.
Here is example of how to point at the Hadoop libs with `LD_LIBRARY_PATH` environment variable:
Here is example of how to point at the Hadoop libs with `LD_LIBRARY_PATH` environment variable:
[source]
----
$ LD_LIBRARY_PATH=~/hadoop-2.5.0-SNAPSHOT/lib/native ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker
@ -199,7 +199,7 @@ snappy: true /usr/lib64/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
----
Set in _hbase-env.sh_ the LD_LIBRARY_PATH environment variable when starting your HBase.
Set in _hbase-env.sh_ the LD_LIBRARY_PATH environment variable when starting your HBase.
=== Compressor Configuration, Installation, and Use
@ -210,13 +210,13 @@ Before HBase can use a given compressor, its libraries need to be available.
Due to licensing issues, only GZ compression is available to HBase (via native Java libraries) in a default installation.
Other compression libraries are available via the shared library bundled with your hadoop.
The hadoop native library needs to be findable when HBase starts.
See
See
.Compressor Support On the Master
A new configuration setting was introduced in HBase 0.95, to check the Master to determine which data block encoders are installed and configured on it, and assume that the entire cluster is configured the same.
This option, `hbase.master.check.compression`, defaults to `true`.
This prevents the situation described in link:https://issues.apache.org/jira/browse/HBASE-6370[HBASE-6370], where a table is created or modified to support a codec that a region server does not support, leading to failures that take a long time to occur and are difficult to debug.
This prevents the situation described in link:https://issues.apache.org/jira/browse/HBASE-6370[HBASE-6370], where a table is created or modified to support a codec that a region server does not support, leading to failures that take a long time to occur and are difficult to debug.
If `hbase.master.check.compression` is enabled, libraries for all desired compressors need to be installed and configured on the Master, even if the Master does not run a region server.
@ -232,7 +232,7 @@ See <<brand.new.compressor,brand.new.compressor>>).
HBase cannot ship with LZO because of incompatibility between HBase, which uses an Apache Software License (ASL) and LZO, which uses a GPL license.
See the link:http://wiki.apache.org/hadoop/UsingLzoCompression[Using LZO
Compression] wiki page for information on configuring LZO support for HBase.
Compression] wiki page for information on configuring LZO support for HBase.
If you depend upon LZO compression, consider configuring your RegionServers to fail to start if LZO is not available.
See <<hbase.regionserver.codecs,hbase.regionserver.codecs>>.
@ -244,19 +244,19 @@ LZ4 support is bundled with Hadoop.
Make sure the hadoop shared library (libhadoop.so) is accessible when you start HBase.
After configuring your platform (see <<hbase.native.platform,hbase.native.platform>>), you can make a symbolic link from HBase to the native Hadoop libraries.
This assumes the two software installs are colocated.
For example, if my 'platform' is Linux-amd64-64:
For example, if my 'platform' is Linux-amd64-64:
[source,bourne]
----
$ cd $HBASE_HOME
$ mkdir lib/native
$ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64
----
----
Use the compression tool to check that LZ4 is installed on all nodes.
Start up (or restart) HBase.
Afterward, you can create and alter tables to enable LZ4 as a compression codec.:
Afterward, you can create and alter tables to enable LZ4 as a compression codec.:
----
hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}
----
----
[[snappy.compression.installation]]
.Install Snappy Support
@ -347,7 +347,7 @@ You must specify either `-write` or `-update-read` as your first parameter, and
====
----
$ bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -h
$ bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -h
usage: bin/hbase org.apache.hadoop.hbase.util.LoadTestTool <options>
Options:
-batchupdate Whether to use batch as opposed to separate

View File

@ -564,7 +564,7 @@ If you are running a distributed operation, be sure to wait until HBase has shut
=== _hbase-site.xml_ and _hbase-default.xml_
Just as in Hadoop where you add site-specific HDFS configuration to the _hdfs-site.xml_ file, for HBase, site specific customizations go into the file _conf/hbase-site.xml_.
For the list of configurable properties, see <<hbase_default_configurations,hbase default configurations>> below or view the raw _hbase-default.xml_ source file in the HBase source code at _src/main/resources_.
For the list of configurable properties, see <<hbase_default_configurations,hbase default configurations>> below or view the raw _hbase-default.xml_ source file in the HBase source code at _src/main/resources_.
Not all configuration options make it out to _hbase-default.xml_.
Configuration that it is thought rare anyone would change can exist only in code; the only way to turn up such configurations is via a reading of the source code itself.
@ -572,7 +572,7 @@ Configuration that it is thought rare anyone would change can exist only in code
Currently, changes here will require a cluster restart for HBase to notice the change.
// hbase/src/main/asciidoc
//
include::../../../../target/asciidoc/hbase-default.adoc[]
include::{docdir}/../../../target/asciidoc/hbase-default.adoc[]
[[hbase.env.sh]]
@ -604,7 +604,7 @@ ZooKeeper is where all these values are kept.
Thus clients require the location of the ZooKeeper ensemble before they can do anything else.
Usually this the ensemble location is kept out in the _hbase-site.xml_ and is picked up by the client from the `CLASSPATH`.
If you are configuring an IDE to run a HBase client, you should include the _conf/_ directory on your classpath so _hbase-site.xml_ settings can be found (or add _src/test/resources_ to pick up the hbase-site.xml used by tests).
If you are configuring an IDE to run a HBase client, you should include the _conf/_ directory on your classpath so _hbase-site.xml_ settings can be found (or add _src/test/resources_ to pick up the hbase-site.xml used by tests).
Minimally, a client of HBase needs several libraries in its `CLASSPATH` when connecting to a cluster, including:
[source]
@ -621,7 +621,7 @@ slf4j-log4j (slf4j-log4j12-1.5.8.jar)
zookeeper (zookeeper-3.4.2.jar)
----
An example basic _hbase-site.xml_ for client only might look as follows:
An example basic _hbase-site.xml_ for client only might look as follows:
[source,xml]
----
<?xml version="1.0"?>
@ -1002,7 +1002,7 @@ See the link:http://docs.oracle.com/javase/6/docs/technotes/guides/management/ag
Historically, besides above port mentioned, JMX opens two additional random TCP listening ports, which could lead to port conflict problem. (See link:https://issues.apache.org/jira/browse/HBASE-10289[HBASE-10289] for details)
As an alternative, You can use the coprocessor-based JMX implementation provided by HBase.
To enable it in 0.99 or above, add below property in _hbase-site.xml_:
To enable it in 0.99 or above, add below property in _hbase-site.xml_:
[source,xml]
----
@ -1033,7 +1033,7 @@ The registry port can be shared with connector port in most cases, so you only n
However if you want to use SSL communication, the 2 ports must be configured to different values.
By default the password authentication and SSL communication is disabled.
To enable password authentication, you need to update _hbase-env.sh_ like below:
To enable password authentication, you need to update _hbase-env.sh_ like below:
[source,bash]
----
export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.authenticate=true \
@ -1060,7 +1060,7 @@ keytool -export -alias jconsole -keystore myKeyStore -file jconsole.cert
keytool -import -alias jconsole -keystore jconsoleKeyStore -file jconsole.cert
----
And then update _hbase-env.sh_ like below:
And then update _hbase-env.sh_ like below:
[source,bash]
----
@ -1082,7 +1082,7 @@ Finally start `jconsole` on the client using the key store:
jconsole -J-Djavax.net.ssl.trustStore=/home/tianq/jconsoleKeyStore
----
NOTE: To enable the HBase JMX implementation on Master, you also need to add below property in _hbase-site.xml_:
NOTE: To enable the HBase JMX implementation on Master, you also need to add below property in _hbase-site.xml_:
[source,xml]
----

View File

@ -93,7 +93,7 @@ The colon character (`:`) delimits the column family from the column family _qua
|===
|Row Key |Time Stamp |ColumnFamily `contents` |ColumnFamily `anchor`|ColumnFamily `people`
|"com.cnn.www" |t9 | |anchor:cnnsi.com = "CNN" |
|"com.cnn.www" |t8 | |anchor:my.look.ca = "CNN.com" |
|"com.cnn.www" |t8 | |anchor:my.look.ca = "CNN.com" |
|"com.cnn.www" |t6 | contents:html = "<html>..." | |
|"com.cnn.www" |t5 | contents:html = "<html>..." | |
|"com.cnn.www" |t3 | contents:html = "<html>..." | |

View File

@ -55,18 +55,18 @@ How do I upgrade Maven-managed projects from HBase 0.94 to HBase 0.96+?::
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>0.98.5-hadoop2</version>
</dependency>
----
+
.Maven Dependency for HBase 0.96
</dependency>
----
+
.Maven Dependency for HBase 0.96
[source,xml]
----
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>0.96.2-hadoop2</version>
</dependency>
----
</dependency>
----
+
.Maven Dependency for HBase 0.94
[source,xml]
@ -75,9 +75,9 @@ How do I upgrade Maven-managed projects from HBase 0.94 to HBase 0.96+?::
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>0.94.3</version>
</dependency>
----
</dependency>
----
=== Architecture
How does HBase handle Region-RegionServer assignment and locality?::
@ -91,7 +91,7 @@ Where can I learn about the rest of the configuration options?::
See <<configuration>>.
=== Schema Design / Data Access
How should I design my schema in HBase?::
See <<datamodel>> and <<schema>>.

View File

@ -57,7 +57,7 @@ Prior to HBase 0.94.x, HBase expected the loopback IP address to be 127.0.0.1. U
.Example /etc/hosts File for Ubuntu
====
The following _/etc/hosts_ file works correctly for HBase 0.94.x and earlier, on Ubuntu. Use this as a template if you run into trouble.
The following _/etc/hosts_ file works correctly for HBase 0.94.x and earlier, on Ubuntu. Use this as a template if you run into trouble.
[listing]
----
127.0.0.1 localhost

File diff suppressed because it is too large Load Diff

View File

@ -29,9 +29,9 @@
:icons: font
:experimental:
* 2006: link:http://research.google.com/archive/bigtable.html[BigTable] paper published by Google.
* 2006 (end of year): HBase development starts.
* 2008: HBase becomes Hadoop sub-project.
* 2010: HBase becomes Apache top-level project.
* 2006: link:http://research.google.com/archive/bigtable.html[BigTable] paper published by Google.
* 2006 (end of year): HBase development starts.
* 2008: HBase becomes Hadoop sub-project.
* 2010: HBase becomes Apache top-level project.
:numbered:

View File

@ -29,7 +29,7 @@
:experimental:
HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems and repairing a corrupted HBase.
It works in two basic modes -- a read-only inconsistency identifying mode and a multi-phase read-write repair mode.
It works in two basic modes -- a read-only inconsistency identifying mode and a multi-phase read-write repair mode.
=== Running hbck to identify inconsistencies
@ -45,7 +45,7 @@ At the end of the commands output it prints OK or tells you the number of INCONS
You may also want to run run hbck a few times because some inconsistencies can be transient (e.g.
cluster is starting up or a region is splitting). Operationally you may want to run hbck regularly and setup alert (e.g.
via nagios) if it repeatedly reports inconsistencies . A run of hbck will report a list of inconsistencies along with a brief description of the regions and tables affected.
The using the `-details` option will report more details including a representative listing of all the splits present in all the tables.
The using the `-details` option will report more details including a representative listing of all the splits present in all the tables.
[source,bourne]
----
@ -66,9 +66,9 @@ $ ./bin/hbase hbck TableFoo TableBar
=== Inconsistencies
If after several runs, inconsistencies continue to be reported, you may have encountered a corruption.
These should be rare, but in the event they occur newer versions of HBase include the hbck tool enabled with automatic repair options.
These should be rare, but in the event they occur newer versions of HBase include the hbck tool enabled with automatic repair options.
There are two invariants that when violated create inconsistencies in HBase:
There are two invariants that when violated create inconsistencies in HBase:
* HBase's region consistency invariant is satisfied if every region is assigned and deployed on exactly one region server, and all places where this state kept is in accordance.
* HBase's table integrity invariant is satisfied if for each table, every possible row key resolves to exactly one region.
@ -77,20 +77,20 @@ Repairs generally work in three phases -- a read-only information gathering phas
Starting from version 0.90.0, hbck could detect region consistency problems report on a subset of possible table integrity problems.
It also included the ability to automatically fix the most common inconsistency, region assignment and deployment consistency problems.
This repair could be done by using the `-fix` command line option.
These problems close regions if they are open on the wrong server or on multiple region servers and also assigns regions to region servers if they are not open.
These problems close regions if they are open on the wrong server or on multiple region servers and also assigns regions to region servers if they are not open.
Starting from HBase versions 0.90.7, 0.92.2 and 0.94.0, several new command line options are introduced to aid repairing a corrupted HBase.
This hbck sometimes goes by the nickname ``uberhbck''. Each particular version of uber hbck is compatible with the HBase's of the same major version (0.90.7 uberhbck can repair a 0.90.4). However, versions <=0.90.6 and versions <=0.92.1 may require restarting the master or failing over to a backup master.
This hbck sometimes goes by the nickname ``uberhbck''. Each particular version of uber hbck is compatible with the HBase's of the same major version (0.90.7 uberhbck can repair a 0.90.4). However, versions <=0.90.6 and versions <=0.92.1 may require restarting the master or failing over to a backup master.
=== Localized repairs
When repairing a corrupted HBase, it is best to repair the lowest risk inconsistencies first.
These are generally region consistency repairs -- localized single region repairs, that only modify in-memory data, ephemeral zookeeper data, or patch holes in the META table.
Region consistency requires that the HBase instance has the state of the region's data in HDFS (.regioninfo files), the region's row in the hbase:meta table., and region's deployment/assignments on region servers and the master in accordance.
Options for repairing region consistency include:
Options for repairing region consistency include:
* `-fixAssignments` (equivalent to the 0.90 `-fix` option) repairs unassigned, incorrectly assigned or multiply assigned regions.
* `-fixMeta` which removes meta rows when corresponding regions are not present in HDFS and adds new meta rows if they regions are present in HDFS while not in META. To fix deployment and assignment problems you can run this command:
* `-fixMeta` which removes meta rows when corresponding regions are not present in HDFS and adds new meta rows if they regions are present in HDFS while not in META. To fix deployment and assignment problems you can run this command:
[source,bourne]
----
@ -205,8 +205,8 @@ However, there could be some lingering offline split parents sometimes.
They are in META, in HDFS, and not deployed.
But HBase can't clean them up.
In this case, you can use the `-fixSplitParents` option to reset them in META to be online and not split.
Therefore, hbck can merge them with other regions if fixing overlapping regions option is used.
Therefore, hbck can merge them with other regions if fixing overlapping regions option is used.
This option should not normally be used, and it is not in `-fixAll`.
This option should not normally be used, and it is not in `-fixAll`.
:numbered:

View File

@ -31,50 +31,50 @@
[[other.info.videos]]
=== HBase Videos
.Introduction to HBase
* link:http://www.cloudera.com/content/cloudera/en/resources/library/presentation/chicago_data_summit_apache_hbase_an_introduction_todd_lipcon.html[Introduction to HBase] by Todd Lipcon (Chicago Data Summit 2011).
* link:http://www.cloudera.com/videos/intorduction-hbase-todd-lipcon[Introduction to HBase] by Todd Lipcon (2010).
link:http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-building-realtime-big-data-services-at-facebook-with-hadoop-and-hbase[Building Real Time Services at Facebook with HBase] by Jonathan Gray (Hadoop World 2011).
.Introduction to HBase
* link:http://www.cloudera.com/content/cloudera/en/resources/library/presentation/chicago_data_summit_apache_hbase_an_introduction_todd_lipcon.html[Introduction to HBase] by Todd Lipcon (Chicago Data Summit 2011).
* link:http://www.cloudera.com/videos/intorduction-hbase-todd-lipcon[Introduction to HBase] by Todd Lipcon (2010).
link:http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-building-realtime-big-data-services-at-facebook-with-hadoop-and-hbase[Building Real Time Services at Facebook with HBase] by Jonathan Gray (Hadoop World 2011).
link:http://www.cloudera.com/videos/hw10_video_how_stumbleupon_built_and_advertising_platform_using_hbase_and_hadoop[HBase and Hadoop, Mixing Real-Time and Batch Processing at StumbleUpon] by JD Cryans (Hadoop World 2010).
link:http://www.cloudera.com/videos/hw10_video_how_stumbleupon_built_and_advertising_platform_using_hbase_and_hadoop[HBase and Hadoop, Mixing Real-Time and Batch Processing at StumbleUpon] by JD Cryans (Hadoop World 2010).
[[other.info.pres]]
=== HBase Presentations (Slides)
link:http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-advanced-hbase-schema-design.html[Advanced HBase Schema Design] by Lars George (Hadoop World 2011).
link:http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-advanced-hbase-schema-design.html[Advanced HBase Schema Design] by Lars George (Hadoop World 2011).
link:http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction[Introduction to HBase] by Todd Lipcon (Chicago Data Summit 2011).
link:http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction[Introduction to HBase] by Todd Lipcon (Chicago Data Summit 2011).
link:http://www.slideshare.net/cloudera/hw09-practical-h-base-getting-the-most-from-your-h-base-install[Getting The Most From Your HBase Install] by Ryan Rawson, Jonathan Gray (Hadoop World 2009).
link:http://www.slideshare.net/cloudera/hw09-practical-h-base-getting-the-most-from-your-h-base-install[Getting The Most From Your HBase Install] by Ryan Rawson, Jonathan Gray (Hadoop World 2009).
[[other.info.papers]]
=== HBase Papers
link:http://research.google.com/archive/bigtable.html[BigTable] by Google (2006).
link:http://research.google.com/archive/bigtable.html[BigTable] by Google (2006).
link:http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html[HBase and HDFS Locality] by Lars George (2010).
link:http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html[HBase and HDFS Locality] by Lars George (2010).
link:http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf[No Relation: The Mixed Blessings of Non-Relational Databases] by Ian Varley (2009).
link:http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf[No Relation: The Mixed Blessings of Non-Relational Databases] by Ian Varley (2009).
[[other.info.sites]]
=== HBase Sites
link:http://www.cloudera.com/blog/category/hbase/[Cloudera's HBase Blog] has a lot of links to useful HBase information.
link:http://www.cloudera.com/blog/category/hbase/[Cloudera's HBase Blog] has a lot of links to useful HBase information.
* link:http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/[CAP Confusion] is a relevant entry for background information on distributed storage systems.
* link:http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/[CAP Confusion] is a relevant entry for background information on distributed storage systems.
link:http://wiki.apache.org/hadoop/HBase/HBasePresentations[HBase Wiki] has a page with a number of presentations.
link:http://wiki.apache.org/hadoop/HBase/HBasePresentations[HBase Wiki] has a page with a number of presentations.
link:http://refcardz.dzone.com/refcardz/hbase[HBase RefCard] from DZone.
link:http://refcardz.dzone.com/refcardz/hbase[HBase RefCard] from DZone.
[[other.info.books]]
=== HBase Books
link:http://shop.oreilly.com/product/0636920014348.do[HBase: The Definitive Guide] by Lars George.
link:http://shop.oreilly.com/product/0636920014348.do[HBase: The Definitive Guide] by Lars George.
[[other.info.books.hadoop]]
=== Hadoop Books
link:http://shop.oreilly.com/product/9780596521981.do[Hadoop: The Definitive Guide] by Tom White.
link:http://shop.oreilly.com/product/9780596521981.do[Hadoop: The Definitive Guide] by Tom White.
:numbered:

View File

@ -102,12 +102,12 @@ Are all the network interfaces functioning correctly? Are you sure? See the Trou
[[perf.network.call_me_maybe]]
=== Network Consistency and Partition Tolerance
The link:http://en.wikipedia.org/wiki/CAP_theorem[CAP Theorem] states that a distributed system can maintain two out of the following three charateristics:
- *C*onsistency -- all nodes see the same data.
The link:http://en.wikipedia.org/wiki/CAP_theorem[CAP Theorem] states that a distributed system can maintain two out of the following three charateristics:
- *C*onsistency -- all nodes see the same data.
- *A*vailability -- every request receives a response about whether it succeeded or failed.
- *P*artition tolerance -- the system continues to operate even if some of its components become unavailable to the others.
HBase favors consistency and partition tolerance, where a decision has to be made. Coda Hale explains why partition tolerance is so important, in http://codahale.com/you-cant-sacrifice-partition-tolerance/.
HBase favors consistency and partition tolerance, where a decision has to be made. Coda Hale explains why partition tolerance is so important, in http://codahale.com/you-cant-sacrifice-partition-tolerance/.
Robert Yokota used an automated testing framework called link:https://aphyr.com/tags/jepsen[Jepson] to test HBase's partition tolerance in the face of network partitions, using techniques modeled after Aphyr's link:https://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions[Call Me Maybe] series. The results, available as a link:https://rayokota.wordpress.com/2015/09/30/call-me-maybe-hbase/[blog post] and an link:https://rayokota.wordpress.com/2015/09/30/call-me-maybe-hbase-addendum/[addendum], show that HBase performs correctly.
@ -782,7 +782,8 @@ Be aware that `Table.delete(Delete)` doesn't use the writeBuffer.
It will execute an RegionServer RPC with each invocation.
For a large number of deletes, consider `Table.delete(List)`.
See http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete%28org.apache.hadoop.hbase.client.Delete%29
See
+++<a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete%28org.apache.hadoop.hbase.client.Delete%29">hbase.client.Delete</a>+++.
[[perf.hdfs]]
== HDFS

View File

@ -47,7 +47,7 @@ For more background on how we arrived at this spec., see link:https://docs.googl
. A wire-format we can evolve
. A format that does not require our rewriting server core or radically changing its current architecture (for later).
. A format that does not require our rewriting server core or radically changing its current architecture (for later).
=== TODO
@ -58,7 +58,7 @@ For more background on how we arrived at this spec., see link:https://docs.googl
. Diagram on how it works
. A grammar that succinctly describes the wire-format.
Currently we have these words and the content of the rpc protobuf idl but a grammar for the back and forth would help with groking rpc.
Also, a little state machine on client/server interactions would help with understanding (and ensuring correct implementation).
Also, a little state machine on client/server interactions would help with understanding (and ensuring correct implementation).
=== RPC
@ -79,7 +79,7 @@ link:https://git-wip-us.apache.org/repos/asf?p=hbase.git;a=blob;f=hbase-protocol
Client initiates connection.
===== Client
On connection setup, client sends a preamble followed by a connection header.
On connection setup, client sends a preamble followed by a connection header.
.<preamble>
[source]
@ -191,7 +191,7 @@ Doing header+param rather than a single protobuf Message with both header and pa
. Is closer to what we currently have
. Having a single fat pb requires extra copying putting the already pb'd param into the body of the fat request pb (and same making result)
. We can decide whether to accept the request or not before we read the param; for example, the request might be low priority.
As is, we read header+param in one go as server is currently implemented so this is a TODO.
As is, we read header+param in one go as server is currently implemented so this is a TODO.
The advantages are minor.
If later, fat request has clear advantage, can roll out a v2 later.
@ -205,13 +205,13 @@ Codec must implement hbase's `Codec` Interface.
After connection setup, all passed cellblocks will be sent with this codec.
The server will return cellblocks using this same codec as long as the codec is on the servers' CLASSPATH (else you will get `UnsupportedCellCodecException`).
To change the default codec, set `hbase.client.default.rpc.codec`.
To change the default codec, set `hbase.client.default.rpc.codec`.
To disable cellblocks completely and to go pure protobuf, set the default to the empty String and do not specify a codec in your Configuration.
So, set `hbase.client.default.rpc.codec` to the empty string and do not set `hbase.client.rpc.codec`.
This will cause the client to connect to the server with no codec specified.
If a server sees no codec, it will return all responses in pure protobuf.
Running pure protobuf all the time will be slower than running with cellblocks.
Running pure protobuf all the time will be slower than running with cellblocks.
.Compression
Uses hadoops compression codecs.

View File

@ -733,10 +733,12 @@ Composite Rowkey With Numeric Substitution:
For this approach another lookup table would be needed in addition to LOG_DATA, called LOG_TYPES.
The rowkey of LOG_TYPES would be:
* [type] (e.g., byte indicating hostname vs. event-type)
* [bytes] variable length bytes for raw hostname or event-type.
* `[type]` (e.g., byte indicating hostname vs. event-type)
* `[bytes]` variable length bytes for raw hostname or event-type.
A column for this rowkey could be a long with an assigned number, which could be obtained by using an link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long%29[HBase counter].
A column for this rowkey could be a long with an assigned number, which could be obtained
by using an
+++<a href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long%29">HBase counter</a>+++.
So the resulting composite rowkey would be:
@ -751,7 +753,9 @@ In either the Hash or Numeric substitution approach, the raw values for hostname
This effectively is the OpenTSDB approach.
What OpenTSDB does is re-write data and pack rows into columns for certain time-periods.
For a detailed explanation, see: link:http://opentsdb.net/schema.html, and link:http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-lessons-learned-from-opentsdb.html[Lessons Learned from OpenTSDB] from HBaseCon2012.
For a detailed explanation, see: link:http://opentsdb.net/schema.html, and
+++<a href="http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-lessons-learned-from-opentsdb.html">Lessons Learned from OpenTSDB</a>+++
from HBaseCon2012.
But this is how the general concept works: data is ingested, for example, in this manner...
@ -854,14 +858,14 @@ The ORDER table's rowkey was described above: <<schema.casestudies.custorder,sch
The SHIPPING_LOCATION's composite rowkey would be something like this:
* [order-rowkey]
* [shipping location number] (e.g., 1st location, 2nd, etc.)
* `[order-rowkey]`
* `[shipping location number]` (e.g., 1st location, 2nd, etc.)
The LINE_ITEM table's composite rowkey would be something like this:
* [order-rowkey]
* [shipping location number] (e.g., 1st location, 2nd, etc.)
* [line item number] (e.g., 1st lineitem, 2nd, etc.)
* `[order-rowkey]`
* `[shipping location number]` (e.g., 1st location, 2nd, etc.)
* `[line item number]` (e.g., 1st lineitem, 2nd, etc.)
Such a normalized model is likely to be the approach with an RDBMS, but that's not your only option with HBase.
The cons of such an approach is that to retrieve information about any Order, you will need:
@ -879,21 +883,21 @@ With this approach, there would exist a single table ORDER that would contain
The Order rowkey was described above: <<schema.casestudies.custorder,schema.casestudies.custorder>>
* [order-rowkey]
* [ORDER record type]
* `[order-rowkey]`
* `[ORDER record type]`
The ShippingLocation composite rowkey would be something like this:
* [order-rowkey]
* [SHIPPING record type]
* [shipping location number] (e.g., 1st location, 2nd, etc.)
* `[order-rowkey]`
* `[SHIPPING record type]`
* `[shipping location number]` (e.g., 1st location, 2nd, etc.)
The LineItem composite rowkey would be something like this:
* [order-rowkey]
* [LINE record type]
* [shipping location number] (e.g., 1st location, 2nd, etc.)
* [line item number] (e.g., 1st lineitem, 2nd, etc.)
* `[order-rowkey]`
* `[LINE record type]`
* `[shipping location number]` (e.g., 1st location, 2nd, etc.)
* `[line item number]` (e.g., 1st lineitem, 2nd, etc.)
[[schema.casestudies.custorder.obj.denorm]]
===== Denormalized
@ -902,9 +906,9 @@ A variant of the Single Table With Record Types approach is to denormalize and f
The LineItem composite rowkey would be something like this:
* [order-rowkey]
* [LINE record type]
* [line item number] (e.g., 1st lineitem, 2nd, etc., care must be taken that there are unique across the entire order)
* `[order-rowkey]`
* `[LINE record type]`
* `[line item number]` (e.g., 1st lineitem, 2nd, etc., care must be taken that there are unique across the entire order)
and the LineItem columns would be something like this:

View File

@ -1332,11 +1332,21 @@ static Table createTableAndWriteDataWithLabels(TableName tableName, String... la
----
====
<<reading_cells_with_labels>>
[[reading_cells_with_labels]]
==== Reading Cells with Labels
When you issue a Scan or Get, HBase uses your default set of authorizations to filter out cells that you do not have access to. A superuser can set the default set of authorizations for a given user by using the `set_auths` HBase Shell command or the link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/security/visibility/VisibilityClient.html#setAuths(org.apache.hadoop.hbase.client.Connection,%20java.lang.String[],%20java.lang.String)[VisibilityClient.setAuths()] method.
You can specify a different authorization during the Scan or Get, by passing the AUTHORIZATIONS option in HBase Shell, or the link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setAuthorizations%28org.apache.hadoop.hbase.security.visibility.Authorizations%29[setAuthorizations()] method if you use the API. This authorization will be combined with your default set as an additional filter. It will further filter your results, rather than giving you additional authorization.
When you issue a Scan or Get, HBase uses your default set of authorizations to
filter out cells that you do not have access to. A superuser can set the default
set of authorizations for a given user by using the `set_auths` HBase Shell command
or the
link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/security/visibility/VisibilityClient.html#setAuths(org.apache.hadoop.hbase.client.Connection,%20java.lang.String\[\],%20java.lang.String)[VisibilityClient.setAuths()] method.
You can specify a different authorization during the Scan or Get, by passing the
AUTHORIZATIONS option in HBase Shell, or the
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setAuthorizations%28org.apache.hadoop.hbase.security.visibility.Authorizations%29[setAuthorizations()]
method if you use the API. This authorization will be combined with your default
set as an additional filter. It will further filter your results, rather than
giving you additional authorization.
.HBase Shell
====
@ -1582,8 +1592,10 @@ Rotate the Master Key::
=== Secure Bulk Load
Bulk loading in secure mode is a bit more involved than normal setup, since the client has to transfer the ownership of the files generated from the MapReduce job to HBase.
Secure bulk loading is implemented by a coprocessor, named link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/security/access/SecureBulkLoadEndpoint.html
[SecureBulkLoadEndpoint], which uses a staging directory configured by the configuration property `hbase.bulkload.staging.dir`, which defaults to _/tmp/hbase-staging/_.
Secure bulk loading is implemented by a coprocessor, named
link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/security/access/SecureBulkLoadEndpoint.html[SecureBulkLoadEndpoint],
which uses a staging directory configured by the configuration property `hbase.bulkload.staging.dir`, which defaults to
_/tmp/hbase-staging/_.
.Secure Bulk Load Algorithm

View File

@ -31,12 +31,12 @@
:experimental:
link:https://issues.apache.org/jira/browse/HBASE-6449[HBASE-6449] added support for tracing requests through HBase, using the open source tracing library, link:http://htrace.incubator.apache.org/[HTrace].
Setting up tracing is quite simple, however it currently requires some very minor changes to your client code (it would not be very difficult to remove this requirement).
Setting up tracing is quite simple, however it currently requires some very minor changes to your client code (it would not be very difficult to remove this requirement).
[[tracing.spanreceivers]]
=== SpanReceivers
The tracing system works by collecting information in structures called 'Spans'. It is up to you to choose how you want to receive this information by implementing the `SpanReceiver` interface, which defines one method:
The tracing system works by collecting information in structures called 'Spans'. It is up to you to choose how you want to receive this information by implementing the `SpanReceiver` interface, which defines one method:
[source]
----
@ -45,12 +45,12 @@ public void receiveSpan(Span span);
----
This method serves as a callback whenever a span is completed.
HTrace allows you to use as many SpanReceivers as you want so you can easily send trace information to multiple destinations.
HTrace allows you to use as many SpanReceivers as you want so you can easily send trace information to multiple destinations.
Configure what SpanReceivers you'd like to us by putting a comma separated list of the fully-qualified class name of classes implementing `SpanReceiver` in _hbase-site.xml_ property: `hbase.trace.spanreceiver.classes`.
Configure what SpanReceivers you'd like to us by putting a comma separated list of the fully-qualified class name of classes implementing `SpanReceiver` in _hbase-site.xml_ property: `hbase.trace.spanreceiver.classes`.
HTrace includes a `LocalFileSpanReceiver` that writes all span information to local files in a JSON-based format.
The `LocalFileSpanReceiver` looks in _hbase-site.xml_ for a `hbase.local-file-span-receiver.path` property with a value describing the name of the file to which nodes should write their span information.
The `LocalFileSpanReceiver` looks in _hbase-site.xml_ for a `hbase.local-file-span-receiver.path` property with a value describing the name of the file to which nodes should write their span information.
[source]
----
@ -65,7 +65,7 @@ The `LocalFileSpanReceiver` looks in _hbase-site.xml_ for a `hbase.local-fi
</property>
----
HTrace also provides `ZipkinSpanReceiver` which converts spans to link:http://github.com/twitter/zipkin[Zipkin] span format and send them to Zipkin server. In order to use this span receiver, you need to install the jar of htrace-zipkin to your HBase's classpath on all of the nodes in your cluster.
HTrace also provides `ZipkinSpanReceiver` which converts spans to link:http://github.com/twitter/zipkin[Zipkin] span format and send them to Zipkin server. In order to use this span receiver, you need to install the jar of htrace-zipkin to your HBase's classpath on all of the nodes in your cluster.
_htrace-zipkin_ is published to the link:http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.htrace%22%20AND%20a%3A%22htrace-zipkin%22[Maven central repository]. You could get the latest version from there or just build it locally (see the link:http://htrace.incubator.apache.org/[HTrace] homepage for information on how to do this) and then copy it out to all nodes.
@ -77,11 +77,11 @@ _htrace-zipkin_ is published to the link:http://search.maven.org/#search%7Cgav%7
<property>
<name>hbase.trace.spanreceiver.classes</name>
<value>org.apache.htrace.impl.ZipkinSpanReceiver</value>
</property>
</property>
<property>
<name>hbase.htrace.zipkin.collector-hostname</name>
<value>localhost</value>
</property>
</property>
<property>
<name>hbase.htrace.zipkin.collector-port</name>
<value>9410</value>
@ -93,7 +93,7 @@ If you do not want to use the included span receivers, you are encouraged to wri
[[tracing.client.modifications]]
== Client Modifications
In order to turn on tracing in your client code, you must initialize the module sending spans to receiver once per client process.
In order to turn on tracing in your client code, you must initialize the module sending spans to receiver once per client process.
[source,java]
----
@ -107,7 +107,7 @@ private SpanReceiverHost spanReceiverHost;
----
Then you simply start tracing span before requests you think are interesting, and close it when the request is done.
For example, if you wanted to trace all of your get operations, you change this:
For example, if you wanted to trace all of your get operations, you change this:
[source,java]
----
@ -118,7 +118,7 @@ Get get = new Get(Bytes.toBytes("r1"));
Result res = table.get(get);
----
into:
into:
[source,java]
----
@ -133,7 +133,7 @@ try {
}
----
If you wanted to trace half of your 'get' operations, you would pass in:
If you wanted to trace half of your 'get' operations, you would pass in:
[source,java]
----
@ -142,12 +142,12 @@ new ProbabilitySampler(0.5)
----
in lieu of `Sampler.ALWAYS` to `Trace.startSpan()`.
See the HTrace _README_ for more information on Samplers.
See the HTrace _README_ for more information on Samplers.
[[tracing.client.shell]]
== Tracing from HBase Shell
You can use `trace` command for tracing requests from HBase Shell. `trace 'start'` command turns on tracing and `trace 'stop'` command turns off tracing.
You can use `trace` command for tracing requests from HBase Shell. `trace 'start'` command turns on tracing and `trace 'stop'` command turns off tracing.
[source]
----
@ -158,7 +158,7 @@ hbase(main):003:0> trace 'stop'
----
`trace 'start'` and `trace 'stop'` always returns boolean value representing if or not there is ongoing tracing.
As a result, `trace 'stop'` returns false on success. `trace 'status'` just returns if or not tracing is turned on.
As a result, `trace 'stop'` returns false on success. `trace 'status'` just returns if or not tracing is turned on.
[source]
----

View File

@ -47,7 +47,7 @@ public class MyHBaseDAO {
Put put = createPut(obj);
table.put(put);
}
private static Put createPut(HBaseTestObj obj) {
Put put = new Put(Bytes.toBytes(obj.getRowKey()));
put.add(Bytes.toBytes("CF"), Bytes.toBytes("CQ-1"),
@ -96,7 +96,7 @@ public class TestMyHbaseDAOData {
These tests ensure that your `createPut` method creates, populates, and returns a `Put` object with expected values.
Of course, JUnit can do much more than this.
For an introduction to JUnit, see link:https://github.com/junit-team/junit/wiki/Getting-started.
For an introduction to JUnit, see link:https://github.com/junit-team/junit/wiki/Getting-started.
== Mockito
@ -133,7 +133,7 @@ public class TestMyHBaseDAO{
Configuration config = HBaseConfiguration.create();
@Mock
Connection connection = ConnectionFactory.createConnection(config);
@Mock
@Mock
private Table table;
@Captor
private ArgumentCaptor putCaptor;
@ -150,7 +150,7 @@ public class TestMyHBaseDAO{
MyHBaseDAO.insertRecord(table, obj);
verify(table).put(putCaptor.capture());
Put put = putCaptor.getValue();
assertEquals(Bytes.toString(put.getRow()), obj.getRowKey());
assert(put.has(Bytes.toBytes("CF"), Bytes.toBytes("CQ-1")));
assert(put.has(Bytes.toBytes("CF"), Bytes.toBytes("CQ-2")));
@ -197,7 +197,7 @@ public class MyReducer extends TableReducer<Text, Text, ImmutableBytesWritable>
}
----
To test this code, the first step is to add a dependency to MRUnit to your Maven POM file.
To test this code, the first step is to add a dependency to MRUnit to your Maven POM file.
[source,xml]
----
@ -225,16 +225,16 @@ public class MyReducerTest {
MyReducer reducer = new MyReducer();
reduceDriver = ReduceDriver.newReduceDriver(reducer);
}
@Test
public void testHBaseInsert() throws IOException {
String strKey = "RowKey-1", strValue = "DATA", strValue1 = "DATA1",
String strKey = "RowKey-1", strValue = "DATA", strValue1 = "DATA1",
strValue2 = "DATA2";
List<Text> list = new ArrayList<Text>();
list.add(new Text(strValue));
list.add(new Text(strValue1));
list.add(new Text(strValue2));
//since in our case all that the reducer is doing is appending the records that the mapper
//since in our case all that the reducer is doing is appending the records that the mapper
//sends it, we should get the following back
String expectedOutput = strValue + strValue1 + strValue2;
//Setup Input, mimic what mapper would have passed
@ -242,10 +242,10 @@ strValue2 = "DATA2";
reduceDriver.withInput(new Text(strKey), list);
//run the reducer and get its output
List<Pair<ImmutableBytesWritable, Writable>> result = reduceDriver.run();
//extract key from result and verify
assertEquals(Bytes.toString(result.get(0).getFirst().get()), strKey);
//extract value for CF/QUALIFIER and verify
Put a = (Put)result.get(0).getSecond();
String c = Bytes.toString(a.get(CF, QUALIFIER).get(0).getValue());
@ -283,7 +283,7 @@ Check the versions to be sure they are appropriate.
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
@ -309,7 +309,7 @@ public class MyHBaseIntegrationTest {
private static HBaseTestingUtility utility;
byte[] CF = "CF".getBytes();
byte[] QUALIFIER = "CQ-1".getBytes();
@Before
public void setup() throws Exception {
utility = new HBaseTestingUtility();
@ -343,7 +343,7 @@ This code creates an HBase mini-cluster and starts it.
Next, it creates a table called `MyTest` with one column family, `CF`.
A record is inserted, a Get is performed from the same table, and the insertion is verified.
NOTE: Starting the mini-cluster takes about 20-30 seconds, but that should be appropriate for integration testing.
NOTE: Starting the mini-cluster takes about 20-30 seconds, but that should be appropriate for integration testing.
To use an HBase mini-cluster on Microsoft Windows, you need to use a Cygwin environment.

View File

@ -45,7 +45,7 @@ HBase does not ship with a _zoo.cfg_ so you will need to browse the _conf_ direc
You must at least list the ensemble servers in _hbase-site.xml_ using the `hbase.zookeeper.quorum` property.
This property defaults to a single ensemble member at `localhost` which is not suitable for a fully distributed HBase.
(It binds to the local machine only and remote clients will not be able to connect).
(It binds to the local machine only and remote clients will not be able to connect).
.How many ZooKeepers should I run?
[NOTE]
@ -54,7 +54,7 @@ You can run a ZooKeeper ensemble that comprises 1 node only but in production it
Also, run an odd number of machines.
In ZooKeeper, an even number of peers is supported, but it is normally not used because an even sized ensemble requires, proportionally, more peers to form a quorum than an odd sized ensemble requires.
For example, an ensemble with 4 peers requires 3 to form a quorum, while an ensemble with 5 also requires 3 to form a quorum.
Thus, an ensemble of 5 allows 2 peers to fail, and thus is more fault tolerant than the ensemble of 4, which allows only 1 down peer.
Thus, an ensemble of 5 allows 2 peers to fail, and thus is more fault tolerant than the ensemble of 4, which allows only 1 down peer.
Give each ZooKeeper server around 1GB of RAM, and if possible, its own dedicated disk (A dedicated disk is the best thing you can do to ensure a performant ZooKeeper ensemble). For very heavily loaded clusters, run ZooKeeper servers on separate machines from RegionServers (DataNodes and TaskTrackers).
====
@ -102,7 +102,7 @@ In the example below we have ZooKeeper persist to _/user/local/zookeeper_.
====
The newer version, the better.
For example, some folks have been bitten by link:https://issues.apache.org/jira/browse/ZOOKEEPER-1277[ZOOKEEPER-1277].
If running zookeeper 3.5+, you can ask hbase to make use of the new multi operation by enabling <<hbase.zookeeper.usemulti,hbase.zookeeper.useMulti>>" in your _hbase-site.xml_.
If running zookeeper 3.5+, you can ask hbase to make use of the new multi operation by enabling <<hbase.zookeeper.usemulti,hbase.zookeeper.useMulti>>" in your _hbase-site.xml_.
====
.ZooKeeper Maintenance
@ -140,7 +140,7 @@ Just make sure to set `HBASE_MANAGES_ZK` to `false` if you want it to stay
For more information about running a distinct ZooKeeper cluster, see the ZooKeeper link:http://hadoop.apache.org/zookeeper/docs/current/zookeeperStarted.html[Getting
Started Guide].
Additionally, see the link:http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A7[ZooKeeper Wiki] or the link:http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_zkMulitServerSetup[ZooKeeper
documentation] for more information on ZooKeeper sizing.
documentation] for more information on ZooKeeper sizing.
[[zk.sasl.auth]]
== SASL Authentication with ZooKeeper
@ -148,24 +148,24 @@ Additionally, see the link:http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A7[ZooKee
Newer releases of Apache HBase (>= 0.92) will support connecting to a ZooKeeper Quorum that supports SASL authentication (which is available in Zookeeper versions 3.4.0 or later).
This describes how to set up HBase to mutually authenticate with a ZooKeeper Quorum.
ZooKeeper/HBase mutual authentication (link:https://issues.apache.org/jira/browse/HBASE-2418[HBASE-2418]) is required as part of a complete secure HBase configuration (link:https://issues.apache.org/jira/browse/HBASE-3025[HBASE-3025]). For simplicity of explication, this section ignores additional configuration required (Secure HDFS and Coprocessor configuration). It's recommended to begin with an HBase-managed Zookeeper configuration (as opposed to a standalone Zookeeper quorum) for ease of learning.
ZooKeeper/HBase mutual authentication (link:https://issues.apache.org/jira/browse/HBASE-2418[HBASE-2418]) is required as part of a complete secure HBase configuration (link:https://issues.apache.org/jira/browse/HBASE-3025[HBASE-3025]). For simplicity of explication, this section ignores additional configuration required (Secure HDFS and Coprocessor configuration). It's recommended to begin with an HBase-managed Zookeeper configuration (as opposed to a standalone Zookeeper quorum) for ease of learning.
=== Operating System Prerequisites
You need to have a working Kerberos KDC setup.
For each `$HOST` that will run a ZooKeeper server, you should have a principle `zookeeper/$HOST`.
For each such host, add a service key (using the `kadmin` or `kadmin.local` tool's `ktadd` command) for `zookeeper/$HOST` and copy this file to `$HOST`, and make it readable only to the user that will run zookeeper on `$HOST`.
Note the location of this file, which we will use below as _$PATH_TO_ZOOKEEPER_KEYTAB_.
Note the location of this file, which we will use below as _$PATH_TO_ZOOKEEPER_KEYTAB_.
Similarly, for each `$HOST` that will run an HBase server (master or regionserver), you should have a principle: `hbase/$HOST`.
For each host, add a keytab file called _hbase.keytab_ containing a service key for `hbase/$HOST`, copy this file to `$HOST`, and make it readable only to the user that will run an HBase service on `$HOST`.
Note the location of this file, which we will use below as _$PATH_TO_HBASE_KEYTAB_.
Note the location of this file, which we will use below as _$PATH_TO_HBASE_KEYTAB_.
Each user who will be an HBase client should also be given a Kerberos principal.
This principal should usually have a password assigned to it (as opposed to, as with the HBase servers, a keytab file) which only this user knows.
The client's principal's `maxrenewlife` should be set so that it can be renewed enough so that the user can complete their HBase client processes.
For example, if a user runs a long-running HBase client process that takes at most 3 days, we might create this user's principal within `kadmin` with: `addprinc -maxrenewlife 3days`.
The Zookeeper client and server libraries manage their own ticket refreshment by running threads that wake up periodically to do the refreshment.
The Zookeeper client and server libraries manage their own ticket refreshment by running threads that wake up periodically to do the refreshment.
On each host that will run an HBase client (e.g. `hbase shell`), add the following file to the HBase home directory's _conf_ directory:
@ -210,7 +210,7 @@ where the _$PATH_TO_HBASE_KEYTAB_ and _$PATH_TO_ZOOKEEPER_KEYTAB_ files are what
The `Server` section will be used by the Zookeeper quorum server, while the `Client` section will be used by the HBase master and regionservers.
The path to this file should be substituted for the text _$HBASE_SERVER_CONF_ in the _hbase-env.sh_ listing below.
The path to this file should be substituted for the text _$CLIENT_CONF_ in the _hbase-env.sh_ listing below.
The path to this file should be substituted for the text _$CLIENT_CONF_ in the _hbase-env.sh_ listing below.
Modify your _hbase-env.sh_ to include the following:
@ -257,7 +257,7 @@ Modify your _hbase-site.xml_ on each node that will run zookeeper, master or reg
where `$ZK_NODES` is the comma-separated list of hostnames of the Zookeeper Quorum hosts.
Start your hbase cluster by running one or more of the following set of commands on the appropriate hosts:
Start your hbase cluster by running one or more of the following set of commands on the appropriate hosts:
----
@ -344,7 +344,7 @@ Server {
----
where `$HOST` is the hostname of each Quorum host.
We will refer to the full pathname of this file as _$ZK_SERVER_CONF_ below.
We will refer to the full pathname of this file as _$ZK_SERVER_CONF_ below.
Start your Zookeepers on each Zookeeper Quorum host with:
@ -354,7 +354,7 @@ Start your Zookeepers on each Zookeeper Quorum host with:
SERVER_JVMFLAGS="-Djava.security.auth.login.config=$ZK_SERVER_CONF" bin/zkServer start
----
Start your HBase cluster by running one or more of the following set of commands on the appropriate nodes:
Start your HBase cluster by running one or more of the following set of commands on the appropriate nodes:
----
@ -415,7 +415,7 @@ mvn clean test -Dtest=TestZooKeeperACL
----
Then configure HBase as described above.
Manually edit target/cached_classpath.txt (see below):
Manually edit target/cached_classpath.txt (see below):
----
@ -439,7 +439,7 @@ mv target/tmp.txt target/cached_classpath.txt
==== Set JAAS configuration programmatically
This would avoid the need for a separate Hadoop jar that fixes link:https://issues.apache.org/jira/browse/HADOOP-7070[HADOOP-7070].
This would avoid the need for a separate Hadoop jar that fixes link:https://issues.apache.org/jira/browse/HADOOP-7070[HADOOP-7070].
==== Elimination of `kerberos.removeHostFromPrincipal` and`kerberos.removeRealmFromPrincipal`