From a1fe1e09642355aa8165c11da3f759d621da1421 Mon Sep 17 00:00:00 2001 From: Misty Stanley-Jones Date: Mon, 22 Dec 2014 15:26:59 +1000 Subject: [PATCH] HBASE-12738 Chunk Ref Guide into file-per-chapter --- src/main/docbkx/architecture.xml | 3489 ++++++++++++++ src/main/docbkx/asf.xml | 44 + src/main/docbkx/book.xml | 6021 +------------------------ src/main/docbkx/compression.xml | 535 +++ src/main/docbkx/configuration.xml | 6 +- src/main/docbkx/customization-pdf.xsl | 129 + src/main/docbkx/datamodel.xml | 865 ++++ src/main/docbkx/faq.xml | 270 ++ src/main/docbkx/hbase-default.xml | 538 +++ src/main/docbkx/hbase_history.xml | 41 + src/main/docbkx/hbck_in_depth.xml | 237 + src/main/docbkx/mapreduce.xml | 630 +++ src/main/docbkx/orca.xml | 47 + src/main/docbkx/other_info.xml | 83 + src/main/docbkx/performance.xml | 2 +- src/main/docbkx/sql.xml | 40 + src/main/docbkx/upgrading.xml | 2 +- src/main/docbkx/ycsb.xml | 36 + 18 files changed, 7008 insertions(+), 6007 deletions(-) create mode 100644 src/main/docbkx/architecture.xml create mode 100644 src/main/docbkx/asf.xml create mode 100644 src/main/docbkx/compression.xml create mode 100644 src/main/docbkx/customization-pdf.xsl create mode 100644 src/main/docbkx/datamodel.xml create mode 100644 src/main/docbkx/faq.xml create mode 100644 src/main/docbkx/hbase-default.xml create mode 100644 src/main/docbkx/hbase_history.xml create mode 100644 src/main/docbkx/hbck_in_depth.xml create mode 100644 src/main/docbkx/mapreduce.xml create mode 100644 src/main/docbkx/orca.xml create mode 100644 src/main/docbkx/other_info.xml create mode 100644 src/main/docbkx/sql.xml create mode 100644 src/main/docbkx/ycsb.xml diff --git a/src/main/docbkx/architecture.xml b/src/main/docbkx/architecture.xml new file mode 100644 index 00000000000..16b298acae3 --- /dev/null +++ b/src/main/docbkx/architecture.xml @@ -0,0 +1,3489 @@ + + + + + Architecture +
+ Overview +
+ NoSQL? + HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which + supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an + example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, + HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, + such as typed columns, secondary indexes, triggers, and advanced query languages, etc. + + However, HBase has many features which supports both linear and modular scaling. HBase clusters expand + by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 + RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. + RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best + performance requires specialized hardware and storage devices. HBase features of note are: + + Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This + makes it very suitable for tasks such as high-speed counter aggregation. + Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are + automatically split and re-distributed as your data grows. + Automatic RegionServer failover + Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system. + MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both + source and sink. + Java Client API: HBase supports an easy to use Java API for programmatic access. + Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends. + Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization. + Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics. + + +
+ +
+ When Should I Use HBase? + HBase isn't suitable for every problem. + First, make sure you have enough data. If you have hundreds of millions or billions of rows, then + HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS + might be a better choice due to the fact that all of your data might wind up on a single node (or two) and + the rest of the cluster may be sitting idle. + + Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, + secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be + "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a + complete redesign as opposed to a port. + + Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than + 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. + + HBase can run quite well stand-alone on a laptop - but this should be considered a development + configuration only. + +
+
+ What Is The Difference Between HBase and Hadoop/HDFS? + HDFS is a distributed file system that is well suited for the storage of large files. + Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. + HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. + This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist + on HDFS for high-speed lookups. See the and the rest of this chapter for more information on how HBase achieves its goals. + +
+
+ +
+ Catalog Tables + The catalog table hbase:meta exists as an HBase table and is filtered out of the HBase + shell's list command, but is in fact a table just like any other. +
+ -ROOT- + + The -ROOT- table was removed in HBase 0.96.0. Information here should + be considered historical. + + The -ROOT- table kept track of the location of the + .META table (the previous name for the table now called hbase:meta) prior to HBase + 0.96. The -ROOT- table structure was as follows: + + Key + + .META. region key (.META.,,1) + + + + + Values + + info:regioninfo (serialized HRegionInfo + instance of hbase:meta) + + + info:server (server:port of the RegionServer holding + hbase:meta) + + + info:serverstartcode (start-time of the RegionServer process holding + hbase:meta) + + +
+
+ hbase:meta + The hbase:meta table (previously called .META.) keeps a list + of all regions in the system. The location of hbase:meta was previously + tracked within the -ROOT- table, but is now stored in Zookeeper. + The hbase:meta table structure is as follows: + + Key + + Region key of the format ([table],[region start key],[region + id]) + + + + Values + + info:regioninfo (serialized + HRegionInfo instance for this region) + + + info:server (server:port of the RegionServer containing this + region) + + + info:serverstartcode (start-time of the RegionServer process + containing this region) + + + When a table is in the process of splitting, two other columns will be created, called + info:splitA and info:splitB. These columns represent the two + daughter regions. The values for these columns are also serialized HRegionInfo instances. + After the region has been split, eventually this row will be deleted. + + Note on HRegionInfo + The empty key is used to denote table start and table end. A region with an empty + start key is the first region in a table. If a region has both an empty start and an + empty end key, it is the only region in the table + + In the (hopefully unlikely) event that programmatic processing of catalog metadata is + required, see the Writables + utility. +
+
+ Startup Sequencing + First, the location of hbase:meta is looked up in Zookeeper. Next, + hbase:meta is updated with server and startcode values. + For information on region-RegionServer assignment, see . +
+
+ +
+ Client + The HBase client finds the RegionServers that are serving the particular row range of + interest. It does this by querying the hbase:meta table. See for details. After locating the required region(s), the + client contacts the RegionServer serving that region, rather than going through the master, + and issues the read or write request. This information is cached in the client so that + subsequent requests need not go through the lookup process. Should a region be reassigned + either by the master load balancer or because a RegionServer has died, the client will + requery the catalog tables to determine the new location of the user region. + + See for more information about the impact of the Master on HBase + Client communication. + Administrative functions are done via an instance of Admin + + +
+ Cluster Connections + The API changed in HBase 1.0. Its been cleaned up and users are returned + Interfaces to work against rather than particular types. In HBase 1.0, + obtain a cluster Connection from ConnectionFactory and thereafter, get from it + instances of Table, Admin, and RegionLocator on an as-need basis. When done, close + obtained instances. Finally, be sure to cleanup your Connection instance before + exiting. Connections are heavyweight objects. Create once and keep an instance around. + Table, Admin and RegionLocator instances are lightweight. Create as you go and then + let go as soon as you are done by closing them. See the + Client Package Javadoc Description for example usage of the new HBase 1.0 API. + + For connection configuration information, see . + + Table + instances are not thread-safe. Only one thread can use an instance of Table at + any given time. When creating Table instances, it is advisable to use the same HBaseConfiguration + instance. This will ensure sharing of ZooKeeper and socket instances to the RegionServers + which is usually what you want. For example, this is preferred: + HBaseConfiguration conf = HBaseConfiguration.create(); +HTable table1 = new HTable(conf, "myTable"); +HTable table2 = new HTable(conf, "myTable"); + as opposed to this: + HBaseConfiguration conf1 = HBaseConfiguration.create(); +HTable table1 = new HTable(conf1, "myTable"); +HBaseConfiguration conf2 = HBaseConfiguration.create(); +HTable table2 = new HTable(conf2, "myTable"); + + For more information about how connections are handled in the HBase client, + see HConnectionManager. + +
Connection Pooling + For applications which require high-end multithreaded access (e.g., web-servers or application servers that may serve many application threads + in a single JVM), you can pre-create an HConnection, as shown in + the following example: + + Pre-Creating a <code>HConnection</code> + // Create a connection to the cluster. +HConnection connection = HConnectionManager.createConnection(Configuration); +HTableInterface table = connection.getTable("myTable"); +// use table as needed, the table returned is lightweight +table.close(); +// use the connection for other access to the cluster +connection.close(); + + Constructing HTableInterface implementation is very lightweight and resources are + controlled. + + <code>HTablePool</code> is Deprecated + Previous versions of this guide discussed HTablePool, which was + deprecated in HBase 0.94, 0.95, and 0.96, and removed in 0.98.1, by HBASE-6500. + Please use HConnection instead. + +
+
+
WriteBuffer and Batch Methods + If is turned off on + HTable, + Puts are sent to RegionServers when the writebuffer + is filled. The writebuffer is 2MB by default. Before an HTable instance is + discarded, either close() or + flushCommits() should be invoked so Puts + will not be lost. + + Note: htable.delete(Delete); does not go in the writebuffer! This only applies to Puts. + + For additional information on write durability, review the ACID semantics page. + + For fine-grained control of batching of + Puts or Deletes, + see the batch methods on HTable. + +
+
External Clients + Information on non-Java clients and custom protocols is covered in + +
+
+ +
Client Request Filters + Get and Scan instances can be + optionally configured with filters which are applied on the RegionServer. + + Filters can be confusing because there are many different types, and it is best to approach them by understanding the groups + of Filter functionality. + +
Structural + Structural Filters contain other Filters. +
FilterList + FilterList + represents a list of Filters with a relationship of FilterList.Operator.MUST_PASS_ALL or + FilterList.Operator.MUST_PASS_ONE between the Filters. The following example shows an 'or' between two + Filters (checking for either 'my value' or 'my other value' on the same attribute). + +FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE); +SingleColumnValueFilter filter1 = new SingleColumnValueFilter( + cf, + column, + CompareOp.EQUAL, + Bytes.toBytes("my value") + ); +list.add(filter1); +SingleColumnValueFilter filter2 = new SingleColumnValueFilter( + cf, + column, + CompareOp.EQUAL, + Bytes.toBytes("my other value") + ); +list.add(filter2); +scan.setFilter(list); + +
+
+
+ Column Value +
+ SingleColumnValueFilter + SingleColumnValueFilter + can be used to test column values for equivalence (CompareOp.EQUAL + ), inequality (CompareOp.NOT_EQUAL), or ranges (e.g., + CompareOp.GREATER). The following is example of testing equivalence a + column to a String value "my value"... + +SingleColumnValueFilter filter = new SingleColumnValueFilter( + cf, + column, + CompareOp.EQUAL, + Bytes.toBytes("my value") + ); +scan.setFilter(filter); + +
+
+
+ Column Value Comparators + There are several Comparator classes in the Filter package that deserve special + mention. These Comparators are used in concert with other Filters, such as . +
+ RegexStringComparator + RegexStringComparator + supports regular expressions for value comparisons. + +RegexStringComparator comp = new RegexStringComparator("my."); // any value that starts with 'my' +SingleColumnValueFilter filter = new SingleColumnValueFilter( + cf, + column, + CompareOp.EQUAL, + comp + ); +scan.setFilter(filter); + + See the Oracle JavaDoc for supported + RegEx patterns in Java. +
+
+ SubstringComparator + SubstringComparator + can be used to determine if a given substring exists in a value. The comparison is + case-insensitive. + +SubstringComparator comp = new SubstringComparator("y val"); // looking for 'my value' +SingleColumnValueFilter filter = new SingleColumnValueFilter( + cf, + column, + CompareOp.EQUAL, + comp + ); +scan.setFilter(filter); + +
+
+ BinaryPrefixComparator + See BinaryPrefixComparator. +
+
+ BinaryComparator + See BinaryComparator. +
+
+
+ KeyValue Metadata + As HBase stores data internally as KeyValue pairs, KeyValue Metadata Filters evaluate + the existence of keys (i.e., ColumnFamily:Column qualifiers) for a row, as opposed to + values the previous section. +
+ FamilyFilter + FamilyFilter + can be used to filter on the ColumnFamily. It is generally a better idea to select + ColumnFamilies in the Scan than to do it with a Filter. +
+
+ QualifierFilter + QualifierFilter + can be used to filter based on Column (aka Qualifier) name. +
+
+ ColumnPrefixFilter + ColumnPrefixFilter + can be used to filter based on the lead portion of Column (aka Qualifier) names. + A ColumnPrefixFilter seeks ahead to the first column matching the prefix in each row + and for each involved column family. It can be used to efficiently get a subset of the + columns in very wide rows. + Note: The same column qualifier can be used in different column families. This + filter returns all matching columns. + Example: Find all columns in a row and family that start with "abc" + +HTableInterface t = ...; +byte[] row = ...; +byte[] family = ...; +byte[] prefix = Bytes.toBytes("abc"); +Scan scan = new Scan(row, row); // (optional) limit to one row +scan.addFamily(family); // (optional) limit to one family +Filter f = new ColumnPrefixFilter(prefix); +scan.setFilter(f); +scan.setBatch(10); // set this if there could be many columns returned +ResultScanner rs = t.getScanner(scan); +for (Result r = rs.next(); r != null; r = rs.next()) { + for (KeyValue kv : r.raw()) { + // each kv represents a column + } +} +rs.close(); + +
+
+ MultipleColumnPrefixFilter + MultipleColumnPrefixFilter + behaves like ColumnPrefixFilter but allows specifying multiple prefixes. + Like ColumnPrefixFilter, MultipleColumnPrefixFilter efficiently seeks ahead to the + first column matching the lowest prefix and also seeks past ranges of columns between + prefixes. It can be used to efficiently get discontinuous sets of columns from very wide + rows. + Example: Find all columns in a row and family that start with "abc" or "xyz" + +HTableInterface t = ...; +byte[] row = ...; +byte[] family = ...; +byte[][] prefixes = new byte[][] {Bytes.toBytes("abc"), Bytes.toBytes("xyz")}; +Scan scan = new Scan(row, row); // (optional) limit to one row +scan.addFamily(family); // (optional) limit to one family +Filter f = new MultipleColumnPrefixFilter(prefixes); +scan.setFilter(f); +scan.setBatch(10); // set this if there could be many columns returned +ResultScanner rs = t.getScanner(scan); +for (Result r = rs.next(); r != null; r = rs.next()) { + for (KeyValue kv : r.raw()) { + // each kv represents a column + } +} +rs.close(); + +
+
+ ColumnRangeFilter + A ColumnRangeFilter + allows efficient intra row scanning. + A ColumnRangeFilter can seek ahead to the first matching column for each involved + column family. It can be used to efficiently get a 'slice' of the columns of a very wide + row. i.e. you have a million columns in a row but you only want to look at columns + bbbb-bbdd. + Note: The same column qualifier can be used in different column families. This + filter returns all matching columns. + Example: Find all columns in a row and family between "bbbb" (inclusive) and "bbdd" + (inclusive) + +HTableInterface t = ...; +byte[] row = ...; +byte[] family = ...; +byte[] startColumn = Bytes.toBytes("bbbb"); +byte[] endColumn = Bytes.toBytes("bbdd"); +Scan scan = new Scan(row, row); // (optional) limit to one row +scan.addFamily(family); // (optional) limit to one family +Filter f = new ColumnRangeFilter(startColumn, true, endColumn, true); +scan.setFilter(f); +scan.setBatch(10); // set this if there could be many columns returned +ResultScanner rs = t.getScanner(scan); +for (Result r = rs.next(); r != null; r = rs.next()) { + for (KeyValue kv : r.raw()) { + // each kv represents a column + } +} +rs.close(); + + Note: Introduced in HBase 0.92 +
+
+
RowKey +
RowFilter + It is generally a better idea to use the startRow/stopRow methods on Scan for row selection, however + RowFilter can also be used. +
+
+
Utility +
FirstKeyOnlyFilter + This is primarily used for rowcount jobs. + See FirstKeyOnlyFilter. +
+
+
+ +
Master + HMaster is the implementation of the Master Server. The Master server is + responsible for monitoring all RegionServer instances in the cluster, and is the interface + for all metadata changes. In a distributed cluster, the Master typically runs on the . J Mohamed Zahoor goes into some more detail on the Master + Architecture in this blog posting, HBase HMaster + Architecture . +
Startup Behavior + If run in a multi-Master environment, all Masters compete to run the cluster. If the active + Master loses its lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to + take over the Master role. + +
+
+ Runtime Impact + A common dist-list question involves what happens to an HBase cluster when the Master + goes down. Because the HBase client talks directly to the RegionServers, the cluster can + still function in a "steady state." Additionally, per , hbase:meta exists as an HBase table and is not + resident in the Master. However, the Master controls critical functions such as + RegionServer failover and completing region splits. So while the cluster can still run for + a short time without the Master, the Master should be restarted as soon as possible. + +
+
Interface + The methods exposed by HMasterInterface are primarily metadata-oriented methods: + + Table (createTable, modifyTable, removeTable, enable, disable) + + ColumnFamily (addColumn, modifyColumn, removeColumn) + + Region (move, assign, unassign) + + + For example, when the HBaseAdmin method disableTable is invoked, it is serviced by the Master server. + +
+
Processes + The Master runs several background threads: + +
LoadBalancer + Periodically, and when there are no regions in transition, + a load balancer will run and move regions around to balance the cluster's load. + See for configuring this property. + See for more information on region assignment. + +
+
CatalogJanitor + Periodically checks and cleans up the hbase:meta table. See for more information on META. +
+
+ +
+
+ RegionServer + HRegionServer is the RegionServer implementation. It is responsible for + serving and managing regions. In a distributed cluster, a RegionServer runs on a . +
+ Interface + The methods exposed by HRegionRegionInterface contain both data-oriented + and region-maintenance methods: + + Data (get, put, delete, next, etc.) + + + Region (splitRegion, compactRegion, etc.) + + For example, when the HBaseAdmin method + majorCompact is invoked on a table, the client is actually iterating + through all regions for the specified table and requesting a major compaction directly to + each region. +
+
+ Processes + The RegionServer runs a variety of background threads: +
+ CompactSplitThread + Checks for splits and handle minor compactions. +
+
+ MajorCompactionChecker + Checks for major compactions. +
+
+ MemStoreFlusher + Periodically flushes in-memory writes in the MemStore to StoreFiles. +
+
+ LogRoller + Periodically checks the RegionServer's WAL. +
+
+ +
+ Coprocessors + Coprocessors were added in 0.92. There is a thorough Blog Overview + of CoProcessors posted. Documentation will eventually move to this reference + guide, but the blog is the most current information available at this time. +
+ +
+ Block Cache + + HBase provides two different BlockCache implementations: the default onheap + LruBlockCache and BucketCache, which is (usually) offheap. This section + discusses benefits and drawbacks of each implementation, how to choose the appropriate + option, and configuration options for each. + + Block Cache Reporting: UI + See the RegionServer UI for detail on caching deploy. Since HBase-0.98.4, the + Block Cache detail has been significantly extended showing configurations, + sizings, current usage, time-in-the-cache, and even detail on block counts and types. + + +
+ + Cache Choices + LruBlockCache is the original implementation, and is + entirely within the Java heap. BucketCache is mainly + intended for keeping blockcache data offheap, although BucketCache can also + keep data onheap and serve from a file-backed cache. + BucketCache is production ready as of hbase-0.98.6 + To run with BucketCache, you need HBASE-11678. This was included in + hbase-0.98.6. + + + + + Fetching will always be slower when fetching from BucketCache, + as compared to the native onheap LruBlockCache. However, latencies tend to be + less erratic across time, because there is less garbage collection when you use + BucketCache since it is managing BlockCache allocations, not the GC. If the + BucketCache is deployed in offheap mode, this memory is not managed by the + GC at all. This is why you'd use BucketCache, so your latencies are less erratic and to mitigate GCs + and heap fragmentation. See Nick Dimiduk's BlockCache 101 for + comparisons running onheap vs offheap tests. Also see + Comparing BlockCache Deploys + which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise + if you are experiencing cache churn (or you want your cache to exist beyond the + vagaries of java GC), use BucketCache. + + + When you enable BucketCache, you are enabling a two tier caching + system, an L1 cache which is implemented by an instance of LruBlockCache and + an offheap L2 cache which is implemented by BucketCache. Management of these + two tiers and the policy that dictates how blocks move between them is done by + CombinedBlockCache. It keeps all DATA blocks in the L2 + BucketCache and meta blocks -- INDEX and BLOOM blocks -- + onheap in the L1 LruBlockCache. + See for more detail on going offheap. +
+ +
+ General Cache Configurations + Apart from the cache implementation itself, you can set some general configuration + options to control how the cache performs. See . After setting any of these options, restart or rolling restart your cluster for the + configuration to take effect. Check logs for errors or unexpected behavior. + See also , which discusses a new option + introduced in HBASE-9857. +
+ +
+ LruBlockCache Design + The LruBlockCache is an LRU cache that contains three levels of block priority to + allow for scan-resistance and in-memory ColumnFamilies: + + + Single access priority: The first time a block is loaded from HDFS it normally + has this priority and it will be part of the first group to be considered during + evictions. The advantage is that scanned blocks are more likely to get evicted than + blocks that are getting more usage. + + + Mutli access priority: If a block in the previous priority group is accessed + again, it upgrades to this priority. It is thus part of the second group considered + during evictions. + + + In-memory access priority: If the block's family was configured to be + "in-memory", it will be part of this priority disregarding the number of times it + was accessed. Catalog tables are configured like this. This group is the last one + considered during evictions. + To mark a column family as in-memory, call + HColumnDescriptor.setInMemory(true); if creating a table from java, + or set IN_MEMORY => true when creating or altering a table in + the shell: e.g. hbase(main):003:0> create 't', {NAME => 'f', IN_MEMORY => 'true'} + + + For more information, see the LruBlockCache + source + +
+
+ LruBlockCache Usage + Block caching is enabled by default for all the user tables which means that any + read operation will load the LRU cache. This might be good for a large number of use + cases, but further tunings are usually required in order to achieve better performance. + An important concept is the working set size, or + WSS, which is: "the amount of memory needed to compute the answer to a problem". For a + website, this would be the data that's needed to answer the queries over a short amount + of time. + The way to calculate how much memory is available in HBase for caching is: + + number of region servers * heap size * hfile.block.cache.size * 0.99 + + The default value for the block cache is 0.25 which represents 25% of the available + heap. The last value (99%) is the default acceptable loading factor in the LRU cache + after which eviction is started. The reason it is included in this equation is that it + would be unrealistic to say that it is possible to use 100% of the available memory + since this would make the process blocking from the point where it loads new blocks. + Here are some examples: + + + One region server with the default heap size (1 GB) and the default block cache + size will have 253 MB of block cache available. + + + 20 region servers with the heap size set to 8 GB and a default block cache size + will have 39.6 of block cache. + + + 100 region servers with the heap size set to 24 GB and a block cache size of 0.5 + will have about 1.16 TB of block cache. + + + Your data is not the only resident of the block cache. Here are others that you may have to take into account: + + + + Catalog Tables + + The -ROOT- (prior to HBase 0.96. See ) and hbase:meta tables are forced + into the block cache and have the in-memory priority which means that they are + harder to evict. The former never uses more than a few hundreds of bytes while the + latter can occupy a few MBs (depending on the number of regions). + + + + HFiles Indexes + + An hfile is the file format that HBase uses to store + data in HDFS. It contains a multi-layered index which allows HBase to seek to the + data without having to read the whole file. The size of those indexes is a factor + of the block size (64KB by default), the size of your keys and the amount of data + you are storing. For big data sets it's not unusual to see numbers around 1GB per + region server, although not all of it will be in cache because the LRU will evict + indexes that aren't used. + + + + Keys + + The values that are stored are only half the picture, since each value is + stored along with its keys (row key, family qualifier, and timestamp). See . + + + + Bloom Filters + + Just like the HFile indexes, those data structures (when enabled) are stored + in the LRU. + + + + Currently the recommended way to measure HFile indexes and bloom filters sizes is to + look at the region server web UI and checkout the relevant metrics. For keys, sampling + can be done by using the HFile command line tool and look for the average key size + metric. Since HBase 0.98.3, you can view detail on BlockCache stats and metrics + in a special Block Cache section in the UI. + It's generally bad to use block caching when the WSS doesn't fit in memory. This is + the case when you have for example 40GB available across all your region servers' block + caches but you need to process 1TB of data. One of the reasons is that the churn + generated by the evictions will trigger more garbage collections unnecessarily. Here are + two use cases: + + + Fully random reading pattern: This is a case where you almost never access the + same row twice within a short amount of time such that the chance of hitting a + cached block is close to 0. Setting block caching on such a table is a waste of + memory and CPU cycles, more so that it will generate more garbage to pick up by the + JVM. For more information on monitoring GC, see . + + + Mapping a table: In a typical MapReduce job that takes a table in input, every + row will be read only once so there's no need to put them into the block cache. The + Scan object has the option of turning this off via the setCaching method (set it to + false). You can still keep block caching turned on on this table if you need fast + random read access. An example would be counting the number of rows in a table that + serves live traffic, caching every block of that table would create massive churn + and would surely evict data that's currently in use. + + +
+ Caching META blocks only (DATA blocks in fscache) + An interesting setup is one where we cache META blocks only and we read DATA + blocks in on each access. If the DATA blocks fit inside fscache, this alternative + may make sense when access is completely random across a very large dataset. + To enable this setup, alter your table and for each column family + set BLOCKCACHE => 'false'. You are 'disabling' the + BlockCache for this column family only you can never disable the caching of + META blocks. Since + HBASE-4683 Always cache index and bloom blocks, + we will cache META blocks even if the BlockCache is disabled. + +
+
+
+ Offheap Block Cache +
+ How to Enable BucketCache + The usual deploy of BucketCache is via a managing class that sets up two caching tiers: an L1 onheap cache + implemented by LruBlockCache and a second L2 cache implemented with BucketCache. The managing class is CombinedBlockCache by default. + The just-previous link describes the caching 'policy' implemented by CombinedBlockCache. In short, it works + by keeping meta blocks -- INDEX and BLOOM in the L1, onheap LruBlockCache tier -- and DATA + blocks are kept in the L2, BucketCache tier. It is possible to amend this behavior in + HBase since version 1.0 and ask that a column family have both its meta and DATA blocks hosted onheap in the L1 tier by + setting cacheDataInL1 via + (HColumnDescriptor.setCacheDataInL1(true) + or in the shell, creating or amending column families setting CACHE_DATA_IN_L1 + to true: e.g. hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}} + + The BucketCache Block Cache can be deployed onheap, offheap, or file based. + You set which via the + hbase.bucketcache.ioengine setting. Setting it to + heap will have BucketCache deployed inside the + allocated java heap. Setting it to offheap will have + BucketCache make its allocations offheap, + and an ioengine setting of file:PATH_TO_FILE will direct + BucketCache to use a file caching (Useful in particular if you have some fast i/o attached to the box such + as SSDs). + + It is possible to deploy an L1+L2 setup where we bypass the CombinedBlockCache + policy and have BucketCache working as a strict L2 cache to the L1 + LruBlockCache. For such a setup, set CacheConfig.BUCKET_CACHE_COMBINED_KEY to + false. In this mode, on eviction from L1, blocks go to L2. + When a block is cached, it is cached first in L1. When we go to look for a cached block, + we look first in L1 and if none found, then search L2. Let us call this deploy format, + Raw L1+L2. + Other BucketCache configs include: specifying a location to persist cache to across + restarts, how many threads to use writing the cache, etc. See the + CacheConfig.html + class for configuration options and descriptions. + + + BucketCache Example Configuration + This sample provides a configuration for a 4 GB offheap BucketCache with a 1 GB + onheap cache. Configuration is performed on the RegionServer. Setting + hbase.bucketcache.ioengine and + hbase.bucketcache.size > 0 enables CombinedBlockCache. + Let us presume that the RegionServer has been set to run with a 5G heap: + i.e. HBASE_HEAPSIZE=5g. + + + First, edit the RegionServer's hbase-env.sh and set + HBASE_OFFHEAPSIZE to a value greater than the offheap size wanted, in + this case, 4 GB (expressed as 4G). Lets set it to 5G. That'll be 4G + for our offheap cache and 1G for any other uses of offheap memory (there are + other users of offheap memory other than BlockCache; e.g. DFSClient + in RegionServer can make use of offheap memory). See . + HBASE_OFFHEAPSIZE=5G + + + Next, add the following configuration to the RegionServer's + hbase-site.xml. + + + hbase.bucketcache.ioengine + offheap + + + hfile.block.cache.size + 0.2 + + + hbase.bucketcache.size + 4196 +]]> + + + + Restart or rolling restart your cluster, and check the logs for any + issues. + + + In the above, we set bucketcache to be 4G. The onheap lrublockcache we + configured to have 0.2 of the RegionServer's heap size (0.2 * 5G = 1G). + In other words, you configure the L1 LruBlockCache as you would normally, + as you would when there is no L2 BucketCache present. + + HBASE-10641 introduced the ability to configure multiple sizes for the + buckets of the bucketcache, in HBase 0.98 and newer. To configurable multiple bucket + sizes, configure the new property (instead of + ) to a comma-separated list of block sizes, + ordered from smallest to largest, with no spaces. The goal is to optimize the bucket + sizes based on your data access patterns. The following example configures buckets of + size 4096 and 8192. + + hfile.block.cache.sizes + 4096,8192 + + ]]> + + Direct Memory Usage In HBase + The default maximum direct memory varies by JVM. Traditionally it is 64M + or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). + HBase servers use direct memory, in particular short-circuit reading, the hosted DFSClient will + allocate direct memory buffers. If you do offheap block caching, you'll + be making use of direct memory. Starting your JVM, make sure + the -XX:MaxDirectMemorySize setting in + conf/hbase-env.sh is set to some value that is + higher than what you have allocated to your offheap blockcache + (hbase.bucketcache.size). It should be larger than your offheap block + cache and then some for DFSClient usage (How much the DFSClient uses is not + easy to quantify; it is the number of open hfiles * hbase.dfs.client.read.shortcircuit.buffer.size + where hbase.dfs.client.read.shortcircuit.buffer.size is set to 128k in HBase -- see hbase-default.xml + default configurations). + Direct memory, which is part of the Java process heap, is separate from the object + heap allocated by -Xmx. The value allocated by MaxDirectMemorySize must not exceed + physical RAM, and is likely to be less than the total available RAM due to other + memory requirements and system constraints. + + You can see how much memory -- onheap and offheap/direct -- a RegionServer is + configured to use and how much it is using at any one time by looking at the + Server Metrics: Memory tab in the UI. It can also be gotten + via JMX. In particular the direct memory currently used by the server can be found + on the java.nio.type=BufferPool,name=direct bean. Terracotta has + a good write up on using offheap memory in java. It is for their product + BigMemory but alot of the issues noted apply in general to any attempt at going + offheap. Check it out. + + hbase.bucketcache.percentage.in.combinedcache + This is a pre-HBase 1.0 configuration removed because it + was confusing. It was a float that you would set to some value + between 0.0 and 1.0. Its default was 0.9. If the deploy was using + CombinedBlockCache, then the LruBlockCache L1 size was calculated to + be (1 - hbase.bucketcache.percentage.in.combinedcache) * size-of-bucketcache + and the BucketCache size was hbase.bucketcache.percentage.in.combinedcache * size-of-bucket-cache. + where size-of-bucket-cache itself is EITHER the value of the configuration hbase.bucketcache.size + IF it was specified as megabytes OR hbase.bucketcache.size * -XX:MaxDirectMemorySize if + hbase.bucketcache.size between 0 and 1.0. + + In 1.0, it should be more straight-forward. L1 LruBlockCache size + is set as a fraction of java heap using hfile.block.cache.size setting + (not the best name) and L2 is set as above either in absolute + megabytes or as a fraction of allocated maximum direct memory. + + +
+
+
+ Comprewssed Blockcache + HBASE-11331 introduced lazy blockcache decompression, more simply referred to + as compressed blockcache. When compressed blockcache is enabled. data and encoded data + blocks are cached in the blockcache in their on-disk format, rather than being + decompressed and decrypted before caching. + For a RegionServer + hosting more data than can fit into cache, enabling this feature with SNAPPY compression + has been shown to result in 50% increase in throughput and 30% improvement in mean + latency while, increasing garbage collection by 80% and increasing overall CPU load by + 2%. See HBASE-11331 for more details about how performance was measured and achieved. + For a RegionServer hosting data that can comfortably fit into cache, or if your workload + is sensitive to extra CPU or garbage-collection load, you may receive less + benefit. + Compressed blockcache is disabled by default. To enable it, set + hbase.block.data.cachecompressed to true in + hbase-site.xml on all RegionServers. +
+
+ +
+ Write Ahead Log (WAL) + +
+ Purpose + The Write Ahead Log (WAL) records all changes to data in + HBase, to file-based storage. Under normal operations, the WAL is not needed because + data changes move from the MemStore to StoreFiles. However, if a RegionServer crashes or + becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to + the data can be replayed. If writing to the WAL fails, the entire operation to modify the + data fails. + + HBase uses an implementation of the WAL interface. Usually, there is only one instance of a WAL per RegionServer. + The RegionServer records Puts and Deletes to it, before recording them to the for the affected . + + + The HLog + + Prior to 2.0, the interface for WALs in HBase was named HLog. + In 0.94, HLog was the name of the implementation of the WAL. You will likely find + references to the HLog in documentation tailored to these older versions. + + + The WAL resides in HDFS in the /hbase/WALs/ directory (prior to + HBase 0.94, they were stored in /hbase/.logs/), with subdirectories per + region. + For more general information about the concept of write ahead logs, see the + Wikipedia Write-Ahead Log + article. +
+
+ WAL Flushing + TODO (describe). +
+ +
+ WAL Splitting + + A RegionServer serves many regions. All of the regions in a region server share the + same active WAL file. Each edit in the WAL file includes information about which region + it belongs to. When a region is opened, the edits in the WAL file which belong to that + region need to be replayed. Therefore, edits in the WAL file must be grouped by region + so that particular sets can be replayed to regenerate the data in a particular region. + The process of grouping the WAL edits by region is called log + splitting. It is a critical process for recovering data if a region server + fails. + Log splitting is done by the HMaster during cluster start-up or by the ServerShutdownHandler + as a region server shuts down. So that consistency is guaranteed, affected regions + are unavailable until data is restored. All WAL edits need to be recovered and replayed + before a given region can become available again. As a result, regions affected by + log splitting are unavailable until the process completes. + + Log Splitting, Step by Step + + The <filename>/hbase/WALs/<host>,<port>,<startcode></filename> directory is renamed. + Renaming the directory is important because a RegionServer may still be up and + accepting requests even if the HMaster thinks it is down. If the RegionServer does + not respond immediately and does not heartbeat its ZooKeeper session, the HMaster + may interpret this as a RegionServer failure. Renaming the logs directory ensures + that existing, valid WAL files which are still in use by an active but busy + RegionServer are not written to by accident. + The new directory is named according to the following pattern: + ,,-splitting]]> + An example of such a renamed directory might look like the following: + /hbase/WALs/srv.example.com,60020,1254173957298-splitting + + + Each log file is split, one at a time. + The log splitter reads the log file one edit entry at a time and puts each edit + entry into the buffer corresponding to the edit’s region. At the same time, the + splitter starts several writer threads. Writer threads pick up a corresponding + buffer and write the edit entries in the buffer to a temporary recovered edit + file. The temporary edit file is stored to disk with the following naming pattern: + //recovered.edits/.temp]]> + This file is used to store all the edits in the WAL log for this region. After + log splitting completes, the .temp file is renamed to the + sequence ID of the first log written to the file. + To determine whether all edits have been written, the sequence ID is compared to + the sequence of the last edit that was written to the HFile. If the sequence of the + last edit is greater than or equal to the sequence ID included in the file name, it + is clear that all writes from the edit file have been completed. + + + After log splitting is complete, each affected region is assigned to a + RegionServer. + When the region is opened, the recovered.edits folder is checked for recovered + edits files. If any such files are present, they are replayed by reading the edits + and saving them to the MemStore. After all edit files are replayed, the contents of + the MemStore are written to disk (HFile) and the edit files are deleted. + + + +
+ Handling of Errors During Log Splitting + + If you set the hbase.hlog.split.skip.errors option to + true, errors are treated as follows: + + + Any error encountered during splitting will be logged. + + + The problematic WAL log will be moved into the .corrupt + directory under the hbase rootdir, + + + Processing of the WAL will continue + + + If the hbase.hlog.split.skip.errors optionset to + false, the default, the exception will be propagated and the + split will be logged as failed. See HBASE-2958 When + hbase.hlog.split.skip.errors is set to false, we fail the split but thats + it. We need to do more than just fail split if this flag is set. + +
+ How EOFExceptions are treated when splitting a crashed RegionServers' + WALs + + If an EOFException occurs while splitting logs, the split proceeds even when + hbase.hlog.split.skip.errors is set to + false. An EOFException while reading the last log in the set of + files to split is likely, because the RegionServer is likely to be in the process of + writing a record at the time of a crash. For background, see HBASE-2643 + Figure how to deal with eof splitting logs +
+
+ +
+ Performance Improvements during Log Splitting + + WAL log splitting and recovery can be resource intensive and take a long time, + depending on the number of RegionServers involved in the crash and the size of the + regions. and were developed to improve + performance during log splitting. + +
+ Distributed Log Splitting + Distributed Log Splitting was added in HBase version 0.92 + (HBASE-1364) + by Prakash Khemani from Facebook. It reduces the time to complete log splitting + dramatically, improving the availability of regions and tables. For + example, recovering a crashed cluster took around 9 hours with single-threaded log + splitting, but only about six minutes with distributed log splitting. + The information in this section is sourced from Jimmy Xiang's blog post at . + + + Enabling or Disabling Distributed Log Splitting + Distributed log processing is enabled by default since HBase 0.92. The setting + is controlled by the hbase.master.distributed.log.splitting + property, which can be set to true or false, + but defaults to true. + + + Distributed Log Splitting, Step by Step + After configuring distributed log splitting, the HMaster controls the process. + The HMaster enrolls each RegionServer in the log splitting process, and the actual + work of splitting the logs is done by the RegionServers. The general process for + log splitting, as described in still applies here. + + If distributed log processing is enabled, the HMaster creates a + split log manager instance when the cluster is started. + The split log manager manages all log files which need + to be scanned and split. The split log manager places all the logs into the + ZooKeeper splitlog node (/hbase/splitlog) as tasks. You can + view the contents of the splitlog by issuing the following + zkcli command. Example output is shown. + ls /hbase/splitlog +[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900, +hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931, +hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946] + + The output contains some non-ASCII characters. When decoded, it looks much + more simple: + +[hdfs://host2.sample.com:56020/hbase/.logs +/host8.sample.com,57020,1340474893275-splitting +/host8.sample.com%3A57020.1340474893900, +hdfs://host2.sample.com:56020/hbase/.logs +/host3.sample.com,57020,1340474893299-splitting +/host3.sample.com%3A57020.1340474893931, +hdfs://host2.sample.com:56020/hbase/.logs +/host4.sample.com,57020,1340474893287-splitting +/host4.sample.com%3A57020.1340474893946] + + The listing represents WAL file names to be scanned and split, which is a + list of log splitting tasks. + + + The split log manager monitors the log-splitting tasks and workers. + The split log manager is responsible for the following ongoing tasks: + + + Once the split log manager publishes all the tasks to the splitlog + znode, it monitors these task nodes and waits for them to be + processed. + + + Checks to see if there are any dead split log + workers queued up. If it finds tasks claimed by unresponsive workers, it + will resubmit those tasks. If the resubmit fails due to some ZooKeeper + exception, the dead worker is queued up again for retry. + + + Checks to see if there are any unassigned + tasks. If it finds any, it create an ephemeral rescan node so that each + split log worker is notified to re-scan unassigned tasks via the + nodeChildrenChanged ZooKeeper event. + + + Checks for tasks which are assigned but expired. If any are found, they + are moved back to TASK_UNASSIGNED state again so that they can + be retried. It is possible that these tasks are assigned to slow workers, or + they may already be finished. This is not a problem, because log splitting + tasks have the property of idempotence. In other words, the same log + splitting task can be processed many times without causing any + problem. + + + The split log manager watches the HBase split log znodes constantly. If + any split log task node data is changed, the split log manager retrieves the + node data. The + node data contains the current state of the task. You can use the + zkcli get command to retrieve the + current state of a task. In the example output below, the first line of the + output shows that the task is currently unassigned. + +get /hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945 + +unassigned host2.sample.com:57000 +cZxid = 0×7115 +ctime = Sat Jun 23 11:13:40 PDT 2012 +... + + Based on the state of the task whose data is changed, the split log + manager does one of the following: + + + + Resubmit the task if it is unassigned + + + Heartbeat the task if it is assigned + + + Resubmit or fail the task if it is resigned (see ) + + + Resubmit or fail the task if it is completed with errors (see ) + + + Resubmit or fail the task if it could not complete due to + errors (see ) + + + Delete the task if it is successfully completed or failed + + + + Reasons a Task Will Fail + The task has been deleted. + The node no longer exists. + The log status manager failed to move the state of the task + to TASK_UNASSIGNED. + The number of resubmits is over the resubmit + threshold. + + + + + + Each RegionServer's split log worker performs the log-splitting tasks. + Each RegionServer runs a daemon thread called the split log + worker, which does the work to split the logs. The daemon thread + starts when the RegionServer starts, and registers itself to watch HBase znodes. + If any splitlog znode children change, it notifies a sleeping worker thread to + wake up and grab more tasks. If if a worker's current task’s node data is + changed, the worker checks to see if the task has been taken by another worker. + If so, the worker thread stops work on the current task. + The worker monitors + the splitlog znode constantly. When a new task appears, the split log worker + retrieves the task paths and checks each one until it finds an unclaimed task, + which it attempts to claim. If the claim was successful, it attempts to perform + the task and updates the task's state property based on the + splitting outcome. At this point, the split log worker scans for another + unclaimed task. + + How the Split Log Worker Approaches a Task + + + It queries the task state and only takes action if the task is in + TASK_UNASSIGNED state. + + + If the task is is in TASK_UNASSIGNED state, the + worker attempts to set the state to TASK_OWNED by itself. + If it fails to set the state, another worker will try to grab it. The split + log manager will also ask all workers to rescan later if the task remains + unassigned. + + + If the worker succeeds in taking ownership of the task, it tries to get + the task state again to make sure it really gets it asynchronously. In the + meantime, it starts a split task executor to do the actual work: + + + Get the HBase root folder, create a temp folder under the root, and + split the log file to the temp folder. + + + If the split was successful, the task executor sets the task to + state TASK_DONE. + + + If the worker catches an unexpected IOException, the task is set to + state TASK_ERR. + + + If the worker is shutting down, set the the task to state + TASK_RESIGNED. + + + If the task is taken by another worker, just log it. + + + + + + + The split log manager monitors for uncompleted tasks. + The split log manager returns when all tasks are completed successfully. If + all tasks are completed with some failures, the split log manager throws an + exception so that the log splitting can be retried. Due to an asynchronous + implementation, in very rare cases, the split log manager loses track of some + completed tasks. For that reason, it periodically checks for remaining + uncompleted task in its task map or ZooKeeper. If none are found, it throws an + exception so that the log splitting can be retried right away instead of hanging + there waiting for something that won’t happen. + + +
+
+ Distributed Log Replay + After a RegionServer fails, its failed region is assigned to another + RegionServer, which is marked as "recovering" in ZooKeeper. A split log worker directly + replays edits from the WAL of the failed region server to the region at its new + location. When a region is in "recovering" state, it can accept writes but no reads + (including Append and Increment), region splits or merges. + Distributed Log Replay extends the framework. It works by + directly replaying WAL edits to another RegionServer instead of creating + recovered.edits files. It provides the following advantages + over distributed log splitting alone: + + It eliminates the overhead of writing and reading a large number of + recovered.edits files. It is not unusual for thousands of + recovered.edits files to be created and written concurrently + during a RegionServer recovery. Many small random writes can degrade overall + system performance. + It allows writes even when a region is in recovering state. It only takes seconds for a recovering region to accept writes again. + + + + Enabling Distributed Log Replay + To enable distributed log replay, set hbase.master.distributed.log.replay to + true. This will be the default for HBase 0.99 (HBASE-10888). + + You must also enable HFile version 3 (which is the default HFile format starting + in HBase 0.99. See HBASE-10855). + Distributed log replay is unsafe for rolling upgrades. +
+
+
+
+ Disabling the WAL + It is possible to disable the WAL, to improve performace in certain specific + situations. However, disabling the WAL puts your data at risk. The only situation where + this is recommended is during a bulk load. This is because, in the event of a problem, + the bulk load can be re-run with no risk of data loss. + The WAL is disabled by calling the HBase client field + Mutation.writeToWAL(false). Use the + Mutation.setDurability(Durability.SKIP_WAL) and Mutation.getDurability() + methods to set and get the field's value. There is no way to disable the WAL for only a + specific table. + + If you disable the WAL for anything other than bulk loads, your data is at + risk. +
+
+ +
+ +
+ Regions + Regions are the basic element of availability and + distribution for tables, and are comprised of a Store per Column Family. The heirarchy of objects + is as follows: + +Table (HBase table) + Region (Regions for the table) + Store (Store per ColumnFamily for each Region for the table) + MemStore (MemStore for each Store for each Region for the table) + StoreFile (StoreFiles for each Store for each Region for the table) + Block (Blocks within a StoreFile within a Store for each Region for the table) + + For a description of what HBase files look like when written to HDFS, see . + +
+ Considerations for Number of Regions + In general, HBase is designed to run with a small (20-200) number of relatively large (5-20Gb) regions per server. The considerations for this are as follows: +
+ Why cannot I have too many regions? + + Typically you want to keep your region count low on HBase for numerous reasons. + Usually right around 100 regions per RegionServer has yielded the best results. + Here are some of the reasons below for keeping region count low: + + + MSLAB requires 2mb per memstore (that's 2mb per family per region). + 1000 regions that have 2 families each is 3.9GB of heap used, and it's not even storing data yet. NB: the 2MB value is configurable. + + If you fill all the regions at somewhat the same rate, the global memory usage makes it that it forces tiny + flushes when you have too many regions which in turn generates compactions. + Rewriting the same data tens of times is the last thing you want. + An example is filling 1000 regions (with one family) equally and let's consider a lower bound for global memstore + usage of 5GB (the region server would have a big heap). + Once it reaches 5GB it will force flush the biggest region, + at that point they should almost all have about 5MB of data so + it would flush that amount. 5MB inserted later, it would flush another + region that will now have a bit over 5MB of data, and so on. + This is currently the main limiting factor for the number of regions; see + for detailed formula. + + The master as is is allergic to tons of regions, and will + take a lot of time assigning them and moving them around in batches. + The reason is that it's heavy on ZK usage, and it's not very async + at the moment (could really be improved -- and has been imporoved a bunch + in 0.96 hbase). + + + In older versions of HBase (pre-v2 hfile, 0.90 and previous), tons of regions + on a few RS can cause the store file index to rise, increasing heap usage and potentially + creating memory pressure or OOME on the RSs + + + Another issue is the effect of the number of regions on mapreduce jobs; it is typical to have one mapper per HBase region. + Thus, hosting only 5 regions per RS may not be enough to get sufficient number of tasks for a mapreduce job, while 1000 regions will generate far too many tasks. + + See for configuration guidelines. + +
+ +
+ +
+ Region-RegionServer Assignment + This section describes how Regions are assigned to RegionServers. + + +
+ Startup + When HBase starts regions are assigned as follows (short version): + + The Master invokes the AssignmentManager upon startup. + + The AssignmentManager looks at the existing region assignments in META. + + If the region assignment is still valid (i.e., if the RegionServer is still online) + then the assignment is kept. + + If the assignment is invalid, then the LoadBalancerFactory is invoked to assign the + region. The DefaultLoadBalancer will randomly assign the region to a RegionServer. + + META is updated with the RegionServer assignment (if needed) and the RegionServer start codes + (start time of the RegionServer process) upon region opening by the RegionServer. + + + +
+ +
+ Failover + When a RegionServer fails: + + The regions immediately become unavailable because the RegionServer is + down. + + + The Master will detect that the RegionServer has failed. + + + The region assignments will be considered invalid and will be re-assigned just + like the startup sequence. + + + In-flight queries are re-tried, and not lost. + + + Operations are switched to a new RegionServer within the following amount of + time: + ZooKeeper session timeout + split time + assignment/replay time + + + +
+ +
+ Region Load Balancing + + Regions can be periodically moved by the . + +
+ +
+ Region State Transition + HBase maintains a state for each region and persists the state in META. The state + of the META region itself is persisted in ZooKeeper. You can see the states of regions + in transition in the Master web UI. Following is the list of possible region + states. + + + Possible Region States + + OFFLINE: the region is offline and not opening + + + OPENING: the region is in the process of being opened + + + OPEN: the region is open and the region server has notified the master + + + FAILED_OPEN: the region server failed to open the region + + + CLOSING: the region is in the process of being closed + + + CLOSED: the region server has closed the region and notified the master + + + FAILED_CLOSE: the region server failed to close the region + + + SPLITTING: the region server notified the master that the region is + splitting + + + SPLIT: the region server notified the master that the region has finished + splitting + + + SPLITTING_NEW: this region is being created by a split which is in + progress + + + MERGING: the region server notified the master that this region is being merged + with another region + + + MERGED: the region server notified the master that this region has been + merged + + + MERGING_NEW: this region is being created by a merge of two regions + + + +
+ Region State Transitions + + + + + + This graph shows all allowed transitions a region can undergo. In the graph, + each node is a state. A node has a color based on the state type, for readability. + A directed line in the graph is a possible state transition. + + +
+ + + Graph Legend + + Brown: Offline state, a special state that can be transient (after closed before + opening), terminal (regions of disabled tables), or initial (regions of newly + created tables) + + Palegreen: Online state that regions can serve requests + + Lightblue: Transient states + + Red: Failure states that need OPS attention + + Gold: Terminal states of regions split/merged + + Grey: Initial states of regions created through split/merge + + + + Region State Transitions Explained + + The master moves a region from OFFLINE to + OPENING state and tries to assign the region to a region + server. The region server may or may not have received the open region request. The + master retries sending the open region request to the region server until the RPC + goes through or the master runs out of retries. After the region server receives the + open region request, the region server begins opening the region. + + + If the master is running out of retries, the master prevents the region server + from opening the region by moving the region to CLOSING state and + trying to close it, even if the region server is starting to open the region. + + + After the region server opens the region, it continues to try to notify the + master until the master moves the region to OPEN state and + notifies the region server. The region is now open. + + + If the region server cannot open the region, it notifies the master. The master + moves the region to CLOSED state and tries to open the region on + a different region server. + + + If the master cannot open the region on any of a certain number of regions, it + moves the region to FAILED_OPEN state, and takes no further + action until an operator intervenes from the HBase shell, or the server is + dead. + + + The master moves a region from OPEN to + CLOSING state. The region server holding the region may or may + not have received the close region request. The master retries sending the close + request to the server until the RPC goes through or the master runs out of + retries. + + + If the region server is not online, or throws + NotServingRegionException, the master moves the region to + OFFLINE state and re-assigns it to a different region + server. + + + If the region server is online, but not reachable after the master runs out of + retries, the master moves the region to FAILED_CLOSE state and + takes no further action until an operator intervenes from the HBase shell, or the + server is dead. + + + If the region server gets the close region request, it closes the region and + notifies the master. The master moves the region to CLOSED state + and re-assigns it to a different region server. + + + Before assigning a region, the master moves the region to + OFFLINE state automatically if it is in + CLOSED state. + + + When a region server is about to split a region, it notifies the master. The + master moves the region to be split from OPEN to + SPLITTING state and add the two new regions to be created to + the region server. These two regions are in SPLITING_NEW state + initially. + + + After notifying the master, the region server starts to split the region. Once + past the point of no return, the region server notifies the master again so the + master can update the META. However, the master does not update the region states + until it is notified by the server that the split is done. If the split is + successful, the splitting region is moved from SPLITTING to + SPLIT state and the two new regions are moved from + SPLITTING_NEW to OPEN state. + + + If the split fails, the splitting region is moved from + SPLITTING back to OPEN state, and the two + new regions which were created are moved from SPLITTING_NEW to + OFFLINE state. + + + When a region server is about to merge two regions, it notifies the master + first. The master moves the two regions to be merged from OPEN to + MERGINGstate, and adds the new region which will hold the + contents of the merged regions region to the region server. The new region is in + MERGING_NEW state initially. + + + After notifying the master, the region server starts to merge the two regions. + Once past the point of no return, the region server notifies the master again so the + master can update the META. However, the master does not update the region states + until it is notified by the region server that the merge has completed. If the merge + is successful, the two merging regions are moved from MERGING to + MERGED state and the new region is moved from + MERGING_NEW to OPEN state. + + + If the merge fails, the two merging regions are moved from + MERGING back to OPEN state, and the new + region which was created to hold the contents of the merged regions is moved from + MERGING_NEW to OFFLINE state. + + + For regions in FAILED_OPEN or FAILED_CLOSE + states , the master tries to close them again when they are reassigned by an + operator via HBase Shell. + + +
+ +
+ +
+ Region-RegionServer Locality + Over time, Region-RegionServer locality is achieved via HDFS block replication. + The HDFS client does the following by default when choosing locations to write replicas: + + First replica is written to local node + + Second replica is written to a random node on another rack + + Third replica is written on the same rack as the second, but on a different node chosen randomly + + Subsequent replicas are written on random nodes on the cluster. See Replica Placement: The First Baby Steps on this page: HDFS Architecture + + + Thus, HBase eventually achieves locality for a region after a flush or a compaction. + In a RegionServer failover situation a RegionServer may be assigned regions with non-local + StoreFiles (because none of the replicas are local), however as new data is written + in the region, or the table is compacted and StoreFiles are re-written, they will become "local" + to the RegionServer. + + For more information, see Replica Placement: The First Baby Steps on this page: HDFS Architecture + and also Lars George's blog on HBase and HDFS locality. + +
+ +
+ Region Splits + Regions split when they reach a configured threshold. + Below we treat the topic in short. For a longer exposition, + see Apache HBase Region Splitting and Merging + by our Enis Soztutar. + + + Splits run unaided on the RegionServer; i.e. the Master does not + participate. The RegionServer splits a region, offlines the split + region and then adds the daughter regions to META, opens daughters on + the parent's hosting RegionServer and then reports the split to the + Master. See for how to manually manage + splits (and for why you might do this) +
+ Custom Split Policies + The default split policy can be overwritten using a custom RegionSplitPolicy (HBase 0.94+). + Typically a custom split policy should extend HBase's default split policy: ConstantSizeRegionSplitPolicy. + + The policy can set globally through the HBaseConfiguration used or on a per table basis: + +HTableDescriptor myHtd = ...; +myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName()); + + +
+
+ +
+ Manual Region Splitting + It is possible to manually split your table, either at table creation (pre-splitting), + or at a later time as an administrative action. You might choose to split your region for + one or more of the following reasons. There may be other valid reasons, but the need to + manually split your table might also point to problems with your schema design. + + Reasons to Manually Split Your Table + + Your data is sorted by timeseries or another similar algorithm that sorts new data + at the end of the table. This means that the Region Server holding the last region is + always under load, and the other Region Servers are idle, or mostly idle. See also + . + + + You have developed an unexpected hotspot in one region of your table. For + instance, an application which tracks web searches might be inundated by a lot of + searches for a celebrity in the event of news about that celebrity. See for more discussion about this particular + scenario. + + + After a big increase to the number of Region Servers in your cluster, to get the + load spread out quickly. + + + Before a bulk-load which is likely to cause unusual and uneven load across + regions. + + + See for a discussion about the dangers and + possible benefits of managing splitting completely manually. +
+ Determining Split Points + The goal of splitting your table manually is to improve the chances of balancing the + load across the cluster in situations where good rowkey design alone won't get you + there. Keeping that in mind, the way you split your regions is very dependent upon the + characteristics of your data. It may be that you already know the best way to split your + table. If not, the way you split your table depends on what your keys are like. + + + Alphanumeric Rowkeys + + If your rowkeys start with a letter or number, you can split your table at + letter or number boundaries. For instance, the following command creates a table + with regions that split at each vowel, so the first region has A-D, the second + region has E-H, the third region has I-N, the fourth region has O-V, and the fifth + region has U-Z. + hbase> create 'test_table', 'f1', SPLITS=> ['a', 'e', 'i', 'o', 'u'] + The following command splits an existing table at split point '2'. + hbase> split 'test_table', '2' + You can also split a specific region by referring to its ID. You can find the + region ID by looking at either the table or region in the Web UI. It will be a + long number such as + t2,1,1410227759524.829850c6eaba1acc689480acd8f081bd.. The + format is table_name,start_key,region_idTo split that + region into two, as close to equally as possible (at the nearest row boundary), + issue the following command. + hbase> split 't2,1,1410227759524.829850c6eaba1acc689480acd8f081bd.' + The split key is optional. If it is omitted, the table or region is split in + half. + The following example shows how to use the RegionSplitter to create 10 + regions, split at hexadecimal values. + hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1 + + + + Using a Custom Algorithm + + The RegionSplitter tool is provided with HBase, and uses a SplitAlgorithm to determine split points for you. As + parameters, you give it the algorithm, desired number of regions, and column + families. It includes two split algorithms. The first is the HexStringSplit algorithm, which assumes the row keys are + hexadecimal strings. The second, UniformSplit, assumes the row keys are random byte arrays. You will + probably need to develop your own SplitAlgorithm, using the provided ones as + models. + + + +
+
+
+ Online Region Merges + + Both Master and Regionserver participate in the event of online region merges. + Client sends merge RPC to master, then master moves the regions together to the + same regionserver where the more heavily loaded region resided, finally master + send merge request to this regionserver and regionserver run the region merges. + Similar with process of region splits, region merges run as a local transaction + on the regionserver, offlines the regions and then merges two regions on the file + system, atomically delete merging regions from META and add merged region to the META, + opens merged region on the regionserver and reports the merge to Master at last. + + An example of region merges in the hbase shell + $ hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME' + hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME', true + + It's an asynchronous operation and call returns immediately without waiting merge completed. + Passing 'true' as the optional third parameter will force a merge ('force' merges regardless + else merge will fail unless passed adjacent regions. 'force' is for expert use only) + +
+ +
+ Store + A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region. + +
+ MemStore + The MemStore holds in-memory modifications to the Store. Modifications are + Cells/KeyValues. When a flush is requested, the current memstore is moved to a snapshot and is + cleared. HBase continues to serve edits from the new memstore and backing snapshot until + the flusher reports that the flush succeeded. At this point, the snapshot is discarded. + Note that when the flush happens, Memstores that belong to the same region will all be + flushed. +
+
+ MemStoreFlush + A MemStore flush can be triggered under any of the conditions listed below. The + minimum flush unit is per region, not at individual MemStore level. + + + When a MemStore reaches the value specified by + hbase.hregion.memstore.flush.size, all MemStores that belong to + its region will be flushed out to disk. + + + When overall memstore usage reaches the value specified by + hbase.regionserver.global.memstore.upperLimit, MemStores from + various regions will be flushed out to disk to reduce overall MemStore usage in a + Region Server. The flush order is based on the descending order of a region's + MemStore usage. Regions will have their MemStores flushed until the overall MemStore + usage drops to or slightly below + hbase.regionserver.global.memstore.lowerLimit. + + + When the number of WAL per region server reaches the value specified in + hbase.regionserver.max.logs, MemStores from various regions + will be flushed out to disk to reduce WAL count. The flush order is based on time. + Regions with the oldest MemStores are flushed first until WAL count drops below + hbase.regionserver.max.logs. + + +
+
+ Scans + + + When a client issues a scan against a table, HBase generates + RegionScanner objects, one per region, to serve the scan request. + + + + The RegionScanner object contains a list of + StoreScanner objects, one per column family. + + + Each StoreScanner object further contains a list of + StoreFileScanner objects, corresponding to each StoreFile and + HFile of the corresponding column family, and a list of + KeyValueScanner objects for the MemStore. + + + The two lists are merge into one, which is sorted in ascending order with the + scan object for the MemStore at the end of the list. + + + When a StoreFileScanner object is constructed, it is associated + with a MultiVersionConsistencyControl read point, which is the + current memstoreTS, filtering out any new updates beyond the read + point. + + +
+
+ StoreFile (HFile) + StoreFiles are where your data lives. + +
HFile Format + The hfile file format is based on + the SSTable file described in the BigTable [2006] paper and on + Hadoop's tfile + (The unit test suite and the compression harness were taken directly from tfile). + Schubert Zhang's blog post on HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs makes for a thorough introduction to HBase's hfile. Matteo Bertozzi has also put up a + helpful description, HBase I/O: HFile. + + For more information, see the HFile source code. + Also see for information about the HFile v2 format that was included in 0.92. + +
+
+ HFile Tool + + To view a textualized version of hfile content, you can do use + the org.apache.hadoop.hbase.io.hfile.HFile + tool. Type the following to see usage:$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile For + example, to view the content of the file + hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475, + type the following: $ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile -v -f hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475 If + you leave off the option -v to see just a summary on the hfile. See + usage for other things to do with the HFile + tool. +
+
+ StoreFile Directory Structure on HDFS + For more information of what StoreFiles look like on HDFS with respect to the directory structure, see . + +
+
+ +
+ Blocks + StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis. + + Compression happens at the block level within StoreFiles. For more information on compression, see . + + For more information on blocks, see the HFileBlock source code. + +
+
+ KeyValue + The KeyValue class is the heart of data storage in HBase. KeyValue wraps a byte array and takes offsets and lengths into passed array + at where to start interpreting the content as KeyValue. + + The KeyValue format inside a byte array is: + + keylength + valuelength + key + value + + + The Key is further decomposed as: + + rowlength + row (i.e., the rowkey) + columnfamilylength + columnfamily + columnqualifier + timestamp + keytype (e.g., Put, Delete, DeleteColumn, DeleteFamily) + + + KeyValue instances are not split across blocks. + For example, if there is an 8 MB KeyValue, even if the block-size is 64kb this KeyValue will be read + in as a coherent block. For more information, see the KeyValue source code. + +
Example + To emphasize the points above, examine what happens with two Puts for two different columns for the same row: + + Put #1: rowkey=row1, cf:attr1=value1 + Put #2: rowkey=row1, cf:attr2=value2 + + Even though these are for the same row, a KeyValue is created for each column: + Key portion for Put #1: + + rowlength ------------> 4 + row -----------------> row1 + columnfamilylength ---> 2 + columnfamily --------> cf + columnqualifier ------> attr1 + timestamp -----------> server time of Put + keytype -------------> Put + + + Key portion for Put #2: + + rowlength ------------> 4 + row -----------------> row1 + columnfamilylength ---> 2 + columnfamily --------> cf + columnqualifier ------> attr2 + timestamp -----------> server time of Put + keytype -------------> Put + + + + It is critical to understand that the rowkey, ColumnFamily, and column (aka columnqualifier) are embedded within + the KeyValue instance. The longer these identifiers are, the bigger the KeyValue is. +
+ +
+
+ Compaction + + Ambiguous Terminology + A StoreFile is a facade of HFile. In terms of compaction, use of + StoreFile seems to have prevailed in the past. + A Store is the same thing as a ColumnFamily. + StoreFiles are related to a Store, or ColumnFamily. + + If you want to read more about StoreFiles versus HFiles and Stores versus + ColumnFamilies, see HBASE-11316. + + + When the MemStore reaches a given size + (hbase.hregion.memstore.flush.size), it flushes its contents to a + StoreFile. The number of StoreFiles in a Store increases over time. + Compaction is an operation which reduces the number of + StoreFiles in a Store, by merging them together, in order to increase performance on + read operations. Compactions can be resource-intensive to perform, and can either help + or hinder performance depending on many factors. + Compactions fall into two categories: minor and major. Minor and major compactions + differ in the following ways. + Minor compactions usually select a small number of small, + adjacent StoreFiles and rewrite them as a single StoreFile. Minor compactions do not + drop (filter out) deletes or expired versions, because of potential side effects. See and for information on how deletes and versions are + handled in relation to compactions. The end result of a minor compaction is fewer, + larger StoreFiles for a given Store. + The end result of a major compaction is a single StoreFile + per Store. Major compactions also process delete markers and max versions. See and for information on how deletes and versions are + handled in relation to compactions. + + + Compaction and Deletions + When an explicit deletion occurs in HBase, the data is not actually deleted. + Instead, a tombstone marker is written. The tombstone marker + prevents the data from being returned with queries. During a major compaction, the + data is actually deleted, and the tombstone marker is removed from the StoreFile. If + the deletion happens because of an expired TTL, no tombstone is created. Instead, the + expired data is filtered out and is not written back to the compacted + StoreFile. + + + + Compaction and Versions + When you create a Column Family, you can specify the maximum number of versions + to keep, by specifying HColumnDescriptor.setMaxVersions(int + versions). The default value is 3. If more versions + than the specified maximum exist, the excess versions are filtered out and not written + back to the compacted StoreFile. + + + + Major Compactions Can Impact Query Results + In some situations, older versions can be inadvertently resurrected if a newer + version is explicitly deleted. See for a more in-depth explanation. + This situation is only possible before the compaction finishes. + + + In theory, major compactions improve performance. However, on a highly loaded + system, major compactions can require an inappropriate number of resources and adversely + affect performance. In a default configuration, major compactions are scheduled + automatically to run once in a 7-day period. This is sometimes inappropriate for systems + in production. You can manage major compactions manually. See . + Compactions do not perform region merges. See for more information on region merging. +
+ Compaction Policy - HBase 0.96.x and newer + Compacting large StoreFiles, or too many StoreFiles at once, can cause more IO + load than your cluster is able to handle without causing performance problems. The + method by which HBase selects which StoreFiles to include in a compaction (and whether + the compaction is a minor or major compaction) is called the compaction + policy. + Prior to HBase 0.96.x, there was only one compaction policy. That original + compaction policy is still available as + RatioBasedCompactionPolicy The new compaction default + policy, called ExploringCompactionPolicy, was subsequently + backported to HBase 0.94 and HBase 0.95, and is the default in HBase 0.96 and newer. + It was implemented in HBASE-7842. In + short, ExploringCompactionPolicy attempts to select the best + possible set of StoreFiles to compact with the least amount of work, while the + RatioBasedCompactionPolicy selects the first set that meets + the criteria. + Regardless of the compaction policy used, file selection is controlled by several + configurable parameters and happens in a multi-step approach. These parameters will be + explained in context, and then will be given in a table which shows their + descriptions, defaults, and implications of changing them. + +
+ Being Stuck + When the MemStore gets too large, it needs to flush its contents to a StoreFile. + However, a Store can only have hbase.hstore.blockingStoreFiles + files, so the MemStore needs to wait for the number of StoreFiles to be reduced by + one or more compactions. However, if the MemStore grows larger than + hbase.hregion.memstore.flush.size, it is not able to flush its + contents to a StoreFile. If the MemStore is too large and the number of StpreFo;es + is also too high, the algorithm is said to be "stuck". The compaction algorithm + checks for this "stuck" situation and provides mechanisms to alleviate it. +
+ +
+ The ExploringCompactionPolicy Algorithm + The ExploringCompactionPolicy algorithm considers each possible set of + adjacent StoreFiles before choosing the set where compaction will have the most + benefit. + One situation where the ExploringCompactionPolicy works especially well is when + you are bulk-loading data and the bulk loads create larger StoreFiles than the + StoreFiles which are holding data older than the bulk-loaded data. This can "trick" + HBase into choosing to perform a major compaction each time a compaction is needed, + and cause a lot of extra overhead. With the ExploringCompactionPolicy, major + compactions happen much less frequently because minor compactions are more + efficient. + In general, ExploringCompactionPolicy is the right choice for most situations, + and thus is the default compaction policy. You can also use + ExploringCompactionPolicy along with . + The logic of this policy can be examined in + hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/ExploringCompactionPolicy.java. + The following is a walk-through of the logic of the + ExploringCompactionPolicy. + + + Make a list of all existing StoreFiles in the Store. The rest of the + algorithm filters this list to come up with the subset of HFiles which will be + chosen for compaction. + + + If this was a user-requested compaction, attempt to perform the requested + compaction type, regardless of what would normally be chosen. Note that even if + the user requests a major compaction, it may not be possible to perform a major + compaction. This may be because not all StoreFiles in the Column Family are + available to compact or because there are too many Stores in the Column + Family. + + + Some StoreFiles are automatically excluded from consideration. These + include: + + + StoreFiles that are larger than + hbase.hstore.compaction.max.size + + + StoreFiles that were created by a bulk-load operation which explicitly + excluded compaction. You may decide to exclude StoreFiles resulting from + bulk loads, from compaction. To do this, specify the + hbase.mapreduce.hfileoutputformat.compaction.exclude + parameter during the bulk load operation. + + + + + Iterate through the list from step 1, and make a list of all potential sets + of StoreFiles to compact together. A potential set is a grouping of + hbase.hstore.compaction.min contiguous StoreFiles in the + list. For each set, perform some sanity-checking and figure out whether this is + the best compaction that could be done: + + + If the number of StoreFiles in this set (not the size of the StoreFiles) + is fewer than hbase.hstore.compaction.min or more than + hbase.hstore.compaction.max, take it out of + consideration. + + + Compare the size of this set of StoreFiles with the size of the smallest + possible compaction that has been found in the list so far. If the size of + this set of StoreFiles represents the smallest compaction that could be + done, store it to be used as a fall-back if the algorithm is "stuck" and no + StoreFiles would otherwise be chosen. See . + + + Do size-based sanity checks against each StoreFile in this set of + StoreFiles. + + + If the size of this StoreFile is larger than + hbase.hstore.compaction.max.size, take it out of + consideration. + + + If the size is greater than or equal to + hbase.hstore.compaction.min.size, sanity-check it + against the file-based ratio to see whether it is too large to be + considered. The sanity-checking is successful if: + + + There is only one StoreFile in this set, or + + + For each StoreFile, its size multiplied by + hbase.hstore.compaction.ratio (or + hbase.hstore.compaction.ratio.offpeak if + off-peak hours are configured and it is during off-peak hours) is + less than the sum of the sizes of the other HFiles in the + set. + + + + + + + + + If this set of StoreFiles is still in consideration, compare it to the + previously-selected best compaction. If it is better, replace the + previously-selected best compaction with this one. + + + When the entire list of potential compactions has been processed, perform + the best compaction that was found. If no StoreFiles were selected for + compaction, but there are multiple StoreFiles, assume the algorithm is stuck + (see ) and if so, perform the smallest + compaction that was found in step 3. + + +
+ +
+ RatioBasedCompactionPolicy Algorithm + The RatioBasedCompactionPolicy was the only compaction policy prior to HBase + 0.96, though ExploringCompactionPolicy has now been backported to HBase 0.94 and + 0.95. To use the RatioBasedCompactionPolicy rather than the + ExploringCompactionPolicy, set + hbase.hstore.defaultengine.compactionpolicy.class to + RatioBasedCompactionPolicy in the + hbase-site.xml file. To switch back to the + ExploringCompactionPolicy, remove the setting from the + hbase-site.xml. + The following section walks you through the algorithm used to select StoreFiles + for compaction in the RatioBasedCompactionPolicy. + + + The first phase is to create a list of all candidates for compaction. A list + is created of all StoreFiles not already in the compaction queue, and all + StoreFiles newer than the newest file that is currently being compacted. This + list of StoreFiles is ordered by the sequence ID. The sequence ID is generated + when a Put is appended to the write-ahead log (WAL), and is stored in the + metadata of the HFile. + + + Check to see if the algorithm is stuck (see , and if so, a major compaction is forced. + This is a key area where is often a better choice than the + RatioBasedCompactionPolicy. + + + If the compaction was user-requested, try to perform the type of compaction + that was requested. Note that a major compaction may not be possible if all + HFiles are not available for compaction or if too may StoreFiles exist (more + than hbase.hstore.compaction.max). + + + Some StoreFiles are automatically excluded from consideration. These + include: + + + StoreFiles that are larger than + hbase.hstore.compaction.max.size + + + StoreFiles that were created by a bulk-load operation which explicitly + excluded compaction. You may decide to exclude StoreFiles resulting from + bulk loads, from compaction. To do this, specify the + hbase.mapreduce.hfileoutputformat.compaction.exclude + parameter during the bulk load operation. + + + + + The maximum number of StoreFiles allowed in a major compaction is controlled + by the hbase.hstore.compaction.max parameter. If the list + contains more than this number of StoreFiles, a minor compaction is performed + even if a major compaction would otherwise have been done. However, a + user-requested major compaction still occurs even if there are more than + hbase.hstore.compaction.max StoreFiles to compact. + + + If the list contains fewer than + hbase.hstore.compaction.min StoreFiles to compact, a minor + compaction is aborted. Note that a major compaction can be performed on a single + HFile. Its function is to remove deletes and expired versions, and reset + locality on the StoreFile. + + + The value of the hbase.hstore.compaction.ratio parameter + is multiplied by the sum of StoreFiles smaller than a given file, to determine + whether that StoreFile is selected for compaction during a minor compaction. For + instance, if hbase.hstore.compaction.ratio is 1.2, FileX is 5 mb, FileY is 2 mb, + and FileZ is 3 mb: + 5 <= 1.2 x (2 + 3) or 5 <= 6 + In this scenario, FileX is eligible for minor compaction. If FileX were 7 + mb, it would not be eligible for minor compaction. This ratio favors smaller + StoreFile. You can configure a different ratio for use in off-peak hours, using + the parameter hbase.hstore.compaction.ratio.offpeak, if you + also configure hbase.offpeak.start.hour and + hbase.offpeak.end.hour. + + + + If the last major compaction was too long ago and there is more than one + StoreFile to be compacted, a major compaction is run, even if it would otherwise + have been minor. By default, the maximum time between major compactions is 7 + days, plus or minus a 4.8 hour period, and determined randomly within those + parameters. Prior to HBase 0.96, the major compaction period was 24 hours. See + hbase.hregion.majorcompaction in the table below to tune or + disable time-based major compactions. + + +
+ +
+ + Parameters Used by Compaction Algorithm + This table contains the main configuration parameters for compaction. This list + is not exhaustive. To tune these parameters from the defaults, edit the + hbase-default.xml file. For a full list of all configuration + parameters available, see + + + + + Parameter + Description + Default + + + + + hbase.hstore.compaction.min + The minimum number of StoreFiles which must be eligible for + compaction before compaction can run. + The goal of tuning hbase.hstore.compaction.min + is to avoid ending up with too many tiny StoreFiles to compact. Setting + this value to 2 would cause a minor compaction each + time you have two StoreFiles in a Store, and this is probably not + appropriate. If you set this value too high, all the other values will + need to be adjusted accordingly. For most cases, the default value is + appropriate. + In previous versions of HBase, the parameter + hbase.hstore.compaction.min was called + hbase.hstore.compactionThreshold. + + 3 + + + hbase.hstore.compaction.max + The maximum number of StoreFiles which will be selected for a + single minor compaction, regardless of the number of eligible + StoreFiles. + Effectively, the value of + hbase.hstore.compaction.max controls the length of + time it takes a single compaction to complete. Setting it larger means + that more StoreFiles are included in a compaction. For most cases, the + default value is appropriate. + + 10 + + + hbase.hstore.compaction.min.size + A StoreFile smaller than this size will always be eligible for + minor compaction. StoreFiles this size or larger are evaluated by + hbase.hstore.compaction.ratio to determine if they are + eligible. + Because this limit represents the "automatic include" limit for + all StoreFiles smaller than this value, this value may need to be reduced + in write-heavy environments where many files in the 1-2 MB range are being + flushed, because every StoreFile will be targeted for compaction and the + resulting StoreFiles may still be under the minimum size and require + further compaction. + If this parameter is lowered, the ratio check is triggered more + quickly. This addressed some issues seen in earlier versions of HBase but + changing this parameter is no longer necessary in most situations. + + 128 MB + + + hbase.hstore.compaction.max.size + An StoreFile larger than this size will be excluded from + compaction. The effect of raising + hbase.hstore.compaction.max.size is fewer, larger + StoreFiles that do not get compacted often. If you feel that compaction is + happening too often without much benefit, you can try raising this + value. + Long.MAX_VALUE + + + hbase.hstore.compaction.ratio + For minor compaction, this ratio is used to determine whether a + given StoreFile which is larger than + hbase.hstore.compaction.min.size is eligible for + compaction. Its effect is to limit compaction of large StoreFile. The + value of hbase.hstore.compaction.ratio is expressed as + a floating-point decimal. + A large ratio, such as 10, will produce a + single giant StoreFile. Conversely, a value of .25, + will produce behavior similar to the BigTable compaction algorithm, + producing four StoreFiles. + A moderate value of between 1.0 and 1.4 is recommended. When + tuning this value, you are balancing write costs with read costs. Raising + the value (to something like 1.4) will have more write costs, because you + will compact larger StoreFiles. However, during reads, HBase will need to seek + through fewer StpreFo;es to accomplish the read. Consider this approach if you + cannot take advantage of . + Alternatively, you can lower this value to something like 1.0 to + reduce the background cost of writes, and use to limit the number of StoreFiles touched + during reads. + For most cases, the default value is appropriate. + + 1.2F + + + hbase.hstore.compaction.ratio.offpeak + The compaction ratio used during off-peak compactions, if off-peak + hours are also configured (see below). Expressed as a floating-point + decimal. This allows for more aggressive (or less aggressive, if you set it + lower than hbase.hstore.compaction.ratio) compaction + during a set time period. Ignored if off-peak is disabled (default). This + works the same as hbase.hstore.compaction.ratio. + 5.0F + + + hbase.offpeak.start.hour + The start of off-peak hours, expressed as an integer between 0 and 23, + inclusive. Set to -1 to disable off-peak. + -1 (disabled) + + + hbase.offpeak.end.hour + The end of off-peak hours, expressed as an integer between 0 and 23, + inclusive. Set to -1 to disable off-peak. + -1 (disabled) + + + hbase.regionserver.thread.compaction.throttle + There are two different thread pools for compactions, one for + large compactions and the other for small compactions. This helps to keep + compaction of lean tables (such as hbase:meta) + fast. If a compaction is larger than this threshold, it goes into the + large compaction pool. In most cases, the default value is + appropriate. + 2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size + (which defaults to 128) + + + hbase.hregion.majorcompaction + Time between major compactions, expressed in milliseconds. Set to + 0 to disable time-based automatic major compactions. User-requested and + size-based major compactions will still run. This value is multiplied by + hbase.hregion.majorcompaction.jitter to cause + compaction to start at a somewhat-random time during a given window of + time. + 7 days (604800000 milliseconds) + + + hbase.hregion.majorcompaction.jitter + A multiplier applied to + hbase.hregion.majorcompaction to cause compaction to + occur a given amount of time either side of + hbase.hregion.majorcompaction. The smaller the + number, the closer the compactions will happen to the + hbase.hregion.majorcompaction interval. Expressed as + a floating-point decimal. + .50F + + + + +
+
+
+ Compaction File Selection + + Legacy Information + This section has been preserved for historical reasons and refers to the way + compaction worked prior to HBase 0.96.x. You can still use this behavior if you + enable For information on + the way that compactions work in HBase 0.96.x and later, see . + + To understand the core algorithm for StoreFile selection, there is some ASCII-art + in the Store + source code that will serve as useful reference. It has been copied below: + +/* normal skew: + * + * older ----> newer + * _ + * | | _ + * | | | | _ + * --|-|- |-|- |-|---_-------_------- minCompactSize + * | | | | | | | | _ | | + * | | | | | | | | | | | | + * | | | | | | | | | | | | + */ + + Important knobs: + + hbase.hstore.compaction.ratio Ratio used in compaction file + selection algorithm (default 1.2f). + + + hbase.hstore.compaction.min (.90 + hbase.hstore.compactionThreshold) (files) Minimum number of StoreFiles per Store + to be selected for a compaction to occur (default 2). + + + hbase.hstore.compaction.max (files) Maximum number of + StoreFiles to compact per minor compaction (default 10). + + + hbase.hstore.compaction.min.size (bytes) Any StoreFile smaller + than this setting with automatically be a candidate for compaction. Defaults to + hbase.hregion.memstore.flush.size (128 mb). + + + hbase.hstore.compaction.max.size (.92) (bytes) Any StoreFile + larger than this setting with automatically be excluded from compaction (default + Long.MAX_VALUE). + + + + The minor compaction StoreFile selection logic is size based, and selects a file + for compaction when the file <= sum(smaller_files) * + hbase.hstore.compaction.ratio. + +
+ Minor Compaction File Selection - Example #1 (Basic Example) + This example mirrors an example from the unit test + TestCompactSelection. + + + hbase.hstore.compaction.ratio = 1.0f + + + hbase.hstore.compaction.min = 3 (files) + + + hbase.hstore.compaction.max = 5 (files) + + + hbase.hstore.compaction.min.size = 10 (bytes) + + + hbase.hstore.compaction.max.size = 1000 (bytes) + + + The following StoreFiles exist: 100, 50, 23, 12, and 12 bytes apiece (oldest to + newest). With the above parameters, the files that would be selected for minor + compaction are 23, 12, and 12. + Why? + + 100 --> No, because sum(50, 23, 12, 12) * 1.0 = 97. + + + 50 --> No, because sum(23, 12, 12) * 1.0 = 47. + + + 23 --> Yes, because sum(12, 12) * 1.0 = 24. + + + 12 --> Yes, because the previous file has been included, and because this + does not exceed the the max-file limit of 5 + + + 12 --> Yes, because the previous file had been included, and because this + does not exceed the the max-file limit of 5. + + + +
+
+ Minor Compaction File Selection - Example #2 (Not Enough Files To + Compact) + This example mirrors an example from the unit test + TestCompactSelection. + + hbase.hstore.compaction.ratio = 1.0f + + + hbase.hstore.compaction.min = 3 (files) + + + hbase.hstore.compaction.max = 5 (files) + + + hbase.hstore.compaction.min.size = 10 (bytes) + + + hbase.hstore.compaction.max.size = 1000 (bytes) + + + + The following StoreFiles exist: 100, 25, 12, and 12 bytes apiece (oldest to + newest). With the above parameters, no compaction will be started. + Why? + + 100 --> No, because sum(25, 12, 12) * 1.0 = 47 + + + 25 --> No, because sum(12, 12) * 1.0 = 24 + + + 12 --> No. Candidate because sum(12) * 1.0 = 12, there are only 2 files + to compact and that is less than the threshold of 3 + + + 12 --> No. Candidate because the previous StoreFile was, but there are + not enough files to compact + + + +
+
+ Minor Compaction File Selection - Example #3 (Limiting Files To Compact) + This example mirrors an example from the unit test + TestCompactSelection. + + hbase.hstore.compaction.ratio = 1.0f + + + hbase.hstore.compaction.min = 3 (files) + + + hbase.hstore.compaction.max = 5 (files) + + + hbase.hstore.compaction.min.size = 10 (bytes) + + + hbase.hstore.compaction.max.size = 1000 (bytes) + + The following StoreFiles exist: 7, 6, 5, 4, 3, 2, and 1 bytes apiece + (oldest to newest). With the above parameters, the files that would be selected for + minor compaction are 7, 6, 5, 4, 3. + Why? + + 7 --> Yes, because sum(6, 5, 4, 3, 2, 1) * 1.0 = 21. Also, 7 is less than + the min-size + + + 6 --> Yes, because sum(5, 4, 3, 2, 1) * 1.0 = 15. Also, 6 is less than + the min-size. + + + 5 --> Yes, because sum(4, 3, 2, 1) * 1.0 = 10. Also, 5 is less than the + min-size. + + + 4 --> Yes, because sum(3, 2, 1) * 1.0 = 6. Also, 4 is less than the + min-size. + + + 3 --> Yes, because sum(2, 1) * 1.0 = 3. Also, 3 is less than the + min-size. + + + 2 --> No. Candidate because previous file was selected and 2 is less than + the min-size, but the max-number of files to compact has been reached. + + + 1 --> No. Candidate because previous file was selected and 1 is less than + the min-size, but max-number of files to compact has been reached. + + + +
+ Impact of Key Configuration Options + + This information is now included in the configuration parameter table in . + +
+
+
+
+ Experimental: Stripe Compactions + Stripe compactions is an experimental feature added in HBase 0.98 which aims to + improve compactions for large regions or non-uniformly distributed row keys. In order + to achieve smaller and/or more granular compactions, the StoreFiles within a region + are maintained separately for several row-key sub-ranges, or "stripes", of the region. + The stripes are transparent to the rest of HBase, so other operations on the HFiles or + data work without modification. + Stripe compactions change the HFile layout, creating sub-regions within regions. + These sub-regions are easier to compact, and should result in fewer major compactions. + This approach alleviates some of the challenges of larger regions. + Stripe compaction is fully compatible with and works in conjunction with either the + ExploringCompactionPolicy or RatioBasedCompactionPolicy. It can be enabled for + existing tables, and the table will continue to operate normally if it is disabled + later. +
+
+ When To Use Stripe Compactions + Consider using stripe compaction if you have either of the following: + + + Large regions. You can get the positive effects of smaller regions without + additional overhead for MemStore and region management overhead. + + + Non-uniform keys, such as time dimension in a key. Only the stripes receiving + the new keys will need to compact. Old data will not compact as often, if at + all + + + + Performance Improvements + Performance testing has shown that the performance of reads improves somewhat, + and variability of performance of reads and writes is greatly reduced. An overall + long-term performance improvement is seen on large non-uniform-row key regions, such + as a hash-prefixed timestamp key. These performance gains are the most dramatic on a + table which is already large. It is possible that the performance improvement might + extend to region splits. + +
+ Enabling Stripe Compaction + You can enable stripe compaction for a table or a column family, by setting its + hbase.hstore.engine.class to + org.apache.hadoop.hbase.regionserver.StripeStoreEngine. You + also need to set the hbase.hstore.blockingStoreFiles to a high + number, such as 100 (rather than the default value of 10). + + Enable Stripe Compaction + + If the table already exists, disable the table. + + + Run one of following commands in the HBase shell. Replace the table name + orders_table with the name of your table. + +alter 'orders_table', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'} +alter 'orders_table', {NAME => 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'}} +create 'orders_table', 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'} + + + + Configure other options if needed. See for more information. + + + Enable the table. + + + + + Disable Stripe Compaction + + Disable the table. + + + Set the hbase.hstore.engine.class option to either nil or + org.apache.hadoop.hbase.regionserver.DefaultStoreEngine. + Either option has the same effect. + +alter 'orders_table', CONFIGURATION => {'hbase.hstore.engine.class' => ''} + + + + Enable the table. + + + When you enable a large table after changing the store engine either way, a + major compaction will likely be performed on most regions. This is not necessary on + new tables. +
+
+ Configuring Stripe Compaction + Each of the settings for stripe compaction should be configured at the table or + column family, after disabling the table. If you use HBase shell, the general + command pattern is as follows: + + +alter 'orders_table', CONFIGURATION => {'key' => 'value', ..., 'key' => 'value'}} + +
+ Region and stripe sizing + You can configure your stripe sizing bsaed upon your region sizing. By + default, your new regions will start with one stripe. On the next compaction after + the stripe has grown too large (16 x MemStore flushes size), it is split into two + stripes. Stripe splitting continues as the region grows, until the region is large + enough to split. + You can improve this pattern for your own data. A good rule is to aim for a + stripe size of at least 1 GB, and about 8-12 stripes for uniform row keys. For + example, if your regions are 30 GB, 12 x 2.5 GB stripes might be a good starting + point. + + + Stripe Sizing Settings + + + + + + Setting + Notes + + + + + + hbase.store.stripe.initialStripeCount + + + The number of stripes to create when stripe compaction is enabled. + You can use it as follows: + + For relatively uniform row keys, if you know the approximate + target number of stripes from the above, you can avoid some + splitting overhead by starting with several stripes (2, 5, 10...). + If the early data is not representative of overall row key + distribution, this will not be as efficient. + + + For existing tables with a large amount of data, this setting + will effectively pre-split your stripes. + + + For keys such as hash-prefixed sequential keys, with more than + one hash prefix per region, pre-splitting may make sense. + + + + + + + hbase.store.stripe.sizeToSplit + + The maximum size a stripe grows before splitting. Use this in + conjunction with hbase.store.stripe.splitPartCount to + control the target stripe size (sizeToSplit = splitPartsCount * target + stripe size), according to the above sizing considerations. + + + + hbase.store.stripe.splitPartCount + + The number of new stripes to create when splitting a stripe. The + default is 2, which is appropriate for most cases. For non-uniform row + keys, you can experiment with increasing the number to 3 or 4, to isolate + the arriving updates into narrower slice of the region without additional + splits being required. + + + +
+
+
+ MemStore Size Settings + By default, the flush creates several files from one MemStore, according to + existing stripe boundaries and row keys to flush. This approach minimizes write + amplification, but can be undesirable if the MemStore is small and there are many + stripes, because the files will be too small. + In this type of situation, you can set + hbase.store.stripe.compaction.flushToL0 to + true. This will cause a MemStore flush to create a single + file instead. When at least + hbase.store.stripe.compaction.minFilesL0 such files (by + default, 4) accumulate, they will be compacted into striped files. +
+
+ Normal Compaction Configuration and Stripe Compaction + All the settings that apply to normal compactions (see ) apply to stripe compactions. + The exceptions are the minimum and maximum number of files, which are set to + higher values by default because the files in stripes are smaller. To control + these for stripe compactions, use + hbase.store.stripe.compaction.minFiles and + hbase.store.stripe.compaction.maxFiles, rather than + hbase.hstore.compaction.min and + hbase.hstore.compaction.max. +
+
+
+
+ +
+ +
+ +
Bulk Loading +
Overview + + HBase includes several methods of loading data into tables. + The most straightforward method is to either use the TableOutputFormat + class from a MapReduce job, or use the normal client APIs; however, + these are not always the most efficient methods. + + + The bulk load feature uses a MapReduce job to output table data in HBase's internal + data format, and then directly loads the generated StoreFiles into a running + cluster. Using bulk load will use less CPU and network resources than + simply using the HBase API. + +
+
Bulk Load Limitations + As bulk loading bypasses the write path, the WAL doesn’t get written to as part of the process. + Replication works by reading the WAL files so it won’t see the bulk loaded data – and the same goes for the edits that use Put.setWriteToWAL(true). + One way to handle that is to ship the raw files or the HFiles to the other cluster and do the other processing there. +
+
Bulk Load Architecture + + The HBase bulk load process consists of two main steps. + +
Preparing data via a MapReduce job + + The first step of a bulk load is to generate HBase data files (StoreFiles) from + a MapReduce job using HFileOutputFormat. This output format writes + out data in HBase's internal storage format so that they can be + later loaded very efficiently into the cluster. + + + In order to function efficiently, HFileOutputFormat must be + configured such that each output HFile fits within a single region. + In order to do this, jobs whose output will be bulk loaded into HBase + use Hadoop's TotalOrderPartitioner class to partition the map output + into disjoint ranges of the key space, corresponding to the key + ranges of the regions in the table. + + + HFileOutputFormat includes a convenience function, + configureIncrementalLoad(), which automatically sets up + a TotalOrderPartitioner based on the current region boundaries of a + table. + +
+
Completing the data load + + After the data has been prepared using + HFileOutputFormat, it is loaded into the cluster using + completebulkload. This command line tool iterates + through the prepared data files, and for each one determines the + region the file belongs to. It then contacts the appropriate Region + Server which adopts the HFile, moving it into its storage directory + and making the data available to clients. + + + If the region boundaries have changed during the course of bulk load + preparation, or between the preparation and completion steps, the + completebulkloads utility will automatically split the + data files into pieces corresponding to the new boundaries. This + process is not optimally efficient, so users should take care to + minimize the delay between preparing a bulk load and importing it + into the cluster, especially if other clients are simultaneously + loading data through other means. + +
+
+
Importing the prepared data using the completebulkload tool + + After a data import has been prepared, either by using the + importtsv tool with the + "importtsv.bulk.output" option or by some other MapReduce + job using the HFileOutputFormat, the + completebulkload tool is used to import the data into the + running cluster. + + + The completebulkload tool simply takes the output path + where importtsv or your MapReduce job put its results, and + the table name to import into. For example: + + $ hadoop jar hbase-server-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable + + The -c config-file option can be used to specify a file + containing the appropriate hbase parameters (e.g., hbase-site.xml) if + not supplied already on the CLASSPATH (In addition, the CLASSPATH must + contain the directory that has the zookeeper configuration file if + zookeeper is NOT managed by HBase). + + + Note: If the target table does not already exist in HBase, this + tool will create the table automatically. + + This tool will run quickly, after which point the new data will be visible in + the cluster. + +
+
See Also + For more information about the referenced utilities, see and . + + + See How-to: Use HBase Bulk Loading, and Why + for a recent blog on current state of bulk loading. + +
+
Advanced Usage + + Although the importtsv tool is useful in many cases, advanced users may + want to generate data programatically, or import data from other formats. To get + started doing so, dig into ImportTsv.java and check the JavaDoc for + HFileOutputFormat. + + + The import step of the bulk load can also be done programatically. See the + LoadIncrementalHFiles class for more information. + +
+
+ +
HDFS + As HBase runs on HDFS (and each StoreFile is written as a file on HDFS), + it is important to have an understanding of the HDFS Architecture + especially in terms of how it stores files, handles failovers, and replicates blocks. + + See the Hadoop documentation on HDFS Architecture + for more information. + +
NameNode + The NameNode is responsible for maintaining the filesystem metadata. See the above HDFS Architecture link + for more information. + +
+
DataNode + The DataNodes are responsible for storing HDFS blocks. See the above HDFS Architecture link + for more information. + +
+
+ +
+ Timeline-consistent High Available Reads +
+ Introduction + + HBase, architecturally, always had the strong consistency guarantee from the start. All reads and writes are routed through a single region server, which guarantees that all writes happen in an order, and all reads are seeing the most recent committed data. + + However, because of this single homing of the reads to a single location, if the server becomes unavailable, the regions of the table that were hosted in the region server become unavailable for some time. There are three phases in the region recovery process - detection, assignment, and recovery. Of these, the detection is usually the longest and is presently in the order of 20-30 seconds depending on the zookeeper session timeout. During this time and before the recovery is complete, the clients will not be able to read the region data. + + However, for some use cases, either the data may be read-only, or doing reads againsts some stale data is acceptable. With timeline-consistent high available reads, HBase can be used for these kind of latency-sensitive use cases where the application can expect to have a time bound on the read completion. + + For achieving high availability for reads, HBase provides a feature called “region replication”. In this model, for each region of a table, there will be multiple replicas that are opened in different region servers. By default, the region replication is set to 1, so only a single region replica is deployed and there will not be any changes from the original model. If region replication is set to 2 or more, than the master will assign replicas of the regions of the table. The Load Balancer ensures that the region replicas are not co-hosted in the same region servers and also in the same rack (if possible). + + All of the replicas for a single region will have a unique replica_id, starting from 0. The region replica having replica_id==0 is called the primary region, and the others “secondary regions” or secondaries. Only the primary can accept writes from the client, and the primary will always contain the latest changes. Since all writes still have to go through the primary region, the writes are not highly-available (meaning they might block for some time if the region becomes unavailable). + + The writes are asynchronously sent to the secondary region replicas using an “Async WAL replication” feature. This works similarly to HBase’s multi-datacenter replication, but instead the data from a region is replicated to the secondary regions. Each secondary replica always receives and observes the writes in the same order that the primary region committed them. This ensures that the secondaries won’t diverge from the primary regions data, but since the log replication is asnyc, the data might be stale in secondary regions. In some sense, this design can be thought of as “in-cluster replication”, where instead of replicating to a different datacenter, the data goes to a secondary region to keep secondary region’s in-memory state up to date. The data files are shared between the primary region and the other replicas, so that there is no extra storage overhead. However, the secondary regions will have recent non-flushed data in their memstores, which increases the memory overhead. + + Async WAL replication feature is being implemented in Phase 2 of issue HBASE-10070. Before this, region replicas will only be updated with flushed data files from the primary (see hbase.regionserver.storefile.refresh.period below). It is also possible to use this without setting storefile.refresh.period for read only tables. + +
+
+ Timeline Consistency + + With this feature, HBase introduces a Consistency definition, which can be provided per read operation (get or scan). + +public enum Consistency { + STRONG, + TIMELINE +} + + Consistency.STRONG is the default consistency model provided by HBase. In case the table has region replication = 1, or in a table with region replicas but the reads are done with this consistency, the read is always performed by the primary regions, so that there will not be any change from the previous behaviour, and the client always observes the latest data. + + In case a read is performed with Consistency.TIMELINE, then the read RPC will be sent to the primary region server first. After a short interval (hbase.client.primaryCallTimeout.get, 10ms by default), parallel RPC for secondary region replicas will also be sent if the primary does not respond back. After this, the result is returned from whichever RPC is finished first. If the response came back from the primary region replica, we can always know that the data is latest. For this Result.isStale() API has been added to inspect the staleness. If the result is from a secondary region, then Result.isStale() will be set to true. The user can then inspect this field to possibly reason about the data. + + In terms of semantics, TIMELINE consistency as implemented by HBase differs from pure eventual + consistency in these respects: + + + Single homed and ordered updates: Region replication or not, on the write side, + there is still only 1 defined replica (primary) which can accept writes. This + replica is responsible for ordering the edits and preventing conflicts. This + guarantees that two different writes are not committed at the same time by different + replicas and the data diverges. With this, there is no need to do read-repair or + last-timestamp-wins kind of conflict resolution. + + + The secondaries also apply the edits in the order that the primary committed + them. This way the secondaries will contain a snapshot of the primaries data at any + point in time. This is similar to RDBMS replications and even HBase’s own + multi-datacenter replication, however in a single cluster. + + + On the read side, the client can detect whether the read is coming from + up-to-date data or is stale data. Also, the client can issue reads with different + consistency requirements on a per-operation basis to ensure its own semantic + guarantees. + + + The client can still observe edits out-of-order, and can go back in time, if it + observes reads from one secondary replica first, then another secondary replica. + There is no stickiness to region replicas or a transaction-id based guarantee. If + required, this can be implemented later though. + + + +
+ HFile Version 1 + + + + + + HFile Version 1 + + +
+ + To better understand the TIMELINE semantics, lets look at the above diagram. Lets say that there are two clients, and the first one writes x=1 at first, then x=2 and x=3 later. As above, all writes are handled by the primary region replica. The writes are saved in the write ahead log (WAL), and replicated to the other replicas asynchronously. In the above diagram, notice that replica_id=1 received 2 updates, and it’s data shows that x=2, while the replica_id=2 only received a single update, and its data shows that x=1. + + If client1 reads with STRONG consistency, it will only talk with the replica_id=0, and thus is guaranteed to observe the latest value of x=3. In case of a client issuing TIMELINE consistency reads, the RPC will go to all replicas (after primary timeout) and the result from the first response will be returned back. Thus the client can see either 1, 2 or 3 as the value of x. Let’s say that the primary region has failed and log replication cannot continue for some time. If the client does multiple reads with TIMELINE consistency, she can observe x=2 first, then x=1, and so on. + + +
+
+ Tradeoffs + Having secondary regions hosted for read availability comes with some tradeoffs which + should be carefully evaluated per use case. Following are advantages and + disadvantages. + + Advantages + + High availability for read-only tables. + + + High availability for stale reads + + + Ability to do very low latency reads with very high percentile (99.9%+) latencies + for stale reads + + + + + Disadvantages + + Double / Triple memstore usage (depending on region replication count) for tables + with region replication > 1 + + + Increased block cache usage + + + Extra network traffic for log replication + + + Extra backup RPCs for replicas + + + To serve the region data from multiple replicas, HBase opens the regions in secondary + mode in the region servers. The regions opened in secondary mode will share the same data + files with the primary region replica, however each secondary region replica will have its + own memstore to keep the unflushed data (only primary region can do flushes). Also to + serve reads from secondary regions, the blocks of data files may be also cached in the + block caches for the secondary regions. +
+
+ Configuration properties + + To use highly available reads, you should set the following properties in hbase-site.xml file. There is no specific configuration to enable or disable region replicas. Instead you can change the number of region replicas per table to increase or decrease at the table creation or with alter table. + +
+ Server side properties + + hbase.regionserver.storefile.refresh.period + 0 + + The period (in milliseconds) for refreshing the store files for the secondary regions. 0 means this feature is disabled. Secondary regions sees new files (from flushes and compactions) from primary once the secondary region refreshes the list of files in the region. But too frequent refreshes might cause extra Namenode pressure. If the files cannot be refreshed for longer than HFile TTL (hbase.master.hfilecleaner.ttl) the requests are rejected. Configuring HFile TTL to a larger value is also recommended with this setting. + + +]]> + + One thing to keep in mind also is that, region replica placement policy is only + enforced by the StochasticLoadBalancer which is the default balancer. If + you are using a custom load balancer property in hbase-site.xml + (hbase.master.loadbalancer.class) replicas of regions might end up being + hosted in the same server. +
+
+ Client side properties + Ensure to set the following for all clients (and servers) that will use region + replicas. + + hbase.ipc.client.allowsInterrupt + true + + Whether to enable interruption of RPC threads at the client side. This is required for region replicas with fallback RPC’s to secondary regions. + + + + hbase.client.primaryCallTimeout.get + 10000 + + The timeout (in microseconds), before secondary fallback RPC’s are submitted for get requests with Consistency.TIMELINE to the secondary replicas of the regions. Defaults to 10ms. Setting this lower will increase the number of RPC’s, but will lower the p99 latencies. + + + + hbase.client.primaryCallTimeout.multiget + 10000 + + The timeout (in microseconds), before secondary fallback RPC’s are submitted for multi-get requests (HTable.get(List)) with Consistency.TIMELINE to the secondary replicas of the regions. Defaults to 10ms. Setting this lower will increase the number of RPC’s, but will lower the p99 latencies. + + + + hbase.client.replicaCallTimeout.scan + 1000000 + + The timeout (in microseconds), before secondary fallback RPC’s are submitted for scan requests with Consistency.TIMELINE to the secondary replicas of the regions. Defaults to 1 sec. Setting this lower will increase the number of RPC’s, but will lower the p99 latencies. + + +]]> + +
+
+
+ Creating a table with region replication + + Region replication is a per-table property. All tables have REGION_REPLICATION = 1 by default, which means that there is only one replica per region. You can set and change the number of replicas per region of a table by supplying the REGION_REPLICATION property in the table descriptor. + +
Shell + 2} + +describe 't1' +for i in 1..100 +put 't1', "r#{i}", 'f1:c1', i +end +flush 't1' +]]> + +
+
Java + + + You can also use setRegionReplication() and alter table to increase, decrease the + region replication for a table. +
+
+
+ Region splits and merges + Region splits and merges are not compatible with regions with replicas yet. So you + have to pre-split the table, and disable the region splits. Also you should not execute + region merges on tables with region replicas. To disable region splits you can use + DisabledRegionSplitPolicy as the split policy. +
+
+ User Interface + In the masters user interface, the region replicas of a table are also shown together + with the primary regions. You can notice that the replicas of a region will share the same + start and end keys and the same region name prefix. The only difference would be the + appended replica_id (which is encoded as hex), and the region encoded name will be + different. You can also see the replica ids shown explicitly in the UI. +
+
+ API and Usage +
+ Shell + You can do reads in shell using a the Consistency.TIMELINE semantics as follows + + get 't1','r6', {CONSISTENCY => "TIMELINE"} +]]> + You can simulate a region server pausing or becoming unavailable and do a read from + the secondary replica: + + +hbase(main):001:0> get 't1','r6', {CONSISTENCY => "TIMELINE"} +]]> + Using scans is also similar + scan 't1', {CONSISTENCY => 'TIMELINE'} +]]> +
+
+ Java + You can set set the consistency for Gets and Scans and do requests as + follows. + + You can also pass multiple gets: + gets = new ArrayList(); +gets.add(get1); +... +Result[] results = table.get(gets); +]]> + And Scans: + + You can inspect whether the results are coming from primary region or not by calling + the Result.isStale() method: + + +
+
+ +
+ Resources + + + More information about the design and implementation can be found at the jira + issue: HBASE-10070 + + + + HBaseCon 2014 talk also contains some + details and slides. + + +
+
+ + +
diff --git a/src/main/docbkx/asf.xml b/src/main/docbkx/asf.xml new file mode 100644 index 00000000000..1455b4acb42 --- /dev/null +++ b/src/main/docbkx/asf.xml @@ -0,0 +1,44 @@ + + + + HBase and the Apache Software Foundation + HBase is a project in the Apache Software Foundation and as such there are responsibilities to the ASF to ensure + a healthy project. +
ASF Development Process + See the Apache Development Process page + for all sorts of information on how the ASF is structured (e.g., PMC, committers, contributors), to tips on contributing + and getting involved, and how open-source works at ASF. + +
+
ASF Board Reporting + Once a quarter, each project in the ASF portfolio submits a report to the ASF board. This is done by the HBase project + lead and the committers. See ASF board reporting for more information. + +
+
diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml index ee2d7fbba87..30100551b51 100644 --- a/src/main/docbkx/book.xml +++ b/src/main/docbkx/book.xml @@ -1,4 +1,5 @@ + - - - - - - - - Data Model - In HBase, data is stored in tables, which have rows and columns. This is a terminology - overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can - be helpful to think of an HBase table as a multi-dimensional map. - - HBase Data Model Terminology - - Table - - An HBase table consists of multiple rows. - - - - Row - - A row in HBase consists of a row key and one or more columns with values associated - with them. Rows are sorted alphabetically by the row key as they are stored. For this - reason, the design of the row key is very important. The goal is to store data in such a - way that related rows are near each other. A common row key pattern is a website domain. - If your row keys are domains, you should probably store them in reverse (org.apache.www, - org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each - other in the table, rather than being spread out based on the first letter of the - subdomain. - - - - Column - - A column in HBase consists of a column family and a column qualifier, which are - delimited by a : (colon) character. - - - - Column Family - - Column families physically colocate a set of columns and their values, often for - performance reasons. Each column family has a set of storage properties, such as whether - its values should be cached in memory, how its data is compressed or its row keys are - encoded, and others. Each row in a table has the same column - families, though a given row might not store anything in a given column family. - Column families are specified when you create your table, and influence the way your - data is stored in the underlying filesystem. Therefore, the column families should be - considered carefully during schema design. - - - - Column Qualifier - - A column qualifier is added to a column family to provide the index for a given - piece of data. Given a column family content, a column qualifier - might be content:html, and another might be - content:pdf. Though column families are fixed at table creation, - column qualifiers are mutable and may differ greatly between rows. - - - - Cell - - A cell is a combination of row, column family, and column qualifier, and contains a - value and a timestamp, which represents the value's version. - A cell's value is an uninterpreted array of bytes. - - - - Timestamp - - A timestamp is written alongside each value, and is the identifier for a given - version of a value. By default, the timestamp represents the time on the RegionServer - when the data was written, but you can specify a different timestamp value when you put - data into the cell. - - Direct manipulation of timestamps is an advanced feature which is only exposed for - special cases that are deeply integrated with HBase, and is discouraged in general. - Encoding a timestamp at the application level is the preferred pattern. - - You can specify the maximum number of versions of a value that HBase retains, per column - family. When the maximum number of versions is reached, the oldest versions are - eventually deleted. By default, only the newest version is kept. - - - - -
- Conceptual View - You can read a very understandable explanation of the HBase data model in the blog post Understanding - HBase and BigTable by Jim R. Wilson. Another good explanation is available in the - PDF Introduction - to Basic Schema Design by Amandeep Khurana. It may help to read different - perspectives to get a solid understanding of HBase schema design. The linked articles cover - the same ground as the information in this section. - The following example is a slightly modified form of the one on page 2 of the BigTable paper. There - is a table called webtable that contains two rows - (com.cnn.www - and com.example.www), three column families named - contents, anchor, and people. In - this example, for the first row (com.cnn.www), - anchor contains two columns (anchor:cssnsi.com, - anchor:my.look.ca) and contents contains one column - (contents:html). This example contains 5 versions of the row with the - row key com.cnn.www, and one version of the row with the row key - com.example.www. The contents:html column qualifier contains the entire - HTML of a given website. Qualifiers of the anchor column family each - contain the external site which links to the site represented by the row, along with the - text it used in the anchor of its link. The people column family represents - people associated with the site. - - - Column Names - By convention, a column name is made of its column family prefix and a - qualifier. For example, the column - contents:html is made up of the column family - contents and the html qualifier. The colon - character (:) delimits the column family from the column family - qualifier. - - - Table <varname>webtable</varname> - - - - - - - - - Row Key - Time Stamp - ColumnFamily contents - ColumnFamily anchor - ColumnFamily people - - - - - "com.cnn.www" - t9 - - anchor:cnnsi.com = "CNN" - - - - "com.cnn.www" - t8 - - anchor:my.look.ca = "CNN.com" - - - - "com.cnn.www" - t6 - contents:html = "<html>..." - - - - - "com.cnn.www" - t5 - contents:html = "<html>..." - - - - - "com.cnn.www" - t3 - contents:html = "<html>..." - - - - - "com.example.www" - t5 - contents:html = "<html>..." - - people:author = "John Doe" - - - -
- Cells in this table that appear to be empty do not take space, or in fact exist, in - HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to - look at data in HBase, or even the most accurate. The following represents the same - information as a multi-dimensional map. This is only a mock-up for illustrative - purposes and may not be strictly accurate. - ..." - t5: contents:html: "..." - t3: contents:html: "..." - } - anchor: { - t9: anchor:cnnsi.com = "CNN" - t8: anchor:my.look.ca = "CNN.com" - } - people: {} - } - "com.example.www": { - contents: { - t5: contents:html: "..." - } - anchor: {} - people: { - t5: people:author: "John Doe" - } - } -} - ]]> - -
-
- Physical View - Although at a conceptual level tables may be viewed as a sparse set of rows, they are - physically stored by column family. A new column qualifier (column_family:column_qualifier) - can be added to an existing column family at any time. - - ColumnFamily <varname>anchor</varname> - - - - - - - Row Key - Time Stamp - Column Family anchor - - - - - "com.cnn.www" - t9 - anchor:cnnsi.com = "CNN" - - - "com.cnn.www" - t8 - anchor:my.look.ca = "CNN.com" - - - -
- - ColumnFamily <varname>contents</varname> - - - - - - - Row Key - Time Stamp - ColumnFamily "contents:" - - - - - "com.cnn.www" - t6 - contents:html = "<html>..." - - - "com.cnn.www" - t5 - contents:html = "<html>..." - - - "com.cnn.www" - t3 - contents:html = "<html>..." - - - -
- The empty cells shown in the - conceptual view are not stored at all. - Thus a request for the value of the contents:html column at time stamp - t8 would return no value. Similarly, a request for an - anchor:my.look.ca value at time stamp t9 would - return no value. However, if no timestamp is supplied, the most recent value for a - particular column would be returned. Given multiple versions, the most recent is also the - first one found, since timestamps - are stored in descending order. Thus a request for the values of all columns in the row - com.cnn.www if no timestamp is specified would be: the value of - contents:html from timestamp t6, the value of - anchor:cnnsi.com from timestamp t9, the value of - anchor:my.look.ca from timestamp t8. - For more information about the internals of how Apache HBase stores data, see . -
- -
- Namespace - A namespace is a logical grouping of tables analogous to a database in relation - database systems. This abstraction lays the groundwork for upcoming multi-tenancy related - features: - - Quota Management (HBASE-8410) - Restrict the amount of resources (ie regions, - tables) a namespace can consume. - - - Namespace Security Administration (HBASE-9206) - provide another level of security - administration for tenants. - - - Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset - of regionservers thus guaranteeing a course level of isolation. - - - -
- Namespace management - A namespace can be created, removed or altered. Namespace membership is determined - during table creation by specifying a fully-qualified table name of the form: - - :]]> - - - - Examples - - -#Create a namespace -create_namespace 'my_ns' - - -#create my_table in my_ns namespace -create 'my_ns:my_table', 'fam' - - -#drop namespace -drop_namespace 'my_ns' - - -#alter namespace -alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'} - - - -
- Predefined namespaces - There are two predefined special namespaces: - - - hbase - system namespace, used to contain hbase internal tables - - - default - tables with no explicit specified namespace will automatically fall into - this namespace. - - - - Examples - - -#namespace=foo and table qualifier=bar -create 'foo:bar', 'fam' - -#namespace=default and table qualifier=bar -create 'bar', 'fam' - - -
- - -
- Table - Tables are declared up front at schema definition time. -
- -
- Row - Row keys are uninterrpreted bytes. Rows are lexicographically sorted with the lowest - order appearing first in a table. The empty byte array is used to denote both the start and - end of a tables' namespace. -
- -
- Column Family<indexterm><primary>Column Family</primary></indexterm> - Columns in Apache HBase are grouped into column families. All - column members of a column family have the same prefix. For example, the columns - courses:history and courses:math are both - members of the courses column family. The colon character - (:) delimits the column family from the column - family qualifierColumn Family Qualifier. - The column family prefix must be composed of printable characters. The - qualifying tail, the column family qualifier, can be made of any - arbitrary bytes. Column families must be declared up front at schema definition time whereas - columns do not need to be defined at schema time but can be conjured on the fly while the - table is up an running. - Physically, all column family members are stored together on the filesystem. Because - tunings and storage specifications are done at the column family level, it is advised that - all column family members have the same general access pattern and size - characteristics. - -
-
- Cells<indexterm><primary>Cells</primary></indexterm> - A {row, column, version} tuple exactly specifies a - cell in HBase. Cell content is uninterrpreted bytes -
-
- Data Model Operations - The four primary data model operations are Get, Put, Scan, and Delete. Operations are - applied via Table - instances. - -
- Get - Get - returns attributes for a specified row. Gets are executed via - Table.get. -
-
- Put - Put - either adds new rows to a table (if the key is new) or can update existing rows (if the - key already exists). Puts are executed via - Table.put (writeBuffer) or - Table.batch (non-writeBuffer). -
-
- Scans - Scan - allow iteration over multiple rows for specified attributes. - The following is an example of a Scan on a Table instance. Assume that a table is - populated with rows with keys "row1", "row2", "row3", and then another set of rows with - the keys "abc1", "abc2", and "abc3". The following example shows how to set a Scan - instance to return the rows beginning with "row". - -public static final byte[] CF = "cf".getBytes(); -public static final byte[] ATTR = "attr".getBytes(); -... - -Table table = ... // instantiate a Table instance - -Scan scan = new Scan(); -scan.addColumn(CF, ATTR); -scan.setRowPrefixFilter(Bytes.toBytes("row")); -ResultScanner rs = table.getScanner(scan); -try { - for (Result r = rs.next(); r != null; r = rs.next()) { - // process result... -} finally { - rs.close(); // always close the ResultScanner! - - Note that generally the easiest way to specify a specific stop point for a scan is by - using the InclusiveStopFilter - class. -
-
- Delete - Delete - removes a row from a table. Deletes are executed via - HTable.delete. - HBase does not modify data in place, and so deletes are handled by creating new - markers called tombstones. These tombstones, along with the dead - values, are cleaned up on major compactions. - See for more information on deleting versions of columns, and - see for more information on compactions. - -
- -
- - -
- Versions<indexterm><primary>Versions</primary></indexterm> - - A {row, column, version} tuple exactly specifies a - cell in HBase. It's possible to have an unbounded number of cells where - the row and column are the same but the cell address differs only in its version - dimension. - - While rows and column keys are expressed as bytes, the version is specified using a long - integer. Typically this long contains time instances such as those returned by - java.util.Date.getTime() or System.currentTimeMillis(), that is: - the difference, measured in milliseconds, between the current time and midnight, - January 1, 1970 UTC. - - The HBase version dimension is stored in decreasing order, so that when reading from a - store file, the most recent values are found first. - - There is a lot of confusion over the semantics of cell versions, in - HBase. In particular: - - - If multiple writes to a cell have the same version, only the last written is - fetchable. - - - - It is OK to write cells in a non-increasing version order. - - - - Below we describe how the version dimension in HBase currently works. See HBASE-2406 for - discussion of HBase versions. Bending time in HBase - makes for a good read on the version, or time, dimension in HBase. It has more detail on - versioning than is provided here. As of this writing, the limiitation - Overwriting values at existing timestamps mentioned in the - article no longer holds in HBase. This section is basically a synopsis of this article - by Bruno Dumon. - -
- Specifying the Number of Versions to Store - The maximum number of versions to store for a given column is part of the column - schema and is specified at table creation, or via an alter command, via - HColumnDescriptor.DEFAULT_VERSIONS. Prior to HBase 0.96, the default number - of versions kept was 3, but in 0.96 and newer has been changed to - 1. - - Modify the Maximum Number of Versions for a Column - This example uses HBase Shell to keep a maximum of 5 versions of column - f1. You could also use HColumnDescriptor. - alter ‘t1′, NAME => ‘f1′, VERSIONS => 5]]> - - - Modify the Minimum Number of Versions for a Column - You can also specify the minimum number of versions to store. By default, this is - set to 0, which means the feature is disabled. The following example sets the minimum - number of versions on field f1 to 2, via HBase Shell. - You could also use HColumnDescriptor. - alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2]]> - - Starting with HBase 0.98.2, you can specify a global default for the maximum number of - versions kept for all newly-created columns, by setting - in hbase-site.xml. See - . -
- -
- Versions and HBase Operations - - In this section we look at the behavior of the version dimension for each of the core - HBase operations. - -
- Get/Scan - - Gets are implemented on top of Scans. The below discussion of Get - applies equally to Scans. - - By default, i.e. if you specify no explicit version, when doing a - get, the cell whose version has the largest value is returned - (which may or may not be the latest one written, see later). The default behavior can be - modified in the following ways: - - - - to return more than one version, see Get.setMaxVersions() - - - - to return versions other than the latest, see Get.setTimeRange() - - To retrieve the latest version that is less than or equal to a given value, thus - giving the 'latest' state of the record at a certain point in time, just use a range - from 0 to the desired version and set the max versions to 1. - - - -
-
- Default Get Example - The following Get will only retrieve the current version of the row - -public static final byte[] CF = "cf".getBytes(); -public static final byte[] ATTR = "attr".getBytes(); -... -Get get = new Get(Bytes.toBytes("row1")); -Result r = table.get(get); -byte[] b = r.getValue(CF, ATTR); // returns current version of value - -
-
- Versioned Get Example - The following Get will return the last 3 versions of the row. - -public static final byte[] CF = "cf".getBytes(); -public static final byte[] ATTR = "attr".getBytes(); -... -Get get = new Get(Bytes.toBytes("row1")); -get.setMaxVersions(3); // will return last 3 versions of row -Result r = table.get(get); -byte[] b = r.getValue(CF, ATTR); // returns current version of value -List<KeyValue> kv = r.getColumn(CF, ATTR); // returns all versions of this column - -
- -
- Put - - Doing a put always creates a new version of a cell, at a certain - timestamp. By default the system uses the server's currentTimeMillis, - but you can specify the version (= the long integer) yourself, on a per-column level. - This means you could assign a time in the past or the future, or use the long value for - non-time purposes. - - To overwrite an existing value, do a put at exactly the same row, column, and - version as that of the cell you would overshadow. -
- Implicit Version Example - The following Put will be implicitly versioned by HBase with the current - time. - -public static final byte[] CF = "cf".getBytes(); -public static final byte[] ATTR = "attr".getBytes(); -... -Put put = new Put(Bytes.toBytes(row)); -put.add(CF, ATTR, Bytes.toBytes( data)); -table.put(put); - -
-
- Explicit Version Example - The following Put has the version timestamp explicitly set. - -public static final byte[] CF = "cf".getBytes(); -public static final byte[] ATTR = "attr".getBytes(); -... -Put put = new Put( Bytes.toBytes(row)); -long explicitTimeInMs = 555; // just an example -put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data)); -table.put(put); - - Caution: the version timestamp is internally by HBase for things like time-to-live - calculations. It's usually best to avoid setting this timestamp yourself. Prefer using - a separate timestamp attribute of the row, or have the timestamp a part of the rowkey, - or both. -
- -
- -
- Delete - - There are three different types of internal delete markers. See Lars Hofhansl's blog - for discussion of his attempt adding another, Scanning - in HBase: Prefix Delete Marker. - - - Delete: for a specific version of a column. - - - Delete column: for all versions of a column. - - - Delete family: for all columns of a particular ColumnFamily - - - When deleting an entire row, HBase will internally create a tombstone for each - ColumnFamily (i.e., not each individual column). - Deletes work by creating tombstone markers. For example, let's - suppose we want to delete a row. For this you can specify a version, or else by default - the currentTimeMillis is used. What this means is delete all - cells where the version is less than or equal to this version. HBase never - modifies data in place, so for example a delete will not immediately delete (or mark as - deleted) the entries in the storage file that correspond to the delete condition. - Rather, a so-called tombstone is written, which will mask the - deleted values. When HBase does a major compaction, the tombstones are processed to - actually remove the dead values, together with the tombstones themselves. If the version - you specified when deleting a row is larger than the version of any value in the row, - then you can consider the complete row to be deleted. - For an informative discussion on how deletes and versioning interact, see the thread Put w/ - timestamp -> Deleteall -> Put w/ timestamp fails up on the user mailing - list. - Also see for more information on the internal KeyValue format. - Delete markers are purged during the next major compaction of the store, unless the - option is set in the column family. To keep the - deletes for a configurable amount of time, you can set the delete TTL via the - property in - hbase-site.xml. If - is not set, or set to 0, all - delete markers, including those with timestamps in the future, are purged during the - next major compaction. Otherwise, a delete marker with a timestamp in the future is kept - until the major compaction which occurs after the time represented by the marker's - timestamp plus the value of , in - milliseconds. - - This behavior represents a fix for an unexpected change that was introduced in - HBase 0.94, and was fixed in HBASE-10118. - The change has been backported to HBase 0.94 and newer branches. - -
-
- -
- Current Limitations - -
- Deletes mask Puts - - Deletes mask puts, even puts that happened after the delete - was entered. See HBASE-2256. Remember that a delete writes a tombstone, which only - disappears after then next major compaction has run. Suppose you do - a delete of everything <= T. After this you do a new put with a - timestamp <= T. This put, even if it happened after the delete, - will be masked by the delete tombstone. Performing the put will not - fail, but when you do a get you will notice the put did have no - effect. It will start working again after the major compaction has - run. These issues should not be a problem if you use - always-increasing versions for new puts to a row. But they can occur - even if you do not care about time: just do delete and put - immediately after each other, and there is some chance they happen - within the same millisecond. -
- -
- Major compactions change query results - - ...create three cell versions at t1, t2 and t3, with a maximum-versions - setting of 2. So when getting all versions, only the values at t2 and t3 will be - returned. But if you delete the version at t2 or t3, the one at t1 will appear again. - Obviously, once a major compaction has run, such behavior will not be the case - anymore... (See Garbage Collection in Bending time in - HBase.) -
-
-
-
- Sort Order - All data model operations HBase return data in sorted order. First by row, - then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted - in reverse, so newest records are returned first). - -
-
- Column Metadata - There is no store of column metadata outside of the internal KeyValue instances for a ColumnFamily. - Thus, while HBase can support not only a wide number of columns per row, but a heterogenous set of columns - between rows as well, it is your responsibility to keep track of the column names. - - The only way to get a complete set of columns that exist for a ColumnFamily is to process all the rows. - For more information about how HBase stores data internally, see . - -
-
Joins - Whether HBase supports joins is a common question on the dist-list, and there is a simple answer: it doesn't, - at not least in the way that RDBMS' support them (e.g., with equi-joins or outer-joins in SQL). As has been illustrated - in this chapter, the read data model operations in HBase are Get and Scan. - - However, that doesn't mean that equivalent join functionality can't be supported in your application, but - you have to do it yourself. The two primary strategies are either denormalizing the data upon writing to HBase, - or to have lookup tables and do the join between HBase tables in your application or MapReduce code (and as RDBMS' - demonstrate, there are several strategies for this depending on the size of the tables, e.g., nested loops vs. - hash-joins). So which is the best approach? It depends on what you are trying to do, and as such there isn't a single - answer that works for every use case. - -
-
ACID - See ACID Semantics. - Lars Hofhansl has also written a note on - ACID in HBase. -
- - - + + + + + + - - - HBase and MapReduce - Apache MapReduce is a software framework used to analyze large amounts of data, and is - the framework used most often with Apache Hadoop. MapReduce itself is out of the - scope of this document. A good place to get started with MapReduce is . MapReduce version - 2 (MR2)is now part of YARN. - - This chapter discusses specific configuration steps you need to take to use MapReduce on - data within HBase. In addition, it discusses other interactions and issues between HBase and - MapReduce jobs. - - mapred and mapreduce - There are two mapreduce packages in HBase as in MapReduce itself: org.apache.hadoop.hbase.mapred - and org.apache.hadoop.hbase.mapreduce. The former does old-style API and the latter - the new style. The latter has more facility though you can usually find an equivalent in the older - package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the - org.apache.hadoop.hbase.mapreduce. In the notes below, we refer to - o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using. - - - - -
- HBase, MapReduce, and the CLASSPATH - By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either - the HBase configuration under $HBASE_CONF_DIR or the HBase classes. - To give the MapReduce jobs the access they need, you could add - hbase-site.xml to the - $HADOOP_HOME/conf/ directory and add the - HBase JARs to the HADOOP_HOME/conf/ - directory, then copy these changes across your cluster. You could add hbase-site.xml to - $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy - these changes across your cluster or edit - $HADOOP_HOMEconf/hadoop-env.sh and add - them to the HADOOP_CLASSPATH variable. However, this approach is not - recommended because it will pollute your Hadoop install with HBase references. It also - requires you to restart the Hadoop cluster before Hadoop can use the HBase data. - Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The - dependencies only need to be available on the local CLASSPATH. The following example runs - the bundled HBase RowCounter - MapReduce job against a table named usertable If you have not set - the environment variables expected in the command (the parts prefixed by a - $ sign and curly braces), you can use the actual system paths instead. - Be sure to use the correct version of the HBase JAR for your system. The backticks - (` symbols) cause ths shell to execute the sub-commands, setting the - CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. - $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable - When the command runs, internally, the HBase JAR finds the dependencies it needs for - zookeeper, guava, and its other dependencies on the passed HADOOP_CLASSPATH - and adds the JARs to the MapReduce job configuration. See the source at - TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. - - The example may not work if you are running HBase from its build directory rather - than an installed location. You may see an error like the following: - java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper - If this occurs, try modifying the command as follows, so that it uses the HBase JARs - from the target/ directory within the build environment. - $ HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable - - - Notice to Mapreduce users of HBase 0.96.1 and above - Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar - to the following: - -Exception in thread "main" java.lang.IllegalAccessError: class - com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass - com.google.protobuf.LiteralByteString - at java.lang.ClassLoader.defineClass1(Native Method) - at java.lang.ClassLoader.defineClass(ClassLoader.java:792) - at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) - at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) - at java.net.URLClassLoader.access$100(URLClassLoader.java:71) - at java.net.URLClassLoader$1.run(URLClassLoader.java:361) - at java.net.URLClassLoader$1.run(URLClassLoader.java:355) - at java.security.AccessController.doPrivileged(Native Method) - at java.net.URLClassLoader.findClass(URLClassLoader.java:354) - at java.lang.ClassLoader.loadClass(ClassLoader.java:424) - at java.lang.ClassLoader.loadClass(ClassLoader.java:357) - at - org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818) - at - org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433) - at - org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186) - at - org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147) - at - org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270) - at - org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100) -... - - This is caused by an optimization introduced in HBASE-9867 that - inadvertently introduced a classloader dependency. - This affects both jobs using the -libjars option and "fat jar," those - which package their runtime dependencies in a nested lib folder. - In order to satisfy the new classloader requirements, hbase-protocol.jar must be - included in Hadoop's classpath. See for current recommendations for resolving - classpath errors. The following is included for historical purposes. - This can be resolved system-wide by including a reference to the hbase-protocol.jar in - hadoop's lib directory, via a symlink or by copying the jar into the new location. - This can also be achieved on a per-job launch basis by including it in the - HADOOP_CLASSPATH environment variable at job submission time. When - launching jobs that package their dependencies, all three of the following job launching - commands satisfy this requirement: - -$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass -$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass -$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass - - For jars that do not package their dependencies, the following command structure is - necessary: - -$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ... - - See also HBASE-10304 for - further discussion of this issue. - -
- -
- MapReduce Scan Caching - TableMapReduceUtil now restores the option to set scanner caching (the number of rows - which are cached before returning the result to the client) on the Scan object that is - passed in. This functionality was lost due to a bug in HBase 0.95 (HBASE-11558), which - is fixed for HBase 0.98.5 and 0.96.3. The priority order for choosing the scanner caching is - as follows: - - - Caching settings which are set on the scan object. - - - Caching settings which are specified via the configuration option - , which can either be set manually in - hbase-site.xml or via the helper method - TableMapReduceUtil.setScannerCaching(). - - - The default value HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING, which is set to - 100. - - - Optimizing the caching settings is a balance between the time the client waits for a - result and the number of sets of results the client needs to receive. If the caching setting - is too large, the client could end up waiting for a long time or the request could even time - out. If the setting is too small, the scan needs to return results in several pieces. - If you think of the scan as a shovel, a bigger cache setting is analogous to a bigger - shovel, and a smaller cache setting is equivalent to more shoveling in order to fill the - bucket. - The list of priorities mentioned above allows you to set a reasonable default, and - override it for specific operations. - See the API documentation for Scan for more details. -
- -
- Bundled HBase MapReduce Jobs - The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about - the bundled MapReduce jobs, run the following command. - - $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar -An example program must be given as the first argument. -Valid program names are: - copytable: Export a table from local cluster to peer cluster - completebulkload: Complete a bulk data load. - export: Write table data to HDFS. - import: Import data written by Export. - importtsv: Import data in TSV format. - rowcounter: Count rows in HBase table - - Each of the valid program names are bundled MapReduce jobs. To run one of the jobs, - model your command after the following example. - $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter myTable -
- -
- HBase as a MapReduce Job Data Source and Data Sink - HBase can be used as a data source, TableInputFormat, - and data sink, TableOutputFormat - or MultiTableOutputFormat, - for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to - subclass TableMapper - and/or TableReducer. - See the do-nothing pass-through classes IdentityTableMapper - and IdentityTableReducer - for basic usage. For a more involved example, see RowCounter - or review the org.apache.hadoop.hbase.mapreduce.TestTableMapReduce unit test. - If you run MapReduce jobs that use HBase as source or sink, need to specify source and - sink table and column names in your configuration. - - When you read from HBase, the TableInputFormat requests the list of regions - from HBase and makes a map, which is either a map-per-region or - mapreduce.job.maps map, whichever is smaller. If your job only has two maps, - raise mapreduce.job.maps to a number greater than the number of regions. Maps - will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per - node. When writing to HBase, it may make sense to avoid the Reduce step and write back into - HBase from within your map. This approach works when your job does not need the sort and - collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is - no point double-sorting (and shuffling data around your MapReduce cluster) unless you need - to. If you do not need the Reduce, you myour map might emit counts of records processed for - reporting at the end of the jobj, or set the number of Reduces to zero and use - TableOutputFormat. If running the Reduce step makes sense in your case, you should typically - use multiple reducers so that load is spread across the HBase cluster. - - A new HBase partitioner, the HRegionPartitioner, - can run as many reducers the number of existing regions. The HRegionPartitioner is suitable - when your table is large and your upload will not greatly alter the number of existing - regions upon completion. Otherwise use the default partitioner. -
- -
- Writing HFiles Directly During Bulk Import - If you are importing into a new table, you can bypass the HBase API and write your - content directly to the filesystem, formatted into HBase data files (HFiles). Your import - will run faster, perhaps an order of magnitude faster. For more on how this mechanism works, - see . -
- -
- RowCounter Example - The included RowCounter - MapReduce job uses TableInputFormat and does a count of all rows in the specified - table. To run it, use the following command: - $ ./bin/hadoop jar hbase-X.X.X.jar - This will - invoke the HBase MapReduce Driver class. Select rowcounter from the choice of jobs - offered. This will print rowcouner usage advice to standard output. Specify the tablename, - column to count, and output - directory. If you have classpath errors, see . -
- -
- Map-Task Splitting -
- The Default HBase MapReduce Splitter - When TableInputFormat - is used to source an HBase table in a MapReduce job, its splitter will make a map task for - each region of the table. Thus, if there are 100 regions in the table, there will be 100 - map-tasks for the job - regardless of how many column families are selected in the - Scan. -
-
- Custom Splitters - For those interested in implementing custom splitters, see the method - getSplits in TableInputFormatBase. - That is where the logic for map-task assignment resides. -
-
-
- HBase MapReduce Examples -
- HBase MapReduce Read Example - The following is an example of using HBase as a MapReduce source in read-only manner. - Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from - the Mapper. There job would be defined as follows... - -Configuration config = HBaseConfiguration.create(); -Job job = new Job(config, "ExampleRead"); -job.setJarByClass(MyReadJob.class); // class that contains mapper - -Scan scan = new Scan(); -scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs -scan.setCacheBlocks(false); // don't set to true for MR jobs -// set other scan attrs -... - -TableMapReduceUtil.initTableMapperJob( - tableName, // input HBase table name - scan, // Scan instance to control CF and attribute selection - MyMapper.class, // mapper - null, // mapper output key - null, // mapper output value - job); -job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper - -boolean b = job.waitForCompletion(true); -if (!b) { - throw new IOException("error with job!"); -} - - ...and the mapper instance would extend TableMapper... - -public static class MyMapper extends TableMapper<Text, Text> { - - public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException { - // process data for the row from the Result instance. - } -} - -
-
- HBase MapReduce Read/Write Example - The following is an example of using HBase both as a source and as a sink with - MapReduce. This example will simply copy data from one table to another. - -Configuration config = HBaseConfiguration.create(); -Job job = new Job(config,"ExampleReadWrite"); -job.setJarByClass(MyReadWriteJob.class); // class that contains mapper - -Scan scan = new Scan(); -scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs -scan.setCacheBlocks(false); // don't set to true for MR jobs -// set other scan attrs - -TableMapReduceUtil.initTableMapperJob( - sourceTable, // input table - scan, // Scan instance to control CF and attribute selection - MyMapper.class, // mapper class - null, // mapper output key - null, // mapper output value - job); -TableMapReduceUtil.initTableReducerJob( - targetTable, // output table - null, // reducer class - job); -job.setNumReduceTasks(0); - -boolean b = job.waitForCompletion(true); -if (!b) { - throw new IOException("error with job!"); -} - - An explanation is required of what TableMapReduceUtil is doing, - especially with the reducer. TableOutputFormat - is being used as the outputFormat class, and several parameters are being set on the - config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key - to ImmutableBytesWritable and reducer value to - Writable. These could be set by the programmer on the job and - conf, but TableMapReduceUtil tries to make things easier. - The following is the example mapper, which will create a Put - and matching the input Result and emit it. Note: this is what the - CopyTable utility does. - -public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> { - - public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { - // this example is just copying the data from the source table... - context.write(row, resultToPut(row,value)); - } - - private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException { - Put put = new Put(key.get()); - for (KeyValue kv : result.raw()) { - put.add(kv); - } - return put; - } -} - - There isn't actually a reducer step, so TableOutputFormat takes - care of sending the Put to the target table. - This is just an example, developers could choose not to use - TableOutputFormat and connect to the target table themselves. - -
-
- HBase MapReduce Read/Write Example With Multi-Table Output - TODO: example for MultiTableOutputFormat. -
-
- HBase MapReduce Summary to HBase Example - The following example uses HBase as a MapReduce source and sink with a summarization - step. This example will count the number of distinct instances of a value in a table and - write those summarized counts in another table. - -Configuration config = HBaseConfiguration.create(); -Job job = new Job(config,"ExampleSummary"); -job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer - -Scan scan = new Scan(); -scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs -scan.setCacheBlocks(false); // don't set to true for MR jobs -// set other scan attrs - -TableMapReduceUtil.initTableMapperJob( - sourceTable, // input table - scan, // Scan instance to control CF and attribute selection - MyMapper.class, // mapper class - Text.class, // mapper output key - IntWritable.class, // mapper output value - job); -TableMapReduceUtil.initTableReducerJob( - targetTable, // output table - MyTableReducer.class, // reducer class - job); -job.setNumReduceTasks(1); // at least one, adjust as required - -boolean b = job.waitForCompletion(true); -if (!b) { - throw new IOException("error with job!"); -} - - In this example mapper a column with a String-value is chosen as the value to summarize - upon. This value is used as the key to emit from the mapper, and an - IntWritable represents an instance counter. - -public static class MyMapper extends TableMapper<Text, IntWritable> { - public static final byte[] CF = "cf".getBytes(); - public static final byte[] ATTR1 = "attr1".getBytes(); - - private final IntWritable ONE = new IntWritable(1); - private Text text = new Text(); - - public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { - String val = new String(value.getValue(CF, ATTR1)); - text.set(val); // we can only emit Writables... - - context.write(text, ONE); - } -} - - In the reducer, the "ones" are counted (just like any other MR example that does this), - and then emits a Put. - -public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { - public static final byte[] CF = "cf".getBytes(); - public static final byte[] COUNT = "count".getBytes(); - - public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { - int i = 0; - for (IntWritable val : values) { - i += val.get(); - } - Put put = new Put(Bytes.toBytes(key.toString())); - put.add(CF, COUNT, Bytes.toBytes(i)); - - context.write(null, put); - } -} - - -
-
- HBase MapReduce Summary to File Example - This very similar to the summary example above, with exception that this is using - HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and - in the reducer. The mapper remains the same. - -Configuration config = HBaseConfiguration.create(); -Job job = new Job(config,"ExampleSummaryToFile"); -job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer - -Scan scan = new Scan(); -scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs -scan.setCacheBlocks(false); // don't set to true for MR jobs -// set other scan attrs - -TableMapReduceUtil.initTableMapperJob( - sourceTable, // input table - scan, // Scan instance to control CF and attribute selection - MyMapper.class, // mapper class - Text.class, // mapper output key - IntWritable.class, // mapper output value - job); -job.setReducerClass(MyReducer.class); // reducer class -job.setNumReduceTasks(1); // at least one, adjust as required -FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile")); // adjust directories as required - -boolean b = job.waitForCompletion(true); -if (!b) { - throw new IOException("error with job!"); -} - - As stated above, the previous Mapper can run unchanged with this example. As for the - Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting - Puts. - - public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { - - public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { - int i = 0; - for (IntWritable val : values) { - i += val.get(); - } - context.write(key, new IntWritable(i)); - } -} - -
-
- HBase MapReduce Summary to HBase Without Reducer - It is also possible to perform summaries without a reducer - if you use HBase as the - reducer. - An HBase target table would need to exist for the job summary. The Table method - incrementColumnValue would be used to atomically increment values. From a - performance perspective, it might make sense to keep a Map of values with their values to - be incremeneted for each map-task, and make one update per key at during the - cleanup method of the mapper. However, your milage may vary depending on the - number of rows to be processed and unique keys. - In the end, the summary results are in HBase. -
-
- HBase MapReduce Summary to RDBMS - Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, - it is possible to generate summaries directly to an RDBMS via a custom reducer. The - setup method can connect to an RDBMS (the connection information can be - passed via custom parameters in the context) and the cleanup method can close the - connection. - It is critical to understand that number of reducers for the job affects the - summarization implementation, and you'll have to design this into your reducer. - Specifically, whether it is designed to run as a singleton (one reducer) or multiple - reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more - reducers that are assigned to the job, the more simultaneous connections to the RDBMS will - be created - this will scale, but only to a point. - - public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> { - - private Connection c = null; - - public void setup(Context context) { - // create DB connection... - } - - public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { - // do summarization - // in this example the keys are Text, but this is just an example - } - - public void cleanup(Context context) { - // close db connection - } - -} - - In the end, the summary results are written to your RDBMS table/s. -
- -
- -
- Accessing Other HBase Tables in a MapReduce Job - Although the framework currently allows one HBase table as input to a MapReduce job, - other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating - an Table instance in the setup method of the Mapper. - public class MyMapper extends TableMapper<Text, LongWritable> { - private Table myOtherTable; - - public void setup(Context context) { - // In here create a Connection to the cluster and save it or use the Connection - // from the existing table - myOtherTable = connection.getTable("myOtherTable"); - } - - public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { - // process Result... - // use 'myOtherTable' for lookups - } - - - -
-
- Speculative Execution - It is generally advisable to turn off speculative execution for MapReduce jobs that use - HBase as a source. This can either be done on a per-Job basis through properties, on on the - entire cluster. Especially for longer running jobs, speculative execution will create - duplicate map-tasks which will double-write your data to HBase; this is probably not what - you want. - See for more information. -
-
- + - - - Architecture -
- Overview -
- NoSQL? - HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which - supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an - example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, - HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, - such as typed columns, secondary indexes, triggers, and advanced query languages, etc. - - However, HBase has many features which supports both linear and modular scaling. HBase clusters expand - by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 - RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. - RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best - performance requires specialized hardware and storage devices. HBase features of note are: - - Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This - makes it very suitable for tasks such as high-speed counter aggregation. - Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are - automatically split and re-distributed as your data grows. - Automatic RegionServer failover - Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system. - MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both - source and sink. - Java Client API: HBase supports an easy to use Java API for programmatic access. - Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends. - Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization. - Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics. - - -
- -
- When Should I Use HBase? - HBase isn't suitable for every problem. - First, make sure you have enough data. If you have hundreds of millions or billions of rows, then - HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS - might be a better choice due to the fact that all of your data might wind up on a single node (or two) and - the rest of the cluster may be sitting idle. - - Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, - secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be - "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a - complete redesign as opposed to a port. - - Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than - 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. - - HBase can run quite well stand-alone on a laptop - but this should be considered a development - configuration only. - -
-
- What Is The Difference Between HBase and Hadoop/HDFS? - HDFS is a distributed file system that is well suited for the storage of large files. - Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. - HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. - This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist - on HDFS for high-speed lookups. See the and the rest of this chapter for more information on how HBase achieves its goals. - -
-
- -
- Catalog Tables - The catalog table hbase:meta exists as an HBase table and is filtered out of the HBase - shell's list command, but is in fact a table just like any other. -
- -ROOT- - - The -ROOT- table was removed in HBase 0.96.0. Information here should - be considered historical. - - The -ROOT- table kept track of the location of the - .META table (the previous name for the table now called hbase:meta) prior to HBase - 0.96. The -ROOT- table structure was as follows: - - Key - - .META. region key (.META.,,1) - - - - - Values - - info:regioninfo (serialized HRegionInfo - instance of hbase:meta) - - - info:server (server:port of the RegionServer holding - hbase:meta) - - - info:serverstartcode (start-time of the RegionServer process holding - hbase:meta) - - -
-
- hbase:meta - The hbase:meta table (previously called .META.) keeps a list - of all regions in the system. The location of hbase:meta was previously - tracked within the -ROOT- table, but is now stored in Zookeeper. - The hbase:meta table structure is as follows: - - Key - - Region key of the format ([table],[region start key],[region - id]) - - - - Values - - info:regioninfo (serialized - HRegionInfo instance for this region) - - - info:server (server:port of the RegionServer containing this - region) - - - info:serverstartcode (start-time of the RegionServer process - containing this region) - - - When a table is in the process of splitting, two other columns will be created, called - info:splitA and info:splitB. These columns represent the two - daughter regions. The values for these columns are also serialized HRegionInfo instances. - After the region has been split, eventually this row will be deleted. - - Note on HRegionInfo - The empty key is used to denote table start and table end. A region with an empty - start key is the first region in a table. If a region has both an empty start and an - empty end key, it is the only region in the table - - In the (hopefully unlikely) event that programmatic processing of catalog metadata is - required, see the Writables - utility. -
-
- Startup Sequencing - First, the location of hbase:meta is looked up in Zookeeper. Next, - hbase:meta is updated with server and startcode values. - For information on region-RegionServer assignment, see . -
-
- -
- Client - The HBase client finds the RegionServers that are serving the particular row range of - interest. It does this by querying the hbase:meta table. See for details. After locating the required region(s), the - client contacts the RegionServer serving that region, rather than going through the master, - and issues the read or write request. This information is cached in the client so that - subsequent requests need not go through the lookup process. Should a region be reassigned - either by the master load balancer or because a RegionServer has died, the client will - requery the catalog tables to determine the new location of the user region. - - See for more information about the impact of the Master on HBase - Client communication. - Administrative functions are done via an instance of Admin - - -
- Cluster Connections - The API changed in HBase 1.0. Its been cleaned up and users are returned - Interfaces to work against rather than particular types. In HBase 1.0, - obtain a cluster Connection from ConnectionFactory and thereafter, get from it - instances of Table, Admin, and RegionLocator on an as-need basis. When done, close - obtained instances. Finally, be sure to cleanup your Connection instance before - exiting. Connections are heavyweight objects. Create once and keep an instance around. - Table, Admin and RegionLocator instances are lightweight. Create as you go and then - let go as soon as you are done by closing them. See the - Client Package Javadoc Description for example usage of the new HBase 1.0 API. - - For connection configuration information, see . - - Table - instances are not thread-safe. Only one thread can use an instance of Table at - any given time. When creating Table instances, it is advisable to use the same HBaseConfiguration - instance. This will ensure sharing of ZooKeeper and socket instances to the RegionServers - which is usually what you want. For example, this is preferred: - HBaseConfiguration conf = HBaseConfiguration.create(); -HTable table1 = new HTable(conf, "myTable"); -HTable table2 = new HTable(conf, "myTable"); - as opposed to this: - HBaseConfiguration conf1 = HBaseConfiguration.create(); -HTable table1 = new HTable(conf1, "myTable"); -HBaseConfiguration conf2 = HBaseConfiguration.create(); -HTable table2 = new HTable(conf2, "myTable"); - - For more information about how connections are handled in the HBase client, - see HConnectionManager. - -
Connection Pooling - For applications which require high-end multithreaded access (e.g., web-servers or application servers that may serve many application threads - in a single JVM), you can pre-create an HConnection, as shown in - the following example: - - Pre-Creating a <code>HConnection</code> - // Create a connection to the cluster. -HConnection connection = HConnectionManager.createConnection(Configuration); -HTableInterface table = connection.getTable("myTable"); -// use table as needed, the table returned is lightweight -table.close(); -// use the connection for other access to the cluster -connection.close(); - - Constructing HTableInterface implementation is very lightweight and resources are - controlled. - - <code>HTablePool</code> is Deprecated - Previous versions of this guide discussed HTablePool, which was - deprecated in HBase 0.94, 0.95, and 0.96, and removed in 0.98.1, by HBASE-6500. - Please use HConnection instead. - -
-
-
WriteBuffer and Batch Methods - If is turned off on - HTable, - Puts are sent to RegionServers when the writebuffer - is filled. The writebuffer is 2MB by default. Before an HTable instance is - discarded, either close() or - flushCommits() should be invoked so Puts - will not be lost. - - Note: htable.delete(Delete); does not go in the writebuffer! This only applies to Puts. - - For additional information on write durability, review the ACID semantics page. - - For fine-grained control of batching of - Puts or Deletes, - see the batch methods on HTable. - -
-
External Clients - Information on non-Java clients and custom protocols is covered in - -
-
- -
Client Request Filters - Get and Scan instances can be - optionally configured with filters which are applied on the RegionServer. - - Filters can be confusing because there are many different types, and it is best to approach them by understanding the groups - of Filter functionality. - -
Structural - Structural Filters contain other Filters. -
FilterList - FilterList - represents a list of Filters with a relationship of FilterList.Operator.MUST_PASS_ALL or - FilterList.Operator.MUST_PASS_ONE between the Filters. The following example shows an 'or' between two - Filters (checking for either 'my value' or 'my other value' on the same attribute). - -FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE); -SingleColumnValueFilter filter1 = new SingleColumnValueFilter( - cf, - column, - CompareOp.EQUAL, - Bytes.toBytes("my value") - ); -list.add(filter1); -SingleColumnValueFilter filter2 = new SingleColumnValueFilter( - cf, - column, - CompareOp.EQUAL, - Bytes.toBytes("my other value") - ); -list.add(filter2); -scan.setFilter(list); - -
-
-
- Column Value -
- SingleColumnValueFilter - SingleColumnValueFilter - can be used to test column values for equivalence (CompareOp.EQUAL - ), inequality (CompareOp.NOT_EQUAL), or ranges (e.g., - CompareOp.GREATER). The following is example of testing equivalence a - column to a String value "my value"... - -SingleColumnValueFilter filter = new SingleColumnValueFilter( - cf, - column, - CompareOp.EQUAL, - Bytes.toBytes("my value") - ); -scan.setFilter(filter); - -
-
-
- Column Value Comparators - There are several Comparator classes in the Filter package that deserve special - mention. These Comparators are used in concert with other Filters, such as . -
- RegexStringComparator - RegexStringComparator - supports regular expressions for value comparisons. - -RegexStringComparator comp = new RegexStringComparator("my."); // any value that starts with 'my' -SingleColumnValueFilter filter = new SingleColumnValueFilter( - cf, - column, - CompareOp.EQUAL, - comp - ); -scan.setFilter(filter); - - See the Oracle JavaDoc for supported - RegEx patterns in Java. -
-
- SubstringComparator - SubstringComparator - can be used to determine if a given substring exists in a value. The comparison is - case-insensitive. - -SubstringComparator comp = new SubstringComparator("y val"); // looking for 'my value' -SingleColumnValueFilter filter = new SingleColumnValueFilter( - cf, - column, - CompareOp.EQUAL, - comp - ); -scan.setFilter(filter); - -
-
- BinaryPrefixComparator - See BinaryPrefixComparator. -
-
- BinaryComparator - See BinaryComparator. -
-
-
- KeyValue Metadata - As HBase stores data internally as KeyValue pairs, KeyValue Metadata Filters evaluate - the existence of keys (i.e., ColumnFamily:Column qualifiers) for a row, as opposed to - values the previous section. -
- FamilyFilter - FamilyFilter - can be used to filter on the ColumnFamily. It is generally a better idea to select - ColumnFamilies in the Scan than to do it with a Filter. -
-
- QualifierFilter - QualifierFilter - can be used to filter based on Column (aka Qualifier) name. -
-
- ColumnPrefixFilter - ColumnPrefixFilter - can be used to filter based on the lead portion of Column (aka Qualifier) names. - A ColumnPrefixFilter seeks ahead to the first column matching the prefix in each row - and for each involved column family. It can be used to efficiently get a subset of the - columns in very wide rows. - Note: The same column qualifier can be used in different column families. This - filter returns all matching columns. - Example: Find all columns in a row and family that start with "abc" - -HTableInterface t = ...; -byte[] row = ...; -byte[] family = ...; -byte[] prefix = Bytes.toBytes("abc"); -Scan scan = new Scan(row, row); // (optional) limit to one row -scan.addFamily(family); // (optional) limit to one family -Filter f = new ColumnPrefixFilter(prefix); -scan.setFilter(f); -scan.setBatch(10); // set this if there could be many columns returned -ResultScanner rs = t.getScanner(scan); -for (Result r = rs.next(); r != null; r = rs.next()) { - for (KeyValue kv : r.raw()) { - // each kv represents a column - } -} -rs.close(); - -
-
- MultipleColumnPrefixFilter - MultipleColumnPrefixFilter - behaves like ColumnPrefixFilter but allows specifying multiple prefixes. - Like ColumnPrefixFilter, MultipleColumnPrefixFilter efficiently seeks ahead to the - first column matching the lowest prefix and also seeks past ranges of columns between - prefixes. It can be used to efficiently get discontinuous sets of columns from very wide - rows. - Example: Find all columns in a row and family that start with "abc" or "xyz" - -HTableInterface t = ...; -byte[] row = ...; -byte[] family = ...; -byte[][] prefixes = new byte[][] {Bytes.toBytes("abc"), Bytes.toBytes("xyz")}; -Scan scan = new Scan(row, row); // (optional) limit to one row -scan.addFamily(family); // (optional) limit to one family -Filter f = new MultipleColumnPrefixFilter(prefixes); -scan.setFilter(f); -scan.setBatch(10); // set this if there could be many columns returned -ResultScanner rs = t.getScanner(scan); -for (Result r = rs.next(); r != null; r = rs.next()) { - for (KeyValue kv : r.raw()) { - // each kv represents a column - } -} -rs.close(); - -
-
- ColumnRangeFilter - A ColumnRangeFilter - allows efficient intra row scanning. - A ColumnRangeFilter can seek ahead to the first matching column for each involved - column family. It can be used to efficiently get a 'slice' of the columns of a very wide - row. i.e. you have a million columns in a row but you only want to look at columns - bbbb-bbdd. - Note: The same column qualifier can be used in different column families. This - filter returns all matching columns. - Example: Find all columns in a row and family between "bbbb" (inclusive) and "bbdd" - (inclusive) - -HTableInterface t = ...; -byte[] row = ...; -byte[] family = ...; -byte[] startColumn = Bytes.toBytes("bbbb"); -byte[] endColumn = Bytes.toBytes("bbdd"); -Scan scan = new Scan(row, row); // (optional) limit to one row -scan.addFamily(family); // (optional) limit to one family -Filter f = new ColumnRangeFilter(startColumn, true, endColumn, true); -scan.setFilter(f); -scan.setBatch(10); // set this if there could be many columns returned -ResultScanner rs = t.getScanner(scan); -for (Result r = rs.next(); r != null; r = rs.next()) { - for (KeyValue kv : r.raw()) { - // each kv represents a column - } -} -rs.close(); - - Note: Introduced in HBase 0.92 -
-
-
RowKey -
RowFilter - It is generally a better idea to use the startRow/stopRow methods on Scan for row selection, however - RowFilter can also be used. -
-
-
Utility -
FirstKeyOnlyFilter - This is primarily used for rowcount jobs. - See FirstKeyOnlyFilter. -
-
-
- -
Master - HMaster is the implementation of the Master Server. The Master server is - responsible for monitoring all RegionServer instances in the cluster, and is the interface - for all metadata changes. In a distributed cluster, the Master typically runs on the . J Mohamed Zahoor goes into some more detail on the Master - Architecture in this blog posting, HBase HMaster - Architecture . -
Startup Behavior - If run in a multi-Master environment, all Masters compete to run the cluster. If the active - Master loses its lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to - take over the Master role. - -
-
- Runtime Impact - A common dist-list question involves what happens to an HBase cluster when the Master - goes down. Because the HBase client talks directly to the RegionServers, the cluster can - still function in a "steady state." Additionally, per , hbase:meta exists as an HBase table and is not - resident in the Master. However, the Master controls critical functions such as - RegionServer failover and completing region splits. So while the cluster can still run for - a short time without the Master, the Master should be restarted as soon as possible. - -
-
Interface - The methods exposed by HMasterInterface are primarily metadata-oriented methods: - - Table (createTable, modifyTable, removeTable, enable, disable) - - ColumnFamily (addColumn, modifyColumn, removeColumn) - - Region (move, assign, unassign) - - - For example, when the HBaseAdmin method disableTable is invoked, it is serviced by the Master server. - -
-
Processes - The Master runs several background threads: - -
LoadBalancer - Periodically, and when there are no regions in transition, - a load balancer will run and move regions around to balance the cluster's load. - See for configuring this property. - See for more information on region assignment. - -
-
CatalogJanitor - Periodically checks and cleans up the hbase:meta table. See for more information on META. -
-
- -
-
- RegionServer - HRegionServer is the RegionServer implementation. It is responsible for - serving and managing regions. In a distributed cluster, a RegionServer runs on a . -
- Interface - The methods exposed by HRegionRegionInterface contain both data-oriented - and region-maintenance methods: - - Data (get, put, delete, next, etc.) - - - Region (splitRegion, compactRegion, etc.) - - For example, when the HBaseAdmin method - majorCompact is invoked on a table, the client is actually iterating - through all regions for the specified table and requesting a major compaction directly to - each region. -
-
- Processes - The RegionServer runs a variety of background threads: -
- CompactSplitThread - Checks for splits and handle minor compactions. -
-
- MajorCompactionChecker - Checks for major compactions. -
-
- MemStoreFlusher - Periodically flushes in-memory writes in the MemStore to StoreFiles. -
-
- LogRoller - Periodically checks the RegionServer's WAL. -
-
- -
- Coprocessors - Coprocessors were added in 0.92. There is a thorough Blog Overview - of CoProcessors posted. Documentation will eventually move to this reference - guide, but the blog is the most current information available at this time. -
- -
- Block Cache - - HBase provides two different BlockCache implementations: the default onheap - LruBlockCache and BucketCache, which is (usually) offheap. This section - discusses benefits and drawbacks of each implementation, how to choose the appropriate - option, and configuration options for each. - - Block Cache Reporting: UI - See the RegionServer UI for detail on caching deploy. Since HBase-0.98.4, the - Block Cache detail has been significantly extended showing configurations, - sizings, current usage, time-in-the-cache, and even detail on block counts and types. - - -
- - Cache Choices - LruBlockCache is the original implementation, and is - entirely within the Java heap. BucketCache is mainly - intended for keeping blockcache data offheap, although BucketCache can also - keep data onheap and serve from a file-backed cache. - BucketCache is production ready as of hbase-0.98.6 - To run with BucketCache, you need HBASE-11678. This was included in - hbase-0.98.6. - - - - - Fetching will always be slower when fetching from BucketCache, - as compared to the native onheap LruBlockCache. However, latencies tend to be - less erratic across time, because there is less garbage collection when you use - BucketCache since it is managing BlockCache allocations, not the GC. If the - BucketCache is deployed in offheap mode, this memory is not managed by the - GC at all. This is why you'd use BucketCache, so your latencies are less erratic and to mitigate GCs - and heap fragmentation. See Nick Dimiduk's BlockCache 101 for - comparisons running onheap vs offheap tests. Also see - Comparing BlockCache Deploys - which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise - if you are experiencing cache churn (or you want your cache to exist beyond the - vagaries of java GC), use BucketCache. - - - When you enable BucketCache, you are enabling a two tier caching - system, an L1 cache which is implemented by an instance of LruBlockCache and - an offheap L2 cache which is implemented by BucketCache. Management of these - two tiers and the policy that dictates how blocks move between them is done by - CombinedBlockCache. It keeps all DATA blocks in the L2 - BucketCache and meta blocks -- INDEX and BLOOM blocks -- - onheap in the L1 LruBlockCache. - See for more detail on going offheap. -
- -
- General Cache Configurations - Apart from the cache implementation itself, you can set some general configuration - options to control how the cache performs. See . After setting any of these options, restart or rolling restart your cluster for the - configuration to take effect. Check logs for errors or unexpected behavior. - See also , which discusses a new option - introduced in HBASE-9857. -
- -
- LruBlockCache Design - The LruBlockCache is an LRU cache that contains three levels of block priority to - allow for scan-resistance and in-memory ColumnFamilies: - - - Single access priority: The first time a block is loaded from HDFS it normally - has this priority and it will be part of the first group to be considered during - evictions. The advantage is that scanned blocks are more likely to get evicted than - blocks that are getting more usage. - - - Mutli access priority: If a block in the previous priority group is accessed - again, it upgrades to this priority. It is thus part of the second group considered - during evictions. - - - In-memory access priority: If the block's family was configured to be - "in-memory", it will be part of this priority disregarding the number of times it - was accessed. Catalog tables are configured like this. This group is the last one - considered during evictions. - To mark a column family as in-memory, call - HColumnDescriptor.setInMemory(true); if creating a table from java, - or set IN_MEMORY => true when creating or altering a table in - the shell: e.g. hbase(main):003:0> create 't', {NAME => 'f', IN_MEMORY => 'true'} - - - For more information, see the LruBlockCache - source - -
-
- LruBlockCache Usage - Block caching is enabled by default for all the user tables which means that any - read operation will load the LRU cache. This might be good for a large number of use - cases, but further tunings are usually required in order to achieve better performance. - An important concept is the working set size, or - WSS, which is: "the amount of memory needed to compute the answer to a problem". For a - website, this would be the data that's needed to answer the queries over a short amount - of time. - The way to calculate how much memory is available in HBase for caching is: - - number of region servers * heap size * hfile.block.cache.size * 0.99 - - The default value for the block cache is 0.25 which represents 25% of the available - heap. The last value (99%) is the default acceptable loading factor in the LRU cache - after which eviction is started. The reason it is included in this equation is that it - would be unrealistic to say that it is possible to use 100% of the available memory - since this would make the process blocking from the point where it loads new blocks. - Here are some examples: - - - One region server with the default heap size (1 GB) and the default block cache - size will have 253 MB of block cache available. - - - 20 region servers with the heap size set to 8 GB and a default block cache size - will have 39.6 of block cache. - - - 100 region servers with the heap size set to 24 GB and a block cache size of 0.5 - will have about 1.16 TB of block cache. - - - Your data is not the only resident of the block cache. Here are others that you may have to take into account: - - - - Catalog Tables - - The -ROOT- (prior to HBase 0.96. See ) and hbase:meta tables are forced - into the block cache and have the in-memory priority which means that they are - harder to evict. The former never uses more than a few hundreds of bytes while the - latter can occupy a few MBs (depending on the number of regions). - - - - HFiles Indexes - - An hfile is the file format that HBase uses to store - data in HDFS. It contains a multi-layered index which allows HBase to seek to the - data without having to read the whole file. The size of those indexes is a factor - of the block size (64KB by default), the size of your keys and the amount of data - you are storing. For big data sets it's not unusual to see numbers around 1GB per - region server, although not all of it will be in cache because the LRU will evict - indexes that aren't used. - - - - Keys - - The values that are stored are only half the picture, since each value is - stored along with its keys (row key, family qualifier, and timestamp). See . - - - - Bloom Filters - - Just like the HFile indexes, those data structures (when enabled) are stored - in the LRU. - - - - Currently the recommended way to measure HFile indexes and bloom filters sizes is to - look at the region server web UI and checkout the relevant metrics. For keys, sampling - can be done by using the HFile command line tool and look for the average key size - metric. Since HBase 0.98.3, you can view detail on BlockCache stats and metrics - in a special Block Cache section in the UI. - It's generally bad to use block caching when the WSS doesn't fit in memory. This is - the case when you have for example 40GB available across all your region servers' block - caches but you need to process 1TB of data. One of the reasons is that the churn - generated by the evictions will trigger more garbage collections unnecessarily. Here are - two use cases: - - - Fully random reading pattern: This is a case where you almost never access the - same row twice within a short amount of time such that the chance of hitting a - cached block is close to 0. Setting block caching on such a table is a waste of - memory and CPU cycles, more so that it will generate more garbage to pick up by the - JVM. For more information on monitoring GC, see . - - - Mapping a table: In a typical MapReduce job that takes a table in input, every - row will be read only once so there's no need to put them into the block cache. The - Scan object has the option of turning this off via the setCaching method (set it to - false). You can still keep block caching turned on on this table if you need fast - random read access. An example would be counting the number of rows in a table that - serves live traffic, caching every block of that table would create massive churn - and would surely evict data that's currently in use. - - -
- Caching META blocks only (DATA blocks in fscache) - An interesting setup is one where we cache META blocks only and we read DATA - blocks in on each access. If the DATA blocks fit inside fscache, this alternative - may make sense when access is completely random across a very large dataset. - To enable this setup, alter your table and for each column family - set BLOCKCACHE => 'false'. You are 'disabling' the - BlockCache for this column family only you can never disable the caching of - META blocks. Since - HBASE-4683 Always cache index and bloom blocks, - we will cache META blocks even if the BlockCache is disabled. - -
-
-
- Offheap Block Cache -
- How to Enable BucketCache - The usual deploy of BucketCache is via a managing class that sets up two caching tiers: an L1 onheap cache - implemented by LruBlockCache and a second L2 cache implemented with BucketCache. The managing class is CombinedBlockCache by default. - The just-previous link describes the caching 'policy' implemented by CombinedBlockCache. In short, it works - by keeping meta blocks -- INDEX and BLOOM in the L1, onheap LruBlockCache tier -- and DATA - blocks are kept in the L2, BucketCache tier. It is possible to amend this behavior in - HBase since version 1.0 and ask that a column family have both its meta and DATA blocks hosted onheap in the L1 tier by - setting cacheDataInL1 via - (HColumnDescriptor.setCacheDataInL1(true) - or in the shell, creating or amending column families setting CACHE_DATA_IN_L1 - to true: e.g. hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}} - - The BucketCache Block Cache can be deployed onheap, offheap, or file based. - You set which via the - hbase.bucketcache.ioengine setting. Setting it to - heap will have BucketCache deployed inside the - allocated java heap. Setting it to offheap will have - BucketCache make its allocations offheap, - and an ioengine setting of file:PATH_TO_FILE will direct - BucketCache to use a file caching (Useful in particular if you have some fast i/o attached to the box such - as SSDs). - - It is possible to deploy an L1+L2 setup where we bypass the CombinedBlockCache - policy and have BucketCache working as a strict L2 cache to the L1 - LruBlockCache. For such a setup, set CacheConfig.BUCKET_CACHE_COMBINED_KEY to - false. In this mode, on eviction from L1, blocks go to L2. - When a block is cached, it is cached first in L1. When we go to look for a cached block, - we look first in L1 and if none found, then search L2. Let us call this deploy format, - Raw L1+L2. - Other BucketCache configs include: specifying a location to persist cache to across - restarts, how many threads to use writing the cache, etc. See the - CacheConfig.html - class for configuration options and descriptions. - - - BucketCache Example Configuration - This sample provides a configuration for a 4 GB offheap BucketCache with a 1 GB - onheap cache. Configuration is performed on the RegionServer. Setting - hbase.bucketcache.ioengine and - hbase.bucketcache.size > 0 enables CombinedBlockCache. - Let us presume that the RegionServer has been set to run with a 5G heap: - i.e. HBASE_HEAPSIZE=5g. - - - First, edit the RegionServer's hbase-env.sh and set - HBASE_OFFHEAPSIZE to a value greater than the offheap size wanted, in - this case, 4 GB (expressed as 4G). Lets set it to 5G. That'll be 4G - for our offheap cache and 1G for any other uses of offheap memory (there are - other users of offheap memory other than BlockCache; e.g. DFSClient - in RegionServer can make use of offheap memory). See . - HBASE_OFFHEAPSIZE=5G - - - Next, add the following configuration to the RegionServer's - hbase-site.xml. - - - hbase.bucketcache.ioengine - offheap - - - hfile.block.cache.size - 0.2 - - - hbase.bucketcache.size - 4196 -]]> - - - - Restart or rolling restart your cluster, and check the logs for any - issues. - - - In the above, we set bucketcache to be 4G. The onheap lrublockcache we - configured to have 0.2 of the RegionServer's heap size (0.2 * 5G = 1G). - In other words, you configure the L1 LruBlockCache as you would normally, - as you would when there is no L2 BucketCache present. - - HBASE-10641 introduced the ability to configure multiple sizes for the - buckets of the bucketcache, in HBase 0.98 and newer. To configurable multiple bucket - sizes, configure the new property (instead of - ) to a comma-separated list of block sizes, - ordered from smallest to largest, with no spaces. The goal is to optimize the bucket - sizes based on your data access patterns. The following example configures buckets of - size 4096 and 8192. - - hfile.block.cache.sizes - 4096,8192 - - ]]> - - Direct Memory Usage In HBase - The default maximum direct memory varies by JVM. Traditionally it is 64M - or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). - HBase servers use direct memory, in particular short-circuit reading, the hosted DFSClient will - allocate direct memory buffers. If you do offheap block caching, you'll - be making use of direct memory. Starting your JVM, make sure - the -XX:MaxDirectMemorySize setting in - conf/hbase-env.sh is set to some value that is - higher than what you have allocated to your offheap blockcache - (hbase.bucketcache.size). It should be larger than your offheap block - cache and then some for DFSClient usage (How much the DFSClient uses is not - easy to quantify; it is the number of open hfiles * hbase.dfs.client.read.shortcircuit.buffer.size - where hbase.dfs.client.read.shortcircuit.buffer.size is set to 128k in HBase -- see hbase-default.xml - default configurations). - Direct memory, which is part of the Java process heap, is separate from the object - heap allocated by -Xmx. The value allocated by MaxDirectMemorySize must not exceed - physical RAM, and is likely to be less than the total available RAM due to other - memory requirements and system constraints. - - You can see how much memory -- onheap and offheap/direct -- a RegionServer is - configured to use and how much it is using at any one time by looking at the - Server Metrics: Memory tab in the UI. It can also be gotten - via JMX. In particular the direct memory currently used by the server can be found - on the java.nio.type=BufferPool,name=direct bean. Terracotta has - a good write up on using offheap memory in java. It is for their product - BigMemory but alot of the issues noted apply in general to any attempt at going - offheap. Check it out. - - hbase.bucketcache.percentage.in.combinedcache - This is a pre-HBase 1.0 configuration removed because it - was confusing. It was a float that you would set to some value - between 0.0 and 1.0. Its default was 0.9. If the deploy was using - CombinedBlockCache, then the LruBlockCache L1 size was calculated to - be (1 - hbase.bucketcache.percentage.in.combinedcache) * size-of-bucketcache - and the BucketCache size was hbase.bucketcache.percentage.in.combinedcache * size-of-bucket-cache. - where size-of-bucket-cache itself is EITHER the value of the configuration hbase.bucketcache.size - IF it was specified as megabytes OR hbase.bucketcache.size * -XX:MaxDirectMemorySize if - hbase.bucketcache.size between 0 and 1.0. - - In 1.0, it should be more straight-forward. L1 LruBlockCache size - is set as a fraction of java heap using hfile.block.cache.size setting - (not the best name) and L2 is set as above either in absolute - megabytes or as a fraction of allocated maximum direct memory. - - -
-
-
- Comprewssed Blockcache - HBASE-11331 introduced lazy blockcache decompression, more simply referred to - as compressed blockcache. When compressed blockcache is enabled. data and encoded data - blocks are cached in the blockcache in their on-disk format, rather than being - decompressed and decrypted before caching. - For a RegionServer - hosting more data than can fit into cache, enabling this feature with SNAPPY compression - has been shown to result in 50% increase in throughput and 30% improvement in mean - latency while, increasing garbage collection by 80% and increasing overall CPU load by - 2%. See HBASE-11331 for more details about how performance was measured and achieved. - For a RegionServer hosting data that can comfortably fit into cache, or if your workload - is sensitive to extra CPU or garbage-collection load, you may receive less - benefit. - Compressed blockcache is disabled by default. To enable it, set - hbase.block.data.cachecompressed to true in - hbase-site.xml on all RegionServers. -
-
- -
- Write Ahead Log (WAL) - -
- Purpose - The Write Ahead Log (WAL) records all changes to data in - HBase, to file-based storage. Under normal operations, the WAL is not needed because - data changes move from the MemStore to StoreFiles. However, if a RegionServer crashes or - becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to - the data can be replayed. If writing to the WAL fails, the entire operation to modify the - data fails. - - HBase uses an implementation of the WAL interface. Usually, there is only one instance of a WAL per RegionServer. - The RegionServer records Puts and Deletes to it, before recording them to the for the affected . - - - The HLog - - Prior to 2.0, the interface for WALs in HBase was named HLog. - In 0.94, HLog was the name of the implementation of the WAL. You will likely find - references to the HLog in documentation tailored to these older versions. - - - The WAL resides in HDFS in the /hbase/WALs/ directory (prior to - HBase 0.94, they were stored in /hbase/.logs/), with subdirectories per - region. - For more general information about the concept of write ahead logs, see the - Wikipedia Write-Ahead Log - article. -
-
- WAL Flushing - TODO (describe). -
- -
- WAL Splitting - - A RegionServer serves many regions. All of the regions in a region server share the - same active WAL file. Each edit in the WAL file includes information about which region - it belongs to. When a region is opened, the edits in the WAL file which belong to that - region need to be replayed. Therefore, edits in the WAL file must be grouped by region - so that particular sets can be replayed to regenerate the data in a particular region. - The process of grouping the WAL edits by region is called log - splitting. It is a critical process for recovering data if a region server - fails. - Log splitting is done by the HMaster during cluster start-up or by the ServerShutdownHandler - as a region server shuts down. So that consistency is guaranteed, affected regions - are unavailable until data is restored. All WAL edits need to be recovered and replayed - before a given region can become available again. As a result, regions affected by - log splitting are unavailable until the process completes. - - Log Splitting, Step by Step - - The <filename>/hbase/WALs/<host>,<port>,<startcode></filename> directory is renamed. - Renaming the directory is important because a RegionServer may still be up and - accepting requests even if the HMaster thinks it is down. If the RegionServer does - not respond immediately and does not heartbeat its ZooKeeper session, the HMaster - may interpret this as a RegionServer failure. Renaming the logs directory ensures - that existing, valid WAL files which are still in use by an active but busy - RegionServer are not written to by accident. - The new directory is named according to the following pattern: - ,,-splitting]]> - An example of such a renamed directory might look like the following: - /hbase/WALs/srv.example.com,60020,1254173957298-splitting - - - Each log file is split, one at a time. - The log splitter reads the log file one edit entry at a time and puts each edit - entry into the buffer corresponding to the edit’s region. At the same time, the - splitter starts several writer threads. Writer threads pick up a corresponding - buffer and write the edit entries in the buffer to a temporary recovered edit - file. The temporary edit file is stored to disk with the following naming pattern: - //recovered.edits/.temp]]> - This file is used to store all the edits in the WAL log for this region. After - log splitting completes, the .temp file is renamed to the - sequence ID of the first log written to the file. - To determine whether all edits have been written, the sequence ID is compared to - the sequence of the last edit that was written to the HFile. If the sequence of the - last edit is greater than or equal to the sequence ID included in the file name, it - is clear that all writes from the edit file have been completed. - - - After log splitting is complete, each affected region is assigned to a - RegionServer. - When the region is opened, the recovered.edits folder is checked for recovered - edits files. If any such files are present, they are replayed by reading the edits - and saving them to the MemStore. After all edit files are replayed, the contents of - the MemStore are written to disk (HFile) and the edit files are deleted. - - - -
- Handling of Errors During Log Splitting - - If you set the hbase.hlog.split.skip.errors option to - true, errors are treated as follows: - - - Any error encountered during splitting will be logged. - - - The problematic WAL log will be moved into the .corrupt - directory under the hbase rootdir, - - - Processing of the WAL will continue - - - If the hbase.hlog.split.skip.errors optionset to - false, the default, the exception will be propagated and the - split will be logged as failed. See HBASE-2958 When - hbase.hlog.split.skip.errors is set to false, we fail the split but thats - it. We need to do more than just fail split if this flag is set. - -
- How EOFExceptions are treated when splitting a crashed RegionServers' - WALs - - If an EOFException occurs while splitting logs, the split proceeds even when - hbase.hlog.split.skip.errors is set to - false. An EOFException while reading the last log in the set of - files to split is likely, because the RegionServer is likely to be in the process of - writing a record at the time of a crash. For background, see HBASE-2643 - Figure how to deal with eof splitting logs -
-
- -
- Performance Improvements during Log Splitting - - WAL log splitting and recovery can be resource intensive and take a long time, - depending on the number of RegionServers involved in the crash and the size of the - regions. and were developed to improve - performance during log splitting. - -
- Distributed Log Splitting - Distributed Log Splitting was added in HBase version 0.92 - (HBASE-1364) - by Prakash Khemani from Facebook. It reduces the time to complete log splitting - dramatically, improving the availability of regions and tables. For - example, recovering a crashed cluster took around 9 hours with single-threaded log - splitting, but only about six minutes with distributed log splitting. - The information in this section is sourced from Jimmy Xiang's blog post at . - - - Enabling or Disabling Distributed Log Splitting - Distributed log processing is enabled by default since HBase 0.92. The setting - is controlled by the hbase.master.distributed.log.splitting - property, which can be set to true or false, - but defaults to true. - - - Distributed Log Splitting, Step by Step - After configuring distributed log splitting, the HMaster controls the process. - The HMaster enrolls each RegionServer in the log splitting process, and the actual - work of splitting the logs is done by the RegionServers. The general process for - log splitting, as described in still applies here. - - If distributed log processing is enabled, the HMaster creates a - split log manager instance when the cluster is started. - The split log manager manages all log files which need - to be scanned and split. The split log manager places all the logs into the - ZooKeeper splitlog node (/hbase/splitlog) as tasks. You can - view the contents of the splitlog by issuing the following - zkcli command. Example output is shown. - ls /hbase/splitlog -[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900, -hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931, -hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946] - - The output contains some non-ASCII characters. When decoded, it looks much - more simple: - -[hdfs://host2.sample.com:56020/hbase/.logs -/host8.sample.com,57020,1340474893275-splitting -/host8.sample.com%3A57020.1340474893900, -hdfs://host2.sample.com:56020/hbase/.logs -/host3.sample.com,57020,1340474893299-splitting -/host3.sample.com%3A57020.1340474893931, -hdfs://host2.sample.com:56020/hbase/.logs -/host4.sample.com,57020,1340474893287-splitting -/host4.sample.com%3A57020.1340474893946] - - The listing represents WAL file names to be scanned and split, which is a - list of log splitting tasks. - - - The split log manager monitors the log-splitting tasks and workers. - The split log manager is responsible for the following ongoing tasks: - - - Once the split log manager publishes all the tasks to the splitlog - znode, it monitors these task nodes and waits for them to be - processed. - - - Checks to see if there are any dead split log - workers queued up. If it finds tasks claimed by unresponsive workers, it - will resubmit those tasks. If the resubmit fails due to some ZooKeeper - exception, the dead worker is queued up again for retry. - - - Checks to see if there are any unassigned - tasks. If it finds any, it create an ephemeral rescan node so that each - split log worker is notified to re-scan unassigned tasks via the - nodeChildrenChanged ZooKeeper event. - - - Checks for tasks which are assigned but expired. If any are found, they - are moved back to TASK_UNASSIGNED state again so that they can - be retried. It is possible that these tasks are assigned to slow workers, or - they may already be finished. This is not a problem, because log splitting - tasks have the property of idempotence. In other words, the same log - splitting task can be processed many times without causing any - problem. - - - The split log manager watches the HBase split log znodes constantly. If - any split log task node data is changed, the split log manager retrieves the - node data. The - node data contains the current state of the task. You can use the - zkcli get command to retrieve the - current state of a task. In the example output below, the first line of the - output shows that the task is currently unassigned. - -get /hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945 - -unassigned host2.sample.com:57000 -cZxid = 0×7115 -ctime = Sat Jun 23 11:13:40 PDT 2012 -... - - Based on the state of the task whose data is changed, the split log - manager does one of the following: - - - - Resubmit the task if it is unassigned - - - Heartbeat the task if it is assigned - - - Resubmit or fail the task if it is resigned (see ) - - - Resubmit or fail the task if it is completed with errors (see ) - - - Resubmit or fail the task if it could not complete due to - errors (see ) - - - Delete the task if it is successfully completed or failed - - - - Reasons a Task Will Fail - The task has been deleted. - The node no longer exists. - The log status manager failed to move the state of the task - to TASK_UNASSIGNED. - The number of resubmits is over the resubmit - threshold. - - - - - - Each RegionServer's split log worker performs the log-splitting tasks. - Each RegionServer runs a daemon thread called the split log - worker, which does the work to split the logs. The daemon thread - starts when the RegionServer starts, and registers itself to watch HBase znodes. - If any splitlog znode children change, it notifies a sleeping worker thread to - wake up and grab more tasks. If if a worker's current task’s node data is - changed, the worker checks to see if the task has been taken by another worker. - If so, the worker thread stops work on the current task. - The worker monitors - the splitlog znode constantly. When a new task appears, the split log worker - retrieves the task paths and checks each one until it finds an unclaimed task, - which it attempts to claim. If the claim was successful, it attempts to perform - the task and updates the task's state property based on the - splitting outcome. At this point, the split log worker scans for another - unclaimed task. - - How the Split Log Worker Approaches a Task - - - It queries the task state and only takes action if the task is in - TASK_UNASSIGNED state. - - - If the task is is in TASK_UNASSIGNED state, the - worker attempts to set the state to TASK_OWNED by itself. - If it fails to set the state, another worker will try to grab it. The split - log manager will also ask all workers to rescan later if the task remains - unassigned. - - - If the worker succeeds in taking ownership of the task, it tries to get - the task state again to make sure it really gets it asynchronously. In the - meantime, it starts a split task executor to do the actual work: - - - Get the HBase root folder, create a temp folder under the root, and - split the log file to the temp folder. - - - If the split was successful, the task executor sets the task to - state TASK_DONE. - - - If the worker catches an unexpected IOException, the task is set to - state TASK_ERR. - - - If the worker is shutting down, set the the task to state - TASK_RESIGNED. - - - If the task is taken by another worker, just log it. - - - - - - - The split log manager monitors for uncompleted tasks. - The split log manager returns when all tasks are completed successfully. If - all tasks are completed with some failures, the split log manager throws an - exception so that the log splitting can be retried. Due to an asynchronous - implementation, in very rare cases, the split log manager loses track of some - completed tasks. For that reason, it periodically checks for remaining - uncompleted task in its task map or ZooKeeper. If none are found, it throws an - exception so that the log splitting can be retried right away instead of hanging - there waiting for something that won’t happen. - - -
-
- Distributed Log Replay - After a RegionServer fails, its failed region is assigned to another - RegionServer, which is marked as "recovering" in ZooKeeper. A split log worker directly - replays edits from the WAL of the failed region server to the region at its new - location. When a region is in "recovering" state, it can accept writes but no reads - (including Append and Increment), region splits or merges. - Distributed Log Replay extends the framework. It works by - directly replaying WAL edits to another RegionServer instead of creating - recovered.edits files. It provides the following advantages - over distributed log splitting alone: - - It eliminates the overhead of writing and reading a large number of - recovered.edits files. It is not unusual for thousands of - recovered.edits files to be created and written concurrently - during a RegionServer recovery. Many small random writes can degrade overall - system performance. - It allows writes even when a region is in recovering state. It only takes seconds for a recovering region to accept writes again. - - - - Enabling Distributed Log Replay - To enable distributed log replay, set hbase.master.distributed.log.replay to - true. This will be the default for HBase 0.99 (HBASE-10888). - - You must also enable HFile version 3 (which is the default HFile format starting - in HBase 0.99. See HBASE-10855). - Distributed log replay is unsafe for rolling upgrades. -
-
-
-
- Disabling the WAL - It is possible to disable the WAL, to improve performace in certain specific - situations. However, disabling the WAL puts your data at risk. The only situation where - this is recommended is during a bulk load. This is because, in the event of a problem, - the bulk load can be re-run with no risk of data loss. - The WAL is disabled by calling the HBase client field - Mutation.writeToWAL(false). Use the - Mutation.setDurability(Durability.SKIP_WAL) and Mutation.getDurability() - methods to set and get the field's value. There is no way to disable the WAL for only a - specific table. - - If you disable the WAL for anything other than bulk loads, your data is at - risk. -
-
- -
- -
- Regions - Regions are the basic element of availability and - distribution for tables, and are comprised of a Store per Column Family. The heirarchy of objects - is as follows: - -Table (HBase table) - Region (Regions for the table) - Store (Store per ColumnFamily for each Region for the table) - MemStore (MemStore for each Store for each Region for the table) - StoreFile (StoreFiles for each Store for each Region for the table) - Block (Blocks within a StoreFile within a Store for each Region for the table) - - For a description of what HBase files look like when written to HDFS, see . - -
- Considerations for Number of Regions - In general, HBase is designed to run with a small (20-200) number of relatively large (5-20Gb) regions per server. The considerations for this are as follows: -
- Why cannot I have too many regions? - - Typically you want to keep your region count low on HBase for numerous reasons. - Usually right around 100 regions per RegionServer has yielded the best results. - Here are some of the reasons below for keeping region count low: - - - MSLAB requires 2mb per memstore (that's 2mb per family per region). - 1000 regions that have 2 families each is 3.9GB of heap used, and it's not even storing data yet. NB: the 2MB value is configurable. - - If you fill all the regions at somewhat the same rate, the global memory usage makes it that it forces tiny - flushes when you have too many regions which in turn generates compactions. - Rewriting the same data tens of times is the last thing you want. - An example is filling 1000 regions (with one family) equally and let's consider a lower bound for global memstore - usage of 5GB (the region server would have a big heap). - Once it reaches 5GB it will force flush the biggest region, - at that point they should almost all have about 5MB of data so - it would flush that amount. 5MB inserted later, it would flush another - region that will now have a bit over 5MB of data, and so on. - This is currently the main limiting factor for the number of regions; see - for detailed formula. - - The master as is is allergic to tons of regions, and will - take a lot of time assigning them and moving them around in batches. - The reason is that it's heavy on ZK usage, and it's not very async - at the moment (could really be improved -- and has been imporoved a bunch - in 0.96 hbase). - - - In older versions of HBase (pre-v2 hfile, 0.90 and previous), tons of regions - on a few RS can cause the store file index to rise, increasing heap usage and potentially - creating memory pressure or OOME on the RSs - - - Another issue is the effect of the number of regions on mapreduce jobs; it is typical to have one mapper per HBase region. - Thus, hosting only 5 regions per RS may not be enough to get sufficient number of tasks for a mapreduce job, while 1000 regions will generate far too many tasks. - - See for configuration guidelines. - -
- -
- -
- Region-RegionServer Assignment - This section describes how Regions are assigned to RegionServers. - - -
- Startup - When HBase starts regions are assigned as follows (short version): - - The Master invokes the AssignmentManager upon startup. - - The AssignmentManager looks at the existing region assignments in META. - - If the region assignment is still valid (i.e., if the RegionServer is still online) - then the assignment is kept. - - If the assignment is invalid, then the LoadBalancerFactory is invoked to assign the - region. The DefaultLoadBalancer will randomly assign the region to a RegionServer. - - META is updated with the RegionServer assignment (if needed) and the RegionServer start codes - (start time of the RegionServer process) upon region opening by the RegionServer. - - - -
- -
- Failover - When a RegionServer fails: - - The regions immediately become unavailable because the RegionServer is - down. - - - The Master will detect that the RegionServer has failed. - - - The region assignments will be considered invalid and will be re-assigned just - like the startup sequence. - - - In-flight queries are re-tried, and not lost. - - - Operations are switched to a new RegionServer within the following amount of - time: - ZooKeeper session timeout + split time + assignment/replay time - - - -
- -
- Region Load Balancing - - Regions can be periodically moved by the . - -
- -
- Region State Transition - HBase maintains a state for each region and persists the state in META. The state - of the META region itself is persisted in ZooKeeper. You can see the states of regions - in transition in the Master web UI. Following is the list of possible region - states. - - - Possible Region States - - OFFLINE: the region is offline and not opening - - - OPENING: the region is in the process of being opened - - - OPEN: the region is open and the region server has notified the master - - - FAILED_OPEN: the region server failed to open the region - - - CLOSING: the region is in the process of being closed - - - CLOSED: the region server has closed the region and notified the master - - - FAILED_CLOSE: the region server failed to close the region - - - SPLITTING: the region server notified the master that the region is - splitting - - - SPLIT: the region server notified the master that the region has finished - splitting - - - SPLITTING_NEW: this region is being created by a split which is in - progress - - - MERGING: the region server notified the master that this region is being merged - with another region - - - MERGED: the region server notified the master that this region has been - merged - - - MERGING_NEW: this region is being created by a merge of two regions - - - -
- Region State Transitions - - - - -
- - - - - Graph Legend - - Brown: Offline state, a special state that can be transient (after closed before - opening), terminal (regions of disabled tables), or initial (regions of newly - created tables) - - Palegreen: Online state that regions can serve requests - - Lightblue: Transient states - - Red: Failure states that need OPS attention - - Gold: Terminal states of regions split/merged - - Grey: Initial states of regions created through split/merge - - - - Region State Transitions Explained - - The master moves a region from OFFLINE to - OPENING state and tries to assign the region to a region - server. The region server may or may not have received the open region request. The - master retries sending the open region request to the region server until the RPC - goes through or the master runs out of retries. After the region server receives the - open region request, the region server begins opening the region. - - - If the master is running out of retries, the master prevents the region server - from opening the region by moving the region to CLOSING state and - trying to close it, even if the region server is starting to open the region. - - - After the region server opens the region, it continues to try to notify the - master until the master moves the region to OPEN state and - notifies the region server. The region is now open. - - - If the region server cannot open the region, it notifies the master. The master - moves the region to CLOSED state and tries to open the region on - a different region server. - - - If the master cannot open the region on any of a certain number of regions, it - moves the region to FAILED_OPEN state, and takes no further - action until an operator intervenes from the HBase shell, or the server is - dead. - - - The master moves a region from OPEN to - CLOSING state. The region server holding the region may or may - not have received the close region request. The master retries sending the close - request to the server until the RPC goes through or the master runs out of - retries. - - - If the region server is not online, or throws - NotServingRegionException, the master moves the region to - OFFLINE state and re-assigns it to a different region - server. - - - If the region server is online, but not reachable after the master runs out of - retries, the master moves the region to FAILED_CLOSE state and - takes no further action until an operator intervenes from the HBase shell, or the - server is dead. - - - If the region server gets the close region request, it closes the region and - notifies the master. The master moves the region to CLOSED state - and re-assigns it to a different region server. - - - Before assigning a region, the master moves the region to - OFFLINE state automatically if it is in - CLOSED state. - - - When a region server is about to split a region, it notifies the master. The - master moves the region to be split from OPEN to - SPLITTING state and add the two new regions to be created to - the region server. These two regions are in SPLITING_NEW state - initially. - - - After notifying the master, the region server starts to split the region. Once - past the point of no return, the region server notifies the master again so the - master can update the META. However, the master does not update the region states - until it is notified by the server that the split is done. If the split is - successful, the splitting region is moved from SPLITTING to - SPLIT state and the two new regions are moved from - SPLITTING_NEW to OPEN state. - - - If the split fails, the splitting region is moved from - SPLITTING back to OPEN state, and the two - new regions which were created are moved from SPLITTING_NEW to - OFFLINE state. - - - When a region server is about to merge two regions, it notifies the master - first. The master moves the two regions to be merged from OPEN to - MERGINGstate, and adds the new region which will hold the - contents of the merged regions region to the region server. The new region is in - MERGING_NEW state initially. - - - After notifying the master, the region server starts to merge the two regions. - Once past the point of no return, the region server notifies the master again so the - master can update the META. However, the master does not update the region states - until it is notified by the region server that the merge has completed. If the merge - is successful, the two merging regions are moved from MERGING to - MERGED state and the new region is moved from - MERGING_NEW to OPEN state. - - - If the merge fails, the two merging regions are moved from - MERGING back to OPEN state, and the new - region which was created to hold the contents of the merged regions is moved from - MERGING_NEW to OFFLINE state. - - - For regions in FAILED_OPEN or FAILED_CLOSE - states , the master tries to close them again when they are reassigned by an - operator via HBase Shell. - - - - - - -
- Region-RegionServer Locality - Over time, Region-RegionServer locality is achieved via HDFS block replication. - The HDFS client does the following by default when choosing locations to write replicas: - - First replica is written to local node - - Second replica is written to a random node on another rack - - Third replica is written on the same rack as the second, but on a different node chosen randomly - - Subsequent replicas are written on random nodes on the cluster. See Replica Placement: The First Baby Steps on this page: HDFS Architecture - - - Thus, HBase eventually achieves locality for a region after a flush or a compaction. - In a RegionServer failover situation a RegionServer may be assigned regions with non-local - StoreFiles (because none of the replicas are local), however as new data is written - in the region, or the table is compacted and StoreFiles are re-written, they will become "local" - to the RegionServer. - - For more information, see Replica Placement: The First Baby Steps on this page: HDFS Architecture - and also Lars George's blog on HBase and HDFS locality. - -
- -
- Region Splits - Regions split when they reach a configured threshold. - Below we treat the topic in short. For a longer exposition, - see Apache HBase Region Splitting and Merging - by our Enis Soztutar. - - - Splits run unaided on the RegionServer; i.e. the Master does not - participate. The RegionServer splits a region, offlines the split - region and then adds the daughter regions to META, opens daughters on - the parent's hosting RegionServer and then reports the split to the - Master. See for how to manually manage - splits (and for why you might do this) -
- Custom Split Policies - The default split policy can be overwritten using a custom RegionSplitPolicy (HBase 0.94+). - Typically a custom split policy should extend HBase's default split policy: ConstantSizeRegionSplitPolicy. - - The policy can set globally through the HBaseConfiguration used or on a per table basis: - -HTableDescriptor myHtd = ...; -myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName()); - - -
-
- -
- Manual Region Splitting - It is possible to manually split your table, either at table creation (pre-splitting), - or at a later time as an administrative action. You might choose to split your region for - one or more of the following reasons. There may be other valid reasons, but the need to - manually split your table might also point to problems with your schema design. - - Reasons to Manually Split Your Table - - Your data is sorted by timeseries or another similar algorithm that sorts new data - at the end of the table. This means that the Region Server holding the last region is - always under load, and the other Region Servers are idle, or mostly idle. See also - . - - - You have developed an unexpected hotspot in one region of your table. For - instance, an application which tracks web searches might be inundated by a lot of - searches for a celebrity in the event of news about that celebrity. See for more discussion about this particular - scenario. - - - After a big increase to the number of Region Servers in your cluster, to get the - load spread out quickly. - - - Before a bulk-load which is likely to cause unusual and uneven load across - regions. - - - See for a discussion about the dangers and - possible benefits of managing splitting completely manually. -
- Determining Split Points - The goal of splitting your table manually is to improve the chances of balancing the - load across the cluster in situations where good rowkey design alone won't get you - there. Keeping that in mind, the way you split your regions is very dependent upon the - characteristics of your data. It may be that you already know the best way to split your - table. If not, the way you split your table depends on what your keys are like. - - - Alphanumeric Rowkeys - - If your rowkeys start with a letter or number, you can split your table at - letter or number boundaries. For instance, the following command creates a table - with regions that split at each vowel, so the first region has A-D, the second - region has E-H, the third region has I-N, the fourth region has O-V, and the fifth - region has U-Z. - hbase> create 'test_table', 'f1', SPLITS=> ['a', 'e', 'i', 'o', 'u'] - The following command splits an existing table at split point '2'. - hbase> split 'test_table', '2' - You can also split a specific region by referring to its ID. You can find the - region ID by looking at either the table or region in the Web UI. It will be a - long number such as - t2,1,1410227759524.829850c6eaba1acc689480acd8f081bd.. The - format is table_name,start_key,region_idTo split that - region into two, as close to equally as possible (at the nearest row boundary), - issue the following command. - hbase> split 't2,1,1410227759524.829850c6eaba1acc689480acd8f081bd.' - The split key is optional. If it is omitted, the table or region is split in - half. - The following example shows how to use the RegionSplitter to create 10 - regions, split at hexadecimal values. - hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1 - - - - Using a Custom Algorithm - - The RegionSplitter tool is provided with HBase, and uses a SplitAlgorithm to determine split points for you. As - parameters, you give it the algorithm, desired number of regions, and column - families. It includes two split algorithms. The first is the HexStringSplit algorithm, which assumes the row keys are - hexadecimal strings. The second, UniformSplit, assumes the row keys are random byte arrays. You will - probably need to develop your own SplitAlgorithm, using the provided ones as - models. - - - -
-
-
- Online Region Merges - - Both Master and Regionserver participate in the event of online region merges. - Client sends merge RPC to master, then master moves the regions together to the - same regionserver where the more heavily loaded region resided, finally master - send merge request to this regionserver and regionserver run the region merges. - Similar with process of region splits, region merges run as a local transaction - on the regionserver, offlines the regions and then merges two regions on the file - system, atomically delete merging regions from META and add merged region to the META, - opens merged region on the regionserver and reports the merge to Master at last. - - An example of region merges in the hbase shell - $ hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME' - hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME', true - - It's an asynchronous operation and call returns immediately without waiting merge completed. - Passing 'true' as the optional third parameter will force a merge ('force' merges regardless - else merge will fail unless passed adjacent regions. 'force' is for expert use only) - -
- -
- Store - A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region. - -
- MemStore - The MemStore holds in-memory modifications to the Store. Modifications are - Cells/KeyValues. When a flush is requested, the current memstore is moved to a snapshot and is - cleared. HBase continues to serve edits from the new memstore and backing snapshot until - the flusher reports that the flush succeeded. At this point, the snapshot is discarded. - Note that when the flush happens, Memstores that belong to the same region will all be - flushed. -
-
- MemStoreFlush - A MemStore flush can be triggered under any of the conditions listed below. The - minimum flush unit is per region, not at individual MemStore level. - - - When a MemStore reaches the value specified by - hbase.hregion.memstore.flush.size, all MemStores that belong to - its region will be flushed out to disk. - - - When overall memstore usage reaches the value specified by - hbase.regionserver.global.memstore.upperLimit, MemStores from - various regions will be flushed out to disk to reduce overall MemStore usage in a - Region Server. The flush order is based on the descending order of a region's - MemStore usage. Regions will have their MemStores flushed until the overall MemStore - usage drops to or slightly below - hbase.regionserver.global.memstore.lowerLimit. - - - When the number of WAL per region server reaches the value specified in - hbase.regionserver.max.logs, MemStores from various regions - will be flushed out to disk to reduce WAL count. The flush order is based on time. - Regions with the oldest MemStores are flushed first until WAL count drops below - hbase.regionserver.max.logs. - - -
-
- Scans - - - When a client issues a scan against a table, HBase generates - RegionScanner objects, one per region, to serve the scan request. - - - - The RegionScanner object contains a list of - StoreScanner objects, one per column family. - - - Each StoreScanner object further contains a list of - StoreFileScanner objects, corresponding to each StoreFile and - HFile of the corresponding column family, and a list of - KeyValueScanner objects for the MemStore. - - - The two lists are merge into one, which is sorted in ascending order with the - scan object for the MemStore at the end of the list. - - - When a StoreFileScanner object is constructed, it is associated - with a MultiVersionConsistencyControl read point, which is the - current memstoreTS, filtering out any new updates beyond the read - point. - - -
-
- StoreFile (HFile) - StoreFiles are where your data lives. - -
HFile Format - The hfile file format is based on - the SSTable file described in the BigTable [2006] paper and on - Hadoop's tfile - (The unit test suite and the compression harness were taken directly from tfile). - Schubert Zhang's blog post on HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs makes for a thorough introduction to HBase's hfile. Matteo Bertozzi has also put up a - helpful description, HBase I/O: HFile. - - For more information, see the HFile source code. - Also see for information about the HFile v2 format that was included in 0.92. - -
-
- HFile Tool - - To view a textualized version of hfile content, you can do use - the org.apache.hadoop.hbase.io.hfile.HFile - tool. Type the following to see usage:$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile For - example, to view the content of the file - hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475, - type the following: $ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.io.hfile.HFile -v -f hdfs://10.81.47.41:8020/hbase/TEST/1418428042/DSMP/4759508618286845475 If - you leave off the option -v to see just a summary on the hfile. See - usage for other things to do with the HFile - tool. -
-
- StoreFile Directory Structure on HDFS - For more information of what StoreFiles look like on HDFS with respect to the directory structure, see . - -
-
- -
- Blocks - StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis. - - Compression happens at the block level within StoreFiles. For more information on compression, see . - - For more information on blocks, see the HFileBlock source code. - -
-
- KeyValue - The KeyValue class is the heart of data storage in HBase. KeyValue wraps a byte array and takes offsets and lengths into passed array - at where to start interpreting the content as KeyValue. - - The KeyValue format inside a byte array is: - - keylength - valuelength - key - value - - - The Key is further decomposed as: - - rowlength - row (i.e., the rowkey) - columnfamilylength - columnfamily - columnqualifier - timestamp - keytype (e.g., Put, Delete, DeleteColumn, DeleteFamily) - - - KeyValue instances are not split across blocks. - For example, if there is an 8 MB KeyValue, even if the block-size is 64kb this KeyValue will be read - in as a coherent block. For more information, see the KeyValue source code. - -
Example - To emphasize the points above, examine what happens with two Puts for two different columns for the same row: - - Put #1: rowkey=row1, cf:attr1=value1 - Put #2: rowkey=row1, cf:attr2=value2 - - Even though these are for the same row, a KeyValue is created for each column: - Key portion for Put #1: - - rowlength ------------> 4 - row -----------------> row1 - columnfamilylength ---> 2 - columnfamily --------> cf - columnqualifier ------> attr1 - timestamp -----------> server time of Put - keytype -------------> Put - - - Key portion for Put #2: - - rowlength ------------> 4 - row -----------------> row1 - columnfamilylength ---> 2 - columnfamily --------> cf - columnqualifier ------> attr2 - timestamp -----------> server time of Put - keytype -------------> Put - - - - It is critical to understand that the rowkey, ColumnFamily, and column (aka columnqualifier) are embedded within - the KeyValue instance. The longer these identifiers are, the bigger the KeyValue is. -
- -
-
- Compaction - - Ambiguous Terminology - A StoreFile is a facade of HFile. In terms of compaction, use of - StoreFile seems to have prevailed in the past. - A Store is the same thing as a ColumnFamily. - StoreFiles are related to a Store, or ColumnFamily. - - If you want to read more about StoreFiles versus HFiles and Stores versus - ColumnFamilies, see HBASE-11316. - - - When the MemStore reaches a given size - (hbase.hregion.memstore.flush.size), it flushes its contents to a - StoreFile. The number of StoreFiles in a Store increases over time. - Compaction is an operation which reduces the number of - StoreFiles in a Store, by merging them together, in order to increase performance on - read operations. Compactions can be resource-intensive to perform, and can either help - or hinder performance depending on many factors. - Compactions fall into two categories: minor and major. Minor and major compactions - differ in the following ways. - Minor compactions usually select a small number of small, - adjacent StoreFiles and rewrite them as a single StoreFile. Minor compactions do not - drop (filter out) deletes or expired versions, because of potential side effects. See and for information on how deletes and versions are - handled in relation to compactions. The end result of a minor compaction is fewer, - larger StoreFiles for a given Store. - The end result of a major compaction is a single StoreFile - per Store. Major compactions also process delete markers and max versions. See and for information on how deletes and versions are - handled in relation to compactions. - - - Compaction and Deletions - When an explicit deletion occurs in HBase, the data is not actually deleted. - Instead, a tombstone marker is written. The tombstone marker - prevents the data from being returned with queries. During a major compaction, the - data is actually deleted, and the tombstone marker is removed from the StoreFile. If - the deletion happens because of an expired TTL, no tombstone is created. Instead, the - expired data is filtered out and is not written back to the compacted - StoreFile. - - - - Compaction and Versions - When you create a Column Family, you can specify the maximum number of versions - to keep, by specifying HColumnDescriptor.setMaxVersions(int - versions). The default value is 3. If more versions - than the specified maximum exist, the excess versions are filtered out and not written - back to the compacted StoreFile. - - - - Major Compactions Can Impact Query Results - In some situations, older versions can be inadvertently resurrected if a newer - version is explicitly deleted. See for a more in-depth explanation. - This situation is only possible before the compaction finishes. - - - In theory, major compactions improve performance. However, on a highly loaded - system, major compactions can require an inappropriate number of resources and adversely - affect performance. In a default configuration, major compactions are scheduled - automatically to run once in a 7-day period. This is sometimes inappropriate for systems - in production. You can manage major compactions manually. See . - Compactions do not perform region merges. See for more information on region merging. -
- Compaction Policy - HBase 0.96.x and newer - Compacting large StoreFiles, or too many StoreFiles at once, can cause more IO - load than your cluster is able to handle without causing performance problems. The - method by which HBase selects which StoreFiles to include in a compaction (and whether - the compaction is a minor or major compaction) is called the compaction - policy. - Prior to HBase 0.96.x, there was only one compaction policy. That original - compaction policy is still available as - RatioBasedCompactionPolicy The new compaction default - policy, called ExploringCompactionPolicy, was subsequently - backported to HBase 0.94 and HBase 0.95, and is the default in HBase 0.96 and newer. - It was implemented in HBASE-7842. In - short, ExploringCompactionPolicy attempts to select the best - possible set of StoreFiles to compact with the least amount of work, while the - RatioBasedCompactionPolicy selects the first set that meets - the criteria. - Regardless of the compaction policy used, file selection is controlled by several - configurable parameters and happens in a multi-step approach. These parameters will be - explained in context, and then will be given in a table which shows their - descriptions, defaults, and implications of changing them. - -
- Being Stuck - When the MemStore gets too large, it needs to flush its contents to a StoreFile. - However, a Store can only have hbase.hstore.blockingStoreFiles - files, so the MemStore needs to wait for the number of StoreFiles to be reduced by - one or more compactions. However, if the MemStore grows larger than - hbase.hregion.memstore.flush.size, it is not able to flush its - contents to a StoreFile. If the MemStore is too large and the number of StpreFo;es - is also too high, the algorithm is said to be "stuck". The compaction algorithm - checks for this "stuck" situation and provides mechanisms to alleviate it. -
- -
- The ExploringCompactionPolicy Algorithm - The ExploringCompactionPolicy algorithm considers each possible set of - adjacent StoreFiles before choosing the set where compaction will have the most - benefit. - One situation where the ExploringCompactionPolicy works especially well is when - you are bulk-loading data and the bulk loads create larger StoreFiles than the - StoreFiles which are holding data older than the bulk-loaded data. This can "trick" - HBase into choosing to perform a major compaction each time a compaction is needed, - and cause a lot of extra overhead. With the ExploringCompactionPolicy, major - compactions happen much less frequently because minor compactions are more - efficient. - In general, ExploringCompactionPolicy is the right choice for most situations, - and thus is the default compaction policy. You can also use - ExploringCompactionPolicy along with . - The logic of this policy can be examined in - hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/ExploringCompactionPolicy.java. - The following is a walk-through of the logic of the - ExploringCompactionPolicy. - - - Make a list of all existing StoreFiles in the Store. The rest of the - algorithm filters this list to come up with the subset of HFiles which will be - chosen for compaction. - - - If this was a user-requested compaction, attempt to perform the requested - compaction type, regardless of what would normally be chosen. Note that even if - the user requests a major compaction, it may not be possible to perform a major - compaction. This may be because not all StoreFiles in the Column Family are - available to compact or because there are too many Stores in the Column - Family. - - - Some StoreFiles are automatically excluded from consideration. These - include: - - - StoreFiles that are larger than - hbase.hstore.compaction.max.size - - - StoreFiles that were created by a bulk-load operation which explicitly - excluded compaction. You may decide to exclude StoreFiles resulting from - bulk loads, from compaction. To do this, specify the - hbase.mapreduce.hfileoutputformat.compaction.exclude - parameter during the bulk load operation. - - - - - Iterate through the list from step 1, and make a list of all potential sets - of StoreFiles to compact together. A potential set is a grouping of - hbase.hstore.compaction.min contiguous StoreFiles in the - list. For each set, perform some sanity-checking and figure out whether this is - the best compaction that could be done: - - - If the number of StoreFiles in this set (not the size of the StoreFiles) - is fewer than hbase.hstore.compaction.min or more than - hbase.hstore.compaction.max, take it out of - consideration. - - - Compare the size of this set of StoreFiles with the size of the smallest - possible compaction that has been found in the list so far. If the size of - this set of StoreFiles represents the smallest compaction that could be - done, store it to be used as a fall-back if the algorithm is "stuck" and no - StoreFiles would otherwise be chosen. See . - - - Do size-based sanity checks against each StoreFile in this set of - StoreFiles. - - - If the size of this StoreFile is larger than - hbase.hstore.compaction.max.size, take it out of - consideration. - - - If the size is greater than or equal to - hbase.hstore.compaction.min.size, sanity-check it - against the file-based ratio to see whether it is too large to be - considered. The sanity-checking is successful if: - - - There is only one StoreFile in this set, or - - - For each StoreFile, its size multiplied by - hbase.hstore.compaction.ratio (or - hbase.hstore.compaction.ratio.offpeak if - off-peak hours are configured and it is during off-peak hours) is - less than the sum of the sizes of the other HFiles in the - set. - - - - - - - - - If this set of StoreFiles is still in consideration, compare it to the - previously-selected best compaction. If it is better, replace the - previously-selected best compaction with this one. - - - When the entire list of potential compactions has been processed, perform - the best compaction that was found. If no StoreFiles were selected for - compaction, but there are multiple StoreFiles, assume the algorithm is stuck - (see ) and if so, perform the smallest - compaction that was found in step 3. - - -
- -
- RatioBasedCompactionPolicy Algorithm - The RatioBasedCompactionPolicy was the only compaction policy prior to HBase - 0.96, though ExploringCompactionPolicy has now been backported to HBase 0.94 and - 0.95. To use the RatioBasedCompactionPolicy rather than the - ExploringCompactionPolicy, set - hbase.hstore.defaultengine.compactionpolicy.class to - RatioBasedCompactionPolicy in the - hbase-site.xml file. To switch back to the - ExploringCompactionPolicy, remove the setting from the - hbase-site.xml. - The following section walks you through the algorithm used to select StoreFiles - for compaction in the RatioBasedCompactionPolicy. - - - The first phase is to create a list of all candidates for compaction. A list - is created of all StoreFiles not already in the compaction queue, and all - StoreFiles newer than the newest file that is currently being compacted. This - list of StoreFiles is ordered by the sequence ID. The sequence ID is generated - when a Put is appended to the write-ahead log (WAL), and is stored in the - metadata of the HFile. - - - Check to see if the algorithm is stuck (see , and if so, a major compaction is forced. - This is a key area where is often a better choice than the - RatioBasedCompactionPolicy. - - - If the compaction was user-requested, try to perform the type of compaction - that was requested. Note that a major compaction may not be possible if all - HFiles are not available for compaction or if too may StoreFiles exist (more - than hbase.hstore.compaction.max). - - - Some StoreFiles are automatically excluded from consideration. These - include: - - - StoreFiles that are larger than - hbase.hstore.compaction.max.size - - - StoreFiles that were created by a bulk-load operation which explicitly - excluded compaction. You may decide to exclude StoreFiles resulting from - bulk loads, from compaction. To do this, specify the - hbase.mapreduce.hfileoutputformat.compaction.exclude - parameter during the bulk load operation. - - - - - The maximum number of StoreFiles allowed in a major compaction is controlled - by the hbase.hstore.compaction.max parameter. If the list - contains more than this number of StoreFiles, a minor compaction is performed - even if a major compaction would otherwise have been done. However, a - user-requested major compaction still occurs even if there are more than - hbase.hstore.compaction.max StoreFiles to compact. - - - If the list contains fewer than - hbase.hstore.compaction.min StoreFiles to compact, a minor - compaction is aborted. Note that a major compaction can be performed on a single - HFile. Its function is to remove deletes and expired versions, and reset - locality on the StoreFile. - - - The value of the hbase.hstore.compaction.ratio parameter - is multiplied by the sum of StoreFiles smaller than a given file, to determine - whether that StoreFile is selected for compaction during a minor compaction. For - instance, if hbase.hstore.compaction.ratio is 1.2, FileX is 5 mb, FileY is 2 mb, - and FileZ is 3 mb: - 5 <= 1.2 x (2 + 3) or 5 <= 6 - In this scenario, FileX is eligible for minor compaction. If FileX were 7 - mb, it would not be eligible for minor compaction. This ratio favors smaller - StoreFile. You can configure a different ratio for use in off-peak hours, using - the parameter hbase.hstore.compaction.ratio.offpeak, if you - also configure hbase.offpeak.start.hour and - hbase.offpeak.end.hour. - - - - If the last major compaction was too long ago and there is more than one - StoreFile to be compacted, a major compaction is run, even if it would otherwise - have been minor. By default, the maximum time between major compactions is 7 - days, plus or minus a 4.8 hour period, and determined randomly within those - parameters. Prior to HBase 0.96, the major compaction period was 24 hours. See - hbase.hregion.majorcompaction in the table below to tune or - disable time-based major compactions. - - -
- -
- - Parameters Used by Compaction Algorithm - This table contains the main configuration parameters for compaction. This list - is not exhaustive. To tune these parameters from the defaults, edit the - hbase-default.xml file. For a full list of all configuration - parameters available, see - - -
- - Parameter - Description - Default - - - - - hbase.hstore.compaction.min - The minimum number of StoreFiles which must be eligible for - compaction before compaction can run. - The goal of tuning hbase.hstore.compaction.min - is to avoid ending up with too many tiny StoreFiles to compact. Setting - this value to 2 would cause a minor compaction each - time you have two StoreFiles in a Store, and this is probably not - appropriate. If you set this value too high, all the other values will - need to be adjusted accordingly. For most cases, the default value is - appropriate. - In previous versions of HBase, the parameter - hbase.hstore.compaction.min was called - hbase.hstore.compactionThreshold. - - 3 - - - hbase.hstore.compaction.max - The maximum number of StoreFiles which will be selected for a - single minor compaction, regardless of the number of eligible - StoreFiles. - Effectively, the value of - hbase.hstore.compaction.max controls the length of - time it takes a single compaction to complete. Setting it larger means - that more StoreFiles are included in a compaction. For most cases, the - default value is appropriate. - - 10 - - - hbase.hstore.compaction.min.size - A StoreFile smaller than this size will always be eligible for - minor compaction. StoreFiles this size or larger are evaluated by - hbase.hstore.compaction.ratio to determine if they are - eligible. - Because this limit represents the "automatic include" limit for - all StoreFiles smaller than this value, this value may need to be reduced - in write-heavy environments where many files in the 1-2 MB range are being - flushed, because every StoreFile will be targeted for compaction and the - resulting StoreFiles may still be under the minimum size and require - further compaction. - If this parameter is lowered, the ratio check is triggered more - quickly. This addressed some issues seen in earlier versions of HBase but - changing this parameter is no longer necessary in most situations. - - 128 MB - - - hbase.hstore.compaction.max.size - An StoreFile larger than this size will be excluded from - compaction. The effect of raising - hbase.hstore.compaction.max.size is fewer, larger - StoreFiles that do not get compacted often. If you feel that compaction is - happening too often without much benefit, you can try raising this - value. - Long.MAX_VALUE - - - hbase.hstore.compaction.ratio - For minor compaction, this ratio is used to determine whether a - given StoreFile which is larger than - hbase.hstore.compaction.min.size is eligible for - compaction. Its effect is to limit compaction of large StoreFile. The - value of hbase.hstore.compaction.ratio is expressed as - a floating-point decimal. - A large ratio, such as 10, will produce a - single giant StoreFile. Conversely, a value of .25, - will produce behavior similar to the BigTable compaction algorithm, - producing four StoreFiles. - A moderate value of between 1.0 and 1.4 is recommended. When - tuning this value, you are balancing write costs with read costs. Raising - the value (to something like 1.4) will have more write costs, because you - will compact larger StoreFiles. However, during reads, HBase will need to seek - through fewer StpreFo;es to accomplish the read. Consider this approach if you - cannot take advantage of . - Alternatively, you can lower this value to something like 1.0 to - reduce the background cost of writes, and use to limit the number of StoreFiles touched - during reads. - For most cases, the default value is appropriate. - - 1.2F - - - hbase.hstore.compaction.ratio.offpeak - The compaction ratio used during off-peak compactions, if off-peak - hours are also configured (see below). Expressed as a floating-point - decimal. This allows for more aggressive (or less aggressive, if you set it - lower than hbase.hstore.compaction.ratio) compaction - during a set time period. Ignored if off-peak is disabled (default). This - works the same as hbase.hstore.compaction.ratio. - 5.0F - - - hbase.offpeak.start.hour - The start of off-peak hours, expressed as an integer between 0 and 23, - inclusive. Set to -1 to disable off-peak. - -1 (disabled) - - - hbase.offpeak.end.hour - The end of off-peak hours, expressed as an integer between 0 and 23, - inclusive. Set to -1 to disable off-peak. - -1 (disabled) - - - hbase.regionserver.thread.compaction.throttle - There are two different thread pools for compactions, one for - large compactions and the other for small compactions. This helps to keep - compaction of lean tables (such as hbase:meta) - fast. If a compaction is larger than this threshold, it goes into the - large compaction pool. In most cases, the default value is - appropriate. - 2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size - (which defaults to 128) - - - hbase.hregion.majorcompaction - Time between major compactions, expressed in milliseconds. Set to - 0 to disable time-based automatic major compactions. User-requested and - size-based major compactions will still run. This value is multiplied by - hbase.hregion.majorcompaction.jitter to cause - compaction to start at a somewhat-random time during a given window of - time. - 7 days (604800000 milliseconds) - - - hbase.hregion.majorcompaction.jitter - A multiplier applied to - hbase.hregion.majorcompaction to cause compaction to - occur a given amount of time either side of - hbase.hregion.majorcompaction. The smaller the - number, the closer the compactions will happen to the - hbase.hregion.majorcompaction interval. Expressed as - a floating-point decimal. - .50F - - - - - - -
- Compaction File Selection - - Legacy Information - This section has been preserved for historical reasons and refers to the way - compaction worked prior to HBase 0.96.x. You can still use this behavior if you - enable For information on - the way that compactions work in HBase 0.96.x and later, see . - - To understand the core algorithm for StoreFile selection, there is some ASCII-art - in the Store - source code that will serve as useful reference. It has been copied below: - -/* normal skew: - * - * older ----> newer - * _ - * | | _ - * | | | | _ - * --|-|- |-|- |-|---_-------_------- minCompactSize - * | | | | | | | | _ | | - * | | | | | | | | | | | | - * | | | | | | | | | | | | - */ - - Important knobs: - - hbase.hstore.compaction.ratio Ratio used in compaction file - selection algorithm (default 1.2f). - - - hbase.hstore.compaction.min (.90 - hbase.hstore.compactionThreshold) (files) Minimum number of StoreFiles per Store - to be selected for a compaction to occur (default 2). - - - hbase.hstore.compaction.max (files) Maximum number of - StoreFiles to compact per minor compaction (default 10). - - - hbase.hstore.compaction.min.size (bytes) Any StoreFile smaller - than this setting with automatically be a candidate for compaction. Defaults to - hbase.hregion.memstore.flush.size (128 mb). - - - hbase.hstore.compaction.max.size (.92) (bytes) Any StoreFile - larger than this setting with automatically be excluded from compaction (default - Long.MAX_VALUE). - - - - The minor compaction StoreFile selection logic is size based, and selects a file - for compaction when the file <= sum(smaller_files) * - hbase.hstore.compaction.ratio. - -
- Minor Compaction File Selection - Example #1 (Basic Example) - This example mirrors an example from the unit test - TestCompactSelection. - - - hbase.hstore.compaction.ratio = 1.0f - - - hbase.hstore.compaction.min = 3 (files) - - - hbase.hstore.compaction.max = 5 (files) - - - hbase.hstore.compaction.min.size = 10 (bytes) - - - hbase.hstore.compaction.max.size = 1000 (bytes) - - - The following StoreFiles exist: 100, 50, 23, 12, and 12 bytes apiece (oldest to - newest). With the above parameters, the files that would be selected for minor - compaction are 23, 12, and 12. - Why? - - 100 --> No, because sum(50, 23, 12, 12) * 1.0 = 97. - - - 50 --> No, because sum(23, 12, 12) * 1.0 = 47. - - - 23 --> Yes, because sum(12, 12) * 1.0 = 24. - - - 12 --> Yes, because the previous file has been included, and because this - does not exceed the the max-file limit of 5 - - - 12 --> Yes, because the previous file had been included, and because this - does not exceed the the max-file limit of 5. - - - -
-
- Minor Compaction File Selection - Example #2 (Not Enough Files To - Compact) - This example mirrors an example from the unit test - TestCompactSelection. - - hbase.hstore.compaction.ratio = 1.0f - - - hbase.hstore.compaction.min = 3 (files) - - - hbase.hstore.compaction.max = 5 (files) - - - hbase.hstore.compaction.min.size = 10 (bytes) - - - hbase.hstore.compaction.max.size = 1000 (bytes) - - - - The following StoreFiles exist: 100, 25, 12, and 12 bytes apiece (oldest to - newest). With the above parameters, no compaction will be started. - Why? - - 100 --> No, because sum(25, 12, 12) * 1.0 = 47 - - - 25 --> No, because sum(12, 12) * 1.0 = 24 - - - 12 --> No. Candidate because sum(12) * 1.0 = 12, there are only 2 files - to compact and that is less than the threshold of 3 - - - 12 --> No. Candidate because the previous StoreFile was, but there are - not enough files to compact - - - -
-
- Minor Compaction File Selection - Example #3 (Limiting Files To Compact) - This example mirrors an example from the unit test - TestCompactSelection. - - hbase.hstore.compaction.ratio = 1.0f - - - hbase.hstore.compaction.min = 3 (files) - - - hbase.hstore.compaction.max = 5 (files) - - - hbase.hstore.compaction.min.size = 10 (bytes) - - - hbase.hstore.compaction.max.size = 1000 (bytes) - - The following StoreFiles exist: 7, 6, 5, 4, 3, 2, and 1 bytes apiece - (oldest to newest). With the above parameters, the files that would be selected for - minor compaction are 7, 6, 5, 4, 3. - Why? - - 7 --> Yes, because sum(6, 5, 4, 3, 2, 1) * 1.0 = 21. Also, 7 is less than - the min-size - - - 6 --> Yes, because sum(5, 4, 3, 2, 1) * 1.0 = 15. Also, 6 is less than - the min-size. - - - 5 --> Yes, because sum(4, 3, 2, 1) * 1.0 = 10. Also, 5 is less than the - min-size. - - - 4 --> Yes, because sum(3, 2, 1) * 1.0 = 6. Also, 4 is less than the - min-size. - - - 3 --> Yes, because sum(2, 1) * 1.0 = 3. Also, 3 is less than the - min-size. - - - 2 --> No. Candidate because previous file was selected and 2 is less than - the min-size, but the max-number of files to compact has been reached. - - - 1 --> No. Candidate because previous file was selected and 1 is less than - the min-size, but max-number of files to compact has been reached. - - - -
- Impact of Key Configuration Options - - This information is now included in the configuration parameter table in . - -
-
-
-
- Experimental: Stripe Compactions - Stripe compactions is an experimental feature added in HBase 0.98 which aims to - improve compactions for large regions or non-uniformly distributed row keys. In order - to achieve smaller and/or more granular compactions, the StoreFiles within a region - are maintained separately for several row-key sub-ranges, or "stripes", of the region. - The stripes are transparent to the rest of HBase, so other operations on the HFiles or - data work without modification. - Stripe compactions change the HFile layout, creating sub-regions within regions. - These sub-regions are easier to compact, and should result in fewer major compactions. - This approach alleviates some of the challenges of larger regions. - Stripe compaction is fully compatible with and works in conjunction with either the - ExploringCompactionPolicy or RatioBasedCompactionPolicy. It can be enabled for - existing tables, and the table will continue to operate normally if it is disabled - later. -
-
- When To Use Stripe Compactions - Consider using stripe compaction if you have either of the following: - - - Large regions. You can get the positive effects of smaller regions without - additional overhead for MemStore and region management overhead. - - - Non-uniform keys, such as time dimension in a key. Only the stripes receiving - the new keys will need to compact. Old data will not compact as often, if at - all - - - - Performance Improvements - Performance testing has shown that the performance of reads improves somewhat, - and variability of performance of reads and writes is greatly reduced. An overall - long-term performance improvement is seen on large non-uniform-row key regions, such - as a hash-prefixed timestamp key. These performance gains are the most dramatic on a - table which is already large. It is possible that the performance improvement might - extend to region splits. - -
- Enabling Stripe Compaction - You can enable stripe compaction for a table or a column family, by setting its - hbase.hstore.engine.class to - org.apache.hadoop.hbase.regionserver.StripeStoreEngine. You - also need to set the hbase.hstore.blockingStoreFiles to a high - number, such as 100 (rather than the default value of 10). - - Enable Stripe Compaction - - If the table already exists, disable the table. - - - Run one of following commands in the HBase shell. Replace the table name - orders_table with the name of your table. - -alter 'orders_table', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'} -alter 'orders_table', {NAME => 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'}} -create 'orders_table', 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'} - - - - Configure other options if needed. See for more information. - - - Enable the table. - - - - - Disable Stripe Compaction - - Disable the table. - - - Set the hbase.hstore.engine.class option to either nil or - org.apache.hadoop.hbase.regionserver.DefaultStoreEngine. - Either option has the same effect. - -alter 'orders_table', CONFIGURATION => {'hbase.hstore.engine.class' => ''} - - - - Enable the table. - - - When you enable a large table after changing the store engine either way, a - major compaction will likely be performed on most regions. This is not necessary on - new tables. -
-
- Configuring Stripe Compaction - Each of the settings for stripe compaction should be configured at the table or - column family, after disabling the table. If you use HBase shell, the general - command pattern is as follows: - - -alter 'orders_table', CONFIGURATION => {'key' => 'value', ..., 'key' => 'value'}} - -
- Region and stripe sizing - You can configure your stripe sizing bsaed upon your region sizing. By - default, your new regions will start with one stripe. On the next compaction after - the stripe has grown too large (16 x MemStore flushes size), it is split into two - stripes. Stripe splitting continues as the region grows, until the region is large - enough to split. - You can improve this pattern for your own data. A good rule is to aim for a - stripe size of at least 1 GB, and about 8-12 stripes for uniform row keys. For - example, if your regions are 30 GB, 12 x 2.5 GB stripes might be a good starting - point. - -
- This graph shows all allowed transitions a region can undergo. In the graph, - each node is a state. A node has a color based on the state type, for readability. - A directed line in the graph is a possible state transition. -
- Stripe Sizing Settings - - - - - - Setting - Notes - - - - - - hbase.store.stripe.initialStripeCount - - - The number of stripes to create when stripe compaction is enabled. - You can use it as follows: - - For relatively uniform row keys, if you know the approximate - target number of stripes from the above, you can avoid some - splitting overhead by starting with several stripes (2, 5, 10...). - If the early data is not representative of overall row key - distribution, this will not be as efficient. - - - For existing tables with a large amount of data, this setting - will effectively pre-split your stripes. - - - For keys such as hash-prefixed sequential keys, with more than - one hash prefix per region, pre-splitting may make sense. - - - - - - - hbase.store.stripe.sizeToSplit - - The maximum size a stripe grows before splitting. Use this in - conjunction with hbase.store.stripe.splitPartCount to - control the target stripe size (sizeToSplit = splitPartsCount * target - stripe size), according to the above sizing considerations. - - - - hbase.store.stripe.splitPartCount - - The number of new stripes to create when splitting a stripe. The - default is 2, which is appropriate for most cases. For non-uniform row - keys, you can experiment with increasing the number to 3 or 4, to isolate - the arriving updates into narrower slice of the region without additional - splits being required. - - - -
-
-
- MemStore Size Settings - By default, the flush creates several files from one MemStore, according to - existing stripe boundaries and row keys to flush. This approach minimizes write - amplification, but can be undesirable if the MemStore is small and there are many - stripes, because the files will be too small. - In this type of situation, you can set - hbase.store.stripe.compaction.flushToL0 to - true. This will cause a MemStore flush to create a single - file instead. When at least - hbase.store.stripe.compaction.minFilesL0 such files (by - default, 4) accumulate, they will be compacted into striped files. -
-
- Normal Compaction Configuration and Stripe Compaction - All the settings that apply to normal compactions (see ) apply to stripe compactions. - The exceptions are the minimum and maximum number of files, which are set to - higher values by default because the files in stripes are smaller. To control - these for stripe compactions, use - hbase.store.stripe.compaction.minFiles and - hbase.store.stripe.compaction.maxFiles, rather than - hbase.hstore.compaction.min and - hbase.hstore.compaction.max. -
-
- - - - - - - -
Bulk Loading -
Overview - - HBase includes several methods of loading data into tables. - The most straightforward method is to either use the TableOutputFormat - class from a MapReduce job, or use the normal client APIs; however, - these are not always the most efficient methods. - - - The bulk load feature uses a MapReduce job to output table data in HBase's internal - data format, and then directly loads the generated StoreFiles into a running - cluster. Using bulk load will use less CPU and network resources than - simply using the HBase API. - -
-
Bulk Load Limitations - As bulk loading bypasses the write path, the WAL doesn’t get written to as part of the process. - Replication works by reading the WAL files so it won’t see the bulk loaded data – and the same goes for the edits that use Put.setWriteToWAL(true). - One way to handle that is to ship the raw files or the HFiles to the other cluster and do the other processing there. -
-
Bulk Load Architecture - - The HBase bulk load process consists of two main steps. - -
Preparing data via a MapReduce job - - The first step of a bulk load is to generate HBase data files (StoreFiles) from - a MapReduce job using HFileOutputFormat. This output format writes - out data in HBase's internal storage format so that they can be - later loaded very efficiently into the cluster. - - - In order to function efficiently, HFileOutputFormat must be - configured such that each output HFile fits within a single region. - In order to do this, jobs whose output will be bulk loaded into HBase - use Hadoop's TotalOrderPartitioner class to partition the map output - into disjoint ranges of the key space, corresponding to the key - ranges of the regions in the table. - - - HFileOutputFormat includes a convenience function, - configureIncrementalLoad(), which automatically sets up - a TotalOrderPartitioner based on the current region boundaries of a - table. - -
-
Completing the data load - - After the data has been prepared using - HFileOutputFormat, it is loaded into the cluster using - completebulkload. This command line tool iterates - through the prepared data files, and for each one determines the - region the file belongs to. It then contacts the appropriate Region - Server which adopts the HFile, moving it into its storage directory - and making the data available to clients. - - - If the region boundaries have changed during the course of bulk load - preparation, or between the preparation and completion steps, the - completebulkloads utility will automatically split the - data files into pieces corresponding to the new boundaries. This - process is not optimally efficient, so users should take care to - minimize the delay between preparing a bulk load and importing it - into the cluster, especially if other clients are simultaneously - loading data through other means. - -
-
-
Importing the prepared data using the completebulkload tool - - After a data import has been prepared, either by using the - importtsv tool with the - "importtsv.bulk.output" option or by some other MapReduce - job using the HFileOutputFormat, the - completebulkload tool is used to import the data into the - running cluster. - - - The completebulkload tool simply takes the output path - where importtsv or your MapReduce job put its results, and - the table name to import into. For example: - - $ hadoop jar hbase-server-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable - - The -c config-file option can be used to specify a file - containing the appropriate hbase parameters (e.g., hbase-site.xml) if - not supplied already on the CLASSPATH (In addition, the CLASSPATH must - contain the directory that has the zookeeper configuration file if - zookeeper is NOT managed by HBase). - - - Note: If the target table does not already exist in HBase, this - tool will create the table automatically. - - This tool will run quickly, after which point the new data will be visible in - the cluster. - -
-
See Also - For more information about the referenced utilities, see and . - - - See How-to: Use HBase Bulk Loading, and Why - for a recent blog on current state of bulk loading. - -
-
Advanced Usage - - Although the importtsv tool is useful in many cases, advanced users may - want to generate data programatically, or import data from other formats. To get - started doing so, dig into ImportTsv.java and check the JavaDoc for - HFileOutputFormat. - - - The import step of the bulk load can also be done programatically. See the - LoadIncrementalHFiles class for more information. - -
-
- -
HDFS - As HBase runs on HDFS (and each StoreFile is written as a file on HDFS), - it is important to have an understanding of the HDFS Architecture - especially in terms of how it stores files, handles failovers, and replicates blocks. - - See the Hadoop documentation on HDFS Architecture - for more information. - -
NameNode - The NameNode is responsible for maintaining the filesystem metadata. See the above HDFS Architecture link - for more information. - -
-
DataNode - The DataNodes are responsible for storing HDFS blocks. See the above HDFS Architecture link - for more information. - -
-
- -
- Timeline-consistent High Available Reads -
- Introduction - - HBase, architecturally, always had the strong consistency guarantee from the start. All reads and writes are routed through a single region server, which guarantees that all writes happen in an order, and all reads are seeing the most recent committed data. - - However, because of this single homing of the reads to a single location, if the server becomes unavailable, the regions of the table that were hosted in the region server become unavailable for some time. There are three phases in the region recovery process - detection, assignment, and recovery. Of these, the detection is usually the longest and is presently in the order of 20-30 seconds depending on the zookeeper session timeout. During this time and before the recovery is complete, the clients will not be able to read the region data. - - However, for some use cases, either the data may be read-only, or doing reads againsts some stale data is acceptable. With timeline-consistent high available reads, HBase can be used for these kind of latency-sensitive use cases where the application can expect to have a time bound on the read completion. - - For achieving high availability for reads, HBase provides a feature called “region replication”. In this model, for each region of a table, there will be multiple replicas that are opened in different region servers. By default, the region replication is set to 1, so only a single region replica is deployed and there will not be any changes from the original model. If region replication is set to 2 or more, than the master will assign replicas of the regions of the table. The Load Balancer ensures that the region replicas are not co-hosted in the same region servers and also in the same rack (if possible). - - All of the replicas for a single region will have a unique replica_id, starting from 0. The region replica having replica_id==0 is called the primary region, and the others “secondary regions” or secondaries. Only the primary can accept writes from the client, and the primary will always contain the latest changes. Since all writes still have to go through the primary region, the writes are not highly-available (meaning they might block for some time if the region becomes unavailable). - - The writes are asynchronously sent to the secondary region replicas using an “Async WAL replication” feature. This works similarly to HBase’s multi-datacenter replication, but instead the data from a region is replicated to the secondary regions. Each secondary replica always receives and observes the writes in the same order that the primary region committed them. This ensures that the secondaries won’t diverge from the primary regions data, but since the log replication is asnyc, the data might be stale in secondary regions. In some sense, this design can be thought of as “in-cluster replication”, where instead of replicating to a different datacenter, the data goes to a secondary region to keep secondary region’s in-memory state up to date. The data files are shared between the primary region and the other replicas, so that there is no extra storage overhead. However, the secondary regions will have recent non-flushed data in their memstores, which increases the memory overhead. - - Async WAL replication feature is being implemented in Phase 2 of issue HBASE-10070. Before this, region replicas will only be updated with flushed data files from the primary (see hbase.regionserver.storefile.refresh.period below). It is also possible to use this without setting storefile.refresh.period for read only tables. - -
-
- Timeline Consistency - - With this feature, HBase introduces a Consistency definition, which can be provided per read operation (get or scan). - -public enum Consistency { - STRONG, - TIMELINE -} - - Consistency.STRONG is the default consistency model provided by HBase. In case the table has region replication = 1, or in a table with region replicas but the reads are done with this consistency, the read is always performed by the primary regions, so that there will not be any change from the previous behaviour, and the client always observes the latest data. - - In case a read is performed with Consistency.TIMELINE, then the read RPC will be sent to the primary region server first. After a short interval (hbase.client.primaryCallTimeout.get, 10ms by default), parallel RPC for secondary region replicas will also be sent if the primary does not respond back. After this, the result is returned from whichever RPC is finished first. If the response came back from the primary region replica, we can always know that the data is latest. For this Result.isStale() API has been added to inspect the staleness. If the result is from a secondary region, then Result.isStale() will be set to true. The user can then inspect this field to possibly reason about the data. - - In terms of semantics, TIMELINE consistency as implemented by HBase differs from pure eventual - consistency in these respects: - - - Single homed and ordered updates: Region replication or not, on the write side, - there is still only 1 defined replica (primary) which can accept writes. This - replica is responsible for ordering the edits and preventing conflicts. This - guarantees that two different writes are not committed at the same time by different - replicas and the data diverges. With this, there is no need to do read-repair or - last-timestamp-wins kind of conflict resolution. - - - The secondaries also apply the edits in the order that the primary committed - them. This way the secondaries will contain a snapshot of the primaries data at any - point in time. This is similar to RDBMS replications and even HBase’s own - multi-datacenter replication, however in a single cluster. - - - On the read side, the client can detect whether the read is coming from - up-to-date data or is stale data. Also, the client can issue reads with different - consistency requirements on a per-operation basis to ensure its own semantic - guarantees. - - - The client can still observe edits out-of-order, and can go back in time, if it - observes reads from one secondary replica first, then another secondary replica. - There is no stickiness to region replicas or a transaction-id based guarantee. If - required, this can be implemented later though. - - - -
- HFile Version 1 - - - - - - HFile Version 1 - - -
- - To better understand the TIMELINE semantics, lets look at the above diagram. Lets say that there are two clients, and the first one writes x=1 at first, then x=2 and x=3 later. As above, all writes are handled by the primary region replica. The writes are saved in the write ahead log (WAL), and replicated to the other replicas asynchronously. In the above diagram, notice that replica_id=1 received 2 updates, and it’s data shows that x=2, while the replica_id=2 only received a single update, and its data shows that x=1. - - If client1 reads with STRONG consistency, it will only talk with the replica_id=0, and thus is guaranteed to observe the latest value of x=3. In case of a client issuing TIMELINE consistency reads, the RPC will go to all replicas (after primary timeout) and the result from the first response will be returned back. Thus the client can see either 1, 2 or 3 as the value of x. Let’s say that the primary region has failed and log replication cannot continue for some time. If the client does multiple reads with TIMELINE consistency, she can observe x=2 first, then x=1, and so on. - - -
-
- Tradeoffs - Having secondary regions hosted for read availability comes with some tradeoffs which - should be carefully evaluated per use case. Following are advantages and - disadvantages. - - Advantages - - High availability for read-only tables. - - - High availability for stale reads - - - Ability to do very low latency reads with very high percentile (99.9%+) latencies - for stale reads - - - - - Disadvantages - - Double / Triple memstore usage (depending on region replication count) for tables - with region replication > 1 - - - Increased block cache usage - - - Extra network traffic for log replication - - - Extra backup RPCs for replicas - - - To serve the region data from multiple replicas, HBase opens the regions in secondary - mode in the region servers. The regions opened in secondary mode will share the same data - files with the primary region replica, however each secondary region replica will have its - own memstore to keep the unflushed data (only primary region can do flushes). Also to - serve reads from secondary regions, the blocks of data files may be also cached in the - block caches for the secondary regions. -
-
- Configuration properties - - To use highly available reads, you should set the following properties in hbase-site.xml file. There is no specific configuration to enable or disable region replicas. Instead you can change the number of region replicas per table to increase or decrease at the table creation or with alter table. - -
- Server side properties - - hbase.regionserver.storefile.refresh.period - 0 - - The period (in milliseconds) for refreshing the store files for the secondary regions. 0 means this feature is disabled. Secondary regions sees new files (from flushes and compactions) from primary once the secondary region refreshes the list of files in the region. But too frequent refreshes might cause extra Namenode pressure. If the files cannot be refreshed for longer than HFile TTL (hbase.master.hfilecleaner.ttl) the requests are rejected. Configuring HFile TTL to a larger value is also recommended with this setting. - - -]]> - - One thing to keep in mind also is that, region replica placement policy is only - enforced by the StochasticLoadBalancer which is the default balancer. If - you are using a custom load balancer property in hbase-site.xml - (hbase.master.loadbalancer.class) replicas of regions might end up being - hosted in the same server. -
-
- Client side properties - Ensure to set the following for all clients (and servers) that will use region - replicas. - - hbase.ipc.client.allowsInterrupt - true - - Whether to enable interruption of RPC threads at the client side. This is required for region replicas with fallback RPC’s to secondary regions. - - - - hbase.client.primaryCallTimeout.get - 10000 - - The timeout (in microseconds), before secondary fallback RPC’s are submitted for get requests with Consistency.TIMELINE to the secondary replicas of the regions. Defaults to 10ms. Setting this lower will increase the number of RPC’s, but will lower the p99 latencies. - - - - hbase.client.primaryCallTimeout.multiget - 10000 - - The timeout (in microseconds), before secondary fallback RPC’s are submitted for multi-get requests (HTable.get(List)) with Consistency.TIMELINE to the secondary replicas of the regions. Defaults to 10ms. Setting this lower will increase the number of RPC’s, but will lower the p99 latencies. - - - - hbase.client.replicaCallTimeout.scan - 1000000 - - The timeout (in microseconds), before secondary fallback RPC’s are submitted for scan requests with Consistency.TIMELINE to the secondary replicas of the regions. Defaults to 1 sec. Setting this lower will increase the number of RPC’s, but will lower the p99 latencies. - - -]]> - -
-
-
- Creating a table with region replication - - Region replication is a per-table property. All tables have REGION_REPLICATION = 1 by default, which means that there is only one replica per region. You can set and change the number of replicas per region of a table by supplying the REGION_REPLICATION property in the table descriptor. - -
Shell - 2} - -describe 't1' -for i in 1..100 -put 't1', "r#{i}", 'f1:c1', i -end -flush 't1' -]]> - -
-
Java - - - You can also use setRegionReplication() and alter table to increase, decrease the - region replication for a table. -
-
-
- Region splits and merges - Region splits and merges are not compatible with regions with replicas yet. So you - have to pre-split the table, and disable the region splits. Also you should not execute - region merges on tables with region replicas. To disable region splits you can use - DisabledRegionSplitPolicy as the split policy. -
-
- User Interface - In the masters user interface, the region replicas of a table are also shown together - with the primary regions. You can notice that the replicas of a region will share the same - start and end keys and the same region name prefix. The only difference would be the - appended replica_id (which is encoded as hex), and the region encoded name will be - different. You can also see the replica ids shown explicitly in the UI. -
-
- API and Usage -
- Shell - You can do reads in shell using a the Consistency.TIMELINE semantics as follows - - get 't1','r6', {CONSISTENCY => "TIMELINE"} -]]> - You can simulate a region server pausing or becoming unavailable and do a read from - the secondary replica: - - -hbase(main):001:0> get 't1','r6', {CONSISTENCY => "TIMELINE"} -]]> - Using scans is also similar - scan 't1', {CONSISTENCY => 'TIMELINE'} -]]> -
-
- Java - You can set set the consistency for Gets and Scans and do requests as - follows. - - You can also pass multiple gets: - gets = new ArrayList(); -gets.add(get1); -... -Result[] results = table.get(gets); -]]> - And Scans: - - You can inspect whether the results are coming from primary region or not by calling - the Result.isStale() method: - - -
-
- -
- Resources - - - More information about the design and implementation can be found at the jira - issue: HBASE-10070 - - - - HBaseCon 2014 talk also contains some - details and slides. - - -
-
- -
+ @@ -5013,1091 +104,17 @@ if (result.isStale()) { - - - FAQ - - General - - When should I use HBase? - - See the in the Architecture chapter. - - - - - Are there other HBase FAQs? - - - See the FAQ that is up on the wiki, HBase Wiki FAQ. - - - - - Does HBase support SQL? - - - Not really. SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests. - See the section for examples on the HBase client. - - - - - How can I find examples of NoSQL/HBase? - - See the link to the BigTable paper in in the appendix, as - well as the other papers. - - - - - What is the history of HBase? - - See . - - - - - - Upgrading - - - How do I upgrade Maven-managed projects from HBase 0.94 to HBase 0.96+? - - - In HBase 0.96, the project moved to a modular structure. Adjust your project's - dependencies to rely upon the hbase-client module or another - module as appropriate, rather than a single JAR. You can model your Maven depency - after one of the following, depending on your targeted version of HBase. See or for more - information. - - Maven Dependency for HBase 0.98 - - org.apache.hbase - hbase-client - 0.98.5-hadoop2 - - ]]> - - - Maven Dependency for HBase 0.96 - - org.apache.hbase - hbase-client - 0.96.2-hadoop2 - - ]]> - - - Maven Dependency for HBase 0.94 - - org.apache.hbase - hbase - 0.94.3 - - ]]> - - - - - Architecture - - How does HBase handle Region-RegionServer assignment and locality? - - - See . - - - - - Configuration - - How can I get started with my first cluster? - - - See . - - - - - Where can I learn about the rest of the configuration options? - - - See . - - - - - Schema Design / Data Access - - How should I design my schema in HBase? - - - See and - - - - - - How can I store (fill in the blank) in HBase? - - - - See . - - - - - - How can I handle secondary indexes in HBase? - - - - See - - - - - Can I change a table's rowkeys? - - This is a very common question. You can't. See . - - - - What APIs does HBase support? - - - See , and . - - - - - MapReduce - - How can I use MapReduce with HBase? - - - See - - - - - Performance and Troubleshooting - - - How can I improve HBase cluster performance? - - - - See . - - - - - - How can I troubleshoot my HBase cluster? - - - - See . - - - - - Amazon EC2 - - - I am running HBase on Amazon EC2 and... - - - - EC2 issues are a special case. See Troubleshooting and Performance sections. - - - - - Operations - - - How do I manage my HBase cluster? - - - - See - - - - - - How do I back up my HBase cluster? - - - - See - - - - - HBase in Action - - Where can I find interesting videos and presentations on HBase? - - - See - - - - - - - - - hbck In Depth - HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems -and repairing a corrupted HBase. It works in two basic modes -- a read-only inconsistency -identifying mode and a multi-phase read-write repair mode. - -
- Running hbck to identify inconsistencies -To check to see if your HBase cluster has corruptions, run hbck against your HBase cluster: - -$ ./bin/hbase hbck - - -At the end of the commands output it prints OK or tells you the number of INCONSISTENCIES -present. You may also want to run run hbck a few times because some inconsistencies can be -transient (e.g. cluster is starting up or a region is splitting). Operationally you may want to run -hbck regularly and setup alert (e.g. via nagios) if it repeatedly reports inconsistencies . -A run of hbck will report a list of inconsistencies along with a brief description of the regions and -tables affected. The using the -details option will report more details including a representative -listing of all the splits present in all the tables. - - -$ ./bin/hbase hbck -details - -If you just want to know if some tables are corrupted, you can limit hbck to identify inconsistencies -in only specific tables. For example the following command would only attempt to check table -TableFoo and TableBar. The benefit is that hbck will run in less time. - -$ ./bin/hbase hbck TableFoo TableBar - -
-
Inconsistencies - - If after several runs, inconsistencies continue to be reported, you may have encountered a -corruption. These should be rare, but in the event they occur newer versions of HBase include -the hbck tool enabled with automatic repair options. - - - There are two invariants that when violated create inconsistencies in HBase: - - - HBase’s region consistency invariant is satisfied if every region is assigned and -deployed on exactly one region server, and all places where this state kept is in -accordance. - - HBase’s table integrity invariant is satisfied if for each table, every possible row key -resolves to exactly one region. - - - -Repairs generally work in three phases -- a read-only information gathering phase that identifies -inconsistencies, a table integrity repair phase that restores the table integrity invariant, and then -finally a region consistency repair phase that restores the region consistency invariant. -Starting from version 0.90.0, hbck could detect region consistency problems report on a subset -of possible table integrity problems. It also included the ability to automatically fix the most -common inconsistency, region assignment and deployment consistency problems. This repair -could be done by using the -fix command line option. These problems close regions if they are -open on the wrong server or on multiple region servers and also assigns regions to region -servers if they are not open. - - -Starting from HBase versions 0.90.7, 0.92.2 and 0.94.0, several new command line options are -introduced to aid repairing a corrupted HBase. This hbck sometimes goes by the nickname -“uberhbck”. Each particular version of uber hbck is compatible with the HBase’s of the same -major version (0.90.7 uberhbck can repair a 0.90.4). However, versions <=0.90.6 and versions -<=0.92.1 may require restarting the master or failing over to a backup master. - -
-
Localized repairs - - When repairing a corrupted HBase, it is best to repair the lowest risk inconsistencies first. -These are generally region consistency repairs -- localized single region repairs, that only modify -in-memory data, ephemeral zookeeper data, or patch holes in the META table. -Region consistency requires that the HBase instance has the state of the region’s data in HDFS -(.regioninfo files), the region’s row in the hbase:meta table., and region’s deployment/assignments on -region servers and the master in accordance. Options for repairing region consistency include: - - -fixAssignments (equivalent to the 0.90 -fix option) repairs unassigned, incorrectly -assigned or multiply assigned regions. - - -fixMeta which removes meta rows when corresponding regions are not present in - HDFS and adds new meta rows if they regions are present in HDFS while not in META. - - - To fix deployment and assignment problems you can run this command: - - -$ ./bin/hbase hbck -fixAssignments - -To fix deployment and assignment problems as well as repairing incorrect meta rows you can -run this command: - -$ ./bin/hbase hbck -fixAssignments -fixMeta - -There are a few classes of table integrity problems that are low risk repairs. The first two are -degenerate (startkey == endkey) regions and backwards regions (startkey > endkey). These are -automatically handled by sidelining the data to a temporary directory (/hbck/xxxx). -The third low-risk class is hdfs region holes. This can be repaired by using the: - - -fixHdfsHoles option for fabricating new empty regions on the file system. -If holes are detected you can use -fixHdfsHoles and should include -fixMeta and -fixAssignments to make the new region consistent. - - - -$ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHoles - -Since this is a common operation, we’ve added a the -repairHoles flag that is equivalent to the -previous command: - -$ ./bin/hbase hbck -repairHoles - -If inconsistencies still remain after these steps, you most likely have table integrity problems -related to orphaned or overlapping regions. -
-
Region Overlap Repairs -Table integrity problems can require repairs that deal with overlaps. This is a riskier operation -because it requires modifications to the file system, requires some decision making, and may -require some manual steps. For these repairs it is best to analyze the output of a hbck -details -run so that you isolate repairs attempts only upon problems the checks identify. Because this is -riskier, there are safeguard that should be used to limit the scope of the repairs. -WARNING: This is a relatively new and have only been tested on online but idle HBase instances -(no reads/writes). Use at your own risk in an active production environment! -The options for repairing table integrity violations include: - - -fixHdfsOrphans option for “adopting” a region directory that is missing a region -metadata file (the .regioninfo file). - - -fixHdfsOverlaps ability for fixing overlapping regions - - -When repairing overlapping regions, a region’s data can be modified on the file system in two -ways: 1) by merging regions into a larger region or 2) by sidelining regions by moving data to -“sideline” directory where data could be restored later. Merging a large number of regions is -technically correct but could result in an extremely large region that requires series of costly -compactions and splitting operations. In these cases, it is probably better to sideline the regions -that overlap with the most other regions (likely the largest ranges) so that merges can happen on -a more reasonable scale. Since these sidelined regions are already laid out in HBase’s native -directory and HFile format, they can be restored by using HBase’s bulk load mechanism. -The default safeguard thresholds are conservative. These options let you override the default -thresholds and to enable the large region sidelining feature. - - -maxMerge <n> maximum number of overlapping regions to merge - - -sidelineBigOverlaps if more than maxMerge regions are overlapping, sideline attempt -to sideline the regions overlapping with the most other regions. - - -maxOverlapsToSideline <n> if sidelining large overlapping regions, sideline at most n -regions. - - - -Since often times you would just want to get the tables repaired, you can use this option to turn -on all repair options: - - -repair includes all the region consistency options and only the hole repairing table -integrity options. - - -Finally, there are safeguards to limit repairs to only specific tables. For example the following -command would only attempt to check and repair table TableFoo and TableBar. - -$ ./bin/hbase hbck -repair TableFoo TableBar - -
Special cases: Meta is not properly assigned -There are a few special cases that hbck can handle as well. -Sometimes the meta table’s only region is inconsistently assigned or deployed. In this case -there is a special -fixMetaOnly option that can try to fix meta assignments. - -$ ./bin/hbase hbck -fixMetaOnly -fixAssignments - -
-
Special cases: HBase version file is missing -HBase’s data on the file system requires a version file in order to start. If this flie is missing, you -can use the -fixVersionFile option to fabricating a new HBase version file. This assumes that -the version of hbck you are running is the appropriate version for the HBase cluster. -
-
Special case: Root and META are corrupt. -The most drastic corruption scenario is the case where the ROOT or META is corrupted and -HBase will not start. In this case you can use the OfflineMetaRepair tool create new ROOT -and META regions and tables. -This tool assumes that HBase is offline. It then marches through the existing HBase home -directory, loads as much information from region metadata files (.regioninfo files) as possible -from the file system. If the region metadata has proper table integrity, it sidelines the original root -and meta table directories, and builds new ones with pointers to the region directories and their -data. - -$ ./bin/hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair - -NOTE: This tool is not as clever as uberhbck but can be used to bootstrap repairs that uberhbck -can complete. -If the tool succeeds you should be able to start hbase and run online repairs if necessary. -
-
Special cases: Offline split parent - -Once a region is split, the offline parent will be cleaned up automatically. Sometimes, daughter regions -are split again before their parents are cleaned up. HBase can clean up parents in the right order. However, -there could be some lingering offline split parents sometimes. They are in META, in HDFS, and not deployed. -But HBase can't clean them up. In this case, you can use the -fixSplitParents option to reset -them in META to be online and not split. Therefore, hbck can merge them with other regions if fixing -overlapping regions option is used. - - -This option should not normally be used, and it is not in -fixAll. - -
-
-
- + + - - - Compression and Data Block Encoding In - HBase<indexterm><primary>Compression</primary><secondary>Data Block - Encoding</secondary><seealso>codecs</seealso></indexterm> - - Codecs mentioned in this section are for encoding and decoding data blocks or row keys. - For information about replication codecs, see . - - Some of the information in this section is pulled from a discussion on the - HBase Development mailing list. - HBase supports several different compression algorithms which can be enabled on a - ColumnFamily. Data block encoding attempts to limit duplication of information in keys, taking - advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys - and the schema of a given table. Compressors reduce the size of large, opaque byte arrays in - cells, and can significantly reduce the storage space needed to store uncompressed - data. - Compressors and data block encoding can be used together on the same ColumnFamily. - - - Changes Take Effect Upon Compaction - If you change compression or encoding for a ColumnFamily, the changes take effect during - compaction. - - - Some codecs take advantage of capabilities built into Java, such as GZip compression. - Others rely on native libraries. Native libraries may be available as part of Hadoop, such as - LZ4. In this case, HBase only needs access to the appropriate shared library. Other codecs, - such as Google Snappy, need to be installed first. Some codecs are licensed in ways that - conflict with HBase's license and cannot be shipped as part of HBase. - - This section discusses common codecs that are used and tested with HBase. No matter what - codec you use, be sure to test that it is installed correctly and is available on all nodes in - your cluster. Extra operational steps may be necessary to be sure that codecs are available on - newly-deployed nodes. You can use the utility to check that a given codec is correctly - installed. - - To configure HBase to use a compressor, see . To enable a compressor for a ColumnFamily, see . To enable data block encoding for a ColumnFamily, see - . - - Block Compressors - - none - - - Snappy - - - LZO - - - LZ4 - - - GZ - - - - - - Data Block Encoding Types - - Prefix - Often, keys are very similar. Specifically, keys often share a common prefix - and only differ near the end. For instance, one key might be - RowKey:Family:Qualifier0 and the next key might be - RowKey:Family:Qualifier1. In Prefix encoding, an extra column is - added which holds the length of the prefix shared between the current key and the previous - key. Assuming the first key here is totally different from the key before, its prefix - length is 0. The second key's prefix length is 23, since they have the - first 23 characters in common. - Obviously if the keys tend to have nothing in common, Prefix will not provide much - benefit. - The following image shows a hypothetical ColumnFamily with no data block encoding. -
- ColumnFamily with No Encoding - - - - - A ColumnFamily with no encoding> - -
- Here is the same data with prefix data encoding. -
- ColumnFamily with Prefix Encoding - - - - - A ColumnFamily with prefix encoding - -
-
- - Diff - Diff encoding expands upon Prefix encoding. Instead of considering the key - sequentially as a monolithic series of bytes, each key field is split so that each part of - the key can be compressed more efficiently. Two new fields are added: timestamp and type. - If the ColumnFamily is the same as the previous row, it is omitted from the current row. - If the key length, value length or type are the same as the previous row, the field is - omitted. In addition, for increased compression, the timestamp is stored as a Diff from - the previous row's timestamp, rather than being stored in full. Given the two row keys in - the Prefix example, and given an exact match on timestamp and the same type, neither the - value length, or type needs to be stored for the second row, and the timestamp value for - the second row is just 0, rather than a full timestamp. - Diff encoding is disabled by default because writing and scanning are slower but more - data is cached. - This image shows the same ColumnFamily from the previous images, with Diff encoding. -
- ColumnFamily with Diff Encoding - - - - - A ColumnFamily with diff encoding - -
-
- - Fast Diff - Fast Diff works similar to Diff, but uses a faster implementation. It also - adds another field which stores a single bit to track whether the data itself is the same - as the previous row. If it is, the data is not stored again. Fast Diff is the recommended - codec to use if you have long keys or many columns. The data format is nearly identical to - Diff encoding, so there is not an image to illustrate it. - - - Prefix Tree encoding was introduced as an experimental feature in HBase 0.96. It - provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but provides - faster random access at a cost of slower encoding speed. Prefix Tree may be appropriate - for applications that have high block cache hit ratios. It introduces new 'tree' fields - for the row and column. The row tree field contains a list of offsets/references - corresponding to the cells in that row. This allows for a good deal of compression. For - more details about Prefix Tree encoding, see HBASE-4676. It is - difficult to graphically illustrate a prefix tree, so no image is included. See the - Wikipedia article for Trie for more general information - about this data structure. - -
- -
- Which Compressor or Data Block Encoder To Use - The compression or codec type to use depends on the characteristics of your data. - Choosing the wrong type could cause your data to take more space rather than less, and can - have performance implications. In general, you need to weigh your options between smaller - size and faster compression/decompression. Following are some general guidelines, expanded from a discussion at Documenting Guidance on compression and codecs. - - - If you have long keys (compared to the values) or many columns, use a prefix - encoder. FAST_DIFF is recommended, as more testing is needed for Prefix Tree - encoding. - - - If the values are large (and not precompressed, such as images), use a data block - compressor. - - - Use GZIP for cold data, which is accessed infrequently. GZIP - compression uses more CPU resources than Snappy or LZO, but provides a higher - compression ratio. - - - Use Snappy or LZO for hot data, which is accessed - frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high - of a compression ratio. - - - In most cases, enabling Snappy or LZO by default is a good choice, because they have - a low performance overhead and provide space savings. - - - Before Snappy became available by Google in 2011, LZO was the default. Snappy has - similar qualities as LZO but has been shown to perform better. - - -
-
- Making use of Hadoop Native Libraries in HBase - The Hadoop shared library has a bunch of facility including - compression libraries and fast crc'ing. To make this facility available - to HBase, do the following. HBase/Hadoop will fall back to use - alternatives if it cannot find the native library versions -- or - fail outright if you asking for an explicit compressor and there is - no alternative available. - If you see the following in your HBase logs, you know that HBase was unable - to locate the Hadoop native libraries: - 2014-08-07 09:26:20,139 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable - If the libraries loaded successfully, the WARN message does not show. - - Lets presume your Hadoop shipped with a native library that - suits the platform you are running HBase on. To check if the Hadoop - native library is available to HBase, run the following tool (available in - Hadoop 2.1 and greater): - $ ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker -2014-08-26 13:15:38,717 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable -Native library checking: -hadoop: false -zlib: false -snappy: false -lz4: false -bzip2: false -2014-08-26 13:15:38,863 INFO [main] util.ExitUtil: Exiting with status 1 -Above shows that the native hadoop library is not available in HBase context. - - To fix the above, either copy the Hadoop native libraries local or symlink to - them if the Hadoop and HBase stalls are adjacent in the filesystem. - You could also point at their location by setting the LD_LIBRARY_PATH environment - variable. - Where the JVM looks to find native librarys is "system dependent" - (See java.lang.System#loadLibrary(name)). On linux, by default, - is going to look in lib/native/PLATFORM where PLATFORM - is the label for the platform your HBase is installed on. - On a local linux machine, it seems to be the concatenation of the java properties - os.name and os.arch followed by whether 32 or 64 bit. - HBase on startup prints out all of the java system properties so find the os.name and os.arch - in the log. For example: - .... - 2014-08-06 15:27:22,853 INFO [main] zookeeper.ZooKeeper: Client environment:os.name=Linux - 2014-08-06 15:27:22,853 INFO [main] zookeeper.ZooKeeper: Client environment:os.arch=amd64 - ... - - So in this case, the PLATFORM string is Linux-amd64-64. - Copying the Hadoop native libraries or symlinking at lib/native/Linux-amd64-64 - will ensure they are found. Check with the Hadoop NativeLibraryChecker. - - - Here is example of how to point at the Hadoop libs with LD_LIBRARY_PATH - environment variable: - $ LD_LIBRARY_PATH=~/hadoop-2.5.0-SNAPSHOT/lib/native ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker -2014-08-26 13:42:49,332 INFO [main] bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native -2014-08-26 13:42:49,337 INFO [main] zlib.ZlibFactory: Successfully loaded & initialized native-zlib library -Native library checking: -hadoop: true /home/stack/hadoop-2.5.0-SNAPSHOT/lib/native/libhadoop.so.1.0.0 -zlib: true /lib64/libz.so.1 -snappy: true /usr/lib64/libsnappy.so.1 -lz4: true revision:99 -bzip2: true /lib64/libbz2.so.1 -Set in hbase-env.sh the LD_LIBRARY_PATH environment variable when starting your HBase. - -
- -
- Compressor Configuration, Installation, and Use -
- Configure HBase For Compressors - Before HBase can use a given compressor, its libraries need to be available. Due to - licensing issues, only GZ compression is available to HBase (via native Java libraries) in - a default installation. Other compression libraries are available via the shared library - bundled with your hadoop. The hadoop native library needs to be findable when HBase - starts. See -
- Compressor Support On the Master - A new configuration setting was introduced in HBase 0.95, to check the Master to - determine which data block encoders are installed and configured on it, and assume that - the entire cluster is configured the same. This option, - hbase.master.check.compression, defaults to true. This - prevents the situation described in HBASE-6370, where - a table is created or modified to support a codec that a region server does not support, - leading to failures that take a long time to occur and are difficult to debug. - If hbase.master.check.compression is enabled, libraries for all desired - compressors need to be installed and configured on the Master, even if the Master does - not run a region server. -
-
- Install GZ Support Via Native Libraries - HBase uses Java's built-in GZip support unless the native Hadoop libraries are - available on the CLASSPATH. The recommended way to add libraries to the CLASSPATH is to - set the environment variable HBASE_LIBRARY_PATH for the user running - HBase. If native libraries are not available and Java's GZIP is used, Got - brand-new compressor reports will be present in the logs. See ). -
-
- Install LZO Support - HBase cannot ship with LZO because of incompatibility between HBase, which uses an - Apache Software License (ASL) and LZO, which uses a GPL license. See the Using LZO - Compression wiki page for information on configuring LZO support for HBase. - If you depend upon LZO compression, consider configuring your RegionServers to fail - to start if LZO is not available. See . -
-
- Configure LZ4 Support - LZ4 support is bundled with Hadoop. Make sure the hadoop shared library - (libhadoop.so) is accessible when you start - HBase. After configuring your platform (see ), you can make a symbolic link from HBase to the native Hadoop - libraries. This assumes the two software installs are colocated. For example, if my - 'platform' is Linux-amd64-64: - $ cd $HBASE_HOME -$ mkdir lib/native -$ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64 - Use the compression tool to check that LZ4 is installed on all nodes. Start up (or restart) - HBase. Afterward, you can create and alter tables to enable LZ4 as a - compression codec.: - -hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'} - - -
-
- Install Snappy Support - HBase does not ship with Snappy support because of licensing issues. You can install - Snappy binaries (for instance, by using yum install snappy on CentOS) - or build Snappy from source. After installing Snappy, search for the shared library, - which will be called libsnappy.so.X where X is a number. If you - built from source, copy the shared library to a known location on your system, such as - /opt/snappy/lib/. - In addition to the Snappy library, HBase also needs access to the Hadoop shared - library, which will be called something like libhadoop.so.X.Y, - where X and Y are both numbers. Make note of the location of the Hadoop library, or copy - it to the same location as the Snappy library. - - The Snappy and Hadoop libraries need to be available on each node of your cluster. - See to find out how to test that this is the case. - See to configure your RegionServers to fail to - start if a given compressor is not available. - - Each of these library locations need to be added to the environment variable - HBASE_LIBRARY_PATH for the operating system user that runs HBase. You - need to restart the RegionServer for the changes to take effect. -
- - -
- CompressionTest - You can use the CompressionTest tool to verify that your compressor is available to - HBase: - - $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy - -
- - -
- Enforce Compression Settings On a RegionServer - You can configure a RegionServer so that it will fail to restart if compression is - configured incorrectly, by adding the option hbase.regionserver.codecs to the - hbase-site.xml, and setting its value to a comma-separated list - of codecs that need to be available. For example, if you set this property to - lzo,gz, the RegionServer would fail to start if both compressors - were not available. This would prevent a new server from being added to the cluster - without having codecs configured properly. -
-
- -
- Enable Compression On a ColumnFamily - To enable compression for a ColumnFamily, use an alter command. You do - not need to re-create the table or copy data. If you are changing codecs, be sure the old - codec is still available until all the old StoreFiles have been compacted. - - Enabling Compression on a ColumnFamily of an Existing Table using HBase - Shell - disable 'test' -hbase> alter 'test', {NAME => 'cf', COMPRESSION => 'GZ'} -hbase> enable 'test']]> - - - - Creating a New Table with Compression On a ColumnFamily - create 'test2', { NAME => 'cf2', COMPRESSION => 'SNAPPY' } - ]]> - - - Verifying a ColumnFamily's Compression Settings - describe 'test' -DESCRIPTION ENABLED - 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE false - ', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', - VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERSIONS - => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'fa - lse', BLOCKSIZE => '65536', IN_MEMORY => 'false', B - LOCKCACHE => 'true'} -1 row(s) in 0.1070 seconds - ]]> - -
- -
- Testing Compression Performance - HBase includes a tool called LoadTestTool which provides mechanisms to test your - compression performance. You must specify either -write or - -update-read as your first parameter, and if you do not specify another - parameter, usage advice is printed for each option. - - <command>LoadTestTool</command> Usage - -Options: - -batchupdate Whether to use batch as opposed to separate - updates for every column in a row - -bloom Bloom filter type, one of [NONE, ROW, ROWCOL] - -compression Compression type, one of [LZO, GZ, NONE, SNAPPY, - LZ4] - -data_block_encoding Encoding algorithm (e.g. prefix compression) to - use for data blocks in the test column family, one - of [NONE, PREFIX, DIFF, FAST_DIFF, PREFIX_TREE]. - -encryption Enables transparent encryption on the test table, - one of [AES] - -generator The class which generates load for the tool. Any - args for this class can be passed as colon - separated after class name - -h,--help Show usage - -in_memory Tries to keep the HFiles of the CF inmemory as far - as possible. Not guaranteed that reads are always - served from inmemory - -init_only Initialize the test table only, don't do any - loading - -key_window The 'key window' to maintain between reads and - writes for concurrent write/read workload. The - default is 0. - -max_read_errors The maximum number of read errors to tolerate - before terminating all reader threads. The default - is 10. - -multiput Whether to use multi-puts as opposed to separate - puts for every column in a row - -num_keys The number of keys to read/write - -num_tables A positive integer number. When a number n is - speicfied, load test tool will load n table - parallely. -tn parameter value becomes table name - prefix. Each table name is in format - _1..._n - -read [:<#threads=20>] - -regions_per_server A positive integer number. When a number n is - specified, load test tool will create the test - table with n regions per server - -skip_init Skip the initialization; assume test table already - exists - -start_key The first key to read/write (a 0-based index). The - default value is 0. - -tn The name of the table to read or write - -update [:<#threads=20>][:<#whether to - ignore nonce collisions=0>] - -write :[:<#threads=20>] - -zk ZK quorum as comma-separated host names without - port numbers - -zk_root name of parent znode in zookeeper - ]]> - - - Example Usage of LoadTestTool - -$ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000000 - -read 100:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE - - -
-
- -
- Enable Data Block Encoding - Codecs are built into HBase so no extra configuration is needed. Codecs are enabled on a - table by setting the DATA_BLOCK_ENCODING property. Disable the table before - altering its DATA_BLOCK_ENCODING setting. Following is an example using HBase Shell: - - Enable Data Block Encoding On a Table - disable 'test' -hbase> alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' } -Updating all regions with the new schema... -0/1 regions updated. -1/1 regions updated. -Done. -0 row(s) in 2.2820 seconds -hbase> enable 'test' -0 row(s) in 0.1580 seconds - ]]> - - - Verifying a ColumnFamily's Data Block Encoding - describe 'test' -DESCRIPTION ENABLED - 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST true - _DIFF', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => - '0', VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERS - IONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS = - > 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals - e', BLOCKCACHE => 'true'} -1 row(s) in 0.0650 seconds - ]]> - -
-
- - - SQL over HBase -
- Apache Phoenix - Apache Phoenix -
-
- Trafodion - Trafodion: Transactional SQL-on-HBase -
-
- - - <link xlink:href="https://github.com/brianfrankcooper/YCSB/">YCSB: The Yahoo! Cloud Serving Benchmark</link> and HBase - TODO: Describe how YCSB is poor for putting up a decent cluster load. - TODO: Describe setup of YCSB for HBase. In particular, presplit your tables before you start - a run. See HBASE-4163 Create Split Strategy for YCSB Benchmark - for why and a little shell command for how to do it. - Ted Dunning redid YCSB so it's mavenized and added facility for verifying workloads. See Ted Dunning's YCSB. - - - - - - - Other Information About HBase -
HBase Videos - Introduction to HBase - - Introduction to HBase by Todd Lipcon (Chicago Data Summit 2011). - - Introduction to HBase by Todd Lipcon (2010). - - - - Building Real Time Services at Facebook with HBase by Jonathan Gray (Hadoop World 2011). - - HBase and Hadoop, Mixing Real-Time and Batch Processing at StumbleUpon by JD Cryans (Hadoop World 2010). - -
-
HBase Presentations (Slides) - Advanced HBase Schema Design by Lars George (Hadoop World 2011). - - Introduction to HBase by Todd Lipcon (Chicago Data Summit 2011). - - Getting The Most From Your HBase Install by Ryan Rawson, Jonathan Gray (Hadoop World 2009). - -
-
HBase Papers - BigTable by Google (2006). - - HBase and HDFS Locality by Lars George (2010). - - No Relation: The Mixed Blessings of Non-Relational Databases by Ian Varley (2009). - -
-
HBase Sites - Cloudera's HBase Blog has a lot of links to useful HBase information. - - CAP Confusion is a relevant entry for background information on - distributed storage systems. - - - - HBase Wiki has a page with a number of presentations. - - HBase RefCard from DZone. - -
-
HBase Books - HBase: The Definitive Guide by Lars George. - -
-
Hadoop Books - Hadoop: The Definitive Guide by Tom White. - -
- -
- - HBase History - - 2006: BigTable paper published by Google. - - 2006 (end of year): HBase development starts. - - 2008: HBase becomes Hadoop sub-project. - - 2010: HBase becomes Apache top-level project. - - - - - HBase and the Apache Software Foundation - HBase is a project in the Apache Software Foundation and as such there are responsibilities to the ASF to ensure - a healthy project. -
ASF Development Process - See the Apache Development Process page - for all sorts of information on how the ASF is structured (e.g., PMC, committers, contributors), to tips on contributing - and getting involved, and how open-source works at ASF. - -
-
ASF Board Reporting - Once a quarter, each project in the ASF portfolio submits a report to the ASF board. This is done by the HBase project - lead and the committers. See ASF board reporting for more information. - -
-
- - Apache HBase Orca - - - - - - An Orca is the Apache HBase mascot. - See NOTICES.txt. Our Orca logo we got here: http://www.vectorfree.com/jumping-orca - It is licensed Creative Commons Attribution 3.0. See https://creativecommons.org/licenses/by/3.0/us/ - We changed the logo by stripping the colored background, inverting - it and then rotating it some. - - - + + + + + + + + diff --git a/src/main/docbkx/compression.xml b/src/main/docbkx/compression.xml new file mode 100644 index 00000000000..d1971b1c34c --- /dev/null +++ b/src/main/docbkx/compression.xml @@ -0,0 +1,535 @@ + + + + + Compression and Data Block Encoding In + HBase<indexterm><primary>Compression</primary><secondary>Data Block + Encoding</secondary><seealso>codecs</seealso></indexterm> + + Codecs mentioned in this section are for encoding and decoding data blocks or row keys. + For information about replication codecs, see . + + Some of the information in this section is pulled from a discussion on the + HBase Development mailing list. + HBase supports several different compression algorithms which can be enabled on a + ColumnFamily. Data block encoding attempts to limit duplication of information in keys, taking + advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys + and the schema of a given table. Compressors reduce the size of large, opaque byte arrays in + cells, and can significantly reduce the storage space needed to store uncompressed + data. + Compressors and data block encoding can be used together on the same ColumnFamily. + + + Changes Take Effect Upon Compaction + If you change compression or encoding for a ColumnFamily, the changes take effect during + compaction. + + + Some codecs take advantage of capabilities built into Java, such as GZip compression. + Others rely on native libraries. Native libraries may be available as part of Hadoop, such as + LZ4. In this case, HBase only needs access to the appropriate shared library. Other codecs, + such as Google Snappy, need to be installed first. Some codecs are licensed in ways that + conflict with HBase's license and cannot be shipped as part of HBase. + + This section discusses common codecs that are used and tested with HBase. No matter what + codec you use, be sure to test that it is installed correctly and is available on all nodes in + your cluster. Extra operational steps may be necessary to be sure that codecs are available on + newly-deployed nodes. You can use the utility to check that a given codec is correctly + installed. + + To configure HBase to use a compressor, see . To enable a compressor for a ColumnFamily, see . To enable data block encoding for a ColumnFamily, see + . + + Block Compressors + + none + + + Snappy + + + LZO + + + LZ4 + + + GZ + + + + + + Data Block Encoding Types + + Prefix - Often, keys are very similar. Specifically, keys often share a common prefix + and only differ near the end. For instance, one key might be + RowKey:Family:Qualifier0 and the next key might be + RowKey:Family:Qualifier1. In Prefix encoding, an extra column is + added which holds the length of the prefix shared between the current key and the previous + key. Assuming the first key here is totally different from the key before, its prefix + length is 0. The second key's prefix length is 23, since they have the + first 23 characters in common. + Obviously if the keys tend to have nothing in common, Prefix will not provide much + benefit. + The following image shows a hypothetical ColumnFamily with no data block encoding. +
+ ColumnFamily with No Encoding + + + + + A ColumnFamily with no encoding> + +
+ Here is the same data with prefix data encoding. +
+ ColumnFamily with Prefix Encoding + + + + + A ColumnFamily with prefix encoding + +
+
+ + Diff - Diff encoding expands upon Prefix encoding. Instead of considering the key + sequentially as a monolithic series of bytes, each key field is split so that each part of + the key can be compressed more efficiently. Two new fields are added: timestamp and type. + If the ColumnFamily is the same as the previous row, it is omitted from the current row. + If the key length, value length or type are the same as the previous row, the field is + omitted. In addition, for increased compression, the timestamp is stored as a Diff from + the previous row's timestamp, rather than being stored in full. Given the two row keys in + the Prefix example, and given an exact match on timestamp and the same type, neither the + value length, or type needs to be stored for the second row, and the timestamp value for + the second row is just 0, rather than a full timestamp. + Diff encoding is disabled by default because writing and scanning are slower but more + data is cached. + This image shows the same ColumnFamily from the previous images, with Diff encoding. +
+ ColumnFamily with Diff Encoding + + + + + A ColumnFamily with diff encoding + +
+
+ + Fast Diff - Fast Diff works similar to Diff, but uses a faster implementation. It also + adds another field which stores a single bit to track whether the data itself is the same + as the previous row. If it is, the data is not stored again. Fast Diff is the recommended + codec to use if you have long keys or many columns. The data format is nearly identical to + Diff encoding, so there is not an image to illustrate it. + + + Prefix Tree encoding was introduced as an experimental feature in HBase 0.96. It + provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but provides + faster random access at a cost of slower encoding speed. Prefix Tree may be appropriate + for applications that have high block cache hit ratios. It introduces new 'tree' fields + for the row and column. The row tree field contains a list of offsets/references + corresponding to the cells in that row. This allows for a good deal of compression. For + more details about Prefix Tree encoding, see HBASE-4676. It is + difficult to graphically illustrate a prefix tree, so no image is included. See the + Wikipedia article for Trie for more general information + about this data structure. + +
+ +
+ Which Compressor or Data Block Encoder To Use + The compression or codec type to use depends on the characteristics of your data. + Choosing the wrong type could cause your data to take more space rather than less, and can + have performance implications. In general, you need to weigh your options between smaller + size and faster compression/decompression. Following are some general guidelines, expanded from a discussion at Documenting Guidance on compression and codecs. + + + If you have long keys (compared to the values) or many columns, use a prefix + encoder. FAST_DIFF is recommended, as more testing is needed for Prefix Tree + encoding. + + + If the values are large (and not precompressed, such as images), use a data block + compressor. + + + Use GZIP for cold data, which is accessed infrequently. GZIP + compression uses more CPU resources than Snappy or LZO, but provides a higher + compression ratio. + + + Use Snappy or LZO for hot data, which is accessed + frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high + of a compression ratio. + + + In most cases, enabling Snappy or LZO by default is a good choice, because they have + a low performance overhead and provide space savings. + + + Before Snappy became available by Google in 2011, LZO was the default. Snappy has + similar qualities as LZO but has been shown to perform better. + + +
+
+ Making use of Hadoop Native Libraries in HBase + The Hadoop shared library has a bunch of facility including + compression libraries and fast crc'ing. To make this facility available + to HBase, do the following. HBase/Hadoop will fall back to use + alternatives if it cannot find the native library versions -- or + fail outright if you asking for an explicit compressor and there is + no alternative available. + If you see the following in your HBase logs, you know that HBase was unable + to locate the Hadoop native libraries: + 2014-08-07 09:26:20,139 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable + If the libraries loaded successfully, the WARN message does not show. + + Lets presume your Hadoop shipped with a native library that + suits the platform you are running HBase on. To check if the Hadoop + native library is available to HBase, run the following tool (available in + Hadoop 2.1 and greater): + $ ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker +2014-08-26 13:15:38,717 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable +Native library checking: +hadoop: false +zlib: false +snappy: false +lz4: false +bzip2: false +2014-08-26 13:15:38,863 INFO [main] util.ExitUtil: Exiting with status 1 +Above shows that the native hadoop library is not available in HBase context. + + To fix the above, either copy the Hadoop native libraries local or symlink to + them if the Hadoop and HBase stalls are adjacent in the filesystem. + You could also point at their location by setting the LD_LIBRARY_PATH environment + variable. + Where the JVM looks to find native librarys is "system dependent" + (See java.lang.System#loadLibrary(name)). On linux, by default, + is going to look in lib/native/PLATFORM where PLATFORM + is the label for the platform your HBase is installed on. + On a local linux machine, it seems to be the concatenation of the java properties + os.name and os.arch followed by whether 32 or 64 bit. + HBase on startup prints out all of the java system properties so find the os.name and os.arch + in the log. For example: + .... + 2014-08-06 15:27:22,853 INFO [main] zookeeper.ZooKeeper: Client environment:os.name=Linux + 2014-08-06 15:27:22,853 INFO [main] zookeeper.ZooKeeper: Client environment:os.arch=amd64 + ... + + So in this case, the PLATFORM string is Linux-amd64-64. + Copying the Hadoop native libraries or symlinking at lib/native/Linux-amd64-64 + will ensure they are found. Check with the Hadoop NativeLibraryChecker. + + + Here is example of how to point at the Hadoop libs with LD_LIBRARY_PATH + environment variable: + $ LD_LIBRARY_PATH=~/hadoop-2.5.0-SNAPSHOT/lib/native ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker +2014-08-26 13:42:49,332 INFO [main] bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native +2014-08-26 13:42:49,337 INFO [main] zlib.ZlibFactory: Successfully loaded & initialized native-zlib library +Native library checking: +hadoop: true /home/stack/hadoop-2.5.0-SNAPSHOT/lib/native/libhadoop.so.1.0.0 +zlib: true /lib64/libz.so.1 +snappy: true /usr/lib64/libsnappy.so.1 +lz4: true revision:99 +bzip2: true /lib64/libbz2.so.1 +Set in hbase-env.sh the LD_LIBRARY_PATH environment variable when starting your HBase. + +
+ +
+ Compressor Configuration, Installation, and Use +
+ Configure HBase For Compressors + Before HBase can use a given compressor, its libraries need to be available. Due to + licensing issues, only GZ compression is available to HBase (via native Java libraries) in + a default installation. Other compression libraries are available via the shared library + bundled with your hadoop. The hadoop native library needs to be findable when HBase + starts. See +
+ Compressor Support On the Master + A new configuration setting was introduced in HBase 0.95, to check the Master to + determine which data block encoders are installed and configured on it, and assume that + the entire cluster is configured the same. This option, + hbase.master.check.compression, defaults to true. This + prevents the situation described in HBASE-6370, where + a table is created or modified to support a codec that a region server does not support, + leading to failures that take a long time to occur and are difficult to debug. + If hbase.master.check.compression is enabled, libraries for all desired + compressors need to be installed and configured on the Master, even if the Master does + not run a region server. +
+
+ Install GZ Support Via Native Libraries + HBase uses Java's built-in GZip support unless the native Hadoop libraries are + available on the CLASSPATH. The recommended way to add libraries to the CLASSPATH is to + set the environment variable HBASE_LIBRARY_PATH for the user running + HBase. If native libraries are not available and Java's GZIP is used, Got + brand-new compressor reports will be present in the logs. See ). +
+
+ Install LZO Support + HBase cannot ship with LZO because of incompatibility between HBase, which uses an + Apache Software License (ASL) and LZO, which uses a GPL license. See the Using LZO + Compression wiki page for information on configuring LZO support for HBase. + If you depend upon LZO compression, consider configuring your RegionServers to fail + to start if LZO is not available. See . +
+
+ Configure LZ4 Support + LZ4 support is bundled with Hadoop. Make sure the hadoop shared library + (libhadoop.so) is accessible when you start + HBase. After configuring your platform (see ), you can make a symbolic link from HBase to the native Hadoop + libraries. This assumes the two software installs are colocated. For example, if my + 'platform' is Linux-amd64-64: + $ cd $HBASE_HOME +$ mkdir lib/native +$ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64 + Use the compression tool to check that LZ4 is installed on all nodes. Start up (or restart) + HBase. Afterward, you can create and alter tables to enable LZ4 as a + compression codec.: + +hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'} + + +
+
+ Install Snappy Support + HBase does not ship with Snappy support because of licensing issues. You can install + Snappy binaries (for instance, by using yum install snappy on CentOS) + or build Snappy from source. After installing Snappy, search for the shared library, + which will be called libsnappy.so.X where X is a number. If you + built from source, copy the shared library to a known location on your system, such as + /opt/snappy/lib/. + In addition to the Snappy library, HBase also needs access to the Hadoop shared + library, which will be called something like libhadoop.so.X.Y, + where X and Y are both numbers. Make note of the location of the Hadoop library, or copy + it to the same location as the Snappy library. + + The Snappy and Hadoop libraries need to be available on each node of your cluster. + See to find out how to test that this is the case. + See to configure your RegionServers to fail to + start if a given compressor is not available. + + Each of these library locations need to be added to the environment variable + HBASE_LIBRARY_PATH for the operating system user that runs HBase. You + need to restart the RegionServer for the changes to take effect. +
+ + +
+ CompressionTest + You can use the CompressionTest tool to verify that your compressor is available to + HBase: + + $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy + +
+ + +
+ Enforce Compression Settings On a RegionServer + You can configure a RegionServer so that it will fail to restart if compression is + configured incorrectly, by adding the option hbase.regionserver.codecs to the + hbase-site.xml, and setting its value to a comma-separated list + of codecs that need to be available. For example, if you set this property to + lzo,gz, the RegionServer would fail to start if both compressors + were not available. This would prevent a new server from being added to the cluster + without having codecs configured properly. +
+
+ +
+ Enable Compression On a ColumnFamily + To enable compression for a ColumnFamily, use an alter command. You do + not need to re-create the table or copy data. If you are changing codecs, be sure the old + codec is still available until all the old StoreFiles have been compacted. + + Enabling Compression on a ColumnFamily of an Existing Table using HBase + Shell + disable 'test' +hbase> alter 'test', {NAME => 'cf', COMPRESSION => 'GZ'} +hbase> enable 'test']]> + + + + Creating a New Table with Compression On a ColumnFamily + create 'test2', { NAME => 'cf2', COMPRESSION => 'SNAPPY' } + ]]> + + + Verifying a ColumnFamily's Compression Settings + describe 'test' +DESCRIPTION ENABLED + 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE false + ', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', + VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERSIONS + => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'fa + lse', BLOCKSIZE => '65536', IN_MEMORY => 'false', B + LOCKCACHE => 'true'} +1 row(s) in 0.1070 seconds + ]]> + +
+ +
+ Testing Compression Performance + HBase includes a tool called LoadTestTool which provides mechanisms to test your + compression performance. You must specify either -write or + -update-read as your first parameter, and if you do not specify another + parameter, usage advice is printed for each option. + + <command>LoadTestTool</command> Usage + +Options: + -batchupdate Whether to use batch as opposed to separate + updates for every column in a row + -bloom Bloom filter type, one of [NONE, ROW, ROWCOL] + -compression Compression type, one of [LZO, GZ, NONE, SNAPPY, + LZ4] + -data_block_encoding Encoding algorithm (e.g. prefix compression) to + use for data blocks in the test column family, one + of [NONE, PREFIX, DIFF, FAST_DIFF, PREFIX_TREE]. + -encryption Enables transparent encryption on the test table, + one of [AES] + -generator The class which generates load for the tool. Any + args for this class can be passed as colon + separated after class name + -h,--help Show usage + -in_memory Tries to keep the HFiles of the CF inmemory as far + as possible. Not guaranteed that reads are always + served from inmemory + -init_only Initialize the test table only, don't do any + loading + -key_window The 'key window' to maintain between reads and + writes for concurrent write/read workload. The + default is 0. + -max_read_errors The maximum number of read errors to tolerate + before terminating all reader threads. The default + is 10. + -multiput Whether to use multi-puts as opposed to separate + puts for every column in a row + -num_keys The number of keys to read/write + -num_tables A positive integer number. When a number n is + speicfied, load test tool will load n table + parallely. -tn parameter value becomes table name + prefix. Each table name is in format + _1..._n + -read [:<#threads=20>] + -regions_per_server A positive integer number. When a number n is + specified, load test tool will create the test + table with n regions per server + -skip_init Skip the initialization; assume test table already + exists + -start_key The first key to read/write (a 0-based index). The + default value is 0. + -tn The name of the table to read or write + -update [:<#threads=20>][:<#whether to + ignore nonce collisions=0>] + -write :[:<#threads=20>] + -zk ZK quorum as comma-separated host names without + port numbers + -zk_root name of parent znode in zookeeper + ]]> + + + Example Usage of LoadTestTool + +$ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000000 + -read 100:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE + + +
+
+ +
+ Enable Data Block Encoding + Codecs are built into HBase so no extra configuration is needed. Codecs are enabled on a + table by setting the DATA_BLOCK_ENCODING property. Disable the table before + altering its DATA_BLOCK_ENCODING setting. Following is an example using HBase Shell: + + Enable Data Block Encoding On a Table + disable 'test' +hbase> alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' } +Updating all regions with the new schema... +0/1 regions updated. +1/1 regions updated. +Done. +0 row(s) in 2.2820 seconds +hbase> enable 'test' +0 row(s) in 0.1580 seconds + ]]> + + + Verifying a ColumnFamily's Data Block Encoding + describe 'test' +DESCRIPTION ENABLED + 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST true + _DIFF', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => + '0', VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERS + IONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS = + > 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals + e', BLOCKCACHE => 'true'} +1 row(s) in 0.0650 seconds + ]]> + +
+ + +
diff --git a/src/main/docbkx/configuration.xml b/src/main/docbkx/configuration.xml index 74b8e526c60..a0b7d116dde 100644 --- a/src/main/docbkx/configuration.xml +++ b/src/main/docbkx/configuration.xml @@ -925,8 +925,8 @@ stopping hbase............... - + href="hbase-default.xml"> + diff --git a/src/main/docbkx/customization-pdf.xsl b/src/main/docbkx/customization-pdf.xsl new file mode 100644 index 00000000000..b21236fc3e3 --- /dev/null +++ b/src/main/docbkx/customization-pdf.xsl @@ -0,0 +1,129 @@ + + + + + + + + + + + + 0 + 0 + 0 + + + 5mm + 10mm + 10mm + + 15mm + 10mm + 0mm + + 18mm + 18mm + + + 0pc + + + + + justify + true + + + 11 + 8 + + + 1.4 + + + /&? + + + + + + + + 0.8em + wrap + true + + + + + + page + + + + + page + + + + + page + + + + 2 + + + + + + + Courier + 8pt + always + + + + + + #E8E8E8 + 0.5pt + solid + #575757 + 3pt + + + + + + 90 + + + + + diff --git a/src/main/docbkx/datamodel.xml b/src/main/docbkx/datamodel.xml new file mode 100644 index 00000000000..bdf697d6fde --- /dev/null +++ b/src/main/docbkx/datamodel.xml @@ -0,0 +1,865 @@ + + + + + Data Model + In HBase, data is stored in tables, which have rows and columns. This is a terminology + overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can + be helpful to think of an HBase table as a multi-dimensional map. + + HBase Data Model Terminology + + Table + + An HBase table consists of multiple rows. + + + + Row + + A row in HBase consists of a row key and one or more columns with values associated + with them. Rows are sorted alphabetically by the row key as they are stored. For this + reason, the design of the row key is very important. The goal is to store data in such a + way that related rows are near each other. A common row key pattern is a website domain. + If your row keys are domains, you should probably store them in reverse (org.apache.www, + org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each + other in the table, rather than being spread out based on the first letter of the + subdomain. + + + + Column + + A column in HBase consists of a column family and a column qualifier, which are + delimited by a : (colon) character. + + + + Column Family + + Column families physically colocate a set of columns and their values, often for + performance reasons. Each column family has a set of storage properties, such as whether + its values should be cached in memory, how its data is compressed or its row keys are + encoded, and others. Each row in a table has the same column + families, though a given row might not store anything in a given column family. + Column families are specified when you create your table, and influence the way your + data is stored in the underlying filesystem. Therefore, the column families should be + considered carefully during schema design. + + + + Column Qualifier + + A column qualifier is added to a column family to provide the index for a given + piece of data. Given a column family content, a column qualifier + might be content:html, and another might be + content:pdf. Though column families are fixed at table creation, + column qualifiers are mutable and may differ greatly between rows. + + + + Cell + + A cell is a combination of row, column family, and column qualifier, and contains a + value and a timestamp, which represents the value's version. + A cell's value is an uninterpreted array of bytes. + + + + Timestamp + + A timestamp is written alongside each value, and is the identifier for a given + version of a value. By default, the timestamp represents the time on the RegionServer + when the data was written, but you can specify a different timestamp value when you put + data into the cell. + + Direct manipulation of timestamps is an advanced feature which is only exposed for + special cases that are deeply integrated with HBase, and is discouraged in general. + Encoding a timestamp at the application level is the preferred pattern. + + You can specify the maximum number of versions of a value that HBase retains, per column + family. When the maximum number of versions is reached, the oldest versions are + eventually deleted. By default, only the newest version is kept. + + + + +
+ Conceptual View + You can read a very understandable explanation of the HBase data model in the blog post Understanding + HBase and BigTable by Jim R. Wilson. Another good explanation is available in the + PDF Introduction + to Basic Schema Design by Amandeep Khurana. It may help to read different + perspectives to get a solid understanding of HBase schema design. The linked articles cover + the same ground as the information in this section. + The following example is a slightly modified form of the one on page 2 of the BigTable paper. There + is a table called webtable that contains two rows + (com.cnn.www + and com.example.www), three column families named + contents, anchor, and people. In + this example, for the first row (com.cnn.www), + anchor contains two columns (anchor:cssnsi.com, + anchor:my.look.ca) and contents contains one column + (contents:html). This example contains 5 versions of the row with the + row key com.cnn.www, and one version of the row with the row key + com.example.www. The contents:html column qualifier contains the entire + HTML of a given website. Qualifiers of the anchor column family each + contain the external site which links to the site represented by the row, along with the + text it used in the anchor of its link. The people column family represents + people associated with the site. + + + Column Names + By convention, a column name is made of its column family prefix and a + qualifier. For example, the column + contents:html is made up of the column family + contents and the html qualifier. The colon + character (:) delimits the column family from the column family + qualifier. + + + Table <varname>webtable</varname> + + + + + + + + + Row Key + Time Stamp + ColumnFamily contents + ColumnFamily anchor + ColumnFamily people + + + + + "com.cnn.www" + t9 + + anchor:cnnsi.com = "CNN" + + + + "com.cnn.www" + t8 + + anchor:my.look.ca = "CNN.com" + + + + "com.cnn.www" + t6 + contents:html = "<html>..." + + + + + "com.cnn.www" + t5 + contents:html = "<html>..." + + + + + "com.cnn.www" + t3 + contents:html = "<html>..." + + + + + "com.example.www" + t5 + contents:html = "<html>..." + + people:author = "John Doe" + + + +
+ Cells in this table that appear to be empty do not take space, or in fact exist, in + HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to + look at data in HBase, or even the most accurate. The following represents the same + information as a multi-dimensional map. This is only a mock-up for illustrative + purposes and may not be strictly accurate. + ..." + t5: contents:html: "..." + t3: contents:html: "..." + } + anchor: { + t9: anchor:cnnsi.com = "CNN" + t8: anchor:my.look.ca = "CNN.com" + } + people: {} + } + "com.example.www": { + contents: { + t5: contents:html: "..." + } + anchor: {} + people: { + t5: people:author: "John Doe" + } + } +} + ]]> + +
+
+ Physical View + Although at a conceptual level tables may be viewed as a sparse set of rows, they are + physically stored by column family. A new column qualifier (column_family:column_qualifier) + can be added to an existing column family at any time. + + ColumnFamily <varname>anchor</varname> + + + + + + + Row Key + Time Stamp + Column Family anchor + + + + + "com.cnn.www" + t9 + anchor:cnnsi.com = "CNN" + + + "com.cnn.www" + t8 + anchor:my.look.ca = "CNN.com" + + + +
+ + ColumnFamily <varname>contents</varname> + + + + + + + Row Key + Time Stamp + ColumnFamily "contents:" + + + + + "com.cnn.www" + t6 + contents:html = "<html>..." + + + "com.cnn.www" + t5 + contents:html = "<html>..." + + + "com.cnn.www" + t3 + contents:html = "<html>..." + + + +
+ The empty cells shown in the + conceptual view are not stored at all. + Thus a request for the value of the contents:html column at time stamp + t8 would return no value. Similarly, a request for an + anchor:my.look.ca value at time stamp t9 would + return no value. However, if no timestamp is supplied, the most recent value for a + particular column would be returned. Given multiple versions, the most recent is also the + first one found, since timestamps + are stored in descending order. Thus a request for the values of all columns in the row + com.cnn.www if no timestamp is specified would be: the value of + contents:html from timestamp t6, the value of + anchor:cnnsi.com from timestamp t9, the value of + anchor:my.look.ca from timestamp t8. + For more information about the internals of how Apache HBase stores data, see . +
+ +
+ Namespace + A namespace is a logical grouping of tables analogous to a database in relation + database systems. This abstraction lays the groundwork for upcoming multi-tenancy related + features: + + Quota Management (HBASE-8410) - Restrict the amount of resources (ie regions, + tables) a namespace can consume. + + + Namespace Security Administration (HBASE-9206) - provide another level of security + administration for tenants. + + + Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset + of regionservers thus guaranteeing a course level of isolation. + + + +
+ Namespace management + A namespace can be created, removed or altered. Namespace membership is determined + during table creation by specifying a fully-qualified table name of the form: + + :]]> + + + + Examples + + +#Create a namespace +create_namespace 'my_ns' + + +#create my_table in my_ns namespace +create 'my_ns:my_table', 'fam' + + +#drop namespace +drop_namespace 'my_ns' + + +#alter namespace +alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'} + + + +
+ Predefined namespaces + There are two predefined special namespaces: + + + hbase - system namespace, used to contain hbase internal tables + + + default - tables with no explicit specified namespace will automatically fall into + this namespace. + + + + Examples + + +#namespace=foo and table qualifier=bar +create 'foo:bar', 'fam' + +#namespace=default and table qualifier=bar +create 'bar', 'fam' + + +
+ + +
+ Table + Tables are declared up front at schema definition time. +
+ +
+ Row + Row keys are uninterrpreted bytes. Rows are lexicographically sorted with the lowest + order appearing first in a table. The empty byte array is used to denote both the start and + end of a tables' namespace. +
+ +
+ Column Family<indexterm><primary>Column Family</primary></indexterm> + Columns in Apache HBase are grouped into column families. All + column members of a column family have the same prefix. For example, the columns + courses:history and courses:math are both + members of the courses column family. The colon character + (:) delimits the column family from the column + family qualifierColumn Family Qualifier. + The column family prefix must be composed of printable characters. The + qualifying tail, the column family qualifier, can be made of any + arbitrary bytes. Column families must be declared up front at schema definition time whereas + columns do not need to be defined at schema time but can be conjured on the fly while the + table is up an running. + Physically, all column family members are stored together on the filesystem. Because + tunings and storage specifications are done at the column family level, it is advised that + all column family members have the same general access pattern and size + characteristics. + +
+
+ Cells<indexterm><primary>Cells</primary></indexterm> + A {row, column, version} tuple exactly specifies a + cell in HBase. Cell content is uninterrpreted bytes +
+
+ Data Model Operations + The four primary data model operations are Get, Put, Scan, and Delete. Operations are + applied via Table + instances. + +
+ Get + Get + returns attributes for a specified row. Gets are executed via + Table.get. +
+
+ Put + Put + either adds new rows to a table (if the key is new) or can update existing rows (if the + key already exists). Puts are executed via + Table.put (writeBuffer) or + Table.batch (non-writeBuffer). +
+
+ Scans + Scan + allow iteration over multiple rows for specified attributes. + The following is an example of a Scan on a Table instance. Assume that a table is + populated with rows with keys "row1", "row2", "row3", and then another set of rows with + the keys "abc1", "abc2", and "abc3". The following example shows how to set a Scan + instance to return the rows beginning with "row". + +public static final byte[] CF = "cf".getBytes(); +public static final byte[] ATTR = "attr".getBytes(); +... + +Table table = ... // instantiate a Table instance + +Scan scan = new Scan(); +scan.addColumn(CF, ATTR); +scan.setRowPrefixFilter(Bytes.toBytes("row")); +ResultScanner rs = table.getScanner(scan); +try { + for (Result r = rs.next(); r != null; r = rs.next()) { + // process result... +} finally { + rs.close(); // always close the ResultScanner! + + Note that generally the easiest way to specify a specific stop point for a scan is by + using the InclusiveStopFilter + class. +
+
+ Delete + Delete + removes a row from a table. Deletes are executed via + HTable.delete. + HBase does not modify data in place, and so deletes are handled by creating new + markers called tombstones. These tombstones, along with the dead + values, are cleaned up on major compactions. + See for more information on deleting versions of columns, and + see for more information on compactions. + +
+ +
+ + +
+ Versions<indexterm><primary>Versions</primary></indexterm> + + A {row, column, version} tuple exactly specifies a + cell in HBase. It's possible to have an unbounded number of cells where + the row and column are the same but the cell address differs only in its version + dimension. + + While rows and column keys are expressed as bytes, the version is specified using a long + integer. Typically this long contains time instances such as those returned by + java.util.Date.getTime() or System.currentTimeMillis(), that is: + the difference, measured in milliseconds, between the current time and midnight, + January 1, 1970 UTC. + + The HBase version dimension is stored in decreasing order, so that when reading from a + store file, the most recent values are found first. + + There is a lot of confusion over the semantics of cell versions, in + HBase. In particular: + + + If multiple writes to a cell have the same version, only the last written is + fetchable. + + + + It is OK to write cells in a non-increasing version order. + + + + Below we describe how the version dimension in HBase currently works. See HBASE-2406 for + discussion of HBase versions. Bending time in HBase + makes for a good read on the version, or time, dimension in HBase. It has more detail on + versioning than is provided here. As of this writing, the limiitation + Overwriting values at existing timestamps mentioned in the + article no longer holds in HBase. This section is basically a synopsis of this article + by Bruno Dumon. + +
+ Specifying the Number of Versions to Store + The maximum number of versions to store for a given column is part of the column + schema and is specified at table creation, or via an alter command, via + HColumnDescriptor.DEFAULT_VERSIONS. Prior to HBase 0.96, the default number + of versions kept was 3, but in 0.96 and newer has been changed to + 1. + + Modify the Maximum Number of Versions for a Column + This example uses HBase Shell to keep a maximum of 5 versions of column + f1. You could also use HColumnDescriptor. + alter ‘t1′, NAME => ‘f1′, VERSIONS => 5]]> + + + Modify the Minimum Number of Versions for a Column + You can also specify the minimum number of versions to store. By default, this is + set to 0, which means the feature is disabled. The following example sets the minimum + number of versions on field f1 to 2, via HBase Shell. + You could also use HColumnDescriptor. + alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2]]> + + Starting with HBase 0.98.2, you can specify a global default for the maximum number of + versions kept for all newly-created columns, by setting + in hbase-site.xml. See + . +
+ +
+ Versions and HBase Operations + + In this section we look at the behavior of the version dimension for each of the core + HBase operations. + +
+ Get/Scan + + Gets are implemented on top of Scans. The below discussion of Get + applies equally to Scans. + + By default, i.e. if you specify no explicit version, when doing a + get, the cell whose version has the largest value is returned + (which may or may not be the latest one written, see later). The default behavior can be + modified in the following ways: + + + + to return more than one version, see Get.setMaxVersions() + + + + to return versions other than the latest, see Get.setTimeRange() + + To retrieve the latest version that is less than or equal to a given value, thus + giving the 'latest' state of the record at a certain point in time, just use a range + from 0 to the desired version and set the max versions to 1. + + + +
+
+ Default Get Example + The following Get will only retrieve the current version of the row + +public static final byte[] CF = "cf".getBytes(); +public static final byte[] ATTR = "attr".getBytes(); +... +Get get = new Get(Bytes.toBytes("row1")); +Result r = table.get(get); +byte[] b = r.getValue(CF, ATTR); // returns current version of value + +
+
+ Versioned Get Example + The following Get will return the last 3 versions of the row. + +public static final byte[] CF = "cf".getBytes(); +public static final byte[] ATTR = "attr".getBytes(); +... +Get get = new Get(Bytes.toBytes("row1")); +get.setMaxVersions(3); // will return last 3 versions of row +Result r = table.get(get); +byte[] b = r.getValue(CF, ATTR); // returns current version of value +List<KeyValue> kv = r.getColumn(CF, ATTR); // returns all versions of this column + +
+ +
+ Put + + Doing a put always creates a new version of a cell, at a certain + timestamp. By default the system uses the server's currentTimeMillis, + but you can specify the version (= the long integer) yourself, on a per-column level. + This means you could assign a time in the past or the future, or use the long value for + non-time purposes. + + To overwrite an existing value, do a put at exactly the same row, column, and + version as that of the cell you would overshadow. +
+ Implicit Version Example + The following Put will be implicitly versioned by HBase with the current + time. + +public static final byte[] CF = "cf".getBytes(); +public static final byte[] ATTR = "attr".getBytes(); +... +Put put = new Put(Bytes.toBytes(row)); +put.add(CF, ATTR, Bytes.toBytes( data)); +table.put(put); + +
+
+ Explicit Version Example + The following Put has the version timestamp explicitly set. + +public static final byte[] CF = "cf".getBytes(); +public static final byte[] ATTR = "attr".getBytes(); +... +Put put = new Put( Bytes.toBytes(row)); +long explicitTimeInMs = 555; // just an example +put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data)); +table.put(put); + + Caution: the version timestamp is internally by HBase for things like time-to-live + calculations. It's usually best to avoid setting this timestamp yourself. Prefer using + a separate timestamp attribute of the row, or have the timestamp a part of the rowkey, + or both. +
+ +
+ +
+ Delete + + There are three different types of internal delete markers. See Lars Hofhansl's blog + for discussion of his attempt adding another, Scanning + in HBase: Prefix Delete Marker. + + + Delete: for a specific version of a column. + + + Delete column: for all versions of a column. + + + Delete family: for all columns of a particular ColumnFamily + + + When deleting an entire row, HBase will internally create a tombstone for each + ColumnFamily (i.e., not each individual column). + Deletes work by creating tombstone markers. For example, let's + suppose we want to delete a row. For this you can specify a version, or else by default + the currentTimeMillis is used. What this means is delete all + cells where the version is less than or equal to this version. HBase never + modifies data in place, so for example a delete will not immediately delete (or mark as + deleted) the entries in the storage file that correspond to the delete condition. + Rather, a so-called tombstone is written, which will mask the + deleted values. When HBase does a major compaction, the tombstones are processed to + actually remove the dead values, together with the tombstones themselves. If the version + you specified when deleting a row is larger than the version of any value in the row, + then you can consider the complete row to be deleted. + For an informative discussion on how deletes and versioning interact, see the thread Put w/ + timestamp -> Deleteall -> Put w/ timestamp fails up on the user mailing + list. + Also see for more information on the internal KeyValue format. + Delete markers are purged during the next major compaction of the store, unless the + option is set in the column family. To keep the + deletes for a configurable amount of time, you can set the delete TTL via the + property in + hbase-site.xml. If + is not set, or set to 0, all + delete markers, including those with timestamps in the future, are purged during the + next major compaction. Otherwise, a delete marker with a timestamp in the future is kept + until the major compaction which occurs after the time represented by the marker's + timestamp plus the value of , in + milliseconds. + + This behavior represents a fix for an unexpected change that was introduced in + HBase 0.94, and was fixed in HBASE-10118. + The change has been backported to HBase 0.94 and newer branches. + +
+
+ +
+ Current Limitations + +
+ Deletes mask Puts + + Deletes mask puts, even puts that happened after the delete + was entered. See HBASE-2256. Remember that a delete writes a tombstone, which only + disappears after then next major compaction has run. Suppose you do + a delete of everything <= T. After this you do a new put with a + timestamp <= T. This put, even if it happened after the delete, + will be masked by the delete tombstone. Performing the put will not + fail, but when you do a get you will notice the put did have no + effect. It will start working again after the major compaction has + run. These issues should not be a problem if you use + always-increasing versions for new puts to a row. But they can occur + even if you do not care about time: just do delete and put + immediately after each other, and there is some chance they happen + within the same millisecond. +
+ +
+ Major compactions change query results + + ...create three cell versions at t1, t2 and t3, with a maximum-versions + setting of 2. So when getting all versions, only the values at t2 and t3 will be + returned. But if you delete the version at t2 or t3, the one at t1 will appear again. + Obviously, once a major compaction has run, such behavior will not be the case + anymore... (See Garbage Collection in Bending time in + HBase.) +
+
+
+
+ Sort Order + All data model operations HBase return data in sorted order. First by row, + then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted + in reverse, so newest records are returned first). + +
+
+ Column Metadata + There is no store of column metadata outside of the internal KeyValue instances for a ColumnFamily. + Thus, while HBase can support not only a wide number of columns per row, but a heterogenous set of columns + between rows as well, it is your responsibility to keep track of the column names. + + The only way to get a complete set of columns that exist for a ColumnFamily is to process all the rows. + For more information about how HBase stores data internally, see . + +
+
Joins + Whether HBase supports joins is a common question on the dist-list, and there is a simple answer: it doesn't, + at not least in the way that RDBMS' support them (e.g., with equi-joins or outer-joins in SQL). As has been illustrated + in this chapter, the read data model operations in HBase are Get and Scan. + + However, that doesn't mean that equivalent join functionality can't be supported in your application, but + you have to do it yourself. The two primary strategies are either denormalizing the data upon writing to HBase, + or to have lookup tables and do the join between HBase tables in your application or MapReduce code (and as RDBMS' + demonstrate, there are several strategies for this depending on the size of the tables, e.g., nested loops vs. + hash-joins). So which is the best approach? It depends on what you are trying to do, and as such there isn't a single + answer that works for every use case. + +
+
ACID + See ACID Semantics. + Lars Hofhansl has also written a note on + ACID in HBase. +
+ diff --git a/src/main/docbkx/faq.xml b/src/main/docbkx/faq.xml new file mode 100644 index 00000000000..d7bcb0cfc4e --- /dev/null +++ b/src/main/docbkx/faq.xml @@ -0,0 +1,270 @@ + + + + FAQ + + General + + When should I use HBase? + + See the in the Architecture chapter. + + + + + Are there other HBase FAQs? + + + See the FAQ that is up on the wiki, HBase Wiki FAQ. + + + + + Does HBase support SQL? + + + Not really. SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests. + See the section for examples on the HBase client. + + + + + How can I find examples of NoSQL/HBase? + + See the link to the BigTable paper in in the appendix, as + well as the other papers. + + + + + What is the history of HBase? + + See . + + + + + + Upgrading + + + How do I upgrade Maven-managed projects from HBase 0.94 to HBase 0.96+? + + + In HBase 0.96, the project moved to a modular structure. Adjust your project's + dependencies to rely upon the hbase-client module or another + module as appropriate, rather than a single JAR. You can model your Maven depency + after one of the following, depending on your targeted version of HBase. See or for more + information. + + Maven Dependency for HBase 0.98 + + org.apache.hbase + hbase-client + 0.98.5-hadoop2 + + ]]> + + + Maven Dependency for HBase 0.96 + + org.apache.hbase + hbase-client + 0.96.2-hadoop2 + + ]]> + + + Maven Dependency for HBase 0.94 + + org.apache.hbase + hbase + 0.94.3 + + ]]> + + + + + Architecture + + How does HBase handle Region-RegionServer assignment and locality? + + + See . + + + + + Configuration + + How can I get started with my first cluster? + + + See . + + + + + Where can I learn about the rest of the configuration options? + + + See . + + + + + Schema Design / Data Access + + How should I design my schema in HBase? + + + See and + + + + + + How can I store (fill in the blank) in HBase? + + + + See . + + + + + + How can I handle secondary indexes in HBase? + + + + See + + + + + Can I change a table's rowkeys? + + This is a very common question. You can't. See . + + + + What APIs does HBase support? + + + See , and . + + + + + MapReduce + + How can I use MapReduce with HBase? + + + See + + + + + Performance and Troubleshooting + + + How can I improve HBase cluster performance? + + + + See . + + + + + + How can I troubleshoot my HBase cluster? + + + + See . + + + + + Amazon EC2 + + + I am running HBase on Amazon EC2 and... + + + + EC2 issues are a special case. See Troubleshooting and Performance sections. + + + + + Operations + + + How do I manage my HBase cluster? + + + + See + + + + + + How do I back up my HBase cluster? + + + + See + + + + + HBase in Action + + Where can I find interesting videos and presentations on HBase? + + + See + + + + + + + diff --git a/src/main/docbkx/hbase-default.xml b/src/main/docbkx/hbase-default.xml new file mode 100644 index 00000000000..125e3d21c9f --- /dev/null +++ b/src/main/docbkx/hbase-default.xml @@ -0,0 +1,538 @@ +HBase Default Configuration +The documentation below is generated using the default hbase configuration file, +hbase-default.xml, as source. +hbase.tmp.dirTemporary directory on the local filesystem. + Change this setting to point to a location more permanent + than '/tmp', the usual resolve for java.io.tmpdir, as the + '/tmp' directory is cleared on machine restart.Default${java.io.tmpdir}/hbase-${user.name}hbase.rootdirThe directory shared by region servers and into + which HBase persists. The URL should be 'fully-qualified' + to include the filesystem scheme. For example, to specify the + HDFS directory '/hbase' where the HDFS instance's namenode is + running at namenode.example.org on port 9000, set this value to: + hdfs://namenode.example.org:9000/hbase. By default, we write + to whatever ${hbase.tmp.dir} is set too -- usually /tmp -- + so change this configuration or else all data will be lost on + machine restart.Default${hbase.tmp.dir}/hbasehbase.cluster.distributedThe mode the cluster will be in. Possible values are + false for standalone mode and true for distributed mode. If + false, startup will run all HBase and ZooKeeper daemons together + in the one JVM.Defaultfalsehbase.zookeeper.quorumComma separated list of servers in the ZooKeeper ensemble + (This config. should have been named hbase.zookeeper.ensemble). + For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com". + By default this is set to localhost for local and pseudo-distributed modes + of operation. For a fully-distributed setup, this should be set to a full + list of ZooKeeper ensemble servers. If HBASE_MANAGES_ZK is set in hbase-env.sh + this is the list of servers which hbase will start/stop ZooKeeper on as + part of cluster start/stop. Client-side, we will take this list of + ensemble members and put it together with the hbase.zookeeper.clientPort + config. and pass it into zookeeper constructor as the connectString + parameter.Defaultlocalhosthbase.local.dirDirectory on the local filesystem to be used + as a local storage.Default${hbase.tmp.dir}/local/hbase.master.info.portThe port for the HBase Master web UI. + Set to -1 if you do not want a UI instance run.Default16010hbase.master.info.bindAddressThe bind address for the HBase Master web UI + Default0.0.0.0hbase.master.logcleaner.pluginsA comma-separated list of BaseLogCleanerDelegate invoked by + the LogsCleaner service. These WAL cleaners are called in order, + so put the cleaner that prunes the most files in front. To + implement your own BaseLogCleanerDelegate, just put it in HBase's classpath + and add the fully qualified class name here. Always add the above + default log cleaners in the list.Defaultorg.apache.hadoop.hbase.master.cleaner.TimeToLiveLogCleanerhbase.master.logcleaner.ttlMaximum time a WAL can stay in the .oldlogdir directory, + after which it will be cleaned by a Master thread.Default600000hbase.master.hfilecleaner.pluginsA comma-separated list of BaseHFileCleanerDelegate invoked by + the HFileCleaner service. These HFiles cleaners are called in order, + so put the cleaner that prunes the most files in front. To + implement your own BaseHFileCleanerDelegate, just put it in HBase's classpath + and add the fully qualified class name here. Always add the above + default log cleaners in the list as they will be overwritten in + hbase-site.xml.Defaultorg.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleanerhbase.master.catalog.timeoutTimeout value for the Catalog Janitor from the master to + META.Default600000hbase.master.infoserver.redirectWhether or not the Master listens to the Master web + UI port (hbase.master.info.port) and redirects requests to the web + UI server shared by the Master and RegionServer.Defaulttruehbase.regionserver.portThe port the HBase RegionServer binds to.Default16020hbase.regionserver.info.portThe port for the HBase RegionServer web UI + Set to -1 if you do not want the RegionServer UI to run.Default16030hbase.regionserver.info.bindAddressThe address for the HBase RegionServer web UIDefault0.0.0.0hbase.regionserver.info.port.autoWhether or not the Master or RegionServer + UI should search for a port to bind to. Enables automatic port + search if hbase.regionserver.info.port is already in use. + Useful for testing, turned off by default.Defaultfalsehbase.regionserver.handler.countCount of RPC Listener instances spun up on RegionServers. + Same property is used by the Master for count of master handlers.Default30hbase.ipc.server.callqueue.handler.factorFactor to determine the number of call queues. + A value of 0 means a single queue shared between all the handlers. + A value of 1 means that each handler has its own queue.Default0.1hbase.ipc.server.callqueue.read.ratioSplit the call queues into read and write queues. + The specified interval (which should be between 0.0 and 1.0) + will be multiplied by the number of call queues. + A value of 0 indicate to not split the call queues, meaning that both read and write + requests will be pushed to the same set of queues. + A value lower than 0.5 means that there will be less read queues than write queues. + A value of 0.5 means there will be the same number of read and write queues. + A value greater than 0.5 means that there will be more read queues than write queues. + A value of 1.0 means that all the queues except one are used to dispatch read requests. + + Example: Given the total number of call queues being 10 + a read.ratio of 0 means that: the 10 queues will contain both read/write requests. + a read.ratio of 0.3 means that: 3 queues will contain only read requests + and 7 queues will contain only write requests. + a read.ratio of 0.5 means that: 5 queues will contain only read requests + and 5 queues will contain only write requests. + a read.ratio of 0.8 means that: 8 queues will contain only read requests + and 2 queues will contain only write requests. + a read.ratio of 1 means that: 9 queues will contain only read requests + and 1 queues will contain only write requests. + Default0hbase.ipc.server.callqueue.scan.ratioGiven the number of read call queues, calculated from the total number + of call queues multiplied by the callqueue.read.ratio, the scan.ratio property + will split the read call queues into small-read and long-read queues. + A value lower than 0.5 means that there will be less long-read queues than short-read queues. + A value of 0.5 means that there will be the same number of short-read and long-read queues. + A value greater than 0.5 means that there will be more long-read queues than short-read queues + A value of 0 or 1 indicate to use the same set of queues for gets and scans. + + Example: Given the total number of read call queues being 8 + a scan.ratio of 0 or 1 means that: 8 queues will contain both long and short read requests. + a scan.ratio of 0.3 means that: 2 queues will contain only long-read requests + and 6 queues will contain only short-read requests. + a scan.ratio of 0.5 means that: 4 queues will contain only long-read requests + and 4 queues will contain only short-read requests. + a scan.ratio of 0.8 means that: 6 queues will contain only long-read requests + and 2 queues will contain only short-read requests. + Default0hbase.regionserver.msgintervalInterval between messages from the RegionServer to Master + in milliseconds.Default3000hbase.regionserver.regionSplitLimitLimit for the number of regions after which no more region + splitting should take place. This is not a hard limit for the number of + regions but acts as a guideline for the regionserver to stop splitting after + a certain limit. Default is MAX_INT; i.e. do not block splitting.Default2147483647hbase.regionserver.logroll.periodPeriod at which we will roll the commit log regardless + of how many edits it has.Default3600000hbase.regionserver.logroll.errors.toleratedThe number of consecutive WAL close errors we will allow + before triggering a server abort. A setting of 0 will cause the + region server to abort if closing the current WAL writer fails during + log rolling. Even a small value (2 or 3) will allow a region server + to ride over transient HDFS errors.Default2hbase.regionserver.hlog.reader.implThe WAL file reader implementation.Defaultorg.apache.hadoop.hbase.regionserver.wal.ProtobufLogReaderhbase.regionserver.hlog.writer.implThe WAL file writer implementation.Defaultorg.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriterhbase.master.distributed.log.replayEnable 'distributed log replay' as default engine splitting + WAL files on server crash. This default is new in hbase 1.0. To fall + back to the old mode 'distributed log splitter', set the value to + 'false'. 'Disributed log replay' improves MTTR because it does not + write intermediate files. 'DLR' required that 'hfile.format.version' + be set to version 3 or higher. + Defaulttruehbase.regionserver.global.memstore.sizeMaximum size of all memstores in a region server before new + updates are blocked and flushes are forced. Defaults to 40% of heap. + Updates are blocked and flushes are forced until size of all memstores + in a region server hits hbase.regionserver.global.memstore.size.lower.limit.Default0.4hbase.regionserver.global.memstore.size.lower.limitMaximum size of all memstores in a region server before flushes are forced. + Defaults to 95% of hbase.regionserver.global.memstore.size. + A 100% value for this value causes the minimum possible flushing to occur when updates are + blocked due to memstore limiting.Default0.95hbase.regionserver.optionalcacheflushinterval + Maximum amount of time an edit lives in memory before being automatically flushed. + Default 1 hour. Set it to 0 to disable automatic flushing.Default3600000hbase.regionserver.catalog.timeoutTimeout value for the Catalog Janitor from the regionserver to META.Default600000hbase.regionserver.dns.interfaceThe name of the Network Interface from which a region server + should report its IP address.Defaultdefaulthbase.regionserver.dns.nameserverThe host name or IP address of the name server (DNS) + which a region server should use to determine the host name used by the + master for communication and display purposes.Defaultdefaulthbase.regionserver.region.split.policy + A split policy determines when a region should be split. The various other split policies that + are available currently are ConstantSizeRegionSplitPolicy, DisabledRegionSplitPolicy, + DelimitedKeyPrefixRegionSplitPolicy, KeyPrefixRegionSplitPolicy etc. + Defaultorg.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicyzookeeper.session.timeoutZooKeeper session timeout in milliseconds. It is used in two different ways. + First, this value is used in the ZK client that HBase uses to connect to the ensemble. + It is also used by HBase when it starts a ZK server and it is passed as the 'maxSessionTimeout'. See + http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions. + For example, if a HBase region server connects to a ZK ensemble that's also managed by HBase, then the + session timeout will be the one specified by this configuration. But, a region server that connects + to an ensemble managed with a different configuration will be subjected that ensemble's maxSessionTimeout. So, + even though HBase might propose using 90 seconds, the ensemble can have a max timeout lower than this and + it will take precedence. The current default that ZK ships with is 40 seconds, which is lower than HBase's. + Default90000zookeeper.znode.parentRoot ZNode for HBase in ZooKeeper. All of HBase's ZooKeeper + files that are configured with a relative path will go under this node. + By default, all of HBase's ZooKeeper file path are configured with a + relative path, so they will all go under this directory unless changed.Default/hbasezookeeper.znode.rootserverPath to ZNode holding root region location. This is written by + the master and read by clients and region servers. If a relative path is + given, the parent folder will be ${zookeeper.znode.parent}. By default, + this means the root location is stored at /hbase/root-region-server.Defaultroot-region-serverzookeeper.znode.acl.parentRoot ZNode for access control lists.Defaultaclhbase.zookeeper.dns.interfaceThe name of the Network Interface from which a ZooKeeper server + should report its IP address.Defaultdefaulthbase.zookeeper.dns.nameserverThe host name or IP address of the name server (DNS) + which a ZooKeeper server should use to determine the host name used by the + master for communication and display purposes.Defaultdefaulthbase.zookeeper.peerportPort used by ZooKeeper peers to talk to each other. + See http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperStarted.html#sc_RunningReplicatedZooKeeper + for more information.Default2888hbase.zookeeper.leaderportPort used by ZooKeeper for leader election. + See http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperStarted.html#sc_RunningReplicatedZooKeeper + for more information.Default3888hbase.zookeeper.useMultiInstructs HBase to make use of ZooKeeper's multi-update functionality. + This allows certain ZooKeeper operations to complete more quickly and prevents some issues + with rare Replication failure scenarios (see the release note of HBASE-2611 for an example). + IMPORTANT: only set this to true if all ZooKeeper servers in the cluster are on version 3.4+ + and will not be downgraded. ZooKeeper versions before 3.4 do not support multi-update and + will not fail gracefully if multi-update is invoked (see ZOOKEEPER-1495).Defaulttruehbase.config.read.zookeeper.config + Set to true to allow HBaseConfiguration to read the + zoo.cfg file for ZooKeeper properties. Switching this to true + is not recommended, since the functionality of reading ZK + properties from a zoo.cfg file has been deprecated.Defaultfalsehbase.zookeeper.property.initLimitProperty from ZooKeeper's config zoo.cfg. + The number of ticks that the initial synchronization phase can take.Default10hbase.zookeeper.property.syncLimitProperty from ZooKeeper's config zoo.cfg. + The number of ticks that can pass between sending a request and getting an + acknowledgment.Default5hbase.zookeeper.property.dataDirProperty from ZooKeeper's config zoo.cfg. + The directory where the snapshot is stored.Default${hbase.tmp.dir}/zookeeperhbase.zookeeper.property.clientPortProperty from ZooKeeper's config zoo.cfg. + The port at which the clients will connect.Default2181hbase.zookeeper.property.maxClientCnxnsProperty from ZooKeeper's config zoo.cfg. + Limit on number of concurrent connections (at the socket level) that a + single client, identified by IP address, may make to a single member of + the ZooKeeper ensemble. Set high to avoid zk connection issues running + standalone and pseudo-distributed.Default300hbase.client.write.bufferDefault size of the HTable client write buffer in bytes. + A bigger buffer takes more memory -- on both the client and server + side since server instantiates the passed write buffer to process + it -- but a larger buffer size reduces the number of RPCs made. + For an estimate of server-side memory-used, evaluate + hbase.client.write.buffer * hbase.regionserver.handler.countDefault2097152hbase.client.pauseGeneral client pause value. Used mostly as value to wait + before running a retry of a failed get, region lookup, etc. + See hbase.client.retries.number for description of how we backoff from + this initial pause amount and how this pause works w/ retries.Default100hbase.client.retries.numberMaximum retries. Used as maximum for all retryable + operations such as the getting of a cell's value, starting a row update, + etc. Retry interval is a rough function based on hbase.client.pause. At + first we retry at this interval but then with backoff, we pretty quickly reach + retrying every ten seconds. See HConstants#RETRY_BACKOFF for how the backup + ramps up. Change this setting and hbase.client.pause to suit your workload.Default35hbase.client.max.total.tasksThe maximum number of concurrent tasks a single HTable instance will + send to the cluster.Default100hbase.client.max.perserver.tasksThe maximum number of concurrent tasks a single HTable instance will + send to a single region server.Default5hbase.client.max.perregion.tasksThe maximum number of concurrent connections the client will + maintain to a single Region. That is, if there is already + hbase.client.max.perregion.tasks writes in progress for this region, new puts + won't be sent to this region until some writes finishes.Default1hbase.client.scanner.cachingNumber of rows that will be fetched when calling next + on a scanner if it is not served from (local, client) memory. Higher + caching values will enable faster scanners but will eat up more memory + and some calls of next may take longer and longer times when the cache is empty. + Do not set this value such that the time between invocations is greater + than the scanner timeout; i.e. hbase.client.scanner.timeout.periodDefault100hbase.client.keyvalue.maxsizeSpecifies the combined maximum allowed size of a KeyValue + instance. This is to set an upper boundary for a single entry saved in a + storage file. Since they cannot be split it helps avoiding that a region + cannot be split any further because the data is too large. It seems wise + to set this to a fraction of the maximum region size. Setting it to zero + or less disables the check.Default10485760hbase.client.scanner.timeout.periodClient scanner lease period in milliseconds.Default60000hbase.client.localityCheck.threadPoolSizeDefault2hbase.bulkload.retries.numberMaximum retries. This is maximum number of iterations + to atomic bulk loads are attempted in the face of splitting operations + 0 means never give up.Default10hbase.balancer.period + Period at which the region balancer runs in the Master.Default300000hbase.regions.slopRebalance if any regionserver has average + (average * slop) regions.Default0.2hbase.server.thread.wakefrequencyTime to sleep in between searches for work (in milliseconds). + Used as sleep interval by service threads such as log roller.Default10000hbase.server.versionfile.writeattempts + How many time to retry attempting to write a version file + before just aborting. Each attempt is seperated by the + hbase.server.thread.wakefrequency milliseconds.Default3hbase.hregion.memstore.flush.size + Memstore will be flushed to disk if size of the memstore + exceeds this number of bytes. Value is checked by a thread that runs + every hbase.server.thread.wakefrequency.Default134217728hbase.hregion.percolumnfamilyflush.size.lower.bound + If FlushLargeStoresPolicy is used, then every time that we hit the + total memstore limit, we find out all the column families whose memstores + exceed this value, and only flush them, while retaining the others whose + memstores are lower than this limit. If none of the families have their + memstore size more than this, all the memstores will be flushed + (just as usual). This value should be less than half of the total memstore + threshold (hbase.hregion.memstore.flush.size). + Default16777216hbase.hregion.preclose.flush.size + If the memstores in a region are this size or larger when we go + to close, run a "pre-flush" to clear out memstores before we put up + the region closed flag and take the region offline. On close, + a flush is run under the close flag to empty memory. During + this time the region is offline and we are not taking on any writes. + If the memstore content is large, this flush could take a long time to + complete. The preflush is meant to clean out the bulk of the memstore + before putting up the close flag and taking the region offline so the + flush that runs under the close flag has little to do.Default5242880hbase.hregion.memstore.block.multiplier + Block updates if memstore has hbase.hregion.memstore.block.multiplier + times hbase.hregion.memstore.flush.size bytes. Useful preventing + runaway memstore during spikes in update traffic. Without an + upper-bound, memstore fills such that when it flushes the + resultant flush files take a long time to compact or split, or + worse, we OOME.Default4hbase.hregion.memstore.mslab.enabled + Enables the MemStore-Local Allocation Buffer, + a feature which works to prevent heap fragmentation under + heavy write loads. This can reduce the frequency of stop-the-world + GC pauses on large heaps.Defaulttruehbase.hregion.max.filesize + Maximum HFile size. If the sum of the sizes of a region's HFiles has grown to exceed this + value, the region is split in two.Default10737418240hbase.hregion.majorcompactionTime between major compactions, expressed in milliseconds. Set to 0 to disable + time-based automatic major compactions. User-requested and size-based major compactions will + still run. This value is multiplied by hbase.hregion.majorcompaction.jitter to cause + compaction to start at a somewhat-random time during a given window of time. The default value + is 7 days, expressed in milliseconds. If major compactions are causing disruption in your + environment, you can configure them to run at off-peak times for your deployment, or disable + time-based major compactions by setting this parameter to 0, and run major compactions in a + cron job or by another external mechanism.Default604800000hbase.hregion.majorcompaction.jitterA multiplier applied to hbase.hregion.majorcompaction to cause compaction to occur + a given amount of time either side of hbase.hregion.majorcompaction. The smaller the number, + the closer the compactions will happen to the hbase.hregion.majorcompaction + interval.Default0.50hbase.hstore.compactionThreshold If more than this number of StoreFiles exist in any one Store + (one StoreFile is written per flush of MemStore), a compaction is run to rewrite all + StoreFiles into a single StoreFile. Larger values delay compaction, but when compaction does + occur, it takes longer to complete.Default3hbase.hstore.flusher.count The number of flush threads. With fewer threads, the MemStore flushes will be + queued. With more threads, the flushes will be executed in parallel, increasing the load on + HDFS, and potentially causing more compactions. Default2hbase.hstore.blockingStoreFiles If more than this number of StoreFiles exist in any one Store (one StoreFile + is written per flush of MemStore), updates are blocked for this region until a compaction is + completed, or until hbase.hstore.blockingWaitTime has been exceeded.Default10hbase.hstore.blockingWaitTime The time for which a region will block updates after reaching the StoreFile limit + defined by hbase.hstore.blockingStoreFiles. After this time has elapsed, the region will stop + blocking updates even if a compaction has not been completed.Default90000hbase.hstore.compaction.minThe minimum number of StoreFiles which must be eligible for compaction before + compaction can run. The goal of tuning hbase.hstore.compaction.min is to avoid ending up with + too many tiny StoreFiles to compact. Setting this value to 2 would cause a minor compaction + each time you have two StoreFiles in a Store, and this is probably not appropriate. If you + set this value too high, all the other values will need to be adjusted accordingly. For most + cases, the default value is appropriate. In previous versions of HBase, the parameter + hbase.hstore.compaction.min was named hbase.hstore.compactionThreshold.Default3hbase.hstore.compaction.maxThe maximum number of StoreFiles which will be selected for a single minor + compaction, regardless of the number of eligible StoreFiles. Effectively, the value of + hbase.hstore.compaction.max controls the length of time it takes a single compaction to + complete. Setting it larger means that more StoreFiles are included in a compaction. For most + cases, the default value is appropriate.Default10hbase.hstore.compaction.min.sizeA StoreFile smaller than this size will always be eligible for minor compaction. + HFiles this size or larger are evaluated by hbase.hstore.compaction.ratio to determine if + they are eligible. Because this limit represents the "automatic include"limit for all + StoreFiles smaller than this value, this value may need to be reduced in write-heavy + environments where many StoreFiles in the 1-2 MB range are being flushed, because every + StoreFile will be targeted for compaction and the resulting StoreFiles may still be under the + minimum size and require further compaction. If this parameter is lowered, the ratio check is + triggered more quickly. This addressed some issues seen in earlier versions of HBase but + changing this parameter is no longer necessary in most situations. Default: 128 MB expressed + in bytes.Default134217728hbase.hstore.compaction.max.sizeA StoreFile larger than this size will be excluded from compaction. The effect of + raising hbase.hstore.compaction.max.size is fewer, larger StoreFiles that do not get + compacted often. If you feel that compaction is happening too often without much benefit, you + can try raising this value. Default: the value of LONG.MAX_VALUE, expressed in bytes.Default9223372036854775807hbase.hstore.compaction.ratioFor minor compaction, this ratio is used to determine whether a given StoreFile + which is larger than hbase.hstore.compaction.min.size is eligible for compaction. Its + effect is to limit compaction of large StoreFiles. The value of hbase.hstore.compaction.ratio + is expressed as a floating-point decimal. A large ratio, such as 10, will produce a single + giant StoreFile. Conversely, a low value, such as .25, will produce behavior similar to the + BigTable compaction algorithm, producing four StoreFiles. A moderate value of between 1.0 and + 1.4 is recommended. When tuning this value, you are balancing write costs with read costs. + Raising the value (to something like 1.4) will have more write costs, because you will + compact larger StoreFiles. However, during reads, HBase will need to seek through fewer + StoreFiles to accomplish the read. Consider this approach if you cannot take advantage of + Bloom filters. Otherwise, you can lower this value to something like 1.0 to reduce the + background cost of writes, and use Bloom filters to control the number of StoreFiles touched + during reads. For most cases, the default value is appropriate.Default1.2Fhbase.hstore.compaction.ratio.offpeakAllows you to set a different (by default, more aggressive) ratio for determining + whether larger StoreFiles are included in compactions during off-peak hours. Works in the + same way as hbase.hstore.compaction.ratio. Only applies if hbase.offpeak.start.hour and + hbase.offpeak.end.hour are also enabled.Default5.0Fhbase.hstore.time.to.purge.deletesThe amount of time to delay purging of delete markers with future timestamps. If + unset, or set to 0, all delete markers, including those with future timestamps, are purged + during the next major compaction. Otherwise, a delete marker is kept until the major compaction + which occurs after the marker's timestamp plus the value of this setting, in milliseconds. + Default0hbase.offpeak.start.hourThe start of off-peak hours, expressed as an integer between 0 and 23, inclusive. + Set to -1 to disable off-peak.Default-1hbase.offpeak.end.hourThe end of off-peak hours, expressed as an integer between 0 and 23, inclusive. Set + to -1 to disable off-peak.Default-1hbase.regionserver.thread.compaction.throttleThere are two different thread pools for compactions, one for large compactions and + the other for small compactions. This helps to keep compaction of lean tables (such as + hbase:meta) fast. If a compaction is larger than this threshold, it + goes into the large compaction pool. In most cases, the default value is appropriate. Default: + 2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size (which defaults to 128MB). + The value field assumes that the value of hbase.hregion.memstore.flush.size is unchanged from + the default.Default2684354560hbase.hstore.compaction.kv.maxThe maximum number of KeyValues to read and then write in a batch when flushing or + compacting. Set this lower if you have big KeyValues and problems with Out Of Memory + Exceptions Set this higher if you have wide, small rows. Default10hbase.storescanner.parallel.seek.enable + Enables StoreFileScanner parallel-seeking in StoreScanner, + a feature which can reduce response latency under special conditions.Defaultfalsehbase.storescanner.parallel.seek.threads + The default thread pool size if parallel-seeking feature enabled.Default10hfile.block.cache.sizePercentage of maximum heap (-Xmx setting) to allocate to block cache + used by a StoreFile. Default of 0.4 means allocate 40%. + Set to 0 to disable but it's not recommended; you need at least + enough cache to hold the storefile indices.Default0.4hfile.block.index.cacheonwriteThis allows to put non-root multi-level index blocks into the block + cache at the time the index is being written.Defaultfalsehfile.index.block.max.sizeWhen the size of a leaf-level, intermediate-level, or root-level + index block in a multi-level block index grows to this size, the + block is written out and a new block is started.Default131072hbase.bucketcache.ioengineWhere to store the contents of the bucketcache. One of: onheap, + offheap, or file. If a file, set it to file:PATH_TO_FILE. See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html for more information. + Defaulthbase.bucketcache.combinedcache.enabledWhether or not the bucketcache is used in league with the LRU + on-heap block cache. In this mode, indices and blooms are kept in the LRU + blockcache and the data blocks are kept in the bucketcache.Defaulttruehbase.bucketcache.sizeThe size of the buckets for the bucketcache if you only use a single size. + Defaults to the default blocksize, which is 64 * 1024.Default65536hbase.bucketcache.sizesA comma-separated list of sizes for buckets for the bucketcache + if you use multiple sizes. Should be a list of block sizes in order from smallest + to largest. The sizes you use will depend on your data access patterns.Defaulthfile.format.versionThe HFile format version to use for new files. + Version 3 adds support for tags in hfiles (See http://hbase.apache.org/book.html#hbase.tags). + Distributed Log Replay requires that tags are enabled. Also see the configuration + 'hbase.replication.rpc.codec'. + Default3hfile.block.bloom.cacheonwriteEnables cache-on-write for inline blocks of a compound Bloom filter.Defaultfalseio.storefile.bloom.block.sizeThe size in bytes of a single block ("chunk") of a compound Bloom + filter. This size is approximate, because Bloom blocks can only be + inserted at data block boundaries, and the number of keys per data + block varies.Default131072hbase.rs.cacheblocksonwriteWhether an HFile block should be added to the block cache when the + block is finished.Defaultfalsehbase.rpc.timeoutThis is for the RPC layer to define how long HBase client applications + take for a remote call to time out. It uses pings to check connections + but will eventually throw a TimeoutException.Default60000hbase.rpc.shortoperation.timeoutThis is another version of "hbase.rpc.timeout". For those RPC operation + within cluster, we rely on this configuration to set a short timeout limitation + for short operation. For example, short rpc timeout for region server's trying + to report to active master can benefit quicker master failover process.Default10000hbase.ipc.client.tcpnodelaySet no delay on rpc socket connections. See + http://docs.oracle.com/javase/1.5.0/docs/api/java/net/Socket.html#getTcpNoDelay()Defaulttruehbase.master.keytab.fileFull path to the kerberos keytab file to use for logging in + the configured HMaster server principal.Defaulthbase.master.kerberos.principalEx. "hbase/_HOST@EXAMPLE.COM". The kerberos principal name + that should be used to run the HMaster process. The principal name should + be in the form: user/hostname@DOMAIN. If "_HOST" is used as the hostname + portion, it will be replaced with the actual hostname of the running + instance.Defaulthbase.regionserver.keytab.fileFull path to the kerberos keytab file to use for logging in + the configured HRegionServer server principal.Defaulthbase.regionserver.kerberos.principalEx. "hbase/_HOST@EXAMPLE.COM". The kerberos principal name + that should be used to run the HRegionServer process. The principal name + should be in the form: user/hostname@DOMAIN. If "_HOST" is used as the + hostname portion, it will be replaced with the actual hostname of the + running instance. An entry for this principal must exist in the file + specified in hbase.regionserver.keytab.fileDefaulthadoop.policy.fileThe policy configuration file used by RPC servers to make + authorization decisions on client requests. Only used when HBase + security is enabled.Defaulthbase-policy.xmlhbase.superuserList of users or groups (comma-separated), who are allowed + full privileges, regardless of stored ACLs, across the cluster. + Only used when HBase security is enabled.Defaulthbase.auth.key.update.intervalThe update interval for master key for authentication tokens + in servers in milliseconds. Only used when HBase security is enabled.Default86400000hbase.auth.token.max.lifetimeThe maximum lifetime in milliseconds after which an + authentication token expires. Only used when HBase security is enabled.Default604800000hbase.ipc.client.fallback-to-simple-auth-allowedWhen a client is configured to attempt a secure connection, but attempts to + connect to an insecure server, that server may instruct the client to + switch to SASL SIMPLE (unsecure) authentication. This setting controls + whether or not the client will accept this instruction from the server. + When false (the default), the client will not allow the fallback to SIMPLE + authentication, and will abort the connection.Defaultfalsehbase.display.keysWhen this is set to true the webUI and such will display all start/end keys + as part of the table details, region names, etc. When this is set to false, + the keys are hidden.Defaulttruehbase.coprocessor.region.classesA comma-separated list of Coprocessors that are loaded by + default on all tables. For any override coprocessor method, these classes + will be called in order. After implementing your own Coprocessor, just put + it in HBase's classpath and add the fully qualified class name here. + A coprocessor can also be loaded on demand by setting HTableDescriptor.Defaulthbase.rest.portThe port for the HBase REST server.Default8080hbase.rest.readonlyDefines the mode the REST server will be started in. Possible values are: + false: All HTTP methods are permitted - GET/PUT/POST/DELETE. + true: Only the GET method is permitted.Defaultfalsehbase.rest.threads.maxThe maximum number of threads of the REST server thread pool. + Threads in the pool are reused to process REST requests. This + controls the maximum number of requests processed concurrently. + It may help to control the memory used by the REST server to + avoid OOM issues. If the thread pool is full, incoming requests + will be queued up and wait for some free threads.Default100hbase.rest.threads.minThe minimum number of threads of the REST server thread pool. + The thread pool always has at least these number of threads so + the REST server is ready to serve incoming requests.Default2hbase.rest.support.proxyuserEnables running the REST server to support proxy-user mode.Defaultfalsehbase.defaults.for.version.skipSet to true to skip the 'hbase.defaults.for.version' check. + Setting this to true can be useful in contexts other than + the other side of a maven generation; i.e. running in an + ide. You'll want to set this boolean to true to avoid + seeing the RuntimException complaint: "hbase-default.xml file + seems to be for and old version of HBase (\${hbase.version}), this + version is X.X.X-SNAPSHOT"Defaultfalsehbase.coprocessor.master.classesA comma-separated list of + org.apache.hadoop.hbase.coprocessor.MasterObserver coprocessors that are + loaded by default on the active HMaster process. For any implemented + coprocessor methods, the listed classes will be called in order. After + implementing your own MasterObserver, just put it in HBase's classpath + and add the fully qualified class name here.Defaulthbase.coprocessor.abortonerrorSet to true to cause the hosting server (master or regionserver) + to abort if a coprocessor fails to load, fails to initialize, or throws an + unexpected Throwable object. Setting this to false will allow the server to + continue execution but the system wide state of the coprocessor in question + will become inconsistent as it will be properly executing in only a subset + of servers, so this is most useful for debugging only.Defaulttruehbase.online.schema.update.enableSet true to enable online schema changes.Defaulttruehbase.table.lock.enableSet to true to enable locking the table in zookeeper for schema change operations. + Table locking from master prevents concurrent schema modifications to corrupt table + state.Defaulttruehbase.table.max.rowsize + Maximum size of single row in bytes (default is 1 Gb) for Get'ting + or Scan'ning without in-row scan flag set. If row size exceeds this limit + RowTooBigException is thrown to client. + Default1073741824hbase.thrift.minWorkerThreadsThe "core size" of the thread pool. New threads are created on every + connection until this many threads are created.Default16hbase.thrift.maxWorkerThreadsThe maximum size of the thread pool. When the pending request queue + overflows, new threads are created until their number reaches this number. + After that, the server starts dropping connections.Default1000hbase.thrift.maxQueuedRequestsThe maximum number of pending Thrift connections waiting in the queue. If + there are no idle threads in the pool, the server queues requests. Only + when the queue overflows, new threads are added, up to + hbase.thrift.maxQueuedRequests threads.Default1000hbase.thrift.htablepool.size.maxThe upper bound for the table pool used in the Thrift gateways server. + Since this is per table name, we assume a single table and so with 1000 default + worker threads max this is set to a matching number. For other workloads this number + can be adjusted as needed. + Default1000hbase.regionserver.thrift.framedUse Thrift TFramedTransport on the server side. + This is the recommended transport for thrift servers and requires a similar setting + on the client side. Changing this to false will select the default transport, + vulnerable to DoS when malformed requests are issued due to THRIFT-601. + Defaultfalsehbase.regionserver.thrift.framed.max_frame_size_in_mbDefault frame size when using framed transportDefault2hbase.regionserver.thrift.compactUse Thrift TCompactProtocol binary serialization protocol.Defaultfalsehbase.data.umask.enableEnable, if true, that file permissions should be assigned + to the files written by the regionserverDefaultfalsehbase.data.umaskFile permissions that should be used to write data + files when hbase.data.umask.enable is trueDefault000hbase.metrics.showTableNameWhether to include the prefix "tbl.tablename" in per-column family metrics. + If true, for each metric M, per-cf metrics will be reported for tbl.T.cf.CF.M, if false, + per-cf metrics will be aggregated by column-family across tables, and reported for cf.CF.M. + In both cases, the aggregated metric M across tables and cfs will be reported.Defaulttruehbase.metrics.exposeOperationTimesWhether to report metrics about time taken performing an + operation on the region server. Get, Put, Delete, Increment, and Append can all + have their times exposed through Hadoop metrics per CF and per region.Defaulttruehbase.snapshot.enabledSet to true to allow snapshots to be taken / restored / cloned.Defaulttruehbase.snapshot.restore.take.failsafe.snapshotSet to true to take a snapshot before the restore operation. + The snapshot taken will be used in case of failure, to restore the previous state. + At the end of the restore operation this snapshot will be deletedDefaulttruehbase.snapshot.restore.failsafe.nameName of the failsafe snapshot taken by the restore operation. + You can use the {snapshot.name}, {table.name} and {restore.timestamp} variables + to create a name based on what you are restoring.Defaulthbase-failsafe-{snapshot.name}-{restore.timestamp}hbase.server.compactchecker.interval.multiplierThe number that determines how often we scan to see if compaction is necessary. + Normally, compactions are done after some events (such as memstore flush), but if + region didn't receive a lot of writes for some time, or due to different compaction + policies, it may be necessary to check it periodically. The interval between checks is + hbase.server.compactchecker.interval.multiplier multiplied by + hbase.server.thread.wakefrequency.Default1000hbase.lease.recovery.timeoutHow long we wait on dfs lease recovery in total before giving up.Default900000hbase.lease.recovery.dfs.timeoutHow long between dfs recover lease invocations. Should be larger than the sum of + the time it takes for the namenode to issue a block recovery command as part of + datanode; dfs.heartbeat.interval and the time it takes for the primary + datanode, performing block recovery to timeout on a dead datanode; usually + dfs.client.socket-timeout. See the end of HBASE-8389 for more.Default64000hbase.column.max.versionNew column family descriptors will use this value as the default number of versions + to keep.Default1hbase.dfs.client.read.shortcircuit.buffer.sizeIf the DFSClient configuration + dfs.client.read.shortcircuit.buffer.size is unset, we will + use what is configured here as the short circuit read default + direct byte buffer size. DFSClient native default is 1MB; HBase + keeps its HDFS files open so number of file blocks * 1MB soon + starts to add up and threaten OOME because of a shortage of + direct memory. So, we set it down from the default. Make + it > the default hbase block size set in the HColumnDescriptor + which is usually 64k. + Default131072hbase.regionserver.checksum.verify + If set to true (the default), HBase verifies the checksums for hfile + blocks. HBase writes checksums inline with the data when it writes out + hfiles. HDFS (as of this writing) writes checksums to a separate file + than the data file necessitating extra seeks. Setting this flag saves + some on i/o. Checksum verification by HDFS will be internally disabled + on hfile streams when this flag is set. If the hbase-checksum verification + fails, we will switch back to using HDFS checksums (so do not disable HDFS + checksums! And besides this feature applies to hfiles only, not to WALs). + If this parameter is set to false, then hbase will not verify any checksums, + instead it will depend on checksum verification being done in the HDFS client. + Defaulttruehbase.hstore.bytes.per.checksum + Number of bytes in a newly created checksum chunk for HBase-level + checksums in hfile blocks. + Default16384hbase.hstore.checksum.algorithm + Name of an algorithm that is used to compute checksums. Possible values + are NULL, CRC32, CRC32C. + DefaultCRC32hbase.status.published + This setting activates the publication by the master of the status of the region server. + When a region server dies and its recovery starts, the master will push this information + to the client application, to let them cut the connection immediately instead of waiting + for a timeout. + Defaultfalsehbase.status.publisher.class + Implementation of the status publication with a multicast message. + Defaultorg.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisherhbase.status.listener.class + Implementation of the status listener with a multicast message. + Defaultorg.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListenerhbase.status.multicast.address.ip + Multicast address to use for the status publication by multicast. + Default226.1.1.3hbase.status.multicast.address.port + Multicast port to use for the status publication by multicast. + Default16100hbase.dynamic.jars.dir + The directory from which the custom filter/co-processor jars can be loaded + dynamically by the region server without the need to restart. However, + an already loaded filter/co-processor class would not be un-loaded. See + HBASE-1936 for more details. + Default${hbase.rootdir}/libhbase.security.authentication + Controls whether or not secure authentication is enabled for HBase. + Possible values are 'simple' (no authentication), and 'kerberos'. + Defaultsimplehbase.rest.filter.classes + Servlet filters for REST service. + Defaultorg.apache.hadoop.hbase.rest.filter.GzipFilterhbase.master.loadbalancer.class + Class used to execute the regions balancing when the period occurs. + See the class comment for more on how it works + http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.html + It replaces the DefaultLoadBalancer as the default (since renamed + as the SimpleLoadBalancer). + Defaultorg.apache.hadoop.hbase.master.balancer.StochasticLoadBalancerhbase.security.exec.permission.checks + If this setting is enabled and ACL based access control is active (the + AccessController coprocessor is installed either as a system coprocessor + or on a table as a table coprocessor) then you must grant all relevant + users EXEC privilege if they require the ability to execute coprocessor + endpoint calls. EXEC privilege, like any other permission, can be + granted globally to a user, or to a user on a per table or per namespace + basis. For more information on coprocessor endpoints, see the coprocessor + section of the HBase online manual. For more information on granting or + revoking permissions using the AccessController, see the security + section of the HBase online manual. + Defaultfalsehbase.procedure.regionserver.classesA comma-separated list of + org.apache.hadoop.hbase.procedure.RegionServerProcedureManager procedure managers that are + loaded by default on the active HRegionServer process. The lifecycle methods (init/start/stop) + will be called by the active HRegionServer process to perform the specific globally barriered + procedure. After implementing your own RegionServerProcedureManager, just put it in + HBase's classpath and add the fully qualified class name here. + Defaulthbase.procedure.master.classesA comma-separated list of + org.apache.hadoop.hbase.procedure.MasterProcedureManager procedure managers that are + loaded by default on the active HMaster process. A procedure is identified by its signature and + users can use the signature and an instant name to trigger an execution of a globally barriered + procedure. After implementing your own MasterProcedureManager, just put it in HBase's classpath + and add the fully qualified class name here.Defaulthbase.coordinated.state.manager.classFully qualified name of class implementing coordinated state manager.Defaultorg.apache.hadoop.hbase.coordination.ZkCoordinatedStateManagerhbase.regionserver.storefile.refresh.period + The period (in milliseconds) for refreshing the store files for the secondary regions. 0 + means this feature is disabled. Secondary regions sees new files (from flushes and + compactions) from primary once the secondary region refreshes the list of files in the + region (there is no notification mechanism). But too frequent refreshes might cause + extra Namenode pressure. If the files cannot be refreshed for longer than HFile TTL + (hbase.master.hfilecleaner.ttl) the requests are rejected. Configuring HFile TTL to a larger + value is also recommended with this setting. + Default0hbase.region.replica.replication.enabled + Whether asynchronous WAL replication to the secondary region replicas is enabled or not. + If this is enabled, a replication peer named "region_replica_replication" will be created + which will tail the logs and replicate the mutatations to region replicas for tables that + have region replication > 1. If this is enabled once, disabling this replication also + requires disabling the replication peer using shell or ReplicationAdmin java class. + Replication to secondary region replicas works over standard inter-cluster replication. + So replication, if disabled explicitly, also has to be enabled by setting "hbase.replication" + to true for this feature to work. + Defaultfalsehbase.http.filter.initializers + A comma separated list of class names. Each class in the list must extend + org.apache.hadoop.hbase.http.FilterInitializer. The corresponding Filter will + be initialized. Then, the Filter will be applied to all user facing jsp + and servlet web pages. + The ordering of the list defines the ordering of the filters. + The default StaticUserWebFilter add a user principal as defined by the + hbase.http.staticuser.user property. + Defaultorg.apache.hadoop.hbase.http.lib.StaticUserWebFilterhbase.security.visibility.mutations.checkauths + This property if enabled, will check whether the labels in the visibility expression are associated + with the user issuing the mutation + Defaultfalsehbase.http.max.threads + The maximum number of threads that the HTTP Server will create in its + ThreadPool. + Default10hbase.replication.rpc.codec + The codec that is to be used when replication is enabled so that + the tags are also replicated. This is used along with HFileV3 which + supports tags in them. If tags are not used or if the hfile version used + is HFileV2 then KeyValueCodec can be used as the replication codec. Note that + using KeyValueCodecWithTags for replication when there are no tags causes no harm. + Defaultorg.apache.hadoop.hbase.codec.KeyValueCodecWithTagshbase.http.staticuser.user + The user name to filter as, on static web filters + while rendering content. An example use is the HDFS + web UI (user to be used for browsing files). + Defaultdr.stack \ No newline at end of file diff --git a/src/main/docbkx/hbase_history.xml b/src/main/docbkx/hbase_history.xml new file mode 100644 index 00000000000..f7b90645af9 --- /dev/null +++ b/src/main/docbkx/hbase_history.xml @@ -0,0 +1,41 @@ + + + + HBase History + + 2006: BigTable paper published by Google. + + 2006 (end of year): HBase development starts. + + 2008: HBase becomes Hadoop sub-project. + + 2010: HBase becomes Apache top-level project. + + + diff --git a/src/main/docbkx/hbck_in_depth.xml b/src/main/docbkx/hbck_in_depth.xml new file mode 100644 index 00000000000..e2ee34f3a46 --- /dev/null +++ b/src/main/docbkx/hbck_in_depth.xml @@ -0,0 +1,237 @@ + + + + + hbck In Depth + HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems + and repairing a corrupted HBase. It works in two basic modes -- a read-only inconsistency + identifying mode and a multi-phase read-write repair mode. + +
+ Running hbck to identify inconsistencies + To check to see if your HBase cluster has corruptions, run hbck against your HBase cluster: + +$ ./bin/hbase hbck + + + At the end of the commands output it prints OK or tells you the number of INCONSISTENCIES + present. You may also want to run run hbck a few times because some inconsistencies can be + transient (e.g. cluster is starting up or a region is splitting). Operationally you may want to run + hbck regularly and setup alert (e.g. via nagios) if it repeatedly reports inconsistencies . + A run of hbck will report a list of inconsistencies along with a brief description of the regions and + tables affected. The using the -details option will report more details including a representative + listing of all the splits present in all the tables. + + +$ ./bin/hbase hbck -details + + If you just want to know if some tables are corrupted, you can limit hbck to identify inconsistencies + in only specific tables. For example the following command would only attempt to check table + TableFoo and TableBar. The benefit is that hbck will run in less time. + +$ ./bin/hbase hbck TableFoo TableBar + +
+
Inconsistencies + + If after several runs, inconsistencies continue to be reported, you may have encountered a + corruption. These should be rare, but in the event they occur newer versions of HBase include + the hbck tool enabled with automatic repair options. + + + There are two invariants that when violated create inconsistencies in HBase: + + + HBase’s region consistency invariant is satisfied if every region is assigned and + deployed on exactly one region server, and all places where this state kept is in + accordance. + + HBase’s table integrity invariant is satisfied if for each table, every possible row key + resolves to exactly one region. + + + + Repairs generally work in three phases -- a read-only information gathering phase that identifies + inconsistencies, a table integrity repair phase that restores the table integrity invariant, and then + finally a region consistency repair phase that restores the region consistency invariant. + Starting from version 0.90.0, hbck could detect region consistency problems report on a subset + of possible table integrity problems. It also included the ability to automatically fix the most + common inconsistency, region assignment and deployment consistency problems. This repair + could be done by using the -fix command line option. These problems close regions if they are + open on the wrong server or on multiple region servers and also assigns regions to region + servers if they are not open. + + + Starting from HBase versions 0.90.7, 0.92.2 and 0.94.0, several new command line options are + introduced to aid repairing a corrupted HBase. This hbck sometimes goes by the nickname + “uberhbck”. Each particular version of uber hbck is compatible with the HBase’s of the same + major version (0.90.7 uberhbck can repair a 0.90.4). However, versions <=0.90.6 and versions + <=0.92.1 may require restarting the master or failing over to a backup master. + +
+
Localized repairs + + When repairing a corrupted HBase, it is best to repair the lowest risk inconsistencies first. + These are generally region consistency repairs -- localized single region repairs, that only modify + in-memory data, ephemeral zookeeper data, or patch holes in the META table. + Region consistency requires that the HBase instance has the state of the region’s data in HDFS + (.regioninfo files), the region’s row in the hbase:meta table., and region’s deployment/assignments on + region servers and the master in accordance. Options for repairing region consistency include: + + -fixAssignments (equivalent to the 0.90 -fix option) repairs unassigned, incorrectly + assigned or multiply assigned regions. + + -fixMeta which removes meta rows when corresponding regions are not present in + HDFS and adds new meta rows if they regions are present in HDFS while not in META. + + + To fix deployment and assignment problems you can run this command: + + +$ ./bin/hbase hbck -fixAssignments + + To fix deployment and assignment problems as well as repairing incorrect meta rows you can + run this command: + +$ ./bin/hbase hbck -fixAssignments -fixMeta + + There are a few classes of table integrity problems that are low risk repairs. The first two are + degenerate (startkey == endkey) regions and backwards regions (startkey > endkey). These are + automatically handled by sidelining the data to a temporary directory (/hbck/xxxx). + The third low-risk class is hdfs region holes. This can be repaired by using the: + + -fixHdfsHoles option for fabricating new empty regions on the file system. + If holes are detected you can use -fixHdfsHoles and should include -fixMeta and -fixAssignments to make the new region consistent. + + + +$ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHoles + + Since this is a common operation, we’ve added a the -repairHoles flag that is equivalent to the + previous command: + +$ ./bin/hbase hbck -repairHoles + + If inconsistencies still remain after these steps, you most likely have table integrity problems + related to orphaned or overlapping regions. +
+
Region Overlap Repairs + Table integrity problems can require repairs that deal with overlaps. This is a riskier operation + because it requires modifications to the file system, requires some decision making, and may + require some manual steps. For these repairs it is best to analyze the output of a hbck -details + run so that you isolate repairs attempts only upon problems the checks identify. Because this is + riskier, there are safeguard that should be used to limit the scope of the repairs. + WARNING: This is a relatively new and have only been tested on online but idle HBase instances + (no reads/writes). Use at your own risk in an active production environment! + The options for repairing table integrity violations include: + + -fixHdfsOrphans option for “adopting” a region directory that is missing a region + metadata file (the .regioninfo file). + + -fixHdfsOverlaps ability for fixing overlapping regions + + + When repairing overlapping regions, a region’s data can be modified on the file system in two + ways: 1) by merging regions into a larger region or 2) by sidelining regions by moving data to + “sideline” directory where data could be restored later. Merging a large number of regions is + technically correct but could result in an extremely large region that requires series of costly + compactions and splitting operations. In these cases, it is probably better to sideline the regions + that overlap with the most other regions (likely the largest ranges) so that merges can happen on + a more reasonable scale. Since these sidelined regions are already laid out in HBase’s native + directory and HFile format, they can be restored by using HBase’s bulk load mechanism. + The default safeguard thresholds are conservative. These options let you override the default + thresholds and to enable the large region sidelining feature. + + -maxMerge <n> maximum number of overlapping regions to merge + + -sidelineBigOverlaps if more than maxMerge regions are overlapping, sideline attempt + to sideline the regions overlapping with the most other regions. + + -maxOverlapsToSideline <n> if sidelining large overlapping regions, sideline at most n + regions. + + + + Since often times you would just want to get the tables repaired, you can use this option to turn + on all repair options: + + -repair includes all the region consistency options and only the hole repairing table + integrity options. + + + Finally, there are safeguards to limit repairs to only specific tables. For example the following + command would only attempt to check and repair table TableFoo and TableBar. + +$ ./bin/hbase hbck -repair TableFoo TableBar + +
Special cases: Meta is not properly assigned + There are a few special cases that hbck can handle as well. + Sometimes the meta table’s only region is inconsistently assigned or deployed. In this case + there is a special -fixMetaOnly option that can try to fix meta assignments. + +$ ./bin/hbase hbck -fixMetaOnly -fixAssignments + +
+
Special cases: HBase version file is missing + HBase’s data on the file system requires a version file in order to start. If this flie is missing, you + can use the -fixVersionFile option to fabricating a new HBase version file. This assumes that + the version of hbck you are running is the appropriate version for the HBase cluster. +
+
Special case: Root and META are corrupt. + The most drastic corruption scenario is the case where the ROOT or META is corrupted and + HBase will not start. In this case you can use the OfflineMetaRepair tool create new ROOT + and META regions and tables. + This tool assumes that HBase is offline. It then marches through the existing HBase home + directory, loads as much information from region metadata files (.regioninfo files) as possible + from the file system. If the region metadata has proper table integrity, it sidelines the original root + and meta table directories, and builds new ones with pointers to the region directories and their + data. + +$ ./bin/hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair + + NOTE: This tool is not as clever as uberhbck but can be used to bootstrap repairs that uberhbck + can complete. + If the tool succeeds you should be able to start hbase and run online repairs if necessary. +
+
Special cases: Offline split parent + + Once a region is split, the offline parent will be cleaned up automatically. Sometimes, daughter regions + are split again before their parents are cleaned up. HBase can clean up parents in the right order. However, + there could be some lingering offline split parents sometimes. They are in META, in HDFS, and not deployed. + But HBase can't clean them up. In this case, you can use the -fixSplitParents option to reset + them in META to be online and not split. Therefore, hbck can merge them with other regions if fixing + overlapping regions option is used. + + + This option should not normally be used, and it is not in -fixAll. + +
+
+ +
diff --git a/src/main/docbkx/mapreduce.xml b/src/main/docbkx/mapreduce.xml new file mode 100644 index 00000000000..9e9e47427e1 --- /dev/null +++ b/src/main/docbkx/mapreduce.xml @@ -0,0 +1,630 @@ + + + + + HBase and MapReduce + Apache MapReduce is a software framework used to analyze large amounts of data, and is + the framework used most often with Apache Hadoop. MapReduce itself is out of the + scope of this document. A good place to get started with MapReduce is . MapReduce version + 2 (MR2)is now part of YARN. + + This chapter discusses specific configuration steps you need to take to use MapReduce on + data within HBase. In addition, it discusses other interactions and issues between HBase and + MapReduce jobs. + + mapred and mapreduce + There are two mapreduce packages in HBase as in MapReduce itself: org.apache.hadoop.hbase.mapred + and org.apache.hadoop.hbase.mapreduce. The former does old-style API and the latter + the new style. The latter has more facility though you can usually find an equivalent in the older + package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the + org.apache.hadoop.hbase.mapreduce. In the notes below, we refer to + o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using. + + + + +
+ HBase, MapReduce, and the CLASSPATH + By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either + the HBase configuration under $HBASE_CONF_DIR or the HBase classes. + To give the MapReduce jobs the access they need, you could add + hbase-site.xml to the + $HADOOP_HOME/conf/ directory and add the + HBase JARs to the HADOOP_HOME/conf/ + directory, then copy these changes across your cluster. You could add hbase-site.xml to + $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy + these changes across your cluster or edit + $HADOOP_HOMEconf/hadoop-env.sh and add + them to the HADOOP_CLASSPATH variable. However, this approach is not + recommended because it will pollute your Hadoop install with HBase references. It also + requires you to restart the Hadoop cluster before Hadoop can use the HBase data. + Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The + dependencies only need to be available on the local CLASSPATH. The following example runs + the bundled HBase RowCounter + MapReduce job against a table named usertable If you have not set + the environment variables expected in the command (the parts prefixed by a + $ sign and curly braces), you can use the actual system paths instead. + Be sure to use the correct version of the HBase JAR for your system. The backticks + (` symbols) cause ths shell to execute the sub-commands, setting the + CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. + $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable + When the command runs, internally, the HBase JAR finds the dependencies it needs for + zookeeper, guava, and its other dependencies on the passed HADOOP_CLASSPATH + and adds the JARs to the MapReduce job configuration. See the source at + TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. + + The example may not work if you are running HBase from its build directory rather + than an installed location. You may see an error like the following: + java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper + If this occurs, try modifying the command as follows, so that it uses the HBase JARs + from the target/ directory within the build environment. + $ HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable + + + Notice to Mapreduce users of HBase 0.96.1 and above + Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar + to the following: + +Exception in thread "main" java.lang.IllegalAccessError: class + com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass + com.google.protobuf.LiteralByteString + at java.lang.ClassLoader.defineClass1(Native Method) + at java.lang.ClassLoader.defineClass(ClassLoader.java:792) + at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) + at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) + at java.net.URLClassLoader.access$100(URLClassLoader.java:71) + at java.net.URLClassLoader$1.run(URLClassLoader.java:361) + at java.net.URLClassLoader$1.run(URLClassLoader.java:355) + at java.security.AccessController.doPrivileged(Native Method) + at java.net.URLClassLoader.findClass(URLClassLoader.java:354) + at java.lang.ClassLoader.loadClass(ClassLoader.java:424) + at java.lang.ClassLoader.loadClass(ClassLoader.java:357) + at + org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818) + at + org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433) + at + org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186) + at + org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147) + at + org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270) + at + org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100) +... + + This is caused by an optimization introduced in HBASE-9867 that + inadvertently introduced a classloader dependency. + This affects both jobs using the -libjars option and "fat jar," those + which package their runtime dependencies in a nested lib folder. + In order to satisfy the new classloader requirements, hbase-protocol.jar must be + included in Hadoop's classpath. See for current recommendations for resolving + classpath errors. The following is included for historical purposes. + This can be resolved system-wide by including a reference to the hbase-protocol.jar in + hadoop's lib directory, via a symlink or by copying the jar into the new location. + This can also be achieved on a per-job launch basis by including it in the + HADOOP_CLASSPATH environment variable at job submission time. When + launching jobs that package their dependencies, all three of the following job launching + commands satisfy this requirement: + +$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass +$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass +$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass + + For jars that do not package their dependencies, the following command structure is + necessary: + +$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ... + + See also HBASE-10304 for + further discussion of this issue. + +
+ +
+ MapReduce Scan Caching + TableMapReduceUtil now restores the option to set scanner caching (the number of rows + which are cached before returning the result to the client) on the Scan object that is + passed in. This functionality was lost due to a bug in HBase 0.95 (HBASE-11558), which + is fixed for HBase 0.98.5 and 0.96.3. The priority order for choosing the scanner caching is + as follows: + + + Caching settings which are set on the scan object. + + + Caching settings which are specified via the configuration option + , which can either be set manually in + hbase-site.xml or via the helper method + TableMapReduceUtil.setScannerCaching(). + + + The default value HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING, which is set to + 100. + + + Optimizing the caching settings is a balance between the time the client waits for a + result and the number of sets of results the client needs to receive. If the caching setting + is too large, the client could end up waiting for a long time or the request could even time + out. If the setting is too small, the scan needs to return results in several pieces. + If you think of the scan as a shovel, a bigger cache setting is analogous to a bigger + shovel, and a smaller cache setting is equivalent to more shoveling in order to fill the + bucket. + The list of priorities mentioned above allows you to set a reasonable default, and + override it for specific operations. + See the API documentation for Scan for more details. +
+ +
+ Bundled HBase MapReduce Jobs + The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about + the bundled MapReduce jobs, run the following command. + + $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar +An example program must be given as the first argument. +Valid program names are: + copytable: Export a table from local cluster to peer cluster + completebulkload: Complete a bulk data load. + export: Write table data to HDFS. + import: Import data written by Export. + importtsv: Import data in TSV format. + rowcounter: Count rows in HBase table + + Each of the valid program names are bundled MapReduce jobs. To run one of the jobs, + model your command after the following example. + $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter myTable +
+ +
+ HBase as a MapReduce Job Data Source and Data Sink + HBase can be used as a data source, TableInputFormat, + and data sink, TableOutputFormat + or MultiTableOutputFormat, + for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to + subclass TableMapper + and/or TableReducer. + See the do-nothing pass-through classes IdentityTableMapper + and IdentityTableReducer + for basic usage. For a more involved example, see RowCounter + or review the org.apache.hadoop.hbase.mapreduce.TestTableMapReduce unit test. + If you run MapReduce jobs that use HBase as source or sink, need to specify source and + sink table and column names in your configuration. + + When you read from HBase, the TableInputFormat requests the list of regions + from HBase and makes a map, which is either a map-per-region or + mapreduce.job.maps map, whichever is smaller. If your job only has two maps, + raise mapreduce.job.maps to a number greater than the number of regions. Maps + will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per + node. When writing to HBase, it may make sense to avoid the Reduce step and write back into + HBase from within your map. This approach works when your job does not need the sort and + collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is + no point double-sorting (and shuffling data around your MapReduce cluster) unless you need + to. If you do not need the Reduce, you myour map might emit counts of records processed for + reporting at the end of the jobj, or set the number of Reduces to zero and use + TableOutputFormat. If running the Reduce step makes sense in your case, you should typically + use multiple reducers so that load is spread across the HBase cluster. + + A new HBase partitioner, the HRegionPartitioner, + can run as many reducers the number of existing regions. The HRegionPartitioner is suitable + when your table is large and your upload will not greatly alter the number of existing + regions upon completion. Otherwise use the default partitioner. +
+ +
+ Writing HFiles Directly During Bulk Import + If you are importing into a new table, you can bypass the HBase API and write your + content directly to the filesystem, formatted into HBase data files (HFiles). Your import + will run faster, perhaps an order of magnitude faster. For more on how this mechanism works, + see . +
+ +
+ RowCounter Example + The included RowCounter + MapReduce job uses TableInputFormat and does a count of all rows in the specified + table. To run it, use the following command: + $ ./bin/hadoop jar hbase-X.X.X.jar + This will + invoke the HBase MapReduce Driver class. Select rowcounter from the choice of jobs + offered. This will print rowcouner usage advice to standard output. Specify the tablename, + column to count, and output + directory. If you have classpath errors, see . +
+ +
+ Map-Task Splitting +
+ The Default HBase MapReduce Splitter + When TableInputFormat + is used to source an HBase table in a MapReduce job, its splitter will make a map task for + each region of the table. Thus, if there are 100 regions in the table, there will be 100 + map-tasks for the job - regardless of how many column families are selected in the + Scan. +
+
+ Custom Splitters + For those interested in implementing custom splitters, see the method + getSplits in TableInputFormatBase. + That is where the logic for map-task assignment resides. +
+
+
+ HBase MapReduce Examples +
+ HBase MapReduce Read Example + The following is an example of using HBase as a MapReduce source in read-only manner. + Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from + the Mapper. There job would be defined as follows... + +Configuration config = HBaseConfiguration.create(); +Job job = new Job(config, "ExampleRead"); +job.setJarByClass(MyReadJob.class); // class that contains mapper + +Scan scan = new Scan(); +scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs +scan.setCacheBlocks(false); // don't set to true for MR jobs +// set other scan attrs +... + +TableMapReduceUtil.initTableMapperJob( + tableName, // input HBase table name + scan, // Scan instance to control CF and attribute selection + MyMapper.class, // mapper + null, // mapper output key + null, // mapper output value + job); +job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper + +boolean b = job.waitForCompletion(true); +if (!b) { + throw new IOException("error with job!"); +} + + ...and the mapper instance would extend TableMapper... + +public static class MyMapper extends TableMapper<Text, Text> { + + public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException { + // process data for the row from the Result instance. + } +} + +
+
+ HBase MapReduce Read/Write Example + The following is an example of using HBase both as a source and as a sink with + MapReduce. This example will simply copy data from one table to another. + +Configuration config = HBaseConfiguration.create(); +Job job = new Job(config,"ExampleReadWrite"); +job.setJarByClass(MyReadWriteJob.class); // class that contains mapper + +Scan scan = new Scan(); +scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs +scan.setCacheBlocks(false); // don't set to true for MR jobs +// set other scan attrs + +TableMapReduceUtil.initTableMapperJob( + sourceTable, // input table + scan, // Scan instance to control CF and attribute selection + MyMapper.class, // mapper class + null, // mapper output key + null, // mapper output value + job); +TableMapReduceUtil.initTableReducerJob( + targetTable, // output table + null, // reducer class + job); +job.setNumReduceTasks(0); + +boolean b = job.waitForCompletion(true); +if (!b) { + throw new IOException("error with job!"); +} + + An explanation is required of what TableMapReduceUtil is doing, + especially with the reducer. TableOutputFormat + is being used as the outputFormat class, and several parameters are being set on the + config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key + to ImmutableBytesWritable and reducer value to + Writable. These could be set by the programmer on the job and + conf, but TableMapReduceUtil tries to make things easier. + The following is the example mapper, which will create a Put + and matching the input Result and emit it. Note: this is what the + CopyTable utility does. + +public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> { + + public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { + // this example is just copying the data from the source table... + context.write(row, resultToPut(row,value)); + } + + private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException { + Put put = new Put(key.get()); + for (KeyValue kv : result.raw()) { + put.add(kv); + } + return put; + } +} + + There isn't actually a reducer step, so TableOutputFormat takes + care of sending the Put to the target table. + This is just an example, developers could choose not to use + TableOutputFormat and connect to the target table themselves. + +
+
+ HBase MapReduce Read/Write Example With Multi-Table Output + TODO: example for MultiTableOutputFormat. +
+
+ HBase MapReduce Summary to HBase Example + The following example uses HBase as a MapReduce source and sink with a summarization + step. This example will count the number of distinct instances of a value in a table and + write those summarized counts in another table. + +Configuration config = HBaseConfiguration.create(); +Job job = new Job(config,"ExampleSummary"); +job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer + +Scan scan = new Scan(); +scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs +scan.setCacheBlocks(false); // don't set to true for MR jobs +// set other scan attrs + +TableMapReduceUtil.initTableMapperJob( + sourceTable, // input table + scan, // Scan instance to control CF and attribute selection + MyMapper.class, // mapper class + Text.class, // mapper output key + IntWritable.class, // mapper output value + job); +TableMapReduceUtil.initTableReducerJob( + targetTable, // output table + MyTableReducer.class, // reducer class + job); +job.setNumReduceTasks(1); // at least one, adjust as required + +boolean b = job.waitForCompletion(true); +if (!b) { + throw new IOException("error with job!"); +} + + In this example mapper a column with a String-value is chosen as the value to summarize + upon. This value is used as the key to emit from the mapper, and an + IntWritable represents an instance counter. + +public static class MyMapper extends TableMapper<Text, IntWritable> { + public static final byte[] CF = "cf".getBytes(); + public static final byte[] ATTR1 = "attr1".getBytes(); + + private final IntWritable ONE = new IntWritable(1); + private Text text = new Text(); + + public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { + String val = new String(value.getValue(CF, ATTR1)); + text.set(val); // we can only emit Writables... + + context.write(text, ONE); + } +} + + In the reducer, the "ones" are counted (just like any other MR example that does this), + and then emits a Put. + +public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { + public static final byte[] CF = "cf".getBytes(); + public static final byte[] COUNT = "count".getBytes(); + + public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { + int i = 0; + for (IntWritable val : values) { + i += val.get(); + } + Put put = new Put(Bytes.toBytes(key.toString())); + put.add(CF, COUNT, Bytes.toBytes(i)); + + context.write(null, put); + } +} + + +
+
+ HBase MapReduce Summary to File Example + This very similar to the summary example above, with exception that this is using + HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and + in the reducer. The mapper remains the same. + +Configuration config = HBaseConfiguration.create(); +Job job = new Job(config,"ExampleSummaryToFile"); +job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer + +Scan scan = new Scan(); +scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs +scan.setCacheBlocks(false); // don't set to true for MR jobs +// set other scan attrs + +TableMapReduceUtil.initTableMapperJob( + sourceTable, // input table + scan, // Scan instance to control CF and attribute selection + MyMapper.class, // mapper class + Text.class, // mapper output key + IntWritable.class, // mapper output value + job); +job.setReducerClass(MyReducer.class); // reducer class +job.setNumReduceTasks(1); // at least one, adjust as required +FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile")); // adjust directories as required + +boolean b = job.waitForCompletion(true); +if (!b) { + throw new IOException("error with job!"); +} + + As stated above, the previous Mapper can run unchanged with this example. As for the + Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting + Puts. + + public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { + + public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { + int i = 0; + for (IntWritable val : values) { + i += val.get(); + } + context.write(key, new IntWritable(i)); + } +} + +
+
+ HBase MapReduce Summary to HBase Without Reducer + It is also possible to perform summaries without a reducer - if you use HBase as the + reducer. + An HBase target table would need to exist for the job summary. The Table method + incrementColumnValue would be used to atomically increment values. From a + performance perspective, it might make sense to keep a Map of values with their values to + be incremeneted for each map-task, and make one update per key at during the + cleanup method of the mapper. However, your milage may vary depending on the + number of rows to be processed and unique keys. + In the end, the summary results are in HBase. +
+
+ HBase MapReduce Summary to RDBMS + Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, + it is possible to generate summaries directly to an RDBMS via a custom reducer. The + setup method can connect to an RDBMS (the connection information can be + passed via custom parameters in the context) and the cleanup method can close the + connection. + It is critical to understand that number of reducers for the job affects the + summarization implementation, and you'll have to design this into your reducer. + Specifically, whether it is designed to run as a singleton (one reducer) or multiple + reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more + reducers that are assigned to the job, the more simultaneous connections to the RDBMS will + be created - this will scale, but only to a point. + + public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> { + + private Connection c = null; + + public void setup(Context context) { + // create DB connection... + } + + public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { + // do summarization + // in this example the keys are Text, but this is just an example + } + + public void cleanup(Context context) { + // close db connection + } + +} + + In the end, the summary results are written to your RDBMS table/s. +
+ +
+ +
+ Accessing Other HBase Tables in a MapReduce Job + Although the framework currently allows one HBase table as input to a MapReduce job, + other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating + an Table instance in the setup method of the Mapper. + public class MyMapper extends TableMapper<Text, LongWritable> { + private Table myOtherTable; + + public void setup(Context context) { + // In here create a Connection to the cluster and save it or use the Connection + // from the existing table + myOtherTable = connection.getTable("myOtherTable"); + } + + public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { + // process Result... + // use 'myOtherTable' for lookups + } + + + +
+
+ Speculative Execution + It is generally advisable to turn off speculative execution for MapReduce jobs that use + HBase as a source. This can either be done on a per-Job basis through properties, on on the + entire cluster. Especially for longer running jobs, speculative execution will create + duplicate map-tasks which will double-write your data to HBase; this is probably not what + you want. + See for more information. +
+ +
diff --git a/src/main/docbkx/orca.xml b/src/main/docbkx/orca.xml new file mode 100644 index 00000000000..29d8727f0f1 --- /dev/null +++ b/src/main/docbkx/orca.xml @@ -0,0 +1,47 @@ + + + + Apache HBase Orca +
+ Apache HBase Orca + + + + + +
+ An Orca is the Apache + HBase mascot. + See NOTICES.txt. Our Orca logo we got here: http://www.vectorfree.com/jumping-orca + It is licensed Creative Commons Attribution 3.0. See https://creativecommons.org/licenses/by/3.0/us/ + We changed the logo by stripping the colored background, inverting + it and then rotating it some. + +
diff --git a/src/main/docbkx/other_info.xml b/src/main/docbkx/other_info.xml new file mode 100644 index 00000000000..72ff2745865 --- /dev/null +++ b/src/main/docbkx/other_info.xml @@ -0,0 +1,83 @@ + + + + Other Information About HBase +
HBase Videos + Introduction to HBase + + Introduction to HBase by Todd Lipcon (Chicago Data Summit 2011). + + Introduction to HBase by Todd Lipcon (2010). + + + + Building Real Time Services at Facebook with HBase by Jonathan Gray (Hadoop World 2011). + + HBase and Hadoop, Mixing Real-Time and Batch Processing at StumbleUpon by JD Cryans (Hadoop World 2010). + +
+
HBase Presentations (Slides) + Advanced HBase Schema Design by Lars George (Hadoop World 2011). + + Introduction to HBase by Todd Lipcon (Chicago Data Summit 2011). + + Getting The Most From Your HBase Install by Ryan Rawson, Jonathan Gray (Hadoop World 2009). + +
+
HBase Papers + BigTable by Google (2006). + + HBase and HDFS Locality by Lars George (2010). + + No Relation: The Mixed Blessings of Non-Relational Databases by Ian Varley (2009). + +
+
HBase Sites + Cloudera's HBase Blog has a lot of links to useful HBase information. + + CAP Confusion is a relevant entry for background information on + distributed storage systems. + + + + HBase Wiki has a page with a number of presentations. + + HBase RefCard from DZone. + +
+
HBase Books + HBase: The Definitive Guide by Lars George. + +
+
Hadoop Books + Hadoop: The Definitive Guide by Tom White. + +
+ +
diff --git a/src/main/docbkx/performance.xml b/src/main/docbkx/performance.xml index 1757d3f43c5..42ed79bc935 100644 --- a/src/main/docbkx/performance.xml +++ b/src/main/docbkx/performance.xml @@ -273,7 +273,7 @@ tableDesc.addFamily(cfDesc); If there is enough RAM, increasing this can help. -
+
<varname>hbase.regionserver.checksum.verify</varname> Have HBase write the checksum into the datablock and save having to do the checksum seek whenever you read. diff --git a/src/main/docbkx/sql.xml b/src/main/docbkx/sql.xml new file mode 100644 index 00000000000..40f43d60f8c --- /dev/null +++ b/src/main/docbkx/sql.xml @@ -0,0 +1,40 @@ + + + + SQL over HBase +
+ Apache Phoenix + Apache Phoenix +
+
+ Trafodion + Trafodion: Transactional SQL-on-HBase +
+ +
diff --git a/src/main/docbkx/upgrading.xml b/src/main/docbkx/upgrading.xml index d5708a403e5..5d71e0f2f1b 100644 --- a/src/main/docbkx/upgrading.xml +++ b/src/main/docbkx/upgrading.xml @@ -240,7 +240,7 @@
-
+
HBase API surface HBase has a lot of API points, but for the compatibility matrix above, we differentiate between Client API, Limited Private API, and Private API. HBase uses a version of Hadoop's Interface classification. HBase's Interface classification classes can be found here. diff --git a/src/main/docbkx/ycsb.xml b/src/main/docbkx/ycsb.xml new file mode 100644 index 00000000000..695614c85ab --- /dev/null +++ b/src/main/docbkx/ycsb.xml @@ -0,0 +1,36 @@ + + + + YCSB + YCSB: The + Yahoo! Cloud Serving Benchmark and HBase + TODO: Describe how YCSB is poor for putting up a decent cluster load. + TODO: Describe setup of YCSB for HBase. In particular, presplit your tables before you + start a run. See HBASE-4163 Create Split Strategy for YCSB Benchmark for why and a little shell + command for how to do it. + Ted Dunning redid YCSB so it's mavenized and added facility for verifying workloads. See + Ted Dunning's YCSB. + + +