From ac2e1c33fd32a6b473ebbfdc32f5e631a69f2a6d Mon Sep 17 00:00:00 2001 From: Jonathan M Hsieh Date: Tue, 19 Aug 2014 16:10:37 -0700 Subject: [PATCH] HBASE-11682 Explain Hotspotting (Misty Stanley-Jones) --- src/main/docbkx/schema_design.xml | 96 +++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) diff --git a/src/main/docbkx/schema_design.xml b/src/main/docbkx/schema_design.xml index de05c140645..efbcb556a48 100644 --- a/src/main/docbkx/schema_design.xml +++ b/src/main/docbkx/schema_design.xml @@ -99,6 +99,102 @@ admin.enableTable(table);
Rowkey Design +
+ Hotspotting + Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, + allowing you to store related rows, or rows that will be read together, near each other. + However, poorly designed row keys are a common source of hotspotting. + Hotspotting occurs when a large amount of client traffic is directed at one node, or only a + few nodes, of a cluster. This traffic may represent reads, writes, or other operations. The + traffic overwhelms the single machine responsible for hosting that region, causing + performance degradation and potentially leading to region unavailability. This can also have + adverse effects on other regions hosted by the same region server as that host is unable to + service the requested load. It is important to design data access patterns such that the + cluster is fully and evenly utilized. + To prevent hotspotting on writes, design your row keys such that rows that truly do need + to be in the same region are, but in the bigger picture, data is being written to multiple + regions across the cluster, rather than one at a time. Some common techniques for avoiding + hotspotting are described below, along with some of their advantages and drawbacks. + + Salting + Salting in this sense has nothing to do with cryptography, but refers to adding random + data to the start of a row key. In this case, salting refers to adding a randomly-assigned + prefix to the row key to cause it to sort differently than it otherwise would. The number + of possible prefixes correspond to the number of regions you want to spread the data + across. Salting can be helpful if you have a few "hot" row key patterns which come up over + and over amongst other more evenly-distributed rows. Consider the following example, which + shows that salting can spread write load across multiple regionservers, and illustrates + some of the negative implications for reads. + + + Salting Example + Suppose you have the following list of row keys, and your table is split such that + there is one region for each letter of the alphabet. Prefix 'a' is one region, prefix 'b' + is another. In this table, all rows starting with 'f' are in the same region. This example + focuses on rows with keys like the following: + +foo0001 +foo0002 +foo0003 +foo0004 + + Now, imagine that you would like to spread these across four different regions. You + decide to use four different salts: a, b, + c, and d. In this scenario, each of these letter + prefixes will be on a different region. After applying the salts, you have the following + rowkeys instead. Since you can now write to four separate regions, you theoretically have + four times the throughput when writing that you would have if all the writes were going to + the same region. + +a-foo0003 +b-foo0001 +c-foo0004 +d-foo0002 + + Then, if you add another row, it will randomly be assigned one of the four possible + salt values and end up near one of the existing rows. + +a-foo0003 +b-foo0001 +c-foo0003 +c-foo0004 +d-foo0002 + + Since this assignment will be random, you will need to do more work if you want to + retrieve the rows in lexicographic order. In this way, salting attempts to increase + throughput on writes, but has a cost during reads. + + + + Hashing + Instead of a random assignment, you could use a one-way hash + that would cause a given row to always be "salted" with the same prefix, in a way that + would spread the load across the regionservers, but allow for predictability during reads. + Using a deterministic hash allows the client to reconstruct the complete rowkey and use a + Get operation to retrieve that row as normal. + + + Hashing Example + Given the same situation in the salting example above, you could instead apply a + one-way hash that would cause the row with key foo0003 to always, and + predictably, receive the a prefix. Then, to retrieve that row, you + would already know the key. You could also optimize things so that certain pairs of keys + were always in the same region, for instance. + + + Reversing the Key + A third common trick for preventing hotspotting is to reverse a fixed-width or numeric + row key so that the part that changes the most often (the least significant digit) is first. + This effectively randomizes row keys, but sacrifices row ordering properties. + + See https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables, + and article on Salted Tables + from the Phoenix project, and the discussion in the comments of HBASE-11682 for more + information about avoiding hotspotting. +
Monotonically Increasing Row Keys/Timeseries Data