From 7dc169f6d336ddc4aea3c5a690e664aae04af0ea Mon Sep 17 00:00:00 2001 From: Doug Meil Date: Tue, 27 Nov 2012 22:30:17 +0000 Subject: [PATCH] hbase-7223 book.xml. addition to RowKey design section about keyspace/region splits. git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1414444 13f79535-47bb-0310-9956-ffa450edef68 --- src/docbkx/book.xml | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/src/docbkx/book.xml b/src/docbkx/book.xml index e136273d6ec..259c75d26ec 100644 --- a/src/docbkx/book.xml +++ b/src/docbkx/book.xml @@ -739,6 +739,43 @@ System.out.println("md5 digest as string length: " + sbDigest.length); // ret inserted a lot of data). +
Relationship Between RowKeys and Region Splits + If you pre-split your table, it is critical to understand how your rowkey will be distributed across + the region boundaries. As an example of why this is important, consider the example of using displayable hex characters as the + lead position of the key (e.g., ""0000000000000000" to "ffffffffffffffff"). Running those key ranges through Bytes.split + (which is the split strategy used when creating regions in HBaseAdmin.createTable(byte[] startKey, byte[] endKey, numRegions) + for 10 regions will generate the following splits... + + + +48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 // 0 +54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 // 6 +61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68 // = +68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126 // D +75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72 // K +82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14 // R +88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44 // X +95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102 // _ +102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 // f + + ... (note: the lead byte is listed to the right as a comment.) Given that the first split is a '0' and the last split is an 'f', + everything is great, right? Not so fast. + + The problem is that all the data is going to pile up in the first 2 regions and the last region thus creating a "lumpy" (and + possibly "hot") region problem. To understand why, refer to an ASCII Table. + '0' is byte 48, and 'f' is byte 102, but there is a huge gap in byte values (bytes 58 to 96) that will never appear in this + keyspace because the only values are [0-9] and [a-f]. Thus, the middle regions regions will + never be used. To make pre-spliting work with this example keyspace, a custom definition of splits (i.e., and not relying on the + built-in split method) is required. + + Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split them in such a way that all the + regions are accessible in the keyspace. While this example demonstrated the problem with a hex-key keyspace, the same problem can happen + with any keyspace. Know your data. + + Lesson #2: While generally not advisable, using hex-keys (and more generally, displayable data) can still work with pre-split + tables as long as all the created regions are accessible in the keyspace. + +