From a12549aa9e50390ced3990978f40421a91ee2e88 Mon Sep 17 00:00:00 2001 From: Doug Meil Date: Tue, 2 Apr 2013 18:18:50 +0000 Subject: [PATCH] hbase-8244. refguide. Moving list data schema design use-case to Schema Design chapter. git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1463657 13f79535-47bb-0310-9956-ffa450edef68 --- hbase-assembly/src/docbkx/case_studies.xml | 137 +---------------- hbase-assembly/src/docbkx/schema_design.xml | 159 ++++++++++++++++++-- 2 files changed, 151 insertions(+), 145 deletions(-) diff --git a/hbase-assembly/src/docbkx/case_studies.xml b/hbase-assembly/src/docbkx/case_studies.xml index 2e3bba0432f..00230bc6d38 100644 --- a/hbase-assembly/src/docbkx/case_studies.xml +++ b/hbase-assembly/src/docbkx/case_studies.xml @@ -37,141 +37,8 @@
Schema Design - -
- List Data - The following is an exchange from the user dist-list regarding a fairly common question: - how to handle per-user list data in Apache HBase. - - *** QUESTION *** - - We're looking at how to store a large amount of (per-user) list data in -HBase, and we were trying to figure out what kind of access pattern made -the most sense. One option is store the majority of the data in a key, so -we could have something like: - - - -<FixedWidthUserName><FixedWidthValueId1>:"" (no value) -<FixedWidthUserName><FixedWidthValueId2>:"" (no value) -<FixedWidthUserName><FixedWidthValueId3>:"" (no value) - - -The other option we had was to do this entirely using: - -<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... -<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... - - -where each row would contain multiple values. -So in one case reading the first thirty values would be: - - -scan { STARTROW => 'FixedWidthUsername' LIMIT => 30} - -And in the second case it would be - -get 'FixedWidthUserName\x00\x00\x00\x00' - - -The general usage pattern would be to read only the first 30 values of -these lists, with infrequent access reading deeper into the lists. Some -users would have <= 30 total values in these lists, and some users would -have millions (i.e. power-law distribution) - - - The single-value format seems like it would take up more space on HBase, -but would offer some improved retrieval / pagination flexibility. Would -there be any significant performance advantages to be able to paginate via -gets vs paginating with scans? - - - My initial understanding was that doing a scan should be faster if our -paging size is unknown (and caching is set appropriately), but that gets -should be faster if we'll always need the same page size. I've ended up -hearing different people tell me opposite things about performance. I -assume the page sizes would be relatively consistent, so for most use cases -we could guarantee that we only wanted one page of data in the -fixed-page-length case. I would also assume that we would have infrequent -updates, but may have inserts into the middle of these lists (meaning we'd -need to update all subsequent rows). - - -Thanks for help / suggestions / follow-up questions. - - *** ANSWER *** - -If I understand you correctly, you're ultimately trying to store -triples in the form "user, valueid, value", right? E.g., something -like: - - -"user123, firstname, Paul", -"user234, lastname, Smith" - - -(But the usernames are fixed width, and the valueids are fixed width). - - -And, your access pattern is along the lines of: "for user X, list the -next 30 values, starting with valueid Y". Is that right? And these -values should be returned sorted by valueid? - - -The tl;dr version is that you should probably go with one row per -user+value, and not build a complicated intra-row pagination scheme on -your own unless you're really sure it is needed. - - -Your two options mirror a common question people have when designing -HBase schemas: should I go "tall" or "wide"? Your first schema is -"tall": each row represents one value for one user, and so there are -many rows in the table for each user; the row key is user + valueid, -and there would be (presumably) a single column qualifier that means -"the value". This is great if you want to scan over rows in sorted -order by row key (thus my question above, about whether these ids are -sorted correctly). You can start a scan at any user+valueid, read the -next 30, and be done. What you're giving up is the ability to have -transactional guarantees around all the rows for one user, but it -doesn't sound like you need that. Doing it this way is generally -recommended (see -here http://hbase.apache.org/book.html#schema.smackdown). - - -Your second option is "wide": you store a bunch of values in one row, -using different qualifiers (where the qualifier is the valueid). The -simple way to do that would be to just store ALL values for one user -in a single row. I'm guessing you jumped to the "paginated" version -because you're assuming that storing millions of columns in a single -row would be bad for performance, which may or may not be true; as -long as you're not trying to do too much in a single request, or do -things like scanning over and returning all of the cells in the row, -it shouldn't be fundamentally worse. The client has methods that allow -you to get specific slices of columns. - - -Note that neither case fundamentally uses more disk space than the -other; you're just "shifting" part of the identifying information for -a value either to the left (into the row key, in option one) or to the -right (into the column qualifiers in option 2). Under the covers, -every key/value still stores the whole row key, and column family -name. (If this is a bit confusing, take an hour and watch Lars -George's excellent video about understanding HBase schema design: -http://www.youtube.com/watch?v=_HLoH_PgrLk). - - -A manually paginated version has lots more complexities, as you note, -like having to keep track of how many things are in each page, -re-shuffling if new values are inserted, etc. That seems significantly -more complex. It might have some slight speed advantages (or -disadvantages!) at extremely high throughput, and the only way to -really know that would be to try it out. If you don't have time to -build it both ways and compare, my advice would be to start with the -simplest option (one row per user+value). Start simple and iterate! :) - - -
- + See the schema design case studies here: +
diff --git a/hbase-assembly/src/docbkx/schema_design.xml b/hbase-assembly/src/docbkx/schema_design.xml index 593e8abd7fb..87e3f05c5ce 100644 --- a/hbase-assembly/src/docbkx/schema_design.xml +++ b/hbase-assembly/src/docbkx/schema_design.xml @@ -431,16 +431,20 @@ public static byte[][] getHexSplits(String startKey, String endKey, int numRegio can be approached. Note: this is just an illustration of potential approaches, not an exhaustive list. Know your data, and know your processing requirements. - There are 3 case studies described: + It is highly recommended that you read the rest of the Schema Design Chapter first, before reading + these case studies. + + Thee following case studies are described: Log Data / Timeseries Data Log Data / Timeseries on Steroids Customer/Sales + Tall/Wide/Middle Schema Design + List Data - ... and then a brief section on "Tall/Wide/Middle" in terms of schema design approaches.
- Log Data and Timeseries Data Case Study + Case Study - Log Data and Timeseries Data Assume that the following data elements are being collected. Hostname @@ -524,9 +528,11 @@ long bucket = timestamp % numBuckets;
- Log Data and Timeseries Data on Steroids Case Study + Case Study - Log Data and Timeseries Data on Steroids This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for - certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html. + certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html, + and Lessons Learned from OpenTSDB + from HBaseCon2012. But this is how the general concept works: data is ingested, for example, in this manner… @@ -544,7 +550,7 @@ long bucket = timestamp % numBuckets;
- Customer / Sales Case Study + Case Study - Customer / Sales Assume that HBase is used to store customer and sales information. There are two core record-types being ingested: a Customer record type, and Sales record type. @@ -612,7 +618,7 @@ reasonable spread in the keyspace, similar options appear:
-
"Tall/Wide/Middle" Schema Design Smackdown +
Case Study - "Tall/Wide/Middle" Schema Design Smackdown This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables. These are general guidelines and not laws - each application must consider its own needs. @@ -638,11 +644,145 @@ reasonable spread in the keyspace, similar options appear: OpenTSDB is the best example of this case where a single row represents a defined time-range, and then discrete events are treated as columns. This approach is often more complex, and may require the additional complexity of re-writing your data, but has the advantage of being I/O efficient. For an overview of this approach, see - Lessons Learned from OpenTSDB - from HBaseCon2012. + .
+ +
+ Case Study - List Data + The following is an exchange from the user dist-list regarding a fairly common question: + how to handle per-user list data in Apache HBase. + + *** QUESTION *** + + We're looking at how to store a large amount of (per-user) list data in +HBase, and we were trying to figure out what kind of access pattern made +the most sense. One option is store the majority of the data in a key, so +we could have something like: + + + +<FixedWidthUserName><FixedWidthValueId1>:"" (no value) +<FixedWidthUserName><FixedWidthValueId2>:"" (no value) +<FixedWidthUserName><FixedWidthValueId3>:"" (no value) + + +The other option we had was to do this entirely using: + +<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... +<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... + + +where each row would contain multiple values. +So in one case reading the first thirty values would be: + + +scan { STARTROW => 'FixedWidthUsername' LIMIT => 30} + +And in the second case it would be + +get 'FixedWidthUserName\x00\x00\x00\x00' + + +The general usage pattern would be to read only the first 30 values of +these lists, with infrequent access reading deeper into the lists. Some +users would have <= 30 total values in these lists, and some users would +have millions (i.e. power-law distribution) + + + The single-value format seems like it would take up more space on HBase, +but would offer some improved retrieval / pagination flexibility. Would +there be any significant performance advantages to be able to paginate via +gets vs paginating with scans? + + + My initial understanding was that doing a scan should be faster if our +paging size is unknown (and caching is set appropriately), but that gets +should be faster if we'll always need the same page size. I've ended up +hearing different people tell me opposite things about performance. I +assume the page sizes would be relatively consistent, so for most use cases +we could guarantee that we only wanted one page of data in the +fixed-page-length case. I would also assume that we would have infrequent +updates, but may have inserts into the middle of these lists (meaning we'd +need to update all subsequent rows). + + +Thanks for help / suggestions / follow-up questions. + + *** ANSWER *** + +If I understand you correctly, you're ultimately trying to store +triples in the form "user, valueid, value", right? E.g., something +like: + + +"user123, firstname, Paul", +"user234, lastname, Smith" + + +(But the usernames are fixed width, and the valueids are fixed width). + + +And, your access pattern is along the lines of: "for user X, list the +next 30 values, starting with valueid Y". Is that right? And these +values should be returned sorted by valueid? + + +The tl;dr version is that you should probably go with one row per +user+value, and not build a complicated intra-row pagination scheme on +your own unless you're really sure it is needed. + + +Your two options mirror a common question people have when designing +HBase schemas: should I go "tall" or "wide"? Your first schema is +"tall": each row represents one value for one user, and so there are +many rows in the table for each user; the row key is user + valueid, +and there would be (presumably) a single column qualifier that means +"the value". This is great if you want to scan over rows in sorted +order by row key (thus my question above, about whether these ids are +sorted correctly). You can start a scan at any user+valueid, read the +next 30, and be done. What you're giving up is the ability to have +transactional guarantees around all the rows for one user, but it +doesn't sound like you need that. Doing it this way is generally +recommended (see +here http://hbase.apache.org/book.html#schema.smackdown). + + +Your second option is "wide": you store a bunch of values in one row, +using different qualifiers (where the qualifier is the valueid). The +simple way to do that would be to just store ALL values for one user +in a single row. I'm guessing you jumped to the "paginated" version +because you're assuming that storing millions of columns in a single +row would be bad for performance, which may or may not be true; as +long as you're not trying to do too much in a single request, or do +things like scanning over and returning all of the cells in the row, +it shouldn't be fundamentally worse. The client has methods that allow +you to get specific slices of columns. + + +Note that neither case fundamentally uses more disk space than the +other; you're just "shifting" part of the identifying information for +a value either to the left (into the row key, in option one) or to the +right (into the column qualifiers in option 2). Under the covers, +every key/value still stores the whole row key, and column family +name. (If this is a bit confusing, take an hour and watch Lars +George's excellent video about understanding HBase schema design: +http://www.youtube.com/watch?v=_HLoH_PgrLk). + + +A manually paginated version has lots more complexities, as you note, +like having to keep track of how many things are in each page, +re-shuffling if new values are inserted, etc. That seems significantly +more complex. It might have some slight speed advantages (or +disadvantages!) at extremely high throughput, and the only way to +really know that would be to try it out. If you don't have time to +build it both ways and compare, my advice would be to start with the +simplest option (one row per user+value). Start simple and iterate! :) + +
Operational and Performance Configuration Options @@ -652,4 +792,3 @@ reasonable spread in the keyspace, similar options appear:
-