hbase-5753.xml - adding schema design case study in Case Studies chapter
git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1311411 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
66772ce043
commit
56d1587b7e
|
@ -34,6 +34,149 @@
|
|||
<para>For more information on Performance and Troubleshooting, see <xref linkend="performance"/> and <xref linkend="trouble"/>.
|
||||
</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="casestudies.schema">
|
||||
<title>Schema Design</title>
|
||||
|
||||
<section xml:id="casestudies.schema.listdata">
|
||||
<title>List Data</title>
|
||||
<para>The following is an exchange from the user dist-list regarding a fairly common question:
|
||||
how to handle per-user list data in HBase.
|
||||
</para>
|
||||
<para>*** QUESTION ***</para>
|
||||
<para>
|
||||
We're looking at how to store a large amount of (per-user) list data in
|
||||
HBase, and we were trying to figure out what kind of access pattern made
|
||||
the most sense. One option is store the majority of the data in a key, so
|
||||
we could have something like:
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
<FixedWidthUserName><FixedWidthValueId1>:"" (no value)
|
||||
<FixedWidthUserName><FixedWidthValueId2>:"" (no value)
|
||||
<FixedWidthUserName><FixedWidthValueId3>:"" (no value)
|
||||
</programlisting>
|
||||
|
||||
The other option we had was to do this entirely using:
|
||||
<programlisting>
|
||||
<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
|
||||
<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
|
||||
</programlisting>
|
||||
<para>
|
||||
where each row would contain multiple values.
|
||||
So in one case reading the first thirty values would be:
|
||||
</para>
|
||||
<programlisting>
|
||||
scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
|
||||
</programlisting>
|
||||
And in the second case it would be
|
||||
<programlisting>
|
||||
get 'FixedWidthUserName\x00\x00\x00\x00'
|
||||
</programlisting>
|
||||
<para>
|
||||
The general usage pattern would be to read only the first 30 values of
|
||||
these lists, with infrequent access reading deeper into the lists. Some
|
||||
users would have <= 30 total values in these lists, and some users would
|
||||
have millions (i.e. power-law distribution)
|
||||
</para>
|
||||
<para>
|
||||
The single-value format seems like it would take up more space on HBase,
|
||||
but would offer some improved retrieval / pagination flexibility. Would
|
||||
there be any significant performance advantages to be able to paginate via
|
||||
gets vs paginating with scans?
|
||||
</para>
|
||||
<para>
|
||||
My initial understanding was that doing a scan should be faster if our
|
||||
paging size is unknown (and caching is set appropriately), but that gets
|
||||
should be faster if we'll always need the same page size. I've ended up
|
||||
hearing different people tell me opposite things about performance. I
|
||||
assume the page sizes would be relatively consistent, so for most use cases
|
||||
we could guarantee that we only wanted one page of data in the
|
||||
fixed-page-length case. I would also assume that we would have infrequent
|
||||
updates, but may have inserts into the middle of these lists (meaning we'd
|
||||
need to update all subsequent rows).
|
||||
</para>
|
||||
<para>
|
||||
Thanks for help / suggestions / follow-up questions.
|
||||
</para>
|
||||
<para>*** ANSWER ***</para>
|
||||
<para>
|
||||
If I understand you correctly, you're ultimately trying to store
|
||||
triples in the form "user, valueid, value", right? E.g., something
|
||||
like:
|
||||
</para>
|
||||
<programlisting>
|
||||
"user123, firstname, Paul",
|
||||
"user234, lastname, Smith"
|
||||
</programlisting>
|
||||
<para>
|
||||
(But the usernames are fixed width, and the valueids are fixed width).
|
||||
</para>
|
||||
<para>
|
||||
And, your access pattern is along the lines of: "for user X, list the
|
||||
next 30 values, starting with valueid Y". Is that right? And these
|
||||
values should be returned sorted by valueid?
|
||||
</para>
|
||||
<para>
|
||||
The tl;dr version is that you should probably go with one row per
|
||||
user+value, and not build a complicated intra-row pagination scheme on
|
||||
your own unless you're really sure it is needed.
|
||||
</para>
|
||||
<para>
|
||||
Your two options mirror a common question people have when designing
|
||||
HBase schemas: should I go "tall" or "wide"? Your first schema is
|
||||
"tall": each row represents one value for one user, and so there are
|
||||
many rows in the table for each user; the row key is user + valueid,
|
||||
and there would be (presumably) a single column qualifier that means
|
||||
"the value". This is great if you want to scan over rows in sorted
|
||||
order by row key (thus my question above, about whether these ids are
|
||||
sorted correctly). You can start a scan at any user+valueid, read the
|
||||
next 30, and be done. What you're giving up is the ability to have
|
||||
transactional guarantees around all the rows for one user, but it
|
||||
doesn't sound like you need that. Doing it this way is generally
|
||||
recommended (see
|
||||
here <link xlink:href="http://hbase.apache.org/book.html#schema.smackdown">http://hbase.apache.org/book.html#schema.smackdown</link>).
|
||||
</para>
|
||||
<para>
|
||||
Your second option is "wide": you store a bunch of values in one row,
|
||||
using different qualifiers (where the qualifier is the valueid). The
|
||||
simple way to do that would be to just store ALL values for one user
|
||||
in a single row. I'm guessing you jumped to the "paginated" version
|
||||
because you're assuming that storing millions of columns in a single
|
||||
row would be bad for performance, which may or may not be true; as
|
||||
long as you're not trying to do too much in a single request, or do
|
||||
things like scanning over and returning all of the cells in the row,
|
||||
it shouldn't be fundamentally worse. The client has methods that allow
|
||||
you to get specific slices of columns.
|
||||
</para>
|
||||
<para>
|
||||
Note that neither case fundamentally uses more disk space than the
|
||||
other; you're just "shifting" part of the identifying information for
|
||||
a value either to the left (into the row key, in option one) or to the
|
||||
right (into the column qualifiers in option 2). Under the covers,
|
||||
every key/value still stores the whole row key, and column family
|
||||
name. (If this is a bit confusing, take an hour and watch Lars
|
||||
George's excellent video about understanding HBase schema design:
|
||||
<link xlink:href="http://www.youtube.com/watch?v=_HLoH_PgrLk)">http://www.youtube.com/watch?v=_HLoH_PgrLk)</link>.
|
||||
</para>
|
||||
<para>
|
||||
A manually paginated version has lots more complexities, as you note,
|
||||
like having to keep track of how many things are in each page,
|
||||
re-shuffling if new values are inserted, etc. That seems significantly
|
||||
more complex. It might have some slight speed advantages (or
|
||||
disadvantages!) at extremely high throughput, and the only way to
|
||||
really know that would be to try it out. If you don't have time to
|
||||
build it both ways and compare, my advice would be to start with the
|
||||
simplest option (one row per user+value). Start simple and iterate! :)
|
||||
</para>
|
||||
|
||||
</section> <!-- listdata -->
|
||||
|
||||
|
||||
</section> <!-- schema design -->
|
||||
|
||||
<section xml:id="casestudies.perftroub">
|
||||
<title>Performance/Troubleshooting</title>
|
||||
|
||||
<section xml:id="casestudies.slownode">
|
||||
<title>Case Study #1 (Performance Issue On A Single Node)</title>
|
||||
|
@ -175,5 +318,7 @@ Link detected: yes
|
|||
<para>See also <xref linkend="dfs.datanode.max.xcievers"/>.
|
||||
</para>
|
||||
</section>
|
||||
|
||||
</section> <!-- performance/troubleshooting -->
|
||||
|
||||
</chapter>
|
||||
|
|
Loading…
Reference in New Issue