HBASE-11476 Expand 'Conceptual View' section of Data Model chapter (Misty Stanley-Jones)
This commit is contained in:
parent
2153a92fa9
commit
92c3b877c0
|
@ -91,38 +91,129 @@
|
||||||
<chapter
|
<chapter
|
||||||
xml:id="datamodel">
|
xml:id="datamodel">
|
||||||
<title>Data Model</title>
|
<title>Data Model</title>
|
||||||
<para>In short, applications store data into an HBase table. Tables are made of rows and
|
<para>In HBase, data is stored in tables, which have rows and columns. This is a terminology
|
||||||
columns. All columns in HBase belong to a particular column family. Table cells -- the
|
overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can
|
||||||
intersection of row and column coordinates -- are versioned. A cell’s content is an
|
be helpful to think of an HBase table as a multi-dimensional map.</para>
|
||||||
uninterpreted array of bytes. </para>
|
<variablelist>
|
||||||
<para>Table row keys are also byte arrays so almost anything can serve as a row key from strings
|
<title>HBase Data Model Terminology</title>
|
||||||
to binary representations of longs or even serialized data structures. Rows in HBase tables
|
<varlistentry>
|
||||||
are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key
|
<term>Table</term>
|
||||||
-- its primary key. </para>
|
<listitem>
|
||||||
|
<para>An HBase table consists of multiple rows.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Row</term>
|
||||||
|
<listitem>
|
||||||
|
<para>A row in HBase consists of a row key and one or more columns with values associated
|
||||||
|
with them. Rows are sorted alphabetically by the row key as they are stored. For this
|
||||||
|
reason, the design of the row key is very important. The goal is to store data in such a
|
||||||
|
way that related rows are near each other. A common row key pattern is a website domain.
|
||||||
|
If your row keys are domains, you should probably store them in reverse (org.apache.www,
|
||||||
|
org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each
|
||||||
|
other in the table, rather than being spread out based on the first letter of the
|
||||||
|
subdomain.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Column</term>
|
||||||
|
<listitem>
|
||||||
|
<para>A column in HBase consists of a column family and a column qualifier, which are
|
||||||
|
delimited by a <literal>:</literal> (colon) character.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Column Family</term>
|
||||||
|
<listitem>
|
||||||
|
<para>Column families physically colocate a set of columns and their values, often for
|
||||||
|
performance reasons. Each column family has a set of storage properties, such as whether
|
||||||
|
its values should be cached in memory, how its data is compressed or its row keys are
|
||||||
|
encoded, and others. Each row in a table has the same column
|
||||||
|
families, though a given row might not store anything in a given column family.</para>
|
||||||
|
<para>Column families are specified when you create your table, and influence the way your
|
||||||
|
data is stored in the underlying filesystem. Therefore, the column families should be
|
||||||
|
considered carefully during schema design.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Column Qualifier</term>
|
||||||
|
<listitem>
|
||||||
|
<para>A column qualifier is added to a column family to provide the index for a given
|
||||||
|
piece of data. Given a column family <literal>content</literal>, a column qualifier
|
||||||
|
might be <literal>content:html</literal>, and another might be
|
||||||
|
<literal>content:pdf</literal>. Though column families are fixed at table creation,
|
||||||
|
column qualifiers are mutable and may differ greatly between rows.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Cell</term>
|
||||||
|
<listitem>
|
||||||
|
<para>A cell is a combination of row, column family, and column qualifier, and contains a
|
||||||
|
value and a timestamp, which represents the value's version.</para>
|
||||||
|
<para>A cell's value is an uninterpreted array of bytes.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Timestamp</term>
|
||||||
|
<listitem>
|
||||||
|
<para>A timestamp is written alongside each value, and is the identifier for a given
|
||||||
|
version of a value. By default, the timestamp represents the time on the RegionServer
|
||||||
|
when the data was written, but you can specify a different timestamp value when you put
|
||||||
|
data into the cell.</para>
|
||||||
|
<caution>
|
||||||
|
<para>Direct manipulation of timestamps is an advanced feature which is only exposed for
|
||||||
|
special cases that are deeply integrated with HBase, and is discouraged in general.
|
||||||
|
Encoding a timestamp at the application level is the preferred pattern.</para>
|
||||||
|
</caution>
|
||||||
|
<para>You can specify the maximum number of versions of a value that HBase retains, per column
|
||||||
|
family. When the maximum number of versions is reached, the oldest versions are
|
||||||
|
eventually deleted. By default, only the newest version is kept.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
</variablelist>
|
||||||
|
|
||||||
<section
|
<section
|
||||||
xml:id="conceptual.view">
|
xml:id="conceptual.view">
|
||||||
<title>Conceptual View</title>
|
<title>Conceptual View</title>
|
||||||
|
<para>You can read a very understandable explanation of the HBase data model in the blog post <link
|
||||||
|
xlink:href="http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable">Understanding
|
||||||
|
HBase and BigTable</link> by Jim R. Wilson. Another good explanation is available in the
|
||||||
|
PDF <link
|
||||||
|
xlink:href="http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf">Introduction
|
||||||
|
to Basic Schema Design</link> by Amandeep Khurana. It may help to read different
|
||||||
|
perspectives to get a solid understanding of HBase schema design. The linked articles cover
|
||||||
|
the same ground as the information in this section.</para>
|
||||||
<para> The following example is a slightly modified form of the one on page 2 of the <link
|
<para> The following example is a slightly modified form of the one on page 2 of the <link
|
||||||
xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper. There
|
xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper. There
|
||||||
is a table called <varname>webtable</varname> that contains two column families named
|
is a table called <varname>webtable</varname> that contains two rows
|
||||||
<varname>contents</varname> and <varname>anchor</varname>. In this example,
|
(<literal>com.cnn.www</literal>
|
||||||
|
and <literal>com.example.www</literal>), three column families named
|
||||||
|
<varname>contents</varname>, <varname>anchor</varname>, and <varname>people</varname>. In
|
||||||
|
this example, for the first row (<literal>com.cnn.www</literal>),
|
||||||
<varname>anchor</varname> contains two columns (<varname>anchor:cssnsi.com</varname>,
|
<varname>anchor</varname> contains two columns (<varname>anchor:cssnsi.com</varname>,
|
||||||
<varname>anchor:my.look.ca</varname>) and <varname>contents</varname> contains one column
|
<varname>anchor:my.look.ca</varname>) and <varname>contents</varname> contains one column
|
||||||
(<varname>contents:html</varname>). <note>
|
(<varname>contents:html</varname>). This example contains 5 versions of the row with the
|
||||||
|
row key <literal>com.cnn.www</literal>, and one version of the row with the row key
|
||||||
|
<literal>com.example.www</literal>. The <varname>contents:html</varname> column qualifier contains the entire
|
||||||
|
HTML of a given website. Qualifiers of the <varname>anchor</varname> column family each
|
||||||
|
contain the external site which links to the site represented by the row, along with the
|
||||||
|
text it used in the anchor of its link. The <varname>people</varname> column family represents
|
||||||
|
people associated with the site.
|
||||||
|
</para>
|
||||||
|
<note>
|
||||||
<title>Column Names</title>
|
<title>Column Names</title>
|
||||||
<para> By convention, a column name is made of its column family prefix and a
|
<para> By convention, a column name is made of its column family prefix and a
|
||||||
<emphasis>qualifier</emphasis>. For example, the column
|
<emphasis>qualifier</emphasis>. For example, the column
|
||||||
<emphasis>contents:html</emphasis> is made up of the column family
|
<emphasis>contents:html</emphasis> is made up of the column family
|
||||||
<varname>contents</varname> and <varname>html</varname> qualifier. The colon character
|
<varname>contents</varname> and the <varname>html</varname> qualifier. The colon
|
||||||
(<literal>:</literal>) delimits the column family from the column family
|
character (<literal>:</literal>) delimits the column family from the column family
|
||||||
<emphasis>qualifier</emphasis>. </para>
|
<emphasis>qualifier</emphasis>. </para>
|
||||||
</note>
|
</note>
|
||||||
<table
|
<table
|
||||||
frame="all">
|
frame="all">
|
||||||
<title>Table <varname>webtable</varname></title>
|
<title>Table <varname>webtable</varname></title>
|
||||||
<tgroup
|
<tgroup
|
||||||
cols="4"
|
cols="5"
|
||||||
align="left"
|
align="left"
|
||||||
colsep="1"
|
colsep="1"
|
||||||
rowsep="1">
|
rowsep="1">
|
||||||
|
@ -134,12 +225,15 @@
|
||||||
colname="c3" />
|
colname="c3" />
|
||||||
<colspec
|
<colspec
|
||||||
colname="c4" />
|
colname="c4" />
|
||||||
|
<colspec
|
||||||
|
colname="c5" />
|
||||||
<thead>
|
<thead>
|
||||||
<row>
|
<row>
|
||||||
<entry>Row Key</entry>
|
<entry>Row Key</entry>
|
||||||
<entry>Time Stamp</entry>
|
<entry>Time Stamp</entry>
|
||||||
<entry>ColumnFamily <varname>contents</varname></entry>
|
<entry>ColumnFamily <varname>contents</varname></entry>
|
||||||
<entry>ColumnFamily <varname>anchor</varname></entry>
|
<entry>ColumnFamily <varname>anchor</varname></entry>
|
||||||
|
<entry>ColumnFamily <varname>people</varname></entry>
|
||||||
</row>
|
</row>
|
||||||
</thead>
|
</thead>
|
||||||
<tbody>
|
<tbody>
|
||||||
|
@ -148,43 +242,85 @@
|
||||||
<entry>t9</entry>
|
<entry>t9</entry>
|
||||||
<entry />
|
<entry />
|
||||||
<entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
|
<entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
|
||||||
|
<entry />
|
||||||
</row>
|
</row>
|
||||||
<row>
|
<row>
|
||||||
<entry>"com.cnn.www"</entry>
|
<entry>"com.cnn.www"</entry>
|
||||||
<entry>t8</entry>
|
<entry>t8</entry>
|
||||||
<entry />
|
<entry />
|
||||||
<entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
|
<entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
|
||||||
|
<entry />
|
||||||
</row>
|
</row>
|
||||||
<row>
|
<row>
|
||||||
<entry>"com.cnn.www"</entry>
|
<entry>"com.cnn.www"</entry>
|
||||||
<entry>t6</entry>
|
<entry>t6</entry>
|
||||||
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
<entry />
|
<entry />
|
||||||
|
<entry />
|
||||||
</row>
|
</row>
|
||||||
<row>
|
<row>
|
||||||
<entry>"com.cnn.www"</entry>
|
<entry>"com.cnn.www"</entry>
|
||||||
<entry>t5</entry>
|
<entry>t5</entry>
|
||||||
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
<entry />
|
<entry />
|
||||||
|
<entry />
|
||||||
</row>
|
</row>
|
||||||
<row>
|
<row>
|
||||||
<entry>"com.cnn.www"</entry>
|
<entry>"com.cnn.www"</entry>
|
||||||
<entry>t3</entry>
|
<entry>t3</entry>
|
||||||
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
<entry />
|
<entry />
|
||||||
|
<entry />
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry>"com.example.www"</entry>
|
||||||
|
<entry>t5</entry>
|
||||||
|
<entry><varname>contents:html</varname> = "<html>..."</entry>
|
||||||
|
<entry></entry>
|
||||||
|
<entry>people:author = "John Doe"</entry>
|
||||||
</row>
|
</row>
|
||||||
</tbody>
|
</tbody>
|
||||||
</tgroup>
|
</tgroup>
|
||||||
</table>
|
</table>
|
||||||
</para>
|
<para>Cells in this table that appear to be empty do not take space, or in fact exist, in
|
||||||
|
HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to
|
||||||
|
look at data in HBase, or even the most accurate. The following represents the same
|
||||||
|
information as a multi-dimensional map. This is only a mock-up for illustrative
|
||||||
|
purposes and may not be strictly accurate.</para>
|
||||||
|
<programlisting><![CDATA[
|
||||||
|
{
|
||||||
|
"com.cnn.www": {
|
||||||
|
contents: {
|
||||||
|
t6: contents:html: "<html>..."
|
||||||
|
t5: contents:html: "<html>..."
|
||||||
|
t3: contents:html: "<html>..."
|
||||||
|
}
|
||||||
|
anchor: {
|
||||||
|
t9: anchor:cnnsi.com = "CNN"
|
||||||
|
t8: anchor:my.look.ca = "CNN.com"
|
||||||
|
}
|
||||||
|
people: {}
|
||||||
|
}
|
||||||
|
"com.example.www": {
|
||||||
|
contents: {
|
||||||
|
t5: contents:html: "<html>..."
|
||||||
|
}
|
||||||
|
anchor: {}
|
||||||
|
people: {
|
||||||
|
t5: people:author: "John Doe"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]]></programlisting>
|
||||||
|
|
||||||
</section>
|
</section>
|
||||||
<section
|
<section
|
||||||
xml:id="physical.view">
|
xml:id="physical.view">
|
||||||
<title>Physical View</title>
|
<title>Physical View</title>
|
||||||
<para> Although at a conceptual level tables may be viewed as a sparse set of rows. Physically
|
<para> Although at a conceptual level tables may be viewed as a sparse set of rows, they are
|
||||||
they are stored on a per-column family basis. New columns (i.e.,
|
physically stored by column family. A new column qualifier (column_family:column_qualifier)
|
||||||
<varname>columnfamily:column</varname>) can be added to any column family without
|
can be added to an existing column family at any time.</para>
|
||||||
pre-announcing them. <table
|
<table
|
||||||
frame="all">
|
frame="all">
|
||||||
<title>ColumnFamily <varname>anchor</varname></title>
|
<title>ColumnFamily <varname>anchor</varname></title>
|
||||||
<tgroup
|
<tgroup
|
||||||
|
@ -258,13 +394,15 @@
|
||||||
</row>
|
</row>
|
||||||
</tbody>
|
</tbody>
|
||||||
</tgroup>
|
</tgroup>
|
||||||
</table> It is important to note in the diagram above that the empty cells shown in the
|
</table>
|
||||||
conceptual view are not stored since they need not be in a column-oriented storage format.
|
<para>The empty cells shown in the
|
||||||
|
conceptual view are not stored at all.
|
||||||
Thus a request for the value of the <varname>contents:html</varname> column at time stamp
|
Thus a request for the value of the <varname>contents:html</varname> column at time stamp
|
||||||
<literal>t8</literal> would return no value. Similarly, a request for an
|
<literal>t8</literal> would return no value. Similarly, a request for an
|
||||||
<varname>anchor:my.look.ca</varname> value at time stamp <literal>t9</literal> would
|
<varname>anchor:my.look.ca</varname> value at time stamp <literal>t9</literal> would
|
||||||
return no value. However, if no timestamp is supplied, the most recent value for a
|
return no value. However, if no timestamp is supplied, the most recent value for a
|
||||||
particular column would be returned and would also be the first one found since timestamps
|
particular column would be returned. Given multiple versions, the most recent is also the
|
||||||
|
first one found, since timestamps
|
||||||
are stored in descending order. Thus a request for the values of all columns in the row
|
are stored in descending order. Thus a request for the values of all columns in the row
|
||||||
<varname>com.cnn.www</varname> if no timestamp is specified would be: the value of
|
<varname>com.cnn.www</varname> if no timestamp is specified would be: the value of
|
||||||
<varname>contents:html</varname> from timestamp <literal>t6</literal>, the value of
|
<varname>contents:html</varname> from timestamp <literal>t6</literal>, the value of
|
||||||
|
|
Loading…
Reference in New Issue