From 65375f8258095f0a0fac16d870ee134d43a7910b Mon Sep 17 00:00:00 2001 From: Jonathan M Hsieh Date: Wed, 13 Aug 2014 14:57:18 -0700 Subject: [PATCH] HBASE-11476 Expand 'Conceptual View' section of Data Model chapter (Misty Stanley-Jones) --- src/main/docbkx/book.xml | 342 +++++++++++++++++++++++++++------------ 1 file changed, 240 insertions(+), 102 deletions(-) diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml index 666e0eb58b8..0facd319c4b 100644 --- a/src/main/docbkx/book.xml +++ b/src/main/docbkx/book.xml @@ -91,38 +91,129 @@ Data Model - In short, applications store data into an HBase table. Tables are made of rows and - columns. All columns in HBase belong to a particular column family. Table cells -- the - intersection of row and column coordinates -- are versioned. A cell’s content is an - uninterpreted array of bytes. - Table row keys are also byte arrays so almost anything can serve as a row key from strings - to binary representations of longs or even serialized data structures. Rows in HBase tables - are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key - -- its primary key. + In HBase, data is stored in tables, which have rows and columns. This is a terminology + overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can + be helpful to think of an HBase table as a multi-dimensional map. + + HBase Data Model Terminology + + Table + + An HBase table consists of multiple rows. + + + + Row + + A row in HBase consists of a row key and one or more columns with values associated + with them. Rows are sorted alphabetically by the row key as they are stored. For this + reason, the design of the row key is very important. The goal is to store data in such a + way that related rows are near each other. A common row key pattern is a website domain. + If your row keys are domains, you should probably store them in reverse (org.apache.www, + org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each + other in the table, rather than being spread out based on the first letter of the + subdomain. + + + + Column + + A column in HBase consists of a column family and a column qualifier, which are + delimited by a : (colon) character. + + + + Column Family + + Column families physically colocate a set of columns and their values, often for + performance reasons. Each column family has a set of storage properties, such as whether + its values should be cached in memory, how its data is compressed or its row keys are + encoded, and others. Each row in a table has the same column + families, though a given row might not store anything in a given column family. + Column families are specified when you create your table, and influence the way your + data is stored in the underlying filesystem. Therefore, the column families should be + considered carefully during schema design. + + + + Column Qualifier + + A column qualifier is added to a column family to provide the index for a given + piece of data. Given a column family content, a column qualifier + might be content:html, and another might be + content:pdf. Though column families are fixed at table creation, + column qualifiers are mutable and may differ greatly between rows. + + + + Cell + + A cell is a combination of row, column family, and column qualifier, and contains a + value and a timestamp, which represents the value's version. + A cell's value is an uninterpreted array of bytes. + + + + Timestamp + + A timestamp is written alongside each value, and is the identifier for a given + version of a value. By default, the timestamp represents the time on the RegionServer + when the data was written, but you can specify a different timestamp value when you put + data into the cell. + + Direct manipulation of timestamps is an advanced feature which is only exposed for + special cases that are deeply integrated with HBase, and is discouraged in general. + Encoding a timestamp at the application level is the preferred pattern. + + You can specify the maximum number of versions of a value that HBase retains, per column + family. When the maximum number of versions is reached, the oldest versions are + eventually deleted. By default, only the newest version is kept. + + +
Conceptual View + You can read a very understandable explanation of the HBase data model in the blog post Understanding + HBase and BigTable by Jim R. Wilson. Another good explanation is available in the + PDF Introduction + to Basic Schema Design by Amandeep Khurana. It may help to read different + perspectives to get a solid understanding of HBase schema design. The linked articles cover + the same ground as the information in this section. The following example is a slightly modified form of the one on page 2 of the BigTable paper. There - is a table called webtable that contains two column families named - contents and anchor. In this example, + is a table called webtable that contains two rows + (com.cnn.www + and com.example.www), three column families named + contents, anchor, and people. In + this example, for the first row (com.cnn.www), anchor contains two columns (anchor:cssnsi.com, anchor:my.look.ca) and contents contains one column - (contents:html). + (contents:html). This example contains 5 versions of the row with the + row key com.cnn.www, and one version of the row with the row key + com.example.www. The contents:html column qualifier contains the entire + HTML of a given website. Qualifiers of the anchor column family each + contain the external site which links to the site represented by the row, along with the + text it used in the anchor of its link. The people column family represents + people associated with the site. + + Column Names - By convention, a column name is made of its column family prefix and a - qualifier. For example, the column - contents:html is made up of the column family - contents and html qualifier. The colon character - (:) delimits the column family from the column family - qualifier. + By convention, a column name is made of its column family prefix and a + qualifier. For example, the column + contents:html is made up of the column family + contents and the html qualifier. The colon + character (:) delimits the column family from the column family + qualifier. Table <varname>webtable</varname> @@ -134,12 +225,15 @@ colname="c3" /> + Row Key Time Stamp ColumnFamily contents ColumnFamily anchor + ColumnFamily people @@ -148,128 +242,172 @@ t9 anchor:cnnsi.com = "CNN" + "com.cnn.www" t8 anchor:my.look.ca = "CNN.com" + "com.cnn.www" t6 contents:html = "<html>..." + "com.cnn.www" t5 contents:html = "<html>..." + "com.cnn.www" t3 contents:html = "<html>..." + + + + "com.example.www" + t5 + contents:html = "<html>..." + + people:author = "John Doe"
- + Cells in this table that appear to be empty do not take space, or in fact exist, in + HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to + look at data in HBase, or even the most accurate. The following represents the same + information as a multi-dimensional map. This is only a mock-up for illustrative + purposes and may not be strictly accurate. + ..." + t5: contents:html: "..." + t3: contents:html: "..." + } + anchor: { + t9: anchor:cnnsi.com = "CNN" + t8: anchor:my.look.ca = "CNN.com" + } + people: {} + } + "com.example.www": { + contents: { + t5: contents:html: "..." + } + anchor: {} + people: { + t5: people:author: "John Doe" + } + } +} + ]]> +
Physical View - Although at a conceptual level tables may be viewed as a sparse set of rows. Physically - they are stored on a per-column family basis. New columns (i.e., - columnfamily:column) can be added to any column family without - pre-announcing them. - ColumnFamily <varname>anchor</varname> - - - - - - - Row Key - Time Stamp - Column Family anchor - - - - - "com.cnn.www" - t9 - anchor:cnnsi.com = "CNN" - - - "com.cnn.www" - t8 - anchor:my.look.ca = "CNN.com" - - - -
- - ColumnFamily <varname>contents</varname> - - - - - - - Row Key - Time Stamp - ColumnFamily "contents:" - - - - - "com.cnn.www" - t6 - contents:html = "<html>..." - - - "com.cnn.www" - t5 - contents:html = "<html>..." - - - "com.cnn.www" - t3 - contents:html = "<html>..." - - - -
It is important to note in the diagram above that the empty cells shown in the - conceptual view are not stored since they need not be in a column-oriented storage format. + Although at a conceptual level tables may be viewed as a sparse set of rows, they are + physically stored by column family. A new column qualifier (column_family:column_qualifier) + can be added to an existing column family at any time. + + ColumnFamily <varname>anchor</varname> + + + + + + + Row Key + Time Stamp + Column Family anchor + + + + + "com.cnn.www" + t9 + anchor:cnnsi.com = "CNN" + + + "com.cnn.www" + t8 + anchor:my.look.ca = "CNN.com" + + + +
+ + ColumnFamily <varname>contents</varname> + + + + + + + Row Key + Time Stamp + ColumnFamily "contents:" + + + + + "com.cnn.www" + t6 + contents:html = "<html>..." + + + "com.cnn.www" + t5 + contents:html = "<html>..." + + + "com.cnn.www" + t3 + contents:html = "<html>..." + + + +
+ The empty cells shown in the + conceptual view are not stored at all. Thus a request for the value of the contents:html column at time stamp t8 would return no value. Similarly, a request for an anchor:my.look.ca value at time stamp t9 would return no value. However, if no timestamp is supplied, the most recent value for a - particular column would be returned and would also be the first one found since timestamps + particular column would be returned. Given multiple versions, the most recent is also the + first one found, since timestamps are stored in descending order. Thus a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of - contents:html from time stamp t6, the value of - anchor:cnnsi.com from time stamp t9, the value of - anchor:my.look.ca from time stamp t8. + contents:html from timestamp t6, the value of + anchor:cnnsi.com from timestamp t9, the value of + anchor:my.look.ca from timestamp t8.
For more information about the internals of how Apache HBase stores data, see .