Go to file
Jim Kellerman 12a62a6333 HADOOP-1814 TestCleanRegionServerExit fails too often on Hudson
git-svn-id: https://svn.apache.org/repos/asf/lucene/hadoop/trunk/src/contrib/hbase@571332 13f79535-47bb-0310-9956-ffa450edef68
2007-08-30 22:09:13 +00:00
bin HADOOP-1802 Startup scripts should wait until hdfs as cleared 'safe mode' 2007-08-30 15:39:11 +00:00
conf HADOOP-1644 [hbase] Compactions should not block updates 2007-08-16 01:07:51 +00:00
lib HADOOP-1375 a simple parser for hbase 2007-07-10 23:30:11 +00:00
src HADOOP-1814 TestCleanRegionServerExit fails too often on Hudson 2007-08-30 22:09:13 +00:00
CHANGES.txt HADOOP-1814 TestCleanRegionServerExit fails too often on Hudson 2007-08-30 22:09:13 +00:00
NOTICE.txt HADOOP-1375 a simple parser for hbase 2007-07-11 21:54:15 +00:00
README.txt HADOOP-1045. Add contrib/hbase, a BigTable-like online database. 2007-04-03 20:34:28 +00:00
build.xml HADOOP-1731. Add Hadoop's version number to contrib jar file names. 2007-08-22 17:13:08 +00:00

README.txt

HBASE
Michael Cafarella


This document gives a quick overview of HBase, the Hadoop simple
database. It is extremely similar to Google's BigTable, with a just a
few differences. If you understand BigTable, great. If not, you should
still be able to understand this document.

---------------------------------------------------------------
I.

HBase uses a data model very similar to that of BigTable. Users store
data rows in labelled tables. A data row has a sortable key and an
arbitrary number of columns. The table is stored sparsely, so that
rows in the same table can have crazily-varying columns, if the user
likes.

A column name has the form "<group>:<label>" where <group> and <label>
can be any string you like. A single table enforces its set of
<group>s (called "column groups"). You can only adjust this set of
groups by performing administrative operations on the table. However,
you can use new <label> strings at any write without preannouncing
it. HBase stores column groups physically close on disk. So the items
in a given column group should have roughly the same write/read
behavior.

Writes are row-locked only. You cannot lock multiple rows at once. All
row-writes are atomic by default.

All updates to the database have an associated timestamp. The HBase
will store a configurable number of versions of a given cell. Clients
can get data by asking for the "most recent value as of a certain
time". Or, clients can fetch all available versions at once.

---------------------------------------------------------------
II.

To the user, a table seems like a list of data tuples, sorted by row
key. Physically, tables are broken into HRegions. An HRegion is
identified by its tablename plus a start/end-key pair. A given HRegion
with keys <start> and <end> will store all the rows from (<start>,
<end>]. A set of HRegions, sorted appropriately, forms an entire
table.

All data is physically stored using Hadoop's DFS. Data is served to
clients by a set of HRegionServers, usually one per machine. A given
HRegion is served by only one HRegionServer at a time.

When a client wants to make updates, it contacts the relevant
HRegionServer and commits the update to an HRegion. Upon commit, the
data is added to the HRegion's HMemcache and to the HRegionServer's
HLog. The HMemcache is a memory buffer that stores and serves the
most-recent updates. The HLog is an on-disk log file that tracks all
updates. The commit() call will not return to the client until the
update has been written to the HLog.

When serving data, the HRegion will first check its HMemcache. If not
available, it will then check its on-disk HStores. There is an HStore
for each column family in an HRegion. An HStore might consist of
multiple on-disk HStoreFiles. Each HStoreFile is a B-Tree-like
structure that allow for relatively fast access.

Periodically, we invoke HRegion.flushcache() to write the contents of
the HMemcache to an on-disk HStore's files. This adds a new HStoreFile
to each HStore. The HMemcache is then emptied, and we write a special
token to the HLog, indicating the HMemcache has been flushed.

On startup, each HRegion checks to see if there have been any writes
to the HLog since the most-recent invocation of flushcache(). If not,
then all relevant HRegion data is reflected in the on-disk HStores. If
yes, the HRegion reconstructs the updates from the HLog, writes them
to the HMemcache, and then calls flushcache(). Finally, it deletes the
HLog and is now available for serving data.

Thus, calling flushcache() infrequently will be less work, but
HMemcache will consume more memory and the HLog will take a longer
time to reconstruct upon restart. If flushcache() is called
frequently, the HMemcache will take less memory, and the HLog will be
faster to reconstruct, but each flushcache() call imposes some
overhead.

The HLog is periodically rolled, so it consists of multiple
time-sorted files. Whenever we roll the HLog, the HLog will delete all
old log files that contain only flushed data. Rolling the HLog takes
very little time and is generally a good idea to do from time to time.

Each call to flushcache() will add an additional HStoreFile to each
HStore. Fetching a file from an HStore can potentially access all of
its HStoreFiles. This is time-consuming, so we want to periodically
compact these HStoreFiles into a single larger one. This is done by
calling HStore.compact().

Compaction is a very expensive operation. It's done automatically at
startup, and should probably be done periodically during operation.

The Google BigTable paper has a slightly-confusing hierarchy of major
and minor compactions. We have just two things to keep in mind:

1) A "flushcache()" drives all updates out of the memory buffer into
   on-disk structures. Upon flushcache, the log-reconstruction time
   goes to zero. Each flushcache() will add a new HStoreFile to each
   HStore.

2) a "compact()" consolidates all the HStoreFiles into a single
   one. It's expensive, and is always done at startup.

Unlike BigTable, Hadoop's HBase allows no period where updates have
been "committed" but have not been written to the log. This is not
hard to add, if it's really wanted.

We can merge two HRegions into a single new HRegion by calling
HRegion.closeAndMerge(). We can split an HRegion into two smaller
HRegions by calling HRegion.closeAndSplit().

OK, to sum up so far:

1) Clients access data in tables.
2) tables are broken into HRegions.
3) HRegions are served by HRegionServers. Clients contact an
   HRegionServer to access the data within its row-range.
4) HRegions store data in:
  a) HMemcache, a memory buffer for recent writes
  b) HLog, a write-log for recent writes
  c) HStores, an efficient on-disk set of files. One per col-group.
     (HStores use HStoreFiles.)

---------------------------------------------------------------
III.

Each HRegionServer stays in contact with the single HBaseMaster. The
HBaseMaster is responsible for telling each HRegionServer what
HRegions it should load and make available.

The HBaseMaster keeps a constant tally of which HRegionServers are
alive at any time. If the connection between an HRegionServer and the
HBaseMaster times out, then:

 a) The HRegionServer kills itself and restarts in an empty state.

 b) The HBaseMaster assumes the HRegionServer has died and reallocates
    its HRegions to other HRegionServers

Note that this is unlike Google's BigTable, where a TabletServer can
still serve Tablets after its connection to the Master has died. We
tie them together, because we do not use an external lock-management
system like BigTable. With BigTable, there's a Master that allocates
tablets and a lock manager (Chubby) that guarantees atomic access by
TabletServers to tablets. HBase uses just a single central point for
all HRegionServers to access: the HBaseMaster.

(This is no more dangerous than what BigTable does. Each system is
reliant on a network structure (whether HBaseMaster or Chubby) that
must survive for the data system to survive. There may be some
Chubby-specific advantages, but that's outside HBase's goals right
now.)

As HRegionServers check in with a new HBaseMaster, the HBaseMaster
asks each HRegionServer to load in zero or more HRegions. When the
HRegionServer dies, the HBaseMaster marks those HRegions as
unallocated, and attempts to give them to different HRegionServers.

Recall that each HRegion is identified by its table name and its
key-range. Since key ranges are contiguous, and they always start and
end with NULL, it's enough to simply indicate the end-key.

Unfortunately, this is not quite enough. Because of merge() and
split(), we may (for just a moment) have two quite different HRegions
with the same name. If the system dies at an inopportune moment, both
HRegions may exist on disk simultaneously. The arbiter of which
HRegion is "correct" is the HBase meta-information (to be discussed
shortly). In order to distinguish between different versions of the
same HRegion, we also add a unique 'regionId' to the HRegion name.

Thus, we finally get to this identifier for an HRegion:

tablename + endkey + regionId.

You can see this identifier being constructed in
HRegion.buildRegionName().

We can also use this identifier as a row-label in a different
HRegion. Thus, the HRegion meta-info is itself stored in an
HRegion. We call this table, which maps from HRegion identifiers to
physical HRegionServer locations, the META table.

The META table itself can grow large, and may be broken into separate
HRegions. To locate all components of the META table, we list all META
HRegions in a ROOT table. The ROOT table is always contained in a
single HRegion.

Upon startup, the HRegionServer immediately attempts to scan the ROOT
table (because there is only one HRegion for the ROOT table, that
HRegion's name is hard-coded). It may have to wait for the ROOT table
to be allocated to an HRegionServer.

Once the ROOT table is available, the HBaseMaster can scan it and
learn of all the META HRegions. It then scans the META table. Again,
the HBaseMaster may have to wait for all the META HRegions to be
allocated to different HRegionServers.

Finally, when the HBaseMaster has scanned the META table, it knows the
entire set of HRegions. It can then allocate these HRegions to the set
of HRegionServers.

The HBaseMaster keeps the set of currently-available HRegionServers in
memory. Since the death of the HBaseMaster means the death of the
entire system, there's no reason to store this information on
disk. All information about the HRegion->HRegionServer mapping is
stored physically on different tables. Thus, a client does not need to
contact the HBaseMaster after it learns the location of the ROOT
HRegion. The load on HBaseMaster should be relatively small: it deals
with timing out HRegionServers, scanning the ROOT and META upon
startup, and serving the location of the ROOT HRegion.

The HClient is fairly complicated, and often needs to navigate the
ROOT and META HRegions when serving a user's request to scan a
specific user table. If an HRegionServer is unavailable or it does not
have an HRegion it should have, the HClient will wait and retry. At
startup or in case of a recent HRegionServer failure, the correct
mapping info from HRegion to HRegionServer may not always be
available.

In summary:

1) HRegionServers offer access to HRegions (an HRegion lives at one
   HRegionServer)
2) HRegionServers check in with the HBaseMaster
3) If the HBaseMaster dies, the whole system dies
4) The set of current HRegionServers is known only to the HBaseMaster
5) The mapping between HRegions and HRegionServers is stored in two
   special HRegions, which are allocated to HRegionServers like any
   other.
6) The ROOT HRegion is a special one, the location of which the
   HBaseMaster always knows.
7) It's the HClient's responsibility to navigate all this.


---------------------------------------------------------------
IV.

What's the current status of all this code?

As of this writing, there is just shy of 7000 lines of code in the
"hbase" directory.

All of the single-machine operations (safe-committing, merging,
splitting, versioning, flushing, compacting, log-recovery) are
complete, have been tested, and seem to work great.

The multi-machine stuff (the HBaseMaster, the HRegionServer, and the
HClient) have not been fully tested. The reason is that the HClient is
still incomplete, so the rest of the distributed code cannot be
fully-tested. I think it's good, but can't be sure until the HClient
is done. However, the code is now very clean and in a state where
other people can understand it and contribute.

Other related features and TODOs:

1) Single-machine log reconstruction works great, but distributed log
   recovery is not yet implemented. This is relatively easy, involving
   just a sort of the log entries, placing the shards into the right
   DFS directories

2) Data compression is not yet implemented, but there is an obvious
   place to do so in the HStore.

3) We need easy interfaces to MapReduce jobs, so they can scan tables

4) The HMemcache lookup structure is relatively inefficient

5) File compaction is relatively slow; we should have a more
   conservative algorithm for deciding when to apply compaction.

6) For the getFull() operation, use of Bloom filters would speed
   things up

7) We need stress-test and performance-number tools for the whole
   system

8) There's some HRegion-specific testing code that worked fine during
   development, but it has to be rewritten so it works against an
   HRegion while it's hosted by an HRegionServer, and connected to an
   HBaseMaster. This code is at the bottom of the HRegion.java file.