HBASE-23198 Update ref guide for distributed MOB compaction.

* add design doc for original MOB changes as they were when HBase 2.0 came out
* add design doc for distributed MOB compaction
* remove configuration and commands no longer relevant after distributed MOB compaction
* add in discussion of configuration options
* allow asciimath formulas since we use them in the discussion

closes #1232

Signed-off-by: Wellington Ramos Chevreuil <wchevreuil@apache.org>
This commit is contained in:
Sean Busbey 2020-02-28 14:34:50 -06:00
parent 9bd39786df
commit aff0ff5d97
4 changed files with 321 additions and 115 deletions

Binary file not shown.

View File

@ -36,22 +36,15 @@ read and write paths are optimized for values smaller than 100KB in size. When
HBase deals with large numbers of objects over this threshold, referred to here
as medium objects, or MOBs, performance is degraded due to write amplification
caused by splits and compactions. When using MOBs, ideally your objects will be between
100KB and 10MB (see the <<faq>>). HBase ***FIX_VERSION_NUMBER*** adds support
for better managing large numbers of MOBs while maintaining performance,
consistency, and low operational overhead. MOB support is provided by the work
done in link:https://issues.apache.org/jira/browse/HBASE-11339[HBASE-11339]. To
take advantage of MOB, you need to use <<hfilev3,HFile version 3>>. Optionally,
100KB and 10MB (see the <<faq>>). HBase 2 added special internal handling of MOBs
to maintain performance, consistency, and low operational overhead. MOB support is
provided by the work done in link:https://issues.apache.org/jira/browse/HBASE-11339[HBASE-11339].
To take advantage of MOB, you need to use <<hfilev3,HFile version 3>>. Optionally,
configure the MOB file reader's cache settings for each RegionServer (see
<<mob.cache.configure>>), then configure specific columns to hold MOB data.
Client code does not need to change to take advantage of HBase MOB support. The
feature is transparent to the client.
MOB compaction
MOB data is flushed into MOB files after MemStore flush. There will be lots of MOB files
after some time. To reduce MOB file count, there is a periodic task which compacts
small MOB files into a large one (MOB compaction).
=== Configuring Columns for MOB
You can configure columns to support MOB during table creation or alteration,
@ -79,41 +72,6 @@ hcd.setMobThreshold(102400L);
----
====
=== Configure MOB Compaction Policy
By default, MOB files for one specific day are compacted into one large MOB file.
To reduce MOB file count more, there are other MOB Compaction policies supported.
daily policy - compact MOB Files for one day into one large MOB file (default policy)
weekly policy - compact MOB Files for one week into one large MOB file
montly policy - compact MOB Files for one month into one large MOB File
.Configure MOB compaction policy Using HBase Shell
----
hbase> create 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400, MOB_COMPACT_PARTITION_POLICY => 'daily'}
hbase> create 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400, MOB_COMPACT_PARTITION_POLICY => 'weekly'}
hbase> create 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400, MOB_COMPACT_PARTITION_POLICY => 'monthly'}
hbase> alter 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400, MOB_COMPACT_PARTITION_POLICY => 'daily'}
hbase> alter 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400, MOB_COMPACT_PARTITION_POLICY => 'weekly'}
hbase> alter 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400, MOB_COMPACT_PARTITION_POLICY => 'monthly'}
----
=== Configure MOB Compaction mergeable threshold
If the size of a mob file is less than this value, it's regarded as a small file and needs to
be merged in mob compaction. The default value is 1280MB.
====
[source,xml]
----
<property>
<name>hbase.mob.compaction.mergeable.threshold</name>
<value>10000000000</value>
</property>
----
====
=== Testing MOB
The utility `org.apache.hadoop.hbase.IntegrationTestIngestWithMOB` is provided to assist with testing
@ -133,9 +91,219 @@ $ sudo -u hbase hbase org.apache.hadoop.hbase.IntegrationTestIngestWithMOB \
* `*maxMobDataSize*` is the maximum value for the size of MOB data.
The default is 5 kB, expressed in bytes.
=== MOB architecture
This section is derived from information found in
link:https://issues.apache.org/jira/browse/HBASE-11339[HBASE-11339], which covered the initial GA
implementation of MOB in HBase and
link:https://issues.apache.org/jira/browse/HBASE-22749[HBASE-22749], which improved things by
parallelizing MOB maintenance across the RegionServers. For more information see
the last version of the design doc created during the initial work,
"link:https://github.com/apache/hbase/blob/master/dev-support/design-docs/HBASE-11339%20MOB%20GA%20design.pdf[HBASE-11339 MOB GA design.pdf]",
and the design doc for the distributed mob compaction feature,
"link:https://github.com/apache/hbase/blob/master/dev-support/design-docs/HBASE-22749%20MOB%20distributed%20compaction.pdf[HBASE-22749 MOB distributed compaction.pdf]".
==== Overview
The MOB feature reduces the overall IO load for configured column families by storing values that
are larger than the configured threshold outside of the normal regions to avoid splits, merges, and
most importantly normal compactions.
When a cell is first written to a region it is stored in the WAL and memstore regardless of value
size. When memstores from a column family configured to use MOB are eventually flushed two hfiles
are written simultaneously. Cells with a value smaller than the threshold size are written to a
normal region hfile. Cells with a value larger than the threshold are written into a special MOB
hfile and also have a MOB reference cell written into the normal region HFile. As the Region Server
flushes a MOB enabled memstore and closes a given normal region HFile it appends metadata that lists
each of the special MOB hfiles referenced by the cells within.
MOB reference cells have the same key as the cell they are based on. The value of the reference cell
is made up of two pieces of metadata: the size of the actual value and the MOB hfile that contains
the original cell. In addition to any tags originally written to HBase, the reference cell prepends
two additional tags. The first is a marker tag that says the cell is a MOB reference. This can be
used later to scan specifically just for reference cells. The second stores the namespace and table
at the time the MOB hfile is written out. This tag is used to optimize how the MOB system finds
the underlying value in MOB hfiles after a series of HBase snapshot operations (ref HBASE-12332).
Note that tags are only available within HBase servers and by default are not sent over RPCs.
All MOB hfiles for a given table are managed within a logical region that does not directly serve
requests. When these MOB hfiles are created from a flush or MOB compaction they are placed in a
dedicated mob data area under the hbase root directory specific to the namespace, table, mob
logical region, and column family. In general that means a path structured like:
----
%HBase Root Dir%/mobdir/data/%namespace%/%table%/%logical region%/%column family%/
----
With default configs, an example table named 'some_table' in the
default namespace with a MOB enabled column family named 'foo' this HDFS directory would be
----
/hbase/mobdir/data/default/some_table/372c1b27e3dc0b56c3a031926e5efbe9/foo/
----
These MOB hfiles are maintained by special chores in the HBase Master and across the individual
Region Servers. Specifically those chores take care of enforcing TTLs and compacting them. Note that
this compaction is primarily a matter of controlling the total number of files in HDFS because our
operational assumptions for MOB data is that it will seldom update or delete.
When a given MOB hfile is no longer needed as a result of our compaction process then a chore in
the Master will take care of moving it to the archive just
like any normal hfile. Because the table's mob region is independent of all the normal regions it
can coexist with them in the regular archive storage area:
----
/hbase/archive/data/default/some_table/372c1b27e3dc0b56c3a031926e5efbe9/foo/
----
The same hfile cleaning chores that take care of eventually deleting unneeded archived files from
normal regions thus also will take care of these MOB hfiles. As such, if there is a snapshot of a
MOB enabled table then the cleaning system will make sure those MOB files stick around in the
archive area as long as they are needed by a snapshot or a clone of a snapshot.
==== MOB compaction
Each time the memstore for a MOB enabled column family performs a flush HBase will write values over
the MOB threshold into MOB specific hfiles. When normal region compaction occurs the Region Server
rewrites the normal data files while maintaining references to these MOB files without rewriting
them. Normal client lookups for MOB values transparently will receive the original values because
the Region Server internals take care of using the reference data to then pull the value out of a
specific MOB file. This indirection means that building up a large number of MOB hfiles doesn't
impact the overall time to retrieve any specific MOB cell. Thus, we need not perform compactions of
the MOB hfiles nearly as often as normal hfiles. As a result, HBase saves IO by not rewriting MOB
hfiles as a part of the periodic compactions a Region Server does on its own.
However, if deletes and updates of MOB cells are frequent then this indirection will begin to waste
space. The only way to stop using the space of a particular MOB hfile is to ensure no cells still
hold references to it. To do that we need to ensure we have written the current values into a new
MOB hfile. If our backing filesystem has a limitation on the number of files that can be present, as
HDFS does, then even if we do not have deletes or updates of MOB cells eventually there will be a
sufficient number of MOB hfiles that we will need to coallesce them.
Periodically a chore in the master coordinates having the region servers
perform a special major compaction that also handles rewritting new MOB files. Like all compactions
the Region Server will create updated hfiles that hold both the cells that are smaller than the MOB
threshold and cells that hold references to the newly rewritten MOB file. Because this rewriting has
the advantage of looking across all active cells for the region our several small MOB files should
end up as a single MOB file per region. The chore defaults to running weekly and can be
configured by setting `hbase.mob.compaction.chore.period` to the desired period in seconds.
====
[source,xml]
----
<property>
<name>hbase.mob.compaction.chore.period</name>
<value>2592000</value>
<description>Example of changing the chore period from a week to a month.</description>
</property>
----
====
By default, the periodic MOB compaction coordination chore will attempt to keep every region
busy doing compactions in parallel in order to maximize the amount of work done on the cluster.
If you need to tune the amount of IO this compaction generates on the underlying filesystem, you
can control how many concurrent region-level compaction requests are allowed by setting
`hbase.mob.major.compaction.region.batch.size` to an integer number greater than zero. If you set
the configuration to 0 then you will get the default behavior of attempting to do all regions in
parallel.
====
[source,xml]
----
<property>
<name>hbase.mob.major.compaction.region.batch.size</name>
<value>1</value>
<description>Example of switching from "as parallel as possible" to "serially"</description>
</property>
----
====
==== MOB file archiving
Eventually we will have MOB hfiles that are no longer needed. Either clients will overwrite the
value or a MOB-rewriting compaction will store a reference to a newer larger MOB hfile. Because any
given MOB cell could have originally been written either in the current region or in a parent region
that existed at some prior point in time, individual Region Servers do not decide when it is time
to archive MOB hfiles. Instead a periodic chore in the Master evaluates MOB hfiles for archiving.
A MOB HFile will be subject to archiving under any of the following conditions:
* Any MOB HFile older than the column family's TTL
* Any MOB HFile older than a "too recent" threshold with no references to it from the regular hfiles
for all regions in a column family
To determine if a MOB HFile meets the second criteria the chore extracts metadata from the regular
HFiles for each MOB enabled column family for a given table. That metadata enumerates the complete
set of MOB HFiles needed to satisfy the references stored in the normal HFile area.
The period of the cleaner chore can be configued by setting `hbase.master.mob.cleaner.period` to a
positive integer number of seconds. It defaults to running daily. You should not need to tune it
unless you have a very aggressive TTL or a very high rate of MOB updates with a correspondingly
high rate of non-MOB compactions.
=== MOB Optimization Tasks
==== Further limiting write amplification
If your MOB workload has few to no updates or deletes then you can opt-in to MOB compactions that
optimize for limiting the amount of write amplification. It acheives this by setting a
size threshold to ignore MOB files during the compaction process. When a given region goes
through MOB compaction it will evaluate the size of the MOB file that currently holds the actual
value and skip rewriting the value if that file is over threshold.
The bound of write amplification in this mode can be approximated as
stem:["Write Amplification" = log_K(M/S)] where *K* is the number of files in compaction
selection, *M* is the configurable threshold for MOB files size, and *S* is the minmum size of
memstore flushes that create MOB files in the first place. For example given 5 files picked up per
compaction, a threshold of 1 GB, and a flush size of 10MB the write amplification will be
stem:[log_5((1GB)/(10MB)) = log_5(100) = 2.86].
If we are using an underlying filesystem with a limitation on the number of files, such as HDFS,
and we know our expected data set size we can choose our maximum file size in order to approach
this limit but stay within it in order to minimize write amplification. For example, if we expect to
store a petabyte and we have a conservative limitation of a million files in our HDFS instance, then
stem:[(1PB)/(1M) = 1GB] gives us a target limitation of a gigabyte per MOB file.
To opt-in to this compaction mode you must set `hbase.mob.compaction.type` to `optimized`. The
default MOB size threshold in this mode is set to 1GB. It can be changed by setting
`hbase.mob.compactions.max.file.size` to a positive integer number of bytes.
====
[source,xml]
----
<property>
<name>hbase.mob.compaction.type</name>
<value>optimized</value>
<description>opt-in to write amplification optimized mob compaction.</description>
</property>
<property>
<name>hbase.mob.compactions.max.file.size</name>
<value>10737418240</value>
<description>Example of tuning the max mob file size to 10GB</dscription>
</property>
----
====
Additionally, when operating in this mode the compaction process will seek to avoid writing MOB
files that are over the max file threshold. As it is writing out a additional MOB values into a MOB
hfile it will check to see if the additional data causes the hfile to be over the max file size.
When the hfile of MOB values reaches limit, the MOB hfile is committed to the MOB storage area and
a new one is created. The hfile with reference cells will track the complete set of MOB hfiles it
needs in its metadata.
.Be mindful of total time to complete compaction of a region
[WARNING]
====
When using the write amplification optimized compaction mode you need to watch for the maximum time
to compact a single region. If it nears an hour you should read through the troubleshooting section
below <<mob.troubleshoot.cleaner.toonew>>. Failure to make the adjustments discussed there could
lead to dataloss.
====
[[mob.cache.configure]]
=== Configuring the MOB Cache
==== Configuring the MOB Cache
Because there can be a large number of MOB files at any time, as compared to the number of HFiles,
@ -181,85 +349,61 @@ suit your environment, and restart or rolling restart the RegionServer.
----
====
=== MOB Optimization Tasks
==== Manually Compacting MOB Files
To manually compact MOB files, rather than waiting for the
<<mob.cache.configure,configuration>> to trigger compaction, use the
`compact` or `major_compact` HBase shell commands. These commands
periodic chore to trigger compaction, use the
`major_compact` HBase shell commands. These commands
require the first argument to be the table name, and take a column
family as the second argument. and take a compaction type as the third argument.
family as the second argument. If used with a column family that includes MOB data, then
these operator requests will result in the MOB data being compacted.
----
hbase> compact 't1', 'c1, MOB
hbase> major_compact 't1', 'c1, MOB
hbase> major_compact 't1'
hbase> major_compact 't2', 'c1
----
These commands are also available via `Admin.compact` and
`Admin.majorCompact` methods.
=== MOB architecture
This section is derived from information found in
link:https://issues.apache.org/jira/browse/HBASE-11339[HBASE-11339]. For more information see
the attachment on that issue
"link:https://issues.apache.org/jira/secure/attachment/12724468/HBase%20MOB%20Design-v5.pdf[Base MOB Design-v5.pdf]".
==== Overview
The MOB feature reduces the overall IO load for configured column families by storing values that
are larger than the configured threshold outside of the normal regions to avoid splits, merges, and
most importantly normal compactions.
When a cell is first written to a region it is stored in the WAL and memstore regardless of value
size. When memstores from a column family configured to use MOB are eventually flushed two hfiles
are written simultaneously. Cells with a value smaller than the threshold size are written to a
normal region hfile. Cells with a value larger than the threshold are written into a special MOB
hfile and also have a MOB reference cell written into the normal region HFile.
MOB reference cells have the same key as the cell they are based on. The value of the reference cell
is made up of two pieces of metadata: the size of the actual value and the MOB hfile that contains
the original cell. In addition to any tags originally written to HBase, the reference cell prepends
two additional tags. The first is a marker tag that says the cell is a MOB reference. This can be
used later to scan specifically just for reference cells. The second stores the namespace and table
at the time the MOB hfile is written out. This tag is used to optimize how the MOB system finds
the underlying value in MOB hfiles after a series of HBase snapshot operations (ref HBASE-12332).
Note that tags are only available within HBase servers and by default are not sent over RPCs.
All MOB hfiles for a given table are managed within a logical region that does not directly serve
requests. When these MOB hfiles are created from a flush or MOB compaction they are placed in a
dedicated mob data area under the hbase root directory specific to the namespace, table, mob
logical region, and column family. In general that means a path structured like:
----
%HBase Root Dir%/mobdir/data/%namespace%/%table%/%logical region%/%column family%/
----
With default configs, an example table named 'some_table' in the
default namespace with a MOB enabled column family named 'foo' this HDFS directory would be
----
/hbase/mobdir/data/default/some_table/372c1b27e3dc0b56c3a031926e5efbe9/foo/
----
These MOB hfiles are maintained by special chores in the HBase Master rather than by any individual
Region Server. Specifically those chores take care of enforcing TTLs and compacting them. Note that
this compaction is primarily a matter of controlling the total number of files in HDFS because our
operational assumptions for MOB data is that it will seldom update or delete.
When a given MOB hfile is no longer needed as a result of our compaction process it is archived just
like any normal hfile. Because the table's mob region is independent of all the normal regions it
can coexist with them in the regular archive storage area:
----
/hbase/archive/data/default/some_table/372c1b27e3dc0b56c3a031926e5efbe9/foo/
----
The same hfile cleaning chores that take care of eventually deleting unneeded archived files from
normal regions thus also will take care of these MOB hfiles.
This same request can be made via the `Admin.majorCompact` Java API.
=== MOB Troubleshooting
[[mob.troubleshoot.cleaner.toonew]]
==== Adjusting the MOB cleaner's tolerance for new hfiles
The MOB cleaner chore ignores all MOB hfiles that were created more recently than an hour prior to
the start of the chore to ensure we don't miss the reference metadata from the corresponding regular
hfile. Without this safety check it would be possible for the cleaner chore to see a MOB hfile for
an in progress flush or compaction and prematurely archive the MOB data. This default buffer should
be sufficient for normal use.
You will need to adjust the tolerance if you use write amplification optimized MOB compaction and
the combination of your underlying filesystem performance and data shape is such that it could take
more than an hour to complete major compaction of a single region. For example, if your MOB data is
distributed such that your largest region adds 80GB of MOB data between compactions that include
rewriting MOB data and your HDFS cluster is only capable of writing 20MB/s for a single file then
when performing the optimized compaction the Region Server will take about a minute to write the
first 1GB MOB hfile and then another hour and seven minutes to write the remaining seventy-nine 1GB
MOB hfiles before finally committing the new reference hfile at the end of the compaction. Given
this example, you would need a larger tolerance window.
You will also need to adjust the tolerance if Region Server flush operations take longer than an
hour for the two HDFS move operations needed to commit both the MOB hfile and the normal hfile that
references it. Such a delay should not happen with a normally configured and healthy HDFS and HBase.
The cleaner's window for "too recent" is controlled by setting `hbase.mob.min.age.archive` to a
positive integer number of milliseconds.
====
[source,xml]
----
<property>
<name>hbase.mob.min.age.archive</name>
<value>86400000</value>
<description>Example of tuning the cleaner to only archive files older than a day.</dscription>
</property>
----
====
==== Retrieving MOB metadata through the HBase Shell
While working on troubleshooting failures in the MOB system you can retrieve some of the internal
@ -468,3 +612,64 @@ $ hdfs dfs -count /hbase/mobdir/data/default/some_table
+
This data is spurious and may be reclaimed. You should sideline it, verify your applications view
of the table, and then delete it.
=== MOB Upgrade Considerations
Generally, data stored using the MOB feature should transparently continue to work correctly across
HBase upgrades.
==== Upgrading to a version with the "distributed MOB compaction" feature
Prior to the work in HBASE-22749, "Distributed MOB compactions", HBase had the Master coordinate all
compaction maintenance of the MOB hfiles. Centralizing management of the MOB data allowed for space
optimizations but safely coordinating that managemet with Region Servers resulted in edge cases that
caused data loss (ref link:https://issues.apache.org/jira/browse/HBASE-22075[HBASE-22075]).
Users of the MOB feature upgrading to a version of HBase that includes HBASE-22749 should be aware
of the following changes:
* The MOB system no longer allows setting "MOB Compaction Policies"
* The MOB system no longer attempts to group MOB values by the date of the original cell's timestamp
according to said compaction policies, daily or otherwise
* The MOB system no longer needs to track individual cell deletes through the use of special
files in the MOB storage area with the suffix `_del`. After upgrading you should sideline these
files.
* Under default configuration the MOB system should take much less time to perform a compaction of
MOB stored values. This is a direct consequence of the fact that HBase will place a much larger
load on the underlying filesystem when doing compactions of MOB stored values; the additional load
should be a multiple on the order of magnitude of number of region servers. I.e. for a cluster
with three region servers and two masters the default configuration should have HBase put three
times the load on HDFS during major compactions that rewrite MOB data when compared to Master
handled MOB compaction; it should also be approximately three times as fast.
* When the MOB system detects that a table has hfiles with references to MOB data but the reference
hfiles do not yet have the needed file level metadata (i.e. from use of the MOB feature prior to
HBASE-22749) then it will refuse to archive _any_ MOB hfiles from that table. The normal course of
periodic compactions done by Region Servers will update existing hfiles with MOB references, but
until a given table has been through the needed compactions operators should expect to see an
increased amount of storage used by the MOB feature.
* Performing a compaction with type "MOB" no longer has special handling to compact specifically the
MOB hfiles. Instead it will issue a warning and do a compaction of the table. For example using
the HBase shell as follows will result in a warning in the Master logs followed by a major
compaction of the 'example' table in its entirety or for the 'big' column respectively.
+
----
hbase> major_compact 'example', nil, 'MOB'
hbase> major_compact 'example', 'big', 'MOB'
----
+
The same is true for directly using the Java API for
`admin.majorCompact(TableName.valueOf("example"), CompactType.MOB)`.
* Similarly, manually performing a major compaction on a table or region will also handle compacting
the MOB stored values for that table or region respectively.
The following configuration setting has been deprecated and replaced:
* `hbase.master.mob.ttl.cleaner.period` has been replaced with `hbase.master.mob.cleaner.period`
The following configuration settings are no longer used:
* `hbase.mob.compaction.mergeable.threshold`
* `hbase.mob.delfile.max.count`
* `hbase.mob.compaction.batch.size`
* `hbase.mob.compactor.class`
* `hbase.mob.compaction.threads.max`

View File

@ -38,6 +38,7 @@
:experimental:
:source-language: java
:leveloffset: 0
:stem:
// Logo for HTML -- doesn't render in PDF
ifdef::backend-html5[]