WAL storage can be expensive, especially if the cell values
represented in the edits are large, consisting of blobs or
significant lengths of text. Such WALs might need to be kept around
for a fairly long time to satisfy replication constraints on a space
limited (or space-contended) filesystem.
We have a custom dictionary compression scheme for cell metadata that
is engaged when WAL compression is enabled in site configuration.
This is fine for that application, where we can expect the universe
of values and their lengths in the custom dictionaries to be
constrained. For arbitrary cell values it is better to use one of the
available compression codecs, which are suitable for arbitrary albeit
compressible data.
Signed-off-by: Bharath Vissapragada <bharathv@apache.org>
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
* HBASE-25875 RegionServer failed to start with IllegalThreadStateException due to race condition in AuthenticationTokenSecretManager's start & retrievePassword method
Signed-off-by: stack <stack@apache.com>
* HBASE-25867 Extra doc around ITBLL
Minor edits to a few log messages.
Explain how the '-c' option works when passed to ChaosMonkeyRunner.
Some added notes on ITBLL.
Fix whacky 'R' and 'Not r' thing in Master (shows when you run ITBLL).
In HRS, report hostname and port when it checks in (was debugging issue
where Master and HRS had different notions of its hostname).
Spare a dirty FNFException on startup if base dir not yet in place.
* Address Review by Sean
Signed-off-by: Sean Busbey <busbey@apache.org>
Revert "HBASE-25032 Wait for region server to become online before adding it to online servers in Master (#2769)"
This reverts commit 1e4639d2eb.
Conflicts:
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
In CatalogJanitor we schedule GCRegionProcedure to clean up both
filesystem and in-memory state after a split, and
GCMultipleMergedRegionsProcedure to do the same for merges. Both of these
procedures clean up in-memory state, but CatalogJanitor also does this
redundantly just after scheduling the procedures. The cleanup should be
done in only one place. Presumably we are using the procedures to do it in
a principled way. Remove the redundancy in CatalogJanitor and fix any
follow on issues, like test failures.
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: Viraj Jasani <vjasani@apache.org>
Processing of the RS report happens asynchronously from other activities
which can mutate region state. For example, a split procedure may already
be running. A split procedure cannot succeed if the parent region is no
longer open, so we can ignore it in that case.
Note that submitting more than one split procedure for a given region is
harmless -- the split is fenced in the procedure handling -- but it would
be noisy in the logs. Only one procedure can succeed. The other
procedure(s) would abort during initialization and report failure with
WARN level logging.
Signed-off-by: Bharath Vissapragada <bharathv@apache.org>
Signed-off-by: Viraj Jasani <vjasani@apache.org>
Signed-off-by: Pankaj <pankajkumar@apache.org>
RegionStates#getAssignmentsForBalancer is used by the HMaster to
collect all regions of interest to the balancer for the next chore
iteration. We check if a table is in disabled state to exclude
regions that will not be of interest (because disabled regions are
or will be offline) or are in a state where they shouldn't be
mutated (like SPLITTING). The current checks are not actually
comprehensive.
Filter out regions not in OPEN or OPENING state when building the
set of interesting regions for the balancer to consider. Only
regions open (or opening) on the cluster are of interest to
balancing calculations for the current iteration. Regions in all
other states can be expected to not be of interest – either offline
(OFFLINE, or FAILED_*), not subject to balancer decisions now
(SPLITTING, SPLITTING_NEW, MERGING, MERGING_NEW), or will be
offline shortly (CLOSING) – until at least the next chore
iteration.
Add TRACE level logging.
Signed-off-by: Bharath Vissapragada <bharathv@apache.org>
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Viraj Jasani <vjasani@apache.org>
We claim in a WARN level log line to be "Playing-it-safe skipping merge/
split gc'ing of regions from hbase:meta while regions-in-transition (RIT)"
but do not actually skip because of a missing return. Remove the warning.
Signed-off-by: Duo Zhang <zhangduo@apache.org>
* HBASE-25824 IntegrationTestLoadCommonCrawl
This integration test loads successful resource retrieval records from
the Common Crawl (https://commoncrawl.org/) public dataset into an HBase
table and writes records that can be used to later verify the presence
and integrity of those records.
Run like:
./bin/hbase org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl \
-Dfs.s3n.awsAccessKeyId=<AWS access key> \
-Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
/path/to/test-CC-MAIN-2021-10-warc.paths.gz \
/path/to/tmp/warc-loader-output
Access to the Common Crawl dataset in S3 is made available to anyone by
Amazon AWS, but Hadoop's S3N filesystem still requires valid access
credentials to initialize.
The input path can either specify a directory or a file. The file may
optionally be compressed with gzip. If a directory, the loader expects
the directory to contain one or more WARC files from the Common Crawl
dataset. If a file, the loader expects a list of Hadoop S3N URIs which
point to S3 locations for one or more WARC files from the Common Crawl
dataset, one URI per line. Lines should be terminated with the UNIX line
terminator.
Included in hbase-it/src/test/resources/CC-MAIN-2021-10-warc.paths.gz
is a list of all WARC files comprising the Q1 2021 crawl archive. There
are 64,000 WARC files in this data set, each containing ~1GB of gzipped
data. The WARC files contain several record types, such as metadata,
request, and response, but we only load the response record types. If
the HBase table schema does not specify compression (by default) there
is roughly a 10x expansion. Loading the full crawl archive results in a
table approximately 640 TB in size.
The hadoop-aws jar will be needed at runtime to instantiate the S3N
filesystem. Use the -files ToolRunner argument to add it.
You can also split the Loader and Verify stages:
Load with:
./bin/hbase 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Loader' \
-files /path/to/hadoop-aws.jar \
-Dfs.s3n.awsAccessKeyId=<AWS access key> \
-Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
/path/to/test-CC-MAIN-2021-10-warc.paths.gz \
/path/to/tmp/warc-loader-output
Verify with:
./bin/hbase 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Verify' \
/path/to/tmp/warc-loader-output
Signed-off-by: Michael Stack <stack@apache.org>