* HBASE-27088 IntegrationLoadTestCommonCrawl async load improvements
- Use an async client and work stealing executor for parallelism during loads.
- Remove the verification read retries, these are not that effective during
replication lag anyway.
- Increase max task attempts because S3 might throttle.
- Implement a side task that exercises Increments by extracting urls from
content and updating a cf that tracks referrer counts. These are not
validated at this time. It could be possible to log the increments, sum
them with a reducer, and then verify the total, but this is left as a
future exercise.
Signed-off-by: Viraj Jasani <vjasani@apache.org>
* Sum RPC time for writes (loader) and reads (verifier) and mutation bytes submitted. Expose as job counters.
* Fix an issue with completion chaining
* Pause loading if too many operations are in flight
* HBASE-22749 Distributed MOB compactions
- MOB compaction is now handled in-line with per-region compaction on region
servers
- regions with mob data store per-hfile metadata about which mob hfiles are
referenced
- admin requested major compaction will also rewrite MOB files; periodic RS
initiated major compaction will not
- periodically a chore in the master will initiate a major compaction that
will rewrite MOB values to ensure it happens. controlled by
'hbase.mob.compaction.chore.period'. default is weekly
- control how many RS the chore requests major compaction on in parallel
with 'hbase.mob.major.compaction.region.batch.size'. default is as
parallel as possible.
- periodic chore in master will scan backing hfiles from regions to get the
set of referenced mob hfiles and archive those that are no longer
referenced. control period with 'hbase.master.mob.cleaner.period'
- Optionally, RS that are compacting mob files can limit write
amplification by not rewriting values from mob hfiles over a certain size
limit. opt-in by setting 'hbase.mob.compaction.type' to 'optimized'.
control threshold by 'hbase.mob.compactions.max.file.size'.
default is 1GiB
- Should smoothly integrate with existing MOB users via rolling upgrade.
will delay old MOB file cleanup until per-region compaction has managed
to compact each region at least once so that used mob hfile metadata can
be gathered.
* HBASE-22749 Distributed MOB compactions
fix RestrictedApi
Co-authored-by: Vladimir Rodionov <vrodionov@apache.org>
Signed-off-by: Wellington Chevreuil <wchevreuil@apache.org>
There are two cases here:
1. Chaos Monkey thread died and there is no chaos after that.
2. Sometimes, regions are being moved back too quick that region server has not finished its initliazation yet.
wait sometime to make sure that region server finishes its initialization.
Signed-off-by: Wellington Chevreuil <wellington.chevreuil@gmail.com>
Previous cherry picks:
commit 6aaef89 HBASE-26064 Introduce a StoreFileTracker to abstract the store file tracking logic
commit 43b40e9 HBASE-25988 Store the store file list by a file #3578)
commit 6e05376 HBASE-26079 Use StoreFileTracker when splitting and merging #3617)
commit 090b2fe HBASE-26224 HBASE-26224 Introduce a MigrationStoreFileTracker to support migratin… #3656)
commit 0ee1689 HBASE-26246 Persist the StoreFileTracker configurations to TableDescriptor when creating table #3666)
commit 2052e80 HBASE-26248 Should find a suitable way to let users specify the store… #3665)
commit 5ff0f98 HBASE-26264 Add more checks to prevent misconfiguration on store file… #3681)
commit fc4f6d1 HBASE-26280 HBASE-26280 Use store file tracker when snapshoting #3685)
commit 06db852 HBASE-26326 CreateTableProcedure fails when FileBasedStoreFileTracker… #3721)
commit e4e7cf8 HBASE-26386 Refactor StoreFileTracker implementations to expose the s… #3774)
commit 08d1171 HBASE-26328 Clone snapshot doesn't load reference files into FILE SFT impl #3749)
commit 8bec26e HBASE-26263 [Rolling Upgrading] Persist the StoreFileTracker configur… #3700)
commit a288365 HBASE-26271: Cleanup the broken store files under data directory #3786)
commit d00b5fa HBASE-26454 CreateTableProcedure still relies on temp dir and renames… #3845)
commit 771e552 HBASE-26286: Add support for specifying store file tracker when restoring or cloning snapshot
commit f16b7b1 HBASE-26265 Update ref guide to mention the new store file tracker im… #3942)
commit 755b3b4 HBASE-26585 Add SFT configuration to META table descriptor when creating META #3998)
commit 39c42c7 HBASE-26639 The implementation of TestMergesSplitsAddToTracker is pro… #4010)
commit 6e1f5b7 HBASE-26586 Should not rely on the global config when setting SFT implementation for a table while upgrading #4006)
commit f1dd865 HBASE-26654 ModifyTableDescriptorProcedure shoud load TableDescriptor… #4034)
commit 8fbc9a2 HBASE-26674 Should modify filesCompacting under storeWriteLock #4040)
commit 5aa0fd2 HBASE-26675 Data race on Compactor.writer #4035)
commit 3021c58 HBASE-26700 The way we bypass broken track file is not enough in Stor… #4055)
commit a8b68c9 HBASE-26690 Modify FSTableDescriptors to not rely on renaming when wr… #4054)
commit dffeb8e HBASE-26587 Introduce a new Admin API to change SFT implementation (#… #4080)
commit b265fe5 HBASE-26673 Implement a shell command for change SFT implementation #4113)
commit 4cdb380 HBASE-26640 Reimplement master local region initialization to better … #4111)
commit 77bb153 HBASE-26707: Reduce number of renames during bulkload (#4066) #4122)
commit a4b192e HBASE-26611 Changing SFT implementation on disabled table is dangerous #4082)
commit d3629bb HBASE-26837 Set SFT config when creating TableDescriptor in TestClone… #4226)
commit 541d748 HBASE-26881 Backport HBASE-25368 to branch-2 (#4267)
Fixups for precommit error prone, checkstyle, and javadoc warnings after applying cherry picks.
Signed-off-by: Josh Elser <elserj@apache.org>
Reviewed-by: Wellington Ramos Chevreuil <wchevreuil@apache.org>
This is no longer needed since we've transitioned to the shaded Jersey shipped in
hbase-thirdparty. Also drop supplemental models entry.
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Andrew Purtell <apurtell@apache.org>
Avoid the pattern where a Random object is allocated, used once or twice, and
then left for GC. This pattern triggers warnings from some static analysis tools
because this pattern leads to poor effective randomness. In a few cases we were
legitimately suffering from this issue; in others a change is still good to
reduce noise in analysis results.
Use ThreadLocalRandom where there is no requirement to set the seed to gain
good reuse.
Where useful relax use of SecureRandom to simply Random or ThreadLocalRandom,
which are unlikely to block if the system entropy pool is low, if we don't need
crypographically strong randomness for the use case. The exception to this is
normalization of use of Bytes#random to fill byte arrays with randomness.
Because Bytes#random may be used to generate key material it must be backed by
SecureRandom.
Signed-off-by: Duo Zhang <zhangduo@apache.org>
The code starting at `ZKUtil.dump(ZKWatcher)` is a small mess – it has cyclic dependencies woven
through itself, `ZKWatcher` and `RecoverableZooKeeper`. It also initializes a static variable in
`ZKUtil` through the factory for `RecoverableZooKeeper` instances. Let's decouple and clean it
up.
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Josh Elser <elserj@apache.org>
Use a hybrid logical clock for timestamping entries.
Using BufferedMutator without HLC was not good because we assign client timestamps,
and the store loop is fast enough that on rare occasion two temporally adjacent URLs
in the set of WARCs are equivalent and the timestamp does not advance, leading later
to a rare false positive CORRUPT finding.
While making changes, support direct S3N paths as input paths on the command line.
Signed-off-by: Viraj Jasani <vjasani@apache.org>
10/17 commits of HBASE-22120, original commit f6ff519dd0
Co-authored-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Peter Somogyi <psomogyi@apache.org>
* HBASE-25867 Extra doc around ITBLL
Minor edits to a few log messages.
Explain how the '-c' option works when passed to ChaosMonkeyRunner.
Some added notes on ITBLL.
Fix whacky 'R' and 'Not r' thing in Master (shows when you run ITBLL).
In HRS, report hostname and port when it checks in (was debugging issue
where Master and HRS had different notions of its hostname).
Spare a dirty FNFException on startup if base dir not yet in place.
* Address Review by Sean
Signed-off-by: Sean Busbey <busbey@apache.org>
This integration test loads successful resource retrieval records from
the Common Crawl (https://commoncrawl.org/) public dataset into an HBase
table and writes records that can be used to later verify the presence
and integrity of those records.
Run like:
./bin/hbase org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl \
-Dfs.s3n.awsAccessKeyId=<AWS access key> \
-Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
/path/to/test-CC-MAIN-2021-10-warc.paths.gz \
/path/to/tmp/warc-loader-output
Access to the Common Crawl dataset in S3 is made available to anyone by
Amazon AWS, but Hadoop's S3N filesystem still requires valid access
credentials to initialize.
The input path can either specify a directory or a file. The file may
optionally be compressed with gzip. If a directory, the loader expects
the directory to contain one or more WARC files from the Common Crawl
dataset. If a file, the loader expects a list of Hadoop S3N URIs which
point to S3 locations for one or more WARC files from the Common Crawl
dataset, one URI per line. Lines should be terminated with the UNIX line
terminator.
Included in hbase-it/src/test/resources/CC-MAIN-2021-10-warc.paths.gz
is a list of all WARC files comprising the Q1 2021 crawl archive. There
are 64,000 WARC files in this data set, each containing ~1GB of gzipped
data. The WARC files contain several record types, such as metadata,
request, and response, but we only load the response record types. If
the HBase table schema does not specify compression (by default) there
is roughly a 10x expansion. Loading the full crawl archive results in a
table approximately 640 TB in size.
The hadoop-aws jar will be needed at runtime to instantiate the S3N
filesystem. Use the -files ToolRunner argument to add it.
You can also split the Loader and Verify stages:
Load with:
./bin/hbase 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Loader' \
-files /path/to/hadoop-aws.jar \
-Dfs.s3n.awsAccessKeyId=<AWS access key> \
-Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
/path/to/test-CC-MAIN-2021-10-warc.paths.gz \
/path/to/tmp/warc-loader-output
Verify with:
./bin/hbase 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Verify' \
/path/to/tmp/warc-loader-output
Signed-off-by: Michael Stack <stack@apache.org>
Conflicts:
pom.xml