Andrew Purtell
6ad5b9e569
HBASE-25824 IntegrationTestLoadCommonCrawl ( #3208 )
...
* HBASE-25824 IntegrationTestLoadCommonCrawl
This integration test loads successful resource retrieval records from
the Common Crawl (https://commoncrawl.org/ ) public dataset into an HBase
table and writes records that can be used to later verify the presence
and integrity of those records.
Run like:
./bin/hbase org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl \
-Dfs.s3n.awsAccessKeyId=<AWS access key> \
-Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
/path/to/test-CC-MAIN-2021-10-warc.paths.gz \
/path/to/tmp/warc-loader-output
Access to the Common Crawl dataset in S3 is made available to anyone by
Amazon AWS, but Hadoop's S3N filesystem still requires valid access
credentials to initialize.
The input path can either specify a directory or a file. The file may
optionally be compressed with gzip. If a directory, the loader expects
the directory to contain one or more WARC files from the Common Crawl
dataset. If a file, the loader expects a list of Hadoop S3N URIs which
point to S3 locations for one or more WARC files from the Common Crawl
dataset, one URI per line. Lines should be terminated with the UNIX line
terminator.
Included in hbase-it/src/test/resources/CC-MAIN-2021-10-warc.paths.gz
is a list of all WARC files comprising the Q1 2021 crawl archive. There
are 64,000 WARC files in this data set, each containing ~1GB of gzipped
data. The WARC files contain several record types, such as metadata,
request, and response, but we only load the response record types. If
the HBase table schema does not specify compression (by default) there
is roughly a 10x expansion. Loading the full crawl archive results in a
table approximately 640 TB in size.
The hadoop-aws jar will be needed at runtime to instantiate the S3N
filesystem. Use the -files ToolRunner argument to add it.
You can also split the Loader and Verify stages:
Load with:
./bin/hbase 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Loader' \
-files /path/to/hadoop-aws.jar \
-Dfs.s3n.awsAccessKeyId=<AWS access key> \
-Dfs.s3n.awsSecretAccessKey=<AWS secret key> \
/path/to/test-CC-MAIN-2021-10-warc.paths.gz \
/path/to/tmp/warc-loader-output
Verify with:
./bin/hbase 'org.apache.hadoop.hbase.test.IntegrationTestLoadCommonCrawl$Verify' \
/path/to/tmp/warc-loader-output
Signed-off-by: Michael Stack <stack@apache.org>
2021-05-03 17:59:00 -07:00
Duo Zhang
7640134e3e
HBASE-25774 Added more detailed logs about the restarting of region servers ( #3213 )
...
Signed-off-by: Yulin Niu <niuyulin@apache.org>
Signed-off-by: Viraj Jasani <vjasani@apache.org>
2021-05-03 20:33:33 +08:00
GeorryHuang
00fec24c90
HBASE-25790 NamedQueue 'BalancerRejection' for recent history of balancer skipping ( #3182 )
...
Signed-off-by: Viraj Jasani <vjasani@apache.org>
2021-05-02 21:30:48 +05:30
Che Xun
accfcebd45
HBASE-25833 fix HBase Configuration File Descriptions ( #3216 )
...
Signed-off-by: Duo Zhang <zhangduo@apache.org>
2021-05-02 21:02:51 +08:00
Kota-SH
5d42f58ff6
HBASE-25816: Improve the documentation of Architecture section of reference guide ( #3211 )
...
Signed-off-by: Sakthi <sakthi@apache.org>
2021-04-30 13:42:06 -07:00
Duo Zhang
73a82bd7c6
HBASE-25825 RSGroupBasedLoadBalancer.onConfigurationChange should chain the request to internal balancer ( #3209 )
...
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-30 22:45:33 +08:00
Duo Zhang
6c65314cdf
HBASE-25819 Fix style issues for StochasticLoadBalancer ( #3207 )
...
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-29 11:03:55 +08:00
Nick Dimiduk
b061b0c4ed
HBASE-25779 HRegionServer#compactSplitThread should be private
...
Minor refactor. Make the `compactSplitThread` member field of `HRegionServer` private, and gate
all access through the getter method.
Signed-off-by: Yulin Niu <niuyulin@apache.org>
Signed-off-by: Pankaj Kumar <pankajkumar@apache.org>
2021-04-28 16:46:36 -07:00
Michael Stack
2382f68b23
HBASE-25792 Filter out o.a.hadoop.thirdparty building shaded jars ( #3184 )
...
Need to add to allowed-licenses list too....
Signed-off-by: Wei-Chiu Chuang <weichiu@apache.org>
Reviewed-by: Duo Zhang <zhangduo@apache.org>
Reviewed-by: Nick Dimiduk <ndimiduk@apache.org>
2021-04-27 08:37:25 -07:00
Duo Zhang
8856f61986
HBASE-25757 Addendum remove CandidateGenerator classes under hbase-server module
2021-04-27 23:25:51 +08:00
Duo Zhang
8d2a0efb7a
HBASE-25811 The client integration test is failing after HBASE-22120 merged ( #3201 )
...
move opentelemetry jars to client-facing-thirdparty
add opentelemetry jars when init map reduce job dependencies
Signed-off-by: Xin Sun <ddupgs@gmail.com>
2021-04-27 11:42:48 +08:00
Duo Zhang
a4d954e606
HBASE-25757 Move BaseLoadBalancer to hbase-balancer module ( #3191 )
...
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-26 12:03:25 +08:00
Duo Zhang
f36e153964
HBASE-25778 The tracinig implementation for AsyncConnectionImpl.getHbck is incorrect ( #3165 )
...
Signed-off-by: meiyi <myimeiyi@gmail.com>
2021-04-25 09:23:23 +08:00
Duo Zhang
be4503d9f8
HBASE-23762 Add documentation on how to enable and view tracing with OpenTelemetry ( #3135 )
...
Signed-off-by: Michael Stack <stack@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
b714889989
HBASE-25733 Upgrade opentelemetry to 1.0.1 ( #3122 )
...
Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
8df9bebdd3
HBASE-25732 Change the command line argument for tracing after upgrading opentelemtry to 1.0.0 ( #3123 )
...
Signed-off-by: Michael Stack <stack@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
7f90c2201f
HBASE-25723 Temporarily remove the trace support for RegionScanner.next ( #3119 )
...
Signed-off-by: Viraj Jasani <vjasani@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
8399293e21
HBASE-25616 Upgrade opentelemetry to 1.0.0 ( #3034 )
...
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
8d68f8cd1c
HBASE-25617 Revisit the span names ( #2998 )
...
Signed-off-by: Guanghao Zhang <zghao@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
f6ff519dd0
HBASE-25591 Upgrade opentelemetry to 0.17.1 ( #2971 )
...
Signed-off-by: Guanghao Zhang <zghao@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
bb8c4967f8
HBASE-25535 Set span kind to CLIENT in AbstractRpcClient ( #2907 )
...
Signed-off-by: Guanghao Zhang <zghao@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
2be2c63f0d
HBASE-25484 Add trace support for WAL sync ( #2892 )
...
Signed-off-by: Guanghao Zhang <zghao@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
03e12bfa4a
HBASE-25455 Add trace support for HRegion read/write operation ( #2861 )
...
Signed-off-by: Guanghao Zhang <zghao@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
ae2c62ffaa
HBASE-25481 Add host and port attribute when tracing rpc call at client side ( #2857 )
...
Signed-off-by: Guanghao Zhang <zghao@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
dcb78bd4bd
HBASE-25454 Add trace support for connection registry ( #2828 )
...
Signed-off-by: stack <stack@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
805b2ae2ad
HBASE-23898 Add trace support for simple apis in async client ( #2813 )
...
Signed-off-by: Guanghao Zhang <zghao@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
57960fa8fa
HBASE-25424 Find a way to config OpenTelemetry tracing without direct… ( #2808 )
...
Signed-off-by: Guanghao Zhang <zghao@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
2420286715
HBASE-25401 Add trace support for async call in rpc client ( #2790 )
...
Signed-off-by: Guanghao Zhang <zghao@apache.org>
2021-04-25 09:23:23 +08:00
Duo Zhang
302d9ea8b8
HBASE-25373 Remove HTrace completely in code base and try to make use of OpenTelemetry
...
Signed-off-by: stack <stack@apache.org>
2021-04-25 09:23:23 +08:00
Andrew Purtell
9895b2dfdf
HBASE-25756 Support alternate compression for major and minor compactions ( #3142 )
...
Signed-off-by: Duo Zhang <zhangduo@apache.org>
2021-04-23 15:45:26 -07:00
Duo Zhang
96fefce9c3
HBASE-25802 Miscellaneous style improvements for load balancer related classes ( #3192 )
...
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-23 15:20:27 +08:00
haxiaolin
996862c1cc
HBASE-25754 StripeCompactionPolicy should support compacting cold regions ( #3152 )
...
Signed-off-by: Duo Zhang <zhangduo@apache.org>
2021-04-23 14:58:53 +08:00
Toshihiro Suzuki
5f4e2e111b
HBASE-25766 Introduce RegionSplitRestriction that restricts the pattern of the split point ( #3150 )
...
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Michael Stack <stack@apache.org>
2021-04-22 13:53:36 +09:00
Duo Zhang
50920ee306
HBASE-25774 TestSyncReplicationStandbyKillRS#testStandbyKillRegionServer is flaky ( #3189 )
...
Wait for the restarter thread to finish before checking the state
Add more detailed logs
Signed-off-by: meiyi <myimeiyi@gmail.com>
2021-04-22 10:10:15 +08:00
Duo Zhang
d5c5e48839
HBASE-25793 Move BaseLoadBalancer.Cluster to a separated file ( #3185 )
...
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-22 09:59:49 +08:00
Baiqiang Zhao
72aa741273
HBASE-25798 typo in MetricsAssertHelper ( #3186 )
...
Signed-off-by: Duo Zhang <zhangduo@apache.org>
2021-04-21 21:40:39 +08:00
haxiaolin
0d257baf29
HBASE-25763 TestRSGroupsWithACL.setupBeforeClass is flaky ( #3158 )
...
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-21 14:41:51 +08:00
Duo Zhang
781da1899a
HBASE-25290 Remove table on master related code in balancer implementation ( #3162 )
...
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-20 21:31:09 +08:00
Peter Somogyi
33e886c6cc
HBASE-25780 Add 2.2.7 to download page [addendum] ( #3180 )
...
Correct release date and add EOL notice
Signed-off-by: Jan Hentschel <jan.hentschel@ultratendency.com>
2021-04-19 16:17:58 +02:00
niuyulin
e8ac1fbe97
HBASE-25777 Fix wrong initialization value in StressAssignmentManagerMonkeyFactory ( #3164 )
...
Signed-off-by: meiyi <myimeiyi@gmail.com>
2021-04-19 17:46:57 +08:00
Nick Dimiduk
b65890da1d
Revert "HBASE-25739 TableSkewCostFunction need to use aggregated deviation ( #3067 )"
...
This reverts commit 533c84d330
.
2021-04-16 09:35:02 -07:00
Guanghao Zhang
94f4479e8f
HBASE-25780 Add 2.2.7 to download page ( #3175 )
...
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Jan Hentschel <jan.hentschel@ultratendency.com>
2021-04-16 15:27:33 +08:00
Duo Zhang
bf78246b4f
HBASE-25775 Use a special balancer to deal with maintenance mode ( #3161 )
...
Signed-off-by: Wellington Chevreuil <wchevreuil@apache.org>
2021-04-16 09:50:24 +08:00
clarax
533c84d330
HBASE-25739 TableSkewCostFunction need to use aggregated deviation ( #3067 )
...
Signed-off-by: Michael Stack <stack@apache.org>
Reviewed-by: David Manning <david.manning@salesforce.com>
2021-04-15 13:12:07 -07:00
xiaoyu
6cf4fdde61
HBASE-25776 Use Class.asSubclass to fix the warning in StochasticLoadBalancer.loadCustomCostFunctions ( #3163 )
...
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Viraj Jasani <vjasani@apache.org>
2021-04-15 23:34:06 +05:30
Nick Dimiduk
bc52bca741
HBASE-25770 Http InfoServers should honor gzip encoding when requested ( #3159 )
...
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Josh Elser <elserj@apache.org>
2021-04-15 09:07:13 -07:00
ZhiChen
c5b0989d22
HBASE-25762 Improvement for some debug-logging guards ( #3145 )
...
Signed-off-by: Duo Zhang <zhangduo@apache.org>
2021-04-13 23:03:55 +08:00
Duo Zhang
5910e9e2d1
HBASE-25767 CandidateGenerator.getRandomIterationOrder is too slow on large cluster ( #3149 )
...
Signed-off-by: XinSun <ddupgs@gmail.com>
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-13 23:00:54 +08:00
Duo Zhang
de012d7d1f
HBASE-25759 The master services field in LocalityBasedCostFunction is never used ( #3144 )
...
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-12 22:27:01 +08:00
Duo Zhang
f9e928e5a7
HBASE-25184 Move RegionLocationFinder to hbase-balancer ( #2543 )
...
Signed-off-by: Yulin Niu <niuyulin@apache.org>
2021-04-10 21:10:53 +08:00