(Forward port from branch-2; simplified by the fact that there
is no hadoop-2.0 profile on master branch)
Make it so our published poms carry the minimum needed to run
an hbase; the published pom has no profiles -- the profiles
specified at build time are resolved, their dependencies inlined,
and then they are stripped -- and no build-time, or plugins
dependencies or properties, etc. Resultant poms have explicit
hadoop lib versions baked in -- no more being able to choose
hbase with hadoop2 or haddop3 at downstream build time by setting
a '-Dhadoop.profile=X.0'.
Pattern is to add profiles when none in sub-modules when
the flatten plugin complains it can't resolve an hadoop
dependency's 'version' (e.g. hadoop-common, hadoop-hdfs).
Adding the profile in the sub-module make it so the flatten
plugin can figure 'hadoop.version' definitively.
(In master there is only the hadoop-3.0 profile).
Another spin on the above happens when profiles already exist
in submodule but the flatten plugin is complaining it can't
figure figure version on an hadoop dependency NOT under
profiles. Below, we move the delinquent hadoop dependency under
existing profiles (minikdc was the usual dependency outside
profiles in sub-modules that flatten complained about).
Sometimes, moving an hadoop dependency under a profile, there
would be excludes on the local dependency. If the parent pom
excludes section was missing the local excludes, we added them
up to the parent module so all excluding is done up there in
the parent profile dependencyManagement section.
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Avoid the pattern where a Random object is allocated, used once or twice, and
then left for GC. This pattern triggers warnings from some static analysis tools
because this pattern leads to poor effective randomness. In a few cases we were
legitimately suffering from this issue; in others a change is still good to
reduce noise in analysis results.
Use ThreadLocalRandom where there is no requirement to set the seed to gain
good reuse.
Where useful relax use of SecureRandom to simply Random or ThreadLocalRandom,
which are unlikely to block if the system entropy pool is low, if we don't need
crypographically strong randomness for the use case. The exception to this is
normalization of use of Bytes#random to fill byte arrays with randomness.
Because Bytes#random may be used to generate key material it must be backed by
SecureRandom.
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Add the following stats for a given table:
- 7. Total size of serialized cells of each CF.
- 8. Total size of serialized cells of each qualifier.
- 9. Total size of serialized cells across all rows.
Signed-off-by: Viraj Jasani <vjasani@apache.org>
We get and retain Compressor instances in HFileBlockDefaultEncodingContext,
and could in theory call Compressor#reinit when setting up the context,
to update compression parameters like level and buffer size, but we do
not plumb through the CompoundConfiguration from the Store into the
encoding context. As a consequence we can only update codec parameters
globally in system site conf files.
Fine grained configurability is important for algorithms like ZStandard
(ZSTD), which offers more than 20 compression levels, where at level 1
it is almost as fast as LZ4, and where at higher levels it utilizes
computationally expensive techniques to rival LZMA at compression ratio
but trades off significantly for reduced compresson throughput. The ZSTD
level that should be set for a given column family or table will vary by
use case.
Signed-off-by: Viraj Jasani <vjasani@apache.org>
This change introduces provided compression codecs to HBase as
new Maven modules. Each module provides compression codec support
that formerly required Hadoop native codecs, which in turn relies
on native code integration, which may or may not be available on
a given hardware platform or in an operational environment. We
now provide codecs in the HBase distribution for users whom for
whatever reason cannot or do not wish to deploy the Hadoop native
codecs.
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Viraj Jasani <vjasani@apache.org>
HBase 2 moved over Scans to use PREAD by default instead of STREAM like
HBase 1. In the context of a MapReduce job, we can generally expect that
clients using the InputFormat (batch job) would be reading most of the
data for a job. Cater to them, but still give users who want PREAD the
ability to do so.
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Signed-off-by: Tak Lon (Stephen) Wu <taklwu@apache.org>