HBASE-24638 Edit doc on (offheap) memory management (#1978)
This commit is contained in:
parent
c0461207ee
commit
0197438564
|
@ -1,4 +1,4 @@
|
||||||
/**
|
/*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one
|
* Licensed to the Apache Software Foundation (ASF) under one
|
||||||
* or more contributor license agreements. See the NOTICE file
|
* or more contributor license agreements. See the NOTICE file
|
||||||
* distributed with this work for additional information
|
* distributed with this work for additional information
|
||||||
|
@ -40,33 +40,29 @@ import org.apache.hbase.thirdparty.com.google.common.annotations.VisibleForTesti
|
||||||
import org.apache.hbase.thirdparty.com.google.common.collect.Sets;
|
import org.apache.hbase.thirdparty.com.google.common.collect.Sets;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* ByteBuffAllocator is used for allocating/freeing the ByteBuffers from/to NIO ByteBuffer pool, and
|
* ByteBuffAllocator is a nio ByteBuffer pool.
|
||||||
* it provide high-level interfaces for upstream. when allocating desired memory size, it will
|
* It returns {@link ByteBuff}s which are wrappers of offheap {@link ByteBuffer} usually. If we are
|
||||||
* return {@link ByteBuff}, if we are sure that those ByteBuffers have reached the end of life
|
* sure that the returned ByteBuffs have reached the end of their life cycle, we must call
|
||||||
* cycle, we must do the {@link ByteBuff#release()} to return back the buffers to the pool,
|
* {@link ByteBuff#release()} to return buffers to the pool otherwise the pool will leak. If the
|
||||||
* otherwise ByteBuffers leak will happen, and the NIO ByteBuffer pool may be exhausted. there's
|
* desired memory size is larger than what the ByteBufferPool has available, we'll downgrade to
|
||||||
* possible that the desired memory size is large than ByteBufferPool has, we'll downgrade to
|
* allocate ByteBuffers from the heap. Increase the ByteBufferPool size if detect this case.<br/>
|
||||||
* allocate ByteBuffers from heap which meaning the GC pressure may increase again. Of course, an
|
|
||||||
* better way is increasing the ByteBufferPool size if we detected this case. <br/>
|
|
||||||
* <br/>
|
* <br/>
|
||||||
* On the other hand, for better memory utilization, we have set an lower bound named
|
* For better memory/pool utilization, there is a lower bound named
|
||||||
* minSizeForReservoirUse in this allocator, and if the desired size is less than
|
* <code>minSizeForReservoirUse</code> in this allocator, and if the desired size is less than
|
||||||
* minSizeForReservoirUse, the allocator will just allocate the ByteBuffer from heap and let the JVM
|
* <code>minSizeForReservoirUse</code>, the allocator will just allocate the ByteBuffer from heap
|
||||||
* free its memory, because it's too wasting to allocate a single fixed-size ByteBuffer for some
|
* and let the JVM manage memory, because it better to not waste pool slots allocating a single
|
||||||
* small objects. <br/>
|
* fixed-size ByteBuffer for a small object.<br/>
|
||||||
* <br/>
|
* <br/>
|
||||||
* We recommend to use this class to allocate/free {@link ByteBuff} in the RPC layer or the entire
|
* This pool can be used anywhere it makes sense managing memory. Currently used at least by RPC.
|
||||||
* read/write path, because it hide the details of memory management and its APIs are more friendly
|
|
||||||
* to the upper layer.
|
|
||||||
*/
|
*/
|
||||||
@InterfaceAudience.Private
|
@InterfaceAudience.Private
|
||||||
public class ByteBuffAllocator {
|
public class ByteBuffAllocator {
|
||||||
|
|
||||||
private static final Logger LOG = LoggerFactory.getLogger(ByteBuffAllocator.class);
|
private static final Logger LOG = LoggerFactory.getLogger(ByteBuffAllocator.class);
|
||||||
|
|
||||||
// The on-heap allocator is mostly used for testing, but also some non-test usage, such as
|
// The on-heap allocator is mostly used for testing but also has some non-test usage such as
|
||||||
// scanning snapshot, we won't have an RpcServer to initialize the allocator, so just use the
|
// for scanning snapshot. This implementation will just allocate ByteBuffers from heap but
|
||||||
// default heap allocator, it will just allocate ByteBuffers from heap but wrapped by an ByteBuff.
|
// wrapped by ByteBuff.
|
||||||
public static final ByteBuffAllocator HEAP = ByteBuffAllocator.createOnHeap();
|
public static final ByteBuffAllocator HEAP = ByteBuffAllocator.createOnHeap();
|
||||||
|
|
||||||
public static final String ALLOCATOR_POOL_ENABLED_KEY = "hbase.server.allocator.pool.enabled";
|
public static final String ALLOCATOR_POOL_ENABLED_KEY = "hbase.server.allocator.pool.enabled";
|
||||||
|
|
|
@ -30,76 +30,81 @@
|
||||||
[[regionserver.offheap.overview]]
|
[[regionserver.offheap.overview]]
|
||||||
== Overview
|
== Overview
|
||||||
|
|
||||||
For reducing the Java GC impact to P99/P999 RPC latency, HBase 2.x has made the offheap read and write path. The cells are
|
To help reduce P99/P999 RPC latencies, HBase 2.x has made the read and write path use a pool of offheap buffers. Cells are
|
||||||
allocated from JVM offheap memory area, which won’t be garbage collected by JVM and need to be deallocated explicitly by
|
allocated in offheap memory outside of the purview of the JVM garbage collector with attendent reduction in GC pressure.
|
||||||
upstream callers. In the write path, the request packet received from client will be allocated offheap and retained
|
In the write path, the request packet received from client will be read in on a pre-allocated offheap buffer and retained
|
||||||
until those cells are successfully written to the WAL and Memstore. The memory data structure in Memstore does
|
offheap until those cells are successfully persisted to the WAL and Memstore. The memory data structure in Memstore does
|
||||||
not directly store the cell memory, but reference to cells which are encoded in multiple chunks in MSLAB, this is easier
|
not directly store the cell memory, but references the cells encoded in the offheap buffers. Similarly for the read path.
|
||||||
to manage the offheap memory. Similarly, in the read path, we’ll try to read the cache firstly, if the cache
|
We’ll try to read the block cache first and if a cache misses, we'll go to the HFile and read the respective block. The
|
||||||
misses, go to the HFile and read the corresponding block. The workflow: from reading blocks to sending cells to
|
workflow from reading blocks to sending cells to client does its best to avoid on-heap memory allocations reducing the
|
||||||
client, it's basically not involved in on-heap memory allocations.
|
amount of work the GC has to do.
|
||||||
|
|
||||||
image::offheap-overview.png[]
|
image::offheap-overview.png[]
|
||||||
|
|
||||||
|
For redress for the single mention of onheap in the read-section of the diagram above see <<regionserver.read.hdfs.block.offheap>>.
|
||||||
|
|
||||||
[[regionserver.offheap.readpath]]
|
[[regionserver.offheap.readpath]]
|
||||||
== Offheap read-path
|
== Offheap read-path
|
||||||
In HBase-2.0.0, link:https://issues.apache.org/jira/browse/HBASE-11425[HBASE-11425] changed the HBase read path so it
|
In HBase-2.0.0, link:https://issues.apache.org/jira/browse/HBASE-11425[HBASE-11425] changed the HBase read path so it
|
||||||
could hold the read-data off-heap (from BucketCache) avoiding copying of cached data on to the java heap.
|
could hold the read-data off-heap avoiding copying of cached data (BlockCache) on to the java heap (for uncached data,
|
||||||
This reduces GC pauses given there is less garbage made and so less to clear. The off-heap read path can have a performance
|
see note under the diagram in the section above). This reduces GC pauses given there is less garbage made and so less
|
||||||
that is similar or better to that of the on-heap LRU cache. This feature is available since HBase 2.0.0.
|
to clear. The off-heap read path can have a performance that is similar or better to that of the on-heap LRU cache.
|
||||||
Refer to below blogs for more details and test results on off heaped read path
|
This feature is available since HBase 2.0.0. Refer to below blogs for more details and test results on off heaped read path
|
||||||
link:https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in[Offheaping the Read Path in Apache HBase: Part 1 of 2]
|
link:https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in[Offheaping the Read Path in Apache HBase: Part 1 of 2]
|
||||||
and link:https://blogs.apache.org/hbase/entry/offheap-read-path-in-production[Offheap Read-Path in Production - The Alibaba story]
|
and link:https://blogs.apache.org/hbase/entry/offheap-read-path-in-production[Offheap Read-Path in Production - The Alibaba story]
|
||||||
|
|
||||||
For an end-to-end off-heaped read-path, all you have to do is enable an off-heap backed <<offheap.blockcache>>(BC).
|
For an end-to-end off-heaped read-path, all you have to do is enable an off-heap backed <<offheap.blockcache>>(BC).
|
||||||
Configure _hbase.bucketcache.ioengine_ to be _offheap_ in _hbase-site.xml_ (See <<bc.deploy.modes>> to learn more about _hbase.bucketcache.ioengine_ options).
|
To do this, configure _hbase.bucketcache.ioengine_ to be _offheap_ in _hbase-site.xml_ (See <<bc.deploy.modes>> to learn
|
||||||
Also specify the total capacity of the BC using `hbase.bucketcache.size` config. Please remember to adjust value of 'HBASE_OFFHEAPSIZE' in
|
more about _hbase.bucketcache.ioengine_ options). Also specify the total capacity of the BC using `hbase.bucketcache.size`.
|
||||||
_hbase-env.sh_ (See <<bc.example>> for help sizing and an example enabling). This configuration is for specifying the maximum
|
Please remember to adjust value of 'HBASE_OFFHEAPSIZE' in _hbase-env.sh_ (See <<bc.example>> for help sizing and an example
|
||||||
possible off-heap memory allocation for the RegionServer java process. This should be bigger than the off-heap BC size
|
enabling). This configuration is for specifying the maximum possible off-heap memory allocation for the RegionServer java
|
||||||
to accommodate usage by other features making use of off-heap memory such as Server RPC buffer pool and short-circuit
|
process. This should be bigger than the off-heap BC size to accommodate usage by other features making use of off-heap memory
|
||||||
reads (See discussion in <<bc.example>>).
|
such as Server RPC buffer pool and short-circuit reads (See discussion in <<bc.example>>).
|
||||||
|
|
||||||
Please keep in mind that there is no default for `hbase.bucketcache.ioengine`
|
Please keep in mind that there is no default for `hbase.bucketcache.ioengine` which means the `BlockCache` is OFF by default
|
||||||
which means the BC is OFF by default (See <<direct.memory>>).
|
(See <<direct.memory>>).
|
||||||
|
|
||||||
This is all you need to do to enable off-heap read path. Most buffers in HBase are already off-heap. With BC off-heap,
|
This is all you need to do to enable off-heap read path. Most buffers in HBase are already off-heap. With BC off-heap,
|
||||||
the read pipeline will copy data between HDFS and the server socket send of the results back to the client.
|
the read pipeline will copy data between HDFS and the server socket -- caveat <<hbase.ipc.server.reservoir.initial.max>> --
|
||||||
|
sending results back to the client.
|
||||||
|
|
||||||
[[regionserver.offheap.rpc.bb.tuning]]
|
[[regionserver.offheap.rpc.bb.tuning]]
|
||||||
===== Tuning the RPC buffer pool
|
===== Tuning the RPC buffer pool
|
||||||
It is possible to tune the ByteBuffer pool on the RPC server side
|
It is possible to tune the ByteBuffer pool on the RPC server side used to accumulate the cell bytes and create result
|
||||||
used to accumulate the cell bytes and create result cell blocks to send back to the client side.
|
cell blocks to send back to the client side. Use `hbase.ipc.server.reservoir.enabled` to turn this pool ON or OFF. By
|
||||||
`hbase.ipc.server.reservoir.enabled` can be used to turn this pool ON or OFF. By default this pool is ON and available. HBase will create off-heap ByteBuffers
|
default this pool is ON and available. HBase will create off-heap ByteBuffers and pool them them by default. Please
|
||||||
and pool them them by default. Please make sure not to turn this OFF if you want end-to-end off-heaping in read path.
|
make sure not to turn this OFF if you want end-to-end off-heaping in read path.
|
||||||
|
|
||||||
NOTE: the config keys which start with prefix `hbase.ipc.server.reservoir` are deprecated in HBase3.x. If you are still
|
If this pool is turned off, the server will create temp buffers onheap to accumulate the cell bytes and
|
||||||
in HBase2.x, then just use the old config keys. otherwise if in HBase3.x, please use the new config keys.
|
make a result cell block. This can impact the GC on a highly read loaded server.
|
||||||
|
|
||||||
|
NOTE: the config keys which start with prefix `hbase.ipc.server.reservoir` are deprecated in hbase-3.x (the
|
||||||
|
internal pool implementation changed). If you are still in hbase-2.2.x or older, then just use the old config
|
||||||
|
keys. Otherwise if in hbase-3.x or hbase-2.3.x+, please use the new config keys
|
||||||
(See <<regionserver.read.hdfs.block.offheap,deprecated and new configs in HBase3.x>>)
|
(See <<regionserver.read.hdfs.block.offheap,deprecated and new configs in HBase3.x>>)
|
||||||
|
|
||||||
If this pool is turned off, the server will create temp buffers on heap to accumulate the cell bytes and
|
Next thing to tune is the ByteBuffer pool on the RPC server side. The user can tune this pool with respect to how
|
||||||
make a result cell block. This can impact the GC on a highly read loaded server.
|
many buffers are in the pool and what should be the size of each ByteBuffer. Use the config
|
||||||
Next thing to tune is the ByteBuffer pool on the RPC server side:
|
`hbase.ipc.server.reservoir.initial.buffer.size` to tune each of the buffer sizes. Default is 64KB for hbase-2.2.x
|
||||||
|
and less, changed to 65KB by default for hbase-2.3.x+
|
||||||
The user can tune this pool with respect to how many buffers are in the pool and what should be the size of each ByteBuffer.
|
|
||||||
Use the config `hbase.ipc.server.reservoir.initial.buffer.size` to tune each of the buffer sizes. Default is 64 KB for HBase2.x, while it will be changed to 65KB by default for HBase3.x
|
|
||||||
(see link:https://issues.apache.org/jira/browse/HBASE-22532[HBASE-22532])
|
(see link:https://issues.apache.org/jira/browse/HBASE-22532[HBASE-22532])
|
||||||
|
|
||||||
When the result size is larger than one ByteBuffer size, the server will try to grab more than one ByteBuffer and make a result cell block out of these.
|
When the result size is larger than one 64KB (Default) ByteBuffer size, the server will try to grab more than one
|
||||||
When the pool is running out of buffers, the server will end up creating temporary on-heap buffers.
|
ByteBuffer and make a result cell block out of a collection of fixed-sized ByteBuffers. When the pool is running
|
||||||
|
out of buffers, the server will skip the pool and create temporary on-heap buffers.
|
||||||
|
|
||||||
The maximum number of ByteBuffers in the pool can be tuned using the config `hbase.ipc.server.reservoir.initial.max`.
|
The maximum number of ByteBuffers in the pool can be tuned using the config `hbase.ipc.server.reservoir.initial.max`.
|
||||||
Its value defaults to 64 * region server handlers configured (See the config `hbase.regionserver.handler.count`). The
|
Its default is a factor of region server handlers count (See the config `hbase.regionserver.handler.count`). The
|
||||||
math is such that by default we consider 2 MB as the result cell block size per read result and each handler will be
|
math is such that by default we consider 2 MB as the result cell block size per read result and each handler will be
|
||||||
handling a read. For 2 MB size, we need 32 buffers each of size 64 KB (See default buffer size in pool). So per handler
|
handling a read. For 2 MB size, we need 32 buffers each of size 64 KB (See default buffer size in pool). So per handler
|
||||||
32 ByteBuffers(BB). We allocate twice this size as the max BBs count such that one handler can be creating the response
|
32 ByteBuffers(BB). We allocate twice this size as the max BBs count such that one handler can be creating the response
|
||||||
and handing it to the RPC Responder thread and then handling a new request creating a new response cell block (using
|
and handing it to the RPC Responder thread and then handling a new request creating a new response cell block (using
|
||||||
pooled buffers). Even if the responder could not send back the first TCP reply immediately, our count should allow that
|
pooled buffers). Even if the responder could not send back the first TCP reply immediately, our count should allow that
|
||||||
we should still have enough buffers in our pool without having to make temporary buffers on the heap. Again for smaller
|
we should still have enough buffers in our pool without having to make temporary buffers on the heap. Again for smaller
|
||||||
sized random row reads, tune this max count. There are lazily created buffers and the count is the max count to be pooled.
|
sized random row reads, tune this max count. These are lazily created buffers and the count is the max count to be pooled.
|
||||||
|
|
||||||
If you still see GC issues even after making end-to-end read path off-heap, look for issues in the appropriate buffer
|
If you still see GC issues even after making end-to-end read path off-heap, look for issues in the appropriate buffer
|
||||||
pool. Check the below RegionServer log with INFO level in HBase2.x:
|
pool. Check for the below RegionServer log line at INFO level in HBase2.x:
|
||||||
|
|
||||||
[source]
|
[source]
|
||||||
----
|
----
|
||||||
|
@ -113,105 +118,114 @@ Or the following log message in HBase3.x:
|
||||||
Pool already reached its max capacity : XXX and no free buffers now. Consider increasing the value for 'hbase.server.allocator.max.buffer.count' ?
|
Pool already reached its max capacity : XXX and no free buffers now. Consider increasing the value for 'hbase.server.allocator.max.buffer.count' ?
|
||||||
----
|
----
|
||||||
|
|
||||||
The setting for _HBASE_OFFHEAPSIZE_ in _hbase-env.sh_ should consider this off heap buffer pool at the RPC side also.
|
[[hbase.offheapsize]]
|
||||||
|
The setting for _HBASE_OFFHEAPSIZE_ in _hbase-env.sh_ should consider this off heap buffer pool on the server side also.
|
||||||
We need to config this max off heap size for the RegionServer as a bit higher than the sum of this max pool size and
|
We need to config this max off heap size for the RegionServer as a bit higher than the sum of this max pool size and
|
||||||
the off heap cache size. The TCP layer will also need to create direct bytebuffers for TCP communication. Also the DFS
|
the off heap cache size. The TCP layer will also need to create direct bytebuffers for TCP communication. Also the DFS
|
||||||
client will need some off-heap to do its workings especially if short-circuit reads are configured. Allocating an extra
|
client will need some off-heap to do its workings especially if short-circuit reads are configured. Allocating an extra
|
||||||
of 1 - 2 GB for the max direct memory size has worked in tests.
|
1 - 2 GB for the max direct memory size has worked in tests.
|
||||||
|
|
||||||
If you are using co processors and refer the Cells in the read results, DO NOT store reference to these Cells out of
|
If you are using coprocessors and refer to the Cells in the read results, DO NOT store reference to these Cells out of
|
||||||
the scope of the CP hook methods. Some times the CPs need store info about the cell (Like its row key) for considering
|
the scope of the CP hook methods. Some times the CPs want to store info about the cell (Like its row key) for considering
|
||||||
in the next CP hook call etc. For such cases, pls clone the required fields of the entire Cell as per the use cases.
|
in the next CP hook call etc. For such cases, pls clone the required fields of the entire Cell as per the use cases.
|
||||||
[ See CellUtil#cloneXXX(Cell) APIs ]
|
[ See CellUtil#cloneXXX(Cell) APIs ]
|
||||||
|
|
||||||
[[regionserver.read.hdfs.block.offheap]]
|
[[regionserver.read.hdfs.block.offheap]]
|
||||||
== Read block from HDFS to offheap directly
|
== Read block from HDFS to offheap directly
|
||||||
|
|
||||||
In HBase-2.x, the RegionServer will still read block from HDFS to a temporary heap ByteBuffer and then flush to BucketCache's
|
In HBase-2.x, the RegionServer will read blocks from HDFS to a temporary onheap ByteBuffer and then flush to
|
||||||
IOEngine asynchronously, finally it will be an offheap one. We can still observe much GC pressure when cache hit ratio
|
the BucketCache. Even if the BucketCache is offheap, we will first pull the HDFS read onheap before writing
|
||||||
is not very high (such as cacheHitRatio ~ 60% ), so in link:https://issues.apache.org/jira/browse/HBASE-21879[HBASE-21879]
|
it out to the offheap BucketCache. We can observe much GC pressure when cache hit ratio low (e.g. a cacheHitRatio ~ 60% ).
|
||||||
we redesigned the read path and made the HDFS block reading be offheap now. This feature will be available in HBASE-3.0.0.
|
link:https://issues.apache.org/jira/browse/HBASE-21879[HBASE-21879] addresses this issue (Requires hbase-2.3.x/hbase-3.x).
|
||||||
|
It depends on there being a supporting HDFS being in place (hadoop-2.10.x or hadoop-3.3.x) and it may require patching
|
||||||
|
HBase itself (as of this writing); see
|
||||||
|
link:https://issues.apache.org/jira/browse/HBASE-21879[HBASE-21879 Read HFile's block to ByteBuffer directly instead of to byte for reducing young gc purpose].
|
||||||
|
Appropriately setup, reads from HDFS can be into offheap buffers passed offheap to the offheap BlockCache to cache.
|
||||||
|
|
||||||
For more details about the design and performance improvement, please see the link:https://docs.google.com/document/d/1xSy9axGxafoH-Qc17zbD2Bd--rWjjI00xTWQZ8ZwI_E/edit?usp=sharing[document].
|
For more details about the design and performance improvement, please see the
|
||||||
Here we will share some best practice about the performance tuning:
|
link:https://docs.google.com/document/d/1xSy9axGxafoH-Qc17zbD2Bd--rWjjI00xTWQZ8ZwI_E[Design Doc -Read HFile's block to Offheap].
|
||||||
|
|
||||||
Firstly, we introduced several configurations about the ByteBuffAllocator (which was abstracted to manage the memory application or release):
|
Here we will share some best practice about the performance tuning but first we introduce new (hbase-3.x/hbase-2.3.x) configuration names
|
||||||
|
that go with the new internal pool implementation (`ByteBuffAllocator` vs the old `ByteBufferPool`), some of which mimic now deprecated
|
||||||
|
hbase-2.2.x configurations discussed above in the <<regionserver.offheap.rpc.bb.tuning>>. Much of the advice here overlaps that given above
|
||||||
|
in the <<regionserver.offheap.rpc.bb.tuning>> since the implementations have similar configurations.
|
||||||
|
|
||||||
1. `hbase.server.allocator.pool.enabled`: means whether the region server will use the pooled offheap ByteBuffer allocator. Its default
|
1. `hbase.server.allocator.pool.enabled` is for whether the RegionServer will use the pooled offheap ByteBuffer allocator. Default
|
||||||
value is true. In HBase2.x, we still use the deprecated `hbase.ipc.server.reservoir.enabled` config while we'll use the new
|
value is true. In hbase-2.x, the deprecated `hbase.ipc.server.reservoir.enabled` did similar and is mapped to this config
|
||||||
one in HBase3.x.
|
until support for the old configuration is removed. This new name will be used in hbase-3.x and hbase-2.3.x+.
|
||||||
2. `hbase.server.allocator.minimal.allocate.size`: If the desired byte size is not less than this one, then it will
|
2. `hbase.server.allocator.minimal.allocate.size` is the threshold at which we start allocating from the pool. Otherwise the
|
||||||
be allocated as a pooled offheap ByteBuff, otherwise it will be allocated from heap directly because it
|
request will be allocated from onheap directly because it would be wasteful allocating small stuff from our pool of fixed-size
|
||||||
is too wasting to allocate from pool with fixed-size ByteBuffers, default value is `hbase.server.allocator.buffer.size/6`.
|
ByteBuffers. The default minimum is `hbase.server.allocator.buffer.size/6`.
|
||||||
3. `hbase.server.allocator.max.buffer.count`: The ByteBuffAllocator will have many fixed-size ByteBuffers inside which
|
3. `hbase.server.allocator.max.buffer.count`: The `ByteBuffAllocator`, the new pool/reservoir implementation, has fixed-size
|
||||||
are composited as a pool, this config indicate how many buffers are there in the pool. Its default value will be 2MB * 2 * hbase.regionserver.handler.count / 65KB,
|
ByteBuffers. This config is for how many buffers to pool. Its default value is 2MB * 2 * hbase.regionserver.handler.count / 65KB
|
||||||
the default hbase.regionserver.handler.count is 30, then its value will be 1890.
|
(similar to thediscussion above in <<regionserver.offheap.rpc.bb.tuning>>). If the default `hbase.regionserver.handler.count` is 30, then the default will be 1890.
|
||||||
4. `hbase.server.allocator.buffer.size`: The byte size of each ByteBuffer, default value is 66560 (65KB), here we choose 65KB instead of 64KB
|
4. `hbase.server.allocator.buffer.size`: The byte size of each ByteBuffer. The default value is 66560 (65KB), here we choose 65KB instead of 64KB
|
||||||
because of link:https://issues.apache.org/jira/browse/HBASE-22532[HBASE-22532].
|
because of link:https://issues.apache.org/jira/browse/HBASE-22532[HBASE-22532].
|
||||||
|
|
||||||
The three config keys: `hbase.ipc.server.reservoir.enabled`, `hbase.ipc.server.reservoir.initial.buffer.size` and `hbase.ipc.server.reservoir.initial.max` are introduced in HBase2.x. while in HBase3.x
|
The three config keys -- `hbase.ipc.server.reservoir.enabled`, `hbase.ipc.server.reservoir.initial.buffer.size` and `hbase.ipc.server.reservoir.initial.max` -- introduced in hbase-2.x
|
||||||
they are deprecated now, instead please use the new config keys: `hbase.server.allocator.pool.enabled`, `hbase.server.allocator.buffer.size` and `hbase.server.allocator.max.buffer.count`.
|
have been renamed and deprecated in hbase-3.x/hbase-2.3.x. Please use the new config keys instead:
|
||||||
|
`hbase.server.allocator.pool.enabled`, `hbase.server.allocator.buffer.size` and `hbase.server.allocator.max.buffer.count`.
|
||||||
If you still use the deprecated three config keys in HBase3.0.0, you will get a WARN log message like:
|
If you still use the deprecated three config keys in hbase-3.x, you will get a WARN log message like:
|
||||||
|
|
||||||
[source]
|
[source]
|
||||||
----
|
----
|
||||||
The config keys hbase.ipc.server.reservoir.initial.buffer.size and hbase.ipc.server.reservoir.initial.max are deprecated now, instead please use hbase.server.allocator.buffer.size and hbase.server.allocator.max.buffer.count. In future release we will remove the two deprecated configs.
|
The config keys hbase.ipc.server.reservoir.initial.buffer.size and hbase.ipc.server.reservoir.initial.max are deprecated now, instead please use hbase.server.allocator.buffer.size and hbase.server.allocator.max.buffer.count. In future release we will remove the two deprecated configs.
|
||||||
----
|
----
|
||||||
|
|
||||||
Second, we have some suggestions about the performance:
|
Next, we have some suggestions regards performance.
|
||||||
|
|
||||||
.Please make sure that there are enough pooled DirectByteBuffer in your ByteBuffAllocator.
|
.Please make sure that there are enough pooled DirectByteBuffer in your ByteBuffAllocator.
|
||||||
|
|
||||||
The ByteBuffAllocator will allocate ByteBuffer from DirectByteBuffer pool firstly, if there’s no available ByteBuffer
|
The ByteBuffAllocator will allocate ByteBuffer from the DirectByteBuffer pool first. If
|
||||||
from the pool, then it will just allocate the ByteBuffers from heap, then the GC pressures will increase again.
|
there’s no available ByteBuffer in the pool, then we will allocate the ByteBuffers from onheap.
|
||||||
|
By default, we will pre-allocate 4MB for each RPC handler (The handler count is determined by the config:
|
||||||
By default, we will pre-allocate 4MB for each RPC handlers ( The handler count is determined by the config:
|
|
||||||
`hbase.regionserver.handler.count`, it has the default value 30) . That’s to say, if your `hbase.server.allocator.buffer.size`
|
`hbase.regionserver.handler.count`, it has the default value 30) . That’s to say, if your `hbase.server.allocator.buffer.size`
|
||||||
is 65KB, then your pool will have 2MB * 2 / 65KB * 30 = 945 DirectByteBuffer. If you have some large scan and have a big caching,
|
is 65KB, then your pool will have 2MB * 2 / 65KB * 30 = 945 DirectByteBuffer. If you have a large scan and a big cache,
|
||||||
say you may have a rpc response whose bytes size is greater than 2MB (another 2MB for receiving rpc request), then it will
|
you may have a RPC response whose bytes size is greater than 2MB (another 2MB for receiving rpc request), then it will
|
||||||
be better to increase the `hbase.server.allocator.max.buffer.count`.
|
be better to increase the `hbase.server.allocator.max.buffer.count`.
|
||||||
|
|
||||||
The RegionServer web UI also has the statistic about ByteBuffAllocator:
|
The RegionServer web UI has statistics on ByteBuffAllocator:
|
||||||
|
|
||||||
image::bytebuff-allocator-stats.png[]
|
image::bytebuff-allocator-stats.png[]
|
||||||
|
|
||||||
If the following condition meet, you may need to increase your max buffer.count:
|
If the following condition is met, you may need to increase your max buffer.count:
|
||||||
|
|
||||||
heapAllocationRatio >= hbase.server.allocator.minimal.allocate.size / hbase.server.allocator.buffer.size * 100%
|
heapAllocationRatio >= hbase.server.allocator.minimal.allocate.size / hbase.server.allocator.buffer.size * 100%
|
||||||
|
|
||||||
.Please make sure the buffer size is greater than your block size.
|
.Please make sure the buffer size is greater than your block size.
|
||||||
|
|
||||||
We have the default block size=64KB, so almost all of the data block have a block size: 64KB + delta, whose delta is
|
We have the default block size of 64KB, so almost all of the data blocks will be 64KB + a small delta, where the delta is
|
||||||
very small, depends on the size of last KeyValue. If we use the default `hbase.server.allocator.buffer.size`=64KB,
|
very small, depending on the size of the last Cell. If we set `hbase.server.allocator.buffer.size`=64KB,
|
||||||
then each block will be allocated as two ByteBuffers: one 64KB DirectByteBuffer and one HeapByteBuffer with delta bytes,
|
then each block will be allocated as two ByteBuffers: one 64KB DirectByteBuffer and one HeapByteBuffer for the delta bytes.
|
||||||
the HeapByteBuffer will increase the GC pressure. Ideally, we should let the data block to be allocated as one ByteBuffer,
|
Ideally, we should let the data block to be allocated as one ByteBuffer; it has a simpler data structure, faster access speed,
|
||||||
it has simpler data structure, faster access speed, less heap usage. On the other hand, If the blocks are composited by multiple ByteBuffers,
|
and less heap usage. Also, if the blocks are a composite of multiple ByteBuffers, to validate the checksum
|
||||||
so we have to validate the checksum by an temporary heap copying (see link:https://issues.apache.org/jira/browse/HBASE-21917[HBASE-21917]), while if it’s a single ByteBuffer,
|
we have to perform a temporary heap copy (see link:https://issues.apache.org/jira/browse/HBASE-21917[HBASE-21917])
|
||||||
we can speed the checksum by calling the hadoop' checksum in native lib, it's more faster.
|
whereas if it’s a single ByteBuffer we can speed the checksum by calling the hadoop' checksum native lib; it's more faster.
|
||||||
|
|
||||||
Please also see: link:https://issues.apache.org/jira/browse/HBASE-22483[HBASE-22483]
|
Please also see: link:https://issues.apache.org/jira/browse/HBASE-22483[HBASE-22483]
|
||||||
|
|
||||||
|
Don't forget to up your _HBASE_OFFHEAPSIZE_ accordingly. See <<hbase.offheapsize>>
|
||||||
|
|
||||||
[[regionserver.offheap.writepath]]
|
[[regionserver.offheap.writepath]]
|
||||||
== Offheap write-path
|
== Offheap write-path
|
||||||
|
|
||||||
In HBase 2.0.0, link:https://issues.apache.org/jira/browse/HBASE-15179[HBASE-15179] made the HBase write path to work off-heap. By default, the MemStores use
|
In hbase-2.x, link:https://issues.apache.org/jira/browse/HBASE-15179[HBASE-15179] made the HBase write path work off-heap. By default, the MemStores in
|
||||||
MSLAB to avoid memory fragmentation. It creates bigger fixed sized chunks and memstore cell's data will get copied into these chunks. These chunks can be pooled
|
HBase have always used MemStore Local Allocation Buffers (MSLABs) to avoid memory fragmentation; an MSLAB creates bigger fixed sized chunks and then the
|
||||||
also and from 2.0.0 the MSLAB (MemStore-Local Allocation Buffer) pool is by default ON. Write off-heaping makes use of the MSLAB pool. It creates MSLAB chunks
|
MemStores Cell's data gets copied into these MSLAB chunks. These chunks can be pooled also and from hbase-2.x on, the MSLAB pool is by default ON.
|
||||||
as Direct ByteBuffers and pools them. HBase defaults to using no off-heap memory for MSLAB which means that cells are copied to heap chunk in MSLAB by default
|
Write off-heaping makes use of the MSLAB pool. It creates MSLAB chunks as Direct ByteBuffers and pools them.
|
||||||
rather than off-heap chunk.
|
|
||||||
|
|
||||||
`hbase.regionserver.offheap.global.memstore.size` is the configuration key which controls the amount of off-heap data whose value is the number of megabytes
|
`hbase.regionserver.offheap.global.memstore.size` is the configuration key which controls the amount of off-heap data. Its value is the number of megabytes
|
||||||
of off-heap memory that should be by MSLAB (e.g. `25` would result in 25MB of off-heap). Be sure to increase `HBASE_OFFHEAPSIZE` which will set the JVM's
|
of off-heap memory that should be used by MSLAB (e.g. `25` would result in 25MB of off-heap). Be sure to increase _HBASE_OFFHEAPSIZE_ which will set the JVM's
|
||||||
MaxDirectMemorySize property. Its default value is 0, means MSLAB use heap chunks.
|
MaxDirectMemorySize property (see <<hbase.offheapsize>> for more on _HBASE_OFFHEAPSIZE_). The default value of
|
||||||
|
`hbase.regionserver.offheap.global.memstore.size` is 0 which means MSLAB uses onheap, not offheap, chunks by default.
|
||||||
|
|
||||||
`hbase.hregion.memstore.mslab.chunksize` controls the size of each off-heap chunk, defaulting to `2097152` (2MB).
|
`hbase.hregion.memstore.mslab.chunksize` controls the size of each off-heap chunk. Default is `2097152` (2MB).
|
||||||
|
|
||||||
When a Cell is added to a MemStore, the bytes for that Cell are copied into these off-heap buffers (if set the `hbase.regionserver.offheap.global.memstore.size` to non-zero)
|
When a Cell is added to a MemStore, the bytes for that Cell are copied into these off-heap buffers (if `hbase.regionserver.offheap.global.memstore.size` is non-zero)
|
||||||
and a Cell POJO will refer to this memory area. This can greatly reduce the on-heap occupancy of the MemStores and reduce the total heap utilization for RegionServers
|
and a Cell POJO will refer to this memory area. This can greatly reduce the on-heap occupancy of the MemStores and reduce the total heap utilization for RegionServers
|
||||||
in a write-heavy workload. On-heap and off-heap memory utiliazation are tracked at multiple levels to implement low level and high level memory management.
|
in a write-heavy workload. On-heap and off-heap memory utiliazation are tracked at multiple levels to implement low level and high level memory management.
|
||||||
The decision to flush a MemStore considers both the on-heap and off-heap usage of that MemStore. At the Region level, the sum of the on-heap and off-heap usages and
|
The decision to flush a MemStore considers both the on-heap and off-heap usage of that MemStore. At the Region level, we sum the on-heap and off-heap usages and
|
||||||
compares them against the region flush size (128MB, by default). Globally, on-heap size occupancy of all memstores are tracked as well as off-heap size. When any of
|
compare them against the region flush size (128MB, by default). Globally, on-heap size occupancy of all memstores are tracked as well as off-heap size. When any of
|
||||||
these sizes breaches the lower mark (`hbase.regionserver.global.memstore.size.lower.limit`) or the maximum size `hbase.regionserver.global.memstore.size`), all
|
these sizes breache the lower mark (`hbase.regionserver.global.memstore.size.lower.limit`) or the maximum size `hbase.regionserver.global.memstore.size`), all
|
||||||
regions are selected for forced flushes.
|
regions are selected for forced flushes.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue