In some particular deployments, the Replication code believes it has
reached EOF for a WAL prior to succesfully parsing all bytes known to
exist in a cleanly closed file.
Consistently this failure happens due to an InvalidProtobufException
after some number of seeks during our attempts to tail the in-progress
RegionServer WAL. As a work-around, this patch treats cleanly closed
files differently than other execution paths. If an EOF is detected due
to parsing or other errors while there are still unparsed bytes before
the end-of-file trailer, we now reset the WAL to the very beginning and
attempt a clean read-through.
In current testing, a single such reset is sufficient to work around
observed dataloss. However, the above change will retry a given WAL file
indefinitely. On each such attempt, a log message like the below will
be emitted at the WARN level:
Processing end of WAL file '{}'. At position {}, which is too far away
from reported file length {}. Restarting WAL reading (see HBASE-15983
for details).
Additionally, this patch adds some additional log detail at the TRACE
level about file offsets seen while handling recoverable errors. It also
add metrics that measure the use of this recovery mechanism.
Changes how we do accounting of Connections to match how it is done in Hadoop.
Adds a ConnectionManager class. Adds new configurations for this new class.
"hbase.ipc.client.idlethreshold" 4000
"hbase.ipc.client.connection.idle-scan-interval.ms" 10000
"hbase.ipc.client.connection.maxidletime" 10000
"hbase.ipc.client.kill.max", 10
"hbase.ipc.server.handler.queue.size", 100
The new scheme does away with synchronization that purportedly would freeze out
reads while we were cleaning up stale connections (according to HADOOP-9955)
Also adds in new mechanism for accepting Connections by pulling in as many
as we can at a time adding them to a Queue instead of doing one at a time.
Can help when bursty traffic according to HADOOP-9956. Removes a blocking
while Reader is busy parsing a request. Adds configuration
"hbase.ipc.server.read.connection-queue.size" with default of 100 for
queue size.
Signed-off-by: stack <stack@apache.org>
Adds HADOOP-9955 RPC idle connection closing is extremely inefficient
Then removes queue added by HADOOP-9956 at Enis suggestion
Changes how we do accounting of Connections to match how it is done in Hadoop.
Adds a ConnectionManager class. Adds new configurations for this new class.
"hbase.ipc.client.idlethreshold" 4000
"hbase.ipc.client.connection.idle-scan-interval.ms" 10000
"hbase.ipc.client.connection.maxidletime" 10000
"hbase.ipc.client.kill.max", 10
"hbase.ipc.server.handler.queue.size", 100
The new scheme does away with synchronization that purportedly would freeze out
reads while we were cleaning up stale connections (according to HADOOP-9955)
Also adds in new mechanism for accepting Connections by pulling in as many
as we can at a time adding them to a Queue instead of doing one at a time.
Can help when bursty traffic according to HADOOP-9956. Removes a blocking
while Reader is busy parsing a request. Adds configuration
"hbase.ipc.server.read.connection-queue.size" with default of 100 for
queue size.
Summary: Missing a root index block is worse than missing a data block. We should know the difference
Test Plan: Tested on a local instance. All numbers looked reasonable.
Differential Revision: https://reviews.facebook.net/D55563
Summary:
Use less contended things for metrics.
For histogram which was the largest culprit we use FastLongHistogram
For atomic long where possible we now use counter.
Test Plan: unit tests
Reviewers:
Subscribers:
Differential Revision: https://reviews.facebook.net/D54381
Summary:
* Add VersionInfoUtil to determine if a client has a specified version or better
* Add an exception type to say that the response should be chunked
* Add on client knowledge of retry exceptions
* Add on metrics for how often this happens
Test Plan: Added a unit test
Differential Revision: https://reviews.facebook.net/D51771