76396714e1
In some particular deployments, the Replication code believes it has reached EOF for a WAL prior to succesfully parsing all bytes known to exist in a cleanly closed file. Consistently this failure happens due to an InvalidProtobufException after some number of seeks during our attempts to tail the in-progress RegionServer WAL. As a work-around, this patch treats cleanly closed files differently than other execution paths. If an EOF is detected due to parsing or other errors while there are still unparsed bytes before the end-of-file trailer, we now reset the WAL to the very beginning and attempt a clean read-through. In current testing, a single such reset is sufficient to work around observed dataloss. However, the above change will retry a given WAL file indefinitely. On each such attempt, a log message like the below will be emitted at the WARN level: Processing end of WAL file '{}'. At position {}, which is too far away from reported file length {}. Restarting WAL reading (see HBASE-15983 for details). Additionally, this patch adds some additional log detail at the TRACE level about file offsets seen while handling recoverable errors. It also add metrics that measure the use of this recovery mechanism.