From 3ac99ad192fb0ab0db7c07e40da6d2d83a732e62 Mon Sep 17 00:00:00 2001
From: Wellington Ramos Chevreuil <wchevreuil@apache.org>
Date: Tue, 16 Jun 2020 10:02:29 +0100
Subject: [PATCH] HBASE-21405 [DOC] Add Details about Output of "status
 'replication'" (#1894)

    Signed-off-by: Jan Hentschel <jan.hentschel@ultratendency.com>
    Signed-off-by: Viraj Jasani <vjasani@apache.org>
---
 src/main/asciidoc/_chapters/ops_mgt.adoc | 85 ++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/src/main/asciidoc/_chapters/ops_mgt.adoc b/src/main/asciidoc/_chapters/ops_mgt.adoc
index da718ec3ea9..85ad4e83371 100644
--- a/src/main/asciidoc/_chapters/ops_mgt.adoc
+++ b/src/main/asciidoc/_chapters/ops_mgt.adoc
@@ -2629,6 +2629,91 @@ You can use the HBase Shell command `status 'replication'` to monitor the replic
 * `status 'replication', 'source'` -- prints the status for each replication source, sorted by hostname.
 * `status 'replication', 'sink'` -- prints the status for each replication sink, sorted by hostname.
 
+==== Understanding the output
+
+The command output will vary according to the state of replication. For example right after a restart
+and if destination peer is not reachable, no replication source threads would be running,
+so no metrics would get displayed:
+
+----
+hbase01.home:
+SOURCE: PeerID=1
+Normal Queue: 1
+No Reader/Shipper threads runnning yet.
+SINK: TimeStampStarted=1591985197350, Waiting for OPs...
+----
+
+Under normal circumstances, a healthy, active-active replication deployment would
+show the following:
+
+----
+    hbase01.home:
+      SOURCE: PeerID=1
+         Normal Queue: 1
+           AgeOfLastShippedOp=0, TimeStampOfLastShippedOp=Fri Jun 12 18:49:23 BST 2020, SizeOfLogQueue=1, EditsReadFromLogQueue=1, OpsShippedToTarget=1, TimeStampOfNextToReplicate=Fri Jun 12 18:49:23 BST 2020, Replication Lag=0
+      SINK: TimeStampStarted=1591983663458, AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Jun 12 18:57:18 BST 2020
+----
+
+The definition for each of these metrics is detailed below:
+
+[cols="1,1,1", options="header"]
+|===
+| Type
+| Metric Name
+| Description
+
+| Source
+| AgeOfLastShippedOp
+| How long last successfully shipped edit took to effectively get replicated on target.
+
+| Source
+| TimeStampOfLastShippedOp
+| The actual date of last successful edit shipment.
+
+| Source
+| SizeOfLogQueue
+| Number of wal files on this given queue.
+
+| Source
+| EditsReadFromLogQueue
+| How many edits have been read from this given queue since this source thread started.
+
+| Source
+| OpsShippedToTarget
+| How many edits have been shipped to target since this source thread started.
+
+| Source
+| TimeStampOfNextToReplicate
+| Date of the current edit been attempted to replicate.
+
+| Source
+| Replication Lag
+| The elapsed time (in millis), since the last edit to replicate was read by this source
+thread and effectively replicated to target
+
+| Sink
+| TimeStampStarted
+| Date (in millis) of when this Sink thread started.
+
+| Sink
+| AgeOfLastAppliedOp
+| How long it took to apply the last successful shipped edit.
+
+| Sink
+| TimeStampsOfLastAppliedOp
+| Date of last successful applied edit.
+
+|===
+
+Growing values for `Source.TimeStampsOfLastAppliedOp` and/or
+`Source.Replication Lag` would indicate replication delays. If those numbers keep going
+up, while `Source.TimeStampOfLastShippedOp`, `Source.EditsReadFromLogQueue`,
+`Source.OpsShippedToTarget` or `Source.TimeStampOfNextToReplicate` do not change at all,
+ then replication flow is failing to progress, and there might be problems within
+clusters communication. This could also happen if replication is manually paused
+(via hbase shell `disable_peer` command, for example), but date keeps getting ingested
+in the source cluster tables.
+
 == Running Multiple Workloads On a Single Cluster
 
 HBase provides the following mechanisms for managing the performance of a cluster