Add Riley's slow query doc from hbase-4117

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1235023 13f79535-47bb-0310-9956-ffa450edef68
2012-01-23 22:06:03 +00:00 · 2012-01-23 22:06:03 +00:00 · a56613fd71
commit a56613fd71
parent 59b1b68d12
1 changed files with 45 additions and 0 deletions
--- a/src/docbkx/ops_mgt.xml
+++ b/src/docbkx/ops_mgt.xml
@ -367,6 +367,51 @@ false
    <title >HBase Monitoring</title>
    <para>TODO
    </para>
+    <section xml:id="ops.slow.query">
+    <title>Slow Query Log</title>
+<para>The HBase slow query log consists of parseable JSON structures describing the properties of those client operations (Gets, Puts, Deletes, etc.) that either took too long to run, or produced too much output. The thresholds for "too long to run" and "too much output" are configurable, as described below. The output is produced inline in the main region server logs so that it is easy to discover further details from context with other logged events. It is also prepended with identifying tags <constant>(responseTooSlow)</constant>, <constant>(responseTooLarge)</constant>, <constant>(operationTooSlow)</constant>, and <constant>(operationTooLarge)</constant> in order to enable easy filtering with grep, in case the user desires to see only slow queries.
+</para>
+
+<section><title>Configuration</title>
+<para>There are two configuration knobs that can be used to adjust the thresholds for when queries are logged.
+</para>
+
+<itemizedlist>
+<listitem>
+<varname>hbase.ipc.warn.response.time</varname> Maximum number of milliseconds that a query can be run without being logged. Defaults to 10000, or 10 seconds. Can be set to -1 to disable logging by time.
+</listitem>
+<listitem><varname>hbase.ipc.warn.response.size</varname> Maximum byte size of response that a query can return without being logged. Defaults to 100 megabytes. Can be set to -1 to disable logging by size.
+</listitem>
+</itemizedlist>
+</section>
+
+<section><title>Metrics</title>
+<para>The slow query log exposes to metrics to JMX.
+<itemizedlist><listitem><varname>hadoop.regionserver_rpc_slowResponse</varname> a global metric reflecting the durations of all responses that triggered logging.</listitem>
+<listitem><varname>hadoop.regionserver_rpc_methodName.aboveOneSec</varname> A metric reflecting the durations of all responses that lasted for more than one second.</listitem>
+</itemizedlist>
+</para>
+</section>
+
+<section><title>Output</title>
+<para>The output is tagged with operation e.g. <constant>(operationTooSlow)</constant> if the call was a client operation, such as a Put, Get, or Delete, which we expose detailed fingerprint information for. If not, it is tagged <constant>(responseTooSlow)</constant> and still produces parseable JSON output, but with less verbose information solely regarding its duration and size in the RPC itself. <constant>TooLarge</constant> is substituted for <constant>TooSlow</constant> if the response size triggered the logging, with <constant>TooLarge</constant> appearing even in the case that both size and duration triggered logging.
+</para>
+</section>
+<section><title>Example</title>
+<para>
+<programlisting>2011-09-08 10:01:25,824 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"tables":{"riley2":{"puts":[{"totalColumns":11,"families":{"actions":[{"timestamp":1315501284459,"qualifier":"0","vlen":9667580},{"timestamp":1315501284459,"qualifier":"1","vlen":10122412},{"timestamp":1315501284459,"qualifier":"2","vlen":11104617},{"timestamp":1315501284459,"qualifier":"3","vlen":13430635}]},"row":"cfcd208495d565ef66e7dff9f98764da:0"}],"families":["actions"]}},"processingtimems":956,"client":"10.47.34.63:33623","starttimems":1315501284456,"queuetimems":0,"totalPuts":1,"class":"HRegionServer","responsesize":0,"method":"multiPut"}</programlisting>
+</para>
+
+<para>Note that everything inside the "tables" structure is output produced by MultiPut's fingerprint, while the rest of the information is RPC-specific, such as processing time and client IP/port. Other client operations follow the same pattern and the same general structure, with necessary differences due to the nature of the individual operations. In the case that the call is not a client operation, that detailed fingerprint information will be completely absent.
+</para>
+
+<para>This particular example, for example, would indicate that the likely cause of slowness is simply a very large (on the order of 100MB) multiput, as we can tell by the "vlen," or value length, fields of each put in the multiPut.
+</para>
+</section>
+</section>
+
+
+
  </section>
  
  <section xml:id="cluster_replication">