HBASE-11607 Document HBase metrics (Misty Stanley-Jones)

2014-08-19 13:51:17 -07:00 · 2014-08-19 13:51:17 -07:00 · 8a52d58a7b
parent 3b864842c7
commit 8a52d58a7b
1 changed files with 118 additions and 182 deletions
--- a/src/main/docbkx/ops_mgt.xml
+++ b/src/main/docbkx/ops_mgt.xml
@ -951,196 +951,132 @@ $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --
  </section>
  <!--  node mgt -->
-  <section
+  <section xml:id="hbase_metrics">
    xml:id="hbase_metrics">
    <title>HBase Metrics</title>
-    <section
+    <para>HBase emits metrics which adhere to the <link
-      xml:id="metric_setup">
+        xlink:href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/metrics/package-summary.html"
        >Hadoop metrics</link> API. Starting with HBase 0.95, HBase is configured to emit a default
      set of metrics with a default sampling period of every 10 seconds. You can use HBase
      metrics in conjunction with Ganglia. You can also filter which metrics are emitted and extend
      the metrics framework to capture custom metrics appropriate for your environment.</para>
    <section xml:id="metric_setup">
      <title>Metric Setup</title>
-      <para>See <link
+      <para>For HBase 0.95 and newer, HBase ships with a default metrics configuration, or
-          xlink:href="http://hbase.apache.org/metrics.html">Metrics</link> for an introduction and
+          <firstterm>sink</firstterm>. This includes a wide variety of individual metrics, and emits
-        how to enable Metrics emission. Still valid for HBase 0.94.x. </para>
+        them every 10 seconds by default. To configure metrics for a given region server, edit the
-      <para>For HBase 0.95.x and up, see <link
+          <filename>conf/hadoop-metrics2-hbase.properties</filename> file. Restart the region server
-          xlink:href="http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html" />
+        for the changes to take effect.</para>
      <para>To change the sampling rate for the default sink, edit the line beginning with
          <literal>*.period</literal>. To filter which metrics are emitted or to extend the metrics
        framework, see <link
          xlink:href="http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html"
        />
      </para>
      <note xml:id="rs_metrics_ganglia">
        <title>HBase Metrics and Ganglia</title>
        <para>By default, HBase emits a large number of metrics per region server. Ganglia may have
          difficulty processing all these metrics. Consider increasing the capacity of the Ganglia
          server or reducing the number of metrics emitted by HBase. See <link
            xlink:href="http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html#filtering"
            >Metrics Filtering</link>.</para>
      </note>
    </section>
-    <section
+    <section>
-      xml:id="rs_metrics_ganglia">
+      <title>Disabling Metrics</title>
-      <title>Warning To Ganglia Users</title>
+      <para>To disable metrics for a region server, edit the
-      <para>Warning to Ganglia Users: by default, HBase will emit a LOT of metrics per RegionServer
+          <filename>conf/hadoop-metrics2-hbase.properties</filename> file and comment out any
-        which may swamp your installation. Options include either increasing Ganglia server
+        uncommented lines. Restart the region server for the changes to take effect.</para>
        capacity, or configuring HBase to emit fewer metrics. </para>
    </section>
    <section
      xml:id="rs_metrics">
      <title>Most Important RegionServer Metrics</title>
      <section
        xml:id="hbase.regionserver.blockCacheHitCachingRatio">
        <title><varname>blockCacheExpressCachingRatio (formerly
          blockCacheHitCachingRatio)</varname></title>
        <para>Block cache hit caching ratio (0 to 100). The cache-hit ratio for reads configured to
          look in the cache (i.e., cacheBlocks=true). </para>
      </section>
      <section
        xml:id="hbase.regionserver.callQueueLength">
        <title><varname>callQueueLength</varname></title>
        <para>Point in time length of the RegionServer call queue. If requests arrive faster than
          the RegionServer handlers can process them they will back up in the callQueue.</para>
      </section>
      <section
        xml:id="hbase.regionserver.compactionQueueSize">
        <title><varname>compactionQueueLength (formerly compactionQueueSize)</varname></title>
        <para>Point in time length of the compaction queue. This is the number of Stores in the
          RegionServer that have been targeted for compaction.</para>
      </section>
      <section
        xml:id="hbase.regionserver.flushQueueSize">
        <title><varname>flushQueueSize</varname></title>
        <para>Point in time number of enqueued regions in the MemStore awaiting flush.</para>
      </section>
      <section
        xml:id="hbase.regionserver.hdfsBlocksLocalityIndex">
        <title><varname>hdfsBlocksLocalityIndex</varname></title>
        <para>Point in time percentage of HDFS blocks that are local to this RegionServer. The
          higher the better. </para>
      </section>
      <section
        xml:id="hbase.regionserver.memstoreSizeMB">
        <title><varname>memstoreSizeMB</varname></title>
        <para>Point in time sum of all the memstore sizes in this RegionServer (MB). Watch for this
          nearing or exceeding the configured high-watermark for MemStore memory in the
          RegionServer. </para>
      </section>
      <section
        xml:id="hbase.regionserver.regions">
        <title><varname>numberOfOnlineRegions</varname></title>
        <para>Point in time number of regions served by the RegionServer. This is an important
          metric to track for RegionServer-Region density. </para>
      </section>
      <section
        xml:id="hbase.regionserver.readRequestsCount">
        <title><varname>readRequestsCount</varname></title>
        <para>Number of read requests for this RegionServer since startup. Note: this is a 32-bit
          integer and can roll. </para>
      </section>
      <section
        xml:id="hbase.regionserver.slowHLogAppendCount">
        <title><varname>slowHLogAppendCount</varname></title>
        <para>Number of slow HLog append writes for this RegionServer since startup, where "slow" is
          > 1 second. This is a good "canary" metric for HDFS. </para>
      </section>
      <section
        xml:id="hbase.regionserver.usedHeapMB">
        <title><varname>usedHeapMB</varname></title>
        <para>Point in time amount of memory used by the RegionServer (MB).</para>
      </section>
      <section
        xml:id="hbase.regionserver.writeRequestsCount">
        <title><varname>writeRequestsCount</varname></title>
        <para>Number of write requests for this RegionServer since startup. Note: this is a 32-bit
          integer and can roll. </para>
      </section>
    <section>
      <title>Discovering Available Metrics</title>
      <para>Rather than listing each metric which HBase emits by default, you can browse through the
        available metrics, either as a JSON output or via JMX. At this time, the JSON output does
        not include the description field which is included in the JMX view. Different metrics are
        exposed for the Master process and each region server process.</para>
      <procedure>
        <title>Access a JSON Output of Available Metrics</title>
        <step>
          <para>After starting HBase, access the region server's web UI, at
              <literal>http://localhost:60030</literal> by default.</para>
        </step>
        <step>
          <para>Click the <guilabel>Metrics Dump</guilabel> link near the top. The metrics for the region server are
            presented as a dump of the JMX bean in JSON format.</para>
        </step>
        <step>
          <para>To view metrics for the Master, connect to the Master's web UI instead (defaults to
              <literal>http://localhost:60010</literal>) and click its <guilabel>Metrics
              Dump</guilabel> link.</para>
        </step>
      </procedure>
      <procedure>
        <title>Browse the JMX Output of Available Metrics</title>
        <para>You can use many different tools to view JMX content by browsing MBeans. This
          procedure uses <command>jvisualvm</command>, which is an application usually available in the JDK.
            </para>
        <step>
          <para>Start HBase, if it is not already running.</para>
        </step>
        <step>
          <para>Run the command <command>jvisualvm</command> command on a host with a GUI display.
            You can launch it from the command line or another method appropriate for your operating
            system.</para>
        </step>
        <step>
          <para>Be sure the <guilabel>VisualVM-MBeans</guilabel> plugin is installed. Browse to <menuchoice>
              <guimenu>Tools</guimenu>
              <guimenuitem>Plugins</guimenuitem>
            </menuchoice>. Click <guilabel>Installed</guilabel> and check whether the plugin is
            listed. If not, click <guilabel>Available Plugins</guilabel>, select it, and click
              <guibutton>Install</guibutton>. When finished, click
            <guibutton>Close</guibutton>.</para>
        </step>
        <step>
          <para>To view details for a given HBase process, double-click the process in the
              <guilabel>Local</guilabel> sub-tree in the left-hand panel. A detailed view opens in
            the right-hand panel. Click the <guilabel>MBeans</guilabel> tab which appears as a tab
            in the top of the right-hand panel.</para>
        </step>
        <step>
          <para>To access the HBase metrics, navigate to the appropriate sub-bean:</para>
          <itemizedlist>
            <listitem>
              <para>Master: <menuchoice>
                  <guimenu>Hadoop</guimenu>
                  <guisubmenu>HBase</guisubmenu>
                  <guisubmenu>Master</guisubmenu>
                  <guisubmenu>Server</guisubmenu>
                </menuchoice></para>
            </listitem>
            <listitem>
              <para>RegionServer: <menuchoice>
                  <guimenu>Hadoop</guimenu>
                  <guisubmenu>HBase</guisubmenu>
                  <guisubmenu>RegionServer</guisubmenu>
                  <guisubmenu>Server</guisubmenu>
                </menuchoice></para>
            </listitem>
          </itemizedlist>
        </step>
        <step>
          <para>The name of each metric and its current value is displayed in the
              <guilabel>Attributes</guilabel> tab. For a view which includes more details, including
            the description of each attribute, click the <guilabel>Metadata</guilabel> tab.</para>
        </step>
      </procedure>
    </section>
-    <section
+    <section xml:id="rs_metrics">
-      xml:id="rs_metrics_other">
+      <title>Most Important RegionServer Metrics</title>
-      <title>Other RegionServer Metrics</title>
+      <para>Previously, this section contained a list of the most important RegionServer metrics.
-      <section
+        However, the list was extremely out of date. In some cases, the name of a given metric has
-        xml:id="hbase.regionserver.blockCacheCount">
+        changed. In other cases, the metric seems to no longer be exposed. An effort is underway to
-        <title><varname>blockCacheCount</varname></title>
+        create automatic documentation for each metric based upon information pulled from its
-        <para>Point in time block cache item count in memory. This is the number of blocks of
+        implementation.</para>
          StoreFiles (HFiles) in the cache.</para>
      </section>
      <section
        xml:id="hbase.regionserver.blockCacheEvictedCount">
        <title><varname>blockCacheEvictedCount</varname></title>
        <para>Number of blocks that had to be evicted from the block cache due to heap size
          constraints by RegionServer since startup.</para>
      </section>
      <section
        xml:id="hbase.regionserver.blockCacheFree">
        <title><varname>blockCacheFreeMB</varname></title>
        <para>Point in time block cache memory available (MB).</para>
      </section>
      <section
        xml:id="hbase.regionserver.blockCacheHitCount">
        <title><varname>blockCacheHitCount</varname></title>
        <para>Number of blocks of StoreFiles (HFiles) read from the cache by RegionServer since
          startup.</para>
      </section>
      <section
        xml:id="hbase.regionserver.blockCacheHitRatio">
        <title><varname>blockCacheHitRatio</varname></title>
        <para>Block cache hit ratio (0 to 100) from RegionServer startup. Includes all read
          requests, although those with cacheBlocks=false will always read from disk and be counted
          as a "cache miss", which means that full-scan MapReduce jobs can affect this metric
          significantly.</para>
      </section>
      <section
        xml:id="hbase.regionserver.blockCacheMissCount">
        <title><varname>blockCacheMissCount</varname></title>
        <para>Number of blocks of StoreFiles (HFiles) requested but not read from the cache from
          RegionServer startup.</para>
      </section>
      <section
        xml:id="hbase.regionserver.blockCacheSize">
        <title><varname>blockCacheSizeMB</varname></title>
        <para>Point in time block cache size in memory (MB). i.e., memory in use by the
          BlockCache</para>
      </section>
      <section
        xml:id="hbase.regionserver.fsPreadLatency">
        <title><varname>fsPreadLatency*</varname></title>
        <para>There are several filesystem positional read latency (ms) metrics, all measured from
          RegionServer startup.</para>
      </section>
      <section
        xml:id="hbase.regionserver.fsReadLatency">
        <title><varname>fsReadLatency*</varname></title>
        <para>There are several filesystem read latency (ms) metrics, all measured from RegionServer
          startup. The issue with interpretation is that ALL reads go into this metric (e.g.,
          single-record Gets, full table Scans), including reads required for compactions. This
          metric is only interesting "over time" when comparing major releases of HBase or your own
          code.</para>
      </section>
      <section
        xml:id="hbase.regionserver.fsWriteLatency">
        <title><varname>fsWriteLatency*</varname></title>
        <para>There are several filesystem write latency (ms) metrics, all measured from
          RegionServer startup. The issue with interpretation is that ALL writes go into this metric
          (e.g., single-record Puts, full table re-writes due to compaction). This metric is only
          interesting "over time" when comparing major releases of HBase or your own code.</para>
      </section>
      <section
        xml:id="hbase.regionserver.stores">
        <title><varname>NumberOfStores</varname></title>
        <para>Point in time number of Stores open on the RegionServer. A Store corresponds to a
          ColumnFamily. For example, if a table (which contains the column family) has 3 regions on
          a RegionServer, there will be 3 stores open for that column family. </para>
      </section>
      <section
        xml:id="hbase.regionserver.storeFiles">
        <title><varname>NumberOfStorefiles</varname></title>
        <para>Point in time number of StoreFiles open on the RegionServer. A store may have more
          than one StoreFile (HFile).</para>
      </section>
      <section
        xml:id="hbase.regionserver.requests">
        <title><varname>requestsPerSecond</varname></title>
        <para>Point in time number of read and write requests. Requests correspond to RegionServer
          RPC calls, thus a single Get will result in 1 request, but a Scan with caching set to 1000
          will result in 1 request for each 'next' call (i.e., not each row). A bulk-load request
          will constitute 1 request per HFile. This metric is less interesting than
          readRequestsCount and writeRequestsCount in terms of measuring activity due to this metric
          being periodic. </para>
      </section>
      <section
        xml:id="hbase.regionserver.storeFileIndexSizeMB">
        <title><varname>storeFileIndexSizeMB</varname></title>
        <para>Point in time sum of all the StoreFile index sizes in this RegionServer (MB)</para>
      </section>
    </section>
-  </section>
+  </section>      
  <section
    xml:id="ops.monitoring">