HBASE-11781 Document new TableMapReduceUtil scanning options (Misty Stanley-Jones)

2014-09-03 16:24:59 -07:00 · 2014-09-03 16:24:59 -07:00 · bcfc6d65af
commit bcfc6d65af
parent 1a6eea335f
1 changed files with 36 additions and 0 deletions
--- a/src/main/docbkx/book.xml
+++ b/src/main/docbkx/book.xml
@ -1048,6 +1048,42 @@ $ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp
      </caution>
    </section>

+    <section>
+      <title>MapReduce Scan Caching</title>
+      <para>TableMapReduceUtil now restores the option to set scanner caching (the number of rows
+        which are cached before returning the result to the client) on the Scan object that is
+        passed in. This functionality was lost due to a bug in HBase 0.95 (<link
+          xlink:href="https://issues.apache.org/jira/browse/HBASE-11558">HBASE-11558</link>), which
+        is fixed for HBase 0.98.5 and 0.96.3. The priority order for choosing the scanner caching is
+        as follows:</para>
+      <orderedlist>
+        <listitem>
+          <para>Caching settings which are set on the scan object.</para>
+        </listitem>
+        <listitem>
+          <para>Caching settings which are specified via the configuration option
+              <option>hbase.client.scanner.caching</option>, which can either be set manually in
+              <filename>hbase-site.xml</filename> or via the helper method
+              <code>TableMapReduceUtil.setScannerCaching()</code>.</para>
+        </listitem>
+        <listitem>
+          <para>The default value <code>HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING</code>, which is set to
+            <literal>100</literal>.</para>
+        </listitem>
+      </orderedlist>
+      <para>Optimizing the caching settings is a balance between the time the client waits for a
+        result and the number of sets of results the client needs to receive. If the caching setting
+        is too large, the client could end up waiting for a long time or the request could even time
+        out. If the setting is too small, the scan needs to return results in several pieces.
+        If you think of the scan as a shovel, a bigger cache setting is analogous to a bigger
+        shovel, and a smaller cache setting is equivalent to more shoveling in order to fill the
+        bucket.</para>
+      <para>The list of priorities mentioned above allows you to set a reasonable default, and
+        override it for specific operations.</para>
+      <para>See the API documentation for <link
+          xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html"
+          >Scan</link> for more details.</para>
+    </section>

    <section>
      <title>Bundled HBase MapReduce Jobs</title>