From ae6af9e6300665df3b09b882d534390902adc2a8 Mon Sep 17 00:00:00 2001
From: Doug Meil <dmeil@apache.org>
Date: Thu, 3 Nov 2011 21:22:57 +0000
Subject: [PATCH] HBASE-4743 book.xml, performance.xml, troubleshooting.xml
 scan info

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1197315 13f79535-47bb-0310-9956-ffa450edef68
---
 src/docbkx/book.xml            |  2 +-
 src/docbkx/performance.xml     | 12 ++++++++++++
 src/docbkx/troubleshooting.xml |  4 +++-
 3 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/src/docbkx/book.xml b/src/docbkx/book.xml
index d7697765fb3..208fbf5e5c2 100644
--- a/src/docbkx/book.xml
+++ b/src/docbkx/book.xml
@@ -567,7 +567,7 @@ admin.enableTable(table);
     </para>
     <section xml:id="number.of.cfs.card"><title>Cardinality of ColumnFamilies</title>
       <para>Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows).  
-      If ColumnFamilyA has 1000,000 rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA's data will likely be spread 
+      If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA's data will likely be spread 
       across many, many regions (and RegionServers).  This makes mass scans for ColumnFamilyA less efficient.  
       </para>
     </section>
diff --git a/src/docbkx/performance.xml b/src/docbkx/performance.xml
index 12b1d19d006..a8955ff21cc 100644
--- a/src/docbkx/performance.xml
+++ b/src/docbkx/performance.xml
@@ -353,6 +353,18 @@ Deferred log flush can be configured on tables via <link
       rows at a time to the client to be processed. There is a cost/benefit to
       have the cache value be large because it costs more in memory for both
       client and RegionServer, so bigger isn't always better.</para>
+      <section xml:id="perf.hbase.client.caching.mr">
+        <title>Scan Caching in MapReduce Jobs</title>
+        <para>Scan settings in MapReduce jobs deserve special attention.  Timeouts can result (e.g., UnknownScannerException)
+        in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the
+        next set of data.  This problem can occur because there is non-trivial processing occuring per row.  If you process
+        rows quickly, set caching higher.  If you process rows more slowly (e.g., lots of transformations per row, writes), 
+        then set caching lower.
+        </para>
+        <para>Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the
+        processing that is often performed in MapReduce jobs tends to exacerbate this issue.
+        </para>
+      </section>
     </section>
     <section xml:id="perf.hbase.client.selection">
       <title>Scan Attribute Selection</title>
diff --git a/src/docbkx/troubleshooting.xml b/src/docbkx/troubleshooting.xml
index 379728bbd7b..6a568233e0e 100644
--- a/src/docbkx/troubleshooting.xml
+++ b/src/docbkx/troubleshooting.xml
@@ -464,12 +464,14 @@ hadoop   17789  155 35.2 9067824 8604364 ?     S&lt;l  Mar04 9855:48 /usr/java/j
        <para>For more information on the HBase client, see <xref linkend="client"/>. 
        </para>
        <section xml:id="trouble.client.scantimeout">
-            <title>ScannerTimeoutException</title>
+            <title>ScannerTimeoutException or UnknownScannerException</title>
             <para>This is thrown if the time between RPC calls from the client to RegionServer exceeds the scan timeout.  
             For example, if <code>Scan.setCaching</code> is set to 500, then there will be an RPC call to fetch the next batch of rows every 500 <code>.next()</code> calls on the ResultScanner
             because data is being transferred in blocks of 500 rows to the client.  Reducing the setCaching value may be an option, but setting this value too low makes for inefficient
             processing on numbers of rows.
             </para>
+            <para>See <xref linkend="perf.hbase.client.caching"/>.
+            </para>
        </section>    
        <section xml:id="trouble.client.scarylogs">
             <title>Shell or client application throws lots of scary exceptions during normal operation</title>