diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml index 0fb3eba8fdc..ed26c1beb80 100644 --- a/src/main/docbkx/book.xml +++ b/src/main/docbkx/book.xml @@ -68,8 +68,8 @@ - - + + Data Model @@ -658,7 +658,7 @@ htable.put(put); - + HBase and MapReduce @@ -1319,11 +1319,16 @@ scan.setFilter(list);
Column Value
SingleColumnValueFilter - SingleColumnValueFilter - can be used to test column values for equivalence (CompareOp.EQUAL - ), inequality (CompareOp.NOT_EQUAL), or ranges - (e.g., CompareOp.GREATER). The folowing is example of testing equivalence a column to a String value "my value"... - + SingleColumnValueFilter can be used to test column values for equivalence + (CompareOp.EQUAL + ), inequality (CompareOp.NOT_EQUAL), or ranges (e.g., + CompareOp.GREATER). The following is example of testing equivalence a + column to a String value "my value"... + SingleColumnValueFilter filter = new SingleColumnValueFilter( cf, column, @@ -1604,9 +1609,9 @@ rs.close(); Single access priority: The first time a block is loaded from HDFS it normally has this priority and it will be part of the first group to be considered during evictions. The advantage is that scanned blocks are more likely to get evicted than blocks that are getting more usage. - Mutli access priority: If a block in the previous priority group is accessed again, it upgrades to this priority. It is thus part of the second group - considered during evictions. - + Multi access priority: If a block in the previous priority group is accessed + again, it upgrades to this priority. It is thus part of the second group considered + during evictions. In-memory access priority: If the block's family was configured to be "in-memory", it will be part of this priority disregarding the number of times it was accessed. Catalog tables are configured like this. This group is the last one considered during evictions. @@ -2418,13 +2423,13 @@ All the settings that apply to normal compactions (file size limits, etc.) apply - + - - + + - - + + @@ -2533,9 +2538,8 @@ All the settings that apply to normal compactions (file size limits, etc.) apply Can I change a table's rowkeys? - - This is a very common quesiton. You can't. See . - + This is a very common question. You can't. See . diff --git a/src/main/docbkx/configuration.xml b/src/main/docbkx/configuration.xml index 8811d60a63b..45bc157742b 100644 --- a/src/main/docbkx/configuration.xml +++ b/src/main/docbkx/configuration.xml @@ -29,7 +29,7 @@ Apache HBase Configuration This chapter is the Not-So-Quick start guide to Apache HBase configuration. It goes over system requirements, Hadoop setup, the different Apache HBase run modes, and the - various configurations in HBase. Please read this chapter carefully. At a mimimum + various configurations in HBase. Please read this chapter carefully. At a minimum ensure that all have been satisfied. Failure to do so will cause you (and us) grief debugging strange errors and/or data loss. @@ -778,7 +778,7 @@ stopping hbase............... Shutdown can take a moment to <filename>hbase-env.sh</filename> Set HBase environment variables in this file. Examples include options to pass the JVM on start of - an HBase daemon such as heap size and garbarge collector configs. + an HBase daemon such as heap size and garbage collector configs. You can also set configurations for HBase configuration, log directories, niceness, ssh options, where to locate process pid files, etc. Open the file at diff --git a/src/main/docbkx/developer.xml b/src/main/docbkx/developer.xml index 3475c8dd421..dd469b65d60 100644 --- a/src/main/docbkx/developer.xml +++ b/src/main/docbkx/developer.xml @@ -492,12 +492,13 @@ HBase have a character not usually seen in other projects.
Apache HBase Modules As of 0.96, Apache HBase is split into multiple modules which creates "interesting" rules for -how and where tests are written. If you are writting code for hbase-server, see - for how to write your tests; these tests can spin -up a minicluster and will need to be categorized. For any other module, for example -hbase-common, the tests must be strict unit tests and just test the class -under test - no use of the HBaseTestingUtility or minicluster is allowed (or even possible -given the dependency tree). + how and where tests are written. If you are writing code for + hbase-server, see for + how to write your tests; these tests can spin up a minicluster and will need to be + categorized. For any other module, for example hbase-common, + the tests must be strict unit tests and just test the class under test - no use of + the HBaseTestingUtility or minicluster is allowed (or even possible given the + dependency tree).
Running Tests in other Modules If the module you are developing in has no other dependencies on other HBase modules, then @@ -643,22 +644,22 @@ error will be reported when a non-existent test case is specified.
Running tests faster - -By default, $ mvn test -P runAllTests runs 5 tests in parallel. -It can be increased on a developer's machine. Allowing that you can have 2 -tests in parallel per core, and you need about 2Gb of memory per test (at the -extreme), if you have an 8 core, 24Gb box, you can have 16 tests in parallel. -but the memory available limits it to 12 (24/2), To run all tests with 12 tests -in parallell, do this: -mvn test -P runAllTests -Dsurefire.secondPartThreadCount=12. -To increase the speed, you can as well use a ramdisk. You will need 2Gb of memory -to run all tests. You will also need to delete the files between two test run. -The typical way to configure a ramdisk on Linux is: -$ sudo mkdir /ram2G + By default, $ mvn test -P runAllTests runs 5 tests in parallel. It can be + increased on a developer's machine. Allowing that you can have 2 tests in + parallel per core, and you need about 2Gb of memory per test (at the extreme), + if you have an 8 core, 24Gb box, you can have 16 tests in parallel. but the + memory available limits it to 12 (24/2), To run all tests with 12 tests in + parallel, do this: mvn test -P runAllTests + -Dsurefire.secondPartThreadCount=12. To increase the speed, you + can as well use a ramdisk. You will need 2Gb of memory to run all tests. You + will also need to delete the files between two test run. The typical way to + configure a ramdisk on Linux is: + $ sudo mkdir /ram2G sudo mount -t tmpfs -o size=2048M tmpfs /ram2G -You can then use it to run all HBase tests with the command: -mvn test -P runAllTests -Dsurefire.secondPartThreadCount=12 -Dtest.build.data.basedirectory=/ram2G - + You can then use it to run all HBase tests with the command: mvn test + -P runAllTests -Dsurefire.secondPartThreadCount=12 + -Dtest.build.data.basedirectory=/ram2G +
@@ -818,7 +819,7 @@ This actually runs ALL the integration tests. mvn failsafe:verify The above command basically looks at all the test results (so don't remove the 'target' directory) for test failures and reports the results. -
+
Running a subset of Integration tests This is very similar to how you specify running a subset of unit tests (see above), but use the property it.test instead of test. @@ -970,9 +971,9 @@ pecularity that is probably fixable but we've not spent the time trying to figur Similarly, for 3.0, you would just replace the profile value. Note that Hadoop-3.0.0-SNAPSHOT does not currently have a deployed maven artificat - you will need to build and install your own in your local maven repository if you want to run against this profile. - - In earilier verions of Apache HBase, you can build against older versions of Apache Hadoop, notably, Hadoop 0.22.x and 0.23.x. - If you are running, for example HBase-0.94 and wanted to build against Hadoop 0.23.x, you would run with: + In earilier versions of Apache HBase, you can build against older versions of Apache + Hadoop, notably, Hadoop 0.22.x and 0.23.x. If you are running, for example + HBase-0.94 and wanted to build against Hadoop 0.23.x, you would run with: mvn -Dhadoop.profile=22 ...
@@ -1420,8 +1421,7 @@ Bar bar = foo.getBar(); <--- imagine there's an extra space(s) after the Committers do this. See How To Commit in the Apache HBase wiki. - Commiters will also resolve the Jira, typically after the patch passes a build. - + Committers will also resolve the Jira, typically after the patch passes a build.
Committers are responsible for making sure commits do not break the build or tests diff --git a/src/main/docbkx/external_apis.xml b/src/main/docbkx/external_apis.xml index e4c9f23f59c..9e3ed372798 100644 --- a/src/main/docbkx/external_apis.xml +++ b/src/main/docbkx/external_apis.xml @@ -295,7 +295,8 @@ Description: This filter takes two arguments – a limit and offset. It returns limit number of columns after offset number of columns. It does this for all the rows - Syntax: ColumnPaginationFilter(‘<limit>’, ‘<offest>’) + Syntax: ColumnPaginationFilter(‘<limit>’, + ‘<offset>’) Example: "ColumnPaginationFilter (3, 5)" diff --git a/src/main/docbkx/ops_mgt.xml b/src/main/docbkx/ops_mgt.xml index aab8928e062..dbd6d17bf37 100644 --- a/src/main/docbkx/ops_mgt.xml +++ b/src/main/docbkx/ops_mgt.xml @@ -36,10 +36,11 @@ Here we list HBase tools for administration, analysis, fixup, and debugging.
Canary -There is a Canary class can help users to canary-test the HBase cluster status, with every column-family for every regions or regionservers granularity. To see the usage, -$ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.tool.Canary -help -Will output -Usage: bin/hbase org.apache.hadoop.hbase.tool.Canary [opts] [table1 [table2]...] | [regionserver1 [regionserver2]..] +There is a Canary class can help users to canary-test the HBase cluster status, with every + column-family for every regions or regionservers granularity. To see the usage, + $ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.tool.Canary -help + Will output + Usage: bin/hbase org.apache.hadoop.hbase.tool.Canary [opts] [table1 [table2]...] | [regionserver1 [regionserver2]..] where [opts] are: -help Show this help and exit. -regionserver replace the table argument to regionserver, @@ -49,25 +50,46 @@ Will output -e Use region/regionserver as regular expression which means the region/regionserver is regular expression pattern -f <B> stop whole program if first error occurs, default is true - -t <N> timeout for a check, default is 600000 (milisecs) -This tool will return non zero error codes to user for collaborating with other monitoring tools, such as Nagios. -The error code definitions are... -private static final int USAGE_EXIT_CODE = 1; + -t <N> timeout for a check, default is 600000 (milliseconds) + This tool will return non zero error codes to user for collaborating with other monitoring + tools, such as Nagios. The error code definitions are... + private static final int USAGE_EXIT_CODE = 1; private static final int INIT_ERROR_EXIT_CODE = 2; private static final int TIMEOUT_ERROR_EXIT_CODE = 3; private static final int ERROR_EXIT_CODE = 4; -Here are some examples based on the following given case. There are two HTable called test-01 and test-02, they have two column family cf1 and cf2 respectively, and deployed on the 3 regionservers. see following table. - - - - RegionServertest-01test-02 - - rs1r1 r2 - rs2r2 - rs3r2 r1 -
-Following are some examples based on the previous given case. -
+ Here are some examples based on the following given case. There are two HTable called + test-01 and test-02, they have two column family cf1 and cf2 respectively, and deployed on + the 3 regionservers. see following table. + + + + + + + RegionServer + test-01 + test-02 + + + + + rs1 + r1 + r2 + + + rs2 + r2 + + + + rs3 + r2 + r1 + + + +
Following are some examples based on the previous given case.
Canary test for every column family (store) of every region of every table $ ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.tool.Canary diff --git a/src/main/docbkx/performance.xml b/src/main/docbkx/performance.xml index 174baf2b2fb..854e7c9fa1d 100644 --- a/src/main/docbkx/performance.xml +++ b/src/main/docbkx/performance.xml @@ -45,10 +45,9 @@
Network - - Perhaps the most important factor in avoiding network issues degrading Hadoop and HBbase performance is the switching hardware - that is used, decisions made early in the scope of the project can cause major problems when you double or triple the size of your cluster (or more). - + Perhaps the most important factor in avoiding network issues degrading Hadoop and HBase + performance is the switching hardware that is used, decisions made early in the scope of the + project can cause major problems when you double or triple the size of your cluster (or more). Important items to consider: @@ -400,7 +399,7 @@ Deferred log flush can be configured on tables via HBase Client: Group Puts by RegionServer In addition to using the writeBuffer, grouping Puts by RegionServer can reduce the number of client RPC calls per writeBuffer flush. - There is a utility HTableUtil currently on TRUNK that does this, but you can either copy that or implement your own verison for + There is a utility HTableUtil currently on TRUNK that does this, but you can either copy that or implement your own version for those still on 0.90.x or earlier.
diff --git a/src/main/docbkx/schema_design.xml b/src/main/docbkx/schema_design.xml index 13ebc2fb40e..482d3d12a4c 100644 --- a/src/main/docbkx/schema_design.xml +++ b/src/main/docbkx/schema_design.xml @@ -141,7 +141,7 @@ admin.enableTable(table); See for more information on HBase stores data internally to see why this is important.
-
Attributes +
Attributes Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via") to store in HBase. @@ -335,10 +335,11 @@ public static byte[][] getHexSplits(String startKey, String endKey, int numRegio Result, so anything that can be converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes. - There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); - search the mailling list for conversations on this topic. All rows in HBase conform to the datamodel, and - that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily. - + There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase + would probably be too much to ask); search the mailing list for conversations on this topic. + All rows in HBase conform to the datamodel, and that includes + versioning. Take that into consideration when making your design, as well as block size for + the ColumnFamily.
Counters @@ -396,10 +397,11 @@ public static byte[][] getHexSplits(String startKey, String endKey, int numRegio ... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches. - It should not be a surprise that secondary indexes require additional cluster space and processing. - This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RBDMS products - are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off. - + It should not be a surprise that secondary indexes require additional cluster space and + processing. This is precisely what happens in an RDBMS because the act of creating an + alternate index requires both space and processing cycles to update. RDBMS products are more + advanced in this regard to handle alternative index management out of the box. However, HBase + scales better at larger data volumes, so this is a feature trade-off. Pay attention to when implementing any of these approaches. Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase @@ -765,10 +767,11 @@ reasonable spread in the keyspace, similar options appear: tall and wide tables. These are general guidelines and not laws - each application must consider its own needs.
Rows vs. Versions - A common question is whether one should prefer rows or HBase's built-in-versioning. The context is typically where there are - "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 1 max versions). The - rows-approach would require storing a timstamp in some portion of the rowkey so that they would not overwite with each successive update. - + A common question is whether one should prefer rows or HBase's built-in-versioning. The + context is typically where there are "a lot" of versions of a row to be retained (e.g., + where it is significantly above the HBase default of 1 max versions). The rows-approach + would require storing a timestamp in some portion of the rowkey so that they would not + overwite with each successive update. Preference: Rows (generally speaking).
diff --git a/src/main/docbkx/shell.xml b/src/main/docbkx/shell.xml index 778545773ed..aaddc8d636e 100644 --- a/src/main/docbkx/shell.xml +++ b/src/main/docbkx/shell.xml @@ -175,18 +175,16 @@ hbase(main):018:0>
<filename>irbrc</filename> - Create an .irbrc file for yourself in your - home directory. Add customizations. A useful one is - command history so commands are save across Shell invocations: - + Create an .irbrc file for yourself in your home + directory. Add customizations. A useful one is command history so commands are save + across Shell invocations: + $ more .irbrc require 'irb/ext/save-history' IRB.conf[:SAVE_HISTORY] = 100 IRB.conf[:HISTORY_FILE] = "#{ENV['HOME']}/.irb-save-history" - See the ruby documentation of - .irbrc to learn about other possible - confiurations. - + See the ruby documentation of .irbrc + to learn about other possible configurations.
LOG data to timestamp diff --git a/src/main/docbkx/troubleshooting.xml b/src/main/docbkx/troubleshooting.xml index d9b7009a402..40f23d124f5 100644 --- a/src/main/docbkx/troubleshooting.xml +++ b/src/main/docbkx/troubleshooting.xml @@ -620,18 +620,21 @@ Harsh J investigated the issue as part of the mailing list thread
Client running out of memory though heap size seems to be stable (but the off-heap/direct heap keeps growing) - -You are likely running into the issue that is described and worked through in -the mail thread HBase, mail # user - Suspected memory leak -and continued over in HBase, mail # dev - FeedbackRe: Suspected memory leak. -A workaround is passing your client-side JVM a reasonable value for -XX:MaxDirectMemorySize. By default, -the MaxDirectMemorySize is equal to your -Xmx max heapsize setting (if -Xmx is set). -Try seting it to something smaller (for example, one user had success setting it to 1g when -they had a client-side heap of 12g). If you set it too small, it will bring on FullGCs so keep -it a bit hefty. You want to make this setting client-side only especially if you are running the new experiemental -server-side off-heap cache since this feature depends on being able to use big direct buffers (You may have to keep -separate client-side and server-side config dirs). - + You are likely running into the issue that is described and worked through in the + mail thread HBase, mail # user - Suspected memory leak and continued over in HBase, mail # dev - FeedbackRe: Suspected memory leak. A workaround is passing + your client-side JVM a reasonable value for -XX:MaxDirectMemorySize. By + default, the MaxDirectMemorySize is equal to your -Xmx max + heapsize setting (if -Xmx is set). Try seting it to something smaller (for + example, one user had success setting it to 1g when they had a client-side heap + of 12g). If you set it too small, it will bring on FullGCs so keep + it a bit hefty. You want to make this setting client-side only especially if you are running + the new experimental server-side off-heap cache since this feature depends on being able to + use big direct buffers (You may have to keep separate client-side and server-side config + dirs).
Client Slowdown When Calling Admin Methods (flush, compact, etc.) @@ -728,15 +731,17 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
Browsing HDFS for HBase Objects - Somtimes it will be necessary to explore the HBase objects that exist on HDFS. These objects could include the WALs (Write Ahead Logs), tables, regions, StoreFiles, etc. - The easiest way to do this is with the NameNode web application that runs on port 50070. The NameNode web application will provide links to the all the DataNodes in the cluster so that - they can be browsed seamlessly. + Sometimes it will be necessary to explore the HBase objects that exist on HDFS. + These objects could include the WALs (Write Ahead Logs), tables, regions, StoreFiles, etc. + The easiest way to do this is with the NameNode web application that runs on port 50070. The + NameNode web application will provide links to the all the DataNodes in the cluster so that + they can be browsed seamlessly. The HDFS directory structure of HBase tables in the cluster is... /hbase /<Table> (Tables in the cluster) /<Region> (Regions for the table) - /<ColumnFamiy> (ColumnFamilies for the Region for the table) + /<ColumnFamily> (ColumnFamilies for the Region for the table) /<StoreFile> (StoreFiles for the ColumnFamily for the Regions for the table) diff --git a/src/main/docbkx/upgrading.xml b/src/main/docbkx/upgrading.xml index 22324a3491a..7711614d21b 100644 --- a/src/main/docbkx/upgrading.xml +++ b/src/main/docbkx/upgrading.xml @@ -27,9 +27,8 @@ */ --> Upgrading - You cannot skip major verisons upgrading. If you are upgrading from - version 0.90.x to 0.94.x, you must first go from 0.90.x to 0.92.x and then go - from 0.92.x to 0.94.x. + You cannot skip major versions upgrading. If you are upgrading from version 0.90.x to + 0.94.x, you must first go from 0.90.x to 0.92.x and then go from 0.92.x to 0.94.x. It may be possible to skip across versions -- for example go from 0.92.2 straight to 0.98.0 just following the 0.96.x upgrade instructions -- but we have not tried it so cannot say whether it works or not.