Merge pull request #2815 from slaunay/documentation/hadoop-classpath-issue-fix-with-configuration

Doc for mapreduce.job.user.classpath.first=true
2016-04-12 10:51:51 -07:00 · 2016-04-12 10:51:51 -07:00 · 37d2ab623e
parent bf4aa965fb
commit 37d2ab623e
2 changed files with 35 additions and 2 deletions
--- a/docs/content/ingestion/batch-ingestion.md
+++ b/docs/content/ingestion/batch-ingestion.md
@ -161,10 +161,31 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
 |ignoreInvalidRows|Boolean|Ignore rows found to have problems.|no (default == false)|
 |combineText|Boolean|Use CombineTextInputFormat to combine multiple files into a file split. This can speed up Hadoop jobs when processing a large number of small files.|no (default == false)|
 |useCombiner|Boolean|Use Hadoop combiner to merge rows at mapper if possible.|no (default == false)|
-|jobProperties|Object|A map of properties to add to the Hadoop job configuration.|no (default == null)|
+|jobProperties|Object|A map of properties to add to the Hadoop job configuration, see below for details.|no (default == null)|
 |buildV9Directly|Boolean|Build v9 index directly instead of building v8 index and converting it to v9 format.|no (default = false)|
 |numBackgroundPersistThreads|Integer|The number of new background threads to use for incremental persists. Using this feature causes a notable increase in memory pressure and cpu usage but will make the job finish more quickly. If changing from the default of 0 (use current thread for persists), we recommend setting it to 1.|no (default == 0)|

+#### jobProperties field of TuningConfig
+
+```json
+   "tuningConfig" : {
+     "type": "hadoop",
+     "jobProperties": {
+       "<hadoop-property-a>": "<value-a>",
+       "<hadoop-property-b>": "<value-b>"
+     }
+   }
+```
+
+The following properties can be used to tune how the MapReduce job is configured by overriding default Hadoop/YARN/Mapreduce configurations:
+
+|Property name|Type|Description|
+|-------------|----|-----------|
+|mapreduce.job.user.classpath.first|String|Use Druid classpath instead of Hadoop classpath for common libraries like [Jackson](https://github.com/FasterXML/jackson) (required with [Cloudera Hadoop distribution](../operations/other-hadoop.html)) when set to `"true"`.|
+|...|String|See [Mapred configuration](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) for more configuration parameters.|
+
+**Please note that using `mapreduce.job.user.classpath.first` is an expert feature and should not be used without a deep understanding of Hadoop and Java class loading mechanism.**
+
 ### Partitioning specification

 Segments are always partitioned based on timestamp (according to the granularitySpec) and may be further partitioned in
--- a/docs/content/operations/other-hadoop.md
+++ b/docs/content/operations/other-hadoop.md
@ -91,7 +91,19 @@ If you are still having problems, include all relevant hadoop jars at the beginn

 #### CDH

-Members of the community have reported dependency conflicts between the version of Jackson used in CDH and Druid. Currently, our best workaround is to edit Druid's pom.xml dependencies to match the version of Jackson in your Hadoop version and recompile Druid.
+Members of the community have reported dependency conflicts between the version of Jackson used in CDH and Druid when running a Mapreduce job like:
+```
+java.lang.VerifyError: class com.fasterxml.jackson.datatype.guava.deser.HostAndPortDeserializer overrides final method deserialize.(Lcom/fasterxml/jackson/core/JsonParser;Lcom/fasterxml/jackson/databind/DeserializationContext;)Ljava/lang/Object;
+```
+
+In order to use the Cloudera distribution of Hadoop, you must configure Mapreduce to
+[favor Druid classpath over Hadoop]((https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html))
+(i.e. use Jackson version provided with Druid).
+
+This can be achieved by either:
+- adding `"mapreduce.job.user.classpath.first": "true"` to the `jobProperties` property of the `tuningConfig` of your indexing task (see the [Job properties section for the Batch Ingestion using the HadoopDruidIndexer page](../ingestion/batch-ingestion.html)).
+- configuring the Druid Middle Manager to add the following property when creating a new Peon: `druid.indexer.runner.javaOpts=... -Dhadoop.mapreduce.job.user.classpath.first=true`
+- edit Druid's pom.xml dependencies to match the version of Jackson in your Hadoop version and recompile Druid

 For more about building Druid, please see [Building Druid](../development/build.html).