jackson conflict workaround in hadooop ingestio & parquet extension coordinate update (#2817)

This commit is contained in:
du00cs 2016-04-14 05:20:33 +08:00 committed by Fangjin Yang
parent 0c4a42bb6f
commit 639d0630b8
2 changed files with 105 additions and 2 deletions

View File

@ -11,6 +11,7 @@ that contains Hadoop jars.
You can also use the `pull-deps` tool to download other Hadoop dependencies you want.
See [pull-deps](../operations/pull-deps.html) for a complete example.
Another way is using a minimal `pom.xml` only contains your version of `hadoop-client` and then run `mvn dependency:copy-dependency` to get required libraries.
## Load Hadoop dependencies
@ -105,8 +106,16 @@ This can be achieved by either:
- configuring the Druid Middle Manager to add the following property when creating a new Peon: `druid.indexer.runner.javaOpts=... -Dhadoop.mapreduce.job.user.classpath.first=true`
- edit Druid's pom.xml dependencies to match the version of Jackson in your Hadoop version and recompile Druid
Members of the community have reported dependency conflicts between the version of Jackson used in CDH and Druid.
**Workaround - 1**
Currently, our best workaround is to edit Druid's pom.xml dependencies to match the version of Jackson in your Hadoop version and recompile Druid.
For more about building Druid, please see [Building Druid](../development/build.html).
**Workaround - 2**
Another workaround solution is to build a custom fat jar of Druid using [sbt](http://www.scala-sbt.org/), which manually excludes all the conflicting Jackson dependencies, and then put this fat jar in the classpath of the command that starts overlord indexing service. To do this, please follow the following steps.
(1) Download and install sbt.
@ -134,6 +143,97 @@ addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
(10) Include the fat jar in the classpath when you start the indexing service. Make sure you've removed 'lib/*' from your classpath because now the fat jar includes all you need.
**Workaround - 3**
If sbt is not your choice, you can also use `maven-shade-plugin` to make a fat jar: relocation all jackson packages will resolve it too. In this way, druid will not be affected by jackson library embedded in hadoop. Please follow the steps below:
(1) Add all extensions you needed to `services/pom.xml` like
```xml
<dependency>
<groupId>io.druid.extensions</groupId>
<artifactId>druid-avro-extensions</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>io.druid.extensions.contrib</groupId>
<artifactId>druid-parquet-extensions</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>io.druid.extensions</groupId>
<artifactId>druid-hdfs-storage</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>io.druid.extensions</groupId>
<artifactId>mysql-metadata-storage</artifactId>
<version>${project.parent.version}</version>
</dependency>
```
(2) Shade jackson packages and assemble a fat jar.
```xml
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<outputFile>
${project.build.directory}/${project.artifactId}-${project.version}-selfcontained.jar
</outputFile>
<relocations>
<relocation>
<pattern>com.fasterxml.jackson</pattern>
<shadedPattern>shade.com.fasterxml.jackson</shadedPattern>
</relocation>
</relocations>
<artifactSet>
<includes>
<include>*:*</include>
</includes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
```
Copy out `services/target/xxxxx-selfcontained.jar` after `mvn install` in project root for further usage.
(3) run hadoop indexer (post an indexing task is not possible now) as below. `lib` is not needed anymore. As hadoop indexer is a standalone tool, you don't have to replace the jars of your running services:
```bash
java -Xmx32m \
-Dfile.encoding=UTF-8 -Duser.timezone=UTC \
-classpath config/hadoop:config/overlord:config/_common:$SELF_CONTAINED_JAR:$HADOOP_DISTRIBUTION/etc/hadoop \
-Djava.security.krb5.conf=$KRB5 \
io.druid.cli.Main index hadoop \
$config_path
```
## Working with Hadoop 1.x and older
We recommend recompiling Druid with your particular version of Hadoop by changing the dependencies in Druid's pom.xml files. Make sure to also either override the default `hadoopDependencyCoordinates` in the code or pass your Hadoop version in as part of indexing.

View File

@ -2,6 +2,11 @@
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<groupId>io.druid.extensions.contrib</groupId>
<artifactId>druid-parquet-extensions</artifactId>
<name>druid-parquet-extensions</name>
<description>druid-parquet-extensions</description>
<parent>
<artifactId>druid</artifactId>
<groupId>io.druid</groupId>
@ -10,8 +15,6 @@
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>druid-parquet-extensions</artifactId>
<dependencies>
<dependency>
<groupId>io.druid.extensions</groupId>