mirror of https://github.com/apache/druid.git
261 lines
9.7 KiB
Markdown
261 lines
9.7 KiB
Markdown
---
|
|
id: tutorial-batch-hadoop
|
|
title: "Tutorial: Load batch data using Apache Hadoop"
|
|
sidebar_label: "Load from Apache Hadoop"
|
|
---
|
|
|
|
<!--
|
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
~ or more contributor license agreements. See the NOTICE file
|
|
~ distributed with this work for additional information
|
|
~ regarding copyright ownership. The ASF licenses this file
|
|
~ to you under the Apache License, Version 2.0 (the
|
|
~ "License"); you may not use this file except in compliance
|
|
~ with the License. You may obtain a copy of the License at
|
|
~
|
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
~
|
|
~ Unless required by applicable law or agreed to in writing,
|
|
~ software distributed under the License is distributed on an
|
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
~ KIND, either express or implied. See the License for the
|
|
~ specific language governing permissions and limitations
|
|
~ under the License.
|
|
-->
|
|
|
|
|
|
This tutorial shows you how to load data files into Apache Druid using a remote Hadoop cluster.
|
|
|
|
For this tutorial, we'll assume that you've already completed the previous
|
|
[batch ingestion tutorial](tutorial-batch.md) using Druid's native batch ingestion system and are using the
|
|
`auto` single-machine configuration as described in the [quickstart](../operations/single-server.md#druid-auto-start).
|
|
|
|
## Install Docker
|
|
|
|
This tutorial requires [Docker](https://docs.docker.com/install/) to be installed on the tutorial machine.
|
|
|
|
Once the Docker install is complete, please proceed to the next steps in the tutorial.
|
|
|
|
## Build the Hadoop docker image
|
|
|
|
For this tutorial, we've provided a Dockerfile for a Hadoop 2.8.5 cluster, which we'll use to run the batch indexing task.
|
|
|
|
This Dockerfile and related files are located at `quickstart/tutorial/hadoop/docker`.
|
|
|
|
From the apache-druid-{{DRUIDVERSION}} package root, run the following commands to build a Docker image named "druid-hadoop-demo" with version tag "2.8.5":
|
|
|
|
```bash
|
|
cd quickstart/tutorial/hadoop/docker
|
|
docker build -t druid-hadoop-demo:2.8.5 .
|
|
```
|
|
|
|
This will start building the Hadoop image. Once the image build is done, you should see the message `Successfully tagged druid-hadoop-demo:2.8.5` printed to the console.
|
|
|
|
## Setup the Hadoop docker cluster
|
|
|
|
### Create temporary shared directory
|
|
|
|
We'll need a shared folder between the host and the Hadoop container for transferring some files.
|
|
|
|
Let's create some folders under `/tmp`, we will use these later when starting the Hadoop container:
|
|
|
|
```bash
|
|
mkdir -p /tmp/shared
|
|
mkdir -p /tmp/shared/hadoop_xml
|
|
```
|
|
|
|
### Configure /etc/hosts
|
|
|
|
On the host machine, add the following entry to `/etc/hosts`:
|
|
|
|
```
|
|
127.0.0.1 druid-hadoop-demo
|
|
```
|
|
|
|
### Start the Hadoop container
|
|
|
|
Once the `/tmp/shared` folder has been created and the `etc/hosts` entry has been added, run the following command to start the Hadoop container.
|
|
|
|
```bash
|
|
docker run -it -h druid-hadoop-demo --name druid-hadoop-demo -p 2049:2049 -p 2122:2122 -p 8020:8020 -p 8021:8021 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8033:8033 -p 8040:8040 -p 8042:8042 -p 8088:8088 -p 8443:8443 -p 9000:9000 -p 10020:10020 -p 19888:19888 -p 34455:34455 -p 49707:49707 -p 50010:50010 -p 50020:50020 -p 50030:50030 -p 50060:50060 -p 50070:50070 -p 50075:50075 -p 50090:50090 -p 51111:51111 -v /tmp/shared:/shared druid-hadoop-demo:2.8.5 /etc/bootstrap.sh -bash
|
|
```
|
|
|
|
Once the container is started, your terminal will attach to a bash shell running inside the container:
|
|
|
|
```bash
|
|
Starting sshd: [ OK ]
|
|
18/07/26 17:27:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
|
|
Starting namenodes on [druid-hadoop-demo]
|
|
druid-hadoop-demo: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-druid-hadoop-demo.out
|
|
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-druid-hadoop-demo.out
|
|
Starting secondary namenodes [0.0.0.0]
|
|
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-druid-hadoop-demo.out
|
|
18/07/26 17:27:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
|
|
starting yarn daemons
|
|
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-druid-hadoop-demo.out
|
|
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-druid-hadoop-demo.out
|
|
starting historyserver, logging to /usr/local/hadoop/logs/mapred--historyserver-druid-hadoop-demo.out
|
|
bash-4.1#
|
|
```
|
|
|
|
The `Unable to load native-hadoop library for your platform... using builtin-java classes where applicable` warning messages can be safely ignored.
|
|
|
|
#### Accessing the Hadoop container shell
|
|
|
|
To open another shell to the Hadoop container, run the following command:
|
|
|
|
```
|
|
docker exec -it druid-hadoop-demo bash
|
|
```
|
|
|
|
### Copy input data to the Hadoop container
|
|
|
|
From the apache-druid-{{DRUIDVERSION}} package root on the host, copy the `quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` sample data to the shared folder:
|
|
|
|
```bash
|
|
cp quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz /tmp/shared/wikiticker-2015-09-12-sampled.json.gz
|
|
```
|
|
|
|
### Setup HDFS directories
|
|
|
|
In the Hadoop container's shell, run the following commands to setup the HDFS directories needed by this tutorial and copy the input data to HDFS.
|
|
|
|
```bash
|
|
cd /usr/local/hadoop/bin
|
|
./hdfs dfs -mkdir /druid
|
|
./hdfs dfs -mkdir /druid/segments
|
|
./hdfs dfs -mkdir /quickstart
|
|
./hdfs dfs -chmod 777 /druid
|
|
./hdfs dfs -chmod 777 /druid/segments
|
|
./hdfs dfs -chmod 777 /quickstart
|
|
./hdfs dfs -chmod -R 777 /tmp
|
|
./hdfs dfs -chmod -R 777 /user
|
|
./hdfs dfs -put /shared/wikiticker-2015-09-12-sampled.json.gz /quickstart/wikiticker-2015-09-12-sampled.json.gz
|
|
```
|
|
|
|
If you encounter namenode errors when running this command, the Hadoop container may not be finished initializing. When this occurs, wait a couple of minutes and retry the commands.
|
|
|
|
## Configure Druid to use Hadoop
|
|
|
|
Some additional steps are needed to configure the Druid cluster for Hadoop batch indexing.
|
|
|
|
### Copy Hadoop configuration to Druid classpath
|
|
|
|
From the Hadoop container's shell, run the following command to copy the Hadoop .xml configuration files to the shared folder:
|
|
|
|
```bash
|
|
cp /usr/local/hadoop/etc/hadoop/*.xml /shared/hadoop_xml
|
|
```
|
|
|
|
From the host machine, run the following, where {PATH_TO_DRUID} is replaced by the path to the Druid package.
|
|
|
|
```bash
|
|
mkdir -p {PATH_TO_DRUID}/conf/druid/single-server/micro-quickstart/_common/hadoop-xml
|
|
cp /tmp/shared/hadoop_xml/*.xml {PATH_TO_DRUID}/conf/druid/single-server/micro-quickstart/_common/hadoop-xml/
|
|
```
|
|
|
|
### Update Druid segment and log storage
|
|
|
|
In your favorite text editor, open `conf/druid/auto/_common/common.runtime.properties`, and make the following edits:
|
|
|
|
#### Disable local deep storage and enable HDFS deep storage
|
|
|
|
```
|
|
#
|
|
# Deep storage
|
|
#
|
|
|
|
# For local disk (only viable in a cluster if this is a network mount):
|
|
#druid.storage.type=local
|
|
#druid.storage.storageDirectory=var/druid/segments
|
|
|
|
# For HDFS:
|
|
druid.storage.type=hdfs
|
|
druid.storage.storageDirectory=/druid/segments
|
|
```
|
|
|
|
|
|
#### Disable local log storage and enable HDFS log storage
|
|
|
|
```
|
|
#
|
|
# Indexing service logs
|
|
#
|
|
|
|
# For local disk (only viable in a cluster if this is a network mount):
|
|
#druid.indexer.logs.type=file
|
|
#druid.indexer.logs.directory=var/druid/indexing-logs
|
|
|
|
# For HDFS:
|
|
druid.indexer.logs.type=hdfs
|
|
druid.indexer.logs.directory=/druid/indexing-logs
|
|
|
|
```
|
|
|
|
### Restart Druid cluster
|
|
|
|
Once the Hadoop .xml files have been copied to the Druid cluster and the segment/log storage configuration has been updated to use HDFS, the Druid cluster needs to be restarted for the new configurations to take effect.
|
|
|
|
If the cluster is still running, CTRL-C to terminate the `bin/start-druid` script, and re-run it to bring the Druid services back up.
|
|
|
|
## Load batch data
|
|
|
|
We've included a sample of Wikipedia edits from September 12, 2015 to get you started.
|
|
|
|
To load this data into Druid, you can submit an *ingestion task* pointing to the file. We've included
|
|
a task that loads the `wikiticker-2015-09-12-sampled.json.gz` file included in the archive.
|
|
|
|
Let's submit the `wikipedia-index-hadoop.json` task:
|
|
|
|
```bash
|
|
bin/post-index-task --file quickstart/tutorial/wikipedia-index-hadoop.json --url http://localhost:8081
|
|
```
|
|
|
|
## Querying your data
|
|
|
|
After the data load is complete, please follow the [query tutorial](../tutorials/tutorial-query.md) to run some example queries on the newly loaded data.
|
|
|
|
## Cleanup
|
|
|
|
This tutorial is only meant to be used together with the [query tutorial](../tutorials/tutorial-query.md).
|
|
|
|
If you wish to go through any of the other tutorials, you will need to:
|
|
* Shut down the cluster and reset the cluster state by removing the contents of the `var` directory under the druid package.
|
|
* Revert the deep storage and task storage config back to local types in `conf/druid/auto/_common/common.runtime.properties`
|
|
* Restart the cluster
|
|
|
|
This is necessary because the other ingestion tutorials will write to the same "wikipedia" datasource, and later tutorials expect the cluster to use local deep storage.
|
|
|
|
Example reverted config:
|
|
|
|
```
|
|
#
|
|
# Deep storage
|
|
#
|
|
|
|
# For local disk (only viable in a cluster if this is a network mount):
|
|
druid.storage.type=local
|
|
druid.storage.storageDirectory=var/druid/segments
|
|
|
|
# For HDFS:
|
|
#druid.storage.type=hdfs
|
|
#druid.storage.storageDirectory=/druid/segments
|
|
|
|
#
|
|
# Indexing service logs
|
|
#
|
|
|
|
# For local disk (only viable in a cluster if this is a network mount):
|
|
druid.indexer.logs.type=file
|
|
druid.indexer.logs.directory=var/druid/indexing-logs
|
|
|
|
# For HDFS:
|
|
#druid.indexer.logs.type=hdfs
|
|
#druid.indexer.logs.directory=/druid/indexing-logs
|
|
|
|
```
|
|
|
|
## Further reading
|
|
|
|
For more information on loading batch data with Hadoop, please see [the Hadoop batch ingestion documentation](../ingestion/hadoop.md).
|