9.0 KiB
layout |
---|
doc_page |
What to Do When You Have a Firewall
When you are behind a firewall, the Maven Druid dependencies will not be accessible, as well as the IRC wikipedia channels that feed realtime data into Druid. To workaround those two challenges, you will need to:
- Make the Maven Druid dependencies available offline
- Make the Wikipedia example GeoLite DB dependency available offline
Making Maven Druid Dependencies Available Offline
-
Extract Druid to a machine that has internet access; e.g.
/Users/foo/druid-<version>
-
Create a repository directory to download the dependencies to; e.g.
/Users/foo/druid-<version>\repo
-
Create property
druid.extensions.localRepository=
path to repo directory
in theDruid Directory
\config\_common/common.runtime.properties
file; e.g.druid.extensions.localRepository=/Users/foo/druid-<version>/repo
-
From within Druid directory, run the
pull-deps
command to download all Druid dependencies to the repository specified in thecommon.runtime.properties
file:java -classpath "config\_common;lib\*" io.druid.cli.Main tools pull-deps
-
Once all dependencies have been downloaded successfully, replicate the
repo
directory to the machine behind the firewall; e.g./opt/druid-<version>/repo
-
Create property
druid.extensions.localRepository=
path to repo directory
in theDruid Directory
/config/_common/common.runtime.properties
file; e.g.druid.extensions.localRepository=/opt/druid-<version>/repo
Making the Wikipedia Example GeoLite DB Dependency Available Offline
-
Download GeoLite2 City DB from http://dev.maxmind.com/geoip/geoip2/geolite2/
-
Copy and extract the DB to
java.io.tmpdir
/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb
; e.g./tmp/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb
Note: depending on the machine's reboot policy, if the
java.io.tmpdir
resolves to the/tmp
directory, you may have to create this file again in thetmp
directory after a machine reboot
Loading the Data into Druid directly from Kafka
As an alternative to reading the data from the IRC channels, which is a challenge to try to do it from behind a firewall, we will use Kafka to stream the data to Druid. To do so, we will need to:
- Configure the Wikipedia example to read streaming data from Kafka
- Set up and configure Kafka
Wikipedia Example Configuration
-
In your favorite editor, open the file
druid-<version>/examples/wikipedia/wikipedia_realtime.spec
-
Backup the file, if necessary, then replace the file content with the following:
[ { "dataSchema": { "dataSource": "wikipedia", "parser": { "type": "string", "parseSpec": { "format": "json", "timestampSpec": { "column": "timestamp", "format": "auto" }, "dimensionsSpec": { "dimensions": [ "page", "language", "user", "unpatrolled", "newPage", "robot", "anonymous", "namespace", "continent", "country", "region", "city" ], "dimensionExclusions": [], "spatialDimensions": [] } } }, "metricsSpec": [ { "type": "count", "name": "count" }, { "type": "doubleSum", "name": "added", "fieldName": "added" }, { "type": "doubleSum", "name": "deleted", "fieldName": "deleted" }, { "type": "doubleSum", "name": "delta", "fieldName": "delta" } ], "granularitySpec": { "type": "uniform", "segmentGranularity": "DAY", "queryGranularity": "NONE" } }, "ioConfig": { "type": "realtime", "firehose": { "type": "kafka-0.8", "consumerProps": { "zookeeper.connect": "localhost:2181", "zookeeper.connection.timeout.ms": "15000", "zookeeper.session.timeout.ms": "15000", "zookeeper.sync.time.ms": "5000", "group.id": "druid-example", "fetch.message.max.bytes": "1048586", "auto.offset.reset": "largest", "auto.commit.enable": "false" }, "feed": "wikipedia" }, "plumber": { "type": "realtime" } }, "tuningConfig": { "type": "realtime", "maxRowsInMemory": 500000, "intermediatePersistPeriod": "PT10m", "windowPeriod": "PT10m", "basePersistDirectory": "/tmp/realtime/basePersist", "rejectionPolicy": { "type": "messageTime" } } } ]
-
Refer to the Running Example Scripts section to start the example Druid Realtime node by issuing the following from within your Druid directory:
./run_example_server.sh
Kafka Setup and Configuration
-
Download Kafka
For this tutorial we will [download Kafka 0.8.2.1] (https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz)
tar -xzf kafka_2.10-0.8.2.1.tgz cd kafka_2.10-0.8.2.1
-
Start Kafka
First, launch ZooKeeper (refer to the Set up Zookeeper section for details), then start the Kafka server (in a separate console):
./bin/kafka-server-start.sh config/server.properties
-
Create a topic named
wikipedia
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia
-
Launch a console producer for the topic
wikipedia
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia
-
Copy and paste the following data into the terminal where we launched the Kafka console producer in the previous step:
{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143} {"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330} {"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111} {"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900} {"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
Finally
Now, that data has been fed into Druid, refer to the Running Example Scripts section to query the real-time node by issuing the following from within the Druid directory:
./run_example_client.sh
The Querying Druid section also has further querying examples.