7.9 KiB
layout |
---|
doc_page |
What to Do When You Have a Firewall
When you are behind a firewall, if the IRC wikipedia channels that feed realtime data into Druid are not accessible, then there is nothing you can do. If IRC channels are accessible, but downloading Geolite DB from maxmind is firewalled, you can workaround this challenge by making GeoLite DB dependency available offline, see below.
Making the Wikipedia Example GeoLite DB Dependency Available Offline
-
Download GeoLite2 City DB from http://dev.maxmind.com/geoip/geoip2/geolite2/
-
Copy and extract the DB to
java.io.tmpdir
/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb
; e.g./tmp/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb
Note: depending on the machine's reboot policy, if the
java.io.tmpdir
resolves to the/tmp
directory, you may have to create this file again in thetmp
directory after a machine reboot
Loading the Data into Druid directly from Kafka
As an alternative to reading the data from the IRC channels, which is a challenge to try to do it from behind a firewall, we will use Kafka to stream the data to Druid. To do so, we will need to:
- Configure the Wikipedia example to read streaming data from Kafka
- Set up and configure Kafka
Wikipedia Example Configuration
-
In your favorite editor, open the file
druid-<version>/examples/wikipedia/wikipedia_realtime.spec
-
Backup the file, if necessary, then replace the file content with the following:
[ { "dataSchema": { "dataSource": "wikipedia", "parser": { "type": "string", "parseSpec": { "format": "json", "timestampSpec": { "column": "timestamp", "format": "auto" }, "dimensionsSpec": { "dimensions": [ "page", "language", "user", "unpatrolled", "newPage", "robot", "anonymous", "namespace", "continent", "country", "region", "city" ], "dimensionExclusions": [], "spatialDimensions": [] } } }, "metricsSpec": [ { "type": "count", "name": "count" }, { "type": "doubleSum", "name": "added", "fieldName": "added" }, { "type": "doubleSum", "name": "deleted", "fieldName": "deleted" }, { "type": "doubleSum", "name": "delta", "fieldName": "delta" } ], "granularitySpec": { "type": "uniform", "segmentGranularity": "DAY", "queryGranularity": "NONE" } }, "ioConfig": { "type": "realtime", "firehose": { "type": "kafka-0.8", "consumerProps": { "zookeeper.connect": "localhost:2181", "zookeeper.connection.timeout.ms": "15000", "zookeeper.session.timeout.ms": "15000", "zookeeper.sync.time.ms": "5000", "group.id": "druid-example", "fetch.message.max.bytes": "1048586", "auto.offset.reset": "largest", "auto.commit.enable": "false" }, "feed": "wikipedia" }, "plumber": { "type": "realtime" } }, "tuningConfig": { "type": "realtime", "maxRowsInMemory": 500000, "intermediatePersistPeriod": "PT10m", "windowPeriod": "PT10m", "basePersistDirectory": "/tmp/realtime/basePersist", "rejectionPolicy": { "type": "messageTime" } } } ]
-
Refer to the Running Example Scripts section to start the example Druid Realtime node by issuing the following from within your Druid directory:
./run_example_server.sh
Kafka Setup and Configuration
-
Download Kafka
For this tutorial we will [download Kafka 0.8.2.1] (https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz)
tar -xzf kafka_2.10-0.8.2.1.tgz cd kafka_2.10-0.8.2.1
-
Start Kafka
First, launch ZooKeeper (refer to the Set up Zookeeper section for details), then start the Kafka server (in a separate console):
./bin/kafka-server-start.sh config/server.properties
-
Create a topic named
wikipedia
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia
-
Launch a console producer for the topic
wikipedia
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia
-
Copy and paste the following data into the terminal where we launched the Kafka console producer in the previous step:
{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143} {"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330} {"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111} {"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900} {"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
Finally
Now, that data has been fed into Druid, refer to the Running Example Scripts section to query the real-time node by issuing the following from within the Druid directory:
./run_example_client.sh
The Querying Druid section also has further querying examples.