druid/firewall.md at 0b319093df7b7f6433c56b324ffab46f9286a37a

7.9 KiB

Raw Blame History

layout
doc_page

What to Do When You Have a Firewall

When you are behind a firewall, if the IRC wikipedia channels that feed realtime data into Druid are not accessible, then there is nothing you can do. If IRC channels are accessible, but downloading Geolite DB from maxmind is firewalled, you can workaround this challenge by making GeoLite DB dependency available offline, see below.

Making the Wikipedia Example GeoLite DB Dependency Available Offline

Download GeoLite2 City DB from http://dev.maxmind.com/geoip/geoip2/geolite2/
Copy and extract the DB to java.io.tmpdir/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb; e.g. /tmp/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb

Note: depending on the machine's reboot policy, if the java.io.tmpdir resolves to the /tmp directory, you may have to create this file again in the tmp directory after a machine reboot

Loading the Data into Druid directly from Kafka

As an alternative to reading the data from the IRC channels, which is a challenge to try to do it from behind a firewall, we will use Kafka to stream the data to Druid. To do so, we will need to:

Configure the Wikipedia example to read streaming data from Kafka
Set up and configure Kafka

Wikipedia Example Configuration

In your favorite editor, open the file druid-<version>/examples/wikipedia/wikipedia_realtime.spec

Backup the file, if necessary, then replace the file content with the following:

[
   {
       "dataSchema": {
           "dataSource": "wikipedia",
           "parser": {
               "type": "string",
               "parseSpec": {
                   "format": "json",
                   "timestampSpec": {
                       "column": "timestamp",
                       "format": "auto"
                   },
                   "dimensionsSpec": {
                       "dimensions": [
                           "page",
                           "language",
                           "user",
                           "unpatrolled",
                           "newPage",
                           "robot",
                           "anonymous",
                           "namespace",
                           "continent",
                           "country",
                           "region",
                           "city"
                       ],
                       "dimensionExclusions": [],
                       "spatialDimensions": []
                   }
               }
           },
           "metricsSpec": [
               {
                   "type": "count",
                   "name": "count"
               },
               {
                   "type": "doubleSum",
                   "name": "added",
                   "fieldName": "added"
               },
               {
                   "type": "doubleSum",
                   "name": "deleted",
                   "fieldName": "deleted"
               },
               {
                   "type": "doubleSum",
                   "name": "delta",
                   "fieldName": "delta"
               }
           ],
           "granularitySpec": {
               "type": "uniform",
               "segmentGranularity": "DAY",
               "queryGranularity": "NONE"
           }
       },
       "ioConfig": {
           "type": "realtime",
           "firehose": {
               "type": "kafka-0.8",
               "consumerProps": {
                   "zookeeper.connect": "localhost:2181",
                   "zookeeper.connection.timeout.ms": "15000",
                   "zookeeper.session.timeout.ms": "15000",
                   "zookeeper.sync.time.ms": "5000",
                   "group.id": "druid-example",
                   "fetch.message.max.bytes": "1048586",
                   "auto.offset.reset": "largest",
                   "auto.commit.enable": "false"
               },
               "feed": "wikipedia"
           },
           "plumber": {
               "type": "realtime"
           }
       },
       "tuningConfig": {
           "type": "realtime",
           "maxRowsInMemory": 500000,
           "intermediatePersistPeriod": "PT10m",
           "windowPeriod": "PT10m",
           "basePersistDirectory": "/tmp/realtime/basePersist",
           "rejectionPolicy": {
               "type": "messageTime"
           }
       }
   }
]

Refer to the Running Example Scripts section to start the example Druid Realtime node by issuing the following from within your Druid directory:
```
./run_example_server.sh
```

Kafka Setup and Configuration

Download Kafka

For this tutorial we will [download Kafka 0.8.2.1] (https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz)
```
tar -xzf kafka_2.10-0.8.2.1.tgz
cd kafka_2.10-0.8.2.1
```
Start Kafka

First, launch ZooKeeper (refer to the Set up Zookeeper section for details), then start the Kafka server (in a separate console):
```
./bin/kafka-server-start.sh config/server.properties
```

Create a topic named wikipedia

./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia

Launch a console producer for the topic wikipedia

./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia

Copy and paste the following data into the terminal where we launched the Kafka console producer in the previous step:

{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}

Finally

Now, that data has been fed into Druid, refer to the Running Example Scripts section to query the real-time node by issuing the following from within the Druid directory:

./run_example_client.sh

The Querying Druid section also has further querying examples.

7.9 KiB Raw Blame History