druid/docs/content/tutorials/firewall.md

7.9 KiB

layout
doc_page

What to Do When You Have a Firewall

When you are behind a firewall, if the IRC wikipedia channels that feed realtime data into Druid are not accessible, then there is nothing you can do. If IRC channels are accessible, but downloading Geolite DB from maxmind is firewalled, you can workaround this challenge by making GeoLite DB dependency available offline, see below.

Making the Wikipedia Example GeoLite DB Dependency Available Offline

  1. Download GeoLite2 City DB from http://dev.maxmind.com/geoip/geoip2/geolite2/

  2. Copy and extract the DB to java.io.tmpdir/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb; e.g. /tmp/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb

    Note: depending on the machine's reboot policy, if the java.io.tmpdir resolves to the /tmp directory, you may have to create this file again in the tmp directory after a machine reboot

Loading the Data into Druid directly from Kafka

As an alternative to reading the data from the IRC channels, which is a challenge to try to do it from behind a firewall, we will use Kafka to stream the data to Druid. To do so, we will need to:

  1. Configure the Wikipedia example to read streaming data from Kafka
  2. Set up and configure Kafka

Wikipedia Example Configuration

  1. In your favorite editor, open the file druid-<version>/examples/wikipedia/wikipedia_realtime.spec

  2. Backup the file, if necessary, then replace the file content with the following:

    [
       {
           "dataSchema": {
               "dataSource": "wikipedia",
               "parser": {
                   "type": "string",
                   "parseSpec": {
                       "format": "json",
                       "timestampSpec": {
                           "column": "timestamp",
                           "format": "auto"
                       },
                       "dimensionsSpec": {
                           "dimensions": [
                               "page",
                               "language",
                               "user",
                               "unpatrolled",
                               "newPage",
                               "robot",
                               "anonymous",
                               "namespace",
                               "continent",
                               "country",
                               "region",
                               "city"
                           ],
                           "dimensionExclusions": [],
                           "spatialDimensions": []
                       }
                   }
               },
               "metricsSpec": [
                   {
                       "type": "count",
                       "name": "count"
                   },
                   {
                       "type": "doubleSum",
                       "name": "added",
                       "fieldName": "added"
                   },
                   {
                       "type": "doubleSum",
                       "name": "deleted",
                       "fieldName": "deleted"
                   },
                   {
                       "type": "doubleSum",
                       "name": "delta",
                       "fieldName": "delta"
                   }
               ],
               "granularitySpec": {
                   "type": "uniform",
                   "segmentGranularity": "DAY",
                   "queryGranularity": "NONE"
               }
           },
           "ioConfig": {
               "type": "realtime",
               "firehose": {
                   "type": "kafka-0.8",
                   "consumerProps": {
                       "zookeeper.connect": "localhost:2181",
                       "zookeeper.connection.timeout.ms": "15000",
                       "zookeeper.session.timeout.ms": "15000",
                       "zookeeper.sync.time.ms": "5000",
                       "group.id": "druid-example",
                       "fetch.message.max.bytes": "1048586",
                       "auto.offset.reset": "largest",
                       "auto.commit.enable": "false"
                   },
                   "feed": "wikipedia"
               },
               "plumber": {
                   "type": "realtime"
               }
           },
           "tuningConfig": {
               "type": "realtime",
               "maxRowsInMemory": 500000,
               "intermediatePersistPeriod": "PT10m",
               "windowPeriod": "PT10m",
               "basePersistDirectory": "/tmp/realtime/basePersist",
               "rejectionPolicy": {
                   "type": "messageTime"
               }
           }
       }
    ]
    
  3. Refer to the Running Example Scripts section to start the example Druid Realtime node by issuing the following from within your Druid directory:

    ./run_example_server.sh
    

Kafka Setup and Configuration

  1. Download Kafka

    For this tutorial we will [download Kafka 0.8.2.1] (https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz)

    tar -xzf kafka_2.10-0.8.2.1.tgz
    cd kafka_2.10-0.8.2.1
    
  2. Start Kafka

    First, launch ZooKeeper (refer to the Set up Zookeeper section for details), then start the Kafka server (in a separate console):

    ./bin/kafka-server-start.sh config/server.properties
    
  3. Create a topic named wikipedia

    ./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia
    
  4. Launch a console producer for the topic wikipedia

    ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia
    
  5. Copy and paste the following data into the terminal where we launched the Kafka console producer in the previous step:

    {"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
    {"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
    {"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
    {"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
    {"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
    

Finally

Now, that data has been fed into Druid, refer to the Running Example Scripts section to query the real-time node by issuing the following from within the Druid directory:

./run_example_client.sh

The Querying Druid section also has further querying examples.