druid/docs/content/tutorials/firewall.md

---
layout: doc_page
---

What to Do When You Have a Firewall
-----------------------------------
When you are behind a firewall, if the IRC wikipedia channels that feed realtime data into Druid are not accessible, then there is nothing you can do. If IRC channels are accessible, but downloading Geolite DB from maxmind is firewalled, you can workaround this challenge by making GeoLite DB dependency available offline, see below.

## Making the Wikipedia Example GeoLite DB Dependency Available Offline
1. Download GeoLite2 City DB from http://dev.maxmind.com/geoip/geoip2/geolite2/
2. Copy and extract the DB to *`java.io.tmpdir`*`/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb`; e.g. `/tmp/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb`

    **Note**: depending on the machine's reboot policy, if the `java.io.tmpdir` resolves to the `/tmp` directory, you may have to create this file again in the `tmp` directory after a machine reboot

## Loading the Data into Druid directly from Kafka
As an alternative to reading the data from the IRC channels, which is a challenge to try to do it from behind a firewall, we will use Kafka to stream the data to Druid. To do so, we will need to:

1. Configure the Wikipedia example to read streaming data from Kafka
2. Set up and configure Kafka

#### Wikipedia Example Configuration
1. In your favorite editor, open the file `druid-<version>/examples/wikipedia/wikipedia_realtime.spec`
2. Backup the file, if necessary, then replace the file content with the following:

    ```json
    [
       {
           "dataSchema": {
               "dataSource": "wikipedia",
               "parser": {
                   "type": "string",
                   "parseSpec": {
                       "format": "json",
                       "timestampSpec": {
                           "column": "timestamp",
                           "format": "auto"
                       },
                       "dimensionsSpec": {
                           "dimensions": [
                               "page",
                               "language",
                               "user",
                               "unpatrolled",
                               "newPage",
                               "robot",
                               "anonymous",
                               "namespace",
                               "continent",
                               "country",
                               "region",
                               "city"
                           ],
                           "dimensionExclusions": [],
                           "spatialDimensions": []
                       }
                   }
               },
               "metricsSpec": [
                   {
                       "type": "count",
                       "name": "count"
                   },
                   {
                       "type": "doubleSum",
                       "name": "added",
                       "fieldName": "added"
                   },
                   {
                       "type": "doubleSum",
                       "name": "deleted",
                       "fieldName": "deleted"
                   },
                   {
                       "type": "doubleSum",
                       "name": "delta",
                       "fieldName": "delta"
                   }
               ],
               "granularitySpec": {
                   "type": "uniform",
                   "segmentGranularity": "DAY",
                   "queryGranularity": "NONE"
               }
           },
           "ioConfig": {
               "type": "realtime",
               "firehose": {
                   "type": "kafka-0.8",
                   "consumerProps": {
                       "zookeeper.connect": "localhost:2181",
                       "zookeeper.connection.timeout.ms": "15000",
                       "zookeeper.session.timeout.ms": "15000",
                       "zookeeper.sync.time.ms": "5000",
                       "group.id": "druid-example",
                       "fetch.message.max.bytes": "1048586",
                       "auto.offset.reset": "largest",
                       "auto.commit.enable": "false"
                   },
                   "feed": "wikipedia"
               },
               "plumber": {
                   "type": "realtime"
               }
           },
           "tuningConfig": {
               "type": "realtime",
               "maxRowsInMemory": 500000,
               "intermediatePersistPeriod": "PT10m",
               "windowPeriod": "PT10m",
               "basePersistDirectory": "/tmp/realtime/basePersist",
               "rejectionPolicy": {
                   "type": "messageTime"
               }
           }
       }
    ]
    ```

3. Refer to the [Running Example Scripts](#running-example-scripts) section to start the example Druid Realtime node by issuing the following from within your Druid directory:

    ```bash
    ./run_example_server.sh
    ```

#### Kafka Setup and Configuration
1. Download Kafka

    For this tutorial we will [download Kafka 0.8.2.1]
    (https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz)

    ```bash
    tar -xzf kafka_2.10-0.8.2.1.tgz
    cd kafka_2.10-0.8.2.1
    ```

2. Start Kafka

    **First, launch ZooKeeper** (refer to the [Set up Zookeeper](#set-up-zookeeper) section for details), then start the Kafka server (in a separate console):

    ```bash
    ./bin/kafka-server-start.sh config/server.properties
    ```

3. Create a topic named `wikipedia`

    ```bash
    ./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia
    ```

4. Launch a console producer for the topic `wikipedia`

    ```bash
    ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia
    ```

5. Copy and paste the following data into the terminal where we launched the Kafka console producer in the previous step:

    ```json
    {"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
    {"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
    {"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
    {"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
    {"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
    ```

#### Finally
Now, that data has been fed into Druid, refer to the [Running Example Scripts](#running-example-scripts) section to query the real-time node by issuing the following from within the Druid directory:
```bash
./run_example_client.sh
```

The [Querying Druid](../querying/querying.md) section also has further querying examples.
Fixes and more docs across many areas 2015-08-07 11:39:52 -07:00			`---`
			`layout: doc_page`
			`---`

			`What to Do When You Have a Firewall`
			`-----------------------------------`
New extension loading mechanism 1) Remove maven client from downloading extensions at runtime. 2) Provide a way to load Druid extensions and hadoop dependencies through file system. 3) Refactor pull-deps so that it can download extensions into extension directories. 4) Add documents on how to use this new extension loading mechanism. 5) Change the way how Druid tarball is generated. Now all the extensions + hadoop-client 2.3.0 are packaged within the Druid tarball. 2015-07-07 22:51:44 -05:00			`When you are behind a firewall, if the IRC wikipedia channels that feed realtime data into Druid are not accessible, then there is nothing you can do. If IRC channels are accessible, but downloading Geolite DB from maxmind is firewalled, you can workaround this challenge by making GeoLite DB dependency available offline, see below.`
Fixes and more docs across many areas 2015-08-07 11:39:52 -07:00
			`## Making the Wikipedia Example GeoLite DB Dependency Available Offline`
			`1. Download GeoLite2 City DB from http://dev.maxmind.com/geoip/geoip2/geolite2/`
			2. Copy and extract the DB to `java.io.tmpdir``/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb`; e.g. `/tmp/io.druid.segment.realtime.firehose.WikipediaIrcDecoder.GeoLite2-City.mmdb`

			Note: depending on the machine's reboot policy, if the `java.io.tmpdir` resolves to the `/tmp` directory, you may have to create this file again in the `tmp` directory after a machine reboot

			`## Loading the Data into Druid directly from Kafka`
			`As an alternative to reading the data from the IRC channels, which is a challenge to try to do it from behind a firewall, we will use Kafka to stream the data to Druid. To do so, we will need to:`

			`1. Configure the Wikipedia example to read streaming data from Kafka`
			`2. Set up and configure Kafka`

			`#### Wikipedia Example Configuration`
			1. In your favorite editor, open the file `druid-<version>/examples/wikipedia/wikipedia_realtime.spec`
			`2. Backup the file, if necessary, then replace the file content with the following:`

			```json
			`[`
			`{`
			`"dataSchema": {`
			`"dataSource": "wikipedia",`
			`"parser": {`
			`"type": "string",`
			`"parseSpec": {`
			`"format": "json",`
			`"timestampSpec": {`
			`"column": "timestamp",`
			`"format": "auto"`
			`},`
			`"dimensionsSpec": {`
			`"dimensions": [`
			`"page",`
			`"language",`
			`"user",`
			`"unpatrolled",`
			`"newPage",`
			`"robot",`
			`"anonymous",`
			`"namespace",`
			`"continent",`
			`"country",`
			`"region",`
			`"city"`
			`],`
			`"dimensionExclusions": [],`
			`"spatialDimensions": []`
			`}`
			`}`
			`},`
			`"metricsSpec": [`
			`{`
			`"type": "count",`
			`"name": "count"`
			`},`
			`{`
			`"type": "doubleSum",`
			`"name": "added",`
			`"fieldName": "added"`
			`},`
			`{`
			`"type": "doubleSum",`
			`"name": "deleted",`
			`"fieldName": "deleted"`
			`},`
			`{`
			`"type": "doubleSum",`
			`"name": "delta",`
			`"fieldName": "delta"`
			`}`
			`],`
			`"granularitySpec": {`
			`"type": "uniform",`
			`"segmentGranularity": "DAY",`
			`"queryGranularity": "NONE"`
			`}`
			`},`
			`"ioConfig": {`
			`"type": "realtime",`
			`"firehose": {`
			`"type": "kafka-0.8",`
			`"consumerProps": {`
			`"zookeeper.connect": "localhost:2181",`
			`"zookeeper.connection.timeout.ms": "15000",`
			`"zookeeper.session.timeout.ms": "15000",`
			`"zookeeper.sync.time.ms": "5000",`
			`"group.id": "druid-example",`
			`"fetch.message.max.bytes": "1048586",`
			`"auto.offset.reset": "largest",`
			`"auto.commit.enable": "false"`
			`},`
			`"feed": "wikipedia"`
			`},`
			`"plumber": {`
			`"type": "realtime"`
			`}`
			`},`
			`"tuningConfig": {`
			`"type": "realtime",`
			`"maxRowsInMemory": 500000,`
			`"intermediatePersistPeriod": "PT10m",`
			`"windowPeriod": "PT10m",`
			`"basePersistDirectory": "/tmp/realtime/basePersist",`
			`"rejectionPolicy": {`
			`"type": "messageTime"`
			`}`
			`}`
			`}`
			`]`
			```

			`3. Refer to the [Running Example Scripts](#running-example-scripts) section to start the example Druid Realtime node by issuing the following from within your Druid directory:`

			```bash
			`./run_example_server.sh`
			```

			`#### Kafka Setup and Configuration`
			`1. Download Kafka`

			`For this tutorial we will [download Kafka 0.8.2.1]`
			`(https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz)`

			```bash
			`tar -xzf kafka_2.10-0.8.2.1.tgz`
			`cd kafka_2.10-0.8.2.1`
			```

			`2. Start Kafka`

			`First, launch ZooKeeper (refer to the [Set up Zookeeper](#set-up-zookeeper) section for details), then start the Kafka server (in a separate console):`

			```bash
			`./bin/kafka-server-start.sh config/server.properties`
			```

			3. Create a topic named `wikipedia`

			```bash
			`./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia`
			```

			4. Launch a console producer for the topic `wikipedia`

			```bash
			`./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia`
			```

			`5. Copy and paste the following data into the terminal where we launched the Kafka console producer in the previous step:`

			```json
			`{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}`
			`{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}`
			`{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}`
			`{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}`
			`{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}`
			```

			`#### Finally`
			`Now, that data has been fed into Druid, refer to the [Running Example Scripts](#running-example-scripts) section to query the real-time node by issuing the following from within the Druid directory:`
			```bash
			`./run_example_client.sh`
			```

			`The [Querying Druid](../querying/querying.md) section also has further querying examples.`