Update Kafka loading docs to use the streaming data loader (#8544)
* fix redirects * remove useless page * fix Single server reference configurations formatting * update batch data loading * update Kafka docs * fix typos and tests * add more links * fix spelling
Before Width: | Height: | Size: 39 KiB |
Before Width: | Height: | Size: 55 KiB After Width: | Height: | Size: 133 KiB |
Before Width: | Height: | Size: 352 KiB After Width: | Height: | Size: 468 KiB |
Before Width: | Height: | Size: 134 KiB After Width: | Height: | Size: 208 KiB |
Before Width: | Height: | Size: 163 KiB After Width: | Height: | Size: 250 KiB |
Before Width: | Height: | Size: 159 KiB After Width: | Height: | Size: 251 KiB |
Before Width: | Height: | Size: 63 KiB After Width: | Height: | Size: 100 KiB |
Before Width: | Height: | Size: 45 KiB After Width: | Height: | Size: 96 KiB |
Before Width: | Height: | Size: 102 KiB After Width: | Height: | Size: 169 KiB |
Before Width: | Height: | Size: 62 KiB After Width: | Height: | Size: 114 KiB |
Before Width: | Height: | Size: 44 KiB After Width: | Height: | Size: 118 KiB |
Before Width: | Height: | Size: 81 KiB After Width: | Height: | Size: 148 KiB |
Before Width: | Height: | Size: 68 KiB After Width: | Height: | Size: 122 KiB |
Before Width: | Height: | Size: 84 KiB After Width: | Height: | Size: 139 KiB |
Before Width: | Height: | Size: 84 KiB |
Before Width: | Height: | Size: 74 KiB |
After Width: | Height: | Size: 115 KiB |
After Width: | Height: | Size: 498 KiB |
After Width: | Height: | Size: 213 KiB |
After Width: | Height: | Size: 255 KiB |
After Width: | Height: | Size: 251 KiB |
After Width: | Height: | Size: 92 KiB |
After Width: | Height: | Size: 133 KiB |
After Width: | Height: | Size: 96 KiB |
After Width: | Height: | Size: 168 KiB |
After Width: | Height: | Size: 111 KiB |
After Width: | Height: | Size: 117 KiB |
After Width: | Height: | Size: 149 KiB |
After Width: | Height: | Size: 140 KiB |
Before Width: | Height: | Size: 99 KiB After Width: | Height: | Size: 148 KiB |
Before Width: | Height: | Size: 81 KiB After Width: | Height: | Size: 141 KiB |
Before Width: | Height: | Size: 64 KiB After Width: | Height: | Size: 138 KiB |
Before Width: | Height: | Size: 29 KiB After Width: | Height: | Size: 65 KiB |
|
@ -676,7 +676,7 @@ These Coordinator static configurations can be defined in the `coordinator/runti
|
|||
|--------|---------------|-----------|-------|
|
||||
|`druid.serverview.type`|batch or http|Segment discovery method to use. "http" enables discovering segments using HTTP instead of zookeeper.|batch|
|
||||
|`druid.coordinator.loadqueuepeon.type`|curator or http|Whether to use "http" or "curator" implementation to assign segment loads/drops to historical|curator|
|
||||
|`druid.coordinator.segment.awaitInitializationOnStart`|true or false|Whether the the Coordinator will wait for its view of segments to fully initialize before starting up. If set to 'true', the Coordinator's HTTP server will not start up, and the Coordinator will not announce itself as available, until the server view is initialized.|true|
|
||||
|`druid.coordinator.segment.awaitInitializationOnStart`|true or false|Whether the Coordinator will wait for its view of segments to fully initialize before starting up. If set to 'true', the Coordinator's HTTP server will not start up, and the Coordinator will not announce itself as available, until the server view is initialized.|true|
|
||||
|
||||
###### Additional config when "http" loadqueuepeon is used
|
||||
|Property|Description|Default|
|
||||
|
@ -1456,7 +1456,7 @@ The Druid SQL server is configured through the following properties on the Broke
|
|||
|`druid.sql.avatica.maxStatementsPerConnection`|Maximum number of simultaneous open statements per Avatica client connection.|1|
|
||||
|`druid.sql.avatica.connectionIdleTimeout`|Avatica client connection idle timeout.|PT5M|
|
||||
|`druid.sql.http.enable`|Whether to enable JSON over HTTP querying at `/druid/v2/sql/`.|true|
|
||||
|`druid.sql.planner.awaitInitializationOnStart`|Boolean|Whether the the Broker will wait for its SQL metadata view to fully initialize before starting up. If set to 'true', the Broker's HTTP server will not start up, and the Broker will not announce itself as available, until the server view is initialized. See also `druid.broker.segment.awaitInitializationOnStart`, a related setting.|true|
|
||||
|`druid.sql.planner.awaitInitializationOnStart`|Boolean|Whether the Broker will wait for its SQL metadata view to fully initialize before starting up. If set to 'true', the Broker's HTTP server will not start up, and the Broker will not announce itself as available, until the server view is initialized. See also `druid.broker.segment.awaitInitializationOnStart`, a related setting.|true|
|
||||
|`druid.sql.planner.maxQueryCount`|Maximum number of queries to issue, including nested queries. Set to 1 to disable sub-queries, or set to 0 for unlimited.|8|
|
||||
|`druid.sql.planner.maxSemiJoinRowsInMemory`|Maximum number of rows to keep in memory for executing two-stage semi-join queries like `SELECT * FROM Employee WHERE DeptName IN (SELECT DeptName FROM Dept)`.|100000|
|
||||
|`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.md). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.md) instead.|100000|
|
||||
|
@ -1491,7 +1491,7 @@ See [cache configuration](#cache-configuration) for how to configure cache setti
|
|||
|`druid.serverview.type`|batch or http|Segment discovery method to use. "http" enables discovering segments using HTTP instead of zookeeper.|batch|
|
||||
|`druid.broker.segment.watchedTiers`|List of strings|Broker watches the segment announcements from processes serving segments to build cache of which process is serving which segments, this configuration allows to only consider segments being served from a whitelist of tiers. By default, Broker would consider all tiers. This can be used to partition your dataSources in specific Historical tiers and configure brokers in partitions so that they are only queryable for specific dataSources.|none|
|
||||
|`druid.broker.segment.watchedDataSources`|List of strings|Broker watches the segment announcements from processes serving segments to build cache of which process is serving which segments, this configuration allows to only consider segments being served from a whitelist of dataSources. By default, Broker would consider all datasources. This can be used to configure brokers in partitions so that they are only queryable for specific dataSources.|none|
|
||||
|`druid.broker.segment.awaitInitializationOnStart`|Boolean|Whether the the Broker will wait for its view of segments to fully initialize before starting up. If set to 'true', the Broker's HTTP server will not start up, and the Broker will not announce itself as available, until the server view is initialized. See also `druid.sql.planner.awaitInitializationOnStart`, a related setting.|true|
|
||||
|`druid.broker.segment.awaitInitializationOnStart`|Boolean|Whether the Broker will wait for its view of segments to fully initialize before starting up. If set to 'true', the Broker's HTTP server will not start up, and the Broker will not announce itself as available, until the server view is initialized. See also `druid.sql.planner.awaitInitializationOnStart`, a related setting.|true|
|
||||
|
||||
## Cache Configuration
|
||||
|
||||
|
|
|
@ -1,39 +0,0 @@
|
|||
---
|
||||
id: integrating-druid-with-other-technologies
|
||||
title: "Integrating Apache Druid with other technologies"
|
||||
sidebar_label: "Integrating with other technologies"
|
||||
---
|
||||
|
||||
<!--
|
||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||
~ or more contributor license agreements. See the NOTICE file
|
||||
~ distributed with this work for additional information
|
||||
~ regarding copyright ownership. The ASF licenses this file
|
||||
~ to you under the Apache License, Version 2.0 (the
|
||||
~ "License"); you may not use this file except in compliance
|
||||
~ with the License. You may obtain a copy of the License at
|
||||
~
|
||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||
~
|
||||
~ Unless required by applicable law or agreed to in writing,
|
||||
~ software distributed under the License is distributed on an
|
||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
~ KIND, either express or implied. See the License for the
|
||||
~ specific language governing permissions and limitations
|
||||
~ under the License.
|
||||
-->
|
||||
|
||||
|
||||
This page discusses how we can integrate Druid with other technologies.
|
||||
|
||||
## Integrating with Open Source Streaming Technologies
|
||||
|
||||
Event streams can be stored in a distributed message bus such as Kafka and further processed via a distributed stream
|
||||
processor system such as Storm, Samza, or Spark Streaming. Data processed by the stream processor can feed into Druid using
|
||||
the [Tranquility](https://github.com/druid-io/tranquility) library.
|
||||
|
||||
<img src="../assets/druid-production.png" width="800"/>
|
||||
|
||||
## Integrating with SQL-on-Hadoop Technologies
|
||||
|
||||
Druid should theoretically integrate well with SQL-on-Hadoop technologies such as Apache Drill, Spark SQL, Presto, Impala, and Hive.
|
|
@ -143,7 +143,7 @@ java -classpath "lib/*" -Dlog4j.configurationFile=conf/druid/cluster/_common/log
|
|||
|
||||
In the example command above:
|
||||
|
||||
- `lib` is the the Druid lib directory
|
||||
- `lib` is the Druid lib directory
|
||||
- `extensions` is the Druid extensions directory
|
||||
- `/tmp/csv` is the output directory. Please make sure that this directory exists.
|
||||
|
||||
|
|
|
@ -61,7 +61,7 @@ Druid provides a `metadata-init` tool for creating Druid's metadata tables. Afte
|
|||
|
||||
In the example commands below:
|
||||
|
||||
- `lib` is the the Druid lib directory
|
||||
- `lib` is the Druid lib directory
|
||||
- `extensions` is the Druid extensions directory
|
||||
- `base` corresponds to the value of `druid.metadata.storage.tables.base` in the configuration, `druid` by default.
|
||||
- The `--connectURI` parameter corresponds to the value of `druid.metadata.storage.connector.connectURI`.
|
||||
|
|
|
@ -44,35 +44,35 @@ The example configurations run the Druid Coordinator and Overlord together in a
|
|||
|
||||
While example configurations are provided for very large single machines, at higher scales we recommend running Druid in a [clustered deployment](../tutorials/cluster.md), for fault-tolerance and reduced resource contention.
|
||||
|
||||
## Single Server Reference Configurations
|
||||
## Single server reference configurations
|
||||
|
||||
Nano-Quickstart: 1 CPU, 4GB RAM
|
||||
------------
|
||||
Launch command: `bin/start-nano-quickstart`
|
||||
Configuration directory: `conf/druid/single-server/nano-quickstart`
|
||||
### Nano-Quickstart: 1 CPU, 4GB RAM
|
||||
|
||||
Micro-Quickstart: 4 CPU, 16GB RAM
|
||||
------------
|
||||
Launch command: `bin/start-micro-quickstart`
|
||||
Configuration directory: `conf/druid/single-server/micro-quickstart`
|
||||
- Launch command: `bin/start-nano-quickstart`
|
||||
- Configuration directory: `conf/druid/single-server/nano-quickstart`
|
||||
|
||||
Small: 8 CPU, 64GB RAM (~i3.2xlarge)
|
||||
------------
|
||||
Launch command: `bin/start-small`
|
||||
Configuration directory: `conf/druid/single-server/small`
|
||||
### Micro-Quickstart: 4 CPU, 16GB RAM
|
||||
|
||||
Medium: 16 CPU, 128GB RAM (~i3.4xlarge)
|
||||
------------
|
||||
Launch command: `bin/start-medium`
|
||||
Configuration directory: `conf/druid/single-server/medium`
|
||||
- Launch command: `bin/start-micro-quickstart`
|
||||
- Configuration directory: `conf/druid/single-server/micro-quickstart`
|
||||
|
||||
Large: 32 CPU, 256GB RAM (~i3.8xlarge)
|
||||
------------
|
||||
Launch command: `bin/start-large`
|
||||
Configuration directory: `conf/druid/single-server/large`
|
||||
### Small: 8 CPU, 64GB RAM (~i3.2xlarge)
|
||||
|
||||
X-Large: 64 CPU, 512GB RAM (~i3.16xlarge)
|
||||
------------
|
||||
Launch command: `bin/start-xlarge`
|
||||
Configuration directory: `conf/druid/single-server/xlarge`
|
||||
- Launch command: `bin/start-small`
|
||||
- Configuration directory: `conf/druid/single-server/small`
|
||||
|
||||
### Medium: 16 CPU, 128GB RAM (~i3.4xlarge)
|
||||
|
||||
- Launch command: `bin/start-medium`
|
||||
- Configuration directory: `conf/druid/single-server/medium`
|
||||
|
||||
### Large: 32 CPU, 256GB RAM (~i3.8xlarge)
|
||||
|
||||
- Launch command: `bin/start-large`
|
||||
- Configuration directory: `conf/druid/single-server/large`
|
||||
|
||||
### X-Large: 64 CPU, 512GB RAM (~i3.16xlarge)
|
||||
|
||||
- Launch command: `bin/start-xlarge`
|
||||
- Configuration directory: `conf/druid/single-server/xlarge`
|
||||
|
||||
|
|
|
@ -117,16 +117,16 @@ All persistent state such as the cluster metadata store and segments for the ser
|
|||
Later on, if you'd like to stop the services, CTRL-C to exit the `bin/start-micro-quickstart` script, which will terminate the Druid processes.
|
||||
|
||||
Once the cluster has started, you can navigate to [http://localhost:8888](http://localhost:8888).
|
||||
The [Druid router process](../design/router.md), which serves the Druid console, resides at this address.
|
||||
The [Druid router process](../design/router.md), which serves the [Druid console](../operations/druid-console.md), resides at this address.
|
||||
|
||||
![Druid console](../assets/tutorial-quickstart-01.png "Druid console")
|
||||
|
||||
It takes a few seconds for all the Druid processes to fully start up. If you open the console immediately after starting the services, you may see some errors that you can safely ignore.
|
||||
|
||||
|
||||
## Loading Data
|
||||
## Loading data
|
||||
|
||||
### Tutorial Dataset
|
||||
### Tutorial dataset
|
||||
|
||||
For the following data loading tutorials, we have included a sample data file containing Wikipedia page edit events that occurred on 2015-09-12.
|
||||
|
||||
|
|
|
@ -43,80 +43,87 @@ We've included a sample of Wikipedia edits from September 12, 2015 to get you st
|
|||
## Loading data with the data loader
|
||||
|
||||
Navigate to [localhost:8888](http://localhost:8888) and click `Load data` in the console header.
|
||||
Select `Local disk`.
|
||||
|
||||
![Data loader init](../assets/tutorial-batch-data-loader-01.png "Data loader init")
|
||||
|
||||
Enter the value of `quickstart/tutorial/` as the base directory and `wikiticker-2015-09-12-sampled.json.gz` as a filter.
|
||||
The separation of base directory and [wildcard file filter](https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter.html) is there if you need to ingest data from multiple files.
|
||||
|
||||
Click `Preview` and make sure that the the data you are seeing is correct.
|
||||
Select `Local disk` and click `Connect data`.
|
||||
|
||||
![Data loader sample](../assets/tutorial-batch-data-loader-02.png "Data loader sample")
|
||||
|
||||
Enter `quickstart/tutorial/` as the base directory and `wikiticker-2015-09-12-sampled.json.gz` as a filter.
|
||||
The separation of base directory and [wildcard file filter](https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter.html) is there if you need to ingest data from multiple files.
|
||||
|
||||
Click `Preview` and make sure that the data you are seeing is correct.
|
||||
|
||||
Once the data is located, you can click "Next: Parse data" to go to the next step.
|
||||
|
||||
![Data loader parse data](../assets/tutorial-batch-data-loader-03.png "Data loader parse data")
|
||||
|
||||
The data loader will try to automatically determine the correct parser for the data.
|
||||
In this case it will successfully determine `json`.
|
||||
Feel free to play around with different parser options to get a preview of how Druid will parse your data.
|
||||
|
||||
![Data loader parse data](../assets/tutorial-batch-data-loader-03.png "Data loader parse data")
|
||||
|
||||
With the `json` parser selected, click `Next: Parse time` to get to the step centered around determining your primary timestamp column.
|
||||
|
||||
![Data loader parse time](../assets/tutorial-batch-data-loader-04.png "Data loader parse time")
|
||||
|
||||
Druid's architecture requires a primary timestamp column (internally stored in a column called `__time`).
|
||||
If you do not have a timestamp in your data, select `Constant value`.
|
||||
In our example, the data loader will determine that the `time` column in our raw data is the only candidate that can be used as the primary time column.
|
||||
|
||||
![Data loader parse time](../assets/tutorial-batch-data-loader-04.png "Data loader parse time")
|
||||
|
||||
Click `Next: ...` twice to go past the `Transform` and `Filter` steps.
|
||||
You do not need to enter anything in these steps as applying ingestion time transforms and filters are out of scope for this tutorial.
|
||||
|
||||
In the `Configure schema` step, you can configure which dimensions (and metrics) will be ingested into Druid.
|
||||
This is exactly what the data will appear like in Druid once it is ingested.
|
||||
Since our dataset is very small, go ahead and turn off `Rollup` by clicking on the switch and confirming the change.
|
||||
|
||||
![Data loader schema](../assets/tutorial-batch-data-loader-05.png "Data loader schema")
|
||||
|
||||
In the `Configure schema` step, you can configure which [dimensions](../ingestion/index.md#dimensions) and [metrics](../ingestion/index.md#metrics) will be ingested into Druid.
|
||||
This is exactly what the data will appear like in Druid once it is ingested.
|
||||
Since our dataset is very small, go ahead and turn off [`Rollup`](../ingestion/index.md#rollup) by clicking on the switch and confirming the change.
|
||||
|
||||
Once you are satisfied with the schema, click `Next` to go to the `Partition` step where you can fine tune how the data will be partitioned into segments.
|
||||
Here you can adjust how the data will be split up into segments in Druid.
|
||||
Since this is a small dataset, there are no adjustments that need to be made in this step.
|
||||
|
||||
![Data loader partition](../assets/tutorial-batch-data-loader-06.png "Data loader partition")
|
||||
|
||||
Clicking past the `Tune` step, we get to the publish step, which is where we can specify what the datasource name in Druid.
|
||||
Let's name this datasource `wikipedia`.
|
||||
Here, you can adjust how the data will be split up into segments in Druid.
|
||||
Since this is a small dataset, there are no adjustments that need to be made in this step.
|
||||
|
||||
Clicking past the `Tune` step, to get to the publish step.
|
||||
|
||||
![Data loader publish](../assets/tutorial-batch-data-loader-07.png "Data loader publish")
|
||||
|
||||
The `Publish` step is where we can specify what the datasource name in Druid.
|
||||
Let's name this datasource `wikipedia`.
|
||||
Finally, click `Next` to review your spec.
|
||||
|
||||
![Data loader spec](../assets/tutorial-batch-data-loader-08.png "Data loader spec")
|
||||
|
||||
This is the spec you have constructed.
|
||||
Feel free to go back and make changes in previous steps to see how changes will update the spec.
|
||||
Similarly, you can also edit the spec directly and see it reflected in the previous steps.
|
||||
|
||||
![Data loader spec](../assets/tutorial-batch-data-loader-08.png "Data loader spec")
|
||||
|
||||
Once you are satisfied with the spec, click `Submit` and an ingestion task will be created.
|
||||
|
||||
You will be taken to the task view with the focus on the newly created task.
|
||||
|
||||
![Tasks view](../assets/tutorial-batch-data-loader-09.png "Tasks view")
|
||||
|
||||
In the tasks view, you can click `Refresh` a couple of times until your ingestion task (hopefully) succeeds.
|
||||
You will be taken to the task view with the focus on the newly created task.
|
||||
The task view is set to auto refresh, wait until your task succeeds.
|
||||
|
||||
When a tasks succeeds it means that it built one or more segments that will now be picked up by the data servers.
|
||||
|
||||
Navigate to the `Datasources` view and click refresh until your datasource (`wikipedia`) appears.
|
||||
This can take a few seconds as the segments are being loaded.
|
||||
Navigate to the `Datasources` view from the header.
|
||||
|
||||
![Datasource view](../assets/tutorial-batch-data-loader-10.png "Datasource view")
|
||||
|
||||
Wait until your datasource (`wikipedia`) appears.
|
||||
This can take a few seconds as the segments are being loaded.
|
||||
|
||||
A datasource is queryable once you see a green (fully available) circle.
|
||||
At this point, you can go to the `Query` view to run SQL queries against the datasource.
|
||||
|
||||
Since this is a small dataset, you can simply run a `SELECT * FROM wikipedia` query to see your results.
|
||||
|
||||
![Query view](../assets/tutorial-batch-data-loader-11.png "Query view")
|
||||
|
||||
Run a `SELECT * FROM "wikipedia"` query to see your results.
|
||||
|
||||
Check out the [query tutorial](../tutorials/tutorial-query.md) to run some example queries on the newly loaded data.
|
||||
|
||||
|
||||
|
|
|
@ -56,15 +56,124 @@ Run this command to create a Kafka topic called *wikipedia*, to which we'll send
|
|||
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia
|
||||
```
|
||||
|
||||
## Start Druid Kafka ingestion
|
||||
## Load data into Kafka
|
||||
|
||||
Let's launch a producer for our topic and send some data!
|
||||
|
||||
In your Druid directory, run the following command:
|
||||
|
||||
```bash
|
||||
cd quickstart/tutorial
|
||||
gunzip -c wikiticker-2015-09-12-sampled.json.gz > wikiticker-2015-09-12-sampled.json
|
||||
```
|
||||
|
||||
In your Kafka directory, run the following command, where {PATH_TO_DRUID} is replaced by the path to the Druid directory:
|
||||
|
||||
```bash
|
||||
export KAFKA_OPTS="-Dfile.encoding=UTF-8"
|
||||
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json
|
||||
```
|
||||
|
||||
The previous command posted sample events to the *wikipedia* Kafka topic.
|
||||
Now we will use Druid's Kafka indexing service to ingest messages from our newly created topic.
|
||||
|
||||
## Loading data with the data loader
|
||||
|
||||
Navigate to [localhost:8888](http://localhost:8888) and click `Load data` in the console header.
|
||||
|
||||
![Data loader init](../assets/tutorial-kafka-data-loader-01.png "Data loader init")
|
||||
|
||||
Select `Apache Kafka` and click `Connect data`.
|
||||
|
||||
![Data loader sample](../assets/tutorial-kafka-data-loader-02.png "Data loader sample")
|
||||
|
||||
Enter `localhost:9092` as the bootstrap server and `wikipedia` as the topic.
|
||||
|
||||
Click `Preview` and make sure that the data you are seeing is correct.
|
||||
|
||||
Once the data is located, you can click "Next: Parse data" to go to the next step.
|
||||
|
||||
![Data loader parse data](../assets/tutorial-kafka-data-loader-03.png "Data loader parse data")
|
||||
|
||||
The data loader will try to automatically determine the correct parser for the data.
|
||||
In this case it will successfully determine `json`.
|
||||
Feel free to play around with different parser options to get a preview of how Druid will parse your data.
|
||||
|
||||
With the `json` parser selected, click `Next: Parse time` to get to the step centered around determining your primary timestamp column.
|
||||
|
||||
![Data loader parse time](../assets/tutorial-kafka-data-loader-04.png "Data loader parse time")
|
||||
|
||||
Druid's architecture requires a primary timestamp column (internally stored in a column called `__time`).
|
||||
If you do not have a timestamp in your data, select `Constant value`.
|
||||
In our example, the data loader will determine that the `time` column in our raw data is the only candidate that can be used as the primary time column.
|
||||
|
||||
Click `Next: ...` twice to go past the `Transform` and `Filter` steps.
|
||||
You do not need to enter anything in these steps as applying ingestion time transforms and filters are out of scope for this tutorial.
|
||||
|
||||
![Data loader schema](../assets/tutorial-kafka-data-loader-05.png "Data loader schema")
|
||||
|
||||
In the `Configure schema` step, you can configure which [dimensions](../ingestion/index.md#dimensions) and [metrics](../ingestion/index.md#metrics) will be ingested into Druid.
|
||||
This is exactly what the data will appear like in Druid once it is ingested.
|
||||
Since our dataset is very small, go ahead and turn off [`Rollup`](../ingestion/index.md#rollup) by clicking on the switch and confirming the change.
|
||||
|
||||
Once you are satisfied with the schema, click `Next` to go to the `Partition` step where you can fine tune how the data will be partitioned into segments.
|
||||
|
||||
![Data loader partition](../assets/tutorial-kafka-data-loader-06.png "Data loader partition")
|
||||
|
||||
Here, you can adjust how the data will be split up into segments in Druid.
|
||||
Since this is a small dataset, there are no adjustments that need to be made in this step.
|
||||
|
||||
Click `Next: Tune` to go to the tuning step.
|
||||
|
||||
![Data loader tune](../assets/tutorial-kafka-data-loader-07.png "Data loader tune")
|
||||
|
||||
In the `Tune` step is it *very important* to set `Use earliest offset` to `True` since we want to consume the data from the start of the stream.
|
||||
There are no other changes that need to be made hear, so click `Next: Publish` to go to the `Publish` step.
|
||||
|
||||
![Data loader publish](../assets/tutorial-kafka-data-loader-08.png "Data loader publish")
|
||||
|
||||
Let's name this datasource `wikipedia-kafka`.
|
||||
|
||||
Finally, click `Next` to review your spec.
|
||||
|
||||
![Data loader spec](../assets/tutorial-kafka-data-loader-09.png "Data loader spec")
|
||||
|
||||
This is the spec you have constructed.
|
||||
Feel free to go back and make changes in previous steps to see how changes will update the spec.
|
||||
Similarly, you can also edit the spec directly and see it reflected in the previous steps.
|
||||
|
||||
Once you are satisfied with the spec, click `Submit` and an ingestion task will be created.
|
||||
|
||||
![Tasks view](../assets/tutorial-kafka-data-loader-10.png "Tasks view")
|
||||
|
||||
You will be taken to the task view with the focus on the newly created supervisor.
|
||||
|
||||
The task view is set to auto refresh, wait until your supervisor launches a task.
|
||||
|
||||
When a tasks starts running, it will also start serving the data that it is ingesting.
|
||||
|
||||
Navigate to the `Datasources` view from the header.
|
||||
|
||||
![Datasource view](../assets/tutorial-kafka-data-loader-11.png "Datasource view")
|
||||
|
||||
When the `wikipedia-kafka` datasource appears here it can be queried.
|
||||
|
||||
*Note:* if the datasource does not appear after a minute you might have not set the supervisor to read from the start of the stream (in the `Tune` step).
|
||||
|
||||
At this point, you can go to the `Query` view to run SQL queries against the datasource.
|
||||
|
||||
Since this is a small dataset, you can simply run a `SELECT * FROM "wikipedia-kafka"` query to see your results.
|
||||
|
||||
![Query view](../assets/tutorial-kafka-data-loader-12.png "Query view")
|
||||
|
||||
Check out the [query tutorial](../tutorials/tutorial-query.md) to run some example queries on the newly loaded data.
|
||||
|
||||
We will use Druid's Kafka indexing service to ingest messages from our newly created *wikipedia* topic.
|
||||
|
||||
### Submit a supervisor via the console
|
||||
|
||||
In the console, click `Submit supervisor` to open the submit supervisor dialog.
|
||||
|
||||
![Submit supervisor](../assets/tutorial-kafka-01.png "Submit supervisor")
|
||||
![Submit supervisor](../assets/tutorial-kafka-submit-supervisor-01.png "Submit supervisor")
|
||||
|
||||
Paste in this spec and click `Submit`.
|
||||
|
||||
|
@ -132,8 +241,6 @@ Paste in this spec and click `Submit`.
|
|||
|
||||
This will start the supervisor that will in turn spawn some tasks that will start listening for incoming data.
|
||||
|
||||
![Running supervisor](../assets/tutorial-kafka-02.png "Running supervisor")
|
||||
|
||||
### Submit a supervisor directly
|
||||
|
||||
To start the service directly, we will need to submit a supervisor spec to the Druid overlord by running the following from the Druid package root:
|
||||
|
@ -150,27 +257,6 @@ For more details about what's going on here, check out the
|
|||
|
||||
You can view the current supervisors and tasks in the Druid Console: [http://localhost:8888/unified-console.html#tasks](http://localhost:8888/unified-console.html#tasks).
|
||||
|
||||
|
||||
## Load data
|
||||
|
||||
Let's launch a producer for our topic and send some data!
|
||||
|
||||
In your Druid directory, run the following command:
|
||||
|
||||
```bash
|
||||
cd quickstart/tutorial
|
||||
gunzip -c wikiticker-2015-09-12-sampled.json.gz > wikiticker-2015-09-12-sampled.json
|
||||
```
|
||||
|
||||
In your Kafka directory, run the following command, where {PATH_TO_DRUID} is replaced by the path to the Druid directory:
|
||||
|
||||
```bash
|
||||
export KAFKA_OPTS="-Dfile.encoding=UTF-8"
|
||||
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json
|
||||
```
|
||||
|
||||
The previous command posted sample events to the *wikipedia* Kafka topic which were then ingested into Druid by the Kafka indexing service. You're now ready to run some queries!
|
||||
|
||||
## Querying your data
|
||||
|
||||
After data is sent to the Kafka stream, it is immediately available for querying.
|
||||
|
|
|
@ -44,8 +44,9 @@ This query retrieves the 10 Wikipedia pages with the most page edits on 2015-09-
|
|||
```sql
|
||||
SELECT page, COUNT(*) AS Edits
|
||||
FROM wikipedia
|
||||
WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00'
|
||||
GROUP BY page ORDER BY Edits DESC
|
||||
WHERE TIMESTAMP '2015-09-12 00:00:00' <= "__time" AND "__time" < TIMESTAMP '2015-09-13 00:00:00'
|
||||
GROUP BY page
|
||||
ORDER BY Edits DESC
|
||||
LIMIT 10
|
||||
```
|
||||
|
||||
|
@ -57,12 +58,18 @@ You can issue the above query from the console.
|
|||
|
||||
![Query autocomplete](../assets/tutorial-query-01.png "Query autocomplete")
|
||||
|
||||
The console query view provides autocomplete together with inline function documentation.
|
||||
You can also configure extra context flags to be sent with the query from the more options menu.
|
||||
The console query view provides autocomplete functionality with inline documentation.
|
||||
|
||||
![Query options](../assets/tutorial-query-02.png "Query options")
|
||||
|
||||
Note that the console will by default wrap your SQL queries in a limit so that you can issue queries like `SELECT * FROM wikipedia` without much hesitation - you can turn off this behavior.
|
||||
You can also configure extra [context flags](../querying/query-context.md) to be sent with the query from the `...` options menu.
|
||||
|
||||
Note that the console will (by default) wrap your SQL queries in a limit where appropriate so that queries such as `SELECT * FROM wikipedia` can complete.
|
||||
You can turn off this behavior from the `Smart query limit` toggle.
|
||||
|
||||
![Query actions](../assets/tutorial-query-03.png "Query actions")
|
||||
|
||||
The query view provides contextual actions that can write and modify the query for you.
|
||||
|
||||
### Query SQL via dsql
|
||||
|
||||
|
|
|
@ -232,10 +232,6 @@
|
|||
"development/geo": {
|
||||
"title": "Spatial filters"
|
||||
},
|
||||
"development/integrating-druid-with-other-technologies": {
|
||||
"title": "Integrating Apache Druid with other technologies",
|
||||
"sidebar_label": "Integrating with other technologies"
|
||||
},
|
||||
"development/javascript": {
|
||||
"title": "JavaScript programming guide",
|
||||
"sidebar_label": "JavaScript functionality"
|
||||
|
|
|
@ -3913,7 +3913,8 @@
|
|||
"ansi-regex": {
|
||||
"version": "2.1.1",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
},
|
||||
"aproba": {
|
||||
"version": "1.2.0",
|
||||
|
@ -3934,12 +3935,14 @@
|
|||
"balanced-match": {
|
||||
"version": "1.0.0",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
},
|
||||
"brace-expansion": {
|
||||
"version": "1.1.11",
|
||||
"bundled": true,
|
||||
"dev": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"balanced-match": "^1.0.0",
|
||||
"concat-map": "0.0.1"
|
||||
|
@ -3954,17 +3957,20 @@
|
|||
"code-point-at": {
|
||||
"version": "1.1.0",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
},
|
||||
"concat-map": {
|
||||
"version": "0.0.1",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
},
|
||||
"console-control-strings": {
|
||||
"version": "1.1.0",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
},
|
||||
"core-util-is": {
|
||||
"version": "1.0.2",
|
||||
|
@ -4081,7 +4087,8 @@
|
|||
"inherits": {
|
||||
"version": "2.0.3",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
},
|
||||
"ini": {
|
||||
"version": "1.3.5",
|
||||
|
@ -4093,6 +4100,7 @@
|
|||
"version": "1.0.0",
|
||||
"bundled": true,
|
||||
"dev": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"number-is-nan": "^1.0.0"
|
||||
}
|
||||
|
@ -4107,6 +4115,7 @@
|
|||
"version": "3.0.4",
|
||||
"bundled": true,
|
||||
"dev": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"brace-expansion": "^1.1.7"
|
||||
}
|
||||
|
@ -4114,12 +4123,14 @@
|
|||
"minimist": {
|
||||
"version": "0.0.8",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
},
|
||||
"minipass": {
|
||||
"version": "2.3.5",
|
||||
"bundled": true,
|
||||
"dev": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"safe-buffer": "^5.1.2",
|
||||
"yallist": "^3.0.0"
|
||||
|
@ -4138,6 +4149,7 @@
|
|||
"version": "0.5.1",
|
||||
"bundled": true,
|
||||
"dev": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"minimist": "0.0.8"
|
||||
}
|
||||
|
@ -4218,7 +4230,8 @@
|
|||
"number-is-nan": {
|
||||
"version": "1.0.1",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
},
|
||||
"object-assign": {
|
||||
"version": "4.1.1",
|
||||
|
@ -4230,6 +4243,7 @@
|
|||
"version": "1.4.0",
|
||||
"bundled": true,
|
||||
"dev": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"wrappy": "1"
|
||||
}
|
||||
|
@ -4315,7 +4329,8 @@
|
|||
"safe-buffer": {
|
||||
"version": "5.1.2",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
},
|
||||
"safer-buffer": {
|
||||
"version": "2.1.2",
|
||||
|
@ -4351,6 +4366,7 @@
|
|||
"version": "1.0.2",
|
||||
"bundled": true,
|
||||
"dev": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"code-point-at": "^1.0.0",
|
||||
"is-fullwidth-code-point": "^1.0.0",
|
||||
|
@ -4370,6 +4386,7 @@
|
|||
"version": "3.0.1",
|
||||
"bundled": true,
|
||||
"dev": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"ansi-regex": "^2.0.0"
|
||||
}
|
||||
|
@ -4413,12 +4430,14 @@
|
|||
"wrappy": {
|
||||
"version": "1.0.2",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
},
|
||||
"yallist": {
|
||||
"version": "3.0.3",
|
||||
"bundled": true,
|
||||
"dev": true
|
||||
"dev": true,
|
||||
"optional": true
|
||||
}
|
||||
}
|
||||
},
|
||||
|
|
|
@ -123,14 +123,14 @@
|
|||
{"source": "configuration/historical.html", "target": "../configuration/index.html#historical"}
|
||||
{"source": "configuration/indexing-service.html", "target": "../configuration/index.html#overlord"}
|
||||
{"source": "configuration/production-cluster.html", "target": "../tutorials/cluster.html"}
|
||||
{"source": "configuration/realtime.md", "target": "../ingestion/standalone-realtime.html"}
|
||||
{"source": "configuration/realtime.html", "target": "../ingestion/standalone-realtime.html"}
|
||||
{"source": "configuration/simple-cluster.html", "target": "../tutorials/cluster.html"}
|
||||
{"source": "configuration/zookeeper.html", "target": "../dependencies/zookeeper.html"}
|
||||
{"source": "dependencies/cassandra-deep-storage.md", "target": "../development/extensions-contrib/cassandra.html"}
|
||||
{"source": "dependencies/cassandra-deep-storage.html", "target": "../development/extensions-contrib/cassandra.html"}
|
||||
{"source": "design/concepts-and-terminology.html", "target": "index.html"}
|
||||
{"source": "design/design.html", "target": "index.html"}
|
||||
{"source": "design/plumber.md", "target": "../ingestion/standalone-realtime.html"}
|
||||
{"source": "design/realtime.md", "target": "../ingestion/standalone-realtime.html"}
|
||||
{"source": "design/plumber.html", "target": "../ingestion/standalone-realtime.html"}
|
||||
{"source": "design/realtime.html", "target": "../ingestion/standalone-realtime.html"}
|
||||
{"source": "development/approximate-histograms.html", "target": "extensions-core/approximate-histograms.html"}
|
||||
{"source": "development/community-extensions/azure.html", "target": "../extensions-contrib/azure.html"}
|
||||
{"source": "development/community-extensions/cassandra.html", "target": "../extensions-contrib/cassandra.html"}
|
||||
|
@ -139,48 +139,48 @@
|
|||
{"source": "development/community-extensions/kafka-simple.html", "target": "../extensions-core/kafka-ingestion.html"}
|
||||
{"source": "development/community-extensions/rabbitmq.html", "target": "../extensions-core/kafka-ingestion.html"}
|
||||
{"source": "development/datasketches-aggregators.html", "target": "extensions-core/datasketches-extension.html"}
|
||||
{"source": "development/extensions-contrib/kafka-simple.md", "target": "../../ingestion/standalone-realtime.html"}
|
||||
{"source": "development/extensions-contrib/kafka-simple.html", "target": "../../ingestion/standalone-realtime.html"}
|
||||
{"source": "development/extensions-contrib/orc.html", "target": "../extensions-core/orc.html"}
|
||||
{"source": "development/extensions-contrib/parquet.html", "target":"../../development/extensions-core/parquet.html"}
|
||||
{"source": "development/extensions-contrib/rabbitmq.md", "target": "../../ingestion/standalone-realtime.html"}
|
||||
{"source": "development/extensions-contrib/rocketmq.md", "target": "../../ingestion/standalone-realtime.html"}
|
||||
{"source": "development/extensions-contrib/rabbitmq.html", "target": "../../ingestion/standalone-realtime.html"}
|
||||
{"source": "development/extensions-contrib/rocketmq.html", "target": "../../ingestion/standalone-realtime.html"}
|
||||
{"source": "development/extensions-contrib/scan-query.html", "target":"../../querying/scan-query.html"}
|
||||
{"source": "development/extensions-core/caffeine-cache.html", "target":"../../configuration/index.html#cache-configuration"}
|
||||
{"source": "development/extensions-core/datasketches-aggregators.html", "target": "datasketches-extension.html"}
|
||||
{"source": "development/extensions-core/kafka-eight-firehose.md", "target": "../../ingestion/standalone-realtime.html"}
|
||||
{"source": "development/extensions-core/kafka-eight-firehose.html", "target": "../../ingestion/standalone-realtime.html"}
|
||||
{"source": "development/extensions-core/namespaced-lookup.html", "target": "lookups-cached-global.html"}
|
||||
{"source": "development/indexer.md", "target": "../design/indexer.html"}
|
||||
{"source": "development/indexer.html", "target": "../design/indexer.html"}
|
||||
{"source": "development/kafka-simple-consumer-firehose.html", "target": "extensions-core/kafka-ingestion.html"}
|
||||
{"source": "development/libraries.html", "target": "/libraries.html"}
|
||||
{"source": "development/router.md", "target": "../design/router.html"}
|
||||
{"source": "development/router.html", "target": "../design/router.html"}
|
||||
{"source": "development/select-query.html", "target": "../querying/select-query.html"}
|
||||
{"source": "index.html", "target": "design/index.html"}
|
||||
{"source": "ingestion/batch-ingestion.md", "target": "index.html#batch"}
|
||||
{"source": "ingestion/command-line-hadoop-indexer.md", "target": "hadoop.html#cli"}
|
||||
{"source": "ingestion/compaction.md", "target": "data-management.html#compact"}
|
||||
{"source": "ingestion/delete-data.md", "target": "data-management.html#delete"}
|
||||
{"source": "ingestion/firehose.md", "target": "native-batch.html#firehoses"}
|
||||
{"source": "ingestion/flatten-json.md", "target": "index.html#flattenspec"}
|
||||
{"source": "ingestion/hadoop-vs-native-batch.md", "target": "index.html#batch"}
|
||||
{"source": "ingestion/ingestion-spec.md", "target": "index.html#spec"}
|
||||
{"source": "ingestion/batch-ingestion.html", "target": "index.html#batch"}
|
||||
{"source": "ingestion/command-line-hadoop-indexer.html", "target": "hadoop.html#cli"}
|
||||
{"source": "ingestion/compaction.html", "target": "data-management.html#compact"}
|
||||
{"source": "ingestion/delete-data.html", "target": "data-management.html#delete"}
|
||||
{"source": "ingestion/firehose.html", "target": "native-batch.html#firehoses"}
|
||||
{"source": "ingestion/flatten-json.html", "target": "index.html#flattenspec"}
|
||||
{"source": "ingestion/hadoop-vs-native-batch.html", "target": "index.html#batch"}
|
||||
{"source": "ingestion/ingestion-spec.html", "target": "index.html#spec"}
|
||||
{"source": "ingestion/ingestion.html", "target": "index.html"}
|
||||
{"source": "ingestion/locking-and-priority.md", "target": "tasks.html#locks"}
|
||||
{"source": "ingestion/misc-tasks.md", "target": "tasks.html#all-task-types"}
|
||||
{"source": "ingestion/locking-and-priority.html", "target": "tasks.html#locks"}
|
||||
{"source": "ingestion/misc-tasks.html", "target": "tasks.html#all-task-types"}
|
||||
{"source": "ingestion/native_tasks.html", "target": "native-batch.html"}
|
||||
{"source": "ingestion/native_tasks.html", "target": "native-batch.html"}
|
||||
{"source": "ingestion/native_tasks.md", "target": "native-batch.html"}
|
||||
{"source": "ingestion/overview.html", "target": "index.html"}
|
||||
{"source": "ingestion/realtime-ingestion.html", "target": "index.html"}
|
||||
{"source": "ingestion/reports.md", "target": "tasks.html#reports"}
|
||||
{"source": "ingestion/schema-changes.md", "target": "data-management.html#schema-changes"}
|
||||
{"source": "ingestion/stream-ingestion.md", "target": "index.html#streaming"}
|
||||
{"source": "ingestion/stream-pull.md", "target": "../ingestion/standalone-realtime.html"}
|
||||
{"source": "ingestion/stream-push.md", "target": "tranquility.html"}
|
||||
{"source": "ingestion/transform-spec.md", "target": "index.html#transformspec"}
|
||||
{"source": "ingestion/update-existing-data.md", "target": "data-management.html#update"}
|
||||
{"source": "ingestion/reports.html", "target": "tasks.html#reports"}
|
||||
{"source": "ingestion/schema-changes.html", "target": "data-management.html#schema-changes"}
|
||||
{"source": "ingestion/stream-ingestion.html", "target": "index.html#streaming"}
|
||||
{"source": "ingestion/stream-pull.html", "target": "../ingestion/standalone-realtime.html"}
|
||||
{"source": "ingestion/stream-push.html", "target": "tranquility.html"}
|
||||
{"source": "ingestion/transform-spec.html", "target": "index.html#transformspec"}
|
||||
{"source": "ingestion/update-existing-data.html", "target": "data-management.html#update"}
|
||||
{"source": "misc/cluster-setup.html", "target": "../tutorials/cluster.html"}
|
||||
{"source": "misc/evaluate.html", "target": "../tutorials/cluster.html"}
|
||||
{"source": "misc/tasks.html", "target": "../ingestion/tasks.html"}
|
||||
{"source": "operations/including-extensions.md", "target": "../development/extensions.html"}
|
||||
{"source": "operations/including-extensions.html", "target": "../development/extensions.html"}
|
||||
{"source": "operations/multitenancy.html", "target": "../querying/multitenancy.html"}
|
||||
{"source": "operations/performance-faq.html", "target": "../operations/basic-cluster-tuning.html"}
|
||||
{"source": "operations/performance-faq.html", "target": "../operations/basic-cluster-tuning.html"}
|
||||
|
@ -196,5 +196,6 @@
|
|||
{"source": "tutorials/tutorial-loading-batch-data.html", "target": "tutorial-batch.html"}
|
||||
{"source": "tutorials/tutorial-loading-streaming-data.html", "target": "tutorial-kafka.html"}
|
||||
{"source": "tutorials/tutorial-the-druid-cluster.html", "target": "cluster.html"}
|
||||
{"source": "tutorials/tutorial-tranquility.md", "target": "../ingestion/tranquility.html"}
|
||||
{"source": "tutorials/tutorial-tranquility.html", "target": "../ingestion/tranquility.html"}
|
||||
{"source": "development/extensions-contrib/google.html", "target": "../extensions-core/google.html"}
|
||||
{"source": "development/integrating-druid-with-other-technologies.html", "target": "../ingestion/index.html"}
|
||||
|
|
|
@ -145,7 +145,6 @@
|
|||
"design/overlord",
|
||||
"design/router",
|
||||
"design/peons",
|
||||
"development/integrating-druid-with-other-technologies",
|
||||
"development/extensions-core/approximate-histograms",
|
||||
"development/extensions-core/avro",
|
||||
"development/extensions-core/bloom-filter",
|
||||
|
|