修改 Kafka 使用数据加载器(data loader)来加载数据

This commit is contained in:
YuCheng Hu 2021-08-06 14:22:48 -04:00
parent 9cda041d0b
commit 0955f29312
1 changed files with 47 additions and 83 deletions

View File

@ -28,117 +28,117 @@ cd kafka_2.12-2.1.0
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia ./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia
``` ```
## Load data into Kafka ## 将数据载入到 Kafka
现在让我们为我们的主题运行一个生成器producer然后向主题中发送一些数据
Let's launch a producer for our topic and send some data! 在你的 Druid 目录中,运行下面的命令:
In your Druid directory, run the following command:
```bash ```bash
cd quickstart/tutorial cd quickstart/tutorial
gunzip -c wikiticker-2015-09-12-sampled.json.gz > wikiticker-2015-09-12-sampled.json gunzip -c wikiticker-2015-09-12-sampled.json.gz > wikiticker-2015-09-12-sampled.json
``` ```
In your Kafka directory, run the following command, where {PATH_TO_DRUID} is replaced by the path to the Druid directory: 在你的 Kafka 的安装目录中,运行下面的命令。请将 {PATH_TO_DRUID} 替换为 Druid 的安装目录:
```bash ```bash
export KAFKA_OPTS="-Dfile.encoding=UTF-8" export KAFKA_OPTS="-Dfile.encoding=UTF-8"
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json
``` ```
The previous command posted sample events to the *wikipedia* Kafka topic. 上面的控制台命令将会把示例消息载入到 Kafka 的 *wikipedia* 主题。
Now we will use Druid's Kafka indexing service to ingest messages from our newly created topic. 现在我们将会使用 Druid 的 Kafka 索引服务indexing service来将我们加载到 Kafka 中的消息导入到 Druid 中。
## Loading data with the data loader ## 使用数据加载器data loader来加载数据
在 URL 中导航到 [localhost:8888](http://localhost:8888) 页面,然后在控制台的顶部单击`Load data`。
Navigate to [localhost:8888](http://localhost:8888) and click `Load data` in the console header.
![Data loader init](../assets/tutorial-kafka-data-loader-01.png "Data loader init") ![Data loader init](../assets/tutorial-kafka-data-loader-01.png "Data loader init")
Select `Apache Kafka` and click `Connect data`. 选择 `Apache Kafka` 然后单击 `Connect data`
![Data loader sample](../assets/tutorial-kafka-data-loader-02.png "Data loader sample") ![Data loader sample](../assets/tutorial-kafka-data-loader-02.png "Data loader sample")
Enter `localhost:9092` as the bootstrap server and `wikipedia` as the topic. 输入 Kafka 的服务器地址为 `localhost:9092` 然后选择 `wikipedia` 为主题。
Click `Apply` and make sure that the data you are seeing is correct. 然后单击 `Apply`。请确定你在界面中看到的数据只正确的。
Once the data is located, you can click "Next: Parse data" to go to the next step. 一旦数据被载入后,你可以单击按钮 "Next: Parse data" 来进行下一步的操作。
![Data loader parse data](../assets/tutorial-kafka-data-loader-03.png "Data loader parse data") ![Data loader parse data](../assets/tutorial-kafka-data-loader-03.png "Data loader parse data")
The data loader will try to automatically determine the correct parser for the data. Druid 的数据加载器将会为需要加载的数据确定正确的处理器。
In this case it will successfully determine `json`. 在本用例中,我们成功的确定了需要处理的数据格式为 `json` 格式。
Feel free to play around with different parser options to get a preview of how Druid will parse your data. 你可以在本页面中选择不同的数据处理器,通过选择不同的数据处理器,能够帮你更好的了解 Druid 是如何帮助你处理数据的。
With the `json` parser selected, click `Next: Parse time` to get to the step centered around determining your primary timestamp column. `json` 格式的数据处理器被选择后,单击 `Next: Parse time` 来进行入下一个界面,在这个界面中你需要确定 timestamp 主键字段的的列。
![Data loader parse time](../assets/tutorial-kafka-data-loader-04.png "Data loader parse time") ![Data loader parse time](../assets/tutorial-kafka-data-loader-04.png "Data loader parse time")
Druid's architecture requires a primary timestamp column (internally stored in a column called `__time`). Druid 要求所有数据必须有一个 timestamp 的主键字段(这个主键字段被定义和存储在 `__time`)中。
If you do not have a timestamp in your data, select `Constant value`. 如果你需要导入的数据没有时间字段的话,那么请选择 `Constant value`
In our example, the data loader will determine that the `time` column in our raw data is the only candidate that can be used as the primary time column. 在我们现在的示例中,数据载入器确定 `time` 字段是唯一可以被用来作为数据时间字段的数据。
Click `Next: ...` twice to go past the `Transform` and `Filter` steps.
You do not need to enter anything in these steps as applying ingestion time transforms and filters are out of scope for this tutorial. 单击 `Next: ...` 2 次,来跳过 `Transform``Filter` 步骤。
针对本教程来说,你并不需要对导入时间进行换行,所以你不需要调整 转换Transform 和 过滤器Filter 的配置。
![Data loader schema](../assets/tutorial-kafka-data-loader-05.png "Data loader schema") ![Data loader schema](../assets/tutorial-kafka-data-loader-05.png "Data loader schema")
In the `Configure schema` step, you can configure which [dimensions](../ingestion/index.md#dimensions) and [metrics](../ingestion/index.md#metrics) will be ingested into Druid. 配置摘要schema 是你对 [dimensions](../ingestion/index.md#dimensions) 和 [metrics](../ingestion/index.md#metrics) 在导入数据的时候配置的地方。
This is exactly what the data will appear like in Druid once it is ingested. 这个界面显示的是当我们对数据在 Druid 中进行导入的时候,数据是如何在 Druid 中进行存储和表现的。
Since our dataset is very small, go ahead and turn off [`Rollup`](../ingestion/index.md#rollup) by clicking on the switch and confirming the change. 因为我们提交的数据集非常小,因此我们可以关闭 [回滚rollup](../ingestion/index.md#rollup) **Rollup** 的开关将不会在这个时候显示来供你选择。
Once you are satisfied with the schema, click `Next` to go to the `Partition` step where you can fine tune how the data will be partitioned into segments. 如果你对当前的配置满意的话,单击 `Next` 来进入 `Partition` 步骤。在这个步骤中你可以定义数据是如何在段中进行分区的。
![Data loader partition](../assets/tutorial-kafka-data-loader-06.png "Data loader partition") ![Data loader partition](../assets/tutorial-kafka-data-loader-06.png "Data loader partition")
Here, you can adjust how the data will be split up into segments in Druid. 在这一步中,你可以调整你的数据是如何在段中进行分配的。
Since this is a small dataset, there are no adjustments that need to be made in this step. 因为当前的数据集是一个非常小的数据库,我们在这一步不需要进行调制。
Click `Next: Tune` to go to the tuning step. 单击 `Next: Tune` 来进入性能配置页面。
![Data loader tune](../assets/tutorial-kafka-data-loader-07.png "Data loader tune") ![Data loader tune](../assets/tutorial-kafka-data-loader-07.png "Data loader tune")
In the `Tune` step is it *very important* to set `Use earliest offset` to `True` since we want to consume the data from the start of the stream. `Tune` 这一步中一个 *非常重要* 的参数是 `Use earliest offset` 设置为 `True`
There are no other changes that need to be made hear, so click `Next: Publish` to go to the `Publish` step. 因为我们希望从流的开始来读取数据。
针对其他的配置,我们不需要进行修改,单击 `Next: Publish` 来进入 `Publish` 步骤。
![Data loader publish](../assets/tutorial-kafka-data-loader-08.png "Data loader publish") ![Data loader publish](../assets/tutorial-kafka-data-loader-08.png "Data loader publish")
Let's name this datasource `wikipedia-kafka`. 让我们将数据源命名为 `wikipedia-kafka`
Finally, click `Next` to review your spec. 最后,单击 `Next` 来查看你的配置。
![Data loader spec](../assets/tutorial-kafka-data-loader-09.png "Data loader spec") ![Data loader spec](../assets/tutorial-kafka-data-loader-09.png "Data loader spec")
This is the spec you have constructed. 等到这一步的时候,你就可以看到如何使用数据导入来创建一个数据导入规范。
Feel free to go back and make changes in previous steps to see how changes will update the spec. 你可以随意的通过页面中的导航返回到前面的页面中对配置进行调整。
Similarly, you can also edit the spec directly and see it reflected in the previous steps. 简单来说你可以对特性目录进行编辑,来查看编辑后的配置是如何对前面的步骤产生影响的。
Once you are satisfied with the spec, click `Submit` and an ingestion task will be created. 当你对所有的配置都满意并且觉得没有问题的时候,单击 **提交Submit**.
![Tasks view](../assets/tutorial-kafka-data-loader-10.png "Tasks view") ![Tasks view](../assets/tutorial-kafka-data-loader-10.png "Tasks view")
You will be taken to the task view with the focus on the newly created supervisor. 现在你需要到界面下半部分的任务视图task view中来查看通过 supervisor 创建的任务。
The task view is set to auto refresh, wait until your supervisor launches a task. 任务视图task view是被设置为自动刷新的请等候 supervisor 来运行一个任务。
When a tasks starts running, it will also start serving the data that it is ingesting. 当一个任务启动运行后,这个任务将会对数据进行处理后导入到 Druid 中。
Navigate to the `Datasources` view from the header. 在页面的顶部,请导航到 `Datasources` 视图。
![Datasource view](../assets/tutorial-kafka-data-loader-11.png "Datasource view") ![Datasource view](../assets/tutorial-kafka-data-loader-11.png "Datasource view")
When the `wikipedia-kafka` datasource appears here it can be queried. `wikipedia-kafka` 数据源成功显示,这个数据源中的数据就可以进行查询了。
*Note:* if the datasource does not appear after a minute you might have not set the supervisor to read from the start of the stream (in the `Tune` step). *请注意:* 如果数据源在经过一段时间的等待后还是没有数据的话,那么很有可能是你的 supervisor 没有设置从 Kafka 的开头读取流数据(`Tune` 步骤中的配置)。
At this point, you can go to the `Query` view to run SQL queries against the datasource. 在数据源完成所有的数据导入后,你可以进入 `Query` 视图,来针对导入的数据源来运行 SQL 查询。
Since this is a small dataset, you can simply run a `SELECT * FROM "wikipedia-kafka"` query to see your results. 因为我们当前导入的数据库很小,你可以直接运行`SELECT * FROM "wikipedia-kafka"` 查询来查看数据导入的结果。
![Query view](../assets/tutorial-kafka-data-loader-12.png "Query view") ![Query view](../assets/tutorial-kafka-data-loader-12.png "Query view")
Check out the [query tutorial](../tutorials/tutorial-query.md) to run some example queries on the newly loaded data. 请访问 [query tutorial](../tutorials/tutorial-query.md) 页面中的内容来了解如何针对一个新载入的数据如何运行查询。
### Submit a supervisor via the console ### Submit a supervisor via the console
@ -251,42 +251,6 @@ For more information on loading data from Kafka streams, please see the [Druid K
### 下载并启动Kafka
[Apache Kafka](http://kafka.apache.org/)是一种高吞吐量消息总线可与Druid很好地配合使用。在本教程中我们将使用Kafka2.1.0。要下载Kafka请在终端中执行以下命令
```json
curl -O https://archive.apache.org/dist/kafka/2.1.0/kafka_2.12-2.1.0.tgz
tar -xzf kafka_2.12-2.1.0.tgz
cd kafka_2.12-2.1.0
```
通过在终端中运行以下命令来启动一个Kafka Broker
```json
./bin/kafka-server-start.sh config/server.properties
```
执行以下命令来创建一个我们用来发送数据的Kafka主题称为"*wikipedia*":
```json
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia
```
### 发送数据到Kafka
让我们为该主题启动一个生产者并发送一些数据
在Druid目录下运行下边的命令
```json
cd quickstart/tutorial
gunzip -c wikiticker-2015-09-12-sampled.json.gz > wikiticker-2015-09-12-sampled.json
```
在Kafka目录下运行以下命令{PATH_TO_DRUID}替换为Druid目录的路径
```json
export KAFKA_OPTS="-Dfile.encoding=UTF-8"
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json
```
上一个命令将示例事件发布到名称为*wikipedia*的Kafka主题。现在我们将使用Druid的Kafka索引服务从新创建的主题中提取消息。 上一个命令将示例事件发布到名称为*wikipedia*的Kafka主题。现在我们将使用Druid的Kafka索引服务从新创建的主题中提取消息。