diff --git a/_sidebar.md b/_sidebar.md index 159d913..347ca42 100644 --- a/_sidebar.md +++ b/_sidebar.md @@ -14,7 +14,7 @@ - [从 Apache Kafka 载入数据](tutorials/tutorial-kafka.md) - [从 Apache Hadoop 载入数据](tutorials/tutorial-batch-hadoop.md) - [查询数据](tutorials/tutorial-query.md) - - [回滚](tutorials/tutorial-rollup.md) + - [Roll-up](tutorials/tutorial-rollup.md) - [配置数据保存时间](tutorials/tutorial-retention.md) - [更新已经存在的数据](tutorials/tutorial-update-data.md) - [压缩段](tutorials/tutorial-compaction.md) diff --git a/tutorials/chapter-5.md b/tutorials/chapter-5.md deleted file mode 100644 index 6136ee0..0000000 --- a/tutorials/chapter-5.md +++ /dev/null @@ -1,179 +0,0 @@ - - - - - - -## Roll-up - -Apache Druid可以通过roll-up在数据摄取阶段对原始数据进行汇总。 Roll-up是对选定列集的一级聚合操作,它可以减小存储数据的大小。 - -本教程中将讨论在一个示例数据集上进行roll-up的结果。 - -本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid,并运行在本地机器上。 - -完成[加载本地文件](tutorial-batch.md)和[数据查询](./chapter-4.md)两部分内容也是非常有帮助的。 - -### 示例数据 - -对于本教程,我们将使用一个网络流事件数据的小样本,表示在特定时间内从源到目标IP地址的流量的数据包和字节计数。 - -```json -{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024} -{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133} -{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780} -{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289} -{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971} -{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204} -{"timestamp":"2018-01-02T21:33:14Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":38,"bytes":6289} -{"timestamp":"2018-01-02T21:33:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":123,"bytes":93999} -{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} -``` -位于 `quickstart/tutorial/rollup-data.json` 的文件包含了样例输入数据 - -我们将使用 `quickstart/tutorial/rollup-index.json` 的摄入数据规范来摄取数据 - -```json -{ - "type" : "index_parallel", - "spec" : { - "dataSchema" : { - "dataSource" : "rollup-tutorial", - "dimensionsSpec" : { - "dimensions" : [ - "srcIP", - "dstIP" - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "iso" - }, - "metricsSpec" : [ - { "type" : "count", "name" : "count" }, - { "type" : "longSum", "name" : "packets", "fieldName" : "packets" }, - { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" } - ], - "granularitySpec" : { - "type" : "uniform", - "segmentGranularity" : "week", - "queryGranularity" : "minute", - "intervals" : ["2018-01-01/2018-01-03"], - "rollup" : true - } - }, - "ioConfig" : { - "type" : "index_parallel", - "inputSource" : { - "type" : "local", - "baseDir" : "quickstart/tutorial", - "filter" : "rollup-data.json" - }, - "inputFormat" : { - "type" : "json" - }, - "appendToExisting" : false - }, - "tuningConfig" : { - "type" : "index_parallel", - "maxRowsPerSegment" : 5000000, - "maxRowsInMemory" : 25000 - } - } -} -``` - -通过在 `granularitySpec` 选项中设置 `rollup : true` 来启用Roll-up - -注意,我们将`srcIP`和`dstIP`定义为**维度**,将`packets`和`bytes`列定义为了`longSum`类型的**指标**,并将 `queryGranularity` 配置定义为 `minute`。 - -加载这些数据后,我们将看到如何使用这些定义。 - -### 加载示例数据 - -在Druid的根目录下运行以下命令: - -```json -bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081 -``` - -脚本运行完成以后,我们将查询数据。 - -### 查询示例数据 - -现在运行 `bin/dsql` 然后执行查询 `select * from "rollup-tutorial";` 来查看已经被摄入的数据。 - -```json -$ bin/dsql -Welcome to dsql, the command-line client for Druid SQL. -Type "\h" for help. -dsql> select * from "rollup-tutorial"; -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │ -│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │ -│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │ -│ 2018-01-02T21:33:00.000Z │ 100288 │ 2 │ 8.8.8.8 │ 161 │ 7.7.7.7 │ -│ 2018-01-02T21:35:00.000Z │ 2818 │ 1 │ 8.8.8.8 │ 12 │ 7.7.7.7 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -Retrieved 5 rows in 1.18s. - -dsql> -``` - -我们来看发生在 `2018-01-01T01:01` 的三条原始数据: - -```json -{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024} -{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133} -{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780} -``` -这三条数据已经被roll up为以下一行数据: - -```json -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -``` - -这输入的数据行已经被按照时间列和维度列 `{timestamp, srcIP, dstIP}` 在指标列 `{packages, bytes}` 上做求和聚合 - -在进行分组之前,原始输入数据的时间戳按分钟进行标记/布局,这是由于摄取规范中的 `"queryGranularity":"minute"` 设置造成的。 -同样,`2018-01-01T01:02` 期间发生的这两起事件也已经汇总。 - -```json -{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289} -{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971} -``` -```json -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -``` - -对于记录1.1.1.1和2.2.2.2之间流量的最后一个事件没有发生汇总,因为这是 `2018-01-01T01:03` 期间发生的唯一事件 - -```json -{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204} -``` -```json -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -``` - -请注意,`计数指标 count` 显示原始输入数据中有多少行贡献给最终的"roll up"行。 \ No newline at end of file diff --git a/tutorials/img-3/tutorial-query-01.png b/tutorials/img-3/tutorial-query-01.png deleted file mode 100644 index 8884d9b..0000000 Binary files a/tutorials/img-3/tutorial-query-01.png and /dev/null differ diff --git a/tutorials/img-3/tutorial-query-02.png b/tutorials/img-3/tutorial-query-02.png deleted file mode 100644 index cf407e3..0000000 Binary files a/tutorials/img-3/tutorial-query-02.png and /dev/null differ diff --git a/tutorials/img-3/tutorial-query-03.png b/tutorials/img-3/tutorial-query-03.png deleted file mode 100644 index c3b599c..0000000 Binary files a/tutorials/img-3/tutorial-query-03.png and /dev/null differ diff --git a/tutorials/img-3/tutorial-query-04.png b/tutorials/img-3/tutorial-query-04.png deleted file mode 100644 index 3f800a6..0000000 Binary files a/tutorials/img-3/tutorial-query-04.png and /dev/null differ diff --git a/tutorials/img-3/tutorial-query-05.png b/tutorials/img-3/tutorial-query-05.png deleted file mode 100644 index 2fc59ce..0000000 Binary files a/tutorials/img-3/tutorial-query-05.png and /dev/null differ diff --git a/tutorials/img-3/tutorial-query-06.png b/tutorials/img-3/tutorial-query-06.png deleted file mode 100644 index 60b4e1a..0000000 Binary files a/tutorials/img-3/tutorial-query-06.png and /dev/null differ diff --git a/tutorials/img-3/tutorial-query-07.png b/tutorials/img-3/tutorial-query-07.png deleted file mode 100644 index d2e5a85..0000000 Binary files a/tutorials/img-3/tutorial-query-07.png and /dev/null differ diff --git a/tutorials/tutorial-rollup.md b/tutorials/tutorial-rollup.md index cc93997..4c1a5be 100644 --- a/tutorials/tutorial-rollup.md +++ b/tutorials/tutorial-rollup.md @@ -1,29 +1,4 @@ ---- -id: tutorial-rollup -title: "Tutorial: Roll-up" -sidebar_label: "Roll-up" ---- - - - - +# Roll-up Apache Druid can summarize raw data at ingestion time using a process we refer to as "roll-up". Roll-up is a first-level aggregation operation over a selected set of columns that reduces the size of stored data. This tutorial will demonstrate the effects of roll-up on an example dataset. @@ -194,3 +169,171 @@ For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no roll-up too ``` Note that the `count` metric shows how many rows in the original input data contributed to the final "rolled up" row. + + +## Roll-up + +Apache Druid可以通过roll-up在数据摄取阶段对原始数据进行汇总。 Roll-up是对选定列集的一级聚合操作,它可以减小存储数据的大小。 + +本教程中将讨论在一个示例数据集上进行roll-up的结果。 + +本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid,并运行在本地机器上。 + +完成[加载本地文件](tutorial-batch.md)和[数据查询](./chapter-4.md)两部分内容也是非常有帮助的。 + +### 示例数据 + +对于本教程,我们将使用一个网络流事件数据的小样本,表示在特定时间内从源到目标IP地址的流量的数据包和字节计数。 + +```json +{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024} +{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133} +{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780} +{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289} +{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971} +{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204} +{"timestamp":"2018-01-02T21:33:14Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":38,"bytes":6289} +{"timestamp":"2018-01-02T21:33:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":123,"bytes":93999} +{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} +``` +位于 `quickstart/tutorial/rollup-data.json` 的文件包含了样例输入数据 + +我们将使用 `quickstart/tutorial/rollup-index.json` 的摄入数据规范来摄取数据 + +```json +{ + "type" : "index_parallel", + "spec" : { + "dataSchema" : { + "dataSource" : "rollup-tutorial", + "dimensionsSpec" : { + "dimensions" : [ + "srcIP", + "dstIP" + ] + }, + "timestampSpec": { + "column": "timestamp", + "format": "iso" + }, + "metricsSpec" : [ + { "type" : "count", "name" : "count" }, + { "type" : "longSum", "name" : "packets", "fieldName" : "packets" }, + { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" } + ], + "granularitySpec" : { + "type" : "uniform", + "segmentGranularity" : "week", + "queryGranularity" : "minute", + "intervals" : ["2018-01-01/2018-01-03"], + "rollup" : true + } + }, + "ioConfig" : { + "type" : "index_parallel", + "inputSource" : { + "type" : "local", + "baseDir" : "quickstart/tutorial", + "filter" : "rollup-data.json" + }, + "inputFormat" : { + "type" : "json" + }, + "appendToExisting" : false + }, + "tuningConfig" : { + "type" : "index_parallel", + "maxRowsPerSegment" : 5000000, + "maxRowsInMemory" : 25000 + } + } +} +``` + +通过在 `granularitySpec` 选项中设置 `rollup : true` 来启用Roll-up + +注意,我们将`srcIP`和`dstIP`定义为**维度**,将`packets`和`bytes`列定义为了`longSum`类型的**指标**,并将 `queryGranularity` 配置定义为 `minute`。 + +加载这些数据后,我们将看到如何使用这些定义。 + +### 加载示例数据 + +在Druid的根目录下运行以下命令: + +```json +bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081 +``` + +脚本运行完成以后,我们将查询数据。 + +### 查询示例数据 + +现在运行 `bin/dsql` 然后执行查询 `select * from "rollup-tutorial";` 来查看已经被摄入的数据。 + +```json +$ bin/dsql +Welcome to dsql, the command-line client for Druid SQL. +Type "\h" for help. +dsql> select * from "rollup-tutorial"; +┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ +│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ +├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ +│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │ +│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │ +│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │ +│ 2018-01-02T21:33:00.000Z │ 100288 │ 2 │ 8.8.8.8 │ 161 │ 7.7.7.7 │ +│ 2018-01-02T21:35:00.000Z │ 2818 │ 1 │ 8.8.8.8 │ 12 │ 7.7.7.7 │ +└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ +Retrieved 5 rows in 1.18s. + +dsql> +``` + +我们来看发生在 `2018-01-01T01:01` 的三条原始数据: + +```json +{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024} +{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133} +{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780} +``` +这三条数据已经被roll up为以下一行数据: + +```json +┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ +│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ +├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ +│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │ +└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ +``` + +这输入的数据行已经被按照时间列和维度列 `{timestamp, srcIP, dstIP}` 在指标列 `{packages, bytes}` 上做求和聚合 + +在进行分组之前,原始输入数据的时间戳按分钟进行标记/布局,这是由于摄取规范中的 `"queryGranularity":"minute"` 设置造成的。 +同样,`2018-01-01T01:02` 期间发生的这两起事件也已经汇总。 + +```json +{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289} +{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971} +``` +```json +┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ +│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ +├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ +│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │ +└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ +``` + +对于记录1.1.1.1和2.2.2.2之间流量的最后一个事件没有发生汇总,因为这是 `2018-01-01T01:03` 期间发生的唯一事件 + +```json +{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204} +``` +```json +┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ +│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ +├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ +│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │ +└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ +``` + +请注意,`计数指标 count` 显示原始输入数据中有多少行贡献给最终的"roll up"行。 \ No newline at end of file