roll up

2020-03-29 15:21:06 +08:00 · 2020-03-29 15:21:06 +08:00 · 80837a8eff
parent c6fa73f690
commit 80837a8eff
1 changed files with 168 additions and 1 deletions
--- a/Tutorials/chapter-5.md
+++ b/Tutorials/chapter-5.md
@ -1 +1,168 @@
-## 简易教程
+<!-- toc -->
 ## Roll-up
 Apache Druid可以通过roll-up在数据摄取阶段对原始数据进行汇总。 Roll-up是对选定列集的一级聚合操作，它可以减小存储数据的大小。
 本教程中将讨论在一个示例数据集上进行roll-up的结果。
 本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid，并运行在本地机器上。
 完成[加载本地文件](./chapter-1.md)和[数据查询](./chapter-4.md)两部分内容也是非常有帮助的。
 ### 示例数据
 对于本教程，我们将使用一个网络流事件数据的小样本，表示在特定时间内从源到目标IP地址的流量的数据包和字节计数。
 ```
 {"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
 {"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
 {"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
 {"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
 {"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
 {"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
 {"timestamp":"2018-01-02T21:33:14Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":38,"bytes":6289}
 {"timestamp":"2018-01-02T21:33:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":123,"bytes":93999}
 {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818}
 ```
 位于 `quickstart/tutorial/rollup-data.json` 的文件包含了样例输入数据
 我们将使用 `quickstart/tutorial/rollup-index.json` 的摄入数据说明来摄取数据
 ```
 {
  "type" : "index_parallel",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "rollup-tutorial",
      "dimensionsSpec" : {
        "dimensions" : [
          "srcIP",
          "dstIP"
        ]
      },
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "metricsSpec" : [
        { "type" : "count", "name" : "count" },
        { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
        { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "week",
        "queryGranularity" : "minute",
        "intervals" : ["2018-01-01/2018-01-03"],
        "rollup" : true
      }
    },
    "ioConfig" : {
      "type" : "index_parallel",
      "inputSource" : {
        "type" : "local",
        "baseDir" : "quickstart/tutorial",
        "filter" : "rollup-data.json"
      },
      "inputFormat" : {
        "type" : "json"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index_parallel",
      "maxRowsPerSegment" : 5000000,
      "maxRowsInMemory" : 25000
    }
  }
 }
 ```
 通过在 `granularitySpec` 选项中设置 `rollup : true` 来启用Roll-up
 注意，我们将`srcIP`和`dstIP`定义为**维度**，将`packets`和`bytes`列定义为了`longSum`类型的**指标**，并将 `queryGranularity` 配置定义为 `minute`。
 加载这些数据后，我们将看到如何使用这些定义。
 ### 加载示例数据
 在Druid的根目录下运行以下命令：
 ```
 bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
 ```
 脚本运行完成以后，我们将查询数据。
 ### 查询示例数据
 现在运行 `bin/dsql` 然后执行查询 `select * from "rollup-tutorial";` 来查看已经被摄入的数据。
 ```
 $ bin/dsql
 Welcome to dsql, the command-line client for Druid SQL.
 Type "\h" for help.
 dsql> select * from "rollup-tutorial";
 ┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
 │ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
 ├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
 │ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
 │ 2018-01-01T01:02:00.000Z │ 366260 │     2 │ 2.2.2.2 │     415 │ 1.1.1.1 │
 │ 2018-01-01T01:03:00.000Z │  10204 │     1 │ 2.2.2.2 │      49 │ 1.1.1.1 │
 │ 2018-01-02T21:33:00.000Z │ 100288 │     2 │ 8.8.8.8 │     161 │ 7.7.7.7 │
 │ 2018-01-02T21:35:00.000Z │   2818 │     1 │ 8.8.8.8 │      12 │ 7.7.7.7 │
 └──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
 Retrieved 5 rows in 1.18s.
 dsql>
 ```
 我们来看发生在 `2018-01-01T01:01` 的三条原始数据：
 ```
 {"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
 {"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
 {"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
 ```
 这三条数据已经被roll up为以下一行数据：
 ```
 ┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
 │ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
 ├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
 │ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
 └──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
 ```
 这输入的数据行已经被按照时间列和维度列 `{timestamp, srcIP, dstIP}` 在指标列 `{packages, bytes}` 上做求和聚合
 在进行分组之前，原始输入数据的时间戳按分钟进行标记/布局，这是由于摄取说明中的 `"queryGranularity"："minute"` 设置造成的。
 同样，`2018-01-01T01:02` 期间发生的这两起事件也已经汇总。
 ```
 {"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
 {"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
 ```
 ```
 ┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
 │ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
 ├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
 │ 2018-01-01T01:02:00.000Z │ 366260 │     2 │ 2.2.2.2 │     415 │ 1.1.1.1 │
 └──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
 ```
 对于记录1.1.1.1和2.2.2.2之间流量的最后一个事件没有发生汇总，因为这是 `2018-01-01T01:03` 期间发生的唯一事件
 ```
 {"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
 ```
 ```
 ┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
 │ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
 ├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
 │ 2018-01-01T01:03:00.000Z │  10204 │     1 │ 2.2.2.2 │      49 │ 1.1.1.1 │
 └──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
 ```
 请注意，`计数指标 count` 显示原始输入数据中有多少行贡献给最终的"roll up"行。