准备 Roll-up 部分的翻译

2021-07-31 17:43:32 -04:00 · 2021-07-31 17:43:32 -04:00 · 0c38388103
commit 0c38388103
parent 06374edac8
10 changed files with 170 additions and 206 deletions
--- a/_sidebar.md
+++ b/_sidebar.md
@ -14,7 +14,7 @@
  - [从 Apache Kafka 载入数据](tutorials/tutorial-kafka.md)
  - [从 Apache Hadoop 载入数据](tutorials/tutorial-batch-hadoop.md)
  - [查询数据](tutorials/tutorial-query.md)
-  - [回滚](tutorials/tutorial-rollup.md)
+  - [Roll-up](tutorials/tutorial-rollup.md)
  - [配置数据保存时间](tutorials/tutorial-retention.md)
  - [更新已经存在的数据](tutorials/tutorial-update-data.md)
  - [压缩段](tutorials/tutorial-compaction.md)
--- a/tutorials/chapter-5.md
+++ b/tutorials/chapter-5.md
@ -1,179 +0,0 @@
-<!-- toc -->
-
-<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
-<ins class="adsbygoogle"
-     style="display:block; text-align:center;"
-     data-ad-layout="in-article"
-     data-ad-format="fluid"
-     data-ad-client="ca-pub-8828078415045620"
-     data-ad-slot="7586680510"></ins>
-<script>
-     (adsbygoogle = window.adsbygoogle || []).push({});
-</script>
-
-## Roll-up
-
-Apache Druid可以通过roll-up在数据摄取阶段对原始数据进行汇总。 Roll-up是对选定列集的一级聚合操作，它可以减小存储数据的大小。
-
-本教程中将讨论在一个示例数据集上进行roll-up的结果。
-
-本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid，并运行在本地机器上。
-
-完成[加载本地文件](tutorial-batch.md)和[数据查询](./chapter-4.md)两部分内容也是非常有帮助的。
-
-### 示例数据
-
-对于本教程，我们将使用一个网络流事件数据的小样本，表示在特定时间内从源到目标IP地址的流量的数据包和字节计数。
-
-```json
-{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
-{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
-{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
-{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
-{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
-{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
-{"timestamp":"2018-01-02T21:33:14Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":38,"bytes":6289}
-{"timestamp":"2018-01-02T21:33:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":123,"bytes":93999}
-{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818}
-```
-位于 `quickstart/tutorial/rollup-data.json` 的文件包含了样例输入数据
-
-我们将使用 `quickstart/tutorial/rollup-index.json` 的摄入数据规范来摄取数据
-
-```json
-{
-  "type" : "index_parallel",
-  "spec" : {
-    "dataSchema" : {
-      "dataSource" : "rollup-tutorial",
-      "dimensionsSpec" : {
-        "dimensions" : [
-          "srcIP",
-          "dstIP"
-        ]
-      },
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "iso"
-      },
-      "metricsSpec" : [
-        { "type" : "count", "name" : "count" },
-        { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
-        { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
-      ],
-      "granularitySpec" : {
-        "type" : "uniform",
-        "segmentGranularity" : "week",
-        "queryGranularity" : "minute",
-        "intervals" : ["2018-01-01/2018-01-03"],
-        "rollup" : true
-      }
-    },
-    "ioConfig" : {
-      "type" : "index_parallel",
-      "inputSource" : {
-        "type" : "local",
-        "baseDir" : "quickstart/tutorial",
-        "filter" : "rollup-data.json"
-      },
-      "inputFormat" : {
-        "type" : "json"
-      },
-      "appendToExisting" : false
-    },
-    "tuningConfig" : {
-      "type" : "index_parallel",
-      "maxRowsPerSegment" : 5000000,
-      "maxRowsInMemory" : 25000
-    }
-  }
-}
-```
-
-通过在 `granularitySpec` 选项中设置 `rollup : true` 来启用Roll-up
-
-注意，我们将`srcIP`和`dstIP`定义为**维度**，将`packets`和`bytes`列定义为了`longSum`类型的**指标**，并将 `queryGranularity` 配置定义为 `minute`。
-
-加载这些数据后，我们将看到如何使用这些定义。
-
-### 加载示例数据
-
-在Druid的根目录下运行以下命令：
-
-```json
-bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
-```
-
-脚本运行完成以后，我们将查询数据。
-
-### 查询示例数据
-
-现在运行 `bin/dsql` 然后执行查询 `select * from "rollup-tutorial";` 来查看已经被摄入的数据。
-
-```json
-$ bin/dsql
-Welcome to dsql, the command-line client for Druid SQL.
-Type "\h" for help.
-dsql> select * from "rollup-tutorial";
-┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
-│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
-├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
-│ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
-│ 2018-01-01T01:02:00.000Z │ 366260 │     2 │ 2.2.2.2 │     415 │ 1.1.1.1 │
-│ 2018-01-01T01:03:00.000Z │  10204 │     1 │ 2.2.2.2 │      49 │ 1.1.1.1 │
-│ 2018-01-02T21:33:00.000Z │ 100288 │     2 │ 8.8.8.8 │     161 │ 7.7.7.7 │
-│ 2018-01-02T21:35:00.000Z │   2818 │     1 │ 8.8.8.8 │      12 │ 7.7.7.7 │
-└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
-Retrieved 5 rows in 1.18s.
-
-dsql>
-```
-
-我们来看发生在 `2018-01-01T01:01` 的三条原始数据：
-
-```json
-{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
-{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
-{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
-```
-这三条数据已经被roll up为以下一行数据：
-
-```json
-┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
-│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
-├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
-│ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
-└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
-```
-
-这输入的数据行已经被按照时间列和维度列 `{timestamp, srcIP, dstIP}` 在指标列 `{packages, bytes}` 上做求和聚合
-
-在进行分组之前，原始输入数据的时间戳按分钟进行标记/布局，这是由于摄取规范中的 `"queryGranularity"："minute"` 设置造成的。
-同样，`2018-01-01T01:02` 期间发生的这两起事件也已经汇总。
-
-```json
-{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
-{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
-```
-```json
-┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
-│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
-├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
-│ 2018-01-01T01:02:00.000Z │ 366260 │     2 │ 2.2.2.2 │     415 │ 1.1.1.1 │
-└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
-```
-
-对于记录1.1.1.1和2.2.2.2之间流量的最后一个事件没有发生汇总，因为这是 `2018-01-01T01:03` 期间发生的唯一事件
-
-```json
-{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
-```
-```json
-┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
-│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
-├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
-│ 2018-01-01T01:03:00.000Z │  10204 │     1 │ 2.2.2.2 │      49 │ 1.1.1.1 │
-└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
-```
-
-请注意，`计数指标 count` 显示原始输入数据中有多少行贡献给最终的"roll up"行。
--- a/tutorials/img-3/tutorial-query-01.png
+++ b/tutorials/img-3/tutorial-query-01.png
--- a/tutorials/img-3/tutorial-query-02.png
+++ b/tutorials/img-3/tutorial-query-02.png
--- a/tutorials/img-3/tutorial-query-03.png
+++ b/tutorials/img-3/tutorial-query-03.png
--- a/tutorials/img-3/tutorial-query-04.png
+++ b/tutorials/img-3/tutorial-query-04.png
--- a/tutorials/img-3/tutorial-query-05.png
+++ b/tutorials/img-3/tutorial-query-05.png
--- a/tutorials/img-3/tutorial-query-06.png
+++ b/tutorials/img-3/tutorial-query-06.png
--- a/tutorials/img-3/tutorial-query-07.png
+++ b/tutorials/img-3/tutorial-query-07.png
--- a/tutorials/tutorial-rollup.md
+++ b/tutorials/tutorial-rollup.md
@ -1,29 +1,4 @@
---
-id: tutorial-rollup
-title: "Tutorial: Roll-up"
-sidebar_label: "Roll-up"
---
-
-<!--
-  ~ Licensed to the Apache Software Foundation (ASF) under one
-  ~ or more contributor license agreements.  See the NOTICE file
-  ~ distributed with this work for additional information
-  ~ regarding copyright ownership.  The ASF licenses this file
-  ~ to you under the Apache License, Version 2.0 (the
-  ~ "License"); you may not use this file except in compliance
-  ~ with the License.  You may obtain a copy of the License at
-  ~
-  ~   http://www.apache.org/licenses/LICENSE-2.0
-  ~
-  ~ Unless required by applicable law or agreed to in writing,
-  ~ software distributed under the License is distributed on an
-  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  ~ KIND, either express or implied.  See the License for the
-  ~ specific language governing permissions and limitations
-  ~ under the License.
-  -->
-
-
+# Roll-up
 Apache Druid can summarize raw data at ingestion time using a process we refer to as "roll-up". Roll-up is a first-level aggregation operation over a selected set of columns that reduces the size of stored data.

 This tutorial will demonstrate the effects of roll-up on an example dataset.
@ -194,3 +169,171 @@ For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no roll-up too
 ```

 Note that the `count` metric shows how many rows in the original input data contributed to the final "rolled up" row.
+
+
+## Roll-up
+
+Apache Druid可以通过roll-up在数据摄取阶段对原始数据进行汇总。 Roll-up是对选定列集的一级聚合操作，它可以减小存储数据的大小。
+
+本教程中将讨论在一个示例数据集上进行roll-up的结果。
+
+本教程我们假设您已经按照[单服务器部署](../GettingStarted/chapter-3.md)中描述下载了Druid，并运行在本地机器上。
+
+完成[加载本地文件](tutorial-batch.md)和[数据查询](./chapter-4.md)两部分内容也是非常有帮助的。
+
+### 示例数据
+
+对于本教程，我们将使用一个网络流事件数据的小样本，表示在特定时间内从源到目标IP地址的流量的数据包和字节计数。
+
+```json
+{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
+{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
+{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
+{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
+{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
+{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
+{"timestamp":"2018-01-02T21:33:14Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":38,"bytes":6289}
+{"timestamp":"2018-01-02T21:33:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":123,"bytes":93999}
+{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818}
+```
+位于 `quickstart/tutorial/rollup-data.json` 的文件包含了样例输入数据
+
+我们将使用 `quickstart/tutorial/rollup-index.json` 的摄入数据规范来摄取数据
+
+```json
+{
+  "type" : "index_parallel",
+  "spec" : {
+    "dataSchema" : {
+      "dataSource" : "rollup-tutorial",
+      "dimensionsSpec" : {
+        "dimensions" : [
+          "srcIP",
+          "dstIP"
+        ]
+      },
+      "timestampSpec": {
+        "column": "timestamp",
+        "format": "iso"
+      },
+      "metricsSpec" : [
+        { "type" : "count", "name" : "count" },
+        { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
+        { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
+      ],
+      "granularitySpec" : {
+        "type" : "uniform",
+        "segmentGranularity" : "week",
+        "queryGranularity" : "minute",
+        "intervals" : ["2018-01-01/2018-01-03"],
+        "rollup" : true
+      }
+    },
+    "ioConfig" : {
+      "type" : "index_parallel",
+      "inputSource" : {
+        "type" : "local",
+        "baseDir" : "quickstart/tutorial",
+        "filter" : "rollup-data.json"
+      },
+      "inputFormat" : {
+        "type" : "json"
+      },
+      "appendToExisting" : false
+    },
+    "tuningConfig" : {
+      "type" : "index_parallel",
+      "maxRowsPerSegment" : 5000000,
+      "maxRowsInMemory" : 25000
+    }
+  }
+}
+```
+
+通过在 `granularitySpec` 选项中设置 `rollup : true` 来启用Roll-up
+
+注意，我们将`srcIP`和`dstIP`定义为**维度**，将`packets`和`bytes`列定义为了`longSum`类型的**指标**，并将 `queryGranularity` 配置定义为 `minute`。
+
+加载这些数据后，我们将看到如何使用这些定义。
+
+### 加载示例数据
+
+在Druid的根目录下运行以下命令：
+
+```json
+bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
+```
+
+脚本运行完成以后，我们将查询数据。
+
+### 查询示例数据
+
+现在运行 `bin/dsql` 然后执行查询 `select * from "rollup-tutorial";` 来查看已经被摄入的数据。
+
+```json
+$ bin/dsql
+Welcome to dsql, the command-line client for Druid SQL.
+Type "\h" for help.
+dsql> select * from "rollup-tutorial";
+┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
+│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
+├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
+│ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
+│ 2018-01-01T01:02:00.000Z │ 366260 │     2 │ 2.2.2.2 │     415 │ 1.1.1.1 │
+│ 2018-01-01T01:03:00.000Z │  10204 │     1 │ 2.2.2.2 │      49 │ 1.1.1.1 │
+│ 2018-01-02T21:33:00.000Z │ 100288 │     2 │ 8.8.8.8 │     161 │ 7.7.7.7 │
+│ 2018-01-02T21:35:00.000Z │   2818 │     1 │ 8.8.8.8 │      12 │ 7.7.7.7 │
+└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
+Retrieved 5 rows in 1.18s.
+
+dsql>
+```
+
+我们来看发生在 `2018-01-01T01:01` 的三条原始数据：
+
+```json
+{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
+{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
+{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
+```
+这三条数据已经被roll up为以下一行数据：
+
+```json
+┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
+│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
+├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
+│ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
+└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
+```
+
+这输入的数据行已经被按照时间列和维度列 `{timestamp, srcIP, dstIP}` 在指标列 `{packages, bytes}` 上做求和聚合
+
+在进行分组之前，原始输入数据的时间戳按分钟进行标记/布局，这是由于摄取规范中的 `"queryGranularity"："minute"` 设置造成的。
+同样，`2018-01-01T01:02` 期间发生的这两起事件也已经汇总。
+
+```json
+{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
+{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
+```
+```json
+┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
+│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
+├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
+│ 2018-01-01T01:02:00.000Z │ 366260 │     2 │ 2.2.2.2 │     415 │ 1.1.1.1 │
+└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
+```
+
+对于记录1.1.1.1和2.2.2.2之间流量的最后一个事件没有发生汇总，因为这是 `2018-01-01T01:03` 期间发生的唯一事件
+
+```json
+{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
+```
+```json
+┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
+│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
+├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
+│ 2018-01-01T01:03:00.000Z │  10204 │     1 │ 2.2.2.2 │      49 │ 1.1.1.1 │
+└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
+```
+
+请注意，`计数指标 count` 显示原始输入数据中有多少行贡献给最终的"roll up"行。