druid-docs-cn/tutorials/chapter-11.md

5.5 KiB
Raw Permalink Blame History

输入数据变换

本教程将演示如何使用转换规范在接收期间过滤和转换输入数据

本教程我们假设您已经按照单服务器部署中描述下载了Druid并运行在本地机器上。

完成加载本地文件数据查询roll-up部分内容也是非常有帮助的

样例数据

我们在 quickstart/tutorial/transform-data.json 中包括了样例数据,为了方便我们展示一下:

{"timestamp":"2018-01-01T07:01:35Z","animal":"octopus",  "location":1, "number":100}
{"timestamp":"2018-01-01T05:01:35Z","animal":"mongoose", "location":2,"number":200}
{"timestamp":"2018-01-01T06:01:35Z","animal":"snake", "location":3, "number":300}
{"timestamp":"2018-01-01T01:01:35Z","animal":"lion", "location":4, "number":300}

使用转换规范加载数据

我们将使用以下规范摄取示例数据,该规范演示了转换规范的使用:

{
  "type" : "index_parallel",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "transform-tutorial",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec" : {
        "dimensions" : [
          "animal",
          { "name": "location", "type": "long" }
        ]
      },
      "metricsSpec" : [
        { "type" : "count", "name" : "count" },
        { "type" : "longSum", "name" : "number", "fieldName" : "number" },
        { "type" : "longSum", "name" : "triple-number", "fieldName" : "triple-number" }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "week",
        "queryGranularity" : "minute",
        "intervals" : ["2018-01-01/2018-01-03"],
        "rollup" : true
      },
      "transformSpec": {
        "transforms": [
          {
            "type": "expression",
            "name": "animal",
            "expression": "concat('super-', animal)"
          },
          {
            "type": "expression",
            "name": "triple-number",
            "expression": "number * 3"
          }
        ],
        "filter": {
          "type":"or",
          "fields": [
            { "type": "selector", "dimension": "animal", "value": "super-mongoose" },
            { "type": "selector", "dimension": "triple-number", "value": "300" },
            { "type": "selector", "dimension": "location", "value": "3" }
          ]
        }
      }
    },
    "ioConfig" : {
      "type" : "index_parallel",
      "inputSource" : {
        "type" : "local",
        "baseDir" : "quickstart/tutorial",
        "filter" : "transform-data.json"
      },
      "inputFormat" : {
        "type" :"json"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index_parallel",
      "maxRowsPerSegment" : 5000000,
      "maxRowsInMemory" : 25000
    }
  }
}

在转换规范中,我们有两个表达式转换:

  • super-animal: 在 "animal" 列的值前加上"super-"。这将用转换后的版本覆盖 animal 列,因为转换的名称是 animal
  • triple-number: 将数字列乘以3, 这将创建一个新的三位数列。注意,我们同时接收原始列和转换列

另外我们有一个包含三个子句的OR过滤器

  • super-animal 值匹配"super-mongoose"
  • triple-number 值匹配300
  • location值匹配3

这个过滤器选择前3行它将排除输入数据中的最后一个"lion"行。请注意,过滤器是在转换之后应用的。

现在提交位于 quickstart/tutorial/transform-index.json 的任务:

bin/post-index-task --file quickstart/tutorial/transform-index.json --url http://localhost:8081

查询已转换的数据

运行 bin/dsql 提交 select * from "transform-tutorial" 查询来看摄入的数据:

dsql> select * from "transform-tutorial";
┌──────────────────────────┬────────────────┬───────┬──────────┬────────┬───────────────┐
 __time                    animal          count  location  number  triple-number 
├──────────────────────────┼────────────────┼───────┼──────────┼────────┼───────────────┤
 2018-01-01T05:01:00.000Z  super-mongoose      1         2     200            600 
 2018-01-01T06:01:00.000Z  super-snake         1         3     300            900 
 2018-01-01T07:01:00.000Z  super-octopus       1         1     100            300 
└──────────────────────────┴────────────────┴───────┴──────────┴────────┴───────────────┘
Retrieved 3 rows in 0.03s.

"Lion"列被丢弃,animal列被转换,我们既有原始列,也有转换后的数字列。