2018-12-13 14:47:20 -05:00
---
2019-08-21 00:48:59 -04:00
id: tutorial-rollup
2023-05-19 12:42:27 -04:00
title: Aggregate data with rollup
sidebar_label: Aggregate data with rollup
2018-12-13 14:47:20 -05:00
---
2018-11-13 12:38:37 -05:00
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
2018-08-09 16:37:52 -04:00
2023-05-19 12:42:27 -04:00
Apache Druid can summarize raw data at ingestion time using a process we refer to as "rollup". Rollup is a first-level aggregation operation over a selected set of columns that reduces the size of stored data.
2018-08-09 16:37:52 -04:00
2023-05-19 12:42:27 -04:00
This tutorial will demonstrate the effects of rollup on an example dataset.
2018-08-09 16:37:52 -04:00
2019-08-21 00:48:59 -04:00
For this tutorial, we'll assume you've already downloaded Druid as described in
2020-12-17 16:37:43 -05:00
the [single-machine quickstart ](index.md ) and have it running on your local machine.
2018-08-09 16:37:52 -04:00
2023-05-19 12:42:27 -04:00
It will also be helpful to have finished [Load a file ](../tutorials/tutorial-batch.md ) and [Query data ](../tutorials/tutorial-query.md ) tutorials.
2018-08-09 16:37:52 -04:00
## Example data
For this tutorial, we'll use a small sample of network flow event data, representing packet and byte counts for traffic from a source to a destination IP address that occurred within a particular second.
2018-08-13 14:11:32 -04:00
```json
2018-08-09 16:37:52 -04:00
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
{"timestamp":"2018-01-02T21:33:14Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":38,"bytes":6289}
{"timestamp":"2018-01-02T21:33:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":123,"bytes":93999}
{"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818}
```
A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json` .
We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json` .
2018-08-13 14:11:32 -04:00
```json
2018-08-09 16:37:52 -04:00
{
2020-01-15 17:08:29 -05:00
"type" : "index_parallel",
2018-08-09 16:37:52 -04:00
"spec" : {
"dataSchema" : {
"dataSource" : "rollup-tutorial",
2020-01-15 17:08:29 -05:00
"dimensionsSpec" : {
"dimensions" : [
"srcIP",
"dstIP"
]
},
"timestampSpec": {
"column": "timestamp",
"format": "iso"
2018-08-09 16:37:52 -04:00
},
"metricsSpec" : [
{ "type" : "count", "name" : "count" },
{ "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
{ "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "week",
"queryGranularity" : "minute",
"intervals" : ["2018-01-01/2018-01-03"],
"rollup" : true
}
},
"ioConfig" : {
2020-01-15 17:08:29 -05:00
"type" : "index_parallel",
"inputSource" : {
2018-08-09 16:37:52 -04:00
"type" : "local",
"baseDir" : "quickstart/tutorial",
"filter" : "rollup-data.json"
},
2020-01-15 17:08:29 -05:00
"inputFormat" : {
"type" : "json"
},
2018-08-09 16:37:52 -04:00
"appendToExisting" : false
},
"tuningConfig" : {
2020-01-15 17:08:29 -05:00
"type" : "index_parallel",
2022-07-28 07:22:13 -04:00
"partitionsSpec": {
"type": "dynamic"
},
2019-03-16 02:29:25 -04:00
"maxRowsInMemory" : 25000
2018-08-09 16:37:52 -04:00
}
}
}
```
2023-05-19 12:42:27 -04:00
Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec` .
2018-08-09 16:37:52 -04:00
2019-08-21 00:48:59 -04:00
Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute` .
2018-08-09 16:37:52 -04:00
We will see how these definitions are used after we load this data.
## Load the example data
2019-09-22 20:38:55 -04:00
From the apache-druid-{{DRUIDVERSION}} package root, run the following command:
2018-08-09 16:37:52 -04:00
2018-08-13 14:11:32 -04:00
```bash
2019-05-16 14:13:48 -04:00
bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081
2018-08-09 16:37:52 -04:00
```
After the script completes, we will query the data.
## Query the example data
Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to see what data was ingested.
2018-08-13 14:11:32 -04:00
```bash
2018-08-09 16:37:52 -04:00
$ bin/dsql
Welcome to dsql, the command-line client for Druid SQL.
Type "\h" for help.
dsql> select * from "rollup-tutorial";
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │
│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │
│ 2018-01-02T21:33:00.000Z │ 100288 │ 2 │ 8.8.8.8 │ 161 │ 7.7.7.7 │
│ 2018-01-02T21:35:00.000Z │ 2818 │ 1 │ 8.8.8.8 │ 12 │ 7.7.7.7 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
Retrieved 5 rows in 1.18s.
2019-08-21 00:48:59 -04:00
dsql>
2018-08-09 16:37:52 -04:00
```
Let's look at the three events in the original input data that occurred during `2018-01-01T01:01` :
2018-08-13 14:11:32 -04:00
```json
2018-08-09 16:37:52 -04:00
{"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
{"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
{"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
```
These three rows have been "rolled up" into the following row:
2018-08-13 14:11:32 -04:00
```bash
2018-08-09 16:37:52 -04:00
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
The input rows have been grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes` .
Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `"queryGranularity":"minute"` setting in the ingestion spec.
Likewise, these two events that occurred during `2018-01-01T01:02` have been rolled up:
2018-08-13 14:11:32 -04:00
```json
2018-08-09 16:37:52 -04:00
{"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
{"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
```
2018-08-13 14:11:32 -04:00
```bash
2018-08-09 16:37:52 -04:00
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
2023-05-19 12:42:27 -04:00
For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no rollup took place, because this was the only event that occurred during `2018-01-01T01:03` :
2018-08-09 16:37:52 -04:00
2018-08-13 14:11:32 -04:00
```json
2018-08-09 16:37:52 -04:00
{"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204}
```
2018-08-13 14:11:32 -04:00
```bash
2018-08-09 16:37:52 -04:00
┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
│ __time │ bytes │ count │ dstIP │ packets │ srcIP │
├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │
└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
```
2018-11-13 12:38:37 -05:00
Note that the `count` metric shows how many rows in the original input data contributed to the final "rolled up" row.