Merge pull request #37 from cwiki-us-docs/feature/tutorial-batch

增加数据导入部分的格式
This commit is contained in:
YuCheng Hu 2021-08-05 18:45:06 -04:00 committed by GitHub
commit 75fe9df234
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
21 changed files with 8063 additions and 3039 deletions

231
ingestion/compaction.md Normal file
View File

@ -0,0 +1,231 @@
---
id: compaction
title: "Compaction"
description: "Defines compaction and automatic compaction (auto-compaction or autocompaction) for segment optimization. Use cases and strategies for compaction. Describes compaction task configuration."
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Query performance in Apache Druid depends on optimally sized segments. Compaction is one strategy you can use to optimize segment size for your Druid database. Compaction tasks read an existing set of segments for a given time interval and combine the data into a new "compacted" set of segments. In some cases the compacted segments are larger, but there are fewer of them. In other cases the compacted segments may be smaller. Compaction tends to increase performance because optimized segments require less per-segment processing and less memory overhead for ingestion and for querying paths.
## Compaction strategies
There are several cases to consider compaction for segment optimization:
- With streaming ingestion, data can arrive out of chronological order creating lots of small segments.
- If you append data using `appendToExisting` for [native batch](native-batch.md) ingestion creating suboptimal segments.
- When you use `index_parallel` for parallel batch indexing and the parallel ingestion tasks create many small segments.
- When a misconfigured ingestion task creates oversized segments.
By default, compaction does not modify the underlying data of the segments. However, there are cases when you may want to modify data during compaction to improve query performance:
- If, after ingestion, you realize that data for the time interval is sparse, you can use compaction to increase the segment granularity.
- Over time you don't need fine-grained granularity for older data so you want use compaction to change older segments to a coarser query granularity. This reduces the storage space required for older data. For example from `minute` to `hour`, or `hour` to `day`. You cannot go from coarser granularity to finer granularity.
- You can change the dimension order to improve sorting and reduce segment size.
- You can remove unused columns in compaction or implement an aggregation metric for older data.
- You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](index.md#perfect-rollup-vs-best-effort-rollup).
Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment.
## Types of compaction
You can configure the Druid Coordinator to perform automatic compaction, also called auto-compaction, for a datasource. Using a segment search policy, the coordinator periodically identifies segments for compaction starting with the newest to oldest. When it discovers segments that have not been compacted or segments that were compacted with a different or changed spec, it submits compaction task for those segments and only those segments.
Automatic compaction works in most use cases and should be your first option. To learn more about automatic compaction, see [Compacting Segments](../design/coordinator.md#compacting-segments).
In cases where you require more control over compaction, you can manually submit compaction tasks. For example:
- Automatic compaction is running into the limit of task slots available to it, so tasks are waiting for previous automatic compaction tasks to complete. Manual compaction can use all available task slots, therefore you can complete compaction more quickly by submitting more concurrent tasks for more intervals.
- You want to force compaction for a specific time range or you want to compact data out of chronological order.
See [Setting up a manual compaction task](#setting-up-manual-compaction) for more about manual compaction tasks.
## Data handling with compaction
During compaction, Druid overwrites the original set of segments with the compacted set. Druid also locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes.
For compaction tasks, `dropExisting` in `ioConfig` can be set to "true" for Druid to drop (mark unused) all existing segments fully contained by the interval of the compaction task. For an example of why this is important, see the suggestion for reindexing with finer granularity under [Implementation considerations](native-batch.md#implementation-considerations). WARNING: this functionality is still in beta and can result in temporary data unavailability for data within the compaction task interval.
If an ingestion task needs to write data to a segment for a time interval locked for compaction, by default the ingestion task supersedes the compaction task and the compaction task fails without finishing. For manual compaction tasks you can adjust the input spec interval to avoid conflicts between ingestion and compaction. For automatic compaction, you can set the `skipOffsetFromLatest` key to adjustment the auto compaction starting point from the current time to reduce the chance of conflicts between ingestion and compaction. See [Compaction dynamic configuration](../configuration/index.md#compaction-dynamic-configuration) for more information. Another option is to set the compaction task to higher priority than the ingestion task.
### Segment granularity handling
Unless you modify the segment granularity in the [granularity spec](#compaction-granularity-spec), Druid attempts to retain the granularity for the compacted segments. When segments have different segment granularities with no overlap in interval Druid creates a separate compaction task for each to retain the segment granularity in the compacted segment.
If segments have different segment granularities before compaction but there is some overlap in interval, Druid attempts find start and end of the overlapping interval and uses the closest segment granularity level for the compacted segment. For example consider two overlapping segments: segment "A" for the interval 01/01/2021-01/02/2021 with day granularity and segment "B" for the interval 01/01/2021-02/01/2021. Druid attempts to combine and compacted the overlapped segments. In this example, the earliest start time for the two segments above is 01/01/2020 and the latest end time of the two segments above is 02/01/2020. Druid compacts the segments together even though they have different segment granularity. Druid uses month segment granularity for the newly compacted segment even though segment A's original segment granularity was DAY.
### Query granularity handling
Unless you modify the query granularity in the [granularity spec](#compaction-granularity-spec), Druid retains the query granularity for the compacted segments. If segments have different query granularities before compaction, Druid chooses the finest level of granularity for the resulting compacted segment. For example if a compaction task combines two segments, one with day query granularity and one with minute query granularity, the resulting segment uses minute query granularity.
> In Apache Druid 0.21.0 and prior, Druid sets the granularity for compacted segments to the default granularity of `NONE` regardless of the query granularity of the original segments.
If you configure query granularity in compaction to go from a finer granularity like month to a coarser query granularity like year, then Druid overshadows the original segment with coarser granularity. Because the new segments have a coarser granularity, running a kill task to remove the overshadowed segments for those intervals will cause you to permanently lose the finer granularity data.
### Dimension handling
Apache Druid supports schema changes. Therefore, dimensions can be different across segments even if they are a part of the same data source. See [Different schemas among segments](../design/segments.md#different-schemas-among-segments). If the input segments have different dimensions, the resulting compacted segment include all dimensions of the input segments.
Even when the input segments have the same set of dimensions, the dimension order or the data type of dimensions can be different. The dimensions of recent segments precede that of old segments in terms of data types and the ordering because more recent segments are more likely to have the preferred order and data types.
If you want to control dimension ordering or ensure specific values for dimension types, you can configure a custom `dimensionsSpec` in the compaction task spec.
### Rollup
Druid only rolls up the output segment when `rollup` is set for all input segments.
See [Roll-up](../ingestion/index.md#rollup) for more details.
You can check that your segments are rolled up or not by using [Segment Metadata Queries](../querying/segmentmetadataquery.md#analysistypes).
## Setting up manual compaction
To perform a manual compaction, you submit a compaction task. Compaction tasks merge all segments for the defined interval according to the following syntax:
```json
{
"type": "compact",
"id": <task_id>,
"dataSource": <task_datasource>,
"ioConfig": <IO config>,
"dimensionsSpec" <custom dimensionsSpec>,
"metricsSpec" <custom metricsSpec>,
"tuningConfig" <parallel indexing task tuningConfig>,
"granularitySpec" <compaction task granularitySpec>,
"context": <task context>
}
```
|Field|Description|Required|
|-----|-----------|--------|
|`type`|Task type. Should be `compact`|Yes|
|`id`|Task id|No|
|`dataSource`|Data source name to compact|Yes|
|`ioConfig`|I/O configuration for compaction task. See [Compaction I/O configuration](#compaction-io-configuration) for details.|Yes|
|`dimensionsSpec`|Custom dimensions spec. The compaction task uses the specified dimensions spec if it exists instead of generating one.|No|
|`metricsSpec`|Custom metrics spec. The compaction task uses the specified metrics spec rather than generating one.|No|
|`segmentGranularity`|When set, the compaction task changes the segment granularity for the given interval. Deprecated. Use `granularitySpec`. |No.|
|`tuningConfig`|[Parallel indexing task tuningConfig](native-batch.md#tuningconfig). Note that your tuning config cannot contain a non-zero value for `awaitSegmentAvailabilityTimeoutMillis` because it is not supported by compaction tasks at this time.|No|
|`context`|[Task context](./tasks.md#context)|No|
|`granularitySpec`|Custom `granularitySpec` to describe the `segmentGranularity` and `queryGranularity` for the compacted segments. See [Compaction granularitySpec](#compaction-granularity-spec).|No|
> Note: Use `granularitySpec` over `segmentGranularity` and only set one of these values. If you specify different values for these in the same compaction spec, the task fails.
To control the number of result segments per time chunk, you can set [maxRowsPerSegment](../configuration/index.md#compaction-dynamic-configuration) or [numShards](../ingestion/native-batch.md#tuningconfig).
> You can run multiple compaction tasks in parallel. For example, if you want to compact the data for a year, you are not limited to running a single task for the entire year. You can run 12 compaction tasks with month-long intervals.
A compaction task internally generates an `index` task spec for performing compaction work with some fixed parameters. For example, its `inputSource` is always the [DruidInputSource](native-batch.md#druid-input-source), and `dimensionsSpec` and `metricsSpec` include all dimensions and metrics of the input segments by default.
Compaction tasks would exit without doing anything and issue a failure status code:
- if the interval you specify has no data segments loaded<br>
OR
- if the interval you specify is empty.
Note that the metadata between input segments and the resulting compacted segments may differ if the metadata among the input segments differs as well. If all input segments have the same metadata, however, the resulting output segment will have the same metadata as all input segments.
### Example compaction task
The following JSON illustrates a compaction task to compact _all segments_ within the interval `2017-01-01/2018-01-01` and create new segments:
```json
{
"type" : "compact",
"dataSource" : "wikipedia",
"ioConfig" : {
"type": "compact",
"inputSpec": {
"type": "interval",
"interval": "2020-01-01/2021-01-01",
}
},
"granularitySpec": {
"segmentGranularity":"day",
"queryGranularity":"hour"
}
}
```
This task doesn't specify a `granularitySpec` so Druid retains the original segment granularity unchanged when compaction is complete.
### Compaction I/O configuration
The compaction `ioConfig` requires specifying `inputSpec` as follows:
|Field|Description|Default|Required?|
|-----|-----------|-------|--------|
|`type`|Task type. Should be `compact`|none|Yes|
|`inputSpec`|Input specification|none|Yes|
|`dropExisting`|If `true`, then the compaction task drops (mark unused) all existing segments fully contained by either the `interval` in the `interval` type `inputSpec` or the umbrella interval of the `segments` in the `segment` type `inputSpec` when the task publishes new compacted segments. If compaction fails, Druid does not drop or mark unused any segments. WARNING: this functionality is still in beta and can result in temporary data unavailability for data within the compaction task interval.|false|no|
There are two supported `inputSpec`s for now.
The interval `inputSpec` is:
|Field|Description|Required|
|-----|-----------|--------|
|`type`|Task type. Should be `interval`|Yes|
|`interval`|Interval to compact|Yes|
The segments `inputSpec` is:
|Field|Description|Required|
|-----|-----------|--------|
|`type`|Task type. Should be `segments`|Yes|
|`segments`|A list of segment IDs|Yes|
### Compaction granularity spec
You can optionally use the `granularitySpec` object to configure the segment granularity and the query granularity of the compacted segments. Their syntax is as follows:
```json
"type": "compact",
"id": <task_id>,
"dataSource": <task_datasource>,
...
,
"granularitySpec": {
"segmentGranularity": <time_period>,
"queryGranularity": <time_period>
}
...
```
`granularitySpec` takes the following keys:
|Field|Description|Required|
|-----|-----------|--------|
|`segmentGranularity`|Time chunking period for the segment granularity. Defaults to 'null', which preserves the original segment granularity. Accepts all [Query granularity](../querying/granularities.md) values.|No|
|`queryGranularity`|Time chunking period for the query granularity. Defaults to 'null', which preserves the original query granularity. Accepts all [Query granularity](../querying/granularities.md) values. Not supported for automatic compaction.|No|
For example, to set the segment granularity to "day" and the query granularity to "hour":
```json
{
"type" : "compact",
"dataSource" : "wikipedia",
"ioConfig" : {
"type": "compact",
"inputSpec": {
"type": "interval",
"interval": "2017-01-01/2018-01-01",
},
"granularitySpec": {
"segmentGranularity":"day",
"queryGranularity":"hour"
}
}
}
```
## Learn more
See the following topics for more information:
- [Segment optimization](../operations/segment-optimization.md) for guidance to determine if compaction will help in your case.
- [Compacting Segments](../design/coordinator.md#compacting-segments) for more on automatic compaction.
- See [Compaction Configuration API](../operations/api-reference.md#compaction-configuration)
and [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for automatic compaction configuration information.

2629
ingestion/data-formats.md Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,191 +0,0 @@
## 数据管理
### schema更新
数据源的schema可以随时更改Apache Druid支持不同段之间的有不同的schema。
#### 替换段文件
Druid使用数据源、时间间隔、版本号和分区号唯一地标识段。只有在某个时间粒度内创建多个段时分区号才在段id中可见。例如如果有小时段但一小时内的数据量超过单个段的容量则可以在同一小时内创建多个段。这些段将共享相同的数据源、时间间隔和版本号但具有线性增加的分区号。
```json
foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-01/2015-01-02_v1_1
foo_2015-01-01/2015-01-02_v1_2
```
在上面的示例段中,`dataSource`=`foo``interval`=`2015-01-01/2015-01-02`version=`v1`partitionNum=`0`。如果在以后的某个时间点使用新的schema重新索引数据则新创建的段将具有更高的版本id。
```json
foo_2015-01-01/2015-01-02_v2_0
foo_2015-01-01/2015-01-02_v2_1
foo_2015-01-01/2015-01-02_v2_2
```
Druid批索引(基于Hadoop或基于IndexTask)保证了时间间隔内的原子更新。在我们的例子中,直到 `2015-01-01/2015-01-02` 的所有 `v2` 段加载到Druid集群中之前查询只使用 `v1` 段, 当加载完所有v2段并可查询后所有查询都将忽略 `v1` 段并切换到 `v2` 段。不久之后,`v1` 段将从集群中卸载。
请注意,跨越多个段间隔的更新在每个间隔内都是原子的。在整个更新过程中它们不是原子的。例如,您有如下段:
```json
foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-02/2015-01-03_v1_1
foo_2015-01-03/2015-01-04_v1_2
```
`v2` 段将在构建后立即加载到集群中,并在段重叠的时间段内替换 `v1` 段。在完全加载 `v2` 段之前,集群可能混合了 `v1``v2` 段。
```json
foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-02/2015-01-03_v2_1
foo_2015-01-03/2015-01-04_v1_2
```
在这种情况下,查询可能会命中 `v1``v2` 段的混合。
#### 在段中不同的schema
同一数据源的Druid段可能有不同的schema。如果一个字符串列维度存在于一个段中而不是另一个段中则涉及这两个段的查询仍然有效。对缺少维度的段的查询将表现为该维度只有空值。类似地如果一个段有一个数值列metric而另一个没有那么查询缺少metric的段通常会"做正确的事情"。在此缺失的Metric上的使用聚合的行为类似于该Metric缺失。
### 压缩与重新索引
压缩合并是一种覆盖操作,它读取现有的一组段,将它们组合成一个具有较大但较少段的新集,并用新的压缩集覆盖原始集,而不更改它内部存储的数据。
出于性能原因,有时将一组段压缩为一组较大但较少的段是有益的,因为在接收和查询路径中都存在一些段处理和内存开销。
压缩任务合并给定时间间隔内的所有段。语法为:
```json
{
"type": "compact",
"id": <task_id>,
"dataSource": <task_datasource>,
"ioConfig": <IO config>,
"dimensionsSpec" <custom dimensionsSpec>,
"metricsSpec" <custom metricsSpec>,
"segmentGranularity": <segment granularity after compaction>,
"tuningConfig" <parallel indexing task tuningConfig>,
"context": <task context>
}
```
| 字段 | 描述 | 是否必须 |
|-|-|-|
| `type` | 任务类型,应该是 `compact` | 是 |
| `id` | 任务id | 否 |
| `dataSource` | 将被压缩合并的数据源名称 | 是 |
| `ioConfig` | 压缩合并任务的 `ioConfig`, 详情见 [Compaction ioConfig](#压缩合并的IOConfig) | 是 |
| `dimensionsSpec` | 自定义 `dimensionsSpec`。压缩任务将使用此dimensionsSpec如果存在而不是生成dimensionsSpec。更多细节见下文。| 否 |
| `metricsSpec` | 自定义 `metricsSpec`。如果指定了压缩任务则压缩任务将使用此metricsSpec而不是生成一个metricsSpec。| 否 |
| `segmentGranularity` | 如果设置了此值,压缩合并任务将更改给定时间间隔内的段粒度。有关详细信息,请参阅 [granularitySpec](ingestion.md#granularityspec) 的 `segmentGranularity`。行为见下表。 | 否 |
| `tuningConfig` | [并行索引任务的tuningConfig](native.md#tuningConfig) | 否 |
| `context` | [任务的上下文](taskrefer.md#上下文参数) | 否 |
一个压缩合并任务的示例如下:
```json
{
"type" : "compact",
"dataSource" : "wikipedia",
"ioConfig" : {
"type": "compact",
"inputSpec": {
"type": "interval",
"interval": "2017-01-01/2018-01-01"
}
}
}
```
压缩任务读取时间间隔 `2017-01-01/2018-01-01` 的*所有分段*,并生成新分段。由于 `segmentGranularity` 为空,压缩后原始的段粒度将保持不变。要控制每个时间块的结果段数,可以设置 [`maxRowsPerSegment`](../configuration/human-readable-byte.md#Coordinator) 或 [`numShards`](native.md#tuningconfig)。请注意您可以同时运行多个压缩任务。例如您可以每月运行12个compactionTasks而不是一整年只运行一个任务。
压缩任务在内部生成 `index` 任务规范,用于使用某些固定参数执行的压缩工作。例如,它的 `inputSource` 始终是 [DruidInputSource](native.md#Druid输入源)`dimensionsSpec` 和 `metricsSpec` 默认包含输入段的所有Dimensions和Metrics。
如果指定的时间间隔中没有加载数据段(或者指定的时间间隔为空),则压缩任务将以失败状态代码退出,而不执行任何操作。
除非所有输入段具有相同的元数据,否则输出段可以具有与输入段不同的元数据。
* Dimensions: 由于Apache Druid支持schema更改因此即使是同一个数据源的一部分各个段之间的维度也可能不同。如果输入段具有不同的维度则输出段基本上包括输入段的所有维度。但是即使输入段具有相同的维度集维度顺序或维度的数据类型也可能不同。例如某些维度的数据类型可以从 `字符串` 类型更改为基本类型,或者可以更改维度的顺序以获得更好的局部性。在这种情况下,在数据类型和排序方面,最近段的维度先于旧段的维度。这是因为最近的段更有可能具有所需的新顺序和数据类型。如果要使用自己的顺序和类型,可以在压缩任务规范中指定自定义 `dimensionsSpec`
* Roll-up: 仅当为所有输入段设置了 `rollup` 时,才会汇总输出段。有关详细信息,请参见 [rollup](ingestion.md#rollup)。您可以使用 [段元数据查询](../querying/segmentMetadata.md) 检查段是否已被rollup。
#### 压缩合并的IOConfig
压缩IOConfig需要指定 `inputSpec`,如下所示。
| 字段 | 描述 | 是否必须 |
|-|-|-|
| `type` | 任务类型,固定为 `compact` | 是 |
| `inputSpec` | 输入规范 | 是 |
目前有两种支持的 `inputSpec`:
时间间隔 `inputSpec`:
| 字段 | 描述 | 是否必须 |
|-|-|-|
| `type` | 任务类型,固定为 `interval` | 是 |
| `interval` | 需要合并压缩的时间间隔 | 是 |
`inputSpec`:
| 字段 | 描述 | 是否必须 |
|-|-|-|
| `type` | 任务类型,固定为 `segments` | 是 |
| `segments` | 段ID列表 | 是 |
### 增加新的数据
Druid可以通过将新的段追加到现有的段集来实现新数据插入到现有的数据源中。它还可以通过将现有段集与新数据合并并覆盖原始集来添加新数据。
Druid不支持按主键更新单个记录。
### 更新现有的数据
在数据源中摄取一段时间的数据并创建Apache Druid段之后您可能需要对摄取的数据进行更改。有几种方法可以做到这一点。
#### 使用lookups
如果有需要经常更新值的维度,请首先尝试使用 [lookups](../querying/lookups.md)。lookups的一个典型用例是在Druid段中存储一个ID维度并希望将ID维度映射到一个人类可读的字符串值该字符串值可能需要定期更新。
#### 重新摄取数据
如果基于lookups的技术还不够您需要将想更新的时间块的数据重新索引到Druid中。这可以在覆盖模式默认模式下使用 [批处理摄取](ingestion.md#批量摄取) 方法之一来完成。它也可以使用 [流式摄取](ingestion.md#流式摄取) 来完成,前提是您先删除相关时间块的数据。
如果在批处理模式下进行重新摄取Druid的原子更新机制意味着查询将从旧数据无缝地转换到新数据。
我们建议保留一份原始数据的副本,以防您需要重新摄取它。
#### 使用基于Hadoop的摄取
本节假设读者理解如何使用Hadoop进行批量摄取。有关详细信息请参见 [Hadoop批处理摄取](hadoop.md)。Hadoop批量摄取可用于重新索引数据和增量摄取数据。
Druid使用 `ioConfig` 中的 `inputSpec` 来知道要接收的数据位于何处以及如何读取它。对于简单的Hadoop批接收`static` 或 `granularity` 粒度规范类型允许您读取存储在深层存储中的数据。
还有其他类型的 `inputSpec` 可以启用重新索引数据和增量接收数据。
#### 使用原生批摄取重新索引
本节假设读者了解如何使用 [原生批处理索引](native.md) 而不使用Hadoop的情况下执行批处理摄取使用 `inputSource` 知道在何处以及如何读取输入数据)。[`DruidInputSource`](native.md#Druid输入源) 可以用来从Druid内部的段读取数据。请注意**IndexTask**只用于原型设计因为它必须在一个进程内完成所有处理并且无法扩展。对于处理超过1GB数据的生产方案请使用Hadoop批量摄取。
### 删除数据
Druid支持永久的将标记为"unused"状态(详情可见架构设计中的 [段的生命周期](../design/Design.md#段生命周期))的段删除掉
杀死任务负责从元数据存储和深度存储中删除掉指定时间间隔内的不被使用的段
更多详细信息,可以看 [杀死任务](taskrefer.md#kill)
永久删除一个段需要两步:
1. 段必须首先标记为"未使用"。当用户通过Coordinator API手动禁用段时就会发生这种情况
2. 在段被标记为"未使用"之后一个Kill任务将从Druid的元数据存储和深层存储中删除任何“未使用”的段
对于数据保留规则的文档,可以详细看 [数据保留](../operations/retainingOrDropData.md)
对于通过Coordinator API来禁用段的文档可以详细看 [Coordinator数据源API](../operations/api.md#coordinator)
在本文档中已经包含了一个删除删除的教程,请看 [数据删除教程](../tutorials/chapter-9.md)
### 杀死任务
**杀死任务**删除段的所有信息并将其从深层存储中删除。在Druid的段表中要杀死的段必须是未使用的used==0。可用语法为
```json
{
"type": "kill",
"id": <task_id>,
"dataSource": <task_datasource>,
"interval" : <all_segments_in_this_interval_will_die!>,
"context": <task context>
}
```
### 数据保留
Druid支持保留规则这些规则用于定义数据应保留的时间间隔和应丢弃数据的时间间隔。
Druid还支持将Historical进程分成不同的层并且可以将保留规则配置为将特定时间间隔的数据分配给特定的层。
这些特性对于性能/成本管理非常有用一个常见的场景是将Historical进程分为"热(hot)"层和"冷(cold)"层。
有关详细信息,请参阅 [加载规则](../operations/retainingOrDropData.md)。

View File

@ -1,3 +1,117 @@
---
id: faq
title: "Ingestion troubleshooting FAQ"
sidebar_label: "Troubleshooting FAQ"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
## Batch Ingestion
If you are trying to batch load historical data but no events are being loaded, make sure the interval of your ingestion spec actually encapsulates the interval of your data. Events outside this interval are dropped.
## Druid ingested my events but I they are not in my query results
If the number of ingested events seem correct, make sure your query is correctly formed. If you included a `count` aggregator in your ingestion spec, you will need to query for the results of this aggregate with a `longSum` aggregator. Issuing a query with a count aggregator will count the number of Druid rows, which includes [roll-up](../design/index.md).
## What types of data does Druid support?
Druid can ingest JSON, CSV, TSV and other delimited data out of the box. Druid supports single dimension values, or multiple dimension values (an array of strings). Druid supports long, float, and double numeric columns.
## Where do my Druid segments end up after ingestion?
Depending on what `druid.storage.type` is set to, Druid will upload segments to some [Deep Storage](../dependencies/deep-storage.md). Local disk is used as the default deep storage.
## My stream ingest is not handing segments off
First, make sure there are no exceptions in the logs of the ingestion process. Also make sure that `druid.storage.type` is set to a deep storage that isn't `local` if you are running a distributed cluster.
Other common reasons that hand-off fails are as follows:
1) Druid is unable to write to the metadata storage. Make sure your configurations are correct.
2) Historical processes are out of capacity and cannot download any more segments. You'll see exceptions in the Coordinator logs if this occurs and the Coordinator console will show the Historicals are near capacity.
3) Segments are corrupt and cannot be downloaded. You'll see exceptions in your Historical processes if this occurs.
4) Deep storage is improperly configured. Make sure that your segment actually exists in deep storage and that the Coordinator logs have no errors.
## How do I get HDFS to work?
Make sure to include the `druid-hdfs-storage` and all the hadoop configuration, dependencies (that can be obtained by running command `hadoop classpath` on a machine where hadoop has been setup) in the classpath. And, provide necessary HDFS settings as described in [deep storage](../dependencies/deep-storage.md) .
## How do I know when I can make query to Druid after submitting batch ingestion task?
You can verify if segments created by a recent ingestion task are loaded onto historicals and available for querying using the following workflow.
1. Submit your ingestion task.
2. Repeatedly poll the [Overlord's tasks API](../operations/api-reference.md#tasks) ( `/druid/indexer/v1/task/{taskId}/status`) until your task is shown to be successfully completed.
3. Poll the [Segment Loading by Datasource API](../operations/api-reference.md#segment-loading-by-datasource) (`/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus`) with
`forceMetadataRefresh=true` and `interval=<INTERVAL_OF_INGESTED_DATA>` once.
(Note: `forceMetadataRefresh=true` refreshes Coordinator's metadata cache of all datasources. This can be a heavy operation in terms of the load on the metadata store but is necessary to make sure that we verify all the latest segments' load status)
If there are segments not yet loaded, continue to step 4, otherwise you can now query the data.
4. Repeatedly poll the [Segment Loading by Datasource API](../operations/api-reference.md#segment-loading-by-datasource) (`/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus`) with
`forceMetadataRefresh=false` and `interval=<INTERVAL_OF_INGESTED_DATA>`.
Continue polling until all segments are loaded. Once all segments are loaded you can now query the data.
Note that this workflow only guarantees that the segments are available at the time of the [Segment Loading by Datasource API](../operations/api-reference.md#segment-loading-by-datasource) call. Segments can still become missing because of historical process failures or any other reasons afterward.
## I don't see my Druid segments on my Historical processes
You can check the Coordinator console located at `<COORDINATOR_IP>:<PORT>`. Make sure that your segments have actually loaded on [Historical processes](../design/historical.md). If your segments are not present, check the Coordinator logs for messages about capacity of replication errors. One reason that segments are not downloaded is because Historical processes have maxSizes that are too small, making them incapable of downloading more data. You can change that with (for example):
```
-Ddruid.segmentCache.locations=[{"path":"/tmp/druid/storageLocation","maxSize":"500000000000"}]
```
## My queries are returning empty results
You can use a [segment metadata query](../querying/segmentmetadataquery.md) for the dimensions and metrics that have been created for your datasource. Make sure that the name of the aggregators you use in your query match one of these metrics. Also make sure that the query interval you specify match a valid time range where data exists.
## How can I Reindex existing data in Druid with schema changes?
You can use DruidInputSource with the [Parallel task](../ingestion/native-batch.md) to ingest existing druid segments using a new schema and change the name, dimensions, metrics, rollup, etc. of the segment.
See [DruidInputSource](../ingestion/native-batch.md#druid-input-source) for more details.
Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing.
See the [Update existing data](../ingestion/data-management.md#update) section of the data management page for more details.
## How can I change the query granularity of existing data in Druid?
In a lot of situations you may want coarser granularity for older data. Example, any data older than 1 month has only hour level granularity but newer data has minute level granularity. This use case is same as re-indexing.
To do this use the [DruidInputSource](../ingestion/native-batch.md#druid-input-source) and run a [Parallel task](../ingestion/native-batch.md). The DruidInputSource will allow you to take in existing segments from Druid and aggregate them and feed them back into Druid. It will also allow you to filter the data in those segments while feeding it back in. This means if there are rows you want to delete, you can just filter them away during re-ingestion.
Typically the above will be run as a batch job to say everyday feed in a chunk of data and aggregate it.
Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing.
See the [Update existing data](../ingestion/data-management.md#update) section of the data management page for more details.
You can also change the query granularity using compaction. See [Query granularity handling](../ingestion/compaction.md#query-granularity-handling).
## Real-time ingestion seems to be stuck
There are a few ways this can occur. Druid will throttle ingestion to prevent out of memory problems if the intermediate persists are taking too long or if hand-off is taking too long. If your process logs indicate certain columns are taking a very long time to build (for example, if your segment granularity is hourly, but creating a single column takes 30 minutes), you should re-evaluate your configuration or scale up your real-time ingestion.
## More information
Data ingestion for Druid can be difficult for first time users. Please don't hesitate to ask questions in the [Druid Forum](https://www.druidforum.org/).
## 数据摄取相关问题FAQ

View File

@ -1,3 +1,574 @@
---
id: hadoop
title: "Hadoop-based ingestion"
sidebar_label: "Hadoop-based"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Apache Hadoop-based batch ingestion in Apache Druid is supported via a Hadoop-ingestion task. These tasks can be posted to a running
instance of a Druid [Overlord](../design/overlord.md). Please refer to our [Hadoop-based vs. native batch comparison table](index.md#batch) for
comparisons between Hadoop-based, native batch (simple), and native batch (parallel) ingestion.
To run a Hadoop-based ingestion task, write an ingestion spec as specified below. Then POST it to the
[`/druid/indexer/v1/task`](../operations/api-reference.md#tasks) endpoint on the Overlord, or use the
`bin/post-index-task` script included with Druid.
## Tutorial
This page contains reference documentation for Hadoop-based ingestion.
For a walk-through instead, check out the [Loading from Apache Hadoop](../tutorials/tutorial-batch-hadoop.md) tutorial.
## Task syntax
A sample task is shown below:
```json
{
"type" : "index_hadoop",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "hadoopyString",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
}
},
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
{
"type" : "doubleSum",
"name" : "added",
"fieldName" : "added"
},
{
"type" : "doubleSum",
"name" : "deleted",
"fieldName" : "deleted"
},
{
"type" : "doubleSum",
"name" : "delta",
"fieldName" : "delta"
}
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE",
"intervals" : [ "2013-08-31/2013-09-01" ]
}
},
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"paths" : "/MyDirectory/example/wikipedia_data.json"
}
},
"tuningConfig" : {
"type": "hadoop"
}
},
"hadoopDependencyCoordinates": <my_hadoop_version>
}
```
|property|description|required?|
|--------|-----------|---------|
|type|The task type, this should always be "index_hadoop".|yes|
|spec|A Hadoop Index Spec. See [Ingestion](../ingestion/index.md)|yes|
|hadoopDependencyCoordinates|A JSON array of Hadoop dependency coordinates that Druid will use, this property will override the default Hadoop coordinates. Once specified, Druid will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`|no|
|classpathPrefix|Classpath that will be prepended for the Peon process.|no|
Also note that Druid automatically computes the classpath for Hadoop job containers that run in the Hadoop cluster. But in case of conflicts between Hadoop and Druid's dependencies, you can manually specify the classpath by setting `druid.extensions.hadoopContainerDruidClasspath` property. See the extensions config in [base druid configuration](../configuration/index.md#extensions).
## `dataSchema`
This field is required. See the [`dataSchema`](index.md#legacy-dataschema-spec) section of the main ingestion page for details on
what it should contain.
## `ioConfig`
This field is required.
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|type|String|This should always be 'hadoop'.|yes|
|inputSpec|Object|A specification of where to pull the data in from. See below.|yes|
|segmentOutputPath|String|The path to dump segments into.|Only used by the [Command-line Hadoop indexer](#cli). This field must be null otherwise.|
|metadataUpdateSpec|Object|A specification of how to update the metadata for the druid cluster these segments belong to.|Only used by the [Command-line Hadoop indexer](#cli). This field must be null otherwise.|
### `inputSpec`
There are multiple types of inputSpecs:
#### `static`
A type of inputSpec where a static path to the data files is provided.
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|inputFormat|String|Specifies the Hadoop InputFormat class to use. e.g. `org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat` |no|
|paths|Array of String|A String of input paths indicating where the raw data is located.|yes|
For example, using the static input paths:
```
"paths" : "hdfs://path/to/data/is/here/data.gz,hdfs://path/to/data/is/here/moredata.gz,hdfs://path/to/data/is/here/evenmoredata.gz"
```
You can also read from cloud storage such as AWS S3 or Google Cloud Storage.
To do so, you need to install the necessary library under Druid's classpath in _all MiddleManager or Indexer processes_.
For S3, you can run the below command to install the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/).
```bash
java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/
```
Once you install the Hadoop AWS module in all MiddleManager and Indexer processes, you can put
your S3 paths in the inputSpec with the below job properties.
For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/).
```
"paths" : "s3a://billy-bucket/the/data/is/here/data.gz,s3a://billy-bucket/the/data/is/here/moredata.gz,s3a://billy-bucket/the/data/is/here/evenmoredata.gz"
```
```json
"jobProperties" : {
"fs.s3a.impl" : "org.apache.hadoop.fs.s3a.S3AFileSystem",
"fs.AbstractFileSystem.s3a.impl" : "org.apache.hadoop.fs.s3a.S3A",
"fs.s3a.access.key" : "YOUR_ACCESS_KEY",
"fs.s3a.secret.key" : "YOUR_SECRET_KEY"
}
```
For Google Cloud Storage, you need to install [GCS connector jar](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
under `${DRUID_HOME}/hadoop-dependencies` in _all MiddleManager or Indexer processes_.
Once you install the GCS Connector jar in all MiddleManager and Indexer processes, you can put
your Google Cloud Storage paths in the inputSpec with the below job properties.
For more configurations, see the [instructions to configure Hadoop](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md#configure-hadoop),
[GCS core default](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.0.0/gcs/conf/gcs-core-default.xml)
and [GCS core template](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml).
```
"paths" : "gs://billy-bucket/the/data/is/here/data.gz,gs://billy-bucket/the/data/is/here/moredata.gz,gs://billy-bucket/the/data/is/here/evenmoredata.gz"
```
```json
"jobProperties" : {
"fs.gs.impl" : "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
"fs.AbstractFileSystem.gs.impl" : "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
}
```
#### `granularity`
A type of inputSpec that expects data to be organized in directories according to datetime using the path format: `y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX` (where date is represented by lowercase and time is represented by uppercase).
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|dataGranularity|String|Specifies the granularity to expect the data at, e.g. hour means to expect directories `y=XXXX/m=XX/d=XX/H=XX`.|yes|
|inputFormat|String|Specifies the Hadoop InputFormat class to use. e.g. `org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat` |no|
|inputPath|String|Base path to append the datetime path to.|yes|
|filePattern|String|Pattern that files should match to be included.|yes|
|pathFormat|String|Joda datetime format for each directory. Default value is `"'y'=yyyy/'m'=MM/'d'=dd/'H'=HH"`, or see [Joda documentation](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html)|no|
For example, if the sample config were run with the interval 2012-06-01/2012-06-02, it would expect data at the paths:
```
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=00
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01
...
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23
```
#### `dataSource`
This is a type of `inputSpec` that reads data already stored inside Druid. This is used to allow "re-indexing" data and for "delta-ingestion" described later in `multi` type inputSpec.
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|type|String.|This should always be 'dataSource'.|yes|
|ingestionSpec|JSON object.|Specification of Druid segments to be loaded. See below.|yes|
|maxSplitSize|Number|Enables combining multiple segments into single Hadoop InputSplit according to size of segments. With -1, druid calculates max split size based on user specified number of map task(mapred.map.tasks or mapreduce.job.maps). By default, one split is made for one segment. maxSplitSize is specified in bytes.|no|
|useNewAggs|Boolean|If "false", then list of aggregators in "metricsSpec" of hadoop indexing task must be same as that used in original indexing task while ingesting raw data. Default value is "false". This field can be set to "true" when "inputSpec" type is "dataSource" and not "multi" to enable arbitrary aggregators while reindexing. See below for "multi" type support for delta-ingestion.|no|
Here is what goes inside `ingestionSpec`:
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|dataSource|String|Druid dataSource name from which you are loading the data.|yes|
|intervals|List|A list of strings representing ISO-8601 Intervals.|yes|
|segments|List|List of segments from which to read data from, by default it is obtained automatically. You can obtain list of segments to put here by making a POST query to Coordinator at url /druid/coordinator/v1/metadata/datasources/segments?full with list of intervals specified in the request payload, e.g. ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"]. You may want to provide this list manually in order to ensure that segments read are exactly same as they were at the time of task submission, task would fail if the list provided by the user does not match with state of database when the task actually runs.|no|
|filter|JSON|See [Filters](../querying/filters.md)|no|
|dimensions|Array of String|Name of dimension columns to load. By default, the list will be constructed from parseSpec. If parseSpec does not have an explicit list of dimensions then all the dimension columns present in stored data will be read.|no|
|metrics|Array of String|Name of metric columns to load. By default, the list will be constructed from the "name" of all the configured aggregators.|no|
|ignoreWhenNoSegments|boolean|Whether to ignore this ingestionSpec if no segments were found. Default behavior is to throw error when no segments were found.|no|
For example
```json
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "dataSource",
"ingestionSpec" : {
"dataSource": "wikipedia",
"intervals": ["2014-10-20T00:00:00Z/P2W"]
}
},
...
}
```
#### `multi`
This is a composing inputSpec to combine other inputSpecs. This inputSpec is used for delta ingestion. You can also use a `multi` inputSpec to combine data from multiple dataSources. However, each particular dataSource can only be specified one time.
Note that, "useNewAggs" must be set to default value false to support delta-ingestion.
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|children|Array of JSON objects|List of JSON objects containing other inputSpecs.|yes|
For example:
```json
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "multi",
"children": [
{
"type" : "dataSource",
"ingestionSpec" : {
"dataSource": "wikipedia",
"intervals": ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"],
"segments": [
{
"dataSource": "test1",
"interval": "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000",
"version": "v2",
"loadSpec": {
"type": "local",
"path": "/tmp/index1.zip"
},
"dimensions": "host",
"metrics": "visited_sum,unique_hosts",
"shardSpec": {
"type": "none"
},
"binaryVersion": 9,
"size": 2,
"identifier": "test1_2000-01-01T00:00:00.000Z_3000-01-01T00:00:00.000Z_v2"
}
]
}
},
{
"type" : "static",
"paths": "/path/to/more/wikipedia/data/"
}
]
},
...
}
```
It is STRONGLY RECOMMENDED to provide list of segments in `dataSource` inputSpec explicitly so that your delta ingestion task is idempotent. You can obtain that list of segments by making following call to the Coordinator.
POST `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full`
Request Body: [interval1, interval2,...] for example ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"]
## `tuningConfig`
The tuningConfig is optional and default parameters will be used if no tuningConfig is specified.
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|workingPath|String|The working path to use for intermediate results (results between Hadoop jobs).|Only used by the [Command-line Hadoop indexer](#cli). The default is '/tmp/druid-indexing'. This field must be null otherwise.|
|version|String|The version of created segments. Ignored for HadoopIndexTask unless useExplicitVersion is set to true|no (default == datetime that indexing starts at)|
|partitionsSpec|Object|A specification of how to partition each time bucket into segments. Absence of this property means no partitioning will occur. See [`partitionsSpec`](#partitionsspec) below.|no (default == 'hashed')|
|maxRowsInMemory|Integer|The number of rows to aggregate before persisting. Note that this is the number of post-aggregation rows which may not be equal to the number of input events due to roll-up. This is used to manage the required JVM heap size. Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set.|no (default == 1000000)|
|maxBytesInMemory|Long|The number of bytes to aggregate in heap memory before persisting. Normally this is computed internally and user does not need to set it. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is maxBytesInMemory * (2 + maxPendingPersists). Note that `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.|no (default == One-sixth of max JVM memory)|
|leaveIntermediate|Boolean|Leave behind intermediate files (for debugging) in the workingPath when a job completes, whether it passes or fails.|no (default == false)|
|cleanupOnFailure|Boolean|Clean up intermediate files when a job fails (unless leaveIntermediate is on).|no (default == true)|
|overwriteFiles|Boolean|Override existing files found during indexing.|no (default == false)|
|ignoreInvalidRows|Boolean|DEPRECATED. Ignore rows found to have problems. If false, any exception encountered during parsing will be thrown and will halt ingestion; if true, unparseable rows and fields will be skipped. If `maxParseExceptions` is defined, this property is ignored.|no (default == false)|
|combineText|Boolean|Use CombineTextInputFormat to combine multiple files into a file split. This can speed up Hadoop jobs when processing a large number of small files.|no (default == false)|
|useCombiner|Boolean|Use Hadoop combiner to merge rows at mapper if possible.|no (default == false)|
|jobProperties|Object|A map of properties to add to the Hadoop job configuration, see below for details.|no (default == null)|
|indexSpec|Object|Tune how data is indexed. See [`indexSpec`](index.md#indexspec) on the main ingestion page for more information.|no|
|indexSpecForIntermediatePersists|Object|defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [`indexSpec`](index.md#indexspec) for possible values.|no (default = same as indexSpec)|
|numBackgroundPersistThreads|Integer|The number of new background threads to use for incremental persists. Using this feature causes a notable increase in memory pressure and CPU usage but will make the job finish more quickly. If changing from the default of 0 (use current thread for persists), we recommend setting it to 1.|no (default == 0)|
|forceExtendableShardSpecs|Boolean|Forces use of extendable shardSpecs. Hash-based partitioning always uses an extendable shardSpec. For single-dimension partitioning, this option should be set to true to use an extendable shardSpec. For partitioning, please check [Partitioning specification](#partitionsspec). This option can be useful when you need to append more data to existing dataSource.|no (default = false)|
|useExplicitVersion|Boolean|Forces HadoopIndexTask to use version.|no (default = false)|
|logParseExceptions|Boolean|If true, log an error message when a parsing exception occurs, containing information about the row where the error occurred.|no(default = false)|
|maxParseExceptions|Integer|The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overrides `ignoreInvalidRows` if `maxParseExceptions` is defined.|no(default = unlimited)|
|useYarnRMJobStatusFallback|Boolean|If the Hadoop jobs created by the indexing task are unable to retrieve their completion status from the JobHistory server, and this parameter is true, the indexing task will try to fetch the application status from `http://<yarn-rm-address>/ws/v1/cluster/apps/<application-id>`, where `<yarn-rm-address>` is the value of `yarn.resourcemanager.webapp.address` in your Hadoop configuration. This flag is intended as a fallback for cases where an indexing task's jobs succeed, but the JobHistory server is unavailable, causing the indexing task to fail because it cannot determine the job statuses.|no (default = true)|
|awaitSegmentAvailabilityTimeoutMillis|Long|Milliseconds to wait for the newly indexed segments to become available for query after ingestion completes. If `<= 0`, no wait will occur. If `> 0`, the task will wait for the Coordinator to indicate that the new segments are available for querying. If the timeout expires, the task will exit as successful, but the segments were not confirmed to have become available for query.|no (default = 0)|
### `jobProperties`
```json
"tuningConfig" : {
"type": "hadoop",
"jobProperties": {
"<hadoop-property-a>": "<value-a>",
"<hadoop-property-b>": "<value-b>"
}
}
```
Hadoop's [MapReduce documentation](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) lists the possible configuration parameters.
With some Hadoop distributions, it may be necessary to set `mapreduce.job.classpath` or `mapreduce.job.user.classpath.first`
to avoid class loading issues. See the [working with different Hadoop versions documentation](../operations/other-hadoop.md)
for more details.
## `partitionsSpec`
Segments are always partitioned based on timestamp (according to the granularitySpec) and may be further partitioned in
some other way depending on partition type. Druid supports two types of partitioning strategies: `hashed` (based on the
hash of all dimensions in each row), and `single_dim` (based on ranges of a single dimension).
Hashed partitioning is recommended in most cases, as it will improve indexing performance and create more uniformly
sized data segments relative to single-dimension partitioning.
### Hash-based partitioning
```json
"partitionsSpec": {
"type": "hashed",
"targetRowsPerSegment": 5000000
}
```
Hashed partitioning works by first selecting a number of segments, and then partitioning rows across those segments
according to the hash of all dimensions in each row. The number of segments is determined automatically based on the
cardinality of the input set and a target partition size.
The configuration options are:
|Field|Description|Required|
|--------|-----------|---------|
|type|Type of partitionSpec to be used.|"hashed"|
|targetRowsPerSegment|Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB. Defaults to 5000000 if `numShards` is not set.|either this or `numShards`|
|targetPartitionSize|Deprecated. Renamed to `targetRowsPerSegment`. Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|either this or `numShards`|
|maxRowsPerSegment|Deprecated. Renamed to `targetRowsPerSegment`. Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|either this or `numShards`|
|numShards|Specify the number of partitions directly, instead of a target partition size. Ingestion will run faster, since it can skip the step necessary to select a number of partitions automatically.|either this or `maxRowsPerSegment`|
|partitionDimensions|The dimensions to partition on. Leave blank to select all dimensions. Only used with `numShards`, will be ignored when `targetRowsPerSegment` is set.|no|
|partitionFunction|A function to compute hash of partition dimensions. See [Hash partition function](#hash-partition-function)|`murmur3_32_abs`|no|
##### Hash partition function
In hash partitioning, the partition function is used to compute hash of partition dimensions. The partition dimension
values are first serialized into a byte array as a whole, and then the partition function is applied to compute hash of
the byte array.
Druid currently supports only one partition function.
|name|description|
|----|-----------|
|`murmur3_32_abs`|Applies an absolute value function to the result of [`murmur3_32`](https://guava.dev/releases/16.0/api/docs/com/google/common/hash/Hashing.html#murmur3_32()).|
### Single-dimension range partitioning
```json
"partitionsSpec": {
"type": "single_dim",
"targetRowsPerSegment": 5000000
}
```
Single-dimension range partitioning works by first selecting a dimension to partition on, and then separating that dimension
into contiguous ranges. Each segment will contain all rows with values of that dimension in that range. For example,
your segments may be partitioned on the dimension "host" using the ranges "a.example.com" to "f.example.com" and
"f.example.com" to "z.example.com". By default, the dimension to use is determined automatically, although you can
override it with a specific dimension.
The configuration options are:
|Field|Description|Required|
|--------|-----------|---------|
|type|Type of partitionSpec to be used.|"single_dim"|
|targetRowsPerSegment|Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|yes|
|targetPartitionSize|Deprecated. Renamed to `targetRowsPerSegment`. Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|no|
|maxRowsPerSegment|Maximum number of rows to include in a partition. Defaults to 50% larger than the `targetRowsPerSegment`.|no|
|maxPartitionSize|Deprecated. Use `maxRowsPerSegment` instead. Maximum number of rows to include in a partition. Defaults to 50% larger than the `targetPartitionSize`.|no|
|partitionDimension|The dimension to partition on. Leave blank to select a dimension automatically.|no|
|assumeGrouped|Assume that input data has already been grouped on time and dimensions. Ingestion will run faster, but may choose sub-optimal partitions if this assumption is violated.|no|
## Remote Hadoop clusters
If you have a remote Hadoop cluster, make sure to include the folder holding your configuration `*.xml` files in your Druid `_common` configuration folder.
If you are having dependency problems with your version of Hadoop and the version compiled with Druid, please see [these docs](../operations/other-hadoop.md).
## Elastic MapReduce
If your cluster is running on Amazon Web Services, you can use Elastic MapReduce (EMR) to index data
from S3. To do this:
- Create a persistent, [long-running cluster](http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-longrunning-transient).
- When creating your cluster, enter the following configuration. If you're using the wizard, this
should be in advanced mode under "Edit software settings":
```
classification=yarn-site,properties=[mapreduce.reduce.memory.mb=6144,mapreduce.reduce.java.opts=-server -Xms2g -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.map.java.opts=758,mapreduce.map.java.opts=-server -Xms512m -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.task.timeout=1800000]
```
- Follow the instructions under
[Configure for connecting to Hadoop](../tutorials/cluster.md#hadoop) using the XML files from `/etc/hadoop/conf`
on your EMR master.
## Kerberized Hadoop clusters
By default druid can use the existing TGT kerberos ticket available in local kerberos key cache.
Although TGT ticket has a limited life cycle,
therefore you need to call `kinit` command periodically to ensure validity of TGT ticket.
To avoid this extra external cron job script calling `kinit` periodically,
you can provide the principal name and keytab location and druid will do the authentication transparently at startup and job launching time.
|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
|`druid.hadoop.security.kerberos.principal`|`druid@EXAMPLE.COM`| Principal user name |empty|
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty|
### Loading from S3 with EMR
- In the `jobProperties` field in the `tuningConfig` section of your Hadoop indexing task, add:
```
"jobProperties" : {
"fs.s3.awsAccessKeyId" : "YOUR_ACCESS_KEY",
"fs.s3.awsSecretAccessKey" : "YOUR_SECRET_KEY",
"fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"fs.s3n.awsAccessKeyId" : "YOUR_ACCESS_KEY",
"fs.s3n.awsSecretAccessKey" : "YOUR_SECRET_KEY",
"fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
}
```
Note that this method uses Hadoop's built-in S3 filesystem rather than Amazon's EMRFS, and is not compatible
with Amazon-specific features such as S3 encryption and consistent views. If you need to use these
features, you will need to make the Amazon EMR Hadoop JARs available to Druid through one of the
mechanisms described in the [Using other Hadoop distributions](#using-other-hadoop-distributions) section.
## Using other Hadoop distributions
Druid works out of the box with many Hadoop distributions.
If you are having dependency conflicts between Druid and your version of Hadoop, you can try
searching for a solution in the [Druid user groups](https://groups.google.com/forum/#!forum/druid-user), or reading the
Druid [Different Hadoop Versions](../operations/other-hadoop.md) documentation.
<a name="cli"></a>
## Command line (non-task) version
To run:
```
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_dir> org.apache.druid.cli.Main index hadoop <spec_file>
```
### Options
- "--coordinate" - provide a version of Apache Hadoop to use. This property will override the default Hadoop coordinates. Once specified, Apache Druid will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`.
- "--no-default-hadoop" - don't pull down the default hadoop version
### Spec file
The spec file needs to contain a JSON object where the contents are the same as the "spec" field in the Hadoop index task. See [Hadoop Batch Ingestion](../ingestion/hadoop.md) for details on the spec format.
In addition, a `metadataUpdateSpec` and `segmentOutputPath` field needs to be added to the ioConfig:
```
"ioConfig" : {
...
"metadataUpdateSpec" : {
"type":"mysql",
"connectURI" : "jdbc:mysql://localhost:3306/druid",
"password" : "diurd",
"segmentTable" : "druid_segments",
"user" : "druid"
},
"segmentOutputPath" : "/MyDirectory/data/index/output"
},
```
and a `workingPath` field needs to be added to the tuningConfig:
```
"tuningConfig" : {
...
"workingPath": "/tmp",
...
}
```
#### Metadata Update Job Spec
This is a specification of the properties that tell the job how to update metadata such that the Druid cluster will see the output segments and load them.
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|type|String|"metadata" is the only value available.|yes|
|connectURI|String|A valid JDBC url to metadata storage.|yes|
|user|String|Username for db.|yes|
|password|String|password for db.|yes|
|segmentTable|String|Table to use in DB.|yes|
These properties should parrot what you have configured for your [Coordinator](../design/coordinator.md).
#### segmentOutputPath Config
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|segmentOutputPath|String|the path to dump segments into.|yes|
#### workingPath Config
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|workingPath|String|the working path to use for intermediate results (results between Hadoop jobs).|no (default == '/tmp/druid-indexing')|
Please note that the command line Hadoop indexer doesn't have the locking capabilities of the indexing service, so if you choose to use it,
you have to take caution to not override segments created by real-time processing (if you that a real-time pipeline set up).
# Hadoop 导入数据
Apache Hadoop-based batch ingestion in Apache Druid is supported via a Hadoop-ingestion task. These tasks can be posted to a running

View File

@ -1,4 +1,3 @@
## Apache Kafka 摄取数据
Kafka索引服务支持在Overlord上配置*supervisors*supervisors通过管理Kafka索引任务的创建和生存期来便于从Kafka摄取数据。这些索引任务使用Kafka自己的分区和偏移机制读取事件因此能够保证只接收一次**exactly-once**。supervisor监视索引任务的状态以便于协调切换、管理故障并确保维护可伸缩性和复制要求。

2960
ingestion/native-batch.md Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

438
ingestion/schema-design.md Normal file
View File

@ -0,0 +1,438 @@
---
id: schema-design
title: "Schema design tips"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
## Druid's data model
For general information, check out the documentation on [Druid's data model](index.md#data-model) on the main
ingestion overview page. The rest of this page discusses tips for users coming from other kinds of systems, as well as
general tips and common practices.
* Druid data is stored in [datasources](index.md#datasources), which are similar to tables in a traditional RDBMS.
* Druid datasources can be ingested with or without [rollup](#rollup). With rollup enabled, Druid partially aggregates your data during ingestion, potentially reducing its row count, decreasing storage footprint, and improving query performance. With rollup disabled, Druid stores one row for each row in your input data, without any pre-aggregation.
* Every row in Druid must have a timestamp. Data is always partitioned by time, and every query has a time filter. Query results can also be broken down by time buckets like minutes, hours, days, and so on.
* All columns in Druid datasources, other than the timestamp column, are either dimensions or metrics. This follows the [standard naming convention](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems) of OLAP data.
* Typical production datasources have tens to hundreds of columns.
* [Dimension columns](index.md#dimensions) are stored as-is, so they can be filtered on, grouped by, or aggregated at query time. They are always single Strings, [arrays of Strings](../querying/multi-value-dimensions.md), single Longs, single Doubles or single Floats.
* [Metric columns](index.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approx). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.
## If you're coming from a...
### Relational model
(Like Hive or PostgreSQL.)
Druid datasources are generally equivalent to tables in a relational database. Druid [lookups](../querying/lookups.md)
can act similarly to data-warehouse-style dimension tables, but as you'll see below, denormalization is often
recommended if you can get away with it.
Common practice for relational data modeling involves [normalization](https://en.wikipedia.org/wiki/Database_normalization):
the idea of splitting up data into multiple tables such that data redundancy is reduced or eliminated. For example, in a
"sales" table, best-practices relational modeling calls for a "product id" column that is a foreign key into a separate
"products" table, which in turn has "product id", "product name", and "product category" columns. This prevents the
product name and category from needing to be repeated on different rows in the "sales" table that refer to the same
product.
In Druid, on the other hand, it is common to use totally flat datasources that do not require joins at query time. In
the example of the "sales" table, in Druid it would be typical to store "product_id", "product_name", and
"product_category" as dimensions directly in a Druid "sales" datasource, without using a separate "products" table.
Totally flat schemas substantially increase performance, since the need for joins is eliminated at query time. As an
an added speed boost, this also allows Druid's query layer to operate directly on compressed dictionary-encoded data.
Perhaps counter-intuitively, this does _not_ substantially increase storage footprint relative to normalized schemas,
since Druid uses dictionary encoding to effectively store just a single integer per row for string columns.
If necessary, Druid datasources can be partially normalized through the use of [lookups](../querying/lookups.md),
which are the rough equivalent of dimension tables in a relational database. At query time, you would use Druid's SQL
`LOOKUP` function, or native lookup extraction functions, instead of using the JOIN keyword like you would in a
relational database. Since lookup tables impose an increase in memory footprint and incur more computational overhead
at query time, it is only recommended to do this if you need the ability to update a lookup table and have the changes
reflected immediately for already-ingested rows in your main table.
Tips for modeling relational data in Druid:
- Druid datasources do not have primary or unique keys, so skip those.
- Denormalize if possible. If you need to be able to update dimension / lookup tables periodically and have those
changes reflected in already-ingested data, consider partial normalization with [lookups](../querying/lookups.md).
- If you need to join two large distributed tables with each other, you must do this before loading the data into Druid.
Druid does not support query-time joins of two datasources. Lookups do not help here, since a full copy of each lookup
table is stored on each Druid server, so they are not a good choice for large tables.
- Consider whether you want to enable [rollup](#rollup) for pre-aggregation, or whether you want to disable
rollup and load your existing data as-is. Rollup in Druid is similar to creating a summary table in a relational model.
### Time series model
(Like OpenTSDB or InfluxDB.)
Similar to time series databases, Druid's data model requires a timestamp. Druid is not a timeseries database, but
it is a natural choice for storing timeseries data. Its flexible data model allows it to store both timeseries and
non-timeseries data, even in the same datasource.
To achieve best-case compression and query performance in Druid for timeseries data, it is important to partition and
sort by metric name, like timeseries databases often do. See [Partitioning and sorting](index.md#partitioning) for more details.
Tips for modeling timeseries data in Druid:
- Druid does not think of data points as being part of a "time series". Instead, Druid treats each point separately
for ingestion and aggregation.
- Create a dimension that indicates the name of the series that a data point belongs to. This dimension is often called
"metric" or "name". Do not get the dimension named "metric" confused with the concept of Druid metrics. Place this
first in the list of dimensions in your "dimensionsSpec" for best performance (this helps because it improves locality;
see [partitioning and sorting](index.md#partitioning) below for details).
- Create other dimensions for attributes attached to your data points. These are often called "tags" in timeseries
database systems.
- Create [metrics](../querying/aggregations.md) corresponding to the types of aggregations that you want to be able
to query. Typically this includes "sum", "min", and "max" (in one of the long, float, or double flavors). If you want to
be able to compute percentiles or quantiles, use Druid's [approximate aggregators](../querying/aggregations.md#approx).
- Consider enabling [rollup](#rollup), which will allow Druid to potentially combine multiple points into one
row in your Druid datasource. This can be useful if you want to store data at a different time granularity than it is
naturally emitted. It is also useful if you want to combine timeseries and non-timeseries data in the same datasource.
- If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
[automatic detection of dimension columns](#schema-less-dimensions).
### Log aggregation model
(Like Elasticsearch or Splunk.)
Similar to log aggregation systems, Druid offers inverted indexes for fast searching and filtering. Druid's search
capabilities are generally less developed than these systems, and its analytical capabilities are generally more
developed. The main data modeling differences between Druid and these systems are that when ingesting data into Druid,
you must be more explicit. Druid columns have types specific upfront and Druid does not, at this time, natively support
nested data.
Tips for modeling log data in Druid:
- If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
[automatic detection of dimension columns](#schema-less-dimensions).
- If you have nested data, flatten it using a [`flattenSpec`](index.md#flattenspec).
- Consider enabling [rollup](#rollup) if you have mainly analytical use cases for your log data. This will
mean you lose the ability to retrieve individual events from Druid, but you potentially gain substantial compression and
query performance boosts.
## General tips and best practices
### Rollup
Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. This is a form
of summarization or pre-aggregation. For more details, see the [Rollup](index.md#rollup) section of the ingestion
documentation.
### Partitioning and sorting
Optimally partitioning and sorting your data can have substantial impact on footprint and performance. For more details,
see the [Partitioning](index.md#partitioning) section of the ingestion documentation.
<a name="sketches"></a>
### Sketches for high cardinality columns
When dealing with high cardinality columns like user IDs or other unique IDs, consider using sketches for approximate
analysis rather than operating on the actual values. When you ingest data using a sketch, Druid does not store the
original raw data, but instead stores a "sketch" of it that it can feed into a later computation at query time. Popular
use cases for sketches include count-distinct and quantile computation. Each sketch is designed for just one particular
kind of computation.
In general using sketches serves two main purposes: improving rollup, and reducing memory footprint at
query time.
Sketches improve rollup ratios because they allow you to collapse multiple distinct values into the same sketch. For
example, if you have two rows that are identical except for a user ID (perhaps two users did the same action at the
same time), storing them in a count-distinct sketch instead of as-is means you can store the data in one row instead of
two. You won't be able to retrieve the user IDs or compute exact distinct counts, but you'll still be able to compute
approximate distinct counts, and you'll reduce your storage footprint.
Sketches reduce memory footprint at query time because they limit the amount of data that needs to be shuffled between
servers. For example, in a quantile computation, instead of needing to send all data points to a central location
so they can be sorted and the quantile can be computed, Druid instead only needs to send a sketch of the points. This
can reduce data transfer needs to mere kilobytes.
For details about the sketches available in Druid, see the
[approximate aggregators](../querying/aggregations.md#approx) page.
If you prefer videos, take a look at [Not exactly!](https://www.youtube.com/watch?v=Hpd3f_MLdXo), a conference talk
about sketches in Druid.
### String vs numeric dimensions
If the user wishes to ingest a column as a numeric-typed dimension (Long, Double or Float), it is necessary to specify the type of the column in the `dimensions` section of the `dimensionsSpec`. If the type is omitted, Druid will ingest a column as the default String type.
There are performance tradeoffs between string and numeric columns. Numeric columns are generally faster to group on
than string columns. But unlike string columns, numeric columns don't have indexes, so they can be slower to filter on.
You may want to experiment to find the optimal choice for your use case.
For details about how to configure numeric dimensions, see the [`dimensionsSpec`](index.md#dimensionsspec) documentation.
### Secondary timestamps
Druid schemas must always include a primary timestamp. The primary timestamp is used for
[partitioning and sorting](index.md#partitioning) your data, so it should be the timestamp that you will most often filter on.
Druid is able to rapidly identify and retrieve data corresponding to time ranges of the primary timestamp column.
If your data has more than one timestamp, you can ingest the others as secondary timestamps. The best way to do this
is to ingest them as [long-typed dimensions](index.md#dimensionsspec) in milliseconds format.
If necessary, you can get them into this format using a [`transformSpec`](index.md#transformspec) and
[expressions](../misc/math-expr.md) like `timestamp_parse`, which returns millisecond timestamps.
At query time, you can query secondary timestamps with [SQL time functions](../querying/sql.md#time-functions)
like `MILLIS_TO_TIMESTAMP`, `TIME_FLOOR`, and others. If you're using native Druid queries, you can use
[expressions](../misc/math-expr.md).
### Nested dimensions
At the time of this writing, Druid does not support nested dimensions. Nested dimensions need to be flattened. For example,
if you have data of the following form:
```
{"foo":{"bar": 3}}
```
then before indexing it, you should transform it to:
```
{"foo_bar": 3}
```
Druid is capable of flattening JSON, Avro, or Parquet input data.
Please read about [`flattenSpec`](index.md#flattenspec) for more details.
<a name="counting"></a>
### Counting the number of ingested events
When rollup is enabled, count aggregators at query time do not actually tell you the number of rows that have been
ingested. They tell you the number of rows in the Druid datasource, which may be smaller than the number of rows
ingested.
In this case, a count aggregator at _ingestion_ time can be used to count the number of events. However, it is important to note
that when you query for this metric, you should use a `longSum` aggregator. A `count` aggregator at query time will return
the number of Druid rows for the time interval, which can be used to determine what the roll-up ratio was.
To clarify with an example, if your ingestion spec contains:
```
...
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
...
```
You should query for the number of ingested rows with:
```
...
"aggregations": [
{ "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" },
...
```
### Schema-less dimensions
If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
a dimension that has been excluded, or a metric column as a dimension.
Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.
### Including the same column as a dimension and a metric
One workflow with unique IDs is to be able to filter on a particular ID, while still being able to do fast unique counts on the ID column.
If you are not using schema-less dimensions, this use case is supported by setting the `name` of the metric to something different than the dimension.
If you are using schema-less dimensions, the best practice here is to include the same column twice, once as a dimension, and as a `hyperUnique` metric. This may involve
some work at ETL time.
As an example, for schema-less dimensions, repeat the same column:
```
{"device_id_dim":123, "device_id_met":123}
```
and in your `metricsSpec`, include:
```
{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" }
```
`device_id_dim` should automatically get picked up as a dimension.
## Schema设计
### Druid数据模型
有关一般信息,请查看摄取概述页面上有关 [Druid数据模型](ingestion.md#Druid数据模型) 的文档。本页的其余部分将讨论来自其他类型系统的用户的提示,以及一般提示和常见做法。
* Druid数据存储在 [数据源](ingestion.md#数据源) 中与传统RDBMS中的表类似。
* Druid数据源可以在摄取过程中使用或不使用 [rollup](ingestion.md#rollup) 。启用rollup后Druid会在接收期间部分聚合您的数据这可能会减少其行数减少存储空间并提高查询性能。禁用rollup后Druid为输入数据中的每一行存储一行而不进行任何预聚合。
* Druid的每一行都必须有时间戳。数据总是按时间进行分区每个查询都有一个时间过滤器。查询结果也可以按时间段如分钟、小时、天等进行细分。
* 除了timestamp列之外Druid数据源中的所有列都是dimensions或metrics。这遵循 [OLAP数据的标准命名约定](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems)。
* 典型的生产数据源有几十到几百列。
* [dimension列](ingestion.md#维度) 按原样存储因此可以在查询时对其进行筛选、分组或聚合。它们总是单个字符串、字符串数组、单个long、单个double或单个float。
* [Metrics列](ingestion.md#指标) 是 [预聚合](../querying/Aggregations.md) 存储的,因此它们只能在查询时聚合(不能按筛选或分组)。它们通常存储为数字(整数或浮点数),但也可以存储为复杂对象,如[HyperLogLog草图或近似分位数草图](../querying/Aggregations.md)。即使禁用了rollup也可以在接收时配置metrics但在启用汇总时最有用。
### 与其他设计模式类比
#### 关系模型
(如 Hive 或者 PostgreSQL
Druid数据源通常相当于关系数据库中的表。Druid的 [lookups特性](../querying/lookups.md) 可以类似于数据仓库样式的维度表,但是正如您将在下面看到的,如果您能够摆脱它,通常建议您进行非规范化。
关系数据建模的常见实践涉及 [规范化](https://en.wikipedia.org/wiki/Database_normalization) 的思想:将数据拆分为多个表,从而减少或消除数据冗余。例如,在"sales"表中,最佳实践关系建模要求将"product id"列作为外键放入单独的"products"表中,该表依次具有"product id"、"product name"和"product category"列, 这可以防止产品名称和类别需要在"sales"表中引用同一产品的不同行上重复。
另一方面在Druid中通常使用在查询时不需要连接的完全平坦的数据源。在"sales"表的例子中在Druid中通常直接将"product_id"、"product_name"和"product_category"作为维度存储在Druid "sales"数据源中,而不使用单独的"products"表。完全平坦的模式大大提高了性能因为查询时不需要连接。作为一个额外的速度提升这也允许Druid的查询层直接操作压缩字典编码的数据。因为Druid使用字典编码来有效地为字符串列每行存储一个整数, 所以可能与直觉相反,这并*没有*显著增加相对于规范化模式的存储空间。
如果需要的话,可以通过使用 [lookups](../querying/lookups.md) 规范化Druid数据源这大致相当于关系数据库中的维度表。在查询时您将使用Druid的SQL `LOOKUP` 查找函数或者原生 `lookup` 提取函数而不是像在关系数据库中那样使用JOIN关键字。由于lookup表会增加内存占用并在查询时产生更多的计算开销因此仅当需要更新lookup表并立即反映主表中已摄取行的更改时才建议执行此操作。
在Druid中建模关系数据的技巧
* Druid数据源没有主键或唯一键所以跳过这些。
* 如果可能的话去规格化。如果需要定期更新dimensions/lookup并将这些更改反映在已接收的数据中请考虑使用 [lookups](../querying/lookups.md) 进行部分规范化。
* 如果需要将两个大型的分布式表连接起来则必须在将数据加载到Druid之前执行此操作。Druid不支持两个数据源的查询时间连接。lookup在这里没有帮助因为每个lookup表的完整副本存储在每个Druid服务器上所以对于大型表来说它们不是一个好的选择。
* 考虑是否要为预聚合启用[rollup](ingestion.md#rollup)或者是否要禁用rollup并按原样加载现有数据。Druid中的Rollup类似于在关系模型中创建摘要表。
#### 时序模型
(如 OpenTSDB 或者 InfluxDB
与时间序列数据库类似Druid的数据模型需要时间戳。Druid不是时序数据库但它同时也是存储时序数据的自然选择。它灵活的数据模型允许它同时存储时序和非时序数据甚至在同一个数据源中。
为了在Druid中实现时序数据的最佳压缩和查询性能像时序数据库经常做的一样按照metric名称进行分区和排序很重要。有关详细信息请参见 [分区和排序](ingestion.md#分区)。
在Druid中建模时序数据的技巧
* Druid并不认为数据点是"时间序列"的一部分。相反Druid对每一点分别进行摄取和聚合
* 创建一个维度,该维度指示数据点所属系列的名称。这个维度通常被称为"metric"或"name"。不要将名为"metric"的维度与Druid Metrics的概念混淆。将它放在"dimensionsSpec"中维度列表的第一个位置,以获得最佳性能(这有助于提高局部性;有关详细信息,请参阅下面的 [分区和排序](ingestion.md#分区)
* 为附着到数据点的属性创建其他维度。在时序数据库系统中,这些通常称为"标签"
* 创建与您希望能够查询的聚合类型相对应的 [Druid Metrics](ingestion.md#指标)。通常这包括"sum"、"min"和"max"在long、float或double中的一种。如果你想计算百分位数或分位数可以使用Druid的 [近似聚合器](../querying/Aggregations.md)
* 考虑启用 [rollup](ingestion.md#rollup)这将允许Druid潜在地将多个点合并到Druid数据源中的一行中。如果希望以不同于原始发出的时间粒度存储数据则这可能非常有用。如果要在同一个数据源中组合时序和非时序数据它也很有用
* 如果您提前不知道要摄取哪些列,请使用空的维度列表来触发 [维度列的自动检测](#无schema的维度列)
#### 日志聚合模型
(如 ElasticSearch 或者 Splunk
与日志聚合系统类似Druid提供反向索引用于快速搜索和筛选。Druid的搜索能力通常不如这些系统发达其分析能力通常更为发达。Druid和这些系统之间的主要数据建模差异在于在将数据摄取到Druid中时必须更加明确。Druid列具有特定的类型而Druid目前不支持嵌套数据。
在Druid中建模日志数据的技巧
* 如果您提前不知道要摄取哪些列,请使用空维度列表来触发 [维度列的自动检测](#无schema的维度列)
* 如果有嵌套数据,请使用 [展平规范](ingestion.md#flattenspec) 将其扁平化
* 如果您主要有日志数据的分析场景,请考虑启用 [rollup](ingestion.md#rollup),这意味着您将失去从Druid中检索单个事件的能力但您可能获得大量的压缩和查询性能提升
### 一般提示以及最佳实践
#### Rollup
Druid可以在接收数据时将其汇总以最小化需要存储的原始数据量。这是一种汇总或预聚合的形式。有关更多详细信息请参阅摄取文档的 [汇总部分](ingestion.md#rollup)。
#### 分区与排序
对数据进行最佳分区和排序会对占用空间和性能产生重大影响。有关更多详细信息,请参阅摄取文档的 [分区部分](ingestion.md#分区)。
#### Sketches高基维处理
在处理高基数列如用户ID或其他唯一ID请考虑使用草图(sketches)进行近似分析,而不是对实际值进行操作。当您使用草图(sketches)摄取数据时Druid不存储原始原始数据而是存储它的"草图(sketches)",它可以在查询时输入到以后的计算中。草图(sketches)的常用场景包括 `count-distinct` 和分位数计算。每个草图都是为一种特定的计算而设计的。
一般来说,使用草图(sketches)有两个主要目的改进rollup和减少查询时的内存占用。
草图(sketches)可以提高rollup比率因为它们允许您将多个不同的值折叠到同一个草图(sketches)中。例如如果有两行除了用户ID之外都是相同的可能两个用户同时执行了相同的操作则将它们存储在 `count-distinct sketch` 中而不是按原样这意味着您可以将数据存储在一行而不是两行中。您将无法检索用户id或计算精确的非重复计数但您仍将能够计算近似的非重复计数并且您将减少存储空间。
草图(sketches)减少了查询时的内存占用因为它们限制了需要在服务器之间洗牌的数据量。例如在分位数计算中Druid不需要将所有数据点发送到中心位置以便对它们进行排序和计算分位数而只需要发送点的草图。这可以将数据传输需要减少到仅千字节。
有关Druid中可用的草图的详细信息请参阅 [近似聚合器页面](../querying/Aggregations.md)。
如果你更喜欢 [视频](https://www.youtube.com/watch?v=Hpd3f_MLdXo)那就看一看吧一个讨论Druid Sketches的会议。
#### 字符串 VS 数值维度
如果用户希望将列摄取为数值类型的维度Long、Double或Float则需要在 `dimensionsSpec``dimensions` 部分中指定列的类型。如果省略了该类型Druid会将列作为默认的字符串类型。
字符串列和数值列之间存在性能折衷。数值列通常比字符串列更快分组。但与字符串列不同,数值列没有索引,因此可以更慢地进行筛选。您可能想尝试为您的用例找到最佳选择。
有关如何配置数值维度的详细信息,请参阅 [`dimensionsSpec`文档](ingestion.md#dimensionsSpec)
#### 辅助时间戳
Druid schema必须始终包含一个主时间戳, 主时间戳用于对数据进行 [分区和排序](ingestion.md#分区)因此它应该是您最常筛选的时间戳。Druid能够快速识别和检索与主时间戳列的时间范围相对应的数据。
如果数据有多个时间戳,则可以将其他时间戳作为辅助时间戳摄取。最好的方法是将它们作为 [毫秒格式的Long类型维度](ingestion.md#dimensionsspec) 摄取。如有必要,可以使用 [`transformSpec`](ingestion.md#transformspec) 和 `timestamp_parse` 等 [表达式](../misc/expression.md) 将它们转换成这种格式,后者返回毫秒时间戳。
在查询时,可以使用诸如 `MILLIS_TO_TIMESTAMP`、`TIME_FLOOR` 等 [SQL时间函数](../querying/druidsql.md) 查询辅助时间戳。如果您使用的是原生Druid查询那么可以使用 [表达式](../misc/expression.md)。
#### 嵌套维度
在编写本文时Druid不支持嵌套维度。嵌套维度需要展平例如如果您有以下数据
```json
{"foo":{"bar": 3}}
```
然后在编制索引之前,应将其转换为:
```json
{"foo_bar": 3}
```
Druid能够将JSON、Avro或Parquet输入数据展平化。请阅读 [展平规格](ingestion.md#flattenspec) 了解更多细节。
#### 计数接收事件数
启用rollup后查询时的计数聚合器(count aggregator)实际上不会告诉您已摄取的行数。它们告诉您Druid数据源中的行数可能小于接收的行数。
在这种情况下,可以使用*摄取时*的计数聚合器来计算事件数。但是需要注意的是在查询此Metrics时应该使用 `longSum` 聚合器。查询时的 `count` 聚合器将返回时间间隔的Druid行数该行数可用于确定rollup比率。
为了举例说明,如果摄取规范包含:
```json
...
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
...
```
您应该使用查询:
```json
...
"aggregations": [
{ "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" },
...
```
#### 无schema的维度列
如果摄取规范中的 `dimensions` 字段为空Druid将把不是timestamp列、已排除的维度和metric列之外的每一列都视为维度。
注意当使用无schema摄取时所有维度都将被摄取为字符串类型的维度。
##### 包含与Dimension和Metric相同的列
一个具有唯一ID的工作流能够对特定ID进行过滤同时仍然能够对ID列进行快速的唯一计数。如果不使用无schema维度则通过将Metric的 `name` 设置为与维度不同的值来支持此场景。如果使用无schema维度这里的最佳实践是将同一列包含两次一次作为维度一次作为 `hyperUnique` Metric。这可能涉及到ETL时的一些工作。
例如对于无schema维度请重复同一列
```json
{"device_id_dim":123, "device_id_met":123}
```
同时在 `metricsSpec` 中包含:
```json
{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" }
```
`device_id_dim` 将自动作为维度来被选取

View File

@ -1,155 +0,0 @@
## Schema设计
### Druid数据模型
有关一般信息,请查看摄取概述页面上有关 [Druid数据模型](ingestion.md#Druid数据模型) 的文档。本页的其余部分将讨论来自其他类型系统的用户的提示,以及一般提示和常见做法。
* Druid数据存储在 [数据源](ingestion.md#数据源) 中与传统RDBMS中的表类似。
* Druid数据源可以在摄取过程中使用或不使用 [rollup](ingestion.md#rollup) 。启用rollup后Druid会在接收期间部分聚合您的数据这可能会减少其行数减少存储空间并提高查询性能。禁用rollup后Druid为输入数据中的每一行存储一行而不进行任何预聚合。
* Druid的每一行都必须有时间戳。数据总是按时间进行分区每个查询都有一个时间过滤器。查询结果也可以按时间段如分钟、小时、天等进行细分。
* 除了timestamp列之外Druid数据源中的所有列都是dimensions或metrics。这遵循 [OLAP数据的标准命名约定](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems)。
* 典型的生产数据源有几十到几百列。
* [dimension列](ingestion.md#维度) 按原样存储因此可以在查询时对其进行筛选、分组或聚合。它们总是单个字符串、字符串数组、单个long、单个double或单个float。
* [Metrics列](ingestion.md#指标) 是 [预聚合](../querying/Aggregations.md) 存储的,因此它们只能在查询时聚合(不能按筛选或分组)。它们通常存储为数字(整数或浮点数),但也可以存储为复杂对象,如[HyperLogLog草图或近似分位数草图](../querying/Aggregations.md)。即使禁用了rollup也可以在接收时配置metrics但在启用汇总时最有用。
### 与其他设计模式类比
#### 关系模型
(如 Hive 或者 PostgreSQL
Druid数据源通常相当于关系数据库中的表。Druid的 [lookups特性](../querying/lookups.md) 可以类似于数据仓库样式的维度表,但是正如您将在下面看到的,如果您能够摆脱它,通常建议您进行非规范化。
关系数据建模的常见实践涉及 [规范化](https://en.wikipedia.org/wiki/Database_normalization) 的思想:将数据拆分为多个表,从而减少或消除数据冗余。例如,在"sales"表中,最佳实践关系建模要求将"product id"列作为外键放入单独的"products"表中,该表依次具有"product id"、"product name"和"product category"列, 这可以防止产品名称和类别需要在"sales"表中引用同一产品的不同行上重复。
另一方面在Druid中通常使用在查询时不需要连接的完全平坦的数据源。在"sales"表的例子中在Druid中通常直接将"product_id"、"product_name"和"product_category"作为维度存储在Druid "sales"数据源中,而不使用单独的"products"表。完全平坦的模式大大提高了性能因为查询时不需要连接。作为一个额外的速度提升这也允许Druid的查询层直接操作压缩字典编码的数据。因为Druid使用字典编码来有效地为字符串列每行存储一个整数, 所以可能与直觉相反,这并*没有*显著增加相对于规范化模式的存储空间。
如果需要的话,可以通过使用 [lookups](../querying/lookups.md) 规范化Druid数据源这大致相当于关系数据库中的维度表。在查询时您将使用Druid的SQL `LOOKUP` 查找函数或者原生 `lookup` 提取函数而不是像在关系数据库中那样使用JOIN关键字。由于lookup表会增加内存占用并在查询时产生更多的计算开销因此仅当需要更新lookup表并立即反映主表中已摄取行的更改时才建议执行此操作。
在Druid中建模关系数据的技巧
* Druid数据源没有主键或唯一键所以跳过这些。
* 如果可能的话去规格化。如果需要定期更新dimensions/lookup并将这些更改反映在已接收的数据中请考虑使用 [lookups](../querying/lookups.md) 进行部分规范化。
* 如果需要将两个大型的分布式表连接起来则必须在将数据加载到Druid之前执行此操作。Druid不支持两个数据源的查询时间连接。lookup在这里没有帮助因为每个lookup表的完整副本存储在每个Druid服务器上所以对于大型表来说它们不是一个好的选择。
* 考虑是否要为预聚合启用[rollup](ingestion.md#rollup)或者是否要禁用rollup并按原样加载现有数据。Druid中的Rollup类似于在关系模型中创建摘要表。
#### 时序模型
(如 OpenTSDB 或者 InfluxDB
与时间序列数据库类似Druid的数据模型需要时间戳。Druid不是时序数据库但它同时也是存储时序数据的自然选择。它灵活的数据模型允许它同时存储时序和非时序数据甚至在同一个数据源中。
为了在Druid中实现时序数据的最佳压缩和查询性能像时序数据库经常做的一样按照metric名称进行分区和排序很重要。有关详细信息请参见 [分区和排序](ingestion.md#分区)。
在Druid中建模时序数据的技巧
* Druid并不认为数据点是"时间序列"的一部分。相反Druid对每一点分别进行摄取和聚合
* 创建一个维度,该维度指示数据点所属系列的名称。这个维度通常被称为"metric"或"name"。不要将名为"metric"的维度与Druid Metrics的概念混淆。将它放在"dimensionsSpec"中维度列表的第一个位置,以获得最佳性能(这有助于提高局部性;有关详细信息,请参阅下面的 [分区和排序](ingestion.md#分区)
* 为附着到数据点的属性创建其他维度。在时序数据库系统中,这些通常称为"标签"
* 创建与您希望能够查询的聚合类型相对应的 [Druid Metrics](ingestion.md#指标)。通常这包括"sum"、"min"和"max"在long、float或double中的一种。如果你想计算百分位数或分位数可以使用Druid的 [近似聚合器](../querying/Aggregations.md)
* 考虑启用 [rollup](ingestion.md#rollup)这将允许Druid潜在地将多个点合并到Druid数据源中的一行中。如果希望以不同于原始发出的时间粒度存储数据则这可能非常有用。如果要在同一个数据源中组合时序和非时序数据它也很有用
* 如果您提前不知道要摄取哪些列,请使用空的维度列表来触发 [维度列的自动检测](#无schema的维度列)
#### 日志聚合模型
(如 ElasticSearch 或者 Splunk
与日志聚合系统类似Druid提供反向索引用于快速搜索和筛选。Druid的搜索能力通常不如这些系统发达其分析能力通常更为发达。Druid和这些系统之间的主要数据建模差异在于在将数据摄取到Druid中时必须更加明确。Druid列具有特定的类型而Druid目前不支持嵌套数据。
在Druid中建模日志数据的技巧
* 如果您提前不知道要摄取哪些列,请使用空维度列表来触发 [维度列的自动检测](#无schema的维度列)
* 如果有嵌套数据,请使用 [展平规范](ingestion.md#flattenspec) 将其扁平化
* 如果您主要有日志数据的分析场景,请考虑启用 [rollup](ingestion.md#rollup),这意味着您将失去从Druid中检索单个事件的能力但您可能获得大量的压缩和查询性能提升
### 一般提示以及最佳实践
#### Rollup
Druid可以在接收数据时将其汇总以最小化需要存储的原始数据量。这是一种汇总或预聚合的形式。有关更多详细信息请参阅摄取文档的 [汇总部分](ingestion.md#rollup)。
#### 分区与排序
对数据进行最佳分区和排序会对占用空间和性能产生重大影响。有关更多详细信息,请参阅摄取文档的 [分区部分](ingestion.md#分区)。
#### Sketches高基维处理
在处理高基数列如用户ID或其他唯一ID请考虑使用草图(sketches)进行近似分析,而不是对实际值进行操作。当您使用草图(sketches)摄取数据时Druid不存储原始原始数据而是存储它的"草图(sketches)",它可以在查询时输入到以后的计算中。草图(sketches)的常用场景包括 `count-distinct` 和分位数计算。每个草图都是为一种特定的计算而设计的。
一般来说,使用草图(sketches)有两个主要目的改进rollup和减少查询时的内存占用。
草图(sketches)可以提高rollup比率因为它们允许您将多个不同的值折叠到同一个草图(sketches)中。例如如果有两行除了用户ID之外都是相同的可能两个用户同时执行了相同的操作则将它们存储在 `count-distinct sketch` 中而不是按原样这意味着您可以将数据存储在一行而不是两行中。您将无法检索用户id或计算精确的非重复计数但您仍将能够计算近似的非重复计数并且您将减少存储空间。
草图(sketches)减少了查询时的内存占用因为它们限制了需要在服务器之间洗牌的数据量。例如在分位数计算中Druid不需要将所有数据点发送到中心位置以便对它们进行排序和计算分位数而只需要发送点的草图。这可以将数据传输需要减少到仅千字节。
有关Druid中可用的草图的详细信息请参阅 [近似聚合器页面](../querying/Aggregations.md)。
如果你更喜欢 [视频](https://www.youtube.com/watch?v=Hpd3f_MLdXo)那就看一看吧一个讨论Druid Sketches的会议。
#### 字符串 VS 数值维度
如果用户希望将列摄取为数值类型的维度Long、Double或Float则需要在 `dimensionsSpec``dimensions` 部分中指定列的类型。如果省略了该类型Druid会将列作为默认的字符串类型。
字符串列和数值列之间存在性能折衷。数值列通常比字符串列更快分组。但与字符串列不同,数值列没有索引,因此可以更慢地进行筛选。您可能想尝试为您的用例找到最佳选择。
有关如何配置数值维度的详细信息,请参阅 [`dimensionsSpec`文档](ingestion.md#dimensionsSpec)
#### 辅助时间戳
Druid schema必须始终包含一个主时间戳, 主时间戳用于对数据进行 [分区和排序](ingestion.md#分区)因此它应该是您最常筛选的时间戳。Druid能够快速识别和检索与主时间戳列的时间范围相对应的数据。
如果数据有多个时间戳,则可以将其他时间戳作为辅助时间戳摄取。最好的方法是将它们作为 [毫秒格式的Long类型维度](ingestion.md#dimensionsspec) 摄取。如有必要,可以使用 [`transformSpec`](ingestion.md#transformspec) 和 `timestamp_parse` 等 [表达式](../misc/expression.md) 将它们转换成这种格式,后者返回毫秒时间戳。
在查询时,可以使用诸如 `MILLIS_TO_TIMESTAMP`、`TIME_FLOOR` 等 [SQL时间函数](../querying/druidsql.md) 查询辅助时间戳。如果您使用的是原生Druid查询那么可以使用 [表达式](../misc/expression.md)。
#### 嵌套维度
在编写本文时Druid不支持嵌套维度。嵌套维度需要展平例如如果您有以下数据
```json
{"foo":{"bar": 3}}
```
然后在编制索引之前,应将其转换为:
```json
{"foo_bar": 3}
```
Druid能够将JSON、Avro或Parquet输入数据展平化。请阅读 [展平规格](ingestion.md#flattenspec) 了解更多细节。
#### 计数接收事件数
启用rollup后查询时的计数聚合器(count aggregator)实际上不会告诉您已摄取的行数。它们告诉您Druid数据源中的行数可能小于接收的行数。
在这种情况下,可以使用*摄取时*的计数聚合器来计算事件数。但是需要注意的是在查询此Metrics时应该使用 `longSum` 聚合器。查询时的 `count` 聚合器将返回时间间隔的Druid行数该行数可用于确定rollup比率。
为了举例说明,如果摄取规范包含:
```json
...
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
...
```
您应该使用查询:
```json
...
"aggregations": [
{ "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" },
...
```
#### 无schema的维度列
如果摄取规范中的 `dimensions` 字段为空Druid将把不是timestamp列、已排除的维度和metric列之外的每一列都视为维度。
注意当使用无schema摄取时所有维度都将被摄取为字符串类型的维度。
##### 包含与Dimension和Metric相同的列
一个具有唯一ID的工作流能够对特定ID进行过滤同时仍然能够对ID列进行快速的唯一计数。如果不使用无schema维度则通过将Metric的 `name` 设置为与维度不同的值来支持此场景。如果使用无schema维度这里的最佳实践是将同一列包含两次一次作为维度一次作为 `hyperUnique` Metric。这可能涉及到ETL时的一些工作。
例如对于无schema维度请重复同一列
```json
{"device_id_dim":123, "device_id_met":123}
```
同时在 `metricsSpec` 中包含:
```json
{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" }
```
`device_id_dim` 将自动作为维度来被选取

View File

@ -0,0 +1,45 @@
---
id: standalone-realtime
layout: doc_page
title: "Realtime Process"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Older versions of Apache Druid supported a standalone 'Realtime' process to query and index 'stream pull'
modes of real-time ingestion. These processes would periodically build segments for the data they had collected over
some span of time and then set up hand-off to [Historical](../design/historical.md) servers.
This processes could be invoked by
```
org.apache.druid.cli.Main server realtime
```
This model of stream pull ingestion was deprecated for a number of both operational and architectural reasons, and
removed completely in Druid 0.16.0. Operationally, realtime nodes were difficult to configure, deploy, and scale because
each node required an unique configuration. The design of the stream pull ingestion system for realtime nodes also
suffered from limitations which made it not possible to achieve exactly once ingestion.
The extensions `druid-kafka-eight`, `druid-kafka-eight-simpleConsumer`, `druid-rabbitmq`, and `druid-rocketmq` were also
removed at this time, since they were built to operate on the realtime nodes.
Please consider using the [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.md) or
[Kinesis Indexing Service](../development/extensions-core/kinesis-ingestion.md) for stream pull ingestion instead.

View File

@ -1,373 +0,0 @@
## 任务参考文档
任务在Druid中完成所有与 [摄取](ingestion.md) 相关的工作。
对于批量摄取,通常使用 [任务api](../operations/api.md#Overlord) 直接将任务提交给Druid。对于流式接收任务通常被提交给supervisor。
### 任务API
任务API主要在两个地方是可用的
* [Overlord](../design/Overlord.md) 进程提供HTTP API接口来进行提交任务、取消任务、检查任务状态、查看任务日志与报告等。 查看 [任务API文档](../operations/api.md) 可以看到完整列表
* Druid SQL包括了一个 [`sys.tasks`](../querying/druidsql.md#系统Schema) ,保存了当前任务运行的信息。 此表是只读的并且可以通过Overlord API查询完整信息的有限制的子集。
### 任务报告
报告包含已完成的任务和正在运行的任务中有关接收的行数和发生的任何分析异常的信息的报表。
报告功能支持 [简单的本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和Kinesis摄取任务支持报告功能。
#### 任务结束报告
任务运行完成后,一个完整的报告可以在以下接口获取:
```json
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
```
一个示例输出如下:
```json
{
"ingestionStatsAndErrors": {
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
"payload": {
"ingestionState": "COMPLETED",
"unparseableEvents": {},
"rowStats": {
"determinePartitions": {
"processed": 0,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
},
"buildSegments": {
"processed": 5390324,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
}
},
"errorMsg": null
},
"type": "ingestionStatsAndErrors"
}
}
```
#### 任务运行报告
当一个任务正在运行时, 任务运行报告可以通过以下接口获得包括摄取状态、未解析事件和过去1分钟、5分钟、15分钟内处理的平均事件数。
```json
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
```
```json
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/liveReports
```
一个示例输出如下:
```json
{
"ingestionStatsAndErrors": {
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
"payload": {
"ingestionState": "RUNNING",
"unparseableEvents": {},
"rowStats": {
"movingAverages": {
"buildSegments": {
"5m": {
"processed": 3.392158326408501,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"15m": {
"processed": 1.736165476881023,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"1m": {
"processed": 4.206417693750045,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
}
}
},
"totals": {
"buildSegments": {
"processed": 1994,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
}
}
},
"errorMsg": null
},
"type": "ingestionStatsAndErrors"
}
}
```
字段的描述信息如下:
`ingestionStatsAndErrors` 提供了行数和错误数的信息
`ingestionState` 标识了摄取任务当前达到了哪一步,可能的取值包括:
* `NOT_STARTED`: 任务还没有读取任何行
* `DETERMINE_PARTITIONS`: 任务正在处理行来决定分区信息
* `BUILD_SEGMENTS`: 任务正在处理行来构建段
* `COMPLETED`: 任务已经完成
只有批处理任务具有 `DETERMINE_PARTITIONS` 阶段。实时任务如由Kafka索引服务创建的任务没有 `DETERMINE_PARTITIONS` 阶段。
`unparseableEvents` 包含由不可解析输入引起的异常消息列表。这有助于识别有问题的输入行。对于 `DETERMINE_PARTITIONS``BUILD_SEGMENTS` 阶段每个阶段都有一个列表。请注意Hadoop批处理任务不支持保存不可解析事件。
`rowStats` map包含有关行计数的信息。每个摄取阶段有一个条目。不同行计数的定义如下所示
* `processed`: 成功摄入且没有报错的行数
* `processedWithErro`: 摄取但在一列或多列中包含解析错误的行数。这通常发生在输入行具有可解析的结构但列的类型无效的情况下,例如为数值列传入非数值字符串值
* `thrownAway`: 跳过的行数。 这包括时间戳在摄取任务定义的时间间隔之外的行,以及使用 [`transformSpec`](ingestion.md#transformspec) 过滤掉的行但不包括显式用户配置跳过的行。例如CSV格式的 `skipHeaderRows``hasHeaderRow` 跳过的行不计算在内
* `unparseable`: 完全无法分析并被丢弃的行数。这将跟踪没有可解析结构的输入行例如在使用JSON解析器时传入非JSON数据。
`errorMsg` 字段显示一条消息,描述导致任务失败的错误。如果任务成功,则为空
### 实时报告
#### 行画像
非并行的 [简单本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和kinesis摄取任务支持在任务运行时检索行统计信息。
可以通过运行任务的Peon上的以下URL访问实时报告
```json
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/rowStats
```
示例报告如下所示。`movingAverages` 部分包含四行计数器的1分钟、5分钟和15分钟移动平均增量其定义与结束报告中的定义相同。`totals` 部分显示当前总计。
```json
{
"movingAverages": {
"buildSegments": {
"5m": {
"processed": 3.392158326408501,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"15m": {
"processed": 1.736165476881023,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"1m": {
"processed": 4.206417693750045,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
}
}
},
"totals": {
"buildSegments": {
"processed": 1994,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
}
}
}
```
对于Kafka索引服务向Overlord API发送一个GET请求将从supervisor管理的每个任务中检索实时行统计报告并提供一个组合报告。
```json
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/supervisor/<supervisor-id>/stats
```
#### 未解析的事件
可以对Peon API发起一次Get请求从正在运行的任务中检索最近遇到的不可解析事件的列表
```json
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/unparseableEvents
```
注意:并不是所有的任务类型支持该功能。 当前,该功能只支持非并行的 [本地批任务](native.md) (`index`类型) 和由Kafka、Kinesis索引服务创建的任务。
### 任务锁系统
本节介绍Druid中的任务锁定系统。Druid的锁定系统和版本控制系统是紧密耦合的以保证接收数据的正确性。
### 段与段之间的"阴影"
可以运行任务覆盖现有数据。覆盖任务创建的段将*覆盖*现有段。请注意,覆盖关系只适用于**同一时间块和同一数据源**。在过滤过时数据的查询处理中,不考虑这些被遮盖的段。
每个段都有一个*主*版本和一个*次*版本。主版本表示为时间戳,格式为["yyyy-MM-dd'T'hh:MM:ss"](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html),次版本表示为整数。这些主版本和次版本用于确定段之间的阴影关系,如下所示。
在以下条件下,段 `s1` 将会覆盖另一个段 `s2`:
* `s1``s2` 有一个更高的主版本
* `s1``s2` 有相同的主版本,但是有更高的次版本
以下是一些示例:
* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段将覆盖另一个主版本为 `2018-01-01T00:00:00.000Z` 且次版本为 `1` 的段
* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `1` 的段将覆盖另一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段
### 锁
如果您正在运行两个或多个 [Druid任务](taskrefer.md),这些任务为同一数据源和同一时间块生成段,那么生成的段可能会相互覆盖,从而导致错误的查询结果。
为了避免这个问题任务将在Druid中创建任何段之前尝试获取锁, 有两种类型的锁,即 *时间块锁**段锁*
使用时间块锁时,任务将锁定生成的段将写入数据源的整个时间块。例如,假设我们有一个任务将数据摄取到 `wikipedia` 数据源的时间块 `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` 中。使用时间块锁此任务将在创建段之前锁定wikipedia数据源的 `2019-01-01T00:00.000Z/2019-01-02T00:00:00.000Z` 整个时间块。只要它持有锁,任何其他任务都将无法为同一数据源的同一时间块创建段。使用时间块锁创建的段的主版本*高于*现有段, 它们的次版本总是 `0`
使用段锁时任务锁定单个段而不是整个时间块。因此如果两个或多个任务正在读取不同的段则它们可以同时为同一时间创建同一数据源的块。例如Kafka索引任务和压缩合并任务总是可以同时将段写入同一数据源的同一时间块中。原因是Kafka索引任务总是附加新段而压缩合并任务总是覆盖现有段。使用段锁创建的段具有*相同的*主版本和较高的次版本。
> [!WARNING]
> 段锁仍然是实验性的。它可能有未知的错误,这可能会导致错误的查询结果。
要启用段锁定,可能需要在 [task context(任务上下文)](#上下文参数) 中将 `forceTimeChunkLock` 设置为 `false`。一旦 `forceTimeChunkLock` 被取消设置,任务将自动选择正确的锁类型。**请注意**段锁并不总是可用的。使用时间块锁的最常见场景是当覆盖任务更改段粒度时。此外只有本地索引任务和Kafka/kinesis索引任务支持段锁。Hadoop索引任务和索引实时(`index_realtime`)任务(被 [Tranquility](tranquility.md)使用)还不支持它。
任务上下文中的 `forceTimeChunkLock` 仅应用于单个任务。如果要为所有任务取消设置,则需要在 [Overlord配置](../configuration/human-readable-byte.md#overlord) 中设置 `druid.indexer.tasklock.forceTimeChunkLock` 为false。
如果两个或多个任务尝试为同一数据源的重叠时间块获取锁,则锁请求可能会相互冲突。**请注意,**锁冲突可能发生在不同的锁类型之间。
锁冲突的行为取决于 [任务优先级](#锁优先级)。如果冲突锁请求的所有任务具有相同的优先级,则首先请求的任务将获得锁, 其他任务将等待任务释放锁。
如果优先级较低的任务请求锁的时间晚于优先级较高的任务,则此任务还将等待优先级较高的任务释放锁。如果优先级较高的任务比优先级较低的任务请求锁的时间晚,则此任务将*抢占*优先级较低的另一个任务。优先级较低的任务的锁将被撤销,优先级较高的任务将获得一个新锁。
锁抢占可以在任务运行时随时发生,除非它在关键的*段发布阶段*。一旦发布段完成,它的锁将再次成为可抢占的。
**请注意**锁由同一groupId的任务共享。例如同一supervisor的Kafka索引任务具有相同的groupId并且彼此共享所有锁。
### 锁优先级
每个任务类型都有不同的默认锁优先级。下表显示了不同任务类型的默认优先级。数字越高,优先级越高。
| 任务类型 | 默认优先级 |
|-|-|
| 实时索引任务 | 75 |
| 批量索引任务 | 50 |
| 合并/追加/压缩任务 | 25 |
| 其他任务 | 0 |
通过在任务上下文中设置优先级,可以覆盖任务优先级,如下所示。
```json
"context" : {
"priority" : 100
}
```
### 上下文参数
任务上下文用于各种单独的任务配置。以下参数适用于所有任务类型。
| 属性 | 默认值 | 描述 |
|-|-|-|
| `taskLockTimeout` | 300000 | 任务锁定超时(毫秒)。更多详细信息,可以查看 [](#锁) 部分 |
| `forceTimeChunkLock` | true | *将此设置为false仍然是实验性的* 。强制始终使用时间块锁。如果未设置,则每个任务都会自动选择要使用的锁类型。如果设置了,它将覆盖 [Overlord配置](../Configuration/configuration.md#overlord] 的 `druid.indexer.tasklock.forceTimeChunkLock` 配置。有关详细信息,可以查看 [锁](#锁) 部分。|
| `priority` | 不同任务类型是不同的。 参见 [锁优先级](#锁优先级) | 任务优先级 |
> [!WARNING]
> 当任务获取锁时它通过HTTP发送请求并等待直到它收到包含锁获取结果的响应为止。因此如果 `taskLockTimeout` 大于 Overlord的`druid.server.http.maxIdleTime` 将会产生HTTP超时错误。
### 所有任务类型
#### `index`
参见 [本地批量摄取(简单任务)](native.md#简单任务)
#### `index_parallel`
参见 [本地批量社区(并行任务)](native.md#并行任务)
#### `index_sub`
由 [`index_parallel`](#index_parallel) 代表您自动提交的任务。
#### `index_hadoop`
参见 [基于Hadoop的摄取](hadoop.md)
#### `index_kafka`
由 [`Kafka摄取supervisor`](kafka.md) 代表您自动提交的任务。
#### `index_kinesis`
由 [`Kinesis摄取supervisor`](kinesis.md) 代表您自动提交的任务。
#### `index_realtime`
由 [`Tranquility`](tranquility.md) 代表您自动提交的任务。
#### `compact`
压缩任务合并给定间隔的所有段。有关详细信息,请参见有关 [压缩](datamanage.md#压缩与重新索引) 的文档。
#### `kill`
Kill tasks删除有关某些段的所有元数据并将其从深层存储中删除。有关详细信息请参阅有关 [删除数据](datamanage.md#删除数据) 的文档。
#### `append`
附加任务将段列表附加到单个段中(一个接一个)。语法是:
```json
{
"type": "append",
"id": <task_id>,
"dataSource": <task_datasource>,
"segments": <JSON list of DataSegment objects to append>,
"aggregations": <optional list of aggregators>,
"context": <task context>
}
```
#### `merge`
合并任务将段列表合并在一起。合并任何公共时间戳。如果在接收过程中禁用了rollup则不会合并公共时间戳并按其时间戳对行重新排序。
> [!WARNING]
> [`compact`](#compact) 任务通常是比 `merge` 任务更好的选择。
语法是:
```json
{
"type": "merge",
"id": <task_id>,
"dataSource": <task_datasource>,
"aggregations": <list of aggregators>,
"rollup": <whether or not to rollup data during a merge>,
"segments": <JSON list of DataSegment objects to merge>,
"context": <task context>
}
```
#### `same_interval_merge`
同一间隔合并任务是合并任务的快捷方式,间隔中的所有段都将被合并。
> [!WARNING]
> [`compact`](#compact) 任务通常是比 `same_interval_merge` 任务更好的选择。
语法是:
```json
{
"type": "same_interval_merge",
"id": <task_id>,
"dataSource": <task_datasource>,
"aggregations": <list of aggregators>,
"rollup": <whether or not to rollup data during a merge>,
"interval": <DataSegment objects in this interval are going to be merged>,
"context": <task context>
}
```

774
ingestion/tasks.md Normal file
View File

@ -0,0 +1,774 @@
---
id: tasks
title: "Task reference"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Tasks do all [ingestion](index.md)-related work in Druid.
For batch ingestion, you will generally submit tasks directly to Druid using the
[Task APIs](../operations/api-reference.md#tasks). For streaming ingestion, tasks are generally submitted for you by a
supervisor.
## Task API
Task APIs are available in two main places:
- The [Overlord](../design/overlord.md) process offers HTTP APIs to submit tasks, cancel tasks, check their status,
review logs and reports, and more. Refer to the [Tasks API reference page](../operations/api-reference.md#tasks) for a
full list.
- Druid SQL includes a [`sys.tasks`](../querying/sql.md#tasks-table) table that provides information about currently
running tasks. This table is read-only, and has a limited (but useful!) subset of the full information available through
the Overlord APIs.
<a name="reports"></a>
## Task reports
A report containing information about the number of rows ingested, and any parse exceptions that occurred is available for both completed tasks and running tasks.
The reporting feature is supported by the [simple native batch task](../ingestion/native-batch.md#simple-task), the Hadoop batch task, and Kafka and Kinesis ingestion tasks.
### Completion report
After a task completes, a completion report can be retrieved at:
```
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
```
An example output is shown below:
```json
{
"ingestionStatsAndErrors": {
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
"payload": {
"ingestionState": "COMPLETED",
"unparseableEvents": {},
"rowStats": {
"determinePartitions": {
"processed": 0,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
},
"buildSegments": {
"processed": 5390324,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
}
},
"errorMsg": null
},
"type": "ingestionStatsAndErrors"
}
}
```
### Live report
When a task is running, a live report containing ingestion state, unparseable events and moving average for number of events processed for 1 min, 5 min, 15 min time window can be retrieved at:
```
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
```
and
```
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/liveReports
```
An example output is shown below:
```json
{
"ingestionStatsAndErrors": {
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
"payload": {
"ingestionState": "RUNNING",
"unparseableEvents": {},
"rowStats": {
"movingAverages": {
"buildSegments": {
"5m": {
"processed": 3.392158326408501,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"15m": {
"processed": 1.736165476881023,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"1m": {
"processed": 4.206417693750045,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
}
}
},
"totals": {
"buildSegments": {
"processed": 1994,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
}
}
},
"errorMsg": null
},
"type": "ingestionStatsAndErrors"
}
}
```
A description of the fields:
The `ingestionStatsAndErrors` report provides information about row counts and errors.
The `ingestionState` shows what step of ingestion the task reached. Possible states include:
* `NOT_STARTED`: The task has not begun reading any rows
* `DETERMINE_PARTITIONS`: The task is processing rows to determine partitioning
* `BUILD_SEGMENTS`: The task is processing rows to construct segments
* `COMPLETED`: The task has finished its work.
Only batch tasks have the DETERMINE_PARTITIONS phase. Realtime tasks such as those created by the Kafka Indexing Service do not have a DETERMINE_PARTITIONS phase.
`unparseableEvents` contains lists of exception messages that were caused by unparseable inputs. This can help with identifying problematic input rows. There will be one list each for the DETERMINE_PARTITIONS and BUILD_SEGMENTS phases. Note that the Hadoop batch task does not support saving of unparseable events.
the `rowStats` map contains information about row counts. There is one entry for each ingestion phase. The definitions of the different row counts are shown below:
* `processed`: Number of rows successfully ingested without parsing errors
* `processedWithError`: Number of rows that were ingested, but contained a parsing error within one or more columns. This typically occurs where input rows have a parseable structure but invalid types for columns, such as passing in a non-numeric String value for a numeric column.
* `thrownAway`: Number of rows skipped. This includes rows with timestamps that were outside of the ingestion task's defined time interval and rows that were filtered out with a [`transformSpec`](index.md#transformspec), but doesn't include the rows skipped by explicit user configurations. For example, the rows skipped by `skipHeaderRows` or `hasHeaderRow` in the CSV format are not counted.
* `unparseable`: Number of rows that could not be parsed at all and were discarded. This tracks input rows without a parseable structure, such as passing in non-JSON data when using a JSON parser.
The `errorMsg` field shows a message describing the error that caused a task to fail. It will be null if the task was successful.
## Live reports
### Row stats
The non-parallel [simple native batch task](../ingestion/native-batch.md#simple-task), the Hadoop batch task, and Kafka and Kinesis ingestion tasks support retrieval of row stats while the task is running.
The live report can be accessed with a GET to the following URL on a Peon running a task:
```
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/rowStats
```
An example report is shown below. The `movingAverages` section contains 1 minute, 5 minute, and 15 minute moving averages of increases to the four row counters, which have the same definitions as those in the completion report. The `totals` section shows the current totals.
```
{
"movingAverages": {
"buildSegments": {
"5m": {
"processed": 3.392158326408501,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"15m": {
"processed": 1.736165476881023,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"1m": {
"processed": 4.206417693750045,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
}
}
},
"totals": {
"buildSegments": {
"processed": 1994,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
}
}
}
```
For the Kafka Indexing Service, a GET to the following Overlord API will retrieve live row stat reports from each task being managed by the supervisor and provide a combined report.
```
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/supervisor/<supervisor-id>/stats
```
### Unparseable events
Lists of recently-encountered unparseable events can be retrieved from a running task with a GET to the following Peon API:
```
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/unparseableEvents
```
Note that this functionality is not supported by all task types. Currently, it is only supported by the
non-parallel [native batch task](../ingestion/native-batch.md) (type `index`) and the tasks created by the Kafka
and Kinesis indexing services.
<a name="locks"></a>
## Task lock system
This section explains the task locking system in Druid. Druid's locking system
and versioning system are tightly coupled with each other to guarantee the correctness of ingested data.
## "Overshadowing" between segments
You can run a task to overwrite existing data. The segments created by an overwriting task _overshadows_ existing segments.
Note that the overshadow relation holds only for the same time chunk and the same data source.
These overshadowed segments are not considered in query processing to filter out stale data.
Each segment has a _major_ version and a _minor_ version. The major version is
represented as a timestamp in the format of [`"yyyy-MM-dd'T'hh:mm:ss"`](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat)
while the minor version is an integer number. These major and minor versions
are used to determine the overshadow relation between segments as seen below.
A segment `s1` overshadows another `s2` if
- `s1` has a higher major version than `s2`, or
- `s1` has the same major version and a higher minor version than `s2`.
Here are some examples.
- A segment of the major version of `2019-01-01T00:00:00.000Z` and the minor version of `0` overshadows
another of the major version of `2018-01-01T00:00:00.000Z` and the minor version of `1`.
- A segment of the major version of `2019-01-01T00:00:00.000Z` and the minor version of `1` overshadows
another of the major version of `2019-01-01T00:00:00.000Z` and the minor version of `0`.
## Locking
If you are running two or more [druid tasks](./tasks.md) which generate segments for the same data source and the same time chunk,
the generated segments could potentially overshadow each other, which could lead to incorrect query results.
To avoid this problem, tasks will attempt to get locks prior to creating any segment in Druid.
There are two types of locks, i.e., _time chunk lock_ and _segment lock_.
When the time chunk lock is used, a task locks the entire time chunk of a data source where generated segments will be written.
For example, suppose we have a task ingesting data into the time chunk of `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` of the `wikipedia` data source.
With the time chunk locking, this task will lock the entire time chunk of `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` of the `wikipedia` data source
before it creates any segments. As long as it holds the lock, any other tasks will be unable to create segments for the same time chunk of the same data source.
The segments created with the time chunk locking have a _higher_ major version than existing segments. Their minor version is always `0`.
When the segment lock is used, a task locks individual segments instead of the entire time chunk.
As a result, two or more tasks can create segments for the same time chunk of the same data source simultaneously
if they are reading different segments.
For example, a Kafka indexing task and a compaction task can always write segments into the same time chunk of the same data source simultaneously.
The reason for this is because a Kafka indexing task always appends new segments, while a compaction task always overwrites existing segments.
The segments created with the segment locking have the _same_ major version and a _higher_ minor version.
> The segment locking is still experimental. It could have unknown bugs which potentially lead to incorrect query results.
To enable segment locking, you may need to set `forceTimeChunkLock` to `false` in the [task context](#context).
Once `forceTimeChunkLock` is unset, the task will choose a proper lock type to use automatically.
Please note that segment lock is not always available. The most common use case where time chunk lock is enforced is
when an overwriting task changes the segment granularity.
Also, the segment locking is supported by only native indexing tasks and Kafka/Kinesis indexing tasks.
Hadoop indexing tasks don't support it.
`forceTimeChunkLock` in the task context is only applied to individual tasks.
If you want to unset it for all tasks, you would want to set `druid.indexer.tasklock.forceTimeChunkLock` to false in the [overlord configuration](../configuration/index.md#overlord-operations).
Lock requests can conflict with each other if two or more tasks try to get locks for the overlapped time chunks of the same data source.
Note that the lock conflict can happen between different locks types.
The behavior on lock conflicts depends on the [task priority](#lock-priority).
If all tasks of conflicting lock requests have the same priority, then the task who requested first will get the lock.
Other tasks will wait for the task to release the lock.
If a task of a lower priority asks a lock later than another of a higher priority,
this task will also wait for the task of a higher priority to release the lock.
If a task of a higher priority asks a lock later than another of a lower priority,
then this task will _preempt_ the other task of a lower priority. The lock
of the lower-prioritized task will be revoked and the higher-prioritized task will acquire a new lock.
This lock preemption can happen at any time while a task is running except
when it is _publishing segments_ in a critical section. Its locks become preemptible again once publishing segments is finished.
Note that locks are shared by the tasks of the same groupId.
For example, Kafka indexing tasks of the same supervisor have the same groupId and share all locks with each other.
<a name="priority"></a>
## Lock priority
Each task type has a different default lock priority. The below table shows the default priorities of different task types. Higher the number, higher the priority.
|task type|default priority|
|---------|----------------|
|Realtime index task|75|
|Batch index task|50|
|Merge/Append/Compaction task|25|
|Other tasks|0|
You can override the task priority by setting your priority in the task context as below.
```json
"context" : {
"priority" : 100
}
```
<a name="context"></a>
## Context parameters
The task context is used for various individual task configuration. The following parameters apply to all task types.
|property|default|description|
|--------|-------|-----------|
|`taskLockTimeout`|300000|task lock timeout in millisecond. For more details, see [Locking](#locking).|
|`forceTimeChunkLock`|true|_Setting this to false is still experimental_<br/> Force to always use time chunk lock. If not set, each task automatically chooses a lock type to use. If this set, it will overwrite the `druid.indexer.tasklock.forceTimeChunkLock` [configuration for the overlord](../configuration/index.md#overlord-operations). See [Locking](#locking) for more details.|
|`priority`|Different based on task types. See [Priority](#priority).|Task priority|
|`useLineageBasedSegmentAllocation`|false in 0.21 or earlier, true in 0.22 or later|Enable the new lineage-based segment allocation protocol for the native Parallel task with dynamic partitioning. This option should be off during the replacing rolling upgrade from one of the Druid versions between 0.19 and 0.21 to Druid 0.22 or higher. Once the upgrade is done, it must be set to true to ensure data correctness.|
> When a task acquires a lock, it sends a request via HTTP and awaits until it receives a response containing the lock acquisition result.
> As a result, an HTTP timeout error can occur if `taskLockTimeout` is greater than `druid.server.http.maxIdleTime` of Overlords.
## All task types
### `index`
See [Native batch ingestion (simple task)](native-batch.md#simple-task).
### `index_parallel`
See [Native batch ingestion (parallel task)](native-batch.md#parallel-task).
### `index_sub`
Submitted automatically, on your behalf, by an [`index_parallel`](#index_parallel) task.
### `index_hadoop`
See [Hadoop-based ingestion](hadoop.md).
### `index_kafka`
Submitted automatically, on your behalf, by a
[Kafka-based ingestion supervisor](../development/extensions-core/kafka-ingestion.md).
### `index_kinesis`
Submitted automatically, on your behalf, by a
[Kinesis-based ingestion supervisor](../development/extensions-core/kinesis-ingestion.md).
### `index_realtime`
Submitted automatically, on your behalf, by [Tranquility](tranquility.md).
### `compact`
Compaction tasks merge all segments of the given interval. See the documentation on
[compaction](compaction.md) for details.
### `kill`
Kill tasks delete all metadata about certain segments and removes them from deep storage.
See the documentation on [deleting data](../ingestion/data-management.md#delete) for details.
## 任务参考文档
任务在Druid中完成所有与 [摄取](ingestion.md) 相关的工作。
对于批量摄取,通常使用 [任务api](../operations/api.md#Overlord) 直接将任务提交给Druid。对于流式接收任务通常被提交给supervisor。
### 任务API
任务API主要在两个地方是可用的
* [Overlord](../design/Overlord.md) 进程提供HTTP API接口来进行提交任务、取消任务、检查任务状态、查看任务日志与报告等。 查看 [任务API文档](../operations/api.md) 可以看到完整列表
* Druid SQL包括了一个 [`sys.tasks`](../querying/druidsql.md#系统Schema) ,保存了当前任务运行的信息。 此表是只读的并且可以通过Overlord API查询完整信息的有限制的子集。
### 任务报告
报告包含已完成的任务和正在运行的任务中有关接收的行数和发生的任何分析异常的信息的报表。
报告功能支持 [简单的本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和Kinesis摄取任务支持报告功能。
#### 任务结束报告
任务运行完成后,一个完整的报告可以在以下接口获取:
```json
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
```
一个示例输出如下:
```json
{
"ingestionStatsAndErrors": {
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
"payload": {
"ingestionState": "COMPLETED",
"unparseableEvents": {},
"rowStats": {
"determinePartitions": {
"processed": 0,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
},
"buildSegments": {
"processed": 5390324,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
}
},
"errorMsg": null
},
"type": "ingestionStatsAndErrors"
}
}
```
#### 任务运行报告
当一个任务正在运行时, 任务运行报告可以通过以下接口获得包括摄取状态、未解析事件和过去1分钟、5分钟、15分钟内处理的平均事件数。
```json
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
```
```json
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/liveReports
```
一个示例输出如下:
```json
{
"ingestionStatsAndErrors": {
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
"payload": {
"ingestionState": "RUNNING",
"unparseableEvents": {},
"rowStats": {
"movingAverages": {
"buildSegments": {
"5m": {
"processed": 3.392158326408501,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"15m": {
"processed": 1.736165476881023,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"1m": {
"processed": 4.206417693750045,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
}
}
},
"totals": {
"buildSegments": {
"processed": 1994,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
}
}
},
"errorMsg": null
},
"type": "ingestionStatsAndErrors"
}
}
```
字段的描述信息如下:
`ingestionStatsAndErrors` 提供了行数和错误数的信息
`ingestionState` 标识了摄取任务当前达到了哪一步,可能的取值包括:
* `NOT_STARTED`: 任务还没有读取任何行
* `DETERMINE_PARTITIONS`: 任务正在处理行来决定分区信息
* `BUILD_SEGMENTS`: 任务正在处理行来构建段
* `COMPLETED`: 任务已经完成
只有批处理任务具有 `DETERMINE_PARTITIONS` 阶段。实时任务如由Kafka索引服务创建的任务没有 `DETERMINE_PARTITIONS` 阶段。
`unparseableEvents` 包含由不可解析输入引起的异常消息列表。这有助于识别有问题的输入行。对于 `DETERMINE_PARTITIONS``BUILD_SEGMENTS` 阶段每个阶段都有一个列表。请注意Hadoop批处理任务不支持保存不可解析事件。
`rowStats` map包含有关行计数的信息。每个摄取阶段有一个条目。不同行计数的定义如下所示
* `processed`: 成功摄入且没有报错的行数
* `processedWithErro`: 摄取但在一列或多列中包含解析错误的行数。这通常发生在输入行具有可解析的结构但列的类型无效的情况下,例如为数值列传入非数值字符串值
* `thrownAway`: 跳过的行数。 这包括时间戳在摄取任务定义的时间间隔之外的行,以及使用 [`transformSpec`](ingestion.md#transformspec) 过滤掉的行但不包括显式用户配置跳过的行。例如CSV格式的 `skipHeaderRows``hasHeaderRow` 跳过的行不计算在内
* `unparseable`: 完全无法分析并被丢弃的行数。这将跟踪没有可解析结构的输入行例如在使用JSON解析器时传入非JSON数据。
`errorMsg` 字段显示一条消息,描述导致任务失败的错误。如果任务成功,则为空
### 实时报告
#### 行画像
非并行的 [简单本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和kinesis摄取任务支持在任务运行时检索行统计信息。
可以通过运行任务的Peon上的以下URL访问实时报告
```json
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/rowStats
```
示例报告如下所示。`movingAverages` 部分包含四行计数器的1分钟、5分钟和15分钟移动平均增量其定义与结束报告中的定义相同。`totals` 部分显示当前总计。
```json
{
"movingAverages": {
"buildSegments": {
"5m": {
"processed": 3.392158326408501,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"15m": {
"processed": 1.736165476881023,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
},
"1m": {
"processed": 4.206417693750045,
"unparseable": 0,
"thrownAway": 0,
"processedWithError": 0
}
}
},
"totals": {
"buildSegments": {
"processed": 1994,
"processedWithError": 0,
"thrownAway": 0,
"unparseable": 0
}
}
}
```
对于Kafka索引服务向Overlord API发送一个GET请求将从supervisor管理的每个任务中检索实时行统计报告并提供一个组合报告。
```json
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/supervisor/<supervisor-id>/stats
```
#### 未解析的事件
可以对Peon API发起一次Get请求从正在运行的任务中检索最近遇到的不可解析事件的列表
```json
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/unparseableEvents
```
注意:并不是所有的任务类型支持该功能。 当前,该功能只支持非并行的 [本地批任务](native.md) (`index`类型) 和由Kafka、Kinesis索引服务创建的任务。
### 任务锁系统
本节介绍Druid中的任务锁定系统。Druid的锁定系统和版本控制系统是紧密耦合的以保证接收数据的正确性。
### 段与段之间的"阴影"
可以运行任务覆盖现有数据。覆盖任务创建的段将*覆盖*现有段。请注意,覆盖关系只适用于**同一时间块和同一数据源**。在过滤过时数据的查询处理中,不考虑这些被遮盖的段。
每个段都有一个*主*版本和一个*次*版本。主版本表示为时间戳,格式为["yyyy-MM-dd'T'hh:MM:ss"](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html),次版本表示为整数。这些主版本和次版本用于确定段之间的阴影关系,如下所示。
在以下条件下,段 `s1` 将会覆盖另一个段 `s2`:
* `s1``s2` 有一个更高的主版本
* `s1``s2` 有相同的主版本,但是有更高的次版本
以下是一些示例:
* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段将覆盖另一个主版本为 `2018-01-01T00:00:00.000Z` 且次版本为 `1` 的段
* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `1` 的段将覆盖另一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段
### 锁
如果您正在运行两个或多个 [Druid任务](taskrefer.md),这些任务为同一数据源和同一时间块生成段,那么生成的段可能会相互覆盖,从而导致错误的查询结果。
为了避免这个问题任务将在Druid中创建任何段之前尝试获取锁, 有两种类型的锁,即 *时间块锁**段锁*
使用时间块锁时,任务将锁定生成的段将写入数据源的整个时间块。例如,假设我们有一个任务将数据摄取到 `wikipedia` 数据源的时间块 `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` 中。使用时间块锁此任务将在创建段之前锁定wikipedia数据源的 `2019-01-01T00:00.000Z/2019-01-02T00:00:00.000Z` 整个时间块。只要它持有锁,任何其他任务都将无法为同一数据源的同一时间块创建段。使用时间块锁创建的段的主版本*高于*现有段, 它们的次版本总是 `0`
使用段锁时任务锁定单个段而不是整个时间块。因此如果两个或多个任务正在读取不同的段则它们可以同时为同一时间创建同一数据源的块。例如Kafka索引任务和压缩合并任务总是可以同时将段写入同一数据源的同一时间块中。原因是Kafka索引任务总是附加新段而压缩合并任务总是覆盖现有段。使用段锁创建的段具有*相同的*主版本和较高的次版本。
> [!WARNING]
> 段锁仍然是实验性的。它可能有未知的错误,这可能会导致错误的查询结果。
要启用段锁定,可能需要在 [task context(任务上下文)](#上下文参数) 中将 `forceTimeChunkLock` 设置为 `false`。一旦 `forceTimeChunkLock` 被取消设置,任务将自动选择正确的锁类型。**请注意**段锁并不总是可用的。使用时间块锁的最常见场景是当覆盖任务更改段粒度时。此外只有本地索引任务和Kafka/kinesis索引任务支持段锁。Hadoop索引任务和索引实时(`index_realtime`)任务(被 [Tranquility](tranquility.md)使用)还不支持它。
任务上下文中的 `forceTimeChunkLock` 仅应用于单个任务。如果要为所有任务取消设置,则需要在 [Overlord配置](../configuration/human-readable-byte.md#overlord) 中设置 `druid.indexer.tasklock.forceTimeChunkLock` 为false。
如果两个或多个任务尝试为同一数据源的重叠时间块获取锁,则锁请求可能会相互冲突。**请注意,**锁冲突可能发生在不同的锁类型之间。
锁冲突的行为取决于 [任务优先级](#锁优先级)。如果冲突锁请求的所有任务具有相同的优先级,则首先请求的任务将获得锁, 其他任务将等待任务释放锁。
如果优先级较低的任务请求锁的时间晚于优先级较高的任务,则此任务还将等待优先级较高的任务释放锁。如果优先级较高的任务比优先级较低的任务请求锁的时间晚,则此任务将*抢占*优先级较低的另一个任务。优先级较低的任务的锁将被撤销,优先级较高的任务将获得一个新锁。
锁抢占可以在任务运行时随时发生,除非它在关键的*段发布阶段*。一旦发布段完成,它的锁将再次成为可抢占的。
**请注意**锁由同一groupId的任务共享。例如同一supervisor的Kafka索引任务具有相同的groupId并且彼此共享所有锁。
### 锁优先级
每个任务类型都有不同的默认锁优先级。下表显示了不同任务类型的默认优先级。数字越高,优先级越高。
| 任务类型 | 默认优先级 |
|-|-|
| 实时索引任务 | 75 |
| 批量索引任务 | 50 |
| 合并/追加/压缩任务 | 25 |
| 其他任务 | 0 |
通过在任务上下文中设置优先级,可以覆盖任务优先级,如下所示。
```json
"context" : {
"priority" : 100
}
```
### 上下文参数
任务上下文用于各种单独的任务配置。以下参数适用于所有任务类型。
| 属性 | 默认值 | 描述 |
|-|-|-|
| `taskLockTimeout` | 300000 | 任务锁定超时(毫秒)。更多详细信息,可以查看 [](#锁) 部分 |
| `forceTimeChunkLock` | true | *将此设置为false仍然是实验性的* 。强制始终使用时间块锁。如果未设置,则每个任务都会自动选择要使用的锁类型。如果设置了,它将覆盖 [Overlord配置](../Configuration/configuration.md#overlord] 的 `druid.indexer.tasklock.forceTimeChunkLock` 配置。有关详细信息,可以查看 [锁](#锁) 部分。|
| `priority` | 不同任务类型是不同的。 参见 [锁优先级](#锁优先级) | 任务优先级 |
> [!WARNING]
> 当任务获取锁时它通过HTTP发送请求并等待直到它收到包含锁获取结果的响应为止。因此如果 `taskLockTimeout` 大于 Overlord的`druid.server.http.maxIdleTime` 将会产生HTTP超时错误。
### 所有任务类型
#### `index`
参见 [本地批量摄取(简单任务)](native.md#简单任务)
#### `index_parallel`
参见 [本地批量社区(并行任务)](native.md#并行任务)
#### `index_sub`
由 [`index_parallel`](#index_parallel) 代表您自动提交的任务。
#### `index_hadoop`
参见 [基于Hadoop的摄取](hadoop.md)
#### `index_kafka`
由 [`Kafka摄取supervisor`](kafka.md) 代表您自动提交的任务。
#### `index_kinesis`
由 [`Kinesis摄取supervisor`](kinesis.md) 代表您自动提交的任务。
#### `index_realtime`
由 [`Tranquility`](tranquility.md) 代表您自动提交的任务。
#### `compact`
压缩任务合并给定间隔的所有段。有关详细信息,请参见有关 [压缩](datamanage.md#压缩与重新索引) 的文档。
#### `kill`
Kill tasks删除有关某些段的所有元数据并将其从深层存储中删除。有关详细信息请参阅有关 [删除数据](datamanage.md#删除数据) 的文档。
#### `append`
附加任务将段列表附加到单个段中(一个接一个)。语法是:
```json
{
"type": "append",
"id": <task_id>,
"dataSource": <task_datasource>,
"segments": <JSON list of DataSegment objects to append>,
"aggregations": <optional list of aggregators>,
"context": <task context>
}
```
#### `merge`
合并任务将段列表合并在一起。合并任何公共时间戳。如果在接收过程中禁用了rollup则不会合并公共时间戳并按其时间戳对行重新排序。
> [!WARNING]
> [`compact`](#compact) 任务通常是比 `merge` 任务更好的选择。
语法是:
```json
{
"type": "merge",
"id": <task_id>,
"dataSource": <task_datasource>,
"aggregations": <list of aggregators>,
"rollup": <whether or not to rollup data during a merge>,
"segments": <JSON list of DataSegment objects to merge>,
"context": <task context>
}
```
#### `same_interval_merge`
同一间隔合并任务是合并任务的快捷方式,间隔中的所有段都将被合并。
> [!WARNING]
> [`compact`](#compact) 任务通常是比 `same_interval_merge` 任务更好的选择。
语法是:
```json
{
"type": "same_interval_merge",
"id": <task_id>,
"dataSource": <task_datasource>,
"aggregations": <list of aggregators>,
"rollup": <whether or not to rollup data during a merge>,
"interval": <DataSegment objects in this interval are going to be merged>,
"context": <task context>
}
```

36
ingestion/tranquility.md Normal file
View File

@ -0,0 +1,36 @@
---
id: tranquility
title: "Tranquility"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
[Tranquility](https://github.com/druid-io/tranquility/) is a separately distributed package for pushing
streams to Druid in real-time.
Tranquility has not been built against a version of Druid later than Druid 0.9.2
release. It may still work with the latest Druid servers, but not all features and functionality will be available
due to limitations of older Druid APIs on the Tranquility side.
For new projects that require streaming ingestion, we recommend using Druid's native support for
[Apache Kafka](../development/extensions-core/kafka-ingestion.md) or
[Amazon Kinesis](../development/extensions-core/kinesis-ingestion.md).
For more details, check out the [Tranquility GitHub page](https://github.com/druid-io/tranquility/).

View File

@ -1 +0,0 @@
<!-- toc -->

View File

@ -0,0 +1,43 @@
# 深度存储合并
If you have been running an evaluation Druid cluster using local deep storage and wish to migrate to a
more production-capable deep storage system such as S3 or HDFS, this document describes the necessary steps.
Migration of deep storage involves the following steps at a high level:
- Copying segments from local deep storage to the new deep storage
- Exporting Druid's segments table from metadata
- Rewriting the load specs in the exported segment data to reflect the new deep storage location
- Reimporting the edited segments into metadata
## Shut down cluster services
To ensure a clean migration, shut down the non-coordinator services to ensure that metadata state will not
change as you do the migration.
When migrating from Derby, the coordinator processes will still need to be up initially, as they host the Derby database.
## Copy segments from old deep storage to new deep storage.
Before migrating, you will need to copy your old segments to the new deep storage.
For information on what path structure to use in the new deep storage, please see [deep storage migration options](../operations/export-metadata.md#deep-storage-migration).
## Export segments with rewritten load specs
Druid provides an [Export Metadata Tool](../operations/export-metadata.md) for exporting metadata from Derby into CSV files
which can then be reimported.
By setting [deep storage migration options](../operations/export-metadata.md#deep-storage-migration), the `export-metadata` tool will export CSV files where the segment load specs have been rewritten to load from your new deep storage location.
Run the `export-metadata` tool on your existing cluster, using the migration options appropriate for your new deep storage location, and save the CSV files it generates. After a successful export, you can shut down the coordinator.
### Import metadata
After generating the CSV exports with the modified segment data, you can reimport the contents of the Druid segments table from the generated CSVs.
Please refer to [import commands](../operations/export-metadata.md#importing-metadata) for examples. Only the `druid_segments` table needs to be imported.
### Restart cluster
After importing the segment table successfully, you can now restart your cluster.

View File

@ -0,0 +1,68 @@
# 元数据合并
If you have been running an evaluation Druid cluster using the built-in Derby metadata storage and wish to migrate to a
more production-capable metadata store such as MySQL or PostgreSQL, this document describes the necessary steps.
## Shut down cluster services
To ensure a clean migration, shut down the non-coordinator services to ensure that metadata state will not
change as you do the migration.
When migrating from Derby, the coordinator processes will still need to be up initially, as they host the Derby database.
## Exporting metadata
Druid provides an [Export Metadata Tool](../operations/export-metadata.md) for exporting metadata from Derby into CSV files
which can then be imported into your new metadata store.
The tool also provides options for rewriting the deep storage locations of segments; this is useful
for [deep storage migration](../operations/deep-storage-migration.md).
Run the `export-metadata` tool on your existing cluster, and save the CSV files it generates. After a successful export, you can shut down the coordinator.
## Initializing the new metadata store
### Create database
Before importing the existing cluster metadata, you will need to set up the new metadata store.
The [MySQL extension](../development/extensions-core/mysql.md) and [PostgreSQL extension](../development/extensions-core/postgresql.md) docs have instructions for initial database setup.
### Update configuration
Update your Druid runtime properties with the new metadata configuration.
### Create Druid tables
Druid provides a `metadata-init` tool for creating Druid's metadata tables. After initializing the Druid database, you can run the commands shown below from the root of the Druid package to initialize the tables.
In the example commands below:
- `lib` is the Druid lib directory
- `extensions` is the Druid extensions directory
- `base` corresponds to the value of `druid.metadata.storage.tables.base` in the configuration, `druid` by default.
- The `--connectURI` parameter corresponds to the value of `druid.metadata.storage.connector.connectURI`.
- The `--user` parameter corresponds to the value of `druid.metadata.storage.connector.user`.
- The `--password` parameter corresponds to the value of `druid.metadata.storage.connector.password`.
#### MySQL
```bash
cd ${DRUID_ROOT}
java -classpath "lib/*" -Dlog4j.configurationFile=conf/druid/cluster/_common/log4j2.xml -Ddruid.extensions.directory="extensions" -Ddruid.extensions.loadList=[\"mysql-metadata-storage\"] -Ddruid.metadata.storage.type=mysql org.apache.druid.cli.Main tools metadata-init --connectURI="<mysql-uri>" --user <user> --password <pass> --base druid
```
#### PostgreSQL
```bash
cd ${DRUID_ROOT}
java -classpath "lib/*" -Dlog4j.configurationFile=conf/druid/cluster/_common/log4j2.xml -Ddruid.extensions.directory="extensions" -Ddruid.extensions.loadList=[\"postgresql-metadata-storage\"] -Ddruid.metadata.storage.type=postgresql org.apache.druid.cli.Main tools metadata-init --connectURI="<postgresql-uri>" --user <user> --password <pass> --base druid
```
### Import metadata
After initializing the tables, please refer to the [import commands](../operations/export-metadata.md#importing-metadata) for your target database.
### Restart cluster
After importing the metadata successfully, you can now restart your cluster.

View File

@ -1 +0,0 @@
<!-- toc -->

View File

@ -249,27 +249,13 @@ druid.indexer.logs.directory=/druid/indexing-logs
请参考 [HDFS extension](../development/extensions-core/hdfs.md) 页面中的内容来获得更多的信息。
## Hadoop连接配置
如果要从Hadoop集群加载数据那么此时应对Druid做如下配置
* 在`conf/druid/cluster/_common/common.runtime.properties`文件中更新`druid.indexer.task.hadoopWorkingPath`配置项将其更新为您期望的一个用于临时文件存储的HDFS路径。 通常会配置为`druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing`
* 需要将Hadoop的配置文件core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml放置在Druid进程的classpath中可以将他们拷贝到`conf/druid/cluster/_common`目录中
请注意您无需为了可以从Hadoop加载数据而使用HDFS深度存储。
更多信息可以看[基于Hadoop的数据摄取](../../DataIngestion/hadoopbased.md)部分的文档。
## Hadoop 的连接配置(可选)
如果你希望懂 Hadoop 集群中加载数据,那么你需要对你的 Druid 集群进行下面的一些配置:
## Hadoop 连接配置(可选)
如果你希望从 Hadoop 集群中加载数据,那么你需要对你的 Druid 集群进行下面的一些配置:
- 更新 `conf/druid/cluster/middleManager/runtime.properties` 文件中的 `druid.indexer.task.hadoopWorkingPath` 配置选项。
将 HDFS 配置路径文件更新到一个你期望使用的临时文件存储路径。`druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing` 为通常的配置。
将 HDFS 配置路径文件更新到一个你期望使用的临时文件存储路径。`druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing` 为通常的配置。
- 将你的 Hadoop XMLs配置文件core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml放到你的 Druid 进程中。
你可以将 `conf/druid/cluster/_common/core-site.xml`, `conf/druid/cluster/_common/hdfs-site.xml` 拷贝到 `conf/druid/cluster/_common` 目录中。
你可以将 `conf/druid/cluster/_common/core-site.xml`, `conf/druid/cluster/_common/hdfs-site.xml` 拷贝到 `conf/druid/cluster/_common` 目录中。
请注意,你不需要为了从 Hadoop 中载入数据而使用 HDFS 深度存储。
@ -277,6 +263,7 @@ druid.indexer.logs.directory=/druid/indexing-logs
有关更多的信息,请参考 [Hadoop-based ingestion](../ingestion/hadoop.md) 页面中的内容。
## 配置 Zookeeper 连接
在实际的生产环境中,我们建议你使用专用的 ZK 集群来进行部署。ZK 的集群与 Druid 的集群部署是分离的。

View File

@ -1,27 +1,159 @@
<!-- toc -->
# 从本地文件中加载数据
本指南演示了如何使用 Druid 的原生批量数据导入特性从本地文件中加载数据到 Apache Druid 中。
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-8828078415045620"
data-ad-slot="7586680510"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
要想将数据导入到 Druid 中,你需要提交一个 *数据导入任务ingestion task* 规范到 Druid Overlord 进程中。
你可以手写这个规范,也可以通过使用 Druid 控制台提供的 _数据加载器data loader_ 来完成。
## 加载本地文件
[快速指南](../tutorials/index.md) 页面向你展示了如何使用数据加载器data loader来构建一个数据导入的规范。
本教程演示了如何使用Apache Druid的本地批量数据摄取来执行批文件加载。
在生产环境中,你可能需要你的数据加载器能够自动工作完成数据的导入。
本页面中的指南先会向你展示如何通过 Druid 的控制台向 Druid 提交一个数据加载规范,然后再对这个数据加载规范设置自动化处理——从命令行和一个脚本中进行加载数据。
在本教程中,我们假设您已经按照[快速入门](../GettingStarted/chapter-1.md)中的规范下载了Druid并使用`micro-quickstart`单机配置使其在本地计算机上运行。您不需要加载任何数据。
Druid的数据加载是通过向Overlord服务提交*摄取任务规范*来启动。对于本教程我们将加载Wikipedia页面示例编辑数据。
## 加载一个规范(使用控制台)
The Druid package includes the following sample native batch ingestion task spec at `quickstart/tutorial/wikipedia-index.json`, shown here for convenience,
which has been configured to read the `quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` input file:
```json
{
"type" : "index_parallel",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"dimensionsSpec" : {
"dimensions" : [
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
{ "name": "added", "type": "long" },
{ "name": "deleted", "type": "long" },
{ "name": "delta", "type": "long" }
]
},
"timestampSpec": {
"column": "time",
"format": "iso"
},
"metricsSpec" : [],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2015-09-12/2015-09-13"],
"rollup" : false
}
},
"ioConfig" : {
"type" : "index_parallel",
"inputSource" : {
"type" : "local",
"baseDir" : "quickstart/tutorial/",
"filter" : "wikiticker-2015-09-12-sampled.json.gz"
},
"inputFormat" : {
"type": "json"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index_parallel",
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000
}
}
}
```
This spec creates a datasource named "wikipedia".
From the Ingestion view, click the ellipses next to Tasks and choose `Submit JSON task`.
![Tasks view add task](../assets/tutorial-batch-submit-task-01.png "Tasks view add task")
This brings up the spec submission dialog where you can paste the spec above.
![Query view](../assets/tutorial-batch-submit-task-02.png "Query view")
Once the spec is submitted, wait a few moments for the data to load, after which you can query it.
## Loading data with a spec (via command line)
For convenience, the Druid package includes a batch ingestion helper script at `bin/post-index-task`.
This script will POST an ingestion task to the Druid Overlord and poll Druid until the data is available for querying.
Run the following command from Druid package root:
```bash
bin/post-index-task --file quickstart/tutorial/wikipedia-index.json --url http://localhost:8081
```
You should see output like the following:
```bash
Beginning indexing data for wikipedia
Task started: index_wikipedia_2018-07-27T06:37:44.323Z
Task log: http://localhost:8081/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/log
Task status: http://localhost:8081/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/status
Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
Task finished with status: SUCCESS
Completed indexing data for wikipedia. Now loading indexed data onto the cluster...
wikipedia loading complete! You may now query your data
```
Once the spec is submitted, you can follow the same instructions as above to wait for the data to load and then query it.
## Loading data without the script
Let's briefly discuss how we would've submitted the ingestion task without using the script. You do not need to run these commands.
To submit the task, POST it to Druid in a new terminal window from the apache-druid-{{DRUIDVERSION}} directory:
```bash
curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-index.json http://localhost:8081/druid/indexer/v1/task
```
Which will print the ID of the task if the submission was successful:
```bash
{"task":"index_wikipedia_2018-06-09T21:30:32.802Z"}
```
You can monitor the status of this task from the console as outlined above.
## Querying your data
Once the data is loaded, please follow the [query tutorial](../tutorials/tutorial-query.md) to run some example queries on the newly loaded data.
## Cleanup
If you wish to go through any of the other ingestion tutorials, you will need to shut down the cluster and reset the cluster state by removing the contents of the `var` directory under the druid package, as the other tutorials will write to the same "wikipedia" datasource.
## Further reading
For more information on loading batch data, please see [the native batch ingestion documentation](../ingestion/native-batch.md).
*数据摄取任务规范*可以手动编写也可以通过Druid控制台里内置的数据加载器编写。数据加载器可以通过采样摄入的数据并配置各种摄入参数来帮助您生成*摄取任务规范*。数据加载器当前仅支持本地批处理提取将来的版本中将提供对流的支持包括存储在Apache Kafka和AWS Kinesis中的数据。目前只能通过手动书写摄入规范来进行流式摄入。
我们提供了2015年9月12日起对Wikipedia进行编辑的示例以帮助您入门。
### 使用Data Loader来加载数据