Merge pull request #43 from cwiki-us-docs/feature/kafka-ingestion

更新文档
This commit is contained in:
YuCheng Hu 2021-10-07 16:09:33 -04:00 committed by GitHub
commit d96c4bd3de
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
14 changed files with 944 additions and 417 deletions

View File

@ -88,7 +88,7 @@
* [过滤](querying/filters.md)
* [粒度](querying/granularity.md)
* [维度](querying/dimensionspec.md)
* [聚合](querying/Aggregations.md)
* [聚合](querying/aggregations.md)
* [后聚合](querying/postaggregation.md)
* [表达式](querying/expression.md)
* [Having(GroupBy)](querying/having.md)

View File

@ -5,10 +5,9 @@ supervisors 通过管理 Kafka 索引任务的创建和销毁的生命周期以
supervisor 对索引任务的状态进行监控,以便于对任务进行扩展或切换,故障管理等操作。
这个服务是由 `druid-kafka-indexing-service` 这个 druid 核心扩展(详情请见 [扩展列表](../../development/extensions.md)提供的。
这个服务是由 `druid-kafka-indexing-service` 这个 druid 核心扩展(详情请见 [扩展列表](../../development/extensions.md提供的内容)
> [!WARNING]
> Kafka索引服务支持在 Kafka 0.11.x 中开始使用的事务主题。这些更改使 Druid 使用的 Kafka 消费者与旧的 Kafka brokers 不兼容。
> Druid 的 Kafka 索引服务支持在 Kafka 0.11.x 中开始使用的事务主题。这些更改使 Druid 使用的 Kafka 消费者与旧的 Kafka brokers 不兼容。
> 在使用 Druid 从 Kafka中 导入数据之前,请确保你的 Kafka 版本为 0.11.x 或更高版本。
> 如果你使用的是旧版本的 Kafka brokers请参阅《 [Kafka升级指南](https://kafka.apache.org/documentation/#upgrade) 》中的内容先进行升级。
@ -107,7 +106,7 @@ curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http:/
|`inputFormat`|Object|[`inputFormat`](../../ingestion/data-formats.md#input-format) 被指定如何来解析处理数据。请参考 [the below section](#specifying-data-format) 来了解更多如何指定 input format 的内容。|Y|
|`consumerProperties`|Map<String, Object>|传递给 Kafka 消费者的一组属性 map。这个必须包含有一个 `bootstrap.servers` 属性。这个属性的值为: `<BROKER_1>:<PORT_1>,<BROKER_2>:<PORT_2>,...` 这样的服务器列表。针对使用 SSL 的链接: `keystore` `truststore``key` 可以使用字符串密码,或者使用 [Password Provider](../../operations/password-provider.md) 来进行提供。|Y|
|`pollTimeout`|Long| Kafka 消费者拉取数据等待的时间。单位为毫秒millisecondsThe length of time to wait for the Kafka consumer to poll records, in |N默认=100|
|`replicas`|Integer|副本的数量, 1 意味着一个单一任务(无副本)。副本任务将始终分配给不同的 workers以提供针对流程故障的恢复能力。|no默认值1|
|`replicas`|Integer|副本的数量, 1 意味着一个单一任务(无副本)。副本任务将始终分配给不同的 workers以提供针对流程故障的恢复能力。|N默认=1|
|`taskCount`|Integer|在一个 *replica set* 集中最大 *reading* 的数量。这意味着读取任务的最大的数量将是 `taskCount * replicas`, 任务总数(*reading* + *publishing*)是大于这个数值的。请参考 [Capacity Planning](#capacity-planning) 中的内容。如果 `taskCount > {numKafkaPartitions}` 的话,总的 reading 任务数量将会小于 `taskCount` 。|N默认=1|
|`taskDuration`|ISO8601 Period|任务停止读取数据并且将已经读取的数据发布为新段的时间周期|N默认=PT1H|
|`startDelay`|ISO8601 Period|supervisor 开始管理任务之前的等待时间周期。|N默认=PT1S|
@ -137,44 +136,42 @@ Kafka 索引服务indexing service支持 [`inputFormat`](../../ingestion/d
<a name="tuningconfig"></a>
### KafkaSupervisorTuningConfig
### Kafka SupervisorTuningConfig 配置
tuningConfig 的配置是可选的如果你不在这里对这个参数进行配置的话Druid 将会使用默认的配置来替代。
| Field | Type | Description | Required |
|-----------------------------------|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| `type` | String | The indexing task type, this should always be `kafka`. | yes |
| `maxRowsInMemory` | Integer | The number of rows to aggregate before persisting. This number is the post-aggregation rows, so it is not equivalent to the number of input events, but the number of aggregated rows that those events result in. This is used to manage the required JVM heap size. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists). Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set. | no (default == 1000000) |
| `maxBytesInMemory` | Long | The number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. Normally this is computed internally and user does not need to set it. The maximum heap memory usage for indexing is maxBytesInMemory * (2 + maxPendingPersists). | no (default == One-sixth of max JVM memory) |
| `maxRowsPerSegment` | Integer | The number of rows to aggregate into a segment; this number is post-aggregation rows. Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`, whichever happens earlier. | no (default == 5000000) |
| `maxTotalRows` | Long | The number of rows to aggregate across all segments; this number is post-aggregation rows. Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`, whichever happens earlier. | no (default == unlimited) |
| `intermediatePersistPeriod` | ISO8601 Period | The period that determines the rate at which intermediate persists occur. | no (default == PT10M) |
| `maxPendingPersists` | Integer | Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists). | no (default == 0, meaning one persist can be running concurrently with ingestion, and none can be queued up) |
| `indexSpec` | Object | Tune how data is indexed. See [IndexSpec](#indexspec) for more information. | no |
| `indexSpecForIntermediatePersists`| | Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values. | no (default = same as indexSpec) |
| `reportParseExceptions` | Boolean | *DEPRECATED*. If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped. Setting `reportParseExceptions` to true will override existing configurations for `maxParseExceptions` and `maxSavedParseExceptions`, setting `maxParseExceptions` to 0 and limiting `maxSavedParseExceptions` to no more than 1. | no (default == false) |
| `handoffConditionTimeout` | Long | Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever. | no (default == 0) |
| `resetOffsetAutomatically` | Boolean | Controls behavior when Druid needs to read Kafka messages that are no longer available (i.e. when OffsetOutOfRangeException is encountered).<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../operations/api-reference.html#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest offset available in Kafka, based on the value of the `useEarliestOffset` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestOffset` is false) or _DUPLICATED_ (if `useEarliestOffset` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful for non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.<br/><br/>This feature behaves similarly to the Kafka `auto.offset.reset` consumer property. | no (default == false) |
| `workerThreads` | Integer | The number of threads that the supervisor uses to handle requests/responses for worker tasks, along with any other internal asynchronous operation. | no (default == min(10, taskCount)) |
| `chatThreads` | Integer | The number of threads that will be used for communicating with indexing tasks. | no (default == min(10, taskCount * replicas)) |
| `chatRetries` | Integer | The number of times HTTP requests to indexing tasks will be retried before considering tasks unresponsive. | no (default == 8) |
| `httpTimeout` | ISO8601 Period | How long to wait for a HTTP response from an indexing task. | no (default == PT10S) |
| `shutdownTimeout` | ISO8601 Period | How long to wait for the supervisor to attempt a graceful shutdown of tasks before exiting. | no (default == PT80S) |
| `offsetFetchPeriod` | ISO8601 Period | How often the supervisor queries Kafka and the indexing tasks to fetch current offsets and calculate lag. | no (default == PT30S, min == PT5S) |
| `segmentWriteOutMediumFactory` | Object | Segment write-out medium to use when creating segments. See below for more information. | no (not specified by default, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used) |
| `intermediateHandoffPeriod` | ISO8601 Period | How often the tasks should hand off segments. Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`, whichever happens earlier. | no (default == P2147483647D) |
| `logParseExceptions` | Boolean | If true, log an error message when a parsing exception occurs, containing information about the row where the error occurred. | no, default == false |
| `maxParseExceptions` | Integer | The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overridden if `reportParseExceptions` is set. | no, unlimited default |
| `maxSavedParseExceptions` | Integer | When a parse exception occurs, Druid can keep track of the most recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances will be saved. These saved exceptions will be made available after the task finishes in the [task completion report](../../ingestion/tasks.md#reports). Overridden if `reportParseExceptions` is set. | no, default == 0 |
|字段Field|类型Type|描述Description|是否必须Required|
| --- | --- | --- | --- |
|`type`|String|索引任务类型, 总是 kafka。|Y|
|`maxRowsInMemory`|Integer|在持久化之前在内存中聚合的最大行数。该数值为聚合之后的行数,所以它不等于原始输入事件的行数,而是事件被聚合后的行数。 通常用来管理所需的 JVM 堆内存。 使用 maxRowsInMemory * (2 + maxPendingPersists) 来当做索引任务的最大堆内存。通常用户不需要设置这个值,但是也需要根据数据的特点来决定,如果行的字节数较短,用户可能不想在内存中存储一百万行,应该设置这个值。|N默认=1000000|
|`maxBytesInMemory`|Long|在持久化之前在内存中聚合的最大字节数。这是基于对内存使用量的粗略估计,而不是实际使用量。通常这是在内部计算的,用户不需要设置它。 索引任务的最大内存使用量是 maxRowsInMemory * (2 + maxPendingPersists)|N默认=最大JVM内存的 1/6|
|`maxRowsPerSegment`|Integer|聚合到一个段中的行数,该数值为聚合后的数值。 当 maxRowsPerSegment 或者 maxTotalRows 有一个值命中的时候,则触发 handoff数据存盘后传到深度存储 该动作也会按照每 intermediateHandoffPeriod 时间间隔发生一次。|N默认=5000000|
|`maxTotalRows`|Long|所有段的聚合后的行数,该值为聚合后的行数。当 maxRowsPerSegment 或者 maxTotalRows 有一个值命中的时候则触发handoff数据落盘后传到深度存储 该动作也会按照每 intermediateHandoffPeriod 时间间隔发生一次。|N默认=unlimited)|
|`intermediatePersistPeriod`|ISO8601 Period|确定触发持续化存储的周期|N默认= PT10M|
|`maxPendingPersists`|Integer|正在等待但启动的持久化过程的最大数量。 如果新的持久化任务超过了此限制,则在当前运行的持久化完成之前,摄取将被阻止。索引任务的最大内存使用量是 maxRowsInMemory * (2 + maxPendingPersists)|否默认为0意味着一个持久化可以与摄取同时运行而没有一个可以进入队列|
|`indexSpec`|Object|调整数据被如何索引。详情可以见 [IndexSpec](https://druid.ossez.com/#/development/extensions-core/kafka-ingestion?id=indexspec) 页面中的内容|N|
|`indexSpecForIntermediatePersists`||定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的维度/度量压缩,以减少最终合并所需的内存。但是,在中间段上禁用压缩可能会增加页缓存的使用,而在它们被合并到发布的最终段之前使用它们,有关可能的值。详情可以见 [IndexSpec](https://druid.ossez.com/#/development/extensions-core/kafka-ingestion?id=indexspec) 页面中的内容。|N默认= 与 indexSpec 相同)|
|`reportParseExceptions`|Boolean|*已经丢弃(*DEPRECATED**。如果为true则在解析期间遇到的异常即停止摄取如果为false则将跳过不可解析的行和字段。将 reportParseExceptions 设置为 true 将覆盖maxParseExceptions 和 maxSavedParseExceptions 的现有配置将maxParseExceptions 设置为 0 并将 maxSavedParseExceptions 限制为不超过1。|N默认=false|
|`handoffConditionTimeout`|Long|段切换(持久化)可以等待的毫秒数(超时时间)。 该值要被设置为大于0的数设置为0意味着将会一直等待不超时。|N默认=0|
|`resetOffsetAutomatically`|Boolean|控制当Druid需要读取Kafka中不可用的消息时的行为比如当发生了 `OffsetOutOfRangeException` 异常时。 如果为false则异常将抛出这将导致任务失败并停止接收。如果发生这种情况则需要手动干预来纠正这种情况可能使用 [重置 Supervisor API](https://druid.ossez.com/#/../operations/api-reference?id=supervisor) 。此模式对于生产非常有用因为它将使您意识到摄取的问题。如果为trueDruid将根据 `useEarliestOffset` 属性的值(`true` 为 `earliest` `false` 为 `latest` 自动重置为Kafka中可用的较早或最新偏移量。请注意这可能导致数据在您不知情的情况下*被丢弃* (如果`useEarliestOffset` 为 `false` )或 *重复* (如果 `useEarliestOffset``true` 。消息将被记录下来以标识已发生重置但摄取将继续。这种模式对于非生产环境非常有用因为它将使Druid尝试自动从问题中恢复即使这些问题会导致数据被安静删除或重复。该特性与Kafka的 `auto.offset.reset` 消费者属性很相似|N默认=false|
|`workerThreads`|Integer|supervisor 用于为工作任务处理 请求/相应requests/responses异步操作的线程数。|N默认=min(10, taskCount)|
|`chatThreads`|Integer|与索引任务的会话线程数。|N默认=10, taskCount * replicas)|
|`chatRetries`|Integer|在任务没有响应之前将重试对索引任务的HTTP请求的次数|N默认=8|
|`httpTimeout`|ISO8601 Period|索引任务的 HTTP 响应超时的时间。|N默认=PT10S|
|`shutdownTimeout`|ISO8601 Period|supervisor 尝试无故障的停掉一个任务的超时时间。|N默认=PT80S|
|`offsetFetchPeriod`|ISO8601 Period|supervisor 查询 Kafka 和索引任务以获取当前偏移和计算滞后的频率。|N默认=PT30Smin == PT5S|
|`segmentWriteOutMediumFactory`|Object|创建段时要使用的段写入介质。更多信息见下文。|N (默认不指定,使用来源于 `druid.peon.defaultSegmentWriteOutMediumFactory.type` 的值)|
|`intermediateHandoffPeriod`|ISO8601 Period|段发生切换的频率。当 `maxRowsPerSegment` 或者 `maxTotalRows` 有一个值命中的时候则触发handoff数据存盘后传到深度存储 该动作也会按照每 `intermediateHandoffPeriod` 时间间隔发生一次。|N默认=P2147483647D|
|`logParseExceptions`|Boolean|如果为 true则在发生解析异常时记录错误消息其中包含有关发生错误的行的信息。|N默认=false|
|`maxParseExceptions`|Integer|任务停止接收之前可发生的最大分析异常数。如果设置了 `reportParseExceptions` ,则该值会被重写。|N默认=unlimited)|
|`maxSavedParseExceptions`|Integer|当出现解析异常时Druid可以跟踪最新的解析异常。"maxSavedParseExceptions"决定将保存多少个异常实例。这些保存的异常将在 [任务完成报告](https://druid.ossez.com/#/taskrefer?id=%e4%bb%bb%e5%8a%a1%e6%8a%a5%e5%91%8a) 中的任务完成后可用。如果设置了`reportParseExceptions` ,则该值会被重写。|N默认=0)|
#### IndexSpec
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|bitmap|Object|Compression format for bitmap indexes. Should be a JSON object. See [Bitmap types](#bitmap-types) below for options.|no (defaults to Roaring)|
|dimensionCompression|String|Compression format for dimension columns. Choose from `LZ4`, `LZF`, or `uncompressed`.|no (default == `LZ4`)|
|metricCompression|String|Compression format for primitive type metric columns. Choose from `LZ4`, `LZF`, `uncompressed`, or `none`.|no (default == `LZ4`)|
|longEncoding|String|Encoding format for metric and dimension columns with type long. Choose from `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as is with 8 bytes each.|no (default == `longs`)|
#### 索引属性IndexSpec
|字段Field|类型Type|描述Description|是否必须Required|
| --- | --- | --- | --- |
|bitmap|Object|针对 bitmap indexes 使用的是压缩格式。应该是一个 JSON 对象,请参考 [Bitmap types](https://druid.ossez.com/#/development/extensions-core/kafka-ingestion?id=bitmap-types) 来了解更多|N默认=Roaring|
|dimensionCompression|String|针对维度dimension列使用的压缩算法请从 `LZ4` `LZF`或者 `uncompressed 中选择。`|N默认= `LZ4`|
|metricCompression|String|针对主要类型 metric 列使用的压缩算法,请从 `LZ4` `LZF`或者 `uncompressed 中选择。`|N默认= `LZ4`|
|longEncoding|String|类型为 long 的 metric 列和 维度dimension的编码格式。从 `auto``long` 中进行选择。`auto` 编码是根据列基数使用偏移量或查找表对值进行编码,并以可变大小存储它们。`longs` 将会按照,每个值 8 字节来进行存储。|N默认= `longs`)|
##### Bitmap types

View File

@ -1088,7 +1088,7 @@ Druid以两种可能的方式来解释 `dimensionsSpec` : *normal* 和 *schemale
##### `metricsSpec`
`metricsSpec` 位于 `dataSchema` -> `metricsSpec` 中,是一个在摄入阶段要应用的 [聚合器](../querying/Aggregations.md) 列表。 在启用了 [rollup](#rollup) 时是很有用的,因为它将配置如何在摄入阶段进行聚合。
`metricsSpec` 位于 `dataSchema` -> `metricsSpec` 中,是一个在摄入阶段要应用的 [聚合器](../querying/aggregations.md) 列表。 在启用了 [rollup](#rollup) 时是很有用的,因为它将配置如何在摄入阶段进行聚合。
一个 `metricsSpec` 实例如下:
```json
@ -1099,7 +1099,7 @@ Druid以两种可能的方式来解释 `dimensionsSpec` : *normal* 和 *schemale
]
```
> [!WARNING]
> 通常,当 [rollup](#rollup) 被禁用时,应该有一个空的 `metricsSpec`因为没有rollupDruid不会在摄取时进行任何的聚合所以没有理由包含摄取时聚合器。但是在某些情况下定义Metrics仍然是有意义的例如如果要创建一个复杂的列作为 [近似聚合](../querying/Aggregations.md#近似聚合) 的预计算部分,则只能通过在 `metricsSpec` 中定义度量来实现
> 通常,当 [rollup](#rollup) 被禁用时,应该有一个空的 `metricsSpec`因为没有rollupDruid不会在摄取时进行任何的聚合所以没有理由包含摄取时聚合器。但是在某些情况下定义Metrics仍然是有意义的例如如果要创建一个复杂的列作为 [近似聚合](../querying/aggregations.md#近似聚合) 的预计算部分,则只能通过在 `metricsSpec` 中定义度量来实现
##### `granularitySpec`

View File

@ -294,7 +294,7 @@ and in your `metricsSpec`, include:
* 除了timestamp列之外Druid数据源中的所有列都是dimensions或metrics。这遵循 [OLAP数据的标准命名约定](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems)。
* 典型的生产数据源有几十到几百列。
* [dimension列](ingestion.md#维度) 按原样存储因此可以在查询时对其进行筛选、分组或聚合。它们总是单个字符串、字符串数组、单个long、单个double或单个float。
* [Metrics列](ingestion.md#指标) 是 [预聚合](../querying/Aggregations.md) 存储的,因此它们只能在查询时聚合(不能按筛选或分组)。它们通常存储为数字(整数或浮点数),但也可以存储为复杂对象,如[HyperLogLog草图或近似分位数草图](../querying/Aggregations.md)。即使禁用了rollup也可以在接收时配置metrics但在启用汇总时最有用。
* [Metrics列](ingestion.md#指标) 是 [预聚合](../querying/aggregations.md) 存储的,因此它们只能在查询时聚合(不能按筛选或分组)。它们通常存储为数字(整数或浮点数),但也可以存储为复杂对象,如[HyperLogLog草图或近似分位数草图](../querying/aggregations.md)。即使禁用了rollup也可以在接收时配置metrics但在启用汇总时最有用。
### 与其他设计模式类比
#### 关系模型
@ -325,7 +325,7 @@ Druid数据源通常相当于关系数据库中的表。Druid的 [lookups特性]
* Druid并不认为数据点是"时间序列"的一部分。相反Druid对每一点分别进行摄取和聚合
* 创建一个维度,该维度指示数据点所属系列的名称。这个维度通常被称为"metric"或"name"。不要将名为"metric"的维度与Druid Metrics的概念混淆。将它放在"dimensionsSpec"中维度列表的第一个位置,以获得最佳性能(这有助于提高局部性;有关详细信息,请参阅下面的 [分区和排序](ingestion.md#分区)
* 为附着到数据点的属性创建其他维度。在时序数据库系统中,这些通常称为"标签"
* 创建与您希望能够查询的聚合类型相对应的 [Druid Metrics](ingestion.md#指标)。通常这包括"sum"、"min"和"max"在long、float或double中的一种。如果你想计算百分位数或分位数可以使用Druid的 [近似聚合器](../querying/Aggregations.md)
* 创建与您希望能够查询的聚合类型相对应的 [Druid Metrics](ingestion.md#指标)。通常这包括"sum"、"min"和"max"在long、float或double中的一种。如果你想计算百分位数或分位数可以使用Druid的 [近似聚合器](../querying/aggregations.md)
* 考虑启用 [rollup](ingestion.md#rollup)这将允许Druid潜在地将多个点合并到Druid数据源中的一行中。如果希望以不同于原始发出的时间粒度存储数据则这可能非常有用。如果要在同一个数据源中组合时序和非时序数据它也很有用
* 如果您提前不知道要摄取哪些列,请使用空的维度列表来触发 [维度列的自动检测](#无schema的维度列)
@ -358,7 +358,7 @@ Druid可以在接收数据时将其汇总以最小化需要存储的原始数
草图(sketches)减少了查询时的内存占用因为它们限制了需要在服务器之间洗牌的数据量。例如在分位数计算中Druid不需要将所有数据点发送到中心位置以便对它们进行排序和计算分位数而只需要发送点的草图。这可以将数据传输需要减少到仅千字节。
有关Druid中可用的草图的详细信息请参阅 [近似聚合器页面](../querying/Aggregations.md)。
有关Druid中可用的草图的详细信息请参阅 [近似聚合器页面](../querying/aggregations.md)。
如果你更喜欢 [视频](https://www.youtube.com/watch?v=Hpd3f_MLdXo)那就看一看吧一个讨论Druid Sketches的会议。

View File

@ -1,339 +0,0 @@
<!-- toc -->
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-8828078415045620"
data-ad-slot="7586680510"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
## 聚合(Aggregations)
> [!WARNING]
> Apache Druid支持两种查询语言 [Druid SQL](druidsql.md) 和 [原生查询](makeNativeQueries.md)。该文档描述了原生查询中的一种查询方式。 对于Druid SQL中使用的该种类型的信息可以参考 [SQL文档](druidsql.md)。
聚合可以在摄取时作为摄取规范的一部分提供作为在数据进入Apache Druid之前汇总数据的一种方式。聚合也可以在查询时指定为许多查询的一部分。
可用聚合包括:
### Count聚合器
`count`计算了过滤器匹配到行的总数:
```json
{ "type" : "count", "name" : <output_name> }
```
请注意计数聚合器计算Druid的行数这并不总是反映摄取的原始事件数。这是因为Druid可以配置为在摄取时汇总数据。要计算摄取的数据行数请在摄取时包括`count`聚合器,在查询时包括`longSum`聚合器。
### Sum聚合器
**`longSum`**
计算64位有符号整数的和
```json
{ "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }
```
`name` 为求和后值的输出名
`fieldName` 为需要求和的指标列
**`doubleSum`**
计算64位浮点数的和与`longSum`相似
```json
{ "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`floatSum`**
计算32位浮点数的和与`longSum`和`doubleSum`相似
```json
{ "type" : "floatSum", "name" : <output_name>, "fieldName" : <metric_name> }
```
### Min/Max聚合器
**`doubleMin`**
`doubleMin`计算所有指标值与Double.POSITIVE_INFINITY相比的较小者
```json
{ "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`doubleMax`**
`doubleMax`计算所有指标值与Double.NEGATIVE_INFINITY相比的较大者
```json
{ "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`floatMin`**
`floatMin`计算所有指标值与Float.POSITIVE_INFINITY相比的较小者
```json
{ "type" : "floatMin", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`floatMax`**
`floatMax`计算所有指标值与Float.NEGATIVE_INFINITY相比的较大者
```json
{ "type" : "floatMax", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`longMin`**
`longMin`计算所有指标值与Long.MAX_VALUE的较小者
```json
{ "type" : "longMin", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`longMax`**
`longMax`计算所有指标值与Long.MIN_VALUE的较大者
```json
{ "type" : "longMax", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`doubleMean`**
计算并返回列值的算术平均值作为64位浮点值。这只是一个查询时聚合器不应在摄入期间使用。
```json
{ "type" : "doubleMean", "name" : <output_name>, "fieldName" : <metric_name> }
```
### First/Last聚合器
Double/Float/Long的First/Last聚合器不能够使用在摄入规范中只能指定为查询时的一部分。
需要注意在启用了rollup的段上进行带有first/last聚合器查询将返回汇总后的值并不是返回原始数据的最后一个值。
**`doubleFirst`**
`doubleFirst`计算最小时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "doubleFirst",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`doubleLast`**
`doubleLast`计算最大时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "doubleLast",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`floatFirst`**
`floatFirst`计算最小时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "floatFirst",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`floatLast`**
`floatLast`计算最大时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "floatLast",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`longFirst`**
`longFirst`计算最小时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "longFirst",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`longLast`**
`longLast`计算最大时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "longLast",
"name" : <output_name>,
"fieldName" : <metric_name>,
}
```
**`stringFirst`**
`stringFirst` 计算最小时间戳的维度值,行不存在的话为`null`
```json
{
"type" : "stringFirst",
"name" : <output_name>,
"fieldName" : <metric_name>,
"maxStringBytes" : <integer> # (optional, defaults to 1024)
}
```
**`stringLast`**
`stringLast` 计算最大时间戳的维度值,行不存在的话为`null`
```json
{
"type" : "stringLast",
"name" : <output_name>,
"fieldName" : <metric_name>,
"maxStringBytes" : <integer> # (optional, defaults to 1024)
}
```
### ANY聚合器
Double/Float/Long/String的ANY聚合器不能够使用在摄入规范中只能指定为查询时的一部分。
返回包括null在内的任何值。此聚合器可以通过返回第一个遇到的值包括null来简化和优化性能
**`doubleAny`**
`doubleAny`返回所有double类型的指标值
```json
{
"type" : "doubleAny",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`floatAny`**
`floatAny`返回所有float类型的指标值
```json
{
"type" : "floatAny",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`longAny`**
`longAny`返回所有long类型的指标值
```json
{
"type" : "longAny",
"name" : <output_name>,
"fieldName" : <metric_name>,
}
```
**`stringAny`**
`stringAny`返回所有string类型的指标值
```json
{
"type" : "stringAny",
"name" : <output_name>,
"fieldName" : <metric_name>,
"maxStringBytes" : <integer> # (optional, defaults to 1024),
}
```
### JavaScript聚合器
计算一组列上的任意JavaScript函数同时允许指标和维度。JavaScript函数应该返回浮点值。
```json
{ "type": "javascript",
"name": "<output_name>",
"fieldNames" : [ <column1>, <column2>, ... ],
"fnAggregate" : "function(current, column1, column2, ...) {
<updates partial aggregate (current) based on the current row values>
return <updated partial aggregate>
}",
"fnCombine" : "function(partialA, partialB) { return <combined partial results>; }",
"fnReset" : "function() { return <initial value>; }"
}
```
实例:
```json
{
"type": "javascript",
"name": "sum(log(x)*y) + 10",
"fieldNames": ["x", "y"],
"fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }",
"fnCombine" : "function(partialA, partialB) { return partialA + partialB; }",
"fnReset" : "function() { return 10; }"
}
```
> [!WARNING]
> 基于JavaScript的功能默认是禁用的。 如何启用它以及如何使用Druid JavaScript功能参考 [JavaScript编程指南](../development/JavaScript.md)。
### 近似聚合(Approximate Aggregations)
#### 唯一计数(Count distinct)
**Apache DataSketches Theta Sketch**
聚合器提供的[DataSketches Theta Sketch扩展](../configuration/core-ext/datasketches-theta.md) 使用[Apache Datasketches库](https://datasketches.apache.org/) 中的Theta Sketch提供不同的计数估计并支持集合并集、交集和差分后置聚合器。
**Apache DataSketches HLL Sketch**
聚合器提供的[DataSketches HLL Sketch扩展](../configuration/core-ext/datasketches-hll.md)使用HyperLogLog算法给出不同的计数估计。
与Theta草图相比HLL草图不支持set操作更新和合并速度稍慢但需要的空间要少得多
**Cardinality, hyperUnique**
> [!WARNING]
> 对于新的场景,我们推荐评估使用 [DataSketches Theta Sketch扩展](../configuration/core-ext/datasketches-theta.md) 和 [DataSketches HLL Sketch扩展](../configuration/core-ext/datasketches-hll.md) 来替代。 DataSketch聚合器通常情况下比经典的Druid `cardinality``hyperUnique` 聚合器提供更弹性的和更好的精确度。
Cardinality和HyperUnique聚合器是在Druid中默认提供的较旧的聚合器实现它们还使用HyperLogLog算法提供不同的计数估计。较新的数据集Theta和HLL扩展提供了上述聚合器具有更高的精度和性能因此建议改为使用。
DataSketches团队已经发表了一篇关于Druid原始HLL算法和DataSketches HLL算法的比较研究。基于数据集实现已证明的优势我们建议优先使用它们而不是使用Druid最初基于HLL的聚合器。但是为了确保向后兼容性我们将继续支持经典聚合器。
请注意,`hyperUnique`聚合器与Detasketches HLL或Theta sketches不相互兼容。
**多列操作(multi-column handling)**
#### 直方图与中位数
### 其他聚合
#### 过滤聚合器

760
querying/aggregations.md Normal file
View File

@ -0,0 +1,760 @@
# 聚合
> Apache Druid 支持两种查询语言: [Druid SQL](sql.md) 和 [原生查询native queries](querying.md)。
> 该文档描述了原生查询中的一种查询方式。
> 有关更多 Druid 在 SQL 中使用 aggregators聚合查询的方式请参考
> [SQL 文档](sql.md#aggregation-functions)。
聚合可以在数据导入的时候ingestion作为数据导入规范的一部分来进行提供作为在数据进入 Apache Druid 之前汇总数据的一种方式。聚合也可以在查询时指定为许多查询中的一部分。
可用聚合包括:
### Count aggregator
`count` computes the count of Druid rows that match the filters.
```json
{ "type" : "count", "name" : <output_name> }
```
Please note the count aggregator counts the number of Druid rows, which does not always reflect the number of raw events ingested.
This is because Druid can be configured to roll up data at ingestion time. To
count the number of ingested rows of data, include a count aggregator at ingestion time, and a longSum aggregator at
query time.
### Sum aggregators
#### `longSum` aggregator
computes the sum of values as a 64-bit, signed integer
```json
{ "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }
```
`name` output name for the summed value
`fieldName` name of the metric column to sum over
#### `doubleSum` aggregator
Computes and stores the sum of values as 64-bit floating point value. Similar to `longSum`
```json
{ "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }
```
#### `floatSum` aggregator
Computes and stores the sum of values as 32-bit floating point value. Similar to `longSum` and `doubleSum`
```json
{ "type" : "floatSum", "name" : <output_name>, "fieldName" : <metric_name> }
```
### Min / Max aggregators
#### `doubleMin` aggregator
`doubleMin` computes the minimum of all metric values and Double.POSITIVE_INFINITY
```json
{ "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }
```
#### `doubleMax` aggregator
`doubleMax` computes the maximum of all metric values and Double.NEGATIVE_INFINITY
```json
{ "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }
```
#### `floatMin` aggregator
`floatMin` computes the minimum of all metric values and Float.POSITIVE_INFINITY
```json
{ "type" : "floatMin", "name" : <output_name>, "fieldName" : <metric_name> }
```
#### `floatMax` aggregator
`floatMax` computes the maximum of all metric values and Float.NEGATIVE_INFINITY
```json
{ "type" : "floatMax", "name" : <output_name>, "fieldName" : <metric_name> }
```
#### `longMin` aggregator
`longMin` computes the minimum of all metric values and Long.MAX_VALUE
```json
{ "type" : "longMin", "name" : <output_name>, "fieldName" : <metric_name> }
```
#### `longMax` aggregator
`longMax` computes the maximum of all metric values and Long.MIN_VALUE
```json
{ "type" : "longMax", "name" : <output_name>, "fieldName" : <metric_name> }
```
### `doubleMean` aggregator
Computes and returns the arithmetic mean of a column's values as a 64-bit floating point value. `doubleMean` is a query time aggregator only. It is not available for indexing.
To accomplish mean aggregation on ingestion, refer to the [Quantiles aggregator](../development/extensions-core/datasketches-quantiles.md#aggregator) from the DataSketches extension.
```json
{ "type" : "doubleMean", "name" : <output_name>, "fieldName" : <metric_name> }
```
### First / Last aggregator
(Double/Float/Long) First and Last aggregator cannot be used in ingestion spec, and should only be specified as part of queries.
Note that queries with first/last aggregators on a segment created with rollup enabled will return the rolled up value, and not the last value within the raw ingested data.
#### `doubleFirst` aggregator
`doubleFirst` computes the metric value with the minimum timestamp or 0 in default mode or `null` in SQL compatible mode if no row exist
```json
{
"type" : "doubleFirst",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
#### `doubleLast` aggregator
`doubleLast` computes the metric value with the maximum timestamp or 0 in default mode or `null` in SQL compatible mode if no row exist
```json
{
"type" : "doubleLast",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
#### `floatFirst` aggregator
`floatFirst` computes the metric value with the minimum timestamp or 0 in default mode or `null` in SQL compatible mode if no row exist
```json
{
"type" : "floatFirst",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
#### `floatLast` aggregator
`floatLast` computes the metric value with the maximum timestamp or 0 in default mode or `null` in SQL compatible mode if no row exist
```json
{
"type" : "floatLast",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
#### `longFirst` aggregator
`longFirst` computes the metric value with the minimum timestamp or 0 in default mode or `null` in SQL compatible mode if no row exist
```json
{
"type" : "longFirst",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
#### `longLast` aggregator
`longLast` computes the metric value with the maximum timestamp or 0 in default mode or `null` in SQL compatible mode if no row exist
```json
{
"type" : "longLast",
"name" : <output_name>,
"fieldName" : <metric_name>,
}
```
#### `stringFirst` aggregator
`stringFirst` computes the metric value with the minimum timestamp or `null` if no row exist
```json
{
"type" : "stringFirst",
"name" : <output_name>,
"fieldName" : <metric_name>,
"maxStringBytes" : <integer> # (optional, defaults to 1024)
}
```
#### `stringLast` aggregator
`stringLast` computes the metric value with the maximum timestamp or `null` if no row exist
```json
{
"type" : "stringLast",
"name" : <output_name>,
"fieldName" : <metric_name>,
"maxStringBytes" : <integer> # (optional, defaults to 1024)
}
```
### ANY aggregator
(Double/Float/Long/String) ANY aggregator cannot be used in ingestion spec, and should only be specified as part of queries.
Returns any value including null. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)
#### `doubleAny` aggregator
`doubleAny` returns any double metric value
```json
{
"type" : "doubleAny",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
#### `floatAny` aggregator
`floatAny` returns any float metric value
```json
{
"type" : "floatAny",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
#### `longAny` aggregator
`longAny` returns any long metric value
```json
{
"type" : "longAny",
"name" : <output_name>,
"fieldName" : <metric_name>,
}
```
#### `stringAny` aggregator
`stringAny` returns any string metric value
```json
{
"type" : "stringAny",
"name" : <output_name>,
"fieldName" : <metric_name>,
"maxStringBytes" : <integer> # (optional, defaults to 1024),
}
```
### JavaScript aggregator
Computes an arbitrary JavaScript function over a set of columns (both metrics and dimensions are allowed). Your
JavaScript functions are expected to return floating-point values.
```json
{ "type": "javascript",
"name": "<output_name>",
"fieldNames" : [ <column1>, <column2>, ... ],
"fnAggregate" : "function(current, column1, column2, ...) {
<updates partial aggregate (current) based on the current row values>
return <updated partial aggregate>
}",
"fnCombine" : "function(partialA, partialB) { return <combined partial results>; }",
"fnReset" : "function() { return <initial value>; }"
}
```
**Example**
```json
{
"type": "javascript",
"name": "sum(log(x)*y) + 10",
"fieldNames": ["x", "y"],
"fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }",
"fnCombine" : "function(partialA, partialB) { return partialA + partialB; }",
"fnReset" : "function() { return 10; }"
}
```
> JavaScript-based functionality is disabled by default. Please refer to the Druid [JavaScript programming guide](../development/javascript.md) for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.
<a name="approx"></a>
## Approximate Aggregations
### Count distinct
#### Apache DataSketches Theta Sketch
The [DataSketches Theta Sketch](../development/extensions-core/datasketches-theta.md) extension-provided aggregator gives distinct count estimates with support for set union, intersection, and difference post-aggregators, using Theta sketches from the [Apache DataSketches](https://datasketches.apache.org/) library.
#### Apache DataSketches HLL Sketch
The [DataSketches HLL Sketch](../development/extensions-core/datasketches-hll.md) extension-provided aggregator gives distinct count estimates using the HyperLogLog algorithm.
Compared to the Theta sketch, the HLL sketch does not support set operations and has slightly slower update and merge speed, but requires significantly less space.
#### Cardinality, hyperUnique
> For new use cases, we recommend evaluating [DataSketches Theta Sketch](../development/extensions-core/datasketches-theta.md) or [DataSketches HLL Sketch](../development/extensions-core/datasketches-hll.md) instead.
> The DataSketches aggregators are generally able to offer more flexibility and better accuracy than the classic Druid `cardinality` and `hyperUnique` aggregators.
The [Cardinality and HyperUnique](../querying/hll-old.md) aggregators are older aggregator implementations available by default in Druid that also provide distinct count estimates using the HyperLogLog algorithm. The newer DataSketches Theta and HLL extension-provided aggregators described above have superior accuracy and performance and are recommended instead.
The DataSketches team has published a [comparison study](https://datasketches.apache.org/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html) between Druid's original HLL algorithm and the DataSketches HLL algorithm. Based on the demonstrated advantages of the DataSketches implementation, we are recommending using them in preference to Druid's original HLL-based aggregators.
However, to ensure backwards compatibility, we will continue to support the classic aggregators.
Please note that `hyperUnique` aggregators are not mutually compatible with Datasketches HLL or Theta sketches.
##### Multi-column handling
Note the DataSketches Theta and HLL aggregators currently only support single-column inputs. If you were previously using the Cardinality aggregator with multiple-column inputs, equivalent operations using Theta or HLL sketches are described below:
* Multi-column `byValue` Cardinality can be replaced with a union of Theta sketches on the individual input columns
* Multi-column `byRow` Cardinality can be replaced with a Theta or HLL sketch on a single [virtual column](../querying/virtual-columns.md) that combines the individual input columns.
### Histograms and quantiles
#### DataSketches Quantiles Sketch
The [DataSketches Quantiles Sketch](../development/extensions-core/datasketches-quantiles.md) extension-provided aggregator provides quantile estimates and histogram approximations using the numeric quantiles DoublesSketch from the [datasketches](https://datasketches.apache.org/) library.
We recommend this aggregator in general for quantiles/histogram use cases, as it provides formal error bounds and has distribution-independent accuracy.
#### Moments Sketch (Experimental)
The [Moments Sketch](../development/extensions-contrib/momentsketch-quantiles.md) extension-provided aggregator is an experimental aggregator that provides quantile estimates using the [Moments Sketch](https://github.com/stanford-futuredata/momentsketch).
The Moments Sketch aggregator is provided as an experimental option. It is optimized for merging speed and it can have higher aggregation performance compared to the DataSketches quantiles aggregator. However, the accuracy of the Moments Sketch is distribution-dependent, so users will need to empirically verify that the aggregator is suitable for their input data.
As a general guideline for experimentation, the [Moments Sketch paper](https://arxiv.org/pdf/1803.01969.pdf) points out that this algorithm works better on inputs with high entropy. In particular, the algorithm is not a good fit when the input data consists of a small number of clustered discrete values.
#### Fixed Buckets Histogram
Druid also provides a [simple histogram implementation](../development/extensions-core/approximate-histograms.md#fixed-buckets-histogram) that uses a fixed range and fixed number of buckets with support for quantile estimation, backed by an array of bucket count values.
The fixed buckets histogram can perform well when the distribution of the input data allows a small number of buckets to be used.
We do not recommend the fixed buckets histogram for general use, as its usefulness is extremely data dependent. However, it is made available for users that have already identified use cases where a fixed buckets histogram is suitable.
#### Approximate Histogram (deprecated)
> The Approximate Histogram aggregator is deprecated.
> There are a number of other quantile estimation algorithms that offer better performance, accuracy, and memory footprint.
> We recommend using [DataSketches Quantiles](../development/extensions-core/datasketches-quantiles.md) instead.
The [Approximate Histogram](../development/extensions-core/approximate-histograms.md) extension-provided aggregator also provides quantile estimates and histogram approximations, based on [http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf](http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf).
The algorithm used by this deprecated aggregator is highly distribution-dependent and its output is subject to serious distortions when the input does not fit within the algorithm's limitations.
A [study published by the DataSketches team](https://datasketches.apache.org/docs/QuantilesStudies/DruidApproxHistogramStudy.html) demonstrates some of the known failure modes of this algorithm:
- The algorithm's quantile calculations can fail to provide results for a large range of rank values (all ranks less than 0.89 in the example used in the study), returning all zeroes instead.
- The algorithm can completely fail to record spikes in the tail ends of the distribution
- In general, the histogram produced by the algorithm can deviate significantly from the true histogram, with no bounds on the errors.
It is not possible to determine a priori how well this aggregator will behave for a given input stream, nor does the aggregator provide any indication that serious distortions are present in the output.
For these reasons, we have deprecated this aggregator and recommend using the DataSketches Quantiles aggregator instead for new and existing use cases, although we will continue to support Approximate Histogram for backwards compatibility.
## Miscellaneous Aggregations
### Filtered Aggregator
A filtered aggregator wraps any given aggregator, but only aggregates the values for which the given dimension filter matches.
This makes it possible to compute the results of a filtered and an unfiltered aggregation simultaneously, without having to issue multiple queries, and use both results as part of post-aggregations.
*Note:* If only the filtered results are required, consider putting the filter on the query itself, which will be much faster since it does not require scanning all the data.
```json
{
"type" : "filtered",
"filter" : {
"type" : "selector",
"dimension" : <dimension>,
"value" : <dimension value>
},
"aggregator" : <aggregation>
}
```
### Grouping Aggregator
A grouping aggregator can only be used as part of GroupBy queries which have a subtotal spec. It returns a number for
each output row that lets you infer whether a particular dimension is included in the sub-grouping used for that row. You can pass
a *non-empty* list of dimensions to this aggregator which *must* be a subset of dimensions that you are grouping on.
E.g if the aggregator has `["dim1", "dim2"]` as input dimensions and `[["dim1", "dim2"], ["dim1"], ["dim2"], []]` as subtotals,
following can be the possible output of the aggregator
| subtotal used in query | Output | (bits representation) |
|------------------------|--------|-----------------------|
| `["dim1", "dim2"]` | 0 | (00) |
| `["dim1"]` | 1 | (01) |
| `["dim2"]` | 2 | (10) |
| `[]` | 3 | (11) |
As illustrated in above example, output number can be thought of as an unsigned n bit number where n is the number of dimensions passed to the aggregator.
The bit at position X is set in this number to 0 if a dimension at position X in input to aggregators is included in the sub-grouping. Otherwise, this bit
is set to 1.
```json
{ "type" : "grouping", "name" : <output_name>, "groupings" : [<dimension>] }
```
## 聚合(Aggregations)
> [!WARNING]
> Apache Druid支持两种查询语言 [Druid SQL](druidsql.md) 和 [原生查询](makeNativeQueries.md)。该文档描述了原生查询中的一种查询方式。 对于Druid SQL中使用的该种类型的信息可以参考 [SQL文档](druidsql.md)。
聚合可以在摄取时作为摄取规范的一部分提供作为在数据进入Apache Druid之前汇总数据的一种方式。聚合也可以在查询时指定为许多查询的一部分。
可用聚合包括:
### Count聚合器
`count`计算了过滤器匹配到行的总数:
```json
{ "type" : "count", "name" : <output_name> }
```
请注意计数聚合器计算Druid的行数这并不总是反映摄取的原始事件数。这是因为Druid可以配置为在摄取时汇总数据。要计算摄取的数据行数请在摄取时包括`count`聚合器,在查询时包括`longSum`聚合器。
### Sum聚合器
**`longSum`**
计算64位有符号整数的和
```json
{ "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }
```
`name` 为求和后值的输出名
`fieldName` 为需要求和的指标列
**`doubleSum`**
计算64位浮点数的和与`longSum`相似
```json
{ "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`floatSum`**
计算32位浮点数的和与`longSum`和`doubleSum`相似
```json
{ "type" : "floatSum", "name" : <output_name>, "fieldName" : <metric_name> }
```
### Min/Max聚合器
**`doubleMin`**
`doubleMin`计算所有指标值与Double.POSITIVE_INFINITY相比的较小者
```json
{ "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`doubleMax`**
`doubleMax`计算所有指标值与Double.NEGATIVE_INFINITY相比的较大者
```json
{ "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`floatMin`**
`floatMin`计算所有指标值与Float.POSITIVE_INFINITY相比的较小者
```json
{ "type" : "floatMin", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`floatMax`**
`floatMax`计算所有指标值与Float.NEGATIVE_INFINITY相比的较大者
```json
{ "type" : "floatMax", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`longMin`**
`longMin`计算所有指标值与Long.MAX_VALUE的较小者
```json
{ "type" : "longMin", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`longMax`**
`longMax`计算所有指标值与Long.MIN_VALUE的较大者
```json
{ "type" : "longMax", "name" : <output_name>, "fieldName" : <metric_name> }
```
**`doubleMean`**
计算并返回列值的算术平均值作为64位浮点值。这只是一个查询时聚合器不应在摄入期间使用。
```json
{ "type" : "doubleMean", "name" : <output_name>, "fieldName" : <metric_name> }
```
### First/Last聚合器
Double/Float/Long的First/Last聚合器不能够使用在摄入规范中只能指定为查询时的一部分。
需要注意在启用了rollup的段上进行带有first/last聚合器查询将返回汇总后的值并不是返回原始数据的最后一个值。
**`doubleFirst`**
`doubleFirst`计算最小时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "doubleFirst",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`doubleLast`**
`doubleLast`计算最大时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "doubleLast",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`floatFirst`**
`floatFirst`计算最小时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "floatFirst",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`floatLast`**
`floatLast`计算最大时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "floatLast",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`longFirst`**
`longFirst`计算最小时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "longFirst",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`longLast`**
`longLast`计算最大时间戳的指标值如果不存在行的话默认为0或者SQL兼容下是`null`
```json
{
"type" : "longLast",
"name" : <output_name>,
"fieldName" : <metric_name>,
}
```
**`stringFirst`**
`stringFirst` 计算最小时间戳的维度值,行不存在的话为`null`
```json
{
"type" : "stringFirst",
"name" : <output_name>,
"fieldName" : <metric_name>,
"maxStringBytes" : <integer> # (optional, defaults to 1024)
}
```
**`stringLast`**
`stringLast` 计算最大时间戳的维度值,行不存在的话为`null`
```json
{
"type" : "stringLast",
"name" : <output_name>,
"fieldName" : <metric_name>,
"maxStringBytes" : <integer> # (optional, defaults to 1024)
}
```
### ANY聚合器
Double/Float/Long/String的ANY聚合器不能够使用在摄入规范中只能指定为查询时的一部分。
返回包括null在内的任何值。此聚合器可以通过返回第一个遇到的值包括null来简化和优化性能
**`doubleAny`**
`doubleAny`返回所有double类型的指标值
```json
{
"type" : "doubleAny",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`floatAny`**
`floatAny`返回所有float类型的指标值
```json
{
"type" : "floatAny",
"name" : <output_name>,
"fieldName" : <metric_name>
}
```
**`longAny`**
`longAny`返回所有long类型的指标值
```json
{
"type" : "longAny",
"name" : <output_name>,
"fieldName" : <metric_name>,
}
```
**`stringAny`**
`stringAny`返回所有string类型的指标值
```json
{
"type" : "stringAny",
"name" : <output_name>,
"fieldName" : <metric_name>,
"maxStringBytes" : <integer> # (optional, defaults to 1024),
}
```
### JavaScript聚合器
计算一组列上的任意JavaScript函数同时允许指标和维度。JavaScript函数应该返回浮点值。
```json
{ "type": "javascript",
"name": "<output_name>",
"fieldNames" : [ <column1>, <column2>, ... ],
"fnAggregate" : "function(current, column1, column2, ...) {
<updates partial aggregate (current) based on the current row values>
return <updated partial aggregate>
}",
"fnCombine" : "function(partialA, partialB) { return <combined partial results>; }",
"fnReset" : "function() { return <initial value>; }"
}
```
实例:
```json
{
"type": "javascript",
"name": "sum(log(x)*y) + 10",
"fieldNames": ["x", "y"],
"fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }",
"fnCombine" : "function(partialA, partialB) { return partialA + partialB; }",
"fnReset" : "function() { return 10; }"
}
```
> [!WARNING]
> 基于JavaScript的功能默认是禁用的。 如何启用它以及如何使用Druid JavaScript功能参考 [JavaScript编程指南](../development/JavaScript.md)。
### 近似聚合(Approximate Aggregations)
#### 唯一计数(Count distinct)
**Apache DataSketches Theta Sketch**
聚合器提供的[DataSketches Theta Sketch扩展](../configuration/core-ext/datasketches-theta.md) 使用[Apache Datasketches库](https://datasketches.apache.org/) 中的Theta Sketch提供不同的计数估计并支持集合并集、交集和差分后置聚合器。
**Apache DataSketches HLL Sketch**
聚合器提供的[DataSketches HLL Sketch扩展](../configuration/core-ext/datasketches-hll.md)使用HyperLogLog算法给出不同的计数估计。
与Theta草图相比HLL草图不支持set操作更新和合并速度稍慢但需要的空间要少得多
**Cardinality, hyperUnique**
> [!WARNING]
> 对于新的场景,我们推荐评估使用 [DataSketches Theta Sketch扩展](../configuration/core-ext/datasketches-theta.md) 和 [DataSketches HLL Sketch扩展](../configuration/core-ext/datasketches-hll.md) 来替代。 DataSketch聚合器通常情况下比经典的Druid `cardinality``hyperUnique` 聚合器提供更弹性的和更好的精确度。
Cardinality和HyperUnique聚合器是在Druid中默认提供的较旧的聚合器实现它们还使用HyperLogLog算法提供不同的计数估计。较新的数据集Theta和HLL扩展提供了上述聚合器具有更高的精度和性能因此建议改为使用。
DataSketches团队已经发表了一篇关于Druid原始HLL算法和DataSketches HLL算法的比较研究。基于数据集实现已证明的优势我们建议优先使用它们而不是使用Druid最初基于HLL的聚合器。但是为了确保向后兼容性我们将继续支持经典聚合器。
请注意,`hyperUnique`聚合器与Detasketches HLL或Theta sketches不相互兼容。
**多列操作(multi-column handling)**
#### 直方图与中位数
### 其他聚合
#### 过滤聚合器

View File

@ -172,7 +172,7 @@ Druid的原生类型系统允许字符串可能有多个值。这些 [多值维
| `ANY_VALUE(expr)` | 返回 `expr` 的任何值包括null。`expr`必须是数字, 此聚合器可以通过返回第一个遇到的值(包括空值)来简化和优化性能 |
| `ANY_VALUE(expr, maxBytesPerString)` | 与 `ANY_VALUE(expr)` 类似但是面向string。`maxBytesPerString` 参数确定每个字符串要分配多少聚合空间, 超过此限制的字符串将被截断。这个参数应该设置得尽可能低,因为高值会导致内存浪费。|
对于近似聚合函数,请查看 [近似聚合文档](Aggregations.md#近似聚合)
对于近似聚合函数,请查看 [近似聚合文档](aggregations.md#近似聚合)
### 扩展函数
#### 数值函数

View File

@ -20,7 +20,7 @@ Filter是一个JSON对象指示查询的计算中应该包含哪些数据行
**注意**
过滤器通常情况下应用于维度列,但是也可以使用在聚合后的指标上,例如,参见 [filtered-aggregator](Aggregations.md#过滤聚合器) 和 [having-filter](having.md)
过滤器通常情况下应用于维度列,但是也可以使用在聚合后的指标上,例如,参见 [filtered-aggregator](aggregations.md#过滤聚合器) 和 [having-filter](having.md)
### **选择过滤器(Selector Filter)**

View File

@ -76,7 +76,7 @@ GroupBy查询对象的示例如下所示:
| having | 参见[Having](having.md) | 否 |
| granularity | 定义查询粒度,参见 [Granularities](granularity.md) | 是 |
| filter | 参见[Filters](filters.md) | 否 |
| aggregations | 参见[Aggregations](Aggregations.md) | 否 |
| aggregations | 参见[Aggregations](aggregations.md) | 否 |
| postAggregations | 参见[Post Aggregations](postaggregation.md) | 否 |
| intervals | ISO-8601格式的时间间隔定义了查询的时间范围 | 是 |
| subtotalsSpec | 一个JSON数组返回顶级维度子集分组的附加结果集。稍后将更详细地[描述它](#关于subtotalSpec)。| 否 |

116
querying/hll-old.md Normal file
View File

@ -0,0 +1,116 @@
## Cardinality aggregator基数聚合
Computes the cardinality of a set of Apache Druid dimensions, using HyperLogLog to estimate the cardinality. Please note that this
aggregator will be much slower than indexing a column with the hyperUnique aggregator. This aggregator also runs over a dimension column, which
means the string dimension cannot be removed from the dataset to improve rollup. In general, we strongly recommend using the hyperUnique aggregator
instead of the cardinality aggregator if you do not care about the individual values of a dimension.
```json
{
"type": "cardinality",
"name": "<output_name>",
"fields": [ <dimension1>, <dimension2>, ... ],
"byRow": <false | true> # (optional, defaults to false),
"round": <false | true> # (optional, defaults to false)
}
```
Each individual element of the "fields" list can be a String or [DimensionSpec](../querying/dimensionspecs.md). A String dimension in the fields list is equivalent to a DefaultDimensionSpec (no transformations).
The HyperLogLog algorithm generates decimal estimates with some error. "round" can be set to true to round off estimated
values to whole numbers. Note that even with rounding, the cardinality is still an estimate. The "round" field only
affects query-time behavior, and is ignored at ingestion-time.
### Cardinality by value
When setting `byRow` to `false` (the default) it computes the cardinality of the set composed of the union of all dimension values for all the given dimensions.
* For a single dimension, this is equivalent to
```sql
SELECT COUNT(DISTINCT(dimension)) FROM <datasource>
```
* For multiple dimensions, this is equivalent to something akin to
```sql
SELECT COUNT(DISTINCT(value)) FROM (
SELECT dim_1 as value FROM <datasource>
UNION
SELECT dim_2 as value FROM <datasource>
UNION
SELECT dim_3 as value FROM <datasource>
)
```
### Cardinality by row
When setting `byRow` to `true` it computes the cardinality by row, i.e. the cardinality of distinct dimension combinations.
This is equivalent to something akin to
```sql
SELECT COUNT(*) FROM ( SELECT DIM1, DIM2, DIM3 FROM <datasource> GROUP BY DIM1, DIM2, DIM3 )
```
**Example**
Determine the number of distinct countries people are living in or have come from.
```json
{
"type": "cardinality",
"name": "distinct_countries",
"fields": [ "country_of_origin", "country_of_residence" ]
}
```
Determine the number of distinct people (i.e. combinations of first and last name).
```json
{
"type": "cardinality",
"name": "distinct_people",
"fields": [ "first_name", "last_name" ],
"byRow" : true
}
```
Determine the number of distinct starting characters of last names
```json
{
"type": "cardinality",
"name": "distinct_last_name_first_char",
"fields": [
{
"type" : "extraction",
"dimension" : "last_name",
"outputName" : "last_name_first_char",
"extractionFn" : { "type" : "substring", "index" : 0, "length" : 1 }
}
],
"byRow" : true
}
```
## HyperUnique aggregator
Uses [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) to compute the estimated cardinality of a dimension that has been aggregated as a "hyperUnique" metric at indexing time.
```json
{
"type" : "hyperUnique",
"name" : <output_name>,
"fieldName" : <metric_name>,
"isInputHyperUnique" : false,
"round" : false
}
```
"isInputHyperUnique" can be set to true to index precomputed HLL (Base64 encoded output from druid-hll is expected).
The "isInputHyperUnique" field only affects ingestion-time behavior, and is ignored at query-time.
The HyperLogLog algorithm generates decimal estimates with some error. "round" can be set to true to round off estimated
values to whole numbers. Note that even with rounding, the cardinality is still an estimate. The "round" field only
affects query-time behavior, and is ignored at ingestion-time.

View File

@ -29,7 +29,7 @@ Apache Druid支持多值字符串维度。当输入字段中包括一个数组
#### 过滤(Filtering)
所有的查询类型,包括 [Filtered Aggregator](Aggregations.md#过滤聚合器),都可以在多值维度上进行过滤。 在多值维度上进行使用Filter遵循以下规则
所有的查询类型,包括 [Filtered Aggregator](aggregations.md#过滤聚合器),都可以在多值维度上进行过滤。 在多值维度上进行使用Filter遵循以下规则
* 当多值维度的任何一个值匹配到值过滤器(例如 "selector", "bound" 和 "in"),该行即被匹配上
* 如果维度有重叠,则列比较过滤器会匹配该行

View File

@ -50,9 +50,9 @@ postAggregation : {
### 字段访问后聚合器(Field accessor post-aggregators)
该后聚合器返回由指定的 [聚合器](Aggregations.md) 输出的值。
该后聚合器返回由指定的 [聚合器](aggregations.md) 输出的值。
`fieldName` 引用在查询时 [聚合部分](Aggregations.md) 给定的聚合器的输出名。 对于复杂的聚合器,如 "cardinality" 和 "hyperUnique", 后聚合器的 `type` 决定了后聚合器将返回什么。 使用 `"fieldAccess" type` 将返回原始的聚合对象,或者使用 `"finalizingFieldAccess" type` 返回最终确定的值,例如估计的基数。
`fieldName` 引用在查询时 [聚合部分](aggregations.md) 给定的聚合器的输出名。 对于复杂的聚合器,如 "cardinality" 和 "hyperUnique", 后聚合器的 `type` 决定了后聚合器将返回什么。 使用 `"fieldAccess" type` 将返回原始的聚合对象,或者使用 `"finalizingFieldAccess" type` 返回最终确定的值,例如估计的基数。
```json
{ "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }

View File

@ -1,20 +1,7 @@
<!-- toc -->
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"
style="display:block; text-align:center;"
data-ad-layout="in-article"
data-ad-format="fluid"
data-ad-client="ca-pub-8828078415045620"
data-ad-slot="7586680510"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
## Timeseries 查询
> [!WARNING]
> Apache Druid支持两种查询语言 [Druid SQL](druidsql.md) 和 [原生查询](makeNativeQueries.md)。该文档描述了原生查询中的一种查询方式。 对于Druid SQL中使用的该种类型的信息可以参考 [SQL文档](druidsql.md)。
>
> Apache Druid支持两种查询语言 [Druid SQL](sql.md) 和 [原生查询](querying.md)。该文档描述了原生查询中的一种查询方式。 对于Druid SQL中使用的该种类型的信息可以参考 [SQL文档](sql.md#query-types)。
该类型的查询将会得到一个时间序列的查询结果,返回的是一个 JSON 对象数组数组中的每一个对象表示被Timeseries查询所查的值。
@ -66,12 +53,16 @@
| `intervals` | ISO-8601格式的JSON对象定义了要查询的时间范围 | 是 |
| `granularity` | 定义了查询结果的粒度,参见 [Granularity](granularity.md) | 是 |
| `filter` | 参见 [Filters](filters.md) | 否 |
| `aggregations` | 参见 [聚合](Aggregations.md)| 否 |
| `aggregations` | 参见 [聚合](aggregations.md)| 否 |
| `postAggregations` | 参见[Post Aggregations](postaggregation.md) | 否 |
| `limit` | 限制返回结果数量的整数值默认是unlimited | 否 |
| `context` | 可以被用来修改查询行为,包括 [Grand Total](#grand-total共计) 和 [Zero-filling](#zero-filling0填充)。详情可以看 [上下文参数](query-context.md)部分中的所有参数类型 | 否 |
为了将所有数据集中起来,上面的查询将从"sample_datasource"表返回2个数据点在 2012-01-01 和 2012-01-03 期间每天一个。每个数据点将是sample_fieldName1的longSum、sample_fieldName2的doubleSum以及sample_fieldName1除以sample_fieldName2的double结果。输出如下
为了将所有数据集中起来,上面的查询将从 "sample_datasource" 表返回2个数据点在 2012-01-01 和 2012-01-03 期间每天一个。
每个数据点将是 sample_fieldName1 的 longSum、sample_fieldName2 的 doubleSum 以及 sample_fieldName1 除以sample_fieldName2 的 double结果。
输出如下:
```json
[
@ -86,7 +77,7 @@
]
```
### Grand Total(计)
### Grand Total(计)
Druid 可以在时间序列查询的结果集中增加一个额外的 "总计"行,通过在上下文中增加 `"grandTotal":true` 来启用该功能,例如:
@ -106,7 +97,9 @@ Druid可以在时间序列查询的结果集中增加一个额外的"总计"行
}
```
总计行将显示为结果数组中的最后一行,并且没有时间戳。即使查询以"降序"模式运行,它也将是最后一行。总计行中的后聚合将基于总计聚合计算。
总计行将显示为结果数组中的最后一行,并且没有时间戳。即使查询以"降序"模式运行,它也将是最后一行。
总计行中的后聚合将基于总计聚合计算。
### Zero-filling(0填充)

View File

@ -91,7 +91,7 @@ TopN的查询对象如下所示
| intervals | ISO-8601格式的时间间隔定义了查询的时间范围 | 是 |
| granularity | 定义查询粒度, 参见 [Granularities](granularity.md) | 是 |
| filter | 参见 [Filters](filters.md) | 否 |
| aggregations | 参见[Aggregations](Aggregations.md) | 对于数值类型的metricSpec aggregations或者postAggregations必须指定否则非必须 |
| aggregations | 参见[Aggregations](aggregations.md) | 对于数值类型的metricSpec aggregations或者postAggregations必须指定否则非必须 |
| postAggregations | 参见[postAggregations](postaggregation.md) | 对于数值类型的metricSpec aggregations或者postAggregations必须指定否则非必须 |
| dimension | 一个string或者json对象用来定义topN查询的维度列详情参见[DimensionSpec](dimensionspec.md) | 是 |
| threshold | 在topN中定义N的一个整型数字例如在top列表中返回多少个结果 | 是 |