整理数据导入文件夹中的文件格式

This commit is contained in:
YuCheng Hu 2021-08-05 18:50:37 -04:00
parent 75fe9df234
commit c568381768
No known key found for this signature in database
GPG Key ID: C395DC68EF030B59
41 changed files with 3956 additions and 40 deletions

View File

@ -52,7 +52,7 @@
* [摄取概述](ingestion/ingestion.md)
* [数据格式](ingestion/dataformats.md)
* [schema设计](ingestion/schemadesign.md)
* [数据管理](ingestion/datamanage.md)
* [数据管理](ingestion/data-management.md)
* [流式摄取](ingestion/kafka.md)
* [Apache Kafka](ingestion/kafka.md)
* [Apache Kinesis](ingestion/kinesis.md)

View File

@ -16,7 +16,7 @@
对于Apache Druid Broker的配置请参见 [Broker配置](../configuration/human-readable-byte.md#Broker)
### HTTP
对于Broker的API的列表请参见 [Broker API](../operations/api.md#Broker)
对于Broker的API的列表请参见 [Broker API](../operations/api-reference.md#Broker)
### 综述

View File

@ -16,7 +16,7 @@
对于Apache Druid的Coordinator进程配置详见 [Coordinator配置](../configuration/human-readable-byte.md#Coordinator)
### HTTP
对于Coordinator的API接口详见 [Coordinator API](../operations/api.md#Coordinator)
对于Coordinator的API接口详见 [Coordinator API](../operations/api-reference.md#Coordinator)
### 综述
Druid Coordinator程序主要负责段管理和分发。更具体地说Druid Coordinator进程与Historical进程通信根据配置加载或删除段。Druid Coordinator负责加载新段、删除过时段、管理段复制和平衡段负载。
@ -45,7 +45,7 @@ org.apache.druid.cli.Main server coordinator
每次运行时Druid Coordinator都通过合并小段或拆分大片段来压缩段。当您的段没有进行段大小可能会导致查询性能下降优化时该操作非常有用。有关详细信息请参见[段大小优化](../operations/segmentSizeOpt.md)。
Coordinator首先根据[段搜索策略](#段搜索策略)查找要压缩的段。找到某些段后,它会发出[压缩任务](../ingestion/taskrefer.md#compact)来压缩这些段。运行压缩任务的最大数目为 `min(sum of worker capacity * slotRatio, maxSlots)`。请注意,即使 `min(sum of worker capacity * slotRatio, maxSlots)` = 0如果为数据源启用了压缩则始终会提交至少一个压缩任务。请参阅[压缩配置API](../operations/api.md#Coordinator)和[压缩配置](../configuration/human-readable-byte.md#Coordinator)以启用压缩。
Coordinator首先根据[段搜索策略](#段搜索策略)查找要压缩的段。找到某些段后,它会发出[压缩任务](../ingestion/taskrefer.md#compact)来压缩这些段。运行压缩任务的最大数目为 `min(sum of worker capacity * slotRatio, maxSlots)`。请注意,即使 `min(sum of worker capacity * slotRatio, maxSlots)` = 0如果为数据源启用了压缩则始终会提交至少一个压缩任务。请参阅[压缩配置API](../operations/api-reference.md#Coordinator)和[压缩配置](../configuration/human-readable-byte.md#Coordinator)以启用压缩。
压缩任务可能由于以下原因而失败:

View File

@ -16,7 +16,7 @@
对于Apache Druid Historical的配置请参见 [Historical配置](../configuration/human-readable-byte.md#Historical)
### HTTP
Historical的API列表请参见 [Historical API](../operations/api.md#Historical)
Historical的API列表请参见 [Historical API](../operations/api-reference.md#Historical)
### 运行
```json

View File

@ -24,7 +24,7 @@ Apache Druid索引器进程是MiddleManager + Peon任务执行系统的另一种
对于Apache Druid Indexer进程的配置请参见 [Indexer配置](../configuration/human-readable-byte.md#Indexer)
### HTTP
Indexer进程与[MiddleManager](../operations/api.md#MiddleManager)共用API
Indexer进程与[MiddleManager](../operations/api-reference.md#MiddleManager)共用API
### 运行
```json

View File

@ -16,7 +16,7 @@
对于Apache Druid MiddleManager配置可以参见[索引服务配置](../configuration/human-readable-byte.md#MiddleManager)
### HTTP
对于MiddleManager的API接口详见 [MiddleManager API](../operations/api.md#MiddleManager)
对于MiddleManager的API接口详见 [MiddleManager API](../operations/api-reference.md#MiddleManager)
### 综述
MiddleManager进程是执行提交的任务的工作进程。MiddleManager将任务转发给运行在不同jvm中的Peon。我们为每个任务设置单独的jvm的原因是为了隔离资源和日志。每个Peon一次只能运行一个任务但是一个MiddleManager可能有多个Peon。

View File

@ -16,7 +16,7 @@
对于Apache Druid的Overlord进程配置详见 [Overlord配置](../configuration/human-readable-byte.md#Overlord)
### HTTP
对于Overlord的API接口详见 [Overlord API](../operations/api.md#Overlord)
对于Overlord的API接口详见 [Overlord API](../operations/api-reference.md#Overlord)
### 综述
Overlord进程负责接收任务、协调任务分配、创建任务锁并将状态返回给调用方。Overlord可以配置为本地模式运行或者远程模式运行默认为本地。在本地模式下Overlord还负责创建执行任务的Peon 在本地模式下运行Overlord时还必须提供所有MiddleManager和Peon配置。本地模式通常用于简单的工作流。在远程模式下Overlord和MiddleManager在不同的进程中运行您可以在不同的服务器上运行每一个进程。如果要将索引服务用作所有Druid索引的单个端点建议使用此模式。

View File

@ -16,7 +16,7 @@
对于Apache Druid Peon配置可以参见 [Peon查询配置](../configuration/human-readable-byte.md) 和 [额外的Peon配置](../configuration/human-readable-byte.md)
### HTTP
对于Peon的API接口详见 [Peon API](../operations/api.md#Peon)
对于Peon的API接口详见 [Peon API](../operations/api-reference.md#Peon)
Peon在单个JVM中运行单个任务。MiddleManager负责创建运行任务的Peon。Peon应该很少如果为了测试目的自己运行。

View File

@ -28,7 +28,7 @@ Apache Druid Router用于将查询路由到不同的Broker。默认情况下B
### HTTP
对于Router的API列表请参见 [Router API](../operations/api.md#Router)
对于Router的API列表请参见 [Router API](../operations/api-reference.md#Router)
### 运行

View File

@ -0,0 +1,123 @@
---
id: data-management
title: "Data management"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Within the context of this topic data management refers to Apache Druid's data maintenance capabilities for existing datasources. There are several options to help you keep your data relevant and to help your Druid cluster remain performant. For example updating, reingesting, adding lookups, reindexing, or deleting data.
In addition to the tasks covered on this page, you can also use segment compaction to improve the layout of your existing data. Refer to [Segment optimization](../operations/segment-optimization.md) to see if compaction will help in your environment. For an overview and steps to configure manual compaction tasks, see [Compaction](./compaction.md).
## Adding new data to existing datasources
Druid can insert new data to an existing datasource by appending new segments to existing segment sets. It can also add new data by merging an existing set of segments with new data and overwriting the original set.
Druid does not support single-record updates by primary key.
<a name="update"></a>
## Updating existing data
Once you ingest some data in a dataSource for an interval and create Apache Druid segments, you might want to make changes to
the ingested data. There are several ways this can be done.
### Using lookups
If you have a dimension where values need to be updated frequently, try first using [lookups](../querying/lookups.md). A
classic use case of lookups is when you have an ID dimension stored in a Druid segment, and want to map the ID dimension to a
human-readable String value that may need to be updated periodically.
### Reingesting data
If lookup-based techniques are not sufficient, you will need to reingest data into Druid for the time chunks that you
want to update. This can be done using one of the [batch ingestion methods](index.md#batch) in overwrite mode (the
default mode). It can also be done using [streaming ingestion](index.md#streaming), provided you drop data for the
relevant time chunks first.
If you do the reingestion in batch mode, Druid's atomic update mechanism means that queries will flip seamlessly from
the old data to the new data.
We recommend keeping a copy of your raw data around in case you ever need to reingest it.
### With Hadoop-based ingestion
This section assumes you understand how to do batch ingestion using Hadoop. See
[Hadoop batch ingestion](./hadoop.md) for more information. Hadoop batch-ingestion can be used for reindexing and delta ingestion.
Druid uses an `inputSpec` in the `ioConfig` to know where the data to be ingested is located and how to read it.
For simple Hadoop batch ingestion, `static` or `granularity` spec types allow you to read data stored in deep storage.
There are other types of `inputSpec` to enable reindexing and delta ingestion.
### Reindexing with Native Batch Ingestion
This section assumes you understand how to do batch ingestion without Hadoop using [native batch indexing](../ingestion/native-batch.md). Native batch indexing uses an `inputSource` to know where and how to read the input data. You can use the [`DruidInputSource`](native-batch.md#druid-input-source) to read data from segments inside Druid. You can use Parallel task (`index_parallel`) for all native batch reindexing tasks. Increase the `maxNumConcurrentSubTasks` to accommodate the amount of data your are reindexing. See [Capacity planning](native-batch.md#capacity-planning).
<a name="delete"></a>
## Deleting data
Druid supports permanent deletion of segments that are in an "unused" state (see the
[Segment lifecycle](../design/architecture.md#segment-lifecycle) section of the Architecture page).
The Kill Task deletes unused segments within a specified interval from metadata storage and deep storage.
For more information, please see [Kill Task](../ingestion/tasks.md#kill).
Permanent deletion of a segment in Apache Druid has two steps:
1. The segment must first be marked as "unused". This occurs when a segment is dropped by retention rules, and when a user manually disables a segment through the Coordinator API.
2. After segments have been marked as "unused", a Kill Task will delete any "unused" segments from Druid's metadata store as well as deep storage.
For documentation on retention rules, please see [Data Retention](../operations/rule-configuration.md).
For documentation on disabling segments using the Coordinator API, please see the
[Coordinator Datasources API](../operations/api-reference.md#coordinator-datasources) reference.
A data deletion tutorial is available at [Tutorial: Deleting data](../tutorials/tutorial-delete-data.md)
## Kill Task
Kill tasks delete all information about a segment and removes it from deep storage. Segments to kill must be unused (used==0) in the Druid segment table. The available grammar is:
```json
{
"type": "kill",
"id": <task_id>,
"dataSource": <task_datasource>,
"interval" : <all_segments_in_this_interval_will_die!>,
"context": <task context>
}
```
## Retention
Druid supports retention rules, which are used to define intervals of time where data should be preserved, and intervals where data should be discarded.
Druid also supports separating Historical processes into tiers, and the retention rules can be configured to assign data for specific intervals to specific tiers.
These features are useful for performance/cost management; a common use case is separating Historical processes into a "hot" tier and a "cold" tier.
For more information, please see [Load rules](../operations/rule-configuration.md).
## Learn more
See the following topics for more information:
- [Compaction](./compaction.md) for an overview and steps to configure manual compaction tasks.
- [Segments](../design/segments.md) for information on how Druid handles segment versioning.

View File

@ -171,7 +171,7 @@ Druid会拒绝时间窗口之外的事件 确认事件是否被拒绝了的
您可以将 [DruidInputSource](native.md#Druid输入源) 与 [并行任务](native.md#并行任务) 一起使用以使用新schema摄取现有的druid段并更改该段的name、dimensions、metrics、rollup等。有关详细信息请参阅 [DruidInputSource](native.md#Druid输入源)。或者如果使用基于hadoop的摄取那么可以使用"dataSource"输入规范来重新编制索引。
有关详细信息,请参阅 [数据管理](datamanage.md) 页的 [更新现有数据](datamanage.md#更新现有的数据) 部分。
有关详细信息,请参阅 [数据管理](data-management.md) 页的 [更新现有数据](data-management.md#更新现有的数据) 部分。
### 如果更改Druid中现有数据的段粒度
@ -179,7 +179,7 @@ Druid会拒绝时间窗口之外的事件 确认事件是否被拒绝了的
为此,使用 [DruidInputSource](native.md#Druid输入源) 并运行一个 [并行任务](native.md#并行任务)。[DruidInputSource](native.md#Druid输入源) 将允许你从Druid中获取现有的段并将它们聚合并反馈给Druid。它还允许您在反馈数据时过滤这些段中的数据这意味着如果有要删除的行可以在重新摄取期间将它们过滤掉。通常上面的操作将作为一个批处理作业运行即每天输入一大块数据并对其进行聚合。或者如果使用基于hadoop的摄取那么可以使用"dataSource"输入规范来重新编制索引。
有关详细信息,请参阅 [数据管理](datamanage.md) 页的 [更新现有数据](datamanage.md#更新现有的数据) 部分。
有关详细信息,请参阅 [数据管理](data-management.md) 页的 [更新现有数据](data-management.md#更新现有的数据) 部分。
### 实时摄取似乎被卡住了

View File

@ -1122,7 +1122,7 @@ you have to take caution to not override segments created by real-time processin
Apache Druid当前支持通过一个Hadoop摄取任务来支持基于Apache Hadoop的批量索引任务 这些任务被提交到 [Druid Overlord](../design/Overlord.md)的一个运行实例上。详情可以查看 [基于Hadoop的摄取vs基于本地批摄取的对比](ingestion.md#批量摄取) 来了解基于Hadoop的摄取、本地简单批摄取、本地并行摄取三者的比较。
运行一个基于Hadoop的批量摄取任务首先需要编写一个如下的摄取规范 然后提交到Overlord的 [`druid/indexer/v1/task`](../operations/api.md#overlord) 接口或者使用Druid软件包中自带的 `bin/post-index-task` 脚本。
运行一个基于Hadoop的批量摄取任务首先需要编写一个如下的摄取规范 然后提交到Overlord的 [`druid/indexer/v1/task`](../operations/api-reference.md#overlord) 接口或者使用Druid软件包中自带的 `bin/post-index-task` 脚本。
### 教程

View File

@ -775,7 +775,7 @@ Druid中的所有数据都被组织成*段*,这些段是数据文件,通常
#### 数据源
Druid数据存储在数据源中与传统RDBMS中的表类似。Druid提供了一个独特的数据建模系统它与关系模型和时间序列模型都具有相似性。
#### 主时间戳列
Druid Schema必须始终包含一个主时间戳。主时间戳用于对 [数据进行分区和排序](#分区)。Druid查询能够快速识别和检索与主时间戳列的时间范围相对应的数据。Druid还可以将主时间戳列用于基于时间的[数据管理操作](datamanage.md),例如删除时间块、覆盖时间块和基于时间的保留规则。
Druid Schema必须始终包含一个主时间戳。主时间戳用于对 [数据进行分区和排序](#分区)。Druid查询能够快速识别和检索与主时间戳列的时间范围相对应的数据。Druid还可以将主时间戳列用于基于时间的[数据管理操作](data-management.md),例如删除时间块、覆盖时间块和基于时间的保留规则。
主时间戳基于 [`timestampSpec`](#timestampSpec) 进行解析。此外,[`granularitySpec`](#granularitySpec) 控制基于主时间戳的其他重要操作。无论从哪个输入字段读取主时间戳,它都将作为名为 `__time` 的列存储在Druid数据源中。
@ -824,7 +824,7 @@ SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
* 使用 [Sketches](schemadesign.md#Sketches高基维处理) 避免存储高基数维度,因为会损害汇总比率
* 在摄入时调整 `queryGranularity`(例如,使用 `PT5M` 而不是 `PT1M` 会增加Druid中两行具有匹配时间戳的可能性并可以提高汇总率
* 将相同的数据加载到多个Druid数据源中是有益的。有些用户选择创建禁用汇总或启用汇总但汇总比率最小的"完整"数据源和具有较少维度和较高汇总比率的"缩写"数据源。当查询只涉及"缩写"集里边的维度时,使用该数据源将导致更快的查询时间,这种方案只需稍微增加存储空间即可完成,因为简化的数据源往往要小得多。
* 如果您使用的 [尽力而为的汇总(best-effort rollup)](#) 摄取配置不能保证[完全汇总(perfect rollup)](#),则可以通过切换到保证的完全汇总选项,或在初始摄取后在[后台重新编制(reindex)](./datamanage.md#压缩与重新索引)数据索引,潜在地提高汇总比率。
* 如果您使用的 [尽力而为的汇总(best-effort rollup)](#) 摄取配置不能保证[完全汇总(perfect rollup)](#),则可以通过切换到保证的完全汇总选项,或在初始摄取后在[后台重新编制(reindex)](./data-management.md#压缩与重新索引)数据索引,潜在地提高汇总比率。
#### 最佳rollup VS 尽可能rollup
一些Druid摄取方法保证了*完美的汇总(perfect rollup)*,这意味着输入数据在摄取时被完美地聚合。另一些则提供了*尽力而为的汇总(best-effort rollup)*,这意味着输入数据可能无法完全聚合,因此可能有多个段保存具有相同时间戳和维度值的行。
@ -857,7 +857,7 @@ Druid数据源总是按时间划分为*时间块*,每个时间块包含一个
> 但是请注意目前Druid总是首先按时间戳对一个段内的行进行排序甚至在 `dimensionsSpec` 中列出的第一个维度之前,这将使得维度排序达不到最大效率。如果需要,可以通过在 `granularitySpec` 中将 `queryGranularity` 设置为等于 `segmentGranularity` 的值来解决此限制,这将把段内的所有时间戳设置为相同的值,并将"真实"时间戳保存为[辅助时间戳](./schemadesign.md#辅助时间戳)。这个限制可能在Druid的未来版本中被移除。
#### 如何设置分区
并不是所有的摄入方式都支持显式的分区配置也不是所有的方法都具有同样的灵活性。在当前的Druid版本中如果您是通过一个不太灵活的方法如Kafka进行初始摄取那么您可以使用 [重新索引的技术(reindex)](./datamanage.md#压缩与重新索引),在最初摄取数据后对其重新分区。这是一种强大的技术:即使您不断地从流中添加新数据, 也可以使用它来确保任何早于某个阈值的数据都得到最佳分区。
并不是所有的摄入方式都支持显式的分区配置也不是所有的方法都具有同样的灵活性。在当前的Druid版本中如果您是通过一个不太灵活的方法如Kafka进行初始摄取那么您可以使用 [重新索引的技术(reindex)](./data-management.md#压缩与重新索引),在最初摄取数据后对其重新分区。这是一种强大的技术:即使您不断地从流中添加新数据, 也可以使用它来确保任何早于某个阈值的数据都得到最佳分区。
下表显示了每个摄取方法如何处理分区:
@ -865,8 +865,8 @@ Druid数据源总是按时间划分为*时间块*,每个时间块包含一个
| - | - |
| [本地批](native.md) | 通过 `tuningConfig` 中的 [`partitionsSpec`](./native.md#partitionsSpec) |
| [Hadoop批](hadoop.md) | 通过 `tuningConfig` 中的 [`partitionsSpec`](./native.md#partitionsSpec) |
| [Kafka索引服务](kafka.md) | Druid中的分区是由Kafka主题的分区方式决定的。您可以在初次摄入后 [重新索引的技术(reindex)](./datamanage.md#压缩与重新索引)以重新分区 |
| [Kinesis索引服务](kinesis.md) | Druid中的分区是由Kinesis流的分区方式决定的。您可以在初次摄入后 [重新索引的技术(reindex)](./datamanage.md#压缩与重新索引)以重新分区 |
| [Kafka索引服务](kafka.md) | Druid中的分区是由Kafka主题的分区方式决定的。您可以在初次摄入后 [重新索引的技术(reindex)](./data-management.md#压缩与重新索引)以重新分区 |
| [Kinesis索引服务](kinesis.md) | Druid中的分区是由Kinesis流的分区方式决定的。您可以在初次摄入后 [重新索引的技术(reindex)](./data-management.md#压缩与重新索引)以重新分区 |
> [!WARNING]
>

View File

@ -108,7 +108,7 @@ curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http:/
| `indexSpecForIntermediatePersists` | | 定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的维度/度量压缩以减少最终合并所需的内存。但是在中间段上禁用压缩可能会增加页缓存的使用而在它们被合并到发布的最终段之前使用它们有关可能的值请参阅IndexSpec。 | 否(默认与 `indexSpec` 相同) |
| `reportParseExceptions` | Boolean | *已废弃*。如果为true则在解析期间遇到的异常即停止摄取如果为false则将跳过不可解析的行和字段。将 `reportParseExceptions` 设置为 `true` 将覆盖`maxParseExceptions` 和 `maxSavedParseExceptions` 的现有配置,将`maxParseExceptions` 设置为 `0` 并将 `maxSavedParseExceptions` 限制为不超过1。 | 否默认为false|
| `handoffConditionTimeout` | Long | 段切换(持久化)可以等待的毫秒数(超时时间)。 该值要被设置为大于0的数设置为0意味着将会一直等待不超时 | 否默认为0|
| `resetOffsetAutomatically` | Boolean | 控制当Druid需要读取Kafka中不可用的消息时的行为比如当发生了 `OffsetOutOfRangeException` 异常时。 <br> 如果为false则异常将抛出这将导致任务失败并停止接收。如果发生这种情况则需要手动干预来纠正这种情况可能使用 [重置 Supervisor API](../operations/api.md#Supervisor)。此模式对于生产非常有用,因为它将使您意识到摄取的问题。 <br> 如果为trueDruid将根据 `useEarliestOffset` 属性的值(`true` 为 `earliest``false` 为 `latest`自动重置为Kafka中可用的较早或最新偏移量。请注意这可能导致数据在您不知情的情况下*被丢弃*(如果`useEarliestOffset` 为 `false`)或 *重复*(如果 `useEarliestOffset``true`。消息将被记录下来以标识已发生重置但摄取将继续。这种模式对于非生产环境非常有用因为它将使Druid尝试自动从问题中恢复即使这些问题会导致数据被安静删除或重复。 <br> 该特性与Kafka的 `auto.offset.reset` 消费者属性很相似 | 否默认为false|
| `resetOffsetAutomatically` | Boolean | 控制当Druid需要读取Kafka中不可用的消息时的行为比如当发生了 `OffsetOutOfRangeException` 异常时。 <br> 如果为false则异常将抛出这将导致任务失败并停止接收。如果发生这种情况则需要手动干预来纠正这种情况可能使用 [重置 Supervisor API](../operations/api-reference.md#Supervisor)。此模式对于生产非常有用,因为它将使您意识到摄取的问题。 <br> 如果为trueDruid将根据 `useEarliestOffset` 属性的值(`true` 为 `earliest``false` 为 `latest`自动重置为Kafka中可用的较早或最新偏移量。请注意这可能导致数据在您不知情的情况下*被丢弃*(如果`useEarliestOffset` 为 `false`)或 *重复*(如果 `useEarliestOffset``true`。消息将被记录下来以标识已发生重置但摄取将继续。这种模式对于非生产环境非常有用因为它将使Druid尝试自动从问题中恢复即使这些问题会导致数据被安静删除或重复。 <br> 该特性与Kafka的 `auto.offset.reset` 消费者属性很相似 | 否默认为false|
| `workerThreads` | Integer | supervisor用于异步操作的线程数。| 否(默认为: min(10, taskCount) |
| `chatThreads` | Integer | 与索引任务的会话线程数 | 否默认为min(10, taskCount * replicas)|
| `chatRetries` | Integer | 在任务没有响应之前将重试对索引任务的HTTP请求的次数 | 否默认为8|
@ -177,7 +177,7 @@ Kafka索引服务同时支持通过 [`inputFormat`](dataformats.md#inputformat)
### 操作
本节描述了一些supervisor API如何在Kafka索引服务中具体工作。对于所有的supervisor API请查看 [Supervisor APIs](../operations/api.md#Supervisor)
本节描述了一些supervisor API如何在Kafka索引服务中具体工作。对于所有的supervisor API请查看 [Supervisor APIs](../operations/api-reference.md#Supervisor)
#### 获取supervisor的状态报告

View File

@ -405,13 +405,13 @@ See the documentation on [deleting data](../ingestion/data-management.md#delete)
任务在Druid中完成所有与 [摄取](ingestion.md) 相关的工作。
对于批量摄取,通常使用 [任务api](../operations/api.md#Overlord) 直接将任务提交给Druid。对于流式接收任务通常被提交给supervisor。
对于批量摄取,通常使用 [任务api](../operations/api-reference.md#Overlord) 直接将任务提交给Druid。对于流式接收任务通常被提交给supervisor。
### 任务API
任务API主要在两个地方是可用的
* [Overlord](../design/Overlord.md) 进程提供HTTP API接口来进行提交任务、取消任务、检查任务状态、查看任务日志与报告等。 查看 [任务API文档](../operations/api.md) 可以看到完整列表
* [Overlord](../design/Overlord.md) 进程提供HTTP API接口来进行提交任务、取消任务、检查任务状态、查看任务日志与报告等。 查看 [任务API文档](../operations/api-reference.md) 可以看到完整列表
* Druid SQL包括了一个 [`sys.tasks`](../querying/druidsql.md#系统Schema) ,保存了当前任务运行的信息。 此表是只读的并且可以通过Overlord API查询完整信息的有限制的子集。
### 任务报告
@ -710,11 +710,11 @@ http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/unparse
#### `compact`
压缩任务合并给定间隔的所有段。有关详细信息,请参见有关 [压缩](datamanage.md#压缩与重新索引) 的文档。
压缩任务合并给定间隔的所有段。有关详细信息,请参见有关 [压缩](data-management.md#压缩与重新索引) 的文档。
#### `kill`
Kill tasks删除有关某些段的所有元数据并将其从深层存储中删除。有关详细信息请参阅有关 [删除数据](datamanage.md#删除数据) 的文档。
Kill tasks删除有关某些段的所有元数据并将其从深层存储中删除。有关详细信息请参阅有关 [删除数据](data-management.md#删除数据) 的文档。
#### `append`

944
operations/api-reference.md Normal file
View File

@ -0,0 +1,944 @@
---
id: api-reference
title: "API reference"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
This page documents all of the API endpoints for each Druid service type.
## Common
The following endpoints are supported by all processes.
### Process information
#### GET
* `/status`
Returns the Druid version, loaded extensions, memory used, total memory and other useful information about the process.
* `/status/health`
An endpoint that always returns a boolean "true" value with a 200 OK response, useful for automated health checks.
* `/status/properties`
Returns the current configuration properties of the process.
* `/status/selfDiscovered/status`
Returns a JSON map of the form `{"selfDiscovered": true/false}`, indicating whether the node has received a confirmation
from the central node discovery mechanism (currently ZooKeeper) of the Druid cluster that the node has been added to the
cluster. It is recommended to not consider a Druid node "healthy" or "ready" in automated deployment/container
management systems until it returns `{"selfDiscovered": true}` from this endpoint. This is because a node may be
isolated from the rest of the cluster due to network issues and it doesn't make sense to consider nodes "healthy" in
this case. Also, when nodes such as Brokers use ZooKeeper segment discovery for building their view of the Druid cluster
(as opposed to HTTP segment discovery), they may be unusable until the ZooKeeper client is fully initialized and starts
to receive data from the ZooKeeper cluster. `{"selfDiscovered": true}` is a proxy event indicating that the ZooKeeper
client on the node has started to receive data from the ZooKeeper cluster and it's expected that all segments and other
nodes will be discovered by this node timely from this point.
* `/status/selfDiscovered`
Similar to `/status/selfDiscovered/status`, but returns 200 OK response with empty body if the node has discovered itself
and 503 SERVICE UNAVAILABLE if the node hasn't discovered itself yet. This endpoint might be useful because some
monitoring checks such as AWS load balancer health checks are not able to look at the response body.
## Master Server
This section documents the API endpoints for the processes that reside on Master servers (Coordinators and Overlords)
in the suggested [three-server configuration](../design/processes.md#server-types).
### Coordinator
#### Leadership
##### GET
* `/druid/coordinator/v1/leader`
Returns the current leader Coordinator of the cluster.
* `/druid/coordinator/v1/isLeader`
Returns a JSON object with field "leader", either true or false, indicating if this server is the current leader
Coordinator of the cluster. In addition, returns HTTP 200 if the server is the current leader and HTTP 404 if not.
This is suitable for use as a load balancer status check if you only want the active leader to be considered in-service
at the load balancer.
<a name="coordinator-segment-loading"></a>
#### Segment Loading
##### GET
* `/druid/coordinator/v1/loadstatus`
Returns the percentage of segments actually loaded in the cluster versus segments that should be loaded in the cluster.
* `/druid/coordinator/v1/loadstatus?simple`
Returns the number of segments left to load until segments that should be loaded in the cluster are available for queries. This does not include segment replication counts.
* `/druid/coordinator/v1/loadstatus?full`
Returns the number of segments left to load in each tier until segments that should be loaded in the cluster are all available. This includes segment replication counts.
* `/druid/coordinator/v1/loadstatus?full&computeUsingClusterView`
Returns the number of segments not yet loaded for each tier until all segments loading in the cluster are available.
The result includes segment replication counts. It also factors in the number of available nodes that are of a service type that can load the segment when computing the number of segments remaining to load.
A segment is considered fully loaded when:
- Druid has replicated it the number of times configured in the corresponding load rule.
- Or the number of replicas for the segment in each tier where it is configured to be replicated equals the available nodes of a service type that are currently allowed to load the segment in the tier.
* `/druid/coordinator/v1/loadqueue`
Returns the ids of segments to load and drop for each Historical process.
* `/druid/coordinator/v1/loadqueue?simple`
Returns the number of segments to load and drop, as well as the total segment load and drop size in bytes for each Historical process.
* `/druid/coordinator/v1/loadqueue?full`
Returns the serialized JSON of segments to load and drop for each Historical process.
#### Segment Loading by Datasource
Note that all _interval_ query parameters are ISO 8601 strings (e.g., 2016-06-27/2016-06-28).
Also note that these APIs only guarantees that the segments are available at the time of the call.
Segments can still become missing because of historical process failures or any other reasons afterward.
##### GET
* `/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus?forceMetadataRefresh={boolean}&interval={myInterval}`
Returns the percentage of segments actually loaded in the cluster versus segments that should be loaded in the cluster for the given
datasource over the given interval (or last 2 weeks if interval is not given). `forceMetadataRefresh` is required to be set.
Setting `forceMetadataRefresh` to true will force the coordinator to poll latest segment metadata from the metadata store
(Note: `forceMetadataRefresh=true` refreshes Coordinator's metadata cache of all datasources. This can be a heavy operation in terms
of the load on the metadata store but can be necessary to make sure that we verify all the latest segments' load status)
Setting `forceMetadataRefresh` to false will use the metadata cached on the coordinator from the last force/periodic refresh.
If no used segments are found for the given inputs, this API returns `204 No Content`
* `/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus?simple&forceMetadataRefresh={boolean}&interval={myInterval}`
Returns the number of segments left to load until segments that should be loaded in the cluster are available for the given datasource
over the given interval (or last 2 weeks if interval is not given). This does not include segment replication counts. `forceMetadataRefresh` is required to be set.
Setting `forceMetadataRefresh` to true will force the coordinator to poll latest segment metadata from the metadata store
(Note: `forceMetadataRefresh=true` refreshes Coordinator's metadata cache of all datasources. This can be a heavy operation in terms
of the load on the metadata store but can be necessary to make sure that we verify all the latest segments' load status)
Setting `forceMetadataRefresh` to false will use the metadata cached on the coordinator from the last force/periodic refresh.
If no used segments are found for the given inputs, this API returns `204 No Content`
* `/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus?full&forceMetadataRefresh={boolean}&interval={myInterval}`
Returns the number of segments left to load in each tier until segments that should be loaded in the cluster are all available for the given datasource
over the given interval (or last 2 weeks if interval is not given). This includes segment replication counts. `forceMetadataRefresh` is required to be set.
Setting `forceMetadataRefresh` to true will force the coordinator to poll latest segment metadata from the metadata store
(Note: `forceMetadataRefresh=true` refreshes Coordinator's metadata cache of all datasources. This can be a heavy operation in terms
of the load on the metadata store but can be necessary to make sure that we verify all the latest segments' load status)
Setting `forceMetadataRefresh` to false will use the metadata cached on the coordinator from the last force/periodic refresh.
You can pass the optional query parameter `computeUsingClusterView` to factor in the available cluster services when calculating
the segments left to load. See [Coordinator Segment Loading](#coordinator-segment-loading) for details.
If no used segments are found for the given inputs, this API returns `204 No Content`
#### Metadata store information
##### GET
* `/druid/coordinator/v1/metadata/segments`
Returns a list of all segments for each datasource enabled in the cluster.
* `/druid/coordinator/v1/metadata/segments?datasources={dataSourceName1}&datasources={dataSourceName2}`
Returns a list of all segments for one or more specific datasources enabled in the cluster.
* `/druid/coordinator/v1/metadata/segments?includeOvershadowedStatus`
Returns a list of all segments for each datasource with the full segment metadata and an extra field `overshadowed`.
* `/druid/coordinator/v1/metadata/segments?includeOvershadowedStatus&datasources={dataSourceName1}&datasources={dataSourceName2}`
Returns a list of all segments for one or more specific datasources with the full segment metadata and an extra field `overshadowed`.
* `/druid/coordinator/v1/metadata/datasources`
Returns a list of the names of datasources with at least one used segment in the cluster.
* `/druid/coordinator/v1/metadata/datasources?includeUnused`
Returns a list of the names of datasources, regardless of whether there are used segments belonging to those datasources in the cluster or not.
* `/druid/coordinator/v1/metadata/datasources?includeDisabled`
Returns a list of the names of datasources, regardless of whether the datasource is disabled or not.
* `/druid/coordinator/v1/metadata/datasources?full`
Returns a list of all datasources with at least one used segment in the cluster. Returns all metadata about those datasources as stored in the metadata store.
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}`
Returns full metadata for a datasource as stored in the metadata store.
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments`
Returns a list of all segments for a datasource as stored in the metadata store.
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full`
Returns a list of all segments for a datasource with the full segment metadata as stored in the metadata store.
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments/{segmentId}`
Returns full segment metadata for a specific segment as stored in the metadata store, if the segment is used. If the
segment is unused, or is unknown, a 404 response is returned.
##### POST
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments`
Returns a list of all segments, overlapping with any of given intervals, for a datasource as stored in the metadata store. Request body is array of string IS0 8601 intervals like [interval1, interval2,...] for example ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"]
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full`
Returns a list of all segments, overlapping with any of given intervals, for a datasource with the full segment metadata as stored in the metadata store. Request body is array of string ISO 8601 intervals like [interval1, interval2,...] for example ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"]
<a name="coordinator-datasources"></a>
#### Datasources
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
(e.g., 2016-06-27_2016-06-28).
##### GET
* `/druid/coordinator/v1/datasources`
Returns a list of datasource names found in the cluster.
* `/druid/coordinator/v1/datasources?simple`
Returns a list of JSON objects containing the name and properties of datasources found in the cluster. Properties include segment count, total segment byte size, replicated total segment byte size, minTime, and maxTime.
* `/druid/coordinator/v1/datasources?full`
Returns a list of datasource names found in the cluster with all metadata about those datasources.
* `/druid/coordinator/v1/datasources/{dataSourceName}`
Returns a JSON object containing the name and properties of a datasource. Properties include segment count, total segment byte size, replicated total segment byte size, minTime, and maxTime.
* `/druid/coordinator/v1/datasources/{dataSourceName}?full`
Returns full metadata for a datasource .
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals`
Returns a set of segment intervals.
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals?simple`
Returns a map of an interval to a JSON object containing the total byte size of segments and number of segments for that interval.
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals?full`
Returns a map of an interval to a map of segment metadata to a set of server names that contain the segment for that interval.
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}`
Returns a set of segment ids for an interval.
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}?simple`
Returns a map of segment intervals contained within the specified interval to a JSON object containing the total byte size of segments and number of segments for an interval.
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}?full`
Returns a map of segment intervals contained within the specified interval to a map of segment metadata to a set of server names that contain the segment for an interval.
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}/serverview`
Returns a map of segment intervals contained within the specified interval to information about the servers that contain the segment for an interval.
* `/druid/coordinator/v1/datasources/{dataSourceName}/segments`
Returns a list of all segments for a datasource in the cluster.
* `/druid/coordinator/v1/datasources/{dataSourceName}/segments?full`
Returns a list of all segments for a datasource in the cluster with the full segment metadata.
* `/druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}`
Returns full segment metadata for a specific segment in the cluster.
* `/druid/coordinator/v1/datasources/{dataSourceName}/tiers`
Return the tiers that a datasource exists in.
#### Note for coordinator's POST and DELETE API's
The segments would be enabled when these API's are called, but then can be disabled again by the coordinator if any dropRule matches. Segments enabled by these API's might not be loaded by historical processes if no loadRule matches. If an indexing or kill task runs at the same time as these API's are invoked, the behavior is undefined. Some segments might be killed and others might be enabled. It's also possible that all segments might be disabled but at the same time, the indexing task is able to read data from those segments and succeed.
Caution : Avoid using indexing or kill tasks and these API's at the same time for the same datasource and time chunk. (It's fine if the time chunks or datasource don't overlap)
##### POST
* `/druid/coordinator/v1/datasources/{dataSourceName}`
Marks as used all segments belonging to a datasource. Returns a JSON object of the form
`{"numChangedSegments": <number>}` with the number of segments in the database whose state has been changed (that is,
the segments were marked as used) as the result of this API call.
* `/druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}`
Marks as used a segment of a datasource. Returns a JSON object of the form `{"segmentStateChanged": <boolean>}` with
the boolean indicating if the state of the segment has been changed (that is, the segment was marked as used) as the
result of this API call.
* `/druid/coordinator/v1/datasources/{dataSourceName}/markUsed`
* `/druid/coordinator/v1/datasources/{dataSourceName}/markUnused`
Marks segments (un)used for a datasource by interval or set of segment Ids.
When marking used only segments that are not overshadowed will be updated.
The request payload contains the interval or set of segment Ids to be marked unused.
Either interval or segment ids should be provided, if both or none are provided in the payload, the API would throw an error (400 BAD REQUEST).
Interval specifies the start and end times as IS0 8601 strings. `interval=(start/end)` where start and end both are inclusive and only the segments completely contained within the specified interval will be disabled, partially overlapping segments will not be affected.
JSON Request Payload:
|Key|Description|Example|
|----------|-------------|---------|
|`interval`|The interval for which to mark segments unused|"2015-09-12T03:00:00.000Z/2015-09-12T05:00:00.000Z"|
|`segmentIds`|Set of segment Ids to be marked unused|["segmentId1", "segmentId2"]|
##### DELETE
* `/druid/coordinator/v1/datasources/{dataSourceName}`
Marks as unused all segments belonging to a datasource. Returns a JSON object of the form
`{"numChangedSegments": <number>}` with the number of segments in the database whose state has been changed (that is,
the segments were marked as unused) as the result of this API call.
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}`
* `@Deprecated. /druid/coordinator/v1/datasources/{dataSourceName}?kill=true&interval={myInterval}`
Runs a [Kill task](../ingestion/tasks.md) for a given interval and datasource.
* `/druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}`
Marks as unused a segment of a datasource. Returns a JSON object of the form `{"segmentStateChanged": <boolean>}` with
the boolean indicating if the state of the segment has been changed (that is, the segment was marked as unused) as the
result of this API call.
#### Retention Rules
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
(e.g., 2016-06-27_2016-06-28).
##### GET
* `/druid/coordinator/v1/rules`
Returns all rules as JSON objects for all datasources in the cluster including the default datasource.
* `/druid/coordinator/v1/rules/{dataSourceName}`
Returns all rules for a specified datasource.
* `/druid/coordinator/v1/rules/{dataSourceName}?full`
Returns all rules for a specified datasource and includes default datasource.
* `/druid/coordinator/v1/rules/history?interval=<interval>`
Returns audit history of rules for all datasources. default value of interval can be specified by setting `druid.audit.manager.auditHistoryMillis` (1 week if not configured) in Coordinator runtime.properties
* `/druid/coordinator/v1/rules/history?count=<n>`
Returns last <n> entries of audit history of rules for all datasources.
* `/druid/coordinator/v1/rules/{dataSourceName}/history?interval=<interval>`
Returns audit history of rules for a specified datasource. default value of interval can be specified by setting `druid.audit.manager.auditHistoryMillis` (1 week if not configured) in Coordinator runtime.properties
* `/druid/coordinator/v1/rules/{dataSourceName}/history?count=<n>`
Returns last <n> entries of audit history of rules for a specified datasource.
##### POST
* `/druid/coordinator/v1/rules/{dataSourceName}`
POST with a list of rules in JSON form to update rules.
Optional Header Parameters for auditing the config change can also be specified.
|Header Param Name| Description | Default |
|----------|-------------|---------|
|`X-Druid-Author`| author making the config change|""|
|`X-Druid-Comment`| comment describing the change being done|""|
#### Intervals
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
(e.g., 2016-06-27_2016-06-28).
##### GET
* `/druid/coordinator/v1/intervals`
Returns all intervals for all datasources with total size and count.
* `/druid/coordinator/v1/intervals/{interval}`
Returns aggregated total size and count for all intervals that intersect given isointerval.
* `/druid/coordinator/v1/intervals/{interval}?simple`
Returns total size and count for each interval within given isointerval.
* `/druid/coordinator/v1/intervals/{interval}?full`
Returns total size and count for each datasource for each interval within given isointerval.
#### Dynamic configuration
See [Coordinator Dynamic Configuration](../configuration/index.md#dynamic-configuration) for details.
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
(e.g., 2016-06-27_2016-06-28).
##### GET
* `/druid/coordinator/v1/config`
Retrieves current coordinator dynamic configuration.
* `/druid/coordinator/v1/config/history?interval={interval}&count={count}`
Retrieves history of changes to overlord dynamic configuration. Accepts `interval` and `count` query string parameters
to filter by interval and limit the number of results respectively.
##### POST
* `/druid/coordinator/v1/config`
Update overlord dynamic worker configuration.
#### Compaction Status
##### GET
* `/druid/coordinator/v1/compaction/progress?dataSource={dataSource}`
Returns the total size of segments awaiting compaction for the given dataSource.
This is only valid for dataSource which has compaction enabled.
##### GET
* `/druid/coordinator/v1/compaction/status`
Returns the status and statistics from the latest auto compaction run of all dataSources which have/had auto compaction enabled.
The response payload includes a list of `latestStatus` objects. Each `latestStatus` represents the status for a dataSource (which has/had auto compaction enabled).
The `latestStatus` object has the following keys:
* `dataSource`: name of the datasource for this status information
* `scheduleStatus`: auto compaction scheduling status. Possible values are `NOT_ENABLED` and `RUNNING`. Returns `RUNNING ` if the dataSource has an active auto compaction config submitted otherwise, `NOT_ENABLED`
* `bytesAwaitingCompaction`: total bytes of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction)
* `bytesCompacted`: total bytes of this datasource that are already compacted with the spec set in the auto compaction config.
* `bytesSkipped`: total bytes of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.
* `segmentCountAwaitingCompaction`: total number of segments of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction)
* `segmentCountCompacted`: total number of segments of this datasource that are already compacted with the spec set in the auto compaction config.
* `segmentCountSkipped`: total number of segments of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.
* `intervalCountAwaitingCompaction`: total number of intervals of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction)
* `intervalCountCompacted`: total number of intervals of this datasource that are already compacted with the spec set in the auto compaction config.
* `intervalCountSkipped`: total number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.
##### GET
* `/druid/coordinator/v1/compaction/status?dataSource={dataSource}`
Similar to the API `/druid/coordinator/v1/compaction/status` above but filters response to only return information for the {dataSource} given.
Note that {dataSource} given must have/had auto compaction enabled.
#### Compaction Configuration
##### GET
* `/druid/coordinator/v1/config/compaction`
Returns all compaction configs.
* `/druid/coordinator/v1/config/compaction/{dataSource}`
Returns a compaction config of a dataSource.
##### POST
* `/druid/coordinator/v1/config/compaction/taskslots?ratio={someRatio}&max={someMaxSlots}`
Update the capacity for compaction tasks. `ratio` and `max` are used to limit the max number of compaction tasks.
They mean the ratio of the total task slots to the compaction task slots and the maximum number of task slots for compaction tasks, respectively.
The actual max number of compaction tasks is `min(max, ratio * total task slots)`.
Note that `ratio` and `max` are optional and can be omitted. If they are omitted, default values (0.1 and unbounded)
will be set for them.
* `/druid/coordinator/v1/config/compaction`
Creates or updates the compaction config for a dataSource.
See [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for configuration details.
##### DELETE
* `/druid/coordinator/v1/config/compaction/{dataSource}`
Removes the compaction config for a dataSource.
#### Server information
##### GET
* `/druid/coordinator/v1/servers`
Returns a list of servers URLs using the format `{hostname}:{port}`. Note that
processes that run with different types will appear multiple times with different
ports.
* `/druid/coordinator/v1/servers?simple`
Returns a list of server data objects in which each object has the following keys:
* `host`: host URL include (`{hostname}:{port}`)
* `type`: process type (`indexer-executor`, `historical`)
* `currSize`: storage size currently used
* `maxSize`: maximum storage size
* `priority`
* `tier`
### Overlord
#### Leadership
##### GET
* `/druid/indexer/v1/leader`
Returns the current leader Overlord of the cluster. If you have multiple Overlords, just one is leading at any given time. The others are on standby.
* `/druid/indexer/v1/isLeader`
This returns a JSON object with field "leader", either true or false. In addition, this call returns HTTP 200 if the
server is the current leader and HTTP 404 if not. This is suitable for use as a load balancer status check if you
only want the active leader to be considered in-service at the load balancer.
#### Tasks
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
(e.g., 2016-06-27_2016-06-28).
##### GET
* `/druid/indexer/v1/tasks`
Retrieve list of tasks. Accepts query string parameters `state`, `datasource`, `createdTimeInterval`, `max`, and `type`.
|Query Parameter |Description |
|---|---|
|`state`|filter list of tasks by task state, valid options are `running`, `complete`, `waiting`, and `pending`.|
| `datasource`| return tasks filtered by Druid datasource.|
| `createdTimeInterval`| return tasks created within the specified interval. |
| `max`| maximum number of `"complete"` tasks to return. Only applies when `state` is set to `"complete"`.|
| `type`| filter tasks by task type. See [task documentation](../ingestion/tasks.md) for more details.|
* `/druid/indexer/v1/completeTasks`
Retrieve list of complete tasks. Equivalent to `/druid/indexer/v1/tasks?state=complete`.
* `/druid/indexer/v1/runningTasks`
Retrieve list of running tasks. Equivalent to `/druid/indexer/v1/tasks?state=running`.
* `/druid/indexer/v1/waitingTasks`
Retrieve list of waiting tasks. Equivalent to `/druid/indexer/v1/tasks?state=waiting`.
* `/druid/indexer/v1/pendingTasks`
Retrieve list of pending tasks. Equivalent to `/druid/indexer/v1/tasks?state=pending`.
* `/druid/indexer/v1/task/{taskId}`
Retrieve the 'payload' of a task.
* `/druid/indexer/v1/task/{taskId}/status`
Retrieve the status of a task.
* `/druid/indexer/v1/task/{taskId}/segments`
Retrieve information about the segments of a task.
> This API is deprecated and will be removed in future releases.
* `/druid/indexer/v1/task/{taskId}/reports`
Retrieve a [task completion report](../ingestion/tasks.md#task-reports) for a task. Only works for completed tasks.
##### POST
* `/druid/indexer/v1/task`
Endpoint for submitting tasks and supervisor specs to the Overlord. Returns the taskId of the submitted task.
* `/druid/indexer/v1/task/{taskId}/shutdown`
Shuts down a task.
* `/druid/indexer/v1/datasources/{dataSource}/shutdownAllTasks`
Shuts down all tasks for a dataSource.
* `/druid/indexer/v1/taskStatus`
Retrieve list of task status objects for list of task id strings in request body.
##### DELETE
* `/druid/indexer/v1/pendingSegments/{dataSource}`
Manually clean up pending segments table in metadata storage for `datasource`. Returns a JSON object response with
`numDeleted` and count of rows deleted from the pending segments table. This API is used by the
`druid.coordinator.kill.pendingSegments.on` [coordinator setting](../configuration/index.md#coordinator-operation)
which automates this operation to perform periodically.
#### Supervisors
##### GET
* `/druid/indexer/v1/supervisor`
Returns a list of strings of the currently active supervisor ids.
* `/druid/indexer/v1/supervisor?full`
Returns a list of objects of the currently active supervisors.
|Field|Type|Description|
|---|---|---|
|`id`|String|supervisor unique identifier|
|`state`|String|basic state of the supervisor. Available states:`UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`, `PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`. Check [Kafka Docs](../development/extensions-core/kafka-ingestion.md#operations) for details.|
|`detailedState`|String|supervisor specific state. (See documentation of specific supervisor for details), e.g. [Kafka](../development/extensions-core/kafka-ingestion.md) or [Kinesis](../development/extensions-core/kinesis-ingestion.md))|
|`healthy`|Boolean|true or false indicator of overall supervisor health|
|`spec`|SupervisorSpec|json specification of supervisor (See Supervisor Configuration for details)|
* `/druid/indexer/v1/supervisor?state=true`
Returns a list of objects of the currently active supervisors and their current state.
|Field|Type|Description|
|---|---|---|
|`id`|String|supervisor unique identifier|
|`state`|String|basic state of the supervisor. Available states: `UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`, `PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`. Check [Kafka Docs](../development/extensions-core/kafka-ingestion.md#operations) for details.|
|`detailedState`|String|supervisor specific state. (See documentation of the specific supervisor for details, e.g. [Kafka](../development/extensions-core/kafka-ingestion.md) or [Kinesis](../development/extensions-core/kinesis-ingestion.md))|
|`healthy`|Boolean|true or false indicator of overall supervisor health|
|`suspended`|Boolean|true or false indicator of whether the supervisor is in suspended state|
* `/druid/indexer/v1/supervisor/<supervisorId>`
Returns the current spec for the supervisor with the provided ID.
* `/druid/indexer/v1/supervisor/<supervisorId>/status`
Returns the current status of the supervisor with the provided ID.
* `/druid/indexer/v1/supervisor/history`
Returns an audit history of specs for all supervisors (current and past).
* `/druid/indexer/v1/supervisor/<supervisorId>/history`
Returns an audit history of specs for the supervisor with the provided ID.
##### POST
* `/druid/indexer/v1/supervisor`
Create a new supervisor or update an existing one.
* `/druid/indexer/v1/supervisor/<supervisorId>/suspend`
Suspend the current running supervisor of the provided ID. Responds with updated SupervisorSpec.
* `/druid/indexer/v1/supervisor/suspendAll`
Suspend all supervisors at once.
* `/druid/indexer/v1/supervisor/<supervisorId>/resume`
Resume indexing tasks for a supervisor. Responds with updated SupervisorSpec.
* `/druid/indexer/v1/supervisor/resumeAll`
Resume all supervisors at once.
* `/druid/indexer/v1/supervisor/<supervisorId>/reset`
Reset the specified supervisor.
* `/druid/indexer/v1/supervisor/<supervisorId>/terminate`
Terminate a supervisor of the provided ID.
* `/druid/indexer/v1/supervisor/terminateAll`
Terminate all supervisors at once.
* `/druid/indexer/v1/supervisor/<supervisorId>/shutdown`
Shutdown a supervisor.
> This API is deprecated and will be removed in future releases.
> Please use the equivalent 'terminate' instead.
#### Dynamic configuration
See [Overlord Dynamic Configuration](../configuration/index.md#overlord-dynamic-configuration) for details.
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
(e.g., 2016-06-27_2016-06-28).
##### GET
* `/druid/indexer/v1/worker`
Retrieves current overlord dynamic configuration.
* `/druid/indexer/v1/worker/history?interval={interval}&count={count}`
Retrieves history of changes to overlord dynamic configuration. Accepts `interval` and `count` query string parameters
to filter by interval and limit the number of results respectively.
* `/druid/indexer/v1/workers`
Retrieves a list of all the worker nodes in the cluster along with its metadata.
* `/druid/indexer/v1/scaling`
Retrieves overlord scaling events if auto-scaling runners are in use.
##### POST
* /druid/indexer/v1/worker
Update overlord dynamic worker configuration.
## Data Server
This section documents the API endpoints for the processes that reside on Data servers (MiddleManagers/Peons and Historicals)
in the suggested [three-server configuration](../design/processes.md#server-types).
### MiddleManager
##### GET
* `/druid/worker/v1/enabled`
Check whether a MiddleManager is in an enabled or disabled state. Returns JSON object keyed by the combined `druid.host`
and `druid.port` with the boolean state as the value.
```json
{"localhost:8091":true}
```
* `/druid/worker/v1/tasks`
Retrieve a list of active tasks being run on MiddleManager. Returns JSON list of taskid strings. Normal usage should
prefer to use the `/druid/indexer/v1/tasks` [Overlord API](#overlord) or one of it's task state specific variants instead.
```json
["index_wikiticker_2019-02-11T02:20:15.316Z"]
```
* `/druid/worker/v1/task/{taskid}/log`
Retrieve task log output stream by task id. Normal usage should prefer to use the `/druid/indexer/v1/task/{taskId}/log`
[Overlord API](#overlord) instead.
##### POST
* `/druid/worker/v1/disable`
'Disable' a MiddleManager, causing it to stop accepting new tasks but complete all existing tasks. Returns JSON object
keyed by the combined `druid.host` and `druid.port`:
```json
{"localhost:8091":"disabled"}
```
* `/druid/worker/v1/enable`
'Enable' a MiddleManager, allowing it to accept new tasks again if it was previously disabled. Returns JSON object
keyed by the combined `druid.host` and `druid.port`:
```json
{"localhost:8091":"enabled"}
```
* `/druid/worker/v1/task/{taskid}/shutdown`
Shutdown a running task by `taskid`. Normal usage should prefer to use the `/druid/indexer/v1/task/{taskId}/shutdown`
[Overlord API](#overlord) instead. Returns JSON:
```json
{"task":"index_kafka_wikiticker_f7011f8ffba384b_fpeclode"}
```
### Peon
#### GET
* `/druid/worker/v1/chat/{taskId}/rowStats`
Retrieve a live row stats report from a Peon. See [task reports](../ingestion/tasks.md#task-reports) for more details.
* `/druid/worker/v1/chat/{taskId}/unparseableEvents`
Retrieve an unparseable events report from a Peon. See [task reports](../ingestion/tasks.md#task-reports) for more details.
### Historical
#### Segment Loading
##### GET
* `/druid/historical/v1/loadstatus`
Returns JSON of the form `{"cacheInitialized":<value>}`, where value is either `true` or `false` indicating if all
segments in the local cache have been loaded. This can be used to know when a Historical process is ready
to be queried after a restart.
* `/druid/historical/v1/readiness`
Similar to `/druid/historical/v1/loadstatus`, but instead of returning JSON with a flag, responses 200 OK if segments
in the local cache have been loaded, and 503 SERVICE UNAVAILABLE, if they haven't.
## Query Server
This section documents the API endpoints for the processes that reside on Query servers (Brokers) in the suggested [three-server configuration](../design/processes.md#server-types).
### Broker
#### Datasource Information
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
(e.g., 2016-06-27_2016-06-28).
##### GET
* `/druid/v2/datasources`
Returns a list of queryable datasources.
* `/druid/v2/datasources/{dataSourceName}`
Returns the dimensions and metrics of the datasource. Optionally, you can provide request parameter "full" to get list of served intervals with dimensions and metrics being served for those intervals. You can also provide request param "interval" explicitly to refer to a particular interval.
If no interval is specified, a default interval spanning a configurable period before the current time will be used. The default duration of this interval is specified in ISO 8601 duration format via:
druid.query.segmentMetadata.defaultHistory
* `/druid/v2/datasources/{dataSourceName}/dimensions`
Returns the dimensions of the datasource.
> This API is deprecated and will be removed in future releases. Please use [SegmentMetadataQuery](../querying/segmentmetadataquery.md) instead
> which provides more comprehensive information and supports all dataSource types including streaming dataSources. It's also encouraged to use [INFORMATION_SCHEMA tables](../querying/sql.md#metadata-tables)
> if you're using SQL.
* `/druid/v2/datasources/{dataSourceName}/metrics`
Returns the metrics of the datasource.
> This API is deprecated and will be removed in future releases. Please use [SegmentMetadataQuery](../querying/segmentmetadataquery.md) instead
> which provides more comprehensive information and supports all dataSource types including streaming dataSources. It's also encouraged to use [INFORMATION_SCHEMA tables](../querying/sql.md#metadata-tables)
> if you're using SQL.
* `/druid/v2/datasources/{dataSourceName}/candidates?intervals={comma-separated-intervals}&numCandidates={numCandidates}`
Returns segment information lists including server locations for the given datasource and intervals. If "numCandidates" is not specified, it will return all servers for each interval.
#### Load Status
##### GET
* `/druid/broker/v1/loadstatus`
Returns a flag indicating if the Broker knows about all segments in the cluster. This can be used to know when a Broker process is ready to be queried after a restart.
* `/druid/broker/v1/readiness`
Similar to `/druid/broker/v1/loadstatus`, but instead of returning a JSON, responses 200 OK if its ready and otherwise 503 SERVICE UNAVAILABLE.
#### Queries
##### POST
* `/druid/v2/`
The endpoint for submitting queries. Accepts an option `?pretty` that pretty prints the results.
* `/druid/v2/candidates/`
Returns segment information lists including server locations for the given query..
### Router
#### GET
* `/druid/v2/datasources`
Returns a list of queryable datasources.
* `/druid/v2/datasources/{dataSourceName}`
Returns the dimensions and metrics of the datasource.
* `/druid/v2/datasources/{dataSourceName}/dimensions`
Returns the dimensions of the datasource.
* `/druid/v2/datasources/{dataSourceName}/metrics`
Returns the metrics of the datasource.

View File

@ -1,13 +0,0 @@
<!-- toc -->
### 通用
### Master
#### Coordinator
#### Overlord
##### Supervisor
### Data
#### MiddleManager
#### Peon
#### Historical
### Query
#### Broker
#### Router

203
operations/auth-ldap.md Normal file
View File

@ -0,0 +1,203 @@
---
id: auth-ldap
title: "LDAP auth"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
This page describes how to set up Druid user authentication and authorization through LDAP. The first step is to enable LDAP authentication and authorization for Druid. You then map an LDAP group to roles and assign permissions to roles.
## Enable LDAP in Druid
Before starting, verify that the active directory is reachable from the Druid Master servers. Command line tools such as `ldapsearch` and `ldapwhoami`, which are included with OpenLDAP, are useful for this testing. 
### Check the connection
First test that the basic connection and user credential works. For example, given a user `uuser1@example.com`, try:
```bash
ldapwhoami -vv -H ldap://<ip_address>:389  -D"uuser1@example.com" -W
```
Enter the password associated with the user when prompted and verify that the command succeeded. If it didn't, try the following troubleshooting steps:
* Verify that you've used the correct port for your LDAP instance. By default, the LDAP port is 389, but double-check with your LDAP admin if unable to connect.
* Check whether a network firewall is not preventing connections to the LDAP port.
* Check whether LDAP clients need to be specifically whitelisted at the LDAP server to be able to reach it. If so, add the Druid Coordinator server to the AD whitelist.
### Check the search criteria
After verifying basic connectivity, check your search criteria. For example, the command for searching for user `uuser1@example.com ` is as follows:
```bash
ldapsearch -x -W -H ldap://<ldap_server>  -D"uuser1@example.com" -b "dc=example,dc=com" "(sAMAccountName=uuser1)"
```
Note the `memberOf` attribute in the results; it shows the groups that the user belongs to. You will use this value to map the LDAP group to the Druid roles later. This attribute may be implemented differently on different types of LDAP servers. For instance, some LDAP servers may support recursive groupings, and some may not. Some LDAP server implementations may not have any object classes that contain this attribute altogether. If your LDAP server does not use the `memberOf` attribute, then Druid will not be able to determine a user's group membership using LDAP. The sAMAccountName attribute used in this example contains the authenticated user identity. This is an attribute of an object class specific to Microsoft Active Directory. The object classes and attribute used in your LDAP server may be different.
## Configure Druid user authentication with LDAP/Active Directory 
1. Enable the `druid-basic-security` extension in the `common.runtime.properties` file. See [Security Overview](security-overview.md) for details.
2. As a best practice, create a user in LDAP to be used for internal communication with Druid.
3. In `common.runtime.properties`, update LDAP-related properties, as shown in the following listing: 
```
druid.auth.authenticatorChain=["ldap"]
druid.auth.authenticator.ldap.type=basic
druid.auth.authenticator.ldap.enableCacheNotifications=true
druid.auth.authenticator.ldap.credentialsValidator.type=ldap
druid.auth.authenticator.ldap.credentialsValidator.url=ldap://<AD host>:<AD port>
druid.auth.authenticator.ldap.credentialsValidator.bindUser=<AD admin user, e.g.: Administrator@example.com>
druid.auth.authenticator.ldap.credentialsValidator.bindPassword=<AD admin password>
druid.auth.authenticator.ldap.credentialsValidator.baseDn=<base dn, e.g.: dc=example,dc=com>
druid.auth.authenticator.ldap.credentialsValidator.userSearch=<The LDAP search, e.g.: (&(sAMAccountName=%s)(objectClass=user))>
druid.auth.authenticator.ldap.credentialsValidator.userAttribute=sAMAccountName
druid.auth.authenticator.ldap.authorizerName=ldapauth
druid.escalator.type=basic
druid.escalator.internalClientUsername=<AD internal user, e.g.: internal@example.com>
druid.escalator.internalClientPassword=Welcome123
druid.escalator.authorizerName=ldapauth
druid.auth.authorizers=["ldapauth"]
druid.auth.authorizer.ldapauth.type=basic
druid.auth.authorizer.ldapauth.initialAdminUser=AD user who acts as the initial admin user, e.g.: internal@example.com>
druid.auth.authorizer.ldapauth.initialAdminRole=admin
druid.auth.authorizer.ldapauth.roleProvider.type=ldap
```
Notice that the LDAP user created in the previous step, `internal@example.com`, serves as the internal client user and the initial admin user.
## Use LDAP groups to assign roles
You can map LDAP groups to a role in Druid. Members in the group get access to the permissions of the corresponding role. 
### Step 1: Create a role
First create the role in Druid using the Druid REST API.
Creating a role involves submitting a POST request to the Coordinator process. 
The following REST APIs to create the role to read access for datasource, config, state.
> As mentioned, the REST API calls need to address the Coordinator node. The examples used below use localhost as the Coordinator host and 8081 as the port. Adjust these settings according to your deployment.
Call the following API to create role `readRole` . 
```
curl -i -v  -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/roles/readRole
```
Check that the role has been created successfully by entering the following:
```
curl -i -v  -H "Content-Type: application/json" -u internal -X GET  http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/roles
```
### Step 2: Add permissions to a role 
You can now add one or more permission to the role. The following example adds read-only access to a `wikipedia` data source.  
Given the following JSON in a file named `perm.json`:
```
[{ "resource": { "name": "wikipedia", "type": "DATASOURCE" }, "action": "READ" }
,{ "resource": { "name": ".*", "type": "STATE" }, "action": "READ" },
{ "resource": {"name": ".*", "type": "CONFIG"}, "action": "READ"}]
```
The following command associates the permissions in the JSON file with the role
```
curl -i -v  -H "Content-Type: application/json" -u internal -X POST -d@perm.json  http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/roles/readRole/permissions
```
Note that the STATE and CONFIG permissions in `perm.json` are needed to see the data source in the Druid console. If only querying permissions are needed, the READ action is sufficient:
```
[{ "resource": { "name": "wikipedia", "type": "DATASOURCE" }, "action": "READ" }]
```
You can also provide the name in the form of regular expression. For example, to give access to all data sources starting with `wiki`, specify the name as  `{ "name": "wiki.*", .....`.
### Step 3: Create group Mapping 
The following shows an example of a group to role mapping. It assumes that a group named `group1` exists in the directory. Also assuming the following role mapping in a file named `groupmap.json`:
```
{
    "name": "group1map",
    "groupPattern": "CN=group1,CN=Users,DC=example,DC=com",
    "roles": [
        "readRole"
    ]
}
```
You can configure the mapping as follows:
```
curl -i -v  -H "Content-Type: application/json" -u internal -X POST -d @groupmap.json http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings/group1map
```
To check whether the group mapping was created successfully, run the following command:
```
curl -i -v  -H "Content-Type: application/json" -u internal -X GET http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings
```
To check the details of a specific group mapping, use the following:
```
curl -i -v  -H "Content-Type: application/json" -u internal -X GET http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings/group1map
```
To add additional roles to the group mapping, use the following API:
```
curl -i -v  -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings/group1/roles/<newrole> 
```
In the next two steps you will be creating a user, and assigning previously created roles to it. These steps are only needed in the following cases:
- Your LDAP server does not support the `memberOf` attribute, or
- You want to configure a user with additional roles that are not mapped to the group(s) that the user is a member of
If this is not the case for your scenario, you can skip these steps.
### Step 4. Create a user
Once LDAP is enabled, only user passwords are verified with LDAP. You add the LDAP user to Druid as follows: 
```
curl -i -v  -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authentication/db/ldap/users/<AD user> 
```
### Step 5. Assign the role to the user
The following command shows how to assign a role to a user:
```
curl -i -v  -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/users/<AD user>/roles/<rolename> 
```
For more information about security and the basic security extension, see [Security Overview](security-overview.md).

View File

@ -1 +1,471 @@
<!-- toc -->
---
id: basic-cluster-tuning
title: "Basic cluster tuning"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
This document provides basic guidelines for configuration properties and cluster architecture considerations related to performance tuning of an Apache Druid deployment.
Please note that this document provides general guidelines and rules-of-thumb: these are not absolute, universal rules for cluster tuning, and this introductory guide is not an exhaustive description of all Druid tuning properties, which are described in the [configuration reference](../configuration/index.md).
If you have questions on tuning Druid for specific use cases, or questions on configuration properties not covered in this guide, please ask the [Druid user mailing list or other community channels](https://druid.apache.org/community/).
## Process-specific guidelines
### Historical
#### Heap sizing
The biggest contributions to heap usage on Historicals are:
- Partial unmerged query results from segments
- The stored maps for [lookups](../querying/lookups.md).
A general rule-of-thumb for sizing the Historical heap is `(0.5GB * number of CPU cores)`, with an upper limit of ~24GB.
This rule-of-thumb scales using the number of CPU cores as a convenient proxy for hardware size and level of concurrency (note: this formula is not a hard rule for sizing Historical heaps).
Having a heap that is too large can result in excessively long GC collection pauses, the ~24GB upper limit is imposed to avoid this.
If caching is enabled on Historicals, the cache is stored on heap, sized by `druid.cache.sizeInBytes`.
Running out of heap on the Historicals can indicate misconfiguration or usage patterns that are overloading the cluster.
##### Lookups
If you are using lookups, calculate the total size of the lookup maps being loaded.
Druid performs an atomic swap when updating lookup maps (both the old map and the new map will exist in heap during the swap), so the maximum potential heap usage from lookup maps will be (2 * total size of all loaded lookups).
Be sure to add `(2 * total size of all loaded lookups)` to your heap size in addition to the `(0.5GB * number of CPU cores)` guideline.
#### Processing Threads and Buffers
Please see the [General Guidelines for Processing Threads and Buffers](#processing-threads-buffers) section for an overview of processing thread/buffer configuration.
On Historicals:
- `druid.processing.numThreads` should generally be set to `(number of cores - 1)`: a smaller value can result in CPU underutilization, while going over the number of cores can result in unnecessary CPU contention.
- `druid.processing.buffer.sizeBytes` can be set to 500MB.
- `druid.processing.numMergeBuffers`, a 1:4 ratio of merge buffers to processing threads is a reasonable choice for general use.
#### Direct Memory Sizing
The processing and merge buffers described above are direct memory buffers.
When a historical processes a query, it must open a set of segments for reading. This also requires some direct memory space, described in [segment decompression buffers](#segment-decompression).
A formula for estimating direct memory usage follows:
(`druid.processing.numThreads` + `druid.processing.numMergeBuffers` + 1) * `druid.processing.buffer.sizeBytes`
The `+ 1` factor is a fuzzy estimate meant to account for the segment decompression buffers.
#### Connection pool sizing
Please see the [General Connection Pool Guidelines](#connection-pool) section for an overview of connection pool configuration.
For Historicals, `druid.server.http.numThreads` should be set to a value slightly higher than the sum of `druid.broker.http.numConnections` across all the Brokers in the cluster.
Tuning the cluster so that each Historical can accept 50 queries and 10 non-queries is a reasonable starting point.
#### Segment Cache Size
`druid.segmentCache.locations` specifies locations where segment data can be stored on the Historical. The sum of available disk space across these locations is set as the default value for property: `druid.server.maxSize`, which controls the total size of segment data that can be assigned by the Coordinator to a Historical.
Segments are memory-mapped by Historical processes using any available free system memory (i.e., memory not used by the Historical JVM and heap/direct memory buffers or other processes on the system). Segments that are not currently in memory will be paged from disk when queried.
Therefore, the size of cache locations set within `druid.segmentCache.locations` should be such that a Historical is not allocated an excessive amount of segment data. As the value of (`free system memory` / total size of all `druid.segmentCache.locations`) increases, a greater proportion of segments can be kept in memory, allowing for better query performance. The total segment data size assigned to a Historical can be overridden with `druid.server.maxSize`, but this is not required for most of the use cases.
#### Number of Historicals
The number of Historicals needed in a cluster depends on how much data the cluster has. For good performance, you will want enough Historicals such that each Historical has a good (`free system memory` / total size of all `druid.segmentCache.locations`) ratio, as described in the segment cache size section above.
Having a smaller number of big servers is generally better than having a large number of small servers, as long as you have enough fault tolerance for your use case.
#### SSD storage
We recommend using SSDs for storage on the Historicals, as they handle segment data stored on disk.
#### Total memory usage
To estimate total memory usage of the Historical under these guidelines:
- Heap: `(0.5GB * number of CPU cores) + (2 * total size of lookup maps) + druid.cache.sizeInBytes`
- Direct Memory: `(druid.processing.numThreads + druid.processing.numMergeBuffers + 1) * druid.processing.buffer.sizeBytes`
The Historical will use any available free system memory (i.e., memory not used by the Historical JVM and heap/direct memory buffers or other processes on the system) for memory-mapping of segments on disk. For better query performance, you will want to ensure a good (`free system memory` / total size of all `druid.segmentCache.locations`) ratio so that a greater proportion of segments can be kept in memory.
#### Segment sizes matter
Be sure to check out [segment size optimization](./segment-optimization.md) to help tune your Historical processes for maximum performance.
### Broker
#### Heap sizing
The biggest contributions to heap usage on Brokers are:
- Partial unmerged query results from Historicals and Tasks
- The segment timeline: this consists of location information (which Historical/Task is serving a segment) for all currently [available](../design/architecture.md#segment-lifecycle) segments.
- Cached segment metadata: this consists of metadata, such as per-segment schemas, for all currently available segments.
The Broker heap requirements scale based on the number of segments in the cluster, and the total data size of the segments.
The heap size will vary based on data size and usage patterns, but 4G to 8G is a good starting point for a small or medium cluster (~15 servers or less). For a rough estimate of memory requirements on the high end, very large clusters with a node count on the order of ~100 nodes may need Broker heaps of 30GB-60GB.
If caching is enabled on the Broker, the cache is stored on heap, sized by `druid.cache.sizeInBytes`.
#### Direct memory sizing
On the Broker, the amount of direct memory needed depends on how many merge buffers (used for merging GroupBys) are configured. The Broker does not generally need processing threads or processing buffers, as query results are merged on-heap in the HTTP connection threads instead.
- `druid.processing.buffer.sizeBytes` can be set to 500MB.
- `druid.processing.numThreads`: set this to 1 (the minimum allowed)
- `druid.processing.numMergeBuffers`: set this to the same value as on Historicals or a bit higher
#### Connection pool sizing
Please see the [General Connection Pool Guidelines](#connection-pool) section for an overview of connection pool configuration.
On the Brokers, please ensure that the sum of `druid.broker.http.numConnections` across all the Brokers is slightly lower than the value of `druid.server.http.numThreads` on your Historicals and Tasks.
`druid.server.http.numThreads` on the Broker should be set to a value slightly higher than `druid.broker.http.numConnections` on the same Broker.
Tuning the cluster so that each Historical can accept 50 queries and 10 non-queries, adjusting the Brokers accordingly, is a reasonable starting point.
#### Broker backpressure
When retrieving query results from Historical processes or Tasks, the Broker can optionally specify a maximum buffer size for queued, unread data, and exert backpressure on the channel to the Historical or Tasks when limit is reached (causing writes to the channel to block on the Historical/Task side until the Broker is able to drain some data from the channel).
This buffer size is controlled by the `druid.broker.http.maxQueuedBytes` setting.
The limit is divided across the number of Historicals/Tasks that a query would hit: suppose I have `druid.broker.http.maxQueuedBytes` set to 5MB, and the Broker receives a query that needs to be fanned out to 2 Historicals. Each per-historical channel would get a 2.5MB buffer in this case.
You can generally set this to a value of approximately `2MB * number of Historicals`. As your cluster scales up with more Historicals and Tasks, consider increasing this buffer size and increasing the Broker heap accordingly.
- If the buffer is too small, this can lead to inefficient queries due to the buffer filling up rapidly and stalling the channel
- If the buffer is too large, this puts more memory pressure on the Broker due to more queued result data in the HTTP channels.
#### Number of brokers
A 1:15 ratio of Brokers to Historicals is a reasonable starting point (this is not a hard rule).
If you need Broker HA, you can deploy 2 initially and then use the 1:15 ratio guideline for additional Brokers.
#### Total memory usage
To estimate total memory usage of the Broker under these guidelines:
- Heap: allocated heap size
- Direct Memory: `(druid.processing.numThreads + druid.processing.numMergeBuffers + 1) * druid.processing.buffer.sizeBytes`
### MiddleManager
The MiddleManager is a lightweight task controller/manager that launches Task processes, which perform ingestion work.
#### MiddleManager heap sizing
The MiddleManager itself does not require much resources, you can set the heap to ~128MB generally.
#### SSD storage
We recommend using SSDs for storage on the MiddleManagers, as the Tasks launched by MiddleManagers handle segment data stored on disk.
#### Task Count
The number of tasks a MiddleManager can launch is controlled by the `druid.worker.capacity` setting.
The number of workers needed in your cluster depends on how many concurrent ingestion tasks you need to run for your use cases. The number of workers that can be launched on a given machine depends on the size of resources allocated per worker and available system resources.
You can allocate more MiddleManager machines to your cluster to add task capacity.
#### Task configurations
The following section below describes configuration for Tasks launched by the MiddleManager. The Tasks can be queried and perform ingestion workloads, so they require more resources than the MM.
##### Task heap sizing
A 1GB heap is usually enough for Tasks.
###### Lookups
If you are using lookups, calculate the total size of the lookup maps being loaded.
Druid performs an atomic swap when updating lookup maps (both the old map and the new map will exist in heap during the swap), so the maximum potential heap usage from lookup maps will be (2 * total size of all loaded lookups).
Be sure to add `(2 * total size of all loaded lookups)` to your Task heap size if you are using lookups.
##### Task processing threads and buffers
For Tasks, 1 or 2 processing threads are often enough, as the Tasks tend to hold much less queryable data than Historical processes.
- `druid.indexer.fork.property.druid.processing.numThreads`: set this to 1 or 2
- `druid.indexer.fork.property.druid.processing.numMergeBuffers`: set this to 2
- `druid.indexer.fork.property.druid.processing.buffer.sizeBytes`: can be set to 100MB
##### Direct memory sizing
The processing and merge buffers described above are direct memory buffers.
When a Task processes a query, it must open a set of segments for reading. This also requires some direct memory space, described in [segment decompression buffers](#segment-decompression).
An ingestion Task also needs to merge partial ingestion results, which requires direct memory space, described in [segment merging](#segment-merging).
A formula for estimating direct memory usage follows:
(`druid.processing.numThreads` + `druid.processing.numMergeBuffers` + 1) * `druid.processing.buffer.sizeBytes`
The `+ 1` factor is a fuzzy estimate meant to account for the segment decompression buffers and dictionary merging buffers.
##### Connection pool sizing
Please see the [General Connection Pool Guidelines](#connection-pool) section for an overview of connection pool configuration.
For Tasks, `druid.server.http.numThreads` should be set to a value slightly higher than the sum of `druid.broker.http.numConnections` across all the Brokers in the cluster.
Tuning the cluster so that each Task can accept 50 queries and 10 non-queries is a reasonable starting point.
#### Total memory usage
To estimate total memory usage of a Task under these guidelines:
- Heap: `1GB + (2 * total size of lookup maps)`
- Direct Memory: `(druid.processing.numThreads + druid.processing.numMergeBuffers + 1) * druid.processing.buffer.sizeBytes`
The total memory usage of the MiddleManager + Tasks:
`MM heap size + druid.worker.capacity * (single task memory usage)`
##### Configuration guidelines for specific ingestion types
###### Kafka/Kinesis ingestion
If you use the [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.md) or [Kinesis Indexing Service](../development/extensions-core/kinesis-ingestion.md), the number of tasks required will depend on the number of partitions and your taskCount/replica settings.
On top of those requirements, allocating more task slots in your cluster is a good idea, so that you have free task
slots available for other tasks, such as [compaction tasks](../ingestion/compaction.md).
###### Hadoop ingestion
If you are only using [Hadoop-based batch ingestion](../ingestion/hadoop.md) with no other ingestion types, you can lower the amount of resources allocated per Task. Batch ingestion tasks do not need to answer queries, and the bulk of the ingestion workload will be executed on the Hadoop cluster, so the Tasks do not require much resources.
###### Parallel native ingestion
If you are using [parallel native batch ingestion](../ingestion/native-batch.md#parallel-task), allocating more available task slots is a good idea and will allow greater ingestion concurrency.
### Coordinator
The main performance-related setting on the Coordinator is the heap size.
The heap requirements of the Coordinator scale with the number of servers, segments, and tasks in the cluster.
You can set the Coordinator heap to the same size as your Broker heap, or slightly smaller: both services have to process cluster-wide state and answer API requests about this state.
#### Dynamic Configuration
`percentOfSegmentsToConsiderPerMove`
* The default value is 100. This means that the Coordinator will consider all segments when it is looking for a segment to move. The Coordinator makes a weighted choice, with segments on Servers with the least capacity being the most likely segments to be moved.
* This weighted selection strategy means that the segments on the servers who have the most available capacity are the least likely to be chosen.
* As the number of segments in the cluster increases, the probability of choosing the Nth segment to move decreases; where N is the last segment considered for moving.
* An admin can use this config to skip consideration of that Nth segment.
* Instead of skipping a precise amount of segments, we skip a percentage of segments in the cluster.
* For example, with the value set to 25, only the first 25% of segments will be considered as a segment that can be moved. This 25% of segments will come from the servers that have the least available capacity.
* In this example, each time the Coordinator looks for a segment to move, it will consider 75% less segments than it did when the configuration was 100. On clusters with hundreds of thousands of segments, this can add up to meaningful coordination time savings.
* General recommendations for this configuration:
* If you are not worried about the amount of time it takes your Coordinator to complete a full coordination cycle, you likely do not need to modify this config.
* If you are frustrated with how long the Coordinator takes to run a full coordination cycle, and you have set the Coordinator dynamic config `maxSegmentsToMove` to a value above 0 (the default is 5), setting this config to a non-default value can help shorten coordination time.
* The recommended starting point value is 66. It represents a meaningful decrease in the percentage of segments considered while also not being too aggressive (You will consider 1/3 fewer segments per move operation with this value).
* The impact that modifying this config will have on your coordination time will be a function of how low you set the config value, the value for `maxSegmentsToMove` and the total number of segments in your cluster.
* If your cluster has a relatively small number of segments, or you choose to move few segments per coordination cycle, there may not be much savings to be had here.
### Overlord
The main performance-related setting on the Overlord is the heap size.
The heap requirements of the Overlord scale primarily with the number of running Tasks.
The Overlord tends to require less resources than the Coordinator or Broker. You can generally set the Overlord heap to a value that's 25-50% of your Coordinator heap.
### Router
The Router has light resource requirements, as it proxies requests to Brokers without performing much computational work itself.
You can assign it 256MB heap as a starting point, growing it if needed.
<a name="processing-threads-buffers"></a>
## Guidelines for processing threads and buffers
### Processing threads
The `druid.processing.numThreads` configuration controls the size of the processing thread pool used for computing query results. The size of this pool limits how many queries can be concurrently processed.
### Processing buffers
`druid.processing.buffer.sizeBytes` is a closely related property that controls the size of the off-heap buffers allocated to the processing threads.
One buffer is allocated for each processing thread. A size between 500MB and 1GB is a reasonable choice for general use.
The TopN and GroupBy queries use these buffers to store intermediate computed results. As the buffer size increases, more data can be processed in a single pass.
### GroupBy merging buffers
If you plan to issue GroupBy V2 queries, `druid.processing.numMergeBuffers` is an important configuration property.
GroupBy V2 queries use an additional pool of off-heap buffers for merging query results. These buffers have the same size as the processing buffers described above, set by the `druid.processing.buffer.sizeBytes` property.
Non-nested GroupBy V2 queries require 1 merge buffer per query, while a nested GroupBy V2 query requires 2 merge buffers (regardless of the depth of nesting).
The number of merge buffers determines the number of GroupBy V2 queries that can be processed concurrently.
<a name="connection-pool"></a>
## Connection pool guidelines
Each Druid process has a configuration property for the number of HTTP connection handling threads, `druid.server.http.numThreads`.
The number of HTTP server threads limits how many concurrent HTTP API requests a given process can handle.
### Sizing the connection pool for queries
The Broker has a setting `druid.broker.http.numConnections` that controls how many outgoing connections it can make to a given Historical or Task process.
These connections are used to send queries to the Historicals or Tasks, with one connection per query; the value of `druid.broker.http.numConnections` is effectively a limit on the number of concurrent queries that a given broker can process.
Suppose we have a cluster with 3 Brokers and `druid.broker.http.numConnections` is set to 10.
This means that each Broker in the cluster will open up to 10 connections to each individual Historical or Task (for a total of 30 incoming query connections per Historical/Task).
On the Historical/Task side, this means that `druid.server.http.numThreads` must be set to a value at least as high as the sum of `druid.broker.http.numConnections` across all the Brokers in the cluster.
In practice, you will want to allocate additional server threads for non-query API requests such as status checks; adding 10 threads for those is a good general guideline. Using the example with 3 Brokers in the cluster and `druid.broker.http.numConnections` set to 10, a value of 40 would be appropriate for `druid.server.http.numThreads` on Historicals and Tasks.
As a starting point, allowing for 50 concurrent queries (requests that read segment data from datasources) + 10 non-query requests (other requests like status checks) on Historicals and Tasks is reasonable (i.e., set `druid.server.http.numThreads` to 60 there), while sizing `druid.broker.http.numConnections` based on the number of Brokers in the cluster to fit within the 50 query connection limit per Historical/Task.
- If the connection pool across Brokers and Historicals/Tasks is too small, the cluster will be underutilized as there are too few concurrent query slots.
- If the connection pool is too large, you may get out-of-memory errors due to excessive concurrent load, and increased resource contention.
- The connection pool sizing matters most when you require QoS-type guarantees and use query priorities; otherwise, these settings can be more loosely configured.
- If your cluster usage patterns are heavily biased towards a high number of small concurrent queries (where each query takes less than ~15ms), enlarging the connection pool can be a good idea.
- The 50/10 general guideline here is a rough starting point, since different queries impose different amounts of load on the system. To size the connection pool more exactly for your cluster, you would need to know the execution times for your queries and ensure that the rate of incoming queries does not exceed your "drain" rate.
## Per-segment direct memory buffers
### Segment decompression
When opening a segment for reading during segment merging or query processing, Druid allocates a 64KB off-heap decompression buffer for each column being read.
Thus, there is additional direct memory overhead of (64KB * number of columns read per segment * number of segments read) when reading segments.
### Segment merging
In addition to the segment decompression overhead described above, when a set of segments are merged during ingestion, a direct buffer is allocated for every String typed column, for every segment in the set to be merged.
The size of these buffers are equal to the cardinality of the String column within its segment, times 4 bytes (the buffers store integers).
For example, if two segments are being merged, the first segment having a single String column with cardinality 1000, and the second segment having a String column with cardinality 500, the merge step would allocate (1000 + 500) * 4 = 6000 bytes of direct memory.
These buffers are used for merging the value dictionaries of the String column across segments. These "dictionary merging buffers" are independent of the "merge buffers" configured by `druid.processing.numMergeBuffers`.
## General recommendations
### JVM tuning
#### Garbage Collection
We recommend using the G1GC garbage collector:
`-XX:+UseG1GC`
Enabling process termination on out-of-memory errors is useful as well, since the process generally will not recover from such a state, and it's better to restart the process:
`-XX:+ExitOnOutOfMemoryError`
#### Other useful JVM flags
```
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Djava.io.tmpdir=<should not be volatile tmpfs and also has good read and write speed. Strongly recommended to avoid using NFS mount>
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Dorg.jboss.logging.provider=slf4j
-Dnet.spy.log.LoggerImpl=net.spy.memcached.compat.log.SLF4JLogger
-Dlog4j.shutdownCallbackRegistry=org.apache.druid.common.config.Log4jShutdown
-Dlog4j.shutdownHookEnabled=true
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-Xloggc:/var/logs/druid/historical.gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=50
-XX:GCLogFileSize=10m
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/logs/druid/historical.hprof
-XX:MaxDirectMemorySize=1g
```
> Please note that the flag settings above represent sample, general guidelines only. Be careful to use values appropriate
for your specific scenario and be sure to test any changes in staging environments.
`ExitOnOutOfMemoryError` flag is only supported starting JDK 8u92 . For older versions, `-XX:OnOutOfMemoryError='kill -9 %p'` can be used.
`MaxDirectMemorySize` restricts JVM from allocating more than specified limit, by setting it to unlimited JVM restriction is lifted and OS level memory limits would still be effective. It's still important to make sure that Druid is not configured to allocate more off-heap memory than your machine has available. Important settings here include `druid.processing.numThreads`, `druid.processing.numMergeBuffers`, and `druid.processing.buffer.sizeBytes`.
Additionally, for large JVM heaps, here are a few Garbage Collection efficiency guidelines that have been known to help in some cases.
- Mount /tmp on tmpfs. See [The Four Month Bug: JVM statistics cause garbage collection pauses](http://www.evanjones.ca/jvm-mmap-pause.html).
- On Disk-IO intensive processes (e.g., Historical and MiddleManager), GC and Druid logs should be written to a different disk than where data is written.
- Disable [Transparent Huge Pages](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html).
- Try disabling biased locking by using `-XX:-UseBiasedLocking` JVM flag. See [Logging Stop-the-world Pauses in JVM](https://dzone.com/articles/logging-stop-world-pauses-jvm).
### Use UTC timezone
We recommend using UTC timezone for all your events and across your hosts, not just for Druid, but for all data infrastructure. This can greatly mitigate potential query problems with inconsistent timezones. To query in a non-UTC timezone see [query granularities](../querying/granularities.md#period-granularities)
### System configuration
#### SSDs
SSDs are highly recommended for Historical, MiddleManager, and Indexer processes if you are not running a cluster that is entirely in memory. SSDs can greatly mitigate the time required to page data in and out of memory.
#### JBOD vs RAID
Historical processes store large number of segments on Disk and support specifying multiple paths for storing those. Typically, hosts have multiple disks configured with RAID which makes them look like a single disk to OS. RAID might have overheads specially if its not hardware controller based but software based. So, Historicals might get improved disk throughput with JBOD.
#### Swap space
We recommend _not_ using swap space for Historical, MiddleManager, and Indexer processes since due to the large number of memory mapped segment files can lead to poor and unpredictable performance.
#### Linux limits
For Historical, MiddleManager, and Indexer processes (and for really large clusters, Broker processes), you might need to adjust some Linux system limits to account for a large number of open files, a large number of network connections, or a large number of memory mapped files.
##### ulimit
The limit on the number of open files can be set permanently by editing `/etc/security/limits.conf`. This value should be substantially greater than the number of segment files that will exist on the server.
##### max_map_count
Historical processes and to a lesser extent, MiddleManager and Indexer processes memory map segment files, so depending on the number of segments per server, `/proc/sys/vm/max_map_count` might also need to be adjusted. Depending on the variant of Linux, this might be done via `sysctl` by placing a file in `/etc/sysctl.d/` that sets `vm.max_map_count`.

View File

@ -1 +1,128 @@
<!-- toc -->
---
id: druid-console
title: "Web console"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Druid include a console for managing datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. Users can also run SQL and native Druid queries in the console.
The Druid Console is hosted by the [Router](../design/router.md) process.
The following cluster settings must be enabled, as they are by default:
- the Router's [management proxy](../design/router.md#enabling-the-management-proxy) must be enabled.
- the Broker processes in the cluster must have [Druid SQL](../querying/sql.md) enabled.
The Druid console can be accessed at:
```
http://<ROUTER_IP>:<ROUTER_PORT>
```
> It is important to note that any Druid console user will have, effectively, the same file permissions as the user under which Druid runs. One way these permissions are surfaced is in the file browser dialog. The dialog
will show console users the files that the underlying user has permissions to. In general, avoid running Druid as
root user. Consider creating a dedicated user account for running Druid.
Below is a description of the high-level features and functionality of the Druid Console
## Home
The home view provides a high level overview of the cluster.
Each card is clickable and links to the appropriate view.
The legacy menu allows you to go to the [legacy coordinator and overlord consoles](./management-uis.md#legacy-consoles) should you need them.
![home-view](../assets/web-console-01-home-view.png "home view")
## Data loader
The data loader view allows you to load data by building an ingestion spec with a step-by-step wizard.
![data-loader-1](../assets/web-console-02-data-loader-1.png)
After selecting the location of your data just follow the series for steps that will show you incremental previews of the data as it will be ingested.
After filling in the required details on every step you can navigate to the next step by clicking the `Next` button.
You can also freely navigate between the steps from the top navigation.
Navigating with the top navigation will leave the underlying spec unmodified while clicking the `Next` button will attempt to fill in the subsequent steps with appropriate defaults.
![data-loader-2](../assets/web-console-03-data-loader-2.png)
## Datasources
The datasources view shows all the currently enabled datasources.
From this view you can see the sizes and availability of the different datasources.
You can edit the retention rules, configure automatic compaction, and drop data.
Like any view that is powered by a DruidSQL query you can click `View SQL query for table` from the `...` menu to run the underlying SQL query directly.
![datasources](../assets/web-console-04-datasources.png)
You can view and edit retention rules to determine the general availability of a datasource.
![retention](../assets/web-console-05-retention.png)
## Segments
The segment view shows all the segments in the cluster.
Each segment can be has a detail view that provides more information.
The Segment ID is also conveniently broken down into Datasource, Start, End, Version, and Partition columns for ease of filtering and sorting.
![segments](../assets/web-console-06-segments.png)
## Tasks and supervisors
From this view you can check the status of existing supervisors as well as suspend, resume, and reset them.
The tasks table allows you see the currently running and recently completed tasks.
To make managing a lot of tasks more accessible, you can group the tasks by their `Type`, `Datasource`, or `Status` to make navigation easier.
![supervisors](../assets/web-console-07-supervisors.png)
Click on the magnifying glass for any supervisor to see detailed reports of its progress.
![supervisor-status](../assets/web-console-08-supervisor-status.png)
Click on the magnifying glass for any task to see more detail about it.
![tasks-status](../assets/web-console-09-task-status.png)
## Servers
The servers tab lets you see the current status of the nodes making up your cluster.
You can group the nodes by type or by tier to get meaningful summary statistics.
![servers](../assets/web-console-10-servers.png)
## Query
The query view lets you issue [DruidSQL](../querying/sql.md) queries and display the results as a table.
The view will attempt to infer your query and let you modify via contextual actions such as adding filters and changing the sort order when possible.
![query-sql](../assets/web-console-11-query-sql.png)
The query view can also issue queries in Druid's [native query format](../querying/querying.md), which is JSON over HTTP.
To send a native Druid query, you must start your query with `{` and format it as JSON.
![query-rune](../assets/web-console-12-query-rune.png)
## Lookups
You can create and edit query time lookups via the lookup view.
![lookups](../assets/web-console-13-lookups.png)

117
operations/dump-segment.md Normal file
View File

@ -0,0 +1,117 @@
---
id: dump-segment
title: "dump-segment tool"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
The DumpSegment tool can be used to dump the metadata or contents of an Apache Druid segment for debugging purposes. Note that the
dump is not necessarily a full-fidelity translation of the segment. In particular, not all metadata is included, and
complex metric values may not be complete.
To run the tool, point it at a segment directory and provide a file for writing output:
```
java -classpath "/my/druid/lib/*" -Ddruid.extensions.loadList="[]" org.apache.druid.cli.Main \
tools dump-segment \
--directory /home/druid/path/to/segment/ \
--out /home/druid/output.txt
```
### Output format
#### Data dumps
By default, or with `--dump rows`, this tool dumps rows of the segment as newline-separate JSON objects, with one
object per line, using the default serialization for each column. Normally all columns are included, but if you like,
you can limit the dump to specific columns with `--column name`.
For example, one line might look like this when pretty-printed:
```
{
"__time": 1442018818771,
"added": 36,
"channel": "#en.wikipedia",
"cityName": null,
"comment": "added project",
"count": 1,
"countryIsoCode": null,
"countryName": null,
"deleted": 0,
"delta": 36,
"isAnonymous": "false",
"isMinor": "false",
"isNew": "false",
"isRobot": "false",
"isUnpatrolled": "false",
"iuser": "00001553",
"metroCode": null,
"namespace": "Talk",
"page": "Talk:Oswald Tilghman",
"regionIsoCode": null,
"regionName": null,
"user": "GELongstreet"
}
```
#### Metadata dumps
With `--dump metadata`, this tool dumps metadata instead of rows. Metadata dumps generated by this tool are in the same
format as returned by the [SegmentMetadata query](../querying/segmentmetadataquery.md).
#### Bitmap dumps
With `--dump bitmaps`, this tool dump bitmap indexes instead of rows. Bitmap dumps generated by this tool include
dictionary-encoded string columns only. The output contains a field "bitmapSerdeFactory" describing the type of bitmaps
used in the segment, and a field "bitmaps" containing the bitmaps for each value of each column. These are base64
encoded by default, but you can also dump them as lists of row numbers with `--decompress-bitmaps`.
Normally all columns are included, but if you like, you can limit the dump to specific columns with `--column name`.
Sample output:
```
{
"bitmapSerdeFactory": {
"type": "roaring",
"compressRunOnSerialization": true
},
"bitmaps": {
"isRobot": {
"false": "//aExfu+Nv3X...",
"true": "gAl7OoRByQ..."
}
}
}
```
### Command line arguments
|argument|description|required?|
|--------|-----------|---------|
|--directory file|Directory containing segment data. This could be generated by unzipping an "index.zip" from deep storage.|yes|
|--output file|File to write to, or omit to write to stdout.|yes|
|--dump TYPE|Dump either 'rows' (default), 'metadata', or 'bitmaps'|no|
|--column columnName|Column to include. Specify multiple times for multiple columns, or omit to include all columns.|no|
|--filter json|JSON-encoded [query filter](../querying/filters.md). Omit to include all rows. Only used if dumping rows.|no|
|--time-iso8601|Format __time column in ISO8601 format rather than long. Only used if dumping rows.|no|
|--decompress-bitmaps|Dump bitmaps as arrays rather than base64-encoded compressed bitmaps. Only used if dumping bitmaps.|no|

View File

@ -0,0 +1,33 @@
---
id: dynamic-config-provider
title: "Dynamic Config Providers"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Druid's core mechanism of supplying multiple related set of credentials/secrets/configurations via Druid extension mechanism. Currently, it is only supported for providing Kafka Consumer configuration in [Kafka Ingestion](../development/extensions-core/kafka-ingestion.md).
Eventually this will replace [PasswordProvider](./password-provider.md)
Users can create custom extension of the `DynamicConfigProvider` interface that is registered at Druid process startup.
For more information, see [Adding a new DynamicConfigProvider implementation](../development/modules.md#adding-a-new-dynamicconfigprovider-implementation).

View File

@ -0,0 +1,202 @@
---
id: export-metadata
title: "Export Metadata Tool"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Druid includes an `export-metadata` tool for assisting with migration of cluster metadata and deep storage.
This tool exports the contents of the following Druid metadata tables:
- segments
- rules
- config
- datasource
- supervisors
Additionally, the tool can rewrite the local deep storage location descriptors in the rows of the segments table
to point to new deep storage locations (S3, HDFS, and local rewrite paths are supported).
The tool has the following limitations:
- Only exporting from Derby metadata is currently supported
- If rewriting load specs for deep storage migration, only migrating from local deep storage is currently supported.
## `export-metadata` Options
The `export-metadata` tool provides the following options:
### Connection Properties
- `--connectURI`: The URI of the Derby database, e.g. `jdbc:derby://localhost:1527/var/druid/metadata.db;create=true`
- `--user`: Username
- `--password`: Password
- `--base`: corresponds to the value of `druid.metadata.storage.tables.base` in the configuration, `druid` by default.
### Output Path
- `--output-path`, `-o`: The output directory of the tool. CSV files for the Druid segments, rules, config, datasource, and supervisors tables will be written to this directory.
### Export Format Options
- `--use-hex-blobs`, `-x`: If set, export BLOB payload columns as hexadecimal strings. This needs to be set if importing back into Derby. Default is false.
- `--booleans-as-strings`, `-t`: If set, write boolean values as "true" or "false" instead of "1" and "0". This needs to be set if importing back into Derby. Default is false.
### Deep Storage Migration
#### Migration to S3 Deep Storage
By setting the options below, the tool will rewrite the segment load specs to point to a new S3 deep storage location.
This helps users migrate segments stored in local deep storage to S3.
- `--s3bucket`, `-b`: The S3 bucket that will hold the migrated segments
- `--s3baseKey`, `-k`: The base S3 key where the migrated segments will be stored
When copying the local deep storage segments to S3, the rewrite performed by this tool requires that the directory structure of the segments be unchanged.
For example, if the cluster had the following local deep storage configuration:
```
druid.storage.type=local
druid.storage.storageDirectory=/druid/segments
```
If the target S3 bucket was `migration`, with a base key of `example`, the contents of `s3://migration/example/` must be identical to that of `/druid/segments` on the old local filesystem.
#### Migration to HDFS Deep Storage
By setting the options below, the tool will rewrite the segment load specs to point to a new HDFS deep storage location.
This helps users migrate segments stored in local deep storage to HDFS.
`--hadoopStorageDirectory`, `-h`: The HDFS path that will hold the migrated segments
When copying the local deep storage segments to HDFS, the rewrite performed by this tool requires that the directory structure of the segments be unchanged, with the exception of directory names containing colons (`:`).
For example, if the cluster had the following local deep storage configuration:
```
druid.storage.type=local
druid.storage.storageDirectory=/druid/segments
```
If the target hadoopStorageDirectory was `/migration/example`, the contents of `hdfs:///migration/example/` must be identical to that of `/druid/segments` on the old local filesystem.
Additionally, the segments paths in local deep storage contain colons(`:`) in their names, e.g.:
`wikipedia/2016-06-27T02:00:00.000Z_2016-06-27T03:00:00.000Z/2019-05-03T21:57:15.950Z/1/index.zip`
HDFS cannot store files containing colons, and this tool expects the colons to be replaced with underscores (`_`) in HDFS.
In this example, the `wikipedia` segment above under `/druid/segments` in local deep storage would need to be migrated to HDFS under `hdfs:///migration/example/` with the following path:
`wikipedia/2016-06-27T02_00_00.000Z_2016-06-27T03_00_00.000Z/2019-05-03T21_57_15.950Z/1/index.zip`
#### Migration to New Local Deep Storage Path
By setting the options below, the tool will rewrite the segment load specs to point to a new local deep storage location.
This helps users migrate segments stored in local deep storage to a new path (e.g., a new NFS mount).
`--newLocalPath`, `-n`: The new path on the local filesystem that will hold the migrated segments
When copying the local deep storage segments to a new path, the rewrite performed by this tool requires that the directory structure of the segments be unchanged.
For example, if the cluster had the following local deep storage configuration:
```
druid.storage.type=local
druid.storage.storageDirectory=/druid/segments
```
If the new path was `/migration/example`, the contents of `/migration/example/` must be identical to that of `/druid/segments` on the local filesystem.
## Running the tool
To use the tool, you can run the following from the root of the Druid package:
```bash
cd ${DRUID_ROOT}
mkdir -p /tmp/csv
java -classpath "lib/*" -Dlog4j.configurationFile=conf/druid/cluster/_common/log4j2.xml -Ddruid.extensions.directory="extensions" -Ddruid.extensions.loadList=[] org.apache.druid.cli.Main tools export-metadata --connectURI "jdbc:derby://localhost:1527/var/druid/metadata.db;" -o /tmp/csv
```
In the example command above:
- `lib` is the Druid lib directory
- `extensions` is the Druid extensions directory
- `/tmp/csv` is the output directory. Please make sure that this directory exists.
## Importing Metadata
After running the tool, the output directory will contain `<table-name>_raw.csv` and `<table-name>.csv` files.
The `<table-name>_raw.csv` files are intermediate files used by the tool, containing the table data as exported by Derby without modification.
The `<table-name>.csv` files are used for import into another database such as MySQL and PostgreSQL and have any configured deep storage location rewrites applied.
Example import commands for Derby, MySQL, and PostgreSQL are shown below.
These example import commands expect `/tmp/csv` and its contents to be accessible from the server. For other options, such as importing from the client filesystem, please refer to the database's documentation.
### Derby
```sql
CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (null,'DRUID_SEGMENTS','/tmp/csv/druid_segments.csv',',','"',null,0);
CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (null,'DRUID_RULES','/tmp/csv/druid_rules.csv',',','"',null,0);
CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (null,'DRUID_CONFIG','/tmp/csv/druid_config.csv',',','"',null,0);
CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (null,'DRUID_DATASOURCE','/tmp/csv/druid_dataSource.csv',',','"',null,0);
CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (null,'DRUID_SUPERVISORS','/tmp/csv/druid_supervisors.csv',',','"',null,0);
```
### MySQL
```sql
LOAD DATA INFILE '/tmp/csv/druid_segments.csv' INTO TABLE druid_segments FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (id,dataSource,created_date,start,end,partitioned,version,used,payload); SHOW WARNINGS;
LOAD DATA INFILE '/tmp/csv/druid_rules.csv' INTO TABLE druid_rules FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (id,dataSource,version,payload); SHOW WARNINGS;
LOAD DATA INFILE '/tmp/csv/druid_config.csv' INTO TABLE druid_config FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (name,payload); SHOW WARNINGS;
LOAD DATA INFILE '/tmp/csv/druid_dataSource.csv' INTO TABLE druid_dataSource FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (dataSource,created_date,commit_metadata_payload,commit_metadata_sha1); SHOW WARNINGS;
LOAD DATA INFILE '/tmp/csv/druid_supervisors.csv' INTO TABLE druid_supervisors FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (id,spec_id,created_date,payload); SHOW WARNINGS;
```
### PostgreSQL
```sql
COPY druid_segments(id,dataSource,created_date,start,"end",partitioned,version,used,payload) FROM '/tmp/csv/druid_segments.csv' DELIMITER ',' CSV;
COPY druid_rules(id,dataSource,version,payload) FROM '/tmp/csv/druid_rules.csv' DELIMITER ',' CSV;
COPY druid_config(name,payload) FROM '/tmp/csv/druid_config.csv' DELIMITER ',' CSV;
COPY druid_dataSource(dataSource,created_date,commit_metadata_payload,commit_metadata_sha1) FROM '/tmp/csv/druid_dataSource.csv' DELIMITER ',' CSV;
COPY druid_supervisors(id,spec_id,created_date,payload) FROM '/tmp/csv/druid_supervisors.csv' DELIMITER ',' CSV;
```

View File

@ -0,0 +1,48 @@
---
id: getting-started
title: "Getting started with Apache Druid"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
## Overview
If you are new to Druid, we recommend reading the [Design Overview](../design/index.md) and the [Ingestion Overview](../ingestion/index.md) first for a basic understanding of Druid.
## Single-server Quickstart and Tutorials
To get started with running Druid, the simplest and quickest way is to try the [single-server quickstart and tutorials](../tutorials/index.md).
## Deploying a Druid cluster
If you wish to jump straight to deploying Druid as a cluster, or if you have an existing single-server deployment that you wish to migrate to a clustered deployment, please see the [Clustered Deployment Guide](../tutorials/cluster.md).
## Operating Druid
The [configuration reference](../configuration/index.md) describes all of Druid's configuration properties.
The [API reference](../operations/api-reference.md) describes the APIs available on each Druid process.
The [basic cluster tuning guide](../operations/basic-cluster-tuning.md) is an introductory guide for tuning your Druid cluster.
## Need help with Druid?
If you have questions about using Druid, please reach out to the [Druid user mailing list or other community channels](https://druid.apache.org/community/)!

View File

@ -0,0 +1,39 @@
---
id: high-availability
title: "High availability"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Apache ZooKeeper, metadata store, the coordinator, the overlord, and brokers are recommended to set up a high availability environment.
- For highly-available ZooKeeper, you will need a cluster of 3 or 5 ZooKeeper nodes.
We recommend either installing ZooKeeper on its own hardware, or running 3 or 5 Master servers (where overlords or coordinators are running)
and configuring ZooKeeper on them appropriately. See the [ZooKeeper admin guide](https://zookeeper.apache.org/doc/current/zookeeperAdmin) for more details.
- For highly-available metadata storage, we recommend MySQL or PostgreSQL with replication and failover enabled.
See [MySQL HA/Scalability Guide](https://dev.mysql.com/doc/mysql-ha-scalability/en/)
and [PostgreSQL's High Availability, Load Balancing, and Replication](https://www.postgresql.org/docs/current/high-availability.html) for MySQL and PostgreSQL, respectively.
- For highly-available Apache Druid Coordinators and Overlords, we recommend to run multiple servers.
If they are all configured to use the same ZooKeeper cluster and metadata storage,
then they will automatically failover between each other as necessary.
Only one will be active at a time, but inactive servers will redirect to the currently active server.
- Druid Brokers can be scaled out and all running servers will be active and queryable.
We recommend placing them behind a load balancer.

View File

@ -0,0 +1,31 @@
---
id: http-compression
title: "HTTP compression"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Apache Druid supports http request decompression and response compression, to use this, http request header `Content-Encoding:gzip` and `Accept-Encoding:gzip` is needed to be set.
|Property|Description|Default|
|--------|-----------|-------|
|`druid.server.http.compressionLevel`|The compression level. Value should be between [-1,9], -1 for default level, 0 for no compression.|-1 (default compression level)|
|`druid.server.http.inflateBufferSize`|The buffer size used by gzip decoder. Set to 0 to disable request decompression.|4096|

View File

@ -0,0 +1,48 @@
---
id: insert-segment-to-db
title: "insert-segment-to-db tool"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
In older versions of Apache Druid, `insert-segment-to-db` was a tool that could scan deep storage and
insert data from there into Druid metadata storage. It was intended to be used to update the segment table in the
metadata storage after manually migrating segments from one place to another, or even to recover lost metadata storage
by telling it where the segments are stored.
In Druid 0.14.x and earlier, Druid wrote segment metadata to two places: the metadata store's `druid_segments` table, and
`descriptor.json` files in deep storage. This practice was stopped in Druid 0.15.0 as part of
[consolidated metadata management](https://github.com/apache/druid/issues/6849), for the following reasons:
1. If any segments are manually dropped or re-enabled by cluster operators, this information is not reflected in
deep storage. Restoring metadata from deep storage would undo any such drops or re-enables.
2. Ingestion methods that allocate segments optimistically (such as native Kafka or Kinesis stream ingestion, or native
batch ingestion in 'append' mode) can write segments to deep storage that are not meant to actually be used by the
Druid cluster. There is no way, while purely looking at deep storage, to differentiate the segments that made it into
the metadata store originally (and therefore _should_ be used) from the segments that did not (and therefore
_should not_ be used).
3. Nothing in Druid other than the `insert-segment-to-db` tool read the `descriptor.json` files.
After this change, Druid stopped writing `descriptor.json` files to deep storage, and now only writes segment metadata
to the metadata store. This meant the `insert-segment-to-db` tool is no longer useful, so it was removed in Druid 0.15.0.
It is highly recommended that you take regular backups of your metadata store, since it is difficult to recover Druid
clusters properly without it.

34
operations/kubernetes.md Normal file
View File

@ -0,0 +1,34 @@
---
id: kubernetes
title: "kubernetes"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Apache Druid distribution is also available as [Docker](https://www.docker.com/) image from [Docker Hub](https://hub.docker.com/r/apache/druid) . For example, you can obtain release 0.16.0-incubating using the command below.
```
$docker pull apache/druid:0.16.0-incubating
```
[druid-operator](https://github.com/druid-io/druid-operator) can be used to manage a Druid cluster on [Kubernetes](https://kubernetes.io/) .
Druid clusters deployed on Kubernetes can function without Zookeeper using [druidkubernetes-extensions](../development/extensions-core/kubernetes.md) .

View File

@ -0,0 +1,64 @@
---
id: management-uis
title: "Legacy Management UIs"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
## Legacy consoles
Druid provides a console for managing datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.
For more information on the Druid Console, have a look at the [Druid Console overview](./druid-console.md)
The Druid Console contains all of the functionality provided by the older consoles described below, which are still available if needed. The legacy consoles may be replaced by the Druid Console in the future.
These older consoles provide a subset of the functionality of the Druid Console. We recommend using the Druid Console if possible.
### Coordinator consoles
#### Version 2
The Druid Coordinator exposes a web console for displaying cluster information and rule configuration. After the Coordinator starts, the console can be accessed at:
```
http://<COORDINATOR_IP>:<COORDINATOR_PORT>
```
There exists a full cluster view (which shows indexing tasks and Historical processes), as well as views for individual Historical processes, datasources and segments themselves. Segment information can be displayed in raw JSON form or as part of a sortable and filterable table.
The Coordinator console also exposes an interface to creating and editing rules. All valid datasources configured in the segment database, along with a default datasource, are available for configuration. Rules of different types can be added, deleted or edited.
#### Version 1
The oldest version of Druid's Coordinator console is still available for backwards compatibility at:
```
http://<COORDINATOR_IP>:<COORDINATOR_PORT>/old-console
```
### Overlord console
The Overlord console can be used to view pending tasks, running tasks, available workers, and recent worker creation and termination. The console can be accessed at:
```
http://<OVERLORD_IP>:<OVERLORD_PORT>/console.html
```

View File

@ -0,0 +1,55 @@
---
id: password-provider
title: "Password providers"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Passwords help secure Apache Druid systems such as the metadata store and the keystore that contains server certificates, and so on.
These passwords have corresponding runtime properties associated with them, for example `druid.metadata.storage.connector.password` corresponds to the metadata store password.
By default users can directly set the passwords in plaintext for runtime properties. For example, `druid.metadata.storage.connector.password=pwd` sets the password to be used by Druid to connect to the metadata store to `pwd`. Alternatively, users can can set passwords as environment variables.
Environment variable passwords allow users to avoid exposing passwords in the `runtime.properties` file.
You can set an environment variable password as in the following example:
```json
druid.metadata.storage.connector.password={ "type": "environment", "variable": "METADATA_STORAGE_PASSWORD" }
```
The values are described below.
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|`type`|String|password provider type|Yes: `environment`|
|`variable`|String|environment variable to read password from|Yes|
Another option that provides even greater control is to securely fetch passwords at runtime using a custom extension of the `PasswordProvider` interface that is registered at Druid process startup.
For more information, see [Adding a new Password Provider implementation](../development/modules.md#adding-a-new-password-provider-implementation).
To use this implementation, simply set the relevant password runtime property similarly to how was shown for the environment variable password:
```json
druid.metadata.storage.connector.password={ "type": "<registered_password_provider_name>", "<jackson_property>": "<value>", ... }
```

View File

@ -1 +0,0 @@
<!-- toc -->

139
operations/pull-deps.md Normal file
View File

@ -0,0 +1,139 @@
---
id: pull-deps
title: "pull-deps tool"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
`pull-deps` is an Apache Druid tool that can pull down dependencies to the local repository and lay dependencies out into the extension directory as needed.
`pull-deps` has several command line options, they are as follows:
`-c` or `--coordinate` (Can be specified multiple times)
Extension coordinate to pull down, followed by a maven coordinate, e.g. org.apache.druid.extensions:mysql-metadata-storage
`-h` or `--hadoop-coordinate` (Can be specified multiply times)
Apache Hadoop dependency to pull down, followed by a maven coordinate, e.g. org.apache.hadoop:hadoop-client:2.4.0
`--no-default-hadoop`
Don't pull down the default hadoop coordinate, i.e., org.apache.hadoop:hadoop-client:2.3.0. If `-h` option is supplied, then default hadoop coordinate will not be downloaded.
`--clean`
Remove existing extension and hadoop dependencies directories before pulling down dependencies.
`-l` or `--localRepository`
A local repository that Maven will use to put downloaded files. Then pull-deps will lay these files out into the extensions directory as needed.
`-r` or `--remoteRepository`
Add a remote repository. Unless `--no-default-remote-repositories` is provided, these will be used after https://repo1.maven.org/maven2/ and http://metamx.artifactoryonline.com/metamx/pub-libs-releases-local
`--no-default-remote-repositories`
Don't use the default remote repositories, only use the repositories provided directly via --remoteRepository.
`-d` or `--defaultVersion`
Version to use for extension coordinate that doesn't have a version information. For example, if extension coordinate is `org.apache.druid.extensions:mysql-metadata-storage`, and default version is `{{DRUIDVERSION}}`, then this coordinate will be treated as `org.apache.druid.extensions:mysql-metadata-storage:{{DRUIDVERSION}}`
`--use-proxy`
Use http/https proxy to send request to the remote repository servers. `--proxy-host` and `--proxy-port` must be set explicitly if this option is enabled.
`--proxy-type`
Set the proxy type, Should be either *http* or *https*, default value is *https*.
`--proxy-host`
Set the proxy host. e.g. proxy.com.
`--proxy-port`
Set the proxy port number. e.g. 8080.
`--proxy-username`
Set a username to connect to the proxy, this option is only required if the proxy server uses authentication.
`--proxy-password`
Set a password to connect to the proxy, this option is only required if the proxy server uses authentication.
To run `pull-deps`, you should
1) Specify `druid.extensions.directory` and `druid.extensions.hadoopDependenciesDir`, these two properties tell `pull-deps` where to put extensions. If you don't specify them, default values will be used, see [Configuration](../configuration/index.md).
2) Tell `pull-deps` what to download using `-c` or `-h` option, which are followed by a maven coordinate.
Example:
Suppose you want to download ```mysql-metadata-storage``` and ```hadoop-client```(both 2.3.0 and 2.4.0) with a specific version, you can run `pull-deps` command with `-c org.apache.druid.extensions:mysql-metadata-storage:{{DRUIDVERSION}}`, `-h org.apache.hadoop:hadoop-client:2.3.0` and `-h org.apache.hadoop:hadoop-client:2.4.0`, an example command would be:
```
java -classpath "/my/druid/lib/*" org.apache.druid.cli.Main tools pull-deps --clean -c org.apache.druid.extensions:mysql-metadata-storage:{{DRUIDVERSION}} -h org.apache.hadoop:hadoop-client:2.3.0 -h org.apache.hadoop:hadoop-client:2.4.0
```
Because `--clean` is supplied, this command will first remove the directories specified at `druid.extensions.directory` and `druid.extensions.hadoopDependenciesDir`, then recreate them and start downloading the extensions there. After finishing downloading, if you go to the extension directories you specified, you will see
```
tree extensions
extensions
└── mysql-metadata-storage
└── mysql-metadata-storage-{{DRUIDVERSION}}.jar
```
```
tree hadoop-dependencies
hadoop-dependencies/
└── hadoop-client
├── 2.3.0
│   ├── activation-1.1.jar
│   ├── avro-1.7.4.jar
│   ├── commons-beanutils-1.7.0.jar
│   ├── commons-beanutils-core-1.8.0.jar
│   ├── commons-cli-1.2.jar
│   ├── commons-codec-1.4.jar
..... lots of jars
└── 2.4.0
├── activation-1.1.jar
├── avro-1.7.4.jar
├── commons-beanutils-1.7.0.jar
├── commons-beanutils-core-1.8.0.jar
├── commons-cli-1.2.jar
├── commons-codec-1.4.jar
..... lots of jars
```
Note that if you specify `--defaultVersion`, you don't have to put version information in the coordinate. For example, if you want `mysql-metadata-storage` to use version `{{DRUIDVERSION}}`, you can change the command above to
```
java -classpath "/my/druid/lib/*" org.apache.druid.cli.Main tools pull-deps --defaultVersion {{DRUIDVERSION}} --clean -c org.apache.druid.extensions:mysql-metadata-storage -h org.apache.hadoop:hadoop-client:2.3.0 -h org.apache.hadoop:hadoop-client:2.4.0
```
> Please note to use the pull-deps tool you must know the Maven groupId, artifactId, and version of your extension.
>
> For Druid community extensions listed [here](../development/extensions.md), the groupId is "org.apache.druid.extensions.contrib" and the artifactId is the name of the extension.

View File

@ -0,0 +1,79 @@
---
id: reset-cluster
title: "reset-cluster tool"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
The `reset-cluster` tool can be used to completely wipe out Apache Druid cluster state stored on Metadata and Deep storage. This is
intended to be used in dev/test environments where you typically want to reset the cluster before running
the test suite.
`reset-cluster` automatically figures out necessary information from Druid cluster configuration. So the java classpath
used in the command must have all the necessary druid configuration files.
It can be run in one of the following ways.
```
java -classpath "/my/druid/lib/*" -Ddruid.extensions.loadList="[]" org.apache.druid.cli.Main \
tools reset-cluster \
[--metadataStore] \
[--segmentFiles] \
[--taskLogs] \
[--hadoopWorkingPath]
```
or
```
java -classpath "/my/druid/lib/*" -Ddruid.extensions.loadList="[]" org.apache.druid.cli.Main \
tools reset-cluster \
--all
```
Usage documentation can be printed by running following command.
```
$ java -classpath "/my/druid/lib/*" -Ddruid.extensions.loadList="[]" org.apache.druid.cli.Main help tools reset-cluster
NAME
druid tools reset-cluster - Cleanup all persisted state from metadata
and deep storage.
SYNOPSIS
druid tools reset-cluster [--all] [--hadoopWorkingPath]
[--metadataStore] [--segmentFiles] [--taskLogs]
OPTIONS
--all
delete all state stored in metadata and deep storage
--hadoopWorkingPath
delete hadoopWorkingPath
--metadataStore
delete all records in metadata storage
--segmentFiles
delete all segment files from deep storage
--taskLogs
delete all tasklogs
```

View File

@ -0,0 +1,103 @@
---
id: rolling-updates
title: "Rolling updates"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
For rolling Apache Druid cluster updates with no downtime, we recommend updating Druid processes in the
following order:
1. Historical
2. \*Overlord (if any)
3. \*Middle Manager/Indexers (if any)
4. Standalone Real-time (if any)
5. Broker
6. Coordinator ( or merged Coordinator+Overlord )
For information about the latest release, see [Druid releases](https://github.com/apache/druid/releases).
\* In 0.12.0, there are protocol changes between the Kafka supervisor and Kafka Indexing task and also some changes to the metadata formats persisted on disk. Therefore, to support rolling upgrade, all the Middle Managers will need to be upgraded first before the Overlord. Note that this ordering is different from the standard order of upgrade, also note that this ordering is only necessary when using the Kafka Indexing Service. If one is not using Kafka Indexing Service or can handle down time for Kafka Supervisor then one can upgrade in any order.
## Historical
Historical processes can be updated one at a time. Each Historical process has a startup time to memory map
all the segments it was serving before the update. The startup time typically takes a few seconds to
a few minutes, depending on the hardware of the host. As long as each Historical process is updated
with a sufficient delay (greater than the time required to start a single process), you can rolling
update the entire Historical cluster.
## Overlord
Overlord processes can be updated one at a time in a rolling fashion.
## Middle Managers/Indexers
Middle Managers or Indexer nodes run both batch and real-time indexing tasks. Generally you want to update Middle
Managers in such a way that real-time indexing tasks do not fail. There are three strategies for
doing that.
### Rolling restart (restore-based)
Middle Managers can be updated one at a time in a rolling fashion when you set
`druid.indexer.task.restoreTasksOnRestart=true`. In this case, indexing tasks that support restoring
will restore their state on Middle Manager restart, and will not fail.
Currently, only realtime tasks support restoring, so non-realtime indexing tasks will fail and will
need to be resubmitted.
### Rolling restart (graceful-termination-based)
Middle Managers can be gracefully terminated using the "disable" API. This works for all task types,
even tasks that are not restorable.
To prepare a Middle Manager for update, send a POST request to
`<MiddleManager_IP:PORT>/druid/worker/v1/disable`. The Overlord will now no longer send tasks to
this Middle Manager. Tasks that have already started will run to completion. Current state can be checked
using `<MiddleManager_IP:PORT>/druid/worker/v1/enabled` .
To view all existing tasks, send a GET request to `<MiddleManager_IP:PORT>/druid/worker/v1/tasks`.
When this list is empty, you can safely update the Middle Manager. After the Middle Manager starts
back up, it is automatically enabled again. You can also manually enable Middle Managers by POSTing
to `<MiddleManager_IP:PORT>/druid/worker/v1/enable`.
### Autoscaling-based replacement
If autoscaling is enabled on your Overlord, then Overlord processes can launch new Middle Manager processes
en masse and then gracefully terminate old ones as their tasks finish. This process is configured by
setting `druid.indexer.runner.minWorkerVersion=#{VERSION}`. Each time you update your Overlord process,
the `VERSION` value should be increased, which will trigger a mass launch of new Middle Managers.
The config `druid.indexer.autoscale.workerVersion=#{VERSION}` also needs to be set.
## Standalone Real-time
Standalone real-time processes can be updated one at a time in a rolling fashion.
## Broker
Broker processes can be updated one at a time in a rolling fashion. There needs to be some delay between
updating each process as Brokers must load the entire state of the cluster before they return valid
results.
## Coordinator
Coordinator processes can be updated one at a time in a rolling fashion.

View File

@ -0,0 +1,234 @@
---
id: rule-configuration
title: "Retaining or automatically dropping data"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
In Apache Druid, Coordinator processes use rules to determine what data should be loaded to or dropped from the cluster. Rules are used for data retention and query execution, and are set on the Coordinator console (http://coordinator_ip:port).
There are three types of rules, i.e., load rules, drop rules, and broadcast rules. Load rules indicate how segments should be assigned to different historical process tiers and how many replicas of a segment should exist in each tier.
Drop rules indicate when segments should be dropped entirely from the cluster. Finally, broadcast rules indicate how segments of different datasources should be co-located in Historical processes.
The Coordinator loads a set of rules from the metadata storage. Rules may be specific to a certain datasource and/or a
default set of rules can be configured. Rules are read in order and hence the ordering of rules is important. The
Coordinator will cycle through all used segments and match each segment with the first rule that applies. Each segment
may only match a single rule.
Note: It is recommended that the Coordinator console is used to configure rules. However, the Coordinator process does have HTTP endpoints to programmatically configure rules.
## Load rules
Load rules indicate how many replicas of a segment should exist in a server tier. **Please note**: If a Load rule is used to retain only data from a certain interval or period, it must be accompanied by a Drop rule. If a Drop rule is not included, data not within the specified interval or period will be retained by the default rule (loadForever).
### Forever Load Rule
Forever load rules are of the form:
```json
{
"type" : "loadForever",
"tieredReplicants": {
"hot": 1,
"_default_tier" : 1
}
}
```
* `type` - this should always be "loadForever"
* `tieredReplicants` - A JSON Object where the keys are the tier names and values are the number of replicas for that tier.
### Interval Load Rule
Interval load rules are of the form:
```json
{
"type" : "loadByInterval",
"interval": "2012-01-01/2013-01-01",
"tieredReplicants": {
"hot": 1,
"_default_tier" : 1
}
}
```
* `type` - this should always be "loadByInterval"
* `interval` - A JSON Object representing ISO-8601 Intervals
* `tieredReplicants` - A JSON Object where the keys are the tier names and values are the number of replicas for that tier.
### Period Load Rule
Period load rules are of the form:
```json
{
"type" : "loadByPeriod",
"period" : "P1M",
"includeFuture" : true,
"tieredReplicants": {
"hot": 1,
"_default_tier" : 1
}
}
```
* `type` - this should always be "loadByPeriod"
* `period` - A JSON Object representing ISO-8601 Periods
* `includeFuture` - A JSON Boolean indicating whether the load period should include the future. This property is optional, Default is true.
* `tieredReplicants` - A JSON Object where the keys are the tier names and values are the number of replicas for that tier.
The interval of a segment will be compared against the specified period. The period is from some time in the past to the future or to the current time, which depends on `includeFuture` is true or false. The rule matches if the period *overlaps* the interval.
## Drop Rules
Drop rules indicate when segments should be dropped from the cluster.
### Forever Drop Rule
Forever drop rules are of the form:
```json
{
"type" : "dropForever"
}
```
* `type` - this should always be "dropForever"
All segments that match this rule are dropped from the cluster.
### Interval Drop Rule
Interval drop rules are of the form:
```json
{
"type" : "dropByInterval",
"interval" : "2012-01-01/2013-01-01"
}
```
* `type` - this should always be "dropByInterval"
* `interval` - A JSON Object representing ISO-8601 Periods
A segment is dropped if the interval contains the interval of the segment.
### Period Drop Rule
Period drop rules are of the form:
```json
{
"type" : "dropByPeriod",
"period" : "P1M",
"includeFuture" : true
}
```
* `type` - this should always be "dropByPeriod"
* `period` - A JSON Object representing ISO-8601 Periods
* `includeFuture` - A JSON Boolean indicating whether the load period should include the future. This property is optional, Default is true.
The interval of a segment will be compared against the specified period. The period is from some time in the past to the future or to the current time, which depends on `includeFuture` is true or false. The rule matches if the period *contains* the interval. This drop rule always dropping recent data.
### Period Drop Before Rule
Period drop before rules are of the form:
```json
{
"type" : "dropBeforeByPeriod",
"period" : "P1M"
}
```
* `type` - this should always be "dropBeforeByPeriod"
* `period` - A JSON Object representing ISO-8601 Periods
The interval of a segment will be compared against the specified period. The period is from some time in the past to the current time. The rule matches if the interval before the period. If you just want to retain recent data, you can use this rule to drop the old data that before a specified period and add a `loadForever` rule to follow it. Notes, `dropBeforeByPeriod + loadForever` is equivalent to `loadByPeriod(includeFuture = true) + dropForever`.
## Broadcast Rules
Broadcast rules indicate that segments of a data source should be loaded by all servers of a cluster of the following types: historicals, brokers, tasks, and indexers.
Note that the broadcast segments are only directly queryable through the historicals, but they are currently loaded on other server types to support join queries.
### Forever Broadcast Rule
Forever broadcast rules are of the form:
```json
{
"type" : "broadcastForever"
}
```
* `type` - this should always be "broadcastForever"
This rule applies to all segments of a datasource, covering all intervals.
### Interval Broadcast Rule
Interval broadcast rules are of the form:
```json
{
"type" : "broadcastByInterval",
"interval" : "2012-01-01/2013-01-01"
}
```
* `type` - this should always be "broadcastByInterval"
* `interval` - A JSON Object representing ISO-8601 Periods. Only the segments of the interval will be broadcasted.
### Period Broadcast Rule
Period broadcast rules are of the form:
```json
{
"type" : "broadcastByPeriod",
"period" : "P1M",
"includeFuture" : true
}
```
* `type` - this should always be "broadcastByPeriod"
* `period` - A JSON Object representing ISO-8601 Periods
* `includeFuture` - A JSON Boolean indicating whether the load period should include the future. This property is optional, Default is true.
The interval of a segment will be compared against the specified period. The period is from some time in the past to the future or to the current time, which depends on `includeFuture` is true or false. The rule matches if the period *overlaps* the interval.
## Permanently deleting data
Druid can fully drop data from the cluster, wipe the metadata store entry, and remove the data from deep storage for any
segments that are marked as unused (segments dropped from the cluster via rules are always marked as unused). You can
submit a [kill task](../ingestion/tasks.md) to the [Overlord](../design/overlord.md) to do this.
## Reloading dropped data
Data that has been dropped from a Druid cluster cannot be reloaded using only rules. To reload dropped data in Druid,
you must first set your retention period (i.e. changing the retention period from 1 month to 2 months), and then mark as
used all segments belonging to the datasource in the Druid Coordinator console, or through the Druid Coordinator
endpoints.

View File

@ -0,0 +1,352 @@
---
id: security-overview
title: "Security overview"
description: Overiew of Apache Druid security. Includes best practices, configuration instructions, and a description of the security model.
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
This document provides an overview of Apache Druid security features, configuration instructions, and some best practices to secure Druid.
By default, security features in Druid are disabled, which simplifies the initial deployment experience. However, security features must be configured in a production deployment. These features include TLS, authentication, and authorization.
## Best practices
The following recommendations apply to the Druid cluster setup:
* Run Druid as an unprivileged Unix user. Do not run Druid as the root user.
> **WARNING!** \
Druid administrators have the same OS permissions as the Unix user account running Druid. See [Authentication and authorization model](security-user-auth.md#authentication-and-authorization-model). If the Druid process is running under the OS root user account, then Druid administrators can read or write all files that the root account has access to, including sensitive files such as `/etc/passwd`.
* Enable authentication to the Druid cluster for production environments and other environments that can be accessed by untrusted networks.
* Enable authorization and do not expose the Druid Console without authorization enabled. If authorization is not enabled, any user that has access to the web console has the same privileges as the operating system user that runs the Druid Console process.
* Grant users the minimum permissions necessary to perform their functions. For instance, do not allow users who only need to query data to write to data sources or view state.
* Disable JavaScript, as noted in the [Security section](https://druid.apache.org/docs/latest/development/javascript.html#security) of the JavaScript guide.
The following recommendations apply to the network where Druid runs:
* Enable TLS to encrypt communication within the cluster.
* Use an API gateway to:
- Restrict access from untrusted networks
- Create an allow list of specific APIs that your users need to access
- Implement account lockout and throttling features.
* When possible, use firewall and other network layer filtering to only expose Druid services and ports specifically required for your use case. For example, only expose Broker ports to downstream applications that execute queries. You can limit access to a specific IP address or IP range to further tighten and enhance security.
The following recommendation applies to Druids authorization and authentication model:
* Only grant `WRITE` permissions to any `DATASOURCE` to trusted users. Druid's trust model assumes those users have the same privileges as the operating system user that runs the Druid Console process.
* Only grant `STATE READ`, `STATE WRITE`, `CONFIG WRITE`, and `DATASOURCE WRITE` permissions to highly-trusted users. These permissions allow users to access resources on behalf of the Druid server process regardless of the datasource.
* If your Druid client application allows less-trusted users to control the input source or firehose of an ingestion task, validate the URLs from the users. It is possible to point unchecked URLs to other locations and resources within your network or local file system.
## Enable TLS
Enabling TLS encrypts the traffic between external clients and the Druid cluster and traffic between services within the cluster.
### Generating keys
Before you enable TLS in Druid, generate the KeyStore and truststore. When one Druid process, e.g. Broker, contacts another Druid process , e.g. Historical, the first service is a client for the second service, considered the server.
The client uses a trustStore that contains certificates trusted by the client. For example, the Broker.
The server uses a KeyStore that contains private keys and certificate chain used to securely identify itself.
The following example demonstrates how to use Java keytool to generate the KeyStore for the server and then create a trustStore to trust the key for the client:
1. Generate the KeyStore with the Java `keytool` command:
```
$> keytool -keystore keystore.jks -alias druid -genkey -keyalg RSA
```
2. Export a public certificate:
```
$> keytool -export -alias druid -keystore keystore.jks -rfc -file public.cert
```
3. Create the trustStore:
```
$> keytool -import -file public.cert -alias druid -keystore truststore.jks
```
Druid uses Jetty as its embedded web server. See [Configuring SSL/TLS KeyStores
](https://www.eclipse.org/jetty/documentation/jetty-11/operations-guide/index.html#og-keystore) from the Jetty documentation.
> WARNING: Do not use use self-signed certificates for production environments. Instead, rely on your current public key infrastructure to generate and distribute trusted keys.
### Update Druid TLS configurations
Edit `common.runtime.properties` for all Druid services on all nodes. Add or update the following TLS options. Restart the cluster when you are finished.
```
# Turn on TLS globally
druid.enableTlsPort=true
# Disable non-TLS communicatoins
druid.enablePlaintextPort=false
# For Druid processes acting as a client
# Load simple-client-sslcontext to enable client side TLS
# Add the following to extension load list
druid.extensions.loadList=[......., "simple-client-sslcontext"]
# Setup client side TLS
druid.client.https.protocol=TLSv1.2
druid.client.https.trustStoreType=jks
druid.client.https.trustStorePath=truststore.jks # replace with correct turstStore file
druid.client.https.trustStorePassword=secret123 # replace with your own password
# Setup server side TLS
druid.server.https.keyStoreType=jks
druid.server.https.keyStorePath=my-keystore.jks # replace with correct keyStore file
druid.server.https.keyStorePassword=secret123 # replace with your own password
druid.server.https.certAlias=druid
```
For more information, see [TLS support](tls-support.md) and [Simple SSLContext Provider Module](../development/extensions-core/simple-client-sslcontext.md).
## Authentication and authorization
You can configure authentication and authorization to control access to the the Druid APIs. Then configure users, roles, and permissions, as described in the following sections. Make the configuration changes in the `common.runtime.properties` file on all Druid servers in the cluster.
Within Druid's operating context, authenticators control the way user identities are verified. Authorizers employ user roles to relate authenticated users to the datasources they are permitted to access. You can set the finest-grained permissions on a per-datasource basis.
The following graphic depicts the course of request through the authentication process:
![Druid security check flow](../assets/security-model-1.png "Druid security check flow")
## Enable an authenticator
To authenticate requests in Druid, you configure an Authenticator. Authenticator extensions exist for HTTP basic authentication, LDAP, and Kerberos.
The following takes you through sample configuration steps for enabling basic auth:
1. Add the `druid-basic-security` extension to `druid.extensions.loadList` in `common.runtime.properties`. For the quickstart installation, for example, the properties file is at `conf/druid/cluster/_common`:
```
druid.extensions.loadList=["druid-basic-security", "druid-histogram", "druid-datasketches", "druid-kafka-indexing-service"]
```
2. Configure the basic Authenticator, Authorizer, and Escalator settings in the same common.runtime.properties file. The Escalator defines how Druid processes authenticate with one another.
An example configuration:
```
# Druid basic security
druid.auth.authenticatorChain=["MyBasicMetadataAuthenticator"]
druid.auth.authenticator.MyBasicMetadataAuthenticator.type=basic
# Default password for 'admin' user, should be changed for production.
druid.auth.authenticator.MyBasicMetadataAuthenticator.initialAdminPassword=password1
# Default password for internal 'druid_system' user, should be changed for production.
druid.auth.authenticator.MyBasicMetadataAuthenticator.initialInternalClientPassword=password2
# Uses the metadata store for storing users, you can use authentication API to create new users and grant permissions
druid.auth.authenticator.MyBasicMetadataAuthenticator.credentialsValidator.type=metadata
# If true and the request credential doesn't exists in this credentials store, the request will proceed to next Authenticator in the chain.
druid.auth.authenticator.MyBasicMetadataAuthenticator.skipOnFailure=false
druid.auth.authenticator.MyBasicMetadataAuthenticator.authorizerName=MyBasicMetadataAuthorizer
# Escalator
druid.escalator.type=basic
druid.escalator.internalClientUsername=druid_system
druid.escalator.internalClientPassword=password2
druid.escalator.authorizerName=MyBasicMetadataAuthorizer
druid.auth.authorizers=["MyBasicMetadataAuthorizer"]
druid.auth.authorizer.MyBasicMetadataAuthorizer.type=basic
```
3. Restart the cluster.
See [Authentication and Authorization](../design/auth.md) for more information about the Authenticator, Escalator, and Authorizer concepts. See [Basic Security](../development/extensions-core/druid-basic-security.md) for more information about the extension used in the examples above, and [Kerberos](../development/extensions-core/druid-kerberos.md) for Kerberos authentication.
## Enable authorizers
After enabling the basic auth extension, you can add users, roles, and permissions via the Druid Coordinator `user` endpoint. Note that you cannot assign permissions directly to individual users. They must be assigned through roles.
The following diagram depicts the authorization model, and the relationship between users, roles, permissions, and resources.
![Druid Security model](../assets/security-model-2.png "Druid security model")
The following steps walk through a sample setup procedure:
> The default Coordinator API port is 8081 for non-TLS connections and 8281 for secured connections.
1. Create a user by issuing a POST request to `druid-ext/basic-security/authentication/db/MyBasicMetadataAuthenticator/users/<USERNAME>`, replacing USERNAME with the *new* username you are trying to create. For example:
```
curl -u admin:password1 -XPOST https://my-coordinator-ip:8281/druid-ext/basic-security/authentication/db/basic/users/myname
```
> If you have TLS enabled, be sure to adjust the curl command accordingly. For example, if your Druid servers use self-signed certificates, you may choose to include the `insecure` curl option to forgo certificate checking for the curl command.
2. Add a credential for the user by issuing a POST to `druid-ext/basic-security/authentication/db/MyBasicMetadataAuthenticator/users/<USERNAME>/credentials`. For example:
```
curl -u admin:password1 -H'Content-Type: application/json' -XPOST --data-binary @pass.json https://my-coordinator-ip:8281/druid-ext/basic-security/authentication/db/basic/users/myname/credentials
```
The password is conveyed in the `pass.json` file in the following form:
```
{
"password": "myname_password"
}
```
2. For each authenticator user you create, create a corresponding authorizer user by issuing a POST request to `druid-ext/basic-security/authorization/db/MyBasicMetadataAuthorizer/users/<USERNAME>`. For example:
```
curl -u admin:password1 -XPOST https://my-coordinator-ip:8281/druid-ext/basic-security/authorization/db/basic/users/myname
```
3. Create authorizer roles to control permissions by issuing a POST request to `druid-ext/basic-security/authorization/db/MyBasicMetadataAuthorizer/roles/<ROLENAME>`. For example:
```
curl -u admin:password1 -XPOST https://my-coordinator-ip:8281/druid-ext/basic-security/authorization/db/basic/roles/myrole
```
4. Assign roles to users by issuing a POST request to `druid-ext/basic-security/authorization/db/MyBasicMetadataAuthorizer/users/<USERNAME>/roles/<ROLENAME>`. For example:
```
curl -u admin:password1 -XPOST https://my-coordinator-ip:8281/druid-ext/basic-security/authorization/db/basic/users/myname/roles/myrole | jq
```
5. Finally, attach permissions to the roles to control how they can interact with Druid at `druid-ext/basic-security/authorization/db/MyBasicMetadataAuthorizer/roles/<ROLENAME>/permissions`.
For example:
```
curl -u admin:password1 -H'Content-Type: application/json' -XPOST --data-binary @perms.json https://my-coordinator-ip:8281/druid-ext/basic-security/authorization/db/basic/roles/myrole/permissions
```
The payload of `perms.json` should be in the form:
```
[
{
"resource": {
"name": "<PATTERN>",
"type": "DATASOURCE"
},
"action": "READ"
},
{
"resource": {
"name": "STATE",
"type": "STATE"
},
"action": "READ"
}
]
```
> Note: Druid treats the resource name as a regular expression (regex). You can use a specific datasource name or regex to grant permissions for multiple datasources at a time.
## Configuring an LDAP authenticator
As an alternative to using the basic metadata authenticator, you can use LDAP to authenticate users. The following steps provide an overview of the setup procedure. For more information on these settings, see [Properties for LDAP user authentication](../development/extensions-core/druid-basic-security.md#properties-for-ldap-user-authentication).
1. In `common.runtime.properties`, add LDAP to the authenticator chain in the order in which you want requests to be evaluated. For example:
```
# Druid basic security
druid.auth.authenticatorChain=["ldap", "MyBasicMetadataAuthenticator"]
```
2. Configure LDAP settings in `common.runtime.properties` as appropriate for your LDAP scheme and system. For example:
```
druid.auth.authenticator.ldap.type=basic
druid.auth.authenticator.ldap.enableCacheNotifications=true
druid.auth.authenticator.ldap.credentialsValidator.type=ldap
druid.auth.authenticator.ldap.credentialsValidator.url=ldap://ad_host:389
druid.auth.authenticator.ldap.credentialsValidator.bindUser=ad_admin_user
druid.auth.authenticator.ldap.credentialsValidator.bindPassword=ad_admin_password
druid.auth.authenticator.ldap.credentialsValidator.baseDn=dc=example,dc=com
druid.auth.authenticator.ldap.credentialsValidator.userSearch=(&(sAMAccountName=%s)(objectClass=user))
druid.auth.authenticator.ldap.credentialsValidator.userAttribute=sAMAccountName
druid.auth.authenticator.ldap.authorizerName=ldapauth
druid.escalator.type=basic
druid.escalator.internalClientUsername=ad_interal_user
druid.escalator.internalClientPassword=Welcome123
druid.escalator.authorizerName=ldapauth
druid.auth.authorizers=["ldapauth"]
druid.auth.authorizer.ldapauth.type=basic
druid.auth.authorizer.ldapauth.initialAdminUser=<ad_initial_admin_user>
druid.auth.authorizer.ldapauth.initialAdminRole=admin
druid.auth.authorizer.ldapauth.roleProvider.type=ldap
```
3. Use the Druid API to create the group mapping and allocate initial roles. For example, using curl and given a group named `group1` in the directory, run:
```
curl -i -v -H "Content-Type: application/json" -u internal -X POST -d @groupmap.json http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings/group1map
```
The `groupmap.json` file contents would be something like:
```
{
"name": "group1map",
"groupPattern": "CN=group1,CN=Users,DC=example,DC=com",
"roles": [
"readRole"
]
}
```
4. Check if the group mapping is created successfully by executing the following API. This lists all group mappings.
```
curl -i -v -H "Content-Type: application/json" -u internal -X GET http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings
```
Alternatively, to check the details of a specific group mapping, use the following API:
```
curl -i -v -H "Content-Type: application/json" -u internal -X GET http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings/group1map
```
5. To add additional roles to the group mapping, use the following API:
```
curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings/group1/roles/<newrole>
```
6. Add the LDAP user to Druid. To add a user, use the following authentication API:
```
curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authentication/db/ldap/users/<ad_user>
```
7. Use the following command to assign the role to a user:
```
curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/users/<ad_user>/roles/<rolename>
```
Congratulations, you have configured permissions for user-assigned roles in Druid!
## Druid security trust model
Within Druid's trust model there users can have different authorization levels:
- Users with resource write permissions are allowed to do anything that the druid process can do.
- Authenticated read only users can execute queries against resources to which they have permissions.
- An authenticated user without any permissions is allowed to execute queries that don't require access to a resource.
Additionally, Druid operates according to the following principles:
From the inner most layer:
1. Druid processes have the same access to the local files granted to the specified system user running the process.
2. The Druid ingestion system can create new processes to execute tasks. Those tasks inherit the user of their parent process. This means that any user authorized to submit an ingestion task can use the ingestion task permissions to read or write any local files or external resources that the Druid process has access to.
> Note: Only grant the `DATASOURCE WRITE` to trusted users because they can act as the Druid process.
Within the cluster:
1. Druid assumes it operates on an isolated, protected network where no reachable IP within the network is under adversary control. When you implement Druid, take care to setup firewalls and other security measures to secure both inbound and outbound connections.
Druid assumes network traffic within the cluster is encrypted, including API calls and data transfers. The default encryption implementation uses TLS.
3. Druid assumes auxiliary services such as the metadata store and ZooKeeper nodes are not under adversary control.
Cluster to deep storage:
1. Druid does not make assumptions about the security for deep storage. It follows the system's native security policies to authenticate and authorize with deep storage.
2. Druid does not encrypt files for deep storage. Instead, it relies on the storage system's native encryption capabilities to ensure compatibility with encryption schemes across all storage types.
Cluster to client:
1. Druid authenticates with the client based on the configured authenticator.
2. Druid only performs actions when an authorizer grants permission. The default configuration is `allowAll authorizer`.

View File

@ -0,0 +1,151 @@
---
id: security-user-auth
title: "User authentication and authorization"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
This document describes the Druid security model that extensions use to enable user authentication and authorization services to Druid.
## Authentication and authorization model
At the center of the Druid user authentication and authorization model are _resources_ and _actions_. A resource is something that authenticated users are trying to access or modify. An action is something that users are trying to do.
There are three resource types:
* DATASOURCE &ndash; Each Druid table (i.e., `tables` in the `druid` schema in SQL) is a resource.
* CONFIG &ndash; Configuration resources exposed by the cluster components.
* STATE &ndash; Cluster-wide state resources.
For specific resources associated with the types, see the endpoint list below and corresponding descriptions in [API Reference](./api-reference.md).
There are two actions:
* READ &ndash; Used for read-only operations.
* WRITE &ndash; Used for operations that are not read-only.
In practice, most deployments will only need to define two classes of users:
* Administrators, who have WRITE action permissions on all resource types. These users will add datasources and administer the system.
* Data users, who only need READ access to DATASOURCE. These users should access Query APIs only through an API gateway. Other APIs and permissions include functionality that should be limited to server admins.
It is important to note that WRITE access to DATASOURCE grants a user broad access. For instance, such users will have access to the Druid file system, S3 buckets, and credentials, among other things. As such, the ability to add and manage datasources should be allocated selectively to administrators.
## Default user accounts
### Authenticator
If `druid.auth.authenticator.<authenticator-name>.initialAdminPassword` is set, a default admin user named "admin" will be created, with the specified initial password. If this configuration is omitted, the "admin" user will not be created.
If `druid.auth.authenticator.<authenticator-name>.initialInternalClientPassword` is set, a default internal system user named "druid_system" will be created, with the specified initial password. If this configuration is omitted, the "druid_system" user will not be created.
### Authorizer
Each Authorizer will always have a default "admin" and "druid_system" user with full privileges.
## Defining permissions
There are two action types in Druid: READ and WRITE
There are three resource types in Druid: DATASOURCE, CONFIG, and STATE.
### DATASOURCE
Resource names for this type are datasource names. Specifying a datasource permission allows the administrator to grant users access to specific datasources.
### CONFIG
There are two possible resource names for the "CONFIG" resource type, "CONFIG" and "security". Granting a user access to CONFIG resources allows them to access the following endpoints.
"CONFIG" resource name covers the following endpoints:
|Endpoint|Process Type|
|--------|---------|
|`/druid/coordinator/v1/config`|coordinator|
|`/druid/indexer/v1/worker`|overlord|
|`/druid/indexer/v1/worker/history`|overlord|
|`/druid/worker/v1/disable`|middleManager|
|`/druid/worker/v1/enable`|middleManager|
"security" resource name covers the following endpoint:
|Endpoint|Process Type|
|--------|---------|
|`/druid-ext/basic-security/authentication`|coordinator|
|`/druid-ext/basic-security/authorization`|coordinator|
### STATE
There is only one possible resource name for the "STATE" config resource type, "STATE". Granting a user access to STATE resources allows them to access the following endpoints.
"STATE" resource name covers the following endpoints:
|Endpoint|Process Type|
|--------|---------|
|`/druid/coordinator/v1`|coordinator|
|`/druid/coordinator/v1/rules`|coordinator|
|`/druid/coordinator/v1/rules/history`|coordinator|
|`/druid/coordinator/v1/servers`|coordinator|
|`/druid/coordinator/v1/tiers`|coordinator|
|`/druid/broker/v1`|broker|
|`/druid/v2/candidates`|broker|
|`/druid/indexer/v1/leader`|overlord|
|`/druid/indexer/v1/isLeader`|overlord|
|`/druid/indexer/v1/action`|overlord|
|`/druid/indexer/v1/workers`|overlord|
|`/druid/indexer/v1/scaling`|overlord|
|`/druid/worker/v1/enabled`|middleManager|
|`/druid/worker/v1/tasks`|middleManager|
|`/druid/worker/v1/task/{taskid}/shutdown`|middleManager|
|`/druid/worker/v1/task/{taskid}/log`|middleManager|
|`/druid/historical/v1`|historical|
|`/druid-internal/v1/segments/`|historical|
|`/druid-internal/v1/segments/`|peon|
|`/druid-internal/v1/segments/`|realtime|
|`/status`|all process types|
### HTTP methods
For information on what HTTP methods are supported on a particular request endpoint, please refer to the [API documentation](./api-reference.md).
GET requires READ permission, while POST and DELETE require WRITE permission.
### SQL Permissions
Queries on Druid datasources require DATASOURCE READ permissions for the specified datasource.
Queries on the [INFORMATION_SCHEMA tables](../querying/sql.md#information-schema) will
return information about datasources that the caller has DATASOURCE READ access to. Other
datasources will be omitted.
Queries on the [system schema tables](../querying/sql.md#system-schema) require the following permissions:
- `segments`: Segments will be filtered based on DATASOURCE READ permissions.
- `servers`: The user requires STATE READ permissions.
- `server_segments`: The user requires STATE READ permissions and segments will be filtered based on DATASOURCE READ permissions.
- `tasks`: Tasks will be filtered based on DATASOURCE READ permissions.
## Configuration Propagation
To prevent excessive load on the Coordinator, the Authenticator and Authorizer user/role Druid metadata store state is cached on each Druid process.
Each process will periodically poll the Coordinator for the latest Druid metadata store state, controlled by the `druid.auth.basic.common.pollingPeriod` and `druid.auth.basic.common.maxRandomDelay` properties.
When a configuration update occurs, the Coordinator can optionally notify each process with the updated Druid metadata store state. This behavior is controlled by the `enableCacheNotifications` and `cacheNotificationTimeout` properties on Authenticators and Authorizers.
Note that because of the caching, changes made to the user/role Druid metadata store may not be immediately reflected at each Druid process.

View File

@ -0,0 +1,101 @@
---
id: segment-optimization
title: "Segment Size Optimization"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
In Apache Druid, it's important to optimize the segment size because
1. Druid stores data in segments. If you're using the [best-effort roll-up](../ingestion/index.md#rollup) mode,
increasing the segment size might introduce further aggregation which reduces the dataSource size.
2. When a query is submitted, that query is distributed to all Historicals and realtime tasks
which hold the input segments of the query. Each process and task picks a thread from its own processing thread pool
to process a single segment. If segment sizes are too large, data might not be well distributed between data
servers, decreasing the degree of parallelism possible during query processing.
At the other extreme where segment sizes are too small, the scheduling
overhead of processing a larger number of segments per query can reduce
performance, as the threads that process each segment compete for the fixed
slots of the processing pool.
It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy
especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case,
you can create segments with a sub-optimized size first and optimize them later using [compaction](../ingestion/compaction.md).
You may need to consider the followings to optimize your segments.
- Number of rows per segment: it's generally recommended for each segment to have around 5 million rows.
This setting is usually _more_ important than the below "segment byte size".
This is because Druid uses a single thread to process each segment,
and thus this setting can directly control how many rows each thread processes,
which in turn means how well the query execution is parallelized.
- Segment byte size: it's recommended to set 300 ~ 700MB. If this value
doesn't match with the "number of rows per segment", please consider optimizing
number of rows per segment rather than this value.
> The above recommendation works in general, but the optimal setting can
> vary based on your workload. For example, if most of your queries
> are heavy and take a long time to process each row, you may want to make
> segments smaller so that the query processing can be more parallelized.
> If you still see some performance issue after optimizing segment size,
> you may need to find the optimal settings for your workload.
There might be several ways to check if the compaction is necessary. One way
is using the [System Schema](../querying/sql.md#system-schema). The
system schema provides several tables about the current system status including the `segments` table.
By running the below query, you can get the average number of rows and average size for published segments.
```sql
SELECT
"start",
"end",
version,
COUNT(*) AS num_segments,
AVG("num_rows") AS avg_num_rows,
SUM("num_rows") AS total_num_rows,
AVG("size") AS avg_size,
SUM("size") AS total_size
FROM
sys.segments A
WHERE
datasource = 'your_dataSource' AND
is_published = 1
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3 DESC;
```
Please note that the query result might include overshadowed segments.
In this case, you may want to see only rows of the max version per interval (pair of `start` and `end`).
Once you find your segments need compaction, you can consider the below two options:
- Turning on the [automatic compaction of Coordinators](../design/coordinator.md#compacting-segments).
The Coordinator periodically submits [compaction tasks](../ingestion/tasks.md#compact) to re-index small segments.
To enable the automatic compaction, you need to configure it for each dataSource via Coordinator's dynamic configuration.
See [Compaction Configuration API](../operations/api-reference.md#compaction-configuration)
and [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for details.
- Running periodic Hadoop batch ingestion jobs and using a `dataSource`
inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel.
Details on how to do this can be found on the [Updating existing data](../ingestion/data-management.md#update) section
of the data management page.
## Learn more
For an overview of compaction and how to submit a manual compaction task, see [Compaction](../ingestion/compaction.md).

106
operations/tls-support.md Normal file
View File

@ -0,0 +1,106 @@
---
id: tls-support
title: "TLS support"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
## General configuration
|Property|Description|Default|
|--------|-----------|-------|
|`druid.enablePlaintextPort`|Enable/Disable HTTP connector.|`true`|
|`druid.enableTlsPort`|Enable/Disable HTTPS connector.|`false`|
Although not recommended, the HTTP and HTTPS connectors can both be enabled at a time. The respective ports are configurable using `druid.plaintextPort`
and `druid.tlsPort` properties on each process. Please see `Configuration` section of individual processes to check the valid and default values for these ports.
## Jetty server configuration
Apache Druid uses Jetty as its embedded web server.
To get familiar with TLS/SSL, along with related concepts like keys and certificates,
read [Configuring SSL/TLS](https://www.eclipse.org/jetty/documentation/current/configuring-ssl.html) in the Jetty documentation.
To get more in-depth knowledge of TLS/SSL support in Java in general, refer to the [Java Secure Socket Extension (JSSE) Reference Guide](http://docs.oracle.com/javase/8/docs/technotes/guides/security/jsse/JSSERefGuide.html).
The [Configuring the Jetty SslContextFactory](https://www.eclipse.org/jetty/documentation/current/configuring-ssl.html#configuring-sslcontextfactory)
section can help in understanding TLS/SSL configurations listed below. Finally, [Java Cryptography Architecture
Standard Algorithm Name Documentation for JDK 8](http://docs.oracle.com/javase/8/docs/technotes/guides/security/StandardNames.html) lists all possible
values for the configs belong, among others provided by Java implementation.
|Property|Description|Default|Required|
|--------|-----------|-------|--------|
|`druid.server.https.keyStorePath`|The file path or URL of the TLS/SSL Key store.|none|yes|
|`druid.server.https.keyStoreType`|The type of the key store.|none|yes|
|`druid.server.https.certAlias`|Alias of TLS/SSL certificate for the connector.|none|yes|
|`druid.server.https.keyStorePassword`|The [Password Provider](../operations/password-provider.md) or String password for the Key Store.|none|yes|
The following table contains configuration options related to client certificate authentication.
|Property|Description|Default|Required|
|--------|-----------|-------|--------|
|`druid.server.https.requireClientCertificate`|If set to true, clients must identify themselves by providing a TLS certificate, without which connections will fail.|false|no|
|`druid.server.https.requestClientCertificate`|If set to true, clients may optionally identify themselves by providing a TLS certificate. Connections will not fail if TLS certificate is not provided. This property is ignored if `requireClientCertificate` is set to true. If `requireClientCertificate` and `requestClientCertificate` are false, the rest of the options in this table are ignored.|false|no|
|`druid.server.https.trustStoreType`|The type of the trust store containing certificates used to validate client certificates. Not needed if `requireClientCertificate` and `requestClientCertificate` are false.|`java.security.KeyStore.getDefaultType()`|no|
|`druid.server.https.trustStorePath`|The file path or URL of the trust store containing certificates used to validate client certificates. Not needed if `requireClientCertificate` and `requestClientCertificate` are false.|none|yes, only if `requireClientCertificate` is true|
|`druid.server.https.trustStoreAlgorithm`|Algorithm to be used by TrustManager to validate client certificate chains. Not needed if `requireClientCertificate` and `requestClientCertificate` are false.|`javax.net.ssl.TrustManagerFactory.getDefaultAlgorithm()`|no|
|`druid.server.https.trustStorePassword`|The [password provider](../operations/password-provider.md) or String password for the Trust Store. Not needed if `requireClientCertificate` and `requestClientCertificate` are false.|none|no|
|`druid.server.https.validateHostnames`|If set to true, check that the client's hostname matches the CN/subjectAltNames in the client certificate. Not used if `requireClientCertificate` and `requestClientCertificate` are false.|true|no|
|`druid.server.https.crlPath`|Specifies a path to a file containing static [Certificate Revocation Lists](https://en.wikipedia.org/wiki/Certificate_revocation_list), used to check if a client certificate has been revoked. Not used if `requireClientCertificate` and `requestClientCertificate` are false.|null|no|
The following table contains non-mandatory advanced configuration options, use caution.
|Property|Description|Default|Required|
|--------|-----------|-------|--------|
|`druid.server.https.keyManagerFactoryAlgorithm`|Algorithm to use for creating KeyManager, more details [here](https://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/JSSERefGuide.html#KeyManager).|`javax.net.ssl.KeyManagerFactory.getDefaultAlgorithm()`|no|
|`druid.server.https.keyManagerPassword`|The [Password Provider](../operations/password-provider.md) or String password for the Key Manager.|none|no|
|`druid.server.https.includeCipherSuites`|List of cipher suite names to include. You can either use the exact cipher suite name or a regular expression.|Jetty's default include cipher list|no|
|`druid.server.https.excludeCipherSuites`|List of cipher suite names to exclude. You can either use the exact cipher suite name or a regular expression.|Jetty's default exclude cipher list|no|
|`druid.server.https.includeProtocols`|List of exact protocols names to include.|Jetty's default include protocol list|no|
|`druid.server.https.excludeProtocols`|List of exact protocols names to exclude.|Jetty's default exclude protocol list|no|
## Internal communication over TLS
Whenever possible Druid processes will use HTTPS to talk to each other. To enable this communication Druid's HttpClient needs to
be configured with a proper [SSLContext](http://docs.oracle.com/javase/8/docs/api/javax/net/ssl/SSLContext) that is able
to validate the Server Certificates, otherwise communication will fail.
Since, there are various ways to configure SSLContext, by default, Druid looks for an instance of SSLContext Guice binding
while creating the HttpClient. This binding can be achieved writing a [Druid extension](../development/extensions.md)
which can provide an instance of SSLContext. Druid comes with a simple extension present [here](../development/extensions-core/simple-client-sslcontext.md)
which should be useful enough for most simple cases, see [this](../development/extensions.md#loading-extensions) for how to include extensions.
If this extension does not satisfy the requirements then please follow the extension [implementation](https://github.com/apache/druid/tree/master/extensions-core/simple-client-sslcontext)
to create your own extension.
When Druid Coordinator/Overlord have both HTTP and HTTPS enabled and Client sends request to non-leader process, then Client is always redirected to the HTTPS endpoint on leader process.
So, Clients should be first upgraded to be able to handle redirect to HTTPS. Then Druid Overlord/Coordinator should be upgraded and configured to run both HTTP and HTTPS ports. Then Client configuration should be changed to refer to Druid Coordinator/Overlord via the HTTPS endpoint and then HTTP port on Druid Coordinator/Overlord should be disabled.
## Custom certificate checks
Druid supports custom certificate check extensions. Please refer to the `org.apache.druid.server.security.TLSCertificateChecker` interface for details on the methods to be implemented.
To use a custom TLS certificate checker, specify the following property:
|Property|Description|Default|Required|
|--------|-----------|-------|--------|
|`druid.tls.certificateChecker`|Type name of custom TLS certificate checker, provided by extensions. Please refer to extension documentation for the type name that should be specified.|"default"|no|
The default checker delegates to the standard trust manager and performs no additional actions or checks.
If using a non-default certificate checker, please refer to the extension documentation for additional configuration properties needed.

View File

@ -0,0 +1,127 @@
---
id: use_sbt_to_build_fat_jar
title: "Content for build.sbt"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
```scala
libraryDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk" % "1.9.23" exclude("common-logging", "common-logging"),
"org.joda" % "joda-convert" % "1.7",
"joda-time" % "joda-time" % "2.7",
"org.apache.druid" % "druid" % "0.8.1" excludeAll (
ExclusionRule("org.ow2.asm"),
ExclusionRule("com.fasterxml.jackson.core"),
ExclusionRule("com.fasterxml.jackson.datatype"),
ExclusionRule("com.fasterxml.jackson.dataformat"),
ExclusionRule("com.fasterxml.jackson.jaxrs"),
ExclusionRule("com.fasterxml.jackson.module")
),
"org.apache.druid" % "druid-services" % "0.8.1" excludeAll (
ExclusionRule("org.ow2.asm"),
ExclusionRule("com.fasterxml.jackson.core"),
ExclusionRule("com.fasterxml.jackson.datatype"),
ExclusionRule("com.fasterxml.jackson.dataformat"),
ExclusionRule("com.fasterxml.jackson.jaxrs"),
ExclusionRule("com.fasterxml.jackson.module")
),
"org.apache.druid" % "druid-indexing-service" % "0.8.1" excludeAll (
ExclusionRule("org.ow2.asm"),
ExclusionRule("com.fasterxml.jackson.core"),
ExclusionRule("com.fasterxml.jackson.datatype"),
ExclusionRule("com.fasterxml.jackson.dataformat"),
ExclusionRule("com.fasterxml.jackson.jaxrs"),
ExclusionRule("com.fasterxml.jackson.module")
),
"org.apache.druid" % "druid-indexing-hadoop" % "0.8.1" excludeAll (
ExclusionRule("org.ow2.asm"),
ExclusionRule("com.fasterxml.jackson.core"),
ExclusionRule("com.fasterxml.jackson.datatype"),
ExclusionRule("com.fasterxml.jackson.dataformat"),
ExclusionRule("com.fasterxml.jackson.jaxrs"),
ExclusionRule("com.fasterxml.jackson.module")
),
"org.apache.druid.extensions" % "mysql-metadata-storage" % "0.8.1" excludeAll (
ExclusionRule("org.ow2.asm"),
ExclusionRule("com.fasterxml.jackson.core"),
ExclusionRule("com.fasterxml.jackson.datatype"),
ExclusionRule("com.fasterxml.jackson.dataformat"),
ExclusionRule("com.fasterxml.jackson.jaxrs"),
ExclusionRule("com.fasterxml.jackson.module")
),
"org.apache.druid.extensions" % "druid-s3-extensions" % "0.8.1" excludeAll (
ExclusionRule("org.ow2.asm"),
ExclusionRule("com.fasterxml.jackson.core"),
ExclusionRule("com.fasterxml.jackson.datatype"),
ExclusionRule("com.fasterxml.jackson.dataformat"),
ExclusionRule("com.fasterxml.jackson.jaxrs"),
ExclusionRule("com.fasterxml.jackson.module")
),
"org.apache.druid.extensions" % "druid-histogram" % "0.8.1" excludeAll (
ExclusionRule("org.ow2.asm"),
ExclusionRule("com.fasterxml.jackson.core"),
ExclusionRule("com.fasterxml.jackson.datatype"),
ExclusionRule("com.fasterxml.jackson.dataformat"),
ExclusionRule("com.fasterxml.jackson.jaxrs"),
ExclusionRule("com.fasterxml.jackson.module")
),
"org.apache.druid.extensions" % "druid-hdfs-storage" % "0.8.1" excludeAll (
ExclusionRule("org.ow2.asm"),
ExclusionRule("com.fasterxml.jackson.core"),
ExclusionRule("com.fasterxml.jackson.datatype"),
ExclusionRule("com.fasterxml.jackson.dataformat"),
ExclusionRule("com.fasterxml.jackson.jaxrs"),
ExclusionRule("com.fasterxml.jackson.module")
),
"com.fasterxml.jackson.core" % "jackson-annotations" % "2.3.0",
"com.fasterxml.jackson.core" % "jackson-core" % "2.3.0",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.3.0",
"com.fasterxml.jackson.datatype" % "jackson-datatype-guava" % "2.3.0",
"com.fasterxml.jackson.datatype" % "jackson-datatype-joda" % "2.3.0",
"com.fasterxml.jackson.jaxrs" % "jackson-jaxrs-base" % "2.3.0",
"com.fasterxml.jackson.jaxrs" % "jackson-jaxrs-json-provider" % "2.3.0",
"com.fasterxml.jackson.jaxrs" % "jackson-jaxrs-smile-provider" % "2.3.0",
"com.fasterxml.jackson.module" % "jackson-module-jaxb-annotations" % "2.3.0",
"com.sun.jersey" % "jersey-servlet" % "1.17.1",
"mysql" % "mysql-connector-java" % "5.1.34",
"org.scalatest" %% "scalatest" % "2.2.3" % "test",
"org.mockito" % "mockito-core" % "1.10.19" % "test"
)
assemblyMergeStrategy in assembly := {
case path if path contains "pom." => MergeStrategy.first
case path if path contains "javax.inject.Named" => MergeStrategy.first
case path if path contains "mime.types" => MergeStrategy.first
case path if path contains "org/apache/commons/logging/impl/SimpleLog.class" => MergeStrategy.first
case path if path contains "org/apache/commons/logging/impl/SimpleLog$1.class" => MergeStrategy.first
case path if path contains "org/apache/commons/logging/impl/NoOpLog.class" => MergeStrategy.first
case path if path contains "org/apache/commons/logging/LogFactory.class" => MergeStrategy.first
case path if path contains "org/apache/commons/logging/LogConfigurationException.class" => MergeStrategy.first
case path if path contains "org/apache/commons/logging/Log.class" => MergeStrategy.first
case path if path contains "META-INF/jersey-module-version" => MergeStrategy.first
case path if path contains ".properties" => MergeStrategy.first
case path if path contains ".class" => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
```