Compare commits
No commits in common. "master" and "feature/getting_started" have entirely different histories.
master
...
feature/ge
16
.idea/checkstyle-idea.xml
generated
16
.idea/checkstyle-idea.xml
generated
@ -1,16 +0,0 @@
|
|||||||
<?xml version="1.0" encoding="UTF-8"?>
|
|
||||||
<project version="4">
|
|
||||||
<component name="CheckStyle-IDEA" serialisationVersion="2">
|
|
||||||
<checkstyleVersion>10.14.0</checkstyleVersion>
|
|
||||||
<scanScope>JavaOnly</scanScope>
|
|
||||||
<copyLibs>true</copyLibs>
|
|
||||||
<option name="thirdPartyClasspath" />
|
|
||||||
<option name="activeLocationIds" />
|
|
||||||
<option name="locations">
|
|
||||||
<list>
|
|
||||||
<ConfigurationLocation id="bundled-sun-checks" type="BUNDLED" scope="All" description="Sun Checks">(bundled)</ConfigurationLocation>
|
|
||||||
<ConfigurationLocation id="bundled-google-checks" type="BUNDLED" scope="All" description="Google Checks">(bundled)</ConfigurationLocation>
|
|
||||||
</list>
|
|
||||||
</option>
|
|
||||||
</component>
|
|
||||||
</project>
|
|
6
.idea/jpa-buddy.xml
generated
6
.idea/jpa-buddy.xml
generated
@ -1,6 +0,0 @@
|
|||||||
<?xml version="1.0" encoding="UTF-8"?>
|
|
||||||
<project version="4">
|
|
||||||
<component name="JpaBuddyIdeaProjectConfig">
|
|
||||||
<option name="renamerInitialized" value="true" />
|
|
||||||
</component>
|
|
||||||
</project>
|
|
10
.idea/runConfigurations.xml
generated
Normal file
10
.idea/runConfigurations.xml
generated
Normal file
@ -0,0 +1,10 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<project version="4">
|
||||||
|
<component name="RunConfigurationProducerService">
|
||||||
|
<option name="ignoredProducers">
|
||||||
|
<set>
|
||||||
|
<option value="com.android.tools.idea.compose.preview.runconfiguration.ComposePreviewRunConfigurationProducer" />
|
||||||
|
</set>
|
||||||
|
</option>
|
||||||
|
</component>
|
||||||
|
</project>
|
6
.idea/thriftCompiler.xml
generated
Normal file
6
.idea/thriftCompiler.xml
generated
Normal file
@ -0,0 +1,6 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<project version="4">
|
||||||
|
<component name="ThriftCompiler">
|
||||||
|
<compilers />
|
||||||
|
</component>
|
||||||
|
</project>
|
80
Configuration/configuration.md
Normal file
80
Configuration/configuration.md
Normal file
@ -0,0 +1,80 @@
|
|||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
## 配置文档
|
||||||
|
|
||||||
|
本部分内容列出来了每一种Druid服务的所有配置项
|
||||||
|
|
||||||
|
### 推荐的配置文件组织方式
|
||||||
|
|
||||||
|
对于Druid的配置文件,一种推荐的结构组织方式为将配置文件放置在Druid根目录的`conf`目录下,如以下所示:
|
||||||
|
|
||||||
|
```json
|
||||||
|
$ ls -R conf
|
||||||
|
druid
|
||||||
|
|
||||||
|
conf/druid:
|
||||||
|
_common broker coordinator historical middleManager overlord
|
||||||
|
|
||||||
|
conf/druid/_common:
|
||||||
|
common.runtime.properties log4j2.xml
|
||||||
|
|
||||||
|
conf/druid/broker:
|
||||||
|
jvm.config runtime.properties
|
||||||
|
|
||||||
|
conf/druid/coordinator:
|
||||||
|
jvm.config runtime.properties
|
||||||
|
|
||||||
|
conf/druid/historical:
|
||||||
|
jvm.config runtime.properties
|
||||||
|
|
||||||
|
conf/druid/middleManager:
|
||||||
|
jvm.config runtime.properties
|
||||||
|
|
||||||
|
conf/druid/overlord:
|
||||||
|
jvm.config runtime.properties
|
||||||
|
```
|
||||||
|
|
||||||
|
每一个目录下都有一个 `runtime.properties` 文件,该文件中包含了特定的Druid进程相关的配置项,例如 `historical`
|
||||||
|
|
||||||
|
`jvm.config` 文件包含了每一个服务的JVM参数,例如堆内存属性等
|
||||||
|
|
||||||
|
所有进程共享的通用属性位于 `_common/common.runtime.properties` 中。
|
||||||
|
|
||||||
|
### 通用配置
|
||||||
|
|
||||||
|
本节下的属性是应该在集群中的所有Druid服务之间共享的公共配置。
|
||||||
|
|
||||||
|
#### JVM配置最佳实践
|
||||||
|
|
||||||
|
在我们的所有进程中有四个需要配置的JVM参数
|
||||||
|
|
||||||
|
1. `-Duser.timezone=UTC` 该参数将JVM的默认时区设置为UTC。我们总是这样设置,不使用其他默认时区进行测试,因此本地时区可能会工作,但它们也可能会发现奇怪和有趣的错误。要在非UTC时区中发出查询,请参阅 [查询粒度](../querying/granularity.md)
|
||||||
|
2. `-Dfile.encoding=UTF-8` 这类似于时区,我们假设UTF-8进行测试。本地编码可能有效,但也可能导致奇怪和有趣的错误。
|
||||||
|
3. `-Djava.io.tmpdir=<a path>` 系统中与文件系统交互的各个部分都是通过临时文件完成的,这些文件可能会变得有些大。许多生产系统都被设置为具有小的(但是很快的)`/tmp`目录,这对于Druid来说可能是个问题,因此我们建议将JVM的tmp目录指向一些有更多内容的目录。此目录不应为volatile tmpfs。这个目录还应该具有良好的读写速度,因此应该强烈避免NFS挂载。
|
||||||
|
4. `-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager` 这允许log4j2处理使用标准java日志的非log4j2组件(如jetty)的日志。
|
||||||
|
|
||||||
|
#### 扩展
|
||||||
|
#### 请求日志
|
||||||
|
#### SQL兼容的空值处理
|
||||||
|
### Master
|
||||||
|
#### Coordinator
|
||||||
|
#### Overlord
|
||||||
|
### Data
|
||||||
|
#### MiddleManager and Peons
|
||||||
|
##### SegmentWriteOutMediumFactory
|
||||||
|
#### Indexer
|
||||||
|
#### Historical
|
||||||
|
### Query
|
||||||
|
#### Broker
|
||||||
|
#### Router
|
1
Configuration/core-ext/approximate-histograms.md
Normal file
1
Configuration/core-ext/approximate-histograms.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/bloom-filter.md
Normal file
1
Configuration/core-ext/bloom-filter.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/datasketches-hll.md
Normal file
1
Configuration/core-ext/datasketches-hll.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/datasketches-quantiles.md
Normal file
1
Configuration/core-ext/datasketches-quantiles.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/datasketches-theta.md
Normal file
1
Configuration/core-ext/datasketches-theta.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/druid-basic-security.md
Normal file
1
Configuration/core-ext/druid-basic-security.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/google-cloud-storage.md
Normal file
1
Configuration/core-ext/google-cloud-storage.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/hdfs.md
Normal file
1
Configuration/core-ext/hdfs.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/kafka-extraction-namespace.md
Normal file
1
Configuration/core-ext/kafka-extraction-namespace.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/lookups-cached-global.md
Normal file
1
Configuration/core-ext/lookups-cached-global.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/microsoft-azure.md
Normal file
1
Configuration/core-ext/microsoft-azure.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/mysql.md
Normal file
1
Configuration/core-ext/mysql.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/postgresql.md
Normal file
1
Configuration/core-ext/postgresql.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/s3.md
Normal file
1
Configuration/core-ext/s3.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/stats.md
Normal file
1
Configuration/core-ext/stats.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1
Configuration/core-ext/tdigestsketch-quantiles.md
Normal file
1
Configuration/core-ext/tdigestsketch-quantiles.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
2
Configuration/extensions.md
Normal file
2
Configuration/extensions.md
Normal file
@ -0,0 +1,2 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
### 社区扩展
|
1
DataIngestion/batchingestion.md
Normal file
1
DataIngestion/batchingestion.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1152
DataIngestion/dataformats.md
Normal file
1152
DataIngestion/dataformats.md
Normal file
File diff suppressed because it is too large
Load Diff
202
DataIngestion/datamanage.md
Normal file
202
DataIngestion/datamanage.md
Normal file
@ -0,0 +1,202 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
## 数据管理
|
||||||
|
|
||||||
|
### schema更新
|
||||||
|
数据源的schema可以随时更改,Apache Druid支持不同段之间的有不同的schema。
|
||||||
|
#### 替换段文件
|
||||||
|
Druid使用数据源、时间间隔、版本号和分区号唯一地标识段。只有在某个时间粒度内创建多个段时,分区号才在段id中可见。例如,如果有小时段,但一小时内的数据量超过单个段的容量,则可以在同一小时内创建多个段。这些段将共享相同的数据源、时间间隔和版本号,但具有线性增加的分区号。
|
||||||
|
```json
|
||||||
|
foo_2015-01-01/2015-01-02_v1_0
|
||||||
|
foo_2015-01-01/2015-01-02_v1_1
|
||||||
|
foo_2015-01-01/2015-01-02_v1_2
|
||||||
|
```
|
||||||
|
在上面的示例段中,`dataSource`=`foo`,`interval`=`2015-01-01/2015-01-02`,version=`v1`,partitionNum=`0`。如果在以后的某个时间点,使用新的schema重新索引数据,则新创建的段将具有更高的版本id。
|
||||||
|
```json
|
||||||
|
foo_2015-01-01/2015-01-02_v2_0
|
||||||
|
foo_2015-01-01/2015-01-02_v2_1
|
||||||
|
foo_2015-01-01/2015-01-02_v2_2
|
||||||
|
```
|
||||||
|
Druid批索引(基于Hadoop或基于IndexTask)保证了时间间隔内的原子更新。在我们的例子中,直到 `2015-01-01/2015-01-02` 的所有 `v2` 段加载到Druid集群中之前,查询只使用 `v1` 段, 当加载完所有v2段并可查询后,所有查询都将忽略 `v1` 段并切换到 `v2` 段。不久之后,`v1` 段将从集群中卸载。
|
||||||
|
|
||||||
|
请注意,跨越多个段间隔的更新在每个间隔内都是原子的。在整个更新过程中它们不是原子的。例如,您有如下段:
|
||||||
|
```json
|
||||||
|
foo_2015-01-01/2015-01-02_v1_0
|
||||||
|
foo_2015-01-02/2015-01-03_v1_1
|
||||||
|
foo_2015-01-03/2015-01-04_v1_2
|
||||||
|
```
|
||||||
|
`v2` 段将在构建后立即加载到集群中,并在段重叠的时间段内替换 `v1` 段。在完全加载 `v2` 段之前,集群可能混合了 `v1` 和 `v2` 段。
|
||||||
|
```json
|
||||||
|
foo_2015-01-01/2015-01-02_v1_0
|
||||||
|
foo_2015-01-02/2015-01-03_v2_1
|
||||||
|
foo_2015-01-03/2015-01-04_v1_2
|
||||||
|
```
|
||||||
|
在这种情况下,查询可能会命中 `v1` 和 `v2` 段的混合。
|
||||||
|
#### 在段中不同的schema
|
||||||
|
同一数据源的Druid段可能有不同的schema。如果一个字符串列(维度)存在于一个段中而不是另一个段中,则涉及这两个段的查询仍然有效。对缺少维度的段的查询将表现为该维度只有空值。类似地,如果一个段有一个数值列(metric),而另一个没有,那么查询缺少metric的段通常会"做正确的事情"。在此缺失的Metric上的使用聚合的行为类似于该Metric缺失。
|
||||||
|
### 压缩与重新索引
|
||||||
|
压缩合并是一种覆盖操作,它读取现有的一组段,将它们组合成一个具有较大但较少段的新集,并用新的压缩集覆盖原始集,而不更改它内部存储的数据。
|
||||||
|
|
||||||
|
出于性能原因,有时将一组段压缩为一组较大但较少的段是有益的,因为在接收和查询路径中都存在一些段处理和内存开销。
|
||||||
|
|
||||||
|
压缩任务合并给定时间间隔内的所有段。语法为:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "compact",
|
||||||
|
"id": <task_id>,
|
||||||
|
"dataSource": <task_datasource>,
|
||||||
|
"ioConfig": <IO config>,
|
||||||
|
"dimensionsSpec" <custom dimensionsSpec>,
|
||||||
|
"metricsSpec" <custom metricsSpec>,
|
||||||
|
"segmentGranularity": <segment granularity after compaction>,
|
||||||
|
"tuningConfig" <parallel indexing task tuningConfig>,
|
||||||
|
"context": <task context>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
| 字段 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|
|
||||||
|
| `type` | 任务类型,应该是 `compact` | 是 |
|
||||||
|
| `id` | 任务id | 否 |
|
||||||
|
| `dataSource` | 将被压缩合并的数据源名称 | 是 |
|
||||||
|
| `ioConfig` | 压缩合并任务的 `ioConfig`, 详情见 [Compaction ioConfig](#压缩合并的IOConfig) | 是 |
|
||||||
|
| `dimensionsSpec` | 自定义 `dimensionsSpec`。压缩任务将使用此dimensionsSpec(如果存在),而不是生成dimensionsSpec。更多细节见下文。| 否 |
|
||||||
|
| `metricsSpec` | 自定义 `metricsSpec`。如果指定了压缩任务,则压缩任务将使用此metricsSpec,而不是生成一个metricsSpec。| 否 |
|
||||||
|
| `segmentGranularity` | 如果设置了此值,压缩合并任务将更改给定时间间隔内的段粒度。有关详细信息,请参阅 [granularitySpec](ingestion.md#granularityspec) 的 `segmentGranularity`。行为见下表。 | 否 |
|
||||||
|
| `tuningConfig` | [并行索引任务的tuningConfig](native.md#tuningConfig) | 否 |
|
||||||
|
| `context` | [任务的上下文](taskrefer.md#上下文参数) | 否 |
|
||||||
|
|
||||||
|
一个压缩合并任务的示例如下:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type" : "compact",
|
||||||
|
"dataSource" : "wikipedia",
|
||||||
|
"ioConfig" : {
|
||||||
|
"type": "compact",
|
||||||
|
"inputSpec": {
|
||||||
|
"type": "interval",
|
||||||
|
"interval": "2017-01-01/2018-01-01"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
压缩任务读取时间间隔 `2017-01-01/2018-01-01` 的*所有分段*,并生成新分段。由于 `segmentGranularity` 为空,压缩后原始的段粒度将保持不变。要控制每个时间块的结果段数,可以设置 [`maxRowsPerSegment`](../Configuration/configuration.md#Coordinator) 或 [`numShards`](native.md#tuningconfig)。请注意,您可以同时运行多个压缩任务。例如,您可以每月运行12个compactionTasks,而不是一整年只运行一个任务。
|
||||||
|
|
||||||
|
压缩任务在内部生成 `index` 任务规范,用于使用某些固定参数执行的压缩工作。例如,它的 `inputSource` 始终是 [DruidInputSource](native.md#Druid输入源),`dimensionsSpec` 和 `metricsSpec` 默认包含输入段的所有Dimensions和Metrics。
|
||||||
|
|
||||||
|
如果指定的时间间隔中没有加载数据段(或者指定的时间间隔为空),则压缩任务将以失败状态代码退出,而不执行任何操作。
|
||||||
|
|
||||||
|
除非所有输入段具有相同的元数据,否则输出段可以具有与输入段不同的元数据。
|
||||||
|
|
||||||
|
* Dimensions: 由于Apache Druid支持schema更改,因此即使是同一个数据源的一部分,各个段之间的维度也可能不同。如果输入段具有不同的维度,则输出段基本上包括输入段的所有维度。但是,即使输入段具有相同的维度集,维度顺序或维度的数据类型也可能不同。例如,某些维度的数据类型可以从 `字符串` 类型更改为基本类型,或者可以更改维度的顺序以获得更好的局部性。在这种情况下,在数据类型和排序方面,最近段的维度先于旧段的维度。这是因为最近的段更有可能具有所需的新顺序和数据类型。如果要使用自己的顺序和类型,可以在压缩任务规范中指定自定义 `dimensionsSpec`。
|
||||||
|
* Roll-up: 仅当为所有输入段设置了 `rollup` 时,才会汇总输出段。有关详细信息,请参见 [rollup](ingestion.md#rollup)。您可以使用 [段元数据查询](../querying/segmentMetadata.md) 检查段是否已被rollup。
|
||||||
|
|
||||||
|
#### 压缩合并的IOConfig
|
||||||
|
压缩IOConfig需要指定 `inputSpec`,如下所示。
|
||||||
|
|
||||||
|
| 字段 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|
|
||||||
|
| `type` | 任务类型,固定为 `compact` | 是 |
|
||||||
|
| `inputSpec` | 输入规范 | 是 |
|
||||||
|
|
||||||
|
目前有两种支持的 `inputSpec`:
|
||||||
|
|
||||||
|
时间间隔 `inputSpec`:
|
||||||
|
|
||||||
|
| 字段 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|
|
||||||
|
| `type` | 任务类型,固定为 `interval` | 是 |
|
||||||
|
| `interval` | 需要合并压缩的时间间隔 | 是 |
|
||||||
|
|
||||||
|
段 `inputSpec`:
|
||||||
|
|
||||||
|
| 字段 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|
|
||||||
|
| `type` | 任务类型,固定为 `segments` | 是 |
|
||||||
|
| `segments` | 段ID列表 | 是 |
|
||||||
|
|
||||||
|
### 增加新的数据
|
||||||
|
|
||||||
|
Druid可以通过将新的段追加到现有的段集,来实现新数据插入到现有的数据源中。它还可以通过将现有段集与新数据合并并覆盖原始集来添加新数据。
|
||||||
|
|
||||||
|
Druid不支持按主键更新单个记录。
|
||||||
|
|
||||||
|
### 更新现有的数据
|
||||||
|
|
||||||
|
在数据源中摄取一段时间的数据并创建Apache Druid段之后,您可能需要对摄取的数据进行更改。有几种方法可以做到这一点。
|
||||||
|
|
||||||
|
#### 使用lookups
|
||||||
|
|
||||||
|
如果有需要经常更新值的维度,请首先尝试使用 [lookups](../querying/lookups.md)。lookups的一个典型用例是,在Druid段中存储一个ID维度,并希望将ID维度映射到一个人类可读的字符串值,该字符串值可能需要定期更新。
|
||||||
|
|
||||||
|
#### 重新摄取数据
|
||||||
|
|
||||||
|
如果基于lookups的技术还不够,您需要将想更新的时间块的数据重新索引到Druid中。这可以在覆盖模式(默认模式)下使用 [批处理摄取](ingestion.md#批量摄取) 方法之一来完成。它也可以使用 [流式摄取](ingestion.md#流式摄取) 来完成,前提是您先删除相关时间块的数据。
|
||||||
|
|
||||||
|
如果在批处理模式下进行重新摄取,Druid的原子更新机制意味着查询将从旧数据无缝地转换到新数据。
|
||||||
|
|
||||||
|
我们建议保留一份原始数据的副本,以防您需要重新摄取它。
|
||||||
|
|
||||||
|
#### 使用基于Hadoop的摄取
|
||||||
|
|
||||||
|
本节假设读者理解如何使用Hadoop进行批量摄取。有关详细信息,请参见 [Hadoop批处理摄取](hadoopbased.md)。Hadoop批量摄取可用于重新索引数据和增量摄取数据。
|
||||||
|
|
||||||
|
Druid使用 `ioConfig` 中的 `inputSpec` 来知道要接收的数据位于何处以及如何读取它。对于简单的Hadoop批接收,`static` 或 `granularity` 粒度规范类型允许您读取存储在深层存储中的数据。
|
||||||
|
|
||||||
|
还有其他类型的 `inputSpec` 可以启用重新索引数据和增量接收数据。
|
||||||
|
|
||||||
|
#### 使用原生批摄取重新索引
|
||||||
|
|
||||||
|
本节假设读者了解如何使用 [原生批处理索引](native.md) 而不使用Hadoop的情况下执行批处理摄取(使用 `inputSource` 知道在何处以及如何读取输入数据)。[`DruidInputSource`](native.md#Druid输入源) 可以用来从Druid内部的段读取数据。请注意,**IndexTask**只用于原型设计,因为它必须在一个进程内完成所有处理,并且无法扩展。对于处理超过1GB数据的生产方案,请使用Hadoop批量摄取。
|
||||||
|
|
||||||
|
### 删除数据
|
||||||
|
|
||||||
|
Druid支持永久的将标记为"unused"状态(详情可见架构设计中的 [段的生命周期](../design/Design.md#段生命周期))的段删除掉
|
||||||
|
|
||||||
|
杀死任务负责从元数据存储和深度存储中删除掉指定时间间隔内的不被使用的段
|
||||||
|
|
||||||
|
更多详细信息,可以看 [杀死任务](taskrefer.md#kill)
|
||||||
|
|
||||||
|
永久删除一个段需要两步:
|
||||||
|
1. 段必须首先标记为"未使用"。当用户通过Coordinator API手动禁用段时,就会发生这种情况
|
||||||
|
2. 在段被标记为"未使用"之后,一个Kill任务将从Druid的元数据存储和深层存储中删除任何“未使用”的段
|
||||||
|
|
||||||
|
对于数据保留规则的文档,可以详细看 [数据保留](../operations/retainingOrDropData.md)
|
||||||
|
|
||||||
|
对于通过Coordinator API来禁用段的文档,可以详细看 [Coordinator数据源API](../operations/api.md#coordinator)
|
||||||
|
|
||||||
|
在本文档中已经包含了一个删除删除的教程,请看 [数据删除教程](../tutorials/chapter-9.md)
|
||||||
|
|
||||||
|
### 杀死任务
|
||||||
|
|
||||||
|
**杀死任务**删除段的所有信息并将其从深层存储中删除。在Druid的段表中,要杀死的段必须是未使用的(used==0)。可用语法为:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "kill",
|
||||||
|
"id": <task_id>,
|
||||||
|
"dataSource": <task_datasource>,
|
||||||
|
"interval" : <all_segments_in_this_interval_will_die!>,
|
||||||
|
"context": <task context>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
### 数据保留
|
||||||
|
|
||||||
|
Druid支持保留规则,这些规则用于定义数据应保留的时间间隔和应丢弃数据的时间间隔。
|
||||||
|
|
||||||
|
Druid还支持将Historical进程分成不同的层,并且可以将保留规则配置为将特定时间间隔的数据分配给特定的层。
|
||||||
|
|
||||||
|
这些特性对于性能/成本管理非常有用;一个常见的场景是将Historical进程分为"热(hot)"层和"冷(cold)"层。
|
||||||
|
|
||||||
|
有关详细信息,请参阅 [加载规则](../operations/retainingOrDropData.md)。
|
87
DataIngestion/faq.md
Normal file
87
DataIngestion/faq.md
Normal file
@ -0,0 +1,87 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
## 数据摄取相关问题FAQ
|
||||||
|
### 实时摄取
|
||||||
|
|
||||||
|
最常见的原因是事件被摄取是在Druid的窗口时段 `windowPeriod` 范围之外。Druid实时摄取只接受当前时间的可配置窗口时段内的事件。您可以通过查看包含 `ingest/events/*` 日志行的实时进程日志来验证这是什么情况。这些z指标将标识接收、拒绝的事件等。
|
||||||
|
|
||||||
|
我们建议对生产中的历史数据使用批量摄取方法。
|
||||||
|
|
||||||
|
### 批量摄取
|
||||||
|
|
||||||
|
如果尝试批量加载历史数据,但没有事件被加载到,请确保摄取规范的时间间隔实际上包含了数据的间隔。此间隔之外的事件将被删除。
|
||||||
|
|
||||||
|
### Druid支持什么样的数据类型
|
||||||
|
|
||||||
|
Druid可以摄取JSON、CSV、TSV和其他分隔数据。Druid支持一维值或多维值(字符串数组)。Druid支持long、float和double数值列。
|
||||||
|
|
||||||
|
### 并非所有的事件都被摄取了
|
||||||
|
|
||||||
|
Druid会拒绝时间窗口之外的事件, 确认事件是否被拒绝了的最佳方式是查看 [Druid摄取指标](../operations/metrics.md)
|
||||||
|
|
||||||
|
如果摄取的事件数似乎正确,请确保查询的格式正确。如果在摄取规范中包含 `count` 聚合器,则需要使用 `longSum` 聚合器查询此聚合的结果。使用count聚合器发出查询将计算Druid行的数量,包括 [rollup](ingestion.md#rollup)。
|
||||||
|
|
||||||
|
### 摄取之后段存储在哪里
|
||||||
|
|
||||||
|
段的存储位置由 `druid.storage.type` 配置决定的,Druid会将段上传到 [深度存储](../design/Deepstorage.md)。 本地磁盘是默认的深度存储位置。
|
||||||
|
|
||||||
|
### 流摄取任务没有发生段切换递交
|
||||||
|
|
||||||
|
首先,确保摄取过程的日志中没有异常,如果运行的是分布式集群,还要确保 `druid.storage.type` 被设置为非本地的深度存储。
|
||||||
|
|
||||||
|
移交失败的其他常见原因如下:
|
||||||
|
|
||||||
|
1. Druid无法写入元数据存储,确保您的配置正确
|
||||||
|
2. Historical进程容量不足,无法再下载任何段。如果发生这种情况,您将在Coordinator日志中看到异常,Coordinator控制台将显示历史记录接近容量
|
||||||
|
3. 段已损坏,无法下载。如果发生这种情况,您将在Historical进程中看到异常
|
||||||
|
4. 深度存储配置不正确。确保您的段实际存在于深度存储中,并且Coordinator日志没有错误
|
||||||
|
|
||||||
|
### 如何让HDFS工作
|
||||||
|
|
||||||
|
确保在类路径中包含 `druid-hdfs-storage` 和所有的hadoop配置、依赖项(可以通过在安装了hadoop的计算机上运行 `hadoop classpath`命令获得)。并且,提供必要的HDFS设置,如 [深度存储](../design/Deepstorage.md) 中所述。
|
||||||
|
|
||||||
|
### 没有在Historical进程中看到Druid段
|
||||||
|
|
||||||
|
您可以查看位于 `<Coordinator_IP>:<PORT>` 的Coordinator控制台, 确保您的段实际上已加载到 [Historical进程](../design/Historical.md)中。如果段不存在,请检查Coordinator日志中有关复制错误容量的消息。不下载段的一个原因是,Historical进程的 `maxSize` 太小,使它们无法下载更多数据。您可以使用(例如)更改它:
|
||||||
|
|
||||||
|
```json
|
||||||
|
-Ddruid.segmentCache.locations=[{"path":"/tmp/druid/storageLocation","maxSize":"500000000000"}]
|
||||||
|
-Ddruid.server.maxSize=500000000000
|
||||||
|
```
|
||||||
|
|
||||||
|
### 查询返回来了空结果
|
||||||
|
|
||||||
|
您可以对为数据源创建的dimension和metric使用段 [元数据查询](../querying/segmentMetadata.md)。确保您在查询中使用的聚合器的名称与这些metric之一匹配,还要确保指定的查询间隔与存在数据的有效时间范围匹配。
|
||||||
|
|
||||||
|
### schema变化时如何在Druid中重新索引现有数据
|
||||||
|
|
||||||
|
您可以将 [DruidInputSource](native.md#Druid输入源) 与 [并行任务](native.md#并行任务) 一起使用,以使用新schema摄取现有的druid段,并更改该段的name、dimensions、metrics、rollup等。有关详细信息,请参阅 [DruidInputSource](native.md#Druid输入源)。或者,如果使用基于hadoop的摄取,那么可以使用"dataSource"输入规范来重新编制索引。
|
||||||
|
|
||||||
|
有关详细信息,请参阅 [数据管理](datamanage.md) 页的 [更新现有数据](datamanage.md#更新现有的数据) 部分。
|
||||||
|
|
||||||
|
### 如果更改Druid中现有数据的段粒度
|
||||||
|
|
||||||
|
在很多情况下,您可能希望降低旧数据的粒度。例如,任何超过1个月的数据都只有小时级别的粒度,而较新的数据只有分钟级别的粒度。此场景与重新索引相同。
|
||||||
|
|
||||||
|
为此,使用 [DruidInputSource](native.md#Druid输入源) 并运行一个 [并行任务](native.md#并行任务)。[DruidInputSource](native.md#Druid输入源) 将允许你从Druid中获取现有的段并将它们聚合并反馈给Druid。它还允许您在反馈数据时过滤这些段中的数据,这意味着,如果有要删除的行,可以在重新摄取期间将它们过滤掉。通常,上面的操作将作为一个批处理作业运行,即每天输入一大块数据并对其进行聚合。或者,如果使用基于hadoop的摄取,那么可以使用"dataSource"输入规范来重新编制索引。
|
||||||
|
|
||||||
|
有关详细信息,请参阅 [数据管理](datamanage.md) 页的 [更新现有数据](datamanage.md#更新现有的数据) 部分。
|
||||||
|
|
||||||
|
### 实时摄取似乎被卡住了
|
||||||
|
|
||||||
|
有几种方法可以做到这一点。如果中间持久化消耗太长时间或如果移交消耗太长事件,Druid将限制摄入,以防止内存不足的问题。如果您的流程日志表明某些列的生成时间非常长(例如,如果您的段粒度是每小时一次,但是创建一个列需要30分钟),那么您应该重新评估您的配置或扩展您的实时接收
|
||||||
|
|
||||||
|
### 更多信息
|
||||||
|
|
||||||
|
对于第一次使用Druid的用户来说,将数据输入Druid是非常困难的。请不要犹豫,在我们的IRC频道或在我们的 [google群组](https://groups.google.com/forum/#!forum/druid-user) 页面上提问。
|
489
DataIngestion/hadoopbased.md
Normal file
489
DataIngestion/hadoopbased.md
Normal file
@ -0,0 +1,489 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
## 基于Hadoop的摄入
|
||||||
|
|
||||||
|
Apache Druid当前支持通过一个Hadoop摄取任务来支持基于Apache Hadoop的批量索引任务, 这些任务被提交到 [Druid Overlord](../design/Overlord.md)的一个运行实例上。详情可以查看 [基于Hadoop的摄取vs基于本地批摄取的对比](ingestion.md#批量摄取) 来了解基于Hadoop的摄取、本地简单批摄取、本地并行摄取三者的比较。
|
||||||
|
|
||||||
|
运行一个基于Hadoop的批量摄取任务,首先需要编写一个如下的摄取规范, 然后提交到Overlord的 [`druid/indexer/v1/task`](../operations/api.md#overlord) 接口,或者使用Druid软件包中自带的 `bin/post-index-task` 脚本。
|
||||||
|
|
||||||
|
### 教程
|
||||||
|
|
||||||
|
本章包括了基于Hadoop摄取的参考文档,对于粗略的查看,可以查看 [从Hadoop加载数据](../GettingStarted/chapter-3.md) 教程。
|
||||||
|
|
||||||
|
### 任务符号
|
||||||
|
|
||||||
|
以下为一个示例任务:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type" : "index_hadoop",
|
||||||
|
"spec" : {
|
||||||
|
"dataSchema" : {
|
||||||
|
"dataSource" : "wikipedia",
|
||||||
|
"parser" : {
|
||||||
|
"type" : "hadoopyString",
|
||||||
|
"parseSpec" : {
|
||||||
|
"format" : "json",
|
||||||
|
"timestampSpec" : {
|
||||||
|
"column" : "timestamp",
|
||||||
|
"format" : "auto"
|
||||||
|
},
|
||||||
|
"dimensionsSpec" : {
|
||||||
|
"dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
|
||||||
|
"dimensionExclusions" : [],
|
||||||
|
"spatialDimensions" : []
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"metricsSpec" : [
|
||||||
|
{
|
||||||
|
"type" : "count",
|
||||||
|
"name" : "count"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type" : "doubleSum",
|
||||||
|
"name" : "added",
|
||||||
|
"fieldName" : "added"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type" : "doubleSum",
|
||||||
|
"name" : "deleted",
|
||||||
|
"fieldName" : "deleted"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type" : "doubleSum",
|
||||||
|
"name" : "delta",
|
||||||
|
"fieldName" : "delta"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"granularitySpec" : {
|
||||||
|
"type" : "uniform",
|
||||||
|
"segmentGranularity" : "DAY",
|
||||||
|
"queryGranularity" : "NONE",
|
||||||
|
"intervals" : [ "2013-08-31/2013-09-01" ]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"ioConfig" : {
|
||||||
|
"type" : "hadoop",
|
||||||
|
"inputSpec" : {
|
||||||
|
"type" : "static",
|
||||||
|
"paths" : "/MyDirectory/example/wikipedia_data.json"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"tuningConfig" : {
|
||||||
|
"type": "hadoop"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"hadoopDependencyCoordinates": <my_hadoop_version>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
| 属性 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|
|
||||||
|
| `type` | 任务类型,应该总是 `index_hadoop` | 是 |
|
||||||
|
| `spec` | Hadoop索引任务规范。 详见 [ingestion](ingestion.md) | 是 |
|
||||||
|
| `hadoopDependencyCoordinates` | Druid使用的Hadoop依赖,这些属性会覆盖默认的Hadoop依赖。 如果该值被指定,Druid将在 `druid.extensions.hadoopDependenciesDir` 目录下查找指定的Hadoop依赖 | 否 |
|
||||||
|
| `classpathPrefix` | 为Peon进程准备的类路径。| 否 |
|
||||||
|
|
||||||
|
还要注意,Druid会自动计算在Hadoop集群中运行的Hadoop作业容器的类路径。但是,如果Hadoop和Druid的依赖项之间发生冲突,可以通过设置 `druid.extensions.hadoopContainerDruidClasspath`属性。请参阅 [基本druid配置中的扩展配置](../Configuration/configuration.md#扩展) 。
|
||||||
|
#### `dataSchema`
|
||||||
|
|
||||||
|
该字段是必须的。 详情可以查看摄取页中的 [`dataSchema`](ingestion.md#dataschema) 部分来看它应该包括哪些部分。
|
||||||
|
|
||||||
|
#### `ioConfig`
|
||||||
|
|
||||||
|
该字段是必须的。
|
||||||
|
|
||||||
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|-|
|
||||||
|
| `type` | String | 应该总是 `hadoop` | 是 |
|
||||||
|
| `inputSpec` | Object | 指定从哪里拉数据。详情见以下。 | 是 |
|
||||||
|
| `segmentOutputPath` | String | 将段转储到的路径 | 仅仅在 [命令行Hadoop索引](#命令行版本) 中使用, 否则该字段必须为null |
|
||||||
|
| `metadataUpdateSpec` | Object | 关于如何更新这些段所属的druid集群的元数据的规范 | 仅仅在 [命令行Hadoop索引](#命令行版本) 中使用, 否则该字段必须为null |
|
||||||
|
|
||||||
|
##### `inputSpec`
|
||||||
|
|
||||||
|
有多种类型的inputSec:
|
||||||
|
|
||||||
|
**`static`**
|
||||||
|
|
||||||
|
一种`inputSpec`的类型,该类型提供数据文件的静态路径。
|
||||||
|
|
||||||
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|-|
|
||||||
|
| `inputFormat` | String | 指定要使用的Hadoop输入格式的类,比如 `org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat` | 否 |
|
||||||
|
| `paths` | String数组 | 标识原始数据位置的输入路径的字符串 | 是 |
|
||||||
|
|
||||||
|
例如,以下例子使用了静态输入路径:
|
||||||
|
|
||||||
|
```json
|
||||||
|
"paths" : "hdfs://path/to/data/is/here/data.gz,hdfs://path/to/data/is/here/moredata.gz,hdfs://path/to/data/is/here/evenmoredata.gz"
|
||||||
|
```
|
||||||
|
|
||||||
|
也可以从云存储直接读取数据,例如AWS S3或者谷歌云存储。 前提是需要首先的所有Druid *MiddleManager进程或者Indexer进程*的类路径下安装必要的依赖库。对于S3,需要通过以下命令来安装 [Hadoop AWS 模块](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)
|
||||||
|
|
||||||
|
```json
|
||||||
|
java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
|
||||||
|
cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/
|
||||||
|
```
|
||||||
|
|
||||||
|
一旦在所有的MiddleManager和Indexer进程中安装了Hadoop AWS模块,即可将S3路径放到 `inputSpec` 中,同时需要有任务属性。 对于更多配置,可以查看 [Hadoop AWS 模块](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)
|
||||||
|
|
||||||
|
```json
|
||||||
|
"paths" : "s3a://billy-bucket/the/data/is/here/data.gz,s3a://billy-bucket/the/data/is/here/moredata.gz,s3a://billy-bucket/the/data/is/here/evenmoredata.gz"
|
||||||
|
```
|
||||||
|
|
||||||
|
```json
|
||||||
|
"jobProperties" : {
|
||||||
|
"fs.s3a.impl" : "org.apache.hadoop.fs.s3a.S3AFileSystem",
|
||||||
|
"fs.AbstractFileSystem.s3a.impl" : "org.apache.hadoop.fs.s3a.S3A",
|
||||||
|
"fs.s3a.access.key" : "YOUR_ACCESS_KEY",
|
||||||
|
"fs.s3a.secret.key" : "YOUR_SECRET_KEY"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
对于谷歌云存储,需要将 [GCS connector jar](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md) 安装到*所有MiddleManager或者Indexer进程*的 `${DRUID_HOME}/hadoop-dependencies`。 一旦在所有的MiddleManager和Indexer进程中安装了GCS连接器jar包,即可将谷歌云存储路径放到 `inputSpec` 中,同时需要有任务属性。对于更多配置,可以查看 [instructions to configure Hadoop](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md#configure-hadoop), [GCS core default](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/conf/gcs-core-default.xml) 和 [GCS core template](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml).
|
||||||
|
|
||||||
|
```json
|
||||||
|
"paths" : "gs://billy-bucket/the/data/is/here/data.gz,gs://billy-bucket/the/data/is/here/moredata.gz,gs://billy-bucket/the/data/is/here/evenmoredata.gz"
|
||||||
|
```
|
||||||
|
```json
|
||||||
|
"jobProperties" : {
|
||||||
|
"fs.gs.impl" : "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
|
||||||
|
"fs.AbstractFileSystem.gs.impl" : "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**`granularity`**
|
||||||
|
|
||||||
|
一种`inputSpec`类型,该类型期望数据已经按照日期时间组织到对应的目录中,路径格式为: `y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX` (其中日期用小写表示,时间用大写表示)。
|
||||||
|
|
||||||
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|-|
|
||||||
|
| `dataGranularity` | String | 指定期望的数据粒度,例如,hour意味着期望的目录格式为: `y=XXXX/m=XX/d=XX/H=XX` | 是 |
|
||||||
|
| `inputFormat` | String | 指定要使用的Hadoop输入格式的类,比如 `org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat` | 否 |
|
||||||
|
| `inputPath` | String | 要将日期时间路径附加到的基路径。| 是 |
|
||||||
|
| `filePattern` | String | 要包含的文件应匹配的模式 | 是 |
|
||||||
|
| `pathFormat` | String | 每个目录的Joda datetime目录。 默认值为: `"'y'=yyyy/'m'=MM/'d'=dd/'H'=HH"` ,详情可以看 [Joda文档](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html) | 否 |
|
||||||
|
|
||||||
|
例如, 如果示例配置具有 2012-06-01/2012-06-02 时间间隔,则数据期望的路径是:
|
||||||
|
|
||||||
|
```json
|
||||||
|
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=00
|
||||||
|
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01
|
||||||
|
...
|
||||||
|
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23
|
||||||
|
```
|
||||||
|
|
||||||
|
**`dataSource`**
|
||||||
|
|
||||||
|
一种`inputSpec`的类型, 该类型读取已经存储在Druid中的数据。 该类型被用来"re-indexing"(重新索引)数据和下边描述 `multi` 类型 `inputSpec` 的 "delta-ingestion"(增量摄取)。
|
||||||
|
|
||||||
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|-|
|
||||||
|
| `type` | String | 应该总是 `dataSource` | 是 |
|
||||||
|
| `ingestionSpec` | JSON对象 | 要加载的Druid段的规范。详情见下边内容。 | 是 |
|
||||||
|
| `maxSplitSize` | Number | 允许根据段的大小将多个段合并为单个Hadoop InputSplit。使用-1,druid根据用户指定的映射任务数计算最大拆分大小(`mapred.map.tasks` 或者 `mapreduce.job.maps`). 默认情况下,对一个段进行一次拆分。`maxSplitSize` 以字节为单位指定。 | 否 |
|
||||||
|
| `useNewAggs` | Boolean | 如果"false",则hadoop索引任务的"metricsSpec"中的聚合器列表必须与接收原始数据时在原始索引任务中使用的聚合器列表相同。默认值为"false"。当"inputSpec"类型为"dataSource"而不是"multi"时,可以将此字段设置为"true",以便在重新编制索引时启用任意聚合器。请参阅下面的"multi"类型增量摄取支持。| 否 |
|
||||||
|
|
||||||
|
下表中为`ingestionSpec`中的一些选项:
|
||||||
|
|
||||||
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|-|
|
||||||
|
| `dataSource` | String | Druid数据源名称,从该数据源读取数据 | 是 |
|
||||||
|
| `intervals` | List | ISO-8601时间间隔的字符串List | 是 |
|
||||||
|
| `segments` | List | 从中读取数据的段的列表,默认情况下自动获取。您可以通过向Coordinator的接口 `/druid/Coordinator/v1/metadata/datasources/segments?full` 进行POST查询来获取要放在这里的段列表。例如["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000","2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"]. 您可能希望手动提供此列表,以确保读取的段与任务提交时的段完全相同,如果用户提供的列表与任务实际运行时的数据库状态不匹配,则任务将失败 | 否 |
|
||||||
|
| `filter` | JSON | 查看 [Filter](../querying/filters.md) | 否 |
|
||||||
|
| `dimensions` | String数组 | 要加载的维度列的名称。默认情况下,列表将根据 `parseSpec` 构造。如果 `parseSpec` 没有维度的显式列表,则将读取存储数据中的所有维度列。 | 否 |
|
||||||
|
| `metrics` | String数组 | 要加载的Metric列的名称。默认情况下,列表将根据所有已配置聚合器的"name"构造。 | 否 |
|
||||||
|
| `ignoreWhenNoSegments` | boolean | 如果找不到段,是否忽略此 `ingestionSpec`。默认行为是在找不到段时引发错误。| 否 |
|
||||||
|
|
||||||
|
示例:
|
||||||
|
|
||||||
|
```json
|
||||||
|
"ioConfig" : {
|
||||||
|
"type" : "hadoop",
|
||||||
|
"inputSpec" : {
|
||||||
|
"type" : "dataSource",
|
||||||
|
"ingestionSpec" : {
|
||||||
|
"dataSource": "wikipedia",
|
||||||
|
"intervals": ["2014-10-20T00:00:00Z/P2W"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**`multi`**
|
||||||
|
|
||||||
|
这是一个组合类型的 `inputSpec`, 来组合其他 `inputSpec`。此inputSpec用于增量接收。您还可以使用一个 `multi` 类型的inputSpec组合来自多个数据源的数据。但是,每个特定的数据源只能指定一次。注意,"useNewAggs"必须设置为默认值false以支持增量摄取。
|
||||||
|
|
||||||
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|-|
|
||||||
|
| `children` | JSON对象数组 | 一个JSON对象List,里边包含了其他类型的inputSpec | 是 |
|
||||||
|
|
||||||
|
示例:
|
||||||
|
|
||||||
|
```json
|
||||||
|
"ioConfig" : {
|
||||||
|
"type" : "hadoop",
|
||||||
|
"inputSpec" : {
|
||||||
|
"type" : "multi",
|
||||||
|
"children": [
|
||||||
|
{
|
||||||
|
"type" : "dataSource",
|
||||||
|
"ingestionSpec" : {
|
||||||
|
"dataSource": "wikipedia",
|
||||||
|
"intervals": ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"],
|
||||||
|
"segments": [
|
||||||
|
{
|
||||||
|
"dataSource": "test1",
|
||||||
|
"interval": "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000",
|
||||||
|
"version": "v2",
|
||||||
|
"loadSpec": {
|
||||||
|
"type": "local",
|
||||||
|
"path": "/tmp/index1.zip"
|
||||||
|
},
|
||||||
|
"dimensions": "host",
|
||||||
|
"metrics": "visited_sum,unique_hosts",
|
||||||
|
"shardSpec": {
|
||||||
|
"type": "none"
|
||||||
|
},
|
||||||
|
"binaryVersion": 9,
|
||||||
|
"size": 2,
|
||||||
|
"identifier": "test1_2000-01-01T00:00:00.000Z_3000-01-01T00:00:00.000Z_v2"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type" : "static",
|
||||||
|
"paths": "/path/to/more/wikipedia/data/"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**强烈建议显式**地在 `dataSource` 中的 `inputSpec` 中提供段列表,以便增量摄取任务是幂等的。您可以通过对Coordinator进行以下调用来获取该段列表,POST `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full`, 请求体:[interval1,interval2,…], 例如["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000","2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"]
|
||||||
|
|
||||||
|
#### `tuningConfig`
|
||||||
|
|
||||||
|
`tuningConfig` 是一个可选项,如果未指定的话,则使用默认的参数。
|
||||||
|
|
||||||
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|-|
|
||||||
|
| `workingPath` | String | 用于存储中间结果(Hadoop作业之间的结果)的工作路径 | 该配置仅仅使用在 [命令行Hadoop索引](#命令行版本) ,默认值为: `/tmp/druid-indexing`, 否则该值必须设置为null |
|
||||||
|
| `version` | String | 创建的段的版本。 对于Hadoop索引任务一般是忽略的,除非 `useExplicitVersion` 被设置为 `true` | 否(默认为索引任务开始的时间) |
|
||||||
|
| `partitionsSpec` | Object | 指定如何将时间块内的分区为段。缺少此属性意味着不会发生分区。 详情可见 [`partitionsSpec`](#partitionsspec) | 否(默认为 `hashed`) |
|
||||||
|
| `maxRowsInMemory` | Integer | 在持久化之前在堆内存中聚合的行数。注意:由于rollup操作,该值是聚合后的行数,可能不等于输入的行数。 该值常用来管理需要的JVM堆内存大小。通常情况下,用户并不需要设置该值,而是依赖数据自身。 如果数据是非常小的,用户希望在内存存储上百万行数据的话,则需要设置该值。 | 否(默认为:1000000)|
|
||||||
|
| `maxBytesInMemory` | Long | 在持久化之前在堆内存中聚合的字节数。通常这是在内部计算的,用户不需要设置它。此值表示在持久化之前要在堆内存中聚合的字节数。这是基于对内存使用量的粗略估计,而不是实际使用量。用于索引的最大堆内存使用量为 `maxBytesInMemory *(2 + maxPendingResistent)` | 否(默认为:最大JVM内存的1/6)|
|
||||||
|
| `leaveIntermediate` | Boolean | 作业完成时,不管通过还是失败,都在工作路径中留下中间文件(用于调试)。 | 否(默认为false)|
|
||||||
|
| `cleanupOnFailure` | Boolean | 当任务失败时清理中间文件(除非 `leaveIntermediate` 设置为true) | 否(默认为true)|
|
||||||
|
| `overwriteFiles` | Boolean | 在索引过程中覆盖找到的现存文件 | 否(默认为false)|
|
||||||
|
| `ignoreInvalidRows` | Boolean | **已废弃**。忽略发现有问题的行。如果为false,解析过程中遇到的任何异常都将引发并停止摄取;如果为true,将跳过不可解析的行和字段。如果定义了 `maxParseExceptions`,则忽略此属性。 | 否(默认为false)|
|
||||||
|
| `combineText` | Boolean | 使用CombineTextInputFormat将多个文件合并为一个文件拆分。这可以在处理大量小文件时加快Hadoop作业的速度。 | 否(默认为false)|
|
||||||
|
| `useCombiner` | Boolean | 如果可能的话,使用Hadoop Combiner在mapper阶段合并行 | 否(默认为false)|
|
||||||
|
| `jobProperties` | Object | 增加到Hadoop作业配置的属性map,详情见下边。 | 否(默认为null)|
|
||||||
|
| `indexSpec` | Object | 调整数据如何被索引。 详细信息可以见位于摄取页的 [`indexSpec`](ingestion.md#tuningConfig) | 否 |
|
||||||
|
| `indexSpecForIntermediatePersists` | Object | 定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的dimension/metric压缩,以减少最终合并所需的内存。但是,在中间段上禁用压缩可能会增加页缓存的使用,因为可能在它们被合并到发布的最终段之前使用它们,有关可能的值,请参阅 [`indexSpec`](ingestion.md#tuningConfig)。 | 否(默认与indexSpec一样)|
|
||||||
|
| `numBackgroundPersistThreads` | Integer | 用于增量持久化的新后台线程数。使用此功能会显著增加内存压力和CPU使用率,但会使任务更快完成。如果从默认值0(对持久性使用当前线程)更改,建议将其设置为1。 | 否(默认为0)|
|
||||||
|
| `forceExtendableShardSpecs` | Boolean | 强制使用可扩展的shardSpec。基于哈希的分区总是使用可扩展的shardSpec。对于单维分区,此选项应设置为true以使用可扩展shardSpec。对于分区,请检查 [分区规范](#partitionsspec) | 否(默认为false)|
|
||||||
|
| `useExplicitVersion` | Boolean | 强制HadoopIndexTask使用version | 否(默认为false)|
|
||||||
|
| `logParseExceptions` | Boolean | 如果为true,则在发生解析异常时记录错误消息,其中包含有关发生错误的行的信息。| 否(默认为false)|
|
||||||
|
| `maxParseExceptions` | Integer | 任务停止接收并失败之前可能发生的最大分析异常数。如果设置了`reportParseExceptions`,则该配置被覆盖。 | 否(默认为unlimited)|
|
||||||
|
| `useYarnRMJobStatusFallback` | Boolean | 如果索引任务创建的Hadoop作业无法从JobHistory服务器检索其完成状态,并且此参数为true,则索引任务将尝试从 `http://<yarn rm address>/ws/v1/cluster/apps/<application id>` 获取应用程序状态,其中 `<yarn rm address>` 是Hadoop配置中 `yarn.resourcemanager.webapp.address` 的地址。此标志用于索引任务的作业成功但JobHistory服务器不可用的情况下的回退,从而导致索引任务失败,因为它无法确定作业状态。 | 否(默认为true)|
|
||||||
|
|
||||||
|
##### `jobProperties`
|
||||||
|
|
||||||
|
```json
|
||||||
|
"tuningConfig" : {
|
||||||
|
"type": "hadoop",
|
||||||
|
"jobProperties": {
|
||||||
|
"<hadoop-property-a>": "<value-a>",
|
||||||
|
"<hadoop-property-b>": "<value-b>"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Hadoop的 [MapReduce文档](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) 列出来了所有可能的配置参数。
|
||||||
|
|
||||||
|
在一些Hadoop分布式环境中,可能需要设置 `mapreduce.job.classpath` 或者 `mapreduce.job.user.classpath.first` 来避免类加载相关的问题。 更多详细信息可以参见 [使用不同Hadoop版本的文档](../operations/other-hadoop.md)
|
||||||
|
|
||||||
|
#### `partitionsSpec`
|
||||||
|
|
||||||
|
段总是基于时间戳进行分区(根据 `granularitySpec`),并且可以根据分区类型以其他方式进一步分区。Druid支持两种类型的分区策略:`hashed`(基于每行中所有维度的hash)和 `single_dim`(基于单个维度的范围)。
|
||||||
|
|
||||||
|
在大多数情况下,建议使用哈希分区,因为相对于单一维度分区,哈希分区将提高索引性能并创建更统一大小的数据段。
|
||||||
|
|
||||||
|
##### 基于哈希的分区
|
||||||
|
|
||||||
|
```json
|
||||||
|
"partitionsSpec": {
|
||||||
|
"type": "hashed",
|
||||||
|
"targetRowsPerSegment": 5000000
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
哈希分区的工作原理是首先选择多个段,然后根据每一行中所有维度的哈希对这些段中的行进行分区。段的数量是根据输入集的基数和目标分区大小自动确定的。
|
||||||
|
|
||||||
|
配置项为:
|
||||||
|
|
||||||
|
| 字段 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|
|
||||||
|
| `type` | 使用的partitionsSpec的类型 | "hashed" |
|
||||||
|
| `targetRowsPerSegment` | 要包含在分区中的目标行数,应为500MB~1GB段的数。如果未设置 `numShards` ,则默认为5000000。 | 为该配置或者 `numShards` |
|
||||||
|
| `targetPartitionSize` | 已弃用。重命名为`targetRowsPerSegment`。要包含在分区中的目标行数,应为500MB~1GB段的数。 | 为该配置或者 `numShards` |
|
||||||
|
| `maxRowsPerSegment` | 已弃用。重命名为`targetRowsPerSegment`。要包含在分区中的目标行数,应为500MB~1GB段的数。 | 为该配置或者 `numShards` | 为该配置或者 `numShards` |
|
||||||
|
| `numShards` | 直接指定分区数,而不是目标分区大小。摄取将运行得更快,因为它可以跳过自动选择多个分区所需的步骤。| 为该配置或者 `maxRowsPerSegment` |
|
||||||
|
| `partitionDimensions` | 要划分的维度。留空可选择所有维度。仅与`numShard` 一起使用,在设置 `targetRowsPerSegment` 时将被忽略。| 否 |
|
||||||
|
|
||||||
|
|
||||||
|
##### 单一维度范围分区
|
||||||
|
|
||||||
|
```json
|
||||||
|
"partitionsSpec": {
|
||||||
|
"type": "single_dim",
|
||||||
|
"targetRowsPerSegment": 5000000
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
单一维度范围分区的工作原理是首先选择要分区的维度,然后将该维度分隔成连续的范围,每个段将包含该维度值在该范围内的所有行。例如,可以在维度"host"上对段进行分区,范围为"a.example.com"到"f.example.com"和"f.example.com"到"z.example.com"。 默认情况下,将自动确定要使用的维度,但可以使用特定维度替代它。
|
||||||
|
|
||||||
|
配置项为:
|
||||||
|
|
||||||
|
| 字段 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|
|
||||||
|
| `type` | 使用的partitionsSpec的类型 | "single_dim" |
|
||||||
|
| `targetRowsPerSegment` | 要包含在分区中的目标行数,应为500MB~1GB段的数。 | 是 |
|
||||||
|
| `targetPartitionSize` | 已弃用。重命名为`targetRowsPerSegment`。要包含在分区中的目标行数,应为500MB~1GB段的数。 | 否 |
|
||||||
|
| `maxRowsPerSegment` | 要包含在分区中的最大行数。默认值为比`targetRowsPerSegment` 大50%。 | 否 |
|
||||||
|
| `maxPartitionSize` | 已弃用。请改用 `maxRowsPerSegment`。要包含在分区中的最大行数, 默认为比 `targetPartitionSize` 大50%。 | 否 |
|
||||||
|
| `partitionDimension` | 要分区的维度。留空可自动选择维度。 | 否 |
|
||||||
|
| `assumeGrouped` | 假设输入数据已经按时间和维度分组。摄取将运行得更快,但如果违反此假设,则可能会选择次优分区。 | 否 |
|
||||||
|
|
||||||
|
### 远程Hadoop集群
|
||||||
|
|
||||||
|
如果已经有了一个远程的Hadoop集群,确保在Druid的 `_common` 配置目录中包含 `*.xml` 文件。
|
||||||
|
|
||||||
|
如果Hadoop与Druid的版本存在依赖等问题,请查看 [这些文档](../operations/other-hadoop.md)
|
||||||
|
|
||||||
|
### Elastic MapReduce
|
||||||
|
|
||||||
|
如果集群运行在AWS上,可以使用Elastic MapReduce(EMR)来从S3中索引数据。需要以下几步:
|
||||||
|
|
||||||
|
* 创建一个 [持续运行的集群](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-longrunning-transient.html)
|
||||||
|
* 创建集群时,请输入以下配置。如果使用向导,则应在"编辑软件设置"下处于高级模式:
|
||||||
|
|
||||||
|
```json
|
||||||
|
classification=yarn-site,properties=[mapreduce.reduce.memory.mb=6144,mapreduce.reduce.java.opts=-server -Xms2g -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.map.java.opts=758,mapreduce.map.java.opts=-server -Xms512m -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.task.timeout=1800000]
|
||||||
|
```
|
||||||
|
* 按照 [Hadoop连接配置](../tutorials/img/chapter-4.md#Hadoop连接配置) 指导,使用EMR master中 `/etc/hadoop/conf` 的XML文件。
|
||||||
|
|
||||||
|
### Kerberized Hadoop集群
|
||||||
|
|
||||||
|
默认情况下,druid可以使用本地kerberos密钥缓存中现有的TGT kerberos票证。虽然TGT票证的生命周期有限,但您需要定期调用 `kinit` 命令以确保TGT票证的有效性。为了避免这个额外的外部cron作业脚本周期性地调用 `kinit`,您可以提供主体名称和keytab位置,druid将在启动和作业启动时透明地执行身份验证。
|
||||||
|
|
||||||
|
| 属性 | 可能的值 |
|
||||||
|
|-|-|
|
||||||
|
| `druid.hadoop.security.kerberos.principal` | `druid@EXAMPLE.COM` |
|
||||||
|
| `druid.hadoop.security.kerberos.keytab` | `/etc/security/keytabs/druid.headlessUser.keytab` |
|
||||||
|
|
||||||
|
#### 从具有EMR的S3加载
|
||||||
|
|
||||||
|
* 在Hadoop索引任务中 `tuningConfig` 部分的 `jobProperties` 字段中添加一下内容:
|
||||||
|
|
||||||
|
```json
|
||||||
|
"jobProperties" : {
|
||||||
|
"fs.s3.awsAccessKeyId" : "YOUR_ACCESS_KEY",
|
||||||
|
"fs.s3.awsSecretAccessKey" : "YOUR_SECRET_KEY",
|
||||||
|
"fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
|
||||||
|
"fs.s3n.awsAccessKeyId" : "YOUR_ACCESS_KEY",
|
||||||
|
"fs.s3n.awsSecretAccessKey" : "YOUR_SECRET_KEY",
|
||||||
|
"fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
|
||||||
|
"io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
注意,此方法使用Hadoop的内置S3文件系统,而不是Amazon的EMRFS,并且与Amazon的特定功能(如S3加密和一致视图)不兼容。如果您需要使用这些特性,那么您将需要通过 [其他Hadoop发行版](#使用其他的Hadoop) 一节中描述的机制之一,使Amazon EMR Hadoop JARs对Druid可用。
|
||||||
|
|
||||||
|
### 使用其他的Hadoop
|
||||||
|
|
||||||
|
Druid在许多Hadoop发行版中都是开箱即用的。
|
||||||
|
|
||||||
|
如果Druid与您当前使用的Hadoop版本发生依赖冲突时,您可以尝试在 [Druid用户组](https://groups.google.com/forum/#!forum/druid-user) 中搜索解决方案, 或者阅读 [Druid不同版本Hadoop文档](../operations/other-hadoop.md)
|
||||||
|
|
||||||
|
### 命令行版本
|
||||||
|
|
||||||
|
运行:
|
||||||
|
|
||||||
|
```json
|
||||||
|
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_dir> org.apache.druid.cli.Main index hadoop <spec_file>
|
||||||
|
```
|
||||||
|
#### 可选项
|
||||||
|
|
||||||
|
* "--coordinate" - 提供要使用的Apache Hadoop版本。此属性将覆盖默认的Hadoop。一旦指定,Apache Druid将从 `druid.extensions.hadoopDependenciesDir` 位置寻找Hadoop依赖。
|
||||||
|
* "--no-default-hadoop" - 不要下拉默认的hadoop版本
|
||||||
|
|
||||||
|
#### 规范文件
|
||||||
|
|
||||||
|
spec文件需要包含一个JSON对象,其中的内容与Hadoop索引任务中的"spec"字段相同。有关规范格式的详细信息,请参见 [Hadoop批处理摄取](hadoopbased.md)。
|
||||||
|
|
||||||
|
另外, `metadataUpdateSpec` 和 `segmentOutputPath` 字段需要被添加到ioConfig中:
|
||||||
|
```json
|
||||||
|
"ioConfig" : {
|
||||||
|
...
|
||||||
|
"metadataUpdateSpec" : {
|
||||||
|
"type":"mysql",
|
||||||
|
"connectURI" : "jdbc:mysql://localhost:3306/druid",
|
||||||
|
"password" : "druid",
|
||||||
|
"segmentTable" : "druid_segments",
|
||||||
|
"user" : "druid"
|
||||||
|
},
|
||||||
|
"segmentOutputPath" : "/MyDirectory/data/index/output"
|
||||||
|
},
|
||||||
|
```
|
||||||
|
同时, `workingPath` 字段需要被添加到tuningConfig:
|
||||||
|
```json
|
||||||
|
"tuningConfig" : {
|
||||||
|
...
|
||||||
|
"workingPath": "/tmp",
|
||||||
|
...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
**Metadata Update Job Spec**
|
||||||
|
|
||||||
|
这是一个属性规范,告诉作业如何更新元数据,以便Druid集群能够看到输出段并加载它们。
|
||||||
|
|
||||||
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|-|
|
||||||
|
| `type` | String | "metadata"是唯一可用的值 | 是 |
|
||||||
|
| `connectURI` | String | 连接元数据存储的可用的JDBC | 是 |
|
||||||
|
| `user` | String | DB的用户名 | 是 |
|
||||||
|
| `password` | String | DB的密码 | 是 |
|
||||||
|
| `segmentTable` | String | DB中使用的表 | 是 |
|
||||||
|
|
||||||
|
这些属性应该模仿您为 [Coordinator](../design/Coordinator.md) 配置的内容。
|
||||||
|
|
||||||
|
**segmentOutputPath配置**
|
||||||
|
|
||||||
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|-|
|
||||||
|
| `segmentOutputPath` | String | 将段转储到的路径 | 是 |
|
||||||
|
|
||||||
|
**workingPath配置**
|
||||||
|
|
||||||
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|
|-|-|-|-|
|
||||||
|
| `workingPath` | String | 用于中间结果(Hadoop作业之间的结果)的工作路径。 | 否(默认为 `/tmp/druid-indexing` )|
|
||||||
|
|
||||||
|
请注意,命令行Hadoop indexer不具备索引服务的锁定功能,因此如果选择使用它,则必须注意不要覆盖由实时处理创建的段(如果设置了实时管道)。
|
541
DataIngestion/ingestion.md
Normal file
541
DataIngestion/ingestion.md
Normal file
@ -0,0 +1,541 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
## 数据摄入
|
||||||
|
### 综述
|
||||||
|
|
||||||
|
Druid中的所有数据都被组织成*段*,这些段是数据文件,通常每个段最多有几百万行。在Druid中加载数据称为*摄取或索引*,它包括从源系统读取数据并基于该数据创建段。
|
||||||
|
|
||||||
|
在大多数摄取方法中,加载数据的工作由Druid [MiddleManager](../design/MiddleManager.md) 进程(或 [Indexer](../design/Indexer.md) 进程)完成。一个例外是基于Hadoop的摄取,这项工作是使用Hadoop MapReduce作业在YARN上完成的(尽管MiddleManager或Indexer进程仍然参与启动和监视Hadoop作业)。一旦段被生成并存储在 [深层存储](../design/Deepstorage.md) 中,它们将被Historical进程加载。有关如何在引擎下工作的更多细节,请参阅Druid设计文档的[存储设计](../design/Design.md) 部分。
|
||||||
|
|
||||||
|
### 如何使用本文档
|
||||||
|
|
||||||
|
您**当前正在阅读的这个页面**提供了通用Druid摄取概念的信息,以及 [所有摄取方法](#摄入方式) **通用的配置**信息。
|
||||||
|
|
||||||
|
**每个摄取方法的单独页面**提供了有关每个摄取方法**独有的概念和配置**的附加信息。
|
||||||
|
|
||||||
|
我们建议您先阅读(或至少略读)这个通用页面,然后参考您选择的一种或多种摄取方法的页面。
|
||||||
|
|
||||||
|
### 摄入方式
|
||||||
|
|
||||||
|
下表列出了Druid最常用的数据摄取方法,帮助您根据自己的情况选择最佳方法。每个摄取方法都支持自己的一组源系统。有关每个方法如何工作的详细信息以及特定于该方法的配置属性,请查看其文档页。
|
||||||
|
|
||||||
|
#### 流式摄取
|
||||||
|
最推荐、也是最流行的流式摄取方法是直接从Kafka读取数据的 [Kafka索引服务](kafka.md) 。如果你喜欢Kinesis,[Kinesis索引服务](kinesis.md) 也能很好地工作。
|
||||||
|
|
||||||
|
下表比较了主要可用选项:
|
||||||
|
|
||||||
|
| **Method** | [**Kafka**](kafka.md) | [**Kinesis**](kinesis.md) | [**Tranquility**](tranquility.md) |
|
||||||
|
| - | - | - | - |
|
||||||
|
| **Supervisor类型** | `kafka` | `kinesis` | `N/A` |
|
||||||
|
| 如何工作 | Druid直接从 Apache Kafka读取数据 | Druid直接从Amazon Kinesis中读取数据 | Tranquility, 一个独立于Druid的库,用来将数据推送到Druid |
|
||||||
|
| 可以摄入迟到的数据 | Yes | Yes | No(迟到的数据将会被基于 `windowPeriod` 的配置丢弃掉) |
|
||||||
|
| 保证不重不丢(Exactly-once)| Yes | Yes | No
|
||||||
|
|
||||||
|
#### 批量摄取
|
||||||
|
|
||||||
|
从文件进行批加载时,应使用一次性 [任务](taskrefer.md),并且有三个选项:`index_parallel`(本地并行批任务)、`index_hadoop`(基于hadoop)或`index`(本地简单批任务)。
|
||||||
|
|
||||||
|
一般来说,如果本地批处理能满足您的需要时我们建议使用它,因为设置更简单(它不依赖于外部Hadoop集群)。但是,仍有一些情况下,基于Hadoop的批摄取可能是更好的选择,例如,当您已经有一个正在运行的Hadoop集群,并且希望使用现有集群的集群资源进行批摄取时。
|
||||||
|
|
||||||
|
此表比较了三个可用选项:
|
||||||
|
|
||||||
|
| **方式** | [**本地批任务(并行)**](native.md#并行任务) | [**基于Hadoop**](hadoopbased.md) | [**本地批任务(简单)**](native.md#简单任务) |
|
||||||
|
| - | - | - | - |
|
||||||
|
| **任务类型** | `index_parallel` | `index_hadoop` | `index` |
|
||||||
|
| **并行?** | 如果 `inputFormat` 是可分割的且 `tuningConfig` 中的 `maxNumConcurrentSubTasks` > 1, 则 **Yes** | Yes | No,每个任务都是单线程的 |
|
||||||
|
| **支持追加或者覆盖** | 都支持 | 只支持覆盖 | 都支持 |
|
||||||
|
| **外部依赖** | 无 | Hadoop集群,用来提交Map-Reduce任务 | 无 |
|
||||||
|
| **输入位置** | 任何 [输入数据源](native.md#输入数据源) | 任何Hadoop文件系统或者Druid数据源 | 任何 [输入数据源](native.md#输入数据源) |
|
||||||
|
| **文件格式** | 任何 [输入格式](dataformats.md) | 任何Hadoop输入格式 | 任何 [输入格式](dataformats.md) |
|
||||||
|
| [**Rollup modes**](#Rollup) | 如果 `tuningConfig` 中的 `forceGuaranteedRollup` = true, 则为 **Perfect(最佳rollup)** | 总是Perfect(最佳rollup) | 如果 `tuningConfig` 中的 `forceGuaranteedRollup` = true, 则为 **Perfect(最佳rollup)** |
|
||||||
|
| **分区选项** | 可选的有`Dynamic`, `hash-based` 和 `range-based` 三种分区方式,详情参见 [分区规范](native.md#partitionsSpec) | 通过 [partitionsSpec](hadoopbased.md#partitionsSpec)中指定 `hash-based` 和 `range-based`分区 | 可选的有`Dynamic`和`hash-based`二种分区方式,详情参见 [分区规范](native.md#partitionsSpec) |
|
||||||
|
|
||||||
|
### Druid数据模型
|
||||||
|
#### 数据源
|
||||||
|
Druid数据存储在数据源中,与传统RDBMS中的表类似。Druid提供了一个独特的数据建模系统,它与关系模型和时间序列模型都具有相似性。
|
||||||
|
#### 主时间戳列
|
||||||
|
Druid Schema必须始终包含一个主时间戳。主时间戳用于对 [数据进行分区和排序](#分区)。Druid查询能够快速识别和检索与主时间戳列的时间范围相对应的数据。Druid还可以将主时间戳列用于基于时间的[数据管理操作](datamanage.md),例如删除时间块、覆盖时间块和基于时间的保留规则。
|
||||||
|
|
||||||
|
主时间戳基于 [`timestampSpec`](#timestampSpec) 进行解析。此外,[`granularitySpec`](#granularitySpec) 控制基于主时间戳的其他重要操作。无论从哪个输入字段读取主时间戳,它都将作为名为 `__time` 的列存储在Druid数据源中。
|
||||||
|
|
||||||
|
如果有多个时间戳列,则可以将其他列存储为 [辅助时间戳](schemadesign.md#辅助时间戳)。
|
||||||
|
|
||||||
|
#### 维度
|
||||||
|
维度是按原样存储的列,可以用于任何目的, 可以在查询时以特殊方式对维度进行分组、筛选或应用聚合器。如果在禁用了 [rollup](#Rollup) 的情况下运行,那么该维度集将被简单地视为要摄取的一组列,并且其行为与不支持rollup功能的典型数据库的预期完全相同。
|
||||||
|
|
||||||
|
通过 [`dimensionSpec`](#dimensionSpec) 配置维度。
|
||||||
|
|
||||||
|
#### 指标
|
||||||
|
Metrics是以聚合形式存储的列。启用 [rollup](#Rollup) 时,它们最有用。指定一个Metric允许您为Druid选择一个聚合函数,以便在摄取期间应用于每一行。这有两个好处:
|
||||||
|
|
||||||
|
1. 如果启用了 [rollup](#Rollup),即使保留摘要信息,也可以将多行折叠为一行。在 [Rollup教程](../tutorials/chapter-5.md) 中,这用于将netflow数据折叠为每(`minute`,`srcIP`,`dstIP`)元组一行,同时保留有关总数据包和字节计数的聚合信息。
|
||||||
|
2. 一些聚合器,特别是近似聚合器,即使在非汇总数据上,如果在接收时部分计算,也可以在查询时更快地计算它们。
|
||||||
|
|
||||||
|
Metrics是通过 [`metricsSpec`](#metricsSpec) 配置的。
|
||||||
|
|
||||||
|
### Rollup
|
||||||
|
#### 什么是rollup
|
||||||
|
Druid可以在接收过程中将数据进行汇总,以最小化需要存储的原始数据量。Rollup是一种汇总或预聚合的形式。实际上,Rollup可以极大地减少需要存储的数据的大小,从而潜在地减少行数的数量级。这种存储量的减少是有代价的:当我们汇总数据时,我们就失去了查询单个事件的能力。
|
||||||
|
|
||||||
|
禁用rollup时,Druid将按原样加载每一行,而不进行任何形式的预聚合。此模式类似于您对不支持汇总功能的典型数据库的期望。
|
||||||
|
|
||||||
|
如果启用了rollup,那么任何具有相同[维度](#维度)和[时间戳](#主时间戳列)的行(在基于 `queryGranularity` 的截断之后)都可以在Druid中折叠或汇总为一行。
|
||||||
|
|
||||||
|
rollup默认是启用状态。
|
||||||
|
|
||||||
|
#### 启用或者禁用rollup
|
||||||
|
|
||||||
|
Rollup由 `granularitySpec` 中的 `rollup` 配置项控制。 默认情况下,值为 `true`(启用状态)。如果你想让Druid按原样存储每条记录,而不需要任何汇总,将该值设置为 `false`。
|
||||||
|
|
||||||
|
#### rollup示例
|
||||||
|
有关如何配置Rollup以及该特性将如何修改数据的示例,请参阅[Rollup教程](../tutorials/chapter-5.md)。
|
||||||
|
|
||||||
|
#### 最大化rollup比率
|
||||||
|
通过比较Druid中的行数和接收的事件数,可以测量数据源的汇总率。这个数字越高,从汇总中获得的好处就越多。一种方法是使用[Druid SQL](../querying/druidsql.md)查询,比如:
|
||||||
|
```json
|
||||||
|
SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
|
||||||
|
```
|
||||||
|
|
||||||
|
在这个查询中,`cnt` 应该引用在摄取时指定的"count"类型Metrics。有关启用汇总时计数工作方式的详细信息,请参阅"架构设计"页上的 [计数接收事件数](../DataIngestion/schemadesign.md#计数接收事件数)。
|
||||||
|
|
||||||
|
最大化Rollup的提示:
|
||||||
|
* 一般来说,拥有的维度越少,维度的基数越低,您将获得更好的汇总比率
|
||||||
|
* 使用 [Sketches](schemadesign.md#Sketches高基维处理) 避免存储高基数维度,因为会损害汇总比率
|
||||||
|
* 在摄入时调整 `queryGranularity`(例如,使用 `PT5M` 而不是 `PT1M` )会增加Druid中两行具有匹配时间戳的可能性,并可以提高汇总率
|
||||||
|
* 将相同的数据加载到多个Druid数据源中是有益的。有些用户选择创建禁用汇总(或启用汇总,但汇总比率最小)的"完整"数据源和具有较少维度和较高汇总比率的"缩写"数据源。当查询只涉及"缩写"集里边的维度时,使用该数据源将导致更快的查询时间,这种方案只需稍微增加存储空间即可完成,因为简化的数据源往往要小得多。
|
||||||
|
* 如果您使用的 [尽力而为的汇总(best-effort rollup)](#) 摄取配置不能保证[完全汇总(perfect rollup)](#),则可以通过切换到保证的完全汇总选项,或在初始摄取后在[后台重新编制(reindex)](./datamanage.md#压缩与重新索引)数据索引,潜在地提高汇总比率。
|
||||||
|
|
||||||
|
#### 最佳rollup VS 尽可能rollup
|
||||||
|
一些Druid摄取方法保证了*完美的汇总(perfect rollup)*,这意味着输入数据在摄取时被完美地聚合。另一些则提供了*尽力而为的汇总(best-effort rollup)*,这意味着输入数据可能无法完全聚合,因此可能有多个段保存具有相同时间戳和维度值的行。
|
||||||
|
|
||||||
|
一般来说,提供*尽力而为的汇总(best-effort rollup)*的摄取方法之所以这样做,是因为它们要么是在没有清洗步骤(这是*完美的汇总(perfect rollup)*所必需的)的情况下并行摄取,要么是因为它们在接收到某个时间段的所有数据(我们称之为*增量发布(incremental publishing)*)之前完成并发布段。在这两种情况下,理论上可以汇总的记录可能会以不同的段结束。所有类型的流接收都在此模式下运行。
|
||||||
|
|
||||||
|
保证*完美的汇总(perfect rollup)*的摄取方法通过额外的预处理步骤来确定实际数据摄取阶段之前的间隔和分区。此预处理步骤扫描整个输入数据集,这通常会增加摄取所需的时间,但提供完美汇总所需的信息。
|
||||||
|
|
||||||
|
下表显示了每个方法如何处理汇总:
|
||||||
|
| **方法** | **如何工作** |
|
||||||
|
| - | - |
|
||||||
|
| [本地批](native.md) | 基于配置,`index_parallel` 和 `index` 可以是完美的,也可以是最佳的。 |
|
||||||
|
| [Hadoop批](hadoopbased.md) | 总是 perfect |
|
||||||
|
| [Kafka索引服务](kafka.md) | 总是 best-effort |
|
||||||
|
| [Kinesis索引服务](kinesis.md) | 总是 best-effort |
|
||||||
|
|
||||||
|
### 分区
|
||||||
|
#### 为什么分区
|
||||||
|
|
||||||
|
数据源中段的最佳分区和排序会对占用空间和性能产生重大影响
|
||||||
|
。
|
||||||
|
Druid数据源总是按时间划分为*时间块*,每个时间块包含一个或多个段。此分区适用于所有摄取方法,并基于摄取规范的 `dataSchema` 中的 `segmentGranularity`参数。
|
||||||
|
|
||||||
|
特定时间块内的段也可以进一步分区,使用的选项根据您选择的摄取类型而不同。一般来说,使用特定维度执行此辅助分区将改善局部性,这意味着具有该维度相同值的行存储在一起,并且可以快速访问。
|
||||||
|
|
||||||
|
通常,通过将数据分区到一些常用来做过滤操作的维度(如果存在的话)上,可以获得最佳性能和最小的总体占用空间。而且,这种分区通常会改善压缩性能而且还往往会提高查询性能(用户报告存储容量减少了三倍)。
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> 分区和排序是最好的朋友!如果您确实有一个天然的分区维度,那么您还应该考虑将它放在 `dimensionsSpec` 的 `dimension` 列表中的第一个维度,它告诉Druid按照该列对每个段中的行进行排序。除了单独分区所获得的改进之外,这通常还会进一步改进压缩。
|
||||||
|
> 但是,请注意,目前,Druid总是首先按时间戳对一个段内的行进行排序,甚至在 `dimensionsSpec` 中列出的第一个维度之前,这将使得维度排序达不到最大效率。如果需要,可以通过在 `granularitySpec` 中将 `queryGranularity` 设置为等于 `segmentGranularity` 的值来解决此限制,这将把段内的所有时间戳设置为相同的值,并将"真实"时间戳保存为[辅助时间戳](./schemadesign.md#辅助时间戳)。这个限制可能在Druid的未来版本中被移除。
|
||||||
|
|
||||||
|
#### 如何设置分区
|
||||||
|
并不是所有的摄入方式都支持显式的分区配置,也不是所有的方法都具有同样的灵活性。在当前的Druid版本中,如果您是通过一个不太灵活的方法(如Kafka)进行初始摄取,那么您可以使用 [重新索引的技术(reindex)](./datamanage.md#压缩与重新索引),在最初摄取数据后对其重新分区。这是一种强大的技术:即使您不断地从流中添加新数据, 也可以使用它来确保任何早于某个阈值的数据都得到最佳分区。
|
||||||
|
|
||||||
|
下表显示了每个摄取方法如何处理分区:
|
||||||
|
|
||||||
|
| **方法** | **如何工作** |
|
||||||
|
| - | - |
|
||||||
|
| [本地批](native.md) | 通过 `tuningConfig` 中的 [`partitionsSpec`](./native.md#partitionsSpec) |
|
||||||
|
| [Hadoop批](hadoopbased.md) | 通过 `tuningConfig` 中的 [`partitionsSpec`](./native.md#partitionsSpec) |
|
||||||
|
| [Kafka索引服务](kafka.md) | Druid中的分区是由Kafka主题的分区方式决定的。您可以在初次摄入后 [重新索引的技术(reindex)](./datamanage.md#压缩与重新索引)以重新分区 |
|
||||||
|
| [Kinesis索引服务](kinesis.md) | Druid中的分区是由Kinesis流的分区方式决定的。您可以在初次摄入后 [重新索引的技术(reindex)](./datamanage.md#压缩与重新索引)以重新分区 |
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
>
|
||||||
|
> 注意,当然,划分数据的一种方法是将其加载到分开的数据源中。这是一种完全可行的方法,当数据源的数量不会导致每个数据源的开销过大时,它可以很好地工作。如果使用这种方法,那么可以忽略这一部分,因为这部分描述了如何在单个数据源中设置分区。
|
||||||
|
>
|
||||||
|
> 有关将数据拆分为单独数据源的详细信息以及潜在的操作注意事项,请参阅 [多租户注意事项](../querying/multitenancy.md)。
|
||||||
|
|
||||||
|
### 摄入规范
|
||||||
|
|
||||||
|
无论使用哪一种摄入方式,数据要么是通过一次性[tasks](taskrefer.md)或者通过持续性的"supervisor"(运行并监控一段时间内的一系列任务)来被加载到Druid中。 在任一种情况下,task或者supervisor的定义都在*摄入规范*中定义。
|
||||||
|
|
||||||
|
摄入规范包括以下三个主要的部分:
|
||||||
|
* [`dataSchema`](#dataschema), 包含了 [`数据源名称`](#datasource), [`主时间戳列`](#timestampspec), [`维度`](#dimensionspec), [`指标`](#metricsspec) 和 [`转换与过滤`](#transformspec)
|
||||||
|
* [`ioConfig`](#ioconfig), 该部分告诉Druid如何去连接数据源系统以及如何去解析数据。 更多详细信息,可以看[摄入方法](#摄入方式)的文档。
|
||||||
|
* [`tuningConfig`](#tuningconfig), 该部分控制着每一种[摄入方法](#摄入方式)的不同的特定调整参数
|
||||||
|
|
||||||
|
一个 `index_parallel` 类型任务的示例摄入规范如下:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "index_parallel",
|
||||||
|
"spec": {
|
||||||
|
"dataSchema": {
|
||||||
|
"dataSource": "wikipedia",
|
||||||
|
"timestampSpec": {
|
||||||
|
"column": "timestamp",
|
||||||
|
"format": "auto"
|
||||||
|
},
|
||||||
|
"dimensionsSpec": {
|
||||||
|
"dimensions": [
|
||||||
|
{ "type": "string", "page" },
|
||||||
|
{ "type": "string", "language" },
|
||||||
|
{ "type": "long", "name": "userId" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metricsSpec": [
|
||||||
|
{ "type": "count", "name": "count" },
|
||||||
|
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
|
||||||
|
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
|
||||||
|
],
|
||||||
|
"granularitySpec": {
|
||||||
|
"segmentGranularity": "day",
|
||||||
|
"queryGranularity": "none",
|
||||||
|
"intervals": [
|
||||||
|
"2013-08-31/2013-09-01"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"ioConfig": {
|
||||||
|
"type": "index_parallel",
|
||||||
|
"inputSource": {
|
||||||
|
"type": "local",
|
||||||
|
"baseDir": "examples/indexing/",
|
||||||
|
"filter": "wikipedia_data.json"
|
||||||
|
},
|
||||||
|
"inputFormat": {
|
||||||
|
"type": "json",
|
||||||
|
"flattenSpec": {
|
||||||
|
"useFieldDiscovery": true,
|
||||||
|
"fields": [
|
||||||
|
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"tuningConfig": {
|
||||||
|
"type": "index_parallel"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
该部分中支持的特定选项依赖于选择的[摄入方法](#摄入方式)。 更多的示例,可以参考每一种[摄入方法](#摄入方式)的文档。
|
||||||
|
|
||||||
|
您还可以不用编写一个摄入规范,可视化的加载数据,该功能位于 [Druid控制台](../operations/manageui.md) 的 "Load Data" 视图中。 Druid可视化数据加载器目前支持 [Kafka](kafka.md), [Kinesis](kinesis.md) 和 [本地批](native.md) 模式。
|
||||||
|
|
||||||
|
#### `dataSchema`
|
||||||
|
> [!WARNING]
|
||||||
|
>
|
||||||
|
> `dataSchema` 规范在0.17.0版本中做了更改,新的规范支持除*Hadoop摄取方式*外的所有方式。 可以在 [过时的 `dataSchema` 规范]()查看老的规范
|
||||||
|
|
||||||
|
`dataSchema` 包含了以下部分:
|
||||||
|
* [`数据源名称`](#datasource), [`主时间戳列`](#timestampspec), [`维度`](#dimensionspec), [`指标`](#metricsspec) 和 [`转换与过滤`](#transformspec)
|
||||||
|
|
||||||
|
一个 `dataSchema` 如下:
|
||||||
|
```json
|
||||||
|
"dataSchema": {
|
||||||
|
"dataSource": "wikipedia",
|
||||||
|
"timestampSpec": {
|
||||||
|
"column": "timestamp",
|
||||||
|
"format": "auto"
|
||||||
|
},
|
||||||
|
"dimensionsSpec": {
|
||||||
|
"dimensions": [
|
||||||
|
{ "type": "string", "page" },
|
||||||
|
{ "type": "string", "language" },
|
||||||
|
{ "type": "long", "name": "userId" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metricsSpec": [
|
||||||
|
{ "type": "count", "name": "count" },
|
||||||
|
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
|
||||||
|
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
|
||||||
|
],
|
||||||
|
"granularitySpec": {
|
||||||
|
"segmentGranularity": "day",
|
||||||
|
"queryGranularity": "none",
|
||||||
|
"intervals": [
|
||||||
|
"2013-08-31/2013-09-01"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
##### `dataSource`
|
||||||
|
`dataSource` 位于 `dataSchema` -> `dataSource` 中,简单的标识了数据将被写入的数据源的名称,示例如下:
|
||||||
|
```json
|
||||||
|
"dataSource": "my-first-datasource"
|
||||||
|
```
|
||||||
|
##### `timestampSpec`
|
||||||
|
`timestampSpec` 位于 `dataSchema` -> `timestampSpec` 中,用来配置 [主时间戳](#timestampspec), 示例如下:
|
||||||
|
```json
|
||||||
|
"timestampSpec": {
|
||||||
|
"column": "timestamp",
|
||||||
|
"format": "auto"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
> [!WARNING]
|
||||||
|
> 概念上,输入数据被读取后,Druid会以一个特定的顺序来对数据应用摄入规范: 首先 `flattenSpec`(如果有),然后 `timestampSpec`, 然后 `transformSpec` ,最后是 `dimensionsSpec` 和 `metricsSpec`。在编写摄入规范时需要牢记这一点
|
||||||
|
|
||||||
|
`timestampSpec` 可以包含以下的部分:
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<th>字段</th>
|
||||||
|
<th>描述</th>
|
||||||
|
<th>默认值</th>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>column</td>
|
||||||
|
<td>要从中读取主时间戳的输入行字段。<br><br>不管这个输入字段的名称是什么,主时间戳总是作为一个名为"__time"的列存储在您的Druid数据源中</td>
|
||||||
|
<td>timestamp</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>format</td>
|
||||||
|
<td>
|
||||||
|
时间戳格式,可选项有:
|
||||||
|
<ul>
|
||||||
|
<li><code>iso</code>: 使用"T"分割的ISO8601,像"2000-01-01T01:02:03.456"</li>
|
||||||
|
<li><code>posix</code>: 自纪元以来的秒数</li>
|
||||||
|
<li><code>millis</code>: 自纪元以来的毫秒数</li>
|
||||||
|
<li><code>micro</code>: 自纪元以来的微秒数</li>
|
||||||
|
<li><code>nano</code>: 自纪元以来的纳秒数</li>
|
||||||
|
<li><code>auto</code>: 自动检测ISO或者毫秒格式</li>
|
||||||
|
<li>任何 <a href="http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html">Joda DateTimeFormat字符串</a></li>
|
||||||
|
</ul>
|
||||||
|
</td>
|
||||||
|
<td>auto</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>missingValue</td>
|
||||||
|
<td>用于具有空或缺少时间戳列的输入记录的时间戳。应该是ISO8601格式,如<code>"2000-01-01T01:02:03.456"</code>。由于Druid需要一个主时间戳,因此此设置对于接收根本没有任何时间戳的数据集非常有用。</td>
|
||||||
|
<td>none</td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
##### `dimensionSpec`
|
||||||
|
`dimensionsSpec` 位于 `dataSchema` -> `dimensionsSpec`, 用来配置维度。示例如下:
|
||||||
|
```json
|
||||||
|
"dimensionsSpec" : {
|
||||||
|
"dimensions": [
|
||||||
|
"page",
|
||||||
|
"language",
|
||||||
|
{ "type": "long", "name": "userId" }
|
||||||
|
],
|
||||||
|
"dimensionExclusions" : [],
|
||||||
|
"spatialDimensions" : []
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> 概念上,输入数据被读取后,Druid会以一个特定的顺序来对数据应用摄入规范: 首先 `flattenSpec`(如果有),然后 `timestampSpec`, 然后 `transformSpec` ,最后是 `dimensionsSpec` 和 `metricsSpec`。在编写摄入规范时需要牢记这一点
|
||||||
|
|
||||||
|
`dimensionsSpec` 可以包括以下部分:
|
||||||
|
|
||||||
|
| 字段 | 描述 | 默认值 |
|
||||||
|
|-|-|-|
|
||||||
|
| dimensions | 维度名称或者对象的列表,在 `dimensions` 和 `dimensionExclusions` 中不能包含相同的列。 <br><br> 如果该配置为一个空数组,Druid将会把所有未出现在 `dimensionExclusions` 中的非时间、非指标列当做字符串类型的维度列,参见[Inclusions and exclusions](#Inclusions-and-exclusions)。 | `[]` |
|
||||||
|
| dimensionExclusions | 在摄取中需要排除的列名称,在该配置中只支持名称,不支持对象。在 `dimensions` 和 `dimensionExclusions` 中不能包含相同的列。 | `[]` |
|
||||||
|
| spatialDimensions | 一个[空间维度](../querying/spatialfilter.md)的数组 | `[]` |
|
||||||
|
|
||||||
|
###### `Dimension objects`
|
||||||
|
在 `dimensions` 列的每一个维度可以是一个名称,也可以是一个对象。 提供一个名称等价于提供了一个给定名称的 `string` 类型的维度对象。例如: `page` 等价于 `{"name": "page", "type": "string"}`。
|
||||||
|
|
||||||
|
维度对象可以有以下的部分:
|
||||||
|
|
||||||
|
| 字段 | 描述 | 默认值 |
|
||||||
|
|-|-|-|
|
||||||
|
| type | `string`, `long`, `float` 或者 `double` | `string` |
|
||||||
|
| name | 维度名称,将用作从输入记录中读取的字段名,以及存储在生成的段中的列名。<br><br> 注意: 如果想在摄取的时候重新命名列,可以使用 [`transformSpec`](#transformspec) | none(必填)|
|
||||||
|
| createBitmapIndex | 对于字符串类型的维度,是否应为生成的段中的列创建位图索引。创建位图索引需要更多存储空间,但会加快某些类型的筛选(特别是相等和前缀筛选)。仅支持字符串类型的维度。| `true` |
|
||||||
|
|
||||||
|
###### `Inclusions and exclusions`
|
||||||
|
Druid以两种可能的方式来解释 `dimensionsSpec` : *normal* 和 *schemaless*
|
||||||
|
|
||||||
|
当 `dimensions` 或者 `spatialDimensions` 为非空时, 将会采用正常的解释方式。 在该情况下, 前边说的两个列表结合起来的集合当做摄入的维度集合。
|
||||||
|
|
||||||
|
当 `dimensions` 和 `spatialDimensions` 同时为空或者null时候,将会采用无模式的解释方式。 在该情况下,维度集合由以下方式决定:
|
||||||
|
1. 首先,从 [`inputFormat`](./dataformats.md) (或者 [`flattenSpec`](./dataformats.md#FlattenSpec), 如果正在使用 )中所有输入字段集合开始
|
||||||
|
2. 排除掉任何在 `dimensionExclusions` 中的列
|
||||||
|
3. 排除掉在 [`timestampSpec`](#timestampspec) 中的时间列
|
||||||
|
4. 排除掉 [`metricsSpec`](#metricsspec) 中用于聚合器输入的列
|
||||||
|
5. 排除掉 [`metricsSpec`](#metricsspec) 中任何与聚合器同名的列
|
||||||
|
6. 所有的其他字段都被按照[默认配置](#dimensionspec)摄入为 `string` 类型的维度
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> 注意:在无模式的维度解释方式中,由 [`transformSpec`](#transformspec) 生成的列当前并未考虑。
|
||||||
|
|
||||||
|
##### `metricsSpec`
|
||||||
|
|
||||||
|
`metricsSpec` 位于 `dataSchema` -> `metricsSpec` 中,是一个在摄入阶段要应用的 [聚合器](../querying/Aggregations.md) 列表。 在启用了 [rollup](#rollup) 时是很有用的,因为它将配置如何在摄入阶段进行聚合。
|
||||||
|
|
||||||
|
一个 `metricsSpec` 实例如下:
|
||||||
|
```json
|
||||||
|
"metricsSpec": [
|
||||||
|
{ "type": "count", "name": "count" },
|
||||||
|
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
|
||||||
|
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
|
||||||
|
]
|
||||||
|
```
|
||||||
|
> [!WARNING]
|
||||||
|
> 通常,当 [rollup](#rollup) 被禁用时,应该有一个空的 `metricsSpec`(因为没有rollup,Druid不会在摄取时进行任何的聚合,所以没有理由包含摄取时聚合器)。但是,在某些情况下,定义Metrics仍然是有意义的:例如,如果要创建一个复杂的列作为 [近似聚合](../querying/Aggregations.md#近似聚合) 的预计算部分,则只能通过在 `metricsSpec` 中定义度量来实现
|
||||||
|
|
||||||
|
##### `granularitySpec`
|
||||||
|
|
||||||
|
`granularitySpec` 位于 `dataSchema` -> `granularitySpec`, 用来配置以下操作:
|
||||||
|
1. 通过 `segmentGranularity` 来将数据源分区到 [时间块](../design/Design.md#数据源和段)
|
||||||
|
2. 如果需要的话,通过 `queryGranularity` 来截断时间戳
|
||||||
|
3. 通过 `interval` 来指定批摄取中应创建段的时间块
|
||||||
|
4. 通过 `rollup` 来指定是否在摄取时进行汇总
|
||||||
|
|
||||||
|
除了 `rollup`, 这些操作都是基于 [主时间戳列](#主时间戳列)
|
||||||
|
|
||||||
|
一个 `granularitySpec` 实例如下:
|
||||||
|
```json
|
||||||
|
"granularitySpec": {
|
||||||
|
"segmentGranularity": "day",
|
||||||
|
"queryGranularity": "none",
|
||||||
|
"intervals": [
|
||||||
|
"2013-08-31/2013-09-01"
|
||||||
|
],
|
||||||
|
"rollup": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`granularitySpec` 可以有以下的部分:
|
||||||
|
|
||||||
|
| 字段 | 描述 | 默认值 |
|
||||||
|
|-|-|-|
|
||||||
|
| type | `uniform` 或者 `arbitrary` ,大多数时候使用 `uniform` | `uniform` |
|
||||||
|
| segmentGranularity | 数据源的 [时间分块](../design/Design.md#数据源和段) 粒度。每个时间块可以创建多个段, 例如,当设置为 `day` 时,同一天的事件属于同一时间块,该时间块可以根据其他配置和输入大小进一步划分为多个段。这里可以提供任何粒度。请注意,同一时间块中的所有段应具有相同的段粒度。 <br><br> 如果 `type` 字段设置为 `arbitrary` 则忽略 | `day` |
|
||||||
|
| queryGranularity | 每个段内时间戳存储的分辨率, 必须等于或比 `segmentGranularity` 更细。这将是您可以查询的最细粒度,并且仍然可以查询到合理的结果。但是请注意,您仍然可以在比此粒度更粗的场景进行查询,例如 "`minute`"的值意味着记录将以分钟的粒度存储,并且可以在分钟的任意倍数(包括分钟、5分钟、小时等)进行查询。<br><br> 这里可以提供任何 [粒度](../querying/AggregationGranularity.md) 。使用 `none` 按原样存储时间戳,而不进行任何截断。请注意,即使将 `queryGranularity` 设置为 `none`,也将应用 `rollup`。 | `none` |
|
||||||
|
| rollup | 是否在摄取时使用 [rollup](#rollup)。 注意:即使 `queryGranularity` 设置为 `none`,rollup也仍然是有效的,当数据具有相同的时间戳时数据将被汇总 | `true` |
|
||||||
|
| interval | 描述应该创建段的时间块的间隔列表。如果 `type` 设置为`uniform`,则此列表将根据 `segmentGranularity` 进行拆分和舍入。如果 `type` 设置为 `arbitrary` ,则将按原样使用此列表。<br><br> 如果该值不提供或者为空值,则批处理摄取任务通常会根据在输入数据中找到的时间戳来确定要输出的时间块。<br><br> 如果指定,批处理摄取任务可以跳过确定分区阶段,这可能会导致更快的摄取。批量摄取任务也可以预先请求它们的所有锁,而不是逐个请求。批处理摄取任务将丢弃任何时间戳超出指定间隔的记录。<br><br> 在任何形式的流摄取中忽略该配置。 | `null` |
|
||||||
|
|
||||||
|
##### `transformSpec`
|
||||||
|
`transformSpec` 位于 `dataSchema` -> `transformSpec`,用来摄取时转换和过滤输入数据。 一个 `transformSpec` 实例如下:
|
||||||
|
```json
|
||||||
|
"transformSpec": {
|
||||||
|
"transforms": [
|
||||||
|
{ "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
|
||||||
|
],
|
||||||
|
"filter": {
|
||||||
|
"type": "selector",
|
||||||
|
"dimension": "country",
|
||||||
|
"value": "San Serriffe"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> 概念上,输入数据被读取后,Druid会以一个特定的顺序来对数据应用摄入规范: 首先 `flattenSpec`(如果有),然后 `timestampSpec`, 然后 `transformSpec` ,最后是 `dimensionsSpec` 和 `metricsSpec`。在编写摄入规范时需要牢记这一点
|
||||||
|
|
||||||
|
##### 过时的 `dataSchema` 规范
|
||||||
|
> [!WARNING]
|
||||||
|
>
|
||||||
|
> `dataSchema` 规范在0.17.0版本中做了更改,新的规范支持除*Hadoop摄取方式*外的所有方式。 可以在 [`dataSchema`](#dataschema)查看老的规范
|
||||||
|
|
||||||
|
除了上面 `dataSchema` 一节中列出的组件之外,过时的 `dataSchema` 规范还有以下两个组件。
|
||||||
|
* [input row parser](), [flatten of nested data]()
|
||||||
|
|
||||||
|
**parser**(已废弃)
|
||||||
|
在过时的 `dataSchema` 中,`parser` 位于 `dataSchema` -> `parser`中,负责配置与解析输入记录相关的各种项。由于 `parser` 已经废弃,不推荐使用,强烈建议改用 `inputFormat`。 对于 `inputFormat` 和支持的 `parser` 类型,可以参见 [数据格式](dataformats.md)。
|
||||||
|
|
||||||
|
`parseSpec`主要部分的详细,参见他们的子部分:
|
||||||
|
* [`timestampSpec`](#timestampspec), 配置 [主时间戳列](#主时间戳列)
|
||||||
|
* [`dimensionsSpec`](#dimensionspec), 配置 [维度](#维度)
|
||||||
|
* [`flattenSpec`](./dataformats.md#FlattenSpec)
|
||||||
|
|
||||||
|
一个 `parser` 实例如下:
|
||||||
|
|
||||||
|
```json
|
||||||
|
"parser": {
|
||||||
|
"type": "string",
|
||||||
|
"parseSpec": {
|
||||||
|
"format": "json",
|
||||||
|
"flattenSpec": {
|
||||||
|
"useFieldDiscovery": true,
|
||||||
|
"fields": [
|
||||||
|
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"timestampSpec": {
|
||||||
|
"column": "timestamp",
|
||||||
|
"format": "auto"
|
||||||
|
},
|
||||||
|
"dimensionsSpec": {
|
||||||
|
"dimensions": [
|
||||||
|
{ "type": "string", "page" },
|
||||||
|
{ "type": "string", "language" },
|
||||||
|
{ "type": "long", "name": "userId" }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
**flattenSpec**
|
||||||
|
在过时的 `dataSchema` 中,`flattenSpec` 位于`dataSchema` -> `parser` -> `parseSpec` -> `flattenSpec`中,负责在潜在的嵌套输入数据(如JSON、Avro等)和Druid的数据模型之间架起桥梁。有关详细信息,请参见 [flattenSpec](./dataformats.md#FlattenSpec) 。
|
||||||
|
|
||||||
|
#### `ioConfig`
|
||||||
|
|
||||||
|
`ioConfig` 影响从源系统(如Apache Kafka、Amazon S3、挂载的文件系统或任何其他受支持的源系统)读取数据的方式。`inputFormat` 属性适用于除Hadoop摄取之外的[所有摄取方法](#摄入方式)。Hadoop摄取仍然使用过时的 `dataSchema` 中的 [parser]。`ioConfig` 的其余部分特定于每个单独的摄取方法。读取JSON数据的 `ioConfig` 示例如下:
|
||||||
|
```json
|
||||||
|
"ioConfig": {
|
||||||
|
"type": "<ingestion-method-specific type code>",
|
||||||
|
"inputFormat": {
|
||||||
|
"type": "json"
|
||||||
|
},
|
||||||
|
...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
详情可以参见每个 [摄取方式](#摄入方式) 提供的文档。
|
||||||
|
|
||||||
|
#### `tuningConfig`
|
||||||
|
|
||||||
|
优化属性在 `tuningConfig` 中指定,`tuningConfig` 位于摄取规范的顶层。有些属性适用于所有摄取方法,但大多数属性特定于每个单独的摄取方法。`tuningConfig` 将所有共享的公共属性设置为默认值的示例如下:
|
||||||
|
```json
|
||||||
|
"tuningConfig": {
|
||||||
|
"type": "<ingestion-method-specific type code>",
|
||||||
|
"maxRowsInMemory": 1000000,
|
||||||
|
"maxBytesInMemory": <one-sixth of JVM memory>,
|
||||||
|
"indexSpec": {
|
||||||
|
"bitmap": { "type": "concise" },
|
||||||
|
"dimensionCompression": "lz4",
|
||||||
|
"metricCompression": "lz4",
|
||||||
|
"longEncoding": "longs"
|
||||||
|
},
|
||||||
|
<other ingestion-method-specific properties>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
| 字段 | 描述 | 默认值 |
|
||||||
|
|-|-|-|
|
||||||
|
| type | 每一种摄入方式都有自己的类型,必须指定为与摄入方式匹配的类型。通常的选项有 `index`, `hadoop`, `kafka` 和 `kinesis` | |
|
||||||
|
| maxRowsInMemory | 数据持久化到硬盘前在内存中存储的最大数据条数。 注意,这个数字是汇总后的,所以可能并不等于输入的记录数。 当摄入的数据达到 `maxRowsInMemory` 或者 `maxBytesInMemory` 时数据将被持久化到硬盘。 | `1000000` |
|
||||||
|
| maxBytesInMemory | 在持久化之前要存储在JVM堆中的数据最大字节数。这是基于对内存使用的粗略估计。当达到 `maxRowsInMemory` 或`maxBytesInMemory` 时(以先发生的为准),摄取的记录将被持久化到磁盘。<br><br>将 `maxBytesInMemory` 设置为-1将禁用此检查,这意味着Druid将完全依赖 `maxRowsInMemory` 来控制内存使用。将其设置为零意味着将使用默认值(JVM堆大小的六分之一)。<br><br> 请注意,内存使用量的估计值被设计为高估值,并且在使用复杂的摄取时聚合器(包括sketches)时可能特别高。如果这导致索引工作负载过于频繁地持久化到磁盘,则可以将 `maxBytesInMemory` 设置为-1并转而依赖 `maxRowsInMemory`。 | JVM堆内存最大值的1/6 |
|
||||||
|
| indexSpec | 优化数据如何被索引,详情可以看下面的表格 | 看下面的表格 |
|
||||||
|
| 其他属性 | 每一种摄入方式都有其自己的优化属性。 详情可以查看每一种方法的文档。 [Kafka索引服务](kafka.md), [Kinesis索引服务](kinesis.md), [本地批](native.md) 和 [Hadoop批](hadoopbased.md) | |
|
||||||
|
|
||||||
|
**`indexSpec`**
|
||||||
|
|
||||||
|
上边表格中的 `indexSpec` 部分可以包含以下属性:
|
||||||
|
|
||||||
|
| 字段 | 描述 | 默认值 |
|
||||||
|
|-|-|-|
|
||||||
|
| bitmap | 位图索引的压缩格式。 需要一个 `type` 设置为 `concise` 或者 `roaring` 的JSON对象。对于 `roaring`类型,布尔属性`compressRunOnSerialization`(默认为true)控制在确定运行长度编码更节省空间时是否使用该编码。 | `{"type":"concise"}` |
|
||||||
|
| dimensionCompression | 维度列的压缩格式。 可选项有 `lz4`, `lzf` 或者 `uncompressed` | `lz4` |
|
||||||
|
| metricCompression | Metrics列的压缩格式。可选项有 `lz4`, `lzf`, `uncompressed` 或者 `none`(`none` 比 `uncompressed` 更有效,但是在老版本的Druid不支持) | `lz4` |
|
||||||
|
| longEncoding | long类型列的编码格式。无论它们是维度还是Metrics,都适用,选项是 `auto` 或 `long`。`auto` 根据列基数使用偏移量或查找表对值进行编码,并以可变大小存储它们。`longs` 按原样存储值,每个值8字节。 | `longs` |
|
||||||
|
|
||||||
|
除了这些属性之外,每个摄取方法都有自己的特定调整属性。有关详细信息,请参阅每个 [摄取方法](#摄入方式) 的文档。
|
@ -1,3 +1,16 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
## Apache Kafka 摄取数据
|
## Apache Kafka 摄取数据
|
||||||
|
|
||||||
Kafka索引服务支持在Overlord上配置*supervisors*,supervisors通过管理Kafka索引任务的创建和生存期来便于从Kafka摄取数据。这些索引任务使用Kafka自己的分区和偏移机制读取事件,因此能够保证只接收一次(**exactly-once**)。supervisor监视索引任务的状态,以便于协调切换、管理故障,并确保维护可伸缩性和复制要求。
|
Kafka索引服务支持在Overlord上配置*supervisors*,supervisors通过管理Kafka索引任务的创建和生存期来便于从Kafka摄取数据。这些索引任务使用Kafka自己的分区和偏移机制读取事件,因此能够保证只接收一次(**exactly-once**)。supervisor监视索引任务的状态,以便于协调切换、管理故障,并确保维护可伸缩性和复制要求。
|
||||||
@ -108,7 +121,7 @@ curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http:/
|
|||||||
| `indexSpecForIntermediatePersists` | | 定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的维度/度量压缩,以减少最终合并所需的内存。但是,在中间段上禁用压缩可能会增加页缓存的使用,而在它们被合并到发布的最终段之前使用它们,有关可能的值,请参阅IndexSpec。 | 否(默认与 `indexSpec` 相同) |
|
| `indexSpecForIntermediatePersists` | | 定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的维度/度量压缩,以减少最终合并所需的内存。但是,在中间段上禁用压缩可能会增加页缓存的使用,而在它们被合并到发布的最终段之前使用它们,有关可能的值,请参阅IndexSpec。 | 否(默认与 `indexSpec` 相同) |
|
||||||
| `reportParseExceptions` | Boolean | *已废弃*。如果为true,则在解析期间遇到的异常即停止摄取;如果为false,则将跳过不可解析的行和字段。将 `reportParseExceptions` 设置为 `true` 将覆盖`maxParseExceptions` 和 `maxSavedParseExceptions` 的现有配置,将`maxParseExceptions` 设置为 `0` 并将 `maxSavedParseExceptions` 限制为不超过1。 | 否(默认为false)|
|
| `reportParseExceptions` | Boolean | *已废弃*。如果为true,则在解析期间遇到的异常即停止摄取;如果为false,则将跳过不可解析的行和字段。将 `reportParseExceptions` 设置为 `true` 将覆盖`maxParseExceptions` 和 `maxSavedParseExceptions` 的现有配置,将`maxParseExceptions` 设置为 `0` 并将 `maxSavedParseExceptions` 限制为不超过1。 | 否(默认为false)|
|
||||||
| `handoffConditionTimeout` | Long | 段切换(持久化)可以等待的毫秒数(超时时间)。 该值要被设置为大于0的数,设置为0意味着将会一直等待不超时 | 否(默认为0)|
|
| `handoffConditionTimeout` | Long | 段切换(持久化)可以等待的毫秒数(超时时间)。 该值要被设置为大于0的数,设置为0意味着将会一直等待不超时 | 否(默认为0)|
|
||||||
| `resetOffsetAutomatically` | Boolean | 控制当Druid需要读取Kafka中不可用的消息时的行为,比如当发生了 `OffsetOutOfRangeException` 异常时。 <br> 如果为false,则异常将抛出,这将导致任务失败并停止接收。如果发生这种情况,则需要手动干预来纠正这种情况;可能使用 [重置 Supervisor API](../operations/api-reference.md#Supervisor)。此模式对于生产非常有用,因为它将使您意识到摄取的问题。 <br> 如果为true,Druid将根据 `useEarliestOffset` 属性的值(`true` 为 `earliest`,`false` 为 `latest`)自动重置为Kafka中可用的较早或最新偏移量。请注意,这可能导致数据在您不知情的情况下*被丢弃*(如果`useEarliestOffset` 为 `false`)或 *重复*(如果 `useEarliestOffset` 为 `true`)。消息将被记录下来,以标识已发生重置,但摄取将继续。这种模式对于非生产环境非常有用,因为它将使Druid尝试自动从问题中恢复,即使这些问题会导致数据被安静删除或重复。 <br> 该特性与Kafka的 `auto.offset.reset` 消费者属性很相似 | 否(默认为false)|
|
| `resetOffsetAutomatically` | Boolean | 控制当Druid需要读取Kafka中不可用的消息时的行为,比如当发生了 `OffsetOutOfRangeException` 异常时。 <br> 如果为false,则异常将抛出,这将导致任务失败并停止接收。如果发生这种情况,则需要手动干预来纠正这种情况;可能使用 [重置 Supervisor API](../operations/api.md#Supervisor)。此模式对于生产非常有用,因为它将使您意识到摄取的问题。 <br> 如果为true,Druid将根据 `useEarliestOffset` 属性的值(`true` 为 `earliest`,`false` 为 `latest`)自动重置为Kafka中可用的较早或最新偏移量。请注意,这可能导致数据在您不知情的情况下*被丢弃*(如果`useEarliestOffset` 为 `false`)或 *重复*(如果 `useEarliestOffset` 为 `true`)。消息将被记录下来,以标识已发生重置,但摄取将继续。这种模式对于非生产环境非常有用,因为它将使Druid尝试自动从问题中恢复,即使这些问题会导致数据被安静删除或重复。 <br> 该特性与Kafka的 `auto.offset.reset` 消费者属性很相似 | 否(默认为false)|
|
||||||
| `workerThreads` | Integer | supervisor用于异步操作的线程数。| 否(默认为: min(10, taskCount)) |
|
| `workerThreads` | Integer | supervisor用于异步操作的线程数。| 否(默认为: min(10, taskCount)) |
|
||||||
| `chatThreads` | Integer | 与索引任务的会话线程数 | 否(默认为:min(10, taskCount * replicas))|
|
| `chatThreads` | Integer | 与索引任务的会话线程数 | 否(默认为:min(10, taskCount * replicas))|
|
||||||
| `chatRetries` | Integer | 在任务没有响应之前,将重试对索引任务的HTTP请求的次数 | 否(默认为8)|
|
| `chatRetries` | Integer | 在任务没有响应之前,将重试对索引任务的HTTP请求的次数 | 否(默认为8)|
|
||||||
@ -149,7 +162,7 @@ curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http:/
|
|||||||
|
|
||||||
| 字段 | 类型 | 描述 | 是否必须 |
|
| 字段 | 类型 | 描述 | 是否必须 |
|
||||||
|-|-|-|-|
|
|-|-|-|-|
|
||||||
| `type` | String | 对于可用选项,可以见 [额外的Peon配置:SegmentWriteOutMediumFactory](../configuration/human-readable-byte.md#SegmentWriteOutMediumFactory) | 是 |
|
| `type` | String | 对于可用选项,可以见 [额外的Peon配置:SegmentWriteOutMediumFactory](../Configuration/configuration.md#SegmentWriteOutMediumFactory) | 是 |
|
||||||
|
|
||||||
#### KafkaSupervisorIOConfig
|
#### KafkaSupervisorIOConfig
|
||||||
|
|
||||||
@ -177,7 +190,7 @@ Kafka索引服务同时支持通过 [`inputFormat`](dataformats.md#inputformat)
|
|||||||
|
|
||||||
### 操作
|
### 操作
|
||||||
|
|
||||||
本节描述了一些supervisor API如何在Kafka索引服务中具体工作。对于所有的supervisor API,请查看 [Supervisor APIs](../operations/api-reference.md#Supervisor)
|
本节描述了一些supervisor API如何在Kafka索引服务中具体工作。对于所有的supervisor API,请查看 [Supervisor APIs](../operations/api.md#Supervisor)
|
||||||
|
|
||||||
#### 获取supervisor的状态报告
|
#### 获取supervisor的状态报告
|
||||||
|
|
1
DataIngestion/kinesis.md
Normal file
1
DataIngestion/kinesis.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
1154
DataIngestion/native.md
Normal file
1154
DataIngestion/native.md
Normal file
File diff suppressed because it is too large
Load Diff
166
DataIngestion/schemadesign.md
Normal file
166
DataIngestion/schemadesign.md
Normal file
@ -0,0 +1,166 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
## Schema设计
|
||||||
|
### Druid数据模型
|
||||||
|
|
||||||
|
有关一般信息,请查看摄取概述页面上有关 [Druid数据模型](ingestion.md#Druid数据模型) 的文档。本页的其余部分将讨论来自其他类型系统的用户的提示,以及一般提示和常见做法。
|
||||||
|
|
||||||
|
* Druid数据存储在 [数据源](ingestion.md#数据源) 中,与传统RDBMS中的表类似。
|
||||||
|
* Druid数据源可以在摄取过程中使用或不使用 [rollup](ingestion.md#rollup) 。启用rollup后,Druid会在接收期间部分聚合您的数据,这可能会减少其行数,减少存储空间,并提高查询性能。禁用rollup后,Druid为输入数据中的每一行存储一行,而不进行任何预聚合。
|
||||||
|
* Druid的每一行都必须有时间戳。数据总是按时间进行分区,每个查询都有一个时间过滤器。查询结果也可以按时间段(如分钟、小时、天等)进行细分。
|
||||||
|
* 除了timestamp列之外,Druid数据源中的所有列都是dimensions或metrics。这遵循 [OLAP数据的标准命名约定](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems)。
|
||||||
|
* 典型的生产数据源有几十到几百列。
|
||||||
|
* [dimension列](ingestion.md#维度) 按原样存储,因此可以在查询时对其进行筛选、分组或聚合。它们总是单个字符串、字符串数组、单个long、单个double或单个float。
|
||||||
|
* [Metrics列](ingestion.md#指标) 是 [预聚合](../querying/Aggregations.md) 存储的,因此它们只能在查询时聚合(不能按筛选或分组)。它们通常存储为数字(整数或浮点数),但也可以存储为复杂对象,如[HyperLogLog草图或近似分位数草图](../querying/Aggregations.md)。即使禁用了rollup,也可以在接收时配置metrics,但在启用汇总时最有用。
|
||||||
|
|
||||||
|
### 与其他设计模式类比
|
||||||
|
#### 关系模型
|
||||||
|
(如 Hive 或者 PostgreSQL)
|
||||||
|
|
||||||
|
Druid数据源通常相当于关系数据库中的表。Druid的 [lookups特性](../querying/lookups.md) 可以类似于数据仓库样式的维度表,但是正如您将在下面看到的,如果您能够摆脱它,通常建议您进行非规范化。
|
||||||
|
|
||||||
|
关系数据建模的常见实践涉及 [规范化](https://en.wikipedia.org/wiki/Database_normalization) 的思想:将数据拆分为多个表,从而减少或消除数据冗余。例如,在"sales"表中,最佳实践关系建模要求将"product id"列作为外键放入单独的"products"表中,该表依次具有"product id"、"product name"和"product category"列, 这可以防止产品名称和类别需要在"sales"表中引用同一产品的不同行上重复。
|
||||||
|
|
||||||
|
另一方面,在Druid中,通常使用在查询时不需要连接的完全平坦的数据源。在"sales"表的例子中,在Druid中,通常直接将"product_id"、"product_name"和"product_category"作为维度存储在Druid "sales"数据源中,而不使用单独的"products"表。完全平坦的模式大大提高了性能,因为查询时不需要连接。作为一个额外的速度提升,这也允许Druid的查询层直接操作压缩字典编码的数据。因为Druid使用字典编码来有效地为字符串列每行存储一个整数, 所以可能与直觉相反,这并*没有*显著增加相对于规范化模式的存储空间。
|
||||||
|
|
||||||
|
如果需要的话,可以通过使用 [lookups](../querying/lookups.md) 规范化Druid数据源,这大致相当于关系数据库中的维度表。在查询时,您将使用Druid的SQL `LOOKUP` 查找函数或者原生 `lookup` 提取函数,而不是像在关系数据库中那样使用JOIN关键字。由于lookup表会增加内存占用并在查询时产生更多的计算开销,因此仅当需要更新lookup表并立即反映主表中已摄取行的更改时,才建议执行此操作。
|
||||||
|
|
||||||
|
在Druid中建模关系数据的技巧:
|
||||||
|
* Druid数据源没有主键或唯一键,所以跳过这些。
|
||||||
|
* 如果可能的话,去规格化。如果需要定期更新dimensions/lookup并将这些更改反映在已接收的数据中,请考虑使用 [lookups](../querying/lookups.md) 进行部分规范化。
|
||||||
|
* 如果需要将两个大型的分布式表连接起来,则必须在将数据加载到Druid之前执行此操作。Druid不支持两个数据源的查询时间连接。lookup在这里没有帮助,因为每个lookup表的完整副本存储在每个Druid服务器上,所以对于大型表来说,它们不是一个好的选择。
|
||||||
|
* 考虑是否要为预聚合启用[rollup](ingestion.md#rollup),或者是否要禁用rollup并按原样加载现有数据。Druid中的Rollup类似于在关系模型中创建摘要表。
|
||||||
|
|
||||||
|
#### 时序模型
|
||||||
|
(如 OpenTSDB 或者 InfluxDB)
|
||||||
|
|
||||||
|
与时间序列数据库类似,Druid的数据模型需要时间戳。Druid不是时序数据库,但它同时也是存储时序数据的自然选择。它灵活的数据模型允许它同时存储时序和非时序数据,甚至在同一个数据源中。
|
||||||
|
|
||||||
|
为了在Druid中实现时序数据的最佳压缩和查询性能,像时序数据库经常做的一样,按照metric名称进行分区和排序很重要。有关详细信息,请参见 [分区和排序](ingestion.md#分区)。
|
||||||
|
|
||||||
|
在Druid中建模时序数据的技巧:
|
||||||
|
* Druid并不认为数据点是"时间序列"的一部分。相反,Druid对每一点分别进行摄取和聚合
|
||||||
|
* 创建一个维度,该维度指示数据点所属系列的名称。这个维度通常被称为"metric"或"name"。不要将名为"metric"的维度与Druid Metrics的概念混淆。将它放在"dimensionsSpec"中维度列表的第一个位置,以获得最佳性能(这有助于提高局部性;有关详细信息,请参阅下面的 [分区和排序](ingestion.md#分区))
|
||||||
|
* 为附着到数据点的属性创建其他维度。在时序数据库系统中,这些通常称为"标签"
|
||||||
|
* 创建与您希望能够查询的聚合类型相对应的 [Druid Metrics](ingestion.md#指标)。通常这包括"sum"、"min"和"max"(在long、float或double中的一种)。如果你想计算百分位数或分位数,可以使用Druid的 [近似聚合器](../querying/Aggregations.md)
|
||||||
|
* 考虑启用 [rollup](ingestion.md#rollup),这将允许Druid潜在地将多个点合并到Druid数据源中的一行中。如果希望以不同于原始发出的时间粒度存储数据,则这可能非常有用。如果要在同一个数据源中组合时序和非时序数据,它也很有用
|
||||||
|
* 如果您提前不知道要摄取哪些列,请使用空的维度列表来触发 [维度列的自动检测](#无schema的维度列)
|
||||||
|
|
||||||
|
#### 日志聚合模型
|
||||||
|
(如 ElasticSearch 或者 Splunk)
|
||||||
|
|
||||||
|
与日志聚合系统类似,Druid提供反向索引,用于快速搜索和筛选。Druid的搜索能力通常不如这些系统发达,其分析能力通常更为发达。Druid和这些系统之间的主要数据建模差异在于,在将数据摄取到Druid中时,必须更加明确。Druid列具有特定的类型,而Druid目前不支持嵌套数据。
|
||||||
|
|
||||||
|
在Druid中建模日志数据的技巧:
|
||||||
|
* 如果您提前不知道要摄取哪些列,请使用空维度列表来触发 [维度列的自动检测](#无schema的维度列)
|
||||||
|
* 如果有嵌套数据,请使用 [展平规范](ingestion.md#flattenspec) 将其扁平化
|
||||||
|
* 如果您主要有日志数据的分析场景,请考虑启用 [rollup](ingestion.md#rollup),这意味着您将失去从Druid中检索单个事件的能力,但您可能获得大量的压缩和查询性能提升
|
||||||
|
|
||||||
|
### 一般提示以及最佳实践
|
||||||
|
#### Rollup
|
||||||
|
|
||||||
|
Druid可以在接收数据时将其汇总,以最小化需要存储的原始数据量。这是一种汇总或预聚合的形式。有关更多详细信息,请参阅摄取文档的 [汇总部分](ingestion.md#rollup)。
|
||||||
|
|
||||||
|
#### 分区与排序
|
||||||
|
|
||||||
|
对数据进行最佳分区和排序会对占用空间和性能产生重大影响。有关更多详细信息,请参阅摄取文档的 [分区部分](ingestion.md#分区)。
|
||||||
|
|
||||||
|
#### Sketches高基维处理
|
||||||
|
|
||||||
|
在处理高基数列(如用户ID或其他唯一ID)时,请考虑使用草图(sketches)进行近似分析,而不是对实际值进行操作。当您使用草图(sketches)摄取数据时,Druid不存储原始原始数据,而是存储它的"草图(sketches)",它可以在查询时输入到以后的计算中。草图(sketches)的常用场景包括 `count-distinct` 和分位数计算。每个草图都是为一种特定的计算而设计的。
|
||||||
|
|
||||||
|
一般来说,使用草图(sketches)有两个主要目的:改进rollup和减少查询时的内存占用。
|
||||||
|
|
||||||
|
草图(sketches)可以提高rollup比率,因为它们允许您将多个不同的值折叠到同一个草图(sketches)中。例如,如果有两行除了用户ID之外都是相同的(可能两个用户同时执行了相同的操作),则将它们存储在 `count-distinct sketch` 中而不是按原样,这意味着您可以将数据存储在一行而不是两行中。您将无法检索用户id或计算精确的非重复计数,但您仍将能够计算近似的非重复计数,并且您将减少存储空间。
|
||||||
|
|
||||||
|
草图(sketches)减少了查询时的内存占用,因为它们限制了需要在服务器之间洗牌的数据量。例如,在分位数计算中,Druid不需要将所有数据点发送到中心位置,以便对它们进行排序和计算分位数,而只需要发送点的草图。这可以将数据传输需要减少到仅千字节。
|
||||||
|
|
||||||
|
有关Druid中可用的草图的详细信息,请参阅 [近似聚合器页面](../querying/Aggregations.md)。
|
||||||
|
|
||||||
|
如果你更喜欢 [视频](https://www.youtube.com/watch?v=Hpd3f_MLdXo),那就看一看吧!,一个讨论Druid Sketches的会议。
|
||||||
|
|
||||||
|
#### 字符串 VS 数值维度
|
||||||
|
|
||||||
|
如果用户希望将列摄取为数值类型的维度(Long、Double或Float),则需要在 `dimensionsSpec` 的 `dimensions` 部分中指定列的类型。如果省略了该类型,Druid会将列作为默认的字符串类型。
|
||||||
|
|
||||||
|
字符串列和数值列之间存在性能折衷。数值列通常比字符串列更快分组。但与字符串列不同,数值列没有索引,因此可以更慢地进行筛选。您可能想尝试为您的用例找到最佳选择。
|
||||||
|
|
||||||
|
有关如何配置数值维度的详细信息,请参阅 [`dimensionsSpec`文档](ingestion.md#dimensionsSpec)
|
||||||
|
|
||||||
|
#### 辅助时间戳
|
||||||
|
|
||||||
|
Druid schema必须始终包含一个主时间戳, 主时间戳用于对数据进行 [分区和排序](ingestion.md#分区),因此它应该是您最常筛选的时间戳。Druid能够快速识别和检索与主时间戳列的时间范围相对应的数据。
|
||||||
|
|
||||||
|
如果数据有多个时间戳,则可以将其他时间戳作为辅助时间戳摄取。最好的方法是将它们作为 [毫秒格式的Long类型维度](ingestion.md#dimensionsspec) 摄取。如有必要,可以使用 [`transformSpec`](ingestion.md#transformspec) 和 `timestamp_parse` 等 [表达式](../misc/expression.md) 将它们转换成这种格式,后者返回毫秒时间戳。
|
||||||
|
|
||||||
|
在查询时,可以使用诸如 `MILLIS_TO_TIMESTAMP`、`TIME_FLOOR` 等 [SQL时间函数](../querying/druidsql.md) 查询辅助时间戳。如果您使用的是原生Druid查询,那么可以使用 [表达式](../misc/expression.md)。
|
||||||
|
|
||||||
|
#### 嵌套维度
|
||||||
|
|
||||||
|
在编写本文时,Druid不支持嵌套维度。嵌套维度需要展平,例如,如果您有以下数据:
|
||||||
|
```json
|
||||||
|
{"foo":{"bar": 3}}
|
||||||
|
```
|
||||||
|
|
||||||
|
然后在编制索引之前,应将其转换为:
|
||||||
|
```json
|
||||||
|
{"foo_bar": 3}
|
||||||
|
```
|
||||||
|
|
||||||
|
Druid能够将JSON、Avro或Parquet输入数据展平化。请阅读 [展平规格](ingestion.md#flattenspec) 了解更多细节。
|
||||||
|
|
||||||
|
#### 计数接收事件数
|
||||||
|
|
||||||
|
启用rollup后,查询时的计数聚合器(count aggregator)实际上不会告诉您已摄取的行数。它们告诉您Druid数据源中的行数,可能小于接收的行数。
|
||||||
|
|
||||||
|
在这种情况下,可以使用*摄取时*的计数聚合器来计算事件数。但是,需要注意的是,在查询此Metrics时,应该使用 `longSum` 聚合器。查询时的 `count` 聚合器将返回时间间隔的Druid行数,该行数可用于确定rollup比率。
|
||||||
|
|
||||||
|
为了举例说明,如果摄取规范包含:
|
||||||
|
```json
|
||||||
|
...
|
||||||
|
"metricsSpec" : [
|
||||||
|
{
|
||||||
|
"type" : "count",
|
||||||
|
"name" : "count"
|
||||||
|
},
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
您应该使用查询:
|
||||||
|
```json
|
||||||
|
...
|
||||||
|
"aggregations": [
|
||||||
|
{ "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" },
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 无schema的维度列
|
||||||
|
|
||||||
|
如果摄取规范中的 `dimensions` 字段为空,Druid将把不是timestamp列、已排除的维度和metric列之外的每一列都视为维度。
|
||||||
|
|
||||||
|
注意,当使用无schema摄取时,所有维度都将被摄取为字符串类型的维度。
|
||||||
|
|
||||||
|
##### 包含与Dimension和Metric相同的列
|
||||||
|
|
||||||
|
一个具有唯一ID的工作流能够对特定ID进行过滤,同时仍然能够对ID列进行快速的唯一计数。如果不使用无schema维度,则通过将Metric的 `name` 设置为与维度不同的值来支持此场景。如果使用无schema维度,这里的最佳实践是将同一列包含两次,一次作为维度,一次作为 `hyperUnique` Metric。这可能涉及到ETL时的一些工作。
|
||||||
|
|
||||||
|
例如,对于无schema维度,请重复同一列:
|
||||||
|
```json
|
||||||
|
{"device_id_dim":123, "device_id_met":123}
|
||||||
|
```
|
||||||
|
同时在 `metricsSpec` 中包含:
|
||||||
|
```json
|
||||||
|
{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" }
|
||||||
|
```
|
||||||
|
`device_id_dim` 将自动作为维度来被选取
|
1
DataIngestion/streamingest.md
Normal file
1
DataIngestion/streamingest.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
384
DataIngestion/taskrefer.md
Normal file
384
DataIngestion/taskrefer.md
Normal file
@ -0,0 +1,384 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
## 任务参考文档
|
||||||
|
|
||||||
|
任务在Druid中完成所有与 [摄取](ingestion.md) 相关的工作。
|
||||||
|
|
||||||
|
对于批量摄取,通常使用 [任务api](../operations/api.md#Overlord) 直接将任务提交给Druid。对于流式接收,任务通常被提交给supervisor。
|
||||||
|
|
||||||
|
### 任务API
|
||||||
|
|
||||||
|
任务API主要在两个地方是可用的:
|
||||||
|
|
||||||
|
* [Overlord](../design/Overlord.md) 进程提供HTTP API接口来进行提交任务、取消任务、检查任务状态、查看任务日志与报告等。 查看 [任务API文档](../operations/api.md) 可以看到完整列表
|
||||||
|
* Druid SQL包括了一个 [`sys.tasks`](../querying/druidsql.md#系统Schema) ,保存了当前任务运行的信息。 此表是只读的,并且可以通过Overlord API查询完整信息的有限制的子集。
|
||||||
|
|
||||||
|
### 任务报告
|
||||||
|
|
||||||
|
报告包含已完成的任务和正在运行的任务中有关接收的行数和发生的任何分析异常的信息的报表。
|
||||||
|
|
||||||
|
报告功能支持 [简单的本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和Kinesis摄取任务支持报告功能。
|
||||||
|
|
||||||
|
#### 任务结束报告
|
||||||
|
|
||||||
|
任务运行完成后,一个完整的报告可以在以下接口获取:
|
||||||
|
|
||||||
|
```json
|
||||||
|
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
|
||||||
|
```
|
||||||
|
|
||||||
|
一个示例输出如下:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"ingestionStatsAndErrors": {
|
||||||
|
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
|
||||||
|
"payload": {
|
||||||
|
"ingestionState": "COMPLETED",
|
||||||
|
"unparseableEvents": {},
|
||||||
|
"rowStats": {
|
||||||
|
"determinePartitions": {
|
||||||
|
"processed": 0,
|
||||||
|
"processedWithError": 0,
|
||||||
|
"thrownAway": 0,
|
||||||
|
"unparseable": 0
|
||||||
|
},
|
||||||
|
"buildSegments": {
|
||||||
|
"processed": 5390324,
|
||||||
|
"processedWithError": 0,
|
||||||
|
"thrownAway": 0,
|
||||||
|
"unparseable": 0
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"errorMsg": null
|
||||||
|
},
|
||||||
|
"type": "ingestionStatsAndErrors"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 任务运行报告
|
||||||
|
|
||||||
|
当一个任务正在运行时, 任务运行报告可以通过以下接口获得,包括摄取状态、未解析事件和过去1分钟、5分钟、15分钟内处理的平均事件数。
|
||||||
|
|
||||||
|
```json
|
||||||
|
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
|
||||||
|
```
|
||||||
|
和
|
||||||
|
```json
|
||||||
|
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/liveReports
|
||||||
|
```
|
||||||
|
|
||||||
|
一个示例输出如下:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"ingestionStatsAndErrors": {
|
||||||
|
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
|
||||||
|
"payload": {
|
||||||
|
"ingestionState": "RUNNING",
|
||||||
|
"unparseableEvents": {},
|
||||||
|
"rowStats": {
|
||||||
|
"movingAverages": {
|
||||||
|
"buildSegments": {
|
||||||
|
"5m": {
|
||||||
|
"processed": 3.392158326408501,
|
||||||
|
"unparseable": 0,
|
||||||
|
"thrownAway": 0,
|
||||||
|
"processedWithError": 0
|
||||||
|
},
|
||||||
|
"15m": {
|
||||||
|
"processed": 1.736165476881023,
|
||||||
|
"unparseable": 0,
|
||||||
|
"thrownAway": 0,
|
||||||
|
"processedWithError": 0
|
||||||
|
},
|
||||||
|
"1m": {
|
||||||
|
"processed": 4.206417693750045,
|
||||||
|
"unparseable": 0,
|
||||||
|
"thrownAway": 0,
|
||||||
|
"processedWithError": 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"totals": {
|
||||||
|
"buildSegments": {
|
||||||
|
"processed": 1994,
|
||||||
|
"processedWithError": 0,
|
||||||
|
"thrownAway": 0,
|
||||||
|
"unparseable": 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"errorMsg": null
|
||||||
|
},
|
||||||
|
"type": "ingestionStatsAndErrors"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
字段的描述信息如下:
|
||||||
|
|
||||||
|
`ingestionStatsAndErrors` 提供了行数和错误数的信息
|
||||||
|
|
||||||
|
`ingestionState` 标识了摄取任务当前达到了哪一步,可能的取值包括:
|
||||||
|
* `NOT_STARTED`: 任务还没有读取任何行
|
||||||
|
* `DETERMINE_PARTITIONS`: 任务正在处理行来决定分区信息
|
||||||
|
* `BUILD_SEGMENTS`: 任务正在处理行来构建段
|
||||||
|
* `COMPLETED`: 任务已经完成
|
||||||
|
|
||||||
|
只有批处理任务具有 `DETERMINE_PARTITIONS` 阶段。实时任务(如由Kafka索引服务创建的任务)没有 `DETERMINE_PARTITIONS` 阶段。
|
||||||
|
|
||||||
|
`unparseableEvents` 包含由不可解析输入引起的异常消息列表。这有助于识别有问题的输入行。对于 `DETERMINE_PARTITIONS` 和 `BUILD_SEGMENTS` 阶段,每个阶段都有一个列表。请注意,Hadoop批处理任务不支持保存不可解析事件。
|
||||||
|
|
||||||
|
`rowStats` map包含有关行计数的信息。每个摄取阶段有一个条目。不同行计数的定义如下所示:
|
||||||
|
|
||||||
|
* `processed`: 成功摄入且没有报错的行数
|
||||||
|
* `processedWithErro`: 摄取但在一列或多列中包含解析错误的行数。这通常发生在输入行具有可解析的结构但列的类型无效的情况下,例如为数值列传入非数值字符串值
|
||||||
|
* `thrownAway`: 跳过的行数。 这包括时间戳在摄取任务定义的时间间隔之外的行,以及使用 [`transformSpec`](ingestion.md#transformspec) 过滤掉的行,但不包括显式用户配置跳过的行。例如,CSV格式的 `skipHeaderRows` 或 `hasHeaderRow` 跳过的行不计算在内
|
||||||
|
* `unparseable`: 完全无法分析并被丢弃的行数。这将跟踪没有可解析结构的输入行,例如在使用JSON解析器时传入非JSON数据。
|
||||||
|
|
||||||
|
`errorMsg` 字段显示一条消息,描述导致任务失败的错误。如果任务成功,则为空
|
||||||
|
|
||||||
|
### 实时报告
|
||||||
|
#### 行画像
|
||||||
|
|
||||||
|
非并行的 [简单本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和kinesis摄取任务支持在任务运行时检索行统计信息。
|
||||||
|
|
||||||
|
可以通过运行任务的Peon上的以下URL访问实时报告:
|
||||||
|
|
||||||
|
```json
|
||||||
|
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/rowStats
|
||||||
|
```
|
||||||
|
|
||||||
|
示例报告如下所示。`movingAverages` 部分包含四行计数器的1分钟、5分钟和15分钟移动平均增量,其定义与结束报告中的定义相同。`totals` 部分显示当前总计。
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"movingAverages": {
|
||||||
|
"buildSegments": {
|
||||||
|
"5m": {
|
||||||
|
"processed": 3.392158326408501,
|
||||||
|
"unparseable": 0,
|
||||||
|
"thrownAway": 0,
|
||||||
|
"processedWithError": 0
|
||||||
|
},
|
||||||
|
"15m": {
|
||||||
|
"processed": 1.736165476881023,
|
||||||
|
"unparseable": 0,
|
||||||
|
"thrownAway": 0,
|
||||||
|
"processedWithError": 0
|
||||||
|
},
|
||||||
|
"1m": {
|
||||||
|
"processed": 4.206417693750045,
|
||||||
|
"unparseable": 0,
|
||||||
|
"thrownAway": 0,
|
||||||
|
"processedWithError": 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"totals": {
|
||||||
|
"buildSegments": {
|
||||||
|
"processed": 1994,
|
||||||
|
"processedWithError": 0,
|
||||||
|
"thrownAway": 0,
|
||||||
|
"unparseable": 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
对于Kafka索引服务,向Overlord API发送一个GET请求,将从supervisor管理的每个任务中检索实时行统计报告,并提供一个组合报告。
|
||||||
|
|
||||||
|
```json
|
||||||
|
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/supervisor/<supervisor-id>/stats
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 未解析的事件
|
||||||
|
|
||||||
|
可以对Peon API发起一次Get请求,从正在运行的任务中检索最近遇到的不可解析事件的列表:
|
||||||
|
|
||||||
|
```json
|
||||||
|
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/unparseableEvents
|
||||||
|
```
|
||||||
|
注意:并不是所有的任务类型支持该功能。 当前,该功能只支持非并行的 [本地批任务](native.md) (`index`类型) 和由Kafka、Kinesis索引服务创建的任务。
|
||||||
|
|
||||||
|
### 任务锁系统
|
||||||
|
|
||||||
|
本节介绍Druid中的任务锁定系统。Druid的锁定系统和版本控制系统是紧密耦合的,以保证接收数据的正确性。
|
||||||
|
|
||||||
|
### 段与段之间的"阴影"
|
||||||
|
|
||||||
|
可以运行任务覆盖现有数据。覆盖任务创建的段将*覆盖*现有段。请注意,覆盖关系只适用于**同一时间块和同一数据源**。在过滤过时数据的查询处理中,不考虑这些被遮盖的段。
|
||||||
|
|
||||||
|
每个段都有一个*主*版本和一个*次*版本。主版本表示为时间戳,格式为["yyyy-MM-dd'T'hh:MM:ss"](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html),次版本表示为整数。这些主版本和次版本用于确定段之间的阴影关系,如下所示。
|
||||||
|
|
||||||
|
在以下条件下,段 `s1` 将会覆盖另一个段 `s2`:
|
||||||
|
* `s1` 比 `s2` 有一个更高的主版本
|
||||||
|
* `s1` 和 `s2` 有相同的主版本,但是有更高的次版本
|
||||||
|
|
||||||
|
以下是一些示例:
|
||||||
|
* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段将覆盖另一个主版本为 `2018-01-01T00:00:00.000Z` 且次版本为 `1` 的段
|
||||||
|
* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `1` 的段将覆盖另一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段
|
||||||
|
|
||||||
|
### 锁
|
||||||
|
|
||||||
|
如果您正在运行两个或多个 [Druid任务](taskrefer.md),这些任务为同一数据源和同一时间块生成段,那么生成的段可能会相互覆盖,从而导致错误的查询结果。
|
||||||
|
|
||||||
|
为了避免这个问题,任务将在Druid中创建任何段之前尝试获取锁, 有两种类型的锁,即 *时间块锁* 和 *段锁*。
|
||||||
|
|
||||||
|
使用时间块锁时,任务将锁定生成的段将写入数据源的整个时间块。例如,假设我们有一个任务将数据摄取到 `wikipedia` 数据源的时间块 `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` 中。使用时间块锁,此任务将在创建段之前锁定wikipedia数据源的 `2019-01-01T00:00.000Z/2019-01-02T00:00:00.000Z` 整个时间块。只要它持有锁,任何其他任务都将无法为同一数据源的同一时间块创建段。使用时间块锁创建的段的主版本*高于*现有段, 它们的次版本总是 `0`。
|
||||||
|
|
||||||
|
使用段锁时,任务锁定单个段而不是整个时间块。因此,如果两个或多个任务正在读取不同的段,则它们可以同时为同一时间创建同一数据源的块。例如,Kafka索引任务和压缩合并任务总是可以同时将段写入同一数据源的同一时间块中。原因是,Kafka索引任务总是附加新段,而压缩合并任务总是覆盖现有段。使用段锁创建的段具有*相同的*主版本和较高的次版本。
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> 段锁仍然是实验性的。它可能有未知的错误,这可能会导致错误的查询结果。
|
||||||
|
|
||||||
|
要启用段锁定,可能需要在 [task context(任务上下文)](#上下文参数) 中将 `forceTimeChunkLock` 设置为 `false`。一旦 `forceTimeChunkLock` 被取消设置,任务将自动选择正确的锁类型。**请注意**,段锁并不总是可用的。使用时间块锁的最常见场景是当覆盖任务更改段粒度时。此外,只有本地索引任务和Kafka/kinesis索引任务支持段锁。Hadoop索引任务和索引实时(`index_realtime`)任务(被 [Tranquility](tranquility.md)使用)还不支持它。
|
||||||
|
|
||||||
|
任务上下文中的 `forceTimeChunkLock` 仅应用于单个任务。如果要为所有任务取消设置,则需要在 [Overlord配置](../Configuration/configuration.md#overlord) 中设置 `druid.indexer.tasklock.forceTimeChunkLock` 为false。
|
||||||
|
|
||||||
|
如果两个或多个任务尝试为同一数据源的重叠时间块获取锁,则锁请求可能会相互冲突。**请注意,**锁冲突可能发生在不同的锁类型之间。
|
||||||
|
|
||||||
|
锁冲突的行为取决于 [任务优先级](#锁优先级)。如果冲突锁请求的所有任务具有相同的优先级,则首先请求的任务将获得锁, 其他任务将等待任务释放锁。
|
||||||
|
|
||||||
|
如果优先级较低的任务请求锁的时间晚于优先级较高的任务,则此任务还将等待优先级较高的任务释放锁。如果优先级较高的任务比优先级较低的任务请求锁的时间晚,则此任务将*抢占*优先级较低的另一个任务。优先级较低的任务的锁将被撤销,优先级较高的任务将获得一个新锁。
|
||||||
|
|
||||||
|
锁抢占可以在任务运行时随时发生,除非它在关键的*段发布阶段*。一旦发布段完成,它的锁将再次成为可抢占的。
|
||||||
|
|
||||||
|
**请注意**,锁由同一groupId的任务共享。例如,同一supervisor的Kafka索引任务具有相同的groupId,并且彼此共享所有锁。
|
||||||
|
|
||||||
|
### 锁优先级
|
||||||
|
|
||||||
|
每个任务类型都有不同的默认锁优先级。下表显示了不同任务类型的默认优先级。数字越高,优先级越高。
|
||||||
|
|
||||||
|
| 任务类型 | 默认优先级 |
|
||||||
|
|-|-|
|
||||||
|
| 实时索引任务 | 75 |
|
||||||
|
| 批量索引任务 | 50 |
|
||||||
|
| 合并/追加/压缩任务 | 25 |
|
||||||
|
| 其他任务 | 0 |
|
||||||
|
|
||||||
|
通过在任务上下文中设置优先级,可以覆盖任务优先级,如下所示。
|
||||||
|
|
||||||
|
```json
|
||||||
|
"context" : {
|
||||||
|
"priority" : 100
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 上下文参数
|
||||||
|
|
||||||
|
任务上下文用于各种单独的任务配置。以下参数适用于所有任务类型。
|
||||||
|
|
||||||
|
| 属性 | 默认值 | 描述 |
|
||||||
|
|-|-|-|
|
||||||
|
| `taskLockTimeout` | 300000 | 任务锁定超时(毫秒)。更多详细信息,可以查看 [锁](#锁) 部分 |
|
||||||
|
| `forceTimeChunkLock` | true | *将此设置为false仍然是实验性的* 。强制始终使用时间块锁。如果未设置,则每个任务都会自动选择要使用的锁类型。如果设置了,它将覆盖 [Overlord配置](../Configuration/configuration.md#overlord] 的 `druid.indexer.tasklock.forceTimeChunkLock` 配置。有关详细信息,可以查看 [锁](#锁) 部分。|
|
||||||
|
| `priority` | 不同任务类型是不同的。 参见 [锁优先级](#锁优先级) | 任务优先级 |
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> 当任务获取锁时,它通过HTTP发送请求并等待,直到它收到包含锁获取结果的响应为止。因此,如果 `taskLockTimeout` 大于 Overlord的`druid.server.http.maxIdleTime` 将会产生HTTP超时错误。
|
||||||
|
|
||||||
|
### 所有任务类型
|
||||||
|
#### `index`
|
||||||
|
|
||||||
|
参见 [本地批量摄取(简单任务)](native.md#简单任务)
|
||||||
|
|
||||||
|
#### `index_parallel`
|
||||||
|
|
||||||
|
参见 [本地批量社区(并行任务)](native.md#并行任务)
|
||||||
|
|
||||||
|
#### `index_sub`
|
||||||
|
|
||||||
|
由 [`index_parallel`](#index_parallel) 代表您自动提交的任务。
|
||||||
|
|
||||||
|
#### `index_hadoop`
|
||||||
|
|
||||||
|
参见 [基于Hadoop的摄取](hadoopbased.md)
|
||||||
|
|
||||||
|
#### `index_kafka`
|
||||||
|
|
||||||
|
由 [`Kafka摄取supervisor`](kafka.md) 代表您自动提交的任务。
|
||||||
|
|
||||||
|
#### `index_kinesis`
|
||||||
|
|
||||||
|
由 [`Kinesis摄取supervisor`](kinesis.md) 代表您自动提交的任务。
|
||||||
|
|
||||||
|
#### `index_realtime`
|
||||||
|
|
||||||
|
由 [`Tranquility`](tranquility.md) 代表您自动提交的任务。
|
||||||
|
|
||||||
|
#### `compact`
|
||||||
|
|
||||||
|
压缩任务合并给定间隔的所有段。有关详细信息,请参见有关 [压缩](datamanage.md#压缩与重新索引) 的文档。
|
||||||
|
|
||||||
|
#### `kill`
|
||||||
|
|
||||||
|
Kill tasks删除有关某些段的所有元数据,并将其从深层存储中删除。有关详细信息,请参阅有关 [删除数据](datamanage.md#删除数据) 的文档。
|
||||||
|
|
||||||
|
#### `append`
|
||||||
|
|
||||||
|
附加任务将段列表附加到单个段中(一个接一个)。语法是:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "append",
|
||||||
|
"id": <task_id>,
|
||||||
|
"dataSource": <task_datasource>,
|
||||||
|
"segments": <JSON list of DataSegment objects to append>,
|
||||||
|
"aggregations": <optional list of aggregators>,
|
||||||
|
"context": <task context>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `merge`
|
||||||
|
|
||||||
|
合并任务将段列表合并在一起。合并任何公共时间戳。如果在接收过程中禁用了rollup,则不会合并公共时间戳,并按其时间戳对行重新排序。
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> [`compact`](#compact) 任务通常是比 `merge` 任务更好的选择。
|
||||||
|
|
||||||
|
语法是:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "merge",
|
||||||
|
"id": <task_id>,
|
||||||
|
"dataSource": <task_datasource>,
|
||||||
|
"aggregations": <list of aggregators>,
|
||||||
|
"rollup": <whether or not to rollup data during a merge>,
|
||||||
|
"segments": <JSON list of DataSegment objects to merge>,
|
||||||
|
"context": <task context>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `same_interval_merge`
|
||||||
|
|
||||||
|
同一间隔合并任务是合并任务的快捷方式,间隔中的所有段都将被合并。
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> [`compact`](#compact) 任务通常是比 `same_interval_merge` 任务更好的选择。
|
||||||
|
|
||||||
|
语法是:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "same_interval_merge",
|
||||||
|
"id": <task_id>,
|
||||||
|
"dataSource": <task_datasource>,
|
||||||
|
"aggregations": <list of aggregators>,
|
||||||
|
"rollup": <whether or not to rollup data during a merge>,
|
||||||
|
"interval": <DataSegment objects in this interval are going to be merged>,
|
||||||
|
"context": <task context>
|
||||||
|
}
|
||||||
|
```
|
1
DataIngestion/tranquility.md
Normal file
1
DataIngestion/tranquility.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
63
GettingStarted/Docker.md
Normal file
63
GettingStarted/Docker.md
Normal file
@ -0,0 +1,63 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
## Docker
|
||||||
|
|
||||||
|
在这个部分中,我们将从 [Docker Hub](https://hub.docker.com/r/apache/druid) 下载Apache Druid镜像,并使用 [Docker](https://www.docker.com/get-started) 和 [Docker Compose](https://docs.docker.com/compose/) 在一台机器上安装它。完成此初始设置后,集群将准备好加载数据。
|
||||||
|
|
||||||
|
在开始快速启动之前,阅读 [Druid概述](chapter-1.md) 和 [摄取概述](../DataIngestion/ingestion.md) 是很有帮助的,因为教程将参考这些页面上讨论的概念。此外,建议熟悉Docker。
|
||||||
|
|
||||||
|
### 前提条件
|
||||||
|
|
||||||
|
* Docker
|
||||||
|
|
||||||
|
### 快速开始
|
||||||
|
|
||||||
|
Druid源代码包含一个 [示例docker-compose.yml](https://github.com/apache/druid/blob/master/distribution/docker/docker-compose.yml) 它可以从Docker Hub中提取一个镜像,适合用作示例环境,并用于试验基于Docker的Druid配置和部署。
|
||||||
|
|
||||||
|
#### Compose文件
|
||||||
|
|
||||||
|
示例 `docker-compose.yml` 将为每个Druid服务创建一个容器,包括Zookeeper和作为元数据存储PostgreSQL容器。深度存储将是本地目录,默认配置为相对于 `docker-compose.yml`文件的 `./storage`,并将作为 `/opt/data` 挂载,并在需要访问深层存储的Druid容器之间共享。Druid容器是通过 [环境文件](https://github.com/apache/druid/blob/master/distribution/docker/environment) 配置的。
|
||||||
|
|
||||||
|
#### 配置
|
||||||
|
|
||||||
|
Druid Docker容器的配置是通过环境变量完成的,环境变量还可以指定到 [标准Druid配置文件](../Configuration/configuration.md) 的路径
|
||||||
|
|
||||||
|
特殊环境变量:
|
||||||
|
|
||||||
|
* `JAVA_OPTS` -- 设置 java options
|
||||||
|
* `DRUID_LOG4J` -- 设置完成的 `log4j.xml`
|
||||||
|
* `DRUID_LOG_LEVEL` -- 覆盖在log4j中的默认日志级别
|
||||||
|
* `DRUID_XMX` -- 设置 Java `Xmx`
|
||||||
|
* `DRUID_XMS` -- 设置 Java `Xms`
|
||||||
|
* `DRUID_MAXNEWSIZE` -- 设置 Java最大新生代大小
|
||||||
|
* `DRUID_NEWSIZE` -- 设置 Java 新生代大小
|
||||||
|
* `DRUID_MAXDIRECTMEMORYSIZE` -- 设置Java最大直接内存大小
|
||||||
|
* `DRUID_CONFIG_COMMON` -- druid "common"属性文件的完整路径
|
||||||
|
* `DRUID_CONFIG_${service}` -- druid "service"属性文件的完整路径
|
||||||
|
|
||||||
|
除了特殊的环境变量外,在容器中启动Druid的脚本还将尝试使用以 `druid_`前缀开头的任何环境变量作为命令行配置。例如,Druid容器进程中的环境变量`druid_metadata_storage_type=postgresql` 将被转换为 `-Ddruid.metadata.storage.type=postgresql`
|
||||||
|
|
||||||
|
Druid `docker-compose.yml` 示例使用单个环境文件来指定完整的Druid配置;但是,在生产用例中,我们建议使用 `DRUID_COMMON_CONFIG` 和`DRUID_CONFIG_${service}` 或专门定制的特定于服务的环境文件。
|
||||||
|
|
||||||
|
### 启动集群
|
||||||
|
|
||||||
|
运行 `docker-compose up` 启动附加shell的集群,或运行 `docker-compose up -d` 在后台运行集群。如果直接使用示例文件,这个命令应该从Druid安装目录中的 `distribution/docker/` 运行。
|
||||||
|
|
||||||
|
启动集群后,可以导航到 [http://localhost:8888](http://localhost/) 。服务于 [Druid控制台](../operations/druid-console.md) 的 [Druid路由进程](../Design/Router.md) 位于这个地址。
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
所有Druid进程需要几秒钟才能完全启动。如果在启动服务后立即打开控制台,可能会看到一些可以安全忽略的错误。
|
||||||
|
|
||||||
|
从这里你可以跟着 [标准教程](chapter-2.md),或者详细说明你的 `docker-compose.yml` 根据需要添加任何其他外部服务依赖项。
|
69
GettingStarted/chapter-3.md
Normal file
69
GettingStarted/chapter-3.md
Normal file
@ -0,0 +1,69 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
### 单服务器部署
|
||||||
|
|
||||||
|
Druid包括一组参考配置和用于单机部署的启动脚本:
|
||||||
|
|
||||||
|
* `nano-quickstart`
|
||||||
|
* `micro-quickstart`
|
||||||
|
* `small`
|
||||||
|
* `medium`
|
||||||
|
* `large`
|
||||||
|
* `large`
|
||||||
|
* `xlarge`
|
||||||
|
|
||||||
|
`micro-quickstart`适合于笔记本电脑等小型机器,旨在用于快速评估测试使用场景。
|
||||||
|
|
||||||
|
`nano-quickstart`是一种甚至更小的配置,目标是具有1个CPU和4GB内存的计算机。它旨在在资源受限的环境(例如小型Docker容器)中进行有限的评估测试。
|
||||||
|
|
||||||
|
其他配置旨在用于一般用途的单机部署,它们的大小适合大致基于亚马逊i3系列EC2实例的硬件。
|
||||||
|
|
||||||
|
这些示例配置的启动脚本与Druid服务一起运行单个ZK实例,您也可以选择单独部署ZK。
|
||||||
|
|
||||||
|
通过[Coordinator配置文档](../Configuration/configuration.md#Coordinator)中描述的可选配置`druid.coordinator.asOverlord.enabled = true`可以在单个进程中同时运行Druid Coordinator和Overlord。
|
||||||
|
|
||||||
|
虽然为大型单台计算机提供了示例配置,但在更高规模下,我们建议在集群部署中运行Druid,以实现容错和减少资源争用。
|
||||||
|
|
||||||
|
#### 单服务器参考配置
|
||||||
|
##### Nano-Quickstart: 1 CPU, 4GB 内存
|
||||||
|
|
||||||
|
* 启动命令: `bin/start-nano-quickstart`
|
||||||
|
* 配置目录: `conf/druid/single-server/nano-quickstart`
|
||||||
|
|
||||||
|
##### Micro-Quickstart: 4 CPU, 16GB 内存
|
||||||
|
|
||||||
|
* 启动命令: `bin/start-micro-quickstart`
|
||||||
|
* 配置目录: `conf/druid/single-server/micro-quickstart`
|
||||||
|
|
||||||
|
##### Small: 8 CPU, 64GB 内存 (~i3.2xlarge)
|
||||||
|
|
||||||
|
* 启动命令: `bin/start-small`
|
||||||
|
* 配置目录: `conf/druid/single-server/small`
|
||||||
|
|
||||||
|
##### Medium: 16 CPU, 128GB 内存 (~i3.4xlarge)
|
||||||
|
|
||||||
|
* 启动命令: `bin/start-medium`
|
||||||
|
* 配置目录: `conf/druid/single-server/medium`
|
||||||
|
|
||||||
|
##### Large: 32 CPU, 256GB 内存 (~i3.8xlarge)
|
||||||
|
|
||||||
|
* 启动命令: `bin/start-large`
|
||||||
|
* 配置目录: `conf/druid/single-server/large`
|
||||||
|
|
||||||
|
##### X-Large: 64 CPU, 512GB 内存 (~i3.16xlarge)
|
||||||
|
|
||||||
|
* 启动命令: `bin/start-xlarge`
|
||||||
|
* 配置目录: `conf/druid/single-server/xlarge`
|
||||||
|
|
||||||
|
---
|
421
GettingStarted/chapter-4.md
Normal file
421
GettingStarted/chapter-4.md
Normal file
@ -0,0 +1,421 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
|
||||||
|
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
||||||
|
<ins class="adsbygoogle"
|
||||||
|
style="display:block; text-align:center;"
|
||||||
|
data-ad-layout="in-article"
|
||||||
|
data-ad-format="fluid"
|
||||||
|
data-ad-client="ca-pub-8828078415045620"
|
||||||
|
data-ad-slot="7586680510"></ins>
|
||||||
|
<script>
|
||||||
|
(adsbygoogle = window.adsbygoogle || []).push({});
|
||||||
|
</script>
|
||||||
|
|
||||||
|
## 集群部署
|
||||||
|
|
||||||
|
Apache Druid旨在作为可伸缩的容错集群进行部署。
|
||||||
|
|
||||||
|
在本文档中,我们将安装一个简单的集群,并讨论如何对其进行进一步配置以满足您的需求。
|
||||||
|
|
||||||
|
这个简单的集群将具有以下特点:
|
||||||
|
* 一个Master服务同时起Coordinator和Overlord进程
|
||||||
|
* 两个可伸缩、容错的Data服务来运行Historical和MiddleManager进程
|
||||||
|
* 一个Query服务,运行Druid Broker和Router进程
|
||||||
|
|
||||||
|
在生产中,我们建议根据您的特定容错需求部署多个Master服务器和多个Query服务器,但是您可以使用一台Master服务器和一台Query服务器将服务快速运行起来,然后再添加更多服务器。
|
||||||
|
### 选择硬件
|
||||||
|
#### 首次部署
|
||||||
|
|
||||||
|
如果您现在没有Druid集群,并打算首次以集群模式部署运行Druid,则本指南提供了一个包含预先配置的集群部署示例。
|
||||||
|
|
||||||
|
##### Master服务
|
||||||
|
|
||||||
|
Coordinator进程和Overlord进程负责处理集群的元数据和协调需求,它们可以运行在同一台服务器上。
|
||||||
|
|
||||||
|
在本示例中,我们将在等效于AWS[m5.2xlarge](https://aws.amazon.com/ec2/instance-types/m5/)实例的硬件环境上部署。
|
||||||
|
|
||||||
|
硬件规格为:
|
||||||
|
|
||||||
|
* 8核CPU
|
||||||
|
* 31GB内存
|
||||||
|
|
||||||
|
可以在`conf/druid/cluster/master`下找到适用于此硬件规格的Master示例服务配置。
|
||||||
|
|
||||||
|
##### Data服务
|
||||||
|
|
||||||
|
Historical和MiddleManager可以分配在同一台服务器上运行,以处理集群中的实际数据,这两个服务受益于CPU、内存和固态硬盘。
|
||||||
|
|
||||||
|
在本示例中,我们将在等效于AWS[i3.4xlarge](https://aws.amazon.com/cn/ec2/instance-types/i3/)实例的硬件环境上部署。
|
||||||
|
|
||||||
|
硬件规格为:
|
||||||
|
* 16核CPU
|
||||||
|
* 122GB内存
|
||||||
|
* 2 * 1.9TB 固态硬盘
|
||||||
|
|
||||||
|
可以在`conf/druid/cluster/data`下找到适用于此硬件规格的Data示例服务配置。
|
||||||
|
|
||||||
|
##### Query服务
|
||||||
|
|
||||||
|
Druid Broker服务接收查询请求,并将其转发到集群中的其他部分,同时其可以可选的配置内存缓存。 Broker服务受益于CPU和内存。
|
||||||
|
|
||||||
|
在本示例中,我们将在等效于AWS[m5.2xlarge](https://aws.amazon.com/ec2/instance-types/m5/)实例的硬件环境上部署。
|
||||||
|
|
||||||
|
硬件规格为:
|
||||||
|
|
||||||
|
* 8核CPU
|
||||||
|
* 31GB内存
|
||||||
|
|
||||||
|
您可以考虑将所有的其他开源UI工具或者查询依赖等与Broker服务部署在同一台服务器上。
|
||||||
|
|
||||||
|
可以在`conf/druid/cluster/query`下找到适用于此硬件规格的Query示例服务配置。
|
||||||
|
|
||||||
|
##### 其他硬件配置
|
||||||
|
|
||||||
|
上面的示例集群是从多种确定Druid集群大小的可能方式中选择的一个示例。
|
||||||
|
|
||||||
|
您可以根据自己的特定需求和限制选择较小/较大的硬件或较少/更多的服务器。
|
||||||
|
|
||||||
|
如果您的使用场景具有复杂的扩展要求,则还可以选择不将Druid服务混合部署(例如,独立的Historical Server)。
|
||||||
|
|
||||||
|
[基本集群调整指南](../operations/basicClusterTuning.md)中的信息可以帮助您进行决策,并可以调整配置大小。
|
||||||
|
|
||||||
|
#### 从单服务器环境迁移部署
|
||||||
|
|
||||||
|
如果您现在已有单服务器部署的环境,例如[单服务器部署示例](./chapter-3.md)中的部署,并且希望迁移到类似规模的集群部署,则以下部分包含一些选择Master/Data/Query服务等效硬件的准则。
|
||||||
|
|
||||||
|
##### Master服务
|
||||||
|
|
||||||
|
Master服务的主要考虑点是可用CPU以及用于Coordinator和Overlord进程的堆内存。
|
||||||
|
|
||||||
|
首先计算出来在单服务器环境下Coordinator和Overlord已分配堆内存之和,然后选择具有足够内存的Master服务硬件,同时还需要考虑到为服务器上其他进程预留一些额外的内存。
|
||||||
|
|
||||||
|
对于CPU,可以选择接近于单服务器环境核数1/4的硬件。
|
||||||
|
|
||||||
|
##### Data服务
|
||||||
|
|
||||||
|
在为集群Data服务选择硬件时,主要考虑可用的CPU和内存,可行时使用SSD存储。
|
||||||
|
|
||||||
|
在集群化部署时,出于容错的考虑,最好是部署多个Data服务。
|
||||||
|
|
||||||
|
在选择Data服务的硬件时,可以假定一个分裂因子`N`,将原来的单服务器环境的CPU和内存除以`N`,然后在新集群中部署`N`个硬件规格缩小的Data服务。
|
||||||
|
|
||||||
|
##### Query服务
|
||||||
|
|
||||||
|
Query服务的硬件选择主要考虑可用的CPU、Broker服务的堆内和堆外内存、Router服务的堆内存。
|
||||||
|
|
||||||
|
首先计算出来在单服务器环境下Broker和Router已分配堆内存之和,然后选择可以覆盖Broker和Router内存的Query服务硬件,同时还需要考虑到为服务器上其他进程预留一些额外的内存。
|
||||||
|
|
||||||
|
对于CPU,可以选择接近于单服务器环境核数1/4的硬件。
|
||||||
|
|
||||||
|
[基本集群调优指南](../operations/basicClusterTuning.md)包含有关如何计算Broker和Router服务内存使用量的信息。
|
||||||
|
|
||||||
|
### 选择操作系统
|
||||||
|
|
||||||
|
我们建议运行您喜欢的Linux发行版,同时还需要:
|
||||||
|
|
||||||
|
* **Java 8**
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> Druid服务运行依赖Java 8,可以使用环境变量`DRUID_JAVA_HOME`或`JAVA_HOME`指定在何处查找Java,有关更多详细信息,请运行`verify-java`脚本。
|
||||||
|
|
||||||
|
### 下载发行版
|
||||||
|
|
||||||
|
首先,下载并解压缩发布安装包。最好首先在单台计算机上执行此操作,因为您将编辑配置,然后将修改后的配置分发到所有服务器上。
|
||||||
|
|
||||||
|
[下载](https://www.apache.org/dyn/closer.cgi?path=/druid/0.17.0/apache-druid-0.17.0-bin.tar.gz)Druid最新0.17.0release安装包
|
||||||
|
|
||||||
|
在终端中运行以下命令来提取Druid
|
||||||
|
|
||||||
|
```
|
||||||
|
tar -xzf apache-druid-0.17.0-bin.tar.gz
|
||||||
|
cd apache-druid-0.17.0
|
||||||
|
```
|
||||||
|
|
||||||
|
在安装包中有以下文件:
|
||||||
|
|
||||||
|
* `LICENSE`和`NOTICE`文件
|
||||||
|
* `bin/*` - 启停等脚本
|
||||||
|
* `conf/druid/cluster/*` - 用于集群部署的模板配置
|
||||||
|
* `extensions/*` - Druid核心扩展
|
||||||
|
* `hadoop-dependencies/*` - Druid Hadoop依赖
|
||||||
|
* `lib/*` - Druid核心库和依赖
|
||||||
|
* `quickstart/*` - 与[快速入门](./chapter-2.md)相关的文件
|
||||||
|
|
||||||
|
我们主要是编辑`conf/druid/cluster/`中的文件。
|
||||||
|
|
||||||
|
#### 从单服务器环境迁移部署
|
||||||
|
|
||||||
|
在以下各节中,我们将在`conf/druid/cluster`下编辑配置。
|
||||||
|
|
||||||
|
如果您已经有一个单服务器部署,请将您的现有配置复制到`conf/druid /cluster`以保留您所做的所有配置更改。
|
||||||
|
|
||||||
|
### 配置元数据存储和深度存储
|
||||||
|
#### 从单服务器环境迁移部署
|
||||||
|
|
||||||
|
如果您已经有一个单服务器部署,并且希望在整个迁移过程中保留数据,请在更新元数据/深层存储配置之前,按照[元数据迁移](../operations/metadataMigration.md)和[深层存储迁移](../operations/DeepstorageMigration.md)中的说明进行操作。
|
||||||
|
|
||||||
|
这些指南针对使用Derby元数据存储和本地深度存储的单服务器部署。 如果您已经在单服务器集群中使用了非Derby元数据存储,则可以在新集群中可以继续使用当前的元数据存储。
|
||||||
|
|
||||||
|
这些指南还提供了有关从本地深度存储迁移段的信息。集群部署需要分布式深度存储,例如S3或HDFS。 如果单服务器部署已在使用分布式深度存储,则可以在新集群中继续使用当前的深度存储。
|
||||||
|
|
||||||
|
#### 元数据存储
|
||||||
|
|
||||||
|
在`conf/druid/cluster/_common/common.runtime.properties`中,使用您将用作元数据存储的服务器地址来替换"metadata.storage.*":
|
||||||
|
|
||||||
|
* `druid.metadata.storage.connector.connectURI`
|
||||||
|
* `druid.metadata.storage.connector.host`
|
||||||
|
|
||||||
|
在生产部署中,我们建议运行专用的元数据存储,例如具有复制功能的MySQL或PostgreSQL,与Druid服务器分开部署。
|
||||||
|
|
||||||
|
[MySQL扩展](../Configuration/core-ext/mysql.md)和[PostgreSQL](../Configuration/core-ext/postgresql.md)扩展文档包含有关扩展配置和初始数据库安装的说明。
|
||||||
|
|
||||||
|
#### 深度存储
|
||||||
|
|
||||||
|
Druid依赖于分布式文件系统或大对象(blob)存储来存储数据,最常用的深度存储实现是S3(适合于在AWS上)和HDFS(适合于已有Hadoop集群)。
|
||||||
|
|
||||||
|
##### S3
|
||||||
|
|
||||||
|
在`conf/druid/cluster/_common/common.runtime.properties`中,
|
||||||
|
|
||||||
|
* 在`druid.extension.loadList`配置项中增加"druid-s3-extensions"扩展
|
||||||
|
* 注释掉配置文件中用于本地存储的"Deep Storage"和"Indexing service logs"
|
||||||
|
* 打开配置文件中关于"For S3"部分中"Deep Storage"和"Indexing service logs"的配置
|
||||||
|
|
||||||
|
上述操作之后,您将看到以下的变化:
|
||||||
|
|
||||||
|
```json
|
||||||
|
druid.extensions.loadList=["druid-s3-extensions"]
|
||||||
|
|
||||||
|
#druid.storage.type=local
|
||||||
|
#druid.storage.storageDirectory=var/druid/segments
|
||||||
|
|
||||||
|
druid.storage.type=s3
|
||||||
|
druid.storage.bucket=your-bucket
|
||||||
|
druid.storage.baseKey=druid/segments
|
||||||
|
druid.s3.accessKey=...
|
||||||
|
druid.s3.secretKey=...
|
||||||
|
|
||||||
|
#druid.indexer.logs.type=file
|
||||||
|
#druid.indexer.logs.directory=var/druid/indexing-logs
|
||||||
|
|
||||||
|
druid.indexer.logs.type=s3
|
||||||
|
druid.indexer.logs.s3Bucket=your-bucket
|
||||||
|
druid.indexer.logs.s3Prefix=druid/indexing-logs
|
||||||
|
```
|
||||||
|
更多信息可以看[S3扩展](../Configuration/core-ext/s3.md)部分的文档。
|
||||||
|
|
||||||
|
##### HDFS
|
||||||
|
|
||||||
|
在`conf/druid/cluster/_common/common.runtime.properties`中,
|
||||||
|
|
||||||
|
* 在`druid.extension.loadList`配置项中增加"druid-hdfs-storage"扩展
|
||||||
|
* 注释掉配置文件中用于本地存储的"Deep Storage"和"Indexing service logs"
|
||||||
|
* 打开配置文件中关于"For HDFS"部分中"Deep Storage"和"Indexing service logs"的配置
|
||||||
|
|
||||||
|
上述操作之后,您将看到以下的变化:
|
||||||
|
|
||||||
|
```json
|
||||||
|
druid.extensions.loadList=["druid-hdfs-storage"]
|
||||||
|
|
||||||
|
#druid.storage.type=local
|
||||||
|
#druid.storage.storageDirectory=var/druid/segments
|
||||||
|
|
||||||
|
druid.storage.type=hdfs
|
||||||
|
druid.storage.storageDirectory=/druid/segments
|
||||||
|
|
||||||
|
#druid.indexer.logs.type=file
|
||||||
|
#druid.indexer.logs.directory=var/druid/indexing-logs
|
||||||
|
|
||||||
|
druid.indexer.logs.type=hdfs
|
||||||
|
druid.indexer.logs.directory=/druid/indexing-logs
|
||||||
|
```
|
||||||
|
|
||||||
|
同时:
|
||||||
|
|
||||||
|
* 需要将Hadoop的配置文件(core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml)放置在Druid进程的classpath中,可以将他们拷贝到`conf/druid/cluster/_common`目录中
|
||||||
|
|
||||||
|
更多信息可以看[HDFS扩展](../Configuration/core-ext/hdfs.md)部分的文档。
|
||||||
|
|
||||||
|
### Hadoop连接配置
|
||||||
|
|
||||||
|
如果要从Hadoop集群加载数据,那么此时应对Druid做如下配置:
|
||||||
|
|
||||||
|
* 在`conf/druid/cluster/_common/common.runtime.properties`文件中更新`druid.indexer.task.hadoopWorkingPath`配置项,将其更新为您期望的一个用于临时文件存储的HDFS路径。 通常会配置为`druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing`
|
||||||
|
* 需要将Hadoop的配置文件(core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml)放置在Druid进程的classpath中,可以将他们拷贝到`conf/druid/cluster/_common`目录中
|
||||||
|
|
||||||
|
请注意,您无需为了可以从Hadoop加载数据而使用HDFS深度存储。例如,如果您的集群在Amazon Web Services上运行,即使您使用Hadoop或Elastic MapReduce加载数据,我们也建议使用S3进行深度存储。
|
||||||
|
|
||||||
|
更多信息可以看[基于Hadoop的数据摄取](../DataIngestion/hadoopbased.md)部分的文档。
|
||||||
|
|
||||||
|
### Zookeeper连接配置
|
||||||
|
|
||||||
|
在生产集群中,我们建议使用专用的ZK集群,该集群与Druid服务器分开部署。
|
||||||
|
|
||||||
|
在 `conf/druid/cluster/_common/common.runtime.properties` 中,将 `druid.zk.service.host` 设置为包含用逗号分隔的host:port对列表的连接字符串,每个对与ZK中的ZooKeeper服务器相对应。(例如" 127.0.0.1:4545"或"127.0.0.1:3000,127.0.0.1:3001、127.0.0.1:3002")
|
||||||
|
|
||||||
|
您也可以选择在Master服务上运行ZK,而不使用专用的ZK集群。如果这样做,我们建议部署3个Master服务,以便您具有ZK仲裁。
|
||||||
|
|
||||||
|
### 配置调整
|
||||||
|
#### 从单服务器环境迁移部署
|
||||||
|
##### Master服务
|
||||||
|
|
||||||
|
如果您使用的是[单服务器部署示例](./chapter-3.md)中的示例配置,则这些示例中将Coordinator和Overlord进程合并为一个合并的进程。
|
||||||
|
|
||||||
|
`conf/druid/cluster/master/coordinator-overlord` 下的示例配置同样合并了Coordinator和Overlord进程。
|
||||||
|
|
||||||
|
您可以将现有的 `coordinator-overlord` 配置从单服务器部署复制到`conf/druid/cluster/master/coordinator-overlord`
|
||||||
|
|
||||||
|
##### Data服务
|
||||||
|
|
||||||
|
假设我们正在从一个32CPU和256GB内存的单服务器部署环境进行迁移,在老的环境中,Historical和MiddleManager使用了如下的配置:
|
||||||
|
|
||||||
|
Historical(单服务器)
|
||||||
|
|
||||||
|
```json
|
||||||
|
druid.processing.buffer.sizeBytes=500000000
|
||||||
|
druid.processing.numMergeBuffers=8
|
||||||
|
druid.processing.numThreads=31
|
||||||
|
```
|
||||||
|
|
||||||
|
MiddleManager(单服务器)
|
||||||
|
|
||||||
|
```json
|
||||||
|
druid.worker.capacity=8
|
||||||
|
druid.indexer.fork.property.druid.processing.numMergeBuffers=2
|
||||||
|
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=100000000
|
||||||
|
druid.indexer.fork.property.druid.processing.numThreads=1
|
||||||
|
```
|
||||||
|
|
||||||
|
在集群部署中,我们选择一个分裂因子(假设为2),则部署2个16CPU和128GB内存的Data服务,各项的调整如下:
|
||||||
|
|
||||||
|
Historical
|
||||||
|
|
||||||
|
* `druid.processing.numThreads`设置为新硬件的(`CPU核数 - 1`)
|
||||||
|
* `druid.processing.numMergeBuffers` 使用分裂因子去除单服务部署环境的值
|
||||||
|
* `druid.processing.buffer.sizeBytes` 该值保持不变
|
||||||
|
|
||||||
|
MiddleManager:
|
||||||
|
|
||||||
|
* `druid.worker.capacity`: 使用分裂因子去除单服务部署环境的值
|
||||||
|
* `druid.indexer.fork.property.druid.processing.numMergeBuffers`: 该值保持不变
|
||||||
|
* `druid.indexer.fork.property.druid.processing.buffer.sizeBytes`: 该值保持不变
|
||||||
|
* `druid.indexer.fork.property.druid.processing.numThreads`: 该值保持不变
|
||||||
|
|
||||||
|
调整后的结果配置如下:
|
||||||
|
|
||||||
|
新的Historical(2 Data服务器)
|
||||||
|
|
||||||
|
```json
|
||||||
|
druid.processing.buffer.sizeBytes=500000000
|
||||||
|
druid.processing.numMergeBuffers=8
|
||||||
|
druid.processing.numThreads=31
|
||||||
|
```
|
||||||
|
|
||||||
|
新的MiddleManager(2 Data服务器)
|
||||||
|
|
||||||
|
```json
|
||||||
|
druid.worker.capacity=4
|
||||||
|
druid.indexer.fork.property.druid.processing.numMergeBuffers=2
|
||||||
|
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=100000000
|
||||||
|
druid.indexer.fork.property.druid.processing.numThreads=1
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Query服务
|
||||||
|
|
||||||
|
您可以将现有的Broker和Router配置复制到`conf/druid/cluster/query`下的目录中,无需进行任何修改.
|
||||||
|
|
||||||
|
#### 首次部署
|
||||||
|
|
||||||
|
如果您正在使用如下描述的示例集群规格:
|
||||||
|
|
||||||
|
* 1 Master 服务器(m5.2xlarge)
|
||||||
|
* 2 Data 服务器(i3.4xlarge)
|
||||||
|
* 1 Query 服务器(m5.2xlarge)
|
||||||
|
|
||||||
|
`conf/druid/cluster`下的配置已经为此硬件确定了,一般情况下您无需做进一步的修改。
|
||||||
|
|
||||||
|
如果您选择了其他硬件,则[基本的集群调整指南](../operations/basicClusterTuning.md)可以帮助您调整配置大小。
|
||||||
|
|
||||||
|
### 开启端口(如果使用了防火墙)
|
||||||
|
|
||||||
|
如果您正在使用防火墙或其他仅允许特定端口上流量准入的系统,请在以下端口上允许入站连接:
|
||||||
|
|
||||||
|
#### Master服务
|
||||||
|
|
||||||
|
* 1527(Derby元数据存储,如果您正在使用一个像MySQL或者PostgreSQL的分离的元数据存储则不需要)
|
||||||
|
* 2181(Zookeeper,如果使用了独立的ZK集群则不需要)
|
||||||
|
* 8081(Coordinator)
|
||||||
|
* 8090(Overlord)
|
||||||
|
|
||||||
|
#### Data服务
|
||||||
|
|
||||||
|
* 8083(Historical)
|
||||||
|
* 8091,8100-8199(Druid MiddleManager,如果`druid.worker.capacity`参数设置较大的话,则需要更多高于8199的端口)
|
||||||
|
|
||||||
|
#### Query服务
|
||||||
|
|
||||||
|
* 8082(Broker)
|
||||||
|
* 8088(Router,如果使用了)
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> 在生产中,我们建议将ZooKeeper和元数据存储部署在其专用硬件上,而不是在Master服务器上。
|
||||||
|
|
||||||
|
### 启动Master服务
|
||||||
|
|
||||||
|
将Druid发行版和您编辑的配置文件复制到Master服务器上。
|
||||||
|
|
||||||
|
如果您一直在本地计算机上编辑配置,则可以使用rsync复制它们:
|
||||||
|
|
||||||
|
```json
|
||||||
|
rsync -az apache-druid-0.17.0/ MASTER_SERVER:apache-druid-0.17.0/
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 不带Zookeeper启动
|
||||||
|
|
||||||
|
在发行版根目录中,运行以下命令以启动Master服务:
|
||||||
|
```json
|
||||||
|
bin/start-cluster-master-no-zk-server
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 带Zookeeper启动
|
||||||
|
|
||||||
|
如果计划在Master服务器上运行ZK,请首先更新`conf/zoo.cfg`以标识您计划如何运行ZK,然后,您可以使用以下命令与ZK一起启动Master服务进程:
|
||||||
|
```json
|
||||||
|
bin/start-cluster-master-with-zk-server
|
||||||
|
```
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> 在生产中,我们建议将ZooKeeper运行在其专用硬件上。
|
||||||
|
|
||||||
|
### 启动Data服务
|
||||||
|
|
||||||
|
将Druid发行版和您编辑的配置文件复制到您的Data服务器。
|
||||||
|
|
||||||
|
在发行版根目录中,运行以下命令以启动Data服务:
|
||||||
|
```json
|
||||||
|
bin/start-cluster-data-server
|
||||||
|
```
|
||||||
|
|
||||||
|
您可以在需要的时候增加更多的Data服务器。
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> 对于具有复杂资源分配需求的集群,您可以将Historical和MiddleManager分开部署,并分别扩容组件。这也使您能够利用Druid的内置MiddleManager自动伸缩功能。
|
||||||
|
|
||||||
|
### 启动Query服务
|
||||||
|
将Druid发行版和您编辑的配置文件复制到您的Query服务器。
|
||||||
|
|
||||||
|
在发行版根目录中,运行以下命令以启动Query服务:
|
||||||
|
|
||||||
|
```json
|
||||||
|
bin/start-cluster-query-server
|
||||||
|
```
|
||||||
|
|
||||||
|
您可以根据查询负载添加更多查询服务器。 如果增加了查询服务器的数量,请确保按照[基本集群调优指南](../operations/basicClusterTuning.md)中的说明调整Historical和Task上的连接池。
|
||||||
|
|
||||||
|
### 加载数据
|
||||||
|
|
||||||
|
恭喜,您现在有了Druid集群!下一步是根据使用场景来了解将数据加载到Druid的推荐方法。
|
||||||
|
|
||||||
|
了解有关[加载数据](../DataIngestion/index.md)的更多信息。
|
||||||
|
|
||||||
|
|
BIN
GettingStarted/img/tutorial-quickstart-01.png
Normal file
BIN
GettingStarted/img/tutorial-quickstart-01.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 65 KiB |
30
SUMMARY.md
30
SUMMARY.md
@ -49,19 +49,19 @@
|
|||||||
* [Zookeeper](design/Zookeeper.md)
|
* [Zookeeper](design/Zookeeper.md)
|
||||||
|
|
||||||
* [数据摄取]()
|
* [数据摄取]()
|
||||||
* [摄取概述](ingestion/ingestion.md)
|
* [摄取概述](DataIngestion/ingestion.md)
|
||||||
* [数据格式](ingestion/dataformats.md)
|
* [数据格式](DataIngestion/dataformats.md)
|
||||||
* [schema设计](ingestion/schemadesign.md)
|
* [schema设计](DataIngestion/schemadesign.md)
|
||||||
* [数据管理](ingestion/data-management.md)
|
* [数据管理](DataIngestion/datamanage.md)
|
||||||
* [流式摄取](ingestion/kafka.md)
|
* [流式摄取](DataIngestion/kafka.md)
|
||||||
* [Apache Kafka](ingestion/kafka.md)
|
* [Apache Kafka](DataIngestion/kafka.md)
|
||||||
* [Apache Kinesis](ingestion/kinesis.md)
|
* [Apache Kinesis](DataIngestion/kinesis.md)
|
||||||
* [Tranquility](ingestion/tranquility.md)
|
* [Tranquility](DataIngestion/tranquility.md)
|
||||||
* [批量摄取](ingestion/native.md)
|
* [批量摄取](DataIngestion/native.md)
|
||||||
* [本地批](ingestion/native.md)
|
* [本地批](DataIngestion/native.md)
|
||||||
* [Hadoop批](ingestion/hadoop.md)
|
* [Hadoop批](DataIngestion/hadoopbased.md)
|
||||||
* [任务参考](ingestion/taskrefer.md)
|
* [任务参考](DataIngestion/taskrefer.md)
|
||||||
* [问题FAQ](ingestion/faq.md)
|
* [问题FAQ](DataIngestion/faq.md)
|
||||||
|
|
||||||
* [数据查询]()
|
* [数据查询]()
|
||||||
* [Druid SQL](querying/druidsql.md)
|
* [Druid SQL](querying/druidsql.md)
|
||||||
@ -88,7 +88,7 @@
|
|||||||
* [过滤](querying/filters.md)
|
* [过滤](querying/filters.md)
|
||||||
* [粒度](querying/granularity.md)
|
* [粒度](querying/granularity.md)
|
||||||
* [维度](querying/dimensionspec.md)
|
* [维度](querying/dimensionspec.md)
|
||||||
* [聚合](querying/aggregations.md)
|
* [聚合](querying/Aggregations.md)
|
||||||
* [后聚合](querying/postaggregation.md)
|
* [后聚合](querying/postaggregation.md)
|
||||||
* [表达式](querying/expression.md)
|
* [表达式](querying/expression.md)
|
||||||
* [Having(GroupBy)](querying/having.md)
|
* [Having(GroupBy)](querying/having.md)
|
||||||
@ -99,7 +99,7 @@
|
|||||||
* [空间过滤器(Spatial Filter)](querying/spatialfilter.md)
|
* [空间过滤器(Spatial Filter)](querying/spatialfilter.md)
|
||||||
|
|
||||||
* [配置列表]()
|
* [配置列表]()
|
||||||
* [配置列表](configuration/human-readable-byte.md)
|
* [配置列表](Configuration/configuration.md)
|
||||||
|
|
||||||
* [操作指南]()
|
* [操作指南]()
|
||||||
* [操作指南](operations/index.md)
|
* [操作指南](operations/index.md)
|
||||||
|
28
_sidebar.md
28
_sidebar.md
@ -31,32 +31,10 @@
|
|||||||
- [元数据存储](dependencies/metadata-storage.md)
|
- [元数据存储](dependencies/metadata-storage.md)
|
||||||
- [ZooKeeper](dependencies/zookeeper.md)
|
- [ZooKeeper](dependencies/zookeeper.md)
|
||||||
|
|
||||||
- 载入(Ingestion)
|
- 摄取(Ingestion)
|
||||||
- [载入数据](ingestion/index.md)
|
- [面试问题和经验](interview/index.md)
|
||||||
- [数据格式](ingestion/data-formats.md)
|
- [算法题](algorithm/index.md)
|
||||||
- [Schema 设计技巧](ingestion/schema-design.md)
|
|
||||||
- [数据管理](ingestion/data-management.md)
|
|
||||||
- 流(Stream)数据载入
|
|
||||||
- [Apache Kafka](development/extensions-core/kafka-ingestion.md)
|
|
||||||
- [Amazon Kinesis](development/extensions-core/kinesis-ingestion.md)
|
|
||||||
- [Tranquility](ingestion/tranquility.md)
|
|
||||||
- 批量数据载入
|
|
||||||
- [原生批量](ingestion/native-batch.md)
|
|
||||||
- [Hadoop 数据载入](ingestion/hadoop.md)
|
|
||||||
- [任务参考](ingestion/tasks.md)
|
|
||||||
- [FAQ 常见问题](ingestion/faq.md)
|
|
||||||
|
|
||||||
- 查询(Querying)
|
- 查询(Querying)
|
||||||
- [Druid SQL](querying/sql.md)
|
|
||||||
- [原生查询](querying/querying.md)
|
|
||||||
- [查询执行](querying/query-execution.md)
|
|
||||||
- 概念
|
|
||||||
- [数据源](querying/datasource.md)
|
|
||||||
- [连接(joins)](querying/joins.md)
|
|
||||||
- 原生查询类型
|
|
||||||
- [Timeseries 查询](querying/timeseriesquery.md)
|
|
||||||
- [TopN 查询](querying/topnquery.md)
|
|
||||||
- [GroupBy 查询](querying/groupbyquery.md)
|
|
||||||
|
|
||||||
- 开发(Development)
|
- 开发(Development)
|
||||||
- [在 Druid 中进行开发](development/index.md)
|
- [在 Druid 中进行开发](development/index.md)
|
||||||
|
34
book/1.md
Normal file
34
book/1.md
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
|
||||||
|
## B端的奇点:产品架构师进阶之路
|
||||||
|
|
||||||
|
[](https://union-click.jd.com/jdc?e=&p=AyIGZRprFQATD1wcUhQyVlgNRQQlW1dCFFlQCxxKQgFHRE5XDVULR0UVABMPXBxSFB1LQglGa28YFmMSQTsVYAhhXWw4cV5RXQJsOHUOHjdUK1sUAxAGUxpYEgEiN1Uca15sEzdUK1sSAhcHUBxTEwYWA1IrXBULIgJWGlkWABEASRteHQAXA2UraxYyIjdVK1glQHxVAUlTFQBGAVETUhcHGgYCS1wcAhUEXRIIFAQUVAUbWiUAEwZREg%3D%3D)
|
||||||
|
|
||||||
|
(点击图片立即购买)
|
||||||
|
|
||||||
|
产品经理岗位最早是在快消行业中产生的,最初的目的是聚焦力量实现销量突破。为什么要这样做呢?
|
||||||
|
|
||||||
|
因为,在早期的快消行业中,所有的品类都是整体进行宣传推广的,并没有针对每个单品的区别化宣传,更没有为某个群体实现单独定制宣传计划的情况。在这种情况下,老产品还能够依靠口碑和影响力持续销售,但是新产品由于刚刚上市,缺少认知度,很难提升销量。长此以往,就导致老产品能够持续地销售,而新产品的销量疲软不振。即便新产品具有更好的效果、更低廉的价格、更诱人的包装,但是由于上市时间不佳、口碑传播缓慢等原因,导致成本难以回收、利润不尽如人意。
|
||||||
|
|
||||||
|
为了解决这个问题,快销行业开始尝试为新产品选择一位负责人,由这位负责人进行新产品的推广、销售、运营等工作。在这种情况下,新产品的负责人便开始利用各种营销手段对产品进行宣传和推广,包括针对性的广告、社区的营销活动等。这位负责人对产品的利润负责,同时对经营活动中的成本团队负责。这就是最早的产品经理的雏形,以现在的标准来看,这只是产品经理众多岗位中的一个岗位——产品营销经理。
|
||||||
|
|
||||||
|
这个岗位给产品销量的提升带来了非常显著的效果,这也使得后续的产品都会有一个负责人,同时这位负责人所需要承担的事务由最初的产品推广变为当前的产品规划、产品运营和产品推广三项主要的工作。
|
||||||
|
|
||||||
|
目前,产品经理在不同的行业中有着不同的工作范畴,除了前面提到的产品规划、产品运营、产品推广,还扩展了财务核算、供应链管理、团队建设等诸多管理内容。因此,产品经理也逐渐衍生出众多的细分岗位,包括目前常见的产品营销经理、产品规划经理,还有比较少见的产品架构师、产品核算师、供应链产品经理,等等。
|
||||||
|
|
||||||
|
在这些众多的产品经理细分岗位中,最核心的岗位是产品规划经理,其余岗位的人都是围绕产品规划经理开展工作的。这是因为产品的演进都是由产品规划经理负责的,既然产品的演进确定了,那么产品在未来不同时期具备的能力也就明确了。这时产品推广经理便能够根据规划的内容制订推广计划,按推广计划为营销活动准备必需的材料,按照一定的节奏进行有针对性的营销活动。同时,产品运营经理能够根据产品规划的结构制定运营方式,修改当前的运营指标以适应未来的发展需要。并且,产品架构师能够明确未来要实现的内容,对于所需技术或应用支撑进行前瞻性的设计和规划工作。
|
||||||
|
|
||||||
|
因此,产品规划经理是推动所有其他产品经理细分岗位工作进展的核心,其他产品经理细分岗位的人紧密地围绕产品规划经理的成果开展工作。
|
||||||
|
|
||||||
|
产品经理是对应的产品负责人,那么与产品相关的工作内容都是产品经理所需要管理或者过问的。但这并不意味着产品经理一定需要完成所有的工作内容,产品经理是产品的设计者、培训者、监督者等。对于产品的推广活动、运营活动和交付,产品经理并不需要在现场,但是推广、运营、交付的方式和考核标准都应当由产品经理制定。
|
||||||
|
|
||||||
|
产品从最初的规划到最后的消亡会经历设计、研发测试、发布、销售、使用、运维、升级、下市、销毁等几个阶段。这些阶段都是由多个团队通力配合来完成的,产品经理在这些阶段起到串联、推动和监督的作用。除了设计阶段,后续的研发测试、发布、销售、使用、运维、升级、下市、销毁等阶段都需要由产品经理进行衔接。
|
||||||
|
|
||||||
|
而随着分工的细化,目前产品架构师已经逐渐进入我们的视野。严格来说,产品架构师是在产品规划经理的基础上进一步划分出来的。产品规划经理的工作目前可以分为三部分,分别是需求管理、版本管理和架构设计。之前的产品需求经理是由产品规划经理细分而来的,同理现在的产品架构师也是由产品规划经理衍生而来的。
|
||||||
|
|
||||||
|
产品架构师需要具备产品规划经理的全部能力,但是偏重架构设计这部分工作。在没有产品架构师之前,产品的架构设计往往是混乱的。产品架构图只能从一个侧面进行说明。在对产品进行描述时,一张图是难以表述清楚的。这时需要从多角度对产品进行描述,这就是产品架构师需要完成的工作了——产品架构师基于产品规划的内容绘制产品架构图。产品架构图包括多个方面,其中最重要的是业务、应用、数据和技术四个方面。
|
||||||
|
|
||||||
|
产品架构师不仅能够独立完成产品规划、需求拆分、功能设计、原型绘制的工作,也能够根据以上内容完成产品架构图的绘制工作。产品架构师的主要职责并不是绘制产品架构图,而是通过产品架构图使得技术保障团队的技术架构师和研发负责人清晰、准确地理解产品,并找出技术实现方法。技术保障团队知道产品的需求是如何转变为技术实现的。同时市场类团队能够明白产品的内部运行机理,从而更好地进行营销推广活动。
|
||||||
|
|
||||||
|
产品架构师应该具备哪些能力,产品经理如何一步步成长为产品架构师便是本书要说明的重点内容
|
||||||
|
|
||||||
|

|
13
book/2.md
Normal file
13
book/2.md
Normal file
@ -0,0 +1,13 @@
|
|||||||
|
## Elasticsearch源码解析与优化实战
|
||||||
|
|
||||||
|
[](https://union-click.jd.com/jdc?e=&p=AyIGZRprFQEXAFYdWhcyVlgNRQQlW1dCFFlQCxxKQgFHRE5XDVULR0UVARcAVh1aFx1LQglGa1J9G2NXExBTZ1JlMQE7fWdoBlF%2BKFMOHjdUK1sUAxAGUxpYEgEiN1Uca0NsEgZUGloUBxICVitaJQIVB1AbXhMHFAddHFolBRIOZR5YFAARBVYcRxUHGgVQH2slMhE3ZStbJQEiRTsfCRJREARWEgxFChsCVB5bRlBAVVUfXhQCE1QCEwkRViIFVBpfHA%3D%3D)
|
||||||
|
|
||||||
|
(点击图片可立即购买)
|
||||||
|
|
||||||
|
Elasticsearch 是一个开源的全文搜索引擎,很多用户对于大规模集群应用时遇到的各种问题难以分析处理,或者知其然而不知其所以然。本书分析 Elasticsearch 中重要模块及其实现原理和机制,让用户深入理解相关重要配置项意义,应对系统故障时不再迷茫。另外,本书提供实际应用场景中一些常见问题的优化建议,这些建议都是作者经过大规模测试及应用验证过的。
|
||||||
|
|
||||||
|
本书介绍了Elasticsearch的系统原理,旨在帮助读者了解其内部原理、设计思想,以及在生产环境中如何正确地部署、优化系统。系统原理分两方面介绍,一方面详细介绍主要流程,例如启动流程、选主流程、恢复流程;另一方面介绍各重要模块的实现,以及模块之间的关系,例如gateway模块、allocation模块等。本书的最后一部分介绍如何优化写入速度、搜索速度等大家关心的实际问题,并提供了一些诊断问题的方法和工具供读者参考。
|
||||||
|
|
||||||
|
本书适合对Elasticsearch进行改进的研发人员、平台运维人员,对分布式搜索感兴趣的朋友,以及在使用Elasticsearch过程中遇到问题的人们。
|
||||||
|
|
||||||
|

|
BIN
book/img/1.jpg
Normal file
BIN
book/img/1.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 626 KiB |
BIN
book/img/2.jpg
Normal file
BIN
book/img/2.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 744 KiB |
BIN
book/img/3.jpg
Normal file
BIN
book/img/3.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 80 KiB |
BIN
book/img/4.png
Normal file
BIN
book/img/4.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 59 KiB |
17
book/product.md
Normal file
17
book/product.md
Normal file
@ -0,0 +1,17 @@
|
|||||||
|
1. [B端的奇点——产品架构师进阶之路](https://union-click.jd.com/jdc?e=&p=AyIGZRprFQATD1wcUhQyVlgNRQQlW1dCFFlQCxxKQgFHRE5XDVULR0UVABMPXBxSFB1LQglGa28YFmMSQTsVYAhhXWw4cV5RXQJsOHUOHjdUK1sUAxAGUxpYEgEiN1Uca15sEzdUK1sSAhcHUBxTEwYWA1IrXBULIgJWGlkWABEASRteHQAXA2UraxYyIjdVK1glQHxVAUlTFQBGAVETUhcHGgYCS1wcAhUEXRIIFAQUVAUbWiUAEwZREg%3D%3D)
|
||||||
|
|
||||||
|
[](https://union-click.jd.com/jdc?e=&p=AyIGZRprFQATD1wcUhQyVlgNRQQlW1dCFFlQCxxKQgFHRE5XDVULR0UVABMPXBxSFB1LQglGa28YFmMSQTsVYAhhXWw4cV5RXQJsOHUOHjdUK1sUAxAGUxpYEgEiN1Uca15sEzdUK1sSAhcHUBxTEwYWA1IrXBULIgJWGlkWABEASRteHQAXA2UraxYyIjdVK1glQHxVAUlTFQBGAVETUhcHGgYCS1wcAhUEXRIIFAQUVAUbWiUAEwZREg%3D%3D)
|
||||||
|
|
||||||
|
(点击图片可立即购买)
|
||||||
|
|
||||||
|
适读人群 :产品经理
|
||||||
|
|
||||||
|
对于准备或即将奔赴产品架构师岗位的小伙伴,本书可以提前梳理相应的技能点;
|
||||||
|
|
||||||
|
对于已经从事产品架构师岗位的小伙伴,本书可以帮助其回顾自身不足以提升能力;
|
||||||
|
|
||||||
|
对于技术负责人,本书可以进一步帮助其理解产品经理岗位的内容,提升技术经理与产品经理的配合程度;
|
||||||
|
|
||||||
|
对于产品规划经理或产品营销经理,本书可以帮助其拓宽业务视野、提升业务能力。
|
||||||
|
|
||||||
|
详情查看请点击 [B端的奇点——产品架构师进阶之路](1.md)
|
11
book/tech.md
Normal file
11
book/tech.md
Normal file
@ -0,0 +1,11 @@
|
|||||||
|
1. [Elasticsearch源码解析与优化实战](https://union-click.jd.com/jdc?e=&p=AyIGZRprFQEXAFYdWhcyVlgNRQQlW1dCFFlQCxxKQgFHRE5XDVULR0UVARcAVh1aFx1LQglGa1J9G2NXExBTZ1JlMQE7fWdoBlF%2BKFMOHjdUK1sUAxAGUxpYEgEiN1Uca0NsEgZUGloUBxICVitaJQIVB1AbXhMHFAddHFolBRIOZR5YFAARBVYcRxUHGgVQH2slMhE3ZStbJQEiRTsfCRJREARWEgxFChsCVB5bRlBAVVUfXhQCE1QCEwkRViIFVBpfHA%3D%3D)
|
||||||
|
|
||||||
|
[](https://union-click.jd.com/jdc?e=&p=AyIGZRprFQEXAFYdWhcyVlgNRQQlW1dCFFlQCxxKQgFHRE5XDVULR0UVARcAVh1aFx1LQglGa1J9G2NXExBTZ1JlMQE7fWdoBlF%2BKFMOHjdUK1sUAxAGUxpYEgEiN1Uca0NsEgZUGloUBxICVitaJQIVB1AbXhMHFAddHFolBRIOZR5YFAARBVYcRxUHGgVQH2slMhE3ZStbJQEiRTsfCRJREARWEgxFChsCVB5bRlBAVVUfXhQCE1QCEwkRViIFVBpfHA%3D%3D)
|
||||||
|
|
||||||
|
(点击图片可立即购买)
|
||||||
|
|
||||||
|
适读人群 :本书适合对Elasticsearch进行改进的研发人员、平台运维人员,对分布式搜索感兴趣的朋友,以及在使用Elasticsearch过程中遇到问题的人们。
|
||||||
|
|
||||||
|
Elasticsearch 是一个开源的全文搜索引擎,很多用户对于大规模集群应用时遇到的各种问题难以分析处理,或者知其然而不知其所以然。本书分析 Elasticsearch 中重要模块及其实现原理和机制,让用户深入理解相关重要配置项意义,应对系统故障时不再迷茫。另外,本书提供实际应用场景中一些常见问题的优化建议,这些建议都是作者经过大规模测试及应用验证过的
|
||||||
|
|
||||||
|
详情查看请点击 [Elasticsearch源码解析与优化实战](2.md)
|
@ -1,158 +0,0 @@
|
|||||||
---
|
|
||||||
id: human-readable-byte
|
|
||||||
title: "Human-readable Byte Configuration Reference"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
This page documents configuration properties related to bytes.
|
|
||||||
|
|
||||||
These properties can be configured through 2 ways:
|
|
||||||
1. a simple number in bytes
|
|
||||||
2. a number with a unit suffix
|
|
||||||
|
|
||||||
## A number in bytes
|
|
||||||
|
|
||||||
Given that cache size is 3G, there's a configuration as below
|
|
||||||
|
|
||||||
```properties
|
|
||||||
# 3G bytes = 3_000_000_000 bytes
|
|
||||||
druid.cache.sizeInBytes=3000000000
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
## A number with a unit suffix
|
|
||||||
|
|
||||||
When you have to put a large number for some configuration as above, it is easy to make a mistake such as extra or missing 0s. Druid supports a better way, a number with a unit suffix.
|
|
||||||
|
|
||||||
Given a disk of 1T, the configuration can be
|
|
||||||
|
|
||||||
```properties
|
|
||||||
druid.segmentCache.locations=[{"path":"/segment-cache-00","maxSize":"1t"},{"path":"/segment-cache-01","maxSize":"1200g"}]
|
|
||||||
```
|
|
||||||
|
|
||||||
Note: in above example, both `1t` and `1T` are acceptable since it's case-insensitive.
|
|
||||||
Also, only integers are valid as the number part. For example, you can't replace `1200g` with `1.2t`.
|
|
||||||
|
|
||||||
### Supported Units
|
|
||||||
In the world of computer, a unit like `K` is ambiguous. It means 1000 or 1024 in different contexts, for more information please see [Here](https://en.wikipedia.org/wiki/Binary_prefix).
|
|
||||||
|
|
||||||
To make it clear, the base of units are defined in Druid as below
|
|
||||||
|
|
||||||
| Unit | Description | Base |
|
|
||||||
|---|---|---|
|
|
||||||
| K | Kilo Decimal Byte | 1_000 |
|
|
||||||
| M | Mega Decimal Byte | 1_000_000 |
|
|
||||||
| G | Giga Decimal Byte | 1_000_000_000 |
|
|
||||||
| T | Tera Decimal Byte | 1_000_000_000_000 |
|
|
||||||
| P | Peta Decimal Byte | 1_000_000_000_000_000 |
|
|
||||||
| KiB | Kilo Binary Byte | 1024 |
|
|
||||||
| MiB | Mega Binary Byte | 1024 * 1024 |
|
|
||||||
| GiB | Giga Binary Byte | 1024 * 1024 * 1024 |
|
|
||||||
| TiB | Tera Binary Byte | 1024 * 1024 * 1024 * 1024 |
|
|
||||||
| PiB | Peta Binary Byte | 1024 * 1024 * 1024 * 1024 * 1024 |
|
|
||||||
|
|
||||||
Unit is case-insensitive. `k`, `kib`, `KiB`, `kiB` are all acceptable.
|
|
||||||
|
|
||||||
Here are two examples
|
|
||||||
|
|
||||||
```properties
|
|
||||||
# 1G bytes = 1_000_000_000 bytes
|
|
||||||
druid.cache.sizeInBytes=1g
|
|
||||||
```
|
|
||||||
|
|
||||||
```properties
|
|
||||||
# 256MiB bytes = 256 * 1024 * 1024 bytes
|
|
||||||
druid.cache.sizeInBytes=256MiB
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## 配置文档
|
|
||||||
|
|
||||||
本部分内容列出来了每一种Druid服务的所有配置项
|
|
||||||
|
|
||||||
### 推荐的配置文件组织方式
|
|
||||||
|
|
||||||
对于Druid的配置文件,一种推荐的结构组织方式为将配置文件放置在Druid根目录的`conf`目录下,如以下所示:
|
|
||||||
|
|
||||||
```json
|
|
||||||
$ ls -R conf
|
|
||||||
druid
|
|
||||||
|
|
||||||
conf/druid:
|
|
||||||
_common broker coordinator historical middleManager overlord
|
|
||||||
|
|
||||||
conf/druid/_common:
|
|
||||||
common.runtime.properties log4j2.xml
|
|
||||||
|
|
||||||
conf/druid/broker:
|
|
||||||
jvm.config runtime.properties
|
|
||||||
|
|
||||||
conf/druid/coordinator:
|
|
||||||
jvm.config runtime.properties
|
|
||||||
|
|
||||||
conf/druid/historical:
|
|
||||||
jvm.config runtime.properties
|
|
||||||
|
|
||||||
conf/druid/middleManager:
|
|
||||||
jvm.config runtime.properties
|
|
||||||
|
|
||||||
conf/druid/overlord:
|
|
||||||
jvm.config runtime.properties
|
|
||||||
```
|
|
||||||
|
|
||||||
每一个目录下都有一个 `runtime.properties` 文件,该文件中包含了特定的Druid进程相关的配置项,例如 `historical`
|
|
||||||
|
|
||||||
`jvm.config` 文件包含了每一个服务的JVM参数,例如堆内存属性等
|
|
||||||
|
|
||||||
所有进程共享的通用属性位于 `_common/common.runtime.properties` 中。
|
|
||||||
|
|
||||||
### 通用配置
|
|
||||||
|
|
||||||
本节下的属性是应该在集群中的所有Druid服务之间共享的公共配置。
|
|
||||||
|
|
||||||
#### JVM配置最佳实践
|
|
||||||
|
|
||||||
在我们的所有进程中有四个需要配置的JVM参数
|
|
||||||
|
|
||||||
1. `-Duser.timezone=UTC` 该参数将JVM的默认时区设置为UTC。我们总是这样设置,不使用其他默认时区进行测试,因此本地时区可能会工作,但它们也可能会发现奇怪和有趣的错误。要在非UTC时区中发出查询,请参阅 [查询粒度](../querying/granularity.md)
|
|
||||||
2. `-Dfile.encoding=UTF-8` 这类似于时区,我们假设UTF-8进行测试。本地编码可能有效,但也可能导致奇怪和有趣的错误。
|
|
||||||
3. `-Djava.io.tmpdir=<a path>` 系统中与文件系统交互的各个部分都是通过临时文件完成的,这些文件可能会变得有些大。许多生产系统都被设置为具有小的(但是很快的)`/tmp`目录,这对于Druid来说可能是个问题,因此我们建议将JVM的tmp目录指向一些有更多内容的目录。此目录不应为volatile tmpfs。这个目录还应该具有良好的读写速度,因此应该强烈避免NFS挂载。
|
|
||||||
4. `-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager` 这允许log4j2处理使用标准java日志的非log4j2组件(如jetty)的日志。
|
|
||||||
|
|
||||||
#### 扩展
|
|
||||||
#### 请求日志
|
|
||||||
#### SQL兼容的空值处理
|
|
||||||
### Master
|
|
||||||
#### Coordinator
|
|
||||||
#### Overlord
|
|
||||||
### Data
|
|
||||||
#### MiddleManager and Peons
|
|
||||||
##### SegmentWriteOutMediumFactory
|
|
||||||
#### Indexer
|
|
||||||
#### Historical
|
|
||||||
### Query
|
|
||||||
#### Broker
|
|
||||||
#### Router
|
|
File diff suppressed because it is too large
Load Diff
@ -1,87 +0,0 @@
|
|||||||
---
|
|
||||||
id: logging
|
|
||||||
title: "Logging"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
Apache Druid processes will emit logs that are useful for debugging to the console. Druid processes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.md#enabling-metrics). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`.
|
|
||||||
|
|
||||||
Druid uses [log4j2](http://logging.apache.org/log4j/2.x/) for logging. Logging can be configured with a log4j2.xml file. Add the path to the directory containing the log4j2.xml file (e.g. the _common/ dir) to your classpath if you want to override default Druid log configuration. Note that this directory should be earlier in the classpath than the druid jars. The easiest way to do this is to prefix the classpath with the config dir.
|
|
||||||
|
|
||||||
To enable java logging to go through log4j2, set the `-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager` server parameter.
|
|
||||||
|
|
||||||
An example log4j2.xml ships with Druid under config/_common/log4j2.xml, and a sample file is also shown below:
|
|
||||||
|
|
||||||
```
|
|
||||||
<?xml version="1.0" encoding="UTF-8" ?>
|
|
||||||
<Configuration status="WARN">
|
|
||||||
<Appenders>
|
|
||||||
<Console name="Console" target="SYSTEM_OUT">
|
|
||||||
<PatternLayout pattern="%d{ISO8601} %p [%t] %c - %m%n"/>
|
|
||||||
</Console>
|
|
||||||
</Appenders>
|
|
||||||
<Loggers>
|
|
||||||
<Root level="info">
|
|
||||||
<AppenderRef ref="Console"/>
|
|
||||||
</Root>
|
|
||||||
|
|
||||||
<!-- Uncomment to enable logging of all HTTP requests
|
|
||||||
<Logger name="org.apache.druid.jetty.RequestLog" additivity="false" level="DEBUG">
|
|
||||||
<AppenderRef ref="Console"/>
|
|
||||||
</Logger>
|
|
||||||
-->
|
|
||||||
</Loggers>
|
|
||||||
</Configuration>
|
|
||||||
```
|
|
||||||
|
|
||||||
## My logs are really chatty, can I set them to asynchronously write?
|
|
||||||
|
|
||||||
Yes, using a `log4j2.xml` similar to the following causes some of the more chatty classes to write asynchronously:
|
|
||||||
|
|
||||||
```
|
|
||||||
<?xml version="1.0" encoding="UTF-8" ?>
|
|
||||||
<Configuration status="WARN">
|
|
||||||
<Appenders>
|
|
||||||
<Console name="Console" target="SYSTEM_OUT">
|
|
||||||
<PatternLayout pattern="%d{ISO8601} %p [%t] %c - %m%n"/>
|
|
||||||
</Console>
|
|
||||||
</Appenders>
|
|
||||||
<Loggers>
|
|
||||||
<AsyncLogger name="org.apache.druid.curator.inventory.CuratorInventoryManager" level="debug" additivity="false">
|
|
||||||
<AppenderRef ref="Console"/>
|
|
||||||
</AsyncLogger>
|
|
||||||
<AsyncLogger name="org.apache.druid.client.BatchServerInventoryView" level="debug" additivity="false">
|
|
||||||
<AppenderRef ref="Console"/>
|
|
||||||
</AsyncLogger>
|
|
||||||
<!-- Make extra sure nobody adds logs in a bad way that can hurt performance -->
|
|
||||||
<AsyncLogger name="org.apache.druid.client.ServerInventoryView" level="debug" additivity="false">
|
|
||||||
<AppenderRef ref="Console"/>
|
|
||||||
</AsyncLogger>
|
|
||||||
<AsyncLogger name ="org.apache.druid.java.util.http.client.pool.ChannelResourceFactory" level="info" additivity="false">
|
|
||||||
<AppenderRef ref="Console"/>
|
|
||||||
</AsyncLogger>
|
|
||||||
<Root level="info">
|
|
||||||
<AppenderRef ref="Console"/>
|
|
||||||
</Root>
|
|
||||||
</Loggers>
|
|
||||||
</Configuration>
|
|
||||||
```
|
|
25
dependencies/deep-storage.md
vendored
25
dependencies/deep-storage.md
vendored
@ -1,4 +1,27 @@
|
|||||||
# 深度存储
|
---
|
||||||
|
id: deep-storage
|
||||||
|
title: "Deep storage"
|
||||||
|
---
|
||||||
|
|
||||||
|
<!--
|
||||||
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
~ or more contributor license agreements. See the NOTICE file
|
||||||
|
~ distributed with this work for additional information
|
||||||
|
~ regarding copyright ownership. The ASF licenses this file
|
||||||
|
~ to you under the Apache License, Version 2.0 (the
|
||||||
|
~ "License"); you may not use this file except in compliance
|
||||||
|
~ with the License. You may obtain a copy of the License at
|
||||||
|
~
|
||||||
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
~
|
||||||
|
~ Unless required by applicable law or agreed to in writing,
|
||||||
|
~ software distributed under the License is distributed on an
|
||||||
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
~ KIND, either express or implied. See the License for the
|
||||||
|
~ specific language governing permissions and limitations
|
||||||
|
~ under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
|
||||||
Deep storage is where segments are stored. It is a storage mechanism that Apache Druid does not provide. This deep storage infrastructure defines the level of durability of your data, as long as Druid processes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented.
|
Deep storage is where segments are stored. It is a storage mechanism that Apache Druid does not provide. This deep storage infrastructure defines the level of durability of your data, as long as Druid processes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented.
|
||||||
|
|
||||||
|
25
dependencies/metadata-storage.md
vendored
25
dependencies/metadata-storage.md
vendored
@ -1,4 +1,27 @@
|
|||||||
# 元数据存储
|
---
|
||||||
|
id: metadata-storage
|
||||||
|
title: "Metadata storage"
|
||||||
|
---
|
||||||
|
|
||||||
|
<!--
|
||||||
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
~ or more contributor license agreements. See the NOTICE file
|
||||||
|
~ distributed with this work for additional information
|
||||||
|
~ regarding copyright ownership. The ASF licenses this file
|
||||||
|
~ to you under the Apache License, Version 2.0 (the
|
||||||
|
~ "License"); you may not use this file except in compliance
|
||||||
|
~ with the License. You may obtain a copy of the License at
|
||||||
|
~
|
||||||
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
~
|
||||||
|
~ Unless required by applicable law or agreed to in writing,
|
||||||
|
~ software distributed under the License is distributed on an
|
||||||
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
~ KIND, either express or implied. See the License for the
|
||||||
|
~ specific language governing permissions and limitations
|
||||||
|
~ under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
|
||||||
The Metadata Storage is an external dependency of Apache Druid. Druid uses it to store
|
The Metadata Storage is an external dependency of Apache Druid. Druid uses it to store
|
||||||
various metadata about the system, but not to store the actual data. There are
|
various metadata about the system, but not to store the actual data. There are
|
||||||
|
25
dependencies/zookeeper.md
vendored
25
dependencies/zookeeper.md
vendored
@ -1,4 +1,27 @@
|
|||||||
# ZooKeeper
|
---
|
||||||
|
id: zookeeper
|
||||||
|
title: "ZooKeeper"
|
||||||
|
---
|
||||||
|
|
||||||
|
<!--
|
||||||
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
~ or more contributor license agreements. See the NOTICE file
|
||||||
|
~ distributed with this work for additional information
|
||||||
|
~ regarding copyright ownership. The ASF licenses this file
|
||||||
|
~ to you under the Apache License, Version 2.0 (the
|
||||||
|
~ "License"); you may not use this file except in compliance
|
||||||
|
~ with the License. You may obtain a copy of the License at
|
||||||
|
~
|
||||||
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
~
|
||||||
|
~ Unless required by applicable law or agreed to in writing,
|
||||||
|
~ software distributed under the License is distributed on an
|
||||||
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
~ KIND, either express or implied. See the License for the
|
||||||
|
~ specific language governing permissions and limitations
|
||||||
|
~ under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
|
||||||
Apache Druid uses [Apache ZooKeeper](http://zookeeper.apache.org/) (ZK) for management of current cluster state.
|
Apache Druid uses [Apache ZooKeeper](http://zookeeper.apache.org/) (ZK) for management of current cluster state.
|
||||||
|
|
||||||
|
@ -13,10 +13,10 @@
|
|||||||
|
|
||||||
## Broker
|
## Broker
|
||||||
### 配置
|
### 配置
|
||||||
对于Apache Druid Broker的配置,请参见 [Broker配置](../configuration/human-readable-byte.md#Broker)
|
对于Apache Druid Broker的配置,请参见 [Broker配置](../Configuration/configuration.md#Broker)
|
||||||
|
|
||||||
### HTTP
|
### HTTP
|
||||||
对于Broker的API的列表,请参见 [Broker API](../operations/api-reference.md#Broker)
|
对于Broker的API的列表,请参见 [Broker API](../operations/api.md#Broker)
|
||||||
|
|
||||||
### 综述
|
### 综述
|
||||||
|
|
||||||
|
@ -13,10 +13,10 @@
|
|||||||
|
|
||||||
## Coordinator进程
|
## Coordinator进程
|
||||||
### 配置
|
### 配置
|
||||||
对于Apache Druid的Coordinator进程配置,详见 [Coordinator配置](../configuration/human-readable-byte.md#Coordinator)
|
对于Apache Druid的Coordinator进程配置,详见 [Coordinator配置](../Configuration/configuration.md#Coordinator)
|
||||||
|
|
||||||
### HTTP
|
### HTTP
|
||||||
对于Coordinator的API接口,详见 [Coordinator API](../operations/api-reference.md#Coordinator)
|
对于Coordinator的API接口,详见 [Coordinator API](../operations/api.md#Coordinator)
|
||||||
|
|
||||||
### 综述
|
### 综述
|
||||||
Druid Coordinator程序主要负责段管理和分发。更具体地说,Druid Coordinator进程与Historical进程通信,根据配置加载或删除段。Druid Coordinator负责加载新段、删除过时段、管理段复制和平衡段负载。
|
Druid Coordinator程序主要负责段管理和分发。更具体地说,Druid Coordinator进程与Historical进程通信,根据配置加载或删除段。Druid Coordinator负责加载新段、删除过时段、管理段复制和平衡段负载。
|
||||||
@ -45,12 +45,12 @@ org.apache.druid.cli.Main server coordinator
|
|||||||
|
|
||||||
每次运行时,Druid Coordinator都通过合并小段或拆分大片段来压缩段。当您的段没有进行段大小(可能会导致查询性能下降)优化时,该操作非常有用。有关详细信息,请参见[段大小优化](../operations/segmentSizeOpt.md)。
|
每次运行时,Druid Coordinator都通过合并小段或拆分大片段来压缩段。当您的段没有进行段大小(可能会导致查询性能下降)优化时,该操作非常有用。有关详细信息,请参见[段大小优化](../operations/segmentSizeOpt.md)。
|
||||||
|
|
||||||
Coordinator首先根据[段搜索策略](#段搜索策略)查找要压缩的段。找到某些段后,它会发出[压缩任务](../ingestion/taskrefer.md#compact)来压缩这些段。运行压缩任务的最大数目为 `min(sum of worker capacity * slotRatio, maxSlots)`。请注意,即使 `min(sum of worker capacity * slotRatio, maxSlots)` = 0,如果为数据源启用了压缩,则始终会提交至少一个压缩任务。请参阅[压缩配置API](../operations/api-reference.md#Coordinator)和[压缩配置](../configuration/human-readable-byte.md#Coordinator)以启用压缩。
|
Coordinator首先根据[段搜索策略](#段搜索策略)查找要压缩的段。找到某些段后,它会发出[压缩任务](../DataIngestion/taskrefer.md#compact)来压缩这些段。运行压缩任务的最大数目为 `min(sum of worker capacity * slotRatio, maxSlots)`。请注意,即使 `min(sum of worker capacity * slotRatio, maxSlots)` = 0,如果为数据源启用了压缩,则始终会提交至少一个压缩任务。请参阅[压缩配置API](../operations/api.md#Coordinator)和[压缩配置](../Configuration/configuration.md#Coordinator)以启用压缩。
|
||||||
|
|
||||||
压缩任务可能由于以下原因而失败:
|
压缩任务可能由于以下原因而失败:
|
||||||
|
|
||||||
* 如果压缩任务的输入段在开始前被删除或覆盖,则该压缩任务将立即失败。
|
* 如果压缩任务的输入段在开始前被删除或覆盖,则该压缩任务将立即失败。
|
||||||
* 如果优先级较高的任务获取与压缩任务的时间间隔重叠的[时间块锁](../ingestion/taskrefer.md#锁),则压缩任务失败。
|
* 如果优先级较高的任务获取与压缩任务的时间间隔重叠的[时间块锁](../DataIngestion/taskrefer.md#锁),则压缩任务失败。
|
||||||
|
|
||||||
一旦压缩任务失败,Coordinator只需再次检查失败任务间隔中的段,并在下次运行中发出另一个压缩任务。
|
一旦压缩任务失败,Coordinator只需再次检查失败任务间隔中的段,并在下次运行中发出另一个压缩任务。
|
||||||
|
|
||||||
@ -76,7 +76,7 @@ Coordinator首先根据[段搜索策略](#段搜索策略)查找要压缩的段
|
|||||||
|
|
||||||
如果Coordinator还有足够的用于压缩任务的插槽,该策略则继续搜索剩下的段并返回 `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` 和 `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION_1`。最后,因为在 `2017-09-01T00:00:00.000Z/2017-10-01T00:00:00.000Z` 时间间隔中只有一个段,所以 `foo_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION` 段也会被选择。
|
如果Coordinator还有足够的用于压缩任务的插槽,该策略则继续搜索剩下的段并返回 `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` 和 `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION_1`。最后,因为在 `2017-09-01T00:00:00.000Z/2017-10-01T00:00:00.000Z` 时间间隔中只有一个段,所以 `foo_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION` 段也会被选择。
|
||||||
|
|
||||||
搜索的起点可以通过 [skipOffsetFromLatest](../configuration/human-readable-byte.md#Coordinator) 来更改设置。如果设置了此选项,则此策略将忽略范围内的时间段(最新段的结束时间 - `skipOffsetFromLatest`), 该配置项主要是为了避免压缩任务和实时任务之间的冲突。请注意,默认情况下,实时任务的优先级高于压缩任务。如果两个任务的时间间隔重叠,实时任务将撤消压缩任务的锁,从而终止压缩任务。
|
搜索的起点可以通过 [skipOffsetFromLatest](../Configuration/configuration.md#Coordinator) 来更改设置。如果设置了此选项,则此策略将忽略范围内的时间段(最新段的结束时间 - `skipOffsetFromLatest`), 该配置项主要是为了避免压缩任务和实时任务之间的冲突。请注意,默认情况下,实时任务的优先级高于压缩任务。如果两个任务的时间间隔重叠,实时任务将撤消压缩任务的锁,从而终止压缩任务。
|
||||||
|
|
||||||
> [!WARNING]
|
> [!WARNING]
|
||||||
> 当有很多相同间隔的小段,并且它们的总大小超过 `inputSegmentSizeBytes` 时,此策略当前无法处理这种情况。如果它找到这样的段,它只会跳过它们。
|
> 当有很多相同间隔的小段,并且它们的总大小超过 `inputSegmentSizeBytes` 时,此策略当前无法处理这种情况。如果它找到这样的段,它只会跳过它们。
|
||||||
|
@ -32,12 +32,12 @@ Apache Druid不提供的存储机制,深度存储是存储段的地方。深
|
|||||||
|
|
||||||
### S3适配
|
### S3适配
|
||||||
|
|
||||||
请看[druid-s3-extensions](../configuration/core-ext/s3.md)扩展文档
|
请看[druid-s3-extensions](../Configuration/core-ext/s3.md)扩展文档
|
||||||
|
|
||||||
### HDFS
|
### HDFS
|
||||||
|
|
||||||
请看[druid-hdfs-extensions](../configuration/core-ext/hdfs.md)扩展文档
|
请看[druid-hdfs-extensions](../Configuration/core-ext/hdfs.md)扩展文档
|
||||||
|
|
||||||
### 其他深度存储
|
### 其他深度存储
|
||||||
|
|
||||||
对于另外的深度存储等,可以参见[扩展列表](../configuration/logging.md)
|
对于另外的深度存储等,可以参见[扩展列表](../Configuration/extensions.md)
|
@ -79,7 +79,7 @@ Druid数据被存储在"datasources"中,类似于传统RDBMS中的表。每一
|
|||||||
|
|
||||||
有关段文件格式的信息,请参见[段文件](segments.md)
|
有关段文件格式的信息,请参见[段文件](segments.md)
|
||||||
|
|
||||||
有关数据在Druid的建模,请参见[schema设计](../ingestion/schemadesign.md)
|
有关数据在Druid的建模,请参见[schema设计](../DataIngestion/schemadesign.md)
|
||||||
|
|
||||||
#### 索引和切换(Indexing and handoff)
|
#### 索引和切换(Indexing and handoff)
|
||||||
|
|
||||||
|
@ -13,10 +13,10 @@
|
|||||||
|
|
||||||
## Historical
|
## Historical
|
||||||
### 配置
|
### 配置
|
||||||
对于Apache Druid Historical的配置,请参见 [Historical配置](../configuration/human-readable-byte.md#Historical)
|
对于Apache Druid Historical的配置,请参见 [Historical配置](../Configuration/configuration.md#Historical)
|
||||||
|
|
||||||
### HTTP
|
### HTTP
|
||||||
Historical的API列表,请参见 [Historical API](../operations/api-reference.md#Historical)
|
Historical的API列表,请参见 [Historical API](../operations/api.md#Historical)
|
||||||
|
|
||||||
### 运行
|
### 运行
|
||||||
```json
|
```json
|
||||||
|
@ -21,10 +21,10 @@ Apache Druid索引器进程是MiddleManager + Peon任务执行系统的另一种
|
|||||||
与MiddleManager + Peon系统相比,Indexer的设计更易于配置和部署,并且能够更好地实现跨任务的资源共享。
|
与MiddleManager + Peon系统相比,Indexer的设计更易于配置和部署,并且能够更好地实现跨任务的资源共享。
|
||||||
|
|
||||||
### 配置
|
### 配置
|
||||||
对于Apache Druid Indexer进程的配置,请参见 [Indexer配置](../configuration/human-readable-byte.md#Indexer)
|
对于Apache Druid Indexer进程的配置,请参见 [Indexer配置](../Configuration/configuration.md#Indexer)
|
||||||
|
|
||||||
### HTTP
|
### HTTP
|
||||||
Indexer进程与[MiddleManager](../operations/api-reference.md#MiddleManager)共用API
|
Indexer进程与[MiddleManager](../operations/api.md#MiddleManager)共用API
|
||||||
|
|
||||||
### 运行
|
### 运行
|
||||||
```json
|
```json
|
||||||
@ -38,7 +38,7 @@ org.apache.druid.cli.Main server indexer
|
|||||||
**查询资源**
|
**查询资源**
|
||||||
查询处理线程和缓冲区在所有任务中共享。索引器将为来自所有任务共享的单个端点的查询提供服务。
|
查询处理线程和缓冲区在所有任务中共享。索引器将为来自所有任务共享的单个端点的查询提供服务。
|
||||||
|
|
||||||
如果启用了[查询缓存](../configuration/human-readable-byte.md),则查询缓存也将在所有任务中共享。
|
如果启用了[查询缓存](../Configuration/configuration.md),则查询缓存也将在所有任务中共享。
|
||||||
|
|
||||||
**服务端HTTP线程**
|
**服务端HTTP线程**
|
||||||
索引器维护两个大小相等的HTTP线程池。
|
索引器维护两个大小相等的HTTP线程池。
|
||||||
|
@ -15,7 +15,7 @@
|
|||||||
|
|
||||||
元数据存储是Apache Druid的一个外部依赖。Druid使用它来存储系统的各种元数据,但不存储实际的数据。下面有许多用于各种目的的表。
|
元数据存储是Apache Druid的一个外部依赖。Druid使用它来存储系统的各种元数据,但不存储实际的数据。下面有许多用于各种目的的表。
|
||||||
|
|
||||||
Derby是Druid的默认元数据存储,但是它不适合生产环境。[MySQL](../configuration/core-ext/mysql.md)和[PostgreSQL](../configuration/core-ext/postgresql.md)是更适合生产的元数据存储。
|
Derby是Druid的默认元数据存储,但是它不适合生产环境。[MySQL](../Configuration/core-ext/mysql.md)和[PostgreSQL](../Configuration/core-ext/postgresql.md)是更适合生产的元数据存储。
|
||||||
|
|
||||||
> [!WARNING]
|
> [!WARNING]
|
||||||
> 元数据存储存储了Druid集群工作所必需的整个元数据。对于生产集群,考虑使用MySQL或PostgreSQL而不是Derby。此外,强烈建议设置数据库的高可用,因为如果丢失任何元数据,将无法恢复。
|
> 元数据存储存储了Druid集群工作所必需的整个元数据。对于生产集群,考虑使用MySQL或PostgreSQL而不是Derby。此外,强烈建议设置数据库的高可用,因为如果丢失任何元数据,将无法恢复。
|
||||||
@ -31,11 +31,11 @@ druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527//opt/var
|
|||||||
|
|
||||||
### MySQL
|
### MySQL
|
||||||
|
|
||||||
参见[mysql-metadata-storage](../configuration/core-ext/mysql.md)扩展文档
|
参见[mysql-metadata-storage](../Configuration/core-ext/mysql.md)扩展文档
|
||||||
|
|
||||||
### PostgreSQL
|
### PostgreSQL
|
||||||
|
|
||||||
参见[postgresql-metadata-storage](../configuration/core-ext/postgresql.md)扩展文档
|
参见[postgresql-metadata-storage](../Configuration/core-ext/postgresql.md)扩展文档
|
||||||
|
|
||||||
### 添加自定义的数据库连接池属性
|
### 添加自定义的数据库连接池属性
|
||||||
|
|
||||||
|
@ -13,10 +13,10 @@
|
|||||||
|
|
||||||
## MiddleManager进程
|
## MiddleManager进程
|
||||||
### 配置
|
### 配置
|
||||||
对于Apache Druid MiddleManager配置,可以参见[索引服务配置](../configuration/human-readable-byte.md#MiddleManager)
|
对于Apache Druid MiddleManager配置,可以参见[索引服务配置](../Configuration/configuration.md#MiddleManager)
|
||||||
|
|
||||||
### HTTP
|
### HTTP
|
||||||
对于MiddleManager的API接口,详见 [MiddleManager API](../operations/api-reference.md#MiddleManager)
|
对于MiddleManager的API接口,详见 [MiddleManager API](../operations/api.md#MiddleManager)
|
||||||
|
|
||||||
### 综述
|
### 综述
|
||||||
MiddleManager进程是执行提交的任务的工作进程。MiddleManager将任务转发给运行在不同jvm中的Peon。我们为每个任务设置单独的jvm的原因是为了隔离资源和日志。每个Peon一次只能运行一个任务,但是,一个MiddleManager可能有多个Peon。
|
MiddleManager进程是执行提交的任务的工作进程。MiddleManager将任务转发给运行在不同jvm中的Peon。我们为每个任务设置单独的jvm的原因是为了隔离资源和日志。每个Peon一次只能运行一个任务,但是,一个MiddleManager可能有多个Peon。
|
||||||
|
@ -13,10 +13,10 @@
|
|||||||
|
|
||||||
## Overload进程
|
## Overload进程
|
||||||
### 配置
|
### 配置
|
||||||
对于Apache Druid的Overlord进程配置,详见 [Overlord配置](../configuration/human-readable-byte.md#Overlord)
|
对于Apache Druid的Overlord进程配置,详见 [Overlord配置](../Configuration/configuration.md#Overlord)
|
||||||
|
|
||||||
### HTTP
|
### HTTP
|
||||||
对于Overlord的API接口,详见 [Overlord API](../operations/api-reference.md#Overlord)
|
对于Overlord的API接口,详见 [Overlord API](../operations/api.md#Overlord)
|
||||||
|
|
||||||
### 综述
|
### 综述
|
||||||
Overlord进程负责接收任务、协调任务分配、创建任务锁并将状态返回给调用方。Overlord可以配置为本地模式运行或者远程模式运行(默认为本地)。在本地模式下,Overlord还负责创建执行任务的Peon, 在本地模式下运行Overlord时,还必须提供所有MiddleManager和Peon配置。本地模式通常用于简单的工作流。在远程模式下,Overlord和MiddleManager在不同的进程中运行,您可以在不同的服务器上运行每一个进程。如果要将索引服务用作所有Druid索引的单个端点,建议使用此模式。
|
Overlord进程负责接收任务、协调任务分配、创建任务锁并将状态返回给调用方。Overlord可以配置为本地模式运行或者远程模式运行(默认为本地)。在本地模式下,Overlord还负责创建执行任务的Peon, 在本地模式下运行Overlord时,还必须提供所有MiddleManager和Peon配置。本地模式通常用于简单的工作流。在远程模式下,Overlord和MiddleManager在不同的进程中运行,您可以在不同的服务器上运行每一个进程。如果要将索引服务用作所有Druid索引的单个端点,建议使用此模式。
|
||||||
|
@ -13,10 +13,10 @@
|
|||||||
|
|
||||||
## Peons
|
## Peons
|
||||||
### 配置
|
### 配置
|
||||||
对于Apache Druid Peon配置,可以参见 [Peon查询配置](../configuration/human-readable-byte.md) 和 [额外的Peon配置](../configuration/human-readable-byte.md)
|
对于Apache Druid Peon配置,可以参见 [Peon查询配置](../Configuration/configuration.md) 和 [额外的Peon配置](../Configuration/configuration.md)
|
||||||
|
|
||||||
### HTTP
|
### HTTP
|
||||||
对于Peon的API接口,详见 [Peon API](../operations/api-reference.md#Peon)
|
对于Peon的API接口,详见 [Peon API](../operations/api.md#Peon)
|
||||||
|
|
||||||
Peon在单个JVM中运行单个任务。MiddleManager负责创建运行任务的Peon。Peon应该很少(如果为了测试目的)自己运行。
|
Peon在单个JVM中运行单个任务。MiddleManager负责创建运行任务的Peon。Peon应该很少(如果为了测试目的)自己运行。
|
||||||
|
|
||||||
|
@ -108,7 +108,7 @@ Coordinator进程的工作负载往往随着集群中段的数量而增加。Ove
|
|||||||
|
|
||||||
通过设置 `druid.Coordinator.asOverlord.enabled` 属性,Coordinator进程和Overlord进程可以作为单个组合进程运行。
|
通过设置 `druid.Coordinator.asOverlord.enabled` 属性,Coordinator进程和Overlord进程可以作为单个组合进程运行。
|
||||||
|
|
||||||
有关详细信息,请参阅[Coordinator配置](../configuration/human-readable-byte.md#Coordinator)。
|
有关详细信息,请参阅[Coordinator配置](../Configuration/configuration.md#Coordinator)。
|
||||||
|
|
||||||
#### Historical和MiddleManager
|
#### Historical和MiddleManager
|
||||||
|
|
||||||
|
@ -24,11 +24,11 @@ Apache Druid Router用于将查询路由到不同的Broker。默认情况下,B
|
|||||||
|
|
||||||
### 配置
|
### 配置
|
||||||
|
|
||||||
对于Apache Druid Router的配置,请参见 [Router 配置](../configuration/human-readable-byte.md#Router)
|
对于Apache Druid Router的配置,请参见 [Router 配置](../Configuration/configuration.md#Router)
|
||||||
|
|
||||||
### HTTP
|
### HTTP
|
||||||
|
|
||||||
对于Router的API列表,请参见 [Router API](../operations/api-reference.md#Router)
|
对于Router的API列表,请参见 [Router API](../operations/api.md#Router)
|
||||||
|
|
||||||
### 运行
|
### 运行
|
||||||
|
|
||||||
|
@ -12,7 +12,7 @@
|
|||||||
</script>
|
</script>
|
||||||
|
|
||||||
## 段
|
## 段
|
||||||
ApacheDruid将索引存储在按时间分区的*段文件*中。在基本设置中,通常为每个时间间隔创建一个段文件,其中时间间隔可在 `granularitySpec` 的`segmentGranularity` 参数中配置。为了使Druid在繁重的查询负载下运行良好,段文件大小必须在建议的300MB-700MB范围内。如果段文件大于此范围,请考虑更改时间间隔的粒度,或者对数据进行分区,并在 `partitionsSpec` 中调整 `targetPartitionSize`(此参数的建议起点是500万行)。有关更多信息,请参阅下面的**分片部分**和[批处理摄取](../ingestion/native.md)文档的**分区规范**部分。
|
ApacheDruid将索引存储在按时间分区的*段文件*中。在基本设置中,通常为每个时间间隔创建一个段文件,其中时间间隔可在 `granularitySpec` 的`segmentGranularity` 参数中配置。为了使Druid在繁重的查询负载下运行良好,段文件大小必须在建议的300MB-700MB范围内。如果段文件大于此范围,请考虑更改时间间隔的粒度,或者对数据进行分区,并在 `partitionsSpec` 中调整 `targetPartitionSize`(此参数的建议起点是500万行)。有关更多信息,请参阅下面的**分片部分**和[批处理摄取](../DataIngestion/native.md)文档的**分区规范**部分。
|
||||||
|
|
||||||
### 段文件的核心数据结构
|
### 段文件的核心数据结构
|
||||||
|
|
||||||
|
@ -1,31 +1,59 @@
|
|||||||
# Kafka 数据载入
|
---
|
||||||
Kafka 索引服务(Kafka indexing service)将会在 Overlord 上启动并配置 *supervisors*,
|
id: kafka-ingestion
|
||||||
supervisors 通过管理 Kafka 索引任务的创建和销毁的生命周期以便于从 Kafka 中载入数据。
|
title: "Apache Kafka ingestion"
|
||||||
这些索引任务使用Kafka自己的分区和偏移机制读取事件,因此能够保证只读取一次(exactly-once)。
|
sidebar_label: "Apache Kafka"
|
||||||
|
---
|
||||||
|
|
||||||
supervisor 对索引任务的状态进行监控,以便于对任务进行扩展或切换,故障管理等操作。
|
<!--
|
||||||
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
~ or more contributor license agreements. See the NOTICE file
|
||||||
|
~ distributed with this work for additional information
|
||||||
|
~ regarding copyright ownership. The ASF licenses this file
|
||||||
|
~ to you under the Apache License, Version 2.0 (the
|
||||||
|
~ "License"); you may not use this file except in compliance
|
||||||
|
~ with the License. You may obtain a copy of the License at
|
||||||
|
~
|
||||||
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
~
|
||||||
|
~ Unless required by applicable law or agreed to in writing,
|
||||||
|
~ software distributed under the License is distributed on an
|
||||||
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
~ KIND, either express or implied. See the License for the
|
||||||
|
~ specific language governing permissions and limitations
|
||||||
|
~ under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
这个服务是由 `druid-kafka-indexing-service` 这个 druid 核心扩展(详情请见 [扩展列表](../../development/extensions.md)提供的内容)。
|
|
||||||
|
|
||||||
> Druid 的 Kafka 索引服务支持在 Kafka 0.11.x 中开始使用的事务主题。这些更改使 Druid 使用的 Kafka 消费者与旧的 Kafka brokers 不兼容。
|
The Kafka indexing service enables the configuration of *supervisors* on the Overlord, which facilitate ingestion from
|
||||||
> 在使用 Druid 从 Kafka中 导入数据之前,请确保你的 Kafka 版本为 0.11.x 或更高版本。
|
Kafka by managing the creation and lifetime of Kafka indexing tasks. These indexing tasks read events using Kafka's own
|
||||||
> 如果你使用的是旧版本的 Kafka brokers,请参阅《 [Kafka升级指南](https://kafka.apache.org/documentation/#upgrade) 》中的内容先进行升级。
|
partition and offset mechanism and are therefore able to provide guarantees of exactly-once ingestion.
|
||||||
|
The supervisor oversees the state of the indexing tasks to coordinate handoffs,
|
||||||
|
manage failures, and ensure that the scalability and replication requirements are maintained.
|
||||||
|
|
||||||
## 教程
|
This service is provided in the `druid-kafka-indexing-service` core Apache Druid extension (see
|
||||||
针对使用 Apache Kafka 数据导入中的参考文档,请访问 [Loading from Apache Kafka](../../tutorials/tutorial-kafka.md) 页面中的教程。
|
[Including Extensions](../../development/extensions.md#loading-extensions)).
|
||||||
|
|
||||||
## 提交一个 Supervisor 规范
|
> The Kafka indexing service supports transactional topics which were introduced in Kafka 0.11.x. These changes make the
|
||||||
|
> Kafka consumer that Druid uses incompatible with older brokers. Ensure that your Kafka brokers are version 0.11.x or
|
||||||
|
> better before using this functionality. Refer [Kafka upgrade guide](https://kafka.apache.org/documentation/#upgrade)
|
||||||
|
> if you are using older version of Kafka brokers.
|
||||||
|
|
||||||
Kafka 的所以服务需要 `druid-kafka-indexing-service` 扩展同时安装在 Overlord 和 MiddleManagers 服务器上。
|
## Tutorial
|
||||||
你可以通过提交一个 supervisor 规范到 Druid 中来完成数据源的设置。你可以采用 HTTP POST 的方法来进行提交,发送的地址为:
|
|
||||||
|
|
||||||
`http://<OVERLORD_IP>:<OVERLORD_PORT>/druid/indexer/v1/supervisor`,一个具体的提交示例如下:
|
This page contains reference documentation for Apache Kafka-based ingestion.
|
||||||
|
For a walk-through instead, check out the [Loading from Apache Kafka](../../tutorials/tutorial-kafka.md) tutorial.
|
||||||
|
|
||||||
|
## Submitting a Supervisor Spec
|
||||||
|
|
||||||
|
The Kafka indexing service requires that the `druid-kafka-indexing-service` extension be loaded on both the Overlord and the
|
||||||
|
MiddleManagers. A supervisor for a dataSource is started by submitting a supervisor spec via HTTP POST to
|
||||||
|
`http://<OVERLORD_IP>:<OVERLORD_PORT>/druid/indexer/v1/supervisor`, for example:
|
||||||
|
|
||||||
```
|
```
|
||||||
curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http://localhost:8090/druid/indexer/v1/supervisor
|
curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http://localhost:8090/druid/indexer/v1/supervisor
|
||||||
```
|
```
|
||||||
|
|
||||||
一个示例的 supervisor 规范如下:
|
A sample supervisor spec is shown below:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
@ -89,89 +117,87 @@ curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http:/
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Supervisor 配置
|
## Supervisor Configuration
|
||||||
|
|
||||||
|字段(Field)|描述(Description)|是否必须(Required)|
|
|Field|Description|Required|
|
||||||
|--------|-----------|---------|
|
|--------|-----------|---------|
|
||||||
|`type`| supervisor 的类型,总是 `kafka` 字符串。|Y|
|
|`type`|The supervisor type, this should always be `kafka`.|yes|
|
||||||
|`dataSchema`|Kafka 索引服务在对数据进行导入的时候使用的数据 schema。请参考 [`dataSchema`](../../ingestion/index.md#dataschema) 页面来了解更多信息 |Y|
|
|`dataSchema`|The schema that will be used by the Kafka indexing task during ingestion. See [`dataSchema`](../../ingestion/index.md#dataschema) for details.|yes|
|
||||||
|`ioConfig`| 一个 KafkaSupervisorIOConfig 对象。在这个对象中我们对 supervisor 和 索引任务(indexing task)使用 Kafka 的连接参数进行定义;对 I/O-related 进行相关设置。请参考本页面下半部分 [KafkaSupervisorIOConfig](#kafkasupervisorioconfig) 的内容。|Y|
|
|`ioConfig`|A KafkaSupervisorIOConfig object for configuring Kafka connection and I/O-related settings for the supervisor and indexing task. See [KafkaSupervisorIOConfig](#kafkasupervisorioconfig) below.|yes|
|
||||||
|`tuningConfig`|一个 KafkaSupervisorTuningConfig 对象。在这个配置对象中,我们对 supervisor 和 索引任务(indexing task)的性能进行设置。请参考本页面下半部分 [KafkaSupervisorTuningConfig](#kafkasupervisortuningconfig) 的内容。|N|
|
|`tuningConfig`|A KafkaSupervisorTuningConfig object for configuring performance-related settings for the supervisor and indexing tasks. See [KafkaSupervisorTuningConfig](#kafkasupervisortuningconfig) below.|no|
|
||||||
|
|
||||||
### KafkaSupervisorIOConfig
|
### KafkaSupervisorIOConfig
|
||||||
|
|
||||||
|字段(Field)|类型(Type)|描述(Description)|是否必须(Required)|
|
|Field|Type|Description|Required|
|
||||||
|-----|----|-----------|--------|
|
|-----|----|-----------|--------|
|
||||||
|`topic`|String|从 Kafka 中读取数据的 主题(topic)名。你必须要指定一个明确的 topic。例如 topic patterns 还不能被支持。|Y|
|
|`topic`|String|The Kafka topic to read from. This must be a specific topic as topic patterns are not supported.|yes|
|
||||||
|`inputFormat`|Object|[`inputFormat`](../../ingestion/data-formats.md#input-format) 被指定如何来解析处理数据。请参考 [the below section](#specifying-data-format) 来了解更多如何指定 input format 的内容。|Y|
|
|`inputFormat`|Object|[`inputFormat`](../../ingestion/data-formats.md#input-format) to specify how to parse input data. See [the below section](#specifying-data-format) for details about specifying the input format.|yes|
|
||||||
|`consumerProperties`|Map<String, Object>|传递给 Kafka 消费者的一组属性 map。这个必须包含有一个 `bootstrap.servers` 属性。这个属性的值为: `<BROKER_1>:<PORT_1>,<BROKER_2>:<PORT_2>,...` 这样的服务器列表。针对使用 SSL 的链接: `keystore`, `truststore`,`key` 可以使用字符串密码,或者使用 [Password Provider](../../operations/password-provider.md) 来进行提供。|Y|
|
|`consumerProperties`|Map<String, Object>|A map of properties to be passed to the Kafka consumer. This must contain a property `bootstrap.servers` with a list of Kafka brokers in the form: `<BROKER_1>:<PORT_1>,<BROKER_2>:<PORT_2>,...`. For SSL connections, the `keystore`, `truststore` and `key` passwords can be provided as a [Password Provider](../../operations/password-provider.md) or String password.|yes|
|
||||||
|`pollTimeout`|Long| Kafka 消费者拉取数据等待的时间。单位为:毫秒(milliseconds)The length of time to wait for the Kafka consumer to poll records, in |N(默认=100))|
|
|`pollTimeout`|Long|The length of time to wait for the Kafka consumer to poll records, in milliseconds|no (default == 100)|
|
||||||
|`replicas`|Integer|副本的数量, 1 意味着一个单一任务(无副本)。副本任务将始终分配给不同的 workers,以提供针对流程故障的恢复能力。|N(默认=1))|
|
|`replicas`|Integer|The number of replica sets, where 1 means a single set of tasks (no replication). Replica tasks will always be assigned to different workers to provide resiliency against process failure.|no (default == 1)|
|
||||||
|`taskCount`|Integer|在一个 *replica set* 集中最大 *reading* 的数量。这意味着读取任务的最大的数量将是 `taskCount * replicas`, 任务总数(*reading* + *publishing*)是大于这个数值的。请参考 [Capacity Planning](#capacity-planning) 中的内容。如果 `taskCount > {numKafkaPartitions}` 的话,总的 reading 任务数量将会小于 `taskCount` 。|N(默认=1))|
|
|`taskCount`|Integer|The maximum number of *reading* tasks in a *replica set*. This means that the maximum number of reading tasks will be `taskCount * replicas` and the total number of tasks (*reading* + *publishing*) will be higher than this. See [Capacity Planning](#capacity-planning) below for more details. The number of reading tasks will be less than `taskCount` if `taskCount > {numKafkaPartitions}`.|no (default == 1)|
|
||||||
|`taskDuration`|ISO8601 Period|任务停止读取数据并且将已经读取的数据发布为新段的时间周期|N(默认=PT1H)|
|
|`taskDuration`|ISO8601 Period|The length of time before tasks stop reading and begin publishing their segment.|no (default == PT1H)|
|
||||||
|`startDelay`|ISO8601 Period|supervisor 开始管理任务之前的等待时间周期。|N(默认=PT1S)|
|
|`startDelay`|ISO8601 Period|The period to wait before the supervisor starts managing tasks.|no (default == PT5S)|
|
||||||
|`period`|ISO8601 Period|supervisor 将要执行管理逻辑的时间周期间隔。请注意,supervisor 将会在一些特定的事件发生时进行执行(例如:任务成功终止,任务失败,任务达到了他们的 taskDuration)。因此这个值指定了在在 2 个事件之间进行执行的最大时间间隔周期。|N(默认=PT30S)|
|
|`period`|ISO8601 Period|How often the supervisor will execute its management logic. Note that the supervisor will also run in response to certain events (such as tasks succeeding, failing, and reaching their taskDuration) so this value specifies the maximum time between iterations.|no (default == PT30S)|
|
||||||
|`useEarliestOffset`|Boolean|如果 supervisor 是第一次对数据源进行管理,supervisor 将会从 Kafka 中获得一系列的数据偏移量。这个标记位用于在 Kafka 中确定最早(earliest)或者最晚(latest)的偏移量。在通常使用的情况下,后续的任务将会从前一个段结束的标记位开始继续执行,因此这个参数只在 supervisor 第一次启动的时候需要。|否(no)(默认值: false)|
|
|`useEarliestOffset`|Boolean|If a supervisor is managing a dataSource for the first time, it will obtain a set of starting offsets from Kafka. This flag determines whether it retrieves the earliest or latest offsets in Kafka. Under normal circumstances, subsequent tasks will start from where the previous segments ended so this flag will only be used on first run.|no (default == false)|
|
||||||
|`completionTimeout`|ISO8601 Period|声明发布任务为失败并终止它 之前等待的时间长度。如果设置得太低,则任务可能永远不会发布。任务的发布时刻大约在 `taskDuration` (任务持续)时间过后开始。|N(默认=PT30M)|
|
|`completionTimeout`|ISO8601 Period|The length of time to wait before declaring a publishing task as failed and terminating it. If this is set too low, your tasks may never publish. The publishing clock for a task begins roughly after `taskDuration` elapses.|no (default == PT30M)|
|
||||||
|`lateMessageRejectionStartDateTime`|ISO8601 DateTime|用来配置一个时间,当消息时间戳早于此日期时间的时候,消息被拒绝。例如我们将这个时间戳设置为 `2016-01-01T11:00Z` 然后 supervisor 在 *2016-01-01T12:00Z* 创建了一个任务,那么早于 *2016-01-01T11:00Z* 的消息将会被丢弃。这个设置有助于帮助避免并发(concurrency)问题。例如,如果你的数据流有延迟消息,并且你有多个需要在同一段上操作的管道(例如实时和夜间批处理摄取管道)。|N(默认=none)|
|
|`lateMessageRejectionStartDateTime`|ISO8601 DateTime|Configure tasks to reject messages with timestamps earlier than this date time; for example if this is set to `2016-01-01T11:00Z` and the supervisor creates a task at *2016-01-01T12:00Z*, messages with timestamps earlier than *2016-01-01T11:00Z* will be dropped. This may help prevent concurrency issues if your data stream has late messages and you have multiple pipelines that need to operate on the same segments (e.g. a realtime and a nightly batch ingestion pipeline).|no (default == none)|
|
||||||
|`lateMessageRejectionPeriod`|ISO8601 Period|配置一个时间周期,当消息时间戳早于此周期的时候,消息被拒绝。例如,如果这个参数被设置为 `PT1H` 同时 supervisor 在 *2016-01-01T12:00Z* 创建了一个任务,那么所有早于 *2016-01-01T11:00Z* 的消息将会被丢弃。 个设置有助于帮助避免并发(concurrency)问题。例如,如果你的数据流有延迟消息,并且你有多个需要在同一段上操作的管道(例如实时和夜间批处理摄取管道)。请注意 `lateMessageRejectionPeriod` 或者 `lateMessageRejectionStartDateTime` 2 个参数只能指定一个,不能同时赋值。|N(默认=none)|
|
|`lateMessageRejectionPeriod`|ISO8601 Period|Configure tasks to reject messages with timestamps earlier than this period before the task was created; for example if this is set to `PT1H` and the supervisor creates a task at *2016-01-01T12:00Z*, messages with timestamps earlier than *2016-01-01T11:00Z* will be dropped. This may help prevent concurrency issues if your data stream has late messages and you have multiple pipelines that need to operate on the same segments (e.g. a realtime and a nightly batch ingestion pipeline). Please note that only one of `lateMessageRejectionPeriod` or `lateMessageRejectionStartDateTime` can be specified.|no (default == none)|
|
||||||
|`earlyMessageRejectionPeriod`|ISO8601 Period|用来配置一个时间周期,当消息时间戳晚于此周期的时候,消息被拒绝。例如,如果这个参数被设置为 `PT1H`,taskDuration 也被设置为 `PT1H`,然后 supervisor 在 *2016-01-01T12:00Z* 创建了一个任务,那么所有晚于 *2016-01-01T14:00Z* 的消息丢会被丢弃,这是因为任务的执行时间为 1 个小时,`earlyMessageRejectionPeriod` 参数的设置为 1 个小时,因此总计需要等候 2 个小时。 **注意:** 任务有时候的执行时间可能会超过任务 `taskDuration` 参数设定的值,例如,supervisor 被挂起的情况。如果设置 `earlyMessageRejectionPeriod` 参数过低的话,在任务的执行时间超过预期的话,将会有可能导致消息被意外丢弃。|N(默认=none)|
|
|`earlyMessageRejectionPeriod`|ISO8601 Period|Configure tasks to reject messages with timestamps later than this period after the task reached its taskDuration; for example if this is set to `PT1H`, the taskDuration is set to `PT1H` and the supervisor creates a task at *2016-01-01T12:00Z*, messages with timestamps later than *2016-01-01T14:00Z* will be dropped. **Note:** Tasks sometimes run past their task duration, for example, in cases of supervisor failover. Setting earlyMessageRejectionPeriod too low may cause messages to be dropped unexpectedly whenever a task runs past its originally configured task duration.|no (default == none)|
|
||||||
|
|
||||||
#### 指定数据格式
|
#### Specifying data format
|
||||||
|
|
||||||
Kafka 索引服务(indexing service)支持 [`inputFormat`](../../ingestion/data-formats.md#input-format) 和 [`parser`](../../ingestion/data-formats.md#parser) 来指定特定的数据格式。
|
Kafka indexing service supports both [`inputFormat`](../../ingestion/data-formats.md#input-format) and [`parser`](../../ingestion/data-formats.md#parser) to specify the data format.
|
||||||
|
The `inputFormat` is a new and recommended way to specify the data format for Kafka indexing service,
|
||||||
|
but unfortunately, it doesn't support all data formats supported by the legacy `parser`.
|
||||||
|
(They will be supported in the future.)
|
||||||
|
|
||||||
`inputFormat` 是一个较新的参数,针对使用的 Kafka 索引服务,我们建议你对这个数据格式参数字段进行设置。
|
The supported `inputFormat`s include [`csv`](../../ingestion/data-formats.md#csv),
|
||||||
不幸的是,目前还不能支持所有在老的 `parser` 中能够支持的数据格式(Druid 将会在后续的版本中提供支持)。
|
[`delimited`](../../ingestion/data-formats.md#tsv-delimited), and [`json`](../../ingestion/data-formats.md#json).
|
||||||
|
You can also read [`avro_stream`](../../ingestion/data-formats.md#avro-stream-parser),
|
||||||
目前 `inputFormat` 能够支持的数据格式包括有:
|
|
||||||
[`csv`](../../ingestion/data-formats.md#csv),
|
|
||||||
[`delimited`](../../ingestion/data-formats.md#tsv-delimited),
|
|
||||||
[`json`](../../ingestion/data-formats.md#json)。
|
|
||||||
|
|
||||||
如果你使用 `parser` 的话,你也可以阅读:
|
|
||||||
[`avro_stream`](../../ingestion/data-formats.md#avro-stream-parser),
|
|
||||||
[`protobuf`](../../ingestion/data-formats.md#protobuf-parser),
|
[`protobuf`](../../ingestion/data-formats.md#protobuf-parser),
|
||||||
[`thrift`](../extensions-contrib/thrift.md) 数据格式。
|
and [`thrift`](../extensions-contrib/thrift.md) formats using `parser`.
|
||||||
|
|
||||||
<a name="tuningconfig"></a>
|
<a name="tuningconfig"></a>
|
||||||
|
|
||||||
### Kafka Supervisor 的 TuningConfig 配置
|
### KafkaSupervisorTuningConfig
|
||||||
|
|
||||||
|
The tuningConfig is optional and default parameters will be used if no tuningConfig is specified.
|
||||||
|
|
||||||
|字段(Field)|类型(Type)|描述(Description)|是否必须(Required)|
|
| Field | Type | Description | Required |
|
||||||
| --- | --- | --- | --- |
|
|-----------------------------------|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
|
||||||
|`type`|String|索引任务类型, 总是 kafka。|Y|
|
| `type` | String | The indexing task type, this should always be `kafka`. | yes |
|
||||||
|`maxRowsInMemory`|Integer|在持久化之前在内存中聚合的最大行数。该数值为聚合之后的行数,所以它不等于原始输入事件的行数,而是事件被聚合后的行数。 通常用来管理所需的 JVM 堆内存。 使用 maxRowsInMemory * (2 + maxPendingPersists) 来当做索引任务的最大堆内存。通常用户不需要设置这个值,但是也需要根据数据的特点来决定,如果行的字节数较短,用户可能不想在内存中存储一百万行,应该设置这个值。|N(默认=1000000)|
|
| `maxRowsInMemory` | Integer | The number of rows to aggregate before persisting. This number is the post-aggregation rows, so it is not equivalent to the number of input events, but the number of aggregated rows that those events result in. This is used to manage the required JVM heap size. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists). Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set. | no (default == 1000000) |
|
||||||
|`maxBytesInMemory`|Long|在持久化之前在内存中聚合的最大字节数。这是基于对内存使用量的粗略估计,而不是实际使用量。通常这是在内部计算的,用户不需要设置它。 索引任务的最大内存使用量是 maxRowsInMemory * (2 + maxPendingPersists)|N(默认=最大JVM内存的 1/6)|
|
| `maxBytesInMemory` | Long | The number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. Normally this is computed internally and user does not need to set it. The maximum heap memory usage for indexing is maxBytesInMemory * (2 + maxPendingPersists). | no (default == One-sixth of max JVM memory) |
|
||||||
|`maxRowsPerSegment`|Integer|聚合到一个段中的行数,该数值为聚合后的数值。 当 maxRowsPerSegment 或者 maxTotalRows 有一个值命中的时候,则触发 handoff(数据存盘后传到深度存储), 该动作也会按照每 intermediateHandoffPeriod 时间间隔发生一次。|N(默认=5000000)|
|
| `maxRowsPerSegment` | Integer | The number of rows to aggregate into a segment; this number is post-aggregation rows. Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`, whichever happens earlier. | no (default == 5000000) |
|
||||||
|`maxTotalRows`|Long|所有段的聚合后的行数,该值为聚合后的行数。当 maxRowsPerSegment 或者 maxTotalRows 有一个值命中的时候,则触发handoff(数据落盘后传到深度存储), 该动作也会按照每 intermediateHandoffPeriod 时间间隔发生一次。|N(默认=unlimited)|
|
| `maxTotalRows` | Long | The number of rows to aggregate across all segments; this number is post-aggregation rows. Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`, whichever happens earlier. | no (default == unlimited) |
|
||||||
|`intermediatePersistPeriod`|ISO8601 Period|确定触发持续化存储的周期|N(默认= PT10M)|
|
| `intermediatePersistPeriod` | ISO8601 Period | The period that determines the rate at which intermediate persists occur. | no (default == PT10M) |
|
||||||
|`maxPendingPersists`|Integer|正在等待但启动的持久化过程的最大数量。 如果新的持久化任务超过了此限制,则在当前运行的持久化完成之前,摄取将被阻止。索引任务的最大内存使用量是 maxRowsInMemory * (2 + maxPendingPersists)|否(默认为0,意味着一个持久化可以与摄取同时运行,而没有一个可以进入队列)|
|
| `maxPendingPersists` | Integer | Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists). | no (default == 0, meaning one persist can be running concurrently with ingestion, and none can be queued up) |
|
||||||
|`indexSpec`|Object|调整数据被如何索引。详情可以见 [IndexSpec](https://druid.ossez.com/#/development/extensions-core/kafka-ingestion?id=indexspec) 页面中的内容|N|
|
| `indexSpec` | Object | Tune how data is indexed. See [IndexSpec](#indexspec) for more information. | no |
|
||||||
|`indexSpecForIntermediatePersists`||定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的维度/度量压缩,以减少最终合并所需的内存。但是,在中间段上禁用压缩可能会增加页缓存的使用,而在它们被合并到发布的最终段之前使用它们,有关可能的值。详情可以见 [IndexSpec](https://druid.ossez.com/#/development/extensions-core/kafka-ingestion?id=indexspec) 页面中的内容。|N(默认= 与 indexSpec 相同)|
|
| `indexSpecForIntermediatePersists`| | Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values. | no (default = same as indexSpec) |
|
||||||
|`reportParseExceptions`|Boolean|*已经丢弃(*DEPRECATED*)*。如果为true,则在解析期间遇到的异常即停止摄取;如果为false,则将跳过不可解析的行和字段。将 reportParseExceptions 设置为 true 将覆盖maxParseExceptions 和 maxSavedParseExceptions 的现有配置,将maxParseExceptions 设置为 0 并将 maxSavedParseExceptions 限制为不超过1。|N(默认=false)|
|
| `reportParseExceptions` | Boolean | *DEPRECATED*. If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped. Setting `reportParseExceptions` to true will override existing configurations for `maxParseExceptions` and `maxSavedParseExceptions`, setting `maxParseExceptions` to 0 and limiting `maxSavedParseExceptions` to no more than 1. | no (default == false) |
|
||||||
|`handoffConditionTimeout`|Long|段切换(持久化)可以等待的毫秒数(超时时间)。 该值要被设置为大于0的数,设置为0意味着将会一直等待不超时。|N(默认=0)|
|
| `handoffConditionTimeout` | Long | Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever. | no (default == 0) |
|
||||||
|`resetOffsetAutomatically`|Boolean|控制当Druid需要读取Kafka中不可用的消息时的行为,比如当发生了 `OffsetOutOfRangeException` 异常时。 如果为false,则异常将抛出,这将导致任务失败并停止接收。如果发生这种情况,则需要手动干预来纠正这种情况;可能使用 [重置 Supervisor API](https://druid.ossez.com/#/../operations/api-reference?id=supervisor) 。此模式对于生产非常有用,因为它将使您意识到摄取的问题。如果为true,Druid将根据 `useEarliestOffset` 属性的值(`true` 为 `earliest` ,`false` 为 `latest` )自动重置为Kafka中可用的较早或最新偏移量。请注意,这可能导致数据在您不知情的情况下*被丢弃* (如果`useEarliestOffset` 为 `false` )或 *重复* (如果 `useEarliestOffset` 为 `true` )。消息将被记录下来,以标识已发生重置,但摄取将继续。这种模式对于非生产环境非常有用,因为它将使Druid尝试自动从问题中恢复,即使这些问题会导致数据被安静删除或重复。该特性与Kafka的 `auto.offset.reset` 消费者属性很相似|N(默认=false)|
|
| `resetOffsetAutomatically` | Boolean | Controls behavior when Druid needs to read Kafka messages that are no longer available (i.e. when OffsetOutOfRangeException is encountered).<br/><br/>If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../operations/api-reference.html#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.<br/><br/>If true, Druid will automatically reset to the earlier or latest offset available in Kafka, based on the value of the `useEarliestOffset` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestOffset` is false) or _DUPLICATED_ (if `useEarliestOffset` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful for non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.<br/><br/>This feature behaves similarly to the Kafka `auto.offset.reset` consumer property. | no (default == false) |
|
||||||
|`workerThreads`|Integer|supervisor 用于为工作任务处理 请求/相应(requests/responses)异步操作的线程数。|N(默认=min(10, taskCount))|
|
| `workerThreads` | Integer | The number of threads that the supervisor uses to handle requests/responses for worker tasks, along with any other internal asynchronous operation. | no (default == min(10, taskCount)) |
|
||||||
|`chatThreads`|Integer|与索引任务的会话线程数。|N(默认=10, taskCount * replicas))|
|
| `chatThreads` | Integer | The number of threads that will be used for communicating with indexing tasks. | no (default == min(10, taskCount * replicas)) |
|
||||||
|`chatRetries`|Integer|在任务没有响应之前,将重试对索引任务的HTTP请求的次数|N(默认=8)|
|
| `chatRetries` | Integer | The number of times HTTP requests to indexing tasks will be retried before considering tasks unresponsive. | no (default == 8) |
|
||||||
|`httpTimeout`|ISO8601 Period|索引任务的 HTTP 响应超时的时间。|N(默认=PT10S)|
|
| `httpTimeout` | ISO8601 Period | How long to wait for a HTTP response from an indexing task. | no (default == PT10S) |
|
||||||
|`shutdownTimeout`|ISO8601 Period|supervisor 尝试无故障的停掉一个任务的超时时间。|N(默认=PT80S)|
|
| `shutdownTimeout` | ISO8601 Period | How long to wait for the supervisor to attempt a graceful shutdown of tasks before exiting. | no (default == PT80S) |
|
||||||
|`offsetFetchPeriod`|ISO8601 Period|supervisor 查询 Kafka 和索引任务以获取当前偏移和计算滞后的频率。|N(默认=PT30S,min == PT5S)|
|
| `offsetFetchPeriod` | ISO8601 Period | How often the supervisor queries Kafka and the indexing tasks to fetch current offsets and calculate lag. | no (default == PT30S, min == PT5S) |
|
||||||
|`segmentWriteOutMediumFactory`|Object|创建段时要使用的段写入介质。更多信息见下文。|N (默认不指定,使用来源于 `druid.peon.defaultSegmentWriteOutMediumFactory.type` 的值)|
|
| `segmentWriteOutMediumFactory` | Object | Segment write-out medium to use when creating segments. See below for more information. | no (not specified by default, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used) |
|
||||||
|`intermediateHandoffPeriod`|ISO8601 Period|段发生切换的频率。当 `maxRowsPerSegment` 或者 `maxTotalRows` 有一个值命中的时候,则触发handoff(数据存盘后传到深度存储), 该动作也会按照每 `intermediateHandoffPeriod` 时间间隔发生一次。|N(默认=P2147483647D)|
|
| `intermediateHandoffPeriod` | ISO8601 Period | How often the tasks should hand off segments. Handoff will happen either if `maxRowsPerSegment` or `maxTotalRows` is hit or every `intermediateHandoffPeriod`, whichever happens earlier. | no (default == P2147483647D) |
|
||||||
|`logParseExceptions`|Boolean|如果为 true,则在发生解析异常时记录错误消息,其中包含有关发生错误的行的信息。|N(默认=false)|
|
| `logParseExceptions` | Boolean | If true, log an error message when a parsing exception occurs, containing information about the row where the error occurred. | no, default == false |
|
||||||
|`maxParseExceptions`|Integer|任务停止接收之前可发生的最大分析异常数。如果设置了 `reportParseExceptions` ,则该值会被重写。|N(默认=unlimited)|
|
| `maxParseExceptions` | Integer | The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overridden if `reportParseExceptions` is set. | no, unlimited default |
|
||||||
|`maxSavedParseExceptions`|Integer|当出现解析异常时,Druid可以跟踪最新的解析异常。"maxSavedParseExceptions"决定将保存多少个异常实例。这些保存的异常将在 [任务完成报告](https://druid.ossez.com/#/taskrefer?id=%e4%bb%bb%e5%8a%a1%e6%8a%a5%e5%91%8a) 中的任务完成后可用。如果设置了`reportParseExceptions` ,则该值会被重写。|N(默认=0)|
|
| `maxSavedParseExceptions` | Integer | When a parse exception occurs, Druid can keep track of the most recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances will be saved. These saved exceptions will be made available after the task finishes in the [task completion report](../../ingestion/tasks.md#reports). Overridden if `reportParseExceptions` is set. | no, default == 0 |
|
||||||
|
|
||||||
#### 索引属性(IndexSpec)
|
#### IndexSpec
|
||||||
|字段(Field)|类型(Type)|描述(Description)|是否必须(Required)|
|
|
||||||
| --- | --- | --- | --- |
|
|Field|Type|Description|Required|
|
||||||
|bitmap|Object|针对 bitmap indexes 使用的是压缩格式。应该是一个 JSON 对象,请参考 [Bitmap types](https://druid.ossez.com/#/development/extensions-core/kafka-ingestion?id=bitmap-types) 来了解更多|N(默认=Roaring)|
|
|-----|----|-----------|--------|
|
||||||
|dimensionCompression|String|针对维度(dimension)列使用的压缩算法,请从 `LZ4,` `LZF,`或者 `uncompressed 中选择。`|N(默认= `LZ4)`|
|
|bitmap|Object|Compression format for bitmap indexes. Should be a JSON object. See [Bitmap types](#bitmap-types) below for options.|no (defaults to Roaring)|
|
||||||
|metricCompression|String|针对主要类型 metric 列使用的压缩算法,请从 `LZ4,` `LZF,`或者 `uncompressed 中选择。`|N(默认= `LZ4)`|
|
|dimensionCompression|String|Compression format for dimension columns. Choose from `LZ4`, `LZF`, or `uncompressed`.|no (default == `LZ4`)|
|
||||||
|longEncoding|String|类型为 long 的 metric 列和 维度(dimension)的编码格式。从 `auto` 或 `long` 中进行选择。`auto` 编码是根据列基数使用偏移量或查找表对值进行编码,并以可变大小存储它们。`longs` 将会按照,每个值 8 字节来进行存储。|N(默认= `longs`)|
|
|metricCompression|String|Compression format for primitive type metric columns. Choose from `LZ4`, `LZF`, `uncompressed`, or `none`.|no (default == `LZ4`)|
|
||||||
|
|longEncoding|String|Encoding format for metric and dimension columns with type long. Choose from `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as is with 8 bytes each.|no (default == `longs`)|
|
||||||
|
|
||||||
##### Bitmap types
|
##### Bitmap types
|
||||||
|
|
||||||
@ -389,205 +415,3 @@ one can schedule re-indexing tasks be run to merge segments together into new se
|
|||||||
Details on how to optimize the segment size can be found on [Segment size optimization](../../operations/segment-optimization.md).
|
Details on how to optimize the segment size can be found on [Segment size optimization](../../operations/segment-optimization.md).
|
||||||
There is also ongoing work to support automatic segment compaction of sharded segments as well as compaction not requiring
|
There is also ongoing work to support automatic segment compaction of sharded segments as well as compaction not requiring
|
||||||
Hadoop (see [here](https://github.com/apache/druid/pull/5102)).
|
Hadoop (see [here](https://github.com/apache/druid/pull/5102)).
|
||||||
|
|
||||||
|
|
||||||
## Apache Kafka 摄取数据
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
#### KafkaSupervisorTuningConfig
|
|
||||||
|
|
||||||
`tuningConfig` 是可选的, 如果未被配置的话,则使用默认的参数。
|
|
||||||
|
|
||||||
| 字段 | 类型 | 描述 | 是否必须 |
|
|
||||||
|-|-|-|-|
|
|
||||||
| `type` | String | 索引任务类型, 总是 `kafka` | 是 |
|
|
||||||
| `maxRowsInMemory` | Integer | 在持久化之前在内存中聚合的最大行数。该数值为聚合之后的行数,所以它不等于原始输入事件的行数,而是事件被聚合后的行数。 通常用来管理所需的JVM堆内存。 使用 `maxRowsInMemory * (2 + maxPendingPersists) ` 来当做索引任务的最大堆内存。通常用户不需要设置这个值,但是也需要根据数据的特点来决定,如果行的字节数较短,用户可能不想在内存中存储一百万行,应该设置这个值 | 否(默认为 1000000)|
|
|
||||||
| `maxBytesInMemory` | Long | 在持久化之前在内存中聚合的最大字节数。这是基于对内存使用量的粗略估计,而不是实际使用量。通常这是在内部计算的,用户不需要设置它。 索引任务的最大内存使用量是 `maxRowsInMemory * (2 + maxPendingPersists) ` | 否(默认为最大JVM内存的 1/6) |
|
|
||||||
| `maxRowsPerSegment` | Integer | 聚合到一个段中的行数,该数值为聚合后的数值。 当 `maxRowsPerSegment` 或者 `maxTotalRows` 有一个值命中的时候,则触发handoff(数据落盘后传到深度存储), 该动作也会按照每 `intermediateHandoffPeriod` 时间间隔发生一次。 | 否(默认为5000000) |
|
|
||||||
| `maxTotalRows` | Long | 所有段的聚合后的行数,该值为聚合后的行数。当 `maxRowsPerSegment` 或者 `maxTotalRows` 有一个值命中的时候,则触发handoff(数据落盘后传到深度存储), 该动作也会按照每 `intermediateHandoffPeriod` 时间间隔发生一次。 | 否(默认为unlimited)|
|
|
||||||
| `intermediateHandoffPeriod` | ISO8601 Period | 确定触发持续化存储的周期 | 否(默认为 PT10M)|
|
|
||||||
| `maxPendingPersists` | Integer | 正在等待但启动的持久化过程的最大数量。 如果新的持久化任务超过了此限制,则在当前运行的持久化完成之前,摄取将被阻止。索引任务的最大内存使用量是 `maxRowsInMemory * (2 + maxPendingPersists) ` | 否(默认为0,意味着一个持久化可以与摄取同时运行,而没有一个可以排队)|
|
|
||||||
| `indexSpec` | Object | 调整数据被如何索引。详情可以见 [indexSpec](#indexspec) | 否 |
|
|
||||||
| `indexSpecForIntermediatePersists` | | 定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的维度/度量压缩,以减少最终合并所需的内存。但是,在中间段上禁用压缩可能会增加页缓存的使用,而在它们被合并到发布的最终段之前使用它们,有关可能的值,请参阅IndexSpec。 | 否(默认与 `indexSpec` 相同) |
|
|
||||||
| `reportParseExceptions` | Boolean | *已废弃*。如果为true,则在解析期间遇到的异常即停止摄取;如果为false,则将跳过不可解析的行和字段。将 `reportParseExceptions` 设置为 `true` 将覆盖`maxParseExceptions` 和 `maxSavedParseExceptions` 的现有配置,将`maxParseExceptions` 设置为 `0` 并将 `maxSavedParseExceptions` 限制为不超过1。 | 否(默认为false)|
|
|
||||||
| `handoffConditionTimeout` | Long | 段切换(持久化)可以等待的毫秒数(超时时间)。 该值要被设置为大于0的数,设置为0意味着将会一直等待不超时 | 否(默认为0)|
|
|
||||||
| `resetOffsetAutomatically` | Boolean | 控制当Druid需要读取Kafka中不可用的消息时的行为,比如当发生了 `OffsetOutOfRangeException` 异常时。 <br> 如果为false,则异常将抛出,这将导致任务失败并停止接收。如果发生这种情况,则需要手动干预来纠正这种情况;可能使用 [重置 Supervisor API](../operations/api-reference.md#Supervisor)。此模式对于生产非常有用,因为它将使您意识到摄取的问题。 <br> 如果为true,Druid将根据 `useEarliestOffset` 属性的值(`true` 为 `earliest`,`false` 为 `latest`)自动重置为Kafka中可用的较早或最新偏移量。请注意,这可能导致数据在您不知情的情况下*被丢弃*(如果`useEarliestOffset` 为 `false`)或 *重复*(如果 `useEarliestOffset` 为 `true`)。消息将被记录下来,以标识已发生重置,但摄取将继续。这种模式对于非生产环境非常有用,因为它将使Druid尝试自动从问题中恢复,即使这些问题会导致数据被安静删除或重复。 <br> 该特性与Kafka的 `auto.offset.reset` 消费者属性很相似 | 否(默认为false)|
|
|
||||||
| `workerThreads` | Integer | supervisor用于异步操作的线程数。| 否(默认为: min(10, taskCount)) |
|
|
||||||
| `chatThreads` | Integer | 与索引任务的会话线程数 | 否(默认为:min(10, taskCount * replicas))|
|
|
||||||
| `chatRetries` | Integer | 在任务没有响应之前,将重试对索引任务的HTTP请求的次数 | 否(默认为8)|
|
|
||||||
| `httpTimeout` | ISO8601 Period | 索引任务的HTTP响应超时 | 否(默认为PT10S)|
|
|
||||||
| `shutdownTimeout` | ISO8601 Period | supervisor尝试优雅的停掉一个任务的超时时间 | 否(默认为:PT80S)|
|
|
||||||
| `offsetFetchPeriod` | ISO8601 Period | supervisor查询Kafka和索引任务以获取当前偏移和计算滞后的频率 | 否(默认为PT30S,最小为PT5S)|
|
|
||||||
| `segmentWriteOutMediumFactory` | Object | 创建段时要使用的段写入介质。更多信息见下文。| 否(默认不指定,使用来源于 `druid.peon.defaultSegmentWriteOutMediumFactory.type` 的值)|
|
|
||||||
| `intermediateHandoffPeriod` | ISO8601 Period | 段发生切换的频率。当 `maxRowsPerSegment` 或者 `maxTotalRows` 有一个值命中的时候,则触发handoff(数据落盘后传到深度存储), 该动作也会按照每 `intermediateHandoffPeriod` 时间间隔发生一次。 | 否(默认为:P2147483647D)|
|
|
||||||
| `logParseExceptions` | Boolean | 如果为true,则在发生解析异常时记录错误消息,其中包含有关发生错误的行的信息。| 否(默认为false)|
|
|
||||||
| `maxParseExceptions` | Integer | 任务停止接收之前可发生的最大分析异常数。如果设置了 `reportParseExceptions`,则该值会被重写。| 否(默认为unlimited)|
|
|
||||||
| `maxSavedParseExceptions` | Integer | 当出现解析异常时,Druid可以跟踪最新的解析异常。"maxSavedParseExceptions"决定将保存多少个异常实例。这些保存的异常将在 [任务完成报告](taskrefer.md#任务报告) 中的任务完成后可用。如果设置了`reportParseExceptions`,则该值会被重写。 | 否(默认为0)|
|
|
||||||
|
|
||||||
##### IndexSpec
|
|
||||||
|
|
||||||
| 字段 | 类型 | 描述 | 是否必须 |
|
|
||||||
|-|-|-|-|-|
|
|
||||||
| `bitmap` | Object | 位图索引的压缩格式。 应该是一个JSON对象,详情见以下 | 否(默认为 `roaring`)|
|
|
||||||
| `dimensionCompression` | String | 维度列的压缩格式。 从 `LZ4`, `LZF` 或者 `uncompressed` 选择 | 否(默认为 `LZ4`)|
|
|
||||||
| `metricCompression` | String | Metrics列的压缩格式。 从 `LZ4`, `LZF`, `uncompressed` 或者 `none` 选择 | 否(默认为 `LZ4`)|
|
|
||||||
| `longEncoding` | String | 类型为long的Metric列和维度列的编码格式。从 `auto` 或者 `longs` 中选择。`auto`编码是根据列基数使用偏移量或查找表对值进行编码,并以可变大小存储它们。`longs` 按原样存储值,每个值8字节。 | 否(默认为 `longs`)|
|
|
||||||
|
|
||||||
**Bitmap类型**
|
|
||||||
|
|
||||||
对于Roaring位图:
|
|
||||||
|
|
||||||
| 字段 | 类型 | 描述 | 是否必须 |
|
|
||||||
|-|-|-|-|
|
|
||||||
| `type` | String | 必须为 `roaring` | 是 |
|
|
||||||
| `compressRunOnSerialization` | Boolean | 使用一个运行长度编码,可以更节省空间 | 否(默认为 `true` )|
|
|
||||||
|
|
||||||
对于Concise位图:
|
|
||||||
|
|
||||||
| 字段 | 类型 | 描述 | 是否必须 |
|
|
||||||
|-|-|-|-|
|
|
||||||
| `type` | String | 必须为 `concise` | 是 |
|
|
||||||
|
|
||||||
##### SegmentWriteOutMediumFactory
|
|
||||||
|
|
||||||
| 字段 | 类型 | 描述 | 是否必须 |
|
|
||||||
|-|-|-|-|
|
|
||||||
| `type` | String | 对于可用选项,可以见 [额外的Peon配置:SegmentWriteOutMediumFactory](../configuration/human-readable-byte.md#SegmentWriteOutMediumFactory) | 是 |
|
|
||||||
|
|
||||||
#### KafkaSupervisorIOConfig
|
|
||||||
|
|
||||||
| 字段 | 类型 | 描述 | 是否必须 |
|
|
||||||
|-|-|-|-|
|
|
||||||
| `topic` | String | 要读取数据的Kafka主题。这必须是一个特定的主题,因为不支持主题模式 | 是 |
|
|
||||||
| `inputFormat` | Object | [`inputFormat`](dataformats.md#inputformat) 指定如何解析输入数据。 看 [下边部分](#指定输入数据格式) 查看指定输入格式的详细信息。 | 是 |
|
|
||||||
| `consumerProperties` | Map<String, Object> | 传给Kafka消费者的一组属性map。必须得包含 `bootstrap.servers` 的属性,其值为Kafka Broker列表,格式为: `<BROKER_1>:<PORT_1>,<BROKER_2>:<PORT_2>,...`。 对于SSL连接,`keystore`, `truststore` 和 `key` 密码可以被以一个字符串密码或者 [密码Provider](../operations/passwordproviders.md) 来提供 | 是 |
|
|
||||||
| `pollTimeout` | Long | Kafka消费者拉取消息记录的超时等待时间,毫秒单位 | 否(默认为100)|
|
|
||||||
| `replicas` | Integer | 副本的数量,1意味着一个单一任务(无副本)。副本任务将始终分配给不同的worker,以提供针对流程故障的恢复能力。| 否(默认为1)|
|
|
||||||
| `taskCount` | Integer | *一个副本集* 中*读取*任务的最大数量。 这意味着读取任务的最大的数量将是 `taskCount * replicas`, 任务总数(*读取 + 发布*)是大于这个数字的。 详情可以看下边的 [容量规划](#容量规划)。 如果 `taskCount > {numKafkaPartitions}`, 读取任务的数量会小于 `taskCount` | 否(默认为1)|
|
|
||||||
| `taskDuration` | ISO8601 Period | 任务停止读取数据、开始发布段之前的时间长度 | 否(默认为PT1H)|
|
|
||||||
| `startDelay` | ISO8601 Period | supervisor开始管理任务之前的等待时间 | 否(默认为PT5S)|
|
|
||||||
| `useEarliestOffset` | Boolean | 如果supervisor是第一次管理数据源,它将从Kafka获得一组起始偏移。此标志确定它是检索Kafka中的最早偏移量还是最新偏移量。在正常情况下,后续任务将从先前段结束的位置开始,因此此标志将仅在首次运行时使用。 | 否(默认false)|
|
|
||||||
| `completionTimeout` | ISO8601 Period | 声明发布任务为失败并终止它 之前等待的时间长度。如果设置得太低,则任务可能永远不会发布。任务的发布时刻大约在 `taskDuration` (任务持续)时间过后开始。 | 否(默认为PT30M)|
|
|
||||||
| `lateMessageRejectionStartDateTime` | ISO8601 DateTime | 用来配置一个时间,当消息时间戳早于此日期时间的时候,消息被拒绝。 例如,如果该值设置为 `2016-01-01T11:00Z`, supervisor在 *`2016-01-01T12:00Z`* 创建了一个任务,时间戳早于 *2016-01-01T11:00Z* 的消息将会被丢弃。如果您的数据流有延迟消息,并且您有多个需要在同一段上操作的管道(例如实时和夜间批处理摄取管道),这可能有助于防止并发问题。 | 否(默认为none)|
|
|
||||||
| `lateMessageRejectionPeriod` | ISO8601 Period | 用来配置一个时间周期,当消息时间戳早于此周期的时候,消息被拒绝。例如,如果该值设置为 `PT1H`, supervisor 在 `2016-01-01T12:00Z` 创建了一个任务,则时间戳早于 `2016-01-01T11:00Z` 的消息将被丢弃。 如果您的数据流有延迟消息,并且您有多个需要在同一段上操作的管道(例如实时和夜间批处理摄取管道),这可能有助于防止并发问题。 **请特别注意**,`lateMessageRejectionPeriod` 和 `lateMessageRejectionStartDateTime` 仅一个可以被指定。 | 否(默认none)|
|
|
||||||
| `earlyMessageRejectionPeriod` | ISO8601 Period | 用来配置一个时间周期,当消息时间戳晚于此周期的时候,消息被拒绝。 例如,如果该值设置为 `PT1H`,supervisor 在 `2016-01-01T12:00Z` 创建了一个任务,则时间戳晚于 `2016-01-01T14:00Z` 的消息将被丢弃。**注意**,任务有时会超过其任务持续时间,例如,在supervisor故障转移的情况下。如果将 `earlyMessageRejectionPeriod` 设置得太低,则每当任务运行超过其最初配置的任务持续时间时,可能会导致消息意外丢弃。| 否(默认none)|
|
|
||||||
|
|
||||||
##### 指定输入数据格式
|
|
||||||
|
|
||||||
Kafka索引服务同时支持通过 [`inputFormat`](dataformats.md#inputformat) 和 [`parser`](dataformats.md#parser) 来指定数据格式。 `inputFormat` 是一种新的且推荐的用于Kafka索引服务中指定数据格式的方式,但是很遗憾的是目前它还不支持过时的 `parser` 所有支持的所有格式(未来会支持)。
|
|
||||||
|
|
||||||
`inputFormat` 支持的格式包括 [`csv`](dataformats.md#csv), [`delimited`](dataformats.md#TSV(Delimited)), [`json`](dataformats.md#json)。可以使用 `parser` 来读取 [`avro_stream`](dataformats.md#AvroStreamParser), [`protobuf`](dataformats.md#ProtobufParser), [`thrift`](../development/overview.md) 格式的数据。
|
|
||||||
|
|
||||||
### 操作
|
|
||||||
|
|
||||||
本节描述了一些supervisor API如何在Kafka索引服务中具体工作。对于所有的supervisor API,请查看 [Supervisor APIs](../operations/api-reference.md#Supervisor)
|
|
||||||
|
|
||||||
#### 获取supervisor的状态报告
|
|
||||||
|
|
||||||
`GET /druid/indexer/v1/supervisor/<supervisorId>/status` 返回由给定supervisor管理的任务当前状态的快照报告。报告中包括Kafka报告的最新偏移量、每个分区的使用者延迟,以及所有分区的聚合延迟。如果supervisor没有收到来自Kafka的最新偏移响应,则每个分区的使用者延迟可以报告为负值。聚合滞后值将始终大于等于0。
|
|
||||||
|
|
||||||
状态报告还包含supervisor的状态和最近引发的异常列表(报告为`recentErrors`,其最大大小可以使用 `druid.supervisor.maxStoredExceptionEvents` 配置进行控制)。有两个字段与supervisor的状态相关- `state` 和 `detailedState`。`state` 字段将始终是少数适用于任何类型的supervisor的通用状态之一,而 `detailedState` 字段将包含一个更具描述性的、特定实现的状态,该状态可以比通用状态字段更深入地了解supervisor的活动。
|
|
||||||
|
|
||||||
`state` 可能的值列表为:[`PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`, `UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`]
|
|
||||||
|
|
||||||
`detailedState`值与它们相应的 `state` 映射关系如下:
|
|
||||||
|
|
||||||
| Detailed State | 相应的State | 描述 |
|
|
||||||
|-|-|-|
|
|
||||||
| UNHEALTHY_SUPERVISOR | UNHEALTHY_SUPERVISOR | supervisor在过去的 `druid.supervisor.unhealthinessThreshold` 内已经发生了错误 |
|
|
||||||
| UNHEALTHY_TASKS | UNHEALTHY_TASKS | 过去 `druid.supervisor.taskUnhealthinessThreshold` 内的任务全部失败了 |
|
|
||||||
| UNABLE_TO_CONNECT_TO_STREAM | UNHEALTHY_SUPERVISOR | supervisor遇到与Kafka的连接问题,过去没有成功连接过 |
|
|
||||||
| LOST_CONTACT_WITH_STREAM | UNHEALTHY_SUPERVISOR | supervisor遇到与Kafka的连接问题,但是在过去成功连接过 |
|
|
||||||
| PENDING(仅在第一次迭代中)| PENDING | supervisor已初始化,尚未开始连接到流 |
|
|
||||||
| CONNECTING_TO_STREAM(仅在第一次迭代中) | RUNNING | supervisor正在尝试连接到流并更新分区数据 |
|
|
||||||
| DISCOVERING_INITIAL_TASKS(仅在第一次迭代中) | RUNNING | supervisor正在发现已在运行的任务 |
|
|
||||||
| CREATING_TASKS(仅在第一次迭代中) | RUNNING | supervisor正在创建任务并发现状态 |
|
|
||||||
| RUNNING | RUNNING | supervisor已启动任务,正在等待任务持续时间结束 |
|
|
||||||
| SUSPENDED | SUSPENDED | supervisor被挂起 |
|
|
||||||
| STOPPING | STOPPING | supervisor正在停止 |
|
|
||||||
|
|
||||||
在supervisor运行循环的每次迭代中,supervisor按顺序完成以下任务:
|
|
||||||
|
|
||||||
1. 从Kafka获取分区列表并确定每个分区的起始偏移量(如果继续,则基于最后处理的偏移量,如果这是一个新主题,则从流的开始或结束开始)。
|
|
||||||
2. 发现正在写入supervisor数据源的任何正在运行的索引任务,如果这些任务与supervisor的配置匹配,则采用这些任务,否则发出停止的信号。
|
|
||||||
3. 向每个受监视的任务发送状态请求,以更新我们对受监视任务的状态的视图。
|
|
||||||
4. 处理已超过 `taskDuration(任务持续时间)` 且应从读取状态转换为发布状态的任务。
|
|
||||||
5. 处理已完成发布的任务,并发出停止冗余副本任务的信号。
|
|
||||||
6. 处理失败的任务并清理supervisor的内部状态。
|
|
||||||
7. 将正常任务列表与请求的 `taskCount` 和 `replicas` 进行比较,并根据需要创建其他任务。
|
|
||||||
|
|
||||||
`detailedState` 字段将在supervisor启动后或从挂起恢复后第一次执行此运行循环时显示附加值(上述表格中那些标记为"仅限第一次迭代"的值)。这是为了解决初始化类型问题,即supervisor无法达到稳定状态(可能是因为它无法连接到Kafka,无法读取Kafka主题,或者无法与现有任务通信)。一旦supervisor稳定(也就是说,一旦完成完整的执行而没有遇到任何问题),`detailedState` 将显示 `RUNNING` 状态,直到它停止、挂起或达到故障阈值并过渡到不正常状态。
|
|
||||||
|
|
||||||
#### 获取supervisor摄取状态报告
|
|
||||||
|
|
||||||
`GET /druid/indexer/v1/supervisor/<supervisorId>/stats` 返回由supervisor管理的每个任务的当前摄取行计数器的快照,以及行计数器的移动平均值。
|
|
||||||
|
|
||||||
可以在 [任务报告:行画像](taskrefer.md#行画像) 中查看详细信息。
|
|
||||||
|
|
||||||
#### supervisor健康检测
|
|
||||||
|
|
||||||
如果supervisor是健康的,则 `GET /druid/indexer/v1/supervisor/<supervisorId>/health` 返回 `200 OK`, 如果是不健康的,则返回 `503 Service Unavailable` 。 健康状态是根据supervisor的 `state` (通过 `/status` 接口返回) 和 Overlord配置的阈值 `druid.supervisor.*` 来决定的。
|
|
||||||
|
|
||||||
#### 更新现有的supervisor
|
|
||||||
|
|
||||||
`POST /druid/indexer/v1/supervisor` 可以被用来更新现有的supervisor规范。如果已存在同一数据源的现有supervisor,则调用此接口将导致:
|
|
||||||
|
|
||||||
* 正在运行的supervisor对其管理的任务发出停止读取并开始发布的信号
|
|
||||||
* 正在运行的supervisor退出
|
|
||||||
* 使用请求正文中提供的配置创建新的supervisor。该supervisor将保留现有的发布任务,并将从发布任务结束时的偏移开始创建新任务
|
|
||||||
|
|
||||||
因此,只需使用这个接口来提交新的schema,就可以实现无缝的schema迁移。
|
|
||||||
|
|
||||||
#### 暂停和恢复supervisors
|
|
||||||
|
|
||||||
可以通过 `POST /druid/indexer/v1/supervisor/<supervisorId>/suspend` 和 `POST /druid/indexer/v1/supervisor/<supervisorId>/resume` 来暂停挂起和恢复一个supervisor。
|
|
||||||
|
|
||||||
注意,supervisor本身仍在运行并发出日志和metrics,它只会确保在supervisor恢复之前没有索引任务正在运行。
|
|
||||||
|
|
||||||
#### 重置supervisors
|
|
||||||
|
|
||||||
`POST/druid/indexer/v1/supervisor/<supervisorId>/reset` 操作清除存储的偏移量,使supervisor开始从Kafka中最早或最新的偏移量读取偏移量(取决于`useEarliestOffset`的值)。清除存储的偏移量后,supervisor将终止并重新创建任务,以便任务开始从有效偏移量读取数据。
|
|
||||||
|
|
||||||
**使用此操作时请小心!** 重置supervisor可能会导致跳过或读取Kafka消息两次,从而导致数据丢失或重复。
|
|
||||||
|
|
||||||
使用此操作的原因是:从由于缺少偏移而导致supervisor停止操作的状态中恢复。索引服务跟踪最新的持久化Kafka偏移量,以便跨任务提供准确的一次摄取保证。后续任务必须从上一个任务完成的位置开始读取,以便接受生成的段。如果Kafka中不再提供预期起始偏移量的消息(通常是因为消息保留期已过或主题已被删除并重新创建),supervisor将拒绝启动,在运行状态下的任务将失败。此操作使您能够从此情况中恢复。
|
|
||||||
|
|
||||||
**请注意,要使此接口可用,必须运行supervisor。**
|
|
||||||
|
|
||||||
#### 终止supervisors
|
|
||||||
|
|
||||||
`POST /druid/indexer/v1/supervisor/<supervisorId>/terminate` 操作终止一个supervisor,并导致由该supervisor管理的所有关联的索引任务立即停止并开始发布它们的段。此supervisor仍将存在于元数据存储中,可以使用supervisor的历史API检索其历史记录,但不会在 "Get supervisor" API响应中列出,也无法检索其配置或状态报告。这个supervisor可以重新启动的唯一方法是向 "create" API提交一个正常工作的supervisor规范。
|
|
||||||
|
|
||||||
#### 容量规划
|
|
||||||
|
|
||||||
Kafka索引任务运行在MiddleManager上,因此,其受限于MiddleManager集群的可用资源。 特别是,您应该确保有足够的worker(使用 `druid.worker.capacity` 属性配置)来处理supervisor规范中的配置。请注意,worker是在所有类型的索引任务之间共享的,因此,您应该计划好worker处理索引总负载的能力(例如批处理、实时任务、合并任务等)。如果您的worker不足,Kafka索引任务将排队并等待下一个可用的worker。这可能会导致查询只返回部分结果,但不会导致数据丢失(假设任务在Kafka清除这些偏移之前运行)。
|
|
||||||
|
|
||||||
正在运行的任务通常处于两种状态之一:*读取(reading)*或*发布(publishing)*。任务将在 `taskDuration(任务持续时间)` 内保持读取状态,在这时将转换为发布状态。只要生成段、将段推送到深层存储并由Historical进程加载和服务(或直到 `completionTimeout` 结束),任务将保持发布状态。
|
|
||||||
|
|
||||||
读取任务的数量由 `replicas` 和 `taskCount` 控制。 一般, 一共有 `replicas * taskCount` 个读取任务, 存在一个例外是当 taskCount > {numKafkaPartitions}, 在这种情况时 {numKafkaPartitions}个任务将被使用。 当 `taskDuration` 结束时,这些任务将被转换为发布状态并创建 `replicas * taskCount` 个新的读取任务。 因此,为了使得读取任务和发布任务可以并发的运行, 最小的容量应该是:
|
|
||||||
|
|
||||||
```json
|
|
||||||
workerCapacity = 2 * replicas * taskCount
|
|
||||||
```
|
|
||||||
|
|
||||||
此值适用于这样一种理想情况:最多有一组任务正在发布,而另一组任务正在读取。在某些情况下,可以同时发布多组任务。如果发布时间(生成段、推送到深层存储、加载到历史记录中)> `taskDuration`,就会发生这种情况。这是一个有效的场景(正确性方面),但需要额外的worker容量来支持。一般来说,最好将 `taskDuration` 设置得足够大,以便在当前任务集开始之前完成上一个任务集的发布。
|
|
||||||
|
|
||||||
#### supervisor持久化
|
|
||||||
|
|
||||||
当通过 `POST /druid/indexer/v1/supervisor` 接口提交一个supervisor规范时,它将被持久化在配置的元数据数据库中。每个数据源只能有一个supervisor,为同一数据源提交第二个规范将覆盖前一个规范。
|
|
||||||
|
|
||||||
当一个Overlord获得领导地位时,无论是通过启动还是由于另一个Overlord失败,它都将为元数据数据库中的每个supervisor规范生成一个supervisor。然后,supervisor将发现正在运行的Kafka索引任务,如果它们与supervisor的配置兼容,则将尝试采用它们。如果它们不兼容,因为它们具有不同的摄取规范或分区分配,则任务将被终止,supervisor将创建一组新任务。这样,supervisor就可以在Overlord重启和故障转移期间坚持不懈地工作。
|
|
||||||
|
|
||||||
supervisor通过 `POST /druid/indexer/v1/supervisor/<supervisorId>/` 终止接口停止。这将在数据库中放置一个逻辑删除标记(以防止重新启动时重新加载supervisor),然后优雅地关闭当前运行的supervisor。当supervisor以这种方式关闭时,它将指示其托管的任务停止读取并立即开始发布其段。对关闭接口的调用将在所有任务发出停止信号后,但在任务完成其段的发布之前返回。
|
|
||||||
|
|
||||||
#### schema/配置变更
|
|
||||||
|
|
||||||
schema和配置更改是通过最初用于创建supervisor的 `POST /druid/indexer/v1/supervisor` 接口提交新的supervisor规范来处理的。Overlord将当前运行的supervisor优雅地关闭,这将导致由该supervisor管理的任务停止读取并开始发布其段。然后将启动一个新的supervisor,该supervisor将创建一组新的任务,这些任务将从先前发布任务关闭的偏移开始读取,但使用更新的schema。通过这种方式,可以在无需暂停摄取的条件下更新应用配置。
|
|
||||||
|
|
||||||
#### 部署注意
|
|
||||||
|
|
||||||
每个Kafka索引任务将从分配给它的Kafka分区中消费的事件放在每个段粒度间隔的单个段中,直到达到 `maxRowsPerSegment`、`maxTotalRows` 或 `intermediateHandoffPeriod` 限制,此时将为进一步的事件创建此段粒度的新分区。Kafka索引任务还执行增量移交,这意味着任务创建的所有段在任务持续时间结束之前都不会被延迟。一旦达到 `maxRowsPerSegment`、`maxTotalRows` 或 `intermediateHandoffPeriod` 限制,任务在该时间点持有的所有段都将被传递,并且将为进一步的事件创建新的段集。这意味着任务可以运行更长的时间,而不必在MiddleManager进程的本地累积旧段,因此鼓励这样做。
|
|
||||||
|
|
||||||
Kafka索引服务可能仍然会产生一些小片段。假设任务持续时间为4小时,段粒度设置为1小时,supervisor在9:10启动,然后在13:10的4小时后,将启动新的任务集,并且间隔13:00-14:00的事件可以跨以前的和新的任务集拆分。如果您发现这成为一个问题,那么可以调度重新索引任务,以便将段合并到理想大小的新段中(每个段大约500-700 MB)。有关如何优化段大小的详细信息,请参见 ["段大小优化"](../operations/segmentSizeOpt.md)。还有一些工作正在进行,以支持碎片段的自动段压缩,以及不需要Hadoop的压缩(参见[此处](https://github.com/apache/druid/pull/5102))。
|
|
||||||
|
|
||||||
|
@ -1,231 +0,0 @@
|
|||||||
---
|
|
||||||
id: compaction
|
|
||||||
title: "Compaction"
|
|
||||||
description: "Defines compaction and automatic compaction (auto-compaction or autocompaction) for segment optimization. Use cases and strategies for compaction. Describes compaction task configuration."
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
Query performance in Apache Druid depends on optimally sized segments. Compaction is one strategy you can use to optimize segment size for your Druid database. Compaction tasks read an existing set of segments for a given time interval and combine the data into a new "compacted" set of segments. In some cases the compacted segments are larger, but there are fewer of them. In other cases the compacted segments may be smaller. Compaction tends to increase performance because optimized segments require less per-segment processing and less memory overhead for ingestion and for querying paths.
|
|
||||||
|
|
||||||
## Compaction strategies
|
|
||||||
There are several cases to consider compaction for segment optimization:
|
|
||||||
- With streaming ingestion, data can arrive out of chronological order creating lots of small segments.
|
|
||||||
- If you append data using `appendToExisting` for [native batch](native-batch.md) ingestion creating suboptimal segments.
|
|
||||||
- When you use `index_parallel` for parallel batch indexing and the parallel ingestion tasks create many small segments.
|
|
||||||
- When a misconfigured ingestion task creates oversized segments.
|
|
||||||
|
|
||||||
By default, compaction does not modify the underlying data of the segments. However, there are cases when you may want to modify data during compaction to improve query performance:
|
|
||||||
- If, after ingestion, you realize that data for the time interval is sparse, you can use compaction to increase the segment granularity.
|
|
||||||
- Over time you don't need fine-grained granularity for older data so you want use compaction to change older segments to a coarser query granularity. This reduces the storage space required for older data. For example from `minute` to `hour`, or `hour` to `day`. You cannot go from coarser granularity to finer granularity.
|
|
||||||
- You can change the dimension order to improve sorting and reduce segment size.
|
|
||||||
- You can remove unused columns in compaction or implement an aggregation metric for older data.
|
|
||||||
- You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](index.md#perfect-rollup-vs-best-effort-rollup).
|
|
||||||
|
|
||||||
Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment.
|
|
||||||
|
|
||||||
## Types of compaction
|
|
||||||
You can configure the Druid Coordinator to perform automatic compaction, also called auto-compaction, for a datasource. Using a segment search policy, the coordinator periodically identifies segments for compaction starting with the newest to oldest. When it discovers segments that have not been compacted or segments that were compacted with a different or changed spec, it submits compaction task for those segments and only those segments.
|
|
||||||
|
|
||||||
Automatic compaction works in most use cases and should be your first option. To learn more about automatic compaction, see [Compacting Segments](../design/coordinator.md#compacting-segments).
|
|
||||||
|
|
||||||
In cases where you require more control over compaction, you can manually submit compaction tasks. For example:
|
|
||||||
- Automatic compaction is running into the limit of task slots available to it, so tasks are waiting for previous automatic compaction tasks to complete. Manual compaction can use all available task slots, therefore you can complete compaction more quickly by submitting more concurrent tasks for more intervals.
|
|
||||||
- You want to force compaction for a specific time range or you want to compact data out of chronological order.
|
|
||||||
|
|
||||||
See [Setting up a manual compaction task](#setting-up-manual-compaction) for more about manual compaction tasks.
|
|
||||||
|
|
||||||
## Data handling with compaction
|
|
||||||
During compaction, Druid overwrites the original set of segments with the compacted set. Druid also locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes.
|
|
||||||
|
|
||||||
For compaction tasks, `dropExisting` in `ioConfig` can be set to "true" for Druid to drop (mark unused) all existing segments fully contained by the interval of the compaction task. For an example of why this is important, see the suggestion for reindexing with finer granularity under [Implementation considerations](native-batch.md#implementation-considerations). WARNING: this functionality is still in beta and can result in temporary data unavailability for data within the compaction task interval.
|
|
||||||
|
|
||||||
If an ingestion task needs to write data to a segment for a time interval locked for compaction, by default the ingestion task supersedes the compaction task and the compaction task fails without finishing. For manual compaction tasks you can adjust the input spec interval to avoid conflicts between ingestion and compaction. For automatic compaction, you can set the `skipOffsetFromLatest` key to adjustment the auto compaction starting point from the current time to reduce the chance of conflicts between ingestion and compaction. See [Compaction dynamic configuration](../configuration/index.md#compaction-dynamic-configuration) for more information. Another option is to set the compaction task to higher priority than the ingestion task.
|
|
||||||
|
|
||||||
### Segment granularity handling
|
|
||||||
|
|
||||||
Unless you modify the segment granularity in the [granularity spec](#compaction-granularity-spec), Druid attempts to retain the granularity for the compacted segments. When segments have different segment granularities with no overlap in interval Druid creates a separate compaction task for each to retain the segment granularity in the compacted segment.
|
|
||||||
|
|
||||||
If segments have different segment granularities before compaction but there is some overlap in interval, Druid attempts find start and end of the overlapping interval and uses the closest segment granularity level for the compacted segment. For example consider two overlapping segments: segment "A" for the interval 01/01/2021-01/02/2021 with day granularity and segment "B" for the interval 01/01/2021-02/01/2021. Druid attempts to combine and compacted the overlapped segments. In this example, the earliest start time for the two segments above is 01/01/2020 and the latest end time of the two segments above is 02/01/2020. Druid compacts the segments together even though they have different segment granularity. Druid uses month segment granularity for the newly compacted segment even though segment A's original segment granularity was DAY.
|
|
||||||
|
|
||||||
### Query granularity handling
|
|
||||||
|
|
||||||
Unless you modify the query granularity in the [granularity spec](#compaction-granularity-spec), Druid retains the query granularity for the compacted segments. If segments have different query granularities before compaction, Druid chooses the finest level of granularity for the resulting compacted segment. For example if a compaction task combines two segments, one with day query granularity and one with minute query granularity, the resulting segment uses minute query granularity.
|
|
||||||
|
|
||||||
> In Apache Druid 0.21.0 and prior, Druid sets the granularity for compacted segments to the default granularity of `NONE` regardless of the query granularity of the original segments.
|
|
||||||
|
|
||||||
If you configure query granularity in compaction to go from a finer granularity like month to a coarser query granularity like year, then Druid overshadows the original segment with coarser granularity. Because the new segments have a coarser granularity, running a kill task to remove the overshadowed segments for those intervals will cause you to permanently lose the finer granularity data.
|
|
||||||
|
|
||||||
### Dimension handling
|
|
||||||
Apache Druid supports schema changes. Therefore, dimensions can be different across segments even if they are a part of the same data source. See [Different schemas among segments](../design/segments.md#different-schemas-among-segments). If the input segments have different dimensions, the resulting compacted segment include all dimensions of the input segments.
|
|
||||||
|
|
||||||
Even when the input segments have the same set of dimensions, the dimension order or the data type of dimensions can be different. The dimensions of recent segments precede that of old segments in terms of data types and the ordering because more recent segments are more likely to have the preferred order and data types.
|
|
||||||
|
|
||||||
If you want to control dimension ordering or ensure specific values for dimension types, you can configure a custom `dimensionsSpec` in the compaction task spec.
|
|
||||||
|
|
||||||
### Rollup
|
|
||||||
Druid only rolls up the output segment when `rollup` is set for all input segments.
|
|
||||||
See [Roll-up](../ingestion/index.md#rollup) for more details.
|
|
||||||
You can check that your segments are rolled up or not by using [Segment Metadata Queries](../querying/segmentmetadataquery.md#analysistypes).
|
|
||||||
|
|
||||||
## Setting up manual compaction
|
|
||||||
|
|
||||||
To perform a manual compaction, you submit a compaction task. Compaction tasks merge all segments for the defined interval according to the following syntax:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "compact",
|
|
||||||
"id": <task_id>,
|
|
||||||
"dataSource": <task_datasource>,
|
|
||||||
"ioConfig": <IO config>,
|
|
||||||
"dimensionsSpec" <custom dimensionsSpec>,
|
|
||||||
"metricsSpec" <custom metricsSpec>,
|
|
||||||
"tuningConfig" <parallel indexing task tuningConfig>,
|
|
||||||
"granularitySpec" <compaction task granularitySpec>,
|
|
||||||
"context": <task context>
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
|Field|Description|Required|
|
|
||||||
|-----|-----------|--------|
|
|
||||||
|`type`|Task type. Should be `compact`|Yes|
|
|
||||||
|`id`|Task id|No|
|
|
||||||
|`dataSource`|Data source name to compact|Yes|
|
|
||||||
|`ioConfig`|I/O configuration for compaction task. See [Compaction I/O configuration](#compaction-io-configuration) for details.|Yes|
|
|
||||||
|`dimensionsSpec`|Custom dimensions spec. The compaction task uses the specified dimensions spec if it exists instead of generating one.|No|
|
|
||||||
|`metricsSpec`|Custom metrics spec. The compaction task uses the specified metrics spec rather than generating one.|No|
|
|
||||||
|`segmentGranularity`|When set, the compaction task changes the segment granularity for the given interval. Deprecated. Use `granularitySpec`. |No.|
|
|
||||||
|`tuningConfig`|[Parallel indexing task tuningConfig](native-batch.md#tuningconfig). Note that your tuning config cannot contain a non-zero value for `awaitSegmentAvailabilityTimeoutMillis` because it is not supported by compaction tasks at this time.|No|
|
|
||||||
|`context`|[Task context](./tasks.md#context)|No|
|
|
||||||
|`granularitySpec`|Custom `granularitySpec` to describe the `segmentGranularity` and `queryGranularity` for the compacted segments. See [Compaction granularitySpec](#compaction-granularity-spec).|No|
|
|
||||||
|
|
||||||
> Note: Use `granularitySpec` over `segmentGranularity` and only set one of these values. If you specify different values for these in the same compaction spec, the task fails.
|
|
||||||
|
|
||||||
To control the number of result segments per time chunk, you can set [maxRowsPerSegment](../configuration/index.md#compaction-dynamic-configuration) or [numShards](../ingestion/native-batch.md#tuningconfig).
|
|
||||||
|
|
||||||
> You can run multiple compaction tasks in parallel. For example, if you want to compact the data for a year, you are not limited to running a single task for the entire year. You can run 12 compaction tasks with month-long intervals.
|
|
||||||
|
|
||||||
A compaction task internally generates an `index` task spec for performing compaction work with some fixed parameters. For example, its `inputSource` is always the [DruidInputSource](native-batch.md#druid-input-source), and `dimensionsSpec` and `metricsSpec` include all dimensions and metrics of the input segments by default.
|
|
||||||
|
|
||||||
Compaction tasks would exit without doing anything and issue a failure status code:
|
|
||||||
- if the interval you specify has no data segments loaded<br>
|
|
||||||
OR
|
|
||||||
- if the interval you specify is empty.
|
|
||||||
|
|
||||||
Note that the metadata between input segments and the resulting compacted segments may differ if the metadata among the input segments differs as well. If all input segments have the same metadata, however, the resulting output segment will have the same metadata as all input segments.
|
|
||||||
|
|
||||||
|
|
||||||
### Example compaction task
|
|
||||||
The following JSON illustrates a compaction task to compact _all segments_ within the interval `2017-01-01/2018-01-01` and create new segments:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type" : "compact",
|
|
||||||
"dataSource" : "wikipedia",
|
|
||||||
"ioConfig" : {
|
|
||||||
"type": "compact",
|
|
||||||
"inputSpec": {
|
|
||||||
"type": "interval",
|
|
||||||
"interval": "2020-01-01/2021-01-01",
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"granularitySpec": {
|
|
||||||
"segmentGranularity":"day",
|
|
||||||
"queryGranularity":"hour"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
This task doesn't specify a `granularitySpec` so Druid retains the original segment granularity unchanged when compaction is complete.
|
|
||||||
|
|
||||||
### Compaction I/O configuration
|
|
||||||
|
|
||||||
The compaction `ioConfig` requires specifying `inputSpec` as follows:
|
|
||||||
|
|
||||||
|Field|Description|Default|Required?|
|
|
||||||
|-----|-----------|-------|--------|
|
|
||||||
|`type`|Task type. Should be `compact`|none|Yes|
|
|
||||||
|`inputSpec`|Input specification|none|Yes|
|
|
||||||
|`dropExisting`|If `true`, then the compaction task drops (mark unused) all existing segments fully contained by either the `interval` in the `interval` type `inputSpec` or the umbrella interval of the `segments` in the `segment` type `inputSpec` when the task publishes new compacted segments. If compaction fails, Druid does not drop or mark unused any segments. WARNING: this functionality is still in beta and can result in temporary data unavailability for data within the compaction task interval.|false|no|
|
|
||||||
|
|
||||||
|
|
||||||
There are two supported `inputSpec`s for now.
|
|
||||||
|
|
||||||
The interval `inputSpec` is:
|
|
||||||
|
|
||||||
|Field|Description|Required|
|
|
||||||
|-----|-----------|--------|
|
|
||||||
|`type`|Task type. Should be `interval`|Yes|
|
|
||||||
|`interval`|Interval to compact|Yes|
|
|
||||||
|
|
||||||
The segments `inputSpec` is:
|
|
||||||
|
|
||||||
|Field|Description|Required|
|
|
||||||
|-----|-----------|--------|
|
|
||||||
|`type`|Task type. Should be `segments`|Yes|
|
|
||||||
|`segments`|A list of segment IDs|Yes|
|
|
||||||
|
|
||||||
### Compaction granularity spec
|
|
||||||
|
|
||||||
You can optionally use the `granularitySpec` object to configure the segment granularity and the query granularity of the compacted segments. Their syntax is as follows:
|
|
||||||
```json
|
|
||||||
"type": "compact",
|
|
||||||
"id": <task_id>,
|
|
||||||
"dataSource": <task_datasource>,
|
|
||||||
...
|
|
||||||
,
|
|
||||||
"granularitySpec": {
|
|
||||||
"segmentGranularity": <time_period>,
|
|
||||||
"queryGranularity": <time_period>
|
|
||||||
}
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
`granularitySpec` takes the following keys:
|
|
||||||
|
|
||||||
|Field|Description|Required|
|
|
||||||
|-----|-----------|--------|
|
|
||||||
|`segmentGranularity`|Time chunking period for the segment granularity. Defaults to 'null', which preserves the original segment granularity. Accepts all [Query granularity](../querying/granularities.md) values.|No|
|
|
||||||
|`queryGranularity`|Time chunking period for the query granularity. Defaults to 'null', which preserves the original query granularity. Accepts all [Query granularity](../querying/granularities.md) values. Not supported for automatic compaction.|No|
|
|
||||||
|
|
||||||
For example, to set the segment granularity to "day" and the query granularity to "hour":
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type" : "compact",
|
|
||||||
"dataSource" : "wikipedia",
|
|
||||||
"ioConfig" : {
|
|
||||||
"type": "compact",
|
|
||||||
"inputSpec": {
|
|
||||||
"type": "interval",
|
|
||||||
"interval": "2017-01-01/2018-01-01",
|
|
||||||
},
|
|
||||||
"granularitySpec": {
|
|
||||||
"segmentGranularity":"day",
|
|
||||||
"queryGranularity":"hour"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Learn more
|
|
||||||
See the following topics for more information:
|
|
||||||
- [Segment optimization](../operations/segment-optimization.md) for guidance to determine if compaction will help in your case.
|
|
||||||
- [Compacting Segments](../design/coordinator.md#compacting-segments) for more on automatic compaction.
|
|
||||||
- See [Compaction Configuration API](../operations/api-reference.md#compaction-configuration)
|
|
||||||
and [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for automatic compaction configuration information.
|
|
File diff suppressed because it is too large
Load Diff
@ -1,123 +0,0 @@
|
|||||||
---
|
|
||||||
id: data-management
|
|
||||||
title: "Data management"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
Within the context of this topic data management refers to Apache Druid's data maintenance capabilities for existing datasources. There are several options to help you keep your data relevant and to help your Druid cluster remain performant. For example updating, reingesting, adding lookups, reindexing, or deleting data.
|
|
||||||
|
|
||||||
In addition to the tasks covered on this page, you can also use segment compaction to improve the layout of your existing data. Refer to [Segment optimization](../operations/segment-optimization.md) to see if compaction will help in your environment. For an overview and steps to configure manual compaction tasks, see [Compaction](./compaction.md).
|
|
||||||
|
|
||||||
## Adding new data to existing datasources
|
|
||||||
|
|
||||||
Druid can insert new data to an existing datasource by appending new segments to existing segment sets. It can also add new data by merging an existing set of segments with new data and overwriting the original set.
|
|
||||||
|
|
||||||
Druid does not support single-record updates by primary key.
|
|
||||||
|
|
||||||
<a name="update"></a>
|
|
||||||
|
|
||||||
## Updating existing data
|
|
||||||
|
|
||||||
Once you ingest some data in a dataSource for an interval and create Apache Druid segments, you might want to make changes to
|
|
||||||
the ingested data. There are several ways this can be done.
|
|
||||||
|
|
||||||
### Using lookups
|
|
||||||
|
|
||||||
If you have a dimension where values need to be updated frequently, try first using [lookups](../querying/lookups.md). A
|
|
||||||
classic use case of lookups is when you have an ID dimension stored in a Druid segment, and want to map the ID dimension to a
|
|
||||||
human-readable String value that may need to be updated periodically.
|
|
||||||
|
|
||||||
### Reingesting data
|
|
||||||
|
|
||||||
If lookup-based techniques are not sufficient, you will need to reingest data into Druid for the time chunks that you
|
|
||||||
want to update. This can be done using one of the [batch ingestion methods](index.md#batch) in overwrite mode (the
|
|
||||||
default mode). It can also be done using [streaming ingestion](index.md#streaming), provided you drop data for the
|
|
||||||
relevant time chunks first.
|
|
||||||
|
|
||||||
If you do the reingestion in batch mode, Druid's atomic update mechanism means that queries will flip seamlessly from
|
|
||||||
the old data to the new data.
|
|
||||||
|
|
||||||
We recommend keeping a copy of your raw data around in case you ever need to reingest it.
|
|
||||||
|
|
||||||
### With Hadoop-based ingestion
|
|
||||||
|
|
||||||
This section assumes you understand how to do batch ingestion using Hadoop. See
|
|
||||||
[Hadoop batch ingestion](./hadoop.md) for more information. Hadoop batch-ingestion can be used for reindexing and delta ingestion.
|
|
||||||
|
|
||||||
Druid uses an `inputSpec` in the `ioConfig` to know where the data to be ingested is located and how to read it.
|
|
||||||
For simple Hadoop batch ingestion, `static` or `granularity` spec types allow you to read data stored in deep storage.
|
|
||||||
|
|
||||||
There are other types of `inputSpec` to enable reindexing and delta ingestion.
|
|
||||||
|
|
||||||
### Reindexing with Native Batch Ingestion
|
|
||||||
|
|
||||||
This section assumes you understand how to do batch ingestion without Hadoop using [native batch indexing](../ingestion/native-batch.md). Native batch indexing uses an `inputSource` to know where and how to read the input data. You can use the [`DruidInputSource`](native-batch.md#druid-input-source) to read data from segments inside Druid. You can use Parallel task (`index_parallel`) for all native batch reindexing tasks. Increase the `maxNumConcurrentSubTasks` to accommodate the amount of data your are reindexing. See [Capacity planning](native-batch.md#capacity-planning).
|
|
||||||
|
|
||||||
<a name="delete"></a>
|
|
||||||
|
|
||||||
## Deleting data
|
|
||||||
|
|
||||||
Druid supports permanent deletion of segments that are in an "unused" state (see the
|
|
||||||
[Segment lifecycle](../design/architecture.md#segment-lifecycle) section of the Architecture page).
|
|
||||||
|
|
||||||
The Kill Task deletes unused segments within a specified interval from metadata storage and deep storage.
|
|
||||||
|
|
||||||
For more information, please see [Kill Task](../ingestion/tasks.md#kill).
|
|
||||||
|
|
||||||
Permanent deletion of a segment in Apache Druid has two steps:
|
|
||||||
|
|
||||||
1. The segment must first be marked as "unused". This occurs when a segment is dropped by retention rules, and when a user manually disables a segment through the Coordinator API.
|
|
||||||
2. After segments have been marked as "unused", a Kill Task will delete any "unused" segments from Druid's metadata store as well as deep storage.
|
|
||||||
|
|
||||||
For documentation on retention rules, please see [Data Retention](../operations/rule-configuration.md).
|
|
||||||
|
|
||||||
For documentation on disabling segments using the Coordinator API, please see the
|
|
||||||
[Coordinator Datasources API](../operations/api-reference.md#coordinator-datasources) reference.
|
|
||||||
|
|
||||||
A data deletion tutorial is available at [Tutorial: Deleting data](../tutorials/tutorial-delete-data.md)
|
|
||||||
|
|
||||||
## Kill Task
|
|
||||||
|
|
||||||
Kill tasks delete all information about a segment and removes it from deep storage. Segments to kill must be unused (used==0) in the Druid segment table. The available grammar is:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "kill",
|
|
||||||
"id": <task_id>,
|
|
||||||
"dataSource": <task_datasource>,
|
|
||||||
"interval" : <all_segments_in_this_interval_will_die!>,
|
|
||||||
"context": <task context>
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Retention
|
|
||||||
|
|
||||||
Druid supports retention rules, which are used to define intervals of time where data should be preserved, and intervals where data should be discarded.
|
|
||||||
|
|
||||||
Druid also supports separating Historical processes into tiers, and the retention rules can be configured to assign data for specific intervals to specific tiers.
|
|
||||||
|
|
||||||
These features are useful for performance/cost management; a common use case is separating Historical processes into a "hot" tier and a "cold" tier.
|
|
||||||
|
|
||||||
For more information, please see [Load rules](../operations/rule-configuration.md).
|
|
||||||
|
|
||||||
## Learn more
|
|
||||||
See the following topics for more information:
|
|
||||||
- [Compaction](./compaction.md) for an overview and steps to configure manual compaction tasks.
|
|
||||||
- [Segments](../design/segments.md) for information on how Druid handles segment versioning.
|
|
190
ingestion/faq.md
190
ingestion/faq.md
@ -1,190 +0,0 @@
|
|||||||
---
|
|
||||||
id: faq
|
|
||||||
title: "Ingestion troubleshooting FAQ"
|
|
||||||
sidebar_label: "Troubleshooting FAQ"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
## Batch Ingestion
|
|
||||||
|
|
||||||
If you are trying to batch load historical data but no events are being loaded, make sure the interval of your ingestion spec actually encapsulates the interval of your data. Events outside this interval are dropped.
|
|
||||||
|
|
||||||
## Druid ingested my events but I they are not in my query results
|
|
||||||
|
|
||||||
If the number of ingested events seem correct, make sure your query is correctly formed. If you included a `count` aggregator in your ingestion spec, you will need to query for the results of this aggregate with a `longSum` aggregator. Issuing a query with a count aggregator will count the number of Druid rows, which includes [roll-up](../design/index.md).
|
|
||||||
|
|
||||||
## What types of data does Druid support?
|
|
||||||
|
|
||||||
Druid can ingest JSON, CSV, TSV and other delimited data out of the box. Druid supports single dimension values, or multiple dimension values (an array of strings). Druid supports long, float, and double numeric columns.
|
|
||||||
|
|
||||||
## Where do my Druid segments end up after ingestion?
|
|
||||||
|
|
||||||
Depending on what `druid.storage.type` is set to, Druid will upload segments to some [Deep Storage](../dependencies/deep-storage.md). Local disk is used as the default deep storage.
|
|
||||||
|
|
||||||
## My stream ingest is not handing segments off
|
|
||||||
|
|
||||||
First, make sure there are no exceptions in the logs of the ingestion process. Also make sure that `druid.storage.type` is set to a deep storage that isn't `local` if you are running a distributed cluster.
|
|
||||||
|
|
||||||
Other common reasons that hand-off fails are as follows:
|
|
||||||
|
|
||||||
1) Druid is unable to write to the metadata storage. Make sure your configurations are correct.
|
|
||||||
|
|
||||||
2) Historical processes are out of capacity and cannot download any more segments. You'll see exceptions in the Coordinator logs if this occurs and the Coordinator console will show the Historicals are near capacity.
|
|
||||||
|
|
||||||
3) Segments are corrupt and cannot be downloaded. You'll see exceptions in your Historical processes if this occurs.
|
|
||||||
|
|
||||||
4) Deep storage is improperly configured. Make sure that your segment actually exists in deep storage and that the Coordinator logs have no errors.
|
|
||||||
|
|
||||||
## How do I get HDFS to work?
|
|
||||||
|
|
||||||
Make sure to include the `druid-hdfs-storage` and all the hadoop configuration, dependencies (that can be obtained by running command `hadoop classpath` on a machine where hadoop has been setup) in the classpath. And, provide necessary HDFS settings as described in [deep storage](../dependencies/deep-storage.md) .
|
|
||||||
|
|
||||||
## How do I know when I can make query to Druid after submitting batch ingestion task?
|
|
||||||
|
|
||||||
You can verify if segments created by a recent ingestion task are loaded onto historicals and available for querying using the following workflow.
|
|
||||||
1. Submit your ingestion task.
|
|
||||||
2. Repeatedly poll the [Overlord's tasks API](../operations/api-reference.md#tasks) ( `/druid/indexer/v1/task/{taskId}/status`) until your task is shown to be successfully completed.
|
|
||||||
3. Poll the [Segment Loading by Datasource API](../operations/api-reference.md#segment-loading-by-datasource) (`/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus`) with
|
|
||||||
`forceMetadataRefresh=true` and `interval=<INTERVAL_OF_INGESTED_DATA>` once.
|
|
||||||
(Note: `forceMetadataRefresh=true` refreshes Coordinator's metadata cache of all datasources. This can be a heavy operation in terms of the load on the metadata store but is necessary to make sure that we verify all the latest segments' load status)
|
|
||||||
If there are segments not yet loaded, continue to step 4, otherwise you can now query the data.
|
|
||||||
4. Repeatedly poll the [Segment Loading by Datasource API](../operations/api-reference.md#segment-loading-by-datasource) (`/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus`) with
|
|
||||||
`forceMetadataRefresh=false` and `interval=<INTERVAL_OF_INGESTED_DATA>`.
|
|
||||||
Continue polling until all segments are loaded. Once all segments are loaded you can now query the data.
|
|
||||||
Note that this workflow only guarantees that the segments are available at the time of the [Segment Loading by Datasource API](../operations/api-reference.md#segment-loading-by-datasource) call. Segments can still become missing because of historical process failures or any other reasons afterward.
|
|
||||||
|
|
||||||
## I don't see my Druid segments on my Historical processes
|
|
||||||
|
|
||||||
You can check the Coordinator console located at `<COORDINATOR_IP>:<PORT>`. Make sure that your segments have actually loaded on [Historical processes](../design/historical.md). If your segments are not present, check the Coordinator logs for messages about capacity of replication errors. One reason that segments are not downloaded is because Historical processes have maxSizes that are too small, making them incapable of downloading more data. You can change that with (for example):
|
|
||||||
|
|
||||||
```
|
|
||||||
-Ddruid.segmentCache.locations=[{"path":"/tmp/druid/storageLocation","maxSize":"500000000000"}]
|
|
||||||
```
|
|
||||||
|
|
||||||
## My queries are returning empty results
|
|
||||||
|
|
||||||
You can use a [segment metadata query](../querying/segmentmetadataquery.md) for the dimensions and metrics that have been created for your datasource. Make sure that the name of the aggregators you use in your query match one of these metrics. Also make sure that the query interval you specify match a valid time range where data exists.
|
|
||||||
|
|
||||||
## How can I Reindex existing data in Druid with schema changes?
|
|
||||||
|
|
||||||
You can use DruidInputSource with the [Parallel task](../ingestion/native-batch.md) to ingest existing druid segments using a new schema and change the name, dimensions, metrics, rollup, etc. of the segment.
|
|
||||||
See [DruidInputSource](../ingestion/native-batch.md#druid-input-source) for more details.
|
|
||||||
Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing.
|
|
||||||
|
|
||||||
See the [Update existing data](../ingestion/data-management.md#update) section of the data management page for more details.
|
|
||||||
|
|
||||||
## How can I change the query granularity of existing data in Druid?
|
|
||||||
|
|
||||||
In a lot of situations you may want coarser granularity for older data. Example, any data older than 1 month has only hour level granularity but newer data has minute level granularity. This use case is same as re-indexing.
|
|
||||||
|
|
||||||
To do this use the [DruidInputSource](../ingestion/native-batch.md#druid-input-source) and run a [Parallel task](../ingestion/native-batch.md). The DruidInputSource will allow you to take in existing segments from Druid and aggregate them and feed them back into Druid. It will also allow you to filter the data in those segments while feeding it back in. This means if there are rows you want to delete, you can just filter them away during re-ingestion.
|
|
||||||
Typically the above will be run as a batch job to say everyday feed in a chunk of data and aggregate it.
|
|
||||||
Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing.
|
|
||||||
|
|
||||||
See the [Update existing data](../ingestion/data-management.md#update) section of the data management page for more details.
|
|
||||||
|
|
||||||
You can also change the query granularity using compaction. See [Query granularity handling](../ingestion/compaction.md#query-granularity-handling).
|
|
||||||
|
|
||||||
## Real-time ingestion seems to be stuck
|
|
||||||
|
|
||||||
There are a few ways this can occur. Druid will throttle ingestion to prevent out of memory problems if the intermediate persists are taking too long or if hand-off is taking too long. If your process logs indicate certain columns are taking a very long time to build (for example, if your segment granularity is hourly, but creating a single column takes 30 minutes), you should re-evaluate your configuration or scale up your real-time ingestion.
|
|
||||||
|
|
||||||
## More information
|
|
||||||
|
|
||||||
Data ingestion for Druid can be difficult for first time users. Please don't hesitate to ask questions in the [Druid Forum](https://www.druidforum.org/).
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## 数据摄取相关问题FAQ
|
|
||||||
### 实时摄取
|
|
||||||
|
|
||||||
最常见的原因是事件被摄取是在Druid的窗口时段 `windowPeriod` 范围之外。Druid实时摄取只接受当前时间的可配置窗口时段内的事件。您可以通过查看包含 `ingest/events/*` 日志行的实时进程日志来验证这是什么情况。这些z指标将标识接收、拒绝的事件等。
|
|
||||||
|
|
||||||
我们建议对生产中的历史数据使用批量摄取方法。
|
|
||||||
|
|
||||||
### 批量摄取
|
|
||||||
|
|
||||||
如果尝试批量加载历史数据,但没有事件被加载到,请确保摄取规范的时间间隔实际上包含了数据的间隔。此间隔之外的事件将被删除。
|
|
||||||
|
|
||||||
### Druid支持什么样的数据类型
|
|
||||||
|
|
||||||
Druid可以摄取JSON、CSV、TSV和其他分隔数据。Druid支持一维值或多维值(字符串数组)。Druid支持long、float和double数值列。
|
|
||||||
|
|
||||||
### 并非所有的事件都被摄取了
|
|
||||||
|
|
||||||
Druid会拒绝时间窗口之外的事件, 确认事件是否被拒绝了的最佳方式是查看 [Druid摄取指标](../operations/metrics.md)
|
|
||||||
|
|
||||||
如果摄取的事件数似乎正确,请确保查询的格式正确。如果在摄取规范中包含 `count` 聚合器,则需要使用 `longSum` 聚合器查询此聚合的结果。使用count聚合器发出查询将计算Druid行的数量,包括 [rollup](ingestion.md#rollup)。
|
|
||||||
|
|
||||||
### 摄取之后段存储在哪里
|
|
||||||
|
|
||||||
段的存储位置由 `druid.storage.type` 配置决定的,Druid会将段上传到 [深度存储](../design/Deepstorage.md)。 本地磁盘是默认的深度存储位置。
|
|
||||||
|
|
||||||
### 流摄取任务没有发生段切换递交
|
|
||||||
|
|
||||||
首先,确保摄取过程的日志中没有异常,如果运行的是分布式集群,还要确保 `druid.storage.type` 被设置为非本地的深度存储。
|
|
||||||
|
|
||||||
移交失败的其他常见原因如下:
|
|
||||||
|
|
||||||
1. Druid无法写入元数据存储,确保您的配置正确
|
|
||||||
2. Historical进程容量不足,无法再下载任何段。如果发生这种情况,您将在Coordinator日志中看到异常,Coordinator控制台将显示历史记录接近容量
|
|
||||||
3. 段已损坏,无法下载。如果发生这种情况,您将在Historical进程中看到异常
|
|
||||||
4. 深度存储配置不正确。确保您的段实际存在于深度存储中,并且Coordinator日志没有错误
|
|
||||||
|
|
||||||
### 如何让HDFS工作
|
|
||||||
|
|
||||||
确保在类路径中包含 `druid-hdfs-storage` 和所有的hadoop配置、依赖项(可以通过在安装了hadoop的计算机上运行 `hadoop classpath`命令获得)。并且,提供必要的HDFS设置,如 [深度存储](../design/Deepstorage.md) 中所述。
|
|
||||||
|
|
||||||
### 没有在Historical进程中看到Druid段
|
|
||||||
|
|
||||||
您可以查看位于 `<Coordinator_IP>:<PORT>` 的Coordinator控制台, 确保您的段实际上已加载到 [Historical进程](../design/Historical.md)中。如果段不存在,请检查Coordinator日志中有关复制错误容量的消息。不下载段的一个原因是,Historical进程的 `maxSize` 太小,使它们无法下载更多数据。您可以使用(例如)更改它:
|
|
||||||
|
|
||||||
```json
|
|
||||||
-Ddruid.segmentCache.locations=[{"path":"/tmp/druid/storageLocation","maxSize":"500000000000"}]
|
|
||||||
-Ddruid.server.maxSize=500000000000
|
|
||||||
```
|
|
||||||
|
|
||||||
### 查询返回来了空结果
|
|
||||||
|
|
||||||
您可以对为数据源创建的dimension和metric使用段 [元数据查询](../querying/segmentMetadata.md)。确保您在查询中使用的聚合器的名称与这些metric之一匹配,还要确保指定的查询间隔与存在数据的有效时间范围匹配。
|
|
||||||
|
|
||||||
### schema变化时如何在Druid中重新索引现有数据
|
|
||||||
|
|
||||||
您可以将 [DruidInputSource](native.md#Druid输入源) 与 [并行任务](native.md#并行任务) 一起使用,以使用新schema摄取现有的druid段,并更改该段的name、dimensions、metrics、rollup等。有关详细信息,请参阅 [DruidInputSource](native.md#Druid输入源)。或者,如果使用基于hadoop的摄取,那么可以使用"dataSource"输入规范来重新编制索引。
|
|
||||||
|
|
||||||
有关详细信息,请参阅 [数据管理](data-management.md) 页的 [更新现有数据](data-management.md#更新现有的数据) 部分。
|
|
||||||
|
|
||||||
### 如果更改Druid中现有数据的段粒度
|
|
||||||
|
|
||||||
在很多情况下,您可能希望降低旧数据的粒度。例如,任何超过1个月的数据都只有小时级别的粒度,而较新的数据只有分钟级别的粒度。此场景与重新索引相同。
|
|
||||||
|
|
||||||
为此,使用 [DruidInputSource](native.md#Druid输入源) 并运行一个 [并行任务](native.md#并行任务)。[DruidInputSource](native.md#Druid输入源) 将允许你从Druid中获取现有的段并将它们聚合并反馈给Druid。它还允许您在反馈数据时过滤这些段中的数据,这意味着,如果有要删除的行,可以在重新摄取期间将它们过滤掉。通常,上面的操作将作为一个批处理作业运行,即每天输入一大块数据并对其进行聚合。或者,如果使用基于hadoop的摄取,那么可以使用"dataSource"输入规范来重新编制索引。
|
|
||||||
|
|
||||||
有关详细信息,请参阅 [数据管理](data-management.md) 页的 [更新现有数据](data-management.md#更新现有的数据) 部分。
|
|
||||||
|
|
||||||
### 实时摄取似乎被卡住了
|
|
||||||
|
|
||||||
有几种方法可以做到这一点。如果中间持久化消耗太长时间或如果移交消耗太长事件,Druid将限制摄入,以防止内存不足的问题。如果您的流程日志表明某些列的生成时间非常长(例如,如果您的段粒度是每小时一次,但是创建一个列需要30分钟),那么您应该重新评估您的配置或扩展您的实时接收
|
|
||||||
|
|
||||||
### 更多信息
|
|
||||||
|
|
||||||
对于第一次使用Druid的用户来说,将数据输入Druid是非常困难的。请不要犹豫,在我们的IRC频道或在我们的 [google群组](https://groups.google.com/forum/#!forum/druid-user) 页面上提问。
|
|
1596
ingestion/hadoop.md
1596
ingestion/hadoop.md
File diff suppressed because it is too large
Load Diff
1251
ingestion/index.md
1251
ingestion/index.md
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@ -1,438 +0,0 @@
|
|||||||
---
|
|
||||||
id: schema-design
|
|
||||||
title: "Schema design tips"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
## Druid's data model
|
|
||||||
|
|
||||||
For general information, check out the documentation on [Druid's data model](index.md#data-model) on the main
|
|
||||||
ingestion overview page. The rest of this page discusses tips for users coming from other kinds of systems, as well as
|
|
||||||
general tips and common practices.
|
|
||||||
|
|
||||||
* Druid data is stored in [datasources](index.md#datasources), which are similar to tables in a traditional RDBMS.
|
|
||||||
* Druid datasources can be ingested with or without [rollup](#rollup). With rollup enabled, Druid partially aggregates your data during ingestion, potentially reducing its row count, decreasing storage footprint, and improving query performance. With rollup disabled, Druid stores one row for each row in your input data, without any pre-aggregation.
|
|
||||||
* Every row in Druid must have a timestamp. Data is always partitioned by time, and every query has a time filter. Query results can also be broken down by time buckets like minutes, hours, days, and so on.
|
|
||||||
* All columns in Druid datasources, other than the timestamp column, are either dimensions or metrics. This follows the [standard naming convention](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems) of OLAP data.
|
|
||||||
* Typical production datasources have tens to hundreds of columns.
|
|
||||||
* [Dimension columns](index.md#dimensions) are stored as-is, so they can be filtered on, grouped by, or aggregated at query time. They are always single Strings, [arrays of Strings](../querying/multi-value-dimensions.md), single Longs, single Doubles or single Floats.
|
|
||||||
* [Metric columns](index.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approx). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.
|
|
||||||
|
|
||||||
|
|
||||||
## If you're coming from a...
|
|
||||||
|
|
||||||
### Relational model
|
|
||||||
|
|
||||||
(Like Hive or PostgreSQL.)
|
|
||||||
|
|
||||||
Druid datasources are generally equivalent to tables in a relational database. Druid [lookups](../querying/lookups.md)
|
|
||||||
can act similarly to data-warehouse-style dimension tables, but as you'll see below, denormalization is often
|
|
||||||
recommended if you can get away with it.
|
|
||||||
|
|
||||||
Common practice for relational data modeling involves [normalization](https://en.wikipedia.org/wiki/Database_normalization):
|
|
||||||
the idea of splitting up data into multiple tables such that data redundancy is reduced or eliminated. For example, in a
|
|
||||||
"sales" table, best-practices relational modeling calls for a "product id" column that is a foreign key into a separate
|
|
||||||
"products" table, which in turn has "product id", "product name", and "product category" columns. This prevents the
|
|
||||||
product name and category from needing to be repeated on different rows in the "sales" table that refer to the same
|
|
||||||
product.
|
|
||||||
|
|
||||||
In Druid, on the other hand, it is common to use totally flat datasources that do not require joins at query time. In
|
|
||||||
the example of the "sales" table, in Druid it would be typical to store "product_id", "product_name", and
|
|
||||||
"product_category" as dimensions directly in a Druid "sales" datasource, without using a separate "products" table.
|
|
||||||
Totally flat schemas substantially increase performance, since the need for joins is eliminated at query time. As an
|
|
||||||
an added speed boost, this also allows Druid's query layer to operate directly on compressed dictionary-encoded data.
|
|
||||||
Perhaps counter-intuitively, this does _not_ substantially increase storage footprint relative to normalized schemas,
|
|
||||||
since Druid uses dictionary encoding to effectively store just a single integer per row for string columns.
|
|
||||||
|
|
||||||
If necessary, Druid datasources can be partially normalized through the use of [lookups](../querying/lookups.md),
|
|
||||||
which are the rough equivalent of dimension tables in a relational database. At query time, you would use Druid's SQL
|
|
||||||
`LOOKUP` function, or native lookup extraction functions, instead of using the JOIN keyword like you would in a
|
|
||||||
relational database. Since lookup tables impose an increase in memory footprint and incur more computational overhead
|
|
||||||
at query time, it is only recommended to do this if you need the ability to update a lookup table and have the changes
|
|
||||||
reflected immediately for already-ingested rows in your main table.
|
|
||||||
|
|
||||||
Tips for modeling relational data in Druid:
|
|
||||||
|
|
||||||
- Druid datasources do not have primary or unique keys, so skip those.
|
|
||||||
- Denormalize if possible. If you need to be able to update dimension / lookup tables periodically and have those
|
|
||||||
changes reflected in already-ingested data, consider partial normalization with [lookups](../querying/lookups.md).
|
|
||||||
- If you need to join two large distributed tables with each other, you must do this before loading the data into Druid.
|
|
||||||
Druid does not support query-time joins of two datasources. Lookups do not help here, since a full copy of each lookup
|
|
||||||
table is stored on each Druid server, so they are not a good choice for large tables.
|
|
||||||
- Consider whether you want to enable [rollup](#rollup) for pre-aggregation, or whether you want to disable
|
|
||||||
rollup and load your existing data as-is. Rollup in Druid is similar to creating a summary table in a relational model.
|
|
||||||
|
|
||||||
### Time series model
|
|
||||||
|
|
||||||
(Like OpenTSDB or InfluxDB.)
|
|
||||||
|
|
||||||
Similar to time series databases, Druid's data model requires a timestamp. Druid is not a timeseries database, but
|
|
||||||
it is a natural choice for storing timeseries data. Its flexible data model allows it to store both timeseries and
|
|
||||||
non-timeseries data, even in the same datasource.
|
|
||||||
|
|
||||||
To achieve best-case compression and query performance in Druid for timeseries data, it is important to partition and
|
|
||||||
sort by metric name, like timeseries databases often do. See [Partitioning and sorting](index.md#partitioning) for more details.
|
|
||||||
|
|
||||||
Tips for modeling timeseries data in Druid:
|
|
||||||
|
|
||||||
- Druid does not think of data points as being part of a "time series". Instead, Druid treats each point separately
|
|
||||||
for ingestion and aggregation.
|
|
||||||
- Create a dimension that indicates the name of the series that a data point belongs to. This dimension is often called
|
|
||||||
"metric" or "name". Do not get the dimension named "metric" confused with the concept of Druid metrics. Place this
|
|
||||||
first in the list of dimensions in your "dimensionsSpec" for best performance (this helps because it improves locality;
|
|
||||||
see [partitioning and sorting](index.md#partitioning) below for details).
|
|
||||||
- Create other dimensions for attributes attached to your data points. These are often called "tags" in timeseries
|
|
||||||
database systems.
|
|
||||||
- Create [metrics](../querying/aggregations.md) corresponding to the types of aggregations that you want to be able
|
|
||||||
to query. Typically this includes "sum", "min", and "max" (in one of the long, float, or double flavors). If you want to
|
|
||||||
be able to compute percentiles or quantiles, use Druid's [approximate aggregators](../querying/aggregations.md#approx).
|
|
||||||
- Consider enabling [rollup](#rollup), which will allow Druid to potentially combine multiple points into one
|
|
||||||
row in your Druid datasource. This can be useful if you want to store data at a different time granularity than it is
|
|
||||||
naturally emitted. It is also useful if you want to combine timeseries and non-timeseries data in the same datasource.
|
|
||||||
- If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
|
|
||||||
[automatic detection of dimension columns](#schema-less-dimensions).
|
|
||||||
|
|
||||||
### Log aggregation model
|
|
||||||
|
|
||||||
(Like Elasticsearch or Splunk.)
|
|
||||||
|
|
||||||
Similar to log aggregation systems, Druid offers inverted indexes for fast searching and filtering. Druid's search
|
|
||||||
capabilities are generally less developed than these systems, and its analytical capabilities are generally more
|
|
||||||
developed. The main data modeling differences between Druid and these systems are that when ingesting data into Druid,
|
|
||||||
you must be more explicit. Druid columns have types specific upfront and Druid does not, at this time, natively support
|
|
||||||
nested data.
|
|
||||||
|
|
||||||
Tips for modeling log data in Druid:
|
|
||||||
|
|
||||||
- If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
|
|
||||||
[automatic detection of dimension columns](#schema-less-dimensions).
|
|
||||||
- If you have nested data, flatten it using a [`flattenSpec`](index.md#flattenspec).
|
|
||||||
- Consider enabling [rollup](#rollup) if you have mainly analytical use cases for your log data. This will
|
|
||||||
mean you lose the ability to retrieve individual events from Druid, but you potentially gain substantial compression and
|
|
||||||
query performance boosts.
|
|
||||||
|
|
||||||
## General tips and best practices
|
|
||||||
|
|
||||||
### Rollup
|
|
||||||
|
|
||||||
Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. This is a form
|
|
||||||
of summarization or pre-aggregation. For more details, see the [Rollup](index.md#rollup) section of the ingestion
|
|
||||||
documentation.
|
|
||||||
|
|
||||||
### Partitioning and sorting
|
|
||||||
|
|
||||||
Optimally partitioning and sorting your data can have substantial impact on footprint and performance. For more details,
|
|
||||||
see the [Partitioning](index.md#partitioning) section of the ingestion documentation.
|
|
||||||
|
|
||||||
<a name="sketches"></a>
|
|
||||||
|
|
||||||
### Sketches for high cardinality columns
|
|
||||||
|
|
||||||
When dealing with high cardinality columns like user IDs or other unique IDs, consider using sketches for approximate
|
|
||||||
analysis rather than operating on the actual values. When you ingest data using a sketch, Druid does not store the
|
|
||||||
original raw data, but instead stores a "sketch" of it that it can feed into a later computation at query time. Popular
|
|
||||||
use cases for sketches include count-distinct and quantile computation. Each sketch is designed for just one particular
|
|
||||||
kind of computation.
|
|
||||||
|
|
||||||
In general using sketches serves two main purposes: improving rollup, and reducing memory footprint at
|
|
||||||
query time.
|
|
||||||
|
|
||||||
Sketches improve rollup ratios because they allow you to collapse multiple distinct values into the same sketch. For
|
|
||||||
example, if you have two rows that are identical except for a user ID (perhaps two users did the same action at the
|
|
||||||
same time), storing them in a count-distinct sketch instead of as-is means you can store the data in one row instead of
|
|
||||||
two. You won't be able to retrieve the user IDs or compute exact distinct counts, but you'll still be able to compute
|
|
||||||
approximate distinct counts, and you'll reduce your storage footprint.
|
|
||||||
|
|
||||||
Sketches reduce memory footprint at query time because they limit the amount of data that needs to be shuffled between
|
|
||||||
servers. For example, in a quantile computation, instead of needing to send all data points to a central location
|
|
||||||
so they can be sorted and the quantile can be computed, Druid instead only needs to send a sketch of the points. This
|
|
||||||
can reduce data transfer needs to mere kilobytes.
|
|
||||||
|
|
||||||
For details about the sketches available in Druid, see the
|
|
||||||
[approximate aggregators](../querying/aggregations.md#approx) page.
|
|
||||||
|
|
||||||
If you prefer videos, take a look at [Not exactly!](https://www.youtube.com/watch?v=Hpd3f_MLdXo), a conference talk
|
|
||||||
about sketches in Druid.
|
|
||||||
|
|
||||||
### String vs numeric dimensions
|
|
||||||
|
|
||||||
If the user wishes to ingest a column as a numeric-typed dimension (Long, Double or Float), it is necessary to specify the type of the column in the `dimensions` section of the `dimensionsSpec`. If the type is omitted, Druid will ingest a column as the default String type.
|
|
||||||
|
|
||||||
There are performance tradeoffs between string and numeric columns. Numeric columns are generally faster to group on
|
|
||||||
than string columns. But unlike string columns, numeric columns don't have indexes, so they can be slower to filter on.
|
|
||||||
You may want to experiment to find the optimal choice for your use case.
|
|
||||||
|
|
||||||
For details about how to configure numeric dimensions, see the [`dimensionsSpec`](index.md#dimensionsspec) documentation.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Secondary timestamps
|
|
||||||
|
|
||||||
Druid schemas must always include a primary timestamp. The primary timestamp is used for
|
|
||||||
[partitioning and sorting](index.md#partitioning) your data, so it should be the timestamp that you will most often filter on.
|
|
||||||
Druid is able to rapidly identify and retrieve data corresponding to time ranges of the primary timestamp column.
|
|
||||||
|
|
||||||
If your data has more than one timestamp, you can ingest the others as secondary timestamps. The best way to do this
|
|
||||||
is to ingest them as [long-typed dimensions](index.md#dimensionsspec) in milliseconds format.
|
|
||||||
If necessary, you can get them into this format using a [`transformSpec`](index.md#transformspec) and
|
|
||||||
[expressions](../misc/math-expr.md) like `timestamp_parse`, which returns millisecond timestamps.
|
|
||||||
|
|
||||||
At query time, you can query secondary timestamps with [SQL time functions](../querying/sql.md#time-functions)
|
|
||||||
like `MILLIS_TO_TIMESTAMP`, `TIME_FLOOR`, and others. If you're using native Druid queries, you can use
|
|
||||||
[expressions](../misc/math-expr.md).
|
|
||||||
|
|
||||||
### Nested dimensions
|
|
||||||
|
|
||||||
At the time of this writing, Druid does not support nested dimensions. Nested dimensions need to be flattened. For example,
|
|
||||||
if you have data of the following form:
|
|
||||||
|
|
||||||
```
|
|
||||||
{"foo":{"bar": 3}}
|
|
||||||
```
|
|
||||||
|
|
||||||
then before indexing it, you should transform it to:
|
|
||||||
|
|
||||||
```
|
|
||||||
{"foo_bar": 3}
|
|
||||||
```
|
|
||||||
|
|
||||||
Druid is capable of flattening JSON, Avro, or Parquet input data.
|
|
||||||
Please read about [`flattenSpec`](index.md#flattenspec) for more details.
|
|
||||||
|
|
||||||
<a name="counting"></a>
|
|
||||||
|
|
||||||
### Counting the number of ingested events
|
|
||||||
|
|
||||||
When rollup is enabled, count aggregators at query time do not actually tell you the number of rows that have been
|
|
||||||
ingested. They tell you the number of rows in the Druid datasource, which may be smaller than the number of rows
|
|
||||||
ingested.
|
|
||||||
|
|
||||||
In this case, a count aggregator at _ingestion_ time can be used to count the number of events. However, it is important to note
|
|
||||||
that when you query for this metric, you should use a `longSum` aggregator. A `count` aggregator at query time will return
|
|
||||||
the number of Druid rows for the time interval, which can be used to determine what the roll-up ratio was.
|
|
||||||
|
|
||||||
To clarify with an example, if your ingestion spec contains:
|
|
||||||
|
|
||||||
```
|
|
||||||
...
|
|
||||||
"metricsSpec" : [
|
|
||||||
{
|
|
||||||
"type" : "count",
|
|
||||||
"name" : "count"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
You should query for the number of ingested rows with:
|
|
||||||
|
|
||||||
```
|
|
||||||
...
|
|
||||||
"aggregations": [
|
|
||||||
{ "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" },
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Schema-less dimensions
|
|
||||||
|
|
||||||
If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
|
|
||||||
a dimension that has been excluded, or a metric column as a dimension.
|
|
||||||
|
|
||||||
Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions.
|
|
||||||
|
|
||||||
### Including the same column as a dimension and a metric
|
|
||||||
|
|
||||||
One workflow with unique IDs is to be able to filter on a particular ID, while still being able to do fast unique counts on the ID column.
|
|
||||||
If you are not using schema-less dimensions, this use case is supported by setting the `name` of the metric to something different than the dimension.
|
|
||||||
If you are using schema-less dimensions, the best practice here is to include the same column twice, once as a dimension, and as a `hyperUnique` metric. This may involve
|
|
||||||
some work at ETL time.
|
|
||||||
|
|
||||||
As an example, for schema-less dimensions, repeat the same column:
|
|
||||||
|
|
||||||
```
|
|
||||||
{"device_id_dim":123, "device_id_met":123}
|
|
||||||
```
|
|
||||||
|
|
||||||
and in your `metricsSpec`, include:
|
|
||||||
|
|
||||||
```
|
|
||||||
{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" }
|
|
||||||
```
|
|
||||||
|
|
||||||
`device_id_dim` should automatically get picked up as a dimension.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Schema设计
|
|
||||||
### Druid数据模型
|
|
||||||
|
|
||||||
有关一般信息,请查看摄取概述页面上有关 [Druid数据模型](ingestion.md#Druid数据模型) 的文档。本页的其余部分将讨论来自其他类型系统的用户的提示,以及一般提示和常见做法。
|
|
||||||
|
|
||||||
* Druid数据存储在 [数据源](ingestion.md#数据源) 中,与传统RDBMS中的表类似。
|
|
||||||
* Druid数据源可以在摄取过程中使用或不使用 [rollup](ingestion.md#rollup) 。启用rollup后,Druid会在接收期间部分聚合您的数据,这可能会减少其行数,减少存储空间,并提高查询性能。禁用rollup后,Druid为输入数据中的每一行存储一行,而不进行任何预聚合。
|
|
||||||
* Druid的每一行都必须有时间戳。数据总是按时间进行分区,每个查询都有一个时间过滤器。查询结果也可以按时间段(如分钟、小时、天等)进行细分。
|
|
||||||
* 除了timestamp列之外,Druid数据源中的所有列都是dimensions或metrics。这遵循 [OLAP数据的标准命名约定](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems)。
|
|
||||||
* 典型的生产数据源有几十到几百列。
|
|
||||||
* [dimension列](ingestion.md#维度) 按原样存储,因此可以在查询时对其进行筛选、分组或聚合。它们总是单个字符串、字符串数组、单个long、单个double或单个float。
|
|
||||||
* [Metrics列](ingestion.md#指标) 是 [预聚合](../querying/aggregations.md) 存储的,因此它们只能在查询时聚合(不能按筛选或分组)。它们通常存储为数字(整数或浮点数),但也可以存储为复杂对象,如[HyperLogLog草图或近似分位数草图](../querying/aggregations.md)。即使禁用了rollup,也可以在接收时配置metrics,但在启用汇总时最有用。
|
|
||||||
|
|
||||||
### 与其他设计模式类比
|
|
||||||
#### 关系模型
|
|
||||||
(如 Hive 或者 PostgreSQL)
|
|
||||||
|
|
||||||
Druid数据源通常相当于关系数据库中的表。Druid的 [lookups特性](../querying/lookups.md) 可以类似于数据仓库样式的维度表,但是正如您将在下面看到的,如果您能够摆脱它,通常建议您进行非规范化。
|
|
||||||
|
|
||||||
关系数据建模的常见实践涉及 [规范化](https://en.wikipedia.org/wiki/Database_normalization) 的思想:将数据拆分为多个表,从而减少或消除数据冗余。例如,在"sales"表中,最佳实践关系建模要求将"product id"列作为外键放入单独的"products"表中,该表依次具有"product id"、"product name"和"product category"列, 这可以防止产品名称和类别需要在"sales"表中引用同一产品的不同行上重复。
|
|
||||||
|
|
||||||
另一方面,在Druid中,通常使用在查询时不需要连接的完全平坦的数据源。在"sales"表的例子中,在Druid中,通常直接将"product_id"、"product_name"和"product_category"作为维度存储在Druid "sales"数据源中,而不使用单独的"products"表。完全平坦的模式大大提高了性能,因为查询时不需要连接。作为一个额外的速度提升,这也允许Druid的查询层直接操作压缩字典编码的数据。因为Druid使用字典编码来有效地为字符串列每行存储一个整数, 所以可能与直觉相反,这并*没有*显著增加相对于规范化模式的存储空间。
|
|
||||||
|
|
||||||
如果需要的话,可以通过使用 [lookups](../querying/lookups.md) 规范化Druid数据源,这大致相当于关系数据库中的维度表。在查询时,您将使用Druid的SQL `LOOKUP` 查找函数或者原生 `lookup` 提取函数,而不是像在关系数据库中那样使用JOIN关键字。由于lookup表会增加内存占用并在查询时产生更多的计算开销,因此仅当需要更新lookup表并立即反映主表中已摄取行的更改时,才建议执行此操作。
|
|
||||||
|
|
||||||
在Druid中建模关系数据的技巧:
|
|
||||||
* Druid数据源没有主键或唯一键,所以跳过这些。
|
|
||||||
* 如果可能的话,去规格化。如果需要定期更新dimensions/lookup并将这些更改反映在已接收的数据中,请考虑使用 [lookups](../querying/lookups.md) 进行部分规范化。
|
|
||||||
* 如果需要将两个大型的分布式表连接起来,则必须在将数据加载到Druid之前执行此操作。Druid不支持两个数据源的查询时间连接。lookup在这里没有帮助,因为每个lookup表的完整副本存储在每个Druid服务器上,所以对于大型表来说,它们不是一个好的选择。
|
|
||||||
* 考虑是否要为预聚合启用[rollup](ingestion.md#rollup),或者是否要禁用rollup并按原样加载现有数据。Druid中的Rollup类似于在关系模型中创建摘要表。
|
|
||||||
|
|
||||||
#### 时序模型
|
|
||||||
(如 OpenTSDB 或者 InfluxDB)
|
|
||||||
|
|
||||||
与时间序列数据库类似,Druid的数据模型需要时间戳。Druid不是时序数据库,但它同时也是存储时序数据的自然选择。它灵活的数据模型允许它同时存储时序和非时序数据,甚至在同一个数据源中。
|
|
||||||
|
|
||||||
为了在Druid中实现时序数据的最佳压缩和查询性能,像时序数据库经常做的一样,按照metric名称进行分区和排序很重要。有关详细信息,请参见 [分区和排序](ingestion.md#分区)。
|
|
||||||
|
|
||||||
在Druid中建模时序数据的技巧:
|
|
||||||
* Druid并不认为数据点是"时间序列"的一部分。相反,Druid对每一点分别进行摄取和聚合
|
|
||||||
* 创建一个维度,该维度指示数据点所属系列的名称。这个维度通常被称为"metric"或"name"。不要将名为"metric"的维度与Druid Metrics的概念混淆。将它放在"dimensionsSpec"中维度列表的第一个位置,以获得最佳性能(这有助于提高局部性;有关详细信息,请参阅下面的 [分区和排序](ingestion.md#分区))
|
|
||||||
* 为附着到数据点的属性创建其他维度。在时序数据库系统中,这些通常称为"标签"
|
|
||||||
* 创建与您希望能够查询的聚合类型相对应的 [Druid Metrics](ingestion.md#指标)。通常这包括"sum"、"min"和"max"(在long、float或double中的一种)。如果你想计算百分位数或分位数,可以使用Druid的 [近似聚合器](../querying/aggregations.md)
|
|
||||||
* 考虑启用 [rollup](ingestion.md#rollup),这将允许Druid潜在地将多个点合并到Druid数据源中的一行中。如果希望以不同于原始发出的时间粒度存储数据,则这可能非常有用。如果要在同一个数据源中组合时序和非时序数据,它也很有用
|
|
||||||
* 如果您提前不知道要摄取哪些列,请使用空的维度列表来触发 [维度列的自动检测](#无schema的维度列)
|
|
||||||
|
|
||||||
#### 日志聚合模型
|
|
||||||
(如 ElasticSearch 或者 Splunk)
|
|
||||||
|
|
||||||
与日志聚合系统类似,Druid提供反向索引,用于快速搜索和筛选。Druid的搜索能力通常不如这些系统发达,其分析能力通常更为发达。Druid和这些系统之间的主要数据建模差异在于,在将数据摄取到Druid中时,必须更加明确。Druid列具有特定的类型,而Druid目前不支持嵌套数据。
|
|
||||||
|
|
||||||
在Druid中建模日志数据的技巧:
|
|
||||||
* 如果您提前不知道要摄取哪些列,请使用空维度列表来触发 [维度列的自动检测](#无schema的维度列)
|
|
||||||
* 如果有嵌套数据,请使用 [展平规范](ingestion.md#flattenspec) 将其扁平化
|
|
||||||
* 如果您主要有日志数据的分析场景,请考虑启用 [rollup](ingestion.md#rollup),这意味着您将失去从Druid中检索单个事件的能力,但您可能获得大量的压缩和查询性能提升
|
|
||||||
|
|
||||||
### 一般提示以及最佳实践
|
|
||||||
#### Rollup
|
|
||||||
|
|
||||||
Druid可以在接收数据时将其汇总,以最小化需要存储的原始数据量。这是一种汇总或预聚合的形式。有关更多详细信息,请参阅摄取文档的 [汇总部分](ingestion.md#rollup)。
|
|
||||||
|
|
||||||
#### 分区与排序
|
|
||||||
|
|
||||||
对数据进行最佳分区和排序会对占用空间和性能产生重大影响。有关更多详细信息,请参阅摄取文档的 [分区部分](ingestion.md#分区)。
|
|
||||||
|
|
||||||
#### Sketches高基维处理
|
|
||||||
|
|
||||||
在处理高基数列(如用户ID或其他唯一ID)时,请考虑使用草图(sketches)进行近似分析,而不是对实际值进行操作。当您使用草图(sketches)摄取数据时,Druid不存储原始原始数据,而是存储它的"草图(sketches)",它可以在查询时输入到以后的计算中。草图(sketches)的常用场景包括 `count-distinct` 和分位数计算。每个草图都是为一种特定的计算而设计的。
|
|
||||||
|
|
||||||
一般来说,使用草图(sketches)有两个主要目的:改进rollup和减少查询时的内存占用。
|
|
||||||
|
|
||||||
草图(sketches)可以提高rollup比率,因为它们允许您将多个不同的值折叠到同一个草图(sketches)中。例如,如果有两行除了用户ID之外都是相同的(可能两个用户同时执行了相同的操作),则将它们存储在 `count-distinct sketch` 中而不是按原样,这意味着您可以将数据存储在一行而不是两行中。您将无法检索用户id或计算精确的非重复计数,但您仍将能够计算近似的非重复计数,并且您将减少存储空间。
|
|
||||||
|
|
||||||
草图(sketches)减少了查询时的内存占用,因为它们限制了需要在服务器之间洗牌的数据量。例如,在分位数计算中,Druid不需要将所有数据点发送到中心位置,以便对它们进行排序和计算分位数,而只需要发送点的草图。这可以将数据传输需要减少到仅千字节。
|
|
||||||
|
|
||||||
有关Druid中可用的草图的详细信息,请参阅 [近似聚合器页面](../querying/aggregations.md)。
|
|
||||||
|
|
||||||
如果你更喜欢 [视频](https://www.youtube.com/watch?v=Hpd3f_MLdXo),那就看一看吧!,一个讨论Druid Sketches的会议。
|
|
||||||
|
|
||||||
#### 字符串 VS 数值维度
|
|
||||||
|
|
||||||
如果用户希望将列摄取为数值类型的维度(Long、Double或Float),则需要在 `dimensionsSpec` 的 `dimensions` 部分中指定列的类型。如果省略了该类型,Druid会将列作为默认的字符串类型。
|
|
||||||
|
|
||||||
字符串列和数值列之间存在性能折衷。数值列通常比字符串列更快分组。但与字符串列不同,数值列没有索引,因此可以更慢地进行筛选。您可能想尝试为您的用例找到最佳选择。
|
|
||||||
|
|
||||||
有关如何配置数值维度的详细信息,请参阅 [`dimensionsSpec`文档](ingestion.md#dimensionsSpec)
|
|
||||||
|
|
||||||
#### 辅助时间戳
|
|
||||||
|
|
||||||
Druid schema必须始终包含一个主时间戳, 主时间戳用于对数据进行 [分区和排序](ingestion.md#分区),因此它应该是您最常筛选的时间戳。Druid能够快速识别和检索与主时间戳列的时间范围相对应的数据。
|
|
||||||
|
|
||||||
如果数据有多个时间戳,则可以将其他时间戳作为辅助时间戳摄取。最好的方法是将它们作为 [毫秒格式的Long类型维度](ingestion.md#dimensionsspec) 摄取。如有必要,可以使用 [`transformSpec`](ingestion.md#transformspec) 和 `timestamp_parse` 等 [表达式](../misc/expression.md) 将它们转换成这种格式,后者返回毫秒时间戳。
|
|
||||||
|
|
||||||
在查询时,可以使用诸如 `MILLIS_TO_TIMESTAMP`、`TIME_FLOOR` 等 [SQL时间函数](../querying/druidsql.md) 查询辅助时间戳。如果您使用的是原生Druid查询,那么可以使用 [表达式](../misc/expression.md)。
|
|
||||||
|
|
||||||
#### 嵌套维度
|
|
||||||
|
|
||||||
在编写本文时,Druid不支持嵌套维度。嵌套维度需要展平,例如,如果您有以下数据:
|
|
||||||
```json
|
|
||||||
{"foo":{"bar": 3}}
|
|
||||||
```
|
|
||||||
|
|
||||||
然后在编制索引之前,应将其转换为:
|
|
||||||
```json
|
|
||||||
{"foo_bar": 3}
|
|
||||||
```
|
|
||||||
|
|
||||||
Druid能够将JSON、Avro或Parquet输入数据展平化。请阅读 [展平规格](ingestion.md#flattenspec) 了解更多细节。
|
|
||||||
|
|
||||||
#### 计数接收事件数
|
|
||||||
|
|
||||||
启用rollup后,查询时的计数聚合器(count aggregator)实际上不会告诉您已摄取的行数。它们告诉您Druid数据源中的行数,可能小于接收的行数。
|
|
||||||
|
|
||||||
在这种情况下,可以使用*摄取时*的计数聚合器来计算事件数。但是,需要注意的是,在查询此Metrics时,应该使用 `longSum` 聚合器。查询时的 `count` 聚合器将返回时间间隔的Druid行数,该行数可用于确定rollup比率。
|
|
||||||
|
|
||||||
为了举例说明,如果摄取规范包含:
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"metricsSpec" : [
|
|
||||||
{
|
|
||||||
"type" : "count",
|
|
||||||
"name" : "count"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
您应该使用查询:
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"aggregations": [
|
|
||||||
{ "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" },
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 无schema的维度列
|
|
||||||
|
|
||||||
如果摄取规范中的 `dimensions` 字段为空,Druid将把不是timestamp列、已排除的维度和metric列之外的每一列都视为维度。
|
|
||||||
|
|
||||||
注意,当使用无schema摄取时,所有维度都将被摄取为字符串类型的维度。
|
|
||||||
|
|
||||||
##### 包含与Dimension和Metric相同的列
|
|
||||||
|
|
||||||
一个具有唯一ID的工作流能够对特定ID进行过滤,同时仍然能够对ID列进行快速的唯一计数。如果不使用无schema维度,则通过将Metric的 `name` 设置为与维度不同的值来支持此场景。如果使用无schema维度,这里的最佳实践是将同一列包含两次,一次作为维度,一次作为 `hyperUnique` Metric。这可能涉及到ETL时的一些工作。
|
|
||||||
|
|
||||||
例如,对于无schema维度,请重复同一列:
|
|
||||||
```json
|
|
||||||
{"device_id_dim":123, "device_id_met":123}
|
|
||||||
```
|
|
||||||
同时在 `metricsSpec` 中包含:
|
|
||||||
```json
|
|
||||||
{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" }
|
|
||||||
```
|
|
||||||
`device_id_dim` 将自动作为维度来被选取
|
|
@ -1,45 +0,0 @@
|
|||||||
---
|
|
||||||
id: standalone-realtime
|
|
||||||
layout: doc_page
|
|
||||||
title: "Realtime Process"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
Older versions of Apache Druid supported a standalone 'Realtime' process to query and index 'stream pull'
|
|
||||||
modes of real-time ingestion. These processes would periodically build segments for the data they had collected over
|
|
||||||
some span of time and then set up hand-off to [Historical](../design/historical.md) servers.
|
|
||||||
|
|
||||||
This processes could be invoked by
|
|
||||||
|
|
||||||
```
|
|
||||||
org.apache.druid.cli.Main server realtime
|
|
||||||
```
|
|
||||||
|
|
||||||
This model of stream pull ingestion was deprecated for a number of both operational and architectural reasons, and
|
|
||||||
removed completely in Druid 0.16.0. Operationally, realtime nodes were difficult to configure, deploy, and scale because
|
|
||||||
each node required an unique configuration. The design of the stream pull ingestion system for realtime nodes also
|
|
||||||
suffered from limitations which made it not possible to achieve exactly once ingestion.
|
|
||||||
|
|
||||||
The extensions `druid-kafka-eight`, `druid-kafka-eight-simpleConsumer`, `druid-rabbitmq`, and `druid-rocketmq` were also
|
|
||||||
removed at this time, since they were built to operate on the realtime nodes.
|
|
||||||
|
|
||||||
Please consider using the [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.md) or
|
|
||||||
[Kinesis Indexing Service](../development/extensions-core/kinesis-ingestion.md) for stream pull ingestion instead.
|
|
@ -1,774 +0,0 @@
|
|||||||
---
|
|
||||||
id: tasks
|
|
||||||
title: "Task reference"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
Tasks do all [ingestion](index.md)-related work in Druid.
|
|
||||||
|
|
||||||
For batch ingestion, you will generally submit tasks directly to Druid using the
|
|
||||||
[Task APIs](../operations/api-reference.md#tasks). For streaming ingestion, tasks are generally submitted for you by a
|
|
||||||
supervisor.
|
|
||||||
|
|
||||||
## Task API
|
|
||||||
|
|
||||||
Task APIs are available in two main places:
|
|
||||||
|
|
||||||
- The [Overlord](../design/overlord.md) process offers HTTP APIs to submit tasks, cancel tasks, check their status,
|
|
||||||
review logs and reports, and more. Refer to the [Tasks API reference page](../operations/api-reference.md#tasks) for a
|
|
||||||
full list.
|
|
||||||
- Druid SQL includes a [`sys.tasks`](../querying/sql.md#tasks-table) table that provides information about currently
|
|
||||||
running tasks. This table is read-only, and has a limited (but useful!) subset of the full information available through
|
|
||||||
the Overlord APIs.
|
|
||||||
|
|
||||||
<a name="reports"></a>
|
|
||||||
|
|
||||||
## Task reports
|
|
||||||
|
|
||||||
A report containing information about the number of rows ingested, and any parse exceptions that occurred is available for both completed tasks and running tasks.
|
|
||||||
|
|
||||||
The reporting feature is supported by the [simple native batch task](../ingestion/native-batch.md#simple-task), the Hadoop batch task, and Kafka and Kinesis ingestion tasks.
|
|
||||||
|
|
||||||
### Completion report
|
|
||||||
|
|
||||||
After a task completes, a completion report can be retrieved at:
|
|
||||||
|
|
||||||
```
|
|
||||||
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
|
|
||||||
```
|
|
||||||
|
|
||||||
An example output is shown below:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"ingestionStatsAndErrors": {
|
|
||||||
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
|
|
||||||
"payload": {
|
|
||||||
"ingestionState": "COMPLETED",
|
|
||||||
"unparseableEvents": {},
|
|
||||||
"rowStats": {
|
|
||||||
"determinePartitions": {
|
|
||||||
"processed": 0,
|
|
||||||
"processedWithError": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"unparseable": 0
|
|
||||||
},
|
|
||||||
"buildSegments": {
|
|
||||||
"processed": 5390324,
|
|
||||||
"processedWithError": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"unparseable": 0
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"errorMsg": null
|
|
||||||
},
|
|
||||||
"type": "ingestionStatsAndErrors"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Live report
|
|
||||||
|
|
||||||
When a task is running, a live report containing ingestion state, unparseable events and moving average for number of events processed for 1 min, 5 min, 15 min time window can be retrieved at:
|
|
||||||
|
|
||||||
```
|
|
||||||
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
|
|
||||||
```
|
|
||||||
|
|
||||||
and
|
|
||||||
|
|
||||||
```
|
|
||||||
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/liveReports
|
|
||||||
```
|
|
||||||
|
|
||||||
An example output is shown below:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"ingestionStatsAndErrors": {
|
|
||||||
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
|
|
||||||
"payload": {
|
|
||||||
"ingestionState": "RUNNING",
|
|
||||||
"unparseableEvents": {},
|
|
||||||
"rowStats": {
|
|
||||||
"movingAverages": {
|
|
||||||
"buildSegments": {
|
|
||||||
"5m": {
|
|
||||||
"processed": 3.392158326408501,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
},
|
|
||||||
"15m": {
|
|
||||||
"processed": 1.736165476881023,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
},
|
|
||||||
"1m": {
|
|
||||||
"processed": 4.206417693750045,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"totals": {
|
|
||||||
"buildSegments": {
|
|
||||||
"processed": 1994,
|
|
||||||
"processedWithError": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"unparseable": 0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"errorMsg": null
|
|
||||||
},
|
|
||||||
"type": "ingestionStatsAndErrors"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
A description of the fields:
|
|
||||||
|
|
||||||
The `ingestionStatsAndErrors` report provides information about row counts and errors.
|
|
||||||
|
|
||||||
The `ingestionState` shows what step of ingestion the task reached. Possible states include:
|
|
||||||
* `NOT_STARTED`: The task has not begun reading any rows
|
|
||||||
* `DETERMINE_PARTITIONS`: The task is processing rows to determine partitioning
|
|
||||||
* `BUILD_SEGMENTS`: The task is processing rows to construct segments
|
|
||||||
* `COMPLETED`: The task has finished its work.
|
|
||||||
|
|
||||||
Only batch tasks have the DETERMINE_PARTITIONS phase. Realtime tasks such as those created by the Kafka Indexing Service do not have a DETERMINE_PARTITIONS phase.
|
|
||||||
|
|
||||||
`unparseableEvents` contains lists of exception messages that were caused by unparseable inputs. This can help with identifying problematic input rows. There will be one list each for the DETERMINE_PARTITIONS and BUILD_SEGMENTS phases. Note that the Hadoop batch task does not support saving of unparseable events.
|
|
||||||
|
|
||||||
the `rowStats` map contains information about row counts. There is one entry for each ingestion phase. The definitions of the different row counts are shown below:
|
|
||||||
* `processed`: Number of rows successfully ingested without parsing errors
|
|
||||||
* `processedWithError`: Number of rows that were ingested, but contained a parsing error within one or more columns. This typically occurs where input rows have a parseable structure but invalid types for columns, such as passing in a non-numeric String value for a numeric column.
|
|
||||||
* `thrownAway`: Number of rows skipped. This includes rows with timestamps that were outside of the ingestion task's defined time interval and rows that were filtered out with a [`transformSpec`](index.md#transformspec), but doesn't include the rows skipped by explicit user configurations. For example, the rows skipped by `skipHeaderRows` or `hasHeaderRow` in the CSV format are not counted.
|
|
||||||
* `unparseable`: Number of rows that could not be parsed at all and were discarded. This tracks input rows without a parseable structure, such as passing in non-JSON data when using a JSON parser.
|
|
||||||
|
|
||||||
The `errorMsg` field shows a message describing the error that caused a task to fail. It will be null if the task was successful.
|
|
||||||
|
|
||||||
## Live reports
|
|
||||||
|
|
||||||
### Row stats
|
|
||||||
|
|
||||||
The non-parallel [simple native batch task](../ingestion/native-batch.md#simple-task), the Hadoop batch task, and Kafka and Kinesis ingestion tasks support retrieval of row stats while the task is running.
|
|
||||||
|
|
||||||
The live report can be accessed with a GET to the following URL on a Peon running a task:
|
|
||||||
|
|
||||||
```
|
|
||||||
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/rowStats
|
|
||||||
```
|
|
||||||
|
|
||||||
An example report is shown below. The `movingAverages` section contains 1 minute, 5 minute, and 15 minute moving averages of increases to the four row counters, which have the same definitions as those in the completion report. The `totals` section shows the current totals.
|
|
||||||
|
|
||||||
```
|
|
||||||
{
|
|
||||||
"movingAverages": {
|
|
||||||
"buildSegments": {
|
|
||||||
"5m": {
|
|
||||||
"processed": 3.392158326408501,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
},
|
|
||||||
"15m": {
|
|
||||||
"processed": 1.736165476881023,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
},
|
|
||||||
"1m": {
|
|
||||||
"processed": 4.206417693750045,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"totals": {
|
|
||||||
"buildSegments": {
|
|
||||||
"processed": 1994,
|
|
||||||
"processedWithError": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"unparseable": 0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
For the Kafka Indexing Service, a GET to the following Overlord API will retrieve live row stat reports from each task being managed by the supervisor and provide a combined report.
|
|
||||||
|
|
||||||
```
|
|
||||||
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/supervisor/<supervisor-id>/stats
|
|
||||||
```
|
|
||||||
|
|
||||||
### Unparseable events
|
|
||||||
|
|
||||||
Lists of recently-encountered unparseable events can be retrieved from a running task with a GET to the following Peon API:
|
|
||||||
|
|
||||||
```
|
|
||||||
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/unparseableEvents
|
|
||||||
```
|
|
||||||
|
|
||||||
Note that this functionality is not supported by all task types. Currently, it is only supported by the
|
|
||||||
non-parallel [native batch task](../ingestion/native-batch.md) (type `index`) and the tasks created by the Kafka
|
|
||||||
and Kinesis indexing services.
|
|
||||||
|
|
||||||
<a name="locks"></a>
|
|
||||||
|
|
||||||
## Task lock system
|
|
||||||
|
|
||||||
This section explains the task locking system in Druid. Druid's locking system
|
|
||||||
and versioning system are tightly coupled with each other to guarantee the correctness of ingested data.
|
|
||||||
|
|
||||||
## "Overshadowing" between segments
|
|
||||||
|
|
||||||
You can run a task to overwrite existing data. The segments created by an overwriting task _overshadows_ existing segments.
|
|
||||||
Note that the overshadow relation holds only for the same time chunk and the same data source.
|
|
||||||
These overshadowed segments are not considered in query processing to filter out stale data.
|
|
||||||
|
|
||||||
Each segment has a _major_ version and a _minor_ version. The major version is
|
|
||||||
represented as a timestamp in the format of [`"yyyy-MM-dd'T'hh:mm:ss"`](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat)
|
|
||||||
while the minor version is an integer number. These major and minor versions
|
|
||||||
are used to determine the overshadow relation between segments as seen below.
|
|
||||||
|
|
||||||
A segment `s1` overshadows another `s2` if
|
|
||||||
|
|
||||||
- `s1` has a higher major version than `s2`, or
|
|
||||||
- `s1` has the same major version and a higher minor version than `s2`.
|
|
||||||
|
|
||||||
Here are some examples.
|
|
||||||
|
|
||||||
- A segment of the major version of `2019-01-01T00:00:00.000Z` and the minor version of `0` overshadows
|
|
||||||
another of the major version of `2018-01-01T00:00:00.000Z` and the minor version of `1`.
|
|
||||||
- A segment of the major version of `2019-01-01T00:00:00.000Z` and the minor version of `1` overshadows
|
|
||||||
another of the major version of `2019-01-01T00:00:00.000Z` and the minor version of `0`.
|
|
||||||
|
|
||||||
## Locking
|
|
||||||
|
|
||||||
If you are running two or more [druid tasks](./tasks.md) which generate segments for the same data source and the same time chunk,
|
|
||||||
the generated segments could potentially overshadow each other, which could lead to incorrect query results.
|
|
||||||
|
|
||||||
To avoid this problem, tasks will attempt to get locks prior to creating any segment in Druid.
|
|
||||||
There are two types of locks, i.e., _time chunk lock_ and _segment lock_.
|
|
||||||
|
|
||||||
When the time chunk lock is used, a task locks the entire time chunk of a data source where generated segments will be written.
|
|
||||||
For example, suppose we have a task ingesting data into the time chunk of `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` of the `wikipedia` data source.
|
|
||||||
With the time chunk locking, this task will lock the entire time chunk of `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` of the `wikipedia` data source
|
|
||||||
before it creates any segments. As long as it holds the lock, any other tasks will be unable to create segments for the same time chunk of the same data source.
|
|
||||||
The segments created with the time chunk locking have a _higher_ major version than existing segments. Their minor version is always `0`.
|
|
||||||
|
|
||||||
When the segment lock is used, a task locks individual segments instead of the entire time chunk.
|
|
||||||
As a result, two or more tasks can create segments for the same time chunk of the same data source simultaneously
|
|
||||||
if they are reading different segments.
|
|
||||||
For example, a Kafka indexing task and a compaction task can always write segments into the same time chunk of the same data source simultaneously.
|
|
||||||
The reason for this is because a Kafka indexing task always appends new segments, while a compaction task always overwrites existing segments.
|
|
||||||
The segments created with the segment locking have the _same_ major version and a _higher_ minor version.
|
|
||||||
|
|
||||||
> The segment locking is still experimental. It could have unknown bugs which potentially lead to incorrect query results.
|
|
||||||
|
|
||||||
To enable segment locking, you may need to set `forceTimeChunkLock` to `false` in the [task context](#context).
|
|
||||||
Once `forceTimeChunkLock` is unset, the task will choose a proper lock type to use automatically.
|
|
||||||
Please note that segment lock is not always available. The most common use case where time chunk lock is enforced is
|
|
||||||
when an overwriting task changes the segment granularity.
|
|
||||||
Also, the segment locking is supported by only native indexing tasks and Kafka/Kinesis indexing tasks.
|
|
||||||
Hadoop indexing tasks don't support it.
|
|
||||||
|
|
||||||
`forceTimeChunkLock` in the task context is only applied to individual tasks.
|
|
||||||
If you want to unset it for all tasks, you would want to set `druid.indexer.tasklock.forceTimeChunkLock` to false in the [overlord configuration](../configuration/index.md#overlord-operations).
|
|
||||||
|
|
||||||
Lock requests can conflict with each other if two or more tasks try to get locks for the overlapped time chunks of the same data source.
|
|
||||||
Note that the lock conflict can happen between different locks types.
|
|
||||||
|
|
||||||
The behavior on lock conflicts depends on the [task priority](#lock-priority).
|
|
||||||
If all tasks of conflicting lock requests have the same priority, then the task who requested first will get the lock.
|
|
||||||
Other tasks will wait for the task to release the lock.
|
|
||||||
|
|
||||||
If a task of a lower priority asks a lock later than another of a higher priority,
|
|
||||||
this task will also wait for the task of a higher priority to release the lock.
|
|
||||||
If a task of a higher priority asks a lock later than another of a lower priority,
|
|
||||||
then this task will _preempt_ the other task of a lower priority. The lock
|
|
||||||
of the lower-prioritized task will be revoked and the higher-prioritized task will acquire a new lock.
|
|
||||||
|
|
||||||
This lock preemption can happen at any time while a task is running except
|
|
||||||
when it is _publishing segments_ in a critical section. Its locks become preemptible again once publishing segments is finished.
|
|
||||||
|
|
||||||
Note that locks are shared by the tasks of the same groupId.
|
|
||||||
For example, Kafka indexing tasks of the same supervisor have the same groupId and share all locks with each other.
|
|
||||||
|
|
||||||
<a name="priority"></a>
|
|
||||||
|
|
||||||
## Lock priority
|
|
||||||
|
|
||||||
Each task type has a different default lock priority. The below table shows the default priorities of different task types. Higher the number, higher the priority.
|
|
||||||
|
|
||||||
|task type|default priority|
|
|
||||||
|---------|----------------|
|
|
||||||
|Realtime index task|75|
|
|
||||||
|Batch index task|50|
|
|
||||||
|Merge/Append/Compaction task|25|
|
|
||||||
|Other tasks|0|
|
|
||||||
|
|
||||||
You can override the task priority by setting your priority in the task context as below.
|
|
||||||
|
|
||||||
```json
|
|
||||||
"context" : {
|
|
||||||
"priority" : 100
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
<a name="context"></a>
|
|
||||||
|
|
||||||
## Context parameters
|
|
||||||
|
|
||||||
The task context is used for various individual task configuration. The following parameters apply to all task types.
|
|
||||||
|
|
||||||
|property|default|description|
|
|
||||||
|--------|-------|-----------|
|
|
||||||
|`taskLockTimeout`|300000|task lock timeout in millisecond. For more details, see [Locking](#locking).|
|
|
||||||
|`forceTimeChunkLock`|true|_Setting this to false is still experimental_<br/> Force to always use time chunk lock. If not set, each task automatically chooses a lock type to use. If this set, it will overwrite the `druid.indexer.tasklock.forceTimeChunkLock` [configuration for the overlord](../configuration/index.md#overlord-operations). See [Locking](#locking) for more details.|
|
|
||||||
|`priority`|Different based on task types. See [Priority](#priority).|Task priority|
|
|
||||||
|`useLineageBasedSegmentAllocation`|false in 0.21 or earlier, true in 0.22 or later|Enable the new lineage-based segment allocation protocol for the native Parallel task with dynamic partitioning. This option should be off during the replacing rolling upgrade from one of the Druid versions between 0.19 and 0.21 to Druid 0.22 or higher. Once the upgrade is done, it must be set to true to ensure data correctness.|
|
|
||||||
|
|
||||||
> When a task acquires a lock, it sends a request via HTTP and awaits until it receives a response containing the lock acquisition result.
|
|
||||||
> As a result, an HTTP timeout error can occur if `taskLockTimeout` is greater than `druid.server.http.maxIdleTime` of Overlords.
|
|
||||||
|
|
||||||
## All task types
|
|
||||||
|
|
||||||
### `index`
|
|
||||||
|
|
||||||
See [Native batch ingestion (simple task)](native-batch.md#simple-task).
|
|
||||||
|
|
||||||
### `index_parallel`
|
|
||||||
|
|
||||||
See [Native batch ingestion (parallel task)](native-batch.md#parallel-task).
|
|
||||||
|
|
||||||
### `index_sub`
|
|
||||||
|
|
||||||
Submitted automatically, on your behalf, by an [`index_parallel`](#index_parallel) task.
|
|
||||||
|
|
||||||
### `index_hadoop`
|
|
||||||
|
|
||||||
See [Hadoop-based ingestion](hadoop.md).
|
|
||||||
|
|
||||||
### `index_kafka`
|
|
||||||
|
|
||||||
Submitted automatically, on your behalf, by a
|
|
||||||
[Kafka-based ingestion supervisor](../development/extensions-core/kafka-ingestion.md).
|
|
||||||
|
|
||||||
### `index_kinesis`
|
|
||||||
|
|
||||||
Submitted automatically, on your behalf, by a
|
|
||||||
[Kinesis-based ingestion supervisor](../development/extensions-core/kinesis-ingestion.md).
|
|
||||||
|
|
||||||
### `index_realtime`
|
|
||||||
|
|
||||||
Submitted automatically, on your behalf, by [Tranquility](tranquility.md).
|
|
||||||
|
|
||||||
### `compact`
|
|
||||||
|
|
||||||
Compaction tasks merge all segments of the given interval. See the documentation on
|
|
||||||
[compaction](compaction.md) for details.
|
|
||||||
|
|
||||||
### `kill`
|
|
||||||
|
|
||||||
Kill tasks delete all metadata about certain segments and removes them from deep storage.
|
|
||||||
See the documentation on [deleting data](../ingestion/data-management.md#delete) for details.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## 任务参考文档
|
|
||||||
|
|
||||||
任务在Druid中完成所有与 [摄取](ingestion.md) 相关的工作。
|
|
||||||
|
|
||||||
对于批量摄取,通常使用 [任务api](../operations/api-reference.md#Overlord) 直接将任务提交给Druid。对于流式接收,任务通常被提交给supervisor。
|
|
||||||
|
|
||||||
### 任务API
|
|
||||||
|
|
||||||
任务API主要在两个地方是可用的:
|
|
||||||
|
|
||||||
* [Overlord](../design/Overlord.md) 进程提供HTTP API接口来进行提交任务、取消任务、检查任务状态、查看任务日志与报告等。 查看 [任务API文档](../operations/api-reference.md) 可以看到完整列表
|
|
||||||
* Druid SQL包括了一个 [`sys.tasks`](../querying/druidsql.md#系统Schema) ,保存了当前任务运行的信息。 此表是只读的,并且可以通过Overlord API查询完整信息的有限制的子集。
|
|
||||||
|
|
||||||
### 任务报告
|
|
||||||
|
|
||||||
报告包含已完成的任务和正在运行的任务中有关接收的行数和发生的任何分析异常的信息的报表。
|
|
||||||
|
|
||||||
报告功能支持 [简单的本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和Kinesis摄取任务支持报告功能。
|
|
||||||
|
|
||||||
#### 任务结束报告
|
|
||||||
|
|
||||||
任务运行完成后,一个完整的报告可以在以下接口获取:
|
|
||||||
|
|
||||||
```json
|
|
||||||
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
|
|
||||||
```
|
|
||||||
|
|
||||||
一个示例输出如下:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"ingestionStatsAndErrors": {
|
|
||||||
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
|
|
||||||
"payload": {
|
|
||||||
"ingestionState": "COMPLETED",
|
|
||||||
"unparseableEvents": {},
|
|
||||||
"rowStats": {
|
|
||||||
"determinePartitions": {
|
|
||||||
"processed": 0,
|
|
||||||
"processedWithError": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"unparseable": 0
|
|
||||||
},
|
|
||||||
"buildSegments": {
|
|
||||||
"processed": 5390324,
|
|
||||||
"processedWithError": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"unparseable": 0
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"errorMsg": null
|
|
||||||
},
|
|
||||||
"type": "ingestionStatsAndErrors"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 任务运行报告
|
|
||||||
|
|
||||||
当一个任务正在运行时, 任务运行报告可以通过以下接口获得,包括摄取状态、未解析事件和过去1分钟、5分钟、15分钟内处理的平均事件数。
|
|
||||||
|
|
||||||
```json
|
|
||||||
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
|
|
||||||
```
|
|
||||||
和
|
|
||||||
```json
|
|
||||||
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/liveReports
|
|
||||||
```
|
|
||||||
|
|
||||||
一个示例输出如下:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"ingestionStatsAndErrors": {
|
|
||||||
"taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
|
|
||||||
"payload": {
|
|
||||||
"ingestionState": "RUNNING",
|
|
||||||
"unparseableEvents": {},
|
|
||||||
"rowStats": {
|
|
||||||
"movingAverages": {
|
|
||||||
"buildSegments": {
|
|
||||||
"5m": {
|
|
||||||
"processed": 3.392158326408501,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
},
|
|
||||||
"15m": {
|
|
||||||
"processed": 1.736165476881023,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
},
|
|
||||||
"1m": {
|
|
||||||
"processed": 4.206417693750045,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"totals": {
|
|
||||||
"buildSegments": {
|
|
||||||
"processed": 1994,
|
|
||||||
"processedWithError": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"unparseable": 0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"errorMsg": null
|
|
||||||
},
|
|
||||||
"type": "ingestionStatsAndErrors"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
字段的描述信息如下:
|
|
||||||
|
|
||||||
`ingestionStatsAndErrors` 提供了行数和错误数的信息
|
|
||||||
|
|
||||||
`ingestionState` 标识了摄取任务当前达到了哪一步,可能的取值包括:
|
|
||||||
* `NOT_STARTED`: 任务还没有读取任何行
|
|
||||||
* `DETERMINE_PARTITIONS`: 任务正在处理行来决定分区信息
|
|
||||||
* `BUILD_SEGMENTS`: 任务正在处理行来构建段
|
|
||||||
* `COMPLETED`: 任务已经完成
|
|
||||||
|
|
||||||
只有批处理任务具有 `DETERMINE_PARTITIONS` 阶段。实时任务(如由Kafka索引服务创建的任务)没有 `DETERMINE_PARTITIONS` 阶段。
|
|
||||||
|
|
||||||
`unparseableEvents` 包含由不可解析输入引起的异常消息列表。这有助于识别有问题的输入行。对于 `DETERMINE_PARTITIONS` 和 `BUILD_SEGMENTS` 阶段,每个阶段都有一个列表。请注意,Hadoop批处理任务不支持保存不可解析事件。
|
|
||||||
|
|
||||||
`rowStats` map包含有关行计数的信息。每个摄取阶段有一个条目。不同行计数的定义如下所示:
|
|
||||||
|
|
||||||
* `processed`: 成功摄入且没有报错的行数
|
|
||||||
* `processedWithErro`: 摄取但在一列或多列中包含解析错误的行数。这通常发生在输入行具有可解析的结构但列的类型无效的情况下,例如为数值列传入非数值字符串值
|
|
||||||
* `thrownAway`: 跳过的行数。 这包括时间戳在摄取任务定义的时间间隔之外的行,以及使用 [`transformSpec`](ingestion.md#transformspec) 过滤掉的行,但不包括显式用户配置跳过的行。例如,CSV格式的 `skipHeaderRows` 或 `hasHeaderRow` 跳过的行不计算在内
|
|
||||||
* `unparseable`: 完全无法分析并被丢弃的行数。这将跟踪没有可解析结构的输入行,例如在使用JSON解析器时传入非JSON数据。
|
|
||||||
|
|
||||||
`errorMsg` 字段显示一条消息,描述导致任务失败的错误。如果任务成功,则为空
|
|
||||||
|
|
||||||
### 实时报告
|
|
||||||
#### 行画像
|
|
||||||
|
|
||||||
非并行的 [简单本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和kinesis摄取任务支持在任务运行时检索行统计信息。
|
|
||||||
|
|
||||||
可以通过运行任务的Peon上的以下URL访问实时报告:
|
|
||||||
|
|
||||||
```json
|
|
||||||
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/rowStats
|
|
||||||
```
|
|
||||||
|
|
||||||
示例报告如下所示。`movingAverages` 部分包含四行计数器的1分钟、5分钟和15分钟移动平均增量,其定义与结束报告中的定义相同。`totals` 部分显示当前总计。
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"movingAverages": {
|
|
||||||
"buildSegments": {
|
|
||||||
"5m": {
|
|
||||||
"processed": 3.392158326408501,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
},
|
|
||||||
"15m": {
|
|
||||||
"processed": 1.736165476881023,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
},
|
|
||||||
"1m": {
|
|
||||||
"processed": 4.206417693750045,
|
|
||||||
"unparseable": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"processedWithError": 0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"totals": {
|
|
||||||
"buildSegments": {
|
|
||||||
"processed": 1994,
|
|
||||||
"processedWithError": 0,
|
|
||||||
"thrownAway": 0,
|
|
||||||
"unparseable": 0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
对于Kafka索引服务,向Overlord API发送一个GET请求,将从supervisor管理的每个任务中检索实时行统计报告,并提供一个组合报告。
|
|
||||||
|
|
||||||
```json
|
|
||||||
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/supervisor/<supervisor-id>/stats
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 未解析的事件
|
|
||||||
|
|
||||||
可以对Peon API发起一次Get请求,从正在运行的任务中检索最近遇到的不可解析事件的列表:
|
|
||||||
|
|
||||||
```json
|
|
||||||
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/unparseableEvents
|
|
||||||
```
|
|
||||||
注意:并不是所有的任务类型支持该功能。 当前,该功能只支持非并行的 [本地批任务](native.md) (`index`类型) 和由Kafka、Kinesis索引服务创建的任务。
|
|
||||||
|
|
||||||
### 任务锁系统
|
|
||||||
|
|
||||||
本节介绍Druid中的任务锁定系统。Druid的锁定系统和版本控制系统是紧密耦合的,以保证接收数据的正确性。
|
|
||||||
|
|
||||||
### 段与段之间的"阴影"
|
|
||||||
|
|
||||||
可以运行任务覆盖现有数据。覆盖任务创建的段将*覆盖*现有段。请注意,覆盖关系只适用于**同一时间块和同一数据源**。在过滤过时数据的查询处理中,不考虑这些被遮盖的段。
|
|
||||||
|
|
||||||
每个段都有一个*主*版本和一个*次*版本。主版本表示为时间戳,格式为["yyyy-MM-dd'T'hh:MM:ss"](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html),次版本表示为整数。这些主版本和次版本用于确定段之间的阴影关系,如下所示。
|
|
||||||
|
|
||||||
在以下条件下,段 `s1` 将会覆盖另一个段 `s2`:
|
|
||||||
* `s1` 比 `s2` 有一个更高的主版本
|
|
||||||
* `s1` 和 `s2` 有相同的主版本,但是有更高的次版本
|
|
||||||
|
|
||||||
以下是一些示例:
|
|
||||||
* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段将覆盖另一个主版本为 `2018-01-01T00:00:00.000Z` 且次版本为 `1` 的段
|
|
||||||
* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `1` 的段将覆盖另一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段
|
|
||||||
|
|
||||||
### 锁
|
|
||||||
|
|
||||||
如果您正在运行两个或多个 [Druid任务](taskrefer.md),这些任务为同一数据源和同一时间块生成段,那么生成的段可能会相互覆盖,从而导致错误的查询结果。
|
|
||||||
|
|
||||||
为了避免这个问题,任务将在Druid中创建任何段之前尝试获取锁, 有两种类型的锁,即 *时间块锁* 和 *段锁*。
|
|
||||||
|
|
||||||
使用时间块锁时,任务将锁定生成的段将写入数据源的整个时间块。例如,假设我们有一个任务将数据摄取到 `wikipedia` 数据源的时间块 `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` 中。使用时间块锁,此任务将在创建段之前锁定wikipedia数据源的 `2019-01-01T00:00.000Z/2019-01-02T00:00:00.000Z` 整个时间块。只要它持有锁,任何其他任务都将无法为同一数据源的同一时间块创建段。使用时间块锁创建的段的主版本*高于*现有段, 它们的次版本总是 `0`。
|
|
||||||
|
|
||||||
使用段锁时,任务锁定单个段而不是整个时间块。因此,如果两个或多个任务正在读取不同的段,则它们可以同时为同一时间创建同一数据源的块。例如,Kafka索引任务和压缩合并任务总是可以同时将段写入同一数据源的同一时间块中。原因是,Kafka索引任务总是附加新段,而压缩合并任务总是覆盖现有段。使用段锁创建的段具有*相同的*主版本和较高的次版本。
|
|
||||||
|
|
||||||
> [!WARNING]
|
|
||||||
> 段锁仍然是实验性的。它可能有未知的错误,这可能会导致错误的查询结果。
|
|
||||||
|
|
||||||
要启用段锁定,可能需要在 [task context(任务上下文)](#上下文参数) 中将 `forceTimeChunkLock` 设置为 `false`。一旦 `forceTimeChunkLock` 被取消设置,任务将自动选择正确的锁类型。**请注意**,段锁并不总是可用的。使用时间块锁的最常见场景是当覆盖任务更改段粒度时。此外,只有本地索引任务和Kafka/kinesis索引任务支持段锁。Hadoop索引任务和索引实时(`index_realtime`)任务(被 [Tranquility](tranquility.md)使用)还不支持它。
|
|
||||||
|
|
||||||
任务上下文中的 `forceTimeChunkLock` 仅应用于单个任务。如果要为所有任务取消设置,则需要在 [Overlord配置](../configuration/human-readable-byte.md#overlord) 中设置 `druid.indexer.tasklock.forceTimeChunkLock` 为false。
|
|
||||||
|
|
||||||
如果两个或多个任务尝试为同一数据源的重叠时间块获取锁,则锁请求可能会相互冲突。**请注意,**锁冲突可能发生在不同的锁类型之间。
|
|
||||||
|
|
||||||
锁冲突的行为取决于 [任务优先级](#锁优先级)。如果冲突锁请求的所有任务具有相同的优先级,则首先请求的任务将获得锁, 其他任务将等待任务释放锁。
|
|
||||||
|
|
||||||
如果优先级较低的任务请求锁的时间晚于优先级较高的任务,则此任务还将等待优先级较高的任务释放锁。如果优先级较高的任务比优先级较低的任务请求锁的时间晚,则此任务将*抢占*优先级较低的另一个任务。优先级较低的任务的锁将被撤销,优先级较高的任务将获得一个新锁。
|
|
||||||
|
|
||||||
锁抢占可以在任务运行时随时发生,除非它在关键的*段发布阶段*。一旦发布段完成,它的锁将再次成为可抢占的。
|
|
||||||
|
|
||||||
**请注意**,锁由同一groupId的任务共享。例如,同一supervisor的Kafka索引任务具有相同的groupId,并且彼此共享所有锁。
|
|
||||||
|
|
||||||
### 锁优先级
|
|
||||||
|
|
||||||
每个任务类型都有不同的默认锁优先级。下表显示了不同任务类型的默认优先级。数字越高,优先级越高。
|
|
||||||
|
|
||||||
| 任务类型 | 默认优先级 |
|
|
||||||
|-|-|
|
|
||||||
| 实时索引任务 | 75 |
|
|
||||||
| 批量索引任务 | 50 |
|
|
||||||
| 合并/追加/压缩任务 | 25 |
|
|
||||||
| 其他任务 | 0 |
|
|
||||||
|
|
||||||
通过在任务上下文中设置优先级,可以覆盖任务优先级,如下所示。
|
|
||||||
|
|
||||||
```json
|
|
||||||
"context" : {
|
|
||||||
"priority" : 100
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 上下文参数
|
|
||||||
|
|
||||||
任务上下文用于各种单独的任务配置。以下参数适用于所有任务类型。
|
|
||||||
|
|
||||||
| 属性 | 默认值 | 描述 |
|
|
||||||
|-|-|-|
|
|
||||||
| `taskLockTimeout` | 300000 | 任务锁定超时(毫秒)。更多详细信息,可以查看 [锁](#锁) 部分 |
|
|
||||||
| `forceTimeChunkLock` | true | *将此设置为false仍然是实验性的* 。强制始终使用时间块锁。如果未设置,则每个任务都会自动选择要使用的锁类型。如果设置了,它将覆盖 [Overlord配置](../Configuration/configuration.md#overlord] 的 `druid.indexer.tasklock.forceTimeChunkLock` 配置。有关详细信息,可以查看 [锁](#锁) 部分。|
|
|
||||||
| `priority` | 不同任务类型是不同的。 参见 [锁优先级](#锁优先级) | 任务优先级 |
|
|
||||||
|
|
||||||
> [!WARNING]
|
|
||||||
> 当任务获取锁时,它通过HTTP发送请求并等待,直到它收到包含锁获取结果的响应为止。因此,如果 `taskLockTimeout` 大于 Overlord的`druid.server.http.maxIdleTime` 将会产生HTTP超时错误。
|
|
||||||
|
|
||||||
### 所有任务类型
|
|
||||||
#### `index`
|
|
||||||
|
|
||||||
参见 [本地批量摄取(简单任务)](native.md#简单任务)
|
|
||||||
|
|
||||||
#### `index_parallel`
|
|
||||||
|
|
||||||
参见 [本地批量社区(并行任务)](native.md#并行任务)
|
|
||||||
|
|
||||||
#### `index_sub`
|
|
||||||
|
|
||||||
由 [`index_parallel`](#index_parallel) 代表您自动提交的任务。
|
|
||||||
|
|
||||||
#### `index_hadoop`
|
|
||||||
|
|
||||||
参见 [基于Hadoop的摄取](hadoop.md)
|
|
||||||
|
|
||||||
#### `index_kafka`
|
|
||||||
|
|
||||||
由 [`Kafka摄取supervisor`](kafka.md) 代表您自动提交的任务。
|
|
||||||
|
|
||||||
#### `index_kinesis`
|
|
||||||
|
|
||||||
由 [`Kinesis摄取supervisor`](kinesis.md) 代表您自动提交的任务。
|
|
||||||
|
|
||||||
#### `index_realtime`
|
|
||||||
|
|
||||||
由 [`Tranquility`](tranquility.md) 代表您自动提交的任务。
|
|
||||||
|
|
||||||
#### `compact`
|
|
||||||
|
|
||||||
压缩任务合并给定间隔的所有段。有关详细信息,请参见有关 [压缩](data-management.md#压缩与重新索引) 的文档。
|
|
||||||
|
|
||||||
#### `kill`
|
|
||||||
|
|
||||||
Kill tasks删除有关某些段的所有元数据,并将其从深层存储中删除。有关详细信息,请参阅有关 [删除数据](data-management.md#删除数据) 的文档。
|
|
||||||
|
|
||||||
#### `append`
|
|
||||||
|
|
||||||
附加任务将段列表附加到单个段中(一个接一个)。语法是:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "append",
|
|
||||||
"id": <task_id>,
|
|
||||||
"dataSource": <task_datasource>,
|
|
||||||
"segments": <JSON list of DataSegment objects to append>,
|
|
||||||
"aggregations": <optional list of aggregators>,
|
|
||||||
"context": <task context>
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### `merge`
|
|
||||||
|
|
||||||
合并任务将段列表合并在一起。合并任何公共时间戳。如果在接收过程中禁用了rollup,则不会合并公共时间戳,并按其时间戳对行重新排序。
|
|
||||||
|
|
||||||
> [!WARNING]
|
|
||||||
> [`compact`](#compact) 任务通常是比 `merge` 任务更好的选择。
|
|
||||||
|
|
||||||
语法是:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "merge",
|
|
||||||
"id": <task_id>,
|
|
||||||
"dataSource": <task_datasource>,
|
|
||||||
"aggregations": <list of aggregators>,
|
|
||||||
"rollup": <whether or not to rollup data during a merge>,
|
|
||||||
"segments": <JSON list of DataSegment objects to merge>,
|
|
||||||
"context": <task context>
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### `same_interval_merge`
|
|
||||||
|
|
||||||
同一间隔合并任务是合并任务的快捷方式,间隔中的所有段都将被合并。
|
|
||||||
|
|
||||||
> [!WARNING]
|
|
||||||
> [`compact`](#compact) 任务通常是比 `same_interval_merge` 任务更好的选择。
|
|
||||||
|
|
||||||
语法是:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "same_interval_merge",
|
|
||||||
"id": <task_id>,
|
|
||||||
"dataSource": <task_datasource>,
|
|
||||||
"aggregations": <list of aggregators>,
|
|
||||||
"rollup": <whether or not to rollup data during a merge>,
|
|
||||||
"interval": <DataSegment objects in this interval are going to be merged>,
|
|
||||||
"context": <task context>
|
|
||||||
}
|
|
||||||
```
|
|
@ -1,36 +0,0 @@
|
|||||||
---
|
|
||||||
id: tranquility
|
|
||||||
title: "Tranquility"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
[Tranquility](https://github.com/druid-io/tranquility/) is a separately distributed package for pushing
|
|
||||||
streams to Druid in real-time.
|
|
||||||
|
|
||||||
Tranquility has not been built against a version of Druid later than Druid 0.9.2
|
|
||||||
release. It may still work with the latest Druid servers, but not all features and functionality will be available
|
|
||||||
due to limitations of older Druid APIs on the Tranquility side.
|
|
||||||
|
|
||||||
For new projects that require streaming ingestion, we recommend using Druid's native support for
|
|
||||||
[Apache Kafka](../development/extensions-core/kafka-ingestion.md) or
|
|
||||||
[Amazon Kinesis](../development/extensions-core/kinesis-ingestion.md).
|
|
||||||
|
|
||||||
For more details, check out the [Tranquility GitHub page](https://github.com/druid-io/tranquility/).
|
|
1
operations/DeepstorageMigration.md
Normal file
1
operations/DeepstorageMigration.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
@ -1,37 +0,0 @@
|
|||||||
---
|
|
||||||
id: alerts
|
|
||||||
title: "Alerts"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
Druid generates alerts on getting into unexpected situations.
|
|
||||||
|
|
||||||
Alerts are emitted as JSON objects to a runtime log file or over HTTP (to a service such as Apache Kafka). Alert emission is disabled by default.
|
|
||||||
|
|
||||||
All Druid alerts share a common set of fields:
|
|
||||||
|
|
||||||
* `timestamp` - the time the alert was created
|
|
||||||
* `service` - the service name that emitted the alert
|
|
||||||
* `host` - the host name that emitted the alert
|
|
||||||
* `severity` - severity of the alert e.g. anomaly, component-failure, service-failure etc.
|
|
||||||
* `description` - a description of the alert
|
|
||||||
* `data` - if there was an exception then a JSON object with fields `exceptionType`, `exceptionMessage` and `exceptionStackTrace`
|
|
@ -1,944 +0,0 @@
|
|||||||
---
|
|
||||||
id: api-reference
|
|
||||||
title: "API reference"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
This page documents all of the API endpoints for each Druid service type.
|
|
||||||
|
|
||||||
## Common
|
|
||||||
|
|
||||||
The following endpoints are supported by all processes.
|
|
||||||
|
|
||||||
### Process information
|
|
||||||
|
|
||||||
#### GET
|
|
||||||
|
|
||||||
* `/status`
|
|
||||||
|
|
||||||
Returns the Druid version, loaded extensions, memory used, total memory and other useful information about the process.
|
|
||||||
|
|
||||||
* `/status/health`
|
|
||||||
|
|
||||||
An endpoint that always returns a boolean "true" value with a 200 OK response, useful for automated health checks.
|
|
||||||
|
|
||||||
* `/status/properties`
|
|
||||||
|
|
||||||
Returns the current configuration properties of the process.
|
|
||||||
|
|
||||||
* `/status/selfDiscovered/status`
|
|
||||||
|
|
||||||
Returns a JSON map of the form `{"selfDiscovered": true/false}`, indicating whether the node has received a confirmation
|
|
||||||
from the central node discovery mechanism (currently ZooKeeper) of the Druid cluster that the node has been added to the
|
|
||||||
cluster. It is recommended to not consider a Druid node "healthy" or "ready" in automated deployment/container
|
|
||||||
management systems until it returns `{"selfDiscovered": true}` from this endpoint. This is because a node may be
|
|
||||||
isolated from the rest of the cluster due to network issues and it doesn't make sense to consider nodes "healthy" in
|
|
||||||
this case. Also, when nodes such as Brokers use ZooKeeper segment discovery for building their view of the Druid cluster
|
|
||||||
(as opposed to HTTP segment discovery), they may be unusable until the ZooKeeper client is fully initialized and starts
|
|
||||||
to receive data from the ZooKeeper cluster. `{"selfDiscovered": true}` is a proxy event indicating that the ZooKeeper
|
|
||||||
client on the node has started to receive data from the ZooKeeper cluster and it's expected that all segments and other
|
|
||||||
nodes will be discovered by this node timely from this point.
|
|
||||||
|
|
||||||
* `/status/selfDiscovered`
|
|
||||||
|
|
||||||
Similar to `/status/selfDiscovered/status`, but returns 200 OK response with empty body if the node has discovered itself
|
|
||||||
and 503 SERVICE UNAVAILABLE if the node hasn't discovered itself yet. This endpoint might be useful because some
|
|
||||||
monitoring checks such as AWS load balancer health checks are not able to look at the response body.
|
|
||||||
|
|
||||||
## Master Server
|
|
||||||
|
|
||||||
This section documents the API endpoints for the processes that reside on Master servers (Coordinators and Overlords)
|
|
||||||
in the suggested [three-server configuration](../design/processes.md#server-types).
|
|
||||||
|
|
||||||
### Coordinator
|
|
||||||
|
|
||||||
#### Leadership
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/leader`
|
|
||||||
|
|
||||||
Returns the current leader Coordinator of the cluster.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/isLeader`
|
|
||||||
|
|
||||||
Returns a JSON object with field "leader", either true or false, indicating if this server is the current leader
|
|
||||||
Coordinator of the cluster. In addition, returns HTTP 200 if the server is the current leader and HTTP 404 if not.
|
|
||||||
This is suitable for use as a load balancer status check if you only want the active leader to be considered in-service
|
|
||||||
at the load balancer.
|
|
||||||
|
|
||||||
|
|
||||||
<a name="coordinator-segment-loading"></a>
|
|
||||||
#### Segment Loading
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/loadstatus`
|
|
||||||
|
|
||||||
Returns the percentage of segments actually loaded in the cluster versus segments that should be loaded in the cluster.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/loadstatus?simple`
|
|
||||||
|
|
||||||
Returns the number of segments left to load until segments that should be loaded in the cluster are available for queries. This does not include segment replication counts.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/loadstatus?full`
|
|
||||||
|
|
||||||
Returns the number of segments left to load in each tier until segments that should be loaded in the cluster are all available. This includes segment replication counts.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/loadstatus?full&computeUsingClusterView`
|
|
||||||
|
|
||||||
Returns the number of segments not yet loaded for each tier until all segments loading in the cluster are available.
|
|
||||||
The result includes segment replication counts. It also factors in the number of available nodes that are of a service type that can load the segment when computing the number of segments remaining to load.
|
|
||||||
A segment is considered fully loaded when:
|
|
||||||
- Druid has replicated it the number of times configured in the corresponding load rule.
|
|
||||||
- Or the number of replicas for the segment in each tier where it is configured to be replicated equals the available nodes of a service type that are currently allowed to load the segment in the tier.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/loadqueue`
|
|
||||||
|
|
||||||
Returns the ids of segments to load and drop for each Historical process.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/loadqueue?simple`
|
|
||||||
|
|
||||||
Returns the number of segments to load and drop, as well as the total segment load and drop size in bytes for each Historical process.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/loadqueue?full`
|
|
||||||
|
|
||||||
Returns the serialized JSON of segments to load and drop for each Historical process.
|
|
||||||
|
|
||||||
|
|
||||||
#### Segment Loading by Datasource
|
|
||||||
|
|
||||||
Note that all _interval_ query parameters are ISO 8601 strings (e.g., 2016-06-27/2016-06-28).
|
|
||||||
Also note that these APIs only guarantees that the segments are available at the time of the call.
|
|
||||||
Segments can still become missing because of historical process failures or any other reasons afterward.
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus?forceMetadataRefresh={boolean}&interval={myInterval}`
|
|
||||||
|
|
||||||
Returns the percentage of segments actually loaded in the cluster versus segments that should be loaded in the cluster for the given
|
|
||||||
datasource over the given interval (or last 2 weeks if interval is not given). `forceMetadataRefresh` is required to be set.
|
|
||||||
Setting `forceMetadataRefresh` to true will force the coordinator to poll latest segment metadata from the metadata store
|
|
||||||
(Note: `forceMetadataRefresh=true` refreshes Coordinator's metadata cache of all datasources. This can be a heavy operation in terms
|
|
||||||
of the load on the metadata store but can be necessary to make sure that we verify all the latest segments' load status)
|
|
||||||
Setting `forceMetadataRefresh` to false will use the metadata cached on the coordinator from the last force/periodic refresh.
|
|
||||||
If no used segments are found for the given inputs, this API returns `204 No Content`
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus?simple&forceMetadataRefresh={boolean}&interval={myInterval}`
|
|
||||||
|
|
||||||
Returns the number of segments left to load until segments that should be loaded in the cluster are available for the given datasource
|
|
||||||
over the given interval (or last 2 weeks if interval is not given). This does not include segment replication counts. `forceMetadataRefresh` is required to be set.
|
|
||||||
Setting `forceMetadataRefresh` to true will force the coordinator to poll latest segment metadata from the metadata store
|
|
||||||
(Note: `forceMetadataRefresh=true` refreshes Coordinator's metadata cache of all datasources. This can be a heavy operation in terms
|
|
||||||
of the load on the metadata store but can be necessary to make sure that we verify all the latest segments' load status)
|
|
||||||
Setting `forceMetadataRefresh` to false will use the metadata cached on the coordinator from the last force/periodic refresh.
|
|
||||||
If no used segments are found for the given inputs, this API returns `204 No Content`
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus?full&forceMetadataRefresh={boolean}&interval={myInterval}`
|
|
||||||
|
|
||||||
Returns the number of segments left to load in each tier until segments that should be loaded in the cluster are all available for the given datasource
|
|
||||||
over the given interval (or last 2 weeks if interval is not given). This includes segment replication counts. `forceMetadataRefresh` is required to be set.
|
|
||||||
Setting `forceMetadataRefresh` to true will force the coordinator to poll latest segment metadata from the metadata store
|
|
||||||
(Note: `forceMetadataRefresh=true` refreshes Coordinator's metadata cache of all datasources. This can be a heavy operation in terms
|
|
||||||
of the load on the metadata store but can be necessary to make sure that we verify all the latest segments' load status)
|
|
||||||
Setting `forceMetadataRefresh` to false will use the metadata cached on the coordinator from the last force/periodic refresh.
|
|
||||||
You can pass the optional query parameter `computeUsingClusterView` to factor in the available cluster services when calculating
|
|
||||||
the segments left to load. See [Coordinator Segment Loading](#coordinator-segment-loading) for details.
|
|
||||||
If no used segments are found for the given inputs, this API returns `204 No Content`
|
|
||||||
|
|
||||||
#### Metadata store information
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/segments`
|
|
||||||
|
|
||||||
Returns a list of all segments for each datasource enabled in the cluster.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/segments?datasources={dataSourceName1}&datasources={dataSourceName2}`
|
|
||||||
|
|
||||||
Returns a list of all segments for one or more specific datasources enabled in the cluster.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/segments?includeOvershadowedStatus`
|
|
||||||
|
|
||||||
Returns a list of all segments for each datasource with the full segment metadata and an extra field `overshadowed`.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/segments?includeOvershadowedStatus&datasources={dataSourceName1}&datasources={dataSourceName2}`
|
|
||||||
|
|
||||||
Returns a list of all segments for one or more specific datasources with the full segment metadata and an extra field `overshadowed`.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/datasources`
|
|
||||||
|
|
||||||
Returns a list of the names of datasources with at least one used segment in the cluster.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/datasources?includeUnused`
|
|
||||||
|
|
||||||
Returns a list of the names of datasources, regardless of whether there are used segments belonging to those datasources in the cluster or not.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/datasources?includeDisabled`
|
|
||||||
|
|
||||||
Returns a list of the names of datasources, regardless of whether the datasource is disabled or not.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/datasources?full`
|
|
||||||
|
|
||||||
Returns a list of all datasources with at least one used segment in the cluster. Returns all metadata about those datasources as stored in the metadata store.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}`
|
|
||||||
|
|
||||||
Returns full metadata for a datasource as stored in the metadata store.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments`
|
|
||||||
|
|
||||||
Returns a list of all segments for a datasource as stored in the metadata store.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full`
|
|
||||||
|
|
||||||
Returns a list of all segments for a datasource with the full segment metadata as stored in the metadata store.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments/{segmentId}`
|
|
||||||
|
|
||||||
Returns full segment metadata for a specific segment as stored in the metadata store, if the segment is used. If the
|
|
||||||
segment is unused, or is unknown, a 404 response is returned.
|
|
||||||
|
|
||||||
##### POST
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments`
|
|
||||||
|
|
||||||
Returns a list of all segments, overlapping with any of given intervals, for a datasource as stored in the metadata store. Request body is array of string IS0 8601 intervals like [interval1, interval2,...] for example ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"]
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full`
|
|
||||||
|
|
||||||
Returns a list of all segments, overlapping with any of given intervals, for a datasource with the full segment metadata as stored in the metadata store. Request body is array of string ISO 8601 intervals like [interval1, interval2,...] for example ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"]
|
|
||||||
|
|
||||||
<a name="coordinator-datasources"></a>
|
|
||||||
|
|
||||||
#### Datasources
|
|
||||||
|
|
||||||
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
|
|
||||||
(e.g., 2016-06-27_2016-06-28).
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources`
|
|
||||||
|
|
||||||
Returns a list of datasource names found in the cluster.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources?simple`
|
|
||||||
|
|
||||||
Returns a list of JSON objects containing the name and properties of datasources found in the cluster. Properties include segment count, total segment byte size, replicated total segment byte size, minTime, and maxTime.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources?full`
|
|
||||||
|
|
||||||
Returns a list of datasource names found in the cluster with all metadata about those datasources.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}`
|
|
||||||
|
|
||||||
Returns a JSON object containing the name and properties of a datasource. Properties include segment count, total segment byte size, replicated total segment byte size, minTime, and maxTime.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}?full`
|
|
||||||
|
|
||||||
Returns full metadata for a datasource .
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals`
|
|
||||||
|
|
||||||
Returns a set of segment intervals.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals?simple`
|
|
||||||
|
|
||||||
Returns a map of an interval to a JSON object containing the total byte size of segments and number of segments for that interval.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals?full`
|
|
||||||
|
|
||||||
Returns a map of an interval to a map of segment metadata to a set of server names that contain the segment for that interval.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}`
|
|
||||||
|
|
||||||
Returns a set of segment ids for an interval.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}?simple`
|
|
||||||
|
|
||||||
Returns a map of segment intervals contained within the specified interval to a JSON object containing the total byte size of segments and number of segments for an interval.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}?full`
|
|
||||||
|
|
||||||
Returns a map of segment intervals contained within the specified interval to a map of segment metadata to a set of server names that contain the segment for an interval.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}/serverview`
|
|
||||||
|
|
||||||
Returns a map of segment intervals contained within the specified interval to information about the servers that contain the segment for an interval.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/segments`
|
|
||||||
|
|
||||||
Returns a list of all segments for a datasource in the cluster.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/segments?full`
|
|
||||||
|
|
||||||
Returns a list of all segments for a datasource in the cluster with the full segment metadata.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}`
|
|
||||||
|
|
||||||
Returns full segment metadata for a specific segment in the cluster.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/tiers`
|
|
||||||
|
|
||||||
Return the tiers that a datasource exists in.
|
|
||||||
|
|
||||||
#### Note for coordinator's POST and DELETE API's
|
|
||||||
The segments would be enabled when these API's are called, but then can be disabled again by the coordinator if any dropRule matches. Segments enabled by these API's might not be loaded by historical processes if no loadRule matches. If an indexing or kill task runs at the same time as these API's are invoked, the behavior is undefined. Some segments might be killed and others might be enabled. It's also possible that all segments might be disabled but at the same time, the indexing task is able to read data from those segments and succeed.
|
|
||||||
|
|
||||||
Caution : Avoid using indexing or kill tasks and these API's at the same time for the same datasource and time chunk. (It's fine if the time chunks or datasource don't overlap)
|
|
||||||
|
|
||||||
##### POST
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}`
|
|
||||||
|
|
||||||
Marks as used all segments belonging to a datasource. Returns a JSON object of the form
|
|
||||||
`{"numChangedSegments": <number>}` with the number of segments in the database whose state has been changed (that is,
|
|
||||||
the segments were marked as used) as the result of this API call.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}`
|
|
||||||
|
|
||||||
Marks as used a segment of a datasource. Returns a JSON object of the form `{"segmentStateChanged": <boolean>}` with
|
|
||||||
the boolean indicating if the state of the segment has been changed (that is, the segment was marked as used) as the
|
|
||||||
result of this API call.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/markUsed`
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/markUnused`
|
|
||||||
|
|
||||||
Marks segments (un)used for a datasource by interval or set of segment Ids.
|
|
||||||
|
|
||||||
When marking used only segments that are not overshadowed will be updated.
|
|
||||||
|
|
||||||
The request payload contains the interval or set of segment Ids to be marked unused.
|
|
||||||
Either interval or segment ids should be provided, if both or none are provided in the payload, the API would throw an error (400 BAD REQUEST).
|
|
||||||
|
|
||||||
Interval specifies the start and end times as IS0 8601 strings. `interval=(start/end)` where start and end both are inclusive and only the segments completely contained within the specified interval will be disabled, partially overlapping segments will not be affected.
|
|
||||||
|
|
||||||
JSON Request Payload:
|
|
||||||
|
|
||||||
|Key|Description|Example|
|
|
||||||
|----------|-------------|---------|
|
|
||||||
|`interval`|The interval for which to mark segments unused|"2015-09-12T03:00:00.000Z/2015-09-12T05:00:00.000Z"|
|
|
||||||
|`segmentIds`|Set of segment Ids to be marked unused|["segmentId1", "segmentId2"]|
|
|
||||||
|
|
||||||
##### DELETE
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}`
|
|
||||||
|
|
||||||
Marks as unused all segments belonging to a datasource. Returns a JSON object of the form
|
|
||||||
`{"numChangedSegments": <number>}` with the number of segments in the database whose state has been changed (that is,
|
|
||||||
the segments were marked as unused) as the result of this API call.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}`
|
|
||||||
* `@Deprecated. /druid/coordinator/v1/datasources/{dataSourceName}?kill=true&interval={myInterval}`
|
|
||||||
|
|
||||||
Runs a [Kill task](../ingestion/tasks.md) for a given interval and datasource.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}`
|
|
||||||
|
|
||||||
Marks as unused a segment of a datasource. Returns a JSON object of the form `{"segmentStateChanged": <boolean>}` with
|
|
||||||
the boolean indicating if the state of the segment has been changed (that is, the segment was marked as unused) as the
|
|
||||||
result of this API call.
|
|
||||||
|
|
||||||
#### Retention Rules
|
|
||||||
|
|
||||||
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
|
|
||||||
(e.g., 2016-06-27_2016-06-28).
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/rules`
|
|
||||||
|
|
||||||
Returns all rules as JSON objects for all datasources in the cluster including the default datasource.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/rules/{dataSourceName}`
|
|
||||||
|
|
||||||
Returns all rules for a specified datasource.
|
|
||||||
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/rules/{dataSourceName}?full`
|
|
||||||
|
|
||||||
Returns all rules for a specified datasource and includes default datasource.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/rules/history?interval=<interval>`
|
|
||||||
|
|
||||||
Returns audit history of rules for all datasources. default value of interval can be specified by setting `druid.audit.manager.auditHistoryMillis` (1 week if not configured) in Coordinator runtime.properties
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/rules/history?count=<n>`
|
|
||||||
|
|
||||||
Returns last <n> entries of audit history of rules for all datasources.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/rules/{dataSourceName}/history?interval=<interval>`
|
|
||||||
|
|
||||||
Returns audit history of rules for a specified datasource. default value of interval can be specified by setting `druid.audit.manager.auditHistoryMillis` (1 week if not configured) in Coordinator runtime.properties
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/rules/{dataSourceName}/history?count=<n>`
|
|
||||||
|
|
||||||
Returns last <n> entries of audit history of rules for a specified datasource.
|
|
||||||
|
|
||||||
##### POST
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/rules/{dataSourceName}`
|
|
||||||
|
|
||||||
POST with a list of rules in JSON form to update rules.
|
|
||||||
|
|
||||||
Optional Header Parameters for auditing the config change can also be specified.
|
|
||||||
|
|
||||||
|Header Param Name| Description | Default |
|
|
||||||
|----------|-------------|---------|
|
|
||||||
|`X-Druid-Author`| author making the config change|""|
|
|
||||||
|`X-Druid-Comment`| comment describing the change being done|""|
|
|
||||||
|
|
||||||
#### Intervals
|
|
||||||
|
|
||||||
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
|
|
||||||
(e.g., 2016-06-27_2016-06-28).
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/intervals`
|
|
||||||
|
|
||||||
Returns all intervals for all datasources with total size and count.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/intervals/{interval}`
|
|
||||||
|
|
||||||
Returns aggregated total size and count for all intervals that intersect given isointerval.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/intervals/{interval}?simple`
|
|
||||||
|
|
||||||
Returns total size and count for each interval within given isointerval.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/intervals/{interval}?full`
|
|
||||||
|
|
||||||
Returns total size and count for each datasource for each interval within given isointerval.
|
|
||||||
|
|
||||||
#### Dynamic configuration
|
|
||||||
|
|
||||||
See [Coordinator Dynamic Configuration](../configuration/index.md#dynamic-configuration) for details.
|
|
||||||
|
|
||||||
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
|
|
||||||
(e.g., 2016-06-27_2016-06-28).
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/config`
|
|
||||||
|
|
||||||
Retrieves current coordinator dynamic configuration.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/config/history?interval={interval}&count={count}`
|
|
||||||
|
|
||||||
Retrieves history of changes to overlord dynamic configuration. Accepts `interval` and `count` query string parameters
|
|
||||||
to filter by interval and limit the number of results respectively.
|
|
||||||
|
|
||||||
##### POST
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/config`
|
|
||||||
|
|
||||||
Update overlord dynamic worker configuration.
|
|
||||||
|
|
||||||
#### Compaction Status
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/compaction/progress?dataSource={dataSource}`
|
|
||||||
|
|
||||||
Returns the total size of segments awaiting compaction for the given dataSource.
|
|
||||||
This is only valid for dataSource which has compaction enabled.
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/compaction/status`
|
|
||||||
|
|
||||||
Returns the status and statistics from the latest auto compaction run of all dataSources which have/had auto compaction enabled.
|
|
||||||
The response payload includes a list of `latestStatus` objects. Each `latestStatus` represents the status for a dataSource (which has/had auto compaction enabled).
|
|
||||||
The `latestStatus` object has the following keys:
|
|
||||||
* `dataSource`: name of the datasource for this status information
|
|
||||||
* `scheduleStatus`: auto compaction scheduling status. Possible values are `NOT_ENABLED` and `RUNNING`. Returns `RUNNING ` if the dataSource has an active auto compaction config submitted otherwise, `NOT_ENABLED`
|
|
||||||
* `bytesAwaitingCompaction`: total bytes of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction)
|
|
||||||
* `bytesCompacted`: total bytes of this datasource that are already compacted with the spec set in the auto compaction config.
|
|
||||||
* `bytesSkipped`: total bytes of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.
|
|
||||||
* `segmentCountAwaitingCompaction`: total number of segments of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction)
|
|
||||||
* `segmentCountCompacted`: total number of segments of this datasource that are already compacted with the spec set in the auto compaction config.
|
|
||||||
* `segmentCountSkipped`: total number of segments of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.
|
|
||||||
* `intervalCountAwaitingCompaction`: total number of intervals of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction)
|
|
||||||
* `intervalCountCompacted`: total number of intervals of this datasource that are already compacted with the spec set in the auto compaction config.
|
|
||||||
* `intervalCountSkipped`: total number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/compaction/status?dataSource={dataSource}`
|
|
||||||
|
|
||||||
Similar to the API `/druid/coordinator/v1/compaction/status` above but filters response to only return information for the {dataSource} given.
|
|
||||||
Note that {dataSource} given must have/had auto compaction enabled.
|
|
||||||
|
|
||||||
#### Compaction Configuration
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/config/compaction`
|
|
||||||
|
|
||||||
Returns all compaction configs.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/config/compaction/{dataSource}`
|
|
||||||
|
|
||||||
Returns a compaction config of a dataSource.
|
|
||||||
|
|
||||||
##### POST
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/config/compaction/taskslots?ratio={someRatio}&max={someMaxSlots}`
|
|
||||||
|
|
||||||
Update the capacity for compaction tasks. `ratio` and `max` are used to limit the max number of compaction tasks.
|
|
||||||
They mean the ratio of the total task slots to the compaction task slots and the maximum number of task slots for compaction tasks, respectively.
|
|
||||||
The actual max number of compaction tasks is `min(max, ratio * total task slots)`.
|
|
||||||
Note that `ratio` and `max` are optional and can be omitted. If they are omitted, default values (0.1 and unbounded)
|
|
||||||
will be set for them.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/config/compaction`
|
|
||||||
|
|
||||||
Creates or updates the compaction config for a dataSource.
|
|
||||||
See [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for configuration details.
|
|
||||||
|
|
||||||
|
|
||||||
##### DELETE
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/config/compaction/{dataSource}`
|
|
||||||
|
|
||||||
Removes the compaction config for a dataSource.
|
|
||||||
|
|
||||||
#### Server information
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/servers`
|
|
||||||
|
|
||||||
Returns a list of servers URLs using the format `{hostname}:{port}`. Note that
|
|
||||||
processes that run with different types will appear multiple times with different
|
|
||||||
ports.
|
|
||||||
|
|
||||||
* `/druid/coordinator/v1/servers?simple`
|
|
||||||
|
|
||||||
Returns a list of server data objects in which each object has the following keys:
|
|
||||||
* `host`: host URL include (`{hostname}:{port}`)
|
|
||||||
* `type`: process type (`indexer-executor`, `historical`)
|
|
||||||
* `currSize`: storage size currently used
|
|
||||||
* `maxSize`: maximum storage size
|
|
||||||
* `priority`
|
|
||||||
* `tier`
|
|
||||||
|
|
||||||
### Overlord
|
|
||||||
|
|
||||||
#### Leadership
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/leader`
|
|
||||||
|
|
||||||
Returns the current leader Overlord of the cluster. If you have multiple Overlords, just one is leading at any given time. The others are on standby.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/isLeader`
|
|
||||||
|
|
||||||
This returns a JSON object with field "leader", either true or false. In addition, this call returns HTTP 200 if the
|
|
||||||
server is the current leader and HTTP 404 if not. This is suitable for use as a load balancer status check if you
|
|
||||||
only want the active leader to be considered in-service at the load balancer.
|
|
||||||
|
|
||||||
#### Tasks
|
|
||||||
|
|
||||||
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
|
|
||||||
(e.g., 2016-06-27_2016-06-28).
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/tasks`
|
|
||||||
|
|
||||||
Retrieve list of tasks. Accepts query string parameters `state`, `datasource`, `createdTimeInterval`, `max`, and `type`.
|
|
||||||
|
|
||||||
|Query Parameter |Description |
|
|
||||||
|---|---|
|
|
||||||
|`state`|filter list of tasks by task state, valid options are `running`, `complete`, `waiting`, and `pending`.|
|
|
||||||
| `datasource`| return tasks filtered by Druid datasource.|
|
|
||||||
| `createdTimeInterval`| return tasks created within the specified interval. |
|
|
||||||
| `max`| maximum number of `"complete"` tasks to return. Only applies when `state` is set to `"complete"`.|
|
|
||||||
| `type`| filter tasks by task type. See [task documentation](../ingestion/tasks.md) for more details.|
|
|
||||||
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/completeTasks`
|
|
||||||
|
|
||||||
Retrieve list of complete tasks. Equivalent to `/druid/indexer/v1/tasks?state=complete`.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/runningTasks`
|
|
||||||
|
|
||||||
Retrieve list of running tasks. Equivalent to `/druid/indexer/v1/tasks?state=running`.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/waitingTasks`
|
|
||||||
|
|
||||||
Retrieve list of waiting tasks. Equivalent to `/druid/indexer/v1/tasks?state=waiting`.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/pendingTasks`
|
|
||||||
|
|
||||||
Retrieve list of pending tasks. Equivalent to `/druid/indexer/v1/tasks?state=pending`.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/task/{taskId}`
|
|
||||||
|
|
||||||
Retrieve the 'payload' of a task.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/task/{taskId}/status`
|
|
||||||
|
|
||||||
Retrieve the status of a task.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/task/{taskId}/segments`
|
|
||||||
|
|
||||||
Retrieve information about the segments of a task.
|
|
||||||
|
|
||||||
> This API is deprecated and will be removed in future releases.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/task/{taskId}/reports`
|
|
||||||
|
|
||||||
Retrieve a [task completion report](../ingestion/tasks.md#task-reports) for a task. Only works for completed tasks.
|
|
||||||
|
|
||||||
##### POST
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/task`
|
|
||||||
|
|
||||||
Endpoint for submitting tasks and supervisor specs to the Overlord. Returns the taskId of the submitted task.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/task/{taskId}/shutdown`
|
|
||||||
|
|
||||||
Shuts down a task.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/datasources/{dataSource}/shutdownAllTasks`
|
|
||||||
|
|
||||||
Shuts down all tasks for a dataSource.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/taskStatus`
|
|
||||||
|
|
||||||
Retrieve list of task status objects for list of task id strings in request body.
|
|
||||||
|
|
||||||
##### DELETE
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/pendingSegments/{dataSource}`
|
|
||||||
|
|
||||||
Manually clean up pending segments table in metadata storage for `datasource`. Returns a JSON object response with
|
|
||||||
`numDeleted` and count of rows deleted from the pending segments table. This API is used by the
|
|
||||||
`druid.coordinator.kill.pendingSegments.on` [coordinator setting](../configuration/index.md#coordinator-operation)
|
|
||||||
which automates this operation to perform periodically.
|
|
||||||
|
|
||||||
#### Supervisors
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor`
|
|
||||||
|
|
||||||
Returns a list of strings of the currently active supervisor ids.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor?full`
|
|
||||||
|
|
||||||
Returns a list of objects of the currently active supervisors.
|
|
||||||
|
|
||||||
|Field|Type|Description|
|
|
||||||
|---|---|---|
|
|
||||||
|`id`|String|supervisor unique identifier|
|
|
||||||
|`state`|String|basic state of the supervisor. Available states:`UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`, `PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`. Check [Kafka Docs](../development/extensions-core/kafka-ingestion.md#operations) for details.|
|
|
||||||
|`detailedState`|String|supervisor specific state. (See documentation of specific supervisor for details), e.g. [Kafka](../development/extensions-core/kafka-ingestion.md) or [Kinesis](../development/extensions-core/kinesis-ingestion.md))|
|
|
||||||
|`healthy`|Boolean|true or false indicator of overall supervisor health|
|
|
||||||
|`spec`|SupervisorSpec|json specification of supervisor (See Supervisor Configuration for details)|
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor?state=true`
|
|
||||||
|
|
||||||
Returns a list of objects of the currently active supervisors and their current state.
|
|
||||||
|
|
||||||
|Field|Type|Description|
|
|
||||||
|---|---|---|
|
|
||||||
|`id`|String|supervisor unique identifier|
|
|
||||||
|`state`|String|basic state of the supervisor. Available states: `UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`, `PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`. Check [Kafka Docs](../development/extensions-core/kafka-ingestion.md#operations) for details.|
|
|
||||||
|`detailedState`|String|supervisor specific state. (See documentation of the specific supervisor for details, e.g. [Kafka](../development/extensions-core/kafka-ingestion.md) or [Kinesis](../development/extensions-core/kinesis-ingestion.md))|
|
|
||||||
|`healthy`|Boolean|true or false indicator of overall supervisor health|
|
|
||||||
|`suspended`|Boolean|true or false indicator of whether the supervisor is in suspended state|
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/<supervisorId>`
|
|
||||||
|
|
||||||
Returns the current spec for the supervisor with the provided ID.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/<supervisorId>/status`
|
|
||||||
|
|
||||||
Returns the current status of the supervisor with the provided ID.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/history`
|
|
||||||
|
|
||||||
Returns an audit history of specs for all supervisors (current and past).
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/<supervisorId>/history`
|
|
||||||
|
|
||||||
Returns an audit history of specs for the supervisor with the provided ID.
|
|
||||||
|
|
||||||
##### POST
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor`
|
|
||||||
|
|
||||||
Create a new supervisor or update an existing one.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/<supervisorId>/suspend`
|
|
||||||
|
|
||||||
Suspend the current running supervisor of the provided ID. Responds with updated SupervisorSpec.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/suspendAll`
|
|
||||||
|
|
||||||
Suspend all supervisors at once.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/<supervisorId>/resume`
|
|
||||||
|
|
||||||
Resume indexing tasks for a supervisor. Responds with updated SupervisorSpec.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/resumeAll`
|
|
||||||
|
|
||||||
Resume all supervisors at once.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/<supervisorId>/reset`
|
|
||||||
|
|
||||||
Reset the specified supervisor.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/<supervisorId>/terminate`
|
|
||||||
|
|
||||||
Terminate a supervisor of the provided ID.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/terminateAll`
|
|
||||||
|
|
||||||
Terminate all supervisors at once.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/supervisor/<supervisorId>/shutdown`
|
|
||||||
|
|
||||||
Shutdown a supervisor.
|
|
||||||
|
|
||||||
> This API is deprecated and will be removed in future releases.
|
|
||||||
> Please use the equivalent 'terminate' instead.
|
|
||||||
|
|
||||||
#### Dynamic configuration
|
|
||||||
|
|
||||||
See [Overlord Dynamic Configuration](../configuration/index.md#overlord-dynamic-configuration) for details.
|
|
||||||
|
|
||||||
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
|
|
||||||
(e.g., 2016-06-27_2016-06-28).
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/worker`
|
|
||||||
|
|
||||||
Retrieves current overlord dynamic configuration.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/worker/history?interval={interval}&count={count}`
|
|
||||||
|
|
||||||
Retrieves history of changes to overlord dynamic configuration. Accepts `interval` and `count` query string parameters
|
|
||||||
to filter by interval and limit the number of results respectively.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/workers`
|
|
||||||
|
|
||||||
Retrieves a list of all the worker nodes in the cluster along with its metadata.
|
|
||||||
|
|
||||||
* `/druid/indexer/v1/scaling`
|
|
||||||
|
|
||||||
Retrieves overlord scaling events if auto-scaling runners are in use.
|
|
||||||
|
|
||||||
##### POST
|
|
||||||
|
|
||||||
* /druid/indexer/v1/worker
|
|
||||||
|
|
||||||
Update overlord dynamic worker configuration.
|
|
||||||
|
|
||||||
## Data Server
|
|
||||||
|
|
||||||
This section documents the API endpoints for the processes that reside on Data servers (MiddleManagers/Peons and Historicals)
|
|
||||||
in the suggested [three-server configuration](../design/processes.md#server-types).
|
|
||||||
|
|
||||||
### MiddleManager
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/worker/v1/enabled`
|
|
||||||
|
|
||||||
Check whether a MiddleManager is in an enabled or disabled state. Returns JSON object keyed by the combined `druid.host`
|
|
||||||
and `druid.port` with the boolean state as the value.
|
|
||||||
|
|
||||||
```json
|
|
||||||
{"localhost:8091":true}
|
|
||||||
```
|
|
||||||
|
|
||||||
* `/druid/worker/v1/tasks`
|
|
||||||
|
|
||||||
Retrieve a list of active tasks being run on MiddleManager. Returns JSON list of taskid strings. Normal usage should
|
|
||||||
prefer to use the `/druid/indexer/v1/tasks` [Overlord API](#overlord) or one of it's task state specific variants instead.
|
|
||||||
|
|
||||||
```json
|
|
||||||
["index_wikiticker_2019-02-11T02:20:15.316Z"]
|
|
||||||
```
|
|
||||||
|
|
||||||
* `/druid/worker/v1/task/{taskid}/log`
|
|
||||||
|
|
||||||
Retrieve task log output stream by task id. Normal usage should prefer to use the `/druid/indexer/v1/task/{taskId}/log`
|
|
||||||
[Overlord API](#overlord) instead.
|
|
||||||
|
|
||||||
##### POST
|
|
||||||
|
|
||||||
* `/druid/worker/v1/disable`
|
|
||||||
|
|
||||||
'Disable' a MiddleManager, causing it to stop accepting new tasks but complete all existing tasks. Returns JSON object
|
|
||||||
keyed by the combined `druid.host` and `druid.port`:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{"localhost:8091":"disabled"}
|
|
||||||
```
|
|
||||||
|
|
||||||
* `/druid/worker/v1/enable`
|
|
||||||
|
|
||||||
'Enable' a MiddleManager, allowing it to accept new tasks again if it was previously disabled. Returns JSON object
|
|
||||||
keyed by the combined `druid.host` and `druid.port`:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{"localhost:8091":"enabled"}
|
|
||||||
```
|
|
||||||
|
|
||||||
* `/druid/worker/v1/task/{taskid}/shutdown`
|
|
||||||
|
|
||||||
Shutdown a running task by `taskid`. Normal usage should prefer to use the `/druid/indexer/v1/task/{taskId}/shutdown`
|
|
||||||
[Overlord API](#overlord) instead. Returns JSON:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{"task":"index_kafka_wikiticker_f7011f8ffba384b_fpeclode"}
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
### Peon
|
|
||||||
|
|
||||||
#### GET
|
|
||||||
|
|
||||||
* `/druid/worker/v1/chat/{taskId}/rowStats`
|
|
||||||
|
|
||||||
Retrieve a live row stats report from a Peon. See [task reports](../ingestion/tasks.md#task-reports) for more details.
|
|
||||||
|
|
||||||
* `/druid/worker/v1/chat/{taskId}/unparseableEvents`
|
|
||||||
|
|
||||||
Retrieve an unparseable events report from a Peon. See [task reports](../ingestion/tasks.md#task-reports) for more details.
|
|
||||||
|
|
||||||
### Historical
|
|
||||||
|
|
||||||
#### Segment Loading
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/historical/v1/loadstatus`
|
|
||||||
|
|
||||||
Returns JSON of the form `{"cacheInitialized":<value>}`, where value is either `true` or `false` indicating if all
|
|
||||||
segments in the local cache have been loaded. This can be used to know when a Historical process is ready
|
|
||||||
to be queried after a restart.
|
|
||||||
|
|
||||||
* `/druid/historical/v1/readiness`
|
|
||||||
|
|
||||||
Similar to `/druid/historical/v1/loadstatus`, but instead of returning JSON with a flag, responses 200 OK if segments
|
|
||||||
in the local cache have been loaded, and 503 SERVICE UNAVAILABLE, if they haven't.
|
|
||||||
|
|
||||||
|
|
||||||
## Query Server
|
|
||||||
|
|
||||||
This section documents the API endpoints for the processes that reside on Query servers (Brokers) in the suggested [three-server configuration](../design/processes.md#server-types).
|
|
||||||
|
|
||||||
### Broker
|
|
||||||
|
|
||||||
#### Datasource Information
|
|
||||||
|
|
||||||
Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/`
|
|
||||||
(e.g., 2016-06-27_2016-06-28).
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/v2/datasources`
|
|
||||||
|
|
||||||
Returns a list of queryable datasources.
|
|
||||||
|
|
||||||
* `/druid/v2/datasources/{dataSourceName}`
|
|
||||||
|
|
||||||
Returns the dimensions and metrics of the datasource. Optionally, you can provide request parameter "full" to get list of served intervals with dimensions and metrics being served for those intervals. You can also provide request param "interval" explicitly to refer to a particular interval.
|
|
||||||
|
|
||||||
If no interval is specified, a default interval spanning a configurable period before the current time will be used. The default duration of this interval is specified in ISO 8601 duration format via:
|
|
||||||
|
|
||||||
druid.query.segmentMetadata.defaultHistory
|
|
||||||
|
|
||||||
* `/druid/v2/datasources/{dataSourceName}/dimensions`
|
|
||||||
|
|
||||||
Returns the dimensions of the datasource.
|
|
||||||
|
|
||||||
> This API is deprecated and will be removed in future releases. Please use [SegmentMetadataQuery](../querying/segmentmetadataquery.md) instead
|
|
||||||
> which provides more comprehensive information and supports all dataSource types including streaming dataSources. It's also encouraged to use [INFORMATION_SCHEMA tables](../querying/sql.md#metadata-tables)
|
|
||||||
> if you're using SQL.
|
|
||||||
|
|
||||||
* `/druid/v2/datasources/{dataSourceName}/metrics`
|
|
||||||
|
|
||||||
Returns the metrics of the datasource.
|
|
||||||
|
|
||||||
> This API is deprecated and will be removed in future releases. Please use [SegmentMetadataQuery](../querying/segmentmetadataquery.md) instead
|
|
||||||
> which provides more comprehensive information and supports all dataSource types including streaming dataSources. It's also encouraged to use [INFORMATION_SCHEMA tables](../querying/sql.md#metadata-tables)
|
|
||||||
> if you're using SQL.
|
|
||||||
|
|
||||||
* `/druid/v2/datasources/{dataSourceName}/candidates?intervals={comma-separated-intervals}&numCandidates={numCandidates}`
|
|
||||||
|
|
||||||
Returns segment information lists including server locations for the given datasource and intervals. If "numCandidates" is not specified, it will return all servers for each interval.
|
|
||||||
|
|
||||||
#### Load Status
|
|
||||||
|
|
||||||
##### GET
|
|
||||||
|
|
||||||
* `/druid/broker/v1/loadstatus`
|
|
||||||
|
|
||||||
Returns a flag indicating if the Broker knows about all segments in the cluster. This can be used to know when a Broker process is ready to be queried after a restart.
|
|
||||||
|
|
||||||
* `/druid/broker/v1/readiness`
|
|
||||||
|
|
||||||
Similar to `/druid/broker/v1/loadstatus`, but instead of returning a JSON, responses 200 OK if its ready and otherwise 503 SERVICE UNAVAILABLE.
|
|
||||||
|
|
||||||
#### Queries
|
|
||||||
|
|
||||||
##### POST
|
|
||||||
|
|
||||||
* `/druid/v2/`
|
|
||||||
|
|
||||||
The endpoint for submitting queries. Accepts an option `?pretty` that pretty prints the results.
|
|
||||||
|
|
||||||
* `/druid/v2/candidates/`
|
|
||||||
|
|
||||||
Returns segment information lists including server locations for the given query..
|
|
||||||
|
|
||||||
### Router
|
|
||||||
|
|
||||||
#### GET
|
|
||||||
|
|
||||||
* `/druid/v2/datasources`
|
|
||||||
|
|
||||||
Returns a list of queryable datasources.
|
|
||||||
|
|
||||||
* `/druid/v2/datasources/{dataSourceName}`
|
|
||||||
|
|
||||||
Returns the dimensions and metrics of the datasource.
|
|
||||||
|
|
||||||
* `/druid/v2/datasources/{dataSourceName}/dimensions`
|
|
||||||
|
|
||||||
Returns the dimensions of the datasource.
|
|
||||||
|
|
||||||
* `/druid/v2/datasources/{dataSourceName}/metrics`
|
|
||||||
|
|
||||||
Returns the metrics of the datasource.
|
|
13
operations/api.md
Normal file
13
operations/api.md
Normal file
@ -0,0 +1,13 @@
|
|||||||
|
<!-- toc -->
|
||||||
|
### 通用
|
||||||
|
### Master
|
||||||
|
#### Coordinator
|
||||||
|
#### Overlord
|
||||||
|
##### Supervisor
|
||||||
|
### Data
|
||||||
|
#### MiddleManager
|
||||||
|
#### Peon
|
||||||
|
#### Historical
|
||||||
|
### Query
|
||||||
|
#### Broker
|
||||||
|
#### Router
|
@ -1,203 +0,0 @@
|
|||||||
---
|
|
||||||
id: auth-ldap
|
|
||||||
title: "LDAP auth"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
This page describes how to set up Druid user authentication and authorization through LDAP. The first step is to enable LDAP authentication and authorization for Druid. You then map an LDAP group to roles and assign permissions to roles.
|
|
||||||
|
|
||||||
## Enable LDAP in Druid
|
|
||||||
|
|
||||||
Before starting, verify that the active directory is reachable from the Druid Master servers. Command line tools such as `ldapsearch` and `ldapwhoami`, which are included with OpenLDAP, are useful for this testing.
|
|
||||||
|
|
||||||
### Check the connection
|
|
||||||
|
|
||||||
First test that the basic connection and user credential works. For example, given a user `uuser1@example.com`, try:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ldapwhoami -vv -H ldap://<ip_address>:389 -D"uuser1@example.com" -W
|
|
||||||
```
|
|
||||||
|
|
||||||
Enter the password associated with the user when prompted and verify that the command succeeded. If it didn't, try the following troubleshooting steps:
|
|
||||||
|
|
||||||
* Verify that you've used the correct port for your LDAP instance. By default, the LDAP port is 389, but double-check with your LDAP admin if unable to connect.
|
|
||||||
* Check whether a network firewall is not preventing connections to the LDAP port.
|
|
||||||
* Check whether LDAP clients need to be specifically whitelisted at the LDAP server to be able to reach it. If so, add the Druid Coordinator server to the AD whitelist.
|
|
||||||
|
|
||||||
|
|
||||||
### Check the search criteria
|
|
||||||
|
|
||||||
After verifying basic connectivity, check your search criteria. For example, the command for searching for user `uuser1@example.com ` is as follows:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ldapsearch -x -W -H ldap://<ldap_server> -D"uuser1@example.com" -b "dc=example,dc=com" "(sAMAccountName=uuser1)"
|
|
||||||
```
|
|
||||||
|
|
||||||
Note the `memberOf` attribute in the results; it shows the groups that the user belongs to. You will use this value to map the LDAP group to the Druid roles later. This attribute may be implemented differently on different types of LDAP servers. For instance, some LDAP servers may support recursive groupings, and some may not. Some LDAP server implementations may not have any object classes that contain this attribute altogether. If your LDAP server does not use the `memberOf` attribute, then Druid will not be able to determine a user's group membership using LDAP. The sAMAccountName attribute used in this example contains the authenticated user identity. This is an attribute of an object class specific to Microsoft Active Directory. The object classes and attribute used in your LDAP server may be different.
|
|
||||||
|
|
||||||
## Configure Druid user authentication with LDAP/Active Directory
|
|
||||||
|
|
||||||
1. Enable the `druid-basic-security` extension in the `common.runtime.properties` file. See [Security Overview](security-overview.md) for details.
|
|
||||||
2. As a best practice, create a user in LDAP to be used for internal communication with Druid.
|
|
||||||
3. In `common.runtime.properties`, update LDAP-related properties, as shown in the following listing:
|
|
||||||
```
|
|
||||||
druid.auth.authenticatorChain=["ldap"]
|
|
||||||
druid.auth.authenticator.ldap.type=basic
|
|
||||||
druid.auth.authenticator.ldap.enableCacheNotifications=true
|
|
||||||
druid.auth.authenticator.ldap.credentialsValidator.type=ldap
|
|
||||||
druid.auth.authenticator.ldap.credentialsValidator.url=ldap://<AD host>:<AD port>
|
|
||||||
druid.auth.authenticator.ldap.credentialsValidator.bindUser=<AD admin user, e.g.: Administrator@example.com>
|
|
||||||
druid.auth.authenticator.ldap.credentialsValidator.bindPassword=<AD admin password>
|
|
||||||
druid.auth.authenticator.ldap.credentialsValidator.baseDn=<base dn, e.g.: dc=example,dc=com>
|
|
||||||
druid.auth.authenticator.ldap.credentialsValidator.userSearch=<The LDAP search, e.g.: (&(sAMAccountName=%s)(objectClass=user))>
|
|
||||||
druid.auth.authenticator.ldap.credentialsValidator.userAttribute=sAMAccountName
|
|
||||||
druid.auth.authenticator.ldap.authorizerName=ldapauth
|
|
||||||
druid.escalator.type=basic
|
|
||||||
druid.escalator.internalClientUsername=<AD internal user, e.g.: internal@example.com>
|
|
||||||
druid.escalator.internalClientPassword=Welcome123
|
|
||||||
druid.escalator.authorizerName=ldapauth
|
|
||||||
druid.auth.authorizers=["ldapauth"]
|
|
||||||
druid.auth.authorizer.ldapauth.type=basic
|
|
||||||
druid.auth.authorizer.ldapauth.initialAdminUser=AD user who acts as the initial admin user, e.g.: internal@example.com>
|
|
||||||
druid.auth.authorizer.ldapauth.initialAdminRole=admin
|
|
||||||
druid.auth.authorizer.ldapauth.roleProvider.type=ldap
|
|
||||||
```
|
|
||||||
|
|
||||||
Notice that the LDAP user created in the previous step, `internal@example.com`, serves as the internal client user and the initial admin user.
|
|
||||||
|
|
||||||
## Use LDAP groups to assign roles
|
|
||||||
|
|
||||||
You can map LDAP groups to a role in Druid. Members in the group get access to the permissions of the corresponding role.
|
|
||||||
|
|
||||||
|
|
||||||
### Step 1: Create a role
|
|
||||||
|
|
||||||
First create the role in Druid using the Druid REST API.
|
|
||||||
|
|
||||||
Creating a role involves submitting a POST request to the Coordinator process.
|
|
||||||
|
|
||||||
The following REST APIs to create the role to read access for datasource, config, state.
|
|
||||||
|
|
||||||
> As mentioned, the REST API calls need to address the Coordinator node. The examples used below use localhost as the Coordinator host and 8081 as the port. Adjust these settings according to your deployment.
|
|
||||||
|
|
||||||
Call the following API to create role `readRole` .
|
|
||||||
|
|
||||||
```
|
|
||||||
curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/roles/readRole
|
|
||||||
```
|
|
||||||
|
|
||||||
Check that the role has been created successfully by entering the following:
|
|
||||||
|
|
||||||
```
|
|
||||||
curl -i -v -H "Content-Type: application/json" -u internal -X GET http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/roles
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
### Step 2: Add permissions to a role
|
|
||||||
|
|
||||||
You can now add one or more permission to the role. The following example adds read-only access to a `wikipedia` data source.
|
|
||||||
|
|
||||||
Given the following JSON in a file named `perm.json`:
|
|
||||||
|
|
||||||
```
|
|
||||||
[{ "resource": { "name": "wikipedia", "type": "DATASOURCE" }, "action": "READ" }
|
|
||||||
,{ "resource": { "name": ".*", "type": "STATE" }, "action": "READ" },
|
|
||||||
{ "resource": {"name": ".*", "type": "CONFIG"}, "action": "READ"}]
|
|
||||||
```
|
|
||||||
|
|
||||||
The following command associates the permissions in the JSON file with the role
|
|
||||||
|
|
||||||
```
|
|
||||||
curl -i -v -H "Content-Type: application/json" -u internal -X POST -d@perm.json http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/roles/readRole/permissions
|
|
||||||
```
|
|
||||||
|
|
||||||
Note that the STATE and CONFIG permissions in `perm.json` are needed to see the data source in the Druid console. If only querying permissions are needed, the READ action is sufficient:
|
|
||||||
|
|
||||||
```
|
|
||||||
[{ "resource": { "name": "wikipedia", "type": "DATASOURCE" }, "action": "READ" }]
|
|
||||||
```
|
|
||||||
|
|
||||||
You can also provide the name in the form of regular expression. For example, to give access to all data sources starting with `wiki`, specify the name as `{ "name": "wiki.*", .....`.
|
|
||||||
|
|
||||||
|
|
||||||
### Step 3: Create group Mapping
|
|
||||||
|
|
||||||
The following shows an example of a group to role mapping. It assumes that a group named `group1` exists in the directory. Also assuming the following role mapping in a file named `groupmap.json`:
|
|
||||||
|
|
||||||
```
|
|
||||||
{
|
|
||||||
"name": "group1map",
|
|
||||||
"groupPattern": "CN=group1,CN=Users,DC=example,DC=com",
|
|
||||||
"roles": [
|
|
||||||
"readRole"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
You can configure the mapping as follows:
|
|
||||||
|
|
||||||
```
|
|
||||||
curl -i -v -H "Content-Type: application/json" -u internal -X POST -d @groupmap.json http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings/group1map
|
|
||||||
```
|
|
||||||
|
|
||||||
To check whether the group mapping was created successfully, run the following command:
|
|
||||||
|
|
||||||
```
|
|
||||||
curl -i -v -H "Content-Type: application/json" -u internal -X GET http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings
|
|
||||||
```
|
|
||||||
|
|
||||||
To check the details of a specific group mapping, use the following:
|
|
||||||
|
|
||||||
```
|
|
||||||
curl -i -v -H "Content-Type: application/json" -u internal -X GET http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings/group1map
|
|
||||||
```
|
|
||||||
|
|
||||||
To add additional roles to the group mapping, use the following API:
|
|
||||||
|
|
||||||
```
|
|
||||||
curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/groupMappings/group1/roles/<newrole>
|
|
||||||
```
|
|
||||||
|
|
||||||
In the next two steps you will be creating a user, and assigning previously created roles to it. These steps are only needed in the following cases:
|
|
||||||
|
|
||||||
- Your LDAP server does not support the `memberOf` attribute, or
|
|
||||||
- You want to configure a user with additional roles that are not mapped to the group(s) that the user is a member of
|
|
||||||
|
|
||||||
If this is not the case for your scenario, you can skip these steps.
|
|
||||||
|
|
||||||
### Step 4. Create a user
|
|
||||||
|
|
||||||
Once LDAP is enabled, only user passwords are verified with LDAP. You add the LDAP user to Druid as follows:
|
|
||||||
|
|
||||||
```
|
|
||||||
curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authentication/db/ldap/users/<AD user>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 5. Assign the role to the user
|
|
||||||
|
|
||||||
The following command shows how to assign a role to a user:
|
|
||||||
|
|
||||||
```
|
|
||||||
curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/users/<AD user>/roles/<rolename>
|
|
||||||
```
|
|
||||||
|
|
||||||
For more information about security and the basic security extension, see [Security Overview](security-overview.md).
|
|
@ -1,471 +0,0 @@
|
|||||||
---
|
|
||||||
id: basic-cluster-tuning
|
|
||||||
title: "Basic cluster tuning"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
This document provides basic guidelines for configuration properties and cluster architecture considerations related to performance tuning of an Apache Druid deployment.
|
|
||||||
|
|
||||||
Please note that this document provides general guidelines and rules-of-thumb: these are not absolute, universal rules for cluster tuning, and this introductory guide is not an exhaustive description of all Druid tuning properties, which are described in the [configuration reference](../configuration/index.md).
|
|
||||||
|
|
||||||
If you have questions on tuning Druid for specific use cases, or questions on configuration properties not covered in this guide, please ask the [Druid user mailing list or other community channels](https://druid.apache.org/community/).
|
|
||||||
|
|
||||||
## Process-specific guidelines
|
|
||||||
|
|
||||||
### Historical
|
|
||||||
|
|
||||||
#### Heap sizing
|
|
||||||
|
|
||||||
The biggest contributions to heap usage on Historicals are:
|
|
||||||
|
|
||||||
- Partial unmerged query results from segments
|
|
||||||
- The stored maps for [lookups](../querying/lookups.md).
|
|
||||||
|
|
||||||
A general rule-of-thumb for sizing the Historical heap is `(0.5GB * number of CPU cores)`, with an upper limit of ~24GB.
|
|
||||||
|
|
||||||
This rule-of-thumb scales using the number of CPU cores as a convenient proxy for hardware size and level of concurrency (note: this formula is not a hard rule for sizing Historical heaps).
|
|
||||||
|
|
||||||
Having a heap that is too large can result in excessively long GC collection pauses, the ~24GB upper limit is imposed to avoid this.
|
|
||||||
|
|
||||||
If caching is enabled on Historicals, the cache is stored on heap, sized by `druid.cache.sizeInBytes`.
|
|
||||||
|
|
||||||
Running out of heap on the Historicals can indicate misconfiguration or usage patterns that are overloading the cluster.
|
|
||||||
|
|
||||||
##### Lookups
|
|
||||||
|
|
||||||
If you are using lookups, calculate the total size of the lookup maps being loaded.
|
|
||||||
|
|
||||||
Druid performs an atomic swap when updating lookup maps (both the old map and the new map will exist in heap during the swap), so the maximum potential heap usage from lookup maps will be (2 * total size of all loaded lookups).
|
|
||||||
|
|
||||||
Be sure to add `(2 * total size of all loaded lookups)` to your heap size in addition to the `(0.5GB * number of CPU cores)` guideline.
|
|
||||||
|
|
||||||
#### Processing Threads and Buffers
|
|
||||||
|
|
||||||
Please see the [General Guidelines for Processing Threads and Buffers](#processing-threads-buffers) section for an overview of processing thread/buffer configuration.
|
|
||||||
|
|
||||||
On Historicals:
|
|
||||||
|
|
||||||
- `druid.processing.numThreads` should generally be set to `(number of cores - 1)`: a smaller value can result in CPU underutilization, while going over the number of cores can result in unnecessary CPU contention.
|
|
||||||
- `druid.processing.buffer.sizeBytes` can be set to 500MB.
|
|
||||||
- `druid.processing.numMergeBuffers`, a 1:4 ratio of merge buffers to processing threads is a reasonable choice for general use.
|
|
||||||
|
|
||||||
#### Direct Memory Sizing
|
|
||||||
|
|
||||||
The processing and merge buffers described above are direct memory buffers.
|
|
||||||
|
|
||||||
When a historical processes a query, it must open a set of segments for reading. This also requires some direct memory space, described in [segment decompression buffers](#segment-decompression).
|
|
||||||
|
|
||||||
A formula for estimating direct memory usage follows:
|
|
||||||
|
|
||||||
(`druid.processing.numThreads` + `druid.processing.numMergeBuffers` + 1) * `druid.processing.buffer.sizeBytes`
|
|
||||||
|
|
||||||
The `+ 1` factor is a fuzzy estimate meant to account for the segment decompression buffers.
|
|
||||||
|
|
||||||
#### Connection pool sizing
|
|
||||||
|
|
||||||
Please see the [General Connection Pool Guidelines](#connection-pool) section for an overview of connection pool configuration.
|
|
||||||
|
|
||||||
For Historicals, `druid.server.http.numThreads` should be set to a value slightly higher than the sum of `druid.broker.http.numConnections` across all the Brokers in the cluster.
|
|
||||||
|
|
||||||
Tuning the cluster so that each Historical can accept 50 queries and 10 non-queries is a reasonable starting point.
|
|
||||||
|
|
||||||
#### Segment Cache Size
|
|
||||||
|
|
||||||
`druid.segmentCache.locations` specifies locations where segment data can be stored on the Historical. The sum of available disk space across these locations is set as the default value for property: `druid.server.maxSize`, which controls the total size of segment data that can be assigned by the Coordinator to a Historical.
|
|
||||||
|
|
||||||
Segments are memory-mapped by Historical processes using any available free system memory (i.e., memory not used by the Historical JVM and heap/direct memory buffers or other processes on the system). Segments that are not currently in memory will be paged from disk when queried.
|
|
||||||
|
|
||||||
Therefore, the size of cache locations set within `druid.segmentCache.locations` should be such that a Historical is not allocated an excessive amount of segment data. As the value of (`free system memory` / total size of all `druid.segmentCache.locations`) increases, a greater proportion of segments can be kept in memory, allowing for better query performance. The total segment data size assigned to a Historical can be overridden with `druid.server.maxSize`, but this is not required for most of the use cases.
|
|
||||||
|
|
||||||
#### Number of Historicals
|
|
||||||
|
|
||||||
The number of Historicals needed in a cluster depends on how much data the cluster has. For good performance, you will want enough Historicals such that each Historical has a good (`free system memory` / total size of all `druid.segmentCache.locations`) ratio, as described in the segment cache size section above.
|
|
||||||
|
|
||||||
Having a smaller number of big servers is generally better than having a large number of small servers, as long as you have enough fault tolerance for your use case.
|
|
||||||
|
|
||||||
#### SSD storage
|
|
||||||
|
|
||||||
We recommend using SSDs for storage on the Historicals, as they handle segment data stored on disk.
|
|
||||||
|
|
||||||
#### Total memory usage
|
|
||||||
|
|
||||||
To estimate total memory usage of the Historical under these guidelines:
|
|
||||||
|
|
||||||
- Heap: `(0.5GB * number of CPU cores) + (2 * total size of lookup maps) + druid.cache.sizeInBytes`
|
|
||||||
- Direct Memory: `(druid.processing.numThreads + druid.processing.numMergeBuffers + 1) * druid.processing.buffer.sizeBytes`
|
|
||||||
|
|
||||||
The Historical will use any available free system memory (i.e., memory not used by the Historical JVM and heap/direct memory buffers or other processes on the system) for memory-mapping of segments on disk. For better query performance, you will want to ensure a good (`free system memory` / total size of all `druid.segmentCache.locations`) ratio so that a greater proportion of segments can be kept in memory.
|
|
||||||
|
|
||||||
#### Segment sizes matter
|
|
||||||
|
|
||||||
Be sure to check out [segment size optimization](./segment-optimization.md) to help tune your Historical processes for maximum performance.
|
|
||||||
|
|
||||||
### Broker
|
|
||||||
|
|
||||||
#### Heap sizing
|
|
||||||
|
|
||||||
The biggest contributions to heap usage on Brokers are:
|
|
||||||
- Partial unmerged query results from Historicals and Tasks
|
|
||||||
- The segment timeline: this consists of location information (which Historical/Task is serving a segment) for all currently [available](../design/architecture.md#segment-lifecycle) segments.
|
|
||||||
- Cached segment metadata: this consists of metadata, such as per-segment schemas, for all currently available segments.
|
|
||||||
|
|
||||||
The Broker heap requirements scale based on the number of segments in the cluster, and the total data size of the segments.
|
|
||||||
|
|
||||||
The heap size will vary based on data size and usage patterns, but 4G to 8G is a good starting point for a small or medium cluster (~15 servers or less). For a rough estimate of memory requirements on the high end, very large clusters with a node count on the order of ~100 nodes may need Broker heaps of 30GB-60GB.
|
|
||||||
|
|
||||||
If caching is enabled on the Broker, the cache is stored on heap, sized by `druid.cache.sizeInBytes`.
|
|
||||||
|
|
||||||
#### Direct memory sizing
|
|
||||||
|
|
||||||
On the Broker, the amount of direct memory needed depends on how many merge buffers (used for merging GroupBys) are configured. The Broker does not generally need processing threads or processing buffers, as query results are merged on-heap in the HTTP connection threads instead.
|
|
||||||
|
|
||||||
- `druid.processing.buffer.sizeBytes` can be set to 500MB.
|
|
||||||
- `druid.processing.numThreads`: set this to 1 (the minimum allowed)
|
|
||||||
- `druid.processing.numMergeBuffers`: set this to the same value as on Historicals or a bit higher
|
|
||||||
|
|
||||||
#### Connection pool sizing
|
|
||||||
|
|
||||||
Please see the [General Connection Pool Guidelines](#connection-pool) section for an overview of connection pool configuration.
|
|
||||||
|
|
||||||
On the Brokers, please ensure that the sum of `druid.broker.http.numConnections` across all the Brokers is slightly lower than the value of `druid.server.http.numThreads` on your Historicals and Tasks.
|
|
||||||
|
|
||||||
`druid.server.http.numThreads` on the Broker should be set to a value slightly higher than `druid.broker.http.numConnections` on the same Broker.
|
|
||||||
|
|
||||||
Tuning the cluster so that each Historical can accept 50 queries and 10 non-queries, adjusting the Brokers accordingly, is a reasonable starting point.
|
|
||||||
|
|
||||||
#### Broker backpressure
|
|
||||||
|
|
||||||
When retrieving query results from Historical processes or Tasks, the Broker can optionally specify a maximum buffer size for queued, unread data, and exert backpressure on the channel to the Historical or Tasks when limit is reached (causing writes to the channel to block on the Historical/Task side until the Broker is able to drain some data from the channel).
|
|
||||||
|
|
||||||
This buffer size is controlled by the `druid.broker.http.maxQueuedBytes` setting.
|
|
||||||
|
|
||||||
The limit is divided across the number of Historicals/Tasks that a query would hit: suppose I have `druid.broker.http.maxQueuedBytes` set to 5MB, and the Broker receives a query that needs to be fanned out to 2 Historicals. Each per-historical channel would get a 2.5MB buffer in this case.
|
|
||||||
|
|
||||||
You can generally set this to a value of approximately `2MB * number of Historicals`. As your cluster scales up with more Historicals and Tasks, consider increasing this buffer size and increasing the Broker heap accordingly.
|
|
||||||
|
|
||||||
- If the buffer is too small, this can lead to inefficient queries due to the buffer filling up rapidly and stalling the channel
|
|
||||||
- If the buffer is too large, this puts more memory pressure on the Broker due to more queued result data in the HTTP channels.
|
|
||||||
|
|
||||||
#### Number of brokers
|
|
||||||
|
|
||||||
A 1:15 ratio of Brokers to Historicals is a reasonable starting point (this is not a hard rule).
|
|
||||||
|
|
||||||
If you need Broker HA, you can deploy 2 initially and then use the 1:15 ratio guideline for additional Brokers.
|
|
||||||
|
|
||||||
#### Total memory usage
|
|
||||||
|
|
||||||
To estimate total memory usage of the Broker under these guidelines:
|
|
||||||
|
|
||||||
- Heap: allocated heap size
|
|
||||||
- Direct Memory: `(druid.processing.numThreads + druid.processing.numMergeBuffers + 1) * druid.processing.buffer.sizeBytes`
|
|
||||||
|
|
||||||
### MiddleManager
|
|
||||||
|
|
||||||
The MiddleManager is a lightweight task controller/manager that launches Task processes, which perform ingestion work.
|
|
||||||
|
|
||||||
#### MiddleManager heap sizing
|
|
||||||
|
|
||||||
The MiddleManager itself does not require much resources, you can set the heap to ~128MB generally.
|
|
||||||
|
|
||||||
#### SSD storage
|
|
||||||
|
|
||||||
We recommend using SSDs for storage on the MiddleManagers, as the Tasks launched by MiddleManagers handle segment data stored on disk.
|
|
||||||
|
|
||||||
#### Task Count
|
|
||||||
|
|
||||||
The number of tasks a MiddleManager can launch is controlled by the `druid.worker.capacity` setting.
|
|
||||||
|
|
||||||
The number of workers needed in your cluster depends on how many concurrent ingestion tasks you need to run for your use cases. The number of workers that can be launched on a given machine depends on the size of resources allocated per worker and available system resources.
|
|
||||||
|
|
||||||
You can allocate more MiddleManager machines to your cluster to add task capacity.
|
|
||||||
|
|
||||||
#### Task configurations
|
|
||||||
|
|
||||||
The following section below describes configuration for Tasks launched by the MiddleManager. The Tasks can be queried and perform ingestion workloads, so they require more resources than the MM.
|
|
||||||
|
|
||||||
##### Task heap sizing
|
|
||||||
|
|
||||||
A 1GB heap is usually enough for Tasks.
|
|
||||||
|
|
||||||
###### Lookups
|
|
||||||
|
|
||||||
If you are using lookups, calculate the total size of the lookup maps being loaded.
|
|
||||||
|
|
||||||
Druid performs an atomic swap when updating lookup maps (both the old map and the new map will exist in heap during the swap), so the maximum potential heap usage from lookup maps will be (2 * total size of all loaded lookups).
|
|
||||||
|
|
||||||
Be sure to add `(2 * total size of all loaded lookups)` to your Task heap size if you are using lookups.
|
|
||||||
|
|
||||||
##### Task processing threads and buffers
|
|
||||||
|
|
||||||
For Tasks, 1 or 2 processing threads are often enough, as the Tasks tend to hold much less queryable data than Historical processes.
|
|
||||||
|
|
||||||
- `druid.indexer.fork.property.druid.processing.numThreads`: set this to 1 or 2
|
|
||||||
- `druid.indexer.fork.property.druid.processing.numMergeBuffers`: set this to 2
|
|
||||||
- `druid.indexer.fork.property.druid.processing.buffer.sizeBytes`: can be set to 100MB
|
|
||||||
|
|
||||||
##### Direct memory sizing
|
|
||||||
|
|
||||||
The processing and merge buffers described above are direct memory buffers.
|
|
||||||
|
|
||||||
When a Task processes a query, it must open a set of segments for reading. This also requires some direct memory space, described in [segment decompression buffers](#segment-decompression).
|
|
||||||
|
|
||||||
An ingestion Task also needs to merge partial ingestion results, which requires direct memory space, described in [segment merging](#segment-merging).
|
|
||||||
|
|
||||||
A formula for estimating direct memory usage follows:
|
|
||||||
|
|
||||||
(`druid.processing.numThreads` + `druid.processing.numMergeBuffers` + 1) * `druid.processing.buffer.sizeBytes`
|
|
||||||
|
|
||||||
The `+ 1` factor is a fuzzy estimate meant to account for the segment decompression buffers and dictionary merging buffers.
|
|
||||||
|
|
||||||
##### Connection pool sizing
|
|
||||||
|
|
||||||
Please see the [General Connection Pool Guidelines](#connection-pool) section for an overview of connection pool configuration.
|
|
||||||
|
|
||||||
For Tasks, `druid.server.http.numThreads` should be set to a value slightly higher than the sum of `druid.broker.http.numConnections` across all the Brokers in the cluster.
|
|
||||||
|
|
||||||
Tuning the cluster so that each Task can accept 50 queries and 10 non-queries is a reasonable starting point.
|
|
||||||
|
|
||||||
#### Total memory usage
|
|
||||||
|
|
||||||
To estimate total memory usage of a Task under these guidelines:
|
|
||||||
|
|
||||||
- Heap: `1GB + (2 * total size of lookup maps)`
|
|
||||||
- Direct Memory: `(druid.processing.numThreads + druid.processing.numMergeBuffers + 1) * druid.processing.buffer.sizeBytes`
|
|
||||||
|
|
||||||
The total memory usage of the MiddleManager + Tasks:
|
|
||||||
|
|
||||||
`MM heap size + druid.worker.capacity * (single task memory usage)`
|
|
||||||
|
|
||||||
##### Configuration guidelines for specific ingestion types
|
|
||||||
|
|
||||||
###### Kafka/Kinesis ingestion
|
|
||||||
|
|
||||||
If you use the [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.md) or [Kinesis Indexing Service](../development/extensions-core/kinesis-ingestion.md), the number of tasks required will depend on the number of partitions and your taskCount/replica settings.
|
|
||||||
|
|
||||||
On top of those requirements, allocating more task slots in your cluster is a good idea, so that you have free task
|
|
||||||
slots available for other tasks, such as [compaction tasks](../ingestion/compaction.md).
|
|
||||||
|
|
||||||
###### Hadoop ingestion
|
|
||||||
|
|
||||||
If you are only using [Hadoop-based batch ingestion](../ingestion/hadoop.md) with no other ingestion types, you can lower the amount of resources allocated per Task. Batch ingestion tasks do not need to answer queries, and the bulk of the ingestion workload will be executed on the Hadoop cluster, so the Tasks do not require much resources.
|
|
||||||
|
|
||||||
###### Parallel native ingestion
|
|
||||||
|
|
||||||
If you are using [parallel native batch ingestion](../ingestion/native-batch.md#parallel-task), allocating more available task slots is a good idea and will allow greater ingestion concurrency.
|
|
||||||
|
|
||||||
### Coordinator
|
|
||||||
|
|
||||||
The main performance-related setting on the Coordinator is the heap size.
|
|
||||||
|
|
||||||
The heap requirements of the Coordinator scale with the number of servers, segments, and tasks in the cluster.
|
|
||||||
|
|
||||||
You can set the Coordinator heap to the same size as your Broker heap, or slightly smaller: both services have to process cluster-wide state and answer API requests about this state.
|
|
||||||
|
|
||||||
#### Dynamic Configuration
|
|
||||||
|
|
||||||
`percentOfSegmentsToConsiderPerMove`
|
|
||||||
* The default value is 100. This means that the Coordinator will consider all segments when it is looking for a segment to move. The Coordinator makes a weighted choice, with segments on Servers with the least capacity being the most likely segments to be moved.
|
|
||||||
* This weighted selection strategy means that the segments on the servers who have the most available capacity are the least likely to be chosen.
|
|
||||||
* As the number of segments in the cluster increases, the probability of choosing the Nth segment to move decreases; where N is the last segment considered for moving.
|
|
||||||
* An admin can use this config to skip consideration of that Nth segment.
|
|
||||||
* Instead of skipping a precise amount of segments, we skip a percentage of segments in the cluster.
|
|
||||||
* For example, with the value set to 25, only the first 25% of segments will be considered as a segment that can be moved. This 25% of segments will come from the servers that have the least available capacity.
|
|
||||||
* In this example, each time the Coordinator looks for a segment to move, it will consider 75% less segments than it did when the configuration was 100. On clusters with hundreds of thousands of segments, this can add up to meaningful coordination time savings.
|
|
||||||
* General recommendations for this configuration:
|
|
||||||
* If you are not worried about the amount of time it takes your Coordinator to complete a full coordination cycle, you likely do not need to modify this config.
|
|
||||||
* If you are frustrated with how long the Coordinator takes to run a full coordination cycle, and you have set the Coordinator dynamic config `maxSegmentsToMove` to a value above 0 (the default is 5), setting this config to a non-default value can help shorten coordination time.
|
|
||||||
* The recommended starting point value is 66. It represents a meaningful decrease in the percentage of segments considered while also not being too aggressive (You will consider 1/3 fewer segments per move operation with this value).
|
|
||||||
* The impact that modifying this config will have on your coordination time will be a function of how low you set the config value, the value for `maxSegmentsToMove` and the total number of segments in your cluster.
|
|
||||||
* If your cluster has a relatively small number of segments, or you choose to move few segments per coordination cycle, there may not be much savings to be had here.
|
|
||||||
|
|
||||||
### Overlord
|
|
||||||
|
|
||||||
The main performance-related setting on the Overlord is the heap size.
|
|
||||||
|
|
||||||
The heap requirements of the Overlord scale primarily with the number of running Tasks.
|
|
||||||
|
|
||||||
The Overlord tends to require less resources than the Coordinator or Broker. You can generally set the Overlord heap to a value that's 25-50% of your Coordinator heap.
|
|
||||||
|
|
||||||
### Router
|
|
||||||
|
|
||||||
The Router has light resource requirements, as it proxies requests to Brokers without performing much computational work itself.
|
|
||||||
|
|
||||||
You can assign it 256MB heap as a starting point, growing it if needed.
|
|
||||||
|
|
||||||
<a name="processing-threads-buffers"></a>
|
|
||||||
|
|
||||||
## Guidelines for processing threads and buffers
|
|
||||||
|
|
||||||
### Processing threads
|
|
||||||
|
|
||||||
The `druid.processing.numThreads` configuration controls the size of the processing thread pool used for computing query results. The size of this pool limits how many queries can be concurrently processed.
|
|
||||||
|
|
||||||
### Processing buffers
|
|
||||||
|
|
||||||
`druid.processing.buffer.sizeBytes` is a closely related property that controls the size of the off-heap buffers allocated to the processing threads.
|
|
||||||
|
|
||||||
One buffer is allocated for each processing thread. A size between 500MB and 1GB is a reasonable choice for general use.
|
|
||||||
|
|
||||||
The TopN and GroupBy queries use these buffers to store intermediate computed results. As the buffer size increases, more data can be processed in a single pass.
|
|
||||||
|
|
||||||
### GroupBy merging buffers
|
|
||||||
|
|
||||||
If you plan to issue GroupBy V2 queries, `druid.processing.numMergeBuffers` is an important configuration property.
|
|
||||||
|
|
||||||
GroupBy V2 queries use an additional pool of off-heap buffers for merging query results. These buffers have the same size as the processing buffers described above, set by the `druid.processing.buffer.sizeBytes` property.
|
|
||||||
|
|
||||||
Non-nested GroupBy V2 queries require 1 merge buffer per query, while a nested GroupBy V2 query requires 2 merge buffers (regardless of the depth of nesting).
|
|
||||||
|
|
||||||
The number of merge buffers determines the number of GroupBy V2 queries that can be processed concurrently.
|
|
||||||
|
|
||||||
<a name="connection-pool"></a>
|
|
||||||
|
|
||||||
## Connection pool guidelines
|
|
||||||
|
|
||||||
Each Druid process has a configuration property for the number of HTTP connection handling threads, `druid.server.http.numThreads`.
|
|
||||||
|
|
||||||
The number of HTTP server threads limits how many concurrent HTTP API requests a given process can handle.
|
|
||||||
|
|
||||||
### Sizing the connection pool for queries
|
|
||||||
|
|
||||||
The Broker has a setting `druid.broker.http.numConnections` that controls how many outgoing connections it can make to a given Historical or Task process.
|
|
||||||
|
|
||||||
These connections are used to send queries to the Historicals or Tasks, with one connection per query; the value of `druid.broker.http.numConnections` is effectively a limit on the number of concurrent queries that a given broker can process.
|
|
||||||
|
|
||||||
Suppose we have a cluster with 3 Brokers and `druid.broker.http.numConnections` is set to 10.
|
|
||||||
|
|
||||||
This means that each Broker in the cluster will open up to 10 connections to each individual Historical or Task (for a total of 30 incoming query connections per Historical/Task).
|
|
||||||
|
|
||||||
On the Historical/Task side, this means that `druid.server.http.numThreads` must be set to a value at least as high as the sum of `druid.broker.http.numConnections` across all the Brokers in the cluster.
|
|
||||||
|
|
||||||
In practice, you will want to allocate additional server threads for non-query API requests such as status checks; adding 10 threads for those is a good general guideline. Using the example with 3 Brokers in the cluster and `druid.broker.http.numConnections` set to 10, a value of 40 would be appropriate for `druid.server.http.numThreads` on Historicals and Tasks.
|
|
||||||
|
|
||||||
As a starting point, allowing for 50 concurrent queries (requests that read segment data from datasources) + 10 non-query requests (other requests like status checks) on Historicals and Tasks is reasonable (i.e., set `druid.server.http.numThreads` to 60 there), while sizing `druid.broker.http.numConnections` based on the number of Brokers in the cluster to fit within the 50 query connection limit per Historical/Task.
|
|
||||||
|
|
||||||
- If the connection pool across Brokers and Historicals/Tasks is too small, the cluster will be underutilized as there are too few concurrent query slots.
|
|
||||||
- If the connection pool is too large, you may get out-of-memory errors due to excessive concurrent load, and increased resource contention.
|
|
||||||
- The connection pool sizing matters most when you require QoS-type guarantees and use query priorities; otherwise, these settings can be more loosely configured.
|
|
||||||
- If your cluster usage patterns are heavily biased towards a high number of small concurrent queries (where each query takes less than ~15ms), enlarging the connection pool can be a good idea.
|
|
||||||
- The 50/10 general guideline here is a rough starting point, since different queries impose different amounts of load on the system. To size the connection pool more exactly for your cluster, you would need to know the execution times for your queries and ensure that the rate of incoming queries does not exceed your "drain" rate.
|
|
||||||
|
|
||||||
## Per-segment direct memory buffers
|
|
||||||
|
|
||||||
### Segment decompression
|
|
||||||
|
|
||||||
When opening a segment for reading during segment merging or query processing, Druid allocates a 64KB off-heap decompression buffer for each column being read.
|
|
||||||
|
|
||||||
Thus, there is additional direct memory overhead of (64KB * number of columns read per segment * number of segments read) when reading segments.
|
|
||||||
|
|
||||||
### Segment merging
|
|
||||||
|
|
||||||
In addition to the segment decompression overhead described above, when a set of segments are merged during ingestion, a direct buffer is allocated for every String typed column, for every segment in the set to be merged.
|
|
||||||
|
|
||||||
The size of these buffers are equal to the cardinality of the String column within its segment, times 4 bytes (the buffers store integers).
|
|
||||||
|
|
||||||
For example, if two segments are being merged, the first segment having a single String column with cardinality 1000, and the second segment having a String column with cardinality 500, the merge step would allocate (1000 + 500) * 4 = 6000 bytes of direct memory.
|
|
||||||
|
|
||||||
These buffers are used for merging the value dictionaries of the String column across segments. These "dictionary merging buffers" are independent of the "merge buffers" configured by `druid.processing.numMergeBuffers`.
|
|
||||||
|
|
||||||
|
|
||||||
## General recommendations
|
|
||||||
|
|
||||||
### JVM tuning
|
|
||||||
|
|
||||||
#### Garbage Collection
|
|
||||||
We recommend using the G1GC garbage collector:
|
|
||||||
|
|
||||||
`-XX:+UseG1GC`
|
|
||||||
|
|
||||||
Enabling process termination on out-of-memory errors is useful as well, since the process generally will not recover from such a state, and it's better to restart the process:
|
|
||||||
|
|
||||||
`-XX:+ExitOnOutOfMemoryError`
|
|
||||||
|
|
||||||
#### Other useful JVM flags
|
|
||||||
|
|
||||||
```
|
|
||||||
-Duser.timezone=UTC
|
|
||||||
-Dfile.encoding=UTF-8
|
|
||||||
-Djava.io.tmpdir=<should not be volatile tmpfs and also has good read and write speed. Strongly recommended to avoid using NFS mount>
|
|
||||||
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
|
|
||||||
-Dorg.jboss.logging.provider=slf4j
|
|
||||||
-Dnet.spy.log.LoggerImpl=net.spy.memcached.compat.log.SLF4JLogger
|
|
||||||
-Dlog4j.shutdownCallbackRegistry=org.apache.druid.common.config.Log4jShutdown
|
|
||||||
-Dlog4j.shutdownHookEnabled=true
|
|
||||||
-XX:+PrintGCDetails
|
|
||||||
-XX:+PrintGCDateStamps
|
|
||||||
-XX:+PrintGCTimeStamps
|
|
||||||
-XX:+PrintGCApplicationStoppedTime
|
|
||||||
-XX:+PrintGCApplicationConcurrentTime
|
|
||||||
-Xloggc:/var/logs/druid/historical.gc.log
|
|
||||||
-XX:+UseGCLogFileRotation
|
|
||||||
-XX:NumberOfGCLogFiles=50
|
|
||||||
-XX:GCLogFileSize=10m
|
|
||||||
-XX:+ExitOnOutOfMemoryError
|
|
||||||
-XX:+HeapDumpOnOutOfMemoryError
|
|
||||||
-XX:HeapDumpPath=/var/logs/druid/historical.hprof
|
|
||||||
-XX:MaxDirectMemorySize=1g
|
|
||||||
```
|
|
||||||
> Please note that the flag settings above represent sample, general guidelines only. Be careful to use values appropriate
|
|
||||||
for your specific scenario and be sure to test any changes in staging environments.
|
|
||||||
|
|
||||||
`ExitOnOutOfMemoryError` flag is only supported starting JDK 8u92 . For older versions, `-XX:OnOutOfMemoryError='kill -9 %p'` can be used.
|
|
||||||
|
|
||||||
`MaxDirectMemorySize` restricts JVM from allocating more than specified limit, by setting it to unlimited JVM restriction is lifted and OS level memory limits would still be effective. It's still important to make sure that Druid is not configured to allocate more off-heap memory than your machine has available. Important settings here include `druid.processing.numThreads`, `druid.processing.numMergeBuffers`, and `druid.processing.buffer.sizeBytes`.
|
|
||||||
|
|
||||||
Additionally, for large JVM heaps, here are a few Garbage Collection efficiency guidelines that have been known to help in some cases.
|
|
||||||
|
|
||||||
|
|
||||||
- Mount /tmp on tmpfs. See [The Four Month Bug: JVM statistics cause garbage collection pauses](http://www.evanjones.ca/jvm-mmap-pause.html).
|
|
||||||
- On Disk-IO intensive processes (e.g., Historical and MiddleManager), GC and Druid logs should be written to a different disk than where data is written.
|
|
||||||
- Disable [Transparent Huge Pages](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html).
|
|
||||||
- Try disabling biased locking by using `-XX:-UseBiasedLocking` JVM flag. See [Logging Stop-the-world Pauses in JVM](https://dzone.com/articles/logging-stop-world-pauses-jvm).
|
|
||||||
|
|
||||||
### Use UTC timezone
|
|
||||||
|
|
||||||
We recommend using UTC timezone for all your events and across your hosts, not just for Druid, but for all data infrastructure. This can greatly mitigate potential query problems with inconsistent timezones. To query in a non-UTC timezone see [query granularities](../querying/granularities.md#period-granularities)
|
|
||||||
|
|
||||||
### System configuration
|
|
||||||
|
|
||||||
#### SSDs
|
|
||||||
|
|
||||||
SSDs are highly recommended for Historical, MiddleManager, and Indexer processes if you are not running a cluster that is entirely in memory. SSDs can greatly mitigate the time required to page data in and out of memory.
|
|
||||||
|
|
||||||
#### JBOD vs RAID
|
|
||||||
|
|
||||||
Historical processes store large number of segments on Disk and support specifying multiple paths for storing those. Typically, hosts have multiple disks configured with RAID which makes them look like a single disk to OS. RAID might have overheads specially if its not hardware controller based but software based. So, Historicals might get improved disk throughput with JBOD.
|
|
||||||
|
|
||||||
#### Swap space
|
|
||||||
|
|
||||||
We recommend _not_ using swap space for Historical, MiddleManager, and Indexer processes since due to the large number of memory mapped segment files can lead to poor and unpredictable performance.
|
|
||||||
|
|
||||||
#### Linux limits
|
|
||||||
|
|
||||||
For Historical, MiddleManager, and Indexer processes (and for really large clusters, Broker processes), you might need to adjust some Linux system limits to account for a large number of open files, a large number of network connections, or a large number of memory mapped files.
|
|
||||||
|
|
||||||
##### ulimit
|
|
||||||
|
|
||||||
The limit on the number of open files can be set permanently by editing `/etc/security/limits.conf`. This value should be substantially greater than the number of segment files that will exist on the server.
|
|
||||||
|
|
||||||
##### max_map_count
|
|
||||||
|
|
||||||
Historical processes and to a lesser extent, MiddleManager and Indexer processes memory map segment files, so depending on the number of segments per server, `/proc/sys/vm/max_map_count` might also need to be adjusted. Depending on the variant of Linux, this might be done via `sysctl` by placing a file in `/etc/sysctl.d/` that sets `vm.max_map_count`.
|
|
||||||
|
|
1
operations/basicClusterTuning.md
Normal file
1
operations/basicClusterTuning.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
<!-- toc -->
|
@ -1,43 +0,0 @@
|
|||||||
# 深度存储合并
|
|
||||||
|
|
||||||
If you have been running an evaluation Druid cluster using local deep storage and wish to migrate to a
|
|
||||||
more production-capable deep storage system such as S3 or HDFS, this document describes the necessary steps.
|
|
||||||
|
|
||||||
Migration of deep storage involves the following steps at a high level:
|
|
||||||
|
|
||||||
- Copying segments from local deep storage to the new deep storage
|
|
||||||
- Exporting Druid's segments table from metadata
|
|
||||||
- Rewriting the load specs in the exported segment data to reflect the new deep storage location
|
|
||||||
- Reimporting the edited segments into metadata
|
|
||||||
|
|
||||||
## Shut down cluster services
|
|
||||||
|
|
||||||
To ensure a clean migration, shut down the non-coordinator services to ensure that metadata state will not
|
|
||||||
change as you do the migration.
|
|
||||||
|
|
||||||
When migrating from Derby, the coordinator processes will still need to be up initially, as they host the Derby database.
|
|
||||||
|
|
||||||
## Copy segments from old deep storage to new deep storage.
|
|
||||||
|
|
||||||
Before migrating, you will need to copy your old segments to the new deep storage.
|
|
||||||
|
|
||||||
For information on what path structure to use in the new deep storage, please see [deep storage migration options](../operations/export-metadata.md#deep-storage-migration).
|
|
||||||
|
|
||||||
## Export segments with rewritten load specs
|
|
||||||
|
|
||||||
Druid provides an [Export Metadata Tool](../operations/export-metadata.md) for exporting metadata from Derby into CSV files
|
|
||||||
which can then be reimported.
|
|
||||||
|
|
||||||
By setting [deep storage migration options](../operations/export-metadata.md#deep-storage-migration), the `export-metadata` tool will export CSV files where the segment load specs have been rewritten to load from your new deep storage location.
|
|
||||||
|
|
||||||
Run the `export-metadata` tool on your existing cluster, using the migration options appropriate for your new deep storage location, and save the CSV files it generates. After a successful export, you can shut down the coordinator.
|
|
||||||
|
|
||||||
### Import metadata
|
|
||||||
|
|
||||||
After generating the CSV exports with the modified segment data, you can reimport the contents of the Druid segments table from the generated CSVs.
|
|
||||||
|
|
||||||
Please refer to [import commands](../operations/export-metadata.md#importing-metadata) for examples. Only the `druid_segments` table needs to be imported.
|
|
||||||
|
|
||||||
### Restart cluster
|
|
||||||
|
|
||||||
After importing the segment table successfully, you can now restart your cluster.
|
|
@ -1,128 +1 @@
|
|||||||
---
|
<!-- toc -->
|
||||||
id: druid-console
|
|
||||||
title: "Web console"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
Druid include a console for managing datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. Users can also run SQL and native Druid queries in the console.
|
|
||||||
|
|
||||||
The Druid Console is hosted by the [Router](../design/router.md) process.
|
|
||||||
|
|
||||||
The following cluster settings must be enabled, as they are by default:
|
|
||||||
|
|
||||||
- the Router's [management proxy](../design/router.md#enabling-the-management-proxy) must be enabled.
|
|
||||||
- the Broker processes in the cluster must have [Druid SQL](../querying/sql.md) enabled.
|
|
||||||
|
|
||||||
The Druid console can be accessed at:
|
|
||||||
|
|
||||||
```
|
|
||||||
http://<ROUTER_IP>:<ROUTER_PORT>
|
|
||||||
```
|
|
||||||
|
|
||||||
> It is important to note that any Druid console user will have, effectively, the same file permissions as the user under which Druid runs. One way these permissions are surfaced is in the file browser dialog. The dialog
|
|
||||||
will show console users the files that the underlying user has permissions to. In general, avoid running Druid as
|
|
||||||
root user. Consider creating a dedicated user account for running Druid.
|
|
||||||
|
|
||||||
Below is a description of the high-level features and functionality of the Druid Console
|
|
||||||
|
|
||||||
## Home
|
|
||||||
|
|
||||||
The home view provides a high level overview of the cluster.
|
|
||||||
Each card is clickable and links to the appropriate view.
|
|
||||||
The legacy menu allows you to go to the [legacy coordinator and overlord consoles](./management-uis.md#legacy-consoles) should you need them.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Data loader
|
|
||||||
|
|
||||||
The data loader view allows you to load data by building an ingestion spec with a step-by-step wizard.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
After selecting the location of your data just follow the series for steps that will show you incremental previews of the data as it will be ingested.
|
|
||||||
After filling in the required details on every step you can navigate to the next step by clicking the `Next` button.
|
|
||||||
You can also freely navigate between the steps from the top navigation.
|
|
||||||
|
|
||||||
Navigating with the top navigation will leave the underlying spec unmodified while clicking the `Next` button will attempt to fill in the subsequent steps with appropriate defaults.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Datasources
|
|
||||||
|
|
||||||
The datasources view shows all the currently enabled datasources.
|
|
||||||
From this view you can see the sizes and availability of the different datasources.
|
|
||||||
You can edit the retention rules, configure automatic compaction, and drop data.
|
|
||||||
Like any view that is powered by a DruidSQL query you can click `View SQL query for table` from the `...` menu to run the underlying SQL query directly.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
You can view and edit retention rules to determine the general availability of a datasource.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Segments
|
|
||||||
|
|
||||||
The segment view shows all the segments in the cluster.
|
|
||||||
Each segment can be has a detail view that provides more information.
|
|
||||||
The Segment ID is also conveniently broken down into Datasource, Start, End, Version, and Partition columns for ease of filtering and sorting.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Tasks and supervisors
|
|
||||||
|
|
||||||
From this view you can check the status of existing supervisors as well as suspend, resume, and reset them.
|
|
||||||
The tasks table allows you see the currently running and recently completed tasks.
|
|
||||||
To make managing a lot of tasks more accessible, you can group the tasks by their `Type`, `Datasource`, or `Status` to make navigation easier.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
Click on the magnifying glass for any supervisor to see detailed reports of its progress.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
Click on the magnifying glass for any task to see more detail about it.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Servers
|
|
||||||
|
|
||||||
The servers tab lets you see the current status of the nodes making up your cluster.
|
|
||||||
You can group the nodes by type or by tier to get meaningful summary statistics.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Query
|
|
||||||
|
|
||||||
The query view lets you issue [DruidSQL](../querying/sql.md) queries and display the results as a table.
|
|
||||||
The view will attempt to infer your query and let you modify via contextual actions such as adding filters and changing the sort order when possible.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
The query view can also issue queries in Druid's [native query format](../querying/querying.md), which is JSON over HTTP.
|
|
||||||
To send a native Druid query, you must start your query with `{` and format it as JSON.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Lookups
|
|
||||||
|
|
||||||
You can create and edit query time lookups via the lookup view.
|
|
||||||
|
|
||||||

|
|
@ -1,117 +0,0 @@
|
|||||||
---
|
|
||||||
id: dump-segment
|
|
||||||
title: "dump-segment tool"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
The DumpSegment tool can be used to dump the metadata or contents of an Apache Druid segment for debugging purposes. Note that the
|
|
||||||
dump is not necessarily a full-fidelity translation of the segment. In particular, not all metadata is included, and
|
|
||||||
complex metric values may not be complete.
|
|
||||||
|
|
||||||
To run the tool, point it at a segment directory and provide a file for writing output:
|
|
||||||
|
|
||||||
```
|
|
||||||
java -classpath "/my/druid/lib/*" -Ddruid.extensions.loadList="[]" org.apache.druid.cli.Main \
|
|
||||||
tools dump-segment \
|
|
||||||
--directory /home/druid/path/to/segment/ \
|
|
||||||
--out /home/druid/output.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
### Output format
|
|
||||||
|
|
||||||
#### Data dumps
|
|
||||||
|
|
||||||
By default, or with `--dump rows`, this tool dumps rows of the segment as newline-separate JSON objects, with one
|
|
||||||
object per line, using the default serialization for each column. Normally all columns are included, but if you like,
|
|
||||||
you can limit the dump to specific columns with `--column name`.
|
|
||||||
|
|
||||||
For example, one line might look like this when pretty-printed:
|
|
||||||
|
|
||||||
```
|
|
||||||
{
|
|
||||||
"__time": 1442018818771,
|
|
||||||
"added": 36,
|
|
||||||
"channel": "#en.wikipedia",
|
|
||||||
"cityName": null,
|
|
||||||
"comment": "added project",
|
|
||||||
"count": 1,
|
|
||||||
"countryIsoCode": null,
|
|
||||||
"countryName": null,
|
|
||||||
"deleted": 0,
|
|
||||||
"delta": 36,
|
|
||||||
"isAnonymous": "false",
|
|
||||||
"isMinor": "false",
|
|
||||||
"isNew": "false",
|
|
||||||
"isRobot": "false",
|
|
||||||
"isUnpatrolled": "false",
|
|
||||||
"iuser": "00001553",
|
|
||||||
"metroCode": null,
|
|
||||||
"namespace": "Talk",
|
|
||||||
"page": "Talk:Oswald Tilghman",
|
|
||||||
"regionIsoCode": null,
|
|
||||||
"regionName": null,
|
|
||||||
"user": "GELongstreet"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Metadata dumps
|
|
||||||
|
|
||||||
With `--dump metadata`, this tool dumps metadata instead of rows. Metadata dumps generated by this tool are in the same
|
|
||||||
format as returned by the [SegmentMetadata query](../querying/segmentmetadataquery.md).
|
|
||||||
|
|
||||||
#### Bitmap dumps
|
|
||||||
|
|
||||||
With `--dump bitmaps`, this tool dump bitmap indexes instead of rows. Bitmap dumps generated by this tool include
|
|
||||||
dictionary-encoded string columns only. The output contains a field "bitmapSerdeFactory" describing the type of bitmaps
|
|
||||||
used in the segment, and a field "bitmaps" containing the bitmaps for each value of each column. These are base64
|
|
||||||
encoded by default, but you can also dump them as lists of row numbers with `--decompress-bitmaps`.
|
|
||||||
|
|
||||||
Normally all columns are included, but if you like, you can limit the dump to specific columns with `--column name`.
|
|
||||||
|
|
||||||
Sample output:
|
|
||||||
|
|
||||||
```
|
|
||||||
{
|
|
||||||
"bitmapSerdeFactory": {
|
|
||||||
"type": "roaring",
|
|
||||||
"compressRunOnSerialization": true
|
|
||||||
},
|
|
||||||
"bitmaps": {
|
|
||||||
"isRobot": {
|
|
||||||
"false": "//aExfu+Nv3X...",
|
|
||||||
"true": "gAl7OoRByQ..."
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Command line arguments
|
|
||||||
|
|
||||||
|argument|description|required?|
|
|
||||||
|--------|-----------|---------|
|
|
||||||
|--directory file|Directory containing segment data. This could be generated by unzipping an "index.zip" from deep storage.|yes|
|
|
||||||
|--output file|File to write to, or omit to write to stdout.|yes|
|
|
||||||
|--dump TYPE|Dump either 'rows' (default), 'metadata', or 'bitmaps'|no|
|
|
||||||
|--column columnName|Column to include. Specify multiple times for multiple columns, or omit to include all columns.|no|
|
|
||||||
|--filter json|JSON-encoded [query filter](../querying/filters.md). Omit to include all rows. Only used if dumping rows.|no|
|
|
||||||
|--time-iso8601|Format __time column in ISO8601 format rather than long. Only used if dumping rows.|no|
|
|
||||||
|--decompress-bitmaps|Dump bitmaps as arrays rather than base64-encoded compressed bitmaps. Only used if dumping bitmaps.|no|
|
|
@ -1,33 +0,0 @@
|
|||||||
---
|
|
||||||
id: dynamic-config-provider
|
|
||||||
title: "Dynamic Config Providers"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
Druid's core mechanism of supplying multiple related set of credentials/secrets/configurations via Druid extension mechanism. Currently, it is only supported for providing Kafka Consumer configuration in [Kafka Ingestion](../development/extensions-core/kafka-ingestion.md).
|
|
||||||
|
|
||||||
Eventually this will replace [PasswordProvider](./password-provider.md)
|
|
||||||
|
|
||||||
|
|
||||||
Users can create custom extension of the `DynamicConfigProvider` interface that is registered at Druid process startup.
|
|
||||||
|
|
||||||
For more information, see [Adding a new DynamicConfigProvider implementation](../development/modules.md#adding-a-new-dynamicconfigprovider-implementation).
|
|
||||||
|
|
@ -1,202 +0,0 @@
|
|||||||
---
|
|
||||||
id: export-metadata
|
|
||||||
title: "Export Metadata Tool"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
Druid includes an `export-metadata` tool for assisting with migration of cluster metadata and deep storage.
|
|
||||||
|
|
||||||
This tool exports the contents of the following Druid metadata tables:
|
|
||||||
|
|
||||||
- segments
|
|
||||||
- rules
|
|
||||||
- config
|
|
||||||
- datasource
|
|
||||||
- supervisors
|
|
||||||
|
|
||||||
Additionally, the tool can rewrite the local deep storage location descriptors in the rows of the segments table
|
|
||||||
to point to new deep storage locations (S3, HDFS, and local rewrite paths are supported).
|
|
||||||
|
|
||||||
The tool has the following limitations:
|
|
||||||
|
|
||||||
- Only exporting from Derby metadata is currently supported
|
|
||||||
- If rewriting load specs for deep storage migration, only migrating from local deep storage is currently supported.
|
|
||||||
|
|
||||||
## `export-metadata` Options
|
|
||||||
|
|
||||||
The `export-metadata` tool provides the following options:
|
|
||||||
|
|
||||||
### Connection Properties
|
|
||||||
|
|
||||||
- `--connectURI`: The URI of the Derby database, e.g. `jdbc:derby://localhost:1527/var/druid/metadata.db;create=true`
|
|
||||||
- `--user`: Username
|
|
||||||
- `--password`: Password
|
|
||||||
- `--base`: corresponds to the value of `druid.metadata.storage.tables.base` in the configuration, `druid` by default.
|
|
||||||
|
|
||||||
### Output Path
|
|
||||||
|
|
||||||
- `--output-path`, `-o`: The output directory of the tool. CSV files for the Druid segments, rules, config, datasource, and supervisors tables will be written to this directory.
|
|
||||||
|
|
||||||
### Export Format Options
|
|
||||||
|
|
||||||
- `--use-hex-blobs`, `-x`: If set, export BLOB payload columns as hexadecimal strings. This needs to be set if importing back into Derby. Default is false.
|
|
||||||
- `--booleans-as-strings`, `-t`: If set, write boolean values as "true" or "false" instead of "1" and "0". This needs to be set if importing back into Derby. Default is false.
|
|
||||||
|
|
||||||
### Deep Storage Migration
|
|
||||||
|
|
||||||
#### Migration to S3 Deep Storage
|
|
||||||
|
|
||||||
By setting the options below, the tool will rewrite the segment load specs to point to a new S3 deep storage location.
|
|
||||||
|
|
||||||
This helps users migrate segments stored in local deep storage to S3.
|
|
||||||
|
|
||||||
- `--s3bucket`, `-b`: The S3 bucket that will hold the migrated segments
|
|
||||||
- `--s3baseKey`, `-k`: The base S3 key where the migrated segments will be stored
|
|
||||||
|
|
||||||
When copying the local deep storage segments to S3, the rewrite performed by this tool requires that the directory structure of the segments be unchanged.
|
|
||||||
|
|
||||||
For example, if the cluster had the following local deep storage configuration:
|
|
||||||
|
|
||||||
```
|
|
||||||
druid.storage.type=local
|
|
||||||
druid.storage.storageDirectory=/druid/segments
|
|
||||||
```
|
|
||||||
|
|
||||||
If the target S3 bucket was `migration`, with a base key of `example`, the contents of `s3://migration/example/` must be identical to that of `/druid/segments` on the old local filesystem.
|
|
||||||
|
|
||||||
#### Migration to HDFS Deep Storage
|
|
||||||
|
|
||||||
By setting the options below, the tool will rewrite the segment load specs to point to a new HDFS deep storage location.
|
|
||||||
|
|
||||||
This helps users migrate segments stored in local deep storage to HDFS.
|
|
||||||
|
|
||||||
`--hadoopStorageDirectory`, `-h`: The HDFS path that will hold the migrated segments
|
|
||||||
|
|
||||||
When copying the local deep storage segments to HDFS, the rewrite performed by this tool requires that the directory structure of the segments be unchanged, with the exception of directory names containing colons (`:`).
|
|
||||||
|
|
||||||
For example, if the cluster had the following local deep storage configuration:
|
|
||||||
|
|
||||||
```
|
|
||||||
druid.storage.type=local
|
|
||||||
druid.storage.storageDirectory=/druid/segments
|
|
||||||
```
|
|
||||||
|
|
||||||
If the target hadoopStorageDirectory was `/migration/example`, the contents of `hdfs:///migration/example/` must be identical to that of `/druid/segments` on the old local filesystem.
|
|
||||||
|
|
||||||
Additionally, the segments paths in local deep storage contain colons(`:`) in their names, e.g.:
|
|
||||||
|
|
||||||
`wikipedia/2016-06-27T02:00:00.000Z_2016-06-27T03:00:00.000Z/2019-05-03T21:57:15.950Z/1/index.zip`
|
|
||||||
|
|
||||||
HDFS cannot store files containing colons, and this tool expects the colons to be replaced with underscores (`_`) in HDFS.
|
|
||||||
|
|
||||||
In this example, the `wikipedia` segment above under `/druid/segments` in local deep storage would need to be migrated to HDFS under `hdfs:///migration/example/` with the following path:
|
|
||||||
|
|
||||||
`wikipedia/2016-06-27T02_00_00.000Z_2016-06-27T03_00_00.000Z/2019-05-03T21_57_15.950Z/1/index.zip`
|
|
||||||
|
|
||||||
#### Migration to New Local Deep Storage Path
|
|
||||||
|
|
||||||
By setting the options below, the tool will rewrite the segment load specs to point to a new local deep storage location.
|
|
||||||
|
|
||||||
This helps users migrate segments stored in local deep storage to a new path (e.g., a new NFS mount).
|
|
||||||
|
|
||||||
`--newLocalPath`, `-n`: The new path on the local filesystem that will hold the migrated segments
|
|
||||||
|
|
||||||
When copying the local deep storage segments to a new path, the rewrite performed by this tool requires that the directory structure of the segments be unchanged.
|
|
||||||
|
|
||||||
For example, if the cluster had the following local deep storage configuration:
|
|
||||||
|
|
||||||
```
|
|
||||||
druid.storage.type=local
|
|
||||||
druid.storage.storageDirectory=/druid/segments
|
|
||||||
```
|
|
||||||
|
|
||||||
If the new path was `/migration/example`, the contents of `/migration/example/` must be identical to that of `/druid/segments` on the local filesystem.
|
|
||||||
|
|
||||||
## Running the tool
|
|
||||||
|
|
||||||
To use the tool, you can run the following from the root of the Druid package:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd ${DRUID_ROOT}
|
|
||||||
mkdir -p /tmp/csv
|
|
||||||
java -classpath "lib/*" -Dlog4j.configurationFile=conf/druid/cluster/_common/log4j2.xml -Ddruid.extensions.directory="extensions" -Ddruid.extensions.loadList=[] org.apache.druid.cli.Main tools export-metadata --connectURI "jdbc:derby://localhost:1527/var/druid/metadata.db;" -o /tmp/csv
|
|
||||||
```
|
|
||||||
|
|
||||||
In the example command above:
|
|
||||||
|
|
||||||
- `lib` is the Druid lib directory
|
|
||||||
- `extensions` is the Druid extensions directory
|
|
||||||
- `/tmp/csv` is the output directory. Please make sure that this directory exists.
|
|
||||||
|
|
||||||
## Importing Metadata
|
|
||||||
|
|
||||||
After running the tool, the output directory will contain `<table-name>_raw.csv` and `<table-name>.csv` files.
|
|
||||||
|
|
||||||
The `<table-name>_raw.csv` files are intermediate files used by the tool, containing the table data as exported by Derby without modification.
|
|
||||||
|
|
||||||
The `<table-name>.csv` files are used for import into another database such as MySQL and PostgreSQL and have any configured deep storage location rewrites applied.
|
|
||||||
|
|
||||||
Example import commands for Derby, MySQL, and PostgreSQL are shown below.
|
|
||||||
|
|
||||||
These example import commands expect `/tmp/csv` and its contents to be accessible from the server. For other options, such as importing from the client filesystem, please refer to the database's documentation.
|
|
||||||
|
|
||||||
### Derby
|
|
||||||
|
|
||||||
```sql
|
|
||||||
CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (null,'DRUID_SEGMENTS','/tmp/csv/druid_segments.csv',',','"',null,0);
|
|
||||||
|
|
||||||
CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (null,'DRUID_RULES','/tmp/csv/druid_rules.csv',',','"',null,0);
|
|
||||||
|
|
||||||
CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (null,'DRUID_CONFIG','/tmp/csv/druid_config.csv',',','"',null,0);
|
|
||||||
|
|
||||||
CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (null,'DRUID_DATASOURCE','/tmp/csv/druid_dataSource.csv',',','"',null,0);
|
|
||||||
|
|
||||||
CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (null,'DRUID_SUPERVISORS','/tmp/csv/druid_supervisors.csv',',','"',null,0);
|
|
||||||
```
|
|
||||||
|
|
||||||
### MySQL
|
|
||||||
|
|
||||||
```sql
|
|
||||||
LOAD DATA INFILE '/tmp/csv/druid_segments.csv' INTO TABLE druid_segments FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (id,dataSource,created_date,start,end,partitioned,version,used,payload); SHOW WARNINGS;
|
|
||||||
|
|
||||||
LOAD DATA INFILE '/tmp/csv/druid_rules.csv' INTO TABLE druid_rules FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (id,dataSource,version,payload); SHOW WARNINGS;
|
|
||||||
|
|
||||||
LOAD DATA INFILE '/tmp/csv/druid_config.csv' INTO TABLE druid_config FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (name,payload); SHOW WARNINGS;
|
|
||||||
|
|
||||||
LOAD DATA INFILE '/tmp/csv/druid_dataSource.csv' INTO TABLE druid_dataSource FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (dataSource,created_date,commit_metadata_payload,commit_metadata_sha1); SHOW WARNINGS;
|
|
||||||
|
|
||||||
LOAD DATA INFILE '/tmp/csv/druid_supervisors.csv' INTO TABLE druid_supervisors FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' (id,spec_id,created_date,payload); SHOW WARNINGS;
|
|
||||||
```
|
|
||||||
|
|
||||||
### PostgreSQL
|
|
||||||
|
|
||||||
```sql
|
|
||||||
COPY druid_segments(id,dataSource,created_date,start,"end",partitioned,version,used,payload) FROM '/tmp/csv/druid_segments.csv' DELIMITER ',' CSV;
|
|
||||||
|
|
||||||
COPY druid_rules(id,dataSource,version,payload) FROM '/tmp/csv/druid_rules.csv' DELIMITER ',' CSV;
|
|
||||||
|
|
||||||
COPY druid_config(name,payload) FROM '/tmp/csv/druid_config.csv' DELIMITER ',' CSV;
|
|
||||||
|
|
||||||
COPY druid_dataSource(dataSource,created_date,commit_metadata_payload,commit_metadata_sha1) FROM '/tmp/csv/druid_dataSource.csv' DELIMITER ',' CSV;
|
|
||||||
|
|
||||||
COPY druid_supervisors(id,spec_id,created_date,payload) FROM '/tmp/csv/druid_supervisors.csv' DELIMITER ',' CSV;
|
|
||||||
```
|
|
@ -1,48 +0,0 @@
|
|||||||
---
|
|
||||||
id: getting-started
|
|
||||||
title: "Getting started with Apache Druid"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
If you are new to Druid, we recommend reading the [Design Overview](../design/index.md) and the [Ingestion Overview](../ingestion/index.md) first for a basic understanding of Druid.
|
|
||||||
|
|
||||||
## Single-server Quickstart and Tutorials
|
|
||||||
|
|
||||||
To get started with running Druid, the simplest and quickest way is to try the [single-server quickstart and tutorials](../tutorials/index.md).
|
|
||||||
|
|
||||||
## Deploying a Druid cluster
|
|
||||||
|
|
||||||
If you wish to jump straight to deploying Druid as a cluster, or if you have an existing single-server deployment that you wish to migrate to a clustered deployment, please see the [Clustered Deployment Guide](../tutorials/cluster.md).
|
|
||||||
|
|
||||||
## Operating Druid
|
|
||||||
|
|
||||||
The [configuration reference](../configuration/index.md) describes all of Druid's configuration properties.
|
|
||||||
|
|
||||||
The [API reference](../operations/api-reference.md) describes the APIs available on each Druid process.
|
|
||||||
|
|
||||||
The [basic cluster tuning guide](../operations/basic-cluster-tuning.md) is an introductory guide for tuning your Druid cluster.
|
|
||||||
|
|
||||||
## Need help with Druid?
|
|
||||||
|
|
||||||
If you have questions about using Druid, please reach out to the [Druid user mailing list or other community channels](https://druid.apache.org/community/)!
|
|
@ -1,39 +0,0 @@
|
|||||||
---
|
|
||||||
id: high-availability
|
|
||||||
title: "High availability"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
Apache ZooKeeper, metadata store, the coordinator, the overlord, and brokers are recommended to set up a high availability environment.
|
|
||||||
|
|
||||||
- For highly-available ZooKeeper, you will need a cluster of 3 or 5 ZooKeeper nodes.
|
|
||||||
We recommend either installing ZooKeeper on its own hardware, or running 3 or 5 Master servers (where overlords or coordinators are running)
|
|
||||||
and configuring ZooKeeper on them appropriately. See the [ZooKeeper admin guide](https://zookeeper.apache.org/doc/current/zookeeperAdmin) for more details.
|
|
||||||
- For highly-available metadata storage, we recommend MySQL or PostgreSQL with replication and failover enabled.
|
|
||||||
See [MySQL HA/Scalability Guide](https://dev.mysql.com/doc/mysql-ha-scalability/en/)
|
|
||||||
and [PostgreSQL's High Availability, Load Balancing, and Replication](https://www.postgresql.org/docs/current/high-availability.html) for MySQL and PostgreSQL, respectively.
|
|
||||||
- For highly-available Apache Druid Coordinators and Overlords, we recommend to run multiple servers.
|
|
||||||
If they are all configured to use the same ZooKeeper cluster and metadata storage,
|
|
||||||
then they will automatically failover between each other as necessary.
|
|
||||||
Only one will be active at a time, but inactive servers will redirect to the currently active server.
|
|
||||||
- Druid Brokers can be scaled out and all running servers will be active and queryable.
|
|
||||||
We recommend placing them behind a load balancer.
|
|
@ -1,31 +0,0 @@
|
|||||||
---
|
|
||||||
id: http-compression
|
|
||||||
title: "HTTP compression"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
Apache Druid supports http request decompression and response compression, to use this, http request header `Content-Encoding:gzip` and `Accept-Encoding:gzip` is needed to be set.
|
|
||||||
|
|
||||||
|Property|Description|Default|
|
|
||||||
|--------|-----------|-------|
|
|
||||||
|`druid.server.http.compressionLevel`|The compression level. Value should be between [-1,9], -1 for default level, 0 for no compression.|-1 (default compression level)|
|
|
||||||
|`druid.server.http.inflateBufferSize`|The buffer size used by gzip decoder. Set to 0 to disable request decompression.|4096|
|
|
@ -1,48 +0,0 @@
|
|||||||
---
|
|
||||||
id: insert-segment-to-db
|
|
||||||
title: "insert-segment-to-db tool"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
In older versions of Apache Druid, `insert-segment-to-db` was a tool that could scan deep storage and
|
|
||||||
insert data from there into Druid metadata storage. It was intended to be used to update the segment table in the
|
|
||||||
metadata storage after manually migrating segments from one place to another, or even to recover lost metadata storage
|
|
||||||
by telling it where the segments are stored.
|
|
||||||
|
|
||||||
In Druid 0.14.x and earlier, Druid wrote segment metadata to two places: the metadata store's `druid_segments` table, and
|
|
||||||
`descriptor.json` files in deep storage. This practice was stopped in Druid 0.15.0 as part of
|
|
||||||
[consolidated metadata management](https://github.com/apache/druid/issues/6849), for the following reasons:
|
|
||||||
|
|
||||||
1. If any segments are manually dropped or re-enabled by cluster operators, this information is not reflected in
|
|
||||||
deep storage. Restoring metadata from deep storage would undo any such drops or re-enables.
|
|
||||||
2. Ingestion methods that allocate segments optimistically (such as native Kafka or Kinesis stream ingestion, or native
|
|
||||||
batch ingestion in 'append' mode) can write segments to deep storage that are not meant to actually be used by the
|
|
||||||
Druid cluster. There is no way, while purely looking at deep storage, to differentiate the segments that made it into
|
|
||||||
the metadata store originally (and therefore _should_ be used) from the segments that did not (and therefore
|
|
||||||
_should not_ be used).
|
|
||||||
3. Nothing in Druid other than the `insert-segment-to-db` tool read the `descriptor.json` files.
|
|
||||||
|
|
||||||
After this change, Druid stopped writing `descriptor.json` files to deep storage, and now only writes segment metadata
|
|
||||||
to the metadata store. This meant the `insert-segment-to-db` tool is no longer useful, so it was removed in Druid 0.15.0.
|
|
||||||
|
|
||||||
It is highly recommended that you take regular backups of your metadata store, since it is difficult to recover Druid
|
|
||||||
clusters properly without it.
|
|
@ -1,34 +0,0 @@
|
|||||||
---
|
|
||||||
id: kubernetes
|
|
||||||
title: "kubernetes"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
Apache Druid distribution is also available as [Docker](https://www.docker.com/) image from [Docker Hub](https://hub.docker.com/r/apache/druid) . For example, you can obtain release 0.16.0-incubating using the command below.
|
|
||||||
|
|
||||||
```
|
|
||||||
$docker pull apache/druid:0.16.0-incubating
|
|
||||||
```
|
|
||||||
|
|
||||||
[druid-operator](https://github.com/druid-io/druid-operator) can be used to manage a Druid cluster on [Kubernetes](https://kubernetes.io/) .
|
|
||||||
|
|
||||||
Druid clusters deployed on Kubernetes can function without Zookeeper using [druid–kubernetes-extensions](../development/extensions-core/kubernetes.md) .
|
|
@ -1,64 +0,0 @@
|
|||||||
---
|
|
||||||
id: management-uis
|
|
||||||
title: "Legacy Management UIs"
|
|
||||||
---
|
|
||||||
|
|
||||||
<!--
|
|
||||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
~ or more contributor license agreements. See the NOTICE file
|
|
||||||
~ distributed with this work for additional information
|
|
||||||
~ regarding copyright ownership. The ASF licenses this file
|
|
||||||
~ to you under the Apache License, Version 2.0 (the
|
|
||||||
~ "License"); you may not use this file except in compliance
|
|
||||||
~ with the License. You may obtain a copy of the License at
|
|
||||||
~
|
|
||||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
~
|
|
||||||
~ Unless required by applicable law or agreed to in writing,
|
|
||||||
~ software distributed under the License is distributed on an
|
|
||||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
~ KIND, either express or implied. See the License for the
|
|
||||||
~ specific language governing permissions and limitations
|
|
||||||
~ under the License.
|
|
||||||
-->
|
|
||||||
|
|
||||||
|
|
||||||
## Legacy consoles
|
|
||||||
|
|
||||||
Druid provides a console for managing datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.
|
|
||||||
|
|
||||||
For more information on the Druid Console, have a look at the [Druid Console overview](./druid-console.md)
|
|
||||||
|
|
||||||
The Druid Console contains all of the functionality provided by the older consoles described below, which are still available if needed. The legacy consoles may be replaced by the Druid Console in the future.
|
|
||||||
|
|
||||||
These older consoles provide a subset of the functionality of the Druid Console. We recommend using the Druid Console if possible.
|
|
||||||
|
|
||||||
### Coordinator consoles
|
|
||||||
|
|
||||||
#### Version 2
|
|
||||||
|
|
||||||
The Druid Coordinator exposes a web console for displaying cluster information and rule configuration. After the Coordinator starts, the console can be accessed at:
|
|
||||||
|
|
||||||
```
|
|
||||||
http://<COORDINATOR_IP>:<COORDINATOR_PORT>
|
|
||||||
```
|
|
||||||
|
|
||||||
There exists a full cluster view (which shows indexing tasks and Historical processes), as well as views for individual Historical processes, datasources and segments themselves. Segment information can be displayed in raw JSON form or as part of a sortable and filterable table.
|
|
||||||
|
|
||||||
The Coordinator console also exposes an interface to creating and editing rules. All valid datasources configured in the segment database, along with a default datasource, are available for configuration. Rules of different types can be added, deleted or edited.
|
|
||||||
|
|
||||||
#### Version 1
|
|
||||||
|
|
||||||
The oldest version of Druid's Coordinator console is still available for backwards compatibility at:
|
|
||||||
|
|
||||||
```
|
|
||||||
http://<COORDINATOR_IP>:<COORDINATOR_PORT>/old-console
|
|
||||||
```
|
|
||||||
|
|
||||||
### Overlord console
|
|
||||||
|
|
||||||
The Overlord console can be used to view pending tasks, running tasks, available workers, and recent worker creation and termination. The console can be accessed at:
|
|
||||||
|
|
||||||
```
|
|
||||||
http://<OVERLORD_IP>:<OVERLORD_PORT>/console.html
|
|
||||||
```
|
|
@ -1,68 +0,0 @@
|
|||||||
# 元数据合并
|
|
||||||
|
|
||||||
If you have been running an evaluation Druid cluster using the built-in Derby metadata storage and wish to migrate to a
|
|
||||||
more production-capable metadata store such as MySQL or PostgreSQL, this document describes the necessary steps.
|
|
||||||
|
|
||||||
## Shut down cluster services
|
|
||||||
|
|
||||||
To ensure a clean migration, shut down the non-coordinator services to ensure that metadata state will not
|
|
||||||
change as you do the migration.
|
|
||||||
|
|
||||||
When migrating from Derby, the coordinator processes will still need to be up initially, as they host the Derby database.
|
|
||||||
|
|
||||||
## Exporting metadata
|
|
||||||
|
|
||||||
Druid provides an [Export Metadata Tool](../operations/export-metadata.md) for exporting metadata from Derby into CSV files
|
|
||||||
which can then be imported into your new metadata store.
|
|
||||||
|
|
||||||
The tool also provides options for rewriting the deep storage locations of segments; this is useful
|
|
||||||
for [deep storage migration](../operations/deep-storage-migration.md).
|
|
||||||
|
|
||||||
Run the `export-metadata` tool on your existing cluster, and save the CSV files it generates. After a successful export, you can shut down the coordinator.
|
|
||||||
|
|
||||||
## Initializing the new metadata store
|
|
||||||
|
|
||||||
### Create database
|
|
||||||
|
|
||||||
Before importing the existing cluster metadata, you will need to set up the new metadata store.
|
|
||||||
|
|
||||||
The [MySQL extension](../development/extensions-core/mysql.md) and [PostgreSQL extension](../development/extensions-core/postgresql.md) docs have instructions for initial database setup.
|
|
||||||
|
|
||||||
### Update configuration
|
|
||||||
|
|
||||||
Update your Druid runtime properties with the new metadata configuration.
|
|
||||||
|
|
||||||
### Create Druid tables
|
|
||||||
|
|
||||||
Druid provides a `metadata-init` tool for creating Druid's metadata tables. After initializing the Druid database, you can run the commands shown below from the root of the Druid package to initialize the tables.
|
|
||||||
|
|
||||||
In the example commands below:
|
|
||||||
|
|
||||||
- `lib` is the Druid lib directory
|
|
||||||
- `extensions` is the Druid extensions directory
|
|
||||||
- `base` corresponds to the value of `druid.metadata.storage.tables.base` in the configuration, `druid` by default.
|
|
||||||
- The `--connectURI` parameter corresponds to the value of `druid.metadata.storage.connector.connectURI`.
|
|
||||||
- The `--user` parameter corresponds to the value of `druid.metadata.storage.connector.user`.
|
|
||||||
- The `--password` parameter corresponds to the value of `druid.metadata.storage.connector.password`.
|
|
||||||
|
|
||||||
#### MySQL
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd ${DRUID_ROOT}
|
|
||||||
java -classpath "lib/*" -Dlog4j.configurationFile=conf/druid/cluster/_common/log4j2.xml -Ddruid.extensions.directory="extensions" -Ddruid.extensions.loadList=[\"mysql-metadata-storage\"] -Ddruid.metadata.storage.type=mysql org.apache.druid.cli.Main tools metadata-init --connectURI="<mysql-uri>" --user <user> --password <pass> --base druid
|
|
||||||
```
|
|
||||||
|
|
||||||
#### PostgreSQL
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd ${DRUID_ROOT}
|
|
||||||
java -classpath "lib/*" -Dlog4j.configurationFile=conf/druid/cluster/_common/log4j2.xml -Ddruid.extensions.directory="extensions" -Ddruid.extensions.loadList=[\"postgresql-metadata-storage\"] -Ddruid.metadata.storage.type=postgresql org.apache.druid.cli.Main tools metadata-init --connectURI="<postgresql-uri>" --user <user> --password <pass> --base druid
|
|
||||||
```
|
|
||||||
|
|
||||||
### Import metadata
|
|
||||||
|
|
||||||
After initializing the tables, please refer to the [import commands](../operations/export-metadata.md#importing-metadata) for your target database.
|
|
||||||
|
|
||||||
### Restart cluster
|
|
||||||
|
|
||||||
After importing the metadata successfully, you can now restart your cluster.
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
x
Reference in New Issue
Block a user