mirror of https://github.com/apache/druid.git
Doc update for the new input source and the new input format (#9171)
* Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs
This commit is contained in:
parent
84ff0d2352
commit
153495068b
|
@ -24,206 +24,7 @@ title: "Apache Avro"
|
||||||
|
|
||||||
This Apache Druid extension enables Druid to ingest and understand the Apache Avro data format. Make sure to [include](../../development/extensions.md#loading-extensions) `druid-avro-extensions` as an extension.
|
This Apache Druid extension enables Druid to ingest and understand the Apache Avro data format. Make sure to [include](../../development/extensions.md#loading-extensions) `druid-avro-extensions` as an extension.
|
||||||
|
|
||||||
### Avro Stream Parser
|
The `druid-avro-extensions` provides two Avro Parsers for stream ingestion and Hadoop batch ingestion.
|
||||||
|
See [Avro Hadoop Parser](../../ingestion/data-formats.md#avro-hadoop-parser)
|
||||||
This is for streaming/realtime ingestion.
|
and [Avro Stream Parser](../../ingestion/data-formats.md#avro-stream-parser)
|
||||||
|
for details.
|
||||||
| Field | Type | Description | Required |
|
|
||||||
|-------|------|-------------|----------|
|
|
||||||
| type | String | This should say `avro_stream`. | no |
|
|
||||||
| avroBytesDecoder | JSON Object | Specifies how to decode bytes to Avro record. | yes |
|
|
||||||
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be an "avro" parseSpec. | yes |
|
|
||||||
|
|
||||||
An Avro parseSpec can contain a [`flattenSpec`](../../ingestion/index.md#flattenspec) using either the "root" or "path"
|
|
||||||
field types, which can be used to read nested Avro records. The "jq" field type is not currently supported for Avro.
|
|
||||||
|
|
||||||
For example, using Avro stream parser with schema repo Avro bytes decoder:
|
|
||||||
|
|
||||||
```json
|
|
||||||
"parser" : {
|
|
||||||
"type" : "avro_stream",
|
|
||||||
"avroBytesDecoder" : {
|
|
||||||
"type" : "schema_repo",
|
|
||||||
"subjectAndIdConverter" : {
|
|
||||||
"type" : "avro_1124",
|
|
||||||
"topic" : "${YOUR_TOPIC}"
|
|
||||||
},
|
|
||||||
"schemaRepository" : {
|
|
||||||
"type" : "avro_1124_rest_client",
|
|
||||||
"url" : "${YOUR_SCHEMA_REPO_END_POINT}",
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"parseSpec" : {
|
|
||||||
"format": "avro",
|
|
||||||
"timestampSpec": <standard timestampSpec>,
|
|
||||||
"dimensionsSpec": <standard dimensionsSpec>,
|
|
||||||
"flattenSpec": <optional>
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Avro Bytes Decoder
|
|
||||||
|
|
||||||
If `type` is not included, the avroBytesDecoder defaults to `schema_repo`.
|
|
||||||
|
|
||||||
##### Inline Schema Based Avro Bytes Decoder
|
|
||||||
|
|
||||||
> The "schema_inline" decoder reads Avro records using a fixed schema and does not support schema migration. If you
|
|
||||||
> may need to migrate schemas in the future, consider one of the other decoders, all of which use a message header that
|
|
||||||
> allows the parser to identify the proper Avro schema for reading records.
|
|
||||||
|
|
||||||
This decoder can be used if all the input events can be read using the same schema. In that case schema can be specified in the input task JSON itself as described below.
|
|
||||||
|
|
||||||
```
|
|
||||||
...
|
|
||||||
"avroBytesDecoder": {
|
|
||||||
"type": "schema_inline",
|
|
||||||
"schema": {
|
|
||||||
//your schema goes here, for example
|
|
||||||
"namespace": "org.apache.druid.data",
|
|
||||||
"name": "User",
|
|
||||||
"type": "record",
|
|
||||||
"fields": [
|
|
||||||
{ "name": "FullName", "type": "string" },
|
|
||||||
{ "name": "Country", "type": "string" }
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
##### Multiple Inline Schemas Based Avro Bytes Decoder
|
|
||||||
|
|
||||||
This decoder can be used if different input events can have different read schema. In that case schema can be specified in the input task JSON itself as described below.
|
|
||||||
|
|
||||||
```
|
|
||||||
...
|
|
||||||
"avroBytesDecoder": {
|
|
||||||
"type": "multiple_schemas_inline",
|
|
||||||
"schemas": {
|
|
||||||
//your id -> schema map goes here, for example
|
|
||||||
"1": {
|
|
||||||
"namespace": "org.apache.druid.data",
|
|
||||||
"name": "User",
|
|
||||||
"type": "record",
|
|
||||||
"fields": [
|
|
||||||
{ "name": "FullName", "type": "string" },
|
|
||||||
{ "name": "Country", "type": "string" }
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"2": {
|
|
||||||
"namespace": "org.apache.druid.otherdata",
|
|
||||||
"name": "UserIdentity",
|
|
||||||
"type": "record",
|
|
||||||
"fields": [
|
|
||||||
{ "name": "Name", "type": "string" },
|
|
||||||
{ "name": "Location", "type": "string" }
|
|
||||||
]
|
|
||||||
},
|
|
||||||
...
|
|
||||||
...
|
|
||||||
}
|
|
||||||
}
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
Note that it is essentially a map of integer schema ID to avro schema object. This parser assumes that record has following format.
|
|
||||||
first 1 byte is version and must always be 1.
|
|
||||||
next 4 bytes are integer schema ID serialized using big-endian byte order.
|
|
||||||
remaining bytes contain serialized avro message.
|
|
||||||
|
|
||||||
##### SchemaRepo Based Avro Bytes Decoder
|
|
||||||
|
|
||||||
This Avro bytes decoder first extract `subject` and `id` from input message bytes, then use them to lookup the Avro schema with which to decode Avro record from bytes. Details can be found in [schema repo](https://github.com/schema-repo/schema-repo) and [AVRO-1124](https://issues.apache.org/jira/browse/AVRO-1124). You will need an http service like schema repo to hold the avro schema. Towards schema registration on the message producer side, you can refer to `org.apache.druid.data.input.AvroStreamInputRowParserTest#testParse()`.
|
|
||||||
|
|
||||||
| Field | Type | Description | Required |
|
|
||||||
|-------|------|-------------|----------|
|
|
||||||
| type | String | This should say `schema_repo`. | no |
|
|
||||||
| subjectAndIdConverter | JSON Object | Specifies the how to extract subject and id from message bytes. | yes |
|
|
||||||
| schemaRepository | JSON Object | Specifies the how to lookup Avro schema from subject and id. | yes |
|
|
||||||
|
|
||||||
###### Avro-1124 Subject And Id Converter
|
|
||||||
|
|
||||||
This section describes the format of the `subjectAndIdConverter` object for the `schema_repo` Avro bytes decoder.
|
|
||||||
|
|
||||||
| Field | Type | Description | Required |
|
|
||||||
|-------|------|-------------|----------|
|
|
||||||
| type | String | This should say `avro_1124`. | no |
|
|
||||||
| topic | String | Specifies the topic of your Kafka stream. | yes |
|
|
||||||
|
|
||||||
|
|
||||||
###### Avro-1124 Schema Repository
|
|
||||||
|
|
||||||
This section describes the format of the `schemaRepository` object for the `schema_repo` Avro bytes decoder.
|
|
||||||
|
|
||||||
| Field | Type | Description | Required |
|
|
||||||
|-------|------|-------------|----------|
|
|
||||||
| type | String | This should say `avro_1124_rest_client`. | no |
|
|
||||||
| url | String | Specifies the endpoint url of your Avro-1124 schema repository. | yes |
|
|
||||||
|
|
||||||
##### Confluent Schema Registry-based Avro Bytes Decoder
|
|
||||||
|
|
||||||
This Avro bytes decoder first extract unique `id` from input message bytes, then use them it lookup in the Schema Registry for the related schema, with which to decode Avro record from bytes.
|
|
||||||
Details can be found in Schema Registry [documentation](http://docs.confluent.io/current/schema-registry/docs/) and [repository](https://github.com/confluentinc/schema-registry).
|
|
||||||
|
|
||||||
| Field | Type | Description | Required |
|
|
||||||
|-------|------|-------------|----------|
|
|
||||||
| type | String | This should say `schema_registry`. | no |
|
|
||||||
| url | String | Specifies the url endpoint of the Schema Registry. | yes |
|
|
||||||
| capacity | Integer | Specifies the max size of the cache (default == Integer.MAX_VALUE). | no |
|
|
||||||
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"avroBytesDecoder" : {
|
|
||||||
"type" : "schema_registry",
|
|
||||||
"url" : <schema-registry-url>
|
|
||||||
}
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
### Avro Hadoop Parser
|
|
||||||
|
|
||||||
This is for batch ingestion using the `HadoopDruidIndexer`. The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"org.apache.druid.data.input.avro.AvroValueInputFormat"`. You may want to set Avro reader's schema in `jobProperties` in `tuningConfig`, e.g.: `"avro.schema.input.value.path": "/path/to/your/schema.avsc"` or `"avro.schema.input.value": "your_schema_JSON_object"`, if reader's schema is not set, the schema in Avro object container file will be used, see [Avro specification](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution). Make sure to include "org.apache.druid.extensions:druid-avro-extensions" as an extension.
|
|
||||||
|
|
||||||
| Field | Type | Description | Required |
|
|
||||||
|-------|------|-------------|----------|
|
|
||||||
| type | String | This should say `avro_hadoop`. | no |
|
|
||||||
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be an "avro" parseSpec. | yes |
|
|
||||||
|
|
||||||
An Avro parseSpec can contain a [`flattenSpec`](../../ingestion/index.md#flattenspec) using either the "root" or "path"
|
|
||||||
field types, which can be used to read nested Avro records. The "jq" field type is not currently supported for Avro.
|
|
||||||
|
|
||||||
For example, using Avro Hadoop parser with custom reader's schema file:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type" : "index_hadoop",
|
|
||||||
"spec" : {
|
|
||||||
"dataSchema" : {
|
|
||||||
"dataSource" : "",
|
|
||||||
"parser" : {
|
|
||||||
"type" : "avro_hadoop",
|
|
||||||
"parseSpec" : {
|
|
||||||
"format": "avro",
|
|
||||||
"timestampSpec": <standard timestampSpec>,
|
|
||||||
"dimensionsSpec": <standard dimensionsSpec>,
|
|
||||||
"flattenSpec": <optional>
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"ioConfig" : {
|
|
||||||
"type" : "hadoop",
|
|
||||||
"inputSpec" : {
|
|
||||||
"type" : "static",
|
|
||||||
"inputFormat": "org.apache.druid.data.input.avro.AvroValueInputFormat",
|
|
||||||
"paths" : ""
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"tuningConfig" : {
|
|
||||||
"jobProperties" : {
|
|
||||||
"avro.schema.input.value.path" : "/path/to/my/schema.avsc"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
|
@ -37,127 +37,8 @@ Deep storage can be written to Google Cloud Storage either via this extension or
|
||||||
|`druid.google.bucket`||GCS bucket name.|Must be set.|
|
|`druid.google.bucket`||GCS bucket name.|Must be set.|
|
||||||
|`druid.google.prefix`||GCS prefix.|No-prefix|
|
|`druid.google.prefix`||GCS prefix.|No-prefix|
|
||||||
|
|
||||||
|
## Reading data from Google Cloud Storage
|
||||||
|
|
||||||
<a name="input-source"></a>
|
The [Google Cloud Storage input source](../../ingestion/native-batch.md#google-cloud-storage-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
|
||||||
|
to read objects directly from Google Cloud Storage. If you use the [Hadoop task](../../ingestion/hadoop.md),
|
||||||
## Google cloud storage batch ingestion input source
|
you can read data from Google Cloud Storage by specifying the paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
|
||||||
|
|
||||||
This extension also provides an input source for Druid native batch ingestion to support reading objects directly from Google Cloud Storage. Objects can be specified as list of Google Cloud Storage URI strings. The Google Cloud Storage input source is splittable and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task), where each worker task of `index_parallel` will read a single object.
|
|
||||||
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "index_parallel",
|
|
||||||
"inputSource": {
|
|
||||||
"type": "google",
|
|
||||||
"uris": ["gs://foo/bar/file.json", "gs://bar/foo/file2.json"]
|
|
||||||
},
|
|
||||||
"inputFormat": {
|
|
||||||
"type": "json"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "index_parallel",
|
|
||||||
"inputSource": {
|
|
||||||
"type": "google",
|
|
||||||
"prefixes": ["gs://foo/bar", "gs://bar/foo"]
|
|
||||||
},
|
|
||||||
"inputFormat": {
|
|
||||||
"type": "json"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "index_parallel",
|
|
||||||
"inputSource": {
|
|
||||||
"type": "google",
|
|
||||||
"objects": [
|
|
||||||
{ "bucket": "foo", "path": "bar/file1.json"},
|
|
||||||
{ "bucket": "bar", "path": "foo/file2.json"}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"inputFormat": {
|
|
||||||
"type": "json"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
|property|description|default|required?|
|
|
||||||
|--------|-----------|-------|---------|
|
|
||||||
|type|This should be `google`.|N/A|yes|
|
|
||||||
|uris|JSON array of URIs where Google Cloud Storage objects to be ingested are located.|N/A|`uris` or `prefixes` or `objects` must be set|
|
|
||||||
|prefixes|JSON array of URI prefixes for the locations of Google Cloud Storage objects to be ingested.|N/A|`uris` or `prefixes` or `objects` must be set|
|
|
||||||
|objects|JSON array of Google Cloud Storage objects to be ingested.|N/A|`uris` or `prefixes` or `objects` must be set|
|
|
||||||
|
|
||||||
|
|
||||||
Google Cloud Storage object:
|
|
||||||
|
|
||||||
|property|description|default|required?|
|
|
||||||
|--------|-----------|-------|---------|
|
|
||||||
|bucket|Name of the Google Cloud Storage bucket|N/A|yes|
|
|
||||||
|path|The path where data is located.|N/A|yes|
|
|
||||||
|
|
||||||
## Firehose
|
|
||||||
|
|
||||||
<a name="firehose"></a>
|
|
||||||
|
|
||||||
#### StaticGoogleBlobStoreFirehose
|
|
||||||
|
|
||||||
This firehose ingests events, similar to the StaticS3Firehose, but from an Google Cloud Store.
|
|
||||||
|
|
||||||
As with the S3 blobstore, it is assumed to be gzipped if the extension ends in .gz
|
|
||||||
|
|
||||||
This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task).
|
|
||||||
Since each split represents an object in this firehose, each worker task of `index_parallel` will read an object.
|
|
||||||
|
|
||||||
Sample spec:
|
|
||||||
|
|
||||||
```json
|
|
||||||
"firehose" : {
|
|
||||||
"type" : "static-google-blobstore",
|
|
||||||
"blobs": [
|
|
||||||
{
|
|
||||||
"bucket": "foo",
|
|
||||||
"path": "/path/to/your/file.json"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"bucket": "bar",
|
|
||||||
"path": "/another/path.json"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
This firehose provides caching and prefetching features. In IndexTask, a firehose can be read twice if intervals or
|
|
||||||
shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow.
|
|
||||||
|
|
||||||
|property|description|default|required?|
|
|
||||||
|--------|-----------|-------|---------|
|
|
||||||
|type|This should be `static-google-blobstore`.|N/A|yes|
|
|
||||||
|blobs|JSON array of Google Blobs.|N/A|yes|
|
|
||||||
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|no|
|
|
||||||
|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|no|
|
|
||||||
|prefetchTriggerBytes|Threshold to trigger prefetching Google Blobs.|maxFetchCapacityBytes / 2|no|
|
|
||||||
|fetchTimeout|Timeout for fetching a Google Blob.|60000|no|
|
|
||||||
|maxFetchRetry|Maximum retry for fetching a Google Blob.|3|no|
|
|
||||||
|
|
||||||
Google Blobs:
|
|
||||||
|
|
||||||
|property|description|default|required?|
|
|
||||||
|--------|-----------|-------|---------|
|
|
||||||
|bucket|Name of the Google Cloud bucket|N/A|yes|
|
|
||||||
|path|The path where data is located.|N/A|yes|
|
|
||||||
|
|
|
@ -36,49 +36,134 @@ To use this Apache Druid extension, make sure to [include](../../development/ext
|
||||||
|`druid.hadoop.security.kerberos.principal`|`druid@EXAMPLE.COM`| Principal user name |empty|
|
|`druid.hadoop.security.kerberos.principal`|`druid@EXAMPLE.COM`| Principal user name |empty|
|
||||||
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty|
|
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty|
|
||||||
|
|
||||||
If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work.
|
Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`)
|
||||||
|
in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`.
|
||||||
|
|
||||||
|
If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work.
|
||||||
If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically.
|
If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically.
|
||||||
|
|
||||||
### Configuration for Google Cloud Storage
|
### Configuration for Cloud Storage
|
||||||
|
|
||||||
The HDFS extension can also be used for GCS as deep storage.
|
You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS.
|
||||||
|
|
||||||
|
#### Configuration for AWS S3
|
||||||
|
|
||||||
|
To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly.
|
||||||
|
|
||||||
|Property|Possible Values|Description|Default|
|
|Property|Possible Values|Description|Default|
|
||||||
|--------|---------------|-----------|-------|
|
|--------|---------------|-----------|-------|
|
||||||
|`druid.storage.type`|hdfs| |Must be set.|
|
|`druid.storage.type`|hdfs| |Must be set.|
|
||||||
|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.|
|
|`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.|
|
||||||
|
|
||||||
All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation) in their class path. One option is to place this jar in <druid>/lib/ and <druid>/extensions/druid-hdfs-storage/
|
You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html), especially the `hadoop-aws.jar` in the Druid classpath.
|
||||||
|
Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.
|
||||||
|
|
||||||
Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.
|
```bash
|
||||||
|
java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
|
||||||
<a name="firehose"></a>
|
cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/
|
||||||
|
|
||||||
## Native batch ingestion
|
|
||||||
|
|
||||||
This firehose ingests events from a predefined list of files from a Hadoop filesystem.
|
|
||||||
This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task).
|
|
||||||
Since each split represents an HDFS file, each worker task of `index_parallel` will read an object.
|
|
||||||
|
|
||||||
Sample spec:
|
|
||||||
|
|
||||||
```json
|
|
||||||
"firehose" : {
|
|
||||||
"type" : "hdfs",
|
|
||||||
"paths": "/foo/bar,/foo/baz"
|
|
||||||
}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
This firehose provides caching and prefetching features. During native batch indexing, a firehose can be read twice if
|
Finally, you need to add the below properties in the `core-site.xml`.
|
||||||
`intervals` are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scanning
|
For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
|
||||||
of files is slow.
|
|
||||||
|
|
||||||
|Property|Description|Default|
|
```xml
|
||||||
|--------|-----------|-------|
|
<property>
|
||||||
|type|This should be `hdfs`.|none (required)|
|
<name>fs.s3a.impl</name>
|
||||||
|paths|HDFS paths. Can be either a JSON array or comma-separated string of paths. Wildcards like `*` are supported in these paths.|none (required)|
|
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
|
||||||
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|
|
<description>The implementation class of the S3A Filesystem</description>
|
||||||
|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|
|
</property>
|
||||||
|prefetchTriggerBytes|Threshold to trigger prefetching files.|maxFetchCapacityBytes / 2|
|
|
||||||
|fetchTimeout|Timeout for fetching each file.|60000|
|
<property>
|
||||||
|maxFetchRetry|Maximum number of retries for fetching each file.|3|
|
<name>fs.AbstractFileSystem.s3a.impl</name>
|
||||||
|
<value>org.apache.hadoop.fs.s3a.S3A</value>
|
||||||
|
<description>The implementation class of the S3A AbstractFileSystem.</description>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.s3a.access.key</name>
|
||||||
|
<description>AWS access key ID. Omit for IAM role-based or provider-based authentication.</description>
|
||||||
|
<value>your access key</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.s3a.secret.key</name>
|
||||||
|
<description>AWS secret key. Omit for IAM role-based or provider-based authentication.</description>
|
||||||
|
<value>your secret key</value>
|
||||||
|
</property>
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Configuration for Google Cloud Storage
|
||||||
|
|
||||||
|
To use the Google Cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly.
|
||||||
|
|
||||||
|
|Property|Possible Values|Description|Default|
|
||||||
|
|--------|---------------|-----------|-------|
|
||||||
|
|`druid.storage.type`|hdfs||Must be set.|
|
||||||
|
|`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the deep storage|Must be set.|
|
||||||
|
|
||||||
|
All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters) in their class path.
|
||||||
|
Please read the [install instructions](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
|
||||||
|
to properly set up the necessary libraries and configurations.
|
||||||
|
One option is to place this jar in `${DRUID_HOME}/lib/` and `${DRUID_HOME}/extensions/druid-hdfs-storage/`.
|
||||||
|
|
||||||
|
Finally, you need to configure the `core-site.xml` file with the filesystem
|
||||||
|
and authentication properties needed for GCS. You may want to copy the below
|
||||||
|
example properties. Please follow the instructions at
|
||||||
|
[https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
|
||||||
|
for more details.
|
||||||
|
For more configurations, [GCS core default](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/conf/gcs-core-default.xml)
|
||||||
|
and [GCS core template](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml).
|
||||||
|
|
||||||
|
```xml
|
||||||
|
<property>
|
||||||
|
<name>fs.gs.impl</name>
|
||||||
|
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
|
||||||
|
<description>The FileSystem for gs: (GCS) uris.</description>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>fs.AbstractFileSystem.gs.impl</name>
|
||||||
|
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
|
||||||
|
<description>The AbstractFileSystem for gs: uris.</description>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>google.cloud.auth.service.account.enable</name>
|
||||||
|
<value>true</value>
|
||||||
|
<description>
|
||||||
|
Whether to use a service account for GCS authorization.
|
||||||
|
Setting this property to `false` will disable use of service accounts for
|
||||||
|
authentication.
|
||||||
|
</description>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>google.cloud.auth.service.account.json.keyfile</name>
|
||||||
|
<value>/path/to/keyfile</value>
|
||||||
|
<description>
|
||||||
|
The JSON key file of the service account used for GCS
|
||||||
|
access when google.cloud.auth.service.account.enable is true.
|
||||||
|
</description>
|
||||||
|
</property>
|
||||||
|
```
|
||||||
|
|
||||||
|
Tested with Druid 0.17.0, Hadoop 2.8.5 and gcs-connector jar 2.0.0-hadoop2.
|
||||||
|
|
||||||
|
## Reading data from HDFS or Cloud Storage
|
||||||
|
|
||||||
|
### Native batch ingestion
|
||||||
|
|
||||||
|
The [HDFS input source](../../ingestion/native-batch.md#hdfs-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
|
||||||
|
to read files directly from the HDFS Storage. You may be able to read objects from cloud storage
|
||||||
|
with the HDFS input source, but we highly recommend to use a proper
|
||||||
|
[Input Source](../../ingestion/native-batch.md#input-sources) instead if possible because
|
||||||
|
it is simple to set up. For now, only the [S3 input source](../../ingestion/native-batch.md#s3-input-source)
|
||||||
|
and the [Google Cloud Storage input source](../../ingestion/native-batch.md#google-cloud-storage-input-source)
|
||||||
|
are supported for cloud storage types, and so you may still want to use the HDFS input source
|
||||||
|
to read from cloud storage other than those two.
|
||||||
|
|
||||||
|
### Hadoop-based ingestion
|
||||||
|
|
||||||
|
If you use the [Hadoop ingestion](../../ingestion/hadoop.md), you can read data from HDFS
|
||||||
|
by specifying the paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
|
||||||
|
See the [Static](../../ingestion/hadoop.md#static) inputSpec for details.
|
||||||
|
|
|
@ -60,10 +60,6 @@ A sample supervisor spec is shown below:
|
||||||
"type": "kafka",
|
"type": "kafka",
|
||||||
"dataSchema": {
|
"dataSchema": {
|
||||||
"dataSource": "metrics-kafka",
|
"dataSource": "metrics-kafka",
|
||||||
"parser": {
|
|
||||||
"type": "string",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "json",
|
|
||||||
"timestampSpec": {
|
"timestampSpec": {
|
||||||
"column": "timestamp",
|
"column": "timestamp",
|
||||||
"format": "auto"
|
"format": "auto"
|
||||||
|
@ -74,8 +70,6 @@ A sample supervisor spec is shown below:
|
||||||
"timestamp",
|
"timestamp",
|
||||||
"value"
|
"value"
|
||||||
]
|
]
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
"metricsSpec": [
|
"metricsSpec": [
|
||||||
{
|
{
|
||||||
|
@ -110,6 +104,9 @@ A sample supervisor spec is shown below:
|
||||||
},
|
},
|
||||||
"ioConfig": {
|
"ioConfig": {
|
||||||
"topic": "metrics",
|
"topic": "metrics",
|
||||||
|
"inputFormat": {
|
||||||
|
"type": "json"
|
||||||
|
},
|
||||||
"consumerProperties": {
|
"consumerProperties": {
|
||||||
"bootstrap.servers": "localhost:9092"
|
"bootstrap.servers": "localhost:9092"
|
||||||
},
|
},
|
||||||
|
@ -196,6 +193,7 @@ For Roaring bitmaps:
|
||||||
|Field|Type|Description|Required|
|
|Field|Type|Description|Required|
|
||||||
|-----|----|-----------|--------|
|
|-----|----|-----------|--------|
|
||||||
|`topic`|String|The Kafka topic to read from. This must be a specific topic as topic patterns are not supported.|yes|
|
|`topic`|String|The Kafka topic to read from. This must be a specific topic as topic patterns are not supported.|yes|
|
||||||
|
|`inputFormat`|Object|[`inputFormat`](../../ingestion/data-formats.md#input-format) to specify how to parse input data. See [the below section](#specifying-data-format) for details about specifying the input format.|yes|
|
||||||
|`consumerProperties`|Map<String, Object>|A map of properties to be passed to the Kafka consumer. This must contain a property `bootstrap.servers` with a list of Kafka brokers in the form: `<BROKER_1>:<PORT_1>,<BROKER_2>:<PORT_2>,...`. For SSL connections, the `keystore`, `truststore` and `key` passwords can be provided as a [Password Provider](../../operations/password-provider.md) or String password.|yes|
|
|`consumerProperties`|Map<String, Object>|A map of properties to be passed to the Kafka consumer. This must contain a property `bootstrap.servers` with a list of Kafka brokers in the form: `<BROKER_1>:<PORT_1>,<BROKER_2>:<PORT_2>,...`. For SSL connections, the `keystore`, `truststore` and `key` passwords can be provided as a [Password Provider](../../operations/password-provider.md) or String password.|yes|
|
||||||
|`pollTimeout`|Long|The length of time to wait for the Kafka consumer to poll records, in milliseconds|no (default == 100)|
|
|`pollTimeout`|Long|The length of time to wait for the Kafka consumer to poll records, in milliseconds|no (default == 100)|
|
||||||
|`replicas`|Integer|The number of replica sets, where 1 means a single set of tasks (no replication). Replica tasks will always be assigned to different workers to provide resiliency against process failure.|no (default == 1)|
|
|`replicas`|Integer|The number of replica sets, where 1 means a single set of tasks (no replication). Replica tasks will always be assigned to different workers to provide resiliency against process failure.|no (default == 1)|
|
||||||
|
@ -209,6 +207,19 @@ For Roaring bitmaps:
|
||||||
|`lateMessageRejectionPeriod`|ISO8601 Period|Configure tasks to reject messages with timestamps earlier than this period before the task was created; for example if this is set to `PT1H` and the supervisor creates a task at *2016-01-01T12:00Z*, messages with timestamps earlier than *2016-01-01T11:00Z* will be dropped. This may help prevent concurrency issues if your data stream has late messages and you have multiple pipelines that need to operate on the same segments (e.g. a realtime and a nightly batch ingestion pipeline). Please note that only one of `lateMessageRejectionPeriod` or `lateMessageRejectionStartDateTime` can be specified.|no (default == none)|
|
|`lateMessageRejectionPeriod`|ISO8601 Period|Configure tasks to reject messages with timestamps earlier than this period before the task was created; for example if this is set to `PT1H` and the supervisor creates a task at *2016-01-01T12:00Z*, messages with timestamps earlier than *2016-01-01T11:00Z* will be dropped. This may help prevent concurrency issues if your data stream has late messages and you have multiple pipelines that need to operate on the same segments (e.g. a realtime and a nightly batch ingestion pipeline). Please note that only one of `lateMessageRejectionPeriod` or `lateMessageRejectionStartDateTime` can be specified.|no (default == none)|
|
||||||
|`earlyMessageRejectionPeriod`|ISO8601 Period|Configure tasks to reject messages with timestamps later than this period after the task reached its taskDuration; for example if this is set to `PT1H`, the taskDuration is set to `PT1H` and the supervisor creates a task at *2016-01-01T12:00Z*, messages with timestamps later than *2016-01-01T14:00Z* will be dropped. **Note:** Tasks sometimes run past their task duration, for example, in cases of supervisor failover. Setting earlyMessageRejectionPeriod too low may cause messages to be dropped unexpectedly whenever a task runs past its originally configured task duration.|no (default == none)|
|
|`earlyMessageRejectionPeriod`|ISO8601 Period|Configure tasks to reject messages with timestamps later than this period after the task reached its taskDuration; for example if this is set to `PT1H`, the taskDuration is set to `PT1H` and the supervisor creates a task at *2016-01-01T12:00Z*, messages with timestamps later than *2016-01-01T14:00Z* will be dropped. **Note:** Tasks sometimes run past their task duration, for example, in cases of supervisor failover. Setting earlyMessageRejectionPeriod too low may cause messages to be dropped unexpectedly whenever a task runs past its originally configured task duration.|no (default == none)|
|
||||||
|
|
||||||
|
#### Specifying data format
|
||||||
|
|
||||||
|
Kafka indexing service supports both [`inputFormat`](../../ingestion/data-formats.md#input-format) and [`parser`](../../ingestion/data-formats.md#parser) to specify the data format.
|
||||||
|
The `inputFormat` is a new and recommended way to specify the data format for Kafka indexing service,
|
||||||
|
but unfortunately, it doesn't support all data formats supported by the legacy `parser`.
|
||||||
|
(They will be supported in the future.)
|
||||||
|
|
||||||
|
The supported `inputFormat`s include [`csv`](../../ingestion/data-formats.md#csv),
|
||||||
|
[`delimited`](../../ingestion/data-formats.md#tsv-delimited), and [`json`](../../ingestion/data-formats.md#json).
|
||||||
|
You can also read [`avro_stream`](../../ingestion/data-formats.md#avro-stream-parser),
|
||||||
|
[`protobuf`](../../ingestion/data-formats.md#protobuf-parser),
|
||||||
|
and [`thrift`](../extensions-contrib/thrift.md) formats using `parser`.
|
||||||
|
|
||||||
## Operations
|
## Operations
|
||||||
|
|
||||||
This section gives descriptions of how some supervisor APIs work specifically in Kafka Indexing Service.
|
This section gives descriptions of how some supervisor APIs work specifically in Kafka Indexing Service.
|
||||||
|
|
|
@ -52,10 +52,6 @@ A sample supervisor spec is shown below:
|
||||||
"type": "kinesis",
|
"type": "kinesis",
|
||||||
"dataSchema": {
|
"dataSchema": {
|
||||||
"dataSource": "metrics-kinesis",
|
"dataSource": "metrics-kinesis",
|
||||||
"parser": {
|
|
||||||
"type": "string",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "json",
|
|
||||||
"timestampSpec": {
|
"timestampSpec": {
|
||||||
"column": "timestamp",
|
"column": "timestamp",
|
||||||
"format": "auto"
|
"format": "auto"
|
||||||
|
@ -66,8 +62,6 @@ A sample supervisor spec is shown below:
|
||||||
"timestamp",
|
"timestamp",
|
||||||
"value"
|
"value"
|
||||||
]
|
]
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
"metricsSpec": [
|
"metricsSpec": [
|
||||||
{
|
{
|
||||||
|
@ -102,6 +96,9 @@ A sample supervisor spec is shown below:
|
||||||
},
|
},
|
||||||
"ioConfig": {
|
"ioConfig": {
|
||||||
"stream": "metrics",
|
"stream": "metrics",
|
||||||
|
"inputFormat": {
|
||||||
|
"type": "json"
|
||||||
|
},
|
||||||
"endpoint": "kinesis.us-east-1.amazonaws.com",
|
"endpoint": "kinesis.us-east-1.amazonaws.com",
|
||||||
"taskCount": 1,
|
"taskCount": 1,
|
||||||
"replicas": 1,
|
"replicas": 1,
|
||||||
|
@ -195,6 +192,7 @@ For Roaring bitmaps:
|
||||||
|Field|Type|Description|Required|
|
|Field|Type|Description|Required|
|
||||||
|-----|----|-----------|--------|
|
|-----|----|-----------|--------|
|
||||||
|`stream`|String|The Kinesis stream to read.|yes|
|
|`stream`|String|The Kinesis stream to read.|yes|
|
||||||
|
|`inputFormat`|Object|[`inputFormat`](../../ingestion/data-formats.md#input-format) to specify how to parse input data. See [the below section](#specifying-data-format) for details about specifying the input format.|yes|
|
||||||
|`endpoint`|String|The AWS Kinesis stream endpoint for a region. You can find a list of endpoints [here](http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region).|no (default == kinesis.us-east-1.amazonaws.com)|
|
|`endpoint`|String|The AWS Kinesis stream endpoint for a region. You can find a list of endpoints [here](http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region).|no (default == kinesis.us-east-1.amazonaws.com)|
|
||||||
|`replicas`|Integer|The number of replica sets, where 1 means a single set of tasks (no replication). Replica tasks will always be assigned to different workers to provide resiliency against process failure.|no (default == 1)|
|
|`replicas`|Integer|The number of replica sets, where 1 means a single set of tasks (no replication). Replica tasks will always be assigned to different workers to provide resiliency against process failure.|no (default == 1)|
|
||||||
|`taskCount`|Integer|The maximum number of *reading* tasks in a *replica set*. This means that the maximum number of reading tasks will be `taskCount * replicas` and the total number of tasks (*reading* + *publishing*) will be higher than this. See 'Capacity Planning' below for more details. The number of reading tasks will be less than `taskCount` if `taskCount > {numKinesisShards}`.|no (default == 1)|
|
|`taskCount`|Integer|The maximum number of *reading* tasks in a *replica set*. This means that the maximum number of reading tasks will be `taskCount * replicas` and the total number of tasks (*reading* + *publishing*) will be higher than this. See 'Capacity Planning' below for more details. The number of reading tasks will be less than `taskCount` if `taskCount > {numKinesisShards}`.|no (default == 1)|
|
||||||
|
@ -211,6 +209,19 @@ For Roaring bitmaps:
|
||||||
|`awsExternalId`|String|The AWS external id to use for additional permissions.|no|
|
|`awsExternalId`|String|The AWS external id to use for additional permissions.|no|
|
||||||
|`deaggregate`|Boolean|Whether to use the de-aggregate function of the KCL. See below for details.|no|
|
|`deaggregate`|Boolean|Whether to use the de-aggregate function of the KCL. See below for details.|no|
|
||||||
|
|
||||||
|
#### Specifying data format
|
||||||
|
|
||||||
|
Kinesis indexing service supports both [`inputFormat`](../../ingestion/data-formats.md#input-format) and [`parser`](../../ingestion/data-formats.md#parser) to specify the data format.
|
||||||
|
The `inputFormat` is a new and recommended way to specify the data format for Kinesis indexing service,
|
||||||
|
but unfortunately, it doesn't support all data formats supported by the legacy `parser`.
|
||||||
|
(They will be supported in the future.)
|
||||||
|
|
||||||
|
The supported `inputFormat`s include [`csv`](../../ingestion/data-formats.md#csv),
|
||||||
|
[`delimited`](../../ingestion/data-formats.md#tsv-delimited), and [`json`](../../ingestion/data-formats.md#json).
|
||||||
|
You can also read [`avro_stream`](../../ingestion/data-formats.md#avro-stream-parser),
|
||||||
|
[`protobuf`](../../ingestion/data-formats.md#protobuf-parser),
|
||||||
|
and [`thrift`](../extensions-contrib/thrift.md) formats using `parser`.
|
||||||
|
|
||||||
## Operations
|
## Operations
|
||||||
|
|
||||||
This section gives descriptions of how some supervisor APIs work specifically in Kinesis Indexing Service.
|
This section gives descriptions of how some supervisor APIs work specifically in Kinesis Indexing Service.
|
||||||
|
|
|
@ -108,7 +108,7 @@ Copy or symlink this file to `extensions/mysql-metadata-storage` under the distr
|
||||||
|
|
||||||
### MySQL Firehose
|
### MySQL Firehose
|
||||||
|
|
||||||
The MySQL extension provides an implementation of an [SqlFirehose](../../ingestion/native-batch.md#firehoses) which can be used to ingest data into Druid from a MySQL database.
|
The MySQL extension provides an implementation of an [SqlFirehose](../../ingestion/native-batch.md#firehoses-deprecated) which can be used to ingest data into Druid from a MySQL database.
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
|
|
@ -28,239 +28,9 @@ Apache ORC files.
|
||||||
|
|
||||||
To use this extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-orc-extensions`.
|
To use this extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-orc-extensions`.
|
||||||
|
|
||||||
## ORC Hadoop Parser
|
The `druid-orc-extensions` provides the [ORC input format](../../ingestion/data-formats.md#orc) and the [ORC Hadoop parser](../../ingestion/data-formats.md#orc-hadoop-parser)
|
||||||
|
for [native batch ingestion](../../ingestion/native-batch.md) and [Hadoop batch ingestion](../../ingestion/hadoop.md), respectively.
|
||||||
The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"org.apache.orc.mapreduce.OrcInputFormat"`.
|
Please see corresponding docs for details.
|
||||||
|
|
||||||
|
|
||||||
|Field | Type | Description | Required|
|
|
||||||
|----------|-------------|----------------------------------------------------------------------------------------|---------|
|
|
||||||
|type | String | This should say `orc` | yes|
|
|
||||||
|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data (`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format) | yes|
|
|
||||||
|
|
||||||
The parser supports two `parseSpec` formats: `orc` and `timeAndDims`.
|
|
||||||
|
|
||||||
`orc` supports auto field discovery and flattening, if specified with a [`flattenSpec`](../../ingestion/index.md#flattenspec).
|
|
||||||
If no `flattenSpec` is specified, `useFieldDiscovery` will be enabled by default. Specifying a `dimensionSpec` is
|
|
||||||
optional if `useFieldDiscovery` is enabled: if a `dimensionSpec` is supplied, the list of `dimensions` it defines will be
|
|
||||||
the set of ingested dimensions, if missing the discovered fields will make up the list.
|
|
||||||
|
|
||||||
`timeAndDims` parse spec must specify which fields will be extracted as dimensions through the `dimensionSpec`.
|
|
||||||
|
|
||||||
[All column types](https://orc.apache.org/docs/types.html) are supported, with the exception of `union` types. Columns of
|
|
||||||
`list` type, if filled with primitives, may be used as a multi-value dimension, or specific elements can be extracted with
|
|
||||||
`flattenSpec` expressions. Likewise, primitive fields may be extracted from `map` and `struct` types in the same manner.
|
|
||||||
Auto field discovery will automatically create a string dimension for every (non-timestamp) primitive or `list` of
|
|
||||||
primitives, as well as any flatten expressions defined in the `flattenSpec`.
|
|
||||||
|
|
||||||
### Hadoop job properties
|
|
||||||
Like most Hadoop jobs, the best outcomes will add `"mapreduce.job.user.classpath.first": "true"` or
|
|
||||||
`"mapreduce.job.classloader": "true"` to the `jobProperties` section of `tuningConfig`. Note that it is likely if using
|
|
||||||
`"mapreduce.job.classloader": "true"` that you will need to set `mapreduce.job.classloader.system.classes` to include
|
|
||||||
`-org.apache.hadoop.hive.` to instruct Hadoop to load `org.apache.hadoop.hive` classes from the application jars instead
|
|
||||||
of system jars, e.g.
|
|
||||||
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"mapreduce.job.classloader": "true",
|
|
||||||
"mapreduce.job.classloader.system.classes" : "java., javax.accessibility., javax.activation., javax.activity., javax.annotation., javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., javax.net., javax.print., javax.rmi., javax.script., -javax.security.auth.message., javax.security.auth., javax.security.cert., javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., -org.apache.hadoop.hbase., -org.apache.hadoop.hive., org.apache.hadoop., core-default.xml, hdfs-default.xml, mapred-default.xml, yarn-default.xml",
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
This is due to the `hive-storage-api` dependency of the
|
|
||||||
`orc-mapreduce` library, which provides some classes under the `org.apache.hadoop.hive` package. If instead using the
|
|
||||||
setting `"mapreduce.job.user.classpath.first": "true"`, then this will not be an issue.
|
|
||||||
|
|
||||||
### Examples
|
|
||||||
|
|
||||||
#### `orc` parser, `orc` parseSpec, auto field discovery, flatten expressions
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "index_hadoop",
|
|
||||||
"spec": {
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "hadoop",
|
|
||||||
"inputSpec": {
|
|
||||||
"type": "static",
|
|
||||||
"inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
|
|
||||||
"paths": "path/to/file.orc"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"dataSchema": {
|
|
||||||
"dataSource": "example",
|
|
||||||
"parser": {
|
|
||||||
"type": "orc",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "orc",
|
|
||||||
"flattenSpec": {
|
|
||||||
"useFieldDiscovery": true,
|
|
||||||
"fields": [
|
|
||||||
{
|
|
||||||
"type": "path",
|
|
||||||
"name": "nestedDim",
|
|
||||||
"expr": "$.nestedData.dim1"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "path",
|
|
||||||
"name": "listDimFirstItem",
|
|
||||||
"expr": "$.listDim[1]"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"timestampSpec": {
|
|
||||||
"column": "timestamp",
|
|
||||||
"format": "millis"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"tuningConfig": <hadoop-tuning-config>
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### `orc` parser, `orc` parseSpec, field discovery with no flattenSpec or dimensionSpec
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "index_hadoop",
|
|
||||||
"spec": {
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "hadoop",
|
|
||||||
"inputSpec": {
|
|
||||||
"type": "static",
|
|
||||||
"inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
|
|
||||||
"paths": "path/to/file.orc"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"dataSchema": {
|
|
||||||
"dataSource": "example",
|
|
||||||
"parser": {
|
|
||||||
"type": "orc",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "orc",
|
|
||||||
"timestampSpec": {
|
|
||||||
"column": "timestamp",
|
|
||||||
"format": "millis"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"tuningConfig": <hadoop-tuning-config>
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### `orc` parser, `orc` parseSpec, no autodiscovery
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "index_hadoop",
|
|
||||||
"spec": {
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "hadoop",
|
|
||||||
"inputSpec": {
|
|
||||||
"type": "static",
|
|
||||||
"inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
|
|
||||||
"paths": "path/to/file.orc"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"dataSchema": {
|
|
||||||
"dataSource": "example",
|
|
||||||
"parser": {
|
|
||||||
"type": "orc",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "orc",
|
|
||||||
"flattenSpec": {
|
|
||||||
"useFieldDiscovery": false,
|
|
||||||
"fields": [
|
|
||||||
{
|
|
||||||
"type": "path",
|
|
||||||
"name": "nestedDim",
|
|
||||||
"expr": "$.nestedData.dim1"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "path",
|
|
||||||
"name": "listDimFirstItem",
|
|
||||||
"expr": "$.listDim[1]"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"timestampSpec": {
|
|
||||||
"column": "timestamp",
|
|
||||||
"format": "millis"
|
|
||||||
},
|
|
||||||
"dimensionsSpec": {
|
|
||||||
"dimensions": [
|
|
||||||
"dim1",
|
|
||||||
"dim3",
|
|
||||||
"nestedDim",
|
|
||||||
"listDimFirstItem"
|
|
||||||
],
|
|
||||||
"dimensionExclusions": [],
|
|
||||||
"spatialDimensions": []
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"tuningConfig": <hadoop-tuning-config>
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### `orc` parser, `timeAndDims` parseSpec
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "index_hadoop",
|
|
||||||
"spec": {
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "hadoop",
|
|
||||||
"inputSpec": {
|
|
||||||
"type": "static",
|
|
||||||
"inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
|
|
||||||
"paths": "path/to/file.orc"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"dataSchema": {
|
|
||||||
"dataSource": "example",
|
|
||||||
"parser": {
|
|
||||||
"type": "orc",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "timeAndDims",
|
|
||||||
"timestampSpec": {
|
|
||||||
"column": "timestamp",
|
|
||||||
"format": "auto"
|
|
||||||
},
|
|
||||||
"dimensionsSpec": {
|
|
||||||
"dimensions": [
|
|
||||||
"dim1",
|
|
||||||
"dim2",
|
|
||||||
"dim3",
|
|
||||||
"listDim"
|
|
||||||
],
|
|
||||||
"dimensionExclusions": [],
|
|
||||||
"spatialDimensions": []
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"tuningConfig": <hadoop-tuning-config>
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
### Migration from 'contrib' extension
|
### Migration from 'contrib' extension
|
||||||
This extension, first available in version 0.15.0, replaces the previous 'contrib' extension which was available until
|
This extension, first available in version 0.15.0, replaces the previous 'contrib' extension which was available until
|
||||||
|
|
|
@ -29,233 +29,8 @@ Apache Parquet files.
|
||||||
Note: If using the `parquet-avro` parser for Apache Hadoop based indexing, `druid-parquet-extensions` depends on the `druid-avro-extensions` module, so be sure to
|
Note: If using the `parquet-avro` parser for Apache Hadoop based indexing, `druid-parquet-extensions` depends on the `druid-avro-extensions` module, so be sure to
|
||||||
[include both](../../development/extensions.md#loading-extensions).
|
[include both](../../development/extensions.md#loading-extensions).
|
||||||
|
|
||||||
## Parquet and Native Batch
|
The `druid-parquet-extensions` provides the [Parquet input format](../../ingestion/data-formats.md#parquet), the [Parquet Hadoop parser](../../ingestion/data-formats.md#parquet-hadoop-parser),
|
||||||
This extension provides a `parquet` input format which can be used with Druid [native batch ingestion](../../ingestion/native-batch.md).
|
and the [Parquet Avro Hadoop Parser](../../ingestion/data-formats.md#parquet-avro-hadoop-parser) with `druid-avro-extensions`.
|
||||||
|
The Parquet input format is available for [native batch ingestion](../../ingestion/native-batch.md)
|
||||||
### Parquet InputFormat
|
and the other 2 parsers are for [Hadoop batch ingestion](../../ingestion/hadoop.md).
|
||||||
|Field | Type | Description | Required|
|
Please see corresponding docs for details.
|
||||||
|---|---|---|---|
|
|
||||||
|type| String| This should be set to `parquet` to read Parquet file| yes |
|
|
||||||
|flattenSpec| JSON Object |Define a [`flattenSpec`](../../ingestion/index.md#flattenspec) to extract nested values from a Parquet file. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
|
|
||||||
| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default == false) |
|
|
||||||
|
|
||||||
### Example
|
|
||||||
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "index_parallel",
|
|
||||||
"inputSource": {
|
|
||||||
"type": "local",
|
|
||||||
"baseDir": "/some/path/to/file/",
|
|
||||||
"filter": "file.parquet"
|
|
||||||
},
|
|
||||||
"inputFormat": {
|
|
||||||
"type": "parquet"
|
|
||||||
"flattenSpec": {
|
|
||||||
"useFieldDiscovery": true,
|
|
||||||
"fields": [
|
|
||||||
{
|
|
||||||
"type": "path",
|
|
||||||
"name": "nested",
|
|
||||||
"expr": "$.path.to.nested"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
"binaryAsString": false
|
|
||||||
},
|
|
||||||
...
|
|
||||||
}
|
|
||||||
...
|
|
||||||
```
|
|
||||||
## Parquet Hadoop Parser
|
|
||||||
|
|
||||||
For Hadoop, this extension provides two parser implementations for reading Parquet files:
|
|
||||||
|
|
||||||
* `parquet` - using a simple conversion contained within this extension
|
|
||||||
* `parquet-avro` - conversion to avro records with the `parquet-avro` library and using the `druid-avro-extensions`
|
|
||||||
module to parse the avro data
|
|
||||||
|
|
||||||
Selection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in
|
|
||||||
the `ioConfig`:
|
|
||||||
|
|
||||||
* `org.apache.druid.data.input.parquet.DruidParquetInputFormat` for `parquet`
|
|
||||||
* `org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat` for `parquet-avro`
|
|
||||||
|
|
||||||
|
|
||||||
Both parse options support auto field discovery and flattening if provided with a
|
|
||||||
[`flattenSpec`](../../ingestion/index.md#flattenspec) with `parquet` or `avro` as the format. Parquet nested list and map
|
|
||||||
[logical types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) _should_ operate correctly with
|
|
||||||
JSON path expressions for all supported types. `parquet-avro` sets a hadoop job property
|
|
||||||
`parquet.avro.add-list-element-records` to `false` (which normally defaults to `true`), in order to 'unwrap' primitive
|
|
||||||
list elements into multi-value dimensions.
|
|
||||||
|
|
||||||
The `parquet` parser supports `int96` Parquet values, while `parquet-avro` does not. There may also be some subtle
|
|
||||||
differences in the behavior of JSON path expression evaluation of `flattenSpec`.
|
|
||||||
|
|
||||||
We suggest using `parquet` over `parquet-avro` to allow ingesting data beyond the schema constraints of Avro conversion.
|
|
||||||
However, `parquet-avro` was the original basis for this extension, and as such it is a bit more mature.
|
|
||||||
|
|
||||||
|
|
||||||
|Field | Type | Description | Required|
|
|
||||||
|----------|-------------|----------------------------------------------------------------------------------------|---------|
|
|
||||||
| type | String | Choose `parquet` or `parquet-avro` to determine how Parquet files are parsed | yes |
|
|
||||||
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims`, `parquet`, `avro` (if used with avro conversion). | yes |
|
|
||||||
| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no(default == false) |
|
|
||||||
|
|
||||||
When the time dimension is a [DateType column](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), a format should not be supplied. When the format is UTF8 (String), either `auto` or a explicitly defined [format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html) is required.
|
|
||||||
|
|
||||||
### Examples
|
|
||||||
|
|
||||||
#### `parquet` parser, `parquet` parseSpec
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "index_hadoop",
|
|
||||||
"spec": {
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "hadoop",
|
|
||||||
"inputSpec": {
|
|
||||||
"type": "static",
|
|
||||||
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
|
|
||||||
"paths": "path/to/file.parquet"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"dataSchema": {
|
|
||||||
"dataSource": "example",
|
|
||||||
"parser": {
|
|
||||||
"type": "parquet",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "parquet",
|
|
||||||
"flattenSpec": {
|
|
||||||
"useFieldDiscovery": true,
|
|
||||||
"fields": [
|
|
||||||
{
|
|
||||||
"type": "path",
|
|
||||||
"name": "nestedDim",
|
|
||||||
"expr": "$.nestedData.dim1"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "path",
|
|
||||||
"name": "listDimFirstItem",
|
|
||||||
"expr": "$.listDim[1]"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"timestampSpec": {
|
|
||||||
"column": "timestamp",
|
|
||||||
"format": "auto"
|
|
||||||
},
|
|
||||||
"dimensionsSpec": {
|
|
||||||
"dimensions": [],
|
|
||||||
"dimensionExclusions": [],
|
|
||||||
"spatialDimensions": []
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"tuningConfig": <hadoop-tuning-config>
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### `parquet` parser, `timeAndDims` parseSpec
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "index_hadoop",
|
|
||||||
"spec": {
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "hadoop",
|
|
||||||
"inputSpec": {
|
|
||||||
"type": "static",
|
|
||||||
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
|
|
||||||
"paths": "path/to/file.parquet"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"dataSchema": {
|
|
||||||
"dataSource": "example",
|
|
||||||
"parser": {
|
|
||||||
"type": "parquet",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "timeAndDims",
|
|
||||||
"timestampSpec": {
|
|
||||||
"column": "timestamp",
|
|
||||||
"format": "auto"
|
|
||||||
},
|
|
||||||
"dimensionsSpec": {
|
|
||||||
"dimensions": [
|
|
||||||
"dim1",
|
|
||||||
"dim2",
|
|
||||||
"dim3",
|
|
||||||
"listDim"
|
|
||||||
],
|
|
||||||
"dimensionExclusions": [],
|
|
||||||
"spatialDimensions": []
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"tuningConfig": <hadoop-tuning-config>
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
```
|
|
||||||
#### `parquet-avro` parser, `avro` parseSpec
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "index_hadoop",
|
|
||||||
"spec": {
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "hadoop",
|
|
||||||
"inputSpec": {
|
|
||||||
"type": "static",
|
|
||||||
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat",
|
|
||||||
"paths": "path/to/file.parquet"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"dataSchema": {
|
|
||||||
"dataSource": "example",
|
|
||||||
"parser": {
|
|
||||||
"type": "parquet-avro",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "avro",
|
|
||||||
"flattenSpec": {
|
|
||||||
"useFieldDiscovery": true,
|
|
||||||
"fields": [
|
|
||||||
{
|
|
||||||
"type": "path",
|
|
||||||
"name": "nestedDim",
|
|
||||||
"expr": "$.nestedData.dim1"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"type": "path",
|
|
||||||
"name": "listDimFirstItem",
|
|
||||||
"expr": "$.listDim[1]"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"timestampSpec": {
|
|
||||||
"column": "timestamp",
|
|
||||||
"format": "auto"
|
|
||||||
},
|
|
||||||
"dimensionsSpec": {
|
|
||||||
"dimensions": [],
|
|
||||||
"dimensionExclusions": [],
|
|
||||||
"spatialDimensions": []
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
"tuningConfig": <hadoop-tuning-config>
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
For additional details see [Hadoop ingestion](../../ingestion/hadoop.md) and [general ingestion spec](../../ingestion/index.md) documentation.
|
|
||||||
|
|
|
@ -87,7 +87,7 @@ In most cases, the configuration options map directly to the [postgres JDBC conn
|
||||||
|
|
||||||
### PostgreSQL Firehose
|
### PostgreSQL Firehose
|
||||||
|
|
||||||
The PostgreSQL extension provides an implementation of an [SqlFirehose](../../ingestion/native-batch.md#firehoses) which can be used to ingest data into Druid from a PostgreSQL database.
|
The PostgreSQL extension provides an implementation of an [SqlFirehose](../../ingestion/native-batch.md#firehoses-deprecated) which can be used to ingest data into Druid from a PostgreSQL database.
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
|
|
@ -25,15 +25,8 @@ title: "Protobuf"
|
||||||
|
|
||||||
This Apache Druid extension enables Druid to ingest and understand the Protobuf data format. Make sure to [include](../../development/extensions.md#loading-extensions) `druid-protobuf-extensions` as an extension.
|
This Apache Druid extension enables Druid to ingest and understand the Protobuf data format. Make sure to [include](../../development/extensions.md#loading-extensions) `druid-protobuf-extensions` as an extension.
|
||||||
|
|
||||||
## Protobuf Parser
|
The `druid-protobuf-extensions` provides the [Protobuf Parser](../../ingestion/data-formats.md#protobuf-parser)
|
||||||
|
for [stream ingestion](../../ingestion/index.md#streaming). See corresponding docs for details.
|
||||||
|
|
||||||
| Field | Type | Description | Required |
|
|
||||||
|-------|------|-------------|----------|
|
|
||||||
| type | String | This should say `protobuf`. | no |
|
|
||||||
| descriptor | String | Protobuf descriptor file name in the classpath or URL. | yes |
|
|
||||||
| protoMessageType | String | Protobuf message type in the descriptor. Both short name and fully qualified name are accepted. The parser uses the first message type found in the descriptor if not specified. | no |
|
|
||||||
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. The format must be JSON. See [JSON ParseSpec](../../ingestion/index.md) for more configuration options. Please note timeAndDims parseSpec is no longer supported. | yes |
|
|
||||||
|
|
||||||
## Example: Load Protobuf messages from Kafka
|
## Example: Load Protobuf messages from Kafka
|
||||||
|
|
||||||
|
|
|
@ -98,111 +98,8 @@ You can enable [server-side encryption](https://docs.aws.amazon.com/AmazonS3/lat
|
||||||
- kms: [Server-side encryption with AWS KMS–Managed Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html)
|
- kms: [Server-side encryption with AWS KMS–Managed Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html)
|
||||||
- custom: [Server-side encryption with Customer-Provided Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html)
|
- custom: [Server-side encryption with Customer-Provided Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html)
|
||||||
|
|
||||||
|
## Reading data from S3
|
||||||
|
|
||||||
<a name="input-source"></a>
|
The [S3 input source](../../ingestion/native-batch.md#s3-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
|
||||||
|
to read objects directly from S3. If you use the [Hadoop task](../../ingestion/hadoop.md),
|
||||||
## S3 batch ingestion input source
|
you can read data from S3 by specifying the S3 paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
|
||||||
|
|
||||||
This extension also provides an input source for Druid native batch ingestion to support reading objects directly from S3. Objects can be specified either via a list of S3 URI strings or a list of S3 location prefixes, which will attempt to list the contents and ingest all objects contained in the locations. The S3 input source is splittable and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task), where each worker task of `index_parallel` will read a single object.
|
|
||||||
|
|
||||||
Sample spec:
|
|
||||||
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "index_parallel",
|
|
||||||
"inputSource": {
|
|
||||||
"type": "s3",
|
|
||||||
"uris": ["s3://foo/bar/file.json", "s3://bar/foo/file2.json"]
|
|
||||||
},
|
|
||||||
"inputFormat": {
|
|
||||||
"type": "json"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "index_parallel",
|
|
||||||
"inputSource": {
|
|
||||||
"type": "s3",
|
|
||||||
"prefixes": ["s3://foo/bar", "s3://bar/foo"]
|
|
||||||
},
|
|
||||||
"inputFormat": {
|
|
||||||
"type": "json"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
```json
|
|
||||||
...
|
|
||||||
"ioConfig": {
|
|
||||||
"type": "index_parallel",
|
|
||||||
"inputSource": {
|
|
||||||
"type": "s3",
|
|
||||||
"objects": [
|
|
||||||
{ "bucket": "foo", "path": "bar/file1.json"},
|
|
||||||
{ "bucket": "bar", "path": "foo/file2.json"}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"inputFormat": {
|
|
||||||
"type": "json"
|
|
||||||
},
|
|
||||||
...
|
|
||||||
},
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
|property|description|default|required?|
|
|
||||||
|--------|-----------|-------|---------|
|
|
||||||
|type|This should be `s3`.|N/A|yes|
|
|
||||||
|uris|JSON array of URIs where S3 objects to be ingested are located.|N/A|`uris` or `prefixes` or `objects` must be set|
|
|
||||||
|prefixes|JSON array of URI prefixes for the locations of S3 objects to be ingested.|N/A|`uris` or `prefixes` or `objects` must be set|
|
|
||||||
|objects|JSON array of S3 Objects to be ingested.|N/A|`uris` or `prefixes` or `objects` must be set|
|
|
||||||
|
|
||||||
|
|
||||||
S3 Object:
|
|
||||||
|
|
||||||
|property|description|default|required?|
|
|
||||||
|--------|-----------|-------|---------|
|
|
||||||
|bucket|Name of the S3 bucket|N/A|yes|
|
|
||||||
|path|The path where data is located.|N/A|yes|
|
|
||||||
|
|
||||||
<a name="firehose"></a>
|
|
||||||
|
|
||||||
## StaticS3Firehose
|
|
||||||
|
|
||||||
This firehose ingests events from a predefined list of S3 objects.
|
|
||||||
This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task).
|
|
||||||
Since each split represents an object in this firehose, each worker task of `index_parallel` will read an object.
|
|
||||||
|
|
||||||
Sample spec:
|
|
||||||
|
|
||||||
```json
|
|
||||||
"firehose" : {
|
|
||||||
"type" : "static-s3",
|
|
||||||
"uris": ["s3://foo/bar/file.gz", "s3://bar/foo/file2.gz"]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
This firehose provides caching and prefetching features. In IndexTask, a firehose can be read twice if intervals or
|
|
||||||
shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow.
|
|
||||||
|
|
||||||
|property|description|default|required?|
|
|
||||||
|--------|-----------|-------|---------|
|
|
||||||
|type|This should be `static-s3`.|N/A|yes|
|
|
||||||
|uris|JSON array of URIs where s3 files to be ingested are located.|N/A|`uris` or `prefixes` must be set|
|
|
||||||
|prefixes|JSON array of URI prefixes for the locations of s3 files to be ingested.|N/A|`uris` or `prefixes` must be set|
|
|
||||||
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|no|
|
|
||||||
|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|no|
|
|
||||||
|prefetchTriggerBytes|Threshold to trigger prefetching s3 objects.|maxFetchCapacityBytes / 2|no|
|
|
||||||
|fetchTimeout|Timeout for fetching an s3 object.|60000|no|
|
|
||||||
|maxFetchRetry|Maximum retry for fetching an s3 object.|3|no|
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -34,7 +34,7 @@ JavaScript can be used to extend Druid in a variety of ways:
|
||||||
- [Extraction functions](../querying/dimensionspecs.html#javascript-extraction-function)
|
- [Extraction functions](../querying/dimensionspecs.html#javascript-extraction-function)
|
||||||
- [Filters](../querying/filters.html#javascript-filter)
|
- [Filters](../querying/filters.html#javascript-filter)
|
||||||
- [Post-aggregators](../querying/post-aggregations.html#javascript-post-aggregator)
|
- [Post-aggregators](../querying/post-aggregations.html#javascript-post-aggregator)
|
||||||
- [Input parsers](../ingestion/data-formats.html#javascript)
|
- [Input parsers](../ingestion/data-formats.html#javascript-parsespec)
|
||||||
- [Router strategy](../design/router.html#javascript)
|
- [Router strategy](../design/router.html#javascript)
|
||||||
- [Worker select strategy](../configuration/index.html#javascript-worker-select-strategy)
|
- [Worker select strategy](../configuration/index.html#javascript-worker-select-strategy)
|
||||||
|
|
||||||
|
|
|
@ -31,9 +31,11 @@ Druid's extensions leverage Guice in order to add things at runtime. Basically,
|
||||||
|
|
||||||
1. Add a new deep storage implementation by extending the `org.apache.druid.segment.loading.DataSegment*` and
|
1. Add a new deep storage implementation by extending the `org.apache.druid.segment.loading.DataSegment*` and
|
||||||
`org.apache.druid.tasklogs.TaskLog*` classes.
|
`org.apache.druid.tasklogs.TaskLog*` classes.
|
||||||
1. Add a new Firehose by extending `org.apache.druid.data.input.FirehoseFactory`.
|
1. Add a new input source by extending `org.apache.druid.data.input.InputSource`.
|
||||||
1. Add a new input parser by extending `org.apache.druid.data.input.impl.InputRowParser`.
|
1. Add a new input entity by extending `org.apache.druid.data.input.InputEntity`.
|
||||||
1. Add a new string-based input format by extending `org.apache.druid.data.input.impl.ParseSpec`.
|
1. Add a new input source reader if necessary by extending `org.apache.druid.data.input.InputSourceReader`. You can use `org.apache.druid.data.input.impl.InputEntityIteratingReader` in most cases.
|
||||||
|
1. Add a new input format by extending `org.apache.druid.data.input.InputFormat`.
|
||||||
|
1. Add a new input entity reader by extending `org.apache.druid.data.input.TextReader` for text formats or `org.apache.druid.data.input.IntermediateRowParsingReader` for binary formats.
|
||||||
1. Add Aggregators by extending `org.apache.druid.query.aggregation.AggregatorFactory`, `org.apache.druid.query.aggregation.Aggregator`,
|
1. Add Aggregators by extending `org.apache.druid.query.aggregation.AggregatorFactory`, `org.apache.druid.query.aggregation.Aggregator`,
|
||||||
and `org.apache.druid.query.aggregation.BufferAggregator`.
|
and `org.apache.druid.query.aggregation.BufferAggregator`.
|
||||||
1. Add PostAggregators by extending `org.apache.druid.query.aggregation.PostAggregator`.
|
1. Add PostAggregators by extending `org.apache.druid.query.aggregation.PostAggregator`.
|
||||||
|
@ -57,7 +59,7 @@ The DruidModule class is has two methods
|
||||||
|
|
||||||
The `configure(Binder)` method is the same method that a normal Guice module would have.
|
The `configure(Binder)` method is the same method that a normal Guice module would have.
|
||||||
|
|
||||||
The `getJacksonModules()` method provides a list of Jackson modules that are used to help initialize the Jackson ObjectMapper instances used by Druid. This is how you add extensions that are instantiated via Jackson (like AggregatorFactory and Firehose objects) to Druid.
|
The `getJacksonModules()` method provides a list of Jackson modules that are used to help initialize the Jackson ObjectMapper instances used by Druid. This is how you add extensions that are instantiated via Jackson (like AggregatorFactory and InputSource objects) to Druid.
|
||||||
|
|
||||||
### Registering your Druid Module
|
### Registering your Druid Module
|
||||||
|
|
||||||
|
@ -148,29 +150,43 @@ To start a segment killing task, you need to access the old Coordinator console
|
||||||
|
|
||||||
After the killing task ends, `index.zip` (`partitionNum_index.zip` for HDFS data storage) file should be deleted from the data storage.
|
After the killing task ends, `index.zip` (`partitionNum_index.zip` for HDFS data storage) file should be deleted from the data storage.
|
||||||
|
|
||||||
### Adding a new Firehose
|
### Adding support for a new input source
|
||||||
|
|
||||||
There is an example of this in the `s3-extensions` module with the StaticS3FirehoseFactory.
|
Adding support for a new input source requires to implement three interfaces, i.e., `InputSource`, `InputEntity`, and `InputSourceReader`.
|
||||||
|
`InputSource` is to define where the input data is stored. `InputEntity` is to define how data can be read in parallel
|
||||||
|
in [native parallel indexing](../ingestion/native-batch.md).
|
||||||
|
`InputSourceReader` defines how to read your new input source and you can simply use the provided `InputEntityIteratingReader` in most cases.
|
||||||
|
|
||||||
Adding a Firehose is done almost entirely through the Jackson Modules instead of Guice. Specifically, note the implementation
|
There is an example of this in the `druid-s3-extensions` module with the `S3InputSource` and `S3Entity`.
|
||||||
|
|
||||||
|
Adding an InputSource is done almost entirely through the Jackson Modules instead of Guice. Specifically, note the implementation
|
||||||
|
|
||||||
``` java
|
``` java
|
||||||
@Override
|
@Override
|
||||||
public List<? extends Module> getJacksonModules()
|
public List<? extends Module> getJacksonModules()
|
||||||
{
|
{
|
||||||
return ImmutableList.of(
|
return ImmutableList.of(
|
||||||
new SimpleModule().registerSubtypes(new NamedType(StaticS3FirehoseFactory.class, "static-s3"))
|
new SimpleModule().registerSubtypes(new NamedType(S3InputSource.class, "s3"))
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
This is registering the FirehoseFactory with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"firehose": { "type": "static-s3", ... }` in your realtime config, then the system will load this FirehoseFactory for your firehose.
|
This is registering the InputSource with Jackson's polymorphic serialization/deserialization layer. More concretely, having this will mean that if you specify a `"inputSource": { "type": "s3", ... }` in your IO config, then the system will load this InputSource for your `InputSource` implementation.
|
||||||
|
|
||||||
Note that inside of Druid, we have made the @JacksonInject annotation for Jackson deserialized objects actually use the base Guice injector to resolve the object to be injected. So, if your FirehoseFactory needs access to some object, you can add a @JacksonInject annotation on a setter and it will get set on instantiation.
|
Note that inside of Druid, we have made the `@JacksonInject` annotation for Jackson deserialized objects actually use the base Guice injector to resolve the object to be injected. So, if your InputSource needs access to some object, you can add a `@JacksonInject` annotation on a setter and it will get set on instantiation.
|
||||||
|
|
||||||
|
### Adding support for a new data format
|
||||||
|
|
||||||
|
Adding support for a new data format requires implementing two interfaces, i.e., `InputFormat` and `InputEntityReader`.
|
||||||
|
`InputFormat` is to define how your data is formatted. `InputEntityReader` is to define how to parse your data and convert into Druid `InputRow`.
|
||||||
|
|
||||||
|
There is an example in the `druid-orc-extensions` module with the `OrcInputFormat` and `OrcReader`.
|
||||||
|
|
||||||
|
Adding an InputFormat is very similar to adding an InputSource. They operate purely through Jackson and thus should just be additions to the Jackson modules returned by your DruidModule.
|
||||||
|
|
||||||
### Adding Aggregators
|
### Adding Aggregators
|
||||||
|
|
||||||
Adding AggregatorFactory objects is very similar to Firehose objects. They operate purely through Jackson and thus should just be additions to the Jackson modules returned by your DruidModule.
|
Adding AggregatorFactory objects is very similar to InputSource objects. They operate purely through Jackson and thus should just be additions to the Jackson modules returned by your DruidModule.
|
||||||
|
|
||||||
### Adding Complex Metrics
|
### Adding Complex Metrics
|
||||||
|
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -143,7 +143,7 @@ To control the number of result segments per time chunk, you can set [maxRowsPer
|
||||||
Please note that you can run multiple compactionTasks at the same time. For example, you can run 12 compactionTasks per month instead of running a single task for the entire year.
|
Please note that you can run multiple compactionTasks at the same time. For example, you can run 12 compactionTasks per month instead of running a single task for the entire year.
|
||||||
|
|
||||||
A compaction task internally generates an `index` task spec for performing compaction work with some fixed parameters.
|
A compaction task internally generates an `index` task spec for performing compaction work with some fixed parameters.
|
||||||
For example, its `firehose` is always the [ingestSegmentFirehose](native-batch.md#segment-firehose), and `dimensionsSpec` and `metricsSpec`
|
For example, its `inputSource` is always the [DruidInputSource](native-batch.md#druid-input-source), and `dimensionsSpec` and `metricsSpec`
|
||||||
include all dimensions and metrics of the input segments by default.
|
include all dimensions and metrics of the input segments by default.
|
||||||
|
|
||||||
Compaction tasks will exit with a failure status code, without doing anything, if the interval you specify has no
|
Compaction tasks will exit with a failure status code, without doing anything, if the interval you specify has no
|
||||||
|
@ -233,7 +233,7 @@ There are other types of `inputSpec` to enable reindexing and delta ingestion.
|
||||||
### Reindexing with Native Batch Ingestion
|
### Reindexing with Native Batch Ingestion
|
||||||
|
|
||||||
This section assumes the reader understands how to do batch ingestion without Hadoop using [native batch indexing](../ingestion/native-batch.md),
|
This section assumes the reader understands how to do batch ingestion without Hadoop using [native batch indexing](../ingestion/native-batch.md),
|
||||||
which uses a "firehose" to know where and how to read the input data. The [`ingestSegment` firehose](native-batch.md#segment-firehose)
|
which uses an `inputSource` to know where and how to read the input data. The [`DruidInputSource`](native-batch.md#druid-input-source)
|
||||||
can be used to read data from segments inside Druid. Note that IndexTask is to be used for prototyping purposes only as
|
can be used to read data from segments inside Druid. Note that IndexTask is to be used for prototyping purposes only as
|
||||||
it has to do all processing inside a single process and can't scale. Please use Hadoop batch ingestion for production
|
it has to do all processing inside a single process and can't scale. Please use Hadoop batch ingestion for production
|
||||||
scenarios dealing with more than 1GB of data.
|
scenarios dealing with more than 1GB of data.
|
||||||
|
|
|
@ -81,8 +81,8 @@ You can use a [segment metadata query](../querying/segmentmetadataquery.md) for
|
||||||
|
|
||||||
## How can I Reindex existing data in Druid with schema changes?
|
## How can I Reindex existing data in Druid with schema changes?
|
||||||
|
|
||||||
You can use IngestSegmentFirehose with index task to ingest existing druid segments using a new schema and change the name, dimensions, metrics, rollup, etc. of the segment.
|
You can use DruidInputSource with the [Parallel task](../ingestion/native-batch.md) to ingest existing druid segments using a new schema and change the name, dimensions, metrics, rollup, etc. of the segment.
|
||||||
See [Firehose](../ingestion/native-batch.md#firehoses) for more details on IngestSegmentFirehose.
|
See [DruidInputSource](../ingestion/native-batch.md#druid-input-source) for more details.
|
||||||
Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing.
|
Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing.
|
||||||
|
|
||||||
See the [Update existing data](../ingestion/data-management.md#update) section of the data management page for more details.
|
See the [Update existing data](../ingestion/data-management.md#update) section of the data management page for more details.
|
||||||
|
@ -91,7 +91,7 @@ See the [Update existing data](../ingestion/data-management.md#update) section o
|
||||||
|
|
||||||
In a lot of situations you may want to lower the granularity of older data. Example, any data older than 1 month has only hour level granularity but newer data has minute level granularity. This use case is same as re-indexing.
|
In a lot of situations you may want to lower the granularity of older data. Example, any data older than 1 month has only hour level granularity but newer data has minute level granularity. This use case is same as re-indexing.
|
||||||
|
|
||||||
To do this use the IngestSegmentFirehose and run an indexer task. The IngestSegment firehose will allow you to take in existing segments from Druid and aggregate them and feed them back into Druid. It will also allow you to filter the data in those segments while feeding it back in. This means if there are rows you want to delete, you can just filter them away during re-ingestion.
|
To do this use the [DruidInputSource](../ingestion/native-batch.md#druid-input-source) and run a [Parallel task](../ingestion/native-batch.md). The DruidInputSource will allow you to take in existing segments from Druid and aggregate them and feed them back into Druid. It will also allow you to filter the data in those segments while feeding it back in. This means if there are rows you want to delete, you can just filter them away during re-ingestion.
|
||||||
Typically the above will be run as a batch job to say everyday feed in a chunk of data and aggregate it.
|
Typically the above will be run as a batch job to say everyday feed in a chunk of data and aggregate it.
|
||||||
Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing.
|
Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing.
|
||||||
|
|
||||||
|
|
|
@ -115,7 +115,7 @@ Also note that Druid automatically computes the classpath for Hadoop job contain
|
||||||
|
|
||||||
## `dataSchema`
|
## `dataSchema`
|
||||||
|
|
||||||
This field is required. See the [`dataSchema`](index.md#dataschema) section of the main ingestion page for details on
|
This field is required. See the [`dataSchema`](index.md#legacy-dataschema-spec) section of the main ingestion page for details on
|
||||||
what it should contain.
|
what it should contain.
|
||||||
|
|
||||||
## `ioConfig`
|
## `ioConfig`
|
||||||
|
@ -145,7 +145,52 @@ A type of inputSpec where a static path to the data files is provided.
|
||||||
For example, using the static input paths:
|
For example, using the static input paths:
|
||||||
|
|
||||||
```
|
```
|
||||||
"paths" : "s3n://billy-bucket/the/data/is/here/data.gz,s3n://billy-bucket/the/data/is/here/moredata.gz,s3n://billy-bucket/the/data/is/here/evenmoredata.gz"
|
"paths" : "hdfs://path/to/data/is/here/data.gz,hdfs://path/to/data/is/here/moredata.gz,hdfs://path/to/data/is/here/evenmoredata.gz"
|
||||||
|
```
|
||||||
|
|
||||||
|
You can also read from cloud storage such as AWS S3 or Google Cloud Storage.
|
||||||
|
To do so, you need to install the necessary library under Druid's classpath in _all MiddleManager or Indexer processes_.
|
||||||
|
For S3, you can run the below command to install the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
|
||||||
|
cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/
|
||||||
|
```
|
||||||
|
|
||||||
|
Once you install the Hadoop AWS module in all MiddleManager and Indexer processes, you can put
|
||||||
|
your S3 paths in the inputSpec with the below job properties.
|
||||||
|
For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
|
||||||
|
|
||||||
|
```
|
||||||
|
"paths" : "s3a://billy-bucket/the/data/is/here/data.gz,s3a://billy-bucket/the/data/is/here/moredata.gz,s3a://billy-bucket/the/data/is/here/evenmoredata.gz"
|
||||||
|
```
|
||||||
|
|
||||||
|
```json
|
||||||
|
"jobProperties" : {
|
||||||
|
"fs.s3a.impl" : "org.apache.hadoop.fs.s3a.S3AFileSystem",
|
||||||
|
"fs.AbstractFileSystem.s3a.impl" : "org.apache.hadoop.fs.s3a.S3A",
|
||||||
|
"fs.s3a.access.key" : "YOUR_ACCESS_KEY",
|
||||||
|
"fs.s3a.secret.key" : "YOUR_SECRET_KEY"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
For Google Cloud Storage, you need to install [GCS connector jar](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
|
||||||
|
under `${DRUID_HOME}/hadoop-dependencies` in _all MiddleManager or Indexer processes_.
|
||||||
|
Once you install the GCS Connector jar in all MiddleManager and Indexer processes, you can put
|
||||||
|
your Google Cloud Storage paths in the inputSpec with the below job properties.
|
||||||
|
For more configurations, see the [instructions to configure Hadoop](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md#configure-hadoop),
|
||||||
|
[GCS core default](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/conf/gcs-core-default.xml)
|
||||||
|
and [GCS core template](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml).
|
||||||
|
|
||||||
|
```
|
||||||
|
"paths" : "gs://billy-bucket/the/data/is/here/data.gz,gs://billy-bucket/the/data/is/here/moredata.gz,gs://billy-bucket/the/data/is/here/evenmoredata.gz"
|
||||||
|
```
|
||||||
|
|
||||||
|
```json
|
||||||
|
"jobProperties" : {
|
||||||
|
"fs.gs.impl" : "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
|
||||||
|
"fs.AbstractFileSystem.gs.impl" : "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
|
||||||
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
#### `granularity`
|
#### `granularity`
|
||||||
|
|
|
@ -28,8 +28,9 @@ All data in Druid is organized into _segments_, which are data files that genera
|
||||||
Loading data in Druid is called _ingestion_ or _indexing_ and consists of reading data from a source system and creating
|
Loading data in Druid is called _ingestion_ or _indexing_ and consists of reading data from a source system and creating
|
||||||
segments based on that data.
|
segments based on that data.
|
||||||
|
|
||||||
In most ingestion methods, the work of loading data is done by Druid MiddleManager processes. One exception is
|
In most ingestion methods, the work of loading data is done by Druid [MiddleManager](../design/middlemanager.md) processes
|
||||||
Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager
|
(or the [Indexer](../design/indexer.md) processes). One exception is
|
||||||
|
Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
|
||||||
processes are still involved in starting and monitoring the Hadoop jobs). Once segments have been generated and stored
|
processes are still involved in starting and monitoring the Hadoop jobs). Once segments have been generated and stored
|
||||||
in [deep storage](../dependencies/deep-storage.md), they will be loaded by Historical processes. For more details on
|
in [deep storage](../dependencies/deep-storage.md), they will be loaded by Historical processes. For more details on
|
||||||
how this works under the hood, see the [Storage design](../design/architecture.md#storage-design) section of Druid's design
|
how this works under the hood, see the [Storage design](../design/architecture.md#storage-design) section of Druid's design
|
||||||
|
@ -70,25 +71,26 @@ This table compares the major available options:
|
||||||
|
|
||||||
### Batch
|
### Batch
|
||||||
|
|
||||||
When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index`
|
When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index_parallel` (native batch; parallel), `index_hadoop` (Hadoop-based),
|
||||||
(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based).
|
or `index` (native batch; single-task).
|
||||||
|
|
||||||
In general, we recommend native batch whenever it meets your needs, since the setup is simpler (it does not depend on
|
In general, we recommend native batch whenever it meets your needs, since the setup is simpler (it does not depend on
|
||||||
an external Hadoop cluster). However, there are still scenarios where Hadoop-based batch ingestion is the right choice,
|
an external Hadoop cluster). However, there are still scenarios where Hadoop-based batch ingestion might be a better choice,
|
||||||
especially due to its support for custom partitioning options and reading binary data formats.
|
for example when you already have a running Hadoop cluster and want to
|
||||||
|
use the cluster resource of the existing cluster for batch ingestion.
|
||||||
|
|
||||||
This table compares the three available options:
|
This table compares the three available options:
|
||||||
|
|
||||||
| **Method** | [Native batch (simple)](native-batch.html#simple-task) | [Native batch (parallel)](native-batch.html#parallel-task) | [Hadoop-based](hadoop.html) |
|
| **Method** | [Native batch (parallel)](native-batch.html#parallel-task) | [Hadoop-based](hadoop.html) | [Native batch (simple)](native-batch.html#simple-task) |
|
||||||
|---|-----|--------------|------------|
|
|---|-----|--------------|------------|
|
||||||
| **Task type** | `index` | `index_parallel` | `index_hadoop` |
|
| **Task type** | `index_parallel` | `index_hadoop` | `index` |
|
||||||
| **Parallel?** | No. Each task is single-threaded. | Yes, if firehose is splittable and `maxNumConcurrentSubTasks` > 1 in tuningConfig. See [firehose documentation](native-batch.md#firehoses) for details. | Yes, always. |
|
| **Parallel?** | Yes, if `inputFormat` is splittable and `maxNumConcurrentSubTasks` > 1 in `tuningConfig`. See [data format documentation](./data-formats.md) for details. | Yes, always. | No. Each task is single-threaded. |
|
||||||
| **Can append or overwrite?** | Yes, both. | Yes, both. | Overwrite only. |
|
| **Can append or overwrite?** | Yes, both. | Overwrite only. | Yes, both. |
|
||||||
| **External dependencies** | None. | None. | Hadoop cluster (Druid submits Map/Reduce jobs). |
|
| **External dependencies** | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | None. |
|
||||||
| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any [firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid datasource. |
|
| **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). | Any Hadoop FileSystem or Druid datasource. | Any [`inputSource`](./native-batch.md#input-sources). |
|
||||||
| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Any Hadoop InputFormat. |
|
| **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
|
||||||
| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).| Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). | Always perfect. |
|
| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
|
||||||
| **Partitioning options** | Hash-based partitioning is supported when `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). | Hash-based or range-based partitioning (when `forceGuaranteedRollup` = true). | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). |
|
| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. |
|
||||||
|
|
||||||
<a name="data-model"></a>
|
<a name="data-model"></a>
|
||||||
|
|
||||||
|
@ -192,11 +194,11 @@ datasource that has rollup disabled (or enabled, but with a minimal rollup ratio
|
||||||
has fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the "abbreviated" set, using
|
has fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the "abbreviated" set, using
|
||||||
that datasource leads to much faster query times. This can often be done with just a small increase in storage
|
that datasource leads to much faster query times. This can often be done with just a small increase in storage
|
||||||
footprint, since abbreviated datasources tend to be substantially smaller.
|
footprint, since abbreviated datasources tend to be substantially smaller.
|
||||||
- If you are using a [best-effort rollup](#best-effort-rollup) ingestion configuration that does not guarantee perfect
|
- If you are using a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect
|
||||||
rollup, you can potentially improve your rollup ratio by switching to a guaranteed perfect rollup option, or by
|
rollup, you can potentially improve your rollup ratio by switching to a guaranteed perfect rollup option, or by
|
||||||
[reindexing](data-management.md#compaction-and-reindexing) your data in the background after initial ingestion.
|
[reindexing](data-management.md#compaction-and-reindexing) your data in the background after initial ingestion.
|
||||||
|
|
||||||
### Best-effort rollup
|
### Perfect rollup vs Best-effort rollup
|
||||||
|
|
||||||
Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
|
Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
|
||||||
time. Others offer _best-effort rollup_, meaning that input data might not be perfectly aggregated and thus there could
|
time. Others offer _best-effort rollup_, meaning that input data might not be perfectly aggregated and thus there could
|
||||||
|
@ -233,7 +235,7 @@ This partitioning happens for all ingestion methods, and is based on the `segmen
|
||||||
ingestion spec's `dataSchema`.
|
ingestion spec's `dataSchema`.
|
||||||
|
|
||||||
The segments within a particular time chunk may also be partitioned further, using options that vary based on the
|
The segments within a particular time chunk may also be partitioned further, using options that vary based on the
|
||||||
ingestion method you have chosen. In general, doing this secondary partitioning using a particular dimension will
|
ingestion type you have chosen. In general, doing this secondary partitioning using a particular dimension will
|
||||||
improve locality, meaning that rows with the same value for that dimension are stored together and can be accessed
|
improve locality, meaning that rows with the same value for that dimension are stored together and can be accessed
|
||||||
quickly.
|
quickly.
|
||||||
|
|
||||||
|
@ -287,32 +289,21 @@ definition is an _ingestion spec_.
|
||||||
|
|
||||||
Ingestion specs consists of three main components:
|
Ingestion specs consists of three main components:
|
||||||
|
|
||||||
- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource), [input row parser](#parser),
|
- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
|
||||||
[primary timestamp](#timestampspec), [flattening of nested data](#flattenspec) (if needed),
|
[primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
|
||||||
[dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
|
- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
|
||||||
- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and . For more information, see the
|
|
||||||
documentation for each [ingestion method](#ingestion-methods).
|
documentation for each [ingestion method](#ingestion-methods).
|
||||||
- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
|
- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
|
||||||
[ingestion method](#ingestion-methods).
|
[ingestion method](#ingestion-methods).
|
||||||
|
|
||||||
Example ingestion spec for task type "index" (native batch):
|
Example ingestion spec for task type `index_parallel` (native batch):
|
||||||
|
|
||||||
```
|
```
|
||||||
{
|
{
|
||||||
"type": "index",
|
"type": "index_parallel",
|
||||||
"spec": {
|
"spec": {
|
||||||
"dataSchema": {
|
"dataSchema": {
|
||||||
"dataSource": "wikipedia",
|
"dataSource": "wikipedia",
|
||||||
"parser": {
|
|
||||||
"type": "string",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "json",
|
|
||||||
"flattenSpec": {
|
|
||||||
"useFieldDiscovery": true,
|
|
||||||
"fields": [
|
|
||||||
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"timestampSpec": {
|
"timestampSpec": {
|
||||||
"column": "timestamp",
|
"column": "timestamp",
|
||||||
"format": "auto"
|
"format": "auto"
|
||||||
|
@ -323,8 +314,6 @@ Example ingestion spec for task type "index" (native batch):
|
||||||
{ "type": "string", "language" },
|
{ "type": "string", "language" },
|
||||||
{ "type": "long", "name": "userId" }
|
{ "type": "long", "name": "userId" }
|
||||||
]
|
]
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
"metricsSpec": [
|
"metricsSpec": [
|
||||||
{ "type": "count", "name": "count" },
|
{ "type": "count", "name": "count" },
|
||||||
|
@ -340,15 +329,24 @@ Example ingestion spec for task type "index" (native batch):
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"ioConfig": {
|
"ioConfig": {
|
||||||
"type": "index",
|
"type": "index_parallel",
|
||||||
"firehose": {
|
"inputSource": {
|
||||||
"type": "local",
|
"type": "local",
|
||||||
"baseDir": "examples/indexing/",
|
"baseDir": "examples/indexing/",
|
||||||
"filter": "wikipedia_data.json"
|
"filter": "wikipedia_data.json"
|
||||||
|
},
|
||||||
|
"inputFormat": {
|
||||||
|
"type": "json",
|
||||||
|
"flattenSpec": {
|
||||||
|
"useFieldDiscovery": true,
|
||||||
|
"fields": [
|
||||||
|
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
||||||
|
]
|
||||||
|
}
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"tuningConfig": {
|
"tuningConfig": {
|
||||||
"type": "index"
|
"type": "index_parallel"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -365,27 +363,20 @@ available in Druid's [web console](../operations/druid-console.md). Druid's visu
|
||||||
|
|
||||||
## `dataSchema`
|
## `dataSchema`
|
||||||
|
|
||||||
|
> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
|
||||||
|
except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
|
||||||
|
|
||||||
The `dataSchema` is a holder for the following components:
|
The `dataSchema` is a holder for the following components:
|
||||||
|
|
||||||
- [datasource name](#datasource), [input row parser](#parser),
|
- [datasource name](#datasource), [primary timestamp](#timestampspec),
|
||||||
[primary timestamp](#timestampspec), [flattening of nested data](#flattenspec) (if needed),
|
[dimensions](#dimensionsspec), [metrics](#metricsspec), and
|
||||||
[dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
|
[transforms and filters](#transformspec) (if needed).
|
||||||
|
|
||||||
An example `dataSchema` is:
|
An example `dataSchema` is:
|
||||||
|
|
||||||
```
|
```
|
||||||
"dataSchema": {
|
"dataSchema": {
|
||||||
"dataSource": "wikipedia",
|
"dataSource": "wikipedia",
|
||||||
"parser": {
|
|
||||||
"type": "string",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "json",
|
|
||||||
"flattenSpec": {
|
|
||||||
"useFieldDiscovery": true,
|
|
||||||
"fields": [
|
|
||||||
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"timestampSpec": {
|
"timestampSpec": {
|
||||||
"column": "timestamp",
|
"column": "timestamp",
|
||||||
"format": "auto"
|
"format": "auto"
|
||||||
|
@ -396,8 +387,6 @@ An example `dataSchema` is:
|
||||||
{ "type": "string", "language" },
|
{ "type": "string", "language" },
|
||||||
{ "type": "long", "name": "userId" }
|
{ "type": "long", "name": "userId" }
|
||||||
]
|
]
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
"metricsSpec": [
|
"metricsSpec": [
|
||||||
{ "type": "count", "name": "count" },
|
{ "type": "count", "name": "count" },
|
||||||
|
@ -424,50 +413,9 @@ The `dataSource` is located in `dataSchema` → `dataSource` and is simply the n
|
||||||
"dataSource": "my-first-datasource"
|
"dataSource": "my-first-datasource"
|
||||||
```
|
```
|
||||||
|
|
||||||
### `parser`
|
|
||||||
|
|
||||||
The `parser` is located in `dataSchema` → `parser` and is responsible for configuring a wide variety of
|
|
||||||
items related to parsing input records.
|
|
||||||
|
|
||||||
For details about supported data formats, see the ["Data formats" page](data-formats.md).
|
|
||||||
|
|
||||||
For details about major components of the `parseSpec`, refer to their subsections:
|
|
||||||
|
|
||||||
- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](#primary-timestamp).
|
|
||||||
- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](#dimensions).
|
|
||||||
- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
|
|
||||||
|
|
||||||
An example `parser` is:
|
|
||||||
|
|
||||||
```
|
|
||||||
"parser": {
|
|
||||||
"type": "string",
|
|
||||||
"parseSpec": {
|
|
||||||
"format": "json",
|
|
||||||
"flattenSpec": {
|
|
||||||
"useFieldDiscovery": true,
|
|
||||||
"fields": [
|
|
||||||
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"timestampSpec": {
|
|
||||||
"column": "timestamp",
|
|
||||||
"format": "auto"
|
|
||||||
},
|
|
||||||
"dimensionsSpec": {
|
|
||||||
"dimensions": [
|
|
||||||
{ "type": "string", "page" },
|
|
||||||
{ "type": "string", "language" },
|
|
||||||
{ "type": "long", "name": "userId" }
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### `timestampSpec`
|
### `timestampSpec`
|
||||||
|
|
||||||
The `timestampSpec` is located in `dataSchema` → `parser` → `parseSpec` → `timestampSpec` and is responsible for
|
The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
|
||||||
configuring the [primary timestamp](#primary-timestamp). An example `timestampSpec` is:
|
configuring the [primary timestamp](#primary-timestamp). An example `timestampSpec` is:
|
||||||
|
|
||||||
```
|
```
|
||||||
|
@ -478,7 +426,7 @@ configuring the [primary timestamp](#primary-timestamp). An example `timestampSp
|
||||||
```
|
```
|
||||||
|
|
||||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||||
> first [`flattenSpec`](#flattenspec), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||||
> your ingestion spec.
|
> your ingestion spec.
|
||||||
|
|
||||||
|
@ -492,7 +440,7 @@ A `timestampSpec` can have the following components:
|
||||||
|
|
||||||
### `dimensionsSpec`
|
### `dimensionsSpec`
|
||||||
|
|
||||||
The `dimensionsSpec` is located in `dataSchema` → `parser` → `parseSpec` → `dimensionsSpec` and is responsible for
|
The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
|
||||||
configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
|
configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
|
||||||
|
|
||||||
```
|
```
|
||||||
|
@ -508,7 +456,7 @@ configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
|
||||||
```
|
```
|
||||||
|
|
||||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||||
> first [`flattenSpec`](#flattenspec), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||||
> your ingestion spec.
|
> your ingestion spec.
|
||||||
|
|
||||||
|
@ -516,8 +464,8 @@ A `dimensionsSpec` can have the following components:
|
||||||
|
|
||||||
| Field | Description | Default |
|
| Field | Description | Default |
|
||||||
|-------|-------------|---------|
|
|-------|-------------|---------|
|
||||||
| dimensions | A list of [dimension names or objects](#dimension-objects).<br><br>If this is an empty array, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns (see [inclusions and exclusions](#inclusions-and-exclusions) below). | `[]` |
|
| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this is an empty array, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns (see [inclusions and exclusions](#inclusions-and-exclusions) below). | `[]` |
|
||||||
| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects. | `[]` |
|
| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects. Cannot have the same column in both `dimensions` and `dimensionExclusions`.| `[]` |
|
||||||
| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
|
| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
|
||||||
|
|
||||||
#### Dimension objects
|
#### Dimension objects
|
||||||
|
@ -537,11 +485,11 @@ Dimension objects can have the following components:
|
||||||
|
|
||||||
Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
|
Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
|
||||||
|
|
||||||
Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and `dimensionExclusions` is ignored.
|
Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested.
|
||||||
|
|
||||||
Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
|
Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
|
||||||
|
|
||||||
1. First, start from the set of all input fields from the [`parser`](#parser) (or the [`flattenSpec`](#flattenspec), if one is being used).
|
1. First, start from the set of all input fields from the [`inputFormat`](./data-formats.md) (or the [`flattenSpec`](./data-formats.md#flattenspec), if one is being used).
|
||||||
2. Any field listed in `dimensionExclusions` is excluded.
|
2. Any field listed in `dimensionExclusions` is excluded.
|
||||||
3. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
|
3. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
|
||||||
4. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
|
4. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
|
||||||
|
@ -551,58 +499,6 @@ Schemaless interpretation occurs when both `dimensions` and `spatialDimensions`
|
||||||
> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
|
> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
|
||||||
> schemaless dimension interpretation.
|
> schemaless dimension interpretation.
|
||||||
|
|
||||||
### `flattenSpec`
|
|
||||||
|
|
||||||
The `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
|
|
||||||
bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
|
|
||||||
An example `flattenSpec` is:
|
|
||||||
|
|
||||||
```
|
|
||||||
"flattenSpec": {
|
|
||||||
"useFieldDiscovery": true,
|
|
||||||
"fields": [
|
|
||||||
{ "name": "baz", "type": "root" },
|
|
||||||
{ "name": "foo_bar", "type": "path", "expr": "$.foo.bar" },
|
|
||||||
{ "name": "first_food", "type": "jq", "expr": ".thing.food[1]" }
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
|
||||||
> first [`flattenSpec`](#flattenspec), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
|
||||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
|
||||||
> your ingestion spec.
|
|
||||||
|
|
||||||
|
|
||||||
Flattening is only supported for [data formats](data-formats.md) that support nesting, including `avro`, `json`, `orc`,
|
|
||||||
and `parquet`. Flattening is _not_ supported for the `timeAndDims` parseSpec type.
|
|
||||||
|
|
||||||
A `flattenSpec` can have the following components:
|
|
||||||
|
|
||||||
| Field | Description | Default |
|
|
||||||
|-------|-------------|---------|
|
|
||||||
| useFieldDiscovery | If true, interpret all root-level fields as available fields for usage by [`timestampSpec`](#timestampspec), [`transformSpec`](#transformspec), [`dimensionsSpec`](#dimensionsspec), and [`metricsSpec`](#metricsspec).<br><br>If false, only explicitly specified fields (see `fields`) will be available for use. | `true` |
|
|
||||||
| fields | Specifies the fields of interest and how they are accessed. [See below for details.](#field-flattening-specifications) | `[]` |
|
|
||||||
|
|
||||||
#### Field flattening specifications
|
|
||||||
|
|
||||||
Each entry in the `fields` list can have the following components:
|
|
||||||
|
|
||||||
| Field | Description | Default |
|
|
||||||
|-------|-------------|---------|
|
|
||||||
| type | Options are as follows:<br><br><ul><li>`root`, referring to a field at the root level of the record. Only really useful if `useFieldDiscovery` is false.</li><li>`path`, referring to a field using [JsonPath](https://github.com/jayway/JsonPath) notation. Supported by most data formats that offer nesting, including `avro`, `json`, `orc`, and `parquet`.</li><li>`jq`, referring to a field using [jackson-jq](https://github.com/eiiches/jackson-jq) notation. Only supported for the `json` format.</li></ul> | none (required) |
|
|
||||||
| name | Name of the field after flattening. This name can be referred to by the [`timestampSpec`](#timestampspec), [`transformSpec`](#transformspec), [`dimensionsSpec`](#dimensionsspec), and [`metricsSpec`](#metricsspec).| none (required) |
|
|
||||||
| expr | Expression for accessing the field while flattening. For type `path`, this should be [JsonPath](https://github.com/jayway/JsonPath). For type `jq`, this should be [jackson-jq](https://github.com/eiiches/jackson-jq) notation. For other types, this parameter is ignored. | none (required for types `path` and `jq`) |
|
|
||||||
|
|
||||||
#### Notes on flattening
|
|
||||||
|
|
||||||
* For convenience, when defining a root-level field, it is possible to define only the field name, as a string, instead of a JSON object. For example, `{"name": "baz", "type": "root"}` is equivalent to `"baz"`.
|
|
||||||
* Enabling `useFieldDiscovery` will only autodetect "simple" fields at the root level that correspond to data types that Druid supports. This includes strings, numbers, and lists of strings or numbers. Other types will not be automatically detected, and must be specified explicitly in the `fields` list.
|
|
||||||
* Duplicate field `name`s are not allowed. An exception will be thrown.
|
|
||||||
* If `useFieldDiscovery` is enabled, any discovered field with the same name as one already defined in the `fields` list will be skipped, rather than added twice.
|
|
||||||
* [http://jsonpath.herokuapp.com/](http://jsonpath.herokuapp.com/) is useful for testing `path`-type expressions.
|
|
||||||
* jackson-jq supports a subset of the full [jq](https://stedolan.github.io/jq/) syntax. Please refer to the [jackson-jq documentation](https://github.com/eiiches/jackson-jq) for details.
|
|
||||||
|
|
||||||
### `metricsSpec`
|
### `metricsSpec`
|
||||||
|
|
||||||
The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
|
The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
|
||||||
|
@ -655,9 +551,9 @@ A `granularitySpec` can have the following components:
|
||||||
| Field | Description | Default |
|
| Field | Description | Default |
|
||||||
|-------|-------------|---------|
|
|-------|-------------|---------|
|
||||||
| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
|
| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
|
||||||
| segmentGranularity | [Time chunking](../design/architecture.html#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
|
| segmentGranularity | [Time chunking](../design/architecture.html#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
|
||||||
| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation.| `none` |
|
| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
|
||||||
| rollup | Whether to use ingestion-time [rollup](#rollup) or not. | `true` |
|
| rollup | Whether to use ingestion-time [rollup](#rollup) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
|
||||||
| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
|
| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
|
||||||
|
|
||||||
### `transformSpec`
|
### `transformSpec`
|
||||||
|
@ -679,7 +575,7 @@ records during ingestion time. It is optional. An example `transformSpec` is:
|
||||||
```
|
```
|
||||||
|
|
||||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||||
> first [`flattenSpec`](#flattenspec), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||||
> your ingestion spec.
|
> your ingestion spec.
|
||||||
|
|
||||||
|
@ -712,17 +608,91 @@ Druid currently includes one kind of built-in transform, the expression transfor
|
||||||
|
|
||||||
The `expression` is a [Druid query expression](../misc/math-expr.md).
|
The `expression` is a [Druid query expression](../misc/math-expr.md).
|
||||||
|
|
||||||
|
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||||
|
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||||
|
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||||
|
> your ingestion spec.
|
||||||
|
|
||||||
#### Filter
|
#### Filter
|
||||||
|
|
||||||
The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
|
The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
|
||||||
ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
|
ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
|
||||||
`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
|
`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
|
||||||
|
|
||||||
|
### Legacy `dataSchema` spec
|
||||||
|
|
||||||
|
> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
|
||||||
|
except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new spec.
|
||||||
|
|
||||||
|
The legacy `dataSchema` spec has below two more components in addition to the ones listed in the [`dataSchema`](#dataschema) section above.
|
||||||
|
|
||||||
|
- [input row parser](#parser-deprecated), [flattening of nested data](#flattenspec) (if needed)
|
||||||
|
|
||||||
|
#### `parser` (Deprecated)
|
||||||
|
|
||||||
|
In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` and is responsible for configuring a wide variety of
|
||||||
|
items related to parsing input records. The `parser` is deprecated and it is highly recommended to use `inputFormat` instead.
|
||||||
|
For details about `inputFormat` and supported `parser` types, see the ["Data formats" page](data-formats.md).
|
||||||
|
|
||||||
|
For details about major components of the `parseSpec`, refer to their subsections:
|
||||||
|
|
||||||
|
- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](#primary-timestamp).
|
||||||
|
- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](#dimensions).
|
||||||
|
- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
|
||||||
|
|
||||||
|
An example `parser` is:
|
||||||
|
|
||||||
|
```
|
||||||
|
"parser": {
|
||||||
|
"type": "string",
|
||||||
|
"parseSpec": {
|
||||||
|
"format": "json",
|
||||||
|
"flattenSpec": {
|
||||||
|
"useFieldDiscovery": true,
|
||||||
|
"fields": [
|
||||||
|
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"timestampSpec": {
|
||||||
|
"column": "timestamp",
|
||||||
|
"format": "auto"
|
||||||
|
},
|
||||||
|
"dimensionsSpec": {
|
||||||
|
"dimensions": [
|
||||||
|
{ "type": "string", "page" },
|
||||||
|
{ "type": "string", "language" },
|
||||||
|
{ "type": "long", "name": "userId" }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### `flattenSpec`
|
||||||
|
|
||||||
|
In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
|
||||||
|
bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
|
||||||
|
See [Flatten spec](./data-formats.md#flattenspec) for more details.
|
||||||
|
|
||||||
## `ioConfig`
|
## `ioConfig`
|
||||||
|
|
||||||
The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
|
The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
|
||||||
filesystem, or any other supported source system. For details, see the documentation provided by each
|
filesystem, or any other supported source system. The `inputFormat` property applies to all
|
||||||
[ingestion method](#ingestion-methods).
|
[ingestion method](#ingestion-methods) except for Hadoop ingestion. The Hadoop ingestion still
|
||||||
|
uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
|
||||||
|
The rest of `ioConfig` is specific to each individual ingestion method.
|
||||||
|
An example `ioConfig` to read JSON data is:
|
||||||
|
|
||||||
|
```json
|
||||||
|
"ioConfig": {
|
||||||
|
"type": "<ingestion-method-specific type code>",
|
||||||
|
"inputFormat": {
|
||||||
|
"type": "json"
|
||||||
|
},
|
||||||
|
...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
For more details, see the documentation provided by each [ingestion method](#ingestion-methods).
|
||||||
|
|
||||||
## `tuningConfig`
|
## `tuningConfig`
|
||||||
|
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -34,6 +34,8 @@ import java.util.Map;
|
||||||
class PartialDimensionDistributionParallelIndexTaskRunner
|
class PartialDimensionDistributionParallelIndexTaskRunner
|
||||||
extends InputSourceSplitParallelIndexTaskRunner<PartialDimensionDistributionTask, DimensionDistributionReport>
|
extends InputSourceSplitParallelIndexTaskRunner<PartialDimensionDistributionTask, DimensionDistributionReport>
|
||||||
{
|
{
|
||||||
|
private static final String PHASE_NAME = "partial dimension distribution";
|
||||||
|
|
||||||
// For tests
|
// For tests
|
||||||
private final IndexTaskClientFactory<ParallelIndexSupervisorTaskClient> taskClientFactory;
|
private final IndexTaskClientFactory<ParallelIndexSupervisorTaskClient> taskClientFactory;
|
||||||
|
|
||||||
|
@ -82,7 +84,7 @@ class PartialDimensionDistributionParallelIndexTaskRunner
|
||||||
@Override
|
@Override
|
||||||
public String getName()
|
public String getName()
|
||||||
{
|
{
|
||||||
return PartialDimensionDistributionTask.TYPE;
|
return PHASE_NAME;
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
|
|
|
@ -35,6 +35,8 @@ import java.util.Map;
|
||||||
class PartialGenericSegmentMergeParallelIndexTaskRunner
|
class PartialGenericSegmentMergeParallelIndexTaskRunner
|
||||||
extends ParallelIndexPhaseRunner<PartialGenericSegmentMergeTask, PushedSegmentsReport>
|
extends ParallelIndexPhaseRunner<PartialGenericSegmentMergeTask, PushedSegmentsReport>
|
||||||
{
|
{
|
||||||
|
private static final String PHASE_NAME = "partial segment merge";
|
||||||
|
|
||||||
private final DataSchema dataSchema;
|
private final DataSchema dataSchema;
|
||||||
private final List<PartialGenericSegmentMergeIOConfig> mergeIOConfigs;
|
private final List<PartialGenericSegmentMergeIOConfig> mergeIOConfigs;
|
||||||
|
|
||||||
|
@ -58,7 +60,7 @@ class PartialGenericSegmentMergeParallelIndexTaskRunner
|
||||||
@Override
|
@Override
|
||||||
public String getName()
|
public String getName()
|
||||||
{
|
{
|
||||||
return PartialGenericSegmentMergeTask.TYPE;
|
return PHASE_NAME;
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
|
|
|
@ -36,6 +36,8 @@ import java.util.Map;
|
||||||
class PartialHashSegmentGenerateParallelIndexTaskRunner
|
class PartialHashSegmentGenerateParallelIndexTaskRunner
|
||||||
extends InputSourceSplitParallelIndexTaskRunner<PartialHashSegmentGenerateTask, GeneratedHashPartitionsReport>
|
extends InputSourceSplitParallelIndexTaskRunner<PartialHashSegmentGenerateTask, GeneratedHashPartitionsReport>
|
||||||
{
|
{
|
||||||
|
private static final String PHASE_NAME = "partial segment generation";
|
||||||
|
|
||||||
// For tests
|
// For tests
|
||||||
private final IndexTaskClientFactory<ParallelIndexSupervisorTaskClient> taskClientFactory;
|
private final IndexTaskClientFactory<ParallelIndexSupervisorTaskClient> taskClientFactory;
|
||||||
private final AppenderatorsManager appenderatorsManager;
|
private final AppenderatorsManager appenderatorsManager;
|
||||||
|
@ -72,7 +74,7 @@ class PartialHashSegmentGenerateParallelIndexTaskRunner
|
||||||
@Override
|
@Override
|
||||||
public String getName()
|
public String getName()
|
||||||
{
|
{
|
||||||
return PartialHashSegmentGenerateTask.TYPE;
|
return PHASE_NAME;
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
|
|
|
@ -37,6 +37,8 @@ import java.util.Map;
|
||||||
class PartialHashSegmentMergeParallelIndexTaskRunner
|
class PartialHashSegmentMergeParallelIndexTaskRunner
|
||||||
extends ParallelIndexPhaseRunner<PartialHashSegmentMergeTask, PushedSegmentsReport>
|
extends ParallelIndexPhaseRunner<PartialHashSegmentMergeTask, PushedSegmentsReport>
|
||||||
{
|
{
|
||||||
|
private static final String PHASE_NAME = "partial segment merge";
|
||||||
|
|
||||||
private final DataSchema dataSchema;
|
private final DataSchema dataSchema;
|
||||||
private final List<PartialHashSegmentMergeIOConfig> mergeIOConfigs;
|
private final List<PartialHashSegmentMergeIOConfig> mergeIOConfigs;
|
||||||
|
|
||||||
|
@ -60,7 +62,7 @@ class PartialHashSegmentMergeParallelIndexTaskRunner
|
||||||
@Override
|
@Override
|
||||||
public String getName()
|
public String getName()
|
||||||
{
|
{
|
||||||
return PartialHashSegmentMergeTask.TYPE;
|
return PHASE_NAME;
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
|
|
|
@ -38,6 +38,8 @@ import java.util.Map;
|
||||||
class PartialRangeSegmentGenerateParallelIndexTaskRunner
|
class PartialRangeSegmentGenerateParallelIndexTaskRunner
|
||||||
extends InputSourceSplitParallelIndexTaskRunner<PartialRangeSegmentGenerateTask, GeneratedPartitionsReport<GenericPartitionStat>>
|
extends InputSourceSplitParallelIndexTaskRunner<PartialRangeSegmentGenerateTask, GeneratedPartitionsReport<GenericPartitionStat>>
|
||||||
{
|
{
|
||||||
|
private static final String PHASE_NAME = "partial segment generation";
|
||||||
|
|
||||||
private final IndexTaskClientFactory<ParallelIndexSupervisorTaskClient> taskClientFactory;
|
private final IndexTaskClientFactory<ParallelIndexSupervisorTaskClient> taskClientFactory;
|
||||||
private final AppenderatorsManager appenderatorsManager;
|
private final AppenderatorsManager appenderatorsManager;
|
||||||
private final Map<Interval, PartitionBoundaries> intervalToPartitions;
|
private final Map<Interval, PartitionBoundaries> intervalToPartitions;
|
||||||
|
@ -87,7 +89,7 @@ class PartialRangeSegmentGenerateParallelIndexTaskRunner
|
||||||
@Override
|
@Override
|
||||||
public String getName()
|
public String getName()
|
||||||
{
|
{
|
||||||
return PartialRangeSegmentGenerateTask.TYPE;
|
return PHASE_NAME;
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
|
|
|
@ -38,9 +38,10 @@ import java.util.Map;
|
||||||
* As its name indicates, distributed indexing is done in a single phase, i.e., without shuffling intermediate data. As
|
* As its name indicates, distributed indexing is done in a single phase, i.e., without shuffling intermediate data. As
|
||||||
* a result, this task can't be used for perfect rollup.
|
* a result, this task can't be used for perfect rollup.
|
||||||
*/
|
*/
|
||||||
class SinglePhaseParallelIndexTaskRunner
|
class SinglePhaseParallelIndexTaskRunner extends ParallelIndexPhaseRunner<SinglePhaseSubTask, PushedSegmentsReport>
|
||||||
extends ParallelIndexPhaseRunner<SinglePhaseSubTask, PushedSegmentsReport>
|
|
||||||
{
|
{
|
||||||
|
private static final String PHASE_NAME = "segment generation";
|
||||||
|
|
||||||
private final ParallelIndexIngestionSpec ingestionSchema;
|
private final ParallelIndexIngestionSpec ingestionSchema;
|
||||||
private final SplittableInputSource<?> baseInputSource;
|
private final SplittableInputSource<?> baseInputSource;
|
||||||
|
|
||||||
|
@ -70,7 +71,7 @@ class SinglePhaseParallelIndexTaskRunner
|
||||||
@Override
|
@Override
|
||||||
public String getName()
|
public String getName()
|
||||||
{
|
{
|
||||||
return SinglePhaseSubTask.TYPE;
|
return PHASE_NAME;
|
||||||
}
|
}
|
||||||
|
|
||||||
@VisibleForTesting
|
@VisibleForTesting
|
||||||
|
|
|
@ -58,6 +58,7 @@ Double.POSITIVE_INFINITY
|
||||||
Double.POSITIVE_INFINITY.
|
Double.POSITIVE_INFINITY.
|
||||||
Dropwizard
|
Dropwizard
|
||||||
dropwizard
|
dropwizard
|
||||||
|
DruidInputSource
|
||||||
DruidSQL
|
DruidSQL
|
||||||
EC2
|
EC2
|
||||||
EC2ContainerCredentialsProviderWrapper
|
EC2ContainerCredentialsProviderWrapper
|
||||||
|
@ -67,6 +68,7 @@ EMRFS
|
||||||
ETL
|
ETL
|
||||||
Elasticsearch
|
Elasticsearch
|
||||||
FirehoseFactory
|
FirehoseFactory
|
||||||
|
FlattenSpec
|
||||||
Float.NEGATIVE_INFINITY
|
Float.NEGATIVE_INFINITY
|
||||||
Float.POSITIVE_INFINITY
|
Float.POSITIVE_INFINITY
|
||||||
ForwardedRequestCustomizer
|
ForwardedRequestCustomizer
|
||||||
|
@ -77,6 +79,7 @@ GUIs
|
||||||
GroupBy
|
GroupBy
|
||||||
Guice
|
Guice
|
||||||
HDFS
|
HDFS
|
||||||
|
HDFSFirehose
|
||||||
HLL
|
HLL
|
||||||
HashSet
|
HashSet
|
||||||
Homebrew
|
Homebrew
|
||||||
|
@ -92,6 +95,7 @@ IndexSpec
|
||||||
IndexTask
|
IndexTask
|
||||||
InfluxDB
|
InfluxDB
|
||||||
InputFormat
|
InputFormat
|
||||||
|
InputSource
|
||||||
Integer.MAX_VALUE
|
Integer.MAX_VALUE
|
||||||
JBOD
|
JBOD
|
||||||
JDBC
|
JDBC
|
||||||
|
@ -103,6 +107,7 @@ JMX
|
||||||
JRE
|
JRE
|
||||||
JS
|
JS
|
||||||
JSON
|
JSON
|
||||||
|
JsonPath
|
||||||
JVM
|
JVM
|
||||||
JVMs
|
JVMs
|
||||||
Joda
|
Joda
|
||||||
|
@ -220,7 +225,10 @@ e.g.
|
||||||
encodings
|
encodings
|
||||||
endian
|
endian
|
||||||
enum
|
enum
|
||||||
|
expr
|
||||||
failover
|
failover
|
||||||
|
featureSpec
|
||||||
|
findColumnsFromHeader
|
||||||
filenames
|
filenames
|
||||||
filesystem
|
filesystem
|
||||||
firefox
|
firefox
|
||||||
|
@ -243,6 +251,7 @@ influxdb
|
||||||
injective
|
injective
|
||||||
inlined
|
inlined
|
||||||
interruptible
|
interruptible
|
||||||
|
jackson-jq
|
||||||
javadoc
|
javadoc
|
||||||
kerberos
|
kerberos
|
||||||
keystore
|
keystore
|
||||||
|
@ -329,6 +338,7 @@ searchable
|
||||||
servlet
|
servlet
|
||||||
sharded
|
sharded
|
||||||
sharding
|
sharding
|
||||||
|
skipHeaderRows
|
||||||
smooshed
|
smooshed
|
||||||
splittable
|
splittable
|
||||||
stdout
|
stdout
|
||||||
|
@ -362,6 +372,7 @@ unparseable
|
||||||
unparsed
|
unparsed
|
||||||
uptime
|
uptime
|
||||||
uris
|
uris
|
||||||
|
useFieldDiscovery
|
||||||
v1
|
v1
|
||||||
v2
|
v2
|
||||||
vCPUs
|
vCPUs
|
||||||
|
|
Loading…
Reference in New Issue