mirror of https://github.com/apache/druid.git
Remove the metadataUpdateSpec from specfile (#3973)
Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job.
This commit is contained in:
parent
8316b4f48f
commit
add17fa7db
|
@ -2,15 +2,15 @@
|
||||||
layout: doc_page
|
layout: doc_page
|
||||||
---
|
---
|
||||||
|
|
||||||
# Parquet
|
# Ingestion using Parquet format
|
||||||
|
|
||||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-avro-extensions` and `druid-parquet-extensions`.
|
To use this extension, make sure to [include](../../operations/including-extensions.html) both `druid-avro-extensions` and `druid-parquet-extensions`.
|
||||||
|
|
||||||
This extension enables Druid to ingest and understand the Apache Parquet data format offline.
|
This extension enables Druid to ingest and understand the Apache Parquet data format offline.
|
||||||
|
|
||||||
## Parquet Hadoop Parser
|
## Parquet Hadoop Parser
|
||||||
|
|
||||||
This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inputSpec in ioConfig must be set to `"io.druid.data.input.parquet.DruidParquetInputFormat"`. Make sure also to include "io.druid.extensions:druid-avro-extensions" as an extension.
|
This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.parquet.DruidParquetInputFormat"`.
|
||||||
|
|
||||||
|Field | Type | Description | Required|
|
|Field | Type | Description | Required|
|
||||||
|----------|-------------|----------------------------------------------------------------------------------------|---------|
|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
|
||||||
|
@ -18,7 +18,77 @@ This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inp
|
||||||
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. | yes |
|
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. | yes |
|
||||||
| binaryAsString | Boolean | Specifies if the bytes parquet column should be converted to strings. | no(default == false) |
|
| binaryAsString | Boolean | Specifies if the bytes parquet column should be converted to strings. | no(default == false) |
|
||||||
|
|
||||||
For example:
|
### Example json for overlord
|
||||||
|
|
||||||
|
When posting the index job to the overlord, setting the correct `inputFormat` is required to switch to parquet ingestion. Make sure to set `jobProperties` to make hdfs path timezone unrelated:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "index_hadoop",
|
||||||
|
"spec": {
|
||||||
|
"ioConfig": {
|
||||||
|
"type": "hadoop",
|
||||||
|
"inputSpec": {
|
||||||
|
"type": "static",
|
||||||
|
"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
|
||||||
|
"paths": "no_metrics"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"dataSchema": {
|
||||||
|
"dataSource": "no_metrics",
|
||||||
|
"parser": {
|
||||||
|
"type": "parquet",
|
||||||
|
"parseSpec": {
|
||||||
|
"format": "timeAndDims",
|
||||||
|
"timestampSpec": {
|
||||||
|
"column": "time",
|
||||||
|
"format": "auto"
|
||||||
|
},
|
||||||
|
"dimensionsSpec": {
|
||||||
|
"dimensions": [
|
||||||
|
"name"
|
||||||
|
],
|
||||||
|
"dimensionExclusions": [],
|
||||||
|
"spatialDimensions": []
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"metricsSpec": [{
|
||||||
|
"type": "count",
|
||||||
|
"name": "count"
|
||||||
|
}],
|
||||||
|
"granularitySpec": {
|
||||||
|
"type": "uniform",
|
||||||
|
"segmentGranularity": "DAY",
|
||||||
|
"queryGranularity": "ALL",
|
||||||
|
"intervals": ["2015-12-31/2016-01-02"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"tuningConfig": {
|
||||||
|
"type": "hadoop",
|
||||||
|
"partitionsSpec": {
|
||||||
|
"targetPartitionSize": 5000000
|
||||||
|
},
|
||||||
|
"jobProperties" : {},
|
||||||
|
"leaveIntermediate": true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example json for standalone jvm
|
||||||
|
When using a standalone JVM instead, additional configuration fields are required. You can just fire a hadoop job with your local compiled jars like:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
HADOOP_CLASS_PATH=`hadoop classpath | sed s/*.jar/*/g`
|
||||||
|
|
||||||
|
java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||||
|
-classpath config/overlord:config/_common:lib/*:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/* \
|
||||||
|
io.druid.cli.Main index hadoop \
|
||||||
|
wikipedia_hadoop_parquet_job.json
|
||||||
|
```
|
||||||
|
|
||||||
|
An example index json when using the standalone JVM:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
@ -83,15 +153,4 @@ For example:
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`). Set `jobProperties` to make hdfs path timezone unrelated.
|
Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`).
|
||||||
|
|
||||||
It is no need to make your cluster to update to SNAPSHOT, you can just fire a hadoop job with your local compiled jars like:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
HADOOP_CLASS_PATH=`hadoop classpath | sed s/*.jar/*/g`
|
|
||||||
|
|
||||||
java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
|
||||||
-classpath config/overlord:config/_common:lib/*:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/* \
|
|
||||||
io.druid.cli.Main index hadoop \
|
|
||||||
wikipedia_hadoop_parquet_job.json
|
|
||||||
```
|
|
||||||
|
|
Loading…
Reference in New Issue