2016-03-30 20:14:58 -04:00
---
layout: doc_page
---
2017-03-01 17:24:36 -05:00
# Ingestion using Parquet format
2016-03-28 03:51:36 -04:00
2017-03-01 17:24:36 -05:00
To use this extension, make sure to [include ](../../operations/including-extensions.html ) both `druid-avro-extensions` and `druid-parquet-extensions` .
2016-03-30 20:14:58 -04:00
2016-03-28 03:51:36 -04:00
This extension enables Druid to ingest and understand the Apache Parquet data format offline.
## Parquet Hadoop Parser
2018-08-30 12:56:26 -04:00
This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of `inputSpec` in `ioConfig` must be set to `"org.apache.druid.data.input.parquet.DruidParquetInputFormat"` .
2016-03-28 03:51:36 -04:00
2016-08-16 18:00:18 -04:00
|Field | Type | Description | Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
2016-08-30 20:30:50 -04:00
| type | String | This should say `parquet` | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. | yes |
| binaryAsString | Boolean | Specifies if the bytes parquet column should be converted to strings. | no(default == false) |
2016-03-28 03:51:36 -04:00
2018-09-04 15:54:41 -04:00
When the time dimension is a [DateType column ](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md ), a format should not be supplied. When the format is UTF8 (String), either `auto` or a explicitly defined [format ](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html ) is required.
2017-06-22 16:56:08 -04:00
2017-03-01 17:24:36 -05:00
### Example json for overlord
When posting the index job to the overlord, setting the correct `inputFormat` is required to switch to parquet ingestion. Make sure to set `jobProperties` to make hdfs path timezone unrelated:
```json
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
2018-08-30 12:56:26 -04:00
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
2017-03-01 17:24:36 -05:00
"paths": "no_metrics"
}
},
"dataSchema": {
"dataSource": "no_metrics",
"parser": {
"type": "parquet",
"parseSpec": {
"format": "timeAndDims",
"timestampSpec": {
"column": "time",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
"name"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [{
"type": "count",
"name": "count"
}],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "ALL",
"intervals": ["2015-12-31/2016-01-02"]
}
},
"tuningConfig": {
"type": "hadoop",
"partitionsSpec": {
"targetPartitionSize": 5000000
},
"jobProperties" : {},
"leaveIntermediate": true
}
}
}
```
### Example json for standalone jvm
When using a standalone JVM instead, additional configuration fields are required. You can just fire a hadoop job with your local compiled jars like:
```bash
HADOOP_CLASS_PATH=`hadoop classpath | sed s/*.jar/*/g`
java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
-classpath config/overlord:config/_common:lib/*:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/* \
2018-08-30 12:56:26 -04:00
org.apache.druid.cli.Main index hadoop \
2017-03-01 17:24:36 -05:00
wikipedia_hadoop_parquet_job.json
```
An example index json when using the standalone JVM:
2016-08-01 10:54:48 -04:00
2016-03-28 03:51:36 -04:00
```json
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
2018-08-30 12:56:26 -04:00
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
2016-03-28 03:51:36 -04:00
"paths": "no_metrics"
},
"metadataUpdateSpec": {
"type": "postgresql",
"connectURI": "jdbc:postgresql://localhost/druid",
"user" : "druid",
"password" : "asdf",
"segmentTable": "druid_segments"
},
"segmentOutputPath": "tmp/segments"
},
"dataSchema": {
"dataSource": "no_metrics",
"parser": {
"type": "parquet",
"parseSpec": {
"format": "timeAndDims",
"timestampSpec": {
"column": "time",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
"name"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [{
"type": "count",
"name": "count"
}],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "ALL",
"intervals": ["2015-12-31/2016-01-02"]
}
},
"tuningConfig": {
"type": "hadoop",
"workingPath": "tmp/working_path",
"partitionsSpec": {
"targetPartitionSize": 5000000
},
"jobProperties" : {},
"leaveIntermediate": true
}
}
}
```
2017-03-01 17:24:36 -05:00
Almost all the fields listed above are required, including `inputFormat` , `metadataUpdateSpec` (`type`, `connectURI` , `user` , `password` , `segmentTable` ).