druid/docs/content/development/extensions-contrib/parquet.md

---
layout: doc_page
---

# Ingestion using Parquet format

To use this extension, make sure to [include](../../operations/including-extensions.html) both `druid-avro-extensions` and `druid-parquet-extensions`.

This extension enables Druid to ingest and understand the Apache Parquet data format offline.

## Parquet Hadoop Parser

This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.parquet.DruidParquetInputFormat"`.

|Field     | Type        | Description                                                                            | Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
| type      | String      | This should say `parquet`                                                              | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. | yes |
| binaryAsString | Boolean | Specifies if the bytes parquet column should be converted to strings. | no(default == false) |

When the time dimension is a [DateType column](https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md), a format should not be supplied. When the format is UTF8 (String), either `auto` or a explicitly defined [format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html) is required.

### Example json for overlord

When posting the index job to the overlord, setting the correct `inputFormat` is required to switch to parquet ingestion. Make sure to set `jobProperties` to make hdfs path timezone unrelated:

```json
{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "no_metrics"
      }
    },
    "dataSchema": {
      "dataSource": "no_metrics",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "time",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "name"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [{
        "type": "count",
        "name": "count"
      }],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": ["2015-12-31/2016-01-02"]
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "partitionsSpec": {
        "targetPartitionSize": 5000000
      },
      "jobProperties" : {},
      "leaveIntermediate": true
    }
  }
}
```

### Example json for standalone jvm
When using a standalone JVM instead, additional configuration fields are required. You can just fire a hadoop job with your local compiled jars like:

```bash
HADOOP_CLASS_PATH=`hadoop classpath | sed s/*.jar/*/g`

java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
  -classpath config/overlord:config/_common:lib/*:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/*  \
  io.druid.cli.Main index hadoop \
  wikipedia_hadoop_parquet_job.json
```

An example index json when using the standalone JVM:

```json
{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "no_metrics"
      },
      "metadataUpdateSpec": {
        "type": "postgresql",
        "connectURI": "jdbc:postgresql://localhost/druid",
        "user" : "druid",
        "password" : "asdf",
        "segmentTable": "druid_segments"
      },
      "segmentOutputPath": "tmp/segments"
    },
    "dataSchema": {
      "dataSource": "no_metrics",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "time",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "name"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [{
        "type": "count",
        "name": "count"
      }],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": ["2015-12-31/2016-01-02"]
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "workingPath": "tmp/working_path",
      "partitionsSpec": {
        "targetPartitionSize": 5000000
      },
      "jobProperties" : {},
      "leaveIntermediate": true
    }
  }
}
```

Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`).
clean up for extensions docs 2016-03-30 20:14:58 -04:00			`---`
			`layout: doc_page`
			`---`

Remove the metadataUpdateSpec from specfile (#3973) Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job. 2017-03-01 17:24:36 -05:00			`# Ingestion using Parquet format`
reoganize code folder according to recent upstream folder changes, seperate it from avro code and take it into extensions-conrib. docs rewite too 2016-03-28 03:51:36 -04:00
Remove the metadataUpdateSpec from specfile (#3973) Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job. 2017-03-01 17:24:36 -05:00			To use this extension, make sure to [include](../../operations/including-extensions.html) both `druid-avro-extensions` and `druid-parquet-extensions`.
clean up for extensions docs 2016-03-30 20:14:58 -04:00
reoganize code folder according to recent upstream folder changes, seperate it from avro code and take it into extensions-conrib. docs rewite too 2016-03-28 03:51:36 -04:00			`This extension enables Druid to ingest and understand the Apache Parquet data format offline.`

			`## Parquet Hadoop Parser`

Remove the metadataUpdateSpec from specfile (#3973) Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job. 2017-03-01 17:24:36 -05:00			This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.parquet.DruidParquetInputFormat"`.
reoganize code folder according to recent upstream folder changes, seperate it from avro code and take it into extensions-conrib. docs rewite too 2016-03-28 03:51:36 -04:00
fix some docs and add new content (#3369) 2016-08-16 18:00:18 -04:00			`\|Field \| Type \| Description \| Required\|`
			`\|----------\|-------------\|----------------------------------------------------------------------------------------\|---------\|`
Add flag binaryAsString for parquet ingestion (#3381) 2016-08-30 20:30:50 -04:00			\| type \| String \| This should say `parquet` \| yes \|
			`\| parseSpec \| JSON Object \| Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. \| yes \|`
			`\| binaryAsString \| Boolean \| Specifies if the bytes parquet column should be converted to strings. \| no(default == false) \|`
reoganize code folder according to recent upstream folder changes, seperate it from avro code and take it into extensions-conrib. docs rewite too 2016-03-28 03:51:36 -04:00
Add Date support to the parquet reader (#4423) * Add Date support to the parquet reader Add support for the Date logical type. Currently this is not supported. Since the parquet date is number of days since epoch gets interpreted as seconds since epoch, it will fails on indexing the data because it will not map to the appriopriate bucket. * Cleaned up code and tests Got rid of unused json files in the examples, cleaned up the tests by using try-with-resources. Now get the filenames from the json file instead of hard coding them and integrated general improvements from the feedback provided by leventov. * Got rid of the caching Remove the caching of the logical type of the time dimension column and cleaned up the code a bit. 2017-06-22 16:56:08 -04:00			When the time dimension is a [DateType column](https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md), a format should not be supplied. When the format is UTF8 (String), either `auto` or a explicitly defined [format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html) is required.

Remove the metadataUpdateSpec from specfile (#3973) Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job. 2017-03-01 17:24:36 -05:00			`### Example json for overlord`

			When posting the index job to the overlord, setting the correct `inputFormat` is required to switch to parquet ingestion. Make sure to set `jobProperties` to make hdfs path timezone unrelated:

			```json
			`{`
			`"type": "index_hadoop",`
			`"spec": {`
			`"ioConfig": {`
			`"type": "hadoop",`
			`"inputSpec": {`
			`"type": "static",`
			`"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",`
			`"paths": "no_metrics"`
			`}`
			`},`
			`"dataSchema": {`
			`"dataSource": "no_metrics",`
			`"parser": {`
			`"type": "parquet",`
			`"parseSpec": {`
			`"format": "timeAndDims",`
			`"timestampSpec": {`
			`"column": "time",`
			`"format": "auto"`
			`},`
			`"dimensionsSpec": {`
			`"dimensions": [`
			`"name"`
			`],`
			`"dimensionExclusions": [],`
			`"spatialDimensions": []`
			`}`
			`}`
			`},`
			`"metricsSpec": [{`
			`"type": "count",`
			`"name": "count"`
			`}],`
			`"granularitySpec": {`
			`"type": "uniform",`
			`"segmentGranularity": "DAY",`
			`"queryGranularity": "ALL",`
			`"intervals": ["2015-12-31/2016-01-02"]`
			`}`
			`},`
			`"tuningConfig": {`
			`"type": "hadoop",`
			`"partitionsSpec": {`
			`"targetPartitionSize": 5000000`
			`},`
			`"jobProperties" : {},`
			`"leaveIntermediate": true`
			`}`
			`}`
			`}`
			```

			`### Example json for standalone jvm`
			`When using a standalone JVM instead, additional configuration fields are required. You can just fire a hadoop job with your local compiled jars like:`

			```bash
			HADOOP_CLASS_PATH=`hadoop classpath \| sed s/.jar//g`

			`java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \`
			`-classpath config/overlord:config/_common:lib/:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/ \`
			`io.druid.cli.Main index hadoop \`
			`wikipedia_hadoop_parquet_job.json`
			```

			`An example index json when using the standalone JVM:`
fix parquet docs (#3304) 2016-08-01 10:54:48 -04:00
reoganize code folder according to recent upstream folder changes, seperate it from avro code and take it into extensions-conrib. docs rewite too 2016-03-28 03:51:36 -04:00			```json
			`{`
			`"type": "index_hadoop",`
			`"spec": {`
			`"ioConfig": {`
			`"type": "hadoop",`
			`"inputSpec": {`
			`"type": "static",`
			`"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",`
			`"paths": "no_metrics"`
			`},`
			`"metadataUpdateSpec": {`
			`"type": "postgresql",`
			`"connectURI": "jdbc:postgresql://localhost/druid",`
			`"user" : "druid",`
			`"password" : "asdf",`
			`"segmentTable": "druid_segments"`
			`},`
			`"segmentOutputPath": "tmp/segments"`
			`},`
			`"dataSchema": {`
			`"dataSource": "no_metrics",`
			`"parser": {`
			`"type": "parquet",`
			`"parseSpec": {`
			`"format": "timeAndDims",`
			`"timestampSpec": {`
			`"column": "time",`
			`"format": "auto"`
			`},`
			`"dimensionsSpec": {`
			`"dimensions": [`
			`"name"`
			`],`
			`"dimensionExclusions": [],`
			`"spatialDimensions": []`
			`}`
			`}`
			`},`
			`"metricsSpec": [{`
			`"type": "count",`
			`"name": "count"`
			`}],`
			`"granularitySpec": {`
			`"type": "uniform",`
			`"segmentGranularity": "DAY",`
			`"queryGranularity": "ALL",`
			`"intervals": ["2015-12-31/2016-01-02"]`
			`}`
			`},`
			`"tuningConfig": {`
			`"type": "hadoop",`
			`"workingPath": "tmp/working_path",`
			`"partitionsSpec": {`
			`"targetPartitionSize": 5000000`
			`},`
			`"jobProperties" : {},`
			`"leaveIntermediate": true`
			`}`
			`}`
			`}`
			```

Remove the metadataUpdateSpec from specfile (#3973) Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job. 2017-03-01 17:24:36 -05:00			Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`).