druid/docs/content/development/extensions-contrib/parquet.md

---
layout: doc_page
---

# Ingestion using Parquet format

To use this extension, make sure to [include](../../operations/including-extensions.html) both `druid-avro-extensions` and `druid-parquet-extensions`.

This extension enables Druid to ingest and understand the Apache Parquet data format offline.

## Parquet Hadoop Parser

This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.parquet.DruidParquetInputFormat"`.

|Field     | Type        | Description                                                                            | Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
| type      | String      | This should say `parquet`                                                              | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. | yes |
| binaryAsString | Boolean | Specifies if the bytes parquet column should be converted to strings. | no(default == false) |

### Example json for overlord

When posting the index job to the overlord, setting the correct `inputFormat` is required to switch to parquet ingestion. Make sure to set `jobProperties` to make hdfs path timezone unrelated:

```json
{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "no_metrics"
      }
    },
    "dataSchema": {
      "dataSource": "no_metrics",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "time",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "name"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [{
        "type": "count",
        "name": "count"
      }],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": ["2015-12-31/2016-01-02"]
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "partitionsSpec": {
        "targetPartitionSize": 5000000
      },
      "jobProperties" : {},
      "leaveIntermediate": true
    }
  }
}
```

### Example json for standalone jvm
When using a standalone JVM instead, additional configuration fields are required. You can just fire a hadoop job with your local compiled jars like:

```bash
HADOOP_CLASS_PATH=`hadoop classpath | sed s/*.jar/*/g`

java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
  -classpath config/overlord:config/_common:lib/*:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/*  \
  io.druid.cli.Main index hadoop \
  wikipedia_hadoop_parquet_job.json
```

An example index json when using the standalone JVM:

```json
{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "no_metrics"
      },
      "metadataUpdateSpec": {
        "type": "postgresql",
        "connectURI": "jdbc:postgresql://localhost/druid",
        "user" : "druid",
        "password" : "asdf",
        "segmentTable": "druid_segments"
      },
      "segmentOutputPath": "tmp/segments"
    },
    "dataSchema": {
      "dataSource": "no_metrics",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "time",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "name"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [{
        "type": "count",
        "name": "count"
      }],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": ["2015-12-31/2016-01-02"]
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "workingPath": "tmp/working_path",
      "partitionsSpec": {
        "targetPartitionSize": 5000000
      },
      "jobProperties" : {},
      "leaveIntermediate": true
    }
  }
}
```

Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`).
clean up for extensions docs 2016-03-30 20:14:58 -04:00			`---`
			`layout: doc_page`
			`---`

Remove the metadataUpdateSpec from specfile (#3973) Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job. 2017-03-01 17:24:36 -05:00			`# Ingestion using Parquet format`
reoganize code folder according to recent upstream folder changes, seperate it from avro code and take it into extensions-conrib. docs rewite too 2016-03-28 03:51:36 -04:00
Remove the metadataUpdateSpec from specfile (#3973) Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job. 2017-03-01 17:24:36 -05:00			To use this extension, make sure to [include](../../operations/including-extensions.html) both `druid-avro-extensions` and `druid-parquet-extensions`.
clean up for extensions docs 2016-03-30 20:14:58 -04:00
reoganize code folder according to recent upstream folder changes, seperate it from avro code and take it into extensions-conrib. docs rewite too 2016-03-28 03:51:36 -04:00			`This extension enables Druid to ingest and understand the Apache Parquet data format offline.`

			`## Parquet Hadoop Parser`

Remove the metadataUpdateSpec from specfile (#3973) Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job. 2017-03-01 17:24:36 -05:00			This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.parquet.DruidParquetInputFormat"`.
reoganize code folder according to recent upstream folder changes, seperate it from avro code and take it into extensions-conrib. docs rewite too 2016-03-28 03:51:36 -04:00
fix some docs and add new content (#3369) 2016-08-16 18:00:18 -04:00			`\|Field \| Type \| Description \| Required\|`
			`\|----------\|-------------\|----------------------------------------------------------------------------------------\|---------\|`
Add flag binaryAsString for parquet ingestion (#3381) 2016-08-30 20:30:50 -04:00			\| type \| String \| This should say `parquet` \| yes \|
			`\| parseSpec \| JSON Object \| Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. \| yes \|`
			`\| binaryAsString \| Boolean \| Specifies if the bytes parquet column should be converted to strings. \| no(default == false) \|`
reoganize code folder according to recent upstream folder changes, seperate it from avro code and take it into extensions-conrib. docs rewite too 2016-03-28 03:51:36 -04:00
Remove the metadataUpdateSpec from specfile (#3973) Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job. 2017-03-01 17:24:36 -05:00			`### Example json for overlord`

			When posting the index job to the overlord, setting the correct `inputFormat` is required to switch to parquet ingestion. Make sure to set `jobProperties` to make hdfs path timezone unrelated:

			```json
			`{`
			`"type": "index_hadoop",`
			`"spec": {`
			`"ioConfig": {`
			`"type": "hadoop",`
			`"inputSpec": {`
			`"type": "static",`
			`"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",`
			`"paths": "no_metrics"`
			`}`
			`},`
			`"dataSchema": {`
			`"dataSource": "no_metrics",`
			`"parser": {`
			`"type": "parquet",`
			`"parseSpec": {`
			`"format": "timeAndDims",`
			`"timestampSpec": {`
			`"column": "time",`
			`"format": "auto"`
			`},`
			`"dimensionsSpec": {`
			`"dimensions": [`
			`"name"`
			`],`
			`"dimensionExclusions": [],`
			`"spatialDimensions": []`
			`}`
			`}`
			`},`
			`"metricsSpec": [{`
			`"type": "count",`
			`"name": "count"`
			`}],`
			`"granularitySpec": {`
			`"type": "uniform",`
			`"segmentGranularity": "DAY",`
			`"queryGranularity": "ALL",`
			`"intervals": ["2015-12-31/2016-01-02"]`
			`}`
			`},`
			`"tuningConfig": {`
			`"type": "hadoop",`
			`"partitionsSpec": {`
			`"targetPartitionSize": 5000000`
			`},`
			`"jobProperties" : {},`
			`"leaveIntermediate": true`
			`}`
			`}`
			`}`
			```

			`### Example json for standalone jvm`
			`When using a standalone JVM instead, additional configuration fields are required. You can just fire a hadoop job with your local compiled jars like:`

			```bash
			HADOOP_CLASS_PATH=`hadoop classpath \| sed s/.jar//g`

			`java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \`
			`-classpath config/overlord:config/_common:lib/:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/ \`
			`io.druid.cli.Main index hadoop \`
			`wikipedia_hadoop_parquet_job.json`
			```

			`An example index json when using the standalone JVM:`
fix parquet docs (#3304) 2016-08-01 10:54:48 -04:00
reoganize code folder according to recent upstream folder changes, seperate it from avro code and take it into extensions-conrib. docs rewite too 2016-03-28 03:51:36 -04:00			```json
			`{`
			`"type": "index_hadoop",`
			`"spec": {`
			`"ioConfig": {`
			`"type": "hadoop",`
			`"inputSpec": {`
			`"type": "static",`
			`"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",`
			`"paths": "no_metrics"`
			`},`
			`"metadataUpdateSpec": {`
			`"type": "postgresql",`
			`"connectURI": "jdbc:postgresql://localhost/druid",`
			`"user" : "druid",`
			`"password" : "asdf",`
			`"segmentTable": "druid_segments"`
			`},`
			`"segmentOutputPath": "tmp/segments"`
			`},`
			`"dataSchema": {`
			`"dataSource": "no_metrics",`
			`"parser": {`
			`"type": "parquet",`
			`"parseSpec": {`
			`"format": "timeAndDims",`
			`"timestampSpec": {`
			`"column": "time",`
			`"format": "auto"`
			`},`
			`"dimensionsSpec": {`
			`"dimensions": [`
			`"name"`
			`],`
			`"dimensionExclusions": [],`
			`"spatialDimensions": []`
			`}`
			`}`
			`},`
			`"metricsSpec": [{`
			`"type": "count",`
			`"name": "count"`
			`}],`
			`"granularitySpec": {`
			`"type": "uniform",`
			`"segmentGranularity": "DAY",`
			`"queryGranularity": "ALL",`
			`"intervals": ["2015-12-31/2016-01-02"]`
			`}`
			`},`
			`"tuningConfig": {`
			`"type": "hadoop",`
			`"workingPath": "tmp/working_path",`
			`"partitionsSpec": {`
			`"targetPartitionSize": 5000000`
			`},`
			`"jobProperties" : {},`
			`"leaveIntermediate": true`
			`}`
			`}`
			`}`
			```

Remove the metadataUpdateSpec from specfile (#3973) Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job. 2017-03-01 17:24:36 -05:00			Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`).