druid/docs/content/development/extensions-core/parquet.md

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
  
---
layout: doc_page
title: "Druid Parquet Extension"
---
# Druid Parquet Extension

This module extends [Druid Hadoop based indexing](../../ingestion/hadoop.html) to ingest data directly from offline 
Apache Parquet files. 

Note: `druid-parquet-extensions` depends on the `druid-avro-extensions` module, so be sure to
 [include  both](../../operations/including-extensions.html).

## Parquet Hadoop Parser

This extension provides two ways to parse Parquet files:
* `parquet` - using a simple conversion contained within this extension 
* `parquet-avro` - conversion to avro records with the `parquet-avro` library and using the `druid-avro-extensions`
 module to parse the avro data

Selection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in 
the `ioConfig`,  `org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat` for `parquet` and 
`org.apache.druid.data.input.parquet.avro.DruidParquetAvroInputFormat` for `parquet-avro`.
 

Both parse options support auto field discovery and flattening if provided with a 
[flattenSpec](../../ingestion/flatten-json.html) with `parquet` or `avro` as the `format`. Parquet nested list and map 
[logical types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) _should_ operate correctly with 
json path expressions for all supported types. `parquet-avro` sets a hadoop job property 
`parquet.avro.add-list-element-records` to `false` (which normally defaults to `true`), in order to 'unwrap' primitive 
list elements into multi-value dimensions.

The `parquet` parser supports `int96` Parquet values, while `parquet-avro` does not. There may also be some subtle 
differences in the behavior of json path expression evaluation of `flattenSpec`.

We suggest using `parquet` over `parquet-avro` to allow ingesting data beyond the schema constraints of Avro conversion. 
However, `parquet-avro` was the original basis for this extension, and as such it is a bit more mature.


|Field     | Type        | Description                                                                            | Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
| type      | String      | Choose `parquet` or `parquet-avro` to determine how Parquet files are parsed | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims`, `parquet`, `avro` (if used with avro conversion). | yes |
| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be converted to strings anyway. | no(default == false) |

When the time dimension is a [DateType column](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), a format should not be supplied. When the format is UTF8 (String), either `auto` or a explicitly defined [format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html) is required.

### Examples

#### `parquet` parser, `parquet` parseSpec
```json
{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "parquet",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}
```

#### `parquet` parser, `timeAndDims` parseSpec
```json
{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "dim1",
              "dim2",
              "dim3",
              "listDim"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
  }
}

```
#### `parquet-avro` parser, `avro` parseSpec
```json
{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.avro.DruidParquetAvroInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet-avro",
        "parseSpec": {
          "format": "avro",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}
```

For additional details see [hadoop ingestion](../../ingestion/hadoop.html) and [general ingestion spec](../../ingestion/ingestion-spec.html) documentation.
add missing license headers, in particular to MD files; clean up RAT … (#6563) * add missing license headers, in particular to MD files; clean up RAT exclusions * revert inadvertent doc changes * docs * cr changes * fix modified druid-production.svg 2018-11-13 12:38:37 -05:00			`<!--`
			`~ Licensed to the Apache Software Foundation (ASF) under one`
			`~ or more contributor license agreements. See the NOTICE file`
			`~ distributed with this work for additional information`
			`~ regarding copyright ownership. The ASF licenses this file`
			`~ to you under the Apache License, Version 2.0 (the`
			`~ "License"); you may not use this file except in compliance`
			`~ with the License. You may obtain a copy of the License at`
			`~`
			`~ http://www.apache.org/licenses/LICENSE-2.0`
			`~`
			`~ Unless required by applicable law or agreed to in writing,`
			`~ software distributed under the License is distributed on an`
			`~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`~ KIND, either express or implied. See the License for the`
			`~ specific language governing permissions and limitations`
			`~ under the License.`
			`-->`

overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core' (#6360) * move parquet-extensions from contrib to core, adds new hadoop parquet parser that does not convert to avro first and supports flattenSpec and int96 columns, add support for flattenSpec for parquet-avro conversion parser, much test with a bunch of files lifted from spark-sql * fix avro flattener to support nullable primitives for auto discovery and now only supports primitive arrays instead of all arrays * remove leftover print * convert micro timestamp to millis * checkstyle * add ignore for .parquet and .parq to rat exclude * fix legit test failure from avro flattern behavior change * fix rebase * add exclusions to pom to cut down on redundant jars * refactor tests, add support for unwrapping lists for parquet-avro, review comments * more comment * fix oops * tweak parquet-avro list handling * more docs * fix style * grr styles 2018-11-06 00:33:42 -05:00			`---`
			`layout: doc_page`
Added titles and harmonized docs to improve usability and SEO (#6731) * added titles and harmonized docs * manually fixed some titles 2018-12-12 23:42:12 -05:00			`title: "Druid Parquet Extension"`
overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core' (#6360) * move parquet-extensions from contrib to core, adds new hadoop parquet parser that does not convert to avro first and supports flattenSpec and int96 columns, add support for flattenSpec for parquet-avro conversion parser, much test with a bunch of files lifted from spark-sql * fix avro flattener to support nullable primitives for auto discovery and now only supports primitive arrays instead of all arrays * remove leftover print * convert micro timestamp to millis * checkstyle * add ignore for .parquet and .parq to rat exclude * fix legit test failure from avro flattern behavior change * fix rebase * add exclusions to pom to cut down on redundant jars * refactor tests, add support for unwrapping lists for parquet-avro, review comments * more comment * fix oops * tweak parquet-avro list handling * more docs * fix style * grr styles 2018-11-06 00:33:42 -05:00			`---`
			`# Druid Parquet Extension`

			`This module extends [Druid Hadoop based indexing](../../ingestion/hadoop.html) to ingest data directly from offline`
			`Apache Parquet files.`

			Note: `druid-parquet-extensions` depends on the `druid-avro-extensions` module, so be sure to
			`[include both](../../operations/including-extensions.html).`

			`## Parquet Hadoop Parser`

			`This extension provides two ways to parse Parquet files:`
			* `parquet` - using a simple conversion contained within this extension
			* `parquet-avro` - conversion to avro records with the `parquet-avro` library and using the `druid-avro-extensions`
			`module to parse the avro data`

			`Selection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in`
			the `ioConfig`, `org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat` for `parquet` and
			`org.apache.druid.data.input.parquet.avro.DruidParquetAvroInputFormat` for `parquet-avro`.


			`Both parse options support auto field discovery and flattening if provided with a`
			[flattenSpec](../../ingestion/flatten-json.html) with `parquet` or `avro` as the `format`. Parquet nested list and map
			`[logical types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) _should_ operate correctly with`
			json path expressions for all supported types. `parquet-avro` sets a hadoop job property
			`parquet.avro.add-list-element-records` to `false` (which normally defaults to `true`), in order to 'unwrap' primitive
			`list elements into multi-value dimensions.`

			The `parquet` parser supports `int96` Parquet values, while `parquet-avro` does not. There may also be some subtle
			differences in the behavior of json path expression evaluation of `flattenSpec`.

			We suggest using `parquet` over `parquet-avro` to allow ingesting data beyond the schema constraints of Avro conversion.
			However, `parquet-avro` was the original basis for this extension, and as such it is a bit more mature.


			`\|Field \| Type \| Description \| Required\|`
			`\|----------\|-------------\|----------------------------------------------------------------------------------------\|---------\|`
			\| type \| String \| Choose `parquet` or `parquet-avro` to determine how Parquet files are parsed \| yes \|
			\| parseSpec \| JSON Object \| Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims`, `parquet`, `avro` (if used with avro conversion). \| yes \|
			`\| binaryAsString \| Boolean \| Specifies if the bytes parquet column which is not logically marked as a string or enum type should be converted to strings anyway. \| no(default == false) \|`

			When the time dimension is a [DateType column](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), a format should not be supplied. When the format is UTF8 (String), either `auto` or a explicitly defined [format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html) is required.

			`### Examples`

			#### `parquet` parser, `parquet` parseSpec
			```json
			`{`
			`"type": "index_hadoop",`
			`"spec": {`
			`"ioConfig": {`
			`"type": "hadoop",`
			`"inputSpec": {`
			`"type": "static",`
			`"inputFormat": "org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat",`
			`"paths": "path/to/file.parquet"`
			`},`
			`...`
			`},`
			`"dataSchema": {`
			`"dataSource": "example",`
			`"parser": {`
			`"type": "parquet",`
			`"parseSpec": {`
			`"format": "parquet",`
			`"flattenSpec": {`
			`"useFieldDiscovery": true,`
			`"fields": [`
			`{`
			`"type": "path",`
			`"name": "nestedDim",`
			`"expr": "$.nestedData.dim1"`
			`},`
			`{`
			`"type": "path",`
			`"name": "listDimFirstItem",`
			`"expr": "$.listDim[1]"`
			`}`
			`]`
			`},`
			`"timestampSpec": {`
			`"column": "timestamp",`
			`"format": "auto"`
			`},`
			`"dimensionsSpec": {`
			`"dimensions": [],`
			`"dimensionExclusions": [],`
			`"spatialDimensions": []`
			`}`
			`}`
			`},`
			`...`
			`},`
			`"tuningConfig": <hadoop-tuning-config>`
			`}`
			`}`
			`}`
			```

			#### `parquet` parser, `timeAndDims` parseSpec
			```json
			`{`
			`"type": "index_hadoop",`
			`"spec": {`
			`"ioConfig": {`
			`"type": "hadoop",`
			`"inputSpec": {`
			`"type": "static",`
			`"inputFormat": "org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat",`
			`"paths": "path/to/file.parquet"`
			`},`
			`...`
			`},`
			`"dataSchema": {`
			`"dataSource": "example",`
			`"parser": {`
			`"type": "parquet",`
			`"parseSpec": {`
			`"format": "timeAndDims",`
			`"timestampSpec": {`
			`"column": "timestamp",`
			`"format": "auto"`
			`},`
			`"dimensionsSpec": {`
			`"dimensions": [`
			`"dim1",`
			`"dim2",`
			`"dim3",`
			`"listDim"`
			`],`
			`"dimensionExclusions": [],`
			`"spatialDimensions": []`
			`}`
			`}`
			`},`
			`...`
			`},`
			`"tuningConfig": <hadoop-tuning-config>`
			`}`
			`}`

			```
			#### `parquet-avro` parser, `avro` parseSpec
			```json
			`{`
			`"type": "index_hadoop",`
			`"spec": {`
			`"ioConfig": {`
			`"type": "hadoop",`
			`"inputSpec": {`
			`"type": "static",`
			`"inputFormat": "org.apache.druid.data.input.parquet.avro.DruidParquetAvroInputFormat",`
			`"paths": "path/to/file.parquet"`
			`},`
			`...`
			`},`
			`"dataSchema": {`
			`"dataSource": "example",`
			`"parser": {`
			`"type": "parquet-avro",`
			`"parseSpec": {`
			`"format": "avro",`
			`"flattenSpec": {`
			`"useFieldDiscovery": true,`
			`"fields": [`
			`{`
			`"type": "path",`
			`"name": "nestedDim",`
			`"expr": "$.nestedData.dim1"`
			`},`
			`{`
			`"type": "path",`
			`"name": "listDimFirstItem",`
			`"expr": "$.listDim[1]"`
			`}`
			`]`
			`},`
			`"timestampSpec": {`
			`"column": "timestamp",`
			`"format": "auto"`
			`},`
			`"dimensionsSpec": {`
			`"dimensions": [],`
			`"dimensionExclusions": [],`
			`"spatialDimensions": []`
			`}`
			`}`
			`},`
			`...`
			`},`
			`"tuningConfig": <hadoop-tuning-config>`
			`}`
			`}`
			`}`
			```

			`For additional details see [hadoop ingestion](../../ingestion/hadoop.html) and [general ingestion spec](../../ingestion/ingestion-spec.html) documentation.`