7.4 KiB
layout | title |
---|---|
doc_page | Apache Parquet Extension |
Apache Parquet Extension
This Apache Druid (incubating) module extends Druid Hadoop based indexing to ingest data directly from offline Apache Parquet files.
Note: druid-parquet-extensions
depends on the druid-avro-extensions
module, so be sure to
include both.
Parquet Hadoop Parser
This extension provides two ways to parse Parquet files:
parquet
- using a simple conversion contained within this extensionparquet-avro
- conversion to avro records with theparquet-avro
library and using thedruid-avro-extensions
module to parse the avro data
Selection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in
the ioConfig
:
org.apache.druid.data.input.parquet.DruidParquetInputFormat
forparquet
org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat
forparquet-avro
Both parse options support auto field discovery and flattening if provided with a
flattenSpec with parquet
or avro
as the format. Parquet nested list and map
logical types should operate correctly with
json path expressions for all supported types. parquet-avro
sets a hadoop job property
parquet.avro.add-list-element-records
to false
(which normally defaults to true
), in order to 'unwrap' primitive
list elements into multi-value dimensions.
The parquet
parser supports int96
Parquet values, while parquet-avro
does not. There may also be some subtle
differences in the behavior of json path expression evaluation of flattenSpec
.
We suggest using parquet
over parquet-avro
to allow ingesting data beyond the schema constraints of Avro conversion.
However, parquet-avro
was the original basis for this extension, and as such it is a bit more mature.
Field | Type | Description | Required |
---|---|---|---|
type | String | Choose parquet or parquet-avro to determine how Parquet files are parsed |
yes |
parseSpec | JSON Object | Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are timeAndDims , parquet , avro (if used with avro conversion). |
yes |
binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be converted to strings anyway. | no(default == false) |
When the time dimension is a DateType column, a format should not be supplied. When the format is UTF8 (String), either auto
or a explicitly defined format is required.
Examples
parquet
parser, parquet
parseSpec
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
"paths": "path/to/file.parquet"
},
...
},
"dataSchema": {
"dataSource": "example",
"parser": {
"type": "parquet",
"parseSpec": {
"format": "parquet",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "path",
"name": "nestedDim",
"expr": "$.nestedData.dim1"
},
{
"type": "path",
"name": "listDimFirstItem",
"expr": "$.listDim[1]"
}
]
},
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
...
},
"tuningConfig": <hadoop-tuning-config>
}
}
}
parquet
parser, timeAndDims
parseSpec
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
"paths": "path/to/file.parquet"
},
...
},
"dataSchema": {
"dataSource": "example",
"parser": {
"type": "parquet",
"parseSpec": {
"format": "timeAndDims",
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
"dim1",
"dim2",
"dim3",
"listDim"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
...
},
"tuningConfig": <hadoop-tuning-config>
}
}
parquet-avro
parser, avro
parseSpec
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat",
"paths": "path/to/file.parquet"
},
...
},
"dataSchema": {
"dataSource": "example",
"parser": {
"type": "parquet-avro",
"parseSpec": {
"format": "avro",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "path",
"name": "nestedDim",
"expr": "$.nestedData.dim1"
},
{
"type": "path",
"name": "listDimFirstItem",
"expr": "$.listDim[1]"
}
]
},
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
...
},
"tuningConfig": <hadoop-tuning-config>
}
}
}
For additional details see hadoop ingestion and general ingestion spec documentation.