8.7 KiB
id | title |
---|---|
parquet | Apache Parquet Extension |
This Apache Druid module extends Druid Hadoop based indexing to ingest data directly from offline Apache Parquet files.
Note: If using the parquet-avro
parser for Apache Hadoop based indexing, druid-parquet-extensions
depends on the druid-avro-extensions
module, so be sure to
include both.
Parquet and Native Batch
This extension provides a parquet
input format which can be used with Druid native batch ingestion.
Parquet InputFormat
Field | Type | Description | Required |
---|---|---|---|
type | String | This should be set to parquet to read Parquet file |
yes |
flattenSpec | JSON Object | Define a flattenSpec to extract nested values from a Parquet file. Note that only 'path' expression are supported ('jq' is unavailable). |
no (default will auto-discover 'root' level properties) |
binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default == false) |
Example
...
"ioConfig": {
"type": "index_parallel",
"inputSource": {
"type": "local",
"baseDir": "/some/path/to/file/",
"filter": "file.parquet"
},
"inputFormat": {
"type": "parquet"
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "path",
"name": "nested",
"expr": "$.path.to.nested"
}
]
}
"binaryAsString": false
},
...
}
...
Parquet Hadoop Parser
For Hadoop, this extension provides two parser implementations for reading Parquet files:
parquet
- using a simple conversion contained within this extensionparquet-avro
- conversion to avro records with theparquet-avro
library and using thedruid-avro-extensions
module to parse the avro data
Selection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in
the ioConfig
:
org.apache.druid.data.input.parquet.DruidParquetInputFormat
forparquet
org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat
forparquet-avro
Both parse options support auto field discovery and flattening if provided with a
flattenSpec
with parquet
or avro
as the format. Parquet nested list and map
logical types should operate correctly with
JSON path expressions for all supported types. parquet-avro
sets a hadoop job property
parquet.avro.add-list-element-records
to false
(which normally defaults to true
), in order to 'unwrap' primitive
list elements into multi-value dimensions.
The parquet
parser supports int96
Parquet values, while parquet-avro
does not. There may also be some subtle
differences in the behavior of JSON path expression evaluation of flattenSpec
.
We suggest using parquet
over parquet-avro
to allow ingesting data beyond the schema constraints of Avro conversion.
However, parquet-avro
was the original basis for this extension, and as such it is a bit more mature.
Field | Type | Description | Required |
---|---|---|---|
type | String | Choose parquet or parquet-avro to determine how Parquet files are parsed |
yes |
parseSpec | JSON Object | Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are timeAndDims , parquet , avro (if used with avro conversion). |
yes |
binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no(default == false) |
When the time dimension is a DateType column, a format should not be supplied. When the format is UTF8 (String), either auto
or a explicitly defined format is required.
Examples
parquet
parser, parquet
parseSpec
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
"paths": "path/to/file.parquet"
},
...
},
"dataSchema": {
"dataSource": "example",
"parser": {
"type": "parquet",
"parseSpec": {
"format": "parquet",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "path",
"name": "nestedDim",
"expr": "$.nestedData.dim1"
},
{
"type": "path",
"name": "listDimFirstItem",
"expr": "$.listDim[1]"
}
]
},
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
...
},
"tuningConfig": <hadoop-tuning-config>
}
}
}
parquet
parser, timeAndDims
parseSpec
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
"paths": "path/to/file.parquet"
},
...
},
"dataSchema": {
"dataSource": "example",
"parser": {
"type": "parquet",
"parseSpec": {
"format": "timeAndDims",
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
"dim1",
"dim2",
"dim3",
"listDim"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
...
},
"tuningConfig": <hadoop-tuning-config>
}
}
parquet-avro
parser, avro
parseSpec
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat",
"paths": "path/to/file.parquet"
},
...
},
"dataSchema": {
"dataSource": "example",
"parser": {
"type": "parquet-avro",
"parseSpec": {
"format": "avro",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
{
"type": "path",
"name": "nestedDim",
"expr": "$.nestedData.dim1"
},
{
"type": "path",
"name": "listDimFirstItem",
"expr": "$.listDim[1]"
}
]
},
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
...
},
"tuningConfig": <hadoop-tuning-config>
}
}
}
For additional details see Hadoop ingestion and general ingestion spec documentation.