druid/parquet.md at f5402169319737381d005ab9a69a84c3857d4ba1

8.7 KiB

Raw Blame History

id	title
parquet	Apache Parquet Extension

This Apache Druid module extends Druid Hadoop based indexing to ingest data directly from offline Apache Parquet files.

Note: If using the parquet-avro parser for Apache Hadoop based indexing, druid-parquet-extensions depends on the druid-avro-extensions module, so be sure to include both.

Parquet and Native Batch

This extension provides a parquet input format which can be used with Druid native batch ingestion.

Parquet InputFormat

Field	Type	Description	Required
type	String	This should be set to `parquet` to read Parquet file	yes
flattenSpec	JSON Object	Define a `flattenSpec` to extract nested values from a Parquet file. Note that only 'path' expression are supported ('jq' is unavailable).	no (default will auto-discover 'root' level properties)
binaryAsString	Boolean	Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string.	no (default == false)

Example

    ...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "local",
        "baseDir": "/some/path/to/file/",
        "filter": "file.parquet"
      },
      "inputFormat": {
        "type": "parquet"
        "flattenSpec": {
          "useFieldDiscovery": true,
          "fields": [
            {
              "type": "path",
              "name": "nested",
              "expr": "$.path.to.nested"
            }
          ]
        }
        "binaryAsString": false
      },
      ...
    }
    ...

Parquet Hadoop Parser

For Hadoop, this extension provides two parser implementations for reading Parquet files:

parquet - using a simple conversion contained within this extension
parquet-avro - conversion to avro records with the parquet-avro library and using the druid-avro-extensions module to parse the avro data

Selection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in the ioConfig:

org.apache.druid.data.input.parquet.DruidParquetInputFormat for parquet
org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat for parquet-avro

Both parse options support auto field discovery and flattening if provided with a flattenSpec with parquet or avro as the format. Parquet nested list and map logical types should operate correctly with JSON path expressions for all supported types. parquet-avro sets a hadoop job property parquet.avro.add-list-element-records to false (which normally defaults to true), in order to 'unwrap' primitive list elements into multi-value dimensions.

The parquet parser supports int96 Parquet values, while parquet-avro does not. There may also be some subtle differences in the behavior of JSON path expression evaluation of flattenSpec.

We suggest using parquet over parquet-avro to allow ingesting data beyond the schema constraints of Avro conversion. However, parquet-avro was the original basis for this extension, and as such it is a bit more mature.

Field	Type	Description	Required
type	String	Choose `parquet` or `parquet-avro` to determine how Parquet files are parsed	yes
parseSpec	JSON Object	Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims`, `parquet`, `avro` (if used with avro conversion).	yes
binaryAsString	Boolean	Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string.	no(default == false)

When the time dimension is a DateType column, a format should not be supplied. When the format is UTF8 (String), either auto or a explicitly defined format is required.

Examples

`parquet` parser, `parquet` parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "parquet",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

`parquet` parser, `timeAndDims` parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "dim1",
              "dim2",
              "dim3",
              "listDim"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
  }
}

`parquet-avro` parser, `avro` parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet-avro",
        "parseSpec": {
          "format": "avro",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

For additional details see Hadoop ingestion and general ingestion spec documentation.

8.7 KiB Raw Blame History