druid/docs/development/extensions-core/parquet.md

7.3 KiB

id title
parquet Apache Parquet Extension

This Apache Druid (incubating) module extends Druid Hadoop based indexing to ingest data directly from offline Apache Parquet files.

Note: druid-parquet-extensions depends on the druid-avro-extensions module, so be sure to include both.

Parquet Hadoop Parser

This extension provides two ways to parse Parquet files:

  • parquet - using a simple conversion contained within this extension
  • parquet-avro - conversion to avro records with the parquet-avro library and using the druid-avro-extensions module to parse the avro data

Selection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in the ioConfig:

  • org.apache.druid.data.input.parquet.DruidParquetInputFormat for parquet
  • org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat for parquet-avro

Both parse options support auto field discovery and flattening if provided with a flattenSpec with parquet or avro as the format. Parquet nested list and map logical types should operate correctly with json path expressions for all supported types. parquet-avro sets a hadoop job property parquet.avro.add-list-element-records to false (which normally defaults to true), in order to 'unwrap' primitive list elements into multi-value dimensions.

The parquet parser supports int96 Parquet values, while parquet-avro does not. There may also be some subtle differences in the behavior of json path expression evaluation of flattenSpec.

We suggest using parquet over parquet-avro to allow ingesting data beyond the schema constraints of Avro conversion. However, parquet-avro was the original basis for this extension, and as such it is a bit more mature.

Field Type Description Required
type String Choose parquet or parquet-avro to determine how Parquet files are parsed yes
parseSpec JSON Object Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are timeAndDims, parquet, avro (if used with avro conversion). yes
binaryAsString Boolean Specifies if the bytes parquet column which is not logically marked as a string or enum type should be converted to strings anyway. no(default == false)

When the time dimension is a DateType column, a format should not be supplied. When the format is UTF8 (String), either auto or a explicitly defined format is required.

Examples

parquet parser, parquet parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "parquet",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

parquet parser, timeAndDims parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "dim1",
              "dim2",
              "dim3",
              "listDim"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
  }
}

parquet-avro parser, avro parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet-avro",
        "parseSpec": {
          "format": "avro",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

For additional details see Hadoop ingestion and general ingestion spec documentation.