druid/parquet.md at ea8e4066f6a6f50487ec6cd2bc4e0ea78731b025

7.3 KiB

Raw Blame History

id	title
parquet	Apache Parquet Extension

This Apache Druid (incubating) module extends Druid Hadoop based indexing to ingest data directly from offline Apache Parquet files.

Note: druid-parquet-extensions depends on the druid-avro-extensions module, so be sure to include both.

Parquet Hadoop Parser

This extension provides two ways to parse Parquet files:

parquet - using a simple conversion contained within this extension
parquet-avro - conversion to avro records with the parquet-avro library and using the druid-avro-extensions module to parse the avro data

Selection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in the ioConfig:

org.apache.druid.data.input.parquet.DruidParquetInputFormat for parquet
org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat for parquet-avro

Both parse options support auto field discovery and flattening if provided with a flattenSpec with parquet or avro as the format. Parquet nested list and map logical types should operate correctly with JSON path expressions for all supported types. parquet-avro sets a hadoop job property parquet.avro.add-list-element-records to false (which normally defaults to true), in order to 'unwrap' primitive list elements into multi-value dimensions.

The parquet parser supports int96 Parquet values, while parquet-avro does not. There may also be some subtle differences in the behavior of JSON path expression evaluation of flattenSpec.

We suggest using parquet over parquet-avro to allow ingesting data beyond the schema constraints of Avro conversion. However, parquet-avro was the original basis for this extension, and as such it is a bit more mature.

Field	Type	Description	Required
type	String	Choose `parquet` or `parquet-avro` to determine how Parquet files are parsed	yes
parseSpec	JSON Object	Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims`, `parquet`, `avro` (if used with avro conversion).	yes
binaryAsString	Boolean	Specifies if the bytes parquet column which is not logically marked as a string or enum type should be converted to strings anyway.	no(default == false)

When the time dimension is a DateType column, a format should not be supplied. When the format is UTF8 (String), either auto or a explicitly defined format is required.

Examples

`parquet` parser, `parquet` parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "parquet",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

`parquet` parser, `timeAndDims` parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "dim1",
              "dim2",
              "dim3",
              "listDim"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
  }
}

`parquet-avro` parser, `avro` parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet-avro",
        "parseSpec": {
          "format": "avro",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

For additional details see Hadoop ingestion and general ingestion spec documentation.

7.3 KiB Raw Blame History

Parquet Hadoop Parser

Examples

parquet parser, parquet parseSpec

parquet parser, timeAndDims parseSpec

parquet-avro parser, avro parseSpec

7.3 KiB

Raw Blame History

`parquet` parser, `parquet` parseSpec

`parquet` parser, `timeAndDims` parseSpec

`parquet-avro` parser, `avro` parseSpec