diff --git a/docs/content/development/extensions-contrib/parquet.md b/docs/content/development/extensions-contrib/parquet.md index e64e366f545..b3732f52d6e 100644 --- a/docs/content/development/extensions-contrib/parquet.md +++ b/docs/content/development/extensions-contrib/parquet.md @@ -2,15 +2,15 @@ layout: doc_page --- -# Parquet +# Ingestion using Parquet format -To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-avro-extensions` and `druid-parquet-extensions`. +To use this extension, make sure to [include](../../operations/including-extensions.html) both `druid-avro-extensions` and `druid-parquet-extensions`. This extension enables Druid to ingest and understand the Apache Parquet data format offline. ## Parquet Hadoop Parser -This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inputSpec in ioConfig must be set to `"io.druid.data.input.parquet.DruidParquetInputFormat"`. Make sure also to include "io.druid.extensions:druid-avro-extensions" as an extension. +This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.parquet.DruidParquetInputFormat"`. |Field | Type | Description | Required| |----------|-------------|----------------------------------------------------------------------------------------|---------| @@ -18,7 +18,77 @@ This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inp | parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. | yes | | binaryAsString | Boolean | Specifies if the bytes parquet column should be converted to strings. | no(default == false) | -For example: +### Example json for overlord + +When posting the index job to the overlord, setting the correct `inputFormat` is required to switch to parquet ingestion. Make sure to set `jobProperties` to make hdfs path timezone unrelated: + +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat", + "paths": "no_metrics" + } + }, + "dataSchema": { + "dataSource": "no_metrics", + "parser": { + "type": "parquet", + "parseSpec": { + "format": "timeAndDims", + "timestampSpec": { + "column": "time", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [ + "name" + ], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + "metricsSpec": [{ + "type": "count", + "name": "count" + }], + "granularitySpec": { + "type": "uniform", + "segmentGranularity": "DAY", + "queryGranularity": "ALL", + "intervals": ["2015-12-31/2016-01-02"] + } + }, + "tuningConfig": { + "type": "hadoop", + "partitionsSpec": { + "targetPartitionSize": 5000000 + }, + "jobProperties" : {}, + "leaveIntermediate": true + } + } +} +``` + +### Example json for standalone jvm +When using a standalone JVM instead, additional configuration fields are required. You can just fire a hadoop job with your local compiled jars like: + +```bash +HADOOP_CLASS_PATH=`hadoop classpath | sed s/*.jar/*/g` + +java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \ + -classpath config/overlord:config/_common:lib/*:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/* \ + io.druid.cli.Main index hadoop \ + wikipedia_hadoop_parquet_job.json +``` + +An example index json when using the standalone JVM: ```json { @@ -83,15 +153,4 @@ For example: } ``` -Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`). Set `jobProperties` to make hdfs path timezone unrelated. - -It is no need to make your cluster to update to SNAPSHOT, you can just fire a hadoop job with your local compiled jars like: - -```bash -HADOOP_CLASS_PATH=`hadoop classpath | sed s/*.jar/*/g` - -java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \ - -classpath config/overlord:config/_common:lib/*:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/* \ - io.druid.cli.Main index hadoop \ - wikipedia_hadoop_parquet_job.json -``` +Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`).