Remove the metadataUpdateSpec from specfile (#3973)

Get rid of the metadataUpdateSpec section in the json example to ingest parquet into druid. When this element is present, it will fail start an indexing job.
2017-03-01 23:24:36 +01:00 · 2017-03-01 23:24:36 +01:00 · add17fa7db
parent 8316b4f48f
commit add17fa7db
1 changed files with 75 additions and 16 deletions
--- a/docs/content/development/extensions-contrib/parquet.md
+++ b/docs/content/development/extensions-contrib/parquet.md
@ -2,15 +2,15 @@
 layout: doc_page
 ---

-# Parquet
+# Ingestion using Parquet format

-To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-avro-extensions` and `druid-parquet-extensions`.
+To use this extension, make sure to [include](../../operations/including-extensions.html) both `druid-avro-extensions` and `druid-parquet-extensions`.

 This extension enables Druid to ingest and understand the Apache Parquet data format offline.

 ## Parquet Hadoop Parser

-This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inputSpec in ioConfig must be set to `"io.druid.data.input.parquet.DruidParquetInputFormat"`. Make sure also to include "io.druid.extensions:druid-avro-extensions" as an extension.
+This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.parquet.DruidParquetInputFormat"`.

 |Field     | Type        | Description                                                                            | Required|
 |----------|-------------|----------------------------------------------------------------------------------------|---------|
@ -18,7 +18,77 @@ This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inp
 | parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a timeAndDims parseSpec. | yes |
 | binaryAsString | Boolean | Specifies if the bytes parquet column should be converted to strings. | no(default == false) |

-For example:
+### Example json for overlord
+
+When posting the index job to the overlord, setting the correct `inputFormat` is required to switch to parquet ingestion. Make sure to set `jobProperties` to make hdfs path timezone unrelated:
+
+```json
+{
+  "type": "index_hadoop",
+  "spec": {
+    "ioConfig": {
+      "type": "hadoop",
+      "inputSpec": {
+        "type": "static",
+        "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
+        "paths": "no_metrics"
+      }
+    },
+    "dataSchema": {
+      "dataSource": "no_metrics",
+      "parser": {
+        "type": "parquet",
+        "parseSpec": {
+          "format": "timeAndDims",
+          "timestampSpec": {
+            "column": "time",
+            "format": "auto"
+          },
+          "dimensionsSpec": {
+            "dimensions": [
+              "name"
+            ],
+            "dimensionExclusions": [],
+            "spatialDimensions": []
+          }
+        }
+      },
+      "metricsSpec": [{
+        "type": "count",
+        "name": "count"
+      }],
+      "granularitySpec": {
+        "type": "uniform",
+        "segmentGranularity": "DAY",
+        "queryGranularity": "ALL",
+        "intervals": ["2015-12-31/2016-01-02"]
+      }
+    },
+    "tuningConfig": {
+      "type": "hadoop",
+      "partitionsSpec": {
+        "targetPartitionSize": 5000000
+      },
+      "jobProperties" : {},
+      "leaveIntermediate": true
+    }
+  }
+}
+```
+
+### Example json for standalone jvm
+When using a standalone JVM instead, additional configuration fields are required. You can just fire a hadoop job with your local compiled jars like:
+
+```bash
+HADOOP_CLASS_PATH=`hadoop classpath | sed s/*.jar/*/g`
+
+java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
+  -classpath config/overlord:config/_common:lib/*:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/*  \
+  io.druid.cli.Main index hadoop \
+  wikipedia_hadoop_parquet_job.json
+```
+
+An example index json when using the standalone JVM:

 ```json
 {
@ -83,15 +153,4 @@ For example:
 }
 ```

-Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`). Set `jobProperties` to make hdfs path timezone unrelated.
-
-It is no need to make your cluster to update to SNAPSHOT, you can just fire a hadoop job with your local compiled jars like:
-
-```bash
-HADOOP_CLASS_PATH=`hadoop classpath | sed s/*.jar/*/g`
-
-java -Xmx32m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
-  -classpath config/overlord:config/_common:lib/*:$HADOOP_CLASS_PATH:extensions/druid-avro-extensions/*  \
-  io.druid.cli.Main index hadoop \
-  wikipedia_hadoop_parquet_job.json
-```
+Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`).