druid/docs/content/development/extensions-contrib/orc.md

---
layout: doc_page
---

# Orc

To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-orc-extensions`.

This extension enables Druid to ingest and understand the Apache Orc data format offline.

## Orc Hadoop Parser

This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inputSpec in ioConfig must be set to `"org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat"`.

|Field     | Type        | Description                                                                            | Required|
|----------|-------------|----------------------------------------------------------------------------------------|---------|
|type      | String      | This should say `orc`                                                                  | yes|
|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used. | yes|
|typeString| String      | String representation of Orc struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no|

For example of `typeString`, string column col1 and array of string column col2 is represented by `"struct<col1:string,col2:array<string>>"`.

Currently, it only supports java primitive types and array of java primitive types, which means only 'list' of compound types in [ORC types](https://orc.apache.org/docs/types.html) is supported (list of list is not supported).  

For example of hadoop indexing:

```json
{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat",
        "paths": "/data/path/in/HDFS/"
      },
      "metadataUpdateSpec": {
        "type": "postgresql",
        "connectURI": "jdbc:postgresql://localhost/druid",
        "user" : "druid",
        "password" : "asdf",
        "segmentTable": "druid_segments"
      },
      "segmentOutputPath": "tmp/segments"
    },
    "dataSchema": {
      "dataSource": "no_metrics",
      "parser": {
        "type": "orc",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "time",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "name"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        },
        "typeString": "struct<time:string,name:string>"
      },
      "metricsSpec": [{
        "type": "count",
        "name": "count"
      }],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": ["2015-12-31/2016-01-02"]
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "workingPath": "tmp/working_path",
      "partitionsSpec": {
        "targetPartitionSize": 5000000
      },
      "jobProperties" : {},
      "leaveIntermediate": true
    }
  }
}
```

Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`). Set `jobProperties` to make hdfs path timezone unrelated.
Hadoop InputRowParser for Orc file (#3019) * InputRowParser to decode OrcStruct from OrcNewInputFormat * add unit test for orc hadoop indexing * update docs and fix test code bug * doc updated * resove maven dependency conflict * remove unused imports * fix returning array type from Object[] to correct primitive array type * fix to support getDimension() of MapBasedRow : changing return type of orc list from array to list * rebase and updated based on comments * updated based on comments * on reflecting review comments * fix bug in typeStringFromParseSpec() and add unit test * add license header 2016-07-26 12:42:56 -04:00			`---`
			`layout: doc_page`
			`---`

			`# Orc`

			To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-orc-extensions`.

			`This extension enables Druid to ingest and understand the Apache Orc data format offline.`

			`## Orc Hadoop Parser`

			This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inputSpec in ioConfig must be set to `"org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat"`.

fix some docs and add new content (#3369) 2016-08-16 18:00:18 -04:00			`\|Field \| Type \| Description \| Required\|`
			`\|----------\|-------------\|----------------------------------------------------------------------------------------\|---------\|`
			\|type \| String \| This should say `orc` \| yes\|
			`\|parseSpec \| JSON Object \| Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used. \| yes\|`
			`\|typeString\| String \| String representation of Orc struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped \| no\|`
Hadoop InputRowParser for Orc file (#3019) * InputRowParser to decode OrcStruct from OrcNewInputFormat * add unit test for orc hadoop indexing * update docs and fix test code bug * doc updated * resove maven dependency conflict * remove unused imports * fix returning array type from Object[] to correct primitive array type * fix to support getDimension() of MapBasedRow : changing return type of orc list from array to list * rebase and updated based on comments * updated based on comments * on reflecting review comments * fix bug in typeStringFromParseSpec() and add unit test * add license header 2016-07-26 12:42:56 -04:00
			For example of `typeString`, string column col1 and array of string column col2 is represented by `"struct<col1:string,col2:array<string>>"`.

			`Currently, it only supports java primitive types and array of java primitive types, which means only 'list' of compound types in [ORC types](https://orc.apache.org/docs/types.html) is supported (list of list is not supported).`

			`For example of hadoop indexing:`
fix docs (#3370) 2016-08-16 19:25:50 -04:00
Hadoop InputRowParser for Orc file (#3019) * InputRowParser to decode OrcStruct from OrcNewInputFormat * add unit test for orc hadoop indexing * update docs and fix test code bug * doc updated * resove maven dependency conflict * remove unused imports * fix returning array type from Object[] to correct primitive array type * fix to support getDimension() of MapBasedRow : changing return type of orc list from array to list * rebase and updated based on comments * updated based on comments * on reflecting review comments * fix bug in typeStringFromParseSpec() and add unit test * add license header 2016-07-26 12:42:56 -04:00			```json
			`{`
			`"type": "index_hadoop",`
			`"spec": {`
			`"ioConfig": {`
			`"type": "hadoop",`
			`"inputSpec": {`
			`"type": "static",`
			`"inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat",`
			`"paths": "/data/path/in/HDFS/"`
			`},`
			`"metadataUpdateSpec": {`
			`"type": "postgresql",`
			`"connectURI": "jdbc:postgresql://localhost/druid",`
			`"user" : "druid",`
			`"password" : "asdf",`
			`"segmentTable": "druid_segments"`
			`},`
			`"segmentOutputPath": "tmp/segments"`
			`},`
			`"dataSchema": {`
			`"dataSource": "no_metrics",`
			`"parser": {`
			`"type": "orc",`
			`"parseSpec": {`
			`"format": "timeAndDims",`
			`"timestampSpec": {`
			`"column": "time",`
			`"format": "auto"`
			`},`
			`"dimensionsSpec": {`
			`"dimensions": [`
			`"name"`
			`],`
			`"dimensionExclusions": [],`
			`"spatialDimensions": []`
			`}`
			`},`
			`"typeString": "struct<time:string,name:string>"`
			`},`
			`"metricsSpec": [{`
			`"type": "count",`
			`"name": "count"`
			`}],`
			`"granularitySpec": {`
			`"type": "uniform",`
			`"segmentGranularity": "DAY",`
			`"queryGranularity": "ALL",`
			`"intervals": ["2015-12-31/2016-01-02"]`
			`}`
			`},`
			`"tuningConfig": {`
			`"type": "hadoop",`
			`"workingPath": "tmp/working_path",`
			`"partitionsSpec": {`
			`"targetPartitionSize": 5000000`
			`},`
			`"jobProperties" : {},`
			`"leaveIntermediate": true`
			`}`
			`}`
			`}`
			```

			Almost all the fields listed above are required, including `inputFormat`, `metadataUpdateSpec`(`type`, `connectURI`, `user`, `password`, `segmentTable`). Set `jobProperties` to make hdfs path timezone unrelated.