--- layout: doc_page title: "ORC Extension" --- # ORC Extension This Apache Druid (incubating) module extends [Druid Hadoop based indexing](../../ingestion/hadoop.html) to ingest data directly from offline Apache ORC files. To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-orc-extensions`. ## ORC Hadoop Parser The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"org.apache.orc.mapreduce.OrcInputFormat"`. |Field | Type | Description | Required| |----------|-------------|----------------------------------------------------------------------------------------|---------| |type | String | This should say `orc` | yes| |parseSpec | JSON Object | Specifies the timestamp and dimensions of the data (`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format) | yes| The parser supports two `parseSpec` formats: `orc` and `timeAndDims`. `orc` supports auto field discovery and flattening, if specified with a [flattenSpec](../../ingestion/flatten-json.html). If no `flattenSpec` is specified, `useFieldDiscovery` will be enabled by default. Specifying a `dimensionSpec` is optional if `useFieldDiscovery` is enabled: if a `dimensionSpec` is supplied, the list of `dimensions` it defines will be the set of ingested dimensions, if missing the discovered fields will make up the list. `timeAndDims` parse spec must specify which fields will be extracted as dimensions through the `dimensionSpec`. [All column types](https://orc.apache.org/docs/types.html) are supported, with the exception of `union` types. Columns of `list` type, if filled with primitives, may be used as a multi-value dimension, or specific elements can be extracted with `flattenSpec` expressions. Likewise, primitive fields may be extracted from `map` and `struct` types in the same manner. Auto field discovery will automatically create a string dimension for every (non-timestamp) primitive or `list` of primitives, as well as any flatten expressions defined in the `flattenSpec`. ### Hadoop Job Properties Like most Hadoop jobs, the best outcomes will add `"mapreduce.job.user.classpath.first": "true"` or `"mapreduce.job.classloader": "true"` to the `jobProperties` section of `tuningConfig`. Note that it is likely if using `"mapreduce.job.classloader": "true"` that you will need to set `mapreduce.job.classloader.system.classes` to include `-org.apache.hadoop.hive.` to instruct Hadoop to load `org.apache.hadoop.hive` classes from the application jars instead of system jars, e.g. ```json ... "mapreduce.job.classloader": "true", "mapreduce.job.classloader.system.classes" : "java., javax.accessibility., javax.activation., javax.activity., javax.annotation., javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., javax.net., javax.print., javax.rmi., javax.script., -javax.security.auth.message., javax.security.auth., javax.security.cert., javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., -org.apache.hadoop.hbase., -org.apache.hadoop.hive., org.apache.hadoop., core-default.xml, hdfs-default.xml, mapred-default.xml, yarn-default.xml", ... ``` This is due to the `hive-storage-api` dependency of the `orc-mapreduce` library, which provides some classes under the `org.apache.hadoop.hive` package. If instead using the setting `"mapreduce.job.user.classpath.first": "true"`, then this will not be an issue. ### Examples #### `orc` parser, `orc` parseSpec, auto field discovery, flatten expressions ```json { "type": "index_hadoop", "spec": { "ioConfig": { "type": "hadoop", "inputSpec": { "type": "static", "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", "paths": "path/to/file.orc" }, ... }, "dataSchema": { "dataSource": "example", "parser": { "type": "orc", "parseSpec": { "format": "orc", "flattenSpec": { "useFieldDiscovery": true, "fields": [ { "type": "path", "name": "nestedDim", "expr": "$.nestedData.dim1" }, { "type": "path", "name": "listDimFirstItem", "expr": "$.listDim[1]" } ] }, "timestampSpec": { "column": "timestamp", "format": "millis" } } }, ... }, "tuningConfig": } } } ``` #### `orc` parser, `orc` parseSpec, field discovery with no flattenSpec or dimensionSpec ```json { "type": "index_hadoop", "spec": { "ioConfig": { "type": "hadoop", "inputSpec": { "type": "static", "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", "paths": "path/to/file.orc" }, ... }, "dataSchema": { "dataSource": "example", "parser": { "type": "orc", "parseSpec": { "format": "orc", "timestampSpec": { "column": "timestamp", "format": "millis" } } }, ... }, "tuningConfig": } } } ``` #### `orc` parser, `orc` parseSpec, no autodiscovery ```json { "type": "index_hadoop", "spec": { "ioConfig": { "type": "hadoop", "inputSpec": { "type": "static", "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", "paths": "path/to/file.orc" }, ... }, "dataSchema": { "dataSource": "example", "parser": { "type": "orc", "parseSpec": { "format": "orc", "flattenSpec": { "useFieldDiscovery": false, "fields": [ { "type": "path", "name": "nestedDim", "expr": "$.nestedData.dim1" }, { "type": "path", "name": "listDimFirstItem", "expr": "$.listDim[1]" } ] }, "timestampSpec": { "column": "timestamp", "format": "millis" }, "dimensionsSpec": { "dimensions": [ "dim1", "dim3", "nestedDim", "listDimFirstItem" ], "dimensionExclusions": [], "spatialDimensions": [] } } }, ... }, "tuningConfig": } } } ``` #### `orc` parser, `timeAndDims` parseSpec ```json { "type": "index_hadoop", "spec": { "ioConfig": { "type": "hadoop", "inputSpec": { "type": "static", "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", "paths": "path/to/file.orc" }, ... }, "dataSchema": { "dataSource": "example", "parser": { "type": "orc", "parseSpec": { "format": "timeAndDims", "timestampSpec": { "column": "timestamp", "format": "auto" }, "dimensionsSpec": { "dimensions": [ "dim1", "dim2", "dim3", "listDim" ], "dimensionExclusions": [], "spatialDimensions": [] } } }, ... }, "tuningConfig": } } ``` ### Migration from 'contrib' extension This extension, first available in version 0.15.0, replaces the previous 'contrib' extension which was available until 0.14.0-incubating. While this extension can index any data the 'contrib' extension could, the json spec for the ingestion task is *incompatible*, and will need modified to work with the newer 'core' extension. To migrate to 0.15.0+: * In `inputSpec` of `ioConfig`, `inputFormat` must be changed from `"org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat"` to `"org.apache.orc.mapreduce.OrcInputFormat"` * The 'contrib' extension supported a `typeString` property, which provided the schema of the ORC file, of which was essentially required to have the types correct, but notably _not_ the column names, which facilitated column renaming. In the 'core' extension, column renaming can be achieved with [`flattenSpec` expressions](../../ingestion/flatten-json.html). For example, `"typeString":"struct"` with the actual schema `struct<_col0:string,_col1:string>`, to preserve Druid schema would need replaced with: ```json "flattenSpec": { "fields": [ { "type": "path", "name": "time", "expr": "$._col0" }, { "type": "path", "name": "name", "expr": "$._col1" } ] ... } ``` * The 'contrib' extension supported a `mapFieldNameFormat` property, which provided a way to specify a dimension to flatten `OrcMap` columns with primitive types. This functionality has also been replaced with [`flattenSpec` expressions](../../ingestion/flatten-json.html). For example: `"mapFieldNameFormat": "_"` for a dimension `nestedData_dim1`, to preserve Druid schema could be replaced with ```json "flattenSpec": { "fields": [ { "type": "path", "name": "nestedData_dim1", "expr": "$.nestedData.dim1" } ] ... } ```