2019-04-09 12:03:26 -04:00
---
2019-08-21 00:48:59 -04:00
id: orc
2019-04-19 18:52:26 -04:00
title: "ORC Extension"
2019-04-09 12:03:26 -04:00
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
2020-01-03 12:33:19 -05:00
This Apache Druid module extends [Druid Hadoop based indexing ](../../ingestion/hadoop.md ) to ingest data directly from offline
2019-08-21 00:48:59 -04:00
Apache ORC files.
To use this extension, make sure to [include ](../../development/extensions.md#loading-extensions ) `druid-orc-extensions` .
2019-04-09 12:03:26 -04:00
2020-01-17 18:52:05 -05:00
The `druid-orc-extensions` provides the [ORC input format ](../../ingestion/data-formats.md#orc ) and the [ORC Hadoop parser ](../../ingestion/data-formats.md#orc-hadoop-parser )
for [native batch ingestion ](../../ingestion/native-batch.md ) and [Hadoop batch ingestion ](../../ingestion/hadoop.md ), respectively.
Please see corresponding docs for details.
2019-04-09 12:03:26 -04:00
### Migration from 'contrib' extension
2019-08-21 00:48:59 -04:00
This extension, first available in version 0.15.0, replaces the previous 'contrib' extension which was available until
2019-09-17 15:47:30 -04:00
0.14.0-incubating. While this extension can index any data the 'contrib' extension could, the JSON spec for the
2019-08-21 00:48:59 -04:00
ingestion task is *incompatible* , and will need modified to work with the newer 'core' extension.
2019-04-09 12:03:26 -04:00
To migrate to 0.15.0+:
2019-07-15 12:55:18 -04:00
2019-08-21 00:48:59 -04:00
* In `inputSpec` of `ioConfig` , `inputFormat` must be changed from `"org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat"` to
2019-04-09 12:03:26 -04:00
`"org.apache.orc.mapreduce.OrcInputFormat"`
* The 'contrib' extension supported a `typeString` property, which provided the schema of the
2019-08-21 00:48:59 -04:00
ORC file, of which was essentially required to have the types correct, but notably _not_ the column names, which
facilitated column renaming. In the 'core' extension, column renaming can be achieved with
[`flattenSpec` ](../../ingestion/index.md#flattenspec ). For example, `"typeString":"struct<time:string,name:string>"`
2019-04-09 12:03:26 -04:00
with the actual schema `struct<_col0:string,_col1:string>` , to preserve Druid schema would need replaced with:
2019-07-15 12:55:18 -04:00
2019-04-09 12:03:26 -04:00
```json
"flattenSpec": {
"fields": [
{
"type": "path",
"name": "time",
"expr": "$._col0"
},
{
"type": "path",
"name": "name",
"expr": "$._col1"
}
]
...
}
```
2019-07-15 12:55:18 -04:00
2019-04-09 12:03:26 -04:00
* The 'contrib' extension supported a `mapFieldNameFormat` property, which provided a way to specify a dimension to
flatten `OrcMap` columns with primitive types. This functionality has also been replaced with
2019-08-21 00:48:59 -04:00
[`flattenSpec` ](../../ingestion/index.md#flattenspec ). For example: `"mapFieldNameFormat": "<PARENT>_<CHILD>"`
for a dimension `nestedData_dim1` , to preserve Druid schema could be replaced with
2019-07-15 12:55:18 -04:00
2019-04-09 12:03:26 -04:00
```json
"flattenSpec": {
"fields": [
{
"type": "path",
"name": "nestedData_dim1",
"expr": "$.nestedData.dim1"
}
]
...
}
2019-08-21 00:48:59 -04:00
```