85 lines
3.2 KiB
Markdown
85 lines
3.2 KiB
Markdown
---
|
|
id: orc
|
|
title: "ORC Extension"
|
|
---
|
|
|
|
<!--
|
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
~ or more contributor license agreements. See the NOTICE file
|
|
~ distributed with this work for additional information
|
|
~ regarding copyright ownership. The ASF licenses this file
|
|
~ to you under the Apache License, Version 2.0 (the
|
|
~ "License"); you may not use this file except in compliance
|
|
~ with the License. You may obtain a copy of the License at
|
|
~
|
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
~
|
|
~ Unless required by applicable law or agreed to in writing,
|
|
~ software distributed under the License is distributed on an
|
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
~ KIND, either express or implied. See the License for the
|
|
~ specific language governing permissions and limitations
|
|
~ under the License.
|
|
-->
|
|
|
|
## ORC extension
|
|
|
|
This Apache Druid extension enables Druid to ingest and understand the Apache ORC data format.
|
|
|
|
The extension provides the [ORC input format](../../ingestion/data-formats.md#orc) and the [ORC Hadoop parser](../../ingestion/data-formats.md#orc-hadoop-parser)
|
|
for [native batch ingestion](../../ingestion/native-batch.md) and [Hadoop batch ingestion](../../ingestion/hadoop.md), respectively.
|
|
Please see corresponding docs for details.
|
|
|
|
To use this extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-orc-extensions`.
|
|
|
|
### Migration from 'contrib' extension
|
|
This extension, first available in version 0.15.0, replaces the previous 'contrib' extension which was available until
|
|
0.14.0-incubating. While this extension can index any data the 'contrib' extension could, the JSON spec for the
|
|
ingestion task is *incompatible*, and will need modified to work with the newer 'core' extension.
|
|
|
|
To migrate to 0.15.0+:
|
|
|
|
* In `inputSpec` of `ioConfig`, `inputFormat` must be changed from `"org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat"` to
|
|
`"org.apache.orc.mapreduce.OrcInputFormat"`
|
|
* The 'contrib' extension supported a `typeString` property, which provided the schema of the
|
|
ORC file, of which was essentially required to have the types correct, but notably _not_ the column names, which
|
|
facilitated column renaming. In the 'core' extension, column renaming can be achieved with
|
|
[`flattenSpec`](../../ingestion/index.md#flattenspec). For example, `"typeString":"struct<time:string,name:string>"`
|
|
with the actual schema `struct<_col0:string,_col1:string>`, to preserve Druid schema would need replaced with:
|
|
|
|
```json
|
|
"flattenSpec": {
|
|
"fields": [
|
|
{
|
|
"type": "path",
|
|
"name": "time",
|
|
"expr": "$._col0"
|
|
},
|
|
{
|
|
"type": "path",
|
|
"name": "name",
|
|
"expr": "$._col1"
|
|
}
|
|
]
|
|
...
|
|
}
|
|
```
|
|
|
|
* The 'contrib' extension supported a `mapFieldNameFormat` property, which provided a way to specify a dimension to
|
|
flatten `OrcMap` columns with primitive types. This functionality has also been replaced with
|
|
[`flattenSpec`](../../ingestion/index.md#flattenspec). For example: `"mapFieldNameFormat": "<PARENT>_<CHILD>"`
|
|
for a dimension `nestedData_dim1`, to preserve Druid schema could be replaced with
|
|
|
|
```json
|
|
"flattenSpec": {
|
|
"fields": [
|
|
{
|
|
"type": "path",
|
|
"name": "nestedData_dim1",
|
|
"expr": "$.nestedData.dim1"
|
|
}
|
|
]
|
|
...
|
|
}
|
|
```
|