Jihoon Son 153495068b Doc update for the new input source and the new input format (#9171)
* Doc update for new input source and input format.

- The input source and input format are promoted in all docs under docs/ingestion
- All input sources including core extension ones are located in docs/ingestion/native-batch.md
- All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md
- New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md

* parquet

* add warning for range partitioning with sequential mode

* hdfs + s3, gs

* add fs impl for gs

* address comments

* address comments

* gcs
2020-01-17 15:52:05 -08:00

3.3 KiB

id title
orc ORC Extension

This Apache Druid module extends Druid Hadoop based indexing to ingest data directly from offline Apache ORC files.

To use this extension, make sure to include druid-orc-extensions.

The druid-orc-extensions provides the ORC input format and the ORC Hadoop parser for native batch ingestion and Hadoop batch ingestion, respectively. Please see corresponding docs for details.

Migration from 'contrib' extension

This extension, first available in version 0.15.0, replaces the previous 'contrib' extension which was available until 0.14.0-incubating. While this extension can index any data the 'contrib' extension could, the JSON spec for the ingestion task is incompatible, and will need modified to work with the newer 'core' extension.

To migrate to 0.15.0+:

  • In inputSpec of ioConfig, inputFormat must be changed from "org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat" to "org.apache.orc.mapreduce.OrcInputFormat"
  • The 'contrib' extension supported a typeString property, which provided the schema of the ORC file, of which was essentially required to have the types correct, but notably not the column names, which facilitated column renaming. In the 'core' extension, column renaming can be achieved with flattenSpec. For example, "typeString":"struct<time:string,name:string>" with the actual schema struct<_col0:string,_col1:string>, to preserve Druid schema would need replaced with:
"flattenSpec": {
  "fields": [
    {
      "type": "path",
      "name": "time",
      "expr": "$._col0"
    },
    {
      "type": "path",
      "name": "name",
      "expr": "$._col1"
    }
  ]
  ...
}
  • The 'contrib' extension supported a mapFieldNameFormat property, which provided a way to specify a dimension to flatten OrcMap columns with primitive types. This functionality has also been replaced with flattenSpec. For example: "mapFieldNameFormat": "<PARENT>_<CHILD>" for a dimension nestedData_dim1, to preserve Druid schema could be replaced with
"flattenSpec": {
 "fields": [
   {
     "type": "path",
     "name": "nestedData_dim1",
     "expr": "$.nestedData.dim1"
   }
 ]
 ...
}