3.3 KiB

Raw Blame History

layout
doc_page

Orc

To use this extension, make sure to include druid-orc-extensions.

This extension enables Druid to ingest and understand the Apache Orc data format offline.

Orc Hadoop Parser

This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inputSpec in ioConfig must be set to "org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat".

Field	Type	Description	Required
type	String	This should say `orc`	yes
parseSpec	JSON Object	Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used.	yes
typeString	String	String representation of Orc struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped	no

For example of typeString, string column col1 and array of string column col2 is represented by "struct<col1:string,col2:array<string>>".

Currently, it only supports java primitive types and array of java primitive types, which means only 'list' of compound types in ORC types is supported (list of list is not supported).

For example of hadoop indexing:

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat",
        "paths": "/data/path/in/HDFS/"
      },
      "metadataUpdateSpec": {
        "type": "postgresql",
        "connectURI": "jdbc:postgresql://localhost/druid",
        "user" : "druid",
        "password" : "asdf",
        "segmentTable": "druid_segments"
      },
      "segmentOutputPath": "tmp/segments"
    },
    "dataSchema": {
      "dataSource": "no_metrics",
      "parser": {
        "type": "orc",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "time",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "name"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        },
        "typeString": "struct<time:string,name:string>"
      },
      "metricsSpec": [{
        "type": "count",
        "name": "count"
      }],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": ["2015-12-31/2016-01-02"]
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "workingPath": "tmp/working_path",
      "partitionsSpec": {
        "targetPartitionSize": 5000000
      },
      "jobProperties" : {},
      "leaveIntermediate": true
    }
  }
}

Almost all the fields listed above are required, including inputFormat, metadataUpdateSpec(type, connectURI, user, password, segmentTable). Set jobProperties to make hdfs path timezone unrelated.

3.3 KiB Raw Blame History

Orc

Orc Hadoop Parser

3.3 KiB

Raw Blame History