8.4 KiB

Raw Blame History

layout
doc_page

Ingestion Spec

A Druid ingestion spec consists of 3 components:

{
  "dataSchema" : {...}
  "ioConfig" : {...}
  "tuningConfig" : {...}
}

Field	Type	Description	Required
dataSchema	JSON Object	Specifies the the schema of the incoming data. All ingestion specs can share the same dataSchema.	yes
ioConfig	JSON Object	Specifies where the data is coming from and where the data is going. This object will vary with the ingestion method.	yes
tuningConfig	JSON Object	Specifies how to tune various ingestion parameters. This object will vary with the ingestion method.	no

DataSchema

An example dataSchema is shown below:

"dataSchema" : {
  "dataSource" : "wikipedia",
  "parser" : {
    "type" : "string",
    "parseSpec" : {
      "format" : "json",
      "timestampSpec" : {
        "column" : "timestamp",
        "format" : "auto"
      },
      "dimensionsSpec" : {
        "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
        "dimensionExclusions" : [],
        "spatialDimensions" : []
      }
    }
  },
  "metricsSpec" : [{
    "type" : "count",
    "name" : "count"
  }, {
    "type" : "doubleSum",
    "name" : "added",
    "fieldName" : "added"
  }, {
    "type" : "doubleSum",
    "name" : "deleted",
    "fieldName" : "deleted"
  }, {
    "type" : "doubleSum",
    "name" : "delta",
    "fieldName" : "delta"
  }],
  "granularitySpec" : {
    "segmentGranularity" : "DAY",
    "queryGranularity" : "NONE",
    "intervals" : [ "2013-08-31/2013-09-01" ]
  }
}

Field	Type	Description	Required
dataSource	String	The name of the ingested datasource. Datasources can be thought of as tables.	yes
parser	JSON Object	Specifies how ingested data can be parsed.	yes
metricsSpec	JSON Object array	A list of aggregators.	yes
granularitySpec	JSON Object	Specifies how to create segments and roll up data.	yes

Parser

If type is not included, the parser defaults to string.

String Parser

Field	Type	Description	Required
type	String	This should say `string`.	no
parseSpec	JSON Object	Specifies the format of the data.	yes

Protobuf Parser

Field	Type	Description	Required
type	String	This should say `protobuf`.	no
parseSpec	JSON Object	Specifies the format of the data.	yes

ParseSpec

If format is not included, the parseSpec defaults to tsv.

JSON ParseSpec

Field	Type	Description	Required
format	String	This should say `json`.	no
timestampSpec	JSON Object	Specifies the column and format of the timestamp.	yes
dimensionsSpec	JSON Object	Specifies the dimensions of the data.	yes

JSON Lowercase ParseSpec

This is a special variation of the JSON ParseSpec that lower cases all the column names in the incoming JSON data. This parseSpec is required if you are updating to Druid 0.7.x from Druid 0.6.x, are directly ingesting JSON with mixed case column names, do not have any ETL in place to lower case those column names, and would like to make queries that include the data you created using 0.6.x and 0.7.x.

Field	Type	Description	Required
format	String	This should say `jsonLowercase`.	yes
timestampSpec	JSON Object	Specifies the column and format of the timestamp.	yes
dimensionsSpec	JSON Object	Specifies the dimensions of the data.	yes

CSV ParseSpec

Field	Type	Description	Required
format	String	This should say `csv`.	yes
timestampSpec	JSON Object	Specifies the column and format of the timestamp.	yes
dimensionsSpec	JSON Object	Specifies the dimensions of the data.	yes
listDelimiter	String	A custom delimiter for multi-value dimensions.	no (default == ctrl+A)
columns	JSON array	Specifies the columns of the data.	yes

TSV ParseSpec

Field	Type	Description	Required
format	String	This should say `tsv`.	yes
timestampSpec	JSON Object	Specifies the column and format of the timestamp.	yes
dimensionsSpec	JSON Object	Specifies the dimensions of the data.	yes
delimiter	String	A custom delimiter for data values.	no (default == \t)
listDelimiter	String	A custom delimiter for multi-value dimensions.	no (default == ctrl+A)
columns	JSON String array	Specifies the columns of the data.	yes

Timestamp Spec

Field	Type	Description	Required
column	String	The column of the timestamp.	yes
format	String	iso, millis, posix, auto or any Joda time format.	no (default == 'auto'

DimensionsSpec

Field	Type	Description	Required
dimensions	JSON String array	The names of the dimensions. If this is an empty array, Druid will treat all columns that are not timestamp or metric columns as dimension columns.	yes
dimensionExclusions	JSON String array	The names of dimensions to exclude from ingestion.	no (default == []
spatialDimensions	JSON Object array	An array of spatial dimensions	no (default == []

GranularitySpec

The default granularity spec is uniform.

Uniform Granularity Spec

This spec is used to generated segments with uniform intervals.

Field	Type	Description	Required
type	string	The type of granularity spec.	no (default == 'uniform')
segmentGranularity	string	The granularity to create segments at.	no (default == 'DAY')
queryGranularity	string	The minimum granularity to be able to query results at and the granularity of the data inside the segment. E.g. a value of "minute" will mean that data is aggregated at minutely granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it will aggregate values together using the aggregators instead of storing individual rows.	no (default == 'NONE')
intervals	string	A list of intervals for the raw data being ingested. Ignored for real-time ingestion.	yes for batch, no for real-time

Arbitrary Granularity Spec

This spec is used to generate segments with arbitrary intervals (it tries to create evenly sized segments). This spec is not supported for real-time processing.

Field	Type	Description	Required
type	string	The type of granularity spec.	no (default == 'uniform')
queryGranularity	string	The minimum granularity to be able to query results at and the granularity of the data inside the segment. E.g. a value of "minute" will mean that data is aggregated at minutely granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it will aggregate values together using the aggregators instead of storing individual rows.	no (default == 'NONE')
intervals	string	A list of intervals for the raw data being ingested. Ignored for real-time ingestion.	yes for batch, no for real-time

IO Config

Real-time Ingestion: See Real-time ingestion. Batch Ingestion: See Batch ingestion

Ingestion Spec

Real-time Ingestion: See Real-time ingestion. Batch Ingestion: See Batch ingestion

Evaluating Timestamp, Dimensions and Metrics

Druid will interpret dimensions, dimension exclusions, and metrics in the following order:

Any column listed in the list of dimensions is treated as a dimension.
Any column listed in the list of dimension exclusions is excluded as a dimension.
The timestamp column and columns/fieldNames required by metrics are excluded by default.
If a metric is also listed as a dimension, the metric must have a different name than the dimension name.

8.4 KiB Raw Blame History

Ingestion Spec

DataSchema

Parser

String Parser

Protobuf Parser

ParseSpec

JSON ParseSpec

JSON Lowercase ParseSpec

CSV ParseSpec

TSV ParseSpec

Timestamp Spec

DimensionsSpec

GranularitySpec

Uniform Granularity Spec

Arbitrary Granularity Spec

IO Config

Ingestion Spec

Evaluating Timestamp, Dimensions and Metrics

8.4 KiB

Raw Blame History