druid/docs/content/ingestion/update-existing-data.md

4.3 KiB

layout
doc_page

Once you ingest some data in a dataSource for an interval, You might want to make following kind of changes to existing data.

Reindexing

You ingested some raw data to a dataSource A and later you want to re-index and create another dataSource B which has a subset of columns or different granularity. Or, you may want to change granularity of data in A itself for some interval.

Delta-ingestion

You ingested some raw data to a dataSource A in an interval, later you want to "append" more data to same interval. This might happen because you used realtime ingestion originally and then received some late events.

Here are the Druid Features you could use to achieve above updates.

  • You can use batch ingestion to override an interval completely by doing the ingestion again with the raw data.
  • You can use re-indexing and delta-ingestion features provided by batch ingestion.

Re-indexing and Delta Ingestion with Hadoop Batch Ingestion

This section assumes the reader has understanding of batch ingestion using Hadoop. See HadoopIndexTask and further explained in batch-ingestion. You can use hadoop batch-ingestion to do re-indexing and delta-ingestion as well.

It is enabled by how Druid reads input data for doing hadoop batch ingestion. Druid uses specified inputSpec to know where the data to be ingested is located and how to read it. For simple hadoop batch ingestion you would use static or granularity spec types which allow you to read data stored on HDFS.

There are two other inputSpec types to enable reindexing and delta-ingestion.

dataSource

It is a type of inputSpec that reads data already stored inside druid. It is useful for doing "re-indexing".

Field Type Description Required
ingestionSpec Json Object Specification of druid segments to be loaded. See below. yes
maxSplitSize Number Enables combining multiple segments into single Hadoop InputSplit according to size of segments. Default is none. no

Here is what goes inside "ingestionSpec"

Field Type Description Required
dataSource String Druid dataSource name from which you are loading the data. yes
intervals List A list of strings representing ISO-8601 Intervals. yes
granularity String Defines the granularity of the query while loading data. Default value is "none".See Granularities. no
filter Json See Filters no
dimensions Array of String Name of dimension columns to load. By default, the list will be constructed from parseSpec. If parseSpec does not have explicit list of dimensions then all the dimension columns present in stored data will be read. no
metrics Array of String Name of metric columns to load. By default, the list will be constructed from the "name" of all the configured aggregators. no

For example

"ingestionSpec" :
    {
        "dataSource": "wikipedia",
        "intervals": ["2014-10-20T00:00:00Z/P2W"]
    }

multi

It is a composing inputSpec to combine two other input specs. It is useful for doing delta ingestion. Note that this is not idempotent operation, we might add some features in future to make it idempotent.

Field Type Description Required
children Array of Json Objects List of json objects containing other inputSpecs yes

For example

"children": [
    {
        "type" : "dataSource",
        "ingestionSpec" : {
            "dataSource": "wikipedia",
            "intervals": ["2014-10-20T00:00:00Z/P2W"]
        }
    },
    {
        "type" : "static",
        "paths": "/path/to/more/wikipedia/data/"
    }
]

Re-indexing with non-hadoop Batch Ingestion

This section assumes the reader has understanding of batch ingestion without hadoop using IndexTask which uses a "firehose" to know where and how to read the input data. IngestSegmentFirehose can be used to read data from segments inside Druid. Note that IndexTask is to be used for prototyping purposes only as it has to do all processing inside a single process and can't scale, please use hadoop batch ingestion for realistic scenarios such as dealing with data more than a GB.