updating the docs on how to do hadoop batch re-ingesion and delta ingestion

2015-07-22 13:49:57 -05:00 · 2015-07-22 13:49:57 -05:00 · cfd81bfac7
parent a3bab5b7d9
commit cfd81bfac7
1 changed files with 57 additions and 2 deletions
--- a/docs/content/ingestion/batch-ingestion.md
+++ b/docs/content/ingestion/batch-ingestion.md
@ -136,7 +136,7 @@ There are multiple types of inputSpecs:
 ##### `static`
-Is a type of data loader where a static path to where the data files are located is passed.
+Is a type of inputSpec where a static path to where the data files are located is passed.
 |Field|Type|Description|Required|
 |-----|----|-----------|--------|
@ -150,7 +150,7 @@ For example, using the static input paths:
 ##### `granularity`
-Is a type of data loader that expects data to be laid out in a specific path format. Specifically, it expects it to be segregated by day in this directory format `y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX` (dates are represented by lowercase, time is represented by uppercase).
+Is a type of inputSpec that expects data to be laid out in a specific path format. Specifically, it expects it to be segregated by day in this directory format `y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX` (dates are represented by lowercase, time is represented by uppercase).
 |Field|Type|Description|Required|
 |-----|----|-----------|--------|
@ -166,6 +166,61 @@ s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01
 ...
 s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23
 ```
 ##### `dataSource`
 It is a type of inputSpec that reads data already stored inside druid. It is useful for doing "re-indexing". A usecase would be that you ingested some data in some interval and at a later time you wanted to change granularity of rows or remove some columns from the data stored in druid.
 |Field|Type|Description|Required|
 |-----|----|-----------|--------|
 |ingestionSpec|Json Object|Specification of druid segments to be loaded. See below.|yes|
 |maxSplitSize|Number|Enables combining multiple segments into single Hadoop InputSplit according to size of segments. Default is none. |no|
 Here is what goes inside "ingestionSpec"
 |Field|Type|Description|Required|
 |dataSource|String|Druid dataSource name from which you are loading the data.|yes|
 |interval|String|A string representing ISO-8601 Intervals.|yes|
 |granularity|String|Defines the granularity of the query while loading data. Default value is "none".See [Granularities](../querying/granularities.html).|no|
 |filter|Json|See [Filters](../querying/filters.html)|no|
 |dimensions|Array of String|Name of dimension columns to load. By default, the list will be constructed from parseSpec. If parseSpec does not have explicit list of dimensions then all the dimension columns present in stored data will be read.|no|
 |metrics|Array of String|Name of metric columns to load. By default, the list will be constructed from the "name" of all the configured aggregators.|no|
 For example
 ```
 "ingestionSpec" :
    {
        "dataSource": "wikipedia",
        "interval": "2014-10-20T00:00:00Z/P2W"
    }
 ```
 ##### `multi`
 It is a composing inputSpec to combine two other input specs. It is useful for doing "delta ingestion". A usecase would be that you ingested some data in some interval and at a later time you wanted to "append" more data to that interval. You can use this inputSpec to combine `dataSource` and `static` (or others) input specs to add more data to an already indexed interval.
 |Field|Type|Description|Required|
 |-----|----|-----------|--------|
 |children|Array of Json Objects|List of json objects containing other inputSpecs |yes|
 For example
 ```
 "children": [
    {
        "type" : "dataSource",
        "ingestionSpec" : {
            "dataSource": "wikipedia",
            "interval": "2014-10-20T00:00:00Z/P2W"
        }
    },
    {
        "type" : "static",
        "paths": "/path/to/more/wikipedia/data/"
    }
 ]
 ```
 #### Metadata Update Job Spec