druid/docs/content/tutorials/tutorial-batch.md

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

---
layout: doc_page
title: "Tutorial: Loading a file"
---
# Tutorial: Loading a file

## Getting started

This tutorial demonstrates how to perform a batch file load, using Druid's native batch ingestion.

For this tutorial, we'll assume you've already downloaded Druid as described in 
the [single-machine quickstart](index.html) and have it running on your local machine. You 
don't need to have loaded any data yet.

## Preparing the data and the ingestion task spec

A data load is initiated by submitting an *ingestion task* spec to the Druid overlord. For this tutorial, we'll be loading the sample Wikipedia page edits data.

The Druid package includes the following sample native batch ingestion task spec at `quickstart/tutorial/wikipedia-index.json`, shown here for convenience,
which has been configured to read the `quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` input file:

```json
{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : [
              "channel",
              "cityName",
              "comment",
              "countryIsoCode",
              "countryName",
              "isAnonymous",
              "isMinor",
              "isNew",
              "isRobot",
              "isUnpatrolled",
              "metroCode",
              "namespace",
              "page",
              "regionIsoCode",
              "regionName",
              "user",
              { "name": "added", "type": "long" },
              { "name": "deleted", "type": "long" },
              { "name": "delta", "type": "long" }
            ]
          },
          "timestampSpec": {
            "column": "time",
            "format": "iso"
          }
        }
      },
      "metricsSpec" : [],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-12/2015-09-13"],
        "rollup" : false
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "local",
        "baseDir" : "quickstart/tutorial/",
        "filter" : "wikiticker-2015-09-12-sampled.json.gz"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 5000000,
      "maxRowsInMemory" : 25000,
      "forceExtendableShardSpecs" : true
    }
  }
}
```

This spec will create a datasource named "wikipedia", 

## Load batch data

We've included a sample of Wikipedia edits from September 12, 2015 to get you started.

To load this data into Druid, you can submit an *ingestion task* pointing to the file. We've included
a task that loads the `wikiticker-2015-09-12-sampled.json.gz` file included in the archive. 

For convenience, the Druid package includes a batch ingestion helper script at `bin/post-index-task`.

This script will POST an ingestion task to the Druid overlord and poll Druid until the data is available for querying.

Run the following command from Druid package root:

```bash
bin/post-index-task --file quickstart/tutorial/wikipedia-index.json 
```

You should see output like the following:

```bash
Beginning indexing data for wikipedia
Task started: index_wikipedia_2018-07-27T06:37:44.323Z
Task log:     http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/log
Task status:  http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/status
Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
Task finished with status: SUCCESS
Completed indexing data for wikipedia. Now loading indexed data onto the cluster...
wikipedia loading complete! You may now query your data
```

## Querying your data

Once the data is loaded, please follow the [query tutorial](../tutorials/tutorial-query.html) to run some example queries on the newly loaded data.

## Cleanup

If you wish to go through any of the other ingestion tutorials, you will need to shut down the cluster and reset the cluster state by removing the contents of the `var` directory under the druid package, as the other tutorials will write to the same "wikipedia" datasource.

## Extra: Loading data without the script

Let's briefly discuss how we would've submitted the ingestion task without using the script. You do not need to run these commands.

To submit the task, POST it to Druid in a new terminal window from the apache-druid-#{DRUIDVERSION} directory:

```bash
curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-index.json http://localhost:8090/druid/indexer/v1/task
```

Which will print the ID of the task if the submission was successful:

```bash
{"task":"index_wikipedia_2018-06-09T21:30:32.802Z"}
```

To view the status of the ingestion task, go to the overlord console:
[http://localhost:8090/console.html](http://localhost:8090/console.html). You can refresh the console periodically, and after
the task is successful, you should see a "SUCCESS" status for the task.

After the ingestion task finishes, the data will be loaded by historical nodes and available for
querying within a minute or two. You can monitor the progress of loading the data in the
coordinator console, by checking whether there is a datasource "wikipedia" with a blue circle
indicating "fully available": [http://localhost:8081/#/](http://localhost:8081/#/).

![Coordinator console](../tutorials/img/tutorial-batch-01.png "Wikipedia 100% loaded")

## Further reading

For more information on loading batch data, please see [the batch ingestion documentation](../ingestion/batch-ingestion.html).
add missing license headers, in particular to MD files; clean up RAT … (#6563) * add missing license headers, in particular to MD files; clean up RAT exclusions * revert inadvertent doc changes * docs * cr changes * fix modified druid-production.svg 2018-11-13 12:38:37 -05:00			`<!--`
			`~ Licensed to the Apache Software Foundation (ASF) under one`
			`~ or more contributor license agreements. See the NOTICE file`
			`~ distributed with this work for additional information`
			`~ regarding copyright ownership. The ASF licenses this file`
			`~ to you under the Apache License, Version 2.0 (the`
			`~ "License"); you may not use this file except in compliance`
			`~ with the License. You may obtain a copy of the License at`
			`~`
			`~ http://www.apache.org/licenses/LICENSE-2.0`
			`~`
			`~ Unless required by applicable law or agreed to in writing,`
			`~ software distributed under the License is distributed on an`
			`~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`~ KIND, either express or implied. See the License for the`
			`~ specific language governing permissions and limitations`
			`~ under the License.`
			`-->`

new quickstart 2016-01-06 00:27:52 -05:00			`---`
			`layout: doc_page`
Added titles and harmonized docs to improve usability and SEO (#6731) * added titles and harmonized docs * manually fixed some titles 2018-12-12 23:42:12 -05:00			`title: "Tutorial: Loading a file"`
new quickstart 2016-01-06 00:27:52 -05:00			`---`
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`# Tutorial: Loading a file`
new quickstart 2016-01-06 00:27:52 -05:00
add doc rendering 2016-02-04 14:53:09 -05:00			`## Getting started`
new quickstart 2016-01-06 00:27:52 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`This tutorial demonstrates how to perform a batch file load, using Druid's native batch ingestion.`
add doc rendering 2016-02-04 14:53:09 -05:00
			`For this tutorial, we'll assume you've already downloaded Druid as described in`
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`the [single-machine quickstart](index.html) and have it running on your local machine. You`
add doc rendering 2016-02-04 14:53:09 -05:00			`don't need to have loaded any data yet.`

New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`## Preparing the data and the ingestion task spec`
add doc rendering 2016-02-04 14:53:09 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`A data load is initiated by submitting an ingestion task spec to the Druid overlord. For this tutorial, we'll be loading the sample Wikipedia page edits data.`
add doc rendering 2016-02-04 14:53:09 -05:00
fixup docs to download from Apache mirror, fixup tarball name and path, change references from quickstart/* to quickstart/tutorial/* (#6570) 2018-11-02 00:47:29 -04:00			The Druid package includes the following sample native batch ingestion task spec at `quickstart/tutorial/wikipedia-index.json`, shown here for convenience,
			which has been configured to read the `quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` input file:
add doc rendering 2016-02-04 14:53:09 -05:00
New doc fixes (#6156) 2018-08-13 14:11:32 -04:00			```json
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`{`
			`"type" : "index",`
			`"spec" : {`
			`"dataSchema" : {`
			`"dataSource" : "wikipedia",`
			`"parser" : {`
			`"type" : "string",`
			`"parseSpec" : {`
			`"format" : "json",`
			`"dimensionsSpec" : {`
			`"dimensions" : [`
			`"channel",`
			`"cityName",`
			`"comment",`
			`"countryIsoCode",`
			`"countryName",`
			`"isAnonymous",`
			`"isMinor",`
			`"isNew",`
			`"isRobot",`
			`"isUnpatrolled",`
			`"metroCode",`
			`"namespace",`
			`"page",`
			`"regionIsoCode",`
			`"regionName",`
			`"user",`
			`{ "name": "added", "type": "long" },`
			`{ "name": "deleted", "type": "long" },`
			`{ "name": "delta", "type": "long" }`
			`]`
			`},`
			`"timestampSpec": {`
			`"column": "time",`
			`"format": "iso"`
			`}`
			`}`
			`},`
			`"metricsSpec" : [],`
			`"granularitySpec" : {`
			`"type" : "uniform",`
			`"segmentGranularity" : "day",`
			`"queryGranularity" : "none",`
			`"intervals" : ["2015-09-12/2015-09-13"],`
			`"rollup" : false`
			`}`
			`},`
			`"ioConfig" : {`
			`"type" : "index",`
			`"firehose" : {`
			`"type" : "local",`
fixup docs to download from Apache mirror, fixup tarball name and path, change references from quickstart/* to quickstart/tutorial/* (#6570) 2018-11-02 00:47:29 -04:00			`"baseDir" : "quickstart/tutorial/",`
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`"filter" : "wikiticker-2015-09-12-sampled.json.gz"`
			`},`
			`"appendToExisting" : false`
			`},`
			`"tuningConfig" : {`
			`"type" : "index",`
			`"targetPartitionSize" : 5000000,`
			`"maxRowsInMemory" : 25000,`
			`"forceExtendableShardSpecs" : true`
			`}`
			`}`
			`}`
			```
new quickstart 2016-01-06 00:27:52 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`This spec will create a datasource named "wikipedia",`
new quickstart 2016-01-06 00:27:52 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`## Load batch data`
new quickstart 2016-01-06 00:27:52 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`We've included a sample of Wikipedia edits from September 12, 2015 to get you started.`
new quickstart 2016-01-06 00:27:52 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`To load this data into Druid, you can submit an ingestion task pointing to the file. We've included`
			a task that loads the `wikiticker-2015-09-12-sampled.json.gz` file included in the archive.
new quickstart 2016-01-06 00:27:52 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			For convenience, the Druid package includes a batch ingestion helper script at `bin/post-index-task`.
new quickstart 2016-01-06 00:27:52 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`This script will POST an ingestion task to the Druid overlord and poll Druid until the data is available for querying.`
new quickstart 2016-01-06 00:27:52 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`Run the following command from Druid package root:`
new quickstart 2016-01-06 00:27:52 -05:00
New doc fixes (#6156) 2018-08-13 14:11:32 -04:00			```bash
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`bin/post-index-task --file quickstart/tutorial/wikipedia-index.json`
new quickstart 2016-01-06 00:27:52 -05:00			```

New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`You should see output like the following:`
new quickstart 2016-01-06 00:27:52 -05:00
New doc fixes (#6156) 2018-08-13 14:11:32 -04:00			```bash
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`Beginning indexing data for wikipedia`
			`Task started: index_wikipedia_2018-07-27T06:37:44.323Z`
			`Task log: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/log`
			`Task status: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/status`
			`Task index_wikipedia_2018-07-27T06:37:44.323Z still running...`
			`Task index_wikipedia_2018-07-27T06:37:44.323Z still running...`
			`Task finished with status: SUCCESS`
			`Completed indexing data for wikipedia. Now loading indexed data onto the cluster...`
			`wikipedia loading complete! You may now query your data`
new quickstart 2016-01-06 00:27:52 -05:00			```

New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`## Querying your data`
new quickstart 2016-01-06 00:27:52 -05:00
New doc fixes (#6156) 2018-08-13 14:11:32 -04:00			`Once the data is loaded, please follow the [query tutorial](../tutorials/tutorial-query.html) to run some example queries on the newly loaded data.`
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00
			`## Cleanup`
new quickstart 2016-01-06 00:27:52 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			If you wish to go through any of the other ingestion tutorials, you will need to shut down the cluster and reset the cluster state by removing the contents of the `var` directory under the druid package, as the other tutorials will write to the same "wikipedia" datasource.
add doc rendering 2016-02-04 14:53:09 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`## Extra: Loading data without the script`
add doc rendering 2016-02-04 14:53:09 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`Let's briefly discuss how we would've submitted the ingestion task without using the script. You do not need to run these commands.`
add doc rendering 2016-02-04 14:53:09 -05:00
fixup docs to download from Apache mirror, fixup tarball name and path, change references from quickstart/* to quickstart/tutorial/* (#6570) 2018-11-02 00:47:29 -04:00			`To submit the task, POST it to Druid in a new terminal window from the apache-druid-#{DRUIDVERSION} directory:`
new quickstart 2016-01-06 00:27:52 -05:00
			```bash
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-index.json http://localhost:8090/druid/indexer/v1/task`
new quickstart 2016-01-06 00:27:52 -05:00			```

New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`Which will print the ID of the task if the submission was successful:`
new quickstart 2016-01-06 00:27:52 -05:00
add doc rendering 2016-02-04 14:53:09 -05:00			```bash
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`{"task":"index_wikipedia_2018-06-09T21:30:32.802Z"}`
new quickstart 2016-01-06 00:27:52 -05:00			```

New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`To view the status of the ingestion task, go to the overlord console:`
			`[http://localhost:8090/console.html](http://localhost:8090/console.html). You can refresh the console periodically, and after`
			`the task is successful, you should see a "SUCCESS" status for the task.`
add doc rendering 2016-02-04 14:53:09 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`After the ingestion task finishes, the data will be loaded by historical nodes and available for`
			`querying within a minute or two. You can monitor the progress of loading the data in the`
			`coordinator console, by checking whether there is a datasource "wikipedia" with a blue circle`
			`indicating "fully available": [http://localhost:8081/#/](http://localhost:8081/#/).`
add doc rendering 2016-02-04 14:53:09 -05:00
New quickstart and tutorials (#6126) * New quickstart and tutorials * PR comments * Fix tranquility 2018-08-09 16:37:52 -04:00			`![Coordinator console](../tutorials/img/tutorial-batch-01.png "Wikipedia 100% loaded")`
add doc rendering 2016-02-04 14:53:09 -05:00
			`## Further reading`

new quickstart 2016-01-06 00:27:52 -05:00			`For more information on loading batch data, please see [the batch ingestion documentation](../ingestion/batch-ingestion.html).`