druid/docs/content/tutorials/tutorial-batch.md

180 lines
6.5 KiB
Markdown
Raw Normal View History

---
layout: doc_page
title: "Tutorial: Loading a file"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Tutorial: Loading a file
2016-01-06 00:27:52 -05:00
2016-02-04 14:53:09 -05:00
## Getting started
2016-01-06 00:27:52 -05:00
This tutorial demonstrates how to perform a batch file load, using Druid's native batch ingestion.
2016-02-04 14:53:09 -05:00
For this tutorial, we'll assume you've already downloaded Druid as described in
the [single-machine quickstart](index.html) and have it running on your local machine. You
2016-02-04 14:53:09 -05:00
don't need to have loaded any data yet.
## Preparing the data and the ingestion task spec
2016-02-04 14:53:09 -05:00
A data load is initiated by submitting an *ingestion task* spec to the Druid Overlord. For this tutorial, we'll be loading the sample Wikipedia page edits data.
2016-02-04 14:53:09 -05:00
The Druid package includes the following sample native batch ingestion task spec at `quickstart/tutorial/wikipedia-index.json`, shown here for convenience,
which has been configured to read the `quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` input file:
2016-02-04 14:53:09 -05:00
2018-08-13 14:11:32 -04:00
```json
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : [
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
{ "name": "added", "type": "long" },
{ "name": "deleted", "type": "long" },
{ "name": "delta", "type": "long" }
]
},
"timestampSpec": {
"column": "time",
"format": "iso"
}
}
},
"metricsSpec" : [],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2015-09-12/2015-09-13"],
"rollup" : false
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "quickstart/tutorial/",
"filter" : "wikiticker-2015-09-12-sampled.json.gz"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"maxRowsPerSegment" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
}
}
```
2016-01-06 00:27:52 -05:00
This spec will create a datasource named "wikipedia",
2016-01-06 00:27:52 -05:00
## Load batch data
2016-01-06 00:27:52 -05:00
We've included a sample of Wikipedia edits from September 12, 2015 to get you started.
2016-01-06 00:27:52 -05:00
To load this data into Druid, you can submit an *ingestion task* pointing to the file. We've included
a task that loads the `wikiticker-2015-09-12-sampled.json.gz` file included in the archive.
2016-01-06 00:27:52 -05:00
For convenience, the Druid package includes a batch ingestion helper script at `bin/post-index-task`.
2016-01-06 00:27:52 -05:00
This script will POST an ingestion task to the Druid Overlord and poll Druid until the data is available for querying.
2016-01-06 00:27:52 -05:00
Run the following command from Druid package root:
2016-01-06 00:27:52 -05:00
2018-08-13 14:11:32 -04:00
```bash
bin/post-index-task --file quickstart/tutorial/wikipedia-index.json
2016-01-06 00:27:52 -05:00
```
You should see output like the following:
2016-01-06 00:27:52 -05:00
2018-08-13 14:11:32 -04:00
```bash
Beginning indexing data for wikipedia
Task started: index_wikipedia_2018-07-27T06:37:44.323Z
Task log: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/log
Task status: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/status
Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
Task finished with status: SUCCESS
Completed indexing data for wikipedia. Now loading indexed data onto the cluster...
wikipedia loading complete! You may now query your data
2016-01-06 00:27:52 -05:00
```
## Querying your data
2016-01-06 00:27:52 -05:00
2018-08-13 14:11:32 -04:00
Once the data is loaded, please follow the [query tutorial](../tutorials/tutorial-query.html) to run some example queries on the newly loaded data.
## Cleanup
2016-01-06 00:27:52 -05:00
If you wish to go through any of the other ingestion tutorials, you will need to shut down the cluster and reset the cluster state by removing the contents of the `var` directory under the druid package, as the other tutorials will write to the same "wikipedia" datasource.
2016-02-04 14:53:09 -05:00
## Extra: Loading data without the script
2016-02-04 14:53:09 -05:00
Let's briefly discuss how we would've submitted the ingestion task without using the script. You do not need to run these commands.
2016-02-04 14:53:09 -05:00
To submit the task, POST it to Druid in a new terminal window from the apache-druid-#{DRUIDVERSION} directory:
2016-01-06 00:27:52 -05:00
```bash
curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-index.json http://localhost:8090/druid/indexer/v1/task
2016-01-06 00:27:52 -05:00
```
Which will print the ID of the task if the submission was successful:
2016-01-06 00:27:52 -05:00
2016-02-04 14:53:09 -05:00
```bash
{"task":"index_wikipedia_2018-06-09T21:30:32.802Z"}
2016-01-06 00:27:52 -05:00
```
To view the status of the ingestion task, go to the Overlord console:
[http://localhost:8090/console.html](http://localhost:8090/console.html). You can refresh the console periodically, and after
the task is successful, you should see a "SUCCESS" status for the task.
2016-02-04 14:53:09 -05:00
After the ingestion task finishes, the data will be loaded by Historical nodes and available for
querying within a minute or two. You can monitor the progress of loading the data in the
Coordinator console, by checking whether there is a datasource "wikipedia" with a blue circle
indicating "fully available": [http://localhost:8081/#/](http://localhost:8081/#/).
2016-02-04 14:53:09 -05:00
![Coordinator console](../tutorials/img/tutorial-batch-01.png "Wikipedia 100% loaded")
2016-02-04 14:53:09 -05:00
## Further reading
2016-01-06 00:27:52 -05:00
For more information on loading batch data, please see [the batch ingestion documentation](../ingestion/batch-ingestion.html).