准备数据导入相关文档的文件结构
This commit is contained in:
parent
f181a67ad3
commit
9c1797cc91
|
@ -1,15 +1,162 @@
|
||||||
<!-- toc -->
|
# 从本地文件中加载数据
|
||||||
|
|
||||||
|
This tutorial demonstrates how to load data into Apache Druid from a file using Apache Druid's native batch ingestion feature.
|
||||||
|
|
||||||
|
You initiate data loading in Druid by submitting an *ingestion task* spec to the Druid Overlord. You can write ingestion
|
||||||
|
specs by hand or using the _data loader_ built into the Druid console.
|
||||||
|
|
||||||
|
The [Quickstart](./index.md) shows you how to use the data loader to build an ingestion spec. For production environments, it's
|
||||||
|
likely that you'll want to automate data ingestion. This tutorial starts by showing you how to submit an ingestion spec
|
||||||
|
directly in the Druid console, and then introduces ways to ingest batch data that lend themselves to
|
||||||
|
automation—from the command line and from a script.
|
||||||
|
|
||||||
|
|
||||||
|
## Loading data with a spec (via console)
|
||||||
|
|
||||||
|
The Druid package includes the following sample native batch ingestion task spec at `quickstart/tutorial/wikipedia-index.json`, shown here for convenience,
|
||||||
|
which has been configured to read the `quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` input file:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type" : "index_parallel",
|
||||||
|
"spec" : {
|
||||||
|
"dataSchema" : {
|
||||||
|
"dataSource" : "wikipedia",
|
||||||
|
"dimensionsSpec" : {
|
||||||
|
"dimensions" : [
|
||||||
|
"channel",
|
||||||
|
"cityName",
|
||||||
|
"comment",
|
||||||
|
"countryIsoCode",
|
||||||
|
"countryName",
|
||||||
|
"isAnonymous",
|
||||||
|
"isMinor",
|
||||||
|
"isNew",
|
||||||
|
"isRobot",
|
||||||
|
"isUnpatrolled",
|
||||||
|
"metroCode",
|
||||||
|
"namespace",
|
||||||
|
"page",
|
||||||
|
"regionIsoCode",
|
||||||
|
"regionName",
|
||||||
|
"user",
|
||||||
|
{ "name": "added", "type": "long" },
|
||||||
|
{ "name": "deleted", "type": "long" },
|
||||||
|
{ "name": "delta", "type": "long" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"timestampSpec": {
|
||||||
|
"column": "time",
|
||||||
|
"format": "iso"
|
||||||
|
},
|
||||||
|
"metricsSpec" : [],
|
||||||
|
"granularitySpec" : {
|
||||||
|
"type" : "uniform",
|
||||||
|
"segmentGranularity" : "day",
|
||||||
|
"queryGranularity" : "none",
|
||||||
|
"intervals" : ["2015-09-12/2015-09-13"],
|
||||||
|
"rollup" : false
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"ioConfig" : {
|
||||||
|
"type" : "index_parallel",
|
||||||
|
"inputSource" : {
|
||||||
|
"type" : "local",
|
||||||
|
"baseDir" : "quickstart/tutorial/",
|
||||||
|
"filter" : "wikiticker-2015-09-12-sampled.json.gz"
|
||||||
|
},
|
||||||
|
"inputFormat" : {
|
||||||
|
"type": "json"
|
||||||
|
},
|
||||||
|
"appendToExisting" : false
|
||||||
|
},
|
||||||
|
"tuningConfig" : {
|
||||||
|
"type" : "index_parallel",
|
||||||
|
"maxRowsPerSegment" : 5000000,
|
||||||
|
"maxRowsInMemory" : 25000
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This spec creates a datasource named "wikipedia".
|
||||||
|
|
||||||
|
From the Ingestion view, click the ellipses next to Tasks and choose `Submit JSON task`.
|
||||||
|
|
||||||
|
![Tasks view add task](../assets/tutorial-batch-submit-task-01.png "Tasks view add task")
|
||||||
|
|
||||||
|
This brings up the spec submission dialog where you can paste the spec above.
|
||||||
|
|
||||||
|
![Query view](../assets/tutorial-batch-submit-task-02.png "Query view")
|
||||||
|
|
||||||
|
Once the spec is submitted, wait a few moments for the data to load, after which you can query it.
|
||||||
|
|
||||||
|
|
||||||
|
## Loading data with a spec (via command line)
|
||||||
|
|
||||||
|
For convenience, the Druid package includes a batch ingestion helper script at `bin/post-index-task`.
|
||||||
|
|
||||||
|
This script will POST an ingestion task to the Druid Overlord and poll Druid until the data is available for querying.
|
||||||
|
|
||||||
|
Run the following command from Druid package root:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bin/post-index-task --file quickstart/tutorial/wikipedia-index.json --url http://localhost:8081
|
||||||
|
```
|
||||||
|
|
||||||
|
You should see output like the following:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
Beginning indexing data for wikipedia
|
||||||
|
Task started: index_wikipedia_2018-07-27T06:37:44.323Z
|
||||||
|
Task log: http://localhost:8081/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/log
|
||||||
|
Task status: http://localhost:8081/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/status
|
||||||
|
Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
|
||||||
|
Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
|
||||||
|
Task finished with status: SUCCESS
|
||||||
|
Completed indexing data for wikipedia. Now loading indexed data onto the cluster...
|
||||||
|
wikipedia loading complete! You may now query your data
|
||||||
|
```
|
||||||
|
|
||||||
|
Once the spec is submitted, you can follow the same instructions as above to wait for the data to load and then query it.
|
||||||
|
|
||||||
|
|
||||||
|
## Loading data without the script
|
||||||
|
|
||||||
|
Let's briefly discuss how we would've submitted the ingestion task without using the script. You do not need to run these commands.
|
||||||
|
|
||||||
|
To submit the task, POST it to Druid in a new terminal window from the apache-druid-{{DRUIDVERSION}} directory:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-index.json http://localhost:8081/druid/indexer/v1/task
|
||||||
|
```
|
||||||
|
|
||||||
|
Which will print the ID of the task if the submission was successful:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
{"task":"index_wikipedia_2018-06-09T21:30:32.802Z"}
|
||||||
|
```
|
||||||
|
|
||||||
|
You can monitor the status of this task from the console as outlined above.
|
||||||
|
|
||||||
|
|
||||||
|
## Querying your data
|
||||||
|
|
||||||
|
Once the data is loaded, please follow the [query tutorial](../tutorials/tutorial-query.md) to run some example queries on the newly loaded data.
|
||||||
|
|
||||||
|
|
||||||
|
## Cleanup
|
||||||
|
|
||||||
|
If you wish to go through any of the other ingestion tutorials, you will need to shut down the cluster and reset the cluster state by removing the contents of the `var` directory under the druid package, as the other tutorials will write to the same "wikipedia" datasource.
|
||||||
|
|
||||||
|
|
||||||
|
## Further reading
|
||||||
|
|
||||||
|
For more information on loading batch data, please see [the native batch ingestion documentation](../ingestion/native-batch.md).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
|
|
||||||
<ins class="adsbygoogle"
|
|
||||||
style="display:block; text-align:center;"
|
|
||||||
data-ad-layout="in-article"
|
|
||||||
data-ad-format="fluid"
|
|
||||||
data-ad-client="ca-pub-8828078415045620"
|
|
||||||
data-ad-slot="7586680510"></ins>
|
|
||||||
<script>
|
|
||||||
(adsbygoogle = window.adsbygoogle || []).push({});
|
|
||||||
</script>
|
|
||||||
|
|
||||||
## 加载本地文件
|
## 加载本地文件
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue