准备数据导入相关文档的文件结构

2021-08-05 13:54:06 -04:00 · 2021-08-05 13:54:06 -04:00 · 9c1797cc91
commit 9c1797cc91
parent f181a67ad3
1 changed files with 158 additions and 11 deletions
--- a/tutorials/tutorial-batch.md
+++ b/tutorials/tutorial-batch.md
@ -1,15 +1,162 @@
-<!-- toc -->
+# 从本地文件中加载数据
+
+This tutorial demonstrates how to load data into Apache Druid from a file using Apache Druid's native batch ingestion feature.
+
+You initiate data loading in Druid by submitting an *ingestion task* spec to the Druid Overlord. You can write ingestion
+specs by hand or using the _data loader_ built into the Druid console.
+
+The [Quickstart](./index.md) shows you how to use the data loader to build an ingestion spec. For production environments, it's
+likely that you'll want to automate data ingestion. This tutorial starts by showing you how to submit an ingestion spec
+directly in the Druid console, and then introduces ways to ingest batch data that lend themselves to
+automation&mdash;from the command line and from a script.
+
+
+## Loading data with a spec (via console)
+
+The Druid package includes the following sample native batch ingestion task spec at `quickstart/tutorial/wikipedia-index.json`, shown here for convenience,
+which has been configured to read the `quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz` input file:
+
+```json
+{
+  "type" : "index_parallel",
+  "spec" : {
+    "dataSchema" : {
+      "dataSource" : "wikipedia",
+      "dimensionsSpec" : {
+        "dimensions" : [
+          "channel",
+          "cityName",
+          "comment",
+          "countryIsoCode",
+          "countryName",
+          "isAnonymous",
+          "isMinor",
+          "isNew",
+          "isRobot",
+          "isUnpatrolled",
+          "metroCode",
+          "namespace",
+          "page",
+          "regionIsoCode",
+          "regionName",
+          "user",
+          { "name": "added", "type": "long" },
+          { "name": "deleted", "type": "long" },
+          { "name": "delta", "type": "long" }
+        ]
+      },
+      "timestampSpec": {
+        "column": "time",
+        "format": "iso"
+      },
+      "metricsSpec" : [],
+      "granularitySpec" : {
+        "type" : "uniform",
+        "segmentGranularity" : "day",
+        "queryGranularity" : "none",
+        "intervals" : ["2015-09-12/2015-09-13"],
+        "rollup" : false
+      }
+    },
+    "ioConfig" : {
+      "type" : "index_parallel",
+      "inputSource" : {
+        "type" : "local",
+        "baseDir" : "quickstart/tutorial/",
+        "filter" : "wikiticker-2015-09-12-sampled.json.gz"
+      },
+      "inputFormat" :  {
+        "type": "json"
+      },
+      "appendToExisting" : false
+    },
+    "tuningConfig" : {
+      "type" : "index_parallel",
+      "maxRowsPerSegment" : 5000000,
+      "maxRowsInMemory" : 25000
+    }
+  }
+}
+```
+
+This spec creates a datasource named "wikipedia".
+
+From the Ingestion view, click the ellipses next to Tasks and choose `Submit JSON task`.
+
+![Tasks view add task](../assets/tutorial-batch-submit-task-01.png "Tasks view add task")
+
+This brings up the spec submission dialog where you can paste the spec above.
+
+![Query view](../assets/tutorial-batch-submit-task-02.png "Query view")
+
+Once the spec is submitted, wait a few moments for the data to load, after which you can query it.
+
+
+## Loading data with a spec (via command line)
+
+For convenience, the Druid package includes a batch ingestion helper script at `bin/post-index-task`.
+
+This script will POST an ingestion task to the Druid Overlord and poll Druid until the data is available for querying.
+
+Run the following command from Druid package root:
+
+```bash
+bin/post-index-task --file quickstart/tutorial/wikipedia-index.json --url http://localhost:8081
+```
+
+You should see output like the following:
+
+```bash
+Beginning indexing data for wikipedia
+Task started: index_wikipedia_2018-07-27T06:37:44.323Z
+Task log:     http://localhost:8081/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/log
+Task status:  http://localhost:8081/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/status
+Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
+Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
+Task finished with status: SUCCESS
+Completed indexing data for wikipedia. Now loading indexed data onto the cluster...
+wikipedia loading complete! You may now query your data
+```
+
+Once the spec is submitted, you can follow the same instructions as above to wait for the data to load and then query it.
+
+
+## Loading data without the script
+
+Let's briefly discuss how we would've submitted the ingestion task without using the script. You do not need to run these commands.
+
+To submit the task, POST it to Druid in a new terminal window from the apache-druid-{{DRUIDVERSION}} directory:
+
+```bash
+curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-index.json http://localhost:8081/druid/indexer/v1/task
+```
+
+Which will print the ID of the task if the submission was successful:
+
+```bash
+{"task":"index_wikipedia_2018-06-09T21:30:32.802Z"}
+```
+
+You can monitor the status of this task from the console as outlined above.
+
+
+## Querying your data
+
+Once the data is loaded, please follow the [query tutorial](../tutorials/tutorial-query.md) to run some example queries on the newly loaded data.
+
+
+## Cleanup
+
+If you wish to go through any of the other ingestion tutorials, you will need to shut down the cluster and reset the cluster state by removing the contents of the `var` directory under the druid package, as the other tutorials will write to the same "wikipedia" datasource.
+
+
+## Further reading
+
+For more information on loading batch data, please see [the native batch ingestion documentation](../ingestion/native-batch.md).
+
+
+

-<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
-<ins class="adsbygoogle"
-     style="display:block; text-align:center;"
-     data-ad-layout="in-article"
-     data-ad-format="fluid"
-     data-ad-client="ca-pub-8828078415045620"
-     data-ad-slot="7586680510"></ins>
-<script>
-     (adsbygoogle = window.adsbygoogle || []).push({});
-</script>

 ## 加载本地文件