mirror of https://github.com/apache/druid.git
135 lines
4.8 KiB
Markdown
135 lines
4.8 KiB
Markdown
|
---
|
|||
|
layout: doc_page
|
|||
|
---
|
|||
|
|
|||
|
## Load your own streaming data
|
|||
|
|
|||
|
## Getting started
|
|||
|
|
|||
|
This tutorial shows you how to load your own streams into Druid.
|
|||
|
|
|||
|
For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
|
|||
|
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
|
|||
|
don't need to have loaded any data yet.
|
|||
|
|
|||
|
Once that's complete, you can load your own dataset by writing a custom ingestion spec.
|
|||
|
|
|||
|
## Writing an ingestion spec
|
|||
|
|
|||
|
When loading streams into Druid, we recommend using the [stream push](../ingestion/stream-push.html)
|
|||
|
process. In this tutorial we'll be using [Tranquility Server](../ingestion/stream-ingestion.html#server) to push
|
|||
|
data into Druid over HTTP.
|
|||
|
|
|||
|
```note-info
|
|||
|
This tutorial will show you how to push streams to Druid using HTTP, but Druid additionally supports
|
|||
|
a wide variety of batch and streaming loading methods. See the *[Loading files](batch-ingestion.html)*
|
|||
|
and *[Loading streams](stream-ingestion.html)* pages for more information about other options,
|
|||
|
including from Hadoop, Kafka, Storm, Samza, Spark Streaming, and your own JVM apps.
|
|||
|
```
|
|||
|
|
|||
|
You can prepare for loading a new dataset over HTTP by writing a custom Tranquility Server
|
|||
|
configuration. The bundled configuration is in `conf-quickstart/tranquility/server.json`, which
|
|||
|
you can modify for your own needs.
|
|||
|
|
|||
|
The most important questions are:
|
|||
|
|
|||
|
* What should the dataset be called? This is the "dataSource" field of the "dataSchema".
|
|||
|
* Which field should be treated as a timestamp? This belongs in the "column" field of the "timestampSpec".
|
|||
|
* Which fields should be treated as dimensions? This belongs in the "dimensions" field of the "dimensionsSpec".
|
|||
|
* Which fields should be treated as measures? This belongs in the "metricsSpec" field.
|
|||
|
|
|||
|
Let's use a small JSON pageviews dataset as an example, with records like:
|
|||
|
|
|||
|
```json
|
|||
|
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
|
|||
|
```
|
|||
|
|
|||
|
So the answers to the questions above are:
|
|||
|
|
|||
|
* Let's call the dataset "pageviews".
|
|||
|
* The timestamp is the "time" field.
|
|||
|
* Good choices for dimensions are the string fields "url" and "user".
|
|||
|
* Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
|
|||
|
sum when we load the data will allow us to compute an average at query time as well.
|
|||
|
|
|||
|
Now, edit the existing `conf-quickstart/tranquility/server.json` file by altering these
|
|||
|
sections:
|
|||
|
|
|||
|
1. Change the key `"metrics"` under `"dataSources"` to `"pageviews"`
|
|||
|
2. Alter these sections under the new `"pageviews"` key:
|
|||
|
```json
|
|||
|
"dataSource": "pageviews"
|
|||
|
```
|
|||
|
|
|||
|
```json
|
|||
|
"timestampSpec": {
|
|||
|
"format": "auto",
|
|||
|
"column": "time"
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
```json
|
|||
|
"dimensionsSpec": {
|
|||
|
"dimensions": ["url", "user"]
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
```json
|
|||
|
"metricsSpec": [
|
|||
|
{"name": "views", "type": "count"},
|
|||
|
{"name": "latencyMs", "type": "doubleSum", "fieldName": "latencyMs"}
|
|||
|
]
|
|||
|
```
|
|||
|
|
|||
|
## Restarting the server
|
|||
|
|
|||
|
Restart the server to pick up the new configuration file by stopping Tranquility (CTRL-C) and starting it up again.
|
|||
|
|
|||
|
## Sending data
|
|||
|
|
|||
|
Let's send some data! We'll start with these three records:
|
|||
|
|
|||
|
```json
|
|||
|
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
|
|||
|
{"time": "2000-01-01T00:00:00Z", "url": "/", "user": "bob", "latencyMs": 11}
|
|||
|
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
|
|||
|
```
|
|||
|
|
|||
|
Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
|
|||
|
[windowPeriod](ingestion-streams.html#segmentgranularity-and-windowperiod) value), so you should
|
|||
|
replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
|
|||
|
get this by running:
|
|||
|
|
|||
|
```bash
|
|||
|
python -c 'import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
|
|||
|
```
|
|||
|
|
|||
|
Update the timestamps in the JSON above, and save it to a file named `pageviews.json`. Then send
|
|||
|
it to Druid by running:
|
|||
|
|
|||
|
```bash
|
|||
|
curl -XPOST -H'Content-Type: application/json' --data-binary @pageviews.json http://localhost:8200/v1/post/pageviews
|
|||
|
```
|
|||
|
|
|||
|
This will print something like:
|
|||
|
|
|||
|
```
|
|||
|
{"result":{"received":3,"sent":3}}
|
|||
|
```
|
|||
|
|
|||
|
This indicates that the HTTP server received 3 events from you, and sent 3 to Druid. Note that
|
|||
|
this may take a few seconds to finish the first time you run it, as Druid resources must be
|
|||
|
allocated to the ingestion task. Subsequent POSTs should complete quickly.
|
|||
|
|
|||
|
If you see `"sent":0` this likely means that your timestamps are not recent enough. Try adjusting
|
|||
|
your timestamps and re-sending your data.
|
|||
|
|
|||
|
## Querying your data
|
|||
|
|
|||
|
After sending data, you can immediately query it using any of the
|
|||
|
[supported query methods](../querying/querying.html).
|
|||
|
|
|||
|
## Further reading
|
|||
|
|
|||
|
To read more about loading streams, see our [streaming ingestion documentation](../ingestion/stream-ingestion.html).
|