druid/docs/content/tutorials/tutorial-streams.md

135 lines
4.8 KiB
Markdown
Raw Normal View History

2016-01-06 00:27:52 -05:00
---
layout: doc_page
---
2016-02-04 14:53:09 -05:00
# Tutorial: Load your own streaming data
2016-01-06 00:27:52 -05:00
## Getting started
This tutorial shows you how to load your own streams into Druid.
2016-02-04 14:53:09 -05:00
For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
2016-01-06 00:27:52 -05:00
don't need to have loaded any data yet.
Once that's complete, you can load your own dataset by writing a custom ingestion spec.
## Writing an ingestion spec
2016-02-04 14:53:09 -05:00
When loading streams into Druid, we recommend using the [stream push](../ingestion/stream-push.html)
process. In this tutorial we'll be using [Tranquility Server](../ingestion/stream-ingestion.html#server) to push
2016-01-06 00:27:52 -05:00
data into Druid over HTTP.
2016-02-04 14:53:09 -05:00
<div class="note info">
This tutorial will show you how to push streams to Druid using HTTP, but Druid additionally supports
a wide variety of batch and streaming loading methods. See the <a href="../ingestion/batch-ingestion.html">Loading files</a>
and <a href="../ingestion/stream-ingestion.html">Loading streams</a> pages for more information about other options,
2016-01-06 00:27:52 -05:00
including from Hadoop, Kafka, Storm, Samza, Spark Streaming, and your own JVM apps.
2016-02-04 14:53:09 -05:00
</div>
2016-01-06 00:27:52 -05:00
2016-02-04 14:53:09 -05:00
You can prepare for loading a new dataset over HTTP by writing a custom Tranquility Server
configuration. The bundled configuration is in `conf-quickstart/tranquility/server.json`, which
2016-01-06 00:27:52 -05:00
you can modify for your own needs.
The most important questions are:
* What should the dataset be called? This is the "dataSource" field of the "dataSchema".
* Which field should be treated as a timestamp? This belongs in the "column" field of the "timestampSpec".
* Which fields should be treated as dimensions? This belongs in the "dimensions" field of the "dimensionsSpec".
* Which fields should be treated as measures? This belongs in the "metricsSpec" field.
Let's use a small JSON pageviews dataset as an example, with records like:
```json
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
```
So the answers to the questions above are:
* Let's call the dataset "pageviews".
* The timestamp is the "time" field.
* Good choices for dimensions are the string fields "url" and "user".
2016-02-04 14:53:09 -05:00
* Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
2016-01-06 00:27:52 -05:00
sum when we load the data will allow us to compute an average at query time as well.
2016-02-04 14:53:09 -05:00
Now, edit the existing `conf-quickstart/tranquility/server.json` file by altering these
2016-01-06 00:27:52 -05:00
sections:
1. Change the key `"metrics"` under `"dataSources"` to `"pageviews"`
2. Alter these sections under the new `"pageviews"` key:
```json
"dataSource": "pageviews"
```
```json
"timestampSpec": {
"format": "auto",
"column": "time"
}
```
```json
"dimensionsSpec": {
"dimensions": ["url", "user"]
}
```
```json
"metricsSpec": [
{"name": "views", "type": "count"},
{"name": "latencyMs", "type": "doubleSum", "fieldName": "latencyMs"}
]
```
## Restarting the server
Restart the server to pick up the new configuration file by stopping Tranquility (CTRL-C) and starting it up again.
## Sending data
Let's send some data! We'll start with these three records:
```json
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
{"time": "2000-01-01T00:00:00Z", "url": "/", "user": "bob", "latencyMs": 11}
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
```
2016-02-04 14:53:09 -05:00
Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
[windowPeriod](ingestion-streams.html#segmentgranularity-and-windowperiod) value), so you should
replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
2016-01-06 00:27:52 -05:00
get this by running:
```bash
python -c 'import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
```
2016-02-04 14:53:09 -05:00
Update the timestamps in the JSON above, and save it to a file named `pageviews.json`. Then send
2016-01-06 00:27:52 -05:00
it to Druid by running:
```bash
curl -XPOST -H'Content-Type: application/json' --data-binary @pageviews.json http://localhost:8200/v1/post/pageviews
```
This will print something like:
```
{"result":{"received":3,"sent":3}}
```
2016-02-04 14:53:09 -05:00
This indicates that the HTTP server received 3 events from you, and sent 3 to Druid. Note that
this may take a few seconds to finish the first time you run it, as Druid resources must be
2016-01-06 00:27:52 -05:00
allocated to the ingestion task. Subsequent POSTs should complete quickly.
2016-02-04 14:53:09 -05:00
If you see `"sent":0` this likely means that your timestamps are not recent enough. Try adjusting
2016-01-06 00:27:52 -05:00
your timestamps and re-sending your data.
## Querying your data
2016-02-04 14:53:09 -05:00
After sending data, you can immediately query it using any of the
2016-01-06 00:27:52 -05:00
[supported query methods](../querying/querying.html).
## Further reading
To read more about loading streams, see our [streaming ingestion documentation](../ingestion/stream-ingestion.html).