fix protobuf extension packaging and docs (#9320)

* fix protobuf extension packaging and docs

* fix paths

* Update protobuf.md

* Update protobuf.md
This commit is contained in:
Clint Wylie 2020-02-07 09:26:52 -08:00 committed by GitHub
parent 53bb45fc9a
commit b55657cc26
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 68 additions and 45 deletions

View File

@ -30,16 +30,16 @@ for [stream ingestion](../../ingestion/index.md#streaming). See corresponding do
## Example: Load Protobuf messages from Kafka ## Example: Load Protobuf messages from Kafka
This example demonstrates how to load Protobuf messages from Kafka. Please read the [Load from Kafka tutorial](../../tutorials/tutorial-kafka.md) first. This example will use the same "metrics" dataset. This example demonstrates how to load Protobuf messages from Kafka. Please read the [Load from Kafka tutorial](../../tutorials/tutorial-kafka.md) first, and see [Kafka Indexing Service](./kafka-ingestion.md) documentation for more details.
Files used in this example are found at `./examples/quickstart/protobuf` in your Druid directory. The files used in this example are found at [`./examples/quickstart/protobuf` in your Druid directory](https://github.com/apache/druid/tree/master/examples/quickstart/protobuf).
- We will use [Kafka Indexing Service](./kafka-ingestion.md). For this example:
- Kafka broker host is `localhost:9092`. - Kafka broker host is `localhost:9092`
- Kafka topic is `metrics_pb` instead of `metrics`. - Kafka topic is `metrics_pb`
- datasource name is `metrics-kafka-pb` instead of `metrics-kafka` to avoid the confusion. - Datasource name is `metrics-protobuf`
Here is the metrics JSON example. Here is a JSON example of the 'metrics' data schema used in the example.
```json ```json
{ {
@ -56,7 +56,7 @@ Here is the metrics JSON example.
### Proto file ### Proto file
The proto file should look like this. Save it as metrics.proto. The corresponding proto file for our 'metrics' dataset looks like this.
``` ```
syntax = "proto3"; syntax = "proto3";
@ -74,28 +74,27 @@ message Metrics {
### Descriptor file ### Descriptor file
Using the `protoc` Protobuf compiler to generate the descriptor file. Save the metrics.desc file either in the classpath or reachable by URL. In this example the descriptor file was saved at /tmp/metrics.desc. Next, we use the `protoc` Protobuf compiler to generate the descriptor file and save it as `metrics.desc`. The descriptor file must be either in the classpath or reachable by URL. In this example the descriptor file was saved at `/tmp/metrics.desc`, however this file is also available in the example files. From your Druid install directory:
``` ```
protoc -o /tmp/metrics.desc metrics.proto protoc -o /tmp/metrics.desc ./quickstart/protobuf/metrics.proto
``` ```
### Supervisor spec JSON ## Create Kafka Supervisor
Below is the complete Supervisor spec JSON to be submitted to the Overlord. Below is the complete Supervisor spec JSON to be submitted to the Overlord.
Please make sure these keys are properly configured for successful ingestion. Make sure these keys are properly configured for successful ingestion.
- `descriptor` for the descriptor file URL. Important supervisor properties
- `protoMessageType` from the proto definition. - `descriptor` for the descriptor file URL
- parseSpec `format` must be `json`. - `protoMessageType` from the proto definition
- `topic` to subscribe. The topic is "metrics_pb" instead of "metrics". - `parser` should have `type` set to `protobuf`, but note that the `format` of the `parseSpec` must be `json`
- `bootstrap.server` is the Kafka broker host.
```json ```json
{ {
"type": "kafka", "type": "kafka",
"dataSchema": { "dataSchema": {
"dataSource": "metrics-kafka2", "dataSource": "metrics-protobuf",
"parser": { "parser": {
"type": "protobuf", "type": "protobuf",
"descriptor": "file:///tmp/metrics.desc", "descriptor": "file:///tmp/metrics.desc",
@ -165,18 +164,55 @@ Please make sure these keys are properly configured for successful ingestion.
} }
``` ```
## Kafka Producer ## Adding Protobuf messages to Kafka
Here is the sample script that publishes the metrics to Kafka in Protobuf format. If necessary, from your Kafka installation directory run the following command to create the Kafka topic
1. Run `protoc` again with the Python binding option. This command generates `metrics_pb2.py` file. ```
``` ./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic metrics_pb
protoc -o metrics.desc metrics.proto --python_out=. ```
```
2. Create Kafka producer script. This example script requires `protobuf` and `kafka-python` modules. With the topic in place, messages can be inserted running the following command from your Druid installation directory
This script requires `protobuf` and `kafka-python` modules. ```
./bin/generate-example-metrics | ./quickstart/protobuf/pb_publisher.py
```
You can confirm that data has been inserted to your Kafka topic using the following command from your Kafka installation directory
```
./bin/kafka-console-consumer --zookeeper localhost --topic metrics_pb
```
which should print messages like this
```
millisecondsGETR"2017-04-06T03:23:56Z*2002/list:request/latencyBwww1.example.com
```
If your supervisor created in the previous step is running, the indexing tasks should begin producing the messages and the data will soon be available for querying in Druid.
## Generating the example files
The files provided in the example quickstart can be generated in the following manner starting with only `metrics.proto`.
### `metrics.desc`
The descriptor file is generated using `protoc` Protobuf compiler. Given a `.proto` file, a `.desc` file can be generated like so.
```
protoc -o metrics.desc metrics.proto
```
### `metrics_pb2.py`
`metrics_pb2.py` is also generated with `protoc`
```
protoc -o metrics.desc metrics.proto --python_out=.
```
### `pb_publisher.py`
After `metrics_pb2.py` is generated, another script can be constructed to parse JSON data, convert it to Protobuf, and produce to a Kafka topic
```python ```python
#!/usr/bin/env python #!/usr/bin/env python
@ -187,32 +223,17 @@ import json
from kafka import KafkaProducer from kafka import KafkaProducer
from metrics_pb2 import Metrics from metrics_pb2 import Metrics
producer = KafkaProducer(bootstrap_servers='localhost:9092') producer = KafkaProducer(bootstrap_servers='localhost:9092')
topic = 'metrics_pb' topic = 'metrics_pb'
metrics = Metrics()
for row in iter(sys.stdin): for row in iter(sys.stdin):
d = json.loads(row) d = json.loads(row)
metrics = Metrics()
for k, v in d.items(): for k, v in d.items():
setattr(metrics, k, v) setattr(metrics, k, v)
pb = metrics.SerializeToString() pb = metrics.SerializeToString()
producer.send(topic, pb) producer.send(topic, pb)
```
3. run producer
``` producer.flush()
./bin/generate-example-metrics | ./pb_publisher.py
```
4. test
```
kafka-console-consumer --zookeeper localhost --topic metrics_pb
```
It should print messages like this
```
millisecondsGETR"2017-04-06T03:23:56Z*2002/list:request/latencyBwww1.example.com
``` ```

View File

@ -32,3 +32,5 @@ for row in iter(sys.stdin):
setattr(metrics, k, v) setattr(metrics, k, v)
pb = metrics.SerializeToString() pb = metrics.SerializeToString()
producer.send(topic, pb) producer.send(topic, pb)
producer.flush()

View File

@ -109,7 +109,7 @@
<artifactId>maven-shade-plugin</artifactId> <artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version> <version>3.2.1</version>
<configuration> <configuration>
<createDependencyReducedPom>false</createDependencyReducedPom> <createDependencyReducedPom>true</createDependencyReducedPom>
<relocations> <relocations>
<relocation> <relocation>
<pattern>com.google.protobuf</pattern> <pattern>com.google.protobuf</pattern>