druid/docs/development/extensions-contrib/thrift.md

88 lines
3.6 KiB
Markdown

---
id: thrift
title: "Thrift"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-thrift-extensions`.
This extension enables Druid to ingest thrift compact data online (`ByteBuffer`) and offline (SequenceFile of type `<Writable, BytesWritable>` or LzoThriftBlock File).
You may want to use another version of thrift, change the dependency in pom and compile yourself.
## LZO Support
If you plan to read LZO-compressed Thrift files, you will need to download version 0.4.19 of the [hadoop-lzo JAR](https://mvnrepository.com/artifact/com.hadoop.gplcompression/hadoop-lzo/0.4.19) and place it in your `extensions/druid-thrift-extensions` directory.
## Thrift Parser
| Field | Type | Description | Required |
| ----------- | ----------- | ---------------------------------------- | -------- |
| type | String | This should say `thrift` | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be a JSON parseSpec. | yes |
| thriftJar | String | path of thrift jar, if not provided, it will try to find the thrift class in classpath. Thrift jar in batch ingestion should be uploaded to HDFS first and configure `jobProperties` with `"tmpjars":"/path/to/your/thrift.jar"` | no |
| thriftClass | String | classname of thrift | yes |
- Batch Ingestion example - `inputFormat` and `tmpjars` should be set.
This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inputSpec in ioConfig could be one of `"org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat"` and `com.twitter.elephantbird.mapreduce.input.LzoThriftBlockInputFormat`. Be careful, when `LzoThriftBlockInputFormat` is used, thrift class must be provided twice.
```json
{
"type": "index_hadoop",
"spec": {
"dataSchema": {
"dataSource": "book",
"parser": {
"type": "thrift",
"jarPath": "book.jar",
"thriftClass": "org.apache.druid.data.input.thrift.Book",
"protocol": "compact",
"parseSpec": {
"format": "json",
...
}
},
"metricsSpec": [],
"granularitySpec": {}
},
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"inputFormat": "org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat",
// "inputFormat": "com.twitter.elephantbird.mapreduce.input.LzoThriftBlockInputFormat",
"paths": "/user/to/some/book.seq"
}
},
"tuningConfig": {
"type": "hadoop",
"jobProperties": {
"tmpjars":"/user/h_user_profile/du00/druid/test/book.jar",
// "elephantbird.class.for.MultiInputFormat" : "${YOUR_THRIFT_CLASS_NAME}"
}
}
}
}
```