druid

mirror of https://github.com/apache/druid.git synced 2025-02-07 10:38:18 +00:00

Go to file

lokesh-lingarajan ad6609a606

Kafka Input Format for headers, key and payload parsing (#11630 )

### Description

Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights.

PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats.

We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload.

This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row.

Lets look at a sample input format from the above discussion

"inputFormat":
{
"type": "kafka", // New input format type
"headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns
"recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp
"headerFormat": // Header parser specifying that values are of type string
{
"type": "string"
},
"valueFormat": // Value parser from json parsing
{
"type": "json",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [...]
}
},
"keyFormat": // Key parser also from json parsing
{
"type": "json"
}
}

Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json.

KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion.

"headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload.

Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch.

Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key".

## KafkaInputFormat Class:
This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases.

During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.

2021-10-07 08:56:27 -07:00

.github

Lock hadoop dependencies to 2.8.5 (#11583 )

2021-08-12 15:16:47 +05:30

.idea

Use ExecutorService variables to assign ExecutorService Instances (#11373 )

2021-06-25 16:56:34 -07:00

benchmarks

refactor sql authorization to get resource type from schema, resource type to be string (#11692 )

2021-09-17 09:53:25 -07:00

cloud

bump version to 0.23.0-SNAPSHOT (#11670 )

2021-09-08 15:56:04 -07:00

codestyle

handle timestamps of complex types when parsing protobuf messages (#11293 )

2021-06-07 15:19:39 +05:30

core

Add cpu/cpuset cgroup and procfs data gathering (#11763 )

2021-10-06 20:27:36 -07:00

dev

chore: fix case of GitHub (#10928 )

2021-05-07 01:15:43 -07:00

distribution

add utility to aid in formatting release notes to be linkable (#11728 )

2021-10-05 18:26:41 -07:00

docs

Kafka Input Format for headers, key and payload parsing (#11630 )

2021-10-07 08:56:27 -07:00

examples

Allow spaces in java home. (#11407 )

2021-07-05 18:50:36 +05:30

extendedset

bump version to 0.23.0-SNAPSHOT (#11670 )

2021-09-08 15:56:04 -07:00

extensions-contrib

Fix moving average extension loading in middle manager and overlord (#11662 )

2021-09-08 22:09:22 -07:00

extensions-core

Kafka Input Format for headers, key and payload parsing (#11630 )

2021-10-07 08:56:27 -07:00

helm/druid

remove DEPRECATION part (#11326 )

2021-06-09 15:52:43 +08:00

hll

bump version to 0.23.0-SNAPSHOT (#11670 )

2021-09-08 15:56:04 -07:00

hooks

Add git pre-commit hook to source control (#9554 )

2020-06-05 11:19:42 -10:00

indexing-hadoop

bump version to 0.23.0-SNAPSHOT (#11670 )

2021-09-08 15:56:04 -07:00

indexing-service

fix broken build (#11727 )

2021-09-21 22:59:51 +07:00

integration-tests

Add killAndRestart for container for integration tests (#11754 )

2021-09-30 13:47:57 -07:00

licenses

Web console: Better hotkeys and library upgrades (#11365 )

2021-06-17 18:24:29 -07:00

processing

Task reports for parallel task: single phase and sequential mode (#11688 )

2021-09-16 13:58:11 -05:00

publications

De-incubation cleanup in code, docs, packaging (#9108 )

2020-01-03 12:33:19 -05:00

server

Implement configurable internally generated query context (#11429 )

2021-10-06 09:02:41 -07:00

services

Implement configurable internally generated query context (#11429 )

2021-10-06 09:02:41 -07:00

sql

Implement configurable internally generated query context (#11429 )

2021-10-06 09:02:41 -07:00

web-console

don't throw local storage errors (#11752 )

2021-10-05 18:49:16 -07:00

website

Kafka Input Format for headers, key and payload parsing (#11630 )

2021-10-07 08:56:27 -07:00

.asf.yaml

Add .asf.yaml. (#9083 )

2019-12-20 16:45:38 -08:00

.backportrc.json

Add 0.18.0 to .backportrc.json to facilitate backport. (#9661 )

2020-04-11 13:49:04 -07:00

.codecov.yml

Use Codecov (#8388 )

2019-08-28 08:49:30 -07:00

.dockerignore

Add docker container for druid (#6896 )

2019-02-08 12:12:28 +00:00

.gitignore

Web console basic end-to-end-test (#9595 )

2020-04-09 12:38:09 -07:00

.lgtm.yml

Suppress LGTM warnings about stack trace exposure (#9631 )

2020-04-09 17:31:03 -07:00

.travis.yml

dependency check with inhert instead of aggregate (#11709 )

2021-09-15 04:18:59 -07:00

check_test_suite_test.py

suppress false positive cve (#11699 )

2021-09-13 20:45:38 -07:00

check_test_suite.py

suppress false positive cve (#11699 )

2021-09-13 20:45:38 -07:00

CONTRIBUTING.md

Fix numbered list formatting in markdown. (#9664 )

2020-04-21 20:18:12 -07:00

LABELS

Add plain text README.txt, use relative link from README.md to build.md (#7611 )

2019-05-09 21:29:26 -07:00

LICENSE

support Aliyun OSS service as deep storage (#9898 )

2020-07-01 22:20:53 -07:00

licenses.yaml

Update Apache Kafka client libraries to 3.0.0 (#11735 )

2021-10-05 10:23:19 -07:00

NOTICE

license.yaml fixes for code introduced related to AWS RDS token based password provider in PR #9518 (#10885 )

2021-03-10 12:59:25 -08:00

owasp-dependency-check-suppressions.xml

suppress hive-storage-api thrift security vulnerability (#11753 )

2021-09-28 23:54:13 -07:00

pom.xml

Update Apache Kafka client libraries to 3.0.0 (#11735 )

2021-10-05 10:23:19 -07:00

README.md

Updates to source and doc build pages (#11464 )

2021-07-20 18:07:34 -07:00

README.template

De-incubation cleanup in code, docs, packaging (#9108 )

2020-01-03 12:33:19 -05:00

setup-hooks.sh

Add git pre-commit hook to source control (#9554 )

2020-06-05 11:19:42 -10:00

upload.sh

Adding licenses and enable apache-rat-plugin. (#6215 )

2018-09-18 08:39:26 -07:00

README.md

Apache Druid

Druid is a high performance real-time analytics database. Druid's main value add is to reduce time to insight and action.

Druid is designed for workflows where fast queries and ingest really matter. Druid excels at powering UIs, running operational (ad-hoc) queries, or handling high concurrency. Consider Druid as an open source alternative to data warehouses for a variety of use cases. The design documentation explains the key concepts.

Getting started

You can get started with Druid with our local or Docker quickstart.

Druid provides a rich set of APIs (via HTTP and JDBC) for loading, managing, and querying your data. You can also interact with Druid via the built-in console (shown below).

Load data

Load streaming and batch data using a point-and-click wizard to guide you through ingestion setup. Monitor one off tasks and ingestion supervisors.

Manage the cluster

Manage your cluster with ease. Get a view of your datasources, segments, ingestion tasks, and services from one convenient location. All powered by SQL systems tables, allowing you to see the underlying query for each view.

Issue queries

Use the built-in query workbench to prototype DruidSQL and native queries or connect one of the many tools that help you make the most out of Druid.

Documentation

You can find the documentation for the latest Druid release on the project website.

If you would like to contribute documentation, please do so under /docs in this repository and submit a pull request.

Community

Community support is available on the druid-user mailing list, which is hosted at Google Groups.

Development discussions occur on dev@druid.apache.org, which you can subscribe to by emailing dev-subscribe@druid.apache.org.

Chat with Druid committers and users in real-time on the #druid channel in the Apache Slack team. Please use this invitation link to join the ASF Slack, and once joined, go into the #druid channel.

Building from source

Please note that JDK 8 is required to build Druid.

For instructions on building Druid from source, see docs/development/build.md

Contributing

Please follow the community guidelines for contributing.

For instructions on setting up IntelliJ dev/intellij-setup.md

License

Apache License, Version 2.0

Languages

Java 62.4%

ReScript 30.7%

TypeScript 3.1%

Euphoria 0.9%

Csound 0.8%

Other 1.9%