druid/docs/development/extensions-core/kafka-extraction-namespace.md

---
id: kafka-extraction-namespace
title: "Apache Kafka Lookups"
---

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

To use this Apache Druid extension, [include](../../development/extensions.md#loading-extensions) `druid-lookups-cached-global` and `druid-kafka-extraction-namespace` in the extensions load list.

If you need updates to populate as promptly as possible, it is possible to plug into a Kafka topic whose key is the old value and message is the desired new value (both in UTF-8) as a LookupExtractorFactory.

```json
{
  "type":"kafka",
  "kafkaTopic":"testTopic",
  "kafkaProperties":{
    "bootstrap.servers":"kafka.service:9092"
  }
}
```

| Parameter         | Description                                                                             | Required | Default           |
|-------------------|-----------------------------------------------------------------------------------------|----------|-------------------|
| `kafkaTopic`      | The Kafka topic to read the data from                                                   | Yes      ||
| `kafkaProperties` | Kafka consumer properties (`bootstrap.servers` must be specified)                       | Yes      ||
| `connectTimeout`  | How long to wait for an initial connection                                              | No       | `0` (do not wait) |
| `isOneToOne`      | The map is a one-to-one (see [Lookup DimensionSpecs](../../querying/dimensionspecs.md)) | No       | `false`           |

The extension `kafka-extraction-namespace` enables reading from an [Apache Kafka](https://kafka.apache.org/) topic which has name/key pairs to allow renaming of dimension values. An example use case would be to rename an ID to a human-readable format.

## How it Works

The extractor works by consuming the configured Kafka topic from the beginning, and appending every record to an internal map. The key of the Kafka record is used as they key of the map, and the payload of the record is used as the value. At query time, a lookup can be used to transform the key into the associated value. See [lookups](../../querying/lookups.md) for how to configure and use lookups in a query. Keys and values are both stored as strings by the lookup extractor.

The extractor remains subscribed to the topic, so new records are added to the lookup map as they appear. This allows for lookup values to be updated in near-realtime. If two records are added to the topic with the same key, the record with the larger offset will replace the previous record in the lookup map. A record with a `null` payload will be treated as a tombstone record, and the associated key will be removed from the lookup map.

The extractor treats the input topic much like a [KTable](https://kafka.apache.org/23/javadoc/org/apache/kafka/streams/kstream/KTable.html). As such, it is best to create your Kafka topic using a [log compaction](https://kafka.apache.org/documentation/#compaction) strategy, so that the most-recent version of a key is always preserved in Kafka. Without properly configuring retention and log compaction, older keys that are automatically removed from Kafka will not be available and will be lost when Druid services are restarted.

### Example

Consider a `country_codes` topic is being consumed, and the following records are added to the topic in the following order:

| Offset | Key | Payload     |
|--------|-----|-------------|
| 1      | NZ  | Nu Zeelund  |
| 2      | AU  | Australia   |
| 3      | NZ  | New Zealand |
| 4      | AU  | `null`      |
| 5      | NZ  | Aotearoa    |
| 6      | CZ  | Czechia     |

This input topic would be consumed from the beginning, and result in a lookup namespace containing the following mappings (notice that the entry for _Australia_ was added and then deleted):

| Key | Value     |
|-----|-----------|
| NZ  | Aotearoa  |
| CZ  | Czechia   |

Now when a query uses this extraction namespace, the country codes can be mapped to the full country name at query time.

## Tombstones and Deleting Records

The Kafka lookup extractor treats `null` Kafka messages as tombstones. This means that a record on the input topic with a `null` message payload on Kafka will remove the associated key from the lookup map, effectively deleting it.

## Limitations

The consumer properties `group.id`, `auto.offset.reset` and `enable.auto.commit` cannot be set in `kafkaProperties` as they are set by the extension as `UUID.randomUUID().toString()`, `earliest` and `false` respectively. This is because the entire topic must be consumed by the Druid service from the very beginning so that a complete map of lookup values can be built. Setting any of these consumer properties will cause the extractor to not start.

Currently, the Kafka lookup extractor feeds the entire Kafka topic into a local cache. If you are using on-heap caching, this can easily clobber your java heap if the Kafka stream spews a lot of unique keys. Off-heap caching should alleviate these concerns, but there is still a limit to the quantity of data that can be stored.  There is currently no eviction policy.

## Testing the Kafka rename functionality

To test this setup, you can send key/value pairs to a Kafka stream via the following producer console:

```
./bin/kafka-console-producer.sh --property parse.key=true --property key.separator="->" --broker-list localhost:9092 --topic testTopic
```

Renames can then be published as `OLD_VAL->NEW_VAL` followed by newline (enter or return)
Front Matter header needs to be on the first line for md to be rendered properly by jekyll (#6733) 2018-12-13 14:47:20 -05:00			`---`
Docusaurus build framework + ingestion doc refresh. (#8311) * Docusaurus build framework + ingestion doc refresh. * stick to npm instead of yarn * fix typos * restore some _bin * Adjustments. * detect and fix redirect anchors * update anchor lint * Web-console: remove specific column filters (#8343) * add clear filter * update tool kit * remove usless check * auto run * add % * Fix resource leak (#8337) * Fix resource leak * Patch comments * Enable Spotbugs NP_NONNULL_RETURN_VIOLATION (#8234) * Fixes from PR review. * Fix more anchors. * Preamble nix. * Fix more anchors, headers * clean up placeholder page * add to website lint to travis config * better broken link checking * travis fix * Fixed more broken links * better redirects * unfancy catch * fix LGTM error * link fixes * fix md issues * Addl fixes 2019-08-21 00:48:59 -04:00			`id: kafka-extraction-namespace`
Add more Apache branding to docs (#7515) 2019-04-19 18:52:26 -04:00			`title: "Apache Kafka Lookups"`
Front Matter header needs to be on the first line for md to be rendered properly by jekyll (#6733) 2018-12-13 14:47:20 -05:00			`---`

add missing license headers, in particular to MD files; clean up RAT … (#6563) * add missing license headers, in particular to MD files; clean up RAT exclusions * revert inadvertent doc changes * docs * cr changes * fix modified druid-production.svg 2018-11-13 12:38:37 -05:00			`<!--`
			`~ Licensed to the Apache Software Foundation (ASF) under one`
			`~ or more contributor license agreements. See the NOTICE file`
			`~ distributed with this work for additional information`
			`~ regarding copyright ownership. The ASF licenses this file`
			`~ to you under the Apache License, Version 2.0 (the`
			`~ "License"); you may not use this file except in compliance`
			`~ with the License. You may obtain a copy of the License at`
			`~`
			`~ http://www.apache.org/licenses/LICENSE-2.0`
			`~`
			`~ Unless required by applicable law or agreed to in writing,`
			`~ software distributed under the License is distributed on an`
			`~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`~ KIND, either express or implied. See the License for the`
			`~ specific language governing permissions and limitations`
			`~ under the License.`
			`-->`

MySQL extension with MariaDB connector docs (#11608) * add docs for mariadb support via mysql extensions * add logging so you know what druid knows * homogenize * spelling * missed a couple 2021-08-19 04:52:26 -04:00			To use this Apache Druid extension, [include](../../development/extensions.md#loading-extensions) `druid-lookups-cached-global` and `druid-kafka-extraction-namespace` in the extensions load list.
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
Spellcheck docs (#8548) * Spellcheck docs Fix spelling mistakes in docs and add CI job for running spellcheck on docs. * Add missing license header 2019-09-17 15:47:30 -04:00			`If you need updates to populate as promptly as possible, it is possible to plug into a Kafka topic whose key is the old value and message is the desired new value (both in UTF-8) as a LookupExtractorFactory.`
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
			```json
			`{`
			`"type":"kafka",`
[QTL] Move kafka-extraction-namespace to the Lookup framework. (#2800) * Move kafka-extraction-namespace to the Lookup framework. * Address comments * Fix missing kafka introspection * Fix tests to be less racy * Make testing a bit more leniant * Make tests even more forgiving * Add comments to kafka lookup cache method * Move startStopLock to just use started * Make start() and stop() idempotent * Forgot to update test after last change, test now accounts for idempotency * Add extra idempotency on stop check * Add more descriptive docs of behavior 2016-05-02 12:45:13 -04:00			`"kafkaTopic":"testTopic",`
Change Kafka Lookup Extractor to not register consumer group (#12842) * change kafka lookups module to not commit offsets The current behaviour of the Kafka lookup extractor is to not commit offsets by assigning a unique ID to the consumer group and setting auto.offset.reset to earliest. This does the job but also pollutes the Kafka broker with a bunch of "ghost" consumer groups that will never again be used. To fix this, we now set enable.auto.commit to false, which prevents the ghost consumer groups being created in the first place. * update docs to include new enable.auto.commit setting behaviour * update kafka-lookup-extractor documentation Provide some additional detail on functionality and configuration. Hopefully this will make it clearer how the extractor works for developers who aren't so familiar with Kafka. * add comments better explaining the logic of the code * add spelling exceptions for kafka lookup docs 2022-08-09 06:44:22 -04:00			`"kafkaProperties":{`
			`"bootstrap.servers":"kafka.service:9092"`
			`}`
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00			`}`
			```

Change Kafka Lookup Extractor to not register consumer group (#12842) * change kafka lookups module to not commit offsets The current behaviour of the Kafka lookup extractor is to not commit offsets by assigning a unique ID to the consumer group and setting auto.offset.reset to earliest. This does the job but also pollutes the Kafka broker with a bunch of "ghost" consumer groups that will never again be used. To fix this, we now set enable.auto.commit to false, which prevents the ghost consumer groups being created in the first place. * update docs to include new enable.auto.commit setting behaviour * update kafka-lookup-extractor documentation Provide some additional detail on functionality and configuration. Hopefully this will make it clearer how the extractor works for developers who aren't so familiar with Kafka. * add comments better explaining the logic of the code * add spelling exceptions for kafka lookup docs 2022-08-09 06:44:22 -04:00			`\| Parameter \| Description \| Required \| Default \|`
			`\|-------------------\|-----------------------------------------------------------------------------------------\|----------\|-------------------\|`
			\| `kafkaTopic` \| The Kafka topic to read the data from \| Yes \|\|
			\| `kafkaProperties` \| Kafka consumer properties (`bootstrap.servers` must be specified) \| Yes \|\|
			\| `connectTimeout` \| How long to wait for an initial connection \| No \| `0` (do not wait) \|
			\| `isOneToOne` \| The map is a one-to-one (see [Lookup DimensionSpecs](../../querying/dimensionspecs.md)) \| No \| `false` \|
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
Change Kafka Lookup Extractor to not register consumer group (#12842) * change kafka lookups module to not commit offsets The current behaviour of the Kafka lookup extractor is to not commit offsets by assigning a unique ID to the consumer group and setting auto.offset.reset to earliest. This does the job but also pollutes the Kafka broker with a bunch of "ghost" consumer groups that will never again be used. To fix this, we now set enable.auto.commit to false, which prevents the ghost consumer groups being created in the first place. * update docs to include new enable.auto.commit setting behaviour * update kafka-lookup-extractor documentation Provide some additional detail on functionality and configuration. Hopefully this will make it clearer how the extractor works for developers who aren't so familiar with Kafka. * add comments better explaining the logic of the code * add spelling exceptions for kafka lookup docs 2022-08-09 06:44:22 -04:00			The extension `kafka-extraction-namespace` enables reading from an [Apache Kafka](https://kafka.apache.org/) topic which has name/key pairs to allow renaming of dimension values. An example use case would be to rename an ID to a human-readable format.
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
Change Kafka Lookup Extractor to not register consumer group (#12842) * change kafka lookups module to not commit offsets The current behaviour of the Kafka lookup extractor is to not commit offsets by assigning a unique ID to the consumer group and setting auto.offset.reset to earliest. This does the job but also pollutes the Kafka broker with a bunch of "ghost" consumer groups that will never again be used. To fix this, we now set enable.auto.commit to false, which prevents the ghost consumer groups being created in the first place. * update docs to include new enable.auto.commit setting behaviour * update kafka-lookup-extractor documentation Provide some additional detail on functionality and configuration. Hopefully this will make it clearer how the extractor works for developers who aren't so familiar with Kafka. * add comments better explaining the logic of the code * add spelling exceptions for kafka lookup docs 2022-08-09 06:44:22 -04:00			`## How it Works`
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
Change Kafka Lookup Extractor to not register consumer group (#12842) * change kafka lookups module to not commit offsets The current behaviour of the Kafka lookup extractor is to not commit offsets by assigning a unique ID to the consumer group and setting auto.offset.reset to earliest. This does the job but also pollutes the Kafka broker with a bunch of "ghost" consumer groups that will never again be used. To fix this, we now set enable.auto.commit to false, which prevents the ghost consumer groups being created in the first place. * update docs to include new enable.auto.commit setting behaviour * update kafka-lookup-extractor documentation Provide some additional detail on functionality and configuration. Hopefully this will make it clearer how the extractor works for developers who aren't so familiar with Kafka. * add comments better explaining the logic of the code * add spelling exceptions for kafka lookup docs 2022-08-09 06:44:22 -04:00			`The extractor works by consuming the configured Kafka topic from the beginning, and appending every record to an internal map. The key of the Kafka record is used as they key of the map, and the payload of the record is used as the value. At query time, a lookup can be used to transform the key into the associated value. See [lookups](../../querying/lookups.md) for how to configure and use lookups in a query. Keys and values are both stored as strings by the lookup extractor.`

			The extractor remains subscribed to the topic, so new records are added to the lookup map as they appear. This allows for lookup values to be updated in near-realtime. If two records are added to the topic with the same key, the record with the larger offset will replace the previous record in the lookup map. A record with a `null` payload will be treated as a tombstone record, and the associated key will be removed from the lookup map.

			The extractor treats the input topic much like a [KTable](https://kafka.apache.org/23/javadoc/org/apache/kafka/streams/kstream/KTable.html). As such, it is best to create your Kafka topic using a [log compaction](https://kafka.apache.org/documentation/#compaction) strategy, so that the most-recent version of a key is always preserved in Kafka. Without properly configuring retention and log compaction, older keys that are automatically removed from Kafka will not be available and will be lost when Druid services are restarted.

			`### Example`

			Consider a `country_codes` topic is being consumed, and the following records are added to the topic in the following order:

			`\| Offset \| Key \| Payload \|`
			`\|--------\|-----\|-------------\|`
			`\| 1 \| NZ \| Nu Zeelund \|`
			`\| 2 \| AU \| Australia \|`
			`\| 3 \| NZ \| New Zealand \|`
			\| 4 \| AU \| `null` \|
			`\| 5 \| NZ \| Aotearoa \|`
			`\| 6 \| CZ \| Czechia \|`

			`This input topic would be consumed from the beginning, and result in a lookup namespace containing the following mappings (notice that the entry for _Australia_ was added and then deleted):`

			`\| Key \| Value \|`
			`\|-----\|-----------\|`
			`\| NZ \| Aotearoa \|`
			`\| CZ \| Czechia \|`

			`Now when a query uses this extraction namespace, the country codes can be mapped to the full country name at query time.`
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
Remove kafka lookup records when a record is tombstoned (#12819) * remove kafka lookup records from factory when record tombstoned * update kafka lookup docs to include tombstone behaviour * change test wait time down to 10ms Co-authored-by: David Palmer <david.palmer@adscale.co.nz> 2022-08-09 01:12:51 -04:00			`## Tombstones and Deleting Records`

			The Kafka lookup extractor treats `null` Kafka messages as tombstones. This means that a record on the input topic with a `null` message payload on Kafka will remove the associated key from the lookup map, effectively deleting it.

Docusaurus build framework + ingestion doc refresh. (#8311) * Docusaurus build framework + ingestion doc refresh. * stick to npm instead of yarn * fix typos * restore some _bin * Adjustments. * detect and fix redirect anchors * update anchor lint * Web-console: remove specific column filters (#8343) * add clear filter * update tool kit * remove usless check * auto run * add % * Fix resource leak (#8337) * Fix resource leak * Patch comments * Enable Spotbugs NP_NONNULL_RETURN_VIOLATION (#8234) * Fixes from PR review. * Fix more anchors. * Preamble nix. * Fix more anchors, headers * clean up placeholder page * add to website lint to travis config * better broken link checking * travis fix * Fixed more broken links * better redirects * unfancy catch * fix LGTM error * link fixes * fix md issues * Addl fixes 2019-08-21 00:48:59 -04:00			`## Limitations`
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
Change Kafka Lookup Extractor to not register consumer group (#12842) * change kafka lookups module to not commit offsets The current behaviour of the Kafka lookup extractor is to not commit offsets by assigning a unique ID to the consumer group and setting auto.offset.reset to earliest. This does the job but also pollutes the Kafka broker with a bunch of "ghost" consumer groups that will never again be used. To fix this, we now set enable.auto.commit to false, which prevents the ghost consumer groups being created in the first place. * update docs to include new enable.auto.commit setting behaviour * update kafka-lookup-extractor documentation Provide some additional detail on functionality and configuration. Hopefully this will make it clearer how the extractor works for developers who aren't so familiar with Kafka. * add comments better explaining the logic of the code * add spelling exceptions for kafka lookup docs 2022-08-09 06:44:22 -04:00			The consumer properties `group.id`, `auto.offset.reset` and `enable.auto.commit` cannot be set in `kafkaProperties` as they are set by the extension as `UUID.randomUUID().toString()`, `earliest` and `false` respectively. This is because the entire topic must be consumed by the Druid service from the very beginning so that a complete map of lookup values can be built. Setting any of these consumer properties will cause the extractor to not start.

			`Currently, the Kafka lookup extractor feeds the entire Kafka topic into a local cache. If you are using on-heap caching, this can easily clobber your java heap if the Kafka stream spews a lot of unique keys. Off-heap caching should alleviate these concerns, but there is still a limit to the quantity of data that can be stored. There is currently no eviction policy.`
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
			`## Testing the Kafka rename functionality`

Spellcheck docs (#8548) * Spellcheck docs Fix spelling mistakes in docs and add CI job for running spellcheck on docs. * Add missing license header 2019-09-17 15:47:30 -04:00			`To test this setup, you can send key/value pairs to a Kafka stream via the following producer console:`
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
			```
			`./bin/kafka-console-producer.sh --property parse.key=true --property key.separator="->" --broker-list localhost:9092 --topic testTopic`
			```

			Renames can then be published as `OLD_VAL->NEW_VAL` followed by newline (enter or return)