druid/docs/development/extensions-contrib/delta-lake.md

---
id: delta-lake
title: "Delta Lake extension"
---

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->


Delta Lake is an open source storage framework that enables building a
Lakehouse architecture with various compute engines. [DeltaLakeInputSource](../../ingestion/input-sources.md#delta-lake-input-source) lets
you ingest data stored in a Delta Lake table into Apache Druid. To use the Delta Lake extension, add the `druid-deltalake-extensions` to the list of loaded extensions.
See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.

The Delta input source reads the configured Delta Lake table and extracts the underlying Delta files in the table's latest snapshot
based on an optional Delta filter. These Delta Lake files are versioned Parquet files.

## Version support

The Delta Lake extension uses the Delta Kernel introduced in Delta Lake 3.0.0, which is compatible with Apache Spark 3.5.x.
Older versions are unsupported, so consider upgrading to Delta Lake 3.0.x or higher to use this extension.

## Downloading Delta Lake extension

To download `druid-deltalake-extensions`, run the following command after replacing `<VERSION>` with the desired
Druid version:

```shell
java \
  -cp "lib/*" \
  -Ddruid.extensions.directory="extensions" \
  -Ddruid.extensions.hadoopDependenciesDir="hadoop-dependencies" \
  org.apache.druid.cli.Main tools pull-deps \
  --no-default-hadoop \
  -c "org.apache.druid.extensions.contrib:druid-deltalake-extensions:<VERSION>"
```

See [Loading community extensions](../../configuration/extensions.md#loading-community-extensions) for more information.
Extension to read and ingest Delta Lake tables (#15755) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> 2024-01-31 00:53:50 -05:00			`---`
			`id: delta-lake`
			`title: "Delta Lake extension"`
			`---`

			`<!--`
			`~ Licensed to the Apache Software Foundation (ASF) under one`
			`~ or more contributor license agreements. See the NOTICE file`
			`~ distributed with this work for additional information`
			`~ regarding copyright ownership. The ASF licenses this file`
			`~ to you under the Apache License, Version 2.0 (the`
			`~ "License"); you may not use this file except in compliance`
			`~ with the License. You may obtain a copy of the License at`
			`~`
			`~ http://www.apache.org/licenses/LICENSE-2.0`
			`~`
			`~ Unless required by applicable law or agreed to in writing,`
			`~ software distributed under the License is distributed on an`
			`~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`~ KIND, either express or implied. See the License for the`
			`~ specific language governing permissions and limitations`
			`~ under the License.`
			`-->`


			`Delta Lake is an open source storage framework that enables building a`
			`Lakehouse architecture with various compute engines. [DeltaLakeInputSource](../../ingestion/input-sources.md#delta-lake-input-source) lets`
			you ingest data stored in a Delta Lake table into Apache Druid. To use the Delta Lake extension, add the `druid-deltalake-extensions` to the list of loaded extensions.
			`See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.`

Support for filters in the Druid Delta Lake connector (#16288) * Delta Lake support for filters. * Updates * cleanup comments * Docs * Remmove Enclosed runner * Rename * Cleanup test * Serde test for the Delta input source and fix jackson annotation. * Updates and docs. * Update error messages to be clearer * Fixes * Handle NumberFormatException to provide a nicer error message. * Apply suggestions from code review Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Doc fixes based on feedback * Yes -> yes in docs; reword slightly. * Update docs/ingestion/input-sources.md Co-authored-by: Laksh Singla <lakshsingla@gmail.com> * Update docs/ingestion/input-sources.md Co-authored-by: Laksh Singla <lakshsingla@gmail.com> * Documentation, javadoc and more updates. * Not with an or expression end-to-end test. * Break up =, >, >=, <, <= into its own types instead of sub-classing. --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> 2024-04-29 14:31:36 -04:00			`The Delta input source reads the configured Delta Lake table and extracts the underlying Delta files in the table's latest snapshot`
			`based on an optional Delta filter. These Delta Lake files are versioned Parquet files.`
Extension to read and ingest Delta Lake tables (#15755) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> 2024-01-31 00:53:50 -05:00
			`## Version support`

			`The Delta Lake extension uses the Delta Kernel introduced in Delta Lake 3.0.0, which is compatible with Apache Spark 3.5.x.`
			`Older versions are unsupported, so consider upgrading to Delta Lake 3.0.x or higher to use this extension.`

Update `groupId` for delta-lake and iceberg extensions (#15843) * Update the group id to org.apache.druid.extensions.contrib for contrib exts. * Note iceberg and delta lake extensions in extensions.md * properties and shell backticks * Update groupId in distribution/pom.xml * remove delta-lake from dist. * Add note on downloading extension. 2024-02-08 02:54:06 -05:00			`## Downloading Delta Lake extension`

			To download `druid-deltalake-extensions`, run the following command after replacing `<VERSION>` with the desired
			`Druid version:`

			```shell
			`java \`
			`-cp "lib/*" \`
			`-Ddruid.extensions.directory="extensions" \`
			`-Ddruid.extensions.hadoopDependenciesDir="hadoop-dependencies" \`
			`org.apache.druid.cli.Main tools pull-deps \`
			`--no-default-hadoop \`
			`-c "org.apache.druid.extensions.contrib:druid-deltalake-extensions:<VERSION>"`
			```

Support for reading Delta Lake table snapshots (#17004) Problem Currently, the delta input source only supports reading from the latest snapshot of the given Delta Lake table. This is a known documented limitation. Description Add support for reading Delta snapshot. By default, the Druid-Delta connector reads the latest snapshot of the Delta table in order to preserve compatibility. Users can specify a snapshotVersion to ingest change data events from Delta tables into Druid. In the future, we can also add support for time-based snapshot reads. The Delta API to read time-based snapshots is not clear currently. 2024-09-09 04:42:48 -04:00			`See [Loading community extensions](../../configuration/extensions.md#loading-community-extensions) for more information.`