Commit Graph

6 Commits

Author SHA1 Message Date
Abhishek Radhakrishnan aa833a711c
Support for reading Delta Lake table snapshots (#17004)
Problem
Currently, the delta input source only supports reading from the latest snapshot of the given Delta Lake table. This is a known documented limitation.

Description
Add support for reading Delta snapshot. By default, the Druid-Delta connector reads the latest snapshot of the Delta table in order to preserve compatibility. Users can specify a snapshotVersion to ingest change data events from Delta tables into Druid.

In the future, we can also add support for time-based snapshot reads. The Delta API to read time-based snapshots is not clear currently.
2024-09-09 14:12:48 +05:30
Abhishek Radhakrishnan acadc2df20
Handle Delta StructType, ArrayType and MapType (#16884)
Handle the following Delta complex types:
a. StructType as JSON
b. ArrayType as Java list
c. MapType as Java map

Generate and add a new Delta table complex-types-table that contains the above complex types for testing.

Update the tests to include a parameterized test with complex-types-table, with the expectations defined in ComplexTypesDeltaTable.java.
2024-08-13 07:50:03 -07:00
Abhishek Radhakrishnan 75937c98e8
Upgrade delta kernel from 3.1.0 to 3.2.0 (#16513)
Upstream release: https://github.com/delta-io/delta/releases/tag/v3.2.0

- Upgrade kernel dependency to 3.2.0
- Notable breaking changes introduced in upstream that affects the Druid extension:
 - Rename TableClient -> Engine
 - Rename DefaultTableClient -> DefaultEngine
 - Exceptions moved to a separate package
 - Table.getPath() doesn't throw TableNotFoundException. Instead the exception is thrown
   when getting snapshot info from the Table object
2024-05-29 10:46:30 -07:00
Abhishek Radhakrishnan 1d7595f3f7
Support for filters in the Druid Delta Lake connector (#16288)
* Delta Lake support for filters.

* Updates

* cleanup comments

* Docs

* Remmove Enclosed runner

* Rename

* Cleanup test

* Serde test for the Delta input source and fix jackson annotation.

* Updates and docs.

* Update error messages to be clearer

* Fixes

* Handle NumberFormatException to provide a nicer error message.

* Apply suggestions from code review

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Doc fixes based on feedback

* Yes -> yes in docs; reword slightly.

* Update docs/ingestion/input-sources.md

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

* Documentation, javadoc and more updates.

* Not with an or expression end-to-end test.

* Break up =, >, >=, <, <= into its own types instead of sub-classing.

---------

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
2024-04-29 11:31:36 -07:00
Abhishek Radhakrishnan 1affa35b29
Bump up Delta Lake Kernel to 3.1.0 (#15842)
This patch bumps Delta Lake Kernel dependency from 3.0.0 to 3.1.0, which released last week - please see https://github.com/delta-io/delta/releases/tag/v3.1.0 for release notes.

There were a few "breaking" API changes in 3.1.0, you can find the rationale for some of those changes here.

Next-up in this extension: add and expose filter predicates.
2024-02-06 21:25:17 +05:30
Abhishek Radhakrishnan 9f95a691f7
Extension to read and ingest Delta Lake tables (#15755)
* something

* test commit

* compilation fix

* more compilation fixes (fixme placeholders)

* Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake

Will need to sort out the dependencies later.

* checkpoint

* remove snapshot schema since we can get schema from the row

* iterator bug fix

* json json json

* sampler flow

* empty impls for read(InputStats) and sample()

* conversion?

* conversion, without timestamp

* Web console changes to show Delta Lake

* Asset bug fix and tile load

* Add missing pieces to input source info, etc.

* fix stuff

* Use a different delta lake asset

* Delta lake extension dependencies

* Cleanup

* Add InputSource, module init and helper code to process delta files.

* Test init

* Checkpoint changes

* Test resources and updates

* some fixes

* move to the correct package

* More tests

* Test cleanup

* TODOs

* Test updates

* requirements and javadocs

* Adjust dependencies

* Update readme

* Bump up version

* fixup typo in deps

* forbidden api and checkstyle checks

* Trim down dependencies

* new lines

* Fixup Intellij inspections.

* Add equals() and hashCode()

* chain splits, intellij inspections

* review comments and todo placeholder

* fix up some docs

* null table path and test dependencies. Fixup broken link.

* run prettify

* Different test; fixes

* Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests

* yank the old test resource.

* add a couple of sad path tests

* Updates to readme based on latest.

* Version support

* Extract Delta DateTime converstions to DeltaTimeUtils class and add test

* More comprehensive split tests.

* Some test renames.

* Cleanup and update instructions.

* add pruneSchema() optimization for table scans.

* Oops, missed the parquet files.

* Update default table and rename schema constants.

* Test setup and misc changes.

* Add class loader logic as the context class loader is unaware about extension classes

* change some table client creation logic.

* Add hadoop-aws, hadoop-common and related exclusions.

* Remove org.apache.hadoop:hadoop-common

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Add entry to .spelling to fix docs static check

---------

Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-30 21:53:50 -08:00