Commit Graph

10 Commits

Author SHA1 Message Date
Abhishek Agarwal 78775ad398
Prepare master for 32.0.0 release () 2024-09-10 11:01:20 +05:30
Abhishek Radhakrishnan aa833a711c
Support for reading Delta Lake table snapshots ()
Problem
Currently, the delta input source only supports reading from the latest snapshot of the given Delta Lake table. This is a known documented limitation.

Description
Add support for reading Delta snapshot. By default, the Druid-Delta connector reads the latest snapshot of the Delta table in order to preserve compatibility. Users can specify a snapshotVersion to ingest change data events from Delta tables into Druid.

In the future, we can also add support for time-based snapshot reads. The Delta API to read time-based snapshots is not clear currently.
2024-09-09 14:12:48 +05:30
Abhishek Radhakrishnan acadc2df20
Handle Delta StructType, ArrayType and MapType ()
Handle the following Delta complex types:
a. StructType as JSON
b. ArrayType as Java list
c. MapType as Java map

Generate and add a new Delta table complex-types-table that contains the above complex types for testing.

Update the tests to include a parameterized test with complex-types-table, with the expectations defined in ComplexTypesDeltaTable.java.
2024-08-13 07:50:03 -07:00
Abhishek Radhakrishnan e01f155209
Add missing `delta-storage` dependency and class loader workaround to Delta table ingestion ()
* Workaround to ingesting from Delta table in 3.2.0.

With the upgrade to Kernel 3.2.0, the Druid Delta connector extension
isn't able to ingest from Delta tables successfully.

Please see https://github.com/delta-io/delta/issues/3299

The underlying problem seems to be coming from
https://github.com/delta-io/delta/blob/master/kernel/kernel-defaults/src/main/java/io/delta/kernel/defaults/internal/logstore/LogStoreProvider.java#L99

This patch is a workaround to setting the thread class loader explictly.
The Kernel community may consider a fix in the next release as it's affected another
connector as well.

* Address review comment: clear the CL after the Thread CL is set.
2024-06-25 09:16:13 -07:00
Abhishek Radhakrishnan 75937c98e8
Upgrade delta kernel from 3.1.0 to 3.2.0 ()
Upstream release: https://github.com/delta-io/delta/releases/tag/v3.2.0

- Upgrade kernel dependency to 3.2.0
- Notable breaking changes introduced in upstream that affects the Druid extension:
 - Rename TableClient -> Engine
 - Rename DefaultTableClient -> DefaultEngine
 - Exceptions moved to a separate package
 - Table.getPath() doesn't throw TableNotFoundException. Instead the exception is thrown
   when getting snapshot info from the Table object
2024-05-29 10:46:30 -07:00
Abhishek Radhakrishnan 1d7595f3f7
Support for filters in the Druid Delta Lake connector ()
* Delta Lake support for filters.

* Updates

* cleanup comments

* Docs

* Remmove Enclosed runner

* Rename

* Cleanup test

* Serde test for the Delta input source and fix jackson annotation.

* Updates and docs.

* Update error messages to be clearer

* Fixes

* Handle NumberFormatException to provide a nicer error message.

* Apply suggestions from code review

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Doc fixes based on feedback

* Yes -> yes in docs; reword slightly.

* Update docs/ingestion/input-sources.md

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

* Documentation, javadoc and more updates.

* Not with an or expression end-to-end test.

* Break up =, >, >=, <, <= into its own types instead of sub-classing.

---------

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
2024-04-29 11:31:36 -07:00
Adarsh Sanjeev 9a2d7c28bc
Prepare master branch for 31.0.0 release () 2024-04-26 09:22:43 +05:30
Abhishek Radhakrishnan 1a5b57df84
Update `groupId` for delta-lake and iceberg extensions ()
* Update the group id to org.apache.druid.extensions.contrib for contrib exts.

* Note iceberg and delta lake extensions in extensions.md

* properties and shell backticks

* Update groupId in distribution/pom.xml

* remove delta-lake from dist.

* Add note on downloading extension.
2024-02-07 23:54:06 -08:00
Abhishek Radhakrishnan 1affa35b29
Bump up Delta Lake Kernel to 3.1.0 ()
This patch bumps Delta Lake Kernel dependency from 3.0.0 to 3.1.0, which released last week - please see https://github.com/delta-io/delta/releases/tag/v3.1.0 for release notes.

There were a few "breaking" API changes in 3.1.0, you can find the rationale for some of those changes here.

Next-up in this extension: add and expose filter predicates.
2024-02-06 21:25:17 +05:30
Abhishek Radhakrishnan 9f95a691f7
Extension to read and ingest Delta Lake tables ()
* something

* test commit

* compilation fix

* more compilation fixes (fixme placeholders)

* Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake

Will need to sort out the dependencies later.

* checkpoint

* remove snapshot schema since we can get schema from the row

* iterator bug fix

* json json json

* sampler flow

* empty impls for read(InputStats) and sample()

* conversion?

* conversion, without timestamp

* Web console changes to show Delta Lake

* Asset bug fix and tile load

* Add missing pieces to input source info, etc.

* fix stuff

* Use a different delta lake asset

* Delta lake extension dependencies

* Cleanup

* Add InputSource, module init and helper code to process delta files.

* Test init

* Checkpoint changes

* Test resources and updates

* some fixes

* move to the correct package

* More tests

* Test cleanup

* TODOs

* Test updates

* requirements and javadocs

* Adjust dependencies

* Update readme

* Bump up version

* fixup typo in deps

* forbidden api and checkstyle checks

* Trim down dependencies

* new lines

* Fixup Intellij inspections.

* Add equals() and hashCode()

* chain splits, intellij inspections

* review comments and todo placeholder

* fix up some docs

* null table path and test dependencies. Fixup broken link.

* run prettify

* Different test; fixes

* Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests

* yank the old test resource.

* add a couple of sad path tests

* Updates to readme based on latest.

* Version support

* Extract Delta DateTime converstions to DeltaTimeUtils class and add test

* More comprehensive split tests.

* Some test renames.

* Cleanup and update instructions.

* add pruneSchema() optimization for table scans.

* Oops, missed the parquet files.

* Update default table and rename schema constants.

* Test setup and misc changes.

* Add class loader logic as the context class loader is unaware about extension classes

* change some table client creation logic.

* Add hadoop-aws, hadoop-common and related exclusions.

* Remove org.apache.hadoop:hadoop-common

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Add entry to .spelling to fix docs static check

---------

Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-30 21:53:50 -08:00