Problem
Currently, the delta input source only supports reading from the latest snapshot of the given Delta Lake table. This is a known documented limitation.
Description
Add support for reading Delta snapshot. By default, the Druid-Delta connector reads the latest snapshot of the Delta table in order to preserve compatibility. Users can specify a snapshotVersion to ingest change data events from Delta tables into Druid.
In the future, we can also add support for time-based snapshot reads. The Delta API to read time-based snapshots is not clear currently.
Handle the following Delta complex types:
a. StructType as JSON
b. ArrayType as Java list
c. MapType as Java map
Generate and add a new Delta table complex-types-table that contains the above complex types for testing.
Update the tests to include a parameterized test with complex-types-table, with the expectations defined in ComplexTypesDeltaTable.java.
Upstream release: https://github.com/delta-io/delta/releases/tag/v3.2.0
- Upgrade kernel dependency to 3.2.0
- Notable breaking changes introduced in upstream that affects the Druid extension:
- Rename TableClient -> Engine
- Rename DefaultTableClient -> DefaultEngine
- Exceptions moved to a separate package
- Table.getPath() doesn't throw TableNotFoundException. Instead the exception is thrown
when getting snapshot info from the Table object
* Delta Lake support for filters.
* Updates
* cleanup comments
* Docs
* Remmove Enclosed runner
* Rename
* Cleanup test
* Serde test for the Delta input source and fix jackson annotation.
* Updates and docs.
* Update error messages to be clearer
* Fixes
* Handle NumberFormatException to provide a nicer error message.
* Apply suggestions from code review
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
* Doc fixes based on feedback
* Yes -> yes in docs; reword slightly.
* Update docs/ingestion/input-sources.md
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
* Update docs/ingestion/input-sources.md
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
* Documentation, javadoc and more updates.
* Not with an or expression end-to-end test.
* Break up =, >, >=, <, <= into its own types instead of sub-classing.
---------
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
This patch bumps Delta Lake Kernel dependency from 3.0.0 to 3.1.0, which released last week - please see https://github.com/delta-io/delta/releases/tag/v3.1.0 for release notes.
There were a few "breaking" API changes in 3.1.0, you can find the rationale for some of those changes here.
Next-up in this extension: add and expose filter predicates.
* something
* test commit
* compilation fix
* more compilation fixes (fixme placeholders)
* Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake
Will need to sort out the dependencies later.
* checkpoint
* remove snapshot schema since we can get schema from the row
* iterator bug fix
* json json json
* sampler flow
* empty impls for read(InputStats) and sample()
* conversion?
* conversion, without timestamp
* Web console changes to show Delta Lake
* Asset bug fix and tile load
* Add missing pieces to input source info, etc.
* fix stuff
* Use a different delta lake asset
* Delta lake extension dependencies
* Cleanup
* Add InputSource, module init and helper code to process delta files.
* Test init
* Checkpoint changes
* Test resources and updates
* some fixes
* move to the correct package
* More tests
* Test cleanup
* TODOs
* Test updates
* requirements and javadocs
* Adjust dependencies
* Update readme
* Bump up version
* fixup typo in deps
* forbidden api and checkstyle checks
* Trim down dependencies
* new lines
* Fixup Intellij inspections.
* Add equals() and hashCode()
* chain splits, intellij inspections
* review comments and todo placeholder
* fix up some docs
* null table path and test dependencies. Fixup broken link.
* run prettify
* Different test; fixes
* Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests
* yank the old test resource.
* add a couple of sad path tests
* Updates to readme based on latest.
* Version support
* Extract Delta DateTime converstions to DeltaTimeUtils class and add test
* More comprehensive split tests.
* Some test renames.
* Cleanup and update instructions.
* add pruneSchema() optimization for table scans.
* Oops, missed the parquet files.
* Update default table and rename schema constants.
* Test setup and misc changes.
* Add class loader logic as the context class loader is unaware about extension classes
* change some table client creation logic.
* Add hadoop-aws, hadoop-common and related exclusions.
* Remove org.apache.hadoop:hadoop-common
* Apply suggestions from code review
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* Add entry to .spelling to fix docs static check
---------
Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>