Commit Graph

18 Commits

Author SHA1 Message Date
Atul Mohan c1f8ae25b5
Support Iceberg ingestion from REST based catalogs (#17124)
Adds support to the iceberg input source to read from Iceberg REST Catalogs.
2024-09-23 22:13:24 -07:00
Victoria Lim 2e2f3cf66a
docs: Refresh docs for SQL input source (#17031)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-09-16 15:52:37 -07:00
Abhishek Radhakrishnan aa833a711c
Support for reading Delta Lake table snapshots (#17004)
Problem
Currently, the delta input source only supports reading from the latest snapshot of the given Delta Lake table. This is a known documented limitation.

Description
Add support for reading Delta snapshot. By default, the Druid-Delta connector reads the latest snapshot of the Delta table in order to preserve compatibility. Users can specify a snapshotVersion to ingest change data events from Delta tables into Druid.

In the future, we can also add support for time-based snapshot reads. The Delta API to read time-based snapshots is not clear currently.
2024-09-09 14:12:48 +05:30
Jill Osborne b4d83a86c2
Middle Manager wording update in docs (#17005) 2024-09-05 10:25:30 -07:00
Andreas Maechler ae70e18bc8
docs: Update Azure extension (#16585)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-06-20 09:31:29 -07:00
Atul Mohan b53d75758f
IcebergInputSource : Add option to toggle case sensitivity while reading columns from iceberg catalog (#16496)
* Toggle case sensitivity while reading columns from iceberg

* Fix tests

* Drop case check and set unconditionally
2024-05-31 10:18:52 -07:00
George Shiqi Wu b3b62ac431
Update azure input source docs (#16508)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-05-29 10:00:46 -07:00
Abhishek Radhakrishnan 1d7595f3f7
Support for filters in the Druid Delta Lake connector (#16288)
* Delta Lake support for filters.

* Updates

* cleanup comments

* Docs

* Remmove Enclosed runner

* Rename

* Cleanup test

* Serde test for the Delta input source and fix jackson annotation.

* Updates and docs.

* Update error messages to be clearer

* Fixes

* Handle NumberFormatException to provide a nicer error message.

* Apply suggestions from code review

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Doc fixes based on feedback

* Yes -> yes in docs; reword slightly.

* Update docs/ingestion/input-sources.md

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

* Documentation, javadoc and more updates.

* Not with an or expression end-to-end test.

* Break up =, >, >=, <, <= into its own types instead of sub-classing.

---------

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
2024-04-29 11:31:36 -07:00
Atul Mohan 2e46a98024
Add range filtering support for iceberg ingestion (#15782)
* Add range filtering support for iceberg ingestion

* Docs formatting

* Spelling
2024-02-01 23:32:30 -08:00
Aru Raghuwanshi 223f29d64c
Update input-sources.md for fixing the warehouse path example under S3 (#15823) 2024-02-01 23:32:05 -08:00
Abhishek Radhakrishnan 9f95a691f7
Extension to read and ingest Delta Lake tables (#15755)
* something

* test commit

* compilation fix

* more compilation fixes (fixme placeholders)

* Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake

Will need to sort out the dependencies later.

* checkpoint

* remove snapshot schema since we can get schema from the row

* iterator bug fix

* json json json

* sampler flow

* empty impls for read(InputStats) and sample()

* conversion?

* conversion, without timestamp

* Web console changes to show Delta Lake

* Asset bug fix and tile load

* Add missing pieces to input source info, etc.

* fix stuff

* Use a different delta lake asset

* Delta lake extension dependencies

* Cleanup

* Add InputSource, module init and helper code to process delta files.

* Test init

* Checkpoint changes

* Test resources and updates

* some fixes

* move to the correct package

* More tests

* Test cleanup

* TODOs

* Test updates

* requirements and javadocs

* Adjust dependencies

* Update readme

* Bump up version

* fixup typo in deps

* forbidden api and checkstyle checks

* Trim down dependencies

* new lines

* Fixup Intellij inspections.

* Add equals() and hashCode()

* chain splits, intellij inspections

* review comments and todo placeholder

* fix up some docs

* null table path and test dependencies. Fixup broken link.

* run prettify

* Different test; fixes

* Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests

* yank the old test resource.

* add a couple of sad path tests

* Updates to readme based on latest.

* Version support

* Extract Delta DateTime converstions to DeltaTimeUtils class and add test

* More comprehensive split tests.

* Some test renames.

* Cleanup and update instructions.

* add pruneSchema() optimization for table scans.

* Oops, missed the parquet files.

* Update default table and rename schema constants.

* Test setup and misc changes.

* Add class loader logic as the context class loader is unaware about extension classes

* change some table client creation logic.

* Add hadoop-aws, hadoop-common and related exclusions.

* Remove org.apache.hadoop:hadoop-common

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Add entry to .spelling to fix docs static check

---------

Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-30 21:53:50 -08:00
Benjamin Hopp 6177f6efd7
Fixing formatting of Iceberg Catalog Object (#15748) 2024-01-30 20:17:38 -08:00
George Shiqi Wu 3e512249e3
Azure multi read options (#15630)
* Include new dependencies

* Mostly implemented

* More azure fixes

* Tests passing

* Unit tests running

* Test running after removing storage exception

* Happy with coverage now

* Add more tests

* fix client factory

* cleanup from testing

* Remove old client

* update docs

* Exclude from spellcheck

* Add licenses

* Fix identity version

* Save work

* Add azure clients

* add licenses

* typos

* Add dependencies

* Exception is not thrown

* Fix intellij check

* Don't need to override

* specify length

* urldecode

* encode path

* Fix checks

* Revert urlencode changes

* Urlencode with azure library

* Update docs/development/extensions-core/azure.md

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

* PR changes

* Update docs/development/extensions-core/azure.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Add config for multiple storage accounts

* Deprecate AzureTaskLogsConfig.maxRetries

* Clean up azure retry block

* logic update to reuse clients

* fix comments

* Create container conditionally

* Fix key auth

* save work

* Fix unit tests

* Revert old azure input type

* Separate input source

* save work

* Add support for app registrations

* Fix unit tests

* clean up spacing

* Add coverage

* fixes from testing

* cleanup some caching behavior

* Add docs

* Fix spelling issues

* fix more spelling errors'

* Fix intellij inspections

* add simple changes from pr

* save work on fixing bug

* Fix unit tests

* Add more testing

* Fix unit test

* Add tests

* Add annotation for azureStorage

* Fix up docs

* Add comment for list method

* Fix tests

* Remove uneeded toString

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* PR changes

* fix injection of StorageConnector

* Fix checkstyle

* clean up unit tests

* More pr fixes

---------

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-01-25 13:29:16 -05:00
Atul Mohan a2914789d7
Add support for ingesting older iceberg snapshots (#15348)
This patch introduces a param snapshotTime in the iceberg inputsource spec that allows the user to ingest data files associated with the most recent snapshot as of the given time. This helps the user ingest data based on older snapshots by specifying the associated snapshot time.
This patch also upgrades the iceberg core version to 1.4.1
2023-11-17 12:32:28 +05:30
Gian Merlino d87d92bc43
Add system fields to input sources. (#15276)
* Add system fields to input sources.

Main changes:

1) The SystemField enum defines system fields "__file_uri", "__file_path",
   and "__file_bucket". They are associated with each input entity.

2) The SystemFieldInputSource interface can be added to any InputSource
   to make it system-field-capable. It sets up serialization of a list
   of configured "systemFields" in the JSON form of the input source, and
   provides a method getSystemFieldValue for computing the value of each
   system field. Cloud object, HDFS, HTTP, and Local now have this.

* Fix various LocalInputSource calls.

* Fix style stuff.

* Fixups.

* Fix tests and coverage.
2023-11-02 10:31:28 -07:00
317brian 6b4dda964d
Docusaurus2 upgrade for master (#14411)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2023-08-16 19:01:21 -07:00
Atul Mohan 03d6d395a0
Extension to read and ingest iceberg data files (#14329)
This adds a new contrib extension: druid-iceberg-extensions which can be used to ingest data stored in Apache Iceberg format. It adds a new input source of type iceberg that connects to a catalog and retrieves the data files associated with an iceberg table and provides these data file paths to either an S3 or HDFS input source depending on the warehouse location.

Two important dependencies associated with Apache Iceberg tables are:

Catalog : This extension supports reading from either a Hive Metastore catalog or a Local file-based catalog. Support for AWS Glue is not available yet.
Warehouse : This extension supports reading data files from either HDFS or S3. Adapters for other cloud object locations should be easy to add by extending the AbstractInputSourceAdapter.
2023-07-18 08:59:57 +05:30
Katya Macedo 269137c682
Update Ingestion section (#14023)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Victoria Lim <lim.t.victoria@gmail.com>
2023-05-19 09:42:27 -07:00