index_realtime tasks were removed from the documentation in #13107. Even
at that time, they weren't really documented per se— just mentioned. They
existed solely to support Tranquility, which is an obsolete ingestion
method that predates migration of Druid to ASF and is no longer being
maintained. Tranquility docs were also de-linked from the sidebars and
the other doc pages in #11134. Only a stub remains, so people with
links to the page can see that it's no longer recommended.
index_realtime_appenderator tasks existed in the code base, but were
never documented, nor as far as I am aware were they used for any purpose.
This patch removes both task types completely, as well as removes all
supporting code that was otherwise unused. It also updates the stub
doc for Tranquility to be firmer that it is not compatible. (Previously,
the stub doc said it wasn't recommended, and pointed out that it is
built against an ancient 0.9.2 version of Druid.)
ITUnionQueryTest has been migrated to the new integration tests framework and updated to use Kafka ingestion.
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Fixes a few minor issues with scripts.
- Add additional information around since it was confusing, and not clear that the number was the ID from github and not just the major version number.
- Fix an issue where the milestone displayed in an output message was the milestone supplied as an argument, instead of the number of the milestone the PR is already tagged against in Github, from the sent request.
* specify node type so that the log filename can get resolved
* Update distribution/docker/druid.sh
Co-authored-by: Benedict Jin <asdf2014@apache.org>
---------
Co-authored-by: Benedict Jin <asdf2014@apache.org>
This PR creates symlinks when there are duplicate jars present in the extension. Docker image includes contrib extensions, too, and the size of the image has bloated up quite a lot of late. This change also fixes "ITNestedQueryPushDownTest integration test"
* Update the group id to org.apache.druid.extensions.contrib for contrib exts.
* Note iceberg and delta lake extensions in extensions.md
* properties and shell backticks
* Update groupId in distribution/pom.xml
* remove delta-lake from dist.
* Add note on downloading extension.
* something
* test commit
* compilation fix
* more compilation fixes (fixme placeholders)
* Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake
Will need to sort out the dependencies later.
* checkpoint
* remove snapshot schema since we can get schema from the row
* iterator bug fix
* json json json
* sampler flow
* empty impls for read(InputStats) and sample()
* conversion?
* conversion, without timestamp
* Web console changes to show Delta Lake
* Asset bug fix and tile load
* Add missing pieces to input source info, etc.
* fix stuff
* Use a different delta lake asset
* Delta lake extension dependencies
* Cleanup
* Add InputSource, module init and helper code to process delta files.
* Test init
* Checkpoint changes
* Test resources and updates
* some fixes
* move to the correct package
* More tests
* Test cleanup
* TODOs
* Test updates
* requirements and javadocs
* Adjust dependencies
* Update readme
* Bump up version
* fixup typo in deps
* forbidden api and checkstyle checks
* Trim down dependencies
* new lines
* Fixup Intellij inspections.
* Add equals() and hashCode()
* chain splits, intellij inspections
* review comments and todo placeholder
* fix up some docs
* null table path and test dependencies. Fixup broken link.
* run prettify
* Different test; fixes
* Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests
* yank the old test resource.
* add a couple of sad path tests
* Updates to readme based on latest.
* Version support
* Extract Delta DateTime converstions to DeltaTimeUtils class and add test
* More comprehensive split tests.
* Some test renames.
* Cleanup and update instructions.
* add pruneSchema() optimization for table scans.
* Oops, missed the parquet files.
* Update default table and rename schema constants.
* Test setup and misc changes.
* Add class loader logic as the context class loader is unaware about extension classes
* change some table client creation logic.
* Add hadoop-aws, hadoop-common and related exclusions.
* Remove org.apache.hadoop:hadoop-common
* Apply suggestions from code review
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* Add entry to .spelling to fix docs static check
---------
Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* New: Add DDSketch-Druid extension
- Based off of http://www.vldb.org/pvldb/vol12/p2195-masson.pdf and uses
the corresponding https://github.com/DataDog/sketches-java library
- contains tests for post building and using aggregation/post
aggregation.
- New aggregator: `ddSketch`
- New post aggregators: `quantileFromDDSketch` and
`quantilesFromDDSketch`
* Fixing easy CodeQL warnings/errors
* Fixing docs, and dependencies
Also moved aggregator ids to AggregatorUtil and PostAggregatorIds
* Adding more Docs and better null/empty handling for aggregators
* Fixing docs, and pom version
* DDSketch documentation format and wording
* Add SpectatorHistogram extension
* Clarify documentation
Cleanup comments
* Use ColumnValueSelector directly
so that we support being queried as a Number using longSum or doubleSum aggregators as well as a histogram.
When queried as a Number, we're returning the count of entries in the histogram.
* Apply suggestions from code review
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* Fix references
* Fix spelling
* Update docs/development/extensions-contrib/spectator-histogram.md
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
---------
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
The PR addresses 2 things:
Add MSQ durable storage connector for GCS
Change GCS client library from the old Google API Client Library to the recommended Google Cloud Client Library. Ref: https://cloud.google.com/apis/docs/client-libraries-explained
- Licenses file contains several licenses for outdated libraries. In this PR we remove licenses for no longer used components.
This change is purely cosmetic / cleans up the license database.
The candidates were designated by reviewing the output of the license check script and comparing it against the depdency tree.
- Minor fix to license check tool to fail more gracefully when the license of used dependency is not listed as known, as well as fix not to fail on multi licensed components when at least one of the licenses is accepted.
---------
Co-authored-by: Xavier Léauté <xl+github@xvrl.net>
This change is meant to fix a issue where passing too large of a task payload to the mm-less task runner will cause the peon to fail to startup because the payload is passed (compressed) as a environment variable (TASK_JSON). In linux systems the limit for a environment variable is commonly 128KB, for windows systems less than this. Setting a env variable longer than this results in a bunch of "Argument list too long" errors.
This adds a new contrib extension: druid-iceberg-extensions which can be used to ingest data stored in Apache Iceberg format. It adds a new input source of type iceberg that connects to a catalog and retrieves the data files associated with an iceberg table and provides these data file paths to either an S3 or HDFS input source depending on the warehouse location.
Two important dependencies associated with Apache Iceberg tables are:
Catalog : This extension supports reading from either a Hive Metastore catalog or a Local file-based catalog. Support for AWS Glue is not available yet.
Warehouse : This extension supports reading data files from either HDFS or S3. Adapters for other cloud object locations should be easy to add by extending the AbstractInputSourceAdapter.
* Claim full support for Java 17.
No production code has changed, except the startup scripts.
Changes:
1) Allow Java 17 without DRUID_SKIP_JAVA_CHECK.
2) Include the full list of opens and exports on both Java 11 and 17.
3) Document that Java 17 is both supported and preferred.
4) Switch some tests from Java 11 to 17 to get better coverage on the
preferred version.
* Doc update.
* Update errorprone.
* Update docker_build_containers.sh.
* Update errorprone in licenses.yaml.
* Add some more run-javas.
* Additional run-javas.
* Update errorprone.
* Suppress new errorprone error.
* Add exports and opens in ForkingTaskRunner for Java 11+.
Test, doc changes.
* Additional errorprone updates.
* Update for errorprone.
* Restore old fomatting in LdapCredentialsValidator.
* Copy bin/ too.
* Fix Java 15, 17 build line in docker_build_containers.sh.
* Update busybox image.
* One more java command.
* Fix interpolation.
* IT commandline refinements.
* Switch to busybox 1.34.1-glibc.
* POM adjustments, build and test one IT on 17.
* Additional debugging.
* Fix silly thing.
* Adjust command line.
* Add exports and opens one more place.
* Additional harmonization of strong encapsulation parameters.
Added a new monitor SysMonitorOshi to replace SysMonitor. The new monitor has a wider support for different machine architectures including ARM instances. Please switch to SysMonitorOshi as SysMonitor is now deprecated and will be removed in future releases.
Hadoop 2 often causes red security scans on Druid distribution because of the dependencies it brings. We want to move away from Hadoop 2 and provide Hadoop 3 distribution available. Switch druid to building with Hadoop 3 by default. Druid will still be compatible with Hadoop 2 and users can build hadoop-2 compatible distribution using hadoop2 profile.
You can now add additional configuration files to be copied to the final conf directory on startup when running in a containerized environment. Useful for running on Kubernetes and needing to add more files with a config map. To specify the path where the configMap is mounted, utilize the DRUID_ADDITIONAL_CONF_DIR environment variable.
Druid catalog basics
Catalog object model for tables, columns
Druid metadata DB storage (as an extension)
REST API to update the catalog (as an extension)
Integration tests
Model only: no planner integration yet
* Support for middle manager less druid, tasks launch as k8s jobs
* Fixing forking task runner test
* Test cleanup, dependency cleanup, intellij inspections cleanup
* Changes per PR review
Add configuration option to disable http/https proxy for the k8s client
Update the docs to provide more detail about sidecar support
* Removing un-needed log lines
* Small changes per PR review
* Upon task completion we callback to the overlord to update the status / locaiton, for slower k8s clusters, this reduces locking time significantly
* Merge conflict fix
* Fixing tests and docs
* update tiny-cluster.yaml
changed `enableTaskLevelLogPush` to `encapsulatedTask`
* Apply suggestions from code review
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
* Minor changes per PR request
* Cleanup, adding test to AbstractTask
* Add comment in peon.sh
* Bumping code coverage
* More tests to make code coverage happy
* Doh a duplicate dependnecy
* Integration test setup is weird for k8s, will do this in a different PR
* Reverting back all integration test changes, will do in anotbher PR
* use StringUtils.base64 instead of Base64
* Jdk is nasty, if i compress in jdk 11 in jdk 17 the decompressed result is different
Co-authored-by: Rahul Gidwani <r_gidwani@apple.com>
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>