* something
* test commit
* compilation fix
* more compilation fixes (fixme placeholders)
* Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake
Will need to sort out the dependencies later.
* checkpoint
* remove snapshot schema since we can get schema from the row
* iterator bug fix
* json json json
* sampler flow
* empty impls for read(InputStats) and sample()
* conversion?
* conversion, without timestamp
* Web console changes to show Delta Lake
* Asset bug fix and tile load
* Add missing pieces to input source info, etc.
* fix stuff
* Use a different delta lake asset
* Delta lake extension dependencies
* Cleanup
* Add InputSource, module init and helper code to process delta files.
* Test init
* Checkpoint changes
* Test resources and updates
* some fixes
* move to the correct package
* More tests
* Test cleanup
* TODOs
* Test updates
* requirements and javadocs
* Adjust dependencies
* Update readme
* Bump up version
* fixup typo in deps
* forbidden api and checkstyle checks
* Trim down dependencies
* new lines
* Fixup Intellij inspections.
* Add equals() and hashCode()
* chain splits, intellij inspections
* review comments and todo placeholder
* fix up some docs
* null table path and test dependencies. Fixup broken link.
* run prettify
* Different test; fixes
* Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests
* yank the old test resource.
* add a couple of sad path tests
* Updates to readme based on latest.
* Version support
* Extract Delta DateTime converstions to DeltaTimeUtils class and add test
* More comprehensive split tests.
* Some test renames.
* Cleanup and update instructions.
* add pruneSchema() optimization for table scans.
* Oops, missed the parquet files.
* Update default table and rename schema constants.
* Test setup and misc changes.
* Add class loader logic as the context class loader is unaware about extension classes
* change some table client creation logic.
* Add hadoop-aws, hadoop-common and related exclusions.
* Remove org.apache.hadoop:hadoop-common
* Apply suggestions from code review
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* Add entry to .spelling to fix docs static check
---------
Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* Possibly stabilize intellij-inspections
* remove `integration-tests-ex/cases` from excluded projects from initial build
* enable ErrorProne's `CheckedExceptionNotThrown` to get earlier errors than intellij-inspections
* fix ddsketch pom.xml
* fix spellcheck
* New: Add DDSketch-Druid extension
- Based off of http://www.vldb.org/pvldb/vol12/p2195-masson.pdf and uses
the corresponding https://github.com/DataDog/sketches-java library
- contains tests for post building and using aggregation/post
aggregation.
- New aggregator: `ddSketch`
- New post aggregators: `quantileFromDDSketch` and
`quantilesFromDDSketch`
* Fixing easy CodeQL warnings/errors
* Fixing docs, and dependencies
Also moved aggregator ids to AggregatorUtil and PostAggregatorIds
* Adding more Docs and better null/empty handling for aggregators
* Fixing docs, and pom version
* DDSketch documentation format and wording
* Faster parsing: reduce String usage, list-based input rows.
Three changes:
1) Reworked FastLineIterator to optionally avoid generating Strings
entirely, and reduce copying somewhat. Benefits the line-oriented
JSON, CSV, delimited (TSV), and regex formats.
2) In the delimited (TSV) format, when the delimiter is a single byte,
split on UTF-8 bytes directly.
3) In CSV and delimited (TSV) formats, use list-based input rows when
the column list is provided upfront by the user.
* Fix style.
* Fix inspections.
* Restore validation.
* Remove fastutil-extra.
* Exception type.
* Fixes for error messages.
* Fixes for null handling.
* Add SpectatorHistogram extension
* Clarify documentation
Cleanup comments
* Use ColumnValueSelector directly
so that we support being queried as a Number using longSum or doubleSum aggregators as well as a histogram.
When queried as a Number, we're returning the count of entries in the histogram.
* Apply suggestions from code review
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* Fix references
* Fix spelling
* Update docs/development/extensions-contrib/spectator-histogram.md
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
---------
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* unpin snakeyaml globally, add suppressions and licenses
* pin snakeyaml in the specific modules that require version 1.x, update licenses and owasp suppression
This removes the pin of the Snakeyaml introduced in: https://github.com/apache/druid/pull/14519
After the updates of io.kubernetes.java-client and io.confluent.kafka-clients, the only uses of the Snakeyaml 1.x are:
- in test scope, transitive dependency of jackson-dataformat-yaml🫙2.12.7
- in compile scope in contrib extension druid-cassandra-storage
- in compile scope in it-tests.
With the dependency version un-pinned, io.kubernetes.java-client and io.confluent.kafka-clients bring Snakeyaml versions 2.0 and 2.2, consequently allowing to build a Druid distribution without the contrib-extension and free of vulnerable Snakeyaml versions.
The PR addresses 2 things:
Add MSQ durable storage connector for GCS
Change GCS client library from the old Google API Client Library to the recommended Google Cloud Client Library. Ref: https://cloud.google.com/apis/docs/client-libraries-explained
This PR updates the tag present in pom.xml to match the druid version in pom.xml
This was last updated in 0da8ffc
It seems to me like this was missed in further Druid version upgrades.
Update of direct dependencies:
* kubernetes java-client to 19.0.0
* docker-java-bom to 3.3.4
In order to update transitive dependencies:
* okio to 3.6.0
* bcjava to 1.76
To address CVES:
- CVE-2023-3635 in okio
- CVE-2023-33201 in bcjava
---------
Co-authored-by: Xavier Léauté <xvrl@apache.org>
The website pom was removed as part of
https://github.com/apache/druid/pull/14411 so we no longer need to
reference it as a module and the profile can be removed.
Dependabot is currently failing trying to look for this module, so
removing it should also fix that.
This change completes the change introduced in #15461
and unifies the version of gson dependency used between all the modules.
gson is used by kubernetes-extension, avro-extensions, ranger-security,
and as a test dependency in several core modules.
---------
Co-authored-by: Xavier Léauté <xl+github@xvrl.net>
Update guava to 32.0.1-jre to address two CVEs: CVE-2020-8908, CVE-2023-2976
This change requires a minor test change to remove assumptions about ordering.
---------
Co-authored-by: Xavier Léauté <xl+github@xvrl.net>
Update multiple dependencies to clear CVEs
Update dropwizard-metrics to 4.2.22 to address GHSA-mm8h-8587-p46h in com.rabbitmq:amqp-client
Update ant to 1.10.14 to resolve GHSA-f62v-xpxf-3v68 GHSA-4p6w-m9wc-c9c9 GHSA-q5r4-cfpx-h6fh GHSA-5v34-g2px-j4fw
Update comomons-compress to resolve GHSA-cgwf-w82q-5jrr
Update jose4j to 0.9.3 to resolve GHSA-7g24-qg88-p43q GHSA-jgvc-jfgh-rjvv
Update kotlin-stdlib to 1.6.0 to resolve GHSA-cqj8-47ch-rvvq and CVE-2022-24329
Recent upgrade of ranger introduced CVE regressions due to outdated elasticsearch components.
Druid-ranger-plugin does not elasticsearch components , and they have been explicitly removed.
Update woodstox-core to 6.4.0 to address GHSA-3f7h-mf4q-vrm4
Upgrade Jackson to version 2.12.7.1 to address CVE-2022-42003, CVE-2022-42004 which affects jackson-databind.
Upgrade com.google.code.gson:gson from 2.2.4 to the latest version (2.10.1) since 2.2.4 is affected by CVE-2022-25647.
* update confluent's dependencies to common, supported version
Update io.confluent.* dependencies to common, updated version 6.2.12
currently used versions are EOL
* move version definition to the top level pom
Update jetty dependencies version to 9.4.53.v20231009
Update netty4 dependencies version to 4.1.100.Final to resolve CVE-2023-4586 (Netty-handler does not validate host names by default)
This change updates dependencies as needed and fixes tests to remove code incompatible with Java 21
As a result all unit tests now pass with Java 21.
* update maven-shade-plugin to 3.5.0 and follow-up to #15042
* explain why we need to override configuration when specifying outputFile
* remove configuration from dependency management in favor of explicit overrides in each module.
* update to mockito to 5.5.0 for Java 21 support when running with Java 11+
* continue using latest mockito 4.x (4.11.0) when running with Java 8
* remove need to mock private fields
* exclude incorrectly declared mockito dependency from pac4j-oidc
* remove mocking of ByteBuffer, since sealed classes can no longer be mocked in Java 21
* add JVM options workaround for system-rules junit plugin not supporting Java 18+
* exclude older versions of byte-buddy from assertj-core
* fix for Java 19 changes in floating point string representation
* fix missing InitializedNullHandlingTest
* update easymock to 5.2.0 for Java 21 compatibility
* update animal-sniffer-plugin to 1.23
* update nl.jqno.equalsverifier to 3.15.1
* update exec-maven-plugin to 3.1.0
This PR updates the library used for Memcached client to AWS Elasticache Client : https://github.com/awslabs/aws-elasticache-cluster-client-memcached-for-java
This enables us to use the option of encrypting data in transit:
Amazon ElastiCache for Memcached now supports encryption of data in transit
For clusters running the Memcached engine, ElastiCache supports Auto Discovery—the ability for client programs to automatically identify all of the nodes in a cache cluster, and to initiate and maintain connections to all of these nodes.
Benefits of Auto Discovery - Amazon ElastiCache
AWS has forked spymemcached 2.12.1, and has since added all the patches included in 2.12.2 and 2.12.3 as part of the 1.2.0 release. So, this can now be considered as an equivalent drop-in replacement.
GitHub - awslabs/aws-elasticache-cluster-client-memcached-for-java: Amazon ElastiCache Cluster Client for Java - enhanced library to connect to ElastiCache clusters.
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticache/AmazonElastiCacheClient.html#AmazonElastiCacheClient--
How to enable TLS with Elasticache
On server side:
https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/in-transit-encryption-mc.html#in-transit-encryption-enable-existing-mc
On client side:
GitHub - awslabs/aws-elasticache-cluster-client-memcached-for-java: Amazon ElastiCache Cluster Client for Java - enhanced library to connect to ElastiCache clusters.
* disable parallel builds; enable batch mode to get rid of transfer progress
* restore .m2 from setup-java if not found
* some change to sql
* add ws
* fix quote
* fix quote
* undo querytest change
* nullhandling in mvtest
* init more
* skip commitid plugin
* add-back 1.0C to build ; remove redundant skip-s from copy-resources; add comment
Upgrade maven shade plugin to try to fix build failures
Sometimes we get maven shade errors in our integ tests becasue we don't run clean in between runs to clear the cache in order to speed them up. This can lead to the below error.
Error: Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:3.2.4:shade (opentelemetry-extension) on project opentelemetry-emitter: Error creating shaded jar: duplicate entry: META-INF/services/org.apache.druid.opentelemetry.shaded.io.grpc.NameResolverProvider
See: https://issues.apache.org/jira/projects/MSHADE/issues/MSHADE-425?filter=allissues
An example run that failed: https://github.com/apache/druid/actions/runs/6301662092/job/17117142375?pr=14887
According to the ticket this is fixed by updating shade to 3.4.1.
When I updated to 3.4.1 I kept running into a different issue during static checks. (Caused by: java.lang.NoClassDefFoundError: com/github/rvesse/airline/parser/errors/ParseException)
I had to add the createDependencyReducedPom: false to get the build to pass.
The dependency reduced pom feature was added in 3.3.0 which we were not using before so setting it explicitly to false should not be a issue. https://issues.apache.org/jira/browse/MSHADE-36)
* update RoaringBitmap to 0.9.49
update RoaringBitmap from 0.9.0 to 0.9.49
Many optimizations and improvements have gone into recent releases of
RoaringBitmap. It seems worthwhile to incorporate those.
* implement workaround for BatchIterator interface change
* add test case for BatchIteratorAdapter.advanceIfNeeded
Currently Druid is using google apis client 1.26.0 version and google-oauth-client-1.26.0.jar in particular is bringing following CVEs CVE-2020-7692, CVE-2021-22573. Despite the CVEs being false positives, they're causing red security scans on Druid distribution. Hence updating the version to latest version with these CVE fixes.
This patch fixes a few issues toward #14858
1. some phony classes were added to enable maven to track the compilation of those classes
2. cyclonedx 2.7.9 seem to handle incremental compilation better; it had a PR relating to that
3. needed to update root pom to 25
4. update antlr to 4.5.3 older one didn't really worked incrementally; 4.5.3 works much better
Currently, Druid is using Guava 16.0.1 version. This upgrade to 31.1-jre fixes the following issues.
CVE-2018-10237 (Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 allows remote attackers to conduct denial of service attacks against servers that depend on this library and deserialize attacker-provided data because the AtomicDoubleArray class (when serialized with Java serialization) and the CompoundOrdering class (when serialized with GWT serialization) perform eager allocation without appropriate checks on what a client has sent and whether the data size is reasonable). We don't use Java or GWT serializations. Despite being false positive they're causing red security scans on Druid distribution.
Latest version of google-client-api is incompatible with the existing Guava version. This PR unblocks Update google client apis to latest version #14414
* Add new configurable buffer period to create gap between mark unused and kill of segment
* Changes after testing
* fixes and improvements
* changes after initial self review
* self review changes
* update sql statement that was lacking last_used
* shore up some code in SqlMetadataConnector after self review
* fix derby compatibility and improve testing/docs
* fix checkstyle violations
* Fixes post merge with master
* add some unit tests to improve coverage
* ignore test coverage on new UpdateTools cli tool
* another attempt to ignore UpdateTables in coverage check
* change column name to used_flag_last_updated
* fix a method signature after column name switch
* update docs spelling
* Update spelling dictionary
* Fixing up docs/spelling and integrating altering tasks table with my alteration code
* Update NULL values for used_flag_last_updated in the background
* Remove logic to allow segs with null used_flag_last_updated to be killed regardless of bufferPeriod
* remove unneeded things now that the new column is automatically updated
* Test new background row updater method
* fix broken tests
* fix create table statement
* cleanup DDL formatting
* Revert adding columns to entry table by default
* fix compilation issues after merge with master
* discovered and fixed metastore inserts that were breaking integration tests
* fixup forgotten insert by using pattern of sharing now timestamp across columns
* fix issue introduced by merge
* fixup after merge with master
* add some directions to docs in the case of segment table validation issues