Commit Graph

13114 Commits

Author SHA1 Message Date
Kashif Faraz 786e772d26
Remove config `druid.coordinator.compaction.skipLockedIntervals` (#14807)
The value of `druid.coordinator.compaction.skipLockedIntervals` should always be `true`.
2023-08-14 12:31:15 +05:30
Rishabh Singh 0dc305f9e4
Upgrade hibernate validator version to fix CVE-2019-10219 (#14757) 2023-08-14 11:50:51 +05:30
dependabot[bot] e2d2afce46
Bump postgresql from 42.4.1 to 42.6.0 (#13959)
* Bump postgresql from 42.4.1 to 42.6.0

Bumps [postgresql](https://github.com/pgjdbc/pgjdbc) from 42.4.1 to 42.6.0.
- [Release notes](https://github.com/pgjdbc/pgjdbc/releases)
- [Changelog](https://github.com/pgjdbc/pgjdbc/blob/master/CHANGELOG.md)
- [Commits](https://github.com/pgjdbc/pgjdbc/compare/REL42.4.1...REL42.6.0)

---
updated-dependencies:
- dependency-name: org.postgresql:postgresql
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* update licenses.yaml

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xavier Léauté <xvrl@apache.org>
2023-08-12 19:17:00 -04:00
Soumyava afe22907a5
Calcite upgrade 1.35 (#14510)
* Update to Calcite 1.35.0
* Update from.ftl for Calcite 1.35.0.
* Fixed tests in Calcite upgrade by doing the following:
1. Added a new rule, CoreRules.PROJECT_FILTER_TRANSPOSE_WHOLE_PROJECT_EXPRESSIONS, to Base rules
2. Refactored the CorrelateUnnestRule
3. Updated CorrelateUnnestRel accordingly
4. Fixed a case with selector filters on the left where Calcite was eliding the virtual column
5. Additional test cases for fixes in 2,3,4
6. Update to StringListAggregator to fail a query if separators are not propagated appropriately
* Refactored for testcases to pass after the upgrade, introduced 2 new data sources for handling filters and select projects
* Added a literalSqlAggregator as the upgraded Calcite involved changes to subquery remove rule. This corrected plans for 2 queries with joins and subqueries by replacing an useless literal dimension with a post agg. Additionally a test with COUNT DISTINCT and FILTER which was failing with Calcite 1.21 is added here which passes with 1.35
* Updated to latest avatica and updated code as SqlUnknownTimeStamp is now used in Calcite which needs to be resolved to a timestamp literal
* Added a wrapper segment ref to use for unnest and filter segment reference
2023-08-11 12:47:16 -07:00
George Shiqi Wu c8a11702db
Support broadcast segmetns (#14789) 2023-08-11 11:14:05 -07:00
Vadim Ogievetsky ec28672d07
Web console: allow format picking for download (#14794)
* allow format picking for download

* better popover

* ux review tweaks
2023-08-11 09:43:29 -07:00
Vadim Ogievetsky b0c78ff295
Web console: make retention dialog clearer (#14793)
* make retention dialog clearer

* tweak

* another tweak

* Update web-console/src/dialogs/retention-dialog/retention-dialog.tsx

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* update snapshot for copy

---------

Co-authored-by: Suneet Saldanha <suneet@apache.org>
2023-08-11 09:43:00 -07:00
hqx871 a0234c4e13
Add sampling factor for DeterminePartitionsJob (#13840)
There are two type of DeterminePartitionsJob:
-  When the input data is not assume grouped, there may be duplicate rows.
In this case, two MR jobs are launched. The first one do group job to remove duplicate rows.
And a second one to perform global sorting to find lower and upper bound for target segments.
- When the input data is assume grouped, we only need to launch the global sorting
MR job to find lower and upper bound for segments.

Sampling strategy:
- If the input data is assume grouped, sample by random at the mapper side of the global sort mr job.
- If the input data is not assume grouped, sample at the mapper of the group job. Use hash on time
and all dimensions and mod by sampling factor to sample, don't use random method because there
may be duplicate rows.
2023-08-11 10:42:25 +05:30
Sergio Ferragut 353f7bed7f
Adding data generation pod to jupyter notebooks deployment (#14742)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2023-08-10 15:43:05 -07:00
zachjsh 82d82dfbd6
Add stats to KillUnusedSegments coordinator duty (#14782)
### Description

Added the following metrics, which are calculated from the `KillUnusedSegments` coordinatorDuty

`"killTask/availableSlot/count"`: calculates the number remaining task slots available for auto kill
`"killTask/maxSlot/count"`: calculates the maximum number of tasks available for auto kill
`"killTask/task/count"`: calculates the number of tasks submitted by auto kill. 

#### Release note
NEW: metrics added for auto kill

`"killTask/availableSlot/count"`: calculates the number remaining task slots available for auto kill
`"killTask/maxSlot/count"`: calculates the maximum number of tasks available for auto kill
`"killTask/task/count"`: calculates the number of tasks submitted by auto kill.
2023-08-10 18:36:53 -04:00
zachjsh 23306c4d80
retry when killing s3 based segments (#14776)
### Description

s3 deleteObjects request sent when killing s3 based segments now being retried, if failure is retry-able.
2023-08-10 14:04:16 -04:00
Xavier Léauté 37ed0f4a17
Bump jclouds.version from 1.9.1 to 2.0.3 (#14746)
* Updates `org.apache.jclouds:*` from 1.9.1 to 2.0.3
* Pin jclouds to 2.0.x since 2.1.x requires Guava 18+
* replace easymock with mockito

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-08-10 06:24:01 -07:00
Rishabh Singh 4b9846b90f
Improve exception message when DruidLeaderClient doesn't find leader node (#14775)
The existing exception message No known server thrown in DruidLeaderClient is unhelpful.
2023-08-10 16:37:37 +05:30
George Shiqi Wu c8537dbeaf
Add lifecycle hooks to KubernetesTaskRunner (#14790) 2023-08-09 21:16:44 -07:00
Vadim Ogievetsky b1988b2f93
Web console: fix result count (#14786)
* fix result count

* fixes
2023-08-09 20:33:01 +05:30
Laksh Singla 8f102f9031
Introduce StorageConnector for Azure (#14660)
The Azure connector is introduced and MSQ's fault tolerance and durable storage can now be used with Microsoft Azure's blob storage. Also, the results of newly introduced queries from deep storage can now store and fetch the results from Azure's blob storage.
2023-08-09 12:25:27 +00:00
Tejaswini Bandlamudi a45b25fa1d
Removes support for Hadoop 2 (#14763)
Removing Hadoop 2 support as discussed in https://lists.apache.org/list?dev@druid.apache.org:lte=1M:hadoop
2023-08-09 17:47:52 +05:30
Tejaswini Bandlamudi 550a66d71e
Upgrade jackson-databind to 2.12.7 (#14770)
The current version of jackson-databind is flagged for vulnerabilities CVE-2020-28491 (Although cbor format is not used in druid), CVE-2020-36518 (Seems genuine as deeply nested json in can cause resource exhaustion). Updating the dependency to the latest version 2.12.7 to fix these vulnerabilities.
2023-08-09 12:22:16 +05:30
Karan Kumar cd817fc469
Fixing typo in `resultsTruncated` (#14779) 2023-08-08 20:51:44 -07:00
Clint Wylie e57f880020
document new filters and stuff (#14760) 2023-08-08 16:01:06 -07:00
Clint Wylie 667e4dab5e
document expression aggregator (#14497) 2023-08-08 15:49:29 -07:00
317brian 8a4dabc431
docs: remove experimental from schema auto-discoery (#14759) 2023-08-08 12:45:44 -07:00
zachjsh 660e6cfa01
Allow for task limit on kill tasks spawned by auto kill coordinator duty (#14769)
### Description

Previously, the `KillUnusedSegments` coordinator duty, in charge of periodically deleting unused segments, could spawn an unlimited number of kill tasks for unused segments. This change adds 2 new coordinator dynamic configs that can be used to control the limit of tasks spawned by this coordinator duty

`killTaskSlotRatio`: Ratio of total available task slots, including autoscaling if applicable that will be allowed for kill tasks. This limit only applies for kill tasks that are spawned automatically by the coordinator's auto kill duty. Default is 1, which allows all available tasks to be used, which is the existing behavior

`maxKillTaskSlots`: Maximum number of tasks that will be allowed for kill tasks. This limit only applies for kill tasks that are spawned automatically by the coordinator's auto kill duty. Default is INT.MAX, which essentially allows for unbounded number of tasks, which is the existing behavior. 

Realize that we can effectively get away with just the one `killTaskSlotRatio`, but following similarly to the compaction config, which has similar properties; I thought it was good to have some control of the upper limit regardless of ratio provided.

#### Release note
NEW: `killTaskSlotRatio`  and `maxKillTaskSlots` coordinator dynamic config properties added that allow control of task resource usage spawned by `KillUnusedSegments` coordinator task (auto kill)
2023-08-08 08:40:55 -04:00
Clint Wylie 2845b6a424
add new filters to unnest filter pushdown (#14777) 2023-08-08 03:29:18 -07:00
Tejaswini Bandlamudi d0403f00fd
upgrade org.mozilla:rhino (#14765) 2023-08-08 12:17:59 +05:30
Suneet Saldanha 2af0ab2425
Metric to report time spent fetching and analyzing segments (#14752)
* Metric to report time spent fetching and analyzing segments

* fix test

* spell check

* fix tests

* checkstyle

* remove unused variable

* Update docs/operations/metrics.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/metrics.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/metrics.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

---------

Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2023-08-07 18:32:48 -07:00
Abhishek Radhakrishnan bff8f9e12e
Update kinesis docs (#14768)
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2023-08-07 17:08:34 -07:00
Suneet Saldanha b624a4ec4a
Rolling Supervisor restarts at taskDuration (#14396)
* Rolling supervior task publishing

* add an option for number of task groups to roll over

* better

* remove docs

* oops

* checkstyle

* wip test

* undo partial test change

* remove incomplete test
2023-08-07 16:24:32 -07:00
George Shiqi Wu 14940dc3ed
Add pod name to TaskLocation for easier observability and debugging. (#14758)
* Add pod name to location

* Add log

* fix style

* Update extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Fix unit tests

---------

Co-authored-by: Suneet Saldanha <suneet@apache.org>
2023-08-07 12:33:35 -07:00
Victoria Lim 7d7813372a
Docs: Include EARLIEST_BY and LATEST_BY as supported aggregation functions (#14280) 2023-08-07 09:59:12 -07:00
Adarsh Sanjeev 56ab81f381
Add support for different result formats to MSQ SqlStatementResource (#14571)
* Add support for different result format

* Add tests

* Add tests

* Fix checkstyle

* Remove changes to destination

* Removed some unwanted code

* Address review comments

* Rename parameter

* Fix tests
2023-08-07 20:48:59 +05:30
Kashif Faraz 2d8e0f28f3
Refactor: Cleanup coordinator duties for metadata cleanup (#14631)
Changes
- Add abstract class `MetadataCleanupDuty`
- Make `KillAuditLogs`, `KillCompactionConfig`, etc extend `MetadataCleanupDuty` 
- Improve log and error messages
- Cleanup tests
- No functional change
2023-08-05 13:08:23 +05:30
Suneet Saldanha 62ddeaf16f
Additional dimensions for service/heartbeat (#14743)
* Additional dimensions for service/heartbeat

* docs

* review

* review
2023-08-04 11:01:07 -07:00
Suneet Saldanha 590734b5eb
Update tutorial-kafka.md (#14749) 2023-08-04 10:56:33 -07:00
Clint Wylie e5661a394c
refactor front-coded into static classes instead of using functional interfaces (#14572)
* refactor front-coded into static classes instead of using functional interfaces

* shared v0 static method instead of copy
2023-08-04 10:52:36 -07:00
Laksh Singla d6c73ca6e5
Cleanup the documentation for deep storage 2023-08-04 10:20:01 +00:00
Abhishek Agarwal 6ced208391
Improve the backport missing script (#14723) 2023-08-04 15:21:55 +05:30
Soumyava 0d73480c8f
Latest aggregator factories should accept time as VectorValueSelecto… (#14753)
Fix the queries that have latest aggregator with an expression as time column
2023-08-04 13:04:25 +05:30
317brian 3b5b6c6a41
docs: query from deep storage (#14609)
* cold tier wip

* wip

* copyedits

* wip

* copyedits

* copyedits

* wip

* wip

* update rules page

* typo

* typo

* update sidebar

* moves durable storage info to its own page in operations

* update screenshots

* add apache license

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* add query from deep storage tutorial stub

* address some of the feedback

* revert screenshot update. handled in separate pr

* load rule update

* wip tutorial

* reformat deep storage endpoints

* rest of tutorial

* typo

* cleanup

* screenshot and sidebar for tutorial

* add license

* typos

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* rest of review comments

* clarify where results are stored

* update api reference for durablestorage context param

* Apply suggestions from code review

Co-authored-by: Karan Kumar <karankumar1100@gmail.com>

* comments

* incorporate #14720

* address rest of comments

* missed one

* Update docs/api-reference/sql-api.md

* Update docs/api-reference/sql-api.md

---------

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: demo-kratia <56242907+demo-kratia@users.noreply.github.com>
Co-authored-by: Karan Kumar <karankumar1100@gmail.com>
2023-08-04 11:10:08 +05:30
Pranav d31c04c4c6
Fix the bug in getIndexInfo for mysql (#14750) 2023-08-03 21:45:01 -07:00
YongGang 3335040b22
Report task/pending/time metrics for k8s based ingestion (#14698)
Changes:
* Add and invoke `StateListener` when state changes in `KubernetesPeonLifecycle`
* Report `task/pending/time` metric in `KubernetesTaskRunner` when state moves to RUNNING
2023-08-04 09:07:11 +05:30
zachjsh ba957a9b97
Add ability to limit the number of segments killed in kill task (#14662)
### Description

Previously, the `maxSegments` configured for auto kill could be ignored if an interval of data for a  given datasource had more than this number of unused segments, causing the kill task spawned with the task of deleting unused segments in that given interval of data to delete more than the `maxSegments` configured. Now each kill task spawned by the auto kill coordinator duty, will kill at most `limit` segments. This is done by adding a new config property to the `KillUnusedSegmentTask` which allows users to specify this limit.
2023-08-03 22:17:04 -04:00
imply-cheddar 748874405c
Minimize PostAggregator computations (#14708)
* Minimize PostAggregator computations

Since a change back in 2014, the topN query has been computing
all PostAggregators on all intermediate responses from leaf nodes
to brokers.  This generates significant slow downs for queries
with relatively expensive PostAggregators.  This change rewrites
the query that is pushed down to only have the minimal set of
PostAggregators such that it is impossible for downstream
processing to do too much work.  The final PostAggregators are
applied at the very end.
2023-08-04 00:04:31 +05:30
YongGang 20c48b6a3d
Retry S3 task log fetch in case of transient S3 exceptions (#14714) 2023-08-03 19:46:10 +05:30
Kashif Faraz b27d281b11
Remove unused param in MetadataResource (#14747) 2023-08-03 19:18:01 +05:30
Suneet Saldanha 00f1f8cef5
Enable ServiceStatusMonitor in the examples (#14744) 2023-08-03 06:07:01 -07:00
AmatyaAvadhanula 5a52f7a457
Fix IT failure due to query interval (#14738) 2023-08-02 11:29:35 -07:00
Adarsh Sanjeev 6837a7be19
Add logging for downsampling sketches in MSQ (#14580)
* Add more logs for downsampling sketches

* Fix builds

* Lower log level

* Add new log message
2023-08-02 20:07:54 +05:30
Abhishek Agarwal 955734ba8d
Fix exempt labels in stale.yml (#14733) 2023-08-02 17:12:18 +05:30
Clint Wylie 94fb41a4df
fix nested field virtual column array column element vector object selector (#14729)
Fixes a case I missed in #14688 when the return type is STRING but its coming from a top level array typed column instead of a nested array column while making a vector object selector.

Also while here I noticed that the internal JSON_VALUE functions for array types were named inconsistently with the non-array functions, so I renamed them. These are not documented so it should not be disruptive in any way, since they are only used internally for rewrites while planning to make the correctly virtual column.

JSON_VALUE_RETURNING_ARRAY_VARCHAR -> JSON_VALUE_ARRAY_VARCHAR
JSON_VALUE_RETURNING_ARRAY_BIGINT -> JSON_VALUE_ARRAY_BIGINT
JSON_VALUE_RETURNING_ARRAY_DOUBLE -> JSON_VALUE_ARRAY_DOUBLE
The internal non-array functions are JSON_VALUE_VARCHAR, JSON_VALUE_BIGINT, and JSON_VALUE_DOUBLE.
2023-08-02 17:08:24 +05:30