Commit Graph

1005 Commits

Author SHA1 Message Date
Zoltan Haindrich ba09e7d1de add 2024-05-21 14:00:02 +00:00
Zoltan Haindrich 08b73d1969 Revert "some stuff"
This reverts commit 52598d3bca.
2024-05-21 12:02:46 +00:00
Zoltan Haindrich 52598d3bca some stuff 2024-05-21 12:02:44 +00:00
Zoltan Haindrich e7e119b559 reduce copypaste 2024-05-16 13:33:27 +00:00
Zoltan Haindrich f4c73e1499 make old defaults to overides 2024-05-16 11:31:28 +00:00
Zoltan Haindrich 59be71c068 add other 2024-05-16 11:23:26 +00:00
Zoltan Haindrich 607ef174c5 indent 2024-05-16 11:18:54 +00:00
Zoltan Haindrich 3658ab24c3 remove f 2024-05-16 11:18:24 +00:00
Zoltan Haindrich bec1f38a0e move sqlmodule down 2024-05-16 11:17:05 +00:00
Zoltan Haindrich 688611eab3 undo 2024-05-16 11:11:55 +00:00
Zoltan Haindrich 93892b6524 undo some 2024-05-16 11:11:03 +00:00
Zoltan Haindrich 118eb61939 there - with 1 boot 2024-05-16 10:31:38 +00:00
Zoltan Haindrich 28ea884e19 almost ready? 2024-05-16 10:01:22 +00:00
Zoltan Haindrich 8ee41f58d0 it does work 2024-05-15 15:14:43 +00:00
Zoltan Haindrich d4b052a579 stuff 2024-05-15 11:57:13 +00:00
Zoltan Haindrich a16f982699 remove crap 2024-05-14 16:04:19 +00:00
Zoltan Haindrich b7b73fa7fe fix context key order 2024-05-14 08:54:53 +00:00
Zoltan Haindrich 9578953678 Merge remote-tracking branch 'apache/master' into quidem-runner-extension-submit 2024-05-14 07:36:48 +00:00
Zoltan Haindrich 3132c12781 remove unnecessary \\ 2024-05-14 07:36:07 +00:00
Sree Charan Manamala b8dd7478d0
Custom Calcite Rule to remove redundant references (#16402)
Custom calcite rule mimicking AggregateProjectMergeRule to extend support to expressions.
The current calcite rule return null in such cases.
In addition, this removes the redundant references.
2024-05-14 06:38:05 +02:00
Zoltan Haindrich e36c46a85a fix import
style fixes

clenaup
2024-05-13 15:52:03 +00:00
Laksh Singla 4bfc186153
Support sorting on complex columns in MSQ (#16322)
MSQ sorts the columns in a highly specialized manner by byte comparisons. As such the values are serialized differently. This works well for the primitive types and primitive arrays, however complex types cannot be serialized specially.

This PR adds the support for sorting the complex columns by deserializing the value from the field and comparing it via the type strategy. This is a lot slower than the byte comparisons, however, it's the only way to support sorting on complex columns that can have arbitrary serialization not optimized for MSQ.

The primitives and the arrays are still compared via the byte comparison, therefore this doesn't affect the performance of the queries supported before the patch. If there's a sorting key with mixed complex and primitive/primitive array types, for example: longCol1 ASC, longCol2 ASC, complexCol1 DESC, complexCol2 DESC, stringCol1 DESC, longCol3 DESC, longCol4 ASC, the comparison will happen like:

    longCol1, longCol2 (ASC) - Compared together via byte-comparison, since both are byte comparable and need to be sorted in ascending order
    complexCol1 (DESC) - Compared via deserialization, cannot be clubbed with any other field
    complexCol2 (DESC) - Compared via deserialization, cannot be clubbed with any other field, even though the prior field was a complex column with the same order
    stringCol1, longCol3 (DESC) - Compared together via byte-comparison, since both are byte comparable and need to be sorted in descending order
    longCol4 (ASC) - Compared via byte-comparison, couldn't be coalesced with the previous fields as the direction was different

This way, we only deserialize the field wherever required
2024-05-13 15:07:05 +05:30
Zoltan Haindrich e13d560b6e Enable quidem shadowing for decoupled testcases
* Altered `QueryTestBuilder` to be able to switch to a backing quidem test
* added a small crc to ensure that the shadow testcase does not deviate from the original one
* Packaged all decoupled related things into a a single `DecoupledExtension` to reduce copy-paste
* `DecoupledTestConfig#quidemReason` must describe why its being used
* `DecoupledTestConfig#separateDefaultModeTest` can be used to make multiple case files based on `NullHandling` state
* fixed a cosmetic bug during decoupled join translation
* enhanced `!druidPlan` to report the final logical plan in non-decoupled mode as well
* add check to ensure that only supported params are present in a druidtest uri
* enabled shadow testcases for previously disabled testcases
2024-05-10 13:38:54 +00:00
Zoltan Haindrich 1811674753
Enable quidem tests to use different suppliers (#16382)
* enable quidem uri support for `druidtest:///?ComponentSupplier=Nested` and similar
* changes the way `SqlTestFrameworkConfig` is being applied; all options will have their own annotation (its kinda impossible to detect that an annotation has a set value or its the default)
* enables hierarchical processing of config annotation (was needed to enable class level supplier annotation)
* moves uri processing related string2config stuff into `SqlTestFrameworkConfig`
2024-05-09 09:21:02 +02:00
Akshat Jain 775d654a6c
Load only the required lookups for MSQ tasks (#16358)
With this PR changes, MSQ tasks (MSQControllerTask and MSQWorkerTask) only load the required lookups during querying and ingestion, based on the value of CTX_LOOKUPS_TO_LOAD key in the query context.
2024-05-09 11:21:54 +05:30
Misha b5958b6b07
Feature configurable calcite bloat (#16248)
* Configurable bloat for calcite ProjectMergeRule implemented

* Comment added

* Default bloat value increased to 1000

* Implemented bloat configuration from QueryContext

* Code refactored, docs updated

---------

Co-authored-by: sviatahorau <mikhail.sviatahorau@deep.bi>
2024-05-06 20:43:39 +05:30
Gian Merlino 588d442422
Add native filter conversion for SCALAR_IN_ARRAY. (#16312)
* Add native filter conversion for SCALAR_IN_ARRAY.

Main changes:

1) Add an implementation of "toDruidFilter" in ScalarInArrayOperatorConversion.

2) Split up Expressions.literalToDruidExpression into two functions, so the first
   half (literalToExprEval) can be used by ScalarInArrayOperatorConversion to more
   efficiently create the list of match values.

* Fix type in time arithmetic conversion.

* Test updates.

* Update test cases to use null instead of '' in default-value mode.

* Switch test from msqIncompatible to compatible with a different result.

* Update one more test.

* Fix test.

* Update tests.

* Use ExprEvalWrapper to differentiate between empty string and null.

* Fix tests some more.

* Fix test.

* Additional comment.

* Style adjustment.

* Fix tests.

* trueValue -> actualValue.

* Use different approach, DruidLiteral instead of ExprEvalWrapper.

* Revert changes in ArrayOfDoublesSketchSqlAggregatorTest.
2024-05-03 13:00:33 -07:00
zachjsh fb7c84fb5d
Catalog clustering keys fixes (#16351)
* * add another catalog clustering columns unit test

* * dissallow clusterKeys with descending order

* * make more clear that clustering is re-written into ingest node
whether a catalog table or not

* * when partitionedBy is stored in catalog, user shouldnt need to specify
it in order to specify clustering

* * fix intellij inspection failure
2024-05-03 14:02:56 -04:00
Zoltan Haindrich 2d0e86cbdc
Use quidem to run tests (#16249)
* test scoped jdbc driver for druidtest:/// backed DruidAvaticaTestDriver
** DecoupledTestConfig is used inside the URI - this will make it possible to attach to existing things more easily
* DruidQuidemTestBase can be used to create module level set of quidem tests
* added quidem commands: !convertedPlan, !logicalPlan, !druidPlan, !nativePlan
** for these I've used some values of the Hook which was there in calcite
* there are some shortcuts with proxies(they are only used during testing) - we can probably remove those later
2024-05-02 02:12:42 -04:00
Laksh Singla e695e52d3f
Improve code flow in the First/Last vector aggregators and unify the numeric aggregators with the String implementations (#16230)
This PR fixes the first and last vector aggregators and improves their readability. Following changes are introduced

    The folding is broken in the vectorized versions. We consider time before checking the folded object.

    If the numerical aggregator gets passed any other object type for some other reason (like String), then the aggregator considers it to be folded, even though it shouldn’t be. We should convert these objects to the desired type, and aggregate them properly.

    The aggregators must properly use generics. This would minimize the ClassCastException issues that can happen with mixed segment types. We are unifying the string first/last aggregators with numeric versions as well.

    The aggregators must aggregate null values (https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstLastUtils.java#L55-L56 ). The aggregator should only ignore pairs with time == null, and not value == null

    Time nullity is ignored when trying to vectorize the data.

    String versions initialized with DateTimes.MIN that is equal to Long.MIN / 2. This can cause incorrect results in case the user enters a custom time column. NOTE: This is still present because it would require a larger refactor in all of the versions.

    There is a difference in what users might expect from the results because the code flow is changed (for example, the direction of the for loops, etc), however, this will only change the results, and not the contract set by first/last aggregators, which is that if multiple values have the same timestamp, then any of them can get picked.

    If the column is non-existent, the users might expect a change in the timestamp from DateTime.MAX to Long.MAX, because the code incorrectly used DateTime.MAX to initialize the aggregator, however, in case of a custom timestamp column, this might not be the case. The SQL query might be prohibited from using any Long since it requires a cast to the timestamp function that can fail, but AFAICT native queries don't have such limitations.
2024-04-30 15:13:14 +05:30
Laksh Singla 26d63e7b65
Prevent joining on nested arrays and complex types (#16349)
#16068 modified DimensionHandlerUtils to accept complex types to be dimensions. This had an unintended side effect of allowing complex types to be joined upon (which wasn't guarded explicitly, it doesn't work).
This PR modifies the IndexedTable to reject building the index on the complex types to prevent joining on complex types. The PR adds back the check in the same place, explicitly.
2024-04-30 11:36:53 +05:30
Akshat Jain 9d2cae40c3
Add support for selective loading of lookups in the task layer (#16328)
Changes:
- Add `LookupLoadingSpec` to support 3 modes of lookup loading: ALL, NONE, ONLY_REQUIRED
- Add method `Task.getLookupLoadingSpec()`
- Do not load any lookups for `KillUnusedSegmentsTask`
2024-04-29 07:19:59 +05:30
zachjsh 365cd7e8e7
INSERT/REPLACE can omit clustering when catalog has default (#16260)
* * fix

* * fix

* * address review comments

* * fix

* * simplify tests

* * fix complex type nullability issue

* * implement and add tests

* * address review comments

* * address test review comments

* * fix checkstyle

* * fix dependencies

* * all tests passing

* * cleanup

* * remove unneeded code

* * remove unused dependency

* * fix checkstyle
2024-04-26 10:19:45 -04:00
Adarsh Sanjeev 9a2d7c28bc
Prepare master branch for 31.0.0 release (#16333) 2024-04-26 09:22:43 +05:30
Gian Merlino 68d6e682e8
Fix TimeBoundary planning when filters require virtual columns. (#16337)
The timeBoundary query does not support virtual columns, so we should
avoid it if the query requires virtual columns.
2024-04-25 16:49:40 -07:00
Zoltan Haindrich 9c0bd56f5b
Make QueryComponentSupliers independent from test classes (#16275) 2024-04-25 02:12:07 -04:00
Laksh Singla 6bca406d31
Grouping on complex columns aka unifying GroupBy strategies (#16068)
Users can pass complex types as dimensions to the group by queries. For example:

SELECT nested_col1, count(*) FROM foo GROUP BY nested_col1
2024-04-24 23:00:14 +05:30
Rishabh Singh e30790e013
Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building (#15817)
Issue: #14989

The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information.

This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.
2024-04-24 22:22:53 +05:30
Sree Charan Manamala 080476f9ea
WINDOWING - Fix 2 nodes with same digest causing mapping issue (#16301)
Fixes the mapping issue in window fucntions where 2 nodes get the same reference.
2024-04-24 16:45:02 +05:30
Laksh Singla b9bbde5c0a
Fix deadlock that can occur while merging group by results (#15420)
This PR prevents such a deadlock from happening by acquiring the merge buffers in a single place and passing it down to the runner that might need it.
2024-04-22 14:10:44 +05:30
Sree Charan Manamala ad5701e891
new SCALAR_IN_ARRAY function analogous to DRUID_IN (#16306)
* scalar_in function

* api doc

* refactor
2024-04-18 21:15:15 -07:00
Sree Charan Manamala 960a674442
Corrected Strict NON NULL return type checks (#16279) 2024-04-18 12:17:13 +02:00
Gian Merlino ccc1ffb032
Additional short circuiting knowledge in filter bundles. (#16292)
* Additional short circuiting knowledge in filter bundles.

Three updates:

1) The parameter "selectionRowCount" on "makeFilterBundle" is renamed
   "applyRowCount", and redefined as an upper bound on rows remaining
   after short-circuiting (rather than number of rows selected so far).
   This definition works better for OR filters, which pass through the
   FALSE set rather than the TRUE set to the next subfilter.

2) AndFilter uses min(applyRowCount, indexIntersectionSize) rather
   than using selectionRowCount for the first subfilter and indexIntersectionSize
   for each filter thereafter. This improves accuracy when the incoming
   applyRowCount is smaller than the row count from the first few indexes.

3) OrFilter uses min(applyRowCount, totalRowCount - indexUnionSize) rather
   than applyRowCount for subfilters. This allows an OR filter to pass
   information about short-circuiting to its subfilters.

To help write tests for this, the patch also moves the sampled
wikiticker data file from sql to processing.

* Forbidden APIs.

* Forbidden APIs.

* Better comments.

* Fix inspection.

* Adjustments to tests.
2024-04-16 22:42:28 -07:00
zachjsh a5428e75ff
INSERT/REPLACE complex target column types are validated against source input expressions (#16223)
* * fix

* * fix

* * address review comments

* * fix

* * simplify tests

* * fix complex type nullability issue

* * address review comments

* * address test review comments

* * fix checkstyle
2024-04-16 17:20:35 -04:00
Sree Charan Manamala 5247059d2f
Allow Double & null values in sql type array through dynamic params (#16274) 2024-04-15 10:44:42 +02:00
Adarsh Sanjeev 3df00aef9d
Add manifest file for MSQ export (#15953)
Currently, export creates the files at the provided destination. The addition of the manifest file will provide a list of files created as part of the manifest. This will allow easier consumption of the data exported from Druid, especially for automated data pipelines
2024-04-15 11:37:31 +05:30
Sree Charan Manamala 3340b200db
Fix window function drill tests failures falling under RESULT_MISMATCH & RESULT_COUNT_MISMATCH (#16264)
* Updated the drill test expected results which are failing due to druid's default sorting algorithm taking nulls first approach.
* Corrected the queries where date time values are directly provided
* marked 2 cases failing with resultset casting issues
2024-04-12 13:54:48 +02:00
Sree Charan Manamala f65c166327
Windowed aggregates should update the aggregation value based on final compute (#16244) 2024-04-12 08:28:33 +02:00
Gian Merlino 9f358f5f4a
SQL tests: avoid mixing skip and cannot vectorize. (#16251)
* SQL tests: avoid mixing skip and cannot vectorize.

skipVectorize switches off vectorization tests completely, and
cannotVectorize turns vectorization tests into negative tests. It doesn't
make sense to use them together, so this patch makes it an error to do so,
and cleans up cases where both are mentioned.

This patch also has the effect of changing various tests from skipVectorize
to cannotVectorize, because in the past when both were mentioned,
skipVectorize would take priority.

* Fix bug with StringAnyAggregatorFactory attempting to vectorize when it cannt.

* Fix tests.
2024-04-11 15:06:11 -07:00
Soumyava 7759f25095
Moving bitwise_or to use native calcite operator (#16237) 2024-04-04 12:49:29 -07:00