Commit Graph

1203 Commits

Author SHA1 Message Date
Zoltan Haindrich 32e8e074ae
Planning could have failed if UNION ALL operator was completely removed (#16946) 2024-09-02 04:37:10 -04:00
Parag Jain 6eb42e8d5a
fix extraction of timeseries results from result level cache (#16895)
* fix extraction of timeseries results from result level cache

* remove unneded import

* add test
2024-09-01 00:25:55 +05:30
Akshat Jain 72f8e79a42
Use multiple workers in MSQ WF drill test suite (#16949) 2024-08-26 11:34:40 +05:30
Gian Merlino 0603d5153d
Segments sorted by non-time columns. (#16849)
* Segments primarily sorted by non-time columns.

Currently, segments are always sorted by __time, followed by the sort
order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting
by __time enables efficient execution of queries involving time-ordering
or granularity. Time-ordering is a simple matter of reading the rows in
stored order, and granular cursors can be generated in streaming fashion.

However, for various workloads, it's better for storage footprint and
query performance to sort by arbitrary orders that do not start with __time.
With this patch, users can sort segments by such orders.

For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to
dimensionsSpec. The "dimensions" list determines the sort order. To
define a sort order that includes "__time", users explicitly
include a dimension named "__time".

For SQL-based ingestion, users set the context parameter
"useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then
used as the explicit segment sort order.

In both cases, when the new "useExplicitSegmentSortOrder" parameter is
false (the default), __time is implicitly prepended to the sort order,
as it always was prior to this patch.

The new parameter is experimental for two main reasons. First, such
segments can cause errors when loaded by older servers, due to violating
their expectations that timestamps are always monotonically increasing.
Second, even on newer servers, not all queries can run on non-time-sorted
segments. Scan queries involving time-ordering and any query involving
granularity will not run. (To partially mitigate this, a currently-undocumented
SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner
avoids using "granularity".)

Changes on the write path:

1) DimensionsSpec can now optionally contain a __time dimension, which
   controls the placement of __time in the sort order. If not present,
   __time is considered to be first in the sort order, as it has always
   been.

2) IncrementalIndex and IndexMerger are updated to sort facts more
   flexibly; not always by time first.

3) Metadata (stored in metadata.drd) gains a "sortOrder" field.

4) MSQ can generate range-based shard specs even when not all columns are
   singly-valued strings. It merely stops accepting new clustering key
   fields when it encounters the first one that isn't a singly-valued
   string. This is useful because it enables range shard specs on
   "someDim" to be created for clauses like "CLUSTERED BY someDim, __time".

Changes on the read path:

1) Add StorageAdapter#getSortOrder so query engines can tell how a
   segment is sorted.

2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter,
   and VectorCursorGranularizer to throw errors when using granularities
   on non-time-ordered segments.

3) Update ScanQueryEngine to throw an error when using the time-ordering
  "order" parameter on non-time-ordered segments.

4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when
   running on a non-time-ordered segment.

5) Add "sqlUseGranularity" context parameter that causes the SQL planner
   to avoid using granularities other than ALL.

Other changes:

1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions"
   and change the meaning subtly: it now returns true if the DimensionsSpec
   represents an unchanging list of dimensions, or false if there is
   some discovery happening. This is what call sites had expected anyway.

* Fixups from CI.

* Fixes.

* Fix missing arg.

* Additional changes.

* Fix logic.

* Fixes.

* Fix test.

* Adjust test.

* Remove throws.

* Fix styles.

* Fix javadocs.

* Cleanup.

* Smoother handling of null ordering.

* Fix tests.

* Missed a spot on the merge.

* Fixups.

* Avoid needless Filters.and.

* Add timeBoundaryInspector to test.

* Fix tests.

* Fix FrameStorageAdapterTest.

* Fix various tests.

* Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder.

* Pom fix.

* Fix doc.
2024-08-23 08:24:43 -07:00
Clint Wylie 2aef6ac685
fix ipv4_parse function return type in SQL to be bigint instead of integer (#16942)
* fix ipv4_parse function return type in SQL to be bigint instead of integer

* fix default value mode
2024-08-22 13:36:43 -07:00
Gian Merlino 338da67bc6
Add type coercion and null check to left, right, repeat exprs. (#16480)
* Add type coercion and null check to left, right, repeat exprs.

These exprs shouldn't validate types; they should coerce types. Coercion
is typical behavior for functions because it enables schema evolution.

The functions are also modified to check isNumericNull on the right-hand
argument. This was missing previously, which would erroneously cause
nulls to be treated as zeroes.

* Fix tests.
2024-08-21 15:07:24 -07:00
Akshat Jain 97f9502ad2
Enable MSQ WF drill tests which were previously disabled (#16935) 2024-08-21 15:47:50 +05:30
Akshat Jain 0ce1b6b22f
MSQ window function: Take segment granularity into consideration to fix NPE issues with ingestion (#16854)
This PR changes the logic for window functions to use the resultShuffleSpecFactory for the last window stage.
2024-08-21 10:06:04 +05:30
Rishabh Singh bc4b3a2f91
Filter out tombstone segments from metadata cache (#16890)
* Fix build

* Support segment metadata queries for tombstones

* Filter out tombstone segments from metadata cache

* Revert some changes

* checkstyle

* Update docs
2024-08-20 11:35:02 +05:30
Gian Merlino 806649f8af
SQL: Fix nullable DATE, TIMESTAMP reduction. (#16915)
Reduction of nullable DATE and TIMESTAMP expressions did not perform
a necessary null check, so would in some cases reduce to
1970-01-01 00:00:00 (epoch) rather than NULL.
2024-08-16 22:41:12 -07:00
Clint Wylie 4283b270e3
rework cursor creation (#16533)
changes:
* Added `CursorBuildSpec` which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments of `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor`
* added new interface `CursorHolder` and new interface `CursorHolderFactory` as a replacement for `CursorFactory`, with method `makeCursorHolder`, which takes a `CursorBuildSpec` as an argument and replaces `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor`
* `CursorFactory.makeCursors` previously returned a `Sequence<Cursor>` corresponding to the query granularity buckets, with a separate `Cursor` per bucket. `CursorHolder.asCursor` instead returns a single `Cursor` (equivalent to 'ALL' granularity), and a new `CursorGranularizer` has been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with its `VectorCursorGranularizer`), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities. 
* Deprecated `CursorFactory`, `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursor`
* updated all `StorageAdapter` implementations to implement `makeCursorHolder`, transitioned direct `CursorFactory` implementations to instead implement `CursorMakerFactory`. `StorageAdapter` being a `CursorMakerFactory` is intended to be a transitional thing, ideally will not be released in favor of moving `CursorMakerFactory` to be fetched directly from `Segment`, however this PR was already large enough so this will be done in a follow-up.
* updated all query engines to use `makeCursorHolder`, granularity based engines to use `CursorGranularizer`.
2024-08-16 11:34:10 -07:00
Sree Charan Manamala 964cf47bb5
fix NPE (#16897) 2024-08-15 18:12:22 +08:00
Akshat Jain 3d6cedb25f
Fix IndexOutOfBoundsException for MSQ window function queries with empty RAC (#16865)
* Fix IndexOutOfBoundsException for MSQ window function queries with empty RAC
2024-08-09 11:39:53 +05:30
zachjsh cb09b572e6
Fix Druid table schema resolution when table defined in catalog and has schema manager (#16869)
* SQL syntax error should target USER persona

* * revert change to queryHandler and related tests, based on review comments

* * add test

* Properly handle Druid schema blending with catalog definition and segment metadata

* * add javadocs
2024-08-08 21:21:03 -04:00
Zoltan Haindrich 408702e100
Add ability to run MSQ in Quidem tests (#16798)
* implements some jdbc facade to enable msq usage
* adds an !msqPlan command
* adds more guice usage to testsystem startup
2024-08-08 06:37:06 +02:00
Gian Merlino de40d81b29
SQL: Add ProjectableFilterableTable to SegmentsTable. (#16841)
* SQL: Add ProjectableFilterableTable to SegmentsTable.

This allows us to skip serialization of expensive fields such as
shard_spec, dimensions, metrics, and last_compaction_state, if those
fields are not actually being queried.

* Restructure logic to avoid unnecessary toString() as well.
2024-08-06 06:40:21 -07:00
Sree Charan Manamala ed6b547481
Handle default bounds correctly in WINDOW clause (#16833)
When a window is defined as WINDOW W AS <DEF> and using a syntax of (PARTITION BY col1 ORDER BY col2 ROWS x PRECEDING), we would need to default the other bound to CURRENT ROW

We already have implemented this earlier, but when defined as WINDOW W AS <DEF>, Calcite takes a different route to validate the window.
2024-08-06 09:58:44 +02:00
Zoltan Haindrich 26e3c44f4b
Quidem record (#16624)
* enables to launch a fake broker based on test resources (druidtest uri)
* could record queries into new testfiles during usage
* instead of re-purpose Calcite's Hook migrates to use DruidHook which we can add further keys
* added a quidem-ut module which could be the place for tests which could iteract with modules/etc
2024-08-05 14:58:32 +02:00
Sree Charan Manamala c7eacd079e
fallback SQL IN filter to expression filter when VirtualColumnRegistry is null (#16836) 2024-08-05 11:27:51 +05:30
Abhishek Radhakrishnan 31b43753fb
Add `druid.indexing.formats.stringMultiValueHandlingMode` system config (#16822)
This patch introduces an optional cluster configuration, druid.indexing.formats.stringMultiValueHandlingMode, allowing operators to override the default mode SORTED_SET for string dimensions. The possible values for the config are SORTED_SET, SORTED_ARRAY, or ARRAY (SORTED_SET is the default). Case insensitive values are allowed.
While this cluster property allows users to manage the multi-value handling mode for string dimension types, it's recommended to migrate to using real array types instead of MVDs.
 
This fixes a long-standing issue where compaction will honor the configured cluster wide property instead of rewriting it as the default SORTED_ARRAY always, even if the data was originally ingested with ARRAY or SORTED_SET.
2024-08-03 10:23:44 -07:00
Zoltan Haindrich c7cde31a89
HAVING clauses may not contain window functions (#16742)
Rejects having clauses if they contain windowed expressions.
Also added a check to produce a more descriptive error if an OVER expression
reaches the filter translation layer.

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-07-29 04:11:36 -04:00
Sree Charan Manamala 9b76d13ff8
Check for Aggregation inside a window clause when syntax used as - WINDOW W AS DEF (#16801) 2024-07-26 11:18:35 +02:00
Clint Wylie 14954c7eb9
serialize legacy as false for scan query for rolling downgrade/upgrade (#16793)
Fixes rolling downgrades/upgrades after #16659 by hard coding scan query "legacy":false since it is a required property during deserialization.
2024-07-25 14:51:58 +05:30
Zoltan Haindrich 7e3fab5bf9
Make WindowFrames more specific (#16741)
Changes the WindowFrame internals / representation a bit; introduces dedicated frametypes for rows and groups which corresponds to the implemented processing methods
2024-07-25 04:57:36 +02:00
Akshat Jain a0437b6c93
MSQ window functions: Fix partition boundary issues for arrays (#16780)
* MSQ window functions: Fix partition boundary issues for arrays

* Address review comments

* Cache type strategies

* Trigger Build

* Convert typeStrategies from list to array
2024-07-24 18:47:04 +05:30
Sree Charan Manamala 3f4d66c399
Check for Unsupported Aggregation with Distinct when useApproxCountDistinct is enabled (#16770)
* init

* add NativelySupportsDistinct

* refactor

* javadoc

* refactor

* fix tests

* fix drill tests

* comments

* Update sql/src/test/java/org/apache/druid/sql/calcite/DrillWindowQueryTest.java

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-07-24 11:13:22 +08:00
Laksh Singla 11bb40981e
Deduce type from the aggregators when materializing subquery results (#16703)
For aggregators like StringFirst/Last, whose intermediate type isn't the same as the final type, using them in GroupBy, TopN or Timeseries subqueries causes a fallback when maxSubqueryBytes is set. This is because we assume that the finalization is not known, due to which the row signature cannot determine whether to use the intermediate or the final type, and it puts it as null. This PR figures out the finalization from the query context and uses the intermediate or the final type appropriately.
2024-07-23 11:52:39 +05:30
Akshat Jain c45d4fdbca
MSQ window functions: Minor cleanup for empty over clause related flows + Exhaustive tests (#16754)
* MSQ window functions: Revamp logic to create separate window stages when empty over() clause is present

* Fix tests

* Revert changes of creating separate stages for empty over clause

* Address review comments
2024-07-23 11:37:34 +05:30
Akshat Jain 6a2348b78b
Preemptive restriction for queries with approximate count distinct on complex columns of unsupported type (#16682)
This PR aims to check if the complex column being queried aligns with the supported types in the aggregator and aggregator factories, and throws a user-friendly error message if they don't.
2024-07-22 21:34:06 +05:30
Sree Charan Manamala 149d7c5207
Throw exceptions in SqlValidator when DISTINCT used over WINDOW (#16738)
* Throw exception if DISTINCT used with window functions aggregate call
* Improve error message when unsupported aggregations are used with window functions
2024-07-22 16:29:46 +02:00
Sree Charan Manamala c9aae9d8e6
Enable WINDOW_LEAF_OPERATOR for native engine to support queries without group by (#16753) 2024-07-22 12:31:55 +02:00
Clint Wylie 35b876436b
remove native scan query legacy mode (#16659) 2024-07-18 23:33:27 -07:00
Akshat Jain b53c26f5c5
Fix issues with partitioning boundaries for MSQ window functions (#16729)
* Fix issues with partitioning boundaries for MSQ window functions

* Address review comments

* Address review comments

* Add test for coverage check failure

* Address review comment

* Remove DruidWindowQueryTest and WindowQueryTestBase, move those tests to DrillWindowQueryTest

* Update extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/querykit/WindowOperatorQueryKit.java

* Address review comments

* Add test for equals and hashcode for WindowOperatorQueryFrameProcessorFactory

* Address review comment

* Fix checkstyle

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-07-18 10:05:09 +08:00
Sree Charan Manamala 40ef9fc4ec
Bug fix for array type selector causing array aggregation over window frame fail (#16653) 2024-07-17 14:09:56 +02:00
Sree Charan Manamala 78a4a09d01
Window Function offset correction for RAC (#16718)
* When an ArrayList RAC creates a child RAC, the start and end offsets need to have the offset of parent's start offset
* Defaults the 2nd window bound to CURRENT ROW when only a single bound is specified
* Removes the windowingStrictValidation warning and throws a hard exception when Order By alongside RANGE clause is not provided with UNBOUNDED or CURRENT ROW as both bounds
2024-07-15 12:43:27 +02:00
Rishabh Singh 64104533ac
Enable querying entirely cold datasources (#16676)
Add ability to query entirely cold datasources.
2024-07-15 15:02:59 +05:30
Vishesh Garg 197c54f673
Auto-Compaction using Multi-Stage Query Engine (#16291)
Description:
Compaction operations issued by the Coordinator currently run using the native query engine.
As majority of the advancements that we are making in batch ingestion are in MSQ, it is imperative
that we support compaction on MSQ to make Compaction more robust and possibly faster. 
For instance, we have seen OOM errors in native compaction that MSQ could have handled by its
auto-calculation of tuning parameters. 

This commit enables compaction on MSQ to remove the dependency on native engine. 

Main changes:
* `DataSourceCompactionConfig` now has an additional field `engine` that can be one of 
`[native, msq]` with `native` being the default.
*  if engine is MSQ, `CompactSegments` duty assigns all available compaction task slots to the
launched `CompactionTask` to ensure full capacity is available to MSQ. This is to avoid stalling which
could happen in case a fraction of the tasks were allotted and they eventually fell short of the number
of tasks required by the MSQ engine to run the compaction.
* `ClientCompactionTaskQuery` has a new field `compactionRunner` with just one `engine` field.
* `CompactionTask` now has `CompactionRunner` interface instance with its implementations
`NativeCompactinRunner` and `MSQCompactionRunner` in the `druid-multi-stage-query` extension.
The objectmapper deserializes `ClientCompactionRunnerInfo` in `ClientCompactionTaskQuery` to the
`CompactionRunner` instance that is mapped to the specified type [`native`, `msq`]. 
* `CompactTask` uses the `CompactionRunner` instance it receives to create the indexing tasks.
* `CompactionTask` to `MSQControllerTask` conversion logic checks whether metrics are present in 
the segment schema. If present, the task is created with a native group-by query; if not, the task is
issued with a scan query. The `storeCompactionState` flag is set in the context.
* Each created `MSQControllerTask` is launched in-place and its `TaskStatus` tracked to determine the
final status of the `CompactionTask`. The id of each of these tasks is the same as that of `CompactionTask`
since otherwise, the workers will be unable to determine the controller task's location for communication
(as they haven't been launched via the overlord).
2024-07-12 16:40:20 +05:30
Sree Charan Manamala 760d70312f
Window Drill tests coverage improvement (#16722)
Window Drill tests coverage improvement
2024-07-11 19:11:36 +05:30
Zoltan Haindrich a9bd0eea2a
Fix queries filtering for the same condition with both an IN and EQUALS to not return empty results (#16597)
temp fix until CALCITE-6435 gets fixed (released&upgraded to)
added a custom rule (FixIncorrectInExpansionTypes) to fix-up types of the affected literals
added a testcase which will alert on upgrade
2024-07-09 12:28:21 +05:30
Alberic Liu c6c2652c89
unified the code format in NestedDataOperatorConversions (#16695) 2024-07-08 10:06:24 +08:00
Akshat Jain 34c80ee3de
Add MSQ engine support for window function drill tests (#16665)
* Add MSQ engine support for window function drill tests

* Address review comments

* Revert formatting changes in TestDataBuilder
2024-06-28 11:14:17 +05:30
Rishabh Singh b9c7664ac3
Fix empty datasource schema on the Broker when metadata query is disabled (#16645)
* Fix build

* Fix empty datasource schema on the broker

* review comment

* Remove unused import
2024-06-28 11:06:56 +05:30
Clint Wylie d4f2636325
fix greatest/least function non-vectorized processing to ignore null argument types (#16649) 2024-06-26 12:59:42 -07:00
Tom 52c9929019
Column name in parse exceptions (#16529)
* first pass

* more changes

* fix tests and formatting

* fix kinesis failing tests

* fix kafka tests

* add dimension name to float parse errors

* double and convertToType handling of dimensionName can report parse errors with dimension name

* fix checkstyle issue

* fix tests

* more cases to have better parse exception messages

* fix test

* fix tests

* partially address comments

* annotate method parameter with nullable

* address comments

* fix tests

* let float, double, long dimensionIndexer pass dimensionName down to dimensionHandlerUtils

* fix compilation error and clean up formatting

* clean up whitespace

* address feedback. undo change, pass down report parse exception for convertToType

* fix test
2024-06-25 13:42:52 -07:00
Clint Wylie 37a50e6803
Remove index_realtime and index_realtime_appenderator tasks (#16602)
index_realtime tasks were removed from the documentation in #13107. Even
at that time, they weren't really documented per se— just mentioned. They
existed solely to support Tranquility, which is an obsolete ingestion
method that predates migration of Druid to ASF and is no longer being
maintained. Tranquility docs were also de-linked from the sidebars and
the other doc pages in #11134. Only a stub remains, so people with
links to the page can see that it's no longer recommended.

index_realtime_appenderator tasks existed in the code base, but were
never documented, nor as far as I am aware were they used for any purpose.

This patch removes both task types completely, as well as removes all
supporting code that was otherwise unused. It also updates the stub
doc for Tranquility to be firmer that it is not compatible. (Previously,
the stub doc said it wasn't recommended, and pointed out that it is
built against an ancient 0.9.2 version of Druid.)

ITUnionQueryTest has been migrated to the new integration tests framework and updated to use Kafka ingestion.

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
2024-06-24 20:13:33 -07:00
Sree Charan Manamala 990fd5f5fb
Make use group iterator for all window frames & support for same bound kinds (#16603)
Fixes apache/druid#15739
2024-06-24 15:52:41 +02:00
Laksh Singla 00c96432af
Materialize scan results correctly when columns are not present in the segments (#16619)
Fixes a bug causing maxSubqueryBytes not to work when segments have missing columns.
2024-06-23 23:15:45 +05:30
Abhishek Radhakrishnan b20c3dbadf
Fix malformed period throwing `ADMIN` persona error (#16626)
* Turn invalid periods into user-facing exception providing more context.

The current exception is targeting the ADMIN persona. Catch that and turn
it into a USER persona instead. Also, provide more context in the error
message.

* Review comment: pass the wrapping expression and stringify.

* Update processing/src/main/java/org/apache/druid/query/expression/ExprUtils.java

Co-authored-by: Clint Wylie <cjwylie@gmail.com>

---------

Co-authored-by: Clint Wylie <cjwylie@gmail.com>
2024-06-20 08:40:28 -07:00
Sree Charan Manamala 7ac0862287
Grouping Engine fix when a limit spec with different order by columns is applied (#16534) 2024-06-20 11:35:58 +02:00
Laksh Singla da1e293a57
Deserialize dimensions in group by queries to their respective types when reading from their serialized format (#16511)
* init

* tests, pair groupable

* framework change

* tests

* update benchmarks

* comments

* add javadoc for the jsonMapper

* remove extra deserialization

* add special serde for map based result rows

* revert unnecessary change

---------

Co-authored-by: asdf2014 <asdf2014@apache.org>
2024-06-14 16:27:47 +08:00
Zoltan Haindrich ac19b148c2
Upgrade calcite to 1.37.0 (#16504)
* contains Make a full copy of the parser and apply our modifications to it #16503
* some minor api changes pair/entry
* some unnecessary aggregation was removed from a set of queries in `CalciteSubqueryTest`
* `AliasedOperatorConversion` was detecting `CHAR_LENGTH` as not a function ; I've removed the check
  * the field it was using doesn't look maintained that much
  * the `kind` is passed for the created `SqlFunction` so I don't think this check is actually needed
* some decoupled test cases become broken - will be fixed later
* some aggregate related changes: due to the fact that SUM() and COUNT() of no inputs are different
* upgrade avatica to 1.25.0
* `CalciteQueryTest#testExactCountDistinctWithFilter` is now executable

Close apache/druid#16503
2024-06-13 08:47:50 +02:00
Zoltan Haindrich f8645de341
Remove incorrect utf8 conversion of ResultCache keys (#16569) 2024-06-12 13:12:05 -07:00
Clint Wylie fee509df2e
fix NestedDataColumnIndexerV4 to not report cardinality (#16507)
* fix NestedDataColumnIndexerV4 to not report cardinality
changes:
* fix issue similar to #16489 but for NestedDataColumnIndexerV4, which can report STRING type if it only processes a single type of values. this should be less common than the auto indexer problem
* fix some issues with sql benchmarks
2024-06-11 20:58:12 -07:00
zachjsh 3f5f5921e0
Fix sql syntax error user (#16583)
This fixes an issue where in some cases, a SQL syntax error encountered when parsing / planning a query results in an error returned to the user with persona a `admin` when it should instead be `user`.
2024-06-11 18:08:35 -04:00
Clint Wylie 3fb6ba22e8
fix expression column capabilities to not report dictionary encoded unless input is string (#16577) 2024-06-08 13:05:19 -07:00
Gian Merlino 277006446d
Fallback vectorization for FunctionExpr and BaseMacroFunctionExpr. (#16366)
* Fallback vectorization for FunctionExpr and BaseMacroFunctionExpr.

This patch adds FallbackVectorProcessor, a processor that adapts non-vectorizable
operations into vectorizable ones. It is used in FunctionExpr and BaseMacroFunctionExpr.

In addition:

- Identifiers are updated to offer getObjectVector for ARRAY and COMPLEX in addition
  to STRING. ExprEvalObjectVector is updated to offer ARRAY and COMPLEX as well.

- In SQL tests, cannotVectorize now fails tests if an exception is not thrown. This makes
  it easier to identify tests that can now vectorize.

- Fix a null-matcher bug in StringObjectVectorValueMatcher.

* Fix tests.

* Fixes.

* Fix tests.

* Fix test.

* Fix test.
2024-06-05 20:03:02 -07:00
Gian Merlino b837ce565b
Simplify serialized form of JsonInputFormat. (#15691)
* Simplify serialized form of JsonInputFormat.

Use JsonInclude for keepNullColumns, assumeNewlineDelimited, and
useJsonNodeReader. Because the default value of keepNullColumns is
variable, we store the original configured value rather than the
derived value, and include if the original value is nonnull.

* Fix test.
2024-06-05 20:01:14 -07:00
Gian Merlino 1040a29bc5
Fix capabilities reported by UnnestStorageAdapter. (#16551)
UnnestStorageAdapter and its cursors did not return capabilities correctly
for the output column. This patch fixes two problems:

1) UnnestStorageAdapter returned the capabilities of the unnest virtual
   column prior to unnesting. It should return the post-unnest capabilities.

2) UnnestColumnValueSelectorCursor passed through isDictionaryEncoded from
   the unnest virtual column. This is incorrect, because the dimension selector
   created by this class never has a dictionary. This is the cause of #16543.
2024-06-05 15:19:42 -07:00
Akshat Jain 6d7d2ffa63
Add interface method for returning canonical lookup name (#16557)
* Add interface method for returning canonical lookup name

* Address review comment

* Add test in LookupReferencesManagerTest for coverage check

* Add test in LookupSerdeModuleTest for coverage check
2024-06-05 14:33:18 -07:00
Abhishek Radhakrishnan b9ba286423
Fix task bootstrapping & simplify segment load/drop flows (#16475)
* Fix task bootstrap locations.

* Remove dependency of SegmentCacheManager from SegmentLoadDropHandler.

- The load drop handler code talks to the local cache manager via
SegmentManager.

* Clean up unused imports and stuff.

* Test fixes.

* Intellij inspections and test bind.

* Clean up dependencies some more

* Extract test load spec and factory to its own class.

* Cleanup test util

* Pull SegmentForTesting out to TestSegmentUtils.

* Fix up.

* Minor changes to infoDir

* Replace server announcer mock and verify that.

* Add tests.

* Update javadocs.

* Address review comments.

* Separate methods for download and bootstrap load

* Clean up return types and exception handling.

* No callback for loadSegment().

* Minor cleanup

* Pull out the test helpers into its own static class so it can have better state control.

* LocalCacheManager stuff

* Fix build.

* Fix build.

* Address some CI warnings.

* Minor updates to javadocs and test code.

* Address some CodeQL test warnings and checkstyle fix.

* Pass a Consumer<DataSegment> instead of boolean & rename variables.

* Small updates

* Remove one test constructor.

* Remove the other constructor that wasn't initializing fully and update usages.

* Cleanup withInfoDir() builder and unnecessary test hooks.

* Remove mocks and elaborate on comments.

* Commentary

* Fix a few Intellij inspection warnings.

* Suppress corePoolSize intellij-inspect warning.

The intellij-inspect tool doesn't seem to correctly inspect
lambda usages. See ScheduledExecutors.

* Update docs and add more tests.

* Use hamcrest for asserting order on expectation.

* Shutdown bootstrap exec.

* Fix checkstyle
2024-06-04 10:44:46 -07:00
Sree Charan Manamala 6bbf9613f8
Throw soft exception in case of empty signature while building Scan Query (#16502) 2024-05-29 09:41:54 +02:00
Sree Charan Manamala 27cfe12f4a
Enable reordering of window operators (#16482)
This commit aims to enable the re-ordering of window operators in order to optimise
the sort and partition operators.
Example : 
```
SELECT m1, m2,
SUM(m1) OVER(PARTITION BY m2) as sum1,
SUM(m2) OVER() as sum2
from numFoo
GROUP BY m1,m2
```

In order to compute this query, we can order the operators as to first compute the operators
corresponding to sum2 and then place the operators corresponding to sum1 which would
help us in reducing one sort operator if we order our operators by sum1 and then sum2.
2024-05-29 12:17:12 +05:30
Clint Wylie 4e1de50e30
fix issue with auto column grouping (#16489)
* fix issue with auto column grouping
changes:
* fixes bug where AutoTypeColumnIndexer reports incorrect cardinality, allowing it to incorrectly use array grouper algorithm for realtime queries producing incorrect results for strings
* fixes bug where auto LONG and DOUBLE type columns incorrectly report not having null values, resulting in incorrect null handling when grouping

* fix test
2024-05-27 11:18:17 +05:30
zachjsh b0cc1ee84b
Add ability to turn off Druid Catalog specific validation done on catalog defined tables in Druid (#16465)
* * add property to enable / disable catalog validation and add tests

* * add integration tests for catalog validation disabled

* * add integration tests

* * remove debugging logs

* * fix forbidden api call
2024-05-23 13:19:51 -04:00
Zoltan Haindrich 12f79acc7e
Enable quidem shadowing for decoupled testcases (#16431)
* Altered `QueryTestBuilder` to be able to switch to a backing quidem test
* added a small crc to ensure that the shadow testcase does not deviate from the original one
* Packaged all decoupled related things into a a single `DecoupledExtension` to reduce copy-paste
* `DecoupledTestConfig#quidemReason` must describe why its being used
* `DecoupledTestConfig#separateDefaultModeTest` can be used to make multiple case files based on `NullHandling` state
* fixed a cosmetic bug during decoupled join translation
* enhanced `!druidPlan` to report the final logical plan in non-decoupled mode as well
* add check to ensure that only supported params are present in a druidtest uri
* enabled shadow testcases for previously disabled testcases
2024-05-23 07:03:16 +02:00
Gian Merlino 599586bcfc
Add SQL DIV function. (#16464)
* Add SQL DIV function.

This function has been documented for some time, but lacked a binding,
so it wasn't usable.

* Add a case with two expression inputs.
2024-05-17 11:11:32 -07:00
Gian Merlino 0fb09445a5
Fix ExpressionPredicateIndexSupplier numeric replace-with-default behavior. (#16448)
* Fix ExpressionPredicateIndexSupplier numeric replace-with-default behavior.

In replace-with-default mode, null numeric values from the index should be
interpreted as zeroes by expressions. This makes the index supplier more
consistent with the behavior of the selectors created by the expression
virtual column.

* Fix test case.
2024-05-15 15:11:47 +05:30
Akshat Jain ddfd62d9a9
Disable loading lookups by default in CompactionTask (#16420)
This PR updates CompactionTask to not load any lookups by default, unless transformSpec is present.

If transformSpec is present, we will make the decision based on context values, loading all lookups by default. This is done to ensure backward compatibility since transformSpec can reference lookups.
If transform spec is not present and no context value is passed, we donot load any lookup.

This behavior can be overridden by supplying lookupLoadingMode and lookupsToLoad in the task context.
2024-05-15 11:39:23 +05:30
Gian Merlino 72432c2e78
Speed up SQL IN using SCALAR_IN_ARRAY. (#16388)
* Speed up SQL IN using SCALAR_IN_ARRAY.

Main changes:

1) DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
   the IN is above inFunctionThreshold. The default value of inFunctionThreshold
   is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.

2) SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
   expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
   value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
   it to Integer.MAX_VALUE.

3) ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
   greater than inFunctionThreshold.

* Revert test.

* Additional coverage.

* Update docs/querying/sql-query-context.md

Co-authored-by: Benedict Jin <asdf2014@apache.org>

* New test.

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-05-14 08:09:27 -07:00
Sree Charan Manamala b8dd7478d0
Custom Calcite Rule to remove redundant references (#16402)
Custom calcite rule mimicking AggregateProjectMergeRule to extend support to expressions.
The current calcite rule return null in such cases.
In addition, this removes the redundant references.
2024-05-14 06:38:05 +02:00
Laksh Singla 4bfc186153
Support sorting on complex columns in MSQ (#16322)
MSQ sorts the columns in a highly specialized manner by byte comparisons. As such the values are serialized differently. This works well for the primitive types and primitive arrays, however complex types cannot be serialized specially.

This PR adds the support for sorting the complex columns by deserializing the value from the field and comparing it via the type strategy. This is a lot slower than the byte comparisons, however, it's the only way to support sorting on complex columns that can have arbitrary serialization not optimized for MSQ.

The primitives and the arrays are still compared via the byte comparison, therefore this doesn't affect the performance of the queries supported before the patch. If there's a sorting key with mixed complex and primitive/primitive array types, for example: longCol1 ASC, longCol2 ASC, complexCol1 DESC, complexCol2 DESC, stringCol1 DESC, longCol3 DESC, longCol4 ASC, the comparison will happen like:

    longCol1, longCol2 (ASC) - Compared together via byte-comparison, since both are byte comparable and need to be sorted in ascending order
    complexCol1 (DESC) - Compared via deserialization, cannot be clubbed with any other field
    complexCol2 (DESC) - Compared via deserialization, cannot be clubbed with any other field, even though the prior field was a complex column with the same order
    stringCol1, longCol3 (DESC) - Compared together via byte-comparison, since both are byte comparable and need to be sorted in descending order
    longCol4 (ASC) - Compared via byte-comparison, couldn't be coalesced with the previous fields as the direction was different

This way, we only deserialize the field wherever required
2024-05-13 15:07:05 +05:30
Zoltan Haindrich 1811674753
Enable quidem tests to use different suppliers (#16382)
* enable quidem uri support for `druidtest:///?ComponentSupplier=Nested` and similar
* changes the way `SqlTestFrameworkConfig` is being applied; all options will have their own annotation (its kinda impossible to detect that an annotation has a set value or its the default)
* enables hierarchical processing of config annotation (was needed to enable class level supplier annotation)
* moves uri processing related string2config stuff into `SqlTestFrameworkConfig`
2024-05-09 09:21:02 +02:00
Akshat Jain 775d654a6c
Load only the required lookups for MSQ tasks (#16358)
With this PR changes, MSQ tasks (MSQControllerTask and MSQWorkerTask) only load the required lookups during querying and ingestion, based on the value of CTX_LOOKUPS_TO_LOAD key in the query context.
2024-05-09 11:21:54 +05:30
Misha b5958b6b07
Feature configurable calcite bloat (#16248)
* Configurable bloat for calcite ProjectMergeRule implemented

* Comment added

* Default bloat value increased to 1000

* Implemented bloat configuration from QueryContext

* Code refactored, docs updated

---------

Co-authored-by: sviatahorau <mikhail.sviatahorau@deep.bi>
2024-05-06 20:43:39 +05:30
Gian Merlino 588d442422
Add native filter conversion for SCALAR_IN_ARRAY. (#16312)
* Add native filter conversion for SCALAR_IN_ARRAY.

Main changes:

1) Add an implementation of "toDruidFilter" in ScalarInArrayOperatorConversion.

2) Split up Expressions.literalToDruidExpression into two functions, so the first
   half (literalToExprEval) can be used by ScalarInArrayOperatorConversion to more
   efficiently create the list of match values.

* Fix type in time arithmetic conversion.

* Test updates.

* Update test cases to use null instead of '' in default-value mode.

* Switch test from msqIncompatible to compatible with a different result.

* Update one more test.

* Fix test.

* Update tests.

* Use ExprEvalWrapper to differentiate between empty string and null.

* Fix tests some more.

* Fix test.

* Additional comment.

* Style adjustment.

* Fix tests.

* trueValue -> actualValue.

* Use different approach, DruidLiteral instead of ExprEvalWrapper.

* Revert changes in ArrayOfDoublesSketchSqlAggregatorTest.
2024-05-03 13:00:33 -07:00
zachjsh fb7c84fb5d
Catalog clustering keys fixes (#16351)
* * add another catalog clustering columns unit test

* * dissallow clusterKeys with descending order

* * make more clear that clustering is re-written into ingest node
whether a catalog table or not

* * when partitionedBy is stored in catalog, user shouldnt need to specify
it in order to specify clustering

* * fix intellij inspection failure
2024-05-03 14:02:56 -04:00
Zoltan Haindrich 2d0e86cbdc
Use quidem to run tests (#16249)
* test scoped jdbc driver for druidtest:/// backed DruidAvaticaTestDriver
** DecoupledTestConfig is used inside the URI - this will make it possible to attach to existing things more easily
* DruidQuidemTestBase can be used to create module level set of quidem tests
* added quidem commands: !convertedPlan, !logicalPlan, !druidPlan, !nativePlan
** for these I've used some values of the Hook which was there in calcite
* there are some shortcuts with proxies(they are only used during testing) - we can probably remove those later
2024-05-02 02:12:42 -04:00
Laksh Singla e695e52d3f
Improve code flow in the First/Last vector aggregators and unify the numeric aggregators with the String implementations (#16230)
This PR fixes the first and last vector aggregators and improves their readability. Following changes are introduced

    The folding is broken in the vectorized versions. We consider time before checking the folded object.

    If the numerical aggregator gets passed any other object type for some other reason (like String), then the aggregator considers it to be folded, even though it shouldn’t be. We should convert these objects to the desired type, and aggregate them properly.

    The aggregators must properly use generics. This would minimize the ClassCastException issues that can happen with mixed segment types. We are unifying the string first/last aggregators with numeric versions as well.

    The aggregators must aggregate null values (https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstLastUtils.java#L55-L56 ). The aggregator should only ignore pairs with time == null, and not value == null

    Time nullity is ignored when trying to vectorize the data.

    String versions initialized with DateTimes.MIN that is equal to Long.MIN / 2. This can cause incorrect results in case the user enters a custom time column. NOTE: This is still present because it would require a larger refactor in all of the versions.

    There is a difference in what users might expect from the results because the code flow is changed (for example, the direction of the for loops, etc), however, this will only change the results, and not the contract set by first/last aggregators, which is that if multiple values have the same timestamp, then any of them can get picked.

    If the column is non-existent, the users might expect a change in the timestamp from DateTime.MAX to Long.MAX, because the code incorrectly used DateTime.MAX to initialize the aggregator, however, in case of a custom timestamp column, this might not be the case. The SQL query might be prohibited from using any Long since it requires a cast to the timestamp function that can fail, but AFAICT native queries don't have such limitations.
2024-04-30 15:13:14 +05:30
Laksh Singla 26d63e7b65
Prevent joining on nested arrays and complex types (#16349)
#16068 modified DimensionHandlerUtils to accept complex types to be dimensions. This had an unintended side effect of allowing complex types to be joined upon (which wasn't guarded explicitly, it doesn't work).
This PR modifies the IndexedTable to reject building the index on the complex types to prevent joining on complex types. The PR adds back the check in the same place, explicitly.
2024-04-30 11:36:53 +05:30
Akshat Jain 9d2cae40c3
Add support for selective loading of lookups in the task layer (#16328)
Changes:
- Add `LookupLoadingSpec` to support 3 modes of lookup loading: ALL, NONE, ONLY_REQUIRED
- Add method `Task.getLookupLoadingSpec()`
- Do not load any lookups for `KillUnusedSegmentsTask`
2024-04-29 07:19:59 +05:30
zachjsh 365cd7e8e7
INSERT/REPLACE can omit clustering when catalog has default (#16260)
* * fix

* * fix

* * address review comments

* * fix

* * simplify tests

* * fix complex type nullability issue

* * implement and add tests

* * address review comments

* * address test review comments

* * fix checkstyle

* * fix dependencies

* * all tests passing

* * cleanup

* * remove unneeded code

* * remove unused dependency

* * fix checkstyle
2024-04-26 10:19:45 -04:00
Adarsh Sanjeev 9a2d7c28bc
Prepare master branch for 31.0.0 release (#16333) 2024-04-26 09:22:43 +05:30
Gian Merlino 68d6e682e8
Fix TimeBoundary planning when filters require virtual columns. (#16337)
The timeBoundary query does not support virtual columns, so we should
avoid it if the query requires virtual columns.
2024-04-25 16:49:40 -07:00
Zoltan Haindrich 9c0bd56f5b
Make QueryComponentSupliers independent from test classes (#16275) 2024-04-25 02:12:07 -04:00
Laksh Singla 6bca406d31
Grouping on complex columns aka unifying GroupBy strategies (#16068)
Users can pass complex types as dimensions to the group by queries. For example:

SELECT nested_col1, count(*) FROM foo GROUP BY nested_col1
2024-04-24 23:00:14 +05:30
Rishabh Singh e30790e013
Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building (#15817)
Issue: #14989

The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information.

This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.
2024-04-24 22:22:53 +05:30
Sree Charan Manamala 080476f9ea
WINDOWING - Fix 2 nodes with same digest causing mapping issue (#16301)
Fixes the mapping issue in window fucntions where 2 nodes get the same reference.
2024-04-24 16:45:02 +05:30
Laksh Singla b9bbde5c0a
Fix deadlock that can occur while merging group by results (#15420)
This PR prevents such a deadlock from happening by acquiring the merge buffers in a single place and passing it down to the runner that might need it.
2024-04-22 14:10:44 +05:30
Sree Charan Manamala ad5701e891
new SCALAR_IN_ARRAY function analogous to DRUID_IN (#16306)
* scalar_in function

* api doc

* refactor
2024-04-18 21:15:15 -07:00
Sree Charan Manamala 960a674442
Corrected Strict NON NULL return type checks (#16279) 2024-04-18 12:17:13 +02:00
Gian Merlino ccc1ffb032
Additional short circuiting knowledge in filter bundles. (#16292)
* Additional short circuiting knowledge in filter bundles.

Three updates:

1) The parameter "selectionRowCount" on "makeFilterBundle" is renamed
   "applyRowCount", and redefined as an upper bound on rows remaining
   after short-circuiting (rather than number of rows selected so far).
   This definition works better for OR filters, which pass through the
   FALSE set rather than the TRUE set to the next subfilter.

2) AndFilter uses min(applyRowCount, indexIntersectionSize) rather
   than using selectionRowCount for the first subfilter and indexIntersectionSize
   for each filter thereafter. This improves accuracy when the incoming
   applyRowCount is smaller than the row count from the first few indexes.

3) OrFilter uses min(applyRowCount, totalRowCount - indexUnionSize) rather
   than applyRowCount for subfilters. This allows an OR filter to pass
   information about short-circuiting to its subfilters.

To help write tests for this, the patch also moves the sampled
wikiticker data file from sql to processing.

* Forbidden APIs.

* Forbidden APIs.

* Better comments.

* Fix inspection.

* Adjustments to tests.
2024-04-16 22:42:28 -07:00
zachjsh a5428e75ff
INSERT/REPLACE complex target column types are validated against source input expressions (#16223)
* * fix

* * fix

* * address review comments

* * fix

* * simplify tests

* * fix complex type nullability issue

* * address review comments

* * address test review comments

* * fix checkstyle
2024-04-16 17:20:35 -04:00
Sree Charan Manamala 5247059d2f
Allow Double & null values in sql type array through dynamic params (#16274) 2024-04-15 10:44:42 +02:00
Adarsh Sanjeev 3df00aef9d
Add manifest file for MSQ export (#15953)
Currently, export creates the files at the provided destination. The addition of the manifest file will provide a list of files created as part of the manifest. This will allow easier consumption of the data exported from Druid, especially for automated data pipelines
2024-04-15 11:37:31 +05:30
Sree Charan Manamala 3340b200db
Fix window function drill tests failures falling under RESULT_MISMATCH & RESULT_COUNT_MISMATCH (#16264)
* Updated the drill test expected results which are failing due to druid's default sorting algorithm taking nulls first approach.
* Corrected the queries where date time values are directly provided
* marked 2 cases failing with resultset casting issues
2024-04-12 13:54:48 +02:00
Sree Charan Manamala f65c166327
Windowed aggregates should update the aggregation value based on final compute (#16244) 2024-04-12 08:28:33 +02:00
Gian Merlino 9f358f5f4a
SQL tests: avoid mixing skip and cannot vectorize. (#16251)
* SQL tests: avoid mixing skip and cannot vectorize.

skipVectorize switches off vectorization tests completely, and
cannotVectorize turns vectorization tests into negative tests. It doesn't
make sense to use them together, so this patch makes it an error to do so,
and cleans up cases where both are mentioned.

This patch also has the effect of changing various tests from skipVectorize
to cannotVectorize, because in the past when both were mentioned,
skipVectorize would take priority.

* Fix bug with StringAnyAggregatorFactory attempting to vectorize when it cannt.

* Fix tests.
2024-04-11 15:06:11 -07:00
Soumyava 7759f25095
Moving bitwise_or to use native calcite operator (#16237) 2024-04-04 12:49:29 -07:00
Soumyava 972937659d
Fixing return type for IPV4 (#15916)
* Fixing return type for IPV4

* Update ipv4match
2024-04-04 08:49:50 -07:00
Soumyava 4bea865697
Restore context flag for window functions (#16229) 2024-04-03 13:57:13 +05:30
zachjsh 9b52c909e0
fix complex types returning UNKNOWN as their SQL type inference (#16216)
* * fix

* * fix

* * address review comments
2024-04-02 14:36:01 -04:00
Aleksey Plekhanov a818b8acb6
Fix CalciteQueryTest#testCountStarWithTimeFilterUsingStringLiterals (#16221)
* Add cases to check handling equals between timestamp and string literal
2024-04-01 13:46:24 -07:00
Soumyava 524842a3bb
Window function on msq (#15470)
This PR aims to introduce Window functions on MSQ by doing the following:

    Introduce a Window querykit for handling window queries along with its factory and a processor for window queries
    If a window operator is present with a partition by clause, pushes the partition as a shuffle spec of the previous stage
    In presence of empty OVER() clause lets all operators loose on a single rac
    In presence of no empty OVER() clause, breaks down each window into individual stages
    Associated machinery to handle window functions in MSQ
    Introduced a separate hidden engine feature WINDOW_LEAF_OPERATOR which is set only for MSQ engine. In presence of this feature, the planner plans without the leaf operators by creating a window query over an inner scan query. In case of native this is set to false and the planner generates the leafOperators
    Guardrails around materialization
    Comprehensive UTs
2024-03-28 14:58:34 +05:30
Sree Charan Manamala f29c8ac368
Allow non literal rhs in MV_FILTER_ONLY and MV_FILTER_NONE (#16113)
This commit allows to use the MV_FILTER_ONLY & MV_FILTER_NONE functions
with a non literal argument. 
Currently `select mv_filter_only('mvd_dim', 'array_dim') from 'table'`
returns a `Unhandled Query Planning Failure`

This is being tackled and also considered for the cases where the `array_dim`
having null & empty values.

Changed classes:
 * `MultiValueStringOperatorConversions`
 * `ApplyFunction`
 * `CalciteMultiValueStringQueryTest`
2024-03-26 12:31:09 +05:30
Zoltan Haindrich a16092b16a
Rewrite exotic LAST_VALUE/FIRST_VALUE to self-reference. (#16063)
* Rewrite exotic LAST_VALUE/FIRST_VALUE to self-reference.

* rewrite `LAST_VALUE(x) OVER (ORDER BY y)` to `LAG(x,0) OVER (ORDER BY y)`
* not directly to `x` because some queries get unplannable that way
* restrict `NTILE` from framing - as its not supported
* add test to ensure that all of the `KNOWN_WINDOW_FNS`'s framing is accounted for

* checkstyle/etc

* add test

* apidoc

* add assume to avoid MSQ fail
2024-03-25 11:03:47 -07:00
zachjsh 8370db106c
INSERT/REPLACE dimension target column types are validated against source input expressions (#15962)
* * address remaining comments from https://github.com/apache/druid/pull/15836

* *  address remaining comments from https://github.com/apache/druid/pull/15908

* * add test that exposes relational algebra issue

* * simplify test exposing issue

* * fix

* * add tests for sealed / non-sealed

* * update test descriptions

* * fix test failure when -Ddruid.generic.useDefaultValueForNull=true

* * check type assignment based on natice Druid types

* * add tests that cover missing jacoco coverage

* * add replace tests

* * add more tests and comments about column ordering

* * simplify tests

* * review comments

* * remove commented line

* * STRING family types should be validated as non-null
2024-03-25 12:34:07 -04:00
Clint Wylie b0a9c318d6
add new typed in filter (#16039)
changes:
* adds TypedInFilter which preserves matching sets in the native match value type
* SQL planner uses new TypedInFilter when druid.generic.useDefaultValueForNull=false (the default)
2024-03-22 12:45:08 -07:00
Clint Wylie 48b8d42698
fix regexp_like, contains_string, icontains_string to return null instead of false for null inputs in sql compatible mode (#15963) 2024-03-19 22:12:47 -07:00
Gian Merlino c96b215dd6
SortMerge join support for IS NOT DISTINCT FROM. (#16003)
* SortMerge join support for IS NOT DISTINCT FROM.

The patch adds a "requiredNonNullKeyParts" field to the sortMerge
processor, which has the list of key parts that must be nonnull for
an equijoin condition to match. Conditions with SQL "=" are present in
the list; conditions with SQL "IS NOT DISTINCT FROM" are absent from
the list.

* Fix test.

* Update javadoc.
2024-03-19 12:02:13 -07:00
Zoltan Haindrich 0a42342cef
Update Calcite*Test to use junit5 (#16106)
* Update Calcite*Test to use junit5

* change the way temp dirs are handled
* add openrewrite workflow to safeguard upgrade
* replace junitparamrunner with standard junit5 parametered tests
* update a few rules to junit5 api
* lots of boring changes

* cleanup QueryLogHook

* cleanup

* fix compile error: ARRAYS_DATASOURCE

* fix test

* remove enclosed

* empty

+TEST:TDigestSketchSqlAggregatorTest,HllSketchSqlAggregatorTest,DoublesSketchSqlAggregatorTest,ThetaSketchSqlAggregatorTest,ArrayOfDoublesSketchSqlAggregatorTest,BloomFilterSqlAggregatorTest,BloomDimFilterSqlTest,CatalogIngestionTest,CatalogQueryTest,FixedBucketsHistogramQuantileSqlAggregatorTest,QuantileSqlAggregatorTest,MSQArraysTest,MSQDataSketchesTest,MSQExportTest,MSQFaultsTest,MSQInsertTest,MSQLoadedSegmentTests,MSQParseExceptionsTest,MSQReplaceTest,MSQSelectTest,InsertLockPreemptedFaultTest,MSQWarningsTest,SqlMSQStatementResourcePostTest,SqlStatementResourceTest,CalciteSelectJoinQueryMSQTest,CalciteSelectQueryMSQTest,CalciteUnionQueryMSQTest,MSQTestBase,VarianceSqlAggregatorTest,SleepSqlTest,SqlRowTransformerTest,DruidAvaticaHandlerTest,DruidStatementTest,BaseCalciteQueryTest,CalciteArraysQueryTest,CalciteCorrelatedQueryTest,CalciteExplainQueryTest,CalciteExportTest,CalciteIngestionDmlTest,CalciteInsertDmlTest,CalciteJoinQueryTest,CalciteLookupFunctionQueryTest,CalciteMultiValueStringQueryTest,CalciteNestedDataQueryTest,CalciteParameterQueryTest,CalciteQueryTest,CalciteReplaceDmlTest,CalciteScanSignatureTest,CalciteSelectQueryTest,CalciteSimpleQueryTest,CalciteSubqueryTest,CalciteSysQueryTest,CalciteTableAppendTest,CalciteTimeBoundaryQueryTest,CalciteUnionQueryTest,CalciteWindowQueryTest,DecoupledPlanningCalciteJoinQueryTest,DecoupledPlanningCalciteQueryTest,DecoupledPlanningCalciteUnionQueryTest,DrillWindowQueryTest,DruidPlannerResourceAnalyzeTest,IngestTableFunctionTest,QueryTestRunner,SqlTestFrameworkConfig,SqlAggregationModuleTest,ExpressionsTest,GreatestExpressionTest,IPv4AddressMatchExpressionTest,IPv4AddressParseExpressionTest,IPv4AddressStringifyExpressionTest,LeastExpressionTest,TimeFormatOperatorConversionTest,CombineAndSimplifyBoundsTest,FiltrationTest,SqlQueryTest,CalcitePlannerModuleTest,CalcitesTest,DruidCalciteSchemaModuleTest,DruidSchemaNoDataInitTest,InformationSchemaTest,NamedDruidSchemaTest,NamedLookupSchemaTest,NamedSystemSchemaTest,RootSchemaProviderTest,SystemSchemaTest,CalciteTestBase,SqlResourceTest

* use @Nested

* add rule to remove enclosed; upgrade surefire

* remove enclosed

* cleanup

* add comment about surefire exclude
2024-03-19 04:05:12 -07:00
Clint Wylie 5afd5c41a5
fix ColumnType to RelDataType conversion for nested arrays (#16138)
* fix ColumnType to RelDataType conversion for nested arrays

* fix test
2024-03-18 23:34:08 -07:00
Zoltan Haindrich d3e22c6e92
fix compile error: ARRAYS_DATASOURCE (#16120) 2024-03-14 18:15:43 +05:30
Clint Wylie dd9bc3749a
fix issues with array_contains and array_overlap with null left side arguments (#15974)
changes:
* fix issues with array_contains and array_overlap with null left side arguments
* modify singleThreaded stuff to allow optimizing Function similar to how we do for ExprMacro - removed SingleThreadSpecializable in favor of default impl of asSingleThreaded on Expr with clear javadocs that most callers shouldn't be calling it directly and should be using Expr.singleThreaded static method which uses a shuttle and delegates to asSingleThreaded instead
* add optimized 'singleThreaded' versions of array_contains and array_overlap
* add mv_harmonize_nulls native expression to use with MV_CONTAINS and MV_OVERLAP to allow them to behave consistently with filter rewrites, coercing null and [] into [null]
* fix bug with casting rhs argument for native array_contains and array_overlap expressions
2024-03-13 18:16:10 -07:00
Sree Charan Manamala e9d2caccb6
Handling null operand in JSON_QUERY_ARRAY (#16118)
* fix return type inference for JSON_QUERY_ARRAY to be nullable
2024-03-13 18:06:27 -07:00
Gian Merlino 256160aba6
MSQ: Validate that strings and string arrays are not mixed. (#15920)
* MSQ: Validate that strings and string arrays are not mixed.

When multi-value strings and string arrays coexist in the same column,
it causes problems with "classic MVD" style queries such as:

  select * from wikipedia -- fails at runtime
  select count(*) from wikipedia where flags = 'B' -- fails at planning time
  select flags, count(*) from wikipedia group by 1 -- fails at runtime

To avoid these problems, this patch adds type verification for INSERT
and REPLACE. It is targeted: the only type changes that are blocked are
string-to-array and array-to-string. There is also a way to exclude
certain columns from the type checks, if the user really knows what
they're doing.

* Fixes.

* Tests and docs and error messages.

* More docs.

* Adjustments.

* Adjust message.

* Fix tests.

* Fix test in DV mode.
2024-03-13 15:37:27 -07:00
Gian Merlino 910124d4de
MSQ: Plan without implicit sorting. (#16073)
* MSQ: Plan without implicit sorting.

This patch adds an EngineFeature "GROUPBY_IMPLICITLY_SORTS" and sets
it true for native, false for MSQ. It's useful for two reasons:

1) In the future we'll likely want MSQ to hash-partition for GROUP BY
   instead of using a global sort, which would mean MSQ would not
   implicitly ORDER BY when there is a GROUP BY.

2) When doing REPLACE with MSQ, CLUSTERED BY is transformed to ORDER BY.
   We should retain that ORDER BY, as it may be a subset of the GROUP BY,
   and it is important to remember which fields the user wanted to include in
   range shard specs.

* Fix tests.

* Fix tests for real.

* Fix test.
2024-03-13 08:27:39 -07:00
Clint Wylie 795e342ba8
fix sql results mixed array and scalar values (#16105)
* fix sql results mixed array and scalar values

* simplify
2024-03-12 23:47:35 -07:00
Zoltan Haindrich 8252d72e2a
Pull up literals in InputAccessor (#16033)
* Pull up literals in InputAccessor

* pull up literals in `InputAccessor`
* remove the need to pass `constants` of `Window`  operator

Fixes #15353

* update test

* enable relax_nulls
2024-03-12 09:14:31 -07:00
Sree Charan Manamala ef9637eef1
Handling array with boolean literals (#16093)
Handling array with boolean literals like ARRAY[true, false]

Druid appears to be able to convert an array with boolean expressions like this array[added=deleted, added=delta] into a numeric array of 0 and 1: select array[added=deleted, added=delta] from wikipedia

However, select array[true, false] from wikipedia doesn't work.
This PR fixes this.
2024-03-12 12:28:16 +05:30
Soumyava 85ee775390
Handling latest_by and earliest_by on numeric columns correctly (#15939)
* Handling latest_by and earliest_by on numeric columns correctly

* Adding test
2024-03-11 13:49:21 -07:00
Zoltan Haindrich 2eb7d7a89b
Calcite tests remove expected exception (#16046)
* Calcite tests remove expected exception

* update testcases using `expectedException` to utilize `assertThrows` instead
* remove `BaseCalciteQueryTest#expectedException`
* fixes `cannotVectorize` so it doesn't anymore stops further processing
* `msqIncompatible` is not anymore toggles a boolean - its an `Assume` instead

Fixes #15423

* cleanup

* move msqIncompat

* update test

* cleanup

* remove comment

* empty-commit

* empty-commit
2024-03-11 13:23:57 +05:30
Zoltan Haindrich aaa64832fd
Disable DecoupledPlanningCalciteJoinQueryTest until it gets fixed (#16070)
Recently this test started other tests from executing by triggering a bug somewhere in surefire.
This patch disables the testcases in case of non-sql compat mode.
2024-03-07 12:55:48 -08:00
Laksh Singla 5f588fa45c
Fix bug while materializing scan's result to frames (#15987)
While converting Sequence<ScanResultValue> to Sequence<Frames>, when maxSubqueryBytes is enabled, we batch the results to prevent creating a single frame per ScanResultValue. Batching requires peeking into the actual value, and checking if the row signature of the scan result’s value matches that of the previous value.

Since we can do this indefinitely (in the worst case all of them have the same signature), we keep fetching them and accumulating them in a list (on the heap). We don’t really know how much to batch before we actually write the value as frames.

The PR modifies the batching logic to not accumulate the results in an intermediary list
2024-03-07 17:11:44 +05:30
Vishesh Garg cf9bc507f6
Fix compilation failure due to missing constant MISSING_JOIN_CONVERSION (#16050)
* Reintroduce variable MISSING_JOIN_CONVERSION

* Remove redundant constant MISSING_JOIN_CONVERSION2

* Correct fix to address failing tests
2024-03-06 15:34:39 +08:00
Zoltan Haindrich 65c3b4d31a
Support join in decoupled mode (#15957)
* plan join(s) in decoupled mode
* configure DecoupledPlanningCalciteJoinQueryTest
        the test has 593 cases; however there are quite a few parameterized
        from the 107 methods annotated with @Test - 42 is not yet working
 * replace the isRoot hack in DruidQueryGenerator with a logic that instead looks ahead for the next node; and doesn't let the previous node do the Project - this makes it plan more likely than the existing planner
2024-03-05 19:10:13 -06:00
Zoltan Haindrich bb882727c0
Fix Windowing/scanAndSort query issues on top of Joins. (#15996)
allow a hashjoin result to be converted to RowsAndColumns
added StorageAdapterRowsAndColumns
fix incorrect isConcrete() return values during early phase of planning
2024-03-05 15:05:31 +05:30
Zoltan Haindrich e469b7ed34
Make setting QUERY_CONTEXT_DEFAULT explicit in tests (#16010) 2024-03-05 10:54:16 +05:30
Adarsh Sanjeev 93eeb05eaf
Revert explain attributes change to old behaviour. (#16004)
* Revert explain attributes change

* Fix tests

* Fix tests

* Rename function
2024-03-04 15:56:02 +05:30
Zoltan Haindrich bf0995f846
Introduce dynamic table append (#15897) 2024-03-01 04:31:57 -05:00
Laksh Singla 17e4f3ac60
Refactor GroupBy and TopN code to relax the constraint of dimensions being comparable (#15559)
The code in the groupBy engine and the topN engine assume that the dimensions are comparable and can call dimA.compareTo(dimB) to sort the dimensions and group them together.
This works well for the primitive dimensions, because they are Comparable, however falls apart when the dimensions can be arrays (or in future scenarios complex columns). In cases when the dimensions are not comparable, Druid resorts to having a wrapper type ComparableStringArray and ComparableList, which is a Comparable, based on the list comparator.
2024-02-27 11:39:29 +05:30
Soumyava 51cc729fd1
Enforcing type checking for flatten concat (#15903) 2024-02-26 21:53:49 -08:00
Abhishek Radhakrishnan 67a6224d91
Fix up incorrect `PARTITIONED BY` error messages (#15961)
* Fix up typos, inaccuracies and clean up code related to PARTITIONED BY.

* Remove wrapper function and update tests to use DruidExceptionMatcher.

* Checkstyle and Intellij inspection fixes.
2024-02-26 14:17:53 -05:00
Zoltan Haindrich 06deda9415
ScanAndSort query fails with NPE for simple queries (#15914)
* some stuff

* add dummy fields

* draft-fix

* rename test

* cleanup

* add null

* cleanup

* cleanup

* add test

* updates

* move check tp constructore

* cleanup

* updates/etc

* fix some more

* add rowSignatureMode

* checkstyle/etc

* override

* missing msqIncompat

* fix test

* fixes

* undo

* updates

* remove param
2024-02-24 15:33:50 -08:00
zachjsh 8ebf237576
Move INSERT & REPLACE validation to the Calcite validator (#15908)
This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner https://github.com/apache/druid/pull/13686 from @paul-rogers, Refactoring the IngestHandler and subclasses to produce a validated SqlInsert instance node instead of the previous Insert source node. The SqlInsert node is then validated in the calcite validator. The validation that is implemented as part of this pr, is only that for the source node, and some of the validation that was previously done in the ingest handlers. As part of this change, the partitionedBy clause can be supplied by the table catalog metadata if it exists, and can be omitted from the ingest time query in this case.
2024-02-22 14:01:59 -05:00
Zoltan Haindrich bcce0806d7
Support Union in decoupled mode (#15870) 2024-02-21 10:54:50 -05:00
Gian Merlino 9c41827dba
Globally disable AUTO_CLOSE_JSON_CONTENT. (#15880)
* Globally disable AUTO_CLOSE_JSON_CONTENT.

This JsonGenerator feature is on by default. It causes problems with code
like this:

  try (JsonGenerator jg = ...) {
    jg.writeStartArray();
    for (x : xs) {
      jg.writeObject(x);
    }
    jg.writeEndArray();
  }

If a jg.writeObject call fails due to some problem with the data it's
reading, the JsonGenerator will write the end array marker automatically
when closed as part of the try-with-resources. If the generator is writing
to a stream where the reader does not have some other mechanism to realize
that an exception was thrown, this leads the reader to believe that the
array is complete when it actually isn't.

Prior to this patch, we disabled AUTO_CLOSE_JSON_CONTENT for JSON-wrapped
SQL result formats in #11685, which fixed an issue where such results
could be erroneously interpreted as complete. This patch fixes a similar
issue with task reports, and all similar issues that may exist elsewhere,
by disabling the feature globally.

* Update test.
2024-02-16 08:52:48 -08:00
Clint Wylie fe2ba8cc28
fix return type inference of parse_long, which can also be null if string is not parseable into a long (#15909)
* fix return type inference of parse_long, which can also be null if string is not parseable into a long

* fix msq test
2024-02-15 08:45:34 -08:00
zachjsh f9ee2c353b
Extend the PARTITION BY clause to accept string literals for the time partitioning (#15836)
This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner https://github.com/apache/druid/pull/13686 from @paul-rogers, extending the PARTITION BY clause to accept string literals for the time partitioning
2024-02-09 11:45:38 -05:00
Sree Charan Manamala 57e12df352
Sql Single Value Aggregator for scalar queries (#15700)
Executing single value correlated queries will throw an exception today since single_value function is not available in druid.
With these added classes, this provides druid, the capability to plan and run such queries.
2024-02-08 19:20:30 +05:30
Soumyava f3996b96ff
Fixes for safe_divide with vectorize and datatypes (#15839)
* Fix for save_divide with vectorize

* More fixes

* Update to use expr.eval(null) for both cases when denominator is 0
2024-02-08 14:40:42 +05:30
Adarsh Sanjeev 514b3b4d01
Add export capabilities to MSQ with SQL syntax (#15689)
* Add test

* Parser changes to support export statements

* Fix builds

* Address comments

* Add frame processor

* Address review comments

* Fix builds

* Update syntax

* Webconsole workaround

* Refactor

* Refactor

* Change export file path

* Update docs

* Remove webconsole changes

* Fix spelling mistake

* Parser changes, add tests

* Parser changes, resolve build warnings

* Fix failing test

* Fix failing test

* Fix IT tests

* Add tests

* Cleanup

* Fix unparse

* Fix forbidden API

* Update docs

* Update docs

* Address review comments

* Address review comments

* Fix tests

* Address review comments

* Fix insert unparse

* Add external write resource action

* Fix tests

* Add resource check to overlord resource

* Fix tests

* Add IT

* Update syntax

* Update tests

* Update permission

* Address review comments

* Address review comments

* Address review comments

* Add tests

* Add check for runtime parameter for bucket and path

* Add check for runtime parameter for bucket and path

* Add tests

* Update docs

* Fix NPE

* Update docs, remove deadcode

* Fix formatting
2024-02-07 22:08:50 +05:30
Clint Wylie 23d4fade90
use NullFilter for SQL rewrite of MV_CONTAINS and MV_OVERLAP for null array elements (#15855)
Fixes an oversight after #14542 that happens in the SQL planner rewrite of MV_CONTAINS and MV_OVERLAP when faced with array elements that are NULL, where we were incorrectly using EqualityFilter instead of NullFilter for null elements (EqualityFilter does not accept null elements).
2024-02-07 19:40:41 +05:30
Zoltan Haindrich fdc7cec271
Support Window operators in decoupled planning (#15815) 2024-02-07 04:09:48 -05:00
Gian Merlino 54b30646f3
Add sqlReverseLookupThreshold for ReverseLookupRule. (#15832)
If lots of keys map to the same value, reversing a LOOKUP call can slow
things down unacceptably. To protect against this, this patch introduces
a parameter sqlReverseLookupThreshold representing the maximum size of an
IN filter that will be created as part of lookup reversal.

If inSubQueryThreshold is set to a smaller value than
sqlReverseLookupThreshold, then inSubQueryThreshold will be used instead.
This allows users to use that single parameter to control IN sizes if they
wish.
2024-02-06 16:32:05 +05:30
Soumyava b86f31f2c0
Addressing shapeshifting issues with window functions (#15807)
Addressing shapeshifting issues with window functions
2024-02-06 11:12:20 +05:30
Zoltan Haindrich 392d585ff8
Identify not range filters without negating subexpressions (#15766)
* Identify not range filters without negating subexpressions

Earlier betweenish (range/bounds) filters were identified thru
a process of negating the subexpressions which may have not performed that well.
(it could have dominated the runtime in some cases)
This patch makes that unnecessary as its able to create the negate expression directly.

* add test;fix for multiple intervals
2024-02-05 19:12:58 -08:00
Zoltan Haindrich 8f5b7522c7
Strict window frame checks (#15746)
introduce checks to ensure that window frame is supported
added check to ensure that no expressions are set as bounds
added logic to detect following/following like cases - described in Window function fails to demarcate if 2 following are used #15739
currently RANGE frames are only supported correctly if both endpoints are unbounded or current row Offset based window range support #15767
added windowingStrictValidation context key to provide a way to override the check
2024-02-02 16:21:53 +05:30
Laksh Singla 7d65caf0c5
Update the docs for EARLIEST_BY/LATEST_BY aggregators with the newly added numeric capabilities (#15670) 2024-02-01 10:24:43 +05:30
Zoltan Haindrich f701197224
Enable ArrayListRowsAndColumns to StorageAdapter conversion (#15735) 2024-01-31 02:36:58 -05:00
Gian Merlino 38a1e827ab
Fix up value types when creating range filters. (#15778)
Fixes a bug introduced in #15609, where queries involving filters on
TIME_FLOOR could encounter ClassCastException when comparing RangeValue
in CombineAndSimplifyBounds.

Prior to #15609, CombineAndSimplifyBounds would remove, rebuild, and
re-add all numeric range filters as part of consolidating numeric range
filters for the same column under the least restrictive type. #15609
included a change to only rebuild numeric range filters when a consolidation
opportunity actually arises. The bug was introduced because the unconditional
rebuild, as a side effect, masked the fact that in some cases range filters
would be created with string match values and a LONG match value type.

This patch changes the fixup to happen at the time the range filter is
initially created, rather than in CombineAndSimplifyBounds.
2024-01-29 13:30:47 -08:00
Abhishek Agarwal 989a8f7874
Better error message for date_trunc operators (#15759)
IAEs are not bubbled up and show up as a runtime failure to the user which are not helpful. See https://apachedruidworkspace.slack.com/archives/C0303FDCZEZ/p1706185796975109 for one such example. This change will fix that.
2024-01-27 11:22:39 +05:30
Karan Kumar c4990f56d6
Prepare main branch for next 30.0.0 release. (#15707) 2024-01-23 15:55:54 +05:30
Zoltan Haindrich d6a12c4389
Add ability to enable ResultCache in tests (#15465) 2024-01-22 09:02:59 -05:00
Pranav 45b30dc07d
Revert "Change default inSubQueryThreshold (#15336)" (#15722)
A low value of inSubQueryThreshold can cause queries with IN filter to plan as joins more commonly. However, some of these join queries may not get planned as IN filter on data nodes and causes significant perf regression.
2024-01-22 11:34:39 +05:30
Zoltan Haindrich 8a43db9395
Range support in window expressions (support them as groups) (#15365)
* support groups windowing mode; which is a close relative of ranges (but not in the standard)
* all windows with range expressions will be executed wit it groups
* it will be 100% correct in case for both bounds its true that: isCurrentRow() || isUnBounded()
  * this covers OVER ( ORDER BY COL )
* for other cases it will have some chances of getting correct results...
2024-01-17 00:05:21 -06:00
Gian Merlino 500681d0cb
Add ImmutableLookupMap for static lookups. (#15675)
* Add ImmutableLookupMap for static lookups.

This patch adds a new ImmutableLookupMap, which comes with an
ImmutableLookupExtractor. It uses a fastutil open hashmap plus two
lists to store its data in such a way that forward and reverse
lookups can both be done quickly. I also observed footprint to be
somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million
row lookup.

The main advantage, though, is that reverse lookups can be done much
more quickly than MapLookupExtractor (which iterates the entire map
for each call to unapplyAll). This speeds up the recently added
ReverseLookupRule (#15626) during SQL planning with very large lookups.

* Use in one more test.

* Fix benchmark.

* Object2ObjectOpenHashMap

* Fixes, and LookupExtractor interface update to have asMap.

* Remove commented-out code.

* Fix style.

* Fix import order.

* Add fastutil.

* Avoid storing Map entries.
2024-01-13 13:14:01 -08:00
Gian Merlino 866fe1cda6
Fix some naming related to AggregatePullUpLookupRule. (#15677)
It was called "split" rather than "pull up" in some places. This patch
standardizes on "pull up".
2024-01-12 15:41:58 -08:00
Gian Merlino cccf13ea82
Reverse, pull up lookups in the SQL planner. (#15626)
* Reverse, pull up lookups in the SQL planner.

Adds two new rules:

1) ReverseLookupRule, which eliminates calls to LOOKUP by doing
   reverse lookups.

2) AggregatePullUpLookupRule, which pulls up calls to LOOKUP above
   GROUP BY, when the lookup is injective.

Adds configs `sqlReverseLookup` and `sqlPullUpLookup` to control whether
these rules fire. Both are enabled by default.

To minimize the chance of performance problems due to many keys mapping to
the same value, ReverseLookupRule refrains from reversing a lookup if there
are more keys than `inSubQueryThreshold`. The rationale for using this setting
is that reversal works by generating an IN, and the `inSubQueryThreshold`
describes the largest IN the user wants the planner to create.

* Add additional line.

* Style.

* Remove commented-out lines.

* Fix tests.

* Add test.

* Fix doc link.

* Fix docs.

* Add one more test.

* Fix tests.

* Logic, test updates.

* - Make FilterDecomposeConcatRule more flexible.

- Make CalciteRulesManager apply reduction rules til fixpoint.

* Additional tests, simplify code.
2024-01-12 00:06:31 -08:00
Zoltan Haindrich e597cc2949
Remove UnaryFunctionOperatorConversion and RoundOperatorConversion (#15566)
* get rid of roun op conv

* cleanup

* use DirectOperatorConversion instead unary

* import order
2024-01-12 10:06:23 +05:30
Gian Merlino 6c18434028
CONCAT flattening, filter decomposition. (#15634)
* CONCAT flattening, filter decomposition.

Flattening: CONCAT(CONCAT(x, y), z) is flattened to CONCAT(x, y, z). This
is especially useful for the || operator, which is a binary operator and
leads to non-flat CONCAT calls.

Filter decomposition: transforms CONCAT(x, '-', y) = 'a-b' into
x = 'a' AND y = 'b'.

* One more test.

* Fix two tests.

* Adjustments from review.

* Fix empty string problem, add tests.
2024-01-11 11:18:50 -08:00
Gian Merlino ee77fa7fb3
Add tests for CASE decomposition. (#15639)
I was looking into adding a rule to do this, and found that it was already
happening as part of Calcite's RexSimplify. So this patch simply adds some
tests to ensure that it continues to happen.
2024-01-10 13:24:24 -08:00
Ankit Kothari 355c2f5da0
Add sql + ingestion compatibility for first/last on numeric values (#15607)
SQL compatibility for numeric last and first column types.
Ingestion UI now provides option for first and last aggregation as well.
2024-01-10 12:59:38 +05:30
Rishabh Singh 71f5307277
Eliminate Periodic Realtime Segment Metadata Queries: Task Now Publish Schema for Seamless Coordinator Updates (#15475)
The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This task encompasses addressing both realtime and finalized segments.

This modification specifically addresses the issue with realtime segments. Tasks will now routinely communicate the schema for realtime segments during the segment announcement process. The Coordinator will identify the schema alongside the segment announcement and subsequently update the schema for realtime segments in the metadata cache.
2024-01-10 08:55:56 +05:30
Abhishek Agarwal 468b99e608
Enable query request queuing by default when total laning is turned on. (#15440)
This PR enables the flag by default to queue excess query requests in the jetty queue. Still keeping the flag so that it can be turned off if necessary. But the flag will be removed in the future.
2024-01-09 07:54:26 +05:30
Clint Wylie df5bcd1367
fix bugs with expression virtual column indexes for expression virtual columns which refer to other virtual columns (#15633)
changes:
* ColumnIndexSelector now extends ColumnSelector. The only real implementation of ColumnIndexSelector, ColumnSelectorColumnIndexSelector, already has a ColumnSelector, so this isn't very disruptive
* removed getColumnNames from ColumnSelector since it was not used
* VirtualColumns and VirtualColumn getIndexSupplier method now needs argument of ColumnIndexSelector instead of ColumnSelector, which allows expression virtual columns to correctly recognize other virtual columns, fixing an issue which would incorrectly handle other virtual columns as non-existent columns instead
* fixed a bug with sql planner incorrectly not using expression filter for equality filters on columns with extractionFn and no virtual column registry
2024-01-08 13:10:11 -08:00
Jonathan Wei 5d1e66b8f9
Allow broker to use catalog for datasource schemas for SQL queries (#15469)
* Allow broker to use catalog for datasource schemas

* More PR comments

* PR comments
2024-01-08 13:46:08 -06:00
Gian Merlino 0422d9d507
Fix redundant expansion in SearchOperatorConversion. (#15625)
This logic error causes sarg expansion to happen twice for IN or NOT IN points.
It doesn't affect the final generated native query, because the
redundant expansions gets combined. But it slows down planning, especially
for large NOT IN.
2024-01-05 12:42:12 -08:00
Zoltan Haindrich b9679d0884
Run filter-into-join rule early for subqueries and disable project-filter rule (#15511)
FILTER_INTO_JOIN is mainly run along with the other rules with the Volcano planner; however if the query starts highly underdefined (join conditions in the where clauses) that generic query could give a lot of room for the other rules to play around with only enabled it for when the join uses subqueries for its inputs. 

PROJECT_FILTER rule is not that useful. and could increase planning times by providing new plans. This problem worsened after we started supporting inner joins with arbitrary join conditions in https://github.com/apache/druid/pull/15302
2024-01-04 15:33:45 +05:30
Gian Merlino 5c3391a084
Follow-ups to SEARCH and IN from #15609. (#15623)
- Rename ExprType to BaseType in CollectComparisons, since ExprType is a thing
  that exists elsewhere.
- Remove unused "notInRexNodes" from SearchOperatorConversion.
2024-01-03 22:38:12 -08:00
Clint Wylie f19ece146f
expression virtual column indexes (#15585)
* ExpressionVirtualColumn + indexes = bff. Expression virtual columns can now use indexes of the underlying columns similar to how expression filters
2024-01-03 21:00:39 -08:00
Gian Merlino 01eec4a55e
New handling for COALESCE, SEARCH, and filter optimization. (#15609)
* New handling for COALESCE, SEARCH, and filter optimization.

COALESCE is converted by Calcite's parser to CASE, which is largely
counterproductive for us, because it ends up duplicating expressions.
In the current code we end up un-doing it in our CaseOperatorConversion.
This patch has a different approach:

1) Add CaseToCoalesceRule to convert CASE back to COALESCE earlier, before
   the Volcano planner runs, using CaseToCoalesceRule.

2) Add FilterDecomposeCoalesceRule to decompose calls like
   "f(COALESCE(x, y))" into "(x IS NOT NULL AND f(x)) OR (x IS NULL AND f(y))".
   This helps use indexes when available on x and y.

3) Add CoalesceLookupRule to push COALESCE into the third arg of LOOKUP.

4) Add a native "coalesce" function so we can convert 3+ arg COALESCE.

The advantage of this approach is that by un-doing the CASE to COALESCE
conversion earlier, we have flexibility to do more stuff with
COALESCE (like decomposition and pushing into LOOKUP).

SEARCH is an operator used internally by Calcite to represent matching
an argument against some set of ranges. This patch improves our handling
of SEARCH in two ways:

1) Expand NOT points (point "holes" in the range set) from SEARCH as
   `!(a || b)` rather than `!a && !b`, which makes it possible to convert
   them to a "not" of "in" filter later.

2) Generate those nice conversions for NOT points even if the SEARCH
   is not composed of 100% NOT points. Without this change, a SEARCH
   for "x NOT IN ('a', 'b') AND x < 'm'" would get converted like
   "x < 'a' OR (x > 'a' AND x < 'b') OR (x > 'b' AND x < 'm')".

One of the steps we take when generating Druid queries from Calcite
plans is to optimize native filters. This patch improves this step:

1) Extract common ANDed predicates in ConvertSelectorsToIns, so we can
   convert "(a && x = 'b') || (a && x = 'c')" into "a && x IN ('b', 'c')".

2) Speed up CombineAndSimplifyBounds and ConvertSelectorsToIns on
   ORs with lots of children by adjusting the logic to avoid calling
   "indexOf" and "remove" on an ArrayList.

3) Refactor ConvertSelectorsToIns to reduce duplicated code between the
   handling for "selector" and "equals" filters.

* Not so final.

* Fixes.

* Fix test.

* Fix test.
2024-01-03 08:56:22 -08:00
AlbericByte a2e65e6a89
Support to pass dynamic values to timestamp Extract function (#15586)
Fixes #15072

Before this modification , the third parameter (timezone) require to be a Literal, it will throw a error when this parameter is column Identifier.
2023-12-21 11:57:52 +05:30
Clint Wylie e373f62692
fix expression post aggregator array handling when grouping wrapper types leak (#15543)
* fix expression post aggregator array handling when grouping wrapper types leak
* more consistent expression function error messaging
2023-12-15 21:43:27 -08:00
Soumyava 3e15522d6b
Round works correctly on system metadata columns (#15554) 2023-12-13 17:23:14 -08:00
Soumyava 38f3cf9e65
Fixing a case where datatype mismatch was happenning in join (#15541) 2023-12-12 12:50:32 -08:00
Clint Wylie 42f2496b7d
fix bug with nested empty array fields (#15532) 2023-12-09 12:20:21 -08:00
Clint Wylie e7c8f2e208
lift restriction of array_to_mv to only support direct column access (#15528) 2023-12-08 16:27:17 -08:00
Soumyava ca4ecdf7d0
Fixing NPE with virtual expression with unnest (#15513)
* Fixing NPE with virtual expression with unnest

* Fixing a comment
2023-12-08 10:51:56 -08:00
Clint Wylie e64b92eb35
add JSON_QUERY_ARRAY function to pluck ARRAY<COMPLEX<json>> out of COMPLEX<json> (#15521) 2023-12-08 05:28:46 -08:00
Adarsh Sanjeev 2e45eadc08
Add better error messages for using OVERWRITE with INSERT statments (#15517)
* Add better error messages for using OVERWRITE with INSERT statments
2023-12-08 15:33:46 +05:30
Zoltan Haindrich c353ccfdef
Windowed min aggregates null-s as 0 (#15371) 2023-12-08 01:41:16 -08:00
sb89594 5fda8613ad
Feature: Add IPv6 Match Function (#15212) 2023-12-07 23:09:06 -08:00
Clint Wylie c241c6980c
store auto columns with only empty or null containing arrays as ARRAY<LONG> instead of COMPLEX<json> (#15505) 2023-12-07 03:31:43 -08:00
Clint Wylie 557f3f6f57
add array column type support to EXTEND operator (#15458) 2023-12-06 23:21:35 -08:00
Rishabh Singh d968bb3f43
Rename config for enabling CentralizedDatasourceSchema feature (#15476)
* Rename property to druid.centralizedDatasourceSchema.enabled
* Update config name in docker-compose
2023-12-05 16:57:25 +05:30
Zoltan Haindrich a1aa4340d0
Changing the queryFrameWork in Calcite*Tests may have sideeffects (#15428)
changes how its configured a bit to use an annotation instead of methods
2023-12-04 00:38:01 +05:30
Clint Wylie 5ce4aab3b8
update ARRAY_OVERLAP to plan with ArrayContainsElement for ARRAY columns (#15451)
Updates ARRAY_OVERLAP to use the same ArrayContainsElement filter added in #15366 when filtering ARRAY typed columns so that it can also use indexes like ARRAY_CONTAINS.
2023-11-30 10:05:20 +05:30
Pranav 93cd638645
Enabling aggregateMultipleValues in all StringAnyAggregators (#15434)
* Enabling aggregateMultipleValues in all StringAnyAggregators

* Adding more tests

* More validation

* fix warning

* updating asserts in decoupled mode

* fix intellij inspection

* Addressing comments

* Addressing comments

* Adding early validations and make aggregate consistent across all

* fixing tests

* fixing tests

* Update docs/querying/sql-aggregations.md

Co-authored-by: Clint Wylie <cjwylie@gmail.com>

* fixing static check

---------

Co-authored-by: Clint Wylie <cjwylie@gmail.com>
2023-11-29 14:32:49 -08:00
Clint Wylie 64fcb32bcf
add native 'array contains element' filter (#15366)
* add native arrayContainsElement filter to use array column element indexes
2023-11-29 03:33:00 -08:00
Abhishek Agarwal 0a56c87e93
SQL: Plan non-equijoin conditions as cross join followed by filter (#15302)
This PR revives #14978 with a few more bells and whistles. Instead of an unconditional cross-join, we will now split the join condition such that some conditions are now evaluated post-join. To decide what sub-condition goes where, I have refactored DruidJoinRule class to extract unsupported sub-conditions. We build a postJoinFilter out of these unsupported sub-conditions and push to the join.
2023-11-29 13:46:11 +05:30
Clint Wylie 97623b408c
add optional 'castToType' parameter to 'auto' column schema (#15417)
* auto but.. with an expected type
2023-11-28 17:19:23 -08:00
Zoltan Haindrich eb056e23b5
Fix dictionarySize overrides in tests (#15354)
I think this is a problem as it discards the false return value when the putToKeyBuffer can't store the value because of the limit

Not forwarding the return value at that point may lead to the normal continuation here regardless something was not added to the dictionary like here
2023-11-28 18:49:09 +05:30
Zoltan Haindrich ca544e552c
Add option to compare results with relative error tolerance (#15429)
Adds a result comparision mode of EQUALS_RELATIVE_1000_ULPS ; which accepts floating point differences up-to 1000 units of least precision
2023-11-28 13:03:16 +05:30
Abhishek Agarwal 3113e7b350
Fix grouping aggregator when one of the dimension is a simple extraction (#15421)
This PR fixes an issue where the grouping aggregator wrongly assumes that a key dimension is a virtual column and assigns a wrong name to it. This results in a mismatch between the dimensions that grouping aggregator sees and the dimension names that rows are aggregated on. And finally, grouping aggregator generates wrong result.
2023-11-24 13:15:07 +05:30
Clint Wylie a95c22ce70
support non-constant expressions for path arguments for json_value and json_query (#15320)
* support dynamic expressions for path arguments for json_value and json_query
2023-11-17 01:12:05 -08:00
Adarsh Sanjeev a134cc30a6
Change default inSubQueryThreshold (#15336) 2023-11-14 14:08:12 +05:30
Rishabh Singh 5446494e63
Non-existent datasource shouldn't affect schema rebuilding for other datasources (#15355)
In pull request #14985, a bug was introduced where periodic refresh would skip rebuilding a datasource's schema after encountering a non-existent datasource. This resulted in remaining datasources having stale schema information.

This change addresses the bug and adds a unit test to validate the refresh mechanism's behaviour when a datasource is removed, and other datasources have schema changes.
2023-11-14 12:52:33 +05:30
Rishabh Singh 8c802e4c9b
Relocating Table Schema Building: Shifting from Brokers to Coordinator for Improved Efficiency (#14985)
In the current design, brokers query both data nodes and tasks to fetch the schema of the segments they serve. The table schema is then constructed by combining the schemas of all segments within a datasource. However, this approach leads to a high number of segment metadata queries during broker startup, resulting in slow startup times and various issues outlined in the design proposal.

To address these challenges, we propose centralizing the table schema management process within the coordinator. This change is the first step in that direction. In the new arrangement, the coordinator will take on the responsibility of querying both data nodes and tasks to fetch segment schema and subsequently building the table schema. Brokers will now simply query the Coordinator to fetch table schema. Importantly, brokers will still retain the capability to build table schemas if the need arises, ensuring both flexibility and resilience.
2023-11-04 19:33:25 +05:30
Laksh Singla 0cc8839a60
Allow casted literal values in SQL functions accepting literals (Part 2) (#15316) 2023-11-03 21:22:19 +05:30
Gian Merlino d87d92bc43
Add system fields to input sources. (#15276)
* Add system fields to input sources.

Main changes:

1) The SystemField enum defines system fields "__file_uri", "__file_path",
   and "__file_bucket". They are associated with each input entity.

2) The SystemFieldInputSource interface can be added to any InputSource
   to make it system-field-capable. It sets up serialization of a list
   of configured "systemFields" in the JSON form of the input source, and
   provides a method getSystemFieldValue for computing the value of each
   system field. Cloud object, HDFS, HTTP, and Local now have this.

* Fix various LocalInputSource calls.

* Fix style stuff.

* Fixups.

* Fix tests and coverage.
2023-11-02 10:31:28 -07:00