druid

Commit Graph

Author	SHA1	Message	Date
Karan Kumar	c4990f56d6	Prepare main branch for next 30.0.0 release. (#15707 )	2024-01-23 15:55:54 +05:30
Zoltan Haindrich	d6a12c4389	Add ability to enable ResultCache in tests (#15465 )	2024-01-22 09:02:59 -05:00
Pranav	45b30dc07d	Revert "Change default inSubQueryThreshold (#15336 )" (#15722 ) A low value of inSubQueryThreshold can cause queries with IN filter to plan as joins more commonly. However, some of these join queries may not get planned as IN filter on data nodes and causes significant perf regression.	2024-01-22 11:34:39 +05:30
Zoltan Haindrich	8a43db9395	Range support in window expressions (support them as groups) (#15365 ) * support groups windowing mode; which is a close relative of ranges (but not in the standard) * all windows with range expressions will be executed wit it groups * it will be 100% correct in case for both bounds its true that: isCurrentRow() \|\| isUnBounded() * this covers OVER ( ORDER BY COL ) * for other cases it will have some chances of getting correct results...	2024-01-17 00:05:21 -06:00
Gian Merlino	500681d0cb	Add ImmutableLookupMap for static lookups. (#15675 ) * Add ImmutableLookupMap for static lookups. This patch adds a new ImmutableLookupMap, which comes with an ImmutableLookupExtractor. It uses a fastutil open hashmap plus two lists to store its data in such a way that forward and reverse lookups can both be done quickly. I also observed footprint to be somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million row lookup. The main advantage, though, is that reverse lookups can be done much more quickly than MapLookupExtractor (which iterates the entire map for each call to unapplyAll). This speeds up the recently added ReverseLookupRule (#15626) during SQL planning with very large lookups. * Use in one more test. * Fix benchmark. * Object2ObjectOpenHashMap * Fixes, and LookupExtractor interface update to have asMap. * Remove commented-out code. * Fix style. * Fix import order. * Add fastutil. * Avoid storing Map entries.	2024-01-13 13:14:01 -08:00
Gian Merlino	866fe1cda6	Fix some naming related to AggregatePullUpLookupRule. (#15677 ) It was called "split" rather than "pull up" in some places. This patch standardizes on "pull up".	2024-01-12 15:41:58 -08:00
Gian Merlino	cccf13ea82	Reverse, pull up lookups in the SQL planner. (#15626 ) * Reverse, pull up lookups in the SQL planner. Adds two new rules: 1) ReverseLookupRule, which eliminates calls to LOOKUP by doing reverse lookups. 2) AggregatePullUpLookupRule, which pulls up calls to LOOKUP above GROUP BY, when the lookup is injective. Adds configs `sqlReverseLookup` and `sqlPullUpLookup` to control whether these rules fire. Both are enabled by default. To minimize the chance of performance problems due to many keys mapping to the same value, ReverseLookupRule refrains from reversing a lookup if there are more keys than `inSubQueryThreshold`. The rationale for using this setting is that reversal works by generating an IN, and the `inSubQueryThreshold` describes the largest IN the user wants the planner to create. * Add additional line. * Style. * Remove commented-out lines. * Fix tests. * Add test. * Fix doc link. * Fix docs. * Add one more test. * Fix tests. * Logic, test updates. * - Make FilterDecomposeConcatRule more flexible. - Make CalciteRulesManager apply reduction rules til fixpoint. * Additional tests, simplify code.	2024-01-12 00:06:31 -08:00
Zoltan Haindrich	e597cc2949	Remove UnaryFunctionOperatorConversion and RoundOperatorConversion (#15566 ) * get rid of roun op conv * cleanup * use DirectOperatorConversion instead unary * import order	2024-01-12 10:06:23 +05:30
Gian Merlino	6c18434028	CONCAT flattening, filter decomposition. (#15634 ) * CONCAT flattening, filter decomposition. Flattening: CONCAT(CONCAT(x, y), z) is flattened to CONCAT(x, y, z). This is especially useful for the \|\| operator, which is a binary operator and leads to non-flat CONCAT calls. Filter decomposition: transforms CONCAT(x, '-', y) = 'a-b' into x = 'a' AND y = 'b'. * One more test. * Fix two tests. * Adjustments from review. * Fix empty string problem, add tests.	2024-01-11 11:18:50 -08:00
Gian Merlino	ee77fa7fb3	Add tests for CASE decomposition. (#15639 ) I was looking into adding a rule to do this, and found that it was already happening as part of Calcite's RexSimplify. So this patch simply adds some tests to ensure that it continues to happen.	2024-01-10 13:24:24 -08:00
Ankit Kothari	355c2f5da0	Add sql + ingestion compatibility for first/last on numeric values (#15607 ) SQL compatibility for numeric last and first column types. Ingestion UI now provides option for first and last aggregation as well.	2024-01-10 12:59:38 +05:30
Rishabh Singh	71f5307277	Eliminate Periodic Realtime Segment Metadata Queries: Task Now Publish Schema for Seamless Coordinator Updates (#15475 ) The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This task encompasses addressing both realtime and finalized segments. This modification specifically addresses the issue with realtime segments. Tasks will now routinely communicate the schema for realtime segments during the segment announcement process. The Coordinator will identify the schema alongside the segment announcement and subsequently update the schema for realtime segments in the metadata cache.	2024-01-10 08:55:56 +05:30
Abhishek Agarwal	468b99e608	Enable query request queuing by default when total laning is turned on. (#15440 ) This PR enables the flag by default to queue excess query requests in the jetty queue. Still keeping the flag so that it can be turned off if necessary. But the flag will be removed in the future.	2024-01-09 07:54:26 +05:30
Clint Wylie	df5bcd1367	fix bugs with expression virtual column indexes for expression virtual columns which refer to other virtual columns (#15633 ) changes: * ColumnIndexSelector now extends ColumnSelector. The only real implementation of ColumnIndexSelector, ColumnSelectorColumnIndexSelector, already has a ColumnSelector, so this isn't very disruptive * removed getColumnNames from ColumnSelector since it was not used * VirtualColumns and VirtualColumn getIndexSupplier method now needs argument of ColumnIndexSelector instead of ColumnSelector, which allows expression virtual columns to correctly recognize other virtual columns, fixing an issue which would incorrectly handle other virtual columns as non-existent columns instead * fixed a bug with sql planner incorrectly not using expression filter for equality filters on columns with extractionFn and no virtual column registry	2024-01-08 13:10:11 -08:00
Jonathan Wei	5d1e66b8f9	Allow broker to use catalog for datasource schemas for SQL queries (#15469 ) * Allow broker to use catalog for datasource schemas * More PR comments * PR comments	2024-01-08 13:46:08 -06:00
Gian Merlino	0422d9d507	Fix redundant expansion in SearchOperatorConversion. (#15625 ) This logic error causes sarg expansion to happen twice for IN or NOT IN points. It doesn't affect the final generated native query, because the redundant expansions gets combined. But it slows down planning, especially for large NOT IN.	2024-01-05 12:42:12 -08:00
Zoltan Haindrich	b9679d0884	Run filter-into-join rule early for subqueries and disable project-filter rule (#15511 ) FILTER_INTO_JOIN is mainly run along with the other rules with the Volcano planner; however if the query starts highly underdefined (join conditions in the where clauses) that generic query could give a lot of room for the other rules to play around with only enabled it for when the join uses subqueries for its inputs. PROJECT_FILTER rule is not that useful. and could increase planning times by providing new plans. This problem worsened after we started supporting inner joins with arbitrary join conditions in https://github.com/apache/druid/pull/15302	2024-01-04 15:33:45 +05:30
Gian Merlino	5c3391a084	Follow-ups to SEARCH and IN from #15609 . (#15623 ) - Rename ExprType to BaseType in CollectComparisons, since ExprType is a thing that exists elsewhere. - Remove unused "notInRexNodes" from SearchOperatorConversion.	2024-01-03 22:38:12 -08:00
Clint Wylie	f19ece146f	expression virtual column indexes (#15585 ) * ExpressionVirtualColumn + indexes = bff. Expression virtual columns can now use indexes of the underlying columns similar to how expression filters	2024-01-03 21:00:39 -08:00
Gian Merlino	01eec4a55e	New handling for COALESCE, SEARCH, and filter optimization. (#15609 ) * New handling for COALESCE, SEARCH, and filter optimization. COALESCE is converted by Calcite's parser to CASE, which is largely counterproductive for us, because it ends up duplicating expressions. In the current code we end up un-doing it in our CaseOperatorConversion. This patch has a different approach: 1) Add CaseToCoalesceRule to convert CASE back to COALESCE earlier, before the Volcano planner runs, using CaseToCoalesceRule. 2) Add FilterDecomposeCoalesceRule to decompose calls like "f(COALESCE(x, y))" into "(x IS NOT NULL AND f(x)) OR (x IS NULL AND f(y))". This helps use indexes when available on x and y. 3) Add CoalesceLookupRule to push COALESCE into the third arg of LOOKUP. 4) Add a native "coalesce" function so we can convert 3+ arg COALESCE. The advantage of this approach is that by un-doing the CASE to COALESCE conversion earlier, we have flexibility to do more stuff with COALESCE (like decomposition and pushing into LOOKUP). SEARCH is an operator used internally by Calcite to represent matching an argument against some set of ranges. This patch improves our handling of SEARCH in two ways: 1) Expand NOT points (point "holes" in the range set) from SEARCH as `!(a \|\| b)` rather than `!a && !b`, which makes it possible to convert them to a "not" of "in" filter later. 2) Generate those nice conversions for NOT points even if the SEARCH is not composed of 100% NOT points. Without this change, a SEARCH for "x NOT IN ('a', 'b') AND x < 'm'" would get converted like "x < 'a' OR (x > 'a' AND x < 'b') OR (x > 'b' AND x < 'm')". One of the steps we take when generating Druid queries from Calcite plans is to optimize native filters. This patch improves this step: 1) Extract common ANDed predicates in ConvertSelectorsToIns, so we can convert "(a && x = 'b') \|\| (a && x = 'c')" into "a && x IN ('b', 'c')". 2) Speed up CombineAndSimplifyBounds and ConvertSelectorsToIns on ORs with lots of children by adjusting the logic to avoid calling "indexOf" and "remove" on an ArrayList. 3) Refactor ConvertSelectorsToIns to reduce duplicated code between the handling for "selector" and "equals" filters. * Not so final. * Fixes. * Fix test. * Fix test.	2024-01-03 08:56:22 -08:00
AlbericByte	a2e65e6a89	Support to pass dynamic values to timestamp Extract function (#15586 ) Fixes #15072 Before this modification , the third parameter (timezone) require to be a Literal, it will throw a error when this parameter is column Identifier.	2023-12-21 11:57:52 +05:30
Clint Wylie	e373f62692	fix expression post aggregator array handling when grouping wrapper types leak (#15543 ) * fix expression post aggregator array handling when grouping wrapper types leak * more consistent expression function error messaging	2023-12-15 21:43:27 -08:00
Soumyava	3e15522d6b	Round works correctly on system metadata columns (#15554 )	2023-12-13 17:23:14 -08:00
Soumyava	38f3cf9e65	Fixing a case where datatype mismatch was happenning in join (#15541 )	2023-12-12 12:50:32 -08:00
Clint Wylie	42f2496b7d	fix bug with nested empty array fields (#15532 )	2023-12-09 12:20:21 -08:00
Clint Wylie	e7c8f2e208	lift restriction of array_to_mv to only support direct column access (#15528 )	2023-12-08 16:27:17 -08:00
Soumyava	ca4ecdf7d0	Fixing NPE with virtual expression with unnest (#15513 ) * Fixing NPE with virtual expression with unnest * Fixing a comment	2023-12-08 10:51:56 -08:00
Clint Wylie	e64b92eb35	add JSON_QUERY_ARRAY function to pluck ARRAY<COMPLEX<json>> out of COMPLEX<json> (#15521 )	2023-12-08 05:28:46 -08:00
Adarsh Sanjeev	2e45eadc08	Add better error messages for using OVERWRITE with INSERT statments (#15517 ) * Add better error messages for using OVERWRITE with INSERT statments	2023-12-08 15:33:46 +05:30
Zoltan Haindrich	c353ccfdef	Windowed min aggregates null-s as 0 (#15371 )	2023-12-08 01:41:16 -08:00
sb89594	5fda8613ad	Feature: Add IPv6 Match Function (#15212 )	2023-12-07 23:09:06 -08:00
Clint Wylie	c241c6980c	store auto columns with only empty or null containing arrays as ARRAY<LONG> instead of COMPLEX<json> (#15505 )	2023-12-07 03:31:43 -08:00
Clint Wylie	557f3f6f57	add array column type support to EXTEND operator (#15458 )	2023-12-06 23:21:35 -08:00
Rishabh Singh	d968bb3f43	Rename config for enabling CentralizedDatasourceSchema feature (#15476 ) * Rename property to druid.centralizedDatasourceSchema.enabled * Update config name in docker-compose	2023-12-05 16:57:25 +05:30
Zoltan Haindrich	a1aa4340d0	Changing the queryFrameWork in Calcite*Tests may have sideeffects (#15428 ) changes how its configured a bit to use an annotation instead of methods	2023-12-04 00:38:01 +05:30
Clint Wylie	5ce4aab3b8	update ARRAY_OVERLAP to plan with ArrayContainsElement for ARRAY columns (#15451 ) Updates ARRAY_OVERLAP to use the same ArrayContainsElement filter added in #15366 when filtering ARRAY typed columns so that it can also use indexes like ARRAY_CONTAINS.	2023-11-30 10:05:20 +05:30
Pranav	93cd638645	Enabling aggregateMultipleValues in all StringAnyAggregators (#15434 ) * Enabling aggregateMultipleValues in all StringAnyAggregators * Adding more tests * More validation * fix warning * updating asserts in decoupled mode * fix intellij inspection * Addressing comments * Addressing comments * Adding early validations and make aggregate consistent across all * fixing tests * fixing tests * Update docs/querying/sql-aggregations.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * fixing static check --------- Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2023-11-29 14:32:49 -08:00
Clint Wylie	64fcb32bcf	add native 'array contains element' filter (#15366 ) * add native arrayContainsElement filter to use array column element indexes	2023-11-29 03:33:00 -08:00
Abhishek Agarwal	0a56c87e93	SQL: Plan non-equijoin conditions as cross join followed by filter (#15302 ) This PR revives #14978 with a few more bells and whistles. Instead of an unconditional cross-join, we will now split the join condition such that some conditions are now evaluated post-join. To decide what sub-condition goes where, I have refactored DruidJoinRule class to extract unsupported sub-conditions. We build a postJoinFilter out of these unsupported sub-conditions and push to the join.	2023-11-29 13:46:11 +05:30
Clint Wylie	97623b408c	add optional 'castToType' parameter to 'auto' column schema (#15417 ) * auto but.. with an expected type	2023-11-28 17:19:23 -08:00
Zoltan Haindrich	eb056e23b5	Fix dictionarySize overrides in tests (#15354 ) I think this is a problem as it discards the false return value when the putToKeyBuffer can't store the value because of the limit Not forwarding the return value at that point may lead to the normal continuation here regardless something was not added to the dictionary like here	2023-11-28 18:49:09 +05:30
Zoltan Haindrich	ca544e552c	Add option to compare results with relative error tolerance (#15429 ) Adds a result comparision mode of EQUALS_RELATIVE_1000_ULPS ; which accepts floating point differences up-to 1000 units of least precision	2023-11-28 13:03:16 +05:30
Abhishek Agarwal	3113e7b350	Fix grouping aggregator when one of the dimension is a simple extraction (#15421 ) This PR fixes an issue where the grouping aggregator wrongly assumes that a key dimension is a virtual column and assigns a wrong name to it. This results in a mismatch between the dimensions that grouping aggregator sees and the dimension names that rows are aggregated on. And finally, grouping aggregator generates wrong result.	2023-11-24 13:15:07 +05:30
Clint Wylie	a95c22ce70	support non-constant expressions for path arguments for json_value and json_query (#15320 ) * support dynamic expressions for path arguments for json_value and json_query	2023-11-17 01:12:05 -08:00
Adarsh Sanjeev	a134cc30a6	Change default inSubQueryThreshold (#15336 )	2023-11-14 14:08:12 +05:30
Rishabh Singh	5446494e63	Non-existent datasource shouldn't affect schema rebuilding for other datasources (#15355 ) In pull request #14985, a bug was introduced where periodic refresh would skip rebuilding a datasource's schema after encountering a non-existent datasource. This resulted in remaining datasources having stale schema information. This change addresses the bug and adds a unit test to validate the refresh mechanism's behaviour when a datasource is removed, and other datasources have schema changes.	2023-11-14 12:52:33 +05:30
Rishabh Singh	8c802e4c9b	Relocating Table Schema Building: Shifting from Brokers to Coordinator for Improved Efficiency (#14985 ) In the current design, brokers query both data nodes and tasks to fetch the schema of the segments they serve. The table schema is then constructed by combining the schemas of all segments within a datasource. However, this approach leads to a high number of segment metadata queries during broker startup, resulting in slow startup times and various issues outlined in the design proposal. To address these challenges, we propose centralizing the table schema management process within the coordinator. This change is the first step in that direction. In the new arrangement, the coordinator will take on the responsibility of querying both data nodes and tasks to fetch segment schema and subsequently building the table schema. Brokers will now simply query the Coordinator to fetch table schema. Importantly, brokers will still retain the capability to build table schemas if the need arises, ensuring both flexibility and resilience.	2023-11-04 19:33:25 +05:30
Laksh Singla	0cc8839a60	Allow casted literal values in SQL functions accepting literals (Part 2) (#15316 )	2023-11-03 21:22:19 +05:30
Gian Merlino	d87d92bc43	Add system fields to input sources. (#15276 ) * Add system fields to input sources. Main changes: 1) The SystemField enum defines system fields "__file_uri", "__file_path", and "__file_bucket". They are associated with each input entity. 2) The SystemFieldInputSource interface can be added to any InputSource to make it system-field-capable. It sets up serialization of a list of configured "systemFields" in the JSON form of the input source, and provides a method getSystemFieldValue for computing the value of each system field. Cloud object, HDFS, HTTP, and Local now have this. * Fix various LocalInputSource calls. * Fix style stuff. * Fixups. * Fix tests and coverage.	2023-11-02 10:31:28 -07:00
Clint Wylie	d261587f4a	explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds (#15245 ) * better documentation for the differences between arrays and mvds * add outputType to ExpressionPostAggregator to make docs true * add output coercion if outputType is defined on ExpressionPostAgg * updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables	2023-11-02 00:31:37 -07:00

1 2 3 4 5 ...

902 Commits