druid

Commit Graph

Author	SHA1	Message	Date
Hiroshi Fukada	3fe3a65344	New: Add DDSketch in extensions-contrib (#15049 ) * New: Add DDSketch-Druid extension - Based off of http://www.vldb.org/pvldb/vol12/p2195-masson.pdf and uses the corresponding https://github.com/DataDog/sketches-java library - contains tests for post building and using aggregation/post aggregation. - New aggregator: `ddSketch` - New post aggregators: `quantileFromDDSketch` and `quantilesFromDDSketch` * Fixing easy CodeQL warnings/errors * Fixing docs, and dependencies Also moved aggregator ids to AggregatorUtil and PostAggregatorIds * Adding more Docs and better null/empty handling for aggregators * Fixing docs, and pom version * DDSketch documentation format and wording	2024-01-23 20:17:07 +05:30
Karan Kumar	c4990f56d6	Prepare main branch for next 30.0.0 release. (#15707 )	2024-01-23 15:55:54 +05:30
Pranav	45b30dc07d	Revert "Change default inSubQueryThreshold (#15336 )" (#15722 ) A low value of inSubQueryThreshold can cause queries with IN filter to plan as joins more commonly. However, some of these join queries may not get planned as IN filter on data nodes and causes significant perf regression.	2024-01-22 11:34:39 +05:30
zachjsh	9d4e8053a4	Kinesis adaptive memory management (#15360 ) ### Description Our Kinesis consumer works by using the [GetRecords API](https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html) in some number of `fetchThreads`, each fetching some number of records (`recordsPerFetch`) and each inserting into a shared buffer that can hold a `recordBufferSize` number of records. The logic is described in our documentation at: https://druid.apache.org/docs/27.0.0/development/extensions-core/kinesis-ingestion/#determine-fetch-settings There is a problem with the logic that this pr fixes: the memory limits rely on a hard-coded “estimated record size” that is `10 KB` if `deaggregate: false` and `1 MB` if `deaggregate: true`. There have been cases where a supervisor had `deaggregate: true` set even though it wasn’t needed, leading to under-utilization of memory and poor ingestion performance. Users don’t always know if their records are aggregated or not. Also, even if they could figure it out, it’s better to not have to. So we’d like to eliminate the `deaggregate` parameter, which means we need to do memory management more adaptively based on the actual record sizes. We take advantage of the fact that GetRecords doesn’t return more than 10MB (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html ): This pr: eliminates `recordsPerFetch`, always use the max limit of 10000 records (the default limit if not set) eliminate `deaggregate`, always have it true cap `fetchThreads` to ensure that if each fetch returns the max (`10MB`) then we don't exceed our budget (`100MB` or `5% of heap`). In practice this means `fetchThreads` will never be more than `10`. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments add `recordBufferSizeBytes` as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be `100MB` or `10% of heap`, whichever is smaller. add `maxBytesPerPoll` as a bytes-based limit for how much data we poll from shared buffer at a time. Default is `1000000` bytes. deprecate `recordBufferSize`, use `recordBufferSizeBytes` instead. Warning is logged if `recordBufferSize` is specified deprecate `maxRecordsPerPoll`, use `maxBytesPerPoll` instead. Warning is logged if maxRecordsPerPoll` is specified Fixed issue that when the record buffer is full, the fetchRecords logic throws away the rest of the GetRecords result after `recordBufferOfferTimeout` and starts a new shard iterator. This seems excessively churny. Instead, wait an unbounded amount of time for queue to stop being full. If the queue remains full, we’ll end up right back waiting for it after the restarted fetch. There was also a call to `newQ::offer` without check in `filterBufferAndResetBackgroundFetch`, which seemed like it could cause data loss. Now checking return value here, and failing if false. ### Release Note Kinesis ingestion memory tuning config has been greatly simplified, and a more adaptive approach is now taken for the configuration. Here is a summary of the changes made: eliminates `recordsPerFetch`, always use the max limit of 10000 records (the default limit if not set) eliminate `deaggregate`, always have it true cap `fetchThreads` to ensure that if each fetch returns the max (`10MB`) then we don't exceed our budget (`100MB` or `5% of heap`). In practice this means `fetchThreads` will never be more than `10`. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments add `recordBufferSizeBytes` as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be `100MB` or `10% of heap`, whichever is smaller. add `maxBytesPerPoll` as a bytes-based limit for how much data we poll from shared buffer at a time. Default is `1000000` bytes. deprecate `recordBufferSize`, use `recordBufferSizeBytes` instead. Warning is logged if `recordBufferSize` is specified deprecate `maxRecordsPerPoll`, use `maxBytesPerPoll` instead. Warning is logged if maxRecordsPerPoll` is specified	2024-01-19 14:30:21 -05:00
Benedict Jin	96b4abc8e9	Add @VisibleForTesting annotation for the backingArray() method (#15690 )	2024-01-18 19:30:10 -08:00
Gian Merlino	792e5c58e4	IncrementalIndex#add is no longer thread-safe. (#15697 ) * IncrementalIndex#add is no longer thread-safe. Following #14866, there is no longer a reason for IncrementalIndex#add to be thread-safe. It turns out it already was not using its selectors in a thread-safe way, as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript` in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't new: Strings have been stored in the dimension selectors for some time, but we didn't have a test that checked for that case; we only have this test that checks for concurrent adds involving numeric selectors. At any rate, this patch changes OnheapIncrementalIndex to no longer try to offer a thread-safe "add" method. It also improves performance a bit by adding a row ID supplier to the selectors it uses to read InputRows, meaning that it can get the benefit of caching values inside the selectors. This patch also: 1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator, which the similar datasketches versions already have. This is done to help them adhere to the contract of Aggregator: concurrent calls to "aggregate" and "get" must be thread-safe. 2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the druid-benchmarks module. * Spelling. * Changes from static analysis. * Fix javadoc.	2024-01-18 03:45:22 -08:00
Gian Merlino	764f41d959	Clear "lineSplittable" for JSON when using KafkaInputFormat. (#15692 ) * Clear "lineSplittable" for JSON when using KafkaInputFormat. JsonInputFormat has a "withLineSplittable" method that can be used to control whether JSON is read line-by-line, or as a whole. The intent is that in streaming ingestion, "lineSplittable" is false (although it can be overridden by "assumeNewlineDelimited"), and in batch ingestion, lineSplittable is true. When a "json" format is wrapped by a "kafka" format, this isn't set properly. This patch updates KafkaInputFormat to set this on an underlying "json" format. The tests for KafkaInputFormat were overriding the "lineSplittable" parameter explicitly, which wasn't really fair, because that made them unrealistic to what happens in production. Now they omit the parameter and get the production behavior. * Add test. * Fix test coverage.	2024-01-18 03:22:41 -08:00
Gian Merlino	d3d0c1c91e	Faster parsing: reduce String usage, list-based input rows. (#15681 ) * Faster parsing: reduce String usage, list-based input rows. Three changes: 1) Reworked FastLineIterator to optionally avoid generating Strings entirely, and reduce copying somewhat. Benefits the line-oriented JSON, CSV, delimited (TSV), and regex formats. 2) In the delimited (TSV) format, when the delimiter is a single byte, split on UTF-8 bytes directly. 3) In CSV and delimited (TSV) formats, use list-based input rows when the column list is provided upfront by the user. * Fix style. * Fix inspections. * Restore validation. * Remove fastutil-extra. * Exception type. * Fixes for error messages. * Fixes for null handling.	2024-01-18 19:18:46 +08:00
Laksh Singla	fc06f2d075	Fix summary iterator in grouping engine(#15658 ) This PR fixes the summary iterator to add aggregators in the correct position. The summary iterator is used when dims are not present, therefore the new change is identical to the old one, but seems more correct while reading.	2024-01-17 20:43:45 +05:30
Zoltan Haindrich	8a43db9395	Range support in window expressions (support them as groups) (#15365 ) * support groups windowing mode; which is a close relative of ranges (but not in the standard) * all windows with range expressions will be executed wit it groups * it will be 100% correct in case for both bounds its true that: isCurrentRow() \|\| isUnBounded() * this covers OVER ( ORDER BY COL ) * for other cases it will have some chances of getting correct results...	2024-01-17 00:05:21 -06:00
AmatyaAvadhanula	11dbfb6e3f	Better error message when partition space is exhausted (#15685 ) * Better error message when partition space is exhausted	2024-01-16 12:32:40 +05:30
Sam Rash	072b16c6df	Fix SQL Innterval.of() error message (#15454 ) Better error message for poorly constructed intervals	2024-01-15 22:34:35 -06:00
Gian Merlino	d359fb3d68	Cache value selectors in RowBasedColumnSelectorFactory. (#15615 ) * Cache value selectors in RowBasedColumnSelectorFactory. There was already caching for dimension selectors. This patch adds caching for value (object and number) selectors. It's helpful when the same field is read multiple times during processing of a single row (for example, by being an input to both MIN and MAX aggregations). * Fix typing. * Fix logic.	2024-01-15 18:03:27 -08:00
Kashif Faraz	18d2a8957f	Refactor: Cleanup test impls of ServiceEmitter (#15683 )	2024-01-15 17:37:00 +05:30
Ben Sykes	e49a7bb3cd	Add SpectatorHistogram extension (#15340 ) * Add SpectatorHistogram extension * Clarify documentation Cleanup comments * Use ColumnValueSelector directly so that we support being queried as a Number using longSum or doubleSum aggregators as well as a histogram. When queried as a Number, we're returning the count of entries in the histogram. * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Fix references * Fix spelling * Update docs/development/extensions-contrib/spectator-histogram.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-01-14 09:52:30 -08:00
Gian Merlino	500681d0cb	Add ImmutableLookupMap for static lookups. (#15675 ) * Add ImmutableLookupMap for static lookups. This patch adds a new ImmutableLookupMap, which comes with an ImmutableLookupExtractor. It uses a fastutil open hashmap plus two lists to store its data in such a way that forward and reverse lookups can both be done quickly. I also observed footprint to be somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million row lookup. The main advantage, though, is that reverse lookups can be done much more quickly than MapLookupExtractor (which iterates the entire map for each call to unapplyAll). This speeds up the recently added ReverseLookupRule (#15626) during SQL planning with very large lookups. * Use in one more test. * Fix benchmark. * Object2ObjectOpenHashMap * Fixes, and LookupExtractor interface update to have asMap. * Remove commented-out code. * Fix style. * Fix import order. * Add fastutil. * Avoid storing Map entries.	2024-01-13 13:14:01 -08:00
Gian Merlino	cccf13ea82	Reverse, pull up lookups in the SQL planner. (#15626 ) * Reverse, pull up lookups in the SQL planner. Adds two new rules: 1) ReverseLookupRule, which eliminates calls to LOOKUP by doing reverse lookups. 2) AggregatePullUpLookupRule, which pulls up calls to LOOKUP above GROUP BY, when the lookup is injective. Adds configs `sqlReverseLookup` and `sqlPullUpLookup` to control whether these rules fire. Both are enabled by default. To minimize the chance of performance problems due to many keys mapping to the same value, ReverseLookupRule refrains from reversing a lookup if there are more keys than `inSubQueryThreshold`. The rationale for using this setting is that reversal works by generating an IN, and the `inSubQueryThreshold` describes the largest IN the user wants the planner to create. * Add additional line. * Style. * Remove commented-out lines. * Fix tests. * Add test. * Fix doc link. * Fix docs. * Add one more test. * Fix tests. * Logic, test updates. * - Make FilterDecomposeConcatRule more flexible. - Make CalciteRulesManager apply reduction rules til fixpoint. * Additional tests, simplify code.	2024-01-12 00:06:31 -08:00
Gian Merlino	2231cb30a4	Faster k-way merging using tournament trees, 8-byte key strides. (#15661 ) * Faster k-way merging using tournament trees, 8-byte key strides. Two speedups for FrameChannelMerger (which does k-way merging in MSQ): 1) Replace the priority queue with a tournament tree, which does fewer comparisons. 2) Compare keys using 8-byte strides, rather than 1 byte at a time. * Adjust comments. * Fix style. * Adjust benchmark and test. * Add eight-list test (power of two).	2024-01-11 08:36:22 -08:00
Clint Wylie	2118258b54	tidy up group by engines after removal of v1 (#15665 )	2024-01-11 00:52:52 -08:00
Laksh Singla	0b91cc4db2	Fix incorrect tests in Sting first/last serde's null handling (#15657 ) Fixes a couple of incorrect test cases, that got merged accidentally	2024-01-10 19:16:12 +05:30
Zoltan Haindrich	fefa763722	Resultcache fetch should deserialize aggregates when they are real results (#15654 ) Fixes #15538	2024-01-10 06:42:33 -05:00
Laksh Singla	4149f98934	Fixes a bug with long string pair serde where null and empty strings are treated equivalently (#15525 ) This PR fixes a bug with the long string pair serde where null and empty strings are treated equivalently, and the return value is always null. When 'useDefaultValueForNull' was set to true by default, this wasn't a commonly seen issue, because nulls were equivalent to empty strings. However, since the default has changed to false, this can create incorrect results when the long string pairs are serded, where the empty strings are incorrectly converted to nulls.	2024-01-10 14:17:57 +05:30
Clint Wylie	2938b8de53	fix issue with NestedPathArrayElement not correctly handling negative index for Object[] like it has for List (#15650 )	2024-01-10 09:46:08 +05:30
Clint Wylie	cafc748f7e	skip expression virtual column indexes when mvd is used as array (#15644 )	2024-01-08 21:22:37 -08:00
Clint Wylie	911941b4a6	fix issue with nested virtual column index supplier for partial paths when processing from raw (#15643 )	2024-01-09 07:55:08 +05:30
Clint Wylie	df5bcd1367	fix bugs with expression virtual column indexes for expression virtual columns which refer to other virtual columns (#15633 ) changes: * ColumnIndexSelector now extends ColumnSelector. The only real implementation of ColumnIndexSelector, ColumnSelectorColumnIndexSelector, already has a ColumnSelector, so this isn't very disruptive * removed getColumnNames from ColumnSelector since it was not used * VirtualColumns and VirtualColumn getIndexSupplier method now needs argument of ColumnIndexSelector instead of ColumnSelector, which allows expression virtual columns to correctly recognize other virtual columns, fixing an issue which would incorrectly handle other virtual columns as non-existent columns instead * fixed a bug with sql planner incorrectly not using expression filter for equality filters on columns with extractionFn and no virtual column registry	2024-01-08 13:10:11 -08:00
Clint Wylie	c221a2634b	overhaul DruidPredicateFactory to better handle 3VL (#15629 ) * overhaul DruidPredicateFactory to better handle 3VL fixes some bugs caused by some limitations of the original design of how DruidPredicateFactory interacts with 3-value logic. The primary impacted area was with how filters on values transformed with expressions or extractionFn which turn non-null values into nulls, which were not possible to be modelled with the 'isNullInputUnknown' method changes: * adds DruidObjectPredicate to specialize string, array, and object based predicates instead of using guava Predicate * DruidPredicateFactory now uses DruidObjectPredicate * introduces DruidPredicateMatch enum, which all predicates returned from DruidPredicateFactory now use instead of booleans to indicate match. This means DruidLongPredicate, DruidFloatPredicate, DruidDoublePredicate, and the newly added DruidObjectPredicate apply methods all now return DruidPredicateMatch. This allows matchers and indexes * isNullInputUnknown has been removed from DruidPredicateFactory * rename, fix test * adjust * style * npe * more test * fix default value mode to not match new test	2024-01-05 19:08:02 -08:00
AmatyaAvadhanula	c41e99e10c	Do not allocate week granular segments unless requested (#15589 ) * Do not allocate week granular segments unless explicitly requested	2024-01-05 12:14:52 +05:30
Clint Wylie	f19ece146f	expression virtual column indexes (#15585 ) * ExpressionVirtualColumn + indexes = bff. Expression virtual columns can now use indexes of the underlying columns similar to how expression filters	2024-01-03 21:00:39 -08:00
Gian Merlino	e40b96e026	Reverse lookup fixes and enhancements. (#15611 ) * Reverse lookup fixes and enhancements. 1) Add a "mayIncludeUnknown" parameter to DimFilter#optimize. This is important because otherwise the reverse-lookup optimization is done improperly when the "in" filter appears under a "not", and the lookup extractionFn may return null for some possible values of the filtered column. The "includeUnknown" test cases in InDimFilterTest illustrate the difference in behavior. 2) Enhance InDimFilter#optimizeLookup to handle "mayIncludeUnknown", and to be able to do a reverse lookup in a wider variety of cases. 3) Make "unapply" protected in LookupExtractor, and move callers to "unapplyAll". The main reason is that MapLookupExtractor, a common implementation, lacks a reverse mapping and therefore does a scan of the map for each call to "unapply". For performance sake these calls need to be batched. * Remove optimize call from BloomDimFilter. * Follow the law. * Fix tests. * Fix imports. * Switch function. * Fix tests. * More tests.	2024-01-03 13:28:44 -08:00
Gian Merlino	01eec4a55e	New handling for COALESCE, SEARCH, and filter optimization. (#15609 ) * New handling for COALESCE, SEARCH, and filter optimization. COALESCE is converted by Calcite's parser to CASE, which is largely counterproductive for us, because it ends up duplicating expressions. In the current code we end up un-doing it in our CaseOperatorConversion. This patch has a different approach: 1) Add CaseToCoalesceRule to convert CASE back to COALESCE earlier, before the Volcano planner runs, using CaseToCoalesceRule. 2) Add FilterDecomposeCoalesceRule to decompose calls like "f(COALESCE(x, y))" into "(x IS NOT NULL AND f(x)) OR (x IS NULL AND f(y))". This helps use indexes when available on x and y. 3) Add CoalesceLookupRule to push COALESCE into the third arg of LOOKUP. 4) Add a native "coalesce" function so we can convert 3+ arg COALESCE. The advantage of this approach is that by un-doing the CASE to COALESCE conversion earlier, we have flexibility to do more stuff with COALESCE (like decomposition and pushing into LOOKUP). SEARCH is an operator used internally by Calcite to represent matching an argument against some set of ranges. This patch improves our handling of SEARCH in two ways: 1) Expand NOT points (point "holes" in the range set) from SEARCH as `!(a \|\| b)` rather than `!a && !b`, which makes it possible to convert them to a "not" of "in" filter later. 2) Generate those nice conversions for NOT points even if the SEARCH is not composed of 100% NOT points. Without this change, a SEARCH for "x NOT IN ('a', 'b') AND x < 'm'" would get converted like "x < 'a' OR (x > 'a' AND x < 'b') OR (x > 'b' AND x < 'm')". One of the steps we take when generating Druid queries from Calcite plans is to optimize native filters. This patch improves this step: 1) Extract common ANDed predicates in ConvertSelectorsToIns, so we can convert "(a && x = 'b') \|\| (a && x = 'c')" into "a && x IN ('b', 'c')". 2) Speed up CombineAndSimplifyBounds and ConvertSelectorsToIns on ORs with lots of children by adjusting the logic to avoid calling "indexOf" and "remove" on an ArrayList. 3) Refactor ConvertSelectorsToIns to reduce duplicated code between the handling for "selector" and "equals" filters. * Not so final. * Fixes. * Fix test. * Fix test.	2024-01-03 08:56:22 -08:00
Gian Merlino	b0e52c99bb	Fix ColumnSelectorColumnIndexSelector#getColumnCapabilities. (#15614 ) * Fix ColumnSelectorColumnIndexSelector#getColumnCapabilities. It was using virtualColumns.getColumnCapabilities, which only returns capabilities for virtual columns, not regular columns. The effect of this is that expression filters (and in some cases, arrayContainsElement filters) would build value matchers rather than use indexes. I think this has been like this since #12315, which added the getColumnCapabilities method to BitmapIndexSelector, and included the same implementation as exists in the code today. This error is easy to make due to the design of virtualColumns.getColumnCapabilities, so to help avoid it in the future, this patch renames the method to getColumnCapabilitiesWithoutFallback to emphasize that it does not return capabilities for regular columns. * Make getColumnCapabilitiesWithoutFallback package-private. * Fix expression filter bitmap usage.	2024-01-02 21:09:18 -08:00
Parth Agrawal	8505e8a909	Provide default implementation for RowFunction evalDimension method (#15452 ) The PR: #13947 introduced a function evalDimension() in the interface RowFunction. There was no default implementation added for this interface which causes all the implementations and custom transforms to fail and require to implement their own version of evalDimension method. This PR adds a default implementation in the interface which allows the evalDimension to return value as a Singleton array of eval result.	2024-01-02 11:14:23 +05:30
AlbericByte	a2e65e6a89	Support to pass dynamic values to timestamp Extract function (#15586 ) Fixes #15072 Before this modification , the third parameter (timezone) require to be a Literal, it will throw a error when this parameter is column Identifier.	2023-12-21 11:57:52 +05:30
Clint Wylie	8a45efbf65	fix some null handling bugs with vector expression processors (#15587 )	2023-12-19 08:14:17 -08:00
Kashif Faraz	9f568858ef	Add logging implementation for AuditManager and audit more endpoints (#15480 ) Changes - Add `log` implementation for `AuditManager` alongwith `SQLAuditManager` - `LoggingAuditManager` simply logs the audit event. Thus, it returns empty for all `fetchAuditHistory` calls. - Add new config `druid.audit.manager.type` which can take values `log`, `sql` (default) - Add new config `druid.audit.manager.logLevel` which can take values `DEBUG`, `INFO`, `WARN`. This gets activated only if `type` is `log`. - Remove usage of `ConfigSerde` from `AuditManager` as audit is not just limited to configs - Add `AuditSerdeHelper` for a single implementation of serialization/deserialization of audit payload and other utility methods.	2023-12-19 13:14:04 +05:30
Clint Wylie	e373f62692	fix expression post aggregator array handling when grouping wrapper types leak (#15543 ) * fix expression post aggregator array handling when grouping wrapper types leak * more consistent expression function error messaging	2023-12-15 21:43:27 -08:00
Tom	901ebbb744	Allow for kafka emitter producer secrets to be masked in logs (#15485 ) * Allow for kafka emitter producer secrets to be masked in logs instead of being visible This change will allow for kafka producer config values that should be secrets to not show up in the logs. This will enhance the security of the people who use the kafka emitter to use this if they want to. This is opt in and will not affect prior configs for this emitter * fix checkstyle issue * change property name	2023-12-15 12:21:21 -05:00
Zoltan Haindrich	7552dc49fb	Reduce amount of expression objects created during evaluations (#15552 ) I was looking into a query which was performing a bit poorly because the case_searched was touching more than 1 columns (if there is only 1 column there is a cache based evaluator). While I was doing that I've noticed that there are a few simple things which could help a bit: use a static TRUE/FALSE instead of creating a new object every time create the ExprEval early for ConstantExpr -s (except the one for BigInteger which seem to have some odd contract) return early from type autodetection these changes mostly reduce the amount of garbage the query creates during case_searched evaluation; although ExpressionSelectorBenchmark shows some improvements ~15% - but my manual trials on the taxi dataset with 60M rows showed more improvements - probably due to the fact that these changes mostly only reduce gc pressure.	2023-12-15 16:11:59 +05:30
Soumyava	3e15522d6b	Round works correctly on system metadata columns (#15554 )	2023-12-13 17:23:14 -08:00
Clint Wylie	e55f6b6202	remove search auto strategy, estimateSelectivity of BitmapColumnIndex (#15550 ) * remove search auto strategy, estimateSelectivity of BitmapColumnIndex * more cleanup	2023-12-13 16:30:01 -08:00
zachjsh	857693f5cf	Decorate sampling response with system fields if specified (#15536 ) * * decorate sampling response with system fields if specified * * add unit test	2023-12-13 12:16:59 -08:00
Ankit Kothari	8735d023a1	Add experimental support for first/last for double/float/long #10702 (#14462 ) Add experimental support for doubleLast, doubleFirst, FloatLast, FloatFirst, longLast and longFirst.	2023-12-12 11:36:51 +05:30
Abhishek Radhakrishnan	96be82a3e6	Clean up duty for non-overlapping eternity tombstones (#15281 ) * Add initial draft of MarkDanglingTombstonesAsUnused duty. * Use overshadowed segments instead of all used segments. * Add unit test for MarkDanglingSegmentsAsUnused duty. * Add mock call * Simplify code. * Docs * shorter lines formatting * metric doc * More tests, refactor and fix up some logic. * update javadocs; other review comments. * Make numCorePartitions as 0 in the TombstoneShardSpec. * fix up test * Add tombstone core partition tests * Update docs/design/coordinator.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * review comment * Minor cleanup * Only consider tombstones with 0 core partitions * Need to register the test shard type to make jackson happy * test comments * checkstyle * fixup misc typos in comments * Update logic to use overshadowed segments * minor cleanup * Rename duty to eternity tombstone instead of dangling. Add test for full eternity tombstone. * Address review feedback. --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-12-11 08:57:15 -08:00
Clint Wylie	42f2496b7d	fix bug with nested empty array fields (#15532 )	2023-12-09 12:20:21 -08:00
Rishabh Singh	54df235026	Lazily build Filter in FilteredAggregatorFactory to avoid parsing exceptions in Router (#15526 ) Query with lookups in FilteredAggregator fails with this exception in router, Cannot construct instance of `org.apache.druid.query.aggregation.FilteredAggregatorFactory`, problem: Lookup [campaigns_lookup[campaignId][is_sold][autodsp]] not found at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 913] (through reference chain: org.apache.druid.query.groupby.GroupByQuery["aggregations"]->java.util.ArrayList[1]) T he problem is that constructor of FilteredAggregatorFactory is actually validating if the lookup exists in this statement dimFilter.toFilter(). This is failing on the router, which is to be expected, because, the router isn’t assigned any lookups. The fix is to move to a lazy initialisation of the filter object in the constructor.	2023-12-09 12:18:37 +05:30
Clint Wylie	e7c8f2e208	lift restriction of array_to_mv to only support direct column access (#15528 )	2023-12-08 16:27:17 -08:00
Clint Wylie	e64b92eb35	add JSON_QUERY_ARRAY function to pluck ARRAY<COMPLEX<json>> out of COMPLEX<json> (#15521 )	2023-12-08 05:28:46 -08:00
Zoltan Haindrich	c353ccfdef	Windowed min aggregates null-s as 0 (#15371 )	2023-12-08 01:41:16 -08:00
Clint Wylie	1eafe983ec	fix array presenting columns to not match single element arrays to scalars for equality (#15503 ) * fix array presenting columns to not match single element arrays to scalars for equality * update docs to clarify usage model of mixed type columns	2023-12-08 01:22:07 -08:00
sb89594	5fda8613ad	Feature: Add IPv6 Match Function (#15212 )	2023-12-07 23:09:06 -08:00
AlbericByte	935aa187a0	add Assert function to verify in the DataGeneratorTest (#15504 ) * add Assert function to verify in the DataGeneratorTest * remove unused log in DataGeneratorTest * add comment for DataGeneratorTest	2023-12-08 09:12:17 +08:00
Clint Wylie	c241c6980c	store auto columns with only empty or null containing arrays as ARRAY<LONG> instead of COMPLEX<json> (#15505 )	2023-12-07 03:31:43 -08:00
Clint Wylie	557f3f6f57	add array column type support to EXTEND operator (#15458 )	2023-12-06 23:21:35 -08:00
Gian Merlino	6f51155ccb	Fix NullFilter getDimensionRangeSet. (#15500 ) It wasn't checking the column name, so it would return a domain regardless of the input column. This means that null filters on data sources with range partitioning would lead to excessive pruning of segments, and therefore missing results.	2023-12-06 15:09:59 +05:30
Clint Wylie	0516d0dae4	simplify IncrementalIndex since group-by v1 has been removed (#15448 )	2023-11-29 14:46:16 -08:00
Pranav	93cd638645	Enabling aggregateMultipleValues in all StringAnyAggregators (#15434 ) * Enabling aggregateMultipleValues in all StringAnyAggregators * Adding more tests * More validation * fix warning * updating asserts in decoupled mode * fix intellij inspection * Addressing comments * Addressing comments * Adding early validations and make aggregate consistent across all * fixing tests * fixing tests * Update docs/querying/sql-aggregations.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * fixing static check --------- Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2023-11-29 14:32:49 -08:00
Clint Wylie	64fcb32bcf	add native 'array contains element' filter (#15366 ) * add native arrayContainsElement filter to use array column element indexes	2023-11-29 03:33:00 -08:00
Clint Wylie	97623b408c	add optional 'castToType' parameter to 'auto' column schema (#15417 ) * auto but.. with an expected type	2023-11-28 17:19:23 -08:00
Zoltan Haindrich	eb056e23b5	Fix dictionarySize overrides in tests (#15354 ) I think this is a problem as it discards the false return value when the putToKeyBuffer can't store the value because of the limit Not forwarding the return value at that point may lead to the normal continuation here regardless something was not added to the dictionary like here	2023-11-28 18:49:09 +05:30
Kashif Faraz	58a724c7e4	Use StubServiceEmitter in tests (#15426 ) * Use StubServiceEmitter in tests * Remove unthrown exception from declaration	2023-11-28 09:43:09 +05:30
Zoltan Haindrich	dff5bcb0a6	Fix resultcache multiple postaggregation restore (#15402 ) Fixes https://github.com/apache/druid/issues/15393	2023-11-21 15:58:20 +05:30
Abhishek Radhakrishnan	470c8ed7b0	Make `numCorePartitions` as 0 for tombstones (#15379 ) * Make numCorePartitions as 0 in the TombstoneShardSpec. * fix up test * Add tombstone core partition tests * review comment * Need to register the test shard type to make jackson happy	2023-11-20 09:42:51 -08:00
Clint Wylie	a95c22ce70	support non-constant expressions for path arguments for json_value and json_query (#15320 ) * support dynamic expressions for path arguments for json_value and json_query	2023-11-17 01:12:05 -08:00
Yashdeep Thorat	7b5790c72c	Fix flaky tests in ParserTest.java (#15318 ) Fixed the following flaky tests: org.apache.druid.math.expr.ParserTest#testApplyFunctions org.apache.druid.math.expr.ParserTest#testSimpleMultiplicativeOp1 org.apache.druid.math.expr.ParserTest#testFunctions org.apache.druid.math.expr.ParserTest#testSimpleLogicalOps1 org.apache.druid.math.expr.ParserTest#testSimpleAdditivityOp1 org.apache.druid.math.expr.ParserTest#testSimpleAdditivityOp2 The above mentioned tests have been reported as flaky (tests assuming deterministic implementation of a non-deterministic specification ) when ran against the NonDex tool. The tests contain assertions (Assertion 1 & Assertion 2) that compare an ArrayList created from a HashSet using the ArrayList() constructor with another List. However, HashSet does not guarantee the ordering of elements and thus resulting in these flaky tests that assume deterministic implementation of HashSet. Thus, when the NonDex tool shuffles the HashSet elements, it results in the test failures: Co-authored-by: ythorat2 <ythorat2@illinois.edu>	2023-11-17 12:29:23 +05:30
Abhishek Radhakrishnan	2e79fd56a7	MSQ generates tombstones honoring granularity specified in a `REPLACE` query. (#15243 ) * MSQ generates tombstones honoring the query's granularity. This change tweaks to only account for the infinite-interval tombstones. For finite-interval tombstones, the MSQ query granualrity will be used which is consistent with how MSQ works. * more tests and some cleanup. * checkstyle * comment edits * Throw TooManyBuckets fault based on review; add more tests. * Add javadocs for both methods on reconciling the methods. * review: Move testReplaceTombstonesWithTooManyBucketsThrowsException to MsqFaultsTest * remove unused imports. * Move TooManyBucketsException to indexing package for shared exception handling. * lower max bucket for tests and fixup count * Advance and count the iterator. * checkstyle	2023-11-14 23:35:36 -08:00
Adarsh Sanjeev	a134cc30a6	Change default inSubQueryThreshold (#15336 )	2023-11-14 14:08:12 +05:30
Pranav	e2fde8c516	Refactor lookups behavior while loading/dropping the containers (#14806 )	2023-11-07 10:07:28 -08:00
Rishabh Singh	8c802e4c9b	Relocating Table Schema Building: Shifting from Brokers to Coordinator for Improved Efficiency (#14985 ) In the current design, brokers query both data nodes and tasks to fetch the schema of the segments they serve. The table schema is then constructed by combining the schemas of all segments within a datasource. However, this approach leads to a high number of segment metadata queries during broker startup, resulting in slow startup times and various issues outlined in the design proposal. To address these challenges, we propose centralizing the table schema management process within the coordinator. This change is the first step in that direction. In the new arrangement, the coordinator will take on the responsibility of querying both data nodes and tasks to fetch segment schema and subsequently building the table schema. Brokers will now simply query the Coordinator to fetch table schema. Importantly, brokers will still retain the capability to build table schemas if the need arises, ensuring both flexibility and resilience.	2023-11-04 19:33:25 +05:30
Clint Wylie	5d39b94149	allow compaction to work with spatial dimensions (#15321 )	2023-11-03 11:27:50 -07:00
Gian Merlino	98f1eb8ede	Use filters for pruning properly for hash-joins. (#15299 ) * Use filters for pruning properly for hash-joins. Native used them too aggressively: it might use filters for the RHS to prune the LHS. MSQ used them not at all. Now, both use them properly, pruning based on base (LHS) columns only. * Fix tests. * Fix style. * Clear filterFields too. * Update.	2023-11-03 07:29:16 -07:00
Gian Merlino	d87d92bc43	Add system fields to input sources. (#15276 ) * Add system fields to input sources. Main changes: 1) The SystemField enum defines system fields "__file_uri", "__file_path", and "__file_bucket". They are associated with each input entity. 2) The SystemFieldInputSource interface can be added to any InputSource to make it system-field-capable. It sets up serialization of a list of configured "systemFields" in the JSON form of the input source, and provides a method getSystemFieldValue for computing the value of each system field. Cloud object, HDFS, HTTP, and Local now have this. * Fix various LocalInputSource calls. * Fix style stuff. * Fixups. * Fix tests and coverage.	2023-11-02 10:31:28 -07:00
Clint Wylie	d261587f4a	explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds (#15245 ) * better documentation for the differences between arrays and mvds * add outputType to ExpressionPostAggregator to make docs true * add output coercion if outputType is defined on ExpressionPostAgg * updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables	2023-11-02 00:31:37 -07:00
Gian Merlino	37e158c2c4	Frames: consider writing singly-valued column when input column hasMultipleValues is UNKNOWN. (#15300 ) * Frames: consider writing singly-valued column when input column hasMultipleValues is UNKNOWN. Prior to this patch, columnar frames would always write multi-valued columns if the input column had hasMultipleValues = UNKNOWN. This had the effect of flipping UNKNOWN to TRUE when copying data into frames, which is problematic because TRUE causes expressions to assume that string inputs must be treated as arrays. We now avoid this by flipping UNKNOWN to FALSE if no multi-valuedness is encountered, and flipping it to TRUE if multi-valuedness is encountered. * Add regression test case.	2023-11-01 22:05:53 -07:00
Vishesh Garg	a27598a487	Segregate advance and advanceUninterruptibly flow in postJoinCursor to allow for interrupts in advance (#15222 ) Currently advance function in postJoinCursor calls advanceUninterruptibly which in turn keeps calling baseCursor.advanceUninterruptibly until the post join condition matches, without checking for interrupts. This causes the CPU to hit 100% without getting a chance for query to be cancelled. With this change, the call flow of advance and advanceUninterruptibly is separated out so that they call baseCursor.advance and baseCursor.advanceUninterruptibly in them, respectively, giving a chance for interrupts in the former case between successive calls to baseCursor.advance.	2023-10-30 14:39:15 +05:30
Ben Sykes	275c1ec64c	Fix error assuming a Complex Type that is a Number is a double (#15272 ) * Fix error assuming a Complex Type that is a Number is a double In the case where a complex type is a number, it may not be castable to double. It can safely be case as Number first to get to the doubleValue.	2023-10-30 09:52:52 +05:30
Zoltan Haindrich	f4a74710e6	Process pure ordering changes with windowing operators (#15241 ) - adds a new query build path: DruidQuery#toScanAndSortQuery which: - builds a ScanQuery without considering the current ordering - builds an operator to execute the sort - fixes a null string to "null" literal string conversion in the frame serializer code - fixes some DrillWindowQueryTest cases - fix NPE in NaiveSortOperator in case there was no input - enables back CoreRules.AGGREGATE_REMOVE - adds a processing level OffsetLimit class and uses that instead of just the limit in the rac parts - earlier window expressions on top of a subquery with an offset may have ignored the offset	2023-10-29 16:40:49 +05:30
Simon Hofbauer	e9b7e4a0eb	fix JSON flaky tests (#15261 ) Co-authored-by: simonh5 <simonh5@illinois.edu>	2023-10-26 20:27:09 -07:00
Zoltan Haindrich	f48263bbb3	Report function name for unknown exceptions during execution (#14987 ) * provide function name when unknown exceptions are encountered * fix keywords/etc * fix keywrod order - regex excercise * add test * add check&fix keywords * decoupledIgnore * Revert "decoupledIgnore" This reverts commit `e922c820a7`. * unpatch Function * move to a different location * checkstyle	2023-10-25 13:37:30 -07:00
Zoltan Haindrich	6784e9c507	Fix summary row issues in case postaggregations are happening (#15232 ) * fix-1/2 * add message v1 * extend test to cover for IOB issue * move stuff around * change message * fix testcase string * compute postaggs (thank you Clint!) * enable feature for test * ignore tests in msq --------- Co-authored-by: Soumyava Das <soumyava@users.noreply.github.com>	2023-10-24 20:33:59 -07:00
Clint Wylie	4149c9422c	cleanup temp files for nested column serializer (#15236 ) * cleanup temp files for nested column serializer * fix style * fix tests in default value mode	2023-10-24 15:30:00 -07:00
Zoltan Haindrich	b95035f183	Fix VirtualColumn related issues in window expressions (#15119 ) for some exotic queries like: SELECT '_'\|\|dim1, MIN(cast(0 as double)) OVER (), MIN(cast((cnt\|\|cnt) as bigint)) OVER () FROM foo the compilation have resulted in NPE -s mostly because VirtualColumn -s were not handled properly	2023-10-23 14:05:59 +05:30
Clint Wylie	c8e458452d	Fix native is boolean filter cache key tests to test the right thing (#15216 )	2023-10-23 11:24:46 +05:30
Clint Wylie	5c14b42e72	fix incorrect unnest dimension cursor value matcher implementation (#15192 )	2023-10-18 16:43:06 -07:00
Clint Wylie	061cfee224	add native filters for "(filter) is true" and "(filter) is false" (#15182 ) * add native filters for "(filter) is true" and "(filter) is false" changes: * add IsTrueDimFilter, IsFalseDimFilter, and abstract IsBooleanDimFilter for native json filter implementations of `(filter) IS TRUE` and `(filter) IS FALSE` * add IsBooleanFilter for actual filtering logic for these filters, which ignore includeUnknown to always use matches with false for true and !matches with true for false * fix test incorrectly adjusted to wrong answer in #15058 * add tests for default value mode	2023-10-18 13:07:35 -07:00
Clint Wylie	22034a1630	preserve Rows.objectToStrings behavior of translating null into "null" inside of lists and arrays (#15190 )	2023-10-17 19:49:36 -07:00
Laksh Singla	b4540ed5d4	Optimize the reading of numerical frame arrays in MSQ (#15175 )	2023-10-18 02:33:42 +05:30
Laksh Singla	dc8d2192c3	Introduce natural comparator for types that don't have a StringComparator (#15145 ) Fixes a bug when executing queries with the ordering of arrays	2023-10-16 10:37:32 +05:30
Pranav	4b0d1b3488	Fix expression result writing of arrays in Hadoop Ingestion (#15127 )	2023-10-13 13:41:41 -07:00
Zoltan Haindrich	6d62c75866	Fix columns with null values in windowing expressions (#15131 )	2023-10-13 10:42:45 -04:00
AmatyaAvadhanula	d25caaefa4	Add support for streaming ingestion with concurrent replace (#15039 ) Add support for streaming ingestion with concurrent replace --------- Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2023-10-13 09:09:03 +05:30
Clint Wylie	d0f64608eb	sql compatible three-valued logic native filters (#15058 ) * sql compatible tri-state native logical filters when druid.expressions.useStrictBooleans=true and druid.generic.useDefaultValueForNull=false, and new druid.generic.useThreeValueLogicForNativeFilters=true * log.warn if non-default configurations are used to guide operators towards SQL complaint behavior	2023-10-12 00:06:23 -07:00
Zoltan Haindrich	ae88f2c0b6	Fix non-sqlcompat validation in CalciteWindowQueryTest (#15086 ) * fixes * check for latest rewrite place * Revert "check for latest rewrite place" This reverts commit `5cf1e2c1ca`. * some stuff (cherry picked from commit ab346d4373ea888eb8ef6115e018e7fb0d27407f) * update test output * updates to test ouptuts * some stuff * move validator * cleanup * fix * change test slightly * add apidoc cleanup warnings * cleanup/etc * instead of telling the story; add a fail with some reason whats the issue * lead-lag fix * add test * remove unnecessary throw * druidexception-trial * Revert "druidexception-trial" This reverts commit `8fa06644bc`. * undo changes to no_grouping; add no_grouping2 * add missing assert on resultcount * rename method; update * introduce enum/etc * make resultmatchmode accessible from TestBuilder#expectedResults * fix dump results to use log * fix * handle null correctly * disable feature type based things for MSQ * fix varianssqlaggtest * use eps in other test * fix intellij error * add final * addrss review * update test/string/etc * write concat in 3 lines :D	2023-10-11 12:34:31 -07:00
Laksh Singla	5f86072456	Prepare master for Druid 29 (#15121 ) Prepare master for Druid 29	2023-10-11 10:33:45 +05:30
Karan Kumar	48f35b3fdd	Add query id to processing pool thread name. (#15059 ) This patch changes the thread name of the processing pool of the indexers/peons/historicals from query.getType() + "_" + query.getDataSource() + "_" + query.getIntervals() to query.getId()	2023-10-10 05:59:03 +05:30
Laksh Singla	95bf331c08	Rename the default setting of 'maxSubqueryBytes' from 'unlimited' to 'disabled' (#15108 ) The default setting of 'maxSubqueryBytes' is renamed from 'unlimited' to 'disabled'.	2023-10-10 02:03:29 +05:30
Clint Wylie	1fc8fb1b20	add a bunch of tests with array typed columns to CalciteArraysQueryTest (#15101 ) * add a bunch of tests with array typed columns to CalciteArraysQueryTest * fix a bug with unnest filter pushdown when filtering on unnested array columns	2023-10-09 06:16:06 -07:00
Laksh Singla	549ef56288	UNION ALLs in MSQ (#14981 ) MSQ now supports UNION ALL with UnionDataSource	2023-10-09 18:18:15 +05:30
Adarsh Sanjeev	7a35ce886d	Add ability for MSQ tasks to query realtime tasks (#15024 ) This PR aims to add the capabilities to: 1. Fetch the realtime segment metadata from the coordinator server view, 2. Adds the ability for workers to query indexers, similar to how brokers do the same for native queries.	2023-10-09 15:14:03 +05:30
kaisun2000	e2cc1c4ad1	Add metric -- count of queries waiting for merge buffers (#15025 ) Add 'mergeBuffer/pendingRequests' metric that exposes the count of waiting queries (threads) blocking in the merge buffers pools.	2023-10-09 12:56:23 +05:30
Gian Merlino	c483cb863d	Fix IndexerWorkerClient#fetchChannelData when response has data and error. (#15084 ) * Fix IndexerWorkerClient#fetchChannelData when response has data and error. When a channel data response from a worker includes some data and then some I/O error, then when the call is retried, we will re-read the set of data that was read by the previous connection and add it to the local channel again. This causes the local channel to become corrupted. The patch fixes this case by skipping data that has already been read.	2023-10-09 11:12:28 +05:30
Soumyava	57ab8e13dc	Updating plans when using joins with unnest on the left (#15075 ) * Updating plans when using joins with unnest on the left * Correcting segment map function for hashJoin * The changes done here are not reflected into MSQ yet so these tests might not run in MSQ * native tests * Self joins with unnest data source * Making this pass * Addressing comments by adding explanation and new test	2023-10-06 19:23:12 -07:00
Pranav	06c5527c85	Allow aliasing of Macros and add new alias for complex decode 64 (#15034 ) * Add AliasExprMacro to allow aliasing of native expression macros * Add decode_base64_complex alias for complex_decode_base64	2023-10-05 16:24:36 -07:00
Laksh Singla	2c286d6f42	Fix monomorphic processing code running on JDK8 since it references a non-existing method (#15092 ) Code relying on monomorphic processing on JDK8 doesn't work correctly, since it tries to reference getArrayLength using method handles, which might have been accidentally removed here since it seems unused. This PR adds the method back as is.	2023-10-05 11:05:38 +05:30
Clint Wylie	b4bc9b6950	fix issue with auto columns with mix of scalar values and empty arrays (#15083 )	2023-10-05 10:15:45 +05:30
Laksh Singla	b8d03d36b0	Free up the resources when materializing the results as Frames (#15032 ) Refactor the code to clean up the result sequences when materializing the results as Frames	2023-10-05 10:14:27 +05:30
Clint Wylie	3afe09a19d	urlencode nested serializer temp file names so they dont explode stuff (#15068 ) Fixes a bug caused by #14919, which was just using the column name as part of a temp file name, which.. isn't very cool, my bad. Switched to use StringUtils.urlEncode so that ugly chars don't explode stuff. The modified test fails without the changes in this PR.	2023-10-05 10:13:45 +05:30
Laksh Singla	30cf76db99	Field writers for numerical arrays (#14900 ) Row-based frames, and by extension, MSQ now supports numeric array types. This means that all queries consuming or producing arrays would also work with MSQ. Numeric arrays can also be ingested via MSQ. Post this patch, queries like, SELECT [1, 2] would work with MSQ since they consume a numeric array, instead of failing with an unsupported column type exception.	2023-10-04 23:16:47 +05:30
Gian Merlino	a9021e4cd7	Fix NPE with lenient aggregators merging in segmentMetadata. (#15078 ) When merging analyses, lenient merging sets unmergeable aggregators to null. Merging such a null aggregator record into a nonnull record would potentially lead to NPE in getMergingFactory. The new code only calls getMergingFactory if both the old and new aggregators are nonnull; else, if either is null, then the merged aggregator is also set to null.	2023-10-04 02:41:41 -07:00
Clint Wylie	632811b285	fix json compat layer to not rewrite v4 into v5 after segment merging (#14997 )	2023-10-04 00:18:18 -07:00
Gian Merlino	2ed4fd1ae3	Compute broadcast-join segmentMapFn only once per worker. (#15007 ) This patch introduces "processor managers" to processor factories, as a replacement for the sequence of processors. Processor managers can use the results of earlier processors to influence the creation of later processors, which provides us with the building block we need to ensure that broadcast join data is only read once. In particular, when broadcast join is happening, the BaseFrameProcessorFactory now uses a ChainedProcessorManager to first run BroadcastJoinSegmentMapFnProcessor (in a single thread), and then run all of the regular processors (possibly multithreaded).	2023-10-04 11:47:00 +05:30
Vishesh Garg	7e8f3e69ef	Avoid intermediate offsets in bucketStart calculation logic to handle DST transition (#15038 ) When moving timestamps by an offset using org.joda.time.chrono.ISOChronology library, if the new timestamp falls in Daylight Savings Time (DST) transition period, the library rounds it off to the nearest valid time. This can lead to incorrect final timestamp when calculated using intermediate offsets landing in DST transition, for e.g. +21D arrived at using +14D and +7D offset, where +14D lands in DST transition period. Since bucketStart values are calculated using this library, this behaviour can lead to incorrect bucketStart times.	2023-10-04 11:32:29 +05:30
Xavier Léauté	adef2069b1	Make unit tests pass with Java 21 (#15014 ) This change updates dependencies as needed and fixes tests to remove code incompatible with Java 21 As a result all unit tests now pass with Java 21. * update maven-shade-plugin to 3.5.0 and follow-up to #15042 * explain why we need to override configuration when specifying outputFile * remove configuration from dependency management in favor of explicit overrides in each module. * update to mockito to 5.5.0 for Java 21 support when running with Java 11+ * continue using latest mockito 4.x (4.11.0) when running with Java 8 * remove need to mock private fields * exclude incorrectly declared mockito dependency from pac4j-oidc * remove mocking of ByteBuffer, since sealed classes can no longer be mocked in Java 21 * add JVM options workaround for system-rules junit plugin not supporting Java 18+ * exclude older versions of byte-buddy from assertj-core * fix for Java 19 changes in floating point string representation * fix missing InitializedNullHandlingTest * update easymock to 5.2.0 for Java 21 compatibility * update animal-sniffer-plugin to 1.23 * update nl.jqno.equalsverifier to 3.15.1 * update exec-maven-plugin to 3.1.0	2023-10-03 22:41:21 -07:00
George Shiqi Wu	64754b6799	Allow users to pass task payload via deep storage instead of environment variable (#14887 ) This change is meant to fix a issue where passing too large of a task payload to the mm-less task runner will cause the peon to fail to startup because the payload is passed (compressed) as a environment variable (TASK_JSON). In linux systems the limit for a environment variable is commonly 128KB, for windows systems less than this. Setting a env variable longer than this results in a bunch of "Argument list too long" errors.	2023-10-03 14:08:59 +05:30
Pranav	f1edd671fb	Exposing optional replaceMissingValueWith in lookup function and macros (#14956 ) * Exposing optional replaceMissingValueWith in lookup function and macros * args range validation * Updating docs * Addressing comments * Update docs/querying/sql-scalar.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * Update docs/querying/sql-functions.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * Addressing comments --------- Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2023-10-02 17:09:23 -07:00
Pranav	07c28f17ca	Fix missing format strings in calls to DruidException.build (#15056 ) * Fix the NPE bug in nonStrictFormat * using non null format string * using Assert.assertThrows	2023-09-29 17:00:36 -07:00
Karan Kumar	2f1bcd6717	Adding `"segment/scan/active" metric for processing thread pool. (#15060 )	2023-09-29 12:34:28 -07:00
Zoltan Haindrich	022950a0c5	MV_FILTER_ONLY may run into Exceptions in case duplicate values were processed (#15012 )	2023-09-27 19:19:42 +05:30
Gian Merlino	3dabfead05	Fix getResultType for HLL, quantiles aggregators. (#15043 ) The aggregators had incorrect types for getResultType when shouldFinalze is false. They had the finalized type, but they should have had the intermediate type. Also includes a refactor of how ExprMacroTable is handled in tests, to make it easier to add tests for this to the MSQ module. The bug was originally noticed because the incorrect result types caused MSQ queries with DS_HLL to behave erratically.	2023-09-27 08:51:14 +05:30
Gian Merlino	0850e615b2	Remove istrue, isfalse vectorized impls. (#14991 ) These were added in #14977, but the implementations are incorrect, because they return null when the input arg is null. They should return false when the input is null. Remove them for now, rather than fixing them, since they're so new that they might as well never have existed.	2023-09-25 11:34:24 +05:30
AmatyaAvadhanula	c62193c4d7	Add support for concurrent batch Append and Replace (#14407 ) Changes: - Add task context parameter `taskLockType`. This determines the type of lock used by a batch task. - Add new task actions for transactional replace and append of segments - Add methods StorageCoordinator.commitAppendSegments and commitReplaceSegments - Upgrade segments to appropriate versions when performing replace and append - Add new metadata table `upgradeSegments` to track segments that need to be upgraded - Add tests	2023-09-25 07:06:37 +05:30
Pranav	883c2692d2	Adding new function decode_base64_utf8 and expr macro (#14943 ) * Adding new function decode_base64_utf8 and expr macro * using BaseScalarUnivariateMacroFunctionExpr * Print stack trace in case of debug in ChainedExecutionQueryRunner * fix static check	2023-09-20 17:06:34 -07:00
Xavier Léauté	22abc10f24	update RoaringBitmap to 0.9.49 (#15006 ) * update RoaringBitmap to 0.9.49 update RoaringBitmap from 0.9.0 to 0.9.49 Many optimizations and improvements have gone into recent releases of RoaringBitmap. It seems worthwhile to incorporate those. * implement workaround for BatchIterator interface change * add test case for BatchIteratorAdapter.advanceIfNeeded	2023-09-20 15:52:27 -07:00
Gian Merlino	823f620ede	Add IS [NOT] DISTINCT FROM to SQL and join matchers. (#14976 ) * Add IS [NOT] DISTINCT FROM to SQL and join matchers. Changes: 1) Add "isdistinctfrom" and "notdistinctfrom" native expressions. 2) Add "IS [NOT] DISTINCT FROM" to SQL. It uses the new native expressions when generating expressions, and is treated the same as equals and not-equals when generating native filters on literals. 3) Update join matchers to have an "includeNull" parameter that determines whether we are operating in "equals" mode or "is not distinct from" mode. * Main changes: - Add ARRAY handling to "notdistinctfrom" and "isdistinctfrom". - Include null in pushed-down filters when using "notdistinctfrom" in a join. Other changes: - Adjust join filter analyzer to more explicitly use InDimFilter's ValuesSets, relying less on remembering to get it right to avoid copies. * Remove unused "wrap" method. * Fixes. * Remove methods we do not need. * Fix bug with INPUT_REF.	2023-09-20 10:44:32 -07:00
Zoltan Haindrich	79f882f48c	Fix exception cause logging in QueryResultPusher (#14975 )	2023-09-20 15:44:02 +05:30
Rohan Garg	39d95955f5	Do not eagerly close inner iterators in CloseableIterator#flatMap (#14986 )	2023-09-15 15:14:20 +05:30
Soumyava	279b3818f0	Make Unnest work with nullif operator (#14993 ) This is due to the recursive filter creation in unnest storage adapter not performing correctly in case of an empty children. This PR addresses the issue	2023-09-15 09:54:14 +05:30
Gian Merlino	3ae5e97801	Add IS [NOT] TRUE, IS [NOT] FALSE native functions. (#14977 ) They are not quite the same as "x == true", "x != true", etc. These functions never return null, even when "x" itself is null.	2023-09-14 09:19:09 -07:00
Soumyava	5c42ac8c4d	Fix for latest agg to handle nulls in time column. Also adding optimi… (#14911 ) * Fix for latest agg to handle nulls in time column. Also adding optimization for dictionary encoded string columns * One minor fix * Adding more tests for the new class * Changing the init to a putInt	2023-09-13 17:37:26 -07:00
Soumyava	bf99d2c7b2	Fix for schema mismatch to go down using the non vectorize path till we update the vectorized aggs properly (#14924 ) * Fix for schema mismatch to go down using the non vectorize path till we update the vectorized aggs properly * Fixing a failed test * Updating numericNilAgg * Moving to use default values in case of nil agg * Adding the same for first agg * Fixing a test * fixing vectorized string agg for last/first with cast if numeric * Updating tests to remove mockito and cover the case of string first/last on non string columns * Updating a test to vectorize * Addressing review comments: Name change to NilVectorAggregator and using static variables now * fixing intellij inspections	2023-09-13 13:15:14 -07:00
Clint Wylie	23b78c0f95	use mmap for nested column value to dictionary id lookup for more chill heap usage during serialization (#14919 )	2023-09-12 21:01:18 -07:00
Kashif Faraz	286eecad7c	Simplify DruidCoordinatorConfig and binding of metadata cleanup duties (#14891 ) Changes: - Move following configs from `CliCoordinator` to `DruidCoordinatorConfig`: - `druid.coordinator.kill.on` - `druid.coordinator.kill.pendingSegments.on` - `druid.coordinator.kill.supervisors.on` - `druid.coordinator.kill.rules.on` - `druid.coordinator.kill.audit.on` - `druid.coordinator.kill.datasource.on` - `druid.coordinator.kill.compaction.on` - In the Coordinator style used by historical management duties, always instantiate all the metadata cleanup duties but execute only if enabled. In the existing code, they are instantiated only when enabled by using optional binding with Guice. - Add a wrapper `MetadataManager` which contains handles to all the different metadata managers for rules, supervisors, segments, etc. - Add a `CoordinatorConfigManager` to simplify read and update of coordinator configs - Remove persistence related methods from `CoordinatorCompactionConfig` and `CoordinatorDynamicConfig` as these are config classes. - Remove annotations `@CoordinatorIndexingServiceDuty`, `@CoordinatorMetadataStoreManagementDuty`	2023-09-13 09:06:57 +05:30
Clint Wylie	891f0a3fe9	longer compatibility window for nested column format v4 (#14955 ) changes: * add back nested column v4 serializers * 'json' schema by default still uses the newer 'nested common format' used by 'auto', but now has an optional 'formatVersion' property which can be specified to override format versions on native ingest jobs * add system config to specify default column format stuff, 'druid.indexing.formats', and property 'druid.indexing.formats.nestedColumnFormatVersion' to specify system level preferred nested column format for friendly rolling upgrades from versions which do not support the newer 'nested common format' used by 'auto'	2023-09-12 14:07:53 -07:00
Zoltan Haindrich	5d16d0edf0	Count distinct returned incorrect results without useApproximateCountDistinct (#14748 ) * fix grouping engine handling of summaries when result set is empty	2023-09-12 13:57:54 -07:00
Suneet Saldanha	757603a773	Set task location as k8sPodName for mm-less ingestion (#14959 ) * Set task location as k8sPodName for mm-less ingestion * tests	2023-09-11 19:44:26 -07:00
Clint Wylie	2b7f2c5119	use VectorValueSelector instead of BaseLongVectorValueSelector for StringFirstAggregatorFactory.factorizeVector (#14957 )	2023-09-09 04:03:05 -07:00
Zoltan Haindrich	699893bcff	Fix StringLastAggregatorFactory equals/toString (#14907 ) * update test * update test * format * test * fix0 * Revert "fix0" This reverts commit `44992cb393`. * ok resultset * add plan * update test * before rewind * test * fix toString/compare/test * move test * add timeColumn to hashCode	2023-09-08 09:20:54 -07:00
Kashif Faraz	647686aee2	Add test and metrics for KillStalePendingSegments duty (#14951 ) Changes: - Add new metric `kill/pendingSegments/count` with dimension `dataSource` - Add tests for `KillStalePendingSegments` - Reduce no-op logs that spit out for each datasource even when no pending segments have been deleted. This can get particularly noisy at low values of `indexingPeriod`. - Refactor the code in `KillStalePendingSegments` for readability and add javadocs	2023-09-08 10:33:47 +05:30
Soumyava	a8fa979115	Unnest dont push down not (#14942 ) * Not pushing down not filters * New test case * Updating tests * Removing a stale comment	2023-09-06 08:57:03 -07:00
Laksh Singla	6ee0b06e38	Auto configuration for maxSubqueryBytes (#14808 ) A new monitor SubqueryCountStatsMonitor which emits the metrics corresponding to the subqueries and their execution is now introduced. Moreover, the user can now also use the auto mode to automatically set the number of bytes available per query for the inlining of its subquery's results.	2023-09-06 05:47:19 +00:00
Soumyava	8088a763a6	Vectorize earliest aggregator for both numeric and string types (#14408 ) * Vectorizing earliest for numeric * Vectorizing earliest string aggregator * checkstyle fix * Removing unnecessary exceptions * Ignoring tests in MSQ as earliest is not supported for numeric there * Fixing benchmarks * Updating tests as MSQ does not support earliest for some cases * Addressing review comments by adding the following: 1. Checking capabilities first before creating selectors 2. Removing mockito in tests for numeric first aggs 3. Removing unnecessary tests * Addressing issues for dictionary encoded single string columns where we can use the dictionary ids instead of the entire string * Adding a flag for multi value dimension selector * Addressing comments * 1 more change * Handling review comments part 1 * Handling review comments and correctness fix for latest_by when the time expression need not be in sorted order * Updating numeric first vector agg * Revert "Updating numeric first vector agg" This reverts commit `4291709901`. * Updating code for correctness issues * fixing an issue with latest agg * Adding more comments and removing an unnecessary check * Addressing null checks for tie selector and only vectorize false for quantile sketches	2023-09-05 08:41:42 -07:00
Kashif Faraz	7f26b80e21	Simplify ServiceMetricEvent.Builder (#14933 ) Changes: - Make ServiceMetricEvent.Builder extend ServiceEventBuilder<ServiceMetricEvent> and thus convert it to a plain builder rather than a builder of builder. - Add methods setCreatedTime , setMetricAndValue to the builder	2023-09-01 11:30:45 +05:30
Gian Merlino	004cd012e1	HttpClient: Include error handler on all connection attempts. (#14915 ) Currently we have an error handler for https connection attempts, but not for plaintext connection attempts. This leads to warnings like the following for plaintext connection errors: EXCEPTION, please implement org.jboss.netty.handler.codec.http.HttpContentDecompressor.exceptionCaught() for proper handling. This happens because if we don't add our own error handler, the last handler in the chain during a connection attempt is HttpContentDecompressor, which doesn't handle errors. The new error handler for plaintext doesn't do much: it just closes the channel.	2023-08-29 14:28:04 +05:30
Zoltan Haindrich	54336e2a3e	Imporve on incremental compilation (#14860 ) This patch fixes a few issues toward #14858 1. some phony classes were added to enable maven to track the compilation of those classes 2. cyclonedx 2.7.9 seem to handle incremental compilation better; it had a PR relating to that 3. needed to update root pom to 25 4. update antlr to 4.5.3 older one didn't really worked incrementally; 4.5.3 works much better	2023-08-24 16:06:16 +05:30
Clint Wylie	36e659a501	remove group-by v1 (#14866 ) * remove group-by v1 * docs * remove unused configs, fix test * fix test * adjustments * why not * adjust * review stuff	2023-08-23 12:44:06 -07:00
Clint Wylie	7b5012ea6e	override retry attempts for InputEntityIteratingReaderTest for much faster test run (#14897 )	2023-08-22 22:01:47 -07:00
Clint Wylie	fb053c399c	consolidate json and auto indexers, remove v4 nested column serializer (#14456 )	2023-08-22 18:50:11 -07:00
Clint Wylie	194a9c9abc	set druid.expressions.useStrictBooleans to true by default (#14734 )	2023-08-22 00:19:56 -07:00
Tejaswini Bandlamudi	d87056e708	Upgrade guava version to 31.1-jre (#14767 ) Currently, Druid is using Guava 16.0.1 version. This upgrade to 31.1-jre fixes the following issues. CVE-2018-10237 (Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 allows remote attackers to conduct denial of service attacks against servers that depend on this library and deserialize attacker-provided data because the AtomicDoubleArray class (when serialized with Java serialization) and the CompoundOrdering class (when serialized with GWT serialization) perform eager allocation without appropriate checks on what a client has sent and whether the data size is reasonable). We don't use Java or GWT serializations. Despite being false positive they're causing red security scans on Druid distribution. Latest version of google-client-api is incompatible with the existing Guava version. This PR unblocks Update google client apis to latest version #14414	2023-08-22 12:09:53 +05:30
Clint Wylie	5d1412949e	enable sql compatible null handling mode by default (#14792 ) * enable sql compatible null handling mode by default * fix bug with string first/last aggs when druid.generic.useDefaultValueForNull=false	2023-08-21 20:07:13 -07:00

1 2 3 4 5 ...

3122 Commits