3018 Commits

Author SHA1 Message Date
Benedict Jin
96b4abc8e9
Add @VisibleForTesting annotation for the backingArray() method (#15690) 2024-01-18 19:30:10 -08:00
Gian Merlino
792e5c58e4
IncrementalIndex#add is no longer thread-safe. (#15697)
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
2024-01-18 03:45:22 -08:00
Gian Merlino
764f41d959
Clear "lineSplittable" for JSON when using KafkaInputFormat. (#15692)
* Clear "lineSplittable" for JSON when using KafkaInputFormat.

JsonInputFormat has a "withLineSplittable" method that can be used to
control whether JSON is read line-by-line, or as a whole. The intent
is that in streaming ingestion, "lineSplittable" is false (although it
can be overridden by "assumeNewlineDelimited"), and in batch ingestion,
lineSplittable is true.

When a "json" format is wrapped by a "kafka" format, this isn't set
properly. This patch updates KafkaInputFormat to set this on an
underlying "json" format.

The tests for KafkaInputFormat were overriding the "lineSplittable"
parameter explicitly, which wasn't really fair, because that made them
unrealistic to what happens in production. Now they omit the parameter
and get the production behavior.

* Add test.

* Fix test coverage.
2024-01-18 03:22:41 -08:00
Gian Merlino
d3d0c1c91e
Faster parsing: reduce String usage, list-based input rows. (#15681)
* Faster parsing: reduce String usage, list-based input rows.

Three changes:

1) Reworked FastLineIterator to optionally avoid generating Strings
   entirely, and reduce copying somewhat. Benefits the line-oriented
   JSON, CSV, delimited (TSV), and regex formats.

2) In the delimited (TSV) format, when the delimiter is a single byte,
   split on UTF-8 bytes directly.

3) In CSV and delimited (TSV) formats, use list-based input rows when
   the column list is provided upfront by the user.

* Fix style.

* Fix inspections.

* Restore validation.

* Remove fastutil-extra.

* Exception type.

* Fixes for error messages.

* Fixes for null handling.
2024-01-18 19:18:46 +08:00
Laksh Singla
fc06f2d075
Fix summary iterator in grouping engine(#15658)
This PR fixes the summary iterator to add aggregators in the correct position. The summary iterator is used when dims are not present, therefore the new change is identical to the old one, but seems more correct while reading.
2024-01-17 20:43:45 +05:30
Zoltan Haindrich
8a43db9395
Range support in window expressions (support them as groups) (#15365)
* support groups windowing mode; which is a close relative of ranges (but not in the standard)
* all windows with range expressions will be executed wit it groups
* it will be 100% correct in case for both bounds its true that: isCurrentRow() || isUnBounded()
  * this covers OVER ( ORDER BY COL )
* for other cases it will have some chances of getting correct results...
2024-01-17 00:05:21 -06:00
AmatyaAvadhanula
11dbfb6e3f
Better error message when partition space is exhausted (#15685)
* Better error message when partition space is exhausted
2024-01-16 12:32:40 +05:30
Sam Rash
072b16c6df
Fix SQL Innterval.of() error message (#15454)
Better error message for poorly constructed intervals
2024-01-15 22:34:35 -06:00
Gian Merlino
d359fb3d68
Cache value selectors in RowBasedColumnSelectorFactory. (#15615)
* Cache value selectors in RowBasedColumnSelectorFactory.

There was already caching for dimension selectors. This patch adds caching
for value (object and number) selectors. It's helpful when the same field is
read multiple times during processing of a single row (for example, by being
an input to both MIN and MAX aggregations).

* Fix typing.

* Fix logic.
2024-01-15 18:03:27 -08:00
Kashif Faraz
18d2a8957f
Refactor: Cleanup test impls of ServiceEmitter (#15683) 2024-01-15 17:37:00 +05:30
Ben Sykes
e49a7bb3cd
Add SpectatorHistogram extension (#15340)
* Add SpectatorHistogram extension

* Clarify documentation
Cleanup comments

* Use ColumnValueSelector directly
so that we support being queried as a Number using longSum or doubleSum aggregators as well as a histogram.
When queried as a Number, we're returning the count of entries in the histogram.

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Fix references

* Fix spelling

* Update docs/development/extensions-contrib/spectator-histogram.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

---------

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-14 09:52:30 -08:00
Gian Merlino
500681d0cb
Add ImmutableLookupMap for static lookups. (#15675)
* Add ImmutableLookupMap for static lookups.

This patch adds a new ImmutableLookupMap, which comes with an
ImmutableLookupExtractor. It uses a fastutil open hashmap plus two
lists to store its data in such a way that forward and reverse
lookups can both be done quickly. I also observed footprint to be
somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million
row lookup.

The main advantage, though, is that reverse lookups can be done much
more quickly than MapLookupExtractor (which iterates the entire map
for each call to unapplyAll). This speeds up the recently added
ReverseLookupRule (#15626) during SQL planning with very large lookups.

* Use in one more test.

* Fix benchmark.

* Object2ObjectOpenHashMap

* Fixes, and LookupExtractor interface update to have asMap.

* Remove commented-out code.

* Fix style.

* Fix import order.

* Add fastutil.

* Avoid storing Map entries.
2024-01-13 13:14:01 -08:00
Gian Merlino
cccf13ea82
Reverse, pull up lookups in the SQL planner. (#15626)
* Reverse, pull up lookups in the SQL planner.

Adds two new rules:

1) ReverseLookupRule, which eliminates calls to LOOKUP by doing
   reverse lookups.

2) AggregatePullUpLookupRule, which pulls up calls to LOOKUP above
   GROUP BY, when the lookup is injective.

Adds configs `sqlReverseLookup` and `sqlPullUpLookup` to control whether
these rules fire. Both are enabled by default.

To minimize the chance of performance problems due to many keys mapping to
the same value, ReverseLookupRule refrains from reversing a lookup if there
are more keys than `inSubQueryThreshold`. The rationale for using this setting
is that reversal works by generating an IN, and the `inSubQueryThreshold`
describes the largest IN the user wants the planner to create.

* Add additional line.

* Style.

* Remove commented-out lines.

* Fix tests.

* Add test.

* Fix doc link.

* Fix docs.

* Add one more test.

* Fix tests.

* Logic, test updates.

* - Make FilterDecomposeConcatRule more flexible.

- Make CalciteRulesManager apply reduction rules til fixpoint.

* Additional tests, simplify code.
2024-01-12 00:06:31 -08:00
Gian Merlino
2231cb30a4
Faster k-way merging using tournament trees, 8-byte key strides. (#15661)
* Faster k-way merging using tournament trees, 8-byte key strides.

Two speedups for FrameChannelMerger (which does k-way merging in MSQ):

1) Replace the priority queue with a tournament tree, which does fewer
   comparisons.

2) Compare keys using 8-byte strides, rather than 1 byte at a time.

* Adjust comments.

* Fix style.

* Adjust benchmark and test.

* Add eight-list test (power of two).
2024-01-11 08:36:22 -08:00
Clint Wylie
2118258b54
tidy up group by engines after removal of v1 (#15665) 2024-01-11 00:52:52 -08:00
Laksh Singla
0b91cc4db2
Fix incorrect tests in Sting first/last serde's null handling (#15657)
Fixes a couple of incorrect test cases, that got merged accidentally
2024-01-10 19:16:12 +05:30
Zoltan Haindrich
fefa763722
Resultcache fetch should deserialize aggregates when they are real results (#15654)
Fixes #15538
2024-01-10 06:42:33 -05:00
Laksh Singla
4149f98934
Fixes a bug with long string pair serde where null and empty strings are treated equivalently (#15525)
This PR fixes a bug with the long string pair serde where null and empty strings are treated equivalently, and the return value is always null. When 'useDefaultValueForNull' was set to true by default, this wasn't a commonly seen issue, because nulls were equivalent to empty strings. However, since the default has changed to false, this can create incorrect results when the long string pairs are serded, where the empty strings are incorrectly converted to nulls.
2024-01-10 14:17:57 +05:30
Clint Wylie
2938b8de53
fix issue with NestedPathArrayElement not correctly handling negative index for Object[] like it has for List (#15650) 2024-01-10 09:46:08 +05:30
Clint Wylie
cafc748f7e
skip expression virtual column indexes when mvd is used as array (#15644) 2024-01-08 21:22:37 -08:00
Clint Wylie
911941b4a6
fix issue with nested virtual column index supplier for partial paths when processing from raw (#15643) 2024-01-09 07:55:08 +05:30
Clint Wylie
df5bcd1367
fix bugs with expression virtual column indexes for expression virtual columns which refer to other virtual columns (#15633)
changes:
* ColumnIndexSelector now extends ColumnSelector. The only real implementation of ColumnIndexSelector, ColumnSelectorColumnIndexSelector, already has a ColumnSelector, so this isn't very disruptive
* removed getColumnNames from ColumnSelector since it was not used
* VirtualColumns and VirtualColumn getIndexSupplier method now needs argument of ColumnIndexSelector instead of ColumnSelector, which allows expression virtual columns to correctly recognize other virtual columns, fixing an issue which would incorrectly handle other virtual columns as non-existent columns instead
* fixed a bug with sql planner incorrectly not using expression filter for equality filters on columns with extractionFn and no virtual column registry
2024-01-08 13:10:11 -08:00
Clint Wylie
c221a2634b
overhaul DruidPredicateFactory to better handle 3VL (#15629)
* overhaul DruidPredicateFactory to better handle 3VL

fixes some bugs caused by some limitations of the original design of how DruidPredicateFactory interacts with 3-value logic. The primary impacted area was with how filters on values transformed with expressions or extractionFn which turn non-null values into nulls, which were not possible to be modelled with the 'isNullInputUnknown' method

changes:
* adds DruidObjectPredicate to specialize string, array, and object based predicates instead of using guava Predicate
* DruidPredicateFactory now uses DruidObjectPredicate
* introduces DruidPredicateMatch enum, which all predicates returned from DruidPredicateFactory now use instead of booleans to indicate match. This means DruidLongPredicate, DruidFloatPredicate, DruidDoublePredicate, and the newly added DruidObjectPredicate apply methods all now return DruidPredicateMatch. This allows matchers and indexes
* isNullInputUnknown has been removed from DruidPredicateFactory

* rename, fix test

* adjust

* style

* npe

* more test

* fix default value mode to not match new test
2024-01-05 19:08:02 -08:00
AmatyaAvadhanula
c41e99e10c
Do not allocate week granular segments unless requested (#15589)
* Do not allocate week granular segments unless explicitly requested
2024-01-05 12:14:52 +05:30
Clint Wylie
f19ece146f
expression virtual column indexes (#15585)
* ExpressionVirtualColumn + indexes = bff. Expression virtual columns can now use indexes of the underlying columns similar to how expression filters
2024-01-03 21:00:39 -08:00
Gian Merlino
e40b96e026
Reverse lookup fixes and enhancements. (#15611)
* Reverse lookup fixes and enhancements.

1) Add a "mayIncludeUnknown" parameter to DimFilter#optimize. This is important
   because otherwise the reverse-lookup optimization is done improperly when
   the "in" filter appears under a "not", and the lookup extractionFn may return
   null for some possible values of the filtered column. The "includeUnknown" test
   cases in InDimFilterTest illustrate the difference in behavior.

2) Enhance InDimFilter#optimizeLookup to handle "mayIncludeUnknown", and to be able
   to do a reverse lookup in a wider variety of cases.

3) Make "unapply" protected in LookupExtractor, and move callers to "unapplyAll".
   The main reason is that MapLookupExtractor, a common implementation, lacks a
   reverse mapping and therefore does a scan of the map for each call to "unapply".
   For performance sake these calls need to be batched.

* Remove optimize call from BloomDimFilter.

* Follow the law.

* Fix tests.

* Fix imports.

* Switch function.

* Fix tests.

* More tests.
2024-01-03 13:28:44 -08:00
Gian Merlino
01eec4a55e
New handling for COALESCE, SEARCH, and filter optimization. (#15609)
* New handling for COALESCE, SEARCH, and filter optimization.

COALESCE is converted by Calcite's parser to CASE, which is largely
counterproductive for us, because it ends up duplicating expressions.
In the current code we end up un-doing it in our CaseOperatorConversion.
This patch has a different approach:

1) Add CaseToCoalesceRule to convert CASE back to COALESCE earlier, before
   the Volcano planner runs, using CaseToCoalesceRule.

2) Add FilterDecomposeCoalesceRule to decompose calls like
   "f(COALESCE(x, y))" into "(x IS NOT NULL AND f(x)) OR (x IS NULL AND f(y))".
   This helps use indexes when available on x and y.

3) Add CoalesceLookupRule to push COALESCE into the third arg of LOOKUP.

4) Add a native "coalesce" function so we can convert 3+ arg COALESCE.

The advantage of this approach is that by un-doing the CASE to COALESCE
conversion earlier, we have flexibility to do more stuff with
COALESCE (like decomposition and pushing into LOOKUP).

SEARCH is an operator used internally by Calcite to represent matching
an argument against some set of ranges. This patch improves our handling
of SEARCH in two ways:

1) Expand NOT points (point "holes" in the range set) from SEARCH as
   `!(a || b)` rather than `!a && !b`, which makes it possible to convert
   them to a "not" of "in" filter later.

2) Generate those nice conversions for NOT points even if the SEARCH
   is not composed of 100% NOT points. Without this change, a SEARCH
   for "x NOT IN ('a', 'b') AND x < 'm'" would get converted like
   "x < 'a' OR (x > 'a' AND x < 'b') OR (x > 'b' AND x < 'm')".

One of the steps we take when generating Druid queries from Calcite
plans is to optimize native filters. This patch improves this step:

1) Extract common ANDed predicates in ConvertSelectorsToIns, so we can
   convert "(a && x = 'b') || (a && x = 'c')" into "a && x IN ('b', 'c')".

2) Speed up CombineAndSimplifyBounds and ConvertSelectorsToIns on
   ORs with lots of children by adjusting the logic to avoid calling
   "indexOf" and "remove" on an ArrayList.

3) Refactor ConvertSelectorsToIns to reduce duplicated code between the
   handling for "selector" and "equals" filters.

* Not so final.

* Fixes.

* Fix test.

* Fix test.
2024-01-03 08:56:22 -08:00
Gian Merlino
b0e52c99bb
Fix ColumnSelectorColumnIndexSelector#getColumnCapabilities. (#15614)
* Fix ColumnSelectorColumnIndexSelector#getColumnCapabilities.

It was using virtualColumns.getColumnCapabilities, which only returns
capabilities for virtual columns, not regular columns. The effect of this
is that expression filters (and in some cases, arrayContainsElement filters)
would build value matchers rather than use indexes.

I think this has been like this since #12315, which added the
getColumnCapabilities method to BitmapIndexSelector, and included the same
implementation as exists in the code today.

This error is easy to make due to the design of virtualColumns.getColumnCapabilities,
so to help avoid it in the future, this patch renames the method to
getColumnCapabilitiesWithoutFallback to emphasize that it does not return
capabilities for regular columns.

* Make getColumnCapabilitiesWithoutFallback package-private.

* Fix expression filter bitmap usage.
2024-01-02 21:09:18 -08:00
Parth Agrawal
8505e8a909
Provide default implementation for RowFunction evalDimension method (#15452)
The PR: #13947 introduced a function evalDimension() in the interface RowFunction.
There was no default implementation added for this interface which causes all the implementations and custom transforms to fail and require to implement their own version of evalDimension method. This PR adds a default implementation in the interface which allows the evalDimension to return value as a Singleton array of eval result.
2024-01-02 11:14:23 +05:30
AlbericByte
a2e65e6a89
Support to pass dynamic values to timestamp Extract function (#15586)
Fixes #15072

Before this modification , the third parameter (timezone) require to be a Literal, it will throw a error when this parameter is column Identifier.
2023-12-21 11:57:52 +05:30
Clint Wylie
8a45efbf65
fix some null handling bugs with vector expression processors (#15587) 2023-12-19 08:14:17 -08:00
Kashif Faraz
9f568858ef
Add logging implementation for AuditManager and audit more endpoints (#15480)
Changes
- Add `log` implementation for `AuditManager` alongwith `SQLAuditManager`
- `LoggingAuditManager` simply logs the audit event. Thus, it returns empty for
all `fetchAuditHistory` calls.
- Add new config `druid.audit.manager.type` which can take values `log`, `sql` (default)
- Add new config `druid.audit.manager.logLevel` which can take values `DEBUG`, `INFO`, `WARN`.
This gets activated only if `type` is `log`.
- Remove usage of `ConfigSerde` from `AuditManager` as audit is not just limited to configs
- Add `AuditSerdeHelper` for a single implementation of serialization/deserialization of
audit payload and other utility methods.
2023-12-19 13:14:04 +05:30
Clint Wylie
e373f62692
fix expression post aggregator array handling when grouping wrapper types leak (#15543)
* fix expression post aggregator array handling when grouping wrapper types leak
* more consistent expression function error messaging
2023-12-15 21:43:27 -08:00
Tom
901ebbb744
Allow for kafka emitter producer secrets to be masked in logs (#15485)
* Allow for kafka emitter producer secrets to be masked in logs instead of being visible

This change will allow for kafka producer config values that should be secrets to not show up in the logs.
This will enhance the security of the people who use the kafka emitter to use this if they want to.
This is opt in and will not affect prior configs for this emitter

* fix checkstyle issue

* change property name
2023-12-15 12:21:21 -05:00
Zoltan Haindrich
7552dc49fb
Reduce amount of expression objects created during evaluations (#15552)
I was looking into a query which was performing a bit poorly because the case_searched was touching more than 1 columns (if there is only 1 column there is a cache based evaluator).
While I was doing that I've noticed that there are a few simple things which could help a bit:

use a static TRUE/FALSE instead of creating a new object every time
create the ExprEval early for ConstantExpr -s (except the one for BigInteger which seem to have some odd contract)
return early from type autodetection
these changes mostly reduce the amount of garbage the query creates during case_searched evaluation; although ExpressionSelectorBenchmark shows some improvements ~15% - but my manual trials on the taxi dataset with 60M rows showed more improvements - probably due to the fact that these changes mostly only reduce gc pressure.
2023-12-15 16:11:59 +05:30
Soumyava
3e15522d6b
Round works correctly on system metadata columns (#15554) 2023-12-13 17:23:14 -08:00
Clint Wylie
e55f6b6202
remove search auto strategy, estimateSelectivity of BitmapColumnIndex (#15550)
* remove search auto strategy, estimateSelectivity of BitmapColumnIndex

* more cleanup
2023-12-13 16:30:01 -08:00
zachjsh
857693f5cf
Decorate sampling response with system fields if specified (#15536)
* * decorate sampling response with system fields if specified

* * add unit test
2023-12-13 12:16:59 -08:00
Ankit Kothari
8735d023a1
Add experimental support for first/last for double/float/long #10702 (#14462)
Add experimental support for doubleLast, doubleFirst, FloatLast, FloatFirst, longLast and longFirst.
2023-12-12 11:36:51 +05:30
Abhishek Radhakrishnan
96be82a3e6
Clean up duty for non-overlapping eternity tombstones (#15281)
* Add initial draft of MarkDanglingTombstonesAsUnused duty.

* Use overshadowed segments instead of all used segments.

* Add unit test for MarkDanglingSegmentsAsUnused duty.

* Add mock call

* Simplify code.

* Docs

* shorter lines formatting

* metric doc

* More tests, refactor and fix up some logic.

* update javadocs; other review comments.

* Make numCorePartitions as 0 in the TombstoneShardSpec.

* fix up test

* Add tombstone core partition tests

* Update docs/design/coordinator.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* review comment

* Minor cleanup

* Only consider tombstones with 0 core partitions

* Need to register the test shard type to make jackson happy

* test comments

* checkstyle

* fixup misc typos in comments

* Update logic to use overshadowed segments

* minor cleanup

* Rename duty to eternity tombstone instead of dangling. Add test for full eternity tombstone.

* Address review feedback.

---------

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2023-12-11 08:57:15 -08:00
Clint Wylie
42f2496b7d
fix bug with nested empty array fields (#15532) 2023-12-09 12:20:21 -08:00
Rishabh Singh
54df235026
Lazily build Filter in FilteredAggregatorFactory to avoid parsing exceptions in Router (#15526)
Query with lookups in FilteredAggregator fails with this exception in router,

Cannot construct instance of `org.apache.druid.query.aggregation.FilteredAggregatorFactory`, problem: Lookup [campaigns_lookup[campaignId][is_sold][autodsp]] not found at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 913] (through reference chain: org.apache.druid.query.groupby.GroupByQuery["aggregations"]->java.util.ArrayList[1])
T
he problem is that constructor of FilteredAggregatorFactory is actually validating if the lookup exists in this statement dimFilter.toFilter().
This is failing on the router, which is to be expected, because, the router isn’t assigned any lookups.
The fix is to move to a lazy initialisation of the filter object in the constructor.
2023-12-09 12:18:37 +05:30
Clint Wylie
e7c8f2e208
lift restriction of array_to_mv to only support direct column access (#15528) 2023-12-08 16:27:17 -08:00
Clint Wylie
e64b92eb35
add JSON_QUERY_ARRAY function to pluck ARRAY<COMPLEX<json>> out of COMPLEX<json> (#15521) 2023-12-08 05:28:46 -08:00
Zoltan Haindrich
c353ccfdef
Windowed min aggregates null-s as 0 (#15371) 2023-12-08 01:41:16 -08:00
Clint Wylie
1eafe983ec
fix array presenting columns to not match single element arrays to scalars for equality (#15503)
* fix array presenting columns to not match single element arrays to scalars for equality
* update docs to clarify usage model of mixed type columns
2023-12-08 01:22:07 -08:00
sb89594
5fda8613ad
Feature: Add IPv6 Match Function (#15212) 2023-12-07 23:09:06 -08:00
AlbericByte
935aa187a0
add Assert function to verify in the DataGeneratorTest (#15504)
* add Assert function to verify in the DataGeneratorTest

* remove unused log in DataGeneratorTest

* add comment for DataGeneratorTest
2023-12-08 09:12:17 +08:00
Clint Wylie
c241c6980c
store auto columns with only empty or null containing arrays as ARRAY<LONG> instead of COMPLEX<json> (#15505) 2023-12-07 03:31:43 -08:00
Clint Wylie
557f3f6f57
add array column type support to EXTEND operator (#15458) 2023-12-06 23:21:35 -08:00