Commit Graph

13745 Commits

Author SHA1 Message Date
Gian Merlino 792e5c58e4
IncrementalIndex#add is no longer thread-safe. (#15697)
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
2024-01-18 03:45:22 -08:00
Gian Merlino 764f41d959
Clear "lineSplittable" for JSON when using KafkaInputFormat. (#15692)
* Clear "lineSplittable" for JSON when using KafkaInputFormat.

JsonInputFormat has a "withLineSplittable" method that can be used to
control whether JSON is read line-by-line, or as a whole. The intent
is that in streaming ingestion, "lineSplittable" is false (although it
can be overridden by "assumeNewlineDelimited"), and in batch ingestion,
lineSplittable is true.

When a "json" format is wrapped by a "kafka" format, this isn't set
properly. This patch updates KafkaInputFormat to set this on an
underlying "json" format.

The tests for KafkaInputFormat were overriding the "lineSplittable"
parameter explicitly, which wasn't really fair, because that made them
unrealistic to what happens in production. Now they omit the parameter
and get the production behavior.

* Add test.

* Fix test coverage.
2024-01-18 03:22:41 -08:00
Gian Merlino d3d0c1c91e
Faster parsing: reduce String usage, list-based input rows. (#15681)
* Faster parsing: reduce String usage, list-based input rows.

Three changes:

1) Reworked FastLineIterator to optionally avoid generating Strings
   entirely, and reduce copying somewhat. Benefits the line-oriented
   JSON, CSV, delimited (TSV), and regex formats.

2) In the delimited (TSV) format, when the delimiter is a single byte,
   split on UTF-8 bytes directly.

3) In CSV and delimited (TSV) formats, use list-based input rows when
   the column list is provided upfront by the user.

* Fix style.

* Fix inspections.

* Restore validation.

* Remove fastutil-extra.

* Exception type.

* Fixes for error messages.

* Fixes for null handling.
2024-01-18 19:18:46 +08:00
Maytas Monsereenusorn 55acf2e2ff
Fix incorrect scale when reading decimal from parquet (#15715)
* Fix incorrect scale when reading decimal from parquet

* add comments

* fix test
2024-01-18 02:10:27 -08:00
Maytas Monsereenusorn a3b32fbd26
Fix comparator and remove deprecated methods from spectatorHistogram extension (#15698)
* Remove deprecated methods from SpectatorHistogram

* Remove deprecated methods from SpectatorHistogram

* Remove deprecated methods from SpectatorHistogram
2024-01-17 21:59:53 -08:00
AmatyaAvadhanula a26defd64b
Clean up stale entries from upgradeSegments table (#15637)
* Clean up stale entries from upgradeSegments table
2024-01-17 20:49:52 +05:30
Laksh Singla fc06f2d075
Fix summary iterator in grouping engine(#15658)
This PR fixes the summary iterator to add aggregators in the correct position. The summary iterator is used when dims are not present, therefore the new change is identical to the old one, but seems more correct while reading.
2024-01-17 20:43:45 +05:30
Abhishek Radhakrishnan c27f5bf52f
Report zero values instead of unknown for empty ingest queries (#15674)
MSQ now allows empty ingest queries by default. For such queries that don't generate any output rows, the query counters in the async status result object/task report don't contain numTotalRows and totalSizeInBytes. These properties when not set/undefined can be confusing to API clients. For example, the web-console treats it as unknown values.

This patch fixes the counters by explicitly reporting them as 0 instead of null for empty ingest queries.
2024-01-17 16:26:10 +05:30
Zoltan Haindrich 8a43db9395
Range support in window expressions (support them as groups) (#15365)
* support groups windowing mode; which is a close relative of ranges (but not in the standard)
* all windows with range expressions will be executed wit it groups
* it will be 100% correct in case for both bounds its true that: isCurrentRow() || isUnBounded()
  * this covers OVER ( ORDER BY COL )
* for other cases it will have some chances of getting correct results...
2024-01-17 00:05:21 -06:00
Laksh Singla 8ba06cf723
account for null values in the stddev post aggregator (#15660) 2024-01-16 19:57:33 +05:30
AmatyaAvadhanula 6b951b94c0
Add new context parameter for using concurrent locks (#15684)
Changes:
- Add new task context flag useConcurrentLocks.
- This can be set for an individual task or at a cluster level using `druid.indexer.task.default.context`.
- When set to true, any appending task would use an APPEND lock and any other
ingestion task would use a REPLACE lock when using time chunk locking.
- If false (default), we fall back on the context flag taskLockType and then useSharedLock.
2024-01-16 12:43:39 +05:30
AmatyaAvadhanula 11dbfb6e3f
Better error message when partition space is exhausted (#15685)
* Better error message when partition space is exhausted
2024-01-16 12:32:40 +05:30
AmatyaAvadhanula 67720b60ae
Skip compaction for intervals without data (#15676)
* Skip compaction for intervals with a single tombstone and no data
2024-01-16 12:31:36 +05:30
Sam Rash 072b16c6df
Fix SQL Innterval.of() error message (#15454)
Better error message for poorly constructed intervals
2024-01-15 22:34:35 -06:00
Gian Merlino d359fb3d68
Cache value selectors in RowBasedColumnSelectorFactory. (#15615)
* Cache value selectors in RowBasedColumnSelectorFactory.

There was already caching for dimension selectors. This patch adds caching
for value (object and number) selectors. It's helpful when the same field is
read multiple times during processing of a single row (for example, by being
an input to both MIN and MAX aggregations).

* Fix typing.

* Fix logic.
2024-01-15 18:03:27 -08:00
Kashif Faraz 18d2a8957f
Refactor: Cleanup test impls of ServiceEmitter (#15683) 2024-01-15 17:37:00 +05:30
Abhishek Radhakrishnan 08c01f1dae
Handle and map errors in delete pending segments API (#15673)
Changes:
- Handle exception in deletePendingSegments API and map to correct HTTP status code
- Clean up exception message using `DruidException`
- Add unit tests
2024-01-15 10:09:01 +05:30
Ben Sykes e49a7bb3cd
Add SpectatorHistogram extension (#15340)
* Add SpectatorHistogram extension

* Clarify documentation
Cleanup comments

* Use ColumnValueSelector directly
so that we support being queried as a Number using longSum or doubleSum aggregators as well as a histogram.
When queried as a Number, we're returning the count of entries in the histogram.

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Fix references

* Fix spelling

* Update docs/development/extensions-contrib/spectator-histogram.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

---------

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-14 09:52:30 -08:00
Kashif Faraz f0c552b2f9
Fix basic auth integration test (#15679)
* Add some retries

* Add a delay to allow creds to propagate

* Checkstyle and stuff
2024-01-14 08:59:15 -08:00
Gian Merlino 500681d0cb
Add ImmutableLookupMap for static lookups. (#15675)
* Add ImmutableLookupMap for static lookups.

This patch adds a new ImmutableLookupMap, which comes with an
ImmutableLookupExtractor. It uses a fastutil open hashmap plus two
lists to store its data in such a way that forward and reverse
lookups can both be done quickly. I also observed footprint to be
somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million
row lookup.

The main advantage, though, is that reverse lookups can be done much
more quickly than MapLookupExtractor (which iterates the entire map
for each call to unapplyAll). This speeds up the recently added
ReverseLookupRule (#15626) during SQL planning with very large lookups.

* Use in one more test.

* Fix benchmark.

* Object2ObjectOpenHashMap

* Fixes, and LookupExtractor interface update to have asMap.

* Remove commented-out code.

* Fix style.

* Fix import order.

* Add fastutil.

* Avoid storing Map entries.
2024-01-13 13:14:01 -08:00
Gian Merlino 866fe1cda6
Fix some naming related to AggregatePullUpLookupRule. (#15677)
It was called "split" rather than "pull up" in some places. This patch
standardizes on "pull up".
2024-01-12 15:41:58 -08:00
YongGang 0457c71d03
Fix k8sAndWorker mode in a zookeeper-less environment (#15445)
* Fix k8sAndWorker mode in a zookeeper-less environment

* add unit test

* code reformat

* minor refine

* change to inject Provider

* correct style

* bind HttpRemoteTaskRunnerFactory as provider

* change to bind on TaskRunnerFactory

* fix styling
2024-01-12 09:30:01 -08:00
Gian Merlino cccf13ea82
Reverse, pull up lookups in the SQL planner. (#15626)
* Reverse, pull up lookups in the SQL planner.

Adds two new rules:

1) ReverseLookupRule, which eliminates calls to LOOKUP by doing
   reverse lookups.

2) AggregatePullUpLookupRule, which pulls up calls to LOOKUP above
   GROUP BY, when the lookup is injective.

Adds configs `sqlReverseLookup` and `sqlPullUpLookup` to control whether
these rules fire. Both are enabled by default.

To minimize the chance of performance problems due to many keys mapping to
the same value, ReverseLookupRule refrains from reversing a lookup if there
are more keys than `inSubQueryThreshold`. The rationale for using this setting
is that reversal works by generating an IN, and the `inSubQueryThreshold`
describes the largest IN the user wants the planner to create.

* Add additional line.

* Style.

* Remove commented-out lines.

* Fix tests.

* Add test.

* Fix doc link.

* Fix docs.

* Add one more test.

* Fix tests.

* Logic, test updates.

* - Make FilterDecomposeConcatRule more flexible.

- Make CalciteRulesManager apply reduction rules til fixpoint.

* Additional tests, simplify code.
2024-01-12 00:06:31 -08:00
Zoltan Haindrich e597cc2949
Remove UnaryFunctionOperatorConversion and RoundOperatorConversion (#15566)
* get rid of roun op conv

* cleanup

* use DirectOperatorConversion instead unary

* import order
2024-01-12 10:06:23 +05:30
Gian Merlino 6c18434028
CONCAT flattening, filter decomposition. (#15634)
* CONCAT flattening, filter decomposition.

Flattening: CONCAT(CONCAT(x, y), z) is flattened to CONCAT(x, y, z). This
is especially useful for the || operator, which is a binary operator and
leads to non-flat CONCAT calls.

Filter decomposition: transforms CONCAT(x, '-', y) = 'a-b' into
x = 'a' AND y = 'b'.

* One more test.

* Fix two tests.

* Adjustments from review.

* Fix empty string problem, add tests.
2024-01-11 11:18:50 -08:00
Gian Merlino 2231cb30a4
Faster k-way merging using tournament trees, 8-byte key strides. (#15661)
* Faster k-way merging using tournament trees, 8-byte key strides.

Two speedups for FrameChannelMerger (which does k-way merging in MSQ):

1) Replace the priority queue with a tournament tree, which does fewer
   comparisons.

2) Compare keys using 8-byte strides, rather than 1 byte at a time.

* Adjust comments.

* Fix style.

* Adjust benchmark and test.

* Add eight-list test (power of two).
2024-01-11 08:36:22 -08:00
Clint Wylie 2118258b54
tidy up group by engines after removal of v1 (#15665) 2024-01-11 00:52:52 -08:00
Laksh Singla 87fbe42218
"Partition boost" the group by queries in MSQ for better splits (#15474)
"Partition boost" the group by queries in MSQ for better splits
2024-01-11 12:46:27 +05:30
Vadim Ogievetsky 5b769a7d32
Update load query detail archive dialog for file input support (#15632)
* Update execution-submit-dialog for file input support

Modified the execution-submit-dialog to support file inputs instead of text inputs for better usability. Users can now submit their queries by selecting a JSON file directly or dragging the file into the dialog. Made appropriate UI adjustments to accommodate this change in execution-submit-dialog styles file.

* Update web-console/src/views/workbench-view/execution-submit-dialog/execution-submit-dialog.tsx

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update web-console/src/views/workbench-view/execution-submit-dialog/execution-submit-dialog.tsx

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update web-console/src/views/workbench-view/execution-submit-dialog/execution-submit-dialog.tsx

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update drag-and-drop instructions in execution-submit-dialog

* Add snapshot tests for ExecutionSubmitDialog

* prettify

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-01-10 20:48:46 -08:00
Kashif Faraz f445ba4d6b
Audit API DELETE datasource (markAllSegmentsAsUnused) (#15653)
Changes:
- Add audit for `DELETE /druid/coordinator/v1/datasources/{datasourceName}`
- Minor refactor
2024-01-11 09:43:32 +05:30
Gian Merlino ee77fa7fb3
Add tests for CASE decomposition. (#15639)
I was looking into adding a rule to do this, and found that it was already
happening as part of Calcite's RexSimplify. So this patch simply adds some
tests to ensure that it continues to happen.
2024-01-10 13:24:24 -08:00
Laksh Singla 0b91cc4db2
Fix incorrect tests in Sting first/last serde's null handling (#15657)
Fixes a couple of incorrect test cases, that got merged accidentally
2024-01-10 19:16:12 +05:30
Kashif Faraz d623756c66
Add cache for password hash in druid-basic-security (#15648)
Add class PasswordHashGenerator. Move hashing logic from BasicAuthUtils to this new class.
Add cache in the hash generator to contain the computed hash of passwords and boost validator performance
Cache has max size 1000 and expiry 1 hour
Key of the cache is an SHA-256 hash of the (password + random salt generated on service startup)
2024-01-10 17:46:09 +05:30
Zoltan Haindrich fefa763722
Resultcache fetch should deserialize aggregates when they are real results (#15654)
Fixes #15538
2024-01-10 06:42:33 -05:00
Yuanli Han 99d4b7dca7
Use to trigger search bar of official site (#15652) 2024-01-10 17:18:36 +08:00
Laksh Singla 4149f98934
Fixes a bug with long string pair serde where null and empty strings are treated equivalently (#15525)
This PR fixes a bug with the long string pair serde where null and empty strings are treated equivalently, and the return value is always null. When 'useDefaultValueForNull' was set to true by default, this wasn't a commonly seen issue, because nulls were equivalent to empty strings. However, since the default has changed to false, this can create incorrect results when the long string pairs are serded, where the empty strings are incorrectly converted to nulls.
2024-01-10 14:17:57 +05:30
Ankit Kothari 355c2f5da0
Add sql + ingestion compatibility for first/last on numeric values (#15607)
SQL compatibility for numeric last and first column types.
Ingestion UI now provides option for first and last aggregation as well.
2024-01-10 12:59:38 +05:30
PANKAJ KUMAR 047c7340ab
Adding retries to update the metadata store instead of failure (#15141)
Currently, If 2 tasks are consuming from the same partitions, try to publish the segment and update the metadata, the second task can fail because the end offset stored in the metadata store doesn't match with the start offset of the second task. We can fix this by retrying instead of failing.

AFAIK apart from the above issue, the metadata mismatch can happen in 2 scenarios:

- when we update the input topic name for the data source
- when we run 2 replicas of ingestion tasks(1 replica will publish and 1 will fail as the first replica has already updated the metadata).

Implemented the comparable function to compare the last committed end offset and new Sequence start offset. And return a specific error msg for this.

Add retry logic on indexers to retry for this specific error msg.

Updated the existing test case.
2024-01-10 12:30:54 +05:30
Clint Wylie 2938b8de53
fix issue with NestedPathArrayElement not correctly handling negative index for Object[] like it has for List (#15650) 2024-01-10 09:46:08 +05:30
Rishabh Singh 71f5307277
Eliminate Periodic Realtime Segment Metadata Queries: Task Now Publish Schema for Seamless Coordinator Updates (#15475)
The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This task encompasses addressing both realtime and finalized segments.

This modification specifically addresses the issue with realtime segments. Tasks will now routinely communicate the schema for realtime segments during the segment announcement process. The Coordinator will identify the schema alongside the segment announcement and subsequently update the schema for realtime segments in the metadata cache.
2024-01-10 08:55:56 +05:30
Vadim Ogievetsky 85b8cf9f37
Web console: Fix concurrent tasks (#15649)
* Improve handling of concurrent tasks option

* Update snapshots
2024-01-09 16:09:42 -08:00
Pranav 747d973752
Skip waiting for first lookup version to get initialized (#15598) 2024-01-09 13:18:39 -08:00
Misha ea6ba40ce1
Add support for Azure Goverment storage (#15523)
Added support for Azure Government storage in Druid Azure-Extensions. This enhancement allows the Azure-Extensions to be compatible with different Azure storage types by updating the endpoint suffix from a hardcoded value to a configurable one.
2024-01-09 22:33:32 +05:30
Clint Wylie cafc748f7e
skip expression virtual column indexes when mvd is used as array (#15644) 2024-01-08 21:22:37 -08:00
Clint Wylie 911941b4a6
fix issue with nested virtual column index supplier for partial paths when processing from raw (#15643) 2024-01-09 07:55:08 +05:30
Abhishek Agarwal 468b99e608
Enable query request queuing by default when total laning is turned on. (#15440)
This PR enables the flag by default to queue excess query requests in the jetty queue. Still keeping the flag so that it can be turned off if necessary. But the flag will be removed in the future.
2024-01-09 07:54:26 +05:30
Victoria Lim 52313c51ac
docs: Anchor link checker (#15624)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-01-08 15:19:05 -08:00
Clint Wylie df5bcd1367
fix bugs with expression virtual column indexes for expression virtual columns which refer to other virtual columns (#15633)
changes:
* ColumnIndexSelector now extends ColumnSelector. The only real implementation of ColumnIndexSelector, ColumnSelectorColumnIndexSelector, already has a ColumnSelector, so this isn't very disruptive
* removed getColumnNames from ColumnSelector since it was not used
* VirtualColumns and VirtualColumn getIndexSupplier method now needs argument of ColumnIndexSelector instead of ColumnSelector, which allows expression virtual columns to correctly recognize other virtual columns, fixing an issue which would incorrectly handle other virtual columns as non-existent columns instead
* fixed a bug with sql planner incorrectly not using expression filter for equality filters on columns with extractionFn and no virtual column registry
2024-01-08 13:10:11 -08:00
Vadim Ogievetsky 84adb9255e
Web console: Fix spec conversion, expose failOnEmptyInsert (#15627)
* Spec converter should dedupe the columns

* Add "Fail on empty insert" setting to QueryContext toggles
2024-01-08 12:05:12 -08:00
Jonathan Wei 5d1e66b8f9
Allow broker to use catalog for datasource schemas for SQL queries (#15469)
* Allow broker to use catalog for datasource schemas

* More PR comments

* PR comments
2024-01-08 13:46:08 -06:00