Commit Graph

454 Commits

Author SHA1 Message Date
Zoltan Haindrich f72dbfd9cd
[Backport] Window Functions : Improve performance by comparing Strings in frame bytes without converting them (#17091) (#17311)
(cherry picked from commit b9a4c73e52)

Co-authored-by: Sree Charan Manamala <sree.manamala@imply.io>
2024-10-10 08:52:54 +05:30
Kashif Faraz b275ffed8d
[Backport] Changes required for Dart (#17046) (#17047) (#17048) (#17066) (#17074) (#17076) (#17077) (#17193) (#17243)
Backport the following patches for a clean backport of Dart changes
1. Add "targetPartitionsPerWorker" setting for MSQ. (#17048)
2. MSQ: Improved worker cancellation. (#17046)
3. Add "includeAllCounters()" to WorkerContext. (#17047)
4. MSQ: Include worker context maps in WorkOrders. (#17076)
5. TableInputSpecSlicer changes to support running on Brokers. (#17074)
6. Fix call to MemoryIntrospector in IndexerControllerContext. (#17066)
7. MSQ: Add QueryKitSpec to encapsulate QueryKit params. (#17077)
8. MSQ: Use task context flag useConcurrentLocks to determine task lock type (#17193)
2024-10-04 11:36:05 +05:30
Kashif Faraz 23b9039a02
MSQ: Rework memory management. (#17057) (#17210)
This patch reworks memory management to better support multi-threaded
workers running in shared JVMs. There are two main changes.

First, processing buffers and threads are moved from a per-JVM model to
a per-worker model. This enables queries to hold processing buffers
without blocking other concurrently-running queries. Changes:

- Introduce ProcessingBuffersSet and ProcessingBuffers to hold the
  per-worker and per-work-order processing buffers (respectively). On Peons,
  this is the JVM-wide processing pool. On Indexers, this is a per-worker
  pool of on-heap buffers. (This change fixes a bug on Indexers where
  excessive processing buffers could be used if MSQ tasks ran concurrently
  with realtime tasks.)

- Add "bufferPool" argument to GroupingEngine#process so a per-worker pool
  can be passed in.

- Add "druid.msq.task.memory.maxThreads" property, which controls the
  maximum number of processing threads to use per task. This allows usage of
  multiple processing buffers per task if admins desire.

- IndexerWorkerContext acquires processingBuffers when creating the FrameContext
  for a work order, and releases them when closing the FrameContext.

- Add "usesProcessingBuffers()" to FrameProcessorFactory so workers know
  how many sets of processing buffers are needed to run a given query.

Second, adjustments to how WorkerMemoryParameters slices up bundles, to
favor more memory for sorting and segment generation. Changes:

- Instead of using same-sized bundles for processing and for sorting,
  workers now use minimally-sized processing bundles (just enough to read
  inputs plus a little overhead). The rest is devoted to broadcast data
  buffering, sorting, and segment-building.

- Segment-building is now limited to 1 concurrent segment per work order.
  This allows each segment-building action to use more memory. Note that
  segment-building is internally multi-threaded to a degree. (Build and
  persist can run concurrently.)

- Simplify frame size calculations by removing the distinction between
  "standard" and "large" frames. The new default frame size is the same
  as the old "standard" frames, 1 MB. The original goal of of the large
  frames was to reduce the number of temporary files during sorting, but
  I think we can achieve the same thing by simply merging a larger number
  of standard frames at once.

- Remove the small worker adjustment that was added in #14117 to account
  for an extra frame involved in writing to durable storage. Instead,
  account for the extra frame whenever we are actually using durable storage.

- Cap super-sorter parallelism using the number of output partitions, rather
  than using a hard coded cap at 4. Note that in practice, so far, this cap
  has not been relevant for tasks because they have only been using a single
  processing thread anyway.

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
2024-10-01 19:50:24 +05:30
Cece Mei c31da4f7a7
Incorporate `estimatedComputeCost` into all `BitmapColumnIndex` classes. (#17125) (#17172)
changes:
* filter index processing is now automatically ordered based on estimated 'cost', which is approximated based on how many expected bitmap operations are required to construct the bitmap used for the 'offset'
* cursorAutoArrangeFilters context flag now defaults to true, but can be set to false to disable cost based filter index sorting
2024-09-30 02:24:05 -07:00
Sree Charan Manamala b7cc0bb343
Window Functions : Remove enable windowing flag (#17087) (#17128)
(cherry picked from commit 67d361c9bf)
2024-09-24 10:28:11 +02:00
Clint Wylie c462e103b6
transition away from StorageAdapter (#16985) (#17024)
* transition away from StorageAdapter
changes:
* CursorHolderFactory has been renamed to CursorFactory and moved off of StorageAdapter, instead fetched directly from the segment via 'asCursorFactory'. The previous deprecated CursorFactory interface has been merged into StorageAdapter
* StorageAdapter is no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call instead
* StorageAdapter methods not covered by CursorFactory (CursorHolderFactory prior to this change) have been moved into interfaces which are retrieved by Segment.as, the primary classes are the previously existing Metadata, as well as new interfaces PhysicalSegmentInspector and TopNOptimizationInspector
* added UnnestSegment and FilteredSegment that extend WrappedSegmentReference since their StorageAdapter implementations were previously provided by WrappedSegmentReference
* added PhysicalSegmentInspector which covers some of the previous StorageAdapter functionality which was primarily used for segment metadata queries and other metadata uses, and is implemented for QueryableIndexSegment and IncrementalIndexSegment
* added TopNOptimizationInspector to cover the oddly specific StorageAdapter.hasBuiltInFilters implementation, which is implemented for HashJoinSegment, UnnestSegment, and FilteredSegment
* Updated all engines and tests to no longer use StorageAdapter
2024-09-09 21:43:41 -07:00
Kashif Faraz fe3d589ff9
Run compaction as a supervisor on Overlord (#16768)
Description
-----------
Auto-compaction currently poses several challenges as it:
1. may get stuck on a failing interval.
2. may get stuck on the latest interval if more data keeps coming into it.
3. always picks the latest interval regardless of the level of compaction in it.
4. may never pick a datasource if its intervals are not very recent.
5. requires setting an explicit period which does not cater to the changing needs of a Druid cluster.

This PR introduces various improvements to compaction scheduling to tackle the above problems.

Change Summary
--------------
1. Run compaction for a datasource as a supervisor of type `autocompact` on Overlord.
2. Make compaction policy extensible and configurable.
3. Track status of recently submitted compaction tasks and pass this info to policy.
4. Add `/simulate` API on both Coordinator and Overlord to run compaction simulations.
5. Redirect compaction status APIs to the Overlord when compaction supervisors are enabled.
2024-09-02 07:53:13 +05:30
Gian Merlino 0603d5153d
Segments sorted by non-time columns. (#16849)
* Segments primarily sorted by non-time columns.

Currently, segments are always sorted by __time, followed by the sort
order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting
by __time enables efficient execution of queries involving time-ordering
or granularity. Time-ordering is a simple matter of reading the rows in
stored order, and granular cursors can be generated in streaming fashion.

However, for various workloads, it's better for storage footprint and
query performance to sort by arbitrary orders that do not start with __time.
With this patch, users can sort segments by such orders.

For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to
dimensionsSpec. The "dimensions" list determines the sort order. To
define a sort order that includes "__time", users explicitly
include a dimension named "__time".

For SQL-based ingestion, users set the context parameter
"useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then
used as the explicit segment sort order.

In both cases, when the new "useExplicitSegmentSortOrder" parameter is
false (the default), __time is implicitly prepended to the sort order,
as it always was prior to this patch.

The new parameter is experimental for two main reasons. First, such
segments can cause errors when loaded by older servers, due to violating
their expectations that timestamps are always monotonically increasing.
Second, even on newer servers, not all queries can run on non-time-sorted
segments. Scan queries involving time-ordering and any query involving
granularity will not run. (To partially mitigate this, a currently-undocumented
SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner
avoids using "granularity".)

Changes on the write path:

1) DimensionsSpec can now optionally contain a __time dimension, which
   controls the placement of __time in the sort order. If not present,
   __time is considered to be first in the sort order, as it has always
   been.

2) IncrementalIndex and IndexMerger are updated to sort facts more
   flexibly; not always by time first.

3) Metadata (stored in metadata.drd) gains a "sortOrder" field.

4) MSQ can generate range-based shard specs even when not all columns are
   singly-valued strings. It merely stops accepting new clustering key
   fields when it encounters the first one that isn't a singly-valued
   string. This is useful because it enables range shard specs on
   "someDim" to be created for clauses like "CLUSTERED BY someDim, __time".

Changes on the read path:

1) Add StorageAdapter#getSortOrder so query engines can tell how a
   segment is sorted.

2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter,
   and VectorCursorGranularizer to throw errors when using granularities
   on non-time-ordered segments.

3) Update ScanQueryEngine to throw an error when using the time-ordering
  "order" parameter on non-time-ordered segments.

4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when
   running on a non-time-ordered segment.

5) Add "sqlUseGranularity" context parameter that causes the SQL planner
   to avoid using granularities other than ALL.

Other changes:

1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions"
   and change the meaning subtly: it now returns true if the DimensionsSpec
   represents an unchanging list of dimensions, or false if there is
   some discovery happening. This is what call sites had expected anyway.

* Fixups from CI.

* Fixes.

* Fix missing arg.

* Additional changes.

* Fix logic.

* Fixes.

* Fix test.

* Adjust test.

* Remove throws.

* Fix styles.

* Fix javadocs.

* Cleanup.

* Smoother handling of null ordering.

* Fix tests.

* Missed a spot on the merge.

* Fixups.

* Avoid needless Filters.and.

* Add timeBoundaryInspector to test.

* Fix tests.

* Fix FrameStorageAdapterTest.

* Fix various tests.

* Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder.

* Pom fix.

* Fix doc.
2024-08-23 08:24:43 -07:00
Clint Wylie 4283b270e3
rework cursor creation (#16533)
changes:
* Added `CursorBuildSpec` which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments of `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor`
* added new interface `CursorHolder` and new interface `CursorHolderFactory` as a replacement for `CursorFactory`, with method `makeCursorHolder`, which takes a `CursorBuildSpec` as an argument and replaces `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor`
* `CursorFactory.makeCursors` previously returned a `Sequence<Cursor>` corresponding to the query granularity buckets, with a separate `Cursor` per bucket. `CursorHolder.asCursor` instead returns a single `Cursor` (equivalent to 'ALL' granularity), and a new `CursorGranularizer` has been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with its `VectorCursorGranularizer`), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities. 
* Deprecated `CursorFactory`, `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursor`
* updated all `StorageAdapter` implementations to implement `makeCursorHolder`, transitioned direct `CursorFactory` implementations to instead implement `CursorMakerFactory`. `StorageAdapter` being a `CursorMakerFactory` is intended to be a transitional thing, ideally will not be released in favor of moving `CursorMakerFactory` to be fetched directly from `Segment`, however this PR was already large enough so this will be done in a follow-up.
* updated all query engines to use `makeCursorHolder`, granularity based engines to use `CursorGranularizer`.
2024-08-16 11:34:10 -07:00
Sree Charan Manamala 84192b11d7
Benchmark for window functions (#16824) 2024-08-07 11:07:11 +02:00
Adarsh Sanjeev 2b81c18fd7
Refactor SemanticCreator (#16700)
Refactors the SemanticCreator annotation.

    Moves the interface to the semantic package.
    Create a SemanticUtils to hold logic for storing semantic maps.
    Add FrameMaker interface.
2024-08-06 11:29:38 -05:00
Zoltan Haindrich 26e3c44f4b
Quidem record (#16624)
* enables to launch a fake broker based on test resources (druidtest uri)
* could record queries into new testfiles during usage
* instead of re-purpose Calcite's Hook migrates to use DruidHook which we can add further keys
* added a quidem-ut module which could be the place for tests which could iteract with modules/etc
2024-08-05 14:58:32 +02:00
Abhishek Radhakrishnan 31b43753fb
Add `druid.indexing.formats.stringMultiValueHandlingMode` system config (#16822)
This patch introduces an optional cluster configuration, druid.indexing.formats.stringMultiValueHandlingMode, allowing operators to override the default mode SORTED_SET for string dimensions. The possible values for the config are SORTED_SET, SORTED_ARRAY, or ARRAY (SORTED_SET is the default). Case insensitive values are allowed.
While this cluster property allows users to manage the multi-value handling mode for string dimension types, it's recommended to migrate to using real array types instead of MVDs.
 
This fixes a long-standing issue where compaction will honor the configured cluster wide property instead of rewriting it as the default SORTED_ARRAY always, even if the data was originally ingested with ARRAY or SORTED_SET.
2024-08-03 10:23:44 -07:00
Kashif Faraz caedeb66cd
Add API to update compaction engine (#16803)
Changes:
- Add API `/druid/coordinator/v1/config/compaction/global` to update cluster level compaction config
- Add class `CompactionConfigUpdateRequest`
- Fix bug in `CoordinatorCompactionConfig` which caused compaction engine to not be persisted.
Use json field name `engine` instead of `compactionEngine` because JSON field names must align
with the getter name.
- Update MSQ validation error messages
- Complete overhaul of `CoordinatorCompactionConfigResourceTest` to remove unnecessary mocking
and add more meaningful tests.
- Add `TuningConfigBuilder` to easily build tuning configs for tests.
- Add `DatasourceCompactionConfigBuilder`
2024-07-27 09:14:51 +05:30
Laksh Singla 725d442355
Faster dimension deserialization on the brokers (#16740)
Speedier dimension deserialization on the brokers.
2024-07-26 14:36:11 +05:30
Clint Wylie 35b876436b
remove native scan query legacy mode (#16659) 2024-07-18 23:33:27 -07:00
Kashif Faraz 01d67ae543
Allow CompactionSegmentIterator to have custom priority (#16737)
Changes:
- Break `NewestSegmentFirstIterator` into two parts
  - `DatasourceCompactibleSegmentIterator` - this contains all the code from `NewestSegmentFirstIterator`
  but now handles a single datasource and allows a priority to be specified
  - `PriorityBasedCompactionSegmentIterator` - contains separate iterator for each datasource and
  combines the results into a single queue to be used by a compaction search policy
- Update `NewestSegmentFirstPolicy` to use the above new classes
- Cleanup `CompactionStatistics` and `AutoCompactionSnapshot`
- Cleanup `CompactSegments`
- Remove unused methods from `Tasks`
- Remove unneeded `TasksTest`
- Move tests from `NewestSegmentFirstIteratorTest` to `CompactionStatusTest`
and `DatasourceCompactibleSegmentIteratorTest`
2024-07-16 19:54:49 +05:30
Vishesh Garg 197c54f673
Auto-Compaction using Multi-Stage Query Engine (#16291)
Description:
Compaction operations issued by the Coordinator currently run using the native query engine.
As majority of the advancements that we are making in batch ingestion are in MSQ, it is imperative
that we support compaction on MSQ to make Compaction more robust and possibly faster. 
For instance, we have seen OOM errors in native compaction that MSQ could have handled by its
auto-calculation of tuning parameters. 

This commit enables compaction on MSQ to remove the dependency on native engine. 

Main changes:
* `DataSourceCompactionConfig` now has an additional field `engine` that can be one of 
`[native, msq]` with `native` being the default.
*  if engine is MSQ, `CompactSegments` duty assigns all available compaction task slots to the
launched `CompactionTask` to ensure full capacity is available to MSQ. This is to avoid stalling which
could happen in case a fraction of the tasks were allotted and they eventually fell short of the number
of tasks required by the MSQ engine to run the compaction.
* `ClientCompactionTaskQuery` has a new field `compactionRunner` with just one `engine` field.
* `CompactionTask` now has `CompactionRunner` interface instance with its implementations
`NativeCompactinRunner` and `MSQCompactionRunner` in the `druid-multi-stage-query` extension.
The objectmapper deserializes `ClientCompactionRunnerInfo` in `ClientCompactionTaskQuery` to the
`CompactionRunner` instance that is mapped to the specified type [`native`, `msq`]. 
* `CompactTask` uses the `CompactionRunner` instance it receives to create the indexing tasks.
* `CompactionTask` to `MSQControllerTask` conversion logic checks whether metrics are present in 
the segment schema. If present, the task is created with a native group-by query; if not, the task is
issued with a scan query. The `storeCompactionState` flag is set in the context.
* Each created `MSQControllerTask` is launched in-place and its `TaskStatus` tracked to determine the
final status of the `CompactionTask`. The id of each of these tasks is the same as that of `CompactionTask`
since otherwise, the workers will be unable to determine the controller task's location for communication
(as they haven't been launched via the overlord).
2024-07-12 16:40:20 +05:30
Gian Merlino dbed1b0f50
Defer more expressions in vectorized groupBy. (#16338)
* Defer more expressions in vectorized groupBy.

This patch adds a way for columns to provide GroupByVectorColumnSelectors,
which controls how the groupBy engine operates on them. This mechanism is used
by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector
that uses the inputs of an expression as the grouping key. The actual expression
evaluation is deferred until the grouped ResultRow is created.

A new context parameter "deferExpressionDimensions" allows users to control when
this deferred selector is used. The default is "fixedWidthNonNumeric", which is a
behavioral change from the prior behavior. Users can get the prior behavior by setting
this to "singleString".

* Fix style.

* Add deferExpressionDimensions to SqlExpressionBenchmark.

* Fix style.

* Fix inspections.

* Add more testing.

* Use valueOrDefault.

* Compute exprKeyBytes a bit lighter-weight.
2024-06-26 17:28:36 -07:00
Laksh Singla 71b3b5ab5d
Add query context parameter to remove null bytes when writing frames (#16579)
MSQ cannot process null bytes in string fields, and the current workaround is to remove them using the REPLACE function. 'removeNullBytes' context parameter has been added which sanitizes the input string fields by removing these null bytes.
2024-06-26 15:00:30 +05:30
Laksh Singla da1e293a57
Deserialize dimensions in group by queries to their respective types when reading from their serialized format (#16511)
* init

* tests, pair groupable

* framework change

* tests

* update benchmarks

* comments

* add javadoc for the jsonMapper

* remove extra deserialization

* add special serde for map based result rows

* revert unnecessary change

---------

Co-authored-by: asdf2014 <asdf2014@apache.org>
2024-06-14 16:27:47 +08:00
Clint Wylie fee509df2e
fix NestedDataColumnIndexerV4 to not report cardinality (#16507)
* fix NestedDataColumnIndexerV4 to not report cardinality
changes:
* fix issue similar to #16489 but for NestedDataColumnIndexerV4, which can report STRING type if it only processes a single type of values. this should be less common than the auto indexer problem
* fix some issues with sql benchmarks
2024-06-11 20:58:12 -07:00
Akshat Jain 6d7d2ffa63
Add interface method for returning canonical lookup name (#16557)
* Add interface method for returning canonical lookup name

* Address review comment

* Add test in LookupReferencesManagerTest for coverage check

* Add test in LookupSerdeModuleTest for coverage check
2024-06-05 14:33:18 -07:00
Adarsh Sanjeev 21f725f33e
Add octet streaming of sketchs in MSQ (#16269)
There are a few issues with using Jackson serialization in sending datasketches between controller and worker in MSQ. This caused a blowup due to holding multiple copies of the sketch being stored.

This PR aims to resolve this by switching to deserializing the sketch payload without Jackson.

The PR adds a new query parameter used during communication between controller and worker while fetching sketches, "sketchEncoding".

    If the value of this parameter is OCTET, the sketch is returned as a binary encoding, done by ClusterByStatisticsSnapshotSerde.
    If the value is not the above, the sketch is encoded by Jackson as before.
2024-05-28 18:12:38 +05:30
Gian Merlino 72432c2e78
Speed up SQL IN using SCALAR_IN_ARRAY. (#16388)
* Speed up SQL IN using SCALAR_IN_ARRAY.

Main changes:

1) DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
   the IN is above inFunctionThreshold. The default value of inFunctionThreshold
   is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.

2) SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
   expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
   value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
   it to Integer.MAX_VALUE.

3) ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
   greater than inFunctionThreshold.

* Revert test.

* Additional coverage.

* Update docs/querying/sql-query-context.md

Co-authored-by: Benedict Jin <asdf2014@apache.org>

* New test.

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-05-14 08:09:27 -07:00
Gian Merlino cdf78ecccd
Fix IndexSpec in SqlBenchmark to use stringEncodingStrategy (#16336) 2024-05-14 14:59:15 +05:30
Laksh Singla 4bfc186153
Support sorting on complex columns in MSQ (#16322)
MSQ sorts the columns in a highly specialized manner by byte comparisons. As such the values are serialized differently. This works well for the primitive types and primitive arrays, however complex types cannot be serialized specially.

This PR adds the support for sorting the complex columns by deserializing the value from the field and comparing it via the type strategy. This is a lot slower than the byte comparisons, however, it's the only way to support sorting on complex columns that can have arbitrary serialization not optimized for MSQ.

The primitives and the arrays are still compared via the byte comparison, therefore this doesn't affect the performance of the queries supported before the patch. If there's a sorting key with mixed complex and primitive/primitive array types, for example: longCol1 ASC, longCol2 ASC, complexCol1 DESC, complexCol2 DESC, stringCol1 DESC, longCol3 DESC, longCol4 ASC, the comparison will happen like:

    longCol1, longCol2 (ASC) - Compared together via byte-comparison, since both are byte comparable and need to be sorted in ascending order
    complexCol1 (DESC) - Compared via deserialization, cannot be clubbed with any other field
    complexCol2 (DESC) - Compared via deserialization, cannot be clubbed with any other field, even though the prior field was a complex column with the same order
    stringCol1, longCol3 (DESC) - Compared together via byte-comparison, since both are byte comparable and need to be sorted in descending order
    longCol4 (ASC) - Compared via byte-comparison, couldn't be coalesced with the previous fields as the direction was different

This way, we only deserialize the field wherever required
2024-05-13 15:07:05 +05:30
Zoltan Haindrich 2d0e86cbdc
Use quidem to run tests (#16249)
* test scoped jdbc driver for druidtest:/// backed DruidAvaticaTestDriver
** DecoupledTestConfig is used inside the URI - this will make it possible to attach to existing things more easily
* DruidQuidemTestBase can be used to create module level set of quidem tests
* added quidem commands: !convertedPlan, !logicalPlan, !druidPlan, !nativePlan
** for these I've used some values of the Hook which was there in calcite
* there are some shortcuts with proxies(they are only used during testing) - we can probably remove those later
2024-05-02 02:12:42 -04:00
Adarsh Sanjeev 9a2d7c28bc
Prepare master branch for 31.0.0 release (#16333) 2024-04-26 09:22:43 +05:30
Laksh Singla 6bca406d31
Grouping on complex columns aka unifying GroupBy strategies (#16068)
Users can pass complex types as dimensions to the group by queries. For example:

SELECT nested_col1, count(*) FROM foo GROUP BY nested_col1
2024-04-24 23:00:14 +05:30
Rishabh Singh e30790e013
Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building (#15817)
Issue: #14989

The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information.

This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.
2024-04-24 22:22:53 +05:30
Tim Williamson 4bdc1890f7
Improve worst-case performance of LIKE filters by 20x (#16153)
* Expected-linear-time LIKE

`LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids backtracking.

Technically, the algorithm runs in `O(nm)`, where `n` is the length of the string to match and `m` is the length of the pattern. In practice, it should run in linear time: essentially as fast as `String.indexOf()` can search for the next match. Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.7x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more.

Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing.

```
Benchmark                                (cardinality)  Mode  Cnt  Before Score       Error  After Score       Error  Units  Before / After
LikeFilterBenchmark.matchBoundPrefix              1000  avgt   10         6.686 ±     0.026        6.765 ±     0.087  us/op           0.99x
LikeFilterBenchmark.matchBoundPrefix            100000  avgt   10       163.936 ±     1.589      140.014 ±     0.563  us/op           1.17x
LikeFilterBenchmark.matchBoundPrefix           1000000  avgt   10      1235.259 ±     7.318     1165.330 ±     9.300  us/op           1.06x
LikeFilterBenchmark.matchLikeContains             1000  avgt   10       255.074 ±     1.530      130.212 ±     3.314  us/op           1.96x
LikeFilterBenchmark.matchLikeContains           100000  avgt   10     34789.639 ±   210.219    18563.644 ±   100.030  us/op           1.87x
LikeFilterBenchmark.matchLikeContains          1000000  avgt   10    287265.302 ±  1790.957   164684.778 ±   317.698  us/op           1.74x
LikeFilterBenchmark.matchLikeEquals               1000  avgt   10         0.410 ±     0.003        0.399 ±     0.001  us/op           1.03x
LikeFilterBenchmark.matchLikeEquals             100000  avgt   10         0.793 ±     0.005        0.719 ±     0.003  us/op           1.10x
LikeFilterBenchmark.matchLikeEquals            1000000  avgt   10         0.864 ±     0.004        0.839 ±     0.005  us/op           1.03x
LikeFilterBenchmark.matchLikeKiller               1000  avgt   10      3077.629 ±     7.928      103.714 ±     2.417  us/op          29.67x
LikeFilterBenchmark.matchLikeKiller             100000  avgt   10    311048.049 ± 13466.911    14777.567 ±    70.242  us/op          21.05x
LikeFilterBenchmark.matchLikeKiller            1000000  avgt   10   3055855.099 ± 18387.839    92476.621 ±  1198.255  us/op          33.04x
LikeFilterBenchmark.matchLikePrefix               1000  avgt   10         6.711 ±     0.035        6.653 ±     0.046  us/op           1.01x
LikeFilterBenchmark.matchLikePrefix             100000  avgt   10       161.535 ±     0.574      163.740 ±     0.833  us/op           0.99x
LikeFilterBenchmark.matchLikePrefix            1000000  avgt   10      1255.696 ±     5.207     1201.378 ±     3.466  us/op           1.05x
LikeFilterBenchmark.matchRegexContains            1000  avgt   10       467.736 ±     2.546      481.431 ±     5.647  us/op           0.97x
LikeFilterBenchmark.matchRegexContains          100000  avgt   10     64871.766 ±   223.341    65483.992 ±   391.249  us/op           0.99x
LikeFilterBenchmark.matchRegexContains         1000000  avgt   10    482906.004 ±  2003.583   477195.835 ±  3094.605  us/op           1.01x
LikeFilterBenchmark.matchRegexKiller              1000  avgt   10      8071.881 ±    18.026     8052.322 ±    17.336  us/op           1.00x
LikeFilterBenchmark.matchRegexKiller            100000  avgt   10   1120094.520 ±  2428.172   808321.542 ±  2411.032  us/op           1.39x
LikeFilterBenchmark.matchRegexKiller           1000000  avgt   10   8096745.012 ± 40782.747  8114114.896 ± 43250.204  us/op           1.00x
LikeFilterBenchmark.matchRegexPrefix              1000  avgt   10       170.843 ±     1.095      175.924 ±     1.144  us/op           0.97x
LikeFilterBenchmark.matchRegexPrefix            100000  avgt   10     17785.280 ±   116.813    18708.888 ±    61.857  us/op           0.95x
LikeFilterBenchmark.matchRegexPrefix           1000000  avgt   10    174415.586 ±  1827.478   173190.799 ±   949.224  us/op           1.01x
LikeFilterBenchmark.matchSelectorEquals           1000  avgt   10         0.411 ±     0.003        0.416 ±     0.002  us/op           0.99x
LikeFilterBenchmark.matchSelectorEquals         100000  avgt   10         0.728 ±     0.003        0.739 ±     0.003  us/op           0.99x
LikeFilterBenchmark.matchSelectorEquals        1000000  avgt   10         0.842 ±     0.002        0.879 ±     0.007  us/op           0.96x
```

* Take into account whether druid.generic.useDefaultValueForNull is set in LikeDimFilterTest assertions.

* Attempt to placate CodeQL.

* Fix handling of multi-pattern suffixes.

* Expected-linear-time LIKE

`LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids the catastrophic backtracking, converting the `LIKE` pattern into a list of `java.util.regex.Pattern` by splitting on the `%` wildcard. The resulting sub-patterns do no backtracking, and a simple greedy loop using `Matcher.find()` to progress through the string is used.

Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.15x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more.

Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing.

```
Benchmark                                      (cardinality)  Mode  Cnt  Before Score       Error      After Score     Error  Units  Before/After
LikeFilterBenchmark.matchBoundPrefix                    1000  avgt   10         5.410 ±     0.010          5.582 ±     0.004  us/op         0.97x
LikeFilterBenchmark.matchBoundPrefix                  100000  avgt   10       140.920 ±     0.306        141.082 ±     0.391  us/op         1.00x
LikeFilterBenchmark.matchBoundPrefix                 1000000  avgt   10      1082.762 ±     1.070       1171.407 ±     1.628  us/op         0.92x
LikeFilterBenchmark.matchLikeComplexContains            1000  avgt   10       221.572 ±     0.228        183.742 ±     0.210  us/op         1.21x
LikeFilterBenchmark.matchLikeComplexContains          100000  avgt   10     25461.362 ±    21.481      17373.828 ±    42.577  us/op         1.47x
LikeFilterBenchmark.matchLikeComplexContains         1000000  avgt   10    221075.917 ±   919.238     177454.683 ±   506.420  us/op         1.25x
LikeFilterBenchmark.matchLikeContains                   1000  avgt   10       283.015 ±     0.219        218.835 ±     3.126  us/op         1.29x
LikeFilterBenchmark.matchLikeContains                 100000  avgt   10     30202.910 ±    32.697      26713.488 ±    49.525  us/op         1.13x
LikeFilterBenchmark.matchLikeContains                1000000  avgt   10    284661.411 ±   130.324     243381.857 ±   540.143  us/op         1.17x
LikeFilterBenchmark.matchLikeEquals                     1000  avgt   10         0.386 ±     0.001          0.380 ±     0.001  us/op         1.02x
LikeFilterBenchmark.matchLikeEquals                   100000  avgt   10         0.670 ±     0.001          0.705 ±     0.002  us/op         0.95x
LikeFilterBenchmark.matchLikeEquals                  1000000  avgt   10         0.839 ±     0.001          0.796 ±     0.001  us/op         1.05x
LikeFilterBenchmark.matchLikeKiller                     1000  avgt   10      4882.099 ±     7.953        170.142 ±     0.494  us/op        28.69x
LikeFilterBenchmark.matchLikeKiller                   100000  avgt   10    524122.010 ±   390.170      19461.637 ±   117.090  us/op        26.93x
LikeFilterBenchmark.matchLikeKiller                  1000000  avgt   10   5121795.377 ±  4176.052     181162.978 ±   368.443  us/op        28.27x
LikeFilterBenchmark.matchLikePrefix                     1000  avgt   10         5.708 ±     0.005          5.677 ±     0.011  us/op         1.01x
LikeFilterBenchmark.matchLikePrefix                   100000  avgt   10       141.853 ±     0.554        108.313 ±     0.330  us/op         1.31x
LikeFilterBenchmark.matchLikePrefix                  1000000  avgt   10      1199.148 ±     1.298       1153.297 ±     1.575  us/op         1.04x
LikeFilterBenchmark.matchLikeSuffix                     1000  avgt   10       256.020 ±     0.283        196.339 ±     0.564  us/op         1.30x
LikeFilterBenchmark.matchLikeSuffix                   100000  avgt   10     29917.931 ±    28.218      21450.997 ±    20.341  us/op         1.39x
LikeFilterBenchmark.matchLikeSuffix                  1000000  avgt   10    241225.193 ±   465.824     194034.292 ±   362.312  us/op         1.24x
LikeFilterBenchmark.matchRegexComplexContains           1000  avgt   10       119.597 ±     0.635        135.550 ±     0.697  us/op         0.88x
LikeFilterBenchmark.matchRegexComplexContains         100000  avgt   10     13089.670 ±    13.738      13766.712 ±    12.802  us/op         0.95x
LikeFilterBenchmark.matchRegexComplexContains        1000000  avgt   10    130822.830 ±  1624.048     131076.029 ±  1636.811  us/op         1.00x
LikeFilterBenchmark.matchRegexContains                  1000  avgt   10       573.273 ±     0.421        615.399 ±     0.633  us/op         0.93x
LikeFilterBenchmark.matchRegexContains                100000  avgt   10     57259.313 ±   162.747      62900.380 ±    44.746  us/op         0.91x
LikeFilterBenchmark.matchRegexContains               1000000  avgt   10    571335.768 ±  2822.776     542536.982 ±   780.290  us/op         1.05x
LikeFilterBenchmark.matchRegexKiller                    1000  avgt   10     11525.499 ±     8.741      11061.791 ±    21.746  us/op         1.04x
LikeFilterBenchmark.matchRegexKiller                  100000  avgt   10   1170414.723 ±   766.160    1144437.291 ±   886.263  us/op         1.02x
LikeFilterBenchmark.matchRegexKiller                 1000000  avgt   10  11507668.302 ± 11318.176  110381620.014 ± 10707.974  us/op         1.11x
LikeFilterBenchmark.matchRegexPrefix                    1000  avgt   10       156.460 ±     0.097        155.217 ±     0.431  us/op         1.01x
LikeFilterBenchmark.matchRegexPrefix                  100000  avgt   10     15056.491 ±    23.906      15508.965 ±   763.976  us/op         0.97x
LikeFilterBenchmark.matchRegexPrefix                 1000000  avgt   10    154416.563 ±   473.108     153737.912 ±   273.347  us/op         1.00x
LikeFilterBenchmark.matchRegexSuffix                    1000  avgt   10       610.684 ±     0.462        590.352 ±     0.334  us/op         1.03x
LikeFilterBenchmark.matchRegexSuffix                  100000  avgt   10     53196.517 ±    78.155      59460.261 ±    56.934  us/op         0.89x
LikeFilterBenchmark.matchRegexSuffix                 1000000  avgt   10    536100.944 ±   440.353     550098.917 ±   740.464  us/op         0.97x
LikeFilterBenchmark.matchSelectorEquals                 1000  avgt   10         0.390 ±     0.001          0.366 ±     0.001  us/op         1.07x
LikeFilterBenchmark.matchSelectorEquals               100000  avgt   10         0.724 ±     0.001          0.714 ±     0.001  us/op         1.01x
LikeFilterBenchmark.matchSelectorEquals              1000000  avgt   10         0.826 ±     0.001          0.847 ±     0.001  us/op         0.98x
```
2024-04-23 22:45:23 -07:00
Laksh Singla b9bbde5c0a
Fix deadlock that can occur while merging group by results (#15420)
This PR prevents such a deadlock from happening by acquiring the merge buffers in a single place and passing it down to the runner that might need it.
2024-04-22 14:10:44 +05:30
Clint Wylie aa230642dd
use PeekableIntIterator for OR filter "partial index" value matchers (#16300) 2024-04-17 08:27:21 -07:00
Gian Merlino 58a8a23243
Avoid conversion to String in JsonReader, JsonNodeReader. (#15693)
* Avoid conversion to String in JsonReader, JsonNodeReader.

These readers were running UTF-8 decode on the provided entity to
convert it to a String, then parsing the String as JSON. The patch
changes them to parse the provided entity's input stream directly.

In order to preserve the nice error messages that include parse errors,
the readers now need to open the entity again on the error path, to
re-read the data. To make this possible, the InputEntity#open contract
is tightened to require the ability to re-open entities, and existing
InputEntity implementations are updated to allow re-opening.

This patch also renames JsonLineReaderBenchmark to JsonInputFormatBenchmark,
updates it to benchmark all three JSON readers, and adds a case that reads
fields out of the parsed row (not just creates it).

* Fixes for static analysis.

* Implement intermediateRowAsString in JsonReader.

* Enhanced JsonInputFormatBenchmark.

Renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, and enhances it to
test various readers (JsonReader, JsonLineReader, JsonNodeReader) as well as
to test with/without field discovery.
2024-03-26 08:16:05 -07:00
Clint Wylie b0a9c318d6
add new typed in filter (#16039)
changes:
* adds TypedInFilter which preserves matching sets in the native match value type
* SQL planner uses new TypedInFilter when druid.generic.useDefaultValueForNull=false (the default)
2024-03-22 12:45:08 -07:00
AmatyaAvadhanula 488d376209
Optimize isOvershadowed when there is a unique minor version for an interval (#15952)
* Optimize isOvershadowed for intervals with timechunk locking
2024-03-20 19:30:00 +05:30
Clint Wylie dd9bc3749a
fix issues with array_contains and array_overlap with null left side arguments (#15974)
changes:
* fix issues with array_contains and array_overlap with null left side arguments
* modify singleThreaded stuff to allow optimizing Function similar to how we do for ExprMacro - removed SingleThreadSpecializable in favor of default impl of asSingleThreaded on Expr with clear javadocs that most callers shouldn't be calling it directly and should be using Expr.singleThreaded static method which uses a shuttle and delegates to asSingleThreaded instead
* add optimized 'singleThreaded' versions of array_contains and array_overlap
* add mv_harmonize_nulls native expression to use with MV_CONTAINS and MV_OVERLAP to allow them to behave consistently with filter rewrites, coercing null and [] into [null]
* fix bug with casting rhs argument for native array_contains and array_overlap expressions
2024-03-13 18:16:10 -07:00
Clint Wylie 313da98879
decouple column serializer compression closers from SegmentWriteoutMedium to optionally allow serializers to release direct memory allocated for compression earlier than when segment is completed (#16076) 2024-03-11 12:28:04 -07:00
Zoltan Haindrich 27d7c30c38
Only use ExprEval in ConstantExpr if its known that it will be safe (#15694)
* `Expr#singleThreaded` which creates a singleThreaded version of the actual expression (caching ExprEval is allowed)
* `Expr#makeSingleThreaded` to make a whole subtree of expressions 'singleThreaded' - uses `Shuttle` to create the new expression tree
* `ConstantExpr#singleThreaded` creates a specialized `ConstantExpr` which does cache the `ExprEval`
* some `@Immutable` annotations were added to make it more likely to notice that there might be something off if a similar change will be made around here for some reason
2024-03-05 12:53:09 -08:00
Clint Wylie 101176590c
adaptive filter partitioning (#15838)
* cooler cursor filter processing allowing much smart utilization of indexes by feeding selectivity forward, with implementations for range and predicate based filters
* added new method Filter.makeFilterBundle which cursors use to get indexes and matchers for building offsets
* AND filter partitioning is now pushed all the way down, even to nested AND filters
* vector engine now uses same indexed base value matcher strategy for OR filters which partially support indexes
2024-02-29 15:38:12 -08:00
Gian Merlino 54b30646f3
Add sqlReverseLookupThreshold for ReverseLookupRule. (#15832)
If lots of keys map to the same value, reversing a LOOKUP call can slow
things down unacceptably. To protect against this, this patch introduces
a parameter sqlReverseLookupThreshold representing the maximum size of an
IN filter that will be created as part of lookup reversal.

If inSubQueryThreshold is set to a smaller value than
sqlReverseLookupThreshold, then inSubQueryThreshold will be used instead.
This allows users to use that single parameter to control IN sizes if they
wish.
2024-02-06 16:32:05 +05:30
Vishesh Garg 37d1650ccf
Benchmark for query planning time for IN queries (#15688)
Adds a set of benchmark queries for measuring the planning time with the IN operator. Current results indicate that with the recent optimizations, the IN planning time with 100K expressions in the IN clause is just 3s and with 1M is 46s. For IN clause paired with OR <col>=<val> expr, the numbers are 10s and 155s for 100K and 1M, resp.
2024-01-31 15:40:31 +05:30
Karan Kumar c4990f56d6
Prepare main branch for next 30.0.0 release. (#15707) 2024-01-23 15:55:54 +05:30
Zoltan Haindrich d6a12c4389
Add ability to enable ResultCache in tests (#15465) 2024-01-22 09:02:59 -05:00
Gian Merlino 792e5c58e4
IncrementalIndex#add is no longer thread-safe. (#15697)
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
2024-01-18 03:45:22 -08:00
Gian Merlino d3d0c1c91e
Faster parsing: reduce String usage, list-based input rows. (#15681)
* Faster parsing: reduce String usage, list-based input rows.

Three changes:

1) Reworked FastLineIterator to optionally avoid generating Strings
   entirely, and reduce copying somewhat. Benefits the line-oriented
   JSON, CSV, delimited (TSV), and regex formats.

2) In the delimited (TSV) format, when the delimiter is a single byte,
   split on UTF-8 bytes directly.

3) In CSV and delimited (TSV) formats, use list-based input rows when
   the column list is provided upfront by the user.

* Fix style.

* Fix inspections.

* Restore validation.

* Remove fastutil-extra.

* Exception type.

* Fixes for error messages.

* Fixes for null handling.
2024-01-18 19:18:46 +08:00
Gian Merlino 500681d0cb
Add ImmutableLookupMap for static lookups. (#15675)
* Add ImmutableLookupMap for static lookups.

This patch adds a new ImmutableLookupMap, which comes with an
ImmutableLookupExtractor. It uses a fastutil open hashmap plus two
lists to store its data in such a way that forward and reverse
lookups can both be done quickly. I also observed footprint to be
somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million
row lookup.

The main advantage, though, is that reverse lookups can be done much
more quickly than MapLookupExtractor (which iterates the entire map
for each call to unapplyAll). This speeds up the recently added
ReverseLookupRule (#15626) during SQL planning with very large lookups.

* Use in one more test.

* Fix benchmark.

* Object2ObjectOpenHashMap

* Fixes, and LookupExtractor interface update to have asMap.

* Remove commented-out code.

* Fix style.

* Fix import order.

* Add fastutil.

* Avoid storing Map entries.
2024-01-13 13:14:01 -08:00
Gian Merlino cccf13ea82
Reverse, pull up lookups in the SQL planner. (#15626)
* Reverse, pull up lookups in the SQL planner.

Adds two new rules:

1) ReverseLookupRule, which eliminates calls to LOOKUP by doing
   reverse lookups.

2) AggregatePullUpLookupRule, which pulls up calls to LOOKUP above
   GROUP BY, when the lookup is injective.

Adds configs `sqlReverseLookup` and `sqlPullUpLookup` to control whether
these rules fire. Both are enabled by default.

To minimize the chance of performance problems due to many keys mapping to
the same value, ReverseLookupRule refrains from reversing a lookup if there
are more keys than `inSubQueryThreshold`. The rationale for using this setting
is that reversal works by generating an IN, and the `inSubQueryThreshold`
describes the largest IN the user wants the planner to create.

* Add additional line.

* Style.

* Remove commented-out lines.

* Fix tests.

* Add test.

* Fix doc link.

* Fix docs.

* Add one more test.

* Fix tests.

* Logic, test updates.

* - Make FilterDecomposeConcatRule more flexible.

- Make CalciteRulesManager apply reduction rules til fixpoint.

* Additional tests, simplify code.
2024-01-12 00:06:31 -08:00
Gian Merlino 2231cb30a4
Faster k-way merging using tournament trees, 8-byte key strides. (#15661)
* Faster k-way merging using tournament trees, 8-byte key strides.

Two speedups for FrameChannelMerger (which does k-way merging in MSQ):

1) Replace the priority queue with a tournament tree, which does fewer
   comparisons.

2) Compare keys using 8-byte strides, rather than 1 byte at a time.

* Adjust comments.

* Fix style.

* Adjust benchmark and test.

* Add eight-list test (power of two).
2024-01-11 08:36:22 -08:00