* enables to use DruidHook for native plan logging
* qudiem tests doesn't necessarily need to run the query to get an explain - this helps during development as if there is a runtime issue it could still be explained in the test
Text-based input formats like csv and tsv currently parse inputs only as strings, following the RFC4180Parser spec).
To workaround this, the web-console and other tools need to further inspect the sample data returned to sample data returned by the Druid sampler API to parse them as numbers.
This patch introduces a new optional config, tryParseNumbers, for the csv and tsv input formats. If enabled, any numbers present in the input will be parsed in the following manner -- long data type for integer types and double for floating-point numbers, and if parsing fails for whatever reason, the input is treated as a string. By default, this configuration is set to false, so numeric strings will be treated as strings.
* Use the whole frame when writing rows.
This patch makes the following adjustments to enable writing larger
single rows to frames:
1) RowBasedFrameWriter: Max out allocation size on the final doubling.
i.e., if the final allocation "naturally" would be 1 MiB but the
max frame size is 900 KiB, use 900 KiB rather than failing the 1 MiB
allocation.
2) AppendableMemory: In reserveAdditional, release the last block if it
is empty. This eliminates waste when a frame writer uses a
successive-doubling approach to find the right allocation size.
3) ArenaMemoryAllocator: Reclaim memory from the last allocation when
the last allocation is closed.
Prior to these changes, a single row could be much smaller than the
frame size and still fail to be added to the frame.
* Style.
* Fix test.
The "suggested server memory" figure needs to take into account
maxConcurrentStages. The fix here does not affect the main memory
calculations, but it does affect the accuracy of error messages.
Currently, TaskDataSegmentProvider fetches the DataSegment from the Coordinator while loading the segment, but just discards it later. This PR refactors this to also return the DataSegment so that it can be used by workers without a separate fetch.
* MSQ: Add QueryKitSpec to encapsulate QueryKit params.
This patch introduces QueryKitSpec, an object that encapsulates the
parameters to makeQueryDefinition that are consistent from call to
call. This simplifies things because we avoid passing around all the
components individually.
This patch also splits "maxWorkerCount" into "maxLeafWorkerCount" and
"maxNonLeafWorkerCount", which apply to leaf stages (no other stages as
inputs) and nonleaf stages respectively.
Finally, this patch also rovides a way for ControllerContext to supply a
QueryKitSpec to its liking. It is expected that this will be used by
controllers of quick interactive queries to set maxNonLeafWorkerCount = 1,
which will generate fanning-in query plans.
* Fix javadoc.
* MSQ: Properly report errors that occur when starting up RunWorkOrder.
In #17046, an exception thrown by RunWorkOrder#startAsync would be ignored
and replaced with a generic CanceledFault. This patch fixes it by retaining
the original error.
* TableInputSpecSlicer changes to support running on Brokers.
Changes:
1) Rename TableInputSpecSlicer to IndexerTableInputSpecSlicer, in anticipation
of a new implementation being added for controllers running on Brokers.
2) Allow the context to use the WorkerManager to build the TableInputSpecSlicer,
in anticipation of Brokers wanting to use this to assign segments to servers
that are already serving those segments.
3) Remove unused DataSegmentTimelineView interface.
4) Add additional javadoc to DataSegmentProvider.
* Style.
* MSQ: Include worker context maps in WorkOrders.
This provides a mechanism to send contexts to workers in long-lived,
shared JVMs that are not part of the task system.
* Style, coverage.
This isn't necessary when using MSQWorkerTaskLauncher as the WorkerManager
implementation, because in that case, task failure also wakes up the
main thread. However, when using workers that are not task-based, we don't
want to rely on the WorkerManager for this.
The existing tests are moved into a "WithMaximalBuffering" subclass,
and a new "WithMinimalBuffering" subclass is added to test cases
where only a single frame is buffered.
* Move TerminalStageSpecFactory packages.
These packages are moved from the "guice" package to the "indexing.destination"
package. They make more sense here, since "guice" is reserved for Guice modules,
annotations, and providers.
* Rearrange imports.
* BaseWorkerClientImpl: Don't attempt to recover from a closed channel.
This patch introduces an exception type "ChannelClosedForWritesException",
which allows the BaseWorkerClientImpl to avoid retrying when the local
channel has been closed. This can happen in cases of cancellation.
* Add some test coverage.
* wip
* Add test coverage.
* Style.
* MSQ: Improved worker cancellation.
Four changes:
1) FrameProcessorExecutor now requires that cancellationIds be registered
with "registerCancellationId" prior to being used in "runFully" or "runAllFully".
2) FrameProcessorExecutor gains an "asExecutor" method, which allows that
executor to be used as an executor for future callbacks in such a way
that respects cancellationId.
3) RunWorkOrder gains a "stop" method, which cancels the current
cancellationId and closes the current FrameContext. It blocks until
both operations are complete.
4) Fixes a bug in RunAllFullyWidget where "processorManager.result()" was
called outside "runAllFullyLock", which could cause it to be called
out-of-order with "cleanup()" in case of cancellation or other error.
Together, these changes help ensure cancellation does not have races.
Once "cancel" is called for a given cancellationId, all existing processors
and running callbacks are canceled and exit in an orderly manner. Future
processors and callbacks with the same cancellationId are rejected
before being executed.
* Fix test.
* Use execute, which doesn't return, to avoid errorprone complaints.
* Fix some style stuff.
* Further enhancements.
* Fix style.
* MSQ: Rework memory management.
This patch reworks memory management to better support multi-threaded
workers running in shared JVMs. There are two main changes.
First, processing buffers and threads are moved from a per-JVM model to
a per-worker model. This enables queries to hold processing buffers
without blocking other concurrently-running queries. Changes:
- Introduce ProcessingBuffersSet and ProcessingBuffers to hold the
per-worker and per-work-order processing buffers (respectively). On Peons,
this is the JVM-wide processing pool. On Indexers, this is a per-worker
pool of on-heap buffers. (This change fixes a bug on Indexers where
excessive processing buffers could be used if MSQ tasks ran concurrently
with realtime tasks.)
- Add "bufferPool" argument to GroupingEngine#process so a per-worker pool
can be passed in.
- Add "druid.msq.task.memory.maxThreads" property, which controls the
maximum number of processing threads to use per task. This allows usage of
multiple processing buffers per task if admins desire.
- IndexerWorkerContext acquires processingBuffers when creating the FrameContext
for a work order, and releases them when closing the FrameContext.
- Add "usesProcessingBuffers()" to FrameProcessorFactory so workers know
how many sets of processing buffers are needed to run a given query.
Second, adjustments to how WorkerMemoryParameters slices up bundles, to
favor more memory for sorting and segment generation. Changes:
- Instead of using same-sized bundles for processing and for sorting,
workers now use minimally-sized processing bundles (just enough to read
inputs plus a little overhead). The rest is devoted to broadcast data
buffering, sorting, and segment-building.
- Segment-building is now limited to 1 concurrent segment per work order.
This allows each segment-building action to use more memory. Note that
segment-building is internally multi-threaded to a degree. (Build and
persist can run concurrently.)
- Simplify frame size calculations by removing the distinction between
"standard" and "large" frames. The new default frame size is the same
as the old "standard" frames, 1 MB. The original goal of of the large
frames was to reduce the number of temporary files during sorting, but
I think we can achieve the same thing by simply merging a larger number
of standard frames at once.
- Remove the small worker adjustment that was added in #14117 to account
for an extra frame involved in writing to durable storage. Instead,
account for the extra frame whenever we are actually using durable storage.
- Cap super-sorter parallelism using the number of output partitions, rather
than using a hard coded cap at 4. Note that in practice, so far, this cap
has not been relevant for tasks because they have only been using a single
processing thread anyway.
* Remove unused import.
* Fix errorprone annotation.
* Fixes for javadocs and inspections.
* Additional test coverage.
* Fix test.
As we move towards multi-threaded MSQ workers, it helps for parallelism
to generate more than one partition per worker. That way, we can fully
utilize all worker threads throughout all stages.
The default value is the number of processing threads. Currently, this
is hard-coded to 1 for peons, but that is expected to change in the future.
1) ControllerQueryKernel: Update readyToReadResults to acknowledge that sorting stages can
go directly from READING_INPUT to RESULTS_READY.
2) WorkerStageKernel: Ignore RESULTS_COMPLETE if work is already finished, which can happen
if the transition to FINISHED comes early due to a downstream LIMIT.
Tasks control the loading of broadcast datasources via BroadcastDatasourceLoadingSpec getBroadcastDatasourceLoadingSpec(). By default, tasks download all broadcast datasources, unless there's an override as with kill and MSQ controller task.
The CLIPeon command line option --loadBroadcastSegments is deprecated in favor of --loadBroadcastDatasourceMode.
Broadcast datasources can be specified in SQL queries through JOIN and FROM clauses, or obtained from other sources such as lookups.To this effect, we have introduced a BroadcastDatasourceLoadingSpec. Finding the set of broadcast datasources during SQL planning will be done in a follow-up, which will apply only to MSQ tasks, so they load only required broadcast datasources. This PR primarily focuses on the skeletal changes around BroadcastDatasourceLoadingSpec and integrating it from the Task interface via CliPeon to SegmentBootstrapper.
Currently, only kill tasks and MSQ controller tasks skip loading broadcast datasources.
* transition away from StorageAdapter
changes:
* CursorHolderFactory has been renamed to CursorFactory and moved off of StorageAdapter, instead fetched directly from the segment via 'asCursorFactory'. The previous deprecated CursorFactory interface has been merged into StorageAdapter
* StorageAdapter is no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call instead
* StorageAdapter methods not covered by CursorFactory (CursorHolderFactory prior to this change) have been moved into interfaces which are retrieved by Segment.as, the primary classes are the previously existing Metadata, as well as new interfaces PhysicalSegmentInspector and TopNOptimizationInspector
* added UnnestSegment and FilteredSegment that extend WrappedSegmentReference since their StorageAdapter implementations were previously provided by WrappedSegmentReference
* added PhysicalSegmentInspector which covers some of the previous StorageAdapter functionality which was primarily used for segment metadata queries and other metadata uses, and is implemented for QueryableIndexSegment and IncrementalIndexSegment
* added TopNOptimizationInspector to cover the oddly specific StorageAdapter.hasBuiltInFilters implementation, which is implemented for HashJoinSegment, UnnestSegment, and FilteredSegment
* Updated all engines and tests to no longer use StorageAdapter
* Add framework for running MSQ tests with taskSpec instead of SQL
* Allow configurable datasegment for tests
* Add test
* Revert "Add test"
This reverts commit 79fb241545.
* Revert "Allow configurable datasegment for tests"
This reverts commit caf04ede2b.
In the compaction config, a range type partitionsSpec supports setting one of maxRowsPerSegment and targetRowsPerSegment. When compaction is run with the native engine, while maxRowsPerSegment = x results in segments of size x, targetRowsPerSegment = y results in segments of size 1.5 * y.
MSQ only supports rowsPerSegment = x as part of its tuning config, the resulting segment size being approx. x -- which is in line with maxRowsPerSegment behaviour in native compaction.
This PR makes the following changes:
use effective maxRowsPerSegment to pass as rowsPerSegment parameter for MSQ
persist rowsPerSegment as maxRowsPerSegment in lastCompactionState for MSQ
Use effective maxRowsPerSegment-based range spec in CompactionStatus check for both Native and MSQ.
This reverts commit f1d24c868f.
Updating nimbus to version 9+ is causing HTTP ERROR 500 java.lang.NoSuchMethodError: 'net.minidev.json.JSONObject com.nimbusds.jwt.JWTClaimsSet.toJSONObject()'
Refer to SAP/cloud-security-services-integration-library#429 (comment) for more details.
We would need to upgrade other libraries as well for updating nimbus.jose.jwt
This patch adds "TypeCastSelectors", which is used when writing frames to
perform two coercions:
- When a numeric type is desired and the underlying type is non-numeric or
unknown, the underlying selector is wrapped, "getObject" is called and the
result is coerced using "ExprEval.ofType". This differs from the prior
behavior where the primitive methods like "getLong", "getDouble", etc, would
be called directly. This fixes an issue where a column would be read as
all-zeroes when its SQL type is numeric and its physical type is string, which
can happen when evolving a column's type from string to number.
- When an array type is desired, the underlying selector is wrapped,
"getObject" is called, and the result is coerced to Object[]. This coercion
replaces some earlier logic from #15917.
Currently compaction with MSQ engine doesn't work for rollup on multi-value dimensions (MVDs), the reason being the default behaviour of grouping on MVD dimensions to unnest the dimension values; for instance grouping on `[s1,s2]` with aggregate `a` will result in two rows: `<s1,a>` and `<s2,a>`.
This change enables rollup on MVDs (without unnest) by converting MVDs to Arrays before rollup using virtual columns, and then converting them back to MVDs using post aggregators. If segment schema is available to the compaction task (when it ends up downloading segments to get existing dimensions/metrics/granularity), it selectively does the MVD-Array conversion only for known multi-valued columns; else it conservatively performs this conversion for all `string` columns.
* MSQ: Add limitHint to global-sort shuffles.
This allows pushing down limits into the SuperSorter.
* Test fixes.
* Add limitSpec to ScanQueryKit. Fix SuperSorter tracking.
Description
-----------
Auto-compaction currently poses several challenges as it:
1. may get stuck on a failing interval.
2. may get stuck on the latest interval if more data keeps coming into it.
3. always picks the latest interval regardless of the level of compaction in it.
4. may never pick a datasource if its intervals are not very recent.
5. requires setting an explicit period which does not cater to the changing needs of a Druid cluster.
This PR introduces various improvements to compaction scheduling to tackle the above problems.
Change Summary
--------------
1. Run compaction for a datasource as a supervisor of type `autocompact` on Overlord.
2. Make compaction policy extensible and configurable.
3. Track status of recently submitted compaction tasks and pass this info to policy.
4. Add `/simulate` API on both Coordinator and Overlord to run compaction simulations.
5. Redirect compaction status APIs to the Overlord when compaction supervisors are enabled.
* MSQ: Add CPU and thread usage counters.
The main change adds "cpu" and "wall" counters. The "cpu" counter measures
CPU time (using JvmUtils.getCurrentThreadCpuTime) taken up by processors
in processing threads. The "wall" counter measures the amount of wall time
taken up by processors in those same processing threads. Both counters are
broken down by type of processor.
This patch also includes changes to support adding new counters. Due to an
oversight in the original design, older deserializers are not forwards-compatible;
they throw errors when encountering an unknown counter type. To manage this,
the following changes are made:
1) The defaultImpl NilQueryCounterSnapshot is added to QueryCounterSnapshot's
deserialization configuration. This means that any unrecognized counter types
will be read as "nil" by deserializers. Going forward, once all servers are
on the latest code, this is enough to enable easily adding new counters.
2) A new context parameter "includeAllCounters" is added, which defaults to "false".
When this parameter is set "false", only legacy counters are included. When set
to "true", all counters are included. This is currently undocumented. In a future
version, we should set the default to "true", and at that time, include a release
note that people updating from versions prior to Druid 31 should set this to
"false" until their upgrade is complete.
* Style, coverage.
* Fix.
Changes:
- Simplify exception handling in `CryptoService` by just catching a `Exception`
- Throw a `DruidException` as the exception is user facing
- Log the exception for easier debugging
- Add a test to verify thrown exception
Currently, if we have a query with window function having PARTITION BY xyz, and we have a million unique values for xyz each having 1 row, we'd end up creating a million individual RACs for processing, each having a single row. This is unnecessary, and we can batch the PARTITION BY keys together for processing, and process them only when we can't batch further rows to adhere to maxRowsMaterialized config.
The previous iteration of this PR was simplifying WindowOperatorQueryFrameProcessor to run all operators on all the rows instead of creating smaller RACs per partition by key. That approach was discarded in favor of the batching approach, and the details are summarized here: #16823 (comment).
changes:
* Adds new `CompressedComplexColumn`, `CompressedComplexColumnSerializer`, `CompressedComplexColumnSupplier` based on `CompressedVariableSizedBlobColumn` used by JSON columns
* Adds `IndexSpec.complexMetricCompression` which can be used to specify compression for the generic compressed complex column. Defaults to uncompressed because compressed columns are not backwards compatible.
* Adds new definition of `ComplexMetricSerde.getSerializer` which accepts an `IndexSpec` argument when creating a serializer. The old signature has been marked `@Deprecated` and has a default implementation that returns `null`, but it will be used by the default implementation of the new version if it is implemented to return a non-null value. The default implementation of the new method will use a `CompressedComplexColumnSerializer` if `IndexSpec.complexMetricCompression` is not null/none/uncompressed, or will use `LargeColumnSupportedComplexColumnSerializer` otherwise.
* Removed all duplicate generic implementations of `ComplexMetricSerde.getSerializer` and `ComplexMetricSerde.deserializeColumn` into default implementations `ComplexMetricSerde` instead of being copied all over the place. The default implementation of `deserializeColumn` will check if the first byte indicates that the new compression was used, otherwise will use the `GenericIndexed` based supplier.
* Complex columns with custom serializers/deserializers are unaffected and may continue doing whatever it is they do, either with specialized compression or whatever else, this new stuff is just to provide generic implementations built around `ObjectStrategy`.
* add ObjectStrategy.readRetainsBufferReference so CompressedComplexColumn only copies on read if required
* add copyValueOnRead flag down to CompressedBlockReader to avoid buffer duplicate if the value needs copied anyway
* MSQ: Fix validation of time position in collations.
It is possible for the collation to refer to a field that isn't mapped,
such as when the DML includes "CLUSTERED BY some_function(some_field)".
In this case, the collation refers to a projected column that is not
part of the field mappings. Prior to this patch, that would lead to an
out of bounds list access on fieldMappings.
This patch fixes the problem by identifying the position of __time in
the fieldMappings first, rather than retrieving each collation field
from fieldMappings.
Fixes a bug introduced in #16849.
* Fix test. Better warning message.
* Place __time in signatures according to sort order.
Updates a variety of places to put __time in row signatures according
to its position in the sort order, rather than always first, including:
- InputSourceSampler.
- ScanQueryEngine (in the default signature when "columns" is empty).
- Various StorageAdapters, which also have the effect of reordering
the column order in segmentMetadata queries, and therefore in SQL
schemas as well.
Follow-up to #16849.
* Fix compilation.
* Additional fixes.
* Fix.
* Fix style.
* Omit nonexistent columns from the row signature.
* Fix tests.
* Segments primarily sorted by non-time columns.
Currently, segments are always sorted by __time, followed by the sort
order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting
by __time enables efficient execution of queries involving time-ordering
or granularity. Time-ordering is a simple matter of reading the rows in
stored order, and granular cursors can be generated in streaming fashion.
However, for various workloads, it's better for storage footprint and
query performance to sort by arbitrary orders that do not start with __time.
With this patch, users can sort segments by such orders.
For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to
dimensionsSpec. The "dimensions" list determines the sort order. To
define a sort order that includes "__time", users explicitly
include a dimension named "__time".
For SQL-based ingestion, users set the context parameter
"useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then
used as the explicit segment sort order.
In both cases, when the new "useExplicitSegmentSortOrder" parameter is
false (the default), __time is implicitly prepended to the sort order,
as it always was prior to this patch.
The new parameter is experimental for two main reasons. First, such
segments can cause errors when loaded by older servers, due to violating
their expectations that timestamps are always monotonically increasing.
Second, even on newer servers, not all queries can run on non-time-sorted
segments. Scan queries involving time-ordering and any query involving
granularity will not run. (To partially mitigate this, a currently-undocumented
SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner
avoids using "granularity".)
Changes on the write path:
1) DimensionsSpec can now optionally contain a __time dimension, which
controls the placement of __time in the sort order. If not present,
__time is considered to be first in the sort order, as it has always
been.
2) IncrementalIndex and IndexMerger are updated to sort facts more
flexibly; not always by time first.
3) Metadata (stored in metadata.drd) gains a "sortOrder" field.
4) MSQ can generate range-based shard specs even when not all columns are
singly-valued strings. It merely stops accepting new clustering key
fields when it encounters the first one that isn't a singly-valued
string. This is useful because it enables range shard specs on
"someDim" to be created for clauses like "CLUSTERED BY someDim, __time".
Changes on the read path:
1) Add StorageAdapter#getSortOrder so query engines can tell how a
segment is sorted.
2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter,
and VectorCursorGranularizer to throw errors when using granularities
on non-time-ordered segments.
3) Update ScanQueryEngine to throw an error when using the time-ordering
"order" parameter on non-time-ordered segments.
4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when
running on a non-time-ordered segment.
5) Add "sqlUseGranularity" context parameter that causes the SQL planner
to avoid using granularities other than ALL.
Other changes:
1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions"
and change the meaning subtly: it now returns true if the DimensionsSpec
represents an unchanging list of dimensions, or false if there is
some discovery happening. This is what call sites had expected anyway.
* Fixups from CI.
* Fixes.
* Fix missing arg.
* Additional changes.
* Fix logic.
* Fixes.
* Fix test.
* Adjust test.
* Remove throws.
* Fix styles.
* Fix javadocs.
* Cleanup.
* Smoother handling of null ordering.
* Fix tests.
* Missed a spot on the merge.
* Fixups.
* Avoid needless Filters.and.
* Add timeBoundaryInspector to test.
* Fix tests.
* Fix FrameStorageAdapterTest.
* Fix various tests.
* Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder.
* Pom fix.
* Fix doc.
Previously, SeekableStreamIndexTaskRunner set ingestion state to
COMPLETED when it finished reading data from Kafka. This is incorrect.
After the changes in this patch, the transitions go:
1) The task stays in BUILD_SEGMENTS after it finishes reading from Kafka,
while it is building its final set of segments to publish.
2) The task transitions to SEGMENT_AVAILABILITY_WAIT after publishing,
while waiting for handoff.
3) The task transitions to COMPLETED immediately before exiting, when
truly done.
A follow-up PR for #16864. Just renames dimensionToSchemaMap to dimensionSchemas and always overrides ARRAY_INGEST_MODE context value to array for MSQ compaction.
changes:
* Added `CursorBuildSpec` which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments of `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor`
* added new interface `CursorHolder` and new interface `CursorHolderFactory` as a replacement for `CursorFactory`, with method `makeCursorHolder`, which takes a `CursorBuildSpec` as an argument and replaces `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor`
* `CursorFactory.makeCursors` previously returned a `Sequence<Cursor>` corresponding to the query granularity buckets, with a separate `Cursor` per bucket. `CursorHolder.asCursor` instead returns a single `Cursor` (equivalent to 'ALL' granularity), and a new `CursorGranularizer` has been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with its `VectorCursorGranularizer`), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities.
* Deprecated `CursorFactory`, `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursor`
* updated all `StorageAdapter` implementations to implement `makeCursorHolder`, transitioned direct `CursorFactory` implementations to instead implement `CursorMakerFactory`. `StorageAdapter` being a `CursorMakerFactory` is intended to be a transitional thing, ideally will not be released in favor of moving `CursorMakerFactory` to be fetched directly from `Segment`, however this PR was already large enough so this will be done in a follow-up.
* updated all query engines to use `makeCursorHolder`, granularity based engines to use `CursorGranularizer`.
Upgrade/Downgrade between any version till or before Druid 30 where the newer version runs a worker task, while the older version runs a controller task can fail. The patch removes that verification check till its safe to add it back.
Fixes#13936
In cases where a supervisor is idle and the overlord is restarted for some reason, the supervisor would
start spinning tasks again. In clusters where there are many low throughput streams, this would spike
the task count unnecessarily.
This commit compares the latest stream offset with the ones in metadata during the startup of supervisor
and sets it to idle state if they match.
This PR adds checks for verification of DataSourceCompactionConfig and CompactionTask with msq engine to ensure:
each aggregator in metricsSpec is idempotent
metricsSpec is non-null when rollup is set to true
Unit tests and existing compaction ITs have been updated accordingly.
This PR fixes query correctness issues for MSQ window functions when using more than 1 worker (that is, maxNumTasks > 2).
Currently, we were keeping the shuffle spec of the previous stage when we didn't have any partition columns for window stage. This PR changes it to override the shuffle spec of the previous stage to MixShuffleSpec (if we have a window function with empty over clause) so that the window stage gets a single partition to work on.
A test has been added for a query which returned incorrect results prior to this change when using more than 1 workers.
* enables to launch a fake broker based on test resources (druidtest uri)
* could record queries into new testfiles during usage
* instead of re-purpose Calcite's Hook migrates to use DruidHook which we can add further keys
* added a quidem-ut module which could be the place for tests which could iteract with modules/etc
This patch introduces an optional cluster configuration, druid.indexing.formats.stringMultiValueHandlingMode, allowing operators to override the default mode SORTED_SET for string dimensions. The possible values for the config are SORTED_SET, SORTED_ARRAY, or ARRAY (SORTED_SET is the default). Case insensitive values are allowed.
While this cluster property allows users to manage the multi-value handling mode for string dimension types, it's recommended to migrate to using real array types instead of MVDs.
This fixes a long-standing issue where compaction will honor the configured cluster wide property instead of rewriting it as the default SORTED_ARRAY always, even if the data was originally ingested with ARRAY or SORTED_SET.
* SQL syntax error should target USER persona
* * revert change to queryHandler and related tests, based on review comments
* * add test
* Introduce KinesisRecordEntity to support Kinesis headers in InputFormats
* * add kinesisInputFormat and Reader, and tests
* * bind KinesisInputFormat class to module
* * improve test coverage
* * remove references to kafka
* * resolve review comments
* * remove comment
* * fix grammer of comment
* * fix comment again
* * fix comment again
* * more review comments
* * add partitionKey
* * add check for same timestamp and partitionKey column name
* * fix intellij inspection
If the optional query parameter detail is supplied, then the response also includes the following:
* A stages object that summarizes information about the different stages being used for query execution, such as stage number, phase, start time, duration, input and output information, processing methods, and partitioning.
* A counters object that provides details on the rows, bytes, and files processed at various stages for each worker across different channels, along with sort progress.
* A warnings object that provides details about any warnings.
* MSQ worker: Support in-memory shuffles.
This patch is a follow-up to #16168, adding worker-side support for
in-memory shuffles. Changes include:
1) Worker-side code now respects the same context parameter "maxConcurrentStages"
that was added to the controller in #16168. The parameter remains undocumented
for now, to give us a chance to more fully develop and test this functionality.
1) WorkerImpl is broken up into WorkerImpl, RunWorkOrder, and RunWorkOrderListener
to improve readability.
2) WorkerImpl has a new StageOutputHolder + StageOutputReader concept, which
abstract over memory-based or file-based stage results.
3) RunWorkOrder is updated to create in-memory stage output channels when
instructed to.
4) ControllerResource is updated to add /doneReadingInput/, so the controller
can tell when workers that sort, but do not gather statistics, are done reading
their inputs.
5) WorkerMemoryParameters is updated to consider maxConcurrentStages.
Additionally, WorkerChatHandler is split into WorkerResource, so as to match
ControllerChatHandler and ControllerResource.
* Updates for static checks, test coverage.
* Fixes.
* Remove exception.
* Changes from review.
* Address static check.
* Changes from review.
* Improvements to docs and method names.
* Update comments, add test.
* Additional javadocs.
* Fix throws.
* Fix worker stopping in tests.
* Fix stuck test.
Follow-up to #16291, this commit enables a subset of existing native compaction ITs on the MSQ engine.
In the process, the following changes have been introduced in the MSQ compaction flow:
- Populate `metricsSpec` in `CompactionState` from `querySpec` in `MSQControllerTask` instead of `dataSchema`
- Add check for pre-rolled-up segments having `AggregatorFactory` with different input and output column names
- Fix passing missing cluster-by clause in scan queries
- Add annotation of `CompactionState` to tombstone segments
Changes:
- Add API `/druid/coordinator/v1/config/compaction/global` to update cluster level compaction config
- Add class `CompactionConfigUpdateRequest`
- Fix bug in `CoordinatorCompactionConfig` which caused compaction engine to not be persisted.
Use json field name `engine` instead of `compactionEngine` because JSON field names must align
with the getter name.
- Update MSQ validation error messages
- Complete overhaul of `CoordinatorCompactionConfigResourceTest` to remove unnecessary mocking
and add more meaningful tests.
- Add `TuningConfigBuilder` to easily build tuning configs for tests.
- Add `DatasourceCompactionConfigBuilder`
Changes the WindowFrame internals / representation a bit; introduces dedicated frametypes for rows and groups which corresponds to the implemented processing methods
* MSQ window functions: Revamp logic to create separate window stages when empty over() clause is present
* Fix tests
* Revert changes of creating separate stages for empty over clause
* Address review comments
changes:
* removes `druid.indexer.task.batchProcessingMode` in favor of always using `CLOSED_SEGMENT_SINKS` which uses `BatchAppenderator`. This was intended to become the default for native batch, but that was missed so `CLOSED_SEGMENTS` was the default (using `AppenderatorImpl`), however MSQ has been exclusively using `BatchAppenderator` with no problems so it seems safe to just roll it out as the only option for batch ingestion everywhere.
* with `batchProcessingMode` gone, there is no use for `AppenderatorImpl` so it has been removed
* implify `Appenderator` construction since there are only separate stream and batch versions now
* simplify tests since `batchProcessingMode` is gone
This PR aims to check if the complex column being queried aligns with the supported types in the aggregator and aggregator factories, and throws a user-friendly error message if they don't.
* Throw exception if DISTINCT used with window functions aggregate call
* Improve error message when unsupported aggregations are used with window functions
Fixes#16766
Change log level from INFO to DEBUG when processing an empty user map
during polling. An empty user map is a normal situation for some
authenticators (e.g. LDAP) and polling is frequent (1 minute by
default.)
changes:
* removed `Firehose` and `FirehoseFactory` and remaining implementations which were mostly no longer used after #16602
* Moved `IngestSegmentFirehose` which was still used internally by Hadoop ingestion to `DatasourceRecordReader.SegmentReader`
* Rename `SQLFirehoseFactoryDatabaseConnector` to `SQLInputSourceDatabaseConnector` and similar renames for sub-classes
* Moved anything remaining in a 'firehose' package somewhere else
* Clean up docs on firehose stuff
* #16717 defer provider instatiation
* add license header
* fix style, ignore new class in jacoco as it is still initialization code
---------
Co-authored-by: Alberto Lago Alvarado <albl@sitecore.net>
* When an ArrayList RAC creates a child RAC, the start and end offsets need to have the offset of parent's start offset
* Defaults the 2nd window bound to CURRENT ROW when only a single bound is specified
* Removes the windowingStrictValidation warning and throws a hard exception when Order By alongside RANGE clause is not provided with UNBOUNDED or CURRENT ROW as both bounds
Changes:
- No functional change
- Add class `TuningConfigBuilder` to build `IndexTuningConfig`, `CompactionTuningConfig`
- Remove old class `ParallelIndexTestingFactory.TuningConfigBuilder`
- Remove some unused fields and methods
Changes
- No functional change
- Remove unused method `IndexTuningConfig.withPartitionsSpec()`
- Remove unused method `ParallelIndexTuningConfig.withPartitionsSpec()`
- Remove redundant method `CompactTask.emitIngestionModeMetrics()`
- Remove Clock argument from `CompactionTask.createDataSchemasForInterval()` as it was only needed
for one test which was just verifying the value passed by the test itself. The code now uses a `Stopwatch`
instead and test simply verifies that the metric has been emitted.
- Other minor cleanup changes
Better fallback strategy when the broker is unable to materialize the subquery's results as frames for estimating the bytes:
a. We don't touch the subquery sequence till we know that we can materialize the result as frames
Description:
Compaction operations issued by the Coordinator currently run using the native query engine.
As majority of the advancements that we are making in batch ingestion are in MSQ, it is imperative
that we support compaction on MSQ to make Compaction more robust and possibly faster.
For instance, we have seen OOM errors in native compaction that MSQ could have handled by its
auto-calculation of tuning parameters.
This commit enables compaction on MSQ to remove the dependency on native engine.
Main changes:
* `DataSourceCompactionConfig` now has an additional field `engine` that can be one of
`[native, msq]` with `native` being the default.
* if engine is MSQ, `CompactSegments` duty assigns all available compaction task slots to the
launched `CompactionTask` to ensure full capacity is available to MSQ. This is to avoid stalling which
could happen in case a fraction of the tasks were allotted and they eventually fell short of the number
of tasks required by the MSQ engine to run the compaction.
* `ClientCompactionTaskQuery` has a new field `compactionRunner` with just one `engine` field.
* `CompactionTask` now has `CompactionRunner` interface instance with its implementations
`NativeCompactinRunner` and `MSQCompactionRunner` in the `druid-multi-stage-query` extension.
The objectmapper deserializes `ClientCompactionRunnerInfo` in `ClientCompactionTaskQuery` to the
`CompactionRunner` instance that is mapped to the specified type [`native`, `msq`].
* `CompactTask` uses the `CompactionRunner` instance it receives to create the indexing tasks.
* `CompactionTask` to `MSQControllerTask` conversion logic checks whether metrics are present in
the segment schema. If present, the task is created with a native group-by query; if not, the task is
issued with a scan query. The `storeCompactionState` flag is set in the context.
* Each created `MSQControllerTask` is launched in-place and its `TaskStatus` tracked to determine the
final status of the `CompactionTask`. The id of each of these tasks is the same as that of `CompactionTask`
since otherwise, the workers will be unable to determine the controller task's location for communication
(as they haven't been launched via the overlord).
In case of few aggregators for example BloomSqlAggregator, BaseVarianceSqlAggregator etc, the aggName is being updated from a0 to a0:agg, breaching the contract as we would expect the aggName as the name which is passed. This is causing a mismatch while creating a column accessor.
This commit aims to correct those violating sql aggregators.
Add a shuffling based on the resultShuffleSpecFactory after a limit processor depending on the query destination. LimitFrameProcessors currently do not update the partition boosting column, so we also add the boost column to the previous stage, if one is required.
* Add MSQ query context maxNumSegments.
- Default is MAX_INT (unbounded).
- When set and if a time chunk contains more number of segments than set in the
query context, the MSQ task will fail with TooManySegments fault.
* Fixup hashCode().
* Rename and checkpoint.
* Add some insert and replace happy and sad path tests.
* Update error msg.
* Commentary
* Adjust the default to be null (meaning no max bound on number of segments).
Also fix formatter.
* Fix CodeQL warnings and minor cleanup.
* Assert on maxNumSegments tuning config.
* Minor test cleanup.
* Use null default for the MultiStageQueryContext as well
* Review feedback
* Review feedback
* Move logic to common function getPartitionsByBucket shared by INSERT and REPLACE.
* Rename to validateNumSegmentsPerBucketOrThrow() for consistency.
* Add segmentGranularity to error message.
MSQ cannot process null bytes in string fields, and the current workaround is to remove them using the REPLACE function. 'removeNullBytes' context parameter has been added which sanitizes the input string fields by removing these null bytes.
* first pass
* more changes
* fix tests and formatting
* fix kinesis failing tests
* fix kafka tests
* add dimension name to float parse errors
* double and convertToType handling of dimensionName can report parse errors with dimension name
* fix checkstyle issue
* fix tests
* more cases to have better parse exception messages
* fix test
* fix tests
* partially address comments
* annotate method parameter with nullable
* address comments
* fix tests
* let float, double, long dimensionIndexer pass dimensionName down to dimensionHandlerUtils
* fix compilation error and clean up formatting
* clean up whitespace
* address feedback. undo change, pass down report parse exception for convertToType
* fix test
* Support ListBasedInputRow in Kafka ingestion with header
* Fix up buildBlendedEventMap
* Add new test for KafkaInputFormat with csv value and headers
* Do not use forbidden APIs
* Move utility method to TestUtils
index_realtime tasks were removed from the documentation in #13107. Even
at that time, they weren't really documented per se— just mentioned. They
existed solely to support Tranquility, which is an obsolete ingestion
method that predates migration of Druid to ASF and is no longer being
maintained. Tranquility docs were also de-linked from the sidebars and
the other doc pages in #11134. Only a stub remains, so people with
links to the page can see that it's no longer recommended.
index_realtime_appenderator tasks existed in the code base, but were
never documented, nor as far as I am aware were they used for any purpose.
This patch removes both task types completely, as well as removes all
supporting code that was otherwise unused. It also updates the stub
doc for Tranquility to be firmer that it is not compatible. (Previously,
the stub doc said it wasn't recommended, and pointed out that it is
built against an ancient 0.9.2 version of Druid.)
ITUnionQueryTest has been migrated to the new integration tests framework and updated to use Kafka ingestion.
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
This PR fixes a few bugs with MSQ export. The main change is calling SqlResults#coerce before writing the column. This allows sketches and json to be correctly deserialized. The format of the exported complex columns are similar to those produced by Async MSQ queries with CSV format.
Notes:
Fix printing of complex columns during export. Sketches and JSON are now correctly formatted during export.
Fix an NPE if the writer has not been initialized. Empty export queries will create an empty file at the location.
Fix a bug with counters for MSQ export, where rows were reported for only the first partition.