Changes
---------
- Add Overlord runtime property `druid.indexer.tasklock.batchAllocationReduceMetadataIO`
- Setting this flag to true (default value) allows the Overlord to fetch only necessary segment
payloads during segment allocation
- Setting this flag to false restores original segment allocation behaviour
All JDK 8 based CI checks have been removed.
Images used in Dockerfile(s) have been updated to Java 17 based images.
Documentation has been updated accordingly.
* DruidOverlord: Move becomeLeader/stopBeingLeader earlier.
On becoming leader, it is helpful for the TaskRunner and TaskQueue to be
available when the SupervisorManager starts up, to aid the supervisors
in discovering their tasks.
On stopping leadership, it is helpful for the TaskRunner and TaskQueue
to be available until the SupervisorManager has finished shutting down.
They are only available when the TaskMaster is in "leader" mode, so to
achieve the above, this patch moves it earlier in the sequence.
* Adjust leadership into two phases.
* Update test.
* Adjustments for coverage.
* Stop mirrors start better.
Following #17394, workerExec can get deadlocked with itself, because it
waits for task futures and is also used as the connectExec for the task
client. To fix this, we need to never await task futures in the workerExec.
There are two specific changes: in "verifyAndMergeCheckpoints" and
"checkpointTaskGroup", two "coalesceAndAwait" calls that formerly occurred
in workerExec are replaced with Futures.transform (using a callback in
workerExec).
Because this adjustment removes a source of blocking, it may also improve
supervisor responsiveness for high task counts. This is not the primary
goal, however. The primary goal is to fix the bug introduced by #17394.
* SeekableStreamSupervisor: Use workerExec as the client connectExec.
This patch uses the already-existing per-supervisor workerExec as the
connectExec for task clients, rather than using the process-wide default
ServiceClientFactory pool.
This helps prevent callbacks from backlogging on the process-wide pool.
It's especially useful for retries, where callbacks may need to establish
new TCP connections or perform TLS handshakes.
* Fix compilation, tests.
* Fix style.
MSQ currently supports only single-valued string dimensions as partition keys.
This patch adds a check to ensure that partition keys are single-valued in case
this info is available by virtue of segment download for schema inference.
During compaction, if MSQ finds multi-valued dimensions (MVDs) declared as part
of `range` partitionsSpec, it switches partitioning type to dynamic, ending up in
repeated compactions of the same interval. To avoid this scenario, the segment
download logic is also updated to always download segments if info on multi-valued
dimensions is required.
Description
-----------
The `OverlordCompactionScheduler` may sometimes launch a duplicate compaction
task for an interval that has just been compacted.
This may happen as follows:
- Scheduler launches a compaction task for an uncompacted interval.
- While the compaction task is running, the `CompactionStatusTracker` does not consider
this interval as compactible and returns the `CompactionStatus` as `SKIPPED` for it.
- As soon as the compaction task finishes, the `CompactionStatusTracker` starts considering
the interval eligible for compaction again.
- This interval remains eligible for compaction until the newly published segments are polled
from the database.
- Once the new segments have been polled, the `CompactionStatus` of the interval changes
to `COMPLETE`.
Change
--------
- Keep track of the `snapshotTime` in `DataSourcesSnapshot`. This time represents the start of the poll.
- Use the `snapshotTime` to determine if a poll has happened after a compaction task completed.
- If not, then skip the interval to avoid launching duplicate tasks.
- For tests, use a future `snapshotTime` to ensure that compaction is always triggered.
Fixes#16587
Streaming ingestion tasks operate by allocating segments before ingesting rows.
These allocations happen across replicas which may send different requests but
must get the same segment id for a given (datasource, interval, version, sequenceName)
across replicas.
This patch fixes the bug by ignoring the previousSegmentId when skipLineageCheck is true.
The patch makes the following changes:
1. Fixes a bug causing compaction to fail on array, complex, and other non-primitive-type columns
2. Updates compaction status check to be conscious of partition dimensions when comparing dimension ordering.
3. Ensures only string columns are specified as partition dimensions
4. Ensures `rollup` is true if and only if metricsSpec is non-empty
5. Ensures disjoint intervals aren't submitted for compaction
6. Adds `compactionReason` to compaction task context.
Follow up to #17137.
Instead of moving the streaming-only methods to the SeekableStreamSupervisor abstract class, this patch moves them to a separate StreamSupervisor interface. The reason is that the SeekableStreamSupervisor abstract class also has many other abstract methods. The StreamSupervisor interface on the other hand provides a minimal set of functions offering a good middle ground for any custom concrete implementation that doesn't require all the goodies from SeekableStreamSupervisor.
Extracting a few miscellaneous non-functional changes from the batch supervisor branch:
- Replace anonymous inner classes with lambda expressions in the SQL supervisor manager layer
- Add explicit @Nullable annotations in DynamicConfigProviderUtils to make IDE happy
- Small variable renames (copy-paste error perhaps) and fix typos
- Add table name for this exception message: Delete the supervisor from the table[%s] in the database...
- Prefer CollectionUtils.isEmptyOrNull() over list == null || list.size() > 0. We can change the Precondition checks to throwing DruidException separately for a batch of APIs at a time.
The current Supervisor interface is primarily focused on streaming use cases. However, as we introduce supervisors for non-streaming use cases, such as the recently added CompactionSupervisor (and the upcoming BatchSupervisor), certain operations like resetting offsets, checkpointing, task group handoff, etc., are not really applicable to non-streaming use cases.
So the methods are split between:
1. Supervisor: common methods that are applicable to both streaming and non-streaming use cases
2. SeekableStreamSupervisor: Supervisor + streaming-only operations. The existing streaming-only overrides exist along with the new abstract method public abstract LagStats computeLagStats(), for which custom implementations already exist in the concrete types
This PR is primarily a refactoring change with minimal functional adjustments (e.g., throwing an exception in a few places in SupervisorManager when the supervisor isn't the expected SeekableStreamSupervisor type).
Description
-----------
Coordinator logs are fairly noisy and don't give much useful information (see example below).
Even when the Coordinator misbehaves, these logs are not very useful.
Main changes
------------
- Add API `GET /druid/coordinator/v1/duties` that returns a status list of all duty groups currently running on the Coordinator
- Emit metrics `segment/poll/time`, `segment/pollWithSchema/time`, `segment/buildSnapshot/time`
- Remove redundant logs that indicate normal operation of well-tested aspects of the Coordinator
Refactors
---------
- Move some logic from `DutiesRunnable` to `CoordinatorDutyGroup`
- Move stats collection from `CollectSegmentAndServerStats` to `PrepareBalancerAndLoadQueues`
- Minor cleanup of class `DruidCoordinator`
- Clean up class `DruidCoordinatorRuntimeParams`
- Remove field `coordinatorStartTime`. Maintain start time in `MarkOvershadowedSegmentsAsUnused` instead.
- Remove field `MetadataRuleManager`. Pass supplier to constructor of applicable duties instead.
- Make `usedSegmentsNewestFirst` and `datasourcesSnapshot` as non-nullable as they are always required.
#16768 added the functionality to run compaction as a supervisor on the overlord.
This patch builds on top of that to restrict MSQ engine to compaction in the supervisor-mode only.
With these changes, users can no longer add MSQ engine as part of datasource compaction config,
or as the default cluster-level compaction engine, on the Coordinator.
The patch also adds an Overlord runtime property `druid.supervisor.compaction.engine=<msq/native>`
to specify the default engine for compaction supervisors.
Since these updates require major changes to existing MSQ compaction integration tests,
this patch disables MSQ-specific compaction integration tests -- they will be taken up in a follow-up PR.
Key changed/added classes in this patch:
* CompactionSupervisor
* CompactionSupervisorSpec
* CoordinatorCompactionConfigsResource
* OverlordCompactionScheduler
Text-based input formats like csv and tsv currently parse inputs only as strings, following the RFC4180Parser spec).
To workaround this, the web-console and other tools need to further inspect the sample data returned to sample data returned by the Druid sampler API to parse them as numbers.
This patch introduces a new optional config, tryParseNumbers, for the csv and tsv input formats. If enabled, any numbers present in the input will be parsed in the following manner -- long data type for integer types and double for floating-point numbers, and if parsing fails for whatever reason, the input is treated as a string. By default, this configuration is set to false, so numeric strings will be treated as strings.
Fixed vulnerabilities
CVE-2021-26291 : Apache Maven is vulnerable to Man-in-the-Middle (MitM) attacks. Various
functions across several files, mentioned below, allow for custom repositories to use the
insecure HTTP protocol. An attacker can exploit this as part of a Man-in-the-Middle (MitM)
attack, taking over or impersonating a repository using the insecure HTTP protocol.
Unsuspecting users may then have the compromised repository defined as a dependency in
their Project Object Model (pom) file and download potentially malicious files from it.
Was fixed by removing outdated tesla-aether library containing vulnerable maven-settings (v3.1.1) package, pull-deps utility updated to use maven resolver instead.
sonatype-2020-0244 : The joni package is vulnerable to Man-in-the-Middle (MitM) attacks.
This project downloads dependencies over HTTP due to an insecure repository configuration
within the .pom file. Consequently, a MitM could intercept requests to the specified
repository and replace the requested dependencies with malicious versions, which can execute
arbitrary code from the application that was built with them.
Was fixed by upgrading joni package to recommended 2.1.34 version
Tasks control the loading of broadcast datasources via BroadcastDatasourceLoadingSpec getBroadcastDatasourceLoadingSpec(). By default, tasks download all broadcast datasources, unless there's an override as with kill and MSQ controller task.
The CLIPeon command line option --loadBroadcastSegments is deprecated in favor of --loadBroadcastDatasourceMode.
Broadcast datasources can be specified in SQL queries through JOIN and FROM clauses, or obtained from other sources such as lookups.To this effect, we have introduced a BroadcastDatasourceLoadingSpec. Finding the set of broadcast datasources during SQL planning will be done in a follow-up, which will apply only to MSQ tasks, so they load only required broadcast datasources. This PR primarily focuses on the skeletal changes around BroadcastDatasourceLoadingSpec and integrating it from the Task interface via CliPeon to SegmentBootstrapper.
Currently, only kill tasks and MSQ controller tasks skip loading broadcast datasources.
Tasks that do not support querying or query processing i.e. supportsQueries = false do not require processing threads, processing buffers, and merge buffers.
* transition away from StorageAdapter
changes:
* CursorHolderFactory has been renamed to CursorFactory and moved off of StorageAdapter, instead fetched directly from the segment via 'asCursorFactory'. The previous deprecated CursorFactory interface has been merged into StorageAdapter
* StorageAdapter is no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call instead
* StorageAdapter methods not covered by CursorFactory (CursorHolderFactory prior to this change) have been moved into interfaces which are retrieved by Segment.as, the primary classes are the previously existing Metadata, as well as new interfaces PhysicalSegmentInspector and TopNOptimizationInspector
* added UnnestSegment and FilteredSegment that extend WrappedSegmentReference since their StorageAdapter implementations were previously provided by WrappedSegmentReference
* added PhysicalSegmentInspector which covers some of the previous StorageAdapter functionality which was primarily used for segment metadata queries and other metadata uses, and is implemented for QueryableIndexSegment and IncrementalIndexSegment
* added TopNOptimizationInspector to cover the oddly specific StorageAdapter.hasBuiltInFilters implementation, which is implemented for HashJoinSegment, UnnestSegment, and FilteredSegment
* Updated all engines and tests to no longer use StorageAdapter
Description:
#16768 introduces new compaction APIs on the Overlord `/compact/status` and `/compact/progress`.
But the corresponding `OverlordClient` methods do not return an object compatible with the actual
endpoints defined in `OverlordCompactionResource`.
This patch ensures that the objects are compatible.
Changes:
- Add `CompactionStatusResponse` and `CompactionProgressResponse`
- Use these as the return type in `OverlordClient` methods and as the response entity in `OverlordCompactionResource`
- Add `SupervisorCleanupModule` bound on the Coordinator to perform cleanup of supervisors.
Without this module, Coordinator cannot deserialize compaction supervisors.
Currently compaction with MSQ engine doesn't work for rollup on multi-value dimensions (MVDs), the reason being the default behaviour of grouping on MVD dimensions to unnest the dimension values; for instance grouping on `[s1,s2]` with aggregate `a` will result in two rows: `<s1,a>` and `<s2,a>`.
This change enables rollup on MVDs (without unnest) by converting MVDs to Arrays before rollup using virtual columns, and then converting them back to MVDs using post aggregators. If segment schema is available to the compaction task (when it ends up downloading segments to get existing dimensions/metrics/granularity), it selectively does the MVD-Array conversion only for known multi-valued columns; else it conservatively performs this conversion for all `string` columns.
Description
-----------
Auto-compaction currently poses several challenges as it:
1. may get stuck on a failing interval.
2. may get stuck on the latest interval if more data keeps coming into it.
3. always picks the latest interval regardless of the level of compaction in it.
4. may never pick a datasource if its intervals are not very recent.
5. requires setting an explicit period which does not cater to the changing needs of a Druid cluster.
This PR introduces various improvements to compaction scheduling to tackle the above problems.
Change Summary
--------------
1. Run compaction for a datasource as a supervisor of type `autocompact` on Overlord.
2. Make compaction policy extensible and configurable.
3. Track status of recently submitted compaction tasks and pass this info to policy.
4. Add `/simulate` API on both Coordinator and Overlord to run compaction simulations.
5. Redirect compaction status APIs to the Overlord when compaction supervisors are enabled.
* Make IntelliJ's MethodIsIdenticalToSuperMethod an error
* Change codebase to follow new IntelliJ inspection
* Restore non-short-circuit boolean expressions to pass tests
* Place __time in signatures according to sort order.
Updates a variety of places to put __time in row signatures according
to its position in the sort order, rather than always first, including:
- InputSourceSampler.
- ScanQueryEngine (in the default signature when "columns" is empty).
- Various StorageAdapters, which also have the effect of reordering
the column order in segmentMetadata queries, and therefore in SQL
schemas as well.
Follow-up to #16849.
* Fix compilation.
* Additional fixes.
* Fix.
* Fix style.
* Omit nonexistent columns from the row signature.
* Fix tests.
* Segments primarily sorted by non-time columns.
Currently, segments are always sorted by __time, followed by the sort
order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting
by __time enables efficient execution of queries involving time-ordering
or granularity. Time-ordering is a simple matter of reading the rows in
stored order, and granular cursors can be generated in streaming fashion.
However, for various workloads, it's better for storage footprint and
query performance to sort by arbitrary orders that do not start with __time.
With this patch, users can sort segments by such orders.
For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to
dimensionsSpec. The "dimensions" list determines the sort order. To
define a sort order that includes "__time", users explicitly
include a dimension named "__time".
For SQL-based ingestion, users set the context parameter
"useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then
used as the explicit segment sort order.
In both cases, when the new "useExplicitSegmentSortOrder" parameter is
false (the default), __time is implicitly prepended to the sort order,
as it always was prior to this patch.
The new parameter is experimental for two main reasons. First, such
segments can cause errors when loaded by older servers, due to violating
their expectations that timestamps are always monotonically increasing.
Second, even on newer servers, not all queries can run on non-time-sorted
segments. Scan queries involving time-ordering and any query involving
granularity will not run. (To partially mitigate this, a currently-undocumented
SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner
avoids using "granularity".)
Changes on the write path:
1) DimensionsSpec can now optionally contain a __time dimension, which
controls the placement of __time in the sort order. If not present,
__time is considered to be first in the sort order, as it has always
been.
2) IncrementalIndex and IndexMerger are updated to sort facts more
flexibly; not always by time first.
3) Metadata (stored in metadata.drd) gains a "sortOrder" field.
4) MSQ can generate range-based shard specs even when not all columns are
singly-valued strings. It merely stops accepting new clustering key
fields when it encounters the first one that isn't a singly-valued
string. This is useful because it enables range shard specs on
"someDim" to be created for clauses like "CLUSTERED BY someDim, __time".
Changes on the read path:
1) Add StorageAdapter#getSortOrder so query engines can tell how a
segment is sorted.
2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter,
and VectorCursorGranularizer to throw errors when using granularities
on non-time-ordered segments.
3) Update ScanQueryEngine to throw an error when using the time-ordering
"order" parameter on non-time-ordered segments.
4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when
running on a non-time-ordered segment.
5) Add "sqlUseGranularity" context parameter that causes the SQL planner
to avoid using granularities other than ALL.
Other changes:
1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions"
and change the meaning subtly: it now returns true if the DimensionsSpec
represents an unchanging list of dimensions, or false if there is
some discovery happening. This is what call sites had expected anyway.
* Fixups from CI.
* Fixes.
* Fix missing arg.
* Additional changes.
* Fix logic.
* Fixes.
* Fix test.
* Adjust test.
* Remove throws.
* Fix styles.
* Fix javadocs.
* Cleanup.
* Smoother handling of null ordering.
* Fix tests.
* Missed a spot on the merge.
* Fixups.
* Avoid needless Filters.and.
* Add timeBoundaryInspector to test.
* Fix tests.
* Fix FrameStorageAdapterTest.
* Fix various tests.
* Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder.
* Pom fix.
* Fix doc.
Previously, SeekableStreamIndexTaskRunner set ingestion state to
COMPLETED when it finished reading data from Kafka. This is incorrect.
After the changes in this patch, the transitions go:
1) The task stays in BUILD_SEGMENTS after it finishes reading from Kafka,
while it is building its final set of segments to publish.
2) The task transitions to SEGMENT_AVAILABILITY_WAIT after publishing,
while waiting for handoff.
3) The task transitions to COMPLETED immediately before exiting, when
truly done.
changes:
* Added `CursorBuildSpec` which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments of `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor`
* added new interface `CursorHolder` and new interface `CursorHolderFactory` as a replacement for `CursorFactory`, with method `makeCursorHolder`, which takes a `CursorBuildSpec` as an argument and replaces `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor`
* `CursorFactory.makeCursors` previously returned a `Sequence<Cursor>` corresponding to the query granularity buckets, with a separate `Cursor` per bucket. `CursorHolder.asCursor` instead returns a single `Cursor` (equivalent to 'ALL' granularity), and a new `CursorGranularizer` has been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with its `VectorCursorGranularizer`), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities.
* Deprecated `CursorFactory`, `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursor`
* updated all `StorageAdapter` implementations to implement `makeCursorHolder`, transitioned direct `CursorFactory` implementations to instead implement `CursorMakerFactory`. `StorageAdapter` being a `CursorMakerFactory` is intended to be a transitional thing, ideally will not be released in favor of moving `CursorMakerFactory` to be fetched directly from `Segment`, however this PR was already large enough so this will be done in a follow-up.
* updated all query engines to use `makeCursorHolder`, granularity based engines to use `CursorGranularizer`.
Fixes#13936
In cases where a supervisor is idle and the overlord is restarted for some reason, the supervisor would
start spinning tasks again. In clusters where there are many low throughput streams, this would spike
the task count unnecessarily.
This commit compares the latest stream offset with the ones in metadata during the startup of supervisor
and sets it to idle state if they match.
Refactors the SemanticCreator annotation.
Moves the interface to the semantic package.
Create a SemanticUtils to hold logic for storing semantic maps.
Add FrameMaker interface.