Bug: When coordinator period is less than 30s, `maxSegmentsToMove` is always computed as 0
irrespective of number of available threads.
Changes:
- Fix lower bound condition and set minimum value to 100.
- Add new test which fails without this fix
Description
-----------
Auto-compaction currently poses several challenges as it:
1. may get stuck on a failing interval.
2. may get stuck on the latest interval if more data keeps coming into it.
3. always picks the latest interval regardless of the level of compaction in it.
4. may never pick a datasource if its intervals are not very recent.
5. requires setting an explicit period which does not cater to the changing needs of a Druid cluster.
This PR introduces various improvements to compaction scheduling to tackle the above problems.
Change Summary
--------------
1. Run compaction for a datasource as a supervisor of type `autocompact` on Overlord.
2. Make compaction policy extensible and configurable.
3. Track status of recently submitted compaction tasks and pass this info to policy.
4. Add `/simulate` API on both Coordinator and Overlord to run compaction simulations.
5. Redirect compaction status APIs to the Overlord when compaction supervisors are enabled.
* Make IntelliJ's MethodIsIdenticalToSuperMethod an error
* Change codebase to follow new IntelliJ inspection
* Restore non-short-circuit boolean expressions to pass tests
* Place __time in signatures according to sort order.
Updates a variety of places to put __time in row signatures according
to its position in the sort order, rather than always first, including:
- InputSourceSampler.
- ScanQueryEngine (in the default signature when "columns" is empty).
- Various StorageAdapters, which also have the effect of reordering
the column order in segmentMetadata queries, and therefore in SQL
schemas as well.
Follow-up to #16849.
* Fix compilation.
* Additional fixes.
* Fix.
* Fix style.
* Omit nonexistent columns from the row signature.
* Fix tests.
* Segments primarily sorted by non-time columns.
Currently, segments are always sorted by __time, followed by the sort
order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting
by __time enables efficient execution of queries involving time-ordering
or granularity. Time-ordering is a simple matter of reading the rows in
stored order, and granular cursors can be generated in streaming fashion.
However, for various workloads, it's better for storage footprint and
query performance to sort by arbitrary orders that do not start with __time.
With this patch, users can sort segments by such orders.
For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to
dimensionsSpec. The "dimensions" list determines the sort order. To
define a sort order that includes "__time", users explicitly
include a dimension named "__time".
For SQL-based ingestion, users set the context parameter
"useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then
used as the explicit segment sort order.
In both cases, when the new "useExplicitSegmentSortOrder" parameter is
false (the default), __time is implicitly prepended to the sort order,
as it always was prior to this patch.
The new parameter is experimental for two main reasons. First, such
segments can cause errors when loaded by older servers, due to violating
their expectations that timestamps are always monotonically increasing.
Second, even on newer servers, not all queries can run on non-time-sorted
segments. Scan queries involving time-ordering and any query involving
granularity will not run. (To partially mitigate this, a currently-undocumented
SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner
avoids using "granularity".)
Changes on the write path:
1) DimensionsSpec can now optionally contain a __time dimension, which
controls the placement of __time in the sort order. If not present,
__time is considered to be first in the sort order, as it has always
been.
2) IncrementalIndex and IndexMerger are updated to sort facts more
flexibly; not always by time first.
3) Metadata (stored in metadata.drd) gains a "sortOrder" field.
4) MSQ can generate range-based shard specs even when not all columns are
singly-valued strings. It merely stops accepting new clustering key
fields when it encounters the first one that isn't a singly-valued
string. This is useful because it enables range shard specs on
"someDim" to be created for clauses like "CLUSTERED BY someDim, __time".
Changes on the read path:
1) Add StorageAdapter#getSortOrder so query engines can tell how a
segment is sorted.
2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter,
and VectorCursorGranularizer to throw errors when using granularities
on non-time-ordered segments.
3) Update ScanQueryEngine to throw an error when using the time-ordering
"order" parameter on non-time-ordered segments.
4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when
running on a non-time-ordered segment.
5) Add "sqlUseGranularity" context parameter that causes the SQL planner
to avoid using granularities other than ALL.
Other changes:
1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions"
and change the meaning subtly: it now returns true if the DimensionsSpec
represents an unchanging list of dimensions, or false if there is
some discovery happening. This is what call sites had expected anyway.
* Fixups from CI.
* Fixes.
* Fix missing arg.
* Additional changes.
* Fix logic.
* Fixes.
* Fix test.
* Adjust test.
* Remove throws.
* Fix styles.
* Fix javadocs.
* Cleanup.
* Smoother handling of null ordering.
* Fix tests.
* Missed a spot on the merge.
* Fixups.
* Avoid needless Filters.and.
* Add timeBoundaryInspector to test.
* Fix tests.
* Fix FrameStorageAdapterTest.
* Fix various tests.
* Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder.
* Pom fix.
* Fix doc.
changes:
* Added `CursorBuildSpec` which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments of `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor`
* added new interface `CursorHolder` and new interface `CursorHolderFactory` as a replacement for `CursorFactory`, with method `makeCursorHolder`, which takes a `CursorBuildSpec` as an argument and replaces `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor`
* `CursorFactory.makeCursors` previously returned a `Sequence<Cursor>` corresponding to the query granularity buckets, with a separate `Cursor` per bucket. `CursorHolder.asCursor` instead returns a single `Cursor` (equivalent to 'ALL' granularity), and a new `CursorGranularizer` has been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with its `VectorCursorGranularizer`), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities.
* Deprecated `CursorFactory`, `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursor`
* updated all `StorageAdapter` implementations to implement `makeCursorHolder`, transitioned direct `CursorFactory` implementations to instead implement `CursorMakerFactory`. `StorageAdapter` being a `CursorMakerFactory` is intended to be a transitional thing, ideally will not be released in favor of moving `CursorMakerFactory` to be fetched directly from `Segment`, however this PR was already large enough so this will be done in a follow-up.
* updated all query engines to use `makeCursorHolder`, granularity based engines to use `CursorGranularizer`.
* Fix build
* Run coldSchemaExec thread periodically
* Bugfix: Run cold schema refresh periodically
* Rename metrics for deep storage only segment schema process
Bug description:
Peons to fail to start up when `WorkerTaskCountStatsMonitor` is used on MiddleManagers.
This is because MiddleManagers pass on their properties to peons and peons are unable to
find `IndexerTaskCountStatsProvider` as that is bound only for indexer nodes.
Fix:
Check if node is an indexer before trying to get instance of `IndexerTaskCountStatsProvider`.
Fixes#13936
In cases where a supervisor is idle and the overlord is restarted for some reason, the supervisor would
start spinning tasks again. In clusters where there are many low throughput streams, this would spike
the task count unnecessarily.
This commit compares the latest stream offset with the ones in metadata during the startup of supervisor
and sets it to idle state if they match.
Adds a configuration clientConnectTimeout to our http client config which controls the connection timeout for our http client requests.
It was observed that on busy K8S clusters, the default connect timeout of 500ms is sometimes not enough time to complete syn/acks for a request and in these cases, the requests timeout with the error:
exceptionType=java.net.SocketTimeoutException, exceptionMessage=Connect Timeout
This behavior was mostly observed on the router while forwarding queries to the broker.
Having a slightly higher connect timeout helped resolve these issues.
This PR adds checks for verification of DataSourceCompactionConfig and CompactionTask with msq engine to ensure:
each aggregator in metricsSpec is idempotent
metricsSpec is non-null when rollup is set to true
Unit tests and existing compaction ITs have been updated accordingly.
Background:
ZK-based segment loading has been completely disabled in #15705 .
ZK `servedSegmentsPath` has been deprecated since Druid 0.7.1, #1182 .
This legacy path has been replaced by the `liveSegmentsPath` and is not used in the code anymore.
Changes:
- Never create ZK loadQueuePath as it is never used.
- Never create ZK servedSegmentsPath as it is never used.
- Do not create ZK liveSegmentsPath if announcement on ZK is disabled
- Fix up tests
* enables to launch a fake broker based on test resources (druidtest uri)
* could record queries into new testfiles during usage
* instead of re-purpose Calcite's Hook migrates to use DruidHook which we can add further keys
* added a quidem-ut module which could be the place for tests which could iteract with modules/etc
This PR adds indexer-level task metrics-
"indexer/task/failed/count"
"indexer/task/success/count"
the current "worker/task/completed/count" metric shows all the tasks completed irrespective of success or failure status so these metrics would help us get more visibility into the status of the completed tasks
This patch introduces an optional cluster configuration, druid.indexing.formats.stringMultiValueHandlingMode, allowing operators to override the default mode SORTED_SET for string dimensions. The possible values for the config are SORTED_SET, SORTED_ARRAY, or ARRAY (SORTED_SET is the default). Case insensitive values are allowed.
While this cluster property allows users to manage the multi-value handling mode for string dimension types, it's recommended to migrate to using real array types instead of MVDs.
This fixes a long-standing issue where compaction will honor the configured cluster wide property instead of rewriting it as the default SORTED_ARRAY always, even if the data was originally ingested with ARRAY or SORTED_SET.
Design:
The loading rate is computed as a moving average of at least the last 10 GiB of successful segment loads.
To account for multiple loading threads on a server, we use the concept of a batch to track load times.
A batch is a set of segments added by the coordinator to the load queue of a server in one go.
Computation:
batchDurationMillis = t(load queue becomes empty) - t(first load request in batch is sent to server)
batchBytes = total bytes successfully loaded in batch
avg loading rate in batch (kbps) = (8 * batchBytes) / batchDurationMillis
overall avg loading rate (kbps) = (8 * sumOverWindow(batchBytes)) / sumOverWindow(batchDurationMillis)
Changes:
- Add `LoadingRateTracker` which computes a moving average load rate based on
the last few GBs of successful segment loads.
- Emit metric `segment/loading/rateKbps` from the Coordinator. In the future, we may
also consider emitting this metric from the historicals themselves.
- Add `expectedLoadTimeMillis` to response of API `/druid/coordinator/v1/loadQueue?simple`
Changes:
- Rename `CoordinatorCompactionConfig` to `DruidCompactionConfig`
- Rename `CompactionConfigUpdateRequest` to `ClusterCompactionConfig`
- Refactor methods in `DruidCompactionConfig`
- Clean up `DataSourceCompactionConfigHistory` and its tests
- Clean up tests and add new tests
- Change API path `/druid/coordinator/v1/config/global` to `/druid/coordinator/v1/config/cluster`
Follow-up to #16291, this commit enables a subset of existing native compaction ITs on the MSQ engine.
In the process, the following changes have been introduced in the MSQ compaction flow:
- Populate `metricsSpec` in `CompactionState` from `querySpec` in `MSQControllerTask` instead of `dataSchema`
- Add check for pre-rolled-up segments having `AggregatorFactory` with different input and output column names
- Fix passing missing cluster-by clause in scan queries
- Add annotation of `CompactionState` to tombstone segments
Changes:
- Add API `/druid/coordinator/v1/config/compaction/global` to update cluster level compaction config
- Add class `CompactionConfigUpdateRequest`
- Fix bug in `CoordinatorCompactionConfig` which caused compaction engine to not be persisted.
Use json field name `engine` instead of `compactionEngine` because JSON field names must align
with the getter name.
- Update MSQ validation error messages
- Complete overhaul of `CoordinatorCompactionConfigResourceTest` to remove unnecessary mocking
and add more meaningful tests.
- Add `TuningConfigBuilder` to easily build tuning configs for tests.
- Add `DatasourceCompactionConfigBuilder`
* Round-robin iterator for datasources to kill.
Currently there's a fairness problem in the KillUnusedSegments duty
where the duty consistently selects the same set of datasources as discovered
from the metadata store or dynamic config params. This is a problem especially
when there are multiple unused. In a medium to large cluster, while we can increase
the task slots to increase the likelihood of broader coverage. This patch adds a simple
round-robin iterator to select datasources and has the following properties:
1. Starts with an initial random cursor position in an ordered list of candidates.
2. Consecutive {@code next()} iterations from {@link #getIterator()} are guaranteed to be deterministic
unless the set of candidates change when {@link #updateCandidates(Set)} is called.
3. Guarantees that no duplicate candidates are returned in two consecutive {@code next()} iterations.
* Renames in RoundRobinIteratorTest.
* Address review comments.
1. Clarify javadocs on the ordered list. Also flesh out the details a bit more.
2. Rename the test hooks to make intent clearer and fix typo.
3. Add NotThreadSafe annotation.
4. Remove one potentially noisy log that's in the path of iteration.
* Add null check to input candidates.
* More commentary.
* Addres review feedback: downgrade some new info logs to debug; invert condition.
Remove redundant comments.
Remove rendundant variable tracking.
* CircularList adjustments.
* Updates to CircularList and cleanup RoundRobinInterator.
* One more case and add more tests.
* Make advanceCursor private for now.
* Review comments.
* more aggressive cancellation of broker parallel merge, more chill blocking queue timeouts
* wire parallel merge into query cancellation system
* oops
* style
* adjust metrics initialization
* fix timeout, fix cleanup to not block
* javadocs to clarify why cancellation future and gizmo are split
* cancelled -> canceled, simplify QueuePusher since it always takes a ResultBatch, non-static terminal marker to make stuff stop complaining about types, specialize tryOffer to be tryOfferTerminal so it wont be misused, add comments to clarify reason for non-blocking offers that might fail
changes:
* removes `druid.indexer.task.batchProcessingMode` in favor of always using `CLOSED_SEGMENT_SINKS` which uses `BatchAppenderator`. This was intended to become the default for native batch, but that was missed so `CLOSED_SEGMENTS` was the default (using `AppenderatorImpl`), however MSQ has been exclusively using `BatchAppenderator` with no problems so it seems safe to just roll it out as the only option for batch ingestion everywhere.
* with `batchProcessingMode` gone, there is no use for `AppenderatorImpl` so it has been removed
* implify `Appenderator` construction since there are only separate stream and batch versions now
* simplify tests since `batchProcessingMode` is gone
changes:
* removed `Firehose` and `FirehoseFactory` and remaining implementations which were mostly no longer used after #16602
* Moved `IngestSegmentFirehose` which was still used internally by Hadoop ingestion to `DatasourceRecordReader.SegmentReader`
* Rename `SQLFirehoseFactoryDatabaseConnector` to `SQLInputSourceDatabaseConnector` and similar renames for sub-classes
* Moved anything remaining in a 'firehose' package somewhere else
* Clean up docs on firehose stuff
Description:
Task action audit logging was first deprecated and disabled by default in Druid 0.13, #6368.
As called out in the original discussion #5859, there are several drawbacks to persisting task action audit logs.
- Only usage of the task audit logs is to serve the API `/indexer/v1/task/{taskId}/segments`
which returns the list of segments created by a task.
- The use case is really narrow and no prod clusters really use this information.
- There can be better ways of obtaining this information, such as the metric
`segment/added/bytes` which reports both the segment ID and task ID
when a segment is committed by a task. We could also include committed segment IDs in task reports.
- A task persisting several segments would bloat up the audit logs table putting unnecessary strain
on metadata storage.
Changes:
- Remove `TaskAuditLogConfig`
- Remove method `TaskAction.isAudited()`. No task action is audited anymore.
- Remove `SegmentInsertAction` as it is not used anymore. `SegmentTransactionalInsertAction`
is the new incarnation which has been in use for a while.
- Deprecate `MetadataStorageActionHandler.addLog()` and `getLogs()`. These are not used anymore
but need to be retained for backward compatibility of extensions.
- Do not create `druid_taskLog` metadata table anymore.
Changes:
- Break `NewestSegmentFirstIterator` into two parts
- `DatasourceCompactibleSegmentIterator` - this contains all the code from `NewestSegmentFirstIterator`
but now handles a single datasource and allows a priority to be specified
- `PriorityBasedCompactionSegmentIterator` - contains separate iterator for each datasource and
combines the results into a single queue to be used by a compaction search policy
- Update `NewestSegmentFirstPolicy` to use the above new classes
- Cleanup `CompactionStatistics` and `AutoCompactionSnapshot`
- Cleanup `CompactSegments`
- Remove unused methods from `Tasks`
- Remove unneeded `TasksTest`
- Move tests from `NewestSegmentFirstIteratorTest` to `CompactionStatusTest`
and `DatasourceCompactibleSegmentIteratorTest`
Better fallback strategy when the broker is unable to materialize the subquery's results as frames for estimating the bytes:
a. We don't touch the subquery sequence till we know that we can materialize the result as frames
Description:
Compaction operations issued by the Coordinator currently run using the native query engine.
As majority of the advancements that we are making in batch ingestion are in MSQ, it is imperative
that we support compaction on MSQ to make Compaction more robust and possibly faster.
For instance, we have seen OOM errors in native compaction that MSQ could have handled by its
auto-calculation of tuning parameters.
This commit enables compaction on MSQ to remove the dependency on native engine.
Main changes:
* `DataSourceCompactionConfig` now has an additional field `engine` that can be one of
`[native, msq]` with `native` being the default.
* if engine is MSQ, `CompactSegments` duty assigns all available compaction task slots to the
launched `CompactionTask` to ensure full capacity is available to MSQ. This is to avoid stalling which
could happen in case a fraction of the tasks were allotted and they eventually fell short of the number
of tasks required by the MSQ engine to run the compaction.
* `ClientCompactionTaskQuery` has a new field `compactionRunner` with just one `engine` field.
* `CompactionTask` now has `CompactionRunner` interface instance with its implementations
`NativeCompactinRunner` and `MSQCompactionRunner` in the `druid-multi-stage-query` extension.
The objectmapper deserializes `ClientCompactionRunnerInfo` in `ClientCompactionTaskQuery` to the
`CompactionRunner` instance that is mapped to the specified type [`native`, `msq`].
* `CompactTask` uses the `CompactionRunner` instance it receives to create the indexing tasks.
* `CompactionTask` to `MSQControllerTask` conversion logic checks whether metrics are present in
the segment schema. If present, the task is created with a native group-by query; if not, the task is
issued with a scan query. The `storeCompactionState` flag is set in the context.
* Each created `MSQControllerTask` is launched in-place and its `TaskStatus` tracked to determine the
final status of the `CompactionTask`. The id of each of these tasks is the same as that of `CompactionTask`
since otherwise, the workers will be unable to determine the controller task's location for communication
(as they haven't been launched via the overlord).
Motivation:
- Improve code hygeiene
- Make `SegmentLoadDropHandler` easily extensible
Changes:
- Add `SegmentBootstrapper`
- Move code for bootstrapping segments already cached on disk and fetched from coordinator to
`SegmentBootstrapper`.
- No functional change
- Use separate executor service in `SegmentBootstrapper`
- Bind `SegmentBootstrapper` to `ManageLifecycle` explicitly in `CliBroker`, `CliHistorical` etc.
Previously, the segment granularity for tables in the catalog had to be defined in period format, ie `'PT1H'` , `'P1D'`, etc. This disallows a user from defining segment granularity of `'ALL'` for a table in the catalog, which may be a valid use case. This change makes it so that a user may define the segment granularity of a table in the catalog, as any string that results in a valid granularity using either the `Granularity.fromString(str)` method, or `new PeriodGranularity(new Period(value), null, null)`, and that granularity maps to a standard supported granularity, where `GranularityType.isStandard(granularity)` returns true. As a result a user may who wants to assign a catalog table's segment granularity to be hourly, may assign the segment granularity property of the table to be either `PT1H`, or `HOUR`. These are the same formats accepted at query time.
Changes:
- Rename `UsedSegmentChecker` to `PublishedSegmentsRetriever`
- Remove deprecated single `Interval` argument from `RetrieveUsedSegmentsAction`
as it is now unused and has been deprecated since #1988
- Return `Set` of segments instead of a `Collection` from `IndexerMetadataStorageCoordinator.retrieveUsedSegments()`
index_realtime tasks were removed from the documentation in #13107. Even
at that time, they weren't really documented per se— just mentioned. They
existed solely to support Tranquility, which is an obsolete ingestion
method that predates migration of Druid to ASF and is no longer being
maintained. Tranquility docs were also de-linked from the sidebars and
the other doc pages in #11134. Only a stub remains, so people with
links to the page can see that it's no longer recommended.
index_realtime_appenderator tasks existed in the code base, but were
never documented, nor as far as I am aware were they used for any purpose.
This patch removes both task types completely, as well as removes all
supporting code that was otherwise unused. It also updates the stub
doc for Tranquility to be firmer that it is not compatible. (Previously,
the stub doc said it wasn't recommended, and pointed out that it is
built against an ancient 0.9.2 version of Druid.)
ITUnionQueryTest has been migrated to the new integration tests framework and updated to use Kafka ingestion.
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
* Initial support for bootstrap segments.
- Adds a new API in the coordinator.
- All processes that have storage locations configured (including tasks)
talk to the coordinator if they can, and fetch bootstrap segments from it.
- Then load the segments onto the segment cache as part of startup.
- This addresses the segment bootstrapping logic required by processes before
they can start serving queries or ingesting.
This patch also lays the foundation to speed up upgrades.
* Fail open by default if there are any errors talking to the coordinator.
* Add test for failure scenario and cleanup logs.
* Cleanup and add debug log
* Assert the events so we know the list exactly.
* Revert RunRules test.
The rules aren't evaluated if there are no clusters.
* Revert RunRulesTest too.
* Remove debug info.
* Make the API POST and update log.
* Fix up UTs.
* Throw 503 from MetadataResource; clean up exception handling and DruidException.
* Remove unused logger, add verification of metrics and docs.
* Update error message
* Update server/src/main/java/org/apache/druid/server/coordination/SegmentLoadDropHandler.java
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* Apply suggestions from code review
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* Adjust test metric expectations with the rename.
* Add BootstrapSegmentResponse container in the response for future extensibility.
* Rename to BootstrapSegmentsInfo for internal consistency.
* Remove unused log.
* Use a member variable for broadcast segments instead of segmentAssigner.
* Minor cleanup
* Add test for loadable bootstrap segments and clarify comment.
* Review suggestions.
---------
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Changes:
- Add new task action `RetrieveSegmentsByIdAction`
- Use new task action to retrieve segments irrespective of their visibility
- During rolling upgrades, this task action would fail as Overlord would be on old version
- If new action fails, fall back to just fetching used segments as before
* Fix flakey BrokerClientTest.
The testError() method reliably fails in the IDE. This is because the
the test runner has
<surefire.rerunFailingTestsCount>3</surefire.rerunFailingTestsCount> is set to 3, so maven
retries this "flaky test" multiple times and the test code returns a successful response
in the third attempt.
The exception handling in BrokerClientTest was broken:
- All non-2xx errors were being turned as 5xx errors. Remove that block of
code. If we need to handle retries of more specific 5xx error codes, that should be
hanlded explicitly. Or if there's a source of 4xx class error that needs to be 5xx,
fix that in the source of error.
* Fix CodeQL warning for unused parameter.
* Handle null values of centralized schema config in PartialMergeTask
* Fix checkstyle
* Do not pass centralized schema config from supervisor task to sub-tasks
* Do not pass ObjectMapper in constructor of task
* Fix logs
* Fix tests
* Router: Authorize permissionless internal requests.
Router-internal requests like /proxy/enabled and errors for invalid
requests should not require permissions, but they still need to be
authorized in order to satisfy the PreResponseAuthorizationCheckFilter.
This patch adds authorization checks that do not require any particular
permissions.
* Fix tests.
* Add interface method for returning canonical lookup name
* Address review comment
* Add test in LookupReferencesManagerTest for coverage check
* Add test in LookupSerdeModuleTest for coverage check
* Fix task bootstrap locations.
* Remove dependency of SegmentCacheManager from SegmentLoadDropHandler.
- The load drop handler code talks to the local cache manager via
SegmentManager.
* Clean up unused imports and stuff.
* Test fixes.
* Intellij inspections and test bind.
* Clean up dependencies some more
* Extract test load spec and factory to its own class.
* Cleanup test util
* Pull SegmentForTesting out to TestSegmentUtils.
* Fix up.
* Minor changes to infoDir
* Replace server announcer mock and verify that.
* Add tests.
* Update javadocs.
* Address review comments.
* Separate methods for download and bootstrap load
* Clean up return types and exception handling.
* No callback for loadSegment().
* Minor cleanup
* Pull out the test helpers into its own static class so it can have better state control.
* LocalCacheManager stuff
* Fix build.
* Fix build.
* Address some CI warnings.
* Minor updates to javadocs and test code.
* Address some CodeQL test warnings and checkstyle fix.
* Pass a Consumer<DataSegment> instead of boolean & rename variables.
* Small updates
* Remove one test constructor.
* Remove the other constructor that wasn't initializing fully and update usages.
* Cleanup withInfoDir() builder and unnecessary test hooks.
* Remove mocks and elaborate on comments.
* Commentary
* Fix a few Intellij inspection warnings.
* Suppress corePoolSize intellij-inspect warning.
The intellij-inspect tool doesn't seem to correctly inspect
lambda usages. See ScheduledExecutors.
* Update docs and add more tests.
* Use hamcrest for asserting order on expectation.
* Shutdown bootstrap exec.
* Fix checkstyle
* fix issue with auto column grouping
changes:
* fixes bug where AutoTypeColumnIndexer reports incorrect cardinality, allowing it to incorrectly use array grouper algorithm for realtime queries producing incorrect results for strings
* fixes bug where auto LONG and DOUBLE type columns incorrectly report not having null values, resulting in incorrect null handling when grouping
* fix test
This PR updates CompactionTask to not load any lookups by default, unless transformSpec is present.
If transformSpec is present, we will make the decision based on context values, loading all lookups by default. This is done to ensure backward compatibility since transformSpec can reference lookups.
If transform spec is not present and no context value is passed, we donot load any lookup.
This behavior can be overridden by supplying lookupLoadingMode and lookupsToLoad in the task context.
Changes:
- Remove deprecated `markAsUnused` parameter from `KillUnusedSegmentsTask`
- Allow `kill` task to use `REPLACE` lock when `useConcurrentLocks` is true
- Use `EXCLUSIVE` lock by default
This parameter has been removed for awhile now as of Druid 0.23.0
https://github.com/apache/druid/pull/12187.
The code was only used in tests to verify that serialization works.
Now remove all references to avoid any confusion.
* * add another catalog clustering columns unit test
* * dissallow clusterKeys with descending order
* * make more clear that clustering is re-written into ingest node
whether a catalog table or not
* * when partitionedBy is stored in catalog, user shouldnt need to specify
it in order to specify clustering
* * fix intellij inspection failure
Changes:
- Rename `DataSegmentChangeRequestAndStatus` to `DataSegmentChangeResponse`
- Rename `SegmentLoadDropHandler.Status` to `SegmentChangeStatus`
- Remove method `CoordinatorRunStats.getSnapshotAndReset()` as it was used only in
load queue peon implementations. Using an atomic reference is much simpler.
- Remove `ServerTestHelper.MAPPER`. Use existing `TestHelper.makeJsonMapper()` instead.
* MSQ controller: Support in-memory shuffles; towards JVM reuse.
This patch contains two controller changes that make progress towards a
lower-latency MSQ.
First, support for in-memory shuffles. The main feature of in-memory shuffles,
as far as the controller is concerned, is that they are not fully buffered. That
means that whenever a producer stage uses in-memory output, its consumer must run
concurrently. The controller determines which stages run concurrently, and when
they start and stop.
"Leapfrogging" allows any chain of sort-based stages to use in-memory shuffles
even if we can only run two stages at once. For example, in a linear chain of
stages 0 -> 1 -> 2 where all do sort-based shuffles, we can use in-memory shuffling
for each one while only running two at once. (When stage 1 is done reading input
and about to start writing its output, we can stop 0 and start 2.)
1) New OutputChannelMode enum attached to WorkOrders that tells workers
whether stage output should be in memory (MEMORY), or use local or durable
storage.
2) New logic in the ControllerQueryKernel to determine which stages can use
in-memory shuffling (ControllerUtils#computeStageGroups) and to launch them
at the appropriate time (ControllerQueryKernel#createNewKernels).
3) New "doneReadingInput" method on Controller (passed down to the stage kernels)
which allows stages to transition to POST_READING even if they are not
gathering statistics. This is important because it enables "leapfrogging"
for HASH_LOCAL_SORT shuffles, and for GLOBAL_SORT shuffles with 1 partition.
4) Moved result-reading from ControllerContext#writeReports to new QueryListener
interface, which ControllerImpl feeds results to row-by-row while the query
is still running. Important so we can read query results from the final
stage using an in-memory channel.
5) New class ControllerQueryKernelConfig holds configs that control kernel
behavior (such as whether to pipeline, maximum number of concurrent stages,
etc). Generated by the ControllerContext.
Second, a refactor towards running workers in persistent JVMs that are able to
cache data across queries. This is helpful because I believe we'll want to reuse
JVMs and cached data for latency reasons.
1) Move creation of WorkerManager and TableInputSpecSlicer to the
ControllerContext, rather than ControllerImpl. This allows managing workers and
work assignment differently when JVMs are reusable.
2) Lift the Controller Jersey resource out from ControllerChatHandler to a
reusable resource.
3) Move memory introspection to a MemoryIntrospector interface, and introduce
ControllerMemoryParameters that uses it. This makes it easier to run MSQ in
process types other than Indexer and Peon.
Both of these areas will have follow-ups that make similar changes on the
worker side.
* Address static checks.
* Address static checks.
* Fixes.
* Report writer tests.
* Adjustments.
* Fix reports.
* Review updates.
* Adjust name.
* Small changes.
This PR fixes the first and last vector aggregators and improves their readability. Following changes are introduced
The folding is broken in the vectorized versions. We consider time before checking the folded object.
If the numerical aggregator gets passed any other object type for some other reason (like String), then the aggregator considers it to be folded, even though it shouldn’t be. We should convert these objects to the desired type, and aggregate them properly.
The aggregators must properly use generics. This would minimize the ClassCastException issues that can happen with mixed segment types. We are unifying the string first/last aggregators with numeric versions as well.
The aggregators must aggregate null values (https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstLastUtils.java#L55-L56 ). The aggregator should only ignore pairs with time == null, and not value == null
Time nullity is ignored when trying to vectorize the data.
String versions initialized with DateTimes.MIN that is equal to Long.MIN / 2. This can cause incorrect results in case the user enters a custom time column. NOTE: This is still present because it would require a larger refactor in all of the versions.
There is a difference in what users might expect from the results because the code flow is changed (for example, the direction of the for loops, etc), however, this will only change the results, and not the contract set by first/last aggregators, which is that if multiple values have the same timestamp, then any of them can get picked.
If the column is non-existent, the users might expect a change in the timestamp from DateTime.MAX to Long.MAX, because the code incorrectly used DateTime.MAX to initialize the aggregator, however, in case of a custom timestamp column, this might not be the case. The SQL query might be prohibited from using any Long since it requires a cast to the timestamp function that can fail, but AFAICT native queries don't have such limitations.
Changes:
- Add new config `lagAggregate` to `LagBasedAutoScalerConfig`
- Add field `aggregateForScaling` to `LagStats`
- Use the new field/config to determine which aggregate to use to compute lag
- Remove method `Supervisor.computeLagForAutoScaler()`
Changes:
- Add `LookupLoadingSpec` to support 3 modes of lookup loading: ALL, NONE, ONLY_REQUIRED
- Add method `Task.getLookupLoadingSpec()`
- Do not load any lookups for `KillUnusedSegmentsTask`
Changes:
1) Check for handoff of upgraded realtime segments.
2) Drop sink only when all associated realtime segments have been abandoned.
3) Delete pending segments upon commit to prevent unnecessary upgrades and
partition space exhaustion when a concurrent replace happens. This also prevents
potential data duplication.
4) Register pending segment upgrade only on those tasks to which the segment is associated.
Issue: #14989
The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information.
This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.
Tries to address the comments made on #16284 after merged.
Changes:
- Remove method `Supervisor.getLagMetric()`
- Add method `Supervisor.computeLagForAutoScaler()`
- Remove classes `LagMetric` and `LagMetricTest`
Changes:
- Add column `task_allocator_id` to `pendingSegments` metadata table.
- Add column `upgraded_from_segment_id` to `pendingSegments` metadata table.
- Add interface `PendingSegmentAllocatingTask` and implement it by all tasks which
can allocate pending segments.
- Use `taskAllocatorId` to identify the task (and its sub-tasks or replicas) to which
a pending segment has been allocated.
- Perform active cleanup of pending segments in `TaskLockbox` once there are no
active tasks for the corresponding task allocator id.
- When committing APPEND segments, also commit all upgraded pending segments
corresponding to that task allocator id.
- When committing REPLACE segments, upgrade all overlapping pending segments in
the same transaction.
Bug:
#15724 introduced a bug where a rolling upgrade would cause all task locations
returned by the Overlord on an older version to be unknown.
Fix:
If the new API fails, fall back to single task status API which always returns a valid task location.
Follow up to #16217
Changes:
- Update `OverlordClient.getReportAsMap()` to return `TaskReport.ReportMap`
- Move the following classes to `org.apache.druid.indexer.report` in the `druid-processing` module
- `TaskReport`
- `KillTaskReport`
- `IngestionStatsAndErrorsTaskReport`
- `TaskContextReport`
- `TaskReportFileWriter`
- `SingleFileTaskReportFileWriter`
- `TaskReportSerdeTest`
- Remove `MsqOverlordResourceTestClient` as it had only one method
which is already present in `OverlordResourceTestClient` itself
The default value for druid.coordinator.kill.period (if unspecified) has changed from P1D to the value of druid.coordinator.period.indexingPeriod. Operators can choose to override druid.coordinator.kill.period and that will take precedence over the default behavior.
The default value for the coordinator dynamic config killTaskSlotRatio is updated from 1.0 to 0.1. This ensures that that kill tasks take up only 1 task slot right out-of-the-box instead of taking up all the task slots.
* Remove stale comment and inline canDutyRun()
* druid.coordinator.kill.period defaults to druid.coordinator.period.indexingPeriod if not set.
- Remove the default P1D value for druid.coordinator.kill.period. Instead default
druid.coordinator.kill.period to whatever value druid.coordinator.period.indexingPeriod is set
to if the former config isn't specified.
- If druid.coordinator.kill.period is set, the value will take precedence over
druid.coordinator.period.indexingPeriod
* Update server/src/test/java/org/apache/druid/server/coordinator/DruidCoordinatorConfigTest.java
* Fix checkstyle error
* Clarify comment
* Update server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinatorConfig.java
* Put back canDutyRun()
* Default killTaskSlotsRatio to 0.1 instead of 1.0 (all slots)
* Fix typo DEFAULT_MAX_COMPACTION_TASK_SLOTS
* Remove unused test method.
* Update default value of killTaskSlotsRatio in docs and web-console default mock
* Move initDuty() after params and config setup.
* Update error message when topic messages.
Suggest resetting the supervisor when the topic changes instead of changing
the supervisor name which is actually making a new supervisor.
* Update server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* Cleanup
* Remove log and include oldCommitMetadataFromDb
* Fix test
---------
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Changes:
- Handle exceptions in the API and map them to a `Response` object with the appropriate error code.
- Replace `AuthorizationUtils.filterAuthorizedResources()` with `DatasourceResourceFilter`.
The endpoint is annotated consistent with other usages.
- Update `DatasourceResourceFilter` to remove the lambda and update javadocs.
The usages information is self-evident with an IDE.
- Adjust the invalid interval exception message.
- Break up the large unit test `testGetUnusedSegmentsInDataSource()` into smaller unit tests
for each test case. Also, validate the error codes.
* Differentiate null and empty lists of segment IDs and versions.
Treat them differently so the. Segment IDs and versions can be An empty list,
in which case, the queries should just not return anything. Versions are optional, so
they can be null, which just indicates nothing, so the queries should return segments with
all possible versions. Segment IDs cannot be null as indicated by the absence of @Nullable
annotation.
* Update javadocs and add empty versions test to kill task.
* Add test for RetrieveSegmentsActions as well.
* Add parameterized segment IDs.
* Refactor into one common method.
* Refactor getConditionForIntervalsAndMatchMode - pass in only what's needed.
* Minor cleanup.