Changes:
- Determine the default value of balancerComputeThreads based on number of
coordinator cpus rather than number of segments. Even if the number of segments
is low and we create more balancer threads, it doesn't hurt the system as threads
would mostly be idle.
- Remove unused field from SegmentLoadQueueManager
Expected values:
- Clusters with ~1M segments typically work with Coordinators having 16 cores or more.
This would give us 8 balancer threads, which is the same as the current maximum.
- On small clusters, even a single thread is enough to do the required balancing work.
### Description
This change enables the `KillUnusedSegments` coordinator duty to be scheduled continuously. Things that prevented this, or made this difficult before were the following:
1. If scheduled at fast enough rate, the duty would find the same intervals to kill for the same datasources, while kill tasks submitted for those same datasources and intervals were already underway, thus wasting task slots on duplicated work.
2. The task resources used by auto kill were previously unbounded. Each duty run period, if unused
segments were found for any datasource, a kill task would be submitted to kill them.
This pr solves for both of these issues:
1. The duty keeps track of the end time of the last interval found when killing unused segments for each datasource, in a in memory map. The end time for each datasource, if found, is used as the start time lower bound, when searching for unused intervals for that same datasource. Each duty run, we remove any datasource keys from this map that are no longer found to match datasources in the system, or in whitelist, and also remove a datasource entry, if there is found to be no unused segments for the datasource, which happens when we fail to find an interval which includes unused segments. Removing the datasource entry from the map, allows for searching for unusedSegments in the datasource from the beginning of time once again
2. The unbounded task resource usage can be mitigated with coordinator dynamic config added as part of ba957a9b97
Operators can configure continous auto kill by providing coordinator runtime properties similar to the following:
```
druid.coordinator.period.indexingPeriod=PT60S
druid.coordinator.kill.period=PT60S
```
And providing sensible limits to the killTask usage via coordinator dynamic properties.
There is a current issue due to inconsistent metadata between worker and controller in MSQ. A controller can receive one set of segments, which are then marked as unused by, say, a compaction job. The worker would be unable to get the segment information as MetadataResource.
Motivation:
- Clean up `DruidCoordinator` and move methods to classes where they are most relevant
Changes:
- No functional change
- Add duty `PrepareBalancerAndLoadQueues` to replace `UpdateCoordinatorState`
- Move map of `LoadQueuePeon` from `DruidCoordinator` to `LoadQueueTaskMaster`
- Make `BalancerStrategyFactory` an abstract class and keep the balancer executor here
- Move reporting of used segment stats and historical capacity stats from
`CollectSegmentAndServerStats` to `PrepareBalancerAndLoadQueues`
- Move reporting of unavailable and under-replicated segment stats from
`CollectSegmentAndServerStats` to `UpdateReplicationStatus` duty
Currently, Druid is using Guava 16.0.1 version. This upgrade to 31.1-jre fixes the following issues.
CVE-2018-10237 (Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 allows remote attackers to conduct denial of service attacks against servers that depend on this library and deserialize attacker-provided data because the AtomicDoubleArray class (when serialized with Java serialization) and the CompoundOrdering class (when serialized with GWT serialization) perform eager allocation without appropriate checks on what a client has sent and whether the data size is reasonable). We don't use Java or GWT serializations. Despite being false positive they're causing red security scans on Druid distribution.
Latest version of google-client-api is incompatible with the existing Guava version. This PR unblocks Update google client apis to latest version #14414
Motivation:
- There is no usage of the `SegmentTransactionInsertAction` which passes a
non-null non-empty value of `segmentsToBeDropped`.
- This is not really needed either as overshadowed segments are marked as unused
by the Coordinator and need not be done in the same transaction as committing segments.
- It will also help simplify the changes being made in #14407
Changes:
- Remove `segmentsToBeDropped` from the task action and all intermediate methods
- Remove related tests which are not needed anymore
Changes:
- Move logic of `NewestSegmentFirstIterator.needsCompaction` to `CompactionStatus`
to improve testability and readability
- Capture the list of checks performed to determine if compaction is needed in a readable
manner in `CompactionStatus.CHECKS`
- Make `CompactionSegmentIterator` iterate over instances of `SegmentsToCompact`
instead of `List<DataSegment>`. This allows use of the `umbrellaInterval` later.
- Replace usages of `QueueEntry` with `SegmentsToCompact`
- Move `SegmentsToCompact` out of `NewestSegmentFirstIterator`
- Simplify `CompactionStatistics`
- Reduce level of less important logs to debug
- No change made to tests to ensure correctness
Changes:
- Use separate executor for every duty group
- This change is thread-safe as every duty group uses its own copy of
`DruidCoordinatorRuntimeParams` and does not share any other mutable instances
with other duty groups.
- With the exception of `HistoricalManagementDuties`, duty groups are typically not
very compute intensive and mostly perform database or HTTP I/O. So, coordinator
resources would still mostly be available for `HistoricalManagementDuties`.
Follow up changes to #12599
Changes:
- Rename column `used_flag_last_updated` to `used_status_last_updated`
- Remove new CLI tool `UpdateTables`.
- We already have a `CreateTables` with similar functionality, which should be able to
handle update cases too.
- Any user running the cluster for the first time should either just have `connector.createTables`
enabled or run `CreateTables` which should create tables at the latest version.
- For instance, the `UpdateTables` tool would be inadequate when a new metadata table has
been added to Druid, and users would have to run `CreateTables` anyway.
- Remove `upgrade-prep.md` and include that info in `metadata-init.md`.
- Fix log messages to adhere to Druid style
- Use lambdas
* Add new configurable buffer period to create gap between mark unused and kill of segment
* Changes after testing
* fixes and improvements
* changes after initial self review
* self review changes
* update sql statement that was lacking last_used
* shore up some code in SqlMetadataConnector after self review
* fix derby compatibility and improve testing/docs
* fix checkstyle violations
* Fixes post merge with master
* add some unit tests to improve coverage
* ignore test coverage on new UpdateTools cli tool
* another attempt to ignore UpdateTables in coverage check
* change column name to used_flag_last_updated
* fix a method signature after column name switch
* update docs spelling
* Update spelling dictionary
* Fixing up docs/spelling and integrating altering tasks table with my alteration code
* Update NULL values for used_flag_last_updated in the background
* Remove logic to allow segs with null used_flag_last_updated to be killed regardless of bufferPeriod
* remove unneeded things now that the new column is automatically updated
* Test new background row updater method
* fix broken tests
* fix create table statement
* cleanup DDL formatting
* Revert adding columns to entry table by default
* fix compilation issues after merge with master
* discovered and fixed metastore inserts that were breaking integration tests
* fixup forgotten insert by using pattern of sharing now timestamp across columns
* fix issue introduced by merge
* fixup after merge with master
* add some directions to docs in the case of segment table validation issues
* Add supervisor /resetOffsets API.
- Add a new endpoint /druid/indexer/v1/supervisor/<supervisorId>/resetOffsets
which accepts DataSourceMetadata as a body parameter.
- Update logs, unit tests and docs.
* Add a new interface method for backwards compatibility.
* Rename
* Adjust tests and javadocs.
* Use CoreInjectorBuilder instead of deprecated makeInjectorWithModules
* UT fix
* Doc updates.
* remove extraneous debugging logs.
* Remove the boolean setting; only ResetHandle() and resetInternal()
* Relax constraints and add a new ResetOffsetsNotice; cleanup old logic.
* A separate ResetOffsetsNotice and some cleanup.
* Minor cleanup
* Add a check & test to verify that sequence numbers are only of type SeekableStreamEndSequenceNumbers
* Add unit tests for the no op implementations for test coverage
* CodeQL fix
* checkstyle from merge conflict
* Doc changes
* DOCUSAURUS code tabs fix. Thanks, Brian!
Changes:
- No change in behaviour if `smartSegmentLoading` is disabled
- If `smartSegmentLoading` is enabled
- Compute `balancerComputeThreads` based on `numUsedSegments`
- Compute `maxSegmentsToMove` based on `balancerComputeThreads`
- Compute `segmentsToMoveToFixSkew` based on usage skew
- Compute `segmentsToMove = Math.min(maxSegmentsToMove, segmentsToMoveToFixSkew)`
Limits:
- 1 <= `balancerComputeThreads` <= 8
- `maxSegmentsToMove` <= 20% of total segments
- `minSegmentsToMove` = 0.15% of total segments
`cachingCost` has been deprecated in #14484 and is not advised to be used in
production clusters as it may cause usage skew across historicals which the
coordinator is unable to rectify. This PR completely disables `cachingCost` strategy
as it has now been rendered redundant due to recent performance improvements
made to `cost` strategy.
Changes
- Disable `cachingCost` strategy
- Add `DisabledCachingCostBalancerStrategyFactory` for the time being so that we
can give a proper error message before falling back to `CostBalancerStrategy`. This
will be removed in subsequent releases.
- Retain `CachingCostBalancerStrategy` for testing/benchmarking purposes.
- Add javadocs to `DiskNormalizedCostBalancerStrategy`
### Description
Added the following metrics, which are calculated from the `KillUnusedSegments` coordinatorDuty
`"killTask/availableSlot/count"`: calculates the number remaining task slots available for auto kill
`"killTask/maxSlot/count"`: calculates the maximum number of tasks available for auto kill
`"killTask/task/count"`: calculates the number of tasks submitted by auto kill.
#### Release note
NEW: metrics added for auto kill
`"killTask/availableSlot/count"`: calculates the number remaining task slots available for auto kill
`"killTask/maxSlot/count"`: calculates the maximum number of tasks available for auto kill
`"killTask/task/count"`: calculates the number of tasks submitted by auto kill.
The current version of jackson-databind is flagged for vulnerabilities CVE-2020-28491 (Although cbor format is not used in druid), CVE-2020-36518 (Seems genuine as deeply nested json in can cause resource exhaustion). Updating the dependency to the latest version 2.12.7 to fix these vulnerabilities.
### Description
Previously, the `KillUnusedSegments` coordinator duty, in charge of periodically deleting unused segments, could spawn an unlimited number of kill tasks for unused segments. This change adds 2 new coordinator dynamic configs that can be used to control the limit of tasks spawned by this coordinator duty
`killTaskSlotRatio`: Ratio of total available task slots, including autoscaling if applicable that will be allowed for kill tasks. This limit only applies for kill tasks that are spawned automatically by the coordinator's auto kill duty. Default is 1, which allows all available tasks to be used, which is the existing behavior
`maxKillTaskSlots`: Maximum number of tasks that will be allowed for kill tasks. This limit only applies for kill tasks that are spawned automatically by the coordinator's auto kill duty. Default is INT.MAX, which essentially allows for unbounded number of tasks, which is the existing behavior.
Realize that we can effectively get away with just the one `killTaskSlotRatio`, but following similarly to the compaction config, which has similar properties; I thought it was good to have some control of the upper limit regardless of ratio provided.
#### Release note
NEW: `killTaskSlotRatio` and `maxKillTaskSlots` coordinator dynamic config properties added that allow control of task resource usage spawned by `KillUnusedSegments` coordinator task (auto kill)
### Description
Previously, the `maxSegments` configured for auto kill could be ignored if an interval of data for a given datasource had more than this number of unused segments, causing the kill task spawned with the task of deleting unused segments in that given interval of data to delete more than the `maxSegments` configured. Now each kill task spawned by the auto kill coordinator duty, will kill at most `limit` segments. This is done by adding a new config property to the `KillUnusedSegmentTask` which allows users to specify this limit.
* Minimize PostAggregator computations
Since a change back in 2014, the topN query has been computing
all PostAggregators on all intermediate responses from leaf nodes
to brokers. This generates significant slow downs for queries
with relatively expensive PostAggregators. This change rewrites
the query that is pushed down to only have the minimal set of
PostAggregators such that it is impossible for downstream
processing to do too much work. The final PostAggregators are
applied at the very end.
Changes:
- Add interface `SegmentDeleteHandler` for marking segments as unused
- In `StrategicSegmentAssigner`, collect all segments on which a drop rule applies in a list
- Process the list above as a batch delete rather than individual deletes
- Improve alert messages when an invalid tier is specified in a load rule
- Improve alert message when no rule applies on a segment
split KillUnusedSegmentsTask to smaller batches
Processing in smaller chunks allows the task execution to yield the TaskLockbox lock,
which allows the overlord to continue being responsive to other tasks and users while
this particular kill task is executing.
* introduce KillUnusedSegmentsTask batchSize parameter to control size of batching
* provide an explanation for kill task batchSize parameter
* add logging details for kill batch progress
Changes
- Rename `LoadQueuePeonTester` to `TestLoadQueuePeon`
- Simplify `TestLoadQueuePeon` by removing dependency on `CuratorLoadQueuePeon`
- Remove usages of mock peons in `LoadRuleTest` and use `TestLoadQueuePeon` instead
* allow for batched delete of segments instead of deleting segment data one by one
create new batchdelete method in datasegment killer that has default functionality
of iterating through all segments and calling delete on them. This will enable
a slow rollout of other deepstorage implementations to move to a batched delete
on their own time
* cleanup batchdelete segments
* batch delete with the omni data deleter
cleaned up code
just need to add tests and docs for this functionality
* update java doc to explain how it will try to use batch if function is overwritten
* rename killBatch to kill
add unit tests
* add omniDataSegmentKillerTest for deleting multiple segments at a time. fix checkstyle
* explain test peculiarity better
* clean up batch kill in s3.
* remove unused return value. cleanup comments and fix checkstyle
* default to batch delete. more specific java docs. list segments that couldn't be deleted
if there was a client error or server error
* simplify error handling
* add tests where an exception is thrown when killing multiple s3 segments
* add test for failing to delete two calls with the s3 client
* fix javadoc for kill(List<DataSegment> segments) clean up tests remove feature flag
* fix typo in javadocs
* fix test failure
* fix checkstyle and improve tests
* fix intellij inspections issues
* address comments, make delete multiple segments not assume same bucket
* fix test errors
* better grammar and punctuation. fix test. and better logging for exception
* remove unused code
* avoid extra arraylist instantiation
* fix broken test
* fix broken test
* fix tests to use assert.throws
* Merge core CoordinatorClient with MSQ CoordinatorServiceClient.
Continuing the work from #12696, this patch merges the MSQ
CoordinatorServiceClient into the core CoordinatorClient, yielding a single
interface that serves both needs and is based on the ServiceClient RPC
system rather than DruidLeaderClient.
Also removes the backwards-compatibility code for the handoff API in
CoordinatorBasedSegmentHandoffNotifier, because the new API was added
in 0.14.0. That's long enough ago that we don't need backwards
compatibility for rolling updates.
* Fixups.
* Trigger GHA.
* Remove unnecessary retrying in DruidInputSource. Add "about an hour"
retry policy and h
* EasyMock
* Use OverlordClient for all Overlord RPCs.
Continuing the work from #12696, this patch removes HttpIndexingServiceClient
and the IndexingService flavor of DruidLeaderClient completely. All remaining
usages are migrated to OverlordClient.
Supporting changes include:
1) Add a variety of methods to OverlordClient.
2) Update MetadataTaskStorage to skip the complete-task lookup when
the caller requests zero completed tasks. This helps performance of
the "get active tasks" APIs, which don't want to see complete ones.
* Use less forbidden APIs.
* Fixes from CI.
* Add test coverage.
* Two more tests.
* Fix test.
* Updates from CR.
* Remove unthrown exceptions.
* Refactor to improve testability and test coverage.
* Add isNil tests.
* Remove unnecessary "deserialize" methods.
Related to #14634
Changes:
- Update `IndexerSQLMetadataStorageCoordinator.deleteSegments` to use
JDBI PreparedBatch instead of issuing single DELETE statements
This PR uses the QoSFilter available in Jetty to park the query requests that exceed a configured limit. This is done so that other HTTP requests such as health check calls do not get blocked if the query server is busy serving long-running queries. The same mechanism can also be used in the future to isolate interactive queries from long-running select queries from interactive queries within the same broker.
Right now, you can still get that isolation by setting druid.query.scheduler.numThreads to a value lowe than druid.server.http.numThreads. That enables total laning but the side effect is that excess requests are not queued and rejected outright that leads to a bad user experience.
Parked requests are timed out after 30 seconds by default. I overrode that to the maxQueryTimeout in this PR.
changes:
* new filters that preserve match value typing to better handle filtering different column types
* sql planner uses new filters by default in sql compatible null handling mode
* remove isFilterable from column capabilities
* proper handling of array filtering, add array processor to column processors
* javadoc for sql test filter functions
* range filter support for arrays, tons more tests, fixes
* add dimension selector tests for mixed type roots
* support json equality
* rename semantic index maker thingys to mostly have plural names since they typically make many indexes, e.g. StringValueSetIndex -> StringValueSetIndexes
* add cooler equality index maker, ValueIndexes
* fix missing string utf8 index supplier
* expression array comparator stuff
Tests to verify the following behaviour have been added:
- Segments from more populous servers are more likely to be picked irrespective of
sample size.
- Segments from all servers are equally likely to be picked if all servers have equivalent
number of segments.
* Change default handoffConditionTimeout to 15 minutes.
Most of the time, when handoff is taking this long, it's because something
is preventing Historicals from loading new data. In this case, we have
two choices:
1) Stop making progress on ingestion, wait for Historicals to load stuff,
and keep the waiting-for-handoff segments available on realtime tasks.
(handoffConditionTimeout = 0, the current default)
2) Continue making progress on ingestion, by exiting the realtime tasks
that were waiting for handoff. Once the Historicals get their act
together, the segments will be loaded, as they are still there on
deep storage. They will just not be continuously available.
(handoffConditionTimeout > 0)
I believe most users would prefer [2], because [1] risks ingestion falling
behind the stream, which causes many other problems. It can cause data loss
if the stream ages-out data before we have a chance to ingest it.
Due to the way tuningConfigs are serialized -- defaults are baked into the
serialized form that is written to the database -- this default change will
not change anyone's existing supervisors. It will take effect for newly
created supervisors.
* Fix tests.
* Update docs/development/extensions-core/kafka-supervisor-reference.md
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
* Update docs/development/extensions-core/kinesis-ingestion.md
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
---------
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
* Add aggregatorMergeStrategy property to SegmentMetadaQuery.
- Adds a new property aggregatorMergeStrategy to segmentMetadata query.
aggregatorMergeStrategy currently supports three types of merge strategies -
the legacy strict and lenient strategies, and the new latest strategy.
- The latest strategy considers the latest aggregator from the latest segment
by time order when there's a conflict when merging aggregators from different
segments.
- Deprecate lenientAggregatorMerge property; The API validates that both the new
and old properties are not set, and returns an exception.
- When merging segments as part of segmentMetadata query, the segments have a more
elaborate id -- <datasource>_<interval>_merged_<partition_number> format, similar to
the name format that segments usually contain. Previously it was simply "merged".
- Adjust unit tests to test the latest strategy, to assert the returned complete
SegmentAnalysis object instead of just the aggregators for completeness.
* Don't explicitly set strict strategy in tests
* Apply suggestions from code review
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
* Update docs/querying/segmentmetadataquery.md
* Apply suggestions from code review
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
---------
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>