Commit Graph

12988 Commits

Author SHA1 Message Date
Vadim Ogievetsky 6425ef4c3c
Web console: fix service view filtering on other bugs (#14597)
* fix service view filter

* MSQ choose best timeformat also
2023-07-17 13:57:37 -07:00
Benedict Jin 0e3ce0a7f7
Improve shields with suitable logos (#14575) 2023-07-17 19:05:54 +05:30
Kashif Faraz ab051d9c5e
Add test for ReservoirSegmentSampler (#14591)
Tests to verify the following behaviour have been added:
- Segments from more populous servers are more likely to be picked irrespective of
sample size.
- Segments from all servers are equally likely to be picked if all servers have equivalent
number of segments.
2023-07-17 18:50:02 +05:30
Abhishek Radhakrishnan 1f6507dd60
Remove the deprecated `InsertCannotOrderByDescending` MSQ fault (#14588)
The deprecated MSQ fault, InsertCannotOrderByDescending, is removed.
2023-07-17 09:23:39 +00:00
Vadim Ogievetsky d5f6749aa3
Web console: catchup to all the backend changes (#14540)
This PR catches the console up to all the backend changes for Druid 27

Specifically:

Add page information to SqlStatementResource API #14512
Allow empty tiered replicants map for load rules #14432
Adding Interactive API's for MSQ engine #14416
Add replication factor column to sys table #14403
Account for data format and compression in MSQ auto taskAssignment #14307
Errors take 3 #14004
2023-07-17 11:26:46 +05:30
YongGang 214f7c3f65
Expose leader dimension in service/heartbeat metric into statsd-reporter (#14593) 2023-07-17 10:38:24 +05:30
Gian Merlino 95ca43034f
Change default handoffConditionTimeout to 15 minutes. (#14539)
* Change default handoffConditionTimeout to 15 minutes.

Most of the time, when handoff is taking this long, it's because something
is preventing Historicals from loading new data. In this case, we have
two choices:

1) Stop making progress on ingestion, wait for Historicals to load stuff,
   and keep the waiting-for-handoff segments available on realtime tasks.
   (handoffConditionTimeout = 0, the current default)

2) Continue making progress on ingestion, by exiting the realtime tasks
   that were waiting for handoff. Once the Historicals get their act
   together, the segments will be loaded, as they are still there on
   deep storage. They will just not be continuously available.
   (handoffConditionTimeout > 0)

I believe most users would prefer [2], because [1] risks ingestion falling
behind the stream, which causes many other problems. It can cause data loss
if the stream ages-out data before we have a chance to ingest it.

Due to the way tuningConfigs are serialized -- defaults are baked into the
serialized form that is written to the database -- this default change will
not change anyone's existing supervisors. It will take effect for newly
created supervisors.

* Fix tests.

* Update docs/development/extensions-core/kafka-supervisor-reference.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/kinesis-ingestion.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

---------

Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2023-07-13 13:17:14 -07:00
Laksh Singla c1c7dff2ad
Using DruidExceptions in MSQ (changes related to the Broker) (#14534)
MSQ engine returns correct error codes for invalid user inputs in the query context. Also, using DruidExceptions for MSQ related errors happening in the Broker with improved error messages.
2023-07-13 19:08:49 +00:00
zachjsh 589aac8b31
Make errorCode of InsertTimeOutOfBoundsFault consistent with others (#14495)
The errorCode of this fault when serialized over the wire was being
set to the name of the class `InsertTimeOutOfBoundsFault` instead of
the CODE `InsertTimeOutOfBounds`. All other faults' errorCodes are
serialized as the respective Fault's code, so making consistent here
as well.
2023-07-13 14:34:21 -04:00
Abhishek Radhakrishnan f4ee58eaa8
Add `aggregatorMergeStrategy` property in SegmentMetadata queries (#14560)
* Add aggregatorMergeStrategy property to SegmentMetadaQuery.

- Adds a new property aggregatorMergeStrategy to segmentMetadata query.
aggregatorMergeStrategy currently supports three types of merge strategies -
the legacy strict and lenient strategies, and the new latest strategy.
- The latest strategy considers the latest aggregator from the latest segment
by time order when there's a conflict when merging aggregators from different
segments.
- Deprecate lenientAggregatorMerge property; The API validates that both the new
and old properties are not set, and returns an exception.
- When merging segments as part of segmentMetadata query, the segments have a more
elaborate id -- <datasource>_<interval>_merged_<partition_number> format, similar to
the name format that segments usually contain. Previously it was simply "merged".
- Adjust unit tests to test the latest strategy, to assert the returned complete
SegmentAnalysis object instead of just the aggregators for completeness.

* Don't explicitly set strict strategy in tests

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/querying/segmentmetadataquery.md

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

---------

Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2023-07-13 12:37:36 -04:00
Gian Merlino 450ecd6370
More efficient generation of ImmutableWorkerHolder from WorkerHolder. (#14546)
* More efficient generation of ImmutableWorkerHolder from WorkerHolder.

Taking the work done in #12096 a little further:

1) Applying a similar optimization to WorkerHolder (HttpRemoteTaskRunner).
   The original patch only helped with the ZkWorker (RemoteTaskRunner).

2) Improve the ZkWorker version somewhat by avoiding multiple iterations
   through the task announcements map.

* Pick better names and use better logic.

* Only runnable tasks.

* Fix test.

* Fix testBlacklistZKWorkers50Percent.
2023-07-13 07:57:16 -07:00
imply-cheddar 7650a71d37
Add window query test files from Drill (#14561) 2023-07-12 20:14:39 -07:00
Sam Rash 0dcb19f7e3
Add Continuous Profiling to Unit Tests (#14506)
Uses a custom continusou jfr profiler.

Modifies the github actions for tests to do profiling only in the case
of jdk17, as the profiler requires jdk17+ to use the JFR streaming API
plus a few other language features in the code.

Continuous Profiling service is provided to the Apache Druid project
free of charge by Imply and any committer can request free access to
the UI.
2023-07-12 17:50:38 -07:00
imply-cheddar 65e1b27aa7
Fix a resource leak with Window processing (#14573)
* Fix a resource leak with Window processing

Additionally, in order to find the leak, there were
adjustments to the StupidPool to track leaks a bit better.
It would appear that the pool objects get GC'd during testing
for some reason which was causing some incorrect identification
of leaks from objects that had been returned but were GC'd along
with the pool.

* Suppress unused warning
2023-07-12 17:25:42 -05:00
Katya Macedo 12ce187ae4
Update slack text (#14578) 2023-07-12 12:08:48 -07:00
cristian-popa d21c54fb73
Cross reference backpressure info (#14508)
Co-authored-by: Jill Osborne <jill.osborne@imply.io>
Co-authored-by: Cristian Popa <cristian.popa@imply.io>
2023-07-12 10:02:04 -07:00
Karan Kumar 89aee6caaa
Fixing an issue in sequential merge (#14574)
* Fixing an issue in sequential merge where workers without any partial key statistics would get stuck because controller did not change the worker state.

* Removing empty check

* Adding IT for MSQ sequential bug fix.
2023-07-12 22:05:30 +05:30
Gian Merlino 3ff51487b7
Add ZooKeeper connection state alerts and metrics. (#14333)
* Add ZooKeeper connection state alerts and metrics.

- New metric "zk/connected" is an indicator showing 1 when connected,
  0 when disconnected.
- New metric "zk/disconnected/time" measures time spent disconnected.
- New alert when Curator connection state enters LOST or SUSPENDED.

* Use right GuardedBy.

* Test fixes, coverage.

* Adjustment.

* Fix tests.

* Fix ITs.

* Improved injection.

* Adjust metric name, add tests.
2023-07-12 09:34:28 -07:00
Gian Merlino 3711c0d987
Reduce heap footprint of GenericIndexed. (#14563)
Two changes:

1) Intern DecompressingByteBufferObjectStrategy. Saves ~32 bytes per column.

2) Split GenericIndexed into GenericIndexed.V1 and GenericIndexed.V2. The
   major benefit here is isolating out the ByteBuffers that are only needed
   for V2. This saves ~80 bytes for V1 (one buffer instead of two).
2023-07-12 08:11:41 -07:00
Gian Merlino cc8b210e4c
AggregatorFactory: Use guessAggregatorHeapFootprint when factorizeWithSize is not implemented. (#14567)
There are two ways of estimating heap footprint of an Aggregator:

1) AggregatorFactory#guessAggregatorHeapFootprint
2) AggregatorFactory#factorizeWithSize + Aggregator#aggregateWithSize

When the second path is used, the default implementation of factorizeWithSize
is now updated to delegate to guessAggregatorHeapFootprint, making these equivalent.
The old logic used getMaxIntermediateSize, which is less accurate.

Also fixes a bug where, when using the second path, calling factorizeWithSize
on PassthroughAggregatorFactory would fail because getMaxIntermediateSize was
not implemented. (There is no buffer aggregator, so there would be no need.)
2023-07-12 07:33:27 -07:00
hqx871 7142b0c39e
Enable result level cache for GroupByStrategyV2 on broker (#11595)
Cache is disabled for GroupByStrategyV2 on broker since the pr #3820 [groupBy v2: Results not fully merged when caching is enabled on the broker]. But we can enable the result-level cache on broker for GroupByStrategyV2 and keep the segment-level cache disabled.
2023-07-12 15:00:01 +05:30
Nhi Pham d76903f10b
Tasks API documentation refactor (#14492)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2023-07-11 13:19:39 -07:00
Abhishek Radhakrishnan 854ef98235
Minor doc fixes. (#14565)
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2023-07-11 13:12:40 -07:00
YongGang 0ca3ba0b30
Add service/heartbeat metric into statsd-reporter (#14564) 2023-07-11 12:38:08 -07:00
Nhi Pham a764ed7fde
Update Jupyter notebook tutorial instructions for ARM devices (#14459)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2023-07-11 10:01:20 -07:00
dependabot[bot] c91148c43b
Bump tough-cookie from 4.0.0 to 4.1.3 in /web-console (#14557)
Bumps [tough-cookie](https://github.com/salesforce/tough-cookie) from 4.0.0 to 4.1.3.
- [Release notes](https://github.com/salesforce/tough-cookie/releases)
- [Changelog](https://github.com/salesforce/tough-cookie/blob/master/CHANGELOG.md)
- [Commits](https://github.com/salesforce/tough-cookie/compare/v4.0.0...v4.1.3)

---
updated-dependencies:
- dependency-name: tough-cookie
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-11 08:53:42 -07:00
Laksh Singla 5ce536355e
Fix planning bug while using sort merge frame processor (#14450)
sqlJoinAlgorithm is now a hint to the planner to execute the join in the specified manner. The planner can decide to ignore the hint if it deduces that the specified algorithm can be detrimental to the performance of the join beforehand.
2023-07-11 09:58:44 +00:00
Pranav 8087aa2b80
Adding the null check in combine and fold in doublesSketch (#14568) 2023-07-11 14:28:34 +05:30
Adarsh Sanjeev 30a91be15a
Add log statements for tmpStorageBytes in MSQ (#14449)
* Add log statements for tmpStorageBytes in MSQ

* Add log

* Update log message
2023-07-11 11:02:12 +05:30
imply-cheddar 66cac08a52
Refactor HllSketchBuildAggregatorFactory (#14544)
* Refactor HllSketchBuildAggregatorFactory

The usage of ColumnProcessors and HllSketchBuildColumnProcessorFactory
made it very difficult to figure out what was going on from just looking
at the AggregatorFactory or Aggregator code.  It also didn't properly
double check that you could use UTF8 ahead of time, even though it's
entirely possible to validate it before trying to use it.  This refactor
makes keeps the general indirection that had been implemented by
the Consumer<Supplier<HllSketch>> but centralizes the decision logic and
makes it easier to understand the code.

* Test fixes

* Add test that validates the types are maintained

* Add back indirection to avoid buffer calls

* Cover floats and doubles are the same thing

* Static checks
2023-07-10 09:57:09 -07:00
Tejaswini Bandlamudi c3f84f9ea0
Suppress CVEs (#14291)
Address various CVEs by upgrading dependencies or adding suppression with a justification
2023-07-10 15:19:26 +05:30
Kashif Faraz 58a35bf07e
Deprecate EntryExistsException in Druid 27 and remove in Druid 28 (#14554)
Also deprecate UnknownSegmentIdsException.
2023-07-08 15:40:14 +05:30
Gian Merlino 63ee69b4e8
Claim full support for Java 17. (#14384)
* Claim full support for Java 17.

No production code has changed, except the startup scripts.

Changes:

1) Allow Java 17 without DRUID_SKIP_JAVA_CHECK.

2) Include the full list of opens and exports on both Java 11 and 17.

3) Document that Java 17 is both supported and preferred.

4) Switch some tests from Java 11 to 17 to get better coverage on the
   preferred version.

* Doc update.

* Update errorprone.

* Update docker_build_containers.sh.

* Update errorprone in licenses.yaml.

* Add some more run-javas.

* Additional run-javas.

* Update errorprone.

* Suppress new errorprone error.

* Add exports and opens in ForkingTaskRunner for Java 11+.

Test, doc changes.

* Additional errorprone updates.

* Update for errorprone.

* Restore old fomatting in LdapCredentialsValidator.

* Copy bin/ too.

* Fix Java 15, 17 build line in docker_build_containers.sh.

* Update busybox image.

* One more java command.

* Fix interpolation.

* IT commandline refinements.

* Switch to busybox 1.34.1-glibc.

* POM adjustments, build and test one IT on 17.

* Additional debugging.

* Fix silly thing.

* Adjust command line.

* Add exports and opens one more place.

* Additional harmonization of strong encapsulation parameters.
2023-07-07 12:52:35 -07:00
Katya Macedo 5f94a2a9c2
Add link to Slack channel (#14553)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2023-07-07 10:09:15 -07:00
Gian Merlino 021a01df45
RTR, HRTR: Fix incorrect maxLazyWorkers check in markLazyWorkers. (#14545)
Recently #14532 fixed a problem when maxLazyWorkers == 0 and lazyWorkers
starts out empty. Unfortunately, even after that patch, there remained
a more general version of this problem when maxLazyWorkers == lazyWorkers.size().
This patch fixes it.

I'm not sure if this would actually happen in production, because the
provisioning strategies do try to avoid calling markWorkersLazy until
previously-initiated terminations have finished. Nevertheless, it still
seems like a good thing to fix.
2023-07-07 10:08:12 -07:00
Laksh Singla 9e617373a0
Handle dimensionless group by queries with partitioning 2023-07-07 21:51:47 +05:30
Karan Kumar afa8c7b8ab
Adding Ability for MSQ to write select results to durable storage. (#14527)
One of the most requested features in druid is to have an ability to download big result sets.
As part of #14416 , we added an ability for MSQ to be queried via a query friendly endpoint. This PR builds upon that work and adds the ability for MSQ to write select results to durable storage.

We write the results to the durable storage location <prefix>/results/<queryId> in the druid frame format. This is exposed to users by
/v2/sql/statements/:queryId/results.
2023-07-07 20:49:48 +05:30
Kashif Faraz 40d0dc9e0e
Use separate executor to handle task updates in TaskQueue (#14533)
Description:
`TaskQueue.notifyStatus` is often a heavy call as it performs the following operations:
- Update task status in metadata DB
- Update task locks in metadata DB
- Request (synchronously) the task runner to shutdown the completed task
- Clean up in-memory data structures

This method can often be slow and can cause worker sync / task runners to slow down.

Main changes:
- Run task completion callbacks in a separate executor to handle task completion updates
- Add new config `druid.indexer.queue.taskCompleteHandlerNumThreads`
- Add metrics to monitor number of processed and queued items
- There are still other paths that can invoke `notifyStatus`, but those need not be moved to
the new executor as they are synchronous on purpose.

Other changes:
- Add new metrics `task/status/queue/count`, `task/status/handled/count`
- Add `TaskCountStatsProvider.getStats()` which deprecates the other `getXXXTaskCount` methods.
- Use `CoordinatorRunStats` to collect and report metrics. This class has been used as is
for now but will later be renamed and repurposed to use across all Druid services.
2023-07-07 20:43:12 +05:30
Jan Werner 95115d722a
CVE fixes - update of multiple dependencies. (#14519)
Apache Druid brings multiple direct and transitive dependencies that are affected by plethora of CVEs.
This PR attempts to update all the dependencies that did not require code refactoring.
This PR modifies pom files, license file and OWASP Dependency Check suppression file.
2023-07-07 20:27:30 +05:30
Gian Merlino 1fe61bc869
ChangeRequestHttpSyncer: Don't wait 1ms when checking isInitialized(). (#14547)
The wait doesn't seem to serve a purpose, other than causing delays
when checking isInitialized() for a large number of things that have
not yet been initialized.
2023-07-07 05:54:39 -07:00
Kashif Faraz d63eff3b1b
Reduce contention in HttpRemoteTaskRunner.getKnownTasks() (#14541) 2023-07-07 13:43:59 +05:30
Gian Merlino dd78e00dc5
Fix ColumnSignature error message and jdk17 test issue. (#14538)
* Fix ColumnSignature error message and jdk17 test issue.

On jdk17, the "problem" part of the error message could change from
NullPointerException to:

  Cannot invoke "String.length()" because "s" is null

Due to the new more-helpful NPEs in Java 17. This broke the expectation
and led to test failures on this case.

This patch fixes the problem by improving the error message so it isn't
a generic NullPointerException.

* Fix format.
2023-07-06 15:10:59 -07:00
Gian Merlino 037f09bef2
HttpRemoteTaskRunner: Fix markLazyWorkers for maxLazyWorkers == 0. (#14532) 2023-07-06 11:51:04 -07:00
Abhishek Radhakrishnan d02bb8bb6e
Set explain attributes after the query is prepared (#14490)
* Add support for DML WITH AS.

* One more UT for with as subquery.

* Add a test with join query

* Use root query prepared node instead of individual SqlNode types.

- Set the explain plan attributes after the query is prepared when
the query is planned and we've the finalized output names in the root
source rel node.
- Adjust tests; add unit test for negative ordinal case.
- Remove the exception / error handling logic from resolveClusteredBy
function since the validations now happen before it comes to the function

* Update comment.
2023-07-06 14:13:32 -04:00
imply-cheddar 5fc122a144
Add window-focused tests from Drill (#13773)
This commit borrows some test definitions from Drill's test suite
and tries to use them to flesh out the full validation of window
function capbilities.

In order to be able to run these tests, we also add the ability to
run a Scan operation against segments, which also meant an
implementation of RowsAndColumns for frames.
2023-07-06 09:20:32 -07:00
imply-cheddar 277b357256
Optimize IntervalIterator (#14530)
UniformGranularityTest's test to test a large number of intervals
runs through 10 years of 1 second intervals.  This pushes a lot of
stuff through IntervalIterator and shows up in terms of test
runtime as one of the hottest tests.  Most of the time is going to
constructing jodatime objects because it is doing things with
DateTime objects instead of millis.  Change the calls to use
millis instead and things go faster.
2023-07-06 14:44:23 +05:30
Kashif Faraz 87bb1b9709
Fix bug during initialization of HttpServerInventoryView (#14517)
If a server is removed during `HttpServerInventoryView.serverInventoryInitialized`,
the initialization gets stuck as this server is never synced. The method eventually times
out (default 250s).

Fix: Mark a server as stopped if it is removed. `serverInventoryInitialized` only waits for
non-stopped servers to sync.

Other changes:
- Add new metrics for better debugging of slow broker/coordinator startup
  - `segment/serverview/sync/healthy`: whether the server view is syncing properly with a server
  - `segment/serverview/sync/unstableTime`: time for which sync with a server has been unstable  
- Clean up logging in `HttpServerInventoryView` and `ChangeRequestHttpSyncer`
- Minor refactor for readability
- Add utility class `Stopwatch`
- Add tests and stubs
2023-07-06 13:04:53 +05:30
Kashif Faraz a6547febaf
Remove unused coordinator dynamic configs (#14524)
After #13197 , several coordinator configs are now redundant as they are not being
used anymore, neither with `smartSegmentLoading` nor otherwise.

Changes:
- Remove dynamic configs `emitBalancingStats`: balancer error stats are always
emitted, debug stats can be logged by using `debugDimensions`
- `useBatchedSegmentSampler`, `percentOfSegmentsToConsiderPerMove`:
batched segment sampling is always used
- Add test to verify deserialization with unknown properties
- Update `CoordinatorRunStats` to always track stats, this can be optimized later.
2023-07-06 12:11:10 +05:30
Soumyava 78db7a4414
A query in MSQ would issue wrong error code (#14531)
with a RuntimeException. Now the RuntimeException is being replaced by an user facing DruidException of Invalid category which would allow calcite not to throw an uncategorized exception.
2023-07-06 08:59:35 +05:30
Jonathan Wei f29a9faa94
Better surfacing of invalid pattern errors for SQL REGEXP_EXTRACT function (#14505) 2023-07-05 17:12:54 -05:00