Commit Graph

14575 Commits

Author SHA1 Message Date
Adithya Chakilam e834e49290
supervisor/autoscaler: Fix clearing of collected lags on skipped scale actions (#17356)
* superviosr/autoscaler: Fix clearing of collected lags on skipped scale actions

* comments

* supervisor/autoscaler: Skip scaling when partitions are less than minTaskCount (#17335)

* Fix pip installation after ubuntu upgrade (#17358)

* fix tests

---------

Co-authored-by: Pranav <pranavbhole@gmail.com>
2024-10-17 11:05:16 -07:00
317brian d1b81f312a
docs: msq autocompaction (#16681)
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Co-authored-by: Vishesh Garg <vishesh.garg@imply.io>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-10-17 10:40:53 -07:00
Abhishek Radhakrishnan 0e6c388b7f
Delta snapshots are zero-indexed, so remove zeroMeansUndefined: true. (#17367)
This lets users filter by snapshotVersion: 0. Previously, zeroMeanUndefined
was set to true, so it would silently default to the latest snapshot.
2024-10-17 08:33:10 -07:00
Akshat Jain 8c52be81d3
Fix postgres metadata storage warning logs because of tablename causing issues (#17351) 2024-10-17 15:48:22 +05:30
Akshat Jain 450fb0147b
Add GlueingPartitioningOperator + Corresponding changes in window function layer to consume it for MSQ (#17038)
*    GlueingPartitioningOperator: It continuously receives data, and outputs batches of partitioned RACs. It maintains a last-partitioning-boundary of the last-pushed-RAC, and attempts to glue it with the next RAC it receives, ensuring that partitions are handled correctly, even across multiple RACs. You can check GlueingPartitioningOperatorTest for some good examples of the "glueing" work.
*    PartitionSortOperator: It sorts rows inside partitioned RACs, on the sort columns. The input RACs it receives are expected to be "complete / separate" partitions of data.
2024-10-17 10:54:52 +05:30
Ashwin Tumma 90175b8927
[Prometheus Emitter] Add to code coverage and remove code smell (#17362)
* [Prometheus Emitter] Add to code coverage and remove code smell
2024-10-17 10:49:16 +05:30
Vadim Ogievetsky 26e2ca66d7
update to node 20 (#17363) 2024-10-16 13:15:10 -07:00
Vadim Ogievetsky 877784e5fd
Web console: add expectedLoadTimeMillis (#17359)
* add expectedLoadTimeMillis

* make spec cleaning less agro

* more cleanup
2024-10-16 13:14:27 -07:00
Vadim Ogievetsky 8ddb316e68
Web console: fix progress indication for table input (#17334)
* fix porgress indication for table input

* fix snapshot
2024-10-16 13:14:11 -07:00
Suraj Goel c1fe1ac898
Remove EOL file-loader dependency (#17346) 2024-10-16 11:11:06 -07:00
George Shiqi Wu a664fc8be3
always set taskLocation (#17350) 2024-10-16 14:02:39 -04:00
Kashif Faraz df3a307e83
Do not use cachingCost balancer strategy in Docker environment (#17349) 2024-10-16 20:59:46 +05:30
TessaIO a9f582711e
Fix loading lookup extension (#17212)
We introduce the option to iterate over fetched data from the dataFetcher for loadingLookups in the lookups-cached-single extension. Also, added the handling of a use case where the data exists in Druid but not in the actual data fetcher, which is in our use-case JDBC Data fetcher, where the value returned is null.

Signed-off-by: TessaIO <ahmedgrati1999@gmail.com>
2024-10-16 07:28:32 -07:00
Pranav f80e2c229e
Fix pip installation after ubuntu upgrade (#17358) 2024-10-15 17:50:18 -07:00
Adithya Chakilam c57bd3b438
supervisor/autoscaler: Skip scaling when partitions are less than minTaskCount (#17335) 2024-10-15 14:12:53 -07:00
Hardik Bajaj 32ce341a6c
Fix RejectExecutionHandler of Blocking Single Threaded executor (#17146)
Throw RejectedExecutionException when submitting tasks to executor that has been shut down.
2024-10-15 22:02:34 +05:30
Clint Wylie c2149d59a7
remove stale comment in QueryableIndexCursorHolder (#17333) 2024-10-11 16:23:59 -07:00
Gian Merlino b287b219a8
MSQ: Include stageId, workerNumber in processing thread names. (#17324)
* MSQ: Include stageId, workerNumber in processing thread names.

Helps identify which query was running in a thread dump.

* s/dart/msq/
2024-10-11 08:37:15 -07:00
Gian Merlino a0c29f8bbb
MSQ WorkerResource: Fix timeout handler for httpGetChannelData. (#17328)
The timeout handler should fire if the response has not been handled yet
(i.e. if responseResolved was previously false). However, it erroneously
fires only if the response *was* handled. This causes HTTP 500 errors if
the timeout actually does fire. The timeout is 30 seconds, which can be
hit during pipelined queries, if an earlier stage of the query hasn't
produced its first frame within 30 seconds.

This fixes a regression introduced in #17140.
2024-10-11 16:29:04 +05:30
Karan Kumar 034bb9dbea
Removing enable windowing from MSQ tests. (#17276) 2024-10-11 05:33:27 +02:00
Shivam Garg 6898a5a359
Removed Microsecond from Extract function (#17247) 2024-10-11 05:32:26 +02:00
Clint Wylie a6236c3d15
add substituteCombiningFactory implementations for datasketches aggs (#17314)
Follow up to #17214, adds implementations for substituteCombiningFactory so that more
datasketches aggs can match projections, along with some projections tests for datasketches.
2024-10-10 16:14:06 +05:30
Suneet Saldanha fb38e483cf
statsd-emitter: Add dutyGroup to coordinator global time metric (#17320)
The duty group is a low cardinality dimension and can be helpful in providing insight
into whether a particular duty group is not running fast enough on the coordinator.
2024-10-10 16:03:50 +05:30
Gian Merlino 1d95ef34f0
Logger: Log context of DruidExceptions. (#17316)
* Logger: Log context of DruidExceptions.

There is often interesting and unique information available in the
"context" of a DruidException. This information is additive to both
the message and the cause, and was missed when we log. This patch adds
the DruidException context to log messages whenever stack traces are
enabled.

* Only log nonempty contexts.
2024-10-10 01:44:50 -07:00
Gian Merlino 074944e02c
Dart: Only use historicals as workers. (#17319)
Only historicals load the Dart worker modules. Other types of servers in
the server view (such as realtime tasks) should not be included.
2024-10-10 13:47:58 +05:30
Gian Merlino 4092f3fe47
MSQ: Call "onQueryComplete" after the query is closed. (#17313)
This fixes a concurrency issue where, for failed queries, "onQueryComplete"
could be called concurrently with "onResultsStart" or "onResultRow". Fully
closing the controller ensures that the result reader is no longer active,
which eliminates the race.
2024-10-10 10:44:44 +05:30
Gian Merlino b27712933e
MSQ: Use leaf worker count for stages that have any leaf inputs. (#17312)
Previously, the leaf worker count was used for stages that have *no*
stage inputs. It should actually be used for stages that have *any*
non-broadcast, non-stage inputs.

This fixes a bug with broadcast joins. In a broadcast join, the stage has
both a table and a broadcast stage as input. Previously, it would be planned
using the non-leaf worker count. It should actually be planned using the
leaf worker count.
2024-10-10 10:44:31 +05:30
Kashif Faraz 3f797c52d0
Fix duplicate compaction task launched by OverlordCompactionScheduler (#17287)
Description
-----------
The `OverlordCompactionScheduler` may sometimes launch a duplicate compaction
task for an interval that has just been compacted.

This may happen as follows:
- Scheduler launches a compaction task for an uncompacted interval.
- While the compaction task is running, the `CompactionStatusTracker` does not consider
this interval as compactible and returns the `CompactionStatus` as `SKIPPED` for it.
- As soon as the compaction task finishes, the `CompactionStatusTracker` starts considering
the interval eligible for compaction again.
- This interval remains eligible for compaction until the newly published segments are polled
from the database.
- Once the new segments have been polled, the `CompactionStatus` of the interval changes
to `COMPLETE`.

Change
--------
- Keep track of the `snapshotTime` in `DataSourcesSnapshot`. This time represents the start of the poll.
- Use the `snapshotTime` to determine if a poll has happened after a compaction task completed.
- If not, then skip the interval to avoid launching duplicate tasks.
- For tests, use a future `snapshotTime` to ensure that compaction is always triggered.
2024-10-10 08:44:09 +05:30
Karan Kumar 4fdb38118a
CVE suppression for various dependencies. (#17307) 2024-10-09 18:07:09 +05:30
AmatyaAvadhanula 88d26e4541
Fix queries for updated segments on SinkQuerySegmentWalker (#17157)
Fix the logic for usage of segment descriptors from queries in SinkQuerySegmentWalker when there are upgraded segments as a result of concurrent replace.

Concurrent append and replace:
With the introduction of concurrent append and replace, for a given interval:

The same sink can correspond to a base segment V0_x0, and have multiple mappings to higher versions with distinct partition numbers such as V1_x1.... Vn_xn.
The initial segment allocation can happen on version V0, but there can be several allocations during the lifecycle of a task which can have different versions spanning from V0 to Vn.
Changes:
Maintain a new timeline of (An overshadowable holding a SegmentDescriptor)
Every segment allocation of version upgrade adds the latest segment descriptor to this timeline.
Iterate this timeline instead of the sinkTimeline to get the segment descriptors in getQueryRunnerForIntervals
Also maintain a mapping of the upgraded segment to its base segment.
When a sink is needed to process the query, find the base segment corresponding to a given descriptor, and then use the sinkTimeline to find its chunk.
2024-10-09 14:43:17 +05:30
Vadim Ogievetsky a395368622
run npm audit fix (#17290) 2024-10-08 16:44:09 -07:00
Vadim Ogievetsky 4570809b4a
better timing bar styling (#17295) 2024-10-08 16:30:58 -07:00
anny-imply dca69c5761
update line in architecture md (#17289) 2024-10-08 11:51:47 -07:00
Gian Merlino baa16f30f6
DartWorkerContext: Return the correct workerId(). (#17280)
Prior to this patch, the workerId() method did not actually return
the worker ID. It returned some other string that had similar information,
but was different.

This caused the /druid/dart-worker/workers API, to return an internal
server error. The API is useful for debugging, although it is not used
during actual queries.
2024-10-08 09:52:55 -07:00
Charles Smith 5ed68622c3
[Docs] Update known issues for window functions (#17097)
* draft update to known issues

* Update known issues

Remove addressed known issues. Clarify the issue with SELECT * queries.
2024-10-08 08:47:13 -07:00
Gian Merlino 152330c5a8
WorkerManager: Correct javadoc for "stop". (#17279)
The javadoc had a factual error: Dart's implementation does not in
fact always return immediately.
2024-10-08 15:49:43 +05:30
Gian Merlino 0a279e634a
DartSqlResource: Return HTTP 202 on cancellation even if no such query. (#17278)
Return HTTP 202 (Accepted) on cancellation, even if the requested query
ID was not found.

The main reason for this is that when the Router broadcasts DELETE requests
to all Brokers, it returns the response from one of them randomly. If we
return 404 when a query ID isn't found, then the Router randomly returns 404s
even when the query really was found and canceled.

This is also arguably still correct behavior. The cancellation request
*was* accepted, it just won't do anything because the query was not in
fact running.
2024-10-08 15:49:34 +05:30
Gian Merlino 01baf99148
DartWorkerModule: Replace en dash with regular dash. (#17281)
Due to a typo, the thread name of the worker executor used an en dash (–)
rather than a regular hyphen (-). This was unintentional, and makes it
difficult to search for in thread dumps.
2024-10-08 15:48:10 +05:30
Gian Merlino 2309aa7bdf
DartSqlResource: Add controllerHost to GetQueriesResponse. (#17283)
This helps find the specific Broker that is executing a query.
2024-10-08 15:47:32 +05:30
Gian Merlino 9921ac1b19
DartSqlResource: Sort queries by start time. (#17282)
* DartSqlResource: Sort queries by start time.

This keeps the list of queries returned by the API in a consistent order.

* Fix test.
2024-10-08 15:47:21 +05:30
Gian Merlino 06bbdb38ce
MSQ: Allow for worker gaps. (#17277)
In a Dart query, all Historicals are given worker IDs, but not all of them
are going to actually be started or receive work orders. This can create gaps
in the set of workers. For example, workers 1 and 3 could have work assigned
while workers 0 and 2 do not.

This patch updates ControllerStageTracker and WorkerInputs to handle such
gaps, by using the set of actual worker numbers, rather than 0..workerCount,
in various places.
2024-10-08 15:07:57 +05:30
Gian Merlino 4fbb129027
Improve javadocs for SegmentDescriptor. (#17274)
The javadoc for SegmentDescriptor discusses differences between it and
SegmentId, but misses the most important difference: SegmentDescriptor
can have a narrower interval than the segment being referenced.
2024-10-08 00:59:55 -07:00
Clint Wylie ab0d6eb620
Fix string array grouping comparator (#17183) 2024-10-08 09:47:28 +05:30
Edgar Melendrez a67a3c8e0a
[docs] update tutorial for Theta sketches (#16953)
* from start to step 3 of Ingest data using Theta sketche

* updated upto "Query the Theta sketch column"

* fixed sentence

* another typo

* using sql ingestion instead of batch-sql

* waiting for explanations on DS_THETA

* Revert "using sql ingestion instead of batch-sql"

This reverts commit b95fcb9b32.

* Revert "using sql ingestion instead of batch-sql"

This reverts commit b95fcb9b32.

* just copy and pasting to where I was

* updated tutorial

* fixing images, and removing unused

* slightly updating explanatio

* Update docs/tutorials/tutorial-sketches-theta.md

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* addressing comments in review

* made filter clause consitent with other instances

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-10-08 10:44:37 +08:00
317brian 9932f2e70a
docs: concurrent append and replace is gA (#17269) 2024-10-08 07:55:55 +05:30
AmatyaAvadhanula f42ecc9f25
Fail concurrent replace tasks with finer segment granularity than append (#17265) 2024-10-08 07:35:13 +05:30
George Shiqi Wu 5d7c7a87ec
Add maximumCapacity to taskRunner (#17107)
* Add maximumCapacity to taskRunner

* fix tests

* pr comments
2024-10-07 15:03:51 -04:00
AmatyaAvadhanula ff97c67945
Fix batch segment allocation failure with replicas (#17262)
Fixes #16587

Streaming ingestion tasks operate by allocating segments before ingesting rows.
These allocations happen across replicas which may send different requests but
must get the same segment id for a given (datasource, interval, version, sequenceName)
across replicas.

This patch fixes the bug by ignoring the previousSegmentId when skipLineageCheck is true.
2024-10-07 19:52:38 +05:30
Karan Kumar 6a4352f466
When removeNullBytes is set, length calculations did not take into account null bytes. (#17232)
* When replaceNullBytes is set, length calculations did not take into account null bytes.
2024-10-07 18:02:52 +05:30
Adarsh Sanjeev c9201ad658
Minor refactors to processing module (#17136)
Refactors a few things.

- Adds SemanticUtils maps to columns.
- Add some addAll functions to reduce duplication, and for future reuse.
- Refactor VariantColumnAndIndexSupplier to only take a SmooshedFileMapper instead.
- Refactor LongColumnSerializerV2 to have separate functions for serializing a value and null.
2024-10-07 13:18:35 +05:30