This patch is extracted from PR 17353.
Changes:
- Added BrokerClient and BrokerClientImpl to the sql package that leverages the ServiceClient functionality; similar to OverlordClient and CoordinatorClient implementations in the server module.
- For now, only two broker API stubs are added: submitSqlTask() and fetchExplainPlan().
- Added a new POJO class ExplainPlan that encapsulates explain plan info.
- Deprecated org.apache.druid.discovery.BrokerClient in favor of the new BrokerClient in this patch.
- Clean up ExplainAttributesTest a bit and added serde verification.
MSQ currently supports only single-valued string dimensions as partition keys.
This patch adds a check to ensure that partition keys are single-valued in case
this info is available by virtue of segment download for schema inference.
During compaction, if MSQ finds multi-valued dimensions (MVDs) declared as part
of `range` partitionsSpec, it switches partitioning type to dynamic, ending up in
repeated compactions of the same interval. To avoid this scenario, the segment
download logic is also updated to always download segments if info on multi-valued
dimensions is required.
- This is a non-functional change that moves SqlTaskStatus and its unit test SqlTaskStatusTest from the msq module to the sql module to help class reuse in other places.
- This refactor is extracted from this PR to facilitate easier review.
- Fix a minor spacing issue in the TaskStartTimeoutFault error message.
The Delta Decimal type wasn't handled correctly in the Druid Delta connector, resulting in the error: Unsupported fieldType[Decimal(4, 2)] for fieldName[price].
There were no tests or existing tables with the Decimal type, so I've updated the existing table, complex-types-table to include this data type.
Note that the Decimal type can only be handled as a double at most in Druid. For a big decimal that cannot fit inside a double, it should be ingested as a string.
* GlueingPartitioningOperator: It continuously receives data, and outputs batches of partitioned RACs. It maintains a last-partitioning-boundary of the last-pushed-RAC, and attempts to glue it with the next RAC it receives, ensuring that partitions are handled correctly, even across multiple RACs. You can check GlueingPartitioningOperatorTest for some good examples of the "glueing" work.
* PartitionSortOperator: It sorts rows inside partitioned RACs, on the sort columns. The input RACs it receives are expected to be "complete / separate" partitions of data.
We introduce the option to iterate over fetched data from the dataFetcher for loadingLookups in the lookups-cached-single extension. Also, added the handling of a use case where the data exists in Druid but not in the actual data fetcher, which is in our use-case JDBC Data fetcher, where the value returned is null.
Signed-off-by: TessaIO <ahmedgrati1999@gmail.com>
The timeout handler should fire if the response has not been handled yet
(i.e. if responseResolved was previously false). However, it erroneously
fires only if the response *was* handled. This causes HTTP 500 errors if
the timeout actually does fire. The timeout is 30 seconds, which can be
hit during pipelined queries, if an earlier stage of the query hasn't
produced its first frame within 30 seconds.
This fixes a regression introduced in #17140.
Follow up to #17214, adds implementations for substituteCombiningFactory so that more
datasketches aggs can match projections, along with some projections tests for datasketches.
The duty group is a low cardinality dimension and can be helpful in providing insight
into whether a particular duty group is not running fast enough on the coordinator.
* Logger: Log context of DruidExceptions.
There is often interesting and unique information available in the
"context" of a DruidException. This information is additive to both
the message and the cause, and was missed when we log. This patch adds
the DruidException context to log messages whenever stack traces are
enabled.
* Only log nonempty contexts.
This fixes a concurrency issue where, for failed queries, "onQueryComplete"
could be called concurrently with "onResultsStart" or "onResultRow". Fully
closing the controller ensures that the result reader is no longer active,
which eliminates the race.
Previously, the leaf worker count was used for stages that have *no*
stage inputs. It should actually be used for stages that have *any*
non-broadcast, non-stage inputs.
This fixes a bug with broadcast joins. In a broadcast join, the stage has
both a table and a broadcast stage as input. Previously, it would be planned
using the non-leaf worker count. It should actually be planned using the
leaf worker count.
Description
-----------
The `OverlordCompactionScheduler` may sometimes launch a duplicate compaction
task for an interval that has just been compacted.
This may happen as follows:
- Scheduler launches a compaction task for an uncompacted interval.
- While the compaction task is running, the `CompactionStatusTracker` does not consider
this interval as compactible and returns the `CompactionStatus` as `SKIPPED` for it.
- As soon as the compaction task finishes, the `CompactionStatusTracker` starts considering
the interval eligible for compaction again.
- This interval remains eligible for compaction until the newly published segments are polled
from the database.
- Once the new segments have been polled, the `CompactionStatus` of the interval changes
to `COMPLETE`.
Change
--------
- Keep track of the `snapshotTime` in `DataSourcesSnapshot`. This time represents the start of the poll.
- Use the `snapshotTime` to determine if a poll has happened after a compaction task completed.
- If not, then skip the interval to avoid launching duplicate tasks.
- For tests, use a future `snapshotTime` to ensure that compaction is always triggered.
Fix the logic for usage of segment descriptors from queries in SinkQuerySegmentWalker when there are upgraded segments as a result of concurrent replace.
Concurrent append and replace:
With the introduction of concurrent append and replace, for a given interval:
The same sink can correspond to a base segment V0_x0, and have multiple mappings to higher versions with distinct partition numbers such as V1_x1.... Vn_xn.
The initial segment allocation can happen on version V0, but there can be several allocations during the lifecycle of a task which can have different versions spanning from V0 to Vn.
Changes:
Maintain a new timeline of (An overshadowable holding a SegmentDescriptor)
Every segment allocation of version upgrade adds the latest segment descriptor to this timeline.
Iterate this timeline instead of the sinkTimeline to get the segment descriptors in getQueryRunnerForIntervals
Also maintain a mapping of the upgraded segment to its base segment.
When a sink is needed to process the query, find the base segment corresponding to a given descriptor, and then use the sinkTimeline to find its chunk.
Prior to this patch, the workerId() method did not actually return
the worker ID. It returned some other string that had similar information,
but was different.
This caused the /druid/dart-worker/workers API, to return an internal
server error. The API is useful for debugging, although it is not used
during actual queries.
Return HTTP 202 (Accepted) on cancellation, even if the requested query
ID was not found.
The main reason for this is that when the Router broadcasts DELETE requests
to all Brokers, it returns the response from one of them randomly. If we
return 404 when a query ID isn't found, then the Router randomly returns 404s
even when the query really was found and canceled.
This is also arguably still correct behavior. The cancellation request
*was* accepted, it just won't do anything because the query was not in
fact running.
Due to a typo, the thread name of the worker executor used an en dash (–)
rather than a regular hyphen (-). This was unintentional, and makes it
difficult to search for in thread dumps.
In a Dart query, all Historicals are given worker IDs, but not all of them
are going to actually be started or receive work orders. This can create gaps
in the set of workers. For example, workers 1 and 3 could have work assigned
while workers 0 and 2 do not.
This patch updates ControllerStageTracker and WorkerInputs to handle such
gaps, by using the set of actual worker numbers, rather than 0..workerCount,
in various places.