Commit Graph

14248 Commits

Author SHA1 Message Date
Laksh Singla c84e689eb8
Don't use ComplexMetricExtractor to fetch the class of the object in field readers (#16825)
This patch fixes queries like `SELECT COUNT(DISTINCT json_col) FROM foo`
2024-08-05 14:13:56 +05:30
Laksh Singla 0411c4e67e
Add metrics for number of rows/bytes materialized while running subqueries (#16835)
subquery/rows and subquery/bytes metrics have been added, which indicate the size of the results materialized on the heap.
2024-08-05 14:13:20 +05:30
Sree Charan Manamala c7eacd079e
fallback SQL IN filter to expression filter when VirtualColumnRegistry is null (#16836) 2024-08-05 11:27:51 +05:30
Abhishek Radhakrishnan 31b43753fb
Add `druid.indexing.formats.stringMultiValueHandlingMode` system config (#16822)
This patch introduces an optional cluster configuration, druid.indexing.formats.stringMultiValueHandlingMode, allowing operators to override the default mode SORTED_SET for string dimensions. The possible values for the config are SORTED_SET, SORTED_ARRAY, or ARRAY (SORTED_SET is the default). Case insensitive values are allowed.
While this cluster property allows users to manage the multi-value handling mode for string dimension types, it's recommended to migrate to using real array types instead of MVDs.
 
This fixes a long-standing issue where compaction will honor the configured cluster wide property instead of rewriting it as the default SORTED_ARRAY always, even if the data was originally ingested with ARRAY or SORTED_SET.
2024-08-03 10:23:44 -07:00
Kashif Faraz 9dc2569f22
Track and emit segment loading rate for HttpLoadQueuePeon on Coordinator (#16691)
Design:
The loading rate is computed as a moving average of at least the last 10 GiB of successful segment loads.
To account for multiple loading threads on a server, we use the concept of a batch to track load times.
A batch is a set of segments added by the coordinator to the load queue of a server in one go.

Computation:
batchDurationMillis = t(load queue becomes empty) - t(first load request in batch is sent to server)
batchBytes = total bytes successfully loaded in batch
avg loading rate in batch (kbps) = (8 * batchBytes) / batchDurationMillis
overall avg loading rate (kbps) = (8 * sumOverWindow(batchBytes)) / sumOverWindow(batchDurationMillis)

Changes:
- Add `LoadingRateTracker` which computes a moving average load rate based on
the last few GBs of successful segment loads.
- Emit metric `segment/loading/rateKbps` from the Coordinator. In the future, we may
also consider emitting this metric from the historicals themselves.
- Add `expectedLoadTimeMillis` to response of API `/druid/coordinator/v1/loadQueue?simple`
2024-08-03 13:14:21 +05:30
Abhishek Radhakrishnan fe6772a101
Rename test builder `MSQTester.setExpectedSegment` (#16837)
* Rename setExpectedSegment to setExpectedSegments in MSQTestBase.

* Add expected segments for max num segments test cases.
2024-08-02 10:01:55 -07:00
zachjsh 9b731e8f0a
Kinesis Input Format for timestamp, and payload parsing (#16813)
* SQL syntax error should target USER persona

* * revert change to queryHandler and related tests, based on review comments

* * add test

* Introduce KinesisRecordEntity to support Kinesis headers in InputFormats

* * add kinesisInputFormat and Reader, and tests

* * bind KinesisInputFormat class to module

* * improve test coverage

* * remove references to kafka

* * resolve review comments

* * remove comment

* * fix grammer of comment

* * fix comment again

* * fix comment again

* * more review comments

* * add partitionKey

* * add check for same timestamp and partitionKey column name

* * fix intellij inspection
2024-08-02 08:48:44 -04:00
Akshat Jain 63ba5a4113
Fix issues with fetching task reports in SQL statements endpoint for middlemanager (#16832) 2024-08-01 23:37:15 -04:00
Vadim Ogievetsky 8c170f7d0e
Web console: use stages, counters, and warnings from the new detailed status API (#16809)
* stages and counters can be seen on the status reponse

* warnings are exposed also

* mark as msq when attached

* update snapshots

* download CSV/TSV null as empty cell
2024-08-01 02:30:30 -07:00
Akshat Jain bb4d6cc001
Add task report fields in response of SQL statements endpoint (#16808)
If the optional query parameter detail is supplied, then the response also includes the following:

 * A stages object that summarizes information about the different stages being used for query execution, such as stage number, phase, start time, duration, input and output information, processing methods, and partitioning.
* A counters object that provides details on the rows, bytes, and files processed at various stages for each worker across different channels, along with sort progress.
* A warnings object that provides details about any warnings.
2024-08-01 10:26:04 +05:30
Gian Merlino 01f6cfcbf5
MSQ worker: Support in-memory shuffles. (#16790)
* MSQ worker: Support in-memory shuffles.

This patch is a follow-up to #16168, adding worker-side support for
in-memory shuffles. Changes include:

1) Worker-side code now respects the same context parameter "maxConcurrentStages"
   that was added to the controller in #16168. The parameter remains undocumented
   for now, to give us a chance to more fully develop and test this functionality.

1) WorkerImpl is broken up into WorkerImpl, RunWorkOrder, and RunWorkOrderListener
   to improve readability.

2) WorkerImpl has a new StageOutputHolder + StageOutputReader concept, which
   abstract over memory-based or file-based stage results.

3) RunWorkOrder is updated to create in-memory stage output channels when
   instructed to.

4) ControllerResource is updated to add /doneReadingInput/, so the controller
   can tell when workers that sort, but do not gather statistics, are done reading
   their inputs.

5) WorkerMemoryParameters is updated to consider maxConcurrentStages.

Additionally, WorkerChatHandler is split into WorkerResource, so as to match
ControllerChatHandler and ControllerResource.

* Updates for static checks, test coverage.

* Fixes.

* Remove exception.

* Changes from review.

* Address static check.

* Changes from review.

* Improvements to docs and method names.

* Update comments, add test.

* Additional javadocs.

* Fix throws.

* Fix worker stopping in tests.

* Fix stuck test.
2024-07-30 18:41:24 -07:00
Edgar Melendrez 3bb6d40285
[docs] batch 5 updating functions (#16812)
* batch 5

* Update docs/querying/sql-functions.md

* applying suggestions

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-07-30 17:30:01 -07:00
Edgar Melendrez 85a8a1d805
[Docs]Batch04 - Bitwise numeric functions (#16805)
* Batch04 - Bitwise numeric functions

* Batch04 - Bitwise numeric functions

* minor fixes

* rewording bitwise_shift functions

* rewording bitwise_shift functions

* Update docs/querying/sql-functions.md

* applying suggestions

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-07-30 10:53:59 -07:00
Kashif Faraz 954aaafe0c
Refactor: Clean up compaction config classes (#16810)
Changes:
- Rename `CoordinatorCompactionConfig` to `DruidCompactionConfig`
- Rename `CompactionConfigUpdateRequest` to `ClusterCompactionConfig`
- Refactor methods in `DruidCompactionConfig`
- Clean up `DataSourceCompactionConfigHistory` and its tests
- Clean up tests and add new tests
- Change API path `/druid/coordinator/v1/config/global` to `/druid/coordinator/v1/config/cluster`
2024-07-30 12:17:25 +05:30
AmatyaAvadhanula 92a40d8169
Add API to fetch conflicting task locks (#16799)
* Add API to fetch conflicting active locks
2024-07-30 11:40:48 +05:30
Vishesh Garg e9ea243d97
Enable compaction ITs on MSQ engine (#16778)
Follow-up to #16291, this commit enables a subset of existing native compaction ITs on the MSQ engine.

In the process, the following changes have been introduced in the MSQ compaction flow:
- Populate `metricsSpec` in `CompactionState` from `querySpec` in `MSQControllerTask` instead of `dataSchema`
- Add check for pre-rolled-up segments having `AggregatorFactory` with different input and output column names
- Fix passing missing cluster-by clause in scan queries
- Add annotation of `CompactionState` to tombstone segments
2024-07-30 09:34:46 +05:30
Zoltan Haindrich c7cde31a89
HAVING clauses may not contain window functions (#16742)
Rejects having clauses if they contain windowed expressions.
Also added a check to produce a more descriptive error if an OVER expression
reaches the filter translation layer.

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-07-29 04:11:36 -04:00
dependabot[bot] f5527dc3e7
Bump io.grpc:grpc-netty-shaded from 1.57.2 to 1.65.1 (#16731)
Bumps [io.grpc:grpc-netty-shaded](https://github.com/grpc/grpc-java) from 1.57.2 to 1.65.1.
- [Release notes](https://github.com/grpc/grpc-java/releases)
- [Commits](https://github.com/grpc/grpc-java/compare/v1.57.2...v1.65.1)

---
updated-dependencies:
- dependency-name: io.grpc:grpc-netty-shaded
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: asdf2014 <asdf2014@apache.org>
2024-07-29 14:51:39 +08:00
dependabot[bot] cbca0dc969
Bump jclouds.version from 2.5.0 to 2.6.0 (#16796)
Bumps `jclouds.version` from 2.5.0 to 2.6.0.

Updates `org.apache.jclouds:jclouds-core` from 2.5.0 to 2.6.0

Updates `org.apache.jclouds.api:openstack-swift` from 2.5.0 to 2.6.0

Updates `org.apache.jclouds.driver:jclouds-slf4j` from 2.5.0 to 2.6.0

Updates `org.apache.jclouds.api:openstack-keystone` from 2.5.0 to 2.6.0

Updates `org.apache.jclouds.api:rackspace-cloudfiles` from 2.5.0 to 2.6.0

Updates `org.apache.jclouds.provider:rackspace-cloudfiles-us` from 2.5.0 to 2.6.0

Updates `org.apache.jclouds.provider:rackspace-cloudfiles-uk` from 2.5.0 to 2.6.0

---
updated-dependencies:
- dependency-name: org.apache.jclouds:jclouds-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: org.apache.jclouds.api:openstack-swift
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: org.apache.jclouds.driver:jclouds-slf4j
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: org.apache.jclouds.api:openstack-keystone
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: org.apache.jclouds.api:rackspace-cloudfiles
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: org.apache.jclouds.provider:rackspace-cloudfiles-us
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: org.apache.jclouds.provider:rackspace-cloudfiles-uk
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: asdf2014 <asdf2014@apache.org>
2024-07-29 14:49:26 +08:00
Kashif Faraz caedeb66cd
Add API to update compaction engine (#16803)
Changes:
- Add API `/druid/coordinator/v1/config/compaction/global` to update cluster level compaction config
- Add class `CompactionConfigUpdateRequest`
- Fix bug in `CoordinatorCompactionConfig` which caused compaction engine to not be persisted.
Use json field name `engine` instead of `compactionEngine` because JSON field names must align
with the getter name.
- Update MSQ validation error messages
- Complete overhaul of `CoordinatorCompactionConfigResourceTest` to remove unnecessary mocking
and add more meaningful tests.
- Add `TuningConfigBuilder` to easily build tuning configs for tests.
- Add `DatasourceCompactionConfigBuilder`
2024-07-27 09:14:51 +05:30
Edgar Melendrez c07aeedbec
[docs] Updating Rollup tutorial (#16762)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2024-07-26 15:43:31 -07:00
Edgar Melendrez 028ee23a1e
[Docs] batch 03 - trig functions (#16795)
* batch 03 - trig functions

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* applying suggestions and corrections

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-07-26 13:11:17 -07:00
Charles Smith ed48cb82e9
[Docs} Remove avro_ocf support from Kafka & Kinesis streaming sources (Revert changes from #11865) (#16807) 2024-07-26 13:06:22 -07:00
Abhishek Radhakrishnan 3c493dc3ed
CircularList round-robin iterator for the KillUnusedSegments duty (#16719)
* Round-robin iterator for datasources to kill.

Currently there's a fairness problem in the KillUnusedSegments duty
where the duty consistently selects the same set of datasources as discovered
from the metadata store or dynamic config params. This is a problem especially
when there are multiple unused. In a medium to large cluster, while we can increase
the task slots to increase the likelihood of broader coverage. This patch adds a simple
round-robin iterator to select datasources and has the following properties:

1. Starts with an initial random cursor position in an ordered list of candidates.
2. Consecutive {@code next()} iterations from {@link #getIterator()} are guaranteed to be deterministic
   unless the set of candidates change when {@link #updateCandidates(Set)} is called.
3. Guarantees that no duplicate candidates are returned in two consecutive {@code next()} iterations.

* Renames in RoundRobinIteratorTest.

* Address review comments.

1. Clarify javadocs on the ordered list. Also flesh out the details a bit more.
2. Rename the test hooks to make intent clearer and fix typo.
3. Add NotThreadSafe annotation.
4. Remove one potentially noisy log that's in the path of iteration.

* Add null check to input candidates.

* More commentary.

* Addres review feedback: downgrade some new info logs to debug; invert condition.

Remove redundant comments.
Remove rendundant variable tracking.

* CircularList adjustments.

* Updates to CircularList and cleanup RoundRobinInterator.

* One more case and add more tests.

* Make advanceCursor private for now.

* Review comments.
2024-07-26 12:20:49 -07:00
Sree Charan Manamala 9b76d13ff8
Check for Aggregation inside a window clause when syntax used as - WINDOW W AS DEF (#16801) 2024-07-26 11:18:35 +02:00
Laksh Singla 725d442355
Faster dimension deserialization on the brokers (#16740)
Speedier dimension deserialization on the brokers.
2024-07-26 14:36:11 +05:30
Clint Wylie 71725b41b5
ignore dependencies for github stale action (#16797) 2024-07-25 10:32:43 -07:00
Gian Merlino b2a88da200
Attempt to coerce COMPLEX to number in numeric aggregators. (#16564)
* Coerce COMPLEX to number in numeric aggregators.

PR #15371 eliminated ObjectColumnSelector's built-in implementations of
numeric methods, which had been marked deprecated.

However, some complex types, like SpectatorHistogram, can be successfully coerced
to number. The documentation for spectator histograms encourages taking advantage of
this by aggregating complex columns with doubleSum and longSum. Currently, this
doesn't work properly for IncrementalIndex, where the behavior relied on those
deprecated ObjectColumnSelector methods.

This patch fixes the behavior by making two changes:

1) SimpleXYZAggregatorFactory (XYZ = type; base class for simple numeric aggregators;
   all of these extend NullableNumericAggregatorFactory) use getObject for STRING
   and COMPLEX. Previously, getObject was only used for STRING.

2) NullableNumericAggregatorFactory (base class for simple numeric aggregators)
   has a new protected method "useGetObject". This allows the base class to
   correctly check for null (using getObject or isNull).

The patch also adds a test for SpectatorHistogram + doubleSum + IncrementalIndex.

* Fix tests.

* Remove the special ColumnValueSelector.

* Add test.
2024-07-25 08:45:29 -07:00
Rohan Garg b5f117bca2
Check for tombstones in wrapping storage adapters (#16791) 2024-07-25 06:55:40 -04:00
Clint Wylie 14954c7eb9
serialize legacy as false for scan query for rolling downgrade/upgrade (#16793)
Fixes rolling downgrades/upgrades after #16659 by hard coding scan query "legacy":false since it is a required property during deserialization.
2024-07-25 14:51:58 +05:30
Gian Merlino c1875e7c1d
HashJoinEngine: Check for interruptions while walking left cursor. (#16773)
* HashJoinEngine: Check for interruptions while walking left cursor.

Previously, the engine only checked for interruptions between emitting
joined rows. In scenarios where large numbers of left rows are skipped
completely (such as a highly selective INNER JOIN) this led to the
join cursor being insufficiently responsive to cancellation.

* Coverage.
2024-07-25 15:10:50 +08:00
Clint Wylie 5da69a01cb
change arrayIngestMode default to array (#16789)
* change arrayIngestMode default to array

* remove arrayIngestMode flag option none

* fix space

* fix test
2024-07-25 15:09:40 +08:00
Zoltan Haindrich 7e3fab5bf9
Make WindowFrames more specific (#16741)
Changes the WindowFrame internals / representation a bit; introduces dedicated frametypes for rows and groups which corresponds to the implemented processing methods
2024-07-25 04:57:36 +02:00
Edgar Melendrez ca787885c9
[docs] batch02 of updating functions (#16761)
* applying changes

* ensuring batch is updated

* Update docs/querying/sql-functions.md

* raise -> raises

* addressing review

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-07-24 15:28:57 -07:00
John Gozde 6ff0cbfa54
Prune date-fns locales, bump sass TODO (#16792) 2024-07-24 10:50:53 -07:00
Akshat Jain a0437b6c93
MSQ window functions: Fix partition boundary issues for arrays (#16780)
* MSQ window functions: Fix partition boundary issues for arrays

* Address review comments

* Cache type strategies

* Trigger Build

* Convert typeStrategies from list to array
2024-07-24 18:47:04 +05:30
Clint Wylie 302739aa58
more aggressive cancellation of broker parallel merge, more chill blocking queue timeouts, and query cancellation participation (#16748)
* more aggressive cancellation of broker parallel merge, more chill blocking queue timeouts

* wire parallel merge into query cancellation system

* oops

* style

* adjust metrics initialization

* fix timeout, fix cleanup to not block

* javadocs to clarify why cancellation future and gizmo are split

* cancelled -> canceled, simplify QueuePusher since it always takes a ResultBatch, non-static terminal marker to make stuff stop complaining about types, specialize tryOffer to be tryOfferTerminal so it wont be misused, add comments to clarify reason for non-blocking offers that might fail
2024-07-24 14:58:34 +08:00
Vadim Ogievetsky 4f0b80bef5
Web console: change to use @fontsource/open-sans (#16786)
* change to use @fontsource/open-sans

* import locale directly

* update check license
2024-07-23 21:28:59 -07:00
Sree Charan Manamala 3f4d66c399
Check for Unsupported Aggregation with Distinct when useApproxCountDistinct is enabled (#16770)
* init

* add NativelySupportsDistinct

* refactor

* javadoc

* refactor

* fix tests

* fix drill tests

* comments

* Update sql/src/test/java/org/apache/druid/sql/calcite/DrillWindowQueryTest.java

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-07-24 11:13:22 +08:00
Sébastien aeb2ee59a2
Added an option to hide the workbench-view toolbar (#16785) 2024-07-23 15:36:54 -07:00
317brian 704962ec8e
doc: minor fixes to migration guides (#16784) 2024-07-23 13:09:51 -07:00
George Shiqi Wu a64e9a1746
Add annotation for pod template (#16772)
* Add annotation for pod template

* pr comments

* add test cases

* add tests
2024-07-23 07:25:15 -07:00
Laksh Singla 11bb40981e
Deduce type from the aggregators when materializing subquery results (#16703)
For aggregators like StringFirst/Last, whose intermediate type isn't the same as the final type, using them in GroupBy, TopN or Timeseries subqueries causes a fallback when maxSubqueryBytes is set. This is because we assume that the finalization is not known, due to which the row signature cannot determine whether to use the intermediate or the final type, and it puts it as null. This PR figures out the finalization from the query context and uses the intermediate or the final type appropriately.
2024-07-23 11:52:39 +05:30
Akshat Jain c45d4fdbca
MSQ window functions: Minor cleanup for empty over clause related flows + Exhaustive tests (#16754)
* MSQ window functions: Revamp logic to create separate window stages when empty over() clause is present

* Fix tests

* Revert changes of creating separate stages for empty over clause

* Address review comments
2024-07-23 11:37:34 +05:30
Gian Merlino 8b8ca0d7fc
DimFilterUtils: Exit filterShards early when filter is null. (#16774)
When the filter is null, there is no need to run the converter on
all the input objects.
2024-07-22 21:17:11 -07:00
Clint Wylie b645d09c5d
move long and double nested field serialization to later phase of serialization (#16769)
changes:
* moves value column serializer initialization, call to `writeValue` method to `GlobalDictionaryEncodedFieldColumnWriter.writeTo` instead of during `GlobalDictionaryEncodedFieldColumnWriter.addValue`. This shift means these numeric value columns are now done in the per field section that happens after serializing the nested column raw data, so only a single compression buffer and temp file will be needed at a time instead of the total number of nested literal fields present in the column. This should be especially helpful for complicated nested structures with thousands of columns as even those 64k compression buffers can add up pretty quickly to a sizeable chunk of direct memory.
2024-07-22 21:14:30 -07:00
Edgar Melendrez 934c10b1cd
docs: Adding admonition box to warn about MVD (#16712)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-07-22 17:32:23 -07:00
Clint Wylie 02b8738c00
remove batchProcessingMode from task config, remove AppenderatorImpl (#16765)
changes:
* removes `druid.indexer.task.batchProcessingMode` in favor of always using `CLOSED_SEGMENT_SINKS` which uses `BatchAppenderator`. This was intended to become the default for native batch, but that was missed so `CLOSED_SEGMENTS` was the default (using `AppenderatorImpl`), however MSQ has been exclusively using `BatchAppenderator` with no problems so it seems safe to just roll it out as the only option for batch ingestion everywhere.
* with `batchProcessingMode` gone, there is no use for `AppenderatorImpl` so it has been removed
* implify `Appenderator` construction since there are only separate stream and batch versions now
* simplify tests since `batchProcessingMode` is gone
2024-07-22 13:56:44 -07:00
Akshat Jain 6a2348b78b
Preemptive restriction for queries with approximate count distinct on complex columns of unsupported type (#16682)
This PR aims to check if the complex column being queried aligns with the supported types in the aggregator and aggregator factories, and throws a user-friendly error message if they don't.
2024-07-22 21:34:06 +05:30
Sree Charan Manamala 149d7c5207
Throw exceptions in SqlValidator when DISTINCT used over WINDOW (#16738)
* Throw exception if DISTINCT used with window functions aggregate call
* Improve error message when unsupported aggregations are used with window functions
2024-07-22 16:29:46 +02:00