Commit Graph

3285 Commits

Author SHA1 Message Date
Edgar Melendrez 721a65046f
docs: add examples for SQL functions (#16745)
* updating first batch of numeric functions

* First batch of functions

* addressing first few comments

* alphabetize list

* draft with suggestions applied

* minor discrepency expr -> <NUMERIC>

* changed raises to calculates

* Update docs/querying/sql-functions.md

* switch to underscore

* changed to exp(1) to match slack message

* adding html text for trademark symbol to .spelling

* fixed discrepancy between description and example

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-07-18 17:06:22 -07:00
Kashif Faraz 9f6ce6ddc0
Remove task action audit logging and druid_taskLog metadata table (#16309)
Description:
Task action audit logging was first deprecated and disabled by default in Druid 0.13, #6368.

As called out in the original discussion #5859, there are several drawbacks to persisting task action audit logs. 
- Only usage of the task audit logs is to serve the API `/indexer/v1/task/{taskId}/segments`
which returns the list of segments created by a task.
- The use case is really narrow and no prod clusters really use this information.
- There can be better ways of obtaining this information, such as the metric
`segment/added/bytes` which reports both the segment ID and task ID
when a segment is committed by a task. We could also include committed segment IDs in task reports.
- A task persisting several segments would bloat up the audit logs table putting unnecessary strain
on metadata storage.

Changes:
- Remove `TaskAuditLogConfig`
- Remove method `TaskAction.isAudited()`. No task action is audited anymore.
- Remove `SegmentInsertAction` as it is not used anymore. `SegmentTransactionalInsertAction`
is the new incarnation which has been in use for a while.
- Deprecate `MetadataStorageActionHandler.addLog()` and `getLogs()`. These are not used anymore
but need to be retained for backward compatibility of extensions.
- Do not create `druid_taskLog` metadata table anymore.
2024-07-17 17:09:00 +05:30
Vadim Ogievetsky 307b8849de
Web console: better sql data loader reset (#16696)
* better sql data loader reset

* snapshot

* fix destination pane sizing

* clean doc links

* update doc links

* more doc links

* extract getClusterCapacity

* update snapsohts

* allow submit suspended

* some renaming

* diff with current

* Do delta
2024-07-11 14:45:04 -07:00
YongGang 4b293fc2a9
Docs: Fix k8s dynamic config URL (#16720) 2024-07-11 10:05:47 +05:30
Lars Francke 586c713d12
Updates build documentation to not mention explicit Java version as it was out of sync with the dedicated Java page. (#16674)
This means there is one less place to keep information in sync.
2024-07-03 20:53:15 +05:30
317brian d65e015c94
docs: nit for link format (#16687) 2024-07-02 16:45:09 -07:00
Victoria Lim adde024e11
docs: Subtitle updates in migration guide overview (#16683) 2024-07-02 12:56:05 -07:00
Jill Osborne bd49ecfd29
Addition to subquery limit migration guide (#16671)
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-07-01 14:22:47 -07:00
Hugh Evans 920d9020c0
Docs: Fix default value for globalIngestionHeapLimitBytes (#16654)
Use the new default value added in #8255
2024-06-27 07:01:56 +05:30
Gian Merlino dbed1b0f50
Defer more expressions in vectorized groupBy. (#16338)
* Defer more expressions in vectorized groupBy.

This patch adds a way for columns to provide GroupByVectorColumnSelectors,
which controls how the groupBy engine operates on them. This mechanism is used
by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector
that uses the inputs of an expression as the grouping key. The actual expression
evaluation is deferred until the grouped ResultRow is created.

A new context parameter "deferExpressionDimensions" allows users to control when
this deferred selector is used. The default is "fixedWidthNonNumeric", which is a
behavioral change from the prior behavior. Users can get the prior behavior by setting
this to "singleString".

* Fix style.

* Add deferExpressionDimensions to SqlExpressionBenchmark.

* Fix style.

* Fix inspections.

* Add more testing.

* Use valueOrDefault.

* Compute exprKeyBytes a bit lighter-weight.
2024-06-26 17:28:36 -07:00
Andreas Maechler ab76d851ad
Update docs contribution with correct script (#16581)
* Spacing

* Fix ordering

* npm run start
2024-06-26 10:30:52 -07:00
Laksh Singla 71b3b5ab5d
Add query context parameter to remove null bytes when writing frames (#16579)
MSQ cannot process null bytes in string fields, and the current workaround is to remove them using the REPLACE function. 'removeNullBytes' context parameter has been added which sanitizes the input string fields by removing these null bytes.
2024-06-26 15:00:30 +05:30
Edgar Melendrez b43f4063c5
Docs: update link and title of quickstart (#16638)
* update link and title

* Discard changes to website/package.json

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

---------

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-06-25 09:07:00 -07:00
Clint Wylie 37a50e6803
Remove index_realtime and index_realtime_appenderator tasks (#16602)
index_realtime tasks were removed from the documentation in #13107. Even
at that time, they weren't really documented per se— just mentioned. They
existed solely to support Tranquility, which is an obsolete ingestion
method that predates migration of Druid to ASF and is no longer being
maintained. Tranquility docs were also de-linked from the sidebars and
the other doc pages in #11134. Only a stub remains, so people with
links to the page can see that it's no longer recommended.

index_realtime_appenderator tasks existed in the code base, but were
never documented, nor as far as I am aware were they used for any purpose.

This patch removes both task types completely, as well as removes all
supporting code that was otherwise unused. It also updates the stub
doc for Tranquility to be firmer that it is not compatible. (Previously,
the stub doc said it wasn't recommended, and pointed out that it is
built against an ancient 0.9.2 version of Druid.)

ITUnionQueryTest has been migrated to the new integration tests framework and updated to use Kafka ingestion.

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
2024-06-24 20:13:33 -07:00
317brian 2131917f16
docs: added front-coded dictionaries to upgrade notes (#16647)
* docs: add front-coded dictionareis to upgrade notes

* add it to release notes template
2024-06-24 10:52:26 -07:00
Abhishek Radhakrishnan 7463589b07
Support for bootstrap segments (#16609)
* Initial support for bootstrap segments.

  - Adds a new API in the coordinator.
  - All processes that have storage locations configured (including tasks)
    talk to the coordinator if they can, and fetch bootstrap segments from it.
  - Then load the segments onto the segment cache as part of startup.
  - This addresses the segment bootstrapping logic required by processes before
    they can start serving queries or ingesting.

    This patch also lays the foundation to speed up upgrades.

* Fail open by default if there are any errors talking to the coordinator.

* Add test for failure scenario and cleanup logs.

* Cleanup and add debug log

* Assert the events so we know the list exactly.

* Revert RunRules test.

The rules aren't evaluated if there are no clusters.

* Revert RunRulesTest too.

* Remove debug info.

* Make the API POST and update log.

* Fix up UTs.

* Throw 503 from MetadataResource; clean up exception handling and DruidException.

* Remove unused logger, add verification of metrics and docs.

* Update error message

* Update server/src/main/java/org/apache/druid/server/coordination/SegmentLoadDropHandler.java

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* Apply suggestions from code review

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* Adjust test metric expectations with the rename.

* Add BootstrapSegmentResponse container in the response for future extensibility.

* Rename to BootstrapSegmentsInfo for internal consistency.

* Remove unused log.

* Use a member variable for broadcast segments instead of segmentAssigner.

* Minor cleanup

* Add test for loadable bootstrap segments and clarify comment.

* Review suggestions.

---------

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2024-06-24 09:27:17 -07:00
Suneet Saldanha 4e0ea7823b
Update docs for K8s TaskRunner Dynamic Config (#16600)
* Update docs for K8s TaskRunner Dynamic Config

* touchups

* code review

* npe

* oopsies
2024-06-21 06:01:59 -07:00
Akshat Jain cd438b1918
Emit metrics for S3UploadThreadPool (#16616)
* Emit metrics for S3UploadThreadPool

* Address review comments

* Revert unnecessary formatting change

* Revert unnecessary formatting change in metrics.md file

* Address review comments

* Add metric for task duration

* Minor fix in metrics.md

* Add s3Key and uploadId in the log message

* Address review comments

* Create new instance of ServiceMetricEvent.Builder for thread safety

* Address review comments

* Address review comments
2024-06-21 11:36:47 +05:30
Andreas Maechler ae70e18bc8
docs: Update Azure extension (#16585)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-06-20 09:31:29 -07:00
Jill Osborne aec1d5ddd6
Link fix (#16596)
* Link fix

* Update docs/operations/auth.md

Co-authored-by: Andreas Maechler <amaechler@gmail.com>

---------

Co-authored-by: Andreas Maechler <amaechler@gmail.com>
2024-06-14 11:40:53 -07:00
317brian e1926e2549
docs: fix redirect (#16548)
* doc: cleanup unnecessary redirect

(cherry picked from commit d86aaadbc78cc51345f768ee66c9a8b2cbf13f27)

* restore redirect file entry. delete md file
2024-06-14 09:54:16 +08:00
Alberic Liu ea2de517b2
Update the youtube link for druid presentations page (#16601)
* Update the link to lambda architectures with Druid

* update the youtube link
2024-06-14 09:47:46 +08:00
Victoria Lim 836cdb48a5
docs: Migration guide for MVDs to arrays (#16516)
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2024-06-13 13:05:58 -07:00
George Shiqi Wu d5a25a94b8
Docs: Clarify that all supervisors can support early handoff (#16588) 2024-06-13 08:43:22 +05:30
YongGang 46dbc74053
Support Dynamic Peon Pod Template Selection in K8s extension (#16510)
* initial commit

* add Javadocs

* refine JSON input config

* more test and fix build

* extract existing behavior as default strategy

* change template mapping fallback

* add docs

* update doc

* fix doc

* address comments

* define Matcher interface

* fix test coverage

* use lower case for endpoint path

* update Json name

* add more tests

* refactoring Selector class
2024-06-12 15:27:10 -07:00
Andreas Maechler fec48432d4
docs: Correct some outdated module names (#16584)
* Fix module names

* Better spacing

* Some spacing

* Suggestions from code review

Thanks Abhishek.

* More links

* Roll-up time

* Remove logs

* More spelling
2024-06-11 14:17:40 -07:00
Andreas Maechler 24056b90b5
Bring back missing property in indexer documentation (#16582)
* Bring back druid.peon.taskActionClient.retry.minWait

* Update docs/configuration/index.md

* Consistent italics

Thanks Abhishek.

* Update docs/configuration/index.md

Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>

* Consistent list style

* Remove extra space

---------

Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>
2024-06-10 16:52:54 -07:00
Kashif Faraz e4fdf1055b
Update default value of `druid.indexer.tasklock.batchAllocationWaitTime` to zero (#16578)
Update default value of druid.indexer.tasklock.batchAllocationWaitTime to 0.
Thus, a segment allocation request is processed immediately unless there are already some requests queued before this one. While in queue, a segment allocation request may get clubbed together with other similar requests into a batch to reduce load on the metadata store.
2024-06-10 20:07:23 +05:30
317brian 8e11adfc6f
docs: remove outdated druidversion var from a page (#16570)
Co-authored-by: asdf2014 <asdf2014@apache.org>
2024-06-10 15:30:36 +08:00
Gian Merlino b837ce565b
Simplify serialized form of JsonInputFormat. (#15691)
* Simplify serialized form of JsonInputFormat.

Use JsonInclude for keepNullColumns, assumeNewlineDelimited, and
useJsonNodeReader. Because the default value of keepNullColumns is
variable, we store the original configured value rather than the
derived value, and include if the original value is nonnull.

* Fix test.
2024-06-05 20:01:14 -07:00
Katya Macedo 7aecc09230
Docs: Remove circular link (#16553) 2024-06-05 11:07:36 -07:00
Charles Smith c100ae0ecc
Add a tutorial for LATEST_BY to get most recent data (#16515)
Co-authored-by: Will Xu <2bethere@gmail.com>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2024-06-04 17:00:25 -07:00
Jill Osborne 8b5802d4cd
docs: add maxSubqueryBytes limit to migration guide landing page (#16547)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-06-04 12:52:06 -07:00
Amit 540d3e6af5
Added new use cases and description of the use case - 5/14/24 (#16451)
Thanks for your contribution @amit-git-account

* Added new use cases and description of the use case - 5/14/24

The use case listing is not changed in a long time. While speaking with users, I came across several other use cases not listed here in the index. So I added new use cases and also added description against the use cases.

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* update spelling file

* Update docs/design/index.md

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-06-04 09:47:49 -07:00
Charles Smith 8f78c901e7
docs: add lookups to the sidebar (#16530)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-06-03 16:04:15 -07:00
Charles Smith b1568fb95b
docs: Adds a redirect for flatten-json which was removed (#16263) 2024-05-31 16:16:12 -07:00
Katya Macedo f70ef1f434
Update front coding text (#16491)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-05-31 15:13:10 -07:00
Katya Macedo 92e660dd21
Add Druid 30.0.0 upgrade notes (#16522) 2024-05-31 13:23:22 -07:00
Atul Mohan b53d75758f
IcebergInputSource : Add option to toggle case sensitivity while reading columns from iceberg catalog (#16496)
* Toggle case sensitivity while reading columns from iceberg

* Fix tests

* Drop case check and set unconditionally
2024-05-31 10:18:52 -07:00
George Shiqi Wu 0936798122
Add limit to task payload size (#16512)
* Add limit to task payload size

* Change to a warning

* Remove test

* Fix unit tests

* Optionally throw alert

* PR comments

* Update indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* PR comments

* Reject large payloads

* Update docs/configuration/index.md

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* Update indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

---------

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2024-05-31 09:17:36 -07:00
Jill Osborne 3c72ec8413
docs: Migration guide for subquery limit (#16519)
Adds a migration guide for Druid 30 to help users understand the new byte-based subquery limit property maxSubqueryBytes
2024-05-31 09:26:07 +05:30
Charles Smith 92e565e3b8
Adds a migration guide overview page to the release-info section (#16506)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Katya Macedo <katya.macedo@imply.io>
2024-05-30 09:50:30 -07:00
Adithya Chakilam a9044ac235
Add cgroup cpu/mem/disk usage metrics (#16472)
* Add cgroup cpu/mem usage metrics

* checks

* comments

* docs fix

* add disk metrics

* fapi check

* checkstyle

* issues

* spelling

* change asserts

* checks

* use proc builder instead of runtime

* specify charset

* spotbug
2024-05-29 12:44:37 -07:00
George Shiqi Wu b3b62ac431
Update azure input source docs (#16508)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-05-29 10:00:46 -07:00
Vadim Ogievetsky 10ea88e5bf
Web console: more robust durable storage setting detection (#16493)
* more robust durable storage setting

* add test
2024-05-22 15:47:20 -07:00
Vadim Ogievetsky a124c6cbbd
fix typo in extension name (#16466) 2024-05-20 09:47:22 +08:00
George Shiqi Wu ed9881df88
Cleanup logic from handoff API (#16457)
* Cleanup logic from handoff API

* Fix test

* Fix checkstyle

* Update docs
2024-05-16 08:42:44 -07:00
Gian Merlino 72432c2e78
Speed up SQL IN using SCALAR_IN_ARRAY. (#16388)
* Speed up SQL IN using SCALAR_IN_ARRAY.

Main changes:

1) DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
   the IN is above inFunctionThreshold. The default value of inFunctionThreshold
   is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.

2) SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
   expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
   value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
   it to Integer.MAX_VALUE.

3) ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
   greater than inFunctionThreshold.

* Revert test.

* Additional coverage.

* Update docs/querying/sql-query-context.md

Co-authored-by: Benedict Jin <asdf2014@apache.org>

* New test.

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-05-14 08:09:27 -07:00
George Shiqi Wu c1bf4fed90
API for stopping streaming tasks early (#16310)
* Try stopping task early

* Fix checkstyle

* Add unit test

* Add a couple more tests

* PR changes

* Use notice

* fix checkstyle

* PR changes

* Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Change payload

* Remove quotes

---------

Co-authored-by: Suneet Saldanha <suneet@apache.org>
2024-05-14 06:39:50 -07:00
Alberic Liu 811dcd1726
update protobuf.md (#16434) 2024-05-11 17:52:54 +08:00
Charles Smith 2d0b4e5f1e
Update sidebar to organize tutorials + other minor improvements (#16184)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-05-09 08:57:43 -07:00
Adarsh Sanjeev 269e035e76
Add validation for reindex with realtime sources (#16390)
Add validation for reindex with realtime sources.

With the addition of concurrent compaction, it is possible to ingest data while querying from realtime sources with MSQ into the same datasource. This could potentially lead to issues if the interval that is ingested into is replaced by an MSQ job, which has queried only some of the data from the realtime task.

This PR adds validation to check that the datasource being ingested into is not being queried from, if the query includes realtime sources.
2024-05-07 10:32:15 +05:30
Misha b5958b6b07
Feature configurable calcite bloat (#16248)
* Configurable bloat for calcite ProjectMergeRule implemented

* Comment added

* Default bloat value increased to 1000

* Implemented bloat configuration from QueryContext

* Code refactored, docs updated

---------

Co-authored-by: sviatahorau <mikhail.sviatahorau@deep.bi>
2024-05-06 20:43:39 +05:30
Alberic Liu 92fb0ff718
upgrade mysql:mysql-connector-java to 8.2.0 (#16024)
* upgrade mysql:mysql-connector-java to 8.2.0

* fix the check errors

* remove unused comment
2024-05-06 21:58:37 +08:00
Rishabh Singh c61c3785a0
Followup changes to 15817 (Segment schema publishing and polling) (#16368)
* Fix build

* Nit changes in KillUnreferencedSegmentSchema

* Replace reference to the abbreviation SMQ with Metadata Query, rename inTransit maps in schema cache

* nitpicks

* Remove reference to smq abbreviation from integration-tests

* Remove reference to smq abbreviation from integration-tests

* minor change

* Update index.md

* Add delimiter while computing schema fingerprint hash
2024-05-03 19:13:52 +05:30
Kashif Faraz 51104e8bb3
Docs: Remove references to Zk-based segment loading (#16360)
Follow up to #15705

Changes:
- Remove references to ZK-based segment loading in the docs
- Fix doc for existing config `druid.coordinator.loadqueuepeon.http.repeatDelay`
2024-05-01 08:06:00 +05:30
Abhishek Radhakrishnan 1d7595f3f7
Support for filters in the Druid Delta Lake connector (#16288)
* Delta Lake support for filters.

* Updates

* cleanup comments

* Docs

* Remmove Enclosed runner

* Rename

* Cleanup test

* Serde test for the Delta input source and fix jackson annotation.

* Updates and docs.

* Update error messages to be clearer

* Fixes

* Handle NumberFormatException to provide a nicer error message.

* Apply suggestions from code review

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Doc fixes based on feedback

* Yes -> yes in docs; reword slightly.

* Update docs/ingestion/input-sources.md

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

* Documentation, javadoc and more updates.

* Not with an or expression end-to-end test.

* Break up =, >, >=, <, <= into its own types instead of sub-classing.

---------

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
2024-04-29 11:31:36 -07:00
Adithya Chakilam f8015eb02a
Add config lagAggregate to LagBasedAutoScalerConfig (#16334)
Changes:
- Add new config `lagAggregate` to `LagBasedAutoScalerConfig`
- Add field `aggregateForScaling` to `LagStats`
- Use the new field/config to determine which aggregate to use to compute lag
- Remove method `Supervisor.computeLagForAutoScaler()`
2024-04-29 22:20:41 +05:30
Gian Merlino db82adcdfd
SCALAR_IN_ARRAY: Optimization and behavioral follow-ups. (#16311)
* Four changes to scalar_in_array as follow-ups to #16306:

1) Align behavior for `null` scalars to the behavior of the native `in` and `inType` filters: return `true` if the array itself contains null, else return `null`.

2) Rename the class to more closely match the function name.

3) Add a specialization for constant arrays, where we build a `HashSet`.

4) Use `castForEqualityComparison` to properly handle cross-type comparisons.
   Additional tests verify comparisons between LONG and DOUBLE are now
   handled properly.

* Fix spelling.

* Adjustments from review.
2024-04-26 16:01:17 -07:00
Atul Mohan 77333e56fa
Docs: Add missing kafka emitter config (#16332) 2024-04-25 10:37:14 +05:30
Katya Macedo ceb6646dec
Add supervisor actions (#16276)
* Add supervisor actions

* Update text

* Update text

* Update after review

* Update after review
2024-04-24 13:14:01 -07:00
Rishabh Singh e30790e013
Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building (#15817)
Issue: #14989

The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information.

This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.
2024-04-24 22:22:53 +05:30
Charles Smith 65412f80ab
remove additional column marks (#16319) 2024-04-22 19:41:54 -07:00
Sree Charan Manamala ad5701e891
new SCALAR_IN_ARRAY function analogous to DRUID_IN (#16306)
* scalar_in function

* api doc

* refactor
2024-04-18 21:15:15 -07:00
Gian Merlino 4285a5e2c6
Update documentation for exceptions to subquery limit. (#16295)
The true exception for groupBy is somewhat more narrow than the docs
suggest.
2024-04-17 21:04:43 -07:00
Hardik Bajaj 0bf5e7745d
Add configurable parameters for statsd client (#16283)
Statsd client sometimes drops metrics when this queueSize of statsd client with max unprocessed messages is completely full. This causes some high cardinality metrics like per partition lag being droppped.
There are multiple parameters of statsdclient that can be initialized and can help increase the load/capacity of client to not to drop metrics more frequently.
Properties like queueSize, poolSize, processorWorkers and senderWorkers will now be configurable at runtime
2024-04-17 18:35:31 +05:30
Nikhil Rao a805c5612e
Adds Druid SQL query examples for the Stats aggregator Native Queries (#16277)
* Adds Druid SQL query examples for the Timeseries and GroupBy Native queries in the stats aggregator docs page

* Updates intervals in Native Query to remove excess Time part in timestamp

* Moves Druid SQL section above Native query because sql used more often by users

* removes old Druid SQL sections

* Adds TopN Druid SQL query using ORDER BY and LIMIT

* Adds table for Druid SQL variance and standard deviation functions

* Update docs/development/extensions-core/stats.md

Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>

---------

Co-authored-by: Karan Kumar <karankumar1100@gmail.com>
Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>
2024-04-15 08:05:34 -07:00
Adarsh Sanjeev 3df00aef9d
Add manifest file for MSQ export (#15953)
Currently, export creates the files at the provided destination. The addition of the manifest file will provide a list of files created as part of the manifest. This will allow easier consumption of the data exported from Druid, especially for automated data pipelines
2024-04-15 11:37:31 +05:30
Abhishek Radhakrishnan 041d0bff5e
Set default `KillUnusedSegments` duty to coordinator's indexing period & `killTaskSlotRatio` to 0.1 (#16247)
The default value for druid.coordinator.kill.period (if unspecified) has changed from P1D to the value of druid.coordinator.period.indexingPeriod. Operators can choose to override druid.coordinator.kill.period and that will take precedence over the default behavior.
The default value for the coordinator dynamic config killTaskSlotRatio is updated from 1.0 to 0.1. This ensures that that kill tasks take up only 1 task slot right out-of-the-box instead of taking up all the task slots.

* Remove stale comment and inline canDutyRun()

* druid.coordinator.kill.period defaults to druid.coordinator.period.indexingPeriod if not set.

- Remove the default P1D value for druid.coordinator.kill.period. Instead default
  druid.coordinator.kill.period to whatever value druid.coordinator.period.indexingPeriod is set
  to if the former config isn't specified.
- If druid.coordinator.kill.period is set, the value will take precedence over
  druid.coordinator.period.indexingPeriod

* Update server/src/test/java/org/apache/druid/server/coordinator/DruidCoordinatorConfigTest.java

* Fix checkstyle error

* Clarify comment

* Update server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinatorConfig.java

* Put back canDutyRun()

* Default killTaskSlotsRatio to 0.1 instead of 1.0 (all slots)

* Fix typo DEFAULT_MAX_COMPACTION_TASK_SLOTS

* Remove unused test method.

* Update default value of killTaskSlotsRatio in docs and web-console default mock

* Move initDuty() after params and config setup.
2024-04-14 18:56:17 -07:00
Katya Macedo 7f06a53cb1
[Docs] Fix API placeholder formatting (#16240) 2024-04-12 09:19:13 -07:00
YongGang da9feb4430
Introduce TaskContextReport for reporting task context (#16041)
Changes:
- Add `TaskContextEnricher` interface to improve task management and monitoring
- Invoke `enrichContext` in `TaskQueue.add()` whenever a new task is submitted to the Overlord
- Add `TaskContextReport` to write out task context information in reports
2024-04-12 08:57:49 +05:30
Pranav fc2600b8e2
Adding jvmVersion dimension in JVM Monitor (#16262) 2024-04-11 15:44:56 -07:00
317brian df9e1bb97b
Docs: Fix typo in tutorial (#16254) 2024-04-10 08:59:52 +05:30
Katya Macedo cd69f145b7
docs: Add upgrade notes for Druid 29.0.1 (#16123) 2024-04-09 13:56:57 -07:00
Vishesh Garg 3d595cfab1
Add storeCompactionState flag support to msq (#15965)
Compaction in the native engine by default records the state of compaction for each segment in the lastCompactionState segment field. This PR adds support for doing the same in the MSQ engine, targeted for future cases such as REPLACE and compaction done via MSQ.

Note that this PR doesn't implicitly store the compaction state for MSQ replace tasks; it is stored with flag "storeCompactionState": true in the query context.
2024-04-09 16:47:47 +05:30
Vishesh Garg 9a4fb58543
Record column name for exceptions while writing frames in RowBasedFrameWriter (#16130)
Current Runtime Exceptions generated while writing frames only include the exception itself without including the name of the column they were encountered in. This patch introduces the further information in the error and makes it non-retryable.
2024-04-09 15:39:10 +05:30
Adarsh Sanjeev e2e0cb905c
Add reasoning for choosing shardSpec to the MSQ report (#16175)
This PR logs the segment type and reason chosen. It also adds it to the query report, to be displayed in the UI.

This PR adds a new section to the reports, segmentReport. This contains the segment type created, if the query is an ingestion, and null otherwise.
2024-04-09 11:32:02 +05:30
Parag Jain f55c9e58a8
add google as external storage for msq export (#16051)
Support for exporting msq results to gcs bucket. This is essentially copying the logic of s3 export for gs, originally done by @adarshsanjeev in this PR - #15689
2024-04-05 12:10:10 +05:30
Sergio Ferragut 64433eb2ff
Update kubernetes-overloard-extension extension name in docs (#16239) 2024-04-04 14:38:28 -07:00
Zoltan Haindrich 1df41db46d
Migrate to use docker compose v2 (#16232)
https://github.com/actions/runner-images/issues/9557
2024-04-03 12:32:55 +02:00
Charles Smith 1aa6808b9a
docs: add tutorial with examples of sql null handling (#16185)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-04-01 11:03:42 -07:00
Pranav 20de7fd95a
Geo spatial interfaces (#16029)
This PR creates an interface for ImmutableRTree and moved the existing implementation to new class which represent 32 bit implementation (stores coordinate as floats). This PR makes the ImmutableRTree extendable to create higher precision implementation as well (64 bit).
In all spatial bound filters, we accept float as input which might not be accurate in the case of high precision implementation of ImmutableRTree. This PR changed the bound filters to accepts the query bounds as double instead of float and it is backward compatible change as it compares double to existing float values in RTree. Previously it was comparing input float to RTree floats which can cause precision loss, now it is little better as it compares double to float which is still not 100% accurate.
There are no changes in the way that we query spatial dimension today except input bound parsing. There is little improvement in string filter predicate which now parse double strings instead of float and compares double to double which is 100% accurate but string predicate is only called when we dont have spatial index.
With allowing the interface to extend ImmutableRTree, we allow to create high precision (HP) implementation and defines new search strategies to perform HP search Iterable<ImmutableBitmap> search(ImmutableDoubleNode node, Bound bound);
With possible HP implementations, Radius bound filter can not really focus on accuracy, it is calculating Euclidean distance in comparing. As EARTH 🌍 is round and not flat, Euclidean distances are not accurate in geo system. This PR adds new param called 'radiusUnit' which allows you to specify units like meters, km, miles etc. It uses https://en.wikipedia.org/wiki/Haversine_formula to check if given geo point falls inside circle or not. Added a test that generates set of points inside and outside in RadiusBoundTest.
2024-04-01 14:58:03 +05:30
Adithya Chakilam 463010bb29
Populate segment stats for non-parallel compaction jobs (#16171)
* Populate segment stats for non-parallel compaction jobs

* fix

* add-tests

* comments

* update-test

* comments
2024-03-29 09:40:55 -04:00
Soumyava 524842a3bb
Window function on msq (#15470)
This PR aims to introduce Window functions on MSQ by doing the following:

    Introduce a Window querykit for handling window queries along with its factory and a processor for window queries
    If a window operator is present with a partition by clause, pushes the partition as a shuffle spec of the previous stage
    In presence of empty OVER() clause lets all operators loose on a single rac
    In presence of no empty OVER() clause, breaks down each window into individual stages
    Associated machinery to handle window functions in MSQ
    Introduced a separate hidden engine feature WINDOW_LEAF_OPERATOR which is set only for MSQ engine. In presence of this feature, the planner plans without the leaf operators by creating a window query over an inner scan query. In case of native this is set to false and the planner generates the leafOperators
    Guardrails around materialization
    Comprehensive UTs
2024-03-28 14:58:34 +05:30
Adithya Chakilam a65b2d4f41
Visibility into LagBased AutoScaler desired task count (#16199)
* Visibility into skipped scale notices

* comments

* change to emit always instead of just skips

* fix failing test

* comments

* Add couple more tests
2024-03-27 13:08:00 -04:00
Rushikesh Bankar 3d8b0ffae8
Add indexer level task metrics to provide more visibility in the task distribution (#15991)
Changes:

Add the following indexer level task metrics:
- `worker/task/running/count`
- `worker/task/assigned/count`
- `worker/task/completed/count`

These metrics will provide more visibility into the tasks distribution across indexers
(We often see a task skew issue across indexers and with this issue it would be easier
to catch the imbalance)
2024-03-21 11:08:01 +05:30
Abhishek Radhakrishnan fa8e511492
Add versions to `markUsed` and `markUnused` APIs (#16141)
* Mark used and unused APIs by versions.

* remove the conditional invocations.

* isValid() and test updates.

* isValid() and tests.

* Remove warning logs for invalid user requests. Also, downgrade visibility.

* Update resp message, etc.

* tests and some cleanup.

* Docs draft

* Clarify docs

* Update server/src/main/java/org/apache/druid/server/http/DataSourcesResource.java

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* Review comments

* Remove default interface methods only used in tests and update docs.

* Clarify javadocs and @Nullable.

* Add more tests.

* Parameterized versions.

---------

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2024-03-19 09:22:25 -07:00
Katya Macedo da6158c166
[Docs] Improve the "Update existing data" tutorial (#16081)
* Modify update data tutorial

* Update after review

* Add append topic

* Update after review

* Add upsert to spelling
2024-03-14 16:31:33 -07:00
Gian Merlino 256160aba6
MSQ: Validate that strings and string arrays are not mixed. (#15920)
* MSQ: Validate that strings and string arrays are not mixed.

When multi-value strings and string arrays coexist in the same column,
it causes problems with "classic MVD" style queries such as:

  select * from wikipedia -- fails at runtime
  select count(*) from wikipedia where flags = 'B' -- fails at planning time
  select flags, count(*) from wikipedia group by 1 -- fails at runtime

To avoid these problems, this patch adds type verification for INSERT
and REPLACE. It is targeted: the only type changes that are blocked are
string-to-array and array-to-string. There is also a way to exclude
certain columns from the type checks, if the user really knows what
they're doing.

* Fixes.

* Tests and docs and error messages.

* More docs.

* Adjustments.

* Adjust message.

* Fix tests.

* Fix test in DV mode.
2024-03-13 15:37:27 -07:00
317brian 03c191f701
docs: clarify description of uri/uriprefix (#16110)
* docs: clarify description of uri/uripath

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-03-13 11:52:01 -07:00
Abhishek Radhakrishnan fb7bb0953d
Kill segments by versions (#15994)
* Kill task version support.

Kill tasks by default kill all versions of unused segments in the specified
interval. Users wanting to delete specific versions (for example, data compliance
reasons) and keep rest of the versions can specify the optional version in the
kill task payload.

* Formatting changes.

* Multi version tests in RetrieveSegmentsActionsTest

Sort of like method-level parameterized tests.

* Address review feedback

* Accept a list of versions instead of a single version.

Support multiple versions.

* Tests for multiple versions.

* Update docs

* Cleanup

* Address review comments.

Retain the old interface method and make it default and route it to
the method with nullable versions variant. Update usages to use the
default method where versions doesn't matter.

* Remove versions from retreive used segments action.

* Some updates.

* Apply suggestions from code review

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* /s/actual/observed/g

* minor test cleanup

* WIP: Test fixes and updates. Also add test for kill by version with used load spec.

Checkpoint.

---------

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2024-03-13 09:37:30 +05:30
Katya Macedo 6f6f86c325
Update `maxRowsInMemory` and `maxBytesInMemory` description (#16104) 2024-03-12 14:40:15 -07:00
George Shiqi Wu 94d2a28465
Add deep storage segment metric (#16072)
* Add new metric for deepStorage segments

* Add docs

* change metric name
2024-03-11 10:24:46 -04:00
Sensor 2d62b4f09b
docs refinement: json format (#16080)
* docs refinement: json format

* Update docs/api-reference/tasks-api.md

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

---------

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2024-03-11 15:49:14 +08:00
Charles Smith 3caacba8c5
update window functions doc (#15902)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-03-07 15:16:52 -08:00
Jill Osborne 67ae0ff450
Update docs for rabbit community extension (#16069)
* Updated docs for rabbit community extension

* Updated after review
2024-03-07 11:29:53 -08:00
Adithya Chakilam 564c44ed85
Add stats segmentsRead and segmentsPublished to compaction task reports (#15947)
Changes:
- Add visibility into number of segments read/published by each parallel compaction
- Add new fields `segmentsRead`, `segmentsPublished` to `IngestionStatsAndErrorsTaskReportData`
- Update `ParallelIndexSupervisorTask` to populate the new stats
2024-03-07 09:37:23 +05:30
Charles Smith ebf3bdd909
restore information about truncated responses to sql api (#16001)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Victoria Lim <lim.t.victoria@gmail.com>
2024-03-06 14:03:58 -08:00
Sergio Ferragut d38703281c
updated description of rowsPerPage in export operations (#16048)
* updated description of rowsPerPage in export operations

* Update docs/multi-stage-query/reference.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-03-05 15:42:12 -08:00
Gian Merlino 566013a5f5
Docs: Fix spelling of 5 GB. (#16040)
The spellchecker does not consider "5GB" to be spelled correctly.
2024-03-04 22:37:38 -08:00
zachjsh 720f1e834a
Add support for AzureDNSZone enabled storage accounts used for deep storage (#16016)
* Add support for AzureDNSZone enabled storage accounts used for deep storage

Added a new config to AzureAccountConfig

`storageAccountEndpointSuffix`

which allows the user to specify a storage account endpoint suffix where the underlying
storage account is enabled for AzureDNSZone. The previous config `endpointSuffix`, did not allow
support for such accounts. The previous config has been deprecated in favor of this new config. Also
fixed an issue where `managedIdentityClientId` was not being set properly

* * address review comments

* * add back azure government link and docs
2024-03-04 16:13:28 -05:00
Gian Merlino 930655ff18
Move retries into DataSegmentPusher implementations. (#15938)
* Move retries into DataSegmentPusher implementations.

The individual implementations know better when they should and should
not retry. They can also generate better error messages.

The inspiration for this patch was a situation where EntityTooLarge was
generated by the S3DataSegmentPusher, and retried uselessly by the
retry harness in PartialSegmentMergeTask.

* Fix missing var.

* Adjust imports.

* Tests, comments, style.

* Remove unused import.
2024-03-04 10:36:21 -08:00
Katya Macedo ced8be3044
docs: Add upgrade notes for Druid 29.0.0 (#16022) 2024-03-04 08:58:52 -08:00
Sensor 4e9b758661
Support CPU resource configurable for Kubernates job under MoK Mode (#16008)
* support CPU resource configurable for Kubernates job

* update property doc

* fix test name

* refine doc format
2024-03-04 10:12:09 -05:00
Adithya Chakilam ec52f686c0
Fix compaction tasks reports getting overwritten (#15981)
* Fix compaction tasks reports geting overwrittened

* only skip for compactiont task

* address comments

* fix boolean

* move boolean flag to task rather than spec

* rename variable

* add docs, fix missing case

* Update docs/ingestion/tasks.md

* rename var

* add task report decode check in IT

* change assert
2024-03-04 10:10:17 -05:00
317brian b3015bd7ce
docs: mention acid-compliance for meta store (#16014)
* docs: add mermaid diagram support

* fix crash when parsing data in data loader that can not be parsed (#15983)

* update jetty to address CVE (#16000)

* Concurrent replace should work with supervisors using concurrent locks (#15995)

* Concurrent replace should work with supervisors using concurrent locks

* Ignore supervisors with useConcurrentLocks set to false

* Apply feedback

* Add pre-check for heavy debug logs (#15706)

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>

* Remove helm paths from CodeQL config (#16006)

* docs: mention acid-compliance for metadb

---------

Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>
Co-authored-by: Jan Werner <105367074+janjwerner-confluent@users.noreply.github.com>
Co-authored-by: AmatyaAvadhanula <amatya.avadhanula@imply.io>
Co-authored-by: Sensor <fectrain@outlook.com>
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-03-04 11:00:38 +08:00
Zoltan Haindrich bf0995f846
Introduce dynamic table append (#15897) 2024-03-01 04:31:57 -05:00
317brian 3df161f73c
docs: update security doc for hashing (#15970)
* docs: add mermaid diagram support

* docs: update druid-basic-security doc to mention caching

* Update docs/development/extensions-core/druid-basic-security.md

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

---------

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2024-02-28 09:48:37 +08:00
benkrug 0c601bf430
Update basic-cluster-tuning.md (#14964)
* Update basic-cluster-tuning.md

The sentence "When free system memory is greater than or equal to druid.segmentCache.locations, the more segment data the Historical can be held in the memory-mapped segment cache" didn't read well.  Updated to clarify it.

* Update docs/operations/basic-cluster-tuning.md

* Update docs/operations/basic-cluster-tuning.md

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-02-28 09:48:20 +08:00
AlbericByte e7d753d4b0
update the doc for dump-segment tool when using jdk11+ (#15971)
* update the doc for dump-segment tool when using jdk11+

* update the style

* fix spell check error
2024-02-28 09:40:10 +08:00
Abhishek Radhakrishnan beccc401e1
Segments created in the same batch have the same `created_date` entry & rename metric (#15977)
* All segments stored in the same batch have the same created_date entry.

In the absence of a group_id column, this metadata would allow us to easily
reason about and troubleshoot ingestion-related issues.

* Rename metric name and code references to eligibleUnusedSegments.

Address review comment from https://github.com/apache/druid/pull/15941#discussion_r1503631992
2024-02-27 17:28:43 +05:30
Karan Kumar 5bb5b41b18
Adding task pending time in MSQ reports (#15966)
Added a new field pendingMs in MSQ task reports. This helps in figuring out the exact run time of the MSQ worker tasks.
    Fixed data races.
2024-02-27 14:41:28 +05:30
Abhishek Radhakrishnan 38ecf980d0
Refactor and add tests and metric to KillUnusedSegments duty (auto-kill) (#15941)
* Kill duty and test improvements.

Initial commit with:
- Bug fixes - auto-kill can throw NPE when there are no datasources present and defaults mismatch.
- Add new stat for candidate segment intervals killed.
- Move a couple of debug logs to info logs for improved visibility (should only log once per kill period).
- Remove redundant checks for code readability.
- Updated tests from using mocks (also the mocks weren't using last updated timestamp) and
  add more test coverage for different config parameters.
- Add a couple of unit tests that are ignored for the eternity case to prove that
  the kill duty doesn't clean up segments with ALL grain or that end in DateTimes.MAX.
- Migrate Druid exception from user to operator persona.

* Address review comments.

* Remove unused methods.

* fix up format specifier and validate bad config tests.

* Consolidate the helpers a bit more and add another test.

* Update test names. Add javadoc placeholders for slightly involved tests.

* Add docs for metric kill/candidateUnusedSegments/count.

Also, rename to disambiguate.

* Comments.

* Apply logging suggestions from code review

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* Review comments

- Clarify docs on eligibility.
- Add test for multiple segments in the same interval. Clarify comment.
- Remove log line from test.
- Remove lastUpdatedDate = now.plus(10) from test.

* minor cleanup.

* Clarify javadocs for getUnusedSegmentIntervals().

---------

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2024-02-27 12:14:41 +05:30
Abhishek Radhakrishnan 67a6224d91
Fix up incorrect `PARTITIONED BY` error messages (#15961)
* Fix up typos, inaccuracies and clean up code related to PARTITIONED BY.

* Remove wrapper function and update tests to use DruidExceptionMatcher.

* Checkstyle and Intellij inspection fixes.
2024-02-26 14:17:53 -05:00
Benjamin Hopp ebb7190545
Docs: Change single-dim to hashed in example for index task (#15529) 2024-02-26 09:16:10 +05:30
Gian Merlino b69f89d9f8
Clarify where to set druid.monitoring.monitors. (#15729) 2024-02-23 18:49:37 +05:30
Adithya Chakilam 1f443d218c
Enable partition stats on streaming task completion report (#15930)
Changes:
- Add visibility into number of records processed by each streaming task per partition
- Add field `recordsProcessed` to `IngestionStatsAndErrorsTaskReportData`
- Populate number of records processed per partition in `SeekableStreamIndexTaskRunner`
2024-02-23 16:29:03 +05:30
Jamie 80942d5754
Feature: add support for ingesting from rabbitmq super streams (#14137)
* Add support for ingesting from Rabbit MQ Super Streams
2024-02-22 10:50:37 +05:30
George Shiqi Wu 59bb72a926
Fix parsing of env variables when properties have underscores (#15919)
* Fix parsing of env variables when properties have underscores

* Add documentation

* Use a % sign instead
2024-02-21 13:18:21 -05:00
317brian c98d54f3c4
docs: delete unused file that causes confusion (#15910) 2024-02-14 16:42:02 -08:00
Peter Marshall cae9cbd7d7
Update tasks.md (#15887)
Remove erroneous white space causing render issues on this page.
2024-02-13 05:20:09 -08:00
Clint Wylie dad8398a4d
start process of deprecating non-sql compatible legacy configurations (#15713)
Starting the process to officially deprecate non SQL compatible modes by updating docs to aggressively call out that Druids non SQL compliant modes are deprecated and will go away someday. There are no code or behavior changes at this PR.
2024-02-13 15:31:45 +05:30
Katya Macedo 0f29ece6a9
[Docs] Refactor streaming ingestion section (#15591)
Merging the work so far. @ektravel , @vogievetsky if there are additional improvements, let's track them & make another pr.



* Refactor streaming ingestion docs

* Update property definition

* Update after review

* Update known issues

* Move kinesis and kafka topics to ingestion, add redirects

* Saving changes

* Saving

* Add input format text

* Update after review

* Minor text edit

* Update example syntax

* Revert back to colon

* Fix merge conflicts

* Fix broken links

* Fix spelling error
2024-02-12 13:52:42 -08:00
Charles Smith 2a42b11660
remove legacy Jupyter tutorial files (#15834)
* remove legacy files

* redirection for the jupyter tutorial page

* remove tutorial from sidebar

* remove redirection
2024-02-12 13:45:47 -08:00
Gian Merlino 7fea34abdd
LOOKUP docs: clarify behavior of replaceMissingValueWith. (#15879)
Clarify behavior when expr is null.
2024-02-11 13:11:00 -08:00
zachjsh f9ee2c353b
Extend the PARTITION BY clause to accept string literals for the time partitioning (#15836)
This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner https://github.com/apache/druid/pull/13686 from @paul-rogers, extending the PARTITION BY clause to accept string literals for the time partitioning
2024-02-09 11:45:38 -05:00
Tom 11a8624ef1
allow for kafka-emitter to have extra dimensions be set for each event it emits (#15845)
* allow for kafka-emitter to have extra dimensions be set for each event it emits

* fix checktsyle issue in kafkaemitterconfig

* make changes to fix docs, and cleanup copy paste error in #toString()

* undo formatting to markdown table

* add more branches so test passes

* fix checkstyle issue
2024-02-08 22:55:24 -08:00
Abhishek Radhakrishnan 1a5b57df84
Update `groupId` for delta-lake and iceberg extensions (#15843)
* Update the group id to org.apache.druid.extensions.contrib for contrib exts.

* Note iceberg and delta lake extensions in extensions.md

* properties and shell backticks

* Update groupId in distribution/pom.xml

* remove delta-lake from dist.

* Add note on downloading extension.
2024-02-07 23:54:06 -08:00
Adarsh Sanjeev 514b3b4d01
Add export capabilities to MSQ with SQL syntax (#15689)
* Add test

* Parser changes to support export statements

* Fix builds

* Address comments

* Add frame processor

* Address review comments

* Fix builds

* Update syntax

* Webconsole workaround

* Refactor

* Refactor

* Change export file path

* Update docs

* Remove webconsole changes

* Fix spelling mistake

* Parser changes, add tests

* Parser changes, resolve build warnings

* Fix failing test

* Fix failing test

* Fix IT tests

* Add tests

* Cleanup

* Fix unparse

* Fix forbidden API

* Update docs

* Update docs

* Address review comments

* Address review comments

* Fix tests

* Address review comments

* Fix insert unparse

* Add external write resource action

* Fix tests

* Add resource check to overlord resource

* Fix tests

* Add IT

* Update syntax

* Update tests

* Update permission

* Address review comments

* Address review comments

* Address review comments

* Add tests

* Add check for runtime parameter for bucket and path

* Add check for runtime parameter for bucket and path

* Add tests

* Update docs

* Fix NPE

* Update docs, remove deadcode

* Fix formatting
2024-02-07 22:08:50 +05:30
Vadim Ogievetsky f2b242b6e6
update console to core Druid changes (#15854) 2024-02-07 19:44:25 +05:30
Pramod Immaneni 59bca0951a
Parallelize storage of incremental segments (#13982)
During ingestion, incremental segments are created in memory for the different time chunks and persisted to disk when certain thresholds are reached (max number of rows, max memory, incremental persist period etc). In the case where there are a lot of dimension and metrics (1000+) it was observed that the creation/serialization of incremental segment file format for persistence and persisting the file took a while and it was blocking ingestion of new data. This affected the real-time ingestion. This serialization and persistence can be parallelized across the different time chunks. This update aims to do that.

The patch adds a simple configuration parameter to the ingestion tuning configuration to specify number of persistence threads. The default value is 1 if it not specified which makes it the same as it is today.
2024-02-07 10:43:05 +05:30
317brian 2dc71c7874
docs: fix rendering (#15835) 2024-02-06 07:18:43 -08:00
Gian Merlino 54b30646f3
Add sqlReverseLookupThreshold for ReverseLookupRule. (#15832)
If lots of keys map to the same value, reversing a LOOKUP call can slow
things down unacceptably. To protect against this, this patch introduces
a parameter sqlReverseLookupThreshold representing the maximum size of an
IN filter that will be created as part of lookup reversal.

If inSubQueryThreshold is set to a smaller value than
sqlReverseLookupThreshold, then inSubQueryThreshold will be used instead.
This allows users to use that single parameter to control IN sizes if they
wish.
2024-02-06 16:32:05 +05:30
Atul Mohan 2e46a98024
Add range filtering support for iceberg ingestion (#15782)
* Add range filtering support for iceberg ingestion

* Docs formatting

* Spelling
2024-02-01 23:32:30 -08:00
Aru Raghuwanshi 223f29d64c
Update input-sources.md for fixing the warehouse path example under S3 (#15823) 2024-02-01 23:32:05 -08:00
317brian 6d617c34d2
docs: revise concurrent append and replace (#15760)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-02-01 11:03:36 -08:00
Laksh Singla 7d65caf0c5
Update the docs for EARLIEST_BY/LATEST_BY aggregators with the newly added numeric capabilities (#15670) 2024-02-01 10:24:43 +05:30
Abhishek Radhakrishnan 9f95a691f7
Extension to read and ingest Delta Lake tables (#15755)
* something

* test commit

* compilation fix

* more compilation fixes (fixme placeholders)

* Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake

Will need to sort out the dependencies later.

* checkpoint

* remove snapshot schema since we can get schema from the row

* iterator bug fix

* json json json

* sampler flow

* empty impls for read(InputStats) and sample()

* conversion?

* conversion, without timestamp

* Web console changes to show Delta Lake

* Asset bug fix and tile load

* Add missing pieces to input source info, etc.

* fix stuff

* Use a different delta lake asset

* Delta lake extension dependencies

* Cleanup

* Add InputSource, module init and helper code to process delta files.

* Test init

* Checkpoint changes

* Test resources and updates

* some fixes

* move to the correct package

* More tests

* Test cleanup

* TODOs

* Test updates

* requirements and javadocs

* Adjust dependencies

* Update readme

* Bump up version

* fixup typo in deps

* forbidden api and checkstyle checks

* Trim down dependencies

* new lines

* Fixup Intellij inspections.

* Add equals() and hashCode()

* chain splits, intellij inspections

* review comments and todo placeholder

* fix up some docs

* null table path and test dependencies. Fixup broken link.

* run prettify

* Different test; fixes

* Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests

* yank the old test resource.

* add a couple of sad path tests

* Updates to readme based on latest.

* Version support

* Extract Delta DateTime converstions to DeltaTimeUtils class and add test

* More comprehensive split tests.

* Some test renames.

* Cleanup and update instructions.

* add pruneSchema() optimization for table scans.

* Oops, missed the parquet files.

* Update default table and rename schema constants.

* Test setup and misc changes.

* Add class loader logic as the context class loader is unaware about extension classes

* change some table client creation logic.

* Add hadoop-aws, hadoop-common and related exclusions.

* Remove org.apache.hadoop:hadoop-common

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Add entry to .spelling to fix docs static check

---------

Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-30 21:53:50 -08:00
Benjamin Hopp 6177f6efd7
Fixing formatting of Iceberg Catalog Object (#15748) 2024-01-30 20:17:38 -08:00
Abhishek Radhakrishnan dbdfae3011
Fix up typo </br /> -> <br /> and adjust interpolated exception msg in InvalidNullByteFault. (#15804) 2024-01-30 12:44:51 -08:00
317brian ba07965580
docs: clean up some rolling updates stuff (#15762) 2024-01-26 14:10:53 -08:00
George Shiqi Wu 3e512249e3
Azure multi read options (#15630)
* Include new dependencies

* Mostly implemented

* More azure fixes

* Tests passing

* Unit tests running

* Test running after removing storage exception

* Happy with coverage now

* Add more tests

* fix client factory

* cleanup from testing

* Remove old client

* update docs

* Exclude from spellcheck

* Add licenses

* Fix identity version

* Save work

* Add azure clients

* add licenses

* typos

* Add dependencies

* Exception is not thrown

* Fix intellij check

* Don't need to override

* specify length

* urldecode

* encode path

* Fix checks

* Revert urlencode changes

* Urlencode with azure library

* Update docs/development/extensions-core/azure.md

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

* PR changes

* Update docs/development/extensions-core/azure.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Add config for multiple storage accounts

* Deprecate AzureTaskLogsConfig.maxRetries

* Clean up azure retry block

* logic update to reuse clients

* fix comments

* Create container conditionally

* Fix key auth

* save work

* Fix unit tests

* Revert old azure input type

* Separate input source

* save work

* Add support for app registrations

* Fix unit tests

* clean up spacing

* Add coverage

* fixes from testing

* cleanup some caching behavior

* Add docs

* Fix spelling issues

* fix more spelling errors'

* Fix intellij inspections

* add simple changes from pr

* save work on fixing bug

* Fix unit tests

* Add more testing

* Fix unit test

* Add tests

* Add annotation for azureStorage

* Fix up docs

* Add comment for list method

* Fix tests

* Remove uneeded toString

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/ingestion/input-sources.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* PR changes

* fix injection of StorageConnector

* Fix checkstyle

* clean up unit tests

* More pr fixes

---------

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-01-25 13:29:16 -05:00
Katya Macedo 867c636629
Document pivot and unpivot operators (#15669) 2024-01-25 09:53:39 -08:00
Abhishek Agarwal 0ab2781a7f
Disable eager initialization for non-query connection requests (#15751) 2024-01-25 14:38:50 +05:30
Hiroshi Fukada 3fe3a65344
New: Add DDSketch in extensions-contrib (#15049)
* New: Add DDSketch-Druid extension

- Based off of http://www.vldb.org/pvldb/vol12/p2195-masson.pdf and uses
 the corresponding https://github.com/DataDog/sketches-java library
- contains tests for post building and using aggregation/post
  aggregation.
- New aggregator: `ddSketch`
- New post aggregators: `quantileFromDDSketch` and
  `quantilesFromDDSketch`

* Fixing easy CodeQL warnings/errors

* Fixing docs, and dependencies

Also moved aggregator ids to AggregatorUtil and PostAggregatorIds

* Adding more Docs and better null/empty handling for aggregators

* Fixing docs, and pom version

* DDSketch documentation format and wording
2024-01-23 20:17:07 +05:30
Pranav 45b30dc07d
Revert "Change default inSubQueryThreshold (#15336)" (#15722)
A low value of inSubQueryThreshold can cause queries with IN filter to plan as joins more commonly. However, some of these join queries may not get planned as IN filter on data nodes and causes significant perf regression.
2024-01-22 11:34:39 +05:30
zachjsh 9d4e8053a4
Kinesis adaptive memory management (#15360)
### Description

Our Kinesis consumer works by using the [GetRecords API](https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html) in some number of `fetchThreads`, each fetching some number of records (`recordsPerFetch`) and each inserting into a shared buffer that can hold a `recordBufferSize` number of records. The logic is described in our documentation at: https://druid.apache.org/docs/27.0.0/development/extensions-core/kinesis-ingestion/#determine-fetch-settings 

There is a problem with the logic that this pr fixes: the memory limits rely on a hard-coded “estimated record size” that is `10 KB` if `deaggregate: false` and `1 MB` if `deaggregate: true`. There have been cases where a supervisor had `deaggregate: true` set even though it wasn’t needed, leading to under-utilization of memory and poor ingestion performance.

Users don’t always know if their records are aggregated or not. Also, even if they could figure it out, it’s better to not have to. So we’d like to eliminate the `deaggregate` parameter, which means we need to do memory management more adaptively based on the actual record sizes.

We take advantage of the fact that GetRecords doesn’t return more than 10MB (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html ):

This pr: 

eliminates `recordsPerFetch`, always use the max limit of 10000 records (the default limit if not set)

eliminate `deaggregate`, always have it true

cap `fetchThreads` to ensure that if each fetch returns the max (`10MB`) then we don't exceed our budget (`100MB` or `5% of heap`). In practice this means `fetchThreads` will never be more than `10`. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments

add `recordBufferSizeBytes` as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be `100MB` or `10% of heap`, whichever is smaller.

add `maxBytesPerPoll` as a bytes-based limit for how much data we poll from shared buffer at a time. Default is `1000000` bytes.

deprecate `recordBufferSize`, use `recordBufferSizeBytes` instead. Warning is logged if `recordBufferSize` is specified

deprecate `maxRecordsPerPoll`, use `maxBytesPerPoll` instead. Warning is logged if maxRecordsPerPoll` is specified

Fixed issue that when the record buffer is full, the fetchRecords logic throws away the rest of the GetRecords result after `recordBufferOfferTimeout` and starts a new shard iterator. This seems excessively churny. Instead,  wait an unbounded amount of time for queue to stop being full. If the queue remains full, we’ll end up right back waiting for it after the restarted fetch.

There was also a call to `newQ::offer` without check in `filterBufferAndResetBackgroundFetch`, which seemed like it could cause data loss. Now checking return value here, and failing if false.

### Release Note

Kinesis ingestion memory tuning config has been greatly simplified, and a more adaptive approach is now taken for the configuration. Here is a summary of the changes made:

eliminates `recordsPerFetch`, always use the max limit of 10000 records (the default limit if not set)

eliminate `deaggregate`, always have it true

cap `fetchThreads` to ensure that if each fetch returns the max (`10MB`) then we don't exceed our budget (`100MB` or `5% of heap`). In practice this means `fetchThreads` will never be more than `10`. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments

add `recordBufferSizeBytes` as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be `100MB` or `10% of heap`, whichever is smaller.

add `maxBytesPerPoll` as a bytes-based limit for how much data we poll from shared buffer at a time. Default is `1000000` bytes.

deprecate `recordBufferSize`, use `recordBufferSizeBytes` instead. Warning is logged if `recordBufferSize` is specified

deprecate `maxRecordsPerPoll`, use `maxBytesPerPoll` instead. Warning is logged if maxRecordsPerPoll` is specified
2024-01-19 14:30:21 -05:00
Abhishek Radhakrishnan 38c1def95a
Kill tasks honor the buffer period of unused segments (#15710)
* Kill tasks should honor the buffer period of unused segments.

- The coordinator duty KillUnusedSegments determines an umbrella interval
 for each datasource to determine the kill interval. There can be multiple unused
segments in an umbrella interval with different used_status_last_updated timestamps.
For example, consider an unused segment that is 30 days old and one that is 1 hour old. Currently
the kill task after the 30-day mark would kill both the unused segments and not retain the 1-hour
old one.

- However, when a kill task is instantiated with this umbrella interval, it’d kill
all the unused segments regardless of the last updated timestamp. We need kill
tasks and RetrieveUnusedSegmentsAction to honor the bufferPeriod to avoid killing
unused segments in the kill interval prematurely.

* Clarify default behavior in docs.

* test comments

* fix canDutyRun()

* small updates.

* checkstyle

* forbidden api fix

* doc fix, unused import, codeql scan error, and cleanup logs.

* Address review comments

* Rename maxUsedFlagLastUpdatedTime to maxUsedStatusLastUpdatedTime

This is consistent with the column name `used_status_last_updated`.

* Apply suggestions from code review

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* Make period Duration type

* Remove older variants of runKilLTask() in OverlordClient interface

* Test can now run without waiting for canDutyRun().

* Remove previous variants of retrieveUnusedSegments from internal metadata storage coordinator interface.

Removes the following interface methods in favor of a new method added:
- retrieveUnusedSegmentsForInterval(String, Interval)
- retrieveUnusedSegmentsForInterval(String, Interval, Integer)

* Chain stream operations

* cleanup

* Pass in the lastUpdatedTime to markUnused test function and remove sleep.

---------

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2024-01-18 22:23:50 -08:00
Abhishek Radhakrishnan f51f0e07e2
Remove documentation for unused segments retrieval API (#15721)
* Undocument unused segments retrieval API.

* Mark API deprecated and unstable. Note that it'll be removed.

* Cleanup .spelling entries

* Remove the Unstable annotation
2024-01-18 18:25:48 -08:00
Ben Sykes e49a7bb3cd
Add SpectatorHistogram extension (#15340)
* Add SpectatorHistogram extension

* Clarify documentation
Cleanup comments

* Use ColumnValueSelector directly
so that we support being queried as a Number using longSum or doubleSum aggregators as well as a histogram.
When queried as a Number, we're returning the count of entries in the histogram.

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Fix references

* Fix spelling

* Update docs/development/extensions-contrib/spectator-histogram.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

---------

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-01-14 09:52:30 -08:00