Commit Graph

3251 Commits

Author SHA1 Message Date
Katya Macedo 490211f2b1
Docs - update streaming ingestion terminology for Kafka and Kinesis (#17003) 2024-09-17 09:49:24 -07:00
Lasse Mammen 307b8e3357
feat: json_merge expression and sql function (#17081) 2024-09-17 18:27:34 +05:30
Victoria Lim 2e2f3cf66a
docs: Refresh docs for SQL input source (#17031)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-09-16 15:52:37 -07:00
Adithya Chakilam 6ef8d5d8e1
OshiSysMonitor: Add ability to skip emitting metrics (#16972)
* OshiSysMonitor: Add ability to skip emitting metrics

* comments

* static checks

* remove oshi
2024-09-12 11:32:31 -04:00
George Shiqi Wu 428f58cf15
Support maxColumnsToMerge in supervisor tuningConfig (#17030)
* support maxColumnsToMerge in supervisor specs

* remove log line

* fix style

* add docs

* fix unit tests
2024-09-11 18:00:13 -04:00
aho135 2427972c10
Implement segment range threshold for automatic query prioritization (#17009)
Implements threshold based automatic query prioritization using the time period of the actual segments scanned. This differs from the current implementation of durationThreshold which uses the duration in the user supplied query. There are some usability constraints with using durationThreshold from the user supplied query, especially when using SQL. For example, if a client does not explicitly specify both start and end timestamps then the duration is extremely large and will always exceed the configured durationThreshold. This is one example interval from a query that specifies no end timestamp:
"interval":["2024-08-30T08:05:41.944Z/146140482-04-24T15:36:27.903Z"]. This interval is generated from a query like SELECT * FROM table WHERE __time > CURRENT_TIMESTAMP - INTERVAL '15' HOUR. Using the time period of the actual segments scanned allows proper prioritization without explicitly having to specify start and end timestamps. This PR adds onto #9493
2024-09-10 15:01:52 +05:30
Abhishek Radhakrishnan aa833a711c
Support for reading Delta Lake table snapshots (#17004)
Problem
Currently, the delta input source only supports reading from the latest snapshot of the given Delta Lake table. This is a known documented limitation.

Description
Add support for reading Delta snapshot. By default, the Druid-Delta connector reads the latest snapshot of the Delta table in order to preserve compatibility. Users can specify a snapshotVersion to ingest change data events from Delta tables into Druid.

In the future, we can also add support for time-based snapshot reads. The Delta API to read time-based snapshots is not clear currently.
2024-09-09 14:12:48 +05:30
Edgar Melendrez 48a758ee08
[docs] reverting changes for sql-functions.md (#17019) 2024-09-06 16:07:32 -07:00
Katya Macedo 94b0705109
Docs - Update the architecture diagram (#17007) 2024-09-06 12:21:27 -07:00
Edgar Melendrez 2d9e92ce78
[docs] Batch11 date and time functions (#16926)
* first draft of functions

* minor improvments

* Update docs/querying/sql-functions.md

* Update docs/querying/sql-scalar.md

* Apply suggestions from code review

Accepted as is

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* applying next round of suggestions

* fixing missing column name

* addressing floor and ceil functions

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* re-wording TIMESTAMPADD

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2024-09-06 12:20:47 -07:00
Edgar Melendrez ed811262e3
[docs] Batch13 IP functions (#16947)
* new datasource

* reviewing before pr

* Update docs/querying/sql-functions.md

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Applying suggestions to IPV4_PARSE

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-09-06 12:19:36 -07:00
Virushade 476b205efa
Docs: Fix language in Schema Design docs (#17010) 2024-09-06 08:48:00 +05:30
Edgar Melendrez c49dc83b22
[docs] batch 12: reduction functions (#16930)
* [docs] batch 12: reduction functions

* Update docs/querying/sql-functions.md

* Update docs/querying/sql-functions.md

* Update docs/querying/sql-functions.md

* applying suggestions

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2024-09-05 17:02:45 -07:00
Jill Osborne b4d83a86c2
Middle Manager wording update in docs (#17005) 2024-09-05 10:25:30 -07:00
Hugh Evans 9162339fa8
Replace dsql instructions in example (#16977) 2024-09-04 12:45:58 -07:00
Katya Macedo 03c37b3143
Fix spelling (#17001) 2024-09-04 13:33:17 -04:00
Hardik Bajaj 2ef936be40
Update Documentation on meregeBuffer/pendingRequests for Real-time nodes (#16992)
#15025 adds mergeBuffer/pendingRequests metric in QueryCountStatsMonitor. Since real-time nodes also use the same merge buffers for queries and have QueryCountStatsMonitor , the documentation is being updated to include this metric.
2024-09-04 00:25:09 +05:30
Vadim Ogievetsky 358d06abc1
Web console: expose forceSegmentSortByTime (#16967)
* no force time

* time UI

* update menus

* tweaks

* dont use bp5

* nicer values

* update snapshots

* similar engine lables

* update snaps
2024-08-29 09:58:15 -07:00
317brian 1d292c5a59
docs: fix the case of the s3 examples for msq (#16969) 2024-08-28 11:13:15 -07:00
Charles Smith e562dd3ac6
Docs: note on iceberg (#16955)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-08-27 14:27:23 -07:00
Jill Osborne 3e031b9dc2
Add dynamic query params example (#16964)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-08-27 14:27:13 -07:00
Clint Wylie f8301a314f
generic block compressed complex columns (#16863)
changes:
* Adds new `CompressedComplexColumn`, `CompressedComplexColumnSerializer`, `CompressedComplexColumnSupplier` based on `CompressedVariableSizedBlobColumn` used by JSON columns
* Adds `IndexSpec.complexMetricCompression` which can be used to specify compression for the generic compressed complex column. Defaults to uncompressed because compressed columns are not backwards compatible.
* Adds new definition of `ComplexMetricSerde.getSerializer` which accepts an `IndexSpec` argument when creating a serializer. The old signature has been marked `@Deprecated` and has a default implementation that returns `null`, but it will be used by the default implementation of the new version if it is implemented to return a non-null value. The default implementation of the new method will use a `CompressedComplexColumnSerializer` if `IndexSpec.complexMetricCompression` is not null/none/uncompressed, or will use `LargeColumnSupportedComplexColumnSerializer` otherwise.
* Removed all duplicate generic implementations of `ComplexMetricSerde.getSerializer` and `ComplexMetricSerde.deserializeColumn` into default implementations `ComplexMetricSerde` instead of being copied all over the place. The default implementation of `deserializeColumn` will check if the first byte indicates that the new compression was used, otherwise will use the `GenericIndexed` based supplier.
* Complex columns with custom serializers/deserializers are unaffected and may continue doing whatever it is they do, either with specialized compression or whatever else, this new stuff is just to provide generic implementations built around `ObjectStrategy`.
* add ObjectStrategy.readRetainsBufferReference so CompressedComplexColumn only copies on read if required
* add copyValueOnRead flag down to CompressedBlockReader to avoid buffer duplicate if the value needs copied anyway
2024-08-27 00:34:41 -07:00
317brian 418da92228
docs: update query from deepstorage segment requirement (#16842)
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Rishabh Singh <6513075+findingrish@users.noreply.github.com>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2024-08-23 11:59:29 -07:00
Hugh Evans 60d4317968
Linked back to query granularity docs (#16883)
* Linked back to query granularity docs

* Update ingestion-spec.md

clairfy about query granularities in the spec.

* Update docs/design/storage.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/ingestion-spec.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/granularities.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Apply suggestions from code review

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-08-23 08:44:19 -07:00
Gian Merlino 0603d5153d
Segments sorted by non-time columns. (#16849)
* Segments primarily sorted by non-time columns.

Currently, segments are always sorted by __time, followed by the sort
order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting
by __time enables efficient execution of queries involving time-ordering
or granularity. Time-ordering is a simple matter of reading the rows in
stored order, and granular cursors can be generated in streaming fashion.

However, for various workloads, it's better for storage footprint and
query performance to sort by arbitrary orders that do not start with __time.
With this patch, users can sort segments by such orders.

For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to
dimensionsSpec. The "dimensions" list determines the sort order. To
define a sort order that includes "__time", users explicitly
include a dimension named "__time".

For SQL-based ingestion, users set the context parameter
"useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then
used as the explicit segment sort order.

In both cases, when the new "useExplicitSegmentSortOrder" parameter is
false (the default), __time is implicitly prepended to the sort order,
as it always was prior to this patch.

The new parameter is experimental for two main reasons. First, such
segments can cause errors when loaded by older servers, due to violating
their expectations that timestamps are always monotonically increasing.
Second, even on newer servers, not all queries can run on non-time-sorted
segments. Scan queries involving time-ordering and any query involving
granularity will not run. (To partially mitigate this, a currently-undocumented
SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner
avoids using "granularity".)

Changes on the write path:

1) DimensionsSpec can now optionally contain a __time dimension, which
   controls the placement of __time in the sort order. If not present,
   __time is considered to be first in the sort order, as it has always
   been.

2) IncrementalIndex and IndexMerger are updated to sort facts more
   flexibly; not always by time first.

3) Metadata (stored in metadata.drd) gains a "sortOrder" field.

4) MSQ can generate range-based shard specs even when not all columns are
   singly-valued strings. It merely stops accepting new clustering key
   fields when it encounters the first one that isn't a singly-valued
   string. This is useful because it enables range shard specs on
   "someDim" to be created for clauses like "CLUSTERED BY someDim, __time".

Changes on the read path:

1) Add StorageAdapter#getSortOrder so query engines can tell how a
   segment is sorted.

2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter,
   and VectorCursorGranularizer to throw errors when using granularities
   on non-time-ordered segments.

3) Update ScanQueryEngine to throw an error when using the time-ordering
  "order" parameter on non-time-ordered segments.

4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when
   running on a non-time-ordered segment.

5) Add "sqlUseGranularity" context parameter that causes the SQL planner
   to avoid using granularities other than ALL.

Other changes:

1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions"
   and change the meaning subtly: it now returns true if the DimensionsSpec
   represents an unchanging list of dimensions, or false if there is
   some discovery happening. This is what call sites had expected anyway.

* Fixups from CI.

* Fixes.

* Fix missing arg.

* Additional changes.

* Fix logic.

* Fixes.

* Fix test.

* Adjust test.

* Remove throws.

* Fix styles.

* Fix javadocs.

* Cleanup.

* Smoother handling of null ordering.

* Fix tests.

* Missed a spot on the merge.

* Fixups.

* Avoid needless Filters.and.

* Add timeBoundaryInspector to test.

* Fix tests.

* Fix FrameStorageAdapterTest.

* Fix various tests.

* Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder.

* Pom fix.

* Fix doc.
2024-08-23 08:24:43 -07:00
AmatyaAvadhanula 8c8a4b2302
Remove references to chatAsync (#16950)
Remove references to chatAsync from Rabbit stream supervisors
2024-08-23 13:21:07 +05:30
Edgar Melendrez c4981e34c4
[docs] Batch10 date and time functions (#16900)
* just starting

* TIME_PARSE and TIME_FORMAT remaining

* fixing typo

* adding last two functions

* review sql-functions.md

* Apply suggestions from code review

Suggestions that were accepted as is

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/querying/sql-functions.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/querying/sql-functions.md

needed to confirm that it did indeed return as a number

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* reviewing remaining suggestions

* addressing review for time_format

* Apply suggestions from code review

Accepted as is

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* addressing final suggestion

* time_zone -> timezone

* timezone fix

---------

Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2024-08-22 20:25:27 -07:00
Hugh Evans 2f6722db63
Updated auth to use variables with default values (#16876)
* Updated auth to use variables with default values

* Update docs/api-reference/sql-ingestion-api.md

* Remove Python auth entirely as its not being used

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2024-08-22 18:09:09 -07:00
Edgar Melendrez fda2d19b88
[Docs] Batch09: only `lookup` (#16878)
* [Docs] Batch09: only `lookup`

* slight changes

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* applying suggestiontions

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* otherwise null -> otherwise returns null

* updating definition in sql-scalar.md

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* hoping to re-run web checks

* change replaceMissingValueWith -> defaultValue

* Update docs/querying/sql-scalar.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* acronym_to_name -> airportcode_to_name

* shortens `airportcode_to_name` to `code_to_name`

---------

Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-08-22 11:11:16 -07:00
Gian Merlino a83125e4a0
Track IngestionState more accurately in realtime tasks. (#16934)
Previously, SeekableStreamIndexTaskRunner set ingestion state to
COMPLETED when it finished reading data from Kafka. This is incorrect.
After the changes in this patch, the transitions go:

1) The task stays in BUILD_SEGMENTS after it finishes reading from Kafka,
   while it is building its final set of segments to publish.

2) The task transitions to SEGMENT_AVAILABILITY_WAIT after publishing,
   while waiting for handoff.

3) The task transitions to COMPLETED immediately before exiting, when
   truly done.
2024-08-22 11:43:46 +05:30
Edgar Melendrez 725695342c
[Docs] Batch07: adding examples to string functions (#16862)
* Lower,Upper,Lpad,Rpad,Parse_long

* up to REGEXP_EXTRACT

* batch 07 ready for review

* updated definitions in scalar

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* rpad and lpad

* addressing comments

* minor fixes

* improving examples based on suggestions

* matched -> matches

* correcting typo

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-08-21 15:08:25 -07:00
benkrug 7b8573ed3d
Update index.md - remove the extra word "does" from one sentence. (#16922)
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-08-20 11:06:12 -07:00
Jakub Matyszewski 82d9ff9cc8
Add docs for log audit manager (#16927)
* Add docs for log audit manager

* Adjust descriptions
2024-08-20 15:58:31 +05:30
Kashif Faraz 2198001930
Remove unused cachingCost strategy runtime properties (#16918) 2024-08-19 10:15:03 +05:30
Edgar Melendrez c968e73171
[Docs] updating transformation during ingestion tutorial (#16845)
* first major revision of tutorial

* more edits

* re-ID the file to reflect new content + redirect

* renaming file

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* addressing suggestions

* adding column names

* Update docs/tutorials/tutorial-transform.md

* Update docs/tutorials/tutorial-transform.md

* Addressing suggestions

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* adding trademark logo and moving paragraph

* decided to shorten final paragraph

---------

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2024-08-16 11:39:57 -07:00
Edgar Melendrez 5b94839d9d
[Docs] Batch08: adding examples to string functions (#16871)
* batch08 completed

* reviewing batch08

* apply corrections suggestions by @FrankChen021
2024-08-16 10:15:30 +08:00
Hugh Evans e91f680d50
Removed deprecated deep storage properties (#16904) 2024-08-15 11:54:34 -07:00
Hugh Evans 6cfdeb3894
Added a topic listing reserved keywords (#16843) 2024-08-15 10:25:09 -07:00
Hugh Evans 8c030feefc
Migration guide fixes (#16902)
* Fix typo in table header

* Fixed example NVL result
2024-08-15 09:26:34 -07:00
Rishabh Singh f67ff92d07
[bugfix] Run cold schema refresh thread periodically (#16873)
* Fix build

* Run coldSchemaExec thread periodically

* Bugfix: Run cold schema refresh periodically

* Rename metrics for deep storage only segment schema process
2024-08-13 11:44:01 +05:30
Abhishek Radhakrishnan d7dfbebf97
[Docs]: Fix typo and update broadcast rules section (#16882)
* Fix typo in waitUntilSegmentsLoad.

* Add a note on configuring druid.segmentCache.locations for broadcast rules.

* Update docs/operations/rule-configuration.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

---------

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2024-08-12 13:55:33 -07:00
aaronm-bi ceed4a0634
Docs: Update list of ingestion types that support concurrent append and replace (#16852) 2024-08-08 08:06:22 +05:30
Atul Mohan 76ad17fb4c
Add config for http client connect timeout (#16831)
Adds a configuration clientConnectTimeout to our http client config which controls the connection timeout for our http client requests.

It was observed that on busy K8S clusters, the default connect timeout of 500ms is sometimes not enough time to complete syn/acks for a request and in these cases, the requests timeout with the error:
exceptionType=java.net.SocketTimeoutException, exceptionMessage=Connect Timeout
This behavior was mostly observed on the router while forwarding queries to the broker.
Having a slightly higher connect timeout helped resolve these issues.
2024-08-07 19:31:10 +05:30
Sree Charan Manamala 1f6d2c41d2
Update doc for dynamic parameters supporting array (#16660)
Update dynamic parameter docs to provide how it can used to replace an Array
2024-08-07 12:33:37 +05:30
Edgar Melendrez 83cf4dc554
[docs] fixes to sql-scalar.md (#16826)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2024-08-06 17:12:57 -07:00
zachjsh c324f09108
Kinesis input format docs (#16840)
* SQL syntax error should target USER persona

* * revert change to queryHandler and related tests, based on review comments

* * add test

* Docs for Kinesis input format

* * remove reference to kafka

* * fix spellcheck error

* Apply suggestions from code review

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

---------

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2024-08-06 18:53:10 -04:00
Edgar Melendrez ebea34a814
[Docs] Batch06: starting string functions (#16838)
* batch06, starting string functions

* addind space after Syntax

* quick change

* correcting spelling

* Update docs/querying/sql-functions.md

* Update sql-functions.md

* applying suggestions

* Update docs/querying/sql-functions.md

* Update docs/querying/sql-functions.md

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-08-06 11:32:26 -07:00
Kashif Faraz aa49be61ea
Do not create ZK paths if not needed (#16816)
Background:
ZK-based segment loading has been completely disabled in #15705 .
ZK `servedSegmentsPath` has been deprecated since Druid 0.7.1, #1182 .
This legacy path has been replaced by the `liveSegmentsPath` and is not used in the code anymore.

Changes:
- Never create ZK loadQueuePath as it is never used.
- Never create ZK servedSegmentsPath as it is never used.
- Do not create ZK liveSegmentsPath if announcement on ZK is disabled
- Fix up tests
2024-08-06 19:29:13 +05:30
Rushikesh Bankar c8323d1a7c
Add indexer task success and failure metrics (#16829)
This PR adds indexer-level task metrics-

"indexer/task/failed/count"
"indexer/task/success/count"

the current "worker/task/completed/count" metric shows all the tasks completed irrespective of success or failure status so these metrics would help us get more visibility into the status of the completed tasks
2024-08-05 16:21:27 +05:30
Laksh Singla 0411c4e67e
Add metrics for number of rows/bytes materialized while running subqueries (#16835)
subquery/rows and subquery/bytes metrics have been added, which indicate the size of the results materialized on the heap.
2024-08-05 14:13:20 +05:30