druid

Commit Graph

Author	SHA1	Message	Date
Gian Merlino	0603d5153d	Segments sorted by non-time columns. (#16849 ) * Segments primarily sorted by non-time columns. Currently, segments are always sorted by __time, followed by the sort order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting by __time enables efficient execution of queries involving time-ordering or granularity. Time-ordering is a simple matter of reading the rows in stored order, and granular cursors can be generated in streaming fashion. However, for various workloads, it's better for storage footprint and query performance to sort by arbitrary orders that do not start with __time. With this patch, users can sort segments by such orders. For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to dimensionsSpec. The "dimensions" list determines the sort order. To define a sort order that includes "__time", users explicitly include a dimension named "__time". For SQL-based ingestion, users set the context parameter "useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then used as the explicit segment sort order. In both cases, when the new "useExplicitSegmentSortOrder" parameter is false (the default), __time is implicitly prepended to the sort order, as it always was prior to this patch. The new parameter is experimental for two main reasons. First, such segments can cause errors when loaded by older servers, due to violating their expectations that timestamps are always monotonically increasing. Second, even on newer servers, not all queries can run on non-time-sorted segments. Scan queries involving time-ordering and any query involving granularity will not run. (To partially mitigate this, a currently-undocumented SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner avoids using "granularity".) Changes on the write path: 1) DimensionsSpec can now optionally contain a __time dimension, which controls the placement of __time in the sort order. If not present, __time is considered to be first in the sort order, as it has always been. 2) IncrementalIndex and IndexMerger are updated to sort facts more flexibly; not always by time first. 3) Metadata (stored in metadata.drd) gains a "sortOrder" field. 4) MSQ can generate range-based shard specs even when not all columns are singly-valued strings. It merely stops accepting new clustering key fields when it encounters the first one that isn't a singly-valued string. This is useful because it enables range shard specs on "someDim" to be created for clauses like "CLUSTERED BY someDim, __time". Changes on the read path: 1) Add StorageAdapter#getSortOrder so query engines can tell how a segment is sorted. 2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter, and VectorCursorGranularizer to throw errors when using granularities on non-time-ordered segments. 3) Update ScanQueryEngine to throw an error when using the time-ordering "order" parameter on non-time-ordered segments. 4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when running on a non-time-ordered segment. 5) Add "sqlUseGranularity" context parameter that causes the SQL planner to avoid using granularities other than ALL. Other changes: 1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions" and change the meaning subtly: it now returns true if the DimensionsSpec represents an unchanging list of dimensions, or false if there is some discovery happening. This is what call sites had expected anyway. * Fixups from CI. * Fixes. * Fix missing arg. * Additional changes. * Fix logic. * Fixes. * Fix test. * Adjust test. * Remove throws. * Fix styles. * Fix javadocs. * Cleanup. * Smoother handling of null ordering. * Fix tests. * Missed a spot on the merge. * Fixups. * Avoid needless Filters.and. * Add timeBoundaryInspector to test. * Fix tests. * Fix FrameStorageAdapterTest. * Fix various tests. * Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder. * Pom fix. * Fix doc.	2024-08-23 08:24:43 -07:00
AmatyaAvadhanula	8c8a4b2302	Remove references to chatAsync (#16950 ) Remove references to chatAsync from Rabbit stream supervisors	2024-08-23 13:21:07 +05:30
Adarsh Sanjeev	2abcb41559	Use controller id while reading from durable storage (#16943 )	2024-08-23 11:37:42 +05:30
Adarsh Sanjeev	e2516d9a67	WriteOutBytes improvements This PR generally improves the working of WriteOutBytes and WriteOutMedium. Some analysis of usage of TmpFileSegmentWriteOutMedium shows that they periodically get used for very small things. The overhead of creating a tmp file is actually very large. To improve the performance in these cases, this PR modifies TmpFileSegmentWriteOutMedium to return a heap-based WriteOutBytes that falls back to making a tmp file when it actually fills up. --------- Co-authored-by: imply-cheddar <eric.tschetter@imply.io>	2024-08-23 11:32:30 +05:30
Edgar Melendrez	c4981e34c4	[docs] Batch10 date and time functions (#16900 ) * just starting * TIME_PARSE and TIME_FORMAT remaining * fixing typo * adding last two functions * review sql-functions.md * Apply suggestions from code review Suggestions that were accepted as is Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/querying/sql-functions.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/querying/sql-functions.md needed to confirm that it did indeed return as a number Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * reviewing remaining suggestions * addressing review for time_format * Apply suggestions from code review Accepted as is Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * addressing final suggestion * time_zone -> timezone * timezone fix --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2024-08-22 20:25:27 -07:00
Hugh Evans	2f6722db63	Updated auth to use variables with default values (#16876 ) * Updated auth to use variables with default values * Update docs/api-reference/sql-ingestion-api.md * Remove Python auth entirely as its not being used --------- Co-authored-by: Benedict Jin <asdf2014@apache.org>	2024-08-22 18:09:09 -07:00
Clint Wylie	2aef6ac685	fix ipv4_parse function return type in SQL to be bigint instead of integer (#16942 ) * fix ipv4_parse function return type in SQL to be bigint instead of integer * fix default value mode	2024-08-22 13:36:43 -07:00
Clint Wylie	bce60b0674	fix flaky ParallelMergeCombiningSequenceTest.testTimeoutExceptionDueToStoppedReader when runner is very slow (#16932 )	2024-08-22 13:34:28 -07:00
Edgar Melendrez	fda2d19b88	[Docs] Batch09: only `lookup` (#16878 ) * [Docs] Batch09: only `lookup` * slight changes * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * applying suggestiontions * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * otherwise null -> otherwise returns null * updating definition in sql-scalar.md * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> * hoping to re-run web checks * change replaceMissingValueWith -> defaultValue * Update docs/querying/sql-scalar.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * acronym_to_name -> airportcode_to_name * shortens `airportcode_to_name` to `code_to_name` --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-08-22 11:11:16 -07:00
Gian Merlino	a83125e4a0	Track IngestionState more accurately in realtime tasks. (#16934 ) Previously, SeekableStreamIndexTaskRunner set ingestion state to COMPLETED when it finished reading data from Kafka. This is incorrect. After the changes in this patch, the transitions go: 1) The task stays in BUILD_SEGMENTS after it finishes reading from Kafka, while it is building its final set of segments to publish. 2) The task transitions to SEGMENT_AVAILABILITY_WAIT after publishing, while waiting for handoff. 3) The task transitions to COMPLETED immediately before exiting, when truly done.	2024-08-22 11:43:46 +05:30
Edgar Melendrez	725695342c	[Docs] Batch07: adding examples to string functions (#16862 ) * Lower,Upper,Lpad,Rpad,Parse_long * up to REGEXP_EXTRACT * batch 07 ready for review * updated definitions in scalar * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> * rpad and lpad * addressing comments * minor fixes * improving examples based on suggestions * matched -> matches * correcting typo * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-08-21 15:08:25 -07:00
Gian Merlino	338da67bc6	Add type coercion and null check to left, right, repeat exprs. (#16480 ) * Add type coercion and null check to left, right, repeat exprs. These exprs shouldn't validate types; they should coerce types. Coercion is typical behavior for functions because it enables schema evolution. The functions are also modified to check isNumericNull on the right-hand argument. This was missing previously, which would erroneously cause nulls to be treated as zeroes. * Fix tests.	2024-08-21 15:07:24 -07:00
Gian Merlino	090023609b	Loosen case in FrameFileWriterTest. (#16938 ) The specific error on a truncated file can vary based on how the final frame of the truncated file is written. This patch loosens the check so it passes regardless of how the truncated file is written.	2024-08-21 13:45:01 -07:00
Akshat Jain	97f9502ad2	Enable MSQ WF drill tests which were previously disabled (#16935 )	2024-08-21 15:47:50 +05:30
Gian Merlino	f6adacf5d6	SuperSorter: Store readOnly output channels. (#16928 ) Without the call to readOnly, each output channel retains a 1 MB allocator, leading to excessive memory use. Fixes regression from #16775.	2024-08-20 23:10:29 -07:00
Akshat Jain	0ce1b6b22f	MSQ window function: Take segment granularity into consideration to fix NPE issues with ingestion (#16854 ) This PR changes the logic for window functions to use the resultShuffleSpecFactory for the last window stage.	2024-08-21 10:06:04 +05:30
Gian Merlino	2bd31603de	FrameFile: Improve error messages. (#16912 ) * FrameFile: Improve error messages. 1) Include frame file path in error messages. 2) Adhere better to style (no space before brackets). * Fix test.	2024-08-20 11:56:30 -07:00
benkrug	7b8573ed3d	Update index.md - remove the extra word "does" from one sentence. (#16922 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-08-20 11:06:12 -07:00
Jakub Matyszewski	82d9ff9cc8	Add docs for log audit manager (#16927 ) * Add docs for log audit manager * Adjust descriptions	2024-08-20 15:58:31 +05:30
Rishabh Singh	bc4b3a2f91	Filter out tombstone segments from metadata cache (#16890 ) * Fix build * Support segment metadata queries for tombstones * Filter out tombstone segments from metadata cache * Revert some changes * checkstyle * Update docs	2024-08-20 11:35:02 +05:30
Clint Wylie	518f642028	remove isDescending from Query interface, move to TimeseriesQuery (#16917 ) * remove isDescending from Query interface, since it is only actually settable and usable by TimeseriesQuery	2024-08-19 23:02:45 -07:00
Vishesh Garg	fb7103ccef	Change dimensionToSchemaMap to dimensionSchemas and override ARRAY_INGEST_MODE to array (#16909 ) A follow-up PR for #16864. Just renames dimensionToSchemaMap to dimensionSchemas and always overrides ARRAY_INGEST_MODE context value to array for MSQ compaction.	2024-08-20 10:30:24 +05:30
Kashif Faraz	2198001930	Remove unused cachingCost strategy runtime properties (#16918 )	2024-08-19 10:15:03 +05:30
Vadim Ogievetsky	4e33ce2b21	fix collapsing in column tree (#16910 )	2024-08-18 15:11:28 -07:00
Akshat Jain	a56b5c018d	Propagate TooManyRowsInAWindowFault error message properly to the user (#16906 ) * Propagate TooManyRowsInAWindowFault error message properly to the user * Add TooManyRowsInAWindowFault to MSQFaultSerdeTest	2024-08-18 10:03:45 +05:30
Benedict Jin	688b4cf164	Fix flaky test in ParallelMergeCombiningSequenceTest (#16907 )	2024-08-18 10:02:50 +05:30
Gian Merlino	806649f8af	SQL: Fix nullable DATE, TIMESTAMP reduction. (#16915 ) Reduction of nullable DATE and TIMESTAMP expressions did not perform a necessary null check, so would in some cases reduce to 1970-01-01 00:00:00 (epoch) rather than NULL.	2024-08-16 22:41:12 -07:00
Vadim Ogievetsky	422183ee70	Web console: expose handoff API (#16586 ) * don't start completions on numbers... it makes numbers hard to enter * add handoff dialog * fix placeholder * Update web-console/src/dialogs/supervisor-handoff-dialog/supervisor-handoff-dialog.tsx Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update web-console/src/dialogs/supervisor-handoff-dialog/supervisor-handoff-dialog.tsx Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update web-console/src/dialogs/supervisor-handoff-dialog/supervisor-handoff-dialog.tsx Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * feedback fixes * update snapshot --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2024-08-16 14:39:16 -07:00
Edgar Melendrez	c968e73171	[Docs] updating transformation during ingestion tutorial (#16845 ) * first major revision of tutorial * more edits * re-ID the file to reflect new content + redirect * renaming file * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * addressing suggestions * adding column names * Update docs/tutorials/tutorial-transform.md * Update docs/tutorials/tutorial-transform.md * Addressing suggestions * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * adding trademark logo and moving paragraph * decided to shorten final paragraph --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Benedict Jin <asdf2014@apache.org> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2024-08-16 11:39:57 -07:00
Clint Wylie	4283b270e3	rework cursor creation (#16533 ) changes: * Added `CursorBuildSpec` which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments of `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor` * added new interface `CursorHolder` and new interface `CursorHolderFactory` as a replacement for `CursorFactory`, with method `makeCursorHolder`, which takes a `CursorBuildSpec` as an argument and replaces `CursorFactory.canVectorize`, `CursorFactory.makeCursor`, and `CursorFactory.makeVectorCursor` * `CursorFactory.makeCursors` previously returned a `Sequence<Cursor>` corresponding to the query granularity buckets, with a separate `Cursor` per bucket. `CursorHolder.asCursor` instead returns a single `Cursor` (equivalent to 'ALL' granularity), and a new `CursorGranularizer` has been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with its `VectorCursorGranularizer`), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities. * Deprecated `CursorFactory`, `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursor` * updated all `StorageAdapter` implementations to implement `makeCursorHolder`, transitioned direct `CursorFactory` implementations to instead implement `CursorMakerFactory`. `StorageAdapter` being a `CursorMakerFactory` is intended to be a transitional thing, ideally will not be released in favor of moving `CursorMakerFactory` to be fetched directly from `Segment`, however this PR was already large enough so this will be done in a follow-up. * updated all query engines to use `makeCursorHolder`, granularity based engines to use `CursorGranularizer`.	2024-08-16 11:34:10 -07:00
Vishesh Garg	e37fe93f09	Add support for a custom `DimensionSchema` in `DataSourceMSQDestination` (#16864 ) This PR adds support for passing in a custom DimensionSchema map to MSQ query destination of type DataSourceMSQDestination	2024-08-16 15:24:49 +05:30
Edgar Melendrez	5b94839d9d	[Docs] Batch08: adding examples to string functions (#16871 ) * batch08 completed * reviewing batch08 * apply corrections suggestions by @FrankChen021	2024-08-16 10:15:30 +08:00
Hugh Evans	e91f680d50	Removed deprecated deep storage properties (#16904 )	2024-08-15 11:54:34 -07:00
Hugh Evans	6cfdeb3894	Added a topic listing reserved keywords (#16843 )	2024-08-15 10:25:09 -07:00
Hugh Evans	8c030feefc	Migration guide fixes (#16902 ) * Fix typo in table header * Fixed example NVL result	2024-08-15 09:26:34 -07:00
Sree Charan Manamala	964cf47bb5	fix NPE (#16897 )	2024-08-15 18:12:22 +08:00
Vadim Ogievetsky	8181ef627a	add useConcurrentLocks toggle (#16899 )	2024-08-14 13:44:53 -07:00
Vadim Ogievetsky	ca82ecd352	bump axios to 1.7.4 (#16898 )	2024-08-14 13:42:26 -07:00
Maytas Monsereenusorn	c2ddff399d	Fix Parquet Reader when ingestion need to read columns in filter (#16874 )	2024-08-14 12:31:38 -07:00
Laksh Singla	204533cade	Remove Query ID verification check from MSQ workers (#16886 ) Upgrade/Downgrade between any version till or before Druid 30 where the newer version runs a worker task, while the older version runs a controller task can fail. The patch removes that verification check till its safe to add it back.	2024-08-14 10:22:19 +05:30
Abhishek Radhakrishnan	acadc2df20	Handle Delta StructType, ArrayType and MapType (#16884 ) Handle the following Delta complex types: a. StructType as JSON b. ArrayType as Java list c. MapType as Java map Generate and add a new Delta table complex-types-table that contains the above complex types for testing. Update the tests to include a parameterized test with complex-types-table, with the expectations defined in ComplexTypesDeltaTable.java.	2024-08-13 07:50:03 -07:00
Adarsh Sanjeev	c6da2f30e8	Add fieldReader for row based frames (#16707 ) Add a new fieldReaders#makeRAC for RowBasedFrameRowsAndColumns.	2024-08-13 14:04:41 +05:30
Rishabh Singh	f67ff92d07	[bugfix] Run cold schema refresh thread periodically (#16873 ) * Fix build * Run coldSchemaExec thread periodically * Bugfix: Run cold schema refresh periodically * Rename metrics for deep storage only segment schema process	2024-08-13 11:44:01 +05:30
Abhishek Radhakrishnan	d7dfbebf97	[Docs]: Fix typo and update broadcast rules section (#16882 ) * Fix typo in waitUntilSegmentsLoad. * Add a note on configuring druid.segmentCache.locations for broadcast rules. * Update docs/operations/rule-configuration.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-08-12 13:55:33 -07:00
Gian Merlino	efe0044f9e	Use fuzzy matchers for compaction bytes asserts. (#16870 ) * Use fuzzy matchers for compaction bytes asserts. This still enables us to test that the bytes are zero and nonzero when they're supposed to be, without having to ge them exactly right. The need to get bytes exactly right makes it difficult to ensure ITs pass when making changes to default segment metadata. * Additional fuzziness.	2024-08-12 10:00:33 +08:00
Rushikesh Bankar	4ef4e75c5d	Fix the issue of missing implementation of IndexerTaskCountStatsProvider for peons (#16875 ) Bug description: Peons to fail to start up when `WorkerTaskCountStatsMonitor` is used on MiddleManagers. This is because MiddleManagers pass on their properties to peons and peons are unable to find `IndexerTaskCountStatsProvider` as that is bound only for indexer nodes. Fix: Check if node is an indexer before trying to get instance of `IndexerTaskCountStatsProvider`.	2024-08-10 14:53:16 +05:30
Vadim Ogievetsky	483a03f26c	Web console: Server context defaults (#16868 ) * add server defaults * null is NULL * r to d * add test * typo	2024-08-09 14:46:59 -07:00
Adithya Chakilam	a7dd436a32	Check if supervisor could be idle on startup (#16844 ) Fixes #13936 In cases where a supervisor is idle and the overlord is restarted for some reason, the supervisor would start spinning tasks again. In clusters where there are many low throughput streams, this would spike the task count unnecessarily. This commit compares the latest stream offset with the ones in metadata during the startup of supervisor and sets it to idle state if they match.	2024-08-09 14:42:48 +05:30
Akshat Jain	3d6cedb25f	Fix IndexOutOfBoundsException for MSQ window function queries with empty RAC (#16865 ) * Fix IndexOutOfBoundsException for MSQ window function queries with empty RAC	2024-08-09 11:39:53 +05:30
zachjsh	cb09b572e6	Fix Druid table schema resolution when table defined in catalog and has schema manager (#16869 ) * SQL syntax error should target USER persona * * revert change to queryHandler and related tests, based on review comments * * add test * Properly handle Druid schema blending with catalog definition and segment metadata * * add javadocs	2024-08-08 21:21:03 -04:00

1 2 3 4 5 ...

14425 Commits All Branches Search

14425 Commits

All Branches