druid

Commit Graph

Author	SHA1	Message	Date
Suraj Goel	7306d280cc	Migrate jaxb bind dependency to jakarta (#17370 ) - Migrated from javax.xml.bind 2.3.1 to jakarta.xml.bind 2.3.3. - Minor version is modified to avoid any breaking changes.	2024-10-26 21:24:17 -07:00
Gian Merlino	7e8671caa9	GroupByQueryConfig: Skip unnecessary toString. (#17396 ) Calling toString on newConfig is unnecessary, because it will be done automatically by the logger. This saves some effort under log levels higher than DEBUG.	2024-10-23 19:57:22 +05:30
Clint Wylie	1157ecdec3	abstract common base of SQL micro-benchmarks to reduce boilerplate and standardize parameters (#17383 ) changes: * adds `SqlBenchmarkDatasets` which contains commonly used benchmark data generator schemas * adds `SqlBaseBenchmark` which contains common benchmark segment generation methods for any benchmark using `SqlBenchmarkDatasets` * adds `SqlBaseQueryBenchmark` and `SqlBasePlanBenchmark` for benchmarks measuring queries and planning respectively * migrate all existing SQL jmh benchmarks to extend `SqlBaseQueryBenchmark`, quite dramatically reducing the boilerplate needed to create benchmarks, and allowing the use of multiple datasources within a benchmark file * adjustments to data generator stuff to allow passing in an ObjectMapper so that the same mapper can be used for both benchmark queries and segment generation, avoiding the need to register stuff with both mappers for benchmarks * adds `SqlProjectionsBenchmark` and `SqlComplexMetricsColumnsBenchmark` for measuring projections and measuring complex metric compression respectively	2024-10-22 19:37:17 -07:00
Laksh Singla	5b09329479	Fixes an issue with AppendableMemory that can cause MSQ jobs to fail (#17369 )	2024-10-18 09:05:53 +05:30
Akshat Jain	450fb0147b	Add GlueingPartitioningOperator + Corresponding changes in window function layer to consume it for MSQ (#17038 ) * GlueingPartitioningOperator: It continuously receives data, and outputs batches of partitioned RACs. It maintains a last-partitioning-boundary of the last-pushed-RAC, and attempts to glue it with the next RAC it receives, ensuring that partitions are handled correctly, even across multiple RACs. You can check GlueingPartitioningOperatorTest for some good examples of the "glueing" work. * PartitionSortOperator: It sorts rows inside partitioned RACs, on the sort columns. The input RACs it receives are expected to be "complete / separate" partitions of data.	2024-10-17 10:54:52 +05:30
Hardik Bajaj	32ce341a6c	Fix RejectExecutionHandler of Blocking Single Threaded executor (#17146 ) Throw RejectedExecutionException when submitting tasks to executor that has been shut down.	2024-10-15 22:02:34 +05:30
Clint Wylie	c2149d59a7	remove stale comment in QueryableIndexCursorHolder (#17333 )	2024-10-11 16:23:59 -07:00
Gian Merlino	b287b219a8	MSQ: Include stageId, workerNumber in processing thread names. (#17324 ) * MSQ: Include stageId, workerNumber in processing thread names. Helps identify which query was running in a thread dump. * s/dart/msq/	2024-10-11 08:37:15 -07:00
Shivam Garg	6898a5a359	Removed Microsecond from Extract function (#17247 )	2024-10-11 05:32:26 +02:00
Clint Wylie	a6236c3d15	add substituteCombiningFactory implementations for datasketches aggs (#17314 ) Follow up to #17214, adds implementations for substituteCombiningFactory so that more datasketches aggs can match projections, along with some projections tests for datasketches.	2024-10-10 16:14:06 +05:30
Gian Merlino	1d95ef34f0	Logger: Log context of DruidExceptions. (#17316 ) * Logger: Log context of DruidExceptions. There is often interesting and unique information available in the "context" of a DruidException. This information is additive to both the message and the cause, and was missed when we log. This patch adds the DruidException context to log messages whenever stack traces are enabled. * Only log nonempty contexts.	2024-10-10 01:44:50 -07:00
Gian Merlino	4fbb129027	Improve javadocs for SegmentDescriptor. (#17274 ) The javadoc for SegmentDescriptor discusses differences between it and SegmentId, but misses the most important difference: SegmentDescriptor can have a narrower interval than the segment being referenced.	2024-10-08 00:59:55 -07:00
Clint Wylie	ab0d6eb620	Fix string array grouping comparator (#17183 )	2024-10-08 09:47:28 +05:30
Karan Kumar	6a4352f466	When removeNullBytes is set, length calculations did not take into account null bytes. (#17232 ) * When replaceNullBytes is set, length calculations did not take into account null bytes.	2024-10-07 18:02:52 +05:30
Adarsh Sanjeev	c9201ad658	Minor refactors to processing module (#17136 ) Refactors a few things. - Adds SemanticUtils maps to columns. - Add some addAll functions to reduce duplication, and for future reuse. - Refactor VariantColumnAndIndexSupplier to only take a SmooshedFileMapper instead. - Refactor LongColumnSerializerV2 to have separate functions for serializing a value and null.	2024-10-07 13:18:35 +05:30
Clint Wylie	0bd13bcd51	Projections prototype (#17214 )	2024-10-05 04:38:57 -07:00
Clint Wylie	04fe56835d	add druid.expressions.allowVectorizeFallback and default to false (#17248 ) changes: adds ExpressionProcessing.allowVectorizeFallback() and ExpressionProcessingConfig.allowVectorizeFallback(), defaulting to false until few remaining bugs can be fixed (mostly complex types and some odd interactions with mixed types) add cannotVectorizeUnlessFallback functions to make it easy to toggle the default of this config, and easy to know what to delete when we remove it in the future	2024-10-05 12:42:42 +05:30
Gian Merlino	b9634a8613	SuperSorter: Don't set allDone if it's already set. (#17238 ) This fixes a race where, if there is no output at all, setAllDoneIfPossible could be called twice (once when the output partitions future resolves, and once when the batcher finishes). If the calls happen in that order, it would try to create nil output channels both times, resulting in a "Channel already set" error.	2024-10-04 06:41:16 +05:30
Gian Merlino	db7cc4634c	Dart: Smoother handling of stage early-exit. (#17228 ) Stages can be instructed to exit before they finish, especially when a downstream stage includes a "LIMIT". This patch has improvements related to early-exiting stages. Bug fix: - WorkerStageKernel: Don't allow fail() to set an exception if the stage is already in a terminal state (FINISHED or FAILED). If fail() is called while in a terminal state, log the exception, then throw it away. If it's a cancellation exception, don't even log it. This fixes a bug where a stage that exited early could transition to FINISHED and then to FAILED, causing the overall query to fail. Performance: - DartWorkerManager previously sent stopWorker commands to workers even when "interrupt" was false. Now it only sends those commands when "interrupt" is true. The method javadoc already claimed this is what the method did, but the implementation did not match the javadoc. This reduces the number of RPCs by 1 per worker per query. Quieter logging: - In ReadableByteChunksFrameChannel, skip logging exception from setError if the channel has been closed. Channels are closed when readers are done with them, so at that point, we wouldn't be interested in the errors. - In RunWorkOrder, skip calling notifyListener on failure of the main work, in the case when stop() has already been called. The stop() method will set its own error using CanceledFault. This enables callers to detect when a stage was canceled vs. failed for some other reason. - In WorkerStageKernel, skip logging cancellation errors in fail(). This is made possible by the previous change in RunWorkOrder.	2024-10-03 20:09:02 +05:30
Zoltan Haindrich	65277b17a9	Decoupled planning: add support for unnest (#17177 ) * adds support for `UNNEST` expressions * introduces `LogicalUnnestRule` to transform a `Correlate` doing UNNEST into a `LogicalUnnest` * `UnnestInputCleanupRule` could move the final unnested expr into the `LogicalUnnest` itself (usually its an `mv_to_array` expression) * enhanced source unwrapping to utilize `FilteredDataSource` if it looks right	2024-10-02 08:54:56 +02:00
Gian Merlino	878adff9aa	MSQ profile for Brokers and Historicals. (#17140 ) This patch adds a profile of MSQ named "Dart" that runs on Brokers and Historicals, and which is compatible with the standard SQL query API. For more high-level description, and notes on future work, refer to #17139. This patch contains the following changes, grouped into packages. Controller (org.apache.druid.msq.dart.controller): The controller runs on Brokers. Main classes are, - DartSqlResource, which serves /druid/v2/sql/dart/. - DartSqlEngine and DartQueryMaker, the entry points from SQL that actually run the MSQ controller code. - DartControllerContext, which configures the MSQ controller. - DartMessageRelays, which sets up relays (see "message relays" below) to read messages from workers' DartControllerClients. - DartTableInputSpecSlicer, which assigns work based on a TimelineServerView. Worker (org.apache.druid.msq.dart.worker) The worker runs on Historicals. Main classes are, - DartWorkerResource, which supplies the regular MSQ WorkerResource, plus Dart-specific APIs. - DartWorkerRunner, which runs MSQ worker code. - DartWorkerContext, which configures the MSQ worker. - DartProcessingBuffersProvider, which provides processing buffers from sliced-up merge buffers. - DartDataSegmentProvider, which provides segments from the Historical's local cache. Message relays (org.apache.druid.messages): To avoid the need for Historicals to contact Brokers during a query, which would create opportunities for queries to get stuck, all connections are opened from Broker to Historical. This is made possible by a message relay system, where the relay server (worker) has an outbox of messages. The relay client (controller) connects to the outbox and retrieves messages. Code for this system lives in the "server" package to keep it separate from the MSQ extension and make it easier to maintain. The worker-to-controller ControllerClient is implemented using message relays. Other changes: - Controller: Added the method "hasWorker". Used by the ControllerMessageListener to notify the appropriate controllers when a worker fails. - WorkerResource: No longer tries to respond more than once in the "httpGetChannelData" API. This comes up when a response due to resolved future is ready at about the same time as a timeout occurs. - MSQTaskQueryMaker: Refactor to separate out some useful functions for reuse in DartQueryMaker. - SqlEngine: Add "queryContext" to "resultTypeForSelect" and "resultTypeForInsert". This allows the DartSqlEngine to modify result format based on whether a "fullReport" context parameter is set. - LimitedOutputStream: New utility class. Used when in "fullReport" mode. - TimelineServerView: Add getDruidServerMetadata as a performance optimization. - CliHistorical: Add SegmentWrangler, so it can query inline data, lookups, etc. - ServiceLocation: Add "fromUri" method, relocating some code from ServiceClientImpl. - FixedServiceLocator: New locator for a fixed set of service locations. Useful for URI locations.	2024-10-01 14:38:55 -07:00
Shivam Garg	ab361747a8	Migrated commons-lang usages to commons-lang3 (#17156 )	2024-09-28 10:28:11 +02:00
Clint Wylie	f8a72b987a	read metadata in SimpleQueryableIndex if available to compute segment ordering (#17181 )	2024-09-27 19:39:03 -07:00
Clint Wylie	157fe1bc1f	fix a mistake in CursorGranularizer to check doneness after advance (#17175 ) Fixes a mistake introduced in #16533 which can result in CursorGranularizer incorrectly trying to get values from a selector after calling cursor.advance because of a missing check for cursor.isDone	2024-09-27 09:36:05 +05:30
Clint Wylie	d77637344d	log.warn anytime a column is relying on ArrayIngestMode.MVD (#17164 ) * log.warn anytime a column is relying on ArrayIngestMode.MVD	2024-09-26 13:44:37 +05:30
Cece Mei	a2b011cdcd	Incorporate `estimatedComputeCost` into all `BitmapColumnIndex` classes. (#17125 ) changes: * filter index processing is now automatically ordered based on estimated 'cost', which is approximated based on how many expected bitmap operations are required to construct the bitmap used for the 'offset' * cursorAutoArrangeFilters context flag now defaults to true, but can be set to false to disable cost based filter index sorting	2024-09-25 23:11:26 -07:00
Clint Wylie	1b5b61ef7f	add multi-value string object vector matcher and expression vector object selectors (#17162 )	2024-09-25 22:57:29 -07:00
Clint Wylie	0ec7eb3ae5	use CastToObjectVectorProcessor for cast to string (#17148 )	2024-09-24 15:21:33 -07:00
Abhishek Radhakrishnan	83299e9882	Miscellaneous cleanup in the supervisor API flow. (#17144 ) Extracting a few miscellaneous non-functional changes from the batch supervisor branch: - Replace anonymous inner classes with lambda expressions in the SQL supervisor manager layer - Add explicit @Nullable annotations in DynamicConfigProviderUtils to make IDE happy - Small variable renames (copy-paste error perhaps) and fix typos - Add table name for this exception message: Delete the supervisor from the table[%s] in the database... - Prefer CollectionUtils.isEmptyOrNull() over list == null \|\| list.size() > 0. We can change the Precondition checks to throwing DruidException separately for a batch of APIs at a time.	2024-09-24 13:06:23 -07:00
George Shiqi Wu	d1bfabbf4d	inter-Extension dependency support (#16973 ) * update docs for kafka lookup extension to specify correct extension ordering * fix first line * test with extension dependencies * save work on dependency management * working dependency graph * working pull * fix style * fix style * remove name * load extension dependencies recursively * generate depenencies on classloader creation * add check for circular dependencies * fix style * revert style changes * remove mutable class loader * clean up class heirarchy * extensions loader test working * add unit tests * pr comments * fix unit tests	2024-09-24 14:17:33 -04:00
Clint Wylie	77a362c555	various fixes and improvements to vectorization fallback (#17098 ) changes: * add `ApplyFunction` support to vectorization fallback, allowing many of the remaining expressions to be vectorized * add `CastToObjectVectorProcessor` so that vector engine can correctly cast any type * add support for array and complex vector constants * reduce number of cases which can block vectorization in expression planner to be unknown inputs (such as unknown multi-valuedness) * fix array constructor expression, apply map expression to make actual evaluated type match the output type inference * fix bug in array_contains where something like array_contains([null], 'hello') would return true if the array was a numeric array since the non-null string value would cast to a null numeric * fix isNull/isNotNull to correctly handle any type of input argument	2024-09-24 04:29:08 -07:00
Adithya Chakilam	8eaac2c051	cgroup monitors: Add mem/disk/cpu usage metrics for V2 (#16905 ) * cgroup monitors: Add mem/disk/cpu usage metrics for V2 * intellij inspection * docs and checks * fix-dos * add comments * comments	2024-09-23 20:32:01 -07:00
Akshat Jain	40414cfe78	MSQ window functions: Reject MVDs during window processing (#17036 ) * MSQ window functions: Reject MVDs during window processing * MSQ window functions: Reject MVDs during window processing * Remove parameterization from MSQWindowTest	2024-09-23 11:39:35 +05:30
Abhishek Radhakrishnan	635e418131	Support to parse numbers in text-based input formats (#17082 ) Text-based input formats like csv and tsv currently parse inputs only as strings, following the RFC4180Parser spec). To workaround this, the web-console and other tools need to further inspect the sample data returned to sample data returned by the Druid sampler API to parse them as numbers. This patch introduces a new optional config, tryParseNumbers, for the csv and tsv input formats. If enabled, any numbers present in the input will be parsed in the following manner -- long data type for integer types and double for floating-point numbers, and if parsing fails for whatever reason, the input is treated as a string. By default, this configuration is set to false, so numeric strings will be treated as strings.	2024-09-19 13:21:18 -07:00
Gian Merlino	3d45f9829c	Use the whole frame when writing rows. (#17094 ) * Use the whole frame when writing rows. This patch makes the following adjustments to enable writing larger single rows to frames: 1) RowBasedFrameWriter: Max out allocation size on the final doubling. i.e., if the final allocation "naturally" would be 1 MiB but the max frame size is 900 KiB, use 900 KiB rather than failing the 1 MiB allocation. 2) AppendableMemory: In reserveAdditional, release the last block if it is empty. This eliminates waste when a frame writer uses a successive-doubling approach to find the right allocation size. 3) ArenaMemoryAllocator: Reclaim memory from the last allocation when the last allocation is closed. Prior to these changes, a single row could be much smaller than the frame size and still fail to be added to the frame. * Style. * Fix test.	2024-09-19 00:42:03 -07:00
Sree Charan Manamala	b9a4c73e52	Window Functions : Improve performance by comparing Strings in frame bytes without converting them (#17091 )	2024-09-19 09:36:28 +02:00
Abhishek Agarwal	8d1e596740	PostJoinCursor should never advance without interruption (#17099 )	2024-09-19 09:10:59 +02:00
Pranav	d1bd6a8156	Update doc for allowedHeaders (#17045 ) Update doc for allowedHeaders and make allowedHeaders more restrictive	2024-09-19 08:37:39 +05:30
Gian Merlino	2d2882cdfe	Add test for exceptions in FutureUtils.transformAsync. (#17106 ) Adds an additional test case to FutureUtilsTest.	2024-09-18 16:08:47 -07:00
Adarsh Sanjeev	2f50138af9	Modify DataSegmentProvider to also return DataSegment (#17021 ) Currently, TaskDataSegmentProvider fetches the DataSegment from the Coordinator while loading the segment, but just discards it later. This PR refactors this to also return the DataSegment so that it can be used by workers without a separate fetch.	2024-09-18 11:20:20 +05:30
Cece Mei	88c3c20ab6	Create a FilterBundle.Builder class and use it to construct FilterBundle. (#17055 )	2024-09-17 15:59:33 -07:00
Clint Wylie	a93546d493	add VirtualColumns.findEquivalent and VirtualColumn.EquivalenceKey (#17084 )	2024-09-17 13:17:44 -07:00
Gian Merlino	46cbb33428	FrameChannelMerger: Fix incorrect behavior of finished(). (#17088 ) Previously, the processor used "remainingChannels" to track the number of non-null entries of currentFrame. Now, "remainingChannels" tracks the number of channels that are unfinished. The difference is subtle. In the previous code, when an input channel was blocked upon exiting nextFrame(), the "currentFrames" entry would be null, and therefore the "remainingChannels" variable would be decremented. After the next await and call to populateCurrentFramesAndTournamentTree(), "remainingChannels" would be incremented if the channel had become unblocked after awaiting. This means that finished(), which returned true if remainingChannels was zero, would not be reliable if called between nextFrame() and the next await + populateCurrentFramesAndTournamentTree(). This patch changes things such that finished() is always reliable. This fixes a regression introduced in PR #16911, which added a call to finished() that was, at that time, unsafe.	2024-09-17 08:35:54 -07:00
Lasse Mammen	307b8e3357	feat: json_merge expression and sql function (#17081 )	2024-09-17 18:27:34 +05:30
Sree Charan Manamala	bb1c3c1749	Add serde for ColumnBasedRowsAndColumns to fix window queries without group by (#16658 ) Register a Ser-De for RowsAndColumns so that the window operator query running on leaf operators would be transferred properly on the wire. Would fix the empty response given by window queries without group by on the native engine.	2024-09-17 06:44:40 +02:00
Laksh Singla	bb487a4193	Support maxSubqueryBytes for window functions (#16800 ) Window queries now acknowledge maxSubqueryBytes.	2024-09-17 10:06:24 +05:30
Gian Merlino	5b7fb5fbca	Speed up FrameFileTest, SuperSorterTest. (#17068 ) * Speed up FrameFileTest, SuperSorterTest. These are two heavily parameterized tests that, together, account for about 60% of runtime in the test suite. FrameFileTest changes: 1) Cache frame files in a static, rather than building the frame file for each parameterization of the test. 2) Adjust TestArrayCursorFactory to cache the signature, rather than re-creating it on each call to getColumnCapabilities. SuperSorterTest changes: 1) Dramatically reduce the number of tests that run with "maxRowsPerFrame" = 1. These are particularly slow due to writing so many small files. Some still run, since it's useful to test edge cases, but much fewer than before. 2) Reduce the "maxActiveProcessors" axis of the test from [1, 2, 4] to [1, 3]. The aim is to reduce the number of cases while still getting good coverage of the feature. 3) Reduce the "maxChannelsPerProcessor" axis of the test from [2, 3, 8] to [2, 7]. The aim is to reduce the number of cases while still getting good coverage of the feature. 4) Use in-memory input channels rather than file channels. 5) Defer formatting of assertion failure messages until they are needed. 6) Cache the cursor factory and its signature in a static. 7) Cache sorted test rows (used for verification) in a static. * It helps to include the file. * Style.	2024-09-15 17:03:18 -07:00
Clint Wylie	73a644258d	abstract `IncrementalIndex` cursor stuff to prepare for using different "views" of the data based on the cursor build spec (#17064 ) * abstract `IncrementalIndex` cursor stuff to prepare to allow for possibility of using different "views" of the data based on the cursor build spec changes: * introduce `IncrementalIndexRowSelector` interface to capture how `IncrementalIndexCursor` and `IncrementalIndexColumnSelectorFactory` read data * `IncrementalIndex` implements `IncrementalIndexRowSelector` * move `FactsHolder` interface to separate file * other minor refactorings	2024-09-15 16:45:51 -07:00
Gian Merlino	4dc5942dab	BaseWorkerClientImpl: Don't attempt to recover from a closed channel. (#17052 ) * BaseWorkerClientImpl: Don't attempt to recover from a closed channel. This patch introduces an exception type "ChannelClosedForWritesException", which allows the BaseWorkerClientImpl to avoid retrying when the local channel has been closed. This can happen in cases of cancellation. * Add some test coverage. * wip * Add test coverage. * Style.	2024-09-15 02:10:58 -07:00
Gian Merlino	6fac267f17	MSQ: Improved worker cancellation. (#17046 ) * MSQ: Improved worker cancellation. Four changes: 1) FrameProcessorExecutor now requires that cancellationIds be registered with "registerCancellationId" prior to being used in "runFully" or "runAllFully". 2) FrameProcessorExecutor gains an "asExecutor" method, which allows that executor to be used as an executor for future callbacks in such a way that respects cancellationId. 3) RunWorkOrder gains a "stop" method, which cancels the current cancellationId and closes the current FrameContext. It blocks until both operations are complete. 4) Fixes a bug in RunAllFullyWidget where "processorManager.result()" was called outside "runAllFullyLock", which could cause it to be called out-of-order with "cleanup()" in case of cancellation or other error. Together, these changes help ensure cancellation does not have races. Once "cancel" is called for a given cancellationId, all existing processors and running callbacks are canceled and exit in an orderly manner. Future processors and callbacks with the same cancellationId are rejected before being executed. * Fix test. * Use execute, which doesn't return, to avoid errorprone complaints. * Fix some style stuff. * Further enhancements. * Fix style.	2024-09-15 01:22:28 -07:00

1 2 3 4 5 ...

3285 Commits