druid

Commit Graph

Author	SHA1	Message	Date
Akshat Jain	cd6083fb94	Add back `UnnecessaryFullyQualifiedName` rule in pmd ruleset (#17570 ) * Add back UnnecessaryFullyQualifiedName rule in pmd ruleset * Fix checkstyle	2024-12-17 12:43:12 +05:30
Akshat Jain	fed36844f1	Re-visit previously disabled spotbugs patterns and enable them (#17560 )	2024-12-13 15:24:40 +01:00
Akshat Jain	a26e4c0e06	Cleanup unreachable Java 8 code flows (#17559 )	2024-12-13 15:24:21 +01:00
Kashif Faraz	24e5d8a9e8	Refactor: Minor cleanup of segment allocation flow (#17524 ) Changes -------- - Simplify the arguments of IndexerMetadataStorageCoordinator.allocatePendingSegment - Remove field SegmentCreateRequest.upgradedFromSegmentId as it was always null - Miscellaneous cleanup	2024-12-13 07:46:57 +05:30
Zoltan Haindrich	1a38434d8d	Restore usage of filtered SUM (#17378 )	2024-12-12 10:30:42 +01:00
Clint Wylie	80d2cd3632	snapshot column capabilities for realtime cursors (#17386 ) * snapshot column capabilities for realtime cursors changes: * adds `CursorBuildSpec.getPhysicalColumns()` to allow specifying the set of required physical columns from a segment. if null, all columns are assumed to be required (e.g. full scan) * `IncrementalIndexCursorFactory`/`IncrementalIndexCursorHolder` uses the physical columns from the cursor build spec to know which set of dimensions to 'snapshot' the capabilities for, allowing expression selectors on realtime queries to no longer be required to treat selectors from `StringDimensionIndexer` as multi-valued unless they truly are multi-valued. this fixes several bugs with expressions on realtime queries that change a value from `StringDimensionIndexer` to some type other than string, which would often result in a single element array from the column being handled as multi-valued * `StringDimensionIndexer.setSparseIndexed()` now adds the default value to the dictionary when set * `StringDimensionIndexer` column value selectors now always report that they are dictionary encoded, and that name lookup is possible in advance on their selectors (since set sparse adds the null value so the cardinality is correct) * fixed a mistake that expression selectors for realtime queries with no null values could not use dictionary encoded selectors * hmm * test changes * cleanup * add test coverage * fix test * fixes * cleanup	2024-12-09 08:44:54 -08:00
Abhishek Radhakrishnan	3a2220c68d	Refactor: Move some classes from `sql` to `processing` & `server` for reusability (#17542 ) This PR contains non-functional / refactoring changes of the following classes in the sql module: 1. Move ExplainPlan and ExplainAttributes fromsql/src/main/java/org/apache/druid/sql/http to processing/src/main/java/org/apache/druid/query/explain 2. Move sql/src/main/java/org/apache/druid/sql/SqlTaskStatus.java -> processing/src/main/java/org/apache/druid/query/http/SqlTaskStatus.java 3. Add a new class processing/src/main/java/org/apache/druid/query/http/ClientSqlQuery.java that is effectively a thin POJO version of SqlQuery in the sql module but without any of the Calcite functionality and business logic. 4. Move BrokerClient, BrokerClientImpl and Broker classes from sql/src/main/java/org/apache/druid/sql/client to server/src/main/java/org/apache/druid/client/broker. 5. Remove BrokerServiceModule that provided the BrokerClient. The functionality is now contained in ServiceClientModule in the server package itself which provides all the clients as well. This is done so that we can reuse the said classes in #17353 without brining in Calcite and other dependencies to the Overlord.	2024-12-06 09:32:03 -08:00
Zoltan Haindrich	c1ef38b052	Minor fixes and enhancements in UnionQuery handling (#17483 ) * plan consistently with either UnionDataSource or UnionQuery for decoupled mode * expose errors * move decoupled related setting from PlannerConfig to QueryContexts	2024-11-28 10:05:12 +01:00
Akshat Jain	dd46c7722d	Remove pre-java-11 profile (#17511 ) We have removed support for Java 8 in #17466. This PR removes an unused profile pre-java-11 which activated for JDK < 11.	2024-11-26 08:43:20 +01:00
Clint Wylie	ede9e4077a	add support for aggregate only projections (#17484 )	2024-11-25 09:22:46 -08:00
Rishabh Singh	74422b58f5	Emit disk spill and merge buffer utilisation metrics for GroupBy queries (#17360 ) This change is to emit following metrics as part of GroupByStatsMonitor monitor, mergeBuffer/used -> Number of merge buffers used. mergeBuffer/acquisitionTimeNs -> Total time required to acquire merge buffer. mergeBuffer/acquisition -> Number of queries that acquired a batch of merge buffers. groupBy/spilledQueries -> Number of queries that spilled onto the disk. groupBy/spilledBytes-> Spilled bytes on the disk. groupBy/mergeDictionarySize -> Size of the merging dictionary.	2024-11-22 14:22:03 +05:30
Adarsh Sanjeev	df649c0bbd	Refactors (#17498 ) Follow-up PR to #17493 to address pending unaddressed comments.	2024-11-22 09:22:38 +05:30
Adarsh Sanjeev	2726c6f388	Minor refactors to processing Some refactors across druid to clean up the code and add utility functions where required.	2024-11-21 15:37:55 +05:30
Akshat Jain	17215cd677	Remove support for Java 8 (#17466 ) All JDK 8 based CI checks have been removed. Images used in Dockerfile(s) have been updated to Java 17 based images. Documentation has been updated accordingly.	2024-11-21 15:33:08 +05:30
Clint Wylie	24a1fafaa7	projection segment merge fixes (#17460 ) changes: * fix issue when merging projections from multiple-incremental persists which was hoping that some 'dim conversion' buffers were not closed, but they already were (by the merging iterator). fix involves selectively persisting these conversion buffers to temp files in the segment write out directory and mapping them and tying them to the segment level closer so that they are available after the lifetime of the parent merger * modify auto column serializers to use segment write out directory for temp files instead of java.io.tmpdir * fix queryable index projection to not put the time-like column as a dimension, instead only adding it as __time * use smoosh for temp files so can safely write any Serializer to a temp smoosh	2024-11-15 16:46:04 -08:00
Zoltan Haindrich	f296102f05	ScanQuery should not ignore columnTypes in equals/hashCode (#17463 ) * ScanQuery: equals/hashCode/toString * DruidQuery: changes of Align ScanQuery column order with its desired signature #17457 * ScanQueryTest: add equalsverifer test	2024-11-12 14:26:59 +05:30
Virushade	8278a1f7df	Fix Javadocs in ColumnCapablities.java (#17462 )	2024-11-12 11:30:33 +05:30
Vivek Dhiman	0dcc2bc469	Fixed NPE in `array_overlap` and `array_contains`. (#17465 )	2024-11-08 20:39:14 -08:00
Gian Merlino	9c25226e06	QueryableIndexSegment: Re-use time boundary inspector. (#17397 ) This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in #16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.	2024-11-06 09:27:59 -08:00
Zoltan Haindrich	2eac8318f8	Support Union in Decoupled planning (#17354 ) * introduces `UnionQuery` * some changes to enable a `UnionQuery` to have multiple input datasources * `UnionQuery` execution is driven by the `QueryLogic` - which could later enable to reduce some complexity in `ClientQuerySegmentWalker` * to run the subqueries of `UnionQuery` there was a need to access the `conglomerate` from the `Runner`; to enable that some refactors were done * renamed `UnionQueryRunner` to `UnionDataSourceQueryRunner` * `QueryRunnerFactoryConglomerate` have taken the place of `QueryToolChestWarehouse` which shaves of some unnecessary things here and there * small cleanup/refactors	2024-11-05 16:58:57 +01:00
Clint Wylie	10208baab2	use big endian for compressed complex column values to fit object strategy expectations (#17422 )	2024-10-29 10:21:09 -07:00
Adarsh Sanjeev	b7c661b801	Make tempStorageDirectory configuration optional and rely on task dir instead (#17015 ) Currently, durable storage and export both require configuring a temporary directory to be used using druid.export.storage.<connectorType>.tempLocalDir and druid.msq.intermediate.storage.tempDir. Tasks on middle manager already have a configured temporary directory. This PR aims to reduce the configuration required by using the task directory as a default if it is not explicitly configured, thus reducing the number of configs that a user has to set. Please note that preference would be given to the user configured, druid..storage.tempDir, on the tasks. If that is not configured, we then use the configured temporary directory. Overlord and brokers also require storage connector configurations (for the durableStorageCleanerOverlordDuty and to fetch results of async queries respectively), but do not have a default temporary task directory. The configuration is still required for these services.	2024-10-29 13:36:59 +05:30
Gian Merlino	446a8f466f	Update errorprone, mockito, jacoco, checkerframework. (#17414 ) * Update errorprone, mockito, jacoco, checkerframework. This patch updates various build and test dependencies, to see if they cause unit tests on JDK 21 to behave more reliably. * Update licenses, tests. * Remove assertEquals. * Repair two tests. * Update some more tests.	2024-10-28 11:34:03 -07:00
Clint Wylie	73675d0671	clean up some thread pools in tests (#17421 )	2024-10-28 09:05:15 -07:00
Suraj Goel	7306d280cc	Migrate jaxb bind dependency to jakarta (#17370 ) - Migrated from javax.xml.bind 2.3.1 to jakarta.xml.bind 2.3.3. - Minor version is modified to avoid any breaking changes.	2024-10-26 21:24:17 -07:00
Gian Merlino	7e8671caa9	GroupByQueryConfig: Skip unnecessary toString. (#17396 ) Calling toString on newConfig is unnecessary, because it will be done automatically by the logger. This saves some effort under log levels higher than DEBUG.	2024-10-23 19:57:22 +05:30
Clint Wylie	1157ecdec3	abstract common base of SQL micro-benchmarks to reduce boilerplate and standardize parameters (#17383 ) changes: * adds `SqlBenchmarkDatasets` which contains commonly used benchmark data generator schemas * adds `SqlBaseBenchmark` which contains common benchmark segment generation methods for any benchmark using `SqlBenchmarkDatasets` * adds `SqlBaseQueryBenchmark` and `SqlBasePlanBenchmark` for benchmarks measuring queries and planning respectively * migrate all existing SQL jmh benchmarks to extend `SqlBaseQueryBenchmark`, quite dramatically reducing the boilerplate needed to create benchmarks, and allowing the use of multiple datasources within a benchmark file * adjustments to data generator stuff to allow passing in an ObjectMapper so that the same mapper can be used for both benchmark queries and segment generation, avoiding the need to register stuff with both mappers for benchmarks * adds `SqlProjectionsBenchmark` and `SqlComplexMetricsColumnsBenchmark` for measuring projections and measuring complex metric compression respectively	2024-10-22 19:37:17 -07:00
Laksh Singla	5b09329479	Fixes an issue with AppendableMemory that can cause MSQ jobs to fail (#17369 )	2024-10-18 09:05:53 +05:30
Akshat Jain	450fb0147b	Add GlueingPartitioningOperator + Corresponding changes in window function layer to consume it for MSQ (#17038 ) * GlueingPartitioningOperator: It continuously receives data, and outputs batches of partitioned RACs. It maintains a last-partitioning-boundary of the last-pushed-RAC, and attempts to glue it with the next RAC it receives, ensuring that partitions are handled correctly, even across multiple RACs. You can check GlueingPartitioningOperatorTest for some good examples of the "glueing" work. * PartitionSortOperator: It sorts rows inside partitioned RACs, on the sort columns. The input RACs it receives are expected to be "complete / separate" partitions of data.	2024-10-17 10:54:52 +05:30
Hardik Bajaj	32ce341a6c	Fix RejectExecutionHandler of Blocking Single Threaded executor (#17146 ) Throw RejectedExecutionException when submitting tasks to executor that has been shut down.	2024-10-15 22:02:34 +05:30
Clint Wylie	c2149d59a7	remove stale comment in QueryableIndexCursorHolder (#17333 )	2024-10-11 16:23:59 -07:00
Gian Merlino	b287b219a8	MSQ: Include stageId, workerNumber in processing thread names. (#17324 ) * MSQ: Include stageId, workerNumber in processing thread names. Helps identify which query was running in a thread dump. * s/dart/msq/	2024-10-11 08:37:15 -07:00
Shivam Garg	6898a5a359	Removed Microsecond from Extract function (#17247 )	2024-10-11 05:32:26 +02:00
Clint Wylie	a6236c3d15	add substituteCombiningFactory implementations for datasketches aggs (#17314 ) Follow up to #17214, adds implementations for substituteCombiningFactory so that more datasketches aggs can match projections, along with some projections tests for datasketches.	2024-10-10 16:14:06 +05:30
Gian Merlino	1d95ef34f0	Logger: Log context of DruidExceptions. (#17316 ) * Logger: Log context of DruidExceptions. There is often interesting and unique information available in the "context" of a DruidException. This information is additive to both the message and the cause, and was missed when we log. This patch adds the DruidException context to log messages whenever stack traces are enabled. * Only log nonempty contexts.	2024-10-10 01:44:50 -07:00
Gian Merlino	4fbb129027	Improve javadocs for SegmentDescriptor. (#17274 ) The javadoc for SegmentDescriptor discusses differences between it and SegmentId, but misses the most important difference: SegmentDescriptor can have a narrower interval than the segment being referenced.	2024-10-08 00:59:55 -07:00
Clint Wylie	ab0d6eb620	Fix string array grouping comparator (#17183 )	2024-10-08 09:47:28 +05:30
Karan Kumar	6a4352f466	When removeNullBytes is set, length calculations did not take into account null bytes. (#17232 ) * When replaceNullBytes is set, length calculations did not take into account null bytes.	2024-10-07 18:02:52 +05:30
Adarsh Sanjeev	c9201ad658	Minor refactors to processing module (#17136 ) Refactors a few things. - Adds SemanticUtils maps to columns. - Add some addAll functions to reduce duplication, and for future reuse. - Refactor VariantColumnAndIndexSupplier to only take a SmooshedFileMapper instead. - Refactor LongColumnSerializerV2 to have separate functions for serializing a value and null.	2024-10-07 13:18:35 +05:30
Clint Wylie	0bd13bcd51	Projections prototype (#17214 )	2024-10-05 04:38:57 -07:00
Clint Wylie	04fe56835d	add druid.expressions.allowVectorizeFallback and default to false (#17248 ) changes: adds ExpressionProcessing.allowVectorizeFallback() and ExpressionProcessingConfig.allowVectorizeFallback(), defaulting to false until few remaining bugs can be fixed (mostly complex types and some odd interactions with mixed types) add cannotVectorizeUnlessFallback functions to make it easy to toggle the default of this config, and easy to know what to delete when we remove it in the future	2024-10-05 12:42:42 +05:30
Gian Merlino	b9634a8613	SuperSorter: Don't set allDone if it's already set. (#17238 ) This fixes a race where, if there is no output at all, setAllDoneIfPossible could be called twice (once when the output partitions future resolves, and once when the batcher finishes). If the calls happen in that order, it would try to create nil output channels both times, resulting in a "Channel already set" error.	2024-10-04 06:41:16 +05:30
Gian Merlino	db7cc4634c	Dart: Smoother handling of stage early-exit. (#17228 ) Stages can be instructed to exit before they finish, especially when a downstream stage includes a "LIMIT". This patch has improvements related to early-exiting stages. Bug fix: - WorkerStageKernel: Don't allow fail() to set an exception if the stage is already in a terminal state (FINISHED or FAILED). If fail() is called while in a terminal state, log the exception, then throw it away. If it's a cancellation exception, don't even log it. This fixes a bug where a stage that exited early could transition to FINISHED and then to FAILED, causing the overall query to fail. Performance: - DartWorkerManager previously sent stopWorker commands to workers even when "interrupt" was false. Now it only sends those commands when "interrupt" is true. The method javadoc already claimed this is what the method did, but the implementation did not match the javadoc. This reduces the number of RPCs by 1 per worker per query. Quieter logging: - In ReadableByteChunksFrameChannel, skip logging exception from setError if the channel has been closed. Channels are closed when readers are done with them, so at that point, we wouldn't be interested in the errors. - In RunWorkOrder, skip calling notifyListener on failure of the main work, in the case when stop() has already been called. The stop() method will set its own error using CanceledFault. This enables callers to detect when a stage was canceled vs. failed for some other reason. - In WorkerStageKernel, skip logging cancellation errors in fail(). This is made possible by the previous change in RunWorkOrder.	2024-10-03 20:09:02 +05:30
Zoltan Haindrich	65277b17a9	Decoupled planning: add support for unnest (#17177 ) * adds support for `UNNEST` expressions * introduces `LogicalUnnestRule` to transform a `Correlate` doing UNNEST into a `LogicalUnnest` * `UnnestInputCleanupRule` could move the final unnested expr into the `LogicalUnnest` itself (usually its an `mv_to_array` expression) * enhanced source unwrapping to utilize `FilteredDataSource` if it looks right	2024-10-02 08:54:56 +02:00
Gian Merlino	878adff9aa	MSQ profile for Brokers and Historicals. (#17140 ) This patch adds a profile of MSQ named "Dart" that runs on Brokers and Historicals, and which is compatible with the standard SQL query API. For more high-level description, and notes on future work, refer to #17139. This patch contains the following changes, grouped into packages. Controller (org.apache.druid.msq.dart.controller): The controller runs on Brokers. Main classes are, - DartSqlResource, which serves /druid/v2/sql/dart/. - DartSqlEngine and DartQueryMaker, the entry points from SQL that actually run the MSQ controller code. - DartControllerContext, which configures the MSQ controller. - DartMessageRelays, which sets up relays (see "message relays" below) to read messages from workers' DartControllerClients. - DartTableInputSpecSlicer, which assigns work based on a TimelineServerView. Worker (org.apache.druid.msq.dart.worker) The worker runs on Historicals. Main classes are, - DartWorkerResource, which supplies the regular MSQ WorkerResource, plus Dart-specific APIs. - DartWorkerRunner, which runs MSQ worker code. - DartWorkerContext, which configures the MSQ worker. - DartProcessingBuffersProvider, which provides processing buffers from sliced-up merge buffers. - DartDataSegmentProvider, which provides segments from the Historical's local cache. Message relays (org.apache.druid.messages): To avoid the need for Historicals to contact Brokers during a query, which would create opportunities for queries to get stuck, all connections are opened from Broker to Historical. This is made possible by a message relay system, where the relay server (worker) has an outbox of messages. The relay client (controller) connects to the outbox and retrieves messages. Code for this system lives in the "server" package to keep it separate from the MSQ extension and make it easier to maintain. The worker-to-controller ControllerClient is implemented using message relays. Other changes: - Controller: Added the method "hasWorker". Used by the ControllerMessageListener to notify the appropriate controllers when a worker fails. - WorkerResource: No longer tries to respond more than once in the "httpGetChannelData" API. This comes up when a response due to resolved future is ready at about the same time as a timeout occurs. - MSQTaskQueryMaker: Refactor to separate out some useful functions for reuse in DartQueryMaker. - SqlEngine: Add "queryContext" to "resultTypeForSelect" and "resultTypeForInsert". This allows the DartSqlEngine to modify result format based on whether a "fullReport" context parameter is set. - LimitedOutputStream: New utility class. Used when in "fullReport" mode. - TimelineServerView: Add getDruidServerMetadata as a performance optimization. - CliHistorical: Add SegmentWrangler, so it can query inline data, lookups, etc. - ServiceLocation: Add "fromUri" method, relocating some code from ServiceClientImpl. - FixedServiceLocator: New locator for a fixed set of service locations. Useful for URI locations.	2024-10-01 14:38:55 -07:00
Shivam Garg	ab361747a8	Migrated commons-lang usages to commons-lang3 (#17156 )	2024-09-28 10:28:11 +02:00
Clint Wylie	f8a72b987a	read metadata in SimpleQueryableIndex if available to compute segment ordering (#17181 )	2024-09-27 19:39:03 -07:00
Clint Wylie	157fe1bc1f	fix a mistake in CursorGranularizer to check doneness after advance (#17175 ) Fixes a mistake introduced in #16533 which can result in CursorGranularizer incorrectly trying to get values from a selector after calling cursor.advance because of a missing check for cursor.isDone	2024-09-27 09:36:05 +05:30
Clint Wylie	d77637344d	log.warn anytime a column is relying on ArrayIngestMode.MVD (#17164 ) * log.warn anytime a column is relying on ArrayIngestMode.MVD	2024-09-26 13:44:37 +05:30
Cece Mei	a2b011cdcd	Incorporate `estimatedComputeCost` into all `BitmapColumnIndex` classes. (#17125 ) changes: * filter index processing is now automatically ordered based on estimated 'cost', which is approximated based on how many expected bitmap operations are required to construct the bitmap used for the 'offset' * cursorAutoArrangeFilters context flag now defaults to true, but can be set to false to disable cost based filter index sorting	2024-09-25 23:11:26 -07:00

1 2 3 4 5 ...

3309 Commits