druid

Commit Graph

Author	SHA1	Message	Date
Gian Merlino	dd78e00dc5	Fix ColumnSignature error message and jdk17 test issue. (#14538 ) * Fix ColumnSignature error message and jdk17 test issue. On jdk17, the "problem" part of the error message could change from NullPointerException to: Cannot invoke "String.length()" because "s" is null Due to the new more-helpful NPEs in Java 17. This broke the expectation and led to test failures on this case. This patch fixes the problem by improving the error message so it isn't a generic NullPointerException. * Fix format.	2023-07-06 15:10:59 -07:00
imply-cheddar	5fc122a144	Add window-focused tests from Drill (#13773 ) This commit borrows some test definitions from Drill's test suite and tries to use them to flesh out the full validation of window function capbilities. In order to be able to run these tests, we also add the ability to run a Scan operation against segments, which also meant an implementation of RowsAndColumns for frames.	2023-07-06 09:20:32 -07:00
imply-cheddar	277b357256	Optimize IntervalIterator (#14530 ) UniformGranularityTest's test to test a large number of intervals runs through 10 years of 1 second intervals. This pushes a lot of stuff through IntervalIterator and shows up in terms of test runtime as one of the hottest tests. Most of the time is going to constructing jodatime objects because it is doing things with DateTime objects instead of millis. Change the calls to use millis instead and things go faster.	2023-07-06 14:44:23 +05:30
Kashif Faraz	87bb1b9709	Fix bug during initialization of HttpServerInventoryView (#14517 ) If a server is removed during `HttpServerInventoryView.serverInventoryInitialized`, the initialization gets stuck as this server is never synced. The method eventually times out (default 250s). Fix: Mark a server as stopped if it is removed. `serverInventoryInitialized` only waits for non-stopped servers to sync. Other changes: - Add new metrics for better debugging of slow broker/coordinator startup - `segment/serverview/sync/healthy`: whether the server view is syncing properly with a server - `segment/serverview/sync/unstableTime`: time for which sync with a server has been unstable - Clean up logging in `HttpServerInventoryView` and `ChangeRequestHttpSyncer` - Minor refactor for readability - Add utility class `Stopwatch` - Add tests and stubs	2023-07-06 13:04:53 +05:30
Clint Wylie	277aaa5c57	remove druid.processing.columnCache.sizeBytes and CachingIndexed, combine string column implementations (#14500 ) * combine string column implementations changes: * generic indexed, front-coded, and auto string columns now all share the same column and index supplier implementations * remove CachingIndexed implementation, which I think is largely no longer needed by the switch of many things to directly using ByteBuffer, avoiding the cost of creating Strings * remove ColumnConfig.columnCacheSizeBytes since CachingIndexed was the only user	2023-07-02 19:37:15 -07:00
Gian Merlino	67fbd8e7fc	Add "stringEncoding" parameter to DataSketches HLL. (#11201 ) * Add "stringEncoding" parameter to DataSketches HLL. Builds on the concept from #11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. * Fix benchmark. * Fix style issues, improve test coverage. * Put round back, to make IT updates easier. * Fix test. * Fix issue with filtered aggregators and add test. * Use DS native update(ByteBuffer) method. Improve test coverage. * Add another suppression. * Fix ITAutoCompactionTest. * Update benchmarks. * Updates. * Fix conflict. * Adjustments.	2023-06-30 12:45:55 -07:00
Gian Merlino	e10e35aa2c	Add REGEXP_REPLACE function. (#14460 ) * Add REGEXP_REPLACE function. Replaces all instances of a pattern with a replacement string. * Fixes. * Improve test coverage. * Adjust behavior.	2023-06-29 13:47:57 -07:00
Gian Merlino	a6cabbe10f	SQL: Avoid "intervals" for non-table-based datasources. (#14336 ) In these other cases, stick to plain "filter". This simplifies lots of logic downstream, and doesn't hurt since we don't have intervals-specific optimizations outside of tables. Fixes an issue where we couldn't properly filter on a column from an external datasource if it was named __time.	2023-06-29 09:57:11 +05:30
Gian Merlino	82fbb31c7c	Properly read SQL-compatible segments in default-value mode. (#14142 ) * Properly read SQL-compatible segments in default-value mode. Main changes: 1) Dictionary-encoded and front-coded string columns: in default-value mode, detect cases where a dictionary has the empty string in it, then either combine it with null (if null is present) or replace it with null (if null is not present). 2) Numeric nullable columns: in default-value mode, ignore the null value bitmap. This causes all null numbers to be read as zeroes. Testing strategy: 1) Add a mmappedWithSqlCompatibleNulls case to BaseFilterTest that writes segments under SQL-compatible mode, and reads them under default-value mode. 2) Unit tests for the new wrapper classes (CombineFirstTwoEntriesIndexed, CombineFirstTwoValuesColumnarInts, CombineFirstTwoValuesColumnarMultiInts, CombineFirstTwoValuesIndexedInts). * Fix a mistake, use more singlethreadedness. * WIP * Tests, improvements. * Style. * See Spot bug. * Remove unused method. * Address review comments. 1) Read bitmaps even if we don't retain them. 2) Combine StringFrontCodedDictionaryEncodedColumn and ScalarStringDictionaryEncodedColumn. * Add missing tests.	2023-06-28 10:30:27 -07:00
Karan Kumar	cb3a9d2b57	Adding Interactive API's for MSQ engine (#14416 ) This PR aims to expose a new API called "@path("/druid/v2/sql/statements/")" which takes the same payload as the current "/druid/v2/sql" endpoint and allows users to fetch results in an async manner.	2023-06-28 17:51:58 +05:30
imply-cheddar	fd20bbd30e	Fix another infinite loop and remove Mockito usage (#14493 ) * Fix another infinite loop and remove Mockito usage The ConfigManager objects were `started()` without ever being stopped. This scheduled a poll call that never-ended, to make matters worse, the poll interval was set to 0 ms, making an infinite poll with 0 sleep, i.e. an infinite loop. Also introduce test classes and remove usage of mocks * Checkstyle	2023-06-27 21:49:27 -07:00
Adarsh Sanjeev	0335aaa279	Add query results directory and prevent the auto cleaner from cleaning it (#14446 ) Adds support for automatic cleaning of a "query-results" directory in durable storage. This directory will be cleaned up only if the task id is not known to the overlord. This will allow the storage of query results after the task has finished running.	2023-06-28 10:14:04 +05:30
Abhishek Radhakrishnan	2cfb00b1de	Add missing `isNull()` implementation to `FilteredAggregator` (#14465 )	2023-06-27 16:35:15 -07:00
Gian Merlino	c78d885b80	Cache parsed expressions and binding analysis in more places. (#14124 ) * Cache parsed expressions and binding analysis in more places. Main changes: 1) Cache parsed and analyzed expressions within PlannerContext for a single SQL query. 2) Cache parsed expressions together with input binding analysis using a new class AnalyzeExpr. This speeds up SQL planning, because SQL planning involves parsing analyzing the same expression strings over and over again. * Fixes. * Fix style. * Fix test. * Simplify: get rid of AnalyzedExpr, focus on caching. * Rename parse -> parseExpression.	2023-06-27 13:40:35 -07:00
imply-cheddar	2f0a43790c	Make GuavaUtilsTest use less CPU (#14487 )	2023-06-26 21:45:29 -07:00
Clint Wylie	6ba10c8b6c	fix bug with json_value expression array extraction (#14461 )	2023-06-26 21:02:44 -07:00
Laksh Singla	f546cd64a9	MSQ: Ensure that the allocated segment aligns with the requested granularity (#14475 ) Changes: - Throw an `InsertCannotAllocateSegmentFault` if the allocated segment is not aligned with the requested granularity. - Tests to verify new behaviour	2023-06-27 09:25:32 +05:30
Laksh Singla	114380749d	MSQ: Improve the parse exception errors and the handling of null UTF characters in Strings in Frames (#14398 )	2023-06-26 18:14:29 +05:30
Laksh Singla	1647d5f4a0	Limit the subquery results by memory usage (#13952 ) Users can now add a guardrail to prevent subquery’s results from exceeding the set number of bytes by setting druid.server.http.maxSubqueryRows in Broker's config or maxSubqueryRows in the query context. This feature is experimental for now and would default back to row-based limiting in case it fails to get the accurate size of the results consumed by the query.	2023-06-26 18:12:28 +05:30
Gian Merlino	970288067a	Fix flaky HttpEmitterConfigTest and ParametrizedUriEmitterConfigTest. (#14481 ) Recently, we have seen flakiness in these two tests, apparently due to computations based on Runtime.getRuntime().maxMemory() differing during static initialization and in the actual tests. I can't think of a reason why this would be happening, but anyway, this patch switches the tests to use the statics instead of recomputing Runtime.getRuntime().maxMemory().	2023-06-23 16:27:11 -07:00
imply-cheddar	7e2cf35d7b	Fix compatibility issue with SqlTaskResource (#14466 ) * Fix compatibility issue with SqlTaskResource The DruidException changes broke the response format for errors coming back from the SqlTaskResource, so fix those	2023-06-23 01:15:32 -07:00
Clint Wylie	31b9d5695d	Extend InitializedNullHandlingTest instead of NullHandlingTest (#14467 ) NullHandlingTest is an actual test, it shouldn't be used as a base class	2023-06-22 15:01:50 +05:30
Hardik Bajaj	1ea9158a50	Added new SysMonitorOshi v0 using Oshi library (#14359 ) Added a new monitor SysMonitorOshi to replace SysMonitor. The new monitor has a wider support for different machine architectures including ARM instances. Please switch to SysMonitorOshi as SysMonitor is now deprecated and will be removed in future releases.	2023-06-20 20:57:58 +05:30
Kashif Faraz	50461c3bd5	Enable smartSegmentLoading on the Coordinator (#13197 ) This commit does a complete revamp of the coordinator to address problem areas: - Stability: Fix several bugs, add capabilities to prioritize and cancel load queue items - Visibility: Add new metrics, improve logs, revamp `CoordinatorRunStats` - Configuration: Add dynamic config `smartSegmentLoading` to automatically set optimal values for all segment loading configs such as `maxSegmentsToMove`, `replicationThrottleLimit` and `maxSegmentsInNodeLoadingQueue`. Changed classes: - Add `StrategicSegmentAssigner` to make assignment decisions for load, replicate and move - Add `SegmentAction` to distinguish between load, replicate, drop and move operations - Add `SegmentReplicationStatus` to capture current state of replication of all used segments - Add `SegmentLoadingConfig` to contain recomputed dynamic config values - Simplify classes `LoadRule`, `BroadcastRule` - Simplify the `BalancerStrategy` and `CostBalancerStrategy` - Add several new methods to `ServerHolder` to track loaded and queued segments - Refactor `DruidCoordinator` Impact: - Enable `smartSegmentLoading` by default. With this enabled, none of the following dynamic configs need to be set: `maxSegmentsToMove`, `replicationThrottleLimit`, `maxSegmentsInNodeLoadingQueue`, `useRoundRobinSegmentAssignment`, `emitBalancingStats` and `replicantLifetime`. - Coordinator reports richer metrics and produces cleaner and more informative logs - Coordinator uses an unlimited load queue for all serves, and makes better assignment decisions	2023-06-19 14:27:35 +05:30
imply-cheddar	cfd07a95b7	Errors take 3 (#14004 ) Introduce DruidException, an exception whose goal in life is to be delivered to a user. DruidException itself has javadoc on it to describe how it should be used. This commit both introduces the Exception and adjusts some of the places that are generating exceptions to generate DruidException objects instead, as a way to show how the Exception should be used. This work was a 3rd iteration on top of work that was started by Paul Rogers. I don't know if his name will survive the squash-and-merge, so I'm calling it out here and thanking him for starting on this.	2023-06-19 01:11:13 -07:00
Soumyava Das	a9a6fc261c	Updating tests as MSQ does not support earliest for some cases	2023-06-19 10:28:04 +05:30
Adarsh Sanjeev	128133fadc	Add column replication_factor column to sys.segments table (#14403 ) Description: Druid allows a configuration of load rules that may cause a used segment to not be loaded on any historical. This status is not tracked in the sys.segments table on the broker, which makes it difficult to determine if the unavailability of a segment is expected and if we should not wait for it to be loaded on a server after ingestion has finished. Changes: - Track replication factor in `SegmentReplicantLookup` during evaluation of load rules - Update API `/druid/coordinator/v1metadata/segments` to return replication factor - Add column `replication_factor` to the sys.segments virtual table and populate it in `MetadataSegmentView` - If this column is 0, the segment is not assigned to any historical and will not be loaded.	2023-06-18 10:02:21 +05:30
George Shiqi Wu	64af9bfe5b	Add groupId to metrics (#14402 ) * Add group id as a dimension * Revert changes * Add to forking task runner * Add missing metrics * Fix indenting * revert metrics * Fix indentation	2023-06-16 09:28:16 -07:00
Soumyava Das	df3db6e6c9	Merge remote-tracking branch 'upstream/master' into vectorize_earliest_num	2023-06-16 14:41:51 +05:30
Clint Wylie	359bd63cc9	allow expression "best effort" type determination to better handle mixed type arrays (#14438 )	2023-06-16 00:02:43 -07:00
Clint Wylie	ff5ae4db6c	fix kafka input format reader schema discovery and partial schema discovery (#14421 ) * fix kafka input format reader schema discovery and partial schema discovery to actually work right, by re-using dimension filtering logic of MapInputRowParser	2023-06-15 00:11:04 -07:00
Clint Wylie	ca116cf886	adjust broker parallel merge to help managed blocking be more well behaved (#14427 )	2023-06-15 00:10:31 -07:00
Pranav	e426d370ea	Start with solo accumulator and empty partition (#14426 ) * Starting parallel merge with solo accumulator and empty partitions * shutshown pool in test	2023-06-14 16:20:48 -07:00
Clint Wylie	8454cc619a	auto columns fixes (#14422 ) changes: * auto columns no longer participate in generic 'null column' handling, this was a mistake to try to support and caused ingestion failures due to mismatched ColumnFormat, and will be replaced in the future with nested common format constant column functionality (not in this PR) * fix bugs with auto columns which contain empty objects, empty arrays, or primitive types mixed with either of these empty constructs * fix bug with bound filter when upper is null equivalent but is strict	2023-06-14 08:57:06 -07:00
Abhishek Radhakrishnan	be5a6593a9	Reset `RuntimeInfo` to fix flaky test `ParametrizedUriEmitterConfigTest`. (#14405 ) * Add injector so JVM settings are correctly set up and bound for the test. * Add VisibleForTesting IDE annotation. * spacing	2023-06-13 18:07:51 -07:00
Clint Wylie	61120dc49a	fix Kafka input format to throw ParseException if timestamp is missing (#14413 )	2023-06-13 09:00:11 -07:00
Soumyava Das	6c139de4f2	checkstyle fix	2023-06-12 17:15:16 +05:30
Soumyava Das	59118ae885	Vectorizing earliest string aggregator	2023-06-12 17:10:20 +05:30
Soumyava Das	2b556f6b19	Vectorizing earliest for numeric	2023-06-12 16:12:28 +05:30
Adarsh Sanjeev	267cbac6ff	Add logs for deleting files using storage connector (#14350 ) * Add logs for deleting files using storage connector * Address review comments * Update log message format	2023-06-11 21:24:30 +05:30
Kashif Faraz	6e158704cb	Do not retry INSERT task into metadata if max_allowed_packet limit is violated (#14271 ) Changes - Add a `DruidException` which contains a user-facing error message, HTTP response code - Make `EntryExistsException` extend `DruidException` - If metadata store max_allowed_packet limit is violated while inserting a new task, throw `DruidException` with response code 400 (bad request) to prevent retries - Add `SQLMetadataConnector.isRootCausePacketTooBigException` with impl for MySQL	2023-06-10 12:15:44 +05:30
imply-cheddar	87149d5975	Remove AbstractIndex (#14388 ) The class apparently only exists to add a toString() method to Indexes, which basically just crashes any debugger on any meaningfully sized index. It's a pointless abstract class that basically only causes pain.	2023-06-08 19:52:16 -07:00
Harini Rajendran	4ff6026d30	Adding SegmentMetadataEvent and publishing them via KafkaEmitter (#14281 ) In this PR, we are enhancing KafkaEmitter, to emit metadata about published segments (SegmentMetadataEvent) into a Kafka topic. This segment metadata information that gets published into Kafka, can be used by any other downstream services to query Druid intelligently based on the segments published. The segment metadata gets published into kafka topic in json string format similar to other events.	2023-06-02 21:28:26 +05:30
zachjsh	e75fb8e8e3	Account for data format and compression in MSQ auto taskAssignment (#14307 ) ### Description This change allows for consideration of the input format and compression when computing how to split the input files among available tasks, in MSQ ingestion, when considering the value of the `maxInputBytesPerWorker` query context parameter. This query parameter allows users to control the maximum number of bytes, with granularity of input file / object, that ingestion tasks will be assigned to ingest. With this change, this context parameter now denotes the estimated weighted size in bytes of the input to split on, with consideration for input format and compression format, rather than the actual file size, reported by the file system. We assume uncompressed newline delimited json as a baseline, with scaling factor of `1`. This means that when computing the byte weight that a file has towards the input splitting, we take the file size as is, if uncompressed json, 1:1. It was found during testing that gzip compressed json, and parquet, has scale factors of `4` and `8` respectively, meaning that each byte of data is weighted 4x and 8x respectively, when computing input splits. This weighted byte scaling is only considered for MSQ ingestion that uses either LocalInputSource or CloudObjectInputSource at the moment. The default value of the `maxInputBytesPerWorker` query context parameter has been updated from 10 GiB, to 512 MiB	2023-06-01 12:53:49 -07:00
Clint Wylie	4096f51f0b	add configurable ColumnTypeMergePolicy to SegmentMetadataCache (#14319 ) This PR adds a new interface to control how SegmentMetadataCache chooses ColumnType when faced with differences between segments for SQL schemas which are computed, exposed as druid.sql.planner.metadataColumnTypeMergePolicy and adds a new 'least restrictive type' mode to allow choosing the type that data across all segments can best be coerced into and sets this as the default behavior. This is a behavior change around when segment driven schema migrations take effect for the SQL schema. With latestInterval, the SQL schema will be updated as soon as the first job with the new schema has published segments, while using leastRestrictive, the schema will only be updated once all segments are reindexed to the new type. The benefit of leastRestrictive is that it eliminates a bunch of type coercion errors that can happen in SQL when types are varied across segments with latestInterval because the newest type is not able to correctly represent older data, such as if the segments have a mix of ARRAY and number types, or any other combinations that lead to odd query plans.	2023-05-24 20:32:51 +05:30
Soumyava	22ba457d29	Expr getCacheKey now delegates to children (#14287 ) * Expr getCacheKey now delegates to children * Removed the LOOKUP_EXPR_CACHE_KEY as we do not need it * Adding an unit test * Update processing/src/main/java/org/apache/druid/math/expr/Expr.java Co-authored-by: Clint Wylie <cjwylie@gmail.com> --------- Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2023-05-23 14:49:38 -07:00
Abhishek Radhakrishnan	a5e04d95a4	Add `TYPE_NAME` to the complex serde classes and replace the hardcoded names. (#14317 ) * Add TYPE_NAME to the serde classes and reuse them instead of hardcoded strings. * Static check fixes.	2023-05-23 00:54:47 -05:00
Clint Wylie	d92b9fbfac	more resilient segment metadata, dont parallel merge internal segment metadata queries (#14296 )	2023-05-17 04:12:55 -07:00
Clint Wylie	b038a11280	fix issues with handling arrays with all null elements and arrays of booleans in strict mode (#14297 )	2023-05-17 01:33:44 -07:00
Soumyava	96a3c00754	Fixing an issue with filtering on a single dimension by converting In… (#14277 ) * Fixing an issue with filtering on a single dimension by converting In filter to a selector filter as needed with Filters.toFilter * Adding a test so that any future refactoring does not break this behavior * Made comment a bit more meaningful	2023-05-15 20:10:36 -07:00
imply-cheddar	f9861808bc	Be able to load segments on Peons (#14239 ) * Be able to load segments on Peons This change introduces a new config on WorkerConfig that indicates how many bytes of each storage location to use for storage of a task. Said config is divided up amongst the locations and slots and then used to set TaskConfig.tmpStorageBytesPerTask The Peons use their local task dir and tmpStorageBytesPerTask as their StorageLocations for the SegmentManager such that they can accept broadcast segments.	2023-05-12 16:51:00 -07:00
Kashif Faraz	ba11b3d462	Refactor: Add OverlordDuty to replace OverlordHelper and align with CoordinatorDuty (#14235 ) Changes: - Replace `OverlordHelper` with `OverlordDuty` to align with `CoordinatorDuty` - Each duty has a `run()` method and defines a `Schedule` with an initial delay and period. - Update existing duties `TaskLogAutoCleaner` and `DurableStorageCleaner` - Add utility class `Configs` - Update log, error messages and javadocs - Other minor style improvements	2023-05-12 22:39:56 +05:30
Clint Wylie	9875090bee	fix segment metadata queries for auto ingested columns that had all null values (#14262 )	2023-05-11 20:58:06 -07:00
Soumyava	f128b9b666	Updates to filter processing for inner query in Joins (#14237 )	2023-05-11 17:21:41 +05:30
Clint Wylie	a58cebe491	add array_to_mv function to convert arrays into mvds to assist with migration from mvds to arrays (#14236 )	2023-05-11 04:43:28 -07:00
Kashif Faraz	64e6283eca	Do not allow retention rules to be null (#14223 ) Changes: - Do not allow retention rules for any datasource or cluster to be null - Allow empty rules at the datasource level but not at the cluster level - Add validation to ensure that `druid.manager.rules.defaultRule` is always set correctly - Minor style refactors	2023-05-11 14:33:56 +05:30
Clint Wylie	aaaff74740	fix npe regression in json_value when filtering non-existent paths (#14250 ) * fix npe regression in json_value when filtering non-existent paths * more coverage	2023-05-10 22:39:22 -07:00
Clint Wylie	6db11bfc60	suppress some cves and fix javadoc build when using java 17 (#14241 )	2023-05-10 15:47:10 -07:00
Clint Wylie	8805d8d7db	fix issues with filtering nulls on values coerced to numeric types (#14139 ) * fix issues with filtering nulls on values coerced to numeric types * fix issues with 'auto' type numeric columns in default value mode * optimize variant typed columns without nested data * more tests for 'auto' type column ingestion	2023-05-08 13:19:02 -07:00
Clint Wylie	a7a4bfd331	modify QueryScheduler to lazily acquire lanes when executing queries to avoid leaks (#14184 ) This PR fixes an issue that could occur if druid.query.scheduler.numThreads is configured and any exception occurs after QueryScheduler.run has been called to create a Sequence. This would result in total and/or lane specific locks being acquired, but because the sequence was not actually being evaluated, the "baggage" which typically releases these locks was not being executed. An example of how this can happen is if a group-by having filter, which wraps and transforms this sequence happens to explode while wrapping the sequence. The end result is that the locks are acquired, but never released, eventually halting the ability to execute any queries.	2023-05-08 11:42:05 +05:30
Clint Wylie	90ea192d9c	fix bugs with auto encoded long vector deserializers (#14186 ) This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+. While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.	2023-05-01 11:49:27 +05:30
Suneet Saldanha	84c11df980	Make LoggingEmitter more useful by using Markers (#14121 ) * Make LoggingEmitter more useful * Skip code coverage for facade classes * fix spellcheck * code review * fix dependency * logging.md * fix checkstyle * Add back jacoco version to main pom	2023-04-27 15:06:06 -07:00
Adarsh Sanjeev	5aa119dfda	Add retry to opening retrying stream (#14126 ) * Add retry to opening retrying stream * Add retry to S3Entity for network issues * Fix tests and clean up code	2023-04-27 16:52:22 +05:30
Gian Merlino	42c8c84eb6	TimeBoundary: Use cursor when datasource is not a regular table. (#14151 ) * TimeBoundary: Use cursor when datasource is not a regular table. Fixes a bug where TimeBoundary could return incorrect results with INNER Join or inline data. * Addl Javadocs.	2023-04-26 17:00:13 -07:00
Gian Merlino	752475b799	Fix two concurrency issues with segment fetching. (#14042 ) * Fix two concurrency issues with segment fetching. 1) SegmentLocalCacheManager: Fix a concurrency issue where certain directory cleanup happened outside of directoryWriteRemoveLock. This created the possibility that segments would be deleted by one thread, while being actively downloaded by another thread. 2) TaskDataSegmentProcessor (MSQ): Fix a concurrency issue when two stages in the same process both use the same segment. For example: a self-join using distributed sort-merge. Prior to this change, the two stages could delete each others' segments. 3) ReferenceCountingResourceHolder: increment() returns a new ResourceHolder, rather than a Releaser. This allows it to be passed to callers without them having to hold on to both the original ResourceHolder and a Releaser. 4) Simplify various interfaces and implementations by using ResourceHolder instead of Pair and instead of split-up fields. * Add test. * Fix style. * Remove Releaser. * Updates from master. * Add some GuardedBys. * Use the correct GuardedBy. * Adjustments.	2023-04-25 20:49:27 -07:00
Gian Merlino	2dfb693d4c	Improved handling for zero-length intervals. (#14136 ) * Improved handling for zero-length intervals. 1) Return an empty list from VersionedIntervalTimeline.lookup when provided with an empty interval. (The logic doesn't quite work when intervals are empty, which led to #14129.) 2) Don't return zero-length intervals from JodaUtils.condenseIntervals. 3) Detect "incorrect" comparator in JodaUtils.condenseIntervals, and recreate the SortedSet if needed. (Not strictly related to the theme of this patch. Just another thing in the same file.) 4) Remove unused method JodaUtils.containOverlappingIntervals. Fixes #14129. * Fix TimewarpOperatorTest.	2023-04-25 17:12:56 -07:00
Gian Merlino	89e7948159	MSQ: Subclass CalciteJoinQueryTest, other supporting changes. (#14105 ) * MSQ: Subclass CalciteJoinQueryTest, other supporting changes. The main change is the new tests: we now subclass CalciteJoinQueryTest in CalciteSelectJoinQueryMSQTest twice, once for Broadcast and once for SortMerge. Two supporting production changes for default-value mode: 1) InputNumberDataSource is marked as concrete, to allow leftFilter to be pushed down to it. 2) In default-value mode, numeric frame field readers can now return nulls. This is necessary when stacking joins on top of joins: nulls must be preserved for semantics that match broadcast joins and native queries. 3) In default-value mode, StringFieldReader.isNull returns true on empty strings in addition to nulls. This is more consistent with the behavior of the selectors, which map empty strings to null as well in that mode. As an effect of change (2), the InsertTimeNull change from #14020 (to replace null timestamps with default timestamps) is reverted. IMO, this is fine, as either behavior is defensible, and the change from #14020 hasn't been released yet. * Adjust tests. * Style fix. * Additional tests.	2023-04-25 12:10:23 -07:00
Gian Merlino	73f050027b	MSQ: Preserve original ParseException when writing frames. (#14122 )	2023-04-25 11:47:15 +05:30
Nicholas Lippis	9d4cc501f7	return task status reported by peon (#14040 ) * return task status reported by peon * Write TaskStatus to file in AbstractTask.cleanUp * Get TaskStatus from task log * Fix merge conflicts in AbstractTaskTest * Add unit tests for TaskLogPusher, TaskLogStreamer, NoopTaskLogs to satisfy code coverage * Add license headerss * Fix style * Remove unknown exception declarations	2023-04-24 12:05:39 -07:00
TSFenwick	accd5536df	Allow for Log4J to be configured for peons but still ensure console logging is enforced (#14094 ) * Allow for Log4J to be configured for peons but still ensure console logging is enforced This change will allow for log4j to be configured for peons but require console logging is still configured for them to ensure peon logs are saved to deep storage. Also fixed the test ConsoleLoggingEnforcementTest to use a valid appender for the non console Config as the previous config was incorrect and would never return a logger. * fix checkstyle * add warning to logger when it overwrites all loggers to be console * optimize calls for altering logging config for ConsoleLoggingEnforcementConfigurationFactory add getName to the druid logger class * update docs, and error message * edit docs to be more clear * fix checkstyle issues * CI fixes - LoggerTest code coverage and fix spelling issue for logging docs	2023-04-24 10:41:56 -07:00
Soumyava	8d60edcfcb	Updating segment map function for QueryDataSource to ensure group by … (#14112 ) * Updating segment map function for QueryDataSource to ensure group by of group by of join data source gets into proper segment map function path * Adding unit tests for the failed case * There you go coverage bot, be happy now	2023-04-20 13:22:29 -07:00
Gian Merlino	9436ee8a63	Nicer error message for CSV with no properties. (#14093 ) * Nicer error message for CSV with no properties. * Take two. * Adjustments from review, and test fixes. * Fix test. * Fix static check.	2023-04-18 12:52:02 -07:00
Clint Wylie	e7d2e8b914	fix bug filtering nested columns with expression filters (#14096 )	2023-04-17 14:21:32 -07:00
Gian Merlino	facd82b493	Add HLLC tests for empty strings that don't pass. (#14085 ) I believe the test case illustrates the cause of the problem in #13950.	2023-04-17 15:46:42 +05:30
Gian Merlino	0884a22c41	MSQ: Support for querying lookup and inline data directly. (#14048 ) * MSQ: Support for querying lookup and inline data directly. Main changes: 1) Add of LookupInputSpec and DataSourcePlan.forLookup. 2) Add InlineInputSpec, and modify of DataSourcePlan.forInline to use this instead of an ExternalInputSpec with JSON. This allows the inline data to act as the right-hand side of a join, if needed. Supporting changes: 1) Modify JoinDataSource's leftFilter validation to be a little less strict: it's now OK with leftFilter being attached to any concrete leaf (no children) datasource, rather than requiring it be a table. This allows MSQ to create JoinDataSource with InputNumberDataSource as the base. 2) Add SegmentWranglerModule to CliIndexer, CliPeon. This allows them to query lookups and inline data directly. * Updates based on CI. * Additional tests. * Style fix. * Remove unused import.	2023-04-14 14:04:02 -07:00
Clint Wylie	179e2e8108	adjust useSchemaDiscovery to also include the behavior of includeAllDimensions to support partial schema declaration without having to set two flags (#14076 )	2023-04-12 23:12:49 -07:00
Gian Merlino	81074411a9	MSQ: Support multiple result columns with the same name. (#14025 ) * MSQ: Support multiple result columns with the same name. This is allowed in SQL, and is supported by the regular SQL endpoint. We retain a validation that INSERT ... SELECT does not allow multiple columns with the same name, because column names in segments must be unique.	2023-04-13 11:09:39 +05:30
Clint Wylie	9ed8beca5e	bug fixes and add support for boolean inputs to classic long dimension indexer (#14069 ) changes: * adds support for boolean inputs to the classic long dimension indexer, which plays nice with LONG being the semi official boolean type in Druid, and even nicer when druid.expressions.useStrictBooleans is set to true, since the sampler when using the new 'auto' schema when 'useSchemaDiscovery' is specified on the dimensions spec will call the type out as LONG * fix bugs with sampler response and new schema discovery stuff incorrectly using classic 'json' type for the logical schema instead of the new 'auto' type	2023-04-11 20:49:52 -07:00
Clint Wylie	29652bd246	fix NPE that can happen when merging all null nested v4 format columns (#14068 )	2023-04-11 19:04:51 -07:00
Clint Wylie	d61bd7f8f1	fix bug in nested v4 format merger from refactoring (#14053 )	2023-04-10 20:38:58 -07:00
Clint Wylie	1aef72aa7e	Bump up the version in pom to 27.0.0 in preparation of release (#14051 )	2023-04-10 14:56:59 +05:30
Gian Merlino	d52bc333aa	Frames: Ensure nulls are read as default values when appropriate. (#14020 ) * Frames: Ensure nulls are read as default values when appropriate. Fixes a bug where LongFieldWriter didn't write a properly transformed zero when writing out a null. This had no meaningful effect in SQL-compatible null handling mode, because the field would get treated as a null anyway. But it does have an effect in default-value mode: it would cause Long.MIN_VALUE to get read out instead of zero. Also adds NullHandling checks to the various frame-based column selectors, allowing reading of nullable frames by servers in default-value mode.	2023-04-10 05:28:46 +05:30
Clint Wylie	f41468fd46	fix off by one error in FrontCodedIndexedWriter and FrontCodedIntArrayIndexedWriter getCardinality method (#14047 ) * fix off by one error in FrontCodedIndexedWriter and FrontCodedIntArrayIndexedWriter getCardinality method	2023-04-07 03:11:15 -07:00
zachjsh	5c0221375c	Allow for Input source security in native task layer (#14003 ) Fixes #13837. ### Description This change allows for input source type security in the native task layer. To enable this feature, the user must set the following property to true: `druid.auth.enableInputSourceSecurity=true` The default value for this property is false, which will continue the existing functionality of needing authorization to write to the respective datasource. When this config is enabled, the users will be required to be authorized for the following resource action, in addition to write permission on the respective datasource. `new ResourceAction(new Resource(ResourceType.EXTERNAL, {INPUT_SOURCE_TYPE}, Action.READ` where `{INPUT_SOURCE_TYPE}` is the type of the input source being used;, http, inline, s3, etc.. Only tasks that provide a non-default implementation of the `getInputSourceResources` method can be submitted when config `druid.auth.enableInputSourceSecurity=true` is set. Otherwise, a 400 error will be thrown.	2023-04-06 13:13:09 -04:00
Abhishek Agarwal	92912a6a2b	JOIN or UNNEST queries over tombstone segment can fail (#14021 ) Join,Unnest queries over tombstone segment can fail	2023-04-06 16:55:58 +05:30
Clint Wylie	b11c0bc249	smarter nested column index utilization (#13977 ) * smarter nested column index utilization changes: * adds skipValueRangeIndexScale and skipValuePredicateIndexScale to ColumnConfig (e.g. DruidProcessingConfig) available as system config via druid.processing.indexes.skipValueRangeIndexScale and druid.processing.indexes.skipValuePredicateIndexScale * NestedColumnIndexSupplier uses skipValueRangeIndexScale and skipValuePredicateIndexScale to multiply by the total number of rows to be processed to determine the threshold at which we should no longer consider using bitmap indexes because it will be too many operations * Default values for skipValueRangeIndexScale and skipValuePredicateIndexScale have been initially set to 0.08, but are separate to allow independent tuning * these are not documented on purpose yet because they are kind of hard to explain, the mainly exist to help conduct larger scale experiments than the jmh benchmarks used to derive the initial set of values * these changes provide a pretty sweet performance boost for filter processing on nested columns	2023-04-06 04:09:24 -07:00
Gian Merlino	319f99db05	Always use file sizes when determining batch ingest splits (#13955 ) * Always use file sizes when determining batch ingest splits. Main changes: 1) Update CloudObjectInputSource and its subclasses (S3, GCS, Azure, Aliyun OSS) to use SplitHintSpecs in all cases. Previously, they were only used for prefixes, not uris or objects. 2) Update ExternalInputSpecSlicer (MSQ) to consider file size. Previously, file size was ignored; all files were treated as equal weight when determining splits. A side effect of these changes is that we'll make additional network calls to find the sizes of objects when users specify URIs or objects as opposed to prefixes. IMO, this is worth it because it's the only way to respect the user's split hint and task assignment settings. Secondary changes: 1) S3, Aliyun OSS: Use getObjectMetadata instead of listObjects to get metadata for a single object. This is a simpler call that is also expected to be less expensive. 2) Azure: Fix a bug where getBlobLength did not populate blob reference attributes, and therefore would not actually retrieve the blob length. 3) MSQ: Align dynamic slicing logic between ExternalInputSpecSlicer and TableInputSpecSlicer. 4) MSQ: Adjust WorkerInputs to ensure there is always at least one worker, even if it has a nil slice. * Add msqCompatible to testGroupByWithImpossibleTimeFilter. * Fix tests. * Add additional tests. * Remove unused stuff. * Remove more unused stuff. * Adjust thresholds. * Remove irrelevant test. * Fix comments. * Fix bug. * Updates.	2023-04-05 08:54:01 -07:00
Clint Wylie	d21babc5b8	remix nested columns (#14014 ) changes: * introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation * introduce new 'auto' type indexer and merger which produces a new common nested format of columns, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>. * revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre #13803 behavior (v4) for backwards compatibility * fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)	2023-04-04 17:51:59 -07:00
Karan Kumar	217b0f6832	Eagerly fetching remote s3 files leading to out of disk (OOD) (#13981 ) * Eagerly fetching remote s3 files leading to OOD.	2023-04-03 14:10:37 +05:30
Clint Wylie	518698a952	lower segment heap footprint and fix bug with expression type coercion (#14002 )	2023-03-31 13:53:22 -07:00
Clint Wylie	e3211e3be0	actually backwards compatible frontCoded string encoding strategy (#13996 )	2023-03-31 02:24:12 -07:00
Soumyava	1eeecf5fb2	Fixing regression issues on unnest (#13976 ) * select sum(c) on an unnested column now does not return 'Type mismatch' error and works properly * Making sure an inner join query works properly * Having on unnested column with a group by now works correctly * count(*) on an unnested query now works correctly	2023-03-31 09:06:43 +05:30
Karan Kumar	8dce3ca4d5	OOM fix for running MSQ jobs with `intermediateSuperSorterStorageMaxLocalBytes` set (#13974 ) While using intermediateSuperSorterStorageMaxLocalBytes the super sorter was retaining references of the memory allocator. The fix clears the current outputChannel when close() is called on the ComposingWritableFrameChannel.java	2023-03-29 18:00:00 +05:30
Clint Wylie	2219e68fa3	add backwards compat mode for frontCoded stringEncodingStrategy (#13988 )	2023-03-28 14:44:44 -07:00
Paul Rogers	76fe26d4ba	Fix typos, add tests for http() function (#13954 )	2023-03-28 14:41:06 -07:00
Karan Kumar	c2fe6a4956	Reworking s3 connector with various improvements (#13960 ) * Reworking s3 connector with 1. Adding retries 2. Adding max fetch size 3. Using s3Utils for most of the api's 4. Fixing bugs in DurableStorageCleaner 5. Moving to Iterator for listDir call	2023-03-28 17:05:16 +05:30
Clint Wylie	d5b1b5bc8e	nested columns + arrays = array columns! (#13803 ) array columns! changes: * add support for storing nested arrays of string, long, and double values as specialized nested columns instead of breaking them into separate element columns * nested column type mimic behavior means that columns ingested with only root arrays of primitive values will be ARRAY typed columns * neat test refactor stuff * add v4 segment test * add array element indexes * add tests for unnest and array columns * fix unnest column value selector cursor handling of null and empty arrays	2023-03-27 12:42:35 -07:00
abhagraw	c52d15d65d	Fixing security vulnerability check errors (#13956 ) * Fixing security vulnerability check errors * Updating javax.el to jakarta.el * Adding cron job trigger on changes to suppressions file	2023-03-23 11:10:06 +05:30
Soumyava	2ad133c06e	Unnest changes for moving the filter on right side of correlate to inside the unnest datasource (#13934 ) * Refactoring and bug fixes on top of unnest. The filter now is passed inside the unnest cursors. Added tests for scenarios such as 1. filter on unnested column which involves a left filter rewrite 2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite 3. not filters 4. SQL functions applied on top of unnested column 5. null present in first row of the column to be unnested	2023-03-22 18:24:00 -07:00
Clint Wylie	f4392a3155	expression transform improvements and fixes (#13947 ) changes: * fixes inconsistent handling of byte[] values between ExprEval.bestEffortOf and ExprEval.ofType, which could cause byte[] values to end up as java toString values instead of base64 encoded strings in ingest time transforms * improved ExpressionTransform binding to re-use ExprEval.bestEffortOf when evaluating a binding instead of throwing it away * improved ExpressionTransform array handling, added RowFunction.evalDimension that returns List<String> to back Row.getDimension and remove the automatic coercing of array types that would typically happen to expression transforms unless using Row.getDimension * added some tests for ExpressionTransform with array inputs * improved ExpressionPostAggregator to use partial type information from decoration * migrate some test uses of InputBindings.forMap to use other methods	2023-03-21 23:26:53 -07:00

1 2 3 4 5 ...

2882 Commits