druid

Commit Graph

Author	SHA1	Message	Date
Gian Merlino	588d442422	Add native filter conversion for SCALAR_IN_ARRAY. (#16312 ) * Add native filter conversion for SCALAR_IN_ARRAY. Main changes: 1) Add an implementation of "toDruidFilter" in ScalarInArrayOperatorConversion. 2) Split up Expressions.literalToDruidExpression into two functions, so the first half (literalToExprEval) can be used by ScalarInArrayOperatorConversion to more efficiently create the list of match values. * Fix type in time arithmetic conversion. * Test updates. * Update test cases to use null instead of '' in default-value mode. * Switch test from msqIncompatible to compatible with a different result. * Update one more test. * Fix test. * Update tests. * Use ExprEvalWrapper to differentiate between empty string and null. * Fix tests some more. * Fix test. * Additional comment. * Style adjustment. * Fix tests. * trueValue -> actualValue. * Use different approach, DruidLiteral instead of ExprEvalWrapper. * Revert changes in ArrayOfDoublesSketchSqlAggregatorTest.	2024-05-03 13:00:33 -07:00
Gian Merlino	1b107ff695	QueryableIndex: Close columns after failed vector cursor setup. (#16365 ) * QueryableIndex: Close columns after failed vector cursor setup. If anything fails while setting up a vector cursor, the prior code in QueryableIndex would not close its ColumnCache and would therefore leak columns. Columns often contain references to buffers that must be closed. * Fix style.	2024-05-03 12:58:40 -07:00
zachjsh	fb7c84fb5d	Catalog clustering keys fixes (#16351 ) * * add another catalog clustering columns unit test * * dissallow clusterKeys with descending order * * make more clear that clustering is re-written into ingest node whether a catalog table or not * * when partitionedBy is stored in catalog, user shouldnt need to specify it in order to specify clustering * * fix intellij inspection failure	2024-05-03 14:02:56 -04:00
Vadim Ogievetsky	4d62c4a917	Web console: concat data when doing a durable storage download (#16375 ) * concat data * fix silly console.error	2024-05-03 08:00:32 -07:00
Rishabh Singh	c61c3785a0	Followup changes to 15817 (Segment schema publishing and polling) (#16368 ) * Fix build * Nit changes in KillUnreferencedSegmentSchema * Replace reference to the abbreviation SMQ with Metadata Query, rename inTransit maps in schema cache * nitpicks * Remove reference to smq abbreviation from integration-tests * Remove reference to smq abbreviation from integration-tests * minor change * Update index.md * Add delimiter while computing schema fingerprint hash	2024-05-03 19:13:52 +05:30
AmatyaAvadhanula	5fae20d287	Do not allocate ids conflicting with existing segment ids (#16380 ) * Do not allocate ids conflicting with existing segment ids * Parameterized tests * Add doc and retain test for coverage	2024-05-03 19:09:48 +05:30
Jan Werner	b16401323b	update dependencies to address CVEs (#16374 ) update dependencies to address new batch of CVEs: - Azure POM from 1.2.19 to 1.2.23 to update transitive dependency nimbus-jose-jwt to address: CVE-2023-52428 - commons-configuration2 from 2.8.0 to 2.10.1 to address: CVE-2024-29131 CVE-2024-29133 - bcpkix-jdk18on from 1.76 to 1.78.1 to address: CVE-2024-30172 CVE-2024-30171 CVE-2024-29857	2024-05-02 21:35:21 -07:00
Abhishek Radhakrishnan	3717554e16	Web console changes for https://github.com/apache/druid/pull/16288 (#16379 ) Adds a text box for delta filter that can accept an optional json object.	2024-05-02 15:50:17 -07:00
Vadim Ogievetsky	39ada8b9ad	Web console: surface more info on the supervisor view (#16318 ) * add rate and stats * better tabs * detail * add recent errors * update tests * don't let people hide the actions column because why * don't sort on actions * better way to agg * add timeouts * show error only once * fix tests and Explain showing up * only consider active tasks * refresh * fix tests * better formatting	2024-05-02 08:50:27 -07:00
AmatyaAvadhanula	b7ae78296a	Allow different timechunk lock types to coexist in a task group (#16369 ) Description: All the streaming ingestion tasks for a given datasource share the same lock for a given interval. Changing lock types in the supervisor can lead to segment allocation errors due to lock conflicts for the new tasks while the older tasks are still running. Fix: Allow locks of different types (EXCLUSIVE, SHARED, APPEND, REPLACE) to co-exist if they have the same interval and the same task group.	2024-05-02 19:54:43 +05:30
Kashif Faraz	e5b40b0b8c	Miscellaneous cleanup of load queue references (#16367 ) Changes: - Rename `DataSegmentChangeRequestAndStatus` to `DataSegmentChangeResponse` - Rename `SegmentLoadDropHandler.Status` to `SegmentChangeStatus` - Remove method `CoordinatorRunStats.getSnapshotAndReset()` as it was used only in load queue peon implementations. Using an atomic reference is much simpler. - Remove `ServerTestHelper.MAPPER`. Use existing `TestHelper.makeJsonMapper()` instead.	2024-05-02 15:59:50 +05:30
Zoltan Haindrich	2d0e86cbdc	Use quidem to run tests (#16249 ) * test scoped jdbc driver for druidtest:/// backed DruidAvaticaTestDriver ** DecoupledTestConfig is used inside the URI - this will make it possible to attach to existing things more easily * DruidQuidemTestBase can be used to create module level set of quidem tests * added quidem commands: !convertedPlan, !logicalPlan, !druidPlan, !nativePlan ** for these I've used some values of the Hook which was there in calcite * there are some shortcuts with proxies(they are only used during testing) - we can probably remove those later	2024-05-02 02:12:42 -04:00
Gian Merlino	5d1950d451	MSQ controller: Support in-memory shuffles; towards JVM reuse. (#16168 ) * MSQ controller: Support in-memory shuffles; towards JVM reuse. This patch contains two controller changes that make progress towards a lower-latency MSQ. First, support for in-memory shuffles. The main feature of in-memory shuffles, as far as the controller is concerned, is that they are not fully buffered. That means that whenever a producer stage uses in-memory output, its consumer must run concurrently. The controller determines which stages run concurrently, and when they start and stop. "Leapfrogging" allows any chain of sort-based stages to use in-memory shuffles even if we can only run two stages at once. For example, in a linear chain of stages 0 -> 1 -> 2 where all do sort-based shuffles, we can use in-memory shuffling for each one while only running two at once. (When stage 1 is done reading input and about to start writing its output, we can stop 0 and start 2.) 1) New OutputChannelMode enum attached to WorkOrders that tells workers whether stage output should be in memory (MEMORY), or use local or durable storage. 2) New logic in the ControllerQueryKernel to determine which stages can use in-memory shuffling (ControllerUtils#computeStageGroups) and to launch them at the appropriate time (ControllerQueryKernel#createNewKernels). 3) New "doneReadingInput" method on Controller (passed down to the stage kernels) which allows stages to transition to POST_READING even if they are not gathering statistics. This is important because it enables "leapfrogging" for HASH_LOCAL_SORT shuffles, and for GLOBAL_SORT shuffles with 1 partition. 4) Moved result-reading from ControllerContext#writeReports to new QueryListener interface, which ControllerImpl feeds results to row-by-row while the query is still running. Important so we can read query results from the final stage using an in-memory channel. 5) New class ControllerQueryKernelConfig holds configs that control kernel behavior (such as whether to pipeline, maximum number of concurrent stages, etc). Generated by the ControllerContext. Second, a refactor towards running workers in persistent JVMs that are able to cache data across queries. This is helpful because I believe we'll want to reuse JVMs and cached data for latency reasons. 1) Move creation of WorkerManager and TableInputSpecSlicer to the ControllerContext, rather than ControllerImpl. This allows managing workers and work assignment differently when JVMs are reusable. 2) Lift the Controller Jersey resource out from ControllerChatHandler to a reusable resource. 3) Move memory introspection to a MemoryIntrospector interface, and introduce ControllerMemoryParameters that uses it. This makes it easier to run MSQ in process types other than Indexer and Peon. Both of these areas will have follow-ups that make similar changes on the worker side. * Address static checks. * Address static checks. * Fixes. * Report writer tests. * Adjustments. * Fix reports. * Review updates. * Adjust name. * Small changes.	2024-04-30 21:30:27 -07:00
Kashif Faraz	51104e8bb3	Docs: Remove references to Zk-based segment loading (#16360 ) Follow up to #15705 Changes: - Remove references to ZK-based segment loading in the docs - Fix doc for existing config `druid.coordinator.loadqueuepeon.http.repeatDelay`	2024-05-01 08:06:00 +05:30
John Gozde	834b0eddeb	web-console: ACE editor refactoring (#16359 ) * Move druid-sql completions to dsql mode * Use font-size 12 * Convert ace-modes to typescript * Move aceCompleters to class member * Use namespace imports	2024-04-30 11:53:39 -07:00
AmatyaAvadhanula	42e99bf912	Add new index on datasource and task_allocator_id for pending segments (#16355 ) * Add pending segments index on datasource and task_allocator_id * Use both datasource and task_allocator_id in queries	2024-04-30 15:48:16 +05:30
Laksh Singla	e695e52d3f	Improve code flow in the First/Last vector aggregators and unify the numeric aggregators with the String implementations (#16230 ) This PR fixes the first and last vector aggregators and improves their readability. Following changes are introduced The folding is broken in the vectorized versions. We consider time before checking the folded object. If the numerical aggregator gets passed any other object type for some other reason (like String), then the aggregator considers it to be folded, even though it shouldn’t be. We should convert these objects to the desired type, and aggregate them properly. The aggregators must properly use generics. This would minimize the ClassCastException issues that can happen with mixed segment types. We are unifying the string first/last aggregators with numeric versions as well. The aggregators must aggregate null values (https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/aggregation/first/StringFirstLastUtils.java#L55-L56 ). The aggregator should only ignore pairs with time == null, and not value == null Time nullity is ignored when trying to vectorize the data. String versions initialized with DateTimes.MIN that is equal to Long.MIN / 2. This can cause incorrect results in case the user enters a custom time column. NOTE: This is still present because it would require a larger refactor in all of the versions. There is a difference in what users might expect from the results because the code flow is changed (for example, the direction of the for loops, etc), however, this will only change the results, and not the contract set by first/last aggregators, which is that if multiple values have the same timestamp, then any of them can get picked. If the column is non-existent, the users might expect a change in the timestamp from DateTime.MAX to Long.MAX, because the code incorrectly used DateTime.MAX to initialize the aggregator, however, in case of a custom timestamp column, this might not be the case. The SQL query might be prohibited from using any Long since it requires a cast to the timestamp function that can fail, but AFAICT native queries don't have such limitations.	2024-04-30 15:13:14 +05:30
Laksh Singla	26d63e7b65	Prevent joining on nested arrays and complex types (#16349 ) #16068 modified DimensionHandlerUtils to accept complex types to be dimensions. This had an unintended side effect of allowing complex types to be joined upon (which wasn't guarded explicitly, it doesn't work). This PR modifies the IndexedTable to reject building the index on the complex types to prevent joining on complex types. The PR adds back the check in the same place, explicitly.	2024-04-30 11:36:53 +05:30
Adarsh Sanjeev	fb63520de9	Add tests for ProcessorManager (#16327 ) * Add tests for ProcessorManager	2024-04-30 09:35:26 +05:30
Alberic Liu	736a2ab7c1	update code style for task type (#16343 ) * update code style for task type * address the comments	2024-04-29 14:42:55 -07:00
Kashif Faraz	aa46314971	Remove usage of skife from DruidCoordinatorConfig (#15705 ) * Remove usage of skife from DruidCoordinatorConfig * Remove old config class * Address static checks * Fix tests * Remove unnecessary mocks * Fix config typos * Fix config condition * Fix test, spotbug check * Move validation to DruidCoordinatorConfig * Move DruidCoordinatorConfig to different package * Fix validation of killunusedconfig * Simplify and fix KillSupervisorsCustomDuty * Address review comments * Fix new tests * Add KillUnusedSchemasConfig * Remove KillUnusedSchemasConfig * Minor renames	2024-04-29 11:37:13 -07:00
Abhishek Radhakrishnan	1d7595f3f7	Support for filters in the Druid Delta Lake connector (#16288 ) * Delta Lake support for filters. * Updates * cleanup comments * Docs * Remmove Enclosed runner * Rename * Cleanup test * Serde test for the Delta input source and fix jackson annotation. * Updates and docs. * Update error messages to be clearer * Fixes * Handle NumberFormatException to provide a nicer error message. * Apply suggestions from code review Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Doc fixes based on feedback * Yes -> yes in docs; reword slightly. * Update docs/ingestion/input-sources.md Co-authored-by: Laksh Singla <lakshsingla@gmail.com> * Update docs/ingestion/input-sources.md Co-authored-by: Laksh Singla <lakshsingla@gmail.com> * Documentation, javadoc and more updates. * Not with an or expression end-to-end test. * Break up =, >, >=, <, <= into its own types instead of sub-classing. --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com>	2024-04-29 11:31:36 -07:00
Adithya Chakilam	f8015eb02a	Add config lagAggregate to LagBasedAutoScalerConfig (#16334 ) Changes: - Add new config `lagAggregate` to `LagBasedAutoScalerConfig` - Add field `aggregateForScaling` to `LagStats` - Use the new field/config to determine which aggregate to use to compute lag - Remove method `Supervisor.computeLagForAutoScaler()`	2024-04-29 22:20:41 +05:30
Kashif Faraz	89ec0da5c5	Disable upload of coverage report to codecov.io (#16347 )	2024-04-29 21:04:55 +05:30
Akshat Jain	9d2cae40c3	Add support for selective loading of lookups in the task layer (#16328 ) Changes: - Add `LookupLoadingSpec` to support 3 modes of lookup loading: ALL, NONE, ONLY_REQUIRED - Add method `Task.getLookupLoadingSpec()` - Do not load any lookups for `KillUnusedSegmentsTask`	2024-04-29 07:19:59 +05:30
Bünyamin	9aef8e02ef	Expend coverage for default mapping (#16340 )	2024-04-27 17:39:07 +05:30
Gian Merlino	db82adcdfd	SCALAR_IN_ARRAY: Optimization and behavioral follow-ups. (#16311 ) * Four changes to scalar_in_array as follow-ups to #16306: 1) Align behavior for `null` scalars to the behavior of the native `in` and `inType` filters: return `true` if the array itself contains null, else return `null`. 2) Rename the class to more closely match the function name. 3) Add a specialization for constant arrays, where we build a `HashSet`. 4) Use `castForEqualityComparison` to properly handle cross-type comparisons. Additional tests verify comparisons between LONG and DOUBLE are now handled properly. * Fix spelling. * Adjustments from review.	2024-04-26 16:01:17 -07:00
Andreas Maechler	9cd1890855	Fix log count (#16341 )	2024-04-26 14:04:19 -07:00
Charles Smith	4e3cb9c251	change ownership of /opt/shared to druid (#16253 )	2024-04-26 21:16:00 +05:30
zachjsh	365cd7e8e7	INSERT/REPLACE can omit clustering when catalog has default (#16260 ) * * fix * * fix * * address review comments * * fix * * simplify tests * * fix complex type nullability issue * * implement and add tests * * address review comments * * address test review comments * * fix checkstyle * * fix dependencies * * all tests passing * * cleanup * * remove unneeded code * * remove unused dependency * * fix checkstyle	2024-04-26 10:19:45 -04:00
Gian Merlino	64a6fc8fc0	JSONFlattenerMaker: Speed up charsetFix. (#16212 ) JSON parsing has this function "charsetFix" that fixes up strings so they can round-trip through UTF-8 encoding without loss of fidelity. It was originally introduced to fix a bug where strings could be sorted, encoded, then decoded, and the resulting decoded strings could end up no longer in sorted order (due to character swaps during the encode operation). The code has been in place for some time, and only applies to JSON. I am not sure if it needs to apply to other formats; it's certainly more difficult to get broken strings from other formats. It's easy in JSON because you can write a JSON string like "foo\uD900". At any rate, this patch does not revisit whether charsetFix should be applied to all formats. It merely optimizes it for the JSON case. The function works by using CharsetEncoder.canEncode, which is a relatively slow method (just as expensive as actually encoding). This patch adds a short-circuit to skip canEncode if all chars in a string are in the basic multilingual plane (i.e. if no chars are surrogates).	2024-04-26 10:46:07 +05:30
Adarsh Sanjeev	9a2d7c28bc	Prepare master branch for 31.0.0 release (#16333 )	2024-04-26 09:22:43 +05:30
Arun Ramani	126a0c219a	Surface lock revocation exceptions in task status (#16325 )	2024-04-26 08:39:44 +05:30
Kashif Faraz	4b6748bdc9	Update default value of useMaxMemoryEstimates for Hadoop jobs (#16280 )	2024-04-26 08:07:21 +05:30
Gian Merlino	68d6e682e8	Fix TimeBoundary planning when filters require virtual columns. (#16337 ) The timeBoundary query does not support virtual columns, so we should avoid it if the query requires virtual columns.	2024-04-25 16:49:40 -07:00
AmatyaAvadhanula	31eee7d51e	Check for handoff of upgraded segments (#16162 ) Changes: 1) Check for handoff of upgraded realtime segments. 2) Drop sink only when all associated realtime segments have been abandoned. 3) Delete pending segments upon commit to prevent unnecessary upgrades and partition space exhaustion when a concurrent replace happens. This also prevents potential data duplication. 4) Register pending segment upgrade only on those tasks to which the segment is associated.	2024-04-25 22:03:38 +05:30
Jakub Matyszewski	5061507541	pacj4: add UserProfile attributes to AuthenticationResult context (#16109 ) I'm adding OIDC context to the AuthenticationResult returned by pac4j extension. I wanted to use this context as input in OpenPolicyAgent authorization. Since AuthenticationResult already accepts context as a parameter it felt okay to pass the profile attributes there.	2024-04-25 12:10:12 +05:30
Zoltan Haindrich	9c0bd56f5b	Make QueryComponentSupliers independent from test classes (#16275 )	2024-04-25 02:12:07 -04:00
Atul Mohan	77333e56fa	Docs: Add missing kafka emitter config (#16332 )	2024-04-25 10:37:14 +05:30
Gian Merlino	8a5cc976a9	ArrayOfDoublesSketchBuildAggregator: Fix NPE in get() for empty sketch. (#16330 ) Fixes a bug introduced in #16296, where the sketch might not be initialized if get() is called without calling aggregate(). Also adds a test for this case.	2024-04-25 00:59:59 -04:00
Bünyamin	e74da6a6b6	Add new metrics for prometheus emitter (#16329 )	2024-04-25 07:16:24 +05:30
Katya Macedo	ceb6646dec	Add supervisor actions (#16276 ) * Add supervisor actions * Update text * Update text * Update after review * Update after review	2024-04-24 13:14:01 -07:00
Laksh Singla	6bca406d31	Grouping on complex columns aka unifying GroupBy strategies (#16068 ) Users can pass complex types as dimensions to the group by queries. For example: SELECT nested_col1, count(*) FROM foo GROUP BY nested_col1	2024-04-24 23:00:14 +05:30
Rishabh Singh	e30790e013	Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building (#15817 ) Issue: #14989 The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.	2024-04-24 22:22:53 +05:30
Sree Charan Manamala	080476f9ea	WINDOWING - Fix 2 nodes with same digest causing mapping issue (#16301 ) Fixes the mapping issue in window fucntions where 2 nodes get the same reference.	2024-04-24 16:45:02 +05:30
Gian Merlino	274ccbfd85	Reset buffer aggregators when resetting Groupers. (#16296 ) Buffer aggregators can contain some cached objects within them, such as Memory references or HLL Unions. Prior to this patch, various Grouper implementations were not releasing this state when resetting their own internal state, which could lead to excessive memory use. This patch renames AggregatorAdapater#close to "reset", and updates Grouper implementations to call this reset method whenever they reset their internal state. The base method on BufferAggregator and VectorAggregator remains named "close", for compatibility with existing extensions, but the contract is adjusted to say that the aggregator may be reused after the method is called. All existing implementations in core already adhere to this new contract, except for the ArrayOfDoubles build flavors, which are updated in this patch to adhere. Additionally, this patch harmonizes buffer sketch helpers to call their clear method "clear" rather than a mix of "clear" and "close". (Others were already using "clear".)	2024-04-24 05:39:24 -04:00
Kashif Faraz	1dabb02843	Fix `ForkingTaskRunnerTest` (#16323 ) Changes: - Use non-static fields to track task counts in `ForkingTaskRunner` - Update assertions in `ForkingTaskRunnerTest` to ensure that the tests are idempotent	2024-04-24 14:05:05 +05:30
Tim Williamson	4bdc1890f7	Improve worst-case performance of LIKE filters by 20x (#16153 ) * Expected-linear-time LIKE `LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids backtracking. Technically, the algorithm runs in `O(nm)`, where `n` is the length of the string to match and `m` is the length of the pattern. In practice, it should run in linear time: essentially as fast as `String.indexOf()` can search for the next match. Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.7x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more. Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing. ``` Benchmark (cardinality) Mode Cnt Before Score Error After Score Error Units Before / After LikeFilterBenchmark.matchBoundPrefix 1000 avgt 10 6.686 ± 0.026 6.765 ± 0.087 us/op 0.99x LikeFilterBenchmark.matchBoundPrefix 100000 avgt 10 163.936 ± 1.589 140.014 ± 0.563 us/op 1.17x LikeFilterBenchmark.matchBoundPrefix 1000000 avgt 10 1235.259 ± 7.318 1165.330 ± 9.300 us/op 1.06x LikeFilterBenchmark.matchLikeContains 1000 avgt 10 255.074 ± 1.530 130.212 ± 3.314 us/op 1.96x LikeFilterBenchmark.matchLikeContains 100000 avgt 10 34789.639 ± 210.219 18563.644 ± 100.030 us/op 1.87x LikeFilterBenchmark.matchLikeContains 1000000 avgt 10 287265.302 ± 1790.957 164684.778 ± 317.698 us/op 1.74x LikeFilterBenchmark.matchLikeEquals 1000 avgt 10 0.410 ± 0.003 0.399 ± 0.001 us/op 1.03x LikeFilterBenchmark.matchLikeEquals 100000 avgt 10 0.793 ± 0.005 0.719 ± 0.003 us/op 1.10x LikeFilterBenchmark.matchLikeEquals 1000000 avgt 10 0.864 ± 0.004 0.839 ± 0.005 us/op 1.03x LikeFilterBenchmark.matchLikeKiller 1000 avgt 10 3077.629 ± 7.928 103.714 ± 2.417 us/op 29.67x LikeFilterBenchmark.matchLikeKiller 100000 avgt 10 311048.049 ± 13466.911 14777.567 ± 70.242 us/op 21.05x LikeFilterBenchmark.matchLikeKiller 1000000 avgt 10 3055855.099 ± 18387.839 92476.621 ± 1198.255 us/op 33.04x LikeFilterBenchmark.matchLikePrefix 1000 avgt 10 6.711 ± 0.035 6.653 ± 0.046 us/op 1.01x LikeFilterBenchmark.matchLikePrefix 100000 avgt 10 161.535 ± 0.574 163.740 ± 0.833 us/op 0.99x LikeFilterBenchmark.matchLikePrefix 1000000 avgt 10 1255.696 ± 5.207 1201.378 ± 3.466 us/op 1.05x LikeFilterBenchmark.matchRegexContains 1000 avgt 10 467.736 ± 2.546 481.431 ± 5.647 us/op 0.97x LikeFilterBenchmark.matchRegexContains 100000 avgt 10 64871.766 ± 223.341 65483.992 ± 391.249 us/op 0.99x LikeFilterBenchmark.matchRegexContains 1000000 avgt 10 482906.004 ± 2003.583 477195.835 ± 3094.605 us/op 1.01x LikeFilterBenchmark.matchRegexKiller 1000 avgt 10 8071.881 ± 18.026 8052.322 ± 17.336 us/op 1.00x LikeFilterBenchmark.matchRegexKiller 100000 avgt 10 1120094.520 ± 2428.172 808321.542 ± 2411.032 us/op 1.39x LikeFilterBenchmark.matchRegexKiller 1000000 avgt 10 8096745.012 ± 40782.747 8114114.896 ± 43250.204 us/op 1.00x LikeFilterBenchmark.matchRegexPrefix 1000 avgt 10 170.843 ± 1.095 175.924 ± 1.144 us/op 0.97x LikeFilterBenchmark.matchRegexPrefix 100000 avgt 10 17785.280 ± 116.813 18708.888 ± 61.857 us/op 0.95x LikeFilterBenchmark.matchRegexPrefix 1000000 avgt 10 174415.586 ± 1827.478 173190.799 ± 949.224 us/op 1.01x LikeFilterBenchmark.matchSelectorEquals 1000 avgt 10 0.411 ± 0.003 0.416 ± 0.002 us/op 0.99x LikeFilterBenchmark.matchSelectorEquals 100000 avgt 10 0.728 ± 0.003 0.739 ± 0.003 us/op 0.99x LikeFilterBenchmark.matchSelectorEquals 1000000 avgt 10 0.842 ± 0.002 0.879 ± 0.007 us/op 0.96x ``` * Take into account whether druid.generic.useDefaultValueForNull is set in LikeDimFilterTest assertions. * Attempt to placate CodeQL. * Fix handling of multi-pattern suffixes. * Expected-linear-time LIKE `LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids the catastrophic backtracking, converting the `LIKE` pattern into a list of `java.util.regex.Pattern` by splitting on the `%` wildcard. The resulting sub-patterns do no backtracking, and a simple greedy loop using `Matcher.find()` to progress through the string is used. Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.15x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more. Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing. ``` Benchmark (cardinality) Mode Cnt Before Score Error After Score Error Units Before/After LikeFilterBenchmark.matchBoundPrefix 1000 avgt 10 5.410 ± 0.010 5.582 ± 0.004 us/op 0.97x LikeFilterBenchmark.matchBoundPrefix 100000 avgt 10 140.920 ± 0.306 141.082 ± 0.391 us/op 1.00x LikeFilterBenchmark.matchBoundPrefix 1000000 avgt 10 1082.762 ± 1.070 1171.407 ± 1.628 us/op 0.92x LikeFilterBenchmark.matchLikeComplexContains 1000 avgt 10 221.572 ± 0.228 183.742 ± 0.210 us/op 1.21x LikeFilterBenchmark.matchLikeComplexContains 100000 avgt 10 25461.362 ± 21.481 17373.828 ± 42.577 us/op 1.47x LikeFilterBenchmark.matchLikeComplexContains 1000000 avgt 10 221075.917 ± 919.238 177454.683 ± 506.420 us/op 1.25x LikeFilterBenchmark.matchLikeContains 1000 avgt 10 283.015 ± 0.219 218.835 ± 3.126 us/op 1.29x LikeFilterBenchmark.matchLikeContains 100000 avgt 10 30202.910 ± 32.697 26713.488 ± 49.525 us/op 1.13x LikeFilterBenchmark.matchLikeContains 1000000 avgt 10 284661.411 ± 130.324 243381.857 ± 540.143 us/op 1.17x LikeFilterBenchmark.matchLikeEquals 1000 avgt 10 0.386 ± 0.001 0.380 ± 0.001 us/op 1.02x LikeFilterBenchmark.matchLikeEquals 100000 avgt 10 0.670 ± 0.001 0.705 ± 0.002 us/op 0.95x LikeFilterBenchmark.matchLikeEquals 1000000 avgt 10 0.839 ± 0.001 0.796 ± 0.001 us/op 1.05x LikeFilterBenchmark.matchLikeKiller 1000 avgt 10 4882.099 ± 7.953 170.142 ± 0.494 us/op 28.69x LikeFilterBenchmark.matchLikeKiller 100000 avgt 10 524122.010 ± 390.170 19461.637 ± 117.090 us/op 26.93x LikeFilterBenchmark.matchLikeKiller 1000000 avgt 10 5121795.377 ± 4176.052 181162.978 ± 368.443 us/op 28.27x LikeFilterBenchmark.matchLikePrefix 1000 avgt 10 5.708 ± 0.005 5.677 ± 0.011 us/op 1.01x LikeFilterBenchmark.matchLikePrefix 100000 avgt 10 141.853 ± 0.554 108.313 ± 0.330 us/op 1.31x LikeFilterBenchmark.matchLikePrefix 1000000 avgt 10 1199.148 ± 1.298 1153.297 ± 1.575 us/op 1.04x LikeFilterBenchmark.matchLikeSuffix 1000 avgt 10 256.020 ± 0.283 196.339 ± 0.564 us/op 1.30x LikeFilterBenchmark.matchLikeSuffix 100000 avgt 10 29917.931 ± 28.218 21450.997 ± 20.341 us/op 1.39x LikeFilterBenchmark.matchLikeSuffix 1000000 avgt 10 241225.193 ± 465.824 194034.292 ± 362.312 us/op 1.24x LikeFilterBenchmark.matchRegexComplexContains 1000 avgt 10 119.597 ± 0.635 135.550 ± 0.697 us/op 0.88x LikeFilterBenchmark.matchRegexComplexContains 100000 avgt 10 13089.670 ± 13.738 13766.712 ± 12.802 us/op 0.95x LikeFilterBenchmark.matchRegexComplexContains 1000000 avgt 10 130822.830 ± 1624.048 131076.029 ± 1636.811 us/op 1.00x LikeFilterBenchmark.matchRegexContains 1000 avgt 10 573.273 ± 0.421 615.399 ± 0.633 us/op 0.93x LikeFilterBenchmark.matchRegexContains 100000 avgt 10 57259.313 ± 162.747 62900.380 ± 44.746 us/op 0.91x LikeFilterBenchmark.matchRegexContains 1000000 avgt 10 571335.768 ± 2822.776 542536.982 ± 780.290 us/op 1.05x LikeFilterBenchmark.matchRegexKiller 1000 avgt 10 11525.499 ± 8.741 11061.791 ± 21.746 us/op 1.04x LikeFilterBenchmark.matchRegexKiller 100000 avgt 10 1170414.723 ± 766.160 1144437.291 ± 886.263 us/op 1.02x LikeFilterBenchmark.matchRegexKiller 1000000 avgt 10 11507668.302 ± 11318.176 110381620.014 ± 10707.974 us/op 1.11x LikeFilterBenchmark.matchRegexPrefix 1000 avgt 10 156.460 ± 0.097 155.217 ± 0.431 us/op 1.01x LikeFilterBenchmark.matchRegexPrefix 100000 avgt 10 15056.491 ± 23.906 15508.965 ± 763.976 us/op 0.97x LikeFilterBenchmark.matchRegexPrefix 1000000 avgt 10 154416.563 ± 473.108 153737.912 ± 273.347 us/op 1.00x LikeFilterBenchmark.matchRegexSuffix 1000 avgt 10 610.684 ± 0.462 590.352 ± 0.334 us/op 1.03x LikeFilterBenchmark.matchRegexSuffix 100000 avgt 10 53196.517 ± 78.155 59460.261 ± 56.934 us/op 0.89x LikeFilterBenchmark.matchRegexSuffix 1000000 avgt 10 536100.944 ± 440.353 550098.917 ± 740.464 us/op 0.97x LikeFilterBenchmark.matchSelectorEquals 1000 avgt 10 0.390 ± 0.001 0.366 ± 0.001 us/op 1.07x LikeFilterBenchmark.matchSelectorEquals 100000 avgt 10 0.724 ± 0.001 0.714 ± 0.001 us/op 1.01x LikeFilterBenchmark.matchSelectorEquals 1000000 avgt 10 0.826 ± 0.001 0.847 ± 0.001 us/op 0.98x ```	2024-04-23 22:45:23 -07:00
Parth Agrawal	f1d24c868f	[CVE Fixes] Update version of Nimbus.jose.jwt (#16320 ) * Update version of nimbus.jose.jwt.version * update licenses.yaml	2024-04-23 15:11:54 +05:30
Charles Smith	65412f80ab	remove additional column marks (#16319 )	2024-04-22 19:41:54 -07:00

... 3 4 5 6 7 ...

14188 Commits All Branches Search

14188 Commits

All Branches