druid

Commit Graph

Author	SHA1	Message	Date
Clint Wylie	5ce4aab3b8	update ARRAY_OVERLAP to plan with ArrayContainsElement for ARRAY columns (#15451 ) Updates ARRAY_OVERLAP to use the same ArrayContainsElement filter added in #15366 when filtering ARRAY typed columns so that it can also use indexes like ARRAY_CONTAINS.	2023-11-30 10:05:20 +05:30
Clint Wylie	0516d0dae4	simplify IncrementalIndex since group-by v1 has been removed (#15448 )	2023-11-29 14:46:16 -08:00
Clint Wylie	64fcb32bcf	add native 'array contains element' filter (#15366 ) * add native arrayContainsElement filter to use array column element indexes	2023-11-29 03:33:00 -08:00
Clint Wylie	97623b408c	add optional 'castToType' parameter to 'auto' column schema (#15417 ) * auto but.. with an expected type	2023-11-28 17:19:23 -08:00
Rishabh Singh	8c802e4c9b	Relocating Table Schema Building: Shifting from Brokers to Coordinator for Improved Efficiency (#14985 ) In the current design, brokers query both data nodes and tasks to fetch the schema of the segments they serve. The table schema is then constructed by combining the schemas of all segments within a datasource. However, this approach leads to a high number of segment metadata queries during broker startup, resulting in slow startup times and various issues outlined in the design proposal. To address these challenges, we propose centralizing the table schema management process within the coordinator. This change is the first step in that direction. In the new arrangement, the coordinator will take on the responsibility of querying both data nodes and tasks to fetch segment schema and subsequently building the table schema. Brokers will now simply query the Coordinator to fetch table schema. Importantly, brokers will still retain the capability to build table schemas if the need arises, ensuring both flexibility and resilience.	2023-11-04 19:33:25 +05:30
Laksh Singla	5f86072456	Prepare master for Druid 29 (#15121 ) Prepare master for Druid 29	2023-10-11 10:33:45 +05:30
Xavier Léauté	adef2069b1	Make unit tests pass with Java 21 (#15014 ) This change updates dependencies as needed and fixes tests to remove code incompatible with Java 21 As a result all unit tests now pass with Java 21. * update maven-shade-plugin to 3.5.0 and follow-up to #15042 * explain why we need to override configuration when specifying outputFile * remove configuration from dependency management in favor of explicit overrides in each module. * update to mockito to 5.5.0 for Java 21 support when running with Java 11+ * continue using latest mockito 4.x (4.11.0) when running with Java 8 * remove need to mock private fields * exclude incorrectly declared mockito dependency from pac4j-oidc * remove mocking of ByteBuffer, since sealed classes can no longer be mocked in Java 21 * add JVM options workaround for system-rules junit plugin not supporting Java 18+ * exclude older versions of byte-buddy from assertj-core * fix for Java 19 changes in floating point string representation * fix missing InitializedNullHandlingTest * update easymock to 5.2.0 for Java 21 compatibility * update animal-sniffer-plugin to 1.23 * update nl.jqno.equalsverifier to 3.15.1 * update exec-maven-plugin to 3.1.0	2023-10-03 22:41:21 -07:00
Soumyava	8088a763a6	Vectorize earliest aggregator for both numeric and string types (#14408 ) * Vectorizing earliest for numeric * Vectorizing earliest string aggregator * checkstyle fix * Removing unnecessary exceptions * Ignoring tests in MSQ as earliest is not supported for numeric there * Fixing benchmarks * Updating tests as MSQ does not support earliest for some cases * Addressing review comments by adding the following: 1. Checking capabilities first before creating selectors 2. Removing mockito in tests for numeric first aggs 3. Removing unnecessary tests * Addressing issues for dictionary encoded single string columns where we can use the dictionary ids instead of the entire string * Adding a flag for multi value dimension selector * Addressing comments * 1 more change * Handling review comments part 1 * Handling review comments and correctness fix for latest_by when the time expression need not be in sorted order * Updating numeric first vector agg * Revert "Updating numeric first vector agg" This reverts commit `4291709901`. * Updating code for correctness issues * fixing an issue with latest agg * Adding more comments and removing an unnecessary check * Addressing null checks for tie selector and only vectorize false for quantile sketches	2023-09-05 08:41:42 -07:00
Clint Wylie	36e659a501	remove group-by v1 (#14866 ) * remove group-by v1 * docs * remove unused configs, fix test * fix test * adjustments * why not * adjust * review stuff	2023-08-23 12:44:06 -07:00
Clint Wylie	fb053c399c	consolidate json and auto indexers, remove v4 nested column serializer (#14456 )	2023-08-22 18:50:11 -07:00
Clint Wylie	194a9c9abc	set druid.expressions.useStrictBooleans to true by default (#14734 )	2023-08-22 00:19:56 -07:00
Kashif Faraz	c211dcc4b3	Clean up compaction logs on coordinator (#14875 ) Changes: - Move logic of `NewestSegmentFirstIterator.needsCompaction` to `CompactionStatus` to improve testability and readability - Capture the list of checks performed to determine if compaction is needed in a readable manner in `CompactionStatus.CHECKS` - Make `CompactionSegmentIterator` iterate over instances of `SegmentsToCompact` instead of `List<DataSegment>`. This allows use of the `umbrellaInterval` later. - Replace usages of `QueueEntry` with `SegmentsToCompact` - Move `SegmentsToCompact` out of `NewestSegmentFirstIterator` - Simplify `CompactionStatistics` - Reduce level of less important logs to debug - No change made to tests to ensure correctness	2023-08-21 17:30:41 +05:30
Clint Wylie	6b14dde50e	deprecate config-magic in favor of json configuration stuff (#14695 ) * json config based processing and broker merge configs to deprecate config-magic	2023-08-16 18:23:57 -07:00
dependabot[bot]	f0a79fa0e4	Bump org.apache.maven.plugins:maven-source-plugin from 2.2.1 to 3.3.0 (#14812 ) Bumps [org.apache.maven.plugins:maven-source-plugin](https://github.com/apache/maven-source-plugin) from 2.2.1 to 3.3.0. - [Commits](https://github.com/apache/maven-source-plugin/compare/maven-source-plugin-2.2.1...maven-source-plugin-3.3.0) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-source-plugin dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-08-15 07:39:19 -07:00
Clint Wylie	e5661a394c	refactor front-coded into static classes instead of using functional interfaces (#14572 ) * refactor front-coded into static classes instead of using functional interfaces * shared v0 static method instead of copy	2023-08-04 10:52:36 -07:00
imply-cheddar	748874405c	Minimize PostAggregator computations (#14708 ) * Minimize PostAggregator computations Since a change back in 2014, the topN query has been computing all PostAggregators on all intermediate responses from leaf nodes to brokers. This generates significant slow downs for queries with relatively expensive PostAggregators. This change rewrites the query that is pushed down to only have the minimal set of PostAggregators such that it is impossible for downstream processing to do too much work. The final PostAggregators are applied at the very end.	2023-08-04 00:04:31 +05:30
Kashif Faraz	22290fd632	Test: Simplify test impl of LoadQueuePeon (#14684 ) Changes - Rename `LoadQueuePeonTester` to `TestLoadQueuePeon` - Simplify `TestLoadQueuePeon` by removing dependency on `CuratorLoadQueuePeon` - Remove usages of mock peons in `LoadRuleTest` and use `TestLoadQueuePeon` instead	2023-07-28 16:14:23 +05:30
Clint Wylie	913416c669	add equality, null, and range filter (#14542 ) changes: * new filters that preserve match value typing to better handle filtering different column types * sql planner uses new filters by default in sql compatible null handling mode * remove isFilterable from column capabilities * proper handling of array filtering, add array processor to column processors * javadoc for sql test filter functions * range filter support for arrays, tons more tests, fixes * add dimension selector tests for mixed type roots * support json equality * rename semantic index maker thingys to mostly have plural names since they typically make many indexes, e.g. StringValueSetIndex -> StringValueSetIndexes * add cooler equality index maker, ValueIndexes * fix missing string utf8 index supplier * expression array comparator stuff	2023-07-18 12:15:22 -07:00
AmatyaAvadhanula	0412f40d36	Prepare master branch for next release, 28.0.0 (#14595 ) * Prepare master branch for next release, 28.0.0	2023-07-18 09:22:30 +05:30
Kashif Faraz	a6547febaf	Remove unused coordinator dynamic configs (#14524 ) After #13197 , several coordinator configs are now redundant as they are not being used anymore, neither with `smartSegmentLoading` nor otherwise. Changes: - Remove dynamic configs `emitBalancingStats`: balancer error stats are always emitted, debug stats can be logged by using `debugDimensions` - `useBatchedSegmentSampler`, `percentOfSegmentsToConsiderPerMove`: batched segment sampling is always used - Add test to verify deserialization with unknown properties - Update `CoordinatorRunStats` to always track stats, this can be optimized later.	2023-07-06 12:11:10 +05:30
Clint Wylie	277aaa5c57	remove druid.processing.columnCache.sizeBytes and CachingIndexed, combine string column implementations (#14500 ) * combine string column implementations changes: * generic indexed, front-coded, and auto string columns now all share the same column and index supplier implementations * remove CachingIndexed implementation, which I think is largely no longer needed by the switch of many things to directly using ByteBuffer, avoiding the cost of creating Strings * remove ColumnConfig.columnCacheSizeBytes since CachingIndexed was the only user	2023-07-02 19:37:15 -07:00
Gian Merlino	67fbd8e7fc	Add "stringEncoding" parameter to DataSketches HLL. (#11201 ) * Add "stringEncoding" parameter to DataSketches HLL. Builds on the concept from #11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. * Fix benchmark. * Fix style issues, improve test coverage. * Put round back, to make IT updates easier. * Fix test. * Fix issue with filtered aggregators and add test. * Use DS native update(ByteBuffer) method. Improve test coverage. * Add another suppression. * Fix ITAutoCompactionTest. * Update benchmarks. * Updates. * Fix conflict. * Adjustments.	2023-06-30 12:45:55 -07:00
Kashif Faraz	50461c3bd5	Enable smartSegmentLoading on the Coordinator (#13197 ) This commit does a complete revamp of the coordinator to address problem areas: - Stability: Fix several bugs, add capabilities to prioritize and cancel load queue items - Visibility: Add new metrics, improve logs, revamp `CoordinatorRunStats` - Configuration: Add dynamic config `smartSegmentLoading` to automatically set optimal values for all segment loading configs such as `maxSegmentsToMove`, `replicationThrottleLimit` and `maxSegmentsInNodeLoadingQueue`. Changed classes: - Add `StrategicSegmentAssigner` to make assignment decisions for load, replicate and move - Add `SegmentAction` to distinguish between load, replicate, drop and move operations - Add `SegmentReplicationStatus` to capture current state of replication of all used segments - Add `SegmentLoadingConfig` to contain recomputed dynamic config values - Simplify classes `LoadRule`, `BroadcastRule` - Simplify the `BalancerStrategy` and `CostBalancerStrategy` - Add several new methods to `ServerHolder` to track loaded and queued segments - Refactor `DruidCoordinator` Impact: - Enable `smartSegmentLoading` by default. With this enabled, none of the following dynamic configs need to be set: `maxSegmentsToMove`, `replicationThrottleLimit`, `maxSegmentsInNodeLoadingQueue`, `useRoundRobinSegmentAssignment`, `emitBalancingStats` and `replicantLifetime`. - Coordinator reports richer metrics and produces cleaner and more informative logs - Coordinator uses an unlimited load queue for all serves, and makes better assignment decisions	2023-06-19 14:27:35 +05:30
imply-cheddar	cfd07a95b7	Errors take 3 (#14004 ) Introduce DruidException, an exception whose goal in life is to be delivered to a user. DruidException itself has javadoc on it to describe how it should be used. This commit both introduces the Exception and adjusts some of the places that are generating exceptions to generate DruidException objects instead, as a way to show how the Exception should be used. This work was a 3rd iteration on top of work that was started by Paul Rogers. I don't know if his name will survive the squash-and-merge, so I'm calling it out here and thanking him for starting on this.	2023-06-19 01:11:13 -07:00
Abhishek Radhakrishnan	a5e04d95a4	Add `TYPE_NAME` to the complex serde classes and replace the hardcoded names. (#14317 ) * Add TYPE_NAME to the serde classes and reuse them instead of hardcoded strings. * Static check fixes.	2023-05-23 00:54:47 -05:00
Clint Wylie	90ea192d9c	fix bugs with auto encoded long vector deserializers (#14186 ) This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+. While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.	2023-05-01 11:49:27 +05:30
Clint Wylie	1aef72aa7e	Bump up the version in pom to 27.0.0 in preparation of release (#14051 )	2023-04-10 14:56:59 +05:30
Clint Wylie	b11c0bc249	smarter nested column index utilization (#13977 ) * smarter nested column index utilization changes: * adds skipValueRangeIndexScale and skipValuePredicateIndexScale to ColumnConfig (e.g. DruidProcessingConfig) available as system config via druid.processing.indexes.skipValueRangeIndexScale and druid.processing.indexes.skipValuePredicateIndexScale * NestedColumnIndexSupplier uses skipValueRangeIndexScale and skipValuePredicateIndexScale to multiply by the total number of rows to be processed to determine the threshold at which we should no longer consider using bitmap indexes because it will be too many operations * Default values for skipValueRangeIndexScale and skipValuePredicateIndexScale have been initially set to 0.08, but are separate to allow independent tuning * these are not documented on purpose yet because they are kind of hard to explain, the mainly exist to help conduct larger scale experiments than the jmh benchmarks used to derive the initial set of values * these changes provide a pretty sweet performance boost for filter processing on nested columns	2023-04-06 04:09:24 -07:00
Clint Wylie	e3211e3be0	actually backwards compatible frontCoded string encoding strategy (#13996 )	2023-03-31 02:24:12 -07:00
zachjsh	3bb67721f7	Allow for Input source security in SQL layer (#13989 ) This change introduces the concept of input source type security model, proposed in #13837.. With this change, this feature is only available at the SQL layer, but we will expand to native layer in a follow up PR. To enable this feature, the user must set the following property to true: druid.auth.enableInputSourceSecurity=true The default value for this property is false, which will continue the existing functionality of having the usage all external sources being authorized against the hardcoded resource action new ResourceAction(new Resource(ResourceType.EXTERNAL, ResourceType.EXTERNAL), Action.READ When this config is enabled, the users will be required to be authorized for the following resource action new ResourceAction(new Resource(ResourceType.EXTERNAL, {INPUT_SOURCE_TYPE}, Action.READ where {INPUT_SOURCE_TYPE} is the type of the input source being used;, http, inline, s3, etc.. Documentation has not been added for the feature as it is not complete at the moment, as we still need to enable this for the native layer in a follow up pr.	2023-03-29 22:15:33 -04:00
Clint Wylie	2219e68fa3	add backwards compat mode for frontCoded stringEncodingStrategy (#13988 )	2023-03-28 14:44:44 -07:00
Clint Wylie	ed57c5c853	better FrontCodedIndexed (#13854 ) * Adds new implementation of 'frontCoded' string encoding strategy, which writes out a v1 FrontCodedIndexed which stores buckets on a prefix of the previous value instead of the first value in the bucket	2023-03-14 18:14:11 -07:00
Gian Merlino	fe9d0c46d5	Improve memory efficiency of WrappedRoaringBitmap. (#13889 ) * Improve memory efficiency of WrappedRoaringBitmap. Two changes: 1) Use an int[] for sizes 4 or below. 2) Remove the boolean compressRunOnSerialization. Doesn't save much space, but it does save a little, and it isn't adding a ton of value to have it be configurable. It was originally configurable in case anything broke when enabling it, but it's been a while and nothing has broken. * Slight adjustment. * Adjust for inspection. * Updates. * Update snaps. * Update test. * Adjust test. * Fix snaps.	2023-03-09 15:48:02 -08:00
Paul Rogers	914eebb4b7	Wire up the catalog resolver (#13788 ) Introduces the catalog resolver interface Wires the resolver up to the planner factory Refactors planner factory	2023-02-22 11:42:32 -08:00
zachjsh	665dee43bf	Revert "Operator conversion deny list (#13766 )" (#13829 ) This reverts commit `38e620aa4c`.	2023-02-21 15:14:49 -08:00
Clint Wylie	08b5951cc5	merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything (#13698 ) * merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything * fix poms and license stuff * mockito is evil * allow reset of JvmUtils RuntimeInfo if tests used static injection to override	2023-02-17 14:27:41 -08:00
zachjsh	38e620aa4c	Operator conversion deny list (#13766 ) ### Description This change adds a new config property `druid.sql.planner.operatorConversion.denyList`, which allows a user to specify any operator conversions that they wish to disallow. A user may want to do this for a number of reasons, including security concerns. The default value of this property is the empty list `[]`, which does not disallow any operator conversions. An example usage of this property is `druid.sql.planner.operatorConversion.denyList=["extern"]`, which disallows the usage of the `extern` operator conversion. If the property is configured this way, and a user of the Druid cluster tries to submit a query that uses the `extern` function, such as the example given [here](https://druid.apache.org/docs/latest/multi-stage-query/examples.html#insert-with-no-rollup), a response with http response code `400` is returned with en error body similar to the following: ``` { "taskId": "4ec5b0b6-fa9b-4c3a-827d-2308294e9985", "state": "FAILED", "error": { "error": "Plan validation failed", "errorMessage": "org.apache.calcite.runtime.CalciteContextException: From line 28, column 5 to line 32, column 5: No match found for function signature EXTERN(<CHARACTER>, <CHARACTER>, <CHARACTER>)", "errorClass": "org.apache.calcite.tools.ValidationException", "host": null } } ```	2023-02-10 09:59:26 -08:00
AmatyaAvadhanula	dcdae84888	Add server view initialization metrics (#13716 ) * Add server view init metrics * Test coverage * Rename metrics	2023-02-07 20:02:00 +05:30
Jason Koch	7a3bd89a85	Dimension dictionary reduce locking (#13710 ) * perf: introduce benchmark for StringDimensionIndexer jdk11 -- Benchmark Mode Cnt Score Error Units StringDimensionIndexerProcessBenchmark.parallelReadWrite avgt 10 30471.552 ± 456.716 us/op StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelReader avgt 10 18069.863 ± 327.923 us/op StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelWriter avgt 10 67676.617 ± 2351.311 us/op StringDimensionIndexerProcessBenchmark.soloReader avgt 10 1048.079 ± 1.120 us/op StringDimensionIndexerProcessBenchmark.soloWriter avgt 10 4629.769 ± 29.353 us/op * perf: switch DimensionDictionary to StampedLock jdk11 - Benchmark Mode Cnt Score Error Units StringDimensionIndexerProcessBenchmark.parallelReadWrite avgt 10 37958.372 ± 1685.206 us/op StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelReader avgt 10 31192.232 ± 2755.365 us/op StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelWriter avgt 10 58256.791 ± 1998.220 us/op StringDimensionIndexerProcessBenchmark.soloReader avgt 10 1079.440 ± 1.753 us/op StringDimensionIndexerProcessBenchmark.soloWriter avgt 10 4585.690 ± 13.225 us/op * perf: use optimistic locking in DimensionDictionary jdk11 - Benchmark Mode Cnt Score Error Units StringDimensionIndexerProcessBenchmark.parallelReadWrite avgt 10 6212.366 ± 162.684 us/op StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelReader avgt 10 1807.235 ± 109.339 us/op StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelWriter avgt 10 19427.759 ± 611.692 us/op StringDimensionIndexerProcessBenchmark.soloReader avgt 10 194.370 ± 1.050 us/op StringDimensionIndexerProcessBenchmark.soloWriter avgt 10 2871.423 ± 14.426 us/op * perf: refactor DimensionDictionary null handling to need less locks jdk11 - Benchmark Mode Cnt Score Error Units StringDimensionIndexerProcessBenchmark.parallelReadWrite avgt 10 6591.619 ± 470.497 us/op StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelReader avgt 10 1387.338 ± 144.587 us/op StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelWriter avgt 10 22204.462 ± 1620.806 us/op StringDimensionIndexerProcessBenchmark.soloReader avgt 10 204.911 ± 0.459 us/op StringDimensionIndexerProcessBenchmark.soloWriter avgt 10 2935.376 ± 12.639 us/op * perf: refactor DimensionDictionary add handling to do a little less work jdk11 - Benchmark Mode Cnt Score Error Units StringDimensionIndexerProcessBenchmark.parallelReadWrite avgt 10 2914.859 ± 22.519 us/op StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelReader avgt 10 508.010 ± 14.675 us/op StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelWriter avgt 10 10135.408 ± 82.745 us/op StringDimensionIndexerProcessBenchmark.soloReader avgt 10 205.415 ± 0.158 us/op StringDimensionIndexerProcessBenchmark.soloWriter avgt 10 3098.743 ± 23.603 us/op	2023-02-01 02:59:12 -08:00
Adarsh Sanjeev	0a486c3bcf	Update forbidden apis with fixed executor (#13633 ) * Update forbidden apis with fixed executor	2023-01-12 15:34:36 +05:30
Jason Koch	6c44dd8175	perf: core/TextReader for faster json ingestion (#13545 ) * perf: provide a custom utf8 specific buffered line iterator (benchmark) Benchmark Mode Cnt Score Error Units JsonLineReaderBenchmark.baseline avgt 15 3459.871 ± 106.175 us/op * perf: provide a custom utf8 specific buffered line iterator Benchmark Mode Cnt Score Error Units JsonLineReaderBenchmark.baseline avgt 15 3022.053 ± 51.286 us/op * perf: provide a custom utf8 specific buffered line iterator (more tests) * perf: provide a custom utf8 specific buffered line iterator (pr feedback) Ensure field visibility is as limited as possible Null check for buffer in constructor * perf: provide a custom utf8 specific buffered line iterator (pr feedback) Remove additional 'finished' variable. * perf: provide a custom utf8 specific buffered line iterator (more tests and bugfix)	2022-12-19 23:12:37 -08:00
Kashif Faraz	7cf761cee4	Prepare master branch for next release, 26.0.0 (#13401 ) * Prepare master branch for next release, 26.0.0 * Use docker image for druid 24.0.1 * Fix version in druid-it-cases pom.xml	2022-11-22 15:31:01 +05:30
Gian Merlino	78d0b0abce	Add string comparison methods to StringUtils, fix dictionary comparisons. (#13364 ) * Add string comparison methods to StringUtils, fix dictionary comparisons. There are various places in Druid code where we assume that String.compareTo is consistent with Unicode code-point ordering. Sadly this is not the case. To help deal with this, this patch introduces the following helpers: 1) compareUnicode: Compares two Strings in Unicode code-point order. 2) compareUtf8: Compares two UTF-8 byte arrays in Unicode code-point order. Equivalent to comparison as unsigned bytes. 3) compareUtf8UsingJavaStringOrdering: Compares two UTF-8 byte arrays, or ByteBuffers, in a manner consistent with String.compareTo. There is no helper for comparing two Strings in a manner consistent with String.compareTo, because for that we can use compareTo directly. The patch also fixes an inconsistency between the String and UTF-8 dictionary GenericIndexed flavors of string-typed columns: they were formerly using incompatible comparators. * Adjust test. * FrontCodedIndexed updates. * Add test. * Fix comments.	2022-11-16 07:15:00 -08:00
Gian Merlino	8f90589ce5	Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. (#13247 ) * Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. These aggregation functions are documented as creating sketches. However, they are planned into native aggregators that include finalization logic to convert the sketch to a number of some sort. This creates an inconsistency: the functions sometimes return sketches, and sometimes return numbers, depending on where they lie in the native query plan. This patch changes these SQL aggregators to _never_ finalize, by using the "shouldFinalize" feature of the native aggregators. It already existed for theta sketches. This patch adds the feature for hll and quantiles sketches. As to impact, Druid finalizes aggregators in two cases: - When they appear in the outer level of a query (not a subquery). - When they are used as input to an expression or finalizing-field-access post-aggregator (not any other kind of post-aggregator). With this patch, the functions will no longer be finalized in these cases. The second item is not likely to matter much. The SQL functions all declare return type OTHER, which would be usable as an input to any other function that makes sense and that would be planned into an expression. So, the main effect of this patch is the first item. To provide backwards compatibility with anyone that was depending on the old behavior, the patch adds a "sqlFinalizeOuterSketches" query context parameter that restores the old behavior. Other changes: 1) Move various argument-checking logic from runtime to planning time in DoublesSketchListArgBaseOperatorConversion, by adding an OperandTypeChecker. 2) Add various JsonIgnores to the sketches to simplify their JSON representations. 3) Allow chaining of ExpressionPostAggregators and other PostAggregators in the SQL layer. 4) Avoid unnecessary FieldAccessPostAggregator wrapping in the SQL layer, now that expressions can operate on complex inputs. 5) Adjust return type to thetaSketch (instead of OTHER) in ThetaSketchSetBaseOperatorConversion. * Fix benchmark class. * Fix compilation error. * Fix ThetaSketchSqlAggregatorTest. * Hopefully fix ITAutoCompactionTest. * Adjustment to ITAutoCompactionTest.	2022-11-03 09:43:00 -07:00
Kashif Faraz	fd7864ae33	Improve run time of coordinator duty MarkAsUnusedOvershadowedSegments (#13287 ) In clusters with a large number of segments, the duty `MarkAsUnusedOvershadowedSegments` can take a long very long time to finish. This is because of the costly invocation of `timeline.isOvershadowed` which is done for every used segment in every coordinator run. Changes - Use `DataSourceSnapshot.getOvershadowedSegments` to get all overshadowed segments - Iterate over this set instead of all used segments to identify segments that can be marked as unused - Mark segments as unused in the DB in batches rather than one at a time - Refactor: Add class `SegmentTimeline` for ease of use and readability while using a `VersionedIntervalTimeline` of segments.	2022-11-01 20:19:52 +05:30
Jason Koch	0d03ce435f	introduce a "tree" type to the flattenSpec (#12177 ) * introduce a "tree" type to the flattenSpec * feedback - rename exprs to nodes, use CollectionsUtils.isNullOrEmpty for guard * feedback - expand docs to more clearly capture limitations of "tree" flattenSpec * feedback - fix for typo on docs * introduce a comment to explain defensive copy, tweak null handling * fix: part of rebase * mark ObjectFlatteners.FlattenerMaker as an ExtensionPoint and provide default for new tree type * fix: objectflattener restore previous behavior to call getRootField for root type * docs: ingestion/data-formats add note that ORC only supports path expressions * chore: linter remove unused import * fix: use correct newer form for empty DimensionsSpec in FlattenJSONBenchmark	2022-11-01 14:49:30 +08:00
somu-imply	affc522b9f	Refactoring the data source before unnest (#13085 ) * First set of changes for framework * Second set of changes to move segment map function to data source * Minot change to server manager * Removing the createSegmentMapFunction from JoinableFactoryWrapper and moving to JoinDataSource * Checkstyle fixes * Patching Eric's fix for injection * Checkstyle and fixing some CI issues * Fixing code inspections and some failed tests and one injector for test in avatica * Another set of changes for CI...almost there * Equals and hashcode part update * Fixing injector from Eric + refactoring for broadcastJoinHelper * Updating second injector. Might revert later if better way found * Fixing guice issue in JoinableFactory * Addressing review comments part 1 * Temp changes refactoring * Revert "Temp changes refactoring" This reverts commit `9da42a9ef0`. * temp * Temp discussions * Refactoring temp * Refatoring the query rewrite to refer to a datasource * Refactoring getCacheKey by moving it inside data source * Nullable annotation check in injector * Addressing some comments, removing 2 analysis.isJoin() checks and correcting the benchmark files * Minor changes for refactoring * Addressing reviews part 1 * Refactoring part 2 with new test cases for broadcast join * Set for nullables * removing instance of checks * Storing nullables in guice to avoid checking on reruns * Fixing a test case and removing an irrelevant line * Addressing the atomic reference review comments	2022-10-26 15:58:58 -07:00
Clint Wylie	77e4246598	add support for 'front coded' string dictionaries for smaller string columns (#12277 ) * add FrontCodedIndexed for delta string encoding * now for actual segments * fix indexOf * fixes and thread safety * add bucket size 4, which seems generally better * fixes * fixes maybe * update indexes to latest interfaces * utf8 support * adjust * oops * oops * refactor, better, faster * more test * fixes * revert * adjustments * fix prefixing * more chill * sql nested benchmark too * refactor * more comments and javadocs * better get * remove base class * fix * hot rod * adjust comments * faster still * minor adjustments * spatial index support * spotbugs * add isSorted to Indexed to strengthen indexOf contract if set, improve javadocs, add docs * fix docs * push into constructor * use base buffer instead of copy * oops	2022-10-25 18:05:38 -07:00
Paul Rogers	86e6e61e88	Modular Calcite Test Framework (#12965 ) * Refactor Calcite test "framework" for planner tests Refactors the current Calcite tests to make it a bit easier to adjust the set of runtime objects used within a test. * Move data creation out of CalciteTests into TestDataBuilder * Move "framework" creation out of CalciteTests into a QueryFramework * Move injector-dependent functions from CalciteTests into QueryFrameworkUtils * Wrapper around the planner factory, etc. to allow customization. * Bulk of the "framework" created once per class rather than once per test. * Refactor tests to use a test builder * Change all testQuery() methods to use the test builder. Move test execution & verification into a test runner.	2022-10-20 15:45:44 -07:00
Gian Merlino	6aca61763e	SQL: Use timestamp_floor when granularity is not safe. (#13206 ) * SQL: Use timestamp_floor when granularity is not safe. PR #12944 added a check at the execution layer to avoid materializing excessive amounts of time-granular buckets. This patch modifies the SQL planner to avoid generating queries that would throw such errors, by switching certain plans to use the timestamp_floor function instead of granularities. This applies both to the Timeseries query type, and the GroupBy timestampResultFieldGranularity feature. The patch also goes one step further: we switch to timestamp_floor not just in the ETERNITY + non-ALL case, but also if the estimated number of time-granular buckets exceeds 100,000. Finally, the patch modifies the timestampResultFieldGranularity field to consistently be a String rather than a Granularity. This ensures that it can be round-trip serialized and deserialized, which is useful when trying to execute the results of "EXPLAIN PLAN FOR" with GroupBy queries that use the timestampResultFieldGranularity feature. * Fix test, address PR comments. * Fix ControllerImpl. * Fix test. * Fix unused import.	2022-10-17 08:22:45 -07:00

1 2 3 4 5 ...

399 Commits