druid

Commit Graph

Author	SHA1	Message	Date
AmatyaAvadhanula	0412f40d36	Prepare master branch for next release, 28.0.0 (#14595 ) * Prepare master branch for next release, 28.0.0	2023-07-18 09:22:30 +05:30
Pranav	8087aa2b80	Adding the null check in combine and fold in doublesSketch (#14568 )	2023-07-11 14:28:34 +05:30
imply-cheddar	66cac08a52	Refactor HllSketchBuildAggregatorFactory (#14544 ) * Refactor HllSketchBuildAggregatorFactory The usage of ColumnProcessors and HllSketchBuildColumnProcessorFactory made it very difficult to figure out what was going on from just looking at the AggregatorFactory or Aggregator code. It also didn't properly double check that you could use UTF8 ahead of time, even though it's entirely possible to validate it before trying to use it. This refactor makes keeps the general indirection that had been implemented by the Consumer<Supplier<HllSketch>> but centralizes the decision logic and makes it easier to understand the code. * Test fixes * Add test that validates the types are maintained * Add back indirection to avoid buffer calls * Cover floats and doubles are the same thing * Static checks	2023-07-10 09:57:09 -07:00
Gian Merlino	67fbd8e7fc	Add "stringEncoding" parameter to DataSketches HLL. (#11201 ) * Add "stringEncoding" parameter to DataSketches HLL. Builds on the concept from #11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. * Fix benchmark. * Fix style issues, improve test coverage. * Put round back, to make IT updates easier. * Fix test. * Fix issue with filtered aggregators and add test. * Use DS native update(ByteBuffer) method. Improve test coverage. * Add another suppression. * Fix ITAutoCompactionTest. * Update benchmarks. * Updates. * Fix conflict. * Adjustments.	2023-06-30 12:45:55 -07:00
Gian Merlino	3d19b748fb	SQL OperatorConversions: Introduce.aggregatorBuilder, allow CAST-as-literal. (#14249 ) * SQL OperatorConversions: Introduce.aggregatorBuilder, allow CAST-as-literal. Four main changes: 1) Provide aggregatorBuilder, a more consistent way of defining the SqlAggFunction we need for all of our SQL aggregators. The mechanism is analogous to the one we already use for SQL functions (OperatorConversions.operatorBuilder). 2) Allow CASTs of constants to be considered as "literalOperands". This fixes an issue where various of our operators are defined with OperandTypes.LITERAL as part of their checkers, which doesn't allow casts. However, in these cases we generally _do_ want to allow casts. The important piece is that the value must be reducible to a constant, not that the SQL text is literally a literal. 3) Update DataSketches SQL aggregators to use the new aggregatorBuilder functionality. The main user-visible effect here is [2]: the aggregators would now accept, for example, "CAST(0.99 AS DOUBLE)" as a literal argument. Other aggregators could be updated in a future patch. 4) Rename "requiredOperands" to "requiredOperandCount", because the old name was confusing. (It rhymes with "literalOperands" but the arguments mean different things.) * Adjust method calls.	2023-06-23 16:25:04 -07:00
imply-cheddar	cfd07a95b7	Errors take 3 (#14004 ) Introduce DruidException, an exception whose goal in life is to be delivered to a user. DruidException itself has javadoc on it to describe how it should be used. This commit both introduces the Exception and adjusts some of the places that are generating exceptions to generate DruidException objects instead, as a way to show how the Exception should be used. This work was a 3rd iteration on top of work that was started by Paul Rogers. I don't know if his name will survive the squash-and-merge, so I'm calling it out here and thanking him for starting on this.	2023-06-19 01:11:13 -07:00
Pranav	5314db9f85	Adding the file mapper to handle v2 buffer deserialization (#14429 )	2023-06-14 19:41:44 -07:00
Soumyava	01b22ca022	Hll Sketch and Theta sketch estimate can now be used as an expression (#14312 ) * Hll Sketch estimate can now be used as an expression * Theta sketch estimate now can be used as an expression	2023-06-06 20:14:25 -07:00
Abhishek Radhakrishnan	2d258a95ad	Fix `EARLIEST_BY`/`LATEST_BY` signature and include function name in signature. (#14352 ) * Fix EarliestLatestBySqlAggregator signature; Include function name for all signatures. * Single quote function signatures, space between args and remove \n. * fixup UT assertion	2023-06-06 09:41:05 -07:00
Alexander Saydakov	4131c0df13	use the latest datasketches-java-4.0.0 (#14334 ) * use the latest datasketches-java-4.0.0 * updated versions of datasketches * adjusted expectation * fixed the expectations --------- Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2023-05-27 22:19:18 -07:00
Clint Wylie	1aef72aa7e	Bump up the version in pom to 27.0.0 in preparation of release (#14051 )	2023-04-10 14:56:59 +05:30
frankgrimes97	2f98675285	Tuple sketch SQL support (#13887 ) This PR is a follow-up to #13819 so that the Tuple sketch functionality can be used in SQL for both ingestion using Multi-Stage Queries (MSQ) and also for analytic queries against Tuple sketch columns.	2023-03-28 18:47:12 +05:30
Gian Merlino	90d8f67e3d	Avoid creating new RelDataTypeFactory during SQL planning. (#13904 ) * Avoid creating new RelDataTypeFactory during SQL planning. Reduces unnecessary CPU cycles. * Fix.	2023-03-08 21:55:49 -08:00
Anshu Makkar	a10e4150d5	Add Post Aggregators for Tuple Sketches (#13819 ) You can now do the following operations with TupleSketches in Post Aggregation Step Get the Sketch Output as Base64 String Provide a constant Tuple Sketch in post-aggregation step that can be used in Set Operations Get the Estimated Value(Sum) of Summary/Metrics Objects associated with Tuple Sketch	2023-03-03 09:32:09 +05:30
Clint Wylie	08b5951cc5	merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything (#13698 ) * merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything * fix poms and license stuff * mockito is evil * allow reset of JvmUtils RuntimeInfo if tests used static injection to override	2023-02-17 14:27:41 -08:00
imply-cheddar	f684df4c22	Use an HllSketchHolder object to enable optimized merge (#13737 ) * Use an HllSketchHolder object to enable optimized merge HllSketchAggregatorFactory.combine had been implemented using a pure pair-wise, "make a union -> add 2 things to union -> get sketch" algorithm. This algorithm does 2 things that was CPU 1) The Union object always builds an HLL_8 sketch regardless of the target type. This means that when the target type is not HLL_8, we spent CPU cycles converting to HLL_8 and back over and over again 2) By throwing away the Union object and converting back to the HllSketch only to build another Union object, we do lots and lots of copy+conversions of the HllSketch This change introduces an HllSketchHolder object which can hold onto a Union object and delay conversion back into an HllSketch until it is actually needed. This follows the same pattern as the SketchHolder object for theta sketches.	2023-02-07 13:57:48 -08:00
imply-cheddar	9c5b61e114	Fallback virtual column (#13739 ) * Fallback virtual column This virtual columns enables falling back to another column if the original column doesn't exist. This is useful when doing column migrations and you have some old data with column X, new data with column Y and you want to use Y if it exists, X otherwise so that you can run a consistent query against all of the data.	2023-02-06 19:36:50 -08:00
Abhishek Agarwal	b25cf216d5	Better error message when theta_sketch_intersect is used on scalar expression (#13508 )	2022-12-07 09:35:43 +05:30
Paul Rogers	b76ff16d00	SQL test framework extensions (#13426 ) SQL test framework extensions * Capture planner artifacts: logical plan, etc. * Planner test builder validates the logical plan * Validation for the SQL resut schema (we already have validation for the Druid row signature) * Better Guice integration: properties, reuse Guice modules * Avoid need for hand-coded expr, macro tables * Retire some of the test-specific query component creation * Fix query log hook race condition	2022-12-02 09:11:59 -08:00
Clint Wylie	f524c68f08	Add mechanism for 'safe' memory reads for complex types (#13361 ) * we can read where we want to we can leave your bounds behind 'cause if the memory is not there we really don't care and we'll crash this process of mine	2022-11-23 00:25:22 -08:00
Kashif Faraz	7cf761cee4	Prepare master branch for next release, 26.0.0 (#13401 ) * Prepare master branch for next release, 26.0.0 * Use docker image for druid 24.0.1 * Fix version in druid-it-cases pom.xml	2022-11-22 15:31:01 +05:30
Clint Wylie	27215d1ff1	fix complex_decode_base64 function, add SQL bindings (#13332 ) * fix complex_decode_base64 function, add SQL bindings * more permissive	2022-11-09 23:40:25 -08:00
Paul Rogers	7e600d2c63	Enhancements to the Calcite test framework (#13283 ) * Enhancements to the Calcite test framework * Standardize "Unauthorized" messages * Additional test framework extension points * Resolved joinable factory dependency issue	2022-11-08 14:28:49 -08:00
Gian Merlino	8f90589ce5	Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. (#13247 ) * Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. These aggregation functions are documented as creating sketches. However, they are planned into native aggregators that include finalization logic to convert the sketch to a number of some sort. This creates an inconsistency: the functions sometimes return sketches, and sometimes return numbers, depending on where they lie in the native query plan. This patch changes these SQL aggregators to _never_ finalize, by using the "shouldFinalize" feature of the native aggregators. It already existed for theta sketches. This patch adds the feature for hll and quantiles sketches. As to impact, Druid finalizes aggregators in two cases: - When they appear in the outer level of a query (not a subquery). - When they are used as input to an expression or finalizing-field-access post-aggregator (not any other kind of post-aggregator). With this patch, the functions will no longer be finalized in these cases. The second item is not likely to matter much. The SQL functions all declare return type OTHER, which would be usable as an input to any other function that makes sense and that would be planned into an expression. So, the main effect of this patch is the first item. To provide backwards compatibility with anyone that was depending on the old behavior, the patch adds a "sqlFinalizeOuterSketches" query context parameter that restores the old behavior. Other changes: 1) Move various argument-checking logic from runtime to planning time in DoublesSketchListArgBaseOperatorConversion, by adding an OperandTypeChecker. 2) Add various JsonIgnores to the sketches to simplify their JSON representations. 3) Allow chaining of ExpressionPostAggregators and other PostAggregators in the SQL layer. 4) Avoid unnecessary FieldAccessPostAggregator wrapping in the SQL layer, now that expressions can operate on complex inputs. 5) Adjust return type to thetaSketch (instead of OTHER) in ThetaSketchSetBaseOperatorConversion. * Fix benchmark class. * Fix compilation error. * Fix ThetaSketchSqlAggregatorTest. * Hopefully fix ITAutoCompactionTest. * Adjustment to ITAutoCompactionTest.	2022-11-03 09:43:00 -07:00
Paul Rogers	86e6e61e88	Modular Calcite Test Framework (#12965 ) * Refactor Calcite test "framework" for planner tests Refactors the current Calcite tests to make it a bit easier to adjust the set of runtime objects used within a test. * Move data creation out of CalciteTests into TestDataBuilder * Move "framework" creation out of CalciteTests into a QueryFramework * Move injector-dependent functions from CalciteTests into QueryFrameworkUtils * Wrapper around the planner factory, etc. to allow customization. * Bulk of the "framework" created once per class rather than once per test. * Refactor tests to use a test builder * Change all testQuery() methods to use the test builder. Move test execution & verification into a test runner.	2022-10-20 15:45:44 -07:00
Paul Rogers	f4dcc52dac	Redesign QueryContext class (#13071 ) We introduce two new configuration keys that refine the query context security model controlled by druid.auth.authorizeQueryContextParams. When that value is set to true then two other configuration options become available: druid.auth.unsecuredContextKeys: The set of query context keys that do not require a security check. Use this for the "white-list" of key to allow. All other keys go through the existing context key security checks. druid.auth.securedContextKeys: The set of query context keys that do require a security check. Use this when you want to allow all but a specific set of keys: only these keys go through the existing context key security checks. Both are set using JSON list format: druid.auth.securedContextKeys=["secretKey1", "secretKey2"] You generally set one or the other values. If both are set, unsecuredContextKeys acts as exceptions to securedContextKeys. In addition, Druid defines two query context keys which always bypass checks because Druid uses them internally: sqlQueryId sqlStringifyArrays	2022-10-15 11:02:11 +05:30
Abhishek Agarwal	618757352b	Bump up the version to 25.0.0 (#12975 ) * Bump up the version to 25.0.0 * Fix the version in console	2022-08-29 11:27:38 +05:30
Alexander Saydakov	7e2371bbde	KLL sketch (#12498 ) * KLL sketch * added documentation * direct static refs * direct static refs * fixed test * addressed review points * added KLL sketch related terms * return a copy from get * Copy unions when returning them from "get". * Remove redundant "final". Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>	2022-08-26 21:19:24 -07:00
Paul Rogers	41712b7a3a	Refactor SqlLifecycle into statement classes (#12845 ) * Refactor SqlLifecycle into statement classes Create direct & prepared statements Remove redundant exceptions from tests Tidy up Calcite query tests Make PlannerConfig more testable * Build fixes * Added builder to SqlQueryPlus * Moved Calcites system properties to saffron.properties * Build fix * Resolve merge conflict * Fix IntelliJ inspection issue * Revisions from reviews Backed out a revision to Calcite tests that didn't work out as planned * Build fix * Fixed spelling errors * Fixed failed test Prepare now enforces security; before it did not. * Rebase and fix IntelliJ inspections issue * Clean up exception handling * Fix handling of JDBC auth errors * Build fix * More tweaks to security messages	2022-08-14 00:44:08 -07:00
Karan Kumar	607b0b9310	Adding withName implementation to AggregatorFactory (#12862 ) * Adding agg factory with name impl * Adding test cases * Fixing test case * Fixing test case * Updated java docs.	2022-08-08 18:31:56 +05:30
Kashif Faraz	7ab2170802	Use datasketches version 3.2.0 (#12509 ) Changes: - Use apache datasketches version 3.2.0. - Remove unsafe reflection-based usage of datasketch internals added in #12022	2022-05-13 11:28:15 +05:30
Gian Merlino	529b983ad0	GroupBy: Reduce allocations by reusing entry and key holders. (#12474 ) * GroupBy: Reduce allocations by reusing entry and key holders. Two main changes: 1) Reuse Entry objects returned by various implementations of Grouper.iterator. 2) Reuse key objects contained within those Entry objects. This is allowed by the contract, which states that entries must be processed and immediately discarded. However, not all call sites respected this, so this patch also updates those call sites. One particularly sneaky way that the old code retained entries too long is due to Guava's MergingIterator and CombiningIterator. Internally, these both advance to the next value prior to returning the current value. So, this patch addresses that in two ways: 1) For merging, we have our own implementation MergeIterator already, although it had the same problem. So, this patch updates our implementation to return the current item prior to advancing to the next item. It also adds a forbidden-api entry to ensure that this safer implementation is used instead of Guava's. 2) For combining, we address the problem in a different way: by copying the key when creating the new, combined entry. * Attempt to fix test. * Remove unused import.	2022-04-28 23:21:13 -07:00
Abhishek Agarwal	2fe053c5cb	Bump up the versions (#12480 )	2022-04-27 14:28:20 +05:30
Jihoon Son	73ce5df22d	Add support for authorizing query context params (#12396 ) The query context is a way that the user gives a hint to the Druid query engine, so that they enforce a certain behavior or at least let the query engine prefer a certain plan during query planning. Today, there are 3 types of query context params as below. Default context params. They are set via druid.query.default.context in runtime properties. Any user context params can be default params. User context params. They are set in the user query request. See https://druid.apache.org/docs/latest/querying/query-context.html for parameters. System context params. They are set by the Druid query engine during query processing. These params override other context params. Today, any context params are allowed to users. This can cause 1) a bad UX if the context param is not matured yet or 2) even query failure or system fault in the worst case if a sensitive param is abused, ex) maxSubqueryRows. This PR adds an ability to limit context params per user role. That means, a query will fail if you have a context param set in the query that is not allowed to you. To do that, this PR adds a new built-in resource type, QUERY_CONTEXT. The resource to authorize has a name of the context param (such as maxSubqueryRows) and the type of QUERY_CONTEXT. To allow a certain context param for a user, the user should be granted WRITE permission on the context param resource. Here is an example of the permission. { "resourceAction" : { "resource" : { "name" : "maxSubqueryRows", "type" : "QUERY_CONTEXT" }, "action" : "WRITE" }, "resourceNamePattern" : "maxSubqueryRows" } Each role can have multiple permissions for context params. Each permission should be set for different context params. When a query is issued with a query context X, the query will fail if the user who issued the query does not have WRITE permission on the query context X. In this case, HTTP endpoints will return 403 response code. JDBC will throw ForbiddenException. Note: there is a context param called brokerService that is used only by the router. This param is used to pin your query to run it in a specific broker. Because the authorization is done not in the router, but in the broker, if you have brokerService set in your query without a proper permission, your query will fail in the broker after routing is done. Technically, this is not right because the authorization is checked after the context param takes effect. However, this should not cause any user-facing issue and thus should be OK. The query will still fail if the user doesn’t have permission for brokerService. The context param authorization can be enabled using druid.auth.authorizeQueryContextParams. This is disabled by default to avoid any hassle when someone upgrades his cluster blindly without reading release notes.	2022-04-21 14:21:16 +05:30
Alexander Saydakov	50038d9344	latest datasketches-java-3.1.0 (#12224 ) These changes are to use the latest datasketches-java-3.1.0 and also to restore support for quantile and HLL4 sketches to be able to grow larger than a given buffer in a buffer aggregator and move to heap in rare cases. This was discussed in #11544. Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2022-03-01 17:14:42 -08:00
Clint Wylie	3ee66bb492	allow optimizing sql expressions and virtual columns (#12241 ) * rework sql planner expression and virtual column handling * simplify a bit * add back and deprecate old methods, more tests, fix multi-value string coercion bug and associated tests * spotbugs * fix bugs with multi-value string array expression handling * javadocs and adjust test * better * fix tests	2022-02-09 14:55:50 -08:00
Kashif Faraz	e648b01afb	Improve memory estimates in Aggregator and DimensionIndexer (#12073 ) Fixes #12022 ### Description The current implementations of memory estimation in `OnHeapIncrementalIndex` and `StringDimensionIndexer` tend to over-estimate which leads to more persistence cycles than necessary. This PR replaces the max estimation mechanism with getting the incremental memory used by the aggregator or indexer at each invocation of `aggregate` or `encode` respectively. ### Changes - Add new flag `useMaxMemoryEstimates` in the task context. This overrides the same flag in DefaultTaskConfig i.e. `druid.indexer.task.default.context` map - Add method `AggregatorFactory.factorizeWithSize()` that returns an `AggregatorAndSize` which contains the aggregator instance and the estimated initial size of the aggregator - Add method `Aggregator.aggregateWithSize()` which returns the incremental memory used by this aggregation step - Update the method `DimensionIndexer.processRowValsToKeyComponent()` to return the encoded key component as well as its effective size in bytes - Update `OnHeapIncrementalIndex` to use the new estimations only if `useMaxMemoryEstimates = false`	2022-02-03 10:34:02 +05:30
imply-cheddar	b153cb2342	Add a small LRU cache and use utf8 bytes in ArrayOfDoubles (#12130 ) * Add a small LRU cache and use utf8 bytes in ArrayOfDoubles * Add tests for extra branches * Even more tests for branch coverage * Fix Style	2022-01-11 13:04:11 -08:00
somu-imply	08fea7a46a	input type validation for datasketches hll "build" aggregator factory (#12131 ) * Ingestion will fail for HLLSketchBuild instead of creating with incorrect values * Addressing review comments for HLL< updated error message introduced test case	2022-01-11 12:00:14 -08:00
Paul Rogers	a66f10eea1	Code cleanup from query profile project (#11822 ) * Code cleanup from query profile project * Fix spelling errors * Fix Javadoc formatting * Abstract out repeated test code * Reuse constants in place of some string literals * Fix up some parameterized types * Reduce warnings reported by Eclipse * Reverted change due to lack of tests	2021-11-30 11:35:38 -08:00
Gian Merlino	93aeaf4801	Improve on-heap aggregator footprint estimates. (#11950 ) Add a "guessAggregatorHeapFootprint" method to AggregatorFactory that mitigates #6743 by enabling heap footprint estimates based on a specific number of rows. The idea is that at ingestion time, the number of rows that go into an aggregator will be 1 (if rollup is off) or will likely be a small number (if rollup is on). It's a heuristic, because of course nothing guarantees that the rollup ratio is a small number. But it's a common case, and I expect this logic to go wrong much less often than the current logic. Also, when it does go wrong, users can fix it by lowering maxRowsInMemory or maxBytesInMemory. The current situation is unintuitive: when the estimation goes wrong, users get an OOME, but actually they need to raise these limits to fix it.	2021-11-28 13:21:24 +05:30
Clint Wylie	f260bbed23	restore and deprecate AggregatorFactory methods (#11917 ) * add back and deprecate aggregator factory methods so i can say i told you so when i delete these later * rename to make less ambiguous, fix fill method * adjust	2021-11-19 15:59:35 -08:00
Clint Wylie	a8805ab60d	add missing json type for ListFilteredVirtualColumn (#11887 ) * add missing json type for ListFilteredVirtualColumn, and tests to try to avoid this happening again * fixes * ugly, but maybe this * oops * too many mappers	2021-11-09 17:25:12 -08:00
Gian Merlino	fc95c92806	Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124 ) * Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. This patch does the following: - Removes OffheapIncrementalIndex. - Clarifies that Aggregators are required to be thread safe. - Clarifies that BufferAggregators and VectorAggregators are not required to be thread safe. - Removes thread safety code from some DataSketches aggregators that had it. (Not all of them did, and that's OK, because it wasn't necessary anyway.) - Makes enabling "useOffheap" with groupBy v1 an error. Rationale for removing the offheap incremental index: - It is only used in one rare scenario: groupBy v1 (which is non-default) in "useOffheap" mode (also non-default). So you have to go pretty deep into the wilderness to get this code to activate in production. It is never used during ingestion. - Its existence complicates developer efforts to reason about how aggregators get used, because the way it uses buffer aggregators is so different from how every other query engine uses them. - It doesn't have meaningful testing. By the way, I do believe that the given way the offheap incremental index works, it actually didn't require buffer aggregators to be thread-safe. It synchronizes on "aggregate" and doesn't call "get" until it has stopped calling "aggregate". Nevertheless, this is a bother to think about, and for the above reasons I think it makes sense to remove the code anyway. * Remove things that are now unused. * Revert removal of getFloat, getLong, getDouble from BufferAggregator. * OAK-related warnings, suppressions. * Unused item suppressions.	2021-10-26 08:05:56 -07:00
Gian Merlino	8276c031c5	Add druid.sql.approxCountDistinct.function property. (#11181 ) * Add druid.sql.approxCountDistinct.function property. The new property allows admins to configure the implementation for APPROX_COUNT_DISTINCT and COUNT(DISTINCT expr) in approximate mode. The motivation for adding this setting is to enable site admins to switch the default HLL implementation to DataSketches. For example, an admin can set: druid.sql.approxCountDistinct.function = APPROX_COUNT_DISTINCT_DS_HLL * Fixes * Fix tests. * Remove erroneous cannotVectorize. * Remove unused import. * Remove unused test imports.	2021-10-25 12:16:21 -07:00
Gian Merlino	d4cace385f	SQL: Allow Scans to be used as outer queries. (#11831 ) * SQL: Allow Scans to be used as outer queries. This has been possible in the native query system for a while, but the capability hasn't yet propagated into the SQL layer. One example of where this is useful is a query like: SELECT * FROM (... LIMIT X) WHERE <filter> Because this expands the kinds of subquery structures the SQL layer will consider, it was also necessary to improve the cost calculations. These changes appear in PartialDruidQuery and DruidOuterQueryRel. The ideas are: - Attach per-column penalties to the output signature of each query, instead of to the initial projection that starts a query. This encourages moving projections into subqueries instead of leaving them on outer queries. - Only attach penalties to projections if there are actually expressions happening. So, now, projections that simply reorder or remove fields are free. - Attach a constant penalty to every outer query. This discourages creating them when they are not needed. The changes are generally beneficial to the test cases we have in CalciteQueryTest. Most plans are unchanged, or are changed in purely cosmetic ways. Two have changed for the better: - testUsingSubqueryWithLimit now returns a constant from the subquery, instead of returning every column. - testJoinOuterGroupByAndSubqueryHasLimit returns a minimal set of columns from the innermost subquery; two unnecessary columns are no longer there. * Fix various DS operator conversions. These were all implemented as direct conversions, which isn't appropriate because they do not actually map onto native functions. These are only usable as post-aggregations. * Test case adjustment.	2021-10-23 17:18:43 -07:00
Gian Merlino	b7a4c79314	Null handling fixes for DS HLL and Theta sketches. (#11830 ) * Null handling fixes for DS HLL and Theta sketches. For HLL, this fixes an NPE when processing a null in a multi-value dimension. For both, empty strings are now properly treated as nulls (and ignored) in replace-with-default mode. Behavior in SQL-compatible mode is unchanged. * Fix expectation.	2021-10-22 19:09:00 -07:00
Clint Wylie	741b4ed516	add output type information to ExpressionPostAggregator (#11818 ) * add ColumnInspector argument to PostAggregator.getType to allow post-aggs to compute their output type based on input types * add test for test for coverage * simplify * Remove unused imports. Co-authored-by: Gian Merlino <gian@imply.io>	2021-10-22 13:52:51 -07:00
Alexander Saydakov	8cf1cbc4a9	latest datasketches-java and datasketches-memory (#11773 ) * latest datasketches-java and datasketches-memory * updated versions of datasketches-java and datasketches-memory Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2021-10-19 23:42:30 -07:00
Clint Wylie	187df58e30	better types (#11713 ) * better type system * needle in a haystack * ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support * fixup merge * more test * fixup * intern * fix * oops * oops again * ... * more test coverage * fix error message * adjust interning, more javadocs * oops * more docs more better	2021-10-19 01:47:25 -07:00
Clint Wylie	fe1d8c206a	bump version to 0.23.0-SNAPSHOT (#11670 )	2021-09-08 15:56:04 -07:00
Jihoon Son	7e90d00cc0	Configurable maxStreamLength for doubles sketches (#11574 ) * Configurable maxStreamLength for doubles sketches * fix equals/hashcode and it test failure * fix test * fix it test * benchmark * doc * grouping key * fix comment * dependency check * Update docs/development/extensions-core/datasketches-quantiles.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2021-08-31 14:56:37 -07:00
dependabot[bot]	167044f715	Bump fastutil from 8.2.3 to 8.5.4 (#11347 ) * Bump fastutil from 8.2.3 to 8.5.4 Bumps [fastutil](https://github.com/vigna/fastutil) from 8.2.3 to 8.5.4. - [Release notes](https://github.com/vigna/fastutil/releases) - [Changelog](https://github.com/vigna/fastutil/blob/master/CHANGES) - [Commits](https://github.com/vigna/fastutil/compare/8.2.3...8.5.4) --- updated-dependencies: - dependency-name: it.unimi.dsi:fastutil dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * update licenses.yaml * update maven dependency list for -core and -extra libraries to pass maven dependency checks Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Xavier Léauté <xvrl@apache.org>	2021-06-10 07:43:18 -07:00
Clint Wylie	f6662b4893	fix count and average SQL aggregators on constant virtual columns (#11208 ) * fix count and average SQL aggregators on constant virtual columns * style * even better, why are we tracking virtual columns in aggregations at all if we have a virtual column registry * oops missed a few * remove unused * this will fix it	2021-05-10 13:41:48 -07:00
Clint Wylie	691d7a1d54	SQL timeseries no longer skip empty buckets with all granularity (#11188 ) * SQL timeseries no longer skip empty buckets with all granularity * add comment, fix tests * the ol switcheroo * revert unintended change * docs and more tests * style * make checkstyle happy * docs fixes and more tests * add docs, tests for array_agg * fixes * oops * doc stuffs * fix compile, match doc style	2021-05-10 10:13:37 -07:00
Gian Merlino	809e001939	Vectorize the DataSketches quantiles aggregator. (#11183 ) * Vectorize the DataSketches quantiles aggregator. Also removes synchronization for the BufferAggregator and VectorAggregator implementations, since it is not necessary (similar to #11115). Extends DoublesSketchAggregatorTest and DoublesSketchSqlAggregatorTest to run all test cases in vectorized mode. * Style fix.	2021-05-02 16:14:21 -07:00
Gian Merlino	f2b54de205	Vectorized versions of HllSketch aggregators. (#11115 ) * Vectorized versions of HllSketch aggregators. The patch uses the same "helper" approach as #10767 and #10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode. * Updates post-code-review. * Fix javadoc.	2021-04-16 18:45:46 -07:00
Jihoon Son	25db8787b3	Fix CAST being ignored when aggregating on strings after cast (#11083 ) * Fix CAST being ignored when aggregating on strings after cast * fix checkstyle and dependency * unused import	2021-04-12 22:21:24 -07:00
Makdon	d939420f23	Update SketchAggregator.java for removing duplicated parentheses (#11021 ) * Update SketchAggregator.java * Add test for sketches aggregator update unoin with double	2021-04-09 17:11:25 -07:00
Alexander Saydakov	f930cf14d6	Use the latest Apache DataSketches release 2.0.0 (#10917 ) * use the latest Apache DataSketches release 2.0.0 * updated datasketches version Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2021-02-26 07:52:00 -06:00
Gian Merlino	6c0c6e60b3	Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory. (#10767 ) * Vectorized theta sketch aggregator. Also a refactoring of BufferAggregator and VectorAggregator such that they share a common interface, BaseBufferAggregator. This allows implementing both in the same file with an abstract + dual subclass structure. * Rework implementation to use composition instead of inheritance. * Rework things to enable working properly for both complex types and regular types. Involved finally moving makeVectorProcessor from DimensionHandlerUtils into ColumnProcessors and harmonizing the two things. * Add missing method. * Style and name changes. * Fix issues from inspections. * Fix style issue.	2021-01-29 09:30:09 -08:00
Jihoon Son	95065bdf1a	Bump dev version to 0.22.0-SNAPSHOT (#10759 )	2021-01-15 13:16:23 -08:00
Jonathan Wei	65c0d64676	Update version to 0.21.0-SNAPSHOT (#10450 ) * [maven-release-plugin] prepare release druid-0.21.0 * [maven-release-plugin] prepare for next development iteration * Update web-console versions	2020-10-03 16:08:34 -07:00
Clint Wylie	ab60661008	refactor internal type system (#9638 ) * better type tracking: add typed postaggs, finalized types for agg factories * more javadoc * adjustments * transition to getTypeName to be used exclusively for complex types * remove unused fn * adjust * more better * rename getTypeName to getComplexTypeName * setup expression post agg for type inference existing * more javadocs * fixup * oops * more test * more test * more comments/javadoc * nulls * explicitly handle only numeric and complex aggregators for incremental index * checkstyle * more tests * adjust * more tests to showcase difference in behavior * timeseries longsum array	2020-08-26 10:53:44 -07:00
Clint Wylie	cfb7a893e7	fill out missing test coverage for druid-datasketches postaggs (#9730 ) * fill out missing test coverage for druid-datasketches postaggs * fixup * fixup merge * oops * oops again	2020-07-31 10:08:07 -07:00
Maytas Monsereenusorn	4e8570b71b	Add integration tests for all InputFormat (#10088 ) * Add integration tests for Avro OCF InputFormat * Add integration tests for Avro OCF InputFormat * add tests * fix bug * fix bug * fix failing tests * add comments * address comments * address comments * address comments * fix test data * reduce resource needed for IT * remove bug fix * fix checkstyle * add bug fix	2020-07-08 12:50:29 -07:00
Franklyn Dsouza	1b9aacb1cd	Fix avg sql aggregator (#10135 ) * new average aggregator * method to create count aggregator factory * test everything * update other usages * fix style * fix more tests * fix datasketches tests	2020-07-08 08:38:56 -07:00
Clint Wylie	c86e7ce30b	bump version to 0.20.0-SNAPSHOT (#10124 )	2020-07-06 15:08:32 -07:00
Clint Wylie	477335abb4	update links datasketches.github.io to datasketches.apache.org (#10107 ) * update links datasketches.github.io to datasketches.apache.org * now with more apache * oops * oops	2020-07-01 14:56:17 -07:00
Maytas Monsereenusorn	9bab6b6371	SketchAggregator.updateUnion should handle null inside List update object (#10055 )	2020-06-19 20:29:25 -07:00
Maytas Monsereenusorn	790e9482ea	Fix Subquery could not be converted to groupBy query (#9959 ) * Fix join * Fix Subquery could not be converted to groupBy query * Fix Subquery could not be converted to groupBy query * Fix Subquery could not be converted to groupBy query * Fix Subquery could not be converted to groupBy query * Fix Subquery could not be converted to groupBy query * Fix Subquery could not be converted to groupBy query * Fix Subquery could not be converted to groupBy query * Fix Subquery could not be converted to groupBy query * add tests * address comments * fix failing tests	2020-06-03 16:46:28 -07:00
Gian Merlino	3dfd7c30c0	Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT. (#9893 ) * Add REGEXP_LIKE, fix empty-pattern bug in REGEXP_EXTRACT. - Add REGEXP_LIKE function that returns a boolean, and is useful in WHERE clauses. - Fix REGEXP_EXTRACT return type (should be nullable; causes incorrect filter elision). - Fix REGEXP_EXTRACT behavior for empty patterns: should always match (previously, they threw errors). - Improve error behavior when REGEXP_EXTRACT and REGEXP_LIKE are passed non-literal patterns. - Improve documentation of REGEXP_EXTRACT. * Changes based on PR review. * Fix arg check. * Important fixes! * Add speller. * wip * Additional tests. * Fix up tests. * Add validation error tests. * Additional tests. * Remove useless call.	2020-06-03 14:31:37 -07:00
Alexander Saydakov	522df300c2	Datasketches 1 3 0 (#9880 ) * use the latest datasketches release * new sketch debug print Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2020-05-16 14:09:23 -07:00
Alexander Saydakov	844d626738	added number of bins parameter (#9436 ) * added number of bins parameter * addressed review points * test equals Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2020-05-04 16:53:09 -07:00
Clint Wylie	fc5383cd00	revert datasketches-java version to 1.1.0-incubating until new version is released (#9751 ) * revert datasketches-java version to 1.1.0-incubating until fix is in place * fix tests * checkstyle	2020-04-24 12:52:12 -07:00
Suneet Saldanha	1ced3b33fb	IntelliJ inspections cleanup (#9339 ) * IntelliJ inspections cleanup * Standard Charset object can be used * Redundant Collection.addAll() call * String literal concatenation missing whitespace * Statement with empty body * Redundant Collection operation * StringBuilder can be replaced with String * Type parameter hides visible type * fix warnings in test code * more test fixes * remove string concatenation inspection error * fix extra curly brace * cleanup AzureTestUtils * fix charsets for RangerAdminClient * review comments	2020-04-10 10:04:40 -07:00
Jihoon Son	0da8ffc3ff	Bump up development version to 0.19.0-SNAPSHOT (#9586 )	2020-03-30 16:24:04 -07:00
Gian Merlino	1ef25a438f	Broker: Add ability to inline subqueries. (#9533 ) * Broker: Add ability to inline subqueries. The main changes: - ClientQuerySegmentWalker: Add ability to inline queries. - Query: Add "getSubQueryId" and "withSubQueryId" methods. - QueryMetrics: Add "subQueryId" dimension. - ServerConfig: Add new "maxSubqueryRows" parameter, which is used by ClientQuerySegmentWalker to limit how many rows can be inlined per query. - IndexedTableJoinMatcher: Allow creating keys on top of unknown types, by assuming they are strings. This is useful because not all types are known for fields in query results. - InlineDataSource: Store RowSignature rather than component parts. Add more zealous "equals" and "hashCode" methods to ease testing. - Moved QuerySegmentWalker test code from CalciteTests and SpecificSegmentsQueryWalker in druid-sql to QueryStackTests in druid-server. Use this to spin up a new ClientQuerySegmentWalkerTest. * Adjustments from CI. * Fix integration test.	2020-03-18 15:06:45 -07:00
Gian Merlino	ff59d2e78b	Move RowSignature from druid-sql to druid-processing and make use of it. (#9508 ) * Move RowSignature from druid-sql to druid-processing and make use of it. 1) Moved (most of) RowSignature from sql to processing. Left behind the SQL-specific stuff in a RowSignatures utility class. It also picked up some new convenience methods along the way. 2) There were a lot of places in the code where Map<String, ValueType> was used to associate columns with type info. These are now all replaced with RowSignature. 3) QueryToolChest's resultArrayFields method is replaced with resultArraySignature, and it now provides type info. * Fix up extensions. * Various fixes	2020-03-12 11:06:44 -07:00
Gian Merlino	c6c2282b59	Harmonization and bug-fixing for selector and filter behavior on unknown types. (#9484 ) * Harmonization and bug-fixing for selector and filter behavior on unknown types. - Migrate ValueMatcherColumnSelectorStrategy to newer ColumnProcessorFactory system, and set defaultType COMPLEX so unknown types can be dynamically matched. - Remove ValueGetters in favor of ColumnComparisonFilter doing its own thing. - Switch various methods to use convertObjectToX when casting to numbers, rather than ad-hoc and inconsistent logic. - Fix bug in RowBasedExpressionColumnValueSelector: isBindingArray should return true even for 0- or 1- element arrays. - Adjust various javadocs. * Add throwParseExceptions option to Rows.objectToNumber, switch back to that. * Update tests. * Adjust moment sketch tests.	2020-03-10 07:15:57 -07:00
Clint Wylie	f8b1f2f7f3	fix issue when distinct grouping dimensions are optimized into the same virtual column expression (#9429 ) * fix issue when distinct grouping dimensions are optimized into the same virtual column expression * fix tests * more better * fixes	2020-03-09 17:48:29 -07:00
Clint Wylie	b408a6d774	sql support for dynamic parameters (#6974 ) * sql support for dynamic parameters * fixup * javadocs * fixup from merge * formatting * fixes * fix it * doc fix * remove druid fallback self-join parameterized test * unused imports * ignore test for now * fix imports * fixup * fix merge * merge fixup * fix test that cannot vectorize * fixup and more better * dependency thingo * fix docs * tweaks * fix docs * spelling * unused imports after merge * review stuffs * add comment * add ignore text * review stuffs	2020-02-19 13:09:20 -08:00
Gian Merlino	3ef5c2f2e8	Add MemoryOpenHashTable, a table similar to ByteBufferHashTable. (#9308 ) * Add MemoryOpenHashTable, a table similar to ByteBufferHashTable. With some key differences to improve speed and design simplicity: 1) Uses Memory rather than ByteBuffer for its backing storage. 2) Uses faster hashing and comparison routines (see HashTableUtils). 3) Capacity is always a power of two, allowing simpler design and more efficient implementation of findBucket. 4) Does not implement growability; instead, leaves that to its callers. The idea is this removes the need for subclasses, while still giving callers flexibility in how to handle table-full scenarios. * Fix LGTM warnings. * Adjust dependencies. * Remove easymock from druid-benchmarks. * Adjustments from review. * Fix datasketches unit tests. * Fix checkstyle.	2020-02-04 19:57:59 -08:00
Suneet Saldanha	33a97dfaae	Guicify druid sql module (#9279 ) * Guicify druid sql module Break up the SQLModule in to smaller modules and provide a binding that modules can use to register schemas with druid sql. * fix some tests * address code review * tests compile * Working tests * Add all the tests * fix up licenses and dependencies * add calcite dependency to druid-benchmarks * tests pass * rename the schemas	2020-02-04 11:33:48 -08:00
Gian Merlino	b411443d22	SQL join support for lookups. (#9294 ) * SQL join support for lookups. 1) Add LookupSchema to SQL, so lookups show up in the catalog. 2) Add join-related rels and rules to SQL, allowing joins to be planned into native Druid queries. * Add two missing LookupSchema calls in tests. * Fix tests. * Fix typo.	2020-01-31 23:51:16 -08:00
Suneet Saldanha	303b02eba1	intelliJ inspections cleanup (#9260 ) * intelliJ inspections cleanup - remove redundant escapes - performance warnings - access static member via instance reference - static method declared final - inner class may be static Most of these changes are aesthetic, however, they will allow inspections to be enabled as part of CI checks going forward The valuable changes in this delta are: - using StringBuilder instead of string addition in a loop indexing-hadoop/.../Utils.java processing/.../ByteBufferMinMaxOffsetHeap.java - Use class variables instead of static variables for parameterized test processing/src/.../ScanQueryLimitRowIteratorTest.java * Add intelliJ inspection warnings as errors to druid profile * one more static inner class	2020-01-29 11:50:52 -08:00
Atul Mohan	ea51bc45bf	Fix nullhandling in tests (#9119 )	2020-01-12 20:19:12 -08:00
Clint Wylie	7af85250cb	null handling for doubles sketch and array of doubles sketch aggs (#9112 ) * doubles sketch and array of doubles sketch aggs now skip rows with nulls in sql compatible null handling mode * formatting	2020-01-07 14:15:32 -06:00
Jonathan Wei	aa539177ec	De-incubation cleanup in code, docs, packaging (#9108 ) * De-incubation cleanup in code, docs, packaging * remove unused docs script	2020-01-03 12:33:19 -05:00
Jonathan Wei	4e8368a5d9	Set version to 0.18.0-SNAPSHOT (#9109 )	2020-01-02 17:55:10 -05:00
Jonathan Wei	8af41d7cd0	Update version to 0.18.0-incubating-SNAPSHOT (#9009 )	2019-12-11 14:04:03 -08:00
Chi Cao Minh	3de7ab8523	DataSketches jars in core (#9003 ) Having DataSketches jars in core will allow potential improvements, for example: - Provide an alternative implementation of HLL: https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html - Range partitioning for native parallel batch indexing without having the user load extensions on the classpath Dev mailing list discussion: https://lists.apache.org/thread.html/301410d71ff799cf616bf17c4ebcf9999fc30829f5fa62909f403e6c%40%3Cdev.druid.apache.org%3E	2019-12-10 14:02:34 -08:00
Chi Cao Minh	bab78fc80e	Parallel indexing single dim partitions (#8925 ) * Parallel indexing single dim partitions Implements single dimension range partitioning for native parallel batch indexing as described in #8769. This initial version requires the druid-datasketches extension to be loaded. The algorithm has 5 phases that are orchestrated by the supervisor in `ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`. These phases and the main classes involved are described below: 1) In parallel, determine the distribution of dimension values for each input source split. `PartialDimensionDistributionTask` uses `StringSketch` to generate the approximate distribution of dimension values for each input source split. If the rows are ungrouped, `PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter` uses a Bloom filter to skip rows that would be grouped. The final distribution is sent back to the supervisor via `DimensionDistributionReport`. 2) The range partitions are determined. In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the supervisor uses `StringSketchMerger` to merge the individual `StringSketch`es created in the preceding phase. The merged sketch is then used to create the range partitions. 3) In parallel, generate partial range-partitioned segments. `PartialRangeSegmentGenerateTask` uses the range partitions determined in the preceding phase and `RangePartitionCachingLocalSegmentAllocator` to generate `SingleDimensionShardSpec`s. The partition information is sent back to the supervisor via `GeneratedGenericPartitionsReport`. 4) The partial range segments are grouped. In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`, the supervisor creates the `PartialGenericSegmentMergeIOConfig`s necessary for the next phase. 5) In parallel, merge partial range-partitioned segments. `PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to retrieve the partial range-partitioned segments generated earlier and then merges and publishes them. * Fix dependencies & forbidden apis * Fixes for integration test * Address review comments * Fix docs, strict compile, sketch check, rollup check * Fix first shard spec, partition serde, single subtask * Fix first partition check in test * Misc rewording/refactoring to address code review * Fix doc link * Split batch index integration test * Do not run parallel-batch-index twice * Adjust last partition * Split ITParallelIndexTest to reduce runtime * Rename test class * Allow null values in range partitions * Indicate which phase failed * Improve asserts in tests	2019-12-09 23:05:49 -08:00
jon-wei	dfbc066163	Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1" This reverts commit `a0f21d9b07`.	2019-11-27 23:22:43 -08:00
jon-wei	0402ff85b8	Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit `8ffa71e7e6`.	2019-11-27 23:22:32 -08:00
jon-wei	8ffa71e7e6	[maven-release-plugin] prepare for next development iteration	2019-11-27 23:18:48 -08:00
jon-wei	a0f21d9b07	[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1	2019-11-27 23:18:37 -08:00
Alexander Saydakov	4a9da3f3fc	use the latest release of datasketches (#8647 ) * use the latest release of datasketches * added datasketches-memory dependency * updated datasketches entries * use datasketches-memory-1.2.0 * updated dependencies * fixed tests	2019-11-25 19:45:51 -08:00
Clint Wylie	3fcaa1a61b	fix sql compatible null handling config work with runtime.properties (#8876 ) * fix sql compatible null handling config work with runtime.properties * fix npe * fix tests * add friendly error * comment, and friendlier still * fix compile * fix from merges	2019-11-20 03:55:29 -08:00
Jonathan Wei	75ea0d592a	Add more datasketches doubles sketch SQL functions (#8843 ) * Add more datasketches doubles sketch SQL postaggs * style and lgtm	2019-11-08 18:05:06 -08:00

1 2 3 4 5 ...

291 Commits