Commit Graph

767 Commits

Author SHA1 Message Date
AmatyaAvadhanula 0412f40d36
Prepare master branch for next release, 28.0.0 (#14595)
* Prepare master branch for next release, 28.0.0
2023-07-18 09:22:30 +05:30
Laksh Singla c1c7dff2ad
Using DruidExceptions in MSQ (changes related to the Broker) (#14534)
MSQ engine returns correct error codes for invalid user inputs in the query context. Also, using DruidExceptions for MSQ related errors happening in the Broker with improved error messages.
2023-07-13 19:08:49 +00:00
Abhishek Radhakrishnan f4ee58eaa8
Add `aggregatorMergeStrategy` property in SegmentMetadata queries (#14560)
* Add aggregatorMergeStrategy property to SegmentMetadaQuery.

- Adds a new property aggregatorMergeStrategy to segmentMetadata query.
aggregatorMergeStrategy currently supports three types of merge strategies -
the legacy strict and lenient strategies, and the new latest strategy.
- The latest strategy considers the latest aggregator from the latest segment
by time order when there's a conflict when merging aggregators from different
segments.
- Deprecate lenientAggregatorMerge property; The API validates that both the new
and old properties are not set, and returns an exception.
- When merging segments as part of segmentMetadata query, the segments have a more
elaborate id -- <datasource>_<interval>_merged_<partition_number> format, similar to
the name format that segments usually contain. Previously it was simply "merged".
- Adjust unit tests to test the latest strategy, to assert the returned complete
SegmentAnalysis object instead of just the aggregators for completeness.

* Don't explicitly set strict strategy in tests

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/querying/segmentmetadataquery.md

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

---------

Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2023-07-13 12:37:36 -04:00
imply-cheddar 7650a71d37
Add window query test files from Drill (#14561) 2023-07-12 20:14:39 -07:00
imply-cheddar 65e1b27aa7
Fix a resource leak with Window processing (#14573)
* Fix a resource leak with Window processing

Additionally, in order to find the leak, there were
adjustments to the StupidPool to track leaks a bit better.
It would appear that the pool objects get GC'd during testing
for some reason which was causing some incorrect identification
of leaks from objects that had been returned but were GC'd along
with the pool.

* Suppress unused warning
2023-07-12 17:25:42 -05:00
Laksh Singla 5ce536355e
Fix planning bug while using sort merge frame processor (#14450)
sqlJoinAlgorithm is now a hint to the planner to execute the join in the specified manner. The planner can decide to ignore the hint if it deduces that the specified algorithm can be detrimental to the performance of the join beforehand.
2023-07-11 09:58:44 +00:00
Gian Merlino 63ee69b4e8
Claim full support for Java 17. (#14384)
* Claim full support for Java 17.

No production code has changed, except the startup scripts.

Changes:

1) Allow Java 17 without DRUID_SKIP_JAVA_CHECK.

2) Include the full list of opens and exports on both Java 11 and 17.

3) Document that Java 17 is both supported and preferred.

4) Switch some tests from Java 11 to 17 to get better coverage on the
   preferred version.

* Doc update.

* Update errorprone.

* Update docker_build_containers.sh.

* Update errorprone in licenses.yaml.

* Add some more run-javas.

* Additional run-javas.

* Update errorprone.

* Suppress new errorprone error.

* Add exports and opens in ForkingTaskRunner for Java 11+.

Test, doc changes.

* Additional errorprone updates.

* Update for errorprone.

* Restore old fomatting in LdapCredentialsValidator.

* Copy bin/ too.

* Fix Java 15, 17 build line in docker_build_containers.sh.

* Update busybox image.

* One more java command.

* Fix interpolation.

* IT commandline refinements.

* Switch to busybox 1.34.1-glibc.

* POM adjustments, build and test one IT on 17.

* Additional debugging.

* Fix silly thing.

* Adjust command line.

* Add exports and opens one more place.

* Additional harmonization of strong encapsulation parameters.
2023-07-07 12:52:35 -07:00
Gian Merlino dd78e00dc5
Fix ColumnSignature error message and jdk17 test issue. (#14538)
* Fix ColumnSignature error message and jdk17 test issue.

On jdk17, the "problem" part of the error message could change from
NullPointerException to:

  Cannot invoke "String.length()" because "s" is null

Due to the new more-helpful NPEs in Java 17. This broke the expectation
and led to test failures on this case.

This patch fixes the problem by improving the error message so it isn't
a generic NullPointerException.

* Fix format.
2023-07-06 15:10:59 -07:00
Abhishek Radhakrishnan d02bb8bb6e
Set explain attributes after the query is prepared (#14490)
* Add support for DML WITH AS.

* One more UT for with as subquery.

* Add a test with join query

* Use root query prepared node instead of individual SqlNode types.

- Set the explain plan attributes after the query is prepared when
the query is planned and we've the finalized output names in the root
source rel node.
- Adjust tests; add unit test for negative ordinal case.
- Remove the exception / error handling logic from resolveClusteredBy
function since the validations now happen before it comes to the function

* Update comment.
2023-07-06 14:13:32 -04:00
imply-cheddar 5fc122a144
Add window-focused tests from Drill (#13773)
This commit borrows some test definitions from Drill's test suite
and tries to use them to flesh out the full validation of window
function capbilities.

In order to be able to run these tests, we also add the ability to
run a Scan operation against segments, which also meant an
implementation of RowsAndColumns for frames.
2023-07-06 09:20:32 -07:00
Soumyava 78db7a4414
A query in MSQ would issue wrong error code (#14531)
with a RuntimeException. Now the RuntimeException is being replaced by an user facing DruidException of Invalid category which would allow calcite not to throw an uncategorized exception.
2023-07-06 08:59:35 +05:30
Jonathan Wei f29a9faa94
Better surfacing of invalid pattern errors for SQL REGEXP_EXTRACT function (#14505) 2023-07-05 17:12:54 -05:00
Pranav 2d5b27358e
Logging the fieldName in the coerce exceptions (#14483)
Logging the fieldName in the coerce exceptions
2023-07-03 14:13:27 +05:30
Gian Merlino e10e35aa2c
Add REGEXP_REPLACE function. (#14460)
* Add REGEXP_REPLACE function.

Replaces all instances of a pattern with a replacement string.

* Fixes.

* Improve test coverage.

* Adjust behavior.
2023-06-29 13:47:57 -07:00
Gian Merlino a6cabbe10f
SQL: Avoid "intervals" for non-table-based datasources. (#14336)
In these other cases, stick to plain "filter". This simplifies lots of
logic downstream, and doesn't hurt since we don't have intervals-specific
optimizations outside of tables.

Fixes an issue where we couldn't properly filter on a column from an
external datasource if it was named __time.
2023-06-29 09:57:11 +05:30
Gian Merlino 34c55a0bde
SQL: SUBSTRING support for non-literals. (#14480)
* SQL: SUBSTRING support for non-literals.

* Fix AssertionError test.

* Fix header.
2023-06-28 13:43:05 -07:00
Jonathan Wei c36f12f1d8
Support complex variance object inputs for variance SQL agg function (#14463)
* Support complex variance object inputs for variance SQL agg function

* Add test

* Include complexTypeChecker, address PR comments

* Checkstyle, javadoc link
2023-06-28 13:14:19 -05:00
Karan Kumar cb3a9d2b57
Adding Interactive API's for MSQ engine (#14416)
This PR aims to expose a new API called
"@path("/druid/v2/sql/statements/")" which takes the same payload as the current "/druid/v2/sql" endpoint and allows users to fetch results in an async manner.
2023-06-28 17:51:58 +05:30
Gian Merlino c78d885b80
Cache parsed expressions and binding analysis in more places. (#14124)
* Cache parsed expressions and binding analysis in more places.

Main changes:

1) Cache parsed and analyzed expressions within PlannerContext for a
   single SQL query.

2) Cache parsed expressions together with input binding analysis using
   a new class AnalyzeExpr.

This speeds up SQL planning, because SQL planning involves parsing
analyzing the same expression strings over and over again.

* Fixes.

* Fix style.

* Fix test.

* Simplify: get rid of AnalyzedExpr, focus on caching.

* Rename parse -> parseExpression.
2023-06-27 13:40:35 -07:00
Clint Wylie 6ba10c8b6c
fix bug with json_value expression array extraction (#14461) 2023-06-26 21:02:44 -07:00
Abhishek Radhakrishnan 903addf7c2
Make agg and scalar routines test to depend on specific routine names. (#14482) 2023-06-26 23:03:08 -04:00
Abhishek Radhakrishnan 79bff4bbf7
Improvements to `EXPLAIN PLAN` attributes (#14441)
* Updates: use the target table directly, sanitized replace time chunks and clustered by cols.

* Add DruidSqlParserUtil and tests.

* minor refactor

* Use SqlUtil.isLiteral

* Throw ValidationException if CLUSTERED BY column descending order is specified.

- Fails query planning

* Some more tests.

* fixup existing comment

* Update comment

* checkstyle fix: remove unused imports

* Remove InsertCannotOrderByDescendingFault and deprecate the fault in readme.

* minor naming

* move deprecated field to the bottom

* update docs.

* add one more example.

* Collapsible query and result

* checkstyle fixes

* Code cleanup

* order by changes

* conditionally set attributes only for explain queries.

* Cleaner ordinal check.

* Add limit test and update javadoc.

* Commentary and minor adjustments.

* Checkstyle fixes.

* One more checkArg.

* add unexpected kind to exception.
2023-06-26 23:01:11 -04:00
Laksh Singla 1647d5f4a0
Limit the subquery results by memory usage (#13952)
Users can now add a guardrail to prevent subquery’s results from exceeding the set number of bytes by setting druid.server.http.maxSubqueryRows in Broker's config or maxSubqueryRows in the query context. This feature is experimental for now and would default back to row-based limiting in case it fails to get the accurate size of the results consumed by the query.
2023-06-26 18:12:28 +05:30
Gian Merlino d7c9c2f367
SqlResults: Coerce arrays to lists for VARCHAR. (#14260)
* SqlResults: Coerce arrays to lists for VARCHAR.

Useful for STRING_TO_MV, which returns VARCHAR at the SQL layer and an
ExprEval with String[] at the native layer.

* Fix style.

* Improve test coverage.

* Remove unnecessary throws.
2023-06-25 09:35:18 -07:00
Gian Merlino 3d19b748fb
SQL OperatorConversions: Introduce.aggregatorBuilder, allow CAST-as-literal. (#14249)
* SQL OperatorConversions: Introduce.aggregatorBuilder, allow CAST-as-literal.

Four main changes:

1) Provide aggregatorBuilder, a more consistent way of defining the
   SqlAggFunction we need for all of our SQL aggregators. The mechanism
   is analogous to the one we already use for SQL functions
   (OperatorConversions.operatorBuilder).

2) Allow CASTs of constants to be considered as "literalOperands". This
   fixes an issue where various of our operators are defined with
   OperandTypes.LITERAL as part of their checkers, which doesn't allow
   casts. However, in these cases we generally _do_ want to allow casts.
   The important piece is that the value must be reducible to a constant,
   not that the SQL text is literally a literal.

3) Update DataSketches SQL aggregators to use the new aggregatorBuilder
   functionality. The main user-visible effect here is [2]: the aggregators
   would now accept, for example, "CAST(0.99 AS DOUBLE)" as a literal
   argument. Other aggregators could be updated in a future patch.

4) Rename "requiredOperands" to "requiredOperandCount", because the
   old name was confusing. (It rhymes with "literalOperands" but the
   arguments mean different things.)

* Adjust method calls.
2023-06-23 16:25:04 -07:00
Rishabh Singh 155fde33ff
Add metrics to SegmentMetadataCache refresh (#14453)
New metrics:
- `segment/metadatacache/refresh/time`: time taken to refresh segments per datasource
- `segment/metadatacache/refresh/count`: number of segments being refreshed per datasource
2023-06-23 16:51:08 +05:30
Rohan Garg 09d6c5a45e
Decouple logical planning and native query generation in SQL planning (#14232)
Add a new planning strategy that explicitly decouples the DAG from building the native query.

With this mode, it is Calcite's job to generate a "logical DAG" which is all of the various
DruidProject, DruidFilter, etc. nodes.  We then take those nodes and use them to build a native
query.  The current commit doesn't pass all tests, but it does work for some things and is a
decent starting baseline.
2023-06-19 16:00:40 -07:00
imply-cheddar cfd07a95b7
Errors take 3 (#14004)
Introduce DruidException, an exception whose goal in life is to be delivered to a user.

DruidException itself has javadoc on it to describe how it should be used.  This commit both introduces the Exception and adjusts some of the places that are generating exceptions to generate DruidException objects instead, as a way to show how the Exception should be used.

This work was a 3rd iteration on top of work that was started by Paul Rogers.  I don't know if his name will survive the squash-and-merge, so I'm calling it out here and thanking him for starting on this.
2023-06-19 01:11:13 -07:00
Adarsh Sanjeev 128133fadc
Add column replication_factor column to sys.segments table (#14403)
Description:
Druid allows a configuration of load rules that may cause a used segment to not be loaded
on any historical. This status is not tracked in the sys.segments table on the broker, which
makes it difficult to determine if the unavailability of a segment is expected and if we should
not wait for it to be loaded on a server after ingestion has finished.

Changes:
- Track replication factor in `SegmentReplicantLookup` during evaluation of load rules
- Update API `/druid/coordinator/v1metadata/segments` to return replication factor
- Add column `replication_factor` to the sys.segments virtual table and populate it in
`MetadataSegmentView`
- If this column is 0, the segment is not assigned to any historical and will not be loaded.
2023-06-18 10:02:21 +05:30
Abhishek Radhakrishnan 04fb75719e
Fail query planning if a `CLUSTERED BY` column contains descending order (#14436)
* Throw ValidationException if CLUSTERED BY column descending order is specified.

- Fails query planning

* Some more tests.

* fixup existing comment

* Update comment

* checkstyle fix: remove unused imports

* Remove InsertCannotOrderByDescendingFault and deprecate the fault in readme.

* move deprecated field to the bottom
2023-06-16 18:10:12 -04:00
Clint Wylie 359bd63cc9
allow expression "best effort" type determination to better handle mixed type arrays (#14438) 2023-06-16 00:02:43 -07:00
Clint Wylie 8454cc619a
auto columns fixes (#14422)
changes:
* auto columns no longer participate in generic 'null column' handling, this was a mistake to try to support and caused ingestion failures due to mismatched ColumnFormat, and will be replaced in the future with nested common format constant column functionality (not in this PR)
* fix bugs with auto columns which contain empty objects, empty arrays, or primitive types mixed with either of these empty constructs
* fix bug with bound filter when upper is null equivalent but is strict
2023-06-14 08:57:06 -07:00
Abhishek Radhakrishnan b8495d45a1
Expose Druid functions in `INFORMATION_SCHEMA.ROUTINES` table. (#14378)
* Add INFORMATION_SCHEMA.ROUTINES to expose Druid operators and functions.

* checkstyle

* remove IS_DETERMISITIC.

* test

* cleanup test

* remove logs and simplify

* fixup unit test

* Add docs for INFORMATION_SCHEMA.ROUTINES table.

* Update test and add another SQL query.

* add stuff to .spelling and checkstyle fix.

* Add more tests for custom operators.

* checkstyle and comment.

* Some naming cleanup.

* Add FUNCTION_ID

* The different Calcite function syntax enums get translated to FUNCTION

* Update docs.

* Cleanup markdown table.

* fixup test.

* fixup intellij inspection

* Review comment: nullable column; add a function to determine function syntax.

* More tests; add non-function syntax operators.

* More unit tests. Also add a separate test for DruidOperatorTable.

* actually just validate non-zero count.

* switch up the order

* checkstyle fixes.
2023-06-13 15:44:04 -04:00
Abhishek Radhakrishnan 326f2c5020
Add more statement attributes to explain plan result. (#14391)
This PR adds the following to the ATTRIBUTES column in the explain plan output:
- partitionedBy
- clusteredBy
- replaceTimeChunks

This PR leverages the work done in #14074, which added a new column ATTRIBUTES
to encapsulate all the statement-related attributes.
2023-06-12 19:18:02 +05:30
Abhishek Radhakrishnan 2d258a95ad
Fix `EARLIEST_BY`/`LATEST_BY` signature and include function name in signature. (#14352)
* Fix EarliestLatestBySqlAggregator signature; Include function name for all signatures.

* Single quote function signatures, space between args and remove \n.

* fixup UT assertion
2023-06-06 09:41:05 -07:00
zachjsh 04a82da63d
Input source security fixes (#14266)
It was found that several supported tasks / input sources did not have implementations for the methods used by the input source security feature, causing these tasks and input sources to fail when used with this feature. This pr adds the needed missing implementations. Also securing the sampling endpoint with input source security, when enabled.
2023-06-01 16:37:19 -07:00
Clint Wylie 4096f51f0b
add configurable ColumnTypeMergePolicy to SegmentMetadataCache (#14319)
This PR adds a new interface to control how SegmentMetadataCache chooses ColumnType when faced with differences between segments for SQL schemas which are computed, exposed as druid.sql.planner.metadataColumnTypeMergePolicy and adds a new 'least restrictive type' mode to allow choosing the type that data across all segments can best be coerced into and sets this as the default behavior.

This is a behavior change around when segment driven schema migrations take effect for the SQL schema. With latestInterval, the SQL schema will be updated as soon as the first job with the new schema has published segments, while using leastRestrictive, the schema will only be updated once all segments are reindexed to the new type. The benefit of leastRestrictive is that it eliminates a bunch of type coercion errors that can happen in SQL when types are varied across segments with latestInterval because the newest type is not able to correctly represent older data, such as if the segments have a mix of ARRAY and number types, or any other combinations that lead to odd query plans.
2023-05-24 20:32:51 +05:30
Abhishek Radhakrishnan 338bdb35ea
Return `RESOURCES` in `EXPLAIN PLAN` as an ordered collection (#14323)
* Make resources an ordered collection so it's deterministic.

* test cleanup

* fixup docs.

* Replace deprecated ObjectNode#put() calls with ObjectNode#set().
2023-05-23 00:55:00 -05:00
Clint Wylie d92b9fbfac
more resilient segment metadata, dont parallel merge internal segment metadata queries (#14296) 2023-05-17 04:12:55 -07:00
Paul Rogers 3c0983c8e9
Extend the IT framework to allow tests in extensions (#13877)
The "new" IT framework provides a convenient way to package and run integration tests (ITs), but only for core modules. We have a use case to run an IT for a contrib extension: the proposed gRPC query extension. This PR provides the IT framework functionality to allow non-core ITs.
2023-05-15 20:29:51 +05:30
imply-cheddar f9861808bc
Be able to load segments on Peons (#14239)
* Be able to load segments on Peons

This change introduces a new config on WorkerConfig
that indicates how many bytes of each storage
location to use for storage of a task.  Said config
is divided up amongst the locations and slots
and then used to set TaskConfig.tmpStorageBytesPerTask

The Peons use their local task dir and
tmpStorageBytesPerTask as their StorageLocations for
the SegmentManager such that they can accept broadcast
segments.
2023-05-12 16:51:00 -07:00
Soumyava f128b9b666
Updates to filter processing for inner query in Joins (#14237) 2023-05-11 17:21:41 +05:30
Clint Wylie a58cebe491
add array_to_mv function to convert arrays into mvds to assist with migration from mvds to arrays (#14236) 2023-05-11 04:43:28 -07:00
Clint Wylie 8805d8d7db
fix issues with filtering nulls on values coerced to numeric types (#14139)
* fix issues with filtering nulls on values coerced to numeric types
* fix issues with 'auto' type numeric columns in default value mode
* optimize variant typed columns without nested data
* more tests for 'auto' type column ingestion
2023-05-08 13:19:02 -07:00
Rohan Garg 4d8feeb279
Fix planning in CASE expressions with complex WHEN and ELSE expressions (#14220) 2023-05-08 11:35:04 +05:30
zachjsh 48cde236c4
Add columnMappings to explain plan output (#14187)
* Add columnMappings to explain plan output

* * fix checkstyle
* add tests

* * improve test coverage

* * temporarily remove unit-test need to run ITs

* * depend on build

* * temporarily lower unit test threshold

* * add back dependency on unit-tests

* * add license headers

* * fix header order

* * review comments

* * fix intellij inspection errors

* * revert code coverage change
2023-05-04 10:36:28 -07:00
Gian Merlino 42c8c84eb6
TimeBoundary: Use cursor when datasource is not a regular table. (#14151)
* TimeBoundary: Use cursor when datasource is not a regular table.

Fixes a bug where TimeBoundary could return incorrect results with
INNER Join or inline data.

* Addl Javadocs.
2023-04-26 17:00:13 -07:00
Gian Merlino 89e7948159
MSQ: Subclass CalciteJoinQueryTest, other supporting changes. (#14105)
* MSQ: Subclass CalciteJoinQueryTest, other supporting changes.

The main change is the new tests: we now subclass CalciteJoinQueryTest
in CalciteSelectJoinQueryMSQTest twice, once for Broadcast and once for
SortMerge.

Two supporting production changes for default-value mode:

1) InputNumberDataSource is marked as concrete, to allow leftFilter to
   be pushed down to it.

2) In default-value mode, numeric frame field readers can now return nulls.
   This is necessary when stacking joins on top of joins: nulls must be
   preserved for semantics that match broadcast joins and native queries.

3) In default-value mode, StringFieldReader.isNull returns true on empty
   strings in addition to nulls. This is more consistent with the behavior
   of the selectors, which map empty strings to null as well in that mode.

As an effect of change (2), the InsertTimeNull change from #14020 (to
replace null timestamps with default timestamps) is reverted. IMO, this
is fine, as either behavior is defensible, and the change from #14020
hasn't been released yet.

* Adjust tests.

* Style fix.

* Additional tests.
2023-04-25 12:10:23 -07:00
Gian Merlino f643abdad9
SQL planning: Consider subqueries in fewer scenarios. (#14123)
* SQL planning: Consider subqueries in fewer scenarios.

Further adjusts logic in DruidRules that was previously adjusted in #13902.
The reason for the original change was that the comment "Subquery must be
a groupBy, so stage must be >= AGGREGATE" was no longer accurate. Subqueries
do not need to be groupBy anymore; they can really be any type of query.
If I recall correctly, the change was needed for certain window queries
to be able to plan on top of Scan queries.

However, this impacts performance negatively, because it causes many
additional outer-query scenarios to be considered, which is expensive.

So, this patch updates the matching logic to consider fewer scenarios. The
skipped scenarios are ones where we expect that, for one reason or another,
it isn't necessary to consider a subquery.

* Remove unnecessary escaping.

* Fix test.
2023-04-21 08:32:13 -07:00
Soumyava 8d60edcfcb
Updating segment map function for QueryDataSource to ensure group by … (#14112)
* Updating segment map function for QueryDataSource to ensure group by of group by of join data source gets into proper segment map function path

* Adding unit tests for the failed case

* There you go coverage bot, be happy now
2023-04-20 13:22:29 -07:00