Commit Graph

246 Commits

Author SHA1 Message Date
Clint Wylie 36e659a501
remove group-by v1 (#14866)
* remove group-by v1

* docs

* remove unused configs, fix test

* fix test

* adjustments

* why not

* adjust

* review stuff
2023-08-23 12:44:06 -07:00
Zoltan Haindrich e806d09309
Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes (#14848) 2023-08-22 22:50:19 -07:00
Clint Wylie fb053c399c
consolidate json and auto indexers, remove v4 nested column serializer (#14456) 2023-08-22 18:50:11 -07:00
Clint Wylie 194a9c9abc
set druid.expressions.useStrictBooleans to true by default (#14734) 2023-08-22 00:19:56 -07:00
Clint Wylie 5d1412949e
enable sql compatible null handling mode by default (#14792)
* enable sql compatible null handling mode by default
* fix bug with string first/last aggs when druid.generic.useDefaultValueForNull=false
2023-08-21 20:07:13 -07:00
317brian 6b4dda964d
Docusaurus2 upgrade for master (#14411)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2023-08-16 19:01:21 -07:00
Soumyava afe22907a5
Calcite upgrade 1.35 (#14510)
* Update to Calcite 1.35.0
* Update from.ftl for Calcite 1.35.0.
* Fixed tests in Calcite upgrade by doing the following:
1. Added a new rule, CoreRules.PROJECT_FILTER_TRANSPOSE_WHOLE_PROJECT_EXPRESSIONS, to Base rules
2. Refactored the CorrelateUnnestRule
3. Updated CorrelateUnnestRel accordingly
4. Fixed a case with selector filters on the left where Calcite was eliding the virtual column
5. Additional test cases for fixes in 2,3,4
6. Update to StringListAggregator to fail a query if separators are not propagated appropriately
* Refactored for testcases to pass after the upgrade, introduced 2 new data sources for handling filters and select projects
* Added a literalSqlAggregator as the upgraded Calcite involved changes to subquery remove rule. This corrected plans for 2 queries with joins and subqueries by replacing an useless literal dimension with a post agg. Additionally a test with COUNT DISTINCT and FILTER which was failing with Calcite 1.21 is added here which passes with 1.35
* Updated to latest avatica and updated code as SqlUnknownTimeStamp is now used in Calcite which needs to be resolved to a timestamp literal
* Added a wrapper segment ref to use for unnest and filter segment reference
2023-08-11 12:47:16 -07:00
Clint Wylie e57f880020
document new filters and stuff (#14760) 2023-08-08 16:01:06 -07:00
Clint Wylie 667e4dab5e
document expression aggregator (#14497) 2023-08-08 15:49:29 -07:00
317brian 3b5b6c6a41
docs: query from deep storage (#14609)
* cold tier wip

* wip

* copyedits

* wip

* copyedits

* copyedits

* wip

* wip

* update rules page

* typo

* typo

* update sidebar

* moves durable storage info to its own page in operations

* update screenshots

* add apache license

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* add query from deep storage tutorial stub

* address some of the feedback

* revert screenshot update. handled in separate pr

* load rule update

* wip tutorial

* reformat deep storage endpoints

* rest of tutorial

* typo

* cleanup

* screenshot and sidebar for tutorial

* add license

* typos

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* rest of review comments

* clarify where results are stored

* update api reference for durablestorage context param

* Apply suggestions from code review

Co-authored-by: Karan Kumar <karankumar1100@gmail.com>

* comments

* incorporate #14720

* address rest of comments

* missed one

* Update docs/api-reference/sql-api.md

* Update docs/api-reference/sql-api.md

---------

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: demo-kratia <56242907+demo-kratia@users.noreply.github.com>
Co-authored-by: Karan Kumar <karankumar1100@gmail.com>
2023-08-04 11:10:08 +05:30
slfan1989 d69edb7723
Docs: Fix some typos. (#14663)
---------

Co-authored-by: slfan1989 <louj1988@@>
2023-07-26 21:24:18 +05:30
Abhishek Radhakrishnan f4d0ea7bc8
Add support for earliest `aggregatorMergeStrategy` (#14598)
* Add EARLIEST aggregator merge strategy.

- More unit tests.
- Include the aggregators analysis type by default in tests.

* Docs.

* Some comments and a test

* Collapse into individual code blocks.
2023-07-18 12:37:10 -07:00
Abhishek Radhakrishnan f4ee58eaa8
Add `aggregatorMergeStrategy` property in SegmentMetadata queries (#14560)
* Add aggregatorMergeStrategy property to SegmentMetadaQuery.

- Adds a new property aggregatorMergeStrategy to segmentMetadata query.
aggregatorMergeStrategy currently supports three types of merge strategies -
the legacy strict and lenient strategies, and the new latest strategy.
- The latest strategy considers the latest aggregator from the latest segment
by time order when there's a conflict when merging aggregators from different
segments.
- Deprecate lenientAggregatorMerge property; The API validates that both the new
and old properties are not set, and returns an exception.
- When merging segments as part of segmentMetadata query, the segments have a more
elaborate id -- <datasource>_<interval>_merged_<partition_number> format, similar to
the name format that segments usually contain. Previously it was simply "merged".
- Adjust unit tests to test the latest strategy, to assert the returned complete
SegmentAnalysis object instead of just the aggregators for completeness.

* Don't explicitly set strict strategy in tests

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/querying/segmentmetadataquery.md

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

---------

Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2023-07-13 12:37:36 -04:00
Abhishek Radhakrishnan 854ef98235
Minor doc fixes. (#14565)
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2023-07-11 13:12:40 -07:00
Gian Merlino e10e35aa2c
Add REGEXP_REPLACE function. (#14460)
* Add REGEXP_REPLACE function.

Replaces all instances of a pattern with a replacement string.

* Fixes.

* Improve test coverage.

* Adjust behavior.
2023-06-29 13:47:57 -07:00
Abhishek Radhakrishnan 79bff4bbf7
Improvements to `EXPLAIN PLAN` attributes (#14441)
* Updates: use the target table directly, sanitized replace time chunks and clustered by cols.

* Add DruidSqlParserUtil and tests.

* minor refactor

* Use SqlUtil.isLiteral

* Throw ValidationException if CLUSTERED BY column descending order is specified.

- Fails query planning

* Some more tests.

* fixup existing comment

* Update comment

* checkstyle fix: remove unused imports

* Remove InsertCannotOrderByDescendingFault and deprecate the fault in readme.

* minor naming

* move deprecated field to the bottom

* update docs.

* add one more example.

* Collapsible query and result

* checkstyle fixes

* Code cleanup

* order by changes

* conditionally set attributes only for explain queries.

* Cleaner ordinal check.

* Add limit test and update javadoc.

* Commentary and minor adjustments.

* Checkstyle fixes.

* One more checkArg.

* add unexpected kind to exception.
2023-06-26 23:01:11 -04:00
Nhi Pham 579b93f282
API reference refactor (#14372)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2023-06-26 15:48:54 -07:00
Laksh Singla 1647d5f4a0
Limit the subquery results by memory usage (#13952)
Users can now add a guardrail to prevent subquery’s results from exceeding the set number of bytes by setting druid.server.http.maxSubqueryRows in Broker's config or maxSubqueryRows in the query context. This feature is experimental for now and would default back to row-based limiting in case it fails to get the accurate size of the results consumed by the query.
2023-06-26 18:12:28 +05:30
Adarsh Sanjeev 128133fadc
Add column replication_factor column to sys.segments table (#14403)
Description:
Druid allows a configuration of load rules that may cause a used segment to not be loaded
on any historical. This status is not tracked in the sys.segments table on the broker, which
makes it difficult to determine if the unavailability of a segment is expected and if we should
not wait for it to be loaded on a server after ingestion has finished.

Changes:
- Track replication factor in `SegmentReplicantLookup` during evaluation of load rules
- Update API `/druid/coordinator/v1metadata/segments` to return replication factor
- Add column `replication_factor` to the sys.segments virtual table and populate it in
`MetadataSegmentView`
- If this column is 0, the segment is not assigned to any historical and will not be loaded.
2023-06-18 10:02:21 +05:30
Abhishek Radhakrishnan b8495d45a1
Expose Druid functions in `INFORMATION_SCHEMA.ROUTINES` table. (#14378)
* Add INFORMATION_SCHEMA.ROUTINES to expose Druid operators and functions.

* checkstyle

* remove IS_DETERMISITIC.

* test

* cleanup test

* remove logs and simplify

* fixup unit test

* Add docs for INFORMATION_SCHEMA.ROUTINES table.

* Update test and add another SQL query.

* add stuff to .spelling and checkstyle fix.

* Add more tests for custom operators.

* checkstyle and comment.

* Some naming cleanup.

* Add FUNCTION_ID

* The different Calcite function syntax enums get translated to FUNCTION

* Update docs.

* Cleanup markdown table.

* fixup test.

* fixup intellij inspection

* Review comment: nullable column; add a function to determine function syntax.

* More tests; add non-function syntax operators.

* More unit tests. Also add a separate test for DruidOperatorTable.

* actually just validate non-zero count.

* switch up the order

* checkstyle fixes.
2023-06-13 15:44:04 -04:00
Abhishek Radhakrishnan 326f2c5020
Add more statement attributes to explain plan result. (#14391)
This PR adds the following to the ATTRIBUTES column in the explain plan output:
- partitionedBy
- clusteredBy
- replaceTimeChunks

This PR leverages the work done in #14074, which added a new column ATTRIBUTES
to encapsulate all the statement-related attributes.
2023-06-12 19:18:02 +05:30
317brian 70952c0977
docs: add sql array functions to nav (#14361)
* docs: add sql array functions to nav

* fix typo

* add sql array functions to list

* fix spelling errors
2023-06-01 16:45:27 -07:00
Katya Macedo 2da84de87f
docs: remove the note about segments (#14161) 2023-05-31 16:37:19 -07:00
Nhi Pham 70c06fc0e1
Advise against using WEEK granularity for Native Batch and MSQ (#14341)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2023-05-30 11:40:12 -07:00
Abhishek Radhakrishnan 338bdb35ea
Return `RESOURCES` in `EXPLAIN PLAN` as an ordered collection (#14323)
* Make resources an ordered collection so it's deterministic.

* test cleanup

* fixup docs.

* Replace deprecated ObjectNode#put() calls with ObjectNode#set().
2023-05-23 00:55:00 -05:00
Katya Macedo 269137c682
Update Ingestion section (#14023)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Victoria Lim <lim.t.victoria@gmail.com>
2023-05-19 09:42:27 -07:00
317brian 8bda7297e1
doc: fix unnest datasource syntax (#14272) 2023-05-12 13:05:27 -07:00
317brian 6254658f61
docs: fix links (#14111) 2023-05-12 09:59:16 -07:00
Clint Wylie a58cebe491
add array_to_mv function to convert arrays into mvds to assist with migration from mvds to arrays (#14236) 2023-05-11 04:43:28 -07:00
zachjsh 48cde236c4
Add columnMappings to explain plan output (#14187)
* Add columnMappings to explain plan output

* * fix checkstyle
* add tests

* * improve test coverage

* * temporarily remove unit-test need to run ITs

* * depend on build

* * temporarily lower unit test threshold

* * add back dependency on unit-tests

* * add license headers

* * fix header order

* * review comments

* * fix intellij inspection errors

* * revert code coverage change
2023-05-04 10:36:28 -07:00
Jill Osborne d4e478c909
NVL function docs update (#14169)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2023-04-27 11:17:21 -07:00
Clint Wylie f6a0888bc0
document arrays in sql (#12549)
* document arrays in sql

* adjustments

* Update docs/querying/sql-array-functions.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/querying/sql-data-types.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/querying/sql-data-types.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/querying/sql-array-functions.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/querying/sql-array-functions.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update sql-array-functions.md

* fix stuff

* fix spelling

---------

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
2023-04-17 19:08:46 -07:00
Abhishek Radhakrishnan c98c66558f
Include statement attributes in `EXPLAIN PLAN` output (#14074)
This commit adds attributes that contain metadata information about the query
in the EXPLAIN PLAN output. The attributes currently contain two items:
- `statementTyp`: SELECT, INSERT or REPLACE
- `targetDataSource`: provides the target datasource name for DML statements

It is added to both the legacy and native query plan outputs.
2023-04-17 21:00:25 +05:30
Atul Mohan e3c160f2f2
Add start_time column to sys.servers (#13358)
Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.
2023-04-14 15:23:34 +05:30
317brian 7e572eef08
docs: sql unnest and cleanup unnest datasource (#13736)
Co-authored-by: Elliott Freis <elliottfreis@Elliott-Freis.earth.dynamic.blacklight.net>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Paul Rogers <paul-rogers@users.noreply.github.com>
Co-authored-by: Jill Osborne <jill.osborne@imply.io>
Co-authored-by: Anshu Makkar <83963638+anshu-makkar@users.noreply.github.com>
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Elliott Freis <108356317+imply-elliott@users.noreply.github.com>
Co-authored-by: Nicholas Lippis <nick.lippis@imply.io>
Co-authored-by: Rohan Garg <7731512+rohangarg@users.noreply.github.com>
Co-authored-by: Karan Kumar <karankumar1100@gmail.com>
Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Co-authored-by: Clint Wylie <cwylie@apache.org>
Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
2023-04-04 13:07:54 -07:00
frankgrimes97 2f98675285
Tuple sketch SQL support (#13887)
This PR is a follow-up to #13819 so that the Tuple sketch functionality can be used in SQL for both ingestion using Multi-Stage Queries (MSQ) and also for analytic queries against Tuple sketch columns.
2023-03-28 18:47:12 +05:30
Jill Osborne 4f95285406
Correct nested columns JSON example (#13953) 2023-03-21 09:17:26 -07:00
317brian 65a663adbb
docs: clarify Java precision (#13671)
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2023-03-15 11:43:41 -07:00
somu-imply a7ba361666
Refactoring and bug fixes on top of unnest. The allowList now is not passed … (#13922)
* Refactoring and bug fixes on top of unnest. The filter now is passed inside the unnest cursors. Added tests for scenarios such as
1. filter on unnested column which involves a left filter rewrite
2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite
3. not filters
4. SQL functions applied on top of unnested column
5. null present in first row of the column to be unnested
2023-03-14 16:05:56 -07:00
Gian Merlino 4b1ffbc452
Various changes and fixes to UNNEST. (#13892)
* Various changes and fixes to UNNEST.

Native changes:

1) UnnestDataSource: Replace "column" and "outputName" with "virtualColumn".
   This enables pushing expressions into the datasource. This in turn
   allows us to do the next thing...

2) UnnestStorageAdapter: Logically apply query-level filters and virtual
   columns after the unnest operation. (Physically, filters are pulled up,
   when possible.) This is beneficial because it allows filters and
   virtual columns to reference the unnested column, and because it is
   consistent with how the join datasource works.

3) Various documentation updates, including declaring "unnest" as an
   experimental feature for now.

SQL changes:

1) Rename DruidUnnestRel (& Rule) to DruidUnnestRel (& Rule). The rel
   is simplified: it only handles the UNNEST part of a correlated join.
   Constant UNNESTs are handled with regular inline rels.

2) Rework DruidCorrelateUnnestRule to focus on pulling Projects from
   the left side up above the Correlate. New test testUnnestTwice verifies
   that this works even when two UNNESTs are stacked on the same table.

3) Include ProjectCorrelateTransposeRule from Calcite to encourage
   pushing mappings down below the left-hand side of the Correlate.

4) Add a new CorrelateFilterLTransposeRule and CorrelateFilterRTransposeRule
   to handle pulling Filters up above the Correlate. New tests
   testUnnestWithFiltersOutside and testUnnestTwiceWithFilters verify
   this behavior.

5) Require a context feature flag for SQL UNNEST, since it's undocumented.
   As part of this, also cleaned up how we handle feature flags in SQL.
   They're now hooked into EngineFeatures, which is useful because not
   all engines support all features.
2023-03-10 16:42:08 +05:30
Gian Merlino fe9d0c46d5
Improve memory efficiency of WrappedRoaringBitmap. (#13889)
* Improve memory efficiency of WrappedRoaringBitmap.

Two changes:

1) Use an int[] for sizes 4 or below.
2) Remove the boolean compressRunOnSerialization. Doesn't save much
   space, but it does save a little, and it isn't adding a ton of value
   to have it be configurable. It was originally configurable in case
   anything broke when enabling it, but it's been a while and nothing
   has broken.

* Slight adjustment.

* Adjust for inspection.

* Updates.

* Update snaps.

* Update test.

* Adjust test.

* Fix snaps.
2023-03-09 15:48:02 -08:00
Gian Merlino 82f7a56475
Sort-merge join and hash shuffles for MSQ. (#13506)
* Sort-merge join and hash shuffles for MSQ.

The main changes are in the processing, multi-stage-query, and sql modules.

processing module:

1) Rename SortColumn to KeyColumn, replace boolean descending with KeyOrder.
   This makes it nicer to model hash keys, which use KeyOrder.NONE.

2) Add nullability checkers to the FieldReader interface, and an
   "isPartiallyNullKey" method to FrameComparisonWidget. The join
   processor uses this to detect null keys.

3) Add WritableFrameChannel.isClosed and OutputChannel.isReadableChannelReady
   so callers can tell which OutputChannels are ready for reading and which
   aren't.

4) Specialize FrameProcessors.makeCursor to return FrameCursor, a random-access
   implementation. The join processor uses this to rewind when it needs to
   replay a set of rows with a particular key.

5) Add MemoryAllocatorFactory, which is embedded inside FrameWriterFactory
   instead of a particular MemoryAllocator. This allows FrameWriterFactory
   to be shared in more scenarios.

multi-stage-query module:

1) ShuffleSpec: Add hash-based shuffles. New enum ShuffleKind helps callers
   figure out what kind of shuffle is happening. The change from SortColumn
   to KeyColumn allows ClusterBy to be used for both hash-based and sort-based
   shuffling.

2) WorkerImpl: Add ability to handle hash-based shuffles. Refactor the logic
   to be more readable by moving the work-order-running code to the inner
   class RunWorkOrder, and the shuffle-pipeline-building code to the inner
   class ShufflePipelineBuilder.

3) Add SortMergeJoinFrameProcessor and factory.

4) WorkerMemoryParameters: Adjust logic to reserve space for output frames
   for hash partitioning. (We need one frame per partition.)

sql module:

1) Add sqlJoinAlgorithm context parameter; can be "broadcast" or
   "sortMerge". With native, it must always be "broadcast", or it's a
   validation error. MSQ supports both. Default is "broadcast" in
   both engines.

2) Validate that MSQs do not use broadcast join with RIGHT or FULL join,
   as results are not correct for broadcast join with those types. Allow
   this in native for two reasons: legacy (the docs caution against it,
   but it's always been allowed), and the fact that it actually *does*
   generate correct results in native when the join is processed on the
   Broker. It is much less likely that MSQ will plan in such a way that
   generates correct results.

3) Remove subquery penalty in DruidJoinQueryRel when using sort-merge
   join, because subqueries are always required, so there's no reason
   to penalize them.

4) Move previously-disabled join reordering and manipulation rules to
   FANCY_JOIN_RULES, and enable them when using sort-merge join. Helps
   get to better plans where projections and filters are pushed down.

* Work around compiler problem.

* Updates from static analysis.

* Fix @param tag.

* Fix declared exception.

* Fix spelling.

* Minor adjustments.

* wip

* Merge fixups

* fixes

* Fix CalciteSelectQueryMSQTest

* Empty keys are sortable.

* Address comments from code review. Rename mux -> mix.

* Restore inspection config.

* Restore original doc.

* Reorder imports.

* Adjustments

* Fix.

* Fix imports.

* Adjustments from review.

* Update header.

* Adjust docs.
2023-03-08 14:19:39 -08:00
Adarsh Sanjeev ef82756176
Add validation for aggregations on __time (#13793)
* Add validation for aggregations on __time
2023-03-07 17:16:36 -08:00
317brian b4b354b658
docs: fix html nits (#13835) 2023-03-02 11:19:32 -08:00
Apoorv Gupta b26f1b4a5d
Update datasources.md: Fix Documentation. (#13865)
Fixed documentation to clarify that union query cant be run over query datasources.
2023-03-01 20:29:15 +05:30
benkrug 66034dd8bc
Update default for finalize in query-context.md (#13763)
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
2023-02-22 12:35:36 -08:00
Paul Rogers 85d36be085
Information schema now uses numeric column types (#13777)
Change to use SQL schemas to allow null numeric columns

* Updated docs
2023-02-17 14:39:31 -08:00
Kashif Faraz f629643c50
Fix value of lookup sync period in docs (#13695)
* Fix lookup docs

* Fix spelling

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2023-02-01 18:12:00 -08:00
Tijo Thomas 1beef30bb2
Support postaggregation function as in Math.pow() (#13703) (#13704)
Support postaggregation function as in Math.pow()
2023-01-31 22:55:04 +05:30
Vadim Ogievetsky 93dc01b6c5
fix broken table missing new line (#13666) 2023-01-12 15:29:51 -08:00