druid/docs/querying
Gian Merlino 82f7a56475
Sort-merge join and hash shuffles for MSQ. (#13506)
* Sort-merge join and hash shuffles for MSQ.

The main changes are in the processing, multi-stage-query, and sql modules.

processing module:

1) Rename SortColumn to KeyColumn, replace boolean descending with KeyOrder.
   This makes it nicer to model hash keys, which use KeyOrder.NONE.

2) Add nullability checkers to the FieldReader interface, and an
   "isPartiallyNullKey" method to FrameComparisonWidget. The join
   processor uses this to detect null keys.

3) Add WritableFrameChannel.isClosed and OutputChannel.isReadableChannelReady
   so callers can tell which OutputChannels are ready for reading and which
   aren't.

4) Specialize FrameProcessors.makeCursor to return FrameCursor, a random-access
   implementation. The join processor uses this to rewind when it needs to
   replay a set of rows with a particular key.

5) Add MemoryAllocatorFactory, which is embedded inside FrameWriterFactory
   instead of a particular MemoryAllocator. This allows FrameWriterFactory
   to be shared in more scenarios.

multi-stage-query module:

1) ShuffleSpec: Add hash-based shuffles. New enum ShuffleKind helps callers
   figure out what kind of shuffle is happening. The change from SortColumn
   to KeyColumn allows ClusterBy to be used for both hash-based and sort-based
   shuffling.

2) WorkerImpl: Add ability to handle hash-based shuffles. Refactor the logic
   to be more readable by moving the work-order-running code to the inner
   class RunWorkOrder, and the shuffle-pipeline-building code to the inner
   class ShufflePipelineBuilder.

3) Add SortMergeJoinFrameProcessor and factory.

4) WorkerMemoryParameters: Adjust logic to reserve space for output frames
   for hash partitioning. (We need one frame per partition.)

sql module:

1) Add sqlJoinAlgorithm context parameter; can be "broadcast" or
   "sortMerge". With native, it must always be "broadcast", or it's a
   validation error. MSQ supports both. Default is "broadcast" in
   both engines.

2) Validate that MSQs do not use broadcast join with RIGHT or FULL join,
   as results are not correct for broadcast join with those types. Allow
   this in native for two reasons: legacy (the docs caution against it,
   but it's always been allowed), and the fact that it actually *does*
   generate correct results in native when the join is processed on the
   Broker. It is much less likely that MSQ will plan in such a way that
   generates correct results.

3) Remove subquery penalty in DruidJoinQueryRel when using sort-merge
   join, because subqueries are always required, so there's no reason
   to penalize them.

4) Move previously-disabled join reordering and manipulation rules to
   FANCY_JOIN_RULES, and enable them when using sort-merge join. Helps
   get to better plans where projections and filters are pushed down.

* Work around compiler problem.

* Updates from static analysis.

* Fix @param tag.

* Fix declared exception.

* Fix spelling.

* Minor adjustments.

* wip

* Merge fixups

* fixes

* Fix CalciteSelectQueryMSQTest

* Empty keys are sortable.

* Address comments from code review. Rename mux -> mix.

* Restore inspection config.

* Restore original doc.

* Reorder imports.

* Adjustments

* Fix.

* Fix imports.

* Adjustments from review.

* Update header.

* Adjust docs.
2023-03-08 14:19:39 -08:00
..
aggregations.md stringFirst and stringLast supported in ingestion (#12466) 2022-04-22 10:28:49 +08:00
caching.md docs: fix html nits (#13835) 2023-03-02 11:19:32 -08:00
datasource.md Sort-merge join and hash shuffles for MSQ. (#13506) 2023-03-08 14:19:39 -08:00
datasourcemetadataquery.md Refresh query docs. (#9704) 2020-04-15 16:12:20 -07:00
dimensionspecs.md Various documentation updates. (#13107) 2022-09-16 21:58:11 -07:00
filters.md Adjust "in" filter null behavior to match "selector". (#12863) 2022-08-08 09:08:36 -07:00
granularities.md Document missed simple granularities (#12768) 2022-07-14 14:02:28 +08:00
groupbyquery.md document virtualColumns in native query documentation, fix some redirects (#12917) 2022-08-18 20:49:23 -07:00
having.md Refactor SQL docs (#12239) 2022-02-11 14:43:30 -08:00
hll-old.md De-incubation cleanup in code, docs, packaging (#9108) 2020-01-03 12:33:19 -05:00
joins.md Sort-merge join and hash shuffles for MSQ. (#13506) 2023-03-08 14:19:39 -08:00
limitspec.md Sql docs items (#12530) 2022-05-17 16:56:31 -07:00
lookups.md Fix value of lookup sync period in docs (#13695) 2023-02-01 18:12:00 -08:00
multi-value-dimensions.md Adding new config for disabling group by on multiValue column (#12253) 2022-02-16 20:53:26 +05:30
multitenancy.md Addition to Multitenancy considerations doc (#12567) 2022-06-02 10:32:14 -07:00
nested-columns.md doc: List Protobuf as a supported format (#13640) 2023-01-06 15:09:37 -08:00
post-aggregations.md Support postaggregation function as in Math.pow() (#13703) (#13704) 2023-01-31 22:55:04 +05:30
query-context.md Update default for finalize in query-context.md (#13763) 2023-02-22 12:35:36 -08:00
query-execution.md Refactor SQL docs (#12239) 2022-02-11 14:43:30 -08:00
querying.md Various documentation updates. (#13107) 2022-09-16 21:58:11 -07:00
scan-query.md Refactor SQL docs (#12239) 2022-02-11 14:43:30 -08:00
searchquery.md document virtualColumns in native query documentation, fix some redirects (#12917) 2022-08-18 20:49:23 -07:00
segmentmetadataquery.md Refactor SQL docs (#12239) 2022-02-11 14:43:30 -08:00
select-query.md Add "offset" parameter to the Scan query. (#10233) 2020-08-13 14:56:24 -07:00
sorting-orders.md Refactor SQL docs (#12239) 2022-02-11 14:43:30 -08:00
sql-aggregations.md Add validation for aggregations on __time (#13793) 2023-03-07 17:16:36 -08:00
sql-api.md Sql docs items (#12530) 2022-05-17 16:56:31 -07:00
sql-data-types.md SQL: Improve docs around casts. (#13466) 2022-12-15 15:01:40 -08:00
sql-functions.md basic docs for nested column query functions (#12922) 2022-08-19 17:12:19 -07:00
sql-jdbc.md doc: add a basic JDBC tutorial (#13343) 2022-11-30 16:25:35 -08:00
sql-json-functions.md Various documentation updates. (#13107) 2022-09-16 21:58:11 -07:00
sql-metadata-tables.md Information schema now uses numeric column types (#13777) 2023-02-17 14:39:31 -08:00
sql-multivalue-string-functions.md Sql docs items (#12530) 2022-05-17 16:56:31 -07:00
sql-operators.md Sql docs items (#12530) 2022-05-17 16:56:31 -07:00
sql-query-context.md Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. (#13247) 2022-11-03 09:43:00 -07:00
sql-scalar.md Add TIME_IN_INTERVAL SQL operator. (#12662) 2022-06-21 13:05:37 -07:00
sql-translation.md Remove limit from timeseries (#13457) 2022-12-02 12:19:59 -08:00
sql.md IMPLY-12348: Update description of UNION ALL in SQL syntax doc (#12710) 2022-07-05 13:08:01 -07:00
timeboundaryquery.md Refresh query docs. (#9704) 2020-04-15 16:12:20 -07:00
timeseriesquery.md document virtualColumns in native query documentation, fix some redirects (#12917) 2022-08-18 20:49:23 -07:00
topnmetricspec.md Sql docs items (#12530) 2022-05-17 16:56:31 -07:00
topnquery.md document virtualColumns in native query documentation, fix some redirects (#12917) 2022-08-18 20:49:23 -07:00
troubleshooting.md Refactor SQL docs (#12239) 2022-02-11 14:43:30 -08:00
using-caching.md Docs - query caching (#11584) 2022-04-18 17:00:21 +08:00
virtual-columns.md Add missing MSQ error code fields to docs (#13308) 2022-11-10 21:03:04 +05:30