docs: copyedits for MSQ join algos (#14012)

This commit is contained in:
317brian 2023-05-11 14:21:09 -07:00 committed by GitHub
parent f128b9b666
commit cc37987dff
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 30 additions and 24 deletions

View File

@ -244,16 +244,22 @@ The following table lists the context parameters for the MSQ task engine:
## Joins ## Joins
Joins in multi-stage queries use one of two algorithms, based on the [context parameter](#context-parameters) Joins in multi-stage queries use one of two algorithms based on what you set the [context parameter](#context-parameters) `sqlJoinAlgorithm` to:
`sqlJoinAlgorithm`. This context parameter applies to the entire SQL statement, so it is not possible to mix different
- [`broadcast`](#broadcast) (default)
- [`sortMerge`](#sort-merge).
If you omit this context parameter, the MSQ task engine uses broadcast since it's the default join algorithm. The context parameter applies to the entire SQL statement, so you can't mix different
join algorithms in the same query. join algorithms in the same query.
### Broadcast ### Broadcast
Set `sqlJoinAlgorithm` to `broadcast`.
The default join algorithm for multi-stage queries is a broadcast hash join, which is similar to how The default join algorithm for multi-stage queries is a broadcast hash join, which is similar to how
[joins are executed with native queries](../querying/query-execution.md#join). First, any adjacent joins are flattened [joins are executed with native queries](../querying/query-execution.md#join).
To use broadcast joins, either omit the `sqlJoinAlgorithm` or set it to `broadcast`.
For a broadcast join, any adjacent joins are flattened
into a structure with a "base" input (the bottom-leftmost one) and other leaf inputs (the rest). Next, any subqueries into a structure with a "base" input (the bottom-leftmost one) and other leaf inputs (the rest). Next, any subqueries
that are inputs the join (either base or other leafs) are planned into independent stages. Then, the non-base leaf that are inputs the join (either base or other leafs) are planned into independent stages. Then, the non-base leaf
inputs are all connected as broadcast inputs to the "base" stage. inputs are all connected as broadcast inputs to the "base" stage.
@ -266,13 +272,14 @@ Only LEFT JOIN, INNER JOIN, and CROSS JOIN are supported with with `broadcast`.
Join conditions, if present, must be equalities. It is not necessary to include a join condition; for example, Join conditions, if present, must be equalities. It is not necessary to include a join condition; for example,
`CROSS JOIN` and comma join do not require join conditions. `CROSS JOIN` and comma join do not require join conditions.
As an example, the following statement has a single join chain where `orders` is the base input, and `products` and The following example has a single join chain where `orders` is the base input while `products` and
`customers` are non-base leaf inputs. The query will first read `products` and `customers`, then broadcast both to `customers` are non-base leaf inputs. The broadcast inputs (`products` and `customers`) must fall under the limit on broadcast table footprint, but the base `orders` input
the stage that reads `orders`. That stage loads the broadcast inputs (`products` and `customers`) in memory, and walks
through `orders` row by row. The results are then aggregated and written to the table `orders_enriched`. The broadcast
inputs (`products` and `customers`) must fall under the limit on broadcast table footprint, but the base `orders` input
can be unlimited in size. can be unlimited in size.
The query reads `products` and `customers` and then broadcasts both to
the stage that reads `orders`. That stage loads the broadcast inputs (`products` and `customers`) in memory and walks
through `orders` row by row. The results are aggregated and written to the table `orders_enriched`.
``` ```
REPLACE INTO orders_enriched REPLACE INTO orders_enriched
OVERWRITE ALL OVERWRITE ALL
@ -291,26 +298,23 @@ CLUSTERED BY product_name
### Sort-merge ### Sort-merge
Set `sqlJoinAlgorithm` to `sortMerge`. You can use the sort-merge join algorithm to make queries more scalable at the cost of performance. If your goal is performance, consider [broadcast joins](#broadcast). There are various scenarios where broadcast join would return a [`BroadcastTablesTooLarge`](#error-codes) error, but a sort-merge join would succeed.
Multi-stage queries can use a sort-merge join algorithm. With this algorithm, each pairwise join is planned into its own To use the sort-merge join algorithm, set the context parameter `sqlJoinAlgorithm` to `sortMerge`.
stage with two inputs. The two inputs are partitioned and sorted using a hash partitioning on the same key. This
approach is generally less performant, but more scalable, than `broadcast`. There are various scenarios where broadcast
join would return a [`BroadcastTablesTooLarge`](#errors) error, but a sort-merge join would succeed.
There is no limit on the overall size of either input, so sort-merge is a good choice for performing a join of two large In a sort-merge join, each pairwise join is planned into its own stage with two inputs. The two inputs are partitioned and sorted using a hash partitioning on the same key.
inputs, or for performing a self-join of a large input with itself.
There is a limit on the amount of data associated with each individual key. If _both_ sides of the join exceed this When using the sort-merge algorithm, keep the following in mind:
limit, the query returns a [`TooManyRowsWithSameKey`](#errors) error. If only one side exceeds the limit, the query
does not return this error.
Join conditions, if present, must be equalities. It is not necessary to include a join condition; for example, - There is no limit on the overall size of either input, so sort-merge is a good choice for performing a join of two large inputs or for performing a self-join of a large input with itself.
`CROSS JOIN` and comma join do not require join conditions.
All join types are supported with `sortMerge`: LEFT, RIGHT, INNER, FULL, and CROSS. - There is a limit on the amount of data associated with each individual key. If _both_ sides of the join exceed this limit, the query returns a [`TooManyRowsWithSameKey`](#error-codes) error. If only one side exceeds the limit, the query does not return this error.
As an example, the following statement runs using a single sort-merge join stage that receives `eventstream` - Join conditions are optional but must be equalities if they are present. For example, `CROSS JOIN` and comma join do not require join conditions.
- All join types are supported with `sortMerge`: LEFT, RIGHT, INNER, FULL, and CROSS.
The following example runs using a single sort-merge join stage that receives `eventstream`
(partitioned on `user_id`) and `users` (partitioned on `id`) as inputs. There is no limit on the size of either input. (partitioned on `user_id`) and `users` (partitioned on `id`) as inputs. There is no limit on the size of either input.
``` ```
@ -328,6 +332,8 @@ PARTITIONED BY HOUR
CLUSTERED BY user CLUSTERED BY user
``` ```
The context parameter that sets `sqlJoinAlgorithm` to `sortMerge` is not shown in the above example.
## Durable Storage ## Durable Storage
Using durable storage with your SQL-based ingestions can improve their reliability by writing intermediate files to a storage location temporarily. Using durable storage with your SQL-based ingestions can improve their reliability by writing intermediate files to a storage location temporarily.