diff --git a/docs/release-info/migr-ansi-sql-null.md b/docs/release-info/migr-ansi-sql-null.md new file mode 100644 index 00000000000..322958f4e8e --- /dev/null +++ b/docs/release-info/migr-ansi-sql-null.md @@ -0,0 +1,386 @@ +--- +id: migr-ansi-sql-null +title: "Migration guide: SQL compliant mode" +sidebar_label: SQL compliant mode +--- + + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +In Apache Druid 28.0.0, the default [null handling](../querying/sql-data-types.md#null-values) mode changed to be compliant with the ANSI SQL standard. +This guide provides strategies for Druid operators who rely on legacy Druid null handling behavior in their applications to transition to SQL compliant mode. +Legacy mode is planned for removal from Druid. + +## SQL compliant null handling + +As of Druid 28.0.0, Druid writes segments in an ANSI SQL compatible null handling mode by default. +This means that Druid stores null values distinctly from empty strings for string dimensions and distinctly from 0 for numeric dimensions. + +This can impact your application behavior because the ANSI SQL standard defines any comparison to null to be unknown. +According to this three-valued logic, `x <> 'some value'` only returns non-null values. + +The default Druid configurations for 28.0.0 and later that enable ANSI SQL compatible null handling mode are the following: + +* `druid.generic.useDefaultValueForNull=false` +* `druid.expressions.useStrictBooleans=true` +* `druid.generic.useThreeValueLogicForNativeFilters=true`  + +Follow the [Null handling tutorial](../tutorials/tutorial-sql-null.md) to learn how the default null handling works in Druid. + +## Legacy null handling and two-value logic + +Prior to Druid 28.0.0, Druid defaulted to a legacy mode which stored default values instead of nulls. +In legacy mode, Druid created segments with the following characteristics at ingestion time: + +- String columns couldn't distinguish an empty string, '', from null. + Therefore, Druid treated them both as interchangeable values. +- Numeric columns couldn't represent null valued rows. + Therefore Druid stored 0 instead of null. + +The Druid configurations for the deprecated legacy mode were the following: + +* `druid.generic.useDefaultValueForNull=true` +* `druid.expressions.useStrictBooleans=false` +* `druid.generic.useThreeValueLogicForNativeFilters=true` + +These configurations are deprecated and scheduled for removal. +After the configurations are removed, Druid will ignore them if they exist in your configuration files and use the default SQL compliant mode. + +## Migrate to SQL compliant mode + +If your business logic relies on the behavior of legacy mode, you have the following options to operate Druid in an ANSI SQL compatible null handling mode: + +- Modify incoming data to either [avoid nulls](#replace-null-values-at-ingestion-time) or [avoid empty strings](#coerce-empty-strings-to-null-at-ingestion-time) to achieve the same query behavior as legacy mode. This means modifying your ingestion SQL queries and ingestion specs to handle nulls or empty strings. + For example, replacing a null for a string column with an empty string or a 0 for a numeric column. + However, it means that your existing queries should operate as if Druid were in legacy mode. + If you do not care about preserving null values, this is a good option for you. + +- Preserve null values and [update all of your SQL queries to be ANSI SQL compliant](#rewrite-your-queries-to-be-sql-compliant). + This means you can preserve the incoming data with nulls intact. + However, you must rewrite any affected client-side queries to be ANSI SQL compliant. + If you have a requirement to preserve null values, choose this option. + +### Replace null values at ingestion time + +If you don't need to preserve null values within Druid, you can use a transform at ingestion time to replace nulls with other values. + +Consider the following input data: + +```json +{"time":"2024-01-01T00:00:00.000Z","string_example":"my_string","number_example":99} +{"time":"2024-01-02T00:00:00.000Z","string_example":"","number_example":0} +{"time":"2024-01-03T00:00:00.000Z","string_example":null,"number_example":null} +``` + +The following example illustrates how to use COALESCE and NVL at ingestion time to avoid null values in Druid: + + + + + +```sql +REPLACE INTO "no_nulls_example" OVERWRITE ALL +WITH "ext" AS ( + SELECT * + FROM TABLE( + EXTERN( + '{"type":"inline","data":"{\"time\":\"2024-01-01T00:00:00.000Z\",\"string_example\":\"my_string\",\"number_example\":99}\n{\"time\":\"2024-01-02T00:00:00.000Z\",\"string_example\":\"\",\"number_example\":0}\n{\"time\":\"2024-01-03T00:00:00.000Z\",\"string_example\":null,\"number_example\":null}"}', + '{"type":"json"}' + ) + ) EXTEND ("time" VARCHAR, "string_example" VARCHAR, "number_example" BIGINT) +) +SELECT + TIME_PARSE("time") AS "__time", + -- Replace any null string values with an empty string + COALESCE("string_example",'') AS string_example, + -- Replace any null numeric values with 0 + NVL("number_example",0) AS number_example +FROM "ext" +PARTITIONED BY MONTH +``` + + + + +```json +{ + "type": "index_parallel", + "spec": { + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "inline", + "data": "{\"time\":\"2024-01-01T00:00:00.000Z\",\"string_example\":\"my_string\",\"number_example\":99}\n{\"time\":\"2024-01-02T00:00:00.000Z\",\"string_example\":\"\",\"number_example\":0}\n{\"time\":\"2024-01-03T00:00:00.000Z\",\"string_example\":null,\"number_example\":null}" + }, + "inputFormat": { + "type": "json" + } + }, + "tuningConfig": { + "type": "index_parallel", + "partitionsSpec": { + "type": "dynamic" + } + }, + "dataSchema": { + "dataSource": "inline_data_native", + "timestampSpec": { + "column": "time", + "format": "iso" + }, + "dimensionsSpec": { + "dimensions": [ + "string_example", + { + "type": "long", + "name": "number_example" + } + ] + }, + "granularitySpec": { + "queryGranularity": "none", + "rollup": false, + "segmentGranularity": "MONTH" + }, + "transformSpec": { + "transforms": [ + { + "type": "expression", + "name": "string_example", + "expression": "COALESCE(\"string_example\",'')" + }, + { + "type": "expression", + "name": "number_example", + "expression": "NVL(\"number_example\",0)" + } + ] + } + } + } +} +``` + + + + +Druid ingests the data with no null values as follows: + +| `__time` | `string_examle` | `number_example`| +| -- | -- | -- | +| `2024-01-01T00:00:00.000Z`| `my_string`| 99 | +| `2024-01-02T00:00:00.000Z`| `empty`| 0 | +| `2024-01-03T00:00:00.000Z`| `empty`| 0 | + +### Coerce empty strings to null at ingestion time + +In legacy mode, Druid recognized empty strings as nulls for equality comparison. +If your queries rely on empty strings to represent nulls, you can coerce empty strings to null at ingestion time using NULLIF. + +For example, consider the following sample input data: + +```json +{"time":"2024-01-01T00:00:00.000Z","string_example":"my_string"} +{"time":"2024-01-02T00:00:00.000Z","string_example":""} +{"time":"2024-01-03T00:00:00.000Z","string_example":null} +``` + +In legacy mode, Druid wrote an empty string for the third record. +Therefore the following query returned 2: + +```sql +SELECT count(*) +FROM "null_string" +WHERE "string_example" IS NULL +``` + +In SQL compliant mode, Druid differentiates between empty strings and nulls, so the same query would return 1. +The following example shows how to coerce empty strings into null to accommodate IS NULL comparisons: + + + + + +```sql +REPLACE INTO "null_string" OVERWRITE ALL +WITH "ext" AS ( + SELECT * + FROM TABLE( + EXTERN( + '{"type":"inline","data":"{\"time\":\"2024-01-01T00:00:00.000Z\",\"string_example\":\"my_string\"}\n{\"time\":\"2024-01-02T00:00:00.000Z\",\"string_example\":\"\"}\n{\"time\":\"2024-01-03T00:00:00.000Z\",\"string_example\":null}"}', + '{"type":"json"}' + ) + ) EXTEND ("time" VARCHAR, "string_example" VARCHAR) +) +SELECT + TIME_PARSE("time") AS "__time", + NULLIF("string_example",'') AS "string_example" +FROM "ext" +PARTITIONED BY MONTH +``` + + + + + +```json +{ + "type": "index_parallel", + "spec": { + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "inline", + "data": "{\"time\":\"2024-01-01T00:00:00.000Z\",\"string_example\":\"my_string\"}\n{\"time\":\"2024-01-02T00:00:00.000Z\",\"string_example\":\"\"}\n{\"time\":\"2024-01-03T00:00:00.000Z\",\"string_example\":null}" + }, + "inputFormat": { + "type": "json" + } + }, + "tuningConfig": { + "type": "index_parallel", + "partitionsSpec": { + "type": "dynamic" + } + }, + "dataSchema": { + "dataSource": "null_string", + "timestampSpec": { + "column": "time", + "format": "iso" + }, + "transformSpec": { + "transforms": [ + { + "type": "expression", + "expression": "case_searched((\"string_example\" == ''),null,\"string_example\")", + "name": "string_example" + } + ] + }, + "dimensionsSpec": { + "dimensions": [ + "string_example" + ] + }, + "granularitySpec": { + "queryGranularity": "none", + "rollup": false, + "segmentGranularity": "month" + } + } + } +} +``` + + + + +Druid ingests the data with no empty strings as follows: + +| `__time` | `string_examle` | +| -- | -- | -- | +| `2024-01-01T00:00:00.000Z`| `my_string`| +| `2024-01-02T00:00:00.000Z`| `null`| +| `2024-01-03T00:00:00.000Z`| `null`| + +Therefore `SELECT count(*) FROM "null_string" WHERE "string_example" IS NULL` returns 2. + +### Rewrite your queries to be SQL compliant + +If you want to maintain null values in your data within Druid, you can use the following ANSI SQL compliant querying strategies to achieve the same results as legacy null handling: + +- Modify inequality queries to include null values. + For example, `x <> 'some value'` becomes `(x <> 'some value' OR x IS NULL)`. +- Use COALESCE or NVL to replace nulls with a value. + For example, `x + 1` becomes `NVL(numeric_value, 0)+1` + +Consider the following Druid datasource `null_example`: + +| `__time` | `string_examle` | `number_example`| +| -- | -- | -- | +| `2024-01-01T00:00:00.000Z`| `my_string`| 99 | +| `2024-01-02T00:00:00.000Z`| `empty`| 0 | +| `2024-01-03T00:00:00.000Z`| `null`| null | + +Druid excludes null strings from equality comparisons. For example: + +```sql +SELECT COUNT(*) AS count_example +FROM "null_example" +WHERE "string_example"<> 'my_string' +``` + +Druid returns 1 because null is considered unknown: neither equal nor unequal to the value. + +To count null values in the result, you can use an OR operator: + +```sql +SELECT COUNT(*) AS count_example +FROM "null_example" +WHERE ("string_example"<> 'my_string') OR "string_example" IS NULL +``` + +Druid returns 2. +To achieve the same result, you can use IS DISTINCT FROM for null-safe comparison: + +```sql +SELECT COUNT(*) as count_example +FROM "null_example" +WHERE "string_example" IS DISTINCT FROM 'my_string' +``` + +Similarly, arithmetic operators on null return null. For example: + +```sql +SELECT "number_example" + 1 AS additon_example +FROM "null_example" +``` + +Druid returns the following because null + any value is null for the ANSI SQL standard: + +| `addition_example`| +| -- | +| 100 | +| 1 | +| null | + +Use NVL to avoid nulls with arithmetic. For example: + +```sql +SELECT NVL("number_example",0) + 1 AS additon_example +FROM "null_example" +``` + +Druid returns the following: + +| `addition_example` | +| -- | +| 100 | +| 1 | +| null | + +## Learn more + +See the following topics for more information: + - [Null handling tutorial](../tutorials/tutorial-sql-null.md) to learn how the default null handling works in Druid. + - [Null values](../querying/sql-data-types.md#null-values) for a description of Druid's behavior with null values. + - [Handling null values](../design/segments.md#handling-null-values) for details about how Druid stores null values. \ No newline at end of file diff --git a/docs/release-info/migration-guide.md b/docs/release-info/migration-guide.md index 92053b83a11..760b691d143 100644 --- a/docs/release-info/migration-guide.md +++ b/docs/release-info/migration-guide.md @@ -30,12 +30,18 @@ The guides in this section outline breaking changes introduced in Druid 25.0.0 a ## Migrate from multi-value dimensions to arrays -Druid now supports SQL-compliant array types. Whenever possible, you should use the array type over multi-value dimensions. See [Migration guide: MVDs to arrays](migr-mvd-array.md). +Druid now supports SQL-compliant array types. Whenever possible, you should use the array type over multi-value dimensions. See [Migration guide: MVDs to arrays](./migr-mvd-array.md). ## Migrate to front-coded dictionary encoding -Druid encodes string columns into dictionaries for better compression. Front-coded dictionary encoding reduces storage and improves performance by optimizing for strings that share similar beginning substrings. See [Migration guide: front-coded dictionaries](migr-front-coded-dict.md) for more information. +Druid encodes string columns into dictionaries for better compression. Front-coded dictionary encoding reduces storage and improves performance by optimizing for strings that share similar beginning substrings. See [Migration guide: front-coded dictionaries](./migr-front-coded-dict.md) for more information. ## Migrate from `maxSubqueryRows` to `maxSubqueryBytes` -Druid allows you to set a byte-based limit on subquery size to prevent Brokers from running out of memory when handling large subqueries. The byte-based subquery limit overrides Druid's row-based subquery limit. We recommend that you move towards using byte-based limits starting in Druid 30.0.0. See [Migration guide: subquery limit](migr-subquery-limit.md) for more information. +Druid allows you to set a byte-based limit on subquery size to prevent Brokers from running out of memory when handling large subqueries. The byte-based subquery limit overrides Druid's row-based subquery limit. We recommend that you move towards using byte-based limits starting in Druid 30.0.0. See [Migration guide: subquery limit](./migr-subquery-limit.md) for more information. + +## Migrate to SQL compliant null handling mode + +By default, the Druid [null handling](../querying/sql-data-types.md#null-values) mode is now compliant with ANSI SQL. +This guide provides strategies for Druid operators and users who rely on the legacy Druid null handling behavior in their applications to transition to ANSI SQL compliant mode. See [Migration guide: SQL compliant mode](./migr-ansi-sql-null.md +) for more information. \ No newline at end of file diff --git a/docs/tutorials/tutorial-sql-null.md b/docs/tutorials/tutorial-sql-null.md index 37cf23a7152..b91e8019bc2 100644 --- a/docs/tutorials/tutorial-sql-null.md +++ b/docs/tutorials/tutorial-sql-null.md @@ -118,7 +118,7 @@ Druid returns the following: |`another_value`|1|1| |`some_value`|1|1| -Also note that GROUP BY expressions yields distinct entries for `null` and the empty string. +Also note that GROUP BY expressions yield distinct entries for `null` and the empty string. ### Filter for empty strings in addition to null diff --git a/website/sidebars.json b/website/sidebars.json index fff44585331..b74efa5c7ab 100644 --- a/website/sidebars.json +++ b/website/sidebars.json @@ -376,6 +376,7 @@ "items": [ "release-info/migr-front-coded-dict", "release-info/migr-subquery-limit", + "release-info/migr-ansi-sql-null", "release-info/migr-mvd-array" ] }