change arrayIngestMode default to array (#16789)

* change arrayIngestMode default to array

* remove arrayIngestMode flag option none

* fix space

* fix test
This commit is contained in:
Clint Wylie 2024-07-25 00:09:40 -07:00 committed by GitHub
parent 7e3fab5bf9
commit 5da69a01cb
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
7 changed files with 61 additions and 119 deletions

View File

@ -71,46 +71,10 @@ The following shows an example `dimensionsSpec` for native ingestion of the data
### SQL-based ingestion
#### `arrayIngestMode`
Arrays can be inserted with [SQL-based ingestion](../multi-stage-query/index.md) when you include the query context
parameter `arrayIngestMode: array`.
When `arrayIngestMode` is `array`, SQL ARRAY types are stored using Druid array columns. This is recommended for new
tables.
When `arrayIngestMode` is `mvd`, SQL `VARCHAR ARRAY` are implicitly wrapped in [`ARRAY_TO_MV`](sql-functions.md#array_to_mv).
This causes them to be stored as [multi-value strings](multi-value-dimensions.md), using the same `STRING` column type
as regular scalar strings. SQL `BIGINT ARRAY` and `DOUBLE ARRAY` cannot be loaded under `arrayIngestMode: mvd`. This
is the default behavior when `arrayIngestMode` is not provided in your query context, although the default behavior
may change to `array` in a future release.
When `arrayIngestMode` is `none`, Druid throws an exception when trying to store any type of arrays. This mode is most
useful when set in the system default query context with `druid.query.default.context.arrayIngestMode = none`, in cases
where the cluster administrator wants SQL query authors to explicitly provide one or the other in their query context.
The following table summarizes the differences in SQL ARRAY handling between `arrayIngestMode: array` and
`arrayIngestMode: mvd`.
| SQL type | Stored type when `arrayIngestMode: array` | Stored type when `arrayIngestMode: mvd` (default) |
|---|---|---|
|`VARCHAR ARRAY`|`ARRAY<STRING>`|[multi-value `STRING`](multi-value-dimensions.md)|
|`BIGINT ARRAY`|`ARRAY<LONG>`|not possible (validation error)|
|`DOUBLE ARRAY`|`ARRAY<DOUBLE>`|not possible (validation error)|
In either mode, you can explicitly wrap string arrays in `ARRAY_TO_MV` to cause them to be stored as
[multi-value strings](multi-value-dimensions.md).
When validating a SQL INSERT or REPLACE statement that contains arrays, Druid checks whether the statement would lead
to mixing string arrays and multi-value strings in the same column. If this condition is detected, the statement fails
validation unless the column is named under the `skipTypeVerification` context parameter. This parameter can be either
a comma-separated list of column names, or a JSON array in string form. This validation is done to prevent accidentally
mixing arrays and multi-value strings in the same column.
Arrays can be inserted with [SQL-based ingestion](../multi-stage-query/index.md).
#### Examples
Set [`arrayIngestMode: array`](#arrayingestmode) in your query context to run the following examples.
```sql
REPLACE INTO "array_example" OVERWRITE ALL
WITH "ext" AS (
@ -169,6 +133,35 @@ GROUP BY 1,2,3,4,5
PARTITIONED BY DAY
```
#### `arrayIngestMode`
For seamless backwards compatible behavior with Druid versions older than 31, there is an `arrayIngestMode` query context flag.
When `arrayIngestMode` is `array`, SQL ARRAY types are stored using Druid array columns. This is recommended for new
tables and the default configuration for Druid 31 and newer.
When `arrayIngestMode` is `mvd` (legacy), SQL `VARCHAR ARRAY` are implicitly wrapped in [`ARRAY_TO_MV`](sql-functions.md#array_to_mv).
This causes them to be stored as [multi-value strings](multi-value-dimensions.md), using the same `STRING` column type
as regular scalar strings. SQL `BIGINT ARRAY` and `DOUBLE ARRAY` cannot be loaded under `arrayIngestMode: mvd`. This
mode is not recommended and will be removed in a future release, but provided for backwards compatibility.
The following table summarizes the differences in SQL ARRAY handling between `arrayIngestMode: array` and
`arrayIngestMode: mvd`.
| SQL type | Stored type when `arrayIngestMode: array` (default) | Stored type when `arrayIngestMode: mvd` |
|---|---|---|
|`VARCHAR ARRAY`|`ARRAY<STRING>`|[multi-value `STRING`](multi-value-dimensions.md)|
|`BIGINT ARRAY`|`ARRAY<LONG>`|not possible (validation error)|
|`DOUBLE ARRAY`|`ARRAY<DOUBLE>`|not possible (validation error)|
In either mode, you can explicitly wrap string arrays in `ARRAY_TO_MV` to cause them to be stored as
[multi-value strings](multi-value-dimensions.md).
When validating a SQL INSERT or REPLACE statement that contains arrays, Druid checks whether the statement would lead
to mixing string arrays and multi-value strings in the same column. If this condition is detected, the statement fails
validation unless the column is named under the `skipTypeVerification` context parameter. This parameter can be either
a comma-separated list of column names, or a JSON array in string form. This validation is done to prevent accidentally
mixing arrays and multi-value strings in the same column.
## Querying arrays
@ -284,9 +277,9 @@ Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensio
Use care during ingestion to ensure you get the type you want.
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter `"arrayIngestMode": "array"`. Arrays may contain strings or numbers.
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays. Arrays may contain strings or numbers.
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any `arrayIngestMode`. Multi-value dimensions can only contain strings.
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions. Multi-value dimensions can only contain strings.
You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:

View File

@ -507,9 +507,9 @@ Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensio
Use care during ingestion to ensure you get the type you want.
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter [`"arrayIngestMode": "array"`](arrays.md#arrayingestmode). Arrays may contain strings or numbers.
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays. Arrays may contain strings or numbers.
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any [`arrayIngestMode`](arrays.md#arrayingestmode). Multi-value dimensions can only contain strings.
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion). Multi-value dimensions can only contain strings.
You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:

View File

@ -25,11 +25,6 @@ package org.apache.druid.msq.util;
*/
public enum ArrayIngestMode
{
/**
* Disables the ingestion of arrays via MSQ's INSERT queries.
*/
NONE,
/**
* String arrays are ingested as MVDs. This is to preserve the legacy behaviour of Druid and will be removed in the
* future, since MVDs are not true array types and the behaviour is incorrect.

View File

@ -131,19 +131,9 @@ public class DimensionSchemaUtils
} else if (queryType.getType() == ValueType.ARRAY) {
ValueType elementType = queryType.getElementType().getType();
if (elementType == ValueType.STRING) {
if (arrayIngestMode == ArrayIngestMode.NONE) {
throw InvalidInput.exception(
"String arrays can not be ingested when '%s' is set to '%s'. Set '%s' in query context "
+ "to 'array' to ingest the string array as an array, or ingest it as an MVD by explicitly casting the "
+ "array to an MVD with the ARRAY_TO_MV function.",
MultiStageQueryContext.CTX_ARRAY_INGEST_MODE,
StringUtils.toLowerCase(arrayIngestMode.name()),
MultiStageQueryContext.CTX_ARRAY_INGEST_MODE
);
} else if (arrayIngestMode == ArrayIngestMode.MVD) {
if (arrayIngestMode == ArrayIngestMode.MVD) {
return ColumnType.STRING;
} else {
assert arrayIngestMode == ArrayIngestMode.ARRAY;
return queryType;
}
} else if (elementType.isNumeric()) {

View File

@ -165,7 +165,7 @@ public class MultiStageQueryContext
public static final boolean DEFAULT_USE_AUTO_SCHEMAS = false;
public static final String CTX_ARRAY_INGEST_MODE = "arrayIngestMode";
public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.MVD;
public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.ARRAY;
public static final String NEXT_WINDOW_SHUFFLE_COL = "__windowShuffleCol";

View File

@ -122,30 +122,7 @@ public class MSQArraysTest extends MSQTestBase
}
/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* string arrays
*/
@MethodSource("data")
@ParameterizedTest(name = "{index}:with context {0}")
public void testInsertStringArrayWithArrayIngestModeNone(String contextName, Map<String, Object> context)
{
final Map<String, Object> adjustedContext = new HashMap<>(context);
adjustedContext.put(MultiStageQueryContext.CTX_ARRAY_INGEST_MODE, "none");
testIngestQuery().setSql(
"INSERT INTO foo1 SELECT MV_TO_ARRAY(dim3) AS dim3 FROM foo GROUP BY 1 PARTITIONED BY ALL TIME")
.setQueryContext(adjustedContext)
.setExpectedExecutionErrorMatcher(CoreMatchers.allOf(
CoreMatchers.instanceOf(ISE.class),
ThrowableMessageMatcher.hasMessage(CoreMatchers.containsString(
"String arrays can not be ingested when 'arrayIngestMode' is set to 'none'"))
))
.verifyExecutionError();
}
/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
* string arrays
*/
@MethodSource("data")
@ -172,7 +149,7 @@ public class MSQArraysTest extends MSQTestBase
}
/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
* string arrays
*/
@MethodSource("data")
@ -200,7 +177,7 @@ public class MSQArraysTest extends MSQTestBase
}
/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
* string arrays
*/
@MethodSource("data")
@ -228,7 +205,7 @@ public class MSQArraysTest extends MSQTestBase
}
/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
* string arrays
*/
@MethodSource("data")
@ -277,7 +254,7 @@ public class MSQArraysTest extends MSQTestBase
}
/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
* string arrays
*/
@MethodSource("data")
@ -316,8 +293,7 @@ public class MSQArraysTest extends MSQTestBase
}
/**
* Tests the behaviour of INSERT query when arrayIngestMode is set to mvd (default) and the only array type to be
* ingested is string array
* Tests the behaviour of INSERT query when arrayIngestMode is set to array (default)
*/
@MethodSource("data")
@ParameterizedTest(name = "{index}:with context {0}")
@ -325,16 +301,32 @@ public class MSQArraysTest extends MSQTestBase
{
RowSignature rowSignature = RowSignature.builder()
.add("__time", ColumnType.LONG)
.add("dim3", ColumnType.STRING)
.add("dim3", ColumnType.STRING_ARRAY)
.build();
List<Object[]> expectedRows = new ArrayList<>(
ImmutableList.of(
new Object[]{0L, null},
new Object[]{0L, new Object[]{"a", "b"}}
)
);
if (!useDefault) {
expectedRows.add(new Object[]{0L, new Object[]{""}});
}
expectedRows.addAll(
ImmutableList.of(
new Object[]{0L, new Object[]{"b", "c"}},
new Object[]{0L, new Object[]{"d"}}
)
);
testIngestQuery().setSql(
"INSERT INTO foo1 SELECT MV_TO_ARRAY(dim3) AS dim3 FROM foo GROUP BY 1 PARTITIONED BY ALL TIME")
.setExpectedDataSource("foo1")
.setExpectedRowSignature(rowSignature)
.setQueryContext(context)
.setExpectedSegment(ImmutableSet.of(SegmentId.of("foo1", Intervals.ETERNITY, "test", 0)))
.setExpectedResultRows(expectedMultiValueFooRowsToArray())
.setExpectedResultRows(expectedRows)
.verifyResults();
}
@ -603,13 +595,6 @@ public class MSQArraysTest extends MSQTestBase
.verifyResults();
}
@MethodSource("data")
@ParameterizedTest(name = "{index}:with context {0}")
public void testSelectOnArraysWithArrayIngestModeAsNone(String contextName, Map<String, Object> context)
{
testSelectOnArrays(contextName, context, "none");
}
@MethodSource("data")
@ParameterizedTest(name = "{index}:with context {0}")
public void testSelectOnArraysWithArrayIngestModeAsMVD(String contextName, Map<String, Object> context)
@ -1128,20 +1113,4 @@ public class MSQArraysTest extends MSQTestBase
.setExpectedResultRows(expectedRows)
.verifyResults();
}
private List<Object[]> expectedMultiValueFooRowsToArray()
{
List<Object[]> expectedRows = new ArrayList<>();
expectedRows.add(new Object[]{0L, null});
if (!useDefault) {
expectedRows.add(new Object[]{0L, ""});
}
expectedRows.addAll(ImmutableList.of(
new Object[]{0L, ImmutableList.of("a", "b")},
new Object[]{0L, ImmutableList.of("b", "c")},
new Object[]{0L, "d"}
));
return expectedRows;
}
}

View File

@ -221,17 +221,12 @@ public class MultiStageQueryContextTest
@Test
public void arrayIngestMode_unset_returnsDefaultValue()
{
Assert.assertEquals(ArrayIngestMode.MVD, MultiStageQueryContext.getArrayIngestMode(QueryContext.empty()));
Assert.assertEquals(ArrayIngestMode.ARRAY, MultiStageQueryContext.getArrayIngestMode(QueryContext.empty()));
}
@Test
public void arrayIngestMode_set_returnsCorrectValue()
{
Assert.assertEquals(
ArrayIngestMode.NONE,
MultiStageQueryContext.getArrayIngestMode(QueryContext.of(ImmutableMap.of(CTX_ARRAY_INGEST_MODE, "none")))
);
Assert.assertEquals(
ArrayIngestMode.MVD,
MultiStageQueryContext.getArrayIngestMode(QueryContext.of(ImmutableMap.of(CTX_ARRAY_INGEST_MODE, "mvd")))