mirror of
https://github.com/apache/druid.git
synced 2025-03-08 10:30:38 +00:00
change arrayIngestMode default to array (#16789)
* change arrayIngestMode default to array * remove arrayIngestMode flag option none * fix space * fix test
This commit is contained in:
parent
7e3fab5bf9
commit
5da69a01cb
@ -71,46 +71,10 @@ The following shows an example `dimensionsSpec` for native ingestion of the data
|
||||
|
||||
### SQL-based ingestion
|
||||
|
||||
#### `arrayIngestMode`
|
||||
|
||||
Arrays can be inserted with [SQL-based ingestion](../multi-stage-query/index.md) when you include the query context
|
||||
parameter `arrayIngestMode: array`.
|
||||
|
||||
When `arrayIngestMode` is `array`, SQL ARRAY types are stored using Druid array columns. This is recommended for new
|
||||
tables.
|
||||
|
||||
When `arrayIngestMode` is `mvd`, SQL `VARCHAR ARRAY` are implicitly wrapped in [`ARRAY_TO_MV`](sql-functions.md#array_to_mv).
|
||||
This causes them to be stored as [multi-value strings](multi-value-dimensions.md), using the same `STRING` column type
|
||||
as regular scalar strings. SQL `BIGINT ARRAY` and `DOUBLE ARRAY` cannot be loaded under `arrayIngestMode: mvd`. This
|
||||
is the default behavior when `arrayIngestMode` is not provided in your query context, although the default behavior
|
||||
may change to `array` in a future release.
|
||||
|
||||
When `arrayIngestMode` is `none`, Druid throws an exception when trying to store any type of arrays. This mode is most
|
||||
useful when set in the system default query context with `druid.query.default.context.arrayIngestMode = none`, in cases
|
||||
where the cluster administrator wants SQL query authors to explicitly provide one or the other in their query context.
|
||||
|
||||
The following table summarizes the differences in SQL ARRAY handling between `arrayIngestMode: array` and
|
||||
`arrayIngestMode: mvd`.
|
||||
|
||||
| SQL type | Stored type when `arrayIngestMode: array` | Stored type when `arrayIngestMode: mvd` (default) |
|
||||
|---|---|---|
|
||||
|`VARCHAR ARRAY`|`ARRAY<STRING>`|[multi-value `STRING`](multi-value-dimensions.md)|
|
||||
|`BIGINT ARRAY`|`ARRAY<LONG>`|not possible (validation error)|
|
||||
|`DOUBLE ARRAY`|`ARRAY<DOUBLE>`|not possible (validation error)|
|
||||
|
||||
In either mode, you can explicitly wrap string arrays in `ARRAY_TO_MV` to cause them to be stored as
|
||||
[multi-value strings](multi-value-dimensions.md).
|
||||
|
||||
When validating a SQL INSERT or REPLACE statement that contains arrays, Druid checks whether the statement would lead
|
||||
to mixing string arrays and multi-value strings in the same column. If this condition is detected, the statement fails
|
||||
validation unless the column is named under the `skipTypeVerification` context parameter. This parameter can be either
|
||||
a comma-separated list of column names, or a JSON array in string form. This validation is done to prevent accidentally
|
||||
mixing arrays and multi-value strings in the same column.
|
||||
Arrays can be inserted with [SQL-based ingestion](../multi-stage-query/index.md).
|
||||
|
||||
#### Examples
|
||||
|
||||
Set [`arrayIngestMode: array`](#arrayingestmode) in your query context to run the following examples.
|
||||
|
||||
```sql
|
||||
REPLACE INTO "array_example" OVERWRITE ALL
|
||||
WITH "ext" AS (
|
||||
@ -169,6 +133,35 @@ GROUP BY 1,2,3,4,5
|
||||
PARTITIONED BY DAY
|
||||
```
|
||||
|
||||
#### `arrayIngestMode`
|
||||
|
||||
For seamless backwards compatible behavior with Druid versions older than 31, there is an `arrayIngestMode` query context flag.
|
||||
|
||||
When `arrayIngestMode` is `array`, SQL ARRAY types are stored using Druid array columns. This is recommended for new
|
||||
tables and the default configuration for Druid 31 and newer.
|
||||
|
||||
When `arrayIngestMode` is `mvd` (legacy), SQL `VARCHAR ARRAY` are implicitly wrapped in [`ARRAY_TO_MV`](sql-functions.md#array_to_mv).
|
||||
This causes them to be stored as [multi-value strings](multi-value-dimensions.md), using the same `STRING` column type
|
||||
as regular scalar strings. SQL `BIGINT ARRAY` and `DOUBLE ARRAY` cannot be loaded under `arrayIngestMode: mvd`. This
|
||||
mode is not recommended and will be removed in a future release, but provided for backwards compatibility.
|
||||
|
||||
The following table summarizes the differences in SQL ARRAY handling between `arrayIngestMode: array` and
|
||||
`arrayIngestMode: mvd`.
|
||||
|
||||
| SQL type | Stored type when `arrayIngestMode: array` (default) | Stored type when `arrayIngestMode: mvd` |
|
||||
|---|---|---|
|
||||
|`VARCHAR ARRAY`|`ARRAY<STRING>`|[multi-value `STRING`](multi-value-dimensions.md)|
|
||||
|`BIGINT ARRAY`|`ARRAY<LONG>`|not possible (validation error)|
|
||||
|`DOUBLE ARRAY`|`ARRAY<DOUBLE>`|not possible (validation error)|
|
||||
|
||||
In either mode, you can explicitly wrap string arrays in `ARRAY_TO_MV` to cause them to be stored as
|
||||
[multi-value strings](multi-value-dimensions.md).
|
||||
|
||||
When validating a SQL INSERT or REPLACE statement that contains arrays, Druid checks whether the statement would lead
|
||||
to mixing string arrays and multi-value strings in the same column. If this condition is detected, the statement fails
|
||||
validation unless the column is named under the `skipTypeVerification` context parameter. This parameter can be either
|
||||
a comma-separated list of column names, or a JSON array in string form. This validation is done to prevent accidentally
|
||||
mixing arrays and multi-value strings in the same column.
|
||||
|
||||
## Querying arrays
|
||||
|
||||
@ -284,9 +277,9 @@ Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensio
|
||||
|
||||
Use care during ingestion to ensure you get the type you want.
|
||||
|
||||
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter `"arrayIngestMode": "array"`. Arrays may contain strings or numbers.
|
||||
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays. Arrays may contain strings or numbers.
|
||||
|
||||
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any `arrayIngestMode`. Multi-value dimensions can only contain strings.
|
||||
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions. Multi-value dimensions can only contain strings.
|
||||
|
||||
You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:
|
||||
|
||||
|
@ -507,9 +507,9 @@ Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensio
|
||||
|
||||
Use care during ingestion to ensure you get the type you want.
|
||||
|
||||
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays and set the context parameter [`"arrayIngestMode": "array"`](arrays.md#arrayingestmode). Arrays may contain strings or numbers.
|
||||
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays. Arrays may contain strings or numbers.
|
||||
|
||||
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion), which ensures you get multi-value dimensions in any [`arrayIngestMode`](arrays.md#arrayingestmode). Multi-value dimensions can only contain strings.
|
||||
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion). Multi-value dimensions can only contain strings.
|
||||
|
||||
You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:
|
||||
|
||||
|
@ -25,11 +25,6 @@ package org.apache.druid.msq.util;
|
||||
*/
|
||||
public enum ArrayIngestMode
|
||||
{
|
||||
/**
|
||||
* Disables the ingestion of arrays via MSQ's INSERT queries.
|
||||
*/
|
||||
NONE,
|
||||
|
||||
/**
|
||||
* String arrays are ingested as MVDs. This is to preserve the legacy behaviour of Druid and will be removed in the
|
||||
* future, since MVDs are not true array types and the behaviour is incorrect.
|
||||
|
@ -131,19 +131,9 @@ public class DimensionSchemaUtils
|
||||
} else if (queryType.getType() == ValueType.ARRAY) {
|
||||
ValueType elementType = queryType.getElementType().getType();
|
||||
if (elementType == ValueType.STRING) {
|
||||
if (arrayIngestMode == ArrayIngestMode.NONE) {
|
||||
throw InvalidInput.exception(
|
||||
"String arrays can not be ingested when '%s' is set to '%s'. Set '%s' in query context "
|
||||
+ "to 'array' to ingest the string array as an array, or ingest it as an MVD by explicitly casting the "
|
||||
+ "array to an MVD with the ARRAY_TO_MV function.",
|
||||
MultiStageQueryContext.CTX_ARRAY_INGEST_MODE,
|
||||
StringUtils.toLowerCase(arrayIngestMode.name()),
|
||||
MultiStageQueryContext.CTX_ARRAY_INGEST_MODE
|
||||
);
|
||||
} else if (arrayIngestMode == ArrayIngestMode.MVD) {
|
||||
if (arrayIngestMode == ArrayIngestMode.MVD) {
|
||||
return ColumnType.STRING;
|
||||
} else {
|
||||
assert arrayIngestMode == ArrayIngestMode.ARRAY;
|
||||
return queryType;
|
||||
}
|
||||
} else if (elementType.isNumeric()) {
|
||||
|
@ -165,7 +165,7 @@ public class MultiStageQueryContext
|
||||
public static final boolean DEFAULT_USE_AUTO_SCHEMAS = false;
|
||||
|
||||
public static final String CTX_ARRAY_INGEST_MODE = "arrayIngestMode";
|
||||
public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.MVD;
|
||||
public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.ARRAY;
|
||||
|
||||
public static final String NEXT_WINDOW_SHUFFLE_COL = "__windowShuffleCol";
|
||||
|
||||
|
@ -122,30 +122,7 @@ public class MSQArraysTest extends MSQTestBase
|
||||
}
|
||||
|
||||
/**
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
|
||||
* string arrays
|
||||
*/
|
||||
@MethodSource("data")
|
||||
@ParameterizedTest(name = "{index}:with context {0}")
|
||||
public void testInsertStringArrayWithArrayIngestModeNone(String contextName, Map<String, Object> context)
|
||||
{
|
||||
|
||||
final Map<String, Object> adjustedContext = new HashMap<>(context);
|
||||
adjustedContext.put(MultiStageQueryContext.CTX_ARRAY_INGEST_MODE, "none");
|
||||
|
||||
testIngestQuery().setSql(
|
||||
"INSERT INTO foo1 SELECT MV_TO_ARRAY(dim3) AS dim3 FROM foo GROUP BY 1 PARTITIONED BY ALL TIME")
|
||||
.setQueryContext(adjustedContext)
|
||||
.setExpectedExecutionErrorMatcher(CoreMatchers.allOf(
|
||||
CoreMatchers.instanceOf(ISE.class),
|
||||
ThrowableMessageMatcher.hasMessage(CoreMatchers.containsString(
|
||||
"String arrays can not be ingested when 'arrayIngestMode' is set to 'none'"))
|
||||
))
|
||||
.verifyExecutionError();
|
||||
}
|
||||
|
||||
/**
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
|
||||
* string arrays
|
||||
*/
|
||||
@MethodSource("data")
|
||||
@ -172,7 +149,7 @@ public class MSQArraysTest extends MSQTestBase
|
||||
}
|
||||
|
||||
/**
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
|
||||
* string arrays
|
||||
*/
|
||||
@MethodSource("data")
|
||||
@ -200,7 +177,7 @@ public class MSQArraysTest extends MSQTestBase
|
||||
}
|
||||
|
||||
/**
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
|
||||
* string arrays
|
||||
*/
|
||||
@MethodSource("data")
|
||||
@ -228,7 +205,7 @@ public class MSQArraysTest extends MSQTestBase
|
||||
}
|
||||
|
||||
/**
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
|
||||
* string arrays
|
||||
*/
|
||||
@MethodSource("data")
|
||||
@ -277,7 +254,7 @@ public class MSQArraysTest extends MSQTestBase
|
||||
}
|
||||
|
||||
/**
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to none (default) and the user tries to ingest
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to default and the user tries to ingest
|
||||
* string arrays
|
||||
*/
|
||||
@MethodSource("data")
|
||||
@ -316,8 +293,7 @@ public class MSQArraysTest extends MSQTestBase
|
||||
}
|
||||
|
||||
/**
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to mvd (default) and the only array type to be
|
||||
* ingested is string array
|
||||
* Tests the behaviour of INSERT query when arrayIngestMode is set to array (default)
|
||||
*/
|
||||
@MethodSource("data")
|
||||
@ParameterizedTest(name = "{index}:with context {0}")
|
||||
@ -325,16 +301,32 @@ public class MSQArraysTest extends MSQTestBase
|
||||
{
|
||||
RowSignature rowSignature = RowSignature.builder()
|
||||
.add("__time", ColumnType.LONG)
|
||||
.add("dim3", ColumnType.STRING)
|
||||
.add("dim3", ColumnType.STRING_ARRAY)
|
||||
.build();
|
||||
|
||||
List<Object[]> expectedRows = new ArrayList<>(
|
||||
ImmutableList.of(
|
||||
new Object[]{0L, null},
|
||||
new Object[]{0L, new Object[]{"a", "b"}}
|
||||
)
|
||||
);
|
||||
if (!useDefault) {
|
||||
expectedRows.add(new Object[]{0L, new Object[]{""}});
|
||||
}
|
||||
expectedRows.addAll(
|
||||
ImmutableList.of(
|
||||
new Object[]{0L, new Object[]{"b", "c"}},
|
||||
new Object[]{0L, new Object[]{"d"}}
|
||||
)
|
||||
);
|
||||
|
||||
testIngestQuery().setSql(
|
||||
"INSERT INTO foo1 SELECT MV_TO_ARRAY(dim3) AS dim3 FROM foo GROUP BY 1 PARTITIONED BY ALL TIME")
|
||||
.setExpectedDataSource("foo1")
|
||||
.setExpectedRowSignature(rowSignature)
|
||||
.setQueryContext(context)
|
||||
.setExpectedSegment(ImmutableSet.of(SegmentId.of("foo1", Intervals.ETERNITY, "test", 0)))
|
||||
.setExpectedResultRows(expectedMultiValueFooRowsToArray())
|
||||
.setExpectedResultRows(expectedRows)
|
||||
.verifyResults();
|
||||
}
|
||||
|
||||
@ -603,13 +595,6 @@ public class MSQArraysTest extends MSQTestBase
|
||||
.verifyResults();
|
||||
}
|
||||
|
||||
@MethodSource("data")
|
||||
@ParameterizedTest(name = "{index}:with context {0}")
|
||||
public void testSelectOnArraysWithArrayIngestModeAsNone(String contextName, Map<String, Object> context)
|
||||
{
|
||||
testSelectOnArrays(contextName, context, "none");
|
||||
}
|
||||
|
||||
@MethodSource("data")
|
||||
@ParameterizedTest(name = "{index}:with context {0}")
|
||||
public void testSelectOnArraysWithArrayIngestModeAsMVD(String contextName, Map<String, Object> context)
|
||||
@ -1128,20 +1113,4 @@ public class MSQArraysTest extends MSQTestBase
|
||||
.setExpectedResultRows(expectedRows)
|
||||
.verifyResults();
|
||||
}
|
||||
|
||||
private List<Object[]> expectedMultiValueFooRowsToArray()
|
||||
{
|
||||
List<Object[]> expectedRows = new ArrayList<>();
|
||||
expectedRows.add(new Object[]{0L, null});
|
||||
if (!useDefault) {
|
||||
expectedRows.add(new Object[]{0L, ""});
|
||||
}
|
||||
|
||||
expectedRows.addAll(ImmutableList.of(
|
||||
new Object[]{0L, ImmutableList.of("a", "b")},
|
||||
new Object[]{0L, ImmutableList.of("b", "c")},
|
||||
new Object[]{0L, "d"}
|
||||
));
|
||||
return expectedRows;
|
||||
}
|
||||
}
|
||||
|
@ -221,17 +221,12 @@ public class MultiStageQueryContextTest
|
||||
@Test
|
||||
public void arrayIngestMode_unset_returnsDefaultValue()
|
||||
{
|
||||
Assert.assertEquals(ArrayIngestMode.MVD, MultiStageQueryContext.getArrayIngestMode(QueryContext.empty()));
|
||||
Assert.assertEquals(ArrayIngestMode.ARRAY, MultiStageQueryContext.getArrayIngestMode(QueryContext.empty()));
|
||||
}
|
||||
|
||||
@Test
|
||||
public void arrayIngestMode_set_returnsCorrectValue()
|
||||
{
|
||||
Assert.assertEquals(
|
||||
ArrayIngestMode.NONE,
|
||||
MultiStageQueryContext.getArrayIngestMode(QueryContext.of(ImmutableMap.of(CTX_ARRAY_INGEST_MODE, "none")))
|
||||
);
|
||||
|
||||
Assert.assertEquals(
|
||||
ArrayIngestMode.MVD,
|
||||
MultiStageQueryContext.getArrayIngestMode(QueryContext.of(ImmutableMap.of(CTX_ARRAY_INGEST_MODE, "mvd")))
|
||||
|
Loading…
x
Reference in New Issue
Block a user