Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes (#14848)

2023-08-23 07:50:19 +02:00 · 2023-08-23 07:50:19 +02:00 · e806d09309
parent 7b5012ea6e
commit e806d09309
5 changed files with 54 additions and 40 deletions
--- a/docs/querying/sql-aggregations.md
+++ b/docs/querying/sql-aggregations.md
@ -86,16 +86,11 @@ In the aggregation functions supported by Druid, only `COUNT`, `ARRAY_AGG`, and
 |`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
 |`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
 |`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like `__time` in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column. If the earliest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`EARLIEST_BY(expr, timestampExpr)`|Returns the earliest value of `expr`, which must be numeric. The earliest value of `expr` is taken from the row with the overall earliest non-null value of `timestampExpr`. If the earliest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`EARLIEST_BY(expr, timestampExpr, maxBytesPerString)`| Like `EARLIEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. The `expr` must come from a relation with a timestamp column (like `__time` in a Druid datasource) and the "latest" is taken from the row with the overall latest non-null value of the timestamp column. If the latest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. |`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=false` (legacy mode)|
-|`LATEST_BY(expr, timestampExpr)`|Returns the latest value of `expr`, which must be numeric. The latest value of `expr` is taken from the row with the overall latest non-null value of `timestampExpr`. If the overall latest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`LATEST_BY(expr, timestampExpr, maxBytesPerString)`|Like `LATEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|`null` or `0` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|`null` or `''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
+|`EARLIEST(expr, [maxBytesPerValue])`|Returns the earliest value of `expr`.<br />If `expr` comes from a relation with a timestamp column (like `__time` in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column.<br />If the earliest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.<br /><br />If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The  `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerValue`is omitted; it defaults to `1024`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
+|`EARLIEST_BY(expr, timestampExpr, [maxBytesPerValue])`|Returns the earliest value of `expr`.<br />The earliest value of `expr` is taken from the row with the overall earliest non-null value of `timestampExpr`. <br />If the earliest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.<br /><br />If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The  `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerValue`is omitted; it defaults to `1024`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
+|`LATEST(expr, [maxBytesPerValue])`|Returns the latest value of `expr`<br />The `expr` must come from a relation with a timestamp column (like `__time` in a Druid datasource) and the "latest" is taken from the row with the overall latest non-null value of the timestamp column.<br />If the latest non-null value of the timestamp column appears in multiple rows, the `expr` may be taken from any of those rows.<br /><br />If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The  `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerValue`is omitted; it defaults to `1024`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
+|`LATEST_BY(expr, timestampExpr, [maxBytesPerValue])`|Returns the latest value of `expr`.<br />The latest value of `expr` is taken from the row with the overall latest non-null value of `timestampExpr`.<br />If the overall latest non-null value of `timestampExpr` appears in multiple rows, the `expr` may be taken from any of those rows.<br /><br />If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerValue`is omitted; it defaults to `1024`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
+|`ANY_VALUE(expr, [maxBytesPerValue])`|Returns any value of `expr` including null. This aggregator can simplify and optimize the performance by returning the first encountered value (including `null`).<br /><br />If `expr` is a string or complex type `maxBytesPerValue` amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The `maxBytesPerValue` parameter should be set as low as possible, since high values will lead to wasted memory.<br/>If `maxBytesPerValue`is omitted; it defaults to `1024`. |`null` or `0`/`''` if `druid.generic.useDefaultValueForNull=true` (legacy mode)|
 |`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy dimension is included in a row, when using `GROUPING SETS`. Refer to [additional documentation](aggregations.md#grouping-aggregator) on how to infer this number.|N/A|
 |`ARRAY_AGG(expr, [size])`|Collects all values of `expr` into an ARRAY, including null values, with `size` in bytes limit on aggregation size (default of 1024 bytes). If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `ARRAY_AGG` expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.|`null`|
 |`ARRAY_AGG(DISTINCT expr, [size])`|Collects all distinct values of `expr` into an ARRAY, including null values, with `size` in bytes limit on aggregation size (default of 1024 bytes) per aggregate. If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of `ORDER BY` within the `ARRAY_AGG` expression is not currently supported, and the ordering of results will be based on the default for the element type.|`null`|
--- a/docs/querying/sql-functions.md
+++ b/docs/querying/sql-functions.md
@ -50,11 +50,7 @@ Calculates the arc cosine of a numeric expression.

 ## ANY_VALUE

-`ANY_VALUE(<NUMERIC>)`
-
-`ANY_VALUE(<BOOLEAN>)`
-
-`ANY_VALUE(<CHARACTER>, <NUMERIC>)`
+`ANY_VALUE(expr, [maxBytesPerValue])`

 **Function type:** [Aggregation](sql-aggregations.md)

@ -641,9 +637,7 @@ Returns a union of Tuple sketches which each contain an array of double values a

 ## EARLIEST

-`EARLIEST(expr)`
-
-`EARLIEST(expr, maxBytesPerString)`
+`EARLIEST(expr, [maxBytesPerValue])`

 **Function type:** [Aggregation](sql-aggregations.md)

@ -651,9 +645,7 @@ Returns the value of a numeric or string expression corresponding to the earlies

 ## EARLIEST_BY

-`EARLIEST_BY(expr, timestampExpr)`
-
-`EARLIEST_BY(expr, timestampExpr, maxBytesPerString)`
+`EARLIEST_BY(expr, timestampExpr, [maxBytesPerValue])`

 **Function type:** [Aggregation](sql-aggregations.md)

@ -837,9 +829,7 @@ Extracts a literal value from `expr` at the specified `path`. If you specify `RE

 ## LATEST

-`LATEST(expr)`
-
-`LATEST(expr, maxBytesPerString)`
+`LATEST(expr, [maxBytesPerValue])`

 **Function type:** [Aggregation](sql-aggregations.md)

@ -847,9 +837,7 @@ Returns the value of a numeric or string expression corresponding to the latest

 ## LATEST_BY

-`LATEST_BY(expr, timestampExpr)`
-
-`LATEST_BY(expr, timestampExpr, maxBytesPerString)`
+`LATEST_BY(expr, timestampExpr, [maxBytesPerValue])`

 **Function type:** [Aggregation](sql-aggregations.md)

--- a/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java
+++ b/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java
@ -82,7 +82,7 @@ public class EarliestLatestAnySqlAggregator implements SqlAggregator
          String fieldName,
          String timeColumn,
          ColumnType type,
-          int maxStringBytes
+          Integer maxStringBytes
      )
      {
        switch (type.getType()) {
@ -108,7 +108,7 @@ public class EarliestLatestAnySqlAggregator implements SqlAggregator
          String fieldName,
          String timeColumn,
          ColumnType type,
-          int maxStringBytes
+          Integer maxStringBytes
      )
      {
        switch (type.getType()) {
@ -134,7 +134,7 @@ public class EarliestLatestAnySqlAggregator implements SqlAggregator
          String fieldName,
          String timeColumn,
          ColumnType type,
-          int maxStringBytes
+          Integer maxStringBytes
      )
      {
        switch (type.getType()) {
@ -157,7 +157,7 @@ public class EarliestLatestAnySqlAggregator implements SqlAggregator
        String fieldName,
        String timeColumn,
        ColumnType outputType,
-        int maxStringBytes
+        Integer maxStringBytes
    );
  }

@ -236,7 +236,7 @@ public class EarliestLatestAnySqlAggregator implements SqlAggregator
    final AggregatorFactory theAggFactory;
    switch (args.size()) {
      case 1:
-        theAggFactory = aggregatorType.createAggregatorFactory(aggregatorName, fieldName, null, outputType, -1);
+        theAggFactory = aggregatorType.createAggregatorFactory(aggregatorName, fieldName, null, outputType, null);
        break;
      case 2:
        int maxStringBytes;
@ -327,8 +327,7 @@ public class EarliestLatestAnySqlAggregator implements SqlAggregator
          EARLIEST_LATEST_ARG0_RETURN_TYPE_INFERENCE,
          InferTypes.RETURN_TYPE,
          OperandTypes.or(
-              OperandTypes.NUMERIC,
-              OperandTypes.BOOLEAN,
+              OperandTypes.ANY,
              OperandTypes.sequence(
                  "'" + aggregatorType.name() + "(expr, maxBytesPerString)'",
                  OperandTypes.ANY,
--- a/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestBySqlAggregator.java
+++ b/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestBySqlAggregator.java
@ -128,7 +128,7 @@ public class EarliestLatestBySqlAggregator implements SqlAggregator
                rexNodes.get(1)
            ),
            outputType,
-            -1
+            null
        );
        break;
      case 3:
--- a/sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
+++ b/sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
@ -13957,13 +13957,13 @@ public class CalciteQueryTest extends BaseCalciteQueryTest
                        .setVirtualColumns(
                            expressionVirtualColumn(
                                "v0",
-                                "CAST(greatest(\"dim1\",\"dim2\"), 'DOUBLE')",
-                                ColumnType.DOUBLE
+                                "greatest(\"dim1\",\"dim2\")",
+                                ColumnType.STRING
                            )
                        )
                        .setGranularity(Granularities.ALL)
                        .addDimension(new DefaultDimensionSpec("l1", "_d0", ColumnType.LONG))
-                        .addAggregator(new DoubleLastAggregatorFactory("a0", "v0", null))
+                        .addAggregator(new StringLastAggregatorFactory("a0", "v0", null, 1024))
                        .setPostAggregatorSpecs(ImmutableList.of(
                            expressionPostAgg("p0", "isnull(\"a0\")")
                        ))
@ -13976,9 +13976,9 @@ public class CalciteQueryTest extends BaseCalciteQueryTest
            new Object[]{325323L, false}
        ) :
        ImmutableList.of(
-            new Object[]{null, true},
+            new Object[]{null, false},
            new Object[]{0L, false},
-            new Object[]{7L, true},
+            new Object[]{7L, false},
            new Object[]{325323L, false}
        )
    );
@ -14269,4 +14269,36 @@ public class CalciteQueryTest extends BaseCalciteQueryTest
        )
    );
  }
+
+  @Test
+  public void testLatestByOnStringColumnWithoutMaxBytesSpecified()
+  {
+    String defaultString = useDefault ? "" : null;
+    cannotVectorize();
+    testQuery(
+        "SELECT dim2,LATEST(dim3),LATEST_BY(dim1, __time),EARLIEST(dim3),EARLIEST_BY(dim1, __time),ANY_VALUE(dim3) FROM druid.foo where dim2='abc' group by 1",
+        ImmutableList.of(
+            GroupByQuery.builder()
+                .setDataSource(CalciteTests.DATASOURCE1)
+                .setInterval(querySegmentSpec(Filtration.eternity()))
+                .setGranularity(Granularities.ALL)
+                .setVirtualColumns(
+                    expressionVirtualColumn("v0", "'abc'", ColumnType.STRING))
+                .setDimFilter(equality("dim2", "abc", ColumnType.STRING))
+                .setDimensions(
+                    dimensions(new DefaultDimensionSpec("v0", "d0", ColumnType.STRING)))
+                .setAggregatorSpecs(
+                    aggregators(
+                        new StringLastAggregatorFactory("a0", "dim3", "__time", 1024),
+                        new StringLastAggregatorFactory("a1", "dim1", "__time", 1024),
+                        new StringFirstAggregatorFactory("a2", "dim3", "__time", 1024),
+                        new StringFirstAggregatorFactory("a3", "dim1", "__time", 1024),
+                        new StringAnyAggregatorFactory("a4", "dim3", 1024)))
+                .build()
+
+        ),
+        ImmutableList.of(
+            new Object[] {"abc", defaultString, "def", defaultString, "def", defaultString}
+        ));
+  }
 }