All JDK 8 based CI checks have been removed.
Images used in Dockerfile(s) have been updated to Java 17 based images.
Documentation has been updated accordingly.
Map Lookup Introspection API endpoints /keys and /values no longer return an invalid JSON object.
Also, update documentation to clarify the version returned by the /version introspection endpoint.
---------
Co-authored-by: Ashwin Tumma <ashwin.tumma@salesforce.com>
* Linked back to query granularity docs
* Update ingestion-spec.md
clairfy about query granularities in the spec.
* Update docs/design/storage.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/ingestion-spec.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/granularities.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Apply suggestions from code review
---------
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Segments primarily sorted by non-time columns.
Currently, segments are always sorted by __time, followed by the sort
order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting
by __time enables efficient execution of queries involving time-ordering
or granularity. Time-ordering is a simple matter of reading the rows in
stored order, and granular cursors can be generated in streaming fashion.
However, for various workloads, it's better for storage footprint and
query performance to sort by arbitrary orders that do not start with __time.
With this patch, users can sort segments by such orders.
For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to
dimensionsSpec. The "dimensions" list determines the sort order. To
define a sort order that includes "__time", users explicitly
include a dimension named "__time".
For SQL-based ingestion, users set the context parameter
"useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then
used as the explicit segment sort order.
In both cases, when the new "useExplicitSegmentSortOrder" parameter is
false (the default), __time is implicitly prepended to the sort order,
as it always was prior to this patch.
The new parameter is experimental for two main reasons. First, such
segments can cause errors when loaded by older servers, due to violating
their expectations that timestamps are always monotonically increasing.
Second, even on newer servers, not all queries can run on non-time-sorted
segments. Scan queries involving time-ordering and any query involving
granularity will not run. (To partially mitigate this, a currently-undocumented
SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner
avoids using "granularity".)
Changes on the write path:
1) DimensionsSpec can now optionally contain a __time dimension, which
controls the placement of __time in the sort order. If not present,
__time is considered to be first in the sort order, as it has always
been.
2) IncrementalIndex and IndexMerger are updated to sort facts more
flexibly; not always by time first.
3) Metadata (stored in metadata.drd) gains a "sortOrder" field.
4) MSQ can generate range-based shard specs even when not all columns are
singly-valued strings. It merely stops accepting new clustering key
fields when it encounters the first one that isn't a singly-valued
string. This is useful because it enables range shard specs on
"someDim" to be created for clauses like "CLUSTERED BY someDim, __time".
Changes on the read path:
1) Add StorageAdapter#getSortOrder so query engines can tell how a
segment is sorted.
2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter,
and VectorCursorGranularizer to throw errors when using granularities
on non-time-ordered segments.
3) Update ScanQueryEngine to throw an error when using the time-ordering
"order" parameter on non-time-ordered segments.
4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when
running on a non-time-ordered segment.
5) Add "sqlUseGranularity" context parameter that causes the SQL planner
to avoid using granularities other than ALL.
Other changes:
1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions"
and change the meaning subtly: it now returns true if the DimensionsSpec
represents an unchanging list of dimensions, or false if there is
some discovery happening. This is what call sites had expected anyway.
* Fixups from CI.
* Fixes.
* Fix missing arg.
* Additional changes.
* Fix logic.
* Fixes.
* Fix test.
* Adjust test.
* Remove throws.
* Fix styles.
* Fix javadocs.
* Cleanup.
* Smoother handling of null ordering.
* Fix tests.
* Missed a spot on the merge.
* Fixups.
* Avoid needless Filters.and.
* Add timeBoundaryInspector to test.
* Fix tests.
* Fix FrameStorageAdapterTest.
* Fix various tests.
* Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder.
* Pom fix.
* Fix doc.
* just starting
* TIME_PARSE and TIME_FORMAT remaining
* fixing typo
* adding last two functions
* review sql-functions.md
* Apply suggestions from code review
Suggestions that were accepted as is
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
* Update docs/querying/sql-functions.md
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
* Update docs/querying/sql-functions.md
needed to confirm that it did indeed return as a number
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
* reviewing remaining suggestions
* addressing review for time_format
* Apply suggestions from code review
Accepted as is
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
* addressing final suggestion
* time_zone -> timezone
* timezone fix
---------
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
* Lower,Upper,Lpad,Rpad,Parse_long
* up to REGEXP_EXTRACT
* batch 07 ready for review
* updated definitions in scalar
* Apply suggestions from code review
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* rpad and lpad
* addressing comments
* minor fixes
* improving examples based on suggestions
* matched -> matches
* correcting typo
* Apply suggestions from code review
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
---------
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* batch 03 - trig functions
* Apply suggestions from code review
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* applying suggestions and corrections
---------
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Changes the WindowFrame internals / representation a bit; introduces dedicated frametypes for rows and groups which corresponds to the implemented processing methods
* updating first batch of numeric functions
* First batch of functions
* addressing first few comments
* alphabetize list
* draft with suggestions applied
* minor discrepency expr -> <NUMERIC>
* changed raises to calculates
* Update docs/querying/sql-functions.md
* switch to underscore
* changed to exp(1) to match slack message
* adding html text for trademark symbol to .spelling
* fixed discrepancy between description and example
---------
Co-authored-by: Benedict Jin <asdf2014@apache.org>
* Defer more expressions in vectorized groupBy.
This patch adds a way for columns to provide GroupByVectorColumnSelectors,
which controls how the groupBy engine operates on them. This mechanism is used
by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector
that uses the inputs of an expression as the grouping key. The actual expression
evaluation is deferred until the grouped ResultRow is created.
A new context parameter "deferExpressionDimensions" allows users to control when
this deferred selector is used. The default is "fixedWidthNonNumeric", which is a
behavioral change from the prior behavior. Users can get the prior behavior by setting
this to "singleString".
* Fix style.
* Add deferExpressionDimensions to SqlExpressionBenchmark.
* Fix style.
* Fix inspections.
* Add more testing.
* Use valueOrDefault.
* Compute exprKeyBytes a bit lighter-weight.
* Simplify serialized form of JsonInputFormat.
Use JsonInclude for keepNullColumns, assumeNewlineDelimited, and
useJsonNodeReader. Because the default value of keepNullColumns is
variable, we store the original configured value rather than the
derived value, and include if the original value is nonnull.
* Fix test.
* Speed up SQL IN using SCALAR_IN_ARRAY.
Main changes:
1) DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
the IN is above inFunctionThreshold. The default value of inFunctionThreshold
is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.
2) SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
it to Integer.MAX_VALUE.
3) ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
greater than inFunctionThreshold.
* Revert test.
* Additional coverage.
* Update docs/querying/sql-query-context.md
Co-authored-by: Benedict Jin <asdf2014@apache.org>
* New test.
---------
Co-authored-by: Benedict Jin <asdf2014@apache.org>
* Four changes to scalar_in_array as follow-ups to #16306:
1) Align behavior for `null` scalars to the behavior of the native `in` and `inType` filters: return `true` if the array itself contains null, else return `null`.
2) Rename the class to more closely match the function name.
3) Add a specialization for constant arrays, where we build a `HashSet`.
4) Use `castForEqualityComparison` to properly handle cross-type comparisons.
Additional tests verify comparisons between LONG and DOUBLE are now
handled properly.
* Fix spelling.
* Adjustments from review.