* Processors for Window Processing
This is an initial take on how to use Processors
for Window Processing. A Processor is an interface
that transforms RowsAndColumns objects.
RowsAndColumns objects are essentially combinations
of rows and columns.
The intention is that these Processors are the start
of a set of operators that more closely resemble what
DB engineers would be accustomed to seeing.
* Wire up windowed processors with a query type that
can run them end-to-end. This code can be used to
actually run a query, so yay!
* Wire up windowed processors with a query type that
can run them end-to-end. This code can be used to
actually run a query, so yay!
* Some SQL tests for window functions. Added wikipedia
data to the indexes available to the
SQL queries and tests validating the windowing
functionality as it exists now.
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
* Moving all unnest cursor code atop refactored code for unnest
* Updating unnest cursor
* Removing dedup and fixing up some null checks
* AllowList changes
* Fixing some NPEs
* Using bitset for allowlist
* Updating the initialization only when cursor is in non-done state
* Updating code to skip rows not in allow list
* Adding a flag for cases when first element is not in allowed list
* Updating for a null in allowList
* Splitting unnest cursor into 2 subclasses
* Intercepting some apis with columnName for new unnested column
* Adding test cases and renaming some stuff
* checkstyle fixes
* Moving to an interface for Unnest
* handling null rows in a dimension
* Updating cursors after comments part-1
* Addressing comments and adding some more tests
* Reverting a change to ScanQueryRunner and improving a comment
* removing an unused function
* Updating cursors after comments part 2
* One last fix for review comments
* Making some functions private, deleting some comments, adding a test for unnest of unnest with allowList
* Adding an exception for a case
* Closure for unnest data source
* Adding some javadocs
* One minor change in makeDimSelector of columnarCursor
* Updating an error message
* Update processing/src/main/java/org/apache/druid/segment/DimensionUnnestCursor.java
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
* Unnesting on virtual columns was missing an object array, adding that to support virtual columns unnesting
* Updating exceptions to use UOE
* Renamed files, added column capability test on adapter, return statement and made unnest datasource not cacheable for the time being
* Handling for null values in dim selector
* Fixing a NPE for null row
* Updating capabilities
* Updating capabilities
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
SQL test framework extensions
* Capture planner artifacts: logical plan, etc.
* Planner test builder validates the logical plan
* Validation for the SQL resut schema (we already have
validation for the Druid row signature)
* Better Guice integration: properties, reuse Guice modules
* Avoid need for hand-coded expr, macro tables
* Retire some of the test-specific query component creation
* Fix query log hook race condition
* fixes BlockLayoutColumnarLongs close method to nullify internal buffer.
* fixes other BlockLayoutColumnar supplier close methods to nullify internal buffers.
* fix spotbugs
* we can read where we want to
we can leave your bounds behind
'cause if the memory is not there
we really don't care
and we'll crash this process of mine
We added compression to the latest/first pair storage, but
the code change was forcing new things to be persisted
with the new format, meaning that any segment created with
the new code cannot be read by the old code. Instead, we
need to default to creating the old format and then remove that default in a future version.
* Add string comparison methods to StringUtils, fix dictionary comparisons.
There are various places in Druid code where we assume that String.compareTo
is consistent with Unicode code-point ordering. Sadly this is not the case.
To help deal with this, this patch introduces the following helpers:
1) compareUnicode: Compares two Strings in Unicode code-point order.
2) compareUtf8: Compares two UTF-8 byte arrays in Unicode code-point order.
Equivalent to comparison as unsigned bytes.
3) compareUtf8UsingJavaStringOrdering: Compares two UTF-8 byte arrays, or
ByteBuffers, in a manner consistent with String.compareTo.
There is no helper for comparing two Strings in a manner consistent
with String.compareTo, because for that we can use compareTo directly.
The patch also fixes an inconsistency between the String and UTF-8
dictionary GenericIndexed flavors of string-typed columns: they were
formerly using incompatible comparators.
* Adjust test.
* FrontCodedIndexed updates.
* Add test.
* Fix comments.
Changes:
- Add a metric for partition-wise kafka/kinesis lag for streaming ingestion.
- Emit lag metrics for streaming ingestion when supervisor is not suspended and state is in {RUNNING, IDLE, UNHEALTHY_TASKS, UNHEALTHY_SUPERVISOR}
- Document metrics
* Compaction: Fetch segments one at a time on main task; skip when possible.
Compact tasks include the ability to fetch existing segments and determine
reasonable defaults for granularitySpec, dimensionsSpec, and metricsSpec.
This is a useful feature that makes compact tasks work well even when the
user running the compaction does not have a clear idea of what they want
the compacted segments to be like.
However, this comes at a cost: it takes time, and disk space, to do all
of these fetches. This patch improves the situation in two ways:
1) When segments do need to be fetched, download them one at a time and
delete them when we're done. This still takes time, but minimizes the
required disk space.
2) Don't fetch segments on the main compact task when they aren't needed.
If the user provides a full granularitySpec, dimensionsSpec, and
metricsSpec, we can skip it.
* Adjustments.
* Changes from code review.
* Fix logic for determining rollup.
* Use lookup memory footprint in MSQ memory computations.
Two main changes:
1) Add estimateHeapFootprint to LookupExtractor.
2) Use this in MSQ's IndexerWorkerContext when determining the total
amount of available memory. It's taken off the top.
This prevents MSQ tasks from running out of memory when there are lookups
defined in the cluster.
* Updates from code review.
* First set of changes for framework
* Second set of changes to move segment map function to data source
* Minot change to server manager
* Removing the createSegmentMapFunction from JoinableFactoryWrapper and moving to JoinDataSource
* Checkstyle fixes
* Patching Eric's fix for injection
* Checkstyle and fixing some CI issues
* Fixing code inspections and some failed tests and one injector for test in avatica
* Another set of changes for CI...almost there
* Equals and hashcode part update
* Fixing injector from Eric + refactoring for broadcastJoinHelper
* Updating second injector. Might revert later if better way found
* Fixing guice issue in JoinableFactory
* Addressing review comments part 1
* Temp changes refactoring
* Revert "Temp changes refactoring"
This reverts commit 9da42a9ef0.
* temp
* Temp discussions
* Refactoring temp
* Refatoring the query rewrite to refer to a datasource
* Refactoring getCacheKey by moving it inside data source
* Nullable annotation check in injector
* Addressing some comments, removing 2 analysis.isJoin() checks and correcting the benchmark files
* Minor changes for refactoring
* Addressing reviews part 1
* Refactoring part 2 with new test cases for broadcast join
* Set for nullables
* removing instance of checks
* Storing nullables in guice to avoid checking on reruns
* Fixing a test case and removing an irrelevant line
* Addressing the atomic reference review comments
* add FrontCodedIndexed for delta string encoding
* now for actual segments
* fix indexOf
* fixes and thread safety
* add bucket size 4, which seems generally better
* fixes
* fixes maybe
* update indexes to latest interfaces
* utf8 support
* adjust
* oops
* oops
* refactor, better, faster
* more test
* fixes
* revert
* adjustments
* fix prefixing
* more chill
* sql nested benchmark too
* refactor
* more comments and javadocs
* better get
* remove base class
* fix
* hot rod
* adjust comments
* faster still
* minor adjustments
* spatial index support
* spotbugs
* add isSorted to Indexed to strengthen indexOf contract if set, improve javadocs, add docs
* fix docs
* push into constructor
* use base buffer instead of copy
* oops
* SQL: Use timestamp_floor when granularity is not safe.
PR #12944 added a check at the execution layer to avoid materializing
excessive amounts of time-granular buckets. This patch modifies the SQL
planner to avoid generating queries that would throw such errors, by
switching certain plans to use the timestamp_floor function instead of
granularities. This applies both to the Timeseries query type, and the
GroupBy timestampResultFieldGranularity feature.
The patch also goes one step further: we switch to timestamp_floor
not just in the ETERNITY + non-ALL case, but also if the estimated
number of time-granular buckets exceeds 100,000.
Finally, the patch modifies the timestampResultFieldGranularity
field to consistently be a String rather than a Granularity. This
ensures that it can be round-trip serialized and deserialized, which is
useful when trying to execute the results of "EXPLAIN PLAN FOR" with
GroupBy queries that use the timestampResultFieldGranularity feature.
* Fix test, address PR comments.
* Fix ControllerImpl.
* Fix test.
* Fix unused import.
We introduce two new configuration keys that refine the query context security model controlled by druid.auth.authorizeQueryContextParams. When that value is set to true then two other configuration options become available:
druid.auth.unsecuredContextKeys: The set of query context keys that do not require a security check. Use this for the "white-list" of key to allow. All other keys go through the existing context key security checks.
druid.auth.securedContextKeys: The set of query context keys that do require a security check. Use this when you want to allow all but a specific set of keys: only these keys go through the existing context key security checks.
Both are set using JSON list format:
druid.auth.securedContextKeys=["secretKey1", "secretKey2"]
You generally set one or the other values. If both are set, unsecuredContextKeys acts as exceptions to securedContextKeys.
In addition, Druid defines two query context keys which always bypass checks because Druid uses them internally:
sqlQueryId
sqlStringifyArrays
This is in preparation for eventually retiring the flag `useMaxMemoryEstimates`,
after which the footprint of a value in the dimension dictionary will always be
estimated using the `estimateSizeOfValue()` method.
* fix json_value sql planning with decimal type, fix vectorized expression math null value handling in default mode
changes:
* json_value 'returning' decimal will now plan to native double typed query instead of ending up with default string typing, allowing decimal vector math expressions to work with this type
* vector math expressions now zero out 'null' values even in 'default' mode (druid.generic.useDefaultValueForNull=false) to prevent downstream things that do not check the null vector from producing incorrect results
* more better
* test and why not vectorize
* more test, more fix
* Expose HTTP Response headers from SqlResource
This change makes the SqlResource expose HTTP response
headers in the same way that the QueryResource exposes them.
Fundamentally, the change is to pipe the QueryResponse
object all the way through to the Resource so that it can
populate response headers. There is also some code
cleanup around DI, as there was a superfluous FactoryFactory
class muddying things up.
changes:
* long and double value columns are now written directly, at the same time as writing out the 'intermediary' dictionaryid column with unsorted ids
* remove reverse value lookup from GlobalDictionaryIdLookup since it is no longer needed
* more consistent expression error messages
* review stuff
* add NamedFunction for Function, ApplyFunction, and ExprMacro to share common stuff
* fixes
* add expression transform name to transformer failure, better parse_json error messaging