* add binary_byte_format/decimal_byte_format/decimal_format
* clean code
* fix doc
* fix review comments
* add spelling check rules
* remove extra param
* improve type handling and null handling
* remove extra zeros
* fix tests and add space between unit suffix and number as most size-format functions do
* fix tests
* add examples
* change function names according to review comments
* fix merge
Signed-off-by: frank chen <frank.chen021@outlook.com>
* no need to configure NullHandling explicitly for tests
Signed-off-by: frank chen <frank.chen021@outlook.com>
* fix tests in SQL-Compatible mode
Signed-off-by: frank chen <frank.chen021@outlook.com>
* Resolve review comments
* Update SQL test case to check null handling
* Fix intellij inspections
* Add more examples
* Fix example
* Improve concurrency between DruidSchema and BrokerServerView
* unused imports and workaround for error prone faiure
* count only known segments
* add comments
This PR fixes the incorrect results for query :
SELECT dim1, l1.k FROM foo LEFT JOIN (select k || '' as k from lookup.lookyloo group by 1) l1 ON foo.dim1 = l1.k WHERE l1.k IS NOT NULL (in CalciteQueryTests)
In the current code, the WHERE clause gets removed from the top of the left join and is pushed to the table foo
leading to incorrect results.
The fix for such a situation is done by :
Converting such left joins into inner joins (since logically the mentioned left join query is equivalent to an inner join) using Calcite while maintaining that the druid execution layer can execute such inner joins.
Preferring converted inner joins over original left joins in our cost model
This PR splits current SegmentLoader into SegmentLoader and SegmentCacheManager.
SegmentLoader - this class is responsible for building the segment object but does not expose any methods for downloading, cache space management, etc. Default implementation delegates the download operations to SegmentCacheManager and only contains the logic for building segments once downloaded. . This class will be used in SegmentManager to construct Segment objects.
SegmentCacheManager - this class manages the segment cache on the local disk. It fetches the segment files to the local disk, can clean up the cache, and in the future, support reserve and release on cache space. [See https://github.com/Make SegmentLoader extensible and customizable #11398]. This class will be used in ingestion tasks such as compaction, re-indexing where segment files need to be downloaded locally.
* improve groupBy query granularity translation when issued from sql layer
* fix style
* use virtual column to determine timestampResult granularity
* dont' apply postaggregators on compute nodes
* relocate constants
* fix order by correctness issue
* fix ut
* use more easier understanding code in DefaultLimitSpec
* address comment
* rollback use virtual column to determine timestampResult granularity
* fix style
* fix style
* address the comment
* add more detail document to explain the tradeoff
* address the comment
* address the comment
Users sometimes make typos when picking timezones - like `America/Los Angeles`
instead of `America/Los_Angeles` instead of defaulting to UTC, this change
makes it so that an error is thrown instead notifying the user of their mistake.
A constant expression may evaluate to Double.NEGATIVE_INFINITY/Double.POSITIVE_INFINITY/Double.NAN e.g. log10(0). When using such an expression in native queries, the user will get the corresponding value without any error. In SQL, however, the user will run into NumberFormatException because we convert the double to big-decimal while constructing a literal numeric expression. This probably should be fixed in calcite - see https://issues.apache.org/jira/browse/CALCITE-2067. This PR adds a verbose error message so that users can take corrective action without scratching their heads.
* add single input string expression dimension vector selector and better expression planning
* better
* fixes
* oops
* rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing
* oops
* javadocs, renaming
* more javadocs
* benchmarks
* use string expression vector processor with vector size 1 instead of expr.eval
* better logging
* javadocs, surprising number of the the
* more
* simplify
* Fix is null selector returning incorrect value for Long data type
* Fix style errors
* Refactor getObject method to also cache null column values
* Make lastInput variable nullable
* Refactor unit test
* Use new boolean lastInputIsNull instead of Long for lastInput to avoid boxing
* Refactor to remove Long for input variable
* Make a separate null caching variable
* Cleaner null caching implementation
* fix count and average SQL aggregators on constant virtual columns
* style
* even better, why are we tracking virtual columns in aggregations at all if we have a virtual column registry
* oops missed a few
* remove unused
* this will fix it
* SQL timeseries no longer skip empty buckets with all granularity
* add comment, fix tests
* the ol switcheroo
* revert unintended change
* docs and more tests
* style
* make checkstyle happy
* docs fixes and more tests
* add docs, tests for array_agg
* fixes
* oops
* doc stuffs
* fix compile, match doc style
* Fix vectorized cardinality bug on certain string columns.
Fixes a bug introduced in #11182, related to the fact that in some cases,
ColumnProcessors.makeVectorProcessor will call "makeObjectProcessor"
instead of "makeSingleValueDimensionProcessor" or
"makeMultiValueDimensionProcessor". CardinalityVectorProcessorFactory
improperly ignored calls to "makeObjectProcessor".
In addition to fixing the bug, I added this detail to the javadocs for
VectorColumnProcessorFactory, to prevent others from running into the
same thing in the future. They do not currently call out this case.
* Improve test coverage.
* Additional fixes.
* ARRAY_AGG sql aggregator function
* add javadoc
* spelling
* review stuff, return null instead of empty when nil input
* review stuff
* Update sql.md
* use type inference for finalize, refactor some things
* Vectorize the cardinality aggregator.
Does not include a byRow implementation, so if byRow is true then
the aggregator still goes through the non-vectorized path.
Testing strategy:
- New tests that exercise both styles of "aggregate" for supported types.
- Some existing tests have also become active (note the deleted
"cannotVectorize" lines).
* Adjust whitespace.
* Enable rewriting certain inner joins as filters.
The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:
- It must be an inner equi-join.
- The right-hand columns referenced by the condition must not contain any
duplicate values. (If they did, the inner join would not be guaranteed
to return at most one row for each left-hand-side row.)
- No columns from the right-hand side can be used by anything other than
the join condition itself.
HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.
In support of this goal:
- Add Query getRequiredColumns() method to help us figure out whether
the right-hand side of a join datasource is being used or not.
- Add JoinConditionAnalysis getRequiredColumns() method to help us
figure out if the right-hand side of a join is being used by later
join clauses acting on the same base.
- Add Joinable getNonNullColumnValuesIfAllUnique method to enable
retrieving the set of values that will form the "in" filter.
- Add LookupExtractor canGetKeySet() and keySet() methods to support
LookupJoinable in its efforts to implement the new Joinable method.
- Add "enableRewriteJoinToFilter" feature flag to
JoinFilterRewriteConfig. The default is disabled.
* Test improvements.
* Test fixes.
* Avoid slow size() call.
* Remove invalid test.
* Fix style.
* Fix mistaken default.
* Small fixes.
* Fix logic error.
* add round test
* code style
* handle null val for round function
* handle null val for round function
* support null for round
* fix compatiblity
* fix test
* fix test
* code style
* optimize format
* Add a planner rule to handle empty tables
* adjust comment
* type handling
* add tests
* unused imports and fix test
* fix more tests
* fix more test
* javadoc
Size HashMap and HashSet appropriately. Perf analysis of the queries
revealed that over 25% of the query time was spent in resizing HashMap and HashSet
collections. Also, prevent the need to examine and authorize all resources when
AllowAllAuthorizer is the configured authorizer.
* expression filter support for vectorized query engines
* remove unused codes
* more tests
* refactor, more tests
* suppress
* more
* more
* more
* oops, i was wrong
* comment
* remove decorate, object dimension selector, more javadocs
* style
* fix SQL issue for group by queries with time filter that gets optimized to false
* short circuit always false in CombineAndSimplifyBounds
* adjust
* javadocs
* add preconditions for and/or filters to ensure they have children
* add comments, remove preconditions
* where filter left first draft
* Revert changes in calcite test
* Refactor a bit
* Fixing the Tests
* Changes
* Adding tests
* Add tests for correlated queries
* Add comment
* Fix typos
* add druid jdbc handler config for minimum number of rows per frame
* javadocs and docs adjustments
* spelling
* adjust docs per review with minor tweaks
* adjust more
* Support segmentGranularity for auto-compaction
* Support segmentGranularity for auto-compaction
* Support segmentGranularity for auto-compaction
* Support segmentGranularity for auto-compaction
* resolve conflict
* Support segmentGranularity for auto-compaction
* Support segmentGranularity for auto-compaction
* fix tests
* fix more tests
* fix checkstyle
* add unit tests
* fix checkstyle
* fix checkstyle
* fix checkstyle
* add unit tests
* add integration tests
* fix checkstyle
* fix checkstyle
* fix failing tests
* address comments
* address comments
* fix tests
* fix tests
* fix test
* fix test
* fix test
* fix test
* fix test
* fix test
* fix test
* fix test
* before i leaped i should've seen, the view from halfway down
* fixes
* fixes, more test
* rename
* fix style
* further refactoring
* review stuffs
* rename
* more javadoc and comments
* enhance the logic of Start up DruidSchema immediately if there are no segments.
* add UT to test DruidSchema init
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
* Tidy up query error codes
* fix tests
* Restore query exception type in JsonParserIterator
* address review comments; add a comment explaining the ugly switch
* fix test
* integration test for coordinator and overlord leadership, added sys.servers is_leader column
* docs
* remove not needed
* fix comments
* fix compile heh
* oof
* revert unintended
* fix tests, split out docker-compose file selection from starting cluster, use docker-compose down to stop cluster
* fixes
* style
* dang
* heh
* scripts are hard
* fix spelling
* fix thing that must not matter since was already wrong ip, log when test fails
* needs more heap
* fix merge
* less aggro
* fix race condition with DruidSchema tables and dataSourcesNeedingRebuild
* rework to see if it passes analysis
* more better
* maybe this
* re-arrange and comments
* First draft of grouping_id function
* Add more tests and documentation
* Add calcite tests
* Fix travis failures
* bit of a change
* Add documentation
* Fix typos
* typo fix
* Two fixes related to encoding of % symbols.
1) TaskResourceFilter: Don't double-decode task ids. request.getPathSegments()
returns already-decoded strings. Applying StringUtils.urlDecode on
top of that causes erroneous behavior with '%' characters.
2) Update various ThreadFactoryBuilder name formats to escape '%'
characters. This fixes situations where substrings starting with '%'
are erroneously treated as format specifiers.
ITs are updated to include a '%' in extra.datasource.name.suffix.
* Avoid String.replace.
* Work around surefire bug.
* Fix xml encoding.
* Another try at the proper encoding.
* Give up on the emojis.
* Less ambitious testing.
* Fix an additional problem.
* Adjust encodeForFormat to return null if the input is null.
* support for vectorizing expressions with non-existent inputs, more consistent type handling for non-vectorized expressions
* inspector
* changes
* more test
* clean
* fix JSON format
* Change all columns in sys segments to be JSON
* Change all columns in sys segments to be JSON
* add tests
* fix failing tests
* fix failing tests
* vectorize remaining math expressions
* fixes
* remove cannotVectorize() where no longer true
* disable vectorized groupby for numeric columns with nulls
* fixes
* push down ValueType to ExprType conversion, tidy up
* determine expr output type for given input types
* revert unintended name change
* add nullable
* tidy up
* fixup
* more better
* fix signatures
* naming things is hard
* fix inspection
* javadoc
* make default implementation of Expr.getOutputType that returns null
* rename method
* more test
* add output for contains expr macro, split operation and function auto conversion
* Add IndexMergerRollupTest
This changelist adds a test to merge indexes with StringFirst/StringLast aggregator.
* Fix StringFirstAggregateCombiner/StringLastAggregateCombiner
The segment-level type for stringFirst/stringLast is SerializablePairLongString,
not String. This changelist fixes it.
* Fix EarliestLatestAnySqlAggregator to handle COMPLEX type
This changelist allows EarliestLatestAnySqlAggregator to accept COMPLEX
type as an operand. For its return type, we set it to VARCHAR, since
COMPLEX column is only generated by stringFirst/stringLast during ingestion
rollup.
* Return value with smaller timestamp in StringFirstAggregatorFactory.combine function
* Add integration tests for stringFirst/stringLast during ingestion
* Use one EarliestLatestReturnTypeInference instance
Co-authored-by: Joy Kent <joy@automonic.ai>
* SQL support for union datasources.
Exposed via the "UNION ALL" operator. This means that there are now two
different implementations of UNION ALL: one at the top level of a query
that works by concatenating subquery results, and one at the table level
that works by creating a UnionDataSource.
The SQL documentation is updated to discuss these two use cases and how
they behave.
Future work could unify these by building support for a native datasource
that represents the union of multiple subqueries. (Today, UnionDataSource
can only represent the union of tables, not subqueries.)
* Fixes.
* Error message for sanity check.
* Additional test fixes.
* Add some error messages.
* better type tracking: add typed postaggs, finalized types for agg factories
* more javadoc
* adjustments
* transition to getTypeName to be used exclusively for complex types
* remove unused fn
* adjust
* more better
* rename getTypeName to getComplexTypeName
* setup expression post agg for type inference existing
* more javadocs
* fixup
* oops
* more test
* more test
* more comments/javadoc
* nulls
* explicitly handle only numeric and complex aggregators for incremental index
* checkstyle
* more tests
* adjust
* more tests to showcase difference in behavior
* timeseries longsum array
* Add SQL "OFFSET" clause.
Under the hood, this uses the new offset features from #10233 (Scan)
and #10235 (GroupBy). Since Timeseries and TopN queries do not currently
have an offset feature, SQL planning will switch from one of those to
Scan or GroupBy if users add an OFFSET.
Includes a refactoring to harmonize offset and limit planning using an
OffsetLimit wrapper class. This is useful because it ensures that the
various places that need to deal with offset and limit collapsing all
behave the same way, using its "andThen" method.
* Fix test and add another test.
* Segment backed broadcast join IndexedTable
* fix comments
* fix tests
* sharing is caring
* fix test
* i hope this doesnt fix it
* filter by schema to maybe fix test
* changes
* close join stuffs so it does not leak, allow table to directly make selector factory
* oops
* update comment
* review stuffs
* better check
* remove DruidLeaderClient.goAsync(..) that does not follow redirect.
Replace its usage by DruidLeaadereClient.go(..) with
InputStreamFullResponseHandler
* remove ByteArrayResponseHolder dependency from JsonParserIterator
* add UT to cover lines in InputStreamFullResponseHandler
* refactor SystemSchema to reduce branches
* further reduce branches
* Revert "add UT to cover lines in InputStreamFullResponseHandler"
This reverts commit 330aba3dd9.
* UTs for InputStreamFullResponseHandler
* remove unused imports
* Add "offset" parameter to the Scan query.
It works by doing the query as normal and then throwing away the first
"offset" number of rows on the broker.
* Fix constructor call.
* Fix up JSONs.
* Fix call to ScanQuery.
* Doc update.
* Fix javadocs.
* Spotbugs, LGTM suppressions.
* Javadocs.
* Fix suppression.
* Stabilize Scan query result order, add tests.
* Update LGTM comment.
* Fixup.
* Test different batch sizes too.
* Nicer tests.
* Fix comment.
* LongMaxVectorAggregator support and test case.
* DoubleMinVectorAggregator and test cases.
* DoubleMaxVectorAggregator and unit test.
* FloatMinVectorAggregator and FloatMaxVectorAggregator.
* Documentation update to include the other vector aggregators.
* Bug fix.
* checkstyle formatting fixes.
* CalciteQueryTest cases update.
* Separate test classes for FloatMaxAggregation and FloatMniAggregation.
* remove the cannotVectorize for float max/min aggregator in test.
* Tests in GroupByQueryRunner, GroupByTimeseriesQueryRunner and TimeseriesQueryRunner.
* Add "offset" parameter to GroupBy query.
It works by doing the query as normal and then throwing away the first
"offset" number of rows on the broker.
* Stabilize GroupBy sorts.
* Fix inspections.
* Fix suppression.
* Fixups.
* Move TopNSequence to druid-core.
* Addl comments.
* NumberedElement equals verification.
* Changes from review.
* Fix minor formatting in docs.
* Add Nullhandling initialization for test to run from IDE.
* Vectorize longMin aggregator.
- A new vectorized class for the vectorized long min aggregator.
- Changes to AggregatorFactory to support vectorize functionality.
- Few changes to schema evolution test to add LongMinAggregatorFactory.
* Add longSum to the supported vectorized aggregator implementations.
* Add MIN() long min to calcite query test that can vectorize.
* Add simple long aggregations test.
* Fixup formatting per checkstyle guide.
* fixup and add more tests for long min aggregator.
* Override test for groupBy since timestamps are handled differently.
* Null compatibility check in test.
* Review comment: Add a test case to LongMinAggregationTest.
* Fix timeseries query constructor when postAggregator has an expression reading timestamp result column
* fix npe
* Fix postAgg referencing timestampResultField and add a test for it
* fix test
* doc
* revert doc
* new average aggregator
* method to create count aggregator factory
* test everything
* update other usages
* fix style
* fix more tests
* fix datasketches tests