* IntelliJ inspection and checkstyle rule for "Collection.EMPTY_* field accesses replaceable with Collections.empty*()"
* Reverted checkstyle rule
* Added tests to pass CI
* Codestyle
* ROUND and having comparators correctly handle doubles
Double.NaN, Double.POSITIVE_INFINITY and Double.NEGATIVE_INFINITY are not real
numbers. Because of this, they can not be converted to BigDecimal and instead
throw a NumberFormatException.
This change adds support for calculations that produce these numbers either
for use in the `ROUND` function or the HavingSpecMetricComparator by not
attempting to convert the number to a BigDecimal.
The bug in ROUND was first introduced in #7224 where we added the ability to
round to any decimal place. This PR changes the behavior back to using
`Math.round` if we recognize a number that can not be converted to a
BigDecimal.
* Add tests and fix spellcheck
* update error message in ExpressionsTest
* Address comments
* fix up round for infinity
* round non numeric doubles returns a double
* fix spotbugs
* Update docs/misc/math-expr.md
* Update docs/querying/sql.md
* Remove LegacyDataSource.
Its purpose was to enable deserialization of strings into TableDataSources.
But we can do this more straightforwardly with Jackson annotations.
* Slight test improvement.
The parameters generator uses CompressionStrategy.noNoneValues() instead
of CompressionStrategyTest.compressionStrategies() which wrapped each
strategy in a single element array. This improves readability of the
test.
* make joinables closeable
* tests and adjustments
* refactor to make join stuffs impelement ReferenceCountedObject instead of Closable, more tests
* fixes
* javadocs and stuff
* fix bugs
* more test
* fix lgtm alert
* simplify
* fixup javadoc
* review stuffs
* safeguard against exceptions
* i hate this checkstyle rule
* make IndexedTable extend Closeable
* remove incorrect and unnecessary overrides from BooleanVectorValueMatcher
* add test case
* add unit tests for ... part of VectorValueMatcherColumnProcessorFactory
* Update VectorValueMatcherColumnProcessorFactoryTest.java
* move benchmark data generator into druid-processing, add a GeneratorInputSource to fill up a cluster with data
* newlines
* make test coverage not fail maybe
* remove useless test
* Update pom.xml
* Update GeneratorInputSourceTest.java
* less passive aggressive test names
* fix groupBy with literal in subquery grouping
* fix groupBy with literal in subquery grouping
* fix groupBy with literal in subquery grouping
* address comments
* update javadocs
* Fix join
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* add tests
* address comments
* fix failing tests
* Add REGEXP_LIKE, fix empty-pattern bug in REGEXP_EXTRACT.
- Add REGEXP_LIKE function that returns a boolean, and is useful in
WHERE clauses.
- Fix REGEXP_EXTRACT return type (should be nullable; causes incorrect
filter elision).
- Fix REGEXP_EXTRACT behavior for empty patterns: should always match
(previously, they threw errors).
- Improve error behavior when REGEXP_EXTRACT and REGEXP_LIKE are passed
non-literal patterns.
- Improve documentation of REGEXP_EXTRACT.
* Changes based on PR review.
* Fix arg check.
* Important fixes!
* Add speller.
* wip
* Additional tests.
* Fix up tests.
* Add validation error tests.
* Additional tests.
* Remove useless call.
* - GroupByQueryEngineV2: Fix leak of intermediate processing buffer when
exceptions are thrown before result sequence is created.
- PooledTopNAlgorithm: Fix leak of intermediate processing buffer when
exceptions are thrown before the PooledTopNParams object is created.
- BlockingPool: Remove unused "take" methods.
* Add tests to verify that buffers have been returned.
* Fix various Yielder leaks.
- CombiningSequence leaked the input yielder from "toYielder" if it ran
into an exception while accumulating the last value from the input
yielder.
- MergeSequence leaked input yielders from "toYielder" if it ran into
an exception while building the initial priority queue.
- ScanQueryRunnerFactory leaked the input yielder in its
"priorityQueueSortAndLimit" strategy if it ran into an exception
while scanning and sorting.
- YieldingSequenceBase.accumulate chomped IOExceptions thrown in
"accumulate" during yielder closing.
* Add tests.
* Fix braces.
* Refactor JoinFilterAnalyzer
This patch attempts to make it easier to follow the join filter analysis code
with the hope of making it easier to add rewrite optimizations in the future.
To keep the patch small and easy to review, this is the first of at least 2
patches that are planned.
This patch adds a builder to the Pre-Analysis, so that it is easier to
instantiate the preAnalysis. It also moves some of the filter normalization
code out to Fitlers with associated tests.
* fix tests
* Refactor JoinFilterAnalyzer - part 2
This change introduces the following components:
* RhsRewriteCandidates - a wrapper for a list of candidates and associated
functions to operate on the set of candidates.
* JoinableClauses - a wrapper for the list of JoinableClause that represent
a join condition and the associated functions to operate on the clauses.
* Equiconditions - a wrapper representing the equiconditions that are used
in the join condition.
And associated test changes.
This refactoring surfaced 2 bugs:
- Missing equals and hashcode implementation for RhsRewriteCandidate, thus
allowing potential duplicates in the rhs rewrite candidates
- Missing Filter#supportsRequiredColumnRewrite check in
analyzeJoinFilterClause, which could result in UnsupportedOperationException
being thrown by the filter
* fix compile error
* remove unused class
* Refactor JoinFilterAnalyzer - Correlations
Move the correlation related code out into it's own class so it's easier
to maintain.
Another patch should follow this one so that the query path uses the
correlation object instead of it's underlying maps.
* Optimize join queries where filter matches nothing
Fixes#9787
This PR changes the Joinable interface to return an Optional set of correlated
values for a column.
This allows the JoinFilterAnalyzer to differentiate between the case where the
column has no matching values and when the column could not find matching
values.
This PR chose not to distinguish between cases where correlated values could
not be computed because of a config that has this behavior disabled or because
of user error - like a column that could not be found. The reasoning was that
the latter is likely an error and the non filter pushdown path will surface the
error if it is.
* Refactor JoinFilterAnalyzer
This patch attempts to make it easier to follow the join filter analysis code
with the hope of making it easier to add rewrite optimizations in the future.
To keep the patch small and easy to review, this is the first of at least 2
patches that are planned.
This patch adds a builder to the Pre-Analysis, so that it is easier to
instantiate the preAnalysis. It also moves some of the filter normalization
code out to Fitlers with associated tests.
* fix tests
* Refactor JoinFilterAnalyzer - part 2
This change introduces the following components:
* RhsRewriteCandidates - a wrapper for a list of candidates and associated
functions to operate on the set of candidates.
* JoinableClauses - a wrapper for the list of JoinableClause that represent
a join condition and the associated functions to operate on the clauses.
* Equiconditions - a wrapper representing the equiconditions that are used
in the join condition.
And associated test changes.
This refactoring surfaced 2 bugs:
- Missing equals and hashcode implementation for RhsRewriteCandidate, thus
allowing potential duplicates in the rhs rewrite candidates
- Missing Filter#supportsRequiredColumnRewrite check in
analyzeJoinFilterClause, which could result in UnsupportedOperationException
being thrown by the filter
* fix compile error
* remove unused class
* Refactor JoinFilterAnalyzer
This patch attempts to make it easier to follow the join filter analysis code
with the hope of making it easier to add rewrite optimizations in the future.
To keep the patch small and easy to review, this is the first of at least 2
patches that are planned.
This patch adds a builder to the Pre-Analysis, so that it is easier to
instantiate the preAnalysis. It also moves some of the filter normalization
code out to Fitlers with associated tests.
* fix tests
* add flag to flattenSpec to keep null columns
* remove changes to inputFormat interface
* add comment
* change comment message
* update web console e2e test
* move keepNullColmns to JSONParseSpec
* fix merge conflicts
* fix tests
* set keepNullColumns to false by default
* fix lgtm
* change Boolean to boolean, add keepNullColumns to hash, add tests for keepKeepNullColumns false + true with no nuulul columns
* Add equals verifier tests
* Fix potential NPEs in joins
intelliJ reported issues with potential NPEs. This was first hit in testing
with a filter being pushed down to the left hand table when joining against
an indexed table.
* More null check cleanup
* Optimize filter value rewrite for IndexedTable
* Add unit tests for LookupJoinable
* Add tests for IndexedTableJoinable
* Add non null assert for dimension selector
* Supress null warning in LookupJoinMatcher
* remove some null checks on hot path
* optimize FileWriteOutBytes to avoid high sys cpu
* optimize FileWriteOutBytes to avoid high sys cpu -- remove IOException
* optimize FileWriteOutBytes to avoid high sys cpu -- remove IOException in writeOutBytes.size
* Revert "optimize FileWriteOutBytes to avoid high sys cpu -- remove IOException in writeOutBytes.size"
This reverts commit 965f7421
* Revert "optimize FileWriteOutBytes to avoid high sys cpu -- remove IOException"
This reverts commit 149e08c0
* optimize FileWriteOutBytes to avoid high sys cpu -- avoid IOEception never thrown check
* Fix size counting to handle IOE in FileWriteOutBytes + tests
* remove unused throws IOException in WriteOutBytes.size()
* Remove redundant throws IOExcpetion clauses
* Parameterize IndexMergeBenchmark
Co-authored-by: huanghui.bigrey <huanghui.bigrey@bytedance.com>
Co-authored-by: Suneet Saldanha <suneet.saldanha@imply.io>
* fixes for inline subqueries when multi-value dimension is present
* fix test
* allow missing capabilities for vectorized group by queries to be treated as single dims since it means that column doesnt exist
* add comment
* fix issue with group by limit pushdown for extractionFn, expressions, joins, etc
* remove unused
* fix test
* revert unintended change
* more tests
* consider capabilities for StringGroupByColumnSelectorStrategy
* fix test
* fix and more test
* revert because im scared
* Fix off-by-one in IndexedTableJoinMatcher.getCardinality.
It would report a cardinality that is one lower than the actual cardinality.
The missing value is the phantom null that can be generated by outer joins.
* Fix tests.
ApproximateHistogram - seems unlikely
SegmentAnalyzer - unclear if this is an actual issue
GenericIndexedWriter - unclear if this is an actual issue
IncrementalIndexRow and OnheapIncrementalIndex are non-issues becaus it's very
unlikely for the number of dims to be large enough to hit the overflow
condition
* IntelliJ inspections cleanup
* Standard Charset object can be used
* Redundant Collection.addAll() call
* String literal concatenation missing whitespace
* Statement with empty body
* Redundant Collection operation
* StringBuilder can be replaced with String
* Type parameter hides visible type
* fix warnings in test code
* more test fixes
* remove string concatenation inspection error
* fix extra curly brace
* cleanup AzureTestUtils
* fix charsets for RangerAdminClient
* review comments
This change fixes a potential integer overflow in BufferArrayGrouper that
was flagged by LGTM. It also adds a check that the vectorized arrays are
initialized before aggregateVector is called.
The changes in HashTableUtils should not have any effect since the numbers
being multiplied are small, but the change will remove the warnings from
being flagged in LGTM.
* fix MAX_INTERMEDIATE_SIZE for DoubleMeanHolder
* byte[] type handling in deserialize and finalizeComputation for DoubleMeanAggregatorFactory
* DoubleMeanAggregatorFactory tests: Max Intermediate Size, Deserialize, finalizeComputation
* moved byte[] check to first position
Co-authored-by: Stanislav <S.Poryadnyi@abcconsulting.ru>
* fix nullhandling exceptions related to test ordering
Tests might get executed in different order depending on the maven
version and the test environment. This may lead to "NullHandling module
not initialized" errors for some tests where we do not initialize
null-handling explicitly.
* use InitializedNullHandlingTest
* SQL support for joins on subqueries.
Changes to SQL module:
- DruidJoinRule: Allow joins on subqueries (left/right are no longer
required to be scans or mappings).
- DruidJoinRel: Add cost estimation code for joins on subqueries.
- DruidSemiJoinRule, DruidSemiJoinRel: Removed, since DruidJoinRule can
handle this case now.
- DruidRel: Remove Nullable annotation from toDruidQuery, because
it is no longer needed (it was used by DruidSemiJoinRel).
- Update Rules constants to reflect new rules available in our current
version of Calcite. Some of these are useful for optimizing joins on
subqueries.
- Rework cost estimation to be in terms of cost per row, and place all
relevant constants in CostEstimates.
Other changes:
- RowBasedColumnSelectorFactory: Don't set hasMultipleValues. The lack
of isComplete is enough to let callers know that columns might have
multiple values, and explicitly setting it to true causes
ExpressionSelectors to think it definitely has multiple values, and
treat the inputs as arrays. This behavior interfered with some of the
new tests that involved queries on lookups.
- QueryContexts: Add maxSubqueryRows parameter, and use it in druid-sql
tests.
* Fixes for tests.
* Adjustments.
* Broker: Add ability to inline subqueries.
The main changes:
- ClientQuerySegmentWalker: Add ability to inline queries.
- Query: Add "getSubQueryId" and "withSubQueryId" methods.
- QueryMetrics: Add "subQueryId" dimension.
- ServerConfig: Add new "maxSubqueryRows" parameter, which is used by
ClientQuerySegmentWalker to limit how many rows can be inlined per
query.
- IndexedTableJoinMatcher: Allow creating keys on top of unknown types,
by assuming they are strings. This is useful because not all types are
known for fields in query results.
- InlineDataSource: Store RowSignature rather than component parts. Add
more zealous "equals" and "hashCode" methods to ease testing.
- Moved QuerySegmentWalker test code from CalciteTests and
SpecificSegmentsQueryWalker in druid-sql to QueryStackTests in
druid-server. Use this to spin up a new ClientQuerySegmentWalkerTest.
* Adjustments from CI.
* Fix integration test.
* Move RowSignature from druid-sql to druid-processing and make use of it.
1) Moved (most of) RowSignature from sql to processing. Left behind the SQL-specific
stuff in a RowSignatures utility class. It also picked up some new convenience
methods along the way.
2) There were a lot of places in the code where Map<String, ValueType> was used to
associate columns with type info. These are now all replaced with RowSignature.
3) QueryToolChest's resultArrayFields method is replaced with resultArraySignature,
and it now provides type info.
* Fix up extensions.
* Various fixes
* Link up row-based datasources to serving layer.
- Add SegmentWrangler interface that allows linking of DataSources to Segments.
- Add LocalQuerySegmentWalker that uses SegmentWranglers to compute queries on
data that is available locally.
- Modify ClientQuerySegmentWalker to use LocalQuerySegmentWalker when the base
datasource is concrete and not a table.
- Add SegmentWranglerModule to the Broker so it has them available and can
properly instantiate . LocalQuerySegmentWalkers.
- Set InlineDataSource and LookupDataSource to concrete, since they can be
directly queried now.
* Fix tests.
* Ability to directly query row-based datasources.
Includes:
- Foundational classes RowBasedSegment, RowBasedStorageAdapter,
RowBasedCursor provide a queryable interface on top of a
RowBasedColumnSelectorFactory.
- Add LookupSegment: A RowBasedSegment that is built on lookup data.
- Improve capability reporting in RowBasedColumnSelectorFactory.
* Fix import.
* Remove unthrown IOException.
* Harmonization and bug-fixing for selector and filter behavior on unknown types.
- Migrate ValueMatcherColumnSelectorStrategy to newer ColumnProcessorFactory
system, and set defaultType COMPLEX so unknown types can be dynamically matched.
- Remove ValueGetters in favor of ColumnComparisonFilter doing its own thing.
- Switch various methods to use convertObjectToX when casting to numbers, rather
than ad-hoc and inconsistent logic.
- Fix bug in RowBasedExpressionColumnValueSelector: isBindingArray should return
true even for 0- or 1- element arrays.
- Adjust various javadocs.
* Add throwParseExceptions option to Rows.objectToNumber, switch back to that.
* Update tests.
* Adjust moment sketch tests.
* Add OnHeapMemorySegmentWriteOutMediumFactory
Add a factory for OnHeapMemorySegmentWriteOutMedium to support direct writing via Spark.
* Register OnHeapMemorySegmentWriteOutMediumFactory.
Register OnHeapMemorySegmentWriteOutMediumFactory with SegmentWriteOutMediumFactory.
* Remove unnecessary throws
The base `makeSegmentWriteOutMedium` throws an IOException, but the particular implementation of OnHeapMemorySegmentWriteOutMediumFactory does not throw a checked exception.
* Update SegmentWriteOutMedium docs to include onHeapMemory
Update the SegmentWriteOutMedium section of the indexing docs to include a description of the new OnHeapSegmentMediumWriteOut option.
* BufferArrayGrouper: Fix potential overflow in requiredBufferCapacity.
If cardinality was high, the computation could overflow an int. There
were tests for this, but the tests were wrong.
* Nicer.
* Add SQL GROUPING SETS support.
Built on top of the subtotalsSpec feature in the groupBy query. This also involves
two changes to subtotalsSpec:
- Alter behavior so limitSpec is applied after subtotalsSpec, rather than applied to
each grouping set. This is more in line with SQL standard behavior. I think it is okay
to make this change, since the old behavior was not documented, so users should
hopefully not be depending on it.
- Fix a bug where virtual columns were included in the subtotal queries, but they
should not have been.
Also fixes two bugs in query equality checking:
- BaseQuery: Use getDuration() instead of "duration" in equals and hashCode, since the
latter is lazily initialized and might be null in one query but not the other.
- GroupByQuery: Include subtotalsSpec in equals and hashCode.
* Fix bugs.
* Fix tests.
* PR updates.
* Grouping class hygiene.
* add Expr.stringify which produces parseable expression strings, parser support for null values in arrays, and parser support for empty numeric arrays
* oops, macros are expressions too
* style
* spotbugs
* qualified type arrays
* review stuffs
* simplify grammar
* more permissive array parsing
* reuse expr joiner
* fix it
* Run IntelliJ inspections on Travis
Running IntelliJ inspections currently takes about 90 minutes, but they
can be run in about 30 minutes on Travis.
* Restore assert statements
* Fix timestamp extract fn to match postgres
Update the timestamp extract function so that it matches the PostgreSQL docs.
Examples from the PostgreSQL docs were added as tests for DECADE, CENTURY
and MILLENIUM extraction.
There were bugs in CENTURY and MILLENIUM that were spotted because of intelliJ
inspections - 'Integer division in floating point context'
* Update CalciteQueryTest
* remove useless round
* mark integer division as an error
When the time extraction Top N algorithm is looking for aggregators, it makes
2 calls to hashCode on the key. Use Map#computeIfAbsent instead so that the
hashCode is calculated only once
* Codestyle - use java style array declaration
Replaced C-style array declarations with java style declarations and marked
the intelliJ inspection as an error
* cleanup test code
* Forbid easily misused HashSet and HashMap constructors
* Add two LinkedHashMap constructors to forbidden-apis and create utility method as replacement for them
* Fix visibility of constant in CollectionUtils.java
* Make an exception for an instance of LinkedHashMap#<init>(int) because proper sizing is used
* revert changes to sql module tests that should be in separate PR
* Finish reverting changes to sql module tests that were flagged in checkstyle during CI
* Add netty dependency resulting from SupressForbidden
* Add HashVectorGrouper based on MemoryOpenHashTable.
Additional supporting changes:
1) Modifies VectorGrouper interface to use Memory instead of ByteBuffers.
2) Modifies BufferArrayGrouper to match the new VectorGrouper interface.
3) Removes "implements VectorGrouper" from BufferHashGrouper.
* Fix comment.
* Fix another comment.
* Remove unused stuff.
* Include hoisted bounds checks.
* Checks against too-large keySpaces.
* Add MemoryOpenHashTable, a table similar to ByteBufferHashTable.
With some key differences to improve speed and design simplicity:
1) Uses Memory rather than ByteBuffer for its backing storage.
2) Uses faster hashing and comparison routines (see HashTableUtils).
3) Capacity is always a power of two, allowing simpler design and more
efficient implementation of findBucket.
4) Does not implement growability; instead, leaves that to its callers.
The idea is this removes the need for subclasses, while still giving
callers flexibility in how to handle table-full scenarios.
* Fix LGTM warnings.
* Adjust dependencies.
* Remove easymock from druid-benchmarks.
* Adjustments from review.
* Fix datasketches unit tests.
* Fix checkstyle.
* Speed up joins on indexed tables with string keys
When joining on index tables with string keys, caching the computation
of row id to row numbers improves performance on the
JoinAndLookupBenchmark.joinIndexTableStringKey* benchmarks by about 10%
if the column cache is enabled an by about 100% if the column cache is
disabled.
* Faster cache impl and handle unknown cardinality
* Remove unused dependency
* Hoist cardinality check outside of hot loop
* Fix dummy DimensionSelector for tests
* Guicify druid sql module
Break up the SQLModule in to smaller modules and provide a binding that
modules can use to register schemas with druid sql.
* fix some tests
* address code review
* tests compile
* Working tests
* Add all the tests
* fix up licenses and dependencies
* add calcite dependency to druid-benchmarks
* tests pass
* rename the schemas
* SQL join support for lookups.
1) Add LookupSchema to SQL, so lookups show up in the catalog.
2) Add join-related rels and rules to SQL, allowing joins to be planned into
native Druid queries.
* Add two missing LookupSchema calls in tests.
* Fix tests.
* Fix typo.
* Add LookupJoinableFactory.
Enables joins where the right-hand side is a lookup. Includes an
integration test.
Also, includes changes to LookupExtractorFactoryContainerProvider:
1) Add "getAllLookupNames", which will be needed to eventually connect
lookups to Druid's SQL catalog.
2) Convert "get" from nullable to Optional return.
3) Swap out most usages of LookupReferencesManager in favor of the
simpler LookupExtractorFactoryContainerProvider interface.
* Fixes for tests.
* Fix another test.
* Java 11 message fix.
* Fixups.
* Fixup benchmark class.
* Add getRightColumns to JoinConditionAnalysis
This change other implementations of JoinableFactory to ask the analysis
for the right key columns instead of having to calculate it themselves.
* Address some review comments
* more code review stuff
Add microbenchmark for joins. Enabling the column cache improves
performance by ~70% for the benchmarks for joins with string keys.
Adjusting LookupJoinMatcher.matchCondition() to have fewer branches,
improves performance by ~10% for the benchmarks for joins with lookups.
* intelliJ inspections cleanup
- remove redundant escapes
- performance warnings
- access static member via instance reference
- static method declared final
- inner class may be static
Most of these changes are aesthetic, however, they will allow inspections to
be enabled as part of CI checks going forward
The valuable changes in this delta are:
- using StringBuilder instead of string addition in a loop
indexing-hadoop/.../Utils.java
processing/.../ByteBufferMinMaxOffsetHeap.java
- Use class variables instead of static variables for parameterized test
processing/src/.../ScanQueryLimitRowIteratorTest.java
* Add intelliJ inspection warnings as errors to druid profile
* one more static inner class
* Make JoinableFactory an extension point
This change makes it so that extensions can register a JoinableFactory that
should be used for a DataSource.
Extensions can provide the factories via DruidBinders#joinableFactoryBinder
Known DataSources - like InlineDataSource are provided in the
JoinableFactoryModule. This module installs a FactoryWarehouse that is
used to decide which factory should be used to generate the Joinable for
the provided DataSource.
The ExtensionPoint is marked as Beta since it is not yet clear if this
needs to remain available to other extensions or if the best way to
register a factory is by using the datasource class.
* Add module test
* remove useless bindings in test
* remove ExtensionPoint annotation
* Make LifecycleLock not final to help with testing