Commit Graph

2515 Commits

Author SHA1 Message Date
Gian Merlino 6c196a5ea2
Remove StorageAdapter.getColumnTypeName. (#11893)
* Remove StorageAdapter.getColumnTypeName.

It was only used by SegmentAnalyzer, and isn't necessary anymore due to
the recent improvements to ColumnCapabilities.

Also: tidy ColumnDescriptor.read slightly by removing an instanceof
check, and moving the relevant logic into ComplexColumnPartSerde.

* Fix spellings.
2021-11-09 15:18:07 -08:00
Gian Merlino 324d4374f6
HashJoinEngine: Fix extraneous advance of left cursor. (#11890)
This could happen for right or full outer joins in certain cases. Tests
weren't catching this because existing Cursor implementations generally
ignore extraneous calls to "advance". So, to help catch this in tests,
extra state validations are also added to RowWalker, which is used by
RowBasedSegment.
2021-11-09 11:34:11 -08:00
Gian Merlino babf00f8e3
Migrate File.mkdirs to FileUtils.mkdirp. (#11879)
* Migrate File.mkdirs to FileUtils.mkdirp.

* Remove unused imports.

* Fix LookupReferencesManager.

* Simplify.

* Also migrate usages of forceMkdir.

* Fix var name.

* Fix incorrect call.

* Update test.
2021-11-09 11:10:49 -08:00
Gian Merlino 945a341acd
RowBasedCursor: Add column-value-reuse optimization. (#11884)
* RowBasedCursor: Add column-value-reuse optimization.

Most of the logic is in RowBasedColumnSelectorFactory, although in this
patch its only user is RowBasedCursor. This improves performance of
features that use RowBasedSegment, like lookup and inline datasources.
It's especially helpful for inline datasources that contain lengthy
arrays, due to the fact that the transformed array can be reused.

* Changes from code review.

* Fixes for ColumnCapabilitiesImplTest.
2021-11-09 07:18:09 -08:00
Gian Merlino a5bd0b8cc0
RowAdapter: Add a default implementation for timestampFunction. (#11885)
Enables simpler implementations for adapters that want to treat the
timestamp as "just another column".
2021-11-08 10:25:13 -08:00
Clint Wylie 7237dc837c
complex typed expressions (#11853)
* complex typed expressions

* add built-in hll collector expressions to get coverage on druid-processing, more types, more better

* rampage!!!

* more javadoc

* adjustments

* oops

* lol

* remove unused dependency

* contradiction?

* more test
2021-11-08 00:33:06 -08:00
Clint Wylie 907e4ca0c5
use correct DimensionSpec with for column value selectors created from dictionary encoded column indexers (#11873)
* use correct dimension spec for column value selectors of dictionary encoded column indexers
2021-11-05 01:51:15 -07:00
Liran Funaro 9ca8f1ec97
Remove IncrementalIndex template modifier (#11160)
Co-authored-by: Liran Funaro <liran.funaro@verizonmedia.com>
2021-10-27 13:10:37 -07:00
Gian Merlino fc95c92806
Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124)
* Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs.

This patch does the following:

- Removes OffheapIncrementalIndex.
- Clarifies that Aggregators are required to be thread safe.
- Clarifies that BufferAggregators and VectorAggregators are not
  required to be thread safe.
- Removes thread safety code from some DataSketches aggregators that
  had it. (Not all of them did, and that's OK, because it wasn't necessary
  anyway.)
- Makes enabling "useOffheap" with groupBy v1 an error.

Rationale for removing the offheap incremental index:

- It is only used in one rare scenario: groupBy v1 (which is non-default)
  in "useOffheap" mode (also non-default). So you have to go pretty deep
  into the wilderness to get this code to activate in production. It is
  never used during ingestion.
- Its existence complicates developer efforts to reason about how
  aggregators get used, because the way it uses buffer aggregators is so
  different from how every other query engine uses them.
- It doesn't have meaningful testing.

By the way, I do believe that the given way the offheap incremental index
works, it actually didn't require buffer aggregators to be thread-safe.
It synchronizes on "aggregate" and doesn't call "get" until it has
stopped calling "aggregate". Nevertheless, this is a bother to think about,
and for the above reasons I think it makes sense to remove the code anyway.

* Remove things that are now unused.

* Revert removal of getFloat, getLong, getDouble from BufferAggregator.

* OAK-related warnings, suppressions.

* Unused item suppressions.
2021-10-26 08:05:56 -07:00
Gian Merlino 98ecbb21cd
Remove CloseQuietly and migrate its usages to other methods. (#10247)
* Remove CloseQuietly and migrate its usages to other methods.

These other methods include:

1) New method CloseableUtils.closeAndWrapExceptions, which wraps IOExceptions
   in RuntimeExceptions for callers that just want to avoid dealing with
   checked exceptions. Most usages were migrated to this method, because it
   looks like they were mainly attempts to avoid declaring a throws clause,
   and perhaps were unintentionally suppressing IOExceptions.
2) New method CloseableUtils.closeInCatch, designed to properly close something
   in a catch block without losing exceptions. Some usages from catch blocks
   were migrated here, when it seemed that they were intended to avoid checked
   exception handling, and did not really intend to also suppress IOExceptions.
3) New method CloseableUtils.closeAndSuppressExceptions, which sends all
   exceptions to a "chomper" that consumes them. Nothing is thrown or returned.
   The behavior is slightly different: with this method, _all_ exceptions are
   suppressed, not just IOExceptions. Calls that seemed like they had good
   reason to suppress exceptions were migrated here.
4) Some calls were migrated to try-with-resources, in cases where it appeared
   that CloseQuietly was being used to avoid throwing an exception in a finally
   block.

🎵 You don't have to go home, but you can't stay here... 🎵

* Remove unused import.

* Fix up various issues.

* Adjustments to tests.

* Fix null handling.

* Additional test.

* Adjustments from review.

* Fixup style stuff.

* Fix NPE caused by holder starting out null.

* Fix spelling.

* Chomp Throwables too.
2021-10-23 17:03:21 -07:00
Clint Wylie 02b2057371
extract generic dictionary encoded column indexing and merging stuffs (#11829)
* extract generic dictionary encoded column indexing and merging stuffs to pave the path towards supporting other types of dictionary encoded columns

* spotbugs and inspections fixes

* friendlier

* javadoc

* better name

* adjust
2021-10-22 17:31:22 -07:00
Clint Wylie 741b4ed516
add output type information to ExpressionPostAggregator (#11818)
* add ColumnInspector argument to PostAggregator.getType to allow post-aggs to compute their output type based on input types

* add test for test for coverage

* simplify

* Remove unused imports.

Co-authored-by: Gian Merlino <gian@imply.io>
2021-10-22 13:52:51 -07:00
Alexander Saydakov 8cf1cbc4a9
latest datasketches-java and datasketches-memory (#11773)
* latest datasketches-java and datasketches-memory

* updated versions of datasketches-java and datasketches-memory

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2021-10-19 23:42:30 -07:00
Clint Wylie 187df58e30
better types (#11713)
* better type system

* needle in a haystack

* ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support

* fixup merge

* more test

* fixup

* intern

* fix

* oops

* oops again

* ...

* more test coverage

* fix error message

* adjust interning, more javadocs

* oops

* more docs more better
2021-10-19 01:47:25 -07:00
Jonathan Wei 22b41ddbbf
Task reports for parallel task: single phase and sequential mode (#11688)
* Task reports for parallel task: single phase and sequential mode

* Address comments

* Add null check for currentSubTaskHolder
2021-09-16 13:58:11 -05:00
Clint Wylie 5e092ccb9b
add MV_FILTER_ONLY, MV_FILTER_NONE, ListFilteredVirtualColumn (#11650)
* add MV_FILTER_ONLY SQL function, and list filter virtual column

* MV_FILTER_NONE and more tests

* formatting

* o yeah, forgot can do easy thing

* style

* hmm why was that there

* test filtering on virtual column

* style

* meh

* do it right

* good bot
2021-09-16 09:31:53 -07:00
Clint Wylie bbb86c8731
more tests for LimitedBufferHashGrouper (#11654)
* more tests for LimitedBufferHashGrouper

* fix style
2021-09-08 16:31:34 -07:00
Clint Wylie fe1d8c206a
bump version to 0.23.0-SNAPSHOT (#11670) 2021-09-08 15:56:04 -07:00
Clint Wylie 59d257816b
fix goldilocks bug with HashVectorGrouper improperly initializing memory (#11649)
* fix goldilocks bug with HashVectorGrouper improperly initializing memory that causes failure when there exists room to only grow one time

* fix unintended change

* cleanup
2021-09-02 02:25:26 -07:00
Jian Wang 3ff1c2b8ce
Fix bug which produces vastly inaccurate query results when forceLimitPushDown is enabled and order by clause has non grouping fields (#11097) 2021-09-01 21:19:38 -07:00
Jihoon Son 2a658acad4
Put sleep in an extension (#11632)
* Put sleep in an extension

* dependency
2021-08-25 01:27:45 -07:00
Kashif Faraz aaf0aaad8f
Enable routing of SQL queries at Router (#11566)
This PR adds a new property druid.router.sql.enable which allows the
Router to handle SQL queries when set to true.

This change does not affect Avatica JDBC requests and they are still routed
by hashing the Connection ID.

To allow parsing of the request object as a SqlQuery (contained in module druid-sql),
some classes have been moved from druid-server to druid-services with
the same package name.
2021-08-13 18:44:39 +05:30
Clint Wylie 9af7ba9d2a
STRING_AGG SQL aggregator function (#11241)
* add string_agg

* oops

* style and fix test

* spelling

* fixup

* review stuffs
2021-08-10 13:47:09 -07:00
Maytas Monsereenusorn 3257913737
Improve query error logging (#11519)
* Improve query error logging

* add docs

* address comments

* address comments
2021-08-05 22:51:09 +07:00
Jihoon Son 8ba7f6a48c
Fix incorrect result of exact topN on an inner join with limit (#11517) 2021-07-31 15:55:49 -07:00
Xavier Léauté 4bca7f014e
update error-prone to 2.8.0 with fix for crashing check (#11494)
* error-prone 2.8.0 fixes https://github.com/google/error-prone/issues/2396
* fix for a few ignored return values
* fix unknown args in sub-modules
2021-07-29 09:13:46 -07:00
Kashif Faraz 8a4e27f51d
Select broker based on query context parameter `brokerService` (#11495)
This change allows the selection of a specific broker service (or broker tier) by the Router.

The newly added ManualTieredBrokerSelectorStrategy works as follows:

Check for the parameter brokerService in the query context. If this is a valid broker service, use it.
Check if the field defaultManualBrokerService has been set in the strategy. If this is a valid broker service, use it.
Move on to the next strategy
2021-07-27 20:56:05 +05:30
Lucas Capistrant 9767b42e85
Add a new metric query/segments/count that is not emitted by default (#11394)
* Add a new metric query/segments/count that is not emitted by default

* docs

* test the default implementation of the metric

* fix spelling error in docs

* document the fact that query retries will result in additional metric emissions

* update using recommended text from @jihoonson
2021-07-22 17:57:35 -07:00
Abhishek Agarwal ce1faa5635
Make SegmentLoader extensible and customizable (#11398)
This PR refactors the code related to segment loading specifically SegmentLoader and SegmentLoaderLocalCacheManager. SegmentLoader is marked UnstableAPI which means, it can be extended outside core druid in custom extensions. Here is a summary of changes

SegmentLoader returns an instance of ReferenceCountingSegment instead of Segment. Earlier, SegmentManager was wrapping Segment objects inside ReferenceCountingSegment. That is now moved to SegmentLoader. With this, a custom implementation can track the references of segments. It also allows them to create custom ReferenceCountingSegment implementations. For this reason, the constructor visibility in ReferenceCountingSegment is changed from private to protected.
SegmentCacheManager has two additional methods called - reserve(DataSegment) and release(DataSegment). These methods let the caller reserve or release space without calling SegmentLoader#getSegment. We already had similar methods in StorageLocation and now they are available in SegmentCacheManager too which wraps multiple locations.
Refactoring to simplify the code in SegmentCacheManager wherever possible. There is no change in the functionality.
2021-07-22 18:00:49 +05:30
kaijianding e39ff44481
improve groupBy query granularity translation with 2x query performance improve when issued from sql layer (#11379)
* improve groupBy query granularity translation when issued from sql layer

* fix style

* use virtual column to determine timestampResult granularity

* dont' apply postaggregators on compute nodes

* relocate constants

* fix order by correctness issue

* fix ut

* use more easier understanding code in DefaultLimitSpec

* address comment

* rollback use virtual column to determine timestampResult granularity

* fix style

* fix style

* address the comment

* add more detail document to explain the tradeoff

* address the comment

* address the comment
2021-07-11 10:22:47 -07:00
Clint Wylie 17efa6f556
add single input string expression dimension vector selector and better expression planning (#11213)
* add single input string expression dimension vector selector and better expression planning

* better

* fixes

* oops

* rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing

* oops

* javadocs, renaming

* more javadocs

* benchmarks

* use string expression vector processor with vector size 1 instead of expr.eval

* better logging

* javadocs, surprising number of the the

* more

* simplify
2021-07-06 11:20:49 -07:00
Abhishek Agarwal 03a6a6d6e1
Replace Processing ExecutorService with QueryProcessingPool (#11382)
This PR refactors the code for QueryRunnerFactory#mergeRunners to accept a new interface called QueryProcessingPool instead of ExecutorService for concurrent execution of query runners. This interface will let custom extensions inject their own implementation for deciding which query-runner to prioritize first. The default implementation is the same as today that takes the priority of query into account. QueryProcessingPool can also be used as a regular executor service. It has a dedicated method for accepting query execution work so implementations can differentiate between regular async tasks and query execution tasks. This dedicated method also passes the QueryRunner object as part of the task information. This hook will let custom extensions carry any state from QuerySegmentWalker to QueryProcessingPool#mergeRunners which is not possible currently.
2021-07-01 16:03:08 +05:30
frank chen 906a704c55
Eliminate ambiguities of KB/MB/GB in the doc (#11333)
* GB ---> GiB

* suppress spelling check

* MB --> MiB, KB --> KiB

* Use IEC binary prefix

* Add reference link

* Fix doc style
2021-06-30 13:42:45 -07:00
Clint Wylie df9b57aa1a
bitwise aggregators, better null handling options for expression agg (#11280)
* bitwise aggregators, better nulls for expression agg

* correct behavior

* rework deserialize, better names

* fix json, share mask
2021-06-25 16:51:16 -07:00
Xavier Léauté 712f2a5d00
upgrade error-prone to 2.7.1 and support checks with Java 11+ (#11363)
* upgrade error-prone to 2.7.1 and support checks with Java 11+

- upgrade error-prone to 2.7.1
- support running error-prone with Java 11 and above using -Xplugin
  instead of custom compiler
- add compiler arguments to ignore warnings/errors in Java 15/16
- introduce strictCompile property to enable strict profiles since we
  now need multiple strict profiles for Java 8
- properly exclude all generated source files from error-prone
- fix druid-processing overriding annotation processors from parent pom
- fix druid-core disabling most non-default checks
- align plugin and annotation errorprone versions
- fix / suppress additional issues found by error-prone:
  * fix bug in SeekableStreamSupervisor initializing ArrayList size with
    the taskGroupdId
  * fix missing @Override annotations
- remove outdated compiler plugin in benchmarks
- remove deleted ParameterPackage error-prone rule
- re-enable checks on benchmark module as well

* fix IntelliJ inspections

* disable LongFloatConversion due to bug in error-prone with JDK 8

* add comment about InsecureCrypto
2021-06-16 12:55:34 -07:00
Clint Wylie bfbd7ec432
fix a bugs related to SQL type inference return type nullability (#11327)
* fix a bunch of type inference nullability bugs

* fixes

* style

* fix test

* fix concat
2021-06-15 12:26:59 -07:00
Clint Wylie 920aa414ca
enrich expression cache key information to support expressions which depend on external state (#11358)
* enrich expression cache key information to support expressions which depend on external state such as lookups

* cache rules everything around me

* low carb

* rename
2021-06-14 17:26:43 -07:00
Clint Wylie 50327b8f63
ignore bySegment query context for SQL queries (#11352)
* ignore bySegment query context for SQL queries

* revert unintended change
2021-06-11 13:49:03 -07:00
Clint Wylie 6b272c857f
adjust topn heap algorithm to only use known cardinality path when dictionary is unique (#11186)
* adjust topn heap algorithm to only use known cardinality path when dictionary is unique

* better check and add comment

* adjust comment more
2021-06-10 18:32:22 -05:00
Jihoon Son 51f983101e
Fix wrong encoding in PredicateFilteredDimensionSelector.getRow (#11339) 2021-06-10 09:14:34 -07:00
dependabot[bot] 167044f715
Bump fastutil from 8.2.3 to 8.5.4 (#11347)
* Bump fastutil from 8.2.3 to 8.5.4

Bumps [fastutil](https://github.com/vigna/fastutil) from 8.2.3 to 8.5.4.
- [Release notes](https://github.com/vigna/fastutil/releases)
- [Changelog](https://github.com/vigna/fastutil/blob/master/CHANGES)
- [Commits](https://github.com/vigna/fastutil/compare/8.2.3...8.5.4)

---
updated-dependencies:
- dependency-name: it.unimi.dsi:fastutil
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* update licenses.yaml
* update maven dependency list for -core and -extra libraries to pass maven dependency checks

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xavier Léauté <xvrl@apache.org>
2021-06-10 07:43:18 -07:00
Maria Sitkovets 259207753d
Fix is null selector returning incorrect value for Long data type (#11170)
* Fix is null selector returning incorrect value for Long data type

* Fix style errors

* Refactor getObject method to also cache null column values

* Make lastInput variable nullable

* Refactor unit test

* Use new boolean lastInputIsNull instead of Long for lastInput to avoid boxing

* Refactor to remove Long for input variable

* Make a separate null caching variable

* Cleaner null caching implementation
2021-05-19 20:47:02 -07:00
Clint Wylie 6d08a7051e
fix bug with aggregator expressions on realtime index with string columns always producing 0 values (#11185)
* fix bug with aggregator expressions on realtime index with string columns always producing 0 values

* more test

* rework some stuff

* javadocs
2021-05-17 11:59:13 -07:00
Clint Wylie 790262e5d0
add estimated byte size limit enforcement for heap based expression aggregator (#11236) 2021-05-12 01:21:50 -07:00
Clint Wylie f6662b4893
fix count and average SQL aggregators on constant virtual columns (#11208)
* fix count and average SQL aggregators on constant virtual columns

* style

* even better, why are we tracking virtual columns in aggregations at all if we have a virtual column registry

* oops missed a few

* remove unused

* this will fix it
2021-05-10 13:41:48 -07:00
Clint Wylie 691d7a1d54
SQL timeseries no longer skip empty buckets with all granularity (#11188)
* SQL timeseries no longer skip empty buckets with all granularity

* add comment, fix tests

* the ol switcheroo

* revert unintended change

* docs and more tests

* style

* make checkstyle happy

* docs fixes and more tests

* add docs, tests for array_agg

* fixes

* oops

* doc stuffs

* fix compile, match doc style
2021-05-10 10:13:37 -07:00
Gian Merlino a1f850d707
Fix vectorized cardinality bug on certain string columns. (#11199)
* Fix vectorized cardinality bug on certain string columns.

Fixes a bug introduced in #11182, related to the fact that in some cases,
ColumnProcessors.makeVectorProcessor will call "makeObjectProcessor"
instead of "makeSingleValueDimensionProcessor" or
"makeMultiValueDimensionProcessor". CardinalityVectorProcessorFactory
improperly ignored calls to "makeObjectProcessor".

In addition to fixing the bug, I added this detail to the javadocs for
VectorColumnProcessorFactory, to prevent others from running into the
same thing in the future. They do not currently call out this case.

* Improve test coverage.

* Additional fixes.
2021-05-07 08:37:10 -07:00
Clint Wylie 554f1ffeee
ARRAY_AGG sql aggregator function (#11157)
* ARRAY_AGG sql aggregator function

* add javadoc

* spelling

* review stuff, return null instead of empty when nil input

* review stuff

* Update sql.md

* use type inference for finalize, refactor some things
2021-05-03 22:17:10 -07:00
Gian Merlino bef7cc911f
Vectorize the cardinality aggregator. (#11182)
* Vectorize the cardinality aggregator.

Does not include a byRow implementation, so if byRow is true then
the aggregator still goes through the non-vectorized path.

Testing strategy:

- New tests that exercise both styles of "aggregate" for supported types.
- Some existing tests have also become active (note the deleted
  "cannotVectorize" lines).

* Adjust whitespace.
2021-05-03 20:27:02 -07:00
Gian Merlino 809e001939
Vectorize the DataSketches quantiles aggregator. (#11183)
* Vectorize the DataSketches quantiles aggregator.

Also removes synchronization for the BufferAggregator and VectorAggregator
implementations, since it is not necessary (similar to #11115).

Extends DoublesSketchAggregatorTest and DoublesSketchSqlAggregatorTest
to run all test cases in vectorized mode.

* Style fix.
2021-05-02 16:14:21 -07:00
Gian Merlino 046069f35a
Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector. (#11172)
* Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector.

The idea is that certain operations (like count distinct on strings) will
be faster if they are able to run directly on UTF-8 bytes instead of on
Java Strings decoded by "lookupName".

* Add license header.

* Updates suggested by robots.
2021-04-30 10:56:11 -07:00
Gian Merlino 6d82c3cbf1
StringComparators: No need to convert to UTF-8 for lexicographic comparison. (#11171)
Lexicographic ordering of UTF-8 byte sequences and in-memory UTF-16
strings are equivalent. So, we can skip the (expensive) conversion and
get an equivalent result. Thank you, Unicode!
2021-04-30 10:54:20 -07:00
Gian Merlino 7d808e357c
InDimFilter: Fix cache key computation to avoid collisions. (#11168)
The prior code did not include separation between values, and encoded
null ambiguously. This patch fixes both of those issues by encoding
strings as length + value instead of just value.

I think cache key computation was OK prior to #9800. Prior to that
patch, the cache key was computed using CacheKeyBuilder.appendStrings,
which encodes strings as UTF-8 and inserts a separator byte (0xff)
between them that cannot appear in a UTF-8 stream.
2021-04-28 17:28:29 -07:00
Gian Merlino ad028de538
InDimFilter: Fix NPE involving certain Set types. (#11169)
* InDimFilter: Fix NPE involving certain Set types.

Normally, InDimFilters that come from JSON have HashSets for "values".
However, programmatically-generated filters (like the ones from #11068)
may use other set types. Some set types, like TreeSets with natural
ordering, will throw NPE on "contains(null)", which causes the
InDimFilter's ValueMatcher to throw NPE if it encounters a null value.

This patch adds code to detect if the values set can support
contains(null), and if not, wrap that in a null-checking lambda.

Also included:

- Remove unneeded NullHandling.needsEmptyToNull method.
- Update IndexedTableJoinable to generate a TreeSet that does not
  require lambda-wrapping. (This particular TreeSet is how I noticed
  the bug in the first place.)

* Test fixes.

* Improve test coverage
2021-04-28 14:13:42 -07:00
Harini Rajendran 8a3be6bccc
Fix TimeSeriesUnionQueryRunnerTest by extending InitializedNullHandlingTest (#11154) 2021-04-23 08:56:03 -07:00
Clint Wylie 57ff1f9cdb
expression aggregator (#11104)
* add experimental expression aggregator

* add test

* fix lgtm

* fix test

* adjust test

* use not null constant

* array_set_concat docs

* add equals and hashcode and tostring

* fix it

* spelling

* do multi-value magic for expression agg, more javadocs, tests

* formatting

* fix inspection

* more better

* nullable
2021-04-22 18:30:16 -07:00
Gian Merlino 202c78c8f3
Enable rewriting certain inner joins as filters. (#11068)
* Enable rewriting certain inner joins as filters.

The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:

- It must be an inner equi-join.
- The right-hand columns referenced by the condition must not contain any
  duplicate values. (If they did, the inner join would not be guaranteed
  to return at most one row for each left-hand-side row.)
- No columns from the right-hand side can be used by anything other than
  the join condition itself.

HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.

In support of this goal:

- Add Query getRequiredColumns() method to help us figure out whether
  the right-hand side of a join datasource is being used or not.
- Add JoinConditionAnalysis getRequiredColumns() method to help us
  figure out if the right-hand side of a join is being used by later
  join clauses acting on the same base.
- Add Joinable getNonNullColumnValuesIfAllUnique method to enable
  retrieving the set of values that will form the "in" filter.
- Add LookupExtractor canGetKeySet() and keySet() methods to support
  LookupJoinable in its efforts to implement the new Joinable method.
- Add "enableRewriteJoinToFilter" feature flag to
  JoinFilterRewriteConfig. The default is disabled.

* Test improvements.

* Test fixes.

* Avoid slow size() call.

* Remove invalid test.

* Fix style.

* Fix mistaken default.

* Small fixes.

* Fix logic error.
2021-04-14 10:49:27 -07:00
Maytas Monsereenusorn f968400170
Introduce a new configuration that skip storing audit payload if payload size exceed limit and skip storing null fields for audit payload (#11078)
* Add config to skip storing audit payload if exceed limit

* fix checkstyle

* change config name

* skip null fields for audit payload

* fix checkstyle

* address comments

* fix guice

* fix test

* add tests

* address comments

* address comments

* address comments

* fix checkstyle

* address comments

* fix test

* fix test

* address comments

* Address comments

Co-authored-by: Jihoon Son <jihoonson@apache.org>

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2021-04-13 20:18:28 -07:00
Clint Wylie 08d3786738
improve bitmap vector offset to report contiguous groups (#11039)
* improve bitmap vector offset to report contiguous groups

* benchmark style

* check for contiguous in getOffsets, tests for exceptions
2021-04-13 11:47:01 -07:00
Gian Merlino c158207ab6
Rename BitmapOperationTest base class to avoid flaky test. (#11102)
PR #10936 renamed BitmapBenchmark, the parent of a couple of bitmap tests, to
BitmapOperationTest. This patch renames it to BitmapOperationTestBase so JUnit
doesn't pick it up as a test case. When JUnit picks it up, it becomes a flaky
test, since its behavior and correctness depends on whether it runs before
or after its subclasses.
2021-04-13 08:01:15 -07:00
Gian Merlino c8e394015d
LongsLongEncodingReader: Implement "duplicate", fixing concurrency bug. (#11098)
Regression introduced in #11004 due to overzealous optimization. Even though
we replaced stateful usage of ByteBuffer with stateless usage of Memory, we
still need to create a new object on "duplicate" due to semantics of setBuffer.
2021-04-13 08:01:01 -07:00
Jihoon Son 25db8787b3
Fix CAST being ignored when aggregating on strings after cast (#11083)
* Fix CAST being ignored when aggregating on strings after cast

* fix checkstyle and dependency

* unused import
2021-04-12 22:21:24 -07:00
BIGrey d33fdd093b
Nested GroupBy query got wrong/empty result when using virtual column and filter (#11081)
* fix nested groupby got empty result when using virtual column

* move to query.getVirtualColumns().wrap instead of new VirtualizedColumnSelectorFactory

* move test to GroupByQueryRunnerTest

* Update processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryRunnerTest.java

Co-authored-by: huagnhui.bigrey <huanghui.bigrey@bytedance.com>
Co-authored-by: Jihoon Son <jihoonson@apache.org>
2021-04-10 21:29:41 -07:00
Clint Wylie 338886fd5f
vector group by support for string expressions (#11010)
* vector group by support for string expressions

* fix test

* comments, javadoc
2021-04-08 19:23:39 -07:00
Abhishek Agarwal 0df0bff44b
Enable multiple distinct aggregators in same query (#11014)
* Enable multiple distinct count

* Add more tests

* fix sql test

* docs fix

* Address nits
2021-04-07 00:52:19 -07:00
Clint Wylie c0e6d1c7f8
vectorize 'auto' long decoding (#11004)
* Vectorize LongDeserializers.

Also, add many more tests.

* more faster

* more more faster

* more cleanup

* fixes

* forbidden

* benchmark style

* idk why

* adjust

* add preconditions for value >= 0 for writers

* add 64 bit exception

Co-authored-by: Gian Merlino <gian@imply.io>
2021-03-26 18:39:13 -07:00
Gian Merlino bf20f9e979
DruidInputSource: Fix issues in column projection, timestamp handling. (#10267)
* DruidInputSource: Fix issues in column projection, timestamp handling.

DruidInputSource, DruidSegmentReader changes:

1) Remove "dimensions" and "metrics". They are not necessary, because we
   can compute which columns we need to read based on what is going to
   be used by the timestamp, transform, dimensions, and metrics.
2) Start using ColumnsFilter (see below) to decide which columns we need
   to read.
3) Actually respect the "timestampSpec". Previously, it was ignored, and
   the timestamp of the returned InputRows was set to the `__time` column
   of the input datasource.

(1) and (2) together fix a bug in which the DruidInputSource would not
properly read columns that are used as inputs to a transformSpec.

(3) fixes a bug where the timestampSpec would be ignored if you attempted
to set the column to something other than `__time`.

(1) and (3) are breaking changes.

Web console changes:

1) Remove "Dimensions" and "Metrics" from the Druid input source.
2) Set timestampSpec to `{"column": "__time", "format": "millis"}` for
   compatibility with the new behavior.

Other changes:

1) Add ColumnsFilter, a new class that allows input readers to determine
   which columns they need to read. Currently, it's only used by the
   DruidInputSource, but it could be used by other columnar input sources
   in the future.
2) Add a ColumnsFilter to InputRowSchema.
3) Remove the metric names from InputRowSchema (they were unused).
4) Add InputRowSchemas.fromDataSchema method that computes the proper
   ColumnsFilter for given timestamp, dimensions, transform, and metrics.
5) Add "getRequiredColumns" method to TransformSpec to support the above.

* Various fixups.

* Uncomment incorrectly commented lines.

* Move TransformSpecTest to the proper module.

* Add druid.indexer.task.ignoreTimestampSpecForDruidInputSource setting.

* Fix.

* Fix build.

* Checkstyle.

* Misc fixes.

* Fix test.

* Move config.

* Fix imports.

* Fixup.

* Fix ShuffleResourceTest.

* Add import.

* Smarter exclusions.

* Fixes based on tests.

Also, add TIME_COLUMN constant in the web console.

* Adjustments for tests.

* Reorder test data.

* Update docs.

* Update docs to say Druid 0.22.0 instead of 0.21.0.

* Fix test.

* Fix ITAutoCompactionTest.

* Changes from review & from merging.
2021-03-25 10:32:21 -07:00
Maytas Monsereenusorn f19c2e9ce4
If ingested data has sparse columns, the ingested data with forceGuaranteedRollup=true can result in imperfect rollup and final dimension ordering can be different from dimensionSpec ordering in the ingestionSpec (#10948)
* add IT

* add IT

* add the fix

* fix checkstyle

* fix compile

* fix compile

* fix test

* fix test

* address comments
2021-03-18 17:04:28 -07:00
Xavier Léauté 1061faa6ba
prefer string concatenation over String.format in performance sensitive code (#10997)
String.format relies on regex parsing, which makes these calls expensive
at higher request volumes.
2021-03-16 22:06:26 -07:00
Clint Wylie 4cd4a22f87
expression filter support for vectorized query engines (#10613)
* expression filter support for vectorized query engines

* remove unused codes

* more tests

* refactor, more tests

* suppress

* more

* more

* more

* oops, i was wrong

* comment

* remove decorate, object dimension selector, more javadocs

* style
2021-03-16 11:46:50 -07:00
Xavier Léauté d26e1bc70d
update code check plugins for Java 15 support (#10978)
* update maven-forbidden-api plugin to 3.1
* update maven-pmd-plugin to 3.14
* update spotbugs to 4.2.2
* fixes validation failures newly caught by those updates
  - fix SpotBugs NP_NONNULL_PARAM_VIOLATION
  - fix PMD UnnecessaryFullyQualifiedName
2021-03-11 07:31:41 -08:00
frank chen b808fd2ef9
Fix NPE in the constructor of TopNQuery (#10969)
* fix NPE

* Add unit tests to cover parameter checking
2021-03-11 00:04:49 -08:00
Clint Wylie 58294329b7
fix SQL issue for group by queries with time filter that gets optimized to false (#10968)
* fix SQL issue for group by queries with time filter that gets optimized to false

* short circuit always false in CombineAndSimplifyBounds

* adjust

* javadocs

* add preconditions for and/or filters to ensure they have children

* add comments, remove preconditions
2021-03-09 19:41:16 -08:00
Abhishek Agarwal c66951a59e
Add flag in SQL to disable left base filter optimization for joins (#10947)
* Add flag to disable left base filter

* code coverage

* Draft

* Review comments

* code coverage

* add docs

* Add old tests
2021-03-09 13:07:34 -08:00
Jihoon Son 2c30f8b3b7
Migrate bitmap benchmarks to JMH (#10936)
* Migrate bitmap benchmarks to JMH

* add concise
2021-03-04 12:50:55 -08:00
Abhishek Agarwal 1a15987432
Supporting filters in the left base table for join datasources (#10697)
* where filter left first draft

* Revert changes in calcite test

* Refactor a bit

* Fixing the Tests

* Changes

* Adding tests

* Add tests for correlated queries

* Add comment

* Fix typos
2021-03-04 10:39:21 -08:00
Gian Merlino 87a2abff79
Fix runtime error when IndexedTableJoinMatcher matches long selector to unique string index. (#10942)
* Fix runtime error when IndexedTableJoinMatcher matches long selector to unique string index.

The issue arises when matching against a long selector on the left-hand side to a string
typed Index on the right-hand side, and when that Index also returns true from areKeysUnique.
In this case, IndexedTableJoinMatcher would generate a ConditionMatcher that implements
matchSingleRow by calling findUniqueLong on the Index. This is inappropriate because the Index
is actually string typed. The fix is to check the type of the Index before deciding how to
implement the ConditionMatcher.

The patch adds "testMatchSingleRowToUniqueStringIndex" to IndexedTableJoinMatcherTest, which
explores this case.

* Update tests.
2021-03-04 00:57:59 -08:00
Gian Merlino 07902f607b
Granularity: Introduce primitive-typed bucketStart, increment methods. (#10904)
* Granularity: Introduce primitive-typed bucketStart, increment methods.

Saves creation of unnecessary DateTime objects in timestamp_floor and
timestamp_ceil expressions.

* Fix style.

* Amp up the test coverage.
2021-02-25 07:59:20 -08:00
Gian Merlino b7e9f5bc85
BoundDimFilter: Simplify the various DruidLongPredicates. (#10906)
They all use Long.compare, but they don't need to. Changing to
regular comparisons simplifies the code and also removes branches.
(Internally, Long.compare has two branches.)
2021-02-19 16:44:56 -08:00
Abhishek Agarwal 8718155f8f
Allow for empty keys in hash map (#10869)
* allow for empty keys in hash map

* fix serde test
2021-02-10 11:19:57 -08:00
Jihoon Son ac41e41232
Update doc for query errors and add unit tests for JsonParserIterator (#10833)
* Update doc for query errors and add unit tests for JsonParserIterator

* static constructor for convenience

* rename method
2021-02-05 02:55:32 -08:00
Jihoon Son 3f8f00a231
Fix CVE-2021-25646 (#10818) 2021-02-04 11:21:43 -08:00
Gian Merlino 6c0c6e60b3
Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory. (#10767)
* Vectorized theta sketch aggregator.

Also a refactoring of BufferAggregator and VectorAggregator such that
they share a common interface, BaseBufferAggregator. This allows
implementing both in the same file with an abstract + dual subclass
structure.

* Rework implementation to use composition instead of inheritance.

* Rework things to enable working properly for both complex types and
regular types.

Involved finally moving makeVectorProcessor from DimensionHandlerUtils
into ColumnProcessors and harmonizing the two things.

* Add missing method.

* Style and name changes.

* Fix issues from inspections.

* Fix style issue.
2021-01-29 09:30:09 -08:00
Clint Wylie cd6af93274
add leftover tests from #10743 (#10766) 2021-01-22 09:20:48 -08:00
Gian Merlino 8b808c4879
Retain order of AND, OR filter children. (#10758)
* Retain order of AND, OR filter children.

If we retain the order, it enables short-circuiting. People can put a
more selective filter earlier in the list and lower the chance that
later filters will need to be evaluated.

Short-circuiting was working before #9608, which switched to unordered
sets to solve a different problem. This patch tries to solve that
problem a different way.

This patch moves filter simplification logic from "optimize" to
"toFilter", because that allows the code to be shared with Filters.and
and Filters.or. The simplification has become more complicated and so
it's useful to share it.

This patch also removes code from CalciteCnfHelper that is no longer
necessary because Filters.and and Filters.or are now doing the work.

* Fixes for inspections.

* Fix tests.

* Back to a Set.
2021-01-20 08:59:20 -08:00
zhangyue19921010 2590ad4f67
Historical unloads damaged segments automatically when lazy on start. (#10688)
* ready to test

* tested on dev cluster

* tested

* code review

* add UTs

* add UTs

* ut passed

* ut passed

* opti imports

* done

* done

* fix checkstyle

* modify uts

* modify logs

* changing the package of SegmentLazyLoadFailCallback.java to org.apache.druid.segment

* merge from master

* modify import orders

* merge from master

* merge from master

* modify logs

* modify docs

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-01-16 19:53:30 -08:00
Gian Merlino 2b24dc3764
SegmentAnalyzer: Properly close column after retrieving it. (#10772) 2021-01-16 19:26:34 -08:00
Jihoon Son 95065bdf1a
Bump dev version to 0.22.0-SNAPSHOT (#10759) 2021-01-15 13:16:23 -08:00
Gian Merlino a82910e065
OrFilter: Properly handle child matchers that return the original mask. (#10754)
* OrFilter: Properly handle child matchers that return the original mask.

This happens when a child matcher is literally true (for example,
BooleanVectorValueMatcher). In this case, OrFilter would throw this
exception from its call to removeAll while processing the next filter:

  java.lang.IllegalStateException: 'other' must be a different instance from 'this'

Also update the javadocs for VectorValueMatcher to call out that the
returned object may be the same as the input mask.

* Fix style.
2021-01-14 23:28:13 -08:00
Gian Merlino 7354953b1b
VectorMatch: Disallow "copyFrom", "addAll" on self; improve tests. (#10755)
No existing code relies on being able to call these methods in this way.

The new tests exhaustively test all vectors up to size 7, and also test
behavior the run-on-self behavior that has been adjusted by this patch.
2021-01-14 18:29:13 -08:00
Gian Merlino 2bbf89db81
Remove FalseVectorMatcher, TrueVectorMatcher in favor of BooleanVectorValueMatcher. (#10757) 2021-01-14 18:28:25 -08:00
Jihoon Son 149306c9db
Tidy up HTTP status codes for query errors (#10746)
* Tidy up query error codes

* fix tests

* Restore query exception type in JsonParserIterator

* address review comments; add a comment explaining the ugly switch

* fix test
2021-01-13 17:20:00 -08:00
Clint Wylie 8c3c9b4060
fix limited queries with subtotals (#10743)
* i put my thing down, flip it and reverse it

* oops
2021-01-13 12:55:24 -08:00
Clint Wylie 9362dc7968
re-use expression vector evaluation results for the same offset in expression vector selectors (#10614)
* cache expression selector results by associating vector expression bindings to underlying vector offset

* better coverage, fix floats

* style

* stupid bot

* stupid me

* more test

* intellij threw me under the bus when it generated those junit methods

* narrow interface instead of passing around offset
2021-01-13 12:44:56 -08:00
秦臻 c62b7c19c3
javascript filter result convert to java boolean (#10721)
* javascript filter result convert to java boolean

* use type convert replace script convert, and add more unit test

Co-authored-by: qinzhen <qinzhen@kuaishou.com>
2021-01-08 14:30:09 -08:00
Gian Merlino 6eef0e4c9f
Fix collision between #10689 and #10593. (#10738) 2021-01-08 09:52:27 -08:00
Aleksey Plekhanov 26bcd47e51
Thread-safety for ResponseContext.REGISTERED_KEYS (#9667) 2021-01-08 00:37:49 -08:00
Liran Funaro 08ab82f55c
IncrementalIndex Tests and Benchmarks Parametrization (#10593)
* Remove redundant IncrementalIndex.Builder

* Parametrize incremental index tests and benchmarks

- Reveal and fix a bug in OffheapIncrementalIndex

* Fix forbiddenapis error: Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

* Fix Intellij errors: declared exception is never thrown

* Add documentation and validate before closing objects on tearDown.

* Add documentation to OffheapIncrementalIndexTestSpec

* Doc corrections and minor changes.

* Add logging for generated rows.

* Refactor new tests/benchmarks.

* Improve IncrementalIndexCreator documentation

* Add required tests for DataGenerator

* Revert "rollupOpportunity" to be a string
2021-01-07 22:18:47 -08:00
Gian Merlino 48e576a307
Scan query: More accurate error message when segment per time chunk limit is exceeded. (#10630)
* Scan query: More accurate error message when segment per time chunk limit is exceeded.

* Add guardrail test.
2021-01-06 14:11:28 -08:00
Jonathan Wei 68bb038b31
Multiphase segment merge for IndexMergerV9 (#10689)
* Multiphase merge for IndexMergerV9

* JSON fix

* Cleanup temp files

* Docs

* Address logging and add IT

* Fix spelling and test unloader datasource name
2021-01-05 22:19:09 -08:00