Commit Graph

2486 Commits

Author SHA1 Message Date
Clint Wylie 84b4bf56d8
vectorize logical operators and boolean functions (#11184)
changes:
* adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions
* vectorize logical operators and boolean functions, some only if useStrictBooleans is true
2021-12-02 16:40:23 -08:00
Paul Rogers a66f10eea1
Code cleanup from query profile project (#11822)
* Code cleanup from query profile project

* Fix spelling errors
* Fix Javadoc formatting
* Abstract out repeated test code
* Reuse constants in place of some string literals
* Fix up some parameterized types
* Reduce warnings reported by Eclipse

* Reverted change due to lack of tests
2021-11-30 11:35:38 -08:00
Gian Merlino f6e6ca2893
Use intermediate-persist IndexSpec during multiphase merge. (#11940)
* Use intermediate-persist IndexSpec during multiphase merge.

The main change is the addition of an intermediate-persist IndexSpec
to the main "merge" method in IndexMerger. There are also a few minor
adjustments to the IndexMerger interface to encourage more harmonious
usage of its methods in the future.

* Additional changes inspired by the test coverage checker.

- Remove unused-in-production IndexMerger methods "append" and "convert".
- Add additional unit tests to UnifiedIndexerAppenderatorsManager.

* Additional adjustments.

* Even more additional adjustments.

* Test fixes.
2021-11-29 15:08:49 -08:00
Gian Merlino 93aeaf4801
Improve on-heap aggregator footprint estimates. (#11950)
Add a "guessAggregatorHeapFootprint" method to AggregatorFactory that
mitigates #6743 by enabling heap footprint estimates based on a specific
number of rows. The idea is that at ingestion time, the number of rows
that go into an aggregator will be 1 (if rollup is off) or will likely
be a small number (if rollup is on).

It's a heuristic, because of course nothing guarantees that the rollup
ratio is a small number. But it's a common case, and I expect this logic
to go wrong much less often than the current logic. Also, when it does
go wrong, users can fix it by lowering maxRowsInMemory or
maxBytesInMemory. The current situation is unintuitive: when the
estimation goes wrong, users get an OOME, but actually they need to
*raise* these limits to fix it.
2021-11-28 13:21:24 +05:30
Rohan Garg 2c08055962
Specify time column for first/last aggregators (#11949)
Add the ability to pass time column in first/last aggregator (and latest/earliest SQL functions). It is to support cases where the time to query upon is stored as a part of a column different than __time. Also, some other logical time column can be specified.
2021-11-25 09:44:14 +05:30
Gian Merlino 12e2228510
RowBasedGrouperHelper: Set hasMultipleValues = false in capabilities. (#11954)
Useful because it enables anything that consumes groupBy results to
potentially operate more efficiently.
2021-11-24 13:14:58 -08:00
Gian Merlino 5e168b861a
StorageAdapter: Add getRowSignature method. (#11953)
Simplifies logic for callers that only want to get a list of all the
column names, or column names and types. Updated callers SegmentAnalyzer,
HashJoinSegmentStorageAdapter, and DruidSegmentReader.
2021-11-24 13:14:25 -08:00
Gian Merlino 0354407655
SQL INSERT planner support. (#11959)
* SQL INSERT planner support.

The main changes are:

1) DruidPlanner is able to validate and authorize INSERT queries. They
   require WRITE permission on the target datasource.

2) QueryMaker is now an interface, and there is a QueryMakerFactory that
   creates instances of it. There is only one production implementation
   of each (NativeQueryMaker and NativeQueryMakerFactory), which
   together behave the same way as the former QueryMaker class. But this
   opens the door to executing queries in ways other than the Druid
   query stack, and is used by unit tests (CalciteInsertDmlTest) to
   test the INSERT planning functionality.

3) Adds an EXTERN table macro that allows references external data using
   InputSource and InputFormat from Druid's batch ingestion API. This is
   not exposed in production yet, but is used by unit tests.

4) Adds a QueryFeature concept that enables the planner to change its
   behavior slightly depending on the capabilities of the execution
   system.

5) Adds an "AuthorizableOperator" concept that enables SqlOperators
   to require additional permissions. This is used by the EXTERN table
   macro.

Related odds and ends:

- Add equals, hashCode, toString methods to InlineInputSource. Aids in
  the "from external" tests in CalciteInsertDmlTest.
- Add JSON-serializability to RowSignature.
- Move the SQL string inside PlannerContext so it is "baked into" the
  planner when the planner is created. Cleans up the code a bit, since
  in practice, the same query is passed in every time to the
  same planner anyway.

* Fix up calls to CalciteTests.createMockQueryLifecycleFactory.

* Fix checkstyle issues.

* Adjustments for CI.

* Adjust DruidAvaticaHandlerTest for stricter test authorizations.
2021-11-24 12:14:04 -08:00
Gian Merlino 35b610ada7
QueryableIndexColumnSelectorFactory: Double-check cached column class. (#11957)
Important because an earlier call to getCachedColumn may have been
done with a different class, leading to a ClassCastException on the
second call. In the prior code, this could happen if a complex column
had makeDimensionSelector called on it after makeColumnValueSelector had
already been called.
2021-11-22 11:31:24 -08:00
Gian Merlino d6507c9428
PrioritizedExecutorService: Properly wrap on direct calls to "execute". (#11956)
Usually, "execute" is called by methods defined in the superclass
AbstractExecutorService, and the passed-in Runnable has been wrapped
by newTaskFor inside a PrioritizedListenableFutureTask. But this method
can also be called directly, and if so, the same wrapping is necessary
for the delegate to get a Runnable that can be entered into a priority
queue with the others.
2021-11-22 10:30:12 -08:00
Clint Wylie f260bbed23
restore and deprecate AggregatorFactory methods (#11917)
* add back and deprecate aggregator factory methods so i can say i told you so when i delete these later

* rename to make less ambiguous, fix fill method

* adjust
2021-11-19 15:59:35 -08:00
Gian Merlino 36ee0367ff
Scan: Add "orderBy" parameter. (#11930)
* Scan: Add "orderBy" parameter.

This patch adds an API for requesting non-time orderings, although it
does not actually add the ability to execute such queries.

The changes are done in such a way that no matter how Scan query objects
are constructed, they will have a correct "getOrderBy". This will enable
us to switch the execution to exclusively use "getOrderBy" later on when
it's implemented.

Scan queries are serialized such that they only include "order" (time
order) if the ordering is time-based, and they only include "orderBy" if
the ordering is non-time-based. This maximizes compatibility with
the existing API while also providing a clean look for formatted queries.

Because this patch does not include execution logic, if someone actually
tries to run a query with non-time ordering, then they will get an error
like "Cannot execute query with orderBy [quality ASC]".

* SQL module fixes.

* Add spotbugs-exclude.

* Remove unused method.
2021-11-19 08:19:12 -08:00
Clint Wylie 7f0bede878
autocompaction support for complex dimensions (#11924)
* autocompaction support for complex dimensions

* more test
2021-11-16 15:57:44 -08:00
Clint Wylie 00c976a3fe
only get bitmap index for string dictionary encoded columns (#11925) 2021-11-16 15:50:02 -08:00
Kashif Faraz 223c5692a8
Add dimension partitioningType to metrics to track usage of different partitioning schemes (#11902)
Add method ShardSpec.getType() to get name of shard spec type
List all names of shard spec types in the interface ShardSpec itself
for easy reference and maintenance
Add dimension partitioningType to metric segment/added/bytes
2021-11-11 18:34:27 +05:30
Gian Merlino fe2f7742f7
Fix incorrect comparison in RowSignature. (#11905)
PR #11882 introduced a type comparison using ==, but while it was in flight,
another PR #11713 changed the type enum to a class. So the comparison should
properly be done with "equals".
2021-11-11 04:30:42 -08:00
Laksh Singla 57ed5127a7
Make subquery IDs more comprehensive (#11809)
There are 3 types of query IDs - id, subQueryId, sqlQueryId. Currently, whenever a query generates subqueries, the subquery's subQueryId is populated randomly. Also, subquery's Id is not set to the parent query Id. Therefore there is no way of linking the subqueries to the parent query, and one loses the ability to look at end to end view of the query.

This PR aims to implement following couple of things:

Populate the subqueries with it's parent's id (and sqlQueryId if present)
Populate the subqueryId such that it forms a hierarchical relationship amongs themselves. For example, if there is a query which launches a subquery, which in turn launches a couple of subqueries, then the ids and subQueryIds should have following structure.
2021-11-11 16:31:56 +05:30
Clint Wylie 5baa22148e
revert ColumnAnalysis type, add typeSignature and use it for DruidSchema (#11895)
* revert ColumnAnalysis type, add typeSignature and use it for DruidSchema

* review stuffs

* maybe null

* better maybe null

* Update docs/querying/segmentmetadataquery.md

* Update docs/querying/segmentmetadataquery.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* fix null right

* sad

* oops

* Update batch_hadoop_queries.json

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-11-10 18:46:29 -08:00
Gian Merlino 14b0b4aee2
RowBasedSegment: Use Sequence instead of Iterable. (#11886)
* RowBasedSegment: Use Sequence instead of Iterable.

The main reason this is good is that Sequences can include baggage that
must be closed after iteration is finished. This enables creating
RowBasedSegments on top of closeable sequences of rows.

To preserve the optimization that allows reversing a List without
copying it, this patch also makes SimpleSequence its own class and allows
extracting the Iterable that was used to create it.

* Fix tests.
2021-11-10 06:06:52 -08:00
Gian Merlino db4d157be6
Add Finalization option to RowSignature.addAggregators. (#11882)
* Add Finalization option to RowSignature.addAggregators.

This make type signatures more useful when the caller knows whether it will
be reading aggregation results in their finalized or intermediate types.

* Fix call site.
2021-11-10 06:05:29 -08:00
Clint Wylie a8805ab60d
add missing json type for ListFilteredVirtualColumn (#11887)
* add missing json type for ListFilteredVirtualColumn, and tests to try to avoid this happening again

* fixes

* ugly, but maybe this

* oops

* too many mappers
2021-11-09 17:25:12 -08:00
Gian Merlino 6c196a5ea2
Remove StorageAdapter.getColumnTypeName. (#11893)
* Remove StorageAdapter.getColumnTypeName.

It was only used by SegmentAnalyzer, and isn't necessary anymore due to
the recent improvements to ColumnCapabilities.

Also: tidy ColumnDescriptor.read slightly by removing an instanceof
check, and moving the relevant logic into ComplexColumnPartSerde.

* Fix spellings.
2021-11-09 15:18:07 -08:00
Gian Merlino 324d4374f6
HashJoinEngine: Fix extraneous advance of left cursor. (#11890)
This could happen for right or full outer joins in certain cases. Tests
weren't catching this because existing Cursor implementations generally
ignore extraneous calls to "advance". So, to help catch this in tests,
extra state validations are also added to RowWalker, which is used by
RowBasedSegment.
2021-11-09 11:34:11 -08:00
Gian Merlino babf00f8e3
Migrate File.mkdirs to FileUtils.mkdirp. (#11879)
* Migrate File.mkdirs to FileUtils.mkdirp.

* Remove unused imports.

* Fix LookupReferencesManager.

* Simplify.

* Also migrate usages of forceMkdir.

* Fix var name.

* Fix incorrect call.

* Update test.
2021-11-09 11:10:49 -08:00
Gian Merlino 945a341acd
RowBasedCursor: Add column-value-reuse optimization. (#11884)
* RowBasedCursor: Add column-value-reuse optimization.

Most of the logic is in RowBasedColumnSelectorFactory, although in this
patch its only user is RowBasedCursor. This improves performance of
features that use RowBasedSegment, like lookup and inline datasources.
It's especially helpful for inline datasources that contain lengthy
arrays, due to the fact that the transformed array can be reused.

* Changes from code review.

* Fixes for ColumnCapabilitiesImplTest.
2021-11-09 07:18:09 -08:00
Gian Merlino a5bd0b8cc0
RowAdapter: Add a default implementation for timestampFunction. (#11885)
Enables simpler implementations for adapters that want to treat the
timestamp as "just another column".
2021-11-08 10:25:13 -08:00
Clint Wylie 7237dc837c
complex typed expressions (#11853)
* complex typed expressions

* add built-in hll collector expressions to get coverage on druid-processing, more types, more better

* rampage!!!

* more javadoc

* adjustments

* oops

* lol

* remove unused dependency

* contradiction?

* more test
2021-11-08 00:33:06 -08:00
Clint Wylie 907e4ca0c5
use correct DimensionSpec with for column value selectors created from dictionary encoded column indexers (#11873)
* use correct dimension spec for column value selectors of dictionary encoded column indexers
2021-11-05 01:51:15 -07:00
Liran Funaro 9ca8f1ec97
Remove IncrementalIndex template modifier (#11160)
Co-authored-by: Liran Funaro <liran.funaro@verizonmedia.com>
2021-10-27 13:10:37 -07:00
Gian Merlino fc95c92806
Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124)
* Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs.

This patch does the following:

- Removes OffheapIncrementalIndex.
- Clarifies that Aggregators are required to be thread safe.
- Clarifies that BufferAggregators and VectorAggregators are not
  required to be thread safe.
- Removes thread safety code from some DataSketches aggregators that
  had it. (Not all of them did, and that's OK, because it wasn't necessary
  anyway.)
- Makes enabling "useOffheap" with groupBy v1 an error.

Rationale for removing the offheap incremental index:

- It is only used in one rare scenario: groupBy v1 (which is non-default)
  in "useOffheap" mode (also non-default). So you have to go pretty deep
  into the wilderness to get this code to activate in production. It is
  never used during ingestion.
- Its existence complicates developer efforts to reason about how
  aggregators get used, because the way it uses buffer aggregators is so
  different from how every other query engine uses them.
- It doesn't have meaningful testing.

By the way, I do believe that the given way the offheap incremental index
works, it actually didn't require buffer aggregators to be thread-safe.
It synchronizes on "aggregate" and doesn't call "get" until it has
stopped calling "aggregate". Nevertheless, this is a bother to think about,
and for the above reasons I think it makes sense to remove the code anyway.

* Remove things that are now unused.

* Revert removal of getFloat, getLong, getDouble from BufferAggregator.

* OAK-related warnings, suppressions.

* Unused item suppressions.
2021-10-26 08:05:56 -07:00
Gian Merlino 98ecbb21cd
Remove CloseQuietly and migrate its usages to other methods. (#10247)
* Remove CloseQuietly and migrate its usages to other methods.

These other methods include:

1) New method CloseableUtils.closeAndWrapExceptions, which wraps IOExceptions
   in RuntimeExceptions for callers that just want to avoid dealing with
   checked exceptions. Most usages were migrated to this method, because it
   looks like they were mainly attempts to avoid declaring a throws clause,
   and perhaps were unintentionally suppressing IOExceptions.
2) New method CloseableUtils.closeInCatch, designed to properly close something
   in a catch block without losing exceptions. Some usages from catch blocks
   were migrated here, when it seemed that they were intended to avoid checked
   exception handling, and did not really intend to also suppress IOExceptions.
3) New method CloseableUtils.closeAndSuppressExceptions, which sends all
   exceptions to a "chomper" that consumes them. Nothing is thrown or returned.
   The behavior is slightly different: with this method, _all_ exceptions are
   suppressed, not just IOExceptions. Calls that seemed like they had good
   reason to suppress exceptions were migrated here.
4) Some calls were migrated to try-with-resources, in cases where it appeared
   that CloseQuietly was being used to avoid throwing an exception in a finally
   block.

🎵 You don't have to go home, but you can't stay here... 🎵

* Remove unused import.

* Fix up various issues.

* Adjustments to tests.

* Fix null handling.

* Additional test.

* Adjustments from review.

* Fixup style stuff.

* Fix NPE caused by holder starting out null.

* Fix spelling.

* Chomp Throwables too.
2021-10-23 17:03:21 -07:00
Clint Wylie 02b2057371
extract generic dictionary encoded column indexing and merging stuffs (#11829)
* extract generic dictionary encoded column indexing and merging stuffs to pave the path towards supporting other types of dictionary encoded columns

* spotbugs and inspections fixes

* friendlier

* javadoc

* better name

* adjust
2021-10-22 17:31:22 -07:00
Clint Wylie 741b4ed516
add output type information to ExpressionPostAggregator (#11818)
* add ColumnInspector argument to PostAggregator.getType to allow post-aggs to compute their output type based on input types

* add test for test for coverage

* simplify

* Remove unused imports.

Co-authored-by: Gian Merlino <gian@imply.io>
2021-10-22 13:52:51 -07:00
Alexander Saydakov 8cf1cbc4a9
latest datasketches-java and datasketches-memory (#11773)
* latest datasketches-java and datasketches-memory

* updated versions of datasketches-java and datasketches-memory

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2021-10-19 23:42:30 -07:00
Clint Wylie 187df58e30
better types (#11713)
* better type system

* needle in a haystack

* ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support

* fixup merge

* more test

* fixup

* intern

* fix

* oops

* oops again

* ...

* more test coverage

* fix error message

* adjust interning, more javadocs

* oops

* more docs more better
2021-10-19 01:47:25 -07:00
Jonathan Wei 22b41ddbbf
Task reports for parallel task: single phase and sequential mode (#11688)
* Task reports for parallel task: single phase and sequential mode

* Address comments

* Add null check for currentSubTaskHolder
2021-09-16 13:58:11 -05:00
Clint Wylie 5e092ccb9b
add MV_FILTER_ONLY, MV_FILTER_NONE, ListFilteredVirtualColumn (#11650)
* add MV_FILTER_ONLY SQL function, and list filter virtual column

* MV_FILTER_NONE and more tests

* formatting

* o yeah, forgot can do easy thing

* style

* hmm why was that there

* test filtering on virtual column

* style

* meh

* do it right

* good bot
2021-09-16 09:31:53 -07:00
Clint Wylie bbb86c8731
more tests for LimitedBufferHashGrouper (#11654)
* more tests for LimitedBufferHashGrouper

* fix style
2021-09-08 16:31:34 -07:00
Clint Wylie fe1d8c206a
bump version to 0.23.0-SNAPSHOT (#11670) 2021-09-08 15:56:04 -07:00
Clint Wylie 59d257816b
fix goldilocks bug with HashVectorGrouper improperly initializing memory (#11649)
* fix goldilocks bug with HashVectorGrouper improperly initializing memory that causes failure when there exists room to only grow one time

* fix unintended change

* cleanup
2021-09-02 02:25:26 -07:00
Jian Wang 3ff1c2b8ce
Fix bug which produces vastly inaccurate query results when forceLimitPushDown is enabled and order by clause has non grouping fields (#11097) 2021-09-01 21:19:38 -07:00
Jihoon Son 2a658acad4
Put sleep in an extension (#11632)
* Put sleep in an extension

* dependency
2021-08-25 01:27:45 -07:00
Kashif Faraz aaf0aaad8f
Enable routing of SQL queries at Router (#11566)
This PR adds a new property druid.router.sql.enable which allows the
Router to handle SQL queries when set to true.

This change does not affect Avatica JDBC requests and they are still routed
by hashing the Connection ID.

To allow parsing of the request object as a SqlQuery (contained in module druid-sql),
some classes have been moved from druid-server to druid-services with
the same package name.
2021-08-13 18:44:39 +05:30
Clint Wylie 9af7ba9d2a
STRING_AGG SQL aggregator function (#11241)
* add string_agg

* oops

* style and fix test

* spelling

* fixup

* review stuffs
2021-08-10 13:47:09 -07:00
Maytas Monsereenusorn 3257913737
Improve query error logging (#11519)
* Improve query error logging

* add docs

* address comments

* address comments
2021-08-05 22:51:09 +07:00
Jihoon Son 8ba7f6a48c
Fix incorrect result of exact topN on an inner join with limit (#11517) 2021-07-31 15:55:49 -07:00
Xavier Léauté 4bca7f014e
update error-prone to 2.8.0 with fix for crashing check (#11494)
* error-prone 2.8.0 fixes https://github.com/google/error-prone/issues/2396
* fix for a few ignored return values
* fix unknown args in sub-modules
2021-07-29 09:13:46 -07:00
Kashif Faraz 8a4e27f51d
Select broker based on query context parameter `brokerService` (#11495)
This change allows the selection of a specific broker service (or broker tier) by the Router.

The newly added ManualTieredBrokerSelectorStrategy works as follows:

Check for the parameter brokerService in the query context. If this is a valid broker service, use it.
Check if the field defaultManualBrokerService has been set in the strategy. If this is a valid broker service, use it.
Move on to the next strategy
2021-07-27 20:56:05 +05:30
Lucas Capistrant 9767b42e85
Add a new metric query/segments/count that is not emitted by default (#11394)
* Add a new metric query/segments/count that is not emitted by default

* docs

* test the default implementation of the metric

* fix spelling error in docs

* document the fact that query retries will result in additional metric emissions

* update using recommended text from @jihoonson
2021-07-22 17:57:35 -07:00
Abhishek Agarwal ce1faa5635
Make SegmentLoader extensible and customizable (#11398)
This PR refactors the code related to segment loading specifically SegmentLoader and SegmentLoaderLocalCacheManager. SegmentLoader is marked UnstableAPI which means, it can be extended outside core druid in custom extensions. Here is a summary of changes

SegmentLoader returns an instance of ReferenceCountingSegment instead of Segment. Earlier, SegmentManager was wrapping Segment objects inside ReferenceCountingSegment. That is now moved to SegmentLoader. With this, a custom implementation can track the references of segments. It also allows them to create custom ReferenceCountingSegment implementations. For this reason, the constructor visibility in ReferenceCountingSegment is changed from private to protected.
SegmentCacheManager has two additional methods called - reserve(DataSegment) and release(DataSegment). These methods let the caller reserve or release space without calling SegmentLoader#getSegment. We already had similar methods in StorageLocation and now they are available in SegmentCacheManager too which wraps multiple locations.
Refactoring to simplify the code in SegmentCacheManager wherever possible. There is no change in the functionality.
2021-07-22 18:00:49 +05:30