Commit Graph

2233 Commits

Author SHA1 Message Date
Gian Merlino 1ef25a438f
Broker: Add ability to inline subqueries. (#9533)
* Broker: Add ability to inline subqueries.

The main changes:

- ClientQuerySegmentWalker: Add ability to inline queries.
- Query: Add "getSubQueryId" and "withSubQueryId" methods.
- QueryMetrics: Add "subQueryId" dimension.
- ServerConfig: Add new "maxSubqueryRows" parameter, which is used by
  ClientQuerySegmentWalker to limit how many rows can be inlined per
  query.
- IndexedTableJoinMatcher: Allow creating keys on top of unknown types,
  by assuming they are strings. This is useful because not all types are
  known for fields in query results.
- InlineDataSource: Store RowSignature rather than component parts. Add
  more zealous "equals" and "hashCode" methods to ease testing.
- Moved QuerySegmentWalker test code from CalciteTests and
  SpecificSegmentsQueryWalker in druid-sql to QueryStackTests in
  druid-server. Use this to spin up a new ClientQuerySegmentWalkerTest.

* Adjustments from CI.

* Fix integration test.
2020-03-18 15:06:45 -07:00
Jonathan Wei b1847364b0
More efficient join filter rewrites (#9516)
* More efficient join filter rewrites

* Rebase

* Remove unused functions

* PR comments, fix compile

* Adjust comment

* Allow filter rewrite when join condition has LHS expression

* Fix inspections

* Fix tests
2020-03-16 22:16:14 -07:00
Clint Wylie 6afd55c8f4
threshold based automatic query prioritization (#9493)
* threshold based automatic query prioritization

* fixes

* spelling and fixes

* fix docs

* spelling

* checkstyle

* adjustments

* doc fix
2020-03-13 01:41:54 -07:00
Gian Merlino ff59d2e78b
Move RowSignature from druid-sql to druid-processing and make use of it. (#9508)
* Move RowSignature from druid-sql to druid-processing and make use of it.

1) Moved (most of) RowSignature from sql to processing. Left behind the SQL-specific
   stuff in a RowSignatures utility class. It also picked up some new convenience
   methods along the way.
2) There were a lot of places in the code where Map<String, ValueType> was used to
   associate columns with type info. These are now all replaced with RowSignature.
3) QueryToolChest's resultArrayFields method is replaced with resultArraySignature,
   and it now provides type info.

* Fix up extensions.

* Various fixes
2020-03-12 11:06:44 -07:00
Jonathan Wei 3082b9289a
Fix NPE when using IndexedTable and all left rows are filtered out (#9490)
* Fix NPE when using IndexedTable and all left rows are filtered out

* Fix compile

* Add constant for uninitialized current row

* Fix checkstyle
2020-03-11 19:23:05 -07:00
Gian Merlino 2ef5c17441
Link up row-based datasources to serving layer. (#9503)
* Link up row-based datasources to serving layer.

- Add SegmentWrangler interface that allows linking of DataSources to Segments.
- Add LocalQuerySegmentWalker that uses SegmentWranglers to compute queries on
  data that is available locally.
- Modify ClientQuerySegmentWalker to use LocalQuerySegmentWalker when the base
  datasource is concrete and not a table.
- Add SegmentWranglerModule to the Broker so it has them available and can
  properly instantiate . LocalQuerySegmentWalkers.
- Set InlineDataSource and LookupDataSource to concrete, since they can be
  directly queried now.

* Fix tests.
2020-03-11 11:32:27 -07:00
Gian Merlino 4f085896c6
Ability to directly query row-based datasources. (#9502)
* Ability to directly query row-based datasources.

Includes:

- Foundational classes RowBasedSegment, RowBasedStorageAdapter,
  RowBasedCursor provide a queryable interface on top of a
  RowBasedColumnSelectorFactory.
- Add LookupSegment: A RowBasedSegment that is built on lookup data.
- Improve capability reporting in RowBasedColumnSelectorFactory.

* Fix import.

* Remove unthrown IOException.
2020-03-10 20:39:01 -07:00
Samarth Jain c74749f0f4
Don't exclude null dimension values from the map based query response (#9438) 2020-03-10 15:06:03 -07:00
Gian Merlino c6c2282b59
Harmonization and bug-fixing for selector and filter behavior on unknown types. (#9484)
* Harmonization and bug-fixing for selector and filter behavior on unknown types.

- Migrate ValueMatcherColumnSelectorStrategy to newer ColumnProcessorFactory
  system, and set defaultType COMPLEX so unknown types can be dynamically matched.
- Remove ValueGetters in favor of ColumnComparisonFilter doing its own thing.
- Switch various methods to use convertObjectToX when casting to numbers, rather
  than ad-hoc and inconsistent logic.
- Fix bug in RowBasedExpressionColumnValueSelector: isBindingArray should return
  true even for 0- or 1- element arrays.
- Adjust various javadocs.

* Add throwParseExceptions option to Rows.objectToNumber, switch back to that.

* Update tests.

* Adjust moment sketch tests.
2020-03-10 07:15:57 -07:00
Clint Wylie 8b9fe6f584
query laning and load shedding (#9407)
* prototype

* merge QueryScheduler and QueryManager

* everything in its right place

* adjustments

* docs

* fixes

* doc fixes

* use resilience4j instead of semaphore

* more tests

* simplify

* checkstyle

* spelling

* oops heh

* remove unused

* simplify

* concurrency tests

* add SqlResource tests, refactor error response

* add json config tests

* use LongAdder instead of AtomicLong

* remove test only stuffs from scheduler

* javadocs, etc

* style

* partial review stuffs

* adjust

* review stuffs

* more javadoc

* error response documentation

* spelling

* preserve user specified lane for NoSchedulingStrategy

* more test, why not

* doc adjustment

* style

* missed review for make a thing a constant

* fixes and tests

* fix test

* Update docs/configuration/index.md

Co-Authored-By: sthetland <steve.hetland@imply.io>

* doc update

Co-authored-by: sthetland <steve.hetland@imply.io>
2020-03-10 02:57:16 -07:00
Jihoon Son 75e2051195
Convert array_contains() and array_overlaps() into native filters if possible (#9487)
* Convert array_contains() and array_overlaps() into native filters if
possible

* make spotbugs happy and fix null results when null compatible
2020-03-09 22:50:38 -07:00
Jonathan Wei 0136dba95d
Add option to control join filter rewrites (#9472)
* Add option to control join filter rewrites

* Fix inspections
2020-03-09 17:36:07 -07:00
Clint Wylie a677664811
allow optimization of single multi-value column input expr with repeated identifier (#9425)
* allow optimization of single multi-value column input expr with repeated identifier

* add test
2020-03-06 12:53:32 -08:00
Julian Jaffe eda03630d0
Add OnHeapMemorySegmentWriteOutMediumFactory (#9454)
* Add OnHeapMemorySegmentWriteOutMediumFactory

Add a factory for OnHeapMemorySegmentWriteOutMedium to support direct writing via Spark.

* Register OnHeapMemorySegmentWriteOutMediumFactory.

Register OnHeapMemorySegmentWriteOutMediumFactory with SegmentWriteOutMediumFactory.

* Remove unnecessary throws

The base `makeSegmentWriteOutMedium` throws an IOException, but the particular implementation of OnHeapMemorySegmentWriteOutMediumFactory does not throw a checked exception.

* Update SegmentWriteOutMedium docs to include onHeapMemory

Update the SegmentWriteOutMedium section of the indexing docs to include a description of the new OnHeapSegmentMediumWriteOut option.
2020-03-05 22:34:08 -08:00
Jihoon Son 3016057178
Make Transform an ExtensionPoint (#9319)
* Make Transform an ExtensionPoint

* Add transform to the list of documented extensions

* Add example transform implementation
2020-03-04 12:13:14 -08:00
Gian Merlino 1fd865b7c1
BufferArrayGrouper: Fix potential overflow in requiredBufferCapacity. (#9435)
* BufferArrayGrouper: Fix potential overflow in requiredBufferCapacity.

If cardinality was high, the computation could overflow an int. There
were tests for this, but the tests were wrong.

* Nicer.
2020-02-28 14:27:52 -08:00
Gian Merlino 81d8be6e39
CacheStrategy: Improve Javadocs. (#9280)
* CacheStrategy: Improve Javadocs.

* Update processing/src/main/java/org/apache/druid/query/CacheStrategy.java

Co-Authored-By: Suneet Saldanha <44787917+suneet-s@users.noreply.github.com>

Co-authored-by: Suneet Saldanha <44787917+suneet-s@users.noreply.github.com>
2020-02-28 11:30:58 -08:00
Gian Merlino ef3d24e886
Add javadocs for enableFilterPushDown. (#9423) 2020-02-26 22:07:33 -08:00
Gian Merlino c9faf3e148
Add SQL GROUPING SETS support. (#9122)
* Add SQL GROUPING SETS support.

Built on top of the subtotalsSpec feature in the groupBy query. This also involves
two changes to subtotalsSpec:

- Alter behavior so limitSpec is applied after subtotalsSpec, rather than applied to
  each grouping set. This is more in line with SQL standard behavior. I think it is okay
  to make this change, since the old behavior was not documented, so users should
  hopefully not be depending on it.
- Fix a bug where virtual columns were included in the subtotal queries, but they
  should not have been.

Also fixes two bugs in query equality checking:

- BaseQuery: Use getDuration() instead of "duration" in equals and hashCode, since the
  latter is lazily initialized and might be null in one query but not the other.
- GroupByQuery: Include subtotalsSpec in equals and hashCode.

* Fix bugs.

* Fix tests.

* PR updates.

* Grouping class hygiene.
2020-02-26 08:52:39 -08:00
Jonathan Wei 5ce9c81b68
Add join prefix duplicate/shadowing check (#9384)
* Add join prefix duplicate/shadowing check

* Fix format string

* PR comments

* PR comment

* Optimize loop PR comment
2020-02-25 18:17:23 -08:00
Clint Wylie 6d8dd5ec10
string -> expression -> string -> expression (#9367)
* add Expr.stringify which produces parseable expression strings, parser support for null values in arrays, and parser support for empty numeric arrays

* oops, macros are expressions too

* style

* spotbugs

* qualified type arrays

* review stuffs

* simplify grammar

* more permissive array parsing

* reuse expr joiner

* fix it
2020-02-21 15:43:02 -08:00
Jonathan Wei cab08f941d
Fix join filter push down post-join virtual column handling (#9373)
* Fix join filter push down post-join virtual column handling

* Remove unused adapter param, update javadocs

* Fix TC

* Update processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java

Co-Authored-By: Suneet Saldanha <44787917+suneet-s@users.noreply.github.com>

* Address PR comments

Co-authored-by: Suneet Saldanha <44787917+suneet-s@users.noreply.github.com>
2020-02-19 15:51:05 -08:00
Chi Cao Minh e7eb45e648
Run IntelliJ inspections on Travis (#9179)
* Run IntelliJ inspections on Travis

Running IntelliJ inspections currently takes about 90 minutes, but they
can be run in about 30 minutes on Travis.

* Restore assert statements
2020-02-19 11:34:19 +03:00
Jonathan Wei 73a0181e34
Fix handling for columns that appear multiple times in join conditions (#9362)
* Fix handling for columns that appear multiple times in join conditions

* Remove unneeded comment

* Fix test
2020-02-17 10:54:04 -08:00
Suneet Saldanha b1f38131af
Fix timestamp extract fn to match postgreSQL (#9337)
* Fix timestamp extract fn to match postgres

Update the timestamp extract function so that it matches the PostgreSQL docs.
Examples from the PostgreSQL docs were added as tests for DECADE, CENTURY
and MILLENIUM extraction.

There were bugs in CENTURY and MILLENIUM that were spotted because of intelliJ
inspections - 'Integer division in floating point context'

* Update CalciteQueryTest

* remove useless round

* mark integer division as an error
2020-02-12 15:39:19 -08:00
Maytas Monsereenusorn c30579e47b
ANY Aggregator should not skip null values implementation (#9317)
* ANY Aggregator should not skip null values implementation

* add tests

* add more tests

* Update documentation

* add more tests

* address review comments

* optimize StringAnyBufferAggregator

* fix failing tests

* address pr comments
2020-02-12 14:01:41 -08:00
Jonathan Wei b2c00b3a79
Add query context option to disable join filter push down (#9335) 2020-02-11 15:31:34 -08:00
Suneet Saldanha ea006dc72a
Optimize TimeExtractionTopNAlgorithm (#9336)
When the time extraction Top N algorithm is looking for aggregators, it makes
2 calls to hashCode on the key. Use Map#computeIfAbsent instead so that the
hashCode is calculated only once
2020-02-10 14:26:10 -08:00
Suneet Saldanha 51d7864935
Codestyle - use java style array declaration (#9338)
* Codestyle - use java style array declaration

Replaced C-style array declarations with java style declarations and marked
the intelliJ inspection as an error

* cleanup test code
2020-02-10 14:25:26 -08:00
Jonathan Wei ad8afc565c
Join filter pushdown initial implementation (#9301)
* Join filter pushdown initial implementation

* Fix test and spotbugs check

* Address PR comments

* More PR comments

* Address some PR comments

* Address more PR comments

* Fix TC failures and address PR comments
2020-02-07 16:23:37 -08:00
Lucas Capistrant 53bb45fc9a
Forbid easily misused HashSet and HashMap constructors (#9165)
* Forbid easily misused HashSet and HashMap constructors

* Add two LinkedHashMap constructors to forbidden-apis and create utility method as replacement for them

* Fix visibility of constant in CollectionUtils.java

* Make an exception for an instance of LinkedHashMap#<init>(int) because proper sizing is used

* revert changes to sql module tests that should be in separate PR

* Finish reverting changes to sql module tests that were flagged in checkstyle during CI

* Add netty dependency resulting from SupressForbidden
2020-02-07 10:44:09 +03:00
Gian Merlino 0aa7a2a3ee
Add HashVectorGrouper based on MemoryOpenHashTable. (#9314)
* Add HashVectorGrouper based on MemoryOpenHashTable.

Additional supporting changes:

1) Modifies VectorGrouper interface to use Memory instead of ByteBuffers.
2) Modifies BufferArrayGrouper to match the new VectorGrouper interface.
3) Removes "implements VectorGrouper" from BufferHashGrouper.

* Fix comment.

* Fix another comment.

* Remove unused stuff.

* Include hoisted bounds checks.

* Checks against too-large keySpaces.
2020-02-06 15:29:14 -08:00
Gian Merlino 3ef5c2f2e8
Add MemoryOpenHashTable, a table similar to ByteBufferHashTable. (#9308)
* Add MemoryOpenHashTable, a table similar to ByteBufferHashTable.

With some key differences to improve speed and design simplicity:

1) Uses Memory rather than ByteBuffer for its backing storage.
2) Uses faster hashing and comparison routines (see HashTableUtils).
3) Capacity is always a power of two, allowing simpler design and more
   efficient implementation of findBucket.
4) Does not implement growability; instead, leaves that to its callers.
   The idea is this removes the need for subclasses, while still giving
   callers flexibility in how to handle table-full scenarios.

* Fix LGTM warnings.

* Adjust dependencies.

* Remove easymock from druid-benchmarks.

* Adjustments from review.

* Fix datasketches unit tests.

* Fix checkstyle.
2020-02-04 19:57:59 -08:00
Chi Cao Minh 0d2b16c1d0
Speed up joins on indexed tables with string keys (#9278)
* Speed up joins on indexed tables with string keys

When joining on index tables with string keys, caching the computation
of row id to row numbers improves performance on the
JoinAndLookupBenchmark.joinIndexTableStringKey* benchmarks by about 10%
if the column cache is enabled an by about 100% if the column cache is
disabled.

* Faster cache impl and handle unknown cardinality

* Remove unused dependency

* Hoist cardinality check outside of hot loop

* Fix dummy DimensionSelector for tests
2020-02-04 17:34:55 -08:00
Suneet Saldanha 33a97dfaae
Guicify druid sql module (#9279)
* Guicify druid sql module

Break up the SQLModule in to smaller modules and provide a binding that
modules can use to register schemas with druid sql.

* fix some tests

* address code review

* tests compile

* Working tests

* Add all the tests

* fix up licenses and dependencies

* add calcite dependency to druid-benchmarks

* tests pass

* rename the schemas
2020-02-04 11:33:48 -08:00
Gian Merlino b411443d22
SQL join support for lookups. (#9294)
* SQL join support for lookups.

1) Add LookupSchema to SQL, so lookups show up in the catalog.
2) Add join-related rels and rules to SQL, allowing joins to be planned into
   native Druid queries.

* Add two missing LookupSchema calls in tests.

* Fix tests.

* Fix typo.
2020-01-31 23:51:16 -08:00
Gian Merlino 85d0d57fc9
Fix timestamp_format expr outside UTC timeZone. (#9282) 2020-01-31 16:20:35 -08:00
Gian Merlino 204ba9966f
Add LookupJoinableFactory. (#9281)
* Add LookupJoinableFactory.

Enables joins where the right-hand side is a lookup. Includes an
integration test.

Also, includes changes to LookupExtractorFactoryContainerProvider:

1) Add "getAllLookupNames", which will be needed to eventually connect
   lookups to Druid's SQL catalog.
2) Convert "get" from nullable to Optional return.
3) Swap out most usages of LookupReferencesManager in favor of the
   simpler LookupExtractorFactoryContainerProvider interface.

* Fixes for tests.

* Fix another test.

* Java 11 message fix.

* Fixups.

* Fixup benchmark class.
2020-01-30 14:46:21 -08:00
Suneet Saldanha 6b44d4aa80
Add getRightEquiConditionKeys to JoinConditionAnalysis (#9287)
* Add getRightColumns to JoinConditionAnalysis

This change other implementations of JoinableFactory to ask the analysis
for the right key columns instead of having to calculate it themselves.

* Address some review comments

* more code review stuff
2020-01-29 22:31:29 -08:00
Chi Cao Minh a1494c30e0
Join microbenchmark (#9267)
Add microbenchmark for joins. Enabling the column cache improves
performance by ~70% for the benchmarks for joins with string keys.
Adjusting LookupJoinMatcher.matchCondition() to have fewer branches,
improves performance by ~10% for the benchmarks for joins with lookups.
2020-01-29 14:08:19 -08:00
Suneet Saldanha 303b02eba1
intelliJ inspections cleanup (#9260)
* intelliJ inspections cleanup

- remove redundant escapes
- performance warnings
- access static member via instance reference
- static method declared final
- inner class may be static

Most of these changes are aesthetic, however, they will allow inspections to
be enabled as part of CI checks going forward

The valuable changes in this delta are:
- using StringBuilder instead of string addition in a loop
    indexing-hadoop/.../Utils.java
    processing/.../ByteBufferMinMaxOffsetHeap.java
- Use class variables instead of static variables for parameterized test
    processing/src/.../ScanQueryLimitRowIteratorTest.java

* Add intelliJ inspection warnings as errors to druid profile

* one more static inner class
2020-01-29 11:50:52 -08:00
Suneet Saldanha 6ee0afa8e5
Rename MapDataSourceJoinableFactoryWarehouse (#9275) 2020-01-28 19:00:07 -08:00
Suneet Saldanha 0ccfe5ca89 Expose JoinableFactory through Guice Bindings (#9271)
* Make JoinableFactory an extension point

This change makes it so that extensions can register a JoinableFactory that
should be used for a DataSource.

Extensions can provide the factories via DruidBinders#joinableFactoryBinder
Known DataSources - like InlineDataSource are provided in the
JoinableFactoryModule. This module installs a FactoryWarehouse that is
used to decide which factory should be used to generate the Joinable for
the provided DataSource.

The ExtensionPoint is marked as Beta since it is not yet clear if this
needs to remain available to other extensions or if the best way to
register a factory is by using the datasource class.

* Add module test

* remove useless bindings in test

* remove ExtensionPoint annotation

* Make LifecycleLock not final to help with testing
2020-01-28 13:59:06 -08:00
Clint Wylie 14253c63d6
removed AsyncQueryRunner since was only used by removed interval chunking stuff (#9252) 2020-01-27 18:53:17 -08:00
Clint Wylie 36c5efe2ab fix some issues with filters on numeric columns with nulls (#9251)
* fix issue with long column predicate filters and nulls

* dang

* uncomment a thing

* styles

* oops

* allcaps

* review stuff
2020-01-27 18:01:01 -08:00
Gian Merlino 19b427e8f3
Add JoinableFactory interface and use it in the query stack. (#9247)
* Add JoinableFactory interface and use it in the query stack.

Also includes InlineJoinableFactory, which enables joining against
inline datasources. This is the first patch where a basic join query
actually works. It includes integration tests.

* Fix test issues.

* Adjustments from code review.
2020-01-24 13:10:01 -08:00
Gian Merlino f0f68570ec
Use DataSourceAnalysis throughout the query stack. (#9239)
Builds on #9235, using the datasource analysis functionality to replace various ad-hoc
approaches. The most interesting changes are in ClientQuerySegmentWalker (brokers),
ServerManager (historicals), and SinkQuerySegmentWalker (indexing tasks).

Other changes related to improving how we analyze queries:

1) Changes TimelineServerView to return an Optional timeline, which I thought made
   the analysis changes cleaner to implement.
2) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to
   determine whether it is safe to pass a subquery dataSource to the query toolchest.
   Fixes an issue introduced in #5471 where subqueries under non-groupBy-typed queries
   were silently ignored, since neither the query entry point nor the toolchest did
   anything special with them.
3) Removes the QueryPlus.withQuerySegmentSpec method, which was mostly being used in
   error-prone ways (ignoring any potential subqueries, and not verifying that the
   underlying data source is actually a table). Replaces with a new function,
   Queries.withSpecificSegments, that includes sanity checks.
2020-01-23 14:07:14 -08:00
Gian Merlino d886463253
Add join-related DataSource types, and analysis functionality. (#9235)
* Add join-related DataSource types, and analysis functionality.

Builds on #9111 and implements the datasource analysis mentioned in #8728. Still can't
handle join datasources, but we're a step closer.

Join-related DataSource types:

1) Add "join", "lookup", and "inline" datasources.
2) Add "getChildren" and "withChildren" methods to DataSource, which will be used
   in the future for query rewriting (e.g. inlining of subqueries).

DataSource analysis functionality:

1) Add DataSourceAnalysis class, which breaks down datasources into three components:
   outer queries, a base datasource (left-most of the highest level left-leaning join
   tree), and other joined-in leaf datasources (the right-hand branches of the
   left-leaning join tree).
2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to
   support analysis.

Other notes:

1) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer.
   Also, made it a Set, so implementations don't need to worry about duplicates.
2) The addition of "isCacheable" should work around #8713, since UnionDataSource now
   returns false for cacheability.

* Remove javadoc comment.

* Updates reflecting code review.

* Add comments.

* Add more comments.
2020-01-22 14:54:47 -08:00
Suneet Saldanha a2939bbd1a Optimize JoinCondition matching (#9200)
* Optimize JoinCondition matching

The LookupJoinMatcher needs to check if a condition is always true or false
multiple times. This can be pre-computed to speed up the match checking

This change reduces the time it takes to perform a for joining on a long key
from ~ 36 ms/op to 23 ms/ op

* Rename variables

* fix typo
2020-01-21 09:11:50 -08:00
Clint Wylie 8011211a0c first/last aggregators and nulls (#9161)
* null handling for numeric first/last aggregators, refactor to not extend nullable numeric agg since they are complex typed aggs

* initially null or not based on config

* review stuff, make string first/last consistent with null handling of numeric columns, more tests

* docs

* handle nil selectors, revert to primitive first/last types so groupby v1 works...
2020-01-20 11:51:54 -08:00