2205 Commits

Author SHA1 Message Date
Suneet Saldanha
51d7864935
Codestyle - use java style array declaration (#9338)
* Codestyle - use java style array declaration

Replaced C-style array declarations with java style declarations and marked
the intelliJ inspection as an error

* cleanup test code
2020-02-10 14:25:26 -08:00
Jonathan Wei
ad8afc565c
Join filter pushdown initial implementation (#9301)
* Join filter pushdown initial implementation

* Fix test and spotbugs check

* Address PR comments

* More PR comments

* Address some PR comments

* Address more PR comments

* Fix TC failures and address PR comments
2020-02-07 16:23:37 -08:00
Lucas Capistrant
53bb45fc9a
Forbid easily misused HashSet and HashMap constructors (#9165)
* Forbid easily misused HashSet and HashMap constructors

* Add two LinkedHashMap constructors to forbidden-apis and create utility method as replacement for them

* Fix visibility of constant in CollectionUtils.java

* Make an exception for an instance of LinkedHashMap#<init>(int) because proper sizing is used

* revert changes to sql module tests that should be in separate PR

* Finish reverting changes to sql module tests that were flagged in checkstyle during CI

* Add netty dependency resulting from SupressForbidden
2020-02-07 10:44:09 +03:00
Gian Merlino
0aa7a2a3ee
Add HashVectorGrouper based on MemoryOpenHashTable. (#9314)
* Add HashVectorGrouper based on MemoryOpenHashTable.

Additional supporting changes:

1) Modifies VectorGrouper interface to use Memory instead of ByteBuffers.
2) Modifies BufferArrayGrouper to match the new VectorGrouper interface.
3) Removes "implements VectorGrouper" from BufferHashGrouper.

* Fix comment.

* Fix another comment.

* Remove unused stuff.

* Include hoisted bounds checks.

* Checks against too-large keySpaces.
2020-02-06 15:29:14 -08:00
Gian Merlino
3ef5c2f2e8
Add MemoryOpenHashTable, a table similar to ByteBufferHashTable. (#9308)
* Add MemoryOpenHashTable, a table similar to ByteBufferHashTable.

With some key differences to improve speed and design simplicity:

1) Uses Memory rather than ByteBuffer for its backing storage.
2) Uses faster hashing and comparison routines (see HashTableUtils).
3) Capacity is always a power of two, allowing simpler design and more
   efficient implementation of findBucket.
4) Does not implement growability; instead, leaves that to its callers.
   The idea is this removes the need for subclasses, while still giving
   callers flexibility in how to handle table-full scenarios.

* Fix LGTM warnings.

* Adjust dependencies.

* Remove easymock from druid-benchmarks.

* Adjustments from review.

* Fix datasketches unit tests.

* Fix checkstyle.
2020-02-04 19:57:59 -08:00
Chi Cao Minh
0d2b16c1d0
Speed up joins on indexed tables with string keys (#9278)
* Speed up joins on indexed tables with string keys

When joining on index tables with string keys, caching the computation
of row id to row numbers improves performance on the
JoinAndLookupBenchmark.joinIndexTableStringKey* benchmarks by about 10%
if the column cache is enabled an by about 100% if the column cache is
disabled.

* Faster cache impl and handle unknown cardinality

* Remove unused dependency

* Hoist cardinality check outside of hot loop

* Fix dummy DimensionSelector for tests
2020-02-04 17:34:55 -08:00
Suneet Saldanha
33a97dfaae
Guicify druid sql module (#9279)
* Guicify druid sql module

Break up the SQLModule in to smaller modules and provide a binding that
modules can use to register schemas with druid sql.

* fix some tests

* address code review

* tests compile

* Working tests

* Add all the tests

* fix up licenses and dependencies

* add calcite dependency to druid-benchmarks

* tests pass

* rename the schemas
2020-02-04 11:33:48 -08:00
Gian Merlino
b411443d22
SQL join support for lookups. (#9294)
* SQL join support for lookups.

1) Add LookupSchema to SQL, so lookups show up in the catalog.
2) Add join-related rels and rules to SQL, allowing joins to be planned into
   native Druid queries.

* Add two missing LookupSchema calls in tests.

* Fix tests.

* Fix typo.
2020-01-31 23:51:16 -08:00
Gian Merlino
85d0d57fc9
Fix timestamp_format expr outside UTC timeZone. (#9282) 2020-01-31 16:20:35 -08:00
Gian Merlino
204ba9966f
Add LookupJoinableFactory. (#9281)
* Add LookupJoinableFactory.

Enables joins where the right-hand side is a lookup. Includes an
integration test.

Also, includes changes to LookupExtractorFactoryContainerProvider:

1) Add "getAllLookupNames", which will be needed to eventually connect
   lookups to Druid's SQL catalog.
2) Convert "get" from nullable to Optional return.
3) Swap out most usages of LookupReferencesManager in favor of the
   simpler LookupExtractorFactoryContainerProvider interface.

* Fixes for tests.

* Fix another test.

* Java 11 message fix.

* Fixups.

* Fixup benchmark class.
2020-01-30 14:46:21 -08:00
Suneet Saldanha
6b44d4aa80
Add getRightEquiConditionKeys to JoinConditionAnalysis (#9287)
* Add getRightColumns to JoinConditionAnalysis

This change other implementations of JoinableFactory to ask the analysis
for the right key columns instead of having to calculate it themselves.

* Address some review comments

* more code review stuff
2020-01-29 22:31:29 -08:00
Chi Cao Minh
a1494c30e0
Join microbenchmark (#9267)
Add microbenchmark for joins. Enabling the column cache improves
performance by ~70% for the benchmarks for joins with string keys.
Adjusting LookupJoinMatcher.matchCondition() to have fewer branches,
improves performance by ~10% for the benchmarks for joins with lookups.
2020-01-29 14:08:19 -08:00
Suneet Saldanha
303b02eba1
intelliJ inspections cleanup (#9260)
* intelliJ inspections cleanup

- remove redundant escapes
- performance warnings
- access static member via instance reference
- static method declared final
- inner class may be static

Most of these changes are aesthetic, however, they will allow inspections to
be enabled as part of CI checks going forward

The valuable changes in this delta are:
- using StringBuilder instead of string addition in a loop
    indexing-hadoop/.../Utils.java
    processing/.../ByteBufferMinMaxOffsetHeap.java
- Use class variables instead of static variables for parameterized test
    processing/src/.../ScanQueryLimitRowIteratorTest.java

* Add intelliJ inspection warnings as errors to druid profile

* one more static inner class
2020-01-29 11:50:52 -08:00
Suneet Saldanha
6ee0afa8e5
Rename MapDataSourceJoinableFactoryWarehouse (#9275) 2020-01-28 19:00:07 -08:00
Suneet Saldanha
0ccfe5ca89 Expose JoinableFactory through Guice Bindings (#9271)
* Make JoinableFactory an extension point

This change makes it so that extensions can register a JoinableFactory that
should be used for a DataSource.

Extensions can provide the factories via DruidBinders#joinableFactoryBinder
Known DataSources - like InlineDataSource are provided in the
JoinableFactoryModule. This module installs a FactoryWarehouse that is
used to decide which factory should be used to generate the Joinable for
the provided DataSource.

The ExtensionPoint is marked as Beta since it is not yet clear if this
needs to remain available to other extensions or if the best way to
register a factory is by using the datasource class.

* Add module test

* remove useless bindings in test

* remove ExtensionPoint annotation

* Make LifecycleLock not final to help with testing
2020-01-28 13:59:06 -08:00
Clint Wylie
14253c63d6
removed AsyncQueryRunner since was only used by removed interval chunking stuff (#9252) 2020-01-27 18:53:17 -08:00
Clint Wylie
36c5efe2ab fix some issues with filters on numeric columns with nulls (#9251)
* fix issue with long column predicate filters and nulls

* dang

* uncomment a thing

* styles

* oops

* allcaps

* review stuff
2020-01-27 18:01:01 -08:00
Gian Merlino
19b427e8f3
Add JoinableFactory interface and use it in the query stack. (#9247)
* Add JoinableFactory interface and use it in the query stack.

Also includes InlineJoinableFactory, which enables joining against
inline datasources. This is the first patch where a basic join query
actually works. It includes integration tests.

* Fix test issues.

* Adjustments from code review.
2020-01-24 13:10:01 -08:00
Gian Merlino
f0f68570ec
Use DataSourceAnalysis throughout the query stack. (#9239)
Builds on #9235, using the datasource analysis functionality to replace various ad-hoc
approaches. The most interesting changes are in ClientQuerySegmentWalker (brokers),
ServerManager (historicals), and SinkQuerySegmentWalker (indexing tasks).

Other changes related to improving how we analyze queries:

1) Changes TimelineServerView to return an Optional timeline, which I thought made
   the analysis changes cleaner to implement.
2) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to
   determine whether it is safe to pass a subquery dataSource to the query toolchest.
   Fixes an issue introduced in #5471 where subqueries under non-groupBy-typed queries
   were silently ignored, since neither the query entry point nor the toolchest did
   anything special with them.
3) Removes the QueryPlus.withQuerySegmentSpec method, which was mostly being used in
   error-prone ways (ignoring any potential subqueries, and not verifying that the
   underlying data source is actually a table). Replaces with a new function,
   Queries.withSpecificSegments, that includes sanity checks.
2020-01-23 14:07:14 -08:00
Gian Merlino
d886463253
Add join-related DataSource types, and analysis functionality. (#9235)
* Add join-related DataSource types, and analysis functionality.

Builds on #9111 and implements the datasource analysis mentioned in #8728. Still can't
handle join datasources, but we're a step closer.

Join-related DataSource types:

1) Add "join", "lookup", and "inline" datasources.
2) Add "getChildren" and "withChildren" methods to DataSource, which will be used
   in the future for query rewriting (e.g. inlining of subqueries).

DataSource analysis functionality:

1) Add DataSourceAnalysis class, which breaks down datasources into three components:
   outer queries, a base datasource (left-most of the highest level left-leaning join
   tree), and other joined-in leaf datasources (the right-hand branches of the
   left-leaning join tree).
2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to
   support analysis.

Other notes:

1) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer.
   Also, made it a Set, so implementations don't need to worry about duplicates.
2) The addition of "isCacheable" should work around #8713, since UnionDataSource now
   returns false for cacheability.

* Remove javadoc comment.

* Updates reflecting code review.

* Add comments.

* Add more comments.
2020-01-22 14:54:47 -08:00
Suneet Saldanha
a2939bbd1a Optimize JoinCondition matching (#9200)
* Optimize JoinCondition matching

The LookupJoinMatcher needs to check if a condition is always true or false
multiple times. This can be pre-computed to speed up the match checking

This change reduces the time it takes to perform a for joining on a long key
from ~ 36 ms/op to 23 ms/ op

* Rename variables

* fix typo
2020-01-21 09:11:50 -08:00
Clint Wylie
8011211a0c first/last aggregators and nulls (#9161)
* null handling for numeric first/last aggregators, refactor to not extend nullable numeric agg since they are complex typed aggs

* initially null or not based on config

* review stuff, make string first/last consistent with null handling of numeric columns, more tests

* docs

* handle nil selectors, revert to primitive first/last types so groupby v1 works...
2020-01-20 11:51:54 -08:00
Gian Merlino
d21054f7c5
Remove the deprecated interval-chunking stuff. (#9216)
* Remove the deprecated interval-chunking stuff.

See https://github.com/apache/druid/pull/6591, https://github.com/apache/druid/pull/4004#issuecomment-284171911 for details.

* Remove unused import.

* Remove chunkInterval too.
2020-01-19 17:14:23 -08:00
Clint Wylie
f0dddaa51a fix topn aggregation on numeric columns with null values (#9183)
* fix topn issue with aggregating on numeric columns with null values

* adjustments

* rename

* add more tests

* fix comments

* more javadocs

* computeIfAbsent
2020-01-17 18:12:24 -08:00
Maytas Monsereenusorn
68ed2a2c8f Fix LATEST / EARLIEST Buffer Aggregator does not work on String column (#9197)
* fix buff limit bug

* add tests

* add test

* add tests

* fix checkstyle
2020-01-16 21:02:37 -08:00
Gian Merlino
448da78765 Speed up String first/last aggregators when folding isn't needed. (#9181)
* Speed up String first/last aggregators when folding isn't needed.

Examines the value column, and disables fold checking via a needsFoldCheck
flag if that column can't possibly contain SerializableLongStringPairs. This
is helpful because it avoids calling getObject on the value selector when
unnecessary; say, because the time selector didn't yield an earlier or later
value.

* PR comments.

* Move fastLooseChop to StringUtils.
2020-01-16 21:02:02 -08:00
Gian Merlino
bd49ec03bc
Move result-to-array logic from SQL layer into QueryToolChests. (#9130)
* Move result-to-array logic from SQL layer into QueryToolChests.

* Checkstyle adjustment.

* Fix typo.
2020-01-16 15:42:10 -08:00
Gian Merlino
bfcb30e48f
Add javadocs and small improvements to join code. (#9196)
A follow-up to #9111.
2020-01-16 15:25:38 -08:00
Maytas Monsereenusorn
42359c93dd Implement ANY aggregator (#9187)
* Implement ANY aggregator

* Add copyright headers

* Add unit tests

* fix BufferAggregator

* Fix bug in BufferAggregator

* hook up the SQL command

* add check for buffer aggregator

* Address comment

* address comments

* add docs

* Address comments

* add more tests for numeric columns that have null values when run in sql compatible null mode

* fix checkstyle errors

* fix failing tests

* fix failing tests
2020-01-16 14:40:32 -08:00
Gian Merlino
a87db7f353
Add HashJoinSegment, a virtual segment for joins. (#9111)
* Add HashJoinSegment, a virtual segment for joins.

An initial step towards #8728. This patch adds enough functionality to implement a joining
cursor on top of a normal datasource. It does not include enough to actually do a query. For
that, future patches will need to wire this low-level functionality into the query language.

* Fixups.

* Fix missing format argument.

* Various tests and minor improvements.

* Changes.

* Remove or add tests for unused stuff.

* Fix up package locations.
2020-01-16 13:14:20 -08:00
Lucas Capistrant
4716e0b585 Fix concurrency of ComplexMetrics.java (#9134) 2020-01-15 17:19:45 +03:00
Clint Wylie
85219ece13 fix null handling for arithmetic post aggregator comparator (#9159)
* fix null handling for arithmetic postagg comparator, add test for comparator for min/max/quantile postaggs in histogram ext

* fix
2020-01-10 13:49:19 -08:00
Jonathan Wei
8c53818fa9
Add numeric nulls to sample data, fix some numeric null handling issues (#9154)
* Fix LongSumAggregator comparator null handling

* Remove unneeded GroupBy test change

* Checkstyle

* Update other processing tests for new sample data

* Remove unused code

* Fix SearchQueryRunner column selectors

* Fix DimensionIndexer null handling and ScanQueryRunnerTest

* Fix TeamCity errors
2020-01-10 13:49:06 -08:00
Clint Wylie
c248e00984 fix moment sketch null handling (#9075) 2020-01-07 14:15:59 -06:00
Jonathan Wei
aa539177ec De-incubation cleanup in code, docs, packaging (#9108)
* De-incubation cleanup in code, docs, packaging

* remove unused docs script
2020-01-03 12:33:19 -05:00
Jonathan Wei
4e8368a5d9 Set version to 0.18.0-SNAPSHOT (#9109) 2020-01-02 17:55:10 -05:00
Clint Wylie
8ccce9857a fix vectorized query engine numeric filter matchers against null values (#9063)
* fix druid-sql issue with filtering numeric columns by null values

* fix vector numeric column matchers to check null vector for null matches
2019-12-20 13:15:48 -08:00
Gian Merlino
d452cbbb82 GenericIndexedWriter: Fix issue when writing large values to large columns. (#9029) 2019-12-13 15:33:14 -08:00
Jonathan Wei
8af41d7cd0 Update version to 0.18.0-incubating-SNAPSHOT (#9009) 2019-12-11 14:04:03 -08:00
Clint Wylie
4327892b84 modify multi-value expression transformation behavior to not treat re-use of the same input as a candidate for cartesian mapping (#8957) 2019-12-09 20:38:15 -08:00
Clint Wylie
06cd30460e
add query metrics for broker parallel merges, off by default (#8981)
* add a bunch of metrics for broker parallel merges, off by default, and tests

* fix tests

* review stuffs

* propogateIfPossible
2019-12-06 13:42:53 -08:00
Q
391646123e Fix double-checked locking in predicate suppliers in BoundDimFi… (#8974)
* Fix double-checked locking in predicate suppliers in BoundDimFilter

* Fix double-checked locking in predicate suppliers in BoundDimFilter

* 1. Use Suppliers.memoize() to initialize and publish singleton.

2. Fix coding style.

* Fix coding style

* Fix double-checked locking bug for predicate suppliers in InDimFilter
2019-12-04 20:01:52 +03:00
Fangyuan Deng
187cf0dd3f [Improvement] historical fast restart by lazy load columns metadata(20X faster) (#6988)
* historical fast restart by lazy load columns metadata

* delete repeated code

* add documentation for druid.segmentCache.lazyLoadOnStart

* fix unit test fail

* fix spellcheck

* update docs

* update docs mentioning a catch
2019-12-03 09:47:01 -08:00
jon-wei
dfbc066163 Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1"
This reverts commit a0f21d9b07bb4b4efd8ef98d0effd0c28f2f7d43.
2019-11-27 23:22:43 -08:00
jon-wei
0402ff85b8 Revert "[maven-release-plugin] prepare for next development iteration"
This reverts commit 8ffa71e7e6ed140446acaa94baf47b779e6f24a3.
2019-11-27 23:22:32 -08:00
jon-wei
8ffa71e7e6 [maven-release-plugin] prepare for next development iteration 2019-11-27 23:18:48 -08:00
jon-wei
a0f21d9b07 [maven-release-plugin] prepare release druid-0.16.1-incubating-rc1 2019-11-27 23:18:37 -08:00
Chi Cao Minh
fba876b607 Update jackson to 2.9.10 (#8940)
Addresses security vulnerabilities:

- sonatype-2016-0397:
  https://github.com/FasterXML/jackson-core/issues/315

- sonatype-2017-0355:
  https://github.com/FasterXML/jackson-core/pull/322
2019-11-26 21:41:14 -08:00
Gian Merlino
e0eb85ace7 Add FileUtils.createTempDir() and enforce its usage. (#8932)
* Add FileUtils.createTempDir() and enforce its usage.

The purpose of this is to improve error messages. Previously, the error
message on a nonexistent or unwritable temp directory would be
"Failed to create directory within 10,000 attempts".

* Further updates.

* Another update.

* Remove commons-io from benchmark.

* Fix tests.
2019-11-22 19:48:49 -08:00
Chi Cao Minh
ff6217365b Refactor parallel indexing perfect rollup partitioning (#8852)
* Refactor parallel indexing perfect rollup partitioning

Refactoring to make it easier to later add range partitioning for
perfect rollup parallel indexing. This is accomplished by adding several
new base classes (e.g., PerfectRollupWorkerTask) and new classes for
encapsulating logic that needs to be changed for different partitioning
strategies (e.g., IndexTaskInputRowIteratorBuilder).

The code is functionally equivalent to before except for the following
small behavior changes:

1) PartialSegmentMergeTask: Previously, this task had a priority of
   DEFAULT_TASK_PRIORITY. It now has a priority of
   DEFAULT_BATCH_INDEX_TASK_PRIORITY (via the new PerfectRollupWorkerTask
   base class), since it is a batch index task.

2) ParallelIndexPhaseRunner: A decorator was added to
   subTaskSpecIterator to ensure the subtasks are generated with unique
   ids. Previously, only tests (i.e., MultiPhaseParallelIndexingTest)
   would have this decorator, but this behavior is desired for non-test
   code as well.

* Fix forbidden apis and pmd warnings

* Fix analyze dependencies warnings

* Fix IndexTask json and add IT diags

* Fix parallel index supervisor<->worker serde

* Fix TeamCity inspection errors/warnings

* Fix TeamCity inspection errors/warnings again

* Integrate changes with those from #8823

* Address review comments

* Address more review comments

* Fix forbidden apis

* Address more review comments
2019-11-20 17:24:12 -08:00