Commit Graph

199 Commits

Author SHA1 Message Date
Clint Wylie 9e5a940cf1
remake column indexes and query processing of filters (#12388)
Following up on #12315, which pushed most of the logic of building ImmutableBitmap into BitmapIndex in order to hide the details of how column indexes are implemented from the Filter implementations, this PR totally refashions how Filter consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.
2022-05-11 11:57:08 +05:30
Gian Merlino 7b89682bbe
DimensionRangeShardSpec speed boost. (#12477)
* DimensionRangeShardSpec speed boost.

Calling isEmpty() and equals() on RangeSets is expensive, because these
fall back on default implementations that call size(). And size() is
_also_ a default implementation that iterates the entire collection.

* Fix and test from code review.
2022-04-27 14:20:35 -07:00
somu-imply 027935dcff
Vectorize numeric latest aggregators (#12439)
* Vectorizing Latest aggregator Part 1

* Updating benchmark tests

* Changing appropriate logic for vectors for null handling

* Introducing an abstract class and moving the commonalities there

* Adding vectorization for StringLast aggregator (initial version)

* Updated bufferized version of numeric aggregators

* Adding some javadocs

* Making sure this PR vectorizes numeric latest agg only

* Adding another benchmarking test

* Fixing intellij inspections

* Adding tests for double

* Adding test cases for long and float

* Updating testcases

* Checkstyle oops..

* One tiny change in test case

* Fixing spotbug and rhs not being used
2022-04-26 11:33:08 -07:00
Clint Wylie 9cfb23935f
push value range and set index get operations into BitmapIndex (#12315)
* push value range and set index get operations into BitmapIndex

* fix bug

* oops, fix better

* better like, fix test, javadocs

* fix checkstyle

* simplify and fixes

* cache

* fix tests

* move indexOf into GenericIndexed

* oops

* fix tests
2022-03-09 13:30:58 -08:00
Rohan Garg 56fbd2af6f
Guard against exponential increase of filters during CNF conversion (#12314)
Currently, the CNF conversion of a filter is unbounded, which means that it can create as many filters as possible thereby also leading to OOMs in historical heap. We should throw an error or disable CNF conversion if the filter count starts getting out of hand. There are ways to do CNF conversion with linear increase in filters as well but that has been left out of the scope of this change since those algorithms add new variables in the predicate - which can be contentious.
2022-03-09 13:19:52 +05:30
Jihoon Son e5ad862665
A new includeAllDimension flag for dimensionsSpec (#12276)
* includeAllDimensions in dimensionsSpec

* doc

* address comments

* unused import and doc spelling
2022-02-25 18:27:48 -08:00
Jason Koch eb1b53b7f8
perf: indexing: Introduce a bulk getValuesInto function to read values (#12105)
* perf: indexing: Introduce a bulk getValuesInto function to read values in bulk

If large number of values are required from DimensionDictionary
during indexing, fetch them all in a single lock/unlock instead of
lock/unlock each individual item.

* refactor: rename key to keys in function args

* fix: check explicitly that argument length on arrays match

* refactor: getValuesInto renamed to getValues, now creates and returns a new T[] rather than filling
2022-02-25 12:19:04 -08:00
somu-imply 033989eb1d
Adding vectorized time_shift (#12254)
* Adding vectorized time_shift

* Vectorize time shift, addressing review comments

* Remove an unused import
2022-02-11 14:44:52 -08:00
Kashif Faraz e648b01afb
Improve memory estimates in Aggregator and DimensionIndexer (#12073)
Fixes #12022  

### Description
The current implementations of memory estimation in `OnHeapIncrementalIndex` and `StringDimensionIndexer` tend to over-estimate which leads to more persistence cycles than necessary.

This PR replaces the max estimation mechanism with getting the incremental memory used by the aggregator or indexer at each invocation of `aggregate` or `encode` respectively.

### Changes
- Add new flag `useMaxMemoryEstimates` in the task context. This overrides the same flag in DefaultTaskConfig i.e. `druid.indexer.task.default.context` map
- Add method `AggregatorFactory.factorizeWithSize()` that returns an `AggregatorAndSize` which contains
  the aggregator instance and the estimated initial size of the aggregator
- Add method `Aggregator.aggregateWithSize()` which returns the incremental memory used by this aggregation step
- Update the method `DimensionIndexer.processRowValsToKeyComponent()` to return the encoded key component as well as its effective size in bytes
- Update `OnHeapIncrementalIndex` to use the new estimations only if `useMaxMemoryEstimates = false`
2022-02-03 10:34:02 +05:30
Maytas Monsereenusorn bd7fe45da0
Support adding metrics in Auto Compaction (#12125)
* add impl

* add impl

* add unit tests

* add unit tests

* add unit tests

* add unit tests

* add unit tests

* add integration tests

* add integration tests

* fix LGTM

* fix test

* remove doc
2022-01-17 20:19:31 -08:00
Jonathan Wei 9b598407c1
Add interface for external schema provider to Druid SQL (#12043)
* Add interfce for external schema provider to Druid SQL

* Add annotations
2021-12-22 22:17:57 +05:30
Laksh Singla ca260dfef6
Intern RowSignature in DruidSchema to reduce its memory footprint (#12001)
DruidSchema consists of a concurrent HashMap of DataSource -> Segement -> AvailableSegmentMetadata. AvailableSegmentMetadata contains RowSignature of the segment, and for each segment, a new object is getting created. RowSignature is an immutable class, and hence it can be interned, and this can lead to huge savings of memory being used in broker, since a lot of the segments of a table would potentially have same RowSignature.
2021-12-08 15:11:13 +05:30
Lucas Capistrant 150902b95c
clean up the balancing code around the batched vs deprecated way of sampling segments to balance (#11960)
* clean up the balancing code around the batched vs deprecated way of sampling segments to balance

* fix docs, clarify comments, add deprecated annotations to legacy code

* remove unused variable

* update dynamic config dialog in console to state percentOfSegmentsToConsiderPerMove deprecated

* fix dynamic config text for percentOfSegmentsToConsiderPerMove

* run prettier to cleanup coordinator-dynamic-config.tsx changes

* update jest snapshot

* update documentation per review feedback
2021-12-07 14:47:46 -08:00
Clint Wylie 84b4bf56d8
vectorize logical operators and boolean functions (#11184)
changes:
* adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions
* vectorize logical operators and boolean functions, some only if useStrictBooleans is true
2021-12-02 16:40:23 -08:00
Paul Rogers a66f10eea1
Code cleanup from query profile project (#11822)
* Code cleanup from query profile project

* Fix spelling errors
* Fix Javadoc formatting
* Abstract out repeated test code
* Reuse constants in place of some string literals
* Fix up some parameterized types
* Reduce warnings reported by Eclipse

* Reverted change due to lack of tests
2021-11-30 11:35:38 -08:00
Gian Merlino 0354407655
SQL INSERT planner support. (#11959)
* SQL INSERT planner support.

The main changes are:

1) DruidPlanner is able to validate and authorize INSERT queries. They
   require WRITE permission on the target datasource.

2) QueryMaker is now an interface, and there is a QueryMakerFactory that
   creates instances of it. There is only one production implementation
   of each (NativeQueryMaker and NativeQueryMakerFactory), which
   together behave the same way as the former QueryMaker class. But this
   opens the door to executing queries in ways other than the Druid
   query stack, and is used by unit tests (CalciteInsertDmlTest) to
   test the INSERT planning functionality.

3) Adds an EXTERN table macro that allows references external data using
   InputSource and InputFormat from Druid's batch ingestion API. This is
   not exposed in production yet, but is used by unit tests.

4) Adds a QueryFeature concept that enables the planner to change its
   behavior slightly depending on the capabilities of the execution
   system.

5) Adds an "AuthorizableOperator" concept that enables SqlOperators
   to require additional permissions. This is used by the EXTERN table
   macro.

Related odds and ends:

- Add equals, hashCode, toString methods to InlineInputSource. Aids in
  the "from external" tests in CalciteInsertDmlTest.
- Add JSON-serializability to RowSignature.
- Move the SQL string inside PlannerContext so it is "baked into" the
  planner when the planner is created. Cleans up the code a bit, since
  in practice, the same query is passed in every time to the
  same planner anyway.

* Fix up calls to CalciteTests.createMockQueryLifecycleFactory.

* Fix checkstyle issues.

* Adjustments for CI.

* Adjust DruidAvaticaHandlerTest for stricter test authorizations.
2021-11-24 12:14:04 -08:00
Maytas Monsereenusorn bb3d2a433a
Support filtering data in Auto Compaction (#11922)
* add impl

* fix checkstyle

* add test

* add test

* add unit tests

* fix unit tests

* fix unit tests

* fix unit tests

* add IT

* add IT

* add comments

* fix spelling
2021-11-24 10:56:38 -08:00
Gian Merlino babf00f8e3
Migrate File.mkdirs to FileUtils.mkdirp. (#11879)
* Migrate File.mkdirs to FileUtils.mkdirp.

* Remove unused imports.

* Fix LookupReferencesManager.

* Simplify.

* Also migrate usages of forceMkdir.

* Fix var name.

* Fix incorrect call.

* Update test.
2021-11-09 11:10:49 -08:00
Maytas Monsereenusorn ddc68c6a81
Support changing dimension schema in Auto Compaction (#11874)
* add impl

* add unit tests

* fix checkstyle

* add impl

* add impl

* add impl

* add impl

* add impl

* add impl

* fix test

* add IT

* add IT

* fix docs

* add test

* address comments

* fix conflict
2021-11-08 21:17:08 -08:00
Clint Wylie 907e4ca0c5
use correct DimensionSpec with for column value selectors created from dictionary encoded column indexers (#11873)
* use correct dimension spec for column value selectors of dictionary encoded column indexers
2021-11-05 01:51:15 -07:00
Liran Funaro 9ca8f1ec97
Remove IncrementalIndex template modifier (#11160)
Co-authored-by: Liran Funaro <liran.funaro@verizonmedia.com>
2021-10-27 13:10:37 -07:00
Gian Merlino fc95c92806
Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124)
* Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs.

This patch does the following:

- Removes OffheapIncrementalIndex.
- Clarifies that Aggregators are required to be thread safe.
- Clarifies that BufferAggregators and VectorAggregators are not
  required to be thread safe.
- Removes thread safety code from some DataSketches aggregators that
  had it. (Not all of them did, and that's OK, because it wasn't necessary
  anyway.)
- Makes enabling "useOffheap" with groupBy v1 an error.

Rationale for removing the offheap incremental index:

- It is only used in one rare scenario: groupBy v1 (which is non-default)
  in "useOffheap" mode (also non-default). So you have to go pretty deep
  into the wilderness to get this code to activate in production. It is
  never used during ingestion.
- Its existence complicates developer efforts to reason about how
  aggregators get used, because the way it uses buffer aggregators is so
  different from how every other query engine uses them.
- It doesn't have meaningful testing.

By the way, I do believe that the given way the offheap incremental index
works, it actually didn't require buffer aggregators to be thread-safe.
It synchronizes on "aggregate" and doesn't call "get" until it has
stopped calling "aggregate". Nevertheless, this is a bother to think about,
and for the above reasons I think it makes sense to remove the code anyway.

* Remove things that are now unused.

* Revert removal of getFloat, getLong, getDouble from BufferAggregator.

* OAK-related warnings, suppressions.

* Unused item suppressions.
2021-10-26 08:05:56 -07:00
Alexander Saydakov 8cf1cbc4a9
latest datasketches-java and datasketches-memory (#11773)
* latest datasketches-java and datasketches-memory

* updated versions of datasketches-java and datasketches-memory

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2021-10-19 23:42:30 -07:00
Clint Wylie 187df58e30
better types (#11713)
* better type system

* needle in a haystack

* ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support

* fixup merge

* more test

* fixup

* intern

* fix

* oops

* oops again

* ...

* more test coverage

* fix error message

* adjust interning, more javadocs

* oops

* more docs more better
2021-10-19 01:47:25 -07:00
Clint Wylie 392f0ca1b5
refactor sql authorization to get resource type from schema, resource type to be string (#11692)
* refactor sql authorization to get resource type from schema, refactor resource type from enum to string

* information schema auth filtering adjustments

* refactor

* minor stuff

* Update SqlResourceCollectorShuttle.java
2021-09-17 09:53:25 -07:00
Jihoon Son 7e90d00cc0
Configurable maxStreamLength for doubles sketches (#11574)
* Configurable maxStreamLength for doubles sketches

* fix equals/hashcode and it test failure

* fix test

* fix it test

* benchmark

* doc

* grouping key

* fix comment

* dependency check

* Update docs/development/extensions-core/datasketches-quantiles.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-08-31 14:56:37 -07:00
Jihoon Son 98312d54cf
Fix CI for master (#11522) 2021-07-30 15:41:21 -07:00
Yuanli Han b83742179a
Reduce method invocation of reservoir sampling (#11257)
* reduce method invocation of reservoir sampling

* add a dynamic parameter and add benchmark

* rebase
2021-07-30 22:09:50 +08:00
Lucas Capistrant 9767b42e85
Add a new metric query/segments/count that is not emitted by default (#11394)
* Add a new metric query/segments/count that is not emitted by default

* docs

* test the default implementation of the metric

* fix spelling error in docs

* document the fact that query retries will result in additional metric emissions

* update using recommended text from @jihoonson
2021-07-22 17:57:35 -07:00
Clint Wylie 17efa6f556
add single input string expression dimension vector selector and better expression planning (#11213)
* add single input string expression dimension vector selector and better expression planning

* better

* fixes

* oops

* rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing

* oops

* javadocs, renaming

* more javadocs

* benchmarks

* use string expression vector processor with vector size 1 instead of expr.eval

* better logging

* javadocs, surprising number of the the

* more

* simplify
2021-07-06 11:20:49 -07:00
frank chen 906a704c55
Eliminate ambiguities of KB/MB/GB in the doc (#11333)
* GB ---> GiB

* suppress spelling check

* MB --> MiB, KB --> KiB

* Use IEC binary prefix

* Add reference link

* Fix doc style
2021-06-30 13:42:45 -07:00
Gian Merlino 202c78c8f3
Enable rewriting certain inner joins as filters. (#11068)
* Enable rewriting certain inner joins as filters.

The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:

- It must be an inner equi-join.
- The right-hand columns referenced by the condition must not contain any
  duplicate values. (If they did, the inner join would not be guaranteed
  to return at most one row for each left-hand-side row.)
- No columns from the right-hand side can be used by anything other than
  the join condition itself.

HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.

In support of this goal:

- Add Query getRequiredColumns() method to help us figure out whether
  the right-hand side of a join datasource is being used or not.
- Add JoinConditionAnalysis getRequiredColumns() method to help us
  figure out if the right-hand side of a join is being used by later
  join clauses acting on the same base.
- Add Joinable getNonNullColumnValuesIfAllUnique method to enable
  retrieving the set of values that will form the "in" filter.
- Add LookupExtractor canGetKeySet() and keySet() methods to support
  LookupJoinable in its efforts to implement the new Joinable method.
- Add "enableRewriteJoinToFilter" feature flag to
  JoinFilterRewriteConfig. The default is disabled.

* Test improvements.

* Test fixes.

* Avoid slow size() call.

* Remove invalid test.

* Fix style.

* Fix mistaken default.

* Small fixes.

* Fix logic error.
2021-04-14 10:49:27 -07:00
Clint Wylie 08d3786738
improve bitmap vector offset to report contiguous groups (#11039)
* improve bitmap vector offset to report contiguous groups

* benchmark style

* check for contiguous in getOffsets, tests for exceptions
2021-04-13 11:47:01 -07:00
Jihoon Son 25db8787b3
Fix CAST being ignored when aggregating on strings after cast (#11083)
* Fix CAST being ignored when aggregating on strings after cast

* fix checkstyle and dependency

* unused import
2021-04-12 22:21:24 -07:00
Maytas Monsereenusorn 4576152e4a
Make dropExisting flag for Compaction configurable and add warning documentations (#11070)
* Make dropExisting flag for Compaction configurable

* fix checkstyle

* fix checkstyle

* fix test

* add tests

* fix spelling

* fix docs

* add IT

* fix test

* fix doc

* fix doc
2021-04-09 00:12:28 -07:00
Clint Wylie 338886fd5f
vector group by support for string expressions (#11010)
* vector group by support for string expressions

* fix test

* comments, javadoc
2021-04-08 19:23:39 -07:00
Clint Wylie c0e6d1c7f8
vectorize 'auto' long decoding (#11004)
* Vectorize LongDeserializers.

Also, add many more tests.

* more faster

* more more faster

* more cleanup

* fixes

* forbidden

* benchmark style

* idk why

* adjust

* add preconditions for value >= 0 for writers

* add 64 bit exception

Co-authored-by: Gian Merlino <gian@imply.io>
2021-03-26 18:39:13 -07:00
Yi Yuan 36e86a2880
Add protobuf schema registry (#10839)
* dd_protobuf_schema_registry

* change licese

* delete some annotation

* nodify tests

* delete extra exception

* add licenses

* add descriptor and protoMessageType in ProtobufInputRowParser for adopt to old version

* seperate kafka-protobuf-provider

* modify protobuf.md

* refine protobuf.md

* add config and header

* bug fixed

Co-authored-by: yuanyi <yuanyi@freewheel.tv>
2021-03-09 15:15:51 -08:00
Jihoon Son 2c30f8b3b7
Migrate bitmap benchmarks to JMH (#10936)
* Migrate bitmap benchmarks to JMH

* add concise
2021-03-04 12:50:55 -08:00
Abhishek Agarwal 1a15987432
Supporting filters in the left base table for join datasources (#10697)
* where filter left first draft

* Revert changes in calcite test

* Refactor a bit

* Fixing the Tests

* Changes

* Adding tests

* Add tests for correlated queries

* Add comment

* Fix typos
2021-03-04 10:39:21 -08:00
Maytas Monsereenusorn 6541178c21
Support segmentGranularity for auto-compaction (#10843)
* Support segmentGranularity for auto-compaction

* Support segmentGranularity for auto-compaction

* Support segmentGranularity for auto-compaction

* Support segmentGranularity for auto-compaction

* resolve conflict

* Support segmentGranularity for auto-compaction

* Support segmentGranularity for auto-compaction

* fix tests

* fix more tests

* fix checkstyle

* add unit tests

* fix checkstyle

* fix checkstyle

* fix checkstyle

* add unit tests

* add integration tests

* fix checkstyle

* fix checkstyle

* fix failing tests

* address comments

* address comments

* fix tests

* fix tests

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test
2021-02-12 03:03:20 -08:00
Clint Wylie fe30f4b414
refactor sql lifecycle, druid planner, views, and view permissions (#10812)
* before i leaped i should've seen, the view from halfway down

* fixes

* fixes, more test

* rename

* fix style

* further refactoring

* review stuffs

* rename

* more javadoc and comments
2021-02-05 12:56:55 -08:00
Liran Funaro 08ab82f55c
IncrementalIndex Tests and Benchmarks Parametrization (#10593)
* Remove redundant IncrementalIndex.Builder

* Parametrize incremental index tests and benchmarks

- Reveal and fix a bug in OffheapIncrementalIndex

* Fix forbiddenapis error: Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

* Fix Intellij errors: declared exception is never thrown

* Add documentation and validate before closing objects on tearDown.

* Add documentation to OffheapIncrementalIndexTestSpec

* Doc corrections and minor changes.

* Add logging for generated rows.

* Refactor new tests/benchmarks.

* Improve IncrementalIndexCreator documentation

* Add required tests for DataGenerator

* Revert "rollupOpportunity" to be a string
2021-01-07 22:18:47 -08:00
Jonathan Wei 68bb038b31
Multiphase segment merge for IndexMergerV9 (#10689)
* Multiphase merge for IndexMergerV9

* JSON fix

* Cleanup temp files

* Docs

* Address logging and add IT

* Fix spelling and test unloader datasource name
2021-01-05 22:19:09 -08:00
Jihoon Son d47d6cf081
Add time-to-first-result benchmark for groupBy (#10612) 2020-12-01 10:32:37 -08:00
Himanshu 30bcb0fd74
DataSourcesSnapshotBenchmark to measure iterateAllUsedSegmentsInSnapshot perf (#10604) 2020-11-29 14:42:14 -08:00
Abhishek Agarwal 4d2a92f46a
Add caching support to join queries (#10366)
* Proposed changes for making joins cacheable

* Add unit tests

* Fix tests

* simplify logic

* Pull empty byte array logic out of CachingQueryRunner

* remove useless null check

* Minor refactor

* Fix tests

* Fix segment caching on Broker

* Move join cache key computation in Broker

Move join cache key computation in Broker from ResultLevelCachingQueryRunner to CachingClusteredClient

* Fix compilation

* Review comments

* Add more tests

* Fix inspection errors

* Pushed condition analysis to JoinableFactory

* review comments

* Disable join caching for broker and add prefix key to BroadcastSegmentIndexedTable

* Remove commented lines

* Fix populateCache

* Disable caching for selective datasources

Refactored the code so that we can decide at the data source level, whether to enable cache for broker or data nodes
2020-10-09 17:42:30 -07:00
Clint Wylie 1d6cb624f4
add vectorizeVirtualColumns query context parameter (#10432)
* add vectorizeVirtualColumns query context parameter

* oops

* spelling

* default to false, more docs

* fix test

* fix spelling
2020-09-28 18:48:34 -07:00
Clint Wylie 19c4b16640
vectorized expressions and expression virtual columns (#10401)
* vectorized expression virtual columns

* cleanup

* fixes

* preserve float if explicitly specified

* oops

* null handling fixes, more tests

* what is an expression planner?

* better names

* remove unused method, add pi

* move vector processor builders into static methods

* reduce boilerplate

* oops

* more naming adjustments

* changes

* nullable

* missing hex

* more
2020-09-23 13:56:38 -07:00
Suneet Saldanha 0b4c897fbe
Vectorized variance aggregators (#10390)
* wip vectorize

* close but not quite

* faster

* unit tests

* fix complex types for variance
2020-09-17 15:05:40 -07:00
Clint Wylie 084b23deed
benchmark for indexed table experiments (#10327)
* benchmark for indexed table experiments

* fix style

* teardown outside of measurement
2020-09-14 15:14:38 -07:00
Jihoon Son 8f14ac814e
More structured way to handle parse exceptions (#10336)
* More structured way to handle parse exceptions

* checkstyle; add more tests

* forbidden api; test

* address comment; new test

* address review comments

* javadoc for parseException; remove redundant parseException in streaming ingestion

* fix tests

* unnecessary catch

* unused imports

* appenderator test

* unused import
2020-09-11 16:31:10 -07:00
Suneet Saldanha a9de00d43a
Remove NUMERIC_HASHING_THRESHOLD (#10313)
* Make NUMERIC_HASHING_THRESHOLD configurable

Change the default numeric hashing threshold to 1 and make it configurable.

Benchmarks attached to this PR show that binary searches are not more faster
than doing a set contains check. The attached flamegraph shows the amount of
time a query spent in the binary search. Given the benchmarks, we can expect
to see roughly a 2x speed up in this part of the query which works out to
~ a 10% faster query in this instance.

* Remove NUMERIC_HASHING_THRESHOLD

* Remove stale docs
2020-08-25 20:05:39 -07:00
Clint Wylie c72f96a4ba
fix bug with expressions on sparse string realtime columns without explicit null valued rows (#10248)
* fix bug with realtime expressions on sparse string columns

* fix test

* add comment back

* push capabilities for dimensions to dimension indexers since they know things

* style

* style

* fixes

* getting a bit carried away

* missed one

* fix it

* benchmark build fix

* review stuffs

* javadoc and comments

* add comment

* more strict check

* fix missed usaged of impl instead of interface
2020-08-11 11:07:17 -07:00
Maytas Monsereenusorn 574b062f1f
Cluster wide default query context setting (#10208)
* Cluster wide default query context setting

* Cluster wide default query context setting

* Cluster wide default query context setting

* add docs

* fix docs

* update props

* fix checkstyle

* fix checkstyle

* fix checkstyle

* update docs

* address comments

* fix checkstyle

* fix checkstyle

* fix checkstyle

* fix checkstyle

* fix checkstyle

* fix NPE
2020-07-29 15:19:18 -07:00
Jihoon Son 657f8ee80f
Fix RetryQueryRunner to actually do the job (#10082)
* Fix RetryQueryRunner to actually do the job

* more javadoc

* fix test and checkstyle

* don't combine for testing

* address comments

* fix unit tests

* always initialize response context in cachingClusteredClient

* fix subquery

* address comments

* fix test

* query id for builders

* make queryId optional in the builders and ClusterQueryResult

* fix test

* suppress tests and unused methods

* exclude groupBy builder

* fix jacoco exclusion

* add tests for builders

* address comments

* don't truncate
2020-07-01 14:02:21 -07:00
Gian Merlino 5faa897a34
Join filter pre-analysis simplifications and sanity checks. (#10104)
* Join filter pre-analysis simplifications and sanity checks.

- At pre-analysis time, only compute pre-analysis for the innermost
  root query, since this is the one that will run on the join that involves
  the base datasource. Previously, pre-analyses were computed for multiple
  levels of the query, some of which were unnecessary.
- Remove JoinFilterPreAnalysisGroup and join query level gathering code,
  since they existed to support precomputation of multiple pre-analyses.
- Embed JoinFilterPreAnalysisKey into JoinFilterPreAnalysis and use it to
  sanity check at processing time that the correct pre-analysis was done.

Tangentially related changes:

- Remove prioritizeAndLaneQuery functionality from LocalQuerySegmentWalker.
  The computed priority and lanes were not being used.
- Add "getBaseQuery" method to DataSourceAnalysis to support identification
  of the proper subquery for filter pre-analysis.

* Fix compilation errors.

* Adjust tests.
2020-06-30 19:14:22 -07:00
xhl0726 1596b3eacd
Optimize protobuf parsing for flatten data (#9999)
* optimize for protobuf parsing

* fix import error and maven dependency

* add unit test in protobufInputrowParserTest for flatten data

* solve code duplication (remove the log and main())

* rename 'flatten' to 'flat' to make it clearer

Co-authored-by: xionghuilin <xionghuilin@bytedance.com>
2020-06-24 18:01:31 -07:00
Clint Wylie c2f5d453f8
fix topn on string columns with non-sorted or non-unique dictionaries (#10053)
* fix topn on string columns with non-sorted or non-unique dictionaries

* fix metadata tests

* refactor, clarify comments and code, fix ci failures
2020-06-19 11:35:18 -07:00
Jonathan Wei 37e150c075
Fix join filter rewrites with nested queries (#10015)
* Fix join filter rewrites with nested queries

* Fix test, inspection, coverage

* Remove clauses from group key

* Fix import order

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
2020-06-18 21:32:29 -07:00
Clint Wylie f8b643ec72
make joinables closeable (#9982)
* make joinables closeable

* tests and adjustments

* refactor to make join stuffs impelement ReferenceCountedObject instead of Closable, more tests

* fixes

* javadocs and stuff

* fix bugs

* more test

* fix lgtm alert

* simplify

* fixup javadoc

* review stuffs

* safeguard against exceptions

* i hate this checkstyle rule

* make IndexedTable extend Closeable
2020-06-09 20:12:36 -07:00
Clint Wylie c5d6163c76
add a GeneratorInputSource to fill up a cluster with generated data for testing (#9946)
* move benchmark data generator into druid-processing, add a GeneratorInputSource to fill up a cluster with data

* newlines

* make test coverage not fail maybe

* remove useless test

* Update pom.xml

* Update GeneratorInputSourceTest.java

* less passive aggressive test names
2020-06-09 19:31:04 -07:00
Clint Wylie 77dd5b06ae
ColumnCapabilities.hasMultipleValues refactor (#9731)
* transition ColumnCapabilities.hasMultipleValues to Capable enum, remove ColumnCapabilities.isComplete

* remove artifical, always multi-value capabilities from IncrementalIndexStorageAdapter and fix up fallout from that, fix ColumnCapabilities merge in index merger

* fix typo

* remove unused method

* review stuffs, revert IncrementalIndexStorageAdapater capabilities change, plumb lame workaround to SegmentAnalyzer

* more comment

* use volatile booleans

* fix line length

* correctly handle missing columns for vector processors

* return ColumnCapabilities.Capable for BitmapIndexSelector.hasMultipleValues, fix vector processor selection for complex

* false on non-existent
2020-06-04 23:52:37 -07:00
Suneet Saldanha 9c40bebc02
Refactor JoinFilterAnalyzer - part 2 (#9929)
* Refactor JoinFilterAnalyzer

This patch attempts to make it easier to follow the join filter analysis code
with the hope of making it easier to add rewrite optimizations in the future.

To keep the patch small and easy to review, this is the first of at least 2
patches that are planned.

This patch adds a builder to the Pre-Analysis, so that it is easier to
instantiate the preAnalysis. It also moves some of the filter normalization
code out to Fitlers with associated tests.

* fix tests

* Refactor JoinFilterAnalyzer - part 2

This change introduces the following components:
 * RhsRewriteCandidates - a wrapper for a list of candidates and associated
     functions to operate on the set of candidates.
 * JoinableClauses - a wrapper for the list of JoinableClause that represent
     a join condition and the associated functions to operate on the clauses.
 * Equiconditions - a wrapper representing the equiconditions that are used
     in the join condition.

And associated test changes.

This refactoring surfaced 2 bugs:
 - Missing equals and hashcode implementation for RhsRewriteCandidate, thus
   allowing potential duplicates in the rhs rewrite candidates
 - Missing Filter#supportsRequiredColumnRewrite check in
   analyzeJoinFilterClause, which could result in UnsupportedOperationException
   being thrown by the filter

* fix compile error

* remove unused class
2020-05-29 15:03:35 -07:00
mcbrewster 28be107a1c
add flag to flattenSpec to keep null columns (#9814)
* add flag to flattenSpec to keep null columns

* remove changes to inputFormat interface

* add comment

* change comment message

* update web console e2e test

* move keepNullColmns to JSONParseSpec

* fix merge conflicts

* fix tests

* set keepNullColumns to false by default

* fix lgtm

* change Boolean to boolean, add keepNullColumns to hash, add tests for keepKeepNullColumns false + true with no nuulul columns

* Add equals verifier tests
2020-05-08 21:53:39 -07:00
BIGrey ee9a721acc
fix npe in IncrementalIndexReadBenchmark (#9754)
Co-authored-by: 黄辉 <huanghui.bigrey@bytedance.com>
2020-05-03 12:52:50 -07:00
Clint Wylie 9a293d554d
remove UnionMergeRule rules from SQL planner (#9797) 2020-05-01 12:50:11 -07:00
BIGrey c5bfe36011
Optimize FileWriteOutBytes to avoid high system cpu usage (#9722)
* optimize FileWriteOutBytes to avoid high sys cpu

* optimize FileWriteOutBytes to avoid high sys cpu -- remove IOException

* optimize FileWriteOutBytes to avoid high sys cpu -- remove IOException in writeOutBytes.size

* Revert "optimize FileWriteOutBytes to avoid high sys cpu -- remove IOException in writeOutBytes.size"

This reverts commit 965f7421

* Revert "optimize FileWriteOutBytes to avoid high sys cpu -- remove IOException"

This reverts commit 149e08c0

* optimize FileWriteOutBytes to avoid high sys cpu -- avoid IOEception never thrown check

* Fix size counting to handle IOE in FileWriteOutBytes + tests

* remove unused throws IOException in WriteOutBytes.size()

* Remove redundant throws IOExcpetion clauses

* Parameterize IndexMergeBenchmark

Co-authored-by: huanghui.bigrey <huanghui.bigrey@bytedance.com>
Co-authored-by: Suneet Saldanha <suneet.saldanha@imply.io>
2020-04-23 20:18:42 -07:00
Suneet Saldanha 1ced3b33fb
IntelliJ inspections cleanup (#9339)
* IntelliJ inspections cleanup

* Standard Charset object can be used
* Redundant Collection.addAll() call
* String literal concatenation missing whitespace
* Statement with empty body
* Redundant Collection operation
* StringBuilder can be replaced with String
* Type parameter hides visible type

* fix warnings in test code

* more test fixes

* remove string concatenation inspection error

* fix extra curly brace

* cleanup AzureTestUtils

* fix charsets for RangerAdminClient

* review comments
2020-04-10 10:04:40 -07:00
Jihoon Son a6790ff22a
More optimize CNF conversion of filters (#9634)
* More optimize CNF conversion of filters

* update license

* fix build

* checkstyle

* remove unnecessary code

* split helper

* license

* checkstyle

* add comments on cnf conversion
2020-04-08 21:31:17 -07:00
Jihoon Son 40e84a171b
Eliminate common subfilters when converting it to a CNF (#9608) 2020-04-05 22:29:41 -07:00
Clint Wylie fa5da6693c
add lane enforcement for joinish queries (#9563)
* add lane enforcement for joinish queries

* oops

* style

* review stuffs
2020-03-30 11:58:16 -07:00
Clint Wylie 2c49f6d89a
error on value counter overflow instead of writing sad segments (#9559) 2020-03-26 16:54:48 -07:00
Gian Merlino 1ef25a438f
Broker: Add ability to inline subqueries. (#9533)
* Broker: Add ability to inline subqueries.

The main changes:

- ClientQuerySegmentWalker: Add ability to inline queries.
- Query: Add "getSubQueryId" and "withSubQueryId" methods.
- QueryMetrics: Add "subQueryId" dimension.
- ServerConfig: Add new "maxSubqueryRows" parameter, which is used by
  ClientQuerySegmentWalker to limit how many rows can be inlined per
  query.
- IndexedTableJoinMatcher: Allow creating keys on top of unknown types,
  by assuming they are strings. This is useful because not all types are
  known for fields in query results.
- InlineDataSource: Store RowSignature rather than component parts. Add
  more zealous "equals" and "hashCode" methods to ease testing.
- Moved QuerySegmentWalker test code from CalciteTests and
  SpecificSegmentsQueryWalker in druid-sql to QueryStackTests in
  druid-server. Use this to spin up a new ClientQuerySegmentWalkerTest.

* Adjustments from CI.

* Fix integration test.
2020-03-18 15:06:45 -07:00
Jonathan Wei b1847364b0
More efficient join filter rewrites (#9516)
* More efficient join filter rewrites

* Rebase

* Remove unused functions

* PR comments, fix compile

* Adjust comment

* Allow filter rewrite when join condition has LHS expression

* Fix inspections

* Fix tests
2020-03-16 22:16:14 -07:00
Clint Wylie 6afd55c8f4
threshold based automatic query prioritization (#9493)
* threshold based automatic query prioritization

* fixes

* spelling and fixes

* fix docs

* spelling

* checkstyle

* adjustments

* doc fix
2020-03-13 01:41:54 -07:00
Jihoon Son 7401bb3f93
Improve OvershadowableManager performance (#9441)
* Use the iterator instead of higherKey(); use the iterator API instead of stream

* Fix tests; fix a concurrency bug in timeline

* fix test

* add tests for findNonOvershadowedObjectsInInterval

* fix test

* add missing tests; fix a bug in QueueEntry

* equals tests

* fix test
2020-03-10 13:22:19 -07:00
Clint Wylie 8b9fe6f584
query laning and load shedding (#9407)
* prototype

* merge QueryScheduler and QueryManager

* everything in its right place

* adjustments

* docs

* fixes

* doc fixes

* use resilience4j instead of semaphore

* more tests

* simplify

* checkstyle

* spelling

* oops heh

* remove unused

* simplify

* concurrency tests

* add SqlResource tests, refactor error response

* add json config tests

* use LongAdder instead of AtomicLong

* remove test only stuffs from scheduler

* javadocs, etc

* style

* partial review stuffs

* adjust

* review stuffs

* more javadoc

* error response documentation

* spelling

* preserve user specified lane for NoSchedulingStrategy

* more test, why not

* doc adjustment

* style

* missed review for make a thing a constant

* fixes and tests

* fix test

* Update docs/configuration/index.md

Co-Authored-By: sthetland <steve.hetland@imply.io>

* doc update

Co-authored-by: sthetland <steve.hetland@imply.io>
2020-03-10 02:57:16 -07:00
Jihoon Son 75e2051195
Convert array_contains() and array_overlaps() into native filters if possible (#9487)
* Convert array_contains() and array_overlaps() into native filters if
possible

* make spotbugs happy and fix null results when null compatible
2020-03-09 22:50:38 -07:00
Jonathan Wei 0136dba95d
Add option to control join filter rewrites (#9472)
* Add option to control join filter rewrites

* Fix inspections
2020-03-09 17:36:07 -07:00
Jihoon Son b924161086
Add main method to VersionedIntervalTimelineBenchmark (#9404) 2020-02-26 12:01:02 -08:00
Clint Wylie b408a6d774
sql support for dynamic parameters (#6974)
* sql support for dynamic parameters

* fixup

* javadocs

* fixup from merge

* formatting

* fixes

* fix it

* doc fix

* remove druid fallback self-join parameterized test

* unused imports

* ignore test for now

* fix imports

* fixup

* fix merge

* merge fixup

* fix test that cannot vectorize

* fixup and more better

* dependency thingo

* fix docs

* tweaks

* fix docs

* spelling

* unused imports after merge

* review stuffs

* add comment

* add ignore text

* review stuffs
2020-02-19 13:09:20 -08:00
Adam Peck e9aebd994a
Fix for building in Eclipse & VS Code. (#7481)
Fixes #6866
Reverse dependencies from /main/ to /test/
Add generated-test-sources to source path for Eclipse
2020-02-13 14:58:32 -08:00
Roman Leventov 8f3fe9cd02 Prohibit String.replace() and String.replaceAll(), fix and prohibit some toString()-related redundancies (#6607)
* Prohibit String.replace() and String.replaceAll(), fix and prohibit some toString()-related redundancies

* Fix bug

* Replace checkstyle regexp with IntelliJ inspection
2018-11-15 13:21:34 -08:00
Roman Leventov 3ae563263a
Renamed 'Generic Column' -> 'Numeric Column'; Fixed a few resource leaks in processing; misc refinements (#5957)
This PR accumulates many refactorings and small improvements that I did while preparing the next change set of https://github.com/druid-io/druid/projects/2. I finally decided to make them a separate PR to minimize the volume of the main PR.

Some of the changes:
 - Renamed confusing "Generic Column" term to "Numeric Column" (what it actually implies) in many class names.
 - Generified `ComplexMetricExtractor`
2018-10-02 14:50:22 -03:00
Gian Merlino 431d3d8497
Rename io.druid to org.apache.druid. (#6266)
* Rename io.druid to org.apache.druid.

* Fix META-INF files and remove some benchmark results.

* MonitorsConfig update for metrics package migration.

* Reorder some dimensions in inner queries for some reason.

* Fix protobuf tests.
2018-08-30 09:56:26 -07:00
Benedict Jin 331a0afb98 Remove redundant type parameters and enforce some other style and inspection rules (#5980)
* Various changes about druid-services module

* Patch improvements from reviewer

* Add ToArrayCallWithZeroLengthArrayArgument & ArraysAsListWithZeroOrOneArgument into inspection profile

* Fix ArraysAsListWithZeroOrOneArgument

* Fix conflict

* Fix ToArrayCallWithZeroLengthArrayArgument

* Fix AliEqualsAvoidNull

* Remove blank line

* Remove unused import clauses

* Fix code style in TopNQueryRunnerTest

* Fix conflict

* Don't use Collections.singletonList when converting the type of array type

* Add argLine into maven-surefire-plugin in druid-process module & increase the timeout value for testMoveSegment testcase

* Roll back the latest commit

* Add java.io.File#toURL() into druid-forbidden-apis

* Using Boolean.parseBoolean instead of Boolean.valueOf for CliCoordinator#isOverlord

* Add a new regexp element into stylecode xml file

* Fix style error for new regexp

* Set the level of ArraysAsListWithZeroOrOneArgument as WARNING

* Fix style error for new regexp

* Add option BY_LEVEL for ToArrayCallWithZeroLengthArrayArgument in inspection profile

* Roll back the level as ToArrayCallWithZeroLengthArrayArgument as ERROR

* Add toArray(new Object[0]) regexp into checkstyle config file & fix them

* Set the level of ArraysAsListWithZeroOrOneArgument as ERROR & Roll back the level of ToArrayCallWithZeroLengthArrayArgument as WARNING until Youtrack fix it

* Add a comment for string equals regexp in checkstyle config

* Fix code format

* Add RedundantTypeArguments as ERROR level inspection

* Fix cannot resolve symbol datasource
2018-07-27 16:56:49 -05:00
Gian Merlino 04ea3c9f8c
Update license headers. (#5976)
* Update license headers.

For compliance with http://www.apache.org/legal/src-headers.html.

* More license adjustments.

* Fix mistakenly edited package line.
2018-07-11 09:55:18 -07:00
Roman Leventov 693e3575f9
Remove unused code and exception declarations (#5461)
* Remove unused code and exception declarations

* Address comments

* Remove redundant Exception declarations

* Make FirehoseFactoryV2.connect() to throw IOException again
2018-03-16 22:11:12 +01:00
Parag Jain 7c01f77b04 Parse Batch support (#5081)
* add parseBatch and deprecate parse method in InputRowParser

add addAll method, skip max rows in memory check for it

remove parse method from implemetations

transform transformers

add string multiplier input row parser

fix withParseSpec

fix kafka batch indexing

fix isPersistRequired

comments

* add unit test

* make persist async

* review comments
2017-12-04 16:06:16 -06:00
Roman Leventov 3541b7544b Prohibit and remove unused declarations in the processing module (#4930)
* Prohibit and remove unused declarations in the processing module

* Fix tests

* Fix integration tests

* Suppress unused

* Try to remove SuppressWarnings unused in VirtualColumn

* Remove reset 'false positives'

* Annotate CliCommandCreator as ExtensionPoint

* Unused import warning instead of error in IntelliJ

* Fixes

* Add comment

* Fix AzureBlob

* Fix CloudFilesBlob

* Address comments

* Add Project SDK section to INTELLIJ_SETUP.md

* Fix image
2017-11-09 09:27:27 -08:00
Kenji Noguchi c0be050242 Add jq expression support in flattenSpec (#4171)
* add jq expression in the flattenSpec

* more tests

* add benchmark

* fix style

* use JsonNode for both JSONPath and JQ

* clean up

* more clean up

* add documentation

* fix style

* move jackson-jq version to dependencyManagement section. remove commented code

* oops. revert wrong fix

* throw IllegalArgumentException for JQ syntax error

* remove e.printStackTrace() that is forbidden

* touch
2017-09-12 14:18:34 -05:00
Roman Leventov cbd1902db8 Add forbidden-apis plugin; prohibit using system time zone (#4611)
* Forbidden APIs WIP

* Remove some tests

* Restore io.druid.math.expr.Function

* Integration tests fix

* Add comments

* Fix in SimpleWorkerProvisioningStrategy

* Formatting

* Replace String.format() with StringUtils.format() in RemoteTaskRunnerTest

* Address comments

* Fix GroupByMultiSegmentTest
2017-08-21 13:02:42 -07:00
Roman Leventov c0beb78ffd Enforce brace formatting with Checkstyle (#4564) 2017-07-21 10:26:59 -05:00
Roman Leventov 1ebfa22955 Update Error prone configuration; Fix bugs (#4252)
* Make Errorprone the default compiler

* Address comments

* Make Error Prone's ClassCanBeStatic rule a error

* Preconditions allow only %s pattern

* Fix DruidCoordinatorBalancerTester

* Try to give the compiler more memory

* Remove distribution module activation on jdk 1.8 because only jdk 1.8 is used now

* Don't show compiler warnings

* Try different travis script

* Fix travis.yml

* Make Error Prone optional again

* For error-prone compiler

* Increase compiler's maxmem

* Don't run Error Prone for benchmarks because of OOM

* Skip install step in Travis

* Remove MetricHolder.writeToChannel()

* In travis.yml, check compilation before tests, because it may fail faster
2017-05-12 15:55:17 +09:00
Jihoon Son d80bec83cc Enable auto license checking (#3836)
* Enable license checking

* Clean duplicated license headers
2017-01-10 18:13:47 -08:00
Akash Dwivedi 4b3bd8bd63 Migrating java-util from Metamarkets. (#3585)
* Migrating java-util from Metamarkets.

* checkstyle and updated license on java-util files.

* Removed unused imports from whole project.

* cherry pick metamx/java-util@826021f.

* Copyright changes on java-util pom, address review comments.
2016-10-21 14:57:07 -07:00
Jonathan Wei b72c54c4f8 Add benchmark data generator, basic ingestion/persist/merge/query benchmarks (#2875) 2016-05-25 16:39:37 -07:00
Gian Merlino 1869c350d8 Move FlattenJSONBenchmarkUtilTest into the correct location. 2016-03-24 15:47:54 -07:00