Commit Graph

2404 Commits

Author SHA1 Message Date
Gian Merlino 6d2ff796a3
Add RowIdSupplier to ColumnSelectorFactory. (#12577)
* Add RowIdSupplier to ColumnSelectorFactory.

This enables virtual columns to cache their outputs in case they are
called multiple times on the same underlying row. This is common for
numeric selectors, where the common pattern is to call isNull() and
then follow with getLong(), getFloat(), or getDouble(). Here, output
caching reduces the number of expression evals by half.

* Fix tests.
2022-05-31 11:38:03 -07:00
Clint Wylie b746bf9129
fix virtual column cycle bug, sql virtual column optimize bug (#12576)
* fix virtual column cycle bug, sql virtual column optimize bug

* more test
2022-05-30 23:51:21 -07:00
Dr. Sizzles 7291c92f4f
Adding zstandard compression library (#12408)
* Adding zstandard compression library

* 1. Took @clintropolis's advice to have ZStandard decompressor use the byte array when the buffers are not direct.
2. Cleaned up checkstyle issues.

* Fixing zstandard version to latest stable version in pom's and updating license files

* Removing zstd from benchmarks and adding to processing (poms)

* fix the intellij inspection issue

* Removing the prefix v for the version in the license check for ztsd

* Fixing license checks

Co-authored-by: Rahul Gidwani <r_gidwani@apple.com>
2022-05-28 17:01:44 -07:00
Clint Wylie d0c9c37e35
make query context changes backwards compatible (#12564)
Adds a default implementation of getQueryContext, which was added to the Query interface in #12396. Query is marked with @ExtensionPoint, and lately we have been trying to be less volatile on these interfaces by providing default implementations to be more chill for extension writers.

The way this default implementation is done in this PR is a bit strange due to the way that getQueryContext is used (mutated with system default and system generated keys); the default implementation has a specific object that it returns, and I added another temporary default method isLegacyContext that checks if the getQueryContext returns that object or not. If not, callers fall back to using getContext and withOverriddenContext to set these default and system values.

I am open to other ideas as well, but this way should work at least without exploding, and added some tests to ensure that it is wired up correctly for QueryLifecycle, including the context authorization stuff.

The added test shows the strange behavior if query context authorization is enabled, mainly that the system default and system generated query context keys also need to be granted as permissions for things to function correctly. This is not great, so I mentioned it in the javadocs as well. Not sure if it needs to be called out anywhere else.
2022-05-25 15:24:41 +05:30
Karan Kumar 9f9faeec81
object[] handling for DimensionHandlers for arrays (#12552)
Description
Fixes a bug when running q's like

 SELECT cntarray,
       Count(*)
FROM   (SELECT dim1,
               dim2,
               Array_agg(cnt) AS cntarray
        FROM   (SELECT dim1,
                       dim2,
                       dim3,
                       Count(*) AS cnt
                FROM   foo
                GROUP  BY 1,
                          2,
                          3)
        GROUP  BY 1,
                  2)
GROUP  BY 1  
This generates an error:

org.apache.druid.java.util.common.ISE: Unable to convert type [Ljava.lang.Object; to org.apache.druid.segment.data.ComparableList
        at org.apache.druid.segment.DimensionHandlerUtils.convertToList(DimensionHandlerUtils.java:405) ~[druid-xx]
Because it's an array of numbers it looks like it does the convertToList call, which looks like:

  @Nullable
  public static ComparableList convertToList(Object obj)
  {
    if (obj == null) {
      return null;
    }
    if (obj instanceof List) {
      return new ComparableList((List) obj);
    }
    if (obj instanceof ComparableList) {
      return (ComparableList) obj;
    }
    throw new ISE("Unable to convert type %s to %s", obj.getClass().getName(), ComparableList.class.getName());
  }
I.e. it doesn't know about arrays. Added the array handling as part of this PR.
2022-05-25 15:24:18 +05:30
Agustin Gonzalez 2f3d7a4c07
Emit state of replace and append for native batch tasks (#12488)
* Emit state of replace and append for native batch tasks

* Emit count of one depending on batch ingestion mode (APPEND, OVERWRITE, REPLACE)

* Add metric to compaction job

* Avoid null ptr exc when null emitter

* Coverage

* Emit tombstone & segment counts

* Tasks need a type

* Spelling

* Integrate BatchIngestionMode in batch ingestion tasks functionality

* Typos

* Remove batch ingestion type from metric since it is already in a dimension. Move IngestionMode to AbstractTask to facilitate having mode as a dimension. Add metrics to streaming. Add missing coverage.

* Avoid inner class referenced by sub-class inspection. Refactor computation of IngestionMode to make it more robust to null IOConfig and fix test.

* Spelling

* Avoid polluting the Task interface

* Rename computeCompaction methods to avoid ambiguous java compiler error if they are passed null. Other minor cleanup.
2022-05-23 12:32:47 -07:00
Gian Merlino 37853f8de4
ConcurrentGrouper: Add mergeThreadLocal option, fix bug around the switch to spilling. (#12513)
* ConcurrentGrouper: Add option to always slice up merge buffers thread-locally.

Normally, the ConcurrentGrouper shares merge buffers across processing
threads until spilling starts, and then switches to a thread-local model.
This minimizes memory use and reduces likelihood of spilling, which is
good, but it creates thread contention. The new mergeThreadLocal option
causes a query to start in thread-local mode immediately, and allows us
to experiment with the relative performance of the two modes.

* Fix grammar in docs.

* Fix race in ConcurrentGrouper.

* Fix issue with timeouts.

* Remove unused import.

* Add "tradeoff" to dictionary.
2022-05-21 10:28:54 -07:00
Gian Merlino 69aac6c8dd
Direct UTF-8 access for "in" filters. (#12517)
* Direct UTF-8 access for "in" filters.

Directly related:

1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
   ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
   necessary, the input set is copied into a ValuesSet. Much logic is
   simplified, because we always know what type the values set will be.
   I think that there won't even be an efficiency loss in most cases.
   InDimFilter is most frequently created by deserialization, and this
   patch updates the JsonCreator constructor to deserialize
   directly into a ValuesSet.

2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
   during index lookups.

3) Add unsigned comparator to ByteBufferUtils and use it in
   GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
   bytes can be compared as bytes if, and only if, the comparison
   is unsigned.

4) Add specialization to GenericIndexed.singleThreaded().indexOf that
   avoids needless ByteBuffer allocations.

5) Clarify that objects returned by ColumnIndexSupplier.as are not
   thread-safe. DictionaryEncodedStringIndexSupplier now calls
   singleThreaded() on all relevant GenericIndexed objects, saving
   a ByteBuffer allocation per access.

Also:

1) Fix performance regression in LikeFilter: since #12315, it applied
   the suffix matcher to all values in range even for type MATCH_ALL.

2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
   which was broken due to calls to strategy.compare in
   GenericIndexed.fromIterable.

* Add like-filter implementation tests.

* Add in-filter implementation tests.

* Add tests, fix issues.

* Fix style.

* Adjustments from review.
2022-05-20 01:51:28 -07:00
machine424 90531fd53f
Do not alter query timeout in ScanQueryEngine (#12271)
Add test to detect timeout mutability
2022-05-19 09:24:42 -07:00
Gian Merlino 4631cff2a9
Free ByteBuffers in tests and fix some bugs. (#12521)
* Ensure ByteBuffers allocated in tests get freed.

Many tests had problems where a direct ByteBuffer would be allocated
and then not freed. This is bad because it causes flaky tests.

To fix this:

1) Add ByteBufferUtils.allocateDirect(size), which returns a ResourceHolder.
   This makes it easy to free the direct buffer. Currently, it's only used
   in tests, because production code seems OK.

2) Update all usages of ByteBuffer.allocateDirect (off-heap) in tests either
   to ByteBuffer.allocate (on-heap, which are garbaged collected), or to
   ByteBufferUtils.allocateDirect (wherever it seemed like there was a good
   reason for the buffer to be off-heap). Make sure to close all direct
   holders when done.

* Changes based on CI results.

* A different approach.

* Roll back BitmapOperationTest stuff.

* Try additional surefire memory.

* Revert "Roll back BitmapOperationTest stuff."

This reverts commit 49f846d9e3.

* Add TestBufferPool.

* Revert Xmx change in tests.

* Better behaved NestedQueryPushDownTest. Exit tests on OOME.

* Fix TestBufferPool.

* Remove T1C from ARM tests.

* Somewhat safer.

* Fix tests.

* Fix style stuff.

* Additional debugging.

* Reset null / expr configs better.

* ExpressionLambdaAggregatorFactory thread-safety.

* Alter forkNode to try to get better info when a JVM crashes.

* Fix buffer retention in ExpressionLambdaAggregatorFactory.

* Remove unused import.
2022-05-19 07:42:29 -07:00
Gian Merlino 5b6727f319
Enable vectorized virtual column processing by default. (#12520)
In the majority of cases, this improves performance.

There's only one case I'm aware of where this may be a net negative: for time_floor(__time, <period>) where there are many repeated __time values. In nonvectorized processing, SingleLongInputCachingExpressionColumnValueSelector implements an optimization to avoid computing the time_floor function on every row. There is no such optimization in vectorized processing.

IMO, we shouldn't mention this in the docs. Rationale: It's too fiddly of a thing: it's not guaranteed that nonvectorized processing will be faster due to the optimization, because it would have to overcome the inherent speed advantage of vectorization. So it'd always require testing to determine the best setting for a specific dataset. It would be bad if users disabled vectorization thinking it would speed up their queries, and it actually slowed them down. And even if users do their own testing, at some point in the future we'll implement the optimization for vectorized processing too, and it's likely that users that explicitly disabled vectorization will continue to have it disabled. I'd like to avoid this outcome by encouraging all users to enable vectorization at all times. Really advanced users would be following development activity anyway, and can read this issue
2022-05-16 15:43:53 +05:30
Gian Merlino ff253fd8a3
Add setProcessingThreadNames context parameter. (#12514)
setting thread names takes a measurable amount of time in the case where segment scans are very quick. In high-QPS testing we found a slight performance boost from turning off processing thread renaming. This option makes that possible.
2022-05-16 13:42:00 +05:30
Abhishek Radhakrishnan 9177515be2
Add IPAddress java library as dependency and migrate IPv4 functions to use the new library. (#11634)
* Add ipaddress library as dependency.

* IPv4 functions to use the inet.ipaddr package.

* Remove unused imports.

* Add new function.

* Minor rename.

* Add more unit tests.

* IPv4 address expr utils unit tests and address options.

* Adjust the IPv4Util functions.

* Move the UTs a bit around.

* Javadoc comments.

* Add license info for IPAddress.

* Fix groupId, artifact and version in license.yaml.

* Remove redundant subnet in messages - fixes UT.

* Remove unused commons-net dependency for /processing project.

* Make class and methods public so it can be accessed.

* Add initial version of benchmark

* Add subnetutils package for benchmarks.

* Auto generate ip addresses.

* Add more v4 address representations in setup to avoid bias.

* Use ThreadLocalRandom to avoid forbidden API usage.

* Adjust IPv4AddressBenchmark to adhere to codestyle rules.

* Update ipaddress library to latest 5.3.4

* Add ipaddress package dependency to benchmarks project.
2022-05-11 22:06:20 -07:00
Clint Wylie 9e5a940cf1
remake column indexes and query processing of filters (#12388)
Following up on #12315, which pushed most of the logic of building ImmutableBitmap into BitmapIndex in order to hide the details of how column indexes are implemented from the Filter implementations, this PR totally refashions how Filter consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.
2022-05-11 11:57:08 +05:30
Rohan Garg 75836a5a06
Add feature flag for sql planning of TimeBoundary queries (#12491)
* Add feature flag for sql planning of TimeBoundary queries

* fixup! Add feature flag for sql planning of TimeBoundary queries

* Add documentation for enableTimeBoundaryPlanning

* fixup! Add documentation for enableTimeBoundaryPlanning
2022-05-10 15:23:42 +05:30
somu-imply c68388ebcd
Vectorized version of string last aggregator (#12493)
* Vectorized version of string last aggregator

* Updating string last and adding testcases

* Updating code and adding testcases for serializable pairs

* Addressing review comments
2022-05-09 17:02:38 -07:00
Rohan Garg 2dd073c2cd
Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation (#12484)
* Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation

* fixup! Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation

* Document vectorized dimension
2022-05-09 10:40:17 -07:00
Gian Merlino 529b983ad0
GroupBy: Reduce allocations by reusing entry and key holders. (#12474)
* GroupBy: Reduce allocations by reusing entry and key holders.

Two main changes:

1) Reuse Entry objects returned by various implementations of
   Grouper.iterator.

2) Reuse key objects contained within those Entry objects.

This is allowed by the contract, which states that entries must be
processed and immediately discarded. However, not all call sites
respected this, so this patch also updates those call sites.

One particularly sneaky way that the old code retained entries too long
is due to Guava's MergingIterator and CombiningIterator. Internally,
these both advance to the next value prior to returning the current
value. So, this patch addresses that in two ways:

1) For merging, we have our own implementation MergeIterator already,
   although it had the same problem. So, this patch updates our
   implementation to return the current item prior to advancing to the
   next item. It also adds a forbidden-api entry to ensure that this
   safer implementation is used instead of Guava's.

2) For combining, we address the problem in a different way: by copying
   the key when creating the new, combined entry.

* Attempt to fix test.

* Remove unused import.
2022-04-28 23:21:13 -07:00
Gian Merlino a2bad0b3a2
Reduce allocations due to Jackson serialization. (#12468)
* Reduce allocations due to Jackson serialization.

This patch attacks two sources of allocations during Jackson
serialization:

1) ObjectMapper.writeValue and JsonGenerator.writeObject create a new
   DefaultSerializerProvider instance for each call. It has lots of
   fields and creates pressure on the garbage collector. So, this patch
   adds helper functions in JacksonUtils that enable reuse of
   SerializerProvider objects and updates various call sites to make
   use of this.

2) GroupByQueryToolChest copies the ObjectMapper for every query to
   install a special module that supports backwards compatibility with
   map-based rows. This isn't needed if resultAsArray is set and
   all servers are running Druid 0.16.0 or later. This release was a
   while ago. So, this patch disables backwards compatibility by default,
   which eliminates the need to copy the heavyweight ObjectMapper. The
   patch also introduces a configuration option that allows admins to
   explicitly enable backwards compatibility.

* Add test.

* Update additional call sites and add to forbidden APIs.
2022-04-27 14:17:26 -07:00
somu-imply 027935dcff
Vectorize numeric latest aggregators (#12439)
* Vectorizing Latest aggregator Part 1

* Updating benchmark tests

* Changing appropriate logic for vectors for null handling

* Introducing an abstract class and moving the commonalities there

* Adding vectorization for StringLast aggregator (initial version)

* Updated bufferized version of numeric aggregators

* Adding some javadocs

* Making sure this PR vectorizes numeric latest agg only

* Adding another benchmarking test

* Fixing intellij inspections

* Adding tests for double

* Adding test cases for long and float

* Updating testcases

* Checkstyle oops..

* One tiny change in test case

* Fixing spotbug and rhs not being used
2022-04-26 11:33:08 -07:00
Will Xu 4868ef9529
Enable Arm builds (#12451)
This PR enables ARM builds on Travis. I've ported over the changes from @martin-g on reducing heap requirements for some of the tests to ensure they run well on Travis arm instances.
2022-04-26 20:14:40 +05:30
Rohan Garg 95694b5afa
Convert simple min/max SQL queries on __time to timeBoundary queries (#12472)
* Support array based results in timeBoundary query

* Fix bug with query interval in timeBoundary

* Convert min(__time) and max(__time) SQL queries to timeBoundary

* Add tests for timeBoundary backed SQL queries

* Fix query plans for existing tests

* fixup! Convert min(__time) and max(__time) SQL queries to timeBoundary

* fixup! Add tests for timeBoundary backed SQL queries

* fixup! Fix bug with query interval in timeBoundary
2022-04-25 08:18:58 -07:00
Jihoon Son 73ce5df22d
Add support for authorizing query context params (#12396)
The query context is a way that the user gives a hint to the Druid query engine, so that they enforce a certain behavior or at least let the query engine prefer a certain plan during query planning. Today, there are 3 types of query context params as below.

Default context params. They are set via druid.query.default.context in runtime properties. Any user context params can be default params.
User context params. They are set in the user query request. See https://druid.apache.org/docs/latest/querying/query-context.html for parameters.
System context params. They are set by the Druid query engine during query processing. These params override other context params.
Today, any context params are allowed to users. This can cause 
1) a bad UX if the context param is not matured yet or 
2) even query failure or system fault in the worst case if a sensitive param is abused, ex) maxSubqueryRows.

This PR adds an ability to limit context params per user role. That means, a query will fail if you have a context param set in the query that is not allowed to you. To do that, this PR adds a new built-in resource type, QUERY_CONTEXT. The resource to authorize has a name of the context param (such as maxSubqueryRows) and the type of QUERY_CONTEXT. To allow a certain context param for a user, the user should be granted WRITE permission on the context param resource. Here is an example of the permission.

{
  "resourceAction" : {
    "resource" : {
      "name" : "maxSubqueryRows",
      "type" : "QUERY_CONTEXT"
    },
    "action" : "WRITE"
  },
  "resourceNamePattern" : "maxSubqueryRows"
}
Each role can have multiple permissions for context params. Each permission should be set for different context params.

When a query is issued with a query context X, the query will fail if the user who issued the query does not have WRITE permission on the query context X. In this case,

HTTP endpoints will return 403 response code.
JDBC will throw ForbiddenException.
Note: there is a context param called brokerService that is used only by the router. This param is used to pin your query to run it in a specific broker. Because the authorization is done not in the router, but in the broker, if you have brokerService set in your query without a proper permission, your query will fail in the broker after routing is done. Technically, this is not right because the authorization is checked after the context param takes effect. However, this should not cause any user-facing issue and thus should be OK. The query will still fail if the user doesn’t have permission for brokerService.

The context param authorization can be enabled using druid.auth.authorizeQueryContextParams. This is disabled by default to avoid any hassle when someone upgrades his cluster blindly without reading release notes.
2022-04-21 14:21:16 +05:30
Rohan Garg 4c6ba73823
Emit vectorized metric dimension by default (#12464) 2022-04-20 21:14:55 -07:00
Maytas Monsereenusorn c25a556827
Fix bug in auto compaction preserveExistingMetrics feature (#12438)
* fix bug

* fix test

* fix IT
2022-04-15 15:47:47 -07:00
Clint Wylie 5824ab9608
fix issue with boolean expression input (#12429) 2022-04-13 16:34:01 -07:00
Jihoon Son 5e5625f3ae
Fix indexMerger to respect the includeAllDimensions flag (#12428)
* Fix indexMerger to respect flag includeAllDimensions flag; jsonInputFormat should set keepNullColumns if useFieldDiscovery is set

* address comments
2022-04-13 12:43:11 -07:00
Maytas Monsereenusorn 8edea5a82d
Add a new flag for ingestion to preserve existing metrics (#12185)
* add impl

* add impl

* fix checkstyle

* add impl

* add unit test

* fix stuff

* fix stuff

* fix stuff

* add unit test

* add more unit tests

* add more unit tests

* add IT

* add IT

* add IT

* add IT

* add ITs

* address comments

* fix test

* fix test

* fix test

* address comments

* address comments

* address comments

* fix conflict

* fix checkstyle

* address comments

* fix test

* fix checkstyle

* fix test

* fix test

* fix IT
2022-04-08 11:02:02 -07:00
Paul Rogers 2cc2088720
Method to specify eternity in the scan query builder (#12223)
* Method to specify eternity in the scan query builder

* Fix checkstyle issue

* Renamed eterity() to eternityInterval()

* Minor fixes
2022-04-04 15:11:32 -07:00
somu-imply a1ea658115
Introducing a new config to ignore nulls while computing String Cardinality (#12345)
* Counting nulls in String cardinality with a config

* Adding tests for the new config

* Wrapping the vectorize part to allow backward compatibility

* Adding different tests, cleaning the code and putting the check at the proper position, handling hasRow() and hasValue() changes

* Updating testcase and code

* Adding null handling test to improve coverage

* Checkstyle fix

* Adding 1 more change in docs

* Making docs clearer
2022-03-29 14:31:36 -07:00
Jihoon Son b6eeef31e5
Store null columns in the segments (#12279)
* Store null columns in the segments

* fix test

* remove NullNumericColumn and unused dependency

* fix compile failure

* use guava instead of apache commons

* split new tests

* unused imports

* address comments
2022-03-23 16:54:04 -07:00
Adarsh Sanjeev ef45a1551e
Convert inQueryThreshold into query context parameter. (#12357)
Added Calcites InQueryThreshold as a query context parameter. Setting this parameter appropriately reduces the time taken for queries with large number of values in their IN conditions.
2022-03-22 18:33:57 +05:30
Xavier Léauté 1f0447e613
fix use of deprecated initMocks method (#12351)
follow-up to #12341
- fix use of deprecated initMocks methods and properly close mocks on teardown
2022-03-19 10:19:02 -07:00
somu-imply b5195c5095
Graceful null handling and correctness in DoubleMean Aggregator (#12320)
* Adding null handling for double mean aggregator

* Updating code to handle nulls in DoubleMean aggregator

* oops last one should have checkstyle issues. fixed

* Updating some code and test cases

* Checking on object is null in case of numeric aggregator

* Adding one more test to improve coverage

* Changing one test as asked in the review

* Changing one test as asked in the review for nulls
2022-03-14 16:52:47 -07:00
mchades 3de1272926
bug fix: merge results of group by limit push down (#11969) 2022-03-11 09:04:34 -08:00
Gian Merlino cb2b2b696d
Fix error message for groupByEnableMultiValueUnnesting. (#12325)
* Fix error message for groupByEnableMultiValueUnnesting.

It referred to the incorrect context parameter.

Also, create a dedicated exception class, to allow easier detection of this
specific error.

* Fix other test.

* More better error messages.

* Test getDimensionName method.
2022-03-10 11:37:24 -08:00
Clint Wylie 9cfb23935f
push value range and set index get operations into BitmapIndex (#12315)
* push value range and set index get operations into BitmapIndex

* fix bug

* oops, fix better

* better like, fix test, javadocs

* fix checkstyle

* simplify and fixes

* cache

* fix tests

* move indexOf into GenericIndexed

* oops

* fix tests
2022-03-09 13:30:58 -08:00
Rohan Garg 9f6a930462
Fix join query incase of filter explosion during CNF conversion (#12324) 2022-03-09 12:43:09 -08:00
Clint Wylie dc0372a28e
improve FileWriteOutBytes.readFully (#12323)
* improve FileWriteOutBytes.readFully

* no need to flush if out of bounds
2022-03-09 11:45:45 -08:00
Rohan Garg 56fbd2af6f
Guard against exponential increase of filters during CNF conversion (#12314)
Currently, the CNF conversion of a filter is unbounded, which means that it can create as many filters as possible thereby also leading to OOMs in historical heap. We should throw an error or disable CNF conversion if the filter count starts getting out of hand. There are ways to do CNF conversion with linear increase in filters as well but that has been left out of the scope of this change since those algorithms add new variables in the predicate - which can be contentious.
2022-03-09 13:19:52 +05:30
Agustin Gonzalez abe76ccb90
Batch ingestion replace (#12137)
* Tombstone support for replace functionality

* A used segment interval is the interval of a current used segment that overlaps any of the input intervals for the spec

* Update compaction test to match replace behavior

* Adapt ITAutoCompactionTest to work with tombstones rather than dropping segments. Add support for tombstones in the broker.

* Style plus simple queriableindex test

* Add segment cache loader tombstone test

* Add more tests

* Add a method to the LogicalSegment to test whether it has any data

* Test filter with some empty logical segments

* Refactor more compaction/dropexisting tests

* Code coverage

* Support for all empty segments

* Skip tombstones when looking-up broker's timeline. Discard changes made to tool chest to avoid empty segments since they will no longer have empty segments after lookup because we are skipping over them.

* Fix null ptr when segment does not have a queriable index

* Add support for empty replace interval (all input data has been filtered out)

* Fixed coverage & style

* Find tombstone versions from lock versions

* Test failures & style

* Interner was making this fail since the two segments were consider equal due to their id's being equal

* Cleanup tombstone version code

* Force timeChunkLock whenever replace (i.e. dropExisting=true) is being used

* Reject replace spec when input intervals are empty

* Documentation

* Style and unit test

* Restore test code deleted by mistake

* Allocate forces TIME_CHUNK locking and uses lock versions. TombstoneShardSpec added.

* Unused imports. Dead code. Test coverage.

* Coverage.

* Prevent killer from throwing an exception for tombstones. This is the killer used in the peon for killing segments.

* Fix OmniKiller + more test coverage.

* Tombstones are now marked using a shard spec

* Drop a segment factory.json in the segment cache for tombstones

* Style

* Style + coverage

* style

* Add TombstoneLoadSpec.class to mapper in test

* Update core/src/main/java/org/apache/druid/segment/loading/TombstoneLoadSpec.java

Typo

Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com>

* Update docs/configuration/index.md

Missing

Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com>

* Typo

* Integrated replace with an existing test since the replace part was redundant and more importantly, the test file was very close or exceeding the 10 min default "no output" CI Travis threshold.

* Range does not work with multi-dim

Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com>
2022-03-08 20:07:02 -07:00
Clint Wylie dae53ae36a
adjust topn heap operation when string is dictionary encoded, but not uniquely (#12291)
* add topn heap optimization when string is dictionary encoded, but not uniquely

* use array instead

* is same

* fix javadoc

* fix

* Update StringTopNColumnAggregatesProcessor.java
2022-03-08 14:32:40 -08:00
Gian Merlino 875e0696e0
GroupBy: Cap dictionary-building selector memory usage. (#12309)
* GroupBy: Cap dictionary-building selector memory usage.

New context parameter "maxSelectorDictionarySize" controls when the
per-segment processing code should return early and trigger a trip
to the merge buffer.

Includes:

- Vectorized and nonvectorized implementations.
- Adjustments to GroupByQueryRunnerTest to exercise this code in
  the v2SmallDictionary suite. (Both the selector dictionary and
  the merging dictionary will be small in that suite.)
- Tests for the new config parameter.

* Fix issues from tests.

* Add "pre-existing" to dictionary.

* Simplify GroupByColumnSelectorStrategy interface by removing one of the writeToKeyBuffer methods.

* Adjustments from review comments.
2022-03-08 13:13:11 -08:00
Gian Merlino 3b373114dc
Officially support Java 11. (#12232)
There aren't any changes in this patch that improve Java 11
compatibility; these changes have already been done separately. This
patch merely updates documentation and explicit Java version checks.

The log message adjustments in DruidProcessingConfig are there to make
things a little nicer when running in Java 11, where we can't measure
direct memory _directly_, and so we may auto-size processing buffers
incorrectly.
2022-03-04 14:15:45 -08:00
Clint Wylie 1c004ea47e
use virtual columns for sql simple aggregators instead of inline expressions (#12251)
* use virtual columns for sql simple aggregators instead of inline expressions

* fixes

* always use virtual columns

* add more tests
2022-03-03 15:05:28 -08:00
Tejaswini Bandlamudi 1af4c9c933
Display row stats for multiphase parallel indexing tasks (#12280)
Row stats are reported for single phase tasks in the `/liveReports` and `/rowStats` APIs
and are also a part of the overall task report. This commit adds changes to report
row stats for multiphase tasks too.

Changes:
- Add `TaskReport` in `GeneratedPartitionsReport` generated during hash and range partitioning
- Collect the reports for `index_generate` phase in `ParallelIndexSupervisorTask`
2022-03-02 10:10:31 +05:30
Jihoon Son e5ad862665
A new includeAllDimension flag for dimensionsSpec (#12276)
* includeAllDimensions in dimensionsSpec

* doc

* address comments

* unused import and doc spelling
2022-02-25 18:27:48 -08:00
Jason Koch eb1b53b7f8
perf: indexing: Introduce a bulk getValuesInto function to read values (#12105)
* perf: indexing: Introduce a bulk getValuesInto function to read values in bulk

If large number of values are required from DimensionDictionary
during indexing, fetch them all in a single lock/unlock instead of
lock/unlock each individual item.

* refactor: rename key to keys in function args

* fix: check explicitly that argument length on arrays match

* refactor: getValuesInto renamed to getValues, now creates and returns a new T[] rather than filling
2022-02-25 12:19:04 -08:00
Karan Kumar 5794331eb1
Adding new config for disabling group by on multiValue column (#12253)
As part of #12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key.
Added a config groupByEnableMultiValueUnnesting which can be set in the query context.

The default value of groupByEnableMultiValueUnnesting is true, therefore it does not change the current engine behavior.
If groupByEnableMultiValueUnnesting is set to false, the query will fail if it encounters a multi-value column in the grouping key.
2022-02-16 20:53:26 +05:30
somu-imply eae163a797
Moving in filter check to broker (#12195)
* Moving in filter check to broker

* Adding more unit tests, making error message meaningful

* Spelling and doc changes

* Updating default to -1 and making this feature hide by default. The number of IN filters can grow upto a max limit of 100

* Removing upper limit of 100, updated docs

* Making documentation more meaningful

* Moving check outside to PlannerConfig, updating test cases and adding back max limit

* Updated with some additional code comments

* Missed removing one line during the checkin

* Addressing doc changes and one forbidden API correction

* Final doc change

* Adding a speling exception, correcting a testcase

* Reading entire filter tree to address combinations of ANDs and ORs

* Specifying in docs that, this case works only for ORs

* Revert "Reading entire filter tree to address combinations of ANDs and ORs"

This reverts commit 81ca8f8496.

* Covering a class cast exception and updating docs

* Counting changed

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2022-02-15 20:45:07 -08:00
Jason Koch 26bc4b7345
perf: cache row if it is a transformed row (#12113)
* perf: cache row if it is a transformed row

* perf: cache row if it is a transformed row (also cache DateTime object)
2022-02-15 10:08:41 -08:00
somu-imply 033989eb1d
Adding vectorized time_shift (#12254)
* Adding vectorized time_shift

* Vectorize time shift, addressing review comments

* Remove an unused import
2022-02-11 14:44:52 -08:00
Clint Wylie 3ee66bb492
allow optimizing sql expressions and virtual columns (#12241)
* rework sql planner expression and virtual column handling

* simplify a bit

* add back and deprecate old methods, more tests, fix multi-value string coercion bug and associated tests

* spotbugs

* fix bugs with multi-value string array expression handling

* javadocs and adjust test

* better

* fix tests
2022-02-09 14:55:50 -08:00
Clint Wylie ae71e05fc5
array_concat_agg and array_agg support for array inputs (#12226)
* array_concat_agg and array_agg support for array inputs
changes:
* added array_concat_agg to aggregate arrays into a single array
* added array_agg support for array inputs to make nested array
* added 'shouldAggregateNullInputs' and 'shouldCombineAggregateNullInputs' to fix a correctness issue with STRING_AGG and ARRAY_AGG when merging results, with dual purpose of being an optimization for aggregating

* fix test

* tie capabilities type to legacy mode flag about coercing arrays to strings

* oops

* better javadoc
2022-02-07 19:59:30 -08:00
Gian Merlino de82c611de
Harmonize implementations of "visit" for Exprs from ExprMacros. (#12230)
* Harmonize implementations of "visit" for Exprs from ExprMacros.

Many of them had bugs where they would not visit all of the original
arguments. I don't think this has user-visible consequences right now,
but it's possible it would in a future world where "visit" is used
for more stuff than it is today.

So, this patch all updates all implementations to a more consistent
style that emphasizes reapplying the macro to the shuttled args.

* Test fixes, test coverage, PR review comments.
2022-02-04 08:08:54 -08:00
Clint Wylie a3affe1471
make EncodedKeyComponent constructor public, remove nullable from DimensionIndexer.processRowValsToUnsortedEncodedKeyComponent (#12229) 2022-02-03 15:02:32 -08:00
Kashif Faraz e648b01afb
Improve memory estimates in Aggregator and DimensionIndexer (#12073)
Fixes #12022  

### Description
The current implementations of memory estimation in `OnHeapIncrementalIndex` and `StringDimensionIndexer` tend to over-estimate which leads to more persistence cycles than necessary.

This PR replaces the max estimation mechanism with getting the incremental memory used by the aggregator or indexer at each invocation of `aggregate` or `encode` respectively.

### Changes
- Add new flag `useMaxMemoryEstimates` in the task context. This overrides the same flag in DefaultTaskConfig i.e. `druid.indexer.task.default.context` map
- Add method `AggregatorFactory.factorizeWithSize()` that returns an `AggregatorAndSize` which contains
  the aggregator instance and the estimated initial size of the aggregator
- Add method `Aggregator.aggregateWithSize()` which returns the incremental memory used by this aggregation step
- Update the method `DimensionIndexer.processRowValsToKeyComponent()` to return the encoded key component as well as its effective size in bytes
- Update `OnHeapIncrementalIndex` to use the new estimations only if `useMaxMemoryEstimates = false`
2022-02-03 10:34:02 +05:30
Clint Wylie f9b406c8f2
add backwards compatibility mode for multi-value string array null value coercion (#12210) 2022-01-31 22:38:15 -08:00
Jihoon Son eeed156dc0
Fix compile error in VirtualizedColumnSelectorFactoryTest (#12208) 2022-01-27 17:35:50 -08:00
Gian Merlino 99a5c2f3d3
Harmonize behavior when virtual columns reference each other. (#11955)
* VirtualizedColumnSelectorFactory: Allow virtual columns to reference each other.

This matches the behavior of QueryableIndex and IncrementalIndex based cursors.

* Fixes to getColumnCapabilities.
2022-01-27 14:31:48 -08:00
Karan Kumar 96b3498a40
Grouping on arrays as arrays (#12078)
* init multiValue column group by

* Changing sorting to Lexicographic as default

* Adding initial tests

* 1.Fixing test cases adding
2.Optimized inmem structs

* Linking SQL layer to native layer

* Adding multiDimension support to group by column strategy

* 1. Removing array coercion in Calcite layer
2. Removing ResultRowDeserializer

* 1. Supporting all primitive array types
2. Removing dimension spec as part of columnSelector

* 1. Supporting all primitive array types
2. Removing dimension spec as part of columnSelector

* 1. Checkstyle things
2. Removing flag

* Minor naming things

* CheckStyle Things

* Fixing test case

* Fixing hashing

* 1. Adding the MV function
2. Added few test cases

* 1. Adding MV function test cases

* Adding Selector strategy function test cases

* Fixing ClientQuerySegmentWalkerTest

* Adding GroupByQueryRunnerTest test cases

* Fixing test cases

* Adding few more test cases

* Fixing Exception asset statement and intellij inspection

* Adding null compatibility tests

* Review comments

* Fixing few failing tests

* Fixing few failing tests

* Do no convert to topN Q incase of group by on array

* Fixing checkstyle

* Fixing differences between jdk's class cast exception message

* 1. Fixing ordering if the grouping key is an array

* Fixing DefaultLimitSpec

* Fixing CalciteArraysQueryTest

* Dummy commit for LGTM

* changes:
* only coerce multi-value string null values when `ExpressionPlan.Trait.NEEDS_APPLIED` is set
* correct return type inference for ARRAY_APPEND,ARRAY_PREPEND,ARRAY_SLICE,ARRAY_CONCAT
* fix bug with ExprEval.ofType when actual type of object from binding doesn't match its claimed type

* Review comments

* Fixing test cases

* Fixing spot bugs

* Fixing strict compile

Co-authored-by: Clint Wylie <cwylie@apache.org>
2022-01-25 20:30:56 -08:00
Clint Wylie fce62b2643
fix StringAnyAggregatorFactory to use single value selector for non-existent columns (#12194) 2022-01-25 12:52:30 -08:00
somu-imply cc8b9c0b6e
Handling OOM error in ExpressionVector setup by reducing number of rows (#12186)
* Handling OOM error in ExpressionVector setup by reducing number of rows

* Removing row size to 10K in sanity tests
2022-01-24 08:37:13 -08:00
Clint Wylie e0c4c568cb
fix incorrect ColumnInspector in IncrementalIndex.makeColumnSelectorFactory (#12155) 2022-01-13 18:09:06 -08:00
Clint Wylie f2ce76966c
add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures less ambiguous (#12145)
* add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures unambiguous

* switcheroo

* EARLIEST_BY/LATEST_BY use timestamp instead of numeric types, update docs

* revert unintended change

* fix docs

* fix docs better
2022-01-12 03:48:53 -08:00
Rohan Garg 81f0aba6cb
Use ListFilteredVirtualColumn for left/fact table expression in join condition (#12127)
* Pass VirtualColumnRegistry in PlannerContext for join expression planning

* Allow for including VCs from join fact table expression

* Optmize MV_FILTER functions to use a VC when in join fact table expression

* fixup! Allow for including VCs from join fact table expression

* Address review comments
2022-01-11 14:47:13 -08:00
imply-cheddar eb0bae49ec
Update PostAggregator to be backwards compat (#12138)
This change mimics what was done in PR #11917 to
fix the incompatibilities produced by #11713. #11917
fixed it with AggregatorFactory by creating default
methods to allow for extensions built against old
jars to still work.  This does the same for PostAggregator
2022-01-11 02:18:14 -08:00
Clint Wylie 7cf9192765
fix delegated smoosh writer and some new facilities for segment writeout medium (#12132)
* fix delegated smoosh writer and some new facilities for segment writeout medium
changes:
* fixed issue with delegated `SmooshedWriter` when writing files that look like paths, causing `NoSuchFileException` exceptions when attempting to open a channel to the file
* `FileSmoosher.addWithSmooshedWriter` when _not_ delegating now checks that it is still open when closing, making it a no-op if already closed (allowing column serializers to add additional files and avoid delegated mode if they are finished writing out their own content and ned to add additional files)
* add `makeChildWriteOutMedium` to `SegmentWriteOutMedium` interface, which allows users of a shared medium to clean up `WriteOutBytes` if they fully control the lifecycle. there are no callers of this yet, adding for future functionality
* `OnHeapByteBufferWriteOutBytes` now can be marked as not open so it `OnHeapMemorySegmentWriteOutMedium` can now behave identically to other medium implementations

* fix to address nit - use AtomicLong
2022-01-10 22:25:19 -08:00
Clint Wylie e583033231
add 'TypeStrategy' to types (#11888)
* add TypeStrategy - value comparators and binary serialization for any TypeSignature
2022-01-10 17:12:14 -08:00
somu-imply c267b65f97
Removing unused processing threadpool on broker (#12070)
* Thread pool for broker

* Updating two tests to improve coverage for new method added

* Updating druidProcessingConfigTest to cover coverage

* Adding missed spelling errors caused in doc

* Adding test to cover lines of new function added
2021-12-21 13:07:53 -08:00
Abhishek Agarwal 5d043cefbc
Fix test in ResponseContextTest (#12077) 2021-12-16 22:51:51 -08:00
Clint Wylie 244c2559e9
fix IncrementalIndex performance regression (#12048)
changes:
* IncrementalIndex is now a ColumnInspector
* fixes performance regression from using map of ColumnCapabilities from IncrementalIndex as a RowSignature
2021-12-09 22:04:32 -08:00
Jonathan Wei 229f82a6f0
Add parse error list API for stream supervisors, use structured object for parse exceptions, simplify parse exception message (#11961)
* Add parse error list API for stream supervisors, simplify parse exception message

* Add input string to parse exception

* Use structured ParseExceptionReport

* Fix tests

* Add test

* PR comments, add ParseExceptionReport equals verifier

* Fix test
2021-12-09 15:42:55 -06:00
Laksh Singla ca260dfef6
Intern RowSignature in DruidSchema to reduce its memory footprint (#12001)
DruidSchema consists of a concurrent HashMap of DataSource -> Segement -> AvailableSegmentMetadata. AvailableSegmentMetadata contains RowSignature of the segment, and for each segment, a new object is getting created. RowSignature is an immutable class, and hence it can be interned, and this can lead to huge savings of memory being used in broker, since a lot of the segments of a table would potentially have same RowSignature.
2021-12-08 15:11:13 +05:30
Clint Wylie 45be2be368
fix issues with multi-value string constant expressions (#12025)
* add specialized constant selector for multi-valued string constants
2021-12-08 00:10:26 -08:00
Clint Wylie a8815f671e
Fix druid client timeout zero (#12023)
* fix bug where queries fail immediately when timeout is 0 instead of using default timeout

* fix to use serverside max

* more better

* less flaky test

* oops
2021-12-07 12:41:01 -08:00
Paul Rogers 34a3d45737
Refactor ResponseContext (#11828)
* Refactor ResponseContext

Fixes a number of issues in preparation for request trailers
and the query profile.

* Converts keys from an enum to classes for smaller code
* Wraps stored values in functions for easier capture for other uses
* Reworks the "header squeezer" to handle types other than arrays.
* Uses metadata for visibility, and ability to compress,
  to replace ad-hoc code.
* Cleans up JSON serialization for the response context.
* Other miscellaneous cleanup.

* Handle unknown keys in deserialization

Also, make "Visibility" into a boolean.

* Revised comment

* Renamd variable
2021-12-06 17:03:12 -08:00
Clint Wylie 84b4bf56d8
vectorize logical operators and boolean functions (#11184)
changes:
* adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions
* vectorize logical operators and boolean functions, some only if useStrictBooleans is true
2021-12-02 16:40:23 -08:00
Paul Rogers a66f10eea1
Code cleanup from query profile project (#11822)
* Code cleanup from query profile project

* Fix spelling errors
* Fix Javadoc formatting
* Abstract out repeated test code
* Reuse constants in place of some string literals
* Fix up some parameterized types
* Reduce warnings reported by Eclipse

* Reverted change due to lack of tests
2021-11-30 11:35:38 -08:00
Gian Merlino f6e6ca2893
Use intermediate-persist IndexSpec during multiphase merge. (#11940)
* Use intermediate-persist IndexSpec during multiphase merge.

The main change is the addition of an intermediate-persist IndexSpec
to the main "merge" method in IndexMerger. There are also a few minor
adjustments to the IndexMerger interface to encourage more harmonious
usage of its methods in the future.

* Additional changes inspired by the test coverage checker.

- Remove unused-in-production IndexMerger methods "append" and "convert".
- Add additional unit tests to UnifiedIndexerAppenderatorsManager.

* Additional adjustments.

* Even more additional adjustments.

* Test fixes.
2021-11-29 15:08:49 -08:00
Gian Merlino 93aeaf4801
Improve on-heap aggregator footprint estimates. (#11950)
Add a "guessAggregatorHeapFootprint" method to AggregatorFactory that
mitigates #6743 by enabling heap footprint estimates based on a specific
number of rows. The idea is that at ingestion time, the number of rows
that go into an aggregator will be 1 (if rollup is off) or will likely
be a small number (if rollup is on).

It's a heuristic, because of course nothing guarantees that the rollup
ratio is a small number. But it's a common case, and I expect this logic
to go wrong much less often than the current logic. Also, when it does
go wrong, users can fix it by lowering maxRowsInMemory or
maxBytesInMemory. The current situation is unintuitive: when the
estimation goes wrong, users get an OOME, but actually they need to
*raise* these limits to fix it.
2021-11-28 13:21:24 +05:30
Rohan Garg 2c08055962
Specify time column for first/last aggregators (#11949)
Add the ability to pass time column in first/last aggregator (and latest/earliest SQL functions). It is to support cases where the time to query upon is stored as a part of a column different than __time. Also, some other logical time column can be specified.
2021-11-25 09:44:14 +05:30
Gian Merlino 12e2228510
RowBasedGrouperHelper: Set hasMultipleValues = false in capabilities. (#11954)
Useful because it enables anything that consumes groupBy results to
potentially operate more efficiently.
2021-11-24 13:14:58 -08:00
Gian Merlino 5e168b861a
StorageAdapter: Add getRowSignature method. (#11953)
Simplifies logic for callers that only want to get a list of all the
column names, or column names and types. Updated callers SegmentAnalyzer,
HashJoinSegmentStorageAdapter, and DruidSegmentReader.
2021-11-24 13:14:25 -08:00
Gian Merlino 0354407655
SQL INSERT planner support. (#11959)
* SQL INSERT planner support.

The main changes are:

1) DruidPlanner is able to validate and authorize INSERT queries. They
   require WRITE permission on the target datasource.

2) QueryMaker is now an interface, and there is a QueryMakerFactory that
   creates instances of it. There is only one production implementation
   of each (NativeQueryMaker and NativeQueryMakerFactory), which
   together behave the same way as the former QueryMaker class. But this
   opens the door to executing queries in ways other than the Druid
   query stack, and is used by unit tests (CalciteInsertDmlTest) to
   test the INSERT planning functionality.

3) Adds an EXTERN table macro that allows references external data using
   InputSource and InputFormat from Druid's batch ingestion API. This is
   not exposed in production yet, but is used by unit tests.

4) Adds a QueryFeature concept that enables the planner to change its
   behavior slightly depending on the capabilities of the execution
   system.

5) Adds an "AuthorizableOperator" concept that enables SqlOperators
   to require additional permissions. This is used by the EXTERN table
   macro.

Related odds and ends:

- Add equals, hashCode, toString methods to InlineInputSource. Aids in
  the "from external" tests in CalciteInsertDmlTest.
- Add JSON-serializability to RowSignature.
- Move the SQL string inside PlannerContext so it is "baked into" the
  planner when the planner is created. Cleans up the code a bit, since
  in practice, the same query is passed in every time to the
  same planner anyway.

* Fix up calls to CalciteTests.createMockQueryLifecycleFactory.

* Fix checkstyle issues.

* Adjustments for CI.

* Adjust DruidAvaticaHandlerTest for stricter test authorizations.
2021-11-24 12:14:04 -08:00
Gian Merlino 35b610ada7
QueryableIndexColumnSelectorFactory: Double-check cached column class. (#11957)
Important because an earlier call to getCachedColumn may have been
done with a different class, leading to a ClassCastException on the
second call. In the prior code, this could happen if a complex column
had makeDimensionSelector called on it after makeColumnValueSelector had
already been called.
2021-11-22 11:31:24 -08:00
Gian Merlino d6507c9428
PrioritizedExecutorService: Properly wrap on direct calls to "execute". (#11956)
Usually, "execute" is called by methods defined in the superclass
AbstractExecutorService, and the passed-in Runnable has been wrapped
by newTaskFor inside a PrioritizedListenableFutureTask. But this method
can also be called directly, and if so, the same wrapping is necessary
for the delegate to get a Runnable that can be entered into a priority
queue with the others.
2021-11-22 10:30:12 -08:00
Clint Wylie f260bbed23
restore and deprecate AggregatorFactory methods (#11917)
* add back and deprecate aggregator factory methods so i can say i told you so when i delete these later

* rename to make less ambiguous, fix fill method

* adjust
2021-11-19 15:59:35 -08:00
Gian Merlino 36ee0367ff
Scan: Add "orderBy" parameter. (#11930)
* Scan: Add "orderBy" parameter.

This patch adds an API for requesting non-time orderings, although it
does not actually add the ability to execute such queries.

The changes are done in such a way that no matter how Scan query objects
are constructed, they will have a correct "getOrderBy". This will enable
us to switch the execution to exclusively use "getOrderBy" later on when
it's implemented.

Scan queries are serialized such that they only include "order" (time
order) if the ordering is time-based, and they only include "orderBy" if
the ordering is non-time-based. This maximizes compatibility with
the existing API while also providing a clean look for formatted queries.

Because this patch does not include execution logic, if someone actually
tries to run a query with non-time ordering, then they will get an error
like "Cannot execute query with orderBy [quality ASC]".

* SQL module fixes.

* Add spotbugs-exclude.

* Remove unused method.
2021-11-19 08:19:12 -08:00
Clint Wylie 7f0bede878
autocompaction support for complex dimensions (#11924)
* autocompaction support for complex dimensions

* more test
2021-11-16 15:57:44 -08:00
Clint Wylie 00c976a3fe
only get bitmap index for string dictionary encoded columns (#11925) 2021-11-16 15:50:02 -08:00
Kashif Faraz 223c5692a8
Add dimension partitioningType to metrics to track usage of different partitioning schemes (#11902)
Add method ShardSpec.getType() to get name of shard spec type
List all names of shard spec types in the interface ShardSpec itself
for easy reference and maintenance
Add dimension partitioningType to metric segment/added/bytes
2021-11-11 18:34:27 +05:30
Gian Merlino fe2f7742f7
Fix incorrect comparison in RowSignature. (#11905)
PR #11882 introduced a type comparison using ==, but while it was in flight,
another PR #11713 changed the type enum to a class. So the comparison should
properly be done with "equals".
2021-11-11 04:30:42 -08:00
Laksh Singla 57ed5127a7
Make subquery IDs more comprehensive (#11809)
There are 3 types of query IDs - id, subQueryId, sqlQueryId. Currently, whenever a query generates subqueries, the subquery's subQueryId is populated randomly. Also, subquery's Id is not set to the parent query Id. Therefore there is no way of linking the subqueries to the parent query, and one loses the ability to look at end to end view of the query.

This PR aims to implement following couple of things:

Populate the subqueries with it's parent's id (and sqlQueryId if present)
Populate the subqueryId such that it forms a hierarchical relationship amongs themselves. For example, if there is a query which launches a subquery, which in turn launches a couple of subqueries, then the ids and subQueryIds should have following structure.
2021-11-11 16:31:56 +05:30
Clint Wylie 5baa22148e
revert ColumnAnalysis type, add typeSignature and use it for DruidSchema (#11895)
* revert ColumnAnalysis type, add typeSignature and use it for DruidSchema

* review stuffs

* maybe null

* better maybe null

* Update docs/querying/segmentmetadataquery.md

* Update docs/querying/segmentmetadataquery.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* fix null right

* sad

* oops

* Update batch_hadoop_queries.json

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-11-10 18:46:29 -08:00
Gian Merlino 14b0b4aee2
RowBasedSegment: Use Sequence instead of Iterable. (#11886)
* RowBasedSegment: Use Sequence instead of Iterable.

The main reason this is good is that Sequences can include baggage that
must be closed after iteration is finished. This enables creating
RowBasedSegments on top of closeable sequences of rows.

To preserve the optimization that allows reversing a List without
copying it, this patch also makes SimpleSequence its own class and allows
extracting the Iterable that was used to create it.

* Fix tests.
2021-11-10 06:06:52 -08:00
Gian Merlino db4d157be6
Add Finalization option to RowSignature.addAggregators. (#11882)
* Add Finalization option to RowSignature.addAggregators.

This make type signatures more useful when the caller knows whether it will
be reading aggregation results in their finalized or intermediate types.

* Fix call site.
2021-11-10 06:05:29 -08:00
Clint Wylie a8805ab60d
add missing json type for ListFilteredVirtualColumn (#11887)
* add missing json type for ListFilteredVirtualColumn, and tests to try to avoid this happening again

* fixes

* ugly, but maybe this

* oops

* too many mappers
2021-11-09 17:25:12 -08:00
Gian Merlino 6c196a5ea2
Remove StorageAdapter.getColumnTypeName. (#11893)
* Remove StorageAdapter.getColumnTypeName.

It was only used by SegmentAnalyzer, and isn't necessary anymore due to
the recent improvements to ColumnCapabilities.

Also: tidy ColumnDescriptor.read slightly by removing an instanceof
check, and moving the relevant logic into ComplexColumnPartSerde.

* Fix spellings.
2021-11-09 15:18:07 -08:00
Gian Merlino 324d4374f6
HashJoinEngine: Fix extraneous advance of left cursor. (#11890)
This could happen for right or full outer joins in certain cases. Tests
weren't catching this because existing Cursor implementations generally
ignore extraneous calls to "advance". So, to help catch this in tests,
extra state validations are also added to RowWalker, which is used by
RowBasedSegment.
2021-11-09 11:34:11 -08:00
Gian Merlino babf00f8e3
Migrate File.mkdirs to FileUtils.mkdirp. (#11879)
* Migrate File.mkdirs to FileUtils.mkdirp.

* Remove unused imports.

* Fix LookupReferencesManager.

* Simplify.

* Also migrate usages of forceMkdir.

* Fix var name.

* Fix incorrect call.

* Update test.
2021-11-09 11:10:49 -08:00
Gian Merlino 945a341acd
RowBasedCursor: Add column-value-reuse optimization. (#11884)
* RowBasedCursor: Add column-value-reuse optimization.

Most of the logic is in RowBasedColumnSelectorFactory, although in this
patch its only user is RowBasedCursor. This improves performance of
features that use RowBasedSegment, like lookup and inline datasources.
It's especially helpful for inline datasources that contain lengthy
arrays, due to the fact that the transformed array can be reused.

* Changes from code review.

* Fixes for ColumnCapabilitiesImplTest.
2021-11-09 07:18:09 -08:00
Gian Merlino a5bd0b8cc0
RowAdapter: Add a default implementation for timestampFunction. (#11885)
Enables simpler implementations for adapters that want to treat the
timestamp as "just another column".
2021-11-08 10:25:13 -08:00
Clint Wylie 7237dc837c
complex typed expressions (#11853)
* complex typed expressions

* add built-in hll collector expressions to get coverage on druid-processing, more types, more better

* rampage!!!

* more javadoc

* adjustments

* oops

* lol

* remove unused dependency

* contradiction?

* more test
2021-11-08 00:33:06 -08:00
Clint Wylie 907e4ca0c5
use correct DimensionSpec with for column value selectors created from dictionary encoded column indexers (#11873)
* use correct dimension spec for column value selectors of dictionary encoded column indexers
2021-11-05 01:51:15 -07:00
Liran Funaro 9ca8f1ec97
Remove IncrementalIndex template modifier (#11160)
Co-authored-by: Liran Funaro <liran.funaro@verizonmedia.com>
2021-10-27 13:10:37 -07:00
Gian Merlino fc95c92806
Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124)
* Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs.

This patch does the following:

- Removes OffheapIncrementalIndex.
- Clarifies that Aggregators are required to be thread safe.
- Clarifies that BufferAggregators and VectorAggregators are not
  required to be thread safe.
- Removes thread safety code from some DataSketches aggregators that
  had it. (Not all of them did, and that's OK, because it wasn't necessary
  anyway.)
- Makes enabling "useOffheap" with groupBy v1 an error.

Rationale for removing the offheap incremental index:

- It is only used in one rare scenario: groupBy v1 (which is non-default)
  in "useOffheap" mode (also non-default). So you have to go pretty deep
  into the wilderness to get this code to activate in production. It is
  never used during ingestion.
- Its existence complicates developer efforts to reason about how
  aggregators get used, because the way it uses buffer aggregators is so
  different from how every other query engine uses them.
- It doesn't have meaningful testing.

By the way, I do believe that the given way the offheap incremental index
works, it actually didn't require buffer aggregators to be thread-safe.
It synchronizes on "aggregate" and doesn't call "get" until it has
stopped calling "aggregate". Nevertheless, this is a bother to think about,
and for the above reasons I think it makes sense to remove the code anyway.

* Remove things that are now unused.

* Revert removal of getFloat, getLong, getDouble from BufferAggregator.

* OAK-related warnings, suppressions.

* Unused item suppressions.
2021-10-26 08:05:56 -07:00
Gian Merlino 98ecbb21cd
Remove CloseQuietly and migrate its usages to other methods. (#10247)
* Remove CloseQuietly and migrate its usages to other methods.

These other methods include:

1) New method CloseableUtils.closeAndWrapExceptions, which wraps IOExceptions
   in RuntimeExceptions for callers that just want to avoid dealing with
   checked exceptions. Most usages were migrated to this method, because it
   looks like they were mainly attempts to avoid declaring a throws clause,
   and perhaps were unintentionally suppressing IOExceptions.
2) New method CloseableUtils.closeInCatch, designed to properly close something
   in a catch block without losing exceptions. Some usages from catch blocks
   were migrated here, when it seemed that they were intended to avoid checked
   exception handling, and did not really intend to also suppress IOExceptions.
3) New method CloseableUtils.closeAndSuppressExceptions, which sends all
   exceptions to a "chomper" that consumes them. Nothing is thrown or returned.
   The behavior is slightly different: with this method, _all_ exceptions are
   suppressed, not just IOExceptions. Calls that seemed like they had good
   reason to suppress exceptions were migrated here.
4) Some calls were migrated to try-with-resources, in cases where it appeared
   that CloseQuietly was being used to avoid throwing an exception in a finally
   block.

🎵 You don't have to go home, but you can't stay here... 🎵

* Remove unused import.

* Fix up various issues.

* Adjustments to tests.

* Fix null handling.

* Additional test.

* Adjustments from review.

* Fixup style stuff.

* Fix NPE caused by holder starting out null.

* Fix spelling.

* Chomp Throwables too.
2021-10-23 17:03:21 -07:00
Clint Wylie 02b2057371
extract generic dictionary encoded column indexing and merging stuffs (#11829)
* extract generic dictionary encoded column indexing and merging stuffs to pave the path towards supporting other types of dictionary encoded columns

* spotbugs and inspections fixes

* friendlier

* javadoc

* better name

* adjust
2021-10-22 17:31:22 -07:00
Clint Wylie 741b4ed516
add output type information to ExpressionPostAggregator (#11818)
* add ColumnInspector argument to PostAggregator.getType to allow post-aggs to compute their output type based on input types

* add test for test for coverage

* simplify

* Remove unused imports.

Co-authored-by: Gian Merlino <gian@imply.io>
2021-10-22 13:52:51 -07:00
Alexander Saydakov 8cf1cbc4a9
latest datasketches-java and datasketches-memory (#11773)
* latest datasketches-java and datasketches-memory

* updated versions of datasketches-java and datasketches-memory

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2021-10-19 23:42:30 -07:00
Clint Wylie 187df58e30
better types (#11713)
* better type system

* needle in a haystack

* ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support

* fixup merge

* more test

* fixup

* intern

* fix

* oops

* oops again

* ...

* more test coverage

* fix error message

* adjust interning, more javadocs

* oops

* more docs more better
2021-10-19 01:47:25 -07:00
Jonathan Wei 22b41ddbbf
Task reports for parallel task: single phase and sequential mode (#11688)
* Task reports for parallel task: single phase and sequential mode

* Address comments

* Add null check for currentSubTaskHolder
2021-09-16 13:58:11 -05:00
Clint Wylie 5e092ccb9b
add MV_FILTER_ONLY, MV_FILTER_NONE, ListFilteredVirtualColumn (#11650)
* add MV_FILTER_ONLY SQL function, and list filter virtual column

* MV_FILTER_NONE and more tests

* formatting

* o yeah, forgot can do easy thing

* style

* hmm why was that there

* test filtering on virtual column

* style

* meh

* do it right

* good bot
2021-09-16 09:31:53 -07:00
Clint Wylie bbb86c8731
more tests for LimitedBufferHashGrouper (#11654)
* more tests for LimitedBufferHashGrouper

* fix style
2021-09-08 16:31:34 -07:00
Clint Wylie 59d257816b
fix goldilocks bug with HashVectorGrouper improperly initializing memory (#11649)
* fix goldilocks bug with HashVectorGrouper improperly initializing memory that causes failure when there exists room to only grow one time

* fix unintended change

* cleanup
2021-09-02 02:25:26 -07:00
Jian Wang 3ff1c2b8ce
Fix bug which produces vastly inaccurate query results when forceLimitPushDown is enabled and order by clause has non grouping fields (#11097) 2021-09-01 21:19:38 -07:00
Jihoon Son 2a658acad4
Put sleep in an extension (#11632)
* Put sleep in an extension

* dependency
2021-08-25 01:27:45 -07:00
Kashif Faraz aaf0aaad8f
Enable routing of SQL queries at Router (#11566)
This PR adds a new property druid.router.sql.enable which allows the
Router to handle SQL queries when set to true.

This change does not affect Avatica JDBC requests and they are still routed
by hashing the Connection ID.

To allow parsing of the request object as a SqlQuery (contained in module druid-sql),
some classes have been moved from druid-server to druid-services with
the same package name.
2021-08-13 18:44:39 +05:30
Clint Wylie 9af7ba9d2a
STRING_AGG SQL aggregator function (#11241)
* add string_agg

* oops

* style and fix test

* spelling

* fixup

* review stuffs
2021-08-10 13:47:09 -07:00
Maytas Monsereenusorn 3257913737
Improve query error logging (#11519)
* Improve query error logging

* add docs

* address comments

* address comments
2021-08-05 22:51:09 +07:00
Jihoon Son 8ba7f6a48c
Fix incorrect result of exact topN on an inner join with limit (#11517) 2021-07-31 15:55:49 -07:00
Xavier Léauté 4bca7f014e
update error-prone to 2.8.0 with fix for crashing check (#11494)
* error-prone 2.8.0 fixes https://github.com/google/error-prone/issues/2396
* fix for a few ignored return values
* fix unknown args in sub-modules
2021-07-29 09:13:46 -07:00
Kashif Faraz 8a4e27f51d
Select broker based on query context parameter `brokerService` (#11495)
This change allows the selection of a specific broker service (or broker tier) by the Router.

The newly added ManualTieredBrokerSelectorStrategy works as follows:

Check for the parameter brokerService in the query context. If this is a valid broker service, use it.
Check if the field defaultManualBrokerService has been set in the strategy. If this is a valid broker service, use it.
Move on to the next strategy
2021-07-27 20:56:05 +05:30
Lucas Capistrant 9767b42e85
Add a new metric query/segments/count that is not emitted by default (#11394)
* Add a new metric query/segments/count that is not emitted by default

* docs

* test the default implementation of the metric

* fix spelling error in docs

* document the fact that query retries will result in additional metric emissions

* update using recommended text from @jihoonson
2021-07-22 17:57:35 -07:00
Abhishek Agarwal ce1faa5635
Make SegmentLoader extensible and customizable (#11398)
This PR refactors the code related to segment loading specifically SegmentLoader and SegmentLoaderLocalCacheManager. SegmentLoader is marked UnstableAPI which means, it can be extended outside core druid in custom extensions. Here is a summary of changes

SegmentLoader returns an instance of ReferenceCountingSegment instead of Segment. Earlier, SegmentManager was wrapping Segment objects inside ReferenceCountingSegment. That is now moved to SegmentLoader. With this, a custom implementation can track the references of segments. It also allows them to create custom ReferenceCountingSegment implementations. For this reason, the constructor visibility in ReferenceCountingSegment is changed from private to protected.
SegmentCacheManager has two additional methods called - reserve(DataSegment) and release(DataSegment). These methods let the caller reserve or release space without calling SegmentLoader#getSegment. We already had similar methods in StorageLocation and now they are available in SegmentCacheManager too which wraps multiple locations.
Refactoring to simplify the code in SegmentCacheManager wherever possible. There is no change in the functionality.
2021-07-22 18:00:49 +05:30
kaijianding e39ff44481
improve groupBy query granularity translation with 2x query performance improve when issued from sql layer (#11379)
* improve groupBy query granularity translation when issued from sql layer

* fix style

* use virtual column to determine timestampResult granularity

* dont' apply postaggregators on compute nodes

* relocate constants

* fix order by correctness issue

* fix ut

* use more easier understanding code in DefaultLimitSpec

* address comment

* rollback use virtual column to determine timestampResult granularity

* fix style

* fix style

* address the comment

* add more detail document to explain the tradeoff

* address the comment

* address the comment
2021-07-11 10:22:47 -07:00
Clint Wylie 17efa6f556
add single input string expression dimension vector selector and better expression planning (#11213)
* add single input string expression dimension vector selector and better expression planning

* better

* fixes

* oops

* rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing

* oops

* javadocs, renaming

* more javadocs

* benchmarks

* use string expression vector processor with vector size 1 instead of expr.eval

* better logging

* javadocs, surprising number of the the

* more

* simplify
2021-07-06 11:20:49 -07:00
Abhishek Agarwal 03a6a6d6e1
Replace Processing ExecutorService with QueryProcessingPool (#11382)
This PR refactors the code for QueryRunnerFactory#mergeRunners to accept a new interface called QueryProcessingPool instead of ExecutorService for concurrent execution of query runners. This interface will let custom extensions inject their own implementation for deciding which query-runner to prioritize first. The default implementation is the same as today that takes the priority of query into account. QueryProcessingPool can also be used as a regular executor service. It has a dedicated method for accepting query execution work so implementations can differentiate between regular async tasks and query execution tasks. This dedicated method also passes the QueryRunner object as part of the task information. This hook will let custom extensions carry any state from QuerySegmentWalker to QueryProcessingPool#mergeRunners which is not possible currently.
2021-07-01 16:03:08 +05:30
frank chen 906a704c55
Eliminate ambiguities of KB/MB/GB in the doc (#11333)
* GB ---> GiB

* suppress spelling check

* MB --> MiB, KB --> KiB

* Use IEC binary prefix

* Add reference link

* Fix doc style
2021-06-30 13:42:45 -07:00
Clint Wylie df9b57aa1a
bitwise aggregators, better null handling options for expression agg (#11280)
* bitwise aggregators, better nulls for expression agg

* correct behavior

* rework deserialize, better names

* fix json, share mask
2021-06-25 16:51:16 -07:00
Xavier Léauté 712f2a5d00
upgrade error-prone to 2.7.1 and support checks with Java 11+ (#11363)
* upgrade error-prone to 2.7.1 and support checks with Java 11+

- upgrade error-prone to 2.7.1
- support running error-prone with Java 11 and above using -Xplugin
  instead of custom compiler
- add compiler arguments to ignore warnings/errors in Java 15/16
- introduce strictCompile property to enable strict profiles since we
  now need multiple strict profiles for Java 8
- properly exclude all generated source files from error-prone
- fix druid-processing overriding annotation processors from parent pom
- fix druid-core disabling most non-default checks
- align plugin and annotation errorprone versions
- fix / suppress additional issues found by error-prone:
  * fix bug in SeekableStreamSupervisor initializing ArrayList size with
    the taskGroupdId
  * fix missing @Override annotations
- remove outdated compiler plugin in benchmarks
- remove deleted ParameterPackage error-prone rule
- re-enable checks on benchmark module as well

* fix IntelliJ inspections

* disable LongFloatConversion due to bug in error-prone with JDK 8

* add comment about InsecureCrypto
2021-06-16 12:55:34 -07:00
Clint Wylie bfbd7ec432
fix a bugs related to SQL type inference return type nullability (#11327)
* fix a bunch of type inference nullability bugs

* fixes

* style

* fix test

* fix concat
2021-06-15 12:26:59 -07:00
Clint Wylie 920aa414ca
enrich expression cache key information to support expressions which depend on external state (#11358)
* enrich expression cache key information to support expressions which depend on external state such as lookups

* cache rules everything around me

* low carb

* rename
2021-06-14 17:26:43 -07:00
Clint Wylie 50327b8f63
ignore bySegment query context for SQL queries (#11352)
* ignore bySegment query context for SQL queries

* revert unintended change
2021-06-11 13:49:03 -07:00
Clint Wylie 6b272c857f
adjust topn heap algorithm to only use known cardinality path when dictionary is unique (#11186)
* adjust topn heap algorithm to only use known cardinality path when dictionary is unique

* better check and add comment

* adjust comment more
2021-06-10 18:32:22 -05:00
Jihoon Son 51f983101e
Fix wrong encoding in PredicateFilteredDimensionSelector.getRow (#11339) 2021-06-10 09:14:34 -07:00
Maria Sitkovets 259207753d
Fix is null selector returning incorrect value for Long data type (#11170)
* Fix is null selector returning incorrect value for Long data type

* Fix style errors

* Refactor getObject method to also cache null column values

* Make lastInput variable nullable

* Refactor unit test

* Use new boolean lastInputIsNull instead of Long for lastInput to avoid boxing

* Refactor to remove Long for input variable

* Make a separate null caching variable

* Cleaner null caching implementation
2021-05-19 20:47:02 -07:00
Clint Wylie 6d08a7051e
fix bug with aggregator expressions on realtime index with string columns always producing 0 values (#11185)
* fix bug with aggregator expressions on realtime index with string columns always producing 0 values

* more test

* rework some stuff

* javadocs
2021-05-17 11:59:13 -07:00
Clint Wylie 790262e5d0
add estimated byte size limit enforcement for heap based expression aggregator (#11236) 2021-05-12 01:21:50 -07:00
Clint Wylie f6662b4893
fix count and average SQL aggregators on constant virtual columns (#11208)
* fix count and average SQL aggregators on constant virtual columns

* style

* even better, why are we tracking virtual columns in aggregations at all if we have a virtual column registry

* oops missed a few

* remove unused

* this will fix it
2021-05-10 13:41:48 -07:00
Clint Wylie 691d7a1d54
SQL timeseries no longer skip empty buckets with all granularity (#11188)
* SQL timeseries no longer skip empty buckets with all granularity

* add comment, fix tests

* the ol switcheroo

* revert unintended change

* docs and more tests

* style

* make checkstyle happy

* docs fixes and more tests

* add docs, tests for array_agg

* fixes

* oops

* doc stuffs

* fix compile, match doc style
2021-05-10 10:13:37 -07:00
Gian Merlino a1f850d707
Fix vectorized cardinality bug on certain string columns. (#11199)
* Fix vectorized cardinality bug on certain string columns.

Fixes a bug introduced in #11182, related to the fact that in some cases,
ColumnProcessors.makeVectorProcessor will call "makeObjectProcessor"
instead of "makeSingleValueDimensionProcessor" or
"makeMultiValueDimensionProcessor". CardinalityVectorProcessorFactory
improperly ignored calls to "makeObjectProcessor".

In addition to fixing the bug, I added this detail to the javadocs for
VectorColumnProcessorFactory, to prevent others from running into the
same thing in the future. They do not currently call out this case.

* Improve test coverage.

* Additional fixes.
2021-05-07 08:37:10 -07:00
Clint Wylie 554f1ffeee
ARRAY_AGG sql aggregator function (#11157)
* ARRAY_AGG sql aggregator function

* add javadoc

* spelling

* review stuff, return null instead of empty when nil input

* review stuff

* Update sql.md

* use type inference for finalize, refactor some things
2021-05-03 22:17:10 -07:00
Gian Merlino bef7cc911f
Vectorize the cardinality aggregator. (#11182)
* Vectorize the cardinality aggregator.

Does not include a byRow implementation, so if byRow is true then
the aggregator still goes through the non-vectorized path.

Testing strategy:

- New tests that exercise both styles of "aggregate" for supported types.
- Some existing tests have also become active (note the deleted
  "cannotVectorize" lines).

* Adjust whitespace.
2021-05-03 20:27:02 -07:00
Gian Merlino 809e001939
Vectorize the DataSketches quantiles aggregator. (#11183)
* Vectorize the DataSketches quantiles aggregator.

Also removes synchronization for the BufferAggregator and VectorAggregator
implementations, since it is not necessary (similar to #11115).

Extends DoublesSketchAggregatorTest and DoublesSketchSqlAggregatorTest
to run all test cases in vectorized mode.

* Style fix.
2021-05-02 16:14:21 -07:00
Gian Merlino 046069f35a
Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector. (#11172)
* Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector.

The idea is that certain operations (like count distinct on strings) will
be faster if they are able to run directly on UTF-8 bytes instead of on
Java Strings decoded by "lookupName".

* Add license header.

* Updates suggested by robots.
2021-04-30 10:56:11 -07:00
Gian Merlino 6d82c3cbf1
StringComparators: No need to convert to UTF-8 for lexicographic comparison. (#11171)
Lexicographic ordering of UTF-8 byte sequences and in-memory UTF-16
strings are equivalent. So, we can skip the (expensive) conversion and
get an equivalent result. Thank you, Unicode!
2021-04-30 10:54:20 -07:00
Gian Merlino 7d808e357c
InDimFilter: Fix cache key computation to avoid collisions. (#11168)
The prior code did not include separation between values, and encoded
null ambiguously. This patch fixes both of those issues by encoding
strings as length + value instead of just value.

I think cache key computation was OK prior to #9800. Prior to that
patch, the cache key was computed using CacheKeyBuilder.appendStrings,
which encodes strings as UTF-8 and inserts a separator byte (0xff)
between them that cannot appear in a UTF-8 stream.
2021-04-28 17:28:29 -07:00
Gian Merlino ad028de538
InDimFilter: Fix NPE involving certain Set types. (#11169)
* InDimFilter: Fix NPE involving certain Set types.

Normally, InDimFilters that come from JSON have HashSets for "values".
However, programmatically-generated filters (like the ones from #11068)
may use other set types. Some set types, like TreeSets with natural
ordering, will throw NPE on "contains(null)", which causes the
InDimFilter's ValueMatcher to throw NPE if it encounters a null value.

This patch adds code to detect if the values set can support
contains(null), and if not, wrap that in a null-checking lambda.

Also included:

- Remove unneeded NullHandling.needsEmptyToNull method.
- Update IndexedTableJoinable to generate a TreeSet that does not
  require lambda-wrapping. (This particular TreeSet is how I noticed
  the bug in the first place.)

* Test fixes.

* Improve test coverage
2021-04-28 14:13:42 -07:00
Harini Rajendran 8a3be6bccc
Fix TimeSeriesUnionQueryRunnerTest by extending InitializedNullHandlingTest (#11154) 2021-04-23 08:56:03 -07:00
Clint Wylie 57ff1f9cdb
expression aggregator (#11104)
* add experimental expression aggregator

* add test

* fix lgtm

* fix test

* adjust test

* use not null constant

* array_set_concat docs

* add equals and hashcode and tostring

* fix it

* spelling

* do multi-value magic for expression agg, more javadocs, tests

* formatting

* fix inspection

* more better

* nullable
2021-04-22 18:30:16 -07:00
Gian Merlino 202c78c8f3
Enable rewriting certain inner joins as filters. (#11068)
* Enable rewriting certain inner joins as filters.

The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:

- It must be an inner equi-join.
- The right-hand columns referenced by the condition must not contain any
  duplicate values. (If they did, the inner join would not be guaranteed
  to return at most one row for each left-hand-side row.)
- No columns from the right-hand side can be used by anything other than
  the join condition itself.

HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.

In support of this goal:

- Add Query getRequiredColumns() method to help us figure out whether
  the right-hand side of a join datasource is being used or not.
- Add JoinConditionAnalysis getRequiredColumns() method to help us
  figure out if the right-hand side of a join is being used by later
  join clauses acting on the same base.
- Add Joinable getNonNullColumnValuesIfAllUnique method to enable
  retrieving the set of values that will form the "in" filter.
- Add LookupExtractor canGetKeySet() and keySet() methods to support
  LookupJoinable in its efforts to implement the new Joinable method.
- Add "enableRewriteJoinToFilter" feature flag to
  JoinFilterRewriteConfig. The default is disabled.

* Test improvements.

* Test fixes.

* Avoid slow size() call.

* Remove invalid test.

* Fix style.

* Fix mistaken default.

* Small fixes.

* Fix logic error.
2021-04-14 10:49:27 -07:00
Maytas Monsereenusorn f968400170
Introduce a new configuration that skip storing audit payload if payload size exceed limit and skip storing null fields for audit payload (#11078)
* Add config to skip storing audit payload if exceed limit

* fix checkstyle

* change config name

* skip null fields for audit payload

* fix checkstyle

* address comments

* fix guice

* fix test

* add tests

* address comments

* address comments

* address comments

* fix checkstyle

* address comments

* fix test

* fix test

* address comments

* Address comments

Co-authored-by: Jihoon Son <jihoonson@apache.org>

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2021-04-13 20:18:28 -07:00
Clint Wylie 08d3786738
improve bitmap vector offset to report contiguous groups (#11039)
* improve bitmap vector offset to report contiguous groups

* benchmark style

* check for contiguous in getOffsets, tests for exceptions
2021-04-13 11:47:01 -07:00
Gian Merlino c158207ab6
Rename BitmapOperationTest base class to avoid flaky test. (#11102)
PR #10936 renamed BitmapBenchmark, the parent of a couple of bitmap tests, to
BitmapOperationTest. This patch renames it to BitmapOperationTestBase so JUnit
doesn't pick it up as a test case. When JUnit picks it up, it becomes a flaky
test, since its behavior and correctness depends on whether it runs before
or after its subclasses.
2021-04-13 08:01:15 -07:00
Gian Merlino c8e394015d
LongsLongEncodingReader: Implement "duplicate", fixing concurrency bug. (#11098)
Regression introduced in #11004 due to overzealous optimization. Even though
we replaced stateful usage of ByteBuffer with stateless usage of Memory, we
still need to create a new object on "duplicate" due to semantics of setBuffer.
2021-04-13 08:01:01 -07:00
Jihoon Son 25db8787b3
Fix CAST being ignored when aggregating on strings after cast (#11083)
* Fix CAST being ignored when aggregating on strings after cast

* fix checkstyle and dependency

* unused import
2021-04-12 22:21:24 -07:00
BIGrey d33fdd093b
Nested GroupBy query got wrong/empty result when using virtual column and filter (#11081)
* fix nested groupby got empty result when using virtual column

* move to query.getVirtualColumns().wrap instead of new VirtualizedColumnSelectorFactory

* move test to GroupByQueryRunnerTest

* Update processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryRunnerTest.java

Co-authored-by: huagnhui.bigrey <huanghui.bigrey@bytedance.com>
Co-authored-by: Jihoon Son <jihoonson@apache.org>
2021-04-10 21:29:41 -07:00
Clint Wylie 338886fd5f
vector group by support for string expressions (#11010)
* vector group by support for string expressions

* fix test

* comments, javadoc
2021-04-08 19:23:39 -07:00
Abhishek Agarwal 0df0bff44b
Enable multiple distinct aggregators in same query (#11014)
* Enable multiple distinct count

* Add more tests

* fix sql test

* docs fix

* Address nits
2021-04-07 00:52:19 -07:00
Clint Wylie c0e6d1c7f8
vectorize 'auto' long decoding (#11004)
* Vectorize LongDeserializers.

Also, add many more tests.

* more faster

* more more faster

* more cleanup

* fixes

* forbidden

* benchmark style

* idk why

* adjust

* add preconditions for value >= 0 for writers

* add 64 bit exception

Co-authored-by: Gian Merlino <gian@imply.io>
2021-03-26 18:39:13 -07:00
Gian Merlino bf20f9e979
DruidInputSource: Fix issues in column projection, timestamp handling. (#10267)
* DruidInputSource: Fix issues in column projection, timestamp handling.

DruidInputSource, DruidSegmentReader changes:

1) Remove "dimensions" and "metrics". They are not necessary, because we
   can compute which columns we need to read based on what is going to
   be used by the timestamp, transform, dimensions, and metrics.
2) Start using ColumnsFilter (see below) to decide which columns we need
   to read.
3) Actually respect the "timestampSpec". Previously, it was ignored, and
   the timestamp of the returned InputRows was set to the `__time` column
   of the input datasource.

(1) and (2) together fix a bug in which the DruidInputSource would not
properly read columns that are used as inputs to a transformSpec.

(3) fixes a bug where the timestampSpec would be ignored if you attempted
to set the column to something other than `__time`.

(1) and (3) are breaking changes.

Web console changes:

1) Remove "Dimensions" and "Metrics" from the Druid input source.
2) Set timestampSpec to `{"column": "__time", "format": "millis"}` for
   compatibility with the new behavior.

Other changes:

1) Add ColumnsFilter, a new class that allows input readers to determine
   which columns they need to read. Currently, it's only used by the
   DruidInputSource, but it could be used by other columnar input sources
   in the future.
2) Add a ColumnsFilter to InputRowSchema.
3) Remove the metric names from InputRowSchema (they were unused).
4) Add InputRowSchemas.fromDataSchema method that computes the proper
   ColumnsFilter for given timestamp, dimensions, transform, and metrics.
5) Add "getRequiredColumns" method to TransformSpec to support the above.

* Various fixups.

* Uncomment incorrectly commented lines.

* Move TransformSpecTest to the proper module.

* Add druid.indexer.task.ignoreTimestampSpecForDruidInputSource setting.

* Fix.

* Fix build.

* Checkstyle.

* Misc fixes.

* Fix test.

* Move config.

* Fix imports.

* Fixup.

* Fix ShuffleResourceTest.

* Add import.

* Smarter exclusions.

* Fixes based on tests.

Also, add TIME_COLUMN constant in the web console.

* Adjustments for tests.

* Reorder test data.

* Update docs.

* Update docs to say Druid 0.22.0 instead of 0.21.0.

* Fix test.

* Fix ITAutoCompactionTest.

* Changes from review & from merging.
2021-03-25 10:32:21 -07:00
Maytas Monsereenusorn f19c2e9ce4
If ingested data has sparse columns, the ingested data with forceGuaranteedRollup=true can result in imperfect rollup and final dimension ordering can be different from dimensionSpec ordering in the ingestionSpec (#10948)
* add IT

* add IT

* add the fix

* fix checkstyle

* fix compile

* fix compile

* fix test

* fix test

* address comments
2021-03-18 17:04:28 -07:00
Xavier Léauté 1061faa6ba
prefer string concatenation over String.format in performance sensitive code (#10997)
String.format relies on regex parsing, which makes these calls expensive
at higher request volumes.
2021-03-16 22:06:26 -07:00
Clint Wylie 4cd4a22f87
expression filter support for vectorized query engines (#10613)
* expression filter support for vectorized query engines

* remove unused codes

* more tests

* refactor, more tests

* suppress

* more

* more

* more

* oops, i was wrong

* comment

* remove decorate, object dimension selector, more javadocs

* style
2021-03-16 11:46:50 -07:00
Xavier Léauté d26e1bc70d
update code check plugins for Java 15 support (#10978)
* update maven-forbidden-api plugin to 3.1
* update maven-pmd-plugin to 3.14
* update spotbugs to 4.2.2
* fixes validation failures newly caught by those updates
  - fix SpotBugs NP_NONNULL_PARAM_VIOLATION
  - fix PMD UnnecessaryFullyQualifiedName
2021-03-11 07:31:41 -08:00
frank chen b808fd2ef9
Fix NPE in the constructor of TopNQuery (#10969)
* fix NPE

* Add unit tests to cover parameter checking
2021-03-11 00:04:49 -08:00
Clint Wylie 58294329b7
fix SQL issue for group by queries with time filter that gets optimized to false (#10968)
* fix SQL issue for group by queries with time filter that gets optimized to false

* short circuit always false in CombineAndSimplifyBounds

* adjust

* javadocs

* add preconditions for and/or filters to ensure they have children

* add comments, remove preconditions
2021-03-09 19:41:16 -08:00
Abhishek Agarwal c66951a59e
Add flag in SQL to disable left base filter optimization for joins (#10947)
* Add flag to disable left base filter

* code coverage

* Draft

* Review comments

* code coverage

* add docs

* Add old tests
2021-03-09 13:07:34 -08:00
Jihoon Son 2c30f8b3b7
Migrate bitmap benchmarks to JMH (#10936)
* Migrate bitmap benchmarks to JMH

* add concise
2021-03-04 12:50:55 -08:00
Abhishek Agarwal 1a15987432
Supporting filters in the left base table for join datasources (#10697)
* where filter left first draft

* Revert changes in calcite test

* Refactor a bit

* Fixing the Tests

* Changes

* Adding tests

* Add tests for correlated queries

* Add comment

* Fix typos
2021-03-04 10:39:21 -08:00
Gian Merlino 87a2abff79
Fix runtime error when IndexedTableJoinMatcher matches long selector to unique string index. (#10942)
* Fix runtime error when IndexedTableJoinMatcher matches long selector to unique string index.

The issue arises when matching against a long selector on the left-hand side to a string
typed Index on the right-hand side, and when that Index also returns true from areKeysUnique.
In this case, IndexedTableJoinMatcher would generate a ConditionMatcher that implements
matchSingleRow by calling findUniqueLong on the Index. This is inappropriate because the Index
is actually string typed. The fix is to check the type of the Index before deciding how to
implement the ConditionMatcher.

The patch adds "testMatchSingleRowToUniqueStringIndex" to IndexedTableJoinMatcherTest, which
explores this case.

* Update tests.
2021-03-04 00:57:59 -08:00
Gian Merlino 07902f607b
Granularity: Introduce primitive-typed bucketStart, increment methods. (#10904)
* Granularity: Introduce primitive-typed bucketStart, increment methods.

Saves creation of unnecessary DateTime objects in timestamp_floor and
timestamp_ceil expressions.

* Fix style.

* Amp up the test coverage.
2021-02-25 07:59:20 -08:00
Gian Merlino b7e9f5bc85
BoundDimFilter: Simplify the various DruidLongPredicates. (#10906)
They all use Long.compare, but they don't need to. Changing to
regular comparisons simplifies the code and also removes branches.
(Internally, Long.compare has two branches.)
2021-02-19 16:44:56 -08:00
Abhishek Agarwal 8718155f8f
Allow for empty keys in hash map (#10869)
* allow for empty keys in hash map

* fix serde test
2021-02-10 11:19:57 -08:00
Jihoon Son ac41e41232
Update doc for query errors and add unit tests for JsonParserIterator (#10833)
* Update doc for query errors and add unit tests for JsonParserIterator

* static constructor for convenience

* rename method
2021-02-05 02:55:32 -08:00
Jihoon Son 3f8f00a231
Fix CVE-2021-25646 (#10818) 2021-02-04 11:21:43 -08:00
Gian Merlino 6c0c6e60b3
Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory. (#10767)
* Vectorized theta sketch aggregator.

Also a refactoring of BufferAggregator and VectorAggregator such that
they share a common interface, BaseBufferAggregator. This allows
implementing both in the same file with an abstract + dual subclass
structure.

* Rework implementation to use composition instead of inheritance.

* Rework things to enable working properly for both complex types and
regular types.

Involved finally moving makeVectorProcessor from DimensionHandlerUtils
into ColumnProcessors and harmonizing the two things.

* Add missing method.

* Style and name changes.

* Fix issues from inspections.

* Fix style issue.
2021-01-29 09:30:09 -08:00
Clint Wylie cd6af93274
add leftover tests from #10743 (#10766) 2021-01-22 09:20:48 -08:00
Gian Merlino 8b808c4879
Retain order of AND, OR filter children. (#10758)
* Retain order of AND, OR filter children.

If we retain the order, it enables short-circuiting. People can put a
more selective filter earlier in the list and lower the chance that
later filters will need to be evaluated.

Short-circuiting was working before #9608, which switched to unordered
sets to solve a different problem. This patch tries to solve that
problem a different way.

This patch moves filter simplification logic from "optimize" to
"toFilter", because that allows the code to be shared with Filters.and
and Filters.or. The simplification has become more complicated and so
it's useful to share it.

This patch also removes code from CalciteCnfHelper that is no longer
necessary because Filters.and and Filters.or are now doing the work.

* Fixes for inspections.

* Fix tests.

* Back to a Set.
2021-01-20 08:59:20 -08:00
zhangyue19921010 2590ad4f67
Historical unloads damaged segments automatically when lazy on start. (#10688)
* ready to test

* tested on dev cluster

* tested

* code review

* add UTs

* add UTs

* ut passed

* ut passed

* opti imports

* done

* done

* fix checkstyle

* modify uts

* modify logs

* changing the package of SegmentLazyLoadFailCallback.java to org.apache.druid.segment

* merge from master

* modify import orders

* merge from master

* merge from master

* modify logs

* modify docs

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

* modify logs to rerun ci

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-01-16 19:53:30 -08:00
Gian Merlino 2b24dc3764
SegmentAnalyzer: Properly close column after retrieving it. (#10772) 2021-01-16 19:26:34 -08:00
Gian Merlino a82910e065
OrFilter: Properly handle child matchers that return the original mask. (#10754)
* OrFilter: Properly handle child matchers that return the original mask.

This happens when a child matcher is literally true (for example,
BooleanVectorValueMatcher). In this case, OrFilter would throw this
exception from its call to removeAll while processing the next filter:

  java.lang.IllegalStateException: 'other' must be a different instance from 'this'

Also update the javadocs for VectorValueMatcher to call out that the
returned object may be the same as the input mask.

* Fix style.
2021-01-14 23:28:13 -08:00
Gian Merlino 7354953b1b
VectorMatch: Disallow "copyFrom", "addAll" on self; improve tests. (#10755)
No existing code relies on being able to call these methods in this way.

The new tests exhaustively test all vectors up to size 7, and also test
behavior the run-on-self behavior that has been adjusted by this patch.
2021-01-14 18:29:13 -08:00
Gian Merlino 2bbf89db81
Remove FalseVectorMatcher, TrueVectorMatcher in favor of BooleanVectorValueMatcher. (#10757) 2021-01-14 18:28:25 -08:00
Jihoon Son 149306c9db
Tidy up HTTP status codes for query errors (#10746)
* Tidy up query error codes

* fix tests

* Restore query exception type in JsonParserIterator

* address review comments; add a comment explaining the ugly switch

* fix test
2021-01-13 17:20:00 -08:00
Clint Wylie 8c3c9b4060
fix limited queries with subtotals (#10743)
* i put my thing down, flip it and reverse it

* oops
2021-01-13 12:55:24 -08:00
Clint Wylie 9362dc7968
re-use expression vector evaluation results for the same offset in expression vector selectors (#10614)
* cache expression selector results by associating vector expression bindings to underlying vector offset

* better coverage, fix floats

* style

* stupid bot

* stupid me

* more test

* intellij threw me under the bus when it generated those junit methods

* narrow interface instead of passing around offset
2021-01-13 12:44:56 -08:00
秦臻 c62b7c19c3
javascript filter result convert to java boolean (#10721)
* javascript filter result convert to java boolean

* use type convert replace script convert, and add more unit test

Co-authored-by: qinzhen <qinzhen@kuaishou.com>
2021-01-08 14:30:09 -08:00
Gian Merlino 6eef0e4c9f
Fix collision between #10689 and #10593. (#10738) 2021-01-08 09:52:27 -08:00
Aleksey Plekhanov 26bcd47e51
Thread-safety for ResponseContext.REGISTERED_KEYS (#9667) 2021-01-08 00:37:49 -08:00
Liran Funaro 08ab82f55c
IncrementalIndex Tests and Benchmarks Parametrization (#10593)
* Remove redundant IncrementalIndex.Builder

* Parametrize incremental index tests and benchmarks

- Reveal and fix a bug in OffheapIncrementalIndex

* Fix forbiddenapis error: Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

* Fix Intellij errors: declared exception is never thrown

* Add documentation and validate before closing objects on tearDown.

* Add documentation to OffheapIncrementalIndexTestSpec

* Doc corrections and minor changes.

* Add logging for generated rows.

* Refactor new tests/benchmarks.

* Improve IncrementalIndexCreator documentation

* Add required tests for DataGenerator

* Revert "rollupOpportunity" to be a string
2021-01-07 22:18:47 -08:00
Gian Merlino 48e576a307
Scan query: More accurate error message when segment per time chunk limit is exceeded. (#10630)
* Scan query: More accurate error message when segment per time chunk limit is exceeded.

* Add guardrail test.
2021-01-06 14:11:28 -08:00
Jonathan Wei 68bb038b31
Multiphase segment merge for IndexMergerV9 (#10689)
* Multiphase merge for IndexMergerV9

* JSON fix

* Cleanup temp files

* Docs

* Address logging and add IT

* Fix spelling and test unloader datasource name
2021-01-05 22:19:09 -08:00
Abhishek Agarwal 796c25532e
Fix post-aggregator computation when used with subtotals (#10653)
* Fix post-aggregator computation

* remove commented code

* Fix numeric null handling

* Add test when subquery returns null long
2020-12-17 20:10:26 -08:00
Abhishek Agarwal 26d74b3580
Add grouping_id function (#10518)
* First draft of grouping_id function

* Add more tests and documentation

* Add calcite tests

* Fix travis failures

* bit of a change

* Add documentation

* Fix typos

* typo fix
2020-12-07 11:46:29 -08:00
Maytas Monsereenusorn 7eb5f59a9a
Fix string byte calculation in StringDimensionIndexer (#10623)
* fix string byte calculation

* fix tests

* fix test
2020-12-04 00:51:48 -08:00
Himanshu 813e18774e
make dimension column extensible with COMPLEX type (#10277)
* make dimension column extensible with COMPLEX type

* more changes

Change-Id: I9707dd644b8d71030b74a8c1d6fff0c0020d960d

* processing module changes for build fix

Change-Id: I146f95a41b79d20edb1721be13f0e9641f788e0e

* rename ColumnCapabilities.getTypeName() to getComplexTypeName()

* rename ColumnBuilder.setTypeName(..) -> ColumnBuilder.setComplexTypeName(..)
2020-12-03 08:58:17 -08:00
Lucas Capistrant 2e02eebd9d
Add context dimension to DefaultQueryMetrics (#10578)
* Add context dimension to DefaultQueryMetrics

* remove redundant addition of context dimension from DruidMetrics now that QueryMetrics adds it by default

* update SearchQueryMetrics to reflect the same pattern as other default dimensions in QueryMetrics

* add PublicApi annotation for context in QueryMetrics Interface
2020-12-01 18:34:03 -08:00
Lucas Capistrant 2560bf0a19
Add new coordinator metrics for coordinator duty runtimes (#10603)
* Add new coordinator metrics for duty runtimes

* fix spelling for a constant variable value

* add comment clarifying why the global runtime metric is emitted where it is

* Remove duty alias in lieu of using the class name for metrics

* fix docs

* CoordinatorStats tests + add duty stats to accumulate() logic
2020-11-29 14:47:35 -08:00
frank chen fe693a4f01
Improve doc and exception message for invalid user configurations (#10598)
* improve doc and exception message

* add spelling check rules and remove unused import

* add a test to improve test coverage
2020-11-23 15:03:13 -08:00
frank chen d7d2c804ad
Add zero period support to TIMESTAMPADD (#10550)
* Allow zero period for TIMESTAMPADD

* update test cases

* add empty zone test case

* add unit test cases for TimestampShiftMacro
2020-11-18 18:26:53 -08:00
frank chen e83d5cb59e
Fix ingestion failure of pretty-formatted JSON message (#10383)
* support multi-line text

* add test cases

* split json text into lines case by case

* improve exception handle

* fix CI

* use IntermediateRowParsingReader as base of JsonReader

* update doc

* ignore the non-immutable field in test case

* add more test cases

* mark `lineSplittable` as final

* fix testcases

* fix doc

* add a test case for SqlReader

* return all raw columns when exception occurs

* fix CI

* fix test cases

* resolve review comments

* handle ParseException returned by index.add

* apply Iterables.getOnlyElement

* fix CI

* fix test cases

* improve code in more graceful way

* fix test cases

* fix test cases

* add a test case to check multiple json string in one text block

* fix inspection check
2020-11-13 13:59:23 -08:00
Atul Mohan 6ccddedb7a
Improved exception handling in case of query timeouts (#10464)
* Separate timeout exceptions

* Add more tests

Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com>
2020-11-03 09:00:33 -06:00
Clint Wylie d0821de854
support for vectorizing expressions with non-existent inputs, more consistent type handling for non-vectorized expressions (#10499)
* support for vectorizing expressions with non-existent inputs, more consistent type handling for non-vectorized expressions

* inspector

* changes

* more test

* clean
2020-10-26 19:55:24 -07:00
Liran Funaro f3a2903218
Configurable Index Type (#10335)
* Introduce a Configurable Index Type

* Change to @UnstableApi

* Add AppendableIndexSpecTest

* Update doc

* Add spelling exception

* Add tests coverage

* Revert some of the changes to reduce diff

* Minor fixes

* Update getMaxBytesInMemoryOrDefault() comment

* Fix typo, remove redundant interface

* Remove off-heap spec (postponed to a later PR)

* Add javadocs to AppendableIndexSpec

* Describe testCreateTask()

* Add tests for AppendableIndexSpec within TuningConfig

* Modify hashCode() to conform with equals()

* Add comment where building incremental-index

* Add "EqualsVerifier" tests

* Revert some of the API back to AppenderatorConfig

* Don't use multi-line comments

* Remove knob documentation (deferred)
2020-10-23 18:34:26 -07:00
Abhishek Agarwal 567e381705
Any virtual column on "__time" should be a pre-join virtual column (#10451)
* Virtual column on __time should be in pre-join

* Add unit test
2020-10-12 13:04:55 -07:00
Abhishek Agarwal 4d2a92f46a
Add caching support to join queries (#10366)
* Proposed changes for making joins cacheable

* Add unit tests

* Fix tests

* simplify logic

* Pull empty byte array logic out of CachingQueryRunner

* remove useless null check

* Minor refactor

* Fix tests

* Fix segment caching on Broker

* Move join cache key computation in Broker

Move join cache key computation in Broker from ResultLevelCachingQueryRunner to CachingClusteredClient

* Fix compilation

* Review comments

* Add more tests

* Fix inspection errors

* Pushed condition analysis to JoinableFactory

* review comments

* Disable join caching for broker and add prefix key to BroadcastSegmentIndexedTable

* Remove commented lines

* Fix populateCache

* Disable caching for selective datasources

Refactored the code so that we can decide at the data source level, whether to enable cache for broker or data nodes
2020-10-09 17:42:30 -07:00
Jihoon Son 1deed9fbcd
Close aggregators in HashVectorGrouper.close() (#10452)
* Close aggregators in HashVectorGrouper.close()

* reuse grouper

* Add missing dependency
2020-10-06 10:17:33 -07:00
Clint Wylie 207ef310f2
vectorized group by support for nullable numeric columns (#10441)
* vectorized group by support for numeric null columns

* revert unintended change

* adjust

* review stuffs
2020-10-05 21:53:53 -07:00
Clint Wylie 9ec5c08e2a
fix array types from escaping into wider query engine (#10460)
* fix array types from escaping into wider query engine

* oops

* adjust

* fix lgtm
2020-10-03 15:30:34 -07:00
Clint Wylie 753bce324b
vectorize constant expressions with optimized selectors (#10440) 2020-09-29 13:19:06 -07:00
Gian Merlino 2be1ae128f
RowBasedIndexedTable: Add specialized index types for long keys. (#10430)
* RowBasedIndexedTable: Add specialized index types for long keys.

Two new index types are added:

1) Use an int-array-based index in cases where the difference between
   the min and max values isn't too large, and keys are unique.

2) Use a Long2ObjectOpenHashMap (instead of the prior Java HashMap) in
   all other cases.

In addition:

1) RowBasedIndexBuilder, a new class, is responsible for picking which
   index implementation to use.

2) The IndexedTable.Index interface is extended to support using
   unboxed primitives in the unique-long-keys case, and callers are
   updated to use the new functionality.

Other key types continue to use indexes backed by Java HashMaps.

* Fixup logic.

* Add tests.
2020-09-29 10:46:47 -07:00
Gian Merlino 599aacce0f
Remove Expr.visit. (#10437)
* Remove Expr.visit.

It isn't used and doesn't have tests.

* Remove Visitor too.
2020-09-28 22:13:10 -07:00
Clint Wylie 1d6cb624f4
add vectorizeVirtualColumns query context parameter (#10432)
* add vectorizeVirtualColumns query context parameter

* oops

* spelling

* default to false, more docs

* fix test

* fix spelling
2020-09-28 18:48:34 -07:00
Clint Wylie 3d700a5e31
vectorize remaining math expressions (#10429)
* vectorize remaining math expressions

* fixes

* remove cannotVectorize() where no longer true

* disable vectorized groupby for numeric columns with nulls

* fixes
2020-09-26 23:30:14 -07:00
Jihoon Son 0cc9eb4903
Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided (#10288)
* Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided

* query context

* fix tests; add more test

* javadoc

* docs and more tests

* remove default and hadoop tests

* consistent name and fix javadoc

* spelling and field name

* default function for partitionsSpec

* other comments

* address comments

* fix tests and spelling

* test

* doc
2020-09-24 16:32:56 -07:00
Clint Wylie 19c4b16640
vectorized expressions and expression virtual columns (#10401)
* vectorized expression virtual columns

* cleanup

* fixes

* preserve float if explicitly specified

* oops

* null handling fixes, more tests

* what is an expression planner?

* better names

* remove unused method, add pi

* move vector processor builders into static methods

* reduce boilerplate

* oops

* more naming adjustments

* changes

* nullable

* missing hex

* more
2020-09-23 13:56:38 -07:00
Gian Merlino 1af2eace41
Include Sequence-building time in CPU time metric. (#10377)
* Include Sequence-building time in CPU time metric.

Meaningful work can be done while building Sequences, and we should
count this work. On the Broker, this includes subquery processing
work done by the mergeResults call of the GroupByQueryQueryToolChest.

* Add test.
2020-09-23 14:33:55 +08:00
Dylan Wylie f3eb0cfb3b
Avoid large limits causing int overflow in buffer size checks (#10356)
* Avoid large limits causing int overflow in buffer size checks

* fix lgtm overflow warning

Co-authored-by: Dylan <dwylie@spotx.tv>
2020-09-18 13:08:49 -07:00
Suneet Saldanha f71ba6f2c2
Vectorized ANY aggregators (#10338)
* WIP vectorized ANY aggregators

* tests

* fix aggs

* cleanup

* code review + tests

* docs

* use NilVectorSelector when needed

* fix spellcheck

* dont instantiate vectors

* cleanup
2020-09-14 19:44:58 -07:00
Clint Wylie e012d5c41b
allow vectorized query engines to utilize vectorized virtual columns (#10388)
* allow vectorized query engines to utilize vectorized virtual column implementations

* javadoc, refactor, checkstyle

* intellij inspection and more javadoc

* better

* review stuffs

* fix incorrect refactor, thanks tests

* minor adjustments
2020-09-14 19:29:35 -07:00
Clint Wylie 184b202411
add computed Expr output types (#10370)
* push down ValueType to ExprType conversion, tidy up

* determine expr output type for given input types

* revert unintended name change

* add nullable

* tidy up

* fixup

* more better

* fix signatures

* naming things is hard

* fix inspection

* javadoc

* make default implementation of Expr.getOutputType that returns null

* rename method

* more test

* add output for contains expr macro, split operation and function auto conversion
2020-09-14 18:18:56 -07:00
Abhishek Agarwal f5e2645bbb
Support SearchQueryDimFilter in sql via new methods (#10350)
* Support SearchQueryDimFilter in sql via new methods

* Contains is a reserved word

* revert unnecessary change

* Fix toDruidExpression method

* rename methods

* java docs

* Add native functions

* revert change in dockerfile

* remove changes from dockerfile

* More tests

* travis fix

* Handle null values better
2020-09-14 09:57:54 -07:00
Jihoon Son 8f14ac814e
More structured way to handle parse exceptions (#10336)
* More structured way to handle parse exceptions

* checkstyle; add more tests

* forbidden api; test

* address comment; new test

* address review comments

* javadoc for parseException; remove redundant parseException in streaming ingestion

* fix tests

* unnecessary catch

* unused imports

* appenderator test

* unused import
2020-09-11 16:31:10 -07:00
Joy Kent e5f0da30ae
Fix stringFirst/stringLast rollup during ingestion (#10332)
* Add IndexMergerRollupTest

This changelist adds a test to merge indexes with StringFirst/StringLast aggregator.

* Fix StringFirstAggregateCombiner/StringLastAggregateCombiner

The segment-level type for stringFirst/stringLast is SerializablePairLongString,
not String. This changelist fixes it.

* Fix EarliestLatestAnySqlAggregator to handle COMPLEX type

This changelist allows EarliestLatestAnySqlAggregator to accept COMPLEX
type as an operand. For its return type, we set it to VARCHAR, since
COMPLEX column is only generated by stringFirst/stringLast during ingestion
rollup.

* Return value with smaller timestamp in StringFirstAggregatorFactory.combine function

* Add integration tests for stringFirst/stringLast during ingestion

* Use one EarliestLatestReturnTypeInference instance

Co-authored-by: Joy Kent <joy@automonic.ai>
2020-09-08 17:36:04 -07:00
Suneet Saldanha 91a153820e
fix NPE in StringGroupByColumnSelectorStrategy#bufferComparator (#10325)
* fix NPE in StringGroupByColumnSelectorStrategy#bufferComparator

* Add tests

* javadocs
2020-09-04 13:23:40 -07:00
Gian Merlino d7fcff3aba
StringFirstAggregatorFactory: Fix incorrect "combine" method. (#10351)
* StringFirstAggregatorFactory: Fix incorrect "combine" method.

There was a test, but it was wrong.

* Fix superclass.
2020-09-03 20:03:26 -07:00
Gian Merlino 8ab1979304
Remove implied profanity from error messages. (#10270)
i.e. WTF, WTH.
2020-08-28 11:38:50 -07:00
Gian Merlino 21703d81ac
Fix handling of 'join' on top of 'union' datasources. (#10318)
* Fix handling of 'join' on top of 'union' datasources.

The problem is that unions are typically rewritten into a series of
individual queries on the underlying tables, but this isn't done when
the union is wrapped in a join.

The main changes are in UnionQueryRunner:

1) Replace an instanceof UnionQueryRunner check with DataSourceAnalysis.
2) Replace a "query.withDataSource" call with a new function, "Queries.withBaseDataSource".

Together, these enable UnionQueryRunner to "see through" a join.

* Tests.

* Adjust heap sizes for integration tests.

* Different approach, more tests.

* Tweak.

* Styling.
2020-08-26 14:23:54 -07:00
Jihoon Son b9ff3483ac
Add support for all partitioing schemes for auto compaction (#10307)
* Add support for all partitioing schemes for auto compaction

* annotate last compaction state for multi phase parallel indexing

* fix build and tests

* test

* better home
2020-08-26 13:19:18 -07:00
Clint Wylie ab60661008
refactor internal type system (#9638)
* better type tracking: add typed postaggs, finalized types for agg factories

* more javadoc

* adjustments

* transition to getTypeName to be used exclusively for complex types

* remove unused fn

* adjust

* more better

* rename getTypeName to getComplexTypeName

* setup expression post agg for type inference existing

* more javadocs

* fixup

* oops

* more test

* more test

* more comments/javadoc

* nulls

* explicitly handle only numeric and complex aggregators for incremental index

* checkstyle

* more tests

* adjust

* more tests to showcase difference in behavior

* timeseries longsum array
2020-08-26 10:53:44 -07:00
Suneet Saldanha a9de00d43a
Remove NUMERIC_HASHING_THRESHOLD (#10313)
* Make NUMERIC_HASHING_THRESHOLD configurable

Change the default numeric hashing threshold to 1 and make it configurable.

Benchmarks attached to this PR show that binary searches are not more faster
than doing a set contains check. The attached flamegraph shows the amount of
time a query spent in the binary search. Given the benchmarks, we can expect
to see roughly a 2x speed up in this part of the query which works out to
~ a 10% faster query in this instance.

* Remove NUMERIC_HASHING_THRESHOLD

* Remove stale docs
2020-08-25 20:05:39 -07:00
Gian Merlino f53785c52c
ExpressionFilter: Use index for expressions of single multi-value columns. (#10320)
Previously, this was disallowed, because expressions treated multi-values
as nulls. But now, if there's a single multi-value column that can be
mapped over, it's okay to use the index. Expression selectors already do
this.
2020-08-24 23:29:31 -07:00
Suneet Saldanha 707b5aae2b
Optimize large InDimFilters (#10312)
* Optimize large InDimFilters

For large InDimFilters, in default mode, the filter does a linear check of the
set to see if it contains either an empty or null. If it does, the empties are
converted to nulls by passing through the entire list again.

Instead of this, in default mode, we attempt to remove an empty string from the
values that are passed to the InDimFilter. If an empty string was removed, we
add null to the set

* code review

* Revert "code review"

This reverts commit 61fe33ebf7.

* code review - less brittle
2020-08-24 16:39:27 -07:00
Clint Wylie 7620b0c54e
Segment backed broadcast join IndexedTable (#10224)
* Segment backed broadcast join IndexedTable

* fix comments

* fix tests

* sharing is caring

* fix test

* i hope this doesnt fix it

* filter by schema to maybe fix test

* changes

* close join stuffs so it does not leak, allow table to directly make selector factory

* oops

* update comment

* review stuffs

* better check
2020-08-20 14:12:39 -07:00
Gian Merlino 6cca7242de
Add "offset" parameter to the Scan query. (#10233)
* Add "offset" parameter to the Scan query.

It works by doing the query as normal and then throwing away the first
"offset" number of rows on the broker.

* Fix constructor call.

* Fix up JSONs.

* Fix call to ScanQuery.

* Doc update.

* Fix javadocs.

* Spotbugs, LGTM suppressions.

* Javadocs.

* Fix suppression.

* Stabilize Scan query result order, add tests.

* Update LGTM comment.

* Fixup.

* Test different batch sizes too.

* Nicer tests.

* Fix comment.
2020-08-13 14:56:24 -07:00
Clint Wylie e053348f74
add hasNulls to ColumnCapabilities, ColumnAnalysis (#10219)
* add isNullable to ColumnCapabilities, ColumnAnalysis

* better builder

* fix segment metadata queries in integration tests

* adjustments

* cleanup

* fix spotbugs

* treat unknown as true in segmentmetadata

* rename to hasNulls, add docs

* fixup

* test the dim indexer selector isNull fix for numeric columns

* fixes

* oof
2020-08-13 14:55:32 -07:00
Jihoon Son a61263b4a9
Allow forceLimitPushDown in SQL (#10253)
* Allow forceLimitPushDown in SQL

* fix test

* fix test

* review comments

* fix test
2020-08-13 13:30:41 -07:00
Gian Merlino 89860b7d6a
Fix javadoc mistake in DefaultLimitSpec. (#10269)
Javadoc for getLimit should say it's a limit, not an offset.
2020-08-13 12:17:26 -07:00
Gian Merlino e273264332
Fix two id-over-maxId errors in StringDimensionIndexer. (#10245)
1) lookupId could return IDs beyond maxId if called with a recently added value.
2) getRow could return an ID for null beyond maxId, if null was recently
   encountered in a dimension that initially didn't appear at all. (In this case,
   the dictionary ID for null can be > 0).

Also add a comment explaining how this stuff is supposed to work.
2020-08-11 20:32:10 -07:00
Clint Wylie c72f96a4ba
fix bug with expressions on sparse string realtime columns without explicit null valued rows (#10248)
* fix bug with realtime expressions on sparse string columns

* fix test

* add comment back

* push capabilities for dimensions to dimension indexers since they know things

* style

* style

* fixes

* getting a bit carried away

* missed one

* fix it

* benchmark build fix

* review stuffs

* javadoc and comments

* add comment

* more strict check

* fix missed usaged of impl instead of interface
2020-08-11 11:07:17 -07:00
Abhishek Radhakrishnan dc16abae34
Vectorization support for long, double, float min & max aggregators. (#10260)
* LongMaxVectorAggregator support and test case.

* DoubleMinVectorAggregator and test cases.

* DoubleMaxVectorAggregator and unit test.

* FloatMinVectorAggregator and FloatMaxVectorAggregator.

* Documentation update to include the other vector aggregators.

* Bug fix.

* checkstyle formatting fixes.

* CalciteQueryTest cases update.

* Separate test classes for FloatMaxAggregation and FloatMniAggregation.

* remove the cannotVectorize for float max/min aggregator in test.

* Tests in GroupByQueryRunner, GroupByTimeseriesQueryRunner and TimeseriesQueryRunner.
2020-08-10 15:18:55 -07:00
Gian Merlino 170031744e
Combine InDimFilter, InFilter. (#10119)
* Combine InDimFilter, InFilter.

There are two motivations:

1. Ensure that when HashJoinSegmentStorageAdapter compares its Filter
   to the original one, and it is an "in" type, the comparison is by
   reference and does not need to check deep equality. This is useful
   when the "in" filter is very large.
2. Simplify things. (There isn't a great reason for the DimFilter and
   Filter logic to be separate, and combining them reduces some
   duplication.)

* Fix test.
2020-08-06 18:34:21 -07:00
Gian Merlino b6aaf59e8c
Add "offset" parameter to GroupBy query. (#10235)
* Add "offset" parameter to GroupBy query.

It works by doing the query as normal and then throwing away the first
"offset" number of rows on the broker.

* Stabilize GroupBy sorts.

* Fix inspections.

* Fix suppression.

* Fixups.

* Move TopNSequence to druid-core.

* Addl comments.

* NumberedElement equals verification.

* Changes from review.
2020-08-05 15:39:58 -07:00
Abhishek Radhakrishnan 34a4113752
Add vectorization support for the longMin aggregator. (#10211)
* Fix minor formatting in docs.

* Add Nullhandling initialization for test to run from IDE.

* Vectorize longMin aggregator.

- A new vectorized class for the vectorized long min aggregator.
- Changes to AggregatorFactory to support vectorize functionality.
- Few changes to schema evolution test to add LongMinAggregatorFactory.

* Add longSum to the supported vectorized aggregator implementations.

* Add MIN() long min to calcite query test that can vectorize.

* Add simple long aggregations test.

* Fixup formatting per checkstyle guide.

* fixup and add more tests for long min aggregator.

* Override test for groupBy since timestamps are handled differently.

* Null compatibility check in test.

* Review comment: Add a test case to LongMinAggregationTest.
2020-08-01 15:32:09 -07:00
frank chen 646fa84d04
Support unit on byte-related properties (#10203)
* support unit suffix on byte-related properties

* add doc

* change default value of byte-related properites in example files

* fix coding style

* fix doc

* fix CI

* suppress spelling errors

* improve code according to comments

* rename Bytes to HumanReadableBytes

* add getBytesInInt to get value safely

* improve doc

* fix problem reported by CI

* fix problem reported by CI

* resolve code review comments

* improve error message

* improve code & doc according to comments

* fix CI problem

* improve doc

* suppress spelling check errors
2020-07-31 09:58:48 +08:00
Maytas Monsereenusorn 574b062f1f
Cluster wide default query context setting (#10208)
* Cluster wide default query context setting

* Cluster wide default query context setting

* Cluster wide default query context setting

* add docs

* fix docs

* update props

* fix checkstyle

* fix checkstyle

* fix checkstyle

* update docs

* address comments

* fix checkstyle

* fix checkstyle

* fix checkstyle

* fix checkstyle

* fix checkstyle

* fix NPE
2020-07-29 15:19:18 -07:00
Jihoon Son 63c1746fe4
Fix timeseries query constructor when postAggregator has an expression reading timestamp result column (#10198)
* Fix timeseries query constructor when postAggregator has an expression reading timestamp result column

* fix npe

* Fix postAgg referencing timestampResultField and add a test for it

* fix test

* doc

* revert doc
2020-07-27 10:54:44 -07:00
Jihoon Son 6fdce36e41
Add integration tests for query retry on missing segments (#10171)
* Add integration tests for query retry on missing segments

* add missing dependencies; fix travis conf

* address comments

* Integration tests extension

* remove unused dependency

* remove druid_main

* fix java agent port
2020-07-22 22:30:35 -07:00
Jihoon Son 41982116f4
Report missing segments when there is no segment for the query datasource in historicals (#10199)
* Report missing segments when there is no segment for the query
datasource in historicals

* test

* missing part for test

* another test
2020-07-20 21:02:52 -07:00
Nishant Bangarwa 971d8a353b
Add groupBy limitSpec to queryCache key (#10093)
* Add groupBy limitSpec to queryCache key

* Only add limitSpec to cache key if pushdown is set to true

* review comment
2020-07-13 19:15:09 -07:00
Jihoon Son 53a2550571
Follow-up for RetryQueryRunner fix (#10144)
* address comments; use guice instead of query context

* typo

* QueryResource tests

* address comments

* catch queryException

* fix spell check
2020-07-08 13:28:11 -07:00
Clint Wylie 010fe047e1
AbstractOptimizableDimFilter should be public (#10142) 2020-07-06 15:19:32 -07:00
Jonathan Wei ed981ef88e
Add DimFilter.toOptimizedFilter(), ensure that join filter pre-analysis operates on optimized filters (#10056)
* Ensure that join filter pre-analysis operates on optimized filters, add DimFilter.toOptimizedFilter

* Remove aggressive equality check that was used for testing

* Use Suppliers.memoize

* Checkstyle
2020-07-01 22:26:17 -07:00
Samarth Jain e2c5bcc22d
Fix UnknownComplexTypeColumn#makeVectorObjectSelector. Add a warning … (#10123)
* Fix UnknownComplexTypeColumn#makeVectorObjectSelector. Add a warning message to indicate failure in deserializing.
2020-07-01 20:06:23 -07:00
Samarth Jain 3e92cdf1cf
Revert "Fix UnknownTypeComplexColumn#makeVectorObjectSelector" (#10121)
This reverts commit 7bb7489afc.
2020-07-01 14:33:17 -07:00
Jihoon Son 657f8ee80f
Fix RetryQueryRunner to actually do the job (#10082)
* Fix RetryQueryRunner to actually do the job

* more javadoc

* fix test and checkstyle

* don't combine for testing

* address comments

* fix unit tests

* always initialize response context in cachingClusteredClient

* fix subquery

* address comments

* fix test

* query id for builders

* make queryId optional in the builders and ClusterQueryResult

* fix test

* suppress tests and unused methods

* exclude groupBy builder

* fix jacoco exclusion

* add tests for builders

* address comments

* don't truncate
2020-07-01 14:02:21 -07:00
samarthjain 7bb7489afc Fix UnknownTypeComplexColumn#makeVectorObjectSelector 2020-07-01 12:02:23 -07:00
Gian Merlino 5faa897a34
Join filter pre-analysis simplifications and sanity checks. (#10104)
* Join filter pre-analysis simplifications and sanity checks.

- At pre-analysis time, only compute pre-analysis for the innermost
  root query, since this is the one that will run on the join that involves
  the base datasource. Previously, pre-analyses were computed for multiple
  levels of the query, some of which were unnecessary.
- Remove JoinFilterPreAnalysisGroup and join query level gathering code,
  since they existed to support precomputation of multiple pre-analyses.
- Embed JoinFilterPreAnalysisKey into JoinFilterPreAnalysis and use it to
  sanity check at processing time that the correct pre-analysis was done.

Tangentially related changes:

- Remove prioritizeAndLaneQuery functionality from LocalQuerySegmentWalker.
  The computed priority and lanes were not being used.
- Add "getBaseQuery" method to DataSourceAnalysis to support identification
  of the proper subquery for filter pre-analysis.

* Fix compilation errors.

* Adjust tests.
2020-06-30 19:14:22 -07:00
Samarth Jain 2c1b45842f
Prevent unknown complex types from breaking DruidSchema refresh (#9422) 2020-06-30 14:06:17 -07:00
chenyuzhi459 a4c6d5f37e
fix query memory leak (#10027)
* fix query memory leak

* rollup ./idea

* roll up .idea

* clean code

* optimize style

* optimize cancel function

* optimize style

* add concurrentGroupTest test case

* add test case

* add unit test

* fix code style

* optimize cancell method use

* format code

* reback code

* optimize cancelAll

* clean code

* add comment
2020-06-26 23:30:59 -07:00
Maytas Monsereenusorn 9be5039f68
Enable query vectorization by default (#10065)
* Enable query vectorization by default

* update docs
2020-06-24 13:08:49 -07:00
Maytas Monsereenusorn f80c02da02
Fix HyperUniquesAggregatorFactory.estimateCardinality null handling to respect output type (#10063)
* fix return type from HyperUniquesAggregator/HyperUniquesVectorAggregator

* address comments

* address comments
2020-06-23 15:54:37 -10:00
Clint Wylie eee99ff0d5
minor rework of topn algorithm selection for clarity and more javadocs (#10058)
* minor refactor of topn engine algorithm selection for clarity

* adjust

* more javadoc
2020-06-22 09:08:50 -07:00
Clint Wylie c2f5d453f8
fix topn on string columns with non-sorted or non-unique dictionaries (#10053)
* fix topn on string columns with non-sorted or non-unique dictionaries

* fix metadata tests

* refactor, clarify comments and code, fix ci failures
2020-06-19 11:35:18 -07:00
Jonathan Wei 37e150c075
Fix join filter rewrites with nested queries (#10015)
* Fix join filter rewrites with nested queries

* Fix test, inspection, coverage

* Remove clauses from group key

* Fix import order

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
2020-06-18 21:32:29 -07:00
Clint Wylie b5e6569d2c
global table only if joinable (#10041)
* global table if only joinable

* oops

* fix style, add more tests

* Update sql/src/test/java/org/apache/druid/sql/calcite/schema/DruidSchemaTest.java

* better information schema columns, distinguish broadcast from joinable

* fix javadoc

* fix mistake

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2020-06-18 17:32:10 -07:00
Aleksey Plekhanov 2c384b61ff
IntelliJ inspection and checkstyle rule for "Collection.EMPTY_* field accesses replaceable with Collections.empty*()" (#9690)
* IntelliJ inspection and checkstyle rule for "Collection.EMPTY_* field accesses replaceable with Collections.empty*()"

* Reverted checkstyle rule

* Added tests to pass CI

* Codestyle
2020-06-18 09:47:07 -07:00
Maytas Monsereenusorn 7569ee3ec6
All aggregators should check if column can be vectorize (#10026)
* All aggregators should use vectorization-aware column processor

* All aggregators should use vectorization-aware column processor

* fix canVectorize

* fix canVectorize

* add tests

* revert back default

* address comment

* address comments

* address comment

* address comment
2020-06-17 01:52:02 -10:00
Clint Wylie 68aa384190
global table datasource for broadcast segments (#10020)
* global table datasource for broadcast segments

* tests

* fix

* fix test

* comments and javadocs

* review stuffs

* use generated equals and hashcode
2020-06-16 17:58:05 -07:00
Suneet Saldanha 4e483a70b4
ROUND and having comparators correctly handle special double values (#10014)
* ROUND and having comparators correctly handle doubles

Double.NaN, Double.POSITIVE_INFINITY and Double.NEGATIVE_INFINITY are not real
numbers. Because of this, they can not be converted to BigDecimal and instead
throw a NumberFormatException.

This change adds support for calculations that produce these numbers either
for use in the `ROUND` function or the HavingSpecMetricComparator by not
attempting to convert the number to a BigDecimal.

The bug in ROUND was first introduced in #7224 where we added the ability to
round to any decimal place. This PR changes the behavior back to using
`Math.round` if we recognize a number that can not be converted to a
BigDecimal.

* Add tests and fix spellcheck

* update error message in ExpressionsTest

* Address comments

* fix up round for infinity

* round non numeric doubles returns a double

* fix spotbugs

* Update docs/misc/math-expr.md

* Update docs/querying/sql.md
2020-06-16 16:09:46 -07:00
Gian Merlino 9330ca9717
Remove LegacyDataSource. (#10037)
* Remove LegacyDataSource.

Its purpose was to enable deserialization of strings into TableDataSources.
But we can do this more straightforwardly with Jackson annotations.

* Slight test improvement.
2020-06-16 14:40:35 -07:00
Clint Wylie 9468df4721
make phaser of ReferenceCountingCloseableObject protected instead of private so subclasses can do stuff with it (#10035) 2020-06-15 19:56:49 -07:00
Stefan Birkner 7282e2f2f9
Simplify CompressedVSizeColumnarIntsSupplierTest (#10003)
The parameters generator uses CompressionStrategy.noNoneValues() instead
of CompressionStrategyTest.compressionStrategies() which wrapped each
strategy in a single element array. This improves readability of the
test.
2020-06-10 09:32:00 -07:00
Clint Wylie f8b643ec72
make joinables closeable (#9982)
* make joinables closeable

* tests and adjustments

* refactor to make join stuffs impelement ReferenceCountedObject instead of Closable, more tests

* fixes

* javadocs and stuff

* fix bugs

* more test

* fix lgtm alert

* simplify

* fixup javadoc

* review stuffs

* safeguard against exceptions

* i hate this checkstyle rule

* make IndexedTable extend Closeable
2020-06-09 20:12:36 -07:00
Clint Wylie 1c9ca55247
remove incorrect and unnecessary overrides from BooleanVectorValueMatcher (#9994)
* remove incorrect and unnecessary overrides from BooleanVectorValueMatcher

* add test case

* add unit tests for ... part of VectorValueMatcherColumnProcessorFactory

* Update VectorValueMatcherColumnProcessorFactoryTest.java
2020-06-09 19:32:16 -07:00
Clint Wylie c5d6163c76
add a GeneratorInputSource to fill up a cluster with generated data for testing (#9946)
* move benchmark data generator into druid-processing, add a GeneratorInputSource to fill up a cluster with data

* newlines

* make test coverage not fail maybe

* remove useless test

* Update pom.xml

* Update GeneratorInputSourceTest.java

* less passive aggressive test names
2020-06-09 19:31:04 -07:00
Clint Wylie 7f51e44b00
fix NilVectorSelector filter optimization (#9989) 2020-06-08 17:40:29 -07:00
Clint Wylie 77dd5b06ae
ColumnCapabilities.hasMultipleValues refactor (#9731)
* transition ColumnCapabilities.hasMultipleValues to Capable enum, remove ColumnCapabilities.isComplete

* remove artifical, always multi-value capabilities from IncrementalIndexStorageAdapter and fix up fallout from that, fix ColumnCapabilities merge in index merger

* fix typo

* remove unused method

* review stuffs, revert IncrementalIndexStorageAdapater capabilities change, plumb lame workaround to SegmentAnalyzer

* more comment

* use volatile booleans

* fix line length

* correctly handle missing columns for vector processors

* return ColumnCapabilities.Capable for BitmapIndexSelector.hasMultipleValues, fix vector processor selection for complex

* false on non-existent
2020-06-04 23:52:37 -07:00
Maytas Monsereenusorn 9738a03c83
Fix groupBy with literal in subquery grouping (#9986)
* fix groupBy with literal in subquery grouping

* fix groupBy with literal in subquery grouping

* fix groupBy with literal in subquery grouping

* address comments

* update javadocs
2020-06-04 13:28:05 -10:00
Maytas Monsereenusorn 790e9482ea
Fix Subquery could not be converted to groupBy query (#9959)
* Fix join

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* add tests

* address comments

* fix failing tests
2020-06-03 16:46:28 -07:00
Gian Merlino 3dfd7c30c0
Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT. (#9893)
* Add REGEXP_LIKE, fix empty-pattern bug in REGEXP_EXTRACT.

- Add REGEXP_LIKE function that returns a boolean, and is useful in
  WHERE clauses.
- Fix REGEXP_EXTRACT return type (should be nullable; causes incorrect
  filter elision).
- Fix REGEXP_EXTRACT behavior for empty patterns: should always match
  (previously, they threw errors).
- Improve error behavior when REGEXP_EXTRACT and REGEXP_LIKE are passed
  non-literal patterns.
- Improve documentation of REGEXP_EXTRACT.

* Changes based on PR review.

* Fix arg check.

* Important fixes!

* Add speller.

* wip

* Additional tests.

* Fix up tests.

* Add validation error tests.

* Additional tests.

* Remove useless call.
2020-06-03 14:31:37 -07:00
Maytas Monsereenusorn 0d22462e07
Document unsupported Join on multi-value column (#9948)
* Document Unsupported Join on multi-value column

* Document Unsupported Join on multi-value column

* address comments

* Add unit tests

* address comments

* add tests
2020-06-03 09:55:52 -10:00
Gian Merlino 3d81564a14
Fix various processing buffer leaks and simplify BlockingPool. (#9928)
* - GroupByQueryEngineV2: Fix leak of intermediate processing buffer when
  exceptions are thrown before result sequence is created.
- PooledTopNAlgorithm: Fix leak of intermediate processing buffer when
  exceptions are thrown before the PooledTopNParams object is created.
- BlockingPool: Remove unused "take" methods.

* Add tests to verify that buffers have been returned.
2020-06-02 18:26:18 -07:00
Gian Merlino 309fc04d54
Fix various Yielder leaks. (#9934)
* Fix various Yielder leaks.

- CombiningSequence leaked the input yielder from "toYielder" if it ran
  into an exception while accumulating the last value from the input
  yielder.
- MergeSequence leaked input yielders from "toYielder" if it ran into
  an exception while building the initial priority queue.
- ScanQueryRunnerFactory leaked the input yielder in its
  "priorityQueueSortAndLimit" strategy if it ran into an exception
  while scanning and sorting.
- YieldingSequenceBase.accumulate chomped IOExceptions thrown in
  "accumulate" during yielder closing.

* Add tests.

* Fix braces.
2020-06-02 18:26:06 -07:00
Xavier Léauté 4ecf1900c3
fix nullhandling exceptions related to test ordering (#9964)
follow-up to https://github.com/apache/druid/pull/9570
2020-06-02 10:13:54 -07:00
Clint Wylie c690d10a7d
support customized factory.json via IndexSpec for segment persist (#9957)
* support customized factory.json via IndexSpec for segment persist

* equals verifier
2020-06-01 16:36:32 -07:00
Suneet Saldanha e03d38b6c8
Optimize join queries where filter matches nothing (#9931)
* Refactor JoinFilterAnalyzer

This patch attempts to make it easier to follow the join filter analysis code
with the hope of making it easier to add rewrite optimizations in the future.

To keep the patch small and easy to review, this is the first of at least 2
patches that are planned.

This patch adds a builder to the Pre-Analysis, so that it is easier to
instantiate the preAnalysis. It also moves some of the filter normalization
code out to Fitlers with associated tests.

* fix tests

* Refactor JoinFilterAnalyzer - part 2

This change introduces the following components:
 * RhsRewriteCandidates - a wrapper for a list of candidates and associated
     functions to operate on the set of candidates.
 * JoinableClauses - a wrapper for the list of JoinableClause that represent
     a join condition and the associated functions to operate on the clauses.
 * Equiconditions - a wrapper representing the equiconditions that are used
     in the join condition.

And associated test changes.

This refactoring surfaced 2 bugs:
 - Missing equals and hashcode implementation for RhsRewriteCandidate, thus
   allowing potential duplicates in the rhs rewrite candidates
 - Missing Filter#supportsRequiredColumnRewrite check in
   analyzeJoinFilterClause, which could result in UnsupportedOperationException
   being thrown by the filter

* fix compile error

* remove unused class

* Refactor JoinFilterAnalyzer - Correlations

Move the correlation related code out into it's own class so it's easier
to maintain.
Another patch should follow this one so that the query path uses the
correlation object instead of it's underlying maps.

* Optimize join queries where filter matches nothing

Fixes #9787

This PR changes the Joinable interface to return an Optional set of correlated
values for a column.
This allows the JoinFilterAnalyzer to differentiate between the case where the
column has no matching values and when the column could not find matching
values.

This PR chose not to distinguish between cases where correlated values could
not be computed because of a config that has this behavior disabled or because
of user error - like a column that could not be found. The reasoning was that
the latter is likely an error and the non filter pushdown path will surface the
error if it is.
2020-05-29 16:53:03 -07:00
Suneet Saldanha 9c40bebc02
Refactor JoinFilterAnalyzer - part 2 (#9929)
* Refactor JoinFilterAnalyzer

This patch attempts to make it easier to follow the join filter analysis code
with the hope of making it easier to add rewrite optimizations in the future.

To keep the patch small and easy to review, this is the first of at least 2
patches that are planned.

This patch adds a builder to the Pre-Analysis, so that it is easier to
instantiate the preAnalysis. It also moves some of the filter normalization
code out to Fitlers with associated tests.

* fix tests

* Refactor JoinFilterAnalyzer - part 2

This change introduces the following components:
 * RhsRewriteCandidates - a wrapper for a list of candidates and associated
     functions to operate on the set of candidates.
 * JoinableClauses - a wrapper for the list of JoinableClause that represent
     a join condition and the associated functions to operate on the clauses.
 * Equiconditions - a wrapper representing the equiconditions that are used
     in the join condition.

And associated test changes.

This refactoring surfaced 2 bugs:
 - Missing equals and hashcode implementation for RhsRewriteCandidate, thus
   allowing potential duplicates in the rhs rewrite candidates
 - Missing Filter#supportsRequiredColumnRewrite check in
   analyzeJoinFilterClause, which could result in UnsupportedOperationException
   being thrown by the filter

* fix compile error

* remove unused class
2020-05-29 15:03:35 -07:00
Suneet Saldanha faef31a0af
Refactor JoinFilterAnalyzer (#9921)
* Refactor JoinFilterAnalyzer

This patch attempts to make it easier to follow the join filter analysis code
with the hope of making it easier to add rewrite optimizations in the future.

To keep the patch small and easy to review, this is the first of at least 2
patches that are planned.

This patch adds a builder to the Pre-Analysis, so that it is easier to
instantiate the preAnalysis. It also moves some of the filter normalization
code out to Fitlers with associated tests.

* fix tests
2020-05-28 22:32:09 -07:00
Suneet Saldanha b0167295d7
Fail incorrectly constructed join queries (#9830)
* Fail incorrectly constructed join queries

* wip annotation for equals implementations

* Add equals tests

* fix tests

* Actually fix the tests

* Address review comments

* prohibit Pattern.hashCode()
2020-05-13 14:23:04 -07:00
Jonathan Wei 16d293d6e0
Directly rewrite filters on RHS join columns into LHS equivalents (#9818)
* Directly rewrite filters on RHS join columns into LHS equivalents

* PR comments

* Fix inspection

* Revert unnecessary ExprMacroTable change

* Fix build after merge

* Address PR comments
2020-05-08 23:45:35 -07:00
mcbrewster 28be107a1c
add flag to flattenSpec to keep null columns (#9814)
* add flag to flattenSpec to keep null columns

* remove changes to inputFormat interface

* add comment

* change comment message

* update web console e2e test

* move keepNullColmns to JSONParseSpec

* fix merge conflicts

* fix tests

* set keepNullColumns to false by default

* fix lgtm

* change Boolean to boolean, add keepNullColumns to hash, add tests for keepKeepNullColumns false + true with no nuulul columns

* Add equals verifier tests
2020-05-08 21:53:39 -07:00
Maytas Monsereenusorn accd710115
Add equivalent test coverage for all RHS join impls (#9831)
* Add equivalent test coverage for all RHS join impls

* address comments
2020-05-06 16:10:41 -07:00
Jihoon Son 6674d721bc
Avoid sorting values in InDimFilter if possible (#9800)
* Avoid sorting values in InDimFilter if possible

* tests

* more tests

* fix and and or filters

* fix build

* false and true vector matchers

* fix vector matchers

* checkstyle

* in filter null handling

* remove wrong test

* address comments

* remove unnecessary null check

* redundant separator

* address comments

* typo

* tests
2020-05-06 15:26:36 -07:00
Suneet Saldanha 1e857c5303
Ignore druid-processing benchmarks in tests (#9821) 2020-05-06 08:59:48 -07:00
Jihoon Son c6caae9a24
Fix filtering on boolean values in transformation (#9812)
* Fix filter on boolean value in Transform

* assert

* more descriptive test

* remove assert

* add assert for cached string; disable tests

* typo
2020-05-04 18:47:10 -07:00
Jian Wang 85dfbb64cb
Update documention for metricCompression (#9811) 2020-05-03 12:56:48 -07:00