Commit Graph

239 Commits

Author SHA1 Message Date
Pranav 5314db9f85
Adding the file mapper to handle v2 buffer deserialization (#14429) 2023-06-14 19:41:44 -07:00
Soumyava 01b22ca022
Hll Sketch and Theta sketch estimate can now be used as an expression (#14312)
* Hll Sketch estimate can now be used as an expression
* Theta sketch estimate now can be used as an expression
2023-06-06 20:14:25 -07:00
Abhishek Radhakrishnan 2d258a95ad
Fix `EARLIEST_BY`/`LATEST_BY` signature and include function name in signature. (#14352)
* Fix EarliestLatestBySqlAggregator signature; Include function name for all signatures.

* Single quote function signatures, space between args and remove \n.

* fixup UT assertion
2023-06-06 09:41:05 -07:00
Alexander Saydakov 4131c0df13
use the latest datasketches-java-4.0.0 (#14334)
* use the latest datasketches-java-4.0.0

* updated versions of datasketches

* adjusted expectation

* fixed the expectations

---------

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2023-05-27 22:19:18 -07:00
frankgrimes97 2f98675285
Tuple sketch SQL support (#13887)
This PR is a follow-up to #13819 so that the Tuple sketch functionality can be used in SQL for both ingestion using Multi-Stage Queries (MSQ) and also for analytic queries against Tuple sketch columns.
2023-03-28 18:47:12 +05:30
Gian Merlino 90d8f67e3d
Avoid creating new RelDataTypeFactory during SQL planning. (#13904)
* Avoid creating new RelDataTypeFactory during SQL planning.

Reduces unnecessary CPU cycles.

* Fix.
2023-03-08 21:55:49 -08:00
Anshu Makkar a10e4150d5
Add Post Aggregators for Tuple Sketches (#13819)
You can now do the following operations with TupleSketches in Post Aggregation Step

Get the Sketch Output as Base64 String
Provide a constant Tuple Sketch in post-aggregation step that can be used in Set Operations
Get the Estimated Value(Sum) of Summary/Metrics Objects associated with Tuple Sketch
2023-03-03 09:32:09 +05:30
imply-cheddar f684df4c22
Use an HllSketchHolder object to enable optimized merge (#13737)
* Use an HllSketchHolder object to enable optimized merge

HllSketchAggregatorFactory.combine had been implemented using a
pure pair-wise, "make a union -> add 2 things to union -> get sketch"
algorithm.  This algorithm does 2 things that was CPU

1) The Union object always builds an HLL_8 sketch regardless of the
  target type.  This means that when the target type is not HLL_8, we
  spent CPU cycles converting to HLL_8 and back over and over again
2) By throwing away the Union object and converting back to the
  HllSketch only to build another Union object, we do lots and lots
  of copy+conversions of the HllSketch

This change introduces an HllSketchHolder object which can hold onto
a Union object and delay conversion back into an HllSketch until
it is actually needed.  This follows the same pattern as the
SketchHolder object for theta sketches.
2023-02-07 13:57:48 -08:00
imply-cheddar 9c5b61e114
Fallback virtual column (#13739)
* Fallback virtual column

This virtual columns enables falling back to another column if
the original column doesn't exist.  This is useful when doing
column migrations and you have some old data with column X,
new data with column Y and you want to use Y if it exists, X
otherwise so that you can run a consistent query against all of
the data.
2023-02-06 19:36:50 -08:00
Abhishek Agarwal b25cf216d5
Better error message when theta_sketch_intersect is used on scalar expression (#13508) 2022-12-07 09:35:43 +05:30
Paul Rogers b76ff16d00
SQL test framework extensions (#13426)
SQL test framework extensions

* Capture planner artifacts: logical plan, etc.
* Planner test builder validates the logical plan
* Validation for the SQL resut schema (we already have
  validation for the Druid row signature)
* Better Guice integration: properties, reuse Guice modules
* Avoid need for hand-coded expr, macro tables
* Retire some of the test-specific query component creation
* Fix query log hook race condition
2022-12-02 09:11:59 -08:00
Clint Wylie f524c68f08
Add mechanism for 'safe' memory reads for complex types (#13361)
* we can read where we want to
we can leave your bounds behind
'cause if the memory is not there
we really don't care
and we'll crash this process of mine
2022-11-23 00:25:22 -08:00
Clint Wylie 27215d1ff1
fix complex_decode_base64 function, add SQL bindings (#13332)
* fix complex_decode_base64 function, add SQL bindings

* more permissive
2022-11-09 23:40:25 -08:00
Paul Rogers 7e600d2c63
Enhancements to the Calcite test framework (#13283)
* Enhancements to the Calcite test framework
* Standardize "Unauthorized" messages
* Additional test framework extension points
* Resolved joinable factory dependency issue
2022-11-08 14:28:49 -08:00
Gian Merlino 8f90589ce5
Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. (#13247)
* Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH.

These aggregation functions are documented as creating sketches. However,
they are planned into native aggregators that include finalization logic
to convert the sketch to a number of some sort. This creates an
inconsistency: the functions sometimes return sketches, and sometimes
return numbers, depending on where they lie in the native query plan.

This patch changes these SQL aggregators to _never_ finalize, by using
the "shouldFinalize" feature of the native aggregators. It already
existed for theta sketches. This patch adds the feature for hll and
quantiles sketches.

As to impact, Druid finalizes aggregators in two cases:

- When they appear in the outer level of a query (not a subquery).
- When they are used as input to an expression or finalizing-field-access
  post-aggregator (not any other kind of post-aggregator).

With this patch, the functions will no longer be finalized in these cases.

The second item is not likely to matter much. The SQL functions all declare
return type OTHER, which would be usable as an input to any other function
that makes sense and that would be planned into an expression.

So, the main effect of this patch is the first item. To provide backwards
compatibility with anyone that was depending on the old behavior, the
patch adds a "sqlFinalizeOuterSketches" query context parameter that
restores the old behavior.

Other changes:

1) Move various argument-checking logic from runtime to planning time in
   DoublesSketchListArgBaseOperatorConversion, by adding an OperandTypeChecker.

2) Add various JsonIgnores to the sketches to simplify their JSON representations.

3) Allow chaining of ExpressionPostAggregators and other PostAggregators
   in the SQL layer.

4) Avoid unnecessary FieldAccessPostAggregator wrapping in the SQL layer,
   now that expressions can operate on complex inputs.

5) Adjust return type to thetaSketch (instead of OTHER) in
   ThetaSketchSetBaseOperatorConversion.

* Fix benchmark class.

* Fix compilation error.

* Fix ThetaSketchSqlAggregatorTest.

* Hopefully fix ITAutoCompactionTest.

* Adjustment to ITAutoCompactionTest.
2022-11-03 09:43:00 -07:00
Paul Rogers 86e6e61e88
Modular Calcite Test Framework (#12965)
* Refactor Calcite test "framework" for planner tests

Refactors the current Calcite tests to make it a bit easier
to adjust the set of runtime objects used within a test.

* Move data creation out of CalciteTests into TestDataBuilder
* Move "framework" creation out of CalciteTests into
  a QueryFramework
* Move injector-dependent functions from CalciteTests
  into QueryFrameworkUtils
* Wrapper around the planner factory, etc. to allow
  customization.
* Bulk of the "framework" created once per class rather
  than once per test.
* Refactor tests to use a test builder
* Change all testQuery() methods to use the test builder.
Move test execution & verification into a test runner.
2022-10-20 15:45:44 -07:00
Paul Rogers f4dcc52dac
Redesign QueryContext class (#13071)
We introduce two new configuration keys that refine the query context security model controlled by druid.auth.authorizeQueryContextParams. When that value is set to true then two other configuration options become available:

druid.auth.unsecuredContextKeys: The set of query context keys that do not require a security check. Use this for the "white-list" of key to allow. All other keys go through the existing context key security checks.
druid.auth.securedContextKeys: The set of query context keys that do require a security check. Use this when you want to allow all but a specific set of keys: only these keys go through the existing context key security checks.
Both are set using JSON list format:

druid.auth.securedContextKeys=["secretKey1", "secretKey2"]
You generally set one or the other values. If both are set, unsecuredContextKeys acts as exceptions to securedContextKeys.

In addition, Druid defines two query context keys which always bypass checks because Druid uses them internally:

sqlQueryId
sqlStringifyArrays
2022-10-15 11:02:11 +05:30
Alexander Saydakov 7e2371bbde
KLL sketch (#12498)
* KLL sketch

* added documentation

* direct static refs

* direct static refs

* fixed test

* addressed review points

* added KLL sketch related terms

* return a copy from get

* Copy unions when returning them from "get".

* Remove redundant "final".

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
2022-08-26 21:19:24 -07:00
Paul Rogers 41712b7a3a
Refactor SqlLifecycle into statement classes (#12845)
* Refactor SqlLifecycle into statement classes

Create direct & prepared statements
Remove redundant exceptions from tests
Tidy up Calcite query tests
Make PlannerConfig more testable

* Build fixes

* Added builder to SqlQueryPlus

* Moved Calcites system properties to saffron.properties

* Build fix

* Resolve merge conflict

* Fix IntelliJ inspection issue

* Revisions from reviews

Backed out a revision to Calcite tests that didn't work out as planned

* Build fix

* Fixed spelling errors

* Fixed failed test

Prepare now enforces security; before it did not.

* Rebase and fix IntelliJ inspections issue

* Clean up exception handling

* Fix handling of JDBC auth errors

* Build fix

* More tweaks to security messages
2022-08-14 00:44:08 -07:00
Karan Kumar 607b0b9310
Adding withName implementation to AggregatorFactory (#12862)
* Adding agg factory with name impl

* Adding test cases

* Fixing test case

* Fixing test case

* Updated java docs.
2022-08-08 18:31:56 +05:30
Kashif Faraz 7ab2170802
Use datasketches version 3.2.0 (#12509)
Changes:
- Use apache datasketches version 3.2.0.
- Remove unsafe reflection-based usage of datasketch internals added in #12022
2022-05-13 11:28:15 +05:30
Gian Merlino 529b983ad0
GroupBy: Reduce allocations by reusing entry and key holders. (#12474)
* GroupBy: Reduce allocations by reusing entry and key holders.

Two main changes:

1) Reuse Entry objects returned by various implementations of
   Grouper.iterator.

2) Reuse key objects contained within those Entry objects.

This is allowed by the contract, which states that entries must be
processed and immediately discarded. However, not all call sites
respected this, so this patch also updates those call sites.

One particularly sneaky way that the old code retained entries too long
is due to Guava's MergingIterator and CombiningIterator. Internally,
these both advance to the next value prior to returning the current
value. So, this patch addresses that in two ways:

1) For merging, we have our own implementation MergeIterator already,
   although it had the same problem. So, this patch updates our
   implementation to return the current item prior to advancing to the
   next item. It also adds a forbidden-api entry to ensure that this
   safer implementation is used instead of Guava's.

2) For combining, we address the problem in a different way: by copying
   the key when creating the new, combined entry.

* Attempt to fix test.

* Remove unused import.
2022-04-28 23:21:13 -07:00
Jihoon Son 73ce5df22d
Add support for authorizing query context params (#12396)
The query context is a way that the user gives a hint to the Druid query engine, so that they enforce a certain behavior or at least let the query engine prefer a certain plan during query planning. Today, there are 3 types of query context params as below.

Default context params. They are set via druid.query.default.context in runtime properties. Any user context params can be default params.
User context params. They are set in the user query request. See https://druid.apache.org/docs/latest/querying/query-context.html for parameters.
System context params. They are set by the Druid query engine during query processing. These params override other context params.
Today, any context params are allowed to users. This can cause 
1) a bad UX if the context param is not matured yet or 
2) even query failure or system fault in the worst case if a sensitive param is abused, ex) maxSubqueryRows.

This PR adds an ability to limit context params per user role. That means, a query will fail if you have a context param set in the query that is not allowed to you. To do that, this PR adds a new built-in resource type, QUERY_CONTEXT. The resource to authorize has a name of the context param (such as maxSubqueryRows) and the type of QUERY_CONTEXT. To allow a certain context param for a user, the user should be granted WRITE permission on the context param resource. Here is an example of the permission.

{
  "resourceAction" : {
    "resource" : {
      "name" : "maxSubqueryRows",
      "type" : "QUERY_CONTEXT"
    },
    "action" : "WRITE"
  },
  "resourceNamePattern" : "maxSubqueryRows"
}
Each role can have multiple permissions for context params. Each permission should be set for different context params.

When a query is issued with a query context X, the query will fail if the user who issued the query does not have WRITE permission on the query context X. In this case,

HTTP endpoints will return 403 response code.
JDBC will throw ForbiddenException.
Note: there is a context param called brokerService that is used only by the router. This param is used to pin your query to run it in a specific broker. Because the authorization is done not in the router, but in the broker, if you have brokerService set in your query without a proper permission, your query will fail in the broker after routing is done. Technically, this is not right because the authorization is checked after the context param takes effect. However, this should not cause any user-facing issue and thus should be OK. The query will still fail if the user doesn’t have permission for brokerService.

The context param authorization can be enabled using druid.auth.authorizeQueryContextParams. This is disabled by default to avoid any hassle when someone upgrades his cluster blindly without reading release notes.
2022-04-21 14:21:16 +05:30
Alexander Saydakov 50038d9344
latest datasketches-java-3.1.0 (#12224)
These changes are to use the latest datasketches-java-3.1.0 and also to restore support for quantile and HLL4 sketches to be able to grow larger than a given buffer in a buffer aggregator and move to heap in rare cases. This was discussed in #11544.

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2022-03-01 17:14:42 -08:00
Clint Wylie 3ee66bb492
allow optimizing sql expressions and virtual columns (#12241)
* rework sql planner expression and virtual column handling

* simplify a bit

* add back and deprecate old methods, more tests, fix multi-value string coercion bug and associated tests

* spotbugs

* fix bugs with multi-value string array expression handling

* javadocs and adjust test

* better

* fix tests
2022-02-09 14:55:50 -08:00
Kashif Faraz e648b01afb
Improve memory estimates in Aggregator and DimensionIndexer (#12073)
Fixes #12022  

### Description
The current implementations of memory estimation in `OnHeapIncrementalIndex` and `StringDimensionIndexer` tend to over-estimate which leads to more persistence cycles than necessary.

This PR replaces the max estimation mechanism with getting the incremental memory used by the aggregator or indexer at each invocation of `aggregate` or `encode` respectively.

### Changes
- Add new flag `useMaxMemoryEstimates` in the task context. This overrides the same flag in DefaultTaskConfig i.e. `druid.indexer.task.default.context` map
- Add method `AggregatorFactory.factorizeWithSize()` that returns an `AggregatorAndSize` which contains
  the aggregator instance and the estimated initial size of the aggregator
- Add method `Aggregator.aggregateWithSize()` which returns the incremental memory used by this aggregation step
- Update the method `DimensionIndexer.processRowValsToKeyComponent()` to return the encoded key component as well as its effective size in bytes
- Update `OnHeapIncrementalIndex` to use the new estimations only if `useMaxMemoryEstimates = false`
2022-02-03 10:34:02 +05:30
imply-cheddar b153cb2342
Add a small LRU cache and use utf8 bytes in ArrayOfDoubles (#12130)
* Add a small LRU cache and use utf8 bytes in ArrayOfDoubles

* Add tests for extra branches

* Even more tests for branch coverage

* Fix Style
2022-01-11 13:04:11 -08:00
somu-imply 08fea7a46a
input type validation for datasketches hll "build" aggregator factory (#12131)
* Ingestion will fail for HLLSketchBuild instead of creating with incorrect values

* Addressing review comments for HLL< updated error message introduced test case
2022-01-11 12:00:14 -08:00
Paul Rogers a66f10eea1
Code cleanup from query profile project (#11822)
* Code cleanup from query profile project

* Fix spelling errors
* Fix Javadoc formatting
* Abstract out repeated test code
* Reuse constants in place of some string literals
* Fix up some parameterized types
* Reduce warnings reported by Eclipse

* Reverted change due to lack of tests
2021-11-30 11:35:38 -08:00
Gian Merlino 93aeaf4801
Improve on-heap aggregator footprint estimates. (#11950)
Add a "guessAggregatorHeapFootprint" method to AggregatorFactory that
mitigates #6743 by enabling heap footprint estimates based on a specific
number of rows. The idea is that at ingestion time, the number of rows
that go into an aggregator will be 1 (if rollup is off) or will likely
be a small number (if rollup is on).

It's a heuristic, because of course nothing guarantees that the rollup
ratio is a small number. But it's a common case, and I expect this logic
to go wrong much less often than the current logic. Also, when it does
go wrong, users can fix it by lowering maxRowsInMemory or
maxBytesInMemory. The current situation is unintuitive: when the
estimation goes wrong, users get an OOME, but actually they need to
*raise* these limits to fix it.
2021-11-28 13:21:24 +05:30
Clint Wylie f260bbed23
restore and deprecate AggregatorFactory methods (#11917)
* add back and deprecate aggregator factory methods so i can say i told you so when i delete these later

* rename to make less ambiguous, fix fill method

* adjust
2021-11-19 15:59:35 -08:00
Clint Wylie a8805ab60d
add missing json type for ListFilteredVirtualColumn (#11887)
* add missing json type for ListFilteredVirtualColumn, and tests to try to avoid this happening again

* fixes

* ugly, but maybe this

* oops

* too many mappers
2021-11-09 17:25:12 -08:00
Gian Merlino fc95c92806
Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124)
* Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs.

This patch does the following:

- Removes OffheapIncrementalIndex.
- Clarifies that Aggregators are required to be thread safe.
- Clarifies that BufferAggregators and VectorAggregators are not
  required to be thread safe.
- Removes thread safety code from some DataSketches aggregators that
  had it. (Not all of them did, and that's OK, because it wasn't necessary
  anyway.)
- Makes enabling "useOffheap" with groupBy v1 an error.

Rationale for removing the offheap incremental index:

- It is only used in one rare scenario: groupBy v1 (which is non-default)
  in "useOffheap" mode (also non-default). So you have to go pretty deep
  into the wilderness to get this code to activate in production. It is
  never used during ingestion.
- Its existence complicates developer efforts to reason about how
  aggregators get used, because the way it uses buffer aggregators is so
  different from how every other query engine uses them.
- It doesn't have meaningful testing.

By the way, I do believe that the given way the offheap incremental index
works, it actually didn't require buffer aggregators to be thread-safe.
It synchronizes on "aggregate" and doesn't call "get" until it has
stopped calling "aggregate". Nevertheless, this is a bother to think about,
and for the above reasons I think it makes sense to remove the code anyway.

* Remove things that are now unused.

* Revert removal of getFloat, getLong, getDouble from BufferAggregator.

* OAK-related warnings, suppressions.

* Unused item suppressions.
2021-10-26 08:05:56 -07:00
Gian Merlino 8276c031c5
Add druid.sql.approxCountDistinct.function property. (#11181)
* Add druid.sql.approxCountDistinct.function property.

The new property allows admins to configure the implementation for
APPROX_COUNT_DISTINCT and COUNT(DISTINCT expr) in approximate mode.

The motivation for adding this setting is to enable site admins to
switch the default HLL implementation to DataSketches.

For example, an admin can set:

  druid.sql.approxCountDistinct.function = APPROX_COUNT_DISTINCT_DS_HLL

* Fixes

* Fix tests.

* Remove erroneous cannotVectorize.

* Remove unused import.

* Remove unused test imports.
2021-10-25 12:16:21 -07:00
Gian Merlino d4cace385f
SQL: Allow Scans to be used as outer queries. (#11831)
* SQL: Allow Scans to be used as outer queries.

This has been possible in the native query system for a while, but the capability
hasn't yet propagated into the SQL layer. One example of where this is useful is
a query like:

  SELECT * FROM (... LIMIT X) WHERE <filter>

Because this expands the kinds of subquery structures the SQL layer will consider,
it was also necessary to improve the cost calculations. These changes appear in
PartialDruidQuery and DruidOuterQueryRel. The ideas are:

- Attach per-column penalties to the output signature of each query, instead of to
  the initial projection that starts a query. This encourages moving projections
  into subqueries instead of leaving them on outer queries.
- Only attach penalties to projections if there are actually expressions happening.
  So, now, projections that simply reorder or remove fields are free.
- Attach a constant penalty to every outer query. This discourages creating them
  when they are not needed.

The changes are generally beneficial to the test cases we have in CalciteQueryTest.
Most plans are unchanged, or are changed in purely cosmetic ways. Two have changed
for the better:

- testUsingSubqueryWithLimit now returns a constant from the subquery, instead of
  returning every column.
- testJoinOuterGroupByAndSubqueryHasLimit returns a minimal set of columns from
  the innermost subquery; two unnecessary columns are no longer there.

* Fix various DS operator conversions.

These were all implemented as direct conversions, which isn't appropriate
because they do not actually map onto native functions. These are only
usable as post-aggregations.

* Test case adjustment.
2021-10-23 17:18:43 -07:00
Gian Merlino b7a4c79314
Null handling fixes for DS HLL and Theta sketches. (#11830)
* Null handling fixes for DS HLL and Theta sketches.

For HLL, this fixes an NPE when processing a null in a multi-value dimension.

For both, empty strings are now properly treated as nulls (and ignored) in
replace-with-default mode. Behavior in SQL-compatible mode is unchanged.

* Fix expectation.
2021-10-22 19:09:00 -07:00
Clint Wylie 741b4ed516
add output type information to ExpressionPostAggregator (#11818)
* add ColumnInspector argument to PostAggregator.getType to allow post-aggs to compute their output type based on input types

* add test for test for coverage

* simplify

* Remove unused imports.

Co-authored-by: Gian Merlino <gian@imply.io>
2021-10-22 13:52:51 -07:00
Alexander Saydakov 8cf1cbc4a9
latest datasketches-java and datasketches-memory (#11773)
* latest datasketches-java and datasketches-memory

* updated versions of datasketches-java and datasketches-memory

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2021-10-19 23:42:30 -07:00
Clint Wylie 187df58e30
better types (#11713)
* better type system

* needle in a haystack

* ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support

* fixup merge

* more test

* fixup

* intern

* fix

* oops

* oops again

* ...

* more test coverage

* fix error message

* adjust interning, more javadocs

* oops

* more docs more better
2021-10-19 01:47:25 -07:00
Jihoon Son 7e90d00cc0
Configurable maxStreamLength for doubles sketches (#11574)
* Configurable maxStreamLength for doubles sketches

* fix equals/hashcode and it test failure

* fix test

* fix it test

* benchmark

* doc

* grouping key

* fix comment

* dependency check

* Update docs/development/extensions-core/datasketches-quantiles.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-08-31 14:56:37 -07:00
Clint Wylie f6662b4893
fix count and average SQL aggregators on constant virtual columns (#11208)
* fix count and average SQL aggregators on constant virtual columns

* style

* even better, why are we tracking virtual columns in aggregations at all if we have a virtual column registry

* oops missed a few

* remove unused

* this will fix it
2021-05-10 13:41:48 -07:00
Clint Wylie 691d7a1d54
SQL timeseries no longer skip empty buckets with all granularity (#11188)
* SQL timeseries no longer skip empty buckets with all granularity

* add comment, fix tests

* the ol switcheroo

* revert unintended change

* docs and more tests

* style

* make checkstyle happy

* docs fixes and more tests

* add docs, tests for array_agg

* fixes

* oops

* doc stuffs

* fix compile, match doc style
2021-05-10 10:13:37 -07:00
Gian Merlino 809e001939
Vectorize the DataSketches quantiles aggregator. (#11183)
* Vectorize the DataSketches quantiles aggregator.

Also removes synchronization for the BufferAggregator and VectorAggregator
implementations, since it is not necessary (similar to #11115).

Extends DoublesSketchAggregatorTest and DoublesSketchSqlAggregatorTest
to run all test cases in vectorized mode.

* Style fix.
2021-05-02 16:14:21 -07:00
Gian Merlino f2b54de205
Vectorized versions of HllSketch aggregators. (#11115)
* Vectorized versions of HllSketch aggregators.

The patch uses the same "helper" approach as #10767 and #10304, and
extends the tests to run in both vectorized and non-vectorized modes.

Also includes some minor changes to the theta sketch vector aggregator:

- Cosmetic changes to make the hll and theta implementations look
  more similar.
- Extends the theta SQL tests to run in vectorized mode.

* Updates post-code-review.

* Fix javadoc.
2021-04-16 18:45:46 -07:00
Jihoon Son 25db8787b3
Fix CAST being ignored when aggregating on strings after cast (#11083)
* Fix CAST being ignored when aggregating on strings after cast

* fix checkstyle and dependency

* unused import
2021-04-12 22:21:24 -07:00
Makdon d939420f23
Update SketchAggregator.java for removing duplicated parentheses (#11021)
* Update SketchAggregator.java

* Add test for sketches aggregator update unoin with double
2021-04-09 17:11:25 -07:00
Alexander Saydakov f930cf14d6
Use the latest Apache DataSketches release 2.0.0 (#10917)
* use the latest Apache DataSketches release 2.0.0

* updated datasketches version

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2021-02-26 07:52:00 -06:00
Gian Merlino 6c0c6e60b3
Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory. (#10767)
* Vectorized theta sketch aggregator.

Also a refactoring of BufferAggregator and VectorAggregator such that
they share a common interface, BaseBufferAggregator. This allows
implementing both in the same file with an abstract + dual subclass
structure.

* Rework implementation to use composition instead of inheritance.

* Rework things to enable working properly for both complex types and
regular types.

Involved finally moving makeVectorProcessor from DimensionHandlerUtils
into ColumnProcessors and harmonizing the two things.

* Add missing method.

* Style and name changes.

* Fix issues from inspections.

* Fix style issue.
2021-01-29 09:30:09 -08:00
Clint Wylie ab60661008
refactor internal type system (#9638)
* better type tracking: add typed postaggs, finalized types for agg factories

* more javadoc

* adjustments

* transition to getTypeName to be used exclusively for complex types

* remove unused fn

* adjust

* more better

* rename getTypeName to getComplexTypeName

* setup expression post agg for type inference existing

* more javadocs

* fixup

* oops

* more test

* more test

* more comments/javadoc

* nulls

* explicitly handle only numeric and complex aggregators for incremental index

* checkstyle

* more tests

* adjust

* more tests to showcase difference in behavior

* timeseries longsum array
2020-08-26 10:53:44 -07:00
Clint Wylie cfb7a893e7
fill out missing test coverage for druid-datasketches postaggs (#9730)
* fill out missing test coverage for druid-datasketches postaggs

* fixup

* fixup merge

* oops

* oops again
2020-07-31 10:08:07 -07:00
Maytas Monsereenusorn 4e8570b71b
Add integration tests for all InputFormat (#10088)
* Add integration tests for Avro OCF InputFormat

* Add integration tests for Avro OCF InputFormat

* add tests

* fix bug

* fix bug

* fix failing tests

* add comments

* address comments

* address comments

* address comments

* fix test data

* reduce resource needed for IT

* remove bug fix

* fix checkstyle

* add bug fix
2020-07-08 12:50:29 -07:00
Franklyn Dsouza 1b9aacb1cd
Fix avg sql aggregator (#10135)
* new average aggregator

* method to create count aggregator factory

* test everything

* update other usages

* fix style

* fix more tests

* fix datasketches tests
2020-07-08 08:38:56 -07:00
Clint Wylie 477335abb4
update links datasketches.github.io to datasketches.apache.org (#10107)
* update links datasketches.github.io to datasketches.apache.org

* now with more apache

* oops

* oops
2020-07-01 14:56:17 -07:00
Maytas Monsereenusorn 9bab6b6371
SketchAggregator.updateUnion should handle null inside List update object (#10055) 2020-06-19 20:29:25 -07:00
Maytas Monsereenusorn 790e9482ea
Fix Subquery could not be converted to groupBy query (#9959)
* Fix join

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* Fix Subquery could not be converted to groupBy query

* add tests

* address comments

* fix failing tests
2020-06-03 16:46:28 -07:00
Gian Merlino 3dfd7c30c0
Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT. (#9893)
* Add REGEXP_LIKE, fix empty-pattern bug in REGEXP_EXTRACT.

- Add REGEXP_LIKE function that returns a boolean, and is useful in
  WHERE clauses.
- Fix REGEXP_EXTRACT return type (should be nullable; causes incorrect
  filter elision).
- Fix REGEXP_EXTRACT behavior for empty patterns: should always match
  (previously, they threw errors).
- Improve error behavior when REGEXP_EXTRACT and REGEXP_LIKE are passed
  non-literal patterns.
- Improve documentation of REGEXP_EXTRACT.

* Changes based on PR review.

* Fix arg check.

* Important fixes!

* Add speller.

* wip

* Additional tests.

* Fix up tests.

* Add validation error tests.

* Additional tests.

* Remove useless call.
2020-06-03 14:31:37 -07:00
Alexander Saydakov 522df300c2
Datasketches 1 3 0 (#9880)
* use the latest datasketches release

* new sketch debug print

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2020-05-16 14:09:23 -07:00
Alexander Saydakov 844d626738
added number of bins parameter (#9436)
* added number of bins parameter

* addressed review points

* test equals

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2020-05-04 16:53:09 -07:00
Clint Wylie fc5383cd00
revert datasketches-java version to 1.1.0-incubating until new version is released (#9751)
* revert datasketches-java version to 1.1.0-incubating until fix is in place

* fix tests

* checkstyle
2020-04-24 12:52:12 -07:00
Suneet Saldanha 1ced3b33fb
IntelliJ inspections cleanup (#9339)
* IntelliJ inspections cleanup

* Standard Charset object can be used
* Redundant Collection.addAll() call
* String literal concatenation missing whitespace
* Statement with empty body
* Redundant Collection operation
* StringBuilder can be replaced with String
* Type parameter hides visible type

* fix warnings in test code

* more test fixes

* remove string concatenation inspection error

* fix extra curly brace

* cleanup AzureTestUtils

* fix charsets for RangerAdminClient

* review comments
2020-04-10 10:04:40 -07:00
Gian Merlino 1ef25a438f
Broker: Add ability to inline subqueries. (#9533)
* Broker: Add ability to inline subqueries.

The main changes:

- ClientQuerySegmentWalker: Add ability to inline queries.
- Query: Add "getSubQueryId" and "withSubQueryId" methods.
- QueryMetrics: Add "subQueryId" dimension.
- ServerConfig: Add new "maxSubqueryRows" parameter, which is used by
  ClientQuerySegmentWalker to limit how many rows can be inlined per
  query.
- IndexedTableJoinMatcher: Allow creating keys on top of unknown types,
  by assuming they are strings. This is useful because not all types are
  known for fields in query results.
- InlineDataSource: Store RowSignature rather than component parts. Add
  more zealous "equals" and "hashCode" methods to ease testing.
- Moved QuerySegmentWalker test code from CalciteTests and
  SpecificSegmentsQueryWalker in druid-sql to QueryStackTests in
  druid-server. Use this to spin up a new ClientQuerySegmentWalkerTest.

* Adjustments from CI.

* Fix integration test.
2020-03-18 15:06:45 -07:00
Gian Merlino ff59d2e78b
Move RowSignature from druid-sql to druid-processing and make use of it. (#9508)
* Move RowSignature from druid-sql to druid-processing and make use of it.

1) Moved (most of) RowSignature from sql to processing. Left behind the SQL-specific
   stuff in a RowSignatures utility class. It also picked up some new convenience
   methods along the way.
2) There were a lot of places in the code where Map<String, ValueType> was used to
   associate columns with type info. These are now all replaced with RowSignature.
3) QueryToolChest's resultArrayFields method is replaced with resultArraySignature,
   and it now provides type info.

* Fix up extensions.

* Various fixes
2020-03-12 11:06:44 -07:00
Gian Merlino c6c2282b59
Harmonization and bug-fixing for selector and filter behavior on unknown types. (#9484)
* Harmonization and bug-fixing for selector and filter behavior on unknown types.

- Migrate ValueMatcherColumnSelectorStrategy to newer ColumnProcessorFactory
  system, and set defaultType COMPLEX so unknown types can be dynamically matched.
- Remove ValueGetters in favor of ColumnComparisonFilter doing its own thing.
- Switch various methods to use convertObjectToX when casting to numbers, rather
  than ad-hoc and inconsistent logic.
- Fix bug in RowBasedExpressionColumnValueSelector: isBindingArray should return
  true even for 0- or 1- element arrays.
- Adjust various javadocs.

* Add throwParseExceptions option to Rows.objectToNumber, switch back to that.

* Update tests.

* Adjust moment sketch tests.
2020-03-10 07:15:57 -07:00
Clint Wylie f8b1f2f7f3
fix issue when distinct grouping dimensions are optimized into the same virtual column expression (#9429)
* fix issue when distinct grouping dimensions are optimized into the same virtual column expression

* fix tests

* more better

* fixes
2020-03-09 17:48:29 -07:00
Clint Wylie b408a6d774
sql support for dynamic parameters (#6974)
* sql support for dynamic parameters

* fixup

* javadocs

* fixup from merge

* formatting

* fixes

* fix it

* doc fix

* remove druid fallback self-join parameterized test

* unused imports

* ignore test for now

* fix imports

* fixup

* fix merge

* merge fixup

* fix test that cannot vectorize

* fixup and more better

* dependency thingo

* fix docs

* tweaks

* fix docs

* spelling

* unused imports after merge

* review stuffs

* add comment

* add ignore text

* review stuffs
2020-02-19 13:09:20 -08:00
Gian Merlino 3ef5c2f2e8
Add MemoryOpenHashTable, a table similar to ByteBufferHashTable. (#9308)
* Add MemoryOpenHashTable, a table similar to ByteBufferHashTable.

With some key differences to improve speed and design simplicity:

1) Uses Memory rather than ByteBuffer for its backing storage.
2) Uses faster hashing and comparison routines (see HashTableUtils).
3) Capacity is always a power of two, allowing simpler design and more
   efficient implementation of findBucket.
4) Does not implement growability; instead, leaves that to its callers.
   The idea is this removes the need for subclasses, while still giving
   callers flexibility in how to handle table-full scenarios.

* Fix LGTM warnings.

* Adjust dependencies.

* Remove easymock from druid-benchmarks.

* Adjustments from review.

* Fix datasketches unit tests.

* Fix checkstyle.
2020-02-04 19:57:59 -08:00
Suneet Saldanha 33a97dfaae
Guicify druid sql module (#9279)
* Guicify druid sql module

Break up the SQLModule in to smaller modules and provide a binding that
modules can use to register schemas with druid sql.

* fix some tests

* address code review

* tests compile

* Working tests

* Add all the tests

* fix up licenses and dependencies

* add calcite dependency to druid-benchmarks

* tests pass

* rename the schemas
2020-02-04 11:33:48 -08:00
Gian Merlino b411443d22
SQL join support for lookups. (#9294)
* SQL join support for lookups.

1) Add LookupSchema to SQL, so lookups show up in the catalog.
2) Add join-related rels and rules to SQL, allowing joins to be planned into
   native Druid queries.

* Add two missing LookupSchema calls in tests.

* Fix tests.

* Fix typo.
2020-01-31 23:51:16 -08:00
Suneet Saldanha 303b02eba1
intelliJ inspections cleanup (#9260)
* intelliJ inspections cleanup

- remove redundant escapes
- performance warnings
- access static member via instance reference
- static method declared final
- inner class may be static

Most of these changes are aesthetic, however, they will allow inspections to
be enabled as part of CI checks going forward

The valuable changes in this delta are:
- using StringBuilder instead of string addition in a loop
    indexing-hadoop/.../Utils.java
    processing/.../ByteBufferMinMaxOffsetHeap.java
- Use class variables instead of static variables for parameterized test
    processing/src/.../ScanQueryLimitRowIteratorTest.java

* Add intelliJ inspection warnings as errors to druid profile

* one more static inner class
2020-01-29 11:50:52 -08:00
Atul Mohan ea51bc45bf Fix nullhandling in tests (#9119) 2020-01-12 20:19:12 -08:00
Clint Wylie 7af85250cb null handling for doubles sketch and array of doubles sketch aggs (#9112)
* doubles sketch and array of doubles sketch aggs now skip rows with nulls in sql compatible null handling mode

* formatting
2020-01-07 14:15:32 -06:00
Jonathan Wei aa539177ec De-incubation cleanup in code, docs, packaging (#9108)
* De-incubation cleanup in code, docs, packaging

* remove unused docs script
2020-01-03 12:33:19 -05:00
Alexander Saydakov 4a9da3f3fc use the latest release of datasketches (#8647)
* use the latest release of datasketches

* added datasketches-memory dependency

* updated datasketches entries

* use datasketches-memory-1.2.0

* updated dependencies

* fixed tests
2019-11-25 19:45:51 -08:00
Clint Wylie 3fcaa1a61b
fix sql compatible null handling config work with runtime.properties (#8876)
* fix sql compatible null handling config work with runtime.properties

* fix npe

* fix tests

* add friendly error

* comment, and friendlier still

* fix compile

* fix from merges
2019-11-20 03:55:29 -08:00
Jonathan Wei 75ea0d592a Add more datasketches doubles sketch SQL functions (#8843)
* Add more datasketches doubles sketch SQL postaggs

* style and lgtm
2019-11-08 18:05:06 -08:00
Roman Leventov 5c0fc0a13a Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564)
* IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle() don't fetch abutting intervals; simplify getUsedSegmentsForIntervals()

* Add VersionedIntervalTimeline.findNonOvershadowedObjectsInInterval() method; Propagate the decision about whether only visible segmetns or visible and overshadowed segments should be returned from IndexerMetadataStorageCoordinator's methods to the user logic; Rename SegmentListUsedAction to RetrieveUsedSegmentsAction, SegmetnListUnusedAction to RetrieveUnusedSegmentsAction, and UsedSegmentLister to UsedSegmentsRetriever

* Fix tests

* More fixes

* Add javadoc notes about returning Collection instead of Set. Add JacksonUtils.readValue() to reduce boilerplate code

* Fix KinesisIndexTaskTest, factor out common parts from KinesisIndexTaskTest and KafkaIndexTaskTest into SeekableStreamIndexTaskTestBase

* More test fixes

* More test fixes

* Add a comment to VersionedIntervalTimelineTestBase

* Fix tests

* Set DataSegment.size(0) in more tests

* Specify DataSegment.size(0) in more places in tests

* Fix more tests

* Fix DruidSchemaTest

* Set DataSegment's size in more tests and benchmarks

* Fix HdfsDataSegmentPusherTest

* Doc changes addressing comments

* Extended doc for visibility

* Typo

* Typo 2

* Address comment
2019-11-06 11:07:04 -08:00
Clint Wylie 3ff5e02237 remove select query (#8739)
* remove select query

* thanks teamcity

* oops

* oops

* add back a SelectQuery class that throws RuntimeExceptions linking to docs

* adjust text

* update docs per review

* deprecated
2019-10-30 19:29:56 -07:00
Jonathan Wei d88075237a
Add initial SQL support for non-expression sketch postaggs (#8487)
* Add initial SQL support for non-expression sketch postaggs

* Checkstyle, spotbugs

* checkstyle

* imports

* Update SQL docs

* Checkstyle

* Fix theta sketch operator docs

* PR comments

* Checkstyle fixes

* Add missing entries for HLL sketch module

* PR comments, add round param to HLL estimate operator, fix optional HLL param
2019-10-18 14:59:44 -07:00
SandishKumarHN 33f0753a70 Add Checkstyle for constant name static final (#8060)
* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* merging with upstream

* review-1

* unknow changes

* unknow changes

* review-2

* merging with master

* review-2 1 changes

* review changes-2 2

* bug fix
2019-08-23 13:13:54 +03:00
Atul Mohan 661976f266 Reset sketch combiner in AggregatorCombiner (#8368)
* Reset union in AggregateCombiner

* Use newer sketch objects for test

* Add empty sketch objects
2019-08-23 00:22:40 -07:00
Jihoon Son b654096194
Fix equals for ArrayOfDoublesSketchAggregatorFactory (#8326) 2019-08-16 14:47:37 -07:00
Clint Wylie 1054d85171
add mechanism to control filter optimization in historical query processing (#8209)
* add support for mechanism to control filter optimization in historical query processing

* oops

* adjust

* woo

* javadoc

* review comments

* fix

* default

* oops

* oof

* this will fix it

* more nullable, refactor DimFilter.getRequiredColumns to use Set, formatting

* extract class DimFilterToStringBuilder with common code from custom DimFilter toString implementations

* adjust variable naming

* missing nullable

* more nullable

* fix javadocs

* nullable

* address review comments

* javadocs, precondition

* nullable

* rename method to be consistent

* review comments

* remove tuning from ColumnComparisonFilter/ColumnComparisonDimFilter
2019-08-09 16:36:18 -07:00
Fokko Driesprong 91743eeebe Spotbugs: NP_NONNULL_PARAM_VIOLATION (#8129) 2019-08-02 19:20:22 +03:00
Gian Merlino 77297f4e6f GroupBy array-based result rows. (#8196)
* GroupBy array-based result rows.

Fixes #8118; see that proposal for details.

Other than the GroupBy changes, the main other "interesting" classes are:

- ResultRow: The array-based result type.
- BaseQuery: T is no longer required to be Comparable.
- QueryToolChest: Adds "decorateObjectMapper" to enable query-aware serialization
  and deserialization of result rows (necessary due to their positional nature).
- QueryResource: Uses the new decoration functionality.
- DirectDruidClient: Also uses the new decoration functionality.
- QueryMaker (in Druid SQL): Modifications to read ResultRows.

These classes weren't changed, but got some new javadocs:

- BySegmentQueryRunner
- FinalizeResultsQueryRunner
- Query

* Adjustments for TC stuff.
2019-07-31 16:15:12 -07:00
Gian Merlino 63461311f8
HllSketch Merge/Build BufferAggregators: Speed up init with prebuilt sketch. (#8194)
* HllSketchMergeBufferAggregator: Speed up init by copying prebuilt sketch.

* Remove useless writableRegion call.

* POM variables.

* Fix missing reposition.

* Apply similar optimization to HllSketchBuildBufferAggregator.

* Rename emptySketch -> emptyUnion in merge flavor.

* Adjustments based on review.

* Comment update.

* Additional updates.

* Comment push.
2019-07-31 08:18:42 -07:00
Fokko Driesprong e1a745717e Spotbugs: NP_STORE_INTO_NONNULL_FIELD (#8021) 2019-07-21 21:23:47 +08:00
Gian Merlino ffa25b7832
Query vectorization. (#6794)
* Benchmarks: New SqlBenchmark, add caching & vectorization to some others.

- Introduce a new SqlBenchmark geared towards benchmarking a wide
  variety of SQL queries. Rename the old SqlBenchmark to
  SqlVsNativeBenchmark.
- Add (optional) caching to SegmentGenerator to enable easier
  benchmarking of larger segments.
- Add vectorization to FilteredAggregatorBenchmark and GroupByBenchmark.

* Query vectorization.

This patch includes vectorized timeseries and groupBy engines, as well
as some analogs of your favorite Druid classes:

- VectorCursor is like Cursor. (It comes from StorageAdapter.makeVectorCursor.)
- VectorColumnSelectorFactory is like ColumnSelectorFactory, and it has
  methods to create analogs of the column selectors you know and love.
- VectorOffset and ReadableVectorOffset are like Offset and ReadableOffset.
- VectorAggregator is like BufferAggregator.
- VectorValueMatcher is like ValueMatcher.

There are some noticeable differences between vectorized and regular
execution:

- Unlike regular cursors, vector cursors do not understand time
  granularity. They expect query engines to handle this on their own,
  which a new VectorCursorGranularizer class helps with. This is to
  avoid too much batch-splitting and to respect the fact that vector
  selectors are somewhat more heavyweight than regular selectors.
- Unlike FilteredOffset, FilteredVectorOffset does not leverage indexes
  for filters that might partially support them (like an OR of one
  filter that supports indexing and another that doesn't). I'm not sure
  that this behavior is desirable anyway (it is potentially too eager)
  but, at any rate, it'd be better to harmonize it between the two
  classes. Potentially they should both do some different thing that
  is smarter than what either of them is doing right now.
- When vector cursors are created by QueryableIndexCursorSequenceBuilder,
  they use a morphing binary-then-linear search to find their start and
  end rows, rather than linear search.

Limitations in this patch are:

- Only timeseries and groupBy have vectorized engines.
- GroupBy doesn't handle multi-value dimensions yet.
- Vector cursors cannot handle virtual columns or descending order.
- Only some filters have vectorized matchers: "selector", "bound", "in",
  "like", "regex", "search", "and", "or", and "not".
- Only some aggregators have vectorized implementations: "count",
  "doubleSum", "floatSum", "longSum", "hyperUnique", and "filtered".
- Dimension specs other than "default" don't work yet (no extraction
  functions or filtered dimension specs).

Currently, the testing strategy includes adding vectorization-enabled
tests to TimeseriesQueryRunnerTest, GroupByQueryRunnerTest,
GroupByTimeseriesQueryRunnerTest, CalciteQueryTest, and all of the
filtering tests that extend BaseFilterTest. In all of those classes,
there are some test cases that don't support vectorization. They are
marked by special function calls like "cannotVectorize" or "skipVectorize"
that tell the test harness to either expect an exception or to skip the
test case.

Testing should be expanded in the future -- a project in and of itself.

Related to #3011.

* WIP

* Adjustments for unused things.

* Adjust javadocs.

* DimensionDictionarySelector adjustments.

* Add "clone" to BatchIteratorAdapter.

* ValueMatcher javadocs.

* Fix benchmark.

* Fixups post-merge.

* Expect exception on testGroupByWithStringVirtualColumn for IncrementalIndex.

* BloomDimFilterSqlTest: Tag two non-vectorizable tests.

* Minor adjustments.

* Update surefire, bump up Xmx in Travis.

* Some more adjustments.

* Javadoc adjustments

* AggregatorAdapters adjustments.

* Additional comments.

* Remove switching search.

* Only missiles.
2019-07-12 12:54:07 -07:00
Clint Wylie abf9843e2a fail complex type 'serde' registration when registered type does not match expected type (#7985)
* make ComplexMetrics.registerSerde type check on register, resolves #7959

* add test

* simplify

* unused imports :/

* simplify

* burned by imports yet again

* remove unused constructor

* switch to getName

* heh oops
2019-07-11 23:03:15 -07:00
Alexander Saydakov 4b176ad265 force native order when wrapping ByteBuffer since Druid can have it set (#8055)
incorrectly
2019-07-11 17:17:53 -07:00
Chi Cao Minh 0ded0ce414 Add round support for DS-HLL (#8023)
* Add round support for DS-HLL

Since the Cardinality aggregator has a "round" option to round off estimated
values generated from the HyperLogLog algorithm, add the same "round" option to
the DataSketches HLL Sketch module aggregators to be consistent.

* Fix checkstyle errors

* Change HllSketchSqlAggregator to do rounding

* Fix test for standard-compliant null handling mode
2019-07-05 15:37:58 -07:00
Alexander Saydakov f38a62e949 theta sketch to string post agg (#7937) 2019-06-27 15:09:57 -07:00
Xue Yu b9c6a26c0e Use ComplexMetrics.registerSerde() across the codebase (#7925)
* refactor complexmetric registerserde

* fix error

* feedback address
2019-06-25 11:39:04 +03:00
Gian Merlino 8649b8ab4c
SQL: Allow select-sort-project query shapes. (#7769)
* SQL: Allow select-sort-project query shapes.

Fixes #7768.

Design changes:

- In PartialDruidQuery, allow projection after select + sort by removing
  the SELECT_SORT query stage and instead allowing the SORT and
  SORT_PROJECT stages to apply either after aggregation or after a plain
  non-aggregating select. This is different from prior behavior, where
  SORT and SORT_PROJECT were only considered valid after aggregation
  stages. This logic change is in the "canAccept" method.
- In DruidQuery, represent either kind of sorting with a single "Sorting"
  class (instead of DefaultLimitSpec). The Sorting class is still
  convertible into a DefaultLimitSpec, but is also convertible into the
  sorting parameters accepted by a Scan query.
- In DruidQuery, represent post-select and post-sorting projections with
  a single "Projection" class. This obsoletes the SortProject and
  SelectProjection classes, and simplifies the DruidQuery by allowing us
  to move virtual-column and post-aggregator-creation logic into the
  new Projection class.
- Split "DruidQuerySignature" into RowSignature and VirtualColumnRegistry.
  This effectively means that instead of having mutable and immutable
  versions of DruidQuerySignature, we instead of RowSignature (always
  immutable) and VirtualColumnRegistry (always mutable, but sometimes
  null). This change wasn't required, but IMO it this makes the logic
  involving them easier to follow, and makes it more clear when the
  virtual column registry is active and when it's not.

Other changes:

- ConvertBoundsToSelectors now just accepts a RowSignature, but we
  use the VirtualColumnRegistry.getFullRowSignature() method to get
  a signature that includes all columns, and therefore allows us to
  simplify the logic (no need to special-case virtual columns).
- Add `__time` to the Scan column list if the query is ordering by time.

* Remove unused import.
2019-05-30 12:56:29 -07:00
Clint Wylie eef69619d3 add support for multi-value string dimensions for HllSketch build aggregator (#7730) 2019-05-23 17:07:32 -07:00
Clint Wylie 23e96d15d4 allow quantiles merge aggregator to also accept doubles (#7718)
* allow quantiles merge aggregator to also accept doubles

* consolidate dupe

* import
2019-05-23 11:13:41 -07:00
Clint Wylie ffc2397bcd fix AggregatorFactory.finalizeComputation implementations to be ok with null inputs (#7731)
* AggregatorFactory finalizeComputation is nullable with nullable input, make implementations honor this

* fixes
2019-05-22 21:13:09 -07:00
Merlin Lee 5f08b0b474 Add checkstyle for "Prohibit @author tags in Javadoc" (#7682)
* Add checkstyle for "Prohibit @author tags in Javadoc"

* Add "Do not use author tags/information in the code" back to CONTRIBUTING.md
2019-05-20 00:09:51 -07:00
Fokko Driesprong 2aa9613bed Bump Checkstyle to 8.20 (#7651)
* Bump Checkstyle to 8.20

Moderate severity vulnerability that affects:
com.puppycrawl.tools:checkstyle

Checkstyle prior to 8.18 loads external DTDs by default,
which can potentially lead to denial of service attacks
or the leaking of confidential information.

Affected versions: < 8.18

* Oops, missed one

* Oops, missed a few
2019-05-14 11:53:37 -07:00
Alexander Saydakov ca1a6649f6 Datasketches quantiles more post-aggs (#7550)
* rank and CDF post-aggs

* added post-aggs to the module

* added new post-aggs

* moved post-agg IDs

* moved post-agg IDs
2019-05-10 11:46:54 -07:00
Alexander Saydakov 9d8f934e68 handle empty sketches (#7526)
* handle empty sketches

* return array of NaN in case of empty sketch

* noinspection ForLoopReplaceableByForEach in tests

* style fixes
2019-04-25 14:28:41 -07:00