Commit Graph

927 Commits

Author SHA1 Message Date
imply-cheddar f9861808bc
Be able to load segments on Peons (#14239)
* Be able to load segments on Peons

This change introduces a new config on WorkerConfig
that indicates how many bytes of each storage
location to use for storage of a task.  Said config
is divided up amongst the locations and slots
and then used to set TaskConfig.tmpStorageBytesPerTask

The Peons use their local task dir and
tmpStorageBytesPerTask as their StorageLocations for
the SegmentManager such that they can accept broadcast
segments.
2023-05-12 16:51:00 -07:00
Soumyava f128b9b666
Updates to filter processing for inner query in Joins (#14237) 2023-05-11 17:21:41 +05:30
Clint Wylie a58cebe491
add array_to_mv function to convert arrays into mvds to assist with migration from mvds to arrays (#14236) 2023-05-11 04:43:28 -07:00
Clint Wylie 8805d8d7db
fix issues with filtering nulls on values coerced to numeric types (#14139)
* fix issues with filtering nulls on values coerced to numeric types
* fix issues with 'auto' type numeric columns in default value mode
* optimize variant typed columns without nested data
* more tests for 'auto' type column ingestion
2023-05-08 13:19:02 -07:00
Rohan Garg 4d8feeb279
Fix planning in CASE expressions with complex WHEN and ELSE expressions (#14220) 2023-05-08 11:35:04 +05:30
zachjsh 48cde236c4
Add columnMappings to explain plan output (#14187)
* Add columnMappings to explain plan output

* * fix checkstyle
* add tests

* * improve test coverage

* * temporarily remove unit-test need to run ITs

* * depend on build

* * temporarily lower unit test threshold

* * add back dependency on unit-tests

* * add license headers

* * fix header order

* * review comments

* * fix intellij inspection errors

* * revert code coverage change
2023-05-04 10:36:28 -07:00
Gian Merlino 42c8c84eb6
TimeBoundary: Use cursor when datasource is not a regular table. (#14151)
* TimeBoundary: Use cursor when datasource is not a regular table.

Fixes a bug where TimeBoundary could return incorrect results with
INNER Join or inline data.

* Addl Javadocs.
2023-04-26 17:00:13 -07:00
Gian Merlino 89e7948159
MSQ: Subclass CalciteJoinQueryTest, other supporting changes. (#14105)
* MSQ: Subclass CalciteJoinQueryTest, other supporting changes.

The main change is the new tests: we now subclass CalciteJoinQueryTest
in CalciteSelectJoinQueryMSQTest twice, once for Broadcast and once for
SortMerge.

Two supporting production changes for default-value mode:

1) InputNumberDataSource is marked as concrete, to allow leftFilter to
   be pushed down to it.

2) In default-value mode, numeric frame field readers can now return nulls.
   This is necessary when stacking joins on top of joins: nulls must be
   preserved for semantics that match broadcast joins and native queries.

3) In default-value mode, StringFieldReader.isNull returns true on empty
   strings in addition to nulls. This is more consistent with the behavior
   of the selectors, which map empty strings to null as well in that mode.

As an effect of change (2), the InsertTimeNull change from #14020 (to
replace null timestamps with default timestamps) is reverted. IMO, this
is fine, as either behavior is defensible, and the change from #14020
hasn't been released yet.

* Adjust tests.

* Style fix.

* Additional tests.
2023-04-25 12:10:23 -07:00
Gian Merlino f643abdad9
SQL planning: Consider subqueries in fewer scenarios. (#14123)
* SQL planning: Consider subqueries in fewer scenarios.

Further adjusts logic in DruidRules that was previously adjusted in #13902.
The reason for the original change was that the comment "Subquery must be
a groupBy, so stage must be >= AGGREGATE" was no longer accurate. Subqueries
do not need to be groupBy anymore; they can really be any type of query.
If I recall correctly, the change was needed for certain window queries
to be able to plan on top of Scan queries.

However, this impacts performance negatively, because it causes many
additional outer-query scenarios to be considered, which is expensive.

So, this patch updates the matching logic to consider fewer scenarios. The
skipped scenarios are ones where we expect that, for one reason or another,
it isn't necessary to consider a subquery.

* Remove unnecessary escaping.

* Fix test.
2023-04-21 08:32:13 -07:00
Soumyava 8d60edcfcb
Updating segment map function for QueryDataSource to ensure group by … (#14112)
* Updating segment map function for QueryDataSource to ensure group by of group by of join data source gets into proper segment map function path

* Adding unit tests for the failed case

* There you go coverage bot, be happy now
2023-04-20 13:22:29 -07:00
zachjsh 04da0102cb
KillTask should return empty inputSource resources (#14106)
### Description

This pr fixes a few bugs found with the inputSource security feature.

1. `KillUnusedSegmentsTask` previously had no definition for the `getInputSourceResources`, which caused an unsupportedOperationException to be thrown when this task type was submitted with the inputSource security feature enabled. This task type should not require any input source specific resources, so returning an empty set for this task type now.

2. Fixed a bug where when the input source type security feature is enabled, all of the input source type specific resources used where authenticated against:

`{"resource": {"name": "EXTERNAL", "type": "{INPUT_SOURCE_TYPE}"}, "action": "READ"}`

When they should be instead authenticated against:

`{"resource": {"name": "{INPUT_SOURCE_TYPE}", "type": "EXTERNAL"}, "action": "READ"}`

3. fixed bug where supervisor tasks were not authenticated against the specific input source types used, if input source security feature was enabled.
2023-04-18 15:27:16 -04:00
Clint Wylie e7d2e8b914
fix bug filtering nested columns with expression filters (#14096) 2023-04-17 14:21:32 -07:00
Abhishek Radhakrishnan c98c66558f
Include statement attributes in `EXPLAIN PLAN` output (#14074)
This commit adds attributes that contain metadata information about the query
in the EXPLAIN PLAN output. The attributes currently contain two items:
- `statementTyp`: SELECT, INSERT or REPLACE
- `targetDataSource`: provides the target datasource name for DML statements

It is added to both the legacy and native query plan outputs.
2023-04-17 21:00:25 +05:30
Gian Merlino a8eb3f2f57
SQL: Fix natural comparator selection for groupBy. (#14075)
* SQL: Fix natural comparator selection for groupBy.

DruidQuery.computeSorting had some unique logic for finding natural
comparators for SQL types. It should be using getStringComparatorForRelDataType
instead.

One good effect here is that the comparator for BOOLEAN is now
NUMERIC rather than LEXICOGRAPHIC. The test case illustrates this.

* Remove msqCompatible, for now.

* Fix test.
2023-04-15 07:14:43 +05:30
Gian Merlino eeed5ed7e2
MSQ: Use the same result coercion routines as the regular SQL endpoint. (#14046)
* MSQ: Use the same result coercion routines as the regular SQL endpoint.

The main changes are to move NativeQueryMaker.coerce to SqlResults, and
to formally make the list of sqlTypeNames from the MSQ results reports
use SqlTypeNames.

- Change the default to MSQ-compatible rather than MSQ-incompatible.
  The explicit marker function is now "notMsqCompatible()".
2023-04-15 06:56:23 +05:30
Gian Merlino 0884a22c41
MSQ: Support for querying lookup and inline data directly. (#14048)
* MSQ: Support for querying lookup and inline data directly.

Main changes:

1) Add of LookupInputSpec and DataSourcePlan.forLookup.

2) Add InlineInputSpec, and modify of DataSourcePlan.forInline to use
   this instead of an ExternalInputSpec with JSON. This allows the inline
   data to act as the right-hand side of a join, if needed.

Supporting changes:

1) Modify JoinDataSource's leftFilter validation to be a little less
   strict: it's now OK with leftFilter being attached to any concrete
   leaf (no children) datasource, rather than requiring it be a table.
   This allows MSQ to create JoinDataSource with InputNumberDataSource
   as the base.

2) Add SegmentWranglerModule to CliIndexer, CliPeon. This allows them to
   query lookups and inline data directly.

* Updates based on CI.

* Additional tests.

* Style fix.

* Remove unused import.
2023-04-14 14:04:02 -07:00
Atul Mohan e3c160f2f2
Add start_time column to sys.servers (#13358)
Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.
2023-04-14 15:23:34 +05:30
zachjsh 2e87b5a901
Input source security sql layer can handle input source with multiple types (#14050)
### Description

This change allows for input sources used during MSQ ingestion to be authorized for multiple input source types, instead of just 1. Such an input source that allows for multiple types is the CombiningInputSource.

Also fixed bug that caused some input source specific functions to be authorized against the permissions

`
[
    new ResourceAction(new Resource(ResourceType.EXTERNAL, ResourceType.EXTERNAL), Action.READ),
    new ResourceAction(new Resource(ResourceType.EXTERNAL, {input_source_type}), Action.READ)
]
`

when the inputSource based authorization feature is enabled, when it should instead be authorized against

`
[
    new ResourceAction(new Resource(ResourceType.EXTERNAL, {input_source_type}), Action.READ)
]
`
2023-04-10 09:48:57 -04:00
Clint Wylie 1aef72aa7e
Bump up the version in pom to 27.0.0 in preparation of release (#14051) 2023-04-10 14:56:59 +05:30
Gian Merlino d52bc333aa
Frames: Ensure nulls are read as default values when appropriate. (#14020)
* Frames: Ensure nulls are read as default values when appropriate.

Fixes a bug where LongFieldWriter didn't write a properly transformed
zero when writing out a null. This had no meaningful effect in SQL-compatible
null handling mode, because the field would get treated as a null anyway.
But it does have an effect in default-value mode: it would cause Long.MIN_VALUE
to get read out instead of zero.

Also adds NullHandling checks to the various frame-based column selectors,
allowing reading of nullable frames by servers in default-value mode.
2023-04-10 05:28:46 +05:30
zachjsh 5c0221375c
Allow for Input source security in native task layer (#14003)
Fixes #13837.

### Description

This change allows for input source type security in the native task layer.

To enable this feature, the user must set the following property to true:

`druid.auth.enableInputSourceSecurity=true`

The default value for this property is false, which will continue the existing functionality of needing authorization to write to the respective datasource.

When this config is enabled, the users will be required to be authorized for the following resource action, in addition to write permission on the respective datasource.

`new ResourceAction(new Resource(ResourceType.EXTERNAL, {INPUT_SOURCE_TYPE}, Action.READ`

where `{INPUT_SOURCE_TYPE}` is the type of the input source being used;, http, inline, s3, etc..

Only tasks that provide a non-default implementation of the `getInputSourceResources` method can be submitted when config `druid.auth.enableInputSourceSecurity=true` is set. Otherwise, a 400 error will be thrown.
2023-04-06 13:13:09 -04:00
Clint Wylie b11c0bc249
smarter nested column index utilization (#13977)
* smarter nested column index utilization
changes:
* adds skipValueRangeIndexScale and skipValuePredicateIndexScale to ColumnConfig (e.g. DruidProcessingConfig) available as system config via druid.processing.indexes.skipValueRangeIndexScale and druid.processing.indexes.skipValuePredicateIndexScale
* NestedColumnIndexSupplier uses skipValueRangeIndexScale and skipValuePredicateIndexScale to multiply by the total number of rows to be processed to determine the threshold at which we should no longer consider using bitmap indexes because it will be too many operations
* Default values for skipValueRangeIndexScale and skipValuePredicateIndexScale have been initially set to 0.08, but are separate to allow independent tuning
* these are not documented on purpose yet because they are kind of hard to explain, the mainly exist to help conduct larger scale experiments than the jmh benchmarks used to derive the initial set of values
* these changes provide a pretty sweet performance boost for filter processing on nested columns
2023-04-06 04:09:24 -07:00
Paul Rogers 030ed911d4
Temporarily revert extended table functions for Druid 26 (#14019) 2023-04-05 21:09:33 -07:00
Gian Merlino 319f99db05
Always use file sizes when determining batch ingest splits (#13955)
* Always use file sizes when determining batch ingest splits.

Main changes:

1) Update CloudObjectInputSource and its subclasses (S3, GCS,
   Azure, Aliyun OSS) to use SplitHintSpecs in all cases. Previously, they
   were only used for prefixes, not uris or objects.

2) Update ExternalInputSpecSlicer (MSQ) to consider file size. Previously,
   file size was ignored; all files were treated as equal weight when
   determining splits.

A side effect of these changes is that we'll make additional network
calls to find the sizes of objects when users specify URIs or objects
as opposed to prefixes. IMO, this is worth it because it's the only way
to respect the user's split hint and task assignment settings.

Secondary changes:

1) S3, Aliyun OSS: Use getObjectMetadata instead of listObjects to get
   metadata for a single object. This is a simpler call that is also
   expected to be less expensive.

2) Azure: Fix a bug where getBlobLength did not populate blob
   reference attributes, and therefore would not actually retrieve the
   blob length.

3) MSQ: Align dynamic slicing logic between ExternalInputSpecSlicer and
   TableInputSpecSlicer.

4) MSQ: Adjust WorkerInputs to ensure there is always at least one
   worker, even if it has a nil slice.

* Add msqCompatible to testGroupByWithImpossibleTimeFilter.

* Fix tests.

* Add additional tests.

* Remove unused stuff.

* Remove more unused stuff.

* Adjust thresholds.

* Remove irrelevant test.

* Fix comments.

* Fix bug.

* Updates.
2023-04-05 08:54:01 -07:00
Clint Wylie 1c8a184677
add null safety checks for DiscoveryDruidNode services for more resilient http server and task views (#13930)
* add null safety checks for DiscoveryDruidNode services for more resilient http server and task vi
2023-04-05 02:45:39 -07:00
Clint Wylie d21babc5b8
remix nested columns (#14014)
changes:
* introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation
* introduce new 'auto' type indexer and merger which produces a new common nested format of columns, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>.
* revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre #13803 behavior (v4) for backwards compatibility
* fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
2023-04-04 17:51:59 -07:00
Soumyava ca94f7146f
Planning correctly for order by queries on time which previously thre… (#13965)
* Planning correctly for order by queries on time which previously threw a planning error
* Updating toDruidQueryForExplaining on a query data source if there is a window on the partial query
2023-04-03 18:30:19 -07:00
Soumyava 1eeecf5fb2
Fixing regression issues on unnest (#13976)
* select sum(c) on an unnested column now does not return 'Type mismatch' error and works properly
* Making sure an inner join query works properly
* Having on unnested column with a group by now works correctly
* count(*) on an unnested query now works correctly
2023-03-31 09:06:43 +05:30
zachjsh 3bb67721f7
Allow for Input source security in SQL layer (#13989)
This change introduces the concept of input source type security model, proposed in #13837.. With this change, this feature is only available at the SQL layer, but we will expand to native layer in a follow up PR.

To enable this feature, the user must set the following property to true:

druid.auth.enableInputSourceSecurity=true

The default value for this property is false, which will continue the existing functionality of having the usage all external sources being authorized against the hardcoded resource action

new ResourceAction(new Resource(ResourceType.EXTERNAL, ResourceType.EXTERNAL), Action.READ

When this config is enabled, the users will be required to be authorized for the following resource action

new ResourceAction(new Resource(ResourceType.EXTERNAL, {INPUT_SOURCE_TYPE}, Action.READ

where {INPUT_SOURCE_TYPE} is the type of the input source being used;, http, inline, s3, etc..

Documentation has not been added for the feature as it is not complete at the moment, as we still need to enable this for the native layer in a follow up pr.
2023-03-29 22:15:33 -04:00
Karan Kumar e4c5122a60
Fixing checkstyle (#14000) 2023-03-29 20:21:21 +05:30
Paul Rogers 76fe26d4ba
Fix typos, add tests for http() function (#13954) 2023-03-28 14:41:06 -07:00
Clint Wylie d5b1b5bc8e
nested columns + arrays = array columns! (#13803)
array columns!
changes:
* add support for storing nested arrays of string, long, and double values as specialized nested columns instead of breaking them into separate element columns
* nested column type mimic behavior means that columns ingested with only root arrays of primitive values will be ARRAY typed columns
* neat test refactor stuff
* add v4 segment test
* add array element indexes
* add tests for unnest and array columns
* fix unnest column value selector cursor handling of null and empty arrays
2023-03-27 12:42:35 -07:00
Paul Rogers da42ee5bfa
Added TYPE(native) data type for external tables (#13958) 2023-03-22 21:43:29 -07:00
Soumyava 2ad133c06e
Unnest changes for moving the filter on right side of correlate to inside the unnest datasource (#13934)
* Refactoring and bug fixes on top of unnest. The filter now is passed inside the unnest cursors. Added tests for scenarios such as
1. filter on unnested column which involves a left filter rewrite
2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite
3. not filters
4. SQL functions applied on top of unnested column
5. null present in first row of the column to be unnested
2023-03-22 18:24:00 -07:00
Clint Wylie 086eb26b74
fix join and unnest planning to ensure that duplicate join prefixes are not used (#13943)
* fix join and unnest planning to ensure that duplicate join prefixes are not used

* wont somebody please think of the children
2023-03-22 12:53:55 -07:00
Clint Wylie f4392a3155
expression transform improvements and fixes (#13947)
changes:
* fixes inconsistent handling of byte[] values between ExprEval.bestEffortOf and ExprEval.ofType, which could cause byte[] values to end up as java toString values instead of base64 encoded strings in ingest time transforms
* improved ExpressionTransform binding to re-use ExprEval.bestEffortOf when evaluating a binding instead of throwing it away
* improved ExpressionTransform array handling, added RowFunction.evalDimension that returns List<String> to back Row.getDimension and remove the automatic coercing of array types that would typically happen to expression transforms unless using Row.getDimension
* added some tests for ExpressionTransform with array inputs
* improved ExpressionPostAggregator to use partial type information from decoration
* migrate some test uses of InputBindings.forMap to use other methods
2023-03-21 23:26:53 -07:00
Adarsh Sanjeev 143fdcfacf
Change test name so it triggers in CI (#13844)
As the name of the class did not end or start with "Test", CalciteSelectQueryMSQTest was not triggered in CI. This PR renames the test.
2023-03-20 15:55:52 +05:30
somu-imply a7ba361666
Refactoring and bug fixes on top of unnest. The allowList now is not passed … (#13922)
* Refactoring and bug fixes on top of unnest. The filter now is passed inside the unnest cursors. Added tests for scenarios such as
1. filter on unnested column which involves a left filter rewrite
2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite
3. not filters
4. SQL functions applied on top of unnested column
5. null present in first row of the column to be unnested
2023-03-14 16:05:56 -07:00
Suneet Saldanha 44547614ae
Report engine as a dimension for sqlQuery metrics (#13906)
* Report engine as a dimension for sqlQuery metrics

* docs
2023-03-10 11:23:57 -08:00
Gian Merlino 4b1ffbc452
Various changes and fixes to UNNEST. (#13892)
* Various changes and fixes to UNNEST.

Native changes:

1) UnnestDataSource: Replace "column" and "outputName" with "virtualColumn".
   This enables pushing expressions into the datasource. This in turn
   allows us to do the next thing...

2) UnnestStorageAdapter: Logically apply query-level filters and virtual
   columns after the unnest operation. (Physically, filters are pulled up,
   when possible.) This is beneficial because it allows filters and
   virtual columns to reference the unnested column, and because it is
   consistent with how the join datasource works.

3) Various documentation updates, including declaring "unnest" as an
   experimental feature for now.

SQL changes:

1) Rename DruidUnnestRel (& Rule) to DruidUnnestRel (& Rule). The rel
   is simplified: it only handles the UNNEST part of a correlated join.
   Constant UNNESTs are handled with regular inline rels.

2) Rework DruidCorrelateUnnestRule to focus on pulling Projects from
   the left side up above the Correlate. New test testUnnestTwice verifies
   that this works even when two UNNESTs are stacked on the same table.

3) Include ProjectCorrelateTransposeRule from Calcite to encourage
   pushing mappings down below the left-hand side of the Correlate.

4) Add a new CorrelateFilterLTransposeRule and CorrelateFilterRTransposeRule
   to handle pulling Filters up above the Correlate. New tests
   testUnnestWithFiltersOutside and testUnnestTwiceWithFilters verify
   this behavior.

5) Require a context feature flag for SQL UNNEST, since it's undocumented.
   As part of this, also cleaned up how we handle feature flags in SQL.
   They're now hooked into EngineFeatures, which is useful because not
   all engines support all features.
2023-03-10 16:42:08 +05:30
imply-cheddar 6b90a320cf
Add back function signature for compat (#13914)
* Add back function signature for compat

* Suppress IntelliJ Error
2023-03-09 21:06:34 -08:00
Gian Merlino bf39b4d313
Window planning: use collation traits, improve subquery logic. (#13902)
* Window planning: use collation traits, improve subquery logic.

SQL changes:

1) Attach RelCollation (sorting) trait to any PartialDruidQuery
   that ends in AGGREGATE or AGGREGATE_PROJECT. This allows planning to
   take advantage of the fact that Druid sorts by dimensions when
   doing aggregations.

2) Windowing: inspect RelCollation trait from input, and insert naiveSort
   if, and only if, necessary.

3) Windowing: add support for Project after Window, when the Project
   is a simple mapping. Helps eliminate subqueries.

4) DruidRules: update logic for considering subqueries to reflect that
   subqueries are not required to be GroupBys, and that we have a bunch
   of new Stages now. With all of this evolution that has happened, the
   old logic didn't quite make sense.

Native changes:

1) Use merge sort (stable) rather than quicksort when sorting
   RowsAndColumns. Makes it easier to write test cases for plans that
   involve re-sorting the data.

* Changes from review.

* Mark the bad test as failing.

* Additional update.

* Fix failingTest.

* Fix tests.

* Mark a var final.
2023-03-09 15:48:13 -08:00
Clint Wylie 48ac5ce50b
use native nvl expression for SQL NVL and 2 argument COALESCE (#13897)
* use custom case operator conversion instead of direct operator conversion, to produce native nvl expression for SQL NVL and 2 argument COALESCE, and add optimization for certain case filters from coalesce and nvl statements
2023-03-09 05:46:17 -08:00
Gian Merlino 90d8f67e3d
Avoid creating new RelDataTypeFactory during SQL planning. (#13904)
* Avoid creating new RelDataTypeFactory during SQL planning.

Reduces unnecessary CPU cycles.

* Fix.
2023-03-08 21:55:49 -08:00
Gian Merlino 82f7a56475
Sort-merge join and hash shuffles for MSQ. (#13506)
* Sort-merge join and hash shuffles for MSQ.

The main changes are in the processing, multi-stage-query, and sql modules.

processing module:

1) Rename SortColumn to KeyColumn, replace boolean descending with KeyOrder.
   This makes it nicer to model hash keys, which use KeyOrder.NONE.

2) Add nullability checkers to the FieldReader interface, and an
   "isPartiallyNullKey" method to FrameComparisonWidget. The join
   processor uses this to detect null keys.

3) Add WritableFrameChannel.isClosed and OutputChannel.isReadableChannelReady
   so callers can tell which OutputChannels are ready for reading and which
   aren't.

4) Specialize FrameProcessors.makeCursor to return FrameCursor, a random-access
   implementation. The join processor uses this to rewind when it needs to
   replay a set of rows with a particular key.

5) Add MemoryAllocatorFactory, which is embedded inside FrameWriterFactory
   instead of a particular MemoryAllocator. This allows FrameWriterFactory
   to be shared in more scenarios.

multi-stage-query module:

1) ShuffleSpec: Add hash-based shuffles. New enum ShuffleKind helps callers
   figure out what kind of shuffle is happening. The change from SortColumn
   to KeyColumn allows ClusterBy to be used for both hash-based and sort-based
   shuffling.

2) WorkerImpl: Add ability to handle hash-based shuffles. Refactor the logic
   to be more readable by moving the work-order-running code to the inner
   class RunWorkOrder, and the shuffle-pipeline-building code to the inner
   class ShufflePipelineBuilder.

3) Add SortMergeJoinFrameProcessor and factory.

4) WorkerMemoryParameters: Adjust logic to reserve space for output frames
   for hash partitioning. (We need one frame per partition.)

sql module:

1) Add sqlJoinAlgorithm context parameter; can be "broadcast" or
   "sortMerge". With native, it must always be "broadcast", or it's a
   validation error. MSQ supports both. Default is "broadcast" in
   both engines.

2) Validate that MSQs do not use broadcast join with RIGHT or FULL join,
   as results are not correct for broadcast join with those types. Allow
   this in native for two reasons: legacy (the docs caution against it,
   but it's always been allowed), and the fact that it actually *does*
   generate correct results in native when the join is processed on the
   Broker. It is much less likely that MSQ will plan in such a way that
   generates correct results.

3) Remove subquery penalty in DruidJoinQueryRel when using sort-merge
   join, because subqueries are always required, so there's no reason
   to penalize them.

4) Move previously-disabled join reordering and manipulation rules to
   FANCY_JOIN_RULES, and enable them when using sort-merge join. Helps
   get to better plans where projections and filters are pushed down.

* Work around compiler problem.

* Updates from static analysis.

* Fix @param tag.

* Fix declared exception.

* Fix spelling.

* Minor adjustments.

* wip

* Merge fixups

* fixes

* Fix CalciteSelectQueryMSQTest

* Empty keys are sortable.

* Address comments from code review. Rename mux -> mix.

* Restore inspection config.

* Restore original doc.

* Reorder imports.

* Adjustments

* Fix.

* Fix imports.

* Adjustments from review.

* Update header.

* Adjust docs.
2023-03-08 14:19:39 -08:00
Adarsh Sanjeev ef82756176
Add validation for aggregations on __time (#13793)
* Add validation for aggregations on __time
2023-03-07 17:16:36 -08:00
Clint Wylie 3924f0eff4
use Calcites.getColumnTypeForRelDataType for SQL CAST operator conversion (#13890)
* use Calcites.getColumnTypeForRelDataType for SQL CAST operator conversion

* fix comment

* intervals are strings but also longs
2023-03-07 13:12:15 -08:00
Gian Merlino fcfb7b8ff6
Add warning comments to Granularity.getIterable. (#13888)
This function is notorious for causing memory exhaustion and excessive
CPU usage; so much so that it was valuable to work around it in the
SQL planner in #13206. Hopefully, a warning comment will encourage
developers to stay away and come up with solutions that do not involve
computing all possible buckets.
2023-03-06 22:57:10 -08:00
Clint Wylie 6cf754b0e0
move numeric null value coercion out of expression processing engine (#13809)
* move numeric null value coercion out of expression processing engine
* add ExprEval.valueOrDefault() to allow consumers to automatically coerce to default values
* rename Expr.buildVectorized as Expr.asVectorProcessor more consistent naming with Function and ApplyFunction; javadocs for some stuff
2023-02-28 18:10:07 -08:00
Paul Rogers 914eebb4b7
Wire up the catalog resolver (#13788)
Introduces the catalog resolver interface
Wires the resolver up to the planner factory
Refactors planner factory
2023-02-22 11:42:32 -08:00
zachjsh 665dee43bf
Revert "Operator conversion deny list (#13766)" (#13829)
This reverts commit 38e620aa4c.
2023-02-21 15:14:49 -08:00
Paul Rogers 85d36be085
Information schema now uses numeric column types (#13777)
Change to use SQL schemas to allow null numeric columns

* Updated docs
2023-02-17 14:39:31 -08:00
Clint Wylie 08b5951cc5
merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything (#13698)
* merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything
* fix poms and license stuff
* mockito is evil
* allow reset of JvmUtils RuntimeInfo if tests used static injection to override
2023-02-17 14:27:41 -08:00
Paul Rogers 333196d207
Code cleanup & message improvements (#13778)
* Misc cleanup edits

Correct spacing
Add type parameters
Add toString() methods to formats so tests compare correctly
IT doc revisions
Error message edits
Display UT query results when tests fail

* Edit

* Build fix

* Build fixes
2023-02-15 15:22:54 +05:30
Clint Wylie fa4cab405f
fix bug with sql planner when virtual column capabilities are null (#13797) 2023-02-13 18:27:23 -08:00
Paul Rogers 842ee554de
Refinements to input-source specific table functions (#13780)
Refinements to table functions

Fixes various bugs
Improves the structure of the table function classes
Adds unit and integration tests
2023-02-13 16:21:27 -08:00
Clint Wylie f09f83697d
fix array_agg to work with complex types and bugs with expression aggregator complex array handling (#13781)
* fix array_agg to work with complex types and bugs with expression aggregator complex array handling
* more consistent handling of array expressions, numeric arrays more consistently honor druid.generic.useDefaultValueForNull, fix array_ordinal sql output type
2023-02-12 22:01:39 -08:00
zachjsh 38e620aa4c
Operator conversion deny list (#13766)
### Description

This change adds a new config property `druid.sql.planner.operatorConversion.denyList`, which allows a user to specify
any operator conversions that they wish to disallow. A user may want to do this for a number of reasons, including security concerns. The default value of this property is the empty list `[]`, which does not disallow any operator conversions.

An example usage of this property is `druid.sql.planner.operatorConversion.denyList=["extern"]`, which disallows the usage of the `extern` operator conversion. If the property is configured this way, and a user of the Druid cluster tries to submit a query that uses the `extern` function, such as the example given [here](https://druid.apache.org/docs/latest/multi-stage-query/examples.html#insert-with-no-rollup), a response with http response code `400` is returned with en error body similar to the following:

```
{
  "taskId": "4ec5b0b6-fa9b-4c3a-827d-2308294e9985",
  "state": "FAILED",
  "error": {
    "error": "Plan validation failed",
    "errorMessage": "org.apache.calcite.runtime.CalciteContextException: From line 28, column 5 to line 32, column 5: No match found for function signature EXTERN(<CHARACTER>, <CHARACTER>, <CHARACTER>)",
    "errorClass": "org.apache.calcite.tools.ValidationException",
    "host": null
  }
}
```
2023-02-10 09:59:26 -08:00
AmatyaAvadhanula dcdae84888
Add server view initialization metrics (#13716)
* Add server view init metrics

* Test coverage

* Rename metrics
2023-02-07 20:02:00 +05:30
Clint Wylie 2d3bee8545
various nested column (and other) fixes (#13732)
changes:
* modified druid schema column type compution to special case COMPLEX<json> handling to choose COMPLEX<json> if any column in any segment is COMPLEX<json>
* NestedFieldVirtualColumn can now work correctly on any type of column, returning either a column selector if a root path, or nil selector if not
* fixed a random bug with NilVectorSelector when using a vector size larger than the default and druid.generic.useDefaultValueForNull=false would have the nulls vector set to all false instead of true
* fixed an overly aggressive check in ExprEval.ofType when handling complex types which would try to treat any string as base64 without gracefully falling back if it was not in fact base64 encoded, along with special handling for complex<json>
* added ExpressionVectorSelectors.castValueSelectorToObject and ExpressionVectorSelectors.castObjectSelectorToNumeric as convience methods to cast vector selectors using cast expressions without the trouble of constructing an expression. the polymorphic nature of the non-vectorized engine (and significantly larger overhead of non-vectorized expression processing) made adding similar methods for non-vectorized selectors less attractive and so have not been added at this time
* fix inconsistency between nested column indexer and serializer in handling values (coerce non primitive and non arrays of primitives using asString)
* ExprEval best effort mode now handles byte[] as string
* added test for ExprEval.bestEffortOf, and add missing conversion cases that tests uncovered
* more tests more better
2023-02-06 19:48:02 -08:00
imply-cheddar 706b8a0227
Adjust Operators to be Pausable (#13694)
* Adjust Operators to be Pausable

This enables "merge" style operations that
combine multiple streams.

This change includes a naive implementation
of one such merge operator just to provide
concrete evidence that the refactoring is
effective.
2023-01-23 20:52:06 -08:00
somu-imply 90d445536d
SQL version of unnest native druid function (#13576)
* adds the SQL component of the native unnest functionality in Druid to unnest SQL queries on a table dimension, virtual column or a constant array and convert them into native Druid queries
* unnest in SQL is implemented as a combination of Correlate (the comma join part) and Uncollect (the unnest part)
2023-01-23 12:53:31 -08:00
Laksh Singla a516eb1a41
Port Calcite's tests to run with MSQ (#13625)
* SQL test framework extensions

* Capture planner artifacts: logical plan, etc.
* Planner test builder validates the logical plan
* Validation for the SQL resut schema (we already have
  validation for the Druid row signature)
* Better Guice integration: properties, reuse Guice modules
* Avoid need for hand-coded expr, macro tables
* Retire some of the test-specific query component creation
* Fix query log hook race condition

Co-authored-by: Paul Rogers <progers@apache.org>
2023-01-19 08:51:11 -08:00
Paul Rogers 22630b0aab
Much improved table functions (#13627)
Much improved table functions

* Revises properties, definitions in the catalog
* Adds a "table function" abstraction to model such functions
* Specific functions for HTTP, inline, local and S3.
* Extended SQL types in the catalog
* Restructure external table definitions to use table functions
* EXTEND syntax for Druid's extern table function
* Support for array-valued table function parameters
* Support for array-valued SQL query parameters
* Much new documentation
2023-01-17 08:41:57 -08:00
imply-cheddar 7ff3722cb9
Swap LazySingleton for Singleton (#13673)
* Swap LazySingleton for Singleton
* Initialize WebserverTestUtils properly
2023-01-15 21:38:37 -08:00
Clint Wylie b5b740bbbb
allow using nested column indexer for schema discovery (#13653)
* single typed "root" only nested columns now mimic "regular" columns of those types
* incremental index can now use nested column indexer instead of string indexer for discovered columns
2023-01-12 18:31:12 -08:00
Adarsh Sanjeev 0a486c3bcf
Update forbidden apis with fixed executor (#13633)
* Update forbidden apis with fixed executor
2023-01-12 15:34:36 +05:30
imply-cheddar f1821a7c18
Add Sort Operator for Window Functions (#13619)
* Addition of NaiveSortMaker and Default implementation

Add the NaiveSortMaker which makes a sorter
object and a default implementation of the
interface.

This also allows us to plan multiple different window 
definitions on the same query.
2023-01-06 00:27:18 -08:00
imply-cheddar a8ecc48ffe
Validate response headers and fix exception logging (#13609)
* Validate response headers and fix exception logging

A class of QueryException were throwing away their
causes making it really hard to determine what's
going wrong when something goes wrong in the SQL
planner specifically.  Fix that and adjust tests
 to do more validation of response headers as well.

We allow 404s and 307s to be returned even without 
authorization validated, but others get converted to 403
2023-01-05 14:15:15 -08:00
Clint Wylie fd63e5a514
fix issue with jdbc and query metrics (#13608)
* fix issue with metrics emitting and jdbc results by getting yielder from query processing thread

* more better
2022-12-21 19:32:53 -08:00
imply-cheddar 0efd0879a8
Unify the handling of HTTP between SQL and Native (#13564)
* Unify the handling of HTTP between SQL and Native

The SqlResource and QueryResource have been
using independent logic for things like error
handling and response context stuff.  This
became abundantly clear and painful during a
change I was making for Window Functions, so
I unified them into using the same code for
walking the response and serializing it.

Things are still not perfectly unified (it would
be the absolute best if the SqlResource just
took SQL, planned it and then delegated the
query run entirely to the QueryResource), but
this refactor doesn't take that fully on.

The new code leverages async query processing
from our jetty container, the different
interaction model with the Resource means that
a lot of tests had to be adjusted to align with
the async query model.  The semantics of the
tests remain the same with one exception: the
SqlResource used to not log requests that failed
authorization checks, now it does.
2022-12-19 00:25:33 -08:00
Clint Wylie 9ae7a36ccd
improve nested column storage format for broader compatibility (#13568)
* bump nested column format version
changes:
* nested field files are now named by their position in field paths list, rather than directly by the path itself. this fixes issues with valid json properties with commas and newlines breaking the csv file meta.smoosh
* update StructuredDataProcessor to deal in NestedPathPart to be consistent with other abstract path handling rather than building JQ syntax strings directly
* add v3 format segment and test
2022-12-15 15:39:26 -08:00
imply-cheddar 089d8da561
Support Framing for Window Aggregations (#13514)
* Support Framing for Window Aggregations

This adds support for framing  over ROWS
for window aggregations.

Still not implemented as yet:
1. RANGE frames
2. Multiple different frames in the same query
3. Frames on last/first functions
2022-12-14 18:04:39 -08:00
Rohan Garg 35c983a351
Use template file for adding table functions grammar (#13553) 2022-12-14 21:52:09 +05:30
somu-imply 7682b0b6b1
Analysis refactor (#13501)
Refactor DataSource to have a getAnalysis method()

This removes various parts of the code where while loops and instanceof
checks were being used to walk through the structure of DataSource objects
in order to build a DataSourceAnalysis.  Instead we just ask the DataSource
for its analysis and allow the stack to rebuild whatever structure existed.
2022-12-12 17:35:44 -08:00
Paul Rogers 013a12e86f
Enhanced MSQ table functions (#13360)
* Enhanced MSQ table functions
* HTTP, LOCALFILES and INLINE table functions powered by
catalog metadata.
* Documentation
2022-12-08 13:56:02 -08:00
imply-cheddar 83261f9641
Starting on Window Functions (#13458)
* Processors for Window Processing

This is an initial take on how to use Processors
for Window Processing.  A Processor is an interface
that transforms RowsAndColumns objects.
RowsAndColumns objects are essentially combinations
of rows and columns.

The intention is that these Processors are the start
of a set of operators that more closely resemble what
DB engineers would be accustomed to seeing.

* Wire up windowed processors with a query type that
can run them end-to-end.  This code can be used to
actually run a query, so yay!

* Wire up windowed processors with a query type that
can run them end-to-end.  This code can be used to
actually run a query, so yay!

* Some SQL tests for window functions. Added wikipedia 
data to the indexes available to the
SQL queries and tests validating the windowing
functionality as it exists now.

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
2022-12-06 15:54:05 -08:00
Paul Rogers b76ff16d00
SQL test framework extensions (#13426)
SQL test framework extensions

* Capture planner artifacts: logical plan, etc.
* Planner test builder validates the logical plan
* Validation for the SQL resut schema (we already have
  validation for the Druid row signature)
* Better Guice integration: properties, reuse Guice modules
* Avoid need for hand-coded expr, macro tables
* Retire some of the test-specific query component creation
* Fix query log hook race condition
2022-12-02 09:11:59 -08:00
Kashif Faraz 656b6cdf62
Add MetricsVerifier to simplify verification of metric values in tests (#13442) 2022-11-28 19:32:37 +05:30
Kashif Faraz 7cf761cee4
Prepare master branch for next release, 26.0.0 (#13401)
* Prepare master branch for next release, 26.0.0

* Use docker image for druid 24.0.1

* Fix version in druid-it-cases pom.xml
2022-11-22 15:31:01 +05:30
Paul Rogers 81d005f267
Druid Catalog basics (#13165)
Druid catalog basics

Catalog object model for tables, columns
Druid metadata DB storage (as an extension)
REST API to update the catalog (as an extension)
Integration tests
Model only: no planner integration yet
2022-11-12 15:30:22 -08:00
Clint Wylie 27215d1ff1
fix complex_decode_base64 function, add SQL bindings (#13332)
* fix complex_decode_base64 function, add SQL bindings

* more permissive
2022-11-09 23:40:25 -08:00
Paul Rogers 7e600d2c63
Enhancements to the Calcite test framework (#13283)
* Enhancements to the Calcite test framework
* Standardize "Unauthorized" messages
* Additional test framework extension points
* Resolved joinable factory dependency issue
2022-11-08 14:28:49 -08:00
Gian Merlino 8f90589ce5
Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. (#13247)
* Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH.

These aggregation functions are documented as creating sketches. However,
they are planned into native aggregators that include finalization logic
to convert the sketch to a number of some sort. This creates an
inconsistency: the functions sometimes return sketches, and sometimes
return numbers, depending on where they lie in the native query plan.

This patch changes these SQL aggregators to _never_ finalize, by using
the "shouldFinalize" feature of the native aggregators. It already
existed for theta sketches. This patch adds the feature for hll and
quantiles sketches.

As to impact, Druid finalizes aggregators in two cases:

- When they appear in the outer level of a query (not a subquery).
- When they are used as input to an expression or finalizing-field-access
  post-aggregator (not any other kind of post-aggregator).

With this patch, the functions will no longer be finalized in these cases.

The second item is not likely to matter much. The SQL functions all declare
return type OTHER, which would be usable as an input to any other function
that makes sense and that would be planned into an expression.

So, the main effect of this patch is the first item. To provide backwards
compatibility with anyone that was depending on the old behavior, the
patch adds a "sqlFinalizeOuterSketches" query context parameter that
restores the old behavior.

Other changes:

1) Move various argument-checking logic from runtime to planning time in
   DoublesSketchListArgBaseOperatorConversion, by adding an OperandTypeChecker.

2) Add various JsonIgnores to the sketches to simplify their JSON representations.

3) Allow chaining of ExpressionPostAggregators and other PostAggregators
   in the SQL layer.

4) Avoid unnecessary FieldAccessPostAggregator wrapping in the SQL layer,
   now that expressions can operate on complex inputs.

5) Adjust return type to thetaSketch (instead of OTHER) in
   ThetaSketchSetBaseOperatorConversion.

* Fix benchmark class.

* Fix compilation error.

* Fix ThetaSketchSqlAggregatorTest.

* Hopefully fix ITAutoCompactionTest.

* Adjustment to ITAutoCompactionTest.
2022-11-03 09:43:00 -07:00
Paul Rogers 22c140251a
Removed unused planner context parameter (#13249)
Removed unused planner context parameter
2022-10-27 17:59:26 -07:00
somu-imply affc522b9f
Refactoring the data source before unnest (#13085)
* First set of changes for framework

* Second set of changes to move segment map function to data source

* Minot change to server manager

* Removing the createSegmentMapFunction from JoinableFactoryWrapper and moving to JoinDataSource

* Checkstyle fixes

* Patching Eric's fix for injection

* Checkstyle and fixing some CI issues

* Fixing code inspections and some failed tests and one injector for test in avatica

* Another set of changes for CI...almost there

* Equals and hashcode part update

* Fixing injector from Eric + refactoring for broadcastJoinHelper

* Updating second injector. Might revert later if better way found

* Fixing guice issue in JoinableFactory

* Addressing review comments part 1

* Temp changes refactoring

* Revert "Temp changes refactoring"

This reverts commit 9da42a9ef0.

* temp

* Temp discussions

* Refactoring temp

* Refatoring the query rewrite to refer to a datasource

* Refactoring getCacheKey by moving it inside data source

* Nullable annotation check in injector

* Addressing some comments, removing 2 analysis.isJoin() checks and correcting the benchmark files

* Minor changes for refactoring

* Addressing reviews part 1

* Refactoring part 2 with new test cases for broadcast join

* Set for nullables

* removing instance of checks

* Storing nullables in guice to avoid checking on reruns

* Fixing a test case and removing an irrelevant line

* Addressing the atomic reference review comments
2022-10-26 15:58:58 -07:00
Gian Merlino 2b0d873c7e
Fix two sources of SQL statement leaks. (#13259)
* Fix two sources of SQL statement leaks.

1) SqlTaskResource and DruidJdbcResultSet leaked statements 100% of the
   time, since they call stmt.plan(), which adds statements to
   SqlLifecycleManager, and they do not explicitly remove them.

2) SqlResource leaked statements if yielder.close() threw an exception.
   (And also would not emit metrics, since in that case it failed to
   call stmt.close as well.)

* Only closeQuietly is needed.
2022-10-25 09:31:56 -07:00
Paul Rogers 86e6e61e88
Modular Calcite Test Framework (#12965)
* Refactor Calcite test "framework" for planner tests

Refactors the current Calcite tests to make it a bit easier
to adjust the set of runtime objects used within a test.

* Move data creation out of CalciteTests into TestDataBuilder
* Move "framework" creation out of CalciteTests into
  a QueryFramework
* Move injector-dependent functions from CalciteTests
  into QueryFrameworkUtils
* Wrapper around the planner factory, etc. to allow
  customization.
* Bulk of the "framework" created once per class rather
  than once per test.
* Refactor tests to use a test builder
* Change all testQuery() methods to use the test builder.
Move test execution & verification into a test runner.
2022-10-20 15:45:44 -07:00
Paul Rogers b34b4353f4
Async reads for JDBC (#13196)
Async reads for JDBC:
Prevents JDBC timeouts on long queries by returning empty batches
when a batch fetch takes too long. Uses an async model to run the
result fetch concurrently with JDBC requests.

Fixed race condition in Druid's Avatica server-side handler
Fixed issue with no-user connections
2022-10-18 11:40:57 -07:00
Gian Merlino 6aca61763e
SQL: Use timestamp_floor when granularity is not safe. (#13206)
* SQL: Use timestamp_floor when granularity is not safe.

PR #12944 added a check at the execution layer to avoid materializing
excessive amounts of time-granular buckets. This patch modifies the SQL
planner to avoid generating queries that would throw such errors, by
switching certain plans to use the timestamp_floor function instead of
granularities. This applies both to the Timeseries query type, and the
GroupBy timestampResultFieldGranularity feature.

The patch also goes one step further: we switch to timestamp_floor
not just in the ETERNITY + non-ALL case, but also if the estimated
number of time-granular buckets exceeds 100,000.

Finally, the patch modifies the timestampResultFieldGranularity
field to consistently be a String rather than a Granularity. This
ensures that it can be round-trip serialized and deserialized, which is
useful when trying to execute the results of "EXPLAIN PLAN FOR" with
GroupBy queries that use the timestampResultFieldGranularity feature.

* Fix test, address PR comments.

* Fix ControllerImpl.

* Fix test.

* Fix unused import.
2022-10-17 08:22:45 -07:00
Paul Rogers f4dcc52dac
Redesign QueryContext class (#13071)
We introduce two new configuration keys that refine the query context security model controlled by druid.auth.authorizeQueryContextParams. When that value is set to true then two other configuration options become available:

druid.auth.unsecuredContextKeys: The set of query context keys that do not require a security check. Use this for the "white-list" of key to allow. All other keys go through the existing context key security checks.
druid.auth.securedContextKeys: The set of query context keys that do require a security check. Use this when you want to allow all but a specific set of keys: only these keys go through the existing context key security checks.
Both are set using JSON list format:

druid.auth.securedContextKeys=["secretKey1", "secretKey2"]
You generally set one or the other values. If both are set, unsecuredContextKeys acts as exceptions to securedContextKeys.

In addition, Druid defines two query context keys which always bypass checks because Druid uses them internally:

sqlQueryId
sqlStringifyArrays
2022-10-15 11:02:11 +05:30
Clint Wylie 6eff6c9ae4
fix json_value sql planning with decimal type, fix vectorized expression math null value handling in default mode (#13214)
* fix json_value sql planning with decimal type, fix vectorized expression math null value handling in default mode
changes:
* json_value 'returning' decimal will now plan to native double typed query instead of ending up with default string typing, allowing decimal vector math expressions to work with this type
* vector math expressions now zero out 'null' values even in 'default' mode (druid.generic.useDefaultValueForNull=false) to prevent downstream things that do not check the null vector from producing incorrect results

* more better

* test and why not vectorize

* more test, more fix
2022-10-12 16:28:41 -07:00
Abhishek Agarwal 61b34950e7
Fix assertion error in sql planning for latest aggregators (#13151)
* Fix sql planning bug for latest aggregators

* change test name

* Fix error messages

* fix error message again
2022-09-28 21:01:32 +05:30
Sam Rash 28b9edc2a8
Add BIG_SUM SQL function (#13102)
This adds a sql function, "BIG_SUM", that uses
CompressedBigDecimal to do a sum. Other misc changes:

1. handle NumberFormatExceptions when parsing a string (default to set
   to 0, configurable in agg factory to be strict and throw on error)
2. format pom file (whitespace) + add dependency
3. scaleUp -> scale and always require scale as a parameter
2022-09-26 18:02:25 -07:00
Paul Rogers 8ce03eb094
Convert the Druid planner to use statement handlers (#12905)
* Converted Druid planner to use statement handlers

Converts the large collection of if-statements for statement
types into a set of classes: one per supported statement type.
Cleans up a few error messages.

* Revisions from review comments

* Build fix

* Build fix

* Resolve merge confict.

* More merges with QueryResponse PR

* More parameterized type cleanup

Forces a rebuild due to a flaky test
2022-09-19 11:58:45 +05:30
Frank Chen fd6c05eee8
Avoid ClassCastException when getting values from `QueryContext` (#13022)
* Use safe conversion methods

* Rename method

* Add getContextAsBoolean

* Update test case

* Remove generic from getContextValue

* Update catch-handler

* Add test

* Resolve comments

* Replace 'getContextXXX' to 'getQueryContext().getAsXXXX'
2022-09-13 18:00:09 +08:00
Gian Merlino 77925cdcdd
Expressions: fixes for round-trips of floating point literals, Long.MIN_VALUE literals, Shuffle.visitAll. (#13037)
* SQL: Fix round-trips of floating point literals.

When writing RexLiterals into Druid expressions, we now write non-integer
numeric literals in such a way that ensures they are parsed as doubles
on the other end.

* Updates from code review, and some additional stuff inspired by the
investigation.

- Remove unnecessary formatting code from DruidExpression.doubleLiteral:
  it handles things just fine with its default behavior.

- Fix a problem where expression literals could not represent Long.MIN_VALUE.
  Now, integer literals start life off as BigIntegerExpr instead of LongExpr,
  and are converted to LongExpr during flattening. This is necessary because,
  in order to avoid ambiguity between unary minus and negative literals, our
  grammar does not actually have true negative literals. Negative numbers must
  be represented as unary minus next to a positive literal.

- Fix a bug  introduced in #12230 where shuttle.visitAll(args) delegated
  to shuttle.visit(arg) instead of arg.visit(shuttle). The latter does
  a recursive visitation, which is the intended behavior.

* Style fixes.

* Move regexp to the right place.
2022-09-12 17:06:20 -07:00
Paul Rogers 80b97ac24d
Create a copy of the shared JDBC context (#13049) 2022-09-12 10:27:56 -07:00
imply-cheddar 5ba0075c0c
Expose HTTP Response headers from SqlResource (#13052)
* Expose HTTP Response headers from SqlResource

This change makes the SqlResource expose HTTP response
headers in the same way that the QueryResource exposes them.

Fundamentally, the change is to pipe the QueryResponse
object all the way through to the Resource so that it can
populate response headers.  There is also some code
cleanup around DI, as there was a superfluous FactoryFactory
class muddying things up.
2022-09-12 01:40:06 -07:00
Gian Merlino e29e7a8434
Add ARRAY_QUANTILE function. (#13061)
* Add ARRAY_QUANTILE function.

Expected usage is like: ARRAY_QUANTILE(ARRAY_AGG(x), 0.9).

* Fix test.
2022-09-09 11:29:20 -07:00
Rohan Garg 2f156b3610
Disallow timeseries queries with ETERNITY interval and non-ALL granularity (#12944) 2022-09-07 16:45:08 +05:30
Rohan Garg 7aa8d7f987
Add query/time metric for SQL queries from router (#12867)
* Add query/time metric for SQL queries from router

* Fix query cancel bug when user has overriden native query-id in a SQL query
2022-09-07 13:54:46 +05:30
Clint Wylie a3a377e570
more consistent expression error messages (#12995)
* more consistent expression error messages

* review stuff

* add NamedFunction for Function, ApplyFunction, and ExprMacro to share common stuff

* fixes

* add expression transform name to transformer failure, better parse_json error messaging
2022-09-06 23:21:38 -07:00
Gian Merlino 0460d8a502
Adjust SQL "cannot plan" error message. (#12903)
Two changes:

1) Restore the text of the SQL query. It was removed in #12897, but
   then it was later pointed out that the text is helpful for end
   users querying Druid through tools that do not show the SQL queries
   that they are making.

2) Adjust wording slightly, from "Cannot build plan for query" to
   "Query not supported". This will be clearer to most users. Generally
   the reason we get these errors is due to unsupported SQL constructs.
2022-08-29 18:33:00 +05:30
Abhishek Agarwal 618757352b
Bump up the version to 25.0.0 (#12975)
* Bump up the version to 25.0.0

* Fix the version in console
2022-08-29 11:27:38 +05:30
Clint Wylie 16f5ac5bd5
json_value adjustments (#12968)
* json_value adjustments
changes:
* native json_value expression now has optional 3rd argument to specify type, which will cast all values to the specified type
* rework how JSON_VALUE is wired up in SQL. Now we are using a custom convertlet to translate JSON_VALUE(... RETURNING type) into dedicated JSON_VALUE_BIGINT, JSON_VALUE_DOUBLE, JSON_VALUE_VARCHAR, JSON_VALUE_ANY instead of using the calcite StandardConvertletTable that wraps JSON_VALUE_ANY in a CAST, so that we preserve the typing of JSON_VALUE to pass down to the native expression as the 3rd argument

* fix json_value_any to be usable by humans too, coverage

* fix bug

* checkstyle

* checkstyle

* review stuff

* validate that options to json_value are the supported options rather than ignore them

* remove more legacy undocumented functions
2022-08-27 07:15:47 -07:00
Clint Wylie 4bdf9815c1
fix issue with SQL planner and null array constants (#12971) 2022-08-26 04:44:17 -07:00
Clint Wylie 72aba00e09
add json function support for paths with negative array indexes (#12972) 2022-08-25 17:11:28 -07:00
Clint Wylie 82ad927087
tighten up array handling, fix bug with array_slice output type inference (#12914) 2022-08-25 00:48:49 -07:00
Karan Kumar f7c6316992
Setting useNativeQueryExplain to true (#12936)
* Setting useNativeQueryExplain to true

* Update docs/querying/sql-query-context.md

Co-authored-by: Santosh Pingale <pingalesantosh@gmail.com>

* Fixing tests

* Fixing broken tests

Co-authored-by: Santosh Pingale <pingalesantosh@gmail.com>
2022-08-24 17:39:55 +05:30
Clint Wylie 289e43281e
stricter behavior for parse_json, add try_parse_json, remove to_json (#12920) 2022-08-22 18:41:07 -07:00
Rohan Garg 3c129f6728
Add sql planning time metric (#12923) 2022-08-22 11:09:44 +05:30
Paul Rogers eb902375a2
Light refactor of the heavily refactored statement classes (#12909)
Reflects lessons learned from working with consumers of
the new code.
2022-08-19 02:31:06 +05:30
Gian Merlino d3015d0f8e
DruidQuery: Return a copy from withScanSignatureIfNeeded, as promised. (#12906)
The method wasn't following its contract, leading to pollution of the
overall planner context, when really we just want to create a new
context for a specific query.
2022-08-16 13:23:14 -07:00
Gian Merlino 6c5a43106a
SQL: Morph QueryMakerFactory into SqlEngine. (#12897)
* SQL: Morph QueryMakerFactory into SqlEngine.

Groundwork for introducing an indexing-service-task-based SQL engine
under the umbrella of #12262. Also includes some other changes related
to improving error behavior.

Main changes:

1) Elevate the QueryMakerFactory interface (an extension point that allows
   customization of how queries are made) into SqlEngine. SQL engines
   can influence planner behavior through EngineFeatures, and can fully
   control the mechanics of query execution using QueryMakers.

2) Remove the server-wide QueryMakerFactory choice, in favor of the choice
   being made by the SQL entrypoint. The indexing-service-task-based
   SQL engine would be associated with its own entrypoint, like
   /druid/v2/sql/task.

Other changes:

1) Adjust DruidPlanner to try either DRUID or BINDABLE convention based
   on analysis of the planned rels; never try both. In particular, we
   no longer try BINDABLE when DRUID fails. This simplifies the logic
   and improves error messages.

2) Adjust error message "Cannot build plan for query" to omit the SQL
   query text. Useful because the text can be quite long, which makes it
   easy to miss the text about the problem.

3) Add a feature to block context parameters used internally by the SQL
   planner from being supplied by end users.

4) Add a feature to enable adding row signature to the context for
   Scan queries. This is useful in building the task-based engine.

5) Add saffron.properties file that turns off sets and graphviz dumps
   in "cannot plan" errors. Significantly reduces log spam on the Broker.

* Fixes from CI.

* Changes from review.

* Can vectorize, now that join-to-filter is on by default.

* Checkstyle! And variable renames!

* Remove throws from test.
2022-08-14 23:31:19 -07:00
Paul Rogers 41712b7a3a
Refactor SqlLifecycle into statement classes (#12845)
* Refactor SqlLifecycle into statement classes

Create direct & prepared statements
Remove redundant exceptions from tests
Tidy up Calcite query tests
Make PlannerConfig more testable

* Build fixes

* Added builder to SqlQueryPlus

* Moved Calcites system properties to saffron.properties

* Build fix

* Resolve merge conflict

* Fix IntelliJ inspection issue

* Revisions from reviews

Backed out a revision to Calcite tests that didn't work out as planned

* Build fix

* Fixed spelling errors

* Fixed failed test

Prepare now enforces security; before it did not.

* Rebase and fix IntelliJ inspections issue

* Clean up exception handling

* Fix handling of JDBC auth errors

* Build fix

* More tweaks to security messages
2022-08-14 00:44:08 -07:00
Rohan Garg 5394838030
Enable conversion of join to filter by default (#12868) 2022-08-13 20:37:43 +05:30
Gian Merlino 836430019a
Add EXTERNAL resource type. (#12896)
This is used to control access to the EXTERN function, which allows
reading external data in SQL. The EXTERN function is not usable in
production as of today, but it is used by the task-based SQL engine
contemplated in #12262.
2022-08-12 10:57:30 -07:00
Paul Rogers 8ad8582dc8
Refactor DruidSchema & DruidTable (#12835)
Refactors the DruidSchema and DruidTable abstractions to prepare for the Druid Catalog.

As we add the catalog, we’ll want to combine physical segment metadata information with “hints” provided by the catalog. This is best done if we tidy up the existing code to more clearly separate responsibilities.

This PR is purely a refactoring move: no functionality changed. There is no difference to user functionality or external APIs. Functionality changes will come later as we add the catalog itself.

DruidSchema
In the present code, DruidSchema does three tasks:

Holds the segment metadata cache
Interfaces with an external schema manager
Acts as a schema to Calcite
This PR splits those responsibilities.

DruidSchema holds the Calcite schema for the druid namespace, combining information fro the segment metadata cache, from the external schema manager and (later) from the catalog.
SegmentMetadataCache holds the segment metadata cache formerly in DruidSchema.
DruidTable
The present DruidTable class is a bit of a kitchen sink: it holds all the various kinds of tables which Druid supports, and uses if-statements to handle behavior that differs between types. Yet, any given DruidTable will handle only one such table type. To more clearly model the actual table types, we split DruidTable into several classes:

DruidTable becomes an abstract base class to hold Druid-specific methods.
DatasourceTable represents a datasource.
ExternalTable represents an external table, such as from EXTERN or (later) from the catalog.
InlineTable represents the internal case in which we attach data directly to a table.
LookupTable represents Druid’s lookup table mechanism.
The new subclasses are more focused: they can be selective about the data they hold and the various predicates since they represent just one table type. This will be important as the catalog information will differ depending on table type and the new structure makes adding that logic cleaner.

DatasourceMetadata
Previously, the DruidSchema segment cache would work with DruidTable objects. With the catalog, we need a layer between the segment metadata and the table as presented to Calcite. To fix this, the new SegmentMetadataCache class uses a new DatasourceMetadata class as its cache entry to hold only the “physical” segment metadata information: it is up to the DruidTable to combine this with the catalog information in a later PR.

More Efficient Table Resolution
Calcite provides a convenient base class for schema objects: AbstractSchema. However, this class is a bit too convenient: all we have to do is provide a map of tables and Calcite does the rest. This means that, to resolve any single datasource, say, foo, we need to cache segment metadata, external schema information, and catalog information for all tables. Just so Calcite can do a map lookup.

There is nothing special about AbstractSchema. We can handle table lookups ourselves. The new AbstractTableSchema does this. In fact, all the rest of Calcite wants is to resolve individual tables by name, and, for commands we don’t use, to provide a list of table names.

DruidSchema now extends AbstractTableSchema. SegmentMetadataCache resolves individual tables (and provides table names.)

DruidSchemaManager
DruidSchemaManager provides a way to specify table schemas externally. In this sense, it is similar to the catalog, but only for datasources. It originally followed the AbstractSchema pattern: it implements provide a map of tables. This PR provides new optional methods for the table lookup and table names operations. The default implementations work the same way that AbstractSchema works: we get the entire map and pick out the information we need. Extensions that use this API should be revised to support the individual operations instead. Druid code no longer calls the original getTables() method.

The PR has one breaking change: since the DruidSchemaManager map is read-only to the rest of Druid, we should return a Map, not a ConcurrentMap.
2022-08-10 10:24:04 +05:30
Clint Wylie ee41cc770f
fix issue with SQL sum aggregator due to bug with DruidTypeSystem and AggregateRemoveRule (#12880)
* fix issue with SQL sum aggregator due to bug with DruidTypeSystem and AggregateRemoveRule

* fix style

* add comment about using custom sum function
2022-08-09 15:17:45 -07:00
Gian Merlino 01d555e47b
Adjust "in" filter null behavior to match "selector". (#12863)
* Adjust "in" filter null behavior to match "selector".

Now, both of them match numeric nulls if constructed with a "null" value.

This is consistent as far as native execution goes, but doesn't match
the behavior of SQL = and IN. So, to address that, this patch also
updates the docs to clarify that the native filters do match nulls.

This patch also updates the SQL docs to describe how Boolean logic is
handled in addition to how NULL values are handled.

Fixes #12856.

* Fix test.
2022-08-08 09:08:36 -07:00
Paul Rogers a618458bf0
Tidy up construction of the Guice Injectors (#12816)
* Refactor Guice initialization

Builders for various module collections
Revise the extensions loader
Injector builders for server startup
Move Hadoop init to indexer
Clean up server node role filtering
Calcite test injector builder

* Revisions from review comments

* Build fixes

* Revisions from review comments
2022-08-04 00:05:07 -07:00
Clint Wylie 623b075d12
fix nested column sql operator return type inference (#12851)
* fix nested column sql operator return type inference

* oops, final
2022-08-03 15:39:08 -07:00
Clint Wylie 6981b1cc12
fix bugs with nested column jsonpath parser (#12831) 2022-08-02 11:38:25 -07:00
Clint Wylie 189e8b9d18
add NumericRangeIndex interface and BoundFilter support (#12830)
add NumericRangeIndex interface and BoundFilter support
changes:
* NumericRangeIndex interface, like LexicographicalRangeIndex but for numbers
* BoundFilter now uses NumericRangeIndex if comparator is numeric and there is no extractionFn
* NestedFieldLiteralColumnIndexSupplier.java now supports supplying NumericRangeIndex for single typed numeric nested literal columns

* better faster stronger and (ever so slightly) more understandable

* more tests, fix bug

* fix style
2022-07-29 18:58:49 -07:00
Paul Rogers d52abe7b38
Today is that day - Single pass through Calcite planner (#12636)
* Druid planner now makes only one pass through Calcite planner

Resolves the issue that required two parse/plan cycles: one
for validate, another for plan. Creates a clone of the Calcite
planner and validator to resolve the conflict that prevented
the merger.
2022-07-29 18:53:21 -07:00
Paul Rogers a8b155e9c6
Fixes for the Avatica JDBC driver (#12709)
* Fixes for the Avatica JDBC driver

Correctly implement regular and prepared statements
Correctly implement result sets
Fix race condition with contexts
Clarify when parameters are used
Prepare for single-pass through the planner

* Addressed review comments

* Addressed review comment
2022-07-27 15:22:40 -07:00
Laksh Singla 2e616e633a
Determine type of `__time` column by RowSignature in case of External Datasource (#12770)
Some queries like `REPLACE INTO ... SELECT TIME_PARSE("__time") AS __time FROM ...`
fail at the Calcite layer because any column with name `__time` is considered to be of
type `SqlTypeName.TIMESTAMP`.

Changes:
- Modify `RowSignatures.toRelDataType()` so that the type of `__time` column
  is determined by the RowSignature's type.
2022-07-26 12:09:40 +05:30
Maytas Monsereenusorn 3bf1e699ff
GREATEST/LEAST function is incorrectly specifying that it cannot return null (#12804) 2022-07-20 14:41:24 +05:30
Adarsh Sanjeev f3272a25f9
Add check for sqlOuterLimit to ingest queries (#12799)
* Add check for sqlOuterLimit to ingest queries

* Fix checkstyle

* Add comment
2022-07-19 09:02:43 -07:00
Paul Rogers ee15c238cc
Clone Calcite planner to access validator (#12708)
Done in preparation for the "single-pass" planner.
2022-07-14 18:10:33 -07:00
Clint Wylie 05b2e967ed
druid nested data column type (#12753)
* add new druid nested data column type

* fixes and such

* fixes

* adjustments, more tests

* self review

* oops

* fix and test

* more better

* style
2022-07-14 12:07:23 -07:00
Rohan Garg bb953be09b
Refactor usage of JoinableFactoryWrapper + more test coverage (#12767)
Refactor usage of JoinableFactoryWrapper to add e2e test for createSegmentMapFn with joinToFilter feature enabled
2022-07-12 06:25:36 -07:00
Gian Merlino 97207cdcc7
Automatic sizing for GroupBy dictionaries. (#12763)
* Automatic sizing for GroupBy dictionary sizes.

Merging and selector dictionary sizes currently both default to 100MB.
This is not optimal, because it can lead to OOM on small servers and
insufficient resource utilization on larger servers. It also invites
end users to try to tune it when queries run out of dictionary space,
which can make things worse if the end user sets it to too high.

So, this patch:

- Adds automatic tuning for selector and merge dictionaries. Selectors
  use up to 15% of the heap and merge buffers use up to 30% of the heap
  (aggregate across all queries).

- Updates out-of-memory error messages to emphasize enabling disk
  spilling vs. increasing memory parameters. With the memory parameters
  automatically sized, it is more likely that an end user will get
  benefit from enabling disk spilling.

- Removes the query context parameters that allow lowering of configured
  dictionary sizes. These complicate the calculation, and I don't see a
  reasonable use case for them.

* Adjust tests.

* Review adjustments.

* Additional comment.

* Remove unused import.
2022-07-11 08:20:50 -07:00
Gian Merlino edfbcc8455
Preserve column order in DruidSchema, SegmentMetadataQuery. (#12754)
* Preserve column order in DruidSchema, SegmentMetadataQuery.

Instead of putting columns in alphabetical order. This is helpful
because it makes query order better match ingestion order. It also
allows tools, like the reindexing flow in the web console, to more
easily do follow-on ingestions using a column order that matches the
pre-existing column order.

We prefer the order from the latest segments. The logic takes all
columns from the latest segments in the order they appear, then adds
on columns from older segments after those.

* Additional test adjustments.

* Adjust imports.
2022-07-08 22:04:11 -07:00
Gian Merlino 9c925b4f09
Frame format for data transfer and short-term storage. (#12745)
* Frame format for data transfer and short-term storage.

As we move towards query execution plans that involve more transfer
of data between servers, it's important to have a data format that
provides for doing this more efficiently than the options available to
us today.

This patch adds:

- Columnar frames, which support fast querying.
- Row-based frames, which support fast sorting via memory comparison
  and fast whole-row copies via memory copying.
- Frame files, a container format that can be stored on disk or
  transferred between servers.

The idea is we should use row-based frames when data is expected to
be sorted, and columnar frames when data is expected to be queried.

The code in this patch is not used in production yet. Therefore, the
patch involves minimal changes outside of the org.apache.druid.frame
package.  The main ones are adjustments to SqlBenchmark to add benchmarks
for queries on frames, and the addition of a "forEach" method to Sequence.

* Fixes based on tests, static analysis.

* Additional fixes.

* Skip DS mapping tests on JDK 14+

* Better JDK checking in tests.

* Fix imports.

* Additional comment.

* Adjustments from code review.

* Update test case.
2022-07-08 20:42:06 -07:00
Rohan Garg d732de9948
Allow adding calcite rules from extensions (#12715)
* Allow adding calcite rules from extensions

* fixup! Allow adding calcite rules from extensions

* Move Rules to CalciteRulesManager

* fixup! Move Rules to CalciteRulesManager
2022-07-06 19:32:35 +05:30
Didip Kerabat 06251c5d2a
Add EIGHT_HOUR into possible list of Granularities. (#12717)
* Add EIGHT_HOUR into possible list of Granularities.

* Add the missing definition.

* fix test.

* Fix another test.

* Stylecheck finally passed.

Co-authored-by: Didip Kerabat <didip@apple.com>
2022-07-05 11:05:37 -07:00
Clint Wylie bbbb6e1c3f
fix DruidSchema issue where datasources with no segments can become stuck in tables list indefinitely (#12727) 2022-07-01 18:54:01 -07:00
Clint Wylie 48731710fb
precursor changes for nested columns to minimize files changed (#12714)
* precursor changes for nested columns to minimize files changed

* inspection fix

* visibility

* adjustment

* unecessary change
2022-07-01 02:27:19 -07:00
Clint Wylie d30efb1c1e
fix bug when rewriting sql virtual column registry (#12718) 2022-07-01 02:24:00 -07:00
Tejaswini Bandlamudi 1fc2f6e4b0
Throw BadQueryContextException if context params cannot be parsed (#12680) 2022-06-24 09:21:25 +05:30
Paul Rogers ffcb996468
Cleanup changes pulled out of PR #12368 (#12672)
This commit contains the cleanup needed for the new integration test framework.

Changes:
- Fix log lines, misspellings, docs, etc.
- Allow the use of some of Druid's "JSON config" objects in tests
- Fix minor bug in `BaseNodeRoleWatcher`
2022-06-23 23:19:50 +05:30
Kashif Faraz b6f8d7a1b3
Add query context param `forceExpressionVirtualColumns` to always use "expression"-type virtual columns in query plan (#12583)
SQL expressions such as those containing `MV_FILTER_ONLY` and `MV_FILTER_NONE`
are planned as specialized virtual columns instead of the default `expression`-type virtual columns.
This commit adds a new context parameter to force the `expression`-type virtual columns.

Changes
- Add query context param `forceExpressionVirtualColumns`
- Use context param to determine if specialized virtual columns should be used or not
- Moved some tests into `CalciteExplainQueryTest`
2022-06-22 15:33:50 +05:30
Gian Merlino 0099940808
Add TIME_IN_INTERVAL SQL operator. (#12662)
* Add TIME_IN_INTERVAL SQL operator.

The operator is implemented as a convertlet rather than an
OperatorConversion, because this allows it to be equivalent to using
the >= and < operators directly.

* SqlParserPos cannot be null here.

* Remove unused import.

* Doc updates.

* Add words to dictionary.
2022-06-21 13:05:37 -07:00
Gian Merlino 818974f6e4
ScanQuery: Fix JsonIgnore for isLegacy. (#12674)
True, false, and null have different meanings: true/false mean "legacy"
and "not legacy"; null means use the default set by ScanQueryConfig.
So, we need to respect this in the JsonIgnore setup.
2022-06-18 15:55:54 -07:00
Paul Rogers 893759de91
Remove null and empty fields from native queries (#12634)
* Remove null and empty fields from native queries

* Test fixes

* Attempted IT fix.

* Revisions from review comments

* Build fixes resulting from changes suggested by reviews

* IT fix for changed segment size
2022-06-16 14:07:25 -07:00
TSFenwick a3603ad6b0
Use DefaultQueryConfig in SqlLifecycle to correctly populate request logs (#12613)
Fixes an issue where sql query request logs do not include the default query context
values set via `druid.query.default.context.xyz` runtime properties.

# Change summary
* Inject `DefaultQueryConfig` into `SqlLifecycleFactory`
* Add params from `DefaultQueryConfig` to the query context in `SqlLifecycle`

# Description
- This change does not affect query execution. This is because the
  `DefaultQueryConfig` was already being used in `QueryLifecycle`,
   which is initialized when the SQL is translated to a native query. 
- This also handles any potential use case where a context parameter should be
   handled at the SQL stage itself.
2022-06-08 12:52:50 +05:30
Laksh Singla 81c37c6515
Add validation for invalid partitioned by granularities (#12589)
* Add validation for invalid partitioned by granularities

* review comments

* improve error message, change location of the method

* remove imports

* use StringUtils.lowercase

Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com>
2022-06-06 22:00:29 +05:30
Adarsh Sanjeev 5a283964ca
Improve SQL validation error messages (#12611)
Update the SQL validation error message to specify whether
the ingest is INSERT or REPLACE for better user experience.
2022-06-06 16:14:28 +05:30
Clint Wylie dc0fdfec67
fix test comment (#12584) 2022-05-31 12:39:20 -07:00
Gian Merlino 02ae3e74ff
RowBasedColumnSelectorFactory: Add "useStringValueOfNullInLists" parameter. (#12578)
RowBasedColumnSelectorFactory inherited strange behavior from
Rows.objectToStrings for nulls that appear in lists: instead of being
left as a null, it is replaced with the string "null". Some callers may
need compatibility with this strange behavior, but it should be opt-in.

Query-time call sites are changed to opt-out of this behavior, since it
is not consistent with query-time expectations. The IncrementalIndex
ingestion-time call site retains the old behavior, as this is traditionally
when Rows.objectToStrings would be used.
2022-05-31 11:38:56 -07:00
Clint Wylie b746bf9129
fix virtual column cycle bug, sql virtual column optimize bug (#12576)
* fix virtual column cycle bug, sql virtual column optimize bug

* more test
2022-05-30 23:51:21 -07:00
Karan Kumar 9f9faeec81
object[] handling for DimensionHandlers for arrays (#12552)
Description
Fixes a bug when running q's like

 SELECT cntarray,
       Count(*)
FROM   (SELECT dim1,
               dim2,
               Array_agg(cnt) AS cntarray
        FROM   (SELECT dim1,
                       dim2,
                       dim3,
                       Count(*) AS cnt
                FROM   foo
                GROUP  BY 1,
                          2,
                          3)
        GROUP  BY 1,
                  2)
GROUP  BY 1  
This generates an error:

org.apache.druid.java.util.common.ISE: Unable to convert type [Ljava.lang.Object; to org.apache.druid.segment.data.ComparableList
        at org.apache.druid.segment.DimensionHandlerUtils.convertToList(DimensionHandlerUtils.java:405) ~[druid-xx]
Because it's an array of numbers it looks like it does the convertToList call, which looks like:

  @Nullable
  public static ComparableList convertToList(Object obj)
  {
    if (obj == null) {
      return null;
    }
    if (obj instanceof List) {
      return new ComparableList((List) obj);
    }
    if (obj instanceof ComparableList) {
      return (ComparableList) obj;
    }
    throw new ISE("Unable to convert type %s to %s", obj.getClass().getName(), ComparableList.class.getName());
  }
I.e. it doesn't know about arrays. Added the array handling as part of this PR.
2022-05-25 15:24:18 +05:30
Adarsh Sanjeev 5063eca5b9
Add error message for incorrectly ordered clause in sql (#12558)
In the case that the clustered by is before the partitioned by for an sql query, the error message is a bit confusing.

insert into foo select * from bar clustered by dim1 partitioned by all

Error: SQL parse failed

Encountered "PARTITIONED" at line 1, column 88.

Was expecting one of: <EOF> "," ... "ASC" ... "DESC" ... "NULLS" ... "." ... "NOT" ... "IN" ... "<" ... "<=" ... ">" ... ">=" ... "=" ... "<>" ... "!=" ... "BETWEEN" ... "LIKE" ... "SIMILAR" ... "+" ... "-" ... "*" ... "/" ... "%" ... "||" ... "AND" ... "OR" ... "IS" ... "MEMBER" ... "SUBMULTISET" ... "CONTAINS" ... "OVERLAPS" ... "EQUALS" ... "PRECEDES" ... "SUCCEEDS" ... "IMMEDIATELY" ... "MULTISET" ... "[" ... "FORMAT" ... "(" ... Less...

org.apache.calcite.sql.parser.SqlParseException
This is a bit confusing and adding a check could be added to throw a more user friendly message stating that the order should be reversed.

Add error message for incorrectly ordered clause in sql.
2022-05-23 12:41:18 +05:30
Gian Merlino 69aac6c8dd
Direct UTF-8 access for "in" filters. (#12517)
* Direct UTF-8 access for "in" filters.

Directly related:

1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
   ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
   necessary, the input set is copied into a ValuesSet. Much logic is
   simplified, because we always know what type the values set will be.
   I think that there won't even be an efficiency loss in most cases.
   InDimFilter is most frequently created by deserialization, and this
   patch updates the JsonCreator constructor to deserialize
   directly into a ValuesSet.

2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
   during index lookups.

3) Add unsigned comparator to ByteBufferUtils and use it in
   GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
   bytes can be compared as bytes if, and only if, the comparison
   is unsigned.

4) Add specialization to GenericIndexed.singleThreaded().indexOf that
   avoids needless ByteBuffer allocations.

5) Clarify that objects returned by ColumnIndexSupplier.as are not
   thread-safe. DictionaryEncodedStringIndexSupplier now calls
   singleThreaded() on all relevant GenericIndexed objects, saving
   a ByteBuffer allocation per access.

Also:

1) Fix performance regression in LikeFilter: since #12315, it applied
   the suffix matcher to all values in range even for type MATCH_ALL.

2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
   which was broken due to calls to strategy.compare in
   GenericIndexed.fromIterable.

* Add like-filter implementation tests.

* Add in-filter implementation tests.

* Add tests, fix issues.

* Fix style.

* Adjustments from review.
2022-05-20 01:51:28 -07:00
Gian Merlino 65a1375b67
SQL: Add is_active to sys.segments, update examples and docs. (#11550)
* SQL: Add is_active to sys.segments, update examples and docs.

is_active is short for:

  (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1

It's important because this represents "all the segments that should
be queryable, whether or not they actually are right now". Most of the
time, this is the set of segments that people will want to look at.

The web console already adds this filter to a lot of its queries,
proving its usefulness.

This patch also reworks the caveat at the bottom of the sys.segments
section, so its information is mixed into the description of each result
field. This should make it more likely for people to see the information.

* Wording updates.

* Adjustments for spellcheck.

* Adjust IT.
2022-05-19 14:23:28 -07:00
Adarsh Sanjeev fcb1c0b7bf
Add cluster by support for replace syntax (#12524)
* Add cluster by support for replace syntax

* Add unit test for with list
2022-05-17 15:15:29 +05:30
Adarsh Sanjeev 0fd4f1e386
Improve error messages from SQL REPLACE syntax (#12523)
- Add user friendly error messages for missing or incorrect OVERWRITE clause for REPLACE SQL query
- Move validation of missing OVERWRITE clause at code level instead of parser for custom error message
2022-05-17 09:55:58 +05:30
Adarsh Sanjeev 39b3487aa9
Add replace statement to sql parser (#12386)
Relevant Issue: #11929

- Add custom replace statement to Druid SQL parser.
- Edit DruidPlanner to convert relevant fields to Query Context.
- Refactor common code with INSERT statements to reuse them for REPLACE where possible.
2022-05-13 10:56:40 +05:30
Clint Wylie 9e5a940cf1
remake column indexes and query processing of filters (#12388)
Following up on #12315, which pushed most of the logic of building ImmutableBitmap into BitmapIndex in order to hide the details of how column indexes are implemented from the Filter implementations, this PR totally refashions how Filter consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.
2022-05-11 11:57:08 +05:30
Rohan Garg 75836a5a06
Add feature flag for sql planning of TimeBoundary queries (#12491)
* Add feature flag for sql planning of TimeBoundary queries

* fixup! Add feature flag for sql planning of TimeBoundary queries

* Add documentation for enableTimeBoundaryPlanning

* fixup! Add documentation for enableTimeBoundaryPlanning
2022-05-10 15:23:42 +05:30
somu-imply c68388ebcd
Vectorized version of string last aggregator (#12493)
* Vectorized version of string last aggregator

* Updating string last and adding testcases

* Updating code and adding testcases for serializable pairs

* Addressing review comments
2022-05-09 17:02:38 -07:00
Gian Merlino a2bad0b3a2
Reduce allocations due to Jackson serialization. (#12468)
* Reduce allocations due to Jackson serialization.

This patch attacks two sources of allocations during Jackson
serialization:

1) ObjectMapper.writeValue and JsonGenerator.writeObject create a new
   DefaultSerializerProvider instance for each call. It has lots of
   fields and creates pressure on the garbage collector. So, this patch
   adds helper functions in JacksonUtils that enable reuse of
   SerializerProvider objects and updates various call sites to make
   use of this.

2) GroupByQueryToolChest copies the ObjectMapper for every query to
   install a special module that supports backwards compatibility with
   map-based rows. This isn't needed if resultAsArray is set and
   all servers are running Druid 0.16.0 or later. This release was a
   while ago. So, this patch disables backwards compatibility by default,
   which eliminates the need to copy the heavyweight ObjectMapper. The
   patch also introduces a configuration option that allows admins to
   explicitly enable backwards compatibility.

* Add test.

* Update additional call sites and add to forbidden APIs.
2022-04-27 14:17:26 -07:00
Gian Merlino 2e42d04038
SQL: Create millisecond precision timestamp literals. (#12407)
* SQL: Create millisecond precision timestamp literals.

Fixes a bug where implicit casts of strings to timestamps would use seconds
precision rather than milliseconds. The new test case
testCountStarWithBetweenTimeFilterUsingMillisecondsInStringLiterals
exercises this.

* Update sql/src/main/java/org/apache/druid/sql/calcite/planner/Calcites.java

Co-authored-by: Frank Chen <frankchen@apache.org>

* Correct precision handling.

- Set default precision to 3 (millis) for things involving timestamps.
- Respect precision specified in types when available.

* Silence, checkstyle.

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-04-27 14:17:07 -07:00
Abhishek Agarwal 2fe053c5cb
Bump up the versions (#12480) 2022-04-27 14:28:20 +05:30
Adarsh Sanjeev 1306965c9e
Validate select columns for insert statement (#12431)
Unnamed columns in the select part of insert SQL statements currently create a table with the column name such as "EXPR$3". This PR adds a check for this.
2022-04-27 12:25:49 +05:30
somu-imply 027935dcff
Vectorize numeric latest aggregators (#12439)
* Vectorizing Latest aggregator Part 1

* Updating benchmark tests

* Changing appropriate logic for vectors for null handling

* Introducing an abstract class and moving the commonalities there

* Adding vectorization for StringLast aggregator (initial version)

* Updated bufferized version of numeric aggregators

* Adding some javadocs

* Making sure this PR vectorizes numeric latest agg only

* Adding another benchmarking test

* Fixing intellij inspections

* Adding tests for double

* Adding test cases for long and float

* Updating testcases

* Checkstyle oops..

* One tiny change in test case

* Fixing spotbug and rhs not being used
2022-04-26 11:33:08 -07:00
Rohan Garg 95694b5afa
Convert simple min/max SQL queries on __time to timeBoundary queries (#12472)
* Support array based results in timeBoundary query

* Fix bug with query interval in timeBoundary

* Convert min(__time) and max(__time) SQL queries to timeBoundary

* Add tests for timeBoundary backed SQL queries

* Fix query plans for existing tests

* fixup! Convert min(__time) and max(__time) SQL queries to timeBoundary

* fixup! Add tests for timeBoundary backed SQL queries

* fixup! Fix bug with query interval in timeBoundary
2022-04-25 08:18:58 -07:00
Jihoon Son 73ce5df22d
Add support for authorizing query context params (#12396)
The query context is a way that the user gives a hint to the Druid query engine, so that they enforce a certain behavior or at least let the query engine prefer a certain plan during query planning. Today, there are 3 types of query context params as below.

Default context params. They are set via druid.query.default.context in runtime properties. Any user context params can be default params.
User context params. They are set in the user query request. See https://druid.apache.org/docs/latest/querying/query-context.html for parameters.
System context params. They are set by the Druid query engine during query processing. These params override other context params.
Today, any context params are allowed to users. This can cause 
1) a bad UX if the context param is not matured yet or 
2) even query failure or system fault in the worst case if a sensitive param is abused, ex) maxSubqueryRows.

This PR adds an ability to limit context params per user role. That means, a query will fail if you have a context param set in the query that is not allowed to you. To do that, this PR adds a new built-in resource type, QUERY_CONTEXT. The resource to authorize has a name of the context param (such as maxSubqueryRows) and the type of QUERY_CONTEXT. To allow a certain context param for a user, the user should be granted WRITE permission on the context param resource. Here is an example of the permission.

{
  "resourceAction" : {
    "resource" : {
      "name" : "maxSubqueryRows",
      "type" : "QUERY_CONTEXT"
    },
    "action" : "WRITE"
  },
  "resourceNamePattern" : "maxSubqueryRows"
}
Each role can have multiple permissions for context params. Each permission should be set for different context params.

When a query is issued with a query context X, the query will fail if the user who issued the query does not have WRITE permission on the query context X. In this case,

HTTP endpoints will return 403 response code.
JDBC will throw ForbiddenException.
Note: there is a context param called brokerService that is used only by the router. This param is used to pin your query to run it in a specific broker. Because the authorization is done not in the router, but in the broker, if you have brokerService set in your query without a proper permission, your query will fail in the broker after routing is done. Technically, this is not right because the authorization is checked after the context param takes effect. However, this should not cause any user-facing issue and thus should be OK. The query will still fail if the user doesn’t have permission for brokerService.

The context param authorization can be enabled using druid.auth.authorizeQueryContextParams. This is disabled by default to avoid any hassle when someone upgrades his cluster blindly without reading release notes.
2022-04-21 14:21:16 +05:30
somu-imply 2db02876cf
Updating an error msg (#12450)
* Updating an error msg

* Added an extra [] so removing it
2022-04-20 07:56:09 -07:00
somu-imply cd6fba2f6c
Handling planning with alias for time for group by and order by (#12418)
An outer scan query, that requires ordering on a column, should be considered an invalid query.
2022-04-15 10:29:17 +05:30
Adarsh Sanjeev b74cb7624d
Make error messages for insert statements consistent with select statements (#12414)
For a query like
INSERT INTO tablename SELECT channel, added as count FROM wikipedia the error message is Encountered "as count". However, for the insert statement
INSERT INTO t SELECT channel, added as count FROM wikipedia PARTITIONED BY ALL
returns INSERT statements must specify PARTITIONED BY clause explictly (incorrectly). This PR corrects this.

Add EOF to end of Druid SQL Insert statements
Rename SQL Insert statements in the parser to reflect the behaviour change
2022-04-09 12:21:40 +05:30
Paul Rogers 2cc2088720
Method to specify eternity in the scan query builder (#12223)
* Method to specify eternity in the scan query builder

* Fix checkstyle issue

* Renamed eterity() to eternityInterval()

* Minor fixes
2022-04-04 15:11:32 -07:00
Adarsh Sanjeev ef45a1551e
Convert inQueryThreshold into query context parameter. (#12357)
Added Calcites InQueryThreshold as a query context parameter. Setting this parameter appropriately reduces the time taken for queries with large number of values in their IN conditions.
2022-03-22 18:33:57 +05:30
Gian Merlino cb2b2b696d
Fix error message for groupByEnableMultiValueUnnesting. (#12325)
* Fix error message for groupByEnableMultiValueUnnesting.

It referred to the incorrect context parameter.

Also, create a dedicated exception class, to allow easier detection of this
specific error.

* Fix other test.

* More better error messages.

* Test getDimensionName method.
2022-03-10 11:37:24 -08:00
Rohan Garg 9f6a930462
Fix join query incase of filter explosion during CNF conversion (#12324) 2022-03-09 12:43:09 -08:00
Clint Wylie 1c004ea47e
use virtual columns for sql simple aggregators instead of inline expressions (#12251)
* use virtual columns for sql simple aggregators instead of inline expressions

* fixes

* always use virtual columns

* add more tests
2022-03-03 15:05:28 -08:00
Xavier Léauté 1434197ee1
update airline dependency to 2.x (#12270)
* upgrade Airline to Airline 2
  https://github.com/airlift/airline is no longer maintained, updating to
  https://github.com/rvesse/airline (Airline 2) to use an actively
  maintained version, while minimizing breaking changes.

  Note, this is a backwards incompatible change, and extensions relying on
  the CliCommandCreator extension point will also need to be updated.

* fix dependency checks where jakarta.inject is now resolved first instead
  of javax.inject, due to Airline 2 using jakarta
2022-02-27 15:19:28 -08:00
Jihoon Son e5ad862665
A new includeAllDimension flag for dimensionsSpec (#12276)
* includeAllDimensions in dimensionsSpec

* doc

* address comments

* unused import and doc spelling
2022-02-25 18:27:48 -08:00
Jonathan Wei b1640a72ee
Re-enable segment metadata cache when using external schema (#12264) 2022-02-22 19:50:29 -06:00
Karan Kumar 5794331eb1
Adding new config for disabling group by on multiValue column (#12253)
As part of #12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key.
Added a config groupByEnableMultiValueUnnesting which can be set in the query context.

The default value of groupByEnableMultiValueUnnesting is true, therefore it does not change the current engine behavior.
If groupByEnableMultiValueUnnesting is set to false, the query will fail if it encounters a multi-value column in the grouping key.
2022-02-16 20:53:26 +05:30
Laksh Singla 8fc0e5c95c
Explain plan for custom insert syntax (#12243)
* Initial commit, explain plan for custom insert syntax working

* Cleanup separate SqlInsert handling
2022-02-15 21:48:34 -08:00
somu-imply eae163a797
Moving in filter check to broker (#12195)
* Moving in filter check to broker

* Adding more unit tests, making error message meaningful

* Spelling and doc changes

* Updating default to -1 and making this feature hide by default. The number of IN filters can grow upto a max limit of 100

* Removing upper limit of 100, updated docs

* Making documentation more meaningful

* Moving check outside to PlannerConfig, updating test cases and adding back max limit

* Updated with some additional code comments

* Missed removing one line during the checkin

* Addressing doc changes and one forbidden API correction

* Final doc change

* Adding a speling exception, correcting a testcase

* Reading entire filter tree to address combinations of ANDs and ORs

* Specifying in docs that, this case works only for ORs

* Revert "Reading entire filter tree to address combinations of ANDs and ORs"

This reverts commit 81ca8f8496.

* Covering a class cast exception and updating docs

* Counting changed

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2022-02-15 20:45:07 -08:00
somu-imply 033989eb1d
Adding vectorized time_shift (#12254)
* Adding vectorized time_shift

* Vectorize time shift, addressing review comments

* Remove an unused import
2022-02-11 14:44:52 -08:00
Laksh Singla 5bd646e10a
Surface a user friendly error when PARTITIONED BY is omitted (#12246)
#12163 makes PARTITIONED BY a required clause in INSERT queries. While this is required, if a user accidentally omits the clause, it emits a JavaCC/Calcite error, since it's syntactically incorrect. The error message is cryptic. Since it's a custom clause, this PR aims to make the clause optional on the syntactic side, but move the validation to DruidSqlInsert where we can surface a friendlier error.
2022-02-11 11:49:00 +05:30
Clint Wylie 3ee66bb492
allow optimizing sql expressions and virtual columns (#12241)
* rework sql planner expression and virtual column handling

* simplify a bit

* add back and deprecate old methods, more tests, fix multi-value string coercion bug and associated tests

* spotbugs

* fix bugs with multi-value string array expression handling

* javadocs and adjust test

* better

* fix tests
2022-02-09 14:55:50 -08:00
Laksh Singla 4add2510ed
Add syntax support for PARTITIONED BY/CLUSTERED BY in INSERT queries (#12163)
This PR aims to add parser changes for supporting PARTITIONED BY and CLUSTERED BY as proposed in the issue #11929.
2022-02-08 16:23:15 +05:30
Clint Wylie ae71e05fc5
array_concat_agg and array_agg support for array inputs (#12226)
* array_concat_agg and array_agg support for array inputs
changes:
* added array_concat_agg to aggregate arrays into a single array
* added array_agg support for array inputs to make nested array
* added 'shouldAggregateNullInputs' and 'shouldCombineAggregateNullInputs' to fix a correctness issue with STRING_AGG and ARRAY_AGG when merging results, with dual purpose of being an optimization for aggregating

* fix test

* tie capabilities type to legacy mode flag about coercing arrays to strings

* oops

* better javadoc
2022-02-07 19:59:30 -08:00
Clint Wylie 8fd587b28c
remove duplicate Broker ServerInventoryView, improve HttpServerInventoryView logging (#12209)
* changes:
* remove SystemSchema duplicate ServerInventoryView in broker
* suppress duplicate segment added/removed warnings in HttpServerInventoryView when doing a full sync

* fixes
2022-02-03 12:57:34 -08:00
Maytas Monsereenusorn 3717693633
Fix java.lang.ClassCastException error when using useApproximateCountDistinct false for aggregation query (#12216)
* add imply

* add test

* add unit test

* add test
2022-02-03 12:01:13 -08:00
Clint Wylie f9b406c8f2
add backwards compatibility mode for multi-value string array null value coercion (#12210) 2022-01-31 22:38:15 -08:00
Abhishek Agarwal 1b8808cce8
Fix SQL queries for inline datasource with null values (#12092)
Fixes a bug because of which some SQL queries cannot be parsed using druid convention. Specifically, these queries translate to an inline datasource and have some null values. Calcite internally uses NULL as SQL type for these literals and that is not supported by the druid.
I am now allowing null column types to be returned while building RowSignature in org.apache.druid.sql.calcite.table.RowSignatures#fromRelDataType. RowSignature already allows null column type for any column. Doing so should also fix bindable queries such as select (1,2). When such queries are run with headers set to true, we get an exception in org.apache.druid.sql.http.ArrayWriter#writeHeader. This is again a similar exception to the one addressed in this PR. Because SQL type for the result column is RECORD and that doesn't have a corresponding columnType.
2022-01-27 18:04:12 +05:30
Karan Kumar 96b3498a40
Grouping on arrays as arrays (#12078)
* init multiValue column group by

* Changing sorting to Lexicographic as default

* Adding initial tests

* 1.Fixing test cases adding
2.Optimized inmem structs

* Linking SQL layer to native layer

* Adding multiDimension support to group by column strategy

* 1. Removing array coercion in Calcite layer
2. Removing ResultRowDeserializer

* 1. Supporting all primitive array types
2. Removing dimension spec as part of columnSelector

* 1. Supporting all primitive array types
2. Removing dimension spec as part of columnSelector

* 1. Checkstyle things
2. Removing flag

* Minor naming things

* CheckStyle Things

* Fixing test case

* Fixing hashing

* 1. Adding the MV function
2. Added few test cases

* 1. Adding MV function test cases

* Adding Selector strategy function test cases

* Fixing ClientQuerySegmentWalkerTest

* Adding GroupByQueryRunnerTest test cases

* Fixing test cases

* Adding few more test cases

* Fixing Exception asset statement and intellij inspection

* Adding null compatibility tests

* Review comments

* Fixing few failing tests

* Fixing few failing tests

* Do no convert to topN Q incase of group by on array

* Fixing checkstyle

* Fixing differences between jdk's class cast exception message

* 1. Fixing ordering if the grouping key is an array

* Fixing DefaultLimitSpec

* Fixing CalciteArraysQueryTest

* Dummy commit for LGTM

* changes:
* only coerce multi-value string null values when `ExpressionPlan.Trait.NEEDS_APPLIED` is set
* correct return type inference for ARRAY_APPEND,ARRAY_PREPEND,ARRAY_SLICE,ARRAY_CONCAT
* fix bug with ExprEval.ofType when actual type of object from binding doesn't match its claimed type

* Review comments

* Fixing test cases

* Fixing spot bugs

* Fixing strict compile

Co-authored-by: Clint Wylie <cwylie@apache.org>
2022-01-25 20:30:56 -08:00
somu-imply cc8b9c0b6e
Handling OOM error in ExpressionVector setup by reducing number of rows (#12186)
* Handling OOM error in ExpressionVector setup by reducing number of rows

* Removing row size to 10K in sanity tests
2022-01-24 08:37:13 -08:00
Laksh Singla dc1703d5f9
Change value of `druid.sql.planner.useGroupingSetForExactDistinct` in common.runtime.properties (#12182)
This PR changes the value of the property `druid.sql.planner.useGroupingSetForExactDistinct` from `false` to `true` in the runtime.properties files, so that newer installations have this property as `true`, while the default still remains as `false`.

The flag determines how queries which contain an aggregation over `DISTINCT` like `SELECT COUNT(DISTINCT foo.dim1) FILTER(WHERE foo.cnt = 1), SUM(foo.cnt) FROM druid.foo` get planned by Calcite. With the flag being set to false, it plans it via joins, whereas with it being set to true, the query is set using grouping sets.

There is a known issue with Calcite (https://github.com/apache/druid/issues/7953), where an NPE is thrown while planning the above query with joins. There is no such issue while planning the query using grouping sets.
2022-01-24 14:00:03 +05:30
Jihoon Son cc2ffc6c0f
Fix node discovery to ignore unknown DruidServices (#12157)
* Fix node discovery to ignore unknown DruidServices

* ignore all runtime exceptions

* fix test

* add custom deserializer

* custom serializer

* log host for unparseable druidService
2022-01-18 22:08:59 -08:00
Gian Merlino cf7191d2bc
Validate target dataSource for INSERT. (#12129) 2022-01-18 09:34:23 -08:00
Maytas Monsereenusorn bd7fe45da0
Support adding metrics in Auto Compaction (#12125)
* add impl

* add impl

* add unit tests

* add unit tests

* add unit tests

* add unit tests

* add unit tests

* add integration tests

* add integration tests

* fix LGTM

* fix test

* remove doc
2022-01-17 20:19:31 -08:00
Clint Wylie f2ce76966c
add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures less ambiguous (#12145)
* add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures unambiguous

* switcheroo

* EARLIEST_BY/LATEST_BY use timestamp instead of numeric types, update docs

* revert unintended change

* fix docs

* fix docs better
2022-01-12 03:48:53 -08:00
Laksh Singla fae73800a7
Set plannerContext error when cannot query external datasources and when insert is not supported. (#12136)
This PR aims to add plannerContext.setPlanningError whenever external table scan rule is invoked, without the queryMaker having the ability to do so.
2022-01-12 15:11:17 +05:30
Rohan Garg 81f0aba6cb
Use ListFilteredVirtualColumn for left/fact table expression in join condition (#12127)
* Pass VirtualColumnRegistry in PlannerContext for join expression planning

* Allow for including VCs from join fact table expression

* Optmize MV_FILTER functions to use a VC when in join fact table expression

* fixup! Allow for including VCs from join fact table expression

* Address review comments
2022-01-11 14:47:13 -08:00
Laksh Singla 7c17341caa
Return empty result when a group by gets optimized to a timeseries query (#12065)
Related to #11188

The above mentioned PR allowed timeseries queries to return a default result, when queries of type: select count(*) from table where dim1="_not_present_dim_" were executed. Before the PR, it returned no row, after the PR, it would return a row with value of count(*) as 0 (as expected by SQL standards of different dbs).

In Grouping#applyProject, we can sometimes perform optimization of a groupBy query to a timeseries query if possible (when the keys of the groupBy are constants, as generated by automated tools). For example, in select count(*) from table where dim1="_present_dim_" group by "dummy_key", the groupBy clause can be removed. However, in the case when the filter doesn't return anything, i.e. select count(*) from table where dim1="_not_present_dim_" group by "dummy_key", the behavior of general databases would be to return nothing, while druid (due to above change) returns an empty row. This PR aims to fix this divergence of behavior.

Example cases:

select count(*) from table where dim1="_not_present_dim_" group by "dummy_key".
CURRENT: Returns a row with count(*) = 0
EXPECTED: Return no row

select 'A', dim1 from foo where m1 = 123123 and dim1 = '_not_present_again_' group by dim1
CURRENT: Returns a row with ('A', 'wat')
EXPECTED: Return no row

To do this, a boolean droppedDimensionsWhileApplyingProject has been added to Grouping which is true whenever we make changes to the original shape with optimization. Hence if a timeseries query has a grouping with this set to true, we set skipEmptyBuckets=true in the query context (i.e. donot return any row).
2022-01-07 21:53:48 +05:30
Jonathan Wei 9b598407c1
Add interface for external schema provider to Druid SQL (#12043)
* Add interfce for external schema provider to Druid SQL

* Add annotations
2021-12-22 22:17:57 +05:30
somu-imply 1871a1ab18
ARRAY_AGG and STRING_AGG will through errors if invoked on a complex datatype (#12089) 2021-12-21 17:41:04 -08:00
Laksh Singla 16642fb278
Fix incorrect type conversion in DruidLogicalValueRule (#11923)
DruidLogicalValuesRule while transforming to DruidRel can return incorrect values, if during the creation of the literal it was created from a float value. The BigDecimal representation stores 123.0, and it seems that using RexLiteral's method while conversion returns the inflated value (which is 1230). I am unsure if this is intentional from Calcite's perspective, and the actual change should be done somewhere else.

Extract the values of INT/LONG from the RexLiteral in the DruidLogicalValuesRule, via BigDecimal.longValue() method.
2021-12-15 10:44:35 +05:30
Clint Wylie 244c2559e9
fix IncrementalIndex performance regression (#12048)
changes:
* IncrementalIndex is now a ColumnInspector
* fixes performance regression from using map of ColumnCapabilities from IncrementalIndex as a RowSignature
2021-12-09 22:04:32 -08:00
Abhishek Agarwal 7abf847eae
Return 400 when SQL query cannot be planned (#12033)
In this PR, we will now return 400 instead of 500 when SQL query cannot be planned. I also fixed a bug where error messages were not getting sent to the users in case the rules throw UnsupportSQLQueryException.
2021-12-08 21:49:54 +05:30
Laksh Singla ca260dfef6
Intern RowSignature in DruidSchema to reduce its memory footprint (#12001)
DruidSchema consists of a concurrent HashMap of DataSource -> Segement -> AvailableSegmentMetadata. AvailableSegmentMetadata contains RowSignature of the segment, and for each segment, a new object is getting created. RowSignature is an immutable class, and hence it can be interned, and this can lead to huge savings of memory being used in broker, since a lot of the segments of a table would potentially have same RowSignature.
2021-12-08 15:11:13 +05:30
Clint Wylie 45be2be368
fix issues with multi-value string constant expressions (#12025)
* add specialized constant selector for multi-valued string constants
2021-12-08 00:10:26 -08:00
Abhishek Agarwal 834aae096a
Human-readable and actionable SQL error messages (#11911)
This PR does two things

1. It adds the capability to surface missing features in SQL to users - The calcite planner will explore through multiple rules to convert a logical SQL query to a druid native query. Some rules change the shape of the query itself, optimize it and some rules are responsible for translating the query into a druid native query. These are DruidQueryRule, DruidOuterQueryRule, DruidJoinRule, DruidUnionDataSourceRule, DruidUnionRule etc. These rules will look at SQL and will do the necessary transformation. But if the rule can't transform the query, it returns back the control to the calcite planner without recording why was it not able to transform. E.g. there is a join query with a non-equal join condition. DruidJoinRule will look at the condition, see that it is not supported, and return back the control. The reason can be that a query can be planned in many different ways so if one rule can't parse it, the query may still be parseable by other rules. In this PR, we are intercepting these gaps and passing them back to the user if the query could not be planned at all.

2. The said capability has been used to generate actionable errors for some common unsupported SQL features. However, not all possible errors are covered and we can keep adding more in the future.
2021-12-07 09:44:08 +05:30
Laksh Singla 44b2fb71ab
Fix the error case when there are multi top level unions (#12017)
This is a follow up to the PR #11908. This fixes the bug in top level union all queries when there are more than 2 SQL subqueries are present.
2021-12-07 01:12:02 +05:30
Jihoon Son 1f052b43c5
Better serverView exec name; remove SingleServerInventoryView (#11770)
Druid currently has 2 serverViews, regular serverView and filtered serverView. The regular serverView is used to monitor all segment announcements from all data nodes (historicals, tasks, indexers). The filtered serverView is used when you want to watch segment announcements from particular tiers. Since these server views keep track of different sets of druidServers and segments in memory, they should be maintained separately. However, they currently share the same name for their executorService, which can cause confusion and make debugging harder especially in the broker since it is using both serverViews, the filtered view for normal query processing and the regular view to serve the servers table (I'm unsure whether this is intended or whether this is a good behavior). This PR changes it to a more obvious name.

This PR also removes SingleServerInventoryView. This view was deprecated a long time ago and has not been documented at least since 0.13 (#6127). I also don't think this can be better in any case than BatchServerInventoryView. Finally, I merged AbstractCuratorServerInventoryView and BatchServerInventoryView as we no longer need AbstractCuratorServerInventoryView after SingleServerInventoryView is removed.
2021-12-04 18:43:05 +05:30
Gian Merlino e0e05aad99
Enhancements to IndexTaskClient. (#12011)
* Enhancements to IndexTaskClient.

1) Ability to use handlers other than StringFullResponseHandler. This
   functionality is not used in production code yet, but is useful
   because it will allow tasks to communicate with each other in
   non-string-based formats and in streaming fashion. In the future,
   we'll be able to use this to make task-to-task communication
   more efficient.

2) Truncate server errors at 1KB, so long errors do not pollute logs.

3) Change error log level for retryable errors from WARN to INFO. (The
   final error is still WARN.)

4) Harmonize log and exception messages to have a more consistent format.

* Additional tests and improvements.
2021-12-03 09:14:32 -08:00
Clint Wylie af6541a236
allow `DruidSchema` to fallback to segment metadata 'type' if 'typeSignature' is null (#12016)
* allow `DruidSchema` to fallback to segment metadata type if typeSignature is null, to avoid producing incorrect SQL schema if broker is upgraded to 0.23 before historicals

* mmm, forbidden tests
2021-12-02 17:42:01 -08:00
Clint Wylie 84b4bf56d8
vectorize logical operators and boolean functions (#11184)
changes:
* adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions
* vectorize logical operators and boolean functions, some only if useStrictBooleans is true
2021-12-02 16:40:23 -08:00
Paul Rogers a66f10eea1
Code cleanup from query profile project (#11822)
* Code cleanup from query profile project

* Fix spelling errors
* Fix Javadoc formatting
* Abstract out repeated test code
* Reuse constants in place of some string literals
* Fix up some parameterized types
* Reduce warnings reported by Eclipse

* Reverted change due to lack of tests
2021-11-30 11:35:38 -08:00
Kashif Faraz b48f5a576b
Fix: Do not require time condition on InlineDataSource (#11982)
For queries on logical values, e.g. SELECT 1337, we need not check for
a filter on __time column even if requireTimeCondition is true.
2021-11-25 21:10:06 +05:30
Laksh Singla c381cae51b
Improve the output of SQL explain message (#11908)
Currently, when we try to do EXPLAIN PLAN FOR, it returns the structure of the SQL parsed (via Calcite's internal planner util), which is verbose (since it tries to explain about the nodes in the SQL, instead of the Druid Query), and not representative of the native Druid query which will get executed on the broker side.

This PR aims to change the format when user tries to EXPLAIN PLAN FOR for queries which are executed by converting them into Druid's native queries (i.e. not sys schemas).
2021-11-25 21:08:33 +05:30
Rohan Garg 2c08055962
Specify time column for first/last aggregators (#11949)
Add the ability to pass time column in first/last aggregator (and latest/earliest SQL functions). It is to support cases where the time to query upon is stored as a part of a column different than __time. Also, some other logical time column can be specified.
2021-11-25 09:44:14 +05:30
Gian Merlino 0354407655
SQL INSERT planner support. (#11959)
* SQL INSERT planner support.

The main changes are:

1) DruidPlanner is able to validate and authorize INSERT queries. They
   require WRITE permission on the target datasource.

2) QueryMaker is now an interface, and there is a QueryMakerFactory that
   creates instances of it. There is only one production implementation
   of each (NativeQueryMaker and NativeQueryMakerFactory), which
   together behave the same way as the former QueryMaker class. But this
   opens the door to executing queries in ways other than the Druid
   query stack, and is used by unit tests (CalciteInsertDmlTest) to
   test the INSERT planning functionality.

3) Adds an EXTERN table macro that allows references external data using
   InputSource and InputFormat from Druid's batch ingestion API. This is
   not exposed in production yet, but is used by unit tests.

4) Adds a QueryFeature concept that enables the planner to change its
   behavior slightly depending on the capabilities of the execution
   system.

5) Adds an "AuthorizableOperator" concept that enables SqlOperators
   to require additional permissions. This is used by the EXTERN table
   macro.

Related odds and ends:

- Add equals, hashCode, toString methods to InlineInputSource. Aids in
  the "from external" tests in CalciteInsertDmlTest.
- Add JSON-serializability to RowSignature.
- Move the SQL string inside PlannerContext so it is "baked into" the
  planner when the planner is created. Cleans up the code a bit, since
  in practice, the same query is passed in every time to the
  same planner anyway.

* Fix up calls to CalciteTests.createMockQueryLifecycleFactory.

* Fix checkstyle issues.

* Adjustments for CI.

* Adjust DruidAvaticaHandlerTest for stricter test authorizations.
2021-11-24 12:14:04 -08:00
Maytas Monsereenusorn bb3d2a433a
Support filtering data in Auto Compaction (#11922)
* add impl

* fix checkstyle

* add test

* add test

* add unit tests

* fix unit tests

* fix unit tests

* fix unit tests

* add IT

* add IT

* add comments

* fix spelling
2021-11-24 10:56:38 -08:00
Abhishek Agarwal b6a0fbc8b6
Break down CalciteQueryTest (#11979)
* Refactor calciteQueryTest

* Move more tests to CalciteJoinQueryTest
2021-11-24 00:15:42 +05:30
Laksh Singla b5a25f24f2
Improve the DruidRexExecutor w.r.t handling of numeric arrays (#11968)
DruidRexExecutor while reducing Arrays, specially numeric arrays, doesn't convert the value from ExprResult's type to BigDecimal, which causes makeLiteral to cast the values. Also, if NaN or Infinite values are present in the array, the error is a generic NumberFormatException. For example:

SELECT ARRAY[1.11, 2.22] returns [1, 2]
SELECT SQRT(-1) throws a generic NumberFormatException instead of IAE

This PR introduces change to cast the numeric values to BigDecimal since Calcite's library understands that easily, and doesn't perform casts.
2021-11-23 11:40:59 +05:30
TSFenwick a4cb1de87a
get rid of class cast exception and add a new testcase for that issue (#11951) 2021-11-22 08:44:20 -08:00
Gian Merlino b3502c3e50
DruidViewMacro: Remove unused escalator field. (#11931)
* DruidViewMacro: Remove unused escalator field.

* Remove additional unused fields.
2021-11-19 16:06:29 -08:00
Gian Merlino 36ee0367ff
Scan: Add "orderBy" parameter. (#11930)
* Scan: Add "orderBy" parameter.

This patch adds an API for requesting non-time orderings, although it
does not actually add the ability to execute such queries.

The changes are done in such a way that no matter how Scan query objects
are constructed, they will have a correct "getOrderBy". This will enable
us to switch the execution to exclusively use "getOrderBy" later on when
it's implemented.

Scan queries are serialized such that they only include "order" (time
order) if the ordering is time-based, and they only include "orderBy" if
the ordering is non-time-based. This maximizes compatibility with
the existing API while also providing a clean look for formatted queries.

Because this patch does not include execution logic, if someone actually
tries to run a query with non-time ordering, then they will get an error
like "Cannot execute query with orderBy [quality ASC]".

* SQL module fixes.

* Add spotbugs-exclude.

* Remove unused method.
2021-11-19 08:19:12 -08:00
somu-imply 29710789a4
Adding safe divide function (#11904)
* IMPLY-4344: Adding safe divide function along with testcases and documentation updates

* Changing based on review comments

* Addressing review comments, fixing coding style, docs and spelling

* Checkstyle passes for all code

* Fixing expected results for infinity

* Revert "Fixing expected results for infinity"

This reverts commit 5fd5cd480d.

* Updating test result and a space in docs
2021-11-17 08:22:41 -08:00
Gian Merlino d76e646700
Fix TestServerInventoryView behavioral discrepancy. (#11932)
Unlike a real one, TestServerInventoryView would call segmentRemoved
any time _any_ segment was removed. It should only be called when _all_
segments have been removed.
2021-11-16 18:08:35 -08:00
Clint Wylie 54fead3546
sql skip reduce of complex literal expressions (#11928) 2021-11-16 15:40:42 -08:00
TSFenwick 1487f558b1
Use a simple class to sanitize JDBC exceptions and also log them (#11843)
* Use a simple class to sanitize sanitizable errors and log them

The purpose of this is to sanitize JDBC errors, but can sanitize other errors
if they implement SanitizableError Interface

add a class to log errors and sanitize them
added a simple test that tests out that the error gets sanitized
add @NonNull annotation to serverconfig's ErrorResponseTransfromStrategy

* return less information as part of too many connections, and instead only log specific details

This is so an end user gets relevant information but not too much info since they might now how
many brokers they have

* return only runtime exceptions

added new error types that need to be sanitized
also sanitize deprecated and unsupported exceptions.

* dont reqrewite exceptions unless necessary for checked exceptions

add docs
avoid blanket turning all exceptions into runtime exceptions

* address comments, to fix up docs.

add more javadocs
add support UOE sanitization

* use try catch instead and sanitize at public methods

* checkstyle fixes

* throw noSuchStatement and NoSuchConnection as Avatica is affected by those

* address comments. move log error back to druid meta

clean up bad formatting and commented code. add missed catch for NoSuchStatementException
clean up comments for error handler and add comment explainging not wanting to santize avatica exceptions

* alter test to reflect new error message
2021-11-16 13:13:03 -08:00
Gian Merlino 6f6e88e02e
SQL: Add type headers to response formats. (#11914)
This allows clients to interpret the results of SQL queries without having to guess types.
2021-11-13 11:30:57 +05:30
Clint Wylie 5baa22148e
revert ColumnAnalysis type, add typeSignature and use it for DruidSchema (#11895)
* revert ColumnAnalysis type, add typeSignature and use it for DruidSchema

* review stuffs

* maybe null

* better maybe null

* Update docs/querying/segmentmetadataquery.md

* Update docs/querying/segmentmetadataquery.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* fix null right

* sad

* oops

* Update batch_hadoop_queries.json

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-11-10 18:46:29 -08:00
TSFenwick cdd1c2876c
catch throwable because calcite is throwing an error not exception (#11892)
* catch throwable because calcite is throwing an error not exception

* add test case
2021-11-10 17:22:04 -08:00
Jihoon Son 13bec7468a
Fix NPE for SQL queries when a query parameter is missing in the mid (#11900)
* Fix NPE for SQL queries when a query parameter is missing in the mid

* checkstyle

* Throw SqlPlanningException instead of IAE
2021-11-10 10:02:26 -08:00
Clint Wylie a8805ab60d
add missing json type for ListFilteredVirtualColumn (#11887)
* add missing json type for ListFilteredVirtualColumn, and tests to try to avoid this happening again

* fixes

* ugly, but maybe this

* oops

* too many mappers
2021-11-09 17:25:12 -08:00
Maytas Monsereenusorn ddc68c6a81
Support changing dimension schema in Auto Compaction (#11874)
* add impl

* add unit tests

* fix checkstyle

* add impl

* add impl

* add impl

* add impl

* add impl

* add impl

* fix test

* add IT

* add IT

* fix docs

* add test

* address comments

* fix conflict
2021-11-08 21:17:08 -08:00
Clint Wylie 7237dc837c
complex typed expressions (#11853)
* complex typed expressions

* add built-in hll collector expressions to get coverage on druid-processing, more types, more better

* rampage!!!

* more javadoc

* adjustments

* oops

* lol

* remove unused dependency

* contradiction?

* more test
2021-11-08 00:33:06 -08:00
Gian Merlino 8971056763
Properly count segment references in tests. (#11870) 2021-11-05 12:49:10 -07:00
Clint Wylie 9bd2ccbb9b
SqlAggregationModuleTest now extends CalciteTestBase to ensure consistent string encoding (#11861) 2021-11-01 15:11:40 -07:00
Gian Merlino 8276c031c5
Add druid.sql.approxCountDistinct.function property. (#11181)
* Add druid.sql.approxCountDistinct.function property.

The new property allows admins to configure the implementation for
APPROX_COUNT_DISTINCT and COUNT(DISTINCT expr) in approximate mode.

The motivation for adding this setting is to enable site admins to
switch the default HLL implementation to DataSketches.

For example, an admin can set:

  druid.sql.approxCountDistinct.function = APPROX_COUNT_DISTINCT_DS_HLL

* Fixes

* Fix tests.

* Remove erroneous cannotVectorize.

* Remove unused import.

* Remove unused test imports.
2021-10-25 12:16:21 -07:00
Kashif Faraz abac9e39ed
Revert permission changes to Supervisor and Task APIs (#11819)
* Revert "Require Datasource WRITE authorization for Supervisor and Task access (#11718)"

This reverts commit f2d6100124.

* Revert "Require DATASOURCE WRITE access in SupervisorResourceFilter and TaskResourceFilter (#11680)"

This reverts commit 6779c4652d.

* Fix docs for the reverted commits

* Fix and restore deleted tests

* Fix and restore SystemSchemaTest
2021-10-25 14:50:38 +05:30
Gian Merlino d4cace385f
SQL: Allow Scans to be used as outer queries. (#11831)
* SQL: Allow Scans to be used as outer queries.

This has been possible in the native query system for a while, but the capability
hasn't yet propagated into the SQL layer. One example of where this is useful is
a query like:

  SELECT * FROM (... LIMIT X) WHERE <filter>

Because this expands the kinds of subquery structures the SQL layer will consider,
it was also necessary to improve the cost calculations. These changes appear in
PartialDruidQuery and DruidOuterQueryRel. The ideas are:

- Attach per-column penalties to the output signature of each query, instead of to
  the initial projection that starts a query. This encourages moving projections
  into subqueries instead of leaving them on outer queries.
- Only attach penalties to projections if there are actually expressions happening.
  So, now, projections that simply reorder or remove fields are free.
- Attach a constant penalty to every outer query. This discourages creating them
  when they are not needed.

The changes are generally beneficial to the test cases we have in CalciteQueryTest.
Most plans are unchanged, or are changed in purely cosmetic ways. Two have changed
for the better:

- testUsingSubqueryWithLimit now returns a constant from the subquery, instead of
  returning every column.
- testJoinOuterGroupByAndSubqueryHasLimit returns a minimal set of columns from
  the innermost subquery; two unnecessary columns are no longer there.

* Fix various DS operator conversions.

These were all implemented as direct conversions, which isn't appropriate
because they do not actually map onto native functions. These are only
usable as post-aggregations.

* Test case adjustment.
2021-10-23 17:18:43 -07:00
Clint Wylie 187df58e30
better types (#11713)
* better type system

* needle in a haystack

* ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support

* fixup merge

* more test

* fixup

* intern

* fix

* oops

* oops again

* ...

* more test coverage

* fix error message

* adjust interning, more javadocs

* oops

* more docs more better
2021-10-19 01:47:25 -07:00
Kashif Faraz f2d6100124
Require Datasource WRITE authorization for Supervisor and Task access (#11718)
Follow up PR for #11680

Description
Supervisor and Task APIs are related to ingestion and must always require Datasource WRITE
authorization even if they are purely informative.

Changes
Check Datasource WRITE in SystemSchema for tables "supervisors" and "tasks"
Check Datasource WRITE for APIs /supervisor/history and /supervisor/{id}/history
Check Datasource for all Indexing Task APIs
2021-10-08 10:39:48 +05:30
Lucas Capistrant 1930ad1f47
Implement configurable internally generated query context (#11429)
* Add the ability to add a context to internally generated druid broker queries

* fix docs

* changes after first CI failure

* cleanup after merge with master

* change default to empty map and improve unit tests

* add doc info and fix checkstyle

* refactor DruidSchema#runSegmentMetadataQuery and add a unit test
2021-10-06 09:02:41 -07:00
Maytas Monsereenusorn 8cc58a4368
Add sql query id to response header for failed sql query (#11756)
* add impl

* add impl
2021-09-30 13:43:39 +07:00
Clint Wylie 11017ef00a
support jdbc even if trailing / is missing (#11737)
* support jdbc even if trailing / is missing

* fix tests
2021-09-29 13:59:26 -07:00
Maytas Monsereenusorn a04b08e45c
Add new config to filter internal Druid-related messages from Query API response (#11711)
* add impl

* add impl

* add tests

* add unit test

* fix checkstyle

* address comments

* fix checkstyle

* fix checkstyle

* fix checkstyle

* fix checkstyle

* fix checkstyle

* address comments

* address comments

* address comments

* fix test

* fix test

* fix test

* fix test

* fix test

* change config name

* change config name

* change config name

* address comments

* address comments

* address comments

* address comments

* address comments

* address comments

* fix compile

* fix compile

* change config

* add more tests

* fix IT
2021-09-29 12:55:49 +07:00
Clint Wylie 5de26cf6d9
add optional system schema authorization (#11720)
* add optional system schema authorization

* remove unused

* adjust docs

* doc fixes, missing ldap config change for integration tests

* style
2021-09-21 13:28:26 -07:00