Commit Graph

650 Commits

Author SHA1 Message Date
Laksh Singla 2e616e633a
Determine type of `__time` column by RowSignature in case of External Datasource (#12770)
Some queries like `REPLACE INTO ... SELECT TIME_PARSE("__time") AS __time FROM ...`
fail at the Calcite layer because any column with name `__time` is considered to be of
type `SqlTypeName.TIMESTAMP`.

Changes:
- Modify `RowSignatures.toRelDataType()` so that the type of `__time` column
  is determined by the RowSignature's type.
2022-07-26 12:09:40 +05:30
Maytas Monsereenusorn 3bf1e699ff
GREATEST/LEAST function is incorrectly specifying that it cannot return null (#12804) 2022-07-20 14:41:24 +05:30
Adarsh Sanjeev f3272a25f9
Add check for sqlOuterLimit to ingest queries (#12799)
* Add check for sqlOuterLimit to ingest queries

* Fix checkstyle

* Add comment
2022-07-19 09:02:43 -07:00
Paul Rogers ee15c238cc
Clone Calcite planner to access validator (#12708)
Done in preparation for the "single-pass" planner.
2022-07-14 18:10:33 -07:00
Clint Wylie 05b2e967ed
druid nested data column type (#12753)
* add new druid nested data column type

* fixes and such

* fixes

* adjustments, more tests

* self review

* oops

* fix and test

* more better

* style
2022-07-14 12:07:23 -07:00
Rohan Garg bb953be09b
Refactor usage of JoinableFactoryWrapper + more test coverage (#12767)
Refactor usage of JoinableFactoryWrapper to add e2e test for createSegmentMapFn with joinToFilter feature enabled
2022-07-12 06:25:36 -07:00
Gian Merlino 97207cdcc7
Automatic sizing for GroupBy dictionaries. (#12763)
* Automatic sizing for GroupBy dictionary sizes.

Merging and selector dictionary sizes currently both default to 100MB.
This is not optimal, because it can lead to OOM on small servers and
insufficient resource utilization on larger servers. It also invites
end users to try to tune it when queries run out of dictionary space,
which can make things worse if the end user sets it to too high.

So, this patch:

- Adds automatic tuning for selector and merge dictionaries. Selectors
  use up to 15% of the heap and merge buffers use up to 30% of the heap
  (aggregate across all queries).

- Updates out-of-memory error messages to emphasize enabling disk
  spilling vs. increasing memory parameters. With the memory parameters
  automatically sized, it is more likely that an end user will get
  benefit from enabling disk spilling.

- Removes the query context parameters that allow lowering of configured
  dictionary sizes. These complicate the calculation, and I don't see a
  reasonable use case for them.

* Adjust tests.

* Review adjustments.

* Additional comment.

* Remove unused import.
2022-07-11 08:20:50 -07:00
Gian Merlino edfbcc8455
Preserve column order in DruidSchema, SegmentMetadataQuery. (#12754)
* Preserve column order in DruidSchema, SegmentMetadataQuery.

Instead of putting columns in alphabetical order. This is helpful
because it makes query order better match ingestion order. It also
allows tools, like the reindexing flow in the web console, to more
easily do follow-on ingestions using a column order that matches the
pre-existing column order.

We prefer the order from the latest segments. The logic takes all
columns from the latest segments in the order they appear, then adds
on columns from older segments after those.

* Additional test adjustments.

* Adjust imports.
2022-07-08 22:04:11 -07:00
Gian Merlino 9c925b4f09
Frame format for data transfer and short-term storage. (#12745)
* Frame format for data transfer and short-term storage.

As we move towards query execution plans that involve more transfer
of data between servers, it's important to have a data format that
provides for doing this more efficiently than the options available to
us today.

This patch adds:

- Columnar frames, which support fast querying.
- Row-based frames, which support fast sorting via memory comparison
  and fast whole-row copies via memory copying.
- Frame files, a container format that can be stored on disk or
  transferred between servers.

The idea is we should use row-based frames when data is expected to
be sorted, and columnar frames when data is expected to be queried.

The code in this patch is not used in production yet. Therefore, the
patch involves minimal changes outside of the org.apache.druid.frame
package.  The main ones are adjustments to SqlBenchmark to add benchmarks
for queries on frames, and the addition of a "forEach" method to Sequence.

* Fixes based on tests, static analysis.

* Additional fixes.

* Skip DS mapping tests on JDK 14+

* Better JDK checking in tests.

* Fix imports.

* Additional comment.

* Adjustments from code review.

* Update test case.
2022-07-08 20:42:06 -07:00
Rohan Garg d732de9948
Allow adding calcite rules from extensions (#12715)
* Allow adding calcite rules from extensions

* fixup! Allow adding calcite rules from extensions

* Move Rules to CalciteRulesManager

* fixup! Move Rules to CalciteRulesManager
2022-07-06 19:32:35 +05:30
Didip Kerabat 06251c5d2a
Add EIGHT_HOUR into possible list of Granularities. (#12717)
* Add EIGHT_HOUR into possible list of Granularities.

* Add the missing definition.

* fix test.

* Fix another test.

* Stylecheck finally passed.

Co-authored-by: Didip Kerabat <didip@apple.com>
2022-07-05 11:05:37 -07:00
Clint Wylie bbbb6e1c3f
fix DruidSchema issue where datasources with no segments can become stuck in tables list indefinitely (#12727) 2022-07-01 18:54:01 -07:00
Clint Wylie 48731710fb
precursor changes for nested columns to minimize files changed (#12714)
* precursor changes for nested columns to minimize files changed

* inspection fix

* visibility

* adjustment

* unecessary change
2022-07-01 02:27:19 -07:00
Clint Wylie d30efb1c1e
fix bug when rewriting sql virtual column registry (#12718) 2022-07-01 02:24:00 -07:00
Tejaswini Bandlamudi 1fc2f6e4b0
Throw BadQueryContextException if context params cannot be parsed (#12680) 2022-06-24 09:21:25 +05:30
Paul Rogers ffcb996468
Cleanup changes pulled out of PR #12368 (#12672)
This commit contains the cleanup needed for the new integration test framework.

Changes:
- Fix log lines, misspellings, docs, etc.
- Allow the use of some of Druid's "JSON config" objects in tests
- Fix minor bug in `BaseNodeRoleWatcher`
2022-06-23 23:19:50 +05:30
Kashif Faraz b6f8d7a1b3
Add query context param `forceExpressionVirtualColumns` to always use "expression"-type virtual columns in query plan (#12583)
SQL expressions such as those containing `MV_FILTER_ONLY` and `MV_FILTER_NONE`
are planned as specialized virtual columns instead of the default `expression`-type virtual columns.
This commit adds a new context parameter to force the `expression`-type virtual columns.

Changes
- Add query context param `forceExpressionVirtualColumns`
- Use context param to determine if specialized virtual columns should be used or not
- Moved some tests into `CalciteExplainQueryTest`
2022-06-22 15:33:50 +05:30
Gian Merlino 0099940808
Add TIME_IN_INTERVAL SQL operator. (#12662)
* Add TIME_IN_INTERVAL SQL operator.

The operator is implemented as a convertlet rather than an
OperatorConversion, because this allows it to be equivalent to using
the >= and < operators directly.

* SqlParserPos cannot be null here.

* Remove unused import.

* Doc updates.

* Add words to dictionary.
2022-06-21 13:05:37 -07:00
Gian Merlino 818974f6e4
ScanQuery: Fix JsonIgnore for isLegacy. (#12674)
True, false, and null have different meanings: true/false mean "legacy"
and "not legacy"; null means use the default set by ScanQueryConfig.
So, we need to respect this in the JsonIgnore setup.
2022-06-18 15:55:54 -07:00
Paul Rogers 893759de91
Remove null and empty fields from native queries (#12634)
* Remove null and empty fields from native queries

* Test fixes

* Attempted IT fix.

* Revisions from review comments

* Build fixes resulting from changes suggested by reviews

* IT fix for changed segment size
2022-06-16 14:07:25 -07:00
TSFenwick a3603ad6b0
Use DefaultQueryConfig in SqlLifecycle to correctly populate request logs (#12613)
Fixes an issue where sql query request logs do not include the default query context
values set via `druid.query.default.context.xyz` runtime properties.

# Change summary
* Inject `DefaultQueryConfig` into `SqlLifecycleFactory`
* Add params from `DefaultQueryConfig` to the query context in `SqlLifecycle`

# Description
- This change does not affect query execution. This is because the
  `DefaultQueryConfig` was already being used in `QueryLifecycle`,
   which is initialized when the SQL is translated to a native query. 
- This also handles any potential use case where a context parameter should be
   handled at the SQL stage itself.
2022-06-08 12:52:50 +05:30
Laksh Singla 81c37c6515
Add validation for invalid partitioned by granularities (#12589)
* Add validation for invalid partitioned by granularities

* review comments

* improve error message, change location of the method

* remove imports

* use StringUtils.lowercase

Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com>
2022-06-06 22:00:29 +05:30
Adarsh Sanjeev 5a283964ca
Improve SQL validation error messages (#12611)
Update the SQL validation error message to specify whether
the ingest is INSERT or REPLACE for better user experience.
2022-06-06 16:14:28 +05:30
Clint Wylie dc0fdfec67
fix test comment (#12584) 2022-05-31 12:39:20 -07:00
Gian Merlino 02ae3e74ff
RowBasedColumnSelectorFactory: Add "useStringValueOfNullInLists" parameter. (#12578)
RowBasedColumnSelectorFactory inherited strange behavior from
Rows.objectToStrings for nulls that appear in lists: instead of being
left as a null, it is replaced with the string "null". Some callers may
need compatibility with this strange behavior, but it should be opt-in.

Query-time call sites are changed to opt-out of this behavior, since it
is not consistent with query-time expectations. The IncrementalIndex
ingestion-time call site retains the old behavior, as this is traditionally
when Rows.objectToStrings would be used.
2022-05-31 11:38:56 -07:00
Clint Wylie b746bf9129
fix virtual column cycle bug, sql virtual column optimize bug (#12576)
* fix virtual column cycle bug, sql virtual column optimize bug

* more test
2022-05-30 23:51:21 -07:00
Karan Kumar 9f9faeec81
object[] handling for DimensionHandlers for arrays (#12552)
Description
Fixes a bug when running q's like

 SELECT cntarray,
       Count(*)
FROM   (SELECT dim1,
               dim2,
               Array_agg(cnt) AS cntarray
        FROM   (SELECT dim1,
                       dim2,
                       dim3,
                       Count(*) AS cnt
                FROM   foo
                GROUP  BY 1,
                          2,
                          3)
        GROUP  BY 1,
                  2)
GROUP  BY 1  
This generates an error:

org.apache.druid.java.util.common.ISE: Unable to convert type [Ljava.lang.Object; to org.apache.druid.segment.data.ComparableList
        at org.apache.druid.segment.DimensionHandlerUtils.convertToList(DimensionHandlerUtils.java:405) ~[druid-xx]
Because it's an array of numbers it looks like it does the convertToList call, which looks like:

  @Nullable
  public static ComparableList convertToList(Object obj)
  {
    if (obj == null) {
      return null;
    }
    if (obj instanceof List) {
      return new ComparableList((List) obj);
    }
    if (obj instanceof ComparableList) {
      return (ComparableList) obj;
    }
    throw new ISE("Unable to convert type %s to %s", obj.getClass().getName(), ComparableList.class.getName());
  }
I.e. it doesn't know about arrays. Added the array handling as part of this PR.
2022-05-25 15:24:18 +05:30
Adarsh Sanjeev 5063eca5b9
Add error message for incorrectly ordered clause in sql (#12558)
In the case that the clustered by is before the partitioned by for an sql query, the error message is a bit confusing.

insert into foo select * from bar clustered by dim1 partitioned by all

Error: SQL parse failed

Encountered "PARTITIONED" at line 1, column 88.

Was expecting one of: <EOF> "," ... "ASC" ... "DESC" ... "NULLS" ... "." ... "NOT" ... "IN" ... "<" ... "<=" ... ">" ... ">=" ... "=" ... "<>" ... "!=" ... "BETWEEN" ... "LIKE" ... "SIMILAR" ... "+" ... "-" ... "*" ... "/" ... "%" ... "||" ... "AND" ... "OR" ... "IS" ... "MEMBER" ... "SUBMULTISET" ... "CONTAINS" ... "OVERLAPS" ... "EQUALS" ... "PRECEDES" ... "SUCCEEDS" ... "IMMEDIATELY" ... "MULTISET" ... "[" ... "FORMAT" ... "(" ... Less...

org.apache.calcite.sql.parser.SqlParseException
This is a bit confusing and adding a check could be added to throw a more user friendly message stating that the order should be reversed.

Add error message for incorrectly ordered clause in sql.
2022-05-23 12:41:18 +05:30
Gian Merlino 69aac6c8dd
Direct UTF-8 access for "in" filters. (#12517)
* Direct UTF-8 access for "in" filters.

Directly related:

1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
   ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
   necessary, the input set is copied into a ValuesSet. Much logic is
   simplified, because we always know what type the values set will be.
   I think that there won't even be an efficiency loss in most cases.
   InDimFilter is most frequently created by deserialization, and this
   patch updates the JsonCreator constructor to deserialize
   directly into a ValuesSet.

2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
   during index lookups.

3) Add unsigned comparator to ByteBufferUtils and use it in
   GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
   bytes can be compared as bytes if, and only if, the comparison
   is unsigned.

4) Add specialization to GenericIndexed.singleThreaded().indexOf that
   avoids needless ByteBuffer allocations.

5) Clarify that objects returned by ColumnIndexSupplier.as are not
   thread-safe. DictionaryEncodedStringIndexSupplier now calls
   singleThreaded() on all relevant GenericIndexed objects, saving
   a ByteBuffer allocation per access.

Also:

1) Fix performance regression in LikeFilter: since #12315, it applied
   the suffix matcher to all values in range even for type MATCH_ALL.

2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
   which was broken due to calls to strategy.compare in
   GenericIndexed.fromIterable.

* Add like-filter implementation tests.

* Add in-filter implementation tests.

* Add tests, fix issues.

* Fix style.

* Adjustments from review.
2022-05-20 01:51:28 -07:00
Gian Merlino 65a1375b67
SQL: Add is_active to sys.segments, update examples and docs. (#11550)
* SQL: Add is_active to sys.segments, update examples and docs.

is_active is short for:

  (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1

It's important because this represents "all the segments that should
be queryable, whether or not they actually are right now". Most of the
time, this is the set of segments that people will want to look at.

The web console already adds this filter to a lot of its queries,
proving its usefulness.

This patch also reworks the caveat at the bottom of the sys.segments
section, so its information is mixed into the description of each result
field. This should make it more likely for people to see the information.

* Wording updates.

* Adjustments for spellcheck.

* Adjust IT.
2022-05-19 14:23:28 -07:00
Adarsh Sanjeev fcb1c0b7bf
Add cluster by support for replace syntax (#12524)
* Add cluster by support for replace syntax

* Add unit test for with list
2022-05-17 15:15:29 +05:30
Adarsh Sanjeev 0fd4f1e386
Improve error messages from SQL REPLACE syntax (#12523)
- Add user friendly error messages for missing or incorrect OVERWRITE clause for REPLACE SQL query
- Move validation of missing OVERWRITE clause at code level instead of parser for custom error message
2022-05-17 09:55:58 +05:30
Adarsh Sanjeev 39b3487aa9
Add replace statement to sql parser (#12386)
Relevant Issue: #11929

- Add custom replace statement to Druid SQL parser.
- Edit DruidPlanner to convert relevant fields to Query Context.
- Refactor common code with INSERT statements to reuse them for REPLACE where possible.
2022-05-13 10:56:40 +05:30
Clint Wylie 9e5a940cf1
remake column indexes and query processing of filters (#12388)
Following up on #12315, which pushed most of the logic of building ImmutableBitmap into BitmapIndex in order to hide the details of how column indexes are implemented from the Filter implementations, this PR totally refashions how Filter consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.
2022-05-11 11:57:08 +05:30
Rohan Garg 75836a5a06
Add feature flag for sql planning of TimeBoundary queries (#12491)
* Add feature flag for sql planning of TimeBoundary queries

* fixup! Add feature flag for sql planning of TimeBoundary queries

* Add documentation for enableTimeBoundaryPlanning

* fixup! Add documentation for enableTimeBoundaryPlanning
2022-05-10 15:23:42 +05:30
somu-imply c68388ebcd
Vectorized version of string last aggregator (#12493)
* Vectorized version of string last aggregator

* Updating string last and adding testcases

* Updating code and adding testcases for serializable pairs

* Addressing review comments
2022-05-09 17:02:38 -07:00
Gian Merlino a2bad0b3a2
Reduce allocations due to Jackson serialization. (#12468)
* Reduce allocations due to Jackson serialization.

This patch attacks two sources of allocations during Jackson
serialization:

1) ObjectMapper.writeValue and JsonGenerator.writeObject create a new
   DefaultSerializerProvider instance for each call. It has lots of
   fields and creates pressure on the garbage collector. So, this patch
   adds helper functions in JacksonUtils that enable reuse of
   SerializerProvider objects and updates various call sites to make
   use of this.

2) GroupByQueryToolChest copies the ObjectMapper for every query to
   install a special module that supports backwards compatibility with
   map-based rows. This isn't needed if resultAsArray is set and
   all servers are running Druid 0.16.0 or later. This release was a
   while ago. So, this patch disables backwards compatibility by default,
   which eliminates the need to copy the heavyweight ObjectMapper. The
   patch also introduces a configuration option that allows admins to
   explicitly enable backwards compatibility.

* Add test.

* Update additional call sites and add to forbidden APIs.
2022-04-27 14:17:26 -07:00
Gian Merlino 2e42d04038
SQL: Create millisecond precision timestamp literals. (#12407)
* SQL: Create millisecond precision timestamp literals.

Fixes a bug where implicit casts of strings to timestamps would use seconds
precision rather than milliseconds. The new test case
testCountStarWithBetweenTimeFilterUsingMillisecondsInStringLiterals
exercises this.

* Update sql/src/main/java/org/apache/druid/sql/calcite/planner/Calcites.java

Co-authored-by: Frank Chen <frankchen@apache.org>

* Correct precision handling.

- Set default precision to 3 (millis) for things involving timestamps.
- Respect precision specified in types when available.

* Silence, checkstyle.

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-04-27 14:17:07 -07:00
Abhishek Agarwal 2fe053c5cb
Bump up the versions (#12480) 2022-04-27 14:28:20 +05:30
Adarsh Sanjeev 1306965c9e
Validate select columns for insert statement (#12431)
Unnamed columns in the select part of insert SQL statements currently create a table with the column name such as "EXPR$3". This PR adds a check for this.
2022-04-27 12:25:49 +05:30
somu-imply 027935dcff
Vectorize numeric latest aggregators (#12439)
* Vectorizing Latest aggregator Part 1

* Updating benchmark tests

* Changing appropriate logic for vectors for null handling

* Introducing an abstract class and moving the commonalities there

* Adding vectorization for StringLast aggregator (initial version)

* Updated bufferized version of numeric aggregators

* Adding some javadocs

* Making sure this PR vectorizes numeric latest agg only

* Adding another benchmarking test

* Fixing intellij inspections

* Adding tests for double

* Adding test cases for long and float

* Updating testcases

* Checkstyle oops..

* One tiny change in test case

* Fixing spotbug and rhs not being used
2022-04-26 11:33:08 -07:00
Rohan Garg 95694b5afa
Convert simple min/max SQL queries on __time to timeBoundary queries (#12472)
* Support array based results in timeBoundary query

* Fix bug with query interval in timeBoundary

* Convert min(__time) and max(__time) SQL queries to timeBoundary

* Add tests for timeBoundary backed SQL queries

* Fix query plans for existing tests

* fixup! Convert min(__time) and max(__time) SQL queries to timeBoundary

* fixup! Add tests for timeBoundary backed SQL queries

* fixup! Fix bug with query interval in timeBoundary
2022-04-25 08:18:58 -07:00
Jihoon Son 73ce5df22d
Add support for authorizing query context params (#12396)
The query context is a way that the user gives a hint to the Druid query engine, so that they enforce a certain behavior or at least let the query engine prefer a certain plan during query planning. Today, there are 3 types of query context params as below.

Default context params. They are set via druid.query.default.context in runtime properties. Any user context params can be default params.
User context params. They are set in the user query request. See https://druid.apache.org/docs/latest/querying/query-context.html for parameters.
System context params. They are set by the Druid query engine during query processing. These params override other context params.
Today, any context params are allowed to users. This can cause 
1) a bad UX if the context param is not matured yet or 
2) even query failure or system fault in the worst case if a sensitive param is abused, ex) maxSubqueryRows.

This PR adds an ability to limit context params per user role. That means, a query will fail if you have a context param set in the query that is not allowed to you. To do that, this PR adds a new built-in resource type, QUERY_CONTEXT. The resource to authorize has a name of the context param (such as maxSubqueryRows) and the type of QUERY_CONTEXT. To allow a certain context param for a user, the user should be granted WRITE permission on the context param resource. Here is an example of the permission.

{
  "resourceAction" : {
    "resource" : {
      "name" : "maxSubqueryRows",
      "type" : "QUERY_CONTEXT"
    },
    "action" : "WRITE"
  },
  "resourceNamePattern" : "maxSubqueryRows"
}
Each role can have multiple permissions for context params. Each permission should be set for different context params.

When a query is issued with a query context X, the query will fail if the user who issued the query does not have WRITE permission on the query context X. In this case,

HTTP endpoints will return 403 response code.
JDBC will throw ForbiddenException.
Note: there is a context param called brokerService that is used only by the router. This param is used to pin your query to run it in a specific broker. Because the authorization is done not in the router, but in the broker, if you have brokerService set in your query without a proper permission, your query will fail in the broker after routing is done. Technically, this is not right because the authorization is checked after the context param takes effect. However, this should not cause any user-facing issue and thus should be OK. The query will still fail if the user doesn’t have permission for brokerService.

The context param authorization can be enabled using druid.auth.authorizeQueryContextParams. This is disabled by default to avoid any hassle when someone upgrades his cluster blindly without reading release notes.
2022-04-21 14:21:16 +05:30
somu-imply 2db02876cf
Updating an error msg (#12450)
* Updating an error msg

* Added an extra [] so removing it
2022-04-20 07:56:09 -07:00
somu-imply cd6fba2f6c
Handling planning with alias for time for group by and order by (#12418)
An outer scan query, that requires ordering on a column, should be considered an invalid query.
2022-04-15 10:29:17 +05:30
Adarsh Sanjeev b74cb7624d
Make error messages for insert statements consistent with select statements (#12414)
For a query like
INSERT INTO tablename SELECT channel, added as count FROM wikipedia the error message is Encountered "as count". However, for the insert statement
INSERT INTO t SELECT channel, added as count FROM wikipedia PARTITIONED BY ALL
returns INSERT statements must specify PARTITIONED BY clause explictly (incorrectly). This PR corrects this.

Add EOF to end of Druid SQL Insert statements
Rename SQL Insert statements in the parser to reflect the behaviour change
2022-04-09 12:21:40 +05:30
Paul Rogers 2cc2088720
Method to specify eternity in the scan query builder (#12223)
* Method to specify eternity in the scan query builder

* Fix checkstyle issue

* Renamed eterity() to eternityInterval()

* Minor fixes
2022-04-04 15:11:32 -07:00
Adarsh Sanjeev ef45a1551e
Convert inQueryThreshold into query context parameter. (#12357)
Added Calcites InQueryThreshold as a query context parameter. Setting this parameter appropriately reduces the time taken for queries with large number of values in their IN conditions.
2022-03-22 18:33:57 +05:30
Gian Merlino cb2b2b696d
Fix error message for groupByEnableMultiValueUnnesting. (#12325)
* Fix error message for groupByEnableMultiValueUnnesting.

It referred to the incorrect context parameter.

Also, create a dedicated exception class, to allow easier detection of this
specific error.

* Fix other test.

* More better error messages.

* Test getDimensionName method.
2022-03-10 11:37:24 -08:00
Rohan Garg 9f6a930462
Fix join query incase of filter explosion during CNF conversion (#12324) 2022-03-09 12:43:09 -08:00
Clint Wylie 1c004ea47e
use virtual columns for sql simple aggregators instead of inline expressions (#12251)
* use virtual columns for sql simple aggregators instead of inline expressions

* fixes

* always use virtual columns

* add more tests
2022-03-03 15:05:28 -08:00
Xavier Léauté 1434197ee1
update airline dependency to 2.x (#12270)
* upgrade Airline to Airline 2
  https://github.com/airlift/airline is no longer maintained, updating to
  https://github.com/rvesse/airline (Airline 2) to use an actively
  maintained version, while minimizing breaking changes.

  Note, this is a backwards incompatible change, and extensions relying on
  the CliCommandCreator extension point will also need to be updated.

* fix dependency checks where jakarta.inject is now resolved first instead
  of javax.inject, due to Airline 2 using jakarta
2022-02-27 15:19:28 -08:00
Jihoon Son e5ad862665
A new includeAllDimension flag for dimensionsSpec (#12276)
* includeAllDimensions in dimensionsSpec

* doc

* address comments

* unused import and doc spelling
2022-02-25 18:27:48 -08:00
Jonathan Wei b1640a72ee
Re-enable segment metadata cache when using external schema (#12264) 2022-02-22 19:50:29 -06:00
Karan Kumar 5794331eb1
Adding new config for disabling group by on multiValue column (#12253)
As part of #12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key.
Added a config groupByEnableMultiValueUnnesting which can be set in the query context.

The default value of groupByEnableMultiValueUnnesting is true, therefore it does not change the current engine behavior.
If groupByEnableMultiValueUnnesting is set to false, the query will fail if it encounters a multi-value column in the grouping key.
2022-02-16 20:53:26 +05:30
Laksh Singla 8fc0e5c95c
Explain plan for custom insert syntax (#12243)
* Initial commit, explain plan for custom insert syntax working

* Cleanup separate SqlInsert handling
2022-02-15 21:48:34 -08:00
somu-imply eae163a797
Moving in filter check to broker (#12195)
* Moving in filter check to broker

* Adding more unit tests, making error message meaningful

* Spelling and doc changes

* Updating default to -1 and making this feature hide by default. The number of IN filters can grow upto a max limit of 100

* Removing upper limit of 100, updated docs

* Making documentation more meaningful

* Moving check outside to PlannerConfig, updating test cases and adding back max limit

* Updated with some additional code comments

* Missed removing one line during the checkin

* Addressing doc changes and one forbidden API correction

* Final doc change

* Adding a speling exception, correcting a testcase

* Reading entire filter tree to address combinations of ANDs and ORs

* Specifying in docs that, this case works only for ORs

* Revert "Reading entire filter tree to address combinations of ANDs and ORs"

This reverts commit 81ca8f8496.

* Covering a class cast exception and updating docs

* Counting changed

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2022-02-15 20:45:07 -08:00
somu-imply 033989eb1d
Adding vectorized time_shift (#12254)
* Adding vectorized time_shift

* Vectorize time shift, addressing review comments

* Remove an unused import
2022-02-11 14:44:52 -08:00
Laksh Singla 5bd646e10a
Surface a user friendly error when PARTITIONED BY is omitted (#12246)
#12163 makes PARTITIONED BY a required clause in INSERT queries. While this is required, if a user accidentally omits the clause, it emits a JavaCC/Calcite error, since it's syntactically incorrect. The error message is cryptic. Since it's a custom clause, this PR aims to make the clause optional on the syntactic side, but move the validation to DruidSqlInsert where we can surface a friendlier error.
2022-02-11 11:49:00 +05:30
Clint Wylie 3ee66bb492
allow optimizing sql expressions and virtual columns (#12241)
* rework sql planner expression and virtual column handling

* simplify a bit

* add back and deprecate old methods, more tests, fix multi-value string coercion bug and associated tests

* spotbugs

* fix bugs with multi-value string array expression handling

* javadocs and adjust test

* better

* fix tests
2022-02-09 14:55:50 -08:00
Laksh Singla 4add2510ed
Add syntax support for PARTITIONED BY/CLUSTERED BY in INSERT queries (#12163)
This PR aims to add parser changes for supporting PARTITIONED BY and CLUSTERED BY as proposed in the issue #11929.
2022-02-08 16:23:15 +05:30
Clint Wylie ae71e05fc5
array_concat_agg and array_agg support for array inputs (#12226)
* array_concat_agg and array_agg support for array inputs
changes:
* added array_concat_agg to aggregate arrays into a single array
* added array_agg support for array inputs to make nested array
* added 'shouldAggregateNullInputs' and 'shouldCombineAggregateNullInputs' to fix a correctness issue with STRING_AGG and ARRAY_AGG when merging results, with dual purpose of being an optimization for aggregating

* fix test

* tie capabilities type to legacy mode flag about coercing arrays to strings

* oops

* better javadoc
2022-02-07 19:59:30 -08:00
Clint Wylie 8fd587b28c
remove duplicate Broker ServerInventoryView, improve HttpServerInventoryView logging (#12209)
* changes:
* remove SystemSchema duplicate ServerInventoryView in broker
* suppress duplicate segment added/removed warnings in HttpServerInventoryView when doing a full sync

* fixes
2022-02-03 12:57:34 -08:00
Maytas Monsereenusorn 3717693633
Fix java.lang.ClassCastException error when using useApproximateCountDistinct false for aggregation query (#12216)
* add imply

* add test

* add unit test

* add test
2022-02-03 12:01:13 -08:00
Clint Wylie f9b406c8f2
add backwards compatibility mode for multi-value string array null value coercion (#12210) 2022-01-31 22:38:15 -08:00
Abhishek Agarwal 1b8808cce8
Fix SQL queries for inline datasource with null values (#12092)
Fixes a bug because of which some SQL queries cannot be parsed using druid convention. Specifically, these queries translate to an inline datasource and have some null values. Calcite internally uses NULL as SQL type for these literals and that is not supported by the druid.
I am now allowing null column types to be returned while building RowSignature in org.apache.druid.sql.calcite.table.RowSignatures#fromRelDataType. RowSignature already allows null column type for any column. Doing so should also fix bindable queries such as select (1,2). When such queries are run with headers set to true, we get an exception in org.apache.druid.sql.http.ArrayWriter#writeHeader. This is again a similar exception to the one addressed in this PR. Because SQL type for the result column is RECORD and that doesn't have a corresponding columnType.
2022-01-27 18:04:12 +05:30
Karan Kumar 96b3498a40
Grouping on arrays as arrays (#12078)
* init multiValue column group by

* Changing sorting to Lexicographic as default

* Adding initial tests

* 1.Fixing test cases adding
2.Optimized inmem structs

* Linking SQL layer to native layer

* Adding multiDimension support to group by column strategy

* 1. Removing array coercion in Calcite layer
2. Removing ResultRowDeserializer

* 1. Supporting all primitive array types
2. Removing dimension spec as part of columnSelector

* 1. Supporting all primitive array types
2. Removing dimension spec as part of columnSelector

* 1. Checkstyle things
2. Removing flag

* Minor naming things

* CheckStyle Things

* Fixing test case

* Fixing hashing

* 1. Adding the MV function
2. Added few test cases

* 1. Adding MV function test cases

* Adding Selector strategy function test cases

* Fixing ClientQuerySegmentWalkerTest

* Adding GroupByQueryRunnerTest test cases

* Fixing test cases

* Adding few more test cases

* Fixing Exception asset statement and intellij inspection

* Adding null compatibility tests

* Review comments

* Fixing few failing tests

* Fixing few failing tests

* Do no convert to topN Q incase of group by on array

* Fixing checkstyle

* Fixing differences between jdk's class cast exception message

* 1. Fixing ordering if the grouping key is an array

* Fixing DefaultLimitSpec

* Fixing CalciteArraysQueryTest

* Dummy commit for LGTM

* changes:
* only coerce multi-value string null values when `ExpressionPlan.Trait.NEEDS_APPLIED` is set
* correct return type inference for ARRAY_APPEND,ARRAY_PREPEND,ARRAY_SLICE,ARRAY_CONCAT
* fix bug with ExprEval.ofType when actual type of object from binding doesn't match its claimed type

* Review comments

* Fixing test cases

* Fixing spot bugs

* Fixing strict compile

Co-authored-by: Clint Wylie <cwylie@apache.org>
2022-01-25 20:30:56 -08:00
somu-imply cc8b9c0b6e
Handling OOM error in ExpressionVector setup by reducing number of rows (#12186)
* Handling OOM error in ExpressionVector setup by reducing number of rows

* Removing row size to 10K in sanity tests
2022-01-24 08:37:13 -08:00
Laksh Singla dc1703d5f9
Change value of `druid.sql.planner.useGroupingSetForExactDistinct` in common.runtime.properties (#12182)
This PR changes the value of the property `druid.sql.planner.useGroupingSetForExactDistinct` from `false` to `true` in the runtime.properties files, so that newer installations have this property as `true`, while the default still remains as `false`.

The flag determines how queries which contain an aggregation over `DISTINCT` like `SELECT COUNT(DISTINCT foo.dim1) FILTER(WHERE foo.cnt = 1), SUM(foo.cnt) FROM druid.foo` get planned by Calcite. With the flag being set to false, it plans it via joins, whereas with it being set to true, the query is set using grouping sets.

There is a known issue with Calcite (https://github.com/apache/druid/issues/7953), where an NPE is thrown while planning the above query with joins. There is no such issue while planning the query using grouping sets.
2022-01-24 14:00:03 +05:30
Jihoon Son cc2ffc6c0f
Fix node discovery to ignore unknown DruidServices (#12157)
* Fix node discovery to ignore unknown DruidServices

* ignore all runtime exceptions

* fix test

* add custom deserializer

* custom serializer

* log host for unparseable druidService
2022-01-18 22:08:59 -08:00
Gian Merlino cf7191d2bc
Validate target dataSource for INSERT. (#12129) 2022-01-18 09:34:23 -08:00
Maytas Monsereenusorn bd7fe45da0
Support adding metrics in Auto Compaction (#12125)
* add impl

* add impl

* add unit tests

* add unit tests

* add unit tests

* add unit tests

* add unit tests

* add integration tests

* add integration tests

* fix LGTM

* fix test

* remove doc
2022-01-17 20:19:31 -08:00
Clint Wylie f2ce76966c
add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures less ambiguous (#12145)
* add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures unambiguous

* switcheroo

* EARLIEST_BY/LATEST_BY use timestamp instead of numeric types, update docs

* revert unintended change

* fix docs

* fix docs better
2022-01-12 03:48:53 -08:00
Laksh Singla fae73800a7
Set plannerContext error when cannot query external datasources and when insert is not supported. (#12136)
This PR aims to add plannerContext.setPlanningError whenever external table scan rule is invoked, without the queryMaker having the ability to do so.
2022-01-12 15:11:17 +05:30
Rohan Garg 81f0aba6cb
Use ListFilteredVirtualColumn for left/fact table expression in join condition (#12127)
* Pass VirtualColumnRegistry in PlannerContext for join expression planning

* Allow for including VCs from join fact table expression

* Optmize MV_FILTER functions to use a VC when in join fact table expression

* fixup! Allow for including VCs from join fact table expression

* Address review comments
2022-01-11 14:47:13 -08:00
Laksh Singla 7c17341caa
Return empty result when a group by gets optimized to a timeseries query (#12065)
Related to #11188

The above mentioned PR allowed timeseries queries to return a default result, when queries of type: select count(*) from table where dim1="_not_present_dim_" were executed. Before the PR, it returned no row, after the PR, it would return a row with value of count(*) as 0 (as expected by SQL standards of different dbs).

In Grouping#applyProject, we can sometimes perform optimization of a groupBy query to a timeseries query if possible (when the keys of the groupBy are constants, as generated by automated tools). For example, in select count(*) from table where dim1="_present_dim_" group by "dummy_key", the groupBy clause can be removed. However, in the case when the filter doesn't return anything, i.e. select count(*) from table where dim1="_not_present_dim_" group by "dummy_key", the behavior of general databases would be to return nothing, while druid (due to above change) returns an empty row. This PR aims to fix this divergence of behavior.

Example cases:

select count(*) from table where dim1="_not_present_dim_" group by "dummy_key".
CURRENT: Returns a row with count(*) = 0
EXPECTED: Return no row

select 'A', dim1 from foo where m1 = 123123 and dim1 = '_not_present_again_' group by dim1
CURRENT: Returns a row with ('A', 'wat')
EXPECTED: Return no row

To do this, a boolean droppedDimensionsWhileApplyingProject has been added to Grouping which is true whenever we make changes to the original shape with optimization. Hence if a timeseries query has a grouping with this set to true, we set skipEmptyBuckets=true in the query context (i.e. donot return any row).
2022-01-07 21:53:48 +05:30
Jonathan Wei 9b598407c1
Add interface for external schema provider to Druid SQL (#12043)
* Add interfce for external schema provider to Druid SQL

* Add annotations
2021-12-22 22:17:57 +05:30
somu-imply 1871a1ab18
ARRAY_AGG and STRING_AGG will through errors if invoked on a complex datatype (#12089) 2021-12-21 17:41:04 -08:00
Laksh Singla 16642fb278
Fix incorrect type conversion in DruidLogicalValueRule (#11923)
DruidLogicalValuesRule while transforming to DruidRel can return incorrect values, if during the creation of the literal it was created from a float value. The BigDecimal representation stores 123.0, and it seems that using RexLiteral's method while conversion returns the inflated value (which is 1230). I am unsure if this is intentional from Calcite's perspective, and the actual change should be done somewhere else.

Extract the values of INT/LONG from the RexLiteral in the DruidLogicalValuesRule, via BigDecimal.longValue() method.
2021-12-15 10:44:35 +05:30
Clint Wylie 244c2559e9
fix IncrementalIndex performance regression (#12048)
changes:
* IncrementalIndex is now a ColumnInspector
* fixes performance regression from using map of ColumnCapabilities from IncrementalIndex as a RowSignature
2021-12-09 22:04:32 -08:00
Abhishek Agarwal 7abf847eae
Return 400 when SQL query cannot be planned (#12033)
In this PR, we will now return 400 instead of 500 when SQL query cannot be planned. I also fixed a bug where error messages were not getting sent to the users in case the rules throw UnsupportSQLQueryException.
2021-12-08 21:49:54 +05:30
Laksh Singla ca260dfef6
Intern RowSignature in DruidSchema to reduce its memory footprint (#12001)
DruidSchema consists of a concurrent HashMap of DataSource -> Segement -> AvailableSegmentMetadata. AvailableSegmentMetadata contains RowSignature of the segment, and for each segment, a new object is getting created. RowSignature is an immutable class, and hence it can be interned, and this can lead to huge savings of memory being used in broker, since a lot of the segments of a table would potentially have same RowSignature.
2021-12-08 15:11:13 +05:30
Clint Wylie 45be2be368
fix issues with multi-value string constant expressions (#12025)
* add specialized constant selector for multi-valued string constants
2021-12-08 00:10:26 -08:00
Abhishek Agarwal 834aae096a
Human-readable and actionable SQL error messages (#11911)
This PR does two things

1. It adds the capability to surface missing features in SQL to users - The calcite planner will explore through multiple rules to convert a logical SQL query to a druid native query. Some rules change the shape of the query itself, optimize it and some rules are responsible for translating the query into a druid native query. These are DruidQueryRule, DruidOuterQueryRule, DruidJoinRule, DruidUnionDataSourceRule, DruidUnionRule etc. These rules will look at SQL and will do the necessary transformation. But if the rule can't transform the query, it returns back the control to the calcite planner without recording why was it not able to transform. E.g. there is a join query with a non-equal join condition. DruidJoinRule will look at the condition, see that it is not supported, and return back the control. The reason can be that a query can be planned in many different ways so if one rule can't parse it, the query may still be parseable by other rules. In this PR, we are intercepting these gaps and passing them back to the user if the query could not be planned at all.

2. The said capability has been used to generate actionable errors for some common unsupported SQL features. However, not all possible errors are covered and we can keep adding more in the future.
2021-12-07 09:44:08 +05:30
Laksh Singla 44b2fb71ab
Fix the error case when there are multi top level unions (#12017)
This is a follow up to the PR #11908. This fixes the bug in top level union all queries when there are more than 2 SQL subqueries are present.
2021-12-07 01:12:02 +05:30
Jihoon Son 1f052b43c5
Better serverView exec name; remove SingleServerInventoryView (#11770)
Druid currently has 2 serverViews, regular serverView and filtered serverView. The regular serverView is used to monitor all segment announcements from all data nodes (historicals, tasks, indexers). The filtered serverView is used when you want to watch segment announcements from particular tiers. Since these server views keep track of different sets of druidServers and segments in memory, they should be maintained separately. However, they currently share the same name for their executorService, which can cause confusion and make debugging harder especially in the broker since it is using both serverViews, the filtered view for normal query processing and the regular view to serve the servers table (I'm unsure whether this is intended or whether this is a good behavior). This PR changes it to a more obvious name.

This PR also removes SingleServerInventoryView. This view was deprecated a long time ago and has not been documented at least since 0.13 (#6127). I also don't think this can be better in any case than BatchServerInventoryView. Finally, I merged AbstractCuratorServerInventoryView and BatchServerInventoryView as we no longer need AbstractCuratorServerInventoryView after SingleServerInventoryView is removed.
2021-12-04 18:43:05 +05:30
Gian Merlino e0e05aad99
Enhancements to IndexTaskClient. (#12011)
* Enhancements to IndexTaskClient.

1) Ability to use handlers other than StringFullResponseHandler. This
   functionality is not used in production code yet, but is useful
   because it will allow tasks to communicate with each other in
   non-string-based formats and in streaming fashion. In the future,
   we'll be able to use this to make task-to-task communication
   more efficient.

2) Truncate server errors at 1KB, so long errors do not pollute logs.

3) Change error log level for retryable errors from WARN to INFO. (The
   final error is still WARN.)

4) Harmonize log and exception messages to have a more consistent format.

* Additional tests and improvements.
2021-12-03 09:14:32 -08:00
Clint Wylie af6541a236
allow `DruidSchema` to fallback to segment metadata 'type' if 'typeSignature' is null (#12016)
* allow `DruidSchema` to fallback to segment metadata type if typeSignature is null, to avoid producing incorrect SQL schema if broker is upgraded to 0.23 before historicals

* mmm, forbidden tests
2021-12-02 17:42:01 -08:00
Clint Wylie 84b4bf56d8
vectorize logical operators and boolean functions (#11184)
changes:
* adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions
* vectorize logical operators and boolean functions, some only if useStrictBooleans is true
2021-12-02 16:40:23 -08:00
Paul Rogers a66f10eea1
Code cleanup from query profile project (#11822)
* Code cleanup from query profile project

* Fix spelling errors
* Fix Javadoc formatting
* Abstract out repeated test code
* Reuse constants in place of some string literals
* Fix up some parameterized types
* Reduce warnings reported by Eclipse

* Reverted change due to lack of tests
2021-11-30 11:35:38 -08:00
Kashif Faraz b48f5a576b
Fix: Do not require time condition on InlineDataSource (#11982)
For queries on logical values, e.g. SELECT 1337, we need not check for
a filter on __time column even if requireTimeCondition is true.
2021-11-25 21:10:06 +05:30
Laksh Singla c381cae51b
Improve the output of SQL explain message (#11908)
Currently, when we try to do EXPLAIN PLAN FOR, it returns the structure of the SQL parsed (via Calcite's internal planner util), which is verbose (since it tries to explain about the nodes in the SQL, instead of the Druid Query), and not representative of the native Druid query which will get executed on the broker side.

This PR aims to change the format when user tries to EXPLAIN PLAN FOR for queries which are executed by converting them into Druid's native queries (i.e. not sys schemas).
2021-11-25 21:08:33 +05:30
Rohan Garg 2c08055962
Specify time column for first/last aggregators (#11949)
Add the ability to pass time column in first/last aggregator (and latest/earliest SQL functions). It is to support cases where the time to query upon is stored as a part of a column different than __time. Also, some other logical time column can be specified.
2021-11-25 09:44:14 +05:30
Gian Merlino 0354407655
SQL INSERT planner support. (#11959)
* SQL INSERT planner support.

The main changes are:

1) DruidPlanner is able to validate and authorize INSERT queries. They
   require WRITE permission on the target datasource.

2) QueryMaker is now an interface, and there is a QueryMakerFactory that
   creates instances of it. There is only one production implementation
   of each (NativeQueryMaker and NativeQueryMakerFactory), which
   together behave the same way as the former QueryMaker class. But this
   opens the door to executing queries in ways other than the Druid
   query stack, and is used by unit tests (CalciteInsertDmlTest) to
   test the INSERT planning functionality.

3) Adds an EXTERN table macro that allows references external data using
   InputSource and InputFormat from Druid's batch ingestion API. This is
   not exposed in production yet, but is used by unit tests.

4) Adds a QueryFeature concept that enables the planner to change its
   behavior slightly depending on the capabilities of the execution
   system.

5) Adds an "AuthorizableOperator" concept that enables SqlOperators
   to require additional permissions. This is used by the EXTERN table
   macro.

Related odds and ends:

- Add equals, hashCode, toString methods to InlineInputSource. Aids in
  the "from external" tests in CalciteInsertDmlTest.
- Add JSON-serializability to RowSignature.
- Move the SQL string inside PlannerContext so it is "baked into" the
  planner when the planner is created. Cleans up the code a bit, since
  in practice, the same query is passed in every time to the
  same planner anyway.

* Fix up calls to CalciteTests.createMockQueryLifecycleFactory.

* Fix checkstyle issues.

* Adjustments for CI.

* Adjust DruidAvaticaHandlerTest for stricter test authorizations.
2021-11-24 12:14:04 -08:00
Maytas Monsereenusorn bb3d2a433a
Support filtering data in Auto Compaction (#11922)
* add impl

* fix checkstyle

* add test

* add test

* add unit tests

* fix unit tests

* fix unit tests

* fix unit tests

* add IT

* add IT

* add comments

* fix spelling
2021-11-24 10:56:38 -08:00
Abhishek Agarwal b6a0fbc8b6
Break down CalciteQueryTest (#11979)
* Refactor calciteQueryTest

* Move more tests to CalciteJoinQueryTest
2021-11-24 00:15:42 +05:30
Laksh Singla b5a25f24f2
Improve the DruidRexExecutor w.r.t handling of numeric arrays (#11968)
DruidRexExecutor while reducing Arrays, specially numeric arrays, doesn't convert the value from ExprResult's type to BigDecimal, which causes makeLiteral to cast the values. Also, if NaN or Infinite values are present in the array, the error is a generic NumberFormatException. For example:

SELECT ARRAY[1.11, 2.22] returns [1, 2]
SELECT SQRT(-1) throws a generic NumberFormatException instead of IAE

This PR introduces change to cast the numeric values to BigDecimal since Calcite's library understands that easily, and doesn't perform casts.
2021-11-23 11:40:59 +05:30
TSFenwick a4cb1de87a
get rid of class cast exception and add a new testcase for that issue (#11951) 2021-11-22 08:44:20 -08:00
Gian Merlino b3502c3e50
DruidViewMacro: Remove unused escalator field. (#11931)
* DruidViewMacro: Remove unused escalator field.

* Remove additional unused fields.
2021-11-19 16:06:29 -08:00
Gian Merlino 36ee0367ff
Scan: Add "orderBy" parameter. (#11930)
* Scan: Add "orderBy" parameter.

This patch adds an API for requesting non-time orderings, although it
does not actually add the ability to execute such queries.

The changes are done in such a way that no matter how Scan query objects
are constructed, they will have a correct "getOrderBy". This will enable
us to switch the execution to exclusively use "getOrderBy" later on when
it's implemented.

Scan queries are serialized such that they only include "order" (time
order) if the ordering is time-based, and they only include "orderBy" if
the ordering is non-time-based. This maximizes compatibility with
the existing API while also providing a clean look for formatted queries.

Because this patch does not include execution logic, if someone actually
tries to run a query with non-time ordering, then they will get an error
like "Cannot execute query with orderBy [quality ASC]".

* SQL module fixes.

* Add spotbugs-exclude.

* Remove unused method.
2021-11-19 08:19:12 -08:00