Originally written by @AlexanderSaydakov in druid-io/druid-io.github.io#448.
I also added redirects and updated links to point to the new
datasketches-extension.html landing page for the extension, rather than to
the old page about theta sketches.
* SQL: Upgrade to Calcite 1.14.0, some refactoring of internals.
This brings benefits:
- Ability to do GROUP BY and ORDER BY with ordinals.
- Ability to support IN filters beyond 19 elements (fixes#4203).
Some refactoring of druid-sql internals:
- Builtin aggregators and operators are implemented as SqlAggregators
and SqlOperatorConversions rather being special cases. This simplifies
the Expressions and GroupByRules code, which were becoming complex.
- SqlAggregator implementations are no longer responsible for filtering.
Added new functions:
- Expressions: strpos.
- SQL: TRUNCATE, TRUNC, LENGTH, CHAR_LENGTH, STRLEN, STRPOS, SUBSTR,
and DATE_TRUNC.
* Add missing @Override annotation.
* Adjustments for forbidden APIs.
* Adjustments for forbidden APIs.
* Disable GROUP BY alias.
* Doc reword.
* Move scan-query from a contrib extension into core.
Based on a proposal at: https://groups.google.com/d/topic/druid-development/ME_OatUDnbk/discussion
This patch also adds support for virtual columns to the Scan query,
and updates Druid SQL to use Scan instead of Select.
This patch also makes some behavioral changes to handling of the __time
column. In particular, it is now is returned as "__time" rather than
"timestamp"; it is no longer included if you do not specifically ask for
it in your "columns"; and it is returned as a long rather than a string.
Users can revert time handling to the legacy extension behavior by
setting "legacy" : true in their queries, or setting the property
druid.query.scan.legacy = true. This is meant to provide a migration
path for users that were formerly using the contrib extension.
* Adjustments from review.
* Add back Select query.
* Adjust SQL docs.
* Restore SelectQuery link.
* SQL: Full TRIM support.
- Support trimming arbitrary characters
- Support BOTH, LEADING, and TRAILING
* Remove unused import.
* Fix tests, add RTRIM / LTRIM.
* Remove unused imports.
* BTRIM and docs.
* Replace for with foreach.
* Improved SQL support for floats and doubles.
- Use Druid FLOAT for SQL FLOAT, and Druid DOUBLE for SQL DOUBLE, REAL,
and DECIMAL.
- Use float* aggregators when appropriate.
- Add tests involving both float and double columns.
- Adjust documentation accordingly.
* CR comments.
* Fix braces.
* SQL + Expressions = Best friends forever.
- Use expressions as a projection layer for anything that can't be
expressed using traditional Druid extractionFns. Sometimes they're
embedded directly (like "expression" filters, builtin aggregators,
or "expression" post-aggregators). Sometimes they're referenced
through virtual columns (like dimensionSpecs, which can't innately
reference functions of more than one column without the virtual
column layer).
- Add many new functions and operators, taking advantage of the
expression capability (see the querying/sql.md doc).
- Improve consistency of constant reduction and of casting by
using Druid expressions for this instead of Calcite's RexExecutor.
* Fix casting bug, and other code review comments.
* Fix docs.
This puts all the SQL stuff in one place. It also makes life easier by
pointing out that configs be made in either common.runtime.properties
or the broker runtime.properties.
* SQL: Resolve column type conflicts in favor of newer segments.
Helps with schema evolution from e.g. long -> float, which is supported
on the query side.
* Take columns from highest timestamp instead of max segment id.
* Fixes and docs.
* SQL: Add context and contextual functions to planner.
Added support for context parameters specified as JDBC connection properties
or a JSON object for SQL-over-JSON-over-HTTP.
Also added features that depend on context functionality:
- Added CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP functions.
- Added support for time zones other than UTC via a "timeZone" context.
- Pass down query context to Druid queries too.
Also some bug fixes:
- Fix DATE handling, it was largely done incorrectly before.
- Fix CAST(__time TO DATE) which should do a floor-to-day.
- Fix non-equality comparisons to FLOOR(__time TO X).
- Fix maxQueryCount property.
* Pass down context to nested queries too.
* SQL: Add resolution parameter to quantile agg, rename to APPROX_QUANTILE.
* Fix bug with re-use of filtered approximate histogram aggregators.
Also add APPROX_QUANTILE tests for filtering and running on complex columns.
Includes some slight refactoring to allow tests to make DruidTables that
include complex columns.
* Remove unused import
* SQL: Ditch CalciteConnection layer and add DruidMeta, extension aggregators.
Switched from CalciteConnection to Planner, bringing benefits:
- CalciteConnection's JDBC interface no longer sits between the SQL server
(HTTP/Avatica) and Druid's query layer. Instead, the SQL servers can use
Druid Sequence objects directly, reducing overhead in the query return path.
- Implemented our own Planner-based Avatica Meta, letting us control
connection timeouts and connection / statement limits. The previous
CalciteConnection-based implementation didn't have any limits or timeouts.
- The Planner interface lets us override the operator table, opening up
SQL language extensions. This patch includes two: APPROX_COUNT_DISTINCT
in core, and a QUANTILE aggregator in the druid-histogram extension.
Also:
- Added INFORMATION_SCHEMA metadata schema.
- Added tests for Unicode literals and escapes.
* Verify statement is actually open before closing it.
* More detailed INFORMATION_SCHEMA docs.
* SQL support for nested groupBys.
Allows, for example, doing exact count distinct by writing:
SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)
Contrast with approximate count distinct, which is:
SELECT COUNT(DISTINCT col) FROM druid.foo
* Add deeply-nested groupBy docs, tests, and maxQueryCount config.
* Extract magic constants into statics.
* Rework rules to put preconditions in the "matches" method.