Sql docs items (#12530)

* touch up sql refactor

* brush up SQL refactor

* incorporate feedback

* reorder sql

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
This commit is contained in:
Charles Smith 2022-05-17 16:56:31 -07:00 committed by GitHub
parent 177638f171
commit 3e8d7a6d9f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
17 changed files with 293 additions and 314 deletions

View File

@ -54,7 +54,7 @@ The table datasource is the most common type. This is the kind of datasource you
[data ingestion](../ingestion/index.md). They are split up into segments, distributed around the cluster,
and queried in parallel.
In [Druid SQL](sql-syntax.md#from), table datasources reside in the `druid` schema. This is the default schema, so table
In [Druid SQL](sql.md#from), table datasources reside in the `druid` schema. This is the default schema, so table
datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
@ -91,7 +91,7 @@ SELECT k, v FROM lookup.countries
```
<!--END_DOCUSAURUS_CODE_TABS-->
Lookup datasources correspond to Druid's key-value [lookup](lookups.md) objects. In [Druid SQL](sql-syntax.md#from),
Lookup datasources correspond to Druid's key-value [lookup](lookups.md) objects. In [Druid SQL](sql.md#from),
they reside in the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
They can be joined onto regular tables using the [join operator](#join).
@ -139,10 +139,10 @@ FROM (
<!--END_DOCUSAURUS_CODE_TABS-->
Unions allow you to treat two or more tables as a single datasource. In SQL, this is done with the UNION ALL operator
applied directly to tables, called a ["table-level union"](sql-syntax.md#table-level). In native queries, this is done with a
applied directly to tables, called a ["table-level union"](sql.md#table-level). In native queries, this is done with a
"union" datasource.
With SQL [table-level unions](sql-syntax.md#table-level) the same columns must be selected from each table in the same order,
With SQL [table-level unions](sql.md#table-level) the same columns must be selected from each table in the same order,
and those columns must either have the same types, or types that can be implicitly cast to each other (such as different
numeric types). For this reason, it is more robust to write your queries to select specific columns.

View File

@ -22,10 +22,10 @@ title: "Joins"
~ under the License.
-->
Druid has two features related to joining of data:
Apache Druid has two features related to joining of data:
1. [Join](datasource.md#join) operators. These are available using a [join datasource](datasource.md#join) in native
queries, or using the [JOIN operator](sql-syntax.md) in Druid SQL. Refer to the
queries, or using the [JOIN operator](sql.md) in Druid SQL. Refer to the
[join datasource](datasource.md#join) documentation for information about how joins work in Druid.
2. [Query-time lookups](lookups.md), simple key-to-value mappings. These are preloaded on all servers that are involved
in queries and can be accessed with or without an explicit join operator. Refer to the [lookups](lookups.md)

View File

@ -24,7 +24,7 @@ title: "Sorting and limiting (groupBy)"
> Apache Druid supports two query languages: [Druid SQL](sql.md) and [native queries](querying.md).
> This document describes the native
> language. For information about sorting in SQL, refer to the [SQL documentation](sql-syntax.md#order-by).
> language. For information about sorting in SQL, refer to the [SQL documentation](sql.md#order-by).
The limitSpec field provides the functionality to sort and limit the set of results from a groupBy query. If you group by a single dimension and are ordering by a single metric, we highly recommend using [TopN Queries](../querying/topnquery.md) instead. The performance will be substantially better. Available options are:

View File

@ -33,7 +33,8 @@ sidebar_label: "Aggregation functions"
> Apache Druid supports two query languages: Druid SQL and [native queries](querying.md).
> This document describes the SQL language.
You can use aggregation functions in the SELECT clause of any query.
You can use aggregation functions in the SELECT clause of any [Druid SQL](./sql.md) query.
Filter any aggregator using the FILTER clause, for example:
```

View File

@ -26,7 +26,7 @@ sidebar_label: "Druid SQL API"
> Apache Druid supports two query languages: Druid SQL and [native queries](querying.md).
> This document describes the SQL language.
You can submit and cancel Druid SQL queries using the Druid SQL API.
You can submit and cancel [Druid SQL](./sql.md) queries using the Druid SQL API.
The Druid SQL API is available at `https://ROUTER:8888/druid/v2/sql`, where `ROUTER` is the IP address of the Druid Router.
## Submit a query

View File

@ -27,7 +27,7 @@ sidebar_label: "SQL data types"
> This document describes the SQL language.
Columns in Druid are associated with a specific data type. This topic describes supported data types in Druid SQL.
Columns in Druid are associated with a specific data type. This topic describes supported data types in [Druid SQL](./sql.md).
## Standard types

View File

@ -27,7 +27,7 @@ sidebar_label: "JDBC driver API"
> This document describes the SQL language.
You can make Druid SQL queries using the [Avatica JDBC driver](https://calcite.apache.org/avatica/downloads/). We recommend using Avatica JDBC driver version 1.17.0 or later. Note that as of the time of this writing, Avatica 1.17.0, the latest version, does not support passing connection string parameters from the URL to Druid, so you must pass them using a `Properties` object. Once you've downloaded the Avatica client jar, add it to your classpath and use the connect string `jdbc:avatica:remote:url=http://BROKER:8082/druid/v2/sql/avatica/`.
You can make [Druid SQL](./sql.md) queries using the [Avatica JDBC driver](https://calcite.apache.org/avatica/downloads/). We recommend using Avatica JDBC driver version 1.17.0 or later. Note that as of the time of this writing, Avatica 1.17.0, the latest version, does not support passing connection string parameters from the URL to Druid, so you must pass them using a `Properties` object. Once you've downloaded the Avatica client jar, add it to your classpath and use the connect string `jdbc:avatica:remote:url=http://BROKER:8082/druid/v2/sql/avatica/`.
Example code:
@ -74,7 +74,7 @@ Note that the non-JDBC [JSON over HTTP](sql-api.md#submit-a-query) API is statel
## Dynamic parameters
You can use [parameterized queries](sql-syntax.md#dynamic-parameters) in JDBC code, as in this example:
You can use [parameterized queries](sql.md#dynamic-parameters) in JDBC code, as in this example:
```java
PreparedStatement statement = connection.prepareStatement("SELECT COUNT(*) AS cnt FROM druid.foo WHERE dim1 = ? OR dim1 = ?");

View File

@ -28,7 +28,7 @@ sidebar_label: "SQL metadata tables"
Druid Brokers infer table and column metadata for each datasource from segments loaded in the cluster, and use this to
plan SQL queries. This metadata is cached on Broker startup and also updated periodically in the background through
plan [SQL queries](./sql.md). This metadata is cached on Broker startup and also updated periodically in the background through
[SegmentMetadata queries](segmentmetadataquery.md). Background metadata refreshing is triggered by
segments entering and exiting the cluster, and can also be throttled through configuration.

View File

@ -35,7 +35,7 @@ sidebar_label: "Multi-value string functions"
> This document describes the SQL language.
Druid supports string dimensions containing multiple values.
This page describes the operations you can perform on multi-value string dimensions.
This page describes the operations you can perform on multi-value string dimensions using [Druid SQL](./sql.md).
See [Multi-value dimensions](multi-value-dimensions.md) for more information.
All "array" references in the multi-value string function documentation can refer to multi-value string columns or

View File

@ -35,7 +35,7 @@ sidebar_label: "Operators"
> This document describes the SQL language.
Operators in Druid SQL typically operate on one or two values and return a result based on the values. Types of operators in Druid SQL include arithmetic, comparison, logical, and more, as described here.
Operators in [Druid SQL](./sql.md) typically operate on one or two values and return a result based on the values. Types of operators in Druid SQL include arithmetic, comparison, logical, and more, as described here.
## Arithmetic operators

View File

@ -26,7 +26,7 @@ sidebar_label: "SQL query context"
> Apache Druid supports two query languages: Druid SQL and [native queries](querying.md).
> This document describes the SQL language.
Druid supports query context parameters which affect SQL planning.
Druid supports query context parameters which affect [SQL query](./sql.md) planning.
See [Query context](query-context.md) for general query context parameters for all query types.
## SQL query context parameters

View File

@ -34,7 +34,7 @@ sidebar_label: "Scalar functions"
> Apache Druid supports two query languages: Druid SQL and [native queries](querying.md).
> This document describes the SQL language.
Druid SQL includes scalar functions that include numeric and string functions, IP address functions, Sketch functions, and more, as described on this page.
[Druid SQL](./sql.md) includes scalar functions that include numeric and string functions, IP address functions, Sketch functions, and more, as described on this page.
## Numeric functions
@ -96,7 +96,7 @@ String functions accept strings, and return a type appropriate to the function.
|`CHAR_LENGTH(expr)`|Alias for `LENGTH`.|
|`CHARACTER_LENGTH(expr)`|Alias for `LENGTH`.|
|`STRLEN(expr)`|Alias for `LENGTH`.|
|`LOOKUP(expr, lookupName)`|Look up `expr` in a registered [query-time lookup table](lookups.md). Note that lookups can also be queried directly using the [`lookup` schema](sql-syntax.md#from).|
|`LOOKUP(expr, lookupName)`|Look up `expr` in a registered [query-time lookup table](lookups.md). Note that lookups can also be queried directly using the [`lookup` schema](sql.md#from).|
|`LOWER(expr)`|Returns `expr` in all lowercase.|
|`UPPER(expr)`|Returns `expr` in all uppercase.|
|`PARSE_LONG(string, [radix])`|Parses a string into a long (BIGINT) with the given radix, or 10 (decimal) if a radix is not provided.|

View File

@ -1,236 +0,0 @@
---
id: sql-syntax
title: "SQL query syntax"
sidebar_label: "SQL query syntax"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
> Apache Druid supports two query languages: Druid SQL and [native queries](querying.md).
> This document describes the SQL language.
Druid SQL supports SELECT queries with the following structure:
```
[ EXPLAIN PLAN FOR ]
[ WITH tableName [ ( column1, column2, ... ) ] AS ( query ) ]
SELECT [ ALL | DISTINCT ] { * | exprs }
FROM { <table> | (<subquery>) | <o1> [ INNER | LEFT ] JOIN <o2> ON condition }
[ WHERE expr ]
[ GROUP BY [ exprs | GROUPING SETS ( (exprs), ... ) | ROLLUP (exprs) | CUBE (exprs) ] ]
[ HAVING expr ]
[ ORDER BY expr [ ASC | DESC ], expr [ ASC | DESC ], ... ]
[ LIMIT limit ]
[ OFFSET offset ]
[ UNION ALL <another query> ]
```
## FROM
The FROM clause can refer to any of the following:
- [Table datasources](datasource.md#table) from the `druid` schema. This is the default schema, so Druid table
datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
- [Lookups](datasource.md#lookup) from the `lookup` schema, for example `lookup.countries`. Note that lookups can
also be queried using the [`LOOKUP` function](sql-scalar.md#string-functions).
- [Subqueries](datasource.md#query).
- [Joins](datasource.md#join) between anything in this list, except between native datasources (table, lookup,
query) and system tables. The join condition must be an equality between expressions from the left- and right-hand side
of the join.
- [Metadata tables](sql-metadata-tables.md) from the `INFORMATION_SCHEMA` or `sys` schemas. Unlike the other options for the
FROM clause, metadata tables are not considered datasources. They exist only in the SQL layer.
For more information about table, lookup, query, and join datasources, refer to the [Datasources](datasource.md)
documentation.
## WHERE
The WHERE clause refers to columns in the FROM table, and will be translated to [native filters](filters.md). The
WHERE clause can also reference a subquery, like `WHERE col1 IN (SELECT foo FROM ...)`. Queries like this are executed
as a join on the subquery, described in the [Query translation](sql-translation.md#subqueries) section.
Strings and numbers can be compared in the WHERE clause of a SQL query through implicit type conversion.
For example, you can evaluate `WHERE stringDim = 1` for a string-typed dimension named `stringDim`.
However, for optimal performance, you should explicitly cast the reference number as a string when comparing against a string dimension:
```
WHERE stringDim = '1'
```
Similarly, if you compare a string-typed dimension with reference to an array of numbers, cast the numbers to strings:
```
WHERE stringDim IN ('1', '2', '3')
```
Note that explicit type casting does not lead to significant performance improvement when comparing strings and numbers involving numeric dimensions since numeric dimensions are not indexed.
## GROUP BY
The GROUP BY clause refers to columns in the FROM table. Using GROUP BY, DISTINCT, or any aggregation functions will
trigger an aggregation query using one of Druid's [three native aggregation query types](sql-translation.md#query-types). GROUP BY
can refer to an expression or a select clause ordinal position (like `GROUP BY 2` to group by the second selected
column).
The GROUP BY clause can also refer to multiple grouping sets in three ways. The most flexible is GROUP BY GROUPING SETS,
for example `GROUP BY GROUPING SETS ( (country, city), () )`. This example is equivalent to a `GROUP BY country, city`
followed by `GROUP BY ()` (a grand total). With GROUPING SETS, the underlying data is only scanned one time, leading to
better efficiency. Second, GROUP BY ROLLUP computes a grouping set for each level of the grouping expressions. For
example `GROUP BY ROLLUP (country, city)` is equivalent to `GROUP BY GROUPING SETS ( (country, city), (country), () )`
and will produce grouped rows for each country / city pair, along with subtotals for each country, along with a grand
total. Finally, GROUP BY CUBE computes a grouping set for each combination of grouping expressions. For example,
`GROUP BY CUBE (country, city)` is equivalent to `GROUP BY GROUPING SETS ( (country, city), (country), (city), () )`.
Grouping columns that do not apply to a particular row will contain `NULL`. For example, when computing
`GROUP BY GROUPING SETS ( (country, city), () )`, the grand total row corresponding to `()` will have `NULL` for the
"country" and "city" columns. Column may also be `NULL` if it was `NULL` in the data itself. To differentiate such rows,
you can use `GROUPING` aggregation.
When using GROUP BY GROUPING SETS, GROUP BY ROLLUP, or GROUP BY CUBE, be aware that results may not be generated in the
order that you specify your grouping sets in the query. If you need results to be generated in a particular order, use
the ORDER BY clause.
## HAVING
The HAVING clause refers to columns that are present after execution of GROUP BY. It can be used to filter on either
grouping expressions or aggregated values. It can only be used together with GROUP BY.
## ORDER BY
The ORDER BY clause refers to columns that are present after execution of GROUP BY. It can be used to order the results
based on either grouping expressions or aggregated values. ORDER BY can refer to an expression or a select clause
ordinal position (like `ORDER BY 2` to order by the second selected column). For non-aggregation queries, ORDER BY
can only order by the `__time` column. For aggregation queries, ORDER BY can order by any column.
## LIMIT
The LIMIT clause limits the number of rows returned. In some situations Druid will push down this limit to data servers,
which boosts performance. Limits are always pushed down for queries that run with the native Scan or TopN query types.
With the native GroupBy query type, it is pushed down when ordering on a column that you are grouping by. If you notice
that adding a limit doesn't change performance very much, then it's possible that Druid wasn't able to push down the
limit for your query.
## OFFSET
The OFFSET clause skips a certain number of rows when returning results.
If both LIMIT and OFFSET are provided, then OFFSET will be applied first, followed by LIMIT. For example, using
LIMIT 100 OFFSET 10 will return 100 rows, starting from row number 10.
Together, LIMIT and OFFSET can be used to implement pagination. However, note that if the underlying datasource is
modified between page fetches, then the different pages will not necessarily align with each other.
There are two important factors that can affect the performance of queries that use OFFSET:
- Skipped rows still need to be generated internally and then discarded, meaning that raising offsets to high values
can cause queries to use additional resources.
- OFFSET is only supported by the Scan and GroupBy [native query types](sql-translation.md#query-types). Therefore, a query with OFFSET
will use one of those two types, even if it might otherwise have run as a Timeseries or TopN. Switching query engines
in this way can affect performance.
## UNION ALL
The "UNION ALL" operator fuses multiple queries together. Druid SQL supports the UNION ALL operator in two situations:
top-level and table-level. Queries that use UNION ALL in any other way will not be able to execute.
### Top-level
UNION ALL can be used at the very top outer layer of a SQL query (not in a subquery, and not in the FROM clause). In
this case, the underlying queries will be run separately, back to back. Their results will be concatenated together
and appear one after the other.
For example:
```
SELECT COUNT(*) FROM tbl WHERE my_column = 'value1'
UNION ALL
SELECT COUNT(*) FROM tbl WHERE my_column = 'value2'
```
With top-level UNION ALL, no further processing can be done after the UNION ALL. For example, the results of the
UNION ALL cannot have GROUP BY, ORDER BY, or any other operators applied to them.
### Table-level
UNION ALL can be used to query multiple tables at the same time. In this case, it must appear in a subquery in the
FROM clause, and the lower-level subqueries that are inputs to the UNION ALL operator must be simple table SELECTs.
Features like expressions, column aliasing, JOIN, GROUP BY, ORDER BY, and so on cannot be used. The query will run
natively using a [union datasource](datasource.md#union).
The same columns must be selected from each table in the same order, and those columns must either have the same types,
or types that can be implicitly cast to each other (such as different numeric types). For this reason, it is generally
more robust to write your queries to select specific columns. If you use `SELECT *`, you will need to modify your
queries if a new column is added to one of the tables but not to the others.
For example:
```
SELECT col1, COUNT(*)
FROM (
SELECT col1, col2, col3 FROM tbl1
UNION ALL
SELECT col1, col2, col3 FROM tbl2
)
GROUP BY col1
```
With table-level UNION ALL, the rows from the unioned tables are not guaranteed to be processed in
any particular order. They may be processed in an interleaved fashion. If you need a particular result ordering,
use [ORDER BY](#order-by) on the outer query.
## EXPLAIN PLAN
Add "EXPLAIN PLAN FOR" to the beginning of any query to get information about how it will be translated. In this case,
the query will not actually be executed. Refer to the [Query translation](sql-translation.md#interpreting-explain-plan-output)
documentation for more information on the output of EXPLAIN PLAN.
> Be careful when interpreting EXPLAIN PLAN output, and use [request logging](../configuration/index.md#request-logging) if in doubt.
Request logs show the exact native query that will be run.
## Identifiers and literals
Identifiers like datasource and column names can optionally be quoted using double quotes. To escape a double quote
inside an identifier, use another double quote, like `"My ""very own"" identifier"`. All identifiers are case-sensitive
and no implicit case conversions are performed.
Literal strings should be quoted with single quotes, like `'foo'`. Literal strings with Unicode escapes can be written
like `U&'fo\00F6'`, where character codes in hex are prefixed by a backslash. Literal numbers can be written in forms
like `100` (denoting an integer), `100.0` (denoting a floating point value), or `1.0e5` (scientific notation). Literal
timestamps can be written like `TIMESTAMP '2000-01-01 00:00:00'`. Literal intervals, used for time arithmetic, can be
written like `INTERVAL '1' HOUR`, `INTERVAL '1 02:03' DAY TO MINUTE`, `INTERVAL '1-2' YEAR TO MONTH`, and so on.
## Dynamic parameters
Druid SQL supports dynamic parameters using question mark (`?`) syntax, where parameters are bound to `?` placeholders
at execution time. To use dynamic parameters, replace any literal in the query with a `?` character and provide a
corresponding parameter value when you execute the query. Parameters are bound to the placeholders in the order in
which they are passed. Parameters are supported in both the [HTTP POST](sql-api.md) and [JDBC](sql-jdbc.md) APIs.
In certain cases, using dynamic parameters in expressions can cause type inference issues which cause your query to fail, for example:
```sql
SELECT * FROM druid.foo WHERE dim1 like CONCAT('%', ?, '%')
```
To solve this issue, explicitly provide the type of the dynamic parameter using the `CAST` keyword. Consider the fix for the preceding example:
```
SELECT * FROM druid.foo WHERE dim1 like CONCAT('%', CAST (? AS VARCHAR), '%')
```

View File

@ -26,9 +26,11 @@ sidebar_label: "SQL query translation"
> Apache Druid supports two query languages: Druid SQL and [native queries](querying.md).
> This document describes the SQL language.
Druid uses [Apache Calcite](https://calcite.apache.org/) to parse and plan SQL queries.
Druid translates SQL statements into its [native JSON-based query language](querying.md).
In general, the slight overhead of translating SQL on the Broker is the only minor performance penalty to using Druid SQL compared to native queries.
Druid SQL translates SQL queries to [native queries](querying.md) before running them, and understanding how this
translation works is key to getting good performance.
This topic includes best practices and tools to help you achieve good performance and minimize the impact of translation.
## Best practices
@ -60,7 +62,7 @@ appreciated.
## Interpreting EXPLAIN PLAN output
The [EXPLAIN PLAN](sql-syntax.md#explain-plan) functionality can help you understand how a given SQL query will
The [EXPLAIN PLAN](sql.md#explain-plan) functionality can help you understand how a given SQL query will
be translated to native. For simple queries that do not involve subqueries or joins, the output of EXPLAIN PLAN
is easy to interpret. The native query that will run is embedded as JSON inside a "DruidQueryRel" line:
@ -228,3 +230,27 @@ in the query context. Since it is set to 1,000,000,000 by default, you don't nee
See [accuracy information](https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch) in the DataSketches documentation for how many bytes are required per stream length.
This query context parameter is a temporary solution to avoid the known issue. It may be removed in a future release after the bug is fixed.
## Unsupported features
Druid does not support all SQL features. In particular, the following features are not supported.
- JOIN between native datasources (table, lookup, subquery) and [system tables](sql-metadata-tables.md).
- JOIN conditions that are not an equality between expressions from the left- and right-hand sides.
- JOIN conditions containing a constant value inside the condition.
- JOIN conditions on a column which contains a multi-value dimension.
- OVER clauses, and analytic functions such as `LAG` and `LEAD`.
- ORDER BY for a non-aggregating query, except for `ORDER BY __time` or `ORDER BY __time DESC`, which are supported.
This restriction only applies to non-aggregating queries; you can ORDER BY any column in an aggregating query.
- DDL and DML.
- Using Druid-specific functions like `TIME_PARSE` and `APPROX_QUANTILE_DS` on [system tables](sql-metadata-tables.md).
Additionally, some Druid native query features are not supported by the SQL language. Some unsupported Druid features
include:
- [Inline datasources](datasource.md#inline).
- [Spatial filters](../development/geo.md).
- [Multi-value dimensions](sql-data-types.md#multi-value-strings) are only partially implemented in Druid SQL. There are known
inconsistencies between their behavior in SQL queries and in native queries due to how they are currently treated by
the SQL planner.

View File

@ -1,7 +1,7 @@
---
id: sql
title: "Druid SQL overview"
sidebar_label: "Druid SQL overview"
sidebar_label: "Overview and syntax"
---
<!--
@ -26,37 +26,237 @@ sidebar_label: "Druid SQL overview"
> Apache Druid supports two query languages: Druid SQL and [native queries](querying.md).
> This document describes the SQL language.
You can query data in Druid datasources using Druid SQL.
Druid uses [Apache Calcite](https://calcite.apache.org/) to parse and plan SQL queries.
Druid translates SQL statements into its native JSON-based query language.
Other than the slight overhead of translating SQL on the Broker, there isn't an
additional performance penalty to using Druid SQL compared to native queries.
You can query data in Druid datasources using [Druid SQL](./sql.md). Druid translates SQL queries into its [native query language](./querying.md). To learn about translation and how to get the best performance from Druid SQL, see [SQL query translation](./sql-translation.md).
Druid SQL planning occurs on the Broker.
Set [Broker runtime properties](../configuration/index.md#sql) to configure the query plan and JDBC querying.
See [Defining SQL permissions](../operations/security-user-auth.md#sql-permissions)
for information on permissions needed to make SQL queries.
For information on permissions needed to make SQL queries, see [Defining SQL permissions](../operations/security-user-auth.md#sql-permissions).
This topic introduces Druid SQL syntax.
For more information and SQL querying options see:
- [Data types](./sql-data-types.md) for a list of supported data types for Druid columns.
- [Aggregation functions](./sql-aggregations.md) for a list of aggregation functions available for Druid SQL SELECT statements.
- [Scalar functions](./sql-scalar.md) for Druid SQL scalar functions including numeric and string functions, IP address functions, Sketch functions, and more.
- [SQL multi-value string functions](./sql-multivalue-string-functions.md) for operations you can perform on string dimensions containing multiple values.
- [Query translation](./sql-translation.md) for information about how Druid translates SQL queries to native queries before running them.
For information about APIs, see:
- [Druid SQL API](./sql-api.md) for information on the HTTP API.
- [SQL JDBC driver API](./sql-jdbc.md) for information about the JDBC driver API.
- [SQL query context](./sql-query-context.md) for information about the query context parameters that affect SQL planning.
## Syntax
Druid SQL supports SELECT queries with the following structure:
```
[ EXPLAIN PLAN FOR ]
[ WITH tableName [ ( column1, column2, ... ) ] AS ( query ) ]
SELECT [ ALL | DISTINCT ] { * | exprs }
FROM { <table> | (<subquery>) | <o1> [ INNER | LEFT ] JOIN <o2> ON condition }
[ WHERE expr ]
[ GROUP BY [ exprs | GROUPING SETS ( (exprs), ... ) | ROLLUP (exprs) | CUBE (exprs) ] ]
[ HAVING expr ]
[ ORDER BY expr [ ASC | DESC ], expr [ ASC | DESC ], ... ]
[ LIMIT limit ]
[ OFFSET offset ]
[ UNION ALL <another query> ]
```
## FROM
The FROM clause can refer to any of the following:
- [Table datasources](datasource.md#table) from the `druid` schema. This is the default schema, so Druid table
datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
- [Lookups](datasource.md#lookup) from the `lookup` schema, for example `lookup.countries`. Note that lookups can
also be queried using the [`LOOKUP` function](sql-scalar.md#string-functions).
- [Subqueries](datasource.md#query).
- [Joins](datasource.md#join) between anything in this list, except between native datasources (table, lookup,
query) and system tables. The join condition must be an equality between expressions from the left- and right-hand side
of the join.
- [Metadata tables](sql-metadata-tables.md) from the `INFORMATION_SCHEMA` or `sys` schemas. Unlike the other options for the
FROM clause, metadata tables are not considered datasources. They exist only in the SQL layer.
For more information about table, lookup, query, and join datasources, refer to the [Datasources](datasource.md)
documentation.
## WHERE
The WHERE clause refers to columns in the FROM table, and will be translated to [native filters](filters.md). The
WHERE clause can also reference a subquery, like `WHERE col1 IN (SELECT foo FROM ...)`. Queries like this are executed
as a join on the subquery, described in the [Query translation](sql-translation.md#subqueries) section.
Strings and numbers can be compared in the WHERE clause of a SQL query through implicit type conversion.
For example, you can evaluate `WHERE stringDim = 1` for a string-typed dimension named `stringDim`.
However, for optimal performance, you should explicitly cast the reference number as a string when comparing against a string dimension:
```
WHERE stringDim = '1'
```
Similarly, if you compare a string-typed dimension with reference to an array of numbers, cast the numbers to strings:
```
WHERE stringDim IN ('1', '2', '3')
```
Note that explicit type casting does not lead to significant performance improvement when comparing strings and numbers involving numeric dimensions since numeric dimensions are not indexed.
## GROUP BY
The GROUP BY clause refers to columns in the FROM table. Using GROUP BY, DISTINCT, or any aggregation functions will
trigger an aggregation query using one of Druid's [three native aggregation query types](sql-translation.md#query-types). GROUP BY
can refer to an expression or a select clause ordinal position (like `GROUP BY 2` to group by the second selected
column).
The GROUP BY clause can also refer to multiple grouping sets in three ways. The most flexible is GROUP BY GROUPING SETS,
for example `GROUP BY GROUPING SETS ( (country, city), () )`. This example is equivalent to a `GROUP BY country, city`
followed by `GROUP BY ()` (a grand total). With GROUPING SETS, the underlying data is only scanned one time, leading to
better efficiency. Second, GROUP BY ROLLUP computes a grouping set for each level of the grouping expressions. For
example `GROUP BY ROLLUP (country, city)` is equivalent to `GROUP BY GROUPING SETS ( (country, city), (country), () )`
and will produce grouped rows for each country / city pair, along with subtotals for each country, along with a grand
total. Finally, GROUP BY CUBE computes a grouping set for each combination of grouping expressions. For example,
`GROUP BY CUBE (country, city)` is equivalent to `GROUP BY GROUPING SETS ( (country, city), (country), (city), () )`.
Grouping columns that do not apply to a particular row will contain `NULL`. For example, when computing
`GROUP BY GROUPING SETS ( (country, city), () )`, the grand total row corresponding to `()` will have `NULL` for the
"country" and "city" columns. Column may also be `NULL` if it was `NULL` in the data itself. To differentiate such rows,
you can use `GROUPING` aggregation.
When using GROUP BY GROUPING SETS, GROUP BY ROLLUP, or GROUP BY CUBE, be aware that results may not be generated in the
order that you specify your grouping sets in the query. If you need results to be generated in a particular order, use
the ORDER BY clause.
## HAVING
The HAVING clause refers to columns that are present after execution of GROUP BY. It can be used to filter on either
grouping expressions or aggregated values. It can only be used together with GROUP BY.
## ORDER BY
The ORDER BY clause refers to columns that are present after execution of GROUP BY. It can be used to order the results
based on either grouping expressions or aggregated values. ORDER BY can refer to an expression or a select clause
ordinal position (like `ORDER BY 2` to order by the second selected column). For non-aggregation queries, ORDER BY
can only order by the `__time` column. For aggregation queries, ORDER BY can order by any column.
## LIMIT
The LIMIT clause limits the number of rows returned. In some situations Druid will push down this limit to data servers,
which boosts performance. Limits are always pushed down for queries that run with the native Scan or TopN query types.
With the native GroupBy query type, it is pushed down when ordering on a column that you are grouping by. If you notice
that adding a limit doesn't change performance very much, then it's possible that Druid wasn't able to push down the
limit for your query.
## OFFSET
The OFFSET clause skips a certain number of rows when returning results.
If both LIMIT and OFFSET are provided, then OFFSET will be applied first, followed by LIMIT. For example, using
LIMIT 100 OFFSET 10 will return 100 rows, starting from row number 10.
Together, LIMIT and OFFSET can be used to implement pagination. However, note that if the underlying datasource is
modified between page fetches, then the different pages will not necessarily align with each other.
There are two important factors that can affect the performance of queries that use OFFSET:
- Skipped rows still need to be generated internally and then discarded, meaning that raising offsets to high values
can cause queries to use additional resources.
- OFFSET is only supported by the Scan and GroupBy [native query types](sql-translation.md#query-types). Therefore, a query with OFFSET
will use one of those two types, even if it might otherwise have run as a Timeseries or TopN. Switching query engines
in this way can affect performance.
## UNION ALL
The "UNION ALL" operator fuses multiple queries together. Druid SQL supports the UNION ALL operator in two situations:
top-level and table-level. Queries that use UNION ALL in any other way will not be able to execute.
### Top-level
UNION ALL can be used at the very top outer layer of a SQL query (not in a subquery, and not in the FROM clause). In
this case, the underlying queries will be run separately, back to back. Their results will be concatenated together
and appear one after the other.
For example:
```
SELECT COUNT(*) FROM tbl WHERE my_column = 'value1'
UNION ALL
SELECT COUNT(*) FROM tbl WHERE my_column = 'value2'
```
With top-level UNION ALL, no further processing can be done after the UNION ALL. For example, the results of the
UNION ALL cannot have GROUP BY, ORDER BY, or any other operators applied to them.
### Table-level
UNION ALL can be used to query multiple tables at the same time. In this case, it must appear in a subquery in the
FROM clause, and the lower-level subqueries that are inputs to the UNION ALL operator must be simple table SELECTs.
Features like expressions, column aliasing, JOIN, GROUP BY, ORDER BY, and so on cannot be used. The query will run
natively using a [union datasource](datasource.md#union).
The same columns must be selected from each table in the same order, and those columns must either have the same types,
or types that can be implicitly cast to each other (such as different numeric types). For this reason, it is generally
more robust to write your queries to select specific columns. If you use `SELECT *`, you will need to modify your
queries if a new column is added to one of the tables but not to the others.
For example:
```
SELECT col1, COUNT(*)
FROM (
SELECT col1, col2, col3 FROM tbl1
UNION ALL
SELECT col1, col2, col3 FROM tbl2
)
GROUP BY col1
```
With table-level UNION ALL, the rows from the unioned tables are not guaranteed to be processed in
any particular order. They may be processed in an interleaved fashion. If you need a particular result ordering,
use [ORDER BY](#order-by) on the outer query.
## EXPLAIN PLAN
Add "EXPLAIN PLAN FOR" to the beginning of any query to get information about how it will be translated. In this case,
the query will not actually be executed. Refer to the [Query translation](sql-translation.md#interpreting-explain-plan-output)
documentation for more information on the output of EXPLAIN PLAN.
> Be careful when interpreting EXPLAIN PLAN output, and use [request logging](../configuration/index.md#request-logging) if in doubt.
Request logs show the exact native query that will be run.
## Identifiers and literals
Identifiers like datasource and column names can optionally be quoted using double quotes. To escape a double quote
inside an identifier, use another double quote, like `"My ""very own"" identifier"`. All identifiers are case-sensitive
and no implicit case conversions are performed.
Literal strings should be quoted with single quotes, like `'foo'`. Literal strings with Unicode escapes can be written
like `U&'fo\00F6'`, where character codes in hex are prefixed by a backslash. Literal numbers can be written in forms
like `100` (denoting an integer), `100.0` (denoting a floating point value), or `1.0e5` (scientific notation). Literal
timestamps can be written like `TIMESTAMP '2000-01-01 00:00:00'`. Literal intervals, used for time arithmetic, can be
written like `INTERVAL '1' HOUR`, `INTERVAL '1 02:03' DAY TO MINUTE`, `INTERVAL '1-2' YEAR TO MONTH`, and so on.
## Dynamic parameters
Druid SQL supports dynamic parameters using question mark (`?`) syntax, where parameters are bound to `?` placeholders
at execution time. To use dynamic parameters, replace any literal in the query with a `?` character and provide a
corresponding parameter value when you execute the query. Parameters are bound to the placeholders in the order in
which they are passed. Parameters are supported in both the [HTTP POST](sql-api.md) and [JDBC](sql-jdbc.md) APIs.
In certain cases, using dynamic parameters in expressions can cause type inference issues which cause your query to fail, for example:
```sql
SELECT * FROM druid.foo WHERE dim1 like CONCAT('%', ?, '%')
```
To solve this issue, explicitly provide the type of the dynamic parameter using the `CAST` keyword. Consider the fix for the preceding example:
```
SELECT * FROM druid.foo WHERE dim1 like CONCAT('%', CAST (? AS VARCHAR), '%')
```
## Unsupported features
Druid does not support all SQL features. In particular, the following features are not supported.
- JOIN between native datasources (table, lookup, subquery) and [system tables](sql-metadata-tables.md).
- JOIN conditions that are not an equality between expressions from the left- and right-hand sides.
- JOIN conditions containing a constant value inside the condition.
- JOIN conditions on a column which contains a multi-value dimension.
- OVER clauses, and analytic functions such as `LAG` and `LEAD`.
- ORDER BY for a non-aggregating query, except for `ORDER BY __time` or `ORDER BY __time DESC`, which are supported.
This restriction only applies to non-aggregating queries; you can ORDER BY any column in an aggregating query.
- DDL and DML.
- Using Druid-specific functions like `TIME_PARSE` and `APPROX_QUANTILE_DS` on [system tables](sql-metadata-tables.md).
Additionally, some Druid native query features are not supported by the SQL language. Some unsupported Druid features
include:
- [Inline datasources](datasource.md#inline).
- [Spatial filters](../development/geo.md).
- [Multi-value dimensions](sql-data-types.md#multi-value-strings) are only partially implemented in Druid SQL. There are known
inconsistencies between their behavior in SQL queries and in native queries due to how they are currently treated by
the SQL planner.

View File

@ -24,7 +24,7 @@ title: "Sorting (topN)"
> Apache Druid supports two query languages: [Druid SQL](sql.md) and [native queries](querying.md).
> This document describes the native
> language. For information about sorting in SQL, refer to the [SQL documentation](sql-syntax.md#order-by).
> language. For information about sorting in SQL, refer to the [SQL documentation](sql.md#order-by).
In Apache Druid, the topN metric spec specifies how topN values should be sorted.

View File

@ -65,7 +65,23 @@
"ingestion/faq"
],
"Querying": [
{
"type": "subcategory",
"label": "Druid SQL",
"ids": [
"querying/sql",
"querying/sql-data-types",
"querying/sql-operators",
"querying/sql-scalar",
"querying/sql-aggregations",
"querying/sql-multivalue-string-functions",
"querying/sql-api",
"querying/sql-jdbc",
"querying/sql-query-context",
"querying/sql-metadata-tables",
"querying/sql-translation"
]
},
"querying/querying",
"querying/query-execution",
"querying/troubleshooting",
@ -83,35 +99,7 @@
"querying/query-context"
]
},
{
"type": "subcategory",
"label": "Druid SQL",
"ids": [
"querying/sql-syntax",
"querying/sql-data-types",
"querying/sql-metadata-tables",
"querying/sql-translation"
]
},
{
"type": "subcategory",
"label": "Druid SQL Functions",
"ids": [
"querying/sql-operators",
"querying/sql-scalar",
"querying/sql-aggregations",
"querying/sql-multivalue-string-functions"
]
},
{
"type": "subcategory",
"label": "Druid SQL APIs",
"ids": [
"querying/sql-api",
"querying/sql-jdbc",
"querying/sql-query-context"
]
},
{
"type": "subcategory",
"label": "Native query types",