mirror of https://github.com/apache/druid.git
679 lines
28 KiB
Plaintext
679 lines
28 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ad4e60b6",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"# Learn the basics of Druid SQL\n",
|
|
"\n",
|
|
"<!--\n",
|
|
" ~ Licensed to the Apache Software Foundation (ASF) under one\n",
|
|
" ~ or more contributor license agreements. See the NOTICE file\n",
|
|
" ~ distributed with this work for additional information\n",
|
|
" ~ regarding copyright ownership. The ASF licenses this file\n",
|
|
" ~ to you under the Apache License, Version 2.0 (the\n",
|
|
" ~ \"License\"); you may not use this file except in compliance\n",
|
|
" ~ with the License. You may obtain a copy of the License at\n",
|
|
" ~\n",
|
|
" ~ http://www.apache.org/licenses/LICENSE-2.0\n",
|
|
" ~\n",
|
|
" ~ Unless required by applicable law or agreed to in writing,\n",
|
|
" ~ software distributed under the License is distributed on an\n",
|
|
" ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
|
|
" ~ KIND, either express or implied. See the License for the\n",
|
|
" ~ specific language governing permissions and limitations\n",
|
|
" ~ under the License.\n",
|
|
" -->\n",
|
|
" \n",
|
|
"Apache Druid supports two query languages: Druid SQL and native queries.\n",
|
|
"Druid SQL is a Structured Query Language (SQL) dialect that enables you to query datasources in Apache Druid using SQL statements.\n",
|
|
"SQL and Druid SQL use similar syntax, with some notable differences.\n",
|
|
"Not all SQL functions are supported in Druid SQL. Instead, Druid includes Druid-specific SQL functions for optimized query performance.\n",
|
|
"\n",
|
|
"This interactive tutorial introduces you to the unique aspects of Druid SQL with the primary focus on the SELECT statement.\n",
|
|
"To learn about native queries, see [Native queries](https://druid.apache.org/docs/latest/querying/querying.html)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8d6bbbcb",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Prerequisites\n",
|
|
"\n",
|
|
"This tutorial works with Druid 25.0.0 or later.\n",
|
|
"\n",
|
|
"Launch this tutorial and all prerequisites using the `druid-jupyter` or `all-services` profiles of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
|
|
"\n",
|
|
"If you do not use the Docker Compose environment, you need the following:\n",
|
|
"\n",
|
|
"* A running Druid instance.<br>\n",
|
|
" Update the `druid_host` variable to point to your Router endpoint. For example:\n",
|
|
" ```\n",
|
|
" druid_host = \"http://localhost:8888\"\n",
|
|
" ```\n",
|
|
"* The [Druid Python API](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/) to simplify access to Druid.\n",
|
|
"\n",
|
|
"It will also help to have a working knowledge of SQL.\n",
|
|
"\n",
|
|
"\n",
|
|
"To start the tutorial, run the following cell. It imports the required Python packages and defines a variable for the Druid client, and another for the SQL client used to run SQL commands."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b7f08a52",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import druidapi\n",
|
|
"\n",
|
|
"# druid_host is the hostname and port for your Druid deployment. \n",
|
|
"# In the Docker Compose tutorial environment, this is the Router\n",
|
|
"# service running at \"http://router:8888\".\n",
|
|
"# If you are not using the Docker Compose environment, edit the `druid_host`.\n",
|
|
"\n",
|
|
"druid_host = \"http://router:8888\"\n",
|
|
"druid_host\n",
|
|
"\n",
|
|
"\n",
|
|
"druid = druidapi.jupyter_client(druid_host)\n",
|
|
"display = druid.display\n",
|
|
"sql_client = druid.sql"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e893ef7d-7136-442f-8bd9-31b5a5276518",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Druid SQL statements\n",
|
|
"\n",
|
|
"The following are the main Druid SQL statements:\n",
|
|
"\n",
|
|
"* SELECT: extract data from a datasource\n",
|
|
"* INSERT INTO: create a new datasource or append to an existing datasource\n",
|
|
"* REPLACE INTO: create a new datasource or overwrite data in an existing datasource\n",
|
|
"\n",
|
|
"Druid SQL does not support CREATE TABLE, DELETE, and DROP TABLE statements.\n",
|
|
"\n",
|
|
"## Ingest data\n",
|
|
"\n",
|
|
"You can use either INSERT INTO or REPLACE INTO to create a datasource and ingest data.\n",
|
|
"INSERT INTO and REPLACE INTO statements both require the PARTITIONED BY clause which defines the granularity of time-based partitioning. For more information, see [Partitioning by time](https://druid.apache.org/docs/latest/multi-stage-query/concepts.html#partitioning-by-time).\n",
|
|
"\n",
|
|
"Run the following cell to ingest data from an external source into a table called `wikipedia-sql-tutorial`. \n",
|
|
"If you already have a table with the same name, use REPLACE INTO instead of INSERT INTO.\n",
|
|
"\n",
|
|
"Note the following about the query to ingest data:\n",
|
|
"- The query uses the TIME_PARSE function to parse ISO 8601 time strings into timestamps. See the section on [timestamp values](#timestamp-values) for more information.\n",
|
|
"- The asterisk ( * ) tells Druid to ingest all the columns.\n",
|
|
"- The EXTERN statement lets you define the data source type and the input schema. See [Read external data with EXTERN](https://druid.apache.org/docs/latest/multi-stage-query/concepts.html#read-external-data-with-extern) for more information.\n",
|
|
"\n",
|
|
"The following cell defines the query, uses MSQ to ingest the data, and waits for the MSQ task to complete. You will see an asterisk `[*]` in the left margin while the task runs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "045f782c-74d8-4447-9487-529071812b51",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"INSERT INTO \"wikipedia-sql-tutorial\" \n",
|
|
"SELECT TIME_PARSE(\"timestamp\") AS __time, * \n",
|
|
"FROM TABLE (EXTERN(\n",
|
|
" '{\"type\": \"http\", \"uris\": [\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
|
|
" '{\"type\": \"json\"}', \n",
|
|
" '[{\"name\": \"added\", \"type\": \"long\"}, {\"name\": \"channel\", \"type\": \"string\"}, {\"name\": \"cityName\", \"type\": \"string\"}, {\"name\": \"comment\", \"type\": \"string\"}, {\"name\": \"commentLength\", \"type\": \"long\"}, {\"name\": \"countryIsoCode\", \"type\": \"string\"}, {\"name\": \"countryName\", \"type\": \"string\"}, {\"name\": \"deleted\", \"type\": \"long\"}, {\"name\": \"delta\", \"type\": \"long\"}, {\"name\": \"deltaBucket\", \"type\": \"string\"}, {\"name\": \"diffUrl\", \"type\": \"string\"}, {\"name\": \"flags\", \"type\": \"string\"}, {\"name\": \"isAnonymous\", \"type\": \"string\"}, {\"name\": \"isMinor\", \"type\": \"string\"}, {\"name\": \"isNew\", \"type\": \"string\"}, {\"name\": \"isRobot\", \"type\": \"string\"}, {\"name\": \"isUnpatrolled\", \"type\": \"string\"}, {\"name\": \"metroCode\", \"type\": \"string\"}, {\"name\": \"namespace\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": \"string\"}, {\"name\": \"regionIsoCode\", \"type\": \"string\"}, {\"name\": \"regionName\", \"type\": \"string\"}, {\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": \"string\"}]'\n",
|
|
" ))\n",
|
|
"PARTITIONED BY DAY\n",
|
|
"'''\n",
|
|
"sql_client.run_task(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a141e962",
|
|
"metadata": {},
|
|
"source": [
|
|
"MSQ reports task completion as soon as ingestion is done. However, it takes a while for Druid to load the resulting segments. Wait for the table to become ready."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cca15307",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql_client.wait_until_ready('wikipedia-sql-tutorial')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "240b0ad5-48f2-4737-b12b-5fd5f98da300",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Datasources\n",
|
|
"\n",
|
|
"Druid supports a variety of datasources, with the table datasource being the most common. In Druid documentation, the word \"datasource\" often implicitly refers to the table datasource.\n",
|
|
"The [Datasources](https://druid.apache.org/docs/latest/querying/datasource.html) topic provides a comprehensive overview of datasources supported by Druid SQL.\n",
|
|
"\n",
|
|
"In Druid SQL, table datasources reside in the `druid` schema. This is the default schema, so table datasources can be referenced as either `druid.DATASOURCE_NAME` or `DATASOURCE_NAME`.\n",
|
|
"\n",
|
|
"For example, run the next cell to return the rows of the column named `channel` from the `wikipedia-sql-tutorial` table. Because this tutorial is running in Jupyter, the cells use the LIMIT clause to limit the size of the query results for display purposes. The cell uses the built-in table formatting feature of the Python API. You can also retrieve the values as a Python object if you wish to perform additional processing."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6e5d8de0",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT \"channel\" FROM \"wikipedia-sql-tutorial\" LIMIT 7\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cbeb5a63",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Data types\n",
|
|
"\n",
|
|
"Druid maps SQL data types onto native types at query runtime.\n",
|
|
"The following native types are supported for Druid columns:\n",
|
|
"\n",
|
|
"* SQL: `VARCHAR`, Druid: `STRING`: UTF-8 encoded strings and string arrays\n",
|
|
"* SQL: `BIGINT`, Druid: `LONG`: 64-bit signed int\n",
|
|
"* SQL & Druid: `FLOAT`: 32-bit float\n",
|
|
"* SQL & Druid: `DOUBLE`: 64-bit float\n",
|
|
"* Druid `COMPLEX`: represents non-standard data types, such as nested JSON, hyperUnique and approxHistogram aggregators, and DataSketches aggregators\n",
|
|
"\n",
|
|
"For reference on how SQL data types map onto Druid native types, see [Standard types](https://druid.apache.org/docs/latest/querying/sql-data-types.html#standard-types).\n",
|
|
"\n",
|
|
"Druid exposes table and column metadata through [INFORMATION_SCHEMA](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#information-schema) tables. Run the following query to retrieve metadata for the `wikipedia-sql-tutorial` datasource. In the response body, each JSON object correlates to a column in the table.\n",
|
|
"Check the objects' `DATA_TYPE` property for SQL data types. You should see TIMESTAMP, BIGINT, and VARCHAR SQL data types. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c7a86e2e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT COLUMN_NAME, DATA_TYPE \n",
|
|
"FROM INFORMATION_SCHEMA.COLUMNS \n",
|
|
"WHERE \"TABLE_SCHEMA\" = 'druid' AND \"TABLE_NAME\" = 'wikipedia-sql-tutorial' \n",
|
|
"LIMIT 7\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "59f41229",
|
|
"metadata": {},
|
|
"source": [
|
|
"This is such a common query that the SQL client has it built in:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1ac6c410",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"display.table('wikipedia-sql-tutorial')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c59ca797-dd91-442b-8d02-67b711b3fcc6",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Timestamp values\n",
|
|
"\n",
|
|
"Druid stores timestamp values as the number of milliseconds since the Unix epoch.\n",
|
|
"Primary timestamps are stored in a column named `__time`.\n",
|
|
"If a dataset doesn't have a timestamp, Druid uses the default value of `1970-01-01 00:00:00`.\n",
|
|
"\n",
|
|
"Druid time functions perform best when used with the `__time` column.\n",
|
|
"By default, time functions use the UTC time zone.\n",
|
|
"For more information about timestamp handling, see [Date and time functions](https://druid.apache.org/docs/latest/querying/sql-scalar.html#date-and-time-functions).\n",
|
|
"\n",
|
|
"Run the following cell to see a time function at work. This example uses the `TIME_IN_INTERVAL` function to query the `channel` and `page` columns of the `wikipedia-sql-tutorial` for rows whose timestamp is contained within the specified interval. The cell groups the results by columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "16c1a31a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT channel, page\n",
|
|
"FROM \"wikipedia-sql-tutorial\" \n",
|
|
"WHERE TIME_IN_INTERVAL(__time, '2016-06-27T00:05:54.56/2016-06-27T00:06:53')\n",
|
|
"GROUP BY channel, page\n",
|
|
"LIMIT 7\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f7cfdfae-ccba-49ba-a70f-63d0bd3527b2",
|
|
"metadata": {},
|
|
"source": [
|
|
"### NULL values\n",
|
|
"\n",
|
|
"Druid supports SQL compatible NULL handling, allowing string columns to distinguish empty strings from NULL and numeric columns to contain NULL rows. To store and query data in SQL compatible mode, explicitly set the `useDefaultValueForNull` property to `false` in `_common/common.runtime.properties`. See [Configuration reference](https://druid.apache.org/docs/latest/configuration/index.html) for common configuration properties.\n",
|
|
"\n",
|
|
"When `useDefaultValueForNull` is set to `true` (default behavior), Druid stores NULL values as `0` for numeric columns and as `''` for string columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "29c24856",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## SELECT statement syntax\n",
|
|
"\n",
|
|
"Druid SQL supports SELECT statements with the following structure:\n",
|
|
"\n",
|
|
"``` mysql\n",
|
|
"[ EXPLAIN PLAN FOR ]\n",
|
|
"[ WITH tableName [ ( column1, column2, ... ) ] AS ( query ) ]\n",
|
|
"SELECT [ ALL | DISTINCT ] { * | exprs }\n",
|
|
"FROM { <table> | (<subquery>) | <o1> [ INNER | LEFT ] JOIN <o2> ON condition }\n",
|
|
"[ WHERE expr ]\n",
|
|
"[ GROUP BY [ exprs | GROUPING SETS ( (exprs), ... ) | ROLLUP (exprs) | CUBE (exprs) ] ]\n",
|
|
"[ HAVING expr ]\n",
|
|
"[ ORDER BY expr [ ASC | DESC ], expr [ ASC | DESC ], ... ]\n",
|
|
"[ LIMIT limit ]\n",
|
|
"[ OFFSET offset ]\n",
|
|
"[ UNION ALL <another query> ]\n",
|
|
"```\n",
|
|
"\n",
|
|
"As a general rule, use the LIMIT clause with `SELECT *` to limit the number of rows returned. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8cf212e5-fb3f-4206-acdd-46ef1da327ab",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## WHERE clause\n",
|
|
"\n",
|
|
"Druid SQL uses the [SQL WHERE clause](https://druid.apache.org/docs/latest/querying/sql.html#where) of a SELECT statement to fetch data based on a particular condition.\n",
|
|
"\n",
|
|
"In most cases, filtering your results by time using the WHERE clause improves query performance.\n",
|
|
"This is because Druid partitions data into time chunks and having a time range allows Druid to skip over unrelated data.\n",
|
|
"At ingestion time, you can further partition segments within a time chunk using the CLUSTERED BY clause to improve locality.\n",
|
|
"At query time, using the WHERE clause to filter on clustered dimensions can improve query performance.\n",
|
|
"\n",
|
|
"Druid supports range filtering on columns that contain long millisecond values, with the boundaries specified as ISO 8601 time intervals. This is suitable for the `__time` column, long metric columns, and dimensions with values that can be parsed as long milliseconds.\n",
|
|
" \n",
|
|
"For example, the following cell uses a comparison operator on the `__time` field to filter results from a certain time range."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "123187d3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT channel, page, comment \n",
|
|
"FROM \"wikipedia-sql-tutorial\" \n",
|
|
"WHERE __time >= TIMESTAMP '2015-09-12 23:33:55' \n",
|
|
" AND namespace = 'Main' \n",
|
|
"LIMIT 7\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5160db26-7e8d-40f7-8588-b7eabfc08355",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Comparison operators\n",
|
|
"\n",
|
|
"Druid SQL supports the following comparison operators. You can use these operators in conjunction with the WHERE clause to compare expressions.\n",
|
|
"\n",
|
|
"- equal to (=)\n",
|
|
"- greater than(>)\n",
|
|
"- less than (<)\n",
|
|
"- greater than or equal (>=)\n",
|
|
"- less than or equal (<=)\n",
|
|
"- not equal to( <>)\n",
|
|
"\n",
|
|
"For example, the next cell returns the first seven records that match the following criteria:\n",
|
|
"- `cityName` is not an empty string\n",
|
|
"- `countryIsoCode` value equals to `US`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b4656c81",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT channel, page, comment \n",
|
|
"FROM \"wikipedia-sql-tutorial\" \n",
|
|
"WHERE \"cityName\" <> '' AND \"countryIsoCode\" = 'US' \n",
|
|
"LIMIT 7\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dd24470d-25c2-4031-a711-8477d69c9e94",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"### Logical operators\n",
|
|
"\n",
|
|
"Druid's handling of logical operators is comparable to SQL with a few exceptions. For example, if an IN list contains NULL, the IN operator matches NULL values. This behavior is different from the SQL IN operator, which does not match NULL values. For a complete list of logical SQL operators supported by Druid SQL, see [Logical operators](https://druid.apache.org/docs/latest/querying/sql-operators.html#logical-operators)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2baf21b9-74d1-4df6-862f-afbaeef1812b",
|
|
"metadata": {},
|
|
"source": [
|
|
"## GROUP BY clause\n",
|
|
"\n",
|
|
"Druid SQL uses the [SQL GROUP BY](https://druid.apache.org/docs/latest/querying/sql.html#group-by) clause to separate items into groups, where each group is composed of rows with identical values. \n",
|
|
"The GROUP BY clause is often used with [aggregation functions](https://druid.apache.org/docs/latest/querying/sql-aggregations.html), such as COUNT or SUM, to produce summary values for each group.\n",
|
|
"\n",
|
|
"For example, the following cell counts all of the entries separated by the field `channel`. The output is limited to seven rows and has two fields: `channel` and `counts`. For each unique value of `channel`, Druid aggregates all rows having that value, counts the number of entries in the group, and assigns the results to a field called `counts`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "0127e401",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT channel, COUNT(*) AS counts \n",
|
|
"FROM \"wikipedia-sql-tutorial\" \n",
|
|
"GROUP BY channel \n",
|
|
"LIMIT 7\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "eab67db4-a0f3-4177-b5be-1fef355bf33f",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can further define the groups by specifying multiple dimensions.\n",
|
|
"Druid SQL supports using numbers in GROUP BY and ORDER BY clauses to refer to column positions in the SELECT clause.\n",
|
|
"Similar to SQL, Druid SQL uses one-based indexing to reference elements in SQL statements.\n",
|
|
"\n",
|
|
"For example, the next cell aggregates entries grouped by fields `cityName` and `countryName`.\n",
|
|
"The output has three fields: `cityName`, `countryName`, and `counts`. For each unique combination of `cityName` and `countryName`, Druid aggregates all rows and averages the entries in the group.\n",
|
|
"The output is limited to seven rows for display purposes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d3724cc3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT cityName, countryName, COUNT(*) AS counts \n",
|
|
"FROM \"wikipedia-sql-tutorial\" \n",
|
|
"GROUP BY cityName, countryName \n",
|
|
"LIMIT 7\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9c2e56af-3fdf-40f1-869b-822ac8aafbc8",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Query types\n",
|
|
"\n",
|
|
"Druid SQL is optimized for the following [native query](https://druid.apache.org/docs/latest/querying/querying.html) types:\n",
|
|
"- Scan\n",
|
|
"- Timeseries\n",
|
|
"- TopN\n",
|
|
"- GroupBy\n",
|
|
"\n",
|
|
"Native queries are low-level JSON-based queries designed to be lightweight and complete very quickly.\n",
|
|
"Druid translates SQL statements into native queries using the [Apache Calcite](https://calcite.apache.org/) data management framework. The queries are then executed by the Druid cluster.\n",
|
|
"\n",
|
|
"To get information about how a Druid SQL query is translated into a native query type, add [EXPLAIN PLAN FOR](https://druid.apache.org/docs/latest/querying/sql.html#explain-plan) to the beginning of the query.\n",
|
|
"Alternatively, you can set up [request logging](https://druid.apache.org/docs/latest/configuration/index.html#request-logging)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2de6e6ba-473c-4ef0-9739-a9472a4c7065",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Scan\n",
|
|
"\n",
|
|
"The Scan native query type returns raw Druid rows in streaming mode.\n",
|
|
"Druid SQL uses the Scan query type for queries that do not aggregate—queries that do not have GROUP BY or DISTINCT clauses.\n",
|
|
"\n",
|
|
"For example, run the next cell to scan the `wikipedia-sql-tutorial` table for comments from Mexico City. Calcite translates this Druid SQL query into the Scan native query type."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3ed58bfd",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT comment AS \"Entry\" \n",
|
|
"FROM \"wikipedia-sql-tutorial\" \n",
|
|
"WHERE cityName = 'Mexico City' \n",
|
|
"LIMIT 7\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "407cf489-3947-4326-9d81-18b38abaee58",
|
|
"metadata": {},
|
|
"source": [
|
|
"### TopN\n",
|
|
"\n",
|
|
"The TopN native query type returns a sorted set of results for the values in a given dimension according to some criteria. TopN results are always computed in memory. In some cases, the TopN query type delivers approximate ranking and results. To prevent this, set the `useApproximateTopN` query context parameter to `false` when calling the [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html). See [SQL query context](https://druid.apache.org/docs/latest/querying/sql-query-context.html) for more information.\n",
|
|
"\n",
|
|
"Druid SQL uses TopN for queries that meet the following criteria:\n",
|
|
"- queries that GROUP BY a single expression\n",
|
|
"- queries that have ORDER BY and LIMIT clauses\n",
|
|
"- queries that do not contain HAVING\n",
|
|
"- queries that are not nested\n",
|
|
"\n",
|
|
"For example, the next cell returns the channels based on the number of events for each one in ascending order. Calcite translates this Druid SQL query into the TopN native query type."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bb694442",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT channel, count(*) as \"Number of events\" \n",
|
|
"FROM \"wikipedia-sql-tutorial\" \n",
|
|
"GROUP BY channel \n",
|
|
"ORDER BY \"Number of events\" ASC \n",
|
|
"LIMIT 5\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "38ee7d1d-2d47-4d36-b8c6-b0868c36c871",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Timeseries\n",
|
|
"\n",
|
|
"The Timeseries native query type returns an array of JSON objects, where each object represents a value asked for by the Timeseries query.\n",
|
|
"\n",
|
|
"Druid SQL uses Timeseries for queries that meet the following criteria:\n",
|
|
"- queries that GROUP BY `FLOOR(__time TO unit)` or `TIME_FLOOR(__time, period)`\n",
|
|
"- queries that do not contain other grouping expressions\n",
|
|
"- queries that do not contain HAVING\n",
|
|
"- queries that are not nested\n",
|
|
"- queries that either have no ORDER BY clause or an ORDER BY clause that orders by the same expression as present in GROUP BY\n",
|
|
"\n",
|
|
"For example, Calcite translates the following Druid SQL query into the Timeseries native query type:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "644b0cdd",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT \n",
|
|
" countryName AS \"Country\", \n",
|
|
" SUM(deleted) AS deleted, \n",
|
|
" SUM(added) AS added \n",
|
|
"FROM \"wikipedia-sql-tutorial\"\n",
|
|
"WHERE countryName = 'France' \n",
|
|
"GROUP BY countryName , FLOOR(__time TO HOUR) \n",
|
|
"LIMIT 7\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "eb5cfc5c-8f58-4e56-a218-7a55867e3e0c",
|
|
"metadata": {},
|
|
"source": [
|
|
"### GroupBy\n",
|
|
"\n",
|
|
"The GroupBy native query type returns an array of JSON objects where each object represents a grouping asked for by the GroupBy query. GroupBy delivers exact results and rankings.\n",
|
|
"\n",
|
|
"Druid SQL uses GroupBy for aggregations, including nested aggregation queries.\n",
|
|
"\n",
|
|
"For example, Calcite translates the following Druid SQL query into the GroupBy native query type:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b4f5d1dd",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sql = '''\n",
|
|
"SELECT \n",
|
|
" countryName AS \"Country\\\", \n",
|
|
" countryIsoCode AS \"ISO\" \n",
|
|
"FROM \"wikipedia-sql-tutorial\"\n",
|
|
"WHERE channel = '#es.wikipedia' \n",
|
|
"GROUP BY countryName, countryIsoCode \n",
|
|
"LIMIT 7\n",
|
|
"'''\n",
|
|
"display.sql(sql)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8fbfa1fa-2cde-46d5-8107-60bd436fb64e",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Learn more\n",
|
|
"\n",
|
|
"This tutorial covers the basics of Druid SQL. To learn more about querying datasets using Druid SQL, see the following topics:\n",
|
|
"\n",
|
|
"- [Druid SQL overview](https://druid.apache.org/docs/latest/querying/sql.html) to learn about Druid SQL syntax.\n",
|
|
"- [SQL data types](https://druid.apache.org/docs/latest/querying/sql-data-types.html) for information on how SQL data types map to Druid SQL.\n",
|
|
"- [SQL query translation](https://druid.apache.org/docs/latest/querying/sql-translation.html) for best practices that help you minimize the impact of SQL translation.\n",
|
|
"- [Druid SQL operators](https://druid.apache.org/docs/latest/querying/sql-operators.html) for operators supported by Druid SQL.\n",
|
|
"- [SQL aggregation functions](https://druid.apache.org/docs/latest/querying/sql-aggregations.html) for reference on the aggregation functions supported by Druid SQL. \n",
|
|
"- [Unsupported features](https://druid.apache.org/docs/latest/querying/sql-translation.html#unsupported-features) for a list of SQL features not supported by Druid SQL.\n",
|
|
"- [SQL keywords](https://calcite.apache.org/docs/reference.html#keywords) for a list of SQL keywords."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.16"
|
|
},
|
|
"toc-autonumbering": false,
|
|
"toc-showcode": false,
|
|
"toc-showmarkdowntxt": false,
|
|
"toc-showtags": false,
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "392d024d9e577b3899d42c3b7a7b6a06db0d6efdc0b44e46dc281b668e7b3887"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|