mirror of https://github.com/apache/druid.git
Jupyter nested columns tutorial (#14788)
This commit is contained in:
parent
faf79470ae
commit
2561477e87
|
@ -0,0 +1,426 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Working with nested columns\n",
|
||||
"\n",
|
||||
"<!--\n",
|
||||
" ~ Licensed to the Apache Software Foundation (ASF) under one\n",
|
||||
" ~ or more contributor license agreements. See the NOTICE file\n",
|
||||
" ~ distributed with this work for additional information\n",
|
||||
" ~ regarding copyright ownership. The ASF licenses this file\n",
|
||||
" ~ to you under the Apache License, Version 2.0 (the\n",
|
||||
" ~ \"License\"); you may not use this file except in compliance\n",
|
||||
" ~ with the License. You may obtain a copy of the License at\n",
|
||||
" ~\n",
|
||||
" ~ http://www.apache.org/licenses/LICENSE-2.0\n",
|
||||
" ~\n",
|
||||
" ~ Unless required by applicable law or agreed to in writing,\n",
|
||||
" ~ software distributed under the License is distributed on an\n",
|
||||
" ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
|
||||
" ~ KIND, either express or implied. See the License for the\n",
|
||||
" ~ specific language governing permissions and limitations\n",
|
||||
" ~ under the License.\n",
|
||||
" -->\n",
|
||||
"\n",
|
||||
"This tutorial demonstrates how to work with [nested columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) in Apache Druid.\n",
|
||||
"\n",
|
||||
"Druid stores nested data structures in `COMPLEX<json>` columns. In this tutorial you perform the following tasks:\n",
|
||||
"\n",
|
||||
"- Ingest nested JSON data using SQL-based ingestion.\n",
|
||||
"- Transform nested data during ingestion using SQL JSON functions.\n",
|
||||
"- Perform queries to display, filter, and aggregate nested data.\n",
|
||||
"- Use helper operators to examine nested data and plan your queries.\n",
|
||||
"\n",
|
||||
"Druid supports directly ingesting nested data with the following formats: JSON, Parquet, Avro, ORC, Protobuf."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Table of contents\n",
|
||||
"\n",
|
||||
"- [Prerequisites](#Prerequisites)\n",
|
||||
"- [Initialization](#Initialization)\n",
|
||||
"- [Ingest nested data](#Ingest-nested-data)\n",
|
||||
"- [Transform nested data](#Transform-nested-data)\n",
|
||||
"- [Query nested data](#Query-nested-data)\n",
|
||||
"- [Group, filter, and aggregate nested data](#Group-filter-and-aggregate-nested-data)\n",
|
||||
"- [Use helper operators](#Use-helper-operators)\n",
|
||||
"- [Learn more](#Learn-more)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Prerequisites\n",
|
||||
"\n",
|
||||
"This tutorial works with Druid 25.0.0 or later.\n",
|
||||
"\n",
|
||||
"### Run with Docker\n",
|
||||
"\n",
|
||||
"Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
|
||||
"\n",
|
||||
"### Run without Docker\n",
|
||||
"\n",
|
||||
"If you do not use the Docker Compose environment, you need the following:\n",
|
||||
"\n",
|
||||
"* A running Apache Druid instance, with a `DRUID_HOST` local environment variable containing the server name of your Druid router.\n",
|
||||
"* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialization"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Run the next cell to set up the Druid Python client's connection to Apache Druid.\n",
|
||||
"\n",
|
||||
"If successful, the Druid version number will be shown in the output."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import druidapi\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"if 'DRUID_HOST' not in os.environ.keys():\n",
|
||||
" druid_host=f\"http://localhost:8888\"\n",
|
||||
"else:\n",
|
||||
" druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
|
||||
" \n",
|
||||
"print(f\"Opening a connection to {druid_host}.\")\n",
|
||||
"druid = druidapi.jupyter_client(druid_host)\n",
|
||||
"\n",
|
||||
"display = druid.display\n",
|
||||
"sql_client = druid.sql\n",
|
||||
"status_client = druid.status\n",
|
||||
"\n",
|
||||
"status_client.version"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Ingest nested data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Run the following cell to ingest sample clickstream data from the [Koalas to the Max](https://www.koalastothemax.com/) game."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sql = '''\n",
|
||||
"INSERT INTO example_koalas_nesteddata\n",
|
||||
" WITH \"source\" AS\n",
|
||||
" (SELECT * FROM TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}',\n",
|
||||
" '{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"client_ip\",\"type\":\"string\"},\n",
|
||||
" {\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},{\"name\":\"event\",\"type\":\"COMPLEX<json>\"},\n",
|
||||
" {\"name\":\"agent\",\"type\":\"COMPLEX<json>\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX<json>\"}]')))\n",
|
||||
" SELECT TIME_PARSE(\"timestamp\") AS \"__time\",\n",
|
||||
" \"client_ip\", \n",
|
||||
" \"session\", \n",
|
||||
" \"session_length\", \n",
|
||||
" \"event\", \n",
|
||||
" \"agent\", \n",
|
||||
" \"geo_ip\"\n",
|
||||
" FROM \"source\"\n",
|
||||
" PARTITIONED BY DAY\n",
|
||||
"'''\n",
|
||||
"\n",
|
||||
"sql_client.run_task(sql)\n",
|
||||
"sql_client.wait_until_ready(\"example_koalas_nesteddata\")\n",
|
||||
"display.table(\"example_koalas_nesteddata\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Druid reports task completion as soon as ingestion is done. However, it takes a while for Druid to load the resulting segments.\n",
|
||||
"\n",
|
||||
"Wait for the table detail to display, then run the following cell to query the data and return selected columns from 3 rows. Note the nested structure of the `event`, `agent`, and `geo_ip` columns."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sql = '''\n",
|
||||
"SELECT session, event, agent, geo_ip \n",
|
||||
"FROM example_koalas_nesteddata LIMIT 3\n",
|
||||
"'''\n",
|
||||
"resp = sql_client.sql_query(sql)\n",
|
||||
"resp.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Transform nested data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can use Druid's [SQL JSON functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to transform nested data in your ingestion query.\n",
|
||||
"\n",
|
||||
"Run the following cell to insert sample data into a new datasource named `example_koalas_nesteddata_transform`. The SELECT query extracts the `country` and `city` elements from the nested `geo_ip` column and creates a composite object `sessionDetails` containing `session` and `session_length`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sql = '''\n",
|
||||
"INSERT INTO example_koalas_nesteddata_transform\n",
|
||||
" WITH \"source\" AS\n",
|
||||
" (SELECT * FROM TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}',\n",
|
||||
" '{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},\n",
|
||||
" {\"name\":\"event\",\"type\":\"COMPLEX<json>\"},{\"name\":\"agent\",\"type\":\"COMPLEX<json>\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX<json>\"}]')))\n",
|
||||
" SELECT TIME_PARSE(\"timestamp\") AS \"__time\",\n",
|
||||
" JSON_QUERY(geo_ip, '$.country') as country,\n",
|
||||
" JSON_QUERY(geo_ip, '$.city') as city,\n",
|
||||
" JSON_OBJECT('session':session, 'session_length':session_length) as sessionDetails\n",
|
||||
" FROM \"source\"\n",
|
||||
" PARTITIONED BY DAY\n",
|
||||
"'''\n",
|
||||
"\n",
|
||||
"sql_client.run_task(sql)\n",
|
||||
"sql_client.wait_until_ready(\"example_koalas_nesteddata_transform\")\n",
|
||||
"display.table(\"example_koalas_nesteddata_transform\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When the table detail displays, run the following cell to query the data and return `country`, `city`, and `sessionDetails` from 3 rows:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sql = '''\n",
|
||||
"SELECT country, city, sessionDetails \n",
|
||||
"FROM example_koalas_nesteddata_transform \n",
|
||||
"LIMIT 3\n",
|
||||
"'''\n",
|
||||
"resp = sql_client.sql_query(sql)\n",
|
||||
"resp.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Query nested data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Run the following cell to display the data types for columns in the `example_koalas_nesteddata` datasource. Note that nested columns display as `COMPLEX<json>`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sql = '''\n",
|
||||
"SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE\n",
|
||||
"FROM INFORMATION_SCHEMA.COLUMNS\n",
|
||||
"WHERE TABLE_NAME = 'example_koalas_nesteddata'\n",
|
||||
"'''\n",
|
||||
"resp = sql_client.sql_query(sql)\n",
|
||||
"resp.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can use [`JSON_VALUE`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to extract specific elements from a `COMPLEX<json>` object.\n",
|
||||
" \n",
|
||||
"Run the following cell to extract `continent` from `geo_ip` and `category` from `agent` for 3 rows:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sql = '''\n",
|
||||
"SELECT JSON_VALUE(geo_ip, '$.continent') as continent,\n",
|
||||
"JSON_VALUE(agent, '$.category') as category\n",
|
||||
"FROM example_koalas_nesteddata LIMIT 3\n",
|
||||
"'''\n",
|
||||
"resp = sql_client.sql_query(sql)\n",
|
||||
"resp.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Group, filter, and aggregate nested data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Run the following cell to see how you can use the SELECT COUNT(DISTINCT) operator with `JSON_VALUE`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sql = '''\n",
|
||||
"SELECT COUNT(DISTINCT(JSON_VALUE(geo_ip, '$.city'))) as \"Number of cities\"\n",
|
||||
"FROM example_koalas_nesteddata\n",
|
||||
"'''\n",
|
||||
"resp = sql_client.sql_query(sql)\n",
|
||||
"resp.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Run the following cell to filter and group a query using `JSON_VALUE`. The query selects the `browser` element from the `agent` column and the `country` and `city` elements from the `geo_ip` column, for all rows with city `Helsinki`. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sql = '''\n",
|
||||
"SELECT JSON_VALUE(agent, '$.browser') as browser,\n",
|
||||
"JSON_VALUE(geo_ip, '$.country') as country,\n",
|
||||
"JSON_VALUE(geo_ip, '$.city') as city\n",
|
||||
"FROM example_koalas_nesteddata\n",
|
||||
"WHERE JSON_VALUE(geo_ip, '$.city') in ('Helsinki')\n",
|
||||
"GROUP BY 1,2,3\n",
|
||||
"ORDER BY 1\n",
|
||||
"'''\n",
|
||||
"resp = sql_client.sql_query(sql)\n",
|
||||
"resp.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Use helper operators"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can use SQL helper operators such as [`JSON_KEYS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) and [`JSON_PATHS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to examine nested data and plan your queries. Run the following cell to return an array of field names and an array of paths for the `geo_ip` nested column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sql = '''\n",
|
||||
"SELECT ARRAY_CONCAT_AGG(DISTINCT JSON_KEYS(geo_ip, '$.')) as \"geo_ip keys\",\n",
|
||||
"ARRAY_CONCAT_AGG(DISTINCT JSON_PATHS(geo_ip)) as \"geo_ip paths\"\n",
|
||||
"FROM example_koalas_nesteddata\n",
|
||||
"'''\n",
|
||||
"resp = sql_client.sql_query(sql)\n",
|
||||
"resp.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Learn more"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This tutorial covers the basics of working with nested data. To learn more about nested data in Druid and related Druid features, see the following topics:\n",
|
||||
"\n",
|
||||
"- [Nested columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) for information about the nested columns feature, with ingestion and query examples. \n",
|
||||
"- [SQL JSON functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) for details on all of the functions you used in this tutorial.\n",
|
||||
"- [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html) for information on how to use Druid SQL-based ingestion."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.3"
|
||||
},
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "a4289e5b8bae5973a6609d90f7bc464162478362b9a770893a3c5c597b0b36e7"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
Loading…
Reference in New Issue