Jupyter nested columns tutorial (#14788)

This commit is contained in:
Jill Osborne 2023-08-16 22:45:37 +01:00 committed by GitHub
parent faf79470ae
commit 2561477e87
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 426 additions and 0 deletions

View File

@ -0,0 +1,426 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Working with nested columns\n",
"\n",
"<!--\n",
" ~ Licensed to the Apache Software Foundation (ASF) under one\n",
" ~ or more contributor license agreements. See the NOTICE file\n",
" ~ distributed with this work for additional information\n",
" ~ regarding copyright ownership. The ASF licenses this file\n",
" ~ to you under the Apache License, Version 2.0 (the\n",
" ~ \"License\"); you may not use this file except in compliance\n",
" ~ with the License. You may obtain a copy of the License at\n",
" ~\n",
" ~ http://www.apache.org/licenses/LICENSE-2.0\n",
" ~\n",
" ~ Unless required by applicable law or agreed to in writing,\n",
" ~ software distributed under the License is distributed on an\n",
" ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
" ~ KIND, either express or implied. See the License for the\n",
" ~ specific language governing permissions and limitations\n",
" ~ under the License.\n",
" -->\n",
"\n",
"This tutorial demonstrates how to work with [nested columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) in Apache Druid.\n",
"\n",
"Druid stores nested data structures in `COMPLEX<json>` columns. In this tutorial you perform the following tasks:\n",
"\n",
"- Ingest nested JSON data using SQL-based ingestion.\n",
"- Transform nested data during ingestion using SQL JSON functions.\n",
"- Perform queries to display, filter, and aggregate nested data.\n",
"- Use helper operators to examine nested data and plan your queries.\n",
"\n",
"Druid supports directly ingesting nested data with the following formats: JSON, Parquet, Avro, ORC, Protobuf."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of contents\n",
"\n",
"- [Prerequisites](#Prerequisites)\n",
"- [Initialization](#Initialization)\n",
"- [Ingest nested data](#Ingest-nested-data)\n",
"- [Transform nested data](#Transform-nested-data)\n",
"- [Query nested data](#Query-nested-data)\n",
"- [Group, filter, and aggregate nested data](#Group-filter-and-aggregate-nested-data)\n",
"- [Use helper operators](#Use-helper-operators)\n",
"- [Learn more](#Learn-more)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"This tutorial works with Druid 25.0.0 or later.\n",
"\n",
"### Run with Docker\n",
"\n",
"Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
"\n",
"### Run without Docker\n",
"\n",
"If you do not use the Docker Compose environment, you need the following:\n",
"\n",
"* A running Apache Druid instance, with a `DRUID_HOST` local environment variable containing the server name of your Druid router.\n",
"* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the next cell to set up the Druid Python client's connection to Apache Druid.\n",
"\n",
"If successful, the Druid version number will be shown in the output."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import druidapi\n",
"import os\n",
"\n",
"if 'DRUID_HOST' not in os.environ.keys():\n",
" druid_host=f\"http://localhost:8888\"\n",
"else:\n",
" druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
" \n",
"print(f\"Opening a connection to {druid_host}.\")\n",
"druid = druidapi.jupyter_client(druid_host)\n",
"\n",
"display = druid.display\n",
"sql_client = druid.sql\n",
"status_client = druid.status\n",
"\n",
"status_client.version"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ingest nested data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell to ingest sample clickstream data from the [Koalas to the Max](https://www.koalastothemax.com/) game."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"INSERT INTO example_koalas_nesteddata\n",
" WITH \"source\" AS\n",
" (SELECT * FROM TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}',\n",
" '{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"client_ip\",\"type\":\"string\"},\n",
" {\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},{\"name\":\"event\",\"type\":\"COMPLEX<json>\"},\n",
" {\"name\":\"agent\",\"type\":\"COMPLEX<json>\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX<json>\"}]')))\n",
" SELECT TIME_PARSE(\"timestamp\") AS \"__time\",\n",
" \"client_ip\", \n",
" \"session\", \n",
" \"session_length\", \n",
" \"event\", \n",
" \"agent\", \n",
" \"geo_ip\"\n",
" FROM \"source\"\n",
" PARTITIONED BY DAY\n",
"'''\n",
"\n",
"sql_client.run_task(sql)\n",
"sql_client.wait_until_ready(\"example_koalas_nesteddata\")\n",
"display.table(\"example_koalas_nesteddata\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Druid reports task completion as soon as ingestion is done. However, it takes a while for Druid to load the resulting segments.\n",
"\n",
"Wait for the table detail to display, then run the following cell to query the data and return selected columns from 3 rows. Note the nested structure of the `event`, `agent`, and `geo_ip` columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT session, event, agent, geo_ip \n",
"FROM example_koalas_nesteddata LIMIT 3\n",
"'''\n",
"resp = sql_client.sql_query(sql)\n",
"resp.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transform nested data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use Druid's [SQL JSON functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to transform nested data in your ingestion query.\n",
"\n",
"Run the following cell to insert sample data into a new datasource named `example_koalas_nesteddata_transform`. The SELECT query extracts the `country` and `city` elements from the nested `geo_ip` column and creates a composite object `sessionDetails` containing `session` and `session_length`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"INSERT INTO example_koalas_nesteddata_transform\n",
" WITH \"source\" AS\n",
" (SELECT * FROM TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}',\n",
" '{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},\n",
" {\"name\":\"event\",\"type\":\"COMPLEX<json>\"},{\"name\":\"agent\",\"type\":\"COMPLEX<json>\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX<json>\"}]')))\n",
" SELECT TIME_PARSE(\"timestamp\") AS \"__time\",\n",
" JSON_QUERY(geo_ip, '$.country') as country,\n",
" JSON_QUERY(geo_ip, '$.city') as city,\n",
" JSON_OBJECT('session':session, 'session_length':session_length) as sessionDetails\n",
" FROM \"source\"\n",
" PARTITIONED BY DAY\n",
"'''\n",
"\n",
"sql_client.run_task(sql)\n",
"sql_client.wait_until_ready(\"example_koalas_nesteddata_transform\")\n",
"display.table(\"example_koalas_nesteddata_transform\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When the table detail displays, run the following cell to query the data and return `country`, `city`, and `sessionDetails` from 3 rows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT country, city, sessionDetails \n",
"FROM example_koalas_nesteddata_transform \n",
"LIMIT 3\n",
"'''\n",
"resp = sql_client.sql_query(sql)\n",
"resp.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Query nested data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell to display the data types for columns in the `example_koalas_nesteddata` datasource. Note that nested columns display as `COMPLEX<json>`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE\n",
"FROM INFORMATION_SCHEMA.COLUMNS\n",
"WHERE TABLE_NAME = 'example_koalas_nesteddata'\n",
"'''\n",
"resp = sql_client.sql_query(sql)\n",
"resp.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use [`JSON_VALUE`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to extract specific elements from a `COMPLEX<json>` object.\n",
" \n",
"Run the following cell to extract `continent` from `geo_ip` and `category` from `agent` for 3 rows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT JSON_VALUE(geo_ip, '$.continent') as continent,\n",
"JSON_VALUE(agent, '$.category') as category\n",
"FROM example_koalas_nesteddata LIMIT 3\n",
"'''\n",
"resp = sql_client.sql_query(sql)\n",
"resp.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Group, filter, and aggregate nested data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell to see how you can use the SELECT COUNT(DISTINCT) operator with `JSON_VALUE`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT COUNT(DISTINCT(JSON_VALUE(geo_ip, '$.city'))) as \"Number of cities\"\n",
"FROM example_koalas_nesteddata\n",
"'''\n",
"resp = sql_client.sql_query(sql)\n",
"resp.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following cell to filter and group a query using `JSON_VALUE`. The query selects the `browser` element from the `agent` column and the `country` and `city` elements from the `geo_ip` column, for all rows with city `Helsinki`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT JSON_VALUE(agent, '$.browser') as browser,\n",
"JSON_VALUE(geo_ip, '$.country') as country,\n",
"JSON_VALUE(geo_ip, '$.city') as city\n",
"FROM example_koalas_nesteddata\n",
"WHERE JSON_VALUE(geo_ip, '$.city') in ('Helsinki')\n",
"GROUP BY 1,2,3\n",
"ORDER BY 1\n",
"'''\n",
"resp = sql_client.sql_query(sql)\n",
"resp.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use helper operators"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use SQL helper operators such as [`JSON_KEYS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) and [`JSON_PATHS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to examine nested data and plan your queries. Run the following cell to return an array of field names and an array of paths for the `geo_ip` nested column."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT ARRAY_CONCAT_AGG(DISTINCT JSON_KEYS(geo_ip, '$.')) as \"geo_ip keys\",\n",
"ARRAY_CONCAT_AGG(DISTINCT JSON_PATHS(geo_ip)) as \"geo_ip paths\"\n",
"FROM example_koalas_nesteddata\n",
"'''\n",
"resp = sql_client.sql_query(sql)\n",
"resp.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Learn more"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This tutorial covers the basics of working with nested data. To learn more about nested data in Druid and related Druid features, see the following topics:\n",
"\n",
"- [Nested columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) for information about the nested columns feature, with ingestion and query examples. \n",
"- [SQL JSON functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) for details on all of the functions you used in this tutorial.\n",
"- [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html) for information on how to use Druid SQL-based ingestion."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.3"
},
"vscode": {
"interpreter": {
"hash": "a4289e5b8bae5973a6609d90f7bc464162478362b9a770893a3c5c597b0b36e7"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}