From 2561477e875e3a8161f45e291f2f961496b23f6e Mon Sep 17 00:00:00 2001 From: Jill Osborne Date: Wed, 16 Aug 2023 22:45:37 +0100 Subject: [PATCH] Jupyter nested columns tutorial (#14788) --- .../02-working-with-nested-columns.ipynb | 426 ++++++++++++++++++ 1 file changed, 426 insertions(+) create mode 100644 examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/02-working-with-nested-columns.ipynb diff --git a/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/02-working-with-nested-columns.ipynb b/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/02-working-with-nested-columns.ipynb new file mode 100644 index 00000000000..f4538607c41 --- /dev/null +++ b/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/02-working-with-nested-columns.ipynb @@ -0,0 +1,426 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Working with nested columns\n", + "\n", + "\n", + "\n", + "This tutorial demonstrates how to work with [nested columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) in Apache Druid.\n", + "\n", + "Druid stores nested data structures in `COMPLEX` columns. In this tutorial you perform the following tasks:\n", + "\n", + "- Ingest nested JSON data using SQL-based ingestion.\n", + "- Transform nested data during ingestion using SQL JSON functions.\n", + "- Perform queries to display, filter, and aggregate nested data.\n", + "- Use helper operators to examine nested data and plan your queries.\n", + "\n", + "Druid supports directly ingesting nested data with the following formats: JSON, Parquet, Avro, ORC, Protobuf." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Table of contents\n", + "\n", + "- [Prerequisites](#Prerequisites)\n", + "- [Initialization](#Initialization)\n", + "- [Ingest nested data](#Ingest-nested-data)\n", + "- [Transform nested data](#Transform-nested-data)\n", + "- [Query nested data](#Query-nested-data)\n", + "- [Group, filter, and aggregate nested data](#Group-filter-and-aggregate-nested-data)\n", + "- [Use helper operators](#Use-helper-operators)\n", + "- [Learn more](#Learn-more)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "This tutorial works with Druid 25.0.0 or later.\n", + "\n", + "### Run with Docker\n", + "\n", + "Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n", + "\n", + "### Run without Docker\n", + "\n", + "If you do not use the Docker Compose environment, you need the following:\n", + "\n", + "* A running Apache Druid instance, with a `DRUID_HOST` local environment variable containing the server name of your Druid router.\n", + "* [druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md), a Python client for Apache Druid. Follow the instructions in the Install section of the README file." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the next cell to set up the Druid Python client's connection to Apache Druid.\n", + "\n", + "If successful, the Druid version number will be shown in the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import druidapi\n", + "import os\n", + "\n", + "if 'DRUID_HOST' not in os.environ.keys():\n", + " druid_host=f\"http://localhost:8888\"\n", + "else:\n", + " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n", + " \n", + "print(f\"Opening a connection to {druid_host}.\")\n", + "druid = druidapi.jupyter_client(druid_host)\n", + "\n", + "display = druid.display\n", + "sql_client = druid.sql\n", + "status_client = druid.status\n", + "\n", + "status_client.version" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Ingest nested data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the following cell to ingest sample clickstream data from the [Koalas to the Max](https://www.koalastothemax.com/) game." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sql = '''\n", + "INSERT INTO example_koalas_nesteddata\n", + " WITH \"source\" AS\n", + " (SELECT * FROM TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}',\n", + " '{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"client_ip\",\"type\":\"string\"},\n", + " {\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},{\"name\":\"event\",\"type\":\"COMPLEX\"},\n", + " {\"name\":\"agent\",\"type\":\"COMPLEX\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX\"}]')))\n", + " SELECT TIME_PARSE(\"timestamp\") AS \"__time\",\n", + " \"client_ip\", \n", + " \"session\", \n", + " \"session_length\", \n", + " \"event\", \n", + " \"agent\", \n", + " \"geo_ip\"\n", + " FROM \"source\"\n", + " PARTITIONED BY DAY\n", + "'''\n", + "\n", + "sql_client.run_task(sql)\n", + "sql_client.wait_until_ready(\"example_koalas_nesteddata\")\n", + "display.table(\"example_koalas_nesteddata\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Druid reports task completion as soon as ingestion is done. However, it takes a while for Druid to load the resulting segments.\n", + "\n", + "Wait for the table detail to display, then run the following cell to query the data and return selected columns from 3 rows. Note the nested structure of the `event`, `agent`, and `geo_ip` columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sql = '''\n", + "SELECT session, event, agent, geo_ip \n", + "FROM example_koalas_nesteddata LIMIT 3\n", + "'''\n", + "resp = sql_client.sql_query(sql)\n", + "resp.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Transform nested data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can use Druid's [SQL JSON functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to transform nested data in your ingestion query.\n", + "\n", + "Run the following cell to insert sample data into a new datasource named `example_koalas_nesteddata_transform`. The SELECT query extracts the `country` and `city` elements from the nested `geo_ip` column and creates a composite object `sessionDetails` containing `session` and `session_length`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sql = '''\n", + "INSERT INTO example_koalas_nesteddata_transform\n", + " WITH \"source\" AS\n", + " (SELECT * FROM TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}',\n", + " '{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},\n", + " {\"name\":\"event\",\"type\":\"COMPLEX\"},{\"name\":\"agent\",\"type\":\"COMPLEX\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX\"}]')))\n", + " SELECT TIME_PARSE(\"timestamp\") AS \"__time\",\n", + " JSON_QUERY(geo_ip, '$.country') as country,\n", + " JSON_QUERY(geo_ip, '$.city') as city,\n", + " JSON_OBJECT('session':session, 'session_length':session_length) as sessionDetails\n", + " FROM \"source\"\n", + " PARTITIONED BY DAY\n", + "'''\n", + "\n", + "sql_client.run_task(sql)\n", + "sql_client.wait_until_ready(\"example_koalas_nesteddata_transform\")\n", + "display.table(\"example_koalas_nesteddata_transform\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When the table detail displays, run the following cell to query the data and return `country`, `city`, and `sessionDetails` from 3 rows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sql = '''\n", + "SELECT country, city, sessionDetails \n", + "FROM example_koalas_nesteddata_transform \n", + "LIMIT 3\n", + "'''\n", + "resp = sql_client.sql_query(sql)\n", + "resp.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Query nested data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the following cell to display the data types for columns in the `example_koalas_nesteddata` datasource. Note that nested columns display as `COMPLEX`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sql = '''\n", + "SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE\n", + "FROM INFORMATION_SCHEMA.COLUMNS\n", + "WHERE TABLE_NAME = 'example_koalas_nesteddata'\n", + "'''\n", + "resp = sql_client.sql_query(sql)\n", + "resp.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can use [`JSON_VALUE`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to extract specific elements from a `COMPLEX` object.\n", + " \n", + "Run the following cell to extract `continent` from `geo_ip` and `category` from `agent` for 3 rows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sql = '''\n", + "SELECT JSON_VALUE(geo_ip, '$.continent') as continent,\n", + "JSON_VALUE(agent, '$.category') as category\n", + "FROM example_koalas_nesteddata LIMIT 3\n", + "'''\n", + "resp = sql_client.sql_query(sql)\n", + "resp.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Group, filter, and aggregate nested data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the following cell to see how you can use the SELECT COUNT(DISTINCT) operator with `JSON_VALUE`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sql = '''\n", + "SELECT COUNT(DISTINCT(JSON_VALUE(geo_ip, '$.city'))) as \"Number of cities\"\n", + "FROM example_koalas_nesteddata\n", + "'''\n", + "resp = sql_client.sql_query(sql)\n", + "resp.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the following cell to filter and group a query using `JSON_VALUE`. The query selects the `browser` element from the `agent` column and the `country` and `city` elements from the `geo_ip` column, for all rows with city `Helsinki`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sql = '''\n", + "SELECT JSON_VALUE(agent, '$.browser') as browser,\n", + "JSON_VALUE(geo_ip, '$.country') as country,\n", + "JSON_VALUE(geo_ip, '$.city') as city\n", + "FROM example_koalas_nesteddata\n", + "WHERE JSON_VALUE(geo_ip, '$.city') in ('Helsinki')\n", + "GROUP BY 1,2,3\n", + "ORDER BY 1\n", + "'''\n", + "resp = sql_client.sql_query(sql)\n", + "resp.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Use helper operators" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can use SQL helper operators such as [`JSON_KEYS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) and [`JSON_PATHS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) to examine nested data and plan your queries. Run the following cell to return an array of field names and an array of paths for the `geo_ip` nested column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sql = '''\n", + "SELECT ARRAY_CONCAT_AGG(DISTINCT JSON_KEYS(geo_ip, '$.')) as \"geo_ip keys\",\n", + "ARRAY_CONCAT_AGG(DISTINCT JSON_PATHS(geo_ip)) as \"geo_ip paths\"\n", + "FROM example_koalas_nesteddata\n", + "'''\n", + "resp = sql_client.sql_query(sql)\n", + "resp.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learn more" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This tutorial covers the basics of working with nested data. To learn more about nested data in Druid and related Druid features, see the following topics:\n", + "\n", + "- [Nested columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) for information about the nested columns feature, with ingestion and query examples. \n", + "- [SQL JSON functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html) for details on all of the functions you used in this tutorial.\n", + "- [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html) for information on how to use Druid SQL-based ingestion." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.3" + }, + "vscode": { + "interpreter": { + "hash": "a4289e5b8bae5973a6609d90f7bc464162478362b9a770893a3c5c597b0b36e7" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}