druid/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/01-streaming-from-kafka.ipynb

599 lines
20 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Ingest and query data from Apache Kafka\n",
"\n",
"<!--\n",
" ~ Licensed to the Apache Software Foundation (ASF) under one\n",
" ~ or more contributor license agreements. See the NOTICE file\n",
" ~ distributed with this work for additional information\n",
" ~ regarding copyright ownership. The ASF licenses this file\n",
" ~ to you under the Apache License, Version 2.0 (the\n",
" ~ \"License\"); you may not use this file except in compliance\n",
" ~ with the License. You may obtain a copy of the License at\n",
" ~\n",
" ~ http://www.apache.org/licenses/LICENSE-2.0\n",
" ~\n",
" ~ Unless required by applicable law or agreed to in writing,\n",
" ~ software distributed under the License is distributed on an\n",
" ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
" ~ KIND, either express or implied. See the License for the\n",
" ~ specific language governing permissions and limitations\n",
" ~ under the License.\n",
" -->\n",
"\n",
"This tutorial introduces you to streaming ingestion in Apache Druid using the Apache Kafka event streaming platform.\n",
"Follow along to learn how to create and load data into a Kafka topic, start ingesting data from the topic into Druid, and query results over time. This tutorial assumes you have a basic understanding of Druid ingestion, querying, and API requests."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of contents\n",
"\n",
"* [Prerequisites](#Prerequisites)\n",
"* [Load Druid API client](#Load-Druid-API-client)\n",
"* [Create Kafka topic](#Create-Kafka-topic)\n",
"* [Load data into Kafka topic](#Load-data-into-Kafka-topic)\n",
"* [Start Druid ingestion](#Start-Druid-ingestion)\n",
"* [Query Druid datasource and visualize query results](#Query-Druid-datasource-and-visualize-query-results)\n",
"* [Learn more](#Learn-more)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"This tutorial works with Druid 25.0.0 or later.\n",
"\n",
"Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
"\n",
"If you do not use the Docker Compose environment, you need the following:\n",
"* A running Druid instance.\n",
" * Update the `druid_host` variable to point to your Router endpoint. For example, `druid_host = \"http://localhost:8888\"`.\n",
" * Update the `rest_client` variable to point to your Coordinator endpoint. For example, `\"http://localhost:8081\"`.\n",
"* A running Kafka cluster.\n",
" * Update the Kafka bootstrap servers to point to your servers. For example, `bootstrap_servers=[\"localhost:9092\"]`.\n",
"* A running [Data Generator server](https://github.com/implydata/druid-datagenerator) accessible to the cluster.\n",
" * Update the data generator client. For example `datagen = druidapi.rest.DruidRestClient(\"http://localhost:9999\")`.\n",
"* The following Python packages:\n",
" * `druidapi`, a Python client for Apache Druid\n",
" * `kafka`, a Python client for Apache Kafka\n",
" * `pandas`, `matplotlib`, and `seaborn` for data visualization\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Druid API client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To start the tutorial, run the following cell. It imports the required Python packages and defines a variable for the Druid client, and another for the SQL client used to run SQL commands."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import druidapi\n",
"import os\n",
"import time\n",
"\n",
"if 'DRUID_HOST' not in os.environ.keys():\n",
" druid_host=f\"http://localhost:8888\"\n",
"else:\n",
" druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
" \n",
"print(f\"Opening a connection to {druid_host}.\")\n",
"druid = druidapi.jupyter_client(druid_host)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Use kafka_host variable when connecting to kafka \n",
"if 'KAFKA_HOST' not in os.environ.keys():\n",
" kafka_host=f\"http://localhost:9092\"\n",
"else:\n",
" kafka_host=f\"{os.environ['KAFKA_HOST']}:9092\"\n",
"\n",
"# this is the kafka topic we will be working with:\n",
"topic_name = \"social_media\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"# shortcuts for display and sql api's\n",
"display = druid.display\n",
"sql_client = druid.sql\n",
"\n",
"# client for Data Generator API\n",
"datagen = druidapi.rest.DruidRestClient(\"http://datagen:9999\")\n",
"\n",
"# client for Druid API\n",
"rest_client = druid.rest"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Publish generated data directly to Kafka topic"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, you use the data generator included as part of the Docker application to generate a stream of messages. The data generator creates and send messages to a Kafka topic named `social_media`. To learn more about the Druid Data Generator, see the [project](https://github.com/implydata/druid-datagenerator) and the [data generation notebook](../01-introduction/02-datagen-intro.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate data\n",
"Run the following cells to load sample data into the `social_media` Kafka topic. The data generator sends events until it reaches 50,000 messages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"headers = {\n",
" 'Content-Type': 'application/json'\n",
"}\n",
"\n",
"datagen_request = {\n",
" \"name\": \"social_stream\",\n",
" \"target\": { \"type\": \"kafka\", \"endpoint\": kafka_host, \"topic\": topic_name },\n",
" \"config_file\": \"social/social_posts.json\", \n",
" \"total_events\":50000,\n",
" \"concurrency\":100\n",
"}\n",
"datagen.post(\"/start\", json.dumps(datagen_request), headers=headers)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check the status of the job with the following cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time.sleep(1) # avoid race between start of the job and its status being available\n",
"response = datagen.get('/status/social_stream')\n",
"response.json()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Start Druid ingestion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have a new Kafka topic and data being streamed into the topic, you ingest the data into Druid by submitting a Kafka ingestion spec.\n",
"The ingestion spec describes the following:\n",
"* where to source the data to ingest (in `spec > ioConfig`),\n",
"* the datasource to ingest data into (in `spec > dataSchema > dataSource`), and\n",
"* what the data looks like (in `spec > dataSchema > dimensionsSpec`).\n",
"\n",
"Other properties control how Druid aggregates and stores data. For more information, see the Druid documenation:\n",
"* [Apache Kafka ingestion](https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html)\n",
"* [Ingestion spec reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)\n",
"\n",
"Run the following cells to define and view the Kafka ingestion spec."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"kafka_ingestion_spec = {\n",
" \"type\": \"kafka\",\n",
" \"spec\": {\n",
" \"ioConfig\": {\n",
" \"type\": \"kafka\",\n",
" \"consumerProperties\": {\n",
" \"bootstrap.servers\": \"kafka:9092\"\n",
" },\n",
" \"topic\": \"social_media\",\n",
" \"inputFormat\": {\n",
" \"type\": \"json\"\n",
" },\n",
" \"useEarliestOffset\": True\n",
" },\n",
" \"tuningConfig\": {\n",
" \"type\": \"kafka\"\n",
" },\n",
" \"dataSchema\": {\n",
" \"dataSource\": \"social_media\",\n",
" \"timestampSpec\": {\n",
" \"column\": \"time\",\n",
" \"format\": \"iso\"\n",
" },\n",
" \"dimensionsSpec\": {\n",
" \"dimensions\": [\n",
" \"username\",\n",
" \"post_title\",\n",
" {\n",
" \"type\": \"long\",\n",
" \"name\": \"views\"\n",
" },\n",
" {\n",
" \"type\": \"long\",\n",
" \"name\": \"upvotes\"\n",
" },\n",
" {\n",
" \"type\": \"long\",\n",
" \"name\": \"comments\"\n",
" },\n",
" \"edited\"\n",
" ]\n",
" },\n",
" \"granularitySpec\": {\n",
" \"queryGranularity\": \"none\",\n",
" \"rollup\": False,\n",
" \"segmentGranularity\": \"hour\"\n",
" }\n",
" }\n",
" }\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Send the spec to Druid to start the streaming ingestion from Kafka:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"headers = {\n",
" 'Content-Type': 'application/json'\n",
"}\n",
"\n",
"supervisor = rest_client.post(\"/druid/indexer/v1/supervisor\", json.dumps(kafka_ingestion_spec), headers=headers)\n",
"print(supervisor.status_code)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A `200` response indicates that the request was successful. You can view the running ingestion task and the new datasource in the web console's [ingestion view](http://localhost:8888/unified-console.html#ingestion).\n",
"\n",
"The following cell pauses further execution until the ingestion has started and the datasource is available for querying:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"druid.sql.wait_until_ready('social_media', verify_load_status=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Query Druid datasource and visualize query results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can now query the new datasource called `social_media`. In this section, you also visualize query results using the Matplotlib and Seaborn visualization libraries. Run the following cell import these packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run a simple query to view a subset of rows from the new datasource:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT * FROM social_media LIMIT 5\n",
"'''\n",
"display.sql(sql)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this social media scenario, each incoming event represents a post on social media, for which you collect the timestamp, username, and post metadata. You are interested in analyzing the total number of upvotes for all posts, compared between users. Preview this data with the following query:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT\n",
" COUNT(post_title) as num_posts,\n",
" SUM(upvotes) as total_upvotes,\n",
" username\n",
"FROM social_media\n",
"GROUP BY username\n",
"ORDER BY num_posts\n",
"'''\n",
"\n",
"response = sql_client.sql_query(sql)\n",
"response.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualize the total number of upvotes per user using a line plot. You sort the results by username before plotting because the order of users may vary as new results arrive."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(response.json)\n",
"df = df.sort_values('username')\n",
"\n",
"df.plot(x='username', y='total_upvotes', marker='o')\n",
"plt.xticks(rotation=45, ha='right')\n",
"plt.ylabel(\"Total number of upvotes\")\n",
"plt.gca().get_legend().remove()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The total number of upvotes likely depends on the total number of posts created per user. To better assess the relative impact per user, you compare the total number of upvotes (line plot) with the total number of posts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"matplotlib.rc_file_defaults()\n",
"ax1 = sns.set_style(style=None, rc=None )\n",
"\n",
"fig, ax1 = plt.subplots()\n",
"plt.xticks(rotation=45, ha='right')\n",
"\n",
"\n",
"sns.lineplot(\n",
" data=df, x='username', y='total_upvotes',\n",
" marker='o', ax=ax1, label=\"Sum of upvotes\")\n",
"ax1.get_legend().remove()\n",
"\n",
"ax2 = ax1.twinx()\n",
"sns.barplot(data=df, x='username', y='num_posts',\n",
" order=df['username'], alpha=0.5, ax=ax2, log=True,\n",
" color=\"orange\", label=\"Number of posts\")\n",
"\n",
"\n",
"# ask matplotlib for the plotted objects and their labels\n",
"lines, labels = ax1.get_legend_handles_labels()\n",
"lines2, labels2 = ax2.get_legend_handles_labels()\n",
"ax2.legend(lines + lines2, labels + labels2, bbox_to_anchor=(1.55, 1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see a correlation between total number of upvotes and total number of posts. In order to track user impact on a more equal footing, normalize the total number of upvotes relative to the total number of posts, and plot the result:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['upvotes_normalized'] = df['total_upvotes']/df['num_posts']\n",
"\n",
"df.plot(x='username', y='upvotes_normalized', marker='o', color='green')\n",
"plt.xticks(rotation=45, ha='right')\n",
"plt.ylabel(\"Number of upvotes (normalized)\")\n",
"plt.gca().get_legend().remove()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You've been working with data taken at a single snapshot in time from when you ran the last query. Run the same query again, and store the output in `response2`, which you will compare with the previous results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response2 = sql_client.sql_query(sql)\n",
"response2.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Normalizing the data also helps you evaluate trends over time more consistently on the same plot axes. Plot the normalized data again, this time alongside the results from the previous snapshot:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df2 = pd.DataFrame(response2.json)\n",
"df2 = df2.sort_values('username')\n",
"df2['upvotes_normalized'] = df2['total_upvotes']/df2['num_posts']\n",
"\n",
"ax = df.plot(x='username', y='upvotes_normalized', marker='o', color='green', label=\"Time 1\")\n",
"df2.plot(x='username', y='upvotes_normalized', marker='o', color='purple', ax=ax, label=\"Time 2\")\n",
"plt.xticks(rotation=45, ha='right')\n",
"plt.ylabel(\"Number of upvotes (normalized)\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This plot shows how some users maintain relatively consistent social media impact between the two query snapshots, whereas other users grow or decline in their influence."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cleanup \n",
"The following cells stop the data generation and ingestion jobs and removes the datasource from Druid."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Stop streaming generator: [{datagen.post('/stop/social_stream','',require_ok=False)}]\")\n",
"print(f'Reset offsets for ingestion: [{druid.rest.post(\"/druid/indexer/v1/supervisor/social_media/reset\",\"\", require_ok=False)}]')\n",
"print(f'Stop streaming ingestion: [{druid.rest.post(\"/druid/indexer/v1/supervisor/social_media/terminate\",\"\", require_ok=False)}]')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the ingestion process ends and completes any final ingestion steps, remove the datasource with the following cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time.sleep(5) # wait for streaming ingestion tasks to end\n",
"print(f\"Drop datasource: [{druid.datasources.drop('social_media')}]\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Learn more\n",
"\n",
"This tutorial showed you how to create a Kafka topic using a Python client for Kafka, send a simulated stream of data to Kafka using a data generator, and query and visualize results over time. For more information, see the following resources:\n",
"\n",
"* [Apache Kafka ingestion](https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html)\n",
"* [Querying data](https://druid.apache.org/docs/latest/tutorials/tutorial-query.html)\n",
"* [Tutorial: Run with Docker](https://druid.apache.org/docs/latest/tutorials/docker.html)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
},
"vscode": {
"interpreter": {
"hash": "a4289e5b8bae5973a6609d90f7bc464162478362b9a770893a3c5c597b0b36e7"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}