druid/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/01-streaming-from-kafka.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ingest and query data from Apache Kafka\n",
    "\n",
    "<!--\n",
    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
    "  ~ or more contributor license agreements.  See the NOTICE file\n",
    "  ~ distributed with this work for additional information\n",
    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
    "  ~ to you under the Apache License, Version 2.0 (the\n",
    "  ~ \"License\"); you may not use this file except in compliance\n",
    "  ~ with the License.  You may obtain a copy of the License at\n",
    "  ~\n",
    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n",
    "  ~\n",
    "  ~ Unless required by applicable law or agreed to in writing,\n",
    "  ~ software distributed under the License is distributed on an\n",
    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
    "  ~ KIND, either express or implied.  See the License for the\n",
    "  ~ specific language governing permissions and limitations\n",
    "  ~ under the License.\n",
    "  -->\n",
    "\n",
    "This tutorial introduces you to streaming ingestion in Apache Druid using the Apache Kafka event streaming platform.\n",
    "Follow along to learn how to create and load data into a Kafka topic, start ingesting data from the topic into Druid, and query results over time. This tutorial assumes you have a basic understanding of Druid ingestion, querying, and API requests."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of contents\n",
    "\n",
    "* [Prerequisites](#Prerequisites)\n",
    "* [Load Druid API client](#Load-Druid-API-client)\n",
    "* [Create Kafka topic](#Create-Kafka-topic)\n",
    "* [Load data into Kafka topic](#Load-data-into-Kafka-topic)\n",
    "* [Start Druid ingestion](#Start-Druid-ingestion)\n",
    "* [Query Druid datasource and visualize query results](#Query-Druid-datasource-and-visualize-query-results)\n",
    "* [Learn more](#Learn-more)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "This tutorial works with Druid 25.0.0 or later.\n",
    "\n",
    "Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
    "\n",
    "If you do not use the Docker Compose environment, you need the following:\n",
    "* A running Druid instance.\n",
    "   * Update the `druid_host` variable to point to your Router endpoint. For example, `druid_host = \"http://localhost:8888\"`.\n",
    "   * Update the `rest_client` variable to point to your Coordinator endpoint. For example, `\"http://localhost:8081\"`.\n",
    "* A running Kafka cluster.\n",
    "   * Update the Kafka bootstrap servers to point to your servers. For example, `bootstrap_servers=[\"localhost:9092\"]`.\n",
    "* The following Python packages:\n",
    "   * `druidapi`, a Python client for Apache Druid\n",
    "   * `DruidDataDriver`, a data generator\n",
    "   * `kafka`, a Python client for Apache Kafka\n",
    "   * `pandas`, `matplotlib`, and `seaborn` for data visualization\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Druid API client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To start the tutorial, run the following cell. It imports the required Python packages and defines a variable for the Druid client, and another for the SQL client used to run SQL commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import druidapi\n",
    "import json\n",
    "\n",
    "# druid_host is the hostname and port for your Druid deployment. \n",
    "# In the Docker Compose tutorial environment, this is the Router\n",
    "# service running at \"http://router:8888\".\n",
    "# If you are not using the Docker Compose environment, edit the `druid_host`.\n",
    "\n",
    "druid_host = \"http://router:8888\"\n",
    "druid_host\n",
    "\n",
    "druid = druidapi.jupyter_client(druid_host)\n",
    "display = druid.display\n",
    "sql_client = druid.sql\n",
    "\n",
    "# Create a rest client for native JSON ingestion for streaming data\n",
    "rest_client = druidapi.rest.DruidRestClient(\"http://coordinator:8081\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Kafka topic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook relies on the Python client for the Apache Kafka. Import the Kafka producer and consumer modules, then create a Kafka client. You use the Kafka producer to create and publish records to a new topic named `social_media`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from kafka import KafkaProducer\n",
    "from kafka import KafkaConsumer\n",
    "\n",
    "# Kafka runs on kafka:9092 in multi-container tutorial application\n",
    "producer = KafkaProducer(bootstrap_servers='kafka:9092')\n",
    "topic_name = \"social_media\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the `social_media` topic and send a sample event. The `send()` command returns a metadata descriptor for the record."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "event = {\n",
    "    \"__time\": \"2023-01-03T16:40:21.501\",\n",
    "    \"username\": \"willow\",\n",
    "    \"post_title\": \"This title is required\",\n",
    "    \"views\": 15284,\n",
    "    \"upvotes\": 124,\n",
    "    \"comments\": 21,\n",
    "    \"edited\": \"True\"\n",
    "}\n",
    "\n",
    "producer.send(topic_name, json.dumps(event).encode('utf-8'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To verify that the Kafka topic stored the event, create a consumer client to read records from the Kafka cluster, and get the next (only) message:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "consumer = KafkaConsumer(topic_name, bootstrap_servers=['kafka:9092'], auto_offset_reset='earliest',\n",
    "     enable_auto_commit=True)\n",
    "\n",
    "print(next(consumer).value.decode('utf-8'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data into Kafka topic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of manually creating events to send to the Kafka topic, use a data generator to simulate a continuous data stream. This tutorial makes use of Druid Data Driver to simulate a continuous data stream into the `social_media` Kafka topic. To learn more about the Druid Data Driver, see the Druid Summit talk, [Generating Time centric Data for Apache Druid](https://www.youtube.com/watch?v=3zAOeLe3iAo).\n",
    "\n",
    "In this notebook, you use a background process to continuously load data into the Kafka topic.\n",
    "This allows you to keep executing commands in this notebook while data is constantly being streamed into the topic."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the following cells to load sample data into the `social_media` Kafka topic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import multiprocessing as mp\n",
    "from datetime import datetime\n",
    "import DruidDataDriver"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_driver():\n",
    "    DruidDataDriver.simulate(\"kafka_docker_config.json\", None, None, \"REAL\", datetime.now())\n",
    "        \n",
    "mp.set_start_method('fork')\n",
    "ps = mp.Process(target=run_driver)\n",
    "ps.start()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start Druid ingestion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that you have a new Kafka topic and data being streamed into the topic, you ingest the data into Druid by submitting a Kafka ingestion spec.\n",
    "The ingestion spec describes the following:\n",
    "* where to source the data to ingest (in `spec > ioConfig`),\n",
    "* the datasource to ingest data into (in `spec > dataSchema > dataSource`), and\n",
    "* what the data looks like (in `spec > dataSchema > dimensionsSpec`).\n",
    "\n",
    "Other properties control how Druid aggregates and stores data. For more information, see the Druid documenation:\n",
    "* [Apache Kafka ingestion](https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html)\n",
    "* [Ingestion spec reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)\n",
    "\n",
    "Run the following cells to define and view the Kafka ingestion spec."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kafka_ingestion_spec = \"{\\\"type\\\": \\\"kafka\\\",\\\"spec\\\": {\\\"ioConfig\\\": {\\\"type\\\": \\\"kafka\\\",\\\"consumerProperties\\\": {\\\"bootstrap.servers\\\": \\\"kafka:9092\\\"},\\\"topic\\\": \\\"social_media\\\",\\\"inputFormat\\\": {\\\"type\\\": \\\"json\\\"},\\\"useEarliestOffset\\\": true},\\\"tuningConfig\\\": {\\\"type\\\": \\\"kafka\\\"},\\\"dataSchema\\\": {\\\"dataSource\\\": \\\"social_media\\\",\\\"timestampSpec\\\": {\\\"column\\\": \\\"__time\\\",\\\"format\\\": \\\"iso\\\"},\\\"dimensionsSpec\\\": {\\\"dimensions\\\": [\\\"username\\\",\\\"post_title\\\",{\\\"type\\\": \\\"long\\\",\\\"name\\\": \\\"views\\\"},{\\\"type\\\": \\\"long\\\",\\\"name\\\": \\\"upvotes\\\"},{\\\"type\\\": \\\"long\\\",\\\"name\\\": \\\"comments\\\"},\\\"edited\\\"]},\\\"granularitySpec\\\": {\\\"queryGranularity\\\": \\\"none\\\",\\\"rollup\\\": false,\\\"segmentGranularity\\\": \\\"hour\\\"}}}}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(json.dumps(json.loads(kafka_ingestion_spec), indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Send the spec to Druid to start the streaming ingestion from Kafka:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "headers = {\n",
    "  'Content-Type': 'application/json'\n",
    "}\n",
    "\n",
    "rest_client.post(\"/druid/indexer/v1/supervisor\", kafka_ingestion_spec, headers=headers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `200` response indicates that the request was successful. You can view the running ingestion task and the new datasource in the web console at http://localhost:8888/unified-console.html."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query Druid datasource and visualize query results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can now query the new datasource called `social_media`. In this section, you also visualize query results using the Matplotlib and Seaborn visualization libraries. Run the following cell import these packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run a simple query to view a subset of rows from the new datasource:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql = '''\n",
    "SELECT * FROM social_media LIMIT 5\n",
    "'''\n",
    "display.sql(sql)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this social media scenario, each incoming event represents a post on social media, for which you collect the timestamp, username, and post metadata. You are interested in analyzing the total number of upvotes for all posts, compared between users. Preview this data with the following query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql = '''\n",
    "SELECT\n",
    "  COUNT(post_title) as num_posts,\n",
    "  SUM(upvotes) as total_upvotes,\n",
    "  username\n",
    "FROM social_media\n",
    "GROUP BY username\n",
    "ORDER BY num_posts\n",
    "'''\n",
    "\n",
    "response = sql_client.sql_query(sql)\n",
    "response.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualize the total number of upvotes per user using a line plot. You sort the results by username before plotting because the order of users may vary as new results arrive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(response.json)\n",
    "df = df.sort_values('username')\n",
    "\n",
    "df.plot(x='username', y='total_upvotes', marker='o')\n",
    "plt.xticks(rotation=45, ha='right')\n",
    "plt.ylabel(\"Total number of upvotes\")\n",
    "plt.gca().get_legend().remove()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The total number of upvotes likely depends on the total number of posts created per user. To better assess the relative impact per user, you compare the total number of upvotes (line plot) with the total number of posts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "matplotlib.rc_file_defaults()\n",
    "ax1 = sns.set_style(style=None, rc=None )\n",
    "\n",
    "fig, ax1 = plt.subplots()\n",
    "plt.xticks(rotation=45, ha='right')\n",
    "\n",
    "\n",
    "sns.lineplot(\n",
    "    data=df, x='username', y='total_upvotes',\n",
    "    marker='o', ax=ax1, label=\"Sum of upvotes\")\n",
    "ax1.get_legend().remove()\n",
    "\n",
    "ax2 = ax1.twinx()\n",
    "sns.barplot(data=df, x='username', y='num_posts',\n",
    "            order=df['username'], alpha=0.5, ax=ax2, log=True,\n",
    "            color=\"orange\", label=\"Number of posts\")\n",
    "\n",
    "\n",
    "# ask matplotlib for the plotted objects and their labels\n",
    "lines, labels = ax1.get_legend_handles_labels()\n",
    "lines2, labels2 = ax2.get_legend_handles_labels()\n",
    "ax2.legend(lines + lines2, labels + labels2, bbox_to_anchor=(1.55, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see a correlation between total number of upvotes and total number of posts. In order to track user impact on a more equal footing, normalize the total number of upvotes relative to the total number of posts, and plot the result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['upvotes_normalized'] = df['total_upvotes']/df['num_posts']\n",
    "\n",
    "df.plot(x='username', y='upvotes_normalized', marker='o', color='green')\n",
    "plt.xticks(rotation=45, ha='right')\n",
    "plt.ylabel(\"Number of upvotes (normalized)\")\n",
    "plt.gca().get_legend().remove()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You've been working with data taken at a single snapshot in time from when you ran the last query. Run the same query again, and store the output in `response2`, which you will compare with the previous results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response2 = sql_client.sql_query(sql)\n",
    "response2.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Normalizing the data also helps you evaluate trends over time more consistently on the same plot axes. Plot the normalized data again, this time alongside the results from the previous snapshot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.DataFrame(response2.json)\n",
    "df2 = df2.sort_values('username')\n",
    "df2['upvotes_normalized'] = df2['total_upvotes']/df2['num_posts']\n",
    "\n",
    "ax = df.plot(x='username', y='upvotes_normalized', marker='o', color='green', label=\"Time 1\")\n",
    "df2.plot(x='username', y='upvotes_normalized', marker='o', color='purple', ax=ax, label=\"Time 2\")\n",
    "plt.xticks(rotation=45, ha='right')\n",
    "plt.ylabel(\"Number of upvotes (normalized)\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This plot shows how some users maintain relatively consistent social media impact between the two query snapshots, whereas other users grow or decline in their influence.\n",
    "\n",
    "## Learn more\n",
    "\n",
    "This tutorial showed you how to create a Kafka topic using a Python client for Kafka, send a simulated stream of data to Kafka using a data generator, and query and visualize results over time. For more information, see the following resources:\n",
    "\n",
    "* [Apache Kafka ingestion](https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html)\n",
    "* [Querying data](https://druid.apache.org/docs/latest/tutorials/tutorial-query.html)\n",
    "* [Tutorial: Run with Docker](https://druid.apache.org/docs/latest/tutorials/docker.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "vscode": {
   "interpreter": {
    "hash": "a4289e5b8bae5973a6609d90f7bc464162478362b9a770893a3c5c597b0b36e7"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}