" ~ Unless required by applicable law or agreed to in writing,\n",
" ~ software distributed under the License is distributed on an\n",
" ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
" ~ KIND, either express or implied. See the License for the\n",
" ~ specific language governing permissions and limitations\n",
" ~ under the License.\n",
" -->\n",
"\n",
"This tutorial introduces you to streaming ingestion in Apache Druid using the Apache Kafka event streaming platform.\n",
"Follow along to learn how to create and load data into a Kafka topic, start ingesting data from the topic into Druid, and query results over time. This tutorial assumes you have a basic understanding of Druid ingestion, querying, and API requests."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of contents\n",
"\n",
"* [Prerequisites](#Prerequisites)\n",
"* [Load Druid API client](#Load-Druid-API-client)\n",
"* [Create Kafka topic](#Create-Kafka-topic)\n",
"* [Load data into Kafka topic](#Load-data-into-Kafka-topic)\n",
"* [Query Druid datasource and visualize query results](#Query-Druid-datasource-and-visualize-query-results)\n",
"* [Learn more](#Learn-more)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"This tutorial works with Druid 25.0.0 or later.\n",
"\n",
"Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
"\n",
"If you do not use the Docker Compose environment, you need the following:\n",
"* A running Druid instance.\n",
" * Update the `druid_host` variable to point to your Router endpoint. For example, `druid_host = \"http://localhost:8888\"`.\n",
" * Update the `rest_client` variable to point to your Coordinator endpoint. For example, `\"http://localhost:8081\"`.\n",
"* A running Kafka cluster.\n",
" * Update the Kafka bootstrap servers to point to your servers. For example, `bootstrap_servers=[\"localhost:9092\"]`.\n",
" * `druidapi`, a Python client for Apache Druid\n",
" * `kafka`, a Python client for Apache Kafka\n",
" * `pandas`, `matplotlib`, and `seaborn` for data visualization\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Druid API client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To start the tutorial, run the following cell. It imports the required Python packages and defines a variable for the Druid client, and another for the SQL client used to run SQL commands."
"In this section, you use the data generator included as part of the Docker application to generate a stream of messages. The data generator creates and send messages to a Kafka topic named `social_media`. To learn more about the Druid Data Generator, see the [project](https://github.com/implydata/druid-datagenerator) and the [data generation notebook](../01-introduction/02-datagen-intro.ipynb)."
"A `200` response indicates that the request was successful. You can view the running ingestion task and the new datasource in the web console's [ingestion view](http://localhost:8888/unified-console.html#ingestion).\n",
"\n",
"The following cell pauses further execution until the ingestion has started and the datasource is available for querying:"
"## Query Druid datasource and visualize query results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can now query the new datasource called `social_media`. In this section, you also visualize query results using the Matplotlib and Seaborn visualization libraries. Run the following cell import these packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run a simple query to view a subset of rows from the new datasource:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT * FROM social_media LIMIT 5\n",
"'''\n",
"display.sql(sql)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this social media scenario, each incoming event represents a post on social media, for which you collect the timestamp, username, and post metadata. You are interested in analyzing the total number of upvotes for all posts, compared between users. Preview this data with the following query:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT\n",
" COUNT(post_title) as num_posts,\n",
" SUM(upvotes) as total_upvotes,\n",
" username\n",
"FROM social_media\n",
"GROUP BY username\n",
"ORDER BY num_posts\n",
"'''\n",
"\n",
"response = sql_client.sql_query(sql)\n",
"response.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualize the total number of upvotes per user using a line plot. You sort the results by username before plotting because the order of users may vary as new results arrive."
"The total number of upvotes likely depends on the total number of posts created per user. To better assess the relative impact per user, you compare the total number of upvotes (line plot) with the total number of posts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"matplotlib.rc_file_defaults()\n",
"ax1 = sns.set_style(style=None, rc=None )\n",
"\n",
"fig, ax1 = plt.subplots()\n",
"plt.xticks(rotation=45, ha='right')\n",
"\n",
"\n",
"sns.lineplot(\n",
" data=df, x='username', y='total_upvotes',\n",
" marker='o', ax=ax1, label=\"Sum of upvotes\")\n",
"You should see a correlation between total number of upvotes and total number of posts. In order to track user impact on a more equal footing, normalize the total number of upvotes relative to the total number of posts, and plot the result:"
"plt.ylabel(\"Number of upvotes (normalized)\")\n",
"plt.gca().get_legend().remove()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You've been working with data taken at a single snapshot in time from when you ran the last query. Run the same query again, and store the output in `response2`, which you will compare with the previous results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response2 = sql_client.sql_query(sql)\n",
"response2.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Normalizing the data also helps you evaluate trends over time more consistently on the same plot axes. Plot the normalized data again, this time alongside the results from the previous snapshot:"
"This plot shows how some users maintain relatively consistent social media impact between the two query snapshots, whereas other users grow or decline in their influence."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cleanup \n",
"The following cells stop the data generation and ingestion jobs and removes the datasource from Druid."
"This tutorial showed you how to create a Kafka topic using a Python client for Kafka, send a simulated stream of data to Kafka using a data generator, and query and visualize results over time. For more information, see the following resources:\n",