druid/examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb

750 lines
22 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "ce2efaaa",
"metadata": {},
"source": [
"# Learn the Druid Python API\n",
"\n",
"<!--\n",
" ~ Licensed to the Apache Software Foundation (ASF) under one\n",
" ~ or more contributor license agreements. See the NOTICE file\n",
" ~ distributed with this work for additional information\n",
" ~ regarding copyright ownership. The ASF licenses this file\n",
" ~ to you under the Apache License, Version 2.0 (the\n",
" ~ \"License\"); you may not use this file except in compliance\n",
" ~ with the License. You may obtain a copy of the License at\n",
" ~\n",
" ~ http://www.apache.org/licenses/LICENSE-2.0\n",
" ~\n",
" ~ Unless required by applicable law or agreed to in writing,\n",
" ~ software distributed under the License is distributed on an\n",
" ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
" ~ KIND, either express or implied. See the License for the\n",
" ~ specific language governing permissions and limitations\n",
" ~ under the License.\n",
" -->\n",
"\n",
"This notebook provides a quick introduction to the Python wrapper around the [Druid REST API](api-tutorial.ipynb). This notebook assumes you are familiar with the basics of the REST API, and the [set of operations which Druid provides](https://druid.apache.org/docs/latest/operations/api-reference.html). This tutorial focuses on using Python to access those APIs rather than explaining the APIs themselves. The APIs themselves are covered in other notebooks that use the Python API.\n",
"\n",
"This tutorial works with Druid 25.0.0 or later.\n",
"\n",
"The Druid Python API is primarily intended to help with these notebook tutorials. It can also be used in your own ad-hoc notebooks, or in a regular Python program.\n",
"\n",
"The Druid Python API is a work in progress. The Druid team adds API wrappers as needed for the notebook tutorials. If you find you need additional wrappers, please feel free to add them, and post a PR to Apache Druid with your additions.\n",
"\n",
"The API provides two levels of functions. Most are simple wrappers around Druid's REST APIs. Others add additional code to make the API easier to use. The SQL query interface is a prime example: extra code translates a simple SQL query into Druid's `SQLQuery` object and interprets the results into a form that can be displayed in a notebook.\n",
"\n",
"This notebook contains sample output to allow it to function as a reference. To run it yourself, start by using the `Kernel` &rarr; `Restart & Clear Output` menu command to clear the sample output.\n",
"\n",
"Start by importing the `druidapi` package from the same folder as this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d90ca5d",
"metadata": {},
"outputs": [],
"source": [
"import druidapi"
]
},
{
"cell_type": "markdown",
"id": "fb68a838",
"metadata": {},
"source": [
"Next, connect to your cluster by providing the router endpoint. The code assumes the cluster is on your local machine, using the default port. Go ahead and change this if your setup is different.\n",
"\n",
"The API uses the router to forward messages to each of Druid's services so that you don't have to keep track of the host and port for each service.\n",
"\n",
"In the Docker Compose tutorial environment, the Router service runs at \"http://router:8888\".\n",
"If you are not using the Docker Compose environment, edit the URL for the `jupyter_client`.\n",
"For example, to `http://localhost:8888/`.\n",
"\n",
"The `jupyter_client()` method waits for the cluster to be ready and sets up the client to display tables and messages as HTML. To use this code without waiting and without HTML formatting, use the `client()` method instead."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ae601081",
"metadata": {},
"outputs": [],
"source": [
"druid = druidapi.jupyter_client('http://router:8888')"
]
},
{
"cell_type": "markdown",
"id": "8b4e774b",
"metadata": {},
"source": [
"## Status Client\n",
"\n",
"The SDK groups Druid REST API calls into categories, with a client for each. Start with the status client."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff16fc3b",
"metadata": {},
"outputs": [],
"source": [
"status_client = druid.status"
]
},
{
"cell_type": "markdown",
"id": "be992774",
"metadata": {},
"source": [
"Use the Python `help()` function to learn what methods are avaialble."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "03f26417",
"metadata": {},
"outputs": [],
"source": [
"help(status_client)"
]
},
{
"cell_type": "markdown",
"id": "e803c9fe",
"metadata": {},
"source": [
"Check the version of your cluster. Some of these notebooks illustrate newer features available only on specific versions of Druid."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2faa0d81",
"metadata": {},
"outputs": [],
"source": [
"status_client.version"
]
},
{
"cell_type": "markdown",
"id": "d78a6c35",
"metadata": {},
"source": [
"You can also check which extensions are loaded in your cluster. Some notebooks require specific extensions to be available."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1001f412",
"metadata": {},
"outputs": [],
"source": [
"status_client.properties['druid.extensions.loadList']"
]
},
{
"cell_type": "markdown",
"id": "012b2e61",
"metadata": {},
"source": [
"## Display Client\n",
"\n",
"The display client performs Druid operations, then formats the results for display in a notebook. Running SQL queries in a notebook is easy with the display client.\n",
"\n",
"When run outside a notebook, the display client formats results as text. The display client is the most convenient way to work with Druid in a notebook. Most operations also have a form that returns results as Python objects rather than displaying them. Use these methods if you write code to work with the results. Here the goal is just to interact with Druid."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f867f1f0",
"metadata": {},
"outputs": [],
"source": [
"display = druid.display"
]
},
{
"cell_type": "markdown",
"id": "d051bc5e",
"metadata": {},
"source": [
"Start by getting a list of schemas."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd8387e0",
"metadata": {},
"outputs": [],
"source": [
"display.schemas()"
]
},
{
"cell_type": "markdown",
"id": "b8261ab0",
"metadata": {},
"source": [
"Then, retreive the tables (or datasources) within any schema."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "64dcb46a",
"metadata": {},
"outputs": [],
"source": [
"display.tables('INFORMATION_SCHEMA')"
]
},
{
"cell_type": "markdown",
"id": "ff311595",
"metadata": {},
"source": [
"The above shows the list of datasources by default. You'll get an empty result if you have no datasources yet."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "616770ce",
"metadata": {},
"outputs": [],
"source": [
"display.tables()"
]
},
{
"cell_type": "markdown",
"id": "7392e484",
"metadata": {},
"source": [
"You can easily run a query and show the results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c649eef",
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT TABLE_NAME\n",
"FROM INFORMATION_SCHEMA.TABLES\n",
"WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
"'''\n",
"display.sql(sql)"
]
},
{
"cell_type": "markdown",
"id": "c6c4e1d4",
"metadata": {},
"source": [
"The query above showed the same results as `tables()`. That is not surprising: `tables()` just runs this query for you."
]
},
{
"cell_type": "markdown",
"id": "f414d145",
"metadata": {},
"source": [
"## SQL Client\n",
"\n",
"While the display client is handy for simple queries, sometimes you need more control, or want to work with the data returned from a query. For this you use the SQL client."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9951e976",
"metadata": {},
"outputs": [],
"source": [
"sql_client = druid.sql"
]
},
{
"cell_type": "markdown",
"id": "7b944084",
"metadata": {},
"source": [
"The SQL client allows you create a SQL request object that enables passing context parameters and query parameters. Druid will work out the query parameter type based on the Python type. Use the display client to show the query results."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd559827",
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT TABLE_NAME\n",
"FROM INFORMATION_SCHEMA.TABLES\n",
"WHERE TABLE_SCHEMA = ?\n",
"'''\n",
"req = sql_client.sql_request(sql)\n",
"req.add_parameter('INFORMATION_SCHEMA')\n",
"req.add_context(\"someParameter\", \"someValue\")\n",
"display.sql(req)"
]
},
{
"cell_type": "markdown",
"id": "937dc6b1",
"metadata": {},
"source": [
"The request has other features for advanced use cases: see the code for details. The query API actually returns a sql response object. Use this if you want to get the values directly, work with the schema, etc."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd7a1827",
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT TABLE_NAME\n",
"FROM INFORMATION_SCHEMA.TABLES\n",
"WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
"'''\n",
"resp = sql_client.sql_query(sql)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2fe6a749",
"metadata": {},
"outputs": [],
"source": [
"col1 = resp.schema[0]\n",
"print(col1.name, col1.sql_type, col1.druid_type)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "41d27bb1",
"metadata": {},
"outputs": [],
"source": [
"resp.rows"
]
},
{
"cell_type": "markdown",
"id": "481af1f2",
"metadata": {},
"source": [
"The `show()` method uses this information for format an HTML table to present the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8dba807b",
"metadata": {},
"outputs": [],
"source": [
"resp.show()"
]
},
{
"cell_type": "markdown",
"id": "99f8db7b",
"metadata": {},
"source": [
"The display and SQL clients are intened for exploratory queries. The [pydruid](https://pythonhosted.org/pydruid/) library provides a robust way to run native queries, to run SQL queries, and to convert the results to various formats."
]
},
{
"cell_type": "markdown",
"id": "9e3be017",
"metadata": {},
"source": [
"## MSQ Ingestion\n",
"\n",
"The SQL client also performs MSQ-based ingestion using `INSERT` or `REPLACE` statements. Use the extension check above to ensure that `druid-multi-stage-query` is loaded in Druid 26. (Later versions may have MSQ built in.)\n",
"\n",
"An MSQ query is run using a different API: `task()`. This API returns a response object that describes the Overlord task which runs the MSQ query. For tutorials, data is usually small enough you can wait for the ingestion to complete. Do that with the `run_task()` call which handles the waiting. To illustrate, here is a query that ingests a subset of columns, and includes a few data clean-up steps:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10f1e451",
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"REPLACE INTO \"myWiki1\" OVERWRITE ALL\n",
"SELECT\n",
" TIME_PARSE(\"timestamp\") AS \"__time\",\n",
" namespace,\n",
" page,\n",
" channel,\n",
" \"user\",\n",
" countryName,\n",
" CASE WHEN isRobot = 'true' THEN 1 ELSE 0 END AS isRobot,\n",
" \"added\",\n",
" \"delta\",\n",
" CASE WHEN isNew = 'true' THEN 1 ELSE 0 END AS isNew,\n",
" CAST(\"deltaBucket\" AS DOUBLE) AS deltaBucket,\n",
" \"deleted\"\n",
"FROM TABLE(\n",
" EXTERN(\n",
" '{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
" '{\"type\":\"json\"}',\n",
" '[{\"name\":\"isRobot\",\"type\":\"string\"},{\"name\":\"channel\",\"type\":\"string\"},{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"flags\",\"type\":\"string\"},{\"name\":\"isUnpatrolled\",\"type\":\"string\"},{\"name\":\"page\",\"type\":\"string\"},{\"name\":\"diffUrl\",\"type\":\"string\"},{\"name\":\"added\",\"type\":\"long\"},{\"name\":\"comment\",\"type\":\"string\"},{\"name\":\"commentLength\",\"type\":\"long\"},{\"name\":\"isNew\",\"type\":\"string\"},{\"name\":\"isMinor\",\"type\":\"string\"},{\"name\":\"delta\",\"type\":\"long\"},{\"name\":\"isAnonymous\",\"type\":\"string\"},{\"name\":\"user\",\"type\":\"string\"},{\"name\":\"deltaBucket\",\"type\":\"long\"},{\"name\":\"deleted\",\"type\":\"long\"},{\"name\":\"namespace\",\"type\":\"string\"},{\"name\":\"cityName\",\"type\":\"string\"},{\"name\":\"countryName\",\"type\":\"string\"},{\"name\":\"regionIsoCode\",\"type\":\"string\"},{\"name\":\"metroCode\",\"type\":\"long\"},{\"name\":\"countryIsoCode\",\"type\":\"string\"},{\"name\":\"regionName\",\"type\":\"string\"}]'\n",
" )\n",
")\n",
"PARTITIONED BY DAY\n",
"CLUSTERED BY namespace, page\n",
"'''"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d752b1d4",
"metadata": {},
"outputs": [],
"source": [
"sql_client.run_task(sql)"
]
},
{
"cell_type": "markdown",
"id": "ef4512f8",
"metadata": {},
"source": [
"MSQ reports task completion as soon as ingestion is done. However, it takes a while for Druid to load the resulting segments. Wait for the table to become ready."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "37fcedf2",
"metadata": {},
"outputs": [],
"source": [
"sql_client.wait_until_ready('myWiki1')"
]
},
{
"cell_type": "markdown",
"id": "11d9c95a",
"metadata": {},
"source": [
"`describe_table()` lists the columns in a table."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b662697b",
"metadata": {},
"outputs": [],
"source": [
"display.table('myWiki1')"
]
},
{
"cell_type": "markdown",
"id": "936f57fb",
"metadata": {},
"source": [
"You can sample a few rows of data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c4cfa5dc",
"metadata": {},
"outputs": [],
"source": [
"display.sql('SELECT * FROM myWiki1 LIMIT 10')"
]
},
{
"cell_type": "markdown",
"id": "c1152f41",
"metadata": {},
"source": [
"## Datasource Client\n",
"\n",
"The Datasource client lets you perform operations on datasource objects. The SQL layer allows you to get metadata and do queries. The datasource client works with the underlying segments. Explaining the full functionality is the topic of another notebook. For now, you can use the datasource client to clean up the datasource created above. The `True` argument asks for \"if exists\" semantics so you don't get an error if the datasource was alredy deleted."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fba659ce",
"metadata": {},
"outputs": [],
"source": [
"ds_client = druid.datasources\n",
"ds_client.drop('myWiki', True)"
]
},
{
"cell_type": "markdown",
"id": "c96fdcc6",
"metadata": {},
"source": [
"## Tasks Client\n",
"\n",
"Use the tasks client to work with Overlord tasks. The `run_task()` call above actually uses the task client internally to poll Overlord."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b4f5ea17",
"metadata": {},
"outputs": [],
"source": [
"task_client = druid.tasks\n",
"task_client.tasks()"
]
},
{
"cell_type": "markdown",
"id": "1deaf95f",
"metadata": {},
"source": [
"## REST Client\n",
"\n",
"The Druid Python API starts with a REST client that itself is built on the `requests` package. The REST client implements the common patterns seen in the Druid REST API. You can create a client directly:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1e55635",
"metadata": {},
"outputs": [],
"source": [
"from druidapi.rest import DruidRestClient\n",
"rest_client = DruidRestClient(\"http://localhost:8888\")"
]
},
{
"cell_type": "markdown",
"id": "dcb8055f",
"metadata": {},
"source": [
"Or, if you have already created the Druid client, you can reuse the existing REST client. This is how the various other clients work internally."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "370ba76a",
"metadata": {},
"outputs": [],
"source": [
"rest_client = druid.rest"
]
},
{
"cell_type": "markdown",
"id": "2654e72c",
"metadata": {},
"source": [
"Use the REST client if you need to make calls that are not yet wrapped by the Python API, or if you want to do something special. To illustrate the client, you can make some of the same calls as in the [Druid REST API notebook](api-tutorial.ipynb).\n",
"\n",
"The REST API maintains the Druid host: you just provide the specifc URL tail. There are methods to get or post JSON results. For example, to get status information:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e42dfbc",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"rest_client.get_json('/status')"
]
},
{
"cell_type": "markdown",
"id": "837e08b0",
"metadata": {},
"source": [
"A quick comparison of the three approaches (Requests, REST client, Python client):\n",
"\n",
"Status:\n",
"\n",
"* Requests: `session.get(druid_host + '/status').json()`\n",
"* REST client: `rest_client.get_json('/status')`\n",
"* Status client: `status_client.status()`\n",
"\n",
"Health:\n",
"\n",
"* Requests: `session.get(druid_host + '/status/health').json()`\n",
"* REST client: `rest_client.get_json('/status/health')`\n",
"* Status client: `status_client.is_healthy()`\n",
"\n",
"Ingest data:\n",
"\n",
"* Requests: See the [REST tutorial](api_tutorial.ipynb)\n",
"* REST client: as the REST tutorial, but use `rest_client.post_json('/druid/v2/sql/task', sql_request)` and\n",
" `rest_client.get_json(f\"/druid/indexer/v1/task/{ingestion_taskId}/status\")`\n",
"* SQL client: `sql_client.run_task(sql)`, also a form for a full SQL request.\n",
"\n",
"List datasources:\n",
"\n",
"* Requests: `session.get(druid_host + '/druid/coordinator/v1/datasources').json()`\n",
"* REST client: `rest_client.get_json('/druid/coordinator/v1/datasources')`\n",
"* Datasources client: `ds_client.names()`\n",
"\n",
"Query data, where `sql_request` is a properly formatted `SqlRequest` dictionary:\n",
"\n",
"* Requests: `session.post(druid_host + '/druid/v2/sql', json=sql_request).json()`\n",
"* REST client: `rest_client.post_json('/druid/v2/sql', sql_request)`\n",
"* SQL Client: `sql_client.show(sql)`, where `sql` is the query text\n",
"\n",
"In general, you have to provide the all the details for the Requests library. The REST client handles the low-level repetitious bits. The Python clients provide methods that encapsulate the specifics of the URLs and return formats."
]
},
{
"cell_type": "markdown",
"id": "edc4ee39",
"metadata": {},
"source": [
"## Constants\n",
"\n",
"Druid has a large number of special constants: type names, options, etc. The `consts` module provides definitions for many of these:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a90187c6",
"metadata": {},
"outputs": [],
"source": [
"from druidapi import consts"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc535898",
"metadata": {},
"outputs": [],
"source": [
"help(consts)"
]
},
{
"cell_type": "markdown",
"id": "b661b29f",
"metadata": {},
"source": [
"Using the constants avoids typos:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3393af62",
"metadata": {},
"outputs": [],
"source": [
"sql_client.tables(consts.SYS_SCHEMA)"
]
},
{
"cell_type": "markdown",
"id": "5e789ca7",
"metadata": {},
"source": [
"## Tracing\n",
"\n",
"It is often handy to see what the Druid API is doing: what messages it sends to Druid. You may need to debug some function that isn't working as expected. Or, perhaps you want to see what is sent to Druid so you can replicate it in your own code. Either way, just turn on tracing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac68b60e",
"metadata": {},
"outputs": [],
"source": [
"druid.trace(True)"
]
},
{
"cell_type": "markdown",
"id": "7b9dc7e3",
"metadata": {},
"source": [
"Then, each call to Druid prints what it sends:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "72c955c0",
"metadata": {},
"outputs": [],
"source": [
"sql_client.tables()"
]
},
{
"cell_type": "markdown",
"id": "ddaf0dc2",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"This notebook have you a whirlwind tour of the Python Druid API: just enough to check your cluster, ingest some data with MSQ and query that data. Druid has many more APIs. As noted earlier, the Python API is a work in progress: the team adds new wrappers as needed for tutorials. Your [contributions](https://github.com/apache/druid/pulls) and [feedback](https://github.com/apache/druid/issues) are welcome."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}