" ~ Unless required by applicable law or agreed to in writing,\n",
" ~ software distributed under the License is distributed on an\n",
" ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
" ~ KIND, either express or implied. See the License for the\n",
" ~ specific language governing permissions and limitations\n",
" ~ under the License.\n",
" -->\n",
"\n",
"This notebook provides a quick introduction to the Python wrapper around the [Druid REST API](api-tutorial.ipynb). This notebook assumes you are familiar with the basics of the REST API, and the [set of operations which Druid provides](https://druid.apache.org/docs/latest/operations/api-reference.html). This tutorial focuses on using Python to access those APIs rather than explaining the APIs themselves. The APIs themselves are covered in other notebooks that use the Python API.\n",
"The Druid Python API is primarily intended to help with these notebook tutorials. It can also be used in your own ad-hoc notebooks, or in a regular Python program.\n",
"\n",
"The Druid Python API is a work in progress. The Druid team adds API wrappers as needed for the notebook tutorials. If you find you need additional wrappers, please feel free to add them, and post a PR to Apache Druid with your additions.\n",
"\n",
"The API provides two levels of functions. Most are simple wrappers around Druid's REST APIs. Others add additional code to make the API easier to use. The SQL query interface is a prime example: extra code translates a simple SQL query into Druid's `SQLQuery` object and interprets the results into a form that can be displayed in a notebook.\n",
"\n",
"This notebook contains sample output to allow it to function as a reference. To run it yourself, start by using the `Kernel` → `Restart & Clear Output` menu command to clear the sample output.\n",
"\n",
"Start by importing the `druidapi` package from the same folder as this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d90ca5d",
"metadata": {},
"outputs": [],
"source": [
"import druidapi"
]
},
{
"cell_type": "markdown",
"id": "fb68a838",
"metadata": {},
"source": [
"Next, connect to your cluster by providing the router endpoint. The code assumes the cluster is on your local machine, using the default port. Go ahead and change this if your setup is different.\n",
"\n",
"The API uses the router to forward messages to each of Druid's services so that you don't have to keep track of the host and port for each service.\n",
"The `jupyter_client()` method waits for the cluster to be ready and sets up the client to display tables and messages as HTML. To use this code without waiting and without HTML formatting, use the `client()` method instead."
"The display client performs Druid operations, then formats the results for display in a notebook. Running SQL queries in a notebook is easy with the display client.\n",
"\n",
"When run outside a notebook, the display client formats results as text. The display client is the most convenient way to work with Druid in a notebook. Most operations also have a form that returns results as Python objects rather than displaying them. Use these methods if you write code to work with the results. Here the goal is just to interact with Druid."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f867f1f0",
"metadata": {},
"outputs": [],
"source": [
"display = druid.display"
]
},
{
"cell_type": "markdown",
"id": "d051bc5e",
"metadata": {},
"source": [
"Start by getting a list of schemas."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd8387e0",
"metadata": {},
"outputs": [],
"source": [
"display.schemas()"
]
},
{
"cell_type": "markdown",
"id": "b8261ab0",
"metadata": {},
"source": [
"Then, retreive the tables (or datasources) within any schema."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "64dcb46a",
"metadata": {},
"outputs": [],
"source": [
"display.tables('INFORMATION_SCHEMA')"
]
},
{
"cell_type": "markdown",
"id": "ff311595",
"metadata": {},
"source": [
"The above shows the list of datasources by default. You'll get an empty result if you have no datasources yet."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "616770ce",
"metadata": {},
"outputs": [],
"source": [
"display.tables()"
]
},
{
"cell_type": "markdown",
"id": "7392e484",
"metadata": {},
"source": [
"You can easily run a query and show the results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c649eef",
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT TABLE_NAME\n",
"FROM INFORMATION_SCHEMA.TABLES\n",
"WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
"'''\n",
"display.sql(sql)"
]
},
{
"cell_type": "markdown",
"id": "c6c4e1d4",
"metadata": {},
"source": [
"The query above showed the same results as `tables()`. That is not surprising: `tables()` just runs this query for you."
]
},
{
"cell_type": "markdown",
"id": "f414d145",
"metadata": {},
"source": [
"## SQL Client\n",
"\n",
"While the display client is handy for simple queries, sometimes you need more control, or want to work with the data returned from a query. For this you use the SQL client."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9951e976",
"metadata": {},
"outputs": [],
"source": [
"sql_client = druid.sql"
]
},
{
"cell_type": "markdown",
"id": "7b944084",
"metadata": {},
"source": [
"The SQL client allows you create a SQL request object that enables passing context parameters and query parameters. Druid will work out the query parameter type based on the Python type. Use the display client to show the query results."
"The request has other features for advanced use cases: see the code for details. The query API actually returns a sql response object. Use this if you want to get the values directly, work with the schema, etc."
"The `show()` method uses this information for format an HTML table to present the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8dba807b",
"metadata": {},
"outputs": [],
"source": [
"resp.show()"
]
},
{
"cell_type": "markdown",
"id": "99f8db7b",
"metadata": {},
"source": [
"The display and SQL clients are intened for exploratory queries. The [pydruid](https://pythonhosted.org/pydruid/) library provides a robust way to run native queries, to run SQL queries, and to convert the results to various formats."
]
},
{
"cell_type": "markdown",
"id": "9e3be017",
"metadata": {},
"source": [
"## MSQ Ingestion\n",
"\n",
"The SQL client also performs MSQ-based ingestion using `INSERT` or `REPLACE` statements. Use the extension check above to ensure that `druid-multi-stage-query` is loaded in Druid 26. (Later versions may have MSQ built in.)\n",
"\n",
"An MSQ query is run using a different API: `task()`. This API returns a response object that describes the Overlord task which runs the MSQ query. For tutorials, data is usually small enough you can wait for the ingestion to complete. Do that with the `run_task()` call which handles the waiting. To illustrate, here is a query that ingests a subset of columns, and includes a few data clean-up steps:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10f1e451",
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"REPLACE INTO \"myWiki1\" OVERWRITE ALL\n",
"SELECT\n",
" TIME_PARSE(\"timestamp\") AS \"__time\",\n",
" namespace,\n",
" page,\n",
" channel,\n",
" \"user\",\n",
" countryName,\n",
" CASE WHEN isRobot = 'true' THEN 1 ELSE 0 END AS isRobot,\n",
" \"added\",\n",
" \"delta\",\n",
" CASE WHEN isNew = 'true' THEN 1 ELSE 0 END AS isNew,\n",
" CAST(\"deltaBucket\" AS DOUBLE) AS deltaBucket,\n",
"MSQ reports task completion as soon as ingestion is done. However, it takes a while for Druid to load the resulting segments. Wait for the table to become ready."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "37fcedf2",
"metadata": {},
"outputs": [],
"source": [
"sql_client.wait_until_ready('myWiki1')"
]
},
{
"cell_type": "markdown",
"id": "11d9c95a",
"metadata": {},
"source": [
"`describe_table()` lists the columns in a table."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b662697b",
"metadata": {},
"outputs": [],
"source": [
"display.table('myWiki1')"
]
},
{
"cell_type": "markdown",
"id": "936f57fb",
"metadata": {},
"source": [
"You can sample a few rows of data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c4cfa5dc",
"metadata": {},
"outputs": [],
"source": [
"display.sql('SELECT * FROM myWiki1 LIMIT 10')"
]
},
{
"cell_type": "markdown",
"id": "c1152f41",
"metadata": {},
"source": [
"## Datasource Client\n",
"\n",
"The Datasource client lets you perform operations on datasource objects. The SQL layer allows you to get metadata and do queries. The datasource client works with the underlying segments. Explaining the full functionality is the topic of another notebook. For now, you can use the datasource client to clean up the datasource created above. The `True` argument asks for \"if exists\" semantics so you don't get an error if the datasource was alredy deleted."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fba659ce",
"metadata": {},
"outputs": [],
"source": [
"ds_client = druid.datasources\n",
"ds_client.drop('myWiki', True)"
]
},
{
"cell_type": "markdown",
"id": "c96fdcc6",
"metadata": {},
"source": [
"## Tasks Client\n",
"\n",
"Use the tasks client to work with Overlord tasks. The `run_task()` call above actually uses the task client internally to poll Overlord."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b4f5ea17",
"metadata": {},
"outputs": [],
"source": [
"task_client = druid.tasks\n",
"task_client.tasks()"
]
},
{
"cell_type": "markdown",
"id": "1deaf95f",
"metadata": {},
"source": [
"## REST Client\n",
"\n",
"The Druid Python API starts with a REST client that itself is built on the `requests` package. The REST client implements the common patterns seen in the Druid REST API. You can create a client directly:"
"Use the REST client if you need to make calls that are not yet wrapped by the Python API, or if you want to do something special. To illustrate the client, you can make some of the same calls as in the [Druid REST API notebook](api-tutorial.ipynb).\n",
"The REST API maintains the Druid host: you just provide the specifc URL tail. There are methods to get or post JSON results. For example, to get status information:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e42dfbc",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"rest_client.get_json('/status')"
]
},
{
"cell_type": "markdown",
"id": "837e08b0",
"metadata": {},
"source": [
"A quick comparison of the three approaches (Requests, REST client, Python client):\n",
"* SQL Client: `sql_client.show(sql)`, where `sql` is the query text\n",
"\n",
"In general, you have to provide the all the details for the Requests library. The REST client handles the low-level repetitious bits. The Python clients provide methods that encapsulate the specifics of the URLs and return formats."
]
},
{
"cell_type": "markdown",
"id": "edc4ee39",
"metadata": {},
"source": [
"## Constants\n",
"\n",
"Druid has a large number of special constants: type names, options, etc. The `consts` module provides definitions for many of these:"
"It is often handy to see what the Druid API is doing: what messages it sends to Druid. You may need to debug some function that isn't working as expected. Or, perhaps you want to see what is sent to Druid so you can replicate it in your own code. Either way, just turn on tracing:"
"This notebook have you a whirlwind tour of the Python Druid API: just enough to check your cluster, ingest some data with MSQ and query that data. Druid has many more APIs. As noted earlier, the Python API is a work in progress: the team adds new wrappers as needed for tutorials. Your [contributions](https://github.com/apache/druid/pulls) and [feedback](https://github.com/apache/druid/issues) are welcome."