druid/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb

699 lines
24 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "ad4e60b6",
"metadata": {
"tags": []
},
"source": [
"# Learn the basics of the Druid API\n",
"\n",
"<!--\n",
" ~ Licensed to the Apache Software Foundation (ASF) under one\n",
" ~ or more contributor license agreements. See the NOTICE file\n",
" ~ distributed with this work for additional information\n",
" ~ regarding copyright ownership. The ASF licenses this file\n",
" ~ to you under the Apache License, Version 2.0 (the\n",
" ~ \"License\"); you may not use this file except in compliance\n",
" ~ with the License. You may obtain a copy of the License at\n",
" ~\n",
" ~ http://www.apache.org/licenses/LICENSE-2.0\n",
" ~\n",
" ~ Unless required by applicable law or agreed to in writing,\n",
" ~ software distributed under the License is distributed on an\n",
" ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
" ~ KIND, either express or implied. See the License for the\n",
" ~ specific language governing permissions and limitations\n",
" ~ under the License.\n",
" -->\n",
" \n",
"This tutorial introduces you to the basics of the Druid API and some of the endpoints you might use frequently to perform tasks, such as the following:\n",
"\n",
"- Checking if your cluster is up\n",
"- Ingesting data\n",
"- Querying data\n",
"\n",
"Different [Druid server types](https://druid.apache.org/docs/latest/design/processes.html#server-types) are responsible for handling different APIs for the Druid services. For example, the Overlord service on the Master server provides the status of a task. You'll also interact the Broker service on the Query Server to see what datasources are available. And to run queries, you'll interact with the Broker. The Router service on the Query servers routes API calls.\n",
"\n",
"For more information, see the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html), which is organized by server type.\n",
"\n",
"For work within other notebooks, prefer to use the [Python API](Python_API_Tutorial.ipynb) which is a notebook-friendly wrapper around the low-level API calls shown here.\n",
"\n",
"## Table of contents\n",
"\n",
"- [Prerequisites](#Prerequisites)\n",
"- [Get basic cluster information](#Get-basic-cluster-information)\n",
"- [Ingest data](#Ingest-data)\n",
"- [Query your data](#Query-your-data)\n",
"- [Learn more](#Learn-more)\n",
"\n",
"For the best experience, use JupyterLab so that you can always access the table of contents."
]
},
{
"cell_type": "markdown",
"id": "8d6bbbcb",
"metadata": {
"tags": []
},
"source": [
"## Prerequisites\n",
"\n",
"Install the [Requests](https://requests.readthedocs.io/en/latest/) library for Python before you start. For example:\n",
"\n",
"```bash\n",
"pip install requests\n",
"```\n",
"\n",
"Please read the [Requests Quickstart](https://requests.readthedocs.io/en/latest/user/quickstart/) to gain a basic understanding of how Requests works.\n",
"\n",
"Next, you'll need a Druid cluster. This tutorial uses the `micro-quickstart` config described in the [Druid quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So download that and start it if you haven't already. In the root of the Druid folder, run the following command to start Druid:\n",
"\n",
"```bash\n",
"./bin/start-micro-quickstart\n",
"```\n",
"\n",
"Finally, you'll need either JupyterLab (recommended) or Jupyter Notebook. Both the quickstart Druid cluster and Jupyter are deployed at `localhost:8888` by default, so you'll need to change the port for Jupyter. To do so, stop Jupyter and start it again with the `port` parameter included. For example, you can use the following command to start Jupyter on port `3001`:\n",
"\n",
"```bash\n",
"# If you're using JupyterLab\n",
"jupyter lab --port 3001\n",
"# If you're using Jupyter Notebook\n",
"jupyter notebook --port 3001 \n",
"```"
]
},
{
"cell_type": "markdown",
"id": "12c0e1c3",
"metadata": {},
"source": [
"To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host, where the Router service listens.\n",
"\n",
"`druid_host` is the hostname and port for your Druid deployment. In a distributed environment, you can point to other Druid services. In this tutorial, you'll use the Router service as the `druid_host`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7f08a52",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"druid_host = \"http://localhost:8888\"\n",
"druid_host"
]
},
{
"cell_type": "markdown",
"id": "a22c69c8",
"metadata": {},
"source": [
"If your cluster is secure, you'll need to provide authorization information on each request. You can automate it by using the Requests `session` feature. Although this tutorial assumes no authorization, the configuration below defines a session as an example."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06ea91de",
"metadata": {},
"outputs": [],
"source": [
"session = requests.Session()"
]
},
{
"cell_type": "markdown",
"id": "2093ecf0-fb4b-405b-a216-094583580e0a",
"metadata": {},
"source": [
"In the rest of this tutorial, the `endpoint` and other variables are updated in code cells to call a different Druid endpoint to accomplish a task."
]
},
{
"cell_type": "markdown",
"id": "29c24856",
"metadata": {
"tags": []
},
"source": [
"## Get basic cluster information\n",
"\n",
"In this cell, you'll use the `GET /status` endpoint to return basic information about your cluster, such as the Druid version, loaded extensions, and resource consumption.\n",
"\n",
"The following cell sets `endpoint` to `/status` and updates the HTTP method to `GET`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a1b453e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"endpoint = druid_host + '/status'\n",
"endpoint"
]
},
{
"cell_type": "markdown",
"id": "1e853795",
"metadata": {},
"source": [
"The Requests `Session` has a `get()` method that posts an HTTP `GET` request. The method takes multiple arguments. Here you only need the URL. The method returns a Requests `Response` object, which can convert the returned JSON result to Python. JSON objects become Python dictionaries. JSON arrays become Python arrays. When you run the cell, you should receive a response that starts with the version number of your Druid deployment."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "baa140b8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"response = session.get(endpoint)\n",
"json = response.json()\n",
"json"
]
},
{
"cell_type": "markdown",
"id": "de82029e",
"metadata": {},
"source": [
"Because the JSON result is converted to Python, you can use Python to pull out the information you want. For example, to see just the version:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "64e83cec",
"metadata": {},
"outputs": [],
"source": [
"json['version']"
]
},
{
"cell_type": "markdown",
"id": "cbeb5a63",
"metadata": {
"tags": []
},
"source": [
"### Get cluster health\n",
"\n",
"The `/status/health` endpoint returns JSON `true` if your cluster is up and running. It's useful if you want to do something like programmatically check if your cluster is available. When you run the following cell, you should receive the `True` Python value if your Druid cluster has finished starting up and is running."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21121c13",
"metadata": {},
"outputs": [],
"source": [
"endpoint = druid_host + '/status/health'\n",
"endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e51170e",
"metadata": {},
"outputs": [],
"source": [
"is_healthy = session.get(endpoint).json()\n",
"is_healthy"
]
},
{
"cell_type": "markdown",
"id": "1917aace",
"metadata": {
"tags": []
},
"source": [
"## Ingest data\n",
"\n",
"Now that you've confirmed that your cluster is up and running, you can start ingesting data. There are different ways to ingest data based on what your needs are. For more information, see [Ingestion methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n",
"\n",
"This tutorial uses the multi-stage query (MSQ) task engine and its `sql/task` endpoint to perform SQL-based ingestion.\n",
"\n",
"To learn more about SQL-based ingestion, see [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). For information about the endpoint specifically, see [SQL-based ingestion and multi-stage query task API](https://druid.apache.org/docs/latest/multi-stage-query/api.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "994ba704",
"metadata": {},
"outputs": [],
"source": [
"endpoint = druid_host + '/druid/v2/sql/task'\n",
"endpoint"
]
},
{
"cell_type": "markdown",
"id": "2168db3a",
"metadata": {
"tags": []
},
"source": [
"The example uses INSERT, but you could also use REPLACE INTO. In fact, if you have an existing datasource with the name `wikipedia_api`, you need to use REPLACE INTO instead. \n",
"\n",
"The `/sql/task` endpoint accepts [SQL requests in the JSON-over-HTTP format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body) using the `query`, `context`, and `parameters` fields.\n",
"\n",
"The query inserts data from an external source into a table named `wikipedia_api`. \n",
"\n",
"Before you ingest the data, take a look at the query. Pay attention to two parts of it, `__time` and `PARTITIONED BY`, which relate to how Druid partitions data:\n",
"\n",
"- **`__time`**\n",
"\n",
" The `__time` column is a key concept for Druid. It's the default partition for Druid and is treated as the primary timestamp. Use it to help you write faster and more efficient queries. Big datasets, such as those for event data, typically have a time component. This means that instead of writing a query using only `COUNT`, you can combine that with `WHERE __time` to return results much more quickly.\n",
"\n",
"- **`PARTITIONED BY DAY`**\n",
"\n",
" If you partition by day, Druid creates segment files within the partition based on the day. You can only replace, delete and update data at the partition level. So when you're deciding how to partition data, make the partition large enough (min 500,000 rows) for good performance but not so big that those operations become impractical to run.\n",
"\n",
"To learn more, see [Partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html).\n",
"\n",
"The query uses `INSERT INTO`. If you have an existing datasource with the name `wikipedia_api`, use `REPLACE INTO` instead."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "90c34908",
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"INSERT INTO wikipedia_api \n",
"SELECT \n",
" TIME_PARSE(\"timestamp\") AS __time,\n",
" * \n",
"FROM TABLE(EXTERN(\n",
" '{\"type\": \"http\", \"uris\": [\"https://druid.apache.org/data/wikipedia.json.gz\"]}', \n",
" '{\"type\": \"json\"}', \n",
" '[{\"name\": \"added\", \"type\": \"long\"}, {\"name\": \"channel\", \"type\": \"string\"}, {\"name\": \"cityName\", \"type\": \"string\"}, {\"name\": \"comment\", \"type\": \"string\"}, {\"name\": \"commentLength\", \"type\": \"long\"}, {\"name\": \"countryIsoCode\", \"type\": \"string\"}, {\"name\": \"countryName\", \"type\": \"string\"}, {\"name\": \"deleted\", \"type\": \"long\"}, {\"name\": \"delta\", \"type\": \"long\"}, {\"name\": \"deltaBucket\", \"type\": \"string\"}, {\"name\": \"diffUrl\", \"type\": \"string\"}, {\"name\": \"flags\", \"type\": \"string\"}, {\"name\": \"isAnonymous\", \"type\": \"string\"}, {\"name\": \"isMinor\", \"type\": \"string\"}, {\"name\": \"isNew\", \"type\": \"string\"}, {\"name\": \"isRobot\", \"type\": \"string\"}, {\"name\": \"isUnpatrolled\", \"type\": \"string\"}, {\"name\": \"metroCode\", \"type\": \"string\"}, {\"name\": \"namespace\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": \"string\"}, {\"name\": \"regionIsoCode\", \"type\": \"string\"}, {\"name\": \"regionName\", \"type\": \"string\"}, {\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": \"string\"}]'\n",
" ))\n",
"PARTITIONED BY DAY\n",
"'''"
]
},
{
"cell_type": "markdown",
"id": "f7dcddd7",
"metadata": {},
"source": [
"The query is included inline here. You can also store it in a file and provide the file.\n",
"\n",
"The above showed how the Requests library can convert the response from JSON to Python. Requests can also convert the request from Python to JSON. The next cell builds up a Python map that represents the Druid `SqlRequest` object. In this case, you need the query and a context variable to set the task count to 3."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6e82c0a",
"metadata": {},
"outputs": [],
"source": [
"sql_request = {\n",
" 'query': sql,\n",
" 'context': {\n",
" 'maxNumTasks': 3\n",
" }\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "6aa7f230",
"metadata": {},
"source": [
"With the SQL request ready, use the the `json` parameter to the `Session` `post` method to send a `POST` request with the `sql_request` object as the payload. The result is a Requests `Response` which is saved in a variable.\n",
"\n",
"Now, run the next cell to start the ingestion."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2939a07",
"metadata": {},
"outputs": [],
"source": [
"response = session.post(endpoint, json=sql_request)"
]
},
{
"cell_type": "markdown",
"id": "9ba1821f",
"metadata": {
"tags": []
},
"source": [
"The MSQ task engine uses a task to ingest data. The response for the API includes a `taskId` and `state` for your ingestion. You can use this `taskId` to reference this task later on to get more information about it."
]
},
{
"cell_type": "markdown",
"id": "f9cc2e45",
"metadata": {},
"source": [
"It is good practice to ensure that the response succeeded by checking the return status. The status should be 20x. (202 means \"accepted\".) If the response is something else, such as 4xx, display `response.text` to see the error message."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "01d875f7",
"metadata": {},
"outputs": [],
"source": [
"response.status_code"
]
},
{
"cell_type": "markdown",
"id": "a983270d",
"metadata": {},
"source": [
"Convert the JSON response to a Python object."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95eeb9bf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"json = response.json()\n",
"json"
]
},
{
"cell_type": "markdown",
"id": "9f9c1b6b",
"metadata": {},
"source": [
"Extract the taskId value from the taskId_response variable so that you can reference it later:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2eeb3e3f",
"metadata": {},
"outputs": [],
"source": [
"ingestion_taskId = json['taskId']\n",
"ingestion_taskId"
]
},
{
"cell_type": "markdown",
"id": "f17892d9-a8c1-43d6-890c-7d68cd792c72",
"metadata": {
"tags": []
},
"source": [
"### Get the status of your task\n",
"\n",
"The following cell shows you how to get the status of your ingestion task. The example continues to run API calls against the endpoint to fetch the status until the ingestion task completes. When it's done, you'll see the JSON response.\n",
"\n",
"You can see basic information about your query, such as when it started and whether or not it's finished.\n",
"\n",
"In addition to the status, you can retrieve a full report about it if you want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need that information for this tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7b65866",
"metadata": {},
"outputs": [],
"source": [
"endpoint = druid_host + f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n",
"endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "22b08321",
"metadata": {},
"outputs": [],
"source": [
"json = session.get(endpoint).json()\n",
"json"
]
},
{
"cell_type": "markdown",
"id": "74a9b593",
"metadata": {},
"source": [
"Wait until your ingestion completes before proceeding. Depending on what else is happening in your Druid cluster and the resources available, ingestion can take some time. Pro tip: an asterisk appears next to the cell while Python runs and waits for the task."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aae5c228",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"ingestion_status = json['status']['status']\n",
"\n",
"if ingestion_status == \"RUNNING\":\n",
" print(\"The ingestion is running...\")\n",
"\n",
"while ingestion_status != \"SUCCESS\":\n",
" time.sleep(5) # 5 seconds \n",
" json = session.get(endpoint).json()\n",
" ingestion_status = json['status']['status']\n",
" \n",
"if ingestion_status == \"SUCCESS\": \n",
" print(\"The ingestion is complete\")\n",
"else:\n",
" print(\"The ingestion task failed:\", json)"
]
},
{
"cell_type": "markdown",
"id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
"metadata": {
"tags": []
},
"source": [
"## Query your data\n",
"\n",
"When you ingest data into Druid, Druid stores the data in a datasource, and this datasource is what you run queries against.\n",
"\n",
"### List your datasources\n",
"\n",
"You can get a list of datasources from the `/druid/coordinator/v1/datasources` endpoint. Since you're just getting started, there should only be a single datasource, the `wikipedia_api` table you created earlier when you ingested external data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a482c1f0",
"metadata": {},
"outputs": [],
"source": [
"endpoint = druid_host + '/druid/coordinator/v1/datasources'\n",
"endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a0a7799e",
"metadata": {},
"outputs": [],
"source": [
"session.get(endpoint).json()"
]
},
{
"cell_type": "markdown",
"id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
"metadata": {
"tags": []
},
"source": [
"### SELECT data\n",
"\n",
"Now, you can query the data. Because this tutorial is running in Jupyter, make sure to limit the size of your query results using `LIMIT`. For example, the following cell selects all columns but limits the results to 3 rows for display purposes because each row is a JSON object. In actual use cases, you'll want to only select the rows that you need. For more information about the kinds of things you can do, see [Druid SQL](https://druid.apache.org/docs/latest/querying/sql.html).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "44868ff9",
"metadata": {},
"outputs": [],
"source": [
"endpoint = druid_host + '/druid/v2/sql'\n",
"endpoint"
]
},
{
"cell_type": "markdown",
"id": "b2b366ad",
"metadata": {},
"source": [
"As for ingestion, define a query, then create a `SQLRequest` object as a Python `dict`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7c77093",
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT * \n",
"FROM wikipedia_api \n",
"LIMIT 3\n",
"'''\n",
"sql_request = {'query': sql}"
]
},
{
"cell_type": "markdown",
"id": "0df529d6",
"metadata": {},
"source": [
"Now run the query. The result is an array of JSON objects. Each JSON object in the response represents a row in the `wikipedia_api` datasource."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac4f95b5",
"metadata": {},
"outputs": [],
"source": [
"json = session.post(endpoint, json=sql_request).json()\n",
"json"
]
},
{
"cell_type": "markdown",
"id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
"metadata": {
"tags": []
},
"source": [
"In addition to the query, there are a few additional things you can define within the payload. For a full list, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n",
"\n",
"This tutorial uses a context parameter and a dynamic parameter.\n",
"\n",
"Context parameters can control certain characteristics related to a query, such as configuring a custom timeout. For information, see [Context parameters](https://druid.apache.org/docs/latest/querying/query-context.html). In the example query that follows, the context block assigns a custom `sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a custom ID, you can use it to reference the query more easily. For example, if you need to cancel a query.\n",
"\n",
"\n",
"Druid supports dynamic parameters, so you can either define certain parameters within the query explicitly or insert a `?` as a placeholder and define it in a parameters block. In the following cell, the `?` gets bound to the timestmap value of `2016-06-27` at execution time. For more information, see [Dynamic parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n",
"\n",
"\n",
"The following cell selects rows where the `__time` column contains a value greater than the value defined dynamically in `parameters` and sets a custom `sqlQueryId`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a645287",
"metadata": {},
"outputs": [],
"source": [
"sql = '''\n",
"SELECT * \n",
"FROM wikipedia_api \n",
"WHERE __time > ? \n",
"LIMIT 1\n",
"'''\n",
"sql_request = {\n",
" 'query': sql,\n",
" 'context': {\n",
" 'sqlQueryId' : 'important-query'\n",
" },\n",
" 'parameters': [\n",
" { 'type': 'TIMESTAMP', 'value': '2016-06-27'}\n",
" ]\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e5cced58",
"metadata": {},
"outputs": [],
"source": [
"json = session.post(endpoint, json=sql_request).json()\n",
"json"
]
},
{
"cell_type": "markdown",
"id": "8fbfa1fa-2cde-46d5-8107-60bd436fb64e",
"metadata": {
"tags": []
},
"source": [
"## Learn more\n",
"\n",
"This tutorial covers the some of the basics related to the Druid API. To learn more about the kinds of things you can do, see the API documentation:\n",
"\n",
"- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n",
"- [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html)\n",
"\n",
"You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by Paul Rogers, a Druid contributor. A simplified version of that library is included with these tutorials. See [the Python API Tutorial](Python_API_Tutorial.ipynb) for an overview. That tutorial shows how to do the same tasks as this one, but in a simpler form: focusing on the Druid actions and not the mechanics of the REST API."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
},
"vscode": {
"interpreter": {
"hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}