druid/examples/quickstart/jupyter-notebooks/druidapi
Victoria Lim ede9903ff4
pip install for Python Druid API (#13938)
Broken test appears unrelated to this PR

* make druidapi pip installable

* include druidapi in prerequisites

* add license to setup.py

* updates from Paul's review

* note about editable install

* Apply suggestions from code review

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* update install instructions

* found unrelated typos

* standardize install cmd with pip

---------

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2023-03-21 11:37:39 -07:00
..
druidapi pip install for Python Druid API (#13938) 2023-03-21 11:37:39 -07:00
README.md pip install for Python Druid API (#13938) 2023-03-21 11:37:39 -07:00
requirements.txt Python Druid API for use in notebooks (#13787) 2023-03-04 18:25:19 -08:00
setup.py pip install for Python Druid API (#13938) 2023-03-21 11:37:39 -07:00

README.md

Python API for Druid

druidapi is a Python library to interact with all aspects of your Apache Druid cluster. druidapi picks up where the venerable pydruid library left off to include full SQL support and support for many of of Druid APIs. druidapi is usable in any Python environment, but is optimized for use in Jupyter, providing a complete interactive environment which complements the UI-based Druid console. The primary use of druidapi at present is to support the set of tutorial notebooks provided in the parent directory.

druidapi works against any version of Druid. Operations that make use of newer features obviously work only against versions of Druid that support those features.

Install

At present, the best way to use druidapi is to clone the Druid repo itself:

git clone git@github.com:apache/druid.git

druidapi is located in examples/quickstart/jupyter-notebooks/druidapi/. From this directory, install the package and its dependencies with pip using the following command:

pip install .

Note that there is a second level druidapi directory that contains the modules. Do not run the install command in the subdirectory.

Verify your installation by checking that the following command runs in Python:

import druidapi

The import statement should not return anything if it runs successfully.

Getting started

To use druidapi, first import the library, then connect to your cluster by providing the URL to your Router instance. The way that is done differs a bit between consumers.

From a tutorial Jupyter notebook

The tutorial Jupyter notebooks in examples/quickstart/jupyter-notebooks reside in the same directory tree as this library. We start the library using the Jupyter-oriented API which is able to render tables in HTML. First, identify your Router endpoint. Use the following for a local installation:

router_endpoint = 'http://localhost:8888'

Then, import the library, declare the druidapi CSS styles, and create a client to your cluster:

import druidapi
druid = druidapi.jupyter_client(router_endpoint)

The jupyter_client call defines a number of CSS styles to aid in displaying tabular results. It also provides a "display" client that renders information as HTML tables.

From a Python script

druidapi works in any Python script. When run outside of a Jupyter notebook, the various "display" commands revert to displaying a text (not HTML) format. The steps are similar to those above:

import druidapi
druid = druidapi.client(router_endpoint)

Library organization

druidapi organizes Druid REST operations into various "clients," each of which provides operations for one of Druid's functional areas. Obtain a client from the druid client created above. For status operations:

status_client = druid.status

The set of clients is still under construction. The set at present includes the following. The set of operations within each client is also partial, and includes only those operations used within one of the tutorial notebooks. Contributions are welcome to expand the scope. Clients are available as properties on the druid object created above.

  • status - Status operations such as service health, property values, and so on. This client is special: it works only with the Router. The Router does not proxy these calls to other nodes. Use the status_for() method to get status for other nodes.
  • datasources - Operations on datasources such as dropping a datasource.
  • tasks - Work with Overlord tasks: status, reports, and more.
  • sql - SQL query operations for both the interactive query engine and MSQ.
  • display - A set of convenience operations to display results as lightly formatted tables in either HTML (for Jupyter notebooks) or text (for other Python scripts).

Assumed cluster architecture

druidapi assumes that you run a standard Druid cluster with a Router in front of the other nodes. This design works well for most Druid clusters:

  • Run locally, such as the various quickstart clusters.
  • Remote cluster on the same network.
  • Druid cluster running under Docker Compose such as that explained in the Druid documentation.
  • Druid integration test clusters launched via the Druid development it.sh command.
  • Druid clusters running under Kubernetes

In all the Docker, Docker Compose and Kubernetes scenaris, the Router's port (typically 8888) must be visible to the machine running druidapi, perhaps via port mapping or a proxy.

The Router is then responsible for routing Druid REST requests to the various other Druid nodes, including those not visible outside of a private Docker or Kubernetes network.

The one exception to this rule is if you want to perform a health check (i.e. the /status endpoint) on a service other than the Router. These checks are not proxied by the Router: you must connect to the target service directly.

Status operations

When working with tutorials, a local Druid cluster, or a Druid integration test cluster, it is common to start your cluster then immediately start performing druidapi operations. However, because Druid is a distributed system, it can take some time for all the services to become ready. This seems to be particularly true when starting a cluster with Docker Compose or Kubernetes on the local system.

Therefore, the first operation is to wait for the cluster to become ready:

status_client = druid.status
status_client.wait_until_ready()

Without this step, your operations may mysteriously fail, and you'll wonder if you did something wrong. Some clients retry operations multiple times in case a service is not yet ready. For typical scripts against a stable cluster, the above line should be sufficient instead. This step is built into the jupyter_client() method to ensure notebooks provide a good exerience.

If your notebook or script uses newer features, you should start by ensuring that the target Druid cluster is of the correct version:

status_client.version

This check will prevent frustration if the notebook is used against previous releases.

Similarly, if the notebook or script uses features defined in an extension, check that the required extension is loaded:

status_client.properties['druid.extensions.loadList']

Display client

When run in a Jupyter notebook, it is often handy to format results for display. A special display client performs operations and formats them for display as HTML tables within the notebook.

display = druid.display

The most common methods are:

  • sql(sql) - Run a query and display the results as an HTML table.
  • schemas() - Display the schemas defined in Druid.
  • tables(schema) - Display the tables (datasources) in the given schema, druid by default.
  • table(name) - Display the schema (list of columns) for the the given table. The name can be one part (foo) or two parts (INFORMATION_SCHEMA.TABLES).
  • function(name) - Display the arguments for a table function defined in the catalog.

The display client also has other methods to format data as a table, to display various kinds of messages and so on.

Interactive queries

The original pydruid library revolves around Druid "native" queries. Most new applications now use SQL. druidapi provides two ways to run queries, depending on whether you want to display the results (typical in a notebook), or use the results in Python code. You can run SQL queries using the SQL client:

sql_client = druid.sql

To obtain the results of a SQL query against the example Wikipedia table (datasource) in a "raw" form:

sql = '''
SELECT
  channel,
  COUNT(*) AS "count"
FROM wikipedia
GROUP BY channel
ORDER BY COUNT(*) DESC
LIMIT 5
'''
client.sql(sql)

Gives:

[{'channel': '#en.wikipedia', 'count': 6650},
 {'channel': '#sh.wikipedia', 'count': 3969},
 {'channel': '#sv.wikipedia', 'count': 1867},
 {'channel': '#ceb.wikipedia', 'count': 1808},
 {'channel': '#de.wikipedia', 'count': 1357}]

The raw results are handy when Python code consumes the results, or for a quick check. The raw results can also be forward to advanced visualization tools such a Pandas.

For simple visualization in notebooks (or as text in Python scripts), you can use the "display" client:

display = druid.display
display.sql(sql)

When run without HTML visualization, the above gives:

channel        count
#en.wikipedia   6650
#sh.wikipedia   3969
#sv.wikipedia   1867
#ceb.wikipedia  1808
#de.wikipedia   1357

Within Jupyter, the results are formatted as an HTML table.

Advanced queries

In addition to the SQL text, Druid also lets you specify:

  • A query context
  • Query parameters
  • Result format options

The Druid SqlQuery object specifies these options. You can build up a Python equivalent:

sql = '''
SELECT *
FROM INFORMATION_SCHEMA.SCHEMATA
WHERE SCHEMA_NAME = ?
'''

sql_query = {
    'query': sql,
    'parameters': [
        {'type': consts.SQL_VARCHAR_TYPE, 'value': 'druid'}
    ],
    'resultFormat': consts.SQL_OBJECT
}

However, the easier approach is to let druidapi handle the work for you using a SQL request:

req = self.client.sql_request(sql)
req.add_parameter('druid')

Either way, when you submit the query in this form, you get a SQL response back:

resp = sql_client.sql_query(req)

The SQL response wraps the REST response. First, we ensure that the request worked:

resp.ok

If the request failed, we can obtain the error message:

resp.error_message

If the request succeeded, we can obtain the results in a variety of ways. The easiest is to obtain the data as a list of Java objects. This is the form shown in the "raw" example above. This works only if you use the default ('objects') result format.

resp.rows

You can also obtain the schema of the result:

resp.schema

The result is a list of ColumnSchema objects. Get column information from the name, sql_type and druid_type fields in each object.

For other formats, you can obtain the REST payload directly:

resp.results

Use the results() method if you requested other formats, such as CSV. The rows() and schema() methods are not available for these other result formats.

The result can also format the results as a text or HTML table, depending on how you created the client:

resp.show()

In fact, the display client sql() method uses the resp.show() method internally, which in turn uses the rows and schema properties.

Run a query and return results

The above forms are handy for interactive use in a notebook. If you just need to run a query to use the results in code, just do the following:

rows = sql_client.sql(sql)

This form takes a set of arguments so that you can use Python to parameterize the query:

sql = 'SELECT * FROM {}'
rows = sql_client.sql(sql, ['myTable'])

MSQ queries

The SQL client can also run an MSQ query. See the sql-tutorial.ipynb notebook for examples. First define the query:

sql = '''
INSERT INTO myTable ...
'''

Then launch an ingestion task:

task = sql_client.task(sql)

To learn the Overlord task ID:

task.id

You can use the tasks client to track the status, or let the task object do it for you:

task.wait_until_done()

You can combine the run-and-wait operations into a single call:

task = sql_client.run_task(sql)

A quirk of Druid is that MSQ reports task completion as soon as ingestion is done. However, it takes a while for Druid to load the resulting segments, so you must wait for the table to become queryable:

sql_client.wait_until_ready('myTable')

Datasource operations

To get information about a datasource, prefer to query the INFORMATION_SCHEMA tables, or use the methods in the display client. Use the datasource client for other operations.

datasources = druid.datasources

To delete a datasource:

datasources.drop('myWiki', True)

The True argument asks for "if exists" semantics so you don't get an error if the datasource does not exist.

REST client

The druidapi is based on a simple REST client which is itself based on the Requests library. If you need to use Druid REST APIs not yet wrapped by this library, you can use the REST client directly. (If you find such APIs, we encourage you to add methods to the library and contribute them to Druid.)

The REST client implements the common patterns seen in the Druid REST API. You can create a client directly:

from druidapi.rest import DruidRestClient
rest_client = DruidRestClient("http://localhost:8888")

Or, if you have already created the Druid client, you can reuse the existing REST client. This is how the various other clients work internally.

rest_client = druid.rest

The REST API maintains the Druid host: you just provide the specifc URL tail. There are methods to get or post JSON results. For example, to get status information:

rest_client.get_json('/status')

A quick comparison of the three approaches (Requests, REST client, Python client):

Status:

  • Requests: session.get(druid_host + '/status').json()
  • REST client: rest_client.get_json('/status')
  • Status client: status_client.status()

Health:

  • Requests: session.get(druid_host + '/status/health').json()
  • REST client: rest_client.get_json('/status/health')
  • Status client: status_client.is_healthy()

Ingest data:

  • Requests: See the REST tutorial.
  • REST client: as the REST tutorial, but use rest_client.post_json('/druid/v2/sql/task', sql_request) and rest_client.get_json(f"/druid/indexer/v1/task/{ingestion_taskId}/status")
  • SQL client: sql_client.run_task(sql), also a form for a full SQL request.

List datasources:

  • Requests: session.get(druid_host + '/druid/coordinator/v1/datasources').json()
  • REST client: rest_client.get_json('/druid/coordinator/v1/datasources')
  • Datasources client: ds_client.names()

Query data, where sql_request is a properly-formatted SqlRequest dictionary:

  • Requests: session.post(druid_host + '/druid/v2/sql', json=sql_request).json()
  • REST client: rest_client.post_json('/druid/v2/sql', sql_request)
  • SQL Client: sql_client.show(sql), where sql is the query text

In general, you have to provide the all the details for the Requests library. The REST client handles the low-level repetitious bits. The Python clients provide methods that encapsulate the specifics of the URLs and return formats.

Constants

Druid has a large number of special constants: type names, options, etc. The consts module provides definitions for many of these:

from druidapi import consts
help(consts)

Contributing

We encourage you to contribute to the druidapi package. Set up an editable installation for development by running the following command in a local clone of your apache/druid repo in examples/quickstart/jupyter-notebooks/druidapi/:

pip install -e .

An editable installation allows you to implement and test changes iteratively without having to reinstall the package with every change.

When you update the package, also increment the version field in setup.py following the PEP 440 semantic versioning scheme.

Use the following guidelines for incrementing the version number:

  • Increment the third position for a patch or bug fix.
  • Increment the second position for new features, such as adding new method wrappers.
  • Increment the first position for major changes and changes that are not backwards compatible.

Submit your contribution by opening a pull request to the apache/druid GitHub repository.