# Python API for Druid `druidapi` is a Python library to interact with all aspects of your [Apache Druid](https://druid.apache.org/) cluster. `druidapi` picks up where the venerable [pydruid](https://github.com/druid-io/pydruid) library left off to include full SQL support and support for many of of Druid APIs. `druidapi` is usable in any Python environment, but is optimized for use in Jupyter, providing a complete interactive environment which complements the UI-based Druid console. The primary use of `druidapi` at present is to support the set of tutorial notebooks provided in the parent directory. `druidapi` works against any version of Druid. Operations that make use of newer features obviously work only against versions of Druid that support those features. ## Install At present, the best way to use `druidapi` is to clone the Druid repo itself: ```bash git clone git@github.com:apache/druid.git ``` `druidapi` is located in `examples/quickstart/jupyter-notebooks/druidapi/`. From this directory, install the package and its dependencies with pip using the following command: ``` pip install . ``` Note that there is a second level `druidapi` directory that contains the modules. Do not run the install command in the subdirectory. Verify your installation by checking that the following command runs in Python: ```python import druidapi ``` The import statement should not return anything if it runs successfully. ## Getting started To use `druidapi`, first import the library, then connect to your cluster by providing the URL to your Router instance. The way that is done differs a bit between consumers. ### From a tutorial Jupyter notebook The tutorial Jupyter notebooks in `examples/quickstart/jupyter-notebooks` reside in the same directory tree as this library. We start the library using the Jupyter-oriented API which is able to render tables in HTML. First, identify your Router endpoint. Use the following for a local installation: ```python router_endpoint = 'http://localhost:8888' ``` Then, import the library, declare the `druidapi` CSS styles, and create a client to your cluster: ```python import druidapi druid = druidapi.jupyter_client(router_endpoint) ``` The `jupyter_client` call defines a number of CSS styles to aid in displaying tabular results. It also provides a "display" client that renders information as HTML tables. ### From a Python script `druidapi` works in any Python script. When run outside of a Jupyter notebook, the various "display" commands revert to displaying a text (not HTML) format. The steps are similar to those above: ```python import druidapi druid = druidapi.client(router_endpoint) ``` ## Library organization `druidapi` organizes Druid REST operations into various "clients," each of which provides operations for one of Druid's functional areas. Obtain a client from the `druid` client created above. For status operations: ```python status_client = druid.status ``` The set of clients is still under construction. The set at present includes the following. The set of operations within each client is also partial, and includes only those operations used within one of the tutorial notebooks. Contributions are welcome to expand the scope. Clients are available as properties on the `druid` object created above. * `status` - Status operations such as service health, property values, and so on. This client is special: it works only with the Router. The Router does not proxy these calls to other nodes. Use the `status_for()` method to get status for other nodes. * `datasources` - Operations on datasources such as dropping a datasource. * `tasks` - Work with Overlord tasks: status, reports, and more. * `sql` - SQL query operations for both the interactive query engine and MSQ. * `display` - A set of convenience operations to display results as lightly formatted tables in either HTML (for Jupyter notebooks) or text (for other Python scripts). ## Assumed cluster architecture `druidapi` assumes that you run a standard Druid cluster with a Router in front of the other nodes. This design works well for most Druid clusters: * Run locally, such as the various quickstart clusters. * Remote cluster on the same network. * Druid cluster running under Docker Compose such as that explained in the Druid documentation. * Druid integration test clusters launched via the Druid development `it.sh` command. * Druid clusters running under Kubernetes In all the Docker, Docker Compose and Kubernetes scenaris, the Router's port (typically 8888) must be visible to the machine running `druidapi`, perhaps via port mapping or a proxy. The Router is then responsible for routing Druid REST requests to the various other Druid nodes, including those not visible outside of a private Docker or Kubernetes network. The one exception to this rule is if you want to perform a health check (i.e. the `/status` endpoint) on a service other than the Router. These checks are _not_ proxied by the Router: you must connect to the target service directly. ## Status operations When working with tutorials, a local Druid cluster, or a Druid integration test cluster, it is common to start your cluster then immediately start performing `druidapi` operations. However, because Druid is a distributed system, it can take some time for all the services to become ready. This seems to be particularly true when starting a cluster with Docker Compose or Kubernetes on the local system. Therefore, the first operation is to wait for the cluster to become ready: ```python status_client = druid.status status_client.wait_until_ready() ``` Without this step, your operations may mysteriously fail, and you'll wonder if you did something wrong. Some clients retry operations multiple times in case a service is not yet ready. For typical scripts against a stable cluster, the above line should be sufficient instead. This step is built into the `jupyter_client()` method to ensure notebooks provide a good exerience. If your notebook or script uses newer features, you should start by ensuring that the target Druid cluster is of the correct version: ```python status_client.version ``` This check will prevent frustration if the notebook is used against previous releases. Similarly, if the notebook or script uses features defined in an extension, check that the required extension is loaded: ```python status_client.properties['druid.extensions.loadList'] ``` ## Display client When run in a Jupyter notebook, it is often handy to format results for display. A special display client performs operations _and_ formats them for display as HTML tables within the notebook. ```python display = druid.display ``` The most common methods are: * `sql(sql)` - Run a query and display the results as an HTML table. * `schemas()` - Display the schemas defined in Druid. * `tables(schema)` - Display the tables (datasources) in the given schema, `druid` by default. * `table(name)` - Display the schema (list of columns) for the the given table. The name can be one part (`foo`) or two parts (`INFORMATION_SCHEMA.TABLES`). * `function(name)` - Display the arguments for a table function defined in the catalog. The display client also has other methods to format data as a table, to display various kinds of messages and so on. ## Interactive queries The original [`pydruid`](https://pythonhosted.org/pydruid/) library revolves around Druid "native" queries. Most new applications now use SQL. `druidapi` provides two ways to run queries, depending on whether you want to display the results (typical in a notebook), or use the results in Python code. You can run SQL queries using the SQL client: ```python sql_client = druid.sql ``` To obtain the results of a SQL query against the example Wikipedia table (datasource) in a "raw" form: ```python sql = ''' SELECT channel, COUNT(*) AS "count" FROM wikipedia GROUP BY channel ORDER BY COUNT(*) DESC LIMIT 5 ''' client.sql(sql) ``` Gives: ```text [{'channel': '#en.wikipedia', 'count': 6650}, {'channel': '#sh.wikipedia', 'count': 3969}, {'channel': '#sv.wikipedia', 'count': 1867}, {'channel': '#ceb.wikipedia', 'count': 1808}, {'channel': '#de.wikipedia', 'count': 1357}] ``` The raw results are handy when Python code consumes the results, or for a quick check. The raw results can also be forward to advanced visualization tools such a Pandas. For simple visualization in notebooks (or as text in Python scripts), you can use the "display" client: ```python display = druid.display display.sql(sql) ``` When run without HTML visualization, the above gives: ```text channel count #en.wikipedia 6650 #sh.wikipedia 3969 #sv.wikipedia 1867 #ceb.wikipedia 1808 #de.wikipedia 1357 ``` Within Jupyter, the results are formatted as an HTML table. ### Advanced queries In addition to the SQL text, Druid also lets you specify: * A query context * Query parameters * Result format options The Druid `SqlQuery` object specifies these options. You can build up a Python equivalent: ```python sql = ''' SELECT * FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = ? ''' sql_query = { 'query': sql, 'parameters': [ {'type': consts.SQL_VARCHAR_TYPE, 'value': 'druid'} ], 'resultFormat': consts.SQL_OBJECT } ``` However, the easier approach is to let `druidapi` handle the work for you using a SQL request: ```python req = self.client.sql_request(sql) req.add_parameter('druid') ``` Either way, when you submit the query in this form, you get a SQL response back: ```python resp = sql_client.sql_query(req) ``` The SQL response wraps the REST response. First, we ensure that the request worked: ```python resp.ok ``` If the request failed, we can obtain the error message: ```python resp.error_message ``` If the request succeeded, we can obtain the results in a variety of ways. The easiest is to obtain the data as a list of Java objects. This is the form shown in the "raw" example above. This works only if you use the default ('objects') result format. ```python resp.rows ``` You can also obtain the schema of the result: ```python resp.schema ``` The result is a list of `ColumnSchema` objects. Get column information from the `name`, `sql_type` and `druid_type` fields in each object. For other formats, you can obtain the REST payload directly: ```python resp.results ``` Use the `results()` method if you requested other formats, such as CSV. The `rows()` and `schema()` methods are not available for these other result formats. The result can also format the results as a text or HTML table, depending on how you created the client: ```python resp.show() ``` In fact, the display client `sql()` method uses the `resp.show()` method internally, which in turn uses the `rows` and `schema` properties. ### Run a query and return results The above forms are handy for interactive use in a notebook. If you just need to run a query to use the results in code, just do the following: ```python rows = sql_client.sql(sql) ``` This form takes a set of arguments so that you can use Python to parameterize the query: ```python sql = 'SELECT * FROM {}' rows = sql_client.sql(sql, ['myTable']) ``` ## MSQ queries The SQL client can also run an MSQ query. See the `sql-tutorial.ipynb` notebook for examples. First define the query: ```python sql = ''' INSERT INTO myTable ... ''' ``` Then launch an ingestion task: ```python task = sql_client.task(sql) ``` To learn the Overlord task ID: ```python task.id ``` You can use the tasks client to track the status, or let the task object do it for you: ```python task.wait_until_done() ``` You can combine the run-and-wait operations into a single call: ```python task = sql_client.run_task(sql) ``` A quirk of Druid is that MSQ reports task completion as soon as ingestion is done. However, it takes a while for Druid to load the resulting segments, so you must wait for the table to become queryable: ```python sql_client.wait_until_ready('myTable') ``` ## Datasource operations To get information about a datasource, prefer to query the `INFORMATION_SCHEMA` tables, or use the methods in the display client. Use the datasource client for other operations. ```python datasources = druid.datasources ``` To delete a datasource: ```python datasources.drop('myWiki', True) ``` The True argument asks for "if exists" semantics so you don't get an error if the datasource does not exist. ## REST client The `druidapi` is based on a simple REST client which is itself based on the Requests library. If you need to use Druid REST APIs not yet wrapped by this library, you can use the REST client directly. (If you find such APIs, we encourage you to add methods to the library and contribute them to Druid.) The REST client implements the common patterns seen in the Druid REST API. You can create a client directly: ```python from druidapi.rest import DruidRestClient rest_client = DruidRestClient("http://localhost:8888") ``` Or, if you have already created the Druid client, you can reuse the existing REST client. This is how the various other clients work internally. ```python rest_client = druid.rest ``` The REST API maintains the Druid host: you just provide the specifc URL tail. There are methods to get or post JSON results. For example, to get status information: ```python rest_client.get_json('/status') ``` A quick comparison of the three approaches (Requests, REST client, Python client): Status: * Requests: `session.get(druid_host + '/status').json()` * REST client: `rest_client.get_json('/status')` * Status client: `status_client.status()` Health: * Requests: `session.get(druid_host + '/status/health').json()` * REST client: `rest_client.get_json('/status/health')` * Status client: `status_client.is_healthy()` Ingest data: * Requests: See the REST tutorial. * REST client: as the REST tutorial, but use `rest_client.post_json('/druid/v2/sql/task', sql_request)` and `rest_client.get_json(f"/druid/indexer/v1/task/{ingestion_taskId}/status")` * SQL client: `sql_client.run_task(sql)`, also a form for a full SQL request. List datasources: * Requests: `session.get(druid_host + '/druid/coordinator/v1/datasources').json()` * REST client: `rest_client.get_json('/druid/coordinator/v1/datasources')` * Datasources client: `ds_client.names()` Query data, where `sql_request` is a properly-formatted `SqlRequest` dictionary: * Requests: `session.post(druid_host + '/druid/v2/sql', json=sql_request).json()` * REST client: `rest_client.post_json('/druid/v2/sql', sql_request)` * SQL Client: `sql_client.show(sql)`, where `sql` is the query text In general, you have to provide the all the details for the Requests library. The REST client handles the low-level repetitious bits. The Python clients provide methods that encapsulate the specifics of the URLs and return formats. ## Constants Druid has a large number of special constants: type names, options, etc. The consts module provides definitions for many of these: ```python from druidapi import consts help(consts) ``` ## Contributing We encourage you to contribute to the `druidapi` package. Set up an editable installation for development by running the following command in a local clone of your `apache/druid` repo in `examples/quickstart/jupyter-notebooks/druidapi/`: ``` pip install -e . ``` An editable installation allows you to implement and test changes iteratively without having to reinstall the package with every change. When you update the package, also increment the version field in `setup.py` following the [PEP 440 semantic versioning scheme](https://peps.python.org/pep-0440/#semantic-versioning). Use the following guidelines for incrementing the version number: * Increment the third position for a patch or bug fix. * Increment the second position for new features, such as adding new method wrappers. * Increment the first position for major changes and changes that are not backwards compatible. Submit your contribution by opening a pull request to the `apache/druid` GitHub repository.