mirror of https://github.com/apache/druid.git
511 lines
17 KiB
Markdown
511 lines
17 KiB
Markdown
<!--
|
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
~ or more contributor license agreements. See the NOTICE file
|
|
~ distributed with this work for additional information
|
|
~ regarding copyright ownership. The ASF licenses this file
|
|
~ to you under the Apache License, Version 2.0 (the
|
|
~ "License"); you may not use this file except in compliance
|
|
~ with the License. You may obtain a copy of the License at
|
|
~
|
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
~
|
|
~ Unless required by applicable law or agreed to in writing,
|
|
~ software distributed under the License is distributed on an
|
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
~ KIND, either express or implied. See the License for the
|
|
~ specific language governing permissions and limitations
|
|
~ under the License.
|
|
-->
|
|
|
|
# Python API for Druid
|
|
|
|
|
|
`druidapi` is a Python library to interact with all aspects of your
|
|
[Apache Druid](https://druid.apache.org/) cluster.
|
|
`druidapi` picks up where the venerable [pydruid](https://github.com/druid-io/pydruid) library
|
|
left off to include full SQL support and support for many of of Druid APIs. `druidapi` is usable
|
|
in any Python environment, but is optimized for use in Jupyter, providing a complete interactive
|
|
environment which complements the UI-based Druid console. The primary use of `druidapi` at present
|
|
is to support the set of tutorial notebooks provided in the parent directory.
|
|
|
|
`druidapi` works against any version of Druid. Operations that make use of newer features obviously work
|
|
only against versions of Druid that support those features.
|
|
|
|
## Install
|
|
|
|
At present, the best way to use `druidapi` is to clone the Druid repo itself:
|
|
|
|
```bash
|
|
git clone git@github.com:apache/druid.git
|
|
```
|
|
|
|
`druidapi` is located in `examples/quickstart/jupyter-notebooks/druidapi/`.
|
|
From this directory, install the package and its dependencies with pip using the following command:
|
|
|
|
```
|
|
pip install .
|
|
```
|
|
|
|
Note that there is a second level `druidapi` directory that contains the modules. Do not run
|
|
the install command in the subdirectory.
|
|
|
|
Verify your installation by checking that the following command runs in Python:
|
|
|
|
```python
|
|
import druidapi
|
|
```
|
|
|
|
The import statement should not return anything if it runs successfully.
|
|
|
|
## Getting started
|
|
|
|
To use `druidapi`, first import the library, then connect to your cluster by providing the URL to your Router instance. The way that is done differs a bit between consumers.
|
|
|
|
### From a tutorial Jupyter notebook
|
|
|
|
The tutorial Jupyter notebooks in `examples/quickstart/jupyter-notebooks` reside in the same directory tree
|
|
as this library. We start the library using the Jupyter-oriented API which is able to render tables in
|
|
HTML. First, identify your Router endpoint. Use the following for a local installation:
|
|
|
|
```python
|
|
router_endpoint = 'http://localhost:8888'
|
|
```
|
|
|
|
Then, import the library, declare the `druidapi` CSS styles, and create a client to your cluster:
|
|
|
|
```python
|
|
import druidapi
|
|
druid = druidapi.jupyter_client(router_endpoint)
|
|
```
|
|
|
|
The `jupyter_client` call defines a number of CSS styles to aid in displaying tabular results. It also
|
|
provides a "display" client that renders information as HTML tables.
|
|
|
|
### From a Python script
|
|
|
|
`druidapi` works in any Python script. When run outside of a Jupyter notebook, the various "display"
|
|
commands revert to displaying a text (not HTML) format. The steps are similar to those above:
|
|
|
|
```python
|
|
import druidapi
|
|
druid = druidapi.client(router_endpoint)
|
|
```
|
|
|
|
## Library organization
|
|
|
|
`druidapi` organizes Druid REST operations into various "clients," each of which provides operations
|
|
for one of Druid's functional areas. Obtain a client from the `druid` client created above. For
|
|
status operations:
|
|
|
|
```python
|
|
status_client = druid.status
|
|
```
|
|
|
|
The set of clients is still under construction. The set at present includes the following. The
|
|
set of operations within each client is also partial, and includes only those operations used
|
|
within one of the tutorial notebooks. Contributions are welcome to expand the scope. Clients are
|
|
available as properties on the `druid` object created above.
|
|
|
|
* `status` - Status operations such as service health, property values, and so on. This client
|
|
is special: it works only with the Router. The Router does not proxy these calls to other nodes.
|
|
Use the `status_for()` method to get status for other nodes.
|
|
* `datasources` - Operations on datasources such as dropping a datasource.
|
|
* `tasks` - Work with Overlord tasks: status, reports, and more.
|
|
* `sql` - SQL query operations for both the interactive query engine and MSQ.
|
|
* `display` - A set of convenience operations to display results as lightly formatted tables
|
|
in either HTML (for Jupyter notebooks) or text (for other Python scripts).
|
|
|
|
## Assumed cluster architecture
|
|
|
|
`druidapi` assumes that you run a standard Druid cluster with a Router in front of the other nodes.
|
|
This design works well for most Druid clusters:
|
|
|
|
* Run locally, such as the various quickstart clusters.
|
|
* Remote cluster on the same network.
|
|
* Druid cluster running under Docker Compose such as that explained in the Druid documentation.
|
|
* Druid integration test clusters launched via the Druid development `it.sh` command.
|
|
* Druid clusters running under Kubernetes
|
|
|
|
In all the Docker, Docker Compose and Kubernetes scenaris, the Router's port (typically 8888) must be visible
|
|
to the machine running `druidapi`, perhaps via port mapping or a proxy.
|
|
|
|
The Router is then responsible for routing Druid REST requests to the various other Druid nodes,
|
|
including those not visible outside of a private Docker or Kubernetes network.
|
|
|
|
The one exception to this rule is if you want to perform a health check (i.e. the `/status` endpoint)
|
|
on a service other than the Router. These checks are _not_ proxied by the Router: you must connect to
|
|
the target service directly.
|
|
|
|
## Status operations
|
|
|
|
When working with tutorials, a local Druid cluster, or a Druid integration test cluster, it is common
|
|
to start your cluster then immediately start performing `druidapi` operations. However, because Druid
|
|
is a distributed system, it can take some time for all the services to become ready. This seems to be
|
|
particularly true when starting a cluster with Docker Compose or Kubernetes on the local system.
|
|
|
|
Therefore, the first operation is to wait for the cluster to become ready:
|
|
|
|
```python
|
|
status_client = druid.status
|
|
status_client.wait_until_ready()
|
|
```
|
|
|
|
Without this step, your operations may mysteriously fail, and you'll wonder if you did something wrong.
|
|
Some clients retry operations multiple times in case a service is not yet ready. For typical scripts
|
|
against a stable cluster, the above line should be sufficient instead. This step is built into the
|
|
`jupyter_client()` method to ensure notebooks provide a good exerience.
|
|
|
|
If your notebook or script uses newer features, you should start by ensuring that the target Druid cluster
|
|
is of the correct version:
|
|
|
|
```python
|
|
status_client.version
|
|
```
|
|
|
|
This check will prevent frustration if the notebook is used against previous releases.
|
|
|
|
Similarly, if the notebook or script uses features defined in an extension, check that the required
|
|
extension is loaded:
|
|
|
|
```python
|
|
status_client.properties['druid.extensions.loadList']
|
|
```
|
|
|
|
## Display client
|
|
|
|
When run in a Jupyter notebook, it is often handy to format results for display. A special display
|
|
client performs operations _and_ formats them for display as HTML tables within the notebook.
|
|
|
|
```python
|
|
display = druid.display
|
|
```
|
|
|
|
The most common methods are:
|
|
|
|
* `sql(sql)` - Run a query and display the results as an HTML table.
|
|
* `schemas()` - Display the schemas defined in Druid.
|
|
* `tables(schema)` - Display the tables (datasources) in the given schema, `druid` by default.
|
|
* `table(name)` - Display the schema (list of columns) for the the given table. The name can
|
|
be one part (`foo`) or two parts (`INFORMATION_SCHEMA.TABLES`).
|
|
* `function(name)` - Display the arguments for a table function defined in the catalog.
|
|
|
|
The display client also has other methods to format data as a table, to display various kinds
|
|
of messages and so on.
|
|
|
|
## Interactive queries
|
|
|
|
The original [`pydruid`](https://pythonhosted.org/pydruid/) library revolves around Druid
|
|
"native" queries. Most new applications now use SQL. `druidapi` provides two ways to run
|
|
queries, depending on whether you want to display the results (typical in a notebook), or
|
|
use the results in Python code. You can run SQL queries using the SQL client:
|
|
|
|
```python
|
|
sql_client = druid.sql
|
|
```
|
|
|
|
To obtain the results of a SQL query against the example Wikipedia table (datasource) in a "raw" form:
|
|
|
|
|
|
```python
|
|
sql = '''
|
|
SELECT
|
|
channel,
|
|
COUNT(*) AS "count"
|
|
FROM wikipedia
|
|
GROUP BY channel
|
|
ORDER BY COUNT(*) DESC
|
|
LIMIT 5
|
|
'''
|
|
client.sql(sql)
|
|
```
|
|
|
|
Gives:
|
|
|
|
```text
|
|
[{'channel': '#en.wikipedia', 'count': 6650},
|
|
{'channel': '#sh.wikipedia', 'count': 3969},
|
|
{'channel': '#sv.wikipedia', 'count': 1867},
|
|
{'channel': '#ceb.wikipedia', 'count': 1808},
|
|
{'channel': '#de.wikipedia', 'count': 1357}]
|
|
```
|
|
|
|
The raw results are handy when Python code consumes the results, or for a quick check. The raw results
|
|
can also be forward to advanced visualization tools such a Pandas.
|
|
|
|
For simple visualization in notebooks (or as text in Python scripts), you can use the "display" client:
|
|
|
|
```python
|
|
display = druid.display
|
|
display.sql(sql)
|
|
```
|
|
|
|
When run without HTML visualization, the above gives:
|
|
|
|
```text
|
|
channel count
|
|
#en.wikipedia 6650
|
|
#sh.wikipedia 3969
|
|
#sv.wikipedia 1867
|
|
#ceb.wikipedia 1808
|
|
#de.wikipedia 1357
|
|
```
|
|
|
|
Within Jupyter, the results are formatted as an HTML table.
|
|
|
|
### Advanced queries
|
|
|
|
In addition to the SQL text, Druid also lets you specify:
|
|
|
|
* A query context
|
|
* Query parameters
|
|
* Result format options
|
|
|
|
The Druid `SqlQuery` object specifies these options. You can build up a Python equivalent:
|
|
|
|
```python
|
|
sql = '''
|
|
SELECT *
|
|
FROM INFORMATION_SCHEMA.SCHEMATA
|
|
WHERE SCHEMA_NAME = ?
|
|
'''
|
|
|
|
sql_query = {
|
|
'query': sql,
|
|
'parameters': [
|
|
{'type': consts.SQL_VARCHAR_TYPE, 'value': 'druid'}
|
|
],
|
|
'resultFormat': consts.SQL_OBJECT
|
|
}
|
|
```
|
|
|
|
However, the easier approach is to let `druidapi` handle the work for you using a SQL request:
|
|
|
|
```python
|
|
req = self.client.sql_request(sql)
|
|
req.add_parameter('druid')
|
|
```
|
|
|
|
Either way, when you submit the query in this form, you get a SQL response back:
|
|
|
|
```python
|
|
resp = sql_client.sql_query(req)
|
|
```
|
|
|
|
The SQL response wraps the REST response. First, we ensure that the request worked:
|
|
|
|
```python
|
|
resp.ok
|
|
```
|
|
|
|
If the request failed, we can obtain the error message:
|
|
|
|
```python
|
|
resp.error_message
|
|
```
|
|
|
|
If the request succeeded, we can obtain the results in a variety of ways. The easiest is to obtain
|
|
the data as a list of Java objects. This is the form shown in the "raw" example above. This works
|
|
only if you use the default ('objects') result format.
|
|
|
|
```python
|
|
resp.rows
|
|
```
|
|
|
|
You can also obtain the schema of the result:
|
|
|
|
```python
|
|
resp.schema
|
|
```
|
|
|
|
The result is a list of `ColumnSchema` objects. Get column information from the `name`, `sql_type`
|
|
and `druid_type` fields in each object.
|
|
|
|
For other formats, you can obtain the REST payload directly:
|
|
|
|
```python
|
|
resp.results
|
|
```
|
|
|
|
Use the `results()` method if you requested other formats, such as CSV. The `rows()` and `schema()` methods
|
|
are not available for these other result formats.
|
|
|
|
The result can also format the results as a text or HTML table, depending on how you created the client:
|
|
|
|
```python
|
|
resp.show()
|
|
```
|
|
|
|
In fact, the display client `sql()` method uses the `resp.show()` method internally, which in turn uses the
|
|
`rows` and `schema` properties.
|
|
|
|
### Run a query and return results
|
|
|
|
The above forms are handy for interactive use in a notebook. If you just need to run a query to use the results
|
|
in code, just do the following:
|
|
|
|
```python
|
|
rows = sql_client.sql(sql)
|
|
```
|
|
|
|
This form takes a set of arguments so that you can use Python to parameterize the query:
|
|
|
|
```python
|
|
sql = 'SELECT * FROM {}'
|
|
rows = sql_client.sql(sql, ['myTable'])
|
|
```
|
|
|
|
## MSQ queries
|
|
|
|
The SQL client can also run an MSQ query. See the `sql-tutorial.ipynb` notebook for examples. First define the
|
|
query:
|
|
|
|
```python
|
|
sql = '''
|
|
INSERT INTO myTable ...
|
|
'''
|
|
```
|
|
|
|
Then launch an ingestion task:
|
|
|
|
```python
|
|
task = sql_client.task(sql)
|
|
```
|
|
|
|
To learn the Overlord task ID:
|
|
|
|
```python
|
|
task.id
|
|
```
|
|
|
|
You can use the tasks client to track the status, or let the task object do it for you:
|
|
|
|
```python
|
|
task.wait_until_done()
|
|
```
|
|
|
|
You can combine the run-and-wait operations into a single call:
|
|
|
|
```python
|
|
task = sql_client.run_task(sql)
|
|
```
|
|
|
|
A quirk of Druid is that MSQ reports task completion as soon as ingestion is done. However, it takes a
|
|
while for Druid to load the resulting segments, so you must wait for the table to become queryable:
|
|
|
|
```python
|
|
sql_client.wait_until_ready('myTable')
|
|
```
|
|
|
|
## Datasource operations
|
|
|
|
To get information about a datasource, prefer to query the `INFORMATION_SCHEMA` tables, or use the methods
|
|
in the display client. Use the datasource client for other operations.
|
|
|
|
```python
|
|
datasources = druid.datasources
|
|
```
|
|
|
|
To delete a datasource:
|
|
|
|
```python
|
|
datasources.drop('myWiki', True)
|
|
```
|
|
|
|
The True argument asks for "if exists" semantics so you don't get an error if the datasource does not exist.
|
|
|
|
## REST client
|
|
|
|
The `druidapi` is based on a simple REST client which is itself based on the Requests library. If you
|
|
need to use Druid REST APIs not yet wrapped by this library, you can use the REST client directly.
|
|
(If you find such APIs, we encourage you to add methods to the library and contribute them to Druid.)
|
|
|
|
The REST client implements the common patterns seen in the Druid REST API. You can create a client directly:
|
|
|
|
```python
|
|
from druidapi.rest import DruidRestClient
|
|
rest_client = DruidRestClient("http://localhost:8888")
|
|
```
|
|
|
|
Or, if you have already created the Druid client, you can reuse the existing REST client. This is how
|
|
the various other clients work internally.
|
|
|
|
```python
|
|
rest_client = druid.rest
|
|
```
|
|
|
|
The REST API maintains the Druid host: you just provide the specifc URL tail. There are methods to get or
|
|
post JSON results. For example, to get status information:
|
|
|
|
```python
|
|
rest_client.get_json('/status')
|
|
```
|
|
|
|
A quick comparison of the three approaches (Requests, REST client, Python client):
|
|
|
|
Status:
|
|
|
|
* Requests: `session.get(druid_host + '/status').json()`
|
|
* REST client: `rest_client.get_json('/status')`
|
|
* Status client: `status_client.status()`
|
|
|
|
Health:
|
|
|
|
* Requests: `session.get(druid_host + '/status/health').json()`
|
|
* REST client: `rest_client.get_json('/status/health')`
|
|
* Status client: `status_client.is_healthy()`
|
|
|
|
Ingest data:
|
|
|
|
* Requests: See the REST tutorial.
|
|
* REST client: as the REST tutorial, but use `rest_client.post_json('/druid/v2/sql/task', sql_request)` and
|
|
`rest_client.get_json(f"/druid/indexer/v1/task/{ingestion_taskId}/status")`
|
|
* SQL client: `sql_client.run_task(sql)`, also a form for a full SQL request.
|
|
|
|
List datasources:
|
|
|
|
* Requests: `session.get(druid_host + '/druid/coordinator/v1/datasources').json()`
|
|
* REST client: `rest_client.get_json('/druid/coordinator/v1/datasources')`
|
|
* Datasources client: `ds_client.names()`
|
|
|
|
Query data, where `sql_request` is a properly-formatted `SqlRequest` dictionary:
|
|
|
|
* Requests: `session.post(druid_host + '/druid/v2/sql', json=sql_request).json()`
|
|
* REST client: `rest_client.post_json('/druid/v2/sql', sql_request)`
|
|
* SQL Client: `sql_client.show(sql)`, where `sql` is the query text
|
|
|
|
In general, you have to provide the all the details for the Requests library. The REST client handles the low-level repetitious bits. The Python clients provide methods that encapsulate the specifics of the URLs and return formats.
|
|
|
|
## Constants
|
|
|
|
Druid has a large number of special constants: type names, options, etc. The consts module provides definitions for many of these:
|
|
|
|
```python
|
|
from druidapi import consts
|
|
help(consts)
|
|
```
|
|
|
|
## Contributing
|
|
|
|
We encourage you to contribute to the `druidapi` package.
|
|
Set up an editable installation for development by running the following command
|
|
in a local clone of your `apache/druid` repo in
|
|
`examples/quickstart/jupyter-notebooks/druidapi/`:
|
|
|
|
```
|
|
pip install -e .
|
|
```
|
|
|
|
An editable installation allows you to implement and test changes iteratively
|
|
without having to reinstall the package with every change.
|
|
|
|
When you update the package, also increment the version field in `setup.py` following the
|
|
[PEP 440 semantic versioning scheme](https://peps.python.org/pep-0440/#semantic-versioning).
|
|
|
|
Use the following guidelines for incrementing the version number:
|
|
* Increment the third position for a patch or bug fix.
|
|
* Increment the second position for new features, such as adding new method wrappers.
|
|
* Increment the first position for major changes and changes that are not backwards compatible.
|
|
|
|
Submit your contribution by opening a pull request to the `apache/druid` GitHub repository.
|
|
|