mirror of
https://github.com/apache/druid.git
synced 2025-02-19 16:37:45 +00:00
* Refresh query docs. Larger changes: - New doc: querying/datasource.md describes the various kinds of datasources you can use, and has examples for both SQL and native. - New doc: querying/query-execution.md describes how native queries are executed at a high level. It doesn't go into the details of specific query engines or how queries run at a per-segment level. But I think it would be good to add or link that content here in the future. - Refreshed doc: querying/sql.md updated to refer to joins, reformatted a bit, added a new "Query translation" section that explains how queries are translated from SQL to native, and removed configuration details (moved to configuration/index.md). - Refreshed doc: querying/joins.md updated to refer to join datasources. Smaller changes: - Add helpful banners to the top of query documentation pages telling people whether a given page describes SQL, native, or both. - Add SQL metrics to operations/metrics.md. - Add some color and cross-links in various places. - Add native query component docs to the sidebar, and renamed them so they look nicer. - Remove Select query from the sidebar. - Fix Broker SQL configs in configuration/index.md. Remove them from querying/sql.md. - Combined querying/searchquery.md and querying/searchqueryspec.md. * Updates. * Fix numbering. * Fix glitches. * Add new words to spellcheck file. * Assorted changes. * Further adjustments. * Add missing punctuation.
95 lines
5.4 KiB
Markdown
95 lines
5.4 KiB
Markdown
---
|
|
id: multitenancy
|
|
title: "Multitenancy considerations"
|
|
sidebar_label: "Multitenancy"
|
|
---
|
|
|
|
<!--
|
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
~ or more contributor license agreements. See the NOTICE file
|
|
~ distributed with this work for additional information
|
|
~ regarding copyright ownership. The ASF licenses this file
|
|
~ to you under the Apache License, Version 2.0 (the
|
|
~ "License"); you may not use this file except in compliance
|
|
~ with the License. You may obtain a copy of the License at
|
|
~
|
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
~
|
|
~ Unless required by applicable law or agreed to in writing,
|
|
~ software distributed under the License is distributed on an
|
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
~ KIND, either express or implied. See the License for the
|
|
~ specific language governing permissions and limitations
|
|
~ under the License.
|
|
-->
|
|
|
|
|
|
Apache Druid is often used to power user-facing data applications, where multitenancy is an important requirement. This
|
|
document outlines Druid's multitenant storage and querying features.
|
|
|
|
## Shared datasources or datasource-per-tenant?
|
|
|
|
A datasource is the Druid equivalent of a database table. Multitenant workloads can either use a separate datasource
|
|
for each tenant, or can share one or more datasources between tenants using a "tenant_id" dimension. When deciding
|
|
which path to go down, consider that each path has pros and cons.
|
|
|
|
Pros of datasources per tenant:
|
|
|
|
- Each datasource can have its own schema, its own backfills, its own partitioning rules, and its own data loading
|
|
and expiration rules.
|
|
- Queries can be faster since there will be fewer segments to examine for a typical tenant's query.
|
|
- You get the most flexibility.
|
|
|
|
Pros of shared datasources:
|
|
|
|
- Each datasource requires its own JVMs for realtime indexing.
|
|
- Each datasource requires its own YARN resources for Hadoop batch jobs.
|
|
- Each datasource requires its own segment files on disk.
|
|
- For these reasons it can be wasteful to have a very large number of small datasources.
|
|
|
|
One compromise is to use more than one datasource, but a smaller number than tenants. For example, you could have some
|
|
tenants with partitioning rules A and some with partitioning rules B; you could use two datasources and split your
|
|
tenants between them.
|
|
|
|
## Partitioning shared datasources
|
|
|
|
If your multitenant cluster uses shared datasources, most of your queries will likely filter on a "tenant_id"
|
|
dimension. These sorts of queries perform best when data is well-partitioned by tenant. There are a few ways to
|
|
accomplish this.
|
|
|
|
With batch indexing, you can use [single-dimension partitioning](../ingestion/hadoop.html#single-dimension-range-partitioning)
|
|
to partition your data by tenant_id. Druid always partitions by time first, but the secondary partition within each
|
|
time bucket will be on tenant_id.
|
|
|
|
With realtime indexing, you'd do this by tweaking the stream you send to Druid. For example, if you're using Kafka then
|
|
you can have your Kafka producer partition your topic by a hash of tenant_id.
|
|
|
|
## Customizing data distribution
|
|
|
|
Druid additionally supports multitenancy by providing configurable means of distributing data. Druid's Historical processes
|
|
can be configured into [tiers](../operations/rule-configuration.md), and [rules](../operations/rule-configuration.md)
|
|
can be set that determines which segments go into which tiers. One use case of this is that recent data tends to be accessed
|
|
more frequently than older data. Tiering enables more recent segments to be hosted on more powerful hardware for better performance.
|
|
A second copy of recent segments can be replicated on cheaper hardware (a different tier), and older segments can also be
|
|
stored on this tier.
|
|
|
|
## Supporting high query concurrency
|
|
|
|
Druid's fundamental unit of computation is a [segment](../design/segments.md). Processes scan segments in parallel and a
|
|
given process can scan `druid.processing.numThreads` concurrently. To
|
|
process more data in parallel and increase performance, more cores can be added to a cluster. Druid segments
|
|
should be sized such that any computation over any given segment should complete in at most 500ms.
|
|
|
|
Druid internally stores requests to scan segments in a priority queue. If a given query requires scanning
|
|
more segments than the total number of available processors in a cluster, and many similarly expensive queries are concurrently
|
|
running, we don't want any query to be starved out. Druid's internal processing logic will scan a set of segments from one query and release resources as soon as the scans complete.
|
|
This allows for a second set of segments from another query to be scanned. By keeping segment computation time very small, we ensure
|
|
that resources are constantly being yielded, and segments pertaining to different queries are all being processed.
|
|
|
|
Druid queries can optionally set a `priority` flag in the [query context](../querying/query-context.md). Queries known to be
|
|
slow (download or reporting style queries) can be de-prioritized and more interactive queries can have higher priority.
|
|
|
|
Broker processes can also be dedicated to a given tier. For example, one set of Broker processes can be dedicated to fast interactive queries,
|
|
and a second set of Broker processes can be dedicated to slower reporting queries. Druid also provides a [Router](../design/router.md)
|
|
process that can route queries to different Brokers based on various query parameters (datasource, interval, etc.).
|