druid/joins.md at ea8e4066f6a6f50487ec6cd2bc4e0ea78731b025

mirror of https://github.com/apache/druid.git synced 2025-02-26 13:16:30 +00:00

Docusaurus build framework + ingestion doc refresh. (#8311 )

* Docusaurus build framework + ingestion doc refresh.

* stick to npm instead of yarn

* fix typos

* restore some _bin

* Adjustments.

* detect and fix redirect anchors

* update anchor lint

* Web-console: remove specific column filters (#8343)

* add clear filter

* update tool kit

* remove usless check

* auto run

* add %

* Fix resource leak (#8337)

* Fix resource leak

* Patch comments

* Enable Spotbugs NP_NONNULL_RETURN_VIOLATION (#8234)

* Fixes from PR review.

* Fix more anchors.

* Preamble nix.

* Fix more anchors, headers

* clean up placeholder page

* add to website lint to travis config

* better broken link checking

* travis fix

* Fixed more broken links

* better redirects

* unfancy catch

* fix LGTM error

* link fixes

* fix md issues

* Addl fixes

2019-08-20 21:48:59 -07:00

2.7 KiB

Raw Blame History

id	title
joins	Joins

Apache Druid (incubating) has limited support for joins through query-time lookups. The common use case of query-time lookups is to replace one dimension value (e.g. a String ID) with another value (e.g. a human-readable String value). This is similar to a star-schema join.

Druid does not yet have full support for joins. Although Druid’s storage format would allow for the implementation of joins (there is no loss of fidelity for columns included as dimensions), full support for joins have not yet been implemented yet for the following reasons:

Scaling join queries has been, in our professional experience, a constant bottleneck of working with distributed databases.
The incremental gains in functionality are perceived to be of less value than the anticipated problems with managing highly concurrent, join-heavy workloads.

A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary high-level strategies for join queries we are aware of are a hash-based strategy or a sorted-merge strategy. The hash-based strategy requires that all but one data set be available as something that looks like a hash table, a lookup operation is then performed on this hash table for every row in the “primary” stream. The sorted-merge strategy assumes that each stream is sorted by the join key and thus allows for the incremental joining of the streams. Each of these strategies, however, requires the materialization of some number of the streams either in sorted order or in a hash table form.

When all sides of the join are significantly large tables (> 1 billion records), materializing the pre-join streams requires complex distributed memory management. The complexity of the memory management is only amplified by the fact that we are targeting highly concurrent, multi-tenant workloads.

2.7 KiB Raw Blame History Unescape Escape

2.7 KiB

Raw Blame History