mirror of https://github.com/apache/druid.git
some more doc improvements (#7675)
This commit is contained in:
parent
d667655871
commit
dc85a5309e
|
@ -24,18 +24,23 @@ title: "Apache Druid (incubating) Design"
|
|||
|
||||
# What is Druid?<a id="what-is-druid"></a>
|
||||
|
||||
Apache Druid (incubating) is a data store designed for high-performance slice-and-dice analytics
|
||||
("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)"-style) on large data sets. Druid is most often
|
||||
used as a data store for powering GUI analytical applications, or as a backend for highly-concurrent APIs that need
|
||||
fast aggregations. Common application areas for Druid include:
|
||||
Apache Druid (incubating) is a real-time analytics database designed for fast slice-and-dice analytics
|
||||
("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries) on large data sets. Druid is most often
|
||||
used as a database for powering use cases where real-time ingest, fast query performance, and high uptime are important.
|
||||
As such, Druid is commonly used for powering GUIs of analytical applications, or as a backend for highly-concurrent APIs
|
||||
that need fast aggregations. Druid works best with event-oriented data.
|
||||
|
||||
- Clickstream analytics
|
||||
- Network flow analytics
|
||||
Common application areas for Druid include:
|
||||
|
||||
- Clickstream analytics (web and mobile analytics)
|
||||
- Network telemetry analytics (network performance monitoring)
|
||||
- Server metrics storage
|
||||
- Supply chain analytics (manufacturing metrics)
|
||||
- Application performance metrics
|
||||
- Digital marketing analytics
|
||||
- Digital marketing/advertising analytics
|
||||
- Business intelligence / OLAP
|
||||
|
||||
Druid's core architecture combines ideas from data warehouses, timeseries databases, and logsearch systems. Some of
|
||||
Druid's key features are:
|
||||
|
||||
1. **Columnar storage format.** Druid uses column-oriented storage, meaning it only needs to load the exact columns
|
||||
|
@ -45,7 +50,7 @@ column is stored optimized for its particular data type, which supports fast sca
|
|||
offer ingest rates of millions of records/sec, retention of trillions of records, and query latencies of sub-second to a
|
||||
few seconds.
|
||||
3. **Massively parallel processing.** Druid can process a query in parallel across the entire cluster.
|
||||
4. **Realtime or batch ingestion.** Druid can ingest data either realtime (ingested data is immediately available for
|
||||
4. **Realtime or batch ingestion.** Druid can ingest data either real-time (ingested data is immediately available for
|
||||
querying) or in batches.
|
||||
5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale the cluster out or in, simply add or
|
||||
remove servers and the cluster will rebalance itself automatically, in the background, without any downtime. If any
|
||||
|
@ -59,11 +64,14 @@ Druid servers, replication ensures that queries are still possible while the sys
|
|||
7. **Indexes for quick filtering.** Druid uses [CONCISE](https://arxiv.org/pdf/1004.0403) or
|
||||
[Roaring](https://roaringbitmap.org/) compressed bitmap indexes to create indexes that power fast filtering and
|
||||
searching across multiple columns.
|
||||
8. **Approximate algorithms.** Druid includes algorithms for approximate count-distinct, approximate ranking, and
|
||||
8. **Time-based partitioning.** Druid first partitions data by time, and can additionally partition based on other fields.
|
||||
This means time-based queries will only access the partitions that match the time range of the query. This leads to
|
||||
significant performance improvements for time-based data.
|
||||
9. **Approximate algorithms.** Druid includes algorithms for approximate count-distinct, approximate ranking, and
|
||||
computation of approximate histograms and quantiles. These algorithms offer bounded memory usage and are often
|
||||
substantially faster than exact computations. For situations where accuracy is more important than speed, Druid also
|
||||
offers exact count-distinct and exact ranking.
|
||||
9. **Automatic summarization at ingest time.** Druid optionally supports data summarization at ingestion time. This
|
||||
10. **Automatic summarization at ingest time.** Druid optionally supports data summarization at ingestion time. This
|
||||
summarization partially pre-aggregates your data, and can lead to big costs savings and performance boosts.
|
||||
|
||||
# When should I use Druid?<a id="when-to-use-druid"></a>
|
||||
|
@ -85,7 +93,8 @@ Situations where you would likely _not_ want to use Druid include:
|
|||
- You need low-latency updates of _existing_ records using a primary key. Druid supports streaming inserts, but not streaming updates (updates are done using
|
||||
background batch jobs).
|
||||
- You are building an offline reporting system where query latency is not very important.
|
||||
- You want to do "big" joins (joining one big fact table to another big fact table).
|
||||
- You want to do "big" joins (joining one big fact table to another big fact table) and you are okay with these queries
|
||||
taking up to hours to complete.
|
||||
|
||||
# Architecture
|
||||
|
||||
|
@ -157,7 +166,7 @@ The following diagram shows how queries and data flow through this architecture,
|
|||
Druid data is stored in "datasources", which are similar to tables in a traditional RDBMS. Each datasource is
|
||||
partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a "chunk" (for
|
||||
example, a single day, if your datasource is partitioned by day). Within a chunk, data is partitioned into one or more
|
||||
"segments". Each segment is a single file, typically comprising up to a few million rows of data. Since segments are
|
||||
["segments"](../design/segments.html). Each segment is a single file, typically comprising up to a few million rows of data. Since segments are
|
||||
organized into time chunks, it's sometimes helpful to think of segments as living on a timeline like the following:
|
||||
|
||||
<img src="../../img/druid-timeline.png" width="800" />
|
||||
|
@ -183,10 +192,10 @@ cluster.
|
|||
|
||||
# Query processing
|
||||
|
||||
Queries first enter the Broker, where the Broker will identify which segments have data that may pertain to that query.
|
||||
Queries first enter the [Broker](../design/broker.html), where the Broker will identify which segments have data that may pertain to that query.
|
||||
The list of segments is always pruned by time, and may also be pruned by other attributes depending on how your
|
||||
datasource is partitioned. The Broker will then identify which Historicals and MiddleManagers are serving those segments
|
||||
and send a rewritten subquery to each of those processes. The Historical/MiddleManager processes will take in the
|
||||
datasource is partitioned. The Broker will then identify which [Historicals](../design/historical.html) and
|
||||
[MiddleManagers](../design/middlemanager.html) are serving those segments and send a rewritten subquery to each of those processes. The Historical/MiddleManager processes will take in the
|
||||
queries, process them and return results. The Broker receives results and merges them together to get the final answer,
|
||||
which it returns to the original caller.
|
||||
|
||||
|
@ -200,4 +209,4 @@ So Druid uses three different techniques to maximize query performance:
|
|||
|
||||
- Pruning which segments are accessed for each query.
|
||||
- Within each segment, using indexes to identify which rows must be accessed.
|
||||
- Within each segment, only reading the specific rows and columns that are relevant to a particular query.
|
||||
- Within each segment, only reading the specific rows and columns that are relevant to a particular query.
|
||||
|
|
|
@ -28,7 +28,7 @@ Apache Druid (incubating) stores its index in *segment files*, which are partiti
|
|||
time. In a basic setup, one segment file is created for each time
|
||||
interval, where the time interval is configurable in the
|
||||
`segmentGranularity` parameter of the `granularitySpec`, which is
|
||||
documented [here](../ingestion/ingestion-spec.html#granularityspec). For druid to
|
||||
documented [here](../ingestion/ingestion-spec.html#granularityspec). For Druid to
|
||||
operate well under heavy query load, it is important for the segment
|
||||
file size to be within the recommended range of 300mb-700mb. If your
|
||||
segment files are larger than this range, then consider either
|
||||
|
|
|
@ -33,7 +33,7 @@ title: "Ingestion"
|
|||
Apache Druid (incubating) data is stored in "datasources", which are similar to tables in a traditional RDBMS. Each datasource is
|
||||
partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a "chunk" (for
|
||||
example, a single day, if your datasource is partitioned by day). Within a chunk, data is partitioned into one or more
|
||||
"segments". Each segment is a single file, typically comprising up to a few million rows of data. Since segments are
|
||||
["segments"](../design/segments.html). Each segment is a single file, typically comprising up to a few million rows of data. Since segments are
|
||||
organized into time chunks, it's sometimes helpful to think of segments as living on a timeline like the following:
|
||||
|
||||
<img src="../../img/druid-timeline.png" width="800" />
|
||||
|
|
|
@ -118,42 +118,47 @@ layout: toc
|
|||
* [ZooKeeper](/docs/VERSION/dependencies/zookeeper.html)
|
||||
|
||||
## Operations
|
||||
* [API Reference](/docs/VERSION/operations/api-reference.html)
|
||||
* [Coordinator](/docs/VERSION/operations/api-reference.html#coordinator)
|
||||
* [Overlord](/docs/VERSION/operations/api-reference.html#overlord)
|
||||
* [MiddleManager](/docs/VERSION/operations/api-reference.html#middlemanager)
|
||||
* [Peon](/docs/VERSION/operations/api-reference.html#peon)
|
||||
* [Broker](/docs/VERSION/operations/api-reference.html#broker)
|
||||
* [Historical](/docs/VERSION/operations/api-reference.html#historical)
|
||||
* [Management UIs](/docs/VERSION/operations/management-uis.html)
|
||||
* [Including Extensions](/docs/VERSION/operations/including-extensions.html)
|
||||
* [Data Retention](/docs/VERSION/operations/rule-configuration.html)
|
||||
* [High Availability](/docs/VERSION/operations/high-availability.html)
|
||||
* [Updating the Cluster](/docs/VERSION/operations/rolling-updates.html)
|
||||
* [Metrics and Monitoring](/docs/VERSION/operations/metrics.html)
|
||||
* [Alerts](/docs/VERSION/operations/alerts.html)
|
||||
* [Updating the Cluster](/docs/VERSION/operations/rolling-updates.html)
|
||||
* [Different Hadoop Versions](/docs/VERSION/operations/other-hadoop.html)
|
||||
* [High Availability](/docs/VERSION/operations/high-availability.html)
|
||||
* [Management UIs](/docs/VERSION/operations/management-uis.html)
|
||||
* [Dump Segment Tool](/docs/VERSION/operations/dump-segment.html)
|
||||
* [Insert Segment Tool](/docs/VERSION/operations/insert-segment-to-db.html)
|
||||
* [Pull Dependencies Tool](/docs/VERSION/operations/pull-deps.html)
|
||||
* [Recommendations](/docs/VERSION/operations/recommendations.html)
|
||||
* [TLS Support](/docs/VERSION/operations/tls-support.html)
|
||||
* [Password Provider](/docs/VERSION/operations/password-provider.html)
|
||||
* [HTTP Compression](/docs/VERSION/operations/http-compression.html)
|
||||
* [Basic Cluster Tuning](/docs/VERSION/operations/basic-cluster-tuning.html)
|
||||
* [Single-server Deployment Examples](/docs/VERSION/operations/single-server.html)
|
||||
* [Clustered Deployment Example](/docs/VERSION/operations/example-cluster.html)
|
||||
* Examples
|
||||
* [Single-server Deployment Examples](/docs/VERSION/operations/single-server.html)
|
||||
* [Clustered Deployment Example](/docs/VERSION/operations/example-cluster.html)
|
||||
* [Recommendations](/docs/VERSION/operations/recommendations.html)
|
||||
* [Performance FAQ](/docs/VERSION/operations/performance-faq.html)
|
||||
* [API Reference](/docs/VERSION/operations/api-reference.html)
|
||||
* [Coordinator](/docs/VERSION/operations/api-reference.html#coordinator)
|
||||
* [Overlord](/docs/VERSION/operations/api-reference.html#overlord)
|
||||
* [MiddleManager](/docs/VERSION/operations/api-reference.html#middlemanager)
|
||||
* [Peon](/docs/VERSION/operations/api-reference.html#peon)
|
||||
* [Broker](/docs/VERSION/operations/api-reference.html#broker)
|
||||
* [Historical](/docs/VERSION/operations/api-reference.html#historical)
|
||||
* Tools
|
||||
* [Dump Segment Tool](/docs/VERSION/operations/dump-segment.html)
|
||||
* [Insert Segment Tool](/docs/VERSION/operations/insert-segment-to-db.html)
|
||||
* [Pull Dependencies Tool](/docs/VERSION/operations/pull-deps.html)
|
||||
* Security
|
||||
* [TLS Support](/docs/VERSION/operations/tls-support.html)
|
||||
* [Password Provider](/docs/VERSION/operations/password-provider.html)
|
||||
|
||||
## Configuration
|
||||
* [Configuration Reference](/docs/VERSION/configuration/index.html)
|
||||
* [Recommended Configuration File Organization](/docs/VERSION/configuration/index.html#recommended-configuration-file-organization)
|
||||
* [JVM Configuration Best Practices](/docs/VERSION/configuration/index.html#jvm-configuration-best-practices)
|
||||
* [Common Configuration](/docs/VERSION/configuration/index.html#common-configurations)
|
||||
* [Coordinator](/docs/VERSION/configuration/index.html#coordinator)
|
||||
* [Overlord](/docs/VERSION/configuration/index.html#overlord)
|
||||
* [MiddleManager & Peons](/docs/VERSION/configuration/index.html#middle-manager-and-peons)
|
||||
* [Broker](/docs/VERSION/configuration/index.html#broker)
|
||||
* [Historical](/docs/VERSION/configuration/index.html#historical)
|
||||
* Processes
|
||||
* [Coordinator](/docs/VERSION/configuration/index.html#coordinator)
|
||||
* [Overlord](/docs/VERSION/configuration/index.html#overlord)
|
||||
* [MiddleManager & Peons](/docs/VERSION/configuration/index.html#middle-manager-and-peons)
|
||||
* [Historical](/docs/VERSION/configuration/index.html#historical)
|
||||
* [Broker](/docs/VERSION/configuration/index.html#broker)
|
||||
* [Caching](/docs/VERSION/configuration/index.html#cache-configuration)
|
||||
* [General Query Configuration](/docs/VERSION/configuration/index.html#general-query-configuration)
|
||||
* [Configuring Logging](/docs/VERSION/configuration/logging.html)
|
||||
|
|
Loading…
Reference in New Issue