some more doc improvements (#7675)

2019-05-16 13:17:21 -07:00 · 2019-05-16 13:17:21 -07:00 · dc85a5309e
parent d667655871
commit dc85a5309e
4 changed files with 55 additions and 41 deletions
--- a/docs/content/design/index.md
+++ b/docs/content/design/index.md
@ -24,18 +24,23 @@ title: "Apache Druid (incubating) Design"

 # What is Druid?<a id="what-is-druid"></a>

-Apache Druid (incubating) is a data store designed for high-performance slice-and-dice analytics
-("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)"-style) on large data sets. Druid is most often
-used as a data store for powering GUI analytical applications, or as a backend for highly-concurrent APIs that need
-fast aggregations. Common application areas for Druid include:
+Apache Druid (incubating) is a real-time analytics database designed for fast slice-and-dice analytics
+("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)" queries) on large data sets. Druid is most often
+used as a database for powering use cases where real-time ingest, fast query performance, and high uptime are important. 
+As such, Druid is commonly used for powering GUIs of analytical applications, or as a backend for highly-concurrent APIs 
+that need fast aggregations. Druid works best with event-oriented data.

- Clickstream analytics
- Network flow analytics
+Common application areas for Druid include:
+
+- Clickstream analytics (web and mobile analytics)
+- Network telemetry analytics (network performance monitoring)
 - Server metrics storage
+- Supply chain analytics (manufacturing metrics)
 - Application performance metrics
- Digital marketing analytics
+- Digital marketing/advertising analytics
 - Business intelligence / OLAP

+Druid's core architecture combines ideas from data warehouses, timeseries databases, and logsearch systems. Some of 
 Druid's key features are:

 1. **Columnar storage format.** Druid uses column-oriented storage, meaning it only needs to load the exact columns
@ -45,7 +50,7 @@ column is stored optimized for its particular data type, which supports fast sca
 offer ingest rates of millions of records/sec, retention of trillions of records, and query latencies of sub-second to a
 few seconds.
 3. **Massively parallel processing.** Druid can process a query in parallel across the entire cluster.
-4. **Realtime or batch ingestion.** Druid can ingest data either realtime (ingested data is immediately available for
+4. **Realtime or batch ingestion.** Druid can ingest data either real-time (ingested data is immediately available for
 querying) or in batches.
 5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale the cluster out or in, simply add or
 remove servers and the cluster will rebalance itself automatically, in the background, without any downtime. If any
@ -59,11 +64,14 @@ Druid servers, replication ensures that queries are still possible while the sys
 7. **Indexes for quick filtering.** Druid uses [CONCISE](https://arxiv.org/pdf/1004.0403) or
 [Roaring](https://roaringbitmap.org/) compressed bitmap indexes to create indexes that power fast filtering and
 searching across multiple columns.
-8. **Approximate algorithms.** Druid includes algorithms for approximate count-distinct, approximate ranking, and
+8. **Time-based partitioning.** Druid first partitions data by time, and can additionally partition based on other fields. 
+This means time-based queries will only access the partitions that match the time range of the query. This leads to 
+significant performance improvements for time-based data. 
+9. **Approximate algorithms.** Druid includes algorithms for approximate count-distinct, approximate ranking, and
 computation of approximate histograms and quantiles. These algorithms offer bounded memory usage and are often
 substantially faster than exact computations. For situations where accuracy is more important than speed, Druid also
 offers exact count-distinct and exact ranking.
-9. **Automatic summarization at ingest time.** Druid optionally supports data summarization at ingestion time. This
+10. **Automatic summarization at ingest time.** Druid optionally supports data summarization at ingestion time. This
 summarization partially pre-aggregates your data, and can lead to big costs savings and performance boosts.

 # When should I use Druid?<a id="when-to-use-druid"></a>
@ -85,7 +93,8 @@ Situations where you would likely _not_ want to use Druid include:
 - You need low-latency updates of _existing_ records using a primary key. Druid supports streaming inserts, but not streaming updates (updates are done using
 background batch jobs).
 - You are building an offline reporting system where query latency is not very important.
- You want to do "big" joins (joining one big fact table to another big fact table).
+- You want to do "big" joins (joining one big fact table to another big fact table) and you are okay with these queries 
+taking up to hours to complete.

 # Architecture

@ -157,7 +166,7 @@ The following diagram shows how queries and data flow through this architecture,
 Druid data is stored in "datasources", which are similar to tables in a traditional RDBMS. Each datasource is
 partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a "chunk" (for
 example, a single day, if your datasource is partitioned by day). Within a chunk, data is partitioned into one or more
-"segments". Each segment is a single file, typically comprising up to a few million rows of data. Since segments are
+["segments"](../design/segments.html). Each segment is a single file, typically comprising up to a few million rows of data. Since segments are
 organized into time chunks, it's sometimes helpful to think of segments as living on a timeline like the following:

 <img src="../../img/druid-timeline.png" width="800" />
@ -183,10 +192,10 @@ cluster.

 # Query processing

-Queries first enter the Broker, where the Broker will identify which segments have data that may pertain to that query.
+Queries first enter the [Broker](../design/broker.html), where the Broker will identify which segments have data that may pertain to that query.
 The list of segments is always pruned by time, and may also be pruned by other attributes depending on how your
-datasource is partitioned. The Broker will then identify which Historicals and MiddleManagers are serving those segments
-and send a rewritten subquery to each of those processes. The Historical/MiddleManager processes will take in the
+datasource is partitioned. The Broker will then identify which [Historicals](../design/historical.html) and 
+[MiddleManagers](../design/middlemanager.html) are serving those segments and send a rewritten subquery to each of those processes. The Historical/MiddleManager processes will take in the
 queries, process them and return results. The Broker receives results and merges them together to get the final answer,
 which it returns to the original caller.

@ -200,4 +209,4 @@ So Druid uses three different techniques to maximize query performance:

 - Pruning which segments are accessed for each query.
 - Within each segment, using indexes to identify which rows must be accessed.
- Within each segment, only reading the specific rows and columns that are relevant to a particular query.
+- Within each segment, only reading the specific rows and columns that are relevant to a particular query.
--- a/docs/content/design/segments.md
+++ b/docs/content/design/segments.md
@ -28,7 +28,7 @@ Apache Druid (incubating) stores its index in *segment files*, which are partiti
 time. In a basic setup, one segment file is created for each time
 interval, where the time interval is configurable in the
 `segmentGranularity` parameter of the `granularitySpec`, which is
-documented [here](../ingestion/ingestion-spec.html#granularityspec).  For druid to
+documented [here](../ingestion/ingestion-spec.html#granularityspec).  For Druid to
 operate well under heavy query load, it is important for the segment
 file size to be within the recommended range of 300mb-700mb. If your
 segment files are larger than this range, then consider either
--- a/docs/content/ingestion/index.md
+++ b/docs/content/ingestion/index.md
@ -33,7 +33,7 @@ title: "Ingestion"
 Apache Druid (incubating) data is stored in "datasources", which are similar to tables in a traditional RDBMS. Each datasource is
 partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a "chunk" (for
 example, a single day, if your datasource is partitioned by day). Within a chunk, data is partitioned into one or more
-"segments". Each segment is a single file, typically comprising up to a few million rows of data. Since segments are
+["segments"](../design/segments.html). Each segment is a single file, typically comprising up to a few million rows of data. Since segments are
 organized into time chunks, it's sometimes helpful to think of segments as living on a timeline like the following:

 <img src="../../img/druid-timeline.png" width="800" />
--- a/docs/content/toc.md
+++ b/docs/content/toc.md
@ -118,42 +118,47 @@ layout: toc
    * [ZooKeeper](/docs/VERSION/dependencies/zookeeper.html)

 ## Operations
-  * [API Reference](/docs/VERSION/operations/api-reference.html)
-    * [Coordinator](/docs/VERSION/operations/api-reference.html#coordinator)
-    * [Overlord](/docs/VERSION/operations/api-reference.html#overlord)
-    * [MiddleManager](/docs/VERSION/operations/api-reference.html#middlemanager)
-    * [Peon](/docs/VERSION/operations/api-reference.html#peon)
-    * [Broker](/docs/VERSION/operations/api-reference.html#broker)
-    * [Historical](/docs/VERSION/operations/api-reference.html#historical)
+  * [Management UIs](/docs/VERSION/operations/management-uis.html)    
  * [Including Extensions](/docs/VERSION/operations/including-extensions.html)
  * [Data Retention](/docs/VERSION/operations/rule-configuration.html)
+  * [High Availability](/docs/VERSION/operations/high-availability.html)
+  * [Updating the Cluster](/docs/VERSION/operations/rolling-updates.html)
  * [Metrics and Monitoring](/docs/VERSION/operations/metrics.html)
  * [Alerts](/docs/VERSION/operations/alerts.html)
-  * [Updating the Cluster](/docs/VERSION/operations/rolling-updates.html)
  * [Different Hadoop Versions](/docs/VERSION/operations/other-hadoop.html)
-  * [High Availability](/docs/VERSION/operations/high-availability.html)
-  * [Management UIs](/docs/VERSION/operations/management-uis.html)
-  * [Dump Segment Tool](/docs/VERSION/operations/dump-segment.html)
-  * [Insert Segment Tool](/docs/VERSION/operations/insert-segment-to-db.html)
-  * [Pull Dependencies Tool](/docs/VERSION/operations/pull-deps.html)
-  * [Recommendations](/docs/VERSION/operations/recommendations.html)
-  * [TLS Support](/docs/VERSION/operations/tls-support.html)
-  * [Password Provider](/docs/VERSION/operations/password-provider.html)
  * [HTTP Compression](/docs/VERSION/operations/http-compression.html)
  * [Basic Cluster Tuning](/docs/VERSION/operations/basic-cluster-tuning.html)
-  * [Single-server Deployment Examples](/docs/VERSION/operations/single-server.html)
-  * [Clustered Deployment Example](/docs/VERSION/operations/example-cluster.html)
+  * Examples
+      * [Single-server Deployment Examples](/docs/VERSION/operations/single-server.html)
+      * [Clustered Deployment Example](/docs/VERSION/operations/example-cluster.html)
+  * [Recommendations](/docs/VERSION/operations/recommendations.html)
+    * [Performance FAQ](/docs/VERSION/operations/performance-faq.html)
+  * [API Reference](/docs/VERSION/operations/api-reference.html)
+      * [Coordinator](/docs/VERSION/operations/api-reference.html#coordinator)
+      * [Overlord](/docs/VERSION/operations/api-reference.html#overlord)
+      * [MiddleManager](/docs/VERSION/operations/api-reference.html#middlemanager)
+      * [Peon](/docs/VERSION/operations/api-reference.html#peon)
+      * [Broker](/docs/VERSION/operations/api-reference.html#broker)
+      * [Historical](/docs/VERSION/operations/api-reference.html#historical)  
+  * Tools
+    * [Dump Segment Tool](/docs/VERSION/operations/dump-segment.html)
+    * [Insert Segment Tool](/docs/VERSION/operations/insert-segment-to-db.html)
+    * [Pull Dependencies Tool](/docs/VERSION/operations/pull-deps.html)  
+  * Security
+    * [TLS Support](/docs/VERSION/operations/tls-support.html)
+    * [Password Provider](/docs/VERSION/operations/password-provider.html)  

 ## Configuration
  * [Configuration Reference](/docs/VERSION/configuration/index.html)
  * [Recommended Configuration File Organization](/docs/VERSION/configuration/index.html#recommended-configuration-file-organization)
  * [JVM Configuration Best Practices](/docs/VERSION/configuration/index.html#jvm-configuration-best-practices)
  * [Common Configuration](/docs/VERSION/configuration/index.html#common-configurations)
-  * [Coordinator](/docs/VERSION/configuration/index.html#coordinator)
-  * [Overlord](/docs/VERSION/configuration/index.html#overlord)
-  * [MiddleManager & Peons](/docs/VERSION/configuration/index.html#middle-manager-and-peons)
-  * [Broker](/docs/VERSION/configuration/index.html#broker)
-  * [Historical](/docs/VERSION/configuration/index.html#historical)
+  * Processes
+    * [Coordinator](/docs/VERSION/configuration/index.html#coordinator)
+    * [Overlord](/docs/VERSION/configuration/index.html#overlord)
+    * [MiddleManager & Peons](/docs/VERSION/configuration/index.html#middle-manager-and-peons)    
+    * [Historical](/docs/VERSION/configuration/index.html#historical)
+    * [Broker](/docs/VERSION/configuration/index.html#broker)
  * [Caching](/docs/VERSION/configuration/index.html#cache-configuration)
  * [General Query Configuration](/docs/VERSION/configuration/index.html#general-query-configuration)
  * [Configuring Logging](/docs/VERSION/configuration/logging.html)