Update to Architecture intro file:

* Updated description of druid architecture components
* Added links from descriptions to actual component pages
* Removed Data-Flow page, which was a (mostly) redundant subset of other pages under Architecture)
* Moved non-redundant info from Data-Flow to other Architecture pages
* Updated info on how data actually flows in Druid 0.6
This commit is contained in:
Igal Levy 2013-11-01 17:17:57 -07:00
parent faa4f4ac9b
commit cf06a73091
6 changed files with 15 additions and 49 deletions

View File

@ -5,6 +5,7 @@ Broker
======
The Broker is the node to route queries to if you want to run a distributed cluster. It understands the metadata published to ZooKeeper about what segments exist on what nodes and routes queries such that they hit the right nodes. This node also merges the result sets from all of the individual nodes together.
On start up, Realtime nodes announce themselves and the segments they are serving in Zookeeper.
Quick Start
-----------

View File

@ -1,37 +0,0 @@
---
layout: doc_page
---
# Data Flow
The diagram below illustrates how different Druid nodes download data and respond to queries:
<img src="../img/druid-dataflow-2x.png" width="800"/>
### Real-time Nodes
Real-time nodes ingest streaming data and announce themselves and the segments they are serving in Zookeeper on start up. During the segment hand-off stage, real-time nodes create a segment metadata entry in MySQL for the segment to hand-off. This segment is uploaded to Deep Storage. Real-time nodes use Zookeeper to monitor when historical nodes complete downloading the segment (indicating hand-off completion) so that it can forget about it. Real-time nodes also respond to query requests from broker nodes and also return query results to the broker nodes.
### Deep Storage
Batch indexed segments and segments created by real-time nodes are uploaded to deep storage. Historical nodes download these segments to serve for queries.
### MySQL
Real-time nodes and batch indexing create new segment metadata entries for the new segments they've created. Coordinator nodes read this metadata table to determine what segments should be loaded in the cluster.
### Coordinator Nodes
Coordinator nodes read segment metadata information from MySQL to determine what segments should be loaded in the cluster. Coordinator nodes user Zookeeper to determine what historical nodes exist, and also create Zookeeper entries to tell historical nodes to load and drop new segments.
### Zookeeper
Real-time nodes announce themselves and the segments they are serving in Zookeeper and also use Zookeeper to monitor segment hand-off. Coordinator nodes use Zookeeper to determine what historical nodes exist in the cluster and create new entries to communicate to historical nodes to load or drop new data. Historical nodes announce themselves and the segments they serve in Zookeeper. Historical nodes also monitor Zookeeper for new load or drop requests. Broker nodes use Zookeeper to determine what historical and real-time nodes exist in the cluster.
### Historical Nodes
Historical nodes announce themselves and the segments they are serving in Zookeeper. Historical nodes also use Zookeeper to monitor for signals to load or drop new segments. Historical nodes download segments from deep storage, respond to the queries from broker nodes about these segments, and return results to the broker nodes.
### Broker Nodes
Broker nodes receive queries from external clients and forward those queries down to real-time and historical nodes. When the individual nodes return their results, broker nodes merge these results and returns them to the caller. Broker nodes use Zookeeper to determine what real-time and historical nodes exist.

View File

@ -25,13 +25,15 @@ Druid is a good fit for products that require real-time data ingestion of a sing
Druid is architected as a grouping of systems each with a distinct role and together they form a working system. The name comes from the Druid class in many role-playing games: it is a shape-shifter, capable of taking many different forms to fulfill various different roles in a group.
Each of the systems, or components, described below also has a dedicated page with more details. You can find the page in the menu on the left, or click the link in the component's description.
The node types that currently exist are:
* **Historical** nodes are the workhorses that handle storage and querying on "historical" data (non-realtime).
* **Realtime** nodes ingest data in real-time. They are in charge of listening to a stream of incoming data and making it available immediately inside the Druid system. As data they have ingested ages, they hand it off to the historical nodes.
* **Coordinator** nodes look over the grouping of historical nodes and make sure that data is available, replicated and in a generally "optimal" configuration.
* **Broker** nodes understand the topology of data across all of the other nodes in the cluster and re-write and route queries accordingly.
* **Indexer** nodes form a cluster of workers to load batch and real-time data into the system as well as allow for alterations to the data stored in the system (also known as the Indexing Service).
* [**Historical**](Historical.html) nodes are the workhorses that handle storage and querying on "historical" data (non-realtime). Historical nodes download segments from deep storage, respond to the queries from broker nodes about these segments, and return results to the broker nodes. They announce themselves and the segments they are serving in Zookeeper, and also use Zookeeper to monitor for signals to load or drop new segments.
* [**Realtime**](Realtime.html) nodes ingest data in real time. They are in charge of listening to a stream of incoming data and making it available immediately inside the Druid system. Real-time nodes respond to query requests from Broker nodes, returning query results to those nodes. Aged data is pushed from Realtime nodes to Historical nodes.
* [**Coordinator**](Coordinator.html) nodes monitor the grouping of historical nodes to ensure that data is available, replicated and in a generally "optimal" configuration. They do this by reading segment metadata information from MySQL to determine what segments should be loaded in the cluster, using Zookeeper to determine what Historical nodes exist, and creating Zookeeper entries to tell Historical nodes to load and drop new segments.
* [**Broker**](Broker.html) nodes receive queries from external clients and forward those queries to Realtime and Historical nodes. When Broker nodes receive results, they merge these results and return them to the caller. For knowing topology, Broker nodes use Zookeeper to determine what Realtime and Historical nodes exist.
* [**Indexer**](Indexing-Service.html) nodes form a cluster of workers to load batch and real-time data into the system as well as allow for alterations to the data stored in the system (also known as the Indexing Service).
This separation allows each node to only care about what it is best at. By separating Historical and Realtime, we separate the memory concerns of listening on a real-time stream of data and processing it for entry into the system. By separating the Coordinator and Broker, we separate the needs for querying from the needs for maintaining "good" data distribution across the cluster.
@ -43,9 +45,9 @@ All nodes can be run in some highly available fashion, either as symmetric peers
Aside from these nodes, there are 3 external dependencies to the system:
1. A running [ZooKeeper](http://zookeeper.apache.org/) cluster for cluster service discovery and maintenance of current data topology
2. A MySQL instance for maintenance of metadata about the data segments that should be served by the system
3. A "deep storage" LOB store/file system to hold the stored segments
1. A running [ZooKeeper](ZooKeeper.html) cluster for cluster service discovery and maintenance of current data topology
2. A [MySQL instance](MySQL.html) for maintenance of metadata about the data segments that should be served by the system
3. A ["deep storage" LOB store/file system](Deep-Storage.html) to hold the stored segments
The following diagram shows how certain nodes and dependencies help manage the cluster by tracking and exchanging metadata. This management layer is illustrated in the following diagram:

View File

@ -4,7 +4,8 @@ layout: doc_page
Realtime
========
Realtime nodes provide a realtime index. Data indexed via these nodes is immediately available for querying. Realtime nodes will periodically build segments representing the data theyve collected over some span of time and hand these segments off to [Historical](Historical.html) nodes.
Realtime nodes provide a realtime index. Data indexed via these nodes is immediately available for querying. Realtime nodes will periodically build segments representing the data theyve collected over some span of time and transfer these segments off to [Historical](Historical.html) nodes. They use ZooKeeper to monitor the transfer and MySQL to store metadata about the transfered segment. Once transfered, segments are forgotten by the Realtime nodes.
Quick Start
-----------

View File

@ -1,7 +1,7 @@
---
layout: doc_page
---
Druid uses ZooKeeper (ZK) for management of current cluster state. The operations that happen over ZK are
Druid uses [ZooKeeper](http://zookeeper.apache.org/) (ZK) for management of current cluster state. The operations that happen over ZK are
1. [Coordinator](Coordinator.html) leader election
2. Segment "publishing" protocol from [Historical](Historical.html) and [Realtime](Realtime.html)

View File

@ -47,7 +47,6 @@ h2. Querying
h2. Architecture
* "Design":./Design.html
** "Data Flow":./Data-Flow.html
* "Segments":./Segments.html
* Node Types
** "Historical":./Historical.html
@ -71,4 +70,4 @@ h2. Development
* "Libraries":./Libraries.html
h2. Misc
* "Thanks":./Thanks.html
* "Thanks":./Thanks.html