Merge pull request #314 from metamx/igalDruid

Updates to Documentation to improve concepts and update small bugs in the segment flow
2013-12-09 10:24:22 -08:00 · 2013-12-09 10:24:22 -08:00 · 710b3cb139
parent 0cab224dc8 97ebad39be
commit 710b3cb139
10 changed files with 21 additions and 7 deletions
--- a/docs/content/Concepts-and-Terminology.md
+++ b/docs/content/Concepts-and-Terminology.md
@ -16,6 +16,8 @@ More definitions are available on the [design page](Design.html).

 * **Dimensions** Aspects or categories of data, such as languages or locations. For example, with *language* and *country* as the type of dimension, values could be "English" or "Mandarin" for language, or "USA" or "China" for country. In Druid, dimensions can serve as filters for narrowing down hits (for example, language = "English" or country = "China").

+* **Ephemeral Node** A Zookeeper node (or "znode") that exists only for the time it is needed to complete the process for which it was created. In a Druid cluster, ephemeral nodes are typically used in work such as assigning [segments](#segment) to certain nodes.
+
 * **Granularity** The time interval corresponding to aggregation by time. Druid configuration settings specify the granularity of [timestamp](#timestamp) buckets in a [segment](#segment) (for example, by minute or by hour), as well as the granularity of the segment itself. The latter is essentially the overall range of absolute time covered by the segment. In queries, granularity settings control the summarization of findings.

 * **Ingestion** The pulling and initial storing and processing of data. Druid supports realtime and batch ingestion of data, and applies indexing in both cases.
@ -24,16 +26,16 @@ More definitions are available on the [design page](Design.html).

 * **Rollup** The aggregation of data that occurs at one or more stages, based on settings in a [configuration file](#specFile). 

-<a name="segment"></a>
+  <a name="segment"></a>
 * **Segment** A collection of (internal) records that are stored and processed together. Druid chunks data into segments representing a time interval, and these are stored and manipulated in the cluster.

 * **Shard** A sub-partition of the data, allowing multiple [segments](#segment) to represent the data in a certain time interval. Sharding occurs along time partitions to better handle amounts of data that exceed certain limits on segment size, although sharding along dimensions may also occur to optimize efficiency.

-<a name="specfile"></a>
+  <a name="specfile"></a>
 * **specFile** The specification for services in JSON format; see [Realtime](Realtime.html) and [Batch-ingestion](Batch-ingestion.html)

-<a name="timeseries"></a>
+  <a name="timeseries"></a>
 * **Timeseries Data** Data points which are ordered in time. The closing value of a financial index or the number of tweets per hour with a certain hashtag are examples of timeseries data.

-<a name="timestamp"></a>
+  <a name="timestamp"></a>
 * **Timestamp** An absolute position on a timeline, given in a standard alpha-numerical format such as with UTC time. [Timeseries data](#timeseries) points can be ordered by timestamp, and in Druid, they are.
--- a/docs/content/Realtime.md
+++ b/docs/content/Realtime.md
@ -178,7 +178,7 @@ Segment Propagation

 The segment propagation diagram for real-time data ingestion can be seen below:

-![Segment Propagation](https://raw.github.com/metamx/druid/druid-0.5.4/doc/segment_propagation.png "Segment Propagation")
+![Segment Propagation](../img/segmentPropagation.png "Segment Propagation")

 Requirements
 ------------
--- a/docs/content/Tutorial:-A-First-Look-at-Druid.md
+++ b/docs/content/Tutorial:-A-First-Look-at-Druid.md
@ -1,6 +1,8 @@
 ---
 layout: doc_page
 ---
+
+# Tutorial: A First Look at Druid
 Greetings! This tutorial will help clarify some core Druid concepts. We will use a realtime dataset and issue some basic Druid queries. If you are ready to explore Druid, and learn a thing or two, read on!

 About the data
--- a/docs/content/Tutorial:-All-About-Queries.md
+++ b/docs/content/Tutorial:-All-About-Queries.md
@ -1,6 +1,8 @@
 ---
 layout: doc_page
 ---
+
+# Tutorial: All About Queries
 Hello! This tutorial is meant to provide a more in-depth look into Druid queries. The tutorial is somewhat incomplete right now but we hope to add more content to it in the near future.

 Setup
--- a/docs/content/Tutorial:-Loading-Your-Data-Part-1.md
+++ b/docs/content/Tutorial:-Loading-Your-Data-Part-1.md
@ -1,6 +1,8 @@
 ---
 layout: doc_page
 ---
+
+# Tutorial: Loading Your Data (Part 1)
 In our last [tutorial](Tutorial%3A-The-Druid-Cluster.html), we set up a complete Druid cluster. We created all the Druid dependencies and loaded some batched data. Druid shards data into self-contained chunks known as [segments](Segments.html). Segments are the fundamental unit of storage in Druid and all Druid nodes only understand segments.

 In this tutorial, we will learn about batch ingestion (as opposed to real-time ingestion) and how to create segments using the final piece of the Druid Cluster, the [indexing service](Indexing-Service.html). The indexing service is a standalone service that accepts [tasks](Tasks.html) in the form of POST requests. The output of most tasks are segments.
--- a/docs/content/Tutorial:-Loading-Your-Data-Part-2.md
+++ b/docs/content/Tutorial:-Loading-Your-Data-Part-2.md
@ -1,6 +1,8 @@
 ---
 layout: doc_page
 ---
+
+# Tutorial: Loading Your Data (Part 2)
 In this tutorial we will cover more advanced/real-world ingestion topics.

 Druid can ingest streaming or batch data. Streaming data is ingested via the real-time node, and batch data is ingested via the Hadoop batch indexer. Druid also has a standalone ingestion service called the [indexing service](Indexing-Service.html).
--- a/docs/content/Tutorial:-The-Druid-Cluster.md
+++ b/docs/content/Tutorial:-The-Druid-Cluster.md
@ -1,6 +1,8 @@
 ---
 layout: doc_page
 ---
+
+# Tutorial: The Druid Cluster
 Welcome back! In our first [tutorial](Tutorial%3A-A-First-Look-at-Druid.html), we introduced you to the most basic Druid setup: a single realtime node. We streamed in some data and queried it. Realtime nodes collect very recent data and periodically hand that data off to the rest of the Druid cluster. Some questions about the architecture must naturally come to mind. What does the rest of Druid cluster look like? How does Druid load available static data?

 This tutorial will hopefully answer these questions!
--- a/docs/content/index.md
+++ b/docs/content/index.md
@ -2,6 +2,8 @@
 layout: doc_page
 ---

+# About Druid
+
 Druid is an open-source analytics data store designed for real-time exploratory queries on large-scale data sets (100’s of Billions entries, 100’s TB data). Druid provides for cost-effective and always-on realtime data ingestion and arbitrary data exploration.

 -   Try out Druid with our Getting Started [Tutorial](./Tutorial%3A-A-First-Look-at-Druid.html)
--- a/docs/content/toc.textile
+++ b/docs/content/toc.textile
@ -3,8 +3,8 @@

 <link rel="stylesheet" href="css/toc.css">

-h1. Contents
-* "Introduction":./
+h1. Introduction
+* "About Druid":./
 * "Concepts and Terminology":./Concepts-and-Terminology.html

 h2. Getting Started
--- a/docs/img/segmentPropagation.png
+++ b/docs/img/segmentPropagation.png