Improve segment file documentation

2015-06-17 16:06:44 -07:00 · 2015-06-17 16:06:44 -07:00 · f214b412ca
parent 2544f3655e
commit f214b412ca
2 changed files with 127 additions and 3 deletions
--- a/docs/content/design/segments.md
+++ b/docs/content/design/segments.md
@ -4,7 +4,125 @@ layout: doc_page
 Segments
 ========
-Druid segments contain data for a time interval, stored as separate columns. Dimensions (string columns) have inverted indexes associated with them for each dimension value. Metric columns are LZ4 compressed.
+
 Druid stores its index in *segment files*, which are partitioned by
 time. In a basic setup, one segment file is created for each time
 interval, where the time inteval is configurable in the
 `segmentGranularity` parameter of the `granularitySpec`, which is
 documented [here](../ingestion/batch-ingestion.html).  For druid to
 operate well under heavy query load, it is important for the segment
 file size to be within the recommended range of 300mb-700mb. If your
 segment files are larger than this range, then consider either
 changing the the granularity of the time interval or partitioning your
 data and tweaking the `targetPartitionSize` in your `partitioningSpec`
 (a good starting point for this parameter is 5 million rows).  See the
 sharding section below and the 'Partitioning specification' section of
 the [Batch ingestion](../ingestion/batch-ingestion.html) documentation
 for more information.
 ### A segment file's core data structures
 Here we describe the internal structure of segment files, which is
 essentially *columnar*: the data for each column is laid out in
 separate data structures. By storing each column separately, Druid can
 decrease query latency by scanning only those columns actually needed
 for a query.  There are three basic column types: the timestamp
 column, dimension columns, and metric columns, as illustrated in the
 image below:
 ![Druid column types](../img/druid-column-types.png "Druid Column Types")
 The timestamp and metric columns are simple: behind the scenes each of
 these is an array of integer or floating point values compressed with
 LZ4. Once a query knows which rows it needs to select, it simply
 decompresses these, pulls out the relevant rows, and applies the
 desired aggregation operator. As with all columns, if a query doesn’t
 require a column, then that column’s data is just skipped over.
 Dimensions columns are different because they support filter and
 group-by operations, so each dimension requires the following
 three data structures:
 1. A dictionary that maps values (which are always treated as strings) to integer IDs,
 2. For each distinct value in the column, a bitmap that indicates which rows contain that value, and
 3. A list of the column’s values, encoded using the dictionary in 1.
 Why these three data structures? The dictionary simply maps string
 values to integer ids so that the values in 2 and 3 can be
 represented compactly. The bitmaps in 2 -- also known as *inverted
 indexes* allow for quick filtering operations (specifically, bitmaps
 are convenient for quickly applying AND and OR operators). Finally,
 the list of values in 3 are needed for *group by* and *TopN*
 queries. In other words, queries that solely aggregate metrics based
 on filters do not need to touch the list of dimension values stored in
 3.
 To get a concrete sense of these data structures, consider the ‘page’
 column from the example data above.  The three data structures that
 represent this dimension are illustrated in the diagram below. 
 ```
 1: Dictionary that encodes column values
  {
    "Justin Bieber": 0,
    "Ke$ha":         1
  }
 2: Column data
  [0,
   0,
   1,
   1]
 3: Bitmaps - one for each unique value of the column
  value="Justin Bieber": [1,1,0,0]
  value="Ke$ha":         [0,0,1,1]
 ```
 Note that the bitmap is different from the first two data structures:
 whereas the first two grow linearly in the size of the data (in the
 worst case), the size of the bitmap section is the product of data
 size * column cardinality. Compression will help us here though
 because we know that each row will have only non-zero entry in a only
 a single bitmap. This means that high cardinality columns will have
 extremely sparse, and therefore highly compressible, bitmaps. Druid
 exploits this using compression algorithms that are specially suited
 for bitmaps, such as roaring bitmap compression.
 ### Multi-value columns
 If a data source makes use of multi-value columns, then the data
 structures within the segment files look a bit different. Let's
 imagine that in the example above, the second row were tagged with
 both the 'Ke$ha' *and* 'Justin Bieber' topics. In this case, the three
 data structures would now look as follows:
 ```
 1: Dictionary that encodes column values
  {
    "Justin Bieber": 0,
    "Ke$ha":         1
  }
 2: Column data
  [0,
   [0,1],  <--Row value of multi-value column can have array of values
   1,
   1]
 3: Bitmaps - one for each unique value
  value="Justin Bieber": [1,1,0,0]
  value="Ke$ha":         [0,1,1,1]
                            ^
                            |
                            |
    Multi-value column has multiple non-zero entries
 ```
 Note the changes to the second row in the column data and the Ke$ha
 bitmap. If a row has more than one value for a column, its entry in
 the 'column data' is an array of values. Additionally, a row with *n*
 values in a column columns will have *n* non-zero valued entries in
 that column's bitmaps.
 Naming Convention
 -----------------
@ -17,7 +135,7 @@ datasource_intervalStart_intervalEnd_version_partitionNum
 Segment Components
 ------------------
-A segment is comprised of several files, listed below.
+Behind the scenes, a segment is comprised of several files, listed below.
 * `version.bin`
@ -52,4 +170,10 @@ Sharding Data to Create Segments
 ### Sharding Data by Dimension
-If the cumulative total number of rows for the different values of a given column exceed some configurable threshold, multiple segments representing the same time interval for the same datasource may be created. These segments will contain some partition number as part of their identifier. Sharding by dimension reduces some of the the costs associated with operations over high cardinality dimensions.
+If the cumulative total number of rows for the different values of a
 given column exceed some configurable threshold, multiple segments
 representing the same time interval for the same datasource may be
 created. These segments will contain some partition number as part of
 their identifier. Sharding by dimension reduces some of the the costs
 associated with operations over high cardinality dimensions. For more
 information on sharding, see the ingestion documentat
--- a/docs/img/druid-column-types.png
+++ b/docs/img/druid-column-types.png