From 20d29666af5d984d85de04ce18abb48410e346c6 Mon Sep 17 00:00:00 2001
From: Gian Merlino <gianmerlino@gmail.com>
Date: Thu, 16 Oct 2014 13:17:04 -0700
Subject: [PATCH] New PartitionsSpec docs.

---
 docs/content/Batch-ingestion.md | 61 ++++++++++++++++++++++-----------
 1 file changed, 41 insertions(+), 20 deletions(-)

diff --git a/docs/content/Batch-ingestion.md b/docs/content/Batch-ingestion.md
index 2f53eb48b4e..0326d1e1a66 100644
--- a/docs/content/Batch-ingestion.md
+++ b/docs/content/Batch-ingestion.md
@@ -162,37 +162,58 @@ The indexing process has the ability to roll data up as it processes the incomin
 
 ### Partitioning specification
 
-Segments are always partitioned based on timestamp (according to the granularitySpec) and may be further partitioned in some other way depending on partition type.
-Druid supports two types of partitions spec - singleDimension and hashed.
+Segments are always partitioned based on timestamp (according to the granularitySpec) and may be further partitioned in
+some other way depending on partition type. Druid supports two types of partitioning strategies: "hashed" (based on the
+hash of all dimensions in each row), and "dimension" (based on ranges of a single dimension).
 
-In SingleDimension partition type data is partitioned based on the values in that dimension.
-For example, data for a day may be split by the dimension "last\_name" into two segments: one with all values from A-M and one with all values from N-Z.
+Hashed partitioning is recommended in most cases, as it will improve indexing performance and create more uniformly
+sized data segments relative to single-dimension partitioning.
 
-In hashed partition type, the number of partitions is determined based on the targetPartitionSize and cardinality of input set and the data is partitioned based on the hashcode of the row.
-
-It is recommended to use Hashed partition as it is more efficient than singleDimension since it does not need to determine the dimension for creating partitions.
-Hashing also gives better distribution of data resulting in equal sized partitions and improving query performance
-
-To use this druid to automatically determine optimal partitions indexer must be given a target partition size. It can then find a good set of partition ranges on its own.
-
-#### Configuration for disabling auto-sharding and creating Fixed number of partitions
- Druid can be configured to NOT run determine partitions and create a fixed number of shards by specifying numShards in hashed partitionsSpec.
- e.g This configuration will skip determining optimal partitions and always create 4 shards for every segment granular interval
+#### Hash-based partitioning
 
 ```json
   "partitionsSpec": {
-     "type": "hashed"
-     "numShards": 4
+     "type": "hashed",
+     "targetPartitionSize": 5000000
    }
 ```
 
+Hashed partitioning works by first selecting a number of segments, and then partitioning rows across those segments
+according to the hash of all dimensions in each row. The number of segments is determined automatically based on the
+cardinality of the input set and a target partition size.
+
+The configuration options are:
+
 |property|description|required?|
 |--------|-----------|---------|
-|type|type of partitionSpec to be used |no, default : singleDimension|
-|targetPartitionSize|target number of rows to include in a partition, should be a number that targets segments of 700MB\~1GB.|yes|
+|type|type of partitionSpec to be used |"hashed"|
+|targetPartitionSize|target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|either this or numShards|
+|numShards|specify the number of partitions directly, instead of a target partition size. Ingestion will run faster, since it can skip the step necessary to select a number of partitions automatically.|either this or targetPartitionSize|
+
+#### Single-dimension partitioning
+
+```json
+  "partitionsSpec": {
+     "type": "dimension",
+     "targetPartitionSize": 5000000
+   }
+```
+
+Single-dimension partitioning works by first selecting a dimension to partition on, and then separating that dimension
+into contiguous ranges. Each segment will contain all rows with values of that dimension in that range. For example,
+your segments may be partitioned on the dimension "host" using the ranges "a.example.com" to "f.example.com" and
+"f.example.com" to "z.example.com". By default, the dimension to use is determined automatically, although you can
+override it with a specific dimension.
+
+The configuration options are:
+
+|property|description|required?|
+|--------|-----------|---------|
+|type|type of partitionSpec to be used |"dimension"|
+|targetPartitionSize|target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|yes|
+|maxPartitionSize|maximum number of rows to include in a partition. Defaults to 50% larger than the targetPartitionSize.|no|
 |partitionDimension|the dimension to partition on. Leave blank to select a dimension automatically.|no|
-|assumeGrouped|assume input data has already been grouped on time and dimensions. This is faster, but can choose suboptimal partitions if the assumption is violated.|no|
-|numShards|provides a way to manually override druid-auto sharding and specify the number of shards to create for each segment granular interval.It is only supported by hashed partitionSpec and targetPartitionSize must be set to -1|no|
+|assumeGrouped|assume input data has already been grouped on time and dimensions. Ingestion will run faster, but can choose suboptimal partitions if the assumption is violated.|no|
 
 ### Updater job spec