Merge pull request #2179 from druid-io/the-docs

Multiple improvements for docs
2016-01-03 21:37:38 +05:30 · 2016-01-03 21:37:38 +05:30 · a0ab65d169
parent 8451f21fed 88f6b9b5ad
commit a0ab65d169
7 changed files with 180 additions and 63 deletions
--- a/docs/content/development/integrating-druid-with-other-technologies.md
+++ b/docs/content/development/integrating-druid-with-other-technologies.md
@ -2,6 +2,17 @@
 layout: doc_page
 ---
 # Integrating Druid With Other Technologies
-This page discusses how we can integrate druid with other technologies. Event streams can be stored in a distributed queue like Kafka, then it can be streamed to a distributed realtime computation system like Twitter Storm / Samza and then it can be feed into Druid via Tranquility plugin. With Tranquility, Middlemanager & Peons will act as a realtime node and they handle realtime queries, segment handoff and realtime indexing.
+
+This page discusses how we can integrate Druid with other technologies. 
+
+## Integrating with Open Source Streaming Technologies
+
+Event streams can be stored in a distributed message bus such as Kafka and further processed via a distributed stream  
+processor system such as Storm, Samza, or Spark Streaming. Data processed by the stream processor can feed into Druid using 
+the [Tranquility](https://github.com/druid-io/tranquility) library. Data can be 

 <img src="../../img/druid-production.png" width="800"/>
+
+## Integrating with SQL-on-Hadoop Technologies
+
+Druid should theoretically integrate well with SQL-on-Hadoop technologies such as Apache Drill, Spark SQL, Presto, Impala, and Hive.
--- a/docs/content/ingestion/batch-ingestion.md
+++ b/docs/content/ingestion/batch-ingestion.md
@ -164,6 +164,14 @@ s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01
 s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23
 ```

+##### `dataSource`
+
+Read Druid segments. See [here](../ingestion/update-existing-data.html) for more information.
+
+##### `multi`
+
+Read multiple sources of data. See [here](../ingestion/update-existing-data.html) for more information.
+
 #### Metadata Update Job Spec

 This is a specification of the properties that tell the job how to update metadata such that the Druid cluster will see the output segments and load them.
--- a/docs/content/ingestion/update-existing-data.md
+++ b/docs/content/ingestion/update-existing-data.md
@ -1,84 +1,115 @@
 ---
 layout: doc_page
 ---
+# Updating Existing Data

-Once you ingest some data in a dataSource for an interval, You might want to make following kind of changes to existing data.
+Once you ingest some data in a dataSource for an interval and create Druid segments, you might want to make changes to 
+the ingested data. There are several ways this can be done.

-##### Reindexing
-You ingested some raw data to a dataSource A and later you want to re-index and create another dataSource B which has a subset of columns or different granularity. Or, you may want to change granularity of data in A itself for some interval.
+##### Updating Dimension Values

-##### Delta-ingestion
-You ingested some raw data to a dataSource A in an interval, later you want to "append" more data to same interval. This might happen because you used realtime ingestion originally and then received some late events. 
+If you have a dimension where values need to be updated frequently, try first using [lookups](../querying/lookups.html). A 
+classic use case of lookups is when you have an ID dimension stored in a Druid segment, and want to map the ID dimension to a 
+human-readable String value that may need to be updated periodically.

-Here are the Druid Features you could use to achieve above updates.
+##### Rebuilding Segments (Reindexing)

- You can use batch ingestion to override an interval completely by doing the ingestion again with the raw data.
- You can use re-indexing and delta-ingestion features provided by batch ingestion.
+If lookups are not sufficient, you can entirely rebuild Druid segments for specific intervals of time. Rebuilding a segment 
+is known as reindexing the data. For example, if you want to add or remove columns from your existing segments, or you want to 
+change the rollup granularity of your segments, you will have to reindex your data.

+We recommend keeping a copy of your raw data around in case you ever need to reindex your data.

-### Re-indexing and Delta Ingestion with Hadoop Batch Ingestion
+##### Dealing with Delayed Events (Delta Ingestion)

-This section assumes the reader has understanding of batch ingestion using Hadoop. See [HadoopIndexTask](../misc/tasks.html#index-hadoop-task) and further explained in [batch-ingestion](batch-ingestion.md). You can use hadoop batch-ingestion to do re-indexing and delta-ingestion as well.
+If you have a batch ingestion pipeline and have delayed events come in and want to append these events to existing 
+segments and avoid the overhead of rebuilding new segments with reindexing, you can use delta ingestion.

-It is enabled by how Druid reads input data for doing hadoop batch ingestion. Druid uses specified `inputSpec` to know where the data to be ingested is located and how to read it. For simple hadoop batch ingestion you would use `static` or `granularity` spec types  which allow you to read data stored on HDFS.
+### Reindexing and Delta Ingestion with Hadoop Batch Ingestion

-There are two other `inputSpec` types to enable reindexing and delta-ingestion.
+This section assumes the reader understands how to do batch ingestion using Hadoop. See 
+[batch-ingestion](batch-ingestion.md) for more information. Hadoop batch-ingestion can be used for reindexing and delta ingestion.
+
+Druid uses an `inputSpec` in the `ioConfig` to know where the data to be ingested is located and how to read it. 
+For simple Hadoop batch ingestion, `static` or `granularity` spec types allow you to read data stored in deep storage.
+
+There are other types of `inputSpec` to enable reindexing and delta ingestion.

 #### `dataSource`

-It is a type of inputSpec that reads data already stored inside druid. It is useful for doing "re-indexing".
+This is a type of `inputSpec` that reads data already stored inside Druid.

 |Field|Type|Description|Required|
 |-----|----|-----------|--------|
-|ingestionSpec|Json Object|Specification of druid segments to be loaded. See below.|yes|
+|type|String.|This should always be 'dataSource'.|yes|
+|ingestionSpec|JSON object.|Specification of Druid segments to be loaded. See below.|yes|
 |maxSplitSize|Number|Enables combining multiple segments into single Hadoop InputSplit according to size of segments. Default is none. |no|

-Here is what goes inside "ingestionSpec"
+Here is what goes inside `ingestionSpec`:

 |Field|Type|Description|Required|
 |-----|----|-----------|--------|
 |dataSource|String|Druid dataSource name from which you are loading the data.|yes|
 |intervals|List|A list of strings representing ISO-8601 Intervals.|yes|
-|granularity|String|Defines the granularity of the query while loading data. Default value is "none".See [Granularities](../querying/granularities.html).|no|
-|filter|Json|See [Filters](../querying/filters.html)|no|
-|dimensions|Array of String|Name of dimension columns to load. By default, the list will be constructed from parseSpec. If parseSpec does not have explicit list of dimensions then all the dimension columns present in stored data will be read.|no|
+|granularity|String|Defines the granularity of the query while loading data. Default value is "none". See [Granularities](../querying/granularities.html).|no|
+|filter|JSON|See [Filters](../querying/filters.html)|no|
+|dimensions|Array of String|Name of dimension columns to load. By default, the list will be constructed from parseSpec. If parseSpec does not have an explicit list of dimensions then all the dimension columns present in stored data will be read.|no|
 |metrics|Array of String|Name of metric columns to load. By default, the list will be constructed from the "name" of all the configured aggregators.|no|

 For example

-```
-"ingestionSpec" :
-    {
-        "dataSource": "wikipedia",
-        "intervals": ["2014-10-20T00:00:00Z/P2W"]
+```json
+"ioConfig" : {
+  "type" : "hadoop",
+  "inputSpec" : {
+    "type" : "dataSource",
+    "ingestionSpec" : {
+      "dataSource": "wikipedia",
+      "intervals": ["2014-10-20T00:00:00Z/P2W"]
    }
+  },
+  ...
+}
 ```

 #### `multi`

-It is a composing inputSpec to combine two other input specs. It is useful for doing delta ingestion. Note that this is not idempotent operation, we might add some features in future to make it idempotent.
+This is a composing inputSpec to combine other inputSpecs. This inputSpec is used for delta ingestion. 
+Please note that delta ingestion is not an idempotent operation. We may add change things in future to make it idempotent.

 |Field|Type|Description|Required|
 |-----|----|-----------|--------|
-|children|Array of Json Objects|List of json objects containing other inputSpecs |yes|
+|children|Array of JSON objects|List of JSON objects containing other inputSpecs.|yes|

-For example
+For example:

-```
-"children": [
-    {
+```json
+"ioConfig" : {
+  "type" : "hadoop",
+  "inputSpec" : {
+    "type" : "multi",
+    "children": [
+      {
        "type" : "dataSource",
        "ingestionSpec" : {
-            "dataSource": "wikipedia",
-            "intervals": ["2014-10-20T00:00:00Z/P2W"]
+          "dataSource": "wikipedia",
+          "intervals": ["2014-10-20T00:00:00Z/P2W"]
        }
-    },
-    {
+      },
+      {
        "type" : "static",
        "paths": "/path/to/more/wikipedia/data/"
-    }
-]
+      }
+    ]  
+  },
+  ...
+}
 ```

-### Re-indexing with non-hadoop Batch Ingestion
-This section assumes the reader has understanding of batch ingestion without hadoop using [IndexTask](../misc/tasks.html#index-task) which uses a "firehose" to know where and how to read the input data. [IngestSegmentFirehose](firehose.html#ingestsegmentfirehose) can be used to read data from segments inside Druid. Note that IndexTask is to be used for prototyping purposes only as it has to do all processing inside a single process and can't scale, please use hadoop batch ingestion for realistic scenarios such as dealing with data more than a GB.
+### Reindexing without Hadoop Batch Ingestion
+
+This section assumes the reader understands how to do batch ingestion without Hadoop using the [IndexTask](../misc/tasks.html#index-task),  
+which uses a "firehose" to know where and how to read the input data. [IngestSegmentFirehose](firehose.html#ingestsegmentfirehose) 
+can be used to read data from segments inside Druid. Note that IndexTask is to be used for prototyping purposes only as 
+it has to do all processing inside a single process and can't scale. Please use Hadoop batch ingestion for production 
+scenarios dealing with more than 1GB of data.
--- a/docs/content/querying/aggregations.md
+++ b/docs/content/querying/aggregations.md
@ -2,17 +2,25 @@
 layout: doc_page
 ---
 # Aggregations
-Aggregations are specifications of processing over metrics available in Druid.
+
+Aggregations can be provided at ingestion time as part of the ingestion spec as a way of summarizing data before it enters Druid. 
+Aggregations can also be specified as part of many queries at query time.
+
 Available aggregations are:

 ### Count aggregator

-`count` computes the row count that match the filters
+`count` computes the count of Druid rows that match the filters.

 ```json
 { "type" : "count", "name" : <output_name> }
 ```

+Please note the count aggregator counts the number of Druid rows, which does not always reflect the number of raw events ingested. 
+This is because Druid rolls up data at ingestion time. To 
+count the number of ingested rows of data, include a count aggregator at ingestion time, and a longSum aggregator at 
+query time.
+
 ### Sum aggregators

 #### `longSum` aggregator
@ -100,6 +108,9 @@ All JavaScript functions must return numerical values.
 }
 ```

+The javascript aggregator is recommended for rapidly prototyping features. This aggregator will be much slower in production 
+use than a native Java aggregator.
+
 ### Cardinality aggregator

 Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality.
@ -169,6 +180,8 @@ Determine the number of distinct people (i.e. combinations of first and last nam

 ## Complex Aggregations

+Druid supports complex aggregations such as various types of approximate sketches. 
+
 ### HyperUnique aggregator

 Uses [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) to compute the estimated cardinality of a dimension that has been aggregated as a "hyperUnique" metric at indexing time.
--- a/docs/content/querying/dimensionspecs.md
+++ b/docs/content/querying/dimensionspecs.md
@ -195,7 +195,10 @@ Example for the `__time` dimension:
 ```

 ### Lookup extraction function
-Explicit lookups allow you to specify a set of keys and values to use when performing the extraction
+
+Lookups are a concept in Druid where dimension values are (optionally) replaced with new values. 
+For more documentation on using lookups, please see [here](../querying/lookups.html). 
+Explicit lookups allow you to specify a set of keys and values to use when performing the extraction.

 ```json
 {
@ -240,11 +243,12 @@ Explicit lookups allow you to specify a set of keys and values to use when perfo
 }
 ```

-A lookup can be of type `namespace` or `map`. A `map` lookup is passed as part of the query. A `namespace` lookup is populated on all the nodes which handle queries as per [lookups](../querying/lookups.html)
+A lookup can be of type `namespace` or `map`. A `map` lookup is passed as part of the query. 
+A `namespace` lookup is populated on all the nodes which handle queries as per [lookups](../querying/lookups.html)

-A property of `retainMissingValue` and `replaceMissingValueWith` can be specified at query time to hint how to handle missing values. Setting `replaceMissingValueWith` to `""` has the same effect of setting it to `null` or omitting the property. Setting `retainMissingValue` to true will use the dimension's original value if it is not found in the lookup. The default values are `replaceMissingValueWith = null` and `retainMissingValue = false` which causes missing values to be treated as missing.
+A property of `retainMissingValue` and `replaceMissingValueWith` can be specified at query time to hint how to handle missing values. Setting `replaceMissingValueWith` to `""` has the same effect as setting it to `null` or omitting the property. Setting `retainMissingValue` to true will use the dimension's original value if it is not found in the lookup. The default values are `replaceMissingValueWith = null` and `retainMissingValue = false` which causes missing values to be treated as missing.
 
-It is illegal to set `retainMissingValue = true` and also specify a `replaceMissingValueWith`
+It is illegal to set `retainMissingValue = true` and also specify a `replaceMissingValueWith`.

 A property of `injective` specifies if optimizations can be used which assume there is no combining of multiple names into one. For example: If ABC123 is the only key that maps to SomeCompany, that can be optimized since it is a unique lookup. But if both ABC123 and DEF456 BOTH map to SomeCompany, then that is NOT a unique lookup. Setting this value to true and setting `retainMissingValue` to FALSE (the default) may cause undesired behavior.

@ -254,6 +258,7 @@ For example, specifying `{"":"bar","bat":"baz"}` with dimension values `[null, "
 Omitting the empty string key will cause the missing value to take over. For example, specifying `{"bat":"baz"}` with dimension values `[null, "foo", "bat"]` and replacing missing values with `"oof"` will yield results of `["oof", "oof", "baz"]`.

 ### Filtering DimensionSpecs
+
 These are only valid for multi-valued dimensions. If you have a row in druid that has a multi-valued dimension with values ["v1", "v2", "v3"] and you send a groupBy/topN query grouping by that dimension with [query filter](filter.html) for value "v1". In the response you will get 3 rows containing "v1", "v2" and "v3". This behavior might be unintuitive for some use cases.

 It happens because `query filter` is internally used on the bitmaps and only used to match the row to be included in the query result processing. With multivalued dimensions, "query filter" behaves like a contains check, which will match the row with dimension value ["v1", "v2", "v3"]. Please see the section on "Multi-value columns" in [segment](../design/segments.html) for more details.
--- a/docs/content/querying/lookups.md
+++ b/docs/content/querying/lookups.md
@ -3,42 +3,82 @@ layout: doc_page
 ---
 # Lookups

-Lookups are a concept in Druid where dimension values are (optionally) replaced with a new value. See [dimension specs](../querying/dimensionspecs.html) for more information. For the purpose of these documents, a "key" refers to a dimension value to match, and a "value" refers to its replacement. So if you wanted to rename `appid-12345` to `Super Mega Awesome App` then the key would be `appid-12345` and the value would be `Super Mega Awesome App`. 
+Lookups are a concept in Druid where dimension values are (optionally) replaced with new values. 
+See [dimension specs](../querying/dimensionspecs.html) for more information. For the purpose of these documents, 
+a "key" refers to a dimension value to match, and a "value" refers to its replacement. 
+So if you wanted to rename `appid-12345` to `Super Mega Awesome App` then the key would be `appid-12345` and the value 
+would be `Super Mega Awesome App`. 

-It is worth noting that lookups support use cases where keys map to unique values (injective) as per a country code and a country name, and also supports use cases where multiple IDs map to the same value as per multiple app-ids belonging to a single account manager.
+It is worth noting that lookups support use cases where keys map to unique values (injective) such as a country code and 
+a country name, and also supports use cases where multiple IDs map to the same value, e.g. multiple app-ids belonging to 
+a single account manager.

-Lookups do not have history. They always use the current data. This means that if the chief account manager for a particular app-id changes, and you issue a query with a lookup to store the app-id to account manager relationship, it will return the current account manager for that app-id REGARDLESS of the time range over which you query.
+Lookups do not have history. They always use the current data. This means that if the chief account manager for a 
+particular app-id changes, and you issue a query with a lookup to store the app-id to account manager relationship, 
+it will return the current account manager for that app-id REGARDLESS of the time range over which you query.

-If you require data timerange sensitive lookups, such a use case is not currently supported dynamically at query time, and such data belongs in the raw denormalized data for use in Druid.
+If you require data time range sensitive lookups, such a use case is not currently supported dynamically at query time, 
+and such data belongs in the raw denormalized data for use in Druid.

-Very small lookups (count of keys on the order of a few dozen to a few hundred) can be passed at query time as a map lookup as per [dimension specs](../querying/dimensionspecs.html).
+Very small lookups (count of keys on the order of a few dozen to a few hundred) can be passed at query time as a "map" 
+lookup as per [dimension specs](../querying/dimensionspecs.html).

-Namespaced lookups are appropriate for lookups which are not possible to pass at query time due to their size, or are not desired to be passed at query time because the data is to reside in and be handled by the Druid servers. Namespaced lookups can be specified as part of the runtime properties file. The property is a list of the namespaces described as per the sections on this page.
+Namespaced lookups are appropriate for lookups which are not possible to pass at query time due to their size, 
+or are not desired to be passed at query time because the data is to reside in and be handled by the Druid servers. 
+Namespaced lookups can be specified as part of the runtime properties file. The property is a list of the namespaces 
+described as per the sections on this page. For example:

 ```json
- druid.query.extraction.namespace.lookups=\
-   [{ "type":"uri", "namespace":"some_uri_lookup","uri": "file:/tmp/prefix/",\
-   "namespaceParseSpec":\
-     {"format":"csv","columns":["key","value"]},\
-   "pollPeriod":"PT5M"},\
-   { "type":"jdbc", "namespace":"some_jdbc_lookup",\
-   "connectorConfig":{"createTables":true,"connectURI":"jdbc:mysql://localhost:3306/druid","user":"druid","password":"diurd"},\
-   "table": "lookupTable", "keyColumn": "mykeyColumn", "valueColumn": "MyValueColumn", "tsColumn": "timeColumn"}]
+ druid.query.extraction.namespace.lookups=
+   [
+     {
+       "type": "uri",
+       "namespace": "some_uri_lookup",
+       "uri": "file:/tmp/prefix/",
+       "namespaceParseSpec": {
+         "format": "csv",
+         "columns": [
+           "key",
+           "value"
+         ]
+       },
+       "pollPeriod": "PT5M"
+     },
+     {
+       "type": "jdbc",
+       "namespace": "some_jdbc_lookup",
+       "connectorConfig": {
+         "createTables": true,
+         "connectURI": "jdbc:mysql:\/\/localhost:3306\/druid",
+         "user": "druid",
+         "password": "diurd"
+       },
+       "table": "lookupTable",
+       "keyColumn": "mykeyColumn",
+       "valueColumn": "MyValueColumn",
+       "tsColumn": "timeColumn"
+     }
+   ]
 ```

-Proper funcitonality of Namespaced lookups requires the following extension to be loaded on the broker, peon, and historical nodes:
+Proper functionality of Namespaced lookups requires the following extension to be loaded on the broker, peon, and historical nodes:
 `io.druid.extensions:druid-namespace-lookup`

 ## Cache Settings
-The following are settings used by the nodes which service queries when setting namespaces (broker, peon, historical)
+
+Lookups are cached locally on historical nodes. The following are settings used by the nodes which service queries when 
+setting namespaces (broker, peon, historical)

 |Property|Description|Default|
 |--------|-----------|-------|
 |`druid.query.extraction.namespace.cache.type`|Specifies the type of caching to be used by the namespaces. May be one of [`offHeap`, `onHeap`]. `offHeap` uses a temporary file for off-heap storage of the namespace (memory mapped files). `onHeap` stores all cache on the heap in standard java map types.|`onHeap`|

-The cache is populated in different ways depending on the settings below. In general, most namespaces employ a `pollPeriod` at the end of which time they poll the remote resource of interest for updates. The notable exception being the kafka namespace lookup as defined below.
+The cache is populated in different ways depending on the settings below. In general, most namespaces employ 
+a `pollPeriod` at the end of which time they poll the remote resource of interest for updates. A notable exception 
+is the Kafka namespace lookup, defined below.

 ## URI namespace update
+
 The remapping values for each namespaced lookup can be specified by json as per

 ```json
@ -204,6 +244,7 @@ The JDBC lookups will poll a database to populate its local cache. If the `tsCol
 ```

 # Kafka namespaced lookup
+
 If you need updates to populate as promptly as possible, it is possible to plug into a kafka topic whose key is the old value and message is the desired new value (both in UTF-8). This requires the following extension: "io.druid.extensions:kafka-extraction-namespace"

 ```json
@ -221,11 +262,13 @@ If you need updates to populate as promptly as possible, it is possible to plug


 ## Kafka renames
+
 The extension `kafka-extraction-namespace` enables reading from a kafka feed which has name/key pairs to allow renaming of dimension values. An example use case would be to rename an ID to a human readable format.

 Currently the historical node caches the key/value pairs from the kafka feed in an ephemeral memory mapped DB via MapDB.

 ## Configuration
+
 The following options are used to define the behavior and should be included wherever the extension is included (all query servicing nodes):

 |Property|Description|Default|
@ -240,7 +283,8 @@ The following are the handling for kafka consumer properties in `druid.query.ren
 |`group.id`|Group ID, auto-assigned for publish-subscribe model and cannot be overridden|`UUID.randomUUID().toString()`|
 |`auto.offset.reset`|Setting to get the entire kafka rename stream. Cannot be overridden|`smallest`|

-## Testing the kafka rename functionality
+## Testing the Kafka rename functionality
+
 To test this setup, you can send key/value pairs to a kafka stream via the following producer console:

 `./bin/kafka-console-producer.sh --property parse.key=true --property key.separator="->" --broker-list localhost:9092 --topic testTopic`
--- a/docs/content/querying/querying.md
+++ b/docs/content/querying/querying.md
@ -9,8 +9,13 @@ Queries are made using an HTTP REST style request to queryable nodes ([Broker](.
 [Historical](../design/historical.html), or [Realtime](../design/realtime.html)). The
 query is expressed in JSON and each of these node types expose the same
 REST query interface. For normal Druid operations, queries should be issued to the broker nodes.
+ 
+Druid's native query language is JSON over HTTP, although many members of the community have contributed different 
+[client libraries](../development/libraries.html) in other languages to query Druid. 

-Although Druid's native query language is JSON over HTTP, many members of the community have contributed different [client libraries](../development/libraries.html) in other languages to query Druid.
+Druid's native query is relatively low level, mapping closely to how computations are performed internally. Druid queries 
+are designed to be lightweight and complete very quickly. This means that for more complex analysis, or to build 
+more complex visualizations, multiple Druid queries may be required.

 Available Queries
 -----------------