Merge pull request #1656 from druid-io/all-the-docs

more docs for common questions
This commit is contained in:
Gian Merlino 2015-08-25 17:49:47 -07:00
commit 10946610f4
9 changed files with 165 additions and 2 deletions

View File

@ -0,0 +1,7 @@
---
layout: doc_page
---
Production Zookeeper Configuration
==================================
You can find great information about configuring and running Zookeeper [here](https://cwiki.apache.org/confluence/display/KAFKA/Operations#Operations-Zookeeper).

View File

@ -53,3 +53,4 @@ UIs
---
* [mistercrunch/panoramix](https://github.com/mistercrunch/panoramix) - A web application to slice, dice and visualize data out of Druid
* [grafana](https://github.com/Quantiply/grafana-plugins/tree/master/features/druid) - A plugin for [Grafana](http://grafana.org/)

View File

@ -0,0 +1,61 @@
---
layout: doc_page
---
# Schema Changes
Schemas for datasources can change at any time and Druid supports different schemas among segments.
## Replacing Segments
Druid uniquely
identifies segments using the datasource, interval, version, and partition number. The partition number is only visible in the segment id if
there are multiple segments created for some granularity of time. For example, if you have hourly segments, but you
have more data in an hour than a single segment can hold, you can create multiple segments for the same hour. These segments will share
the same datasource, interval, and version, but have linearly increasing partition numbers.
```
foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-01/2015-01-02_v1_1
foo_2015-01-01/2015-01-02_v1_2
```
In the example segments above, the dataSource = foo, interval = 2015-01-01/2015-01-02, version = v1, partitionNum = 0.
If at some later point in time, you reindex the data with a new schema, the newly created segments will have a higher version id.
```
foo_2015-01-01/2015-01-02_v2_0
foo_2015-01-01/2015-01-02_v2_1
foo_2015-01-01/2015-01-02_v2_2
```
Druid batch indexing (either Hadoop-based or IndexTask-based) guarantees atomic updates on an interval-by-interval basis.
In our example, until all `v2` segments for `2015-01-01/2015-01-02` are loaded in a Druid cluster, queries exclusively use `v1` segments.
Once all `v2` segments are loaded and queryable, all queries ignore `v1` segments and switch to the `v2` segments.
Shortly afterwards, the `v1` segments are unloaded from the cluster.
Note that updates that span multiple segment intervals are only atomic within each interval. They are not atomic across the entire update.
For example, you have segments such as the following:
```
foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-02/2015-01-03_v1_1
foo_2015-01-03/2015-01-04_v1_2
```
`v2` segments will be loaded into the cluster as soon as they are built and replace `v1` segments for the period of time the
segments overlap. Before v2 segments are completely loaded, your cluster may have a mixture of `v1` and `v2` segments.
```
foo_2015-01-01/2015-01-02_v1_0
foo_2015-01-02/2015-01-03_v2_1
foo_2015-01-03/2015-01-04_v1_2
```
In this case, queries may hit a mixture of `v1` and `v2` segments.
## Different Schemas Among Segments
Druid segments for the same datasource may have different schemas. If a string column (dimension) exists in one segment but not
another, queries that involve both segments still work. Queries for the segment missing the dimension will behave as if the dimension has only null values.
Similarly, if one segment has a numeric column (metric) but another does not, queries on the segment missing the
metric will generally "do the right thing". Aggregations over this missing metric behave as if the metric were missing.

View File

@ -0,0 +1,38 @@
---
layout: doc_page
---
# Multitenancy Considerations
Druid is often used to power user-facing data applications and has several features built in to better support high
volumes of concurrent queries.
## Parallelization Model
Druid's fundamental unit of computation is a [segment](../design/segments.html). Nodes scan segments in parallel and a
given node can scan `druid.processing.numThreads` concurrently. To
process more data in parallel and increase performance, more cores can be added to a cluster. Druid segments
should be sized such that any computation over any given segment should complete in at most 500ms.
Druid internally stores requests to scan segments in a priority queue. If a given query requires scanning
more segments than the total number of available processors in a cluster, and many similarly expensive queries are concurrently
running, we don't want any query to be starved out. Druid's internal processing logic will scan a set of segments from one query and release resources as soon as the scans complete.
This allows for a second set of segments from another query to be scanned. By keeping segment computation time very small, we ensure
that resources are constantly being yielded, and segments pertaining to different queries are all being processed.
## Data Distribution
Druid additionally supports multitenancy by providing configurable means of distributing data. Druid's historical nodes
can be configured into [tiers](../operations/rule-configuration.html), and [rules](../operations/rule-configuration.html)
can be set that determines which segments go into which tiers. One use case of this is that recent data tends to be accessed
more frequently than older data. Tiering enables more recent segments to be hosted on more powerful hardware for better performance.
A second copy of recent segments can be replicated on cheaper hardware (a different tier), and older segments can also be
stored on this tier.
## Query Distribution
Druid queries can optionally set a `priority` flag in the [query context](../querying/query-context.html). Queries known to be
slow (download or reporting style queries) can be de-prioritized and more interactive queries can have higher priority.
Broker nodes can also be dedicated to a given tier. For example, one set of broker nodes can be dedicated to fast interactive queries,
and a second set of broker nodes can be dedicated to slower reporting queries. Druid also provides a [router](../development/router.html)
node that can route queries to different brokers based on various query parameters (datasource, interval, etc.).

View File

@ -17,11 +17,15 @@ SSDs are highly recommended for historical and real-time nodes if you are not ru
Although Druid supports schema-less ingestion of dimensions, because of [https://github.com/druid-io/druid/issues/658](https://github.com/druid-io/druid/issues/658), you may sometimes get bigger segments than necessary. To ensure segments are as compact as possible, providing dimension names in lexicographic order is recommended.
# Use Timeseries and TopN Queries Instead of GroupBy Where Possible
Timeseries and TopN queries are much more optimized and significantly faster than groupBy queries for their designed use cases. Issuing multiple topN or timeseries queries from your application can potentially be more efficient than a single groupBy query.
# Segment sizes matter
Segments should generally be between 300MB-700MB in size. Too many small segments results in inefficient CPU utilizations and
too many large segments impacts query performance, most notably with TopN queries.
# Read FAQs
You should read common problems people have here:

View File

@ -167,6 +167,7 @@ Example for the `__time` dimension:
### Lookup lookup extraction function
Explicit lookups allow you to specify a set of keys and values to use when performing the extraction
```json
{
"type":"lookup",

View File

@ -0,0 +1,35 @@
---
layout: doc_page
---
# Joins
Druid has limited support for joins through [query-time lookups](../querying/lookups.html). The common use case of
query-time lookups is to replace one dimension value that (e.g. a String ID) with another value (e.g. a human-readable
String value).
Druid does not yet have full support for joins. Although Druids storage format would allow for the implementation
of joins (there is no loss of fidelity for columns included as dimensions), full support for joins have not yet been implemented yet
for the following reasons:
1. Scaling join queries has been, in our professional experience,
a constant bottleneck of working with distributed databases.
2. The incremental gains in functionality are perceived to be
of less value than the anticipated problems with managing
highly concurrent, join-heavy workloads.
A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
high-level strategies for join queries we are aware of are a hash-based strategy or a
sorted-merge strategy. The hash-based strategy requires that all but
one data set be available as something that looks like a hash table,
a lookup operation is then performed on this hash table for every
row in the “primary” stream. The sorted-merge strategy assumes
that each stream is sorted by the join key and thus allows for the incremental
joining of the streams. Each of these strategies, however,
requires the materialization of some number of the streams either in
sorted order or in a hash table form.
When all sides of the join are significantly large tables (> 1 billion
records), materializing the pre-join streams requires complex
distributed memory management. The complexity of the memory
management is only amplified by the fact that we are targeting highly
concurrent, multi-tenant workloads.

View File

@ -1,3 +1,6 @@
---
layout: doc_page
---
# Lookups
Lookups are a concept in Druid where dimension values are (optionally) replaced with a new value. See [dimension specs](../querying/dimensionspecs.html) for more information. For the purpose of these documents, a "key" refers to a dimension value to match, and a "value" refers to its replacement. So if you wanted to rename `appid-12345` to `Super Mega Awesome App` then the key would be `appid-12345` and the value would be `Super Mega Awesome App`.
@ -36,6 +39,7 @@ The cache is populated in different ways depending on the settings below. In gen
## URI namespace update
The remapping values for each namespaced lookup can be specified by json as per
```json
{
"type":"uri",
@ -73,6 +77,7 @@ The `namespaceParseSpec` can be one of a number of values. Each of the examples
|`valueColumn`|The name of the column containing the value|no|The second column|
*example input*
```
bar,something,foo
bat,something2,baz
@ -80,6 +85,7 @@ truck,something3,buck
```
*example namespaceParseSpec*
```json
"namespaceParseSpec": {
"format": "csv",
@ -100,6 +106,7 @@ truck,something3,buck
*example input*
```
bar|something,1|foo
bat|something,2|baz
@ -107,6 +114,7 @@ truck|something,3|buck
```
*example namespaceParseSpec*
```json
"namespaceParseSpec": {
"format": "tsv",
@ -125,6 +133,7 @@ truck|something,3|buck
|`valueFieldName`|The field name of the value|yes|null|
*example input*
```json
{"key": "foo", "value": "bar", "somethingElse" : "something"}
{"key": "baz", "value": "bat", "somethingElse" : "something"}
@ -132,6 +141,7 @@ truck|something,3|buck
```
*example namespaceParseSpec*
```json
"namespaceParseSpec": {
"format": "customJson",
@ -153,13 +163,13 @@ The `simpleJson` lookupParseSpec does not take any parameters. It is simply a li
```
*example namespaceParseSpec*
```json
"namespaceParseSpec":{
"type": "simpleJson"
}
```
## JDBC namespaced lookup
The JDBC lookups will poll a database to populate its local cache. If the `tsColumn` is set it must be able to accept comparisons in the format `'2015-01-01 00:00:00'`. For example, the following must be valid sql for the table `SELECT * FROM some_lookup_table WHERE timestamp_column > '2015-01-01 00:00:00'`. If `tsColumn` is set, the caching service will attempt to only poll values that were written *after* the last sync. If `tsColumn` is not set, the entire table is pulled every time.
@ -173,6 +183,7 @@ The JDBC lookups will poll a database to populate its local cache. If the `tsCol
|`valueColumn`|The column in `table` which contains the values|Yes||
|`tsColumn`| The column in `table` which contains when the key was updated|No|Not used|
|`pollPeriod`|How often to poll the DB|No|0 (only once)|
```json
{
"type":"jdbc",
@ -193,6 +204,7 @@ The JDBC lookups will poll a database to populate its local cache. If the `tsCol
# Kafka namespaced lookup
If you need updates to populate as promptly as possible, it is possible to plug into a kafka topic whose key is the old value and message is the desired new value (both in UTF-8). This requires the following extension: "io.druid.extensions:kafka-extraction-namespace"
```json
{
"type":"kafka",

View File

@ -14,6 +14,7 @@ h2. Data Ingestion
* "Data Formats":../ingestion/data-formats.html
* "Data Schema":../ingestion/index.html
* "Schema Design":../ingestion/schema-design.html
* "Schema Changes":../ingestion/schema-changes.html
* "Realtime Ingestion":../ingestion/realtime-ingestion.html
* "Batch Ingestion":../ingestion/batch-ingestion.html
* "FAQ":../ingestion/faq.html
@ -36,6 +37,7 @@ h2. Querying
** "DimensionSpecs":../querying/dimensionspecs.html
** "Context":../querying/query-context.html
* "SQL":../querying/sql.html
* "Joins":../querying/joins.html
h2. Design
* "Overview":../design/design.html
@ -60,6 +62,7 @@ h2. Operations
* "Alerts":../operations/alerts.html
* "Updating the Cluster":../operations/rolling-updates.html
* "Different Hadoop Versions":../operations/other-hadoop.html
* "Multitenancy Considerations":../operations/multitenancy.html
* "Performance FAQ":../operations/performance-faq.html
h2. Configuration
@ -73,6 +76,7 @@ h2. Configuration
* "Simple Cluster Configuration":../configuration/simple-cluster.html
* "Production Cluster Configuration":../configuration/production-cluster.html
* "Production Hadoop Configuration":../configuration/hadoop.html
* "Production Zookeeper Configuration":../configuration/zookeeper.html
h2. Development
* "Libraries":../development/libraries.html