Merge pull request #1553 from druid-io/schema-design

Added section on best practices for schema design and a few other edits
This commit is contained in:
Charles Allen 2015-07-24 14:18:18 -07:00
commit 68e50c2e9a
5 changed files with 113 additions and 1 deletions

View File

@ -4,7 +4,7 @@ layout: doc_page
# Druid Concepts
Druid is an open source data store designed for [OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing) queries on time-series data.
Druid is an open source data store designed for [OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing) queries on event data.
This page is meant to provide readers with a high level overview of how Druid stores data, and the architecture of a Druid cluster.
## The Data

View File

@ -20,6 +20,12 @@ If you are trying to batch load historical data but no events are being loaded,
Druid can ingest JSON, CSV, TSV and other delimited data out of the box. Druid supports single dimension values, or multiple dimension values (an array of strings). Druid supports long and float numeric columns.
## Not all of my events were ingested
Druid will reject events outside of a window period. The best way to see if events are being rejected is to check the [Druid ingest metrics](../operations/metrics.html).
If the number of ingested events seem correct, make sure your query is correctly formed. If you included a `count` aggregator in your ingestion spec, you will need to query for the results of this aggregate with a `longSum` aggregator. Issuing a query with a count aggregator will count the number of Druid rows, which includes [roll-up](../design/index.html).
## Where do my Druid segments end up after ingestion?
Depending on what `druid.storage.type` is set to, Druid will upload segments to some [Deep Storage](../dependencies/deep-storage.html). Local disk is used as the default deep storage.

View File

@ -0,0 +1,98 @@
---
layout: doc_page
---
# Schema Design
This page is meant to assist users in designing a schema for data to be ingested in Druid. Druid intakes denormalized data
and columns are one of three types: a timestamp, a dimension, or a measure (or a metric/aggregator as they are
known in Druid). This follows the [standard naming convention](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems)
of OLAP data.
For more detailed information:
* Every row in Druid must have a timestamp. Data is always partitioned by time, and every query has a time filter. Query results can also be broken down by time buckets like minutes, hours, days, and so on.
* Dimensions are fields that can be filtered on or grouped by. They are always either single Strings or arrays of Strings.
* Metrics are fields that can be aggregated. They are often stored as numbers (integers or floats) but can also be stored as complex objects like HyperLogLog sketches or approximate histogram sketches.
Typical production tables (or datasources as they are known in Druid) have fewer than 100 dimensions and fewer
than 100 metrics, although, based on user testimony, datasources with thousands of dimensions have been created.
Below, we outline some best practices with schema design:
## High cardinality dimensions (e.g. unique IDs)
In practice, we see that exact counts for unique IDs are often not required. Storing unique IDs as a column will kill
[roll-up](../design/index.html), and impact compression. Instead, storing a sketch of the number of the unique IDs seen, and using that
sketch as part of aggregations, will greatly improve performance (up to orders of magnitude performance improvement), and significantly reduce storage.
Druid's `hyperUnique` aggregator is based off of Hyperloglog and can be used for unique counts on a high cardinality dimension.
For more information, see [here](https://www.youtube.com/watch?v=Hpd3f_MLdXo).
## Nested dimensions
At the time of this writing, Druid does not support nested dimensions. Nested dimensions need to be flattened. For example,
if you have data of the following form:
```
{"foo":{"bar": 3}}
```
then before indexing it, you should transform it to:
```
{"foo_bar": 3}
```
## Counting the number of ingested events
A count aggregator at ingestion time can be used to count the number of events ingested. However, it is important to note
that when you query for this metric, you should use a `longSum` aggregator. A `count` aggregator at query time will return
the number of Druid rows for the time interval, which can be used to determine what the roll-up ratio was.
To clarify with an example, if you ingestion spec contains:
```
...
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
...
```
You should query for the number of ingested rows with:
```
...
"aggregations": [
{ "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" },
...
```
## Schema-less dimensions
If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column,
a dimension that has been excluded, or a metric column as a dimension. It should be noted that because of [#658](https://github.com/druid-io/druid/issues/658)
these segments will be slightly larger than if the list of dimensions was explicitly specified in lexicographic order. This limitation
does not impact query correctness- just storage requirements.
## Including the same column as a dimension and a metric
One workflow with unique IDs is to be able to filter on a particular ID, while still being able to do fast unique counts on the ID column.
If you are not using schema-less dimensions, this use case is supported by setting the `name` of the metric to something different than the dimension.
If you are using schema-less dimensions, the best practice here is to include the same column twice, once as a dimension, and as a `hyperUnique` metric. This may involve
some work at ETL time.
As an example, for schema-less dimensions, repeat the same column:
```
{"device_id_dim":123, "device_id_met":123}
```
and in your `metricsSpec`, include:
```
{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" }
```
`device_id_dim` should automatically get picked up as a dimension.

View File

@ -125,3 +125,8 @@ Period drop rules are of the form:
* `period` - A JSON Object representing ISO-8601 Periods
The interval of a segment will be compared against the specified period. The period is from some time in the past to the current time. The rule matches if the period contains the interval.
# Permanently Deleting Data
Druid can fully drop data from the cluster, wipe the metadata store entry, and remove the data from deep storage for any segments that are
marked as unused (segments dropped from the cluster via rules are always marked as unused). You can submit a [kill task](../misc/tasks.html) to the [indexing service](../design/indexing-service.html) to do this.

View File

@ -13,6 +13,7 @@ h2. Data Ingestion
* "Overview":../ingestion/overview.html
* "Data Formats":../ingestion/data-formats.html
* "Data Schema":../ingestion/index.html
* "Schema Design":../ingestion/schema-design.html
* "Realtime Ingestion":../ingestion/realtime-ingestion.html
* "Batch Ingestion":../ingestion/batch-ingestion.html
* "FAQ":../ingestion/faq.html
@ -27,11 +28,13 @@ h2. Querying
* "DataSource Metadata":../querying/datasourcemetadataquery.html
* "Search":../querying/searchquery.html
* Components
** "Datasources":../querying/datasource.html
** "Filters":../querying/filters.html
** "Aggregations":../querying/aggregations.html
** "Post Aggregations":../querying/post-aggregations.html
** "Granularities":../querying/granularities.html
** "DimensionSpecs":../querying/dimensionspecs.html
** "Context":../querying/query-context.html
* "SQL":../querying/sql.html
h2. Design