From 70c06fc0e129a2a9fc684141bcd8671f4b70f3c3 Mon Sep 17 00:00:00 2001 From: Nhi Pham <56242907+demo-kratia@users.noreply.github.com> Date: Tue, 30 May 2023 11:40:12 -0700 Subject: [PATCH] Advise against using WEEK granularity for Native Batch and MSQ (#14341) Co-authored-by: Charles Smith --- docs/ingestion/ingestion-spec.md | 2 +- docs/multi-stage-query/reference.md | 5 +++-- docs/querying/granularities.md | 4 +++- 3 files changed, 7 insertions(+), 4 deletions(-) diff --git a/docs/ingestion/ingestion-spec.md b/docs/ingestion/ingestion-spec.md index e5a2ee062dd..fd91694f0d3 100644 --- a/docs/ingestion/ingestion-spec.md +++ b/docs/ingestion/ingestion-spec.md @@ -317,7 +317,7 @@ A `granularitySpec` can have the following components: | Field | Description | Default | |-------|-------------|---------| | type |`uniform`| `uniform` | -| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.| `day` | +| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.

Avoid `WEEK` granularity for data partitioning because weeks don't align neatly with months and years, making it difficult to change partitioning by coarser granularity. Instead, opt for other partitioning options such as `DAY` or `MONTH`, which offer more flexibility.| `day` | | queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).

Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` | | rollup | Whether to use ingestion-time [rollup](./rollup.md) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` | | intervals | A list of intervals defining time chunks for segments. Specify interval values using ISO8601 format. For example, `["2021-12-06T21:27:10+00:00/2021-12-07T00:00:00+00:00"]`. If you omit the time, the time defaults to "00:00:00".

Druid breaks the list up and rounds off the list values based on the `segmentGranularity`.

If `null` or not provided, batch ingestion tasks generally determine which time chunks to output based on the timestamps found in the input data.

If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks throw away any records with timestamps outside of the specified intervals.

Ignored for any form of streaming ingestion. | `null` | diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index e9c238f9ad5..342e60bf42e 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -192,12 +192,13 @@ The following ISO 8601 periods are supported for `TIME_FLOOR` and the string con - PT1H - PT6H - P1D -- P1W +- P1W* - P1M - P3M - P1Y -For more information about partitioning, see [Partitioning](concepts.md#partitioning-by-time). +For more information about partitioning, see [Partitioning](concepts.md#partitioning-by-time).

+*Avoid partitioning by week, `P1W`, because weeks don't align neatly with months and years, making it difficult to partition by coarser granularities later. ### `CLUSTERED BY` diff --git a/docs/querying/granularities.md b/docs/querying/granularities.md index 4ca12bb3411..327bb239f1d 100644 --- a/docs/querying/granularities.md +++ b/docs/querying/granularities.md @@ -51,7 +51,7 @@ Druid supports the following granularity strings: - `six_hour` - `eight_hour` - `day` - - `week` + - `week`* - `month` - `quarter` - `year` @@ -61,6 +61,8 @@ The minimum and maximum granularities are `none` and `all`, described as follows * `none` does not mean zero bucketing. It buckets data to millisecond granularity—the granularity of the internal index. You can think of `none` as equivalent to `millisecond`. > Do not use `none` in a [timeseries query](../querying/timeseriesquery.md); Druid fills empty interior time buckets with zeroes, meaning the output will contain results for every single millisecond in the requested interval. +*Avoid using the `week` granularity for partitioning at ingestion time, because weeks don't align neatly with months and years, making it difficult to partition by coarser granularities later. + #### Example: Suppose you have data below stored in Apache Druid with millisecond ingestion granularity,