fix doc headers (#8729)

This commit is contained in:
Vadim Ogievetsky 2019-10-24 11:17:39 -07:00 committed by Fangjin Yang
parent fdbc4ae147
commit cc3650ee3b
7 changed files with 77 additions and 77 deletions

View File

@ -31,7 +31,7 @@ Apache Druid (incubating) uses [Apache ZooKeeper](http://zookeeper.apache.org/)
4. [Overlord](../design/overlord.md) leader election 4. [Overlord](../design/overlord.md) leader election
5. [Overlord](../design/overlord.md) and [MiddleManager](../design/middlemanager.md) task management 5. [Overlord](../design/overlord.md) and [MiddleManager](../design/middlemanager.md) task management
### Coordinator Leader Election ## Coordinator Leader Election
We use the Curator LeadershipLatch recipe to do leader election at path We use the Curator LeadershipLatch recipe to do leader election at path
@ -39,7 +39,7 @@ We use the Curator LeadershipLatch recipe to do leader election at path
${druid.zk.paths.coordinatorPath}/_COORDINATOR ${druid.zk.paths.coordinatorPath}/_COORDINATOR
``` ```
### Segment "publishing" protocol from Historical and Realtime ## Segment "publishing" protocol from Historical and Realtime
The `announcementsPath` and `servedSegmentsPath` are used for this. The `announcementsPath` and `servedSegmentsPath` are used for this.
@ -63,7 +63,7 @@ ${druid.zk.paths.servedSegmentsPath}/${druid.host}/_segment_identifier_
Processes like the [Coordinator](../design/coordinator.md) and [Broker](../design/broker.md) can then watch these paths to see which processes are currently serving which segments. Processes like the [Coordinator](../design/coordinator.md) and [Broker](../design/broker.md) can then watch these paths to see which processes are currently serving which segments.
### Segment load/drop protocol between Coordinator and Historical ## Segment load/drop protocol between Coordinator and Historical
The `loadQueuePath` is used for this. The `loadQueuePath` is used for this.

View File

@ -122,7 +122,7 @@ To pull it all together, the above query would return *n\*m* data points, up to
] ]
``` ```
### Behavior on multi-value dimensions ## Behavior on multi-value dimensions
groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
from matching rows will be used to generate one group per value. It's possible for a query to return more groups than from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
@ -133,7 +133,7 @@ improve performance.
See [Multi-value dimensions](multi-value-dimensions.html) for more details. See [Multi-value dimensions](multi-value-dimensions.html) for more details.
### More on subtotalsSpec ## More on subtotalsSpec
The subtotals feature allows computation of multiple sub-groupings in a single query. To use this feature, add a "subtotalsSpec" to your query, which should be a list of subgroup dimension sets. It should contain the "outputName" from dimensions in your "dimensions" attribute, in the same order as they appear in the "dimensions" attribute (although, of course, you may skip some). For example, consider a groupBy query like this one: The subtotals feature allows computation of multiple sub-groupings in a single query. To use this feature, add a "subtotalsSpec" to your query, which should be a list of subgroup dimension sets. It should contain the "outputName" from dimensions in your "dimensions" attribute, in the same order as they appear in the "dimensions" attribute (although, of course, you may skip some). For example, consider a groupBy query like this one:
```json ```json
@ -219,9 +219,9 @@ Response for above query would look something like below...
] ]
``` ```
### Implementation details ## Implementation details
#### Strategies ### Strategies
GroupBy queries can be executed using two different strategies. The default strategy for a cluster is determined by the GroupBy queries can be executed using two different strategies. The default strategy for a cluster is determined by the
"druid.query.groupBy.defaultStrategy" runtime property on the Broker. This can be overridden using "groupByStrategy" in "druid.query.groupBy.defaultStrategy" runtime property on the Broker. This can be overridden using "groupByStrategy" in
@ -242,7 +242,7 @@ merging is always single-threaded. Because the Broker merges results using the i
the full result set before returning any results. On both the data processes and the Broker, the merging index is fully the full result set before returning any results. On both the data processes and the Broker, the merging index is fully
on-heap by default, but it can optionally store aggregated values off-heap. on-heap by default, but it can optionally store aggregated values off-heap.
#### Differences between v1 and v2 ### Differences between v1 and v2
Query API and results are compatible between the two engines; however, there are some differences from a cluster Query API and results are compatible between the two engines; however, there are some differences from a cluster
configuration perspective: configuration perspective:
@ -263,7 +263,7 @@ ignores chunkPeriod.
when the grouping key is a single indexed string column. In array-based aggregation, the dictionary-encoded value is used when the grouping key is a single indexed string column. In array-based aggregation, the dictionary-encoded value is used
as the index, so the aggregated values in the array can be accessed directly without finding buckets based on hashing. as the index, so the aggregated values in the array can be accessed directly without finding buckets based on hashing.
#### Memory tuning and resource limits ### Memory tuning and resource limits
When using groupBy v2, three parameters control resource usage and limits: When using groupBy v2, three parameters control resource usage and limits:
@ -299,21 +299,21 @@ this limit will fail with a "Resource limit exceeded" error indicating they exce
operators should make sure that the on-heap aggregations will not exceed available JVM heap space for the expected operators should make sure that the on-heap aggregations will not exceed available JVM heap space for the expected
concurrent query load. concurrent query load.
#### Performance tuning for groupBy v2 ### Performance tuning for groupBy v2
##### Limit pushdown optimization #### Limit pushdown optimization
Druid pushes down the `limit` spec in groupBy queries to the segments on Historicals wherever possible to early prune unnecessary intermediate results and minimize the amount of data transferred to Brokers. By default, this technique is applied only when all fields in the `orderBy` spec is a subset of the grouping keys. This is because the `limitPushDown` doesn't guarantee the exact results if the `orderBy` spec includes any fields that are not in the grouping keys. However, you can enable this technique even in such cases if you can sacrifice some accuracy for fast query processing like in topN queries. See `forceLimitPushDown` in [advanced groupBy v2 configurations](#groupby-v2-configurations). Druid pushes down the `limit` spec in groupBy queries to the segments on Historicals wherever possible to early prune unnecessary intermediate results and minimize the amount of data transferred to Brokers. By default, this technique is applied only when all fields in the `orderBy` spec is a subset of the grouping keys. This is because the `limitPushDown` doesn't guarantee the exact results if the `orderBy` spec includes any fields that are not in the grouping keys. However, you can enable this technique even in such cases if you can sacrifice some accuracy for fast query processing like in topN queries. See `forceLimitPushDown` in [advanced groupBy v2 configurations](#groupby-v2-configurations).
##### Optimizing hash table #### Optimizing hash table
The groupBy v2 engine uses an open addressing hash table for aggregation. The hash table is initialized with a given initial bucket number and gradually grows on buffer full. On hash collisions, the linear probing technique is used. The groupBy v2 engine uses an open addressing hash table for aggregation. The hash table is initialized with a given initial bucket number and gradually grows on buffer full. On hash collisions, the linear probing technique is used.
The default number of initial buckets is 1024 and the default max load factor of the hash table is 0.7. If you can see too many collisions in the hash table, you can adjust these numbers. See `bufferGrouperInitialBuckets` and `bufferGrouperMaxLoadFactor` in [Advanced groupBy v2 configurations](#groupby-v2-configurations). The default number of initial buckets is 1024 and the default max load factor of the hash table is 0.7. If you can see too many collisions in the hash table, you can adjust these numbers. See `bufferGrouperInitialBuckets` and `bufferGrouperMaxLoadFactor` in [Advanced groupBy v2 configurations](#groupby-v2-configurations).
##### Parallel combine #### Parallel combine
Once a Historical finishes aggregation using the hash table, it sorts the aggregated results and merges them before sending to the Once a Historical finishes aggregation using the hash table, it sorts the aggregated results and merges them before sending to the
Broker for N-way merge aggregation in the broker. By default, Historicals use all their available processing threads Broker for N-way merge aggregation in the broker. By default, Historicals use all their available processing threads
@ -341,7 +341,7 @@ Please note that each Historical needs two merge buffers to process a groupBy v2
computing intermediate aggregates from each segment and another for combining intermediate aggregates in parallel. computing intermediate aggregates from each segment and another for combining intermediate aggregates in parallel.
#### Alternatives ### Alternatives
There are some situations where other query types may be a better choice than groupBy. There are some situations where other query types may be a better choice than groupBy.
@ -353,7 +353,7 @@ advantage of the fact that segments are already sorted on time) and does not nee
will sometimes be faster than groupBy. This is especially true if you are ordering by a metric and find approximate will sometimes be faster than groupBy. This is especially true if you are ordering by a metric and find approximate
results acceptable. results acceptable.
#### Nested groupBys ### Nested groupBys
Nested groupBys (dataSource of type "query") are performed differently for "v1" and "v2". The Broker first runs the Nested groupBys (dataSource of type "query") are performed differently for "v1" and "v2". The Broker first runs the
inner groupBy query in the usual way. "v1" strategy then materializes the inner query's results on-heap with Druid's inner groupBy query in the usual way. "v1" strategy then materializes the inner query's results on-heap with Druid's
@ -361,11 +361,11 @@ indexing mechanism, and runs the outer query on these materialized results. "v2"
inner query's results stream with off-heap fact map and on-heap string dictionary that can spill to disk. Both inner query's results stream with off-heap fact map and on-heap string dictionary that can spill to disk. Both
strategy perform the outer query on the Broker in a single-threaded fashion. strategy perform the outer query on the Broker in a single-threaded fashion.
#### Configurations ### Configurations
This section describes the configurations for groupBy queries. You can set the runtime properties in the `runtime.properties` file on Broker, Historical, and MiddleManager processes. You can set the query context parameters through the [query context](query-context.html). This section describes the configurations for groupBy queries. You can set the runtime properties in the `runtime.properties` file on Broker, Historical, and MiddleManager processes. You can set the query context parameters through the [query context](query-context.html).
##### Configurations for groupBy v2 #### Configurations for groupBy v2
Supported runtime properties: Supported runtime properties:
@ -382,9 +382,9 @@ Supported query contexts:
|`maxOnDiskStorage`|Can be used to lower the value of `druid.query.groupBy.maxOnDiskStorage` for this query.| |`maxOnDiskStorage`|Can be used to lower the value of `druid.query.groupBy.maxOnDiskStorage` for this query.|
#### Advanced configurations ### Advanced configurations
##### Common configurations for all groupBy strategies #### Common configurations for all groupBy strategies
Supported runtime properties: Supported runtime properties:
@ -401,7 +401,7 @@ Supported query contexts:
|`groupByIsSingleThreaded`|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.| |`groupByIsSingleThreaded`|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.|
##### GroupBy v2 configurations #### GroupBy v2 configurations
Supported runtime properties: Supported runtime properties:
@ -428,7 +428,7 @@ Supported query contexts:
|`applyLimitPushDownToSegment`|If Broker pushes limit down to queryable nodes (historicals, peons) then limit results during segment scan. This context value can be used to override `druid.query.groupBy.applyLimitPushDownToSegment`.|true| |`applyLimitPushDownToSegment`|If Broker pushes limit down to queryable nodes (historicals, peons) then limit results during segment scan. This context value can be used to override `druid.query.groupBy.applyLimitPushDownToSegment`.|true|
##### GroupBy v1 configurations #### GroupBy v1 configurations
Supported runtime properties: Supported runtime properties:
@ -445,7 +445,7 @@ Supported query contexts:
|`maxResults`|Can be used to lower the value of `druid.query.groupBy.maxResults` for this query.|None| |`maxResults`|Can be used to lower the value of `druid.query.groupBy.maxResults` for this query.|None|
|`useOffheap`|Set to true to store aggregations off-heap when merging results.|false| |`useOffheap`|Set to true to store aggregations off-heap when merging results.|false|
##### Array based result rows #### Array based result rows
Internally Druid always uses an array based representation of groupBy result rows, but by default this is translated Internally Druid always uses an array based representation of groupBy result rows, but by default this is translated
into a map based result format at the Broker. To reduce the overhead of this translation, results may also be returned into a map based result format at the Broker. To reduce the overhead of this translation, results may also be returned

View File

@ -124,7 +124,7 @@ only the rows which satisfy those filters, thereby saving I/O cost. However, it
and cursor-based execution plans, and chooses the optimal one. Currently, it is not enabled by default due to the overhead and cursor-based execution plans, and chooses the optimal one. Currently, it is not enabled by default due to the overhead
of cost estimation. of cost estimation.
#### Server configuration ## Server configuration
The following runtime properties apply: The following runtime properties apply:
@ -132,7 +132,7 @@ The following runtime properties apply:
|--------|-----------|-------| |--------|-----------|-------|
|`druid.query.search.searchStrategy`|Default search query strategy.|useIndexes| |`druid.query.search.searchStrategy`|Default search query strategy.|useIndexes|
#### Query context ## Query context
The following query context parameters apply: The following query context parameters apply:

View File

@ -89,18 +89,18 @@ undefined.
Only columns which are dimensions (i.e., have type `STRING`) will have any cardinality. Rest of the columns (timestamp and metric columns) will show cardinality as `null`. Only columns which are dimensions (i.e., have type `STRING`) will have any cardinality. Rest of the columns (timestamp and metric columns) will show cardinality as `null`.
### intervals ## intervals
If an interval is not specified, the query will use a default interval that spans a configurable period before the end time of the most recent segment. If an interval is not specified, the query will use a default interval that spans a configurable period before the end time of the most recent segment.
The length of this default time period is set in the Broker configuration via: The length of this default time period is set in the Broker configuration via:
druid.query.segmentMetadata.defaultHistory druid.query.segmentMetadata.defaultHistory
### toInclude ## toInclude
There are 3 types of toInclude objects. There are 3 types of toInclude objects.
#### All ### All
The grammar is as follows: The grammar is as follows:
@ -108,7 +108,7 @@ The grammar is as follows:
"toInclude": { "type": "all"} "toInclude": { "type": "all"}
``` ```
#### None ### None
The grammar is as follows: The grammar is as follows:
@ -116,7 +116,7 @@ The grammar is as follows:
"toInclude": { "type": "none"} "toInclude": { "type": "none"}
``` ```
#### List ### List
The grammar is as follows: The grammar is as follows:
@ -124,7 +124,7 @@ The grammar is as follows:
"toInclude": { "type": "list", "columns": [<string list of column names>]} "toInclude": { "type": "list", "columns": [<string list of column names>]}
``` ```
### analysisTypes ## analysisTypes
This is a list of properties that determines the amount of information returned about the columns, i.e. analyses to be performed on the columns. This is a list of properties that determines the amount of information returned about the columns, i.e. analyses to be performed on the columns.
@ -135,32 +135,32 @@ The default analysis types can be set in the Broker configuration via:
Types of column analyses are described below: Types of column analyses are described below:
#### cardinality ### cardinality
* `cardinality` in the result will return the estimated floor of cardinality for each column. Only relevant for * `cardinality` in the result will return the estimated floor of cardinality for each column. Only relevant for
dimension columns. dimension columns.
#### minmax ### minmax
* Estimated min/max values for each column. Only relevant for dimension columns. * Estimated min/max values for each column. Only relevant for dimension columns.
#### size ### size
* `size` in the result will contain the estimated total segment byte size as if the data were stored in text format * `size` in the result will contain the estimated total segment byte size as if the data were stored in text format
#### interval ### interval
* `intervals` in the result will contain the list of intervals associated with the queried segments. * `intervals` in the result will contain the list of intervals associated with the queried segments.
#### timestampSpec ### timestampSpec
* `timestampSpec` in the result will contain timestampSpec of data stored in segments. this can be null if timestampSpec of segments was unknown or unmergeable (if merging is enabled). * `timestampSpec` in the result will contain timestampSpec of data stored in segments. this can be null if timestampSpec of segments was unknown or unmergeable (if merging is enabled).
#### queryGranularity ### queryGranularity
* `queryGranularity` in the result will contain query granularity of data stored in segments. this can be null if query granularity of segments was unknown or unmergeable (if merging is enabled). * `queryGranularity` in the result will contain query granularity of data stored in segments. this can be null if query granularity of segments was unknown or unmergeable (if merging is enabled).
#### aggregators ### aggregators
* `aggregators` in the result will contain the list of aggregators usable for querying metric columns. This may be * `aggregators` in the result will contain the list of aggregators usable for querying metric columns. This may be
null if the aggregators are unknown or unmergeable (if merging is enabled). null if the aggregators are unknown or unmergeable (if merging is enabled).
@ -169,12 +169,12 @@ null if the aggregators are unknown or unmergeable (if merging is enabled).
* The form of the result is a map of column name to aggregator. * The form of the result is a map of column name to aggregator.
#### rollup ### rollup
* `rollup` in the result is true/false/null. * `rollup` in the result is true/false/null.
* When merging is enabled, if some are rollup, others are not, result is null. * When merging is enabled, if some are rollup, others are not, result is null.
### lenientAggregatorMerge ## lenientAggregatorMerge
Conflicts between aggregator metadata across segments can occur if some segments have unknown aggregators, or if Conflicts between aggregator metadata across segments can occur if some segments have unknown aggregators, or if
two segments use incompatible aggregators for the same column (e.g. longSum changed to doubleSum). two segments use incompatible aggregators for the same column (e.g. longSum changed to doubleSum).

View File

@ -94,7 +94,7 @@ To pull it all together, the above query would return 2 data points, one for eac
] ]
``` ```
#### Grand totals ## Grand totals
Druid can include an extra "grand totals" row as the last row of a timeseries result set. To enable this, add Druid can include an extra "grand totals" row as the last row of a timeseries result set. To enable this, add
`"grandTotal" : true` to your query context. For example: `"grandTotal" : true` to your query context. For example:
@ -119,7 +119,7 @@ The grand totals row will appear as the last row in the result array, and will h
row even if the query is run in "descending" mode. Post-aggregations in the grand totals row will be computed based row even if the query is run in "descending" mode. Post-aggregations in the grand totals row will be computed based
upon the grand total aggregations. upon the grand total aggregations.
#### Zero-filling ## Zero-filling
Timeseries queries normally fill empty interior time buckets with zeroes. For example, if you issue a "day" granularity Timeseries queries normally fill empty interior time buckets with zeroes. For example, if you issue a "day" granularity
timeseries query for the interval 2012-01-01/2012-01-04, and no data exists for 2012-01-02, you will receive: timeseries query for the interval 2012-01-01/2012-01-04, and no data exists for 2012-01-02, you will receive:

View File

@ -149,7 +149,7 @@ The format of the results would look like so:
] ]
``` ```
### Behavior on multi-value dimensions ## Behavior on multi-value dimensions
topN queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values topN queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
from matching rows will be used to generate one group per value. It's possible for a query to return more groups than from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
@ -160,7 +160,7 @@ improve performance.
See [Multi-value dimensions](multi-value-dimensions.html) for more details. See [Multi-value dimensions](multi-value-dimensions.html) for more details.
### Aliasing ## Aliasing
The current TopN algorithm is an approximate algorithm. The top 1000 local results from each segment are returned for merging to determine the global topN. As such, the topN algorithm is approximate in both rank and results. Approximate results *ONLY APPLY WHEN THERE ARE MORE THAN 1000 DIM VALUES*. A topN over a dimension with fewer than 1000 unique dimension values can be considered accurate in rank and accurate in aggregates. The current TopN algorithm is an approximate algorithm. The top 1000 local results from each segment are returned for merging to determine the global topN. As such, the topN algorithm is approximate in both rank and results. Approximate results *ONLY APPLY WHEN THERE ARE MORE THAN 1000 DIM VALUES*. A topN over a dimension with fewer than 1000 unique dimension values can be considered accurate in rank and accurate in aggregates.
@ -176,7 +176,7 @@ Users wishing to get an *exact rank and exact aggregates* topN over a dimension
Users who can tolerate *approximate rank* topN over a dimension with greater than 1000 unique values, but require *exact aggregates* can issue two queries. One to get the approximate topN dimension values, and another topN with dimension selection filters which only use the topN results of the first. Users who can tolerate *approximate rank* topN over a dimension with greater than 1000 unique values, but require *exact aggregates* can issue two queries. One to get the approximate topN dimension values, and another topN with dimension selection filters which only use the topN results of the first.
#### Example First query: ### Example First query
```json ```json
{ {
@ -199,7 +199,7 @@ Users who can tolerate *approximate rank* topN over a dimension with greater tha
} }
``` ```
#### Example second query: ### Example second query
```json ```json
{ {

View File

@ -50,6 +50,9 @@
"design/coordinator": { "design/coordinator": {
"title": "Coordinator Process" "title": "Coordinator Process"
}, },
"design/extensions-contrib/dropwizard": {
"title": "Dropwizard metrics emitter"
},
"design/historical": { "design/historical": {
"title": "Historical Process" "title": "Historical Process"
}, },
@ -336,9 +339,6 @@
"operations/pull-deps": { "operations/pull-deps": {
"title": "pull-deps tool" "title": "pull-deps tool"
}, },
"operations/recommendations": {
"title": "Recommendations"
},
"operations/reset-cluster": { "operations/reset-cluster": {
"title": "reset-cluster tool" "title": "reset-cluster tool"
}, },