Various documentation updates. (#13107)

* Various documentation updates.

1) Split out "data management" from "ingestion". Break it into thematic pages.

2) Move "SQL-based ingestion" into the Ingestion category. Adjust content so
   all conceptual content is in concepts.md and all syntax content is in reference.md.
   Shorten the known issues page to the most interesting ones.

3) Add SQL-based ingestion to the ingestion method comparison page. Remove the
   index task, since index_parallel is just as good when maxNumConcurrentSubTasks: 1.

4) Rename various mentions of "Druid console" to "web console".

5) Add additional information to ingestion/partitioning.md.

6) Remove a mention of Tranquility.

7) Remove a note about upgrading to Druid 0.10.1.

8) Remove no-longer-relevant task types from ingestion/tasks.md.

9) Move ingestion/native-batch-firehose.md to the hidden section. It was previously deprecated.

10) Move ingestion/native-batch-simple-task.md to the hidden section. It is still linked in some
    places, but it isn't very useful compared to index_parallel, so it shouldn't take up space
    in the sidebar.

11) Make all br tags self-closing.

12) Certain other cosmetic changes.

13) Update to node-sass 7.

* make travis use node12 for docs

Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>
This commit is contained in:
Gian Merlino 2022-09-16 21:58:11 -07:00 committed by GitHub
parent c62a822121
commit d4967c38f8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
73 changed files with 2612 additions and 1549 deletions

View File

@ -394,7 +394,7 @@ jobs:
- name: "docs"
stage: Tests - phase 1
install: ./check_test_suite.py && travis_terminate 0 || (cd website && npm install)
install: ./check_test_suite.py && travis_terminate 0 || (cd website && nvm install 12.22.12 && npm install)
script: |-
(cd website && npm run lint && npm run spellcheck) || { echo "

View File

@ -895,7 +895,7 @@ These Coordinator static configurations can be defined in the `coordinator/runti
The Coordinator has dynamic configuration to change certain behavior on the fly.
It is recommended that you use the [web console](../operations/druid-console.md) to configure these parameters.
It is recommended that you use the [web console](../operations/web-console.md) to configure these parameters.
However, if you need to do it via HTTP, the JSON object can be submitted to the Coordinator via a POST request at:
```
@ -983,7 +983,7 @@ These configuration options control the behavior of the Lookup dynamic configura
##### Automatic compaction dynamic configuration
You can set or update [automatic compaction](../ingestion/automatic-compaction.md) properties dynamically using the
You can set or update [automatic compaction](../data-management/automatic-compaction.md) properties dynamically using the
[Coordinator API](../operations/api-reference.md#automatic-compaction-configuration) without restarting Coordinators.
For details about segment compaction, see [Segment size optimization](../operations/segment-optimization.md).
@ -995,7 +995,7 @@ You can configure automatic compaction through the following properties:
|`dataSource`|dataSource name to be compacted.|yes|
|`taskPriority`|[Priority](../ingestion/tasks.md#priority) of compaction task.|no (default = 25)|
|`inputSegmentSizeBytes`|Maximum number of total segment bytes processed per compaction task. Since a time chunk must be processed in its entirety, if the segments for a particular time chunk have a total size in bytes greater than this parameter, compaction will not run for that time chunk. Because each compaction task runs with a single thread, setting this value too far above 12GB will result in compaction tasks taking an excessive amount of time.|no (default = 100,000,000,000,000 i.e. 100TB)|
|`skipOffsetFromLatest`|The offset for searching segments to be compacted in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Strongly recommended to set for realtime dataSources. See [Data handling with compaction](../ingestion/compaction.md#data-handling-with-compaction).|no (default = "P1D")|
|`skipOffsetFromLatest`|The offset for searching segments to be compacted in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Strongly recommended to set for realtime dataSources. See [Data handling with compaction](../data-management/compaction.md#data-handling-with-compaction).|no (default = "P1D")|
|`tuningConfig`|Tuning config for compaction tasks. See below [Automatic compaction tuningConfig](#automatic-compaction-tuningconfig).|no|
|`taskContext`|[Task context](../ingestion/tasks.md#context) for compaction tasks.|no|
|`granularitySpec`|Custom `granularitySpec`. See [Automatic compaction granularitySpec](#automatic-compaction-granularityspec).|No|
@ -1020,7 +1020,7 @@ You may see this issue with streaming ingestion from Kafka and Kinesis, which in
To mitigate this problem, set `skipOffsetFromLatest` to a value large enough so that arriving data tends to fall outside the offset value from the current time. This way you can avoid conflicts between compaction tasks and realtime ingestion tasks.
For example, if you want to skip over segments from thirty days prior to the end time of the most recent segment, assign `"skipOffsetFromLatest": "P30D"`.
For more information, see [Avoid conflicts with ingestion](../ingestion/automatic-compaction.md#avoid-conflicts-with-ingestion).
For more information, see [Avoid conflicts with ingestion](../data-management/automatic-compaction.md#avoid-conflicts-with-ingestion).
###### Automatic compaction tuningConfig
@ -1038,7 +1038,7 @@ The below is a list of the supported configurations for auto-compaction.
|`indexSpecForIntermediatePersists`|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](../ingestion/ingestion-spec.md#indexspec) for possible values.|no|
|`maxPendingPersists`|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with `maxRowsInMemory` * (2 + `maxPendingPersists`).|no (default = 0, meaning one persist can be running concurrently with ingestion, and none can be queued up)|
|`pushTimeout`|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever.|no (default = 0)|
|`segmentWriteOutMediumFactory`|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](../ingestion/native-batch-simple-task.md#segmentwriteoutmediumfactory).|no (default is the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used)|
|`segmentWriteOutMediumFactory`|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](../ingestion/native-batch.md#segmentwriteoutmediumfactory).|no (default is the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used)|
|`maxNumConcurrentSubTasks`|Maximum number of worker tasks which can be run in parallel at the same time. The supervisor task would spawn worker tasks up to `maxNumConcurrentSubTasks` regardless of the current available task slots. If this value is set to 1, the supervisor task processes data ingestion on its own instead of spawning worker tasks. If this value is set to too large, too many worker tasks can be created which might block other ingestion. Check [Capacity Planning](../ingestion/native-batch.md#capacity-planning) for more details.|no (default = 1)|
|`maxRetry`|Maximum number of retries on task failures.|no (default = 3)|
|`maxNumSegmentsToMerge`|Max limit for the number of segments that a single task can merge at the same time in the second phase. Used only with `hashed` or `single_dim` partitionsSpec.|no (default = 100)|
@ -1801,7 +1801,7 @@ Druid uses Jetty to serve HTTP requests. Each query being processed consumes a s
|`druid.server.http.enableRequestLimit`|If enabled, no requests would be queued in jetty queue and "HTTP 429 Too Many Requests" error response would be sent. |false|
|`druid.server.http.defaultQueryTimeout`|Query timeout in millis, beyond which unfinished queries will be cancelled|300000|
|`druid.server.http.maxScatterGatherBytes`|Maximum number of bytes gathered from data processes such as Historicals and realtime processes to execute a query. Queries that exceed this limit will fail. This is an advance configuration that allows to protect in case Broker is under heavy load and not utilizing the data gathered in memory fast enough and leading to OOMs. This limit can be further reduced at query time using `maxScatterGatherBytes` in the context. Note that having large limit is not necessarily bad if broker is never under heavy concurrent load in which case data gathered is processed quickly and freeing up the memory used. Human-readable format is supported, see [here](human-readable-byte.md). |Long.MAX_VALUE|
|`druid.server.http.maxSubqueryRows`|Maximum number of rows from all subqueries per query. Druid stores the subquery rows in temporary tables that live in the Java heap. `druid.server.http.maxSubqueryRows` is a guardrail to prevent the system from exhausting available heap. When a subquery exceeds the row limit, Druid throws a resource limit exceeded exception: "Subquery generated results beyond maximum."<br><br>It is a good practice to avoid large subqueries in Druid. However, if you choose to raise the subquery row limit, you must also increase the heap size of all Brokers, Historicals, and task Peons that process data for the subqueries to accommodate the subquery results.<br><br>There is no formula to calculate the correct value. Trial and error is the best approach.|100000|
|`druid.server.http.maxSubqueryRows`|Maximum number of rows from all subqueries per query. Druid stores the subquery rows in temporary tables that live in the Java heap. `druid.server.http.maxSubqueryRows` is a guardrail to prevent the system from exhausting available heap. When a subquery exceeds the row limit, Druid throws a resource limit exceeded exception: "Subquery generated results beyond maximum."<br /><br />It is a good practice to avoid large subqueries in Druid. However, if you choose to raise the subquery row limit, you must also increase the heap size of all Brokers, Historicals, and task Peons that process data for the subqueries to accommodate the subquery results.<br /><br />There is no formula to calculate the correct value. Trial and error is the best approach.|100000|
|`druid.server.http.gracefulShutdownTimeout`|The maximum amount of time Jetty waits after receiving shutdown signal. After this timeout the threads will be forcefully shutdown. This allows any queries that are executing to complete(Only values greater than zero are valid).|`PT30S`|
|`druid.server.http.unannouncePropagationDelay`|How long to wait for ZooKeeper unannouncements to propagate before shutting down Jetty. This is a minimum and `druid.server.http.gracefulShutdownTimeout` does not start counting down until after this period elapses.|`PT0S` (do not wait)|
|`druid.server.http.maxQueryTimeout`|Maximum allowed value (in milliseconds) for `timeout` parameter. See [query-context](../querying/query-context.md) to know more about `timeout`. Query is rejected if the query context `timeout` is greater than this value. |Long.MAX_VALUE|
@ -1901,7 +1901,7 @@ The Druid SQL server is configured through the following properties on the Broke
|`druid.sql.planner.authorizeSystemTablesDirectly`|If true, Druid authorizes queries against any of the system schema tables (`sys` in SQL) as `SYSTEM_TABLE` resources which require `READ` access, in addition to permissions based content filtering.|false|
|`druid.sql.planner.useNativeQueryExplain`|If true, `EXPLAIN PLAN FOR` will return the explain plan as a JSON representation of equivalent native query(s), else it will return the original version of explain plan generated by Calcite. It can be overridden per query with `useNativeQueryExplain` context key.|true|
|`druid.sql.planner.maxNumericInFilters`|Max limit for the amount of numeric values that can be compared for a string type dimension when the entire SQL WHERE clause of a query translates to an [OR](../querying/filters.md#or) of [Bound filter](../querying/filters.md#bound-filter). By default, Druid does not restrict the amount of numeric Bound Filters on String columns, although this situation may block other queries from running. Set this property to a smaller value to prevent Druid from running queries that have prohibitively long segment processing times. The optimal limit requires some trial and error; we recommend starting with 100. Users who submit a query that exceeds the limit of `maxNumericInFilters` should instead rewrite their queries to use strings in the `WHERE` clause instead of numbers. For example, `WHERE someString IN (123, 456)`. If this value is disabled, `maxNumericInFilters` set through query context is ignored.|`-1` (disabled)|
|`druid.sql.approxCountDistinct.function`|Implementation to use for the [`APPROX_COUNT_DISTINCT` function](../querying/sql-aggregations.md). Without extensions loaded, the only valid value is `APPROX_COUNT_DISTINCT_BUILTIN` (a HyperLogLog, or HLL, based implementation). If the [DataSketches extension](../development/extensions-core/datasketches-extension.md) is loaded, this can also be `APPROX_COUNT_DISTINCT_DS_HLL` (alternative HLL implementation) or `APPROX_COUNT_DISTINCT_DS_THETA`.<br><br>Theta sketches use significantly more memory than HLL sketches, so you should prefer one of the two HLL implementations.|APPROX_COUNT_DISTINCT_BUILTIN|
|`druid.sql.approxCountDistinct.function`|Implementation to use for the [`APPROX_COUNT_DISTINCT` function](../querying/sql-aggregations.md). Without extensions loaded, the only valid value is `APPROX_COUNT_DISTINCT_BUILTIN` (a HyperLogLog, or HLL, based implementation). If the [DataSketches extension](../development/extensions-core/datasketches-extension.md) is loaded, this can also be `APPROX_COUNT_DISTINCT_DS_HLL` (alternative HLL implementation) or `APPROX_COUNT_DISTINCT_DS_THETA`.<br /><br />Theta sketches use significantly more memory than HLL sketches, so you should prefer one of the two HLL implementations.|APPROX_COUNT_DISTINCT_BUILTIN|
> Previous versions of Druid had properties named `druid.sql.planner.maxQueryCount` and `druid.sql.planner.maxSemiJoinRowsInMemory`.
> These properties are no longer available. Since Druid 0.18.0, you can use `druid.server.http.maxSubqueryRows` to control the maximum

View File

@ -39,12 +39,12 @@ This topic guides you through setting up automatic compaction for your Druid clu
## Enable automatic compaction
You can enable automatic compaction for a datasource using the Druid console or programmatically via an API.
This process differs for manual compaction tasks, which can be submitted from the [Tasks view of the Druid console](../operations/druid-console.md) or the [Tasks API](../operations/api-reference.md#post-5).
You can enable automatic compaction for a datasource using the web console or programmatically via an API.
This process differs for manual compaction tasks, which can be submitted from the [Tasks view of the web console](../operations/web-console.md) or the [Tasks API](../operations/api-reference.md#post-5).
### Druid console
### web console
Use the Druid console to enable automatic compaction for a datasource as follows.
Use the web console to enable automatic compaction for a datasource as follows.
1. Click **Datasources** in the top-level navigation.
2. In the **Compaction** column, click the edit icon for the datasource to compact.
@ -142,13 +142,13 @@ druid.coordinator.compaction.period=PT60S
After the Coordinator has initiated auto-compaction, you can view compaction statistics for the datasource, including the number of bytes, segments, and intervals already compacted and those awaiting compaction. The Coordinator also reports the total bytes, segments, and intervals not eligible for compaction in accordance with its [segment search policy](../design/coordinator.md#segment-search-policy-in-automatic-compaction).
In the Druid console, the Datasources view displays auto-compaction statistics. The Tasks view shows the task information for compaction tasks that were triggered by the automatic compaction system.
In the web console, the Datasources view displays auto-compaction statistics. The Tasks view shows the task information for compaction tasks that were triggered by the automatic compaction system.
To get statistics by API, send a [`GET` request](../operations/api-reference.md#get-10) to `/druid/coordinator/v1/compaction/status`. To filter the results to a particular datasource, pass the datasource name as a query parameter to the request—for example, `/druid/coordinator/v1/compaction/status?dataSource=wikipedia`.
## Examples
The following examples demonstrate potential use cases in which auto-compaction may improve your Druid performance. See more details in [Compaction strategies](../ingestion/compaction.md#compaction-strategies). The examples in this section do not change the underlying data.
The following examples demonstrate potential use cases in which auto-compaction may improve your Druid performance. See more details in [Compaction strategies](../data-management/compaction.md#compaction-strategies). The examples in this section do not change the underlying data.
### Change segment granularity

View File

@ -29,7 +29,7 @@ Query performance in Apache Druid depends on optimally sized segments. Compactio
There are several cases to consider compaction for segment optimization:
- With streaming ingestion, data can arrive out of chronological order creating many small segments.
- If you append data using `appendToExisting` for [native batch](native-batch.md) ingestion creating suboptimal segments.
- If you append data using `appendToExisting` for [native batch](../ingestion/native-batch.md) ingestion creating suboptimal segments.
- When you use `index_parallel` for parallel batch indexing and the parallel ingestion tasks create many small segments.
- When a misconfigured ingestion task creates oversized segments.
@ -39,7 +39,7 @@ By default, compaction does not modify the underlying data of the segments. Howe
- If you don't need fine-grained granularity for older data, you can use compaction to change older segments to a coarser query granularity. For example, from `minute` to `hour` or `hour` to `day`. This reduces the storage space required for older data.
- You can change the dimension order to improve sorting and reduce segment size.
- You can remove unused columns in compaction or implement an aggregation metric for older data.
- You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](./rollup.md#perfect-rollup-vs-best-effort-rollup).
- You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](../ingestion/rollup.md#perfect-rollup-vs-best-effort-rollup).
Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment.
@ -47,7 +47,7 @@ Compaction does not improve performance in all situations. For example, if you r
You can configure the Druid Coordinator to perform automatic compaction, also called auto-compaction, for a datasource. Using its [segment search policy](../design/coordinator.md#segment-search-policy-in-automatic-compaction), the Coordinator periodically identifies segments for compaction starting from newest to oldest. When the Coordinator discovers segments that have not been compacted or segments that were compacted with a different or changed spec, it submits compaction tasks for the time interval covering those segments.
Automatic compaction works in most use cases and should be your first option. To learn more, see [Automatic compaction](../ingestion/automatic-compaction.md).
Automatic compaction works in most use cases and should be your first option. To learn more, see [Automatic compaction](../data-management/automatic-compaction.md).
In cases where you require more control over compaction, you can manually submit compaction tasks. For example:
@ -60,12 +60,12 @@ See [Setting up a manual compaction task](#setting-up-manual-compaction) for mor
During compaction, Druid overwrites the original set of segments with the compacted set. Druid also locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes.
You can set `dropExisting` in `ioConfig` to "true" in the compaction task to configure Druid to replace all existing segments fully contained by the interval. See the suggestion for reindexing with finer granularity under [Implementation considerations](native-batch.md#implementation-considerations) for an example.
You can set `dropExisting` in `ioConfig` to "true" in the compaction task to configure Druid to replace all existing segments fully contained by the interval. See the suggestion for reindexing with finer granularity under [Implementation considerations](../ingestion/native-batch.md#implementation-considerations) for an example.
> WARNING: `dropExisting` in `ioConfig` is a beta feature.
If an ingestion task needs to write data to a segment for a time interval locked for compaction, by default the ingestion task supersedes the compaction task and the compaction task fails without finishing. For manual compaction tasks, you can adjust the input spec interval to avoid conflicts between ingestion and compaction. For automatic compaction, you can set the `skipOffsetFromLatest` key to adjust the auto-compaction starting point from the current time to reduce the chance of conflicts between ingestion and compaction.
Another option is to set the compaction task to higher priority than the ingestion task.
For more information, see [Avoid conflicts with ingestion](../ingestion/automatic-compaction.md#avoid-conflicts-with-ingestion).
For more information, see [Avoid conflicts with ingestion](../data-management/automatic-compaction.md#avoid-conflicts-with-ingestion).
### Segment granularity handling
@ -126,21 +126,21 @@ To perform a manual compaction, you submit a compaction task. Compaction tasks m
|`transformSpec`|When set, the compaction task uses the specified `transformSpec` rather than using `null`. See [Compaction transformSpec](#compaction-transform-spec) for details.|No|
|`metricsSpec`|When set, the compaction task uses the specified `metricsSpec` rather than generating one.|No|
|`segmentGranularity`|Deprecated. Use `granularitySpec`.|No|
|`tuningConfig`|[Tuning configuration](native-batch.md#tuningconfig) for parallel indexing. `awaitSegmentAvailabilityTimeoutMillis` value is not supported for compaction tasks. Leave this parameter at the default value, 0.|No|
|`tuningConfig`|[Tuning configuration](../ingestion/native-batch.md#tuningconfig) for parallel indexing. `awaitSegmentAvailabilityTimeoutMillis` value is not supported for compaction tasks. Leave this parameter at the default value, 0.|No|
|`granularitySpec`|When set, the compaction task uses the specified `granularitySpec` rather than generating one. See [Compaction `granularitySpec`](#compaction-granularity-spec) for details.|No|
|`context`|[Task context](./tasks.md#context)|No|
|`context`|[Task context](../ingestion/tasks.md#context)|No|
> Note: Use `granularitySpec` over `segmentGranularity` and only set one of these values. If you specify different values for these in the same compaction spec, the task fails.
To control the number of result segments per time chunk, you can set [`maxRowsPerSegment`](./native-batch.md#partitionsspec) or [`numShards`](../ingestion/native-batch.md#tuningconfig).
To control the number of result segments per time chunk, you can set [`maxRowsPerSegment`](../ingestion/native-batch.md#partitionsspec) or [`numShards`](../ingestion/../ingestion/native-batch.md#tuningconfig).
> You can run multiple compaction tasks in parallel. For example, if you want to compact the data for a year, you are not limited to running a single task for the entire year. You can run 12 compaction tasks with month-long intervals.
A compaction task internally generates an `index` task spec for performing compaction work with some fixed parameters. For example, its `inputSource` is always the [DruidInputSource](./native-batch-input-source.md), and `dimensionsSpec` and `metricsSpec` include all dimensions and metrics of the input segments by default.
A compaction task internally generates an `index` task spec for performing compaction work with some fixed parameters. For example, its `inputSource` is always the [`druid` input source](../ingestion/native-batch-input-source.md), and `dimensionsSpec` and `metricsSpec` include all dimensions and metrics of the input segments by default.
Compaction tasks exit without doing anything and issue a failure status code in either of the following cases:
- If the interval you specify has no data segments loaded<br>
- If the interval you specify has no data segments loaded<br />
- If the interval you specify is empty.
Note that the metadata between input segments and the resulting compacted segments may differ if the metadata among the input segments differs as well. If all input segments have the same metadata, however, the resulting output segment will have the same metadata as all input segments.
@ -179,7 +179,7 @@ The compaction `ioConfig` requires specifying `inputSpec` as follows:
|-----|-----------|-------|--------|
|`type`|Task type. Set the value to `compact`.|none|Yes|
|`inputSpec`|Specification of the target [intervals](#interval-inputspec) or [segments](#segments-inputspec).|none|Yes|
|`dropExisting`|If `true`, the task replaces all existing segments fully contained by either of the following:<br>- the `interval` in the `interval` type `inputSpec`.<br>- the umbrella interval of the `segments` in the `segment` type `inputSpec`.<br>If compaction fails, Druid does not change any of the existing segments.<br>**WARNING**: `dropExisting` in `ioConfig` is a beta feature. |false|No|
|`dropExisting`|If `true`, the task replaces all existing segments fully contained by either of the following:<br />- the `interval` in the `interval` type `inputSpec`.<br />- the umbrella interval of the `segments` in the `segment` type `inputSpec`.<br />If compaction fails, Druid does not change any of the existing segments.<br />**WARNING**: `dropExisting` in `ioConfig` is a beta feature. |false|No|
Druid supports two supported `inputSpec` formats:
@ -223,5 +223,5 @@ Druid supports two supported `inputSpec` formats:
See the following topics for more information:
- [Segment optimization](../operations/segment-optimization.md) for guidance to determine if compaction will help in your case.
- [Automatic compaction](../ingestion/automatic-compaction.md) for how to enable and configure automatic compaction.
- [Automatic compaction](automatic-compaction.md) for how to enable and configure automatic compaction.

View File

@ -0,0 +1,103 @@
---
id: delete
title: "Data deletion"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
## By time range, manually
Apache Druid stores data [partitioned by time chunk](../design/architecture.md#datasources-and-segments) and supports
deleting data for time chunks by dropping segments. This is a fast, metadata-only operation.
Deletion by time range happens in two steps:
1. Segments to be deleted must first be marked as ["unused"](../design/architecture.md#segment-lifecycle). This can
happen when a segment is dropped by a [drop rule](../operations/rule-configuration.md) or when you manually mark a
segment unused through the Coordinator API or web console. This is a soft delete: the data is not available for
querying, but the segment files remains in deep storage, and the segment records remains in the metadata store.
2. Once a segment is marked "unused", you can use a [`kill` task](#kill-task) to permanently delete the segment file from
deep storage and remove its record from the metadata store. This is a hard delete: the data is unrecoverable unless
you have a backup.
For documentation on disabling segments using the Coordinator API, see the
[Coordinator API reference](../operations/api-reference.md#coordinator-datasources).
A data deletion tutorial is available at [Tutorial: Deleting data](../tutorials/tutorial-delete-data.md).
## By time range, automatically
Druid supports [load and drop rules](../operations/rule-configuration.md), which are used to define intervals of time
where data should be preserved, and intervals where data should be discarded. Data that falls under a drop rule is
marked unused, in the same manner as if you [manually mark that time range unused](#by-time-range-manually). This is a
fast, metadata-only operation.
Data that is dropped in this way is marked unused, but remains in deep storage. To permanently delete it, use a
[`kill` task](#kill-task).
## Specific records
Druid supports deleting specific records using [reindexing](update.md#reindex) with a filter. The filter specifies which
data remains after reindexing, so it must be the inverse of the data you want to delete. Because segments must be
rewritten to delete data in this way, it can be a time-consuming operation.
For example, to delete records where `userName` is `'bob'` with native batch indexing, use a
[`transformSpec`](../ingestion/ingestion-spec.md#transformspec) with filter `{"type": "not", "field": {"type":
"selector", "dimension": "userName", "value": "bob"}}`.
To delete the same records using SQL, use [REPLACE](../multi-stage-query/concepts.md#replace) with `WHERE userName <> 'bob'`.
To reindex using [native batch](../ingestion/native-batch.md), use the [`druid` input
source](../ingestion/native-batch-input-source.md#druid-input-source). If needed,
[`transformSpec`](../ingestion/ingestion-spec.md#transformspec) can be used to filter or modify data during the
reindexing job. To reindex with SQL, use [`REPLACE <table> OVERWRITE`](../multi-stage-query/reference.md#replace)
with `SELECT ... FROM <table>`. (Druid does not have `UPDATE` or `ALTER TABLE` statements.) Any SQL SELECT query can be
used to filter, modify, or enrich the data during the reindexing job.
Data that is deleted in this way is marked unused, but remains in deep storage. To permanently delete it, use a [`kill`
task](#kill-task).
## Entire table
Deleting an entire table works the same way as [deleting part of a table by time range](#by-time-range-manually). First,
mark all segments unused using the Coordinator API or web console. Then, optionally, delete it permanently using a
[`kill` task](#kill-task).
<a name="kill-task"></a>
## Permanently (`kill` task)
Data that has been overwritten or soft-deleted still remains as segments that have been marked unused. You can use a
`kill` task to permanently delete this data.
The available grammar is:
```json
{
"type": "kill",
"id": <task_id>,
"dataSource": <task_datasource>,
"interval" : <all_unused_segments_in_this_interval_will_die!>,
"context": <task context>
}
```
**WARNING:** The `kill` task permanently removes all information about the affected segments from the metadata store and
deep storage. This operation cannot be undone.

View File

@ -1,7 +1,7 @@
---
id: tutorial-msq-external-data
title: "Loading files with SQL"
sidebar_label: "Loading files with SQL"
id: index
title: "Data management"
sidebar_label: "Overview"
---
<!--
@ -23,22 +23,12 @@ sidebar_label: "Loading files with SQL"
~ under the License.
-->
<!DOCTYPE html>
<!--This redirects to the Multi-Stage Query tutorial. This redirect file exists cause duplicate entries in the left nav aren't allowed-->
<html lang="en-US">
<head>
<meta charset="UTF-8" />
<meta
http-equiv="refresh"
content="0; url=/docs/multi-stage-query/connect-external-data.html"
/>
<script type="text/javascript">
window.location.href = '/docs/multi-stage-query/connect-external-data.html';
</script>
<title>About the Druid documentation</title>
</head>
<body>
If you are not redirected automatically, follow this
<a href="/docs/multi-stage-query/connect-external-data.html">link</a>.
</body>
</html>
Apache Druid stores data [partitioned by time chunk](../design/architecture.md#datasources-and-segments) in immutable
files called [segments](../design/segments.md). Data management operations involving replacing, or deleting,
these segments include:
- [Updates](update.md) to existing data.
- [Deletion](delete.md) of existing data.
- [Schema changes](schema-changes.md) for new and existing data.
- [Compaction](compaction.md) and [automatic compaction](automatic-compaction.md), which reindex existing data to
optimize storage footprint and performance.

View File

@ -0,0 +1,39 @@
---
id: schema-changes
title: "Schema changes"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
## For new data
Apache Druid allows you to provide a new schema for new data without the need to update the schema of any existing data.
It is sufficient to update your supervisor spec, if using [streaming ingestion](../ingestion/index.md#streaming), or to
provide the new schema the next time you do a [batch ingestion](../ingestion/index.md#batch). This is made possible by
the fact that each [segment](../design/architecture.md#datasources-and-segments), at the time it is created, stores a
copy of its own schema. Druid reconciles all of these individual segment schemas automatically at query time.
## For existing data
Schema changes are sometimes necessary for existing data. For example, you may want to change the type of a column in
previously-ingested data, or drop a column entirely. Druid handles this using [reindexing](update.md), the same method
it uses to handle updates of existing data. Reindexing involves rewriting all affected segments and can be a
time-consuming operation.

View File

@ -0,0 +1,76 @@
---
id: update
title: "Data updates"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
## Overwrite
Apache Druid stores data [partitioned by time chunk](../design/architecture.md#datasources-and-segments) and supports
overwriting existing data using time ranges. Data outside the replacement time range is not touched. Overwriting of
existing data is done using the same mechanisms as [batch ingestion](../ingestion/index.md#batch).
For example:
- [Native batch](../ingestion/native-batch.md) with `appendToExisting: false`, and `intervals` set to a specific
time range, overwrites data for that time range.
- [SQL `REPLACE <table> OVERWRITE [ALL | WHERE ...]`](../multi-stage-query/reference.md#replace) overwrites data for
the entire table or for a specified time range.
In both cases, Druid's atomic update mechanism ensures that queries will flip seamlessly from the old data to the new
data on a time-chunk-by-time-chunk basis.
Ingestion and overwriting cannot run concurrently for the same time range of the same datasource. While an overwrite job
is ongoing for a particular time range of a datasource, new ingestions for that time range are queued up. Ingestions for
other time ranges proceed as normal. Read-only queries also proceed as normal, using the pre-existing version of the
data.
Druid does not support single-record updates by primary key.
## Reindex
Reindexing is an [overwrite of existing data](#overwrite) where the source of new data is the existing data itself. It
is used to perform schema changes, repartition data, filter out unwanted data, enrich existing data, and so on. This
behaves just like any other [overwrite](#overwrite) with regard to atomic updates and locking.
With [native batch](../ingestion/native-batch.md), use the [`druid` input
source](../ingestion/native-batch-input-source.md#druid-input-source). If needed,
[`transformSpec`](../ingestion/ingestion-spec.md#transformspec) can be used to filter or modify data during the
reindexing job.
With SQL, use [`REPLACE <table> OVERWRITE`](../multi-stage-query/reference.md#replace) with `SELECT ... FROM
<table>`. (Druid does not have `UPDATE` or `ALTER TABLE` statements.) Any SQL SELECT query can be used to filter,
modify, or enrich the data during the reindexing job.
## Rolled-up datasources
Rolled-up datasources can be effectively updated using appends, without rewrites. When you append a row that has an
identical set of dimensions to an existing row, queries that use aggregation operators automatically combine those two
rows together at query time.
[Compaction](compaction.md) or [automatic compaction](automatic-compaction.md) can be used to physically combine these
matching rows together later on, by rewriting segments in the background.
## Lookups
If you have a dimension where values need to be updated frequently, try first using [lookups](../querying/lookups.md). A
classic use case of lookups is when you have an ID dimension stored in a Druid segment, and want to map the ID dimension to a
human-readable string that may need to be updated periodically.

View File

@ -44,9 +44,9 @@ Druid has several types of services:
* [**Historical**](../design/historical.md) services store queryable data.
* [**MiddleManager**](../design/middlemanager.md) services ingest data.
You can view services in the **Services** tab in the Druid console:
You can view services in the **Services** tab in the web console:
![Druid services](../assets/services-overview.png "Services in the Druid console")
![Druid services](../assets/services-overview.png "Services in the web console")
## Druid servers

View File

@ -83,7 +83,7 @@ To ensure an even distribution of segments across Historical processes in the cl
### Automatic compaction
The Druid Coordinator manages the [automatic compaction system](../ingestion/automatic-compaction.md).
The Druid Coordinator manages the [automatic compaction system](../data-management/automatic-compaction.md).
Each run, the Coordinator compacts segments by merging small segments or splitting a large one. This is useful when the size of your segments is not optimized which may degrade query performance.
See [Segment size optimization](../operations/segment-optimization.md) for details.
@ -139,7 +139,7 @@ The search start point can be changed by setting `skipOffsetFromLatest`.
If this is set, this policy will ignore the segments falling into the time chunk of (the end time of the most recent segment - `skipOffsetFromLatest`).
This is to avoid conflicts between compaction tasks and realtime tasks.
Note that realtime tasks have a higher priority than compaction tasks by default. Realtime tasks will revoke the locks of compaction tasks if their intervals overlap, resulting in the termination of the compaction task.
For more information, see [Avoid conflicts with ingestion](../ingestion/automatic-compaction.md#avoid-conflicts-with-ingestion).
For more information, see [Avoid conflicts with ingestion](../data-management/automatic-compaction.md#avoid-conflicts-with-ingestion).
> This policy currently cannot handle the situation when there are a lot of small segments which have the same interval,
> and their total size exceeds [`inputSegmentSizeBytes`](../configuration/index.md#automatic-compaction-dynamic-configuration).

View File

@ -91,6 +91,4 @@ Separate task logs are not currently supported when using the Indexer; all task
The Indexer currently imposes an identical memory limit on each task. In later releases, the per-task memory limit will be removed and only the global limit will apply. The limit on concurrent merges will also be removed.
The Indexer does not work properly with [`index_realtime`](../ingestion/tasks.md#index_realtime) task types. Therefore, it is not compatible with [Tranquility](../ingestion/tranquility.md). If you are using Tranquility, consider migrating to Druid's builtin [Apache Kafka](../development/extensions-core/kafka-ingestion.md) or [Amazon Kinesis](../development/extensions-core/kinesis-ingestion.md) ingestion options.
In later releases, per-task memory usage will be dynamically managed. Please see https://github.com/apache/druid/issues/7900 for details on future enhancements to the Indexer.

View File

@ -78,7 +78,7 @@ caller. End users typically query Brokers rather than querying Historicals or Mi
Overlords, and Coordinators. They are optional since you can also simply contact the Druid Brokers, Overlords, and
Coordinators directly.
The Router also runs the [Druid console](../operations/druid-console.md), a management UI for datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.
The Router also runs the [web console](../operations/web-console.md), a management UI for datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.
### Data server

View File

@ -26,7 +26,7 @@ The Apache Druid Router process can be used to route queries to different Broker
For query routing purposes, you should only ever need the Router process if you have a Druid cluster well into the terabyte range.
In addition to query routing, the Router also runs the [Druid console](../operations/druid-console.md), a management UI for datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.
In addition to query routing, the Router also runs the [web console](../operations/web-console.md), a management UI for datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.
### Configuration

View File

@ -23,7 +23,7 @@ title: "Segments"
-->
Apache Druid stores its data and indexes in *segment files* partitioned by time. Druid creates a segment for each segment interval that contains data. If an interval is empty—that is, containing no rows—no segment exists for that time interval. Druid may create multiple segments for the same interval if you ingest data for that period via different ingestion jobs. [Compaction](../ingestion/compaction.md) is the Druid process that attempts to combine these segments into a single segment per interval for optimal performance.
Apache Druid stores its data and indexes in *segment files* partitioned by time. Druid creates a segment for each segment interval that contains data. If an interval is empty—that is, containing no rows—no segment exists for that time interval. Druid may create multiple segments for the same interval if you ingest data for that period via different ingestion jobs. [Compaction](../data-management/compaction.md) is the Druid process that attempts to combine these segments into a single segment per interval for optimal performance.
The time interval is configurable in the `segmentGranularity` parameter of the [`granularitySpec`](../ingestion/ingestion-spec.md#granularityspec).

View File

@ -27,7 +27,7 @@ To use this Apache Druid extension, [include](../../development/extensions.md#lo
## Introduction
This extension emits Druid metrics to [Apache Kafka](https://kafka.apache.org) directly with JSON format.<br>
This extension emits Druid metrics to [Apache Kafka](https://kafka.apache.org) directly with JSON format.<br />
Currently, Kafka has not only their nice ecosystem but also consumer API readily available.
So, If you currently use Kafka, It's easy to integrate various tool or UI
to monitor the status of your Druid cluster with this extension.

View File

@ -25,7 +25,7 @@ title: "Druid pac4j based Security extension"
Apache Druid Extension to enable [OpenID Connect](https://openid.net/connect/) based Authentication for Druid Processes using [pac4j](https://github.com/pac4j/pac4j) as the underlying client library.
This can be used with any authentication server that supports same e.g. [Okta](https://developer.okta.com/).
This extension should only be used at the router node to enable a group of users in existing authentication server to interact with Druid cluster, using the [Druid console](../../operations/druid-console.md). This extension does not support JDBC client authentication.
This extension should only be used at the router node to enable a group of users in existing authentication server to interact with Druid cluster, using the [web console](../../operations/web-console.md). This extension does not support JDBC client authentication.
## Configuration

View File

@ -254,7 +254,7 @@ development wiki-edit 1636399229823
For more information, see [`kafka` data format](../../ingestion/data-formats.md#kafka).
## Submit a supervisor spec
Druid starts a supervisor for a dataSource when you submit a supervisor spec. You can use the data loader in the Druid console or you can submit a supervisor spec to the following endpoint:
Druid starts a supervisor for a dataSource when you submit a supervisor spec. You can use the data loader in the web console or you can submit a supervisor spec to the following endpoint:
`http://<OVERLORD_IP>:<OVERLORD_PORT>/druid/indexer/v1/supervisor`

View File

@ -151,7 +151,7 @@ export SSL_TRUSTSTORE_PASSWORD=mysecrettruststorepassword
}
}
```
Verify that you've changed the values for all configurations to match your own environment. You can use the environment variable config provider syntax in the **Consumer properties** field on the **Connect tab** in the **Load Data** UI in the Druid console. When connecting to Kafka, Druid replaces the environment variables with their corresponding values.
Verify that you've changed the values for all configurations to match your own environment. You can use the environment variable config provider syntax in the **Consumer properties** field on the **Connect tab** in the **Load Data** UI in the web console. When connecting to Kafka, Druid replaces the environment variables with their corresponding values.
Note: You can provide SSL connections with [Password Provider](../../operations/password-provider.md) interface to define the `keystore`, `truststore`, and `key`, but this feature is deprecated.

View File

@ -144,7 +144,7 @@ T00:00:00.000Z/2015-04-14T02:41:09.484Z/0/index.zip] to [/opt/druid/zk_druid/dde
* DataSegmentKiller
The easiest way of testing the segment killing is marking a segment as not used and then starting a killing task in the [web console](../operations/druid-console.md).
The easiest way of testing the segment killing is marking a segment as not used and then starting a killing task in the [web console](../operations/web-console.md).
To mark a segment as not used, you need to connect to your metadata storage and update the `used` column to `false` on the segment table rows.

View File

@ -592,7 +592,7 @@ Configure your `flattenSpec` as follows:
| Field | Description | Default |
|-------|-------------|---------|
| useFieldDiscovery | If true, interpret all root-level fields as available fields for usage by [`timestampSpec`](./ingestion-spec.md#timestampspec), [`transformSpec`](./ingestion-spec.md#transformspec), [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec), and [`metricsSpec`](./ingestion-spec.md#metricsspec).<br><br>If false, only explicitly specified fields (see `fields`) will be available for use. | `true` |
| useFieldDiscovery | If true, interpret all root-level fields as available fields for usage by [`timestampSpec`](./ingestion-spec.md#timestampspec), [`transformSpec`](./ingestion-spec.md#transformspec), [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec), and [`metricsSpec`](./ingestion-spec.md#metricsspec).<br /><br />If false, only explicitly specified fields (see `fields`) will be available for use. | `true` |
| fields | Specifies the fields of interest and how they are accessed. See [Field flattening specifications](#field-flattening-specifications) for more detail. | `[]` |
For example:
@ -616,7 +616,7 @@ Each entry in the `fields` list can have the following components:
| Field | Description | Default |
|-------|-------------|---------|
| type | Options are as follows:<br><br><ul><li>`root`, referring to a field at the root level of the record. Only really useful if `useFieldDiscovery` is false.</li><li>`path`, referring to a field using [JsonPath](https://github.com/jayway/JsonPath) notation. Supported by most data formats that offer nesting, including `avro`, `json`, `orc`, and `parquet`.</li><li>`jq`, referring to a field using [jackson-jq](https://github.com/eiiches/jackson-jq) notation. Only supported for the `json` format.</li></ul> | none (required) |
| type | Options are as follows:<br /><br /><ul><li>`root`, referring to a field at the root level of the record. Only really useful if `useFieldDiscovery` is false.</li><li>`path`, referring to a field using [JsonPath](https://github.com/jayway/JsonPath) notation. Supported by most data formats that offer nesting, including `avro`, `json`, `orc`, and `parquet`.</li><li>`jq`, referring to a field using [jackson-jq](https://github.com/eiiches/jackson-jq) notation. Only supported for the `json` format.</li></ul> | none (required) |
| name | Name of the field after flattening. This name can be referred to by the [`timestampSpec`](./ingestion-spec.md#timestampspec), [`transformSpec`](./ingestion-spec.md#transformspec), [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec), and [`metricsSpec`](./ingestion-spec.md#metricsspec).| none (required) |
| expr | Expression for accessing the field while flattening. For type `path`, this should be [JsonPath](https://github.com/jayway/JsonPath). For type `jq`, this should be [jackson-jq](https://github.com/eiiches/jackson-jq) notation. For other types, this parameter is ignored. | none (required for types `path` and `jq`) |

View File

@ -1,130 +0,0 @@
---
id: data-management
title: "Data management"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
Within the context of this topic data management refers to Apache Druid's data maintenance capabilities for existing datasources. There are several options to help you keep your data relevant and to help your Druid cluster remain performant. For example updating, reingesting, adding lookups, reindexing, or deleting data.
In addition to the tasks covered on this page, you can also use segment compaction to improve the layout of your existing data. Refer to [Segment optimization](../operations/segment-optimization.md) to see if compaction will help in your environment. For an overview and steps to configure manual compaction tasks, see [Compaction](./compaction.md).
## Adding new data to existing datasources
Druid can insert new data to an existing datasource by appending new segments to existing segment sets. It can also add new data by merging an existing set of segments with new data and overwriting the original set.
Druid does not support single-record updates by primary key.
<a name="update"></a>
## Updating existing data
Once you ingest some data in a dataSource for an interval and create Apache Druid segments, you might want to make changes to
the ingested data. There are several ways this can be done.
### Using lookups
If you have a dimension where values need to be updated frequently, try first using [lookups](../querying/lookups.md). A
classic use case of lookups is when you have an ID dimension stored in a Druid segment, and want to map the ID dimension to a
human-readable String value that may need to be updated periodically.
### Reingesting data
If lookup-based techniques are not sufficient, you will need to reingest data into Druid for the time chunks that you
want to update. This can be done using one of the [batch ingestion methods](index.md#batch) in overwrite mode (the
default mode). It can also be done using [streaming ingestion](index.md#streaming), provided you drop data for the
relevant time chunks first.
If you do the reingestion in batch mode, Druid's atomic update mechanism means that queries will flip seamlessly from
the old data to the new data.
We recommend keeping a copy of your raw data around in case you ever need to reingest it.
### With Hadoop-based ingestion
This section assumes you understand how to do batch ingestion using Hadoop. See
[Hadoop batch ingestion](./hadoop.md) for more information. Hadoop batch-ingestion can be used for reindexing and delta ingestion.
Druid uses an `inputSpec` in the `ioConfig` to know where the data to be ingested is located and how to read it.
For simple Hadoop batch ingestion, `static` or `granularity` spec types allow you to read data stored in deep storage.
There are other types of `inputSpec` to enable reindexing and delta ingestion.
### Reindexing with Native Batch Ingestion
This section assumes you understand how to do batch ingestion without Hadoop using [native batch indexing](../ingestion/native-batch.md). Native batch indexing uses an `inputSource` to know where and how to read the input data. You can use the [`DruidInputSource`](./native-batch-input-source.md) to read data from segments inside Druid. You can use Parallel task (`index_parallel`) for all native batch reindexing tasks. Increase the `maxNumConcurrentSubTasks` to accommodate the amount of data your are reindexing. See [Capacity planning](native-batch.md#capacity-planning).
<a name="delete"></a>
## Deleting data
Druid supports permanent deletion of segments that are in an "unused" state (see the
[Segment lifecycle](../design/architecture.md#segment-lifecycle) section of the Architecture page).
The Kill Task deletes unused segments within a specified interval from metadata storage and deep storage.
For more information, please see [Kill Task](../ingestion/tasks.md#kill).
Permanent deletion of a segment in Apache Druid has two steps:
1. The segment must first be marked as "unused". This occurs when a segment is dropped by retention rules, and when a user manually disables a segment through the Coordinator API.
2. After segments have been marked as "unused", a Kill Task will delete any "unused" segments from Druid's metadata store as well as deep storage.
For documentation on retention rules, please see [Data Retention](../operations/rule-configuration.md).
For documentation on disabling segments using the Coordinator API, please see the
[Coordinator Datasources API](../operations/api-reference.md#coordinator-datasources) reference.
A data deletion tutorial is available at [Tutorial: Deleting data](../tutorials/tutorial-delete-data.md)
## Kill Task
The kill task deletes all information about segments and removes them from deep storage. Segments to kill must be unused (used==0) in the Druid segment table.
The available grammar is:
```json
{
"type": "kill",
"id": <task_id>,
"dataSource": <task_datasource>,
"interval" : <all_segments_in_this_interval_will_die!>,
"markAsUnused": <true|false>,
"context": <task context>
}
```
If `markAsUnused` is true (default is false), the kill task will first mark any segments within the specified interval as unused, before deleting the unused segments within the interval.
**WARNING!** The kill task permanently removes all information about the affected segments from the metadata store and deep storage. These segments cannot be recovered after the kill task runs, this operation cannot be undone.
## Retention
Druid supports retention rules, which are used to define intervals of time where data should be preserved, and intervals where data should be discarded.
Druid also supports separating Historical processes into tiers, and the retention rules can be configured to assign data for specific intervals to specific tiers.
These features are useful for performance/cost management; a common use case is separating Historical processes into a "hot" tier and a "cold" tier.
For more information, please see [Load rules](../operations/rule-configuration.md).
## Learn more
See the following topics for more information:
- [Compaction](./compaction.md) for an overview and steps to configure manual compaction tasks.
- [Segments](../design/segments.md) for information on how Druid handles segment versioning.

View File

@ -29,7 +29,7 @@ Druid stores data in datasources, which are similar to tables in a traditional r
## Primary timestamp
Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
for time-based [data management operations](../data-management/index.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. Regardless of the source field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.

View File

@ -27,14 +27,10 @@ sidebar_label: "Troubleshooting FAQ"
If you are trying to batch load historical data but no events are being loaded, make sure the interval of your ingestion spec actually encapsulates the interval of your data. Events outside this interval are dropped.
## Druid ingested my events but I they are not in my query results
## Druid ingested my events but they are not in my query results
If the number of ingested events seem correct, make sure your query is correctly formed. If you included a `count` aggregator in your ingestion spec, you will need to query for the results of this aggregate with a `longSum` aggregator. Issuing a query with a count aggregator will count the number of Druid rows, which includes [roll-up](../design/index.md).
## What types of data does Druid support?
Druid can ingest JSON, CSV, TSV and other delimited data out of the box. Druid supports single dimension values, or multiple dimension values (an array of strings). Druid supports long, float, and double numeric columns.
## Where do my Druid segments end up after ingestion?
Depending on what `druid.storage.type` is set to, Druid will upload segments to some [Deep Storage](../dependencies/deep-storage.md). Local disk is used as the default deep storage.
@ -73,7 +69,7 @@ Note that this workflow only guarantees that the segments are available at the t
## I don't see my Druid segments on my Historical processes
You can check the [web console](../operations/druid-console.md) to make sure that your segments have actually loaded on [Historical processes](../design/historical.md). If your segments are not present, check the Coordinator logs for messages about capacity of replication errors. One reason that segments are not downloaded is because Historical processes have maxSizes that are too small, making them incapable of downloading more data. You can change that with (for example):
You can check the [web console](../operations/web-console.md) to make sure that your segments have actually loaded on [Historical processes](../design/historical.md). If your segments are not present, check the Coordinator logs for messages about capacity of replication errors. One reason that segments are not downloaded is because Historical processes have maxSizes that are too small, making them incapable of downloading more data. You can change that with (for example):
```
-Ddruid.segmentCache.locations=[{"path":"/tmp/druid/storageLocation","maxSize":"500000000000"}]
@ -83,26 +79,6 @@ You can check the [web console](../operations/druid-console.md) to make sure tha
You can use a [segment metadata query](../querying/segmentmetadataquery.md) for the dimensions and metrics that have been created for your datasource. Make sure that the name of the aggregators you use in your query match one of these metrics. Also make sure that the query interval you specify match a valid time range where data exists.
## How can I Reindex existing data in Druid with schema changes?
You can use DruidInputSource with the [Parallel task](../ingestion/native-batch.md) to ingest existing druid segments using a new schema and change the name, dimensions, metrics, rollup, etc. of the segment.
See [DruidInputSource](./native-batch-input-source.md) for more details.
Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing.
See the [Update existing data](../ingestion/data-management.md#update) section of the data management page for more details.
## How can I change the query granularity of existing data in Druid?
In a lot of situations you may want coarser granularity for older data. Example, any data older than 1 month has only hour level granularity but newer data has minute level granularity. This use case is same as re-indexing.
To do this use the [DruidInputSource](./native-batch-input-source.md) and run a [Parallel task](../ingestion/native-batch.md). The DruidInputSource will allow you to take in existing segments from Druid and aggregate them and feed them back into Druid. It will also allow you to filter the data in those segments while feeding it back in. This means if there are rows you want to delete, you can just filter them away during re-ingestion.
Typically the above will be run as a batch job to say everyday feed in a chunk of data and aggregate it.
Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing.
See the [Update existing data](../ingestion/data-management.md#update) section of the data management page for more details.
You can also change the query granularity using compaction. See [Query granularity handling](../ingestion/compaction.md#query-granularity-handling).
## Real-time ingestion seems to be stuck
There are a few ways this can occur. Druid will throttle ingestion to prevent out of memory problems if the intermediate persists are taking too long or if hand-off is taking too long. If your process logs indicate certain columns are taking a very long time to build (for example, if your segment granularity is hourly, but creating a single column takes 30 minutes), you should re-evaluate your configuration or scale up your real-time ingestion.

View File

@ -22,10 +22,13 @@ title: "Ingestion"
~ under the License.
-->
Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from
your source system and stores it in data files called [_segments_](../design/architecture.md#datasources-and-segments).
In general, segment files contain a few million rows each.
For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs.
For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the
[Indexer](../design/indexer.md) processes load your source data. The sole exception is Hadoop-based ingestion, which
uses a Hadoop MapReduce job on YARN.
During ingestion Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md). Historical nodes load the segments into memory to respond to queries. For streaming ingestion, the Middle Managers and indexers can respond to queries in real-time with arriving data. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
@ -46,41 +49,32 @@ page.
### Streaming
The most recommended, and most popular, method of streaming ingestion is the
[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) that reads directly from Kafka. Alternatively, the Kinesis
indexing service works with Amazon Kinesis Data Streams.
Streaming ingestion uses an ongoing process called a supervisor that reads from the data stream to ingest data into Druid.
This table compares the options:
There are two available options for streaming ingestion. Streaming ingestion is controlled by a continuously-running
supervisor.
| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | [Kinesis](../development/extensions-core/kinesis-ingestion.md) |
|---|-----|--------------|
| **Supervisor type** | `kafka` | `kinesis`|
| **How it works** | Druid reads directly from Apache Kafka. | Druid reads directly from Amazon Kinesis.|
| **Can ingest late data?** | Yes | Yes |
| **Exactly-once guarantees?** | Yes | Yes |
| **Can ingest late data?** | Yes. | Yes. |
| **Exactly-once guarantees?** | Yes. | Yes. |
### Batch
When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index_parallel` (native batch; parallel), `index_hadoop` (Hadoop-based),
or `index` (native batch; single-task).
There are three available options for batch ingestion. Batch ingestion jobs are associated with a controller task that
runs for the duration of the job.
In general, we recommend native batch whenever it meets your needs, since the setup is simpler (it does not depend on
an external Hadoop cluster). However, there are still scenarios where Hadoop-based batch ingestion might be a better choice,
for example when you already have a running Hadoop cluster and want to
use the cluster resource of the existing cluster for batch ingestion.
This table compares the three available options:
| **Method** | [Native batch (parallel)](./native-batch.md) | [Hadoop-based](hadoop.md) | [Native batch (simple)](./native-batch-simple-task.md) |
| **Method** | [Native batch](./native-batch.md) | [SQL](../multi-stage-query/index.md) | [Hadoop-based](hadoop.md) |
|---|-----|--------------|------------|
| **Task type** | `index_parallel` | `index_hadoop` | `index` |
| **Parallel?** | Yes, if `inputFormat` is splittable and `maxNumConcurrentSubTasks` > 1 in `tuningConfig`. See [data format documentation](./data-formats.md) for details. | Yes, always. | No. Each task is single-threaded. |
| **Can append or overwrite?** | Yes, both. | Overwrite only. | Yes, both. |
| **External dependencies** | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | None. |
| **Input locations** | Any [`inputSource`](./native-batch-input-source.md). | Any Hadoop FileSystem or Druid datasource. | Any [`inputSource`](./native-batch-input-source.md). |
| **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
| **[Rollup modes](./rollup.md)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [partitionsSpec](./native-batch.md#partitionsspec) for details.| Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [partitionsSpec](./native-batch.md#partitionsspec) for details. |
| **Controller task type** | `index_parallel` | `query_controller` | `index_hadoop` |
| **How you submit it** | Send an `index_parallel` spec to the [task API](../operations/api-reference.md#task-submit). | Send an [INSERT](../multi-stage-query/concepts.md#insert) or [REPLACE](../multi-stage-query/concepts.md#replace) statement to the [SQL task API](../multi-stage-query/api.md#submit-a-query). | Send an `index_hadoop` spec to the [task API](../operations/api-reference.md#task-submit). |
| **Parallelism** | Using subtasks, if [`maxNumConcurrentSubTasks`](native-batch.md#tuningconfig) is greater than 1. | Using `query_worker` subtasks. | Using YARN. |
| **Fault tolerance** | Workers automatically relaunched upon failure. Controller task failure leads to job failure. | Controller or worker task failure leads to job failure. | YARN containers automatically relaunched upon failure. Controller task failure leads to job failure. |
| **Can append?** | Yes. | Yes (INSERT). | No. |
| **Can overwrite?** | Yes. | Yes (REPLACE). | Yes. |
| **External dependencies** | None. | None. | Hadoop cluster. |
| **Input sources** | Any [`inputSource`](./native-batch-input-source.md). | Any [`inputSource`](./native-batch-input-source.md) (using [EXTERN](../multi-stage-query/concepts.md#extern)) or Druid datasource (using FROM). | Any Hadoop FileSystem or Druid datasource. |
| **Input formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. |
| **Secondary partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [partitionsSpec](./native-batch.md#partitionsspec) for details.| Range partitioning ([CLUSTERED BY](../multi-stage-query/concepts.md#clustering)). | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). |
| **[Rollup modes](./rollup.md#perfect-rollup-vs-best-effort-rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). | Always perfect. | Always perfect. |

View File

@ -95,7 +95,7 @@ The specific options supported by these sections will depend on the [ingestion m
For more examples, refer to the documentation for each ingestion method.
You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
available in Druid's [web console](../operations/web-console.md). Druid's visual data loader supports
[Kafka](../development/extensions-core/kafka-ingestion.md),
[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
[native batch](native-batch.md) mode.
@ -175,7 +175,7 @@ A `timestampSpec` can have the following components:
|Field|Description|Default|
|-----|-----------|-------|
|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
|column|Input row field to read the primary timestamp from.<br /><br />Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
@ -209,8 +209,8 @@ A `dimensionsSpec` can have the following components:
| Field | Description | Default |
|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
| `dimensions` | A list of [dimension names or objects](#dimension-objects). You cannot include the same column in both `dimensions` and `dimensionExclusions`.<br><br>If `dimensions` and `spatialDimensions` are both null or empty arrays, Druid treats all columns other than timestamp or metrics that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) for details.<br><br>As a best practice, put the most frequently filtered dimensions at the beginning of the dimensions list. In this case, it would also be good to consider [`partitioning`](partitioning.md) by those same dimensions. | `[]` |
| `dimensionExclusions` | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
| `dimensions` | A list of [dimension names or objects](#dimension-objects). You cannot include the same column in both `dimensions` and `dimensionExclusions`.<br /><br />If `dimensions` and `spatialDimensions` are both null or empty arrays, Druid treats all columns other than timestamp or metrics that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) for details.<br /><br />As a best practice, put the most frequently filtered dimensions at the beginning of the dimensions list. In this case, it would also be good to consider [`partitioning`](partitioning.md) by those same dimensions. | `[]` |
| `dimensionExclusions` | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br /><br />This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
| `spatialDimensions` | An array of [spatial dimensions](../development/geo.md). | `[]` |
| `includeAllDimensions` | You can set `includeAllDimensions` to true to ingest both explicit dimensions in the `dimensions` field and other dimensions that the ingestion task discovers from input data. In this case, the explicit dimensions will appear first in order that you specify them and the dimensions dynamically discovered will come after. This flag can be useful especially with auto schema discovery using [`flattenSpec`](./data-formats.html#flattenspec). If this is not set and the `dimensions` field is not empty, Druid will ingest only explicit dimensions. If this is not set and the `dimensions` field is empty, all discovered dimensions will be ingested. | false |
@ -224,7 +224,7 @@ Dimension objects can have the following components:
| Field | Description | Default |
|-------|-------------|---------|
| type | Either `string`, `long`, `float`, `double`, or `json`. | `string` |
| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br /><br />Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
@ -300,9 +300,9 @@ A `granularitySpec` can have the following components:
|-------|-------------|---------|
| type |`uniform`| `uniform` |
| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.| `day` |
| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br /><br />Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
| rollup | Whether to use ingestion-time [rollup](./rollup.md) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
| intervals | A list of intervals defining time chunks for segments. Specify interval values using ISO8601 format. For example, `["2021-12-06T21:27:10+00:00/2021-12-07T00:00:00+00:00"]`. If you omit the time, the time defaults to "00:00:00".<br><br>Druid breaks the list up and rounds off the list values based on the `segmentGranularity`.<br><br>If `null` or not provided, batch ingestion tasks generally determine which time chunks to output based on the timestamps found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
| intervals | A list of intervals defining time chunks for segments. Specify interval values using ISO8601 format. For example, `["2021-12-06T21:27:10+00:00/2021-12-07T00:00:00+00:00"]`. If you omit the time, the time defaults to "00:00:00".<br /><br />Druid breaks the list up and rounds off the list values based on the `segmentGranularity`.<br /><br />If `null` or not provided, batch ingestion tasks generally determine which time chunks to output based on the timestamps found in the input data.<br /><br />If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks throw away any records with timestamps outside of the specified intervals.<br /><br />Ignored for any form of streaming ingestion. | `null` |
### `transformSpec`

View File

@ -1,7 +1,7 @@
---
id: native-batch-firehose
title: "Native batch ingestion with firehose"
sidebar_label: "Firehose"
sidebar_label: "Firehose (deprecated)"
---
<!--

View File

@ -1,7 +1,7 @@
---
id: native-batch-input-sources
title: "Native batch input sources"
sidebar_label: "Input sources"
sidebar_label: "Native batch: input sources"
---
<!--
@ -462,9 +462,9 @@ in `druid.ingestion.hdfs.allowedProtocols`. See [HDFS input source security conf
The HTTP input source is to support reading files directly from remote sites via HTTP.
> **NOTE:** Ingestion tasks run under the operating system account that runs the Druid processes, for example the Indexer, Middle Manager, and Peon. This means any user who can submit an ingestion task can specify an `HTTPInputSource` at any location where the Druid process has permissions. For example, using `HTTPInputSource`, a console user has access to internal network locations where the they would be denied access otherwise.
> **WARNING:** `HTTPInputSource` is not limited to the HTTP or HTTPS protocols. It uses the Java `URI` class that supports HTTP, HTTPS, FTP, file, and jar protocols by default. This means you should never run Druid under the `root` account, because a user can use the file protocol to access any files on the local disk.
> **Security notes:** Ingestion tasks run under the operating system account that runs the Druid processes, for example the Indexer, Middle Manager, and Peon. This means any user who can submit an ingestion task can specify an input source referring to any location that the Druid process can access. For example, using `http` input source, users may have access to internal network servers.
>
> The `http` input source is not limited to the HTTP or HTTPS protocols. It uses the Java URI class that supports HTTP, HTTPS, FTP, file, and jar protocols by default.
For more information about security best practices, see [Security overview](../operations/security-overview.md#best-practices).
@ -632,9 +632,9 @@ want the output timestamp to be equivalent to the input timestamp. In this case,
and the format to `auto` or `millis`.
It is OK for the input and output datasources to be the same. In this case, newly generated data will overwrite the
previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a
good idea to test out your reindexing by writing to a separate datasource before overwriting your main one.
Alternatively, if your goals can be satisfied by [compaction](compaction.md), consider that instead as a simpler
previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a good
idea to test out your reindexing by writing to a separate datasource before overwriting your main one. Alternatively, if
your goals can be satisfied by [compaction](../data-management/compaction.md), consider that instead as a simpler
approach.
An example task spec is shown below. It reads from a hypothetical raw datasource `wikipedia_raw` and creates a new

View File

@ -1,7 +1,7 @@
---
id: native-batch-simple-task
title: "Native batch simple task indexing"
sidebar_label: "Simple task indexing"
sidebar_label: "Native batch (simple)"
---
<!--
@ -23,7 +23,10 @@ sidebar_label: "Simple task indexing"
~ under the License.
-->
The simple task (type `index`) is designed to ingest small data sets into Apache Druid. The task executes within the indexing service. For general information on native batch indexing and parallel task indexing, see [Native batch ingestion](./native-batch.md).
> This page describes native batch ingestion using [ingestion specs](ingestion-spec.md). Refer to the [ingestion
> methods](../ingestion/index.md#batch) table to determine which ingestion method is right for you.
The simple task ([task type](tasks.md) `index`) executes single-threaded as a single task within the indexing service. For parallel, scalable options consider using [`index_parallel` tasks](./native-batch.md) or [SQL-based batch ingestion](../multi-stage-query/index.md).
## Simple task example
@ -143,7 +146,7 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](rollup.md). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, the index task will read the entire input data twice: one for finding the optimal number of partitions per time chunk and one for generating segments. Note that the result segments would be hash-partitioned. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no|
|reportParseExceptions|DEPRECATED. If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped. Setting `reportParseExceptions` to true will override existing configurations for `maxParseExceptions` and `maxSavedParseExceptions`, setting `maxParseExceptions` to 0 and limiting `maxSavedParseExceptions` to no more than 1.|false|no|
|pushTimeout|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever.|0|no|
|segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](#segmentwriteoutmediumfactory).|Not specified, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used|no|
|segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](native-batch.md#segmentwriteoutmediumfactory).|Not specified, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used|no|
|logParseExceptions|If true, log an error message when a parsing exception occurs, containing information about the row where the error occurred.|false|no|
|maxParseExceptions|The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overridden if `reportParseExceptions` is set.|unlimited|no|
|maxSavedParseExceptions|When a parse exception occurs, Druid can keep track of the most recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances will be saved. These saved exceptions will be made available after the task finishes in the [task completion report](tasks.md#task-reports). Overridden if `reportParseExceptions` is set.|0|no|
@ -170,12 +173,6 @@ For best-effort rollup, you should use `dynamic`.
|maxRowsPerSegment|Used in sharding. Determines how many rows are in each segment.|5000000|no|
|maxTotalRows|Total number of rows in segments waiting for being pushed.|20000000|no|
### `segmentWriteOutMediumFactory`
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|type|String|See [Additional Peon Configuration: SegmentWriteOutMediumFactory](../configuration/index.md#segmentwriteoutmediumfactory) for explanation and available options.|yes|
## Segment pushing modes
While ingesting data using the simple task indexing, Druid creates segments from the input data and pushes them. For segment pushing,

View File

@ -23,6 +23,8 @@ sidebar_label: "Native batch"
~ under the License.
-->
> This page describes native batch ingestion using [ingestion specs](ingestion-spec.md). Refer to the [ingestion
> methods](../ingestion/index.md#batch) table to determine which ingestion method is right for you.
Apache Druid supports the following types of native batch indexing tasks:
- Parallel task indexing (`index_parallel`) that can run multiple indexing tasks concurrently. Parallel task works well for production ingestion tasks.
@ -31,15 +33,15 @@ Apache Druid supports the following types of native batch indexing tasks:
This topic covers the configuration for `index_parallel` ingestion specs.
For related information on batch indexing, see:
- [Simple task indexing](./native-batch-simple-task.md) for `index` task configuration.
- [Native batch input sources](./native-batch-input-source.md) for a reference for `inputSource` configuration.
- [Hadoop-based vs. native batch comparison table](./index.md#batch) for a comparison of batch ingestion methods.
- [Loading a file](../tutorials/tutorial-batch.md) for a tutorial on native batch ingestion.
- [Batch ingestion method comparison table](./index.md#batch) for a comparison of batch ingestion methods.
- [Tutorial: Loading a file](../tutorials/tutorial-batch.md) for a tutorial on native batch ingestion.
- [Input sources](./native-batch-input-source.md) for possible input sources.
- [Input formats](./data-formats.md#input-format) for possible input formats.
## Submit an indexing task
To run either kind of native batch indexing task you can:
- Use the **Load Data** UI in the Druid console to define and submit an ingestion spec.
- Use the **Load Data** UI in the web console to define and submit an ingestion spec.
- Define an ingestion spec in JSON based upon the [examples](#parallel-indexing-example) and reference topics for batch indexing. Then POST the ingestion spec to the [Indexer API endpoint](../operations/api-reference.md#tasks),
`/druid/indexer/v1/task`, the Overlord service. Alternatively you can use the indexing script included with Druid at `bin/post-index-task`.
@ -86,7 +88,7 @@ You can set `dropExisting` flag in the `ioConfig` to true if you want the ingest
The following examples demonstrate when to set the `dropExisting` property to true in the `ioConfig`:
Consider an existing segment with an interval of 2020-01-01 to 2021-01-01 and `YEAR` `segmentGranularity`. You want to overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data using the finer segmentGranularity of `MONTH`. If the replacement data does not have a record within every months from 2020-01-01 to 2021-01-01 Druid cannot drop the original `YEAR` segment even if it does include all the replacement data. Set `dropExisting` to true in this case to replace the original segment at `YEAR` `segmentGranularity` since you no longer need it.<br><br>
Consider an existing segment with an interval of 2020-01-01 to 2021-01-01 and `YEAR` `segmentGranularity`. You want to overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data using the finer segmentGranularity of `MONTH`. If the replacement data does not have a record within every months from 2020-01-01 to 2021-01-01 Druid cannot drop the original `YEAR` segment even if it does include all the replacement data. Set `dropExisting` to true in this case to replace the original segment at `YEAR` `segmentGranularity` since you no longer need it.<br /><br />
Imagine you want to re-ingest or overwrite a datasource and the new data does not contain some time intervals that exist in the datasource. For example, a datasource contains the following data at `MONTH` segmentGranularity:
- **January**: 1 record
- **February**: 10 records
@ -235,7 +237,7 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|forceGuaranteedRollup|Forces [perfect rollup](rollup.md). The perfect rollup optimizes the total size of generated segments and querying time but increases indexing time. If true, specify `intervals` in the `granularitySpec` and use either `hashed` or `single_dim` for the `partitionsSpec`. You cannot use this flag in conjunction with `appendToExisting` of IOConfig. For more details, see [Segment pushing modes](#segment-pushing-modes).|false|no|
|reportParseExceptions|If true, Druid throws exceptions encountered during parsing and halts ingestion. If false, Druid skips unparseable rows and fields.|false|no|
|pushTimeout|Milliseconds to wait to push segments. Must be >= 0, where 0 means to wait forever.|0|no|
|segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](./native-batch-simple-task.md#segmentwriteoutmediumfactory).|If not specified, uses the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` |no|
|segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](#segmentwriteoutmediumfactory).|If not specified, uses the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` |no|
|maxNumConcurrentSubTasks|Maximum number of worker tasks that can be run in parallel at the same time. The supervisor task spawns worker tasks up to `maxNumConcurrentSubTasks` regardless of the current available task slots. If this value is 1, the supervisor task processes data ingestion on its own instead of spawning worker tasks. If this value is set to too large, the supervisor may create too many worker tasks that block other ingestion tasks. See [Capacity planning](#capacity-planning) for more details.|1|no|
|maxRetry|Maximum number of retries on task failures.|3|no|
|maxNumSegmentsToMerge|Max limit for the number of segments that a single task can merge at the same time in the second phase. Used only when `forceGuaranteedRollup` is true.|100|no|
@ -711,11 +713,15 @@ For details on available input sources see:
- [Azure input source](./native-batch-input-source.md#azure-input-source) (`azure`) reads data from Azure Blob Storage and Azure Data Lake.
- [HDFS input source](./native-batch-input-source.md#hdfs-input-source) (`hdfs`) reads data from HDFS storage.
- [HTTP input Source](./native-batch-input-source.md#http-input-source) (`http`) reads data from HTTP servers.
- [Inline input Source](./native-batch-input-source.md#inline-input-source) reads data you paste into the Druid console.
- [Inline input Source](./native-batch-input-source.md#inline-input-source) reads data you paste into the web console.
- [Local input Source](./native-batch-input-source.md#local-input-source) (`local`) reads data from local storage.
- [Druid input Source](./native-batch-input-source.md#druid-input-source) (`druid`) reads data from a Druid datasource.
- [SQL input Source](./native-batch-input-source.md#sql-input-source) (`sql`) reads data from a RDBMS source.
For information on how to combine input sources, see [Combining input source](./native-batch-input-source.md#combining-input-source).
### `segmentWriteOutMediumFactory`
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|type|String|See [Additional Peon Configuration: SegmentWriteOutMediumFactory](../configuration/index.md#segmentwriteoutmediumfactory) for explanation and available options.|yes|

View File

@ -34,6 +34,16 @@ This topic describes how to set up partitions within a single datasource. It doe
Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
Partitioning by time is important for two reasons:
1. Queries that filter by `__time` (SQL) or `intervals` (native) are able to use time partitioning to prune the set of segments to consider.
2. Certain data management operations, such as overwriting and compacting existing data, acquire exclusive write locks on time partitions.
3. Each segment file is wholly contained within a time partition. Too-fine-grained partitioning may cause a large number
of small segments, which leads to poor performance.
The most common choices to balance these considerations are `hour` and `day`. For streaming ingestion, `hour` is especially
common, because it allows compaction to follow ingestion with less of a time delay.
## Secondary partitioning
Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
@ -45,25 +55,26 @@ dimension that you often use as a filter when possible. Such partitioning often
Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
## How to configure partitioning
Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. If you are doing initial ingestion through a less-flexible method like
Kafka), you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain time threshold while you continuously add new data from a stream.
Kafka), you can use [reindexing](../data-management/update.md#reindex) or [compaction](../data-management/compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain time threshold while you continuously add new data from a stream.
The following table shows how each ingestion method handles partitioning:
|Method|How it works|
|------|------------|
|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
|[SQL](../multi-stage-query/index.md)|Configured using [`PARTITIONED BY`](../multi-stage-query/concepts.md#partitioning) and [`CLUSTERED BY`](../multi-stage-query/concepts.md#clustering).|
|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Kafka topic partitioning defines how Druid partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Kinesis stream sharding defines how Druid partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Kafka topic partitioning defines how Druid partitions the datasource. You can also [reindex](../data-management/update.md#reindex) or [compact](../data-management/compaction.md) to repartition after initial ingestion.|
|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Kinesis stream sharding defines how Druid partitions the datasource. You can also [reindex](../data-management/update.md#reindex) or [compact](../data-management/compaction.md) to repartition after initial ingestion.|
## Learn more
See the following topics for more information:
* [`partitionsSpec`](native-batch.md#partitionsspec) for more detail on partitioning with Native Batch ingestion.
* [Reindexing](data-management.md#reingesting-data) and [Compaction](compaction.md) for information on how to repartition existing data in Druid.
* [Reindexing](../data-management/update.md#reindex) and [Compaction](../data-management/compaction.md) for information on how to repartition existing data in Druid.

View File

@ -60,7 +60,7 @@ Tips for maximizing rollup:
When queries only involve dimensions in the "abbreviated" set, use the second datasource to reduce query times. Often, this method only requires a small increase in storage footprint because abbreviated datasources tend to be substantially smaller.
- If you use a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect rollup, try one of the following:
- Switch to a guaranteed perfect rollup option.
- [Reindex](data-management.md#reingesting-data) or [compact](compaction.md) your data in the background after initial ingestion.
- [Reindex](../data-management/update.md#reindex) or [compact](../data-management/compaction.md) your data in the background after initial ingestion.
## Perfect rollup vs best-effort rollup
@ -80,6 +80,7 @@ The following table shows how each method handles rollup:
|Method|How it works|
|------|------------|
|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.|
|[SQL-based batch](../multi-stage-query/index.md)|Always perfect.|
|[Hadoop](hadoop.md)|Always perfect.|
|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.|

View File

@ -45,11 +45,11 @@ the Overlord APIs.
A report containing information about the number of rows ingested, and any parse exceptions that occurred is available for both completed tasks and running tasks.
The reporting feature is supported by the [simple native batch task](../ingestion/native-batch-simple-task.md), the Hadoop batch task, and Kafka and Kinesis ingestion tasks.
The reporting feature is supported by [native batch tasks](../ingestion/native-batch.md), the Hadoop batch task, and Kafka and Kinesis ingestion tasks.
### Completion report
After a task completes, a completion report can be retrieved at:
After a task completes, if it supports reports, its report can be retrieved at:
```
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
@ -104,12 +104,6 @@ When a task is running, a live report containing ingestion state, unparseable ev
http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports
```
and
```
http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/liveReports
```
An example output is shown below:
```json
@ -184,7 +178,7 @@ The `errorMsg` field shows a message describing the error that caused a task to
### Row stats
The non-parallel [simple native batch task](./native-batch-simple-task.md), the Hadoop batch task, and Kafka and Kinesis ingestion tasks support retrieval of row stats while the task is running.
The [native batch task](./native-batch.md), the Hadoop batch task, and Kafka and Kinesis ingestion tasks support retrieval of row stats while the task is running.
The live report can be accessed with a GET to the following URL on a Peon running a task:
@ -356,7 +350,7 @@ You can override the task priority by setting your priority in the task context
The task context is used for various individual task configuration.
Specify task context configurations in the `context` field of the ingestion spec.
When configuring [automatic compaction](../ingestion/automatic-compaction.md), set the task context configurations in `taskContext` rather than in `context`.
When configuring [automatic compaction](../data-management/automatic-compaction.md), set the task context configurations in `taskContext` rather than in `context`.
The settings get passed into the `context` field of the compaction tasks issued to MiddleManagers.
The following parameters apply to all task types.
@ -398,18 +392,10 @@ You can configure retention periods for logs in milliseconds by setting `druid.i
## All task types
### `index`
See [Native batch ingestion (simple task)](./native-batch-simple-task.md).
### `index_parallel`
See [Native batch ingestion (parallel task)](native-batch.md).
### `index_sub`
Submitted automatically, on your behalf, by an [`index_parallel`](#index_parallel) task.
### `index_hadoop`
See [Hadoop-based ingestion](hadoop.md).
@ -424,16 +410,12 @@ Submitted automatically, on your behalf, by a
Submitted automatically, on your behalf, by a
[Kinesis-based ingestion supervisor](../development/extensions-core/kinesis-ingestion.md).
### `index_realtime`
Submitted automatically, on your behalf, by [Tranquility](tranquility.md).
### `compact`
Compaction tasks merge all segments of the given interval. See the documentation on
[compaction](compaction.md) for details.
[compaction](../data-management/compaction.md) for details.
### `kill`
Kill tasks delete all metadata about certain segments and removes them from deep storage.
See the documentation on [deleting data](../ingestion/data-management.md#delete) for details.
See the documentation on [deleting data](../data-management/delete.md) for details.

View File

@ -1,6 +1,6 @@
---
id: api
title: SQL-based ingestion APIs
title: SQL-based ingestion and multi-stage query task API
sidebar_label: API
---
@ -23,9 +23,13 @@ sidebar_label: API
~ under the License.
-->
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.
> This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md)
> extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which
> ingestion method is right for you.
The **Query** view in the Druid console provides the most stable experience for the multi-stage query task engine (MSQ task engine) and multi-stage query architecture. Use the UI if you do not need a programmatic interface.
The **Query** view in the web console provides a friendly experience for the multi-stage query task engine (MSQ task
engine) and multi-stage query architecture. We recommend using the web console if you do not need a programmatic
interface.
When using the API for the MSQ task engine, the action you want to take determines the endpoint you use:
@ -36,16 +40,17 @@ When using the API for the MSQ task engine, the action you want to take determin
You submit queries to the MSQ task engine using the `POST /druid/v2/sql/task/` endpoint.
### Request
#### Request
Currently, the MSQ task engine ignores the provided values of `resultFormat`, `header`,
`typesHeader`, and `sqlTypesHeader`. SQL SELECT queries write out their results into the task report (in the `multiStageQuery.payload.results.results` key) formatted as if `resultFormat` is an `array`.
For task queries similar to the [example queries](./msq-example-queries.md), you need to escape characters such as quotation marks (") if you use something like `curl`.
For task queries similar to the [example queries](./examples.md), you need to escape characters such as quotation marks (") if you use something like `curl`.
You don't need to escape characters if you use a method that can parse JSON seamlessly, such as Python.
The Python example in this topic escapes quotation marks although it's not required.
The following example is the same query that you submit when you complete [Convert a JSON ingestion spec](./msq-tutorial-convert-ingest-spec.md) where you insert data into a table named `wikipedia`.
The following example is the same query that you submit when you complete [Convert a JSON ingestion
spec](../tutorials/tutorial-msq-convert-spec.md) where you insert data into a table named `wikipedia`.
<!--DOCUSAURUS_CODE_TABS-->
@ -106,7 +111,7 @@ print(response.text)
<!--END_DOCUSAURUS_CODE_TABS-->
### Response
#### Response
```json
{
@ -127,7 +132,7 @@ print(response.text)
You can retrieve status of a query to see if it is still running, completed successfully, failed, or got canceled.
### Request
#### Request
<!--DOCUSAURUS_CODE_TABS-->
@ -162,7 +167,7 @@ print(response.text)
<!--END_DOCUSAURUS_CODE_TABS-->
### Response
#### Response
```
{
@ -200,7 +205,7 @@ Keep the following in mind when using the task API to view reports:
For an explanation of the fields in a report, see [Report response fields](#report-response-fields).
### Request
#### Request
<!--DOCUSAURUS_CODE_TABS-->
@ -236,7 +241,7 @@ print(response.text)
<!--END_DOCUSAURUS_CODE_TABS-->
### Response
#### Response
The response shows an example report for a query.
@ -535,7 +540,9 @@ The response shows an example report for a query.
}
```
### Report response fields
</details>
<a name="report-response-fields"></a>
The following table describes the response fields when you retrieve a report for a MSQ task engine using the `/druid/indexer/v1/task/<taskId>/reports` endpoint:
@ -550,8 +557,8 @@ The following table describes the response fields when you retrieve a report for
|multiStageQuery.payload.status.errorReport.taskId|The task that reported the error, if known. May be a controller task or a worker task.|
|multiStageQuery.payload.status.errorReport.host|The hostname and port of the task that reported the error, if known.|
|multiStageQuery.payload.status.errorReport.stageNumber|The stage number that reported the error, if it happened during execution of a specific stage.|
|multiStageQuery.payload.status.errorReport.error|Error object. Contains `errorCode` at a minimum, and may contain other fields as described in the [error code table](./msq-concepts.md#error-codes). Always present if there is an error.|
|multiStageQuery.payload.status.errorReport.error.errorCode|One of the error codes from the [error code table](./msq-concepts.md#error-codes). Always present if there is an error.|
|multiStageQuery.payload.status.errorReport.error|Error object. Contains `errorCode` at a minimum, and may contain other fields as described in the [error code table](./reference.md#error-codes). Always present if there is an error.|
|multiStageQuery.payload.status.errorReport.error.errorCode|One of the error codes from the [error code table](./reference.md#error-codes). Always present if there is an error.|
|multiStageQuery.payload.status.errorReport.error.errorMessage|User-friendly error message. Not always present, even if there is an error.|
|multiStageQuery.payload.status.errorReport.exceptionStackTrace|Java stack trace in string form, if the error was due to a server-side exception.|
|multiStageQuery.payload.stages|Array of query stages.|
@ -571,7 +578,7 @@ The following table describes the response fields when you retrieve a report for
## Cancel a query task
### Request
#### Request
<!--DOCUSAURUS_CODE_TABS-->
@ -606,7 +613,7 @@ print(response.text)
<!--END_DOCUSAURUS_CODE_TABS-->
### Response
#### Response
```
{

View File

@ -0,0 +1,281 @@
---
id: concepts
title: "SQL-based ingestion concepts"
sidebar_label: "Key concepts"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
> This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md)
> extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which
> ingestion method is right for you.
## SQL task engine
The `druid-multi-stage-query` extension adds a multi-stage query (MSQ) task engine that executes SQL SELECT,
[INSERT](reference.md#insert), and [REPLACE](reference.md#replace) statements as batch tasks in the indexing service,
which execute on [Middle Managers](../design/architecture.md#druid-services). INSERT and REPLACE tasks publish
[segments](../design/architecture.md#datasources-and-segments) just like [all other forms of batch
ingestion](../ingestion/index.md#batch). Each query occupies at least two task slots while running: one controller task,
and at least one worker task.
You can execute queries using the MSQ task engine through the **Query** view in the [web
console](../operations/web-console.md) or through the [`/druid/v2/sql/task` API](api.md).
For more details on how SQL queries are executed using the MSQ task engine, see [multi-stage query
tasks](#multi-stage-query-tasks).
## SQL extensions
To support ingestion, additional SQL functionality is available through the MSQ task engine.
<a name="extern"></a>
### Read external data with EXTERN
Query tasks can access external data through the EXTERN function, using any native batch [input
source](../ingestion/native-batch-input-source.md) and [input format](../ingestion/data-formats.md#input-format).
EXTERN can read multiple files in parallel across different worker tasks. However, EXTERN does not split individual
files across multiple worker tasks. If you have a small number of very large input files, you can increase query
parallelism by splitting up your input files.
For more information about the syntax, see [EXTERN](./reference.md#extern).
<a name="insert"></a>
### Load data with INSERT
INSERT statements can create a new datasource or append to an existing datasource. In Druid SQL, unlike standard SQL,
there is no syntactical difference between creating a table and appending data to a table. Druid does not include a
`CREATE TABLE` statement.
Nearly all SELECT capabilities are available for `INSERT ... SELECT` queries. Certain exceptions are listed on the [Known
issues](./known-issues.md#select) page.
INSERT statements acquire a shared lock to the target datasource. Multiple INSERT statements can run at the same time,
for the same datasource, if your cluster has enough task slots.
Like all other forms of [batch ingestion](../ingestion/index.md#batch), each INSERT statement generates new segments and
publishes them at the end of its run. For this reason, it is best suited to loading data in larger batches. Do not use
INSERT statements to load data in a sequence of microbatches; for that, use [streaming
ingestion](../ingestion/index.md#streaming) instead.
For more information about the syntax, see [INSERT](./reference.md#insert).
<a name="replace"></a>
### Overwrite data with REPLACE
REPLACE statements can create a new datasource or overwrite data in an existing datasource. In Druid SQL, unlike
standard SQL, there is no syntactical difference between creating a table and overwriting data in a table. Druid does
not include a `CREATE TABLE` statement.
REPLACE uses an [OVERWRITE clause](reference.md#replace-specific-time-ranges) to determine which data to overwrite. You
can overwrite an entire table, or a specific time range of a table. When you overwrite a specific time range, that time
range must align with the granularity specified in the PARTITIONED BY clause.
REPLACE statements acquire an exclusive write lock to the target time range of the target datasource. No other ingestion
or compaction operations may proceed for that time range while the task is running. However, ingestion and compaction
operations may proceed for other time ranges.
Nearly all SELECT capabilities are available for `REPLACE ... SELECT` queries. Certain exceptions are listed on the [Known
issues](./known-issues.md#select) page.
For more information about the syntax, see [REPLACE](./reference.md#replace).
### Primary timestamp
Druid tables always include a primary timestamp named `__time`.
It is common to set a primary timestamp by using [date and time
functions](../querying/sql-scalar.md#date-and-time-functions); for example: `TIME_FORMAT("timestamp", 'yyyy-MM-dd
HH:mm:ss') AS __time`.
The `__time` column is used for [partitioning by time](#partitioning-by-time). If you use `PARTITIONED BY ALL` or
`PARTITIONED BY ALL TIME`, partitioning by time is disabled. In these cases, you do not need to include a `__time`
column in your INSERT statement. However, Druid still creates a `__time` column in your Druid table and sets all
timestamps to 1970-01-01 00:00:00.
For more information, see [Primary timestamp](../ingestion/data-model.md#primary-timestamp).
<a name="partitioning"></a>
### Partitioning by time
INSERT and REPLACE statements require the PARTITIONED BY clause, which determines how time-based partitioning is done.
In Druid, data is split into one or more segments per time chunk, defined by the PARTITIONED BY granularity.
Partitioning by time is important for three reasons:
1. Queries that filter by `__time` (SQL) or `intervals` (native) are able to use time partitioning to prune the set of
segments to consider.
2. Certain data management operations, such as overwriting and compacting existing data, acquire exclusive write locks
on time partitions. Finer-grained partitioning allows finer-grained exclusive write locks.
3. Each segment file is wholly contained within a time partition. Too-fine-grained partitioning may cause a large number
of small segments, which leads to poor performance.
`PARTITIONED BY HOUR` and `PARTITIONED BY DAY` are the most common choices to balance these considerations. `PARTITIONED
BY ALL` is suitable if your dataset does not have a [primary timestamp](#primary-timestamp).
For more information about the syntax, see [PARTITIONED BY](./reference.md#partitioned-by).
### Clustering
Within each time chunk defined by [time partitioning](#partitioning-by-time), data can be further split by the optional
[CLUSTERED BY](reference.md#clustered-by) clause.
For example, suppose you ingest 100 million rows per hour using `PARTITIONED BY HOUR` and `CLUSTERED BY hostName`. The
ingestion task will generate segments of roughly 3 million rows — the default value of
[`rowsPerSegment`](reference.md#context-parameters) — with lexicographic ranges of `hostName`s grouped into segments.
Clustering is important for two reasons:
1. Lower storage footprint due to improved locality, and therefore improved compressibility.
2. Better query performance due to dimension-based segment pruning, which removes segments from consideration when they
cannot possibly contain data matching a query's filter. This speeds up filters like `x = 'foo'` and `x IN ('foo',
'bar')`.
To activate dimension-based pruning, these requirements must be met:
- Segments were generated by a REPLACE statement, not an INSERT statement.
- All CLUSTERED BY columns are single-valued string columns.
If these requirements are _not_ met, Druid still clusters data during ingestion, but will not be able to perform
dimension-based segment pruning at query time. You can tell if dimension-based segment pruning is possible by using the
`sys.segments` table to inspect the `shard_spec` for the segments generated by an ingestion query. If they are of type
`range` or `single`, then dimension-based segment pruning is possible. Otherwise, it is not. The shard spec type is also
available in the **Segments** view under the **Partitioning** column.
For more information about syntax, see [CLUSTERED BY](./reference.md#clustered-by).
### Rollup
[Rollup](../ingestion/rollup.md) is a technique that pre-aggregates data during ingestion to reduce the amount of data
stored. Intermediate aggregations are stored in the generated segments, and further aggregation is done at query time.
This reduces storage footprint and improves performance, often dramatically.
To perform ingestion with rollup:
1. Use GROUP BY. The columns in the GROUP BY clause become dimensions, and aggregation functions become metrics.
2. Set [`finalizeAggregations: false`](reference.md#context-parameters) in your context. This causes aggregation
functions to write their internal state to the generated segments, instead of the finalized end result, and enables
further aggregation at query time.
3. Wrap all multi-value strings in `MV_TO_ARRAY(...)` and set [`groupByEnableMultiValueUnnesting:
false`](reference.md#context-parameters) in your context. This ensures that multi-value strings are left alone and
remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the
GROUP BY operator.
When you do all of these things, Druid understands that you intend to do an ingestion with rollup, and it writes
rollup-related metadata into the generated segments. Other applications can then use [`segmentMetadata`
queries](../querying/segmentmetadataquery.md) to retrieve rollup-related information.
If you see the error "Encountered multi-value dimension `x` that cannot be processed with
groupByEnableMultiValueUnnesting set to false", then wrap that column in `MV_TO_ARRAY(x) AS x`.
The following [aggregation functions](../querying/sql-aggregations.md) are supported for rollup at ingestion time:
`COUNT` (but switch to `SUM` at query time), `SUM`, `MIN`, `MAX`, `EARLIEST` ([string only](known-issues.md#select)),
`LATEST` ([string only](known-issues.md#select)), `APPROX_COUNT_DISTINCT`, `APPROX_COUNT_DISTINCT_BUILTIN`,
`APPROX_COUNT_DISTINCT_DS_HLL`, `APPROX_COUNT_DISTINCT_DS_THETA`, and `DS_QUANTILES_SKETCH` (but switch to
`APPROX_QUANTILE_DS` at query time). Do not use `AVG`; instead, use `SUM` and `COUNT` at ingest time and compute the
quotient at query time.
For an example, see [INSERT with rollup example](examples.md#insert-with-rollup).
## Multi-stage query tasks
### Execution flow
When you execute a SQL statement using the task endpoint [`/druid/v2/sql/task`](api.md#submit-a-query), the following
happens:
1. The Broker plans your SQL query into a native query, as usual.
2. The Broker wraps the native query into a task of type `query_controller`
and submits it to the indexing service.
3. The Broker returns the task ID to you and exits.
4. The controller task launches some number of worker tasks determined by
the `maxNumTasks` and `taskAssignment` [context parameters](./reference.md#context-parameters). You can set these settings individually for each query.
5. Worker tasks of type `query_worker` execute the query.
6. If the query is a SELECT query, the worker tasks send the results
back to the controller task, which writes them into its task report.
If the query is an INSERT or REPLACE query, the worker tasks generate and
publish new Druid segments to the provided datasource.
### Parallelism
The [`maxNumTasks`](./reference.md#context-parameters) query parameter determines the maximum number of tasks your
query will use, including the one `query_controller` task. Generally, queries perform better with more workers. The
lowest possible value of `maxNumTasks` is two (one worker and one controller). Do not set this higher than the number of
free slots available in your cluster; doing so will result in a [TaskStartTimeout](reference.md#error-codes) error.
When [reading external data](#extern), EXTERN can read multiple files in parallel across
different worker tasks. However, EXTERN does not split individual files across multiple worker tasks. If you have a
small number of very large input files, you can increase query parallelism by splitting up your input files.
The `druid.worker.capacity` server property on each [Middle Manager](../design/architecture.md#druid-services)
determines the maximum number of worker tasks that can run on each server at once. Worker tasks run single-threaded,
which also determines the maximum number of processors on the server that can contribute towards multi-stage queries.
### Memory usage
Increasing the amount of available memory can improve performance in certain cases:
- Segment generation becomes more efficient when data doesn't spill to disk as often.
- Sorting stage output data becomes more efficient since available memory affects the
number of required sorting passes.
Worker tasks use both JVM heap memory and off-heap ("direct") memory.
On Peons launched by Middle Managers, the bulk of the JVM heap (75%) is split up into two bundles of equal size: one
processor bundle and one worker bundle. Each one comprises 37.5% of the available JVM heap.
The processor memory bundle is used for query processing and segment generation. Each processor bundle must also
provides space to buffer I/O between stages. Specifically, each downstream stage requires 1 MB of buffer space for each
upstream worker. For example, if you have 100 workers running in stage 0, and stage 1 reads from stage 0, then each
worker in stage 1 requires 1M * 100 = 100 MB of memory for frame buffers.
The worker memory bundle is used for sorting stage output data prior to shuffle. Workers can sort more data than fits in
memory; in this case, they will switch to using disk.
Worker tasks also use off-heap ("direct") memory. Set the amount of direct memory available (`-XX:MaxDirectMemorySize`)
to at least `(druid.processing.numThreads + 1) * druid.processing.buffer.sizeBytes`. Increasing the amount of direct
memory available beyond the minimum does not speed up processing.
### Disk usage
Worker tasks use local disk for four purposes:
- Temporary copies of input data. Each temporary file is deleted before the next one is read. You only need
enough temporary disk space to store one input file at a time per task.
- Temporary data related to segment generation. You only need enough temporary disk space to store one segments' worth
of data at a time per task. This is generally less than 2 GB per task.
- External sort of data prior to shuffle. Requires enough space to store a compressed copy of the entire output dataset
for a task.
- Storing stage output data during a shuffle. Requires enough space to store a compressed copy of the entire output
dataset for a task.
Workers use the task working directory, given by
[`druid.indexer.task.baseDir`](../configuration/index.md#additional-peon-configuration), for these items. It is
important that this directory has enough space available for these purposes.

View File

@ -23,9 +23,11 @@ sidebar_label: Examples
~ under the License.
-->
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.
> This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md)
> extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which
> ingestion method is right for you.
These example queries show you some of the things you can do when modifying queries for your use case. Copy the example queries into the **Query** view of the Druid console and run them to see what they do.
These example queries show you some of the things you can do when modifying queries for your use case. Copy the example queries into the **Query** view of the web console and run them to see what they do.
## INSERT with no rollup
@ -75,7 +77,7 @@ CLUSTERED BY channel
## INSERT with rollup
This example inserts data into a table named `kttm_data` and performs data rollup. This example implements the recommendations described in [multi-value dimensions](./index.md#multi-value-dimensions).
This example inserts data into a table named `kttm_data` and performs data rollup. This example implements the recommendations described in [Rollup](./concepts.md#rollup).
<details><summary>Show the query</summary>
@ -472,8 +474,3 @@ LIMIT 1000
```
</details>
## Next steps
* [Read Multi-stage queries](./msq-example-queries.md) to learn more about how multi-stage queries work.
* [Explore the Query view](../operations/druid-console.md) to learn about the UI tools to help you get started.

View File

@ -1,7 +1,7 @@
---
id: index
title: SQL-based ingestion overview and syntax
sidebar_label: Overview and syntax
title: SQL-based ingestion
sidebar_label: Overview
description: Introduces multi-stage query architecture and its task engine
---
@ -24,319 +24,50 @@ description: Introduces multi-stage query architecture and its task engine
~ under the License.
-->
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.
> This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md)
> extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which
> ingestion method is right for you.
SQL-based ingestion for Apache Druid uses a distributed multi-stage query architecture, which includes a query engine called the multi-stage query task engine (MSQ task engine). The MSQ task engine extends Druid's query capabilities, so you can write queries that reference [external data](#read-external-data) as well as perform ingestion with SQL [INSERT](#insert-data) and [REPLACE](#replace-data). Essentially, you can perform SQL-based ingestion instead of using JSON ingestion specs that Druid's native ingestion uses.
Apache Druid supports SQL-based ingestion using the bundled [`druid-multi-stage-query` extension](#load-the-extension).
This extension adds a [multi-stage query task engine for SQL](concepts.md#sql-task-engine) that allows running SQL
[INSERT](concepts.md#insert) and [REPLACE](concepts.md#replace) statements as batch tasks.
The MSQ task engine excels at executing queries that can get bottlenecked at the Broker when using Druid's native SQL engine. When you submit queries, the MSQ task engine splits them into stages and automatically exchanges data between stages. Each stage is parallelized to run across multiple data servers at once, simplifying performance.
Nearly all SELECT capabilities are available for `INSERT ... SELECT` and `REPLACE ... SELECT` queries, with certain
exceptions listed on the [Known issues](./known-issues.md#select) page. This allows great flexibility to apply
transformations, filters, JOINs, aggregations, and so on while ingesting data. This also allows in-database
transformation: creating new tables based on queries of other tables.
## Vocabulary
## MSQ task engine features
- **Controller**: An indexing service task of type `query_controller` that manages
the execution of a query. There is one controller task per query.
In its current state, the MSQ task engine enables you to do the following:
- **Worker**: Indexing service tasks of type `query_worker` that execute a
query. There can be multiple worker tasks per query. Internally,
the tasks process items in parallel using their processing pools (up to `druid.processing.numThreads` of execution parallelism
within a worker task).
- Read external data at query time using EXTERN.
- Execute batch ingestion jobs by writing SQL queries using INSERT and REPLACE. You no longer need to generate a JSON-based ingestion spec.
- Transform and rewrite existing tables using SQL.
- Perform multi-dimension range partitioning reliably, which leads to more evenly distributed segment sizes and better performance.
- **Stage**: A stage of query execution that is parallelized across
worker tasks. Workers exchange data with each other between stages.
The MSQ task engine has additional features that can be used as part of a proof of concept or demo, but don't use or rely on the following features for any meaningful use cases, especially production ones:
- Execute heavy-weight queries and return large numbers of rows.
- Execute queries that exchange large amounts of data between servers, like exact count distinct of high-cardinality fields.
- **Partition**: A slice of data output by worker tasks. In INSERT or REPLACE
queries, the partitions of the final stage become Druid segments.
- **Shuffle**: Workers exchange data between themselves on a per-partition basis in a process called
shuffling. During a shuffle, each output partition is sorted by a clustering key.
## Load the extension
For new clusters that use 24.0 or later, the multi-stage query extension is loaded by default. If you want to add the extension to an existing cluster, add the extension `druid-multi-stage-query` to `druid.extensions.loadlist` in your `common.runtime.properties` file.
To add the extension to an existing cluster, add `druid-multi-stage-query` to `druid.extensions.loadlist` in your
`common.runtime.properties` file.
For more information about how to load an extension, see [Loading extensions](../development/extensions.md#loading-extensions).
To use EXTERN, you need READ permission on the resource named "EXTERNAL" of the resource type "EXTERNAL". If you encounter a 403 error when trying to use EXTERN, verify that you have the correct permissions.
## MSQ task engine query syntax
You can submit queries to the MSQ task engine through the **Query** view in the Druid console or through the API. The Druid console is a good place to start because you can preview a query before you run it. You can also experiment with many of the [context parameters](./msq-reference.md#context-parameters) through the UI. Once you're comfortable with submitting queries through the Druid console, [explore using the API to submit a query](./msq-api.md#submit-a-query).
If you encounter an issue after you submit a query, you can learn more about what an error means from the [limits](./msq-concepts.md#limits) and [errors](./msq-concepts.md#error-codes).
Queries for the MSQ task engine involve three primary functions:
- EXTERN to query external data
- INSERT INTO ... SELECT to insert data, such as data from an external source
- REPLACE to replace existing datasources, partially or fully, with query results
For information about the syntax for queries, see [SQL syntax](./msq-reference.md#sql-syntax).
### Read external data
Query tasks can access external data through the EXTERN function. When using EXTERN, keep in mind that large files do not get split across different worker tasks. If you have fewer input files than worker tasks, you can increase query parallelism by splitting up your input files such that you have at least one input file per worker task.
You can use the EXTERN function anywhere a table is expected in the following form: `TABLE(EXTERN(...))`. You can use external data with SELECT, INSERT, and REPLACE queries.
The following query reads external data:
```sql
SELECT
*
FROM TABLE(
EXTERN(
'{"type": "http", "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]}',
'{"type": "json"}',
'[{"name": "timestamp", "type": "string"}, {"name": "page", "type": "string"}, {"name": "user", "type": "string"}]'
)
)
LIMIT 100
```
For more information about the syntax, see [EXTERN](./msq-reference.md#extern).
### Insert data
With the MSQ task engine, Druid can use the results of a query task to create a new datasource or to append to an existing datasource. Syntactically, there is no difference between the two. These operations use the INSERT INTO ... SELECT syntax.
All SELECT capabilities are available for INSERT queries. However, the MSQ task engine does not include all the existing SQL query features of Druid. See [Known issues](./msq-known-issues.md) for a list of capabilities that aren't available.
The following example query inserts data from an external source into a table named `w000` and partitions it by day:
```sql
INSERT INTO w000
SELECT
TIME_PARSE("timestamp") AS __time,
"page",
"user"
FROM TABLE(
EXTERN(
'{"type": "http", "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]}',
'{"type": "json"}',
'[{"name": "timestamp", "type": "string"}, {"name": "page", "type": "string"}, {"name": "user", "type": "string"}]'
)
)
PARTITIONED BY DAY
```
For more information about the syntax, see [INSERT](./msq-reference.md#insert).
### Replace data
The syntax for REPLACE is similar to INSERT. All SELECT functionality is available for REPLACE queries.
Note that the MSQ task engine does not yet implement all native Druid query features.
For details, see [Known issues](./msq-known-issues.md).
When working with REPLACE queries, keep the following in mind:
- The intervals generated as a result of the OVERWRITE WHERE query must align with the granularity specified in the PARTITIONED BY clause.
- OVERWRITE WHERE queries only support the `__time` column.
For more information about the syntax, see [REPLACE](./msq-reference.md#replace).
The following examples show how to replace data in a table.
#### REPLACE all data
You can replace all the data in a table by using REPLACE INTO ... OVERWRITE ALL SELECT:
```sql
REPLACE INTO w000
OVERWRITE ALL
SELECT
TIME_PARSE("timestamp") AS __time,
"page",
"user"
FROM TABLE(
EXTERN(
'{"type": "http", "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]}',
'{"type": "json"}',
'[{"name": "timestamp", "type": "string"}, {"name": "page", "type": "string"}, {"name": "user", "type": "string"}]'
)
)
PARTITIONED BY DAY
```
#### REPLACE some data
You can replace some of the data in a table by using REPLACE INTO ... OVERWRITE WHERE ... SELECT:
```sql
REPLACE INTO w000
OVERWRITE WHERE __time >= TIMESTAMP '2019-08-25' AND __time < TIMESTAMP '2019-08-28'
SELECT
TIME_PARSE("timestamp") AS __time,
"page",
"user"
FROM TABLE(
EXTERN(
'{"type": "http", "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]}',
'{"type": "json"}',
'[{"name": "timestamp", "type": "string"}, {"name": "page", "type": "string"}, {"name": "user", "type": "string"}]'
)
)
PARTITIONED BY DAY
```
## Adjust query behavior
In addition to the basic functions, you can further modify your query behavior to control how your queries run or what your results look like. You can control how your queries behave by changing the following:
### Primary timestamp
Druid tables always include a primary timestamp named `__time`, so your ingestion query should generally include a column named `__time`.
The following formats are supported for `__time` in the source data:
- ISO 8601 with 'T' separator, such as "2000-01-01T01:02:03.456"
- Milliseconds since Unix epoch (00:00:00 UTC on January 1, 1970)
The `__time` column is used for time-based partitioning, such as `PARTITIONED BY DAY`.
If you use `PARTITIONED BY ALL` or `PARTITIONED BY ALL TIME`, time-based
partitioning is disabled. In these cases, your ingestion query doesn't need
to include a `__time` column. However, Druid still creates a `__time` column
in your Druid table and sets all timestamps to 1970-01-01 00:00:00.
For more information, see [Primary timestamp](../ingestion/data-model.md#primary-timestamp).
### PARTITIONED BY
INSERT and REPLACE queries require the PARTITIONED BY clause, which determines how time-based partitioning is done. In Druid, data is split into segments, one or more per time chunk defined by the PARTITIONED BY granularity. A good general rule is to adjust the granularity so that each segment contains about five million rows. Choose a granularity based on your ingestion rate. For example, if you ingest a million rows per day, PARTITIONED BY DAY is good. If you ingest a million rows an hour, choose PARTITION BY HOUR instead.
Using the clause provides the following benefits:
- Better query performance due to time-based segment pruning, which removes segments from
consideration when they do not contain any data for a query's time filter.
- More efficient data management, as data can be rewritten for each time partition individually
rather than the whole table.
You can use the following arguments for PARTITIONED BY:
- Time unit: `HOUR`, `DAY`, `MONTH`, or `YEAR`. Equivalent to `FLOOR(__time TO TimeUnit)`.
- `TIME_FLOOR(__time, 'granularity_string')`, where granularity_string is an ISO 8601 period like
'PT1H'. The first argument must be `__time`.
- `FLOOR(__time TO TimeUnit)`, where `TimeUnit` is any unit supported by the [FLOOR function](../querying/sql-scalar.md#date-and-time-functions). The first
argument must be `__time`.
- `ALL` or `ALL TIME`, which effectively disables time partitioning by placing all data in a single
time chunk. To use LIMIT or OFFSET at the outer level of your INSERT or REPLACE query, you must set PARTITIONED BY to ALL or ALL TIME.
You can use the following ISO 8601 periods for `TIME_FLOOR`:
- PT1S
- PT1M
- PT5M
- PT10M
- PT15M
- PT30M
- PT1H
- PT6H
- P1D
- P1W
- P1M
- P3M
- P1Y
### CLUSTERED BY
Data is first divided by the PARTITIONED BY clause. Data can be further split by the CLUSTERED BY clause. For example, suppose you ingest 100 M rows per hour and use `PARTITIONED BY HOUR` as your time partition. You then divide up the data further by adding `CLUSTERED BY hostName`. The result is segments of about 5 million rows, with like `hostName`s grouped within the same segment.
Using CLUSTERED BY has the following benefits:
- Lower storage footprint due to combining similar data into the same segments, which improves
compressibility.
- Better query performance due to dimension-based segment pruning, which removes segments from
consideration when they cannot possibly contain data matching a query's filter.
For dimension-based segment pruning to be effective, your queries should meet the following conditions:
- All CLUSTERED BY columns are single-valued string columns
- Use a REPLACE query for ingestion
Druid still clusters data during ingestion if these conditions aren't met but won't perform dimension-based segment pruning at query time. That means if you use an INSERT query for ingestion or have numeric columns or multi-valued string columns, dimension-based segment pruning doesn't occur at query time.
You can tell if dimension-based segment pruning is possible by using the `sys.segments` table to
inspect the `shard_spec` for the segments generated by an ingestion query. If they are of type
`range` or `single`, then dimension-based segment pruning is possible. Otherwise, it is not. The
shard spec type is also available in the **Segments** view under the **Partitioning**
column.
You can use the following filters for dimension-based segment pruning:
- Equality to string literals, like `x = 'foo'` or `x IN ('foo', 'bar')`.
- Comparison to string literals, like `x < 'foo'` or other comparisons involving `<`, `>`, `<=`, or `>=`.
This differs from multi-dimension range based partitioning in classic batch ingestion where both
string and numeric columns support Broker-level pruning. With SQL-based batch ingestion,
only string columns support Broker-level pruning.
It is okay to mix time partitioning with secondary partitioning. For example, you can
combine `PARTITIONED BY HOUR` with `CLUSTERED BY channel` to perform
time partitioning by hour and secondary partitioning by channel within each hour.
### GROUP BY
A query's GROUP BY clause determines how data is rolled up. The expressions in the GROUP BY clause become
dimensions, and aggregation functions become metrics.
### Ingest-time aggregations
When performing rollup using aggregations, it is important to use aggregators
that return nonfinalized state. This allows you to perform further rollups
at query time. To achieve this, set `finalizeAggregations: false` in your
ingestion query context.
Check out the [INSERT with rollup example query](./msq-example-queries.md#insert-with-rollup) to see this feature in
action.
Druid needs information for aggregating measures of different segments to compact. For example, to aggregate `count("col") as example_measure`, Druid needs to sum the value of `example_measure`
across the segments. This information is stored inside the metadata of the segment. For the SQL-based ingestion, Druid only populates the
aggregator information of a column in the segment metadata when:
- The INSERT or REPLACE query has an outer GROUP BY clause.
- The following context parameters are set for the query context: `finalizeAggregations: false` and `groupByEnableMultiValueUnnesting: false`
The following table lists query-time aggregations for SQL-based ingestion:
|Query-time aggregation|Notes|
|----------------------|-----|
|SUM|Use unchanged at ingest time.|
|MIN|Use unchanged at ingest time.|
|MAX|Use unchanged at ingest time.|
|AVG|Use SUM and COUNT at ingest time. Switch to quotient of SUM at query time.|
|COUNT|Use unchanged at ingest time, but switch to SUM at query time.|
|COUNT(DISTINCT expr)|If approximate, use APPROX_COUNT_DISTINCT at ingest time.<br /><br />If exact, you cannot use an ingest-time aggregation. Instead, `expr` must be stored as-is. Add it to the SELECT and GROUP BY lists.|
|EARLIEST(expr)<br /><br />(numeric form)|Not supported.|
|EARLIEST(expr, maxBytes)<br /><br />(string form)|Use unchanged at ingest time.|
|LATEST(expr)<br /><br />(numeric form)|Not supported.|
|LATEST(expr, maxBytes)<br /><br />(string form)|Use unchanged at ingest time.|
|APPROX_COUNT_DISTINCT|Use unchanged at ingest time.|
|APPROX_COUNT_DISTINCT_BUILTIN|Use unchanged at ingest time.|
|APPROX_COUNT_DISTINCT_DS_HLL|Use unchanged at ingest time.|
|APPROX_COUNT_DISTINCT_DS_THETA|Use unchanged at ingest time.|
|APPROX_QUANTILE|Not supported. Deprecated; use APPROX_QUANTILE_DS instead.|
|APPROX_QUANTILE_DS|Use DS_QUANTILES_SKETCH at ingest time. Continue using APPROX_QUANTILE_DS at query time.|
|APPROX_QUANTILE_FIXED_BUCKETS|Not supported.|
### Multi-value dimensions
By default, multi-value dimensions are not ingested as expected when rollup is enabled because the
GROUP BY operator unnests them instead of leaving them as arrays. This is [standard behavior](../querying/sql-data-types.md#multi-value-strings) for GROUP BY but it is generally not desirable behavior for ingestion.
To address this:
- When using GROUP BY with data from EXTERN, wrap any string type fields from EXTERN that may be
multi-valued in `MV_TO_ARRAY`.
- Set `groupByEnableMultiValueUnnesting: false` in your query context to ensure that all multi-value
strings are properly converted to arrays using `MV_TO_ARRAY`. If any strings aren't
wrapped in `MV_TO_ARRAY`, the query reports an error that includes the message "Encountered
multi-value dimension x that cannot be processed with groupByEnableMultiValueUnnesting set to false."
For an example, see [INSERT with rollup example query](./msq-example-queries.md#insert-with-rollup).
### Context parameters
Context parameters can control things such as how many tasks get launched or what happens if there's a malformed record.
For a full list of context parameters and how they affect a query, see [Context parameters](./msq-reference.md#context-parameters).
To use [EXTERN](reference.md#extern), you need READ permission on the resource named "EXTERNAL" of the resource type
"EXTERNAL". If you encounter a 403 error when trying to use EXTERN, verify that you have the correct permissions.
## Next steps
* [Understand how the multi-stage query architecture works](./msq-concepts.md) by reading about the concepts behind it and its processes.
* [Explore the Query view](../operations/druid-console.md) to learn about the UI tools that can help you get started.
* [Read about key concepts](./concepts.md) to learn more about how SQL-based ingestion and multi-stage queries work.
* [Check out the examples](./examples.md) to see SQL-based ingestion in action.
* [Explore the Query view](../operations/web-console.md) to get started in the web console.

View File

@ -23,41 +23,46 @@ sidebar_label: Known issues
~ under the License.
-->
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0.
> Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported.
> We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment
> before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not
> write to a datasource is experimental.
> This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md)
> extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which
> ingestion method is right for you.
## Multi-stage query task runtime
- Fault tolerance is not implemented. If any task fails, the entire query fails.
- SELECT from a Druid datasource does not include unpublished real-time data.
- GROUPING SETS is not implemented. Queries that use GROUPING SETS fail.
- Worker task stage outputs are stored in the working directory given by `druid.indexer.task.baseDir`. Stages that
generate a large amount of output data may exhaust all available disk space. In this case, the query fails with
an [UnknownError](./msq-reference.md#error-codes) with a message including "No space left on device".
an [UnknownError](./reference.md#error-codes) with a message including "No space left on device".
## SELECT
- SELECT from a Druid datasource does not include unpublished real-time data.
- GROUPING SETS and UNION ALL are not implemented. Queries using these features return a
[QueryNotSupported](reference.md#error-codes) error.
- The numeric varieties of the EARLIEST and LATEST aggregators do not work properly. Attempting to use the numeric
varieties of these aggregators lead to an error like
`java.lang.ClassCastException: class java.lang.Double cannot be cast to class org.apache.druid.collections.SerializablePair`.
The string varieties, however, do work properly.
## INSERT and REPLACE
## INSERT and REPLACE
- INSERT with column lists, like `INSERT INTO tbl (a, b, c) SELECT ...`, is not implemented.
- INSERT and REPLACE with column lists, like `INSERT INTO tbl (a, b, c) SELECT ...`, is not implemented.
- `INSERT ... SELECT` inserts columns from the SELECT statement based on column name. This differs from SQL standard
behavior, where columns are inserted based on position.
- `INSERT ... SELECT` and `REPLACE ... SELECT` insert columns from the SELECT statement based on column name. This
differs from SQL standard behavior, where columns are inserted based on position.
- INSERT and REPLACE do not support all options available in [ingestion specs](../ingestion/ingestion-spec.md),
including the `createBitmapIndex` and `multiValueHandling` [dimension](../ingestion/ingestion-spec.md#dimension-objects)
properties, and the `indexSpec` [`tuningConfig`](../ingestion/ingestion-spec.md#tuningconfig) property.
## EXTERN
- The [schemaless dimensions](../ingestion/ingestion-spec.md#inclusions-and-exclusions)
feature is not available. All columns and their types must be specified explicitly using the `signature` parameter
of the [EXTERN function](msq-reference.md#extern).
of the [EXTERN function](reference.md#extern).
- EXTERN with input sources that match large numbers of files may exhaust available memory on the controller task.

View File

@ -1,168 +0,0 @@
---
id: concepts
title: SQL-based ingestion concepts
sidebar_label: Key concepts
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.
This topic covers the main concepts and terminology of the multi-stage query architecture.
## Vocabulary
You might see the following terms in the documentation or while you're using the multi-stage query architecture and task engine, such as when you view the report for a query:
- **Controller**: An indexing service task of type `query_controller` that manages
the execution of a query. There is one controller task per query.
- **Worker**: Indexing service tasks of type `query_worker` that execute a
query. There can be multiple worker tasks per query. Internally,
the tasks process items in parallel using their processing pools (up to `druid.processing.numThreads` of execution parallelism
within a worker task).
- **Stage**: A stage of query execution that is parallelized across
worker tasks. Workers exchange data with each other between stages.
- **Partition**: A slice of data output by worker tasks. In INSERT or REPLACE
queries, the partitions of the final stage become Druid segments.
- **Shuffle**: Workers exchange data between themselves on a per-partition basis in a process called
shuffling. During a shuffle, each output partition is sorted by a clustering key.
## How the MSQ task engine works
Query tasks, specifically queries for INSERT, REPLACE, and SELECT, execute using indexing service tasks. Every query occupies at least two task slots while running.
When you submit a query task to the MSQ task engine, the following happens:
1. The Broker plans your SQL query into a native query, as usual.
2. The Broker wraps the native query into a task of type `query_controller`
and submits it to the indexing service.
3. The Broker returns the task ID to you and exits.
4. The controller task launches some number of worker tasks determined by
the `maxNumTasks` and `taskAssignment` [context parameters](./msq-reference.md#context-parameters). You can set these settings individually for each query.
5. The worker tasks execute the query.
6. If the query is a SELECT query, the worker tasks send the results
back to the controller task, which writes them into its task report.
If the query is an INSERT or REPLACE query, the worker tasks generate and
publish new Druid segments to the provided datasource.
## Parallelism
Parallelism affects performance.
The [`maxNumTasks`](./msq-reference.md#context-parameters) query parameter determines the maximum number of tasks (workers and one controller) your query will use. Generally, queries perform better with more workers. The lowest possible value of `maxNumTasks` is two (one worker and one controller), and the highest possible value is equal to the number of free task slots in your cluster.
The `druid.worker.capacity` server property on each Middle Manager determines the maximum number
of worker tasks that can run on each server at once. Worker tasks run single-threaded, which
also determines the maximum number of processors on the server that can contribute towards
multi-stage queries. Since data servers are shared between Historicals and
Middle Managers, the default setting for `druid.worker.capacity` is lower than the number of
processors on the server. Advanced users may consider enhancing parallelism by increasing this
value to one less than the number of processors on the server. In most cases, this increase must
be accompanied by an adjustment of the memory allotment of the Historical process,
Middle-Manager-launched tasks, or both, to avoid memory overcommitment and server instability. If
you are not comfortable tuning these memory usage parameters to avoid overcommitment, it is best
to stick with the default `druid.worker.capacity`.
## Memory usage
Increasing the amount of available memory can improve performance as follows:
- Segment generation becomes more efficient when data doesn't spill to disk as often.
- Sorting stage output data becomes more efficient since available memory affects the
number of required sorting passes.
Worker tasks use both JVM heap memory and off-heap ("direct") memory.
On Peons launched by Middle Managers, the bulk of the JVM heap (75%) is split up into two bundles of equal size: one processor bundle and one worker bundle. Each one comprises 37.5% of the available JVM heap.
The processor memory bundle is used for query processing and segment generation. Each processor bundle must
also provides space to buffer I/O between stages. Specifically, each downstream stage requires 1 MB of buffer space for
each upstream worker. For example, if you have 100 workers running in stage 0, and stage 1 reads from stage 0,
then each worker in stage 1 requires 1M * 100 = 100 MB of memory for frame buffers.
The worker memory bundle is used for sorting stage output data prior to shuffle. Workers can sort
more data than fits in memory; in this case, they will switch to using disk.
Worker tasks also use off-heap ("direct") memory. Set the amount of direct
memory available (`-XX:MaxDirectMemorySize`) to at least
`(druid.processing.numThreads + 1) * druid.processing.buffer.sizeBytes`. Increasing the
amount of direct memory available beyond the minimum does not speed up processing.
It may be necessary to override one or more memory-related parameters if you run into one of the [known issues](./msq-known-issues.md) around memory usage.
## Limits
Knowing the limits for the MSQ task engine can help you troubleshoot any [errors](#error-codes) that you encounter. Many of the errors occur as a result of reaching a limit.
The following table lists query limits:
|Limit|Value|Error if exceeded|
|-----|-----|-----------------|
| Size of an individual row written to a frame. Row size when written to a frame may differ from the original row size. | 1 MB | `RowTooLarge` |
| Number of segment-granular time chunks encountered during ingestion. | 5,000 | `TooManyBuckets` |
| Number of input files/segments per worker. | 10,000 | `TooManyInputFiles` |
| Number of output partitions for any one stage. Number of segments generated during ingestion. |25,000 | `TooManyPartitions` |
| Number of output columns for any one stage. | 2,000 | `TooManyColumns` |
| Number of workers for any one stage. | Hard limit is 1,000. Memory-dependent soft limit may be lower. | `TooManyWorkers` |
| Maximum memory occupied by broadcasted tables. | 30% of each [processor memory bundle](#memory-usage). | `BroadcastTablesTooLarge` |
## Error codes
The following table describes error codes you may encounter in the `multiStageQuery.payload.status.errorReport.error.errorCode` field:
|Code|Meaning|Additional fields|
|----|-----------|----|
| BroadcastTablesTooLarge | The size of the broadcast tables, used in right hand side of the joins, exceeded the memory reserved for them in a worker task. | `maxBroadcastTablesSize`: Memory reserved for the broadcast tables, measured in bytes. |
| Canceled | The query was canceled. Common reasons for cancellation:<br /><br /><ul><li>User-initiated shutdown of the controller task via the `/druid/indexer/v1/task/{taskId}/shutdown` API.</li><li>Restart or failure of the server process that was running the controller task.</li></ul>| |
| CannotParseExternalData | A worker task could not parse data from an external datasource. | |
| ColumnNameRestricted| The query uses a restricted column name. | |
| ColumnTypeNotSupported| Support for writing or reading from a particular column type is not supported. | |
| ColumnTypeNotSupported | The query attempted to use a column type that is not supported by the frame format. This occurs with ARRAY types, which are not yet implemented for frames. | `columnName`<br /> <br />`columnType` |
| InsertCannotAllocateSegment | The controller task could not allocate a new segment ID due to conflict with existing segments or pending segments. Common reasons for such conflicts:<br /> <br /><ul><li>Attempting to mix different granularities in the same intervals of the same datasource.</li><li>Prior ingestions that used non-extendable shard specs.</li></ul>| `dataSource`<br /> <br />`interval`: The interval for the attempted new segment allocation. |
| InsertCannotBeEmpty | An INSERT or REPLACE query did not generate any output rows in a situation where output rows are required for success. This can happen for INSERT or REPLACE queries with `PARTITIONED BY` set to something other than `ALL` or `ALL TIME`. | `dataSource` |
| InsertCannotOrderByDescending | An INSERT query contained a `CLUSTERED BY` expression in descending order. Druid's segment generation code only supports ascending order. | `columnName` |
| InsertCannotReplaceExistingSegment | A REPLACE query cannot proceed because an existing segment partially overlaps those bounds, and the portion within the bounds is not fully overshadowed by query results. <br /> <br />There are two ways to address this without modifying your query:<ul><li>Shrink the OVERLAP filter to match the query results.</li><li>Expand the OVERLAP filter to fully contain the existing segment.</li></ul>| `segmentId`: The existing segment <br />
| InsertLockPreempted | An INSERT or REPLACE query was canceled by a higher-priority ingestion job, such as a real-time ingestion task. | |
| InsertTimeNull | An INSERT or REPLACE query encountered a null timestamp in the `__time` field.<br /><br />This can happen due to using an expression like `TIME_PARSE(timestamp) AS __time` with a timestamp that cannot be parsed. (TIME_PARSE returns null when it cannot parse a timestamp.) In this case, try parsing your timestamps using a different function or pattern.<br /><br />If your timestamps may genuinely be null, consider using COALESCE to provide a default value. One option is CURRENT_TIMESTAMP, which represents the start time of the job. |
| InsertTimeOutOfBounds | A REPLACE query generated a timestamp outside the bounds of the TIMESTAMP parameter for your OVERWRITE WHERE clause.<br /> <br />To avoid this error, verify that the you specified is valid. | `interval`: time chunk interval corresponding to the out-of-bounds timestamp |
| InvalidNullByte | A string column included a null byte. Null bytes in strings are not permitted. | `column`: The column that included the null byte |
| QueryNotSupported | QueryKit could not translate the provided native query to a multi-stage query.<br /> <br />This can happen if the query uses features that aren't supported, like GROUPING SETS. | |
| RowTooLarge | The query tried to process a row that was too large to write to a single frame. See the [Limits](#limits) table for the specific limit on frame size. Note that the effective maximum row size is smaller than the maximum frame size due to alignment considerations during frame writing. | `maxFrameSize`: The limit on the frame size. |
| TaskStartTimeout | Unable to launch all the worker tasks in time. <br /> <br />There might be insufficient available slots to start all the worker tasks simultaneously.<br /> <br /> Try splitting up the query into smaller chunks with lesser `maxNumTasks` number. Another option is to increase capacity. | |
| TooManyBuckets | Exceeded the number of partition buckets for a stage. Partition buckets are only used for `segmentGranularity` during INSERT queries. The most common reason for this error is that your `segmentGranularity` is too narrow relative to the data. See the [Limits](./msq-concepts.md#limits) table for the specific limit. | `maxBuckets`: The limit on buckets. |
| TooManyInputFiles | Exceeded the number of input files/segments per worker. See the [Limits](./msq-concepts.md#limits) table for the specific limit. | `umInputFiles`: The total number of input files/segments for the stage.<br /><br />`maxInputFiles`: The maximum number of input files/segments per worker per stage.<br /><br />`minNumWorker`: The minimum number of workers required for a successful run. |
| TooManyPartitions | Exceeded the number of partitions for a stage. The most common reason for this is that the final stage of an INSERT or REPLACE query generated too many segments. See the [Limits](./msq-concepts.md#limits) table for the specific limit. | `maxPartitions`: The limit on partitions which was exceeded |
| TooManyColumns | Exceeded the number of columns for a stage. See the [Limits](#limits) table for the specific limit. | `maxColumns`: The limit on columns which was exceeded. |
| TooManyWarnings | Exceeded the allowed number of warnings of a particular type. | `rootErrorCode`: The error code corresponding to the exception that exceeded the required limit. <br /><br />`maxWarnings`: Maximum number of warnings that are allowed for the corresponding `rootErrorCode`. |
| TooManyWorkers | Exceeded the supported number of workers running simultaneously. See the [Limits](#limits) table for the specific limit. | `workers`: The number of simultaneously running workers that exceeded a hard or soft limit. This may be larger than the number of workers in any one stage if multiple stages are running simultaneously. <br /><br />`maxWorkers`: The hard or soft limit on workers that was exceeded. |
| NotEnoughMemory | Insufficient memory to launch a stage. | `serverMemory`: The amount of memory available to a single process.<br /><br />`serverWorkers`: The number of workers running in a single process.<br /><br />`serverThreads`: The number of threads in a single process. |
| WorkerFailed | A worker task failed unexpectedly. | `workerTaskId`: The ID of the worker task. |
| WorkerRpcFailed | A remote procedure call to a worker task failed and could not recover. | `workerTaskId`: the id of the worker task |
| UnknownError | All other errors. | |

View File

@ -1,169 +0,0 @@
---
id: reference
title: SQL-based ingestion reference
sidebar_label: Reference
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.
This topic is a reference guide for the multi-stage query architecture in Apache Druid.
## Context parameters
In addition to the Druid SQL [context parameters](../querying/sql-query-context.md), the multi-stage query task engine accepts certain context parameters that are specific to it.
Use context parameters alongside your queries to customize the behavior of the query. If you're using the API, include the context parameters in the query context when you submit a query:
```json
{
"query": "SELECT 1 + 1",
"context": {
"<key>": "<value>",
"maxNumTasks": 3
}
}
```
If you're using the Druid console, you can specify the context parameters through various UI options.
The following table lists the context parameters for the MSQ task engine:
|Parameter|Description|Default value|
|---------|-----------|-------------|
| maxNumTasks | SELECT, INSERT, REPLACE<br /><br />The maximum total number of tasks to launch, including the controller task. The lowest possible value for this setting is 2: one controller and one worker. All tasks must be able to launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` error code after approximately 10 minutes.<br /><br />May also be provided as `numTasks`. If both are present, `maxNumTasks` takes priority.| 2 |
| taskAssignment | SELECT, INSERT, REPLACE<br /><br />Determines how many tasks to use. Possible values include: <ul><li>`max`: Uses as many tasks as possible, up to `maxNumTasks`.</li><li>`auto`: When file sizes can be determined through directory listing (for example: local files, S3, GCS, HDFS) uses as few tasks as possible without exceeding 10 GiB or 10,000 files per task, unless exceeding these limits is necessary to stay within `maxNumTasks`. When file sizes cannot be determined through directory listing (for example: http), behaves the same as `max`.</li></ui> | `max` |
| finalizeAggregations | SELECT, INSERT, REPLACE<br /><br />Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | true |
| rowsInMemory | INSERT or REPLACE<br /><br />Maximum number of rows to store in memory at once before flushing to disk during the segment generation process. Ignored for non-INSERT queries. In most cases, use the default value. You may need to override the default if you run into one of the [known issues](./msq-known-issues.md) around memory usage. | 100,000 |
| segmentSortOrder | INSERT or REPLACE<br /><br />Normally, Druid sorts rows in individual segments using `__time` first, followed by the [CLUSTERED BY](./index.md#clustered-by) clause. When you set `segmentSortOrder`, Druid sorts rows in segments using this column list first, followed by the CLUSTERED BY order.<br /><br />You provide the column list as comma-separated values or as a JSON array in string form. If your query includes `__time`, then this list must begin with `__time`. For example, consider an INSERT query that uses `CLUSTERED BY country` and has `segmentSortOrder` set to `__time,city`. Within each time chunk, Druid assigns rows to segments based on `country`, and then within each of those segments, Druid sorts those rows by `__time` first, then `city`, then `country`. | empty list |
| maxParseExceptions| SELECT, INSERT, REPLACE<br /><br />Maximum number of parse exceptions that are ignored while executing the query before it stops with `TooManyWarningsFault`. To ignore all the parse exceptions, set the value to -1.| 0 |
| rowsPerSegment | INSERT or REPLACE<br /><br />The number of rows per segment to target. The actual number of rows per segment may be somewhat higher or lower than this number. In most cases, use the default. For general information about sizing rows per segment, see [Segment Size Optimization](../operations/segment-optimization.md). | 3,000,000 |
| sqlTimeZone | Sets the time zone for this connection, which affects how time functions and timestamp literals behave. Use a time zone name like "America/Los_Angeles" or offset like "-08:00".| `druid.sql.planner.sqlTimeZone` on the Broker (default: UTC)|
| useApproximateCountDistinct | Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.| `druid.sql.planner.useApproximateCountDistinct` on the Broker (default: true)|
## Error codes
Error codes have corresponding human-readable messages that explain the error. For more information about the error codes, see [Error codes](./msq-concepts.md#error-codes).
## SQL syntax
The MSQ task engine has three primary SQL functions:
- EXTERN
- INSERT
- REPLACE
For information about using these functions and their corresponding examples, see [MSQ task engine query syntax](./index.md#msq-task-engine-query-syntax). For information about adjusting the shape of your data, see [Adjust query behavior](./index.md#adjust-query-behavior).
### EXTERN
Use the EXTERN function to read external data.
Function format:
```sql
SELECT
<column>
FROM TABLE(
EXTERN(
'<Druid input source>',
'<Druid input format>',
'<row signature>'
)
)
```
EXTERN consists of the following parts:
1. Any [Druid input source](../ingestion/native-batch-input-source.md) as a JSON-encoded string.
2. Any [Druid input format](../ingestion/data-formats.md) as a JSON-encoded string.
3. A row signature, as a JSON-encoded array of column descriptors. Each column descriptor must have a `name` and a `type`. The type can be `string`, `long`, `double`, or `float`. This row signature is used to map the external data into the SQL layer.
### INSERT
Use the INSERT function to insert data.
Unlike standard SQL, INSERT inserts data according to column name and not positionally. This means that it is important for the output column names of subsequent INSERT queries to be the same as the table. Do not rely on their positions within the SELECT clause.
Function format:
```sql
INSERT INTO <table name>
SELECT
<column>
FROM <table>
PARTITIONED BY <time frame>
```
INSERT consists of the following parts:
1. Optional [context parameters](./msq-reference.md#context-parameters).
2. An `INSERT INTO <dataSource>` clause at the start of your query, such as `INSERT INTO your-table`.
3. A clause for the data you want to insert, such as `SELECT...FROM TABLE...`. You can use EXTERN to reference external tables using the following format: ``TABLE(EXTERN(...))`.
4. A [PARTITIONED BY](./index.md#partitioned-by) clause for your INSERT statement. For example, use PARTITIONED BY DAY for daily partitioning or PARTITIONED BY ALL TIME to skip time partitioning completely.
5. An optional [CLUSTERED BY](./index.md#clustered-by) clause.
### REPLACE
You can use the REPLACE function to replace all or some of the data.
Unlike standard SQL, REPLACE inserts data according to column name and not positionally. This means that it is important for the output column names of subsequent REPLACE queries to be the same as the table. Do not rely on their positions within the SELECT clause.
#### REPLACE all data
Function format to replace all data:
```sql
REPLACE INTO <target table>
OVERWRITE ALL
SELECT
TIME_PARSE("timestamp") AS __time,
<column>
FROM <source table>
PARTITIONED BY <time>
```
#### REPLACE specific data
Function format to replace specific data:
```sql
REPLACE INTO <target table>
OVERWRITE WHERE __time >= TIMESTAMP '<lower bound>' AND __time < TIMESTAMP '<upper bound>'
SELECT
TIME_PARSE("timestamp") AS __time,
<column>
FROM <source table>
PARTITIONED BY <time>
```
REPLACE consists of the following parts:
1. Optional [context parameters](./msq-reference.md#context-parameters).
2. A `REPLACE INTO <dataSource>` clause at the start of your query, such as `REPLACE INTO your-table.`
3. An OVERWRITE clause after the datasource, either OVERWRITE ALL or OVERWRITE WHERE:
- OVERWRITE ALL replaces the entire existing datasource with the results of the query.
- OVERWRITE WHERE drops the time segments that match the condition you set. Conditions are based on the `__time` column and use the format `__time [< > = <= >=] TIMESTAMP`. Use them with AND, OR, and NOT between them, inclusive of the timestamps specified. For example, see [REPLACE INTO ... OVERWRITE WHERE ... SELECT](./index.md#replace-some-data).
4. A clause for the actual data you want to use for the replacement.
5. A [PARTITIONED BY](./index.md#partitioned-by) clause to your REPLACE statement. For example, use PARTITIONED BY DAY for daily partitioning, or PARTITIONED BY ALL TIME to skip time partitioning completely.
6. An optional [CLUSTERED BY](./index.md#clustered-by) clause.

View File

@ -1,43 +0,0 @@
---
id: security
title: SQL-based ingestion security
sidebar_label: Security
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.
All authenticated users can use the multi-stage query task engine (MSQ task engine) through the UI and API if the extension is loaded. However, without additional permissions, users are not able to issue queries that read or write Druid datasources or external data. The permission you need depends on what you are trying to do.
The permission required to submit a query depends on the type of query:
- SELECT from a Druid datasource requires the READ DATASOURCE permission on that
datasource.
- INSERT or REPLACE into a Druid datasource requires the WRITE DATASOURCE permission on that
datasource.
- EXTERN references to external data require READ permission on the resource name "EXTERNAL" of the resource type "EXTERNAL". Users without the correct permission encounter a 403 error when trying to run queries that include EXTERN.
Query tasks that you submit to the MSQ task engine are Overlord tasks, so they follow the Overlord's (indexer) model. This means that users with access to the Overlord API can perform some actions even if they didn't submit the query. The actions include retrieving the status or canceling a query. For more information about the Overlord API and the task API, see [APIs for SQL-based ingestion](./msq-api.md).
To interact with a query through the Overlord API, you need the following permissions:
- INSERT or REPLACE queries: You must have READ DATASOURCE permission on the output datasource.
- SELECT queries: You must have read permissions on the `__query_select` datasource, which is a stub datasource that gets created.

View File

@ -0,0 +1,258 @@
---
id: reference
title: SQL-based ingestion reference
sidebar_label: Reference
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
> This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md)
> extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which
> ingestion method is right for you.
## SQL reference
This topic is a reference guide for the multi-stage query architecture in Apache Druid. For examples of real-world
usage, refer to the [Examples](examples.md) page.
### EXTERN
Use the EXTERN function to read external data.
Function format:
```sql
SELECT
<column>
FROM TABLE(
EXTERN(
'<Druid input source>',
'<Druid input format>',
'<row signature>'
)
)
```
EXTERN consists of the following parts:
1. Any [Druid input source](../ingestion/native-batch-input-source.md) as a JSON-encoded string.
2. Any [Druid input format](../ingestion/data-formats.md) as a JSON-encoded string.
3. A row signature, as a JSON-encoded array of column descriptors. Each column descriptor must have a `name` and a `type`. The type can be `string`, `long`, `double`, or `float`. This row signature is used to map the external data into the SQL layer.
For more information, see [Read external data with EXTERN](concepts.md#extern).
### INSERT
Use the INSERT statement to insert data.
Unlike standard SQL, INSERT loads data into the target table according to column name, not positionally. If necessary,
use `AS` in your SELECT column list to assign the correct names. Do not rely on their positions within the SELECT
clause.
Statement format:
```sql
INSERT INTO <table name>
< SELECT query >
PARTITIONED BY <time frame>
[ CLUSTERED BY <column list> ]
```
INSERT consists of the following parts:
1. Optional [context parameters](./reference.md#context-parameters).
2. An `INSERT INTO <dataSource>` clause at the start of your query, such as `INSERT INTO your-table`.
3. A clause for the data you want to insert, such as `SELECT ... FROM ...`. You can use [EXTERN](#extern) to reference external tables using `FROM TABLE(EXTERN(...))`.
4. A [PARTITIONED BY](#partitioned-by) clause, such as `PARTITIONED BY DAY`.
5. An optional [CLUSTERED BY](#clustered-by) clause.
For more information, see [Load data with INSERT](concepts.md#insert).
### REPLACE
You can use the REPLACE function to replace all or some of the data.
Unlike standard SQL, REPLACE loads data into the target table according to column name, not positionally. If necessary,
use `AS` in your SELECT column list to assign the correct names. Do not rely on their positions within the SELECT
clause.
#### REPLACE all data
Function format to replace all data:
```sql
REPLACE INTO <target table>
OVERWRITE ALL
< SELECT query >
PARTITIONED BY <time granularity>
[ CLUSTERED BY <column list> ]
```
#### REPLACE specific time ranges
Function format to replace specific time ranges:
```sql
REPLACE INTO <target table>
OVERWRITE WHERE __time >= TIMESTAMP '<lower bound>' AND __time < TIMESTAMP '<upper bound>'
< SELECT query >
PARTITIONED BY <time granularity>
[ CLUSTERED BY <column list> ]
```
REPLACE consists of the following parts:
1. Optional [context parameters](./reference.md#context-parameters).
2. A `REPLACE INTO <dataSource>` clause at the start of your query, such as `REPLACE INTO "your-table".`
3. An OVERWRITE clause after the datasource, either OVERWRITE ALL or OVERWRITE WHERE:
- OVERWRITE ALL replaces the entire existing datasource with the results of the query.
- OVERWRITE WHERE drops the time segments that match the condition you set. Conditions are based on the `__time`
column and use the format `__time [< > = <= >=] TIMESTAMP`. Use them with AND, OR, and NOT between them, inclusive
of the timestamps specified. No other expressions or functions are valid in OVERWRITE.
4. A clause for the actual data you want to use for the replacement.
5. A [PARTITIONED BY](#partitioned-by) clause, such as `PARTITIONED BY DAY`.
6. An optional [CLUSTERED BY](#clustered-by) clause.
For more information, see [Overwrite data with REPLACE](concepts.md#replace).
### PARTITIONED BY
The `PARTITIONED BY <time granularity>` clause is required for [INSERT](#insert) and [REPLACE](#replace). See
[Partitioning](concepts.md#partitioning) for details.
The following granularity arguments are accepted:
- Time unit: `HOUR`, `DAY`, `MONTH`, or `YEAR`. Equivalent to `FLOOR(__time TO TimeUnit)`.
- `TIME_FLOOR(__time, 'granularity_string')`, where granularity_string is one of the ISO 8601 periods listed below. The
first argument must be `__time`.
- `FLOOR(__time TO TimeUnit)`, where `TimeUnit` is any unit supported by the [FLOOR function](../querying/sql-scalar.md#date-and-time-functions). The first argument must be `__time`.
- `ALL` or `ALL TIME`, which effectively disables time partitioning by placing all data in a single time chunk. To use
LIMIT or OFFSET at the outer level of your INSERT or REPLACE query, you must set PARTITIONED BY to ALL or ALL TIME.
The following ISO 8601 periods are supported for `TIME_FLOOR`:
- PT1S
- PT1M
- PT5M
- PT10M
- PT15M
- PT30M
- PT1H
- PT6H
- P1D
- P1W
- P1M
- P3M
- P1Y
For more information about partitioning, see [Partitioning](concepts.md#partitioning).
### CLUSTERED BY
The `CLUSTERED BY <column list>` clause is optional for [INSERT](#insert) and [REPLACE](#replace). It accepts a list of
column names or expressions.
For more information about clustering, see [Clustering](concepts.md#clustering).
<a name="context"></a>
## Context parameters
In addition to the Druid SQL [context parameters](../querying/sql-query-context.md), the multi-stage query task engine accepts certain context parameters that are specific to it.
Use context parameters alongside your queries to customize the behavior of the query. If you're using the API, include the context parameters in the query context when you submit a query:
```json
{
"query": "SELECT 1 + 1",
"context": {
"<key>": "<value>",
"maxNumTasks": 3
}
}
```
If you're using the web console, you can specify the context parameters through various UI options.
The following table lists the context parameters for the MSQ task engine:
|Parameter|Description|Default value|
|---------|-----------|-------------|
| maxNumTasks | SELECT, INSERT, REPLACE<br /><br />The maximum total number of tasks to launch, including the controller task. The lowest possible value for this setting is 2: one controller and one worker. All tasks must be able to launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` error code after approximately 10 minutes.<br /><br />May also be provided as `numTasks`. If both are present, `maxNumTasks` takes priority.| 2 |
| taskAssignment | SELECT, INSERT, REPLACE<br /><br />Determines how many tasks to use. Possible values include: <ul><li>`max`: Uses as many tasks as possible, up to `maxNumTasks`.</li><li>`auto`: When file sizes can be determined through directory listing (for example: local files, S3, GCS, HDFS) uses as few tasks as possible without exceeding 10 GiB or 10,000 files per task, unless exceeding these limits is necessary to stay within `maxNumTasks`. When file sizes cannot be determined through directory listing (for example: http), behaves the same as `max`.</li></ul> | `max` |
| finalizeAggregations | SELECT, INSERT, REPLACE<br /><br />Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | true |
| rowsInMemory | INSERT or REPLACE<br /><br />Maximum number of rows to store in memory at once before flushing to disk during the segment generation process. Ignored for non-INSERT queries. In most cases, use the default value. You may need to override the default if you run into one of the [known issues](./known-issues.md) around memory usage. | 100,000 |
| segmentSortOrder | INSERT or REPLACE<br /><br />Normally, Druid sorts rows in individual segments using `__time` first, followed by the [CLUSTERED BY](#clustered-by) clause. When you set `segmentSortOrder`, Druid sorts rows in segments using this column list first, followed by the CLUSTERED BY order.<br /><br />You provide the column list as comma-separated values or as a JSON array in string form. If your query includes `__time`, then this list must begin with `__time`. For example, consider an INSERT query that uses `CLUSTERED BY country` and has `segmentSortOrder` set to `__time,city`. Within each time chunk, Druid assigns rows to segments based on `country`, and then within each of those segments, Druid sorts those rows by `__time` first, then `city`, then `country`. | empty list |
| maxParseExceptions| SELECT, INSERT, REPLACE<br /><br />Maximum number of parse exceptions that are ignored while executing the query before it stops with `TooManyWarningsFault`. To ignore all the parse exceptions, set the value to -1.| 0 |
| rowsPerSegment | INSERT or REPLACE<br /><br />The number of rows per segment to target. The actual number of rows per segment may be somewhat higher or lower than this number. In most cases, use the default. For general information about sizing rows per segment, see [Segment Size Optimization](../operations/segment-optimization.md). | 3,000,000 |
| sqlTimeZone | Sets the time zone for this connection, which affects how time functions and timestamp literals behave. Use a time zone name like "America/Los_Angeles" or offset like "-08:00".| `druid.sql.planner.sqlTimeZone` on the Broker (default: UTC)|
| useApproximateCountDistinct | Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.| `druid.sql.planner.useApproximateCountDistinct` on the Broker (default: true)|
## Limits
Knowing the limits for the MSQ task engine can help you troubleshoot any [errors](#error-codes) that you encounter. Many of the errors occur as a result of reaching a limit.
The following table lists query limits:
|Limit|Value|Error if exceeded|
|-----|-----|-----------------|
| Size of an individual row written to a frame. Row size when written to a frame may differ from the original row size. | 1 MB | `RowTooLarge` |
| Number of segment-granular time chunks encountered during ingestion. | 5,000 | `TooManyBuckets` |
| Number of input files/segments per worker. | 10,000 | `TooManyInputFiles` |
| Number of output partitions for any one stage. Number of segments generated during ingestion. |25,000 | `TooManyPartitions` |
| Number of output columns for any one stage. | 2,000 | `TooManyColumns` |
| Number of workers for any one stage. | Hard limit is 1,000. Memory-dependent soft limit may be lower. | `TooManyWorkers` |
| Maximum memory occupied by broadcasted tables. | 30% of each [processor memory bundle](concepts.md#memory-usage). | `BroadcastTablesTooLarge` |
<a name="errors"></a>
## Error codes
The following table describes error codes you may encounter in the `multiStageQuery.payload.status.errorReport.error.errorCode` field:
|Code|Meaning|Additional fields|
|----|-----------|----|
| BroadcastTablesTooLarge | The size of the broadcast tables, used in right hand side of the joins, exceeded the memory reserved for them in a worker task. | `maxBroadcastTablesSize`: Memory reserved for the broadcast tables, measured in bytes. |
| Canceled | The query was canceled. Common reasons for cancellation:<br /><br /><ul><li>User-initiated shutdown of the controller task via the `/druid/indexer/v1/task/{taskId}/shutdown` API.</li><li>Restart or failure of the server process that was running the controller task.</li></ul>| |
| CannotParseExternalData | A worker task could not parse data from an external datasource. | |
| ColumnNameRestricted| The query uses a restricted column name. | |
| ColumnTypeNotSupported| Support for writing or reading from a particular column type is not supported. | |
| ColumnTypeNotSupported | The query attempted to use a column type that is not supported by the frame format. This occurs with ARRAY types, which are not yet implemented for frames. | `columnName`<br /> <br />`columnType` |
| InsertCannotAllocateSegment | The controller task could not allocate a new segment ID due to conflict with existing segments or pending segments. Common reasons for such conflicts:<br /> <br /><ul><li>Attempting to mix different granularities in the same intervals of the same datasource.</li><li>Prior ingestions that used non-extendable shard specs.</li></ul>| `dataSource`<br /> <br />`interval`: The interval for the attempted new segment allocation. |
| InsertCannotBeEmpty | An INSERT or REPLACE query did not generate any output rows in a situation where output rows are required for success. This can happen for INSERT or REPLACE queries with `PARTITIONED BY` set to something other than `ALL` or `ALL TIME`. | `dataSource` |
| InsertCannotOrderByDescending | An INSERT query contained a `CLUSTERED BY` expression in descending order. Druid's segment generation code only supports ascending order. | `columnName` |
| InsertCannotReplaceExistingSegment | A REPLACE query cannot proceed because an existing segment partially overlaps those bounds, and the portion within the bounds is not fully overshadowed by query results. <br /> <br />There are two ways to address this without modifying your query:<ul><li>Shrink the OVERLAP filter to match the query results.</li><li>Expand the OVERLAP filter to fully contain the existing segment.</li></ul>| `segmentId`: The existing segment <br />
| InsertLockPreempted | An INSERT or REPLACE query was canceled by a higher-priority ingestion job, such as a real-time ingestion task. | |
| InsertTimeNull | An INSERT or REPLACE query encountered a null timestamp in the `__time` field.<br /><br />This can happen due to using an expression like `TIME_PARSE(timestamp) AS __time` with a timestamp that cannot be parsed. (TIME_PARSE returns null when it cannot parse a timestamp.) In this case, try parsing your timestamps using a different function or pattern.<br /><br />If your timestamps may genuinely be null, consider using COALESCE to provide a default value. One option is CURRENT_TIMESTAMP, which represents the start time of the job. |
| InsertTimeOutOfBounds | A REPLACE query generated a timestamp outside the bounds of the TIMESTAMP parameter for your OVERWRITE WHERE clause.<br /> <br />To avoid this error, verify that the you specified is valid. | `interval`: time chunk interval corresponding to the out-of-bounds timestamp |
| InvalidNullByte | A string column included a null byte. Null bytes in strings are not permitted. | `column`: The column that included the null byte |
| QueryNotSupported | QueryKit could not translate the provided native query to a multi-stage query.<br /> <br />This can happen if the query uses features that aren't supported, like GROUPING SETS. | |
| RowTooLarge | The query tried to process a row that was too large to write to a single frame. See the [Limits](#limits) table for the specific limit on frame size. Note that the effective maximum row size is smaller than the maximum frame size due to alignment considerations during frame writing. | `maxFrameSize`: The limit on the frame size. |
| TaskStartTimeout | Unable to launch all the worker tasks in time. <br /> <br />There might be insufficient available slots to start all the worker tasks simultaneously.<br /> <br /> Try splitting up the query into smaller chunks with lesser `maxNumTasks` number. Another option is to increase capacity. | |
| TooManyBuckets | Exceeded the number of partition buckets for a stage. Partition buckets are only used for `segmentGranularity` during INSERT queries. The most common reason for this error is that your `segmentGranularity` is too narrow relative to the data. See the [Limits](#limits) table for the specific limit. | `maxBuckets`: The limit on buckets. |
| TooManyInputFiles | Exceeded the number of input files/segments per worker. See the [Limits](#limits) table for the specific limit. | `umInputFiles`: The total number of input files/segments for the stage.<br /><br />`maxInputFiles`: The maximum number of input files/segments per worker per stage.<br /><br />`minNumWorker`: The minimum number of workers required for a successful run. |
| TooManyPartitions | Exceeded the number of partitions for a stage. The most common reason for this is that the final stage of an INSERT or REPLACE query generated too many segments. See the [Limits](#limits) table for the specific limit. | `maxPartitions`: The limit on partitions which was exceeded |
| TooManyColumns | Exceeded the number of columns for a stage. See the [Limits](#limits) table for the specific limit. | `maxColumns`: The limit on columns which was exceeded. |
| TooManyWarnings | Exceeded the allowed number of warnings of a particular type. | `rootErrorCode`: The error code corresponding to the exception that exceeded the required limit. <br /><br />`maxWarnings`: Maximum number of warnings that are allowed for the corresponding `rootErrorCode`. |
| TooManyWorkers | Exceeded the supported number of workers running simultaneously. See the [Limits](#limits) table for the specific limit. | `workers`: The number of simultaneously running workers that exceeded a hard or soft limit. This may be larger than the number of workers in any one stage if multiple stages are running simultaneously. <br /><br />`maxWorkers`: The hard or soft limit on workers that was exceeded. |
| NotEnoughMemory | Insufficient memory to launch a stage. | `serverMemory`: The amount of memory available to a single process.<br /><br />`serverWorkers`: The number of workers running in a single process.<br /><br />`serverThreads`: The number of threads in a single process. |
| WorkerFailed | A worker task failed unexpectedly. | `workerTaskId`: The ID of the worker task. |
| WorkerRpcFailed | A remote procedure call to a worker task failed and could not recover. | `workerTaskId`: the id of the worker task |
| UnknownError | All other errors. | |

View File

@ -0,0 +1,49 @@
---
id: security
title: SQL-based ingestion security
sidebar_label: Security
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
> This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md)
> extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which
> ingestion method is right for you.
All authenticated users can use the multi-stage query task engine (MSQ task engine) through the UI and API if the extension is loaded. However, without additional permissions, users are not able to issue queries that read or write Druid datasources or external data. The permission needed depends on what the user is trying to do.
To submit a query:
- SELECT from a Druid datasource requires the READ DATASOURCE permission on that datasource.
- [INSERT](reference.md#insert) or [REPLACE](reference.md#replace) into a Druid datasource requires the WRITE DATASOURCE
permission on that datasource.
- [EXTERN](reference.md#extern) requires READ permission on a resource named "EXTERNAL" with type "EXTERNAL". Users without the correct
permission encounter a 403 error when trying to run queries that include EXTERN.
Once a query is submitted, it executes as a [`query_controller`](concepts.md#execution-flow) task. Query tasks that
users submit to the MSQ task engine are Overlord tasks, so they follow the Overlord's security model. This means that
users with access to the Overlord API can perform some actions even if they didn't submit the query, including
retrieving status or canceling a query. For more information about the Overlord API and the task API, see [APIs for
SQL-based ingestion](./api.md).
To interact with a query through the Overlord API, users need the following permissions:
- INSERT or REPLACE queries: Users must have READ DATASOURCE permission on the output datasource.
- SELECT queries: Users must have read permissions on the `__query_select` datasource, which is a stub datasource that gets created.

View File

@ -465,7 +465,7 @@ Update overlord dynamic worker configuration.
* `/druid/coordinator/v1/compaction/progress?dataSource={dataSource}`
Returns the total size of segments awaiting compaction for the given dataSource.
The specified dataSource must have [automatic compaction](../ingestion/automatic-compaction.md) enabled.
The specified dataSource must have [automatic compaction](../data-management/automatic-compaction.md) enabled.
##### GET
@ -517,7 +517,7 @@ will be set for them.
* `/druid/coordinator/v1/config/compaction`
Creates or updates the [automatic compaction](../ingestion/automatic-compaction.md) config for a dataSource.
Creates or updates the [automatic compaction](../data-management/automatic-compaction.md) config for a dataSource.
See [Automatic compaction dynamic configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) for configuration details.
@ -617,6 +617,8 @@ Retrieve information about the segments of a task.
Retrieve a [task completion report](../ingestion/tasks.md#task-reports) for a task. Only works for completed tasks.
<a name="task-submit"></a>
##### POST
* `/druid/indexer/v1/task`

View File

@ -130,7 +130,7 @@ The following command associates the permissions in the JSON file with the role
curl -i -v  -H "Content-Type: application/json" -u internal -X POST -d@perm.json  http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/roles/readRole/permissions
```
Note that the STATE and CONFIG permissions in `perm.json` are needed to see the data source in the Druid console. If only querying permissions are needed, the READ action is sufficient:
Note that the STATE and CONFIG permissions in `perm.json` are needed to see the data source in the web console. If only querying permissions are needed, the READ action is sufficient:
```
[{ "resource": { "name": "wikipedia", "type": "DATASOURCE" }, "action": "READ" }]

View File

@ -259,7 +259,7 @@ The total memory usage of the MiddleManager + Tasks:
If you use the [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.md) or [Kinesis Indexing Service](../development/extensions-core/kinesis-ingestion.md), the number of tasks required will depend on the number of partitions and your taskCount/replica settings.
On top of those requirements, allocating more task slots in your cluster is a good idea, so that you have free task
slots available for other tasks, such as [compaction tasks](../ingestion/compaction.md).
slots available for other tasks, such as [compaction tasks](../data-management/compaction.md).
###### Hadoop ingestion

View File

@ -23,7 +23,7 @@ title: "Retaining or automatically dropping data"
-->
In Apache Druid, Coordinator processes use rules to determine what data should be loaded to or dropped from the cluster. Rules are used for data retention and query execution, and are set via the [web console](./druid-console.md).
In Apache Druid, Coordinator processes use rules to determine what data should be loaded to or dropped from the cluster. Rules are used for data retention and query execution, and are set via the [web console](./web-console.md).
There are three types of rules, i.e., load rules, drop rules, and broadcast rules. Load rules indicate how segments should be assigned to different historical process tiers and how many replicas of a segment should exist in each tier.
Drop rules indicate when segments should be dropped entirely from the cluster. Finally, broadcast rules indicate how segments of different datasources should be co-located in Historical processes.

View File

@ -37,7 +37,7 @@ The following recommendations apply to the Druid cluster setup:
> **WARNING!** \
Druid administrators have the same OS permissions as the Unix user account running Druid. See [Authentication and authorization model](security-user-auth.md#authentication-and-authorization-model). If the Druid process is running under the OS root user account, then Druid administrators can read or write all files that the root account has access to, including sensitive files such as `/etc/passwd`.
* Enable authentication to the Druid cluster for production environments and other environments that can be accessed by untrusted networks.
* Enable authorization and do not expose the Druid console without authorization enabled. If authorization is not enabled, any user that has access to the web console has the same privileges as the operating system user that runs the Druid console process.
* Enable authorization and do not expose the web console without authorization enabled. If authorization is not enabled, any user that has access to the web console has the same privileges as the operating system user that runs the web console process.
* Grant users the minimum permissions necessary to perform their functions. For instance, do not allow users who only need to query data to write to data sources or view state.
* Do not provide plain-text passwords for production systems in configuration specs. For example, sensitive properties should not be in the `consumerProperties` field of `KafkaSupervisorIngestionSpec`. See [Environment variable dynamic config provider](./dynamic-config-provider.md#environment-variable-dynamic-config-provider) for more information.
* Disable JavaScript, as noted in the [Security section](https://druid.apache.org/docs/latest/development/javascript.html#security) of the JavaScript guide.
@ -51,7 +51,7 @@ The following recommendations apply to the network where Druid runs:
* When possible, use firewall and other network layer filtering to only expose Druid services and ports specifically required for your use case. For example, only expose Broker ports to downstream applications that execute queries. You can limit access to a specific IP address or IP range to further tighten and enhance security.
The following recommendation applies to Druid's authorization and authentication model:
* Only grant `WRITE` permissions to any `DATASOURCE` to trusted users. Druid's trust model assumes those users have the same privileges as the operating system user that runs the Druid console process. Additionally, users with `WRITE` permissions can make changes to datasources and they have access to both task and supervisor update (POST) APIs which may affect ingestion.
* Only grant `WRITE` permissions to any `DATASOURCE` to trusted users. Druid's trust model assumes those users have the same privileges as the operating system user that runs the web console process. Additionally, users with `WRITE` permissions can make changes to datasources and they have access to both task and supervisor update (POST) APIs which may affect ingestion.
* Only grant `STATE READ`, `STATE WRITE`, `CONFIG WRITE`, and `DATASOURCE WRITE` permissions to highly-trusted users. These permissions allow users to access resources on behalf of the Druid server process regardless of the datasource.
* If your Druid client application allows less-trusted users to control the input source or firehose of an ingestion task, validate the URLs from the users. It is possible to point unchecked URLs to other locations and resources within your network or local file system.

View File

@ -35,7 +35,7 @@ Druid uses the following resource types:
* DATASOURCE &ndash; Each Druid table (i.e., `tables` in the `druid` schema in SQL) is a resource.
* CONFIG &ndash; Configuration resources exposed by the cluster components.
* EXTERNAL &ndash; External data read through the [EXTERN function](../multi-stage-query/index.md#read-external-data) in SQL.
* EXTERNAL &ndash; External data read through the [EXTERN function](../multi-stage-query/concepts.md#extern) in SQL.
* STATE &ndash; Cluster-wide state resources.
* SYSTEM_TABLE &ndash; when the Broker property `druid.sql.planner.authorizeSystemTablesDirectly` is true, then Druid uses this resource type to authorize the system tables in the `sys` schema in SQL.
@ -105,7 +105,7 @@ There are two possible resource names for the "CONFIG" resource type, "CONFIG" a
The EXTERNAL resource type only accepts the resource name "EXTERNAL".
Granting a user access to EXTERNAL resources allows them to run queries that include
the [EXTERN function](../multi-stage-query/index.md#read-external-data) in SQL
the [EXTERN function](../multi-stage-query/concepts.md#extern) in SQL
to read external data.
### `STATE`
@ -149,7 +149,7 @@ For information on what HTTP methods are supported on a particular request endpo
Queries on Druid datasources require DATASOURCE READ permissions for the specified datasource.
Queries to access external data through the [EXTERN function](../multi-stage-query/index.md#read-external-data) require EXTERNAL READ permissions.
Queries to access external data through the [EXTERN function](../multi-stage-query/concepts.md#extern) require EXTERNAL READ permissions.
Queries on [INFORMATION_SCHEMA tables](../querying/sql-metadata-tables.md#information-schema) return information about datasources that the caller has DATASOURCE READ access to. Other
datasources are omitted.

View File

@ -38,7 +38,7 @@ In Apache Druid, it's important to optimize the segment size because
It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy
especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case,
you can create segments with a sub-optimized size first and optimize them later using [compaction](../ingestion/compaction.md).
you can create segments with a sub-optimized size first and optimize them later using [compaction](../data-management/compaction.md).
You may need to consider the followings to optimize your segments.
@ -90,13 +90,13 @@ Once you find your segments need compaction, you can consider the below two opti
- Turning on the [automatic compaction of Coordinators](../design/coordinator.md#automatic-compaction).
The Coordinator periodically submits [compaction tasks](../ingestion/tasks.md#compact) to re-index small segments.
To enable the automatic compaction, you need to configure it for each dataSource via Coordinator's dynamic configuration.
For more information, see [Automatic compaction](../ingestion/automatic-compaction.md).
For more information, see [Automatic compaction](../data-management/automatic-compaction.md).
- Running periodic Hadoop batch ingestion jobs and using a `dataSource`
inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel.
Details on how to do this can be found on the [Updating existing data](../ingestion/data-management.md#update) section
Details on how to do this can be found on the [Updating existing data](../data-management/update.md) section
of the data management page.
## Learn more
* For an overview of compaction and how to submit a manual compaction task, see [Compaction](../ingestion/compaction.md).
* To learn how to enable and configure automatic compaction, see [Automatic compaction](../ingestion/automatic-compaction.md).
* For an overview of compaction and how to submit a manual compaction task, see [Compaction](../data-management/compaction.md).
* To learn how to enable and configure automatic compaction, see [Automatic compaction](../data-management/automatic-compaction.md).

View File

@ -1,6 +1,6 @@
---
id: druid-console
title: "Druid console"
id: web-console
title: "Web console"
---
<!--
@ -25,21 +25,22 @@ title: "Druid console"
Druid includes a web console for loading data, managing datasources and tasks, and viewing server status and segment information.
You can also run SQL and native Druid queries in the console.
Enable the following cluster settings to use the Druid console. Note that these settings are enabled by default.
Enable the following cluster settings to use the web console. Note that these settings are enabled by default.
- Enable the Router's [management proxy](../design/router.md#enabling-the-management-proxy).
- Enable [Druid SQL](../configuration/index.md#sql) for the Broker processes in the cluster.
The [Router](../design/router.md) service hosts the Druid console.
Access the Druid console at the following address:
The [Router](../design/router.md) service hosts the web console.
Access the web console at the following address:
```
http://<ROUTER_IP>:<ROUTER_PORT>
```
> It is important to note that any Druid console user will have, effectively, the same file permissions as the user under which Druid runs. One way these permissions are surfaced is in the file browser dialog. The dialog
will show console users the files that the underlying user has permissions to. In general, avoid running Druid as
root user. Consider creating a dedicated user account for running Druid.
> **Security note:** Without [Druid user permissions](../operations/security-overview.md) configured, any user of the
API or web console has effectively the same level of access to local files and network services as the user under which
Druid runs. It is a best practice to avoid running Druid as the root user, and to use Druid permissions or network
firewalls to restrict which users have access to potentially sensitive resources.
This topic presents the high-level features and functionality of the Druid console.
This topic presents the high-level features and functionality of the web console.
## Home

View File

@ -103,8 +103,6 @@ For more details and examples, see [multi-value dimensions](multi-value-dimensio
### Lookup DimensionSpecs
> Lookups are an [experimental](../development/experimental.md) feature.
You can use lookup dimension specs to define a lookup implementation as a dimension spec directly.
Generally, there are two kinds of lookup implementations.
The first kind is passed at the query time like `map` implementation.

View File

@ -481,10 +481,3 @@ ex: `GET /druid/v1/lookups/introspect/nato-phonetic/values`
"Dash"
]
```
## Druid version 0.10.0 to 0.10.1 upgrade/downgrade
Overall druid cluster lookups configuration is persisted in metadata store and also individual lookup processes optionally persist a snapshot of loaded lookups on disk.
If upgrading from druid version 0.10.0 to 0.10.1, then migration for all persisted metadata is handled automatically.
If downgrading from 0.10.1 to 0.9.0 then lookups updates done via Coordinator while 0.10.1 was running, would be lost.

View File

@ -37,7 +37,7 @@ curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:applica
> Replace `<queryable_host>:<port>` with the appropriate address and port for your system. For example, if running the quickstart configuration, replace `<queryable_host>:<port>` with localhost:8888.
You can also enter them directly in the Druid console's Query view. Simply pasting a native query into the console switches the editor into JSON mode.
You can also enter them directly in the web console's Query view. Simply pasting a native query into the console switches the editor into JSON mode.
![Native query](../assets/native-queries-01.png "Native query")

View File

@ -66,12 +66,12 @@ In the aggregation functions supported by Druid, only `COUNT`, `ARRAY_AGG`, and
|Function|Notes|Default|
|--------|-----|-------|
|`COUNT(*)`|Counts the number of rows.|`0`|
|`COUNT(DISTINCT expr)`|Counts distinct values of `expr`.<br><br>When `useApproximateCountDistinct` is set to "true" (the default), this is an alias for `APPROX_COUNT_DISTINCT`. The specific algorithm depends on the value of [`druid.sql.approxCountDistinct.function`](../configuration/index.md#sql). In this mode, you can use strings, numbers, or prebuilt sketches. If counting prebuilt sketches, the prebuilt sketch type must match the selected algorithm.<br><br>When `useApproximateCountDistinct` is set to "false", the computation will be exact. In this case, `expr` must be string or numeric, since exact counts are not possible using prebuilt sketches. In exact mode, only one distinct count per query is permitted unless `useGroupingSetForExactDistinct` is enabled.|`0`|
|`COUNT(DISTINCT expr)`|Counts distinct values of `expr`.<br /><br />When `useApproximateCountDistinct` is set to "true" (the default), this is an alias for `APPROX_COUNT_DISTINCT`. The specific algorithm depends on the value of [`druid.sql.approxCountDistinct.function`](../configuration/index.md#sql). In this mode, you can use strings, numbers, or prebuilt sketches. If counting prebuilt sketches, the prebuilt sketch type must match the selected algorithm.<br /><br />When `useApproximateCountDistinct` is set to "false", the computation will be exact. In this case, `expr` must be string or numeric, since exact counts are not possible using prebuilt sketches. In exact mode, only one distinct count per query is permitted unless `useGroupingSetForExactDistinct` is enabled.|`0`|
|`SUM(expr)`|Sums numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`MIN(expr)`|Takes the minimum of numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `9223372036854775807` (maximum LONG value)|
|`MAX(expr)`|Takes the maximum of numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `-9223372036854775808` (minimum LONG value)|
|`AVG(expr)`|Averages numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of `expr` using an approximate algorithm. The `expr` can be a regular column or a prebuilt sketch column.<br><br>The specific algorithm depends on the value of [`druid.sql.approxCountDistinct.function`](../configuration/index.md#sql). By default, this is `APPROX_COUNT_DISTINCT_BUILTIN`. If the [DataSketches extension](../development/extensions-core/datasketches-extension.md) is loaded, you can set it to `APPROX_COUNT_DISTINCT_DS_HLL` or `APPROX_COUNT_DISTINCT_DS_THETA`.<br><br>When run on prebuilt sketch columns, the sketch column type must match the implementation of this function. For example: when `druid.sql.approxCountDistinct.function` is set to `APPROX_COUNT_DISTINCT_BUILTIN`, this function runs on prebuilt hyperUnique columns, but not on prebuilt HLLSketchBuild columns.|
|`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of `expr` using an approximate algorithm. The `expr` can be a regular column or a prebuilt sketch column.<br /><br />The specific algorithm depends on the value of [`druid.sql.approxCountDistinct.function`](../configuration/index.md#sql). By default, this is `APPROX_COUNT_DISTINCT_BUILTIN`. If the [DataSketches extension](../development/extensions-core/datasketches-extension.md) is loaded, you can set it to `APPROX_COUNT_DISTINCT_DS_HLL` or `APPROX_COUNT_DISTINCT_DS_THETA`.<br /><br />When run on prebuilt sketch columns, the sketch column type must match the implementation of this function. For example: when `druid.sql.approxCountDistinct.function` is set to `APPROX_COUNT_DISTINCT_BUILTIN`, this function runs on prebuilt hyperUnique columns, but not on prebuilt HLLSketchBuild columns.|
|`APPROX_COUNT_DISTINCT_BUILTIN(expr)`|_Usage note:_ consider using `APPROX_COUNT_DISTINCT_DS_HLL` instead, which offers better accuracy in many cases.<br/><br/>Counts distinct values of `expr` using Druid's built-in "cardinality" or "hyperUnique" aggregators, which implement a variant of [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). The `expr` can be a string, a number, or a prebuilt hyperUnique column. Results are always approximate, regardless of the value of `useApproximateCountDistinct`.|
|`APPROX_QUANTILE(expr, probability, [resolution])`|_Deprecated._ Use `APPROX_QUANTILE_DS` instead, which provides a superior distribution-independent algorithm with formal error guarantees.<br/><br/>Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator) expressions. `probability` should be between 0 and 1, exclusive. `resolution` is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. Load the [approximate histogram extension](../development/extensions-core/approximate-histograms.md) to use this function.|`NaN`|
|`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric or [fixed buckets histogram](../development/extensions-core/approximate-histograms.md#fixed-buckets-histogram) expressions. `probability` should be between 0 and 1, exclusive. The `numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters are described in the fixed buckets histogram documentation. Load the [approximate histogram extension](../development/extensions-core/approximate-histograms.md) to use this function.|`0.0`|

View File

@ -62,11 +62,11 @@ Consider the following example input JSON:
{"x":1, "y":[1, 2, 3]}
```
- To return the entire JSON object:<br>
- To return the entire JSON object:<br />
`$` -> `{"x":1, "y":[1, 2, 3]}`
- To return the value of the key "x":<br>
- To return the value of the key "x":<br />
`$.x` -> `1`
- For a key that contains an array, to return the entire array:<br>
- For a key that contains an array, to return the entire array:<br />
`$['y']` -> `[1, 2, 3]`
- For a key that contains an array, to return an item in the array:<br>
- For a key that contains an array, to return an item in the array:<br />
`$.y[1]` -> `2`

View File

@ -1,6 +1,6 @@
---
id: docker
title: "Docker"
title: "Tutorial: Run with Docker"
---
<!--
@ -110,9 +110,9 @@ Note that Druid uses port 8888 for the console. This port is also used by Jupyte
Run `docker-compose up` to launch the cluster with a shell attached, or `docker-compose up -d` to run the cluster in the background.
Once the cluster has started, you can navigate to the [Druid console](../operations/druid-console.md) at [http://localhost:8888](http://localhost:8888). The [Druid router process](../design/router.md) serves the UI.
Once the cluster has started, you can navigate to the [web console](../operations/web-console.md) at [http://localhost:8888](http://localhost:8888). The [Druid router process](../design/router.md) serves the UI.
![Druid console](../assets/tutorial-quickstart-01.png "Druid console")
![web console](../assets/tutorial-quickstart-01.png "web console")
It takes a few seconds for all the Druid processes to fully start up. If you open the console immediately after starting the services, you may see some errors that you can safely ignore.

View File

@ -1,6 +1,6 @@
---
id: index
title: "Quickstart"
title: "Quickstart (local)"
---
<!--
@ -23,43 +23,37 @@ title: "Quickstart"
-->
This quickstart gets you started with Apache Druid using the `micro-quickstart` startup configuration and introduces you to some Druid features, including the MSQ task engine that's part of the [multi-stage query architecture](../multi-stage-query/index.md).
With the MSQ task engine, you can write query tasks that can reference [external data](../multi-stage-query/index.md#read-external-data) as well as perform ingestion with SQL [INSERT](../multi-stage-query/index.md#insert-data) and [REPLACE](../multi-stage-query/index.md#replace-data), eliminating the need to generate JSON-based ingestion specs.
This quickstart gets you started with Apache Druid using the [`micro-quickstart`](../operations/single-server.md#micro-quickstart-4-cpu-16gib-ram) configuration, and introduces you to Druid ingestion and query features.
In this quickstart, you'll do the following:
- install Druid
- start up Druid services
- use the MSQ task engine to ingest data
- use SQL to ingest and query data
Druid supports different ingestion engines. While we recommend SQL based ingestion, you can find tutorials for other modes of ingestion, such as [Load data with native batch ingestion](tutorial-batch-native.md).
Druid supports a variety of ingestion options. Once you're done with this tutorial, refer to the
[Ingestion](../ingestion/index.md) page to determine which ingestion method is right for you.
## Requirements
You can follow these steps on a relatively small machine, such as a laptop with around 4 CPU and 16 GiB of RAM.
You can follow these steps on a relatively modest machine, such as a workstation or virtual server with 16 GiB of RAM.
Druid comes equipped with several startup configuration profiles for a range of machine sizes.
The `micro-quickstart` configuration profile is suitable for evaluating Druid. If you want to
try out Druid's performance or scaling capabilities, you'll need a larger machine and configuration profile.
The configuration profiles included with Druid range from the even smaller _Nano-Quickstart_ configuration (1 CPU, 4GiB RAM)
to the _X-Large_ configuration (64 CPU, 512GiB RAM). For more information, see [Single server deployment](../operations/single-server.md).
For information on deploying Druid services across clustered machines, see [Clustered deployment](./cluster.md).
Druid comes equipped with several [startup configuration profiles](../operations/single-server.md) for a
range of machine sizes. These range from `nano` (1 CPU, 4GiB RAM) to `x-large` (64 CPU, 512GiB RAM). For more
information, see [Single server deployment](../operations/single-server.md). For information on deploying Druid services
across clustered machines, see [Clustered deployment](./cluster.md).
The software requirements for the installation machine are:
* Linux, Mac OS X, or other Unix-like OS (Windows is not supported)
* Java 8, Update 92 or later (8u92+) or Java 11
* Linux, Mac OS X, or other Unix-like OS. (Windows is not supported.)
* Java 8u92+ or Java 11.
> Druid relies on the environment variables `JAVA_HOME` or `DRUID_JAVA_HOME` to find Java on the machine. You can set
`DRUID_JAVA_HOME` if there is more than one instance of Java. To verify Java requirements for your environment, run the
`bin/verify-java` script.
Before installing a production Druid instance, be sure to consider the user account on the operating system under
which Druid will run. This is important because any Druid console user will have, effectively, the same permissions as
that user. For example, the file browser UI will show console users the files that the underlying user can
access. In general, avoid running Druid as root user. Consider creating a dedicated user account for running Druid.
Before installing a production Druid instance, be sure to review the [security
overview](../operations/security-overview.md). In general, avoid running Druid as root user. Consider creating a
dedicated user account for running Druid.
## Install Druid
@ -111,16 +105,16 @@ At any time, you can revert Druid to its original, post-installation state by de
To stop Druid at any time, use CTRL+C in the terminal. This exits the `bin/start-micro-quickstart` script and terminates all Druid processes.
## Open the Druid console
## Open the web console
After the Druid services finish startup, open the [Druid console](../operations/druid-console.md) at [http://localhost:8888](http://localhost:8888).
After the Druid services finish startup, open the [web console](../operations/web-console.md) at [http://localhost:8888](http://localhost:8888).
![Druid console](../assets/tutorial-quickstart-01.png "Druid console")
![web console](../assets/tutorial-quickstart-01.png "web console")
It may take a few seconds for all Druid services to finish starting, including the [Druid router](../design/router.md), which serves the console. If you attempt to open the Druid console before startup is complete, you may see errors in the browser. Wait a few moments and try again.
It may take a few seconds for all Druid services to finish starting, including the [Druid router](../design/router.md), which serves the console. If you attempt to open the web console before startup is complete, you may see errors in the browser. Wait a few moments and try again.
In this quickstart, you use the the Druid console to perform ingestion. The MSQ task engine specifically uses the **Query** view to edit and run SQL queries.
For a complete walkthrough of the **Query** view as it relates to the multi-stage query architecture and the MSQ task engine, see [UI walkthrough](../operations/druid-console.md).
In this quickstart, you use the the web console to perform ingestion. The MSQ task engine specifically uses the **Query** view to edit and run SQL queries.
For a complete walkthrough of the **Query** view as it relates to the multi-stage query architecture and the MSQ task engine, see [UI walkthrough](../operations/web-console.md).
## Load data
@ -220,13 +214,11 @@ Congratulations! You've gone from downloading Druid to querying data with the MS
See the following topics for more information:
* [Druid SQL overview](../querying/sql.md) to learn about how to query the data you just ingested.
* [Ingestion overview](../ingestion/index.md) to explore options for ingesting more data.
* [Tutorial: Load files using SQL](./tutorial-msq-extern.md) to learn how to generate a SQL query that loads external data into a Druid datasource.
* [Tutorial: Load data with native batch ingestion](tutorial-batch-native.md) to load and query data with Druid's native batch ingestion feature.
* [Tutorial: Load stream data from Apache Kafka](./tutorial-kafka.md) to load streaming data from a Kafka topic.
* [Extensions](../development/extensions.md) for details on Druid extensions.
* [MSQ task engine query syntax](../multi-stage-query/index.md#msq-task-engine-query-syntax) to further explore queries for SQL-based ingestion.
* [Druid SQL overview](../querying/sql.md) to learn about how to query data you ingest.
* [Load data with native batch ingestion](tutorial-batch-native.md) to load and query data with Druid's native batch ingestion feature.
* [Load stream data from Apache Kafka](./tutorial-kafka.md) to load streaming data from a Kafka topic.
* [API](../multi-stage-query/msq-api.md) to submit query tasks to the MSQ task engine programmatically.
* [Connect external data](../multi-stage-query/msq-tutorial-connect-external-data.md) to learn how to generate a query that references externally hosted data that the MSQ task engine can use to ingest data.
* [Convert ingestion spec](../multi-stage-query/msq-tutorial-convert-ingest-spec.md) to learn how to convert an existing JSON ingestion spec to a SQL query that the MSQ task engine can use to ingest data.
Remember that after stopping Druid services, you can start clean next time by deleting the `var` directory from the Druid root directory and running the `bin/start-micro-quickstart` script again. You may want to do this before taking other data ingestion tutorials, since they use the same Wikipedia datasource.
Remember that after stopping Druid services, you can start clean next time by deleting the `var` directory from the Druid root directory and running the `bin/start-micro-quickstart` script again. You may want to do this before using other data ingestion tutorials, since they use the same Wikipedia datasource.

View File

@ -27,7 +27,7 @@ This topic shows you how to load and query data files in Apache Druid using its
## Prerequisites
Install Druid, start up Druid services, and open the Druid console as described in the [Druid quickstart](index.md).
Install Druid, start up Druid services, and open the web console as described in the [Druid quickstart](index.md).
## Load data
@ -37,7 +37,7 @@ as we'll do here to perform batch file loading with Druid's native batch ingesti
The Druid distribution bundles sample data we can use. The sample data located in `quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz`
in the Druid root directory represents Wikipedia page edits for a given day.
1. Click **Load data** from the Druid console header (![Load data](../assets/tutorial-batch-data-loader-00.png)).
1. Click **Load data** from the web console header (![Load data](../assets/tutorial-batch-data-loader-00.png)).
2. Select the **Local disk** tile and then click **Connect data**.

View File

@ -1,7 +1,7 @@
---
id: tutorial-batch
title: "Tutorial: Loading a file"
sidebar_label: "Loading files natively"
sidebar_label: "Load files natively"
---
<!--
@ -27,13 +27,11 @@ sidebar_label: "Loading files natively"
This tutorial demonstrates how to load data into Apache Druid from a file using Apache Druid's native batch ingestion feature.
You initiate data loading in Druid by submitting an *ingestion task* spec to the Druid Overlord. You can write ingestion
specs by hand or using the _data loader_ built into the Druid console.
The [Quickstart](./index.md) shows you how to use the data loader to build an ingestion spec. For production environments, it's
likely that you'll want to automate data ingestion. This tutorial starts by showing you how to submit an ingestion spec
directly in the Druid console, and then introduces ways to ingest batch data that lend themselves to
automation&mdash;from the command line and from a script.
specs by hand or using the _data loader_ built into the web console.
For production environments, it's likely that you'll want to automate data ingestion. This tutorial starts by showing
you how to submit an ingestion spec directly in the web console, and then introduces ways to ingest batch data that
lend themselves to automation&mdash;from the command line and from a script.
## Loading data with a spec (via console)

View File

@ -53,7 +53,7 @@ bin/post-index-task --file quickstart/tutorial/compaction-init-index.json --url
> `maxRowsPerSegment` in the tutorial ingestion spec is set to 1000 to generate multiple segments per hour for demonstration purposes. Do not use this spec in production.
After the ingestion completes, navigate to [http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources) in a browser to see the new datasource in the Druid console.
After the ingestion completes, navigate to [http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources) in a browser to see the new datasource in the web console.
![compaction-tutorial datasource](../assets/tutorial-compaction-01.png "compaction-tutorial datasource")
@ -119,7 +119,7 @@ During that time, you may see 75 total segments comprised of the old segment set
![Compacted segments intermediate state 2](../assets/tutorial-compaction-04.png "Compacted segments intermediate state 2")
The new compacted segments have a more recent version than the original segments.
Even though the Druid console displays both sets of segments, queries only read from the new compacted segments.
Even though the web console displays both sets of segments, queries only read from the new compacted segments.
Run a COUNT query on `compaction-tutorial` again to verify the number of rows remains 39,244:
@ -184,5 +184,5 @@ It takes some time before the Coordinator marks the old input segments as unused
This tutorial demonstrated how to use a compaction task spec to manually compact segments and how to optionally change the segment granularity for segments.
- For more details, see [Compaction](../ingestion/compaction.md).
- For more details, see [Compaction](../data-management/compaction.md).
- To learn about the benefits of compaction, see [Segment optimization](../operations/segment-optimization.md).

View File

@ -259,7 +259,7 @@ If the supervisor was successfully created, you will get a response containing t
For more details about what's going on here, check out the
[Druid Kafka indexing service documentation](../development/extensions-core/kafka-ingestion.md).
You can view the current supervisors and tasks in the Druid console: [http://localhost:8888/unified-console.md#tasks](http://localhost:8888/unified-console.html#tasks).
You can view the current supervisors and tasks in the web console: [http://localhost:8888/unified-console.md#tasks](http://localhost:8888/unified-console.html#tasks).
## Querying your data

View File

@ -1,44 +0,0 @@
---
id: tutorial-msq-convert-json
title: "Convert ingestion spec to SQL"
sidebar_label: "Convert ingestion spec to SQL"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<!DOCTYPE html>
<!--This redirects to the Multi-Stage Query tutorial. This redirect file exists cause duplicate entries in the left nav aren't allowed-->
<html lang="en-US">
<head>
<meta charset="UTF-8" />
<meta
http-equiv="refresh"
content="0; url=/docs/multi-stage-query/convert-json-spec.html"
/>
<script type="text/javascript">
window.location.href = '/docs/multi-stage-query/convert-json-spec.html';
</script>
<title>About the Druid documentation</title>
</head>
<body>
If you are not redirected automatically, follow this
<a href="/docs/multi-stage-query/convert-json-spec.html">link</a>.
</body>
</html>

View File

@ -1,7 +1,8 @@
---
id: convert-json-spec
title: Tutorial - Convert an ingestion spec for SQL-based ingestion
description: How to convert an ingestion spec to a query for SQL-based ingestion in the Druid console.
id: tutorial-msq-convert-spec
title: "Tutorial: Convert an ingestion spec for SQL-based ingestion"
sidebar_label: "Convert ingestion spec to SQL"
description: How to convert an ingestion spec to a query for SQL-based ingestion in the web console.
---
<!--
@ -23,21 +24,23 @@ description: How to convert an ingestion spec to a query for SQL-based ingestion
~ under the License.
-->
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.
> This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md)
> extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which
> ingestion method is right for you.
If you're already ingesting data with Druid's native SQL engine, you can use the Druid console to convert the ingestion spec to a SQL query that the multi-stage query task engine can use to ingest data.
If you're already ingesting data with [native batch ingestion](../ingestion/native-batch.md), you can use the [web console](../operations/web-console.md) to convert the ingestion spec to a SQL query that the multi-stage query task engine can use to ingest data.
This tutorial demonstrates how to convert the ingestion spec to a query task in the Druid console.
This tutorial demonstrates how to convert the ingestion spec to a query task in the web console.
To convert the ingestion spec to a query task, do the following:
1. In the **Query** view of the Druid console, navigate to the menu bar that includes **Run**.
1. In the **Query** view of the web console, navigate to the menu bar that includes **Run**.
2. Click the ellipsis icon and select **Convert ingestion spec to SQL**.
![Convert ingestion spec to SQL](../assets/multi-stage-query/tutorial-msq-convert.png "Convert ingestion spec to SQL")
3. In the **Ingestion spec to covert** window, insert your ingestion spec. You can use your own spec or the sample ingestion spec provided in the tutorial. The sample spec uses data hosted at `https://druid.apache.org/data/wikipedia.json.gz` and loads it into a table named `wikipedia`:
<details><summary>Show the spec</summary>
```json
{
"type": "index_parallel",
@ -117,11 +120,11 @@ To convert the ingestion spec to a query task, do the following:
}
}
```
</details>
4. Click **Submit** to submit the spec. The Druid console uses the JSON-based ingestion spec to generate a SQL query that you can use instead. This is what the query looks like for the sample ingestion spec:
4. Click **Submit** to submit the spec. The web console uses the JSON-based ingestion spec to generate a SQL query that you can use instead. This is what the query looks like for the sample ingestion spec:
<details><summary>Show the query</summary>
```sql
@ -160,10 +163,10 @@ To convert the ingestion spec to a query task, do the following:
"countryIsoCode",
"regionName"
FROM source
PARTITIONED BY DAY
PARTITIONED BY DAY
```
</details>
4. Review the generated SQL query to make sure it matches your requirements and does what you expect.
5. Click **Run** to start the ingestion.
5. Click **Run** to start the ingestion.

View File

@ -1,6 +1,7 @@
---
id: connect-external-data
title: Tutorial - Load files with SQL-based ingestion
id: tutorial-msq-extern
title: "Tutorial: Load files with SQL-based ingestion"
sidebar_label: "Load files using SQL 🆕"
description: How to generate a query that references externally hosted data
---
@ -23,7 +24,9 @@ description: How to generate a query that references externally hosted data
~ under the License.
-->
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.
> This page describes SQL-based batch ingestion using the [`druid-multi-stage-query`](../multi-stage-query/index.md)
> extension, new in Druid 24.0. Refer to the [ingestion methods](../ingestion/index.md#batch) table to determine which
> ingestion method is right for you.
This tutorial demonstrates how to generate a query that references externally hosted data using the **Connect external data** wizard.
@ -33,7 +36,7 @@ Although you can manually create a query in the UI, you can use Druid to generat
To generate a query from external data, do the following:
1. In the **Query** view of the Druid console, click **Connect external data**.
1. In the **Query** view of the web console, click **Connect external data**.
2. On the **Select input type** screen, choose **HTTP(s)** and enter the following value in the **URIs** field: `https://druid.apache.org/data/wikipedia.json.gz`. Leave the HTTP auth username and password blank.
3. Click **Connect data**.
4. On the **Parse** screen, you can perform additional actions before you load the data into Druid:
@ -86,7 +89,7 @@ To generate a query from external data, do the following:
6. Review and modify the query to meet your needs. For example, you can rename the table or change segment granularity. To partition by something other than ALL, include `TIME_PARSE("timestamp") AS __time` in your SELECT statement.
For example, to specify day-based segment granularity, change the partitioning to `PARTITIONED BY DAY`:
```sql
INSERT INTO ...
SELECT
@ -96,7 +99,7 @@ To generate a query from external data, do the following:
PARTITIONED BY DAY
```
1. Optionally, select **Preview** to review the data before you ingest it. A preview runs the query without the REPLACE INTO clause and with an added LIMIT.
1. Optionally, select **Preview** to review the data before you ingest it. A preview runs the query without the REPLACE INTO clause and with an added LIMIT.
You can see the general shape of the data before you commit to inserting it.
The LIMITs make the query run faster but can cause incomplete results.
2. Click **Run** to launch your query. The query returns information including its duration and the number of rows inserted into the table.
@ -140,5 +143,5 @@ ORDER BY COUNT(*) DESC
See the following topics to learn more:
* [MSQ task engine query syntax](./index.md#msq-task-engine-query-syntax) for information about the different query components.
* [Reference](./msq-reference.md) for reference on context parameters, functions, and error codes.
* [SQL-based ingestion overview](../multi-stage-query/index.md) to further explore SQL-based ingestion.
* [SQL-based ingestion reference](../multi-stage-query/reference.md) for reference on context parameters, functions, and error codes.

View File

@ -34,16 +34,16 @@ by following one of them:
* [Tutorial: Loading stream data from Kafka](../tutorials/tutorial-kafka.md)
* [Tutorial: Loading a file using Hadoop](../tutorials/tutorial-batch-hadoop.md)
There are various ways to run Druid SQL queries: from the Druid console, using a command line utility
There are various ways to run Druid SQL queries: from the web console, using a command line utility
and by posting the query by HTTP. We'll look at each of these.
## Query SQL from the Druid console
## Query SQL from the web console
The Druid console includes a view that makes it easier to build and test queries, and
The web console includes a view that makes it easier to build and test queries, and
view their results.
1. Start up the Druid cluster, if it's not already running, and open the Druid console in your web
1. Start up the Druid cluster, if it's not already running, and open the web console in your web
browser.
2. Click **Query** from the header to open the Query view:
@ -147,7 +147,7 @@ performance issues. For more information, see [Native queries](../querying/query
9. Finally, click `...` and **Edit context** to see how you can add additional parameters controlling the execution of the query execution. In the field, enter query context options as JSON key-value pairs, as described in [Context flags](../querying/query-context.md).
That's it! We've built a simple query using some of the query builder features built into the Druid console. The following
That's it! We've built a simple query using some of the query builder features built into the web console. The following
sections provide a few more example queries you can try. Also, see [Other ways to invoke SQL queries](#other-ways-to-invoke-sql-queries) to learn how
to run Druid SQL from the command line or over HTTP.

View File

@ -41,7 +41,7 @@ The ingestion spec can be found at `quickstart/tutorial/retention-index.json`. L
bin/post-index-task --file quickstart/tutorial/retention-index.json --url http://localhost:8081
```
After the ingestion completes, go to [http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources) in a browser to access the Druid console's datasource view.
After the ingestion completes, go to [http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources) in a browser to access the web console's datasource view.
This view shows the available datasources and a summary of the retention rules for each datasource:
@ -85,7 +85,7 @@ Now click `Save`. You can see the new rules in the datasources view:
![New rules](../assets/tutorial-retention-05.png "New rules")
Give the cluster a few minutes to apply the rule change, and go to the [segments view](http://localhost:8888/unified-console.html#segments) in the Druid console.
Give the cluster a few minutes to apply the rule change, and go to the [segments view](http://localhost:8888/unified-console.html#segments) in the web console.
The segments for the first 12 hours of 2015-09-12 are now gone:
![New segments](../assets/tutorial-retention-06.png "New segments")

View File

@ -95,7 +95,7 @@ date,uid,show,episode
## Ingest data using Theta sketches
1. Navigate to the **Load data** wizard in the Druid console.
1. Navigate to the **Load data** wizard in the web console.
2. Select `Paste data` as the data source and paste the given data:
![Load data view with pasted data](../assets/tutorial-theta-01.png)

View File

@ -345,6 +345,8 @@ memcached
mergeable
metadata
millis
microbatch
microbatches
misconfiguration
misconfigured
mostAvailableSize
@ -381,6 +383,7 @@ pre-computation
pre-compute
pre-computing
pre-configured
pre-existing
pre-filtered
pre-filtering
pre-generated
@ -477,6 +480,8 @@ unmergeable
unmerged
UNNEST
unnest
unnested
unnesting
unnests
unparseable
unparsed
@ -1241,7 +1246,7 @@ timestampColumnName
timestampSpec
urls
valueFormat
- ../docs/ingestion/data-management.md
- ../docs/data-management/compaction.md
1GB
IOConfig
compactionTask
@ -1717,13 +1722,10 @@ maxResults
orderby
orderbys
outputName
pre-existing
pushdown
row1
subtotalsSpec
tradeoff
unnested
unnesting
- ../docs/querying/having.md
HavingSpec
HavingSpecs

View File

@ -32,6 +32,25 @@
"configuration/logging": {
"title": "Logging"
},
"data-management/automatic-compaction": {
"title": "Automatic compaction"
},
"data-management/compaction": {
"title": "Compaction"
},
"data-management/delete": {
"title": "Data deletion"
},
"data-management/index": {
"title": "Data management",
"sidebar_label": "Overview"
},
"data-management/schema-changes": {
"title": "Schema changes"
},
"data-management/update": {
"title": "Data updates"
},
"dependencies/deep-storage": {
"title": "Deep storage"
},
@ -104,6 +123,9 @@
"development/extensions-contrib/cloudfiles": {
"title": "Rackspace Cloud Files"
},
"development/extensions-contrib/compressed-big-decimal": {
"title": "Compressed Big Decimal"
},
"development/extensions-contrib/distinctcount": {
"title": "DistinctCount Aggregator"
},
@ -173,6 +195,9 @@
"development/extensions-core/datasketches-hll": {
"title": "DataSketches HLL Sketch module"
},
"development/extensions-core/datasketches-kll": {
"title": "DataSketches KLL Sketch module"
},
"development/extensions-core/datasketches-quantiles": {
"title": "DataSketches Quantiles Sketch module"
},
@ -281,18 +306,9 @@
"development/versioning": {
"title": "Versioning"
},
"ingestion/automatic-compaction": {
"title": "Automatic compaction"
},
"ingestion/compaction": {
"title": "Compaction"
},
"ingestion/data-formats": {
"title": "Data formats"
},
"ingestion/data-management": {
"title": "Data management"
},
"ingestion/data-model": {
"title": "Druid data model",
"sidebar_label": "Data model"
@ -314,15 +330,15 @@
},
"ingestion/native-batch-firehose": {
"title": "Native batch ingestion with firehose",
"sidebar_label": "Firehose"
"sidebar_label": "Firehose (deprecated)"
},
"ingestion/native-batch-input-sources": {
"title": "Native batch input sources",
"sidebar_label": "Input sources"
"sidebar_label": "Native batch: input sources"
},
"ingestion/native-batch-simple-task": {
"title": "Native batch simple task indexing",
"sidebar_label": "Simple task indexing"
"sidebar_label": "Native batch (simple)"
},
"ingestion/native-batch": {
"title": "Native batch ingestion",
@ -354,6 +370,34 @@
"misc/papers-and-talks": {
"title": "Papers"
},
"multi-stage-query/api": {
"title": "SQL-based ingestion and multi-stage query task API",
"sidebar_label": "API"
},
"multi-stage-query/concepts": {
"title": "SQL-based ingestion concepts",
"sidebar_label": "Key concepts"
},
"multi-stage-query/examples": {
"title": "SQL-based ingestion query examples",
"sidebar_label": "Examples"
},
"multi-stage-query/index": {
"title": "SQL-based ingestion",
"sidebar_label": "Overview"
},
"multi-stage-query/known-issues": {
"title": "SQL-based ingestion known issues",
"sidebar_label": "Known issues"
},
"multi-stage-query/reference": {
"title": "SQL-based ingestion reference",
"sidebar_label": "Reference"
},
"multi-stage-query/security": {
"title": "SQL-based ingestion security",
"sidebar_label": "Security"
},
"operations/alerts": {
"title": "Alerts"
},
@ -373,9 +417,6 @@
"operations/deep-storage-migration": {
"title": "Deep storage migration"
},
"operations/druid-console": {
"title": "Druid console"
},
"operations/dump-segment": {
"title": "dump-segment tool"
},
@ -403,9 +444,6 @@
"operations/kubernetes": {
"title": "kubernetes"
},
"operations/management-uis": {
"title": "Legacy Management UIs"
},
"operations/metadata-migration": {
"title": "Metadata Migration"
},
@ -456,6 +494,9 @@
"operations/use_sbt_to_build_fat_jar": {
"title": "Content for build.sbt"
},
"operations/web-console": {
"title": "Web console"
},
"querying/aggregations": {
"title": "Aggregations"
},
@ -507,6 +548,10 @@
"title": "Multitenancy considerations",
"sidebar_label": "Multitenancy"
},
"querying/nested-columns": {
"title": "Nested columns",
"sidebar_label": "Nested columns"
},
"querying/post-aggregations": {
"title": "Post-aggregations"
},
@ -559,6 +604,10 @@
"title": "SQL JDBC driver API",
"sidebar_label": "JDBC driver API"
},
"querying/sql-json-functions": {
"title": "SQL JSON functions",
"sidebar_label": "JSON functions"
},
"querying/sql-metadata-tables": {
"title": "SQL metadata tables",
"sidebar_label": "SQL metadata tables"
@ -616,18 +665,21 @@
"title": "Clustered deployment"
},
"tutorials/docker": {
"title": "Docker"
"title": "Tutorial: Run with Docker"
},
"tutorials/index": {
"title": "Quickstart"
"title": "Quickstart (local)"
},
"tutorials/tutorial-batch-hadoop": {
"title": "Tutorial: Load batch data using Apache Hadoop",
"sidebar_label": "Load from Apache Hadoop"
},
"tutorials/tutorial-batch-native": {
"title": "Load data with native batch ingestion"
},
"tutorials/tutorial-batch": {
"title": "Tutorial: Loading a file",
"sidebar_label": "Loading files natively"
"sidebar_label": "Load files natively"
},
"tutorials/tutorial-compaction": {
"title": "Tutorial: Compacting segments",
@ -649,6 +701,14 @@
"title": "Configuring Apache Druid to use Kerberized Apache Hadoop as deep storage",
"sidebar_label": "Kerberized HDFS deep storage"
},
"tutorials/tutorial-msq-convert-spec": {
"title": "Tutorial: Convert an ingestion spec for SQL-based ingestion",
"sidebar_label": "Convert ingestion spec to SQL"
},
"tutorials/tutorial-msq-extern": {
"title": "Tutorial: Load files with SQL-based ingestion",
"sidebar_label": "Load files using SQL 🆕"
},
"tutorials/tutorial-query": {
"title": "Tutorial: Querying data",
"sidebar_label": "Querying data"
@ -661,6 +721,10 @@
"title": "Tutorial: Roll-up",
"sidebar_label": "Roll-up"
},
"tutorials/tutorial-sketches-theta": {
"title": "Approximations with Theta sketches",
"sidebar_label": "Theta sketches"
},
"tutorials/tutorial-transform-spec": {
"title": "Tutorial: Transforming input data",
"sidebar_label": "Transforming input data"
@ -684,6 +748,7 @@
"Tutorials": "Tutorials",
"Design": "Design",
"Ingestion": "Ingestion",
"Data management": "Data management",
"Querying": "Querying",
"Configuration": "Configuration",
"Operations": "Operations",

1646
website/package-lock.json generated

File diff suppressed because it is too large Load Diff

View File

@ -22,7 +22,7 @@
"devDependencies": {
"docusaurus": "^1.14.4",
"markdown-spellcheck": "^1.3.1",
"node-sass": "^4.13.1"
"node-sass": "^7.0.0"
},
"dependencies": {
"fast-glob": "^3.2.2",

View File

@ -155,9 +155,12 @@
{"source": "development/router.html", "target": "../design/router.html"}
{"source": "development/select-query.html", "target": "../querying/select-query.html"}
{"source": "index.html", "target": "design/index.html"}
{"source": "ingestion/automatic-compaction.html", "target": "../data-management/automatic-compaction.html"}
{"source": "ingestion/batch-ingestion.html", "target": "index.html#batch"}
{"source": "ingestion/command-line-hadoop-indexer.html", "target": "hadoop.html#cli"}
{"source": "ingestion/delete-data.html", "target": "data-management.html#delete"}
{"source": "ingestion/compaction.html", "target": "../data-management/compaction.html"}
{"source": "ingestion/data-management.html", "target": "../data-management/index.html"}
{"source": "ingestion/delete-data.html", "target": "../data-management/delete.html"}
{"source": "ingestion/firehose.html", "target": "native-batch-firehose.html"}
{"source": "ingestion/flatten-json.html", "target": "ingestion-spec.html#flattenspec"}
{"source": "ingestion/hadoop-vs-native-batch.html", "target": "index.html#batch"}
@ -165,16 +168,15 @@
{"source": "ingestion/locking-and-priority.html", "target": "tasks.html#locks"}
{"source": "ingestion/misc-tasks.html", "target": "tasks.html#all-task-types"}
{"source": "ingestion/native_tasks.html", "target": "native-batch.html"}
{"source": "ingestion/native_tasks.html", "target": "native-batch.html"}
{"source": "ingestion/overview.html", "target": "index.html"}
{"source": "ingestion/realtime-ingestion.html", "target": "index.html"}
{"source": "ingestion/reports.html", "target": "tasks.html#reports"}
{"source": "ingestion/schema-changes.html", "target": "design/segments.html#segments-with-different-schemas"}
{"source": "ingestion/schema-changes.html", "target": "../design/segments.html#segments-with-different-schemas"}
{"source": "ingestion/stream-ingestion.html", "target": "index.html#streaming"}
{"source": "ingestion/stream-pull.html", "target": "../ingestion/standalone-realtime.html"}
{"source": "ingestion/stream-push.html", "target": "tranquility.html"}
{"source": "ingestion/transform-spec.html", "target": "ingestion-spec.html#transformspec"}
{"source": "ingestion/update-existing-data.html", "target": "data-management.html#update"}
{"source": "ingestion/update-existing-data.html", "target": "../data-management/update.html"}
{"source": "misc/cluster-setup.html", "target": "../tutorials/cluster.html"}
{"source": "misc/evaluate.html", "target": "../tutorials/cluster.html"}
{"source": "misc/tasks.html", "target": "../ingestion/tasks.html"}
@ -197,5 +199,7 @@
{"source": "tutorials/tutorial-tranquility.html", "target": "../ingestion/tranquility.html"}
{"source": "development/extensions-contrib/google.html", "target": "../extensions-core/google.html"}
{"source": "development/integrating-druid-with-other-technologies.html", "target": "../ingestion/index.html"}
{"source": "operations/druid-console.html", "target": "web-console.html"}
{"source": "operations/getting-started.html", "target": "../design/index.html"}
{"source": "operations/management-uis.html", "target": "web-console.html"}
{"source": "operations/recommendations.html", "target": "basic-cluster-tuning.html"}
{"source": "operations/management-uis.html", "target": "operations/druid-console.html"}

View File

@ -3,13 +3,12 @@
"Getting started": [
"design/index",
"tutorials/index",
"tutorials/docker",
"operations/single-server",
"tutorials/cluster"
],
"Tutorials": [
"tutorials/tutorial-batch",
"tutorials/tutorial-msq-external-data",
"tutorials/tutorial-msq-extern",
"tutorials/tutorial-kafka",
"tutorials/tutorial-batch-hadoop",
"tutorials/tutorial-query",
@ -21,8 +20,9 @@
"tutorials/tutorial-delete-data",
"tutorials/tutorial-ingestion-spec",
"tutorials/tutorial-transform-spec",
"tutorials/docker",
"tutorials/tutorial-kerberos-hadoop",
"tutorials/tutorial-msq-convert-json"
"tutorials/tutorial-msq-convert-spec"
],
"Design": [
"design/architecture",
@ -40,9 +40,6 @@
"ingestion/partitioning",
"ingestion/ingestion-spec",
"ingestion/schema-design",
"ingestion/data-management",
"ingestion/compaction",
"ingestion/automatic-compaction",
{
"type": "subcategory",
"label": "Stream ingestion",
@ -59,25 +56,33 @@
"label": "Batch ingestion",
"ids": [
"ingestion/native-batch",
"ingestion/native-batch-simple-task",
"ingestion/native-batch-input-sources",
"ingestion/native-batch-firehose",
"ingestion/hadoop"
]
},
{
"type": "subcategory",
"label": "SQL-based ingestion \uD83C\uDD95",
"ids": [
"multi-stage-query/index",
"multi-stage-query/concepts",
"multi-stage-query/api",
"multi-stage-query/security",
"multi-stage-query/examples",
"multi-stage-query/reference",
"multi-stage-query/known-issues"
]
},
"ingestion/tasks",
"ingestion/faq"
],
"SQL-based ingestion": [
"multi-stage-query/index",
"multi-stage-query/concepts",
"multi-stage-query/connect-external-data",
"multi-stage-query/convert-json-spec",
"multi-stage-query/examples",
"multi-stage-query/api",
"multi-stage-query/security",
"multi-stage-query/reference",
"multi-stage-query/known-issues"
"Data management": [
"data-management/index",
"data-management/update",
"data-management/delete",
"data-management/schema-changes",
"data-management/compaction",
"data-management/automatic-compaction"
],
"Querying": [
{
@ -90,8 +95,8 @@
"querying/sql-scalar",
"querying/sql-aggregations",
"querying/sql-multivalue-string-functions",
"querying/sql-functions",
"querying/sql-json-functions",
"querying/sql-functions",
"querying/sql-api",
"querying/sql-jdbc",
"querying/sql-query-context",
@ -157,8 +162,7 @@
"configuration/logging"
],
"Operations": [
"operations/druid-console",
"operations/getting-started",
"operations/web-console",
"operations/java",
{
"type": "subcategory",
@ -171,7 +175,6 @@
"operations/dynamic-config-provider",
"design/auth",
"operations/tls-support"
]
},
{
@ -295,6 +298,8 @@
"operations/kubernetes",
"querying/hll-old",
"querying/select-query",
"ingestion/native-batch-firehose",
"ingestion/native-batch-simple-task",
"ingestion/standalone-realtime"
]
}