mirror of https://github.com/apache/druid.git
New doc for troubleshooting query execution (#12075)
* new doc for troubleshooting query execution * add doc to sidebar * Apply suggestions from code review
This commit is contained in:
parent
60a3a802b6
commit
acbeae23b8
|
@ -0,0 +1,68 @@
|
|||
---
|
||||
id: troubleshooting
|
||||
title: "Troubleshooting query execution in Druid"
|
||||
sidebar_label: "Troubleshooting"
|
||||
---
|
||||
|
||||
<!--
|
||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||
~ or more contributor license agreements. See the NOTICE file
|
||||
~ distributed with this work for additional information
|
||||
~ regarding copyright ownership. The ASF licenses this file
|
||||
~ to you under the Apache License, Version 2.0 (the
|
||||
~ "License"); you may not use this file except in compliance
|
||||
~ with the License. You may obtain a copy of the License at
|
||||
~
|
||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||
~
|
||||
~ Unless required by applicable law or agreed to in writing,
|
||||
~ software distributed under the License is distributed on an
|
||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
~ KIND, either express or implied. See the License for the
|
||||
~ specific language governing permissions and limitations
|
||||
~ under the License.
|
||||
-->
|
||||
|
||||
This topic describes issues that may affect query execution in Druid, how to identify those issues, and strategies to resolve them.
|
||||
|
||||
## Query fails due to internal communication timeout
|
||||
|
||||
In Druid's query processing, when the Broker sends a query to the data servers, the data servers process the query and push their intermediate results back to the Broker.
|
||||
Because calls from the Broker to the data servers are synchronous, the Jetty server can time out in data servers in certain cases:
|
||||
|
||||
1. The data servers don't push any results to the Broker before the maximum idle time.
|
||||
2. The data servers started to push data but paused for longer than the maximum idle time such as due to [Broker backpressure](../operations/basic-cluster-tuning.md#broker-backpressure).
|
||||
|
||||
When such timeout occurs, the server interrupts the connection between the Broker and data servers which causes the query to fail with a channel disconnection error. For example,
|
||||
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"error": "Unknown exception",
|
||||
"errorMessage": "Query[6eee73a6-a95f-4bdc-821d-981e99e39242] url[https://localhost:8283/druid/v2/] failed with exception msg [Channel disconnected] (through reference chain: org.apache.druid.query.scan.ScanResultValue[\"segmentId\"])",
|
||||
"errorClass": "com.fasterxml.jackson.databind.JsonMappingException",
|
||||
"host": "localhost:8283"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Channel disconnection occurs for various reasons.
|
||||
To verify that the error is due to web server timeout, search for the query ID in the Historical logs.
|
||||
The query ID in the example above is `6eee73a6-a95f-4bdc-821d-981e99e39242`.
|
||||
The `"host"` field in the error message above indicates the IP address of the Historical in question.
|
||||
In the Historical logs, you will see a raised exception indicating `Idle timeout expired`:
|
||||
|
||||
```text
|
||||
2021-09-14T19:52:27,685 ERROR [qtp475526834-85[scan_[test_large_table]_6eee73a6-a95f-4bdc-821d-981e99e39242]] org.apache.druid.server.QueryResource - Unable to send query response. (java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 300000/300000 ms)
|
||||
2021-09-14T19:52:27,685 ERROR [qtp475526834-85] org.apache.druid.server.QueryLifecycle - Exception while processing queryId [6eee73a6-a95f-4bdc-821d-981e99e39242] (java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 300000/300000 ms)
|
||||
2021-09-14T19:52:27,686 WARN [qtp475526834-85] org.eclipse.jetty.server.HttpChannel - handleException /druid/v2/ java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 300000/300000 ms
|
||||
```
|
||||
|
||||
To mitigate query failure due to web server timeout:
|
||||
* Increase the max idle time for the web server.
|
||||
Set the max idle time in the `druid.server.http.maxIdleTime` property in the `historical/runtime.properties` file.
|
||||
You must restart the Druid cluster for this change to take effect.
|
||||
See [Configuration reference](../configuration/index.md) for more information on configuring the server.
|
||||
* If the timeout occurs because the data servers have not pushed any results to the Broker, consider optimizing data server performance. Significant slowdown in the data servers may be a result of spilling too much data to disk in [groupBy v2 queries](groupbyquery.html#performance-tuning-for-groupby-v2), large [`IN` filters](filters.md#in-filter) in the query, or an under scaled cluster. Analyze your [Druid query metrics](../operations/metrics.md#query-metrics) to determine the bottleneck.
|
||||
* If the timeout is caused by Broker backpressure, consider optimizing Broker performance. Check whether the connection is fast enough between the Broker and deep storage.
|
||||
|
|
@ -68,6 +68,7 @@
|
|||
"querying/sql",
|
||||
"querying/querying",
|
||||
"querying/query-execution",
|
||||
"querying/troubleshooting",
|
||||
{
|
||||
"type": "subcategory",
|
||||
"label": "Concepts",
|
||||
|
|
Loading…
Reference in New Issue