mirror of https://github.com/apache/druid.git
Doc fixes around msq (#13090)
* remove things that do not apply * fix more things * pin node to a working version * fix * fixes * known issues tidy up * revert auto formatting changes * remove management-uis page which is 100% lies * don't mention the Coordinator console (that no longer exits) * goodies * fix typo
This commit is contained in:
parent
5ece870634
commit
2493eb17bf
|
@ -893,9 +893,10 @@ These Coordinator static configurations can be defined in the `coordinator/runti
|
|||
|
||||
#### Dynamic Configuration
|
||||
|
||||
The Coordinator has dynamic configuration to change certain behavior on the fly. The Coordinator uses a JSON spec object from the Druid [metadata storage](../dependencies/metadata-storage.md) config table. This object is detailed below:
|
||||
The Coordinator has dynamic configuration to change certain behavior on the fly.
|
||||
|
||||
It is recommended that you use the Coordinator Console to configure these parameters. However, if you need to do it via HTTP, the JSON object can be submitted to the Coordinator via a POST request at:
|
||||
It is recommended that you use the [web console](../operations/druid-console.md) to configure these parameters.
|
||||
However, if you need to do it via HTTP, the JSON object can be submitted to the Coordinator via a POST request at:
|
||||
|
||||
```
|
||||
http://<COORDINATOR_IP>:<PORT>/druid/coordinator/v1/config
|
||||
|
|
|
@ -145,10 +145,6 @@ For more information, see [Avoid conflicts with ingestion](../ingestion/automati
|
|||
> and their total size exceeds [`inputSegmentSizeBytes`](../configuration/index.md#automatic-compaction-dynamic-configuration).
|
||||
> If it finds such segments, it simply skips them.
|
||||
|
||||
### The Coordinator console
|
||||
|
||||
The Druid Coordinator exposes a web GUI for displaying cluster information and rule configuration. For more details, see [Coordinator console](../operations/management-uis.md#coordinator-consoles).
|
||||
|
||||
### FAQ
|
||||
|
||||
1. **Do clients ever contact the Coordinator process?**
|
||||
|
|
|
@ -40,10 +40,6 @@ In local mode Overlord is also responsible for creating Peons for executing task
|
|||
Local mode is typically used for simple workflows. In remote mode, the Overlord and MiddleManager are run in separate processes and you can run each on a different server.
|
||||
This mode is recommended if you intend to use the indexing service as the single endpoint for all Druid indexing.
|
||||
|
||||
### Overlord console
|
||||
|
||||
The Overlord provides a UI for managing tasks and workers. For more details, please see [overlord console](../operations/management-uis.md#overlord-console).
|
||||
|
||||
### Blacklisted workers
|
||||
|
||||
If a MiddleManager has task failures above a threshold, the Overlord will blacklist these MiddleManagers. No more than 20% of the MiddleManagers can be blacklisted. Blacklisted MiddleManagers will be periodically whitelisted.
|
||||
|
|
|
@ -144,11 +144,11 @@ T00:00:00.000Z/2015-04-14T02:41:09.484Z/0/index.zip] to [/opt/druid/zk_druid/dde
|
|||
|
||||
* DataSegmentKiller
|
||||
|
||||
The easiest way of testing the segment killing is marking a segment as not used and then starting a killing task through the old Coordinator console.
|
||||
The easiest way of testing the segment killing is marking a segment as not used and then starting a killing task in the [web console](../operations/druid-console.md).
|
||||
|
||||
To mark a segment as not used, you need to connect to your metadata storage and update the `used` column to `false` on the segment table rows.
|
||||
|
||||
To start a segment killing task, you need to access the old Coordinator console `http://<COODRINATOR_IP>:<COORDINATOR_PORT/old-console/kill.html` then select the appropriate datasource and then input a time range (e.g. `2000/3000`).
|
||||
To start a segment killing task, you need to access the web console then select `issue kill task` for the appropriate datasource.
|
||||
|
||||
After the killing task ends, `index.zip` (`partitionNum_index.zip` for HDFS data storage) file should be deleted from the data storage.
|
||||
|
||||
|
|
|
@ -75,7 +75,9 @@ parsing data is less efficient than writing a native Java parser or using an ext
|
|||
## Input format
|
||||
|
||||
You can use the `inputFormat` field to specify the data format for your input data.
|
||||
> `inputFormat` doesn't support all data formats or ingestion methods supported by Druid yet.
|
||||
|
||||
> `inputFormat` doesn't support all data formats or ingestion methods supported by Druid.
|
||||
|
||||
Especially if you want to use the Hadoop ingestion, you still need to use the [Parser](#parser).
|
||||
If your data is formatted in some format not listed in this section, please consider using the Parser instead.
|
||||
|
||||
|
|
|
@ -47,7 +47,7 @@ Other common reasons that hand-off fails are as follows:
|
|||
|
||||
1) Druid is unable to write to the metadata storage. Make sure your configurations are correct.
|
||||
|
||||
2) Historical processes are out of capacity and cannot download any more segments. You'll see exceptions in the Coordinator logs if this occurs and the Coordinator console will show the Historicals are near capacity.
|
||||
2) Historical processes are out of capacity and cannot download any more segments. You'll see exceptions in the Coordinator logs if this occurs and the web console will show the Historicals are near capacity.
|
||||
|
||||
3) Segments are corrupt and cannot be downloaded. You'll see exceptions in your Historical processes if this occurs.
|
||||
|
||||
|
@ -73,7 +73,7 @@ Note that this workflow only guarantees that the segments are available at the t
|
|||
|
||||
## I don't see my Druid segments on my Historical processes
|
||||
|
||||
You can check the Coordinator console located at `<COORDINATOR_IP>:<PORT>`. Make sure that your segments have actually loaded on [Historical processes](../design/historical.md). If your segments are not present, check the Coordinator logs for messages about capacity of replication errors. One reason that segments are not downloaded is because Historical processes have maxSizes that are too small, making them incapable of downloading more data. You can change that with (for example):
|
||||
You can check the [web console](../operations/druid-console.md) to make sure that your segments have actually loaded on [Historical processes](../design/historical.md). If your segments are not present, check the Coordinator logs for messages about capacity of replication errors. One reason that segments are not downloaded is because Historical processes have maxSizes that are too small, making them incapable of downloading more data. You can change that with (for example):
|
||||
|
||||
```
|
||||
-Ddruid.segmentCache.locations=[{"path":"/tmp/druid/storageLocation","maxSize":"500000000000"}]
|
||||
|
|
|
@ -81,7 +81,7 @@ SELECT
|
|||
*
|
||||
FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type": "http", "uris": ["https://static.imply.io/data/wikipedia.json.gz"]}',
|
||||
'{"type": "http", "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]}',
|
||||
'{"type": "json"}',
|
||||
'[{"name": "timestamp", "type": "string"}, {"name": "page", "type": "string"}, {"name": "user", "type": "string"}]'
|
||||
)
|
||||
|
@ -107,7 +107,7 @@ SELECT
|
|||
"user"
|
||||
FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type": "http", "uris": ["https://static.imply.io/data/wikipedia.json.gz"]}',
|
||||
'{"type": "http", "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]}',
|
||||
'{"type": "json"}',
|
||||
'[{"name": "timestamp", "type": "string"}, {"name": "page", "type": "string"}, {"name": "user", "type": "string"}]'
|
||||
)
|
||||
|
@ -145,7 +145,7 @@ SELECT
|
|||
"user"
|
||||
FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type": "http", "uris": ["https://static.imply.io/data/wikipedia.json.gz"]}',
|
||||
'{"type": "http", "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]}',
|
||||
'{"type": "json"}',
|
||||
'[{"name": "timestamp", "type": "string"}, {"name": "page", "type": "string"}, {"name": "user", "type": "string"}]'
|
||||
)
|
||||
|
@ -166,7 +166,7 @@ SELECT
|
|||
"user"
|
||||
FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type": "http", "uris": ["https://static.imply.io/data/wikipedia.json.gz"]}',
|
||||
'{"type": "http", "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]}',
|
||||
'{"type": "json"}',
|
||||
'[{"name": "timestamp", "type": "string"}, {"name": "page", "type": "string"}, {"name": "user", "type": "string"}]'
|
||||
)
|
||||
|
@ -197,7 +197,7 @@ For more information, see [Primary timestamp](../ingestion/data-model.md#primary
|
|||
|
||||
### PARTITIONED BY
|
||||
|
||||
INSERT and REPLACE queries require the PARTITIONED BY clause, which determines how time-based partitioning is done. In Druid, data is split into segments, one or more per time chunk defined by the PARTITIONED BY granularity. A good general rule is to adjust the granularity so that each segment contains about five million rows. Choose a granularity based on your ingestion rate. For example, if you ingest a million rows per day, PARTITION BY DAY is good. If you ingest a million rows an hour, choose PARTITION BY HOUR instead.
|
||||
INSERT and REPLACE queries require the PARTITIONED BY clause, which determines how time-based partitioning is done. In Druid, data is split into segments, one or more per time chunk defined by the PARTITIONED BY granularity. A good general rule is to adjust the granularity so that each segment contains about five million rows. Choose a granularity based on your ingestion rate. For example, if you ingest a million rows per day, PARTITIONED BY DAY is good. If you ingest a million rows an hour, choose PARTITION BY HOUR instead.
|
||||
|
||||
Using the clause provides the following benefits:
|
||||
|
||||
|
@ -235,7 +235,7 @@ You can use the following ISO 8601 periods for `TIME_FLOOR`:
|
|||
|
||||
### CLUSTERED BY
|
||||
|
||||
Data is first divided by the PARTITION BY clause. Data can be further split by the CLUSTERED BY clause. For example, suppose you ingest 100 M rows per hour and use `PARTITIONED BY HOUR` as your time partition. You then divide up the data further by adding `CLUSTERED BY hostName`. The result is segments of about 5 million rows, with like `hostNames` grouped within the same segment.
|
||||
Data is first divided by the PARTITIONED BY clause. Data can be further split by the CLUSTERED BY clause. For example, suppose you ingest 100 M rows per hour and use `PARTITIONED BY HOUR` as your time partition. You then divide up the data further by adding `CLUSTERED BY hostName`. The result is segments of about 5 million rows, with like `hostName`s grouped within the same segment.
|
||||
|
||||
Using CLUSTERED BY has the following benefits:
|
||||
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -115,7 +115,7 @@ memory available (`-XX:MaxDirectMemorySize`) to at least
|
|||
`(druid.processing.numThreads + 1) * druid.processing.buffer.sizeBytes`. Increasing the
|
||||
amount of direct memory available beyond the minimum does not speed up processing.
|
||||
|
||||
It may be necessary to override one or more memory-related parameters if you run into one of the [known issues around memory usage](./msq-known-issues.md#memory-usage).
|
||||
It may be necessary to override one or more memory-related parameters if you run into one of the [known issues](./msq-known-issues.md) around memory usage.
|
||||
|
||||
## Limits
|
||||
|
||||
|
|
|
@ -34,9 +34,6 @@ This example inserts data into a table named `w000` without performing any data
|
|||
<details><summary>Show the query</summary>
|
||||
|
||||
```sql
|
||||
--:context finalizeAggregations: false
|
||||
--:context groupByEnableMultiValueUnnesting: false
|
||||
|
||||
INSERT INTO w000
|
||||
SELECT
|
||||
TIME_PARSE("timestamp") AS __time,
|
||||
|
@ -65,7 +62,7 @@ SELECT
|
|||
regionName
|
||||
FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/wikipedia.json.gz"]}',
|
||||
'{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
|
||||
'{"type":"json"}',
|
||||
'[{"name":"isRobot","type":"string"},{"name":"channel","type":"string"},{"name":"timestamp","type":"string"},{"name":"flags","type":"string"},{"name":"isUnpatrolled","type":"string"},{"name":"page","type":"string"},{"name":"diffUrl","type":"string"},{"name":"added","type":"long"},{"name":"comment","type":"string"},{"name":"commentLength","type":"long"},{"name":"isNew","type":"string"},{"name":"isMinor","type":"string"},{"name":"delta","type":"long"},{"name":"isAnonymous","type":"string"},{"name":"user","type":"string"},{"name":"deltaBucket","type":"long"},{"name":"deleted","type":"long"},{"name":"namespace","type":"string"},{"name":"cityName","type":"string"},{"name":"countryName","type":"string"},{"name":"regionIsoCode","type":"string"},{"name":"metroCode","type":"long"},{"name":"countryIsoCode","type":"string"},{"name":"regionName","type":"string"}]'
|
||||
)
|
||||
|
@ -83,15 +80,12 @@ This example inserts data into a table named `kttm_data` and performs data rollu
|
|||
<details><summary>Show the query</summary>
|
||||
|
||||
```sql
|
||||
--:context finalizeAggregations: false
|
||||
--:context groupByEnableMultiValueUnnesting: false
|
||||
|
||||
INSERT INTO "kttm_rollup"
|
||||
|
||||
WITH kttm_data AS (
|
||||
SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/kttm/kttm-v2-2019-08-25.json.gz"]}',
|
||||
'{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
|
||||
'{"type":"json"}',
|
||||
'[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"language","type":"string"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]'
|
||||
)
|
||||
|
@ -129,9 +123,6 @@ This example aggregates data from a table named `w000` and inserts the result in
|
|||
<details><summary>Show the query</summary>
|
||||
|
||||
```sql
|
||||
--:context finalizeAggregations: false
|
||||
--:context groupByEnableMultiValueUnnesting: false
|
||||
|
||||
INSERT INTO w002
|
||||
SELECT
|
||||
FLOOR(__time TO MINUTE) AS __time,
|
||||
|
@ -160,21 +151,18 @@ This example inserts data into a table named `w003` and joins data from two sour
|
|||
<details><summary>Show the query</summary>
|
||||
|
||||
```sql
|
||||
--:context finalizeAggregations: false
|
||||
--:context groupByEnableMultiValueUnnesting: false
|
||||
|
||||
INSERT INTO w003
|
||||
WITH
|
||||
wikidata AS (SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/wikipedia.json.gz"]}',
|
||||
'{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
|
||||
'{"type":"json"}',
|
||||
'[{"name":"isRobot","type":"string"},{"name":"channel","type":"string"},{"name":"timestamp","type":"string"},{"name":"flags","type":"string"},{"name":"isUnpatrolled","type":"string"},{"name":"page","type":"string"},{"name":"diffUrl","type":"string"},{"name":"added","type":"long"},{"name":"comment","type":"string"},{"name":"commentLength","type":"long"},{"name":"isNew","type":"string"},{"name":"isMinor","type":"string"},{"name":"delta","type":"long"},{"name":"isAnonymous","type":"string"},{"name":"user","type":"string"},{"name":"deltaBucket","type":"long"},{"name":"deleted","type":"long"},{"name":"namespace","type":"string"},{"name":"cityName","type":"string"},{"name":"countryName","type":"string"},{"name":"regionIsoCode","type":"string"},{"name":"metroCode","type":"long"},{"name":"countryIsoCode","type":"string"},{"name":"regionName","type":"string"}]'
|
||||
)
|
||||
)),
|
||||
countries AS (SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/lookup/country.tsv"]}',
|
||||
'{"type":"http","uris":["https://static.imply.io/example-data/lookup/countries.tsv"]}',
|
||||
'{"type":"tsv","findColumnsFromHeader":true}',
|
||||
'[{"name":"Country","type":"string"},{"name":"Capital","type":"string"},{"name":"ISO3","type":"string"},{"name":"ISO2","type":"string"}]'
|
||||
)
|
||||
|
@ -219,9 +207,6 @@ This example replaces the entire datasource used in the table `w007` with the ne
|
|||
<details><summary>Show the query</summary>
|
||||
|
||||
```sql
|
||||
--:context finalizeAggregations: false
|
||||
--:context groupByEnableMultiValueUnnesting: false
|
||||
|
||||
REPLACE INTO w007
|
||||
OVERWRITE ALL
|
||||
SELECT
|
||||
|
@ -251,7 +236,7 @@ SELECT
|
|||
regionName
|
||||
FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/wikipedia.json.gz"]}',
|
||||
'{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
|
||||
'{"type":"json"}',
|
||||
'[{"name":"isRobot","type":"string"},{"name":"channel","type":"string"},{"name":"timestamp","type":"string"},{"name":"flags","type":"string"},{"name":"isUnpatrolled","type":"string"},{"name":"page","type":"string"},{"name":"diffUrl","type":"string"},{"name":"added","type":"long"},{"name":"comment","type":"string"},{"name":"commentLength","type":"long"},{"name":"isNew","type":"string"},{"name":"isMinor","type":"string"},{"name":"delta","type":"long"},{"name":"isAnonymous","type":"string"},{"name":"user","type":"string"},{"name":"deltaBucket","type":"long"},{"name":"deleted","type":"long"},{"name":"namespace","type":"string"},{"name":"cityName","type":"string"},{"name":"countryName","type":"string"},{"name":"regionIsoCode","type":"string"},{"name":"metroCode","type":"long"},{"name":"countryIsoCode","type":"string"},{"name":"regionName","type":"string"}]'
|
||||
)
|
||||
|
@ -269,9 +254,6 @@ This example replaces certain segments in a datasource with the new query data w
|
|||
<details><summary>Show the query</summary>
|
||||
|
||||
```sql
|
||||
--:context finalizeAggregations: false
|
||||
--:context groupByEnableMultiValueUnnesting: false
|
||||
|
||||
REPLACE INTO w007
|
||||
OVERWRITE WHERE __time >= TIMESTAMP '2019-08-25 02:00:00' AND __time < TIMESTAMP '2019-08-25 03:00:00'
|
||||
SELECT
|
||||
|
@ -295,9 +277,6 @@ CLUSTERED BY page
|
|||
<details><summary>Show the query</summary>
|
||||
|
||||
```sql
|
||||
--:context finalizeAggregations: false
|
||||
--:context groupByEnableMultiValueUnnesting: false
|
||||
|
||||
REPLACE INTO w000
|
||||
OVERWRITE ALL
|
||||
SELECT
|
||||
|
@ -326,13 +305,10 @@ CLUSTERED BY page
|
|||
|
||||
|
||||
```sql
|
||||
--:context finalizeAggregations: false
|
||||
--:context groupByEnableMultiValueUnnesting: false
|
||||
|
||||
WITH flights AS (
|
||||
SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/FlightCarrierOnTime/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
|
||||
'{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
|
||||
'{"type":"csv","findColumnsFromHeader":true}',
|
||||
'[{"name":"depaturetime","type":"string"},{"name":"arrivalime","type":"string"},{"name":"Year","type":"long"},{"name":"Quarter","type":"long"},{"name":"Month","type":"long"},{"name":"DayofMonth","type":"long"},{"name":"DayOfWeek","type":"long"},{"name":"FlightDate","type":"string"},{"name":"Reporting_Airline","type":"string"},{"name":"DOT_ID_Reporting_Airline","type":"long"},{"name":"IATA_CODE_Reporting_Airline","type":"string"},{"name":"Tail_Number","type":"string"},{"name":"Flight_Number_Reporting_Airline","type":"long"},{"name":"OriginAirportID","type":"long"},{"name":"OriginAirportSeqID","type":"long"},{"name":"OriginCityMarketID","type":"long"},{"name":"Origin","type":"string"},{"name":"OriginCityName","type":"string"},{"name":"OriginState","type":"string"},{"name":"OriginStateFips","type":"long"},{"name":"OriginStateName","type":"string"},{"name":"OriginWac","type":"long"},{"name":"DestAirportID","type":"long"},{"name":"DestAirportSeqID","type":"long"},{"name":"DestCityMarketID","type":"long"},{"name":"Dest","type":"string"},{"name":"DestCityName","type":"string"},{"name":"DestState","type":"string"},{"name":"DestStateFips","type":"long"},{"name":"DestStateName","type":"string"},{"name":"DestWac","type":"long"},{"name":"CRSDepTime","type":"long"},{"name":"DepTime","type":"long"},{"name":"DepDelay","type":"long"},{"name":"DepDelayMinutes","type":"long"},{"name":"DepDel15","type":"long"},{"name":"DepartureDelayGroups","type":"long"},{"name":"DepTimeBlk","type":"string"},{"name":"TaxiOut","type":"long"},{"name":"WheelsOff","type":"long"},{"name":"WheelsOn","type":"long"},{"name":"TaxiIn","type":"long"},{"name":"CRSArrTime","type":"long"},{"name":"ArrTime","type":"long"},{"name":"ArrDelay","type":"long"},{"name":"ArrDelayMinutes","type":"long"},{"name":"ArrDel15","type":"long"},{"name":"ArrivalDelayGroups","type":"long"},{"name":"ArrTimeBlk","type":"string"},{"name":"Cancelled","type":"long"},{"name":"CancellationCode","type":"string"},{"name":"Diverted","type":"long"},{"name":"CRSElapsedTime","type":"long"},{"name":"ActualElapsedTime","type":"long"},{"name":"AirTime","type":"long"},{"name":"Flights","type":"long"},{"name":"Distance","type":"long"},{"name":"DistanceGroup","type":"long"},{"name":"CarrierDelay","type":"long"},{"name":"WeatherDelay","type":"long"},{"name":"NASDelay","type":"long"},{"name":"SecurityDelay","type":"long"},{"name":"LateAircraftDelay","type":"long"},{"name":"FirstDepTime","type":"string"},{"name":"TotalAddGTime","type":"string"},{"name":"LongestAddGTime","type":"string"},{"name":"DivAirportLandings","type":"string"},{"name":"DivReachedDest","type":"string"},{"name":"DivActualElapsedTime","type":"string"},{"name":"DivArrDelay","type":"string"},{"name":"DivDistance","type":"string"},{"name":"Div1Airport","type":"string"},{"name":"Div1AirportID","type":"string"},{"name":"Div1AirportSeqID","type":"string"},{"name":"Div1WheelsOn","type":"string"},{"name":"Div1TotalGTime","type":"string"},{"name":"Div1LongestGTime","type":"string"},{"name":"Div1WheelsOff","type":"string"},{"name":"Div1TailNum","type":"string"},{"name":"Div2Airport","type":"string"},{"name":"Div2AirportID","type":"string"},{"name":"Div2AirportSeqID","type":"string"},{"name":"Div2WheelsOn","type":"string"},{"name":"Div2TotalGTime","type":"string"},{"name":"Div2LongestGTime","type":"string"},{"name":"Div2WheelsOff","type":"string"},{"name":"Div2TailNum","type":"string"},{"name":"Div3Airport","type":"string"},{"name":"Div3AirportID","type":"string"},{"name":"Div3AirportSeqID","type":"string"},{"name":"Div3WheelsOn","type":"string"},{"name":"Div3TotalGTime","type":"string"},{"name":"Div3LongestGTime","type":"string"},{"name":"Div3WheelsOff","type":"string"},{"name":"Div3TailNum","type":"string"},{"name":"Div4Airport","type":"string"},{"name":"Div4AirportID","type":"string"},{"name":"Div4AirportSeqID","type":"string"},{"name":"Div4WheelsOn","type":"string"},{"name":"Div4TotalGTime","type":"string"},{"name":"Div4LongestGTime","type":"string"},{"name":"Div4WheelsOff","type":"string"},{"name":"Div4TailNum","type":"string"},{"name":"Div5Airport","type":"string"},{"name":"Div5AirportID","type":"string"},{"name":"Div5AirportSeqID","type":"string"},{"name":"Div5WheelsOn","type":"string"},{"name":"Div5TotalGTime","type":"string"},{"name":"Div5LongestGTime","type":"string"},{"name":"Div5WheelsOff","type":"string"},{"name":"Div5TailNum","type":"string"},{"name":"Unnamed: 109","type":"string"}]'
|
||||
)
|
||||
|
@ -340,7 +316,7 @@ WITH flights AS (
|
|||
L_AIRPORT AS (
|
||||
SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/FlightCarrierOnTime/dimensions/L_AIRPORT.csv"]}',
|
||||
'{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/dimensions/L_AIRPORT.csv"]}',
|
||||
'{"type":"csv","findColumnsFromHeader":true}',
|
||||
'[{"name":"Code","type":"string"},{"name":"Description","type":"string"}]'
|
||||
)
|
||||
|
@ -348,7 +324,7 @@ L_AIRPORT AS (
|
|||
L_AIRPORT_ID AS (
|
||||
SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/FlightCarrierOnTime/dimensions/L_AIRPORT_ID.csv"]}',
|
||||
'{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/dimensions/L_AIRPORT_ID.csv"]}',
|
||||
'{"type":"csv","findColumnsFromHeader":true}',
|
||||
'[{"name":"Code","type":"long"},{"name":"Description","type":"string"}]'
|
||||
)
|
||||
|
@ -356,7 +332,7 @@ L_AIRPORT_ID AS (
|
|||
L_AIRLINE_ID AS (
|
||||
SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/FlightCarrierOnTime/dimensions/L_AIRLINE_ID.csv"]}',
|
||||
'{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/dimensions/L_AIRLINE_ID.csv"]}',
|
||||
'{"type":"csv","findColumnsFromHeader":true}',
|
||||
'[{"name":"Code","type":"long"},{"name":"Description","type":"string"}]'
|
||||
)
|
||||
|
@ -364,7 +340,7 @@ L_AIRLINE_ID AS (
|
|||
L_CITY_MARKET_ID AS (
|
||||
SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/FlightCarrierOnTime/dimensions/L_CITY_MARKET_ID.csv"]}',
|
||||
'{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/dimensions/L_CITY_MARKET_ID.csv"]}',
|
||||
'{"type":"csv","findColumnsFromHeader":true}',
|
||||
'[{"name":"Code","type":"long"},{"name":"Description","type":"string"}]'
|
||||
)
|
||||
|
@ -372,7 +348,7 @@ L_CITY_MARKET_ID AS (
|
|||
L_CANCELLATION AS (
|
||||
SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/FlightCarrierOnTime/dimensions/L_CANCELLATION.csv"]}',
|
||||
'{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/dimensions/L_CANCELLATION.csv"]}',
|
||||
'{"type":"csv","findColumnsFromHeader":true}',
|
||||
'[{"name":"Code","type":"string"},{"name":"Description","type":"string"}]'
|
||||
)
|
||||
|
@ -380,7 +356,7 @@ L_CANCELLATION AS (
|
|||
L_STATE_FIPS AS (
|
||||
SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/FlightCarrierOnTime/dimensions/L_STATE_FIPS.csv"]}',
|
||||
'{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/dimensions/L_STATE_FIPS.csv"]}',
|
||||
'{"type":"csv","findColumnsFromHeader":true}',
|
||||
'[{"name":"Code","type":"long"},{"name":"Description","type":"string"}]'
|
||||
)
|
||||
|
|
|
@ -23,96 +23,42 @@ sidebar_label: Known issues
|
|||
~ under the License.
|
||||
-->
|
||||
|
||||
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.
|
||||
> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0.
|
||||
> Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported.
|
||||
> We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment
|
||||
> before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not
|
||||
> write to a datasource is experimental.
|
||||
|
||||
## General query execution
|
||||
## Multi-stage query task runtime
|
||||
|
||||
- There's no fault tolerance. If any task fails, the entire query fails.
|
||||
- Fault tolerance is not implemented. If any task fails, the entire query fails.
|
||||
|
||||
- Only one local file system per server is used for stage output data during multi-stage query
|
||||
execution. If your servers have multiple local file systems, this causes queries to exhaust
|
||||
available disk space earlier than expected.
|
||||
- SELECT from a Druid datasource does not include unpublished real-time data.
|
||||
|
||||
- When `msqMaxNumTasks` is higher than the total
|
||||
capacity of the cluster, more tasks may be launched than can run at once. This leads to a
|
||||
[TaskStartTimeout](./msq-reference.md#context-parameters) error code, as there is never enough capacity to run the query.
|
||||
To avoid this, set `msqMaxNumTasks` to a number of tasks that can run simultaneously on your cluster.
|
||||
- GROUPING SETS is not implemented. Queries that use GROUPING SETS fail.
|
||||
|
||||
- When `msqTaskAssignment` is set to `auto`, the system generates one task per input file for certain splittable
|
||||
input sources where file sizes are not known ahead of time. This includes the `http` input source, where the system
|
||||
generates one task per URI.
|
||||
- Worker task stage outputs are stored in the working directory given by `druid.indexer.task.baseDir`. Stages that
|
||||
generate a large amount of output data may exhaust all available disk space. In this case, the query fails with
|
||||
an [UnknownError](./msq-reference.md#error-codes) with a message including "No space left on device".
|
||||
|
||||
## Memory usage
|
||||
- The numeric varieties of the EARLIEST and LATEST aggregators do not work properly. Attempting to use the numeric
|
||||
varieties of these aggregators lead to an error like
|
||||
`java.lang.ClassCastException: class java.lang.Double cannot be cast to class org.apache.druid.collections.SerializablePair`.
|
||||
The string varieties, however, do work properly.
|
||||
|
||||
- INSERT queries can consume excessive memory when using complex types due to inaccurate footprint
|
||||
estimation. This can appear as an OutOfMemoryError during the SegmentGenerator stage when using
|
||||
sketches. If you run into this issue, try manually lowering the value of the
|
||||
[`msqRowsInMemory`](./msq-reference.md#context-parameters) parameter.
|
||||
## INSERT and REPLACE
|
||||
|
||||
- EXTERN loads an entire row group into memory at once when reading from Parquet files. Row groups
|
||||
can be up to 1 GB in size, which can lead to excessive heap usage when reading many files in
|
||||
parallel. This can appear as an OutOfMemoryError during stages that read Parquet input files. If
|
||||
you run into this issue, try using a smaller number of worker tasks or you can increase the heap
|
||||
size of your Indexers or of your Middle Manager-launched indexing tasks.
|
||||
- INSERT with column lists, like `INSERT INTO tbl (a, b, c) SELECT ...`, is not implemented.
|
||||
|
||||
- Ingesting a very long row may consume excessive memory and result in an OutOfMemoryError. If a row is read
|
||||
which requires more memory than is available, the service might throw OutOfMemoryError. If you run into this
|
||||
issue, allocate enough memory to be able to store the largest row to the indexer.
|
||||
- `INSERT ... SELECT` inserts columns from the SELECT statement based on column name. This differs from SQL standard
|
||||
behavior, where columns are inserted based on position.
|
||||
|
||||
## SELECT queries
|
||||
|
||||
- SELECT query results do not include real-time data until it has been published.
|
||||
|
||||
- TIMESTAMP types are formatted as numbers rather than ISO8601 timestamp
|
||||
strings, which differs from Druid's standard result format.
|
||||
|
||||
- BOOLEAN types are formatted as numbers like `1` and `0` rather
|
||||
than `true` or `false`, which differs from Druid's standard result
|
||||
format.
|
||||
|
||||
- TopN is not implemented. The context parameter
|
||||
`useApproximateTopN` is ignored and always treated as if it
|
||||
were `false`. Therefore, topN-shaped queries will
|
||||
always run using the groupBy engine. There is no loss of
|
||||
functionality, but there may be a performance impact, since
|
||||
these queries will run using an exact algorithm instead of an
|
||||
approximate one.
|
||||
- GROUPING SETS is not implemented. Queries that use GROUPING SETS
|
||||
will fail.
|
||||
- The numeric flavors of the EARLIEST and LATEST aggregators do not work properly. Attempting to use the numeric flavors of these aggregators will lead to an error like `java.lang.ClassCastException: class java.lang.Double cannot be cast to class org.apache.druid.collections.SerializablePair`. The string flavors, however, do work properly.
|
||||
|
||||
## INSERT queries
|
||||
## EXTERN
|
||||
|
||||
- The [schemaless dimensions](../ingestion/ingestion-spec.md#inclusions-and-exclusions)
|
||||
feature is not available. All columns and their types must be specified explicitly.
|
||||
feature is not available. All columns and their types must be specified explicitly using the `signature` parameter
|
||||
of the [EXTERN function](msq-reference.md#extern).
|
||||
|
||||
- [Segment metadata queries](../querying/segmentmetadataquery.md)
|
||||
on datasources ingested with the Multi-Stage Query Engine will return values for`timestampSpec` that are not usable
|
||||
for introspection.
|
||||
- EXTERN with input sources that match large numbers of files may exhaust available memory on the controller task.
|
||||
|
||||
- When INSERT with GROUP BY does the match the criteria mentioned in [GROUP BY](./index.md#group-by), the multi-stage engine generates segments that Druid's compaction
|
||||
functionality is not able to further roll up. This applies to automatic compaction as well as manually
|
||||
issued `compact` tasks. Individual queries executed with the multi-stage engine always guarantee
|
||||
perfect rollup for their output, so this only matters if you are performing a sequence of INSERT
|
||||
queries that each append data to the same time chunk. If necessary, you can compact such data
|
||||
using another SQL query instead of a `compact` task.
|
||||
|
||||
- When using INSERT with GROUP BY, splitting of large partitions is not currently
|
||||
implemented. If a single partition key appears in a
|
||||
very large number of rows, an oversized segment will be created.
|
||||
You can mitigate this by adding additional columns to your
|
||||
partition key. Note that partition splitting _does_ work properly
|
||||
when performing INSERT without GROUP BY.
|
||||
|
||||
- INSERT with column lists, like
|
||||
`INSERT INTO tbl (a, b, c) SELECT ...`, is not implemented.
|
||||
|
||||
## EXTERN queries
|
||||
|
||||
- EXTERN does not accept `druid` input sources.
|
||||
|
||||
## Missing guardrails
|
||||
|
||||
- Maximum number of input files. Since there's no limit, the controller can potentially run out of memory tracking all input files
|
||||
|
||||
- Maximum amount of local disk space to use for temporary data. No guardrail today means worker tasks may exhaust all available disk space. In this case, you will receive an [UnknownError](./msq-reference.md#error-codes)) with a message including "No space left on device".
|
||||
- EXTERN does not accept `druid` input sources. Use FROM instead.
|
||||
|
|
|
@ -50,9 +50,9 @@ The following table lists the context parameters for the MSQ task engine:
|
|||
|Parameter|Description|Default value|
|
||||
|---------|-----------|-------------|
|
||||
| maxNumTasks | SELECT, INSERT, REPLACE<br /><br />The maximum total number of tasks to launch, including the controller task. The lowest possible value for this setting is 2: one controller and one worker. All tasks must be able to launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` error code after approximately 10 minutes.<br /><br />May also be provided as `numTasks`. If both are present, `maxNumTasks` takes priority.| 2 |
|
||||
| taskAssignment | SELECT, INSERT, REPLACE<br /><br />Determines how many tasks to use. Possible values include: <ul><li>`max`: Use as many tasks as possible, up to the maximum `maxNumTasks`.</li><li>`auto`: Use as few tasks as possible without exceeding 10 GiB or 10,000 files per task. Review the [limitations](./msq-known-issues.md#general-query-execution) of `auto` mode before using it.</li></ui>| `max` |
|
||||
| taskAssignment | SELECT, INSERT, REPLACE<br /><br />Determines how many tasks to use. Possible values include: <ul><li>`max`: Uses as many tasks as possible, up to `maxNumTasks`.</li><li>`auto`: When file sizes can be determined through directory listing (for example: local files, S3, GCS, HDFS) uses as few tasks as possible without exceeding 10 GiB or 10,000 files per task, unless exceeding these limits is necessary to stay within `maxNumTasks`. When file sizes cannot be determined through directory listing (for example: http), behaves the same as `max`.</li></ui> | `max` |
|
||||
| finalizeAggregations | SELECT, INSERT, REPLACE<br /><br />Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | true |
|
||||
| rowsInMemory | INSERT or REPLACE<br /><br />Maximum number of rows to store in memory at once before flushing to disk during the segment generation process. Ignored for non-INSERT queries. In most cases, use the default value. You may need to override the default if you run into one of the [known issues around memory usage](./msq-known-issues.md#memory-usage)</a>. | 100,000 |
|
||||
| rowsInMemory | INSERT or REPLACE<br /><br />Maximum number of rows to store in memory at once before flushing to disk during the segment generation process. Ignored for non-INSERT queries. In most cases, use the default value. You may need to override the default if you run into one of the [known issues](./msq-known-issues.md) around memory usage. | 100,000 |
|
||||
| segmentSortOrder | INSERT or REPLACE<br /><br />Normally, Druid sorts rows in individual segments using `__time` first, followed by the [CLUSTERED BY](./index.md#clustered-by) clause. When you set `segmentSortOrder`, Druid sorts rows in segments using this column list first, followed by the CLUSTERED BY order.<br /><br />You provide the column list as comma-separated values or as a JSON array in string form. If your query includes `__time`, then this list must begin with `__time`. For example, consider an INSERT query that uses `CLUSTERED BY country` and has `segmentSortOrder` set to `__time,city`. Within each time chunk, Druid assigns rows to segments based on `country`, and then within each of those segments, Druid sorts those rows by `__time` first, then `city`, then `country`. | empty list |
|
||||
| maxParseExceptions| SELECT, INSERT, REPLACE<br /><br />Maximum number of parse exceptions that are ignored while executing the query before it stops with `TooManyWarningsFault`. To ignore all the parse exceptions, set the value to -1.| 0 |
|
||||
| rowsPerSegment | INSERT or REPLACE<br /><br />The number of rows per segment to target. The actual number of rows per segment may be somewhat higher or lower than this number. In most cases, use the default. For general information about sizing rows per segment, see [Segment Size Optimization](../operations/segment-optimization.md). | 3,000,000 |
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
---
|
||||
id: connect-external-data
|
||||
title: Tutorial - Connect external data for SQL-based ingestion
|
||||
title: Tutorial - Load files with SQL-based ingestion
|
||||
description: How to generate a query that references externally hosted data
|
||||
---
|
||||
|
||||
|
@ -27,22 +27,19 @@ description: How to generate a query that references externally hosted data
|
|||
|
||||
This tutorial demonstrates how to generate a query that references externally hosted data using the **Connect external data** wizard.
|
||||
|
||||
The following example uses EXTERN to query a JSON file located at https://static.imply.io/data/wikipedia.json.gz.
|
||||
The following example uses EXTERN to query a JSON file located at https://druid.apache.org/data/wikipedia.json.gz.
|
||||
|
||||
Although you can manually create a query in the UI, you can use Druid to generate a base query for you that you can modify to meet your requirements.
|
||||
|
||||
To generate a query from external data, do the following:
|
||||
|
||||
1. In the **Query** view of the Druid console, click **Connect external data**.
|
||||
2. On the **Select input type** screen, choose **HTTP(s)** and enter the following value in the **URIs** field: `https://static.imply.io/data/wikipedia.json.gz`. Leave the HTTP auth username and password blank.
|
||||
2. On the **Select input type** screen, choose **HTTP(s)** and enter the following value in the **URIs** field: `https://druid.apache.org/data/wikipedia.json.gz`. Leave the HTTP auth username and password blank.
|
||||
3. Click **Connect data**.
|
||||
4. On the **Parse** screen, you can perform additional actions before you load the data into Druid:
|
||||
- Expand a row to see what data it corresponds to from the source.
|
||||
- Customize how Druid handles the data by selecting the **Input format** and its related options, such as adding **JSON parser features** for JSON files.
|
||||
5. When you're ready, click **Done**. You're returned to the **Query** view where you can see the newly generated query:
|
||||
|
||||
- The query inserts the data from the external source into a table named `wikipedia`.
|
||||
- Context parameters appear before the query in the syntax unique to the Druid console: `--: context {key}: {value}`. When submitting queries to Druid directly, set the `context` parameters in the context section of the SQL query object. For more information about context parameters, see [Context parameters](./msq-reference.md#context-parameters).
|
||||
5. When you're ready, click **Done**. You're returned to the **Query** view where you can see the starter query that will insert the data from the external source into a table named `wikipedia`.
|
||||
|
||||
<details><summary>Show the query</summary>
|
||||
|
||||
|
@ -51,7 +48,7 @@ To generate a query from external data, do the following:
|
|||
WITH ext AS (SELECT *
|
||||
FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/wikipedia.json.gz"]}',
|
||||
'{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
|
||||
'{"type":"json"}',
|
||||
'[{"name":"isRobot","type":"string"},{"name":"channel","type":"string"},{"name":"timestamp","type":"string"},{"name":"flags","type":"string"},{"name":"isUnpatrolled","type":"string"},{"name":"page","type":"string"},{"name":"diffUrl","type":"string"},{"name":"added","type":"long"},{"name":"comment","type":"string"},{"name":"commentLength","type":"long"},{"name":"isNew","type":"string"},{"name":"isMinor","type":"string"},{"name":"delta","type":"long"},{"name":"isAnonymous","type":"string"},{"name":"user","type":"string"},{"name":"deltaBucket","type":"long"},{"name":"deleted","type":"long"},{"name":"namespace","type":"string"},{"name":"cityName","type":"string"},{"name":"countryName","type":"string"},{"name":"regionIsoCode","type":"string"},{"name":"metroCode","type":"long"},{"name":"countryIsoCode","type":"string"},{"name":"regionName","type":"string"}]'
|
||||
)
|
||||
|
@ -91,7 +88,7 @@ To generate a query from external data, do the following:
|
|||
For example, to specify day-based segment granularity, change the partitioning to `PARTITIONED BY DAY`:
|
||||
|
||||
```sql
|
||||
...
|
||||
INSERT INTO ...
|
||||
SELECT
|
||||
TIME_PARSE("timestamp") AS __time,
|
||||
...
|
||||
|
@ -99,7 +96,9 @@ To generate a query from external data, do the following:
|
|||
PARTITIONED BY DAY
|
||||
```
|
||||
|
||||
1. Optionally, select **Preview** to review the data before you ingest it. A preview runs the query without the INSERT INTO clause and with an added LIMIT to the main query and to all helper queries. You can see the general shape of the data before you commit to inserting it. The LIMITs make the query run faster but can cause incomplete results.
|
||||
1. Optionally, select **Preview** to review the data before you ingest it. A preview runs the query without the REPLACE INTO clause and with an added LIMIT.
|
||||
You can see the general shape of the data before you commit to inserting it.
|
||||
The LIMITs make the query run faster but can cause incomplete results.
|
||||
2. Click **Run** to launch your query. The query returns information including its duration and the number of rows inserted into the table.
|
||||
|
||||
## Query the data
|
||||
|
@ -126,7 +125,7 @@ SELECT
|
|||
COUNT(*)
|
||||
FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type": "http", "uris": ["https://static.imply.io/data/wikipedia.json.gz"]}',
|
||||
'{"type": "http", "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]}',
|
||||
'{"type": "json"}',
|
||||
'[{"name": "added", "type": "long"}, {"name": "channel", "type": "string"}, {"name": "cityName", "type": "string"}, {"name": "comment", "type": "string"}, {"name": "commentLength", "type": "long"}, {"name": "countryIsoCode", "type": "string"}, {"name": "countryName", "type": "string"}, {"name": "deleted", "type": "long"}, {"name": "delta", "type": "long"}, {"name": "deltaBucket", "type": "string"}, {"name": "diffUrl", "type": "string"}, {"name": "flags", "type": "string"}, {"name": "isAnonymous", "type": "string"}, {"name": "isMinor", "type": "string"}, {"name": "isNew", "type": "string"}, {"name": "isRobot", "type": "string"}, {"name": "isUnpatrolled", "type": "string"}, {"name": "metroCode", "type": "string"}, {"name": "namespace", "type": "string"}, {"name": "page", "type": "string"}, {"name": "regionIsoCode", "type": "string"}, {"name": "regionName", "type": "string"}, {"name": "timestamp", "type": "string"}, {"name": "user", "type": "string"}]'
|
||||
)
|
||||
|
|
|
@ -34,7 +34,7 @@ To convert the ingestion spec to a query task, do the following:
|
|||
1. In the **Query** view of the Druid console, navigate to the menu bar that includes **Run**.
|
||||
2. Click the ellipsis icon and select **Convert ingestion spec to SQL**.
|
||||
![Convert ingestion spec to SQL](../assets/multi-stage-query/tutorial-msq-convert.png "Convert ingestion spec to SQL")
|
||||
3. In the **Ingestion spec to covert** window, insert your ingestion spec. You can use your own spec or the sample ingestion spec provided in the tutorial. The sample spec uses data hosted at `https://static.imply.io/data/wikipedia.json.gz` and loads it into a table named `wikipedia`:
|
||||
3. In the **Ingestion spec to covert** window, insert your ingestion spec. You can use your own spec or the sample ingestion spec provided in the tutorial. The sample spec uses data hosted at `https://druid.apache.org/data/wikipedia.json.gz` and loads it into a table named `wikipedia`:
|
||||
|
||||
<details><summary>Show the spec</summary>
|
||||
|
||||
|
@ -47,7 +47,7 @@ To convert the ingestion spec to a query task, do the following:
|
|||
"inputSource": {
|
||||
"type": "http",
|
||||
"uris": [
|
||||
"https://static.imply.io/data/wikipedia.json.gz"
|
||||
"https://druid.apache.org/data/wikipedia.json.gz"
|
||||
]
|
||||
},
|
||||
"inputFormat": {
|
||||
|
@ -129,7 +129,7 @@ To convert the ingestion spec to a query task, do the following:
|
|||
REPLACE INTO wikipedia OVERWRITE ALL
|
||||
WITH source AS (SELECT * FROM TABLE(
|
||||
EXTERN(
|
||||
'{"type":"http","uris":["https://static.imply.io/data/wikipedia.json.gz"]}',
|
||||
'{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
|
||||
'{"type":"json"}',
|
||||
'[{"name":"timestamp","type":"string"},{"name":"isRobot","type":"string"},{"name":"channel","type":"string"},{"name":"flags","type":"string"},{"name":"isUnpatrolled","type":"string"},{"name":"page","type":"string"},{"name":"diffUrl","type":"string"},{"name":"added","type":"long"},{"name":"comment","type":"string"},{"name":"commentLength","type":"long"},{"name":"isNew","type":"string"},{"name":"isMinor","type":"string"},{"name":"delta","type":"long"},{"name":"isAnonymous","type":"string"},{"name":"user","type":"string"},{"name":"deltaBucket","type":"long"},{"name":"deleted","type":"long"},{"name":"namespace","type":"string"},{"name":"cityName","type":"string"},{"name":"countryName","type":"string"},{"name":"regionIsoCode","type":"string"},{"name":"metroCode","type":"string"},{"name":"countryIsoCode","type":"string"},{"name":"regionName","type":"string"}]'
|
||||
)
|
||||
|
|
|
@ -1,64 +0,0 @@
|
|||
---
|
||||
id: management-uis
|
||||
title: "Legacy Management UIs"
|
||||
---
|
||||
|
||||
<!--
|
||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||
~ or more contributor license agreements. See the NOTICE file
|
||||
~ distributed with this work for additional information
|
||||
~ regarding copyright ownership. The ASF licenses this file
|
||||
~ to you under the Apache License, Version 2.0 (the
|
||||
~ "License"); you may not use this file except in compliance
|
||||
~ with the License. You may obtain a copy of the License at
|
||||
~
|
||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||
~
|
||||
~ Unless required by applicable law or agreed to in writing,
|
||||
~ software distributed under the License is distributed on an
|
||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
~ KIND, either express or implied. See the License for the
|
||||
~ specific language governing permissions and limitations
|
||||
~ under the License.
|
||||
-->
|
||||
|
||||
|
||||
## Legacy consoles
|
||||
|
||||
Druid provides a console for managing datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.
|
||||
|
||||
For more information on the Druid Console, have a look at the [Druid Console overview](./druid-console.md)
|
||||
|
||||
The Druid Console contains all of the functionality provided by the older consoles described below, which are still available if needed. The legacy consoles may be replaced by the Druid Console in the future.
|
||||
|
||||
These older consoles provide a subset of the functionality of the Druid Console. We recommend using the Druid Console if possible.
|
||||
|
||||
### Coordinator consoles
|
||||
|
||||
#### Version 2
|
||||
|
||||
The Druid Coordinator exposes a web console for displaying cluster information and rule configuration. After the Coordinator starts, the console can be accessed at:
|
||||
|
||||
```
|
||||
http://<COORDINATOR_IP>:<COORDINATOR_PORT>
|
||||
```
|
||||
|
||||
There exists a full cluster view (which shows indexing tasks and Historical processes), as well as views for individual Historical processes, datasources and segments themselves. Segment information can be displayed in raw JSON form or as part of a sortable and filterable table.
|
||||
|
||||
The Coordinator console also exposes an interface to creating and editing rules. All valid datasources configured in the segment database, along with a default datasource, are available for configuration. Rules of different types can be added, deleted or edited.
|
||||
|
||||
#### Version 1
|
||||
|
||||
The oldest version of Druid's Coordinator console is still available for backwards compatibility at:
|
||||
|
||||
```
|
||||
http://<COORDINATOR_IP>:<COORDINATOR_PORT>/old-console
|
||||
```
|
||||
|
||||
### Overlord console
|
||||
|
||||
The Overlord console can be used to view pending tasks, running tasks, available workers, and recent worker creation and termination. The console can be accessed at:
|
||||
|
||||
```
|
||||
http://<OVERLORD_IP>:<OVERLORD_PORT>/console.html
|
||||
```
|
|
@ -23,7 +23,7 @@ title: "Retaining or automatically dropping data"
|
|||
-->
|
||||
|
||||
|
||||
In Apache Druid, Coordinator processes use rules to determine what data should be loaded to or dropped from the cluster. Rules are used for data retention and query execution, and are set on the Coordinator console (http://coordinator_ip:port).
|
||||
In Apache Druid, Coordinator processes use rules to determine what data should be loaded to or dropped from the cluster. Rules are used for data retention and query execution, and are set via the [web console](./druid-console.md).
|
||||
|
||||
There are three types of rules, i.e., load rules, drop rules, and broadcast rules. Load rules indicate how segments should be assigned to different historical process tiers and how many replicas of a segment should exist in each tier.
|
||||
Drop rules indicate when segments should be dropped entirely from the cluster. Finally, broadcast rules indicate how segments of different datasources should be co-located in Historical processes.
|
||||
|
@ -33,7 +33,7 @@ default set of rules can be configured. Rules are read in order and hence the or
|
|||
Coordinator will cycle through all used segments and match each segment with the first rule that applies. Each segment
|
||||
may only match a single rule.
|
||||
|
||||
Note: It is recommended that the Coordinator console is used to configure rules. However, the Coordinator process does have HTTP endpoints to programmatically configure rules.
|
||||
Note: It is recommended that the web console is used to configure rules. However, the Coordinator process does have HTTP endpoints to programmatically configure rules.
|
||||
|
||||
## Load rules
|
||||
|
||||
|
@ -230,5 +230,5 @@ submit a [kill task](../ingestion/tasks.md) to the [Overlord](../design/overlord
|
|||
|
||||
Data that has been dropped from a Druid cluster cannot be reloaded using only rules. To reload dropped data in Druid,
|
||||
you must first set your retention period (i.e. changing the retention period from 1 month to 2 months), and then mark as
|
||||
used all segments belonging to the datasource in the Druid Coordinator console, or through the Druid Coordinator
|
||||
used all segments belonging to the datasource in the web console, or through the Druid Coordinator
|
||||
endpoints.
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
id: tutorial-msq-external-data
|
||||
title: "Connect external data"
|
||||
sidebar_label: "Connect external data"
|
||||
title: "Loading files with SQL"
|
||||
sidebar_label: "Loading files with SQL"
|
||||
---
|
||||
|
||||
<!--
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
id: tutorial-msq-convert-json
|
||||
title: "Convert JSON ingestion spec to SQL"
|
||||
sidebar_label: "Convert JSON ingestion spec"
|
||||
title: "Convert ingestion spec to SQL"
|
||||
sidebar_label: "Convert ingestion spec to SQL"
|
||||
---
|
||||
|
||||
<!--
|
||||
|
|
|
@ -13,6 +13,12 @@
|
|||
"lint": "npm run link-lint",
|
||||
"spellcheck": "mdspell --en-us --ignore-numbers --report '../docs/**/*.md'"
|
||||
},
|
||||
"engines": {
|
||||
"node": ">=12"
|
||||
},
|
||||
"volta": {
|
||||
"node": "12.22.12"
|
||||
},
|
||||
"devDependencies": {
|
||||
"docusaurus": "^1.14.4",
|
||||
"markdown-spellcheck": "^1.3.1",
|
||||
|
|
|
@ -198,3 +198,4 @@
|
|||
{"source": "development/extensions-contrib/google.html", "target": "../extensions-core/google.html"}
|
||||
{"source": "development/integrating-druid-with-other-technologies.html", "target": "../ingestion/index.html"}
|
||||
{"source": "operations/recommendations.html", "target": "basic-cluster-tuning.html"}
|
||||
{"source": "operations/management-uis.html", "target": "operations/druid-console.html"}
|
||||
|
|
|
@ -9,6 +9,7 @@
|
|||
],
|
||||
"Tutorials": [
|
||||
"tutorials/tutorial-batch",
|
||||
"tutorials/tutorial-msq-external-data",
|
||||
"tutorials/tutorial-kafka",
|
||||
"tutorials/tutorial-batch-hadoop",
|
||||
"tutorials/tutorial-query",
|
||||
|
@ -21,7 +22,6 @@
|
|||
"tutorials/tutorial-ingestion-spec",
|
||||
"tutorials/tutorial-transform-spec",
|
||||
"tutorials/tutorial-kerberos-hadoop",
|
||||
"tutorials/tutorial-msq-external-data",
|
||||
"tutorials/tutorial-msq-convert-json"
|
||||
],
|
||||
"Design": [
|
||||
|
@ -203,7 +203,6 @@
|
|||
"type": "subcategory",
|
||||
"label": "Misc",
|
||||
"ids": [
|
||||
"operations/management-uis",
|
||||
"operations/dump-segment",
|
||||
"operations/reset-cluster",
|
||||
"operations/insert-segment-to-db",
|
||||
|
|
Loading…
Reference in New Issue