druid

Commit Graph

Author	SHA1	Message	Date
317brian	ff577a69a5	doc: escape tags in markdown in prepration for docusaurus2 (#14379 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-06-08 11:26:18 -07:00
Kashif Faraz	12e8fa5c97	Prevent coordinator from getting stuck if leadership changes during coordinator run (#14385 ) Changes: - Add a timeout of 1 minute to resultFuture.get() in `CostBalancerStrategy.chooseBestServer`. 1 minute is the typical time for a full coordinator run and is more than enough time for cost computations of a single segment. - Raise an alert if an exception is encountered while computing costs and if the executor has not been shutdown. This is because a shutdown is intentional and does not require an alert.	2023-06-08 15:29:20 +05:30
Atul Mohan	6a4cbab4b8	Upgrade parquet-mr version (#14070 ) * Upgrade parquet version * Move parquet version to hadoop3 * Fix license * Exclude audience annotations	2023-06-07 08:54:54 -07:00
Gian Merlino	6370769cbf	Fix documentation for druid.query.scheduler.numThreads. (#14381 ) * Fix documentation for druid.query.scheduler.numThreads.	2023-06-07 14:48:08 +05:30
Soumyava	01b22ca022	Hll Sketch and Theta sketch estimate can now be used as an expression (#14312 ) * Hll Sketch estimate can now be used as an expression * Theta sketch estimate now can be used as an expression	2023-06-06 20:14:25 -07:00
Abhishek Radhakrishnan	2d258a95ad	Fix `EARLIEST_BY`/`LATEST_BY` signature and include function name in signature. (#14352 ) * Fix EarliestLatestBySqlAggregator signature; Include function name for all signatures. * Single quote function signatures, space between args and remove \n. * fixup UT assertion	2023-06-06 09:41:05 -07:00
Laksh Singla	5da601c47e	fix npe (#14369 )	2023-06-06 17:01:42 +05:30
John Gozde	cfc2a8d286	Switch to @blueprint/datetime2 (#14371 ) * Bump blueprint packages * Switch to datetime2 components * Update licenses * Update snapshots	2023-06-05 22:18:05 -07:00
Gian Merlino	a0d49baad6	MSQ: Fix issue with rollup ingestion and aggregators with multiple names. (#14367 ) The same aggregator can have two output names for a SQL like: INSERT INTO foo SELECT x, COUNT() AS y, COUNT() AS z FROM t GROUP BY 1 PARTITIONED BY ALL In this case, the SQL planner will create a query with a single "count" aggregator mapped to output names "y" and "z". The prior MSQ code did not properly handle this case, instead throwing an error like: Expected single output for query column[a0] but got [[1, 2]]	2023-06-06 10:28:41 +05:30
John Gozde	c14e54cf93	Remove context params from class component ctors (#14366 )	2023-06-05 11:15:28 -07:00
317brian	49c056af17	docs: add basic contributor guide for docs (#14365 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-06-05 10:53:17 -07:00
Tejaswini Bandlamudi	8e4f003f02	Fix flaky Revised ITs failures on GHA runners (#14348 ) * Fix read timed out failures and remove containers before test * remove containers before loading images * add labels to IT docker containers, download stable minio docker image release instead of latest	2023-06-05 18:58:54 +05:30
Abhishek Agarwal	139156cf6b	Reduce the spam in broker logs (#14368 )	2023-06-05 18:56:34 +05:30
Katya Macedo	7fd215b2e7	Document storeCompactionState (#14354 )	2023-06-02 11:09:04 -07:00
Harini Rajendran	4ff6026d30	Adding SegmentMetadataEvent and publishing them via KafkaEmitter (#14281 ) In this PR, we are enhancing KafkaEmitter, to emit metadata about published segments (SegmentMetadataEvent) into a Kafka topic. This segment metadata information that gets published into Kafka, can be used by any other downstream services to query Druid intelligently based on the segments published. The segment metadata gets published into kafka topic in json string format similar to other events.	2023-06-02 21:28:26 +05:30
Andreas Maechler	45014bd5b4	Handle all types of exceptions when initializing input source in sampler API (#14355 ) The sampler API returns a `400 bad request` response if it encounters a `SamplerException`. Otherwise, it returns a generic `500 Internal server error` response, with the message "The RuntimeException could not be mapped to a response, re-throwing to the HTTP container". This commit updates `RecordSupplierInputSource` to handle all types of exceptions instead of just `InterruptedException`and wrap them in a `SamplerException` so that the actual error is propagated back to the user.	2023-06-02 19:43:53 +05:30
Abhishek Agarwal	b482fda503	Ignore misc.xml (#14362 )	2023-06-02 12:00:52 +05:30
Andreas Maechler	55effd92cf	Docs: Typo and language cleanup in Kinesis ingestion docs (#14356 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-06-02 08:18:41 +05:30
317brian	70952c0977	docs: add sql array functions to nav (#14361 ) * docs: add sql array functions to nav * fix typo * add sql array functions to list * fix spelling errors	2023-06-01 16:45:27 -07:00
zachjsh	04a82da63d	Input source security fixes (#14266 ) It was found that several supported tasks / input sources did not have implementations for the methods used by the input source security feature, causing these tasks and input sources to fail when used with this feature. This pr adds the needed missing implementations. Also securing the sampling endpoint with input source security, when enabled.	2023-06-01 16:37:19 -07:00
zachjsh	e75fb8e8e3	Account for data format and compression in MSQ auto taskAssignment (#14307 ) ### Description This change allows for consideration of the input format and compression when computing how to split the input files among available tasks, in MSQ ingestion, when considering the value of the `maxInputBytesPerWorker` query context parameter. This query parameter allows users to control the maximum number of bytes, with granularity of input file / object, that ingestion tasks will be assigned to ingest. With this change, this context parameter now denotes the estimated weighted size in bytes of the input to split on, with consideration for input format and compression format, rather than the actual file size, reported by the file system. We assume uncompressed newline delimited json as a baseline, with scaling factor of `1`. This means that when computing the byte weight that a file has towards the input splitting, we take the file size as is, if uncompressed json, 1:1. It was found during testing that gzip compressed json, and parquet, has scale factors of `4` and `8` respectively, meaning that each byte of data is weighted 4x and 8x respectively, when computing input splits. This weighted byte scaling is only considered for MSQ ingestion that uses either LocalInputSource or CloudObjectInputSource at the moment. The default value of the `maxInputBytesPerWorker` query context parameter has been updated from 10 GiB, to 512 MiB	2023-06-01 12:53:49 -07:00
Abhishek Radhakrishnan	d60290e76d	Remove extraneous apostrophe in the native batch docs (#14358 )	2023-06-01 08:57:41 -07:00
Katya Macedo	2da84de87f	docs: remove the note about segments (#14161 )	2023-05-31 16:37:19 -07:00
317brian	2012a6bd8e	Docs: fix broken link to Python API jupyter notebook (#14332 )	2023-05-31 08:12:27 +05:30
Charles Smith	37cb76d545	fixes dataSourceName varaible ref (#14340 )	2023-05-30 13:15:27 -07:00
panhongan	c244c3de53	fix hdfs initialization issue (#14276 ) * fix hdfs initialization issue * add PR * remove conf settings * Improve comments * move hdfs storage validation to start handler * restore exception	2023-05-30 12:41:54 -07:00
Nhi Pham	70c06fc0e1	Advise against using WEEK granularity for Native Batch and MSQ (#14341 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-05-30 11:40:12 -07:00
Rishabh Singh	2086ff88bc	Add logging for task stop operations (#14192 ) Log more details when task cannot be stopped for various reasons	2023-05-30 18:50:52 +05:30
Pramod Immaneni	1ac5544da7	Updated default value of maxTotalRows to reflect the value in the code (#14298 )	2023-05-30 14:41:06 +05:30
Abhishek Radhakrishnan	5fd3e01ef0	More specific exclusions in the `examples` folder. (#14347 ) This PR changes how we skip java UT and ITs with changes in the examples folder. After this change, any Markdown files within the examples folder and jupyter-notebooks directory will be excluded. The rationale behind these more specific exclusions is that some ITs use json files checked in examples, so we want to trigger the full workflow for all other changes.	2023-05-30 12:01:45 +05:30
Kashif Faraz	d4cacebf79	Add tests for CostBalancerStrategy (#14230 ) Changes: - `CostBalancerStrategyTest` - Focus on verification of cost computations rather than choosing servers in this test - Add new tests `testComputeCost` and `testJointSegmentsCost` - Add tests to demonstrate that with a long enough interval gap, all costs become negligible - Retain `testIntervalCost` and `testIntervalCostAdditivity` - Remove redundant tests such as `testStrategyMultiThreaded`, `testStrategySingleThreaded`as verification of this behaviour is better suited to `BalancingStrategiesTest`. - `CostBalancerStrategyBenchmark` - Remove usage of static method from `CostBalancerStrategyTest` - Explicitly setup cluster and segments to use for benchmarking	2023-05-30 08:52:56 +05:30
Kashif Faraz	8091c6a547	Update default values in CoordinatorDynamicConfig (#14269 ) The defaults of the following config values in the `CoordinatorDynamicConfig` are being updated. 1. `maxSegmentsInNodeLoadingQueue = 500` (previous = 100) 2. `replicationThrottleLimit = 500` (previous = 10) Rationale: With round-robin segment assignment now being the default assignment technique, the Coordinator can assign a large number of under-replicated/unavailable segments very quickly, without getting stuck in `RunRules` duty due to very slow strategy-based cost computations. 3. `maxSegmentsToMove = 100` (previous = 5) Rationale: A very low value (say 5) is ineffective in balancing especially if there are many segments to balance. A very large value can cause excessive moves, which has these disadvantages: - Load of moving segments competing with load of unavailable/under-replicated segments - Unnecessary network costs due to constant download and delete of segments These defaults will be revisited after #13197 is merged.	2023-05-30 08:51:33 +05:30
Tejaswini Bandlamudi	0e51c2702a	update operations per run (#14325 )	2023-05-29 14:05:11 +05:30
Tejaswini Bandlamudi	914c006b8e	increase middlemanager heap server size in tests (#14345 )	2023-05-29 10:45:34 +05:30
Alexander Saydakov	4131c0df13	use the latest datasketches-java-4.0.0 (#14334 ) * use the latest datasketches-java-4.0.0 * updated versions of datasketches * adjusted expectation * fixed the expectations --------- Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2023-05-27 22:19:18 -07:00
Karan Kumar	8d256e35b4	MSQ ignores tombstone segments for downloads. (#14342 )	2023-05-27 14:21:52 +05:30
Kashif Faraz	0cde3a8b52	Fix regression in batch segment allocation (#14337 ) * Improve batch segment allocation logs * Fix batch seg alloc regression * Fix logs * Fix logs * Fix tests and logs	2023-05-25 22:34:54 -07:00
Vadim Ogievetsky	1873fca6c7	Web console: update DQT to latest version and fix bigint crash (#14318 ) * update dqt * don't crash on bigint values * better submit experiance * bump to an even version	2023-05-24 17:40:45 -07:00
Charles Smith	88831b1dd0	Docs: Updates docker compose to turn off kraft which causes errors (#14335 )	2023-05-24 09:33:32 -07:00
Clint Wylie	4096f51f0b	add configurable ColumnTypeMergePolicy to SegmentMetadataCache (#14319 ) This PR adds a new interface to control how SegmentMetadataCache chooses ColumnType when faced with differences between segments for SQL schemas which are computed, exposed as druid.sql.planner.metadataColumnTypeMergePolicy and adds a new 'least restrictive type' mode to allow choosing the type that data across all segments can best be coerced into and sets this as the default behavior. This is a behavior change around when segment driven schema migrations take effect for the SQL schema. With latestInterval, the SQL schema will be updated as soon as the first job with the new schema has published segments, while using leastRestrictive, the schema will only be updated once all segments are reindexed to the new type. The benefit of leastRestrictive is that it eliminates a bunch of type coercion errors that can happen in SQL when types are varied across segments with latestInterval because the newest type is not able to correctly represent older data, such as if the segments have a mix of ARRAY and number types, or any other combinations that lead to odd query plans.	2023-05-24 20:32:51 +05:30
Soumyava	22ba457d29	Expr getCacheKey now delegates to children (#14287 ) * Expr getCacheKey now delegates to children * Removed the LOOKUP_EXPR_CACHE_KEY as we do not need it * Adding an unit test * Update processing/src/main/java/org/apache/druid/math/expr/Expr.java Co-authored-by: Clint Wylie <cjwylie@gmail.com> --------- Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2023-05-23 14:49:38 -07:00
Abhishek Radhakrishnan	338bdb35ea	Return `RESOURCES` in `EXPLAIN PLAN` as an ordered collection (#14323 ) * Make resources an ordered collection so it's deterministic. * test cleanup * fixup docs. * Replace deprecated ObjectNode#put() calls with ObjectNode#set().	2023-05-23 00:55:00 -05:00
Abhishek Radhakrishnan	a5e04d95a4	Add `TYPE_NAME` to the complex serde classes and replace the hardcoded names. (#14317 ) * Add TYPE_NAME to the serde classes and reuse them instead of hardcoded strings. * Static check fixes.	2023-05-23 00:54:47 -05:00
Victoria Lim	6b3a6113c4	Doc: List supported values for Kafka `headerFormat` (#14316 )	2023-05-22 15:41:07 -07:00
Nhi Pham	3f6610aaf1	fixed wording in OSS query laning doc (#14324 ) Co-authored-by: Nhi Pham <nhipham@Nhi-Pham.local>	2023-05-22 11:58:17 -07:00
George Shiqi Wu	cb65135b99	Fix log streaming (#14285 ) * Fix log streaming * Add watch log * Add unit tests * long running client * singleton client * Remove accidental close	2023-05-22 11:19:53 -07:00
Tejaswini Bandlamudi	36a084e021	Fix GHA workflows naming & Run ITs if UTs fail on coverage (#14158 ) Currently, there is no way to run ITs if unit-tests fail on coverage. This PR allows Revised, Standard ITs to run even when unit-tests fail on coverage errors, still failing the workflow. This PR also fixes existing GHA workflow naming.	2023-05-22 11:44:34 +05:30
317brian	9faf9ecf20	docs: add line about write datasource perm for overlord api (#14114 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-05-19 14:56:24 -07:00
Katya Macedo	269137c682	Update Ingestion section (#14023 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Victoria Lim <lim.t.victoria@gmail.com>	2023-05-19 09:42:27 -07:00
Vadim Ogievetsky	7f66fd049b	don't show merged stats until needed (#14311 )	2023-05-18 20:32:58 -07:00

... 2 3 4 5 6 ...

12946 Commits All Branches Search

12946 Commits

All Branches