druid

Commit Graph

Author	SHA1	Message	Date
Adarsh Sanjeev	afb3d91777	Add unit test for complex column grouping (#13650 ) * Add unit test for complex column grouping Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2023-01-12 15:25:01 +05:30
Maytas Monsereenusorn	7f54ebbf47	Fix Parquet Parser missing column when reading parquet file (#13612 ) * fix parquet reader * fix checkstyle * fix bug * fix inspection * refactor * fix checkstyle * fix checkstyle * fix checkstyle * fix checkstyle * add test * fix checkstyle * fix tests * add IT * add IT * add more tests * fix checkstyle * fix stuff * fix stuff * add more tests * add more tests	2023-01-11 20:08:48 -10:00
Karan Kumar	56076d33fb	Worker retry for MSQ task (#13353 ) * Initial commit. * Fixing error message in retry exceeded exception * Cleaning up some code * Adding some test cases. * Adding java docs. * Finishing up state test cases. * Adding some more java docs and fixing spot bugs, intellij inspections * Fixing intellij inspections and added tests * Documenting error codes * Migrate current integration batch tests to equivalent MSQ tests (#13374) * Migrate current integration batch tests to equivalent MSQ tests using new IT framework * Fix build issues * Trigger Build * Adding more tests and addressing comments * fixBuildIssues * fix dependency issues * Parameterized the test and addressed comments * Addressing comments * fixing checkstyle errors * Adressing comments * Adding ITTest which kills the worker abruptly * Review comments phase one * Adding doc changes * Adjusting for single threaded execution. * Adding Sequential Merge PR state handling * Merge things * Fixing checkstyle. * Adding new context param for fault tolerance. Adding stale task handling in sketchFetcher. Adding UT's. * Merge things * Merge things * Adding parameterized tests Created separate module for faultToleranceTests * Adding missed files * Review comments and fixing tests. * Documentation things. * Fixing IT * Controller impl fix. * Fixing racy WorkerSketchFetcherTest.java exception handling. Co-authored-by: abhagraw <99210446+abhagraw@users.noreply.github.com> Co-authored-by: Karan Kumar <cryptoe@karans-mbp.lan>	2023-01-11 07:38:29 +05:30
Abhishek Radhakrishnan	41fdf6eafb	Quote and escape literals in JDBC lookup to allow reserved identifiers. (#13632 ) * Quote and escape table, key and column names. * fix typo. * More select statements. * Derby lookup tests create quoted identifiers so it's compatible. * Use Stringutils.replace() utility. * quote the filter string. * Squish doubly quote usage into a single function. * Add parameterized test with reserved identifiers. * few changes.	2023-01-10 12:11:54 +05:30
imply-cheddar	f1821a7c18	Add Sort Operator for Window Functions (#13619 ) * Addition of NaiveSortMaker and Default implementation Add the NaiveSortMaker which makes a sorter object and a default implementation of the interface. This also allows us to plan multiple different window definitions on the same query.	2023-01-06 00:27:18 -08:00
imply-cheddar	a8ecc48ffe	Validate response headers and fix exception logging (#13609 ) * Validate response headers and fix exception logging A class of QueryException were throwing away their causes making it really hard to determine what's going wrong when something goes wrong in the SQL planner specifically. Fix that and adjust tests to do more validation of response headers as well. We allow 404s and 307s to be returned even without authorization validated, but others get converted to 403	2023-01-05 14:15:15 -08:00
imply-cheddar	7b92b85168	Unify DummyRequest with MockHttpServletRequest (#13602 ) We had 2 different classes both creating fake instances of an HttpServletRequest, this makes it to that we only have one in a common location	2022-12-21 20:15:08 -08:00
Kashif Faraz	c1e2656644	Fix scope of dependencies in protobuf-extensions pom (#13593 )	2022-12-19 13:56:55 +05:30
Clint Wylie	d9e5245ff0	allow string dimension indexer to handle byte[] as base64 strings (#13573 ) This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. This issue was uncovered by a regression of sorts introduced by #13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing). I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff.	2022-12-16 14:50:17 +05:30
Kashif Faraz	d6949b1b79	Track input processedBytes with MSQ ingestion (#13559 ) Follow up to #13520 Bytes processed are currently tracked for intermediate stages in MSQ ingestion. This patch adds the capability to track the bytes processed by an MSQ controller task while reading from an external input source or a segment source. Changes: - Track `processedBytes` for every `InputSource` read in `ExternalInputSliceReader` - Update `ChannelCounters` with the above obtained `processedBytes` when incrementing the input file count. - Update task report structure in docs The total input processed bytes can be obtained by summing the `processedBytes` as follows: totalBytes = 0 for every root stage (i.e. a stage which does not have another stage as an input): for every worker in that stage: for every input channel: (i.e. channels with prefix "input", e.g. "input0", "input1", etc.) totalBytes += processedBytes	2022-12-16 02:20:01 +05:30
Adarsh Sanjeev	2b605aa9cf	Multiple fixes for the MSQ stats merging piece which (#13463 ) * Add validation checks to worker chat handler apis * Merge things and polishing the error messages. * Minor error message change * Fixing race and adding some tests * Fixing controller fetching stats from wrong workers. Fixing race Changing default mode to Parallel Adding logging. Fixing exceptions not propagated properly. * Changing to kernel worker count * Added a better logic to figure out assigned worker for a stage. * Nits * Moving to existing kernel methods * Adding more coverage Co-authored-by: cryptoe <karankumar1100@gmail.com>	2022-12-15 09:35:11 +05:30
Kashif Faraz	58a3acc2c4	Add InputStats to track bytes processed by a task (#13520 ) This commit adds a new class `InputStats` to track the total bytes processed by a task. The field `processedBytes` is published in task reports along with other row stats. Major changes: - Add class `InputStats` to track processed bytes - Add method `InputSourceReader.read(InputStats)` to read input rows while counting bytes. > Since we need to count the bytes, we could not just have a wrapper around `InputSourceReader` or `InputEntityReader` (the way `CountableInputSourceReader` does) because the `InputSourceReader` only deals with `InputRow`s and the byte information is already lost. - Classic batch: Use the new `InputSourceReader.read(inputStats)` in `AbstractBatchIndexTask` - Streaming: Increment `processedBytes` in `StreamChunkParser`. This does not use the new `InputSourceReader.read(inputStats)` method. - Extend `InputStats` with `RowIngestionMeters` so that bytes can be exposed in task reports Other changes: - Update tests to verify the value of `processedBytes` - Rename `MutableRowIngestionMeters` to `SimpleRowIngestionMeters` and remove duplicate class - Replace `CacheTestSegmentCacheManager` with `NoopSegmentCacheManager` - Refactor `KafkaIndexTaskTest` and `KinesisIndexTaskTest`	2022-12-13 18:54:42 +05:30
somu-imply	7682b0b6b1	Analysis refactor (#13501 ) Refactor DataSource to have a getAnalysis method() This removes various parts of the code where while loops and instanceof checks were being used to walk through the structure of DataSource objects in order to build a DataSourceAnalysis. Instead we just ask the DataSource for its analysis and allow the stack to rebuild whatever structure existed.	2022-12-12 17:35:44 -08:00
Gian Merlino	de5a4bafcb	Zero-copy local deep storage. (#13394 ) * Zero-copy local deep storage. This is useful for local deep storage, since it reduces disk usage and makes Historicals able to load segments instantaneously. Two changes: 1) Introduce "druid.storage.zip" parameter for local storage, which defaults to false. This changes default behavior from writing an index.zip to writing a regular directory. This is safe to do even during a rolling update, because the older code actually already handled unzipped directories being present on local deep storage. 2) In LocalDataSegmentPuller and LocalDataSegmentPusher, use hard links instead of copies when possible. (Generally this is possible when the source and destination directory are on the same filesystem.)	2022-12-12 17:28:24 -08:00
Karan Kumar	5a3d79a5d5	Removing unused exec service. (#13541 )	2022-12-12 14:39:42 +05:30
Clint Wylie	7002ecd303	add protobuf flattener, direct to plain java conversion for faster flattening (#13519 ) * add protobuf flattener, direct to plain java conversion for faster flattening, nested column tests	2022-12-09 12:24:21 -08:00
Gian Merlino	55814888f5	MSQ: Only look at sqlInsertSegmentGranularity on the outer query. (#13537 ) The planner sets sqlInsertSegmentGranularity in its context when using PARTITIONED BY, which sets it on every native query in the stack (as all native queries for a SQL query typically have the same context). QueryKit would interpret that as a request to configure bucketing for all native queries. This isn't useful, as bucketing is only used for the penultimate stage in INSERT / REPLACE. So, this patch modifies QueryKit to only look at sqlInsertSegmentGranularity on the outermost query. As an additional change, this patch switches the static ObjectMapper to use the processwide ObjectMapper for deserializing Granularities. Saves an ObjectMapper instance, and ensures that if there are any special serdes registered for Granularity, we'll pick them up.	2022-12-09 20:48:16 +05:30
Paul Rogers	013a12e86f	Enhanced MSQ table functions (#13360 ) * Enhanced MSQ table functions * HTTP, LOCALFILES and INLINE table functions powered by catalog metadata. * Documentation	2022-12-08 13:56:02 -08:00
Gian Merlino	91ef9872ec	MSQ: Improve TooManyBuckets error message, improve error docs. (#13525 ) 1) Edited the TooManyBuckets error message to mention PARTITIONED BY instead of segmentGranularity. 2) Added error-code-specific anchors in the docs. 3) Add information to various error codes in the docs about common causes and solutions.	2022-12-08 13:18:26 -08:00
Adarsh Sanjeev	fbf76ad8f5	Remove stray reference to fix OOM while merging sketches (#13475 ) * Remove stray reference to fix OOM while merging sketches * Update future to add result from executor service * Update tests and address review comments * Address review comments * Moved mock * Close threadpool on teardown * Remove worker task cancel	2022-12-08 07:17:55 +05:30
Abhishek Agarwal	b25cf216d5	Better error message when theta_sketch_intersect is used on scalar expression (#13508 )	2022-12-07 09:35:43 +05:30
Paul Rogers	b76ff16d00	SQL test framework extensions (#13426 ) SQL test framework extensions * Capture planner artifacts: logical plan, etc. * Planner test builder validates the logical plan * Validation for the SQL resut schema (we already have validation for the Druid row signature) * Better Guice integration: properties, reuse Guice modules * Avoid need for hand-coded expr, macro tables * Retire some of the test-specific query component creation * Fix query log hook race condition	2022-12-02 09:11:59 -08:00
AmatyaAvadhanula	cc307e4c29	Fix needless task shutdown on leader switch (#13411 ) * Fix needless task shutdown on leader switch * Add unit test * Fix style * Fix UTs	2022-12-01 18:31:08 +05:30
Adarsh Sanjeev	8395273099	Add unit tests for MSQ ingestion faults (#13439 ) * Add unit tests for MSQ ingestion faults * Resolve build failure * Move test to MSQFaultTest * Rename test	2022-12-01 10:11:49 +05:30
xiaokang	6ba35f6d59	update org.bouncycastle:bcprov-jdk15on 1.68 to 1.69 (#13440 )	2022-11-30 21:57:38 +05:30
Adarsh Sanjeev	af164cbc10	Fix an issue with WorkerSketchFetcher not terminating on shutdown (#13459 ) * Fix an issue with WorkerSketchFetcher not terminating on shutdown * Change threadpool name	2022-11-30 21:02:48 +05:30
Kashif Faraz	8ff1b2d5d4	Revert "Add filter in cloud object input source for backward compatibility (#13437 )" (#13450 ) This reverts commit `b12e5f300e`.	2022-11-30 16:33:05 +05:30
Gian Merlino	50963edcae	Fix compile error in MSQSelectTest. (#13456 )	2022-11-29 15:51:03 -08:00
Laksh Singla	79df11c16c	Improve unit test coverage for MSQ (#13398 ) * add faults tests for the multi stage query * add too many parttiions fault * add toomanyinputfilesfault * programmatically generate the file * refactor * Trigger Build	2022-11-29 17:27:04 +05:30
Laksh Singla	4ed6255bdf	Convert errors based on implicit type conversion in multi value arrays to parse exception in MSQ (#13366 ) * initial commit * fix test * push the json changes * reduce the area of the try..catch * Trigger Build * review	2022-11-29 17:19:57 +05:30
Clint Wylie	37b8d4861c	fix issues with nested data conversion (#13407 )	2022-11-28 12:29:43 -08:00
Clint Wylie	4b58f5f23c	fix KafkaInputFormat with nested columns by delegating to underlying inputRow map instead of eagerly copying (#13406 )	2022-11-28 12:28:07 -08:00
Tejaswini Bandlamudi	b12e5f300e	Add filter in cloud object input source for backward compatibility (#13437 ) https://github.com/apache/druid/pull/13027 PR replaces `filter` parameter with `objectGlob` in ingestion input source. However, this will cause existing ingestion jobs to fail if they are using a filter already. This PR adds old filter functionality alongside objectGlob to preserve backward compatibility.	2022-11-28 23:04:33 +05:30
Clint Wylie	f524c68f08	Add mechanism for 'safe' memory reads for complex types (#13361 ) * we can read where we want to we can leave your bounds behind 'cause if the memory is not there we really don't care and we'll crash this process of mine	2022-11-23 00:25:22 -08:00
Kashif Faraz	7cf761cee4	Prepare master branch for next release, 26.0.0 (#13401 ) * Prepare master branch for next release, 26.0.0 * Use docker image for druid 24.0.1 * Fix version in druid-it-cases pom.xml	2022-11-22 15:31:01 +05:30
Gian Merlino	c6054b7cb7	Attach IO error to parse error when we can't contact Avro schema registry. (#13403 ) * Attach IO error to parse error when we can't contact Avro schema registry. The change in #12080 lost the original exception context. This patch adds it back. * Add hamcrest-core. * Fix format string.	2022-11-21 22:20:26 -08:00
Adarsh Sanjeev	280a0f7158	Add sequential sketch merging to MSQ (#13205 ) * Add sketch fetching framework * Refactor code to support sequential merge * Update worker sketch fetcher * Refactor sketch fetcher * Refactor sketch fetcher * Add context parameter and threshold to trigger sequential merge * Fix test * Add integration test for non sequential merge * Address review comments * Address review comments * Address review comments * Resolve maxRetainedBytes * Add new classes * Renamed key statistics information class * Rename fetchStatisticsSnapshotForTimeChunk function * Address review comments * Address review comments * Update documentation and add comments * Resolve build issues * Resolve build issues * Change worker APIs to async * Address review comments * Resolve build issues * Add null time check * Update integration tests * Address review comments * Add log messages and comments * Resolve build issues * Add unit tests * Add unit tests * Fix timing issue in tests	2022-11-22 09:56:32 +05:30
Gian Merlino	bfffbabb56	Async task client for SeekableStreamSupervisors. (#13354 ) Main changes: 1) Convert SeekableStreamIndexTaskClient to an interface, move old code to SeekableStreamIndexTaskClientSyncImpl, and add new implementation SeekableStreamIndexTaskClientAsyncImpl that uses ServiceClient. 2) Add "chatAsync" parameter to seekable stream supervisors that causes the supervisor to use an async task client. 3) In SeekableStreamSupervisor.discoverTasks, adjust logic to avoid making blocking RPC calls in workerExec threads. 4) In SeekableStreamSupervisor generally, switch from Futures.successfulAsList to FutureUtils.coalesce, so we can better capture the errors that occurred with contacting individual tasks. Other, related changes: 1) Add ServiceRetryPolicy.retryNotAvailable, which controls whether ServiceClient retries unavailable services. Useful since we do not want to retry calls unavailable tasks within the service client. (The supervisor does its own higher-level retries.) 2) Add FutureUtils.transformAsync, a more lambda friendly version of Futures.transform(f, AsyncFunction). 3) Add FutureUtils.coalesce. Similar to Futures.successfulAsList, but returns Either instead of using null on error. 4) Add JacksonUtils.readValue overloads for JavaType and TypeReference.	2022-11-21 19:20:26 +05:30
Gian Merlino	f037776fd8	MSQ: Launch initial tasks faster. (#13393 ) Notify the mainLoop thread to skip a sleep when the desired task count changes.	2022-11-21 19:11:18 +05:30
Rohan Garg	6ccf31490e	Allow injection of node-role set to all non base modules (#13371 )	2022-11-18 12:12:03 +05:30
Clint Wylie	8c9ffcfe37	nested column support for ORC (#13375 ) * nested column support for ORC * more test	2022-11-17 21:08:34 -08:00
Tejaswini Bandlamudi	bf10ff73a8	Fixes Kafka Supervisor Lag Report (#13380 ) Fixes inclusion of all stream partitions in all tasks. The PR (Adds Idle feature to `SeekableStreamSupervisor` for inactive stream) - https://github.com/apache/druid/pull/13144 updates the resulting lag calculation map in `KafkaSupervisor` to include all the latest partitions from the stream to set the idle state accordingly rather than the previous way of lag calculation only for the partitions actively being read from the stream. This led to an explosion of metrics in lag reports in cases where 1000s of tasks per supervisor are present. Changes: - Add a new method to generate lags for only those partitions a single task is actively reading from while updating the Supervisor reports.	2022-11-17 22:24:45 +05:30
Laksh Singla	9e938b5a6f	Add a limit to the number of columns in the CLUSTERED BY clause (#13352 ) * Add clustered by limit * change semantics, add docs * add fault class to the module * add test * unambiguate test	2022-11-15 22:05:15 +05:30
Clint Wylie	309cae7b65	nested column support for Parquet and Avro (#13325 ) * nested column support for Parquet and Avro * style	2022-11-14 16:09:05 -08:00
Adarsh Sanjeev	a3edda3b63	Modify quantile sketches to add byte[] directly (#13351 ) * Modify quantile sketchs to add byte[] directly * Rename class and add test	2022-11-14 00:24:06 +05:30
Paul Rogers	81d005f267	Druid Catalog basics (#13165 ) Druid catalog basics Catalog object model for tables, columns Druid metadata DB storage (as an extension) REST API to update the catalog (as an extension) Integration tests Model only: no planner integration yet	2022-11-12 15:30:22 -08:00
Laksh Singla	3e172d44ab	Bind DurableStorageCleaner only on the Overlord nodes (#13355 )	2022-11-11 21:56:33 +05:30
Didip Kerabat	56d5c9780d	Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects (#13027 ) * Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects. Removed: import org.apache.commons.io.FilenameUtils; Add: import java.nio.file.FileSystems; import java.nio.file.PathMatcher; import java.nio.file.Paths; * Forgot to update CloudObjectInputSource as well. * Fix tests. * Removed unused exceptions. * Able to reduced user mistakes, by removing the protocol and the bucket on filter. * add 1 more test. * add comment on filterWithoutProtocolAndBucket * Fix lint issue. * Fix another lint issue. * Replace all mention of filter -> objectGlob per convo here: https://github.com/apache/druid/pull/13027#issuecomment-1266410707 * fix 1 bad constructor. * Fix the documentation. * Don’t do anything clever with the object path. * Remove unused imports. * Fix spelling error. * Fix incorrect search and replace. * Addressing Gian’s comment. * add filename on .spelling * Fix documentation. * fix documentation again Co-authored-by: Didip Kerabat <didip@apple.com>	2022-11-10 23:46:40 -08:00
Gian Merlino	77478f25fb	Add taskActionType dimension to task/action/run/time. (#13333 ) * Add taskActionType dimension to task/action/run/time. * Spelling.	2022-11-11 12:00:08 +05:30
AmatyaAvadhanula	fb23e38aa7	Fix messageGap emission (#13346 ) * Fix messageGap emission * Do not emit messageGap after stopping reading events * Refactoring * Fix tests	2022-11-10 17:50:19 +05:30
Clint Wylie	27215d1ff1	fix complex_decode_base64 function, add SQL bindings (#13332 ) * fix complex_decode_base64 function, add SQL bindings * more permissive	2022-11-09 23:40:25 -08:00
AmatyaAvadhanula	0512ae4922	Optimize metadata calls in SeekableStreamSupervisor (#13328 ) * Optimize metadata calls * Modify isTaskCurrent * Fix tests * Refactoring	2022-11-10 07:22:51 +05:30
Laksh Singla	b7a513fe09	Add a OverlordHelper that cleans up durable storage objects in MSQ (#13269 ) * scratch * s3 ls fix, add docs * add documentation, update method name * Add tests, address commits, change default value of the helper * fix test * update the default value of config, remove initial delay config * Trigger Build * update class * add more tests * docs update * spellcheck * remove ioe from the signature * add back dmmy constructor for initialization * fix guice bindings, intellij inspections	2022-11-09 17:23:35 +05:30
Paul Rogers	7e600d2c63	Enhancements to the Calcite test framework (#13283 ) * Enhancements to the Calcite test framework * Standardize "Unauthorized" messages * Additional test framework extension points * Resolved joinable factory dependency issue	2022-11-08 14:28:49 -08:00
Tejaswini Bandlamudi	594545da55	Adds cluster level idleConfig setting for supervisor (#13311 ) * adds cluster level idleConfig * updates docs * refactoring * spelling nit * nit * nit * refactoring	2022-11-08 14:54:14 +05:30
Adarsh Sanjeev	a28b8c2674	Improve rowkey object size estimate (#13319 ) * Improve rowkey object size estimate * Address review comments * Update comment * Fix test	2022-11-08 10:12:07 +05:30
Gian Merlino	48528a0c98	MSQ: Fix task lock checking during publish, fix lock priority. (#13282 ) * MSQ: Fix task lock checking during publish, fix lock priority. Fixes two issues: 1) ControllerImpl did not properly check the return value of SegmentTransactionalInsertAction when doing a REPLACE. This could cause it to not realize that its locks were preempted. 2) Task lock priority was the default of 0. It should be the higher batch default of 50. The low priority made it possible for MSQ tasks to be preempted by compaction tasks, which is not desired. * Restructuring, add docs. * Add performSegmentPublish tests. * Fix tests.	2022-11-08 09:27:34 +05:30
Abhishek Agarwal	b1eaf7a21f	MSQ should load even if node roles are not set (#13318 )	2022-11-07 21:11:16 +05:30
Gian Merlino	9423aa9163	MSQ: Consider PARTITION_STATS_MAX_BYTES in WorkerMemoryParameters. (#13274 ) * MSQ: Consider PARTITION_STATS_MAX_BYTES in WorkerMemoryParameters. This consideration is important, because otherwise we can run out of memory due to large statistics-tracking objects. * Improved calculations.	2022-11-07 14:27:18 +05:30
AmatyaAvadhanula	a17ffdfc5d	Fix flaky test method in KafkaSupervisorTest (#13315 )	2022-11-05 10:31:40 +05:30
Clint Wylie	e60e305ddb	fix issue with parquet list conversion of nullable lists with complex nullable elements (#13294 ) * fix issue with parquet list conversion of nullable lists with complex nullable elements * pom stuff * fix style * adjustments	2022-11-04 05:25:42 -07:00
Gian Merlino	8f90589ce5	Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. (#13247 ) * Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. These aggregation functions are documented as creating sketches. However, they are planned into native aggregators that include finalization logic to convert the sketch to a number of some sort. This creates an inconsistency: the functions sometimes return sketches, and sometimes return numbers, depending on where they lie in the native query plan. This patch changes these SQL aggregators to _never_ finalize, by using the "shouldFinalize" feature of the native aggregators. It already existed for theta sketches. This patch adds the feature for hll and quantiles sketches. As to impact, Druid finalizes aggregators in two cases: - When they appear in the outer level of a query (not a subquery). - When they are used as input to an expression or finalizing-field-access post-aggregator (not any other kind of post-aggregator). With this patch, the functions will no longer be finalized in these cases. The second item is not likely to matter much. The SQL functions all declare return type OTHER, which would be usable as an input to any other function that makes sense and that would be planned into an expression. So, the main effect of this patch is the first item. To provide backwards compatibility with anyone that was depending on the old behavior, the patch adds a "sqlFinalizeOuterSketches" query context parameter that restores the old behavior. Other changes: 1) Move various argument-checking logic from runtime to planning time in DoublesSketchListArgBaseOperatorConversion, by adding an OperandTypeChecker. 2) Add various JsonIgnores to the sketches to simplify their JSON representations. 3) Allow chaining of ExpressionPostAggregators and other PostAggregators in the SQL layer. 4) Avoid unnecessary FieldAccessPostAggregator wrapping in the SQL layer, now that expressions can operate on complex inputs. 5) Adjust return type to thetaSketch (instead of OTHER) in ThetaSketchSetBaseOperatorConversion. * Fix benchmark class. * Fix compilation error. * Fix ThetaSketchSqlAggregatorTest. * Hopefully fix ITAutoCompactionTest. * Adjustment to ITAutoCompactionTest.	2022-11-03 09:43:00 -07:00
Gian Merlino	d1877e41ec	Use lookup memory footprint in MSQ memory computations. (#13271 ) * Use lookup memory footprint in MSQ memory computations. Two main changes: 1) Add estimateHeapFootprint to LookupExtractor. 2) Use this in MSQ's IndexerWorkerContext when determining the total amount of available memory. It's taken off the top. This prevents MSQ tasks from running out of memory when there are lookups defined in the cluster. * Updates from code review.	2022-11-03 07:36:54 -07:00
Laksh Singla	ccc55ef899	Mask SQL String in the MSQTaskQueryMaker for secrets (#13231 ) * add test * add masking code * fix test * oops * refactor json usage * refactor, variable update * add test cases * Trigger Build * add comment to the regex * address review comment	2022-11-03 15:27:28 +05:30
Laksh Singla	7cb21cb968	Use worker number instead of task id in MSQ for communication to/from workers. (#13062 ) * Conversion from taskId to workerNumber in the workerClient * storage connector changes, suffix file when finish writing to it * Fix tests * Trigger Build * convert IntFunction to a dedicated interface * first review round * use a dummy file to indicate success * fetch the first filename from the list in case of multiple files * tests working, fix semantic issue with ls * change how the success flag works * comments, checkstyle, method rename * fix test * forbiddenapis fix * Trigger Build * change the writer * dead store fix * Review comments * revert changes * review * review comments * Update extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/shuffle/DurableStorageInputChannelFactory.java Co-authored-by: Karan Kumar <karankumar1100@gmail.com> * Update extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/shuffle/DurableStorageInputChannelFactory.java Co-authored-by: Karan Kumar <karankumar1100@gmail.com> * update error messages * better error messages * fix checkstyle Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2022-11-03 10:25:45 +05:30
Dr. Sizzles	e5ad24ff9f	Support for middle manager less druid, tasks launch as k8s jobs (#13156 ) * Support for middle manager less druid, tasks launch as k8s jobs * Fixing forking task runner test * Test cleanup, dependency cleanup, intellij inspections cleanup * Changes per PR review Add configuration option to disable http/https proxy for the k8s client Update the docs to provide more detail about sidecar support * Removing un-needed log lines * Small changes per PR review * Upon task completion we callback to the overlord to update the status / locaiton, for slower k8s clusters, this reduces locking time significantly * Merge conflict fix * Fixing tests and docs * update tiny-cluster.yaml changed `enableTaskLevelLogPush` to `encapsulatedTask` * Apply suggestions from code review Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * Minor changes per PR request * Cleanup, adding test to AbstractTask * Add comment in peon.sh * Bumping code coverage * More tests to make code coverage happy * Doh a duplicate dependnecy * Integration test setup is weird for k8s, will do this in a different PR * Reverting back all integration test changes, will do in anotbher PR * use StringUtils.base64 instead of Base64 * Jdk is nasty, if i compress in jdk 11 in jdk 17 the decompressed result is different Co-authored-by: Rahul Gidwani <r_gidwani@apple.com> Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>	2022-11-02 19:44:47 -07:00
Kashif Faraz	fd7864ae33	Improve run time of coordinator duty MarkAsUnusedOvershadowedSegments (#13287 ) In clusters with a large number of segments, the duty `MarkAsUnusedOvershadowedSegments` can take a long very long time to finish. This is because of the costly invocation of `timeline.isOvershadowed` which is done for every used segment in every coordinator run. Changes - Use `DataSourceSnapshot.getOvershadowedSegments` to get all overshadowed segments - Iterate over this set instead of all used segments to identify segments that can be marked as unused - Mark segments as unused in the DB in batches rather than one at a time - Refactor: Add class `SegmentTimeline` for ease of use and readability while using a `VersionedIntervalTimeline` of segments.	2022-11-01 20:19:52 +05:30
Jason Koch	0d03ce435f	introduce a "tree" type to the flattenSpec (#12177 ) * introduce a "tree" type to the flattenSpec * feedback - rename exprs to nodes, use CollectionsUtils.isNullOrEmpty for guard * feedback - expand docs to more clearly capture limitations of "tree" flattenSpec * feedback - fix for typo on docs * introduce a comment to explain defensive copy, tweak null handling * fix: part of rebase * mark ObjectFlatteners.FlattenerMaker as an ExtensionPoint and provide default for new tree type * fix: objectflattener restore previous behavior to call getRootField for root type * docs: ingestion/data-formats add note that ORC only supports path expressions * chore: linter remove unused import * fix: use correct newer form for empty DimensionsSpec in FlattenJSONBenchmark	2022-11-01 14:49:30 +08:00
Adarsh Sanjeev	675fd982fb	Correct task status returned by controller (#13288 ) * Correct worker status returned by controller * Address review comments	2022-10-31 15:18:19 +05:30
AmatyaAvadhanula	e1ff3ca289	Resume streaming tasks on Overlord switch (#13223 ) * Resume streaming tasks on Overlord switch * Refactoring and better messages * Better docs * Add unit test * Fix tests' setup * Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Better logs * Fix test again Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2022-10-29 09:38:49 +05:30
Gian Merlino	d851985cf5	MSQ: Add support for indexSpec. (#13275 )	2022-10-28 14:27:50 -07:00
Gian Merlino	4f0145fb85	MSQ: Use long instead of double for estimatedRetainedBytes. (#13272 ) Fixes a problem where, due to the inexactness of floating-point math, we would potentially drift while tracking retained byte counts and run into assertion failures in assertRetainedByteCountsAreTrackedCorrectly.	2022-10-28 08:31:52 -07:00
AmatyaAvadhanula	9cbda66d96	Remove skip ignorable shards (#13221 ) * Revert "Improve kinesis task assignment after resharding (#12235)" This reverts commit `1ec57cb935`.	2022-10-28 16:19:01 +05:30
Adarsh Sanjeev	4775427e2c	Add task start status to worker report (#13263 ) * Add task start status to worker report * Address review comments * Address review comments * Update documentation * Update spelling checks	2022-10-28 12:00:15 +05:30
somu-imply	affc522b9f	Refactoring the data source before unnest (#13085 ) * First set of changes for framework * Second set of changes to move segment map function to data source * Minot change to server manager * Removing the createSegmentMapFunction from JoinableFactoryWrapper and moving to JoinDataSource * Checkstyle fixes * Patching Eric's fix for injection * Checkstyle and fixing some CI issues * Fixing code inspections and some failed tests and one injector for test in avatica * Another set of changes for CI...almost there * Equals and hashcode part update * Fixing injector from Eric + refactoring for broadcastJoinHelper * Updating second injector. Might revert later if better way found * Fixing guice issue in JoinableFactory * Addressing review comments part 1 * Temp changes refactoring * Revert "Temp changes refactoring" This reverts commit `9da42a9ef0`. * temp * Temp discussions * Refactoring temp * Refatoring the query rewrite to refer to a datasource * Refactoring getCacheKey by moving it inside data source * Nullable annotation check in injector * Addressing some comments, removing 2 analysis.isJoin() checks and correcting the benchmark files * Minor changes for refactoring * Addressing reviews part 1 * Refactoring part 2 with new test cases for broadcast join * Set for nullables * removing instance of checks * Storing nullables in guice to avoid checking on reruns * Fixing a test case and removing an irrelevant line * Addressing the atomic reference review comments	2022-10-26 15:58:58 -07:00
Gian Merlino	d98c808d3f	Remove basePersistDirectory from tuning configs. (#13040 ) * Remove basePersistDirectory from tuning configs. Since the removal of CliRealtime, it serves no purpose, since it is always overridden in production using withBasePersistDirectory given some subdirectory of the task work directory. Removing this from the tuning config has a benefit beyond removing no-longer-needed logic: it also avoids the side effect of empty "druid-realtime-persist" directories getting created in the systemwide temp directory. * Test adjustments to appropriately set basePersistDirectory. * Remove unused import. * Fix RATC constructor.	2022-10-21 17:25:36 -07:00
Paul Rogers	86e6e61e88	Modular Calcite Test Framework (#12965 ) * Refactor Calcite test "framework" for planner tests Refactors the current Calcite tests to make it a bit easier to adjust the set of runtime objects used within a test. * Move data creation out of CalciteTests into TestDataBuilder * Move "framework" creation out of CalciteTests into a QueryFramework * Move injector-dependent functions from CalciteTests into QueryFrameworkUtils * Wrapper around the planner factory, etc. to allow customization. * Bulk of the "framework" created once per class rather than once per test. * Refactor tests to use a test builder * Change all testQuery() methods to use the test builder. Move test execution & verification into a test runner.	2022-10-20 15:45:44 -07:00
Laksh Singla	fc262dfbaf	MSQ: Report the warning directly as an error if none of it is allowed by the user (#13198 ) In MSQ, there can be an upper limit to the number of worker warnings. For example, for parseExceptions encountered while parsing the external data, the user can specify an upper limit to the number of parse exceptions that can be allowed before it throws an error of type TooManyWarnings. This PR makes it so that if the user disallows warnings of a certain type i.e. the limit is 0 (or is executing in strict mode), instead of throwing an error of type TooManyWarnings, we can directly surface the warning as the error, saving the user from the hassle of going throw the warning reports.	2022-10-20 13:43:10 +05:30
Gian Merlino	6aca61763e	SQL: Use timestamp_floor when granularity is not safe. (#13206 ) * SQL: Use timestamp_floor when granularity is not safe. PR #12944 added a check at the execution layer to avoid materializing excessive amounts of time-granular buckets. This patch modifies the SQL planner to avoid generating queries that would throw such errors, by switching certain plans to use the timestamp_floor function instead of granularities. This applies both to the Timeseries query type, and the GroupBy timestampResultFieldGranularity feature. The patch also goes one step further: we switch to timestamp_floor not just in the ETERNITY + non-ALL case, but also if the estimated number of time-granular buckets exceeds 100,000. Finally, the patch modifies the timestampResultFieldGranularity field to consistently be a String rather than a Granularity. This ensures that it can be round-trip serialized and deserialized, which is useful when trying to execute the results of "EXPLAIN PLAN FOR" with GroupBy queries that use the timestampResultFieldGranularity feature. * Fix test, address PR comments. * Fix ControllerImpl. * Fix test. * Fix unused import.	2022-10-17 08:22:45 -07:00
Paul Rogers	f4dcc52dac	Redesign QueryContext class (#13071 ) We introduce two new configuration keys that refine the query context security model controlled by druid.auth.authorizeQueryContextParams. When that value is set to true then two other configuration options become available: druid.auth.unsecuredContextKeys: The set of query context keys that do not require a security check. Use this for the "white-list" of key to allow. All other keys go through the existing context key security checks. druid.auth.securedContextKeys: The set of query context keys that do require a security check. Use this when you want to allow all but a specific set of keys: only these keys go through the existing context key security checks. Both are set using JSON list format: druid.auth.securedContextKeys=["secretKey1", "secretKey2"] You generally set one or the other values. If both are set, unsecuredContextKeys acts as exceptions to securedContextKeys. In addition, Druid defines two query context keys which always bypass checks because Druid uses them internally: sqlQueryId sqlStringifyArrays	2022-10-15 11:02:11 +05:30
hnakamor	6332c571bd	Support to read task logs from some S3 compatible cloud storage (#13195 ) * follow RFC7232 * Only unquoted strings are processed according to RFC7232. * Add help method and test cases.	2022-10-15 10:44:23 +08:00
zachjsh	2f2fe20089	Improve global-cached-lookups metric reporting (#13219 ) It was found that the namespace/cache/heapSizeInBytes metric that tracks the total heap size in bytes of all lookup caches loaded on a service instance was being under reported. We were not accounting for the memory overhead of the String object, which I've found in testing to be ~40 bytes. While this overhead may be java version dependent, it should not vary much, and accounting for this provides a better estimate. Also fixed some logging, and reading bytes from the JDBI result set a little more efficient by saving hash table lookups. Also added some of the lookup metrics to the default statsD emitter metric whitelist.	2022-10-13 18:51:54 -04:00
Tejaswini Bandlamudi	3e13584e0e	Adds Idle feature to `SeekableStreamSupervisor` for inactive stream (#13144 ) * Idle Seekable stream supervisor changes. * nit * nit * nit * Adds unit tests * Supervisor decides it's idle state instead of AutoScaler * docs update * nit * nit * docs update * Adds Kafka unit test * Adds Kafka Integration test. * Updates travis config. * Updates kafka-indexing-service dependencies. * updates previous offsets snapshot & doc * Doesn't act if supervisor is suspended. * Fixes highest current offsets fetch bug, adds new Kafka UT tests, doc changes. * Reverts Kinesis Supervisor idle behaviour changes. * nit * nit * Corrects SeekableStreamSupervisorSpec check on idle behaviour config, adds tests. * Fixes getHighestCurrentOffsets to fetch offsets of publishing tasks too * Adds Kafka Supervisor UT * Improves test coverage in druid-server * Corrects IT override config * Doc updates and Syntactic changes * nit * supervisorSpec.ioConfig.idleConfig changes	2022-10-12 18:31:08 +05:30
Jonathan Wei	9b8e69c99a	Add inline descriptor Protobuf bytes decoder (#13192 ) * Add inline descriptor Protobuf bytes decoder * PR comments * Update tests, check for IllegalArgumentException * Fix license, add equals test * Update extensions-core/protobuf-extensions/src/main/java/org/apache/druid/data/input/protobuf/InlineDescriptorProtobufBytesDecoder.java Co-authored-by: Frank Chen <frankchen@apache.org> Co-authored-by: Frank Chen <frankchen@apache.org>	2022-10-11 13:37:28 -05:00
Frank Chen	d30cf8c308	Dependency cleanup (#13194 ) * Clean up dependency in extensions * Bump protobuf/aws.sdk * Bump aws-sdk to 1.12.317 * Fix CI * Fix CI * Update license * Update license	2022-10-10 20:34:38 +08:00
Abhishek Agarwal	e3f9a0ed44	Lazy initialization of segment killers, movers and archivers (#13170 ) * Lazy initialization of segment killers, movers and archivers * Add test for lazy killer * Add more tests * Intellij fixes	2022-10-04 15:55:46 +05:30
Adarsh Sanjeev	92d2633ae6	Update ClusterByStatisticsCollectorImpl to use bytes instead of keys (#12998 ) * Update clusterByStatistics to use bytes instead of keys * Address review comments * Resolve checkstyle * Increase test coverage * Update test * Update thresholds * Update retained keys function * Update docs * Fix spelling	2022-10-03 12:08:23 +05:30
Jonathan Wei	1f1fced6d4	Add JsonInputFormat option to assume newline delimited JSON, improve parse exception handling for multiline JSON (#13089 ) * Add JsonInputFormat option to assume newline delimited JSON, improve handling for non-NDJSON * Fix serde and docs * Add PR comment check	2022-09-26 19:51:04 -05:00
Jonathan Wei	331e6d707b	Add KafkaConfigOverrides extension point (#13122 ) * Add KafkaConfigOverrides extension point * X	2022-09-21 11:47:19 +05:30
Frank Chen	a3391693eb	Improve a MSQ planning error message (#13113 )	2022-09-19 23:11:54 +08:00
Paul Rogers	8ce03eb094	Convert the Druid planner to use statement handlers (#12905 ) * Converted Druid planner to use statement handlers Converts the large collection of if-statements for statement types into a set of classes: one per supported statement type. Cleans up a few error messages. * Revisions from review comments * Build fix * Build fix * Resolve merge confict. * More merges with QueryResponse PR * More parameterized type cleanup Forces a rebuild due to a flaky test	2022-09-19 11:58:45 +05:30
Ellen Shen	da30c8070a	kafka consumer: custom serializer can't be configured after it's instantiation (#12960 ) (#13097 ) * allow kakfa custom serializer to be configured * add unit tests Co-authored-by: ellen shen <ellenshen@apple.com>	2022-09-17 20:42:21 +08:00
Atul Mohan	c153c2a712	Initialize NullValueHandlingConfig for failed tests (#13078 ) * Initialize null handling * Refactor nullhandlingconfig init	2022-09-15 20:47:10 +08:00
Frank Chen	fd6c05eee8	Avoid ClassCastException when getting values from `QueryContext` (#13022 ) * Use safe conversion methods * Rename method * Add getContextAsBoolean * Update test case * Remove generic from getContextValue * Update catch-handler * Add test * Resolve comments * Replace 'getContextXXX' to 'getQueryContext().getAsXXXX'	2022-09-13 18:00:09 +08:00
Gian Merlino	c00ad28ecc	Cleaner JSON for various input sources and formats. (#13064 ) * Cleaner JSON for various input sources and formats. Add JsonInclude to various properties, to avoid population of default values in serialized JSON. Also fixes a bug in OrcInputFormat: it was not writing binaryAsString, so the property would be lost on serde. * Additonal test cases.	2022-09-12 10:29:31 -07:00
imply-cheddar	5ba0075c0c	Expose HTTP Response headers from SqlResource (#13052 ) * Expose HTTP Response headers from SqlResource This change makes the SqlResource expose HTTP response headers in the same way that the QueryResource exposes them. Fundamentally, the change is to pipe the QueryResponse object all the way through to the Resource so that it can populate response headers. There is also some code cleanup around DI, as there was a superfluous FactoryFactory class muddying things up.	2022-09-12 01:40:06 -07:00
Gian Merlino	f00f1f754d	MSQ extension: Fix over-capacity write in ScanQueryFrameProcessor. (#13036 ) * MSQ extension: Fix over-capacity write in ScanQueryFrameProcessor. Frame processors are meant to write only one output frame per cycle. The ScanQueryFrameProcessor would write two when reading from a channel if the input frame cursor cycled and then the output frame filled up while reading from the next frame. This patch fixes the bug, and adds a test. It also makes some adjustments to the processor code in order to make it easier to test. * Add license header.	2022-09-07 19:32:21 +05:30
Clint Wylie	a3a377e570	more consistent expression error messages (#12995 ) * more consistent expression error messages * review stuff * add NamedFunction for Function, ApplyFunction, and ExprMacro to share common stuff * fixes * add expression transform name to transformer failure, better parse_json error messaging	2022-09-06 23:21:38 -07:00
Abhishek Agarwal	618757352b	Bump up the version to 25.0.0 (#12975 ) * Bump up the version to 25.0.0 * Fix the version in console	2022-08-29 11:27:38 +05:30
Alexander Saydakov	7e2371bbde	KLL sketch (#12498 ) * KLL sketch * added documentation * direct static refs * direct static refs * fixed test * addressed review points * added KLL sketch related terms * return a copy from get * Copy unions when returning them from "get". * Remove redundant "final". Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>	2022-08-26 21:19:24 -07:00

1 2 3 4 5 ...

1137 Commits