druid

Commit Graph

Author	SHA1	Message	Date
somu-imply	74ff848ce5	Fixing incorrect filtering of nulls in an array when ingesting for JSON and Avro (#13712 )	2023-02-01 04:15:08 -08:00
Adarsh Sanjeev	51dfde0284	Add maxInputBytesPerWorker as query context parameter (#13707 ) * Add maxInputBytesPerWorker as query context parameter * Move documenation to msq specific docs * Update tests * Spacing * Address review comments * Fix test * Update docs/multi-stage-query/reference.md * Correct spelling mistake --------- Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2023-01-31 20:55:28 +05:30
Rohan Garg	f76acccff2	Allow using composed storage for SuperSorter intermediate data (#13368 )	2023-01-24 01:02:03 +05:30
Laksh Singla	a516eb1a41	Port Calcite's tests to run with MSQ (#13625 ) * SQL test framework extensions * Capture planner artifacts: logical plan, etc. * Planner test builder validates the logical plan * Validation for the SQL resut schema (we already have validation for the Druid row signature) * Better Guice integration: properties, reuse Guice modules * Avoid need for hand-coded expr, macro tables * Retire some of the test-specific query component creation * Fix query log hook race condition Co-authored-by: Paul Rogers <progers@apache.org>	2023-01-19 08:51:11 -08:00
Clint Wylie	fb26a1093d	discover nested columns when using nested column indexer for schemaless ingestion (#13672 ) * discover nested columns when using nested column indexer for schemaless * move useNestedColumnIndexerForSchemaDiscovery from AppendableIndexSpec to DimensionsSpec	2023-01-18 12:57:28 -08:00
Paul Rogers	22630b0aab	Much improved table functions (#13627 ) Much improved table functions * Revises properties, definitions in the catalog * Adds a "table function" abstraction to model such functions * Specific functions for HTTP, inline, local and S3. * Extended SQL types in the catalog * Restructure external table definitions to use table functions * EXTEND syntax for Druid's extern table function * Support for array-valued table function parameters * Support for array-valued SQL query parameters * Much new documentation	2023-01-17 08:41:57 -08:00
Gian Merlino	182c4fad29	Kinesis: More robust default fetch settings. (#13539 ) * Kinesis: More robust default fetch settings. 1) Default recordsPerFetch and recordBufferSize based on available memory rather than using hardcoded numbers. For this, we need an estimate of record size. Use 10 KB for regular records and 1 MB for aggregated records. With 1 GB heaps, 2 processors per task, and nonaggregated records, recordBufferSize comes out to the same as the old default (10000), and recordsPerFetch comes out slightly lower (1250 instead of 4000). 2) Default maxRecordsPerPoll based on whether records are aggregated or not (100 if not aggregated, 1 if aggregated). Prior default was 100. 3) Default fetchThreads based on processors divided by task count on Indexers, rather than overall processor count. 4) Additionally clean up the serialized JSON a bit by adding various JsonInclude annotations. * Updates for tests. * Additional important verify.	2023-01-13 11:03:54 +05:30
Adarsh Sanjeev	cb16a7f6a9	Fix behaviour of downsampling buckets to a single key (#13663 )	2023-01-12 21:24:24 +05:30
Adarsh Sanjeev	afb3d91777	Add unit test for complex column grouping (#13650 ) * Add unit test for complex column grouping Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2023-01-12 15:25:01 +05:30
Maytas Monsereenusorn	7f54ebbf47	Fix Parquet Parser missing column when reading parquet file (#13612 ) * fix parquet reader * fix checkstyle * fix bug * fix inspection * refactor * fix checkstyle * fix checkstyle * fix checkstyle * fix checkstyle * add test * fix checkstyle * fix tests * add IT * add IT * add more tests * fix checkstyle * fix stuff * fix stuff * add more tests * add more tests	2023-01-11 20:08:48 -10:00
Karan Kumar	56076d33fb	Worker retry for MSQ task (#13353 ) * Initial commit. * Fixing error message in retry exceeded exception * Cleaning up some code * Adding some test cases. * Adding java docs. * Finishing up state test cases. * Adding some more java docs and fixing spot bugs, intellij inspections * Fixing intellij inspections and added tests * Documenting error codes * Migrate current integration batch tests to equivalent MSQ tests (#13374) * Migrate current integration batch tests to equivalent MSQ tests using new IT framework * Fix build issues * Trigger Build * Adding more tests and addressing comments * fixBuildIssues * fix dependency issues * Parameterized the test and addressed comments * Addressing comments * fixing checkstyle errors * Adressing comments * Adding ITTest which kills the worker abruptly * Review comments phase one * Adding doc changes * Adjusting for single threaded execution. * Adding Sequential Merge PR state handling * Merge things * Fixing checkstyle. * Adding new context param for fault tolerance. Adding stale task handling in sketchFetcher. Adding UT's. * Merge things * Merge things * Adding parameterized tests Created separate module for faultToleranceTests * Adding missed files * Review comments and fixing tests. * Documentation things. * Fixing IT * Controller impl fix. * Fixing racy WorkerSketchFetcherTest.java exception handling. Co-authored-by: abhagraw <99210446+abhagraw@users.noreply.github.com> Co-authored-by: Karan Kumar <cryptoe@karans-mbp.lan>	2023-01-11 07:38:29 +05:30
Abhishek Radhakrishnan	41fdf6eafb	Quote and escape literals in JDBC lookup to allow reserved identifiers. (#13632 ) * Quote and escape table, key and column names. * fix typo. * More select statements. * Derby lookup tests create quoted identifiers so it's compatible. * Use Stringutils.replace() utility. * quote the filter string. * Squish doubly quote usage into a single function. * Add parameterized test with reserved identifiers. * few changes.	2023-01-10 12:11:54 +05:30
imply-cheddar	f1821a7c18	Add Sort Operator for Window Functions (#13619 ) * Addition of NaiveSortMaker and Default implementation Add the NaiveSortMaker which makes a sorter object and a default implementation of the interface. This also allows us to plan multiple different window definitions on the same query.	2023-01-06 00:27:18 -08:00
imply-cheddar	a8ecc48ffe	Validate response headers and fix exception logging (#13609 ) * Validate response headers and fix exception logging A class of QueryException were throwing away their causes making it really hard to determine what's going wrong when something goes wrong in the SQL planner specifically. Fix that and adjust tests to do more validation of response headers as well. We allow 404s and 307s to be returned even without authorization validated, but others get converted to 403	2023-01-05 14:15:15 -08:00
imply-cheddar	7b92b85168	Unify DummyRequest with MockHttpServletRequest (#13602 ) We had 2 different classes both creating fake instances of an HttpServletRequest, this makes it to that we only have one in a common location	2022-12-21 20:15:08 -08:00
Kashif Faraz	c1e2656644	Fix scope of dependencies in protobuf-extensions pom (#13593 )	2022-12-19 13:56:55 +05:30
Clint Wylie	d9e5245ff0	allow string dimension indexer to handle byte[] as base64 strings (#13573 ) This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. This issue was uncovered by a regression of sorts introduced by #13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing). I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff.	2022-12-16 14:50:17 +05:30
Kashif Faraz	d6949b1b79	Track input processedBytes with MSQ ingestion (#13559 ) Follow up to #13520 Bytes processed are currently tracked for intermediate stages in MSQ ingestion. This patch adds the capability to track the bytes processed by an MSQ controller task while reading from an external input source or a segment source. Changes: - Track `processedBytes` for every `InputSource` read in `ExternalInputSliceReader` - Update `ChannelCounters` with the above obtained `processedBytes` when incrementing the input file count. - Update task report structure in docs The total input processed bytes can be obtained by summing the `processedBytes` as follows: totalBytes = 0 for every root stage (i.e. a stage which does not have another stage as an input): for every worker in that stage: for every input channel: (i.e. channels with prefix "input", e.g. "input0", "input1", etc.) totalBytes += processedBytes	2022-12-16 02:20:01 +05:30
Adarsh Sanjeev	2b605aa9cf	Multiple fixes for the MSQ stats merging piece which (#13463 ) * Add validation checks to worker chat handler apis * Merge things and polishing the error messages. * Minor error message change * Fixing race and adding some tests * Fixing controller fetching stats from wrong workers. Fixing race Changing default mode to Parallel Adding logging. Fixing exceptions not propagated properly. * Changing to kernel worker count * Added a better logic to figure out assigned worker for a stage. * Nits * Moving to existing kernel methods * Adding more coverage Co-authored-by: cryptoe <karankumar1100@gmail.com>	2022-12-15 09:35:11 +05:30
Kashif Faraz	58a3acc2c4	Add InputStats to track bytes processed by a task (#13520 ) This commit adds a new class `InputStats` to track the total bytes processed by a task. The field `processedBytes` is published in task reports along with other row stats. Major changes: - Add class `InputStats` to track processed bytes - Add method `InputSourceReader.read(InputStats)` to read input rows while counting bytes. > Since we need to count the bytes, we could not just have a wrapper around `InputSourceReader` or `InputEntityReader` (the way `CountableInputSourceReader` does) because the `InputSourceReader` only deals with `InputRow`s and the byte information is already lost. - Classic batch: Use the new `InputSourceReader.read(inputStats)` in `AbstractBatchIndexTask` - Streaming: Increment `processedBytes` in `StreamChunkParser`. This does not use the new `InputSourceReader.read(inputStats)` method. - Extend `InputStats` with `RowIngestionMeters` so that bytes can be exposed in task reports Other changes: - Update tests to verify the value of `processedBytes` - Rename `MutableRowIngestionMeters` to `SimpleRowIngestionMeters` and remove duplicate class - Replace `CacheTestSegmentCacheManager` with `NoopSegmentCacheManager` - Refactor `KafkaIndexTaskTest` and `KinesisIndexTaskTest`	2022-12-13 18:54:42 +05:30
somu-imply	7682b0b6b1	Analysis refactor (#13501 ) Refactor DataSource to have a getAnalysis method() This removes various parts of the code where while loops and instanceof checks were being used to walk through the structure of DataSource objects in order to build a DataSourceAnalysis. Instead we just ask the DataSource for its analysis and allow the stack to rebuild whatever structure existed.	2022-12-12 17:35:44 -08:00
Gian Merlino	de5a4bafcb	Zero-copy local deep storage. (#13394 ) * Zero-copy local deep storage. This is useful for local deep storage, since it reduces disk usage and makes Historicals able to load segments instantaneously. Two changes: 1) Introduce "druid.storage.zip" parameter for local storage, which defaults to false. This changes default behavior from writing an index.zip to writing a regular directory. This is safe to do even during a rolling update, because the older code actually already handled unzipped directories being present on local deep storage. 2) In LocalDataSegmentPuller and LocalDataSegmentPusher, use hard links instead of copies when possible. (Generally this is possible when the source and destination directory are on the same filesystem.)	2022-12-12 17:28:24 -08:00
Karan Kumar	5a3d79a5d5	Removing unused exec service. (#13541 )	2022-12-12 14:39:42 +05:30
Clint Wylie	7002ecd303	add protobuf flattener, direct to plain java conversion for faster flattening (#13519 ) * add protobuf flattener, direct to plain java conversion for faster flattening, nested column tests	2022-12-09 12:24:21 -08:00
Gian Merlino	55814888f5	MSQ: Only look at sqlInsertSegmentGranularity on the outer query. (#13537 ) The planner sets sqlInsertSegmentGranularity in its context when using PARTITIONED BY, which sets it on every native query in the stack (as all native queries for a SQL query typically have the same context). QueryKit would interpret that as a request to configure bucketing for all native queries. This isn't useful, as bucketing is only used for the penultimate stage in INSERT / REPLACE. So, this patch modifies QueryKit to only look at sqlInsertSegmentGranularity on the outermost query. As an additional change, this patch switches the static ObjectMapper to use the processwide ObjectMapper for deserializing Granularities. Saves an ObjectMapper instance, and ensures that if there are any special serdes registered for Granularity, we'll pick them up.	2022-12-09 20:48:16 +05:30
Paul Rogers	013a12e86f	Enhanced MSQ table functions (#13360 ) * Enhanced MSQ table functions * HTTP, LOCALFILES and INLINE table functions powered by catalog metadata. * Documentation	2022-12-08 13:56:02 -08:00
Gian Merlino	91ef9872ec	MSQ: Improve TooManyBuckets error message, improve error docs. (#13525 ) 1) Edited the TooManyBuckets error message to mention PARTITIONED BY instead of segmentGranularity. 2) Added error-code-specific anchors in the docs. 3) Add information to various error codes in the docs about common causes and solutions.	2022-12-08 13:18:26 -08:00
Adarsh Sanjeev	fbf76ad8f5	Remove stray reference to fix OOM while merging sketches (#13475 ) * Remove stray reference to fix OOM while merging sketches * Update future to add result from executor service * Update tests and address review comments * Address review comments * Moved mock * Close threadpool on teardown * Remove worker task cancel	2022-12-08 07:17:55 +05:30
Abhishek Agarwal	b25cf216d5	Better error message when theta_sketch_intersect is used on scalar expression (#13508 )	2022-12-07 09:35:43 +05:30
Paul Rogers	b76ff16d00	SQL test framework extensions (#13426 ) SQL test framework extensions * Capture planner artifacts: logical plan, etc. * Planner test builder validates the logical plan * Validation for the SQL resut schema (we already have validation for the Druid row signature) * Better Guice integration: properties, reuse Guice modules * Avoid need for hand-coded expr, macro tables * Retire some of the test-specific query component creation * Fix query log hook race condition	2022-12-02 09:11:59 -08:00
AmatyaAvadhanula	cc307e4c29	Fix needless task shutdown on leader switch (#13411 ) * Fix needless task shutdown on leader switch * Add unit test * Fix style * Fix UTs	2022-12-01 18:31:08 +05:30
Adarsh Sanjeev	8395273099	Add unit tests for MSQ ingestion faults (#13439 ) * Add unit tests for MSQ ingestion faults * Resolve build failure * Move test to MSQFaultTest * Rename test	2022-12-01 10:11:49 +05:30
xiaokang	6ba35f6d59	update org.bouncycastle:bcprov-jdk15on 1.68 to 1.69 (#13440 )	2022-11-30 21:57:38 +05:30
Adarsh Sanjeev	af164cbc10	Fix an issue with WorkerSketchFetcher not terminating on shutdown (#13459 ) * Fix an issue with WorkerSketchFetcher not terminating on shutdown * Change threadpool name	2022-11-30 21:02:48 +05:30
Kashif Faraz	8ff1b2d5d4	Revert "Add filter in cloud object input source for backward compatibility (#13437 )" (#13450 ) This reverts commit `b12e5f300e`.	2022-11-30 16:33:05 +05:30
Gian Merlino	50963edcae	Fix compile error in MSQSelectTest. (#13456 )	2022-11-29 15:51:03 -08:00
Laksh Singla	79df11c16c	Improve unit test coverage for MSQ (#13398 ) * add faults tests for the multi stage query * add too many parttiions fault * add toomanyinputfilesfault * programmatically generate the file * refactor * Trigger Build	2022-11-29 17:27:04 +05:30
Laksh Singla	4ed6255bdf	Convert errors based on implicit type conversion in multi value arrays to parse exception in MSQ (#13366 ) * initial commit * fix test * push the json changes * reduce the area of the try..catch * Trigger Build * review	2022-11-29 17:19:57 +05:30
Clint Wylie	37b8d4861c	fix issues with nested data conversion (#13407 )	2022-11-28 12:29:43 -08:00
Clint Wylie	4b58f5f23c	fix KafkaInputFormat with nested columns by delegating to underlying inputRow map instead of eagerly copying (#13406 )	2022-11-28 12:28:07 -08:00
Tejaswini Bandlamudi	b12e5f300e	Add filter in cloud object input source for backward compatibility (#13437 ) https://github.com/apache/druid/pull/13027 PR replaces `filter` parameter with `objectGlob` in ingestion input source. However, this will cause existing ingestion jobs to fail if they are using a filter already. This PR adds old filter functionality alongside objectGlob to preserve backward compatibility.	2022-11-28 23:04:33 +05:30
Clint Wylie	f524c68f08	Add mechanism for 'safe' memory reads for complex types (#13361 ) * we can read where we want to we can leave your bounds behind 'cause if the memory is not there we really don't care and we'll crash this process of mine	2022-11-23 00:25:22 -08:00
Kashif Faraz	7cf761cee4	Prepare master branch for next release, 26.0.0 (#13401 ) * Prepare master branch for next release, 26.0.0 * Use docker image for druid 24.0.1 * Fix version in druid-it-cases pom.xml	2022-11-22 15:31:01 +05:30
Gian Merlino	c6054b7cb7	Attach IO error to parse error when we can't contact Avro schema registry. (#13403 ) * Attach IO error to parse error when we can't contact Avro schema registry. The change in #12080 lost the original exception context. This patch adds it back. * Add hamcrest-core. * Fix format string.	2022-11-21 22:20:26 -08:00
Adarsh Sanjeev	280a0f7158	Add sequential sketch merging to MSQ (#13205 ) * Add sketch fetching framework * Refactor code to support sequential merge * Update worker sketch fetcher * Refactor sketch fetcher * Refactor sketch fetcher * Add context parameter and threshold to trigger sequential merge * Fix test * Add integration test for non sequential merge * Address review comments * Address review comments * Address review comments * Resolve maxRetainedBytes * Add new classes * Renamed key statistics information class * Rename fetchStatisticsSnapshotForTimeChunk function * Address review comments * Address review comments * Update documentation and add comments * Resolve build issues * Resolve build issues * Change worker APIs to async * Address review comments * Resolve build issues * Add null time check * Update integration tests * Address review comments * Add log messages and comments * Resolve build issues * Add unit tests * Add unit tests * Fix timing issue in tests	2022-11-22 09:56:32 +05:30
Gian Merlino	bfffbabb56	Async task client for SeekableStreamSupervisors. (#13354 ) Main changes: 1) Convert SeekableStreamIndexTaskClient to an interface, move old code to SeekableStreamIndexTaskClientSyncImpl, and add new implementation SeekableStreamIndexTaskClientAsyncImpl that uses ServiceClient. 2) Add "chatAsync" parameter to seekable stream supervisors that causes the supervisor to use an async task client. 3) In SeekableStreamSupervisor.discoverTasks, adjust logic to avoid making blocking RPC calls in workerExec threads. 4) In SeekableStreamSupervisor generally, switch from Futures.successfulAsList to FutureUtils.coalesce, so we can better capture the errors that occurred with contacting individual tasks. Other, related changes: 1) Add ServiceRetryPolicy.retryNotAvailable, which controls whether ServiceClient retries unavailable services. Useful since we do not want to retry calls unavailable tasks within the service client. (The supervisor does its own higher-level retries.) 2) Add FutureUtils.transformAsync, a more lambda friendly version of Futures.transform(f, AsyncFunction). 3) Add FutureUtils.coalesce. Similar to Futures.successfulAsList, but returns Either instead of using null on error. 4) Add JacksonUtils.readValue overloads for JavaType and TypeReference.	2022-11-21 19:20:26 +05:30
Gian Merlino	f037776fd8	MSQ: Launch initial tasks faster. (#13393 ) Notify the mainLoop thread to skip a sleep when the desired task count changes.	2022-11-21 19:11:18 +05:30
Rohan Garg	6ccf31490e	Allow injection of node-role set to all non base modules (#13371 )	2022-11-18 12:12:03 +05:30
Clint Wylie	8c9ffcfe37	nested column support for ORC (#13375 ) * nested column support for ORC * more test	2022-11-17 21:08:34 -08:00
Tejaswini Bandlamudi	bf10ff73a8	Fixes Kafka Supervisor Lag Report (#13380 ) Fixes inclusion of all stream partitions in all tasks. The PR (Adds Idle feature to `SeekableStreamSupervisor` for inactive stream) - https://github.com/apache/druid/pull/13144 updates the resulting lag calculation map in `KafkaSupervisor` to include all the latest partitions from the stream to set the idle state accordingly rather than the previous way of lag calculation only for the partitions actively being read from the stream. This led to an explosion of metrics in lag reports in cases where 1000s of tasks per supervisor are present. Changes: - Add a new method to generate lags for only those partitions a single task is actively reading from while updating the Supervisor reports.	2022-11-17 22:24:45 +05:30

1 2 3 4 5 ...

1095 Commits