druid

mirror of https://github.com/apache/druid.git synced 2025-02-19 16:37:45 +00:00

Author	SHA1	Message	Date
Gian Merlino	256160aba6	MSQ: Validate that strings and string arrays are not mixed. (#15920 ) * MSQ: Validate that strings and string arrays are not mixed. When multi-value strings and string arrays coexist in the same column, it causes problems with "classic MVD" style queries such as: select * from wikipedia -- fails at runtime select count() from wikipedia where flags = 'B' -- fails at planning time select flags, count() from wikipedia group by 1 -- fails at runtime To avoid these problems, this patch adds type verification for INSERT and REPLACE. It is targeted: the only type changes that are blocked are string-to-array and array-to-string. There is also a way to exclude certain columns from the type checks, if the user really knows what they're doing. * Fixes. * Tests and docs and error messages. * More docs. * Adjustments. * Adjust message. * Fix tests. * Fix test in DV mode.	2024-03-13 15:37:27 -07:00
Sergio Ferragut	d38703281c	updated description of rowsPerPage in export operations (#16048 ) * updated description of rowsPerPage in export operations * Update docs/multi-stage-query/reference.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-03-05 15:42:12 -08:00
Abhishek Radhakrishnan	67a6224d91	Fix up incorrect `PARTITIONED BY` error messages (#15961 ) * Fix up typos, inaccuracies and clean up code related to PARTITIONED BY. * Remove wrapper function and update tests to use DruidExceptionMatcher. * Checkstyle and Intellij inspection fixes.	2024-02-26 14:17:53 -05:00
zachjsh	f9ee2c353b	Extend the PARTITION BY clause to accept string literals for the time partitioning (#15836 ) This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner https://github.com/apache/druid/pull/13686 from @paul-rogers, extending the PARTITION BY clause to accept string literals for the time partitioning	2024-02-09 11:45:38 -05:00
Adarsh Sanjeev	514b3b4d01	Add export capabilities to MSQ with SQL syntax (#15689 ) * Add test * Parser changes to support export statements * Fix builds * Address comments * Add frame processor * Address review comments * Fix builds * Update syntax * Webconsole workaround * Refactor * Refactor * Change export file path * Update docs * Remove webconsole changes * Fix spelling mistake * Parser changes, add tests * Parser changes, resolve build warnings * Fix failing test * Fix failing test * Fix IT tests * Add tests * Cleanup * Fix unparse * Fix forbidden API * Update docs * Update docs * Address review comments * Address review comments * Fix tests * Address review comments * Fix insert unparse * Add external write resource action * Fix tests * Add resource check to overlord resource * Fix tests * Add IT * Update syntax * Update tests * Update permission * Address review comments * Address review comments * Address review comments * Add tests * Add check for runtime parameter for bucket and path * Add check for runtime parameter for bucket and path * Add tests * Update docs * Fix NPE * Update docs, remove deadcode * Fix formatting	2024-02-07 22:08:50 +05:30
Laksh Singla	7d65caf0c5	Update the docs for EARLIEST_BY/LATEST_BY aggregators with the newly added numeric capabilities (#15670 )	2024-02-01 10:24:43 +05:30
Abhishek Radhakrishnan	dbdfae3011	Fix up typo </br /> -> <br /> and adjust interpolated exception msg in InvalidNullByteFault. (#15804 )	2024-01-30 12:44:51 -08:00
Victoria Lim	52313c51ac	docs: Anchor link checker (#15624 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-01-08 15:19:05 -08:00
317brian	141c214b46	docs: add note about finalizeaggregations for sql-based ingestion (#15631 )	2024-01-08 10:06:10 -08:00
Abhishek Radhakrishnan	9c7d7fc777	Allow empty inserts and replaces in MSQ. (#15495 ) * Allow empty inserts and replace. - Introduce a new query context failOnEmptyInsert which defaults to false. - When this context is false (default), MSQE will now allow empty inserts and replaces. - When this context is true, MSQE will throw the existing InsertCannotBeEmpty MSQ fault. - For REPLACE ALL over an ALL grain segment, the query will generate a tombstone spanning eternity which will be removed eventually be the coordinator. - Add unit tests in MSQInsertTest, MSQReplaceTest to test the new default behavior (i.e., when failOnEmptyInsert = false) - Update unit tests in MSQFaultsTest to test the non-default behavior (i.e., when failOnEmptyInsert = true) * Ignore test to see if it's the culprit for OOM * Add heap dump config * Bump up -Xmx from 1500 MB to 2048 MB * Add steps to tarball and collect hprof dump to GHA action * put back mx to 1500MB to trigger the failure * add the step to reusable unit test workflow as well * Revert the temp heap dump & @Ignore changes since max heap size is increased * Minor updates * Review comments 1. Doc suggestions 2. Add tests for empty insert and replace queries with ALL grain and limit in the default failOnEmptyInsert mode (=false). Add similar tests to MSQFaultsTest with failOnEmptyInsert = true, so the query does fail with an InsertCannotBeEmpty fault. 3. Nullable annotation and javadocs * Add comment replace_limit.patch	2024-01-02 13:05:51 -08:00
Vishesh Garg	e43bb74c3a	Add MSQ Durable Storage Connector for Google Cloud Storage and change current Google Cloud Storage client library (#15398 ) The PR addresses 2 things: Add MSQ durable storage connector for GCS Change GCS client library from the old Google API Client Library to the recommended Google Cloud Client Library. Ref: https://cloud.google.com/apis/docs/client-libraries-explained	2023-12-14 07:34:49 +05:30
Karan Kumar	5036af6fb3	Doc fixes for query from deep storage and MSQ (#15313 ) Minor updates to the documentation. Added prerequisites. Removed a known issue in MSQ since its no longer valid. --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-03 10:52:20 +05:30
Clint Wylie	d261587f4a	explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds (#15245 ) * better documentation for the differences between arrays and mvds * add outputType to ExpressionPostAggregator to make docs true * add output coercion if outputType is defined on ExpressionPostAgg * updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables	2023-11-02 00:31:37 -07:00
317brian	436ded3d78	docs: durable storage azure cleanup (#15120 ) Co-authored-by: Laksh Singla <lakshsingla@gmail.com>	2023-10-31 15:20:38 -07:00
Adarsh Sanjeev	c5fa649ea5	Rename segment load wait parameter (#15251 )	2023-10-25 18:08:37 +05:30
Karan Kumar	61ea9e07c5	Limit pages size to a configurable limit (#14994 ) Adding the ability to limit the pages sizes of select queries. We piggyback on the same machinery that is used to control the numRowsPerSegment. This patch introduces a new context parameter rowsPerPage for which the default value is set to 100000 rows. This patch also optimizes adding the last selectResults stage only when the previous stages have sorted outputs. Currently for each select query with selectDestination=durableStorage, we used to add this extra selectResults stage.	2023-10-12 14:01:46 +05:30
Laksh Singla	95bf331c08	Rename the default setting of 'maxSubqueryBytes' from 'unlimited' to 'disabled' (#15108 ) The default setting of 'maxSubqueryBytes' is renamed from 'unlimited' to 'disabled'.	2023-10-10 02:03:29 +05:30
Adarsh Sanjeev	7a35ce886d	Add ability for MSQ tasks to query realtime tasks (#15024 ) This PR aims to add the capabilities to: 1. Fetch the realtime segment metadata from the coordinator server view, 2. Adds the ability for workers to query indexers, similar to how brokers do the same for native queries.	2023-10-09 15:14:03 +05:30
Adarsh Sanjeev	7e987e3d69	Add query context parameter for segment load wait (#15076 ) Add segmentLoadWait as a query context parameter. If this is true, the controller queries the broker and waits till the segments created (if any) have been loaded by the load rules. The controller also provides this information in the live reports and task reports. If this is false, the controller exits immediately after finishing the query.	2023-10-05 18:26:34 +05:30
317brian	6b4dda964d	Docusaurus2 upgrade for master (#14411 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-08-16 19:01:21 -07:00
Laksh Singla	8f102f9031	Introduce StorageConnector for Azure (#14660 ) The Azure connector is introduced and MSQ's fault tolerance and durable storage can now be used with Microsoft Azure's blob storage. Also, the results of newly introduced queries from deep storage can now store and fetch the results from Azure's blob storage.	2023-08-09 12:25:27 +00:00
Victoria Lim	7d7813372a	Docs: Include EARLIEST_BY and LATEST_BY as supported aggregation functions (#14280 )	2023-08-07 09:59:12 -07:00
Laksh Singla	d6c73ca6e5	Cleanup the documentation for deep storage	2023-08-04 10:20:01 +00:00
317brian	3b5b6c6a41	docs: query from deep storage (#14609 ) * cold tier wip * wip * copyedits * wip * copyedits * copyedits * wip * wip * update rules page * typo * typo * update sidebar * moves durable storage info to its own page in operations * update screenshots * add apache license * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * add query from deep storage tutorial stub * address some of the feedback * revert screenshot update. handled in separate pr * load rule update * wip tutorial * reformat deep storage endpoints * rest of tutorial * typo * cleanup * screenshot and sidebar for tutorial * add license * typos * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * rest of review comments * clarify where results are stored * update api reference for durablestorage context param * Apply suggestions from code review Co-authored-by: Karan Kumar <karankumar1100@gmail.com> * comments * incorporate #14720 * address rest of comments * missed one * Update docs/api-reference/sql-api.md * Update docs/api-reference/sql-api.md --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: demo-kratia <56242907+demo-kratia@users.noreply.github.com> Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2023-08-04 11:10:08 +05:30
Karan Kumar	77e0c16bce	Sql statement api error messaging fixes. (#14629 ) * Error messaging fixes. * Static check fix * Review comments	2023-07-20 22:48:44 +05:30
Abhishek Radhakrishnan	1f6507dd60	Remove the deprecated `InsertCannotOrderByDescending` MSQ fault (#14588 ) The deprecated MSQ fault, InsertCannotOrderByDescending, is removed.	2023-07-17 09:23:39 +00:00
Laksh Singla	5ce536355e	Fix planning bug while using sort merge frame processor (#14450 ) sqlJoinAlgorithm is now a hint to the planner to execute the join in the specified manner. The planner can decide to ignore the hint if it deduces that the specified algorithm can be detrimental to the performance of the join beforehand.	2023-07-11 09:58:44 +00:00
Adarsh Sanjeev	0335aaa279	Add query results directory and prevent the auto cleaner from cleaning it (#14446 ) Adds support for automatic cleaning of a "query-results" directory in durable storage. This directory will be cleaned up only if the task id is not known to the overlord. This will allow the storage of query results after the task has finished running.	2023-06-28 10:14:04 +05:30
Laksh Singla	f546cd64a9	MSQ: Ensure that the allocated segment aligns with the requested granularity (#14475 ) Changes: - Throw an `InsertCannotAllocateSegmentFault` if the allocated segment is not aligned with the requested granularity. - Tests to verify new behaviour	2023-06-27 09:25:32 +05:30
Laksh Singla	114380749d	MSQ: Improve the parse exception errors and the handling of null UTF characters in Strings in Frames (#14398 )	2023-06-26 18:14:29 +05:30
Abhishek Radhakrishnan	04fb75719e	Fail query planning if a `CLUSTERED BY` column contains descending order (#14436 ) * Throw ValidationException if CLUSTERED BY column descending order is specified. - Fails query planning * Some more tests. * fixup existing comment * Update comment * checkstyle fix: remove unused imports * Remove InsertCannotOrderByDescendingFault and deprecate the fault in readme. * move deprecated field to the bottom	2023-06-16 18:10:12 -04:00
zachjsh	e75fb8e8e3	Account for data format and compression in MSQ auto taskAssignment (#14307 ) ### Description This change allows for consideration of the input format and compression when computing how to split the input files among available tasks, in MSQ ingestion, when considering the value of the `maxInputBytesPerWorker` query context parameter. This query parameter allows users to control the maximum number of bytes, with granularity of input file / object, that ingestion tasks will be assigned to ingest. With this change, this context parameter now denotes the estimated weighted size in bytes of the input to split on, with consideration for input format and compression format, rather than the actual file size, reported by the file system. We assume uncompressed newline delimited json as a baseline, with scaling factor of `1`. This means that when computing the byte weight that a file has towards the input splitting, we take the file size as is, if uncompressed json, 1:1. It was found during testing that gzip compressed json, and parquet, has scale factors of `4` and `8` respectively, meaning that each byte of data is weighted 4x and 8x respectively, when computing input splits. This weighted byte scaling is only considered for MSQ ingestion that uses either LocalInputSource or CloudObjectInputSource at the moment. The default value of the `maxInputBytesPerWorker` query context parameter has been updated from 10 GiB, to 512 MiB	2023-06-01 12:53:49 -07:00
Nhi Pham	70c06fc0e1	Advise against using WEEK granularity for Native Batch and MSQ (#14341 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-05-30 11:40:12 -07:00
317brian	9faf9ecf20	docs: add line about write datasource perm for overlord api (#14114 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-05-19 14:56:24 -07:00
Katya Macedo	269137c682	Update Ingestion section (#14023 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Victoria Lim <lim.t.victoria@gmail.com>	2023-05-19 09:42:27 -07:00
Adarsh Sanjeev	e8ef31fe92	Fix condition for timeout in worker task launcher (#14270 ) * Fix condition for timeout in worker task launcher	2023-05-16 08:30:00 +05:30
imply-cheddar	f9861808bc	Be able to load segments on Peons (#14239 ) * Be able to load segments on Peons This change introduces a new config on WorkerConfig that indicates how many bytes of each storage location to use for storage of a task. Said config is divided up amongst the locations and slots and then used to set TaskConfig.tmpStorageBytesPerTask The Peons use their local task dir and tmpStorageBytesPerTask as their StorageLocations for the SegmentManager such that they can accept broadcast segments.	2023-05-12 16:51:00 -07:00
317brian	6254658f61	docs: fix links (#14111 )	2023-05-12 09:59:16 -07:00
317brian	cc37987dff	docs: copyedits for MSQ join algos (#14012 )	2023-05-11 14:21:09 -07:00
Karan Kumar	6f0cdd0c3f	`TaskStartTimeoutFault` now depends on the last successful worker launch time. (#14172 ) * `TaskStartTimeoutFault` now depends on the last successful worker launch time.	2023-05-03 00:05:15 +05:30
Gian Merlino	89e7948159	MSQ: Subclass CalciteJoinQueryTest, other supporting changes. (#14105 ) * MSQ: Subclass CalciteJoinQueryTest, other supporting changes. The main change is the new tests: we now subclass CalciteJoinQueryTest in CalciteSelectJoinQueryMSQTest twice, once for Broadcast and once for SortMerge. Two supporting production changes for default-value mode: 1) InputNumberDataSource is marked as concrete, to allow leftFilter to be pushed down to it. 2) In default-value mode, numeric frame field readers can now return nulls. This is necessary when stacking joins on top of joins: nulls must be preserved for semantics that match broadcast joins and native queries. 3) In default-value mode, StringFieldReader.isNull returns true on empty strings in addition to nulls. This is more consistent with the behavior of the selectors, which map empty strings to null as well in that mode. As an effect of change (2), the InsertTimeNull change from #14020 (to replace null timestamps with default timestamps) is reverted. IMO, this is fine, as either behavior is defensible, and the change from #14020 hasn't been released yet. * Adjust tests. * Style fix. * Additional tests.	2023-04-25 12:10:23 -07:00
Adarsh Sanjeev	a7d5c64aeb	Move MSQ temporary storage to a runtime parameter instead of being configured from query context (#14061 ) * Adds new run time parameter druid.indexer.task.tmpStorageBytesPerTask. This sets a limit for the amount of temporary storage disk space used by tasks. This limit is currently only respected by MSQ tasks. * Removes query context parameters intermediateSuperSorterStorageMaxLocalBytes and composedIntermediateSuperSorterStorageEnabled. Composed intermediate super sorter (which was enabled by composedIntermediateSuperSorterStorageEnabled) is now enabled automatically if durableShuffleStorage is set to true. intermediateSuperSorterStorageMaxLocalBytes is calculated from the limit set by the run time parameter druid.indexer.task.tmpStorageBytesPerTask.	2023-04-18 16:56:51 +05:30
Laksh Singla	8eb854c845	Remove maxResultsSize config property from S3OutputConfig (#14101 ) * "maxResultsSize" has been removed from the S3OutputConfig and a default "chunkSize" of 100MiB is now present. This change primarily affects users who wish to use durable storage for MSQ jobs.	2023-04-18 14:25:20 +05:30
317brian	6c9b7b6efd	msq: add durable storage info (#14035 ) * msq: add durable storage info * fix duplicate row * Apply suggestions from code review Co-authored-by: Karan Kumar <karankumar1100@gmail.com> --------- Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2023-04-14 13:28:23 +05:30
Gian Merlino	d52bc333aa	Frames: Ensure nulls are read as default values when appropriate. (#14020 ) * Frames: Ensure nulls are read as default values when appropriate. Fixes a bug where LongFieldWriter didn't write a properly transformed zero when writing out a null. This had no meaningful effect in SQL-compatible null handling mode, because the field would get treated as a null anyway. But it does have an effect in default-value mode: it would cause Long.MIN_VALUE to get read out instead of zero. Also adds NullHandling checks to the various frame-based column selectors, allowing reading of nullable frames by servers in default-value mode.	2023-04-10 05:28:46 +05:30
Suraj Sanjay Kadam	b4157e32ae	Update api.md (#13436 ) * Update api.md I have created changes in api call of python according to latest version of requests 2.28.1 library. Along with this there are some irregularities between use of <your-instance> and <hostname> so I have tried to fix that also. * Update api.md made some changes in declaring USER and PASSWORD	2023-04-06 15:05:36 -07:00
Paul Rogers	030ed911d4	Temporarily revert extended table functions for Druid 26 (#14019 )	2023-04-05 21:09:33 -07:00
Karan Kumar	e6a11707cb	Adding query stack fault to MSQ to capture native query errors. (#13926 ) * Add a new fault "QueryRuntimeError" to MSQ engine to capture native query errors. * Fixed bug in MSQ fault tolerance where worker were being retried if `UnexpectedMultiValueDimensionException` was thrown. * An exception from the query runtime with `org.apache.druid.query` as the package name is thrown as a QueryRuntimeError	2023-04-05 16:29:10 +05:30
Paul Rogers	76fe26d4ba	Fix typos, add tests for http() function (#13954 )	2023-03-28 14:41:06 -07:00
Gian Merlino	062d72b67e	Add timeout to TaskStartTimeoutFault. (#13970 ) * Add timeout to TaskStartTimeoutFault. Makes the error message a bit more useful. * Update docs.	2023-03-27 23:37:19 +05:30

1 2

85 Commits