druid

Commit Graph

Author	SHA1	Message	Date
Gian Merlino	786c959e9e	MSQ: Add limitHint to global-sort shuffles. (#16911 ) * MSQ: Add limitHint to global-sort shuffles. This allows pushing down limits into the SuperSorter. * Test fixes. * Add limitSpec to ScanQueryKit. Fix SuperSorter tracking.	2024-09-03 09:05:29 -07:00
AmatyaAvadhanula	70bad948e3	Add integration tests for concurrent append and replace (#16755 ) IT for streaming tasks with concurrent compaction	2024-09-03 14:58:15 +05:30
Sree Charan Manamala	619d8ef964	Window Functions : Numeric Arrays Frame Column Writers - fix class cast exception (#16983 ) Fix ClassCastException in ArrayFrameCoulmnWriters	2024-09-03 11:44:52 +05:30
Akshat Jain	ca17beb77b	Fix issues with window functions wrt virtual columns when using ARRAY_CONCAT_AGG aggregator (#16971 ) * Fix issues with window functions wrt virtual columns when using ARRAY_CONCAT_AGG aggregator * Address review comment * Address review comment	2024-09-03 11:41:40 +05:30
Kashif Faraz	a7349442e4	Fix computed value of maxSegmentsToMove when coordinator period is very low (#16984 ) Bug: When coordinator period is less than 30s, `maxSegmentsToMove` is always computed as 0 irrespective of number of available threads. Changes: - Fix lower bound condition and set minimum value to 100. - Add new test which fails without this fix	2024-09-02 14:35:29 +05:30
Zoltan Haindrich	32e8e074ae	Planning could have failed if UNION ALL operator was completely removed (#16946 )	2024-09-02 04:37:10 -04:00
Jonathan Wei	088c33759d	Allow druid.azure.account to be nullable (#16960 )	2024-09-02 12:05:51 +05:30
Kashif Faraz	fe3d589ff9	Run compaction as a supervisor on Overlord (#16768 ) Description ----------- Auto-compaction currently poses several challenges as it: 1. may get stuck on a failing interval. 2. may get stuck on the latest interval if more data keeps coming into it. 3. always picks the latest interval regardless of the level of compaction in it. 4. may never pick a datasource if its intervals are not very recent. 5. requires setting an explicit period which does not cater to the changing needs of a Druid cluster. This PR introduces various improvements to compaction scheduling to tackle the above problems. Change Summary -------------- 1. Run compaction for a datasource as a supervisor of type `autocompact` on Overlord. 2. Make compaction policy extensible and configurable. 3. Track status of recently submitted compaction tasks and pass this info to policy. 4. Add `/simulate` API on both Coordinator and Overlord to run compaction simulations. 5. Redirect compaction status APIs to the Overlord when compaction supervisors are enabled.	2024-09-02 07:53:13 +05:30
Parag Jain	6eb42e8d5a	fix extraction of timeseries results from result level cache (#16895 ) * fix extraction of timeseries results from result level cache * remove unneded import * add test	2024-09-01 00:25:55 +05:30
Virushade	0217c8c541	Change Inspection Profile to set "Method is identical to its super method" as error (#16976 ) * Make IntelliJ's MethodIsIdenticalToSuperMethod an error * Change codebase to follow new IntelliJ inspection * Restore non-short-circuit boolean expressions to pass tests	2024-08-31 09:37:34 +05:30
Gian Merlino	caf8ce3e0b	MSQ: Add CPU and thread usage counters. (#16914 ) * MSQ: Add CPU and thread usage counters. The main change adds "cpu" and "wall" counters. The "cpu" counter measures CPU time (using JvmUtils.getCurrentThreadCpuTime) taken up by processors in processing threads. The "wall" counter measures the amount of wall time taken up by processors in those same processing threads. Both counters are broken down by type of processor. This patch also includes changes to support adding new counters. Due to an oversight in the original design, older deserializers are not forwards-compatible; they throw errors when encountering an unknown counter type. To manage this, the following changes are made: 1) The defaultImpl NilQueryCounterSnapshot is added to QueryCounterSnapshot's deserialization configuration. This means that any unrecognized counter types will be read as "nil" by deserializers. Going forward, once all servers are on the latest code, this is enough to enable easily adding new counters. 2) A new context parameter "includeAllCounters" is added, which defaults to "false". When this parameter is set "false", only legacy counters are included. When set to "true", all counters are included. This is currently undocumented. In a future version, we should set the default to "true", and at that time, include a release note that people updating from versions prior to Druid 31 should set this to "false" until their upgrade is complete. * Style, coverage. * Fix.	2024-08-30 20:02:30 -07:00
Kashif Faraz	d5b64ba2e3	Improve exception handling in extension druid-pac4j (#16979 ) Changes: - Simplify exception handling in `CryptoService` by just catching a `Exception` - Throw a `DruidException` as the exception is user facing - Log the exception for easier debugging - Add a test to verify thrown exception	2024-08-30 12:32:49 +05:30
Vadim Ogievetsky	358d06abc1	Web console: expose forceSegmentSortByTime (#16967 ) * no force time * time UI * update menus * tweaks * dont use bp5 * nicer values * update snapshots * similar engine lables * update snaps	2024-08-29 09:58:15 -07:00
317brian	1d292c5a59	docs: fix the case of the s3 examples for msq (#16969 )	2024-08-28 11:13:15 -07:00
Akshat Jain	fbd305af0f	MSQ WF: Batch multiple PARTITION BY keys for processing (#16823 ) Currently, if we have a query with window function having PARTITION BY xyz, and we have a million unique values for xyz each having 1 row, we'd end up creating a million individual RACs for processing, each having a single row. This is unnecessary, and we can batch the PARTITION BY keys together for processing, and process them only when we can't batch further rows to adhere to maxRowsMaterialized config. The previous iteration of this PR was simplifying WindowOperatorQueryFrameProcessor to run all operators on all the rows instead of creating smaller RACs per partition by key. That approach was discarded in favor of the batching approach, and the details are summarized here: #16823 (comment).	2024-08-28 11:32:47 +05:30
Virushade	862ccda59b	Disable automatic search refresh every for keystroke (#16963 ) * Disable refreshing page everytime keystroke is detected * Lengthen input debounce time to 1s * Run Prettier to pass stylecheck	2024-08-27 20:28:43 -07:00
Charles Smith	e562dd3ac6	Docs: note on iceberg (#16955 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-08-27 14:27:23 -07:00
Jill Osborne	3e031b9dc2	Add dynamic query params example (#16964 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-08-27 14:27:13 -07:00
Vadim Ogievetsky	21dcf804eb	Web console: add ability to issue auxiliary queries to speed up data views (#16952 ) * Add ability to issue auxiliary queries * readonly supervisor * return * update snapshot * fix classes	2024-08-27 13:38:30 -07:00
Pranav	0caf383102	Fix buffer capacity race condition in spatial (#16931 )	2024-08-27 00:36:29 -07:00
Clint Wylie	f8301a314f	generic block compressed complex columns (#16863 ) changes: * Adds new `CompressedComplexColumn`, `CompressedComplexColumnSerializer`, `CompressedComplexColumnSupplier` based on `CompressedVariableSizedBlobColumn` used by JSON columns * Adds `IndexSpec.complexMetricCompression` which can be used to specify compression for the generic compressed complex column. Defaults to uncompressed because compressed columns are not backwards compatible. * Adds new definition of `ComplexMetricSerde.getSerializer` which accepts an `IndexSpec` argument when creating a serializer. The old signature has been marked `@Deprecated` and has a default implementation that returns `null`, but it will be used by the default implementation of the new version if it is implemented to return a non-null value. The default implementation of the new method will use a `CompressedComplexColumnSerializer` if `IndexSpec.complexMetricCompression` is not null/none/uncompressed, or will use `LargeColumnSupportedComplexColumnSerializer` otherwise. * Removed all duplicate generic implementations of `ComplexMetricSerde.getSerializer` and `ComplexMetricSerde.deserializeColumn` into default implementations `ComplexMetricSerde` instead of being copied all over the place. The default implementation of `deserializeColumn` will check if the first byte indicates that the new compression was used, otherwise will use the `GenericIndexed` based supplier. * Complex columns with custom serializers/deserializers are unaffected and may continue doing whatever it is they do, either with specialized compression or whatever else, this new stuff is just to provide generic implementations built around `ObjectStrategy`. * add ObjectStrategy.readRetainsBufferReference so CompressedComplexColumn only copies on read if required * add copyValueOnRead flag down to CompressedBlockReader to avoid buffer duplicate if the value needs copied anyway	2024-08-27 00:34:41 -07:00
Gian Merlino	ed3dbd6242	MSQ: Fix validation of time position in collations. (#16961 ) * MSQ: Fix validation of time position in collations. It is possible for the collation to refer to a field that isn't mapped, such as when the DML includes "CLUSTERED BY some_function(some_field)". In this case, the collation refers to a projected column that is not part of the field mappings. Prior to this patch, that would lead to an out of bounds list access on fieldMappings. This patch fixes the problem by identifying the position of __time in the fieldMappings first, rather than retrieving each collation field from fieldMappings. Fixes a bug introduced in #16849. * Fix test. Better warning message.	2024-08-27 00:02:32 -07:00
Adarsh Sanjeev	3b88b57d70	Add configurable final stages to MSQ ingestion queries (#16699 ) * Add a segmentMorphFactory to MSQ. * Add test * Make argument nullable * Fix Guice issues * Merge with master * Remove extra information * Fix tests * Create a utils class * Refactor segment generation * Fix javadoc * Refactor * Refactor * Fix injection	2024-08-27 11:35:48 +05:30
Adarsh Sanjeev	16517e348e	Suppress CVE (#16851 )	2024-08-27 10:47:57 +05:30
Gian Merlino	5d2ed33b89	Place __time in signatures according to sort order. (#16958 ) * Place __time in signatures according to sort order. Updates a variety of places to put __time in row signatures according to its position in the sort order, rather than always first, including: - InputSourceSampler. - ScanQueryEngine (in the default signature when "columns" is empty). - Various StorageAdapters, which also have the effect of reordering the column order in segmentMetadata queries, and therefore in SQL schemas as well. Follow-up to #16849. * Fix compilation. * Additional fixes. * Fix. * Fix style. * Omit nonexistent columns from the row signature. * Fix tests.	2024-08-26 21:45:51 -07:00
George Shiqi Wu	7ee7e194c4	Add supervisor log when task count is greater than partitions (#16948 ) * Add log message when task count is higher than partitions * newline * fix ordering * Add supervisor id * Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> --------- Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2024-08-26 07:40:02 -07:00
Akshat Jain	72f8e79a42	Use multiple workers in MSQ WF drill test suite (#16949 )	2024-08-26 11:34:40 +05:30
317brian	418da92228	docs: update query from deepstorage segment requirement (#16842 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Rishabh Singh <6513075+findingrish@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2024-08-23 11:59:29 -07:00
Hugh Evans	60d4317968	Linked back to query granularity docs (#16883 ) * Linked back to query granularity docs * Update ingestion-spec.md clairfy about query granularities in the spec. * Update docs/design/storage.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/ingestion/ingestion-spec.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/granularities.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Apply suggestions from code review --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-08-23 08:44:19 -07:00
Gian Merlino	0603d5153d	Segments sorted by non-time columns. (#16849 ) * Segments primarily sorted by non-time columns. Currently, segments are always sorted by __time, followed by the sort order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting by __time enables efficient execution of queries involving time-ordering or granularity. Time-ordering is a simple matter of reading the rows in stored order, and granular cursors can be generated in streaming fashion. However, for various workloads, it's better for storage footprint and query performance to sort by arbitrary orders that do not start with __time. With this patch, users can sort segments by such orders. For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to dimensionsSpec. The "dimensions" list determines the sort order. To define a sort order that includes "__time", users explicitly include a dimension named "__time". For SQL-based ingestion, users set the context parameter "useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then used as the explicit segment sort order. In both cases, when the new "useExplicitSegmentSortOrder" parameter is false (the default), __time is implicitly prepended to the sort order, as it always was prior to this patch. The new parameter is experimental for two main reasons. First, such segments can cause errors when loaded by older servers, due to violating their expectations that timestamps are always monotonically increasing. Second, even on newer servers, not all queries can run on non-time-sorted segments. Scan queries involving time-ordering and any query involving granularity will not run. (To partially mitigate this, a currently-undocumented SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner avoids using "granularity".) Changes on the write path: 1) DimensionsSpec can now optionally contain a __time dimension, which controls the placement of __time in the sort order. If not present, __time is considered to be first in the sort order, as it has always been. 2) IncrementalIndex and IndexMerger are updated to sort facts more flexibly; not always by time first. 3) Metadata (stored in metadata.drd) gains a "sortOrder" field. 4) MSQ can generate range-based shard specs even when not all columns are singly-valued strings. It merely stops accepting new clustering key fields when it encounters the first one that isn't a singly-valued string. This is useful because it enables range shard specs on "someDim" to be created for clauses like "CLUSTERED BY someDim, __time". Changes on the read path: 1) Add StorageAdapter#getSortOrder so query engines can tell how a segment is sorted. 2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter, and VectorCursorGranularizer to throw errors when using granularities on non-time-ordered segments. 3) Update ScanQueryEngine to throw an error when using the time-ordering "order" parameter on non-time-ordered segments. 4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when running on a non-time-ordered segment. 5) Add "sqlUseGranularity" context parameter that causes the SQL planner to avoid using granularities other than ALL. Other changes: 1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions" and change the meaning subtly: it now returns true if the DimensionsSpec represents an unchanging list of dimensions, or false if there is some discovery happening. This is what call sites had expected anyway. * Fixups from CI. * Fixes. * Fix missing arg. * Additional changes. * Fix logic. * Fixes. * Fix test. * Adjust test. * Remove throws. * Fix styles. * Fix javadocs. * Cleanup. * Smoother handling of null ordering. * Fix tests. * Missed a spot on the merge. * Fixups. * Avoid needless Filters.and. * Add timeBoundaryInspector to test. * Fix tests. * Fix FrameStorageAdapterTest. * Fix various tests. * Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder. * Pom fix. * Fix doc.	2024-08-23 08:24:43 -07:00
AmatyaAvadhanula	8c8a4b2302	Remove references to chatAsync (#16950 ) Remove references to chatAsync from Rabbit stream supervisors	2024-08-23 13:21:07 +05:30
Adarsh Sanjeev	2abcb41559	Use controller id while reading from durable storage (#16943 )	2024-08-23 11:37:42 +05:30
Adarsh Sanjeev	e2516d9a67	WriteOutBytes improvements This PR generally improves the working of WriteOutBytes and WriteOutMedium. Some analysis of usage of TmpFileSegmentWriteOutMedium shows that they periodically get used for very small things. The overhead of creating a tmp file is actually very large. To improve the performance in these cases, this PR modifies TmpFileSegmentWriteOutMedium to return a heap-based WriteOutBytes that falls back to making a tmp file when it actually fills up. --------- Co-authored-by: imply-cheddar <eric.tschetter@imply.io>	2024-08-23 11:32:30 +05:30
Edgar Melendrez	c4981e34c4	[docs] Batch10 date and time functions (#16900 ) * just starting * TIME_PARSE and TIME_FORMAT remaining * fixing typo * adding last two functions * review sql-functions.md * Apply suggestions from code review Suggestions that were accepted as is Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/querying/sql-functions.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/querying/sql-functions.md needed to confirm that it did indeed return as a number Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * reviewing remaining suggestions * addressing review for time_format * Apply suggestions from code review Accepted as is Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * addressing final suggestion * time_zone -> timezone * timezone fix --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2024-08-22 20:25:27 -07:00
Hugh Evans	2f6722db63	Updated auth to use variables with default values (#16876 ) * Updated auth to use variables with default values * Update docs/api-reference/sql-ingestion-api.md * Remove Python auth entirely as its not being used --------- Co-authored-by: Benedict Jin <asdf2014@apache.org>	2024-08-22 18:09:09 -07:00
Clint Wylie	2aef6ac685	fix ipv4_parse function return type in SQL to be bigint instead of integer (#16942 ) * fix ipv4_parse function return type in SQL to be bigint instead of integer * fix default value mode	2024-08-22 13:36:43 -07:00
Clint Wylie	bce60b0674	fix flaky ParallelMergeCombiningSequenceTest.testTimeoutExceptionDueToStoppedReader when runner is very slow (#16932 )	2024-08-22 13:34:28 -07:00
Edgar Melendrez	fda2d19b88	[Docs] Batch09: only `lookup` (#16878 ) * [Docs] Batch09: only `lookup` * slight changes * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * applying suggestiontions * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * otherwise null -> otherwise returns null * updating definition in sql-scalar.md * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> * hoping to re-run web checks * change replaceMissingValueWith -> defaultValue * Update docs/querying/sql-scalar.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * acronym_to_name -> airportcode_to_name * shortens `airportcode_to_name` to `code_to_name` --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-08-22 11:11:16 -07:00
Gian Merlino	a83125e4a0	Track IngestionState more accurately in realtime tasks. (#16934 ) Previously, SeekableStreamIndexTaskRunner set ingestion state to COMPLETED when it finished reading data from Kafka. This is incorrect. After the changes in this patch, the transitions go: 1) The task stays in BUILD_SEGMENTS after it finishes reading from Kafka, while it is building its final set of segments to publish. 2) The task transitions to SEGMENT_AVAILABILITY_WAIT after publishing, while waiting for handoff. 3) The task transitions to COMPLETED immediately before exiting, when truly done.	2024-08-22 11:43:46 +05:30
Edgar Melendrez	725695342c	[Docs] Batch07: adding examples to string functions (#16862 ) * Lower,Upper,Lpad,Rpad,Parse_long * up to REGEXP_EXTRACT * batch 07 ready for review * updated definitions in scalar * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> * rpad and lpad * addressing comments * minor fixes * improving examples based on suggestions * matched -> matches * correcting typo * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-08-21 15:08:25 -07:00
Gian Merlino	338da67bc6	Add type coercion and null check to left, right, repeat exprs. (#16480 ) * Add type coercion and null check to left, right, repeat exprs. These exprs shouldn't validate types; they should coerce types. Coercion is typical behavior for functions because it enables schema evolution. The functions are also modified to check isNumericNull on the right-hand argument. This was missing previously, which would erroneously cause nulls to be treated as zeroes. * Fix tests.	2024-08-21 15:07:24 -07:00
Gian Merlino	090023609b	Loosen case in FrameFileWriterTest. (#16938 ) The specific error on a truncated file can vary based on how the final frame of the truncated file is written. This patch loosens the check so it passes regardless of how the truncated file is written.	2024-08-21 13:45:01 -07:00
Akshat Jain	97f9502ad2	Enable MSQ WF drill tests which were previously disabled (#16935 )	2024-08-21 15:47:50 +05:30
Gian Merlino	f6adacf5d6	SuperSorter: Store readOnly output channels. (#16928 ) Without the call to readOnly, each output channel retains a 1 MB allocator, leading to excessive memory use. Fixes regression from #16775.	2024-08-20 23:10:29 -07:00
Akshat Jain	0ce1b6b22f	MSQ window function: Take segment granularity into consideration to fix NPE issues with ingestion (#16854 ) This PR changes the logic for window functions to use the resultShuffleSpecFactory for the last window stage.	2024-08-21 10:06:04 +05:30
Gian Merlino	2bd31603de	FrameFile: Improve error messages. (#16912 ) * FrameFile: Improve error messages. 1) Include frame file path in error messages. 2) Adhere better to style (no space before brackets). * Fix test.	2024-08-20 11:56:30 -07:00
benkrug	7b8573ed3d	Update index.md - remove the extra word "does" from one sentence. (#16922 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-08-20 11:06:12 -07:00
Jakub Matyszewski	82d9ff9cc8	Add docs for log audit manager (#16927 ) * Add docs for log audit manager * Adjust descriptions	2024-08-20 15:58:31 +05:30
Rishabh Singh	bc4b3a2f91	Filter out tombstone segments from metadata cache (#16890 ) * Fix build * Support segment metadata queries for tombstones * Filter out tombstone segments from metadata cache * Revert some changes * checkstyle * Update docs	2024-08-20 11:35:02 +05:30
Clint Wylie	518f642028	remove isDescending from Query interface, move to TimeseriesQuery (#16917 ) * remove isDescending from Query interface, since it is only actually settable and usable by TimeseriesQuery	2024-08-19 23:02:45 -07:00

1 2 3 4 5 ...

14404 Commits All Branches Search

14404 Commits

All Branches