druid

Commit Graph

Author	SHA1	Message	Date
Adithya Chakilam	ec52f686c0	Fix compaction tasks reports getting overwritten (#15981 ) * Fix compaction tasks reports geting overwrittened * only skip for compactiont task * address comments * fix boolean * move boolean flag to task rather than spec * rename variable * add docs, fix missing case * Update docs/ingestion/tasks.md * rename var * add task report decode check in IT * change assert	2024-03-04 10:10:17 -05:00
317brian	b3015bd7ce	docs: mention acid-compliance for meta store (#16014 ) * docs: add mermaid diagram support * fix crash when parsing data in data loader that can not be parsed (#15983) * update jetty to address CVE (#16000) * Concurrent replace should work with supervisors using concurrent locks (#15995) * Concurrent replace should work with supervisors using concurrent locks * Ignore supervisors with useConcurrentLocks set to false * Apply feedback * Add pre-check for heavy debug logs (#15706) Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> Co-authored-by: Benedict Jin <asdf2014@apache.org> * Remove helm paths from CodeQL config (#16006) * docs: mention acid-compliance for metadb --------- Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com> Co-authored-by: Jan Werner <105367074+janjwerner-confluent@users.noreply.github.com> Co-authored-by: AmatyaAvadhanula <amatya.avadhanula@imply.io> Co-authored-by: Sensor <fectrain@outlook.com> Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> Co-authored-by: Benedict Jin <asdf2014@apache.org>	2024-03-04 11:00:38 +08:00
Zoltan Haindrich	bf0995f846	Introduce dynamic table append (#15897 )	2024-03-01 04:31:57 -05:00
317brian	3df161f73c	docs: update security doc for hashing (#15970 ) * docs: add mermaid diagram support * docs: update druid-basic-security doc to mention caching * Update docs/development/extensions-core/druid-basic-security.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> --------- Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2024-02-28 09:48:37 +08:00
benkrug	0c601bf430	Update basic-cluster-tuning.md (#14964 ) * Update basic-cluster-tuning.md The sentence "When free system memory is greater than or equal to druid.segmentCache.locations, the more segment data the Historical can be held in the memory-mapped segment cache" didn't read well. Updated to clarify it. * Update docs/operations/basic-cluster-tuning.md * Update docs/operations/basic-cluster-tuning.md --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-02-28 09:48:20 +08:00
AlbericByte	e7d753d4b0	update the doc for dump-segment tool when using jdk11+ (#15971 ) * update the doc for dump-segment tool when using jdk11+ * update the style * fix spell check error	2024-02-28 09:40:10 +08:00
Abhishek Radhakrishnan	beccc401e1	Segments created in the same batch have the same `created_date` entry & rename metric (#15977 ) * All segments stored in the same batch have the same created_date entry. In the absence of a group_id column, this metadata would allow us to easily reason about and troubleshoot ingestion-related issues. * Rename metric name and code references to eligibleUnusedSegments. Address review comment from https://github.com/apache/druid/pull/15941#discussion_r1503631992	2024-02-27 17:28:43 +05:30
Karan Kumar	5bb5b41b18	Adding task pending time in MSQ reports (#15966 ) Added a new field pendingMs in MSQ task reports. This helps in figuring out the exact run time of the MSQ worker tasks. Fixed data races.	2024-02-27 14:41:28 +05:30
Abhishek Radhakrishnan	38ecf980d0	Refactor and add tests and metric to KillUnusedSegments duty (auto-kill) (#15941 ) * Kill duty and test improvements. Initial commit with: - Bug fixes - auto-kill can throw NPE when there are no datasources present and defaults mismatch. - Add new stat for candidate segment intervals killed. - Move a couple of debug logs to info logs for improved visibility (should only log once per kill period). - Remove redundant checks for code readability. - Updated tests from using mocks (also the mocks weren't using last updated timestamp) and add more test coverage for different config parameters. - Add a couple of unit tests that are ignored for the eternity case to prove that the kill duty doesn't clean up segments with ALL grain or that end in DateTimes.MAX. - Migrate Druid exception from user to operator persona. * Address review comments. * Remove unused methods. * fix up format specifier and validate bad config tests. * Consolidate the helpers a bit more and add another test. * Update test names. Add javadoc placeholders for slightly involved tests. * Add docs for metric kill/candidateUnusedSegments/count. Also, rename to disambiguate. * Comments. * Apply logging suggestions from code review Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Review comments - Clarify docs on eligibility. - Add test for multiple segments in the same interval. Clarify comment. - Remove log line from test. - Remove lastUpdatedDate = now.plus(10) from test. * minor cleanup. * Clarify javadocs for getUnusedSegmentIntervals(). --------- Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2024-02-27 12:14:41 +05:30
Abhishek Radhakrishnan	67a6224d91	Fix up incorrect `PARTITIONED BY` error messages (#15961 ) * Fix up typos, inaccuracies and clean up code related to PARTITIONED BY. * Remove wrapper function and update tests to use DruidExceptionMatcher. * Checkstyle and Intellij inspection fixes.	2024-02-26 14:17:53 -05:00
Benjamin Hopp	ebb7190545	Docs: Change single-dim to hashed in example for index task (#15529 )	2024-02-26 09:16:10 +05:30
Gian Merlino	b69f89d9f8	Clarify where to set druid.monitoring.monitors. (#15729 )	2024-02-23 18:49:37 +05:30
Adithya Chakilam	1f443d218c	Enable partition stats on streaming task completion report (#15930 ) Changes: - Add visibility into number of records processed by each streaming task per partition - Add field `recordsProcessed` to `IngestionStatsAndErrorsTaskReportData` - Populate number of records processed per partition in `SeekableStreamIndexTaskRunner`	2024-02-23 16:29:03 +05:30
Jamie	80942d5754	Feature: add support for ingesting from rabbitmq super streams (#14137 ) * Add support for ingesting from Rabbit MQ Super Streams	2024-02-22 10:50:37 +05:30
George Shiqi Wu	59bb72a926	Fix parsing of env variables when properties have underscores (#15919 ) * Fix parsing of env variables when properties have underscores * Add documentation * Use a % sign instead	2024-02-21 13:18:21 -05:00
317brian	c98d54f3c4	docs: delete unused file that causes confusion (#15910 )	2024-02-14 16:42:02 -08:00
Peter Marshall	cae9cbd7d7	Update tasks.md (#15887 ) Remove erroneous white space causing render issues on this page.	2024-02-13 05:20:09 -08:00
Clint Wylie	dad8398a4d	start process of deprecating non-sql compatible legacy configurations (#15713 ) Starting the process to officially deprecate non SQL compatible modes by updating docs to aggressively call out that Druids non SQL compliant modes are deprecated and will go away someday. There are no code or behavior changes at this PR.	2024-02-13 15:31:45 +05:30
Katya Macedo	0f29ece6a9	[Docs] Refactor streaming ingestion section (#15591 ) Merging the work so far. @ektravel , @vogievetsky if there are additional improvements, let's track them & make another pr. * Refactor streaming ingestion docs * Update property definition * Update after review * Update known issues * Move kinesis and kafka topics to ingestion, add redirects * Saving changes * Saving * Add input format text * Update after review * Minor text edit * Update example syntax * Revert back to colon * Fix merge conflicts * Fix broken links * Fix spelling error	2024-02-12 13:52:42 -08:00
Charles Smith	2a42b11660	remove legacy Jupyter tutorial files (#15834 ) * remove legacy files * redirection for the jupyter tutorial page * remove tutorial from sidebar * remove redirection	2024-02-12 13:45:47 -08:00
Gian Merlino	7fea34abdd	LOOKUP docs: clarify behavior of replaceMissingValueWith. (#15879 ) Clarify behavior when expr is null.	2024-02-11 13:11:00 -08:00
zachjsh	f9ee2c353b	Extend the PARTITION BY clause to accept string literals for the time partitioning (#15836 ) This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner https://github.com/apache/druid/pull/13686 from @paul-rogers, extending the PARTITION BY clause to accept string literals for the time partitioning	2024-02-09 11:45:38 -05:00
Tom	11a8624ef1	allow for kafka-emitter to have extra dimensions be set for each event it emits (#15845 ) * allow for kafka-emitter to have extra dimensions be set for each event it emits * fix checktsyle issue in kafkaemitterconfig * make changes to fix docs, and cleanup copy paste error in #toString() * undo formatting to markdown table * add more branches so test passes * fix checkstyle issue	2024-02-08 22:55:24 -08:00
Abhishek Radhakrishnan	1a5b57df84	Update `groupId` for delta-lake and iceberg extensions (#15843 ) * Update the group id to org.apache.druid.extensions.contrib for contrib exts. * Note iceberg and delta lake extensions in extensions.md * properties and shell backticks * Update groupId in distribution/pom.xml * remove delta-lake from dist. * Add note on downloading extension.	2024-02-07 23:54:06 -08:00
Adarsh Sanjeev	514b3b4d01	Add export capabilities to MSQ with SQL syntax (#15689 ) * Add test * Parser changes to support export statements * Fix builds * Address comments * Add frame processor * Address review comments * Fix builds * Update syntax * Webconsole workaround * Refactor * Refactor * Change export file path * Update docs * Remove webconsole changes * Fix spelling mistake * Parser changes, add tests * Parser changes, resolve build warnings * Fix failing test * Fix failing test * Fix IT tests * Add tests * Cleanup * Fix unparse * Fix forbidden API * Update docs * Update docs * Address review comments * Address review comments * Fix tests * Address review comments * Fix insert unparse * Add external write resource action * Fix tests * Add resource check to overlord resource * Fix tests * Add IT * Update syntax * Update tests * Update permission * Address review comments * Address review comments * Address review comments * Add tests * Add check for runtime parameter for bucket and path * Add check for runtime parameter for bucket and path * Add tests * Update docs * Fix NPE * Update docs, remove deadcode * Fix formatting	2024-02-07 22:08:50 +05:30
Vadim Ogievetsky	f2b242b6e6	update console to core Druid changes (#15854 )	2024-02-07 19:44:25 +05:30
Pramod Immaneni	59bca0951a	Parallelize storage of incremental segments (#13982 ) During ingestion, incremental segments are created in memory for the different time chunks and persisted to disk when certain thresholds are reached (max number of rows, max memory, incremental persist period etc). In the case where there are a lot of dimension and metrics (1000+) it was observed that the creation/serialization of incremental segment file format for persistence and persisting the file took a while and it was blocking ingestion of new data. This affected the real-time ingestion. This serialization and persistence can be parallelized across the different time chunks. This update aims to do that. The patch adds a simple configuration parameter to the ingestion tuning configuration to specify number of persistence threads. The default value is 1 if it not specified which makes it the same as it is today.	2024-02-07 10:43:05 +05:30
317brian	2dc71c7874	docs: fix rendering (#15835 )	2024-02-06 07:18:43 -08:00
Gian Merlino	54b30646f3	Add sqlReverseLookupThreshold for ReverseLookupRule. (#15832 ) If lots of keys map to the same value, reversing a LOOKUP call can slow things down unacceptably. To protect against this, this patch introduces a parameter sqlReverseLookupThreshold representing the maximum size of an IN filter that will be created as part of lookup reversal. If inSubQueryThreshold is set to a smaller value than sqlReverseLookupThreshold, then inSubQueryThreshold will be used instead. This allows users to use that single parameter to control IN sizes if they wish.	2024-02-06 16:32:05 +05:30
Atul Mohan	2e46a98024	Add range filtering support for iceberg ingestion (#15782 ) * Add range filtering support for iceberg ingestion * Docs formatting * Spelling	2024-02-01 23:32:30 -08:00
Aru Raghuwanshi	223f29d64c	Update input-sources.md for fixing the warehouse path example under S3 (#15823 )	2024-02-01 23:32:05 -08:00
317brian	6d617c34d2	docs: revise concurrent append and replace (#15760 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-02-01 11:03:36 -08:00
Laksh Singla	7d65caf0c5	Update the docs for EARLIEST_BY/LATEST_BY aggregators with the newly added numeric capabilities (#15670 )	2024-02-01 10:24:43 +05:30
Abhishek Radhakrishnan	9f95a691f7	Extension to read and ingest Delta Lake tables (#15755 ) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-01-30 21:53:50 -08:00
Benjamin Hopp	6177f6efd7	Fixing formatting of Iceberg Catalog Object (#15748 )	2024-01-30 20:17:38 -08:00
Abhishek Radhakrishnan	dbdfae3011	Fix up typo </br /> -> <br /> and adjust interpolated exception msg in InvalidNullByteFault. (#15804 )	2024-01-30 12:44:51 -08:00
317brian	ba07965580	docs: clean up some rolling updates stuff (#15762 )	2024-01-26 14:10:53 -08:00
George Shiqi Wu	3e512249e3	Azure multi read options (#15630 ) * Include new dependencies * Mostly implemented * More azure fixes * Tests passing * Unit tests running * Test running after removing storage exception * Happy with coverage now * Add more tests * fix client factory * cleanup from testing * Remove old client * update docs * Exclude from spellcheck * Add licenses * Fix identity version * Save work * Add azure clients * add licenses * typos * Add dependencies * Exception is not thrown * Fix intellij check * Don't need to override * specify length * urldecode * encode path * Fix checks * Revert urlencode changes * Urlencode with azure library * Update docs/development/extensions-core/azure.md Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * PR changes * Update docs/development/extensions-core/azure.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Add config for multiple storage accounts * Deprecate AzureTaskLogsConfig.maxRetries * Clean up azure retry block * logic update to reuse clients * fix comments * Create container conditionally * Fix key auth * save work * Fix unit tests * Revert old azure input type * Separate input source * save work * Add support for app registrations * Fix unit tests * clean up spacing * Add coverage * fixes from testing * cleanup some caching behavior * Add docs * Fix spelling issues * fix more spelling errors' * Fix intellij inspections * add simple changes from pr * save work on fixing bug * Fix unit tests * Add more testing * Fix unit test * Add tests * Add annotation for azureStorage * Fix up docs * Add comment for list method * Fix tests * Remove uneeded toString * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * PR changes * fix injection of StorageConnector * Fix checkstyle * clean up unit tests * More pr fixes --------- Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-01-25 13:29:16 -05:00
Katya Macedo	867c636629	Document pivot and unpivot operators (#15669 )	2024-01-25 09:53:39 -08:00
Abhishek Agarwal	0ab2781a7f	Disable eager initialization for non-query connection requests (#15751 )	2024-01-25 14:38:50 +05:30
Hiroshi Fukada	3fe3a65344	New: Add DDSketch in extensions-contrib (#15049 ) * New: Add DDSketch-Druid extension - Based off of http://www.vldb.org/pvldb/vol12/p2195-masson.pdf and uses the corresponding https://github.com/DataDog/sketches-java library - contains tests for post building and using aggregation/post aggregation. - New aggregator: `ddSketch` - New post aggregators: `quantileFromDDSketch` and `quantilesFromDDSketch` * Fixing easy CodeQL warnings/errors * Fixing docs, and dependencies Also moved aggregator ids to AggregatorUtil and PostAggregatorIds * Adding more Docs and better null/empty handling for aggregators * Fixing docs, and pom version * DDSketch documentation format and wording	2024-01-23 20:17:07 +05:30
Pranav	45b30dc07d	Revert "Change default inSubQueryThreshold (#15336 )" (#15722 ) A low value of inSubQueryThreshold can cause queries with IN filter to plan as joins more commonly. However, some of these join queries may not get planned as IN filter on data nodes and causes significant perf regression.	2024-01-22 11:34:39 +05:30
zachjsh	9d4e8053a4	Kinesis adaptive memory management (#15360 ) ### Description Our Kinesis consumer works by using the [GetRecords API](https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html) in some number of `fetchThreads`, each fetching some number of records (`recordsPerFetch`) and each inserting into a shared buffer that can hold a `recordBufferSize` number of records. The logic is described in our documentation at: https://druid.apache.org/docs/27.0.0/development/extensions-core/kinesis-ingestion/#determine-fetch-settings There is a problem with the logic that this pr fixes: the memory limits rely on a hard-coded “estimated record size” that is `10 KB` if `deaggregate: false` and `1 MB` if `deaggregate: true`. There have been cases where a supervisor had `deaggregate: true` set even though it wasn’t needed, leading to under-utilization of memory and poor ingestion performance. Users don’t always know if their records are aggregated or not. Also, even if they could figure it out, it’s better to not have to. So we’d like to eliminate the `deaggregate` parameter, which means we need to do memory management more adaptively based on the actual record sizes. We take advantage of the fact that GetRecords doesn’t return more than 10MB (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html ): This pr: eliminates `recordsPerFetch`, always use the max limit of 10000 records (the default limit if not set) eliminate `deaggregate`, always have it true cap `fetchThreads` to ensure that if each fetch returns the max (`10MB`) then we don't exceed our budget (`100MB` or `5% of heap`). In practice this means `fetchThreads` will never be more than `10`. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments add `recordBufferSizeBytes` as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be `100MB` or `10% of heap`, whichever is smaller. add `maxBytesPerPoll` as a bytes-based limit for how much data we poll from shared buffer at a time. Default is `1000000` bytes. deprecate `recordBufferSize`, use `recordBufferSizeBytes` instead. Warning is logged if `recordBufferSize` is specified deprecate `maxRecordsPerPoll`, use `maxBytesPerPoll` instead. Warning is logged if maxRecordsPerPoll` is specified Fixed issue that when the record buffer is full, the fetchRecords logic throws away the rest of the GetRecords result after `recordBufferOfferTimeout` and starts a new shard iterator. This seems excessively churny. Instead, wait an unbounded amount of time for queue to stop being full. If the queue remains full, we’ll end up right back waiting for it after the restarted fetch. There was also a call to `newQ::offer` without check in `filterBufferAndResetBackgroundFetch`, which seemed like it could cause data loss. Now checking return value here, and failing if false. ### Release Note Kinesis ingestion memory tuning config has been greatly simplified, and a more adaptive approach is now taken for the configuration. Here is a summary of the changes made: eliminates `recordsPerFetch`, always use the max limit of 10000 records (the default limit if not set) eliminate `deaggregate`, always have it true cap `fetchThreads` to ensure that if each fetch returns the max (`10MB`) then we don't exceed our budget (`100MB` or `5% of heap`). In practice this means `fetchThreads` will never be more than `10`. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments add `recordBufferSizeBytes` as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be `100MB` or `10% of heap`, whichever is smaller. add `maxBytesPerPoll` as a bytes-based limit for how much data we poll from shared buffer at a time. Default is `1000000` bytes. deprecate `recordBufferSize`, use `recordBufferSizeBytes` instead. Warning is logged if `recordBufferSize` is specified deprecate `maxRecordsPerPoll`, use `maxBytesPerPoll` instead. Warning is logged if maxRecordsPerPoll` is specified	2024-01-19 14:30:21 -05:00
Abhishek Radhakrishnan	38c1def95a	Kill tasks honor the buffer period of unused segments (#15710 ) * Kill tasks should honor the buffer period of unused segments. - The coordinator duty KillUnusedSegments determines an umbrella interval for each datasource to determine the kill interval. There can be multiple unused segments in an umbrella interval with different used_status_last_updated timestamps. For example, consider an unused segment that is 30 days old and one that is 1 hour old. Currently the kill task after the 30-day mark would kill both the unused segments and not retain the 1-hour old one. - However, when a kill task is instantiated with this umbrella interval, it’d kill all the unused segments regardless of the last updated timestamp. We need kill tasks and RetrieveUnusedSegmentsAction to honor the bufferPeriod to avoid killing unused segments in the kill interval prematurely. * Clarify default behavior in docs. * test comments * fix canDutyRun() * small updates. * checkstyle * forbidden api fix * doc fix, unused import, codeql scan error, and cleanup logs. * Address review comments * Rename maxUsedFlagLastUpdatedTime to maxUsedStatusLastUpdatedTime This is consistent with the column name `used_status_last_updated`. * Apply suggestions from code review Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Make period Duration type * Remove older variants of runKilLTask() in OverlordClient interface * Test can now run without waiting for canDutyRun(). * Remove previous variants of retrieveUnusedSegments from internal metadata storage coordinator interface. Removes the following interface methods in favor of a new method added: - retrieveUnusedSegmentsForInterval(String, Interval) - retrieveUnusedSegmentsForInterval(String, Interval, Integer) * Chain stream operations * cleanup * Pass in the lastUpdatedTime to markUnused test function and remove sleep. --------- Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2024-01-18 22:23:50 -08:00
Abhishek Radhakrishnan	f51f0e07e2	Remove documentation for unused segments retrieval API (#15721 ) * Undocument unused segments retrieval API. * Mark API deprecated and unstable. Note that it'll be removed. * Cleanup .spelling entries * Remove the Unstable annotation	2024-01-18 18:25:48 -08:00
Ben Sykes	e49a7bb3cd	Add SpectatorHistogram extension (#15340 ) * Add SpectatorHistogram extension * Clarify documentation Cleanup comments * Use ColumnValueSelector directly so that we support being queried as a Number using longSum or doubleSum aggregators as well as a histogram. When queried as a Number, we're returning the count of entries in the histogram. * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Fix references * Fix spelling * Update docs/development/extensions-contrib/spectator-histogram.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-01-14 09:52:30 -08:00
Gian Merlino	cccf13ea82	Reverse, pull up lookups in the SQL planner. (#15626 ) * Reverse, pull up lookups in the SQL planner. Adds two new rules: 1) ReverseLookupRule, which eliminates calls to LOOKUP by doing reverse lookups. 2) AggregatePullUpLookupRule, which pulls up calls to LOOKUP above GROUP BY, when the lookup is injective. Adds configs `sqlReverseLookup` and `sqlPullUpLookup` to control whether these rules fire. Both are enabled by default. To minimize the chance of performance problems due to many keys mapping to the same value, ReverseLookupRule refrains from reversing a lookup if there are more keys than `inSubQueryThreshold`. The rationale for using this setting is that reversal works by generating an IN, and the `inSubQueryThreshold` describes the largest IN the user wants the planner to create. * Add additional line. * Style. * Remove commented-out lines. * Fix tests. * Add test. * Fix doc link. * Fix docs. * Add one more test. * Fix tests. * Logic, test updates. * - Make FilterDecomposeConcatRule more flexible. - Make CalciteRulesManager apply reduction rules til fixpoint. * Additional tests, simplify code.	2024-01-12 00:06:31 -08:00
Misha	ea6ba40ce1	Add support for Azure Goverment storage (#15523 ) Added support for Azure Government storage in Druid Azure-Extensions. This enhancement allows the Azure-Extensions to be compatible with different Azure storage types by updating the endpoint suffix from a hardcoded value to a configurable one.	2024-01-09 22:33:32 +05:30
Abhishek Agarwal	468b99e608	Enable query request queuing by default when total laning is turned on. (#15440 ) This PR enables the flag by default to queue excess query requests in the jetty queue. Still keeping the flag so that it can be turned off if necessary. But the flag will be removed in the future.	2024-01-09 07:54:26 +05:30
Victoria Lim	52313c51ac	docs: Anchor link checker (#15624 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-01-08 15:19:05 -08:00
317brian	141c214b46	docs: add note about finalizeaggregations for sql-based ingestion (#15631 )	2024-01-08 10:06:10 -08:00
Charles Smith	d8830b64fc	add style for table formatting to docs contribution (#15612 ) Co-authored-by: Benedict Jin <asdf2014@apache.org> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-01-04 14:02:32 -08:00
George Shiqi Wu	8e95cea8e5	Azure client upgrade to allow identity options (#15287 ) * Include new dependencies * Mostly implemented * More azure fixes * Tests passing * Unit tests running * Test running after removing storage exception * Happy with coverage now * Add more tests * fix client factory * cleanup from testing * Remove old client * update docs * Exclude from spellcheck * Add licenses * Fix identity version * Save work * Add azure clients * add licenses * typos * Add dependencies * Exception is not thrown * Fix intellij check * Don't need to override * specify length * urldecode * encode path * Fix checks * Revert urlencode changes * Urlencode with azure library * Update docs/development/extensions-core/azure.md Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * PR changes * Update docs/development/extensions-core/azure.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Deprecate AzureTaskLogsConfig.maxRetries * Clean up azure retry block * logic update to reuse clients * fix comments * Create container conditionally * Fix key auth * Remove container client logic * Add some more testing * Update comments * Add a comment explaining client reuse * Move logic to factory class * use bom for dependency management * fix license versions --------- Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-01-03 18:36:05 -05:00
Victoria Lim	b8060fc93f	docs: Fix broken anchor links (#15621 )	2024-01-03 15:28:27 -08:00
Gian Merlino	01eec4a55e	New handling for COALESCE, SEARCH, and filter optimization. (#15609 ) * New handling for COALESCE, SEARCH, and filter optimization. COALESCE is converted by Calcite's parser to CASE, which is largely counterproductive for us, because it ends up duplicating expressions. In the current code we end up un-doing it in our CaseOperatorConversion. This patch has a different approach: 1) Add CaseToCoalesceRule to convert CASE back to COALESCE earlier, before the Volcano planner runs, using CaseToCoalesceRule. 2) Add FilterDecomposeCoalesceRule to decompose calls like "f(COALESCE(x, y))" into "(x IS NOT NULL AND f(x)) OR (x IS NULL AND f(y))". This helps use indexes when available on x and y. 3) Add CoalesceLookupRule to push COALESCE into the third arg of LOOKUP. 4) Add a native "coalesce" function so we can convert 3+ arg COALESCE. The advantage of this approach is that by un-doing the CASE to COALESCE conversion earlier, we have flexibility to do more stuff with COALESCE (like decomposition and pushing into LOOKUP). SEARCH is an operator used internally by Calcite to represent matching an argument against some set of ranges. This patch improves our handling of SEARCH in two ways: 1) Expand NOT points (point "holes" in the range set) from SEARCH as `!(a \|\| b)` rather than `!a && !b`, which makes it possible to convert them to a "not" of "in" filter later. 2) Generate those nice conversions for NOT points even if the SEARCH is not composed of 100% NOT points. Without this change, a SEARCH for "x NOT IN ('a', 'b') AND x < 'm'" would get converted like "x < 'a' OR (x > 'a' AND x < 'b') OR (x > 'b' AND x < 'm')". One of the steps we take when generating Druid queries from Calcite plans is to optimize native filters. This patch improves this step: 1) Extract common ANDed predicates in ConvertSelectorsToIns, so we can convert "(a && x = 'b') \|\| (a && x = 'c')" into "a && x IN ('b', 'c')". 2) Speed up CombineAndSimplifyBounds and ConvertSelectorsToIns on ORs with lots of children by adjusting the logic to avoid calling "indexOf" and "remove" on an ArrayList. 3) Refactor ConvertSelectorsToIns to reduce duplicated code between the handling for "selector" and "equals" filters. * Not so final. * Fixes. * Fix test. * Fix test.	2024-01-03 08:56:22 -08:00
sensor	cfdea06857	Fix `used_flag_last_updated` to `used_status_last_updated` in upgrade-notes.md (#15601 ) * Fix `used_flag_last_updated` to `used_status_last_updated` in upgrade-notes.md * Update docs/release-info/upgrade-notes.md Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com> --------- Co-authored-by: Benedict Jin <asdf2014@apache.org> Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>	2024-01-03 11:48:07 +08:00
Abhishek Radhakrishnan	f0f428274a	Prometheus config property doc fixup (#15613 ) * Minor fixes * Update docs/development/extensions-contrib/prometheus.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-01-02 16:28:42 -08:00
Abhishek Radhakrishnan	9c7d7fc777	Allow empty inserts and replaces in MSQ. (#15495 ) * Allow empty inserts and replace. - Introduce a new query context failOnEmptyInsert which defaults to false. - When this context is false (default), MSQE will now allow empty inserts and replaces. - When this context is true, MSQE will throw the existing InsertCannotBeEmpty MSQ fault. - For REPLACE ALL over an ALL grain segment, the query will generate a tombstone spanning eternity which will be removed eventually be the coordinator. - Add unit tests in MSQInsertTest, MSQReplaceTest to test the new default behavior (i.e., when failOnEmptyInsert = false) - Update unit tests in MSQFaultsTest to test the non-default behavior (i.e., when failOnEmptyInsert = true) * Ignore test to see if it's the culprit for OOM * Add heap dump config * Bump up -Xmx from 1500 MB to 2048 MB * Add steps to tarball and collect hprof dump to GHA action * put back mx to 1500MB to trigger the failure * add the step to reusable unit test workflow as well * Revert the temp heap dump & @Ignore changes since max heap size is increased * Minor updates * Review comments 1. Doc suggestions 2. Add tests for empty insert and replace queries with ALL grain and limit in the default failOnEmptyInsert mode (=false). Add similar tests to MSQFaultsTest with failOnEmptyInsert = true, so the query does fail with an InsertCannotBeEmpty fault. 3. Nullable annotation and javadocs * Add comment replace_limit.patch	2024-01-02 13:05:51 -08:00
Vishesh Garg	e43bb74c3a	Add MSQ Durable Storage Connector for Google Cloud Storage and change current Google Cloud Storage client library (#15398 ) The PR addresses 2 things: Add MSQ durable storage connector for GCS Change GCS client library from the old Google API Client Library to the recommended Google Cloud Client Library. Ref: https://cloud.google.com/apis/docs/client-libraries-explained	2023-12-14 07:34:49 +05:30
Clint Wylie	e55f6b6202	remove search auto strategy, estimateSelectivity of BitmapColumnIndex (#15550 ) * remove search auto strategy, estimateSelectivity of BitmapColumnIndex * more cleanup	2023-12-13 16:30:01 -08:00
Bartosz Mikulski	4670a7650f	Optional removal of metrics from Prometheus PushGateway on shutdown (#14935 ) * Optional removal of metrics from Prometheus PushGateway on shutdown * Make pushGatewayDeleteOnShutdown property nullable * Add waitForShutdownDelay property * Fix unit test * Address PR comments * Address PR comments * Add explanation on why it is useful to have deletePushGatewayMetricsOnShutdown * Fix spelling error * Fix spelling error	2023-12-13 11:58:53 -05:00
Clint Wylie	e8fcf2cac8	minor doc adjustments (#15531 )	2023-12-11 18:22:44 -08:00
zachjsh	ab7d9bc6ec	Add api for Retrieving unused segments (#15415 ) ### Description This pr adds an api for retrieving unused segments for a particular datasource. The api supports pagination by the addition of `limit` and `lastSegmentId` parameters. The resulting unused segments are returned with optional `sortOrder`, `ASC` or `DESC` with respect to the matching segments `id`, `start time`, and `end time`, or not returned in any guarenteed order if `sortOrder` is not specified `GET /druid/coordinator/v1/datasources/{dataSourceName}/unusedSegments?interval={interval}&limit={limit}&lastSegmentId={lastSegmentId}&sortOrder={sortOrder}` Returns a list of unused segments for a datasource in the cluster contained within an optionally specified interval. Optional parameters for limit and lastSegmentId can be given as well, to limit results and enable paginated results. The results may be sorted in either ASC, or DESC order depending on specifying the sortOrder parameter. `dataSourceName`: The name of the datasource `interval`: the specific interval to search for unused segments for. `limit`: the maximum number of unused segments to return information about. This property helps to support pagination `lastSegmentId`: the last segment id from which to search for results. All segments returned are > this segment lexigraphically if sortOrder is null or ASC, or < this segment lexigraphically if sortOrder is DESC. `sortOrder`: Specifies the order with which to return the matching segments by start time, end time. A null value indicates that order does not matter. This PR has: - [x] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [x] added documentation for new or modified features or behaviors. - [ ] a release note entry in the PR description. - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [x] been tested in a test Druid cluster.	2023-12-11 16:32:18 -05:00
Katya Macedo	fc222377ae	[Docs] Document decode_base64_complex and decode_base64_utf8 functions (#15444 )	2023-12-11 09:12:06 -08:00
Abhishek Radhakrishnan	96be82a3e6	Clean up duty for non-overlapping eternity tombstones (#15281 ) * Add initial draft of MarkDanglingTombstonesAsUnused duty. * Use overshadowed segments instead of all used segments. * Add unit test for MarkDanglingSegmentsAsUnused duty. * Add mock call * Simplify code. * Docs * shorter lines formatting * metric doc * More tests, refactor and fix up some logic. * update javadocs; other review comments. * Make numCorePartitions as 0 in the TombstoneShardSpec. * fix up test * Add tombstone core partition tests * Update docs/design/coordinator.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * review comment * Minor cleanup * Only consider tombstones with 0 core partitions * Need to register the test shard type to make jackson happy * test comments * checkstyle * fixup misc typos in comments * Update logic to use overshadowed segments * minor cleanup * Rename duty to eternity tombstone instead of dangling. Add test for full eternity tombstone. * Address review feedback. --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-12-11 08:57:15 -08:00
Katya Macedo	099a9825d1	[Docs] Add a release notes template (#15333 ) * Add release notes template * Update spellcheck	2023-12-11 11:35:16 +05:30
Victoria Lim	e68979e03b	Docs: update SQL API reference (#15515 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-12-08 11:53:19 -08:00
Katya Macedo	355c800108	Revamp design page (#15486 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-12-08 11:40:24 -08:00
Clint Wylie	e64b92eb35	add JSON_QUERY_ARRAY function to pluck ARRAY<COMPLEX<json>> out of COMPLEX<json> (#15521 )	2023-12-08 05:28:46 -08:00
Clint Wylie	1eafe983ec	fix array presenting columns to not match single element arrays to scalars for equality (#15503 ) * fix array presenting columns to not match single element arrays to scalars for equality * update docs to clarify usage model of mixed type columns	2023-12-08 01:22:07 -08:00
sb89594	5fda8613ad	Feature: Add IPv6 Match Function (#15212 )	2023-12-07 23:09:06 -08:00
Charles Smith	db3a633250	update timeseries to reflect NULL filling (#15512 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-12-07 14:41:27 -08:00
Clint Wylie	82ac48786b	document arrayContainsElement filter (#15455 )	2023-12-07 00:14:00 -08:00
Benjamin Hopp	fea53c7084	Re-arranging sections for append and replace docs. (#15497 )	2023-12-06 13:13:05 -08:00
Abhishek Radhakrishnan	f4949afdd7	clarify and fixup typos related to unused segments in docs and javadocs. (#15498 )	2023-12-05 22:30:32 -08:00
Jill Osborne	0e14a2c77f	Update retention rules doc (#15439 )	2023-12-05 09:53:17 -08:00
Jan Werner	f4856bc1c1	ranger-security: exclude jackson-jaxrs from + fix outdated documentation (#15481 ) * Excluding jackson-jaxrs dependency from ranger-plugin-common to address CVE regression introduced by ranger-upgrade: CVE-2019-10202, CVE-2019-10172 * remove the reference to outdated ranger 2.0 from the docs --------- Co-authored-by: Xavier Léauté <xl+github@xvrl.net>	2023-12-05 08:24:37 -08:00
Rishabh Singh	d968bb3f43	Rename config for enabling CentralizedDatasourceSchema feature (#15476 ) * Rename property to druid.centralizedDatasourceSchema.enabled * Update config name in docker-compose	2023-12-05 16:57:25 +05:30
Pranav	74ab6024e1	Native doc update (#15456 ) Updating the native docs for #15434	2023-11-30 10:37:23 +05:30
Pranav	93cd638645	Enabling aggregateMultipleValues in all StringAnyAggregators (#15434 ) * Enabling aggregateMultipleValues in all StringAnyAggregators * Adding more tests * More validation * fix warning * updating asserts in decoupled mode * fix intellij inspection * Addressing comments * Addressing comments * Adding early validations and make aggregate consistent across all * fixing tests * fixing tests * Update docs/querying/sql-aggregations.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * fixing static check --------- Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2023-11-29 14:32:49 -08:00
Abhishek Agarwal	0a56c87e93	SQL: Plan non-equijoin conditions as cross join followed by filter (#15302 ) This PR revives #14978 with a few more bells and whistles. Instead of an unconditional cross-join, we will now split the join condition such that some conditions are now evaluated post-join. To decide what sub-condition goes where, I have refactored DruidJoinRule class to extract unsupported sub-conditions. We build a postJoinFilter out of these unsupported sub-conditions and push to the join.	2023-11-29 13:46:11 +05:30
Zoltan Haindrich	eb056e23b5	Fix dictionarySize overrides in tests (#15354 ) I think this is a problem as it discards the false return value when the putToKeyBuffer can't store the value because of the limit Not forwarding the return value at that point may lead to the normal continuation here regardless something was not added to the dictionary like here	2023-11-28 18:49:09 +05:30
Charles Smith	a929b9f16e	clafiry DISTINCT is optional for COUNT() (#15394 )	2023-11-28 16:52:16 +05:30
Petrichor	b102667695	[Docs] Add example connection parameters for Java APIs (#15345 )	2023-11-28 15:09:41 +05:30
Jill Osborne	3fa856b3ff	Update Kinesis resharding doc (#15401 )	2023-11-20 15:40:59 -08:00
Jill Osborne	6ed343c047	Data management API doc refactor (#15087 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: George Shiqi Wu <george.wu@imply.io> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: ythorat2 <ythorat2@illinois.edu> Co-authored-by: Krishna Anandan <krishna1729atom@gmail.com> Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com> Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com> Co-authored-by: Karan Kumar <karankumar1100@gmail.com> Co-authored-by: Rishabh Singh <6513075+findingrish@users.noreply.github.com> Co-authored-by: Magnus Henoch <magnus@gameanalytics.com> Co-authored-by: AmatyaAvadhanula <amatya.avadhanula@imply.io> Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Yashdeep Thorat <yashdeep97@gmail.com> Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com> Co-authored-by: Clint Wylie <cwylie@apache.org> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>	2023-11-20 12:34:42 -08:00
317brian	dfc52994d4	docs: fix code tabs (#15403 )	2023-11-20 11:16:10 -08:00
Clint Wylie	a95c22ce70	support non-constant expressions for path arguments for json_value and json_query (#15320 ) * support dynamic expressions for path arguments for json_value and json_query	2023-11-17 01:12:05 -08:00
Atul Mohan	a2914789d7	Add support for ingesting older iceberg snapshots (#15348 ) This patch introduces a param snapshotTime in the iceberg inputsource spec that allows the user to ingest data files associated with the most recent snapshot as of the given time. This helps the user ingest data based on older snapshots by specifying the associated snapshot time. This patch also upgrades the iceberg core version to 1.4.1	2023-11-17 12:32:28 +05:30
Charles Smith	6a5da5a05e	fix redirect for api docs and misc array-related typos (#15387 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-16 13:29:19 -08:00
Karan Kumar	857b8de425	Query from deep storage doc fixes. (#15382 ) Fixing outdated query from deep storage docs.	2023-11-16 14:05:20 +05:30
Adarsh Sanjeev	a134cc30a6	Change default inSubQueryThreshold (#15336 )	2023-11-14 14:08:12 +05:30
YongGang	3a3d37ef40	Fix for segment/count Metric Not Emitting with Statsd-emitter (#15347 ) * fix segment/count metric in Statsd-emitter * update doc * Update docs/development/extensions-contrib/prometheus.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/development/extensions-contrib/statsd.md Co-authored-by: Suneet Saldanha <suneet@apache.org> --------- Co-authored-by: Suneet Saldanha <suneet@apache.org>	2023-11-10 08:08:58 -08:00
Charles Smith	e7d0429f5b	docs: suggest metadata store with instant ADD COLUMN semantics (#15334 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-09 12:56:30 -08:00
Pranav	e2fde8c516	Refactor lookups behavior while loading/dropping the containers (#14806 )	2023-11-07 10:07:28 -08:00
Charles Smith	0403e48266	window functions docs (#14739 ) * draft window functions * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * address comments * remove default column * Update docs/querying/sql-window-functions.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/querying/sql-window-functions.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * fix ntile * remove default header column * code tics to remove spelling errors * add known issues, add SUM example * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * address spelling * remove extra chars * add to sidebar, fix admonition * Update sql-window-functions.md accept suggestion, change admonition style * update sidebar * Delete Untitled.ipynb rm unwanted file * Update docs/querying/sql-window-functions.md * Update docs/querying/sql-window-functions.md * update context param, accept suggestions * accept suggestions * Apply suggestions from code review * Fix known issues * require GROUP BY, explain order of operation * accept suggestions * fix spelling --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-11-06 11:34:42 -08:00
Rishabh Singh	8c802e4c9b	Relocating Table Schema Building: Shifting from Brokers to Coordinator for Improved Efficiency (#14985 ) In the current design, brokers query both data nodes and tasks to fetch the schema of the segments they serve. The table schema is then constructed by combining the schemas of all segments within a datasource. However, this approach leads to a high number of segment metadata queries during broker startup, resulting in slow startup times and various issues outlined in the design proposal. To address these challenges, we propose centralizing the table schema management process within the coordinator. This change is the first step in that direction. In the new arrangement, the coordinator will take on the responsibility of querying both data nodes and tasks to fetch segment schema and subsequently building the table schema. Brokers will now simply query the Coordinator to fetch table schema. Importantly, brokers will still retain the capability to build table schemas if the need arises, ensuring both flexibility and resilience.	2023-11-04 19:33:25 +05:30
Tts-233	f39a778f7d	Fix 404 URL about native query (#15324 )	2023-11-03 08:39:59 -07:00
Karan Kumar	5036af6fb3	Doc fixes for query from deep storage and MSQ (#15313 ) Minor updates to the documentation. Added prerequisites. Removed a known issue in MSQ since its no longer valid. --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-03 10:52:20 +05:30
cristian-popa	fb260f3e41	docs: LDAP trust store property clarification (#15028 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-02 13:00:08 -07:00
Gian Merlino	d87d92bc43	Add system fields to input sources. (#15276 ) * Add system fields to input sources. Main changes: 1) The SystemField enum defines system fields "__file_uri", "__file_path", and "__file_bucket". They are associated with each input entity. 2) The SystemFieldInputSource interface can be added to any InputSource to make it system-field-capable. It sets up serialization of a list of configured "systemFields" in the JSON form of the input source, and provides a method getSystemFieldValue for computing the value of each system field. Cloud object, HDFS, HTTP, and Local now have this. * Fix various LocalInputSource calls. * Fix style stuff. * Fixups. * Fix tests and coverage.	2023-11-02 10:31:28 -07:00
Clint Wylie	d261587f4a	explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds (#15245 ) * better documentation for the differences between arrays and mvds * add outputType to ExpressionPostAggregator to make docs true * add output coercion if outputType is defined on ExpressionPostAgg * updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables	2023-11-02 00:31:37 -07:00
Charles Smith	de557a62ad	Suggest adoption of Google Style guide (#14905 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-11-01 13:31:03 -07:00
Charles Smith	3860052de0	remove references to Jupyter notebooks within the Druid repo (#15143 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-11-01 13:17:06 -07:00
Katya Macedo	935050bf43	docs: Dynamic config cleanup (#15265 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-01 11:22:33 -07:00
317brian	436ded3d78	docs: durable storage azure cleanup (#15120 ) Co-authored-by: Laksh Singla <lakshsingla@gmail.com>	2023-10-31 15:20:38 -07:00
Katya Macedo	a43ffbdf2b	[Docs] Improvements to JSON-based batch Ingestion page (#15286 )	2023-10-31 14:50:45 -07:00
317brian	87695410ac	docs: blurb about msq union all (#15223 )	2023-10-31 14:15:38 -07:00
Vishesh Garg	039b05585c	Add worker status and duration metrics in live and task reports (#15180 ) Add worker status and duration metrics in live and task reports for tracking.	2023-10-30 09:43:22 +05:30
317brian	737947754d	docs: add concurent compaction docs (#15218 ) Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2023-10-27 10:29:34 -07:00
David Christle	fc0b940f78	Document the allowed range of announcer maxBytesPerNode (#15063 )	2023-10-26 14:51:01 -07:00
YongGang	7a25ee4fd9	Ability to send task types to k8s or worker task runner (#15196 ) * Ability to send task types to k8s or worker task runner * add more tests * use runnerStrategy to determine task runner * minor refine * refine runner strategy config * move workerType config to upper level * validate config when application start	2023-10-25 09:55:56 -07:00
Adarsh Sanjeev	c5fa649ea5	Rename segment load wait parameter (#15251 )	2023-10-25 18:08:37 +05:30
Karan Kumar	61ea9e07c5	Limit pages size to a configurable limit (#14994 ) Adding the ability to limit the pages sizes of select queries. We piggyback on the same machinery that is used to control the numRowsPerSegment. This patch introduces a new context parameter rowsPerPage for which the default value is set to 100000 rows. This patch also optimizes adding the last selectResults stage only when the previous stages have sorted outputs. Currently for each select query with selectDestination=durableStorage, we used to add this extra selectResults stage.	2023-10-12 14:01:46 +05:30
Clint Wylie	d0f64608eb	sql compatible three-valued logic native filters (#15058 ) * sql compatible tri-state native logical filters when druid.expressions.useStrictBooleans=true and druid.generic.useDefaultValueForNull=false, and new druid.generic.useThreeValueLogicForNativeFilters=true * log.warn if non-default configurations are used to guide operators towards SQL complaint behavior	2023-10-12 00:06:23 -07:00
317brian	265c811963	docs: remove experimental note from query from deep storage docs (#15132 )	2023-10-12 11:51:02 +05:30
Katya Macedo	10aab7506e	Dynamic configuration API documentation refactor (#15098 ) Co-authored-by: demo-kratia <56242907+demo-kratia@users.noreply.github.com>	2023-10-11 14:45:05 -07:00
317brian	263e106714	docs: remove experimental note from unnest docs (#15123 ) * docs: remove experimental note from unnest docs * remove flag needed to use unnest	2023-10-10 16:52:51 -07:00
Laksh Singla	95bf331c08	Rename the default setting of 'maxSubqueryBytes' from 'unlimited' to 'disabled' (#15108 ) The default setting of 'maxSubqueryBytes' is renamed from 'unlimited' to 'disabled'.	2023-10-10 02:03:29 +05:30
Adarsh Sanjeev	7a35ce886d	Add ability for MSQ tasks to query realtime tasks (#15024 ) This PR aims to add the capabilities to: 1. Fetch the realtime segment metadata from the coordinator server view, 2. Adds the ability for workers to query indexers, similar to how brokers do the same for native queries.	2023-10-09 15:14:03 +05:30
kaisun2000	e2cc1c4ad1	Add metric -- count of queries waiting for merge buffers (#15025 ) Add 'mergeBuffer/pendingRequests' metric that exposes the count of waiting queries (threads) blocking in the merge buffers pools.	2023-10-09 12:56:23 +05:30
Pranav	c7d0615af3	Fix the build for #15013.: Lookup jitter upstream build fix (#15103 ) Fix the build for #15013.	2023-10-09 09:35:39 +05:30
317brian	2164dafb99	docs: update unnest to use crossjoin instead of comma (#15074 )	2023-10-05 09:01:08 -07:00
Adarsh Sanjeev	7e987e3d69	Add query context parameter for segment load wait (#15076 ) Add segmentLoadWait as a query context parameter. If this is true, the controller queries the broker and waits till the segments created (if any) have been loaded by the load rules. The controller also provides this information in the live reports and task reports. If this is false, the controller exits immediately after finishing the query.	2023-10-05 18:26:34 +05:30
Pranav	f1edd671fb	Exposing optional replaceMissingValueWith in lookup function and macros (#14956 ) * Exposing optional replaceMissingValueWith in lookup function and macros * args range validation * Updating docs * Addressing comments * Update docs/querying/sql-scalar.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * Update docs/querying/sql-functions.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * Addressing comments --------- Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2023-10-02 17:09:23 -07:00
Parth Agrawal	d038237ece	memcached cache: switch to AWS elasticache-java-cluster-client and add TLS support (#14827 ) This PR updates the library used for Memcached client to AWS Elasticache Client : https://github.com/awslabs/aws-elasticache-cluster-client-memcached-for-java This enables us to use the option of encrypting data in transit: Amazon ElastiCache for Memcached now supports encryption of data in transit For clusters running the Memcached engine, ElastiCache supports Auto Discovery—the ability for client programs to automatically identify all of the nodes in a cache cluster, and to initiate and maintain connections to all of these nodes. Benefits of Auto Discovery - Amazon ElastiCache AWS has forked spymemcached 2.12.1, and has since added all the patches included in 2.12.2 and 2.12.3 as part of the 1.2.0 release. So, this can now be considered as an equivalent drop-in replacement. GitHub - awslabs/aws-elasticache-cluster-client-memcached-for-java: Amazon ElastiCache Cluster Client for Java - enhanced library to connect to ElastiCache clusters. https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticache/AmazonElastiCacheClient.html#AmazonElastiCacheClient-- How to enable TLS with Elasticache On server side: https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/in-transit-encryption-mc.html#in-transit-encryption-enable-existing-mc On client side: GitHub - awslabs/aws-elasticache-cluster-client-memcached-for-java: Amazon ElastiCache Cluster Client for Java - enhanced library to connect to ElastiCache clusters.	2023-10-02 12:51:05 -07:00
Karan Kumar	2f1bcd6717	Adding `"segment/scan/active" metric for processing thread pool. (#15060 )	2023-09-29 12:34:28 -07:00
Soumyava	75af741a96	Revert "SQL: Plan non-equijoin conditions as cross join followed by filter. (#14978 )" (#15029 ) This reverts commit `4f498e6469`.	2023-09-25 11:35:44 -07:00
Gian Merlino	823f620ede	Add IS [NOT] DISTINCT FROM to SQL and join matchers. (#14976 ) * Add IS [NOT] DISTINCT FROM to SQL and join matchers. Changes: 1) Add "isdistinctfrom" and "notdistinctfrom" native expressions. 2) Add "IS [NOT] DISTINCT FROM" to SQL. It uses the new native expressions when generating expressions, and is treated the same as equals and not-equals when generating native filters on literals. 3) Update join matchers to have an "includeNull" parameter that determines whether we are operating in "equals" mode or "is not distinct from" mode. * Main changes: - Add ARRAY handling to "notdistinctfrom" and "isdistinctfrom". - Include null in pushed-down filters when using "notdistinctfrom" in a join. Other changes: - Adjust join filter analyzer to more explicitly use InDimFilter's ValuesSets, relying less on remembering to get it right to avoid copies. * Remove unused "wrap" method. * Fixes. * Remove methods we do not need. * Fix bug with INPUT_REF.	2023-09-20 10:44:32 -07:00
Gian Merlino	4f498e6469	SQL: Plan non-equijoin conditions as cross join followed by filter. (#14978 ) * SQL: Plan non-equijoin conditions as cross join followed by filter. Druid has previously refused to execute joins with non-equality-based conditions. This was well-intentioned: the idea was to push people to write their queries in a different, hopefully more performant way. But as we're moving towards fuller SQL support, it makes more sense to allow these conditions to go through with the best plan we can come up with: a cross join followed by a filter. In some cases this will allow the query to run, and people will be happy with that. In other cases, it will run into resource limits during execution. But we should at least give the query a chance. This patch also updates the documentation to explain how people can tell whether their queries are being planned this way. * cartesian is a word. * Adjust tests. * Update docs/querying/datasource.md Co-authored-by: Benedict Jin <asdf2014@apache.org> --------- Co-authored-by: Benedict Jin <asdf2014@apache.org>	2023-09-19 10:23:42 -07:00
George Shiqi Wu	f773d83914	Mixed task runner for migration to mm-less ingestion (#14918 ) * save work * Working * Fix runner constructor * Working runner * extra log lines * try using lifecycle for everything * clean up configs * cleanup /workers call * Use a single config * Allow selecting runner * debug changes * Work on composite task runner * Unit tests running * Add documentation * Add some javadocs * Fix spelling * Use standard libraries * code review * fix * fix * use taskRunner as string * checkstyl --------- Co-authored-by: Suneet Saldanha <suneet@apache.org>	2023-09-11 18:09:46 -07:00
317brian	3a453f7a3c	docs: add note about transparent_reconnection (#14953 ) * add note about transparent_reconnection * Update docs/api-reference/sql-jdbc.md	2023-09-11 11:58:39 -07:00
317brian	09f7dfe327	docs: update docusaurus 2 stuff (#14864 )	2023-09-08 14:19:15 -07:00
Kashif Faraz	647686aee2	Add test and metrics for KillStalePendingSegments duty (#14951 ) Changes: - Add new metric `kill/pendingSegments/count` with dimension `dataSource` - Add tests for `KillStalePendingSegments` - Reduce no-op logs that spit out for each datasource even when no pending segments have been deleted. This can get particularly noisy at low values of `indexingPeriod`. - Refactor the code in `KillStalePendingSegments` for readability and add javadocs	2023-09-08 10:33:47 +05:30
Hardik Bajaj	e100b18e86	Updated documentation for OshiSysMonitor (#14912 )	2023-09-07 16:54:33 +05:30
Laksh Singla	6ee0b06e38	Auto configuration for maxSubqueryBytes (#14808 ) A new monitor SubqueryCountStatsMonitor which emits the metrics corresponding to the subqueries and their execution is now introduced. Moreover, the user can now also use the auto mode to automatically set the number of bytes available per query for the inlining of its subquery's results.	2023-09-06 05:47:19 +00:00
Adarsh Sanjeev	959148ad37	Add code to wait for segments generated to be loaded on historicals (#14322 ) Currently, after an MSQ query, the web console is responsible for waiting for the segments to load. It does so by checking if there are any segments loading into the datasource ingested into, which can cause some issues, like in cases where the segments would never be loaded, or would end up waiting for other ingests as well. This PR shifts this responsibility to the controller, which would have the list of segments created.	2023-09-06 10:35:57 +05:30
Clint Wylie	706b57c0b2	fixup array and mvd sql docs (#14928 )	2023-09-05 16:17:00 -07:00
Jill Osborne	425ebaa387	Query tips doc (#14922 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-09-05 14:16:01 -07:00
Kashif Faraz	ec630e3671	Remove deprecated coordinator dynamic configs (#14923 ) Changes: [A] Remove config `decommissioningMaxPercentOfMaxSegmentsToMove` - It is a complicated config 😅 , - It is always desirable to prioritize move from decommissioning servers so that they can be terminated quickly, so this should always be 100% - It is already handled by `smartSegmentLoading` (enabled by default) [B] Remove config `maxNonPrimaryReplicantsToLoad` This was added in #11135 to address two requirements: - Prevent coordinator runs from getting stuck assigning too many segments to historicals - Prevent load of replicas from competing with load of unavailable segments Both of these requirements are now already met thanks to: - Round-robin segment assignment - Prioritization in the new coordinator - Modifications to `replicationThrottleLimit` - `smartSegmentLoading` (enabled by default)	2023-09-04 11:54:36 +05:30
John Gerassimou	d201ea0ece	prometheus-emitter: add extraLabels parameter (#14728 ) * prometheus-emitter: add extraLabels parameter * prometheus-emitter: update readme to include the extraLabels parameter * prometheus-emitter: remove nullable and surface label name issues * remove import to make linter happy	2023-08-29 12:02:22 -07:00
benkrug	8885805bb3	Update filters.md (#14917 )	2023-08-28 15:29:00 -07:00
Kashif Faraz	d6565f46b0	Increase the computed value of replicationThrottleLimit (#14913 ) Changes - Increase value of `replicationThrottleLimit` computed by `smartSegmentLoading` from 2% to 5% of total number of used segments. - Assign replicas to a tier even when some replicas are already being loaded in that tier - Limit the total number of replicas in load queue at start of run + replica assignments in the run to the `replicationThrottleLimit`. i.e. for every tier, num loading replicas at start of run + num replicas assigned in run <= replicationThrottleLimit	2023-08-28 18:20:22 +05:30
Victoria Lim	9142f4b8d7	docs: update note in automatic compaction doc (#14908 )	2023-08-25 14:14:29 -07:00
Kashif Faraz	e51181957c	Use num cores to determine balancerComputeThreads (#14902 ) Changes: - Determine the default value of balancerComputeThreads based on number of coordinator cpus rather than number of segments. Even if the number of segments is low and we create more balancer threads, it doesn't hurt the system as threads would mostly be idle. - Remove unused field from SegmentLoadQueueManager Expected values: - Clusters with ~1M segments typically work with Coordinators having 16 cores or more. This would give us 8 balancer threads, which is the same as the current maximum. - On small clusters, even a single thread is enough to do the required balancing work.	2023-08-25 08:15:27 +05:30
Abhishek Agarwal	3c7b237c22	Add docs for ingesting Kafka topic name (#14894 ) Add documentation on how to extract the Kafka topic name and ingest it into the data.	2023-08-24 19:19:59 +05:30
Clint Wylie	36e659a501	remove group-by v1 (#14866 ) * remove group-by v1 * docs * remove unused configs, fix test * fix test * adjustments * why not * adjust * review stuff	2023-08-23 12:44:06 -07:00
zachjsh	0c76df1c7d	Enable Continuous auto kill (#14831 ) ### Description This change enables the `KillUnusedSegments` coordinator duty to be scheduled continuously. Things that prevented this, or made this difficult before were the following: 1. If scheduled at fast enough rate, the duty would find the same intervals to kill for the same datasources, while kill tasks submitted for those same datasources and intervals were already underway, thus wasting task slots on duplicated work. 2. The task resources used by auto kill were previously unbounded. Each duty run period, if unused segments were found for any datasource, a kill task would be submitted to kill them. This pr solves for both of these issues: 1. The duty keeps track of the end time of the last interval found when killing unused segments for each datasource, in a in memory map. The end time for each datasource, if found, is used as the start time lower bound, when searching for unused intervals for that same datasource. Each duty run, we remove any datasource keys from this map that are no longer found to match datasources in the system, or in whitelist, and also remove a datasource entry, if there is found to be no unused segments for the datasource, which happens when we fail to find an interval which includes unused segments. Removing the datasource entry from the map, allows for searching for unusedSegments in the datasource from the beginning of time once again 2. The unbounded task resource usage can be mitigated with coordinator dynamic config added as part of `ba957a9b97` Operators can configure continous auto kill by providing coordinator runtime properties similar to the following: ``` druid.coordinator.period.indexingPeriod=PT60S druid.coordinator.kill.period=PT60S ``` And providing sensible limits to the killTask usage via coordinator dynamic properties.	2023-08-23 09:23:08 -04:00
Adarsh Sanjeev	dfb5a98888	Add coordinator API for unused segments (#14846 ) There is a current issue due to inconsistent metadata between worker and controller in MSQ. A controller can receive one set of segments, which are then marked as unused by, say, a compaction job. The worker would be unable to get the segment information as MetadataResource.	2023-08-23 14:51:25 +05:30
Giulio Talarico	76e5048aab	fix supervisor spec api submission commands (#14877 )	2023-08-23 14:38:09 +05:30

1 2 3 4 5 ...

3181 Commits