druid

Commit Graph

Author	SHA1	Message	Date
Clint Wylie	a95c22ce70	support non-constant expressions for path arguments for json_value and json_query (#15320 ) * support dynamic expressions for path arguments for json_value and json_query	2023-11-17 01:12:05 -08:00
Atul Mohan	a2914789d7	Add support for ingesting older iceberg snapshots (#15348 ) This patch introduces a param snapshotTime in the iceberg inputsource spec that allows the user to ingest data files associated with the most recent snapshot as of the given time. This helps the user ingest data based on older snapshots by specifying the associated snapshot time. This patch also upgrades the iceberg core version to 1.4.1	2023-11-17 12:32:28 +05:30
Charles Smith	6a5da5a05e	fix redirect for api docs and misc array-related typos (#15387 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-16 13:29:19 -08:00
Karan Kumar	857b8de425	Query from deep storage doc fixes. (#15382 ) Fixing outdated query from deep storage docs.	2023-11-16 14:05:20 +05:30
Adarsh Sanjeev	a134cc30a6	Change default inSubQueryThreshold (#15336 )	2023-11-14 14:08:12 +05:30
YongGang	3a3d37ef40	Fix for segment/count Metric Not Emitting with Statsd-emitter (#15347 ) * fix segment/count metric in Statsd-emitter * update doc * Update docs/development/extensions-contrib/prometheus.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/development/extensions-contrib/statsd.md Co-authored-by: Suneet Saldanha <suneet@apache.org> --------- Co-authored-by: Suneet Saldanha <suneet@apache.org>	2023-11-10 08:08:58 -08:00
Charles Smith	e7d0429f5b	docs: suggest metadata store with instant ADD COLUMN semantics (#15334 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-09 12:56:30 -08:00
Pranav	e2fde8c516	Refactor lookups behavior while loading/dropping the containers (#14806 )	2023-11-07 10:07:28 -08:00
Charles Smith	0403e48266	window functions docs (#14739 ) * draft window functions * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * address comments * remove default column * Update docs/querying/sql-window-functions.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/querying/sql-window-functions.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * fix ntile * remove default header column * code tics to remove spelling errors * add known issues, add SUM example * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * address spelling * remove extra chars * add to sidebar, fix admonition * Update sql-window-functions.md accept suggestion, change admonition style * update sidebar * Delete Untitled.ipynb rm unwanted file * Update docs/querying/sql-window-functions.md * Update docs/querying/sql-window-functions.md * update context param, accept suggestions * accept suggestions * Apply suggestions from code review * Fix known issues * require GROUP BY, explain order of operation * accept suggestions * fix spelling --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-11-06 11:34:42 -08:00
Rishabh Singh	8c802e4c9b	Relocating Table Schema Building: Shifting from Brokers to Coordinator for Improved Efficiency (#14985 ) In the current design, brokers query both data nodes and tasks to fetch the schema of the segments they serve. The table schema is then constructed by combining the schemas of all segments within a datasource. However, this approach leads to a high number of segment metadata queries during broker startup, resulting in slow startup times and various issues outlined in the design proposal. To address these challenges, we propose centralizing the table schema management process within the coordinator. This change is the first step in that direction. In the new arrangement, the coordinator will take on the responsibility of querying both data nodes and tasks to fetch segment schema and subsequently building the table schema. Brokers will now simply query the Coordinator to fetch table schema. Importantly, brokers will still retain the capability to build table schemas if the need arises, ensuring both flexibility and resilience.	2023-11-04 19:33:25 +05:30
Tts-233	f39a778f7d	Fix 404 URL about native query (#15324 )	2023-11-03 08:39:59 -07:00
Karan Kumar	5036af6fb3	Doc fixes for query from deep storage and MSQ (#15313 ) Minor updates to the documentation. Added prerequisites. Removed a known issue in MSQ since its no longer valid. --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-03 10:52:20 +05:30
cristian-popa	fb260f3e41	docs: LDAP trust store property clarification (#15028 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-02 13:00:08 -07:00
Gian Merlino	d87d92bc43	Add system fields to input sources. (#15276 ) * Add system fields to input sources. Main changes: 1) The SystemField enum defines system fields "__file_uri", "__file_path", and "__file_bucket". They are associated with each input entity. 2) The SystemFieldInputSource interface can be added to any InputSource to make it system-field-capable. It sets up serialization of a list of configured "systemFields" in the JSON form of the input source, and provides a method getSystemFieldValue for computing the value of each system field. Cloud object, HDFS, HTTP, and Local now have this. * Fix various LocalInputSource calls. * Fix style stuff. * Fixups. * Fix tests and coverage.	2023-11-02 10:31:28 -07:00
Clint Wylie	d261587f4a	explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds (#15245 ) * better documentation for the differences between arrays and mvds * add outputType to ExpressionPostAggregator to make docs true * add output coercion if outputType is defined on ExpressionPostAgg * updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables	2023-11-02 00:31:37 -07:00
Charles Smith	de557a62ad	Suggest adoption of Google Style guide (#14905 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-11-01 13:31:03 -07:00
Charles Smith	3860052de0	remove references to Jupyter notebooks within the Druid repo (#15143 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-11-01 13:17:06 -07:00
Katya Macedo	935050bf43	docs: Dynamic config cleanup (#15265 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-11-01 11:22:33 -07:00
317brian	436ded3d78	docs: durable storage azure cleanup (#15120 ) Co-authored-by: Laksh Singla <lakshsingla@gmail.com>	2023-10-31 15:20:38 -07:00
Katya Macedo	a43ffbdf2b	[Docs] Improvements to JSON-based batch Ingestion page (#15286 )	2023-10-31 14:50:45 -07:00
317brian	87695410ac	docs: blurb about msq union all (#15223 )	2023-10-31 14:15:38 -07:00
Vishesh Garg	039b05585c	Add worker status and duration metrics in live and task reports (#15180 ) Add worker status and duration metrics in live and task reports for tracking.	2023-10-30 09:43:22 +05:30
317brian	737947754d	docs: add concurent compaction docs (#15218 ) Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2023-10-27 10:29:34 -07:00
David Christle	fc0b940f78	Document the allowed range of announcer maxBytesPerNode (#15063 )	2023-10-26 14:51:01 -07:00
YongGang	7a25ee4fd9	Ability to send task types to k8s or worker task runner (#15196 ) * Ability to send task types to k8s or worker task runner * add more tests * use runnerStrategy to determine task runner * minor refine * refine runner strategy config * move workerType config to upper level * validate config when application start	2023-10-25 09:55:56 -07:00
Adarsh Sanjeev	c5fa649ea5	Rename segment load wait parameter (#15251 )	2023-10-25 18:08:37 +05:30
Karan Kumar	61ea9e07c5	Limit pages size to a configurable limit (#14994 ) Adding the ability to limit the pages sizes of select queries. We piggyback on the same machinery that is used to control the numRowsPerSegment. This patch introduces a new context parameter rowsPerPage for which the default value is set to 100000 rows. This patch also optimizes adding the last selectResults stage only when the previous stages have sorted outputs. Currently for each select query with selectDestination=durableStorage, we used to add this extra selectResults stage.	2023-10-12 14:01:46 +05:30
Clint Wylie	d0f64608eb	sql compatible three-valued logic native filters (#15058 ) * sql compatible tri-state native logical filters when druid.expressions.useStrictBooleans=true and druid.generic.useDefaultValueForNull=false, and new druid.generic.useThreeValueLogicForNativeFilters=true * log.warn if non-default configurations are used to guide operators towards SQL complaint behavior	2023-10-12 00:06:23 -07:00
317brian	265c811963	docs: remove experimental note from query from deep storage docs (#15132 )	2023-10-12 11:51:02 +05:30
Katya Macedo	10aab7506e	Dynamic configuration API documentation refactor (#15098 ) Co-authored-by: demo-kratia <56242907+demo-kratia@users.noreply.github.com>	2023-10-11 14:45:05 -07:00
317brian	263e106714	docs: remove experimental note from unnest docs (#15123 ) * docs: remove experimental note from unnest docs * remove flag needed to use unnest	2023-10-10 16:52:51 -07:00
Laksh Singla	95bf331c08	Rename the default setting of 'maxSubqueryBytes' from 'unlimited' to 'disabled' (#15108 ) The default setting of 'maxSubqueryBytes' is renamed from 'unlimited' to 'disabled'.	2023-10-10 02:03:29 +05:30
Adarsh Sanjeev	7a35ce886d	Add ability for MSQ tasks to query realtime tasks (#15024 ) This PR aims to add the capabilities to: 1. Fetch the realtime segment metadata from the coordinator server view, 2. Adds the ability for workers to query indexers, similar to how brokers do the same for native queries.	2023-10-09 15:14:03 +05:30
kaisun2000	e2cc1c4ad1	Add metric -- count of queries waiting for merge buffers (#15025 ) Add 'mergeBuffer/pendingRequests' metric that exposes the count of waiting queries (threads) blocking in the merge buffers pools.	2023-10-09 12:56:23 +05:30
Pranav	c7d0615af3	Fix the build for #15013.: Lookup jitter upstream build fix (#15103 ) Fix the build for #15013.	2023-10-09 09:35:39 +05:30
317brian	2164dafb99	docs: update unnest to use crossjoin instead of comma (#15074 )	2023-10-05 09:01:08 -07:00
Adarsh Sanjeev	7e987e3d69	Add query context parameter for segment load wait (#15076 ) Add segmentLoadWait as a query context parameter. If this is true, the controller queries the broker and waits till the segments created (if any) have been loaded by the load rules. The controller also provides this information in the live reports and task reports. If this is false, the controller exits immediately after finishing the query.	2023-10-05 18:26:34 +05:30
Pranav	f1edd671fb	Exposing optional replaceMissingValueWith in lookup function and macros (#14956 ) * Exposing optional replaceMissingValueWith in lookup function and macros * args range validation * Updating docs * Addressing comments * Update docs/querying/sql-scalar.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * Update docs/querying/sql-functions.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * Addressing comments --------- Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2023-10-02 17:09:23 -07:00
Parth Agrawal	d038237ece	memcached cache: switch to AWS elasticache-java-cluster-client and add TLS support (#14827 ) This PR updates the library used for Memcached client to AWS Elasticache Client : https://github.com/awslabs/aws-elasticache-cluster-client-memcached-for-java This enables us to use the option of encrypting data in transit: Amazon ElastiCache for Memcached now supports encryption of data in transit For clusters running the Memcached engine, ElastiCache supports Auto Discovery—the ability for client programs to automatically identify all of the nodes in a cache cluster, and to initiate and maintain connections to all of these nodes. Benefits of Auto Discovery - Amazon ElastiCache AWS has forked spymemcached 2.12.1, and has since added all the patches included in 2.12.2 and 2.12.3 as part of the 1.2.0 release. So, this can now be considered as an equivalent drop-in replacement. GitHub - awslabs/aws-elasticache-cluster-client-memcached-for-java: Amazon ElastiCache Cluster Client for Java - enhanced library to connect to ElastiCache clusters. https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticache/AmazonElastiCacheClient.html#AmazonElastiCacheClient-- How to enable TLS with Elasticache On server side: https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/in-transit-encryption-mc.html#in-transit-encryption-enable-existing-mc On client side: GitHub - awslabs/aws-elasticache-cluster-client-memcached-for-java: Amazon ElastiCache Cluster Client for Java - enhanced library to connect to ElastiCache clusters.	2023-10-02 12:51:05 -07:00
Karan Kumar	2f1bcd6717	Adding `"segment/scan/active" metric for processing thread pool. (#15060 )	2023-09-29 12:34:28 -07:00
Soumyava	75af741a96	Revert "SQL: Plan non-equijoin conditions as cross join followed by filter. (#14978 )" (#15029 ) This reverts commit `4f498e6469`.	2023-09-25 11:35:44 -07:00
Gian Merlino	823f620ede	Add IS [NOT] DISTINCT FROM to SQL and join matchers. (#14976 ) * Add IS [NOT] DISTINCT FROM to SQL and join matchers. Changes: 1) Add "isdistinctfrom" and "notdistinctfrom" native expressions. 2) Add "IS [NOT] DISTINCT FROM" to SQL. It uses the new native expressions when generating expressions, and is treated the same as equals and not-equals when generating native filters on literals. 3) Update join matchers to have an "includeNull" parameter that determines whether we are operating in "equals" mode or "is not distinct from" mode. * Main changes: - Add ARRAY handling to "notdistinctfrom" and "isdistinctfrom". - Include null in pushed-down filters when using "notdistinctfrom" in a join. Other changes: - Adjust join filter analyzer to more explicitly use InDimFilter's ValuesSets, relying less on remembering to get it right to avoid copies. * Remove unused "wrap" method. * Fixes. * Remove methods we do not need. * Fix bug with INPUT_REF.	2023-09-20 10:44:32 -07:00
Gian Merlino	4f498e6469	SQL: Plan non-equijoin conditions as cross join followed by filter. (#14978 ) * SQL: Plan non-equijoin conditions as cross join followed by filter. Druid has previously refused to execute joins with non-equality-based conditions. This was well-intentioned: the idea was to push people to write their queries in a different, hopefully more performant way. But as we're moving towards fuller SQL support, it makes more sense to allow these conditions to go through with the best plan we can come up with: a cross join followed by a filter. In some cases this will allow the query to run, and people will be happy with that. In other cases, it will run into resource limits during execution. But we should at least give the query a chance. This patch also updates the documentation to explain how people can tell whether their queries are being planned this way. * cartesian is a word. * Adjust tests. * Update docs/querying/datasource.md Co-authored-by: Benedict Jin <asdf2014@apache.org> --------- Co-authored-by: Benedict Jin <asdf2014@apache.org>	2023-09-19 10:23:42 -07:00
George Shiqi Wu	f773d83914	Mixed task runner for migration to mm-less ingestion (#14918 ) * save work * Working * Fix runner constructor * Working runner * extra log lines * try using lifecycle for everything * clean up configs * cleanup /workers call * Use a single config * Allow selecting runner * debug changes * Work on composite task runner * Unit tests running * Add documentation * Add some javadocs * Fix spelling * Use standard libraries * code review * fix * fix * use taskRunner as string * checkstyl --------- Co-authored-by: Suneet Saldanha <suneet@apache.org>	2023-09-11 18:09:46 -07:00
317brian	3a453f7a3c	docs: add note about transparent_reconnection (#14953 ) * add note about transparent_reconnection * Update docs/api-reference/sql-jdbc.md	2023-09-11 11:58:39 -07:00
317brian	09f7dfe327	docs: update docusaurus 2 stuff (#14864 )	2023-09-08 14:19:15 -07:00
Kashif Faraz	647686aee2	Add test and metrics for KillStalePendingSegments duty (#14951 ) Changes: - Add new metric `kill/pendingSegments/count` with dimension `dataSource` - Add tests for `KillStalePendingSegments` - Reduce no-op logs that spit out for each datasource even when no pending segments have been deleted. This can get particularly noisy at low values of `indexingPeriod`. - Refactor the code in `KillStalePendingSegments` for readability and add javadocs	2023-09-08 10:33:47 +05:30
Hardik Bajaj	e100b18e86	Updated documentation for OshiSysMonitor (#14912 )	2023-09-07 16:54:33 +05:30
Laksh Singla	6ee0b06e38	Auto configuration for maxSubqueryBytes (#14808 ) A new monitor SubqueryCountStatsMonitor which emits the metrics corresponding to the subqueries and their execution is now introduced. Moreover, the user can now also use the auto mode to automatically set the number of bytes available per query for the inlining of its subquery's results.	2023-09-06 05:47:19 +00:00
Adarsh Sanjeev	959148ad37	Add code to wait for segments generated to be loaded on historicals (#14322 ) Currently, after an MSQ query, the web console is responsible for waiting for the segments to load. It does so by checking if there are any segments loading into the datasource ingested into, which can cause some issues, like in cases where the segments would never be loaded, or would end up waiting for other ingests as well. This PR shifts this responsibility to the controller, which would have the list of segments created.	2023-09-06 10:35:57 +05:30
Clint Wylie	706b57c0b2	fixup array and mvd sql docs (#14928 )	2023-09-05 16:17:00 -07:00
Jill Osborne	425ebaa387	Query tips doc (#14922 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-09-05 14:16:01 -07:00
Kashif Faraz	ec630e3671	Remove deprecated coordinator dynamic configs (#14923 ) Changes: [A] Remove config `decommissioningMaxPercentOfMaxSegmentsToMove` - It is a complicated config 😅 , - It is always desirable to prioritize move from decommissioning servers so that they can be terminated quickly, so this should always be 100% - It is already handled by `smartSegmentLoading` (enabled by default) [B] Remove config `maxNonPrimaryReplicantsToLoad` This was added in #11135 to address two requirements: - Prevent coordinator runs from getting stuck assigning too many segments to historicals - Prevent load of replicas from competing with load of unavailable segments Both of these requirements are now already met thanks to: - Round-robin segment assignment - Prioritization in the new coordinator - Modifications to `replicationThrottleLimit` - `smartSegmentLoading` (enabled by default)	2023-09-04 11:54:36 +05:30
John Gerassimou	d201ea0ece	prometheus-emitter: add extraLabels parameter (#14728 ) * prometheus-emitter: add extraLabels parameter * prometheus-emitter: update readme to include the extraLabels parameter * prometheus-emitter: remove nullable and surface label name issues * remove import to make linter happy	2023-08-29 12:02:22 -07:00
benkrug	8885805bb3	Update filters.md (#14917 )	2023-08-28 15:29:00 -07:00
Kashif Faraz	d6565f46b0	Increase the computed value of replicationThrottleLimit (#14913 ) Changes - Increase value of `replicationThrottleLimit` computed by `smartSegmentLoading` from 2% to 5% of total number of used segments. - Assign replicas to a tier even when some replicas are already being loaded in that tier - Limit the total number of replicas in load queue at start of run + replica assignments in the run to the `replicationThrottleLimit`. i.e. for every tier, num loading replicas at start of run + num replicas assigned in run <= replicationThrottleLimit	2023-08-28 18:20:22 +05:30
Victoria Lim	9142f4b8d7	docs: update note in automatic compaction doc (#14908 )	2023-08-25 14:14:29 -07:00
Kashif Faraz	e51181957c	Use num cores to determine balancerComputeThreads (#14902 ) Changes: - Determine the default value of balancerComputeThreads based on number of coordinator cpus rather than number of segments. Even if the number of segments is low and we create more balancer threads, it doesn't hurt the system as threads would mostly be idle. - Remove unused field from SegmentLoadQueueManager Expected values: - Clusters with ~1M segments typically work with Coordinators having 16 cores or more. This would give us 8 balancer threads, which is the same as the current maximum. - On small clusters, even a single thread is enough to do the required balancing work.	2023-08-25 08:15:27 +05:30
Abhishek Agarwal	3c7b237c22	Add docs for ingesting Kafka topic name (#14894 ) Add documentation on how to extract the Kafka topic name and ingest it into the data.	2023-08-24 19:19:59 +05:30
Clint Wylie	36e659a501	remove group-by v1 (#14866 ) * remove group-by v1 * docs * remove unused configs, fix test * fix test * adjustments * why not * adjust * review stuff	2023-08-23 12:44:06 -07:00
zachjsh	0c76df1c7d	Enable Continuous auto kill (#14831 ) ### Description This change enables the `KillUnusedSegments` coordinator duty to be scheduled continuously. Things that prevented this, or made this difficult before were the following: 1. If scheduled at fast enough rate, the duty would find the same intervals to kill for the same datasources, while kill tasks submitted for those same datasources and intervals were already underway, thus wasting task slots on duplicated work. 2. The task resources used by auto kill were previously unbounded. Each duty run period, if unused segments were found for any datasource, a kill task would be submitted to kill them. This pr solves for both of these issues: 1. The duty keeps track of the end time of the last interval found when killing unused segments for each datasource, in a in memory map. The end time for each datasource, if found, is used as the start time lower bound, when searching for unused intervals for that same datasource. Each duty run, we remove any datasource keys from this map that are no longer found to match datasources in the system, or in whitelist, and also remove a datasource entry, if there is found to be no unused segments for the datasource, which happens when we fail to find an interval which includes unused segments. Removing the datasource entry from the map, allows for searching for unusedSegments in the datasource from the beginning of time once again 2. The unbounded task resource usage can be mitigated with coordinator dynamic config added as part of `ba957a9b97` Operators can configure continous auto kill by providing coordinator runtime properties similar to the following: ``` druid.coordinator.period.indexingPeriod=PT60S druid.coordinator.kill.period=PT60S ``` And providing sensible limits to the killTask usage via coordinator dynamic properties.	2023-08-23 09:23:08 -04:00
Adarsh Sanjeev	dfb5a98888	Add coordinator API for unused segments (#14846 ) There is a current issue due to inconsistent metadata between worker and controller in MSQ. A controller can receive one set of segments, which are then marked as unused by, say, a compaction job. The worker would be unable to get the segment information as MetadataResource.	2023-08-23 14:51:25 +05:30
Giulio Talarico	76e5048aab	fix supervisor spec api submission commands (#14877 )	2023-08-23 14:38:09 +05:30
Zoltan Haindrich	e806d09309	Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes (#14848 )	2023-08-22 22:50:19 -07:00
Clint Wylie	fb053c399c	consolidate json and auto indexers, remove v4 nested column serializer (#14456 )	2023-08-22 18:50:11 -07:00
Soumyava	6817de9376	Doc changes for avatica transparent reconnection (#14896 )	2023-08-22 11:58:17 -07:00
Clint Wylie	194a9c9abc	set druid.expressions.useStrictBooleans to true by default (#14734 )	2023-08-22 00:19:56 -07:00
Benedict Jin	18f7cb6926	Fixed broken URL of python api tutorial (#14881 )	2023-08-22 09:53:41 +05:30
Clint Wylie	5d1412949e	enable sql compatible null handling mode by default (#14792 ) * enable sql compatible null handling mode by default * fix bug with string first/last aggs when druid.generic.useDefaultValueForNull=false	2023-08-21 20:07:13 -07:00
Katya Macedo	5f74ef56f1	Clean up Kafka supervisor topic (#14651 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-08-21 11:55:38 -07:00
Nhi Pham	9fe7c01c16	Automatic compaction API documentation refactor (#14740 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-08-21 11:34:41 -07:00
Kashif Faraz	097b645005	Clean up after add kill bufferPeriod (#14868 ) Follow up changes to #12599 Changes: - Rename column `used_flag_last_updated` to `used_status_last_updated` - Remove new CLI tool `UpdateTables`. - We already have a `CreateTables` with similar functionality, which should be able to handle update cases too. - Any user running the cluster for the first time should either just have `connector.createTables` enabled or run `CreateTables` which should create tables at the latest version. - For instance, the `UpdateTables` tool would be inadequate when a new metadata table has been added to Druid, and users would have to run `CreateTables` anyway. - Remove `upgrade-prep.md` and include that info in `metadata-init.md`. - Fix log messages to adhere to Druid style - Use lambdas	2023-08-19 00:00:04 +05:30
Lucas Capistrant	9c124f2cde	Add a configurable bufferPeriod between when a segment is marked unused and deleted by KillUnusedSegments duty (#12599 ) * Add new configurable buffer period to create gap between mark unused and kill of segment * Changes after testing * fixes and improvements * changes after initial self review * self review changes * update sql statement that was lacking last_used * shore up some code in SqlMetadataConnector after self review * fix derby compatibility and improve testing/docs * fix checkstyle violations * Fixes post merge with master * add some unit tests to improve coverage * ignore test coverage on new UpdateTools cli tool * another attempt to ignore UpdateTables in coverage check * change column name to used_flag_last_updated * fix a method signature after column name switch * update docs spelling * Update spelling dictionary * Fixing up docs/spelling and integrating altering tasks table with my alteration code * Update NULL values for used_flag_last_updated in the background * Remove logic to allow segs with null used_flag_last_updated to be killed regardless of bufferPeriod * remove unneeded things now that the new column is automatically updated * Test new background row updater method * fix broken tests * fix create table statement * cleanup DDL formatting * Revert adding columns to entry table by default * fix compilation issues after merge with master * discovered and fixed metastore inserts that were breaking integration tests * fixup forgotten insert by using pattern of sharing now timestamp across columns * fix issue introduced by merge * fixup after merge with master * add some directions to docs in the case of segment table validation issues	2023-08-17 19:32:51 -05:00
Abhishek Radhakrishnan	37db5d9b81	Reset offsets supervisor API (#14772 ) * Add supervisor /resetOffsets API. - Add a new endpoint /druid/indexer/v1/supervisor/<supervisorId>/resetOffsets which accepts DataSourceMetadata as a body parameter. - Update logs, unit tests and docs. * Add a new interface method for backwards compatibility. * Rename * Adjust tests and javadocs. * Use CoreInjectorBuilder instead of deprecated makeInjectorWithModules * UT fix * Doc updates. * remove extraneous debugging logs. * Remove the boolean setting; only ResetHandle() and resetInternal() * Relax constraints and add a new ResetOffsetsNotice; cleanup old logic. * A separate ResetOffsetsNotice and some cleanup. * Minor cleanup * Add a check & test to verify that sequence numbers are only of type SeekableStreamEndSequenceNumbers * Add unit tests for the no op implementations for test coverage * CodeQL fix * checkstyle from merge conflict * Doc changes * DOCUSAURUS code tabs fix. Thanks, Brian!	2023-08-17 14:13:10 -07:00
Abhishek Agarwal	b97cc45d81	Add clarification to the docs for multi-topic Kafka ingestion (#14847 ) Follow-up to #14828. Added some more clarification about how topicPattern is used.	2023-08-17 12:52:06 +05:30
317brian	6b4dda964d	Docusaurus2 upgrade for master (#14411 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-08-16 19:01:21 -07:00
YongGang	3954685aae	Report more metrics to monitor K8s task runner (#14771 ) * Report pod running metrics to monitor K8s task runner * refine method definition * fix checkstyle * implement task metrics * more comment * address comments * update doc for the new metrics reported * fix checkstyle * refine method definition * minor refine	2023-08-16 14:03:53 -04:00
Abhishek Agarwal	7911a04064	Refactoring of multi-topic kafka ingestion docs (#14828 ) In this PR, I have gotten rid of multiTopic parameter and instead added a topicPattern parameter. Kafka supervisor will pass topicPattern or topic as the stream name to the core ingestion engine. There is validation to ensure that only one of topic or topicPattern will be set. This new setting is easier to understand than overloading the topic field that earlier could be interpreted differently depending on the value of some other field.	2023-08-16 18:00:11 +05:30
Nhi Pham	8fa78594ea	Druid SQL API documentation refactor (#14711 )	2023-08-15 13:45:25 -07:00
Nhi Pham	a38579ab3c	Retention rules API documentation refactor (#14623 )	2023-08-15 13:44:44 -07:00
Abhishek Agarwal	30b5dd4ca7	Add support to read from multiple kafka topics in same supervisor (#14424 ) This PR adds support to read from multiple Kafka topics in the same supervisor. A multi-topic ingestion can be useful in scenarios where a cluster admin has no control over input streams. Different teams in an org may create different input topics that they can write the data to. However, the cluster admin wants all this data to be queryable in one data source.	2023-08-14 22:24:49 +05:30
Soumyava	afe22907a5	Calcite upgrade 1.35 (#14510 ) * Update to Calcite 1.35.0 * Update from.ftl for Calcite 1.35.0. * Fixed tests in Calcite upgrade by doing the following: 1. Added a new rule, CoreRules.PROJECT_FILTER_TRANSPOSE_WHOLE_PROJECT_EXPRESSIONS, to Base rules 2. Refactored the CorrelateUnnestRule 3. Updated CorrelateUnnestRel accordingly 4. Fixed a case with selector filters on the left where Calcite was eliding the virtual column 5. Additional test cases for fixes in 2,3,4 6. Update to StringListAggregator to fail a query if separators are not propagated appropriately * Refactored for testcases to pass after the upgrade, introduced 2 new data sources for handling filters and select projects * Added a literalSqlAggregator as the upgraded Calcite involved changes to subquery remove rule. This corrected plans for 2 queries with joins and subqueries by replacing an useless literal dimension with a post agg. Additionally a test with COUNT DISTINCT and FILTER which was failing with Calcite 1.21 is added here which passes with 1.35 * Updated to latest avatica and updated code as SqlUnknownTimeStamp is now used in Calcite which needs to be resolved to a timestamp literal * Added a wrapper segment ref to use for unnest and filter segment reference	2023-08-11 12:47:16 -07:00
zachjsh	82d82dfbd6	Add stats to KillUnusedSegments coordinator duty (#14782 ) ### Description Added the following metrics, which are calculated from the `KillUnusedSegments` coordinatorDuty `"killTask/availableSlot/count"`: calculates the number remaining task slots available for auto kill `"killTask/maxSlot/count"`: calculates the maximum number of tasks available for auto kill `"killTask/task/count"`: calculates the number of tasks submitted by auto kill. #### Release note NEW: metrics added for auto kill `"killTask/availableSlot/count"`: calculates the number remaining task slots available for auto kill `"killTask/maxSlot/count"`: calculates the maximum number of tasks available for auto kill `"killTask/task/count"`: calculates the number of tasks submitted by auto kill.	2023-08-10 18:36:53 -04:00
Laksh Singla	8f102f9031	Introduce StorageConnector for Azure (#14660 ) The Azure connector is introduced and MSQ's fault tolerance and durable storage can now be used with Microsoft Azure's blob storage. Also, the results of newly introduced queries from deep storage can now store and fetch the results from Azure's blob storage.	2023-08-09 12:25:27 +00:00
Tejaswini Bandlamudi	a45b25fa1d	Removes support for Hadoop 2 (#14763 ) Removing Hadoop 2 support as discussed in https://lists.apache.org/list?dev@druid.apache.org:lte=1M:hadoop	2023-08-09 17:47:52 +05:30
Clint Wylie	e57f880020	document new filters and stuff (#14760 )	2023-08-08 16:01:06 -07:00
Clint Wylie	667e4dab5e	document expression aggregator (#14497 )	2023-08-08 15:49:29 -07:00
317brian	8a4dabc431	docs: remove experimental from schema auto-discoery (#14759 )	2023-08-08 12:45:44 -07:00
zachjsh	660e6cfa01	Allow for task limit on kill tasks spawned by auto kill coordinator duty (#14769 ) ### Description Previously, the `KillUnusedSegments` coordinator duty, in charge of periodically deleting unused segments, could spawn an unlimited number of kill tasks for unused segments. This change adds 2 new coordinator dynamic configs that can be used to control the limit of tasks spawned by this coordinator duty `killTaskSlotRatio`: Ratio of total available task slots, including autoscaling if applicable that will be allowed for kill tasks. This limit only applies for kill tasks that are spawned automatically by the coordinator's auto kill duty. Default is 1, which allows all available tasks to be used, which is the existing behavior `maxKillTaskSlots`: Maximum number of tasks that will be allowed for kill tasks. This limit only applies for kill tasks that are spawned automatically by the coordinator's auto kill duty. Default is INT.MAX, which essentially allows for unbounded number of tasks, which is the existing behavior. Realize that we can effectively get away with just the one `killTaskSlotRatio`, but following similarly to the compaction config, which has similar properties; I thought it was good to have some control of the upper limit regardless of ratio provided. #### Release note NEW: `killTaskSlotRatio` and `maxKillTaskSlots` coordinator dynamic config properties added that allow control of task resource usage spawned by `KillUnusedSegments` coordinator task (auto kill)	2023-08-08 08:40:55 -04:00
Suneet Saldanha	2af0ab2425	Metric to report time spent fetching and analyzing segments (#14752 ) * Metric to report time spent fetching and analyzing segments * fix test * spell check * fix tests * checkstyle * remove unused variable * Update docs/operations/metrics.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/metrics.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/metrics.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-08-07 18:32:48 -07:00
Abhishek Radhakrishnan	bff8f9e12e	Update kinesis docs (#14768 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-08-07 17:08:34 -07:00
Victoria Lim	7d7813372a	Docs: Include EARLIEST_BY and LATEST_BY as supported aggregation functions (#14280 )	2023-08-07 09:59:12 -07:00
Kashif Faraz	2d8e0f28f3	Refactor: Cleanup coordinator duties for metadata cleanup (#14631 ) Changes - Add abstract class `MetadataCleanupDuty` - Make `KillAuditLogs`, `KillCompactionConfig`, etc extend `MetadataCleanupDuty` - Improve log and error messages - Cleanup tests - No functional change	2023-08-05 13:08:23 +05:30
Suneet Saldanha	62ddeaf16f	Additional dimensions for service/heartbeat (#14743 ) * Additional dimensions for service/heartbeat * docs * review * review	2023-08-04 11:01:07 -07:00
Suneet Saldanha	590734b5eb	Update tutorial-kafka.md (#14749 )	2023-08-04 10:56:33 -07:00
Laksh Singla	d6c73ca6e5	Cleanup the documentation for deep storage	2023-08-04 10:20:01 +00:00
317brian	3b5b6c6a41	docs: query from deep storage (#14609 ) * cold tier wip * wip * copyedits * wip * copyedits * copyedits * wip * wip * update rules page * typo * typo * update sidebar * moves durable storage info to its own page in operations * update screenshots * add apache license * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * add query from deep storage tutorial stub * address some of the feedback * revert screenshot update. handled in separate pr * load rule update * wip tutorial * reformat deep storage endpoints * rest of tutorial * typo * cleanup * screenshot and sidebar for tutorial * add license * typos * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * rest of review comments * clarify where results are stored * update api reference for durablestorage context param * Apply suggestions from code review Co-authored-by: Karan Kumar <karankumar1100@gmail.com> * comments * incorporate #14720 * address rest of comments * missed one * Update docs/api-reference/sql-api.md * Update docs/api-reference/sql-api.md --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: demo-kratia <56242907+demo-kratia@users.noreply.github.com> Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2023-08-04 11:10:08 +05:30
zachjsh	ba957a9b97	Add ability to limit the number of segments killed in kill task (#14662 ) ### Description Previously, the `maxSegments` configured for auto kill could be ignored if an interval of data for a given datasource had more than this number of unused segments, causing the kill task spawned with the task of deleting unused segments in that given interval of data to delete more than the `maxSegments` configured. Now each kill task spawned by the auto kill coordinator duty, will kill at most `limit` segments. This is done by adding a new config property to the `KillUnusedSegmentTask` which allows users to specify this limit.	2023-08-03 22:17:04 -04:00
George Shiqi Wu	174053f4fd	Add readme for kubernetes-overlord-extensions and update docs (#14674 ) * Add readme for kubernetes task scheduler * clean up uneeded stuff * Update extensions-contrib/kubernetes-overlord-extensions/README.md Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * Move documentation into main page * indentation * cleanup spellcheck errors * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update extensions-contrib/kubernetes-overlord-extensions/README.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * PR comments * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Suneet Saldanha <suneet@apache.org> --------- Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Suneet Saldanha <suneet@apache.org>	2023-08-01 13:29:44 -07:00
Kashif Faraz	10328c0743	Rename metadatacache and serverview metrics (#14716 )	2023-08-01 14:18:20 +05:30
Gian Merlino	5387f1bac0	Remove chatAsync parameter, so chat is always async. (#14692 ) * Remove chatAsync parameter, so chat is always async. chatAsync has been made default in Druid 26. I have seen good battle-testing of it in production, and am comfortable removing the older sync client. This was the last remaining usage of IndexTaskClient, so this patch deletes all that stuff too. * Remove unthrown exception. * Remove unthrown exception. * No more TimeoutException.	2023-07-31 19:42:51 -07:00
Jason Koch	44d5c1a15f	split KillUnusedSegmentsTask to processing in smaller chunks (#14642 ) split KillUnusedSegmentsTask to smaller batches Processing in smaller chunks allows the task execution to yield the TaskLockbox lock, which allows the overlord to continue being responsive to other tasks and users while this particular kill task is executing. * introduce KillUnusedSegmentsTask batchSize parameter to control size of batching * provide an explanation for kill task batchSize parameter * add logging details for kill batch progress	2023-07-31 12:56:27 -07:00
Nhi Pham	53733d2542	JSON-querying API documentation refactor (#14589 ) Co-authored-by: Jill Osborne <jill.osborne@imply.io> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-07-28 10:55:17 -07:00
Nhi Pham	ee9cfc7e18	Tasks API documentation spacing update (#14633 )	2023-07-27 14:24:55 -07:00
Nhi Pham	482def788f	Supervisor API documentation refactor (#14579 )	2023-07-27 12:58:37 -07:00
Katya Macedo	0b9e4af443	Clean up some of the descriptions (#14661 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2023-07-27 12:11:58 -07:00
Nhi Pham	dd204e596d	Refresh the OS Druid web console screenshots (#14397 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-07-26 16:32:03 -07:00
slfan1989	d69edb7723	Docs: Fix some typos. (#14663 ) --------- Co-authored-by: slfan1989 <louj1988@@>	2023-07-26 21:24:18 +05:30
Katya Macedo	4804630c78	Clean up Kinesis doc (#14529 )	2023-07-25 19:24:36 -07:00
Nhi Pham	2dc3e94a9a	Service status API documentation refactor (#14528 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-07-25 18:42:47 -07:00
Karan Kumar	77e0c16bce	Sql statement api error messaging fixes. (#14629 ) * Error messaging fixes. * Static check fix * Review comments	2023-07-20 22:48:44 +05:30
Gian Merlino	bac5ef347c	Add ingest/input/bytes metric and Kafka consumer metrics. (#14582 ) * Add ingest/input/bytes metric and Kafka consumer metrics. New metrics: 1) ingest/input/bytes. Equivalent to processedBytes in the task reports. 2) kafka/consumer/bytesConsumed: Equivalent to the Kafka consumer metric "bytes-consumed-total". Only emitted for Kafka tasks. 3) kafka/consumer/recordsConsumed: Equivalent to the Kafka consumer metric "records-consumed-total". Only emitted for Kafka tasks. * Fix anchor. * Fix KafkaConsumerMonitor. * Interface updates. * Doc changes. * Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTask.java Co-authored-by: Benedict Jin <asdf2014@apache.org> --------- Co-authored-by: Benedict Jin <asdf2014@apache.org>	2023-07-20 10:56:22 +08:00
Jaehui Lee	1f4ee5e21b	Docs: Change default value of "maxRowsInMemory" in tuningConfig (#14618 ) Reflecting fixes from https://github.com/apache/druid/pull/13939	2023-07-19 23:14:15 +05:30
Abhishek Radhakrishnan	f4d0ea7bc8	Add support for earliest `aggregatorMergeStrategy` (#14598 ) * Add EARLIEST aggregator merge strategy. - More unit tests. - Include the aggregators analysis type by default in tests. * Docs. * Some comments and a test * Collapse into individual code blocks.	2023-07-18 12:37:10 -07:00
Kashif Faraz	cab93fb817	Docs: Minor change missed in #14590 (#14604 ) Changes: - Rephrased the description of `smartSegmentLoading` - Moved detail about value computation outside of quoted block as it is important.	2023-07-18 19:58:37 +05:30
Kashif Faraz	88dc330da2	Docs: Changes for coordinator improvements done in #13197 (#14590 )	2023-07-18 14:22:00 +05:30
Atul Mohan	03d6d395a0	Extension to read and ingest iceberg data files (#14329 ) This adds a new contrib extension: druid-iceberg-extensions which can be used to ingest data stored in Apache Iceberg format. It adds a new input source of type iceberg that connects to a catalog and retrieves the data files associated with an iceberg table and provides these data file paths to either an S3 or HDFS input source depending on the warehouse location. Two important dependencies associated with Apache Iceberg tables are: Catalog : This extension supports reading from either a Hive Metastore catalog or a Local file-based catalog. Support for AWS Glue is not available yet. Warehouse : This extension supports reading data files from either HDFS or S3. Adapters for other cloud object locations should be easy to add by extending the AbstractInputSourceAdapter.	2023-07-18 08:59:57 +05:30
Abhishek Radhakrishnan	1f6507dd60	Remove the deprecated `InsertCannotOrderByDescending` MSQ fault (#14588 ) The deprecated MSQ fault, InsertCannotOrderByDescending, is removed.	2023-07-17 09:23:39 +00:00
Gian Merlino	95ca43034f	Change default handoffConditionTimeout to 15 minutes. (#14539 ) * Change default handoffConditionTimeout to 15 minutes. Most of the time, when handoff is taking this long, it's because something is preventing Historicals from loading new data. In this case, we have two choices: 1) Stop making progress on ingestion, wait for Historicals to load stuff, and keep the waiting-for-handoff segments available on realtime tasks. (handoffConditionTimeout = 0, the current default) 2) Continue making progress on ingestion, by exiting the realtime tasks that were waiting for handoff. Once the Historicals get their act together, the segments will be loaded, as they are still there on deep storage. They will just not be continuously available. (handoffConditionTimeout > 0) I believe most users would prefer [2], because [1] risks ingestion falling behind the stream, which causes many other problems. It can cause data loss if the stream ages-out data before we have a chance to ingest it. Due to the way tuningConfigs are serialized -- defaults are baked into the serialized form that is written to the database -- this default change will not change anyone's existing supervisors. It will take effect for newly created supervisors. * Fix tests. * Update docs/development/extensions-core/kafka-supervisor-reference.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/development/extensions-core/kinesis-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-07-13 13:17:14 -07:00
Abhishek Radhakrishnan	f4ee58eaa8	Add `aggregatorMergeStrategy` property in SegmentMetadata queries (#14560 ) * Add aggregatorMergeStrategy property to SegmentMetadaQuery. - Adds a new property aggregatorMergeStrategy to segmentMetadata query. aggregatorMergeStrategy currently supports three types of merge strategies - the legacy strict and lenient strategies, and the new latest strategy. - The latest strategy considers the latest aggregator from the latest segment by time order when there's a conflict when merging aggregators from different segments. - Deprecate lenientAggregatorMerge property; The API validates that both the new and old properties are not set, and returns an exception. - When merging segments as part of segmentMetadata query, the segments have a more elaborate id -- <datasource>_<interval>_merged_<partition_number> format, similar to the name format that segments usually contain. Previously it was simply "merged". - Adjust unit tests to test the latest strategy, to assert the returned complete SegmentAnalysis object instead of just the aggregators for completeness. * Don't explicitly set strict strategy in tests * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/querying/segmentmetadataquery.md * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-07-13 12:37:36 -04:00
Katya Macedo	12ce187ae4	Update slack text (#14578 )	2023-07-12 12:08:48 -07:00
cristian-popa	d21c54fb73	Cross reference backpressure info (#14508 ) Co-authored-by: Jill Osborne <jill.osborne@imply.io> Co-authored-by: Cristian Popa <cristian.popa@imply.io>	2023-07-12 10:02:04 -07:00
Gian Merlino	3ff51487b7	Add ZooKeeper connection state alerts and metrics. (#14333 ) * Add ZooKeeper connection state alerts and metrics. - New metric "zk/connected" is an indicator showing 1 when connected, 0 when disconnected. - New metric "zk/disconnected/time" measures time spent disconnected. - New alert when Curator connection state enters LOST or SUSPENDED. * Use right GuardedBy. * Test fixes, coverage. * Adjustment. * Fix tests. * Fix ITs. * Improved injection. * Adjust metric name, add tests.	2023-07-12 09:34:28 -07:00
hqx871	7142b0c39e	Enable result level cache for GroupByStrategyV2 on broker (#11595 ) Cache is disabled for GroupByStrategyV2 on broker since the pr #3820 [groupBy v2: Results not fully merged when caching is enabled on the broker]. But we can enable the result-level cache on broker for GroupByStrategyV2 and keep the segment-level cache disabled.	2023-07-12 15:00:01 +05:30
Nhi Pham	d76903f10b	Tasks API documentation refactor (#14492 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-07-11 13:19:39 -07:00
Abhishek Radhakrishnan	854ef98235	Minor doc fixes. (#14565 ) Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2023-07-11 13:12:40 -07:00
Nhi Pham	a764ed7fde	Update Jupyter notebook tutorial instructions for ARM devices (#14459 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-07-11 10:01:20 -07:00
Laksh Singla	5ce536355e	Fix planning bug while using sort merge frame processor (#14450 ) sqlJoinAlgorithm is now a hint to the planner to execute the join in the specified manner. The planner can decide to ignore the hint if it deduces that the specified algorithm can be detrimental to the performance of the join beforehand.	2023-07-11 09:58:44 +00:00
Gian Merlino	63ee69b4e8	Claim full support for Java 17. (#14384 ) * Claim full support for Java 17. No production code has changed, except the startup scripts. Changes: 1) Allow Java 17 without DRUID_SKIP_JAVA_CHECK. 2) Include the full list of opens and exports on both Java 11 and 17. 3) Document that Java 17 is both supported and preferred. 4) Switch some tests from Java 11 to 17 to get better coverage on the preferred version. * Doc update. * Update errorprone. * Update docker_build_containers.sh. * Update errorprone in licenses.yaml. * Add some more run-javas. * Additional run-javas. * Update errorprone. * Suppress new errorprone error. * Add exports and opens in ForkingTaskRunner for Java 11+. Test, doc changes. * Additional errorprone updates. * Update for errorprone. * Restore old fomatting in LdapCredentialsValidator. * Copy bin/ too. * Fix Java 15, 17 build line in docker_build_containers.sh. * Update busybox image. * One more java command. * Fix interpolation. * IT commandline refinements. * Switch to busybox 1.34.1-glibc. * POM adjustments, build and test one IT on 17. * Additional debugging. * Fix silly thing. * Adjust command line. * Add exports and opens one more place. * Additional harmonization of strong encapsulation parameters.	2023-07-07 12:52:35 -07:00
Katya Macedo	5f94a2a9c2	Add link to Slack channel (#14553 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-07-07 10:09:15 -07:00
Kashif Faraz	87bb1b9709	Fix bug during initialization of HttpServerInventoryView (#14517 ) If a server is removed during `HttpServerInventoryView.serverInventoryInitialized`, the initialization gets stuck as this server is never synced. The method eventually times out (default 250s). Fix: Mark a server as stopped if it is removed. `serverInventoryInitialized` only waits for non-stopped servers to sync. Other changes: - Add new metrics for better debugging of slow broker/coordinator startup - `segment/serverview/sync/healthy`: whether the server view is syncing properly with a server - `segment/serverview/sync/unstableTime`: time for which sync with a server has been unstable - Clean up logging in `HttpServerInventoryView` and `ChangeRequestHttpSyncer` - Minor refactor for readability - Add utility class `Stopwatch` - Add tests and stubs	2023-07-06 13:04:53 +05:30
Kashif Faraz	a6547febaf	Remove unused coordinator dynamic configs (#14524 ) After #13197 , several coordinator configs are now redundant as they are not being used anymore, neither with `smartSegmentLoading` nor otherwise. Changes: - Remove dynamic configs `emitBalancingStats`: balancer error stats are always emitted, debug stats can be logged by using `debugDimensions` - `useBatchedSegmentSampler`, `percentOfSegmentsToConsiderPerMove`: batched segment sampling is always used - Add test to verify deserialization with unknown properties - Update `CoordinatorRunStats` to always track stats, this can be optimized later.	2023-07-06 12:11:10 +05:30
Victoria Lim	50b7e5d20e	docs: fix links (#14504 )	2023-07-05 12:29:47 -07:00
Jakub Matyszewski	cc159f4317	docs: k8s-jobs role needs batch apigroup (#14343 )	2023-07-04 14:34:20 +05:30
Clint Wylie	277aaa5c57	remove druid.processing.columnCache.sizeBytes and CachingIndexed, combine string column implementations (#14500 ) * combine string column implementations changes: * generic indexed, front-coded, and auto string columns now all share the same column and index supplier implementations * remove CachingIndexed implementation, which I think is largely no longer needed by the switch of many things to directly using ByteBuffer, avoiding the cost of creating Strings * remove ColumnConfig.columnCacheSizeBytes since CachingIndexed was the only user	2023-07-02 19:37:15 -07:00
Gian Merlino	e10e35aa2c	Add REGEXP_REPLACE function. (#14460 ) * Add REGEXP_REPLACE function. Replaces all instances of a pattern with a replacement string. * Fixes. * Improve test coverage. * Adjust behavior.	2023-06-29 13:47:57 -07:00
Adarsh Sanjeev	0335aaa279	Add query results directory and prevent the auto cleaner from cleaning it (#14446 ) Adds support for automatic cleaning of a "query-results" directory in durable storage. This directory will be cleaned up only if the task id is not known to the overlord. This will allow the storage of query results after the task has finished running.	2023-06-28 10:14:04 +05:30
Laksh Singla	f546cd64a9	MSQ: Ensure that the allocated segment aligns with the requested granularity (#14475 ) Changes: - Throw an `InsertCannotAllocateSegmentFault` if the allocated segment is not aligned with the requested granularity. - Tests to verify new behaviour	2023-06-27 09:25:32 +05:30
Abhishek Radhakrishnan	79bff4bbf7	Improvements to `EXPLAIN PLAN` attributes (#14441 ) * Updates: use the target table directly, sanitized replace time chunks and clustered by cols. * Add DruidSqlParserUtil and tests. * minor refactor * Use SqlUtil.isLiteral * Throw ValidationException if CLUSTERED BY column descending order is specified. - Fails query planning * Some more tests. * fixup existing comment * Update comment * checkstyle fix: remove unused imports * Remove InsertCannotOrderByDescendingFault and deprecate the fault in readme. * minor naming * move deprecated field to the bottom * update docs. * add one more example. * Collapsible query and result * checkstyle fixes * Code cleanup * order by changes * conditionally set attributes only for explain queries. * Cleaner ordinal check. * Add limit test and update javadoc. * Commentary and minor adjustments. * Checkstyle fixes. * One more checkArg. * add unexpected kind to exception.	2023-06-26 23:01:11 -04:00
Nhi Pham	579b93f282	API reference refactor (#14372 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-06-26 15:48:54 -07:00
Katya Macedo	fc08617e9e	[Docs] Clean up druid.processing.intermediaryData.storage.type description (#14431 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-06-26 11:46:54 -07:00
YongGang	b7434be99e	Add ServiceStatusMonitor to monitor service health (#14443 ) * Add OverlordStatusMonitor and CoordinatorStatusMonitor to monitor service leader status * make the monitor more general * resolve conflict * use Supplier pattern to provide metrics * reformat code and doc * move service specific tag to dimension * minor refine * update doc * reformat code * address comments * remove declared exception * bind HeartbeatSupplier conditionally in Coordinator	2023-06-26 10:26:37 -07:00
Laksh Singla	114380749d	MSQ: Improve the parse exception errors and the handling of null UTF characters in Strings in Frames (#14398 )	2023-06-26 18:14:29 +05:30
Laksh Singla	1647d5f4a0	Limit the subquery results by memory usage (#13952 ) Users can now add a guardrail to prevent subquery’s results from exceeding the set number of bytes by setting druid.server.http.maxSubqueryRows in Broker's config or maxSubqueryRows in the query context. This feature is experimental for now and would default back to row-based limiting in case it fails to get the accurate size of the results consumed by the query.	2023-06-26 18:12:28 +05:30
Gian Merlino	1d6c9657ec	Clarify compaction docs. (#14225 ) * Clarify compaction docs. The prior wording made it sound like segmentGranularity, queryGranularity, and rollup are always required for granularitySpec. They are not required, but they are strongly recommended. The adjusted wording hopefully does a better job of making that clear. * Fix link. * Wording adjustments.	2023-06-23 15:24:15 -07:00
Rishabh Singh	155fde33ff	Add metrics to SegmentMetadataCache refresh (#14453 ) New metrics: - `segment/metadatacache/refresh/time`: time taken to refresh segments per datasource - `segment/metadatacache/refresh/count`: number of segments being refreshed per datasource	2023-06-23 16:51:08 +05:30
Adarsh Sanjeev	90b8f850a5	Allow empty tiered replicants map for load rules (#14432 ) Changes: - Add property `useDefaultTierForNull` for all load rules. This property determines the default value of `tieredReplicants` if it is not specified. When true, the default is `_default_tier => 2 replicas`. When false, the default is empty, i.e. no replicas on any tier. - Fix validation to allow empty replicants map, so that the segment is used but not loaded anywhere.	2023-06-22 14:44:06 +05:30
Adarsh Sanjeev	128133fadc	Add column replication_factor column to sys.segments table (#14403 ) Description: Druid allows a configuration of load rules that may cause a used segment to not be loaded on any historical. This status is not tracked in the sys.segments table on the broker, which makes it difficult to determine if the unavailability of a segment is expected and if we should not wait for it to be loaded on a server after ingestion has finished. Changes: - Track replication factor in `SegmentReplicantLookup` during evaluation of load rules - Update API `/druid/coordinator/v1metadata/segments` to return replication factor - Add column `replication_factor` to the sys.segments virtual table and populate it in `MetadataSegmentView` - If this column is 0, the segment is not assigned to any historical and will not be loaded.	2023-06-18 10:02:21 +05:30
Abhishek Radhakrishnan	04fb75719e	Fail query planning if a `CLUSTERED BY` column contains descending order (#14436 ) * Throw ValidationException if CLUSTERED BY column descending order is specified. - Fails query planning * Some more tests. * fixup existing comment * Update comment * checkstyle fix: remove unused imports * Remove InsertCannotOrderByDescendingFault and deprecate the fault in readme. * move deprecated field to the bottom	2023-06-16 18:10:12 -04:00
George Shiqi Wu	64af9bfe5b	Add groupId to metrics (#14402 ) * Add group id as a dimension * Revert changes * Add to forking task runner * Add missing metrics * Fix indenting * revert metrics * Fix indentation	2023-06-16 09:28:16 -07:00
Maytas Monsereenusorn	5d76d0ea74	Fix segment/deleted/count metric not being emitted (#14433 ) * Fix segment/deleted/count metric * Fix segment/deleted/count metric * Fix segment/deleted/count metric	2023-06-15 14:08:19 -07:00
Laksh Singla	4935f2470a	Limit results generated by SELECT queries in MSQ (#14370 ) * Limit select results in MSQ * reduce number of files in test * add truncated flag * avoid materializing select results to list, use iterable instead * javadocs	2023-06-15 13:13:11 +05:30
Abhishek Radhakrishnan	b8495d45a1	Expose Druid functions in `INFORMATION_SCHEMA.ROUTINES` table. (#14378 ) * Add INFORMATION_SCHEMA.ROUTINES to expose Druid operators and functions. * checkstyle * remove IS_DETERMISITIC. * test * cleanup test * remove logs and simplify * fixup unit test * Add docs for INFORMATION_SCHEMA.ROUTINES table. * Update test and add another SQL query. * add stuff to .spelling and checkstyle fix. * Add more tests for custom operators. * checkstyle and comment. * Some naming cleanup. * Add FUNCTION_ID * The different Calcite function syntax enums get translated to FUNCTION * Update docs. * Cleanup markdown table. * fixup test. * fixup intellij inspection * Review comment: nullable column; add a function to determine function syntax. * More tests; add non-function syntax operators. * More unit tests. Also add a separate test for DruidOperatorTable. * actually just validate non-zero count. * switch up the order * checkstyle fixes.	2023-06-13 15:44:04 -04:00
Abhishek Radhakrishnan	1c76ebad3b	Minor doc updates. (#14409 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-06-12 15:24:48 -07:00
Abhishek Radhakrishnan	326f2c5020	Add more statement attributes to explain plan result. (#14391 ) This PR adds the following to the ATTRIBUTES column in the explain plan output: - partitionedBy - clusteredBy - replaceTimeChunks This PR leverages the work done in #14074, which added a new column ATTRIBUTES to encapsulate all the statement-related attributes.	2023-06-12 19:18:02 +05:30
Abhishek Radhakrishnan	31c386ee1b	Fixup typo and java code snippets in JDBC docs. (#14399 )	2023-06-09 12:39:21 -07:00
317brian	ff577a69a5	doc: escape tags in markdown in prepration for docusaurus2 (#14379 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-06-08 11:26:18 -07:00
Gian Merlino	6370769cbf	Fix documentation for druid.query.scheduler.numThreads. (#14381 ) * Fix documentation for druid.query.scheduler.numThreads.	2023-06-07 14:48:08 +05:30
317brian	49c056af17	docs: add basic contributor guide for docs (#14365 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-06-05 10:53:17 -07:00
Katya Macedo	7fd215b2e7	Document storeCompactionState (#14354 )	2023-06-02 11:09:04 -07:00
Harini Rajendran	4ff6026d30	Adding SegmentMetadataEvent and publishing them via KafkaEmitter (#14281 ) In this PR, we are enhancing KafkaEmitter, to emit metadata about published segments (SegmentMetadataEvent) into a Kafka topic. This segment metadata information that gets published into Kafka, can be used by any other downstream services to query Druid intelligently based on the segments published. The segment metadata gets published into kafka topic in json string format similar to other events.	2023-06-02 21:28:26 +05:30
Andreas Maechler	55effd92cf	Docs: Typo and language cleanup in Kinesis ingestion docs (#14356 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-06-02 08:18:41 +05:30
317brian	70952c0977	docs: add sql array functions to nav (#14361 ) * docs: add sql array functions to nav * fix typo * add sql array functions to list * fix spelling errors	2023-06-01 16:45:27 -07:00
zachjsh	e75fb8e8e3	Account for data format and compression in MSQ auto taskAssignment (#14307 ) ### Description This change allows for consideration of the input format and compression when computing how to split the input files among available tasks, in MSQ ingestion, when considering the value of the `maxInputBytesPerWorker` query context parameter. This query parameter allows users to control the maximum number of bytes, with granularity of input file / object, that ingestion tasks will be assigned to ingest. With this change, this context parameter now denotes the estimated weighted size in bytes of the input to split on, with consideration for input format and compression format, rather than the actual file size, reported by the file system. We assume uncompressed newline delimited json as a baseline, with scaling factor of `1`. This means that when computing the byte weight that a file has towards the input splitting, we take the file size as is, if uncompressed json, 1:1. It was found during testing that gzip compressed json, and parquet, has scale factors of `4` and `8` respectively, meaning that each byte of data is weighted 4x and 8x respectively, when computing input splits. This weighted byte scaling is only considered for MSQ ingestion that uses either LocalInputSource or CloudObjectInputSource at the moment. The default value of the `maxInputBytesPerWorker` query context parameter has been updated from 10 GiB, to 512 MiB	2023-06-01 12:53:49 -07:00
Abhishek Radhakrishnan	d60290e76d	Remove extraneous apostrophe in the native batch docs (#14358 )	2023-06-01 08:57:41 -07:00
Katya Macedo	2da84de87f	docs: remove the note about segments (#14161 )	2023-05-31 16:37:19 -07:00
317brian	2012a6bd8e	Docs: fix broken link to Python API jupyter notebook (#14332 )	2023-05-31 08:12:27 +05:30
Nhi Pham	70c06fc0e1	Advise against using WEEK granularity for Native Batch and MSQ (#14341 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-05-30 11:40:12 -07:00
Pramod Immaneni	1ac5544da7	Updated default value of maxTotalRows to reflect the value in the code (#14298 )	2023-05-30 14:41:06 +05:30
Kashif Faraz	8091c6a547	Update default values in CoordinatorDynamicConfig (#14269 ) The defaults of the following config values in the `CoordinatorDynamicConfig` are being updated. 1. `maxSegmentsInNodeLoadingQueue = 500` (previous = 100) 2. `replicationThrottleLimit = 500` (previous = 10) Rationale: With round-robin segment assignment now being the default assignment technique, the Coordinator can assign a large number of under-replicated/unavailable segments very quickly, without getting stuck in `RunRules` duty due to very slow strategy-based cost computations. 3. `maxSegmentsToMove = 100` (previous = 5) Rationale: A very low value (say 5) is ineffective in balancing especially if there are many segments to balance. A very large value can cause excessive moves, which has these disadvantages: - Load of moving segments competing with load of unavailable/under-replicated segments - Unnecessary network costs due to constant download and delete of segments These defaults will be revisited after #13197 is merged.	2023-05-30 08:51:33 +05:30
Clint Wylie	4096f51f0b	add configurable ColumnTypeMergePolicy to SegmentMetadataCache (#14319 ) This PR adds a new interface to control how SegmentMetadataCache chooses ColumnType when faced with differences between segments for SQL schemas which are computed, exposed as druid.sql.planner.metadataColumnTypeMergePolicy and adds a new 'least restrictive type' mode to allow choosing the type that data across all segments can best be coerced into and sets this as the default behavior. This is a behavior change around when segment driven schema migrations take effect for the SQL schema. With latestInterval, the SQL schema will be updated as soon as the first job with the new schema has published segments, while using leastRestrictive, the schema will only be updated once all segments are reindexed to the new type. The benefit of leastRestrictive is that it eliminates a bunch of type coercion errors that can happen in SQL when types are varied across segments with latestInterval because the newest type is not able to correctly represent older data, such as if the segments have a mix of ARRAY and number types, or any other combinations that lead to odd query plans.	2023-05-24 20:32:51 +05:30
Abhishek Radhakrishnan	338bdb35ea	Return `RESOURCES` in `EXPLAIN PLAN` as an ordered collection (#14323 ) * Make resources an ordered collection so it's deterministic. * test cleanup * fixup docs. * Replace deprecated ObjectNode#put() calls with ObjectNode#set().	2023-05-23 00:55:00 -05:00
Victoria Lim	6b3a6113c4	Doc: List supported values for Kafka `headerFormat` (#14316 )	2023-05-22 15:41:07 -07:00
Nhi Pham	3f6610aaf1	fixed wording in OSS query laning doc (#14324 ) Co-authored-by: Nhi Pham <nhipham@Nhi-Pham.local>	2023-05-22 11:58:17 -07:00
317brian	9faf9ecf20	docs: add line about write datasource perm for overlord api (#14114 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-05-19 14:56:24 -07:00
Katya Macedo	269137c682	Update Ingestion section (#14023 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Victoria Lim <lim.t.victoria@gmail.com>	2023-05-19 09:42:27 -07:00
Abhishek Radhakrishnan	7400ed3c93	Fixup data deletion tutorial docs (#14283 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-05-17 17:05:35 -07:00
317brian	ceda1e98b9	docs: add docs for schema auto-discovery (#14065 ) * wip schemaless * wip * more cleanup * update tuningconfig example * updates based on feedback from clint * remove errant comma * update dimension object to include auto * update to include string schemaless way * fix spelling errors * updates for type-aware and string-based changes * Update docs/ingestion/schema-design.md * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * update spelling file * Update docs/ingestion/schema-design.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * copyedits * fix anchor --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2023-05-17 01:36:02 -07:00
Adarsh Sanjeev	e8ef31fe92	Fix condition for timeout in worker task launcher (#14270 ) * Fix condition for timeout in worker task launcher	2023-05-16 08:30:00 +05:30
Victoria Lim	66d4ea014c	Docs: Tutorial for streaming ingestion using Kafka + Docker file to use with Jupyter tutorials (#13984 )	2023-05-15 15:20:52 -07:00
Peter Marshall	c4aa98953b	202304-docs-removeDF (#14132 )	2023-05-15 15:08:57 -07:00
imply-cheddar	f9861808bc	Be able to load segments on Peons (#14239 ) * Be able to load segments on Peons This change introduces a new config on WorkerConfig that indicates how many bytes of each storage location to use for storage of a task. Said config is divided up amongst the locations and slots and then used to set TaskConfig.tmpStorageBytesPerTask The Peons use their local task dir and tmpStorageBytesPerTask as their StorageLocations for the SegmentManager such that they can accept broadcast segments.	2023-05-12 16:51:00 -07:00
317brian	8bda7297e1	doc: fix unnest datasource syntax (#14272 )	2023-05-12 13:05:27 -07:00
317brian	6254658f61	docs: fix links (#14111 )	2023-05-12 09:59:16 -07:00
Kashif Faraz	47a70d03e8	Docs: Minor rephrase in indexing-service.md (#14231 ) * Fix language in indexing-service * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-05-12 08:22:02 +05:30
317brian	cc37987dff	docs: copyedits for MSQ join algos (#14012 )	2023-05-11 14:21:09 -07:00
Clint Wylie	a58cebe491	add array_to_mv function to convert arrays into mvds to assist with migration from mvds to arrays (#14236 )	2023-05-11 04:43:28 -07:00
Kashif Faraz	bd0080c4ce	Update default values in docs (#14233 )	2023-05-09 19:13:51 +05:30
Shingo Kitagawa	152e9375e2	update documentation about multiValueHandling (#14197 ) * update documentation about multiValueHandling * Update docs/ingestion/ingestion-spec.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/ingestion/ingestion-spec.md Co-authored-by: Gian Merlino <gianmerlino@gmail.com> * fix spelling --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>	2023-05-08 16:16:54 -07:00
Abhishek Radhakrishnan	6ca3fb9b08	Remove the redundant ISO-8601 text in the readme. (#14210 )	2023-05-05 11:27:29 -07:00
zachjsh	48cde236c4	Add columnMappings to explain plan output (#14187 ) * Add columnMappings to explain plan output * * fix checkstyle * add tests * * improve test coverage * * temporarily remove unit-test need to run ITs * * depend on build * * temporarily lower unit test threshold * * add back dependency on unit-tests * * add license headers * * fix header order * * review comments * * fix intellij inspection errors * * revert code coverage change	2023-05-04 10:36:28 -07:00
Karan Kumar	6f0cdd0c3f	`TaskStartTimeoutFault` now depends on the last successful worker launch time. (#14172 ) * `TaskStartTimeoutFault` now depends on the last successful worker launch time.	2023-05-03 00:05:15 +05:30
Vadim Ogievetsky	32af570fb2	fix API doc formatting (#14167 )	2023-04-29 09:29:41 -07:00
Suneet Saldanha	84c11df980	Make LoggingEmitter more useful by using Markers (#14121 ) * Make LoggingEmitter more useful * Skip code coverage for facade classes * fix spellcheck * code review * fix dependency * logging.md * fix checkstyle * Add back jacoco version to main pom	2023-04-27 15:06:06 -07:00
Jill Osborne	d4e478c909	NVL function docs update (#14169 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-04-27 11:17:21 -07:00
TSFenwick	6c99fbea92	fix typo in s3 docs. add readme to s3 module. (#14135 ) * fix typo in s3 docs. add readme to s3 module. * Update extensions-core/s3-extensions/README.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * cleanup readme for s3 extension and link to repo markdown doc instead of web docs --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-04-26 14:03:11 -07:00
Tejaswini Bandlamudi	774073b2e7	Update Hadoop3 as default build version (#14005 ) Hadoop 2 often causes red security scans on Druid distribution because of the dependencies it brings. We want to move away from Hadoop 2 and provide Hadoop 3 distribution available. Switch druid to building with Hadoop 3 by default. Druid will still be compatible with Hadoop 2 and users can build hadoop-2 compatible distribution using hadoop2 profile.	2023-04-26 12:52:51 +05:30
Gian Merlino	a7d4162195	Compaction: Block input specs not aligned with segmentGranularity. (#14127 ) * Compaction: Block input specs not aligned with segmentGranularity. When input intervals are not aligned with segmentGranularity, data may be overshadowed if it lies in the space between the input intervals and the output segmentGranularity. In MSQ REPLACE, this is a validation error. IMO the same behavior makes sense for compaction tasks. In case anyone was depending on the ability to compact nonaligned intervals, a configuration parameter allowNonAlignedInterval is provided. I don't expect it to be used much. * Remove unused. * ITCompactionTaskTest uses non-aligned intervals.	2023-04-25 17:06:16 -07:00
Gian Merlino	89e7948159	MSQ: Subclass CalciteJoinQueryTest, other supporting changes. (#14105 ) * MSQ: Subclass CalciteJoinQueryTest, other supporting changes. The main change is the new tests: we now subclass CalciteJoinQueryTest in CalciteSelectJoinQueryMSQTest twice, once for Broadcast and once for SortMerge. Two supporting production changes for default-value mode: 1) InputNumberDataSource is marked as concrete, to allow leftFilter to be pushed down to it. 2) In default-value mode, numeric frame field readers can now return nulls. This is necessary when stacking joins on top of joins: nulls must be preserved for semantics that match broadcast joins and native queries. 3) In default-value mode, StringFieldReader.isNull returns true on empty strings in addition to nulls. This is more consistent with the behavior of the selectors, which map empty strings to null as well in that mode. As an effect of change (2), the InsertTimeNull change from #14020 (to replace null timestamps with default timestamps) is reverted. IMO, this is fine, as either behavior is defensible, and the change from #14020 hasn't been released yet. * Adjust tests. * Style fix. * Additional tests.	2023-04-25 12:10:23 -07:00
TSFenwick	accd5536df	Allow for Log4J to be configured for peons but still ensure console logging is enforced (#14094 ) * Allow for Log4J to be configured for peons but still ensure console logging is enforced This change will allow for log4j to be configured for peons but require console logging is still configured for them to ensure peon logs are saved to deep storage. Also fixed the test ConsoleLoggingEnforcementTest to use a valid appender for the non console Config as the previous config was incorrect and would never return a logger. * fix checkstyle * add warning to logger when it overwrites all loggers to be console * optimize calls for altering logging config for ConsoleLoggingEnforcementConfigurationFactory add getName to the druid logger class * update docs, and error message * edit docs to be more clear * fix checkstyle issues * CI fixes - LoggerTest code coverage and fix spelling issue for logging docs	2023-04-24 10:41:56 -07:00
Adarsh Sanjeev	a7d5c64aeb	Move MSQ temporary storage to a runtime parameter instead of being configured from query context (#14061 ) * Adds new run time parameter druid.indexer.task.tmpStorageBytesPerTask. This sets a limit for the amount of temporary storage disk space used by tasks. This limit is currently only respected by MSQ tasks. * Removes query context parameters intermediateSuperSorterStorageMaxLocalBytes and composedIntermediateSuperSorterStorageEnabled. Composed intermediate super sorter (which was enabled by composedIntermediateSuperSorterStorageEnabled) is now enabled automatically if durableShuffleStorage is set to true. intermediateSuperSorterStorageMaxLocalBytes is calculated from the limit set by the run time parameter druid.indexer.task.tmpStorageBytesPerTask.	2023-04-18 16:56:51 +05:30
Laksh Singla	8eb854c845	Remove maxResultsSize config property from S3OutputConfig (#14101 ) * "maxResultsSize" has been removed from the S3OutputConfig and a default "chunkSize" of 100MiB is now present. This change primarily affects users who wish to use durable storage for MSQ jobs.	2023-04-18 14:25:20 +05:30
Clint Wylie	f6a0888bc0	document arrays in sql (#12549 ) * document arrays in sql * adjustments * Update docs/querying/sql-array-functions.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/querying/sql-data-types.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/querying/sql-data-types.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/querying/sql-array-functions.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/querying/sql-array-functions.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update sql-array-functions.md * fix stuff * fix spelling --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-04-17 19:08:46 -07:00
Abhishek Radhakrishnan	c98c66558f	Include statement attributes in `EXPLAIN PLAN` output (#14074 ) This commit adds attributes that contain metadata information about the query in the EXPLAIN PLAN output. The attributes currently contain two items: - `statementTyp`: SELECT, INSERT or REPLACE - `targetDataSource`: provides the target datasource name for DML statements It is added to both the legacy and native query plan outputs.	2023-04-17 21:00:25 +05:30
Atul Mohan	e3c160f2f2	Add start_time column to sys.servers (#13358 ) Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.	2023-04-14 15:23:34 +05:30
317brian	6c9b7b6efd	msq: add durable storage info (#14035 ) * msq: add durable storage info * fix duplicate row * Apply suggestions from code review Co-authored-by: Karan Kumar <karankumar1100@gmail.com> --------- Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2023-04-14 13:28:23 +05:30
imply-cheddar	aaa6cc1883	Make the tasks run with only a single directory (#14063 ) * Make the tasks run with only a single directory There was a change that tried to get indexing to run on multiple disks It made a bunch of changes to how tasks run, effectively hiding the "safe" directory for tasks to write files into from the task code itself making it extremely difficult to do anything correctly inside of a task. This change reverts those changes inside of the tasks and makes it so that only the task runners are the ones that make decisions about which mount points should be used for storing task-related files. It adds the config druid.worker.baseTaskDirs which can be used by the task runners to know which directories they should schedule tasks inside of. The TaskConfig remains the authoritative source of configuration for where and how an individual task should be operating.	2023-04-13 00:45:02 -07:00
Vadim Ogievetsky	3a7e4efdd6	Docs: updating Kafka input format docs (#14049 ) * updating Kafka input format docs * typo * spellcheck * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2023-04-11 20:06:23 -07:00
Abhishek Radhakrishnan	5ce1b0903e	Add basic security functions to druidapi (follow up to #14009 ) (#14055 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Paul Rogers <progers@apache.org>	2023-04-11 10:55:27 -07:00
Gian Merlino	d52bc333aa	Frames: Ensure nulls are read as default values when appropriate. (#14020 ) * Frames: Ensure nulls are read as default values when appropriate. Fixes a bug where LongFieldWriter didn't write a properly transformed zero when writing out a null. This had no meaningful effect in SQL-compatible null handling mode, because the field would get treated as a null anyway. But it does have an effect in default-value mode: it would cause Long.MIN_VALUE to get read out instead of zero. Also adds NullHandling checks to the various frame-based column selectors, allowing reading of nullable frames by servers in default-value mode.	2023-04-10 05:28:46 +05:30
Charles Smith	166cb6203b	Remove unnecessary python topic. Style changes to quickstart. (#13647 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-04-07 09:55:52 -07:00
Vadim Ogievetsky	5ee4ecee62	Web console: use new sampler features (#14017 ) * use new sampler features * supprot kafka format * update DQT, fix tests * prefer non numeric formats * fix input format step * boost SQL data loader * delete dimension in auto discover mode * inline example specs * feedback updates * yeet the format into valueFormat when switching to kafka * kafka format is now a toggle * even better form layout * rename	2023-04-07 06:28:29 -07:00
Suraj Sanjay Kadam	b4157e32ae	Update api.md (#13436 ) * Update api.md I have created changes in api call of python according to latest version of requests 2.28.1 library. Along with this there are some irregularities between use of <your-instance> and <hostname> so I have tried to fix that also. * Update api.md made some changes in declaring USER and PASSWORD	2023-04-06 15:05:36 -07:00
Charles Smith	1c2744b31e	Fix querying sql (#14026 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-04-06 14:50:06 -07:00
Paul Rogers	030ed911d4	Temporarily revert extended table functions for Druid 26 (#14019 )	2023-04-05 21:09:33 -07:00
Nicholas Lippis	5810e650d4	K8s mm less fixes (#14028 ) Update Fabric8 version and allow metrics monitors to be overriden	2023-04-05 22:23:16 +05:30
Tejaswini Bandlamudi	ccf48245d7	Update documentation for Kafka Supervisor IdleConfig (#14032 )	2023-04-05 21:55:39 +05:30
Karan Kumar	e6a11707cb	Adding query stack fault to MSQ to capture native query errors. (#13926 ) * Add a new fault "QueryRuntimeError" to MSQ engine to capture native query errors. * Fixed bug in MSQ fault tolerance where worker were being retried if `UnexpectedMultiValueDimensionException` was thrown. * An exception from the query runtime with `org.apache.druid.query` as the package name is thrown as a QueryRuntimeError	2023-04-05 16:29:10 +05:30
317brian	7e572eef08	docs: sql unnest and cleanup unnest datasource (#13736 ) Co-authored-by: Elliott Freis <elliottfreis@Elliott-Freis.earth.dynamic.blacklight.net> Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Paul Rogers <paul-rogers@users.noreply.github.com> Co-authored-by: Jill Osborne <jill.osborne@imply.io> Co-authored-by: Anshu Makkar <83963638+anshu-makkar@users.noreply.github.com> Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Elliott Freis <108356317+imply-elliott@users.noreply.github.com> Co-authored-by: Nicholas Lippis <nick.lippis@imply.io> Co-authored-by: Rohan Garg <7731512+rohangarg@users.noreply.github.com> Co-authored-by: Karan Kumar <karankumar1100@gmail.com> Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Clint Wylie <cwylie@apache.org> Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com>	2023-04-04 13:07:54 -07:00
Vadim Ogievetsky	981662e9f4	Web console: add a nice UI for overlord dynamic configs and improve the docs (#13993 ) * in progress * better form * doc updates * doc changes * add inline docs * fix tests * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/configuration/index.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * final fixes * fix case * Update docs/configuration/index.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update docs/configuration/index.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update docs/configuration/index.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update docs/configuration/index.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update docs/configuration/index.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update docs/configuration/index.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update docs/configuration/index.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update docs/configuration/index.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update docs/configuration/index.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update docs/configuration/index.md Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * fix overflow * fix spelling --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2023-03-31 10:12:25 -07:00
Clint Wylie	e3211e3be0	actually backwards compatible frontCoded string encoding strategy (#13996 )	2023-03-31 02:24:12 -07:00
Clint Wylie	2219e68fa3	add backwards compat mode for frontCoded stringEncodingStrategy (#13988 )	2023-03-28 14:44:44 -07:00
Paul Rogers	76fe26d4ba	Fix typos, add tests for http() function (#13954 )	2023-03-28 14:41:06 -07:00
frankgrimes97	2f98675285	Tuple sketch SQL support (#13887 ) This PR is a follow-up to #13819 so that the Tuple sketch functionality can be used in SQL for both ingestion using Multi-Stage Queries (MSQ) and also for analytic queries against Tuple sketch columns.	2023-03-28 18:47:12 +05:30
Rishabh Singh	e8e8082573	Update OIDCConfig with scope information (#13973 ) Allow users to provide custom scope through OIDC configuration	2023-03-28 14:50:00 +05:30
Gian Merlino	062d72b67e	Add timeout to TaskStartTimeoutFault. (#13970 ) * Add timeout to TaskStartTimeoutFault. Makes the error message a bit more useful. * Update docs.	2023-03-27 23:37:19 +05:30
Arnout Engelen	daff7fe73b	Document how to report security issues (#13886 ) Document how to report security issues on the security overview page, so we can link this page from the homepage. That should make all the other important security information easier to find as well.	2023-03-27 11:26:37 +05:30
Atul Mohan	19db32d6b4	Add JWT authenticator support for validating ID Tokens (#13242 ) Expands the OIDC based auth in Druid by adding a JWT Authenticator that validates ID Tokens associated with a request. The existing pac4j authenticator works for authenticating web users while accessing the console, whereas this authenticator is for validating Druid API requests made by Direct clients. Services already supporting OIDC can attach their ID tokens to the Druid requests under the Authorization request header.	2023-03-25 18:41:40 +05:30
Gian Merlino	549018d076	Revert "Update docs." This reverts commit `de27c7d3c1`.	2023-03-24 17:16:12 -07:00
Gian Merlino	de27c7d3c1	Update docs.	2023-03-24 17:15:27 -07:00
Nicholas Lippis	8a72544bd2	Hook up pod template adapter (#13966 ) * Hook up PodTemplateTaskAdapter * Make task adapter TYPE parameters final * Rename adapters types * Include specified adapter name in exception message * Documentation for sidecarSupport deprecation * Fix order * Set TASK_ID as environment variable in PodTemplateTaskAdapter (#13969) * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * Hook up PodTemplateTaskAdapter * Make task adapter TYPE parameters final * Rename adapters types * Include specified adapter name in exception message * Documentation for sidecarSupport deprecation * Fix order * fix spelling errors --------- Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>	2023-03-24 12:13:46 -06:00
Jill Osborne	976d39281f	Fix some broken links in docs (#13968 )	2023-03-24 10:48:23 -07:00
Paul Rogers	da42ee5bfa	Added TYPE(native) data type for external tables (#13958 )	2023-03-22 21:43:29 -07:00
Adarsh Sanjeev	7bab407495	Add segment generator counters to MSQ reports (#13909 ) * Add segment generator counters to reports * Remove unneeded annotation * Fix checkstyle and coverage * Add persist and merged as new metrics * Address review comments * Fix checkstyle * Create metrics class to handle updating counters * Address review comments * Add rowsPushed as a new metrics	2023-03-22 09:17:26 -07:00
Jill Osborne	4f95285406	Correct nested columns JSON example (#13953 )	2023-03-21 09:17:26 -07:00
Karan Kumar	67df1324ee	Undocumenting certain context parameter in MSQ. (#13928 ) * Removing intermediateSuperSorterStorageMaxLocalBytes, maxInputBytesPerWorker, composedIntermediateSuperSorterStorageEnabled, clusterStatisticsMergeMode from docs * Adding documentation in the context class.	2023-03-16 17:56:44 +05:30
317brian	65a663adbb	docs: clarify Java precision (#13671 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-03-15 11:43:41 -07:00
somu-imply	a7ba361666	Refactoring and bug fixes on top of unnest. The allowList now is not passed … (#13922 ) * Refactoring and bug fixes on top of unnest. The filter now is passed inside the unnest cursors. Added tests for scenarios such as 1. filter on unnested column which involves a left filter rewrite 2. filter on unnested virtual column which pushes the filter to the right only and involves no rewrite 3. not filters 4. SQL functions applied on top of unnested column 5. null present in first row of the column to be unnested	2023-03-14 16:05:56 -07:00
Suneet Saldanha	44547614ae	Report engine as a dimension for sqlQuery metrics (#13906 ) * Report engine as a dimension for sqlQuery metrics * docs	2023-03-10 11:23:57 -08:00
Gian Merlino	4b1ffbc452	Various changes and fixes to UNNEST. (#13892 ) * Various changes and fixes to UNNEST. Native changes: 1) UnnestDataSource: Replace "column" and "outputName" with "virtualColumn". This enables pushing expressions into the datasource. This in turn allows us to do the next thing... 2) UnnestStorageAdapter: Logically apply query-level filters and virtual columns after the unnest operation. (Physically, filters are pulled up, when possible.) This is beneficial because it allows filters and virtual columns to reference the unnested column, and because it is consistent with how the join datasource works. 3) Various documentation updates, including declaring "unnest" as an experimental feature for now. SQL changes: 1) Rename DruidUnnestRel (& Rule) to DruidUnnestRel (& Rule). The rel is simplified: it only handles the UNNEST part of a correlated join. Constant UNNESTs are handled with regular inline rels. 2) Rework DruidCorrelateUnnestRule to focus on pulling Projects from the left side up above the Correlate. New test testUnnestTwice verifies that this works even when two UNNESTs are stacked on the same table. 3) Include ProjectCorrelateTransposeRule from Calcite to encourage pushing mappings down below the left-hand side of the Correlate. 4) Add a new CorrelateFilterLTransposeRule and CorrelateFilterRTransposeRule to handle pulling Filters up above the Correlate. New tests testUnnestWithFiltersOutside and testUnnestTwiceWithFilters verify this behavior. 5) Require a context feature flag for SQL UNNEST, since it's undocumented. As part of this, also cleaned up how we handle feature flags in SQL. They're now hooked into EngineFeatures, which is useful because not all engines support all features.	2023-03-10 16:42:08 +05:30
Gian Merlino	fe9d0c46d5	Improve memory efficiency of WrappedRoaringBitmap. (#13889 ) * Improve memory efficiency of WrappedRoaringBitmap. Two changes: 1) Use an int[] for sizes 4 or below. 2) Remove the boolean compressRunOnSerialization. Doesn't save much space, but it does save a little, and it isn't adding a ton of value to have it be configurable. It was originally configurable in case anything broke when enabling it, but it's been a while and nothing has broken. * Slight adjustment. * Adjust for inspection. * Updates. * Update snaps. * Update test. * Adjust test. * Fix snaps.	2023-03-09 15:48:02 -08:00
Gian Merlino	82f7a56475	Sort-merge join and hash shuffles for MSQ. (#13506 ) * Sort-merge join and hash shuffles for MSQ. The main changes are in the processing, multi-stage-query, and sql modules. processing module: 1) Rename SortColumn to KeyColumn, replace boolean descending with KeyOrder. This makes it nicer to model hash keys, which use KeyOrder.NONE. 2) Add nullability checkers to the FieldReader interface, and an "isPartiallyNullKey" method to FrameComparisonWidget. The join processor uses this to detect null keys. 3) Add WritableFrameChannel.isClosed and OutputChannel.isReadableChannelReady so callers can tell which OutputChannels are ready for reading and which aren't. 4) Specialize FrameProcessors.makeCursor to return FrameCursor, a random-access implementation. The join processor uses this to rewind when it needs to replay a set of rows with a particular key. 5) Add MemoryAllocatorFactory, which is embedded inside FrameWriterFactory instead of a particular MemoryAllocator. This allows FrameWriterFactory to be shared in more scenarios. multi-stage-query module: 1) ShuffleSpec: Add hash-based shuffles. New enum ShuffleKind helps callers figure out what kind of shuffle is happening. The change from SortColumn to KeyColumn allows ClusterBy to be used for both hash-based and sort-based shuffling. 2) WorkerImpl: Add ability to handle hash-based shuffles. Refactor the logic to be more readable by moving the work-order-running code to the inner class RunWorkOrder, and the shuffle-pipeline-building code to the inner class ShufflePipelineBuilder. 3) Add SortMergeJoinFrameProcessor and factory. 4) WorkerMemoryParameters: Adjust logic to reserve space for output frames for hash partitioning. (We need one frame per partition.) sql module: 1) Add sqlJoinAlgorithm context parameter; can be "broadcast" or "sortMerge". With native, it must always be "broadcast", or it's a validation error. MSQ supports both. Default is "broadcast" in both engines. 2) Validate that MSQs do not use broadcast join with RIGHT or FULL join, as results are not correct for broadcast join with those types. Allow this in native for two reasons: legacy (the docs caution against it, but it's always been allowed), and the fact that it actually does generate correct results in native when the join is processed on the Broker. It is much less likely that MSQ will plan in such a way that generates correct results. 3) Remove subquery penalty in DruidJoinQueryRel when using sort-merge join, because subqueries are always required, so there's no reason to penalize them. 4) Move previously-disabled join reordering and manipulation rules to FANCY_JOIN_RULES, and enable them when using sort-merge join. Helps get to better plans where projections and filters are pushed down. * Work around compiler problem. * Updates from static analysis. * Fix @param tag. * Fix declared exception. * Fix spelling. * Minor adjustments. * wip * Merge fixups * fixes * Fix CalciteSelectQueryMSQTest * Empty keys are sortable. * Address comments from code review. Rename mux -> mix. * Restore inspection config. * Restore original doc. * Reorder imports. * Adjustments * Fix. * Fix imports. * Adjustments from review. * Update header. * Adjust docs.	2023-03-08 14:19:39 -08:00
Abhishek Agarwal	52bd9e6adb	Improved error message when topic name changes within same supervisor (#13815 ) Improved error message when topic name changes within same supervisor Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-03-07 18:10:18 -08:00
Adarsh Sanjeev	ef82756176	Add validation for aggregations on __time (#13793 ) * Add validation for aggregations on __time	2023-03-07 17:16:36 -08:00
Karan Kumar	94cfabea18	Suggested memory calculation in case NOT_ENOUGH_MEMORY_FAULT is thrown. (#13846 ) * Suggested memory calculation in case NOT_ENOUGH_MEMORY_FAULT is thrown. Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-03-06 18:00:36 +05:30
Paul Rogers	a580aca551	Python Druid API for use in notebooks (#13787 ) Python Druid API for use in notebooks Revises existing notebooks and readme to reference the new API. Notebook to explain the new API. Split README into a console version and a notebook version to work around lack of a nice display for md files. Update the REST API notebook to use simpler Requests calls Converted the SQL tutorial to use the Python library README file, converted to using properties --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-03-04 18:25:19 -08:00
Anshu Makkar	a10e4150d5	Add Post Aggregators for Tuple Sketches (#13819 ) You can now do the following operations with TupleSketches in Post Aggregation Step Get the Sketch Output as Base64 String Provide a constant Tuple Sketch in post-aggregation step that can be used in Set Operations Get the Estimated Value(Sum) of Summary/Metrics Objects associated with Tuple Sketch	2023-03-03 09:32:09 +05:30
317brian	b4b354b658	docs: fix html nits (#13835 )	2023-03-02 11:19:32 -08:00
Jill Osborne	26c5cac41a	Fix a link problem (#13876 )	2023-03-02 09:09:51 -08:00
Tejaswini Bandlamudi	7103cb4b9d	Removes FiniteFirehoseFactory and its implementations (#12852 ) The FiniteFirehoseFactory and InputRowParser classes were deprecated in 0.17.0 (#8823) in favor of InputSource & InputFormat. This PR removes the FiniteFirehoseFactory and all its implementations along with classes solely used by them like Fetcher (Used by PrefetchableTextFilesFirehoseFactory). Refactors classes including tests using FiniteFirehoseFactory to use InputSource instead. Removing InputRowParser may not be as trivial as many classes that aren't deprecated depends on it (with no alternatives), like EventReceiverFirehoseFactory. Hence FirehoseFactory, EventReceiverFirehoseFactory, and Firehose are marked deprecated.	2023-03-02 18:07:17 +05:30
Apoorv Gupta	b26f1b4a5d	Update datasources.md: Fix Documentation. (#13865 ) Fixed documentation to clarify that union query cant be run over query datasources.	2023-03-01 20:29:15 +05:30
Laksh Singla	ca68fd93a6	Generate tombstones when running MSQ's replace (#13706 ) *When running REPLACE queries, the segments which contain no data are dropped (marked as unused). This PR aims to generate tombstones in place of segments which contain no data to mark their deletion, as is the behavior with the native ingestion. This will cause InsertCannotReplaceExistingSegmentFault to be removed since it was generated if the interval to be marked unused didn't fully overlap one of the existing segments to replace.	2023-03-01 12:01:30 +05:30
AdheipSingh	22e516fd53	Update kubernetes.md (#13858 )	2023-02-28 11:20:24 -08:00
Kashif Faraz	12f62e2c42	Clarify doc of ingest/handoff/time metric (#13856 )	2023-02-28 10:37:47 +05:30
Victoria Lim	e46379ba7a	Docs: Update name of the metadata tables (#13734 ) * Update name of the metadata tables * emend spelling file * fix spelling --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-02-23 13:57:59 -08:00
tejasparbat	d74d6824ec	update LDAP endpoint (#13839 ) Current DOC at step https://druid.apache.org/docs/latest/operations/auth-ldap.html#add-an-ldap-user-to-druid-and-assign-a-role Example request to add the LDAP user myuser to Druid: curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authentication/db/ldap/users/myuser Example request to assign the myuser user to the queryRole role: curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authentication/db/ldap/users/myuser/roles/queryRole Expected: Example request to add the LDAP user myuser to Druid: curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/users/myuser Example request to assign the myuser user to the queryRole role curl -i -v -H "Content-Type: application/json" -u internal -X POST http://localhost:8081/druid-ext/basic-security/authorization/db/ldapauth/users/myuser/roles/queryRole	2023-02-23 13:55:06 -08:00
Win Min Soe	70f9052f1d	docs: update correct config base on server spec (#13832 ) Co-authored-by: Winn Minn <winn.minn@grabtaxi.com>	2023-02-23 08:50:47 -08:00
Abhishek Radhakrishnan	17a3cd0b68	Remove the additional backtick that's causing a SA issue. (#13838 )	2023-02-23 09:01:08 +05:30
benkrug	66034dd8bc	Update default for finalize in query-context.md (#13763 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-02-22 12:35:36 -08:00
Katya Macedo	1595653e6f	docs: add a link for the Druid SQL tutorial (#13468 ) * docs: add juptyer API tutorial for API and jupyter tutorial index (#3) (cherry picked from commit aeb8d9e3390fa26d9c533dce0862295b80c58583) * update prereqs and fix jupyterlab name * Removing notebook since 13345 has it 13345 should be merged first * update contributing instructions * docs: link to the Druid SQL tutorial * Add link to partitioning * fix merge conflict * Saving * Update docs/tutorials/tutorial-jupyter-index.md * Remove partitioning --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: brian.le <brian.le@imply.io> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-02-22 09:36:13 -08:00
317brian	07883e311e	doc: fix unnecessary link (#13785 ) CI errors look unrelated to this change.	2023-02-21 17:34:46 -08:00
zachjsh	665dee43bf	Revert "Operator conversion deny list (#13766 )" (#13829 ) This reverts commit `38e620aa4c`.	2023-02-21 15:14:49 -08:00
Paul Rogers	5dadbdf4d0	Generate the IT docker-compose.yaml files (#13669 ) Generate IT docker-compose.sh files Generates test-specific docker-compose.sh files using a simple Python template script.	2023-02-21 15:03:02 -08:00
benkrug	c6b1576fc1	Update clean-metadata-store.md (#13131 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-02-21 12:53:54 -08:00
Paul Rogers	85d36be085	Information schema now uses numeric column types (#13777 ) Change to use SQL schemas to allow null numeric columns * Updated docs	2023-02-17 14:39:31 -08:00
Katya Macedo	bc8b710b7e	Fix broken link (#13767 )	2023-02-17 09:02:12 -08:00
Churro	c1f283fd31	Better sidecar support (#13655 ) * Better sidecar support * remove un-thrown exception from test * Druid you are such a stickler about spelling :) * Only require the primaryContainerName, no need to exclude containers	2023-02-14 10:56:15 +05:30
Guy ☀️ Moore	306997be87	Add Perl 5 to druid requirements (#13708 ) Without perl 5 I was unable to start druid using the instructions in the quickstart guide. I'm not certain what versions it might require, but the one that I got working was perl 5 > This is perl 5, version 36, subversion 0 (v5.36.0) built for x86_64-linux-thread-multi	2023-02-13 13:34:49 -08:00
zachjsh	38e620aa4c	Operator conversion deny list (#13766 ) ### Description This change adds a new config property `druid.sql.planner.operatorConversion.denyList`, which allows a user to specify any operator conversions that they wish to disallow. A user may want to do this for a number of reasons, including security concerns. The default value of this property is the empty list `[]`, which does not disallow any operator conversions. An example usage of this property is `druid.sql.planner.operatorConversion.denyList=["extern"]`, which disallows the usage of the `extern` operator conversion. If the property is configured this way, and a user of the Druid cluster tries to submit a query that uses the `extern` function, such as the example given [here](https://druid.apache.org/docs/latest/multi-stage-query/examples.html#insert-with-no-rollup), a response with http response code `400` is returned with en error body similar to the following: ``` { "taskId": "4ec5b0b6-fa9b-4c3a-827d-2308294e9985", "state": "FAILED", "error": { "error": "Plan validation failed", "errorMessage": "org.apache.calcite.runtime.CalciteContextException: From line 28, column 5 to line 32, column 5: No match found for function signature EXTERN(<CHARACTER>, <CHARACTER>, <CHARACTER>)", "errorClass": "org.apache.calcite.tools.ValidationException", "host": null } } ```	2023-02-10 09:59:26 -08:00
Anshu Makkar	d7b95988d7	Add missing documentation for constant post-aggregator (#13664 ) Thanks @anshu-makkar , I was waiting for CI to complete yesterday. Failures seem unrelated, so merging.	2023-02-09 08:53:45 -08:00
Suneet Saldanha	714ac07b52	Allow users to add additional metadata to ingestion metrics (#13760 ) * Allow users to add additional metadata to ingestion metrics When submitting an ingestion spec, users may pass a map of metadata in the ingestion spec config that will be added to ingestion metrics. This will make it possible for operators to tag metrics with other metadata that doesn't necessarily line up with the existing tags like taskId. Druid clusters that ingest these metrics can take advantage of the nested data columns feature to process this additional metadata. * rename to tags * docs * tests * fix test * make code cov happy * checkstyle	2023-02-08 18:07:23 -08:00
AmatyaAvadhanula	0cf1fc3d55	Indexing on multiple disks (#13476 ) * Initial commit * Simple UTs * Parameterize tests * Parameterized tests for k8s task runner * Fix restore bug * Refactor TaskStorageDirTracker * Change CliPeon args	2023-02-08 11:31:34 +05:30
AmatyaAvadhanula	dcdae84888	Add server view initialization metrics (#13716 ) * Add server view init metrics * Test coverage * Rename metrics	2023-02-07 20:02:00 +05:30
Suneet Saldanha	bea18dc9e4	Update basic auth examples (#13750 )	2023-02-03 14:45:48 -08:00
drudi-at-coffee	7580248770	Update api.md (#13727 ) Added missing '/status' in HTTP status request	2023-02-02 10:43:22 -08:00
Victoria Lim	33efd5ab1d	docs: Refresh the update data tutorial (#13641 ) Merging regardless of nit since topic is in better shape. * refresh the update data tutorial * Apply suggestions from code review Co-authored-by: Jill Osborne <jill.osborne@imply.io> --------- Co-authored-by: Jill Osborne <jill.osborne@imply.io>	2023-02-01 18:18:16 -08:00
Kashif Faraz	f629643c50	Fix value of lookup sync period in docs (#13695 ) * Fix lookup docs * Fix spelling * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-02-01 18:12:00 -08:00
Sergio Ferragut	7f830b20d7	fixed init commands for both mysql and postgresql (#13713 )	2023-02-01 18:07:31 -08:00
Suneet Saldanha	cfc3115a59	Compaction history returns empty list instead of 404 when not found (#13730 ) * Compaction history returns empty list instead of 404 when not found * checkstyle	2023-02-01 17:44:07 -08:00
Tijo Thomas	1beef30bb2	Support postaggregation function as in Math.pow() (#13703 ) (#13704 ) Support postaggregation function as in Math.pow()	2023-01-31 22:55:04 +05:30
Adarsh Sanjeev	51dfde0284	Add maxInputBytesPerWorker as query context parameter (#13707 ) * Add maxInputBytesPerWorker as query context parameter * Move documenation to msq specific docs * Update tests * Spacing * Address review comments * Fix test * Update docs/multi-stage-query/reference.md * Correct spelling mistake --------- Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2023-01-31 20:55:28 +05:30
Jill Osborne	356b0e37cf	Tutorial: Query view (#13565 ) * Tutorial: Query view * Removed duplicate file * Update tutorial-sql-query-view.md * Update tutorial-sql-query-view.md * Update tutorial-sql-query-view.md * Updated after review * Update docs/tutorials/tutorial-sql-query-view.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update tutorial-sql-query-view.md Update title * Update sidebars.json fix merge conflict w/ sidebar * address spelling ci --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-01-27 14:29:43 -08:00
sairam devarashetty	6164c420a1	Create update.md (#13451 ) * Create update.md Important Line highlighted * Update docs/data-management/update.md Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-01-25 16:23:40 -08:00
317brian	9021161c8c	doc: fix markdown spacing (#13683 ) * doc: fix markdown spacing * fix spacing	2023-01-25 16:22:49 -08:00
Victoria Lim	00cee329bd	pitfall when using combining input source (#13639 )	2023-01-25 12:50:19 -08:00
Suneet Saldanha	016c881795	Add API to return automatic compaction config history (#13699 ) Add a new API to return the history of changes to automatic compaction config history to make it easy for users to see what changes have been made to their auto-compaction config. The API is scoped per dataSource to allow users to triage issues with an individual dataSource. The API responds with a list of configs when there is a change to either the settings that impact all auto-compaction configs on a cluster or the dataSource in question.	2023-01-23 13:23:45 -08:00
Rohan Garg	f76acccff2	Allow using composed storage for SuperSorter intermediate data (#13368 )	2023-01-24 01:02:03 +05:30
Eyal Yurman	44374f91bc	Fix broken links to Oracle JDK docs (#13687 ) * Fix broken link for SSLContext java doc * Update tls-support.md * Update tls-support.md * Update tls-support.md * Update simple-client-sslcontext.md	2023-01-18 14:46:08 +05:30
Paul Rogers	22630b0aab	Much improved table functions (#13627 ) Much improved table functions * Revises properties, definitions in the catalog * Adds a "table function" abstraction to model such functions * Specific functions for HTTP, inline, local and S3. * Extended SQL types in the catalog * Restructure external table definitions to use table functions * EXTEND syntax for Druid's extern table function * Support for array-valued table function parameters * Support for array-valued SQL query parameters * Much new documentation	2023-01-17 08:41:57 -08:00
Gian Merlino	182c4fad29	Kinesis: More robust default fetch settings. (#13539 ) * Kinesis: More robust default fetch settings. 1) Default recordsPerFetch and recordBufferSize based on available memory rather than using hardcoded numbers. For this, we need an estimate of record size. Use 10 KB for regular records and 1 MB for aggregated records. With 1 GB heaps, 2 processors per task, and nonaggregated records, recordBufferSize comes out to the same as the old default (10000), and recordsPerFetch comes out slightly lower (1250 instead of 4000). 2) Default maxRecordsPerPoll based on whether records are aggregated or not (100 if not aggregated, 1 if aggregated). Prior default was 100. 3) Default fetchThreads based on processors divided by task count on Indexers, rather than overall processor count. 4) Additionally clean up the serialized JSON a bit by adding various JsonInclude annotations. * Updates for tests. * Additional important verify.	2023-01-13 11:03:54 +05:30
Vadim Ogievetsky	93dc01b6c5	fix broken table missing new line (#13666 )	2023-01-12 15:29:51 -08:00
Vadim Ogievetsky	f97bcc69d3	Docs: reword single server page (#13659 ) * reword single server page * fix typo * Update docs/operations/single-server.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * spelling Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-01-11 21:12:52 -08:00
Karan Kumar	56076d33fb	Worker retry for MSQ task (#13353 ) * Initial commit. * Fixing error message in retry exceeded exception * Cleaning up some code * Adding some test cases. * Adding java docs. * Finishing up state test cases. * Adding some more java docs and fixing spot bugs, intellij inspections * Fixing intellij inspections and added tests * Documenting error codes * Migrate current integration batch tests to equivalent MSQ tests (#13374) * Migrate current integration batch tests to equivalent MSQ tests using new IT framework * Fix build issues * Trigger Build * Adding more tests and addressing comments * fixBuildIssues * fix dependency issues * Parameterized the test and addressed comments * Addressing comments * fixing checkstyle errors * Adressing comments * Adding ITTest which kills the worker abruptly * Review comments phase one * Adding doc changes * Adjusting for single threaded execution. * Adding Sequential Merge PR state handling * Merge things * Fixing checkstyle. * Adding new context param for fault tolerance. Adding stale task handling in sketchFetcher. Adding UT's. * Merge things * Merge things * Adding parameterized tests Created separate module for faultToleranceTests * Adding missed files * Review comments and fixing tests. * Documentation things. * Fixing IT * Controller impl fix. * Fixing racy WorkerSketchFetcherTest.java exception handling. Co-authored-by: abhagraw <99210446+abhagraw@users.noreply.github.com> Co-authored-by: Karan Kumar <cryptoe@karans-mbp.lan>	2023-01-11 07:38:29 +05:30
Abhishek Agarwal	17936e2920	Add an option to enable HSTS in druid services (#13489 ) * Add an option to enable HSTS * Fix code and add docs * Deduplicate headers * unused import * Fix spelling	2023-01-10 22:31:51 +05:30
Victoria Lim	a800dae87a	doc: List Protobuf as a supported format (#13640 )	2023-01-06 15:09:37 -08:00
317brian	6bbf4266b2	docs: documentation for unnest datasource (#13479 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-01-06 11:41:11 -08:00
Kashif Faraz	0d97e658b2	Docs: Update quickstart instructions (#13611 ) Changes: - Remove specification of a Druid version in the quickstart, because the previous step instructs downloading the latest version anyway. - Mention usage of memory parameter in the quickstart	2022-12-22 11:51:08 +05:30
Vadim Ogievetsky	07597c687d	Docs: Remove large data file (#13595 )	2022-12-19 13:14:22 +05:30
Gian Merlino	ee890965f4	LocalInputSource: Serialize File paths without forcing resolution. (#13534 ) * LocalInputSource: Serialize File paths without forcing resolution. Fixes #13359. * Add one more javadoc.	2022-12-19 11:47:36 +05:30
Victoria Lim	09d8b16447	Document shouldFinalize for sketches that have the parameter (#13524 ) Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2022-12-17 10:48:06 -08:00

... 4 5 6 7 8 ...

3244 Commits