druid

Commit Graph

Author	SHA1	Message	Date
Clint Wylie	c221a2634b	overhaul DruidPredicateFactory to better handle 3VL (#15629 ) * overhaul DruidPredicateFactory to better handle 3VL fixes some bugs caused by some limitations of the original design of how DruidPredicateFactory interacts with 3-value logic. The primary impacted area was with how filters on values transformed with expressions or extractionFn which turn non-null values into nulls, which were not possible to be modelled with the 'isNullInputUnknown' method changes: * adds DruidObjectPredicate to specialize string, array, and object based predicates instead of using guava Predicate * DruidPredicateFactory now uses DruidObjectPredicate * introduces DruidPredicateMatch enum, which all predicates returned from DruidPredicateFactory now use instead of booleans to indicate match. This means DruidLongPredicate, DruidFloatPredicate, DruidDoublePredicate, and the newly added DruidObjectPredicate apply methods all now return DruidPredicateMatch. This allows matchers and indexes * isNullInputUnknown has been removed from DruidPredicateFactory * rename, fix test * adjust * style * npe * more test * fix default value mode to not match new test	2024-01-05 19:08:02 -08:00
Gian Merlino	e40b96e026	Reverse lookup fixes and enhancements. (#15611 ) * Reverse lookup fixes and enhancements. 1) Add a "mayIncludeUnknown" parameter to DimFilter#optimize. This is important because otherwise the reverse-lookup optimization is done improperly when the "in" filter appears under a "not", and the lookup extractionFn may return null for some possible values of the filtered column. The "includeUnknown" test cases in InDimFilterTest illustrate the difference in behavior. 2) Enhance InDimFilter#optimizeLookup to handle "mayIncludeUnknown", and to be able to do a reverse lookup in a wider variety of cases. 3) Make "unapply" protected in LookupExtractor, and move callers to "unapplyAll". The main reason is that MapLookupExtractor, a common implementation, lacks a reverse mapping and therefore does a scan of the map for each call to "unapply". For performance sake these calls need to be batched. * Remove optimize call from BloomDimFilter. * Follow the law. * Fix tests. * Fix imports. * Switch function. * Fix tests. * More tests.	2024-01-03 13:28:44 -08:00
Jan Werner	fa2c8edb5d	unpin snakeyaml, add suppressions and licenses (#15549 ) * unpin snakeyaml globally, add suppressions and licenses * pin snakeyaml in the specific modules that require version 1.x, update licenses and owasp suppression This removes the pin of the Snakeyaml introduced in: https://github.com/apache/druid/pull/14519 After the updates of io.kubernetes.java-client and io.confluent.kafka-clients, the only uses of the Snakeyaml 1.x are: - in test scope, transitive dependency of jackson-dataformat-yaml🫙2.12.7 - in compile scope in contrib extension druid-cassandra-storage - in compile scope in it-tests. With the dependency version un-pinned, io.kubernetes.java-client and io.confluent.kafka-clients bring Snakeyaml versions 2.0 and 2.2, consequently allowing to build a Druid distribution without the contrib-extension and free of vulnerable Snakeyaml versions.	2023-12-15 10:33:14 -08:00
Tom	901ebbb744	Allow for kafka emitter producer secrets to be masked in logs (#15485 ) * Allow for kafka emitter producer secrets to be masked in logs instead of being visible This change will allow for kafka producer config values that should be secrets to not show up in the logs. This will enhance the security of the people who use the kafka emitter to use this if they want to. This is opt in and will not affect prior configs for this emitter * fix checkstyle issue * change property name	2023-12-15 12:21:21 -05:00
sensor	c9be1cb4e8	Clean useless InterruptedException warn in ingestion task log (#15519 ) * Clean useless InterruptedException warn in ingestion task log * test coverage for the code change, manually close the scheduler thread to trigger Interrupt signal --------- Co-authored-by: Qiong Chen <qiong.chen@shopee.com>	2023-12-15 11:18:53 +08:00
Bartosz Mikulski	4670a7650f	Optional removal of metrics from Prometheus PushGateway on shutdown (#14935 ) * Optional removal of metrics from Prometheus PushGateway on shutdown * Make pushGatewayDeleteOnShutdown property nullable * Add waitForShutdownDelay property * Fix unit test * Address PR comments * Address PR comments * Add explanation on why it is useful to have deletePushGatewayMetricsOnShutdown * Fix spelling error * Fix spelling error	2023-12-13 11:58:53 -05:00
George Shiqi Wu	4152f1d147	Fix empty logs and status messages for mmless ingestion (#15527 ) * Fix empty logs and status messages for mmless ingestion * Add tests	2023-12-11 13:20:45 -05:00
Clint Wylie	0516d0dae4	simplify IncrementalIndex since group-by v1 has been removed (#15448 )	2023-11-29 14:46:16 -08:00
Kashif Faraz	58a724c7e4	Use StubServiceEmitter in tests (#15426 ) * Use StubServiceEmitter in tests * Remove unthrown exception from declaration	2023-11-28 09:43:09 +05:30
Karan Kumar	a0188192de	Fixing failing compaction/parallel index jobs during upgrade due to new actions being available on the overlord. (#15430 ) * Fixing failing compaction/parallel index jobs during upgrade due to new actions not available on the overlord. * Fixing build * Removing extra space. * Fixing json getter. * Review comments.	2023-11-25 13:50:29 +05:30
Tom	386cdb95e7	Fixes minor typo in kafka emitter (#15416 )	2023-11-23 14:19:39 +05:30
Atul Mohan	a2914789d7	Add support for ingesting older iceberg snapshots (#15348 ) This patch introduces a param snapshotTime in the iceberg inputsource spec that allows the user to ingest data files associated with the most recent snapshot as of the given time. This helps the user ingest data based on older snapshots by specifying the associated snapshot time. This patch also upgrades the iceberg core version to 1.4.1	2023-11-17 12:32:28 +05:30
YongGang	3a3d37ef40	Fix for segment/count Metric Not Emitting with Statsd-emitter (#15347 ) * fix segment/count metric in Statsd-emitter * update doc * Update docs/development/extensions-contrib/prometheus.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/development/extensions-contrib/statsd.md Co-authored-by: Suneet Saldanha <suneet@apache.org> --------- Co-authored-by: Suneet Saldanha <suneet@apache.org>	2023-11-10 08:08:58 -08:00
George Shiqi Wu	130bfbfc6d	Revert "Separate task lifecycle from kubernetes/location lifecycle (#15133 )" (#15346 ) This reverts commit `dc0b163e19`.	2023-11-08 13:12:30 -05:00
Kengo Seki	b7d7f84bce	Bump Jedis version to 5.0.2 (#15344 ) Currently, the redis-cache extension uses Jedis 2.9.0, which was released over seven years ago and is no longer listed in the official support matrix. This patch upgrades it to ensure the compatibility with the recent version of Redis and make future upgrades easier, including: Upgrade Jedis to v5.0.2, the latest version at this writing, and address the API changes and dependency version mismatch. Replace mock-jedis with jedis-mock, since the former has not been actively maintained any longer and not compatible with recent versions of Jedis.	2023-11-08 20:22:41 +05:30
Atul Mohan	ff7de49015	Consolidate and reduce dependency footprint for iceberg extension (#15280 ) * Consolidate and reduce dependency footprint * Fix dependency analysis	2023-11-06 12:17:32 +05:30
Rishabh Singh	8c802e4c9b	Relocating Table Schema Building: Shifting from Brokers to Coordinator for Improved Efficiency (#14985 ) In the current design, brokers query both data nodes and tasks to fetch the schema of the segments they serve. The table schema is then constructed by combining the schemas of all segments within a datasource. However, this approach leads to a high number of segment metadata queries during broker startup, resulting in slow startup times and various issues outlined in the design proposal. To address these challenges, we propose centralizing the table schema management process within the coordinator. This change is the first step in that direction. In the new arrangement, the coordinator will take on the responsibility of querying both data nodes and tasks to fetch segment schema and subsequently building the table schema. Brokers will now simply query the Coordinator to fetch table schema. Importantly, brokers will still retain the capability to build table schemas if the need arises, ensuring both flexibility and resilience.	2023-11-04 19:33:25 +05:30
Laksh Singla	0cc8839a60	Allow casted literal values in SQL functions accepting literals (Part 2) (#15316 )	2023-11-03 21:22:19 +05:30
Gian Merlino	d87d92bc43	Add system fields to input sources. (#15276 ) * Add system fields to input sources. Main changes: 1) The SystemField enum defines system fields "__file_uri", "__file_path", and "__file_bucket". They are associated with each input entity. 2) The SystemFieldInputSource interface can be added to any InputSource to make it system-field-capable. It sets up serialization of a list of configured "systemFields" in the JSON form of the input source, and provides a method getSystemFieldValue for computing the value of each system field. Cloud object, HDFS, HTTP, and Local now have this. * Fix various LocalInputSource calls. * Fix style stuff. * Fixups. * Fix tests and coverage.	2023-11-02 10:31:28 -07:00
Gian Merlino	6b6d73b5d4	Use min of scheduler threads and server threads for subquery guardrails. (#15295 ) * Use min of scheduler threads and server threads for subquery guardrails. This allows more memory to be used for subqueries when the query scheduler is configured to limit queries below the number of server threads. The patch also refactors the code so SubqueryGuardrailHelper is provided by a Guice Provider rather than being created by ClientQuerySegmentWalker, to achieve better separation of concerns. * Exclude provider from coverage.	2023-11-01 22:34:53 -07:00
Laksh Singla	2ea7177f15	Allow casted literal values in SQL functions accepting literals (#15282 ) Functions that accept literals also allow casted literals. This shouldn't have an impact on the queries that the user writes. It enables the SQL functions to accept explicit cast, which is required with JDBC.	2023-11-01 10:38:48 +05:30
YongGang	7a25ee4fd9	Ability to send task types to k8s or worker task runner (#15196 ) * Ability to send task types to k8s or worker task runner * add more tests * use runnerStrategy to determine task runner * minor refine * refine runner strategy config * move workerType config to upper level * validate config when application start	2023-10-25 09:55:56 -07:00
Laksh Singla	207398a47d	Initialize null handling in CompressedBigDecimalAggregatorTimeseriesTestBase to fix failing test(#15252 )	2023-10-25 20:26:46 +05:30
AmatyaAvadhanula	65b69cded4	Filter pending segments upgraded with transactional replace (#15169 ) * Filter pending segments upgraded with transactional replace * Push sequence name filter to metadata query	2023-10-23 21:18:47 +05:30
AmatyaAvadhanula	a8febd457c	A Replacing task must read segments created before it acquired its lock (#15085 ) * Replacing tasks must read segments created before they acquired their locks	2023-10-19 11:13:07 +05:30
George Shiqi Wu	dc0b163e19	Separate task lifecycle from kubernetes/location lifecycle (#15133 ) * Separate k8s and druid task lifecycles * Remove extra log lines * Fix unit tests * fix unit tests * Fix unit tests * notify listeners on task completion * Fix unit test * unused var * PR changes * Fix unit tests * Fix checkstyle * PR changes	2023-10-17 08:17:43 -07:00
Clint Wylie	d0f64608eb	sql compatible three-valued logic native filters (#15058 ) * sql compatible tri-state native logical filters when druid.expressions.useStrictBooleans=true and druid.generic.useDefaultValueForNull=false, and new druid.generic.useThreeValueLogicForNativeFilters=true * log.warn if non-default configurations are used to guide operators towards SQL complaint behavior	2023-10-12 00:06:23 -07:00
Zoltan Haindrich	ae88f2c0b6	Fix non-sqlcompat validation in CalciteWindowQueryTest (#15086 ) * fixes * check for latest rewrite place * Revert "check for latest rewrite place" This reverts commit `5cf1e2c1ca`. * some stuff (cherry picked from commit ab346d4373ea888eb8ef6115e018e7fb0d27407f) * update test output * updates to test ouptuts * some stuff * move validator * cleanup * fix * change test slightly * add apidoc cleanup warnings * cleanup/etc * instead of telling the story; add a fail with some reason whats the issue * lead-lag fix * add test * remove unnecessary throw * druidexception-trial * Revert "druidexception-trial" This reverts commit `8fa06644bc`. * undo changes to no_grouping; add no_grouping2 * add missing assert on resultcount * rename method; update * introduce enum/etc * make resultmatchmode accessible from TestBuilder#expectedResults * fix dump results to use log * fix * handle null correctly * disable feature type based things for MSQ * fix varianssqlaggtest * use eps in other test * fix intellij error * add final * addrss review * update test/string/etc * write concat in 3 lines :D	2023-10-11 12:34:31 -07:00
Laksh Singla	5f86072456	Prepare master for Druid 29 (#15121 ) Prepare master for Druid 29	2023-10-11 10:33:45 +05:30
Zoltan Haindrich	b5a87fd89b	Support constant args in window functions (#15071 ) Instead of passing the constants around in a new parameter; InputAccessor was introduced to take care of transparently handling the constants - this new class started picking up some copy-paste debris around field accesses; and made them a little bit more readble.	2023-10-08 12:14:25 +05:30
Xavier Léauté	adef2069b1	Make unit tests pass with Java 21 (#15014 ) This change updates dependencies as needed and fixes tests to remove code incompatible with Java 21 As a result all unit tests now pass with Java 21. * update maven-shade-plugin to 3.5.0 and follow-up to #15042 * explain why we need to override configuration when specifying outputFile * remove configuration from dependency management in favor of explicit overrides in each module. * update to mockito to 5.5.0 for Java 21 support when running with Java 11+ * continue using latest mockito 4.x (4.11.0) when running with Java 8 * remove need to mock private fields * exclude incorrectly declared mockito dependency from pac4j-oidc * remove mocking of ByteBuffer, since sealed classes can no longer be mocked in Java 21 * add JVM options workaround for system-rules junit plugin not supporting Java 18+ * exclude older versions of byte-buddy from assertj-core * fix for Java 19 changes in floating point string representation * fix missing InitializedNullHandlingTest * update easymock to 5.2.0 for Java 21 compatibility * update animal-sniffer-plugin to 1.23 * update nl.jqno.equalsverifier to 3.15.1 * update exec-maven-plugin to 3.1.0	2023-10-03 22:41:21 -07:00
Soumyava	cb050282a0	Intervals are updated properly for Unnest queries (#15020 ) Fixes a bug where the unnest queries were not updated with the correct intervals.	2023-10-04 02:52:10 +05:30
George Shiqi Wu	64754b6799	Allow users to pass task payload via deep storage instead of environment variable (#14887 ) This change is meant to fix a issue where passing too large of a task payload to the mm-less task runner will cause the peon to fail to startup because the payload is passed (compressed) as a environment variable (TASK_JSON). In linux systems the limit for a environment variable is commonly 128KB, for windows systems less than this. Setting a env variable longer than this results in a bunch of "Argument list too long" errors.	2023-10-03 14:08:59 +05:30
Karan Kumar	2f1bcd6717	Adding `"segment/scan/active" metric for processing thread pool. (#15060 )	2023-09-29 12:34:28 -07:00
Zoltan Haindrich	5f3b310115	Build reliablity fixes (#15048 ) * disable parallel builds; enable batch mode to get rid of transfer progress * restore .m2 from setup-java if not found * some change to sql * add ws * fix quote * fix quote * undo querytest change * nullhandling in mvtest * init more * skip commitid plugin * add-back 1.0C to build ; remove redundant skip-s from copy-resources; add comment	2023-09-28 12:27:52 -07:00
George Shiqi Wu	8e22a178cc	Support getTaskLocation for mixed task runner (#15033 ) The KubernetesAndWorkerTaskRunner currently doesn't implement getTaskLocation, so tasks run by it will show a unknown TaskLocation in the druid console after a task has completed. Fix bug in KubernetesAndWorkerTaskRunner that manifests as missing information in the druid Web Console.	2023-09-27 08:57:36 +05:30
YongGang	7301e60a9c	Add metrics for number of segments generated per task in MSQ (#14980 ) Add ingest/tombstones/count and ingest/segments/count metrics in MSQ.	2023-09-26 02:46:33 +05:30
Tejaswini Bandlamudi	48b6d2abf9	skip org.owasp:dependency-check on extensions-contrib modules and suppress false-positive gRPC CVEs (#15026 )	2023-09-25 12:14:42 +05:30
YongGang	be3f93e3cf	Restore tasks when lifecycle start (#14909 ) * K8s tasks restore should be from lifecycle start * add test * add more tests * fix test * wait tasks restore finish when start * fix style * revert previous change and add comment	2023-09-22 12:03:34 -07:00
Kashif Faraz	409bffe7f2	Rename IMSC.announceHistoricalSegments to commitSegments (#15021 ) This commit pulls out some changes from #14407 to simplify that PR. Changes: - Rename `IndexerMetadataStorageCoordinator.announceHistoricalSegments` to `commitSegments` - Rename the overloaded method to `commitSegmentsAndMetadata` - Fix some typos	2023-09-21 16:19:03 +05:30
Laksh Singla	82e809c8d0	fix (#15017 )	2023-09-20 15:48:26 -07:00
George Shiqi Wu	d459df8d6e	Fix log syntax (#15004 )	2023-09-18 10:40:02 -07:00
George Shiqi Wu	f773d83914	Mixed task runner for migration to mm-less ingestion (#14918 ) * save work * Working * Fix runner constructor * Working runner * extra log lines * try using lifecycle for everything * clean up configs * cleanup /workers call * Use a single config * Allow selecting runner * debug changes * Work on composite task runner * Unit tests running * Add documentation * Add some javadocs * Fix spelling * Use standard libraries * code review * fix * fix * use taskRunner as string * checkstyl --------- Co-authored-by: Suneet Saldanha <suneet@apache.org>	2023-09-11 18:09:46 -07:00
Laksh Singla	6ee0b06e38	Auto configuration for maxSubqueryBytes (#14808 ) A new monitor SubqueryCountStatsMonitor which emits the metrics corresponding to the subqueries and their execution is now introduced. Moreover, the user can now also use the auto mode to automatically set the number of bytes available per query for the inlining of its subquery's results.	2023-09-06 05:47:19 +00:00
Abhishek Radhakrishnan	9d6ca61ac1	Verify statsd mock client interaction in unit test (#14939 )	2023-09-05 07:34:22 -07:00
Kashif Faraz	289ee1e011	Refactor: Cleanup NoopTask (#14938 ) Changes: - Simplify static `create` methods for `NoopTask` - Remove `FirehoseFactory`, `IsReadyResult`, `readyTime` from `NoopTask` as these fields were not being used anywhere - Update tests	2023-09-05 09:15:41 +05:30
Kashif Faraz	7f26b80e21	Simplify ServiceMetricEvent.Builder (#14933 ) Changes: - Make ServiceMetricEvent.Builder extend ServiceEventBuilder<ServiceMetricEvent> and thus convert it to a plain builder rather than a builder of builder. - Add methods setCreatedTime , setMetricAndValue to the builder	2023-09-01 11:30:45 +05:30
John Gerassimou	d201ea0ece	prometheus-emitter: add extraLabels parameter (#14728 ) * prometheus-emitter: add extraLabels parameter * prometheus-emitter: update readme to include the extraLabels parameter * prometheus-emitter: remove nullable and surface label name issues * remove import to make linter happy	2023-08-29 12:02:22 -07:00
George Shiqi Wu	95b0de61d1	Move some lifecycle management from doTask -> shutdown for the mm-less task runner (#14895 ) * save work * Add syncronized * Don't shutdown in run * Adding unit tests * Cleanup lifecycle * Fix tests * remove newline	2023-08-25 10:50:38 -06:00
George Shiqi Wu	ad32f84586	Fix capacity response in mm-less ingestion (#14888 ) Changes: - Fix capacity response in mm-less ingestion. - Add field usedClusterCapacity to the GET /totalWorkerCapacity response. This API should be used to get the total ingestion capacity on the overlord. - Remove method `isK8sTaskRunner` from interface `TaskRunner`	2023-08-25 08:17:38 +05:30
Tejaswini Bandlamudi	388d5ecf78	Fix reported CVEs (#14882 ) Suppress CVEs from dependencies with no available fix or false positives hadoop-annotations: CVE-2022-25168, CVE-2021-33036 hadoop-client-runtime: CVE-2023-1370, CVE-2023-37475 okio: CVE-2023-3635 Upgrade grpc version to fix CVE-2023-33953	2023-08-24 19:28:55 +05:30
Clint Wylie	36e659a501	remove group-by v1 (#14866 ) * remove group-by v1 * docs * remove unused configs, fix test * fix test * adjustments * why not * adjust * review stuff	2023-08-23 12:44:06 -07:00
Tejaswini Bandlamudi	d87056e708	Upgrade guava version to 31.1-jre (#14767 ) Currently, Druid is using Guava 16.0.1 version. This upgrade to 31.1-jre fixes the following issues. CVE-2018-10237 (Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 allows remote attackers to conduct denial of service attacks against servers that depend on this library and deserialize attacker-provided data because the AtomicDoubleArray class (when serialized with Java serialization) and the CompoundOrdering class (when serialized with GWT serialization) perform eager allocation without appropriate checks on what a client has sent and whether the data size is reasonable). We don't use Java or GWT serializations. Despite being false positive they're causing red security scans on Druid distribution. Latest version of google-client-api is incompatible with the existing Guava version. This PR unblocks Update google client apis to latest version #14414	2023-08-22 12:09:53 +05:30
Abhishek Radhakrishnan	37db5d9b81	Reset offsets supervisor API (#14772 ) * Add supervisor /resetOffsets API. - Add a new endpoint /druid/indexer/v1/supervisor/<supervisorId>/resetOffsets which accepts DataSourceMetadata as a body parameter. - Update logs, unit tests and docs. * Add a new interface method for backwards compatibility. * Rename * Adjust tests and javadocs. * Use CoreInjectorBuilder instead of deprecated makeInjectorWithModules * UT fix * Doc updates. * remove extraneous debugging logs. * Remove the boolean setting; only ResetHandle() and resetInternal() * Relax constraints and add a new ResetOffsetsNotice; cleanup old logic. * A separate ResetOffsetsNotice and some cleanup. * Minor cleanup * Add a check & test to verify that sequence numbers are only of type SeekableStreamEndSequenceNumbers * Add unit tests for the no op implementations for test coverage * CodeQL fix * checkstyle from merge conflict * Doc changes * DOCUSAURUS code tabs fix. Thanks, Brian!	2023-08-17 14:13:10 -07:00
Clint Wylie	6b14dde50e	deprecate config-magic in favor of json configuration stuff (#14695 ) * json config based processing and broker merge configs to deprecate config-magic	2023-08-16 18:23:57 -07:00
dependabot[bot]	faf79470ae	Bump io.dropwizard.metrics:metrics-graphite from 3.1.2 to 4.2.19 (#14842 ) * Bump io.dropwizard.metrics:metrics-graphite from 3.1.2 to 4.2.19 Bumps [io.dropwizard.metrics:metrics-graphite](https://github.com/dropwizard/metrics) from 3.1.2 to 4.2.19. - [Release notes](https://github.com/dropwizard/metrics/releases) - [Commits](https://github.com/dropwizard/metrics/compare/v3.1.2...v4.2.19) --- updated-dependencies: - dependency-name: io.dropwizard.metrics:metrics-graphite dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * align graphite-emitter dropwizard version with core --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Xavier Léauté <xvrl@apache.org>	2023-08-16 13:58:35 -07:00
YongGang	3954685aae	Report more metrics to monitor K8s task runner (#14771 ) * Report pod running metrics to monitor K8s task runner * refine method definition * fix checkstyle * implement task metrics * more comment * address comments * update doc for the new metrics reported * fix checkstyle * refine method definition * minor refine	2023-08-16 14:03:53 -04:00
Rishabh Singh	0dc305f9e4	Upgrade hibernate validator version to fix CVE-2019-10219 (#14757 )	2023-08-14 11:50:51 +05:30
George Shiqi Wu	c8a11702db	Support broadcast segmetns (#14789 )	2023-08-11 11:14:05 -07:00
hqx871	a0234c4e13	Add sampling factor for DeterminePartitionsJob (#13840 ) There are two type of DeterminePartitionsJob: - When the input data is not assume grouped, there may be duplicate rows. In this case, two MR jobs are launched. The first one do group job to remove duplicate rows. And a second one to perform global sorting to find lower and upper bound for target segments. - When the input data is assume grouped, we only need to launch the global sorting MR job to find lower and upper bound for segments. Sampling strategy: - If the input data is assume grouped, sample by random at the mapper side of the global sort mr job. - If the input data is not assume grouped, sample at the mapper of the group job. Use hash on time and all dimensions and mod by sampling factor to sample, don't use random method because there may be duplicate rows.	2023-08-11 10:42:25 +05:30
zachjsh	82d82dfbd6	Add stats to KillUnusedSegments coordinator duty (#14782 ) ### Description Added the following metrics, which are calculated from the `KillUnusedSegments` coordinatorDuty `"killTask/availableSlot/count"`: calculates the number remaining task slots available for auto kill `"killTask/maxSlot/count"`: calculates the maximum number of tasks available for auto kill `"killTask/task/count"`: calculates the number of tasks submitted by auto kill. #### Release note NEW: metrics added for auto kill `"killTask/availableSlot/count"`: calculates the number remaining task slots available for auto kill `"killTask/maxSlot/count"`: calculates the maximum number of tasks available for auto kill `"killTask/task/count"`: calculates the number of tasks submitted by auto kill.	2023-08-10 18:36:53 -04:00
Xavier Léauté	37ed0f4a17	Bump jclouds.version from 1.9.1 to 2.0.3 (#14746 ) * Updates `org.apache.jclouds:` from 1.9.1 to 2.0.3 Pin jclouds to 2.0.x since 2.1.x requires Guava 18+ * replace easymock with mockito Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-08-10 06:24:01 -07:00
George Shiqi Wu	c8537dbeaf	Add lifecycle hooks to KubernetesTaskRunner (#14790 )	2023-08-09 21:16:44 -07:00
Tejaswini Bandlamudi	a45b25fa1d	Removes support for Hadoop 2 (#14763 ) Removing Hadoop 2 support as discussed in https://lists.apache.org/list?dev@druid.apache.org:lte=1M:hadoop	2023-08-09 17:47:52 +05:30
Tejaswini Bandlamudi	550a66d71e	Upgrade jackson-databind to 2.12.7 (#14770 ) The current version of jackson-databind is flagged for vulnerabilities CVE-2020-28491 (Although cbor format is not used in druid), CVE-2020-36518 (Seems genuine as deeply nested json in can cause resource exhaustion). Updating the dependency to the latest version 2.12.7 to fix these vulnerabilities.	2023-08-09 12:22:16 +05:30
George Shiqi Wu	14940dc3ed	Add pod name to TaskLocation for easier observability and debugging. (#14758 ) * Add pod name to location * Add log * fix style * Update extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java Co-authored-by: Suneet Saldanha <suneet@apache.org> * Fix unit tests --------- Co-authored-by: Suneet Saldanha <suneet@apache.org>	2023-08-07 12:33:35 -07:00
Suneet Saldanha	62ddeaf16f	Additional dimensions for service/heartbeat (#14743 ) * Additional dimensions for service/heartbeat * docs * review * review	2023-08-04 11:01:07 -07:00
YongGang	3335040b22	Report task/pending/time metrics for k8s based ingestion (#14698 ) Changes: * Add and invoke `StateListener` when state changes in `KubernetesPeonLifecycle` * Report `task/pending/time` metric in `KubernetesTaskRunner` when state moves to RUNNING	2023-08-04 09:07:11 +05:30
imply-cheddar	748874405c	Minimize PostAggregator computations (#14708 ) * Minimize PostAggregator computations Since a change back in 2014, the topN query has been computing all PostAggregators on all intermediate responses from leaf nodes to brokers. This generates significant slow downs for queries with relatively expensive PostAggregators. This change rewrites the query that is pushed down to only have the minimal set of PostAggregators such that it is impossible for downstream processing to do too much work. The final PostAggregators are applied at the very end.	2023-08-04 00:04:31 +05:30
George Shiqi Wu	174053f4fd	Add readme for kubernetes-overlord-extensions and update docs (#14674 ) * Add readme for kubernetes task scheduler * clean up uneeded stuff * Update extensions-contrib/kubernetes-overlord-extensions/README.md Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * Move documentation into main page * indentation * cleanup spellcheck errors * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update extensions-contrib/kubernetes-overlord-extensions/README.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * PR comments * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/development/extensions-contrib/k8s-jobs.md Co-authored-by: Suneet Saldanha <suneet@apache.org> --------- Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Suneet Saldanha <suneet@apache.org>	2023-08-01 13:29:44 -07:00
YongGang	9b88b78ba4	Fix race condition in KubernetesTaskRunner when task is added to the map (#14643 ) Changes: - Fix race condition in KubernetesTaskRunner introduced by #14435 - Perform addition and removal from map inside a synchronized block - Update tests	2023-07-27 12:34:36 +05:30
George Shiqi Wu	f742bb7376	Get task location should be stored on the lifecycle object (#14649 ) * Fix issue with long data source names * Use the regular library * Save location and tls enabled * Null out before running * add another comment	2023-07-24 18:36:19 -07:00
George Shiqi Wu	28914bbab8	Fix issue with long data source names (#14620 ) * Fix issue with long data source names * Use the regular library * fix overlord utils test	2023-07-24 08:45:10 -07:00
Abhishek Agarwal	efb32810c4	Clean up the core API required for Iceberg extension (#14614 ) Changes: - Replace `AbstractInputSourceBuilder` with `InputSourceFactory` - Move iceberg specific logic to `IcebergInputSource`	2023-07-21 13:01:33 +05:30
Clint Wylie	913416c669	add equality, null, and range filter (#14542 ) changes: * new filters that preserve match value typing to better handle filtering different column types * sql planner uses new filters by default in sql compatible null handling mode * remove isFilterable from column capabilities * proper handling of array filtering, add array processor to column processors * javadoc for sql test filter functions * range filter support for arrays, tons more tests, fixes * add dimension selector tests for mixed type roots * support json equality * rename semantic index maker thingys to mostly have plural names since they typically make many indexes, e.g. StringValueSetIndex -> StringValueSetIndexes * add cooler equality index maker, ValueIndexes * fix missing string utf8 index supplier * expression array comparator stuff	2023-07-18 12:15:22 -07:00
Kashif Faraz	993d8a9bf6	Bump up version in iceberg pom (#14605 )	2023-07-18 18:07:19 +05:30
AmatyaAvadhanula	0412f40d36	Prepare master branch for next release, 28.0.0 (#14595 ) * Prepare master branch for next release, 28.0.0	2023-07-18 09:22:30 +05:30
Atul Mohan	03d6d395a0	Extension to read and ingest iceberg data files (#14329 ) This adds a new contrib extension: druid-iceberg-extensions which can be used to ingest data stored in Apache Iceberg format. It adds a new input source of type iceberg that connects to a catalog and retrieves the data files associated with an iceberg table and provides these data file paths to either an S3 or HDFS input source depending on the warehouse location. Two important dependencies associated with Apache Iceberg tables are: Catalog : This extension supports reading from either a Hive Metastore catalog or a Local file-based catalog. Support for AWS Glue is not available yet. Warehouse : This extension supports reading data files from either HDFS or S3. Adapters for other cloud object locations should be easy to add by extending the AbstractInputSourceAdapter.	2023-07-18 08:59:57 +05:30
YongGang	214f7c3f65	Expose leader dimension in service/heartbeat metric into statsd-reporter (#14593 )	2023-07-17 10:38:24 +05:30
YongGang	0ca3ba0b30	Add service/heartbeat metric into statsd-reporter (#14564 )	2023-07-11 12:38:08 -07:00
Jan Werner	95115d722a	CVE fixes - update of multiple dependencies. (#14519 ) Apache Druid brings multiple direct and transitive dependencies that are affected by plethora of CVEs. This PR attempts to update all the dependencies that did not require code refactoring. This PR modifies pom files, license file and OWASP Dependency Check suppression file.	2023-07-07 20:27:30 +05:30
George Shiqi Wu	bd07c3dd43	Don't need to double synchronize on simple map operations (#14435 ) * Don't need to double syncronize on simple map operations * remove lock	2023-06-17 17:30:37 -07:00
George Shiqi Wu	76e70654ac	Fix issues when startup timeout is hit (#14425 )	2023-06-14 11:49:55 -07:00
Abhishek Radhakrishnan	1c76ebad3b	Minor doc updates. (#14409 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2023-06-12 15:24:48 -07:00
Abhishek Radhakrishnan	2d258a95ad	Fix `EARLIEST_BY`/`LATEST_BY` signature and include function name in signature. (#14352 ) * Fix EarliestLatestBySqlAggregator signature; Include function name for all signatures. * Single quote function signatures, space between args and remove \n. * fixup UT assertion	2023-06-06 09:41:05 -07:00
Harini Rajendran	4ff6026d30	Adding SegmentMetadataEvent and publishing them via KafkaEmitter (#14281 ) In this PR, we are enhancing KafkaEmitter, to emit metadata about published segments (SegmentMetadataEvent) into a Kafka topic. This segment metadata information that gets published into Kafka, can be used by any other downstream services to query Druid intelligently based on the segments published. The segment metadata gets published into kafka topic in json string format similar to other events.	2023-06-02 21:28:26 +05:30
zachjsh	04a82da63d	Input source security fixes (#14266 ) It was found that several supported tasks / input sources did not have implementations for the methods used by the input source security feature, causing these tasks and input sources to fail when used with this feature. This pr adds the needed missing implementations. Also securing the sampling endpoint with input source security, when enabled.	2023-06-01 16:37:19 -07:00
George Shiqi Wu	cb65135b99	Fix log streaming (#14285 ) * Fix log streaming * Add watch log * Add unit tests * long running client * singleton client * Remove accidental close	2023-05-22 11:19:53 -07:00
George Shiqi Wu	51f722b7f1	Fix labels (#14282 ) * Fix labels * move to a util function * style * PR comments * rename class	2023-05-18 11:51:58 -07:00
imply-cheddar	f9861808bc	Be able to load segments on Peons (#14239 ) * Be able to load segments on Peons This change introduces a new config on WorkerConfig that indicates how many bytes of each storage location to use for storage of a task. Said config is divided up amongst the locations and slots and then used to set TaskConfig.tmpStorageBytesPerTask The Peons use their local task dir and tmpStorageBytesPerTask as their StorageLocations for the SegmentManager such that they can accept broadcast segments.	2023-05-12 16:51:00 -07:00
Nicholas Lippis	58dcbf9399	queue tasks in kubernetes task runner if capacity is fully utilized (#14156 ) * queue tasks if all slots in use * Declare hamcrest-core dependency * Use AtomicBoolean for shutdown requested * Use AtomicReference for peon lifecycle state * fix uninitialized read error * fix indentations * Make tasks protected * fix KubernetesTaskRunnerConfig deserialization * ensure k8s task runner max capacity is Integer.MAX_VALUE * set job duration as task status duration * Address pr comments --------- Co-authored-by: George Shiqi Wu <george.wu@imply.io>	2023-05-12 09:41:44 -06:00
Clint Wylie	6db11bfc60	suppress some cves and fix javadoc build when using java 17 (#14241 )	2023-05-10 15:47:10 -07:00
George Shiqi Wu	161d12eb44	Fix unit tests for java 17 (#14207 ) Fix a unit test that fails in java 17	2023-05-09 20:02:31 +05:30
minseok	3c62c00d4c	Fix Typos in DruidToGraphiteEventConverter (#14219 )	2023-05-08 17:46:32 +05:30
George Shiqi Wu	eed5f4f291	Add labels to k8s jobs for the PodTemplateTaskAdapter (#14205 ) * Add labels * Add prefix * remove newline * fix syntax * Update prefix	2023-05-08 10:56:52 +08:00
Churro	123c4908c8	Ephemeral storage is respected from the overlod for peon tasks (#14201 )	2023-05-05 16:27:29 -07:00
Clint Wylie	90ea192d9c	fix bugs with auto encoded long vector deserializers (#14186 ) This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+. While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.	2023-05-01 11:49:27 +05:30
Nicholas Lippis	6579c1c5b6	remove unneeded TaskLogStreamer binding override (#14176 )	2023-04-27 19:39:24 +05:30
Tejaswini Bandlamudi	774073b2e7	Update Hadoop3 as default build version (#14005 ) Hadoop 2 often causes red security scans on Druid distribution because of the dependencies it brings. We want to move away from Hadoop 2 and provide Hadoop 3 distribution available. Switch druid to building with Hadoop 3 by default. Druid will still be compatible with Hadoop 2 and users can build hadoop-2 compatible distribution using hadoop2 profile.	2023-04-26 12:52:51 +05:30
Nicholas Lippis	9d4cc501f7	return task status reported by peon (#14040 ) * return task status reported by peon * Write TaskStatus to file in AbstractTask.cleanUp * Get TaskStatus from task log * Fix merge conflicts in AbstractTaskTest * Add unit tests for TaskLogPusher, TaskLogStreamer, NoopTaskLogs to satisfy code coverage * Add license headerss * Fix style * Remove unknown exception declarations	2023-04-24 12:05:39 -07:00

1 2 3 4 5 ...

671 Commits