Commit Graph

68 Commits

Author SHA1 Message Date
Jan Werner fa2c8edb5d
unpin snakeyaml, add suppressions and licenses (#15549)
* unpin snakeyaml globally, add suppressions and licenses
* pin snakeyaml in the specific modules that require version 1.x, update licenses and owasp suppression

This removes the pin of the Snakeyaml introduced in:  https://github.com/apache/druid/pull/14519
After the updates of io.kubernetes.java-client and io.confluent.kafka-clients, the only uses of the Snakeyaml 1.x are:
- in test scope, transitive dependency of jackson-dataformat-yaml🫙2.12.7
- in compile scope in contrib extension druid-cassandra-storage
- in compile scope in it-tests. 

With the dependency version un-pinned, io.kubernetes.java-client and io.confluent.kafka-clients bring Snakeyaml versions 2.0 and 2.2, consequently allowing to build a Druid distribution without the contrib-extension and free of vulnerable Snakeyaml versions.
2023-12-15 10:33:14 -08:00
George Shiqi Wu 4152f1d147
Fix empty logs and status messages for mmless ingestion (#15527)
* Fix empty logs and status messages for mmless ingestion

* Add tests
2023-12-11 13:20:45 -05:00
Kashif Faraz 58a724c7e4
Use StubServiceEmitter in tests (#15426)
* Use StubServiceEmitter in tests
* Remove unthrown exception from declaration
2023-11-28 09:43:09 +05:30
George Shiqi Wu 130bfbfc6d
Revert "Separate task lifecycle from kubernetes/location lifecycle (#15133)" (#15346)
This reverts commit dc0b163e19.
2023-11-08 13:12:30 -05:00
YongGang 7a25ee4fd9
Ability to send task types to k8s or worker task runner (#15196)
* Ability to send task types to k8s or worker task runner

* add more tests

* use runnerStrategy to determine task runner

* minor refine

* refine runner strategy config

* move workerType config to upper level

* validate config when application start
2023-10-25 09:55:56 -07:00
George Shiqi Wu dc0b163e19
Separate task lifecycle from kubernetes/location lifecycle (#15133)
* Separate k8s and druid task lifecycles

* Remove extra log lines

* Fix unit tests

* fix unit tests

* Fix unit tests

* notify listeners on task completion

* Fix unit test

* unused var

* PR changes

* Fix unit tests

* Fix checkstyle

* PR changes
2023-10-17 08:17:43 -07:00
Laksh Singla 5f86072456
Prepare master for Druid 29 (#15121)
Prepare master for Druid 29
2023-10-11 10:33:45 +05:30
Xavier Léauté adef2069b1
Make unit tests pass with Java 21 (#15014)
This change updates dependencies as needed and fixes tests to remove code incompatible with Java 21
As a result all unit tests now pass with Java 21.

* update maven-shade-plugin to 3.5.0 and follow-up to #15042
  * explain why we need to override configuration when specifying outputFile
  * remove configuration from dependency management in favor of explicit overrides in each module.
* update to mockito to 5.5.0 for Java 21 support when running with Java 11+
  * continue using latest mockito 4.x (4.11.0) when running with Java 8  
  * remove need to mock private fields
* exclude incorrectly declared mockito dependency from pac4j-oidc
* remove mocking of ByteBuffer, since sealed classes can no longer be mocked in Java 21
* add JVM options workaround for system-rules junit plugin not supporting Java 18+
* exclude older versions of byte-buddy from assertj-core
* fix for Java 19 changes in floating point string representation
* fix missing InitializedNullHandlingTest
* update easymock to 5.2.0 for Java 21 compatibility
* update animal-sniffer-plugin to 1.23
* update nl.jqno.equalsverifier to 3.15.1
* update exec-maven-plugin to 3.1.0
2023-10-03 22:41:21 -07:00
George Shiqi Wu 64754b6799
Allow users to pass task payload via deep storage instead of environment variable (#14887)
This change is meant to fix a issue where passing too large of a task payload to the mm-less task runner will cause the peon to fail to startup because the payload is passed (compressed) as a environment variable (TASK_JSON). In linux systems the limit for a environment variable is commonly 128KB, for windows systems less than this. Setting a env variable longer than this results in a bunch of "Argument list too long" errors.
2023-10-03 14:08:59 +05:30
George Shiqi Wu 8e22a178cc
Support getTaskLocation for mixed task runner (#15033)
The KubernetesAndWorkerTaskRunner currently doesn't implement getTaskLocation, so tasks run by it will show a unknown TaskLocation in the druid console after a task has completed.

Fix bug in KubernetesAndWorkerTaskRunner that manifests as missing information in the druid Web Console.
2023-09-27 08:57:36 +05:30
Tejaswini Bandlamudi 48b6d2abf9
skip org.owasp:dependency-check on extensions-contrib modules and suppress false-positive gRPC CVEs (#15026) 2023-09-25 12:14:42 +05:30
YongGang be3f93e3cf
Restore tasks when lifecycle start (#14909)
* K8s tasks restore should be from lifecycle start

* add test

* add more tests

* fix test

* wait tasks restore finish when start

* fix style

* revert previous change and add comment
2023-09-22 12:03:34 -07:00
George Shiqi Wu d459df8d6e
Fix log syntax (#15004) 2023-09-18 10:40:02 -07:00
George Shiqi Wu f773d83914
Mixed task runner for migration to mm-less ingestion (#14918)
* save work

* Working

* Fix runner constructor

* Working runner

* extra log lines

* try using lifecycle for everything

* clean up configs

* cleanup /workers call

* Use a single config

* Allow selecting runner

* debug changes

* Work on composite task runner

* Unit tests running

* Add documentation

* Add some javadocs

* Fix spelling

* Use standard libraries

* code review

* fix

* fix

* use taskRunner as string

* checkstyl

---------

Co-authored-by: Suneet Saldanha <suneet@apache.org>
2023-09-11 18:09:46 -07:00
Kashif Faraz 289ee1e011
Refactor: Cleanup NoopTask (#14938)
Changes:
- Simplify static `create` methods for `NoopTask`
- Remove `FirehoseFactory`, `IsReadyResult`, `readyTime` from `NoopTask`
as these fields were not being used anywhere
- Update tests
2023-09-05 09:15:41 +05:30
Kashif Faraz 7f26b80e21
Simplify ServiceMetricEvent.Builder (#14933)
Changes:
- Make ServiceMetricEvent.Builder extend ServiceEventBuilder<ServiceMetricEvent>
and thus convert it to a plain builder rather than a builder of builder.
- Add methods setCreatedTime , setMetricAndValue to the builder
2023-09-01 11:30:45 +05:30
George Shiqi Wu 95b0de61d1
Move some lifecycle management from doTask -> shutdown for the mm-less task runner (#14895)
* save work

* Add syncronized

* Don't shutdown in run

* Adding unit tests

* Cleanup lifecycle

* Fix tests

* remove newline
2023-08-25 10:50:38 -06:00
George Shiqi Wu ad32f84586
Fix capacity response in mm-less ingestion (#14888)
Changes:
- Fix capacity response in mm-less ingestion.
- Add field usedClusterCapacity to the GET /totalWorkerCapacity response.
This API should be used to get the total ingestion capacity on the overlord.
- Remove method `isK8sTaskRunner` from interface `TaskRunner`
2023-08-25 08:17:38 +05:30
Tejaswini Bandlamudi d87056e708
Upgrade guava version to 31.1-jre (#14767)
Currently, Druid is using Guava 16.0.1 version. This upgrade to 31.1-jre fixes the following issues.

CVE-2018-10237 (Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 allows remote attackers to conduct denial of service attacks against servers that depend on this library and deserialize attacker-provided data because the AtomicDoubleArray class (when serialized with Java serialization) and the CompoundOrdering class (when serialized with GWT serialization) perform eager allocation without appropriate checks on what a client has sent and whether the data size is reasonable). We don't use Java or GWT serializations. Despite being false positive they're causing red security scans on Druid distribution.
Latest version of google-client-api is incompatible with the existing Guava version. This PR unblocks Update google client apis to latest version #14414
2023-08-22 12:09:53 +05:30
YongGang 3954685aae
Report more metrics to monitor K8s task runner (#14771)
* Report pod running metrics to monitor K8s task runner

* refine method definition

* fix checkstyle

* implement task metrics

* more comment

* address comments

* update doc for the new metrics reported

* fix checkstyle

* refine method definition

* minor refine
2023-08-16 14:03:53 -04:00
Rishabh Singh 0dc305f9e4
Upgrade hibernate validator version to fix CVE-2019-10219 (#14757) 2023-08-14 11:50:51 +05:30
George Shiqi Wu c8a11702db
Support broadcast segmetns (#14789) 2023-08-11 11:14:05 -07:00
George Shiqi Wu c8537dbeaf
Add lifecycle hooks to KubernetesTaskRunner (#14790) 2023-08-09 21:16:44 -07:00
George Shiqi Wu 14940dc3ed
Add pod name to TaskLocation for easier observability and debugging. (#14758)
* Add pod name to location

* Add log

* fix style

* Update extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Fix unit tests

---------

Co-authored-by: Suneet Saldanha <suneet@apache.org>
2023-08-07 12:33:35 -07:00
YongGang 3335040b22
Report task/pending/time metrics for k8s based ingestion (#14698)
Changes:
* Add and invoke `StateListener` when state changes in `KubernetesPeonLifecycle`
* Report `task/pending/time` metric in `KubernetesTaskRunner` when state moves to RUNNING
2023-08-04 09:07:11 +05:30
George Shiqi Wu 174053f4fd
Add readme for kubernetes-overlord-extensions and update docs (#14674)
* Add readme for kubernetes task scheduler

* clean up uneeded stuff

* Update extensions-contrib/kubernetes-overlord-extensions/README.md

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

* Move documentation into main page

* indentation

* cleanup spellcheck errors

* Update docs/development/extensions-contrib/k8s-jobs.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update extensions-contrib/kubernetes-overlord-extensions/README.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/development/extensions-contrib/k8s-jobs.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* PR comments

* Update docs/development/extensions-contrib/k8s-jobs.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/development/extensions-contrib/k8s-jobs.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/development/extensions-contrib/k8s-jobs.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

---------

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Suneet Saldanha <suneet@apache.org>
2023-08-01 13:29:44 -07:00
YongGang 9b88b78ba4
Fix race condition in KubernetesTaskRunner when task is added to the map (#14643)
Changes:
- Fix race condition in KubernetesTaskRunner introduced by #14435 
- Perform addition and removal from map inside a synchronized block
- Update tests
2023-07-27 12:34:36 +05:30
George Shiqi Wu f742bb7376
Get task location should be stored on the lifecycle object (#14649)
* Fix issue with long data source names

* Use the regular library

* Save location and tls enabled

* Null out before running

* add another comment
2023-07-24 18:36:19 -07:00
George Shiqi Wu 28914bbab8
Fix issue with long data source names (#14620)
* Fix issue with long data source names

* Use the regular library

* fix overlord utils test
2023-07-24 08:45:10 -07:00
AmatyaAvadhanula 0412f40d36
Prepare master branch for next release, 28.0.0 (#14595)
* Prepare master branch for next release, 28.0.0
2023-07-18 09:22:30 +05:30
Jan Werner 95115d722a
CVE fixes - update of multiple dependencies. (#14519)
Apache Druid brings multiple direct and transitive dependencies that are affected by plethora of CVEs.
This PR attempts to update all the dependencies that did not require code refactoring.
This PR modifies pom files, license file and OWASP Dependency Check suppression file.
2023-07-07 20:27:30 +05:30
George Shiqi Wu bd07c3dd43
Don't need to double synchronize on simple map operations (#14435)
* Don't need to double syncronize on simple map operations

* remove lock
2023-06-17 17:30:37 -07:00
George Shiqi Wu 76e70654ac
Fix issues when startup timeout is hit (#14425) 2023-06-14 11:49:55 -07:00
George Shiqi Wu cb65135b99
Fix log streaming (#14285)
* Fix log streaming

* Add watch log

* Add unit tests

* long running client

* singleton client

* Remove accidental close
2023-05-22 11:19:53 -07:00
George Shiqi Wu 51f722b7f1
Fix labels (#14282)
* Fix labels

* move to a util function

* style

* PR comments

* rename class
2023-05-18 11:51:58 -07:00
Nicholas Lippis 58dcbf9399
queue tasks in kubernetes task runner if capacity is fully utilized (#14156)
* queue tasks if all slots in use

* Declare hamcrest-core dependency

* Use AtomicBoolean for shutdown requested

* Use AtomicReference for peon lifecycle state

* fix uninitialized read error

* fix indentations

* Make tasks protected

* fix KubernetesTaskRunnerConfig deserialization

* ensure k8s task runner max capacity is Integer.MAX_VALUE

* set job duration as task status duration

* Address pr comments

---------

Co-authored-by: George Shiqi Wu <george.wu@imply.io>
2023-05-12 09:41:44 -06:00
George Shiqi Wu 161d12eb44
Fix unit tests for java 17 (#14207)
Fix a unit test that fails in java 17
2023-05-09 20:02:31 +05:30
George Shiqi Wu eed5f4f291
Add labels to k8s jobs for the PodTemplateTaskAdapter (#14205)
* Add labels

* Add prefix

* remove newline

* fix syntax

* Update prefix
2023-05-08 10:56:52 +08:00
Churro 123c4908c8
Ephemeral storage is respected from the overlod for peon tasks (#14201) 2023-05-05 16:27:29 -07:00
Clint Wylie 90ea192d9c
fix bugs with auto encoded long vector deserializers (#14186)
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+.

While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
2023-05-01 11:49:27 +05:30
Nicholas Lippis 6579c1c5b6
remove unneeded TaskLogStreamer binding override (#14176) 2023-04-27 19:39:24 +05:30
Nicholas Lippis 9d4cc501f7
return task status reported by peon (#14040)
* return task status reported by peon

* Write TaskStatus to file in AbstractTask.cleanUp

* Get TaskStatus from task log

* Fix merge conflicts in AbstractTaskTest

* Add unit tests for TaskLogPusher, TaskLogStreamer, NoopTaskLogs to satisfy code coverage

* Add license headerss

* Fix style

* Remove unknown exception declarations
2023-04-24 12:05:39 -07:00
imply-cheddar aaa6cc1883
Make the tasks run with only a single directory (#14063)
* Make the tasks run with only a single directory

There was a change that tried to get indexing to run on multiple disks
It made a bunch of changes to how tasks run, effectively hiding the
"safe" directory for tasks to write files into from the task code itself
making it extremely difficult to do anything correctly inside of a task.

This change reverts those changes inside of the tasks and makes it so that
only the task runners are the ones that make decisions about which
mount points should be used for storing task-related files.

It adds the config druid.worker.baseTaskDirs which can be used by the
task runners to know which directories they should schedule tasks inside of.
The TaskConfig remains the authoritative source of configuration for where
and how an individual task should be operating.
2023-04-13 00:45:02 -07:00
George Shiqi Wu 00d777d848
Fix race condition in KubernetesTaskRunner between shutdown and getKnownTasks (#14030)
* Fix issues with null pointers on jobResponse

* fix unit tests

* Update extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

* nullable

* fix error message

* Use jobs for known tasks instead of pods

* Remove log lines

* remove log lines

* PR change requests

* revert wait change

---------

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
2023-04-10 13:27:49 -07:00
Clint Wylie 1aef72aa7e
Bump up the version in pom to 27.0.0 in preparation of release (#14051) 2023-04-10 14:56:59 +05:30
Nicholas Lippis 5810e650d4
K8s mm less fixes (#14028)
Update Fabric8 version and allow metrics monitors to be overriden
2023-04-05 22:23:16 +05:30
George Shiqi Wu f60f377e5f
Fix issues with null pointers on jobResponse (#14010)
* Fix issues with null pointers on jobResponse

* fix unit tests

* Update extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidKubernetesPeonClient.java

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

* nullable

* fix error message

---------

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
2023-04-04 17:48:18 -07:00
George Shiqi Wu 4560b9d8aa
New error message for task deletion (#14008)
* New error message

* Add unit test
2023-04-03 14:26:09 -07:00
Nicholas Lippis 61a35262ec
Kubernetes task runner live reports (#13986)
Implement Live Reports for the KubernetesTaskRunner
2023-03-30 10:30:22 +05:30
George Shiqi Wu 44abe2b96f
Fix bug in k8s task runner in handling deleted jobs (#14001)
With the KubernetesTaskRunner, if a task is manually shutdown via the web console while running or the corresponding k8s job is manually deleted, the thread responsible for overseeing the task gets stuck in a loop because the fabric8 client sends one event to it that the job is null when the job is deleted, but this doesn't pass the condition.

This means that the thread is stuck waiting on a fabric8 event (the job being successful) that will never come up until maxTaskDuration (default 4 hours). If a user of the extension is trying to use a limited taskqueue maxSize, this can cause problems as the k8s executor pool is unable to pick up additional tasks (since threads are stuck waiting on the old tasks that have already been deleted).
2023-03-30 10:09:52 +05:30