Commit Graph

72 Commits

Author SHA1 Message Date
Kashif Faraz 647686aee2
Add test and metrics for KillStalePendingSegments duty (#14951)
Changes:
- Add new metric `kill/pendingSegments/count` with dimension `dataSource`
- Add tests for `KillStalePendingSegments`
- Reduce no-op logs that spit out for each datasource even when no pending
segments have been deleted. This can get particularly noisy at low values of `indexingPeriod`.
- Refactor the code in `KillStalePendingSegments` for readability and add javadocs
2023-09-08 10:33:47 +05:30
Hardik Bajaj e100b18e86
Updated documentation for OshiSysMonitor (#14912) 2023-09-07 16:54:33 +05:30
Laksh Singla 6ee0b06e38
Auto configuration for maxSubqueryBytes (#14808)
A new monitor SubqueryCountStatsMonitor which emits the metrics corresponding to the subqueries and their execution is now introduced. Moreover, the user can now also use the auto mode to automatically set the number of bytes available per query for the inlining of its subquery's results.
2023-09-06 05:47:19 +00:00
317brian 6b4dda964d
Docusaurus2 upgrade for master (#14411)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2023-08-16 19:01:21 -07:00
zachjsh 82d82dfbd6
Add stats to KillUnusedSegments coordinator duty (#14782)
### Description

Added the following metrics, which are calculated from the `KillUnusedSegments` coordinatorDuty

`"killTask/availableSlot/count"`: calculates the number remaining task slots available for auto kill
`"killTask/maxSlot/count"`: calculates the maximum number of tasks available for auto kill
`"killTask/task/count"`: calculates the number of tasks submitted by auto kill. 

#### Release note
NEW: metrics added for auto kill

`"killTask/availableSlot/count"`: calculates the number remaining task slots available for auto kill
`"killTask/maxSlot/count"`: calculates the maximum number of tasks available for auto kill
`"killTask/task/count"`: calculates the number of tasks submitted by auto kill.
2023-08-10 18:36:53 -04:00
Suneet Saldanha 2af0ab2425
Metric to report time spent fetching and analyzing segments (#14752)
* Metric to report time spent fetching and analyzing segments

* fix test

* spell check

* fix tests

* checkstyle

* remove unused variable

* Update docs/operations/metrics.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/metrics.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/metrics.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

---------

Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
2023-08-07 18:32:48 -07:00
Suneet Saldanha 62ddeaf16f
Additional dimensions for service/heartbeat (#14743)
* Additional dimensions for service/heartbeat

* docs

* review

* review
2023-08-04 11:01:07 -07:00
Kashif Faraz 10328c0743
Rename metadatacache and serverview metrics (#14716) 2023-08-01 14:18:20 +05:30
Katya Macedo 0b9e4af443
Clean up some of the descriptions (#14661)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2023-07-27 12:11:58 -07:00
Gian Merlino bac5ef347c
Add ingest/input/bytes metric and Kafka consumer metrics. (#14582)
* Add ingest/input/bytes metric and Kafka consumer metrics.

New metrics:

1) ingest/input/bytes. Equivalent to processedBytes in the task reports.

2) kafka/consumer/bytesConsumed: Equivalent to the Kafka consumer
   metric "bytes-consumed-total". Only emitted for Kafka tasks.

3) kafka/consumer/recordsConsumed: Equivalent to the Kafka consumer
   metric "records-consumed-total". Only emitted for Kafka tasks.

* Fix anchor.

* Fix KafkaConsumerMonitor.

* Interface updates.

* Doc changes.

* Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTask.java

Co-authored-by: Benedict Jin <asdf2014@apache.org>

---------

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2023-07-20 10:56:22 +08:00
Kashif Faraz 88dc330da2
Docs: Changes for coordinator improvements done in #13197 (#14590) 2023-07-18 14:22:00 +05:30
Gian Merlino 3ff51487b7
Add ZooKeeper connection state alerts and metrics. (#14333)
* Add ZooKeeper connection state alerts and metrics.

- New metric "zk/connected" is an indicator showing 1 when connected,
  0 when disconnected.
- New metric "zk/disconnected/time" measures time spent disconnected.
- New alert when Curator connection state enters LOST or SUSPENDED.

* Use right GuardedBy.

* Test fixes, coverage.

* Adjustment.

* Fix tests.

* Fix ITs.

* Improved injection.

* Adjust metric name, add tests.
2023-07-12 09:34:28 -07:00
Abhishek Radhakrishnan 854ef98235
Minor doc fixes. (#14565)
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2023-07-11 13:12:40 -07:00
Kashif Faraz 87bb1b9709
Fix bug during initialization of HttpServerInventoryView (#14517)
If a server is removed during `HttpServerInventoryView.serverInventoryInitialized`,
the initialization gets stuck as this server is never synced. The method eventually times
out (default 250s).

Fix: Mark a server as stopped if it is removed. `serverInventoryInitialized` only waits for
non-stopped servers to sync.

Other changes:
- Add new metrics for better debugging of slow broker/coordinator startup
  - `segment/serverview/sync/healthy`: whether the server view is syncing properly with a server
  - `segment/serverview/sync/unstableTime`: time for which sync with a server has been unstable  
- Clean up logging in `HttpServerInventoryView` and `ChangeRequestHttpSyncer`
- Minor refactor for readability
- Add utility class `Stopwatch`
- Add tests and stubs
2023-07-06 13:04:53 +05:30
Kashif Faraz a6547febaf
Remove unused coordinator dynamic configs (#14524)
After #13197 , several coordinator configs are now redundant as they are not being
used anymore, neither with `smartSegmentLoading` nor otherwise.

Changes:
- Remove dynamic configs `emitBalancingStats`: balancer error stats are always
emitted, debug stats can be logged by using `debugDimensions`
- `useBatchedSegmentSampler`, `percentOfSegmentsToConsiderPerMove`:
batched segment sampling is always used
- Add test to verify deserialization with unknown properties
- Update `CoordinatorRunStats` to always track stats, this can be optimized later.
2023-07-06 12:11:10 +05:30
YongGang b7434be99e
Add ServiceStatusMonitor to monitor service health (#14443)
* Add OverlordStatusMonitor and CoordinatorStatusMonitor to monitor service leader status

* make the monitor more general

* resolve conflict

* use Supplier pattern to provide metrics

* reformat code and doc

* move service specific tag to dimension

* minor refine

* update doc

* reformat code

* address comments

* remove declared exception

* bind HeartbeatSupplier conditionally in Coordinator
2023-06-26 10:26:37 -07:00
Rishabh Singh 155fde33ff
Add metrics to SegmentMetadataCache refresh (#14453)
New metrics:
- `segment/metadatacache/refresh/time`: time taken to refresh segments per datasource
- `segment/metadatacache/refresh/count`: number of segments being refreshed per datasource
2023-06-23 16:51:08 +05:30
George Shiqi Wu 64af9bfe5b
Add groupId to metrics (#14402)
* Add group id as a dimension

* Revert changes

* Add to forking task runner

* Add missing metrics

* Fix indenting

* revert metrics

* Fix indentation
2023-06-16 09:28:16 -07:00
Maytas Monsereenusorn 5d76d0ea74
Fix segment/deleted/count metric not being emitted (#14433)
* Fix segment/deleted/count metric

* Fix segment/deleted/count metric

* Fix segment/deleted/count metric
2023-06-15 14:08:19 -07:00
Suneet Saldanha 44547614ae
Report engine as a dimension for sqlQuery metrics (#13906)
* Report engine as a dimension for sqlQuery metrics

* docs
2023-03-10 11:23:57 -08:00
Abhishek Agarwal 52bd9e6adb
Improved error message when topic name changes within same supervisor (#13815)
Improved error message when topic name changes within same supervisor

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
2023-03-07 18:10:18 -08:00
Kashif Faraz 12f62e2c42
Clarify doc of ingest/handoff/time metric (#13856) 2023-02-28 10:37:47 +05:30
Suneet Saldanha 714ac07b52
Allow users to add additional metadata to ingestion metrics (#13760)
* Allow users to add additional metadata to ingestion metrics

When submitting an ingestion spec, users may pass a map of metadata
in the ingestion spec config that will be added to ingestion metrics.

This will make it possible for operators to tag metrics with other
metadata that doesn't necessarily line up with the existing tags
like taskId.

Druid clusters that ingest these metrics can take advantage of the
nested data columns feature to process this additional metadata.

* rename to tags

* docs

* tests

* fix test

* make code cov happy

* checkstyle
2023-02-08 18:07:23 -08:00
AmatyaAvadhanula dcdae84888
Add server view initialization metrics (#13716)
* Add server view init metrics

* Test coverage

* Rename metrics
2023-02-07 20:02:00 +05:30
Kashif Faraz c7229fc787
Limit max batch size for segment allocation, add docs (#13503)
Changes:
- Limit max batch size in `SegmentAllocationQueue` to 500
- Rename `batchAllocationMaxWaitTime` to `batchAllocationWaitTime` since the actual
wait time may exceed this configured value.
- Replace usage of `SegmentInsertAction` in `TaskToolbox` with `SegmentTransactionalInsertAction`
2022-12-07 10:07:14 +05:30
Katya Macedo fd239305d9
Update metrics doc (#13316)
Changes:
- used inline code-style to format dimension names
- removed unnecessary punctuation
2022-11-21 09:43:52 +05:30
Gian Merlino 77478f25fb
Add taskActionType dimension to task/action/run/time. (#13333)
* Add taskActionType dimension to task/action/run/time.

* Spelling.
2022-11-11 12:00:08 +05:30
AmatyaAvadhanula a2013e6566
Enhance streaming ingestion metrics (#13331)
Changes:
- Add a metric for partition-wise kafka/kinesis lag for streaming ingestion.
- Emit lag metrics for streaming ingestion when supervisor is not suspended and state is in {RUNNING, IDLE, UNHEALTHY_TASKS, UNHEALTHY_SUPERVISOR}
- Document metrics
2022-11-09 23:44:15 +05:30
Kashif Faraz ff8e0c3397
Fix issues with caching cost strategy (#13321)
`cachingCost` strategy has some discrepancies when compared to cost strategy.
This commit addresses two of these by retaining the same behaviour as the `cost` strategy
when computing the cost of moving a segment to a server:
- subtract the self cost of a segment if it is being served by the target server
- subtract the cost of segments that are marked to be dropped

Other changes:
- Add tests to verify fixed strategy. These tests would fail without the fixes made to `CachingCostStrategy.computeCost()`
- Fix the definition of the segment related metrics in the docs.
- Fix some docs issues introduced in #13181
2022-11-08 16:11:39 +05:30
AmatyaAvadhanula a738ac9ad7
Improve task pause logging and metrics for streaming ingestion (#13313)
* Improve task pause logging and metrics for streaming ingestion

* Add metrics doc

* Fix spelling
2022-11-07 21:33:54 +05:30
Gian Merlino 3bbb76f17b
Docs: Add query/cpu/time to real-time metrics. (#13229) 2022-10-15 18:26:44 +05:30
Rohan Garg 7aa8d7f987
Add query/time metric for SQL queries from router (#12867)
* Add query/time metric for SQL queries from router

* Fix query cancel bug when user has overriden native query-id in a SQL query
2022-09-07 13:54:46 +05:30
Rohan Garg 3c129f6728
Add sql planning time metric (#12923) 2022-08-22 11:09:44 +05:30
Gian Merlino ef6811ef88
Improved Java 17 support and Java runtime docs. (#12839)
* Improved Java 17 support and Java runtime docs.

1) Add a "Java runtime" doc page with information about supported
   Java versions, garbage collection, and strong encapsulation..

2) Update asm and equalsverifier to versions that support Java 17.

3) Add additional "--add-opens" lines to surefire configuration, so
   tests can pass successfully under Java 17.

4) Switch openjdk15 tests to openjdk17.

5) Update FrameFile to specifically mention Java runtime incompatibility
   as the cause of not being able to use Memory.map.

6) Update SegmentLoadDropHandler to log an error for Errors too, not
   just Exceptions. This is important because an IllegalAccessError is
   encountered when the correct "--add-opens" line is not provided,
   which would otherwise be silently ignored.

7) Update example configs to use druid.indexer.runner.javaOptsArray
   instead of druid.indexer.runner.javaOpts. (The latter is deprecated.)

* Adjustments.

* Use run-java in more places.

* Add run-java.

* Update .gitignore.

* Exclude hadoop-client-api.

Brought in when building on Java 17.

* Swap one more usage of java.

* Fix the run-java script.

* Fix flag.

* Include link to Temurin.

* Spelling.

* Update examples/bin/run-java

Co-authored-by: Xavier Léauté <xl+github@xvrl.net>

Co-authored-by: Xavier Léauté <xl+github@xvrl.net>
2022-08-03 23:16:05 -07:00
zachjsh c0380e7b0a
* fix duplicate dimension (#12778) 2022-07-14 10:39:03 +05:30
TSFenwick 8c02880d5f
Emit metrics for distribution of number of rows per segment (#12730)
* initial commit of bucket dimensions for metrics

return counts of segments that have rowcount in a bucket size for a datasource
return average value of rowcount per segment in a datasource
added unit test
naming could use a lot of work
buckets right now are not finalized
added javadocs
altered metrics.md

* fix checkstyle issues

* addressed review comments

add monitor test
move added functionality to new monitor
update docs

* address comments

renamed monitor
handle tombstones better
update docs
added javadocs

* Add support for tombstones in the segment distribution

* undo changes to tombstone segmentizer factory

* fix accidental whitespacing changes

* address comments regarding metrics documentation

and rename variable to be more accurate

* fix tests

* fix checkstyle issues

* fix broken test

* undo removal of timeout
2022-07-12 07:04:42 -07:00
Agustin Gonzalez 2f3d7a4c07
Emit state of replace and append for native batch tasks (#12488)
* Emit state of replace and append for native batch tasks

* Emit count of one depending on batch ingestion mode (APPEND, OVERWRITE, REPLACE)

* Add metric to compaction job

* Avoid null ptr exc when null emitter

* Coverage

* Emit tombstone & segment counts

* Tasks need a type

* Spelling

* Integrate BatchIngestionMode in batch ingestion tasks functionality

* Typos

* Remove batch ingestion type from metric since it is already in a dimension. Move IngestionMode to AbstractTask to facilitate having mode as a dimension. Add metrics to streaming. Add missing coverage.

* Avoid inner class referenced by sub-class inspection. Refactor computation of IngestionMode to make it more robust to null IOConfig and fix test.

* Spelling

* Avoid polluting the Task interface

* Rename computeCompaction methods to avoid ambiguous java compiler error if they are passed null. Other minor cleanup.
2022-05-23 12:32:47 -07:00
Rohan Garg 2dd073c2cd
Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation (#12484)
* Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation

* fixup! Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation

* Document vectorized dimension
2022-05-09 10:40:17 -07:00
Rocky Chen 770ad95169
Add a metric for task duration in the pending queue (#12492)
This PR is to measure how long a task stays in the pending queue and emits the value with the metric task/pending/time. The metric is measured in RemoteTaskRunner and HttpRemoteTaskRunner.

An example of the metric:

```
2022-04-26T21:59:09,488 INFO [rtr-pending-tasks-runner-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2022-04-26T21:59:09.487Z","service":"druid/coordinator","host":"localhost:8081","version":"2022.02.0-iap-SNAPSHOT","metric":"task/pending/time","value":8,"dataSource":"wikipedia","taskId":"index_parallel_wikipedia_gecpcglg_2022-04-26T21:59:09.432Z","taskType":"index_parallel"}
```

------------------------------------------
Key changed/added classes in this PR

    Emit metric task/pending/time in classes RemoteTaskRunner and HttpRemoteTaskRunner.
    Update related factory classes and tests.
2022-05-02 23:47:25 -04:00
zachjsh 564d6defd4
Worker level task metrics (#12446)
* * fix metric name inconsistency

* * add task slot metrics for middle managers

* * add new WorkerTaskCountStatsMonitor to report task count metrics
  from worker

* * more stuff

* * remove unused variable

* * more stuff

* * add javadocs

* * fix checkstyle

* * fix hadoop test failure

* * cleanup

* * add more code coverage in tests

* * fix test failure

* * add docs

* * increase code coverage

* * fix spelling

* * fix failing tests

* * remove dead code

* * fix spelling
2022-04-26 11:44:44 -05:00
Victoria Lim 9ed7aa33ec
Docs for request logging (#12363)
* add docs for request logging

* remove stray character

* Update docs/operations/request-logging.md

Co-authored-by: TSFenwick <tsfenwick@gmail.com>

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: TSFenwick <tsfenwick@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-03-28 14:09:41 -07:00
Sandeep 61e1ffc7f7
add a new query laning metrics to visualize lane assignment (#12111)
* add a new query laning metrics to visualize lane assignment

* fixes :spotbugs check

* Update docs/operations/metrics.md

Co-authored-by: Benedict Jin <asdf2014@apache.org>

* Update server/src/main/java/org/apache/druid/server/QueryScheduler.java

Co-authored-by: Benedict Jin <asdf2014@apache.org>

* Update server/src/main/java/org/apache/druid/server/QueryScheduler.java

Co-authored-by: Benedict Jin <asdf2014@apache.org>

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2022-03-04 15:21:17 +08:00
Karan Kumar 377edff042
Ingestion metrics doc fix (#12066)
* Ingestion metrics doc fix.

* Fixing typo

* Adding missed keywords in ignore list
2021-12-15 12:51:53 +05:30
Lucas Capistrant 761fe9f144
Add new metric that quantifies how long batch ingest jobs waited for segment availability and whether or not that wait was successful (#12002)
* add a unit test that tests that new metric is emitted

* remove unused import

* clarify in doc that this is for batch tasks

* fix IndexTaskTest
2021-12-10 11:40:52 -06:00
Peter Marshall 0b3f0bbbd8
Docs - Metrics docs layout and info about query/bytes (#11481)
* Metrics docs layout and info about query/bytes

Knowledge transfer from https://groups.google.com/g/druid-user/c/8fiflmSEoTQ - updated the layout of the Metrics part, adding links between docs pages.

Update index.md

Amended typo

* Update docs/configuration/index.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/configuration/index.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/operations/metrics.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/operations/metrics.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/operations/metrics.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Feedback applied

Http --> HTTP and moved content / removed >

* Update docs/configuration/index.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/configuration/index.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-12-07 09:45:24 -08:00
Nikhil Navadiya 3c51136098
Add worker category dimension (#11554)
* Add worker category as dimension in TaskSlotCountStatsMonitor

* Change description

* Add workerConfig as field

* Modify HttpRemoteTaskRunnerTest to test worker category in taskslot metrics

* Fixing tests

* Fixing alerts

* Adding unit test in SingleTaskBackgroundRunnerTest for task slot metrics APIs

* Resolving false positive spell check

* addressing comments

* throw UnsupportedOperationException for tasklotmetrics APIs in SingleTaskBackgroundRunner

Co-authored-by: Nikhil Navadiya <nnavadiya@twitter.com>
2021-11-18 22:59:07 -08:00
Jian Wang 8e7e679984
Add more metrics for Jetty server thread pool usage (#11113)
Add more metrics for jetty server thread pool usage so we know if we have allocated enough http threads to handle requests.
2021-11-07 16:51:44 +05:30
Arun Ramani b6b42d3936
Minor processor quota computation fix + docs (#11783)
* cpu/cpuset cgroup and procfs data gathering

* Renames and default values

* Formatting

* Trigger Build

* Add cgroup monitors

* Return 0 if no period

* Update

* Minor processor quota computation fix + docs

* Address comments

* Address comments

* Fix spellcheck

Co-authored-by: arunramani-imply <84351090+arunramani-imply@users.noreply.github.com>
2021-10-08 22:52:03 -05:00
Charles Smith 621e5ac63f
docs: clarify RealtimeMetricsMonitor, HistoricalMetricsMonitor (#11565)
* docs: clarify RealtimeMetricsMonitor, HistoricalMetricsMonitor

* Update docs/configuration/index.md
2021-10-05 17:38:23 -07:00
imply-jhan 332e68edb5
improve the metric definition (#11602) 2021-08-17 12:31:42 +07:00