Changes:
Add the following indexer level task metrics:
- `worker/task/running/count`
- `worker/task/assigned/count`
- `worker/task/completed/count`
These metrics will provide more visibility into the tasks distribution across indexers
(We often see a task skew issue across indexers and with this issue it would be easier
to catch the imbalance)
* Mark used and unused APIs by versions.
* remove the conditional invocations.
* isValid() and test updates.
* isValid() and tests.
* Remove warning logs for invalid user requests. Also, downgrade visibility.
* Update resp message, etc.
* tests and some cleanup.
* Docs draft
* Clarify docs
* Update server/src/main/java/org/apache/druid/server/http/DataSourcesResource.java
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* Review comments
* Remove default interface methods only used in tests and update docs.
* Clarify javadocs and @Nullable.
* Add more tests.
* Parameterized versions.
---------
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* MSQ: Validate that strings and string arrays are not mixed.
When multi-value strings and string arrays coexist in the same column,
it causes problems with "classic MVD" style queries such as:
select * from wikipedia -- fails at runtime
select count(*) from wikipedia where flags = 'B' -- fails at planning time
select flags, count(*) from wikipedia group by 1 -- fails at runtime
To avoid these problems, this patch adds type verification for INSERT
and REPLACE. It is targeted: the only type changes that are blocked are
string-to-array and array-to-string. There is also a way to exclude
certain columns from the type checks, if the user really knows what
they're doing.
* Fixes.
* Tests and docs and error messages.
* More docs.
* Adjustments.
* Adjust message.
* Fix tests.
* Fix test in DV mode.
* docs: clarify description of uri/uripath
* Apply suggestions from code review
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
---------
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Kill task version support.
Kill tasks by default kill all versions of unused segments in the specified
interval. Users wanting to delete specific versions (for example, data compliance
reasons) and keep rest of the versions can specify the optional version in the
kill task payload.
* Formatting changes.
* Multi version tests in RetrieveSegmentsActionsTest
Sort of like method-level parameterized tests.
* Address review feedback
* Accept a list of versions instead of a single version.
Support multiple versions.
* Tests for multiple versions.
* Update docs
* Cleanup
* Address review comments.
Retain the old interface method and make it default and route it to
the method with nullable versions variant. Update usages to use the
default method where versions doesn't matter.
* Remove versions from retreive used segments action.
* Some updates.
* Apply suggestions from code review
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* /s/actual/observed/g
* minor test cleanup
* WIP: Test fixes and updates. Also add test for kill by version with used load spec.
Checkpoint.
---------
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Changes:
- Add visibility into number of segments read/published by each parallel compaction
- Add new fields `segmentsRead`, `segmentsPublished` to `IngestionStatsAndErrorsTaskReportData`
- Update `ParallelIndexSupervisorTask` to populate the new stats
* updated description of rowsPerPage in export operations
* Update docs/multi-stage-query/reference.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
---------
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Add support for AzureDNSZone enabled storage accounts used for deep storage
Added a new config to AzureAccountConfig
`storageAccountEndpointSuffix`
which allows the user to specify a storage account endpoint suffix where the underlying
storage account is enabled for AzureDNSZone. The previous config `endpointSuffix`, did not allow
support for such accounts. The previous config has been deprecated in favor of this new config. Also
fixed an issue where `managedIdentityClientId` was not being set properly
* * address review comments
* * add back azure government link and docs
* Move retries into DataSegmentPusher implementations.
The individual implementations know better when they should and should
not retry. They can also generate better error messages.
The inspiration for this patch was a situation where EntityTooLarge was
generated by the S3DataSegmentPusher, and retried uselessly by the
retry harness in PartialSegmentMergeTask.
* Fix missing var.
* Adjust imports.
* Tests, comments, style.
* Remove unused import.
* docs: add mermaid diagram support
* fix crash when parsing data in data loader that can not be parsed (#15983)
* update jetty to address CVE (#16000)
* Concurrent replace should work with supervisors using concurrent locks (#15995)
* Concurrent replace should work with supervisors using concurrent locks
* Ignore supervisors with useConcurrentLocks set to false
* Apply feedback
* Add pre-check for heavy debug logs (#15706)
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
* Remove helm paths from CodeQL config (#16006)
* docs: mention acid-compliance for metadb
---------
Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>
Co-authored-by: Jan Werner <105367074+janjwerner-confluent@users.noreply.github.com>
Co-authored-by: AmatyaAvadhanula <amatya.avadhanula@imply.io>
Co-authored-by: Sensor <fectrain@outlook.com>
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
* Update basic-cluster-tuning.md
The sentence "When free system memory is greater than or equal to druid.segmentCache.locations, the more segment data the Historical can be held in the memory-mapped segment cache" didn't read well. Updated to clarify it.
* Update docs/operations/basic-cluster-tuning.md
* Update docs/operations/basic-cluster-tuning.md
---------
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* All segments stored in the same batch have the same created_date entry.
In the absence of a group_id column, this metadata would allow us to easily
reason about and troubleshoot ingestion-related issues.
* Rename metric name and code references to eligibleUnusedSegments.
Address review comment from https://github.com/apache/druid/pull/15941#discussion_r1503631992
* Kill duty and test improvements.
Initial commit with:
- Bug fixes - auto-kill can throw NPE when there are no datasources present and defaults mismatch.
- Add new stat for candidate segment intervals killed.
- Move a couple of debug logs to info logs for improved visibility (should only log once per kill period).
- Remove redundant checks for code readability.
- Updated tests from using mocks (also the mocks weren't using last updated timestamp) and
add more test coverage for different config parameters.
- Add a couple of unit tests that are ignored for the eternity case to prove that
the kill duty doesn't clean up segments with ALL grain or that end in DateTimes.MAX.
- Migrate Druid exception from user to operator persona.
* Address review comments.
* Remove unused methods.
* fix up format specifier and validate bad config tests.
* Consolidate the helpers a bit more and add another test.
* Update test names. Add javadoc placeholders for slightly involved tests.
* Add docs for metric kill/candidateUnusedSegments/count.
Also, rename to disambiguate.
* Comments.
* Apply logging suggestions from code review
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* Review comments
- Clarify docs on eligibility.
- Add test for multiple segments in the same interval. Clarify comment.
- Remove log line from test.
- Remove lastUpdatedDate = now.plus(10) from test.
* minor cleanup.
* Clarify javadocs for getUnusedSegmentIntervals().
---------
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* Fix up typos, inaccuracies and clean up code related to PARTITIONED BY.
* Remove wrapper function and update tests to use DruidExceptionMatcher.
* Checkstyle and Intellij inspection fixes.
Changes:
- Add visibility into number of records processed by each streaming task per partition
- Add field `recordsProcessed` to `IngestionStatsAndErrorsTaskReportData`
- Populate number of records processed per partition in `SeekableStreamIndexTaskRunner`
Starting the process to officially deprecate non SQL compatible modes by updating docs to aggressively call out that Druids non SQL compliant modes are deprecated and will go away someday. There are no code or behavior changes at this PR.
Merging the work so far. @ektravel , @vogievetsky if there are additional improvements, let's track them & make another pr.
* Refactor streaming ingestion docs
* Update property definition
* Update after review
* Update known issues
* Move kinesis and kafka topics to ingestion, add redirects
* Saving changes
* Saving
* Add input format text
* Update after review
* Minor text edit
* Update example syntax
* Revert back to colon
* Fix merge conflicts
* Fix broken links
* Fix spelling error
This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner https://github.com/apache/druid/pull/13686 from @paul-rogers, extending the PARTITION BY clause to accept string literals for the time partitioning
* allow for kafka-emitter to have extra dimensions be set for each event it emits
* fix checktsyle issue in kafkaemitterconfig
* make changes to fix docs, and cleanup copy paste error in #toString()
* undo formatting to markdown table
* add more branches so test passes
* fix checkstyle issue
* Update the group id to org.apache.druid.extensions.contrib for contrib exts.
* Note iceberg and delta lake extensions in extensions.md
* properties and shell backticks
* Update groupId in distribution/pom.xml
* remove delta-lake from dist.
* Add note on downloading extension.
During ingestion, incremental segments are created in memory for the different time chunks and persisted to disk when certain thresholds are reached (max number of rows, max memory, incremental persist period etc). In the case where there are a lot of dimension and metrics (1000+) it was observed that the creation/serialization of incremental segment file format for persistence and persisting the file took a while and it was blocking ingestion of new data. This affected the real-time ingestion. This serialization and persistence can be parallelized across the different time chunks. This update aims to do that.
The patch adds a simple configuration parameter to the ingestion tuning configuration to specify number of persistence threads. The default value is 1 if it not specified which makes it the same as it is today.
If lots of keys map to the same value, reversing a LOOKUP call can slow
things down unacceptably. To protect against this, this patch introduces
a parameter sqlReverseLookupThreshold representing the maximum size of an
IN filter that will be created as part of lookup reversal.
If inSubQueryThreshold is set to a smaller value than
sqlReverseLookupThreshold, then inSubQueryThreshold will be used instead.
This allows users to use that single parameter to control IN sizes if they
wish.