Changes:
- Add visibility into number of records processed by each streaming task per partition
- Add field `recordsProcessed` to `IngestionStatsAndErrorsTaskReportData`
- Populate number of records processed per partition in `SeekableStreamIndexTaskRunner`
This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner https://github.com/apache/druid/pull/13686 from @paul-rogers, Refactoring the IngestHandler and subclasses to produce a validated SqlInsert instance node instead of the previous Insert source node. The SqlInsert node is then validated in the calcite validator. The validation that is implemented as part of this pr, is only that for the source node, and some of the validation that was previously done in the ingest handlers. As part of this change, the partitionedBy clause can be supplied by the table catalog metadata if it exists, and can be omitted from the ingest time query in this case.
Fixes a bug when the undocumented castToType parameter is set on 'auto' column schema, which should have been using the 'nullable' comparator to allow null values to be present when merging columns, but wasn't which would lead to null pointer exceptions. Also fixes an issue I noticed while adding tests that if 'FLOAT' type was specified for the castToType parameter it would be an exception because that type is not expected to be present, since 'auto' uses the native expressions to determine the input types and expressions don't have direct support for floats, only doubles.
In the future I should probably split this functionality out of the 'auto' schema (maybe even have a simpler version of the auto indexer dedicated to handling non-nested data) but still have the same results of writing out the newer 'nested common format' columns used by 'auto', but I haven't taken that on in this PR.
The helm chart was originally moved here in #11163 from
https://github.com/helm/charts/tree/master/incubator/druid after the
helm/charts repository was deprecated. However, it has been excluded
from releases since then, due to uncertainty around whether we need
IP clearance. We have not had volunteers willing to sort this out,
so this patch removes the code.
It can be re-added if a volunteer is available to sort out the
IP clearance process.
See thread at: https://lists.apache.org/thread/ygyzt23m06vc775nq5dsm349rf0j47dg
* Globally disable AUTO_CLOSE_JSON_CONTENT.
This JsonGenerator feature is on by default. It causes problems with code
like this:
try (JsonGenerator jg = ...) {
jg.writeStartArray();
for (x : xs) {
jg.writeObject(x);
}
jg.writeEndArray();
}
If a jg.writeObject call fails due to some problem with the data it's
reading, the JsonGenerator will write the end array marker automatically
when closed as part of the try-with-resources. If the generator is writing
to a stream where the reader does not have some other mechanism to realize
that an exception was thrown, this leads the reader to believe that the
array is complete when it actually isn't.
Prior to this patch, we disabled AUTO_CLOSE_JSON_CONTENT for JSON-wrapped
SQL result formats in #11685, which fixed an issue where such results
could be erroneously interpreted as complete. This patch fixes a similar
issue with task reports, and all similar issues that may exist elsewhere,
by disabling the feature globally.
* Update test.
Apache Druid brings the dependency json-path which is affected by CVE-2023-51074.
Its latest version 2.9.0 fixes the above CVE.
Append function has been added to json-path and so the unit test to check for the append function not present has been updated.
---------
Co-authored-by: Xavier Léauté <xvrl@apache.org>
* thrust of the fix to allow for the json values to be out of order
The existing problem is that toMap doesn't turn some values into json primitive
values, for example segmentMetadata just has DateTime objects for it's time in
the EventMap, but Alert event converts those into strings when calling toMap.
This creates an issue because when we check the emitted events the mapper
deserializing the string value for dateTime leaves it as a string in the
EventMap. So the question is do we alter the events toMap() to return string/map
version of objects or to make the expected events do a round trip of
eventMap -> string -> eventMap to turn everything into json primitives
* fix issue by making toMap events convert Objects into strings, or maps
* fix linting errors
* use method of using mapper to round trip expected data to make it have same type
as those of the events emitted
* remove unnecessary comment
* Address review comment: add test javadocs
* Fix flaky assertion failure.
Use ConcurrentHashMap instead of HashMap because the producer callback
can trigger concurrently and override the map initialization.
* fixup intellij inspection
Starting the process to officially deprecate non SQL compatible modes by updating docs to aggressively call out that Druids non SQL compliant modes are deprecated and will go away someday. There are no code or behavior changes at this PR.
* Rework ExprMacro base classes to simplify implementations.
This patch removes BaseScalarUnivariateMacroFunctionExpr, adds
BaseMacroFunctionExpr at the top of the hierarchy (a suitable base class
for ExprMacros that take either arrays or scalars), and adds an
implementation for "visit" to BaseMacroFunctionExpr.
The effect on implementations is generally cleaner code:
- Exprs no longer need to implement "visit".
- Exprs no longer need to implement "stringify", even if they don't
use all of their args at runtime, because BaseMacroFunctionExpr has
access to even unused args.
- Exprs that accept arrays can extend BaseMacroFunctionExpr and
inherit a bunch of useful methods. The only one they need to
implement themselves that scalar exprs don't is "supplyAnalyzeInputs".
* Make StringDecodeBase64UTFExpression a static class.
* Remove unused import.
* Formatting, annotation changes.
Merging the work so far. @ektravel , @vogievetsky if there are additional improvements, let's track them & make another pr.
* Refactor streaming ingestion docs
* Update property definition
* Update after review
* Update known issues
* Move kinesis and kafka topics to ingestion, add redirects
* Saving changes
* Saving
* Add input format text
* Update after review
* Minor text edit
* Update example syntax
* Revert back to colon
* Fix merge conflicts
* Fix broken links
* Fix spelling error
* Clean up kafka emitter tests a bit and add more validations.
The test wasn't validating what events were sent, but simply the dropped counters, which
aren't that useful.
Additionally, this module has fewer tests, so folks often run into code coverage issue
in this extension. Hopefully this change helps with that too.
* Change things to feed-based rather than topic-based.
* Another test for shared topic
* Switch to DruidException, add test dependencies and sad path config tests.
* missing test dependency
* minor renames.
* Add more tests - to test unknown events and drop when queue is full
This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner https://github.com/apache/druid/pull/13686 from @paul-rogers, extending the PARTITION BY clause to accept string literals for the time partitioning
* allow for kafka-emitter to have extra dimensions be set for each event it emits
* fix checktsyle issue in kafkaemitterconfig
* make changes to fix docs, and cleanup copy paste error in #toString()
* undo formatting to markdown table
* add more branches so test passes
* fix checkstyle issue
Executing single value correlated queries will throw an exception today since single_value function is not available in druid.
With these added classes, this provides druid, the capability to plan and run such queries.
* Update the group id to org.apache.druid.extensions.contrib for contrib exts.
* Note iceberg and delta lake extensions in extensions.md
* properties and shell backticks
* Update groupId in distribution/pom.xml
* remove delta-lake from dist.
* Add note on downloading extension.
* Fix HllSketchHolderObjectStrategy#isSafeToConvertToNullSketch.
The prior code from #15162 was reading only the low-order byte of an int
representing the size of a coupon set. As a result, it would erroneously
believe that a coupon set with a multiple of 256 elements was empty.