* Additional short circuiting knowledge in filter bundles.
Three updates:
1) The parameter "selectionRowCount" on "makeFilterBundle" is renamed
"applyRowCount", and redefined as an upper bound on rows remaining
after short-circuiting (rather than number of rows selected so far).
This definition works better for OR filters, which pass through the
FALSE set rather than the TRUE set to the next subfilter.
2) AndFilter uses min(applyRowCount, indexIntersectionSize) rather
than using selectionRowCount for the first subfilter and indexIntersectionSize
for each filter thereafter. This improves accuracy when the incoming
applyRowCount is smaller than the row count from the first few indexes.
3) OrFilter uses min(applyRowCount, totalRowCount - indexUnionSize) rather
than applyRowCount for subfilters. This allows an OR filter to pass
information about short-circuiting to its subfilters.
To help write tests for this, the patch also moves the sampled
wikiticker data file from sql to processing.
* Forbidden APIs.
* Forbidden APIs.
* Better comments.
* Fix inspection.
* Adjustments to tests.
Currently, export creates the files at the provided destination. The addition of the manifest file will provide a list of files created as part of the manifest. This will allow easier consumption of the data exported from Druid, especially for automated data pipelines
Follow up to #16217
Changes:
- Update `OverlordClient.getReportAsMap()` to return `TaskReport.ReportMap`
- Move the following classes to `org.apache.druid.indexer.report` in the `druid-processing` module
- `TaskReport`
- `KillTaskReport`
- `IngestionStatsAndErrorsTaskReport`
- `TaskContextReport`
- `TaskReportFileWriter`
- `SingleFileTaskReportFileWriter`
- `TaskReportSerdeTest`
- Remove `MsqOverlordResourceTestClient` as it had only one method
which is already present in `OverlordResourceTestClient` itself
* Fix ORDER BY on certain GROUPING SETS.
DefaultLimitSpec (part of native groupBy) had a bug where it would assume
that results are naturally ordered by dimensions even when subtotalsSpec
is present. However, this is not necessarily the case. For certain
combinations of ORDER BY and GROUPING SETS, this would cause the ORDER BY
to be ignored.
* Fix test testGroupByWithSubtotalsSpecWithOrderLimitForcePushdown. Resorting was necessary.
* SQL tests: avoid mixing skip and cannot vectorize.
skipVectorize switches off vectorization tests completely, and
cannotVectorize turns vectorization tests into negative tests. It doesn't
make sense to use them together, so this patch makes it an error to do so,
and cleans up cases where both are mentioned.
This patch also has the effect of changing various tests from skipVectorize
to cannotVectorize, because in the past when both were mentioned,
skipVectorize would take priority.
* Fix bug with StringAnyAggregatorFactory attempting to vectorize when it cannt.
* Fix tests.
Compaction in the native engine by default records the state of compaction for each segment in the lastCompactionState segment field. This PR adds support for doing the same in the MSQ engine, targeted for future cases such as REPLACE and compaction done via MSQ.
Note that this PR doesn't implicitly store the compaction state for MSQ replace tasks; it is stored with flag "storeCompactionState": true in the query context.
Current Runtime Exceptions generated while writing frames only include the exception itself without including the name of the column they were encountered in. This patch introduces the further information in the error and makes it non-retryable.
* Allow typedIn to run in replace-with-default mode.
Useful when data servers, like Historicals, are running in replace-with-default
mode and the Broker is running in SQL-compatible mode, which can happen during
a rolling update that is applying a mode change.
The name of the combining filtered aggregator factory should be the same
as the name of the original factory. However, it wasn't the same in the
case where the original factory's name and the original delegate aggregator
were inconsistently named. In this scenario, we should use the name of
the original filtered aggregator, not the name of the original delegate
aggregator.
This PR creates an interface for ImmutableRTree and moved the existing implementation to new class which represent 32 bit implementation (stores coordinate as floats). This PR makes the ImmutableRTree extendable to create higher precision implementation as well (64 bit).
In all spatial bound filters, we accept float as input which might not be accurate in the case of high precision implementation of ImmutableRTree. This PR changed the bound filters to accepts the query bounds as double instead of float and it is backward compatible change as it compares double to existing float values in RTree. Previously it was comparing input float to RTree floats which can cause precision loss, now it is little better as it compares double to float which is still not 100% accurate.
There are no changes in the way that we query spatial dimension today except input bound parsing. There is little improvement in string filter predicate which now parse double strings instead of float and compares double to double which is 100% accurate but string predicate is only called when we dont have spatial index.
With allowing the interface to extend ImmutableRTree, we allow to create high precision (HP) implementation and defines new search strategies to perform HP search Iterable<ImmutableBitmap> search(ImmutableDoubleNode node, Bound bound);
With possible HP implementations, Radius bound filter can not really focus on accuracy, it is calculating Euclidean distance in comparing. As EARTH 🌍 is round and not flat, Euclidean distances are not accurate in geo system. This PR adds new param called 'radiusUnit' which allows you to specify units like meters, km, miles etc. It uses https://en.wikipedia.org/wiki/Haversine_formula to check if given geo point falls inside circle or not. Added a test that generates set of points inside and outside in RadiusBoundTest.
This PR aims to introduce Window functions on MSQ by doing the following:
Introduce a Window querykit for handling window queries along with its factory and a processor for window queries
If a window operator is present with a partition by clause, pushes the partition as a shuffle spec of the previous stage
In presence of empty OVER() clause lets all operators loose on a single rac
In presence of no empty OVER() clause, breaks down each window into individual stages
Associated machinery to handle window functions in MSQ
Introduced a separate hidden engine feature WINDOW_LEAF_OPERATOR which is set only for MSQ engine. In presence of this feature, the planner plans without the leaf operators by creating a window query over an inner scan query. In case of native this is set to false and the planner generates the leafOperators
Guardrails around materialization
Comprehensive UTs
Changes:
- Handle exceptions in the API and map them to a `Response` object with the appropriate error code.
- Replace `AuthorizationUtils.filterAuthorizedResources()` with `DatasourceResourceFilter`.
The endpoint is annotated consistent with other usages.
- Update `DatasourceResourceFilter` to remove the lambda and update javadocs.
The usages information is self-evident with an IDE.
- Adjust the invalid interval exception message.
- Break up the large unit test `testGetUnusedSegmentsInDataSource()` into smaller unit tests
for each test case. Also, validate the error codes.
* Avoid conversion to String in JsonReader, JsonNodeReader.
These readers were running UTF-8 decode on the provided entity to
convert it to a String, then parsing the String as JSON. The patch
changes them to parse the provided entity's input stream directly.
In order to preserve the nice error messages that include parse errors,
the readers now need to open the entity again on the error path, to
re-read the data. To make this possible, the InputEntity#open contract
is tightened to require the ability to re-open entities, and existing
InputEntity implementations are updated to allow re-opening.
This patch also renames JsonLineReaderBenchmark to JsonInputFormatBenchmark,
updates it to benchmark all three JSON readers, and adds a case that reads
fields out of the parsed row (not just creates it).
* Fixes for static analysis.
* Implement intermediateRowAsString in JsonReader.
* Enhanced JsonInputFormatBenchmark.
Renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, and enhances it to
test various readers (JsonReader, JsonLineReader, JsonNodeReader) as well as
to test with/without field discovery.
This commit allows to use the MV_FILTER_ONLY & MV_FILTER_NONE functions
with a non literal argument.
Currently `select mv_filter_only('mvd_dim', 'array_dim') from 'table'`
returns a `Unhandled Query Planning Failure`
This is being tackled and also considered for the cases where the `array_dim`
having null & empty values.
Changed classes:
* `MultiValueStringOperatorConversions`
* `ApplyFunction`
* `CalciteMultiValueStringQueryTest`
Changes:
- Use error code `internalServerError` for failures of this type
- Remove the error code argument from `InternalServerError.exception()` methods
thus fixing a bug in the callers.
changes:
* adds TypedInFilter which preserves matching sets in the native match value type
* SQL planner uses new TypedInFilter when druid.generic.useDefaultValueForNull=false (the default)
* SortMerge join support for IS NOT DISTINCT FROM.
The patch adds a "requiredNonNullKeyParts" field to the sortMerge
processor, which has the list of key parts that must be nonnull for
an equijoin condition to match. Conditions with SQL "=" are present in
the list; conditions with SQL "IS NOT DISTINCT FROM" are absent from
the list.
* Fix test.
* Update javadoc.
* Update Calcite*Test to use junit5
* change the way temp dirs are handled
* add openrewrite workflow to safeguard upgrade
* replace junitparamrunner with standard junit5 parametered tests
* update a few rules to junit5 api
* lots of boring changes
* cleanup QueryLogHook
* cleanup
* fix compile error: ARRAYS_DATASOURCE
* fix test
* remove enclosed
* empty
+TEST:TDigestSketchSqlAggregatorTest,HllSketchSqlAggregatorTest,DoublesSketchSqlAggregatorTest,ThetaSketchSqlAggregatorTest,ArrayOfDoublesSketchSqlAggregatorTest,BloomFilterSqlAggregatorTest,BloomDimFilterSqlTest,CatalogIngestionTest,CatalogQueryTest,FixedBucketsHistogramQuantileSqlAggregatorTest,QuantileSqlAggregatorTest,MSQArraysTest,MSQDataSketchesTest,MSQExportTest,MSQFaultsTest,MSQInsertTest,MSQLoadedSegmentTests,MSQParseExceptionsTest,MSQReplaceTest,MSQSelectTest,InsertLockPreemptedFaultTest,MSQWarningsTest,SqlMSQStatementResourcePostTest,SqlStatementResourceTest,CalciteSelectJoinQueryMSQTest,CalciteSelectQueryMSQTest,CalciteUnionQueryMSQTest,MSQTestBase,VarianceSqlAggregatorTest,SleepSqlTest,SqlRowTransformerTest,DruidAvaticaHandlerTest,DruidStatementTest,BaseCalciteQueryTest,CalciteArraysQueryTest,CalciteCorrelatedQueryTest,CalciteExplainQueryTest,CalciteExportTest,CalciteIngestionDmlTest,CalciteInsertDmlTest,CalciteJoinQueryTest,CalciteLookupFunctionQueryTest,CalciteMultiValueStringQueryTest,CalciteNestedDataQueryTest,CalciteParameterQueryTest,CalciteQueryTest,CalciteReplaceDmlTest,CalciteScanSignatureTest,CalciteSelectQueryTest,CalciteSimpleQueryTest,CalciteSubqueryTest,CalciteSysQueryTest,CalciteTableAppendTest,CalciteTimeBoundaryQueryTest,CalciteUnionQueryTest,CalciteWindowQueryTest,DecoupledPlanningCalciteJoinQueryTest,DecoupledPlanningCalciteQueryTest,DecoupledPlanningCalciteUnionQueryTest,DrillWindowQueryTest,DruidPlannerResourceAnalyzeTest,IngestTableFunctionTest,QueryTestRunner,SqlTestFrameworkConfig,SqlAggregationModuleTest,ExpressionsTest,GreatestExpressionTest,IPv4AddressMatchExpressionTest,IPv4AddressParseExpressionTest,IPv4AddressStringifyExpressionTest,LeastExpressionTest,TimeFormatOperatorConversionTest,CombineAndSimplifyBoundsTest,FiltrationTest,SqlQueryTest,CalcitePlannerModuleTest,CalcitesTest,DruidCalciteSchemaModuleTest,DruidSchemaNoDataInitTest,InformationSchemaTest,NamedDruidSchemaTest,NamedLookupSchemaTest,NamedSystemSchemaTest,RootSchemaProviderTest,SystemSchemaTest,CalciteTestBase,SqlResourceTest
* use @Nested
* add rule to remove enclosed; upgrade surefire
* remove enclosed
* cleanup
* add comment about surefire exclude
Changes:
- Remove deprecated `DruidException` (old one) and `EntryExistsException`
- Use newly added comprehensive `DruidException` instead
- Update error message in `SqlMetadataStorageActionHandler` when max packet limit is violated.
- Factor out common code from several faults into `BaseFault`.
- Slightly update javadoc in `DruidException` to render it correctly
- Remove unused classes `SegmentToMove`, `SegmentToDrop`
- Move `ServletResourceUtils` from module `druid-processing` to `druid-server`
- Add utility method to build error Response from `DruidException`.
changes:
* fix issues with array_contains and array_overlap with null left side arguments
* modify singleThreaded stuff to allow optimizing Function similar to how we do for ExprMacro - removed SingleThreadSpecializable in favor of default impl of asSingleThreaded on Expr with clear javadocs that most callers shouldn't be calling it directly and should be using Expr.singleThreaded static method which uses a shuttle and delegates to asSingleThreaded instead
* add optimized 'singleThreaded' versions of array_contains and array_overlap
* add mv_harmonize_nulls native expression to use with MV_CONTAINS and MV_OVERLAP to allow them to behave consistently with filter rewrites, coercing null and [] into [null]
* fix bug with casting rhs argument for native array_contains and array_overlap expressions
ConcurrentGrouper kind of misuses ThreadLocal to hold a SpillingGrouper, and never calls remove() on it, which can result in large amounts of heap being retained as weak references even after grouping is finished. This PR calls keySerde.reset() on all of the Grouper.close() implementations that have a KeySerde and should free up a bunch of space that is no longer needed.
The previously used GCS API client library returned last update time for objects directly in milliseconds. The new library returns it in OffsetDateTime format which was being converted to seconds and stored against the object. This fix converts the time back to ms before storing it.
Changes:
Improve `SqlSegmentsMetadataManager`
- Break the loop in `populateUsedStatusLastUpdated` before going to sleep if there are no more segments to update
- Add comments and clean up logs
Refactor `SqlSegmentsMetadataManagerTest`
- Merge `SqlSegmentsMetadataManagerEmptyTest` into this test
- Add method `testPollEmpty`
- Shave a few seconds off of the tests by reducing poll duration
- Simplify creation of test segments
- Some renames here and there
- Remove unused methods
- Move `TestDerbyConnector.allowLastUsedFlagToBeNull` to this class
Other minor changes
- Add javadoc to `NoneShardSpec`
- Use lambda in `SqlSegmentMetadataPublisher`
While converting Sequence<ScanResultValue> to Sequence<Frames>, when maxSubqueryBytes is enabled, we batch the results to prevent creating a single frame per ScanResultValue. Batching requires peeking into the actual value, and checking if the row signature of the scan result’s value matches that of the previous value.
Since we can do this indefinitely (in the worst case all of them have the same signature), we keep fetching them and accumulating them in a list (on the heap). We don’t really know how much to batch before we actually write the value as frames.
The PR modifies the batching logic to not accumulate the results in an intermediary list
* `Expr#singleThreaded` which creates a singleThreaded version of the actual expression (caching ExprEval is allowed)
* `Expr#makeSingleThreaded` to make a whole subtree of expressions 'singleThreaded' - uses `Shuttle` to create the new expression tree
* `ConstantExpr#singleThreaded` creates a specialized `ConstantExpr` which does cache the `ExprEval`
* some `@Immutable` annotations were added to make it more likely to notice that there might be something off if a similar change will be made around here for some reason
Since #15175, the javadoc for ReadableFieldPointer is somewhat out of date. It says that
the pointer only points to the beginning of the field, but this is no longer true. This
patch updates the javadoc to be more accurate.
allow a hashjoin result to be converted to RowsAndColumns
added StorageAdapterRowsAndColumns
fix incorrect isConcrete() return values during early phase of planning
* Move retries into DataSegmentPusher implementations.
The individual implementations know better when they should and should
not retry. They can also generate better error messages.
The inspiration for this patch was a situation where EntityTooLarge was
generated by the S3DataSegmentPusher, and retried uselessly by the
retry harness in PartialSegmentMergeTask.
* Fix missing var.
* Adjust imports.
* Tests, comments, style.
* Remove unused import.
* Rows.objectToNumber: Accept decimals with output type LONG.
PR #15615 added an optimization to avoid parsing numbers twice in cases
where we know that they should definitely be longs or
definitely be doubles. Rather than try parsing as long first, and then
try parsing as double, it would use only the parsing routine specific to
the requested outputType.
This caused a bug: previously, we would accept decimals like "1.0" or
"1.23" as longs, by truncating them to "1". After that patch, we would
treat such decimals as nulls when the outputType is set to LONG.
This patch retains the short-circuit for doubles: if outputType is
DOUBLE, we only parse the string as a double. But for outputType LONG,
this patch restores the old behavior: try to parse as long first,
then double.
* cooler cursor filter processing allowing much smart utilization of indexes by feeding selectivity forward, with implementations for range and predicate based filters
* added new method Filter.makeFilterBundle which cursors use to get indexes and matchers for building offsets
* AND filter partitioning is now pushed all the way down, even to nested AND filters
* vector engine now uses same indexed base value matcher strategy for OR filters which partially support indexes
The code in the groupBy engine and the topN engine assume that the dimensions are comparable and can call dimA.compareTo(dimB) to sort the dimensions and group them together.
This works well for the primitive dimensions, because they are Comparable, however falls apart when the dimensions can be arrays (or in future scenarios complex columns). In cases when the dimensions are not comparable, Druid resorts to having a wrapper type ComparableStringArray and ComparableList, which is a Comparable, based on the list comparator.
Fixes a bug when the undocumented castToType parameter is set on 'auto' column schema, which should have been using the 'nullable' comparator to allow null values to be present when merging columns, but wasn't which would lead to null pointer exceptions. Also fixes an issue I noticed while adding tests that if 'FLOAT' type was specified for the castToType parameter it would be an exception because that type is not expected to be present, since 'auto' uses the native expressions to determine the input types and expressions don't have direct support for floats, only doubles.
In the future I should probably split this functionality out of the 'auto' schema (maybe even have a simpler version of the auto indexer dedicated to handling non-nested data) but still have the same results of writing out the newer 'nested common format' columns used by 'auto', but I haven't taken that on in this PR.
* Globally disable AUTO_CLOSE_JSON_CONTENT.
This JsonGenerator feature is on by default. It causes problems with code
like this:
try (JsonGenerator jg = ...) {
jg.writeStartArray();
for (x : xs) {
jg.writeObject(x);
}
jg.writeEndArray();
}
If a jg.writeObject call fails due to some problem with the data it's
reading, the JsonGenerator will write the end array marker automatically
when closed as part of the try-with-resources. If the generator is writing
to a stream where the reader does not have some other mechanism to realize
that an exception was thrown, this leads the reader to believe that the
array is complete when it actually isn't.
Prior to this patch, we disabled AUTO_CLOSE_JSON_CONTENT for JSON-wrapped
SQL result formats in #11685, which fixed an issue where such results
could be erroneously interpreted as complete. This patch fixes a similar
issue with task reports, and all similar issues that may exist elsewhere,
by disabling the feature globally.
* Update test.
* Rework ExprMacro base classes to simplify implementations.
This patch removes BaseScalarUnivariateMacroFunctionExpr, adds
BaseMacroFunctionExpr at the top of the hierarchy (a suitable base class
for ExprMacros that take either arrays or scalars), and adds an
implementation for "visit" to BaseMacroFunctionExpr.
The effect on implementations is generally cleaner code:
- Exprs no longer need to implement "visit".
- Exprs no longer need to implement "stringify", even if they don't
use all of their args at runtime, because BaseMacroFunctionExpr has
access to even unused args.
- Exprs that accept arrays can extend BaseMacroFunctionExpr and
inherit a bunch of useful methods. The only one they need to
implement themselves that scalar exprs don't is "supplyAnalyzeInputs".
* Make StringDecodeBase64UTFExpression a static class.
* Remove unused import.
* Formatting, annotation changes.
Executing single value correlated queries will throw an exception today since single_value function is not available in druid.
With these added classes, this provides druid, the capability to plan and run such queries.
During ingestion, incremental segments are created in memory for the different time chunks and persisted to disk when certain thresholds are reached (max number of rows, max memory, incremental persist period etc). In the case where there are a lot of dimension and metrics (1000+) it was observed that the creation/serialization of incremental segment file format for persistence and persisting the file took a while and it was blocking ingestion of new data. This affected the real-time ingestion. This serialization and persistence can be parallelized across the different time chunks. This update aims to do that.
The patch adds a simple configuration parameter to the ingestion tuning configuration to specify number of persistence threads. The default value is 1 if it not specified which makes it the same as it is today.
This PR wires up ValueIndexes and ArrayElementIndexes for nested arrays, ValueIndexes for nested long and double columns, and fixes a handful of bugs I found after adding nested columns to the filter test gauntlet.
introduce checks to ensure that window frame is supported
added check to ensure that no expressions are set as bounds
added logic to detect following/following like cases - described in Window function fails to demarcate if 2 following are used #15739
currently RANGE frames are only supported correctly if both endpoints are unbounded or current row Offset based window range support #15767
added windowingStrictValidation context key to provide a way to override the check
Adds a set of benchmark queries for measuring the planning time with the IN operator. Current results indicate that with the recent optimizations, the IN planning time with 100K expressions in the IN clause is just 3s and with 1M is 46s. For IN clause paired with OR <col>=<val> expr, the numbers are 10s and 155s for 100K and 1M, resp.
Nested columns maintain a null value bitmap for which rows are nulls, however I forgot to wire up a ColumnIndexSupplier to nested columns when filtering the 'raw' data itself, so these were not able to be used. This PR fixes that by adding a supplier that can return NullValueIndex to be used by the NullFilter, which should speed up is null and is not null filters on json columns.
I haven't spent the time to measure the difference yet, but I imagine it should be a significant speed increase.
Note that I only wired this up if druid.generic.useDefaultValueForNull=false (sql compatible mode), the reason being that the SQL planner still uses selector filter, which is unable to properly handle any arrays or complex types (including json, even checking for nulls). The reason for this is so that the behavior is consistent between using the index and using the value matcher, otherwise we get into a situation where using the index has correct behavior but using the value matcher does not, which I was trying to avoid.
* Possibly stabilize intellij-inspections
* remove `integration-tests-ex/cases` from excluded projects from initial build
* enable ErrorProne's `CheckedExceptionNotThrown` to get earlier errors than intellij-inspections
* fix ddsketch pom.xml
* fix spellcheck
* New: Add DDSketch-Druid extension
- Based off of http://www.vldb.org/pvldb/vol12/p2195-masson.pdf and uses
the corresponding https://github.com/DataDog/sketches-java library
- contains tests for post building and using aggregation/post
aggregation.
- New aggregator: `ddSketch`
- New post aggregators: `quantileFromDDSketch` and
`quantilesFromDDSketch`
* Fixing easy CodeQL warnings/errors
* Fixing docs, and dependencies
Also moved aggregator ids to AggregatorUtil and PostAggregatorIds
* Adding more Docs and better null/empty handling for aggregators
* Fixing docs, and pom version
* DDSketch documentation format and wording
A low value of inSubQueryThreshold can cause queries with IN filter to plan as joins more commonly. However, some of these join queries may not get planned as IN filter on data nodes and causes significant perf regression.
### Description
Our Kinesis consumer works by using the [GetRecords API](https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html) in some number of `fetchThreads`, each fetching some number of records (`recordsPerFetch`) and each inserting into a shared buffer that can hold a `recordBufferSize` number of records. The logic is described in our documentation at: https://druid.apache.org/docs/27.0.0/development/extensions-core/kinesis-ingestion/#determine-fetch-settings
There is a problem with the logic that this pr fixes: the memory limits rely on a hard-coded “estimated record size” that is `10 KB` if `deaggregate: false` and `1 MB` if `deaggregate: true`. There have been cases where a supervisor had `deaggregate: true` set even though it wasn’t needed, leading to under-utilization of memory and poor ingestion performance.
Users don’t always know if their records are aggregated or not. Also, even if they could figure it out, it’s better to not have to. So we’d like to eliminate the `deaggregate` parameter, which means we need to do memory management more adaptively based on the actual record sizes.
We take advantage of the fact that GetRecords doesn’t return more than 10MB (https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html ):
This pr:
eliminates `recordsPerFetch`, always use the max limit of 10000 records (the default limit if not set)
eliminate `deaggregate`, always have it true
cap `fetchThreads` to ensure that if each fetch returns the max (`10MB`) then we don't exceed our budget (`100MB` or `5% of heap`). In practice this means `fetchThreads` will never be more than `10`. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments
add `recordBufferSizeBytes` as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be `100MB` or `10% of heap`, whichever is smaller.
add `maxBytesPerPoll` as a bytes-based limit for how much data we poll from shared buffer at a time. Default is `1000000` bytes.
deprecate `recordBufferSize`, use `recordBufferSizeBytes` instead. Warning is logged if `recordBufferSize` is specified
deprecate `maxRecordsPerPoll`, use `maxBytesPerPoll` instead. Warning is logged if maxRecordsPerPoll` is specified
Fixed issue that when the record buffer is full, the fetchRecords logic throws away the rest of the GetRecords result after `recordBufferOfferTimeout` and starts a new shard iterator. This seems excessively churny. Instead, wait an unbounded amount of time for queue to stop being full. If the queue remains full, we’ll end up right back waiting for it after the restarted fetch.
There was also a call to `newQ::offer` without check in `filterBufferAndResetBackgroundFetch`, which seemed like it could cause data loss. Now checking return value here, and failing if false.
### Release Note
Kinesis ingestion memory tuning config has been greatly simplified, and a more adaptive approach is now taken for the configuration. Here is a summary of the changes made:
eliminates `recordsPerFetch`, always use the max limit of 10000 records (the default limit if not set)
eliminate `deaggregate`, always have it true
cap `fetchThreads` to ensure that if each fetch returns the max (`10MB`) then we don't exceed our budget (`100MB` or `5% of heap`). In practice this means `fetchThreads` will never be more than `10`. Tasks usually don't have that many processors available to them anyway, so in practice I don't think this will change the number of threads for too many deployments
add `recordBufferSizeBytes` as a bytes-based limit rather than records-based limit for the shared queue. We do know the byte size of kinesis records by at this point. Default should be `100MB` or `10% of heap`, whichever is smaller.
add `maxBytesPerPoll` as a bytes-based limit for how much data we poll from shared buffer at a time. Default is `1000000` bytes.
deprecate `recordBufferSize`, use `recordBufferSizeBytes` instead. Warning is logged if `recordBufferSize` is specified
deprecate `maxRecordsPerPoll`, use `maxBytesPerPoll` instead. Warning is logged if maxRecordsPerPoll` is specified
* IncrementalIndex#add is no longer thread-safe.
Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.
It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.
At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.
This patch also:
1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
which the similar datasketches versions already have. This is done to
help them adhere to the contract of Aggregator: concurrent calls to
"aggregate" and "get" must be thread-safe.
2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
druid-benchmarks module.
* Spelling.
* Changes from static analysis.
* Fix javadoc.
* Clear "lineSplittable" for JSON when using KafkaInputFormat.
JsonInputFormat has a "withLineSplittable" method that can be used to
control whether JSON is read line-by-line, or as a whole. The intent
is that in streaming ingestion, "lineSplittable" is false (although it
can be overridden by "assumeNewlineDelimited"), and in batch ingestion,
lineSplittable is true.
When a "json" format is wrapped by a "kafka" format, this isn't set
properly. This patch updates KafkaInputFormat to set this on an
underlying "json" format.
The tests for KafkaInputFormat were overriding the "lineSplittable"
parameter explicitly, which wasn't really fair, because that made them
unrealistic to what happens in production. Now they omit the parameter
and get the production behavior.
* Add test.
* Fix test coverage.
* Faster parsing: reduce String usage, list-based input rows.
Three changes:
1) Reworked FastLineIterator to optionally avoid generating Strings
entirely, and reduce copying somewhat. Benefits the line-oriented
JSON, CSV, delimited (TSV), and regex formats.
2) In the delimited (TSV) format, when the delimiter is a single byte,
split on UTF-8 bytes directly.
3) In CSV and delimited (TSV) formats, use list-based input rows when
the column list is provided upfront by the user.
* Fix style.
* Fix inspections.
* Restore validation.
* Remove fastutil-extra.
* Exception type.
* Fixes for error messages.
* Fixes for null handling.
This PR fixes the summary iterator to add aggregators in the correct position. The summary iterator is used when dims are not present, therefore the new change is identical to the old one, but seems more correct while reading.
* support groups windowing mode; which is a close relative of ranges (but not in the standard)
* all windows with range expressions will be executed wit it groups
* it will be 100% correct in case for both bounds its true that: isCurrentRow() || isUnBounded()
* this covers OVER ( ORDER BY COL )
* for other cases it will have some chances of getting correct results...
* Cache value selectors in RowBasedColumnSelectorFactory.
There was already caching for dimension selectors. This patch adds caching
for value (object and number) selectors. It's helpful when the same field is
read multiple times during processing of a single row (for example, by being
an input to both MIN and MAX aggregations).
* Fix typing.
* Fix logic.
* Add SpectatorHistogram extension
* Clarify documentation
Cleanup comments
* Use ColumnValueSelector directly
so that we support being queried as a Number using longSum or doubleSum aggregators as well as a histogram.
When queried as a Number, we're returning the count of entries in the histogram.
* Apply suggestions from code review
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* Fix references
* Fix spelling
* Update docs/development/extensions-contrib/spectator-histogram.md
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
---------
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* Add ImmutableLookupMap for static lookups.
This patch adds a new ImmutableLookupMap, which comes with an
ImmutableLookupExtractor. It uses a fastutil open hashmap plus two
lists to store its data in such a way that forward and reverse
lookups can both be done quickly. I also observed footprint to be
somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million
row lookup.
The main advantage, though, is that reverse lookups can be done much
more quickly than MapLookupExtractor (which iterates the entire map
for each call to unapplyAll). This speeds up the recently added
ReverseLookupRule (#15626) during SQL planning with very large lookups.
* Use in one more test.
* Fix benchmark.
* Object2ObjectOpenHashMap
* Fixes, and LookupExtractor interface update to have asMap.
* Remove commented-out code.
* Fix style.
* Fix import order.
* Add fastutil.
* Avoid storing Map entries.
* Reverse, pull up lookups in the SQL planner.
Adds two new rules:
1) ReverseLookupRule, which eliminates calls to LOOKUP by doing
reverse lookups.
2) AggregatePullUpLookupRule, which pulls up calls to LOOKUP above
GROUP BY, when the lookup is injective.
Adds configs `sqlReverseLookup` and `sqlPullUpLookup` to control whether
these rules fire. Both are enabled by default.
To minimize the chance of performance problems due to many keys mapping to
the same value, ReverseLookupRule refrains from reversing a lookup if there
are more keys than `inSubQueryThreshold`. The rationale for using this setting
is that reversal works by generating an IN, and the `inSubQueryThreshold`
describes the largest IN the user wants the planner to create.
* Add additional line.
* Style.
* Remove commented-out lines.
* Fix tests.
* Add test.
* Fix doc link.
* Fix docs.
* Add one more test.
* Fix tests.
* Logic, test updates.
* - Make FilterDecomposeConcatRule more flexible.
- Make CalciteRulesManager apply reduction rules til fixpoint.
* Additional tests, simplify code.
* Faster k-way merging using tournament trees, 8-byte key strides.
Two speedups for FrameChannelMerger (which does k-way merging in MSQ):
1) Replace the priority queue with a tournament tree, which does fewer
comparisons.
2) Compare keys using 8-byte strides, rather than 1 byte at a time.
* Adjust comments.
* Fix style.
* Adjust benchmark and test.
* Add eight-list test (power of two).
This PR fixes a bug with the long string pair serde where null and empty strings are treated equivalently, and the return value is always null. When 'useDefaultValueForNull' was set to true by default, this wasn't a commonly seen issue, because nulls were equivalent to empty strings. However, since the default has changed to false, this can create incorrect results when the long string pairs are serded, where the empty strings are incorrectly converted to nulls.
changes:
* ColumnIndexSelector now extends ColumnSelector. The only real implementation of ColumnIndexSelector, ColumnSelectorColumnIndexSelector, already has a ColumnSelector, so this isn't very disruptive
* removed getColumnNames from ColumnSelector since it was not used
* VirtualColumns and VirtualColumn getIndexSupplier method now needs argument of ColumnIndexSelector instead of ColumnSelector, which allows expression virtual columns to correctly recognize other virtual columns, fixing an issue which would incorrectly handle other virtual columns as non-existent columns instead
* fixed a bug with sql planner incorrectly not using expression filter for equality filters on columns with extractionFn and no virtual column registry
* overhaul DruidPredicateFactory to better handle 3VL
fixes some bugs caused by some limitations of the original design of how DruidPredicateFactory interacts with 3-value logic. The primary impacted area was with how filters on values transformed with expressions or extractionFn which turn non-null values into nulls, which were not possible to be modelled with the 'isNullInputUnknown' method
changes:
* adds DruidObjectPredicate to specialize string, array, and object based predicates instead of using guava Predicate
* DruidPredicateFactory now uses DruidObjectPredicate
* introduces DruidPredicateMatch enum, which all predicates returned from DruidPredicateFactory now use instead of booleans to indicate match. This means DruidLongPredicate, DruidFloatPredicate, DruidDoublePredicate, and the newly added DruidObjectPredicate apply methods all now return DruidPredicateMatch. This allows matchers and indexes
* isNullInputUnknown has been removed from DruidPredicateFactory
* rename, fix test
* adjust
* style
* npe
* more test
* fix default value mode to not match new test
* Reverse lookup fixes and enhancements.
1) Add a "mayIncludeUnknown" parameter to DimFilter#optimize. This is important
because otherwise the reverse-lookup optimization is done improperly when
the "in" filter appears under a "not", and the lookup extractionFn may return
null for some possible values of the filtered column. The "includeUnknown" test
cases in InDimFilterTest illustrate the difference in behavior.
2) Enhance InDimFilter#optimizeLookup to handle "mayIncludeUnknown", and to be able
to do a reverse lookup in a wider variety of cases.
3) Make "unapply" protected in LookupExtractor, and move callers to "unapplyAll".
The main reason is that MapLookupExtractor, a common implementation, lacks a
reverse mapping and therefore does a scan of the map for each call to "unapply".
For performance sake these calls need to be batched.
* Remove optimize call from BloomDimFilter.
* Follow the law.
* Fix tests.
* Fix imports.
* Switch function.
* Fix tests.
* More tests.
* New handling for COALESCE, SEARCH, and filter optimization.
COALESCE is converted by Calcite's parser to CASE, which is largely
counterproductive for us, because it ends up duplicating expressions.
In the current code we end up un-doing it in our CaseOperatorConversion.
This patch has a different approach:
1) Add CaseToCoalesceRule to convert CASE back to COALESCE earlier, before
the Volcano planner runs, using CaseToCoalesceRule.
2) Add FilterDecomposeCoalesceRule to decompose calls like
"f(COALESCE(x, y))" into "(x IS NOT NULL AND f(x)) OR (x IS NULL AND f(y))".
This helps use indexes when available on x and y.
3) Add CoalesceLookupRule to push COALESCE into the third arg of LOOKUP.
4) Add a native "coalesce" function so we can convert 3+ arg COALESCE.
The advantage of this approach is that by un-doing the CASE to COALESCE
conversion earlier, we have flexibility to do more stuff with
COALESCE (like decomposition and pushing into LOOKUP).
SEARCH is an operator used internally by Calcite to represent matching
an argument against some set of ranges. This patch improves our handling
of SEARCH in two ways:
1) Expand NOT points (point "holes" in the range set) from SEARCH as
`!(a || b)` rather than `!a && !b`, which makes it possible to convert
them to a "not" of "in" filter later.
2) Generate those nice conversions for NOT points even if the SEARCH
is not composed of 100% NOT points. Without this change, a SEARCH
for "x NOT IN ('a', 'b') AND x < 'm'" would get converted like
"x < 'a' OR (x > 'a' AND x < 'b') OR (x > 'b' AND x < 'm')".
One of the steps we take when generating Druid queries from Calcite
plans is to optimize native filters. This patch improves this step:
1) Extract common ANDed predicates in ConvertSelectorsToIns, so we can
convert "(a && x = 'b') || (a && x = 'c')" into "a && x IN ('b', 'c')".
2) Speed up CombineAndSimplifyBounds and ConvertSelectorsToIns on
ORs with lots of children by adjusting the logic to avoid calling
"indexOf" and "remove" on an ArrayList.
3) Refactor ConvertSelectorsToIns to reduce duplicated code between the
handling for "selector" and "equals" filters.
* Not so final.
* Fixes.
* Fix test.
* Fix test.
* Fix ColumnSelectorColumnIndexSelector#getColumnCapabilities.
It was using virtualColumns.getColumnCapabilities, which only returns
capabilities for virtual columns, not regular columns. The effect of this
is that expression filters (and in some cases, arrayContainsElement filters)
would build value matchers rather than use indexes.
I think this has been like this since #12315, which added the
getColumnCapabilities method to BitmapIndexSelector, and included the same
implementation as exists in the code today.
This error is easy to make due to the design of virtualColumns.getColumnCapabilities,
so to help avoid it in the future, this patch renames the method to
getColumnCapabilitiesWithoutFallback to emphasize that it does not return
capabilities for regular columns.
* Make getColumnCapabilitiesWithoutFallback package-private.
* Fix expression filter bitmap usage.
The PR: #13947 introduced a function evalDimension() in the interface RowFunction.
There was no default implementation added for this interface which causes all the implementations and custom transforms to fail and require to implement their own version of evalDimension method. This PR adds a default implementation in the interface which allows the evalDimension to return value as a Singleton array of eval result.
Fixes#15072
Before this modification , the third parameter (timezone) require to be a Literal, it will throw a error when this parameter is column Identifier.
Changes
- Add `log` implementation for `AuditManager` alongwith `SQLAuditManager`
- `LoggingAuditManager` simply logs the audit event. Thus, it returns empty for
all `fetchAuditHistory` calls.
- Add new config `druid.audit.manager.type` which can take values `log`, `sql` (default)
- Add new config `druid.audit.manager.logLevel` which can take values `DEBUG`, `INFO`, `WARN`.
This gets activated only if `type` is `log`.
- Remove usage of `ConfigSerde` from `AuditManager` as audit is not just limited to configs
- Add `AuditSerdeHelper` for a single implementation of serialization/deserialization of
audit payload and other utility methods.
* Allow for kafka emitter producer secrets to be masked in logs instead of being visible
This change will allow for kafka producer config values that should be secrets to not show up in the logs.
This will enhance the security of the people who use the kafka emitter to use this if they want to.
This is opt in and will not affect prior configs for this emitter
* fix checkstyle issue
* change property name
I was looking into a query which was performing a bit poorly because the case_searched was touching more than 1 columns (if there is only 1 column there is a cache based evaluator).
While I was doing that I've noticed that there are a few simple things which could help a bit:
use a static TRUE/FALSE instead of creating a new object every time
create the ExprEval early for ConstantExpr -s (except the one for BigInteger which seem to have some odd contract)
return early from type autodetection
these changes mostly reduce the amount of garbage the query creates during case_searched evaluation; although ExpressionSelectorBenchmark shows some improvements ~15% - but my manual trials on the taxi dataset with 60M rows showed more improvements - probably due to the fact that these changes mostly only reduce gc pressure.
* Add initial draft of MarkDanglingTombstonesAsUnused duty.
* Use overshadowed segments instead of all used segments.
* Add unit test for MarkDanglingSegmentsAsUnused duty.
* Add mock call
* Simplify code.
* Docs
* shorter lines formatting
* metric doc
* More tests, refactor and fix up some logic.
* update javadocs; other review comments.
* Make numCorePartitions as 0 in the TombstoneShardSpec.
* fix up test
* Add tombstone core partition tests
* Update docs/design/coordinator.md
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
* review comment
* Minor cleanup
* Only consider tombstones with 0 core partitions
* Need to register the test shard type to make jackson happy
* test comments
* checkstyle
* fixup misc typos in comments
* Update logic to use overshadowed segments
* minor cleanup
* Rename duty to eternity tombstone instead of dangling. Add test for full eternity tombstone.
* Address review feedback.
---------
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Query with lookups in FilteredAggregator fails with this exception in router,
Cannot construct instance of `org.apache.druid.query.aggregation.FilteredAggregatorFactory`, problem: Lookup [campaigns_lookup[campaignId][is_sold][autodsp]] not found at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 913] (through reference chain: org.apache.druid.query.groupby.GroupByQuery["aggregations"]->java.util.ArrayList[1])
T
he problem is that constructor of FilteredAggregatorFactory is actually validating if the lookup exists in this statement dimFilter.toFilter().
This is failing on the router, which is to be expected, because, the router isn’t assigned any lookups.
The fix is to move to a lazy initialisation of the filter object in the constructor.
It wasn't checking the column name, so it would return a domain regardless
of the input column. This means that null filters on data sources with range
partitioning would lead to excessive pruning of segments, and therefore
missing results.
I think this is a problem as it discards the false return value when the putToKeyBuffer can't store the value because of the limit
Not forwarding the return value at that point may lead to the normal continuation here regardless something was not added to the dictionary like here
* Make numCorePartitions as 0 in the TombstoneShardSpec.
* fix up test
* Add tombstone core partition tests
* review comment
* Need to register the test shard type to make jackson happy
Fixed the following flaky tests:
org.apache.druid.math.expr.ParserTest#testApplyFunctions
org.apache.druid.math.expr.ParserTest#testSimpleMultiplicativeOp1
org.apache.druid.math.expr.ParserTest#testFunctions
org.apache.druid.math.expr.ParserTest#testSimpleLogicalOps1
org.apache.druid.math.expr.ParserTest#testSimpleAdditivityOp1
org.apache.druid.math.expr.ParserTest#testSimpleAdditivityOp2
The above mentioned tests have been reported as flaky (tests assuming deterministic implementation of a non-deterministic specification ) when ran against the NonDex tool.
The tests contain assertions (Assertion 1 & Assertion 2) that compare an ArrayList created from a HashSet using the ArrayList() constructor with another List. However, HashSet does not guarantee the ordering of elements and thus resulting in these flaky tests that assume deterministic implementation of HashSet. Thus, when the NonDex tool shuffles the HashSet elements, it results in the test failures:
Co-authored-by: ythorat2 <ythorat2@illinois.edu>
* MSQ generates tombstones honoring the query's granularity.
This change tweaks to only account for the infinite-interval tombstones.
For finite-interval tombstones, the MSQ query granualrity will be used
which is consistent with how MSQ works.
* more tests and some cleanup.
* checkstyle
* comment edits
* Throw TooManyBuckets fault based on review; add more tests.
* Add javadocs for both methods on reconciling the methods.
* review: Move testReplaceTombstonesWithTooManyBucketsThrowsException to MsqFaultsTest
* remove unused imports.
* Move TooManyBucketsException to indexing package for shared exception handling.
* lower max bucket for tests and fixup count
* Advance and count the iterator.
* checkstyle
In the current design, brokers query both data nodes and tasks to fetch the schema of the segments they serve. The table schema is then constructed by combining the schemas of all segments within a datasource. However, this approach leads to a high number of segment metadata queries during broker startup, resulting in slow startup times and various issues outlined in the design proposal.
To address these challenges, we propose centralizing the table schema management process within the coordinator. This change is the first step in that direction. In the new arrangement, the coordinator will take on the responsibility of querying both data nodes and tasks to fetch segment schema and subsequently building the table schema. Brokers will now simply query the Coordinator to fetch table schema. Importantly, brokers will still retain the capability to build table schemas if the need arises, ensuring both flexibility and resilience.
* Use filters for pruning properly for hash-joins.
Native used them too aggressively: it might use filters for the RHS
to prune the LHS. MSQ used them not at all. Now, both use them properly,
pruning based on base (LHS) columns only.
* Fix tests.
* Fix style.
* Clear filterFields too.
* Update.
* Add system fields to input sources.
Main changes:
1) The SystemField enum defines system fields "__file_uri", "__file_path",
and "__file_bucket". They are associated with each input entity.
2) The SystemFieldInputSource interface can be added to any InputSource
to make it system-field-capable. It sets up serialization of a list
of configured "systemFields" in the JSON form of the input source, and
provides a method getSystemFieldValue for computing the value of each
system field. Cloud object, HDFS, HTTP, and Local now have this.
* Fix various LocalInputSource calls.
* Fix style stuff.
* Fixups.
* Fix tests and coverage.
* better documentation for the differences between arrays and mvds
* add outputType to ExpressionPostAggregator to make docs true
* add output coercion if outputType is defined on ExpressionPostAgg
* updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables
* Frames: consider writing singly-valued column when input column hasMultipleValues is UNKNOWN.
Prior to this patch, columnar frames would always write multi-valued columns if
the input column had hasMultipleValues = UNKNOWN. This had the effect of flipping
UNKNOWN to TRUE when copying data into frames, which is problematic because TRUE
causes expressions to assume that string inputs must be treated as arrays.
We now avoid this by flipping UNKNOWN to FALSE if no multi-valuedness
is encountered, and flipping it to TRUE if multi-valuedness is encountered.
* Add regression test case.
Currently advance function in postJoinCursor calls advanceUninterruptibly which in turn keeps calling baseCursor.advanceUninterruptibly until the post join condition matches, without checking for interrupts. This causes the CPU to hit 100% without getting a chance for query to be cancelled.
With this change, the call flow of advance and advanceUninterruptibly is separated out so that they call baseCursor.advance and baseCursor.advanceUninterruptibly in them, respectively, giving a chance for interrupts in the former case between successive calls to baseCursor.advance.
* Fix error assuming a Complex Type that is a Number is a double
In the case where a complex type is a number, it may not be castable to double. It can safely be case as Number first to get to the doubleValue.