During ingestion, incremental segments are created in memory for the different time chunks and persisted to disk when certain thresholds are reached (max number of rows, max memory, incremental persist period etc). In the case where there are a lot of dimension and metrics (1000+) it was observed that the creation/serialization of incremental segment file format for persistence and persisting the file took a while and it was blocking ingestion of new data. This affected the real-time ingestion. This serialization and persistence can be parallelized across the different time chunks. This update aims to do that.
The patch adds a simple configuration parameter to the ingestion tuning configuration to specify number of persistence threads. The default value is 1 if it not specified which makes it the same as it is today.
This patch bumps Delta Lake Kernel dependency from 3.0.0 to 3.1.0, which released last week - please see https://github.com/delta-io/delta/releases/tag/v3.1.0 for release notes.
There were a few "breaking" API changes in 3.1.0, you can find the rationale for some of those changes here.
Next-up in this extension: add and expose filter predicates.
If lots of keys map to the same value, reversing a LOOKUP call can slow
things down unacceptably. To protect against this, this patch introduces
a parameter sqlReverseLookupThreshold representing the maximum size of an
IN filter that will be created as part of lookup reversal.
If inSubQueryThreshold is set to a smaller value than
sqlReverseLookupThreshold, then inSubQueryThreshold will be used instead.
This allows users to use that single parameter to control IN sizes if they
wish.
* Identify not range filters without negating subexpressions
Earlier betweenish (range/bounds) filters were identified thru
a process of negating the subexpressions which may have not performed that well.
(it could have dominated the runtime in some cases)
This patch makes that unnecessary as its able to create the negate expression directly.
* add test;fix for multiple intervals
This PR wires up ValueIndexes and ArrayElementIndexes for nested arrays, ValueIndexes for nested long and double columns, and fixes a handful of bugs I found after adding nested columns to the filter test gauntlet.
PassthroughAggregatorFactory overrides a deprecated method in the AggregatorFactory, on which it relies on for serializing one of its fields complexTypeName. This was accidentally removed, leading to a bug in the factory, where the type name doesn't get serialized properly, and places null in the type name. This PR revives that method with a different name and adds tests for the same.
Proposal #13469
Original PR #14024
A new method is being added in QueryLifecycle class to authorise a query based on authentication result.
This method is required since we authenticate the query by intercepting it in the grpc extension and pass down the authentication result.
introduce checks to ensure that window frame is supported
added check to ensure that no expressions are set as bounds
added logic to detect following/following like cases - described in Window function fails to demarcate if 2 following are used #15739
currently RANGE frames are only supported correctly if both endpoints are unbounded or current row Offset based window range support #15767
added windowingStrictValidation context key to provide a way to override the check
- After upgrading the pac4j version in: https://github.com/apache/druid/pull/15522. We were not able to access the druid ui.
- Upgraded the Nimbus libraries version to a compatible version to pac4j.
- In the older pac4j version, when we return RedirectAction there we also update the webcontext Response status code and add the authentication URL to the header. But in the newer pac4j version, we just simply return the RedirectAction. So that's why it was not getting redirected to the generated authentication URL.
- To fix the above, I have updated the NOOP_HTTP_ACTION_ADAPTER to JEE_HTTP_ACTION_ADAPTER and it updates the HTTP Response in context as per the HTTP Action.
Adds a set of benchmark queries for measuring the planning time with the IN operator. Current results indicate that with the recent optimizations, the IN planning time with 100K expressions in the IN clause is just 3s and with 1M is 46s. For IN clause paired with OR <col>=<val> expr, the numbers are 10s and 155s for 100K and 1M, resp.
The PR makes 2 change:
Correct the current logs directory tarred in GHA static checks to log
Make the steps of logs tar-ing and uploading conditional on web console test failures, which currently happens on any step failure in static checks workflow
Sample logs before this change for failed static checks: https://github.com/apache/druid/actions/runs/7719743853/job/21043502498
* something
* test commit
* compilation fix
* more compilation fixes (fixme placeholders)
* Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake
Will need to sort out the dependencies later.
* checkpoint
* remove snapshot schema since we can get schema from the row
* iterator bug fix
* json json json
* sampler flow
* empty impls for read(InputStats) and sample()
* conversion?
* conversion, without timestamp
* Web console changes to show Delta Lake
* Asset bug fix and tile load
* Add missing pieces to input source info, etc.
* fix stuff
* Use a different delta lake asset
* Delta lake extension dependencies
* Cleanup
* Add InputSource, module init and helper code to process delta files.
* Test init
* Checkpoint changes
* Test resources and updates
* some fixes
* move to the correct package
* More tests
* Test cleanup
* TODOs
* Test updates
* requirements and javadocs
* Adjust dependencies
* Update readme
* Bump up version
* fixup typo in deps
* forbidden api and checkstyle checks
* Trim down dependencies
* new lines
* Fixup Intellij inspections.
* Add equals() and hashCode()
* chain splits, intellij inspections
* review comments and todo placeholder
* fix up some docs
* null table path and test dependencies. Fixup broken link.
* run prettify
* Different test; fixes
* Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests
* yank the old test resource.
* add a couple of sad path tests
* Updates to readme based on latest.
* Version support
* Extract Delta DateTime converstions to DeltaTimeUtils class and add test
* More comprehensive split tests.
* Some test renames.
* Cleanup and update instructions.
* add pruneSchema() optimization for table scans.
* Oops, missed the parquet files.
* Update default table and rename schema constants.
* Test setup and misc changes.
* Add class loader logic as the context class loader is unaware about extension classes
* change some table client creation logic.
* Add hadoop-aws, hadoop-common and related exclusions.
* Remove org.apache.hadoop:hadoop-common
* Apply suggestions from code review
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* Add entry to .spelling to fix docs static check
---------
Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Fixes a bug introduced in #15609, where queries involving filters on
TIME_FLOOR could encounter ClassCastException when comparing RangeValue
in CombineAndSimplifyBounds.
Prior to #15609, CombineAndSimplifyBounds would remove, rebuild, and
re-add all numeric range filters as part of consolidating numeric range
filters for the same column under the least restrictive type. #15609
included a change to only rebuild numeric range filters when a consolidation
opportunity actually arises. The bug was introduced because the unconditional
rebuild, as a side effect, masked the fact that in some cases range filters
would be created with string match values and a LONG match value type.
This patch changes the fixup to happen at the time the range filter is
initially created, rather than in CombineAndSimplifyBounds.
Nested columns maintain a null value bitmap for which rows are nulls, however I forgot to wire up a ColumnIndexSupplier to nested columns when filtering the 'raw' data itself, so these were not able to be used. This PR fixes that by adding a supplier that can return NullValueIndex to be used by the NullFilter, which should speed up is null and is not null filters on json columns.
I haven't spent the time to measure the difference yet, but I imagine it should be a significant speed increase.
Note that I only wired this up if druid.generic.useDefaultValueForNull=false (sql compatible mode), the reason being that the SQL planner still uses selector filter, which is unable to properly handle any arrays or complex types (including json, even checking for nulls). The reason for this is so that the behavior is consistent between using the index and using the value matcher, otherwise we get into a situation where using the index has correct behavior but using the value matcher does not, which I was trying to avoid.
### Description
The unusedSegment api response was extended to include the original DataSegment object with the creation time and last used update time added to it. A new object `DataSegmentPlus` was created for this purpose, and the metadata queries used were updated as needed.
example response:
```
[
{
"dataSegment": {
"dataSource": "inline_data",
"interval": "2023-01-02T00:00:00.000Z/2023-01-03T00:00:00.000Z",
"version": "2024-01-25T16:06:42.835Z",
"loadSpec": {
"type": "local",
"path": "/Users/zachsherman/projects/opensrc-druid/distribution/target/apache-druid-30.0.0-SNAPSHOT/var/druid/segments/inline_data/2023-01-02T00:00:00.000Z_2023-01-03T00:00:00.000Z/2024-01-25T16:06:42.835Z/0/index/"
},
"dimensions": "str_dim,double_measure1,double_measure2",
"metrics": "",
"shardSpec": {
"type": "numbered",
"partitionNum": 0,
"partitions": 1
},
"binaryVersion": 9,
"size": 1402,
"identifier": "inline_data_2023-01-02T00:00:00.000Z_2023-01-03T00:00:00.000Z_2024-01-25T16:06:42.835Z"
},
"createdDate": "2024-01-25T16:06:44.196Z",
"usedStatusLastUpdatedDate": "2024-01-25T16:07:34.909Z"
}
]
```
A more ideal/permanent fix would be to have status checks exposed by the services,
but that'll require more code changes. So temporarily bump it to unblock CI now.
As part of becoming FIPS compliance, we are seeing this error: salt must be at least 128 bits when we run the Druid code against FIPS Compliant cryptographic security providers.
This PR fixes the salt size used in Pac4jSessionStore.java