during segment publishing we do streaming operations on a collection not
safe for concurrent modification. To guarantee correct results we must
also guard any operations on the stream itself.
This may explain the issue seen in https://github.com/apache/druid/issues/9845
* Refactor JoinFilterAnalyzer
This patch attempts to make it easier to follow the join filter analysis code
with the hope of making it easier to add rewrite optimizations in the future.
To keep the patch small and easy to review, this is the first of at least 2
patches that are planned.
This patch adds a builder to the Pre-Analysis, so that it is easier to
instantiate the preAnalysis. It also moves some of the filter normalization
code out to Fitlers with associated tests.
* fix tests
* Refactor JoinFilterAnalyzer - part 2
This change introduces the following components:
* RhsRewriteCandidates - a wrapper for a list of candidates and associated
functions to operate on the set of candidates.
* JoinableClauses - a wrapper for the list of JoinableClause that represent
a join condition and the associated functions to operate on the clauses.
* Equiconditions - a wrapper representing the equiconditions that are used
in the join condition.
And associated test changes.
This refactoring surfaced 2 bugs:
- Missing equals and hashcode implementation for RhsRewriteCandidate, thus
allowing potential duplicates in the rhs rewrite candidates
- Missing Filter#supportsRequiredColumnRewrite check in
analyzeJoinFilterClause, which could result in UnsupportedOperationException
being thrown by the filter
* fix compile error
* remove unused class
* Refactor JoinFilterAnalyzer - Correlations
Move the correlation related code out into it's own class so it's easier
to maintain.
Another patch should follow this one so that the query path uses the
correlation object instead of it's underlying maps.
* Optimize join queries where filter matches nothing
Fixes#9787
This PR changes the Joinable interface to return an Optional set of correlated
values for a column.
This allows the JoinFilterAnalyzer to differentiate between the case where the
column has no matching values and when the column could not find matching
values.
This PR chose not to distinguish between cases where correlated values could
not be computed because of a config that has this behavior disabled or because
of user error - like a column that could not be found. The reasoning was that
the latter is likely an error and the non filter pushdown path will surface the
error if it is.
* Refactor JoinFilterAnalyzer
This patch attempts to make it easier to follow the join filter analysis code
with the hope of making it easier to add rewrite optimizations in the future.
To keep the patch small and easy to review, this is the first of at least 2
patches that are planned.
This patch adds a builder to the Pre-Analysis, so that it is easier to
instantiate the preAnalysis. It also moves some of the filter normalization
code out to Fitlers with associated tests.
* fix tests
* Refactor JoinFilterAnalyzer - part 2
This change introduces the following components:
* RhsRewriteCandidates - a wrapper for a list of candidates and associated
functions to operate on the set of candidates.
* JoinableClauses - a wrapper for the list of JoinableClause that represent
a join condition and the associated functions to operate on the clauses.
* Equiconditions - a wrapper representing the equiconditions that are used
in the join condition.
And associated test changes.
This refactoring surfaced 2 bugs:
- Missing equals and hashcode implementation for RhsRewriteCandidate, thus
allowing potential duplicates in the rhs rewrite candidates
- Missing Filter#supportsRequiredColumnRewrite check in
analyzeJoinFilterClause, which could result in UnsupportedOperationException
being thrown by the filter
* fix compile error
* remove unused class
* Update tutorial-query.md
* First full pass complete
* Smoothing over, a bit
* link and spell checking
* Update querying.md
* Review comments; screenshot fixes
* Making ports consistent, pending confirmation
Switching to the Router port, to make this be consistent with the tutorial ports, but can switch back here and there if it should be 8082 instead.
* Resizing screenshot
* Update querying.md
* Review feedback incorporated.
* Refactor JoinFilterAnalyzer
This patch attempts to make it easier to follow the join filter analysis code
with the hope of making it easier to add rewrite optimizations in the future.
To keep the patch small and easy to review, this is the first of at least 2
patches that are planned.
This patch adds a builder to the Pre-Analysis, so that it is easier to
instantiate the preAnalysis. It also moves some of the filter normalization
code out to Fitlers with associated tests.
* fix tests
* Add ingestion specs for CalciteQueryTests
This PR introduces ingestion specs that can be used for local testing
so that CalciteQueryTests can be built on a druid cluster.
* Add README
* Update sql/src/test/resources/calcite/tests/README.md
Motivation for this change is to not inadvertently identify binary
formats that contain uncompressed string data as TSV or CSV.
Moving detection of magic byte headers before heuristics should be more
robust in general.
* Enforce code coverage
Add an automated way of checking if new code has adequate unit tests,
since merging code coverage reports and check coverage thresholds via
coveralls or codecov is unreliable.
The following minimum unit test code coverage is now enforced:
- 80% functions
- 65% branch
- 65% line
Branch and line coverage thresholds are slightly lower for now as they
are harder to achieve.
After the code coverage check looks reliable, the thresholds can be
increased later if needed.
* Add comments
* Number based columns representing time in custom format cannot be used as timestamp column in Druid.
Prior to this fix, if an integer column in parquet is storing dateint in format yyyyMMdd, it cannot be used as timestamp column in Druid as the timestamp parser interprets it as a number storing UTC time instead of treating it as a number representing time in yyyyMMdd format. Data formats like TSV or CSV don't suffer from this problem as the timestamp is passed in an as string which the timestamp parser is able to parse correctly.
* refactor SeekableStreamSupervisor usage of RecordSupplier to reduce contention between background threads and main thread, refactor KinesisRecordSupplier, refactor Kinesis lag metric collection and emitting
* fix style and test
* cleanup, refactor, javadocs, test
* fixes
* keep collecting current offsets and lag if unhealthy in background reporting thread
* review stuffs
* add comment
* Add AvroOCFInputFormat
* Support supplying a reader schema in AvroOCFInputFormat
* Add docs for Avro OCF input format
* Address review comments
* Address second round of review
* Bad plan for table-lookup-lookup join with filter on first lookup and outer limit
* Bad plan for table-lookup-lookup join with filter on first lookup and outer limit
* Bad plan for table-lookup-lookup join with filter on first lookup and outer limit
* Bad plan for table-lookup-lookup join with filter on first lookup and outer limit
* Bad plan for table-lookup-lookup join with filter on first lookup and outer limit
* Bad plan for table-lookup-lookup join with filter on first lookup and outer limit
* address comments
* address comments
* fix checkstyle
* address comments
* address comments
* add flag to flattenSpec to keep null columns
* remove changes to inputFormat interface
* add comment
* change comment message
* update web console e2e test
* move keepNullColmns to JSONParseSpec
* fix merge conflicts
* fix tests
* set keepNullColumns to false by default
* fix lgtm
* change Boolean to boolean, add keepNullColumns to hash, add tests for keepKeepNullColumns false + true with no nuulul columns
* Add equals verifier tests
* postagg test coverage for druid-stats, druid-momentsketch, druid-tdigestsketch and fixes
* style fixes
* fix comparator for TDigestQuantilePostAggregator
* Update data-formats.md
Per Suneet, "Since you're editing this file can you also fix the json on line 177 please - it's missing a comma after the }"
* Light text cleanup
* Removing discussion of sample data, since it's repeated in the data loading tutorial, and not immediately relevant here.
* Clarifying accepted values for URI lookup
* Update index.md
* original quickstart full first pass
* original quickstart full first pass
* first pass all the way through
* straggler
* image touchups and finished old tutorial
* a bit of finishing up
* druid-caffeine-cache ext previously removed
* Sample MaxDirectMemorySize value unrealistic
* Review comments
* fixing links
* spell checking gymnastics
* workerThreads desc slightly expanded
* typo
* Typo
* Reversing Kafka config order
* Changing order of configs for Kinesis
* Trying this again: ioConfig then tuningConfig