* Move tools for indexing to TaskToolbox instead of injecting them in constructor
* oops, other changes
* fix test
* unnecessary new file
* fix test
* fix build
* Fix handling of 'join' on top of 'union' datasources.
The problem is that unions are typically rewritten into a series of
individual queries on the underlying tables, but this isn't done when
the union is wrapped in a join.
The main changes are in UnionQueryRunner:
1) Replace an instanceof UnionQueryRunner check with DataSourceAnalysis.
2) Replace a "query.withDataSource" call with a new function, "Queries.withBaseDataSource".
Together, these enable UnionQueryRunner to "see through" a join.
* Tests.
* Adjust heap sizes for integration tests.
* Different approach, more tests.
* Tweak.
* Styling.
* Add support for all partitioing schemes for auto compaction
* annotate last compaction state for multi phase parallel indexing
* fix build and tests
* test
* better home
* better type tracking: add typed postaggs, finalized types for agg factories
* more javadoc
* adjustments
* transition to getTypeName to be used exclusively for complex types
* remove unused fn
* adjust
* more better
* rename getTypeName to getComplexTypeName
* setup expression post agg for type inference existing
* more javadocs
* fixup
* oops
* more test
* more test
* more comments/javadoc
* nulls
* explicitly handle only numeric and complex aggregators for incremental index
* checkstyle
* more tests
* adjust
* more tests to showcase difference in behavior
* timeseries longsum array
* Add "offset" parameter to the Scan query.
It works by doing the query as normal and then throwing away the first
"offset" number of rows on the broker.
* Fix constructor call.
* Fix up JSONs.
* Fix call to ScanQuery.
* Doc update.
* Fix javadocs.
* Spotbugs, LGTM suppressions.
* Javadocs.
* Fix suppression.
* Stabilize Scan query result order, add tests.
* Update LGTM comment.
* Fixup.
* Test different batch sizes too.
* Nicer tests.
* Fix comment.
* support unit suffix on byte-related properties
* add doc
* change default value of byte-related properites in example files
* fix coding style
* fix doc
* fix CI
* suppress spelling errors
* improve code according to comments
* rename Bytes to HumanReadableBytes
* add getBytesInInt to get value safely
* improve doc
* fix problem reported by CI
* fix problem reported by CI
* resolve code review comments
* improve error message
* improve code & doc according to comments
* fix CI problem
* improve doc
* suppress spelling check errors
* Add validation for authorizer name
* fix deps
* add javadocs
* Do not use resource filters
* Fix BasicAuthenticatorResource as well
* Add integration tests
* fix test
* fix
* Add availability and consistency docs.
Describes transactional ingestion and atomic replacement. Also, this patch
deletes some bad advice from the javadocs for SegmentTransactionalInsertAction.
* Fix missing word.
* QueryCountStatsMonitor can be injected in the Peon
This change fixes a dependency injection bug where there is a circular
dependency on getting the MonitorScheduler when a user configures the
QueryCountStatsMonitor to be used.
* fix tests
* Actually fix the tests this time
* Fix native batch range partition segment sizing
Fixes#10057.
Native batch range partitioning was only considering the partition
dimension value when grouping rows instead of using all of the row's
partition values. Thus, for schemas with multiple dimensions, the rollup
was overestimated, which would cause too many dimension values to be
packed into the same range partition. The resulting segments would then
be overly large (and not honor the target or max partition sizes).
Main changes:
- PartialDimensionDistributionTask: Consider all dimension values when
grouping row
- RangePartitionMultiPhaseParallelIndexingTest: Regression test by
having input with rows that should roll up and rows that should not
roll up
* Use hadoop & native hash ingestion row group key
* Fix missing temp dir for native single_dim
Native single dim indexing throws a file not found exception from
InputEntityIteratingReader.java:81. This MR creates the required
temporary directory when setting up the
PartialDimensionDistributionTask. The change was tested on a Druid
cluster. After installing the change native single_dim indexing
completes successfully.
* Fix indentation
* Use SinglePhaseSubTask as example for creating the temp dir
* Move temporary indexing dir creation in to TaskToolbox
* Remove unused dependency
Co-authored-by: Morri Feldman <morri@appsflyer.com>
* Fill in the core partition set size properly for batch ingestion with
dynamic partitioning
* incomplete javadoc
* Address comments
* fix tests
* fix json serde, add tests
* checkstyle
* Set core partition set size for hash-partitioned segments properly in
batch ingestion
* test for both parallel and single-threaded task
* unused variables
* fix test
* unused imports
* add hash/range buckets
* some test adjustment and missing json serde
* centralized partition id allocation in parallel and simple tasks
* remove string partition chunk
* revive string partition chunk
* fill numCorePartitions for hadoop
* clean up hash stuffs
* resolved todos
* javadocs
* Fix tests
* add more tests
* doc
* unused imports
* Allow append to existing datasources when dynamic partitioing is used
* fix test
* checkstyle
* checkstyle
* fix test
* fix test
* fix other tests..
* checkstyle
* hansle unknown core partitions size in overlord segment allocation
* fail to append when numCorePartitions is unknown
* log
* fix comment; rename to be more intuitive
* double append test
* cleanup complete(); add tests
* fix build
* add tests
* address comments
* checkstyle
* Fill in the core partition set size properly for batch ingestion with
dynamic partitioning
* incomplete javadoc
* Address comments
* fix tests
* fix json serde, add tests
* checkstyle
* Set core partition set size for hash-partitioned segments properly in
batch ingestion
* test for both parallel and single-threaded task
* unused variables
* fix test
* unused imports
* add hash/range buckets
* some test adjustment and missing json serde
* centralized partition id allocation in parallel and simple tasks
* remove string partition chunk
* revive string partition chunk
* fill numCorePartitions for hadoop
* clean up hash stuffs
* resolved todos
* javadocs
* Fix tests
* add more tests
* doc
* unused imports
* IntelliJ inspection and checkstyle rule for "Collection.EMPTY_* field accesses replaceable with Collections.empty*()"
* Reverted checkstyle rule
* Added tests to pass CI
* Codestyle
* move benchmark data generator into druid-processing, add a GeneratorInputSource to fill up a cluster with data
* newlines
* make test coverage not fail maybe
* remove useless test
* Update pom.xml
* Update GeneratorInputSourceTest.java
* less passive aggressive test names
* Empty partitionDimension has less rollup compared to the case when it is explicitly specified
* Adding a unit test for the empty partitionDimension scenario. Fixing another test which was failing
* Fixing CI Build Inspection Issue
* Addressing all review comments
* Updating the javadocs for the hash method in HashBasedNumberedShardSpec
This change removes ListenableFutures.transformAsync in favor of the
existing Guava Futures.transform implementation. Our own implementation
had a bug which did not fail the future if the applied function threw an
exception, resulting in the future never completing.
An attempt was made to fix this bug, however when running againts Guava's own
tests, our version failed another half dozen tests, so it was decided to not
continue down that path and scrap our own implementation.
Explanation for how was this bug manifested itself:
An exception thrown in BaseAppenderatorDriver.publishInBackground when
invoked via transformAsync in StreamAppenderatorDriver.publish will
cause the resulting future to never complete.
This explains why when encountering https://github.com/apache/druid/issues/9845
the task will never complete, forever waiting for the publishFuture to
register the handoff. As a result, the corresponding "Error while
publishing segments ..." message only gets logged once the index task
times out and is forcefully shutdown when the future is force-cancelled
by the executor.
* refactor SeekableStreamSupervisor usage of RecordSupplier to reduce contention between background threads and main thread, refactor KinesisRecordSupplier, refactor Kinesis lag metric collection and emitting
* fix style and test
* cleanup, refactor, javadocs, test
* fixes
* keep collecting current offsets and lag if unhealthy in background reporting thread
* review stuffs
* add comment
* add flag to flattenSpec to keep null columns
* remove changes to inputFormat interface
* add comment
* change comment message
* update web console e2e test
* move keepNullColmns to JSONParseSpec
* fix merge conflicts
* fix tests
* set keepNullColumns to false by default
* fix lgtm
* change Boolean to boolean, add keepNullColumns to hash, add tests for keepKeepNullColumns false + true with no nuulul columns
* Add equals verifier tests
* IntelliJ inspections cleanup
* Standard Charset object can be used
* Redundant Collection.addAll() call
* String literal concatenation missing whitespace
* Statement with empty body
* Redundant Collection operation
* StringBuilder can be replaced with String
* Type parameter hides visible type
* fix warnings in test code
* more test fixes
* remove string concatenation inspection error
* fix extra curly brace
* cleanup AzureTestUtils
* fix charsets for RangerAdminClient
* review comments
* WIP integration tests
* Add integration test for ingestion with transformSpec
* WIP almost working tests
* Add ignored tests
* checkstyle stuff
* remove newPage from index task ingestion spec
* more test cleanup
* still not quite working
* Actually disable the tests
* working tests
* fix codestyle
* dont use junit in integration tests
* actually fix the bug
* fix checkstyle
* bring index tests closer to reindex tests
* fix nullhandling exceptions related to test ordering
Tests might get executed in different order depending on the maven
version and the test environment. This may lead to "NullHandling module
not initialized" errors for some tests where we do not initialize
null-handling explicitly.
* use InitializedNullHandlingTest
* DruidSegmentReader should work if timestamp is specified as a dimension
* Add integration tests
Tests for compaction and re-indexing a datasource with the timestamp column
* Instructions to run integration tests against quickstart
* address pr