* first pass
* more changes
* fix tests and formatting
* fix kinesis failing tests
* fix kafka tests
* add dimension name to float parse errors
* double and convertToType handling of dimensionName can report parse errors with dimension name
* fix checkstyle issue
* fix tests
* more cases to have better parse exception messages
* fix test
* fix tests
* partially address comments
* annotate method parameter with nullable
* address comments
* fix tests
* let float, double, long dimensionIndexer pass dimensionName down to dimensionHandlerUtils
* fix compilation error and clean up formatting
* clean up whitespace
* address feedback. undo change, pass down report parse exception for convertToType
* fix test
Issue: #14989
The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information.
This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.
* IncrementalIndex#add is no longer thread-safe.
Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.
It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.
At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.
This patch also:
1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
which the similar datasketches versions already have. This is done to
help them adhere to the contract of Aggregator: concurrent calls to
"aggregate" and "get" must be thread-safe.
2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
druid-benchmarks module.
* Spelling.
* Changes from static analysis.
* Fix javadoc.
This change updates dependencies as needed and fixes tests to remove code incompatible with Java 21
As a result all unit tests now pass with Java 21.
* update maven-shade-plugin to 3.5.0 and follow-up to #15042
* explain why we need to override configuration when specifying outputFile
* remove configuration from dependency management in favor of explicit overrides in each module.
* update to mockito to 5.5.0 for Java 21 support when running with Java 11+
* continue using latest mockito 4.x (4.11.0) when running with Java 8
* remove need to mock private fields
* exclude incorrectly declared mockito dependency from pac4j-oidc
* remove mocking of ByteBuffer, since sealed classes can no longer be mocked in Java 21
* add JVM options workaround for system-rules junit plugin not supporting Java 18+
* exclude older versions of byte-buddy from assertj-core
* fix for Java 19 changes in floating point string representation
* fix missing InitializedNullHandlingTest
* update easymock to 5.2.0 for Java 21 compatibility
* update animal-sniffer-plugin to 1.23
* update nl.jqno.equalsverifier to 3.15.1
* update exec-maven-plugin to 3.1.0
Changes:
- Add task context parameter `taskLockType`. This determines the type of lock used by a batch task.
- Add new task actions for transactional replace and append of segments
- Add methods StorageCoordinator.commitAppendSegments and commitReplaceSegments
- Upgrade segments to appropriate versions when performing replace and append
- Add new metadata table `upgradeSegments` to track segments that need to be upgraded
- Add tests
There are two type of DeterminePartitionsJob:
- When the input data is not assume grouped, there may be duplicate rows.
In this case, two MR jobs are launched. The first one do group job to remove duplicate rows.
And a second one to perform global sorting to find lower and upper bound for target segments.
- When the input data is assume grouped, we only need to launch the global sorting
MR job to find lower and upper bound for segments.
Sampling strategy:
- If the input data is assume grouped, sample by random at the mapper side of the global sort mr job.
- If the input data is not assume grouped, sample at the mapper of the group job. Use hash on time
and all dimensions and mod by sampling factor to sample, don't use random method because there
may be duplicate rows.
UniformGranularityTest's test to test a large number of intervals
runs through 10 years of 1 second intervals. This pushes a lot of
stuff through IntervalIterator and shows up in terms of test
runtime as one of the hottest tests. Most of the time is going to
constructing jodatime objects because it is doing things with
DateTime objects instead of millis. Change the calls to use
millis instead and things go faster.
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+.
While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
Hadoop 2 often causes red security scans on Druid distribution because of the dependencies it brings. We want to move away from Hadoop 2 and provide Hadoop 3 distribution available. Switch druid to building with Hadoop 3 by default. Druid will still be compatible with Hadoop 2 and users can build hadoop-2 compatible distribution using hadoop2 profile.
changes:
* introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation
* introduce new 'auto' type indexer and merger which produces a new common nested format of columns, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>.
* revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre #13803 behavior (v4) for backwards compatibility
* fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
* Lower default maxRowsInMemory for realtime ingestion.
The thinking here is that for best ingestion throughput, we want
intermediate persists to be as big as possible without using up all
available memory. So, we rely mainly on maxBytesInMemory. The default
maxRowsInMemory (1 million) is really just a safety: in case we have
a large number of very small rows, we don't want to get overwhelmed
by per-row overheads.
However, maximum ingestion throughput isn't necessarily the primary
goal for realtime ingestion. Query performance is also important. And
because query performance is not as good on the in-memory dataset, it's
helpful to keep it from growing too large. 150k seems like a reasonable
balance here. It means that for a typical 5 million row segment, we
won't trigger more than 33 persists due to this limit, which is a
reasonable number of persists.
* Update tests.
* Update server/src/main/java/org/apache/druid/segment/indexing/RealtimeTuningConfig.java
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* Fix test.
* Fix link.
---------
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything
* fix poms and license stuff
* mockito is evil
* allow reset of JvmUtils RuntimeInfo if tests used static injection to override
* single typed "root" only nested columns now mimic "regular" columns of those types
* incremental index can now use nested column indexer instead of string indexer for discovered columns
In clusters with a large number of segments, the duty `MarkAsUnusedOvershadowedSegments`
can take a long very long time to finish. This is because of the costly invocation of
`timeline.isOvershadowed` which is done for every used segment in every coordinator run.
Changes
- Use `DataSourceSnapshot.getOvershadowedSegments` to get all overshadowed segments
- Iterate over this set instead of all used segments to identify segments that can be marked as unused
- Mark segments as unused in the DB in batches rather than one at a time
- Refactor: Add class `SegmentTimeline` for ease of use and readability while using a
`VersionedIntervalTimeline` of segments.
This PR enables ARM builds on Travis. I've ported over the changes from @martin-g on reducing heap requirements for some of the tests to ensure they run well on Travis arm instances.
* Store null columns in the segments
* fix test
* remove NullNumericColumn and unused dependency
* fix compile failure
* use guava instead of apache commons
* split new tests
* unused imports
* address comments
Mockito now supports all our needs and plays much better with recent Java versions.
Migrating to Mockito also simplifies running the kind of tests that required PowerMock in the past.
* replace all uses of powermock with mockito-inline
* upgrade mockito to 4.3.1 and fix use of deprecated methods
* import mockito bom to align all our mockito dependencies
* add powermock to forbidden-apis to avoid accidentally reintroducing it in the future
changes:
* IncrementalIndex is now a ColumnInspector
* fixes performance regression from using map of ColumnCapabilities from IncrementalIndex as a RowSignature
* Use intermediate-persist IndexSpec during multiphase merge.
The main change is the addition of an intermediate-persist IndexSpec
to the main "merge" method in IndexMerger. There are also a few minor
adjustments to the IndexMerger interface to encourage more harmonious
usage of its methods in the future.
* Additional changes inspired by the test coverage checker.
- Remove unused-in-production IndexMerger methods "append" and "convert".
- Add additional unit tests to UnifiedIndexerAppenderatorsManager.
* Additional adjustments.
* Even more additional adjustments.
* Test fixes.
* add back and deprecate aggregator factory methods so i can say i told you so when i delete these later
* rename to make less ambiguous, fix fill method
* adjust
* complex typed expressions
* add built-in hll collector expressions to get coverage on druid-processing, more types, more better
* rampage!!!
* more javadoc
* adjustments
* oops
* lol
* remove unused dependency
* contradiction?
* more test
Add support for hadoop 3 profiles . Most of the details are captured in #11791 .
We use a combination of maven profiles and resource filtering to achieve this. Hadoop2 is supported by default and a new maven profile with the name hadoop3 is created. This will allow the user to choose the profile which is best suited for the use case.
* better type system
* needle in a haystack
* ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support
* fixup merge
* more test
* fixup
* intern
* fix
* oops
* oops again
* ...
* more test coverage
* fix error message
* adjust interning, more javadocs
* oops
* more docs more better