Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Victoria Lim <lim.t.victoria@gmail.com>
* Add ability to wait for segment availability for batch jobs
* IT updates
* fix queries in legacy hadoop IT
* Fix broken indexing integration tests
* address an lgtm flag
* spell checker still flagging for hadoop doc. adding under that file header too
* fix compaction IT
* Updates to wait for availability method
* improve unit testing for patch
* fix bad indentation
* refactor waitForSegmentAvailability
* Fixes based off of review comments
* cleanup to get compile after merging with master
* fix failing test after previous logic update
* add back code that must have gotten deleted during conflict resolution
* update some logging code
* fixes to get compilation working after merge with master
* reset interrupt flag in catch block after code review pointed it out
* small changes following self-review
* fixup some issues brought on by merge with master
* small changes after review
* cleanup a little bit after merge with master
* Fix potential resource leak in AbstractBatchIndexTask
* syntax fix
* Add a Compcation TuningConfig type
* add docs stipulating the lack of support by Compaction tasks for the new config
* Fixup compilation errors after merge with master
* Remove erreneous newline
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* fix checkstyle
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* fix test
* fix test
* add log
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* address comments
* fix checkstyle
* fix checkstyle
* add config to skip overhead memory calculation
* add test for the skipBytesInMemoryOverheadCheck config
* add docs
* fix checkstyle
* fix checkstyle
* fix spelling
* address comments
* fix travis
* address comments
* Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided
* query context
* fix tests; add more test
* javadoc
* docs and more tests
* remove default and hadoop tests
* consistent name and fix javadoc
* spelling and field name
* default function for partitionsSpec
* other comments
* address comments
* fix tests and spelling
* test
* doc
* fix docs error: google to azure and hdfs to http
* fix docs error: indexSpecForIntermediatePersists of tuningConfig in hadoop-based batch part
* fix docs error: logParseExceptions of tuningConfig in hadoop-based batch part
* fix docs error: maxParseExceptions of tuningConfig in hadoop-based batch part
* Doc update for new input source and input format.
- The input source and input format are promoted in all docs under docs/ingestion
- All input sources including core extension ones are located in docs/ingestion/native-batch.md
- All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md
- New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md
* parquet
* add warning for range partitioning with sequential mode
* hdfs + s3, gs
* add fs impl for gs
* address comments
* address comments
* gcs
* Parallel indexing single dim partitions
Implements single dimension range partitioning for native parallel batch
indexing as described in #8769. This initial version requires the
druid-datasketches extension to be loaded.
The algorithm has 5 phases that are orchestrated by the supervisor in
`ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`.
These phases and the main classes involved are described below:
1) In parallel, determine the distribution of dimension values for each
input source split.
`PartialDimensionDistributionTask` uses `StringSketch` to generate
the approximate distribution of dimension values for each input
source split. If the rows are ungrouped,
`PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter`
uses a Bloom filter to skip rows that would be grouped. The final
distribution is sent back to the supervisor via
`DimensionDistributionReport`.
2) The range partitions are determined.
In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the
supervisor uses `StringSketchMerger` to merge the individual
`StringSketch`es created in the preceding phase. The merged sketch is
then used to create the range partitions.
3) In parallel, generate partial range-partitioned segments.
`PartialRangeSegmentGenerateTask` uses the range partitions
determined in the preceding phase and
`RangePartitionCachingLocalSegmentAllocator` to generate
`SingleDimensionShardSpec`s. The partition information is sent back
to the supervisor via `GeneratedGenericPartitionsReport`.
4) The partial range segments are grouped.
In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`,
the supervisor creates the `PartialGenericSegmentMergeIOConfig`s
necessary for the next phase.
5) In parallel, merge partial range-partitioned segments.
`PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to
retrieve the partial range-partitioned segments generated earlier and
then merges and publishes them.
* Fix dependencies & forbidden apis
* Fixes for integration test
* Address review comments
* Fix docs, strict compile, sketch check, rollup check
* Fix first shard spec, partition serde, single subtask
* Fix first partition check in test
* Misc rewording/refactoring to address code review
* Fix doc link
* Split batch index integration test
* Do not run parallel-batch-index twice
* Adjust last partition
* Split ITParallelIndexTest to reduce runtime
* Rename test class
* Allow null values in range partitions
* Indicate which phase failed
* Improve asserts in tests
* Adjust defaults for hashed partitioning
If neither the partition size nor the number of shards are specified,
default to partitions of 5,000,000 rows (similar to the behavior of
dynamic partitions). Previously, both could be null and cause incorrect
behavior.
Specifying both a partition size and a number of shards now results in
an error instead of ignoring the partition size in favor of using the
number of shards. This is a behavior change that makes it more apparent
to the user that only one of the two properties will be honored
(previously, a message was just logged when the specified partition size
was ignored).
* Fix test
* Handle -1 as null
* Add -1 as null tests for single dim partitioning
* Simplify logic to handle -1 as null
* Address review comments
* Rename partition spec fields
Rename partition spec fields to be consistent across the various types
(hashed, single_dim, dynamic). Specifically, use targetNumRowsPerSegment
and maxRowsPerSegment in favor of targetPartitionSize and
maxSegmentSize. Consistent and clearer names are easier for users to
understand and use.
Also fix various IntelliJ inspection warnings and doc spelling mistakes.
* Fix test
* Improve docs
* Add targetRowsPerSegment to HashedPartitionsSpec