druid/processing
Gian Merlino 319f99db05
Always use file sizes when determining batch ingest splits (#13955)
* Always use file sizes when determining batch ingest splits.

Main changes:

1) Update CloudObjectInputSource and its subclasses (S3, GCS,
   Azure, Aliyun OSS) to use SplitHintSpecs in all cases. Previously, they
   were only used for prefixes, not uris or objects.

2) Update ExternalInputSpecSlicer (MSQ) to consider file size. Previously,
   file size was ignored; all files were treated as equal weight when
   determining splits.

A side effect of these changes is that we'll make additional network
calls to find the sizes of objects when users specify URIs or objects
as opposed to prefixes. IMO, this is worth it because it's the only way
to respect the user's split hint and task assignment settings.

Secondary changes:

1) S3, Aliyun OSS: Use getObjectMetadata instead of listObjects to get
   metadata for a single object. This is a simpler call that is also
   expected to be less expensive.

2) Azure: Fix a bug where getBlobLength did not populate blob
   reference attributes, and therefore would not actually retrieve the
   blob length.

3) MSQ: Align dynamic slicing logic between ExternalInputSpecSlicer and
   TableInputSpecSlicer.

4) MSQ: Adjust WorkerInputs to ensure there is always at least one
   worker, even if it has a nil slice.

* Add msqCompatible to testGroupByWithImpossibleTimeFilter.

* Fix tests.

* Add additional tests.

* Remove unused stuff.

* Remove more unused stuff.

* Adjust thresholds.

* Remove irrelevant test.

* Fix comments.

* Fix bug.

* Updates.
2023-04-05 08:54:01 -07:00
..
src Always use file sizes when determining batch ingest splits (#13955) 2023-04-05 08:54:01 -07:00
pom.xml Fixing security vulnerability check errors (#13956) 2023-03-23 11:10:06 +05:30