druid/processing/src
Gian Merlino 67fbd8e7fc
Add "stringEncoding" parameter to DataSketches HLL. (#11201)
* Add "stringEncoding" parameter to DataSketches HLL.

Builds on the concept from #11172 and adds a way to feed HLL sketches
with UTF-8 bytes.

This must be an option rather than always-on, because prior to this
patch, HLL sketches used UTF-16LE encoding when hashing strings. To
remain compatible with sketch images created prior to this patch -- which
matters during rolling updates and when reading sketches that have been
written to segments -- we must keep UTF-16LE as the default.

Not currently documented, because I'm not yet sure how best to expose
this functionality to users. I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.

* Fix benchmark.

* Fix style issues, improve test coverage.

* Put round back, to make IT updates easier.

* Fix test.

* Fix issue with filtered aggregators and add test.

* Use DS native update(ByteBuffer) method. Improve test coverage.

* Add another suppression.

* Fix ITAutoCompactionTest.

* Update benchmarks.

* Updates.

* Fix conflict.

* Adjustments.
2023-06-30 12:45:55 -07:00
..
main Add "stringEncoding" parameter to DataSketches HLL. (#11201) 2023-06-30 12:45:55 -07:00
test Add "stringEncoding" parameter to DataSketches HLL. (#11201) 2023-06-30 12:45:55 -07:00