mirror of https://github.com/apache/druid.git
67fbd8e7fc
* Add "stringEncoding" parameter to DataSketches HLL. Builds on the concept from #11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. * Fix benchmark. * Fix style issues, improve test coverage. * Put round back, to make IT updates easier. * Fix test. * Fix issue with filtered aggregators and add test. * Use DS native update(ByteBuffer) method. Improve test coverage. * Add another suppression. * Fix ITAutoCompactionTest. * Update benchmarks. * Updates. * Fix conflict. * Adjustments. |
||
---|---|---|
.. | ||
src | ||
pom.xml |