druid/extensions-core
Gian Merlino c204d68376 Fixes, adjustments to numeric null handling and string first/last aggregators. (#8834)
There is a class of bugs due to the fact that BaseObjectColumnValueSelector
has both "getObject" and "isNull" methods, but in most selector implementations
and most call sites, it is clear that the intent of "isNull" is only to apply
to the primitive getters, not the object getter. This makes sense, because the
purpose of isNull is to enable detection of nulls in otherwise-primitive columns.
Imagine a string column with a numeric selector built on top of it. You would
want it to return isNull = true, so numeric aggregators don't treat it as
all zeroes.

Sometimes this design leads people to accidentally guard non-primitive get
methods with "selector.isNull" checks, which is improper.

This patch has three goals:

1) Fix null-handling bugs that already exist in this class.
2) Make interface and doc changes that reduce the probability of future bugs.
3) Fix other, unrelated bugs I noticed in the stringFirst and stringLast
   aggregators while fixing null-handling bugs. I thought about splitting this
   into its own patch, but it ended up being tough to split from the
   null-handling fixes.

For (1) the fixes are,

- Fix StringFirst and StringLastAggregatorFactory to stop guarding getObject
  calls on isNull, by no longer extending NullableAggregatorFactory. Now uses
  -1 as a sigil value for null, to differentiate nulls and empty strings.
- Fix ExpressionFilter to stop guarding getObject calls on isNull. Also, use
  eval.asBoolean() to avoid calling getLong on the selector after already
  calling getObject.
- Fix ObjectBloomFilterAggregator to stop guarding DimensionSelector calls
  on isNull. Also, refactored slightly to avoid the overhead of calling
  getObject followed by another getter (see BloomFilterAggregatorFactory for
  part of this).

For (2) the main changes are,

- Remove the "isNull" method from BaseObjectColumnValueSelector.
- Clarify "isNull" doc on BaseNullableColumnValueSelector.
- Rename NullableAggregatorFactory -> NullbleNumericAggregatorFactory to emphasize
  that it only works on aggregators that take numbers as input.
- Similar naming changes to the Aggregator, BufferAggregator, and AggregateCombiner.
- Similar naming changes to helper methods for groupBy, ValueMatchers, etc.

For (3) the other fixes for StringFirst and StringLastAggregatorFactory are,

- Fixed buffer overrun in the buffer aggregators when some characters in the string
  code into more than one byte (the old code used "substring" to apply a byte limit,
  which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method.
- Fixed weird IncrementalIndex logic that led to reading nulls for the timestamp.
- Adjusted weird StringFirst/Last logic that worked around the weird IncrementalIndex
  behavior.
- Refactored to share code between the four aggregators.
- Improved test coverage.
- Made the base stringFirst, stringLast aggregators adaptive, and streamlined the
  xFold versions into aliases. The adaptiveness is similar to how other aggregators
  like hyperUnique work.
2019-11-07 17:46:59 -08:00
..
avro-extensions Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
datasketches Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00
druid-basic-security Issue 8678 Non-coordinator services are repeatedly logging JsonMappingException when using druid-basic-security extension with an authenticator that has no users setup (#8692) 2019-10-18 11:09:53 -07:00
druid-bloom-filter Fixes, adjustments to numeric null handling and string first/last aggregators. (#8834) 2019-11-07 17:46:59 -08:00
druid-kerberos Remove commons-httpclient (#8407) 2019-09-27 02:14:58 -07:00
ec2-extensions Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
google-extensions Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
hdfs-storage Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00
histogram Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00
kafka-extraction-namespace Removed 'if' condition. (#8768) 2019-10-28 13:40:03 -07:00
kafka-indexing-service Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00
kinesis-indexing-service Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00
lookups-cached-global Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
lookups-cached-single Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
mysql-metadata-storage Case sensitive comparison of nonbinary string in MySQL metadata storage (#8758) 2019-10-30 20:48:08 -07:00
orc-extensions Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
parquet-extensions Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
postgresql-metadata-storage Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
protobuf-extensions Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
s3-extensions Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00
simple-client-sslcontext Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
stats Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00