Apache Druid: a high performance real-time analytics database.
Go to file
Gian Merlino c204d68376 Fixes, adjustments to numeric null handling and string first/last aggregators. (#8834)
There is a class of bugs due to the fact that BaseObjectColumnValueSelector
has both "getObject" and "isNull" methods, but in most selector implementations
and most call sites, it is clear that the intent of "isNull" is only to apply
to the primitive getters, not the object getter. This makes sense, because the
purpose of isNull is to enable detection of nulls in otherwise-primitive columns.
Imagine a string column with a numeric selector built on top of it. You would
want it to return isNull = true, so numeric aggregators don't treat it as
all zeroes.

Sometimes this design leads people to accidentally guard non-primitive get
methods with "selector.isNull" checks, which is improper.

This patch has three goals:

1) Fix null-handling bugs that already exist in this class.
2) Make interface and doc changes that reduce the probability of future bugs.
3) Fix other, unrelated bugs I noticed in the stringFirst and stringLast
   aggregators while fixing null-handling bugs. I thought about splitting this
   into its own patch, but it ended up being tough to split from the
   null-handling fixes.

For (1) the fixes are,

- Fix StringFirst and StringLastAggregatorFactory to stop guarding getObject
  calls on isNull, by no longer extending NullableAggregatorFactory. Now uses
  -1 as a sigil value for null, to differentiate nulls and empty strings.
- Fix ExpressionFilter to stop guarding getObject calls on isNull. Also, use
  eval.asBoolean() to avoid calling getLong on the selector after already
  calling getObject.
- Fix ObjectBloomFilterAggregator to stop guarding DimensionSelector calls
  on isNull. Also, refactored slightly to avoid the overhead of calling
  getObject followed by another getter (see BloomFilterAggregatorFactory for
  part of this).

For (2) the main changes are,

- Remove the "isNull" method from BaseObjectColumnValueSelector.
- Clarify "isNull" doc on BaseNullableColumnValueSelector.
- Rename NullableAggregatorFactory -> NullbleNumericAggregatorFactory to emphasize
  that it only works on aggregators that take numbers as input.
- Similar naming changes to the Aggregator, BufferAggregator, and AggregateCombiner.
- Similar naming changes to helper methods for groupBy, ValueMatchers, etc.

For (3) the other fixes for StringFirst and StringLastAggregatorFactory are,

- Fixed buffer overrun in the buffer aggregators when some characters in the string
  code into more than one byte (the old code used "substring" to apply a byte limit,
  which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method.
- Fixed weird IncrementalIndex logic that led to reading nulls for the timestamp.
- Adjusted weird StringFirst/Last logic that worked around the weird IncrementalIndex
  behavior.
- Refactored to share code between the four aggregators.
- Improved test coverage.
- Made the base stringFirst, stringLast aggregators adaptive, and streamlined the
  xFold versions into aliases. The adaptiveness is similar to how other aggregators
  like hyperUnique work.
2019-11-07 17:46:59 -08:00
.github add checkbox for licenses.yaml in PR template, mention it in CONTRIBUTING.md (#8367) 2019-08-22 14:14:24 -07:00
.idea Implementing dropwizard emitter for druid (#7363) 2019-10-01 14:59:30 -07:00
benchmarks parallel broker merges on fork join pool (#8578) 2019-11-07 11:58:46 -08:00
cloud Add credentials for ECS (#8651) 2019-10-12 09:12:14 -07:00
codestyle Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
core Fixes, adjustments to numeric null handling and string first/last aggregators. (#8834) 2019-11-07 17:46:59 -08:00
dev Add an item to concurrency checklist about assertions in parall… (#8701) 2019-10-29 11:38:04 +03:00
distribution update how to release doc (#8590) 2019-10-02 08:51:25 -07:00
docs parallel broker merges on fork join pool (#8578) 2019-11-07 11:58:46 -08:00
examples Fix verify script. (#8798) 2019-10-30 23:30:01 -07:00
extendedset bump master version to 0.17.0-incubating-SNAPSHOT (#8421) 2019-08-28 01:58:36 -07:00
extensions-contrib parallel broker merges on fork join pool (#8578) 2019-11-07 11:58:46 -08:00
extensions-core Fixes, adjustments to numeric null handling and string first/last aggregators. (#8834) 2019-11-07 17:46:59 -08:00
hll Fix dependency analyze warnings (#8230) 2019-09-09 14:37:21 -07:00
indexing-hadoop Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00
indexing-service Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00
integration-tests remove select query (#8739) 2019-10-30 19:29:56 -07:00
licenses add jaxb-runtime to fix exception with newer versions of java (#8409) 2019-08-27 14:25:05 -06:00
processing Fixes, adjustments to numeric null handling and string first/last aggregators. (#8834) 2019-11-07 17:46:59 -08:00
publications [ImgBot] Optimize images (#7873) 2019-06-24 21:27:48 -07:00
server parallel broker merges on fork join pool (#8578) 2019-11-07 11:58:46 -08:00
services Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00
sql Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564) 2019-11-06 11:07:04 -08:00
web-console Web console: Interval input component (#8777) 2019-11-07 13:07:17 -08:00
website parallel broker merges on fork join pool (#8578) 2019-11-07 11:58:46 -08:00
.codecov.yml Use Codecov (#8388) 2019-08-28 08:49:30 -07:00
.dockerignore Add docker container for druid (#6896) 2019-02-08 12:12:28 +00:00
.gitignore autogenerate NOTICE.BINARY from NOTICE and licenses.yaml (#8306) 2019-08-21 12:46:27 -07:00
.travis.yml Spellcheck docs (#8548) 2019-09-17 12:47:30 -07:00
CONTRIBUTING.md Fix incorrect build from source path in README.md and druid repo url. (#8531) 2019-09-12 19:48:01 -07:00
DISCLAIMER add missing license headers, in particular to MD files; clean up RAT … (#6563) 2018-11-13 09:38:37 -08:00
LABELS Add plain text README.txt, use relative link from README.md to build.md (#7611) 2019-05-09 21:29:26 -07:00
LICENSE Add missing license pointer for Porter Stemmer (#7941) 2019-06-24 12:21:40 -07:00
NOTICE add copyright info back to NOTICE and NOTICE.BINARY (#8298) 2019-08-14 19:42:47 -05:00
README.md Update README.md (#8829) 2019-11-06 08:59:00 -08:00
README.template switch links from druid.io to druid.apache.org (#7914) 2019-06-18 09:06:27 -07:00
licenses.yaml Upgrade joda-time to 2.10.5 (#8821) 2019-11-06 14:30:22 -08:00
pom.xml Upgrade joda-time to 2.10.5 (#8821) 2019-11-06 14:30:22 -08:00
upload.sh Adding licenses and enable apache-rat-plugin. (#6215) 2018-09-18 08:39:26 -07:00

README.md

Slack Build Status Language grade: Java Coverage Status Docker

Apache Druid (incubating)

Apache Druid (incubating) is a high performance real-time analytics database.

Druid is a next-gen open source alternative to analytical databases such as Vertica, Greenplum, and Exadata, and data warehouses such as Snowflake, BigQuery, and Redshift.

Getting started

You can get started with Druid with our quickstart.

Druid provides a rich set of APIs (via HTTP and JDBC) for loading, managing, and querying your data. You can also interact with Druid via the built-in console (shown below).

Load data

data loader Kafka

Load streaming and batch data using a point-and-click wizard to guide you through ingestion setup. Monitor one off tasks and ingestion supervisors.

Manage the cluster

management

Manage your cluster with ease. Get a view of your datasources, segments, ingestion tasks, and servers from one convenient location. All powered by SQL systems tables allowing you to see the underlying query for each view.

Issue queries

query view combo

Use the built-in query workbench to prototype DruidSQL and native queries or connect one of the many tools that help you make the most out of Druid.

Documentation

You can find the documentation for the latest Druid release on the project website.

If you would like to contribute documentation, please do so under /docs in this repository and submit a pull request.

Community

Community support is available on the druid-user mailing list, which is hosted at Google Groups.

Development discussions occur on dev@druid.apache.org, which you can subscribe to by emailing dev-subscribe@druid.apache.org.

Chat with Druid committers and users in real-time on the #druid channel in the Apache Slack team. Please use this invitation link to join the ASF Slack, and once joined, go into the #druid channel.

Building from source

Please note that JDK 8 is required to build Druid.

For instructions on building Druid from source, see docs/development/build.md

Contributing

Please follow the community guidelines for contributing.

License

Apache License, Version 2.0

Disclaimer: Apache Druid is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.