Apache Druid: a high performance real-time analytics database.
Go to file
Gian Merlino ffa25b7832
Query vectorization. (#6794)
* Benchmarks: New SqlBenchmark, add caching & vectorization to some others.

- Introduce a new SqlBenchmark geared towards benchmarking a wide
  variety of SQL queries. Rename the old SqlBenchmark to
  SqlVsNativeBenchmark.
- Add (optional) caching to SegmentGenerator to enable easier
  benchmarking of larger segments.
- Add vectorization to FilteredAggregatorBenchmark and GroupByBenchmark.

* Query vectorization.

This patch includes vectorized timeseries and groupBy engines, as well
as some analogs of your favorite Druid classes:

- VectorCursor is like Cursor. (It comes from StorageAdapter.makeVectorCursor.)
- VectorColumnSelectorFactory is like ColumnSelectorFactory, and it has
  methods to create analogs of the column selectors you know and love.
- VectorOffset and ReadableVectorOffset are like Offset and ReadableOffset.
- VectorAggregator is like BufferAggregator.
- VectorValueMatcher is like ValueMatcher.

There are some noticeable differences between vectorized and regular
execution:

- Unlike regular cursors, vector cursors do not understand time
  granularity. They expect query engines to handle this on their own,
  which a new VectorCursorGranularizer class helps with. This is to
  avoid too much batch-splitting and to respect the fact that vector
  selectors are somewhat more heavyweight than regular selectors.
- Unlike FilteredOffset, FilteredVectorOffset does not leverage indexes
  for filters that might partially support them (like an OR of one
  filter that supports indexing and another that doesn't). I'm not sure
  that this behavior is desirable anyway (it is potentially too eager)
  but, at any rate, it'd be better to harmonize it between the two
  classes. Potentially they should both do some different thing that
  is smarter than what either of them is doing right now.
- When vector cursors are created by QueryableIndexCursorSequenceBuilder,
  they use a morphing binary-then-linear search to find their start and
  end rows, rather than linear search.

Limitations in this patch are:

- Only timeseries and groupBy have vectorized engines.
- GroupBy doesn't handle multi-value dimensions yet.
- Vector cursors cannot handle virtual columns or descending order.
- Only some filters have vectorized matchers: "selector", "bound", "in",
  "like", "regex", "search", "and", "or", and "not".
- Only some aggregators have vectorized implementations: "count",
  "doubleSum", "floatSum", "longSum", "hyperUnique", and "filtered".
- Dimension specs other than "default" don't work yet (no extraction
  functions or filtered dimension specs).

Currently, the testing strategy includes adding vectorization-enabled
tests to TimeseriesQueryRunnerTest, GroupByQueryRunnerTest,
GroupByTimeseriesQueryRunnerTest, CalciteQueryTest, and all of the
filtering tests that extend BaseFilterTest. In all of those classes,
there are some test cases that don't support vectorization. They are
marked by special function calls like "cannotVectorize" or "skipVectorize"
that tell the test harness to either expect an exception or to skip the
test case.

Testing should be expanded in the future -- a project in and of itself.

Related to #3011.

* WIP

* Adjustments for unused things.

* Adjust javadocs.

* DimensionDictionarySelector adjustments.

* Add "clone" to BatchIteratorAdapter.

* ValueMatcher javadocs.

* Fix benchmark.

* Fixups post-merge.

* Expect exception on testGroupByWithStringVirtualColumn for IncrementalIndex.

* BloomDimFilterSqlTest: Tag two non-vectorizable tests.

* Minor adjustments.

* Update surefire, bump up Xmx in Travis.

* Some more adjustments.

* Javadoc adjustments

* AggregatorAdapters adjustments.

* Additional comments.

* Remove switching search.

* Only missiles.
2019-07-12 12:54:07 -07:00
.github adjust PR template (#8016) 2019-07-03 08:31:31 -07:00
.idea Web-console: Add action column to segments view (#7954) 2019-06-25 20:14:06 -07:00
benchmarks Query vectorization. (#6794) 2019-07-12 12:54:07 -07:00
ci Move dev-related files and instructions to dev/ directory; add committer's instructions (#7279) 2019-04-17 15:27:14 +02:00
cloud Bump up snapshot version to 0.16.0 (#7802) 2019-05-30 17:17:33 -07:00
codestyle Enable Spotbugs: MS_OOI_PKGPROTECT (#8022) 2019-07-08 13:17:56 +05:30
core Query vectorization. (#6794) 2019-07-12 12:54:07 -07:00
dev Add the pull-request template (#7206) 2019-06-27 15:51:25 +03:00
distribution Fix license check in travis and make it optional (#8049) 2019-07-09 19:35:29 -07:00
docs Query vectorization. (#6794) 2019-07-12 12:54:07 -07:00
examples remove FirehoseV2 and realtime node extensions (#8020) 2019-07-04 15:40:22 -07:00
extendedset Bump up snapshot version to 0.16.0 (#7802) 2019-05-30 17:17:33 -07:00
extensions-contrib fail complex type 'serde' registration when registered type does not match expected type (#7985) 2019-07-11 23:03:15 -07:00
extensions-core Query vectorization. (#6794) 2019-07-12 12:54:07 -07:00
hll switch links from druid.io to druid.apache.org (#7914) 2019-06-18 09:06:27 -07:00
indexing-hadoop add config to optionally disable all compression in intermediate segment persists while ingestion (#7919) 2019-07-10 12:22:24 -07:00
indexing-service add config to optionally disable all compression in intermediate segment persists while ingestion (#7919) 2019-07-10 12:22:24 -07:00
integration-tests Add instruction about skipping up-to-date checks when running integration tests (#7843) 2019-07-08 13:44:32 +05:30
licenses Binary license management system (#7998) 2019-07-08 12:24:51 -07:00
processing Query vectorization. (#6794) 2019-07-12 12:54:07 -07:00
publications [ImgBot] Optimize images (#7873) 2019-06-24 21:27:48 -07:00
server Add inline firehose (#8056) 2019-07-11 21:43:46 -07:00
services write value of bitmap as field name (#8066) 2019-07-11 19:29:46 -07:00
sql Query vectorization. (#6794) 2019-07-12 12:54:07 -07:00
web-console added replicated size (#8043) 2019-07-10 08:29:05 -07:00
.dockerignore Add docker container for druid (#6896) 2019-02-08 12:12:28 +00:00
.gitignore Fix some problems reported by PVS-Studio (#7738) 2019-05-29 11:20:45 -07:00
.travis.yml Query vectorization. (#6794) 2019-07-12 12:54:07 -07:00
CONTRIBUTING.md CONTRIBUTING: Remove "keep the number of commits small" guidance. (#8004) 2019-07-03 11:53:41 -07:00
DISCLAIMER add missing license headers, in particular to MD files; clean up RAT … (#6563) 2018-11-13 09:38:37 -08:00
LABELS Add plain text README.txt, use relative link from README.md to build.md (#7611) 2019-05-09 21:29:26 -07:00
LICENSE Add missing license pointer for Porter Stemmer (#7941) 2019-06-24 12:21:40 -07:00
NOTICE Adjust NOTICE files (#7945) 2019-06-25 09:08:54 -07:00
NOTICE.BINARY remove FirehoseV2 and realtime node extensions (#8020) 2019-07-04 15:40:22 -07:00
README.md remove IRC badge from readme (#8052) 2019-07-10 08:29:19 -07:00
README.template switch links from druid.io to druid.apache.org (#7914) 2019-06-18 09:06:27 -07:00
build.sh Fix license check in travis and make it optional (#8049) 2019-07-09 19:35:29 -07:00
licenses.yaml Binary license management system (#7998) 2019-07-08 12:24:51 -07:00
pom.xml Query vectorization. (#6794) 2019-07-12 12:54:07 -07:00
upload.sh Adding licenses and enable apache-rat-plugin. (#6215) 2018-09-18 08:39:26 -07:00

README.md

Build Status Inspections Status Coverage Status

Apache Druid (incubating)

Apache Druid (incubating) is a high performance analytics data store for event-driven data.

Disclaimer: Apache Druid is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

License

Apache License, Version 2.0

More Information

More information about Druid can be found on https://druid.apache.org.

Documentation

You can find the documentation for the latest Druid release on the project website.

If you would like to contribute documentation, please do so under /docs/content in this repository and submit a pull request.

Getting Started

You can get started with Druid with our quickstart.

Reporting Issues

If you find any bugs, please file a GitHub issue.

Community

Community support is available on the druid-user mailing list(druid-user@googlegroups.com), which is hosted at Google Groups.

Development discussions occur on dev@druid.apache.org, which you can subscribe to by emailing dev-subscribe@druid.apache.org.

We also have a couple people hanging out on IRC in #druid-dev on irc.freenode.net.

Building From Source

Please note that JDK 8 is required to build Druid.

For instructions on building Druid from source, see docs/content/development/build.md

Contributing

Please follow the guidelines listed here.