Commit Graph

572 Commits

Author SHA1 Message Date
Gian Merlino ffa25b7832
Query vectorization. (#6794)
* Benchmarks: New SqlBenchmark, add caching & vectorization to some others.

- Introduce a new SqlBenchmark geared towards benchmarking a wide
  variety of SQL queries. Rename the old SqlBenchmark to
  SqlVsNativeBenchmark.
- Add (optional) caching to SegmentGenerator to enable easier
  benchmarking of larger segments.
- Add vectorization to FilteredAggregatorBenchmark and GroupByBenchmark.

* Query vectorization.

This patch includes vectorized timeseries and groupBy engines, as well
as some analogs of your favorite Druid classes:

- VectorCursor is like Cursor. (It comes from StorageAdapter.makeVectorCursor.)
- VectorColumnSelectorFactory is like ColumnSelectorFactory, and it has
  methods to create analogs of the column selectors you know and love.
- VectorOffset and ReadableVectorOffset are like Offset and ReadableOffset.
- VectorAggregator is like BufferAggregator.
- VectorValueMatcher is like ValueMatcher.

There are some noticeable differences between vectorized and regular
execution:

- Unlike regular cursors, vector cursors do not understand time
  granularity. They expect query engines to handle this on their own,
  which a new VectorCursorGranularizer class helps with. This is to
  avoid too much batch-splitting and to respect the fact that vector
  selectors are somewhat more heavyweight than regular selectors.
- Unlike FilteredOffset, FilteredVectorOffset does not leverage indexes
  for filters that might partially support them (like an OR of one
  filter that supports indexing and another that doesn't). I'm not sure
  that this behavior is desirable anyway (it is potentially too eager)
  but, at any rate, it'd be better to harmonize it between the two
  classes. Potentially they should both do some different thing that
  is smarter than what either of them is doing right now.
- When vector cursors are created by QueryableIndexCursorSequenceBuilder,
  they use a morphing binary-then-linear search to find their start and
  end rows, rather than linear search.

Limitations in this patch are:

- Only timeseries and groupBy have vectorized engines.
- GroupBy doesn't handle multi-value dimensions yet.
- Vector cursors cannot handle virtual columns or descending order.
- Only some filters have vectorized matchers: "selector", "bound", "in",
  "like", "regex", "search", "and", "or", and "not".
- Only some aggregators have vectorized implementations: "count",
  "doubleSum", "floatSum", "longSum", "hyperUnique", and "filtered".
- Dimension specs other than "default" don't work yet (no extraction
  functions or filtered dimension specs).

Currently, the testing strategy includes adding vectorization-enabled
tests to TimeseriesQueryRunnerTest, GroupByQueryRunnerTest,
GroupByTimeseriesQueryRunnerTest, CalciteQueryTest, and all of the
filtering tests that extend BaseFilterTest. In all of those classes,
there are some test cases that don't support vectorization. They are
marked by special function calls like "cannotVectorize" or "skipVectorize"
that tell the test harness to either expect an exception or to skip the
test case.

Testing should be expanded in the future -- a project in and of itself.

Related to #3011.

* WIP

* Adjustments for unused things.

* Adjust javadocs.

* DimensionDictionarySelector adjustments.

* Add "clone" to BatchIteratorAdapter.

* ValueMatcher javadocs.

* Fix benchmark.

* Fixups post-merge.

* Expect exception on testGroupByWithStringVirtualColumn for IncrementalIndex.

* BloomDimFilterSqlTest: Tag two non-vectorizable tests.

* Minor adjustments.

* Update surefire, bump up Xmx in Travis.

* Some more adjustments.

* Javadoc adjustments

* AggregatorAdapters adjustments.

* Additional comments.

* Remove switching search.

* Only missiles.
2019-07-12 12:54:07 -07:00
Clint Wylie abf9843e2a fail complex type 'serde' registration when registered type does not match expected type (#7985)
* make ComplexMetrics.registerSerde type check on register, resolves #7959

* add test

* simplify

* unused imports :/

* simplify

* burned by imports yet again

* remove unused constructor

* switch to getName

* heh oops
2019-07-11 23:03:15 -07:00
Alexander Saydakov 4b176ad265 force native order when wrapping ByteBuffer since Druid can have it set (#8055)
incorrectly
2019-07-11 17:17:53 -07:00
Clint Wylie 349b743ce0
fix master branch build (#8057) 2019-07-10 14:58:10 -07:00
Himanshu 14aec7fcec
add config to optionally disable all compression in intermediate segment persists while ingestion (#7919)
* disable all compression in intermediate segment persists while ingestion

* more changes and build fix

* by default retain existing indexingSpec for intermediate persisted segments

* document indexSpecForIntermediatePersists index tuning config

* fix build issues

* update serde tests
2019-07-10 12:22:24 -07:00
Jihoon Son fcf56f2330 Add IS_INCREMENTAL_HANDOFF_SUPPORTED for KIS backward compatibility (#8050)
* Add IS_INCREMENTAL_HANDOFF_SUPPORTED for KIS backward compatibility

* do it for kafka only

* fix test
2019-07-10 08:29:37 -07:00
Chi Cao Minh 1166bbcb75 Remove static imports from tests (#8036)
Make static imports forbidden in tests and remove all occurrences to be
consistent with the non-test code.

Also, various changes to files affected by above:
- Reformat to adhere to druid style guide
- Fix various IntelliJ warnings
- Fix various SonarLint warnings (e.g., the expected/actual args to
  Assert.assertEquals() were flipped)
2019-07-06 09:33:12 -07:00
Chi Cao Minh 0ded0ce414 Add round support for DS-HLL (#8023)
* Add round support for DS-HLL

Since the Cardinality aggregator has a "round" option to round off estimated
values generated from the HyperLogLog algorithm, add the same "round" option to
the DataSketches HLL Sketch module aggregators to be consistent.

* Fix checkstyle errors

* Change HllSketchSqlAggregator to do rounding

* Fix test for standard-compliant null handling mode
2019-07-05 15:37:58 -07:00
Clint Wylie 42a7b8849a remove FirehoseV2 and realtime node extensions (#8020)
* remove firehosev2 and realtime node extensions

* revert intellij stuff

* rat exclusion
2019-07-04 15:40:22 -07:00
Alexander Saydakov f38a62e949 theta sketch to string post agg (#7937) 2019-06-27 15:09:57 -07:00
Xue Yu b9c6a26c0e Use ComplexMetrics.registerSerde() across the codebase (#7925)
* refactor complexmetric registerserde

* fix error

* feedback address
2019-06-25 11:39:04 +03:00
Sashidhar Thallam 6cc8802b8e #7875: Setting ACL on S3 task logs on similar lines as that of data segments pushed to S3 (#7907)
* #7875: Setting ACL on S3 task logs on similar lines as that of data segment pushed to S3

* #7875 1. Extracting a method (which uploads a file to S3 setting appropriate access control list to the file being uploaded) and moving it to utils class. 2. Adding S3TaskLogsTest.java file to test acl (permissions) on the task log files pushed to S3.

* fixing checkstyle errors

* #7875 Incorporating review comments
2019-06-24 17:25:59 -06:00
Fokko Driesprong 82b248cc17 Spotbugs: Enable MS_SHOULD_BE_FINAL (#7946) 2019-06-23 15:42:18 -07:00
Jonathan Wei 35601bb7a0 Add finalizeAsBase64Binary option to FixedBucketsHistogramAggregatorFactory (#7784)
* Add finalizeAsBase64Binary option to FixedBucketsHistogramAggregatorFactory

* Add finalizeAsBase64Binary option to ApproximateHistogramFactory

* Update approx histogram doc
2019-06-21 18:00:19 -07:00
Clint Wylie 494b8ebe56 multi-value string column support for expressions (#7588)
* array support for expression language for multi-value string columns

* fix tests?

* fixes

* more tests

* fixes

* cleanup

* more better, more test

* ignore inspection

* license

* license fix

* inspection

* remove dumb import

* more better

* some comments

* add expr rewrite for arrayfn args for more magic, tests

* test stuff

* more tests

* fix test

* fix test

* castfunc can deal with arrays

* needs more empty array

* more tests, make cast to long array more forgiving

* refactor

* simplify ExprMacro Expr implementations with base classes in core

* oops

* more test

* use Shuttle for Parser.flatten, javadoc, cleanup

* fixes and more tests

* unused import

* fixes

* javadocs, cleanup, refactors

* fix imports

* more javadoc

* more javadoc

* more

* more javadocs, nonnullbydefault, minor refactor

* markdown fix

* adjustments

* more doc

* move initial filter out

* docs

* map empty arg lambda, apply function argument validation

* check function args at parse time instead of eval time

* more immutable

* more more immutable

* clarify grammar

* fix docs

* empty array is string test, we need a way to make arrays better maybe in the future, or define empty arrays as other types..
2019-06-19 13:57:37 -07:00
Fokko Driesprong 0a6fbbbb80 Bump Apache Avro to 1.9.0 (#7772)
* Bump Apache Avro to 1.9.0

Apache Avro 1.9.0 brings a lot of new features:
* Deprecate Joda-Time in favor of Java8 JSR310 and setting it as default
* Remove support for Hadoop 1.x
* Move from Jackson 1.x to 2.9
* Add ZStandard Codec
* Lots of updates on the dependencies to fix CVE's
* Remove Jackson classes from public API
* Apache Avro is built by default with Java 8
* Apache Avro is compiled and tested with Java 11 to guarantee compatibility
* Apache Avro MapReduce is compiled and tested with Hadoop 3
* Apache Avro is now leaner, multiple dependencies were removed: guava, paranamer, commons-codec, and commons-logging
* Introduce JMH Performance Testing Framework
* Add Snappy support for C++ DataFile
* and many, many more!

* Add exclusions for Jackson
2019-06-19 03:31:18 -07:00
Clint Wylie 71997c16a2 switch links from druid.io to druid.apache.org (#7914)
* switch links from druid.io to druid.apache.org

* fix it
2019-06-18 09:06:27 -07:00
Fokko Driesprong f581118f05 Remove Apache Pig from the tests (#7810)
* Remove Apache Pig from the tests

* Remove the Pig specific part

* Fix the Checkstyle issues

* Cleanup a bit

* Add an additional test

* Revert the abstract class
2019-06-14 14:18:58 -07:00
Sashidhar Thallam 3bee6adcf7 Use map.putIfAbsent() or map.computeIfAbsent() as appropriate instead of containsKey() + put() (#7764)
* https://github.com/apache/incubator-druid/issues/7316 Use Map.putIfAbsent() instead of containsKey() + put()

* fixing indentation

* Using map.computeIfAbsent() instead of map.putIfAbsent() where appropriate

* fixing checkstyle

* Changing the recommendation text

* Reverting auto changes made by IDE

* Implementing recommendation: A ConcurrentHashMap on which computeIfAbsent() is called should be assigned into variables of ConcurrentHashMap type, not ConcurrentMap

* Removing unused import
2019-06-14 17:59:36 +02:00
Xue Yu ce591d1457 Support var_pop, var_samp, stddev_pop and stddev_samp etc in sql (#7801)
* support var_pop, stddev_pop etc in sql

* fix sql compatible

* rebase on master

* update doc
2019-06-10 09:40:09 -07:00
Clint Wylie 3fbb0a5e00 Supervisor list api with states and health (#7839)
* allow optionally listing all supervisors with their state and health

* docs

* add state to full

* clean

* casing

* format

* spelling
2019-06-07 16:26:33 -07:00
Clint Wylie ee0d4ea589 add bloom filter fallback aggregator when types are unknown (#7719) 2019-06-06 14:39:32 -07:00
Alexander Saydakov 4dd446bfdd sketches-core-0.13.4 (#7666) 2019-06-06 14:36:52 -07:00
Eugene Sevastyanov 080270283a Druid basic authentication class composition config (#7789)
* Druid basic authentication class composition config.

* Added comments

* Reduced nulls

* Used noop implementations to get rid of null

* Added docs for no-op metadata storage updaters

* Fixed BasicAuthClassCompositionConfig javadoc

* Removed incorrect comments
2019-06-06 15:51:37 +02:00
Gian Merlino 1de1a02e49 Kinesis: Fix getPartitionIds, should be checking isHasMoreShards. (#7830) 2019-06-04 16:26:22 -07:00
Nishant Bangarwa fdc03bd336 [druid-kerberos] Fix checking of host URI when reading cookies from cookie store (#7825)
Reading of auth cookie was not checking URI of the server where request was being sent.  This was causing cookie set for one server to be sent to another one and extra authentication round trips between internal druid services.
2019-06-03 19:32:50 -07:00
Justin Borromeo 8032c4add8 Add errors and state to stream supervisor status API endpoint (#7428)
* Add state and error tracking for seekable stream supervisors

* Fixed nits in docs

* Made inner class static and updated spec test with jackson inject

* Review changes

* Remove redundant config param in supervisor

* Style

* Applied some of Jon's recommendations

* Add transience field

* write test

* implement code review changes except for reconsidering logic of markRunFinishedAndEvaluateHealth()

* remove transience reporting and fix SeekableStreamSupervisorStateManager impl

* move call to stateManager.markRunFinished() from RunNotice to runInternal() for tests

* remove stateHistory because it wasn't adding much value, some fixes, and add more tests

* fix tests

* code review changes and add HTTP health check status

* fix test failure

* refactor to split into a generic SupervisorStateManager and a specific SeekableStreamSupervisorStateManager

* fixup after merge

* code review changes - add additional docs

* cleanup KafkaIndexTaskTest

* add additional documentation for Kinesis indexing

* remove unused throws class
2019-05-31 17:16:01 -07:00
Jihoon Son 7abfbb066a Bump up snapshot version to 0.16.0 (#7802) 2019-05-30 17:17:33 -07:00
Gian Merlino 8649b8ab4c
SQL: Allow select-sort-project query shapes. (#7769)
* SQL: Allow select-sort-project query shapes.

Fixes #7768.

Design changes:

- In PartialDruidQuery, allow projection after select + sort by removing
  the SELECT_SORT query stage and instead allowing the SORT and
  SORT_PROJECT stages to apply either after aggregation or after a plain
  non-aggregating select. This is different from prior behavior, where
  SORT and SORT_PROJECT were only considered valid after aggregation
  stages. This logic change is in the "canAccept" method.
- In DruidQuery, represent either kind of sorting with a single "Sorting"
  class (instead of DefaultLimitSpec). The Sorting class is still
  convertible into a DefaultLimitSpec, but is also convertible into the
  sorting parameters accepted by a Scan query.
- In DruidQuery, represent post-select and post-sorting projections with
  a single "Projection" class. This obsoletes the SortProject and
  SelectProjection classes, and simplifies the DruidQuery by allowing us
  to move virtual-column and post-aggregator-creation logic into the
  new Projection class.
- Split "DruidQuerySignature" into RowSignature and VirtualColumnRegistry.
  This effectively means that instead of having mutable and immutable
  versions of DruidQuerySignature, we instead of RowSignature (always
  immutable) and VirtualColumnRegistry (always mutable, but sometimes
  null). This change wasn't required, but IMO it this makes the logic
  involving them easier to follow, and makes it more clear when the
  virtual column registry is active and when it's not.

Other changes:

- ConvertBoundsToSelectors now just accepts a RowSignature, but we
  use the VirtualColumnRegistry.getFullRowSignature() method to get
  a signature that includes all columns, and therefore allows us to
  simplify the logic (no need to special-case virtual columns).
- Add `__time` to the Scan column list if the query is ordering by time.

* Remove unused import.
2019-05-30 12:56:29 -07:00
Roman Leventov 782863ed0f Fix some problems reported by PVS-Studio (#7738)
* Fix some problems reported by PVS-Studio

* Address comments
2019-05-29 11:20:45 -07:00
Fokko Driesprong e46bdf082e Remove Codehaus references from the tests (#7773) 2019-05-27 10:51:14 -07:00
Clint Wylie eef69619d3 add support for multi-value string dimensions for HllSketch build aggregator (#7730) 2019-05-23 17:07:32 -07:00
Jonathan Wei ec4d09a02f Remove obsolete isExcluded config from Kerberos authenticator (#7745) 2019-05-23 16:00:05 -07:00
Jonathan Wei 54b3f363c4
Remove unnecessary principal handling in KerberosAuthenticator (#7685) 2019-05-23 13:15:44 -07:00
Clint Wylie 23e96d15d4 allow quantiles merge aggregator to also accept doubles (#7718)
* allow quantiles merge aggregator to also accept doubles

* consolidate dupe

* import
2019-05-23 11:13:41 -07:00
Merlin Lee 26fad7e06a Add checkstyle for "Local variable names shouldn't start with capital" (#7681)
* Add checkstyle for "Local variable names shouldn't start with capital"

* Adjust some local variables to constants

* Replace StringUtils.LINE_SEPARATOR with System.lineSeparator()
2019-05-23 18:40:28 +02:00
Jihoon Son eff2be4f8f Remove LegacyKafkaIndexTaskRunner (#7735) 2019-05-23 09:25:35 -07:00
Gian Merlino 53b6467fc8 SeekableStreamIndexTaskRunner: Lazy init of runner. (#7729)
The main motivation is that this fixes #7724, by making it so the overlord
doesn't try to create a task runner and parser when all it really wants to
do is create a task object and serialize it.
2019-05-22 21:13:57 -07:00
Clint Wylie ffc2397bcd fix AggregatorFactory.finalizeComputation implementations to be ok with null inputs (#7731)
* AggregatorFactory finalizeComputation is nullable with nullable input, make implementations honor this

* fixes
2019-05-22 21:13:09 -07:00
Gian Merlino b6941551ae Upgrade various build and doc links to https. (#7722)
* Upgrade various build and doc links to https.

Where it wasn't possible to upgrade build-time dependencies to https,
I kept http in place but used hardcoded checksums or GPG keys to ensure
that artifacts fetched over http are verified properly.

* Switch to https://apache.org.
2019-05-21 11:30:14 -07:00
Merlin Lee 5f08b0b474 Add checkstyle for "Prohibit @author tags in Javadoc" (#7682)
* Add checkstyle for "Prohibit @author tags in Javadoc"

* Add "Do not use author tags/information in the code" back to CONTRIBUTING.md
2019-05-20 00:09:51 -07:00
David Lim d38457933f Data loader (sampler component) - Kafka/Kinesis samplers (#7566)
* implement Kafka/Kinesis sampler

* add KafkaSamplerSpecTest and KinesisSamplerSpecTest

* code review changes
2019-05-16 20:26:23 -07:00
Jonathan Wei 7d63c295cc Fix compilation error in CoordinatorBasicAuthorizerResourceTest (#7667)
* Fix compilation error in CoordinatorBasicAuthorizerResourceTest

* Don't use simplifyPermissions
2019-05-15 17:47:38 -07:00
Jonathan Wei 6901123a53 Fix compareAndSwap() in SQLMetadataConnector (#7661)
* Fix compareAndSwap() in SQLMetadataConnector

* Catch serialization_failure and retry for Postgres
2019-05-15 14:53:04 -07:00
Jonathan Wei e874da7cea
Add simpler permissions option to BasicAuthorizer GET APIs (#7635)
* Add simpler permissions option to BasicAuthorizer GET APIs

* Adjust log message

Co-Authored-By: Himanshu <g.himanshu@gmail.com>

* Adjust log message

Co-Authored-By: Himanshu <g.himanshu@gmail.com>
2019-05-15 12:59:32 -07:00
Fokko Driesprong 2aa9613bed Bump Checkstyle to 8.20 (#7651)
* Bump Checkstyle to 8.20

Moderate severity vulnerability that affects:
com.puppycrawl.tools:checkstyle

Checkstyle prior to 8.18 loads external DTDs by default,
which can potentially lead to denial of service attacks
or the leaking of confidential information.

Affected versions: < 8.18

* Oops, missed one

* Oops, missed a few
2019-05-14 11:53:37 -07:00
Fokko Driesprong 4c709ddbc1 Bump Apache Parquet to 1.10.1 (#7645)
https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1101
2019-05-12 14:38:33 -07:00
Alexander Saydakov ca1a6649f6 Datasketches quantiles more post-aggs (#7550)
* rank and CDF post-aggs

* added post-aggs to the module

* added new post-aggs

* moved post-agg IDs

* moved post-agg IDs
2019-05-10 11:46:54 -07:00
Alexander Saydakov 59f9ff38c7 fix issue #7607 (#7619)
* fix issue #7607

* exclude com.google.code.findbugs:annotations
2019-05-09 17:33:29 -07:00
Jinseon Lee 0ef435a16c add postgresql meta db table schema configuration property (#7137) (#7183)
* add postgresql meta db table schema configuration property (#7137)

If the postgresql db schema changes, you must set the configuration
values.
You do not need to set it if there is no change from the default schema
'public'.
druid.metadata.postgres.dbTableSchema=public

* create postgresql metadb table schema configuration property (#7137)
If the postgresql db schema changes, you must set the configuration
values.
You do not need to set it if there is no change from the default schema
'public'.
druid.metadata.postgres.dbTableSchema=public
check PostgreSQLTablesConfig.java

* modify postgresql readme file. - metadb table schema (#7137)
If the postgresql db schema changes, you must set the configuration
values.
You do not need to set it if there is no change from the default schema
'public'.
druid.metadata.postgres.dbTableSchema=public
check PostgreSQLTablesConfig.java
2019-05-08 12:56:30 -07:00