Commit Graph

11939 Commits

Author SHA1 Message Date
Gian Merlino 69aac6c8dd
Direct UTF-8 access for "in" filters. (#12517)
* Direct UTF-8 access for "in" filters.

Directly related:

1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8
   ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If
   necessary, the input set is copied into a ValuesSet. Much logic is
   simplified, because we always know what type the values set will be.
   I think that there won't even be an efficiency loss in most cases.
   InDimFilter is most frequently created by deserialization, and this
   patch updates the JsonCreator constructor to deserialize
   directly into a ValuesSet.

2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes
   during index lookups.

3) Add unsigned comparator to ByteBufferUtils and use it in
   GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8
   bytes can be compared as bytes if, and only if, the comparison
   is unsigned.

4) Add specialization to GenericIndexed.singleThreaded().indexOf that
   avoids needless ByteBuffer allocations.

5) Clarify that objects returned by ColumnIndexSupplier.as are not
   thread-safe. DictionaryEncodedStringIndexSupplier now calls
   singleThreaded() on all relevant GenericIndexed objects, saving
   a ByteBuffer allocation per access.

Also:

1) Fix performance regression in LikeFilter: since #12315, it applied
   the suffix matcher to all values in range even for type MATCH_ALL.

2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark,
   which was broken due to calls to strategy.compare in
   GenericIndexed.fromIterable.

* Add like-filter implementation tests.

* Add in-filter implementation tests.

* Add tests, fix issues.

* Fix style.

* Adjustments from review.
2022-05-20 01:51:28 -07:00
Vadim Ogievetsky a235aca2b3
Web console: fix go to segments not working (#12541)
* use correct filter syntax

* fix tests
2022-05-19 14:34:03 -07:00
Gian Merlino 5f95cc61fe
RemoteTaskRunner: Fix NPE in streamTaskReports. (#12006)
* RemoteTaskRunner: Fix NPE in streamTaskReports.

It is possible for a work item to drop out of runningTasks after the
ZkWorker is retrieved. In this case, the current code would throw
an NPE.

* Additional tests and additional fixes.

* Fix import.
2022-05-19 14:23:55 -07:00
Gian Merlino 65a1375b67
SQL: Add is_active to sys.segments, update examples and docs. (#11550)
* SQL: Add is_active to sys.segments, update examples and docs.

is_active is short for:

  (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1

It's important because this represents "all the segments that should
be queryable, whether or not they actually are right now". Most of the
time, this is the set of segments that people will want to look at.

The web console already adds this filter to a lot of its queries,
proving its usefulness.

This patch also reworks the caveat at the bottom of the sys.segments
section, so its information is mixed into the description of each result
field. This should make it more likely for people to see the information.

* Wording updates.

* Adjustments for spellcheck.

* Adjust IT.
2022-05-19 14:23:28 -07:00
machine424 90531fd53f
Do not alter query timeout in ScanQueryEngine (#12271)
Add test to detect timeout mutability
2022-05-19 09:24:42 -07:00
Xavier Léauté ec41dfb535
upgrade core Apache Kafka dependencies to 3.2.0 (#12538)
Announcement: https://blogs.apache.org/kafka/entry/what-s-new-in-apache8
Release notes: https://downloads.apache.org/kafka/3.2.0/RELEASE_NOTES.html
2022-05-19 09:04:52 -07:00
Gian Merlino 1d258d2108
Slightly improve RTR log messages. (#12540)
1) Align "Assigning task" log messages between RTR and HRTR.

2) Remove confusing reference to "Coordinator".

3) Move "Not assigning task" message from INFO to DEBUG. It's not super
   important to see this message: we mainly want to see what _does_ get
   assigned.

4) Reword "Task switched from pending to running" message to better
   match the structure of the  "Assigning task" message from the same
   method.
2022-05-19 07:43:55 -07:00
Gian Merlino 485de6a14a
Add builder for TaskToolbox. (#12539)
* Add builder for TaskToolbox.

The main purpose of this change is to make it easier to create
TaskToolboxes in tests. However, the builder is used in production
too, by TaskToolboxFactory.

* Fix imports, adjust formatting.

* Fix import.
2022-05-19 07:43:50 -07:00
Gian Merlino 4631cff2a9
Free ByteBuffers in tests and fix some bugs. (#12521)
* Ensure ByteBuffers allocated in tests get freed.

Many tests had problems where a direct ByteBuffer would be allocated
and then not freed. This is bad because it causes flaky tests.

To fix this:

1) Add ByteBufferUtils.allocateDirect(size), which returns a ResourceHolder.
   This makes it easy to free the direct buffer. Currently, it's only used
   in tests, because production code seems OK.

2) Update all usages of ByteBuffer.allocateDirect (off-heap) in tests either
   to ByteBuffer.allocate (on-heap, which are garbaged collected), or to
   ByteBufferUtils.allocateDirect (wherever it seemed like there was a good
   reason for the buffer to be off-heap). Make sure to close all direct
   holders when done.

* Changes based on CI results.

* A different approach.

* Roll back BitmapOperationTest stuff.

* Try additional surefire memory.

* Revert "Roll back BitmapOperationTest stuff."

This reverts commit 49f846d9e3.

* Add TestBufferPool.

* Revert Xmx change in tests.

* Better behaved NestedQueryPushDownTest. Exit tests on OOME.

* Fix TestBufferPool.

* Remove T1C from ARM tests.

* Somewhat safer.

* Fix tests.

* Fix style stuff.

* Additional debugging.

* Reset null / expr configs better.

* ExpressionLambdaAggregatorFactory thread-safety.

* Alter forkNode to try to get better info when a JVM crashes.

* Fix buffer retention in ExpressionLambdaAggregatorFactory.

* Remove unused import.
2022-05-19 07:42:29 -07:00
Tejaswini Bandlamudi c877d8a981
Updates default inputSegmentSizeBytes in Compaction config (#12534)
Fixes Cannot serialize BigInt value as JSON error while loading compaction config in console.
2022-05-19 14:43:34 +05:30
AmatyaAvadhanula 215b90d1a4
CVE suppression (#12535) 2022-05-19 11:21:48 +05:30
Charles Smith 3e8d7a6d9f
Sql docs items (#12530)
* touch up sql refactor

* brush up SQL refactor

* incorporate feedback

* reorder sql

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-05-17 16:56:31 -07:00
Katya Macedo 177638f171
Fix typo, add comma (#12529) 2022-05-17 16:42:47 -07:00
Adarsh Sanjeev fcb1c0b7bf
Add cluster by support for replace syntax (#12524)
* Add cluster by support for replace syntax

* Add unit test for with list
2022-05-17 15:15:29 +05:30
Clint Wylie b23ddc5939
print replication levels in coordinator segment logs (#12511)
* print replication levels in coordinator segment logs

* add served segment count to stats

* also for drops
2022-05-17 02:24:13 -07:00
Adarsh Sanjeev 0fd4f1e386
Improve error messages from SQL REPLACE syntax (#12523)
- Add user friendly error messages for missing or incorrect OVERWRITE clause for REPLACE SQL query
- Move validation of missing OVERWRITE clause at code level instead of parser for custom error message
2022-05-17 09:55:58 +05:30
Gian Merlino fdfecfd996
Improved docs for range partitioning. (#12350)
* Improved docs for range partitioning.

1) Clarify the benefits of range partitioning.
2) Clarify which filters support pruning.
3) Include the fact that multi-value dimensions cannot be used for partitioning.

* Additional clarification.

* Update other section.

* Another adjustment.

* Updates from review.
2022-05-16 09:42:31 -07:00
Hellmar Becker 985640f103
Clarify the use of the Lookup API (#12088)
* Update lookups.md

* Update docs/querying/lookups.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/querying/lookups.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2022-05-16 07:50:24 -07:00
317brian 351e57bdb6
docs(fix): clarify how worker.version and minWorkerVersion comparison works (#12459)
* docs(fix): clarify how worker.version and minWorkerVersion comparison works

* Revert "docs(fix): clarify how worker.version and minWorkerVersion comparison works"

This reverts commit cadd1fdc60.

* docs(fix): clarify how worker.version and minWorkerVersion comparison works

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/configuration/index.md

fix spelling

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-05-16 07:48:33 -07:00
Gian Merlino 5b6727f319
Enable vectorized virtual column processing by default. (#12520)
In the majority of cases, this improves performance.

There's only one case I'm aware of where this may be a net negative: for time_floor(__time, <period>) where there are many repeated __time values. In nonvectorized processing, SingleLongInputCachingExpressionColumnValueSelector implements an optimization to avoid computing the time_floor function on every row. There is no such optimization in vectorized processing.

IMO, we shouldn't mention this in the docs. Rationale: It's too fiddly of a thing: it's not guaranteed that nonvectorized processing will be faster due to the optimization, because it would have to overcome the inherent speed advantage of vectorization. So it'd always require testing to determine the best setting for a specific dataset. It would be bad if users disabled vectorization thinking it would speed up their queries, and it actually slowed them down. And even if users do their own testing, at some point in the future we'll implement the optimization for vectorized processing too, and it's likely that users that explicitly disabled vectorization will continue to have it disabled. I'd like to avoid this outcome by encouraging all users to enable vectorization at all times. Really advanced users would be following development activity anyway, and can read this issue
2022-05-16 15:43:53 +05:30
Frank Chen c33ff1c745
Enforce console logging for peon process (#12067)
Currently all Druid processes share the same log4j2 configuration file located in _common directory. Since peon processes are spawned by middle manager process, they derivate the environment variables from the middle manager. These variables include those in the log4j2.xml controlling to which file the logger writes the log.

But current task logging mechanism requires the peon processes to output the log to console so that the middle manager can redirect the console output to a file and upload this file to task log storage.

So, this PR imposes this requirement to peon processes, whatever the configuration is in the shared log4j2.xml, peon processes always write the log to console.
2022-05-16 15:07:21 +05:30
Gian Merlino ff253fd8a3
Add setProcessingThreadNames context parameter. (#12514)
setting thread names takes a measurable amount of time in the case where segment scans are very quick. In high-QPS testing we found a slight performance boost from turning off processing thread renaming. This option makes that possible.
2022-05-16 13:42:00 +05:30
Jason Koch bb1a6def9d
Task queue unblock (#12099)
* concurrency: introduce GuardedBy to TaskQueue

* perf: Introduce TaskQueueScaleTest to test performance of TaskQueue with large task counts

This introduces a test case to confirm how long it will take to launch and manage (aka shutdown)
a large number of threads in the TaskQueue.

h/t to @gianm for main implementation.

* perf: improve scalability of TaskQueue with large task counts

* linter fixes, expand test coverage

* pr feedback suggestion; swap to different linter

* swap to use SuppressWarnings

* Fix TaskQueueScaleTest.

Co-authored-by: Gian Merlino <gian@imply.io>
2022-05-14 16:44:29 -07:00
Kashif Faraz 7ab2170802
Use datasketches version 3.2.0 (#12509)
Changes:
- Use apache datasketches version 3.2.0.
- Remove unsafe reflection-based usage of datasketch internals added in #12022
2022-05-13 11:28:15 +05:30
Adarsh Sanjeev 39b3487aa9
Add replace statement to sql parser (#12386)
Relevant Issue: #11929

- Add custom replace statement to Druid SQL parser.
- Edit DruidPlanner to convert relevant fields to Query Context.
- Refactor common code with INSERT statements to reuse them for REPLACE where possible.
2022-05-13 10:56:40 +05:30
Abhishek Radhakrishnan 9177515be2
Add IPAddress java library as dependency and migrate IPv4 functions to use the new library. (#11634)
* Add ipaddress library as dependency.

* IPv4 functions to use the inet.ipaddr package.

* Remove unused imports.

* Add new function.

* Minor rename.

* Add more unit tests.

* IPv4 address expr utils unit tests and address options.

* Adjust the IPv4Util functions.

* Move the UTs a bit around.

* Javadoc comments.

* Add license info for IPAddress.

* Fix groupId, artifact and version in license.yaml.

* Remove redundant subnet in messages - fixes UT.

* Remove unused commons-net dependency for /processing project.

* Make class and methods public so it can be accessed.

* Add initial version of benchmark

* Add subnetutils package for benchmarks.

* Auto generate ip addresses.

* Add more v4 address representations in setup to avoid bias.

* Use ThreadLocalRandom to avoid forbidden API usage.

* Adjust IPv4AddressBenchmark to adhere to codestyle rules.

* Update ipaddress library to latest 5.3.4

* Add ipaddress package dependency to benchmarks project.
2022-05-11 22:06:20 -07:00
Clint Wylie 9e5a940cf1
remake column indexes and query processing of filters (#12388)
Following up on #12315, which pushed most of the logic of building ImmutableBitmap into BitmapIndex in order to hide the details of how column indexes are implemented from the Filter implementations, this PR totally refashions how Filter consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.
2022-05-11 11:57:08 +05:30
Lucas Capistrant deb69d1bc0
Allow coordinator to be configured to kill segments in future (#10877)
Allow a Druid cluster to kill segments whose interval_end is a date in the future. This can be done by setting druid.coordinator.kill.durationToRetain to a negative period. For example PT-24H would allow segments to be killed if their interval_end date was 24 hours or less into the future at the time that the kill task is generated by the system.

A cluster operator can also disregard the druid.coordinator.kill.durationToRetain entirely by setting a new configuration, druid.coordinator.kill.ignoreDurationToRetain=true. This ignores interval_end date when looking for segments to kill, and instead is capable of killing any segment marked unused. This new configuration is off by default, and a cluster operator should fully understand and accept the risks if they enable it.
2022-05-11 07:35:15 +05:30
Kashif Faraz 60b4fa0f75
Docs: Fix column name in ingestion rollup doc (#12036)
Fix the referred column name from "count" to "num_rows" as "count" vs. "COUNT(*)" might be a little confusing in this example.
2022-05-10 17:35:59 +05:30
Rohan Garg 75836a5a06
Add feature flag for sql planning of TimeBoundary queries (#12491)
* Add feature flag for sql planning of TimeBoundary queries

* fixup! Add feature flag for sql planning of TimeBoundary queries

* Add documentation for enableTimeBoundaryPlanning

* fixup! Add documentation for enableTimeBoundaryPlanning
2022-05-10 15:23:42 +05:30
somu-imply c68388ebcd
Vectorized version of string last aggregator (#12493)
* Vectorized version of string last aggregator

* Updating string last and adding testcases

* Updating code and adding testcases for serializable pairs

* Addressing review comments
2022-05-09 17:02:38 -07:00
Rohan Garg 2dd073c2cd
Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation (#12484)
* Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation

* fixup! Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation

* Document vectorized dimension
2022-05-09 10:40:17 -07:00
Atul Mohan eb6de94e1f
Add daily stats to console (#12329) 2022-05-05 15:31:21 -07:00
Vadim Ogievetsky 2d8eb117c0
Web console: add a button to get out of restricted mode, make capability detection more robust (#12503)
* allow unrestrict

* update tests
2022-05-05 15:06:59 -07:00
Victoria Lim 0206a2da5c
Update automatic compaction docs with consistent terminology (#12416)
* specify automatic compaction where applicable

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* update for style and consistency

* implement suggested feedback

* remove duplicate example

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/compaction.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/api-reference.md

* update .spelling

* Adopt review suggestions

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
2022-05-03 16:22:25 -07:00
Naya Chen 35a7d863b7
add aws-java-sdk-sts to aws-common classpath (#12482)
Fixes #11303

WebIdentityTokenProvider in the defaultAWSCredentialsProviderChain can not actually be used because the aws-java-sdk-sts jar is not in the classpath of S3 extension at runtime, since each extension has its own classpath. This results in the inability to assume STS role before generating authentication token.
The error message from getCredentials() is:

"Unable to load credentials from WebIdentityTokenCredentialsProvider: To use assume role profiles the aws-java-sdk-sts module must be on the class path"

This PR will fix multiple authentication modules that are dependent on the WebIdentityTokenProvider, including AWS IAM based RDS authentication and S3 authentication.
2022-05-03 12:25:51 -07:00
Vadim Ogievetsky fb08bac01a
Web console: Misc table fixes (#12489)
* Misc table fixes

* extract default className

* table spacing updates

* fix e2e action selector

* try more times

* make the web console exist again
2022-05-03 12:08:08 -07:00
zachjsh de14f511d6
Fix broken ForkingTaskRunnerTest (#12499)
A recent commit broke this test. This pr fixes the test.
2022-05-03 04:00:36 -04:00
Rocky Chen 770ad95169
Add a metric for task duration in the pending queue (#12492)
This PR is to measure how long a task stays in the pending queue and emits the value with the metric task/pending/time. The metric is measured in RemoteTaskRunner and HttpRemoteTaskRunner.

An example of the metric:

```
2022-04-26T21:59:09,488 INFO [rtr-pending-tasks-runner-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2022-04-26T21:59:09.487Z","service":"druid/coordinator","host":"localhost:8081","version":"2022.02.0-iap-SNAPSHOT","metric":"task/pending/time","value":8,"dataSource":"wikipedia","taskId":"index_parallel_wikipedia_gecpcglg_2022-04-26T21:59:09.432Z","taskType":"index_parallel"}
```

------------------------------------------
Key changed/added classes in this PR

    Emit metric task/pending/time in classes RemoteTaskRunner and HttpRemoteTaskRunner.
    Update related factory classes and tests.
2022-05-02 23:47:25 -04:00
Nishant Bangarwa 785a1eeb9f
Update maven assembly plugin for druid-benchmarks (#12487) 2022-05-02 09:43:19 -07:00
Lucas Capistrant 39e7191f03
Add authentication call before cleaning up intermediate files in hadoop ingestions (#12030)
* Add authentication call before cleaning up intermediate files in hadoop ingestions

* fix checkstyle

* remove debug log
2022-05-02 08:40:44 -05:00
aggarwalakshay dd8781f5b0
Upgrade dependency-check-maven to 7.0.4 (#12441) 2022-05-01 22:45:58 +08:00
317brian b97f273d5a
docs: fix typo (#12494) 2022-05-01 22:44:31 +08:00
MC-JY bb080693a9
Improve build performance of modules (#12486)
* improve build performance of modules

* improve build performance of modules

* Update pom.xml

* improve build performance of modules
2022-05-01 22:43:11 +08:00
Tejaswini Bandlamudi 1d1f53e7d5
Improve error messages when URI points to a file that doesn't exist (#12490) 2022-05-01 11:26:16 +05:30
Gian Merlino 529b983ad0
GroupBy: Reduce allocations by reusing entry and key holders. (#12474)
* GroupBy: Reduce allocations by reusing entry and key holders.

Two main changes:

1) Reuse Entry objects returned by various implementations of
   Grouper.iterator.

2) Reuse key objects contained within those Entry objects.

This is allowed by the contract, which states that entries must be
processed and immediately discarded. However, not all call sites
respected this, so this patch also updates those call sites.

One particularly sneaky way that the old code retained entries too long
is due to Guava's MergingIterator and CombiningIterator. Internally,
these both advance to the next value prior to returning the current
value. So, this patch addresses that in two ways:

1) For merging, we have our own implementation MergeIterator already,
   although it had the same problem. So, this patch updates our
   implementation to return the current item prior to advancing to the
   next item. It also adds a forbidden-api entry to ensure that this
   safer implementation is used instead of Guava's.

2) For combining, we address the problem in a different way: by copying
   the key when creating the new, combined entry.

* Attempt to fix test.

* Remove unused import.
2022-04-28 23:21:13 -07:00
Charles Smith 42fa5c26e1
remove arbitrary granularity spec from docs (#12460)
* remove arbitrary granularity spec from docs

* Update docs/ingestion/ingestion-spec.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-04-28 16:36:54 -07:00
Frank Chen df074f2f96
Improve exception message for native binary operators (#12335)
* Improve exception message

* Update message
2022-04-28 10:20:16 +08:00
Gian Merlino 7b89682bbe
DimensionRangeShardSpec speed boost. (#12477)
* DimensionRangeShardSpec speed boost.

Calling isEmpty() and equals() on RangeSets is expensive, because these
fall back on default implementations that call size(). And size() is
_also_ a default implementation that iterates the entire collection.

* Fix and test from code review.
2022-04-27 14:20:35 -07:00
Gian Merlino a2bad0b3a2
Reduce allocations due to Jackson serialization. (#12468)
* Reduce allocations due to Jackson serialization.

This patch attacks two sources of allocations during Jackson
serialization:

1) ObjectMapper.writeValue and JsonGenerator.writeObject create a new
   DefaultSerializerProvider instance for each call. It has lots of
   fields and creates pressure on the garbage collector. So, this patch
   adds helper functions in JacksonUtils that enable reuse of
   SerializerProvider objects and updates various call sites to make
   use of this.

2) GroupByQueryToolChest copies the ObjectMapper for every query to
   install a special module that supports backwards compatibility with
   map-based rows. This isn't needed if resultAsArray is set and
   all servers are running Druid 0.16.0 or later. This release was a
   while ago. So, this patch disables backwards compatibility by default,
   which eliminates the need to copy the heavyweight ObjectMapper. The
   patch also introduces a configuration option that allows admins to
   explicitly enable backwards compatibility.

* Add test.

* Update additional call sites and add to forbidden APIs.
2022-04-27 14:17:26 -07:00