Commit Graph

1004 Commits

Author SHA1 Message Date
Dongjoon Hyun 79f86a0511
Upgrade ORC to 1.7.4 (#12572)
This commit upgrades Apache ORC library from 1.7.2 to 1.7.4.
Apache ORC 1.7.4 is the maintenance release with the following bug fixes.

https://orc.apache.org/news/2022/04/15/ORC-1.7.4/
https://github.com/apache/orc/releases/tag/v1.7.4
2022-05-28 17:44:36 +05:30
Gian Merlino 4631cff2a9
Free ByteBuffers in tests and fix some bugs. (#12521)
* Ensure ByteBuffers allocated in tests get freed.

Many tests had problems where a direct ByteBuffer would be allocated
and then not freed. This is bad because it causes flaky tests.

To fix this:

1) Add ByteBufferUtils.allocateDirect(size), which returns a ResourceHolder.
   This makes it easy to free the direct buffer. Currently, it's only used
   in tests, because production code seems OK.

2) Update all usages of ByteBuffer.allocateDirect (off-heap) in tests either
   to ByteBuffer.allocate (on-heap, which are garbaged collected), or to
   ByteBufferUtils.allocateDirect (wherever it seemed like there was a good
   reason for the buffer to be off-heap). Make sure to close all direct
   holders when done.

* Changes based on CI results.

* A different approach.

* Roll back BitmapOperationTest stuff.

* Try additional surefire memory.

* Revert "Roll back BitmapOperationTest stuff."

This reverts commit 49f846d9e3.

* Add TestBufferPool.

* Revert Xmx change in tests.

* Better behaved NestedQueryPushDownTest. Exit tests on OOME.

* Fix TestBufferPool.

* Remove T1C from ARM tests.

* Somewhat safer.

* Fix tests.

* Fix style stuff.

* Additional debugging.

* Reset null / expr configs better.

* ExpressionLambdaAggregatorFactory thread-safety.

* Alter forkNode to try to get better info when a JVM crashes.

* Fix buffer retention in ExpressionLambdaAggregatorFactory.

* Remove unused import.
2022-05-19 07:42:29 -07:00
Kashif Faraz 7ab2170802
Use datasketches version 3.2.0 (#12509)
Changes:
- Use apache datasketches version 3.2.0.
- Remove unsafe reflection-based usage of datasketch internals added in #12022
2022-05-13 11:28:15 +05:30
Lucas Capistrant 39e7191f03
Add authentication call before cleaning up intermediate files in hadoop ingestions (#12030)
* Add authentication call before cleaning up intermediate files in hadoop ingestions

* fix checkstyle

* remove debug log
2022-05-02 08:40:44 -05:00
MC-JY bb080693a9
Improve build performance of modules (#12486)
* improve build performance of modules

* improve build performance of modules

* Update pom.xml

* improve build performance of modules
2022-05-01 22:43:11 +08:00
Gian Merlino 529b983ad0
GroupBy: Reduce allocations by reusing entry and key holders. (#12474)
* GroupBy: Reduce allocations by reusing entry and key holders.

Two main changes:

1) Reuse Entry objects returned by various implementations of
   Grouper.iterator.

2) Reuse key objects contained within those Entry objects.

This is allowed by the contract, which states that entries must be
processed and immediately discarded. However, not all call sites
respected this, so this patch also updates those call sites.

One particularly sneaky way that the old code retained entries too long
is due to Guava's MergingIterator and CombiningIterator. Internally,
these both advance to the next value prior to returning the current
value. So, this patch addresses that in two ways:

1) For merging, we have our own implementation MergeIterator already,
   although it had the same problem. So, this patch updates our
   implementation to return the current item prior to advancing to the
   next item. It also adds a forbidden-api entry to ensure that this
   safer implementation is used instead of Guava's.

2) For combining, we address the problem in a different way: by copying
   the key when creating the new, combined entry.

* Attempt to fix test.

* Remove unused import.
2022-04-28 23:21:13 -07:00
Abhishek Agarwal 2fe053c5cb
Bump up the versions (#12480) 2022-04-27 14:28:20 +05:30
Jihoon Son 73ce5df22d
Add support for authorizing query context params (#12396)
The query context is a way that the user gives a hint to the Druid query engine, so that they enforce a certain behavior or at least let the query engine prefer a certain plan during query planning. Today, there are 3 types of query context params as below.

Default context params. They are set via druid.query.default.context in runtime properties. Any user context params can be default params.
User context params. They are set in the user query request. See https://druid.apache.org/docs/latest/querying/query-context.html for parameters.
System context params. They are set by the Druid query engine during query processing. These params override other context params.
Today, any context params are allowed to users. This can cause 
1) a bad UX if the context param is not matured yet or 
2) even query failure or system fault in the worst case if a sensitive param is abused, ex) maxSubqueryRows.

This PR adds an ability to limit context params per user role. That means, a query will fail if you have a context param set in the query that is not allowed to you. To do that, this PR adds a new built-in resource type, QUERY_CONTEXT. The resource to authorize has a name of the context param (such as maxSubqueryRows) and the type of QUERY_CONTEXT. To allow a certain context param for a user, the user should be granted WRITE permission on the context param resource. Here is an example of the permission.

{
  "resourceAction" : {
    "resource" : {
      "name" : "maxSubqueryRows",
      "type" : "QUERY_CONTEXT"
    },
    "action" : "WRITE"
  },
  "resourceNamePattern" : "maxSubqueryRows"
}
Each role can have multiple permissions for context params. Each permission should be set for different context params.

When a query is issued with a query context X, the query will fail if the user who issued the query does not have WRITE permission on the query context X. In this case,

HTTP endpoints will return 403 response code.
JDBC will throw ForbiddenException.
Note: there is a context param called brokerService that is used only by the router. This param is used to pin your query to run it in a specific broker. Because the authorization is done not in the router, but in the broker, if you have brokerService set in your query without a proper permission, your query will fail in the broker after routing is done. Technically, this is not right because the authorization is checked after the context param takes effect. However, this should not cause any user-facing issue and thus should be OK. The query will still fail if the user doesn’t have permission for brokerService.

The context param authorization can be enabled using druid.auth.authorizeQueryContextParams. This is disabled by default to avoid any hassle when someone upgrades his cluster blindly without reading release notes.
2022-04-21 14:21:16 +05:30
PJ Fanning 341c65738d
issue-12426 upgrade k8s client due to cve (#12427)
* issue-12426 upgrade k8s client due to cve

* compile issues

* try to fix license check
2022-04-21 10:11:55 +08:00
Rohan Garg de9f12b5c6
Fail fast incase a lookup load fails (#12397)
Currently while loading a lookup for the first time, loading threads blocks
for `waitForFirstRunMs` incase the lookup failed to load. If the `waitForFirstRunMs`
is long (like 10 minutes), such blocking can slow down the loading of other lookups.

This commit allows the thread to progress as soon as the loading of the lookup fails.
2022-04-18 13:14:02 +05:30
AmatyaAvadhanula 067254b778
Package kinesis client jar within the extension (#12370)
amazon-kinesis-client was not covered undered the apache license and required separate insertion in the kinesis extension.
This can now be avoided since it is covered, and including it within druid helps prevent incompatibilities.

Allows enabling of deaggregation out of the box by packaging amazon-kinesis-client (1.14.4) with druid for kinesis ingestion.
2022-04-04 21:31:18 +05:30
AmatyaAvadhanula c5531be553
Add feature flag for Kinesis listShards API usage (#12383)
listShards API was used to get all the shards for kinesis ingestion to improve its resiliency as part of #12161.

However, this may require additional permissions in the IAM policy where the stream is present. (Please refer to: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_ListShards.html).

A dynamic configuration useListShards has been added to KinesisSupervisorTuningConfig to control the usage of this API and prevent issues upon upgrade. It can be safely turned on (and is recommended when using kinesis ingestion) by setting this configuration to true.
2022-04-04 14:58:10 +05:30
Jihoon Son b6eeef31e5
Store null columns in the segments (#12279)
* Store null columns in the segments

* fix test

* remove NullNumericColumn and unused dependency

* fix compile failure

* use guava instead of apache commons

* split new tests

* unused imports

* address comments
2022-03-23 16:54:04 -07:00
Kyle Larose db91961af7
kubernetes: restart watch on null response (#12233)
* kubernetes: restart watch on null response

Kubernetes watches allow a client to efficiently processes changes to
resources. However, they have some idiosyncrasies. In particular, they
can error out for various reasons leading to what would normally be seen
as an invalid result.

The Druid kubernetes node discovery subsystem does not handle a certain
case properly. The watch can return an item with a null object.  These
leads to a null pointer exception. When this happens, the provider needs
to restart the watch, because rerunning the watch from the same resource
version leads to the same result: yet another null pointer exception.

This commit changes the provider to handle null objects by restarting
the watch.

* review: add more coverage

This adds a bit more coverage to the K8sDruidNodeDiscoveryProvider watch
loop, and removes an unnecessay return.

* kubernetes: reduce logging verbosity

The log messages about items being NULL don't really deserve to be at a
level other than DEBUG since they are not actionable, particularly since
we automatically recover now. Move them to the DEBUG level.
2022-03-10 12:56:40 -08:00
Parag Jain 2efb74ff1e
fix supervisor auto scaler config serde bug (#12317) 2022-03-09 16:17:12 -08:00
Abhishek Agarwal 6346b9561d
Reuse the InputEntityReader in SettableByteEntityReader (#12269)
* Reuse the InputEntityReader in SettableByteEntityReader

* Fix logic

* Fix kafka streaming ingestion

* Add Tests for kafka input format change

* Address review comments
2022-03-09 14:38:31 -08:00
Clint Wylie 0600772cce
use a non-concurrent map for lookups-cached-global unless incremental updates are actually required (#12293)
* use a non-concurrent map for lookups-cached-global unless incremental updates are actually required
* adjustments
* fix test
2022-03-08 21:54:25 -08:00
Gian Merlino 28f8bcce9b
Always reopen stream in FileUtils.copyLarge, RetryingInputStream. (#12307)
* Always reopen stream in FileUtils.copyLarge, RetryingInputStream.

When an InputStream throws an exception from one of its read methods,
we should assume it's bad and reopen it.

The main changes here are:

- In FileUtils.copyLarge, replace InputStream with InputStreamSupplier.
- In RetryingInputStream, collapse retryCondition and resetCondition
  into a single condition. Also, make it required, since every usage
  is passing in a specific condition anyway.

* Test fixes.

* Fix read impl.
2022-03-05 14:39:14 -08:00
Frank Chen 36bc41855d
Set Content-Type for String based response (#12295) 2022-03-04 15:17:03 +08:00
Alexander Saydakov 50038d9344
latest datasketches-java-3.1.0 (#12224)
These changes are to use the latest datasketches-java-3.1.0 and also to restore support for quantile and HLL4 sketches to be able to grow larger than a given buffer in a buffer aggregator and move to heap in rare cases. This was discussed in #11544.

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2022-03-01 17:14:42 -08:00
Laksh Singla 3f709db173
Make ParseExceptions more informative (#12259)
This PR aims to make the ParseExceptions in Druid more informative, by adding additional information (metadata) to the ParseException, which can contain additional information about the exception. For example - the path of the file generating the issue, the line number (where it can be easily fetched - like CsvReader)

Following changes are addressed in this PR:

A new class CloseableIteratorWithMetadata has been created which is like CloseableIterator but also has a metadata method that returns a context Map<String, Object> about the current element returned by next().
IntermediateRowParsingReader#read() now attaches the InputEntity and the "record number" which created the exception (while parsing them), and IntermediateRowParsingReader#sample attaches the InputEntity (but not the "record number").
TextReader (and its subclasses), which is a specific implementation of the IntermediateRowParsingReader also include the line number which caused the generation of the error.
This will also help in triaging the issues when InputSourceReader generates ParseException because it can point to the specific InputEntity which caused the exception (while trying to read it).
2022-02-28 22:31:15 +05:30
Xavier Léauté d105519558
Replace use of PowerMock with Mockito (#12282)
Mockito now supports all our needs and plays much better with recent Java versions.
Migrating to Mockito also simplifies running the kind of tests that required PowerMock in the past. 

* replace all uses of powermock with mockito-inline
* upgrade mockito to 4.3.1 and fix use of deprecated methods
* import mockito bom to align all our mockito dependencies
* add powermock to forbidden-apis to avoid accidentally reintroducing it in the future
2022-02-27 22:47:09 -08:00
Xavier Léauté 4c61878f9c
Reduce use of mocking and simplify some tests (#12283)
* remove use of mocks for ServiceMetricEvent
* simplify KafkaEmitterTests by moving to Mockito
* speed up KafkaEmitterTest by adjusting reporting frequency in tests
* remove unnecessary easymock and JUnitParams dependencies
2022-02-26 17:23:09 -08:00
Jihoon Son e5ad862665
A new includeAllDimension flag for dimensionsSpec (#12276)
* includeAllDimensions in dimensionsSpec

* doc

* address comments

* unused import and doc spelling
2022-02-25 18:27:48 -08:00
Karan Kumar b86f2d4c2e
Performance fixes in proto readers (#12267) 2022-02-24 23:21:48 +05:30
Xavier Léauté 009dd9e09a
upgrade core Apache Kafka dependencies to 3.1.0 (#12203)
Announcement: https://blogs.apache.org/kafka/entry/what-s-new-in-apache7
Release notes: https://dist.apache.org/repos/dist/release/kafka/3.1.0/RELEASE_NOTES.html

* upgrade core Apache Kafka dependencies to 3.1.0
* fix use of private Kafka APIs
* remove deprecated test rules
* remove mock calls that weren't verified in the first place
* remove the need for powermock in KafkaLookupExtractorFactoryTest
* align curator-test version with curator itself
* update easymock to 4.3.0
2022-02-23 18:42:51 -08:00
Karan Kumar b94390ba33
Adding Shared Access resource support for azure (#12266)
Azure Blob storage has multiple modes of authentication. One of them is Shared access resource
. This is very useful in cases when we do not want to add the account key in the druid properties .
2022-02-22 18:27:43 +05:30
AmatyaAvadhanula 1ec57cb935
Improve kinesis task assignment after resharding (#12235)
Problem:
- When a kinesis stream is resharded, the original shards are closed.
   Any intermediate shard created in the process is eventually closed as well.
- If a shard is closed before any record is put into it, it can be safely ignored for ingestion.
- It is expensive to determine if a closed shard is empty, since it requires a call to the Kinesis cluster.

Changes:
- Maintain a cache of closed empty and closed non-empty shards in `KinesisSupervisor`
- Add config `skipIngorableShards` to `KinesisSupervisorTuningConfig`
- The caches are used and updated only when `skipIgnorableShards = true`
2022-02-18 12:37:06 +05:30
William Hyun 34bc361953
Update ORC to 1.7.2 (#12084) 2022-02-15 10:04:12 -08:00
Clint Wylie 3ee66bb492
allow optimizing sql expressions and virtual columns (#12241)
* rework sql planner expression and virtual column handling

* simplify a bit

* add back and deprecate old methods, more tests, fix multi-value string coercion bug and associated tests

* spotbugs

* fix bugs with multi-value string array expression handling

* javadocs and adjust test

* better

* fix tests
2022-02-09 14:55:50 -08:00
Jihoon Son ab3d994a17
Lazy instantiation for segmentKillers, segmentMovers, and segmentArchivers (#12207)
* working

* Lazily load segmentKillers, segmentMovers, and segmentArchivers

* more tests

* test-jar plugin

* more coverage

* lazy client

* clean up changes

* checkstyle

* i did not change the branch condition

* adjust failure rate to run tests faster

* javadocs

* checkstyle
2022-02-08 13:02:06 -08:00
Gian Merlino de82c611de
Harmonize implementations of "visit" for Exprs from ExprMacros. (#12230)
* Harmonize implementations of "visit" for Exprs from ExprMacros.

Many of them had bugs where they would not visit all of the original
arguments. I don't think this has user-visible consequences right now,
but it's possible it would in a future world where "visit" is used
for more stuff than it is today.

So, this patch all updates all implementations to a more consistent
style that emphasizes reapplying the macro to the shuttled args.

* Test fixes, test coverage, PR review comments.
2022-02-04 08:08:54 -08:00
Kashif Faraz e648b01afb
Improve memory estimates in Aggregator and DimensionIndexer (#12073)
Fixes #12022  

### Description
The current implementations of memory estimation in `OnHeapIncrementalIndex` and `StringDimensionIndexer` tend to over-estimate which leads to more persistence cycles than necessary.

This PR replaces the max estimation mechanism with getting the incremental memory used by the aggregator or indexer at each invocation of `aggregate` or `encode` respectively.

### Changes
- Add new flag `useMaxMemoryEstimates` in the task context. This overrides the same flag in DefaultTaskConfig i.e. `druid.indexer.task.default.context` map
- Add method `AggregatorFactory.factorizeWithSize()` that returns an `AggregatorAndSize` which contains
  the aggregator instance and the estimated initial size of the aggregator
- Add method `Aggregator.aggregateWithSize()` which returns the incremental memory used by this aggregation step
- Update the method `DimensionIndexer.processRowValsToKeyComponent()` to return the encoded key component as well as its effective size in bytes
- Update `OnHeapIncrementalIndex` to use the new estimations only if `useMaxMemoryEstimates = false`
2022-02-03 10:34:02 +05:30
Clint Wylie 978b8f7dde
do not explode if mysql transient exception class does not exist (#12213)
Follow up to #12205 to allow druid-mysql-extensions to work with mysql connector/j 8.x again, which does not contain MySQLTransientException, and while would have had the same problem as mariadb if a transient exception was checked, the new check eagerly loads the class when starting up, causing immediate failure.
2022-02-01 09:06:24 +05:30
Clint Wylie 5d2291991e
use reflection to check for mysql transient exception type (#12205)
* use reflection to check for mysql transient exception type

* better

* oops
2022-01-27 13:13:16 -08:00
AmatyaAvadhanula 1f63b447c4
Mitigate Kinesis stream LimitExceededException by using listShards API (#12161)
Makes kinesis ingestion resilient to `LimitExceededException` caused by resharding.
Replace `describeStream` with `listShards` (recommended) to get shard related info.
`describeStream` has a limit (100) to the number of shards returned per call and a low default TPS limit of 10.
`listShards` returns the info for at most 1000 shards and has a higher TPS limit of 100 as well.

Key changed/added classes in this PR
 * `KinesisRecordSupplier`
 * `KinesisAdminClient`
2022-01-21 10:15:51 +05:30
Abhishek Agarwal 53c0e489c2
Fix infinite retrying during task pausing (#12167)
This fixes a bug that causes TaskClient in overlord to continuously retry to pause tasks. This can happen when a task is not responding to the pause command. Ideally, in such a case when the task is unresponsive, the overlord would have given up after a few retries and would have killed the task. However, due to this bug, retries go on forever.
2022-01-19 09:03:36 +05:30
Jonathan Wei 74c876e578
Throw parse exceptions on schema get errors for SchemaRegistryBasedAvroBytesDecoder (#12080)
* Add option to throw parse exceptions on schema get errors for SchemaRegistryBasedAvroBytesDecoder

* Remove option
2022-01-13 12:36:51 -06:00
imply-cheddar b153cb2342
Add a small LRU cache and use utf8 bytes in ArrayOfDoubles (#12130)
* Add a small LRU cache and use utf8 bytes in ArrayOfDoubles

* Add tests for extra branches

* Even more tests for branch coverage

* Fix Style
2022-01-11 13:04:11 -08:00
somu-imply 08fea7a46a
input type validation for datasketches hll "build" aggregator factory (#12131)
* Ingestion will fail for HLLSketchBuild instead of creating with incorrect values

* Addressing review comments for HLL< updated error message introduced test case
2022-01-11 12:00:14 -08:00
Suneet Saldanha 25ac04e067
MySqlFirehoseDatabaseConnector uses configured driver class name (#12049) 2021-12-09 20:58:55 -08:00
Frank Chen 58245b4617
Support JsonPath functions in JsonPath expressions (#11722)
* Add jsonPath functions support

* Add jsonPath function test for Avro

* Add jsonPath function length() to Orc

* Add jsonPath function length() to Parquet

* Add more tests to ORC format

* update doc

* Fix exception during ingestion

* Add IT test case

* Revert "Fix exception during ingestion"

This reverts commit 5a5484b9ea.

* update IT test case

* Add 'keys()'

* Commit IT test case

* Fix UT
2021-12-10 10:53:23 +08:00
Jonathan Wei 229f82a6f0
Add parse error list API for stream supervisors, use structured object for parse exceptions, simplify parse exception message (#11961)
* Add parse error list API for stream supervisors, simplify parse exception message

* Add input string to parse exception

* Use structured ParseExceptionReport

* Fix tests

* Add test

* PR comments, add ParseExceptionReport equals verifier

* Fix test
2021-12-09 15:42:55 -06:00
zachjsh 65cadbe42a
Fix bad lookup config fails task (#12021)
This PR fixes an issue in which if a lookup is configured incorreclty; does not serialize properly when being pulled by peon node, it causes the task to fail. The failure occurs because the peon and other leaf nodes (broker, historical), have retry logic that continues to retry the lookup loading for 3 minutes by default. The http listener thread on the peon task is not started until lookup loading completes, by default, the overlord waits 1 minute by default, to communicate with the peon task to get the task status, after which is orders the task to shut down, causing the ingestion task to fail.

To fix the issue, we catch the exception serialization error, and do not retry. Also fixed an issue in which a bad lookup config interferes with any other good lookup configs from being loaded.
2021-12-07 00:55:34 -05:00
Gian Merlino e0e05aad99
Enhancements to IndexTaskClient. (#12011)
* Enhancements to IndexTaskClient.

1) Ability to use handlers other than StringFullResponseHandler. This
   functionality is not used in production code yet, but is useful
   because it will allow tasks to communicate with each other in
   non-string-based formats and in streaming fashion. In the future,
   we'll be able to use this to make task-to-task communication
   more efficient.

2) Truncate server errors at 1KB, so long errors do not pollute logs.

3) Change error log level for retryable errors from WARN to INFO. (The
   final error is still WARN.)

4) Harmonize log and exception messages to have a more consistent format.

* Additional tests and improvements.
2021-12-03 09:14:32 -08:00
Frank Chen c2cea25a6b
Improve exception message when loading data from web-console (#11723)
* Improve exception handling

* Revert some changes

* Resolve comments

* Update indexing-service/src/main/java/org/apache/druid/indexing/overlord/sampler/SamplerExceptionMapper.java

Co-authored-by: Karan Kumar <karankumar1100@gmail.com>

* Update indexing-service/src/main/java/org/apache/druid/indexing/overlord/sampler/SamplerExceptionMapper.java

Co-authored-by: Karan Kumar <karankumar1100@gmail.com>

* Address review comments

Co-authored-by: Karan Kumar <karankumar1100@gmail.com>
2021-12-03 21:33:49 +08:00
Abhishek Agarwal 503384569a
Fix classNotFoundException when connecting to secure LDAP (#11978)
This PR fixes a problem where the com.sun.jndi.ldap.Connection tries to build BasicSecuritySSLSocketFactory when calling LDAPCredentialsValidator.validateCredentials since BasicSecuritySSLSocketFactory is in extension class loader and not visible to system classloader.
2021-12-03 12:08:19 +05:30
Clint Wylie 84b4bf56d8
vectorize logical operators and boolean functions (#11184)
changes:
* adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions
* vectorize logical operators and boolean functions, some only if useStrictBooleans is true
2021-12-02 16:40:23 -08:00
Paul Rogers a66f10eea1
Code cleanup from query profile project (#11822)
* Code cleanup from query profile project

* Fix spelling errors
* Fix Javadoc formatting
* Abstract out repeated test code
* Reuse constants in place of some string literals
* Fix up some parameterized types
* Reduce warnings reported by Eclipse

* Reverted change due to lack of tests
2021-11-30 11:35:38 -08:00
Gian Merlino 93aeaf4801
Improve on-heap aggregator footprint estimates. (#11950)
Add a "guessAggregatorHeapFootprint" method to AggregatorFactory that
mitigates #6743 by enabling heap footprint estimates based on a specific
number of rows. The idea is that at ingestion time, the number of rows
that go into an aggregator will be 1 (if rollup is off) or will likely
be a small number (if rollup is on).

It's a heuristic, because of course nothing guarantees that the rollup
ratio is a small number. But it's a common case, and I expect this logic
to go wrong much less often than the current logic. Also, when it does
go wrong, users can fix it by lowering maxRowsInMemory or
maxBytesInMemory. The current situation is unintuitive: when the
estimation goes wrong, users get an OOME, but actually they need to
*raise* these limits to fix it.
2021-11-28 13:21:24 +05:30
Agustin Gonzalez 8eff6334f7
AWS "Data read has a different length than the expected" error should reset stream and try again (#11941)
* Add support for custom reset condition & support for other args to have defaults to make the method api consistent

* Add support for custom reset condition to InputEntity

* Fix test names

* Clarifying comments to why we need to read the message's content to identify S3's resettable exception

* Add unit test to verify custom resettable condition for S3Entity

* Provide a way to customize retries since they are expensive to test
2021-11-26 12:45:34 -07:00
Gian Merlino cb0a2af644
TestKafkaExtractionCluster: Shut down Kafka, ZK in @After. (#11963) 2021-11-20 15:17:05 -08:00
Clint Wylie f260bbed23
restore and deprecate AggregatorFactory methods (#11917)
* add back and deprecate aggregator factory methods so i can say i told you so when i delete these later

* rename to make less ambiguous, fix fill method

* adjust
2021-11-19 15:59:35 -08:00
William Hyun 3abca73ee8
Upgrade ORC to 1.7.1 (#11919) 2021-11-15 09:13:03 -08:00
Atul Mohan f9941c12c3
Reduce list operation calls when pulling segments from S3 (#11899)
* Lazy lists

* Fix objectsummary init
2021-11-10 19:13:46 -08:00
Clint Wylie a8805ab60d
add missing json type for ListFilteredVirtualColumn (#11887)
* add missing json type for ListFilteredVirtualColumn, and tests to try to avoid this happening again

* fixes

* ugly, but maybe this

* oops

* too many mappers
2021-11-09 17:25:12 -08:00
Gian Merlino babf00f8e3
Migrate File.mkdirs to FileUtils.mkdirp. (#11879)
* Migrate File.mkdirs to FileUtils.mkdirp.

* Remove unused imports.

* Fix LookupReferencesManager.

* Simplify.

* Also migrate usages of forceMkdir.

* Fix var name.

* Fix incorrect call.

* Update test.
2021-11-09 11:10:49 -08:00
Clint Wylie 7237dc837c
complex typed expressions (#11853)
* complex typed expressions

* add built-in hll collector expressions to get coverage on druid-processing, more types, more better

* rampage!!!

* more javadoc

* adjustments

* oops

* lol

* remove unused dependency

* contradiction?

* more test
2021-11-08 00:33:06 -08:00
zachjsh 1d6df48145
Warn if cache size of lookup is beyond max size (#11863)
Enhanced the ExtractionNamespace interface in lookups-cached-global core extension with the ability to set a maxHeapPercentage for the cache of the respective namespace. The reason for adding this functionality, is make it easier to detect when a lookup table grows to a size that the underlying service cannot handle, because it does not have enough memory. The default value of maxHeap for the interface is -1, which indicates that no maxHeapPercentage has been set. For the JdbcExtractionNamespace and UriExtractionNamespace implementations, the default value is null, which will cause the respective service that the lookup is loaded in, to warn when its cache is beyond mxHeapPercentage of the service's configured max heap size. If a positive non-null value is set for the namespace's maxHeapPercentage config, this value will be honored for all services that the respective lookup is loaded onto, and consequently log warning messages when the cache of the respective lookup grows beyond this respective percentage of the services configured max heap size. Warnings are logged every time that either Uri based or Jdbc based lookups are regenerated, if the maxHeapPercentage constraint is violated. No other implementations will log warnings at this time. No error is thrown when the size exceeds the maxHeapPercentage at this time, as doing so could break functionality for existing users. Previously the JdbcCacheGenerator generated its cache by materializing all rows of the underling table in memory at once; this made it difficult to log warning messages in the case that the results from the jdbc query were very large and caused the service to run out of memory. To help with this, this pr makes it so that the jdbc query results are instead streamed through an iterator.
2021-11-03 21:32:22 -04:00
Karan Kumar 90640bb316
Support for hadoop 3 via maven profiles (#11794)
Add support for hadoop 3 profiles . Most of the details are captured in #11791 .
We use a combination of maven profiles and resource filtering to achieve this. Hadoop2 is supported by default and a new maven profile with the name hadoop3 is created. This will allow the user to choose the profile which is best suited for the use case.
2021-10-30 22:46:24 +05:30
Gian Merlino fc95c92806
Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124)
* Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs.

This patch does the following:

- Removes OffheapIncrementalIndex.
- Clarifies that Aggregators are required to be thread safe.
- Clarifies that BufferAggregators and VectorAggregators are not
  required to be thread safe.
- Removes thread safety code from some DataSketches aggregators that
  had it. (Not all of them did, and that's OK, because it wasn't necessary
  anyway.)
- Makes enabling "useOffheap" with groupBy v1 an error.

Rationale for removing the offheap incremental index:

- It is only used in one rare scenario: groupBy v1 (which is non-default)
  in "useOffheap" mode (also non-default). So you have to go pretty deep
  into the wilderness to get this code to activate in production. It is
  never used during ingestion.
- Its existence complicates developer efforts to reason about how
  aggregators get used, because the way it uses buffer aggregators is so
  different from how every other query engine uses them.
- It doesn't have meaningful testing.

By the way, I do believe that the given way the offheap incremental index
works, it actually didn't require buffer aggregators to be thread-safe.
It synchronizes on "aggregate" and doesn't call "get" until it has
stopped calling "aggregate". Nevertheless, this is a bother to think about,
and for the above reasons I think it makes sense to remove the code anyway.

* Remove things that are now unused.

* Revert removal of getFloat, getLong, getDouble from BufferAggregator.

* OAK-related warnings, suppressions.

* Unused item suppressions.
2021-10-26 08:05:56 -07:00
Gian Merlino 8276c031c5
Add druid.sql.approxCountDistinct.function property. (#11181)
* Add druid.sql.approxCountDistinct.function property.

The new property allows admins to configure the implementation for
APPROX_COUNT_DISTINCT and COUNT(DISTINCT expr) in approximate mode.

The motivation for adding this setting is to enable site admins to
switch the default HLL implementation to DataSketches.

For example, an admin can set:

  druid.sql.approxCountDistinct.function = APPROX_COUNT_DISTINCT_DS_HLL

* Fixes

* Fix tests.

* Remove erroneous cannotVectorize.

* Remove unused import.

* Remove unused test imports.
2021-10-25 12:16:21 -07:00
Gian Merlino d4cace385f
SQL: Allow Scans to be used as outer queries. (#11831)
* SQL: Allow Scans to be used as outer queries.

This has been possible in the native query system for a while, but the capability
hasn't yet propagated into the SQL layer. One example of where this is useful is
a query like:

  SELECT * FROM (... LIMIT X) WHERE <filter>

Because this expands the kinds of subquery structures the SQL layer will consider,
it was also necessary to improve the cost calculations. These changes appear in
PartialDruidQuery and DruidOuterQueryRel. The ideas are:

- Attach per-column penalties to the output signature of each query, instead of to
  the initial projection that starts a query. This encourages moving projections
  into subqueries instead of leaving them on outer queries.
- Only attach penalties to projections if there are actually expressions happening.
  So, now, projections that simply reorder or remove fields are free.
- Attach a constant penalty to every outer query. This discourages creating them
  when they are not needed.

The changes are generally beneficial to the test cases we have in CalciteQueryTest.
Most plans are unchanged, or are changed in purely cosmetic ways. Two have changed
for the better:

- testUsingSubqueryWithLimit now returns a constant from the subquery, instead of
  returning every column.
- testJoinOuterGroupByAndSubqueryHasLimit returns a minimal set of columns from
  the innermost subquery; two unnecessary columns are no longer there.

* Fix various DS operator conversions.

These were all implemented as direct conversions, which isn't appropriate
because they do not actually map onto native functions. These are only
usable as post-aggregations.

* Test case adjustment.
2021-10-23 17:18:43 -07:00
Gian Merlino 98ecbb21cd
Remove CloseQuietly and migrate its usages to other methods. (#10247)
* Remove CloseQuietly and migrate its usages to other methods.

These other methods include:

1) New method CloseableUtils.closeAndWrapExceptions, which wraps IOExceptions
   in RuntimeExceptions for callers that just want to avoid dealing with
   checked exceptions. Most usages were migrated to this method, because it
   looks like they were mainly attempts to avoid declaring a throws clause,
   and perhaps were unintentionally suppressing IOExceptions.
2) New method CloseableUtils.closeInCatch, designed to properly close something
   in a catch block without losing exceptions. Some usages from catch blocks
   were migrated here, when it seemed that they were intended to avoid checked
   exception handling, and did not really intend to also suppress IOExceptions.
3) New method CloseableUtils.closeAndSuppressExceptions, which sends all
   exceptions to a "chomper" that consumes them. Nothing is thrown or returned.
   The behavior is slightly different: with this method, _all_ exceptions are
   suppressed, not just IOExceptions. Calls that seemed like they had good
   reason to suppress exceptions were migrated here.
4) Some calls were migrated to try-with-resources, in cases where it appeared
   that CloseQuietly was being used to avoid throwing an exception in a finally
   block.

🎵 You don't have to go home, but you can't stay here... 🎵

* Remove unused import.

* Fix up various issues.

* Adjustments to tests.

* Fix null handling.

* Additional test.

* Adjustments from review.

* Fixup style stuff.

* Fix NPE caused by holder starting out null.

* Fix spelling.

* Chomp Throwables too.
2021-10-23 17:03:21 -07:00
Gian Merlino b7a4c79314
Null handling fixes for DS HLL and Theta sketches. (#11830)
* Null handling fixes for DS HLL and Theta sketches.

For HLL, this fixes an NPE when processing a null in a multi-value dimension.

For both, empty strings are now properly treated as nulls (and ignored) in
replace-with-default mode. Behavior in SQL-compatible mode is unchanged.

* Fix expectation.
2021-10-22 19:09:00 -07:00
Clint Wylie 741b4ed516
add output type information to ExpressionPostAggregator (#11818)
* add ColumnInspector argument to PostAggregator.getType to allow post-aggs to compute their output type based on input types

* add test for test for coverage

* simplify

* Remove unused imports.

Co-authored-by: Gian Merlino <gian@imply.io>
2021-10-22 13:52:51 -07:00
Alexander Saydakov 8cf1cbc4a9
latest datasketches-java and datasketches-memory (#11773)
* latest datasketches-java and datasketches-memory

* updated versions of datasketches-java and datasketches-memory

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
2021-10-19 23:42:30 -07:00
Clint Wylie 187df58e30
better types (#11713)
* better type system

* needle in a haystack

* ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support

* fixup merge

* more test

* fixup

* intern

* fix

* oops

* oops again

* ...

* more test coverage

* fix error message

* adjust interning, more javadocs

* oops

* more docs more better
2021-10-19 01:47:25 -07:00
lokesh-lingarajan ad6609a606
Kafka Input Format for headers, key and payload parsing (#11630)
### Description

Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights.

PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats.

We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload.

This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row.

Lets look at a sample input format from the above discussion

"inputFormat":
{
    "type": "kafka",     // New input format type
    "headerLabelPrefix": "kafka.header.",   // Label prefix for header columns, this will avoid collusions while merging columns
    "recordTimestampLabelPrefix": "kafka.",  // Kafka record's timestamp is made available in case payload does not carry timestamp
    "headerFormat":  // Header parser specifying that values are of type string
    {
        "type": "string"
    },
    "valueFormat": // Value parser from json parsing
    {
        "type": "json",
        "flattenSpec": {
          "useFieldDiscovery": true,
          "fields": [...]
        }
    },
    "keyFormat":  // Key parser also from json parsing
    {
        "type": "json"
    }
}

Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. 

KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. 

"headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload.

Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch.

Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key".

## KafkaInputFormat Class: 
This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases.

During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.
2021-10-07 08:56:27 -07:00
Xavier Léauté bc3b038712
Update Apache Kafka client libraries to 3.0.0 (#11735)
Release notes:
https://downloads.apache.org/kafka/3.0.0/RELEASE_NOTES.html
https://blogs.apache.org/kafka/entry/what-s-new-in-apache6
2021-10-05 10:23:19 -07:00
William Hyun 9bff6bd70e
Upgrade ORC to 1.7.0 (#11726)
* Upgrade ORC to 1.7.0

* address comments

* address comments

* Add import
2021-09-27 13:20:09 -07:00
Clint Wylie 392f0ca1b5
refactor sql authorization to get resource type from schema, resource type to be string (#11692)
* refactor sql authorization to get resource type from schema, refactor resource type from enum to string

* information schema auth filtering adjustments

* refactor

* minor stuff

* Update SqlResourceCollectorShuttle.java
2021-09-17 09:53:25 -07:00
Kashif Faraz 757720fae5
Suppress stacktrace of InterruptedException in CommonCacheNotifier (#11715)
When CommonCachedNotifier is being stopped while the thread is waiting on updateQueue.take(),
an InterruptedException is thrown. The stack trace from this exception gives the wrong idea that something went wrong with the shutdown.
2021-09-16 22:27:08 +05:30
Clint Wylie fe1d8c206a
bump version to 0.23.0-SNAPSHOT (#11670) 2021-09-08 15:56:04 -07:00
Agustin Gonzalez 9efa6cc9c8
Make persists concurrent with adding rows in batch ingestion (#11536)
* Make persists concurrent with ingestion

* Remove semaphore but keep concurrent persists (with add) and add push in the backround as well

* Go back to documented default persists (zero)

* Move to debug

* Remove unnecessary Atomics

* Comments on synchronization (or not) for sinks & sinkMetadata

* Some cleanup for unit tests but they still need further work

* Shutdown & wait for persists and push on close

* Provide support for three existing batch appenderators using batchProcessingMode flag

* Fix reference to wrong appenderator

* Fix doc typos

* Add BatchAppenderators class test coverage

* Add log message to batchProcessingMode final value, fix typo in enum name

* Another typo and minor fix to log message

* LEGACY->OPEN_SEGMENTS, Edit docs

* Minor update legacy->open segments log message

* More code comments, mostly small adjustments to naming etc

* fix spelling

* Exclude BtachAppenderators from Jacoco since it is fully tested but Jacoco still refuses to ack coverage

* Coverage for Appenderators & BatchAppenderators, name change of a method that was still using "legacy" rather than "openSegments"

Co-authored-by: Clint Wylie <cjwylie@gmail.com>
2021-09-08 13:31:52 -07:00
Jihoon Son 7e90d00cc0
Configurable maxStreamLength for doubles sketches (#11574)
* Configurable maxStreamLength for doubles sketches

* fix equals/hashcode and it test failure

* fix test

* fix it test

* benchmark

* doc

* grouping key

* fix comment

* dependency check

* Update docs/development/extensions-core/datasketches-quantiles.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-08-31 14:56:37 -07:00
zhangyue19921010 6d14ea2d14
Dynamic auto scale Kinesis-Stream ingest tasks (#10985)
* ready to test

* revert misc.xml

* document kinesis md

* Update docs/development/extensions-core/kafka-ingestion.md

* Update docs/development/extensions-core/kinesis-ingestion.md

* Update docs/development/extensions-core/kinesis-ingestion.md

* Update docs/development/extensions-core/kinesis-ingestion.md

* Update docs/development/extensions-core/kinesis-ingestion.md

* Update docs/development/extensions-core/kinesis-ingestion.md

* Update docs/development/extensions-core/kinesis-ingestion.md

* Update docs/development/extensions-core/kinesis-ingestion.md

* Update docs/development/extensions-core/kinesis-ingestion.md

* Update docs/development/extensions-core/kinesis-ingestion.md

* Update docs/development/extensions-core/kinesis-ingestion.md

* Update kafka-ingestion.md

remove leading `

* Update kinesis-ingestion.md

add missing `

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-08-30 15:44:29 -07:00
Jihoon Son 2a658acad4
Put sleep in an extension (#11632)
* Put sleep in an extension

* dependency
2021-08-25 01:27:45 -07:00
Maytas Monsereenusorn b36242b404
Fix bug in Variance Buffer Aggregator resulting in intermittent NaN when druid.generic.useDefaultValueForNull=false (#11617)
* Fix bug in Variance Aggregator resulting in intermittent NaN when druid.generic.useDefaultValueForNull=false

* fix checkstyle

* address comments
2021-08-20 09:13:51 -07:00
Clint Wylie ec334a641b
MySQL extension with MariaDB connector docs (#11608)
* add docs for mariadb support via mysql extensions

* add logging so you know what druid knows

* homogenize

* spelling

* missed a couple
2021-08-19 01:52:26 -07:00
dependabot[bot] 776ddf76f4
Bump parquet.version from 1.11.1 to 1.12.0 (#11346)
* Bump parquet.version from 1.11.1 to 1.12.0

Bumps `parquet.version` from 1.11.1 to 1.12.0.

Updates `parquet-column` from 1.11.1 to 1.12.0
- [Release notes](https://github.com/apache/parquet-mr/releases)
- [Changelog](https://github.com/apache/parquet-mr/blob/master/CHANGES.md)
- [Commits](https://github.com/apache/parquet-mr/compare/apache-parquet-1.11.1...apache-parquet-1.12.0)

Updates `parquet-avro` from 1.11.1 to 1.12.0
- [Release notes](https://github.com/apache/parquet-mr/releases)
- [Changelog](https://github.com/apache/parquet-mr/blob/master/CHANGES.md)
- [Commits](https://github.com/apache/parquet-mr/compare/apache-parquet-1.11.1...apache-parquet-1.12.0)

Updates `parquet-hadoop` from 1.11.1 to 1.12.0
- [Release notes](https://github.com/apache/parquet-mr/releases)
- [Changelog](https://github.com/apache/parquet-mr/blob/master/CHANGES.md)
- [Commits](https://github.com/apache/parquet-mr/compare/apache-parquet-1.11.1...apache-parquet-1.12.0)

---
updated-dependencies:
- dependency-name: org.apache.parquet:parquet-column
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: org.apache.parquet:parquet-avro
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: org.apache.parquet:parquet-hadoop
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update license

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Suneet Saldanha <suneet@apache.org>
2021-08-13 19:17:57 -07:00
Parag Jain c7b46671b3
option to use deep storage for storing shuffle data (#11507)
Fixes #11297.
Description

Description and design in the proposal #11297
Key changed/added classes in this PR

    *DataSegmentPusher
    *ShuffleClient
    *PartitionStat
    *PartitionLocation
    *IntermediaryDataManager
2021-08-13 16:40:25 -04:00
Kashif Faraz aaf0aaad8f
Enable routing of SQL queries at Router (#11566)
This PR adds a new property druid.router.sql.enable which allows the
Router to handle SQL queries when set to true.

This change does not affect Avatica JDBC requests and they are still routed
by hashing the Connection ID.

To allow parsing of the request object as a SqlQuery (contained in module druid-sql),
some classes have been moved from druid-server to druid-services with
the same package name.
2021-08-13 18:44:39 +05:30
Rohan Garg 2004a94675
Cleanup test dependencies in hdfs-storage extension (#11563)
* Cleanup test dependencies in hdfs-storage extension

* Fix working directory in LocalFileSystem in indexing-hadoop test
2021-08-10 07:52:32 -07:00
Suneet Saldanha 361bfdcaa5
Better logging for lookups (#11539)
* Better logging for lookups

The default pollPeriod of 0 means that lookups are loaded once only at startup

Add a warning message to warn operators about this. I suspect that most
operators using jdbc or uri would expect eventual consistency with the source
of the lookups if using jdbc or uri. So make this a warning to make it easier
to debug if an operator notices a data inconsistency issue.

* oops
2021-08-04 16:44:54 -07:00
Yi Yuan aa7cb50f24
Add DynamicConfigProvider for Schema Registry (#11362)
* add_DynamicConfigProvider_for_schema_registry

* bug fixed

* add document

* fix document

* fix spot bug

* fix document

* inject ObjectMapper

* add DynamicConfigProviderUtils

* add UT

* bug fixed

Co-authored-by: yuanyi <yuanyi@freewheel.tv>
2021-08-03 13:24:52 -07:00
Agustin Gonzalez a2da407b70
Add error msg to parallel task's TaskStatus (#11486)
* Add error msg to parallel task's TaskStatus

* Consolidate failure block

* Add failure test

* Make it fail

* Add fail while stopped

* Simplify hash task test using a runner that fails after so many runs (parameter)

* Remove unthrown exception

* Use runner names to identify phase

* Added range partition kill test & fixed a timing bug with the custom runner

* Forbidden api

* Style

* Unit test code cleanup

* Added message to invalid state exception and improved readability  of the phase error messages for the parallel task failure unit tests
2021-08-02 12:11:28 -07:00
dependabot[bot] cf674c833c
Bump maven-resources-plugin from 3.1.0 to 3.2.0 (#11525)
Bumps [maven-resources-plugin](https://github.com/apache/maven-resources-plugin) from 3.1.0 to 3.2.0.
- [Release notes](https://github.com/apache/maven-resources-plugin/releases)
- [Commits](https://github.com/apache/maven-resources-plugin/compare/maven-resources-plugin-3.1.0...maven-resources-plugin-3.2.0)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-resources-plugin
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-08-02 09:38:34 -07:00
Dongjoon Hyun dbed4424b5
Upgrade ORC to 1.6.9 (#11518) 2021-07-31 23:33:03 -07:00
Xavier Léauté 4bca7f014e
update error-prone to 2.8.0 with fix for crashing check (#11494)
* error-prone 2.8.0 fixes https://github.com/google/error-prone/issues/2396
* fix for a few ignored return values
* fix unknown args in sub-modules
2021-07-29 09:13:46 -07:00
zachjsh a2538d264d
Add back missing unit test coverage in AvroFlattenerMakerTest (#11451)
* Add back missing unit test coverage in AvroFlattenerMakerTest

Adds back test coverage for Avro flattener that was mistakenly removed in https://github.com/apache/druid/pull/10505. Recfactored the tests a bit too.

* resolve checkstyle warnings
2021-07-20 18:27:00 -07:00
Abhishek Agarwal 94c1671eaf
Split SegmentLoader into SegmentLoader and SegmentCacheManager (#11466)
This PR splits current SegmentLoader into SegmentLoader and SegmentCacheManager.

SegmentLoader - this class is responsible for building the segment object but does not expose any methods for downloading, cache space management, etc. Default implementation delegates the download operations to SegmentCacheManager and only contains the logic for building segments once downloaded. . This class will be used in SegmentManager to construct Segment objects.

SegmentCacheManager - this class manages the segment cache on the local disk. It fetches the segment files to the local disk, can clean up the cache, and in the future, support reserve and release on cache space. [See https://github.com/Make SegmentLoader extensible and customizable #11398]. This class will be used in ingestion tasks such as compaction, re-indexing where segment files need to be downloaded locally.
2021-07-21 00:14:19 +05:30
Dongjoon Hyun 5037493e45
Bump commons-io to 2.11.0 (#11460)
* Bump commons-io to 2.11.0

* Address comments

* Remove try catch

* Fix checkstyle
2021-07-19 15:47:14 -07:00
Clint Wylie 2705fe98fa
Fix avro json serde issues (#11455) 2021-07-20 00:32:05 +08:00
frank chen 2236cf2234
eliminate extra object instantiation (#11345) 2021-07-12 18:31:39 -07:00
Sandeep 18b8ac5349
removes unnecessary checks (#11431)
* removes unnecessary checks

* removes unnecessary checks
2021-07-12 18:21:16 -07:00
Clint Wylie d0b4e55a6f
fix pom version reference (#11435) 2021-07-13 08:37:57 +08:00
Clint Wylie 63fcd77c38
support using mariadb connector with mysql extensions (#11402)
* support using mariadb connector with mysql extensions

* cleanup and more tests

* fix test

* javadocs, more tests, etc

* style and more test

* more test more better

* missing pom

* more pom
2021-07-08 12:25:37 -07:00
Joseph Glanville d5e8d4d680
Avro union support (#10505)
* Avro union support

* Document new union support

* Add support for AvroStreamInputFormat and fix checkstyle

* Extend multi-member union test schema and format

* Some additional docs and add Enums to spelling

* Rename explodeUnions -> extractUnions

* explode -> extract

* ByType

* Correct spelling error
2021-07-06 22:05:41 -07:00
Clint Wylie 17efa6f556
add single input string expression dimension vector selector and better expression planning (#11213)
* add single input string expression dimension vector selector and better expression planning

* better

* fixes

* oops

* rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing

* oops

* javadocs, renaming

* more javadocs

* benchmarks

* use string expression vector processor with vector size 1 instead of expr.eval

* better logging

* javadocs, surprising number of the the

* more

* simplify
2021-07-06 11:20:49 -07:00