druid

Commit Graph

Author	SHA1	Message	Date
Dongjoon Hyun	79f86a0511	Upgrade ORC to 1.7.4 (#12572 ) This commit upgrades Apache ORC library from 1.7.2 to 1.7.4. Apache ORC 1.7.4 is the maintenance release with the following bug fixes. https://orc.apache.org/news/2022/04/15/ORC-1.7.4/ https://github.com/apache/orc/releases/tag/v1.7.4	2022-05-28 17:44:36 +05:30
Gian Merlino	4631cff2a9	Free ByteBuffers in tests and fix some bugs. (#12521 ) * Ensure ByteBuffers allocated in tests get freed. Many tests had problems where a direct ByteBuffer would be allocated and then not freed. This is bad because it causes flaky tests. To fix this: 1) Add ByteBufferUtils.allocateDirect(size), which returns a ResourceHolder. This makes it easy to free the direct buffer. Currently, it's only used in tests, because production code seems OK. 2) Update all usages of ByteBuffer.allocateDirect (off-heap) in tests either to ByteBuffer.allocate (on-heap, which are garbaged collected), or to ByteBufferUtils.allocateDirect (wherever it seemed like there was a good reason for the buffer to be off-heap). Make sure to close all direct holders when done. * Changes based on CI results. * A different approach. * Roll back BitmapOperationTest stuff. * Try additional surefire memory. * Revert "Roll back BitmapOperationTest stuff." This reverts commit `49f846d9e3`. * Add TestBufferPool. * Revert Xmx change in tests. * Better behaved NestedQueryPushDownTest. Exit tests on OOME. * Fix TestBufferPool. * Remove T1C from ARM tests. * Somewhat safer. * Fix tests. * Fix style stuff. * Additional debugging. * Reset null / expr configs better. * ExpressionLambdaAggregatorFactory thread-safety. * Alter forkNode to try to get better info when a JVM crashes. * Fix buffer retention in ExpressionLambdaAggregatorFactory. * Remove unused import.	2022-05-19 07:42:29 -07:00
Kashif Faraz	7ab2170802	Use datasketches version 3.2.0 (#12509 ) Changes: - Use apache datasketches version 3.2.0. - Remove unsafe reflection-based usage of datasketch internals added in #12022	2022-05-13 11:28:15 +05:30
Lucas Capistrant	39e7191f03	Add authentication call before cleaning up intermediate files in hadoop ingestions (#12030 ) * Add authentication call before cleaning up intermediate files in hadoop ingestions * fix checkstyle * remove debug log	2022-05-02 08:40:44 -05:00
MC-JY	bb080693a9	Improve build performance of modules (#12486 ) * improve build performance of modules * improve build performance of modules * Update pom.xml * improve build performance of modules	2022-05-01 22:43:11 +08:00
Gian Merlino	529b983ad0	GroupBy: Reduce allocations by reusing entry and key holders. (#12474 ) * GroupBy: Reduce allocations by reusing entry and key holders. Two main changes: 1) Reuse Entry objects returned by various implementations of Grouper.iterator. 2) Reuse key objects contained within those Entry objects. This is allowed by the contract, which states that entries must be processed and immediately discarded. However, not all call sites respected this, so this patch also updates those call sites. One particularly sneaky way that the old code retained entries too long is due to Guava's MergingIterator and CombiningIterator. Internally, these both advance to the next value prior to returning the current value. So, this patch addresses that in two ways: 1) For merging, we have our own implementation MergeIterator already, although it had the same problem. So, this patch updates our implementation to return the current item prior to advancing to the next item. It also adds a forbidden-api entry to ensure that this safer implementation is used instead of Guava's. 2) For combining, we address the problem in a different way: by copying the key when creating the new, combined entry. * Attempt to fix test. * Remove unused import.	2022-04-28 23:21:13 -07:00
Abhishek Agarwal	2fe053c5cb	Bump up the versions (#12480 )	2022-04-27 14:28:20 +05:30
Jihoon Son	73ce5df22d	Add support for authorizing query context params (#12396 ) The query context is a way that the user gives a hint to the Druid query engine, so that they enforce a certain behavior or at least let the query engine prefer a certain plan during query planning. Today, there are 3 types of query context params as below. Default context params. They are set via druid.query.default.context in runtime properties. Any user context params can be default params. User context params. They are set in the user query request. See https://druid.apache.org/docs/latest/querying/query-context.html for parameters. System context params. They are set by the Druid query engine during query processing. These params override other context params. Today, any context params are allowed to users. This can cause 1) a bad UX if the context param is not matured yet or 2) even query failure or system fault in the worst case if a sensitive param is abused, ex) maxSubqueryRows. This PR adds an ability to limit context params per user role. That means, a query will fail if you have a context param set in the query that is not allowed to you. To do that, this PR adds a new built-in resource type, QUERY_CONTEXT. The resource to authorize has a name of the context param (such as maxSubqueryRows) and the type of QUERY_CONTEXT. To allow a certain context param for a user, the user should be granted WRITE permission on the context param resource. Here is an example of the permission. { "resourceAction" : { "resource" : { "name" : "maxSubqueryRows", "type" : "QUERY_CONTEXT" }, "action" : "WRITE" }, "resourceNamePattern" : "maxSubqueryRows" } Each role can have multiple permissions for context params. Each permission should be set for different context params. When a query is issued with a query context X, the query will fail if the user who issued the query does not have WRITE permission on the query context X. In this case, HTTP endpoints will return 403 response code. JDBC will throw ForbiddenException. Note: there is a context param called brokerService that is used only by the router. This param is used to pin your query to run it in a specific broker. Because the authorization is done not in the router, but in the broker, if you have brokerService set in your query without a proper permission, your query will fail in the broker after routing is done. Technically, this is not right because the authorization is checked after the context param takes effect. However, this should not cause any user-facing issue and thus should be OK. The query will still fail if the user doesn’t have permission for brokerService. The context param authorization can be enabled using druid.auth.authorizeQueryContextParams. This is disabled by default to avoid any hassle when someone upgrades his cluster blindly without reading release notes.	2022-04-21 14:21:16 +05:30
PJ Fanning	341c65738d	issue-12426 upgrade k8s client due to cve (#12427 ) * issue-12426 upgrade k8s client due to cve * compile issues * try to fix license check	2022-04-21 10:11:55 +08:00
Rohan Garg	de9f12b5c6	Fail fast incase a lookup load fails (#12397 ) Currently while loading a lookup for the first time, loading threads blocks for `waitForFirstRunMs` incase the lookup failed to load. If the `waitForFirstRunMs` is long (like 10 minutes), such blocking can slow down the loading of other lookups. This commit allows the thread to progress as soon as the loading of the lookup fails.	2022-04-18 13:14:02 +05:30
AmatyaAvadhanula	067254b778	Package kinesis client jar within the extension (#12370 ) amazon-kinesis-client was not covered undered the apache license and required separate insertion in the kinesis extension. This can now be avoided since it is covered, and including it within druid helps prevent incompatibilities. Allows enabling of deaggregation out of the box by packaging amazon-kinesis-client (1.14.4) with druid for kinesis ingestion.	2022-04-04 21:31:18 +05:30
AmatyaAvadhanula	c5531be553	Add feature flag for Kinesis listShards API usage (#12383 ) listShards API was used to get all the shards for kinesis ingestion to improve its resiliency as part of #12161. However, this may require additional permissions in the IAM policy where the stream is present. (Please refer to: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_ListShards.html). A dynamic configuration useListShards has been added to KinesisSupervisorTuningConfig to control the usage of this API and prevent issues upon upgrade. It can be safely turned on (and is recommended when using kinesis ingestion) by setting this configuration to true.	2022-04-04 14:58:10 +05:30
Jihoon Son	b6eeef31e5	Store null columns in the segments (#12279 ) * Store null columns in the segments * fix test * remove NullNumericColumn and unused dependency * fix compile failure * use guava instead of apache commons * split new tests * unused imports * address comments	2022-03-23 16:54:04 -07:00
Kyle Larose	db91961af7	kubernetes: restart watch on null response (#12233 ) * kubernetes: restart watch on null response Kubernetes watches allow a client to efficiently processes changes to resources. However, they have some idiosyncrasies. In particular, they can error out for various reasons leading to what would normally be seen as an invalid result. The Druid kubernetes node discovery subsystem does not handle a certain case properly. The watch can return an item with a null object. These leads to a null pointer exception. When this happens, the provider needs to restart the watch, because rerunning the watch from the same resource version leads to the same result: yet another null pointer exception. This commit changes the provider to handle null objects by restarting the watch. * review: add more coverage This adds a bit more coverage to the K8sDruidNodeDiscoveryProvider watch loop, and removes an unnecessay return. * kubernetes: reduce logging verbosity The log messages about items being NULL don't really deserve to be at a level other than DEBUG since they are not actionable, particularly since we automatically recover now. Move them to the DEBUG level.	2022-03-10 12:56:40 -08:00
Parag Jain	2efb74ff1e	fix supervisor auto scaler config serde bug (#12317 )	2022-03-09 16:17:12 -08:00
Abhishek Agarwal	6346b9561d	Reuse the InputEntityReader in SettableByteEntityReader (#12269 ) * Reuse the InputEntityReader in SettableByteEntityReader * Fix logic * Fix kafka streaming ingestion * Add Tests for kafka input format change * Address review comments	2022-03-09 14:38:31 -08:00
Clint Wylie	0600772cce	use a non-concurrent map for lookups-cached-global unless incremental updates are actually required (#12293 ) * use a non-concurrent map for lookups-cached-global unless incremental updates are actually required * adjustments * fix test	2022-03-08 21:54:25 -08:00
Gian Merlino	28f8bcce9b	Always reopen stream in FileUtils.copyLarge, RetryingInputStream. (#12307 ) * Always reopen stream in FileUtils.copyLarge, RetryingInputStream. When an InputStream throws an exception from one of its read methods, we should assume it's bad and reopen it. The main changes here are: - In FileUtils.copyLarge, replace InputStream with InputStreamSupplier. - In RetryingInputStream, collapse retryCondition and resetCondition into a single condition. Also, make it required, since every usage is passing in a specific condition anyway. * Test fixes. * Fix read impl.	2022-03-05 14:39:14 -08:00
Frank Chen	36bc41855d	Set Content-Type for String based response (#12295 )	2022-03-04 15:17:03 +08:00
Alexander Saydakov	50038d9344	latest datasketches-java-3.1.0 (#12224 ) These changes are to use the latest datasketches-java-3.1.0 and also to restore support for quantile and HLL4 sketches to be able to grow larger than a given buffer in a buffer aggregator and move to heap in rare cases. This was discussed in #11544. Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2022-03-01 17:14:42 -08:00
Laksh Singla	3f709db173	Make ParseExceptions more informative (#12259 ) This PR aims to make the ParseExceptions in Druid more informative, by adding additional information (metadata) to the ParseException, which can contain additional information about the exception. For example - the path of the file generating the issue, the line number (where it can be easily fetched - like CsvReader) Following changes are addressed in this PR: A new class CloseableIteratorWithMetadata has been created which is like CloseableIterator but also has a metadata method that returns a context Map<String, Object> about the current element returned by next(). IntermediateRowParsingReader#read() now attaches the InputEntity and the "record number" which created the exception (while parsing them), and IntermediateRowParsingReader#sample attaches the InputEntity (but not the "record number"). TextReader (and its subclasses), which is a specific implementation of the IntermediateRowParsingReader also include the line number which caused the generation of the error. This will also help in triaging the issues when InputSourceReader generates ParseException because it can point to the specific InputEntity which caused the exception (while trying to read it).	2022-02-28 22:31:15 +05:30
Xavier Léauté	d105519558	Replace use of PowerMock with Mockito (#12282 ) Mockito now supports all our needs and plays much better with recent Java versions. Migrating to Mockito also simplifies running the kind of tests that required PowerMock in the past. * replace all uses of powermock with mockito-inline * upgrade mockito to 4.3.1 and fix use of deprecated methods * import mockito bom to align all our mockito dependencies * add powermock to forbidden-apis to avoid accidentally reintroducing it in the future	2022-02-27 22:47:09 -08:00
Xavier Léauté	4c61878f9c	Reduce use of mocking and simplify some tests (#12283 ) * remove use of mocks for ServiceMetricEvent * simplify KafkaEmitterTests by moving to Mockito * speed up KafkaEmitterTest by adjusting reporting frequency in tests * remove unnecessary easymock and JUnitParams dependencies	2022-02-26 17:23:09 -08:00
Jihoon Son	e5ad862665	A new includeAllDimension flag for dimensionsSpec (#12276 ) * includeAllDimensions in dimensionsSpec * doc * address comments * unused import and doc spelling	2022-02-25 18:27:48 -08:00
Karan Kumar	b86f2d4c2e	Performance fixes in proto readers (#12267 )	2022-02-24 23:21:48 +05:30
Xavier Léauté	009dd9e09a	upgrade core Apache Kafka dependencies to 3.1.0 (#12203 ) Announcement: https://blogs.apache.org/kafka/entry/what-s-new-in-apache7 Release notes: https://dist.apache.org/repos/dist/release/kafka/3.1.0/RELEASE_NOTES.html * upgrade core Apache Kafka dependencies to 3.1.0 * fix use of private Kafka APIs * remove deprecated test rules * remove mock calls that weren't verified in the first place * remove the need for powermock in KafkaLookupExtractorFactoryTest * align curator-test version with curator itself * update easymock to 4.3.0	2022-02-23 18:42:51 -08:00
Karan Kumar	b94390ba33	Adding Shared Access resource support for azure (#12266 ) Azure Blob storage has multiple modes of authentication. One of them is Shared access resource . This is very useful in cases when we do not want to add the account key in the druid properties .	2022-02-22 18:27:43 +05:30
AmatyaAvadhanula	1ec57cb935	Improve kinesis task assignment after resharding (#12235 ) Problem: - When a kinesis stream is resharded, the original shards are closed. Any intermediate shard created in the process is eventually closed as well. - If a shard is closed before any record is put into it, it can be safely ignored for ingestion. - It is expensive to determine if a closed shard is empty, since it requires a call to the Kinesis cluster. Changes: - Maintain a cache of closed empty and closed non-empty shards in `KinesisSupervisor` - Add config `skipIngorableShards` to `KinesisSupervisorTuningConfig` - The caches are used and updated only when `skipIgnorableShards = true`	2022-02-18 12:37:06 +05:30
William Hyun	34bc361953	Update ORC to 1.7.2 (#12084 )	2022-02-15 10:04:12 -08:00
Clint Wylie	3ee66bb492	allow optimizing sql expressions and virtual columns (#12241 ) * rework sql planner expression and virtual column handling * simplify a bit * add back and deprecate old methods, more tests, fix multi-value string coercion bug and associated tests * spotbugs * fix bugs with multi-value string array expression handling * javadocs and adjust test * better * fix tests	2022-02-09 14:55:50 -08:00
Jihoon Son	ab3d994a17	Lazy instantiation for segmentKillers, segmentMovers, and segmentArchivers (#12207 ) * working * Lazily load segmentKillers, segmentMovers, and segmentArchivers * more tests * test-jar plugin * more coverage * lazy client * clean up changes * checkstyle * i did not change the branch condition * adjust failure rate to run tests faster * javadocs * checkstyle	2022-02-08 13:02:06 -08:00
Gian Merlino	de82c611de	Harmonize implementations of "visit" for Exprs from ExprMacros. (#12230 ) * Harmonize implementations of "visit" for Exprs from ExprMacros. Many of them had bugs where they would not visit all of the original arguments. I don't think this has user-visible consequences right now, but it's possible it would in a future world where "visit" is used for more stuff than it is today. So, this patch all updates all implementations to a more consistent style that emphasizes reapplying the macro to the shuttled args. * Test fixes, test coverage, PR review comments.	2022-02-04 08:08:54 -08:00
Kashif Faraz	e648b01afb	Improve memory estimates in Aggregator and DimensionIndexer (#12073 ) Fixes #12022 ### Description The current implementations of memory estimation in `OnHeapIncrementalIndex` and `StringDimensionIndexer` tend to over-estimate which leads to more persistence cycles than necessary. This PR replaces the max estimation mechanism with getting the incremental memory used by the aggregator or indexer at each invocation of `aggregate` or `encode` respectively. ### Changes - Add new flag `useMaxMemoryEstimates` in the task context. This overrides the same flag in DefaultTaskConfig i.e. `druid.indexer.task.default.context` map - Add method `AggregatorFactory.factorizeWithSize()` that returns an `AggregatorAndSize` which contains the aggregator instance and the estimated initial size of the aggregator - Add method `Aggregator.aggregateWithSize()` which returns the incremental memory used by this aggregation step - Update the method `DimensionIndexer.processRowValsToKeyComponent()` to return the encoded key component as well as its effective size in bytes - Update `OnHeapIncrementalIndex` to use the new estimations only if `useMaxMemoryEstimates = false`	2022-02-03 10:34:02 +05:30
Clint Wylie	978b8f7dde	do not explode if mysql transient exception class does not exist (#12213 ) Follow up to #12205 to allow druid-mysql-extensions to work with mysql connector/j 8.x again, which does not contain MySQLTransientException, and while would have had the same problem as mariadb if a transient exception was checked, the new check eagerly loads the class when starting up, causing immediate failure.	2022-02-01 09:06:24 +05:30
Clint Wylie	5d2291991e	use reflection to check for mysql transient exception type (#12205 ) * use reflection to check for mysql transient exception type * better * oops	2022-01-27 13:13:16 -08:00
AmatyaAvadhanula	1f63b447c4	Mitigate Kinesis stream LimitExceededException by using listShards API (#12161 ) Makes kinesis ingestion resilient to `LimitExceededException` caused by resharding. Replace `describeStream` with `listShards` (recommended) to get shard related info. `describeStream` has a limit (100) to the number of shards returned per call and a low default TPS limit of 10. `listShards` returns the info for at most 1000 shards and has a higher TPS limit of 100 as well. Key changed/added classes in this PR * `KinesisRecordSupplier` * `KinesisAdminClient`	2022-01-21 10:15:51 +05:30
Abhishek Agarwal	53c0e489c2	Fix infinite retrying during task pausing (#12167 ) This fixes a bug that causes TaskClient in overlord to continuously retry to pause tasks. This can happen when a task is not responding to the pause command. Ideally, in such a case when the task is unresponsive, the overlord would have given up after a few retries and would have killed the task. However, due to this bug, retries go on forever.	2022-01-19 09:03:36 +05:30
Jonathan Wei	74c876e578	Throw parse exceptions on schema get errors for SchemaRegistryBasedAvroBytesDecoder (#12080 ) * Add option to throw parse exceptions on schema get errors for SchemaRegistryBasedAvroBytesDecoder * Remove option	2022-01-13 12:36:51 -06:00
imply-cheddar	b153cb2342	Add a small LRU cache and use utf8 bytes in ArrayOfDoubles (#12130 ) * Add a small LRU cache and use utf8 bytes in ArrayOfDoubles * Add tests for extra branches * Even more tests for branch coverage * Fix Style	2022-01-11 13:04:11 -08:00
somu-imply	08fea7a46a	input type validation for datasketches hll "build" aggregator factory (#12131 ) * Ingestion will fail for HLLSketchBuild instead of creating with incorrect values * Addressing review comments for HLL< updated error message introduced test case	2022-01-11 12:00:14 -08:00
Suneet Saldanha	25ac04e067	MySqlFirehoseDatabaseConnector uses configured driver class name (#12049 )	2021-12-09 20:58:55 -08:00
Frank Chen	58245b4617	Support JsonPath functions in JsonPath expressions (#11722 ) * Add jsonPath functions support * Add jsonPath function test for Avro * Add jsonPath function length() to Orc * Add jsonPath function length() to Parquet * Add more tests to ORC format * update doc * Fix exception during ingestion * Add IT test case * Revert "Fix exception during ingestion" This reverts commit `5a5484b9ea`. * update IT test case * Add 'keys()' * Commit IT test case * Fix UT	2021-12-10 10:53:23 +08:00
Jonathan Wei	229f82a6f0	Add parse error list API for stream supervisors, use structured object for parse exceptions, simplify parse exception message (#11961 ) * Add parse error list API for stream supervisors, simplify parse exception message * Add input string to parse exception * Use structured ParseExceptionReport * Fix tests * Add test * PR comments, add ParseExceptionReport equals verifier * Fix test	2021-12-09 15:42:55 -06:00
zachjsh	65cadbe42a	Fix bad lookup config fails task (#12021 ) This PR fixes an issue in which if a lookup is configured incorreclty; does not serialize properly when being pulled by peon node, it causes the task to fail. The failure occurs because the peon and other leaf nodes (broker, historical), have retry logic that continues to retry the lookup loading for 3 minutes by default. The http listener thread on the peon task is not started until lookup loading completes, by default, the overlord waits 1 minute by default, to communicate with the peon task to get the task status, after which is orders the task to shut down, causing the ingestion task to fail. To fix the issue, we catch the exception serialization error, and do not retry. Also fixed an issue in which a bad lookup config interferes with any other good lookup configs from being loaded.	2021-12-07 00:55:34 -05:00
Gian Merlino	e0e05aad99	Enhancements to IndexTaskClient. (#12011 ) * Enhancements to IndexTaskClient. 1) Ability to use handlers other than StringFullResponseHandler. This functionality is not used in production code yet, but is useful because it will allow tasks to communicate with each other in non-string-based formats and in streaming fashion. In the future, we'll be able to use this to make task-to-task communication more efficient. 2) Truncate server errors at 1KB, so long errors do not pollute logs. 3) Change error log level for retryable errors from WARN to INFO. (The final error is still WARN.) 4) Harmonize log and exception messages to have a more consistent format. * Additional tests and improvements.	2021-12-03 09:14:32 -08:00
Frank Chen	c2cea25a6b	Improve exception message when loading data from web-console (#11723 ) * Improve exception handling * Revert some changes * Resolve comments * Update indexing-service/src/main/java/org/apache/druid/indexing/overlord/sampler/SamplerExceptionMapper.java Co-authored-by: Karan Kumar <karankumar1100@gmail.com> * Update indexing-service/src/main/java/org/apache/druid/indexing/overlord/sampler/SamplerExceptionMapper.java Co-authored-by: Karan Kumar <karankumar1100@gmail.com> * Address review comments Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2021-12-03 21:33:49 +08:00
Abhishek Agarwal	503384569a	Fix classNotFoundException when connecting to secure LDAP (#11978 ) This PR fixes a problem where the com.sun.jndi.ldap.Connection tries to build BasicSecuritySSLSocketFactory when calling LDAPCredentialsValidator.validateCredentials since BasicSecuritySSLSocketFactory is in extension class loader and not visible to system classloader.	2021-12-03 12:08:19 +05:30
Clint Wylie	84b4bf56d8	vectorize logical operators and boolean functions (#11184 ) changes: * adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions * vectorize logical operators and boolean functions, some only if useStrictBooleans is true	2021-12-02 16:40:23 -08:00
Paul Rogers	a66f10eea1	Code cleanup from query profile project (#11822 ) * Code cleanup from query profile project * Fix spelling errors * Fix Javadoc formatting * Abstract out repeated test code * Reuse constants in place of some string literals * Fix up some parameterized types * Reduce warnings reported by Eclipse * Reverted change due to lack of tests	2021-11-30 11:35:38 -08:00
Gian Merlino	93aeaf4801	Improve on-heap aggregator footprint estimates. (#11950 ) Add a "guessAggregatorHeapFootprint" method to AggregatorFactory that mitigates #6743 by enabling heap footprint estimates based on a specific number of rows. The idea is that at ingestion time, the number of rows that go into an aggregator will be 1 (if rollup is off) or will likely be a small number (if rollup is on). It's a heuristic, because of course nothing guarantees that the rollup ratio is a small number. But it's a common case, and I expect this logic to go wrong much less often than the current logic. Also, when it does go wrong, users can fix it by lowering maxRowsInMemory or maxBytesInMemory. The current situation is unintuitive: when the estimation goes wrong, users get an OOME, but actually they need to raise these limits to fix it.	2021-11-28 13:21:24 +05:30
Agustin Gonzalez	8eff6334f7	AWS "Data read has a different length than the expected" error should reset stream and try again (#11941 ) * Add support for custom reset condition & support for other args to have defaults to make the method api consistent * Add support for custom reset condition to InputEntity * Fix test names * Clarifying comments to why we need to read the message's content to identify S3's resettable exception * Add unit test to verify custom resettable condition for S3Entity * Provide a way to customize retries since they are expensive to test	2021-11-26 12:45:34 -07:00
Gian Merlino	cb0a2af644	TestKafkaExtractionCluster: Shut down Kafka, ZK in @After. (#11963 )	2021-11-20 15:17:05 -08:00
Clint Wylie	f260bbed23	restore and deprecate AggregatorFactory methods (#11917 ) * add back and deprecate aggregator factory methods so i can say i told you so when i delete these later * rename to make less ambiguous, fix fill method * adjust	2021-11-19 15:59:35 -08:00
William Hyun	3abca73ee8	Upgrade ORC to 1.7.1 (#11919 )	2021-11-15 09:13:03 -08:00
Atul Mohan	f9941c12c3	Reduce list operation calls when pulling segments from S3 (#11899 ) * Lazy lists * Fix objectsummary init	2021-11-10 19:13:46 -08:00
Clint Wylie	a8805ab60d	add missing json type for ListFilteredVirtualColumn (#11887 ) * add missing json type for ListFilteredVirtualColumn, and tests to try to avoid this happening again * fixes * ugly, but maybe this * oops * too many mappers	2021-11-09 17:25:12 -08:00
Gian Merlino	babf00f8e3	Migrate File.mkdirs to FileUtils.mkdirp. (#11879 ) * Migrate File.mkdirs to FileUtils.mkdirp. * Remove unused imports. * Fix LookupReferencesManager. * Simplify. * Also migrate usages of forceMkdir. * Fix var name. * Fix incorrect call. * Update test.	2021-11-09 11:10:49 -08:00
Clint Wylie	7237dc837c	complex typed expressions (#11853 ) * complex typed expressions * add built-in hll collector expressions to get coverage on druid-processing, more types, more better * rampage!!! * more javadoc * adjustments * oops * lol * remove unused dependency * contradiction? * more test	2021-11-08 00:33:06 -08:00
zachjsh	1d6df48145	Warn if cache size of lookup is beyond max size (#11863 ) Enhanced the ExtractionNamespace interface in lookups-cached-global core extension with the ability to set a maxHeapPercentage for the cache of the respective namespace. The reason for adding this functionality, is make it easier to detect when a lookup table grows to a size that the underlying service cannot handle, because it does not have enough memory. The default value of maxHeap for the interface is -1, which indicates that no maxHeapPercentage has been set. For the JdbcExtractionNamespace and UriExtractionNamespace implementations, the default value is null, which will cause the respective service that the lookup is loaded in, to warn when its cache is beyond mxHeapPercentage of the service's configured max heap size. If a positive non-null value is set for the namespace's maxHeapPercentage config, this value will be honored for all services that the respective lookup is loaded onto, and consequently log warning messages when the cache of the respective lookup grows beyond this respective percentage of the services configured max heap size. Warnings are logged every time that either Uri based or Jdbc based lookups are regenerated, if the maxHeapPercentage constraint is violated. No other implementations will log warnings at this time. No error is thrown when the size exceeds the maxHeapPercentage at this time, as doing so could break functionality for existing users. Previously the JdbcCacheGenerator generated its cache by materializing all rows of the underling table in memory at once; this made it difficult to log warning messages in the case that the results from the jdbc query were very large and caused the service to run out of memory. To help with this, this pr makes it so that the jdbc query results are instead streamed through an iterator.	2021-11-03 21:32:22 -04:00
Karan Kumar	90640bb316	Support for hadoop 3 via maven profiles (#11794 ) Add support for hadoop 3 profiles . Most of the details are captured in #11791 . We use a combination of maven profiles and resource filtering to achieve this. Hadoop2 is supported by default and a new maven profile with the name hadoop3 is created. This will allow the user to choose the profile which is best suited for the use case.	2021-10-30 22:46:24 +05:30
Gian Merlino	fc95c92806	Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124 ) * Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. This patch does the following: - Removes OffheapIncrementalIndex. - Clarifies that Aggregators are required to be thread safe. - Clarifies that BufferAggregators and VectorAggregators are not required to be thread safe. - Removes thread safety code from some DataSketches aggregators that had it. (Not all of them did, and that's OK, because it wasn't necessary anyway.) - Makes enabling "useOffheap" with groupBy v1 an error. Rationale for removing the offheap incremental index: - It is only used in one rare scenario: groupBy v1 (which is non-default) in "useOffheap" mode (also non-default). So you have to go pretty deep into the wilderness to get this code to activate in production. It is never used during ingestion. - Its existence complicates developer efforts to reason about how aggregators get used, because the way it uses buffer aggregators is so different from how every other query engine uses them. - It doesn't have meaningful testing. By the way, I do believe that the given way the offheap incremental index works, it actually didn't require buffer aggregators to be thread-safe. It synchronizes on "aggregate" and doesn't call "get" until it has stopped calling "aggregate". Nevertheless, this is a bother to think about, and for the above reasons I think it makes sense to remove the code anyway. * Remove things that are now unused. * Revert removal of getFloat, getLong, getDouble from BufferAggregator. * OAK-related warnings, suppressions. * Unused item suppressions.	2021-10-26 08:05:56 -07:00
Gian Merlino	8276c031c5	Add druid.sql.approxCountDistinct.function property. (#11181 ) * Add druid.sql.approxCountDistinct.function property. The new property allows admins to configure the implementation for APPROX_COUNT_DISTINCT and COUNT(DISTINCT expr) in approximate mode. The motivation for adding this setting is to enable site admins to switch the default HLL implementation to DataSketches. For example, an admin can set: druid.sql.approxCountDistinct.function = APPROX_COUNT_DISTINCT_DS_HLL * Fixes * Fix tests. * Remove erroneous cannotVectorize. * Remove unused import. * Remove unused test imports.	2021-10-25 12:16:21 -07:00
Gian Merlino	d4cace385f	SQL: Allow Scans to be used as outer queries. (#11831 ) * SQL: Allow Scans to be used as outer queries. This has been possible in the native query system for a while, but the capability hasn't yet propagated into the SQL layer. One example of where this is useful is a query like: SELECT * FROM (... LIMIT X) WHERE <filter> Because this expands the kinds of subquery structures the SQL layer will consider, it was also necessary to improve the cost calculations. These changes appear in PartialDruidQuery and DruidOuterQueryRel. The ideas are: - Attach per-column penalties to the output signature of each query, instead of to the initial projection that starts a query. This encourages moving projections into subqueries instead of leaving them on outer queries. - Only attach penalties to projections if there are actually expressions happening. So, now, projections that simply reorder or remove fields are free. - Attach a constant penalty to every outer query. This discourages creating them when they are not needed. The changes are generally beneficial to the test cases we have in CalciteQueryTest. Most plans are unchanged, or are changed in purely cosmetic ways. Two have changed for the better: - testUsingSubqueryWithLimit now returns a constant from the subquery, instead of returning every column. - testJoinOuterGroupByAndSubqueryHasLimit returns a minimal set of columns from the innermost subquery; two unnecessary columns are no longer there. * Fix various DS operator conversions. These were all implemented as direct conversions, which isn't appropriate because they do not actually map onto native functions. These are only usable as post-aggregations. * Test case adjustment.	2021-10-23 17:18:43 -07:00
Gian Merlino	98ecbb21cd	Remove CloseQuietly and migrate its usages to other methods. (#10247 ) * Remove CloseQuietly and migrate its usages to other methods. These other methods include: 1) New method CloseableUtils.closeAndWrapExceptions, which wraps IOExceptions in RuntimeExceptions for callers that just want to avoid dealing with checked exceptions. Most usages were migrated to this method, because it looks like they were mainly attempts to avoid declaring a throws clause, and perhaps were unintentionally suppressing IOExceptions. 2) New method CloseableUtils.closeInCatch, designed to properly close something in a catch block without losing exceptions. Some usages from catch blocks were migrated here, when it seemed that they were intended to avoid checked exception handling, and did not really intend to also suppress IOExceptions. 3) New method CloseableUtils.closeAndSuppressExceptions, which sends all exceptions to a "chomper" that consumes them. Nothing is thrown or returned. The behavior is slightly different: with this method, _all_ exceptions are suppressed, not just IOExceptions. Calls that seemed like they had good reason to suppress exceptions were migrated here. 4) Some calls were migrated to try-with-resources, in cases where it appeared that CloseQuietly was being used to avoid throwing an exception in a finally block. 🎵 You don't have to go home, but you can't stay here... 🎵 * Remove unused import. * Fix up various issues. * Adjustments to tests. * Fix null handling. * Additional test. * Adjustments from review. * Fixup style stuff. * Fix NPE caused by holder starting out null. * Fix spelling. * Chomp Throwables too.	2021-10-23 17:03:21 -07:00
Gian Merlino	b7a4c79314	Null handling fixes for DS HLL and Theta sketches. (#11830 ) * Null handling fixes for DS HLL and Theta sketches. For HLL, this fixes an NPE when processing a null in a multi-value dimension. For both, empty strings are now properly treated as nulls (and ignored) in replace-with-default mode. Behavior in SQL-compatible mode is unchanged. * Fix expectation.	2021-10-22 19:09:00 -07:00
Clint Wylie	741b4ed516	add output type information to ExpressionPostAggregator (#11818 ) * add ColumnInspector argument to PostAggregator.getType to allow post-aggs to compute their output type based on input types * add test for test for coverage * simplify * Remove unused imports. Co-authored-by: Gian Merlino <gian@imply.io>	2021-10-22 13:52:51 -07:00
Alexander Saydakov	8cf1cbc4a9	latest datasketches-java and datasketches-memory (#11773 ) * latest datasketches-java and datasketches-memory * updated versions of datasketches-java and datasketches-memory Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2021-10-19 23:42:30 -07:00
Clint Wylie	187df58e30	better types (#11713 ) * better type system * needle in a haystack * ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support * fixup merge * more test * fixup * intern * fix * oops * oops again * ... * more test coverage * fix error message * adjust interning, more javadocs * oops * more docs more better	2021-10-19 01:47:25 -07:00
lokesh-lingarajan	ad6609a606	Kafka Input Format for headers, key and payload parsing (#11630 ) ### Description Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights. PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats. We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload. This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row. Lets look at a sample input format from the above discussion "inputFormat": { "type": "kafka", // New input format type "headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp "headerFormat": // Header parser specifying that values are of type string { "type": "string" }, "valueFormat": // Value parser from json parsing { "type": "json", "flattenSpec": { "useFieldDiscovery": true, "fields": [...] } }, "keyFormat": // Key parser also from json parsing { "type": "json" } } Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. "headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload. Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch. Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key". ## KafkaInputFormat Class: This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases. During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.	2021-10-07 08:56:27 -07:00
Xavier Léauté	bc3b038712	Update Apache Kafka client libraries to 3.0.0 (#11735 ) Release notes: https://downloads.apache.org/kafka/3.0.0/RELEASE_NOTES.html https://blogs.apache.org/kafka/entry/what-s-new-in-apache6	2021-10-05 10:23:19 -07:00
William Hyun	9bff6bd70e	Upgrade ORC to 1.7.0 (#11726 ) * Upgrade ORC to 1.7.0 * address comments * address comments * Add import	2021-09-27 13:20:09 -07:00
Clint Wylie	392f0ca1b5	refactor sql authorization to get resource type from schema, resource type to be string (#11692 ) * refactor sql authorization to get resource type from schema, refactor resource type from enum to string * information schema auth filtering adjustments * refactor * minor stuff * Update SqlResourceCollectorShuttle.java	2021-09-17 09:53:25 -07:00
Kashif Faraz	757720fae5	Suppress stacktrace of InterruptedException in CommonCacheNotifier (#11715 ) When CommonCachedNotifier is being stopped while the thread is waiting on updateQueue.take(), an InterruptedException is thrown. The stack trace from this exception gives the wrong idea that something went wrong with the shutdown.	2021-09-16 22:27:08 +05:30
Clint Wylie	fe1d8c206a	bump version to 0.23.0-SNAPSHOT (#11670 )	2021-09-08 15:56:04 -07:00
Agustin Gonzalez	9efa6cc9c8	Make persists concurrent with adding rows in batch ingestion (#11536 ) * Make persists concurrent with ingestion * Remove semaphore but keep concurrent persists (with add) and add push in the backround as well * Go back to documented default persists (zero) * Move to debug * Remove unnecessary Atomics * Comments on synchronization (or not) for sinks & sinkMetadata * Some cleanup for unit tests but they still need further work * Shutdown & wait for persists and push on close * Provide support for three existing batch appenderators using batchProcessingMode flag * Fix reference to wrong appenderator * Fix doc typos * Add BatchAppenderators class test coverage * Add log message to batchProcessingMode final value, fix typo in enum name * Another typo and minor fix to log message * LEGACY->OPEN_SEGMENTS, Edit docs * Minor update legacy->open segments log message * More code comments, mostly small adjustments to naming etc * fix spelling * Exclude BtachAppenderators from Jacoco since it is fully tested but Jacoco still refuses to ack coverage * Coverage for Appenderators & BatchAppenderators, name change of a method that was still using "legacy" rather than "openSegments" Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2021-09-08 13:31:52 -07:00
Jihoon Son	7e90d00cc0	Configurable maxStreamLength for doubles sketches (#11574 ) * Configurable maxStreamLength for doubles sketches * fix equals/hashcode and it test failure * fix test * fix it test * benchmark * doc * grouping key * fix comment * dependency check * Update docs/development/extensions-core/datasketches-quantiles.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2021-08-31 14:56:37 -07:00
zhangyue19921010	6d14ea2d14	Dynamic auto scale Kinesis-Stream ingest tasks (#10985 ) * ready to test * revert misc.xml * document kinesis md * Update docs/development/extensions-core/kafka-ingestion.md * Update docs/development/extensions-core/kinesis-ingestion.md * Update docs/development/extensions-core/kinesis-ingestion.md * Update docs/development/extensions-core/kinesis-ingestion.md * Update docs/development/extensions-core/kinesis-ingestion.md * Update docs/development/extensions-core/kinesis-ingestion.md * Update docs/development/extensions-core/kinesis-ingestion.md * Update docs/development/extensions-core/kinesis-ingestion.md * Update docs/development/extensions-core/kinesis-ingestion.md * Update docs/development/extensions-core/kinesis-ingestion.md * Update docs/development/extensions-core/kinesis-ingestion.md * Update kafka-ingestion.md remove leading ` * Update kinesis-ingestion.md add missing ` Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2021-08-30 15:44:29 -07:00
Jihoon Son	2a658acad4	Put sleep in an extension (#11632 ) * Put sleep in an extension * dependency	2021-08-25 01:27:45 -07:00
Maytas Monsereenusorn	b36242b404	Fix bug in Variance Buffer Aggregator resulting in intermittent NaN when druid.generic.useDefaultValueForNull=false (#11617 ) * Fix bug in Variance Aggregator resulting in intermittent NaN when druid.generic.useDefaultValueForNull=false * fix checkstyle * address comments	2021-08-20 09:13:51 -07:00
Clint Wylie	ec334a641b	MySQL extension with MariaDB connector docs (#11608 ) * add docs for mariadb support via mysql extensions * add logging so you know what druid knows * homogenize * spelling * missed a couple	2021-08-19 01:52:26 -07:00
dependabot[bot]	776ddf76f4	Bump parquet.version from 1.11.1 to 1.12.0 (#11346 ) * Bump parquet.version from 1.11.1 to 1.12.0 Bumps `parquet.version` from 1.11.1 to 1.12.0. Updates `parquet-column` from 1.11.1 to 1.12.0 - [Release notes](https://github.com/apache/parquet-mr/releases) - [Changelog](https://github.com/apache/parquet-mr/blob/master/CHANGES.md) - [Commits](https://github.com/apache/parquet-mr/compare/apache-parquet-1.11.1...apache-parquet-1.12.0) Updates `parquet-avro` from 1.11.1 to 1.12.0 - [Release notes](https://github.com/apache/parquet-mr/releases) - [Changelog](https://github.com/apache/parquet-mr/blob/master/CHANGES.md) - [Commits](https://github.com/apache/parquet-mr/compare/apache-parquet-1.11.1...apache-parquet-1.12.0) Updates `parquet-hadoop` from 1.11.1 to 1.12.0 - [Release notes](https://github.com/apache/parquet-mr/releases) - [Changelog](https://github.com/apache/parquet-mr/blob/master/CHANGES.md) - [Commits](https://github.com/apache/parquet-mr/compare/apache-parquet-1.11.1...apache-parquet-1.12.0) --- updated-dependencies: - dependency-name: org.apache.parquet:parquet-column dependency-type: direct:production update-type: version-update:semver-minor - dependency-name: org.apache.parquet:parquet-avro dependency-type: direct:production update-type: version-update:semver-minor - dependency-name: org.apache.parquet:parquet-hadoop dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update license Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Suneet Saldanha <suneet@apache.org>	2021-08-13 19:17:57 -07:00
Parag Jain	c7b46671b3	option to use deep storage for storing shuffle data (#11507 ) Fixes #11297. Description Description and design in the proposal #11297 Key changed/added classes in this PR DataSegmentPusher ShuffleClient PartitionStat PartitionLocation *IntermediaryDataManager	2021-08-13 16:40:25 -04:00
Kashif Faraz	aaf0aaad8f	Enable routing of SQL queries at Router (#11566 ) This PR adds a new property druid.router.sql.enable which allows the Router to handle SQL queries when set to true. This change does not affect Avatica JDBC requests and they are still routed by hashing the Connection ID. To allow parsing of the request object as a SqlQuery (contained in module druid-sql), some classes have been moved from druid-server to druid-services with the same package name.	2021-08-13 18:44:39 +05:30
Rohan Garg	2004a94675	Cleanup test dependencies in hdfs-storage extension (#11563 ) * Cleanup test dependencies in hdfs-storage extension * Fix working directory in LocalFileSystem in indexing-hadoop test	2021-08-10 07:52:32 -07:00
Suneet Saldanha	361bfdcaa5	Better logging for lookups (#11539 ) * Better logging for lookups The default pollPeriod of 0 means that lookups are loaded once only at startup Add a warning message to warn operators about this. I suspect that most operators using jdbc or uri would expect eventual consistency with the source of the lookups if using jdbc or uri. So make this a warning to make it easier to debug if an operator notices a data inconsistency issue. * oops	2021-08-04 16:44:54 -07:00
Yi Yuan	aa7cb50f24	Add DynamicConfigProvider for Schema Registry (#11362 ) * add_DynamicConfigProvider_for_schema_registry * bug fixed * add document * fix document * fix spot bug * fix document * inject ObjectMapper * add DynamicConfigProviderUtils * add UT * bug fixed Co-authored-by: yuanyi <yuanyi@freewheel.tv>	2021-08-03 13:24:52 -07:00
Agustin Gonzalez	a2da407b70	Add error msg to parallel task's TaskStatus (#11486 ) * Add error msg to parallel task's TaskStatus * Consolidate failure block * Add failure test * Make it fail * Add fail while stopped * Simplify hash task test using a runner that fails after so many runs (parameter) * Remove unthrown exception * Use runner names to identify phase * Added range partition kill test & fixed a timing bug with the custom runner * Forbidden api * Style * Unit test code cleanup * Added message to invalid state exception and improved readability of the phase error messages for the parallel task failure unit tests	2021-08-02 12:11:28 -07:00
dependabot[bot]	cf674c833c	Bump maven-resources-plugin from 3.1.0 to 3.2.0 (#11525 ) Bumps [maven-resources-plugin](https://github.com/apache/maven-resources-plugin) from 3.1.0 to 3.2.0. - [Release notes](https://github.com/apache/maven-resources-plugin/releases) - [Commits](https://github.com/apache/maven-resources-plugin/compare/maven-resources-plugin-3.1.0...maven-resources-plugin-3.2.0) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-resources-plugin dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2021-08-02 09:38:34 -07:00
Dongjoon Hyun	dbed4424b5	Upgrade ORC to 1.6.9 (#11518 )	2021-07-31 23:33:03 -07:00
Xavier Léauté	4bca7f014e	update error-prone to 2.8.0 with fix for crashing check (#11494 ) * error-prone 2.8.0 fixes https://github.com/google/error-prone/issues/2396 * fix for a few ignored return values * fix unknown args in sub-modules	2021-07-29 09:13:46 -07:00
zachjsh	a2538d264d	Add back missing unit test coverage in AvroFlattenerMakerTest (#11451 ) * Add back missing unit test coverage in AvroFlattenerMakerTest Adds back test coverage for Avro flattener that was mistakenly removed in https://github.com/apache/druid/pull/10505. Recfactored the tests a bit too. * resolve checkstyle warnings	2021-07-20 18:27:00 -07:00
Abhishek Agarwal	94c1671eaf	Split SegmentLoader into SegmentLoader and SegmentCacheManager (#11466 ) This PR splits current SegmentLoader into SegmentLoader and SegmentCacheManager. SegmentLoader - this class is responsible for building the segment object but does not expose any methods for downloading, cache space management, etc. Default implementation delegates the download operations to SegmentCacheManager and only contains the logic for building segments once downloaded. . This class will be used in SegmentManager to construct Segment objects. SegmentCacheManager - this class manages the segment cache on the local disk. It fetches the segment files to the local disk, can clean up the cache, and in the future, support reserve and release on cache space. [See https://github.com/Make SegmentLoader extensible and customizable #11398]. This class will be used in ingestion tasks such as compaction, re-indexing where segment files need to be downloaded locally.	2021-07-21 00:14:19 +05:30
Dongjoon Hyun	5037493e45	Bump commons-io to 2.11.0 (#11460 ) * Bump commons-io to 2.11.0 * Address comments * Remove try catch * Fix checkstyle	2021-07-19 15:47:14 -07:00
Clint Wylie	2705fe98fa	Fix avro json serde issues (#11455 )	2021-07-20 00:32:05 +08:00
frank chen	2236cf2234	eliminate extra object instantiation (#11345 )	2021-07-12 18:31:39 -07:00
Sandeep	18b8ac5349	removes unnecessary checks (#11431 ) * removes unnecessary checks * removes unnecessary checks	2021-07-12 18:21:16 -07:00
Clint Wylie	d0b4e55a6f	fix pom version reference (#11435 )	2021-07-13 08:37:57 +08:00
Clint Wylie	63fcd77c38	support using mariadb connector with mysql extensions (#11402 ) * support using mariadb connector with mysql extensions * cleanup and more tests * fix test * javadocs, more tests, etc * style and more test * more test more better * missing pom * more pom	2021-07-08 12:25:37 -07:00
Joseph Glanville	d5e8d4d680	Avro union support (#10505 ) * Avro union support * Document new union support * Add support for AvroStreamInputFormat and fix checkstyle * Extend multi-member union test schema and format * Some additional docs and add Enums to spelling * Rename explodeUnions -> extractUnions * explode -> extract * ByType * Correct spelling error	2021-07-06 22:05:41 -07:00
Clint Wylie	17efa6f556	add single input string expression dimension vector selector and better expression planning (#11213 ) * add single input string expression dimension vector selector and better expression planning * better * fixes * oops * rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing * oops * javadocs, renaming * more javadocs * benchmarks * use string expression vector processor with vector size 1 instead of expr.eval * better logging * javadocs, surprising number of the the * more * simplify	2021-07-06 11:20:49 -07:00

1 2 3 4 5 ...

1004 Commits