druid

Commit Graph

Author	SHA1	Message	Date
Clint Wylie	84b4bf56d8	vectorize logical operators and boolean functions (#11184 ) changes: * adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions * vectorize logical operators and boolean functions, some only if useStrictBooleans is true	2021-12-02 16:40:23 -08:00
Paul Rogers	a66f10eea1	Code cleanup from query profile project (#11822 ) * Code cleanup from query profile project * Fix spelling errors * Fix Javadoc formatting * Abstract out repeated test code * Reuse constants in place of some string literals * Fix up some parameterized types * Reduce warnings reported by Eclipse * Reverted change due to lack of tests	2021-11-30 11:35:38 -08:00
Gian Merlino	f6e6ca2893	Use intermediate-persist IndexSpec during multiphase merge. (#11940 ) * Use intermediate-persist IndexSpec during multiphase merge. The main change is the addition of an intermediate-persist IndexSpec to the main "merge" method in IndexMerger. There are also a few minor adjustments to the IndexMerger interface to encourage more harmonious usage of its methods in the future. * Additional changes inspired by the test coverage checker. - Remove unused-in-production IndexMerger methods "append" and "convert". - Add additional unit tests to UnifiedIndexerAppenderatorsManager. * Additional adjustments. * Even more additional adjustments. * Test fixes.	2021-11-29 15:08:49 -08:00
Gian Merlino	93aeaf4801	Improve on-heap aggregator footprint estimates. (#11950 ) Add a "guessAggregatorHeapFootprint" method to AggregatorFactory that mitigates #6743 by enabling heap footprint estimates based on a specific number of rows. The idea is that at ingestion time, the number of rows that go into an aggregator will be 1 (if rollup is off) or will likely be a small number (if rollup is on). It's a heuristic, because of course nothing guarantees that the rollup ratio is a small number. But it's a common case, and I expect this logic to go wrong much less often than the current logic. Also, when it does go wrong, users can fix it by lowering maxRowsInMemory or maxBytesInMemory. The current situation is unintuitive: when the estimation goes wrong, users get an OOME, but actually they need to raise these limits to fix it.	2021-11-28 13:21:24 +05:30
Rohan Garg	2c08055962	Specify time column for first/last aggregators (#11949 ) Add the ability to pass time column in first/last aggregator (and latest/earliest SQL functions). It is to support cases where the time to query upon is stored as a part of a column different than __time. Also, some other logical time column can be specified.	2021-11-25 09:44:14 +05:30
Gian Merlino	12e2228510	RowBasedGrouperHelper: Set hasMultipleValues = false in capabilities. (#11954 ) Useful because it enables anything that consumes groupBy results to potentially operate more efficiently.	2021-11-24 13:14:58 -08:00
Gian Merlino	5e168b861a	StorageAdapter: Add getRowSignature method. (#11953 ) Simplifies logic for callers that only want to get a list of all the column names, or column names and types. Updated callers SegmentAnalyzer, HashJoinSegmentStorageAdapter, and DruidSegmentReader.	2021-11-24 13:14:25 -08:00
Gian Merlino	0354407655	SQL INSERT planner support. (#11959 ) * SQL INSERT planner support. The main changes are: 1) DruidPlanner is able to validate and authorize INSERT queries. They require WRITE permission on the target datasource. 2) QueryMaker is now an interface, and there is a QueryMakerFactory that creates instances of it. There is only one production implementation of each (NativeQueryMaker and NativeQueryMakerFactory), which together behave the same way as the former QueryMaker class. But this opens the door to executing queries in ways other than the Druid query stack, and is used by unit tests (CalciteInsertDmlTest) to test the INSERT planning functionality. 3) Adds an EXTERN table macro that allows references external data using InputSource and InputFormat from Druid's batch ingestion API. This is not exposed in production yet, but is used by unit tests. 4) Adds a QueryFeature concept that enables the planner to change its behavior slightly depending on the capabilities of the execution system. 5) Adds an "AuthorizableOperator" concept that enables SqlOperators to require additional permissions. This is used by the EXTERN table macro. Related odds and ends: - Add equals, hashCode, toString methods to InlineInputSource. Aids in the "from external" tests in CalciteInsertDmlTest. - Add JSON-serializability to RowSignature. - Move the SQL string inside PlannerContext so it is "baked into" the planner when the planner is created. Cleans up the code a bit, since in practice, the same query is passed in every time to the same planner anyway. * Fix up calls to CalciteTests.createMockQueryLifecycleFactory. * Fix checkstyle issues. * Adjustments for CI. * Adjust DruidAvaticaHandlerTest for stricter test authorizations.	2021-11-24 12:14:04 -08:00
Gian Merlino	35b610ada7	QueryableIndexColumnSelectorFactory: Double-check cached column class. (#11957 ) Important because an earlier call to getCachedColumn may have been done with a different class, leading to a ClassCastException on the second call. In the prior code, this could happen if a complex column had makeDimensionSelector called on it after makeColumnValueSelector had already been called.	2021-11-22 11:31:24 -08:00
Gian Merlino	d6507c9428	PrioritizedExecutorService: Properly wrap on direct calls to "execute". (#11956 ) Usually, "execute" is called by methods defined in the superclass AbstractExecutorService, and the passed-in Runnable has been wrapped by newTaskFor inside a PrioritizedListenableFutureTask. But this method can also be called directly, and if so, the same wrapping is necessary for the delegate to get a Runnable that can be entered into a priority queue with the others.	2021-11-22 10:30:12 -08:00
Clint Wylie	f260bbed23	restore and deprecate AggregatorFactory methods (#11917 ) * add back and deprecate aggregator factory methods so i can say i told you so when i delete these later * rename to make less ambiguous, fix fill method * adjust	2021-11-19 15:59:35 -08:00
Gian Merlino	36ee0367ff	Scan: Add "orderBy" parameter. (#11930 ) * Scan: Add "orderBy" parameter. This patch adds an API for requesting non-time orderings, although it does not actually add the ability to execute such queries. The changes are done in such a way that no matter how Scan query objects are constructed, they will have a correct "getOrderBy". This will enable us to switch the execution to exclusively use "getOrderBy" later on when it's implemented. Scan queries are serialized such that they only include "order" (time order) if the ordering is time-based, and they only include "orderBy" if the ordering is non-time-based. This maximizes compatibility with the existing API while also providing a clean look for formatted queries. Because this patch does not include execution logic, if someone actually tries to run a query with non-time ordering, then they will get an error like "Cannot execute query with orderBy [quality ASC]". * SQL module fixes. * Add spotbugs-exclude. * Remove unused method.	2021-11-19 08:19:12 -08:00
Clint Wylie	7f0bede878	autocompaction support for complex dimensions (#11924 ) * autocompaction support for complex dimensions * more test	2021-11-16 15:57:44 -08:00
Clint Wylie	00c976a3fe	only get bitmap index for string dictionary encoded columns (#11925 )	2021-11-16 15:50:02 -08:00
Kashif Faraz	223c5692a8	Add dimension partitioningType to metrics to track usage of different partitioning schemes (#11902 ) Add method ShardSpec.getType() to get name of shard spec type List all names of shard spec types in the interface ShardSpec itself for easy reference and maintenance Add dimension partitioningType to metric segment/added/bytes	2021-11-11 18:34:27 +05:30
Gian Merlino	fe2f7742f7	Fix incorrect comparison in RowSignature. (#11905 ) PR #11882 introduced a type comparison using ==, but while it was in flight, another PR #11713 changed the type enum to a class. So the comparison should properly be done with "equals".	2021-11-11 04:30:42 -08:00
Laksh Singla	57ed5127a7	Make subquery IDs more comprehensive (#11809 ) There are 3 types of query IDs - id, subQueryId, sqlQueryId. Currently, whenever a query generates subqueries, the subquery's subQueryId is populated randomly. Also, subquery's Id is not set to the parent query Id. Therefore there is no way of linking the subqueries to the parent query, and one loses the ability to look at end to end view of the query. This PR aims to implement following couple of things: Populate the subqueries with it's parent's id (and sqlQueryId if present) Populate the subqueryId such that it forms a hierarchical relationship amongs themselves. For example, if there is a query which launches a subquery, which in turn launches a couple of subqueries, then the ids and subQueryIds should have following structure.	2021-11-11 16:31:56 +05:30
Clint Wylie	5baa22148e	revert ColumnAnalysis type, add typeSignature and use it for DruidSchema (#11895 ) * revert ColumnAnalysis type, add typeSignature and use it for DruidSchema * review stuffs * maybe null * better maybe null * Update docs/querying/segmentmetadataquery.md * Update docs/querying/segmentmetadataquery.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * fix null right * sad * oops * Update batch_hadoop_queries.json Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2021-11-10 18:46:29 -08:00
Gian Merlino	14b0b4aee2	RowBasedSegment: Use Sequence instead of Iterable. (#11886 ) * RowBasedSegment: Use Sequence instead of Iterable. The main reason this is good is that Sequences can include baggage that must be closed after iteration is finished. This enables creating RowBasedSegments on top of closeable sequences of rows. To preserve the optimization that allows reversing a List without copying it, this patch also makes SimpleSequence its own class and allows extracting the Iterable that was used to create it. * Fix tests.	2021-11-10 06:06:52 -08:00
Gian Merlino	db4d157be6	Add Finalization option to RowSignature.addAggregators. (#11882 ) * Add Finalization option to RowSignature.addAggregators. This make type signatures more useful when the caller knows whether it will be reading aggregation results in their finalized or intermediate types. * Fix call site.	2021-11-10 06:05:29 -08:00
Clint Wylie	a8805ab60d	add missing json type for ListFilteredVirtualColumn (#11887 ) * add missing json type for ListFilteredVirtualColumn, and tests to try to avoid this happening again * fixes * ugly, but maybe this * oops * too many mappers	2021-11-09 17:25:12 -08:00
Gian Merlino	6c196a5ea2	Remove StorageAdapter.getColumnTypeName. (#11893 ) * Remove StorageAdapter.getColumnTypeName. It was only used by SegmentAnalyzer, and isn't necessary anymore due to the recent improvements to ColumnCapabilities. Also: tidy ColumnDescriptor.read slightly by removing an instanceof check, and moving the relevant logic into ComplexColumnPartSerde. * Fix spellings.	2021-11-09 15:18:07 -08:00
Gian Merlino	324d4374f6	HashJoinEngine: Fix extraneous advance of left cursor. (#11890 ) This could happen for right or full outer joins in certain cases. Tests weren't catching this because existing Cursor implementations generally ignore extraneous calls to "advance". So, to help catch this in tests, extra state validations are also added to RowWalker, which is used by RowBasedSegment.	2021-11-09 11:34:11 -08:00
Gian Merlino	babf00f8e3	Migrate File.mkdirs to FileUtils.mkdirp. (#11879 ) * Migrate File.mkdirs to FileUtils.mkdirp. * Remove unused imports. * Fix LookupReferencesManager. * Simplify. * Also migrate usages of forceMkdir. * Fix var name. * Fix incorrect call. * Update test.	2021-11-09 11:10:49 -08:00
Gian Merlino	945a341acd	RowBasedCursor: Add column-value-reuse optimization. (#11884 ) * RowBasedCursor: Add column-value-reuse optimization. Most of the logic is in RowBasedColumnSelectorFactory, although in this patch its only user is RowBasedCursor. This improves performance of features that use RowBasedSegment, like lookup and inline datasources. It's especially helpful for inline datasources that contain lengthy arrays, due to the fact that the transformed array can be reused. * Changes from code review. * Fixes for ColumnCapabilitiesImplTest.	2021-11-09 07:18:09 -08:00
Gian Merlino	a5bd0b8cc0	RowAdapter: Add a default implementation for timestampFunction. (#11885 ) Enables simpler implementations for adapters that want to treat the timestamp as "just another column".	2021-11-08 10:25:13 -08:00
Clint Wylie	7237dc837c	complex typed expressions (#11853 ) * complex typed expressions * add built-in hll collector expressions to get coverage on druid-processing, more types, more better * rampage!!! * more javadoc * adjustments * oops * lol * remove unused dependency * contradiction? * more test	2021-11-08 00:33:06 -08:00
Clint Wylie	907e4ca0c5	use correct DimensionSpec with for column value selectors created from dictionary encoded column indexers (#11873 ) * use correct dimension spec for column value selectors of dictionary encoded column indexers	2021-11-05 01:51:15 -07:00
Liran Funaro	9ca8f1ec97	Remove IncrementalIndex template modifier (#11160 ) Co-authored-by: Liran Funaro <liran.funaro@verizonmedia.com>	2021-10-27 13:10:37 -07:00
Gian Merlino	fc95c92806	Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124 ) * Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. This patch does the following: - Removes OffheapIncrementalIndex. - Clarifies that Aggregators are required to be thread safe. - Clarifies that BufferAggregators and VectorAggregators are not required to be thread safe. - Removes thread safety code from some DataSketches aggregators that had it. (Not all of them did, and that's OK, because it wasn't necessary anyway.) - Makes enabling "useOffheap" with groupBy v1 an error. Rationale for removing the offheap incremental index: - It is only used in one rare scenario: groupBy v1 (which is non-default) in "useOffheap" mode (also non-default). So you have to go pretty deep into the wilderness to get this code to activate in production. It is never used during ingestion. - Its existence complicates developer efforts to reason about how aggregators get used, because the way it uses buffer aggregators is so different from how every other query engine uses them. - It doesn't have meaningful testing. By the way, I do believe that the given way the offheap incremental index works, it actually didn't require buffer aggregators to be thread-safe. It synchronizes on "aggregate" and doesn't call "get" until it has stopped calling "aggregate". Nevertheless, this is a bother to think about, and for the above reasons I think it makes sense to remove the code anyway. * Remove things that are now unused. * Revert removal of getFloat, getLong, getDouble from BufferAggregator. * OAK-related warnings, suppressions. * Unused item suppressions.	2021-10-26 08:05:56 -07:00
Gian Merlino	98ecbb21cd	Remove CloseQuietly and migrate its usages to other methods. (#10247 ) * Remove CloseQuietly and migrate its usages to other methods. These other methods include: 1) New method CloseableUtils.closeAndWrapExceptions, which wraps IOExceptions in RuntimeExceptions for callers that just want to avoid dealing with checked exceptions. Most usages were migrated to this method, because it looks like they were mainly attempts to avoid declaring a throws clause, and perhaps were unintentionally suppressing IOExceptions. 2) New method CloseableUtils.closeInCatch, designed to properly close something in a catch block without losing exceptions. Some usages from catch blocks were migrated here, when it seemed that they were intended to avoid checked exception handling, and did not really intend to also suppress IOExceptions. 3) New method CloseableUtils.closeAndSuppressExceptions, which sends all exceptions to a "chomper" that consumes them. Nothing is thrown or returned. The behavior is slightly different: with this method, _all_ exceptions are suppressed, not just IOExceptions. Calls that seemed like they had good reason to suppress exceptions were migrated here. 4) Some calls were migrated to try-with-resources, in cases where it appeared that CloseQuietly was being used to avoid throwing an exception in a finally block. 🎵 You don't have to go home, but you can't stay here... 🎵 * Remove unused import. * Fix up various issues. * Adjustments to tests. * Fix null handling. * Additional test. * Adjustments from review. * Fixup style stuff. * Fix NPE caused by holder starting out null. * Fix spelling. * Chomp Throwables too.	2021-10-23 17:03:21 -07:00
Clint Wylie	02b2057371	extract generic dictionary encoded column indexing and merging stuffs (#11829 ) * extract generic dictionary encoded column indexing and merging stuffs to pave the path towards supporting other types of dictionary encoded columns * spotbugs and inspections fixes * friendlier * javadoc * better name * adjust	2021-10-22 17:31:22 -07:00
Clint Wylie	741b4ed516	add output type information to ExpressionPostAggregator (#11818 ) * add ColumnInspector argument to PostAggregator.getType to allow post-aggs to compute their output type based on input types * add test for test for coverage * simplify * Remove unused imports. Co-authored-by: Gian Merlino <gian@imply.io>	2021-10-22 13:52:51 -07:00
Alexander Saydakov	8cf1cbc4a9	latest datasketches-java and datasketches-memory (#11773 ) * latest datasketches-java and datasketches-memory * updated versions of datasketches-java and datasketches-memory Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2021-10-19 23:42:30 -07:00
Clint Wylie	187df58e30	better types (#11713 ) * better type system * needle in a haystack * ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support * fixup merge * more test * fixup * intern * fix * oops * oops again * ... * more test coverage * fix error message * adjust interning, more javadocs * oops * more docs more better	2021-10-19 01:47:25 -07:00
Jonathan Wei	22b41ddbbf	Task reports for parallel task: single phase and sequential mode (#11688 ) * Task reports for parallel task: single phase and sequential mode * Address comments * Add null check for currentSubTaskHolder	2021-09-16 13:58:11 -05:00
Clint Wylie	5e092ccb9b	add MV_FILTER_ONLY, MV_FILTER_NONE, ListFilteredVirtualColumn (#11650 ) * add MV_FILTER_ONLY SQL function, and list filter virtual column * MV_FILTER_NONE and more tests * formatting * o yeah, forgot can do easy thing * style * hmm why was that there * test filtering on virtual column * style * meh * do it right * good bot	2021-09-16 09:31:53 -07:00
Clint Wylie	bbb86c8731	more tests for LimitedBufferHashGrouper (#11654 ) * more tests for LimitedBufferHashGrouper * fix style	2021-09-08 16:31:34 -07:00
Clint Wylie	fe1d8c206a	bump version to 0.23.0-SNAPSHOT (#11670 )	2021-09-08 15:56:04 -07:00
Clint Wylie	59d257816b	fix goldilocks bug with HashVectorGrouper improperly initializing memory (#11649 ) * fix goldilocks bug with HashVectorGrouper improperly initializing memory that causes failure when there exists room to only grow one time * fix unintended change * cleanup	2021-09-02 02:25:26 -07:00
Jian Wang	3ff1c2b8ce	Fix bug which produces vastly inaccurate query results when forceLimitPushDown is enabled and order by clause has non grouping fields (#11097 )	2021-09-01 21:19:38 -07:00
Jihoon Son	2a658acad4	Put sleep in an extension (#11632 ) * Put sleep in an extension * dependency	2021-08-25 01:27:45 -07:00
Kashif Faraz	aaf0aaad8f	Enable routing of SQL queries at Router (#11566 ) This PR adds a new property druid.router.sql.enable which allows the Router to handle SQL queries when set to true. This change does not affect Avatica JDBC requests and they are still routed by hashing the Connection ID. To allow parsing of the request object as a SqlQuery (contained in module druid-sql), some classes have been moved from druid-server to druid-services with the same package name.	2021-08-13 18:44:39 +05:30
Clint Wylie	9af7ba9d2a	STRING_AGG SQL aggregator function (#11241 ) * add string_agg * oops * style and fix test * spelling * fixup * review stuffs	2021-08-10 13:47:09 -07:00
Maytas Monsereenusorn	3257913737	Improve query error logging (#11519 ) * Improve query error logging * add docs * address comments * address comments	2021-08-05 22:51:09 +07:00
Jihoon Son	8ba7f6a48c	Fix incorrect result of exact topN on an inner join with limit (#11517 )	2021-07-31 15:55:49 -07:00
Xavier Léauté	4bca7f014e	update error-prone to 2.8.0 with fix for crashing check (#11494 ) * error-prone 2.8.0 fixes https://github.com/google/error-prone/issues/2396 * fix for a few ignored return values * fix unknown args in sub-modules	2021-07-29 09:13:46 -07:00
Kashif Faraz	8a4e27f51d	Select broker based on query context parameter `brokerService` (#11495 ) This change allows the selection of a specific broker service (or broker tier) by the Router. The newly added ManualTieredBrokerSelectorStrategy works as follows: Check for the parameter brokerService in the query context. If this is a valid broker service, use it. Check if the field defaultManualBrokerService has been set in the strategy. If this is a valid broker service, use it. Move on to the next strategy	2021-07-27 20:56:05 +05:30
Lucas Capistrant	9767b42e85	Add a new metric query/segments/count that is not emitted by default (#11394 ) * Add a new metric query/segments/count that is not emitted by default * docs * test the default implementation of the metric * fix spelling error in docs * document the fact that query retries will result in additional metric emissions * update using recommended text from @jihoonson	2021-07-22 17:57:35 -07:00
Abhishek Agarwal	ce1faa5635	Make SegmentLoader extensible and customizable (#11398 ) This PR refactors the code related to segment loading specifically SegmentLoader and SegmentLoaderLocalCacheManager. SegmentLoader is marked UnstableAPI which means, it can be extended outside core druid in custom extensions. Here is a summary of changes SegmentLoader returns an instance of ReferenceCountingSegment instead of Segment. Earlier, SegmentManager was wrapping Segment objects inside ReferenceCountingSegment. That is now moved to SegmentLoader. With this, a custom implementation can track the references of segments. It also allows them to create custom ReferenceCountingSegment implementations. For this reason, the constructor visibility in ReferenceCountingSegment is changed from private to protected. SegmentCacheManager has two additional methods called - reserve(DataSegment) and release(DataSegment). These methods let the caller reserve or release space without calling SegmentLoader#getSegment. We already had similar methods in StorageLocation and now they are available in SegmentCacheManager too which wraps multiple locations. Refactoring to simplify the code in SegmentCacheManager wherever possible. There is no change in the functionality.	2021-07-22 18:00:49 +05:30

1 2 3 4 5 ...

2486 Commits