druid

Commit Graph

Author	SHA1	Message	Date
Clint Wylie	e0c4c568cb	fix incorrect ColumnInspector in IncrementalIndex.makeColumnSelectorFactory (#12155 )	2022-01-13 18:09:06 -08:00
Clint Wylie	f2ce76966c	add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures less ambiguous (#12145 ) * add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures unambiguous * switcheroo * EARLIEST_BY/LATEST_BY use timestamp instead of numeric types, update docs * revert unintended change * fix docs * fix docs better	2022-01-12 03:48:53 -08:00
Rohan Garg	81f0aba6cb	Use ListFilteredVirtualColumn for left/fact table expression in join condition (#12127 ) * Pass VirtualColumnRegistry in PlannerContext for join expression planning * Allow for including VCs from join fact table expression * Optmize MV_FILTER functions to use a VC when in join fact table expression * fixup! Allow for including VCs from join fact table expression * Address review comments	2022-01-11 14:47:13 -08:00
imply-cheddar	eb0bae49ec	Update PostAggregator to be backwards compat (#12138 ) This change mimics what was done in PR #11917 to fix the incompatibilities produced by #11713. #11917 fixed it with AggregatorFactory by creating default methods to allow for extensions built against old jars to still work. This does the same for PostAggregator	2022-01-11 02:18:14 -08:00
Clint Wylie	7cf9192765	fix delegated smoosh writer and some new facilities for segment writeout medium (#12132 ) * fix delegated smoosh writer and some new facilities for segment writeout medium changes: * fixed issue with delegated `SmooshedWriter` when writing files that look like paths, causing `NoSuchFileException` exceptions when attempting to open a channel to the file * `FileSmoosher.addWithSmooshedWriter` when _not_ delegating now checks that it is still open when closing, making it a no-op if already closed (allowing column serializers to add additional files and avoid delegated mode if they are finished writing out their own content and ned to add additional files) * add `makeChildWriteOutMedium` to `SegmentWriteOutMedium` interface, which allows users of a shared medium to clean up `WriteOutBytes` if they fully control the lifecycle. there are no callers of this yet, adding for future functionality * `OnHeapByteBufferWriteOutBytes` now can be marked as not open so it `OnHeapMemorySegmentWriteOutMedium` can now behave identically to other medium implementations * fix to address nit - use AtomicLong	2022-01-10 22:25:19 -08:00
Clint Wylie	e583033231	add 'TypeStrategy' to types (#11888 ) * add TypeStrategy - value comparators and binary serialization for any TypeSignature	2022-01-10 17:12:14 -08:00
somu-imply	c267b65f97	Removing unused processing threadpool on broker (#12070 ) * Thread pool for broker * Updating two tests to improve coverage for new method added * Updating druidProcessingConfigTest to cover coverage * Adding missed spelling errors caused in doc * Adding test to cover lines of new function added	2021-12-21 13:07:53 -08:00
Abhishek Agarwal	5d043cefbc	Fix test in ResponseContextTest (#12077 )	2021-12-16 22:51:51 -08:00
Clint Wylie	244c2559e9	fix IncrementalIndex performance regression (#12048 ) changes: * IncrementalIndex is now a ColumnInspector * fixes performance regression from using map of ColumnCapabilities from IncrementalIndex as a RowSignature	2021-12-09 22:04:32 -08:00
Jonathan Wei	229f82a6f0	Add parse error list API for stream supervisors, use structured object for parse exceptions, simplify parse exception message (#11961 ) * Add parse error list API for stream supervisors, simplify parse exception message * Add input string to parse exception * Use structured ParseExceptionReport * Fix tests * Add test * PR comments, add ParseExceptionReport equals verifier * Fix test	2021-12-09 15:42:55 -06:00
Laksh Singla	ca260dfef6	Intern RowSignature in DruidSchema to reduce its memory footprint (#12001 ) DruidSchema consists of a concurrent HashMap of DataSource -> Segement -> AvailableSegmentMetadata. AvailableSegmentMetadata contains RowSignature of the segment, and for each segment, a new object is getting created. RowSignature is an immutable class, and hence it can be interned, and this can lead to huge savings of memory being used in broker, since a lot of the segments of a table would potentially have same RowSignature.	2021-12-08 15:11:13 +05:30
Clint Wylie	45be2be368	fix issues with multi-value string constant expressions (#12025 ) * add specialized constant selector for multi-valued string constants	2021-12-08 00:10:26 -08:00
Clint Wylie	a8815f671e	Fix druid client timeout zero (#12023 ) * fix bug where queries fail immediately when timeout is 0 instead of using default timeout * fix to use serverside max * more better * less flaky test * oops	2021-12-07 12:41:01 -08:00
Paul Rogers	34a3d45737	Refactor ResponseContext (#11828 ) * Refactor ResponseContext Fixes a number of issues in preparation for request trailers and the query profile. * Converts keys from an enum to classes for smaller code * Wraps stored values in functions for easier capture for other uses * Reworks the "header squeezer" to handle types other than arrays. * Uses metadata for visibility, and ability to compress, to replace ad-hoc code. * Cleans up JSON serialization for the response context. * Other miscellaneous cleanup. * Handle unknown keys in deserialization Also, make "Visibility" into a boolean. * Revised comment * Renamd variable	2021-12-06 17:03:12 -08:00
Clint Wylie	84b4bf56d8	vectorize logical operators and boolean functions (#11184 ) changes: * adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions * vectorize logical operators and boolean functions, some only if useStrictBooleans is true	2021-12-02 16:40:23 -08:00
Paul Rogers	a66f10eea1	Code cleanup from query profile project (#11822 ) * Code cleanup from query profile project * Fix spelling errors * Fix Javadoc formatting * Abstract out repeated test code * Reuse constants in place of some string literals * Fix up some parameterized types * Reduce warnings reported by Eclipse * Reverted change due to lack of tests	2021-11-30 11:35:38 -08:00
Gian Merlino	f6e6ca2893	Use intermediate-persist IndexSpec during multiphase merge. (#11940 ) * Use intermediate-persist IndexSpec during multiphase merge. The main change is the addition of an intermediate-persist IndexSpec to the main "merge" method in IndexMerger. There are also a few minor adjustments to the IndexMerger interface to encourage more harmonious usage of its methods in the future. * Additional changes inspired by the test coverage checker. - Remove unused-in-production IndexMerger methods "append" and "convert". - Add additional unit tests to UnifiedIndexerAppenderatorsManager. * Additional adjustments. * Even more additional adjustments. * Test fixes.	2021-11-29 15:08:49 -08:00
Gian Merlino	93aeaf4801	Improve on-heap aggregator footprint estimates. (#11950 ) Add a "guessAggregatorHeapFootprint" method to AggregatorFactory that mitigates #6743 by enabling heap footprint estimates based on a specific number of rows. The idea is that at ingestion time, the number of rows that go into an aggregator will be 1 (if rollup is off) or will likely be a small number (if rollup is on). It's a heuristic, because of course nothing guarantees that the rollup ratio is a small number. But it's a common case, and I expect this logic to go wrong much less often than the current logic. Also, when it does go wrong, users can fix it by lowering maxRowsInMemory or maxBytesInMemory. The current situation is unintuitive: when the estimation goes wrong, users get an OOME, but actually they need to raise these limits to fix it.	2021-11-28 13:21:24 +05:30
Rohan Garg	2c08055962	Specify time column for first/last aggregators (#11949 ) Add the ability to pass time column in first/last aggregator (and latest/earliest SQL functions). It is to support cases where the time to query upon is stored as a part of a column different than __time. Also, some other logical time column can be specified.	2021-11-25 09:44:14 +05:30
Gian Merlino	12e2228510	RowBasedGrouperHelper: Set hasMultipleValues = false in capabilities. (#11954 ) Useful because it enables anything that consumes groupBy results to potentially operate more efficiently.	2021-11-24 13:14:58 -08:00
Gian Merlino	5e168b861a	StorageAdapter: Add getRowSignature method. (#11953 ) Simplifies logic for callers that only want to get a list of all the column names, or column names and types. Updated callers SegmentAnalyzer, HashJoinSegmentStorageAdapter, and DruidSegmentReader.	2021-11-24 13:14:25 -08:00
Gian Merlino	0354407655	SQL INSERT planner support. (#11959 ) * SQL INSERT planner support. The main changes are: 1) DruidPlanner is able to validate and authorize INSERT queries. They require WRITE permission on the target datasource. 2) QueryMaker is now an interface, and there is a QueryMakerFactory that creates instances of it. There is only one production implementation of each (NativeQueryMaker and NativeQueryMakerFactory), which together behave the same way as the former QueryMaker class. But this opens the door to executing queries in ways other than the Druid query stack, and is used by unit tests (CalciteInsertDmlTest) to test the INSERT planning functionality. 3) Adds an EXTERN table macro that allows references external data using InputSource and InputFormat from Druid's batch ingestion API. This is not exposed in production yet, but is used by unit tests. 4) Adds a QueryFeature concept that enables the planner to change its behavior slightly depending on the capabilities of the execution system. 5) Adds an "AuthorizableOperator" concept that enables SqlOperators to require additional permissions. This is used by the EXTERN table macro. Related odds and ends: - Add equals, hashCode, toString methods to InlineInputSource. Aids in the "from external" tests in CalciteInsertDmlTest. - Add JSON-serializability to RowSignature. - Move the SQL string inside PlannerContext so it is "baked into" the planner when the planner is created. Cleans up the code a bit, since in practice, the same query is passed in every time to the same planner anyway. * Fix up calls to CalciteTests.createMockQueryLifecycleFactory. * Fix checkstyle issues. * Adjustments for CI. * Adjust DruidAvaticaHandlerTest for stricter test authorizations.	2021-11-24 12:14:04 -08:00
Gian Merlino	35b610ada7	QueryableIndexColumnSelectorFactory: Double-check cached column class. (#11957 ) Important because an earlier call to getCachedColumn may have been done with a different class, leading to a ClassCastException on the second call. In the prior code, this could happen if a complex column had makeDimensionSelector called on it after makeColumnValueSelector had already been called.	2021-11-22 11:31:24 -08:00
Gian Merlino	d6507c9428	PrioritizedExecutorService: Properly wrap on direct calls to "execute". (#11956 ) Usually, "execute" is called by methods defined in the superclass AbstractExecutorService, and the passed-in Runnable has been wrapped by newTaskFor inside a PrioritizedListenableFutureTask. But this method can also be called directly, and if so, the same wrapping is necessary for the delegate to get a Runnable that can be entered into a priority queue with the others.	2021-11-22 10:30:12 -08:00
Clint Wylie	f260bbed23	restore and deprecate AggregatorFactory methods (#11917 ) * add back and deprecate aggregator factory methods so i can say i told you so when i delete these later * rename to make less ambiguous, fix fill method * adjust	2021-11-19 15:59:35 -08:00
Gian Merlino	36ee0367ff	Scan: Add "orderBy" parameter. (#11930 ) * Scan: Add "orderBy" parameter. This patch adds an API for requesting non-time orderings, although it does not actually add the ability to execute such queries. The changes are done in such a way that no matter how Scan query objects are constructed, they will have a correct "getOrderBy". This will enable us to switch the execution to exclusively use "getOrderBy" later on when it's implemented. Scan queries are serialized such that they only include "order" (time order) if the ordering is time-based, and they only include "orderBy" if the ordering is non-time-based. This maximizes compatibility with the existing API while also providing a clean look for formatted queries. Because this patch does not include execution logic, if someone actually tries to run a query with non-time ordering, then they will get an error like "Cannot execute query with orderBy [quality ASC]". * SQL module fixes. * Add spotbugs-exclude. * Remove unused method.	2021-11-19 08:19:12 -08:00
Clint Wylie	7f0bede878	autocompaction support for complex dimensions (#11924 ) * autocompaction support for complex dimensions * more test	2021-11-16 15:57:44 -08:00
Clint Wylie	00c976a3fe	only get bitmap index for string dictionary encoded columns (#11925 )	2021-11-16 15:50:02 -08:00
Kashif Faraz	223c5692a8	Add dimension partitioningType to metrics to track usage of different partitioning schemes (#11902 ) Add method ShardSpec.getType() to get name of shard spec type List all names of shard spec types in the interface ShardSpec itself for easy reference and maintenance Add dimension partitioningType to metric segment/added/bytes	2021-11-11 18:34:27 +05:30
Gian Merlino	fe2f7742f7	Fix incorrect comparison in RowSignature. (#11905 ) PR #11882 introduced a type comparison using ==, but while it was in flight, another PR #11713 changed the type enum to a class. So the comparison should properly be done with "equals".	2021-11-11 04:30:42 -08:00
Laksh Singla	57ed5127a7	Make subquery IDs more comprehensive (#11809 ) There are 3 types of query IDs - id, subQueryId, sqlQueryId. Currently, whenever a query generates subqueries, the subquery's subQueryId is populated randomly. Also, subquery's Id is not set to the parent query Id. Therefore there is no way of linking the subqueries to the parent query, and one loses the ability to look at end to end view of the query. This PR aims to implement following couple of things: Populate the subqueries with it's parent's id (and sqlQueryId if present) Populate the subqueryId such that it forms a hierarchical relationship amongs themselves. For example, if there is a query which launches a subquery, which in turn launches a couple of subqueries, then the ids and subQueryIds should have following structure.	2021-11-11 16:31:56 +05:30
Clint Wylie	5baa22148e	revert ColumnAnalysis type, add typeSignature and use it for DruidSchema (#11895 ) * revert ColumnAnalysis type, add typeSignature and use it for DruidSchema * review stuffs * maybe null * better maybe null * Update docs/querying/segmentmetadataquery.md * Update docs/querying/segmentmetadataquery.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * fix null right * sad * oops * Update batch_hadoop_queries.json Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2021-11-10 18:46:29 -08:00
Gian Merlino	14b0b4aee2	RowBasedSegment: Use Sequence instead of Iterable. (#11886 ) * RowBasedSegment: Use Sequence instead of Iterable. The main reason this is good is that Sequences can include baggage that must be closed after iteration is finished. This enables creating RowBasedSegments on top of closeable sequences of rows. To preserve the optimization that allows reversing a List without copying it, this patch also makes SimpleSequence its own class and allows extracting the Iterable that was used to create it. * Fix tests.	2021-11-10 06:06:52 -08:00
Gian Merlino	db4d157be6	Add Finalization option to RowSignature.addAggregators. (#11882 ) * Add Finalization option to RowSignature.addAggregators. This make type signatures more useful when the caller knows whether it will be reading aggregation results in their finalized or intermediate types. * Fix call site.	2021-11-10 06:05:29 -08:00
Clint Wylie	a8805ab60d	add missing json type for ListFilteredVirtualColumn (#11887 ) * add missing json type for ListFilteredVirtualColumn, and tests to try to avoid this happening again * fixes * ugly, but maybe this * oops * too many mappers	2021-11-09 17:25:12 -08:00
Gian Merlino	6c196a5ea2	Remove StorageAdapter.getColumnTypeName. (#11893 ) * Remove StorageAdapter.getColumnTypeName. It was only used by SegmentAnalyzer, and isn't necessary anymore due to the recent improvements to ColumnCapabilities. Also: tidy ColumnDescriptor.read slightly by removing an instanceof check, and moving the relevant logic into ComplexColumnPartSerde. * Fix spellings.	2021-11-09 15:18:07 -08:00
Gian Merlino	324d4374f6	HashJoinEngine: Fix extraneous advance of left cursor. (#11890 ) This could happen for right or full outer joins in certain cases. Tests weren't catching this because existing Cursor implementations generally ignore extraneous calls to "advance". So, to help catch this in tests, extra state validations are also added to RowWalker, which is used by RowBasedSegment.	2021-11-09 11:34:11 -08:00
Gian Merlino	babf00f8e3	Migrate File.mkdirs to FileUtils.mkdirp. (#11879 ) * Migrate File.mkdirs to FileUtils.mkdirp. * Remove unused imports. * Fix LookupReferencesManager. * Simplify. * Also migrate usages of forceMkdir. * Fix var name. * Fix incorrect call. * Update test.	2021-11-09 11:10:49 -08:00
Gian Merlino	945a341acd	RowBasedCursor: Add column-value-reuse optimization. (#11884 ) * RowBasedCursor: Add column-value-reuse optimization. Most of the logic is in RowBasedColumnSelectorFactory, although in this patch its only user is RowBasedCursor. This improves performance of features that use RowBasedSegment, like lookup and inline datasources. It's especially helpful for inline datasources that contain lengthy arrays, due to the fact that the transformed array can be reused. * Changes from code review. * Fixes for ColumnCapabilitiesImplTest.	2021-11-09 07:18:09 -08:00
Gian Merlino	a5bd0b8cc0	RowAdapter: Add a default implementation for timestampFunction. (#11885 ) Enables simpler implementations for adapters that want to treat the timestamp as "just another column".	2021-11-08 10:25:13 -08:00
Clint Wylie	7237dc837c	complex typed expressions (#11853 ) * complex typed expressions * add built-in hll collector expressions to get coverage on druid-processing, more types, more better * rampage!!! * more javadoc * adjustments * oops * lol * remove unused dependency * contradiction? * more test	2021-11-08 00:33:06 -08:00
Clint Wylie	907e4ca0c5	use correct DimensionSpec with for column value selectors created from dictionary encoded column indexers (#11873 ) * use correct dimension spec for column value selectors of dictionary encoded column indexers	2021-11-05 01:51:15 -07:00
Liran Funaro	9ca8f1ec97	Remove IncrementalIndex template modifier (#11160 ) Co-authored-by: Liran Funaro <liran.funaro@verizonmedia.com>	2021-10-27 13:10:37 -07:00
Gian Merlino	fc95c92806	Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124 ) * Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. This patch does the following: - Removes OffheapIncrementalIndex. - Clarifies that Aggregators are required to be thread safe. - Clarifies that BufferAggregators and VectorAggregators are not required to be thread safe. - Removes thread safety code from some DataSketches aggregators that had it. (Not all of them did, and that's OK, because it wasn't necessary anyway.) - Makes enabling "useOffheap" with groupBy v1 an error. Rationale for removing the offheap incremental index: - It is only used in one rare scenario: groupBy v1 (which is non-default) in "useOffheap" mode (also non-default). So you have to go pretty deep into the wilderness to get this code to activate in production. It is never used during ingestion. - Its existence complicates developer efforts to reason about how aggregators get used, because the way it uses buffer aggregators is so different from how every other query engine uses them. - It doesn't have meaningful testing. By the way, I do believe that the given way the offheap incremental index works, it actually didn't require buffer aggregators to be thread-safe. It synchronizes on "aggregate" and doesn't call "get" until it has stopped calling "aggregate". Nevertheless, this is a bother to think about, and for the above reasons I think it makes sense to remove the code anyway. * Remove things that are now unused. * Revert removal of getFloat, getLong, getDouble from BufferAggregator. * OAK-related warnings, suppressions. * Unused item suppressions.	2021-10-26 08:05:56 -07:00
Gian Merlino	98ecbb21cd	Remove CloseQuietly and migrate its usages to other methods. (#10247 ) * Remove CloseQuietly and migrate its usages to other methods. These other methods include: 1) New method CloseableUtils.closeAndWrapExceptions, which wraps IOExceptions in RuntimeExceptions for callers that just want to avoid dealing with checked exceptions. Most usages were migrated to this method, because it looks like they were mainly attempts to avoid declaring a throws clause, and perhaps were unintentionally suppressing IOExceptions. 2) New method CloseableUtils.closeInCatch, designed to properly close something in a catch block without losing exceptions. Some usages from catch blocks were migrated here, when it seemed that they were intended to avoid checked exception handling, and did not really intend to also suppress IOExceptions. 3) New method CloseableUtils.closeAndSuppressExceptions, which sends all exceptions to a "chomper" that consumes them. Nothing is thrown or returned. The behavior is slightly different: with this method, _all_ exceptions are suppressed, not just IOExceptions. Calls that seemed like they had good reason to suppress exceptions were migrated here. 4) Some calls were migrated to try-with-resources, in cases where it appeared that CloseQuietly was being used to avoid throwing an exception in a finally block. 🎵 You don't have to go home, but you can't stay here... 🎵 * Remove unused import. * Fix up various issues. * Adjustments to tests. * Fix null handling. * Additional test. * Adjustments from review. * Fixup style stuff. * Fix NPE caused by holder starting out null. * Fix spelling. * Chomp Throwables too.	2021-10-23 17:03:21 -07:00
Clint Wylie	02b2057371	extract generic dictionary encoded column indexing and merging stuffs (#11829 ) * extract generic dictionary encoded column indexing and merging stuffs to pave the path towards supporting other types of dictionary encoded columns * spotbugs and inspections fixes * friendlier * javadoc * better name * adjust	2021-10-22 17:31:22 -07:00
Clint Wylie	741b4ed516	add output type information to ExpressionPostAggregator (#11818 ) * add ColumnInspector argument to PostAggregator.getType to allow post-aggs to compute their output type based on input types * add test for test for coverage * simplify * Remove unused imports. Co-authored-by: Gian Merlino <gian@imply.io>	2021-10-22 13:52:51 -07:00
Alexander Saydakov	8cf1cbc4a9	latest datasketches-java and datasketches-memory (#11773 ) * latest datasketches-java and datasketches-memory * updated versions of datasketches-java and datasketches-memory Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>	2021-10-19 23:42:30 -07:00
Clint Wylie	187df58e30	better types (#11713 ) * better type system * needle in a haystack * ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support * fixup merge * more test * fixup * intern * fix * oops * oops again * ... * more test coverage * fix error message * adjust interning, more javadocs * oops * more docs more better	2021-10-19 01:47:25 -07:00
Jonathan Wei	22b41ddbbf	Task reports for parallel task: single phase and sequential mode (#11688 ) * Task reports for parallel task: single phase and sequential mode * Address comments * Add null check for currentSubTaskHolder	2021-09-16 13:58:11 -05:00

1 2 3 4 5 ...

2500 Commits