druid

Commit Graph

Author	SHA1	Message	Date
Gian Merlino	6d2ff796a3	Add RowIdSupplier to ColumnSelectorFactory. (#12577 ) * Add RowIdSupplier to ColumnSelectorFactory. This enables virtual columns to cache their outputs in case they are called multiple times on the same underlying row. This is common for numeric selectors, where the common pattern is to call isNull() and then follow with getLong(), getFloat(), or getDouble(). Here, output caching reduces the number of expression evals by half. * Fix tests.	2022-05-31 11:38:03 -07:00
Clint Wylie	b746bf9129	fix virtual column cycle bug, sql virtual column optimize bug (#12576 ) * fix virtual column cycle bug, sql virtual column optimize bug * more test	2022-05-30 23:51:21 -07:00
Dr. Sizzles	7291c92f4f	Adding zstandard compression library (#12408 ) * Adding zstandard compression library * 1. Took @clintropolis's advice to have ZStandard decompressor use the byte array when the buffers are not direct. 2. Cleaned up checkstyle issues. * Fixing zstandard version to latest stable version in pom's and updating license files * Removing zstd from benchmarks and adding to processing (poms) * fix the intellij inspection issue * Removing the prefix v for the version in the license check for ztsd * Fixing license checks Co-authored-by: Rahul Gidwani <r_gidwani@apple.com>	2022-05-28 17:01:44 -07:00
Dongjoon Hyun	79f86a0511	Upgrade ORC to 1.7.4 (#12572 ) This commit upgrades Apache ORC library from 1.7.2 to 1.7.4. Apache ORC 1.7.4 is the maintenance release with the following bug fixes. https://orc.apache.org/news/2022/04/15/ORC-1.7.4/ https://github.com/apache/orc/releases/tag/v1.7.4	2022-05-28 17:44:36 +05:30
Clint Wylie	d0c9c37e35	make query context changes backwards compatible (#12564 ) Adds a default implementation of getQueryContext, which was added to the Query interface in #12396. Query is marked with @ExtensionPoint, and lately we have been trying to be less volatile on these interfaces by providing default implementations to be more chill for extension writers. The way this default implementation is done in this PR is a bit strange due to the way that getQueryContext is used (mutated with system default and system generated keys); the default implementation has a specific object that it returns, and I added another temporary default method isLegacyContext that checks if the getQueryContext returns that object or not. If not, callers fall back to using getContext and withOverriddenContext to set these default and system values. I am open to other ideas as well, but this way should work at least without exploding, and added some tests to ensure that it is wired up correctly for QueryLifecycle, including the context authorization stuff. The added test shows the strange behavior if query context authorization is enabled, mainly that the system default and system generated query context keys also need to be granted as permissions for things to function correctly. This is not great, so I mentioned it in the javadocs as well. Not sure if it needs to be called out anywhere else.	2022-05-25 15:24:41 +05:30
Karan Kumar	9f9faeec81	object[] handling for DimensionHandlers for arrays (#12552 ) Description Fixes a bug when running q's like SELECT cntarray, Count() FROM (SELECT dim1, dim2, Array_agg(cnt) AS cntarray FROM (SELECT dim1, dim2, dim3, Count() AS cnt FROM foo GROUP BY 1, 2, 3) GROUP BY 1, 2) GROUP BY 1 This generates an error: org.apache.druid.java.util.common.ISE: Unable to convert type [Ljava.lang.Object; to org.apache.druid.segment.data.ComparableList at org.apache.druid.segment.DimensionHandlerUtils.convertToList(DimensionHandlerUtils.java:405) ~[druid-xx] Because it's an array of numbers it looks like it does the convertToList call, which looks like: @Nullable public static ComparableList convertToList(Object obj) { if (obj == null) { return null; } if (obj instanceof List) { return new ComparableList((List) obj); } if (obj instanceof ComparableList) { return (ComparableList) obj; } throw new ISE("Unable to convert type %s to %s", obj.getClass().getName(), ComparableList.class.getName()); } I.e. it doesn't know about arrays. Added the array handling as part of this PR.	2022-05-25 15:24:18 +05:30
Abhishek Agarwal	b10eb4cbd4	Suppress false CVE on druid-indexing-hadoop artifact (#12562 )	2022-05-24 16:00:58 +05:30
Abhishek Agarwal	32fe4d1324	Use a different repository to download sigar artifacts. (#12561 )	2022-05-24 14:42:51 +05:30
Agustin Gonzalez	2f3d7a4c07	Emit state of replace and append for native batch tasks (#12488 ) * Emit state of replace and append for native batch tasks * Emit count of one depending on batch ingestion mode (APPEND, OVERWRITE, REPLACE) * Add metric to compaction job * Avoid null ptr exc when null emitter * Coverage * Emit tombstone & segment counts * Tasks need a type * Spelling * Integrate BatchIngestionMode in batch ingestion tasks functionality * Typos * Remove batch ingestion type from metric since it is already in a dimension. Move IngestionMode to AbstractTask to facilitate having mode as a dimension. Add metrics to streaming. Add missing coverage. * Avoid inner class referenced by sub-class inspection. Refactor computation of IngestionMode to make it more robust to null IOConfig and fix test. * Spelling * Avoid polluting the Task interface * Rename computeCompaction methods to avoid ambiguous java compiler error if they are passed null. Other minor cleanup.	2022-05-23 12:32:47 -07:00
Adarsh Sanjeev	5063eca5b9	Add error message for incorrectly ordered clause in sql (#12558 ) In the case that the clustered by is before the partitioned by for an sql query, the error message is a bit confusing. insert into foo select * from bar clustered by dim1 partitioned by all Error: SQL parse failed Encountered "PARTITIONED" at line 1, column 88. Was expecting one of: <EOF> "," ... "ASC" ... "DESC" ... "NULLS" ... "." ... "NOT" ... "IN" ... "<" ... "<=" ... ">" ... ">=" ... "=" ... "<>" ... "!=" ... "BETWEEN" ... "LIKE" ... "SIMILAR" ... "+" ... "-" ... "*" ... "/" ... "%" ... "\|\|" ... "AND" ... "OR" ... "IS" ... "MEMBER" ... "SUBMULTISET" ... "CONTAINS" ... "OVERLAPS" ... "EQUALS" ... "PRECEDES" ... "SUCCEEDS" ... "IMMEDIATELY" ... "MULTISET" ... "[" ... "FORMAT" ... "(" ... Less... org.apache.calcite.sql.parser.SqlParseException This is a bit confusing and adding a check could be added to throw a more user friendly message stating that the order should be reversed. Add error message for incorrectly ordered clause in sql.	2022-05-23 12:41:18 +05:30
AmatyaAvadhanula	6d85ba4c00	Suppress CVEs (#12553 )	2022-05-23 12:35:23 +05:30
Gian Merlino	37853f8de4	ConcurrentGrouper: Add mergeThreadLocal option, fix bug around the switch to spilling. (#12513 ) * ConcurrentGrouper: Add option to always slice up merge buffers thread-locally. Normally, the ConcurrentGrouper shares merge buffers across processing threads until spilling starts, and then switches to a thread-local model. This minimizes memory use and reduces likelihood of spilling, which is good, but it creates thread contention. The new mergeThreadLocal option causes a query to start in thread-local mode immediately, and allows us to experiment with the relative performance of the two modes. * Fix grammar in docs. * Fix race in ConcurrentGrouper. * Fix issue with timeouts. * Remove unused import. * Add "tradeoff" to dictionary.	2022-05-21 10:28:54 -07:00
Katya Macedo	5073cee73f	Fix zookeeper spelling (#12556 )	2022-05-21 16:14:02 +08:00
Clint Wylie	2d8dbb53e0	update to latest lz4 1.8.0 (#12557 )	2022-05-21 16:02:20 +08:00
Agustin Gonzalez	c236227905	Deal with potential cardinality estimate being negative and add logging to hash determine partitions phase (#12443 ) * Deal with potential cardinality estimate being negative and add logging * Fix typo in name * Refine and minimize logging * Make it info based on code review * Create a named constant for the magic number	2022-05-20 10:51:06 -07:00
superivaj	f9bdb3b236	Fix usage of maxColumnsToMerge in auto-compaction tuning config (#12551 ) Issue: Even though `CompactionTuningConfig` allows a `maxColumnsToMerge` config (to optimize memory usage, particulary for datasources with many dimensions), the corresponding client object `ClientCompactionTaskQueryTuningConfig` (used by the coordinator duty `CompactSegments` to trigger auto-compaction) does not contain this field. Thus, the value of `maxColumnsToMerge` specified in any datasource compaction config is ignored. Changes: - Add field `maxColumnsToMerge` in `ClientCompactionTaskQueryTuningConfig` and `UserCompactionTaskQueryTuningConfig` - Fix tests	2022-05-20 22:23:08 +05:30
Gian Merlino	69aac6c8dd	Direct UTF-8 access for "in" filters. (#12517 ) * Direct UTF-8 access for "in" filters. Directly related: 1) InDimFilter: Store stored Strings (in ValuesSet) plus sorted UTF-8 ByteBuffers (in valuesUtf8). Use valuesUtf8 whenever possible. If necessary, the input set is copied into a ValuesSet. Much logic is simplified, because we always know what type the values set will be. I think that there won't even be an efficiency loss in most cases. InDimFilter is most frequently created by deserialization, and this patch updates the JsonCreator constructor to deserialize directly into a ValuesSet. 2) Add Utf8ValueSetIndex, which InDimFilter uses to avoid UTF-8 decodes during index lookups. 3) Add unsigned comparator to ByteBufferUtils and use it in GenericIndexed.BYTE_BUFFER_STRATEGY. This is important because UTF-8 bytes can be compared as bytes if, and only if, the comparison is unsigned. 4) Add specialization to GenericIndexed.singleThreaded().indexOf that avoids needless ByteBuffer allocations. 5) Clarify that objects returned by ColumnIndexSupplier.as are not thread-safe. DictionaryEncodedStringIndexSupplier now calls singleThreaded() on all relevant GenericIndexed objects, saving a ByteBuffer allocation per access. Also: 1) Fix performance regression in LikeFilter: since #12315, it applied the suffix matcher to all values in range even for type MATCH_ALL. 2) Add ObjectStrategy.canCompare() method. This fixes LikeFilterBenchmark, which was broken due to calls to strategy.compare in GenericIndexed.fromIterable. * Add like-filter implementation tests. * Add in-filter implementation tests. * Add tests, fix issues. * Fix style. * Adjustments from review.	2022-05-20 01:51:28 -07:00
Vadim Ogievetsky	a235aca2b3	Web console: fix go to segments not working (#12541 ) * use correct filter syntax * fix tests	2022-05-19 14:34:03 -07:00
Gian Merlino	5f95cc61fe	RemoteTaskRunner: Fix NPE in streamTaskReports. (#12006 ) * RemoteTaskRunner: Fix NPE in streamTaskReports. It is possible for a work item to drop out of runningTasks after the ZkWorker is retrieved. In this case, the current code would throw an NPE. * Additional tests and additional fixes. * Fix import.	2022-05-19 14:23:55 -07:00
Gian Merlino	65a1375b67	SQL: Add is_active to sys.segments, update examples and docs. (#11550 ) * SQL: Add is_active to sys.segments, update examples and docs. is_active is short for: (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1 It's important because this represents "all the segments that should be queryable, whether or not they actually are right now". Most of the time, this is the set of segments that people will want to look at. The web console already adds this filter to a lot of its queries, proving its usefulness. This patch also reworks the caveat at the bottom of the sys.segments section, so its information is mixed into the description of each result field. This should make it more likely for people to see the information. * Wording updates. * Adjustments for spellcheck. * Adjust IT.	2022-05-19 14:23:28 -07:00
machine424	90531fd53f	Do not alter query timeout in ScanQueryEngine (#12271 ) Add test to detect timeout mutability	2022-05-19 09:24:42 -07:00
Xavier Léauté	ec41dfb535	upgrade core Apache Kafka dependencies to 3.2.0 (#12538 ) Announcement: https://blogs.apache.org/kafka/entry/what-s-new-in-apache8 Release notes: https://downloads.apache.org/kafka/3.2.0/RELEASE_NOTES.html	2022-05-19 09:04:52 -07:00
Gian Merlino	1d258d2108	Slightly improve RTR log messages. (#12540 ) 1) Align "Assigning task" log messages between RTR and HRTR. 2) Remove confusing reference to "Coordinator". 3) Move "Not assigning task" message from INFO to DEBUG. It's not super important to see this message: we mainly want to see what _does_ get assigned. 4) Reword "Task switched from pending to running" message to better match the structure of the "Assigning task" message from the same method.	2022-05-19 07:43:55 -07:00
Gian Merlino	485de6a14a	Add builder for TaskToolbox. (#12539 ) * Add builder for TaskToolbox. The main purpose of this change is to make it easier to create TaskToolboxes in tests. However, the builder is used in production too, by TaskToolboxFactory. * Fix imports, adjust formatting. * Fix import.	2022-05-19 07:43:50 -07:00
Gian Merlino	4631cff2a9	Free ByteBuffers in tests and fix some bugs. (#12521 ) * Ensure ByteBuffers allocated in tests get freed. Many tests had problems where a direct ByteBuffer would be allocated and then not freed. This is bad because it causes flaky tests. To fix this: 1) Add ByteBufferUtils.allocateDirect(size), which returns a ResourceHolder. This makes it easy to free the direct buffer. Currently, it's only used in tests, because production code seems OK. 2) Update all usages of ByteBuffer.allocateDirect (off-heap) in tests either to ByteBuffer.allocate (on-heap, which are garbaged collected), or to ByteBufferUtils.allocateDirect (wherever it seemed like there was a good reason for the buffer to be off-heap). Make sure to close all direct holders when done. * Changes based on CI results. * A different approach. * Roll back BitmapOperationTest stuff. * Try additional surefire memory. * Revert "Roll back BitmapOperationTest stuff." This reverts commit `49f846d9e3`. * Add TestBufferPool. * Revert Xmx change in tests. * Better behaved NestedQueryPushDownTest. Exit tests on OOME. * Fix TestBufferPool. * Remove T1C from ARM tests. * Somewhat safer. * Fix tests. * Fix style stuff. * Additional debugging. * Reset null / expr configs better. * ExpressionLambdaAggregatorFactory thread-safety. * Alter forkNode to try to get better info when a JVM crashes. * Fix buffer retention in ExpressionLambdaAggregatorFactory. * Remove unused import.	2022-05-19 07:42:29 -07:00
Tejaswini Bandlamudi	c877d8a981	Updates default inputSegmentSizeBytes in Compaction config (#12534 ) Fixes Cannot serialize BigInt value as JSON error while loading compaction config in console.	2022-05-19 14:43:34 +05:30
AmatyaAvadhanula	215b90d1a4	CVE suppression (#12535 )	2022-05-19 11:21:48 +05:30
Charles Smith	3e8d7a6d9f	Sql docs items (#12530 ) * touch up sql refactor * brush up SQL refactor * incorporate feedback * reorder sql * Update docs/querying/sql.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2022-05-17 16:56:31 -07:00
Katya Macedo	177638f171	Fix typo, add comma (#12529 )	2022-05-17 16:42:47 -07:00
Adarsh Sanjeev	fcb1c0b7bf	Add cluster by support for replace syntax (#12524 ) * Add cluster by support for replace syntax * Add unit test for with list	2022-05-17 15:15:29 +05:30
Clint Wylie	b23ddc5939	print replication levels in coordinator segment logs (#12511 ) * print replication levels in coordinator segment logs * add served segment count to stats * also for drops	2022-05-17 02:24:13 -07:00
Adarsh Sanjeev	0fd4f1e386	Improve error messages from SQL REPLACE syntax (#12523 ) - Add user friendly error messages for missing or incorrect OVERWRITE clause for REPLACE SQL query - Move validation of missing OVERWRITE clause at code level instead of parser for custom error message	2022-05-17 09:55:58 +05:30
Gian Merlino	fdfecfd996	Improved docs for range partitioning. (#12350 ) * Improved docs for range partitioning. 1) Clarify the benefits of range partitioning. 2) Clarify which filters support pruning. 3) Include the fact that multi-value dimensions cannot be used for partitioning. * Additional clarification. * Update other section. * Another adjustment. * Updates from review.	2022-05-16 09:42:31 -07:00
Hellmar Becker	985640f103	Clarify the use of the Lookup API (#12088 ) * Update lookups.md * Update docs/querying/lookups.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/querying/lookups.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2022-05-16 07:50:24 -07:00
317brian	351e57bdb6	docs(fix): clarify how worker.version and minWorkerVersion comparison works (#12459 ) * docs(fix): clarify how worker.version and minWorkerVersion comparison works * Revert "docs(fix): clarify how worker.version and minWorkerVersion comparison works" This reverts commit `cadd1fdc60`. * docs(fix): clarify how worker.version and minWorkerVersion comparison works * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/configuration/index.md fix spelling Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2022-05-16 07:48:33 -07:00
Gian Merlino	5b6727f319	Enable vectorized virtual column processing by default. (#12520 ) In the majority of cases, this improves performance. There's only one case I'm aware of where this may be a net negative: for time_floor(__time, <period>) where there are many repeated __time values. In nonvectorized processing, SingleLongInputCachingExpressionColumnValueSelector implements an optimization to avoid computing the time_floor function on every row. There is no such optimization in vectorized processing. IMO, we shouldn't mention this in the docs. Rationale: It's too fiddly of a thing: it's not guaranteed that nonvectorized processing will be faster due to the optimization, because it would have to overcome the inherent speed advantage of vectorization. So it'd always require testing to determine the best setting for a specific dataset. It would be bad if users disabled vectorization thinking it would speed up their queries, and it actually slowed them down. And even if users do their own testing, at some point in the future we'll implement the optimization for vectorized processing too, and it's likely that users that explicitly disabled vectorization will continue to have it disabled. I'd like to avoid this outcome by encouraging all users to enable vectorization at all times. Really advanced users would be following development activity anyway, and can read this issue	2022-05-16 15:43:53 +05:30
Frank Chen	c33ff1c745	Enforce console logging for peon process (#12067 ) Currently all Druid processes share the same log4j2 configuration file located in _common directory. Since peon processes are spawned by middle manager process, they derivate the environment variables from the middle manager. These variables include those in the log4j2.xml controlling to which file the logger writes the log. But current task logging mechanism requires the peon processes to output the log to console so that the middle manager can redirect the console output to a file and upload this file to task log storage. So, this PR imposes this requirement to peon processes, whatever the configuration is in the shared log4j2.xml, peon processes always write the log to console.	2022-05-16 15:07:21 +05:30
Gian Merlino	ff253fd8a3	Add setProcessingThreadNames context parameter. (#12514 ) setting thread names takes a measurable amount of time in the case where segment scans are very quick. In high-QPS testing we found a slight performance boost from turning off processing thread renaming. This option makes that possible.	2022-05-16 13:42:00 +05:30
Jason Koch	bb1a6def9d	Task queue unblock (#12099 ) * concurrency: introduce GuardedBy to TaskQueue * perf: Introduce TaskQueueScaleTest to test performance of TaskQueue with large task counts This introduces a test case to confirm how long it will take to launch and manage (aka shutdown) a large number of threads in the TaskQueue. h/t to @gianm for main implementation. * perf: improve scalability of TaskQueue with large task counts * linter fixes, expand test coverage * pr feedback suggestion; swap to different linter * swap to use SuppressWarnings * Fix TaskQueueScaleTest. Co-authored-by: Gian Merlino <gian@imply.io>	2022-05-14 16:44:29 -07:00
Kashif Faraz	7ab2170802	Use datasketches version 3.2.0 (#12509 ) Changes: - Use apache datasketches version 3.2.0. - Remove unsafe reflection-based usage of datasketch internals added in #12022	2022-05-13 11:28:15 +05:30
Adarsh Sanjeev	39b3487aa9	Add replace statement to sql parser (#12386 ) Relevant Issue: #11929 - Add custom replace statement to Druid SQL parser. - Edit DruidPlanner to convert relevant fields to Query Context. - Refactor common code with INSERT statements to reuse them for REPLACE where possible.	2022-05-13 10:56:40 +05:30
Abhishek Radhakrishnan	9177515be2	Add IPAddress java library as dependency and migrate IPv4 functions to use the new library. (#11634 ) * Add ipaddress library as dependency. * IPv4 functions to use the inet.ipaddr package. * Remove unused imports. * Add new function. * Minor rename. * Add more unit tests. * IPv4 address expr utils unit tests and address options. * Adjust the IPv4Util functions. * Move the UTs a bit around. * Javadoc comments. * Add license info for IPAddress. * Fix groupId, artifact and version in license.yaml. * Remove redundant subnet in messages - fixes UT. * Remove unused commons-net dependency for /processing project. * Make class and methods public so it can be accessed. * Add initial version of benchmark * Add subnetutils package for benchmarks. * Auto generate ip addresses. * Add more v4 address representations in setup to avoid bias. * Use ThreadLocalRandom to avoid forbidden API usage. * Adjust IPv4AddressBenchmark to adhere to codestyle rules. * Update ipaddress library to latest 5.3.4 * Add ipaddress package dependency to benchmarks project.	2022-05-11 22:06:20 -07:00
Clint Wylie	9e5a940cf1	remake column indexes and query processing of filters (#12388 ) Following up on #12315, which pushed most of the logic of building ImmutableBitmap into BitmapIndex in order to hide the details of how column indexes are implemented from the Filter implementations, this PR totally refashions how Filter consume indexes. The end result, while a rather dramatic reshuffling of the existing code, should be extraordinarily flexible, eventually allowing us to model any type of index we can imagine, and providing the machinery to build the filters that use them, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.	2022-05-11 11:57:08 +05:30
Lucas Capistrant	deb69d1bc0	Allow coordinator to be configured to kill segments in future (#10877 ) Allow a Druid cluster to kill segments whose interval_end is a date in the future. This can be done by setting druid.coordinator.kill.durationToRetain to a negative period. For example PT-24H would allow segments to be killed if their interval_end date was 24 hours or less into the future at the time that the kill task is generated by the system. A cluster operator can also disregard the druid.coordinator.kill.durationToRetain entirely by setting a new configuration, druid.coordinator.kill.ignoreDurationToRetain=true. This ignores interval_end date when looking for segments to kill, and instead is capable of killing any segment marked unused. This new configuration is off by default, and a cluster operator should fully understand and accept the risks if they enable it.	2022-05-11 07:35:15 +05:30
Kashif Faraz	60b4fa0f75	Docs: Fix column name in ingestion rollup doc (#12036 ) Fix the referred column name from "count" to "num_rows" as "count" vs. "COUNT(*)" might be a little confusing in this example.	2022-05-10 17:35:59 +05:30
Rohan Garg	75836a5a06	Add feature flag for sql planning of TimeBoundary queries (#12491 ) * Add feature flag for sql planning of TimeBoundary queries * fixup! Add feature flag for sql planning of TimeBoundary queries * Add documentation for enableTimeBoundaryPlanning * fixup! Add documentation for enableTimeBoundaryPlanning	2022-05-10 15:23:42 +05:30
somu-imply	c68388ebcd	Vectorized version of string last aggregator (#12493 ) * Vectorized version of string last aggregator * Updating string last and adding testcases * Updating code and adding testcases for serializable pairs * Addressing review comments	2022-05-09 17:02:38 -07:00
Rohan Garg	2dd073c2cd	Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation (#12484 ) * Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation * fixup! Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation * Document vectorized dimension	2022-05-09 10:40:17 -07:00
Atul Mohan	eb6de94e1f	Add daily stats to console (#12329 )	2022-05-05 15:31:21 -07:00
Vadim Ogievetsky	2d8eb117c0	Web console: add a button to get out of restricted mode, make capability detection more robust (#12503 ) * allow unrestrict * update tests	2022-05-05 15:06:59 -07:00

1 2 3 4 5 ...

11755 Commits All Branches Search

11755 Commits

All Branches