druid

Commit Graph

Author	SHA1	Message	Date
Clint Wylie	05b2e967ed	druid nested data column type (#12753 ) * add new druid nested data column type * fixes and such * fixes * adjustments, more tests * self review * oops * fix and test * more better * style	2022-07-14 12:07:23 -07:00
Frank Chen	a544aff761	Document missed simple granularities (#12768 ) * Document missed simple granularities * Update docs/querying/granularities.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/querying/granularities.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2022-07-14 14:02:28 +08:00
zachjsh	c0380e7b0a	* fix duplicate dimension (#12778 )	2022-07-14 10:39:03 +05:30
Victoria Lim	d8f8c56f94	Docs: Index page with all SQL functions (#12771 ) * list of all functions * add function names to spelling file	2022-07-14 09:59:55 +08:00
Clint Wylie	8c33508eaf	run web-console e2e tests for java changes too (#12776 ) * run web-console e2e tests for java changes too, fix travis stages for web e2e and docs jobs * run the script test on script changes	2022-07-13 16:12:57 -07:00
Vadim Ogievetsky	c1c2104bd6	fix ordering in e2e test (#12775 )	2022-07-13 15:08:00 -07:00
Abhishek Agarwal	2ab20c9fc9	Surface more information about task status in tests (#12759 ) I see some test runs failing because task status is not as expected. It will be helpful to know what error the task has.	2022-07-13 14:53:53 +05:30
TSFenwick	8c02880d5f	Emit metrics for distribution of number of rows per segment (#12730 ) * initial commit of bucket dimensions for metrics return counts of segments that have rowcount in a bucket size for a datasource return average value of rowcount per segment in a datasource added unit test naming could use a lot of work buckets right now are not finalized added javadocs altered metrics.md * fix checkstyle issues * addressed review comments add monitor test move added functionality to new monitor update docs * address comments renamed monitor handle tombstones better update docs added javadocs * Add support for tombstones in the segment distribution * undo changes to tombstone segmentizer factory * fix accidental whitespacing changes * address comments regarding metrics documentation and rename variable to be more accurate * fix tests * fix checkstyle issues * fix broken test * undo removal of timeout	2022-07-12 07:04:42 -07:00
Rohan Garg	bb953be09b	Refactor usage of JoinableFactoryWrapper + more test coverage (#12767 ) Refactor usage of JoinableFactoryWrapper to add e2e test for createSegmentMapFn with joinToFilter feature enabled	2022-07-12 06:25:36 -07:00
Karan Kumar	cebf2ba9c7	[Flaky unit test] Adding file based uri. (#12671 ) * Adding file based uri. * Adding the HTTP entity test back	2022-07-11 20:57:22 +05:30
Gian Merlino	97207cdcc7	Automatic sizing for GroupBy dictionaries. (#12763 ) * Automatic sizing for GroupBy dictionary sizes. Merging and selector dictionary sizes currently both default to 100MB. This is not optimal, because it can lead to OOM on small servers and insufficient resource utilization on larger servers. It also invites end users to try to tune it when queries run out of dictionary space, which can make things worse if the end user sets it to too high. So, this patch: - Adds automatic tuning for selector and merge dictionaries. Selectors use up to 15% of the heap and merge buffers use up to 30% of the heap (aggregate across all queries). - Updates out-of-memory error messages to emphasize enabling disk spilling vs. increasing memory parameters. With the memory parameters automatically sized, it is more likely that an end user will get benefit from enabling disk spilling. - Removes the query context parameters that allow lowering of configured dictionary sizes. These complicate the calculation, and I don't see a reasonable use case for them. * Adjust tests. * Review adjustments. * Additional comment. * Remove unused import.	2022-07-11 08:20:50 -07:00
Gian Merlino	d2576584a0	Consolidate the two TaskStatus classes. (#12765 ) * Consolidate the two TaskStatus classes. There are two, but we don't need more than one. * Fix import order.	2022-07-11 07:25:22 -07:00
Tejaswini Bandlamudi	32946216d0	Debugs Flaky License dependency Reports generation (#12744 ) * Surfaces mvn command output in case of failure. * formats output * nit	2022-07-11 14:35:34 +05:30
Gian Merlino	864b77e91a	SpillingGrouper: Make DISK_FULL sticky. (#12764 ) When we return DISK_FULL to a processing thread, it skips the rest of the segment and the query is canceled. However, it's possible that the next segment starts processing before cancellation can kick in. We want that one, if it occurs, to see DISK_FULL too.	2022-07-09 06:45:38 -07:00
Kashif Faraz	8dc4a155c7	Fix flaky IT: ITPerfectRollupParallelBatchIndexTest (#12737 ) * Increase worker.intermediaryPartitionTimeout in ITs to 30 mins * Update timeout to 60 mins * Remove timeout change from indexer	2022-07-09 17:15:51 +05:30
Maytas Monsereenusorn	1558ef471c	Add some debug tips for debugging peons (#12697 ) * add some debug tips * address comments * fix typo	2022-07-09 01:47:25 -07:00
Didip Kerabat	48fd2e6400	Add missing metrics into statsd-reporter. (#12762 )	2022-07-08 23:13:06 -07:00
Gian Merlino	edfbcc8455	Preserve column order in DruidSchema, SegmentMetadataQuery. (#12754 ) * Preserve column order in DruidSchema, SegmentMetadataQuery. Instead of putting columns in alphabetical order. This is helpful because it makes query order better match ingestion order. It also allows tools, like the reindexing flow in the web console, to more easily do follow-on ingestions using a column order that matches the pre-existing column order. We prefer the order from the latest segments. The logic takes all columns from the latest segments in the order they appear, then adds on columns from older segments after those. * Additional test adjustments. * Adjust imports.	2022-07-08 22:04:11 -07:00
Gian Merlino	9c925b4f09	Frame format for data transfer and short-term storage. (#12745 ) * Frame format for data transfer and short-term storage. As we move towards query execution plans that involve more transfer of data between servers, it's important to have a data format that provides for doing this more efficiently than the options available to us today. This patch adds: - Columnar frames, which support fast querying. - Row-based frames, which support fast sorting via memory comparison and fast whole-row copies via memory copying. - Frame files, a container format that can be stored on disk or transferred between servers. The idea is we should use row-based frames when data is expected to be sorted, and columnar frames when data is expected to be queried. The code in this patch is not used in production yet. Therefore, the patch involves minimal changes outside of the org.apache.druid.frame package. The main ones are adjustments to SqlBenchmark to add benchmarks for queries on frames, and the addition of a "forEach" method to Sequence. * Fixes based on tests, static analysis. * Additional fixes. * Skip DS mapping tests on JDK 14+ * Better JDK checking in tests. * Fix imports. * Additional comment. * Adjustments from code review. * Update test case.	2022-07-08 20:42:06 -07:00
Rohan Garg	bcff35f798	Pushdown join filter with right side referencing columns (#12749 )	2022-07-08 19:59:41 +05:30
Gian Merlino	378fea9517	Retain CSP configuration in ServerConfig constructor. (#12755 ) Without this change, CliIndexer would not apply custom CSP headers and would revert to the default.	2022-07-08 19:19:14 +05:30
Jianhuan Liu	4574dea5e9	Use MXBeans to get GC metrics #12476 (#12481 ) * jvm gc to mxbeans * add zgc and shenandoah #12476 * remove tryCreateGcCounter * separate the space collector * blend GcGenerationCollector into GcCollector * add jdk surefire argLine	2022-07-08 14:32:06 +08:00
Gian Merlino	e82890fde4	Mark specific nimbus.lang.tag.version. (#12751 ) * Mark specific nimbus.lang.tag.version. * Add ignoredUnusedDeclaredDependencies.	2022-07-07 09:58:35 +05:30
PJ Fanning	059aba781a	issue-12628: upgrade jetty to 9.4.41.v20210516 due to CVE (#12629 ) * upgrade jetty to 9.4.41.v20210516 due to cve * Update licenses.yaml	2022-07-07 00:20:01 +08:00
Rohan Garg	d732de9948	Allow adding calcite rules from extensions (#12715 ) * Allow adding calcite rules from extensions * fixup! Allow adding calcite rules from extensions * Move Rules to CalciteRulesManager * fixup! Move Rules to CalciteRulesManager	2022-07-06 19:32:35 +05:30
Gian Merlino	49feffff1b	Add comment about double-close in ColumnSelectorColumnIndexSelector. (#12735 )	2022-07-06 00:50:35 -07:00
Jill Osborne	682ea7f32d	IMPLY-12348: Update description of UNION ALL in SQL syntax doc (#12710 ) * IMPLY-12348: Updated description of UNION ALL * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/sql.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update sql.md * Update docs/querying/sql.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2022-07-05 13:08:01 -07:00
Didip Kerabat	06251c5d2a	Add EIGHT_HOUR into possible list of Granularities. (#12717 ) * Add EIGHT_HOUR into possible list of Granularities. * Add the missing definition. * fix test. * Fix another test. * Stylecheck finally passed. Co-authored-by: Didip Kerabat <didip@apple.com>	2022-07-05 11:05:37 -07:00
Gian Merlino	2b330186e2	Mid-level service client and updated high-level clients. (#12696 ) * Mid-level service client and updated high-level clients. Our servers talk to each other over HTTP. We have a low-level HTTP client (HttpClient) that is super-asynchronous and super-customizable through its handlers. It's also proven to be quite robust: we use it for Broker -> Historical communication over the wide variety of query types and workloads we support. But the low-level client has no facilities for service location or retries, which means we have a variety of high-level clients that implement these in their own ways. Some high-level clients do a better job than others. This patch adds a mid-level ServiceClient that makes it easier for high-level clients to be built correctly and harmoniously, and migrates some of the high-level logic to use ServiceClients. Main changes: 1) Add ServiceClient org.apache.druid.rpc package. That package also contains supporting stuff like ServiceLocator and RetryPolicy interfaces, and a DiscoveryServiceLocator based on DruidNodeDiscoveryProvider. 2) Add high-level OverlordClient in org.apache.druid.rpc.indexing. 3) Indexing task client creator in TaskServiceClients. It uses SpecificTaskServiceLocator to find the tasks. This improves on ClientInfoTaskProvider by caching task locations for up to 30 seconds across calls, reducing load on the Overlord. 4) Rework ParallelIndexSupervisorTaskClient to use a ServiceClient instead of extending IndexTaskClient. 5) Rework RemoteTaskActionClient to use a ServiceClient instead of DruidLeaderClient. 6) Rework LocalIntermediaryDataManager, TaskMonitor, and ParallelIndexSupervisorTask. As a result, MiddleManager, Peon, and Overlord no longer need IndexingServiceClient (which internally used DruidLeaderClient). There are some concrete benefits over the prior logic, namely: - DruidLeaderClient does retries in its "go" method, but only retries exactly 5 times, does not sleep between retries, and does not retry retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.) ServiceClient handles retries in a more reasonable way. - DruidLeaderClient's methods are all synchronous, whereas ServiceClient methods are asynchronous. This is used in one place so far: the SpecificTaskServiceLocator, so we don't need to block a thread trying to locate a task. It can be used in other places in the future. - HttpIndexingServiceClient does not properly handle all server errors. In some cases, it tries to parse a server error as a successful response (for example: in getTaskStatus). - IndexTaskClient currently makes an Overlord call on every task-to-task HTTP request, as a way to find where the target task is. ServiceClient, through SpecificTaskServiceLocator, caches these target locations for a period of time. * Style adjustments. * For the coverage. * Adjustments. * Better behaviors. * Fixes.	2022-07-05 09:43:26 -07:00
Clint Wylie	36e38b319b	add virtual column support to search query (#12720 )	2022-07-04 21:58:10 -07:00
Rohan Garg	97a926fb29	Suppress CVE-2022-33915 (#12740 )	2022-07-04 22:48:08 +05:30
Tejaswini Bandlamudi	d559773a0e	sets Hadoop conf ClassLoader (#12738 )	2022-07-04 17:07:39 +05:30
imply-cheddar	e3128e3fa3	Poison stupid pool (#12646 ) * Poison StupidPool and fix resource leaks There are various resource leaks from test setup as well as some corners in query processing. We poison the StupidPool to start failing tests when the leaks come and fix any issues uncovered from that so that we can start from a clean baseline. Unfortunately, because of how poisoning works, we can only fail future checkouts from the same pool, which means that there is a natural race between a leak happening -> GC occurs -> leak detected -> pool poisoned. This race means that, depending on interleaving of tests, if the very last time that an object is checked out from the pool leaks, then it won't get caught. At some point in the future, something will catch it, however and from that point on it will be deterministic. * Remove various things left over from iterations * Clean up FilterAnalysis and add javadoc on StupidPool * Revert changes to .idea/misc.xml that accidentally got pushed * Style and test branches * Stylistic woes	2022-07-03 14:36:22 -07:00
Clint Wylie	bbbb6e1c3f	fix DruidSchema issue where datasources with no segments can become stuck in tables list indefinitely (#12727 )	2022-07-01 18:54:01 -07:00
Kashif Faraz	f5b5cb93ea	Fix expiry timeout bug in LocalIntermediateDataManager (#12722 ) The expiry timeout is compared against the current time but the condition is reversed. This means that as soon as a supervisor task finishes, its partitions are cleaned up, irrespective of the specified `intermediaryPartitionTimeout` period. After these changes, the `intermediaryPartitionTimeout` will start getting honored. Changes * Fix the condition * Add tests to verify the new correct behaviour * Reduce the default expiry timeout from P1D to PT5M to retain current behaviour in case of default configs.	2022-07-01 16:29:22 +05:30
Clint Wylie	48731710fb	precursor changes for nested columns to minimize files changed (#12714 ) * precursor changes for nested columns to minimize files changed * inspection fix * visibility * adjustment * unecessary change	2022-07-01 02:27:19 -07:00
Clint Wylie	d30efb1c1e	fix bug when rewriting sql virtual column registry (#12718 )	2022-07-01 02:24:00 -07:00
Rohan Garg	c09b5a2294	Fix skipTests build flag (#12716 ) * fix skipTests * Skip console UTs with skipTests * Use skipTests in skip-tests profile	2022-06-29 21:59:26 -07:00
Rui Chen	068bea6334	deps: upgrade mysql-connector-java to v5.1.49 (#12704 )	2022-06-29 23:15:46 +08:00
Abhishek Agarwal	dbd45daf33	Flakiness and exceptions during tests (#12705 )	2022-06-28 10:36:23 +05:30
Paul Rogers	f83fab699e	Add IT-related changes pulled out of PR #12368 (#12673 ) This commit contains changes made to the existing ITs to support the new ITs. Changes: - Make the "custom node role" code usable by the new ITs. - Use flag `-DskipITs` to skips the integration tests but runs unit tests. - Use flag `-DskipUTs` skips unit tests but runs the "new" integration tests. - Expand the existing Druid profile, `-P skip-tests` to skip both ITs and UTs.	2022-06-26 02:13:59 +05:30
Paul Rogers	f7caee3b25	Revert changes from #12672 (#12703 ) * Revert changes from #12672 * Reverted more conflicting changes Changes are not needed given previous reversions.	2022-06-25 09:10:44 +05:30
Gian Merlino	679ccffe0f	Revert "SqlSegmentsMetadataQuery: Fix OVERLAPS for wide target segments. (#12600 )" (#12679 ) This reverts commit `8fbf92e047`.	2022-06-25 09:08:26 +05:30
William Hyun	2aadd69f54	Update ORC to 1.7.5 (#12667 )	2022-06-24 16:08:42 -07:00
Gian Merlino	d5abd06b96	Fix flaky KafkaIndexTaskTest. (#12657 ) * Fix flaky KafkaIndexTaskTest. The testRunTransactionModeRollback case had many race conditions. Most notably, it would commit a transaction and then immediately check to see that the results were not indexed. This is racey because it relied on the indexing thread being slower than the test thread. Now, the case waits for the transaction to be processed by the indexing thread before checking the results. * Changes from review.	2022-06-24 13:53:51 -07:00
Didip Kerabat	6ddb828c7a	Able to filter Cloud objects with glob notation. (#12659 ) In a heterogeneous environment, sometimes you don't have control over the input folder. Upstream can put any folder they want. In this situation the S3InputSource.java is unusable. Most people like me solved it by using Airflow to fetch the full list of parquet files and pass it over to Druid. But doing this explodes the JSON spec. We had a situation where 1 of the JSON spec is 16MB and that's simply too much for Overlord. This patch allows users to pass {"filter": "*.parquet"} and let Druid performs the filtering of the input files. I am using the glob notation to be consistent with the LocalFirehose syntax.	2022-06-24 11:40:08 +05:30
Tejaswini Bandlamudi	1fc2f6e4b0	Throw BadQueryContextException if context params cannot be parsed (#12680 )	2022-06-24 09:21:25 +05:30
Gian Merlino	d29343cbe3	Disable autokill of segments by default. (#12693 ) Also add clarifying commentary to the documentation about how durationToRetain works.	2022-06-23 17:17:11 -07:00
Paul Rogers	ffcb996468	Cleanup changes pulled out of PR #12368 (#12672 ) This commit contains the cleanup needed for the new integration test framework. Changes: - Fix log lines, misspellings, docs, etc. - Allow the use of some of Druid's "JSON config" objects in tests - Fix minor bug in `BaseNodeRoleWatcher`	2022-06-23 23:19:50 +05:30
Jihoon Son	3d9e3dbad9	Fix hadoop library location for integration tests (#12497 )	2022-06-23 10:39:54 -05:00

... 3 4 5 6 7 ...

12053 Commits All Branches Search

12053 Commits

All Branches