druid

Commit Graph

Author	SHA1	Message	Date
imply-cheddar	d2f82f8dd6	Make GCP initialization truly lazy (#14077 ) The GCP initialization pulls credentials for talking to GCP. We want that to only happen when fully required and thus want the GCP-related objects lazily instantiated.	2023-04-12 23:10:50 -07:00
Gian Merlino	81074411a9	MSQ: Support multiple result columns with the same name. (#14025 ) * MSQ: Support multiple result columns with the same name. This is allowed in SQL, and is supported by the regular SQL endpoint. We retain a validation that INSERT ... SELECT does not allow multiple columns with the same name, because column names in segments must be unique.	2023-04-13 11:09:39 +05:30
zachjsh	89bdbdc3ed	Input source security feature should work for MSQ tasks (#14056 ) ### Description Previously msq controller and worker tasks did not have implementations for the `getInputSourceResources()` method. This causes the submission of these tasks to fail if the following auth config is enabled: `druid.auth.enableInputSourceSecurity=true` Added implementations of this method for these tasks that return an empty set of input sources. This means that for these task types, if `druid.auth.enableInputSourceSecurity=true` config is used, the input source types will be properly computed and authorized in the SQL layer, but not if the equivalent controller / worker tasks are submitted to the task endpoint.	2023-04-11 11:36:15 -04:00
zachjsh	2e87b5a901	Input source security sql layer can handle input source with multiple types (#14050 ) ### Description This change allows for input sources used during MSQ ingestion to be authorized for multiple input source types, instead of just 1. Such an input source that allows for multiple types is the CombiningInputSource. Also fixed bug that caused some input source specific functions to be authorized against the permissions ` [ new ResourceAction(new Resource(ResourceType.EXTERNAL, ResourceType.EXTERNAL), Action.READ), new ResourceAction(new Resource(ResourceType.EXTERNAL, {input_source_type}), Action.READ) ] ` when the inputSource based authorization feature is enabled, when it should instead be authorized against ` [ new ResourceAction(new Resource(ResourceType.EXTERNAL, {input_source_type}), Action.READ) ] `	2023-04-10 09:48:57 -04:00
Clint Wylie	1aef72aa7e	Bump up the version in pom to 27.0.0 in preparation of release (#14051 )	2023-04-10 14:56:59 +05:30
Gian Merlino	d52bc333aa	Frames: Ensure nulls are read as default values when appropriate. (#14020 ) * Frames: Ensure nulls are read as default values when appropriate. Fixes a bug where LongFieldWriter didn't write a properly transformed zero when writing out a null. This had no meaningful effect in SQL-compatible null handling mode, because the field would get treated as a null anyway. But it does have an effect in default-value mode: it would cause Long.MIN_VALUE to get read out instead of zero. Also adds NullHandling checks to the various frame-based column selectors, allowing reading of nullable frames by servers in default-value mode.	2023-04-10 05:28:46 +05:30
Clint Wylie	a769f14652	fix compile with java 8 (#14045 )	2023-04-07 07:01:38 -07:00
Abhishek Radhakrishnan	f47b05a98c	Hyphenate multi value string for consistency. Fixup extra space in javadoc. (#14043 )	2023-04-07 11:46:07 +05:30
zachjsh	5c0221375c	Allow for Input source security in native task layer (#14003 ) Fixes #13837. ### Description This change allows for input source type security in the native task layer. To enable this feature, the user must set the following property to true: `druid.auth.enableInputSourceSecurity=true` The default value for this property is false, which will continue the existing functionality of needing authorization to write to the respective datasource. When this config is enabled, the users will be required to be authorized for the following resource action, in addition to write permission on the respective datasource. `new ResourceAction(new Resource(ResourceType.EXTERNAL, {INPUT_SOURCE_TYPE}, Action.READ` where `{INPUT_SOURCE_TYPE}` is the type of the input source being used;, http, inline, s3, etc.. Only tasks that provide a non-default implementation of the `getInputSourceResources` method can be submitted when config `druid.auth.enableInputSourceSecurity=true` is set. Otherwise, a 400 error will be thrown.	2023-04-06 13:13:09 -04:00
Paul Rogers	030ed911d4	Temporarily revert extended table functions for Druid 26 (#14019 )	2023-04-05 21:09:33 -07:00
Abhishek Radhakrishnan	b98eed8fb8	Revert quoting lookup fix. (#14034 ) * Revert "Add ANSI_QUOTES propety to DBI init in lookups. (#13826)" This reverts commit `9e9976001c`. * Revert "Quote and escape literals in JDBC lookup to allow reserved identifiers. (#13632)" This reverts commit `41fdf6eafb`. * fix typo.	2023-04-05 20:52:36 -07:00
Gian Merlino	319f99db05	Always use file sizes when determining batch ingest splits (#13955 ) * Always use file sizes when determining batch ingest splits. Main changes: 1) Update CloudObjectInputSource and its subclasses (S3, GCS, Azure, Aliyun OSS) to use SplitHintSpecs in all cases. Previously, they were only used for prefixes, not uris or objects. 2) Update ExternalInputSpecSlicer (MSQ) to consider file size. Previously, file size was ignored; all files were treated as equal weight when determining splits. A side effect of these changes is that we'll make additional network calls to find the sizes of objects when users specify URIs or objects as opposed to prefixes. IMO, this is worth it because it's the only way to respect the user's split hint and task assignment settings. Secondary changes: 1) S3, Aliyun OSS: Use getObjectMetadata instead of listObjects to get metadata for a single object. This is a simpler call that is also expected to be less expensive. 2) Azure: Fix a bug where getBlobLength did not populate blob reference attributes, and therefore would not actually retrieve the blob length. 3) MSQ: Align dynamic slicing logic between ExternalInputSpecSlicer and TableInputSpecSlicer. 4) MSQ: Adjust WorkerInputs to ensure there is always at least one worker, even if it has a nil slice. * Add msqCompatible to testGroupByWithImpossibleTimeFilter. * Fix tests. * Add additional tests. * Remove unused stuff. * Remove more unused stuff. * Adjust thresholds. * Remove irrelevant test. * Fix comments. * Fix bug. * Updates.	2023-04-05 08:54:01 -07:00
Karan Kumar	e6a11707cb	Adding query stack fault to MSQ to capture native query errors. (#13926 ) * Add a new fault "QueryRuntimeError" to MSQ engine to capture native query errors. * Fixed bug in MSQ fault tolerance where worker were being retried if `UnexpectedMultiValueDimensionException` was thrown. * An exception from the query runtime with `org.apache.druid.query` as the package name is thrown as a QueryRuntimeError	2023-04-05 16:29:10 +05:30
Laksh Singla	012b49d5e5	Fix the order of aggregator finalization in GroupByPostShuffleFrameProcessor (MSQ) (#14022 ) * fix the order in which finalization is done * add comment explaining the change * null handling case	2023-04-05 11:04:06 +05:30
Clint Wylie	d21babc5b8	remix nested columns (#14014 ) changes: * introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation * introduce new 'auto' type indexer and merger which produces a new common nested format of columns, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>. * revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre #13803 behavior (v4) for backwards compatibility * fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)	2023-04-04 17:51:59 -07:00
Karan Kumar	217b0f6832	Eagerly fetching remote s3 files leading to out of disk (OOD) (#13981 ) * Eagerly fetching remote s3 files leading to OOD.	2023-04-03 14:10:37 +05:30
Clint Wylie	e3211e3be0	actually backwards compatible frontCoded string encoding strategy (#13996 )	2023-03-31 02:24:12 -07:00
zachjsh	3bb67721f7	Allow for Input source security in SQL layer (#13989 ) This change introduces the concept of input source type security model, proposed in #13837.. With this change, this feature is only available at the SQL layer, but we will expand to native layer in a follow up PR. To enable this feature, the user must set the following property to true: druid.auth.enableInputSourceSecurity=true The default value for this property is false, which will continue the existing functionality of having the usage all external sources being authorized against the hardcoded resource action new ResourceAction(new Resource(ResourceType.EXTERNAL, ResourceType.EXTERNAL), Action.READ When this config is enabled, the users will be required to be authorized for the following resource action new ResourceAction(new Resource(ResourceType.EXTERNAL, {INPUT_SOURCE_TYPE}, Action.READ where {INPUT_SOURCE_TYPE} is the type of the input source being used;, http, inline, s3, etc.. Documentation has not been added for the feature as it is not complete at the moment, as we still need to enable this for the native layer in a follow up pr.	2023-03-29 22:15:33 -04:00
frankgrimes97	2f98675285	Tuple sketch SQL support (#13887 ) This PR is a follow-up to #13819 so that the Tuple sketch functionality can be used in SQL for both ingestion using Multi-Stage Queries (MSQ) and also for analytic queries against Tuple sketch columns.	2023-03-28 18:47:12 +05:30
Karan Kumar	c2fe6a4956	Reworking s3 connector with various improvements (#13960 ) * Reworking s3 connector with 1. Adding retries 2. Adding max fetch size 3. Using s3Utils for most of the api's 4. Fixing bugs in DurableStorageCleaner 5. Moving to Iterator for listDir call	2023-03-28 17:05:16 +05:30
Rishabh Singh	e8e8082573	Update OIDCConfig with scope information (#13973 ) Allow users to provide custom scope through OIDC configuration	2023-03-28 14:50:00 +05:30
Clint Wylie	d5b1b5bc8e	nested columns + arrays = array columns! (#13803 ) array columns! changes: * add support for storing nested arrays of string, long, and double values as specialized nested columns instead of breaking them into separate element columns * nested column type mimic behavior means that columns ingested with only root arrays of primitive values will be ARRAY typed columns * neat test refactor stuff * add v4 segment test * add array element indexes * add tests for unnest and array columns * fix unnest column value selector cursor handling of null and empty arrays	2023-03-27 12:42:35 -07:00
Gian Merlino	062d72b67e	Add timeout to TaskStartTimeoutFault. (#13970 ) * Add timeout to TaskStartTimeoutFault. Makes the error message a bit more useful. * Update docs.	2023-03-27 23:37:19 +05:30
Atul Mohan	19db32d6b4	Add JWT authenticator support for validating ID Tokens (#13242 ) Expands the OIDC based auth in Druid by adding a JWT Authenticator that validates ID Tokens associated with a request. The existing pac4j authenticator works for authenticating web users while accessing the console, whereas this authenticator is for validating Druid API requests made by Direct clients. Services already supporting OIDC can attach their ID tokens to the Druid requests under the Authorization request header.	2023-03-25 18:41:40 +05:30
Adarsh Sanjeev	7bab407495	Add segment generator counters to MSQ reports (#13909 ) * Add segment generator counters to reports * Remove unneeded annotation * Fix checkstyle and coverage * Add persist and merged as new metrics * Address review comments * Fix checkstyle * Create metrics class to handle updating counters * Address review comments * Add rowsPushed as a new metrics	2023-03-22 09:17:26 -07:00
Clint Wylie	f4392a3155	expression transform improvements and fixes (#13947 ) changes: * fixes inconsistent handling of byte[] values between ExprEval.bestEffortOf and ExprEval.ofType, which could cause byte[] values to end up as java toString values instead of base64 encoded strings in ingest time transforms * improved ExpressionTransform binding to re-use ExprEval.bestEffortOf when evaluating a binding instead of throwing it away * improved ExpressionTransform array handling, added RowFunction.evalDimension that returns List<String> to back Row.getDimension and remove the automatic coercing of array types that would typically happen to expression transforms unless using Row.getDimension * added some tests for ExpressionTransform with array inputs * improved ExpressionPostAggregator to use partial type information from decoration * migrate some test uses of InputBindings.forMap to use other methods	2023-03-21 23:26:53 -07:00
Gian Merlino	1c7a03a47b	Lower default maxRowsInMemory for realtime ingestion. (#13939 ) * Lower default maxRowsInMemory for realtime ingestion. The thinking here is that for best ingestion throughput, we want intermediate persists to be as big as possible without using up all available memory. So, we rely mainly on maxBytesInMemory. The default maxRowsInMemory (1 million) is really just a safety: in case we have a large number of very small rows, we don't want to get overwhelmed by per-row overheads. However, maximum ingestion throughput isn't necessarily the primary goal for realtime ingestion. Query performance is also important. And because query performance is not as good on the in-memory dataset, it's helpful to keep it from growing too large. 150k seems like a reasonable balance here. It means that for a typical 5 million row segment, we won't trigger more than 33 persists due to this limit, which is a reasonable number of persists. * Update tests. * Update server/src/main/java/org/apache/druid/segment/indexing/RealtimeTuningConfig.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Fix test. * Fix link. --------- Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2023-03-21 10:36:36 -07:00
Adarsh Sanjeev	143fdcfacf	Change test name so it triggers in CI (#13844 ) As the name of the class did not end or start with "Test", CalciteSelectQueryMSQTest was not triggered in CI. This PR renames the test.	2023-03-20 15:55:52 +05:30
Karan Kumar	bf13156b55	Regression bug fix where ever LimitFrameProcessor's were used. (#13941 )	2023-03-16 09:18:18 -07:00
Karan Kumar	67df1324ee	Undocumenting certain context parameter in MSQ. (#13928 ) * Removing intermediateSuperSorterStorageMaxLocalBytes, maxInputBytesPerWorker, composedIntermediateSuperSorterStorageEnabled, clusterStatisticsMergeMode from docs * Adding documentation in the context class.	2023-03-16 17:56:44 +05:30
Tejaswini Bandlamudi	6837289cb0	Fixes parquet uint_32 datatype conversion (#13935 ) After parquet ingestion, uint_32 parquet datatypes are stored as null values in the dataSource. This PR fixes this conversion bug.	2023-03-16 15:27:38 +05:30
Gian Merlino	4b1ffbc452	Various changes and fixes to UNNEST. (#13892 ) * Various changes and fixes to UNNEST. Native changes: 1) UnnestDataSource: Replace "column" and "outputName" with "virtualColumn". This enables pushing expressions into the datasource. This in turn allows us to do the next thing... 2) UnnestStorageAdapter: Logically apply query-level filters and virtual columns after the unnest operation. (Physically, filters are pulled up, when possible.) This is beneficial because it allows filters and virtual columns to reference the unnested column, and because it is consistent with how the join datasource works. 3) Various documentation updates, including declaring "unnest" as an experimental feature for now. SQL changes: 1) Rename DruidUnnestRel (& Rule) to DruidUnnestRel (& Rule). The rel is simplified: it only handles the UNNEST part of a correlated join. Constant UNNESTs are handled with regular inline rels. 2) Rework DruidCorrelateUnnestRule to focus on pulling Projects from the left side up above the Correlate. New test testUnnestTwice verifies that this works even when two UNNESTs are stacked on the same table. 3) Include ProjectCorrelateTransposeRule from Calcite to encourage pushing mappings down below the left-hand side of the Correlate. 4) Add a new CorrelateFilterLTransposeRule and CorrelateFilterRTransposeRule to handle pulling Filters up above the Correlate. New tests testUnnestWithFiltersOutside and testUnnestTwiceWithFilters verify this behavior. 5) Require a context feature flag for SQL UNNEST, since it's undocumented. As part of this, also cleaned up how we handle feature flags in SQL. They're now hooked into EngineFeatures, which is useful because not all engines support all features.	2023-03-10 16:42:08 +05:30
Gian Merlino	90d8f67e3d	Avoid creating new RelDataTypeFactory during SQL planning. (#13904 ) * Avoid creating new RelDataTypeFactory during SQL planning. Reduces unnecessary CPU cycles. * Fix.	2023-03-08 21:55:49 -08:00
Laksh Singla	dc67296e9d	Fix for OOM in the Tombstone generating logic in MSQ (#13893 ) fix OOMs using a different logic for generating tombstones --------- Co-authored-by: Paul Rogers <paul-rogers@users.noreply.github.com>	2023-03-08 21:38:08 -08:00
Clint Wylie	c7f4bb5056	fix KafkaInputFormat when used with Sampler API (#13900 ) * fix KafkaInputFormat when used with Sampler API * handle key format sampling the same as value format sampling	2023-03-08 16:23:24 -08:00
Gian Merlino	82f7a56475	Sort-merge join and hash shuffles for MSQ. (#13506 ) * Sort-merge join and hash shuffles for MSQ. The main changes are in the processing, multi-stage-query, and sql modules. processing module: 1) Rename SortColumn to KeyColumn, replace boolean descending with KeyOrder. This makes it nicer to model hash keys, which use KeyOrder.NONE. 2) Add nullability checkers to the FieldReader interface, and an "isPartiallyNullKey" method to FrameComparisonWidget. The join processor uses this to detect null keys. 3) Add WritableFrameChannel.isClosed and OutputChannel.isReadableChannelReady so callers can tell which OutputChannels are ready for reading and which aren't. 4) Specialize FrameProcessors.makeCursor to return FrameCursor, a random-access implementation. The join processor uses this to rewind when it needs to replay a set of rows with a particular key. 5) Add MemoryAllocatorFactory, which is embedded inside FrameWriterFactory instead of a particular MemoryAllocator. This allows FrameWriterFactory to be shared in more scenarios. multi-stage-query module: 1) ShuffleSpec: Add hash-based shuffles. New enum ShuffleKind helps callers figure out what kind of shuffle is happening. The change from SortColumn to KeyColumn allows ClusterBy to be used for both hash-based and sort-based shuffling. 2) WorkerImpl: Add ability to handle hash-based shuffles. Refactor the logic to be more readable by moving the work-order-running code to the inner class RunWorkOrder, and the shuffle-pipeline-building code to the inner class ShufflePipelineBuilder. 3) Add SortMergeJoinFrameProcessor and factory. 4) WorkerMemoryParameters: Adjust logic to reserve space for output frames for hash partitioning. (We need one frame per partition.) sql module: 1) Add sqlJoinAlgorithm context parameter; can be "broadcast" or "sortMerge". With native, it must always be "broadcast", or it's a validation error. MSQ supports both. Default is "broadcast" in both engines. 2) Validate that MSQs do not use broadcast join with RIGHT or FULL join, as results are not correct for broadcast join with those types. Allow this in native for two reasons: legacy (the docs caution against it, but it's always been allowed), and the fact that it actually does generate correct results in native when the join is processed on the Broker. It is much less likely that MSQ will plan in such a way that generates correct results. 3) Remove subquery penalty in DruidJoinQueryRel when using sort-merge join, because subqueries are always required, so there's no reason to penalize them. 4) Move previously-disabled join reordering and manipulation rules to FANCY_JOIN_RULES, and enable them when using sort-merge join. Helps get to better plans where projections and filters are pushed down. * Work around compiler problem. * Updates from static analysis. * Fix @param tag. * Fix declared exception. * Fix spelling. * Minor adjustments. * wip * Merge fixups * fixes * Fix CalciteSelectQueryMSQTest * Empty keys are sortable. * Address comments from code review. Rename mux -> mix. * Restore inspection config. * Restore original doc. * Reorder imports. * Adjustments * Fix. * Fix imports. * Adjustments from review. * Update header. * Adjust docs.	2023-03-08 14:19:39 -08:00
Adarsh Sanjeev	ef82756176	Add validation for aggregations on __time (#13793 ) * Add validation for aggregations on __time	2023-03-07 17:16:36 -08:00
Karan Kumar	94cfabea18	Suggested memory calculation in case NOT_ENOUGH_MEMORY_FAULT is thrown. (#13846 ) * Suggested memory calculation in case NOT_ENOUGH_MEMORY_FAULT is thrown. Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-03-06 18:00:36 +05:30
Karan Kumar	65c3954942	Adding forbidden api for Properties#get() and Properties#getOrDefault() (#13882 ) Properties#getOrDefault method does not check the default map for values where as Properties#getProperty() does.	2023-03-06 10:42:04 +05:30
Rohan Garg	f33898ed6d	Fix durable storage cleanup (#13853 )	2023-03-06 09:49:14 +05:30
Nicholas Lippis	b68180fc44	use getProperty in MSQDurableStorageModule (#13881 )	2023-03-04 11:56:43 -08:00
Anshu Makkar	a10e4150d5	Add Post Aggregators for Tuple Sketches (#13819 ) You can now do the following operations with TupleSketches in Post Aggregation Step Get the Sketch Output as Base64 String Provide a constant Tuple Sketch in post-aggregation step that can be used in Set Operations Get the Estimated Value(Sum) of Summary/Metrics Objects associated with Tuple Sketch	2023-03-03 09:32:09 +05:30
Tejaswini Bandlamudi	7103cb4b9d	Removes FiniteFirehoseFactory and its implementations (#12852 ) The FiniteFirehoseFactory and InputRowParser classes were deprecated in 0.17.0 (#8823) in favor of InputSource & InputFormat. This PR removes the FiniteFirehoseFactory and all its implementations along with classes solely used by them like Fetcher (Used by PrefetchableTextFilesFirehoseFactory). Refactors classes including tests using FiniteFirehoseFactory to use InputSource instead. Removing InputRowParser may not be as trivial as many classes that aren't deprecated depends on it (with no alternatives), like EventReceiverFirehoseFactory. Hence FirehoseFactory, EventReceiverFirehoseFactory, and Firehose are marked deprecated.	2023-03-02 18:07:17 +05:30
Laksh Singla	ca68fd93a6	Generate tombstones when running MSQ's replace (#13706 ) *When running REPLACE queries, the segments which contain no data are dropped (marked as unused). This PR aims to generate tombstones in place of segments which contain no data to mark their deletion, as is the behavior with the native ingestion. This will cause InsertCannotReplaceExistingSegmentFault to be removed since it was generated if the interval to be marked unused didn't fully overlap one of the existing segments to replace.	2023-03-01 12:01:30 +05:30
Clint Wylie	1d8fff4096	sampler + type detection = bff (#13711 ) * sampler + type detection = bff * split logical and physical dimensions, tidy up	2023-02-28 04:14:30 -08:00
Gian Merlino	aeb1187a7d	Fix NPE in KinesisSupervisor#setupRecordSupplier. (#13859 ) * Fix NPE in KinesisSupervisor#setupRecordSupplier. PR #13539 refactored record supplier creation and introduced a bug: this method would throw NPE when recordsPerFetch was not provided by the user. recordsPerFetch isn't needed in this context at all, since the supervisor-side supplier doesn't fetch records. So this patch sets it to zero. * Remove unused imports.	2023-02-27 19:55:28 -08:00
Karan Kumar	6bb5effa7b	Better logging for MSQ worker task (#13790 ) * Adding more logs to MSQ worker implementation which makes it easier to debug.	2023-02-26 03:24:24 +05:30
Paul Rogers	914eebb4b7	Wire up the catalog resolver (#13788 ) Introduces the catalog resolver interface Wires the resolver up to the planner factory Refactors planner factory	2023-02-22 11:42:32 -08:00
Abhishek Agarwal	d2dbb8b2c0	Fix infinite checkpointing between tasks and overlord (#13825 ) If the intermediate handoff period is less than the task duration and there is no new data in the input topic, task will continuously checkpoint the same offsets again and again. This PR fixes that bug by resetting the checkpoint time even when the task receives the same end offset request again.	2023-02-22 19:25:59 +05:30
Abhishek Radhakrishnan	9e9976001c	Add ANSI_QUOTES propety to DBI init in lookups. (#13826 )	2023-02-21 15:13:22 -08:00

1 2 3 4 5 ...

1154 Commits