druid

Commit Graph

Author	SHA1	Message	Date
317brian	9932f2e70a	docs: concurrent append and replace is gA (#17269 )	2024-10-08 07:55:55 +05:30
Atul Mohan	c1f8ae25b5	Support Iceberg ingestion from REST based catalogs (#17124 ) Adds support to the iceberg input source to read from Iceberg REST Catalogs.	2024-09-23 22:13:24 -07:00
Abhishek Radhakrishnan	635e418131	Support to parse numbers in text-based input formats (#17082 ) Text-based input formats like csv and tsv currently parse inputs only as strings, following the RFC4180Parser spec). To workaround this, the web-console and other tools need to further inspect the sample data returned to sample data returned by the Druid sampler API to parse them as numbers. This patch introduces a new optional config, tryParseNumbers, for the csv and tsv input formats. If enabled, any numbers present in the input will be parsed in the following manner -- long data type for integer types and double for floating-point numbers, and if parsing fails for whatever reason, the input is treated as a string. By default, this configuration is set to false, so numeric strings will be treated as strings.	2024-09-19 13:21:18 -07:00
Abhishek Radhakrishnan	39723e5401	Update note about `sys.tasks` table (#17096 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-09-18 11:02:45 -07:00
Katya Macedo	490211f2b1	Docs - update streaming ingestion terminology for Kafka and Kinesis (#17003 )	2024-09-17 09:49:24 -07:00
Victoria Lim	2e2f3cf66a	docs: Refresh docs for SQL input source (#17031 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-09-16 15:52:37 -07:00
George Shiqi Wu	428f58cf15	Support maxColumnsToMerge in supervisor tuningConfig (#17030 ) * support maxColumnsToMerge in supervisor specs * remove log line * fix style * add docs * fix unit tests	2024-09-11 18:00:13 -04:00
Abhishek Radhakrishnan	aa833a711c	Support for reading Delta Lake table snapshots (#17004 ) Problem Currently, the delta input source only supports reading from the latest snapshot of the given Delta Lake table. This is a known documented limitation. Description Add support for reading Delta snapshot. By default, the Druid-Delta connector reads the latest snapshot of the Delta table in order to preserve compatibility. Users can specify a snapshotVersion to ingest change data events from Delta tables into Druid. In the future, we can also add support for time-based snapshot reads. The Delta API to read time-based snapshots is not clear currently.	2024-09-09 14:12:48 +05:30
Virushade	476b205efa	Docs: Fix language in Schema Design docs (#17010 )	2024-09-06 08:48:00 +05:30
Jill Osborne	b4d83a86c2	Middle Manager wording update in docs (#17005 )	2024-09-05 10:25:30 -07:00
Clint Wylie	f8301a314f	generic block compressed complex columns (#16863 ) changes: * Adds new `CompressedComplexColumn`, `CompressedComplexColumnSerializer`, `CompressedComplexColumnSupplier` based on `CompressedVariableSizedBlobColumn` used by JSON columns * Adds `IndexSpec.complexMetricCompression` which can be used to specify compression for the generic compressed complex column. Defaults to uncompressed because compressed columns are not backwards compatible. * Adds new definition of `ComplexMetricSerde.getSerializer` which accepts an `IndexSpec` argument when creating a serializer. The old signature has been marked `@Deprecated` and has a default implementation that returns `null`, but it will be used by the default implementation of the new version if it is implemented to return a non-null value. The default implementation of the new method will use a `CompressedComplexColumnSerializer` if `IndexSpec.complexMetricCompression` is not null/none/uncompressed, or will use `LargeColumnSupportedComplexColumnSerializer` otherwise. * Removed all duplicate generic implementations of `ComplexMetricSerde.getSerializer` and `ComplexMetricSerde.deserializeColumn` into default implementations `ComplexMetricSerde` instead of being copied all over the place. The default implementation of `deserializeColumn` will check if the first byte indicates that the new compression was used, otherwise will use the `GenericIndexed` based supplier. * Complex columns with custom serializers/deserializers are unaffected and may continue doing whatever it is they do, either with specialized compression or whatever else, this new stuff is just to provide generic implementations built around `ObjectStrategy`. * add ObjectStrategy.readRetainsBufferReference so CompressedComplexColumn only copies on read if required * add copyValueOnRead flag down to CompressedBlockReader to avoid buffer duplicate if the value needs copied anyway	2024-08-27 00:34:41 -07:00
Hugh Evans	60d4317968	Linked back to query granularity docs (#16883 ) * Linked back to query granularity docs * Update ingestion-spec.md clairfy about query granularities in the spec. * Update docs/design/storage.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/ingestion/ingestion-spec.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/granularities.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Apply suggestions from code review --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-08-23 08:44:19 -07:00
Gian Merlino	0603d5153d	Segments sorted by non-time columns. (#16849 ) * Segments primarily sorted by non-time columns. Currently, segments are always sorted by __time, followed by the sort order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting by __time enables efficient execution of queries involving time-ordering or granularity. Time-ordering is a simple matter of reading the rows in stored order, and granular cursors can be generated in streaming fashion. However, for various workloads, it's better for storage footprint and query performance to sort by arbitrary orders that do not start with __time. With this patch, users can sort segments by such orders. For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to dimensionsSpec. The "dimensions" list determines the sort order. To define a sort order that includes "__time", users explicitly include a dimension named "__time". For SQL-based ingestion, users set the context parameter "useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then used as the explicit segment sort order. In both cases, when the new "useExplicitSegmentSortOrder" parameter is false (the default), __time is implicitly prepended to the sort order, as it always was prior to this patch. The new parameter is experimental for two main reasons. First, such segments can cause errors when loaded by older servers, due to violating their expectations that timestamps are always monotonically increasing. Second, even on newer servers, not all queries can run on non-time-sorted segments. Scan queries involving time-ordering and any query involving granularity will not run. (To partially mitigate this, a currently-undocumented SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner avoids using "granularity".) Changes on the write path: 1) DimensionsSpec can now optionally contain a __time dimension, which controls the placement of __time in the sort order. If not present, __time is considered to be first in the sort order, as it has always been. 2) IncrementalIndex and IndexMerger are updated to sort facts more flexibly; not always by time first. 3) Metadata (stored in metadata.drd) gains a "sortOrder" field. 4) MSQ can generate range-based shard specs even when not all columns are singly-valued strings. It merely stops accepting new clustering key fields when it encounters the first one that isn't a singly-valued string. This is useful because it enables range shard specs on "someDim" to be created for clauses like "CLUSTERED BY someDim, __time". Changes on the read path: 1) Add StorageAdapter#getSortOrder so query engines can tell how a segment is sorted. 2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter, and VectorCursorGranularizer to throw errors when using granularities on non-time-ordered segments. 3) Update ScanQueryEngine to throw an error when using the time-ordering "order" parameter on non-time-ordered segments. 4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when running on a non-time-ordered segment. 5) Add "sqlUseGranularity" context parameter that causes the SQL planner to avoid using granularities other than ALL. Other changes: 1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions" and change the meaning subtly: it now returns true if the DimensionsSpec represents an unchanging list of dimensions, or false if there is some discovery happening. This is what call sites had expected anyway. * Fixups from CI. * Fixes. * Fix missing arg. * Additional changes. * Fix logic. * Fixes. * Fix test. * Adjust test. * Remove throws. * Fix styles. * Fix javadocs. * Cleanup. * Smoother handling of null ordering. * Fix tests. * Missed a spot on the merge. * Fixups. * Avoid needless Filters.and. * Add timeBoundaryInspector to test. * Fix tests. * Fix FrameStorageAdapterTest. * Fix various tests. * Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder. * Pom fix. * Fix doc.	2024-08-23 08:24:43 -07:00
AmatyaAvadhanula	8c8a4b2302	Remove references to chatAsync (#16950 ) Remove references to chatAsync from Rabbit stream supervisors	2024-08-23 13:21:07 +05:30
Gian Merlino	a83125e4a0	Track IngestionState more accurately in realtime tasks. (#16934 ) Previously, SeekableStreamIndexTaskRunner set ingestion state to COMPLETED when it finished reading data from Kafka. This is incorrect. After the changes in this patch, the transitions go: 1) The task stays in BUILD_SEGMENTS after it finishes reading from Kafka, while it is building its final set of segments to publish. 2) The task transitions to SEGMENT_AVAILABILITY_WAIT after publishing, while waiting for handoff. 3) The task transitions to COMPLETED immediately before exiting, when truly done.	2024-08-22 11:43:46 +05:30
aaronm-bi	ceed4a0634	Docs: Update list of ingestion types that support concurrent append and replace (#16852 )	2024-08-08 08:06:22 +05:30
zachjsh	c324f09108	Kinesis input format docs (#16840 ) * SQL syntax error should target USER persona * * revert change to queryHandler and related tests, based on review comments * * add test * Docs for Kinesis input format * * remove reference to kafka * * fix spellcheck error * Apply suggestions from code review Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-08-06 18:53:10 -04:00
Charles Smith	ed48cb82e9	[Docs} Remove avro_ocf support from Kafka & Kinesis streaming sources (Revert changes from #11865 ) (#16807 )	2024-07-26 13:06:22 -07:00
Clint Wylie	a34a06e192	remove Firehose and FirehoseFactory (#16758 ) changes: * removed `Firehose` and `FirehoseFactory` and remaining implementations which were mostly no longer used after #16602 * Moved `IngestSegmentFirehose` which was still used internally by Hadoop ingestion to `DatasourceRecordReader.SegmentReader` * Rename `SQLFirehoseFactoryDatabaseConnector` to `SQLInputSourceDatabaseConnector` and similar renames for sub-classes * Moved anything remaining in a 'firehose' package somewhere else * Clean up docs on firehose stuff	2024-07-19 14:37:21 -07:00
Vadim Ogievetsky	307b8849de	Web console: better sql data loader reset (#16696 ) * better sql data loader reset * snapshot * fix destination pane sizing * clean doc links * update doc links * more doc links * extract getClusterCapacity * update snapsohts * allow submit suspended * some renaming * diff with current * Do delta	2024-07-11 14:45:04 -07:00
Clint Wylie	37a50e6803	Remove index_realtime and index_realtime_appenderator tasks (#16602 ) index_realtime tasks were removed from the documentation in #13107. Even at that time, they weren't really documented per se— just mentioned. They existed solely to support Tranquility, which is an obsolete ingestion method that predates migration of Druid to ASF and is no longer being maintained. Tranquility docs were also de-linked from the sidebars and the other doc pages in #11134. Only a stub remains, so people with links to the page can see that it's no longer recommended. index_realtime_appenderator tasks existed in the code base, but were never documented, nor as far as I am aware were they used for any purpose. This patch removes both task types completely, as well as removes all supporting code that was otherwise unused. It also updates the stub doc for Tranquility to be firmer that it is not compatible. (Previously, the stub doc said it wasn't recommended, and pointed out that it is built against an ancient 0.9.2 version of Druid.) ITUnionQueryTest has been migrated to the new integration tests framework and updated to use Kafka ingestion. Co-authored-by: Gian Merlino <gianmerlino@gmail.com>	2024-06-24 20:13:33 -07:00
Andreas Maechler	ae70e18bc8	docs: Update Azure extension (#16585 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-06-20 09:31:29 -07:00
317brian	e1926e2549	docs: fix redirect (#16548 ) * doc: cleanup unnecessary redirect (cherry picked from commit d86aaadbc78cc51345f768ee66c9a8b2cbf13f27) * restore redirect file entry. delete md file	2024-06-14 09:54:16 +08:00
Katya Macedo	7aecc09230	Docs: Remove circular link (#16553 )	2024-06-05 11:07:36 -07:00
Charles Smith	b1568fb95b	docs: Adds a redirect for flatten-json which was removed (#16263 )	2024-05-31 16:16:12 -07:00
Katya Macedo	f70ef1f434	Update front coding text (#16491 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-05-31 15:13:10 -07:00
Atul Mohan	b53d75758f	IcebergInputSource : Add option to toggle case sensitivity while reading columns from iceberg catalog (#16496 ) * Toggle case sensitivity while reading columns from iceberg * Fix tests * Drop case check and set unconditionally	2024-05-31 10:18:52 -07:00
George Shiqi Wu	b3b62ac431	Update azure input source docs (#16508 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-05-29 10:00:46 -07:00
Abhishek Radhakrishnan	1d7595f3f7	Support for filters in the Druid Delta Lake connector (#16288 ) * Delta Lake support for filters. * Updates * cleanup comments * Docs * Remmove Enclosed runner * Rename * Cleanup test * Serde test for the Delta input source and fix jackson annotation. * Updates and docs. * Update error messages to be clearer * Fixes * Handle NumberFormatException to provide a nicer error message. * Apply suggestions from code review Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Doc fixes based on feedback * Yes -> yes in docs; reword slightly. * Update docs/ingestion/input-sources.md Co-authored-by: Laksh Singla <lakshsingla@gmail.com> * Update docs/ingestion/input-sources.md Co-authored-by: Laksh Singla <lakshsingla@gmail.com> * Documentation, javadoc and more updates. * Not with an or expression end-to-end test. * Break up =, >, >=, <, <= into its own types instead of sub-classing. --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com>	2024-04-29 11:31:36 -07:00
Adithya Chakilam	f8015eb02a	Add config lagAggregate to LagBasedAutoScalerConfig (#16334 ) Changes: - Add new config `lagAggregate` to `LagBasedAutoScalerConfig` - Add field `aggregateForScaling` to `LagStats` - Use the new field/config to determine which aggregate to use to compute lag - Remove method `Supervisor.computeLagForAutoScaler()`	2024-04-29 22:20:41 +05:30
Katya Macedo	ceb6646dec	Add supervisor actions (#16276 ) * Add supervisor actions * Update text * Update text * Update after review * Update after review	2024-04-24 13:14:01 -07:00
Katya Macedo	7f06a53cb1	[Docs] Fix API placeholder formatting (#16240 )	2024-04-12 09:19:13 -07:00
YongGang	da9feb4430	Introduce TaskContextReport for reporting task context (#16041 ) Changes: - Add `TaskContextEnricher` interface to improve task management and monitoring - Invoke `enrichContext` in `TaskQueue.add()` whenever a new task is submitted to the Overlord - Add `TaskContextReport` to write out task context information in reports	2024-04-12 08:57:49 +05:30
Adithya Chakilam	463010bb29	Populate segment stats for non-parallel compaction jobs (#16171 ) * Populate segment stats for non-parallel compaction jobs * fix * add-tests * comments * update-test * comments	2024-03-29 09:40:55 -04:00
Katya Macedo	6f6f86c325	Update `maxRowsInMemory` and `maxBytesInMemory` description (#16104 )	2024-03-12 14:40:15 -07:00
Jill Osborne	67ae0ff450	Update docs for rabbit community extension (#16069 ) * Updated docs for rabbit community extension * Updated after review	2024-03-07 11:29:53 -08:00
Adithya Chakilam	564c44ed85	Add stats segmentsRead and segmentsPublished to compaction task reports (#15947 ) Changes: - Add visibility into number of segments read/published by each parallel compaction - Add new fields `segmentsRead`, `segmentsPublished` to `IngestionStatsAndErrorsTaskReportData` - Update `ParallelIndexSupervisorTask` to populate the new stats	2024-03-07 09:37:23 +05:30
Adithya Chakilam	ec52f686c0	Fix compaction tasks reports getting overwritten (#15981 ) * Fix compaction tasks reports geting overwrittened * only skip for compactiont task * address comments * fix boolean * move boolean flag to task rather than spec * rename variable * add docs, fix missing case * Update docs/ingestion/tasks.md * rename var * add task report decode check in IT * change assert	2024-03-04 10:10:17 -05:00
Benjamin Hopp	ebb7190545	Docs: Change single-dim to hashed in example for index task (#15529 )	2024-02-26 09:16:10 +05:30
Adithya Chakilam	1f443d218c	Enable partition stats on streaming task completion report (#15930 ) Changes: - Add visibility into number of records processed by each streaming task per partition - Add field `recordsProcessed` to `IngestionStatsAndErrorsTaskReportData` - Populate number of records processed per partition in `SeekableStreamIndexTaskRunner`	2024-02-23 16:29:03 +05:30
Jamie	80942d5754	Feature: add support for ingesting from rabbitmq super streams (#14137 ) * Add support for ingesting from Rabbit MQ Super Streams	2024-02-22 10:50:37 +05:30
Peter Marshall	cae9cbd7d7	Update tasks.md (#15887 ) Remove erroneous white space causing render issues on this page.	2024-02-13 05:20:09 -08:00
Katya Macedo	0f29ece6a9	[Docs] Refactor streaming ingestion section (#15591 ) Merging the work so far. @ektravel , @vogievetsky if there are additional improvements, let's track them & make another pr. * Refactor streaming ingestion docs * Update property definition * Update after review * Update known issues * Move kinesis and kafka topics to ingestion, add redirects * Saving changes * Saving * Add input format text * Update after review * Minor text edit * Update example syntax * Revert back to colon * Fix merge conflicts * Fix broken links * Fix spelling error	2024-02-12 13:52:42 -08:00
Vadim Ogievetsky	f2b242b6e6	update console to core Druid changes (#15854 )	2024-02-07 19:44:25 +05:30
Atul Mohan	2e46a98024	Add range filtering support for iceberg ingestion (#15782 ) * Add range filtering support for iceberg ingestion * Docs formatting * Spelling	2024-02-01 23:32:30 -08:00
Aru Raghuwanshi	223f29d64c	Update input-sources.md for fixing the warehouse path example under S3 (#15823 )	2024-02-01 23:32:05 -08:00
317brian	6d617c34d2	docs: revise concurrent append and replace (#15760 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-02-01 11:03:36 -08:00
Abhishek Radhakrishnan	9f95a691f7	Extension to read and ingest Delta Lake tables (#15755 ) * something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-01-30 21:53:50 -08:00
Benjamin Hopp	6177f6efd7	Fixing formatting of Iceberg Catalog Object (#15748 )	2024-01-30 20:17:38 -08:00
George Shiqi Wu	3e512249e3	Azure multi read options (#15630 ) * Include new dependencies * Mostly implemented * More azure fixes * Tests passing * Unit tests running * Test running after removing storage exception * Happy with coverage now * Add more tests * fix client factory * cleanup from testing * Remove old client * update docs * Exclude from spellcheck * Add licenses * Fix identity version * Save work * Add azure clients * add licenses * typos * Add dependencies * Exception is not thrown * Fix intellij check * Don't need to override * specify length * urldecode * encode path * Fix checks * Revert urlencode changes * Urlencode with azure library * Update docs/development/extensions-core/azure.md Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * PR changes * Update docs/development/extensions-core/azure.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Add config for multiple storage accounts * Deprecate AzureTaskLogsConfig.maxRetries * Clean up azure retry block * logic update to reuse clients * fix comments * Create container conditionally * Fix key auth * save work * Fix unit tests * Revert old azure input type * Separate input source * save work * Add support for app registrations * Fix unit tests * clean up spacing * Add coverage * fixes from testing * cleanup some caching behavior * Add docs * Fix spelling issues * fix more spelling errors' * Fix intellij inspections * add simple changes from pr * save work on fixing bug * Fix unit tests * Add more testing * Fix unit test * Add tests * Add annotation for azureStorage * Fix up docs * Add comment for list method * Fix tests * Remove uneeded toString * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Update docs/ingestion/input-sources.md Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * PR changes * fix injection of StorageConnector * Fix checkstyle * clean up unit tests * More pr fixes --------- Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-01-25 13:29:16 -05:00

1 2 3 4 5 ...

253 Commits