druid

Commit Graph

Author	SHA1	Message	Date
Akshat Jain	17215cd677	Remove support for Java 8 (#17466 ) All JDK 8 based CI checks have been removed. Images used in Dockerfile(s) have been updated to Java 17 based images. Documentation has been updated accordingly.	2024-11-21 15:33:08 +05:30
Adithya Chakilam	6f436301be	supervisor: make rejection periods work with stopTasksCount (#17442 ) * kafka-indexing: Report consumer io time * commit * backward * tests * remove unwanted changes * comments * comments * coverage * change name * fixes * fixes * comments	2024-11-18 13:12:24 -08:00
Shekhar Prasad Rajak	ae049a4bab	AWS Glue Catalog for Iceberg ingest extension (#17392 ) * iceberg glue catalog dependencies added * GlueIcebergCatalog added in druid module * default version of iceberg glue catalog implementation - basics * basic tests added * removed dependecy iceberg-aws-bundle * glue catalog support - docs update for iceberg * Update IcebergDruidModule.java * Update IcebergDruidModule.java * updates in dependencies and warehousePath must be under catalogProp * removed some dependencies - which not required * only glue sdk added * update license * avro exclusion removed * doc update * doc update * set the type to glue * minor change * minor change * fixing codestyle * checkstyle fixes * checkstyle fixes * checkstyle fixes * dependency check fixes * update pom for ignore warning for glue catalog * compile scope needed - iceberg-aws and awssdk * updates pom with comment * minor change * mvn dependency check in iceberg extension * revert pom.xml changes * aws sdk sts and s3 for gluecatalog initialize * dependency check - ignore aws sdk s3 and sts --------- Co-authored-by: SHEKHAR PRASAD RAJAK <shekhar_rajak@apple.com>	2024-11-10 18:43:55 -08:00
317brian	d1b81f312a	docs: msq autocompaction (#16681 ) Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> Co-authored-by: Vishesh Garg <vishesh.garg@imply.io> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-10-17 10:40:53 -07:00
317brian	9932f2e70a	docs: concurrent append and replace is gA (#17269 )	2024-10-08 07:55:55 +05:30
Atul Mohan	c1f8ae25b5	Support Iceberg ingestion from REST based catalogs (#17124 ) Adds support to the iceberg input source to read from Iceberg REST Catalogs.	2024-09-23 22:13:24 -07:00
Abhishek Radhakrishnan	635e418131	Support to parse numbers in text-based input formats (#17082 ) Text-based input formats like csv and tsv currently parse inputs only as strings, following the RFC4180Parser spec). To workaround this, the web-console and other tools need to further inspect the sample data returned to sample data returned by the Druid sampler API to parse them as numbers. This patch introduces a new optional config, tryParseNumbers, for the csv and tsv input formats. If enabled, any numbers present in the input will be parsed in the following manner -- long data type for integer types and double for floating-point numbers, and if parsing fails for whatever reason, the input is treated as a string. By default, this configuration is set to false, so numeric strings will be treated as strings.	2024-09-19 13:21:18 -07:00
Abhishek Radhakrishnan	39723e5401	Update note about `sys.tasks` table (#17096 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-09-18 11:02:45 -07:00
Katya Macedo	490211f2b1	Docs - update streaming ingestion terminology for Kafka and Kinesis (#17003 )	2024-09-17 09:49:24 -07:00
Victoria Lim	2e2f3cf66a	docs: Refresh docs for SQL input source (#17031 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-09-16 15:52:37 -07:00
George Shiqi Wu	428f58cf15	Support maxColumnsToMerge in supervisor tuningConfig (#17030 ) * support maxColumnsToMerge in supervisor specs * remove log line * fix style * add docs * fix unit tests	2024-09-11 18:00:13 -04:00
Abhishek Radhakrishnan	aa833a711c	Support for reading Delta Lake table snapshots (#17004 ) Problem Currently, the delta input source only supports reading from the latest snapshot of the given Delta Lake table. This is a known documented limitation. Description Add support for reading Delta snapshot. By default, the Druid-Delta connector reads the latest snapshot of the Delta table in order to preserve compatibility. Users can specify a snapshotVersion to ingest change data events from Delta tables into Druid. In the future, we can also add support for time-based snapshot reads. The Delta API to read time-based snapshots is not clear currently.	2024-09-09 14:12:48 +05:30
Virushade	476b205efa	Docs: Fix language in Schema Design docs (#17010 )	2024-09-06 08:48:00 +05:30
Jill Osborne	b4d83a86c2	Middle Manager wording update in docs (#17005 )	2024-09-05 10:25:30 -07:00
Clint Wylie	f8301a314f	generic block compressed complex columns (#16863 ) changes: * Adds new `CompressedComplexColumn`, `CompressedComplexColumnSerializer`, `CompressedComplexColumnSupplier` based on `CompressedVariableSizedBlobColumn` used by JSON columns * Adds `IndexSpec.complexMetricCompression` which can be used to specify compression for the generic compressed complex column. Defaults to uncompressed because compressed columns are not backwards compatible. * Adds new definition of `ComplexMetricSerde.getSerializer` which accepts an `IndexSpec` argument when creating a serializer. The old signature has been marked `@Deprecated` and has a default implementation that returns `null`, but it will be used by the default implementation of the new version if it is implemented to return a non-null value. The default implementation of the new method will use a `CompressedComplexColumnSerializer` if `IndexSpec.complexMetricCompression` is not null/none/uncompressed, or will use `LargeColumnSupportedComplexColumnSerializer` otherwise. * Removed all duplicate generic implementations of `ComplexMetricSerde.getSerializer` and `ComplexMetricSerde.deserializeColumn` into default implementations `ComplexMetricSerde` instead of being copied all over the place. The default implementation of `deserializeColumn` will check if the first byte indicates that the new compression was used, otherwise will use the `GenericIndexed` based supplier. * Complex columns with custom serializers/deserializers are unaffected and may continue doing whatever it is they do, either with specialized compression or whatever else, this new stuff is just to provide generic implementations built around `ObjectStrategy`. * add ObjectStrategy.readRetainsBufferReference so CompressedComplexColumn only copies on read if required * add copyValueOnRead flag down to CompressedBlockReader to avoid buffer duplicate if the value needs copied anyway	2024-08-27 00:34:41 -07:00
Hugh Evans	60d4317968	Linked back to query granularity docs (#16883 ) * Linked back to query granularity docs * Update ingestion-spec.md clairfy about query granularities in the spec. * Update docs/design/storage.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/ingestion/ingestion-spec.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/querying/granularities.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Apply suggestions from code review --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2024-08-23 08:44:19 -07:00
Gian Merlino	0603d5153d	Segments sorted by non-time columns. (#16849 ) * Segments primarily sorted by non-time columns. Currently, segments are always sorted by __time, followed by the sort order provided by the user via dimensionsSpec or CLUSTERED BY. Sorting by __time enables efficient execution of queries involving time-ordering or granularity. Time-ordering is a simple matter of reading the rows in stored order, and granular cursors can be generated in streaming fashion. However, for various workloads, it's better for storage footprint and query performance to sort by arbitrary orders that do not start with __time. With this patch, users can sort segments by such orders. For spec-based ingestion, users add "useExplicitSegmentSortOrder: true" to dimensionsSpec. The "dimensions" list determines the sort order. To define a sort order that includes "__time", users explicitly include a dimension named "__time". For SQL-based ingestion, users set the context parameter "useExplicitSegmentSortOrder: true". The CLUSTERED BY clause is then used as the explicit segment sort order. In both cases, when the new "useExplicitSegmentSortOrder" parameter is false (the default), __time is implicitly prepended to the sort order, as it always was prior to this patch. The new parameter is experimental for two main reasons. First, such segments can cause errors when loaded by older servers, due to violating their expectations that timestamps are always monotonically increasing. Second, even on newer servers, not all queries can run on non-time-sorted segments. Scan queries involving time-ordering and any query involving granularity will not run. (To partially mitigate this, a currently-undocumented SQL feature "sqlUseGranularity" is provided. When set to false the SQL planner avoids using "granularity".) Changes on the write path: 1) DimensionsSpec can now optionally contain a __time dimension, which controls the placement of __time in the sort order. If not present, __time is considered to be first in the sort order, as it has always been. 2) IncrementalIndex and IndexMerger are updated to sort facts more flexibly; not always by time first. 3) Metadata (stored in metadata.drd) gains a "sortOrder" field. 4) MSQ can generate range-based shard specs even when not all columns are singly-valued strings. It merely stops accepting new clustering key fields when it encounters the first one that isn't a singly-valued string. This is useful because it enables range shard specs on "someDim" to be created for clauses like "CLUSTERED BY someDim, __time". Changes on the read path: 1) Add StorageAdapter#getSortOrder so query engines can tell how a segment is sorted. 2) Update QueryableIndexStorageAdapter, IncrementalIndexStorageAdapter, and VectorCursorGranularizer to throw errors when using granularities on non-time-ordered segments. 3) Update ScanQueryEngine to throw an error when using the time-ordering "order" parameter on non-time-ordered segments. 4) Update TimeBoundaryQueryRunnerFactory to perform a segment scan when running on a non-time-ordered segment. 5) Add "sqlUseGranularity" context parameter that causes the SQL planner to avoid using granularities other than ALL. Other changes: 1) Rename DimensionsSpec "hasCustomDimensions" to "hasFixedDimensions" and change the meaning subtly: it now returns true if the DimensionsSpec represents an unchanging list of dimensions, or false if there is some discovery happening. This is what call sites had expected anyway. * Fixups from CI. * Fixes. * Fix missing arg. * Additional changes. * Fix logic. * Fixes. * Fix test. * Adjust test. * Remove throws. * Fix styles. * Fix javadocs. * Cleanup. * Smoother handling of null ordering. * Fix tests. * Missed a spot on the merge. * Fixups. * Avoid needless Filters.and. * Add timeBoundaryInspector to test. * Fix tests. * Fix FrameStorageAdapterTest. * Fix various tests. * Use forceSegmentSortByTime instead of useExplicitSegmentSortOrder. * Pom fix. * Fix doc.	2024-08-23 08:24:43 -07:00
AmatyaAvadhanula	8c8a4b2302	Remove references to chatAsync (#16950 ) Remove references to chatAsync from Rabbit stream supervisors	2024-08-23 13:21:07 +05:30
Gian Merlino	a83125e4a0	Track IngestionState more accurately in realtime tasks. (#16934 ) Previously, SeekableStreamIndexTaskRunner set ingestion state to COMPLETED when it finished reading data from Kafka. This is incorrect. After the changes in this patch, the transitions go: 1) The task stays in BUILD_SEGMENTS after it finishes reading from Kafka, while it is building its final set of segments to publish. 2) The task transitions to SEGMENT_AVAILABILITY_WAIT after publishing, while waiting for handoff. 3) The task transitions to COMPLETED immediately before exiting, when truly done.	2024-08-22 11:43:46 +05:30
aaronm-bi	ceed4a0634	Docs: Update list of ingestion types that support concurrent append and replace (#16852 )	2024-08-08 08:06:22 +05:30
zachjsh	c324f09108	Kinesis input format docs (#16840 ) * SQL syntax error should target USER persona * * revert change to queryHandler and related tests, based on review comments * * add test * Docs for Kinesis input format * * remove reference to kafka * * fix spellcheck error * Apply suggestions from code review Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-08-06 18:53:10 -04:00
Charles Smith	ed48cb82e9	[Docs} Remove avro_ocf support from Kafka & Kinesis streaming sources (Revert changes from #11865 ) (#16807 )	2024-07-26 13:06:22 -07:00
Clint Wylie	a34a06e192	remove Firehose and FirehoseFactory (#16758 ) changes: * removed `Firehose` and `FirehoseFactory` and remaining implementations which were mostly no longer used after #16602 * Moved `IngestSegmentFirehose` which was still used internally by Hadoop ingestion to `DatasourceRecordReader.SegmentReader` * Rename `SQLFirehoseFactoryDatabaseConnector` to `SQLInputSourceDatabaseConnector` and similar renames for sub-classes * Moved anything remaining in a 'firehose' package somewhere else * Clean up docs on firehose stuff	2024-07-19 14:37:21 -07:00
Vadim Ogievetsky	307b8849de	Web console: better sql data loader reset (#16696 ) * better sql data loader reset * snapshot * fix destination pane sizing * clean doc links * update doc links * more doc links * extract getClusterCapacity * update snapsohts * allow submit suspended * some renaming * diff with current * Do delta	2024-07-11 14:45:04 -07:00
Clint Wylie	37a50e6803	Remove index_realtime and index_realtime_appenderator tasks (#16602 ) index_realtime tasks were removed from the documentation in #13107. Even at that time, they weren't really documented per se— just mentioned. They existed solely to support Tranquility, which is an obsolete ingestion method that predates migration of Druid to ASF and is no longer being maintained. Tranquility docs were also de-linked from the sidebars and the other doc pages in #11134. Only a stub remains, so people with links to the page can see that it's no longer recommended. index_realtime_appenderator tasks existed in the code base, but were never documented, nor as far as I am aware were they used for any purpose. This patch removes both task types completely, as well as removes all supporting code that was otherwise unused. It also updates the stub doc for Tranquility to be firmer that it is not compatible. (Previously, the stub doc said it wasn't recommended, and pointed out that it is built against an ancient 0.9.2 version of Druid.) ITUnionQueryTest has been migrated to the new integration tests framework and updated to use Kafka ingestion. Co-authored-by: Gian Merlino <gianmerlino@gmail.com>	2024-06-24 20:13:33 -07:00
Andreas Maechler	ae70e18bc8	docs: Update Azure extension (#16585 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-06-20 09:31:29 -07:00
317brian	e1926e2549	docs: fix redirect (#16548 ) * doc: cleanup unnecessary redirect (cherry picked from commit d86aaadbc78cc51345f768ee66c9a8b2cbf13f27) * restore redirect file entry. delete md file	2024-06-14 09:54:16 +08:00
Katya Macedo	7aecc09230	Docs: Remove circular link (#16553 )	2024-06-05 11:07:36 -07:00
Charles Smith	b1568fb95b	docs: Adds a redirect for flatten-json which was removed (#16263 )	2024-05-31 16:16:12 -07:00
Katya Macedo	f70ef1f434	Update front coding text (#16491 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2024-05-31 15:13:10 -07:00
Atul Mohan	b53d75758f	IcebergInputSource : Add option to toggle case sensitivity while reading columns from iceberg catalog (#16496 ) * Toggle case sensitivity while reading columns from iceberg * Fix tests * Drop case check and set unconditionally	2024-05-31 10:18:52 -07:00
George Shiqi Wu	b3b62ac431	Update azure input source docs (#16508 ) Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>	2024-05-29 10:00:46 -07:00
Abhishek Radhakrishnan	1d7595f3f7	Support for filters in the Druid Delta Lake connector (#16288 ) * Delta Lake support for filters. * Updates * cleanup comments * Docs * Remmove Enclosed runner * Rename * Cleanup test * Serde test for the Delta input source and fix jackson annotation. * Updates and docs. * Update error messages to be clearer * Fixes * Handle NumberFormatException to provide a nicer error message. * Apply suggestions from code review Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> * Doc fixes based on feedback * Yes -> yes in docs; reword slightly. * Update docs/ingestion/input-sources.md Co-authored-by: Laksh Singla <lakshsingla@gmail.com> * Update docs/ingestion/input-sources.md Co-authored-by: Laksh Singla <lakshsingla@gmail.com> * Documentation, javadoc and more updates. * Not with an or expression end-to-end test. * Break up =, >, >=, <, <= into its own types instead of sub-classing. --------- Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com>	2024-04-29 11:31:36 -07:00
Adithya Chakilam	f8015eb02a	Add config lagAggregate to LagBasedAutoScalerConfig (#16334 ) Changes: - Add new config `lagAggregate` to `LagBasedAutoScalerConfig` - Add field `aggregateForScaling` to `LagStats` - Use the new field/config to determine which aggregate to use to compute lag - Remove method `Supervisor.computeLagForAutoScaler()`	2024-04-29 22:20:41 +05:30
Katya Macedo	ceb6646dec	Add supervisor actions (#16276 ) * Add supervisor actions * Update text * Update text * Update after review * Update after review	2024-04-24 13:14:01 -07:00
Katya Macedo	7f06a53cb1	[Docs] Fix API placeholder formatting (#16240 )	2024-04-12 09:19:13 -07:00
YongGang	da9feb4430	Introduce TaskContextReport for reporting task context (#16041 ) Changes: - Add `TaskContextEnricher` interface to improve task management and monitoring - Invoke `enrichContext` in `TaskQueue.add()` whenever a new task is submitted to the Overlord - Add `TaskContextReport` to write out task context information in reports	2024-04-12 08:57:49 +05:30
Adithya Chakilam	463010bb29	Populate segment stats for non-parallel compaction jobs (#16171 ) * Populate segment stats for non-parallel compaction jobs * fix * add-tests * comments * update-test * comments	2024-03-29 09:40:55 -04:00
Katya Macedo	6f6f86c325	Update `maxRowsInMemory` and `maxBytesInMemory` description (#16104 )	2024-03-12 14:40:15 -07:00
Jill Osborne	67ae0ff450	Update docs for rabbit community extension (#16069 ) * Updated docs for rabbit community extension * Updated after review	2024-03-07 11:29:53 -08:00
Adithya Chakilam	564c44ed85	Add stats segmentsRead and segmentsPublished to compaction task reports (#15947 ) Changes: - Add visibility into number of segments read/published by each parallel compaction - Add new fields `segmentsRead`, `segmentsPublished` to `IngestionStatsAndErrorsTaskReportData` - Update `ParallelIndexSupervisorTask` to populate the new stats	2024-03-07 09:37:23 +05:30
Adithya Chakilam	ec52f686c0	Fix compaction tasks reports getting overwritten (#15981 ) * Fix compaction tasks reports geting overwrittened * only skip for compactiont task * address comments * fix boolean * move boolean flag to task rather than spec * rename variable * add docs, fix missing case * Update docs/ingestion/tasks.md * rename var * add task report decode check in IT * change assert	2024-03-04 10:10:17 -05:00
Benjamin Hopp	ebb7190545	Docs: Change single-dim to hashed in example for index task (#15529 )	2024-02-26 09:16:10 +05:30
Adithya Chakilam	1f443d218c	Enable partition stats on streaming task completion report (#15930 ) Changes: - Add visibility into number of records processed by each streaming task per partition - Add field `recordsProcessed` to `IngestionStatsAndErrorsTaskReportData` - Populate number of records processed per partition in `SeekableStreamIndexTaskRunner`	2024-02-23 16:29:03 +05:30
Jamie	80942d5754	Feature: add support for ingesting from rabbitmq super streams (#14137 ) * Add support for ingesting from Rabbit MQ Super Streams	2024-02-22 10:50:37 +05:30
Peter Marshall	cae9cbd7d7	Update tasks.md (#15887 ) Remove erroneous white space causing render issues on this page.	2024-02-13 05:20:09 -08:00
Katya Macedo	0f29ece6a9	[Docs] Refactor streaming ingestion section (#15591 ) Merging the work so far. @ektravel , @vogievetsky if there are additional improvements, let's track them & make another pr. * Refactor streaming ingestion docs * Update property definition * Update after review * Update known issues * Move kinesis and kafka topics to ingestion, add redirects * Saving changes * Saving * Add input format text * Update after review * Minor text edit * Update example syntax * Revert back to colon * Fix merge conflicts * Fix broken links * Fix spelling error	2024-02-12 13:52:42 -08:00
Vadim Ogievetsky	f2b242b6e6	update console to core Druid changes (#15854 )	2024-02-07 19:44:25 +05:30
Atul Mohan	2e46a98024	Add range filtering support for iceberg ingestion (#15782 ) * Add range filtering support for iceberg ingestion * Docs formatting * Spelling	2024-02-01 23:32:30 -08:00
Aru Raghuwanshi	223f29d64c	Update input-sources.md for fixing the warehouse path example under S3 (#15823 )	2024-02-01 23:32:05 -08:00

1 2 3 4 5 ...

257 Commits