druid

Commit Graph

Author	SHA1	Message	Date
Gian Merlino	1c7a03a47b	Lower default maxRowsInMemory for realtime ingestion. (#13939 ) * Lower default maxRowsInMemory for realtime ingestion. The thinking here is that for best ingestion throughput, we want intermediate persists to be as big as possible without using up all available memory. So, we rely mainly on maxBytesInMemory. The default maxRowsInMemory (1 million) is really just a safety: in case we have a large number of very small rows, we don't want to get overwhelmed by per-row overheads. However, maximum ingestion throughput isn't necessarily the primary goal for realtime ingestion. Query performance is also important. And because query performance is not as good on the in-memory dataset, it's helpful to keep it from growing too large. 150k seems like a reasonable balance here. It means that for a typical 5 million row segment, we won't trigger more than 33 persists due to this limit, which is a reasonable number of persists. * Update tests. * Update server/src/main/java/org/apache/druid/segment/indexing/RealtimeTuningConfig.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Fix test. * Fix link. --------- Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2023-03-21 10:36:36 -07:00
Clint Wylie	c7f4bb5056	fix KafkaInputFormat when used with Sampler API (#13900 ) * fix KafkaInputFormat when used with Sampler API * handle key format sampling the same as value format sampling	2023-03-08 16:23:24 -08:00
Karan Kumar	65c3954942	Adding forbidden api for Properties#get() and Properties#getOrDefault() (#13882 ) Properties#getOrDefault method does not check the default map for values where as Properties#getProperty() does.	2023-03-06 10:42:04 +05:30
Tejaswini Bandlamudi	7103cb4b9d	Removes FiniteFirehoseFactory and its implementations (#12852 ) The FiniteFirehoseFactory and InputRowParser classes were deprecated in 0.17.0 (#8823) in favor of InputSource & InputFormat. This PR removes the FiniteFirehoseFactory and all its implementations along with classes solely used by them like Fetcher (Used by PrefetchableTextFilesFirehoseFactory). Refactors classes including tests using FiniteFirehoseFactory to use InputSource instead. Removing InputRowParser may not be as trivial as many classes that aren't deprecated depends on it (with no alternatives), like EventReceiverFirehoseFactory. Hence FirehoseFactory, EventReceiverFirehoseFactory, and Firehose are marked deprecated.	2023-03-02 18:07:17 +05:30
Abhishek Agarwal	d2dbb8b2c0	Fix infinite checkpointing between tasks and overlord (#13825 ) If the intermediate handoff period is less than the task duration and there is no new data in the input topic, task will continuously checkpoint the same offsets again and again. This PR fixes that bug by resetting the checkpoint time even when the task receives the same end offset request again.	2023-02-22 19:25:59 +05:30
Clint Wylie	fb26a1093d	discover nested columns when using nested column indexer for schemaless ingestion (#13672 ) * discover nested columns when using nested column indexer for schemaless * move useNestedColumnIndexerForSchemaDiscovery from AppendableIndexSpec to DimensionsSpec	2023-01-18 12:57:28 -08:00
Kashif Faraz	58a3acc2c4	Add InputStats to track bytes processed by a task (#13520 ) This commit adds a new class `InputStats` to track the total bytes processed by a task. The field `processedBytes` is published in task reports along with other row stats. Major changes: - Add class `InputStats` to track processed bytes - Add method `InputSourceReader.read(InputStats)` to read input rows while counting bytes. > Since we need to count the bytes, we could not just have a wrapper around `InputSourceReader` or `InputEntityReader` (the way `CountableInputSourceReader` does) because the `InputSourceReader` only deals with `InputRow`s and the byte information is already lost. - Classic batch: Use the new `InputSourceReader.read(inputStats)` in `AbstractBatchIndexTask` - Streaming: Increment `processedBytes` in `StreamChunkParser`. This does not use the new `InputSourceReader.read(inputStats)` method. - Extend `InputStats` with `RowIngestionMeters` so that bytes can be exposed in task reports Other changes: - Update tests to verify the value of `processedBytes` - Rename `MutableRowIngestionMeters` to `SimpleRowIngestionMeters` and remove duplicate class - Replace `CacheTestSegmentCacheManager` with `NoopSegmentCacheManager` - Refactor `KafkaIndexTaskTest` and `KinesisIndexTaskTest`	2022-12-13 18:54:42 +05:30
Gian Merlino	de5a4bafcb	Zero-copy local deep storage. (#13394 ) * Zero-copy local deep storage. This is useful for local deep storage, since it reduces disk usage and makes Historicals able to load segments instantaneously. Two changes: 1) Introduce "druid.storage.zip" parameter for local storage, which defaults to false. This changes default behavior from writing an index.zip to writing a regular directory. This is safe to do even during a rolling update, because the older code actually already handled unzipped directories being present on local deep storage. 2) In LocalDataSegmentPuller and LocalDataSegmentPusher, use hard links instead of copies when possible. (Generally this is possible when the source and destination directory are on the same filesystem.)	2022-12-12 17:28:24 -08:00
AmatyaAvadhanula	cc307e4c29	Fix needless task shutdown on leader switch (#13411 ) * Fix needless task shutdown on leader switch * Add unit test * Fix style * Fix UTs	2022-12-01 18:31:08 +05:30
Clint Wylie	4b58f5f23c	fix KafkaInputFormat with nested columns by delegating to underlying inputRow map instead of eagerly copying (#13406 )	2022-11-28 12:28:07 -08:00
Gian Merlino	bfffbabb56	Async task client for SeekableStreamSupervisors. (#13354 ) Main changes: 1) Convert SeekableStreamIndexTaskClient to an interface, move old code to SeekableStreamIndexTaskClientSyncImpl, and add new implementation SeekableStreamIndexTaskClientAsyncImpl that uses ServiceClient. 2) Add "chatAsync" parameter to seekable stream supervisors that causes the supervisor to use an async task client. 3) In SeekableStreamSupervisor.discoverTasks, adjust logic to avoid making blocking RPC calls in workerExec threads. 4) In SeekableStreamSupervisor generally, switch from Futures.successfulAsList to FutureUtils.coalesce, so we can better capture the errors that occurred with contacting individual tasks. Other, related changes: 1) Add ServiceRetryPolicy.retryNotAvailable, which controls whether ServiceClient retries unavailable services. Useful since we do not want to retry calls unavailable tasks within the service client. (The supervisor does its own higher-level retries.) 2) Add FutureUtils.transformAsync, a more lambda friendly version of Futures.transform(f, AsyncFunction). 3) Add FutureUtils.coalesce. Similar to Futures.successfulAsList, but returns Either instead of using null on error. 4) Add JacksonUtils.readValue overloads for JavaType and TypeReference.	2022-11-21 19:20:26 +05:30
Tejaswini Bandlamudi	bf10ff73a8	Fixes Kafka Supervisor Lag Report (#13380 ) Fixes inclusion of all stream partitions in all tasks. The PR (Adds Idle feature to `SeekableStreamSupervisor` for inactive stream) - https://github.com/apache/druid/pull/13144 updates the resulting lag calculation map in `KafkaSupervisor` to include all the latest partitions from the stream to set the idle state accordingly rather than the previous way of lag calculation only for the partitions actively being read from the stream. This led to an explosion of metrics in lag reports in cases where 1000s of tasks per supervisor are present. Changes: - Add a new method to generate lags for only those partitions a single task is actively reading from while updating the Supervisor reports.	2022-11-17 22:24:45 +05:30
Gian Merlino	77478f25fb	Add taskActionType dimension to task/action/run/time. (#13333 ) * Add taskActionType dimension to task/action/run/time. * Spelling.	2022-11-11 12:00:08 +05:30
AmatyaAvadhanula	fb23e38aa7	Fix messageGap emission (#13346 ) * Fix messageGap emission * Do not emit messageGap after stopping reading events * Refactoring * Fix tests	2022-11-10 17:50:19 +05:30
AmatyaAvadhanula	0512ae4922	Optimize metadata calls in SeekableStreamSupervisor (#13328 ) * Optimize metadata calls * Modify isTaskCurrent * Fix tests * Refactoring	2022-11-10 07:22:51 +05:30
Tejaswini Bandlamudi	594545da55	Adds cluster level idleConfig setting for supervisor (#13311 ) * adds cluster level idleConfig * updates docs * refactoring * spelling nit * nit * nit * refactoring	2022-11-08 14:54:14 +05:30
AmatyaAvadhanula	a17ffdfc5d	Fix flaky test method in KafkaSupervisorTest (#13315 )	2022-11-05 10:31:40 +05:30
Dr. Sizzles	e5ad24ff9f	Support for middle manager less druid, tasks launch as k8s jobs (#13156 ) * Support for middle manager less druid, tasks launch as k8s jobs * Fixing forking task runner test * Test cleanup, dependency cleanup, intellij inspections cleanup * Changes per PR review Add configuration option to disable http/https proxy for the k8s client Update the docs to provide more detail about sidecar support * Removing un-needed log lines * Small changes per PR review * Upon task completion we callback to the overlord to update the status / locaiton, for slower k8s clusters, this reduces locking time significantly * Merge conflict fix * Fixing tests and docs * update tiny-cluster.yaml changed `enableTaskLevelLogPush` to `encapsulatedTask` * Apply suggestions from code review Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * Minor changes per PR request * Cleanup, adding test to AbstractTask * Add comment in peon.sh * Bumping code coverage * More tests to make code coverage happy * Doh a duplicate dependnecy * Integration test setup is weird for k8s, will do this in a different PR * Reverting back all integration test changes, will do in anotbher PR * use StringUtils.base64 instead of Base64 * Jdk is nasty, if i compress in jdk 11 in jdk 17 the decompressed result is different Co-authored-by: Rahul Gidwani <r_gidwani@apple.com> Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>	2022-11-02 19:44:47 -07:00
AmatyaAvadhanula	e1ff3ca289	Resume streaming tasks on Overlord switch (#13223 ) * Resume streaming tasks on Overlord switch * Refactoring and better messages * Better docs * Add unit test * Fix tests' setup * Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Update indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> * Better logs * Fix test again Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2022-10-29 09:38:49 +05:30
Gian Merlino	d98c808d3f	Remove basePersistDirectory from tuning configs. (#13040 ) * Remove basePersistDirectory from tuning configs. Since the removal of CliRealtime, it serves no purpose, since it is always overridden in production using withBasePersistDirectory given some subdirectory of the task work directory. Removing this from the tuning config has a benefit beyond removing no-longer-needed logic: it also avoids the side effect of empty "druid-realtime-persist" directories getting created in the systemwide temp directory. * Test adjustments to appropriately set basePersistDirectory. * Remove unused import. * Fix RATC constructor.	2022-10-21 17:25:36 -07:00
Tejaswini Bandlamudi	3e13584e0e	Adds Idle feature to `SeekableStreamSupervisor` for inactive stream (#13144 ) * Idle Seekable stream supervisor changes. * nit * nit * nit * Adds unit tests * Supervisor decides it's idle state instead of AutoScaler * docs update * nit * nit * docs update * Adds Kafka unit test * Adds Kafka Integration test. * Updates travis config. * Updates kafka-indexing-service dependencies. * updates previous offsets snapshot & doc * Doesn't act if supervisor is suspended. * Fixes highest current offsets fetch bug, adds new Kafka UT tests, doc changes. * Reverts Kinesis Supervisor idle behaviour changes. * nit * nit * Corrects SeekableStreamSupervisorSpec check on idle behaviour config, adds tests. * Fixes getHighestCurrentOffsets to fetch offsets of publishing tasks too * Adds Kafka Supervisor UT * Improves test coverage in druid-server * Corrects IT override config * Doc updates and Syntactic changes * nit * supervisorSpec.ioConfig.idleConfig changes	2022-10-12 18:31:08 +05:30
Jonathan Wei	1f1fced6d4	Add JsonInputFormat option to assume newline delimited JSON, improve parse exception handling for multiline JSON (#13089 ) * Add JsonInputFormat option to assume newline delimited JSON, improve handling for non-NDJSON * Fix serde and docs * Add PR comment check	2022-09-26 19:51:04 -05:00
Jonathan Wei	331e6d707b	Add KafkaConfigOverrides extension point (#13122 ) * Add KafkaConfigOverrides extension point * X	2022-09-21 11:47:19 +05:30
Ellen Shen	da30c8070a	kafka consumer: custom serializer can't be configured after it's instantiation (#12960 ) (#13097 ) * allow kakfa custom serializer to be configured * add unit tests Co-authored-by: ellen shen <ellenshen@apple.com>	2022-09-17 20:42:21 +08:00
Clint Wylie	8ee8786d3c	add maxBytesInMemory and maxClientResponseBytes to SamplerConfig (#12947 ) * add maxBytesInMemory and maxClientResponseBytes to SamplerConfig	2022-08-25 00:50:41 -07:00
Adarsh Sanjeev	3b58a01c7c	Correct spelling in messages and variable names. (#12932 )	2022-08-24 11:06:31 +05:30
AmatyaAvadhanula	d294404924	Kinesis ingestion with empty shards (#12792 ) Kinesis ingestion requires all shards to have at least 1 record at the required position in druid. Even if this is satisified initially, resharding the stream can lead to empty intermediate shards. A significant delay in writing to newly created shards was also problematic. Kinesis shard sequence numbers are big integers. Introduce two more custom sequence tokens UNREAD_TRIM_HORIZON and UNREAD_LATEST to indicate that a shard has not been read from and that it needs to be read from the start or the end respectively. These values can be used to avoid the need to read at least one record to obtain a sequence number for ingesting a newly discovered shard. If a record cannot be obtained immediately, use a marker to obtain the relevant shardIterator and use this shardIterator to obtain a valid sequence number. As long as a valid sequence number is not obtained, continue storing the token as the offset. These tokens (UNREAD_TRIM_HORIZON and UNREAD_LATEST) are logically ordered to be earlier than any valid sequence number. However, the ordering requires a few subtle changes to the existing mechanism for record sequence validation: The sequence availability check ensures that the current offset is before the earliest available sequence in the shard. However, current token being an UNREAD token indicates that any sequence number in the shard is valid (despite the ordering) Kinesis sequence numbers are inclusive i.e if current sequence == end sequence, there are more records left to read. However, the equality check is exclusive when dealing with UNREAD tokens.	2022-08-05 22:38:58 +05:30
Gian Merlino	2b330186e2	Mid-level service client and updated high-level clients. (#12696 ) * Mid-level service client and updated high-level clients. Our servers talk to each other over HTTP. We have a low-level HTTP client (HttpClient) that is super-asynchronous and super-customizable through its handlers. It's also proven to be quite robust: we use it for Broker -> Historical communication over the wide variety of query types and workloads we support. But the low-level client has no facilities for service location or retries, which means we have a variety of high-level clients that implement these in their own ways. Some high-level clients do a better job than others. This patch adds a mid-level ServiceClient that makes it easier for high-level clients to be built correctly and harmoniously, and migrates some of the high-level logic to use ServiceClients. Main changes: 1) Add ServiceClient org.apache.druid.rpc package. That package also contains supporting stuff like ServiceLocator and RetryPolicy interfaces, and a DiscoveryServiceLocator based on DruidNodeDiscoveryProvider. 2) Add high-level OverlordClient in org.apache.druid.rpc.indexing. 3) Indexing task client creator in TaskServiceClients. It uses SpecificTaskServiceLocator to find the tasks. This improves on ClientInfoTaskProvider by caching task locations for up to 30 seconds across calls, reducing load on the Overlord. 4) Rework ParallelIndexSupervisorTaskClient to use a ServiceClient instead of extending IndexTaskClient. 5) Rework RemoteTaskActionClient to use a ServiceClient instead of DruidLeaderClient. 6) Rework LocalIntermediaryDataManager, TaskMonitor, and ParallelIndexSupervisorTask. As a result, MiddleManager, Peon, and Overlord no longer need IndexingServiceClient (which internally used DruidLeaderClient). There are some concrete benefits over the prior logic, namely: - DruidLeaderClient does retries in its "go" method, but only retries exactly 5 times, does not sleep between retries, and does not retry retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.) ServiceClient handles retries in a more reasonable way. - DruidLeaderClient's methods are all synchronous, whereas ServiceClient methods are asynchronous. This is used in one place so far: the SpecificTaskServiceLocator, so we don't need to block a thread trying to locate a task. It can be used in other places in the future. - HttpIndexingServiceClient does not properly handle all server errors. In some cases, it tries to parse a server error as a successful response (for example: in getTaskStatus). - IndexTaskClient currently makes an Overlord call on every task-to-task HTTP request, as a way to find where the target task is. ServiceClient, through SpecificTaskServiceLocator, caches these target locations for a period of time. * Style adjustments. * For the coverage. * Adjustments. * Better behaviors. * Fixes.	2022-07-05 09:43:26 -07:00
Gian Merlino	d5abd06b96	Fix flaky KafkaIndexTaskTest. (#12657 ) * Fix flaky KafkaIndexTaskTest. The testRunTransactionModeRollback case had many race conditions. Most notably, it would commit a transaction and then immediately check to see that the results were not indexed. This is racey because it relied on the indexing thread being slower than the test thread. Now, the case waits for the transaction to be processed by the indexing thread before checking the results. * Changes from review.	2022-06-24 13:53:51 -07:00
Jihoon Son	b6eeef31e5	Store null columns in the segments (#12279 ) * Store null columns in the segments * fix test * remove NullNumericColumn and unused dependency * fix compile failure * use guava instead of apache commons * split new tests * unused imports * address comments	2022-03-23 16:54:04 -07:00
Parag Jain	2efb74ff1e	fix supervisor auto scaler config serde bug (#12317 )	2022-03-09 16:17:12 -08:00
Abhishek Agarwal	6346b9561d	Reuse the InputEntityReader in SettableByteEntityReader (#12269 ) * Reuse the InputEntityReader in SettableByteEntityReader * Fix logic * Fix kafka streaming ingestion * Add Tests for kafka input format change * Address review comments	2022-03-09 14:38:31 -08:00
Laksh Singla	3f709db173	Make ParseExceptions more informative (#12259 ) This PR aims to make the ParseExceptions in Druid more informative, by adding additional information (metadata) to the ParseException, which can contain additional information about the exception. For example - the path of the file generating the issue, the line number (where it can be easily fetched - like CsvReader) Following changes are addressed in this PR: A new class CloseableIteratorWithMetadata has been created which is like CloseableIterator but also has a metadata method that returns a context Map<String, Object> about the current element returned by next(). IntermediateRowParsingReader#read() now attaches the InputEntity and the "record number" which created the exception (while parsing them), and IntermediateRowParsingReader#sample attaches the InputEntity (but not the "record number"). TextReader (and its subclasses), which is a specific implementation of the IntermediateRowParsingReader also include the line number which caused the generation of the error. This will also help in triaging the issues when InputSourceReader generates ParseException because it can point to the specific InputEntity which caused the exception (while trying to read it).	2022-02-28 22:31:15 +05:30
Jihoon Son	e5ad862665	A new includeAllDimension flag for dimensionsSpec (#12276 ) * includeAllDimensions in dimensionsSpec * doc * address comments * unused import and doc spelling	2022-02-25 18:27:48 -08:00
Xavier Léauté	009dd9e09a	upgrade core Apache Kafka dependencies to 3.1.0 (#12203 ) Announcement: https://blogs.apache.org/kafka/entry/what-s-new-in-apache7 Release notes: https://dist.apache.org/repos/dist/release/kafka/3.1.0/RELEASE_NOTES.html * upgrade core Apache Kafka dependencies to 3.1.0 * fix use of private Kafka APIs * remove deprecated test rules * remove mock calls that weren't verified in the first place * remove the need for powermock in KafkaLookupExtractorFactoryTest * align curator-test version with curator itself * update easymock to 4.3.0	2022-02-23 18:42:51 -08:00
Abhishek Agarwal	53c0e489c2	Fix infinite retrying during task pausing (#12167 ) This fixes a bug that causes TaskClient in overlord to continuously retry to pause tasks. This can happen when a task is not responding to the pause command. Ideally, in such a case when the task is unresponsive, the overlord would have given up after a few retries and would have killed the task. However, due to this bug, retries go on forever.	2022-01-19 09:03:36 +05:30
Jonathan Wei	229f82a6f0	Add parse error list API for stream supervisors, use structured object for parse exceptions, simplify parse exception message (#11961 ) * Add parse error list API for stream supervisors, simplify parse exception message * Add input string to parse exception * Use structured ParseExceptionReport * Fix tests * Add test * PR comments, add ParseExceptionReport equals verifier * Fix test	2021-12-09 15:42:55 -06:00
Gian Merlino	e0e05aad99	Enhancements to IndexTaskClient. (#12011 ) * Enhancements to IndexTaskClient. 1) Ability to use handlers other than StringFullResponseHandler. This functionality is not used in production code yet, but is useful because it will allow tasks to communicate with each other in non-string-based formats and in streaming fashion. In the future, we'll be able to use this to make task-to-task communication more efficient. 2) Truncate server errors at 1KB, so long errors do not pollute logs. 3) Change error log level for retryable errors from WARN to INFO. (The final error is still WARN.) 4) Harmonize log and exception messages to have a more consistent format. * Additional tests and improvements.	2021-12-03 09:14:32 -08:00
Frank Chen	c2cea25a6b	Improve exception message when loading data from web-console (#11723 ) * Improve exception handling * Revert some changes * Resolve comments * Update indexing-service/src/main/java/org/apache/druid/indexing/overlord/sampler/SamplerExceptionMapper.java Co-authored-by: Karan Kumar <karankumar1100@gmail.com> * Update indexing-service/src/main/java/org/apache/druid/indexing/overlord/sampler/SamplerExceptionMapper.java Co-authored-by: Karan Kumar <karankumar1100@gmail.com> * Address review comments Co-authored-by: Karan Kumar <karankumar1100@gmail.com>	2021-12-03 21:33:49 +08:00
lokesh-lingarajan	ad6609a606	Kafka Input Format for headers, key and payload parsing (#11630 ) ### Description Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights. PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats. We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload. This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row. Lets look at a sample input format from the above discussion "inputFormat": { "type": "kafka", // New input format type "headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp "headerFormat": // Header parser specifying that values are of type string { "type": "string" }, "valueFormat": // Value parser from json parsing { "type": "json", "flattenSpec": { "useFieldDiscovery": true, "fields": [...] } }, "keyFormat": // Key parser also from json parsing { "type": "json" } } Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. "headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload. Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch. Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key". ## KafkaInputFormat Class: This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases. During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.	2021-10-07 08:56:27 -07:00
Xavier Léauté	bc3b038712	Update Apache Kafka client libraries to 3.0.0 (#11735 ) Release notes: https://downloads.apache.org/kafka/3.0.0/RELEASE_NOTES.html https://blogs.apache.org/kafka/entry/what-s-new-in-apache6	2021-10-05 10:23:19 -07:00
Agustin Gonzalez	9efa6cc9c8	Make persists concurrent with adding rows in batch ingestion (#11536 ) * Make persists concurrent with ingestion * Remove semaphore but keep concurrent persists (with add) and add push in the backround as well * Go back to documented default persists (zero) * Move to debug * Remove unnecessary Atomics * Comments on synchronization (or not) for sinks & sinkMetadata * Some cleanup for unit tests but they still need further work * Shutdown & wait for persists and push on close * Provide support for three existing batch appenderators using batchProcessingMode flag * Fix reference to wrong appenderator * Fix doc typos * Add BatchAppenderators class test coverage * Add log message to batchProcessingMode final value, fix typo in enum name * Another typo and minor fix to log message * LEGACY->OPEN_SEGMENTS, Edit docs * Minor update legacy->open segments log message * More code comments, mostly small adjustments to naming etc * fix spelling * Exclude BtachAppenderators from Jacoco since it is fully tested but Jacoco still refuses to ack coverage * Coverage for Appenderators & BatchAppenderators, name change of a method that was still using "legacy" rather than "openSegments" Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2021-09-08 13:31:52 -07:00
Agustin Gonzalez	a2da407b70	Add error msg to parallel task's TaskStatus (#11486 ) * Add error msg to parallel task's TaskStatus * Consolidate failure block * Add failure test * Make it fail * Add fail while stopped * Simplify hash task test using a runner that fails after so many runs (parameter) * Remove unthrown exception * Use runner names to identify phase * Added range partition kill test & fixed a timing bug with the custom runner * Forbidden api * Style * Unit test code cleanup * Added message to invalid state exception and improved readability of the phase error messages for the parallel task failure unit tests	2021-08-02 12:11:28 -07:00
Abhishek Agarwal	94c1671eaf	Split SegmentLoader into SegmentLoader and SegmentCacheManager (#11466 ) This PR splits current SegmentLoader into SegmentLoader and SegmentCacheManager. SegmentLoader - this class is responsible for building the segment object but does not expose any methods for downloading, cache space management, etc. Default implementation delegates the download operations to SegmentCacheManager and only contains the logic for building segments once downloaded. . This class will be used in SegmentManager to construct Segment objects. SegmentCacheManager - this class manages the segment cache on the local disk. It fetches the segment files to the local disk, can clean up the cache, and in the future, support reserve and release on cache space. [See https://github.com/Make SegmentLoader extensible and customizable #11398]. This class will be used in ingestion tasks such as compaction, re-indexing where segment files need to be downloaded locally.	2021-07-21 00:14:19 +05:30
frank chen	2236cf2234	eliminate extra object instantiation (#11345 )	2021-07-12 18:31:39 -07:00
Abhishek Agarwal	03a6a6d6e1	Replace Processing ExecutorService with QueryProcessingPool (#11382 ) This PR refactors the code for QueryRunnerFactory#mergeRunners to accept a new interface called QueryProcessingPool instead of ExecutorService for concurrent execution of query runners. This interface will let custom extensions inject their own implementation for deciding which query-runner to prioritize first. The default implementation is the same as today that takes the priority of query into account. QueryProcessingPool can also be used as a regular executor service. It has a dedicated method for accepting query execution work so implementations can differentiate between regular async tasks and query execution tasks. This dedicated method also passes the QueryRunner object as part of the task information. This hook will let custom extensions carry any state from QuerySegmentWalker to QueryProcessingPool#mergeRunners which is not possible currently.	2021-07-01 16:03:08 +05:30
Yi Yuan	de8daf8139	Delete buildV9Directly in Kafka and Kinesis Indexing Service (#11351 ) * delete_buildV9Directly_in_kafka_and_kinesis_indexing_service * delete * delete them from server * delete buildV9Directly from hadoop indexing * bug fixed Co-authored-by: yuanyi <yuanyi@freewheel.tv>	2021-06-23 16:36:46 -07:00
Agustin Gonzalez	8e5048e643	Avoid memory mapping hydrants after they are persisted & after they are merged for native batch ingestion (#11123 ) * Avoid mapping hydrants in create segments phase for native ingestion * Drop queriable indices after a given sink is fully merged * Do not drop memory mappings for realtime ingestion * Style fixes * Renamed to match use case better * Rollback memoization code and use the real time flag instead * Null ptr fix in FireHydrant toString plus adjustments to memory pressure tracking calculations * Style * Log some count stats * Make sure sinks size is obtained at the right time * BatchAppenderator unit test * Fix comment typos * Renamed methods to make them more readable * Move persisted metadata from FireHydrant class to AppenderatorImpl. Removed superfluous differences and fix comment typo. Removed custom comparator * Missing dependency * Make persisted hydrant metadata map concurrent and better reflect the fact that keys are Java references. Maintain persisted metadata when dropping/closing segments. * Replaced concurrent variables with normal ones * Added batchMemoryMappedIndex "fallback" flag with default "false". Set this to "true" make code fallback to previous code path. * Style fix. * Added note to new setting in doc, using Iterables.size (and removing a dependency), and fixing a typo in a comment. * Forgot to commit this edited documentation message	2021-05-11 14:34:26 -07:00
Yuanli Han	8647040f4d	Allow user to set group.id for Kafka ingestion task (#11147 ) * allow user to set group.id for Kafka ingestion task * fix test coverage by removing deprecated code and add doc * fix typo * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: frank chen <frankchen@apache.org> Co-authored-by: frank chen <frankchen@apache.org>	2021-05-09 11:56:19 +08:00
Xavier Léauté	0296f20551	upgrade Apache Kafka to 2.8.0 (#11139 ) * upgrade to Apache Kafka 2.8.0 (release notes: https://downloads.apache.org/kafka/2.8.0/RELEASE_NOTES.html) * pass Kafka version as a Docker argument in integration tests to keep in sync with maven version * fix use of internal Kafka APIs in integration tests	2021-04-24 08:27:07 -07:00

1 2 3 4 5 ...

320 Commits