* Use the whole frame when writing rows.
This patch makes the following adjustments to enable writing larger
single rows to frames:
1) RowBasedFrameWriter: Max out allocation size on the final doubling.
i.e., if the final allocation "naturally" would be 1 MiB but the
max frame size is 900 KiB, use 900 KiB rather than failing the 1 MiB
allocation.
2) AppendableMemory: In reserveAdditional, release the last block if it
is empty. This eliminates waste when a frame writer uses a
successive-doubling approach to find the right allocation size.
3) ArenaMemoryAllocator: Reclaim memory from the last allocation when
the last allocation is closed.
Prior to these changes, a single row could be much smaller than the
frame size and still fail to be added to the frame.
* Style.
* Fix test.
Revives #14024 and additionally supports,
Native queries
gRPC health check endpoint
This PR doesn't have the shaded module for packaging gRPC and Guava libraries since grpc-query module uses the same Guava version as that of Druid.
The response is gRPC-specific. It provides the result schema along with the results as a binary "blob". Results can be in CSV, JSON array lines or as an array of Protobuf objects. If using Protobuf, the corresponding class must be installed along with the gRPC query extension so it is available to the Broker at runtime.
The "suggested server memory" figure needs to take into account
maxConcurrentStages. The fix here does not affect the main memory
calculations, but it does affect the accuracy of error messages.
Parent issue: #14989
It is possible for the order of columns to vary across segments especially during realtime ingestion.
Since, the schema fingerprint is sensitive to column order this leads to creation of a large number of segment schema in the metadata database for essentially the same set of columns.
This is wasteful, this patch fixes this problem by computing schema fingerprint on lexicographically sorted columns. This would result in creation of a single schema in the metadata database with the first observed column order for a given signature.
Currently, TaskDataSegmentProvider fetches the DataSegment from the Coordinator while loading the segment, but just discards it later. This PR refactors this to also return the DataSegment so that it can be used by workers without a separate fetch.
There were some problematic cases
join branches are run with finalize=false instead of finalize=true like normal subqueries
this inconsistency is not good - but fixing it is a bigger thing
ensure that right hand sides of joins are always subqueries - or accessible globally
To achieve the above:
operand indexes were needed for the upstream reltree nodes in the generator
source unwrapping now takes the join situation into account as well
* Reduce the number of ITs in CDS groups
The two CDS IT groups cds-task-schema-publish-disabled & cds-coordinator-metadata-query-disabled
runs a lot of ITs serially, leading to increased CI runtime.
Reducing the number of ITs running in the two groups to,
ITAppendBatchIndexTest (append-ingestion)
ITIndexerTest (batch-index)
ITOverwriteBatchIndexTest (batch-index)
ITCompactionSparseColumnIndexTest (compaction)
ITCompactionTaskTest (compaction)
ITKafkaIndexingServiceDataFormatTest (kafka-data-format)
Both these groups will be removed once CDS is enabled by default.
* MSQ: Add QueryKitSpec to encapsulate QueryKit params.
This patch introduces QueryKitSpec, an object that encapsulates the
parameters to makeQueryDefinition that are consistent from call to
call. This simplifies things because we avoid passing around all the
components individually.
This patch also splits "maxWorkerCount" into "maxLeafWorkerCount" and
"maxNonLeafWorkerCount", which apply to leaf stages (no other stages as
inputs) and nonleaf stages respectively.
Finally, this patch also rovides a way for ControllerContext to supply a
QueryKitSpec to its liking. It is expected that this will be used by
controllers of quick interactive queries to set maxNonLeafWorkerCount = 1,
which will generate fanning-in query plans.
* Fix javadoc.
Previously, the processor used "remainingChannels" to track the number of
non-null entries of currentFrame. Now, "remainingChannels" tracks the
number of channels that are unfinished.
The difference is subtle. In the previous code, when an input channel
was blocked upon exiting nextFrame(), the "currentFrames" entry would be
null, and therefore the "remainingChannels" variable would be decremented.
After the next await and call to populateCurrentFramesAndTournamentTree(),
"remainingChannels" would be incremented if the channel had become
unblocked after awaiting.
This means that finished(), which returned true if remainingChannels was
zero, would not be reliable if called between nextFrame() and the
next await + populateCurrentFramesAndTournamentTree().
This patch changes things such that finished() is always reliable. This
fixes a regression introduced in PR #16911, which added a call to
finished() that was, at that time, unsafe.
* MSQ: Properly report errors that occur when starting up RunWorkOrder.
In #17046, an exception thrown by RunWorkOrder#startAsync would be ignored
and replaced with a generic CanceledFault. This patch fixes it by retaining
the original error.
* TableInputSpecSlicer changes to support running on Brokers.
Changes:
1) Rename TableInputSpecSlicer to IndexerTableInputSpecSlicer, in anticipation
of a new implementation being added for controllers running on Brokers.
2) Allow the context to use the WorkerManager to build the TableInputSpecSlicer,
in anticipation of Brokers wanting to use this to assign segments to servers
that are already serving those segments.
3) Remove unused DataSegmentTimelineView interface.
4) Add additional javadoc to DataSegmentProvider.
* Style.
* MSQ: Include worker context maps in WorkOrders.
This provides a mechanism to send contexts to workers in long-lived,
shared JVMs that are not part of the task system.
* Style, coverage.
Register a Ser-De for RowsAndColumns so that the window operator query running on leaf operators would be transferred properly on the wire. Would fix the empty response given by window queries without group by on the native engine.
This isn't necessary when using MSQWorkerTaskLauncher as the WorkerManager
implementation, because in that case, task failure also wakes up the
main thread. However, when using workers that are not task-based, we don't
want to rely on the WorkerManager for this.
Fixed vulnerabilities
CVE-2021-26291 : Apache Maven is vulnerable to Man-in-the-Middle (MitM) attacks. Various
functions across several files, mentioned below, allow for custom repositories to use the
insecure HTTP protocol. An attacker can exploit this as part of a Man-in-the-Middle (MitM)
attack, taking over or impersonating a repository using the insecure HTTP protocol.
Unsuspecting users may then have the compromised repository defined as a dependency in
their Project Object Model (pom) file and download potentially malicious files from it.
Was fixed by removing outdated tesla-aether library containing vulnerable maven-settings (v3.1.1) package, pull-deps utility updated to use maven resolver instead.
sonatype-2020-0244 : The joni package is vulnerable to Man-in-the-Middle (MitM) attacks.
This project downloads dependencies over HTTP due to an insecure repository configuration
within the .pom file. Consequently, a MitM could intercept requests to the specified
repository and replace the requested dependencies with malicious versions, which can execute
arbitrary code from the application that was built with them.
Was fixed by upgrading joni package to recommended 2.1.34 version
The existing tests are moved into a "WithMaximalBuffering" subclass,
and a new "WithMinimalBuffering" subclass is added to test cases
where only a single frame is buffered.
* Move TerminalStageSpecFactory packages.
These packages are moved from the "guice" package to the "indexing.destination"
package. They make more sense here, since "guice" is reserved for Guice modules,
annotations, and providers.
* Rearrange imports.
* Speed up FrameFileTest, SuperSorterTest.
These are two heavily parameterized tests that, together, account for
about 60% of runtime in the test suite.
FrameFileTest changes:
1) Cache frame files in a static, rather than building the frame file
for each parameterization of the test.
2) Adjust TestArrayCursorFactory to cache the signature, rather than
re-creating it on each call to getColumnCapabilities.
SuperSorterTest changes:
1) Dramatically reduce the number of tests that run with
"maxRowsPerFrame" = 1. These are particularly slow due to writing so
many small files. Some still run, since it's useful to test edge cases,
but much fewer than before.
2) Reduce the "maxActiveProcessors" axis of the test from [1, 2, 4] to
[1, 3]. The aim is to reduce the number of cases while still getting
good coverage of the feature.
3) Reduce the "maxChannelsPerProcessor" axis of the test from [2, 3, 8]
to [2, 7]. The aim is to reduce the number of cases while still getting
good coverage of the feature.
4) Use in-memory input channels rather than file channels.
5) Defer formatting of assertion failure messages until they are needed.
6) Cache the cursor factory and its signature in a static.
7) Cache sorted test rows (used for verification) in a static.
* It helps to include the file.
* Style.
* abstract `IncrementalIndex` cursor stuff to prepare to allow for possibility of using different "views" of the data based on the cursor build spec
changes:
* introduce `IncrementalIndexRowSelector` interface to capture how `IncrementalIndexCursor` and `IncrementalIndexColumnSelectorFactory` read data
* `IncrementalIndex` implements `IncrementalIndexRowSelector`
* move `FactsHolder` interface to separate file
* other minor refactorings
* BaseWorkerClientImpl: Don't attempt to recover from a closed channel.
This patch introduces an exception type "ChannelClosedForWritesException",
which allows the BaseWorkerClientImpl to avoid retrying when the local
channel has been closed. This can happen in cases of cancellation.
* Add some test coverage.
* wip
* Add test coverage.
* Style.
* MSQ: Improved worker cancellation.
Four changes:
1) FrameProcessorExecutor now requires that cancellationIds be registered
with "registerCancellationId" prior to being used in "runFully" or "runAllFully".
2) FrameProcessorExecutor gains an "asExecutor" method, which allows that
executor to be used as an executor for future callbacks in such a way
that respects cancellationId.
3) RunWorkOrder gains a "stop" method, which cancels the current
cancellationId and closes the current FrameContext. It blocks until
both operations are complete.
4) Fixes a bug in RunAllFullyWidget where "processorManager.result()" was
called outside "runAllFullyLock", which could cause it to be called
out-of-order with "cleanup()" in case of cancellation or other error.
Together, these changes help ensure cancellation does not have races.
Once "cancel" is called for a given cancellationId, all existing processors
and running callbacks are canceled and exit in an orderly manner. Future
processors and callbacks with the same cancellationId are rejected
before being executed.
* Fix test.
* Use execute, which doesn't return, to avoid errorprone complaints.
* Fix some style stuff.
* Further enhancements.
* Fix style.
* MSQ: Rework memory management.
This patch reworks memory management to better support multi-threaded
workers running in shared JVMs. There are two main changes.
First, processing buffers and threads are moved from a per-JVM model to
a per-worker model. This enables queries to hold processing buffers
without blocking other concurrently-running queries. Changes:
- Introduce ProcessingBuffersSet and ProcessingBuffers to hold the
per-worker and per-work-order processing buffers (respectively). On Peons,
this is the JVM-wide processing pool. On Indexers, this is a per-worker
pool of on-heap buffers. (This change fixes a bug on Indexers where
excessive processing buffers could be used if MSQ tasks ran concurrently
with realtime tasks.)
- Add "bufferPool" argument to GroupingEngine#process so a per-worker pool
can be passed in.
- Add "druid.msq.task.memory.maxThreads" property, which controls the
maximum number of processing threads to use per task. This allows usage of
multiple processing buffers per task if admins desire.
- IndexerWorkerContext acquires processingBuffers when creating the FrameContext
for a work order, and releases them when closing the FrameContext.
- Add "usesProcessingBuffers()" to FrameProcessorFactory so workers know
how many sets of processing buffers are needed to run a given query.
Second, adjustments to how WorkerMemoryParameters slices up bundles, to
favor more memory for sorting and segment generation. Changes:
- Instead of using same-sized bundles for processing and for sorting,
workers now use minimally-sized processing bundles (just enough to read
inputs plus a little overhead). The rest is devoted to broadcast data
buffering, sorting, and segment-building.
- Segment-building is now limited to 1 concurrent segment per work order.
This allows each segment-building action to use more memory. Note that
segment-building is internally multi-threaded to a degree. (Build and
persist can run concurrently.)
- Simplify frame size calculations by removing the distinction between
"standard" and "large" frames. The new default frame size is the same
as the old "standard" frames, 1 MB. The original goal of of the large
frames was to reduce the number of temporary files during sorting, but
I think we can achieve the same thing by simply merging a larger number
of standard frames at once.
- Remove the small worker adjustment that was added in #14117 to account
for an extra frame involved in writing to durable storage. Instead,
account for the extra frame whenever we are actually using durable storage.
- Cap super-sorter parallelism using the number of output partitions, rather
than using a hard coded cap at 4. Note that in practice, so far, this cap
has not been relevant for tasks because they have only been using a single
processing thread anyway.
* Remove unused import.
* Fix errorprone annotation.
* Fixes for javadocs and inspections.
* Additional test coverage.
* Fix test.
* QueryResource: Don't close JSON content on error.
Following similar issues fixed in #11685 and #15880, this patch fixes
a bug where QueryResource would write a closing array marker if it
encountered an exception after starting to push results. This makes it
difficult for callers to detect errors.
The prior patches didn't catch this problem because QueryResource uses
the ObjectMapper in a unique way, through writeValuesAsArray, which
doesn't respect the global AUTO_CLOSE_JSON_CONTENT setting.
* Fix usage of customized ObjectMappers.
As we move towards multi-threaded MSQ workers, it helps for parallelism
to generate more than one partition per worker. That way, we can fully
utilize all worker threads throughout all stages.
The default value is the number of processing threads. Currently, this
is hard-coded to 1 for peons, but that is expected to change in the future.
1) ControllerQueryKernel: Update readyToReadResults to acknowledge that sorting stages can
go directly from READING_INPUT to RESULTS_READY.
2) WorkerStageKernel: Ignore RESULTS_COMPLETE if work is already finished, which can happen
if the transition to FINISHED comes early due to a downstream LIMIT.
changes:
* CursorHolder.isPreAggregated method indicates that a cursor has pre-aggregated data for all AggregatorFactory specified in a CursorBuildSpec. If true, engines should rewrite the query to use AggregatorFactory.getCombiningAggreggator, and column selector factories will provide selectors with the aggregator interediate type for the aggregator factory name
* Added groupby, timeseries, and topN support for CursorHolder.isPreAggregated
* Added synthetic test since no CursorHolder implementations support isPreAggregated at this point in time
This PR #16890 introduced a change to skip adding tombstone segments to the cache.
It turns out that as a side effect tombstone segments appear unavailable in the console. This happens because availability of a segment in Broker is determined from the metadata cache.
The fix is to keep the segment in the metadata cache but skip them from refresh.
This doesn't affect any functionality as metadata query for tombstone returns empty causing continuous refresh of those segments.