Sysmonitor stats (mem, fs, disk, net, cpu, swap, sys, tcp) are reported by all Druid processes, including Peons that are ephemeral in nature. Since Peons always run on the same host as the MiddleManager that spawned them and is unlikely to change, the SyMonitor metrics emitted by Peon are merely duplicates. This is often not a problem except when machines are super-beefy. Imagine a 64-core machine and 32 workers running on this machine. now you will have each Peon reporting metrics for each core. that's an increase of (32 * 64)x in the number of metrics. This leads to a metric explosion.
This PR updates MetricsModule to check node role running while registering SysMonitor and not to load any existing SysMonitor$Stats.
Few indexing tasks register RealtimeMetricsMonitor or TaskRealtimeMetricsMonitor with the process’s MonitorScheduler when they start. These monitors never unregister themselves (they always return true, they'd need to return false to unregister). Each of these monitors emits a set of metrics once every druid.monitoring.emissionPeriod.
As a result, after executing several tasks for a while, Indexer emits metrics of these tasks even after they're long gone.
Proposed Solution
Since one should be able to obtain the last round of ingestion metrics after the task unregisters the monitor, introducing lastRoundMetricsToBePushed variable to keep track of the same and overriding the AbstractMonitor.monitor method in RealtimeMetricsMonitor, TaskRealtimeMetricsMonitor to implement the new logic.
* initial commit of bucket dimensions for metrics
return counts of segments that have rowcount in a bucket size for a datasource
return average value of rowcount per segment in a datasource
added unit test
naming could use a lot of work
buckets right now are not finalized
added javadocs
altered metrics.md
* fix checkstyle issues
* addressed review comments
add monitor test
move added functionality to new monitor
update docs
* address comments
renamed monitor
handle tombstones better
update docs
added javadocs
* Add support for tombstones in the segment distribution
* undo changes to tombstone segmentizer factory
* fix accidental whitespacing changes
* address comments regarding metrics documentation
and rename variable to be more accurate
* fix tests
* fix checkstyle issues
* fix broken test
* undo removal of timeout
* Mid-level service client and updated high-level clients.
Our servers talk to each other over HTTP. We have a low-level HTTP
client (HttpClient) that is super-asynchronous and super-customizable
through its handlers. It's also proven to be quite robust: we use it
for Broker -> Historical communication over the wide variety of query
types and workloads we support.
But the low-level client has no facilities for service location or
retries, which means we have a variety of high-level clients that
implement these in their own ways. Some high-level clients do a better
job than others. This patch adds a mid-level ServiceClient that makes
it easier for high-level clients to be built correctly and harmoniously,
and migrates some of the high-level logic to use ServiceClients.
Main changes:
1) Add ServiceClient org.apache.druid.rpc package. That package also
contains supporting stuff like ServiceLocator and RetryPolicy
interfaces, and a DiscoveryServiceLocator based on
DruidNodeDiscoveryProvider.
2) Add high-level OverlordClient in org.apache.druid.rpc.indexing.
3) Indexing task client creator in TaskServiceClients. It uses
SpecificTaskServiceLocator to find the tasks. This improves on
ClientInfoTaskProvider by caching task locations for up to 30 seconds
across calls, reducing load on the Overlord.
4) Rework ParallelIndexSupervisorTaskClient to use a ServiceClient
instead of extending IndexTaskClient.
5) Rework RemoteTaskActionClient to use a ServiceClient instead of
DruidLeaderClient.
6) Rework LocalIntermediaryDataManager, TaskMonitor, and
ParallelIndexSupervisorTask. As a result, MiddleManager, Peon, and
Overlord no longer need IndexingServiceClient (which internally used
DruidLeaderClient).
There are some concrete benefits over the prior logic, namely:
- DruidLeaderClient does retries in its "go" method, but only retries
exactly 5 times, does not sleep between retries, and does not retry
retryable HTTP codes like 502, 503, 504. (It only retries IOExceptions.)
ServiceClient handles retries in a more reasonable way.
- DruidLeaderClient's methods are all synchronous, whereas ServiceClient
methods are asynchronous. This is used in one place so far: the
SpecificTaskServiceLocator, so we don't need to block a thread trying
to locate a task. It can be used in other places in the future.
- HttpIndexingServiceClient does not properly handle all server errors.
In some cases, it tries to parse a server error as a successful
response (for example: in getTaskStatus).
- IndexTaskClient currently makes an Overlord call on every task-to-task
HTTP request, as a way to find where the target task is. ServiceClient,
through SpecificTaskServiceLocator, caches these target locations
for a period of time.
* Style adjustments.
* For the coverage.
* Adjustments.
* Better behaviors.
* Fixes.
The expiry timeout is compared against the current time but the condition is reversed.
This means that as soon as a supervisor task finishes, its partitions are cleaned up,
irrespective of the specified `intermediaryPartitionTimeout` period.
After these changes, the `intermediaryPartitionTimeout` will start getting honored.
Changes
* Fix the condition
* Add tests to verify the new correct behaviour
* Reduce the default expiry timeout from P1D to PT5M
to retain current behaviour in case of default configs.
This commit contains the cleanup needed for the new integration test framework.
Changes:
- Fix log lines, misspellings, docs, etc.
- Allow the use of some of Druid's "JSON config" objects in tests
- Fix minor bug in `BaseNodeRoleWatcher`
* Remove null and empty fields from native queries
* Test fixes
* Attempted IT fix.
* Revisions from review comments
* Build fixes resulting from changes suggested by reviews
* IT fix for changed segment size
The web-console (indirectly) calls the Overlord’s GET tasks API to fetch the tasks' summary which in turn queries the metadata tasks table. This query tries to fetch several columns, including payload, of all the rows at once. This introduces a significant memory overhead and can cause unresponsiveness or overlord failure when the ingestion tab is opened multiple times (due to several parallel calls to this API)
Another thing to note is that the task table (the payload column in particular) can be very large. Extracting large payloads from such tables can be very slow, leading to slow UI. While we are fixing the memory pressure in the overlord, we can also fix the slowness in UI caused by fetching large payloads from the table. Fetching large payloads also puts pressure on the metadata store as reported in the community (Metadata store query performance degrades as the tasks in druid_tasks table grows · Issue #12318 · apache/druid )
The task summaries returned as a response for the API are several times smaller and can fit comfortably in memory. So, there is an opportunity here to fix the memory usage, slow ingestion, and under-pressure metadata store by removing the need to handle large payloads in every layer we can. Of course, the solution becomes complex as we try to fix more layers. With that in mind, this page captures two approaches. They vary in complexity and also in the degree to which they fix the aforementioned problems.
* Clean up query contexts
Uses constants in place of literal strings for context keys.
Moves some QueryContext methods to QueryContexts for reuse.
* Revisions from review comments
* Add QoSFilters first in the chain.
When a request is suspended and later resumed due to QoS constraints,
its filter chain is restarted. Placing QoSFilters first in the chain
avoids double-execution of other filters.
Fixes an issue where requests deferred by QoS would report 403 Forbidden
due to double-execution of SecuritySanityCheckFilter.
* Smaller changes.
* Add QoS filters in BaseJettyTest.
* Remove unused parameter.
* SqlSegmentsMetadataQuery: Fix OVERLAPS for wide target segments.
Segments with endpoints prior to year 0 or after year 9999 may overlap
the search intervals but not match the generated SQL conditions. So, we
need to add an additional OR condition to catch these.
I checked a real, live MySQL metadata store to confirm that the query
still uses metadata store indexes. It does.
* Add comments.
Often users are submitting queries, and ingestion specs that work only if the relevant extension is not loaded. However, the error is too technical for the users and doesn't suggest them to check for missing extensions. This PR modifies the error message so users can at least check their settings before assuming that the error is because of a bug.
Adds a default implementation of getQueryContext, which was added to the Query interface in #12396. Query is marked with @ExtensionPoint, and lately we have been trying to be less volatile on these interfaces by providing default implementations to be more chill for extension writers.
The way this default implementation is done in this PR is a bit strange due to the way that getQueryContext is used (mutated with system default and system generated keys); the default implementation has a specific object that it returns, and I added another temporary default method isLegacyContext that checks if the getQueryContext returns that object or not. If not, callers fall back to using getContext and withOverriddenContext to set these default and system values.
I am open to other ideas as well, but this way should work at least without exploding, and added some tests to ensure that it is wired up correctly for QueryLifecycle, including the context authorization stuff.
The added test shows the strange behavior if query context authorization is enabled, mainly that the system default and system generated query context keys also need to be granted as permissions for things to function correctly. This is not great, so I mentioned it in the javadocs as well. Not sure if it needs to be called out anywhere else.
* Emit state of replace and append for native batch tasks
* Emit count of one depending on batch ingestion mode (APPEND, OVERWRITE, REPLACE)
* Add metric to compaction job
* Avoid null ptr exc when null emitter
* Coverage
* Emit tombstone & segment counts
* Tasks need a type
* Spelling
* Integrate BatchIngestionMode in batch ingestion tasks functionality
* Typos
* Remove batch ingestion type from metric since it is already in a dimension. Move IngestionMode to AbstractTask to facilitate having mode as a dimension. Add metrics to streaming. Add missing coverage.
* Avoid inner class referenced by sub-class inspection. Refactor computation of IngestionMode to make it more robust to null IOConfig and fix test.
* Spelling
* Avoid polluting the Task interface
* Rename computeCompaction methods to avoid ambiguous java compiler error if they are passed null. Other minor cleanup.
Issue:
Even though `CompactionTuningConfig` allows a `maxColumnsToMerge` config
(to optimize memory usage, particulary for datasources with many dimensions),
the corresponding client object `ClientCompactionTaskQueryTuningConfig`
(used by the coordinator duty `CompactSegments` to trigger auto-compaction)
does not contain this field. Thus, the value of `maxColumnsToMerge` specified
in any datasource compaction config is ignored.
Changes:
- Add field `maxColumnsToMerge` in `ClientCompactionTaskQueryTuningConfig`
and `UserCompactionTaskQueryTuningConfig`
- Fix tests
* Ensure ByteBuffers allocated in tests get freed.
Many tests had problems where a direct ByteBuffer would be allocated
and then not freed. This is bad because it causes flaky tests.
To fix this:
1) Add ByteBufferUtils.allocateDirect(size), which returns a ResourceHolder.
This makes it easy to free the direct buffer. Currently, it's only used
in tests, because production code seems OK.
2) Update all usages of ByteBuffer.allocateDirect (off-heap) in tests either
to ByteBuffer.allocate (on-heap, which are garbaged collected), or to
ByteBufferUtils.allocateDirect (wherever it seemed like there was a good
reason for the buffer to be off-heap). Make sure to close all direct
holders when done.
* Changes based on CI results.
* A different approach.
* Roll back BitmapOperationTest stuff.
* Try additional surefire memory.
* Revert "Roll back BitmapOperationTest stuff."
This reverts commit 49f846d9e3.
* Add TestBufferPool.
* Revert Xmx change in tests.
* Better behaved NestedQueryPushDownTest. Exit tests on OOME.
* Fix TestBufferPool.
* Remove T1C from ARM tests.
* Somewhat safer.
* Fix tests.
* Fix style stuff.
* Additional debugging.
* Reset null / expr configs better.
* ExpressionLambdaAggregatorFactory thread-safety.
* Alter forkNode to try to get better info when a JVM crashes.
* Fix buffer retention in ExpressionLambdaAggregatorFactory.
* Remove unused import.
Allow a Druid cluster to kill segments whose interval_end is a date in the future. This can be done by setting druid.coordinator.kill.durationToRetain to a negative period. For example PT-24H would allow segments to be killed if their interval_end date was 24 hours or less into the future at the time that the kill task is generated by the system.
A cluster operator can also disregard the druid.coordinator.kill.durationToRetain entirely by setting a new configuration, druid.coordinator.kill.ignoreDurationToRetain=true. This ignores interval_end date when looking for segments to kill, and instead is capable of killing any segment marked unused. This new configuration is off by default, and a cluster operator should fully understand and accept the risks if they enable it.
* GroupBy: Reduce allocations by reusing entry and key holders.
Two main changes:
1) Reuse Entry objects returned by various implementations of
Grouper.iterator.
2) Reuse key objects contained within those Entry objects.
This is allowed by the contract, which states that entries must be
processed and immediately discarded. However, not all call sites
respected this, so this patch also updates those call sites.
One particularly sneaky way that the old code retained entries too long
is due to Guava's MergingIterator and CombiningIterator. Internally,
these both advance to the next value prior to returning the current
value. So, this patch addresses that in two ways:
1) For merging, we have our own implementation MergeIterator already,
although it had the same problem. So, this patch updates our
implementation to return the current item prior to advancing to the
next item. It also adds a forbidden-api entry to ensure that this
safer implementation is used instead of Guava's.
2) For combining, we address the problem in a different way: by copying
the key when creating the new, combined entry.
* Attempt to fix test.
* Remove unused import.
* Reduce allocations due to Jackson serialization.
This patch attacks two sources of allocations during Jackson
serialization:
1) ObjectMapper.writeValue and JsonGenerator.writeObject create a new
DefaultSerializerProvider instance for each call. It has lots of
fields and creates pressure on the garbage collector. So, this patch
adds helper functions in JacksonUtils that enable reuse of
SerializerProvider objects and updates various call sites to make
use of this.
2) GroupByQueryToolChest copies the ObjectMapper for every query to
install a special module that supports backwards compatibility with
map-based rows. This isn't needed if resultAsArray is set and
all servers are running Druid 0.16.0 or later. This release was a
while ago. So, this patch disables backwards compatibility by default,
which eliminates the need to copy the heavyweight ObjectMapper. The
patch also introduces a configuration option that allows admins to
explicitly enable backwards compatibility.
* Add test.
* Update additional call sites and add to forbidden APIs.
* Support array based results in timeBoundary query
* Fix bug with query interval in timeBoundary
* Convert min(__time) and max(__time) SQL queries to timeBoundary
* Add tests for timeBoundary backed SQL queries
* Fix query plans for existing tests
* fixup! Convert min(__time) and max(__time) SQL queries to timeBoundary
* fixup! Add tests for timeBoundary backed SQL queries
* fixup! Fix bug with query interval in timeBoundary
The query context is a way that the user gives a hint to the Druid query engine, so that they enforce a certain behavior or at least let the query engine prefer a certain plan during query planning. Today, there are 3 types of query context params as below.
Default context params. They are set via druid.query.default.context in runtime properties. Any user context params can be default params.
User context params. They are set in the user query request. See https://druid.apache.org/docs/latest/querying/query-context.html for parameters.
System context params. They are set by the Druid query engine during query processing. These params override other context params.
Today, any context params are allowed to users. This can cause
1) a bad UX if the context param is not matured yet or
2) even query failure or system fault in the worst case if a sensitive param is abused, ex) maxSubqueryRows.
This PR adds an ability to limit context params per user role. That means, a query will fail if you have a context param set in the query that is not allowed to you. To do that, this PR adds a new built-in resource type, QUERY_CONTEXT. The resource to authorize has a name of the context param (such as maxSubqueryRows) and the type of QUERY_CONTEXT. To allow a certain context param for a user, the user should be granted WRITE permission on the context param resource. Here is an example of the permission.
{
"resourceAction" : {
"resource" : {
"name" : "maxSubqueryRows",
"type" : "QUERY_CONTEXT"
},
"action" : "WRITE"
},
"resourceNamePattern" : "maxSubqueryRows"
}
Each role can have multiple permissions for context params. Each permission should be set for different context params.
When a query is issued with a query context X, the query will fail if the user who issued the query does not have WRITE permission on the query context X. In this case,
HTTP endpoints will return 403 response code.
JDBC will throw ForbiddenException.
Note: there is a context param called brokerService that is used only by the router. This param is used to pin your query to run it in a specific broker. Because the authorization is done not in the router, but in the broker, if you have brokerService set in your query without a proper permission, your query will fail in the broker after routing is done. Technically, this is not right because the authorization is checked after the context param takes effect. However, this should not cause any user-facing issue and thus should be OK. The query will still fail if the user doesn’t have permission for brokerService.
The context param authorization can be enabled using druid.auth.authorizeQueryContextParams. This is disabled by default to avoid any hassle when someone upgrades his cluster blindly without reading release notes.
* Make tombstones ingestible by having them return an empty result set.
* Spotbug
* Coverage
* Coverage
* Remove unnecessary exception (checkstyle)
* Fix integration test and add one more to test dropExisting set to false over tombstones
* Force dropExisting to true in auto-compaction when the interval contains only tombstones
* Checkstyle, fix unit test
* Changed flag by mistake, fixing it
* Remove method from interface since this method is specific to only DruidSegmentInputentity
* Fix typo
* Adapt to latest code
* Update comments when only tombstones to compact
* Move empty iterator to a new DruidTombstoneSegmentReader
* Code review feedback
* Checkstyle
* Review feedback
* Coverage
* Optionally load segment index files into page cache on bootstrap and new segment download
* Fix unit test failure
* Fix test case
* fix spelling
* fix spelling
* fix test and test coverage issues
Co-authored-by: Jian Wang <wjhypo@gmail.com>
* add impl
* add impl
* fix checkstyle
* add impl
* add unit test
* fix stuff
* fix stuff
* fix stuff
* add unit test
* add more unit tests
* add more unit tests
* add IT
* add IT
* add IT
* add IT
* add ITs
* address comments
* fix test
* fix test
* fix test
* address comments
* address comments
* address comments
* fix conflict
* fix checkstyle
* address comments
* fix test
* fix checkstyle
* fix test
* fix test
* fix IT
The current default value of inputSegmentSizeBytes is 400MB, which is pretty
low for most compaction use cases. Thus most users are forced to override the
default.
The default value is now increased to Long.MAX_VALUE.
* Store null columns in the segments
* fix test
* remove NullNumericColumn and unused dependency
* fix compile failure
* use guava instead of apache commons
* split new tests
* unused imports
* address comments
* finds complete and active tasks from the same snapshot
* overlord resource
* unit test
* integration test
* javadoc and cleanup
* more cleanup
* fix test and add more
Add config for eager / lazy connection initialization in ResourcePool
Description
Currently, when multiple tasks are launched, each of them eagerly initializes a full pool's worth of connections to the coordinator.
While this is acceptable when the parameter for number of eagerConnections (== maxSize) is small, this can be problematic in environments where it's a large value (say 1000) and multiple tasks are launched simultaneously, which can cause a large number of connections to be created to the coordinator, thereby overwhelming it.
Patch
Nodes like the broker may require eager initialization of resources and do not create connections with the Coordinator.
It is unnecessary to do this with other types of nodes.
A config parameter eagerInitialization is added, which when set to true, initializes the max permissible connections when ResourcePool is initialized.
If set to false, lazy initialization of connection resources takes place.
NOTE: All nodes except the broker have this new parameter set to false in the quickstart as part of this PR
Algorithm
The current implementation relies on the creation of maxSize resources eagerly.
The new implementation's behaviour is as follows:
If a resource has been previously created and is available, lend it.
Else if the number of created resources is less than the allowed parameter, create and lend it.
Else, wait for one of the lent resources to be returned.
* Tombstone support for replace functionality
* A used segment interval is the interval of a current used segment that overlaps any of the input intervals for the spec
* Update compaction test to match replace behavior
* Adapt ITAutoCompactionTest to work with tombstones rather than dropping segments. Add support for tombstones in the broker.
* Style plus simple queriableindex test
* Add segment cache loader tombstone test
* Add more tests
* Add a method to the LogicalSegment to test whether it has any data
* Test filter with some empty logical segments
* Refactor more compaction/dropexisting tests
* Code coverage
* Support for all empty segments
* Skip tombstones when looking-up broker's timeline. Discard changes made to tool chest to avoid empty segments since they will no longer have empty segments after lookup because we are skipping over them.
* Fix null ptr when segment does not have a queriable index
* Add support for empty replace interval (all input data has been filtered out)
* Fixed coverage & style
* Find tombstone versions from lock versions
* Test failures & style
* Interner was making this fail since the two segments were consider equal due to their id's being equal
* Cleanup tombstone version code
* Force timeChunkLock whenever replace (i.e. dropExisting=true) is being used
* Reject replace spec when input intervals are empty
* Documentation
* Style and unit test
* Restore test code deleted by mistake
* Allocate forces TIME_CHUNK locking and uses lock versions. TombstoneShardSpec added.
* Unused imports. Dead code. Test coverage.
* Coverage.
* Prevent killer from throwing an exception for tombstones. This is the killer used in the peon for killing segments.
* Fix OmniKiller + more test coverage.
* Tombstones are now marked using a shard spec
* Drop a segment factory.json in the segment cache for tombstones
* Style
* Style + coverage
* style
* Add TombstoneLoadSpec.class to mapper in test
* Update core/src/main/java/org/apache/druid/segment/loading/TombstoneLoadSpec.java
Typo
Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com>
* Update docs/configuration/index.md
Missing
Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com>
* Typo
* Integrated replace with an existing test since the replace part was redundant and more importantly, the test file was very close or exceeding the 10 min default "no output" CI Travis threshold.
* Range does not work with multi-dim
Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com>
* Always reopen stream in FileUtils.copyLarge, RetryingInputStream.
When an InputStream throws an exception from one of its read methods,
we should assume it's bad and reopen it.
The main changes here are:
- In FileUtils.copyLarge, replace InputStream with InputStreamSupplier.
- In RetryingInputStream, collapse retryCondition and resetCondition
into a single condition. Also, make it required, since every usage
is passing in a specific condition anyway.
* Test fixes.
* Fix read impl.
* add a new query laning metrics to visualize lane assignment
* fixes :spotbugs check
* Update docs/operations/metrics.md
Co-authored-by: Benedict Jin <asdf2014@apache.org>
* Update server/src/main/java/org/apache/druid/server/QueryScheduler.java
Co-authored-by: Benedict Jin <asdf2014@apache.org>
* Update server/src/main/java/org/apache/druid/server/QueryScheduler.java
Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
This PR aims to make the ParseExceptions in Druid more informative, by adding additional information (metadata) to the ParseException, which can contain additional information about the exception. For example - the path of the file generating the issue, the line number (where it can be easily fetched - like CsvReader)
Following changes are addressed in this PR:
A new class CloseableIteratorWithMetadata has been created which is like CloseableIterator but also has a metadata method that returns a context Map<String, Object> about the current element returned by next().
IntermediateRowParsingReader#read() now attaches the InputEntity and the "record number" which created the exception (while parsing them), and IntermediateRowParsingReader#sample attaches the InputEntity (but not the "record number").
TextReader (and its subclasses), which is a specific implementation of the IntermediateRowParsingReader also include the line number which caused the generation of the error.
This will also help in triaging the issues when InputSourceReader generates ParseException because it can point to the specific InputEntity which caused the exception (while trying to read it).
Mockito now supports all our needs and plays much better with recent Java versions.
Migrating to Mockito also simplifies running the kind of tests that required PowerMock in the past.
* replace all uses of powermock with mockito-inline
* upgrade mockito to 4.3.1 and fix use of deprecated methods
* import mockito bom to align all our mockito dependencies
* add powermock to forbidden-apis to avoid accidentally reintroducing it in the future
* upgrade Airline to Airline 2
https://github.com/airlift/airline is no longer maintained, updating to
https://github.com/rvesse/airline (Airline 2) to use an actively
maintained version, while minimizing breaking changes.
Note, this is a backwards incompatible change, and extensions relying on
the CliCommandCreator extension point will also need to be updated.
* fix dependency checks where jakarta.inject is now resolved first instead
of javax.inject, due to Airline 2 using jakarta
Problem:
When using a `CachingCostBalancerStrategy` with segments of granularity ALL,
no segment gets loaded.
- With granularity ALL, segments of eternity interval are created which have
`start = Long.MIN_VALUE / 2` and `end = Long.MAX_VALUE / 2`.
- For cost calculation in the balancer strategy, `toLocalInterval()` method is invoked where
`Long.MIN_VALUE / 2` or `Long.MAX_VALUE / 2` cause an overflow thus resulting in no overlap.
- The strategy is unable to find any eligible server for loading a given segment.
Fix:
- Reverse order of operations to divide by `MILLIS_FACTOR` (~10^8) first,
then do the subtraction to prevent Long overflow.
* working
* Lazily load segmentKillers, segmentMovers, and segmentArchivers
* more tests
* test-jar plugin
* more coverage
* lazy client
* clean up changes
* checkstyle
* i did not change the branch condition
* adjust failure rate to run tests faster
* javadocs
* checkstyle
* changes:
* remove SystemSchema duplicate ServerInventoryView in broker
* suppress duplicate segment added/removed warnings in HttpServerInventoryView when doing a full sync
* fixes
Fixes#12022
### Description
The current implementations of memory estimation in `OnHeapIncrementalIndex` and `StringDimensionIndexer` tend to over-estimate which leads to more persistence cycles than necessary.
This PR replaces the max estimation mechanism with getting the incremental memory used by the aggregator or indexer at each invocation of `aggregate` or `encode` respectively.
### Changes
- Add new flag `useMaxMemoryEstimates` in the task context. This overrides the same flag in DefaultTaskConfig i.e. `druid.indexer.task.default.context` map
- Add method `AggregatorFactory.factorizeWithSize()` that returns an `AggregatorAndSize` which contains
the aggregator instance and the estimated initial size of the aggregator
- Add method `Aggregator.aggregateWithSize()` which returns the incremental memory used by this aggregation step
- Update the method `DimensionIndexer.processRowValsToKeyComponent()` to return the encoded key component as well as its effective size in bytes
- Update `OnHeapIncrementalIndex` to use the new estimations only if `useMaxMemoryEstimates = false`
Fixes an issue where a load-drop-load sequence for a segment and historical doesn't work correctly for http based load queue peon. The first cycle of load-drop works fine - the problem comes when there is an attempt to reload the segment. The historical caches load success for some recent segments and makes the reload as a no-op. But it doesn't consider that fact that the segment was also dropped in between the load requests.
This change invalidates the cache after a client tries to fetch a success result.
* init multiValue column group by
* Changing sorting to Lexicographic as default
* Adding initial tests
* 1.Fixing test cases adding
2.Optimized inmem structs
* Linking SQL layer to native layer
* Adding multiDimension support to group by column strategy
* 1. Removing array coercion in Calcite layer
2. Removing ResultRowDeserializer
* 1. Supporting all primitive array types
2. Removing dimension spec as part of columnSelector
* 1. Supporting all primitive array types
2. Removing dimension spec as part of columnSelector
* 1. Checkstyle things
2. Removing flag
* Minor naming things
* CheckStyle Things
* Fixing test case
* Fixing hashing
* 1. Adding the MV function
2. Added few test cases
* 1. Adding MV function test cases
* Adding Selector strategy function test cases
* Fixing ClientQuerySegmentWalkerTest
* Adding GroupByQueryRunnerTest test cases
* Fixing test cases
* Adding few more test cases
* Fixing Exception asset statement and intellij inspection
* Adding null compatibility tests
* Review comments
* Fixing few failing tests
* Fixing few failing tests
* Do no convert to topN Q incase of group by on array
* Fixing checkstyle
* Fixing differences between jdk's class cast exception message
* 1. Fixing ordering if the grouping key is an array
* Fixing DefaultLimitSpec
* Fixing CalciteArraysQueryTest
* Dummy commit for LGTM
* changes:
* only coerce multi-value string null values when `ExpressionPlan.Trait.NEEDS_APPLIED` is set
* correct return type inference for ARRAY_APPEND,ARRAY_PREPEND,ARRAY_SLICE,ARRAY_CONCAT
* fix bug with ExprEval.ofType when actual type of object from binding doesn't match its claimed type
* Review comments
* Fixing test cases
* Fixing spot bugs
* Fixing strict compile
Co-authored-by: Clint Wylie <cwylie@apache.org>
* Add http response status code to org.eclipse.jetty.server.RequestLog
* http response code is expressed as an int. Set log msg interpolation based on digit
* trying to add an unit test to verify if the logger.debug method is called
* trying to add an unit test to verify if the logger.debug method is called
* fix compilation issues
* remove test
* Thread pool for broker
* Updating two tests to improve coverage for new method added
* Updating druidProcessingConfigTest to cover coverage
* Adding missed spelling errors caused in doc
* Adding test to cover lines of new function added
This index helps in faster query results during kill task's query on interval based unused segment listing. This can become a bottleneck in some production loads causing coordinator to wait longer for metadata db replies and impacting Kafka ingestion. The modified index has helped reduce the query times for such queries.
* clean up the balancing code around the batched vs deprecated way of sampling segments to balance
* fix docs, clarify comments, add deprecated annotations to legacy code
* remove unused variable
* update dynamic config dialog in console to state percentOfSegmentsToConsiderPerMove deprecated
* fix dynamic config text for percentOfSegmentsToConsiderPerMove
* run prettier to cleanup coordinator-dynamic-config.tsx changes
* update jest snapshot
* update documentation per review feedback
* fix bug where queries fail immediately when timeout is 0 instead of using default timeout
* fix to use serverside max
* more better
* less flaky test
* oops
This PR fixes an issue in which if a lookup is configured incorreclty; does not serialize properly when being pulled by peon node, it causes the task to fail. The failure occurs because the peon and other leaf nodes (broker, historical), have retry logic that continues to retry the lookup loading for 3 minutes by default. The http listener thread on the peon task is not started until lookup loading completes, by default, the overlord waits 1 minute by default, to communicate with the peon task to get the task status, after which is orders the task to shut down, causing the ingestion task to fail.
To fix the issue, we catch the exception serialization error, and do not retry. Also fixed an issue in which a bad lookup config interferes with any other good lookup configs from being loaded.
This PR does two things
1. It adds the capability to surface missing features in SQL to users - The calcite planner will explore through multiple rules to convert a logical SQL query to a druid native query. Some rules change the shape of the query itself, optimize it and some rules are responsible for translating the query into a druid native query. These are DruidQueryRule, DruidOuterQueryRule, DruidJoinRule, DruidUnionDataSourceRule, DruidUnionRule etc. These rules will look at SQL and will do the necessary transformation. But if the rule can't transform the query, it returns back the control to the calcite planner without recording why was it not able to transform. E.g. there is a join query with a non-equal join condition. DruidJoinRule will look at the condition, see that it is not supported, and return back the control. The reason can be that a query can be planned in many different ways so if one rule can't parse it, the query may still be parseable by other rules. In this PR, we are intercepting these gaps and passing them back to the user if the query could not be planned at all.
2. The said capability has been used to generate actionable errors for some common unsupported SQL features. However, not all possible errors are covered and we can keep adding more in the future.
* Refactor ResponseContext
Fixes a number of issues in preparation for request trailers
and the query profile.
* Converts keys from an enum to classes for smaller code
* Wraps stored values in functions for easier capture for other uses
* Reworks the "header squeezer" to handle types other than arrays.
* Uses metadata for visibility, and ability to compress,
to replace ad-hoc code.
* Cleans up JSON serialization for the response context.
* Other miscellaneous cleanup.
* Handle unknown keys in deserialization
Also, make "Visibility" into a boolean.
* Revised comment
* Renamd variable
Druid currently has 2 serverViews, regular serverView and filtered serverView. The regular serverView is used to monitor all segment announcements from all data nodes (historicals, tasks, indexers). The filtered serverView is used when you want to watch segment announcements from particular tiers. Since these server views keep track of different sets of druidServers and segments in memory, they should be maintained separately. However, they currently share the same name for their executorService, which can cause confusion and make debugging harder especially in the broker since it is using both serverViews, the filtered view for normal query processing and the regular view to serve the servers table (I'm unsure whether this is intended or whether this is a good behavior). This PR changes it to a more obvious name.
This PR also removes SingleServerInventoryView. This view was deprecated a long time ago and has not been documented at least since 0.13 (#6127). I also don't think this can be better in any case than BatchServerInventoryView. Finally, I merged AbstractCuratorServerInventoryView and BatchServerInventoryView as we no longer need AbstractCuratorServerInventoryView after SingleServerInventoryView is removed.
* Make nodeRole available during binding; add support for dynamic registration of DruidService
* fix checkstyle and test
* fix customRole test
* address comments
* add more javadoc
* Enhancements to IndexTaskClient.
1) Ability to use handlers other than StringFullResponseHandler. This
functionality is not used in production code yet, but is useful
because it will allow tasks to communicate with each other in
non-string-based formats and in streaming fashion. In the future,
we'll be able to use this to make task-to-task communication
more efficient.
2) Truncate server errors at 1KB, so long errors do not pollute logs.
3) Change error log level for retryable errors from WARN to INFO. (The
final error is still WARN.)
4) Harmonize log and exception messages to have a more consistent format.
* Additional tests and improvements.
* Code cleanup from query profile project
* Fix spelling errors
* Fix Javadoc formatting
* Abstract out repeated test code
* Reuse constants in place of some string literals
* Fix up some parameterized types
* Reduce warnings reported by Eclipse
* Reverted change due to lack of tests
* Use intermediate-persist IndexSpec during multiphase merge.
The main change is the addition of an intermediate-persist IndexSpec
to the main "merge" method in IndexMerger. There are also a few minor
adjustments to the IndexMerger interface to encourage more harmonious
usage of its methods in the future.
* Additional changes inspired by the test coverage checker.
- Remove unused-in-production IndexMerger methods "append" and "convert".
- Add additional unit tests to UnifiedIndexerAppenderatorsManager.
* Additional adjustments.
* Even more additional adjustments.
* Test fixes.
* Consolidate a bunch of ad-hoc segments metadata SQL; fix some bugs.
This patch gathers together a variety of SQL from SqlSegmentsMetadataManager
and IndexerSQLMetadataStorageCoordinator into a new class SqlSegmentsMetadataQuery.
It focuses on SQL related to retrieving segment payloads and marking
segments used and unused.
In addition to cleaning up the code a bit, this patch also fixes a bug
with years before 0 or after 9999. The prior SQL did not work properly
because dates outside this range cannot be compared as strings. The new
code does work for these far-past and far-future years.
So, if you're ever interested in using Druid to analyze things from
ancient Babylon, you better apply this patch first!
* Fix test compiling.
* Fixes and improvements.
* Fix forbidden API.
* Additional fixes.
* add impl
* fix checkstyle
* add test
* add test
* add unit tests
* fix unit tests
* fix unit tests
* fix unit tests
* add IT
* add IT
* add comments
* fix spelling
* Make LocalInputSource.files a List instead of Set and adjust wikipedia_index_task to use file list.
Rationale: the behavior of wikipedia_index_task.json is order-dependent with regard to its input
files; some orders produce 4 segments and some produce 5 segments. Some integration tests, like
ITSystemTableBatchIndexTaskTest and ITAutoCompactionTest, are written assuming that the
4-segment case will always happen. Providing the file list in a specific order ensures that this
will happen as expected by the tests.
I didn't see a specific reason why the LocalInputSource.files parameter needed to be a Set, so
changing it to a List was the simplest way to achieve the consistent ordering. I think it will
also make the behavior make more sense if someone does specify the same input file multiple
times in a spec: I think they'd expect it to be loaded multiple times instead of deduped. This
is consistent with the behavior of other input sources like S3, GCS, HTTP.
* Sort files in LocalFirehoseFactory.
* Add worker category as dimension in TaskSlotCountStatsMonitor
* Change description
* Add workerConfig as field
* Modify HttpRemoteTaskRunnerTest to test worker category in taskslot metrics
* Fixing tests
* Fixing alerts
* Adding unit test in SingleTaskBackgroundRunnerTest for task slot metrics APIs
* Resolving false positive spell check
* addressing comments
* throw UnsupportedOperationException for tasklotmetrics APIs in SingleTaskBackgroundRunner
Co-authored-by: Nikhil Navadiya <nnavadiya@twitter.com>
* Fix bug Unrecognized token 'No': was expecting (JSON String,...) when calling the API /druid/indexer/v1/task/taskId/reports and the report is not found
* Also log other non-OK statuses
* Use a simple class to sanitize sanitizable errors and log them
The purpose of this is to sanitize JDBC errors, but can sanitize other errors
if they implement SanitizableError Interface
add a class to log errors and sanitize them
added a simple test that tests out that the error gets sanitized
add @NonNull annotation to serverconfig's ErrorResponseTransfromStrategy
* return less information as part of too many connections, and instead only log specific details
This is so an end user gets relevant information but not too much info since they might now how
many brokers they have
* return only runtime exceptions
added new error types that need to be sanitized
also sanitize deprecated and unsupported exceptions.
* dont reqrewite exceptions unless necessary for checked exceptions
add docs
avoid blanket turning all exceptions into runtime exceptions
* address comments, to fix up docs.
add more javadocs
add support UOE sanitization
* use try catch instead and sanitize at public methods
* checkstyle fixes
* throw noSuchStatement and NoSuchConnection as Avatica is affected by those
* address comments. move log error back to druid meta
clean up bad formatting and commented code. add missed catch for NoSuchStatementException
clean up comments for error handler and add comment explainging not wanting to santize avatica exceptions
* alter test to reflect new error message
There are 3 types of query IDs - id, subQueryId, sqlQueryId. Currently, whenever a query generates subqueries, the subquery's subQueryId is populated randomly. Also, subquery's Id is not set to the parent query Id. Therefore there is no way of linking the subqueries to the parent query, and one loses the ability to look at end to end view of the query.
This PR aims to implement following couple of things:
Populate the subqueries with it's parent's id (and sqlQueryId if present)
Populate the subqueryId such that it forms a hierarchical relationship amongs themselves. For example, if there is a query which launches a subquery, which in turn launches a couple of subqueries, then the ids and subQueryIds should have following structure.
* RowBasedSegment: Use Sequence instead of Iterable.
The main reason this is good is that Sequences can include baggage that
must be closed after iteration is finished. This enables creating
RowBasedSegments on top of closeable sequences of rows.
To preserve the optimization that allows reversing a List without
copying it, this patch also makes SimpleSequence its own class and allows
extracting the Iterable that was used to create it.
* Fix tests.
* Remove StorageAdapter.getColumnTypeName.
It was only used by SegmentAnalyzer, and isn't necessary anymore due to
the recent improvements to ColumnCapabilities.
Also: tidy ColumnDescriptor.read slightly by removing an instanceof
check, and moving the relevant logic into ComplexColumnPartSerde.
* Fix spellings.
* complex typed expressions
* add built-in hll collector expressions to get coverage on druid-processing, more types, more better
* rampage!!!
* more javadoc
* adjustments
* oops
* lol
* remove unused dependency
* contradiction?
* more test
This PR adds support for range partitioning on multiple dimensions. It extends on the
concept and implementation of single dimension range partitioning.
The new partition type added is range which corresponds to a set of Dimension Range Partition classes. single_dim is now treated as a range type partition with a single partition dimension.
The start and end values of a DimensionRangeShardSpec are represented
by StringTuples, where each String in the tuple is the value of a partition dimension.
The new config is an extension of the concept of "watchedTiers" where
the Broker can choose to add the info of only the specified tiers to its timeline.
Similarly, with this config, Broker can choose to skip the realtime nodes and
thus it would query only Historical processes for any given segment.
* remove useless call to balanceServers for move from decom servers when there are no decom servers
* refactor approach to this PR but accomplish the same thing
* Remove CloseQuietly and migrate its usages to other methods.
These other methods include:
1) New method CloseableUtils.closeAndWrapExceptions, which wraps IOExceptions
in RuntimeExceptions for callers that just want to avoid dealing with
checked exceptions. Most usages were migrated to this method, because it
looks like they were mainly attempts to avoid declaring a throws clause,
and perhaps were unintentionally suppressing IOExceptions.
2) New method CloseableUtils.closeInCatch, designed to properly close something
in a catch block without losing exceptions. Some usages from catch blocks
were migrated here, when it seemed that they were intended to avoid checked
exception handling, and did not really intend to also suppress IOExceptions.
3) New method CloseableUtils.closeAndSuppressExceptions, which sends all
exceptions to a "chomper" that consumes them. Nothing is thrown or returned.
The behavior is slightly different: with this method, _all_ exceptions are
suppressed, not just IOExceptions. Calls that seemed like they had good
reason to suppress exceptions were migrated here.
4) Some calls were migrated to try-with-resources, in cases where it appeared
that CloseQuietly was being used to avoid throwing an exception in a finally
block.
🎵 You don't have to go home, but you can't stay here... 🎵
* Remove unused import.
* Fix up various issues.
* Adjustments to tests.
* Fix null handling.
* Additional test.
* Adjustments from review.
* Fixup style stuff.
* Fix NPE caused by holder starting out null.
* Fix spelling.
* Chomp Throwables too.
* better type system
* needle in a haystack
* ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support
* fixup merge
* more test
* fixup
* intern
* fix
* oops
* oops again
* ...
* more test coverage
* fix error message
* adjust interning, more javadocs
* oops
* more docs more better
this change ensures that JettyTest is setting the properties it needs in case some other test overwrites them
this also changes up the ordering of the call for setProperties to call super's first in case super is setting the same property
* Add the ability to add a context to internally generated druid broker queries
* fix docs
* changes after first CI failure
* cleanup after merge with master
* change default to empty map and improve unit tests
* add doc info and fix checkstyle
* refactor DruidSchema#runSegmentMetadataQuery and add a unit test
The new config is an extension of the concept of "watchedTiers" where
the Broker can choose to add the info of only the specified tiers to its timeline.
Similarly, with this config, Broker can choose to ignore the segments being served
by the specified historical tiers. By default, no tier is ignored.
This config is useful when you want a completely isolated tier amongst many other tiers.
Say there are several tiers of historicals Tier T1, Tier T2 ... Tier Tn
and there are several brokers Broker B1, Broker B2 .... Broker Bm
If we want only Broker B1 to query Tier T1, instead of setting a long list of watchedTiers
on each of the other Brokers B2 ... Bm, we could just set druid.broker.segment.ignoredTiers=["T1"]
for these Brokers, while Broker B1 could have druid.broker.segment.watchedTiers=["T1"]
* Fix issue of duplicate key under certain conditions when loading late data in streaming. Also fixes a documentation issue with skipSegmentLineageCheck.
* maxId may be null at this point, need to check for that
* Remove hypothetical case (it cannot happen)
* Revert compaction is simply "killing" the compacted segment and previously, used, overshadowed segments are visible again
* Add comments
* refactor sql authorization to get resource type from schema, refactor resource type from enum to string
* information schema auth filtering adjustments
* refactor
* minor stuff
* Update SqlResourceCollectorShuttle.java
* Make persists concurrent with ingestion
* Remove semaphore but keep concurrent persists (with add) and add push in the backround as well
* Go back to documented default persists (zero)
* Move to debug
* Remove unnecessary Atomics
* Comments on synchronization (or not) for sinks & sinkMetadata
* Some cleanup for unit tests but they still need further work
* Shutdown & wait for persists and push on close
* Provide support for three existing batch appenderators using batchProcessingMode flag
* Fix reference to wrong appenderator
* Fix doc typos
* Add BatchAppenderators class test coverage
* Add log message to batchProcessingMode final value, fix typo in enum name
* Another typo and minor fix to log message
* LEGACY->OPEN_SEGMENTS, Edit docs
* Minor update legacy->open segments log message
* More code comments, mostly small adjustments to naming etc
* fix spelling
* Exclude BtachAppenderators from Jacoco since it is fully tested but Jacoco still refuses to ack coverage
* Coverage for Appenderators & BatchAppenderators, name change of a method that was still using "legacy" rather than "openSegments"
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
* initial work
* reduce lock in sqlLifecycle
* Integration test for sql canceling
* javadoc, cleanup, more tests
* log level to debug
* fix test
* checkstyle
* fix flaky test; address comments
* rowTransformer
* cancelled state
* use lock
* explode instead of noop
* oops
* unused import
* less aggressive with state
* fix calcite charset
* don't emit metrics when you are not authorized
Fixes#11297.
Description
Description and design in the proposal #11297
Key changed/added classes in this PR
*DataSegmentPusher
*ShuffleClient
*PartitionStat
*PartitionLocation
*IntermediaryDataManager
This PR adds a new property druid.router.sql.enable which allows the
Router to handle SQL queries when set to true.
This change does not affect Avatica JDBC requests and they are still routed
by hashing the Connection ID.
To allow parsing of the request object as a SqlQuery (contained in module druid-sql),
some classes have been moved from druid-server to druid-services with
the same package name.
* Improve concurrency between DruidSchema and BrokerServerView
* unused imports and workaround for error prone faiure
* count only known segments
* add comments
* Update default maxSegmentsInNodeLoadingQueue
Update the default maxSegmentsInNodeLoadingQueue from 0 (unbounded) to 100.
An unbounded maxSegmentsInNodeLoadingQueue can cause cluster instability.
Since this is the default druid operators need to run into this instability
and then look through the docs to see that the recommended value for a large
cluster is 1000. This change makes it so the default will prevent clusters
from falling over as they grow over time.
* update tests
* codestyle
* Allow kill task to mark segments as unused
* Add IndexerSQLMetadataStorageCoordinator test
* Update docs/ingestion/data-management.md
Co-authored-by: Jihoon Son <jihoonson@apache.org>
* Add warning to kill task doc
Co-authored-by: Jihoon Son <jihoonson@apache.org>
This change allows the selection of a specific broker service (or broker tier) by the Router.
The newly added ManualTieredBrokerSelectorStrategy works as follows:
Check for the parameter brokerService in the query context. If this is a valid broker service, use it.
Check if the field defaultManualBrokerService has been set in the strategy. If this is a valid broker service, use it.
Move on to the next strategy
* Add a new metric query/segments/count that is not emitted by default
* docs
* test the default implementation of the metric
* fix spelling error in docs
* document the fact that query retries will result in additional metric emissions
* update using recommended text from @jihoonson
This PR refactors the code related to segment loading specifically SegmentLoader and SegmentLoaderLocalCacheManager. SegmentLoader is marked UnstableAPI which means, it can be extended outside core druid in custom extensions. Here is a summary of changes
SegmentLoader returns an instance of ReferenceCountingSegment instead of Segment. Earlier, SegmentManager was wrapping Segment objects inside ReferenceCountingSegment. That is now moved to SegmentLoader. With this, a custom implementation can track the references of segments. It also allows them to create custom ReferenceCountingSegment implementations. For this reason, the constructor visibility in ReferenceCountingSegment is changed from private to protected.
SegmentCacheManager has two additional methods called - reserve(DataSegment) and release(DataSegment). These methods let the caller reserve or release space without calling SegmentLoader#getSegment. We already had similar methods in StorageLocation and now they are available in SegmentCacheManager too which wraps multiple locations.
Refactoring to simplify the code in SegmentCacheManager wherever possible. There is no change in the functionality.