* Add ZooKeeper connection state alerts and metrics.
- New metric "zk/connected" is an indicator showing 1 when connected,
0 when disconnected.
- New metric "zk/disconnected/time" measures time spent disconnected.
- New alert when Curator connection state enters LOST or SUSPENDED.
* Use right GuardedBy.
* Test fixes, coverage.
* Adjustment.
* Fix tests.
* Fix ITs.
* Improved injection.
* Adjust metric name, add tests.
Cache is disabled for GroupByStrategyV2 on broker since the pr #3820 [groupBy v2: Results not fully merged when caching is enabled on the broker]. But we can enable the result-level cache on broker for GroupByStrategyV2 and keep the segment-level cache disabled.
Description:
`TaskQueue.notifyStatus` is often a heavy call as it performs the following operations:
- Update task status in metadata DB
- Update task locks in metadata DB
- Request (synchronously) the task runner to shutdown the completed task
- Clean up in-memory data structures
This method can often be slow and can cause worker sync / task runners to slow down.
Main changes:
- Run task completion callbacks in a separate executor to handle task completion updates
- Add new config `druid.indexer.queue.taskCompleteHandlerNumThreads`
- Add metrics to monitor number of processed and queued items
- There are still other paths that can invoke `notifyStatus`, but those need not be moved to
the new executor as they are synchronous on purpose.
Other changes:
- Add new metrics `task/status/queue/count`, `task/status/handled/count`
- Add `TaskCountStatsProvider.getStats()` which deprecates the other `getXXXTaskCount` methods.
- Use `CoordinatorRunStats` to collect and report metrics. This class has been used as is
for now but will later be renamed and repurposed to use across all Druid services.
The wait doesn't seem to serve a purpose, other than causing delays
when checking isInitialized() for a large number of things that have
not yet been initialized.
UniformGranularityTest's test to test a large number of intervals
runs through 10 years of 1 second intervals. This pushes a lot of
stuff through IntervalIterator and shows up in terms of test
runtime as one of the hottest tests. Most of the time is going to
constructing jodatime objects because it is doing things with
DateTime objects instead of millis. Change the calls to use
millis instead and things go faster.
If a server is removed during `HttpServerInventoryView.serverInventoryInitialized`,
the initialization gets stuck as this server is never synced. The method eventually times
out (default 250s).
Fix: Mark a server as stopped if it is removed. `serverInventoryInitialized` only waits for
non-stopped servers to sync.
Other changes:
- Add new metrics for better debugging of slow broker/coordinator startup
- `segment/serverview/sync/healthy`: whether the server view is syncing properly with a server
- `segment/serverview/sync/unstableTime`: time for which sync with a server has been unstable
- Clean up logging in `HttpServerInventoryView` and `ChangeRequestHttpSyncer`
- Minor refactor for readability
- Add utility class `Stopwatch`
- Add tests and stubs
After #13197 , several coordinator configs are now redundant as they are not being
used anymore, neither with `smartSegmentLoading` nor otherwise.
Changes:
- Remove dynamic configs `emitBalancingStats`: balancer error stats are always
emitted, debug stats can be logged by using `debugDimensions`
- `useBatchedSegmentSampler`, `percentOfSegmentsToConsiderPerMove`:
batched segment sampling is always used
- Add test to verify deserialization with unknown properties
- Update `CoordinatorRunStats` to always track stats, this can be optimized later.
* combine string column implementations
changes:
* generic indexed, front-coded, and auto string columns now all share the same column and index supplier implementations
* remove CachingIndexed implementation, which I think is largely no longer needed by the switch of many things to directly using ByteBuffer, avoiding the cost of creating Strings
* remove ColumnConfig.columnCacheSizeBytes since CachingIndexed was the only user
In these other cases, stick to plain "filter". This simplifies lots of
logic downstream, and doesn't hurt since we don't have intervals-specific
optimizations outside of tables.
Fixes an issue where we couldn't properly filter on a column from an
external datasource if it was named __time.
This PR aims to expose a new API called
"@path("/druid/v2/sql/statements/")" which takes the same payload as the current "/druid/v2/sql" endpoint and allows users to fetch results in an async manner.
Changes to `cost` strategy:
- In every `ServerHolder`, track the number of segments per datasource per interval
- Perform cost computations for a given interval just once, and then multiply by a constant
factor to account for the total segment count in that interval
- Do not perform joint cost computations with segments that are outside the compute interval
(± 45 days) for the segment being considered for move
- Remove metrics `segment/cost/*` as they were coordinator killers! Turning on these metrics
(by setting `emitBalancingStats` to true) has often caused the coordinator to be stuck for hours.
Moreover, they are too complicated to decipher and do not provide any meaningful insight into
a Druid cluster.
- Add new simpler metrics `segment/balancer/compute/*` to track cost computation time,
count and errors.
Other changes:
- Remove flaky test from `CostBalancerStrategyTest`.
- Add tests to verify that computed cost has remained unchanged
- Remove usages of mock `BalancerStrategy` from `LoadRuleTest`, `BalanceSegmentsTest`
- Clean up `BalancerStrategy` interface
* Add OverlordStatusMonitor and CoordinatorStatusMonitor to monitor service leader status
* make the monitor more general
* resolve conflict
* use Supplier pattern to provide metrics
* reformat code and doc
* move service specific tag to dimension
* minor refine
* update doc
* reformat code
* address comments
* remove declared exception
* bind HeartbeatSupplier conditionally in Coordinator
Users can now add a guardrail to prevent subquery’s results from exceeding the set number of bytes by setting druid.server.http.maxSubqueryRows in Broker's config or maxSubqueryRows in the query context. This feature is experimental for now and would default back to row-based limiting in case it fails to get the accurate size of the results consumed by the query.
Changes:
- Add property `useDefaultTierForNull` for all load rules. This property determines the default
value of `tieredReplicants` if it is not specified. When true, the default is `_default_tier => 2 replicas`.
When false, the default is empty, i.e. no replicas on any tier.
- Fix validation to allow empty replicants map, so that the segment is used but not loaded anywhere.
Added a new monitor SysMonitorOshi to replace SysMonitor. The new monitor has a wider support for different machine architectures including ARM instances. Please switch to SysMonitorOshi as SysMonitor is now deprecated and will be removed in future releases.
This commit does a complete revamp of the coordinator to address problem areas:
- Stability: Fix several bugs, add capabilities to prioritize and cancel load queue items
- Visibility: Add new metrics, improve logs, revamp `CoordinatorRunStats`
- Configuration: Add dynamic config `smartSegmentLoading` to automatically set
optimal values for all segment loading configs such as `maxSegmentsToMove`,
`replicationThrottleLimit` and `maxSegmentsInNodeLoadingQueue`.
Changed classes:
- Add `StrategicSegmentAssigner` to make assignment decisions for load, replicate and move
- Add `SegmentAction` to distinguish between load, replicate, drop and move operations
- Add `SegmentReplicationStatus` to capture current state of replication of all used segments
- Add `SegmentLoadingConfig` to contain recomputed dynamic config values
- Simplify classes `LoadRule`, `BroadcastRule`
- Simplify the `BalancerStrategy` and `CostBalancerStrategy`
- Add several new methods to `ServerHolder` to track loaded and queued segments
- Refactor `DruidCoordinator`
Impact:
- Enable `smartSegmentLoading` by default. With this enabled, none of the following
dynamic configs need to be set: `maxSegmentsToMove`, `replicationThrottleLimit`,
`maxSegmentsInNodeLoadingQueue`, `useRoundRobinSegmentAssignment`,
`emitBalancingStats` and `replicantLifetime`.
- Coordinator reports richer metrics and produces cleaner and more informative logs
- Coordinator uses an unlimited load queue for all serves, and makes better assignment decisions
Introduce DruidException, an exception whose goal in life is to be delivered to a user.
DruidException itself has javadoc on it to describe how it should be used. This commit both introduces the Exception and adjusts some of the places that are generating exceptions to generate DruidException objects instead, as a way to show how the Exception should be used.
This work was a 3rd iteration on top of work that was started by Paul Rogers. I don't know if his name will survive the squash-and-merge, so I'm calling it out here and thanking him for starting on this.
Description:
Druid allows a configuration of load rules that may cause a used segment to not be loaded
on any historical. This status is not tracked in the sys.segments table on the broker, which
makes it difficult to determine if the unavailability of a segment is expected and if we should
not wait for it to be loaded on a server after ingestion has finished.
Changes:
- Track replication factor in `SegmentReplicantLookup` during evaluation of load rules
- Update API `/druid/coordinator/v1metadata/segments` to return replication factor
- Add column `replication_factor` to the sys.segments virtual table and populate it in
`MetadataSegmentView`
- If this column is 0, the segment is not assigned to any historical and will not be loaded.
Changes
- Add a `DruidException` which contains a user-facing error message, HTTP response code
- Make `EntryExistsException` extend `DruidException`
- If metadata store max_allowed_packet limit is violated while inserting a new task, throw
`DruidException` with response code 400 (bad request) to prevent retries
- Add `SQLMetadataConnector.isRootCausePacketTooBigException` with impl for MySQL
Changes:
- Add a timeout of 1 minute to resultFuture.get() in `CostBalancerStrategy.chooseBestServer`.
1 minute is the typical time for a full coordinator run and is more than enough time for cost
computations of a single segment.
- Raise an alert if an exception is encountered while computing costs and if the executor has
not been shutdown. This is because a shutdown is intentional and does not require an alert.
It was found that several supported tasks / input sources did not have implementations for the methods used by the input source security feature, causing these tasks and input sources to fail when used with this feature. This pr adds the needed missing implementations. Also securing the sampling endpoint with input source security, when enabled.
Changes:
- `CostBalancerStrategyTest`
- Focus on verification of cost computations rather than choosing servers in this test
- Add new tests `testComputeCost` and `testJointSegmentsCost`
- Add tests to demonstrate that with a long enough interval gap, all costs become negligible
- Retain `testIntervalCost` and `testIntervalCostAdditivity`
- Remove redundant tests such as `testStrategyMultiThreaded`, `testStrategySingleThreaded`as
verification of this behaviour is better suited to `BalancingStrategiesTest`.
- `CostBalancerStrategyBenchmark`
- Remove usage of static method from `CostBalancerStrategyTest`
- Explicitly setup cluster and segments to use for benchmarking
The defaults of the following config values in the `CoordinatorDynamicConfig` are being updated.
1. `maxSegmentsInNodeLoadingQueue = 500` (previous = 100)
2. `replicationThrottleLimit = 500` (previous = 10)
Rationale: With round-robin segment assignment now being the default assignment technique,
the Coordinator can assign a large number of under-replicated/unavailable segments very quickly,
without getting stuck in `RunRules` duty due to very slow strategy-based cost computations.
3. `maxSegmentsToMove = 100` (previous = 5)
Rationale: A very low value (say 5) is ineffective in balancing especially if there are many segments
to balance. A very large value can cause excessive moves, which has these disadvantages:
- Load of moving segments competing with load of unavailable/under-replicated segments
- Unnecessary network costs due to constant download and delete of segments
These defaults will be revisited after #13197 is merged.
* Expr getCacheKey now delegates to children
* Removed the LOOKUP_EXPR_CACHE_KEY as we do not need it
* Adding an unit test
* Update processing/src/main/java/org/apache/druid/math/expr/Expr.java
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
---------
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
The "new" IT framework provides a convenient way to package and run integration tests (ITs), but only for core modules. We have a use case to run an IT for a contrib extension: the proposed gRPC query extension. This PR provides the IT framework functionality to allow non-core ITs.
* Be able to load segments on Peons
This change introduces a new config on WorkerConfig
that indicates how many bytes of each storage
location to use for storage of a task. Said config
is divided up amongst the locations and slots
and then used to set TaskConfig.tmpStorageBytesPerTask
The Peons use their local task dir and
tmpStorageBytesPerTask as their StorageLocations for
the SegmentManager such that they can accept broadcast
segments.
Changes:
- Replace `OverlordHelper` with `OverlordDuty` to align with `CoordinatorDuty`
- Each duty has a `run()` method and defines a `Schedule` with an initial delay and period.
- Update existing duties `TaskLogAutoCleaner` and `DurableStorageCleaner`
- Add utility class `Configs`
- Update log, error messages and javadocs
- Other minor style improvements
Changes:
- Do not allow retention rules for any datasource or cluster to be null
- Allow empty rules at the datasource level but not at the cluster level
- Add validation to ensure that `druid.manager.rules.defaultRule` is always set correctly
- Minor style refactors
This PR fixes an issue that could occur if druid.query.scheduler.numThreads is configured and any exception occurs after QueryScheduler.run has been called to create a Sequence. This would result in total and/or lane specific locks being acquired, but because the sequence was not actually being evaluated, the "baggage" which typically releases these locks was not being executed. An example of how this can happen is if a group-by having filter, which wraps and transforms this sequence happens to explode while wrapping the sequence. The end result is that the locks are acquired, but never released, eventually halting the ability to execute any queries.
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+.
While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
* Make LoggingEmitter more useful
* Skip code coverage for facade classes
* fix spellcheck
* code review
* fix dependency
* logging.md
* fix checkstyle
* Add back jacoco version to main pom
* Fix two concurrency issues with segment fetching.
1) SegmentLocalCacheManager: Fix a concurrency issue where certain directory
cleanup happened outside of directoryWriteRemoveLock. This created the
possibility that segments would be deleted by one thread, while being
actively downloaded by another thread.
2) TaskDataSegmentProcessor (MSQ): Fix a concurrency issue when two stages
in the same process both use the same segment. For example: a self-join
using distributed sort-merge. Prior to this change, the two stages could
delete each others' segments.
3) ReferenceCountingResourceHolder: increment() returns a new ResourceHolder,
rather than a Releaser. This allows it to be passed to callers without them
having to hold on to both the original ResourceHolder *and* a Releaser.
4) Simplify various interfaces and implementations by using ResourceHolder
instead of Pair and instead of split-up fields.
* Add test.
* Fix style.
* Remove Releaser.
* Updates from master.
* Add some GuardedBys.
* Use the correct GuardedBy.
* Adjustments.
* Refresh DruidLeaderClient cache for non-200 responses
* Change local variable name to avoid confusion
* Implicit retries for 503 and 504
* Remove unused imports
* Use argumentmatcher instead of Mockito for #any in test
* Remove flag to disable retry for 503/504
* Remove unused import from test
* Add log line for internal retry
---------
Co-authored-by: Abhishek Singh Chouhan <abhishek.chouhan@salesforce.com>
### Description
This pr fixes a few bugs found with the inputSource security feature.
1. `KillUnusedSegmentsTask` previously had no definition for the `getInputSourceResources`, which caused an unsupportedOperationException to be thrown when this task type was submitted with the inputSource security feature enabled. This task type should not require any input source specific resources, so returning an empty set for this task type now.
2. Fixed a bug where when the input source type security feature is enabled, all of the input source type specific resources used where authenticated against:
`{"resource": {"name": "EXTERNAL", "type": "{INPUT_SOURCE_TYPE}"}, "action": "READ"}`
When they should be instead authenticated against:
`{"resource": {"name": "{INPUT_SOURCE_TYPE}", "type": "EXTERNAL"}, "action": "READ"}`
3. fixed bug where supervisor tasks were not authenticated against the specific input source types used, if input source security feature was enabled.
* Make the tasks run with only a single directory
There was a change that tried to get indexing to run on multiple disks
It made a bunch of changes to how tasks run, effectively hiding the
"safe" directory for tasks to write files into from the task code itself
making it extremely difficult to do anything correctly inside of a task.
This change reverts those changes inside of the tasks and makes it so that
only the task runners are the ones that make decisions about which
mount points should be used for storing task-related files.
It adds the config druid.worker.baseTaskDirs which can be used by the
task runners to know which directories they should schedule tasks inside of.
The TaskConfig remains the authoritative source of configuration for where
and how an individual task should be operating.
### Description
This change allows for input sources used during MSQ ingestion to be authorized for multiple input source types, instead of just 1. Such an input source that allows for multiple types is the CombiningInputSource.
Also fixed bug that caused some input source specific functions to be authorized against the permissions
`
[
new ResourceAction(new Resource(ResourceType.EXTERNAL, ResourceType.EXTERNAL), Action.READ),
new ResourceAction(new Resource(ResourceType.EXTERNAL, {input_source_type}), Action.READ)
]
`
when the inputSource based authorization feature is enabled, when it should instead be authorized against
`
[
new ResourceAction(new Resource(ResourceType.EXTERNAL, {input_source_type}), Action.READ)
]
`
Fixes#13837.
### Description
This change allows for input source type security in the native task layer.
To enable this feature, the user must set the following property to true:
`druid.auth.enableInputSourceSecurity=true`
The default value for this property is false, which will continue the existing functionality of needing authorization to write to the respective datasource.
When this config is enabled, the users will be required to be authorized for the following resource action, in addition to write permission on the respective datasource.
`new ResourceAction(new Resource(ResourceType.EXTERNAL, {INPUT_SOURCE_TYPE}, Action.READ`
where `{INPUT_SOURCE_TYPE}` is the type of the input source being used;, http, inline, s3, etc..
Only tasks that provide a non-default implementation of the `getInputSourceResources` method can be submitted when config `druid.auth.enableInputSourceSecurity=true` is set. Otherwise, a 400 error will be thrown.
changes:
* introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation
* introduce new 'auto' type indexer and merger which produces a new common nested format of columns, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>.
* revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre #13803 behavior (v4) for backwards compatibility
* fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
Due to race conditions, the BrokerServerView may sometimes try to add a segment to a server which has already been removed from the inventory. This results in an NPE and keeps the BrokerServerView from processing all change requests.
This change introduces the concept of input source type security model, proposed in #13837.. With this change, this feature is only available at the SQL layer, but we will expand to native layer in a follow up PR.
To enable this feature, the user must set the following property to true:
druid.auth.enableInputSourceSecurity=true
The default value for this property is false, which will continue the existing functionality of having the usage all external sources being authorized against the hardcoded resource action
new ResourceAction(new Resource(ResourceType.EXTERNAL, ResourceType.EXTERNAL), Action.READ
When this config is enabled, the users will be required to be authorized for the following resource action
new ResourceAction(new Resource(ResourceType.EXTERNAL, {INPUT_SOURCE_TYPE}, Action.READ
where {INPUT_SOURCE_TYPE} is the type of the input source being used;, http, inline, s3, etc..
Documentation has not been added for the feature as it is not complete at the moment, as we still need to enable this for the native layer in a follow up pr.
* Add segment generator counters to reports
* Remove unneeded annotation
* Fix checkstyle and coverage
* Add persist and merged as new metrics
* Address review comments
* Fix checkstyle
* Create metrics class to handle updating counters
* Address review comments
* Add rowsPushed as a new metrics
changes:
* fixes inconsistent handling of byte[] values between ExprEval.bestEffortOf and ExprEval.ofType, which could cause byte[] values to end up as java toString values instead of base64 encoded strings in ingest time transforms
* improved ExpressionTransform binding to re-use ExprEval.bestEffortOf when evaluating a binding instead of throwing it away
* improved ExpressionTransform array handling, added RowFunction.evalDimension that returns List<String> to back Row.getDimension and remove the automatic coercing of array types that would typically happen to expression transforms unless using Row.getDimension
* added some tests for ExpressionTransform with array inputs
* improved ExpressionPostAggregator to use partial type information from decoration
* migrate some test uses of InputBindings.forMap to use other methods
Changes:
- Set `useRoundRobinSegmentAssignment` in coordinator dynamic config to `true` by default.
- Set `batchSegmentAllocation` in `TaskLockConfig` (used in Overlord runtime properties) to `true` by default.
* Lower default maxRowsInMemory for realtime ingestion.
The thinking here is that for best ingestion throughput, we want
intermediate persists to be as big as possible without using up all
available memory. So, we rely mainly on maxBytesInMemory. The default
maxRowsInMemory (1 million) is really just a safety: in case we have
a large number of very small rows, we don't want to get overwhelmed
by per-row overheads.
However, maximum ingestion throughput isn't necessarily the primary
goal for realtime ingestion. Query performance is also important. And
because query performance is not as good on the in-memory dataset, it's
helpful to keep it from growing too large. 150k seems like a reasonable
balance here. It means that for a typical 5 million row segment, we
won't trigger more than 33 persists due to this limit, which is a
reasonable number of persists.
* Update tests.
* Update server/src/main/java/org/apache/druid/segment/indexing/RealtimeTuningConfig.java
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* Fix test.
* Fix link.
---------
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* This makes the zookeeper connection retry count configurable. This is presently hardcoded to 29 tries which ends up taking a long time for the druid node to shutdown in case of ZK connectivity loss.
Having a shorter retry count helps k8s deployments to fail fast. In situations where the underlying k8s node loses network connectivity or is no longer able to talk to zookeeper, failing fast can trigger pod restarts which can then reassign the pod to a healthy k8s node.
Existing behavior is preserved, but users can override this property if needed.
* Improve memory efficiency of WrappedRoaringBitmap.
Two changes:
1) Use an int[] for sizes 4 or below.
2) Remove the boolean compressRunOnSerialization. Doesn't save much
space, but it does save a little, and it isn't adding a ton of value
to have it be configurable. It was originally configurable in case
anything broke when enabling it, but it's been a while and nothing
has broken.
* Slight adjustment.
* Adjust for inspection.
* Updates.
* Update snaps.
* Update test.
* Adjust test.
* Fix snaps.
The FiniteFirehoseFactory and InputRowParser classes were deprecated in 0.17.0 (#8823) in favor of InputSource & InputFormat. This PR removes the FiniteFirehoseFactory and all its implementations along with classes solely used by them like Fetcher (Used by PrefetchableTextFilesFirehoseFactory). Refactors classes including tests using FiniteFirehoseFactory to use InputSource instead.
Removing InputRowParser may not be as trivial as many classes that aren't deprecated depends on it (with no alternatives), like EventReceiverFirehoseFactory. Hence FirehoseFactory, EventReceiverFirehoseFactory, and Firehose are marked deprecated.
* Make CompactionSearchPolicy injectable
A small refactoring that makes the search policy for compaction injectable.
Future changes can introduce new search policies that can be configured and
injected so that operators can choose which search policy is best suited for
their cluster.
This will also allow us to de-couple the scheduling of compaction jobs from
the CompactSegments duty, allowing the co-ordinator to schedule compaction
jobs faster than the duty lifecycle.
This PR is made so that it easy to review the future changes.
* fix tests
When both plainText and TLS ports are set in druid, the redirection to a different leader node can fail. This is caused by how we compare a redirect path and the leader locations registered with a druid node. While the registered location has both plainText and TLS port set, the redirect path only has one port since it's a URI.
* merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything
* fix poms and license stuff
* mockito is evil
* allow reset of JvmUtils RuntimeInfo if tests used static injection to override
* Use an HllSketchHolder object to enable optimized merge
HllSketchAggregatorFactory.combine had been implemented using a
pure pair-wise, "make a union -> add 2 things to union -> get sketch"
algorithm. This algorithm does 2 things that was CPU
1) The Union object always builds an HLL_8 sketch regardless of the
target type. This means that when the target type is not HLL_8, we
spent CPU cycles converting to HLL_8 and back over and over again
2) By throwing away the Union object and converting back to the
HllSketch only to build another Union object, we do lots and lots
of copy+conversions of the HllSketch
This change introduces an HllSketchHolder object which can hold onto
a Union object and delay conversion back into an HllSketch until
it is actually needed. This follows the same pattern as the
SketchHolder object for theta sketches.
* Fallback virtual column
This virtual columns enables falling back to another column if
the original column doesn't exist. This is useful when doing
column migrations and you have some old data with column X,
new data with column Y and you want to use Y if it exists, X
otherwise so that you can run a consistent query against all of
the data.
Add a new API to return the history of changes to automatic compaction config history to make it easy for users to see what changes have been made to their auto-compaction config.
The API is scoped per dataSource to allow users to triage issues with an individual dataSource. The API responds with a list of configs when there is a change to either the settings that impact all auto-compaction configs on a cluster or the dataSource in question.
* discover nested columns when using nested column indexer for schemaless
* move useNestedColumnIndexerForSchemaDiscovery from AppendableIndexSpec to DimensionsSpec
Much improved table functions
* Revises properties, definitions in the catalog
* Adds a "table function" abstraction to model such functions
* Specific functions for HTTP, inline, local and S3.
* Extended SQL types in the catalog
* Restructure external table definitions to use table functions
* EXTEND syntax for Druid's extern table function
* Support for array-valued table function parameters
* Support for array-valued SQL query parameters
* Much new documentation
* Kinesis: More robust default fetch settings.
1) Default recordsPerFetch and recordBufferSize based on available memory
rather than using hardcoded numbers. For this, we need an estimate
of record size. Use 10 KB for regular records and 1 MB for aggregated
records. With 1 GB heaps, 2 processors per task, and nonaggregated
records, recordBufferSize comes out to the same as the old
default (10000), and recordsPerFetch comes out slightly lower (1250
instead of 4000).
2) Default maxRecordsPerPoll based on whether records are aggregated
or not (100 if not aggregated, 1 if aggregated). Prior default was 100.
3) Default fetchThreads based on processors divided by task count on
Indexers, rather than overall processor count.
4) Additionally clean up the serialized JSON a bit by adding various
JsonInclude annotations.
* Updates for tests.
* Additional important verify.
* single typed "root" only nested columns now mimic "regular" columns of those types
* incremental index can now use nested column indexer instead of string indexer for discovered columns
* Validate response headers and fix exception logging
A class of QueryException were throwing away their
causes making it really hard to determine what's
going wrong when something goes wrong in the SQL
planner specifically. Fix that and adjust tests
to do more validation of response headers as well.
We allow 404s and 307s to be returned even without
authorization validated, but others get converted to 403
* Unify the handling of HTTP between SQL and Native
The SqlResource and QueryResource have been
using independent logic for things like error
handling and response context stuff. This
became abundantly clear and painful during a
change I was making for Window Functions, so
I unified them into using the same code for
walking the response and serializing it.
Things are still not perfectly unified (it would
be the absolute best if the SqlResource just
took SQL, planned it and then delegated the
query run entirely to the QueryResource), but
this refactor doesn't take that fully on.
The new code leverages async query processing
from our jetty container, the different
interaction model with the Resource means that
a lot of tests had to be adjusted to align with
the async query model. The semantics of the
tests remain the same with one exception: the
SqlResource used to not log requests that failed
authorization checks, now it does.
This commit adds a new class `InputStats` to track the total bytes processed by a task.
The field `processedBytes` is published in task reports along with other row stats.
Major changes:
- Add class `InputStats` to track processed bytes
- Add method `InputSourceReader.read(InputStats)` to read input rows while counting bytes.
> Since we need to count the bytes, we could not just have a wrapper around `InputSourceReader` or `InputEntityReader` (the way `CountableInputSourceReader` does) because the `InputSourceReader` only deals with `InputRow`s and the byte information is already lost.
- Classic batch: Use the new `InputSourceReader.read(inputStats)` in `AbstractBatchIndexTask`
- Streaming: Increment `processedBytes` in `StreamChunkParser`. This does not use the new `InputSourceReader.read(inputStats)` method.
- Extend `InputStats` with `RowIngestionMeters` so that bytes can be exposed in task reports
Other changes:
- Update tests to verify the value of `processedBytes`
- Rename `MutableRowIngestionMeters` to `SimpleRowIngestionMeters` and remove duplicate class
- Replace `CacheTestSegmentCacheManager` with `NoopSegmentCacheManager`
- Refactor `KafkaIndexTaskTest` and `KinesisIndexTaskTest`
Refactor DataSource to have a getAnalysis method()
This removes various parts of the code where while loops and instanceof
checks were being used to walk through the structure of DataSource objects
in order to build a DataSourceAnalysis. Instead we just ask the DataSource
for its analysis and allow the stack to rebuild whatever structure existed.
* Zero-copy local deep storage.
This is useful for local deep storage, since it reduces disk usage and
makes Historicals able to load segments instantaneously.
Two changes:
1) Introduce "druid.storage.zip" parameter for local storage, which defaults
to false. This changes default behavior from writing an index.zip to writing
a regular directory. This is safe to do even during a rolling update, because
the older code actually already handled unzipped directories being present
on local deep storage.
2) In LocalDataSegmentPuller and LocalDataSegmentPusher, use hard links
instead of copies when possible. (Generally this is possible when the
source and destination directory are on the same filesystem.)