* Always use file sizes when determining batch ingest splits.
Main changes:
1) Update CloudObjectInputSource and its subclasses (S3, GCS,
Azure, Aliyun OSS) to use SplitHintSpecs in all cases. Previously, they
were only used for prefixes, not uris or objects.
2) Update ExternalInputSpecSlicer (MSQ) to consider file size. Previously,
file size was ignored; all files were treated as equal weight when
determining splits.
A side effect of these changes is that we'll make additional network
calls to find the sizes of objects when users specify URIs or objects
as opposed to prefixes. IMO, this is worth it because it's the only way
to respect the user's split hint and task assignment settings.
Secondary changes:
1) S3, Aliyun OSS: Use getObjectMetadata instead of listObjects to get
metadata for a single object. This is a simpler call that is also
expected to be less expensive.
2) Azure: Fix a bug where getBlobLength did not populate blob
reference attributes, and therefore would not actually retrieve the
blob length.
3) MSQ: Align dynamic slicing logic between ExternalInputSpecSlicer and
TableInputSpecSlicer.
4) MSQ: Adjust WorkerInputs to ensure there is always at least one
worker, even if it has a nil slice.
* Add msqCompatible to testGroupByWithImpossibleTimeFilter.
* Fix tests.
* Add additional tests.
* Remove unused stuff.
* Remove more unused stuff.
* Adjust thresholds.
* Remove irrelevant test.
* Fix comments.
* Fix bug.
* Updates.
With the KubernetesTaskRunner, if a task is manually shutdown via the web console while running or the corresponding k8s job is manually deleted, the thread responsible for overseeing the task gets stuck in a loop because the fabric8 client sends one event to it that the job is null when the job is deleted, but this doesn't pass the condition.
This means that the thread is stuck waiting on a fabric8 event (the job being successful) that will never come up until maxTaskDuration (default 4 hours). If a user of the extension is trying to use a limited taskqueue maxSize, this can cause problems as the k8s executor pool is unable to pick up additional tasks (since threads are stuck waiting on the old tasks that have already been deleted).
* Hook up PodTemplateTaskAdapter
* Make task adapter TYPE parameters final
* Rename adapters types
* Include specified adapter name in exception message
* Documentation for sidecarSupport deprecation
* Fix order
* Set TASK_ID as environment variable in PodTemplateTaskAdapter (#13969)
* Update docs/development/extensions-contrib/k8s-jobs.md
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
* Hook up PodTemplateTaskAdapter
* Make task adapter TYPE parameters final
* Rename adapters types
* Include specified adapter name in exception message
* Documentation for sidecarSupport deprecation
* Fix order
* fix spelling errors
---------
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
* Pod template task adapter
* Use getBaseTaskDirPaths
* Remove unused task from getEnv
* Use Optional.ifPresent() instead of Optional.map()
* Pass absolute path
* Don't pass task to getEnv
* Assert the correct adapter is created
* Javadocs and Comments
* Add exception message to assertions
* Use TaskConfig to get task dir in KubernetesTaskRunner
* Use the first path specified in baseTaskDirPaths instead of deprecated baseTaskDirPath
* Use getBaseTaskDirPaths in generate command
The FiniteFirehoseFactory and InputRowParser classes were deprecated in 0.17.0 (#8823) in favor of InputSource & InputFormat. This PR removes the FiniteFirehoseFactory and all its implementations along with classes solely used by them like Fetcher (Used by PrefetchableTextFilesFirehoseFactory). Refactors classes including tests using FiniteFirehoseFactory to use InputSource instead.
Removing InputRowParser may not be as trivial as many classes that aren't deprecated depends on it (with no alternatives), like EventReceiverFirehoseFactory. Hence FirehoseFactory, EventReceiverFirehoseFactory, and Firehose are marked deprecated.
* merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything
* fix poms and license stuff
* mockito is evil
* allow reset of JvmUtils RuntimeInfo if tests used static injection to override
* Better sidecar support
* remove un-thrown exception from test
* Druid you are such a stickler about spelling :)
* Only require the primaryContainerName, no need to exclude containers
changes:
* modified druid schema column type compution to special case COMPLEX<json> handling to choose COMPLEX<json> if any column in any segment is COMPLEX<json>
* NestedFieldVirtualColumn can now work correctly on any type of column, returning either a column selector if a root path, or nil selector if not
* fixed a random bug with NilVectorSelector when using a vector size larger than the default and druid.generic.useDefaultValueForNull=false would have the nulls vector set to all false instead of true
* fixed an overly aggressive check in ExprEval.ofType when handling complex types which would try to treat any string as base64 without gracefully falling back if it was not in fact base64 encoded, along with special handling for complex<json>
* added ExpressionVectorSelectors.castValueSelectorToObject and ExpressionVectorSelectors.castObjectSelectorToNumeric as convience methods to cast vector selectors using cast expressions without the trouble of constructing an expression. the polymorphic nature of the non-vectorized engine (and significantly larger overhead of non-vectorized expression processing) made adding similar methods for non-vectorized selectors less attractive and so have not been added at this time
* fix inconsistency between nested column indexer and serializer in handling values (coerce non primitive and non arrays of primitives using asString)
* ExprEval best effort mode now handles byte[] as string
* added test for ExprEval.bestEffortOf, and add missing conversion cases that tests uncovered
* more tests more better
This commit adds a new class `InputStats` to track the total bytes processed by a task.
The field `processedBytes` is published in task reports along with other row stats.
Major changes:
- Add class `InputStats` to track processed bytes
- Add method `InputSourceReader.read(InputStats)` to read input rows while counting bytes.
> Since we need to count the bytes, we could not just have a wrapper around `InputSourceReader` or `InputEntityReader` (the way `CountableInputSourceReader` does) because the `InputSourceReader` only deals with `InputRow`s and the byte information is already lost.
- Classic batch: Use the new `InputSourceReader.read(inputStats)` in `AbstractBatchIndexTask`
- Streaming: Increment `processedBytes` in `StreamChunkParser`. This does not use the new `InputSourceReader.read(inputStats)` method.
- Extend `InputStats` with `RowIngestionMeters` so that bytes can be exposed in task reports
Other changes:
- Update tests to verify the value of `processedBytes`
- Rename `MutableRowIngestionMeters` to `SimpleRowIngestionMeters` and remove duplicate class
- Replace `CacheTestSegmentCacheManager` with `NoopSegmentCacheManager`
- Refactor `KafkaIndexTaskTest` and `KinesisIndexTaskTest`
Refactor DataSource to have a getAnalysis method()
This removes various parts of the code where while loops and instanceof
checks were being used to walk through the structure of DataSource objects
in order to build a DataSourceAnalysis. Instead we just ask the DataSource
for its analysis and allow the stack to rebuild whatever structure existed.
SQL test framework extensions
* Capture planner artifacts: logical plan, etc.
* Planner test builder validates the logical plan
* Validation for the SQL resut schema (we already have
validation for the Druid row signature)
* Better Guice integration: properties, reuse Guice modules
* Avoid need for hand-coded expr, macro tables
* Retire some of the test-specific query component creation
* Fix query log hook race condition
https://github.com/apache/druid/pull/13027 PR replaces `filter` parameter with
`objectGlob` in ingestion input source. However, this will cause existing ingestion
jobs to fail if they are using a filter already. This PR adds old filter functionality
alongside objectGlob to preserve backward compatibility.
We added compression to the latest/first pair storage, but
the code change was forcing new things to be persisted
with the new format, meaning that any segment created with
the new code cannot be read by the old code. Instead, we
need to default to creating the old format and then remove that default in a future version.
* Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects.
Removed:
import org.apache.commons.io.FilenameUtils;
Add:
import java.nio.file.FileSystems;
import java.nio.file.PathMatcher;
import java.nio.file.Paths;
* Forgot to update CloudObjectInputSource as well.
* Fix tests.
* Removed unused exceptions.
* Able to reduced user mistakes, by removing the protocol and the bucket on filter.
* add 1 more test.
* add comment on filterWithoutProtocolAndBucket
* Fix lint issue.
* Fix another lint issue.
* Replace all mention of filter -> objectGlob per convo here:
https://github.com/apache/druid/pull/13027#issuecomment-1266410707
* fix 1 bad constructor.
* Fix the documentation.
* Don’t do anything clever with the object path.
* Remove unused imports.
* Fix spelling error.
* Fix incorrect search and replace.
* Addressing Gian’s comment.
* add filename on .spelling
* Fix documentation.
* fix documentation again
Co-authored-by: Didip Kerabat <didip@apple.com>
Changes:
- Add a metric for partition-wise kafka/kinesis lag for streaming ingestion.
- Emit lag metrics for streaming ingestion when supervisor is not suspended and state is in {RUNNING, IDLE, UNHEALTHY_TASKS, UNHEALTHY_SUPERVISOR}
- Document metrics
* Support for middle manager less druid, tasks launch as k8s jobs
* Fixing forking task runner test
* Test cleanup, dependency cleanup, intellij inspections cleanup
* Changes per PR review
Add configuration option to disable http/https proxy for the k8s client
Update the docs to provide more detail about sidecar support
* Removing un-needed log lines
* Small changes per PR review
* Upon task completion we callback to the overlord to update the status / locaiton, for slower k8s clusters, this reduces locking time significantly
* Merge conflict fix
* Fixing tests and docs
* update tiny-cluster.yaml
changed `enableTaskLevelLogPush` to `encapsulatedTask`
* Apply suggestions from code review
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
* Minor changes per PR request
* Cleanup, adding test to AbstractTask
* Add comment in peon.sh
* Bumping code coverage
* More tests to make code coverage happy
* Doh a duplicate dependnecy
* Integration test setup is weird for k8s, will do this in a different PR
* Reverting back all integration test changes, will do in anotbher PR
* use StringUtils.base64 instead of Base64
* Jdk is nasty, if i compress in jdk 11 in jdk 17 the decompressed result is different
Co-authored-by: Rahul Gidwani <r_gidwani@apple.com>
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
* Refactor Calcite test "framework" for planner tests
Refactors the current Calcite tests to make it a bit easier
to adjust the set of runtime objects used within a test.
* Move data creation out of CalciteTests into TestDataBuilder
* Move "framework" creation out of CalciteTests into
a QueryFramework
* Move injector-dependent functions from CalciteTests
into QueryFrameworkUtils
* Wrapper around the planner factory, etc. to allow
customization.
* Bulk of the "framework" created once per class rather
than once per test.
* Refactor tests to use a test builder
* Change all testQuery() methods to use the test builder.
Move test execution & verification into a test runner.
* SQL: Use timestamp_floor when granularity is not safe.
PR #12944 added a check at the execution layer to avoid materializing
excessive amounts of time-granular buckets. This patch modifies the SQL
planner to avoid generating queries that would throw such errors, by
switching certain plans to use the timestamp_floor function instead of
granularities. This applies both to the Timeseries query type, and the
GroupBy timestampResultFieldGranularity feature.
The patch also goes one step further: we switch to timestamp_floor
not just in the ETERNITY + non-ALL case, but also if the estimated
number of time-granular buckets exceeds 100,000.
Finally, the patch modifies the timestampResultFieldGranularity
field to consistently be a String rather than a Granularity. This
ensures that it can be round-trip serialized and deserialized, which is
useful when trying to execute the results of "EXPLAIN PLAN FOR" with
GroupBy queries that use the timestampResultFieldGranularity feature.
* Fix test, address PR comments.
* Fix ControllerImpl.
* Fix test.
* Fix unused import.
We introduce two new configuration keys that refine the query context security model controlled by druid.auth.authorizeQueryContextParams. When that value is set to true then two other configuration options become available:
druid.auth.unsecuredContextKeys: The set of query context keys that do not require a security check. Use this for the "white-list" of key to allow. All other keys go through the existing context key security checks.
druid.auth.securedContextKeys: The set of query context keys that do require a security check. Use this when you want to allow all but a specific set of keys: only these keys go through the existing context key security checks.
Both are set using JSON list format:
druid.auth.securedContextKeys=["secretKey1", "secretKey2"]
You generally set one or the other values. If both are set, unsecuredContextKeys acts as exceptions to securedContextKeys.
In addition, Druid defines two query context keys which always bypass checks because Druid uses them internally:
sqlQueryId
sqlStringifyArrays
It was found that the namespace/cache/heapSizeInBytes metric that tracks the total heap size in bytes of all lookup caches loaded on a service instance was being under reported. We were not accounting for the memory overhead of the String object, which I've found in testing to be ~40 bytes. While this overhead may be java version dependent, it should not vary much, and accounting for this provides a better estimate. Also fixed some logging, and reading bytes from the JDBI result set a little more efficient by saving hash table lookups. Also added some of the lookup metrics to the default statsD emitter metric whitelist.
This adds min/max functions for CompressedBigDecimal. It exposes these
functions via sql (BIG_MAX, BIG_MIN--see the SqlAggFunction
implementations).
It also includes various bug fixes and cleanup to the original
CompressedBigDecimal code include the AggregatorFactories. Various null
handling was improved.
Additional test cases were added for both new and existing code
including a base test case for AggregationFactories. Other tests common
across sum,min,max may be refactored also to share the varoius cases in
the future.
This adds a sql function, "BIG_SUM", that uses
CompressedBigDecimal to do a sum. Other misc changes:
1. handle NumberFormatExceptions when parsing a string (default to set
to 0, configurable in agg factory to be strict and throw on error)
2. format pom file (whitespace) + add dependency
3. scaleUp -> scale and always require scale as a parameter
Optimizes the compareTo() function in
CompressedBigDecimal. It directly compares the int[] rather than
creating BigDecimal objects and using its compareTo.
It handles unequal sized CBDs, but does require
the scales to match.
1. remove unnecessary generic type from CompressedBigDecimal
2. support Number input types
3. support aggregator reading supported input types directly (uningested
data)
4. fix scaling bug in buffer aggregator