* Speed up SQL IN using SCALAR_IN_ARRAY.
Main changes:
1) DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
the IN is above inFunctionThreshold. The default value of inFunctionThreshold
is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.
2) SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
it to Integer.MAX_VALUE.
3) ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
greater than inFunctionThreshold.
* Revert test.
* Additional coverage.
* Update docs/querying/sql-query-context.md
Co-authored-by: Benedict Jin <asdf2014@apache.org>
* New test.
---------
Co-authored-by: Benedict Jin <asdf2014@apache.org>
Add validation for reindex with realtime sources.
With the addition of concurrent compaction, it is possible to ingest data while querying from realtime sources with MSQ into the same datasource. This could potentially lead to issues if the interval that is ingested into is replaced by an MSQ job, which has queried only some of the data from the realtime task.
This PR adds validation to check that the datasource being ingested into is not being queried from, if the query includes realtime sources.
Follow up to #15705
Changes:
- Remove references to ZK-based segment loading in the docs
- Fix doc for existing config `druid.coordinator.loadqueuepeon.http.repeatDelay`
* Delta Lake support for filters.
* Updates
* cleanup comments
* Docs
* Remmove Enclosed runner
* Rename
* Cleanup test
* Serde test for the Delta input source and fix jackson annotation.
* Updates and docs.
* Update error messages to be clearer
* Fixes
* Handle NumberFormatException to provide a nicer error message.
* Apply suggestions from code review
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
* Doc fixes based on feedback
* Yes -> yes in docs; reword slightly.
* Update docs/ingestion/input-sources.md
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
* Update docs/ingestion/input-sources.md
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
* Documentation, javadoc and more updates.
* Not with an or expression end-to-end test.
* Break up =, >, >=, <, <= into its own types instead of sub-classing.
---------
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
Changes:
- Add new config `lagAggregate` to `LagBasedAutoScalerConfig`
- Add field `aggregateForScaling` to `LagStats`
- Use the new field/config to determine which aggregate to use to compute lag
- Remove method `Supervisor.computeLagForAutoScaler()`
* Four changes to scalar_in_array as follow-ups to #16306:
1) Align behavior for `null` scalars to the behavior of the native `in` and `inType` filters: return `true` if the array itself contains null, else return `null`.
2) Rename the class to more closely match the function name.
3) Add a specialization for constant arrays, where we build a `HashSet`.
4) Use `castForEqualityComparison` to properly handle cross-type comparisons.
Additional tests verify comparisons between LONG and DOUBLE are now
handled properly.
* Fix spelling.
* Adjustments from review.
Issue: #14989
The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information.
This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.
Statsd client sometimes drops metrics when this queueSize of statsd client with max unprocessed messages is completely full. This causes some high cardinality metrics like per partition lag being droppped.
There are multiple parameters of statsdclient that can be initialized and can help increase the load/capacity of client to not to drop metrics more frequently.
Properties like queueSize, poolSize, processorWorkers and senderWorkers will now be configurable at runtime
* Adds Druid SQL query examples for the Timeseries and GroupBy Native queries in the stats aggregator docs page
* Updates intervals in Native Query to remove excess Time part in timestamp
* Moves Druid SQL section above Native query because sql used more often by users
* removes old Druid SQL sections
* Adds TopN Druid SQL query using ORDER BY and LIMIT
* Adds table for Druid SQL variance and standard deviation functions
* Update docs/development/extensions-core/stats.md
Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>
---------
Co-authored-by: Karan Kumar <karankumar1100@gmail.com>
Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>
Currently, export creates the files at the provided destination. The addition of the manifest file will provide a list of files created as part of the manifest. This will allow easier consumption of the data exported from Druid, especially for automated data pipelines
The default value for druid.coordinator.kill.period (if unspecified) has changed from P1D to the value of druid.coordinator.period.indexingPeriod. Operators can choose to override druid.coordinator.kill.period and that will take precedence over the default behavior.
The default value for the coordinator dynamic config killTaskSlotRatio is updated from 1.0 to 0.1. This ensures that that kill tasks take up only 1 task slot right out-of-the-box instead of taking up all the task slots.
* Remove stale comment and inline canDutyRun()
* druid.coordinator.kill.period defaults to druid.coordinator.period.indexingPeriod if not set.
- Remove the default P1D value for druid.coordinator.kill.period. Instead default
druid.coordinator.kill.period to whatever value druid.coordinator.period.indexingPeriod is set
to if the former config isn't specified.
- If druid.coordinator.kill.period is set, the value will take precedence over
druid.coordinator.period.indexingPeriod
* Update server/src/test/java/org/apache/druid/server/coordinator/DruidCoordinatorConfigTest.java
* Fix checkstyle error
* Clarify comment
* Update server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinatorConfig.java
* Put back canDutyRun()
* Default killTaskSlotsRatio to 0.1 instead of 1.0 (all slots)
* Fix typo DEFAULT_MAX_COMPACTION_TASK_SLOTS
* Remove unused test method.
* Update default value of killTaskSlotsRatio in docs and web-console default mock
* Move initDuty() after params and config setup.
Changes:
- Add `TaskContextEnricher` interface to improve task management and monitoring
- Invoke `enrichContext` in `TaskQueue.add()` whenever a new task is submitted to the Overlord
- Add `TaskContextReport` to write out task context information in reports
Compaction in the native engine by default records the state of compaction for each segment in the lastCompactionState segment field. This PR adds support for doing the same in the MSQ engine, targeted for future cases such as REPLACE and compaction done via MSQ.
Note that this PR doesn't implicitly store the compaction state for MSQ replace tasks; it is stored with flag "storeCompactionState": true in the query context.
Current Runtime Exceptions generated while writing frames only include the exception itself without including the name of the column they were encountered in. This patch introduces the further information in the error and makes it non-retryable.
This PR logs the segment type and reason chosen. It also adds it to the query report, to be displayed in the UI.
This PR adds a new section to the reports, segmentReport. This contains the segment type created, if the query is an ingestion, and null otherwise.
Support for exporting msq results to gcs bucket. This is essentially copying the logic of s3 export for gs, originally done by @adarshsanjeev in this PR - #15689
This PR creates an interface for ImmutableRTree and moved the existing implementation to new class which represent 32 bit implementation (stores coordinate as floats). This PR makes the ImmutableRTree extendable to create higher precision implementation as well (64 bit).
In all spatial bound filters, we accept float as input which might not be accurate in the case of high precision implementation of ImmutableRTree. This PR changed the bound filters to accepts the query bounds as double instead of float and it is backward compatible change as it compares double to existing float values in RTree. Previously it was comparing input float to RTree floats which can cause precision loss, now it is little better as it compares double to float which is still not 100% accurate.
There are no changes in the way that we query spatial dimension today except input bound parsing. There is little improvement in string filter predicate which now parse double strings instead of float and compares double to double which is 100% accurate but string predicate is only called when we dont have spatial index.
With allowing the interface to extend ImmutableRTree, we allow to create high precision (HP) implementation and defines new search strategies to perform HP search Iterable<ImmutableBitmap> search(ImmutableDoubleNode node, Bound bound);
With possible HP implementations, Radius bound filter can not really focus on accuracy, it is calculating Euclidean distance in comparing. As EARTH 🌍 is round and not flat, Euclidean distances are not accurate in geo system. This PR adds new param called 'radiusUnit' which allows you to specify units like meters, km, miles etc. It uses https://en.wikipedia.org/wiki/Haversine_formula to check if given geo point falls inside circle or not. Added a test that generates set of points inside and outside in RadiusBoundTest.
This PR aims to introduce Window functions on MSQ by doing the following:
Introduce a Window querykit for handling window queries along with its factory and a processor for window queries
If a window operator is present with a partition by clause, pushes the partition as a shuffle spec of the previous stage
In presence of empty OVER() clause lets all operators loose on a single rac
In presence of no empty OVER() clause, breaks down each window into individual stages
Associated machinery to handle window functions in MSQ
Introduced a separate hidden engine feature WINDOW_LEAF_OPERATOR which is set only for MSQ engine. In presence of this feature, the planner plans without the leaf operators by creating a window query over an inner scan query. In case of native this is set to false and the planner generates the leafOperators
Guardrails around materialization
Comprehensive UTs
Changes:
Add the following indexer level task metrics:
- `worker/task/running/count`
- `worker/task/assigned/count`
- `worker/task/completed/count`
These metrics will provide more visibility into the tasks distribution across indexers
(We often see a task skew issue across indexers and with this issue it would be easier
to catch the imbalance)
* Mark used and unused APIs by versions.
* remove the conditional invocations.
* isValid() and test updates.
* isValid() and tests.
* Remove warning logs for invalid user requests. Also, downgrade visibility.
* Update resp message, etc.
* tests and some cleanup.
* Docs draft
* Clarify docs
* Update server/src/main/java/org/apache/druid/server/http/DataSourcesResource.java
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* Review comments
* Remove default interface methods only used in tests and update docs.
* Clarify javadocs and @Nullable.
* Add more tests.
* Parameterized versions.
---------
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* MSQ: Validate that strings and string arrays are not mixed.
When multi-value strings and string arrays coexist in the same column,
it causes problems with "classic MVD" style queries such as:
select * from wikipedia -- fails at runtime
select count(*) from wikipedia where flags = 'B' -- fails at planning time
select flags, count(*) from wikipedia group by 1 -- fails at runtime
To avoid these problems, this patch adds type verification for INSERT
and REPLACE. It is targeted: the only type changes that are blocked are
string-to-array and array-to-string. There is also a way to exclude
certain columns from the type checks, if the user really knows what
they're doing.
* Fixes.
* Tests and docs and error messages.
* More docs.
* Adjustments.
* Adjust message.
* Fix tests.
* Fix test in DV mode.
* docs: clarify description of uri/uripath
* Apply suggestions from code review
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
---------
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Kill task version support.
Kill tasks by default kill all versions of unused segments in the specified
interval. Users wanting to delete specific versions (for example, data compliance
reasons) and keep rest of the versions can specify the optional version in the
kill task payload.
* Formatting changes.
* Multi version tests in RetrieveSegmentsActionsTest
Sort of like method-level parameterized tests.
* Address review feedback
* Accept a list of versions instead of a single version.
Support multiple versions.
* Tests for multiple versions.
* Update docs
* Cleanup
* Address review comments.
Retain the old interface method and make it default and route it to
the method with nullable versions variant. Update usages to use the
default method where versions doesn't matter.
* Remove versions from retreive used segments action.
* Some updates.
* Apply suggestions from code review
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
* /s/actual/observed/g
* minor test cleanup
* WIP: Test fixes and updates. Also add test for kill by version with used load spec.
Checkpoint.
---------
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>