Commit Graph

2664 Commits

Author SHA1 Message Date
Jill Osborne 47dd4ed2e7
Added experimental feature text for front coding feature (#13349) 2022-11-11 02:06:13 -08:00
Didip Kerabat 56d5c9780d
Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects (#13027)
* Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects.

Removed:

import org.apache.commons.io.FilenameUtils;

Add:

import java.nio.file.FileSystems;
import java.nio.file.PathMatcher;
import java.nio.file.Paths;

* Forgot to update CloudObjectInputSource as well.

* Fix tests.

* Removed unused exceptions.

* Able to reduced user mistakes, by removing the protocol and the bucket on filter.

* add 1 more test.

* add comment on filterWithoutProtocolAndBucket

* Fix lint issue.

* Fix another lint issue.

* Replace all mention of filter -> objectGlob per convo here:

https://github.com/apache/druid/pull/13027#issuecomment-1266410707

* fix 1 bad constructor.

* Fix the documentation.

* Don’t do anything clever with the object path.

* Remove unused imports.

* Fix spelling error.

* Fix incorrect search and replace.

* Addressing Gian’s comment.

* add filename on .spelling

* Fix documentation.

* fix documentation again

Co-authored-by: Didip Kerabat <didip@apple.com>
2022-11-10 23:46:40 -08:00
Gian Merlino 77478f25fb
Add taskActionType dimension to task/action/run/time. (#13333)
* Add taskActionType dimension to task/action/run/time.

* Spelling.
2022-11-11 12:00:08 +05:30
Andreas Maechler 03175a2b8d
Add missing MSQ error code fields to docs (#13308)
* Fix typo

* Fix some spacing

* Add missing fields

* Cleanup table spacing

* Remove durable storage docs again

Thanks Brian for pointing out previous discussions.

* Update docs/multi-stage-query/reference.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Mark codes as code

* And even more codes as code

* Another set of spaces

* Combine `ColumnTypeNotSupported`

Thanks Karan.

* More whitespaces and typos

* Add spelling and fix links

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-11-10 21:03:04 +05:30
Jill Osborne c2210c4e09
Update ingestion spec doc (#13329)
* Update ingestion spec doc

* Updated

* Updated

* Update docs/ingestion/ingestion-spec.md

Co-authored-by: Clint Wylie <cjwylie@gmail.com>

* Updated

* Updated

Co-authored-by: Clint Wylie <cjwylie@gmail.com>
2022-11-10 02:54:35 -08:00
Jill Osborne 965e41538e
Update nested columns doc (#13314)
* Updated nested columns doc

* Update nested-columns.md

* Update nested-columns.md
2022-11-10 09:53:28 +08:00
AmatyaAvadhanula a2013e6566
Enhance streaming ingestion metrics (#13331)
Changes:
- Add a metric for partition-wise kafka/kinesis lag for streaming ingestion.
- Emit lag metrics for streaming ingestion when supervisor is not suspended and state is in {RUNNING, IDLE, UNHEALTHY_TASKS, UNHEALTHY_SUPERVISOR}
- Document metrics
2022-11-09 23:44:15 +05:30
Laksh Singla b7a513fe09
Add a OverlordHelper that cleans up durable storage objects in MSQ (#13269)
* scratch

* s3 ls fix, add docs

* add documentation, update method name

* Add tests, address commits, change default value of the helper

* fix test

* update the default value of config, remove initial delay config

* Trigger Build

* update class

* add more tests

* docs update

* spellcheck

* remove ioe from the signature

* add back dmmy constructor for initialization

* fix guice bindings, intellij inspections
2022-11-09 17:23:35 +05:30
Kashif Faraz ff8e0c3397
Fix issues with caching cost strategy (#13321)
`cachingCost` strategy has some discrepancies when compared to cost strategy.
This commit addresses two of these by retaining the same behaviour as the `cost` strategy
when computing the cost of moving a segment to a server:
- subtract the self cost of a segment if it is being served by the target server
- subtract the cost of segments that are marked to be dropped

Other changes:
- Add tests to verify fixed strategy. These tests would fail without the fixes made to `CachingCostStrategy.computeCost()`
- Fix the definition of the segment related metrics in the docs.
- Fix some docs issues introduced in #13181
2022-11-08 16:11:39 +05:30
Tejaswini Bandlamudi 594545da55
Adds cluster level idleConfig setting for supervisor (#13311)
* adds cluster level idleConfig

* updates docs

* refactoring

* spelling nit

* nit

* nit

* refactoring
2022-11-08 14:54:14 +05:30
Gian Merlino 48528a0c98
MSQ: Fix task lock checking during publish, fix lock priority. (#13282)
* MSQ: Fix task lock checking during publish, fix lock priority.

Fixes two issues:

1) ControllerImpl did not properly check the return value of
   SegmentTransactionalInsertAction when doing a REPLACE. This could cause
   it to not realize that its locks were preempted.

2) Task lock priority was the default of 0. It should be the higher
   batch default of 50. The low priority made it possible for MSQ tasks
   to be preempted by compaction tasks, which is not desired.

* Restructuring, add docs.

* Add performSegmentPublish tests.

* Fix tests.
2022-11-08 09:27:34 +05:30
Jill Osborne d1a4de022a
Update retention rules doc (#13181)
* Update retention rules doc

* Update rule-configuration.md

* Updated

* Updated

* Updated

* Updated

* Update rule-configuration.md

* Update rule-configuration.md
2022-11-07 14:47:33 -08:00
AmatyaAvadhanula a738ac9ad7
Improve task pause logging and metrics for streaming ingestion (#13313)
* Improve task pause logging and metrics for streaming ingestion

* Add metrics doc

* Fix spelling
2022-11-07 21:33:54 +05:30
AmatyaAvadhanula 47c32a9d92
Skip ALL granularity compaction (#13304)
* Skip autocompaction for datasources with ETERNITY segments
2022-11-07 17:55:03 +05:30
Gian Merlino 227b57dd8e
Compaction: Fetch segments one at a time on main task; skip when possible. (#13280)
* Compaction: Fetch segments one at a time on main task; skip when possible.

Compact tasks include the ability to fetch existing segments and determine
reasonable defaults for granularitySpec, dimensionsSpec, and metricsSpec.
This is a useful feature that makes compact tasks work well even when the
user running the compaction does not have a clear idea of what they want
the compacted segments to be like.

However, this comes at a cost: it takes time, and disk space, to do all
of these fetches. This patch improves the situation in two ways:

1) When segments do need to be fetched, download them one at a time and
   delete them when we're done. This still takes time, but minimizes the
   required disk space.

2) Don't fetch segments on the main compact task when they aren't needed.
   If the user provides a full granularitySpec, dimensionsSpec, and
   metricsSpec, we can skip it.

* Adjustments.

* Changes from code review.

* Fix logic for determining rollup.
2022-11-07 14:50:14 +05:30
Gian Merlino 9423aa9163
MSQ: Consider PARTITION_STATS_MAX_BYTES in WorkerMemoryParameters. (#13274)
* MSQ: Consider PARTITION_STATS_MAX_BYTES in WorkerMemoryParameters.

This consideration is important, because otherwise we can run out of
memory due to large statistics-tracking objects.

* Improved calculations.
2022-11-07 14:27:18 +05:30
Gian Merlino 8f90589ce5
Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. (#13247)
* Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH.

These aggregation functions are documented as creating sketches. However,
they are planned into native aggregators that include finalization logic
to convert the sketch to a number of some sort. This creates an
inconsistency: the functions sometimes return sketches, and sometimes
return numbers, depending on where they lie in the native query plan.

This patch changes these SQL aggregators to _never_ finalize, by using
the "shouldFinalize" feature of the native aggregators. It already
existed for theta sketches. This patch adds the feature for hll and
quantiles sketches.

As to impact, Druid finalizes aggregators in two cases:

- When they appear in the outer level of a query (not a subquery).
- When they are used as input to an expression or finalizing-field-access
  post-aggregator (not any other kind of post-aggregator).

With this patch, the functions will no longer be finalized in these cases.

The second item is not likely to matter much. The SQL functions all declare
return type OTHER, which would be usable as an input to any other function
that makes sense and that would be planned into an expression.

So, the main effect of this patch is the first item. To provide backwards
compatibility with anyone that was depending on the old behavior, the
patch adds a "sqlFinalizeOuterSketches" query context parameter that
restores the old behavior.

Other changes:

1) Move various argument-checking logic from runtime to planning time in
   DoublesSketchListArgBaseOperatorConversion, by adding an OperandTypeChecker.

2) Add various JsonIgnores to the sketches to simplify their JSON representations.

3) Allow chaining of ExpressionPostAggregators and other PostAggregators
   in the SQL layer.

4) Avoid unnecessary FieldAccessPostAggregator wrapping in the SQL layer,
   now that expressions can operate on complex inputs.

5) Adjust return type to thetaSketch (instead of OTHER) in
   ThetaSketchSetBaseOperatorConversion.

* Fix benchmark class.

* Fix compilation error.

* Fix ThetaSketchSqlAggregatorTest.

* Hopefully fix ITAutoCompactionTest.

* Adjustment to ITAutoCompactionTest.
2022-11-03 09:43:00 -07:00
Gian Merlino d1877e41ec
Use lookup memory footprint in MSQ memory computations. (#13271)
* Use lookup memory footprint in MSQ memory computations.

Two main changes:

1) Add estimateHeapFootprint to LookupExtractor.

2) Use this in MSQ's IndexerWorkerContext when determining the total
   amount of available memory. It's taken off the top.

This prevents MSQ tasks from running out of memory when there are lookups
defined in the cluster.

* Updates from code review.
2022-11-03 07:36:54 -07:00
317brian ae638e338c
docs(msq): update insert vs replace for dimension-based segment pruning (#13228)
* docs(msq): update insert vs replace to mention dimension-based segment pruning

* make suggested changes
2022-11-03 14:17:44 +05:30
Dr. Sizzles e5ad24ff9f
Support for middle manager less druid, tasks launch as k8s jobs (#13156)
* Support for middle manager less druid, tasks launch as k8s jobs

* Fixing forking task runner test

* Test cleanup, dependency cleanup, intellij inspections cleanup

* Changes per PR review

Add configuration option to disable http/https proxy for the k8s client
Update the docs to provide more detail about sidecar support

* Removing un-needed log lines

* Small changes per PR review

* Upon task completion we callback to the overlord to update the status / locaiton, for slower k8s clusters, this reduces locking time significantly

* Merge conflict fix

* Fixing tests and docs

* update tiny-cluster.yaml 

changed `enableTaskLevelLogPush` to `encapsulatedTask`

* Apply suggestions from code review

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

* Minor changes per PR request

* Cleanup, adding test to AbstractTask

* Add comment in peon.sh

* Bumping code coverage

* More tests to make code coverage happy

* Doh a duplicate dependnecy

* Integration test setup is weird for k8s, will do this in a different PR

* Reverting back all integration test changes, will do in anotbher PR

* use StringUtils.base64 instead of Base64

* Jdk is nasty, if i compress in jdk 11 in jdk 17 the decompressed result is different

Co-authored-by: Rahul Gidwani <r_gidwani@apple.com>
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
2022-11-02 19:44:47 -07:00
Jason Koch 0d03ce435f
introduce a "tree" type to the flattenSpec (#12177)
* introduce a "tree" type to the flattenSpec

* feedback - rename exprs to nodes, use CollectionsUtils.isNullOrEmpty for guard

* feedback - expand docs to more clearly capture limitations of "tree" flattenSpec

* feedback - fix for typo on docs

* introduce a comment to explain defensive copy, tweak null handling

* fix: part of rebase

* mark ObjectFlatteners.FlattenerMaker as an ExtensionPoint and provide default for new tree type

* fix: objectflattener restore previous behavior to call getRootField for root type

* docs: ingestion/data-formats add note that ORC only supports path expressions

* chore: linter remove unused import

* fix: use correct newer form for empty DimensionsSpec in FlattenJSONBenchmark
2022-11-01 14:49:30 +08:00
Gian Merlino d851985cf5
MSQ: Add support for indexSpec. (#13275) 2022-10-28 14:27:50 -07:00
Adarsh Sanjeev 4775427e2c
Add task start status to worker report (#13263)
* Add task start status to worker report

* Address review comments

* Address review comments

* Update documentation

* Update spelling checks
2022-10-28 12:00:15 +05:30
Tejaswini Bandlamudi 49e54a0ec6
Docs: Update inputSegmentSizeBytes description (#13266) 2022-10-28 09:33:52 +05:30
Clint Wylie 77e4246598
add support for 'front coded' string dictionaries for smaller string columns (#12277)
* add FrontCodedIndexed for delta string encoding

* now for actual segments

* fix indexOf

* fixes and thread safety

* add bucket size 4, which seems generally better

* fixes

* fixes maybe

* update indexes to latest interfaces

* utf8 support

* adjust

* oops

* oops

* refactor, better, faster

* more test

* fixes

* revert

* adjustments

* fix prefixing

* more chill

* sql nested benchmark too

* refactor

* more comments and javadocs

* better get

* remove base class

* fix

* hot rod

* adjust comments

* faster still

* minor adjustments

* spatial index support

* spotbugs

* add isSorted to Indexed to strengthen indexOf contract if set, improve javadocs, add docs

* fix docs

* push into constructor

* use base buffer instead of copy

* oops
2022-10-25 18:05:38 -07:00
317brian c83115e4e1
api: change API page formatting (#13213)
Tracking additional improvements requested by @paul-rogers: #13239

* api: refactor page so that indented bullet is child and unindented portion is parent

* get rid of post etc headings and combine them with the endpoint

* Update docs/operations/api-reference.md

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* fix broken links

* fix typo

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2022-10-18 13:22:26 -07:00
Paul Rogers b34b4353f4
Async reads for JDBC (#13196)
Async reads for JDBC:
Prevents JDBC timeouts on long queries by returning empty batches
when a batch fetch takes too long. Uses an async model to run the
result fetch concurrently with JDBC requests.

Fixed race condition in Druid's Avatica server-side handler
Fixed issue with no-user connections
2022-10-18 11:40:57 -07:00
cristian-popa cc10350870
Collocated processes instructions (#13224)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Frank Chen <frankchen@apache.org>
2022-10-17 11:56:00 -07:00
Gian Merlino 3bbb76f17b
Docs: Add query/cpu/time to real-time metrics. (#13229) 2022-10-15 18:26:44 +05:30
arvindanugula 42384d85e7
Update nested-columns.md (#13227)
typo error corrected.
2022-10-14 16:15:46 -07:00
Victoria Lim 02ad62a08c
Docs: update description of query priority default value (#13191)
* update description of default for query priority

* update order

* update terms

* standardize to query context parameters
2022-10-14 14:28:04 -07:00
Karan Kumar 9d51e466b1
Minor doc update for BroadcastTablesTooLarge (#13218)
Minor doc update for `BroadcastTablesTooLarge`. Now the user will know what to do
in case this fault is encountered.
2022-10-14 09:06:55 +05:30
Tejaswini Bandlamudi 3e13584e0e
Adds Idle feature to `SeekableStreamSupervisor` for inactive stream (#13144)
* Idle Seekable stream supervisor changes.

* nit

* nit

* nit

* Adds unit tests

* Supervisor decides it's idle state instead of AutoScaler

* docs update

* nit

* nit

* docs update

* Adds Kafka unit test

* Adds Kafka Integration test.

* Updates travis config.

* Updates kafka-indexing-service dependencies.

* updates previous offsets snapshot & doc

* Doesn't act if supervisor is suspended.

* Fixes highest current offsets fetch bug, adds new Kafka UT tests, doc changes.

* Reverts Kinesis Supervisor idle behaviour changes.

* nit

* nit

* Corrects SeekableStreamSupervisorSpec check on idle behaviour config, adds tests.

* Fixes getHighestCurrentOffsets to fetch offsets of publishing tasks too

* Adds Kafka Supervisor UT

* Improves test coverage in druid-server

* Corrects IT override config

* Doc updates and Syntactic changes

* nit

* supervisorSpec.ioConfig.idleConfig changes
2022-10-12 18:31:08 +05:30
Jonathan Wei 9b8e69c99a
Add inline descriptor Protobuf bytes decoder (#13192)
* Add inline descriptor Protobuf bytes decoder

* PR comments

* Update tests, check for IllegalArgumentException

* Fix license, add equals test

* Update extensions-core/protobuf-extensions/src/main/java/org/apache/druid/data/input/protobuf/InlineDescriptorProtobufBytesDecoder.java

Co-authored-by: Frank Chen <frankchen@apache.org>

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-10-11 13:37:28 -05:00
Charles Smith 25c1d55dd6
Clarify behavior when decommissioningMaxPercentOfMaxSegmentsToMove = 0 (#13157)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-10-07 09:01:32 -07:00
317brian 0edceead80
msq: update known issue about GROUPING SETS and COUNT DISTINCT (#13185)
* msq: update known issue about GROUPING SETS and COUNT DISTINCT

* address feedback from Gian
2022-10-05 19:47:03 -07:00
AmatyaAvadhanula 41e51b21c3
Make http options the default configurations (#13092)
Druid currently uses Zookeeper dependent options as the default.
This commit updates the following to use HTTP as the default instead.
- task runner. `druid.indexer.runner.type=remote -> httpRemote`
- load queue peon. `druid.coordinator.loadqueuepeon.type=curator -> http`
- server inventory view. `druid.serverview.type=curator -> http`
2022-10-05 05:35:17 +05:30
Adarsh Sanjeev 92d2633ae6
Update ClusterByStatisticsCollectorImpl to use bytes instead of keys (#12998)
* Update clusterByStatistics to use bytes instead of keys

* Address review comments

* Resolve checkstyle

* Increase test coverage

* Update test

* Update thresholds

* Update retained keys function

* Update docs

* Fix spelling
2022-10-03 12:08:23 +05:30
Jill Osborne 548d810baa
Correct nested columns example (#13150) 2022-09-28 10:39:56 +05:30
David Palmer 0d7bf66578
Add a note to the documentation about pre-built HLLSketches (#13088)
* add a note to the documentation about pre-built HLLSketches

Druid actually supports ingesting a pre-generated sketch column by using
the HLLSketchMerge aggregator. However, this functionality was
previously not made clear in the documentation.

* copyedit from the King's English to American English

* add suggested style changes

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-09-27 10:29:39 +08:00
Apoorv Gupta c8f4d72fb1
Fix documentation bug about injective lookups (#13147)
replace mapping to `unique keys` with mapping to `unique values`.
2022-09-27 10:16:48 +08:00
Jonathan Wei 1f1fced6d4
Add JsonInputFormat option to assume newline delimited JSON, improve parse exception handling for multiline JSON (#13089)
* Add JsonInputFormat option to assume newline delimited JSON, improve handling for non-NDJSON

* Fix serde and docs

* Add PR comment check
2022-09-26 19:51:04 -05:00
Charles Smith eb760c3d1d
update log4j example (#13095)
* update log4j example

* fix some style issues

* Update docs/configuration/logging.md

Co-authored-by: Frank Chen <frankchen@apache.org>

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-09-22 09:46:49 +08:00
317brian 12f12a13a9
fix: fix broken postgres link (#13135) 2022-09-22 09:46:20 +08:00
317brian 7fa35839c0
fix: follow naming convention for msq task engine (#13127)
* fix: follow naming convention for msq task engine

* more fixes

* add back in experimental

* fix anchor
2022-09-21 18:46:06 -07:00
Gian Merlino 2f731f356e
Update pull-deps docs with correct repo list. (#13134)
There is only one default remote repo at this time.
2022-09-21 12:16:57 -07:00
Katya Macedo 90d14f629a
spatial-filters (#13124) 2022-09-20 22:48:36 -07:00
hosswald 5ed5c83aab
Clarified the behaviour of SQL COUNT(DISTINCT dim) on multi-value dimensions (#13128)
* Clarified the behaviour of COUNT(DISTINCT column) on multi-value columns

* Update docs/querying/sql-aggregations.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-09-20 18:03:34 -07:00
Vadim Ogievetsky edc444a4bc
fix quickstart (#13126) 2022-09-20 17:44:21 -07:00
Vadim Ogievetsky b9edfe34a4
be consistent about referring to the web console by its name (#13118) 2022-09-19 15:02:17 -07:00