Commit Graph

2941 Commits

Author SHA1 Message Date
Adarsh Sanjeev 2b605aa9cf
Multiple fixes for the MSQ stats merging piece which (#13463)
* Add validation checks to worker chat handler apis

* Merge things and polishing the error messages.

* Minor error message change

* Fixing race and adding some tests

* Fixing controller fetching stats from wrong workers.
Fixing race
Changing default mode to Parallel
Adding logging.
Fixing exceptions not propagated properly.

* Changing to kernel worker count

* Added a better logic to figure out assigned worker for a stage.

* Nits

* Moving to existing kernel methods

* Adding more coverage

Co-authored-by: cryptoe <karankumar1100@gmail.com>
2022-12-15 09:35:11 +05:30
Vadim Ogievetsky 2729e25295
Link to java docs (#13478)
* add link to page about selecting a JRE

* add link to script also

* simplify text
2022-12-14 11:45:23 -08:00
Gian Merlino de5a4bafcb
Zero-copy local deep storage. (#13394)
* Zero-copy local deep storage.

This is useful for local deep storage, since it reduces disk usage and
makes Historicals able to load segments instantaneously.

Two changes:

1) Introduce "druid.storage.zip" parameter for local storage, which defaults
   to false. This changes default behavior from writing an index.zip to writing
   a regular directory. This is safe to do even during a rolling update, because
   the older code actually already handled unzipped directories being present
   on local deep storage.

2) In LocalDataSegmentPuller and LocalDataSegmentPusher, use hard links
   instead of copies when possible. (Generally this is possible when the
   source and destination directory are on the same filesystem.)
2022-12-12 17:28:24 -08:00
Rishabh Singh 4ebdfe226d
Druid automated quickstart (#13365)
* Druid automated quickstart

* remove conf/druid/single-server/quickstart/_common/historical/jvm.config

* Minor changes in python script

* Add lower bound memory for some services

* Additional runtime properties for services

* Update supervise script to accept command arguments, corresponding changes in druid-quickstart.py

* File end newline

* Limit the ability to start multiple instances of a service, documentation changes

* simplify script arguments

* restore changes in medium profile

* run-druid refactor

* compute and pass middle manager runtime properties to run-druid
supervise script changes to process java opts array
use argparse, leave free memory, logging

* Remove extra quotes from mm task javaopts array

* Update logic to compute minimum memory

* simplify run-druid

* remove debug options from run-druid

* resolve the config_path provided

* comment out service specific runtime properties which are computed in the code

* simplify run-druid

* clean up docs, naming changes

* Throw ValueError exception on illegal state

* update docs

* rename args, compute_only -> compute, run_zk -> zk

* update help documentation

* update help documentation

* move task memory computation into separate method

* Add validation checks

* remove print

* Add validations

* remove start-druid bash script, rename start-druid-main

* Include tasks in lower bound memory calculation

* Fix test

* 256m instead of 256g

* caffeine cache uses 5% of heap

* ensure min task count is 2, task count is monotonic

* update configs and documentation for runtime props in conf/druid/single-server/quickstart

* Update docs

* Specify memory argument for each profile in single-server.md

* Update middleManager runtime.properties

* Move quickstart configs to conf/druid/base, add bash launch script, support python2

* Update supervise script

* rename base config directory to auto

* rename python script, changes to pass repeated args to supervise

* remove exmaples/conf/druid/base dir

* add docs

* restore changes in conf dir

* update start-druid-auto

* remove hashref for commands in supervise script

* start-druid-main java_opts array is comma separated

* update entry point script name in python script

* Update help docs

* documentation changes

* docs changes

* update docs

* add support for running indexer

* update supported services list

* update help

* Update python.md

* remove dir

* update .spelling

* Remove dependency on psutil and pathlib

* update docs

* Update get_physical_memory method

* Update help docs

* update docs

* update method to get physical memory on python

* udpate spelling

* update .spelling

* minor change

* Minor change

* memory comptuation for indexer

* update start-druid

* Update python.md

* Update single-server.md

* Update python.md

* run python3 --version to check if python is installed

* Update supervise script

* start-druid: echo message if python not found

* update anchor text

* minor change

* Update condition in supervise script

* JVM not jvm in docs
2022-12-09 11:04:02 -08:00
Paul Rogers 013a12e86f
Enhanced MSQ table functions (#13360)
* Enhanced MSQ table functions
* HTTP, LOCALFILES and INLINE table functions powered by
catalog metadata.
* Documentation
2022-12-08 13:56:02 -08:00
Gian Merlino 91ef9872ec
MSQ: Improve TooManyBuckets error message, improve error docs. (#13525)
1) Edited the TooManyBuckets error message to mention PARTITIONED BY
   instead of segmentGranularity.

2) Added error-code-specific anchors in the docs.

3) Add information to various error codes in the docs about common
   causes and solutions.
2022-12-08 13:18:26 -08:00
Jill Osborne b56855b837
Update to native ingestion doc (#13482)
* Update to native ingestion doc

* Update docs/ingestion/native-batch.md

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* Update native-batch.md

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2022-12-07 15:08:19 +05:30
Vadim Ogievetsky 9679f6a9b5
Web console: add arrayOfDoublesSketch and other small fixes (#13486)
* add padding and keywords

* add arrayOfDoubles

* Update docs/development/extensions-core/datasketches-tuple.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/development/extensions-core/datasketches-tuple.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/development/extensions-core/datasketches-tuple.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/development/extensions-core/datasketches-tuple.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/development/extensions-core/datasketches-tuple.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* partiton int

* fix docs

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-12-06 21:21:49 -08:00
Kashif Faraz c7229fc787
Limit max batch size for segment allocation, add docs (#13503)
Changes:
- Limit max batch size in `SegmentAllocationQueue` to 500
- Rename `batchAllocationMaxWaitTime` to `batchAllocationWaitTime` since the actual
wait time may exceed this configured value.
- Replace usage of `SegmentInsertAction` in `TaskToolbox` with `SegmentTransactionalInsertAction`
2022-12-07 10:07:14 +05:30
Gian Merlino fda0a1aadd
Set chatAsync default to true. (#13491)
This functionality was originally added in #13354.
2022-12-05 20:53:59 -08:00
Kashif Faraz 65945a686f
Docs: Update docs for coordinator dynamic config (#13494)
* Update docs for useBatchedSegmentSampler

* Update docs for round robin assigment
2022-12-05 16:53:10 +05:30
TSFenwick 10bec54acc
Switching emitter. This will allow for a per feed emitter designation. (#13363)
* Switching emitter. This will allow for a per feed emitter designation.

This will work by looking at an event's feed and direct it to a specific emitter. If no specific feed is specified for a feed.
The emitter can direct the event to a default emitter.

* fix checkstyle issues and make docs for switching emitter use basic event feeds

* fix broken docs, add test, and guard against misconfigurations

* add module test
add switching emitter module test

* fix broken SwitchingEmitterModuleTest

* add apache license to top of test

* fix checkstyle issues

* address comments by adding javadocs, removing a todo, and making druid docs more clear
2022-12-05 16:04:34 +05:30
Katya Macedo 78c1a2bd66
Remove limit from timeseries (#13457)
CI build failures seem unrelated to docs
2022-12-02 12:19:59 -08:00
Jill Osborne 138a6de507
Update nested columns docs (#13461)
* Update nested columns docs

(cherry picked from commit 04206c5179)

* Update nested-columns.md

(cherry picked from commit 8085ee7217)
2022-12-01 10:47:32 -08:00
317brian cc2e4a80ff
doc: add a basic JDBC tutorial (#13343)
* initial commit for jdbc tutorial

(cherry picked from commit 04c4adad71e5436b76c3425fe369df03aaaf0acb)

* add commentary

* address comments from charles

* add query context to example

* fix typo

* add links

* Apply suggestions from code review

Co-authored-by: Frank Chen <frankchen@apache.org>

* fix datatype

* address feedback

* add parameterize to spelling file. the past tense version was already there

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-11-30 16:25:35 -08:00
Jill Osborne 291ded22d5
Update experimental features doc (#13452) 2022-11-30 16:14:43 +05:30
Jill Osborne 5c520e0cf9
Update LDAP configuration docs (#13245)
* Update LDAP configuration docs

* Updated after review

* Update auth-ldap.md

Updated.

* Update auth-ldap.md

* Updated spelling file

* Update docs/operations/auth-ldap.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/operations/auth-ldap.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/operations/auth-ldap.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update auth-ldap.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-11-29 09:26:32 -08:00
Jill Osborne 100a2aa4a2
Update and document experimental features (#13348)
* Update and document experimental features
* Updated
* Update experimental-features.md
* Update docs/development/experimental-features.md
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
* Updated after review
* Updated
* Update materialized-view.md
* Update experimental-features.md
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
2022-11-29 08:01:28 +05:30
Jill Osborne db7c29c6f9
Correction to firehose migration doc (#13423) 2022-11-28 10:24:27 +05:30
Adarsh Sanjeev 280a0f7158
Add sequential sketch merging to MSQ (#13205)
* Add sketch fetching framework

* Refactor code to support sequential merge

* Update worker sketch fetcher

* Refactor sketch fetcher

* Refactor sketch fetcher

* Add context parameter and threshold to trigger sequential merge

* Fix test

* Add integration test for non sequential merge

* Address review comments

* Address review comments

* Address review comments

* Resolve maxRetainedBytes

* Add new classes

* Renamed key statistics information class

* Rename fetchStatisticsSnapshotForTimeChunk function

* Address review comments

* Address review comments

* Update documentation and add comments

* Resolve build issues

* Resolve build issues

* Change worker APIs to async

* Address review comments

* Resolve build issues

* Add null time check

* Update integration tests

* Address review comments

* Add log messages and comments

* Resolve build issues

* Add unit tests

* Add unit tests

* Fix timing issue in tests
2022-11-22 09:56:32 +05:30
Jill Osborne 68018a808f
Firehose migration doc (#12981)
* Firehose migration doc

* Update migrate-from-firehose-ingestion.md

* Updated with review comments and suggestions

* Update migrate-from-firehose-ingestion.md

* Update migrate-from-firehose-ingestion.md

* Update migrate-from-firehose-ingestion.md
2022-11-21 11:17:12 -08:00
Gian Merlino bfffbabb56
Async task client for SeekableStreamSupervisors. (#13354)
Main changes:
1) Convert SeekableStreamIndexTaskClient to an interface, move old code
   to SeekableStreamIndexTaskClientSyncImpl, and add new implementation
   SeekableStreamIndexTaskClientAsyncImpl that uses ServiceClient.
2) Add "chatAsync" parameter to seekable stream supervisors that causes
   the supervisor to use an async task client.
3) In SeekableStreamSupervisor.discoverTasks, adjust logic to avoid making
   blocking RPC calls in workerExec threads.
4) In SeekableStreamSupervisor generally, switch from Futures.successfulAsList
   to FutureUtils.coalesce, so we can better capture the errors that occurred
   with contacting individual tasks.

Other, related changes:
1) Add ServiceRetryPolicy.retryNotAvailable, which controls whether
   ServiceClient retries unavailable services. Useful since we do not
   want to retry calls unavailable tasks within the service client. (The
   supervisor does its own higher-level retries.)
2) Add FutureUtils.transformAsync, a more lambda friendly version of
   Futures.transform(f, AsyncFunction).
3) Add FutureUtils.coalesce. Similar to Futures.successfulAsList, but
   returns Either instead of using null on error.
4) Add JacksonUtils.readValue overloads for JavaType and TypeReference.
2022-11-21 19:20:26 +05:30
Katya Macedo fd239305d9
Update metrics doc (#13316)
Changes:
- used inline code-style to format dimension names
- removed unnecessary punctuation
2022-11-21 09:43:52 +05:30
Jill Osborne a860baf496
Updated docs on front coding (#13387) 2022-11-19 00:01:04 -08:00
Laksh Singla 9e938b5a6f
Add a limit to the number of columns in the CLUSTERED BY clause (#13352)
* Add clustered by limit

* change semantics, add docs

* add fault class to the module

* add test

* unambiguate test
2022-11-15 22:05:15 +05:30
Clint Wylie 1231ce3b75
dump-segment tool support for examining nested columns (#13356)
* add nested mode to dump segment tool to dump nested columns

* docs

* more test

* fix it
2022-11-14 16:08:47 -08:00
Jill Osborne b0db2a87d8
Update Kafka ingestion tutorial (#13261)
* Update Kafka ingestion tutorial

* Update tutorial-kafka.md

Updated location of sample data file

* Added sample data file

* Update tutorial-kafka.md

* Add sample data file

* Update tutorial-kafka.md

Updated sample file location in curl commands

* Update and reuploading sample data files

* Updated spelling file

* Delete .spelling

* Added spelling file

* Update docs/tutorials/tutorial-kafka.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-kafka.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Updated after review

* Update tutorial-kafka.md

* Updated

* Update tutorial-kafka.md

* Update tutorial-kafka.md

* Update tutorial-kafka.md

* Updated sample data file and command

* Add files via upload

* Delete kttm-nested-data.json.tgz

* Delete kttm-nested-data.json.tgz

* Add files via upload

* Update tutorial-kafka.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-11-11 14:47:54 -08:00
Jill Osborne 47dd4ed2e7
Added experimental feature text for front coding feature (#13349) 2022-11-11 02:06:13 -08:00
Didip Kerabat 56d5c9780d
Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects (#13027)
* Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects.

Removed:

import org.apache.commons.io.FilenameUtils;

Add:

import java.nio.file.FileSystems;
import java.nio.file.PathMatcher;
import java.nio.file.Paths;

* Forgot to update CloudObjectInputSource as well.

* Fix tests.

* Removed unused exceptions.

* Able to reduced user mistakes, by removing the protocol and the bucket on filter.

* add 1 more test.

* add comment on filterWithoutProtocolAndBucket

* Fix lint issue.

* Fix another lint issue.

* Replace all mention of filter -> objectGlob per convo here:

https://github.com/apache/druid/pull/13027#issuecomment-1266410707

* fix 1 bad constructor.

* Fix the documentation.

* Don’t do anything clever with the object path.

* Remove unused imports.

* Fix spelling error.

* Fix incorrect search and replace.

* Addressing Gian’s comment.

* add filename on .spelling

* Fix documentation.

* fix documentation again

Co-authored-by: Didip Kerabat <didip@apple.com>
2022-11-10 23:46:40 -08:00
Gian Merlino 77478f25fb
Add taskActionType dimension to task/action/run/time. (#13333)
* Add taskActionType dimension to task/action/run/time.

* Spelling.
2022-11-11 12:00:08 +05:30
Andreas Maechler 03175a2b8d
Add missing MSQ error code fields to docs (#13308)
* Fix typo

* Fix some spacing

* Add missing fields

* Cleanup table spacing

* Remove durable storage docs again

Thanks Brian for pointing out previous discussions.

* Update docs/multi-stage-query/reference.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Mark codes as code

* And even more codes as code

* Another set of spaces

* Combine `ColumnTypeNotSupported`

Thanks Karan.

* More whitespaces and typos

* Add spelling and fix links

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-11-10 21:03:04 +05:30
Jill Osborne c2210c4e09
Update ingestion spec doc (#13329)
* Update ingestion spec doc

* Updated

* Updated

* Update docs/ingestion/ingestion-spec.md

Co-authored-by: Clint Wylie <cjwylie@gmail.com>

* Updated

* Updated

Co-authored-by: Clint Wylie <cjwylie@gmail.com>
2022-11-10 02:54:35 -08:00
Jill Osborne 965e41538e
Update nested columns doc (#13314)
* Updated nested columns doc

* Update nested-columns.md

* Update nested-columns.md
2022-11-10 09:53:28 +08:00
AmatyaAvadhanula a2013e6566
Enhance streaming ingestion metrics (#13331)
Changes:
- Add a metric for partition-wise kafka/kinesis lag for streaming ingestion.
- Emit lag metrics for streaming ingestion when supervisor is not suspended and state is in {RUNNING, IDLE, UNHEALTHY_TASKS, UNHEALTHY_SUPERVISOR}
- Document metrics
2022-11-09 23:44:15 +05:30
Laksh Singla b7a513fe09
Add a OverlordHelper that cleans up durable storage objects in MSQ (#13269)
* scratch

* s3 ls fix, add docs

* add documentation, update method name

* Add tests, address commits, change default value of the helper

* fix test

* update the default value of config, remove initial delay config

* Trigger Build

* update class

* add more tests

* docs update

* spellcheck

* remove ioe from the signature

* add back dmmy constructor for initialization

* fix guice bindings, intellij inspections
2022-11-09 17:23:35 +05:30
Kashif Faraz ff8e0c3397
Fix issues with caching cost strategy (#13321)
`cachingCost` strategy has some discrepancies when compared to cost strategy.
This commit addresses two of these by retaining the same behaviour as the `cost` strategy
when computing the cost of moving a segment to a server:
- subtract the self cost of a segment if it is being served by the target server
- subtract the cost of segments that are marked to be dropped

Other changes:
- Add tests to verify fixed strategy. These tests would fail without the fixes made to `CachingCostStrategy.computeCost()`
- Fix the definition of the segment related metrics in the docs.
- Fix some docs issues introduced in #13181
2022-11-08 16:11:39 +05:30
Tejaswini Bandlamudi 594545da55
Adds cluster level idleConfig setting for supervisor (#13311)
* adds cluster level idleConfig

* updates docs

* refactoring

* spelling nit

* nit

* nit

* refactoring
2022-11-08 14:54:14 +05:30
Gian Merlino 48528a0c98
MSQ: Fix task lock checking during publish, fix lock priority. (#13282)
* MSQ: Fix task lock checking during publish, fix lock priority.

Fixes two issues:

1) ControllerImpl did not properly check the return value of
   SegmentTransactionalInsertAction when doing a REPLACE. This could cause
   it to not realize that its locks were preempted.

2) Task lock priority was the default of 0. It should be the higher
   batch default of 50. The low priority made it possible for MSQ tasks
   to be preempted by compaction tasks, which is not desired.

* Restructuring, add docs.

* Add performSegmentPublish tests.

* Fix tests.
2022-11-08 09:27:34 +05:30
Jill Osborne d1a4de022a
Update retention rules doc (#13181)
* Update retention rules doc

* Update rule-configuration.md

* Updated

* Updated

* Updated

* Updated

* Update rule-configuration.md

* Update rule-configuration.md
2022-11-07 14:47:33 -08:00
AmatyaAvadhanula a738ac9ad7
Improve task pause logging and metrics for streaming ingestion (#13313)
* Improve task pause logging and metrics for streaming ingestion

* Add metrics doc

* Fix spelling
2022-11-07 21:33:54 +05:30
AmatyaAvadhanula 47c32a9d92
Skip ALL granularity compaction (#13304)
* Skip autocompaction for datasources with ETERNITY segments
2022-11-07 17:55:03 +05:30
Gian Merlino 227b57dd8e
Compaction: Fetch segments one at a time on main task; skip when possible. (#13280)
* Compaction: Fetch segments one at a time on main task; skip when possible.

Compact tasks include the ability to fetch existing segments and determine
reasonable defaults for granularitySpec, dimensionsSpec, and metricsSpec.
This is a useful feature that makes compact tasks work well even when the
user running the compaction does not have a clear idea of what they want
the compacted segments to be like.

However, this comes at a cost: it takes time, and disk space, to do all
of these fetches. This patch improves the situation in two ways:

1) When segments do need to be fetched, download them one at a time and
   delete them when we're done. This still takes time, but minimizes the
   required disk space.

2) Don't fetch segments on the main compact task when they aren't needed.
   If the user provides a full granularitySpec, dimensionsSpec, and
   metricsSpec, we can skip it.

* Adjustments.

* Changes from code review.

* Fix logic for determining rollup.
2022-11-07 14:50:14 +05:30
Gian Merlino 9423aa9163
MSQ: Consider PARTITION_STATS_MAX_BYTES in WorkerMemoryParameters. (#13274)
* MSQ: Consider PARTITION_STATS_MAX_BYTES in WorkerMemoryParameters.

This consideration is important, because otherwise we can run out of
memory due to large statistics-tracking objects.

* Improved calculations.
2022-11-07 14:27:18 +05:30
Gian Merlino 8f90589ce5
Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH. (#13247)
* Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH.

These aggregation functions are documented as creating sketches. However,
they are planned into native aggregators that include finalization logic
to convert the sketch to a number of some sort. This creates an
inconsistency: the functions sometimes return sketches, and sometimes
return numbers, depending on where they lie in the native query plan.

This patch changes these SQL aggregators to _never_ finalize, by using
the "shouldFinalize" feature of the native aggregators. It already
existed for theta sketches. This patch adds the feature for hll and
quantiles sketches.

As to impact, Druid finalizes aggregators in two cases:

- When they appear in the outer level of a query (not a subquery).
- When they are used as input to an expression or finalizing-field-access
  post-aggregator (not any other kind of post-aggregator).

With this patch, the functions will no longer be finalized in these cases.

The second item is not likely to matter much. The SQL functions all declare
return type OTHER, which would be usable as an input to any other function
that makes sense and that would be planned into an expression.

So, the main effect of this patch is the first item. To provide backwards
compatibility with anyone that was depending on the old behavior, the
patch adds a "sqlFinalizeOuterSketches" query context parameter that
restores the old behavior.

Other changes:

1) Move various argument-checking logic from runtime to planning time in
   DoublesSketchListArgBaseOperatorConversion, by adding an OperandTypeChecker.

2) Add various JsonIgnores to the sketches to simplify their JSON representations.

3) Allow chaining of ExpressionPostAggregators and other PostAggregators
   in the SQL layer.

4) Avoid unnecessary FieldAccessPostAggregator wrapping in the SQL layer,
   now that expressions can operate on complex inputs.

5) Adjust return type to thetaSketch (instead of OTHER) in
   ThetaSketchSetBaseOperatorConversion.

* Fix benchmark class.

* Fix compilation error.

* Fix ThetaSketchSqlAggregatorTest.

* Hopefully fix ITAutoCompactionTest.

* Adjustment to ITAutoCompactionTest.
2022-11-03 09:43:00 -07:00
Gian Merlino d1877e41ec
Use lookup memory footprint in MSQ memory computations. (#13271)
* Use lookup memory footprint in MSQ memory computations.

Two main changes:

1) Add estimateHeapFootprint to LookupExtractor.

2) Use this in MSQ's IndexerWorkerContext when determining the total
   amount of available memory. It's taken off the top.

This prevents MSQ tasks from running out of memory when there are lookups
defined in the cluster.

* Updates from code review.
2022-11-03 07:36:54 -07:00
317brian ae638e338c
docs(msq): update insert vs replace for dimension-based segment pruning (#13228)
* docs(msq): update insert vs replace to mention dimension-based segment pruning

* make suggested changes
2022-11-03 14:17:44 +05:30
Dr. Sizzles e5ad24ff9f
Support for middle manager less druid, tasks launch as k8s jobs (#13156)
* Support for middle manager less druid, tasks launch as k8s jobs

* Fixing forking task runner test

* Test cleanup, dependency cleanup, intellij inspections cleanup

* Changes per PR review

Add configuration option to disable http/https proxy for the k8s client
Update the docs to provide more detail about sidecar support

* Removing un-needed log lines

* Small changes per PR review

* Upon task completion we callback to the overlord to update the status / locaiton, for slower k8s clusters, this reduces locking time significantly

* Merge conflict fix

* Fixing tests and docs

* update tiny-cluster.yaml 

changed `enableTaskLevelLogPush` to `encapsulatedTask`

* Apply suggestions from code review

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

* Minor changes per PR request

* Cleanup, adding test to AbstractTask

* Add comment in peon.sh

* Bumping code coverage

* More tests to make code coverage happy

* Doh a duplicate dependnecy

* Integration test setup is weird for k8s, will do this in a different PR

* Reverting back all integration test changes, will do in anotbher PR

* use StringUtils.base64 instead of Base64

* Jdk is nasty, if i compress in jdk 11 in jdk 17 the decompressed result is different

Co-authored-by: Rahul Gidwani <r_gidwani@apple.com>
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
2022-11-02 19:44:47 -07:00
Jason Koch 0d03ce435f
introduce a "tree" type to the flattenSpec (#12177)
* introduce a "tree" type to the flattenSpec

* feedback - rename exprs to nodes, use CollectionsUtils.isNullOrEmpty for guard

* feedback - expand docs to more clearly capture limitations of "tree" flattenSpec

* feedback - fix for typo on docs

* introduce a comment to explain defensive copy, tweak null handling

* fix: part of rebase

* mark ObjectFlatteners.FlattenerMaker as an ExtensionPoint and provide default for new tree type

* fix: objectflattener restore previous behavior to call getRootField for root type

* docs: ingestion/data-formats add note that ORC only supports path expressions

* chore: linter remove unused import

* fix: use correct newer form for empty DimensionsSpec in FlattenJSONBenchmark
2022-11-01 14:49:30 +08:00
Gian Merlino d851985cf5
MSQ: Add support for indexSpec. (#13275) 2022-10-28 14:27:50 -07:00
Adarsh Sanjeev 4775427e2c
Add task start status to worker report (#13263)
* Add task start status to worker report

* Address review comments

* Address review comments

* Update documentation

* Update spelling checks
2022-10-28 12:00:15 +05:30
Tejaswini Bandlamudi 49e54a0ec6
Docs: Update inputSegmentSizeBytes description (#13266) 2022-10-28 09:33:52 +05:30
Clint Wylie 77e4246598
add support for 'front coded' string dictionaries for smaller string columns (#12277)
* add FrontCodedIndexed for delta string encoding

* now for actual segments

* fix indexOf

* fixes and thread safety

* add bucket size 4, which seems generally better

* fixes

* fixes maybe

* update indexes to latest interfaces

* utf8 support

* adjust

* oops

* oops

* refactor, better, faster

* more test

* fixes

* revert

* adjustments

* fix prefixing

* more chill

* sql nested benchmark too

* refactor

* more comments and javadocs

* better get

* remove base class

* fix

* hot rod

* adjust comments

* faster still

* minor adjustments

* spatial index support

* spotbugs

* add isSorted to Indexed to strengthen indexOf contract if set, improve javadocs, add docs

* fix docs

* push into constructor

* use base buffer instead of copy

* oops
2022-10-25 18:05:38 -07:00
317brian c83115e4e1
api: change API page formatting (#13213)
Tracking additional improvements requested by @paul-rogers: #13239

* api: refactor page so that indented bullet is child and unindented portion is parent

* get rid of post etc headings and combine them with the endpoint

* Update docs/operations/api-reference.md

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* fix broken links

* fix typo

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2022-10-18 13:22:26 -07:00
Paul Rogers b34b4353f4
Async reads for JDBC (#13196)
Async reads for JDBC:
Prevents JDBC timeouts on long queries by returning empty batches
when a batch fetch takes too long. Uses an async model to run the
result fetch concurrently with JDBC requests.

Fixed race condition in Druid's Avatica server-side handler
Fixed issue with no-user connections
2022-10-18 11:40:57 -07:00
cristian-popa cc10350870
Collocated processes instructions (#13224)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Frank Chen <frankchen@apache.org>
2022-10-17 11:56:00 -07:00
Gian Merlino 3bbb76f17b
Docs: Add query/cpu/time to real-time metrics. (#13229) 2022-10-15 18:26:44 +05:30
arvindanugula 42384d85e7
Update nested-columns.md (#13227)
typo error corrected.
2022-10-14 16:15:46 -07:00
Victoria Lim 02ad62a08c
Docs: update description of query priority default value (#13191)
* update description of default for query priority

* update order

* update terms

* standardize to query context parameters
2022-10-14 14:28:04 -07:00
Karan Kumar 9d51e466b1
Minor doc update for BroadcastTablesTooLarge (#13218)
Minor doc update for `BroadcastTablesTooLarge`. Now the user will know what to do
in case this fault is encountered.
2022-10-14 09:06:55 +05:30
Tejaswini Bandlamudi 3e13584e0e
Adds Idle feature to `SeekableStreamSupervisor` for inactive stream (#13144)
* Idle Seekable stream supervisor changes.

* nit

* nit

* nit

* Adds unit tests

* Supervisor decides it's idle state instead of AutoScaler

* docs update

* nit

* nit

* docs update

* Adds Kafka unit test

* Adds Kafka Integration test.

* Updates travis config.

* Updates kafka-indexing-service dependencies.

* updates previous offsets snapshot & doc

* Doesn't act if supervisor is suspended.

* Fixes highest current offsets fetch bug, adds new Kafka UT tests, doc changes.

* Reverts Kinesis Supervisor idle behaviour changes.

* nit

* nit

* Corrects SeekableStreamSupervisorSpec check on idle behaviour config, adds tests.

* Fixes getHighestCurrentOffsets to fetch offsets of publishing tasks too

* Adds Kafka Supervisor UT

* Improves test coverage in druid-server

* Corrects IT override config

* Doc updates and Syntactic changes

* nit

* supervisorSpec.ioConfig.idleConfig changes
2022-10-12 18:31:08 +05:30
Jonathan Wei 9b8e69c99a
Add inline descriptor Protobuf bytes decoder (#13192)
* Add inline descriptor Protobuf bytes decoder

* PR comments

* Update tests, check for IllegalArgumentException

* Fix license, add equals test

* Update extensions-core/protobuf-extensions/src/main/java/org/apache/druid/data/input/protobuf/InlineDescriptorProtobufBytesDecoder.java

Co-authored-by: Frank Chen <frankchen@apache.org>

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-10-11 13:37:28 -05:00
Charles Smith 25c1d55dd6
Clarify behavior when decommissioningMaxPercentOfMaxSegmentsToMove = 0 (#13157)
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-10-07 09:01:32 -07:00
317brian 0edceead80
msq: update known issue about GROUPING SETS and COUNT DISTINCT (#13185)
* msq: update known issue about GROUPING SETS and COUNT DISTINCT

* address feedback from Gian
2022-10-05 19:47:03 -07:00
AmatyaAvadhanula 41e51b21c3
Make http options the default configurations (#13092)
Druid currently uses Zookeeper dependent options as the default.
This commit updates the following to use HTTP as the default instead.
- task runner. `druid.indexer.runner.type=remote -> httpRemote`
- load queue peon. `druid.coordinator.loadqueuepeon.type=curator -> http`
- server inventory view. `druid.serverview.type=curator -> http`
2022-10-05 05:35:17 +05:30
Adarsh Sanjeev 92d2633ae6
Update ClusterByStatisticsCollectorImpl to use bytes instead of keys (#12998)
* Update clusterByStatistics to use bytes instead of keys

* Address review comments

* Resolve checkstyle

* Increase test coverage

* Update test

* Update thresholds

* Update retained keys function

* Update docs

* Fix spelling
2022-10-03 12:08:23 +05:30
Jill Osborne 548d810baa
Correct nested columns example (#13150) 2022-09-28 10:39:56 +05:30
David Palmer 0d7bf66578
Add a note to the documentation about pre-built HLLSketches (#13088)
* add a note to the documentation about pre-built HLLSketches

Druid actually supports ingesting a pre-generated sketch column by using
the HLLSketchMerge aggregator. However, this functionality was
previously not made clear in the documentation.

* copyedit from the King's English to American English

* add suggested style changes

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-09-27 10:29:39 +08:00
Apoorv Gupta c8f4d72fb1
Fix documentation bug about injective lookups (#13147)
replace mapping to `unique keys` with mapping to `unique values`.
2022-09-27 10:16:48 +08:00
Jonathan Wei 1f1fced6d4
Add JsonInputFormat option to assume newline delimited JSON, improve parse exception handling for multiline JSON (#13089)
* Add JsonInputFormat option to assume newline delimited JSON, improve handling for non-NDJSON

* Fix serde and docs

* Add PR comment check
2022-09-26 19:51:04 -05:00
Charles Smith eb760c3d1d
update log4j example (#13095)
* update log4j example

* fix some style issues

* Update docs/configuration/logging.md

Co-authored-by: Frank Chen <frankchen@apache.org>

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-09-22 09:46:49 +08:00
317brian 12f12a13a9
fix: fix broken postgres link (#13135) 2022-09-22 09:46:20 +08:00
317brian 7fa35839c0
fix: follow naming convention for msq task engine (#13127)
* fix: follow naming convention for msq task engine

* more fixes

* add back in experimental

* fix anchor
2022-09-21 18:46:06 -07:00
Gian Merlino 2f731f356e
Update pull-deps docs with correct repo list. (#13134)
There is only one default remote repo at this time.
2022-09-21 12:16:57 -07:00
Katya Macedo 90d14f629a
spatial-filters (#13124) 2022-09-20 22:48:36 -07:00
hosswald 5ed5c83aab
Clarified the behaviour of SQL COUNT(DISTINCT dim) on multi-value dimensions (#13128)
* Clarified the behaviour of COUNT(DISTINCT column) on multi-value columns

* Update docs/querying/sql-aggregations.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-09-20 18:03:34 -07:00
Vadim Ogievetsky edc444a4bc
fix quickstart (#13126) 2022-09-20 17:44:21 -07:00
Vadim Ogievetsky b9edfe34a4
be consistent about referring to the web console by its name (#13118) 2022-09-19 15:02:17 -07:00
Vadim Ogievetsky bb0b810b1d
fix html tags in docs (#13117)
* fix html tags in docs

* revert not null
2022-09-18 19:40:33 -07:00
Gian Merlino d9b2968edb
Docs: Clarify the situation with SELECT. (#13109) 2022-09-17 10:47:57 -07:00
Charles Smith b366a6c5a4
Add clarification around docker environment #8926 (#13084)
* Add clarification around docker environment #8926

* fix spelling

* Update docs/tutorials/docker.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/tutorials/docker.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* fix nano quickstart

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-09-17 20:44:24 +08:00
Gian Merlino d4967c38f8
Various documentation updates. (#13107)
* Various documentation updates.

1) Split out "data management" from "ingestion". Break it into thematic pages.

2) Move "SQL-based ingestion" into the Ingestion category. Adjust content so
   all conceptual content is in concepts.md and all syntax content is in reference.md.
   Shorten the known issues page to the most interesting ones.

3) Add SQL-based ingestion to the ingestion method comparison page. Remove the
   index task, since index_parallel is just as good when maxNumConcurrentSubTasks: 1.

4) Rename various mentions of "Druid console" to "web console".

5) Add additional information to ingestion/partitioning.md.

6) Remove a mention of Tranquility.

7) Remove a note about upgrading to Druid 0.10.1.

8) Remove no-longer-relevant task types from ingestion/tasks.md.

9) Move ingestion/native-batch-firehose.md to the hidden section. It was previously deprecated.

10) Move ingestion/native-batch-simple-task.md to the hidden section. It is still linked in some
    places, but it isn't very useful compared to index_parallel, so it shouldn't take up space
    in the sidebar.

11) Make all br tags self-closing.

12) Certain other cosmetic changes.

13) Update to node-sass 7.

* make travis use node12 for docs

Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>
2022-09-16 21:58:11 -07:00
Vadim Ogievetsky 2493eb17bf
Doc fixes around msq (#13090)
* remove things that do not apply

* fix more things

* pin node to a working version

* fix

* fixes

* known issues tidy up

* revert auto formatting changes

* remove management-uis page which is 100% lies

* don't mention the Coordinator console (that no longer exits)

* goodies

* fix typo
2022-09-16 02:15:26 -07:00
Katya Macedo 2218c8d23c
Documentation: Update spatial indexing example (#12555)
* fix spatial indexing example

* Update docs/development/geo.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/development/geo.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update text and example

* Format JSON example

* Update docs/development/geo.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/development/geo.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/development/geo.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/development/geo.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/development/geo.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/development/geo.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/development/geo.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/development/geo.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Accept review suggestions

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Frank Chen <frankchen@apache.org>
2022-09-16 10:32:19 +08:00
Atul Mohan a8fd3a9077
Provide service specific log4j overrides in containerized deployments (#13020)
* Provide service specific log4j overrides

* Clarify comments

* Add docs
2022-09-14 11:47:11 +08:00
Benedict Jin 4bde50e683
Bump the version of Druid docker image from 0.16.0-incubating to latest (#13058) 2022-09-10 14:06:00 +05:30
Vadim Ogievetsky 4fc43670e5
adjust docs and images (#13067) 2022-09-10 14:05:19 +05:30
DENNIS dced61645f
prometheus-emitter supports sending metrics to pushgateway regularly … (#13034)
* prometheus-emitter supports sending metrics to pushgateway regularly and continuously

* spell check fix

* Optimization variable name and related documents

* Update docs/development/extensions-contrib/prometheus.md

OK, it looks more conspicuous

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update doc

* Update docs/development/extensions-contrib/prometheus.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* When PrometheusEmitter is closed, close the scheduler

* Ensure that registeredMetrics is thread safe.

* Local variable name optimization

* Remove unnecessary white space characters

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-09-09 20:46:14 +08:00
sachidananda007 48c99054d0
Update tutorial-kafka.md (#13056)
* Update tutorial-kafka.md

Added missing command to the doc for zookeeper before starting kafka

* Update docs/tutorials/tutorial-kafka.md

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-09-09 10:06:19 +08:00
Frank Chen d57557d51d
Improve doc and configuration of prometheus emitter (#13028)
* Improve doc and validation

* Add configuration for peon tasks

* Update doc

* Update test case

* Fix typo

* Update docs/development/extensions-contrib/prometheus.md

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* Update docs/development/extensions-contrib/prometheus.md

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2022-09-09 02:20:34 +08:00
Rohan Garg 7aa8d7f987
Add query/time metric for SQL queries from router (#12867)
* Add query/time metric for SQL queries from router

* Fix query cancel bug when user has overriden native query-id in a SQL query
2022-09-07 13:54:46 +05:30
Adam Peck ee22663dd3
Add interpolation to JsonConfigurator (#13023)
* Add interpolation to JsonConfigurator

* Fix checkstyle

* Fix tests by removing common-text override

* Add back commons-text without version

* Remove unused hadoopDir configs

* Move some stuff to hopefully pass coverage
2022-09-07 12:48:01 +05:30
Clint Wylie a3a377e570
more consistent expression error messages (#12995)
* more consistent expression error messages

* review stuff

* add NamedFunction for Function, ApplyFunction, and ExprMacro to share common stuff

* fixes

* add expression transform name to transformer failure, better parse_json error messaging
2022-09-06 23:21:38 -07:00
Jill Osborne 1f69140623
Nested columns documentation (#12946)
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: brian.le <brian.le@imply.io>
2022-09-06 14:42:18 -07:00
Vadim Ogievetsky 897689c03b
remove mentions of DruidQueryRel from docs (#13033)
* remove mentions of DruidQueryRel

* Update docs/querying/sql-translation.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql-translation.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-09-06 13:37:27 -07:00
317brian d4233ef2a1
msq: add multi-stage-query docs (#12983)
* msq: add multi-stage-query docs

* add screenshots

add back theta sketches tutoria

change filename

fix filename

fix link

fix headings

* fixes

* fixes

* fix spelling issues and update spell file

* address feedback from karan

* add missing guardrail to known issues

* update blurb

* fix typo

* remove durable storage info

* update titles

* Restore en.json

* Update query view

* address comments from vad

* Update docs/multi-stage-query/msq-known-issues.md

finish sentence

* add apache license to docs

* add apache license to docs

Co-authored-by: Katya Macedo <katya.macedo@imply.io>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-09-06 23:06:09 +05:30
senthilkv 3d9aef225d
compressed big decimal - module (#10705)
Compressed Big Decimal is an extension which provides support for 
Mutable big decimal value that can be used to accumulate values 
without losing precision or reallocating memory. This type helps in 
absolute precision arithmetic on large numbers in applications, 
where greater level of accuracy is required, such as financial 
applications, currency based transactions. This helps avoid rounding 
issues where in potentially large amount of money can be lost.

Accumulation requires that the two numbers have the same scale, 
but does not require that they are of the same size. If the value 
being accumulated has a larger underlying array than this value 
(the result), then the higher order bits are dropped, similar to what 
happens when adding a long to an int and storing the result in an 
int. A compressed big decimal that holds its data with an embedded 
array.

Compressed big decimal is an absolute number based complex type 
based on big decimal in Java. This supports all the functionalities 
supported by Java Big Decimal. Java Big Decimal is not mutable in 
order to avoid big garbage collection issues. Compressed big decimal 
is needed to mutate the value in the accumulator.
2022-09-06 00:06:57 -07:00
zemin 6805a7f9c2
Ease of hidding sensitive properties from /status/proper… (#12950)
* apache#12063 Ease of hidding sensitive properties from /status/properties endpoint

* apache#12063 Ease of hidding sensitive properties from /status/properties endpoint

* apache#12063 Ease of hidding sensitive properties from /status/properties endpoint

using one property for hiding properties, updated the index.md to document hiddenProperties

* apache#12063 Ease of hidding sensitive properties from /status/properties endpoint

Added java docs

* apache#12063 Ease of hidding sensitive properties from /status/properties endpoint

Add "password", "key", "token", "pwd" as default druid.server.hiddenProperties

fixed typo and removed redundant space

Co-authored-by: zemin <zemin.piao@adyen.com>
2022-09-02 08:51:25 -05:00
Gian Merlino 85d2a6d879
Improve range partitioning docs. (#13016)
Two improvements:

- Use a realistic targetRowsPerSegment, so if people copy and paste
  the example from the docs, it will generate reasonable segments.
- Spell "countryName" correctly.
2022-09-01 15:21:30 -07:00
Gian Merlino 48ceab2153
Add Java 17 information to documentation. (#12990)
The docs say Java 17 support is experimental, and give tips on running
successfully with Java 17.

This patch also removes java.base/jdk.internal.perf and
jdk.management/com.sun.management.internal from the list of required
exports and opens, because they were formerly needed for JvmMonitor,
which was rewritten in #12481 to use MXBeans instead.
2022-08-30 12:32:49 -07:00
Clint Wylie 16f5ac5bd5
json_value adjustments (#12968)
* json_value adjustments
changes:
* native json_value expression now has optional 3rd argument to specify type, which will cast all values to the specified type
* rework how JSON_VALUE is wired up in SQL. Now we are using a custom convertlet to translate JSON_VALUE(... RETURNING type) into dedicated JSON_VALUE_BIGINT, JSON_VALUE_DOUBLE, JSON_VALUE_VARCHAR, JSON_VALUE_ANY instead of using the calcite StandardConvertletTable that wraps JSON_VALUE_ANY in a CAST, so that we preserve the typing of JSON_VALUE to pass down to the native expression as the 3rd argument

* fix json_value_any to be usable by humans too, coverage

* fix bug

* checkstyle

* checkstyle

* review stuff

* validate that options to json_value are the supported options rather than ignore them

* remove more legacy undocumented functions
2022-08-27 07:15:47 -07:00
Alexander Saydakov 7e2371bbde
KLL sketch (#12498)
* KLL sketch

* added documentation

* direct static refs

* direct static refs

* fixed test

* addressed review points

* added KLL sketch related terms

* return a copy from get

* Copy unions when returning them from "get".

* Remove redundant "final".

Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
2022-08-26 21:19:24 -07:00
Adam Peck 21b73bde20
Update Curator to 5.3.0 (#12939)
* Update Curator to 5.3.0

* Update licenses.yaml

* Fix inspections + add tests.

* Fix checkstyle

* Another intellij inspection fix

* Update curator exclusions

* Cleanup new exhibitor references

* Remove unused dep and checkstyle fix
2022-08-26 18:23:40 -07:00
Jill Osborne 7a1e1f88bb
Remove experimental note from stable features (#12973)
* Removed experimental note for features that are no longer experimental

* Updated native batch doc
2022-08-25 09:26:46 -07:00
Santosh Pingale 31dc9004bd
Auto-reload TLS certs for druid endpoints (#12933)
* #12064 Auto-reload tls certs for druid endpoints

* #12064 Add missing toString param

* #12064 Add tests and new jks
Co-authored-by: zemin-piao <pzm6391@gmail.com>

* #12064 Refine tests

* #12064 Add documentation

* Apply suggestions from code review

Co-authored-by: Frank Chen <frankchen@apache.org>

Co-authored-by: santosh <santosh.pingale@adyen.com>
Co-authored-by: Frank Chen <frankchen@apache.org>
2022-08-25 20:12:43 +08:00
Victoria Lim 02914c17b9
Tutorial on ingesting and querying Theta sketches (#12723)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-08-24 09:23:22 -07:00
Karan Kumar f7c6316992
Setting useNativeQueryExplain to true (#12936)
* Setting useNativeQueryExplain to true

* Update docs/querying/sql-query-context.md

Co-authored-by: Santosh Pingale <pingalesantosh@gmail.com>

* Fixing tests

* Fixing broken tests

Co-authored-by: Santosh Pingale <pingalesantosh@gmail.com>
2022-08-24 17:39:55 +05:30
Petar Petrov 6fec1d4c95
Add useNativeQueryExplain in sql query context documentation (#12924) (#12934)
Co-authored-by: Petar Petrov <petar.petrov@system73.com>
2022-08-22 16:31:15 +05:30
AmatyaAvadhanula 379df5f103
Kinesis docs and logs improvements (#12886)
Going ahead with the merge. CI is failing because of a code coverage change in the log line.
2022-08-22 14:49:42 +05:30
Rohan Garg 3c129f6728
Add sql planning time metric (#12923) 2022-08-22 11:09:44 +05:30
Clint Wylie f8097ccfaa
basic docs for nested column query functions (#12922)
* basic docs for nested column query functions
2022-08-19 17:12:19 -07:00
Clint Wylie 69fe1f04e5
document virtualColumns in native query documentation, fix some redirects (#12917)
* document virtualColumns in native query documentation, fix some redirects

* after all that, forgot to run spellcheck locally

* review stuff
2022-08-18 20:49:23 -07:00
Ian Roberts 770358dc34
Update tls-support.md (#12916)
Fixing " lists all possible values for the configs belong" in TLS section
2022-08-18 09:46:30 +08:00
Peter Marshall f665a0c077
Docs - Add links to Basic Tuning guide in process pages (#12741)
Added link to the relevant section of the Basic Cluster Tuning page on each process page.

This is in order to improve access to this information, which is not easy to find through search or nav.
2022-08-16 18:42:44 +05:30
Lucas Capistrant ec8bdeb9f6
Document missing property - druid.announcer.skipSegmentAnnouncementOnZk (#12891)
* document missing config related to segment announcement

* improve wording

* improve wording

* update docs
2022-08-16 12:32:56 +05:30
Rohan Garg 5394838030
Enable conversion of join to filter by default (#12868) 2022-08-13 20:37:43 +05:30
Lucas Capistrant 3a3271eddc
Introduce defaultOnDiskStorage config for Group By (#12833)
* Introduce defaultOnDiskStorage config for groupBy

* add debug log to groupby query config

* Apply config change suggestion from review

* Remove accidental new lines

* update default value of new default disk storage config

* update debug log to have more descriptive text

* Make maxOnDiskStorage and defaultOnDiskStorage HumanRedadableBytes

* improve test coverage

* Provide default implementation to new default method on advice of reviewer
2022-08-12 09:40:21 -07:00
David Palmer 2855fb6ff8
Change Kafka Lookup Extractor to not register consumer group (#12842)
* change kafka lookups module to not commit offsets

The current behaviour of the Kafka lookup extractor is to not commit
offsets by assigning a unique ID to the consumer group and setting
auto.offset.reset to earliest. This does the job but also pollutes the
Kafka broker with a bunch of "ghost" consumer groups that will never again be
used.

To fix this, we now set enable.auto.commit to false, which prevents the
ghost consumer groups being created in the first place.

* update docs to include new enable.auto.commit setting behaviour

* update kafka-lookup-extractor documentation

Provide some additional detail on functionality and configuration.
Hopefully this will make it clearer how the extractor works for
developers who aren't so familiar with Kafka.

* add comments better explaining the logic of the code

* add spelling exceptions for kafka lookup docs
2022-08-09 16:14:22 +05:30
絵空事スピリット ebe783dbdc
Correct minor format issue (#12882) 2022-08-09 18:15:41 +08:00
Hamish Ball abd7a9748d
Remove kafka lookup records when a record is tombstoned (#12819)
* remove kafka lookup records from factory when record tombstoned

* update kafka lookup docs to include tombstone behaviour

* change test wait time down to 10ms

Co-authored-by: David Palmer <david.palmer@adscale.co.nz>
2022-08-09 10:42:51 +05:30
David Hergenroeder 533c39f35a
Fix rollup docs bullet formatting (#12876) 2022-08-09 10:10:07 +08:00
Suneet Saldanha 267b32c2e2
Set druid.processing.fifo to true by default (#12571) 2022-08-08 10:18:24 -07:00
Gian Merlino 01d555e47b
Adjust "in" filter null behavior to match "selector". (#12863)
* Adjust "in" filter null behavior to match "selector".

Now, both of them match numeric nulls if constructed with a "null" value.

This is consistent as far as native execution goes, but doesn't match
the behavior of SQL = and IN. So, to address that, this patch also
updates the docs to clarify that the native filters do match nulls.

This patch also updates the SQL docs to describe how Boolean logic is
handled in addition to how NULL values are handled.

Fixes #12856.

* Fix test.
2022-08-08 09:08:36 -07:00
AmatyaAvadhanula d294404924
Kinesis ingestion with empty shards (#12792)
Kinesis ingestion requires all shards to have at least 1 record at the required position in druid.
Even if this is satisified initially, resharding the stream can lead to empty intermediate shards. A significant delay in writing to newly created shards was also problematic.

Kinesis shard sequence numbers are big integers. Introduce two more custom sequence tokens UNREAD_TRIM_HORIZON and UNREAD_LATEST to indicate that a shard has not been read from and that it needs to be read from the start or the end respectively.
These values can be used to avoid the need to read at least one record to obtain a sequence number for ingesting a newly discovered shard.

If a record cannot be obtained immediately, use a marker to obtain the relevant shardIterator and use this shardIterator to obtain a valid sequence number. As long as a valid sequence number is not obtained, continue storing the token as the offset.

These tokens (UNREAD_TRIM_HORIZON and UNREAD_LATEST) are logically ordered to be earlier than any valid sequence number.

However, the ordering requires a few subtle changes to the existing mechanism for record sequence validation:

The sequence availability check ensures that the current offset is before the earliest available sequence in the shard. However, current token being an UNREAD token indicates that any sequence number in the shard is valid (despite the ordering)

Kinesis sequence numbers are inclusive i.e if current sequence == end sequence, there are more records left to read.
However, the equality check is exclusive when dealing with UNREAD tokens.
2022-08-05 22:38:58 +05:30
Katya Macedo c6dd9dd4af
Fix typo in compaction.md (#12774) 2022-08-04 14:47:22 -07:00
Gian Merlino ef6811ef88
Improved Java 17 support and Java runtime docs. (#12839)
* Improved Java 17 support and Java runtime docs.

1) Add a "Java runtime" doc page with information about supported
   Java versions, garbage collection, and strong encapsulation..

2) Update asm and equalsverifier to versions that support Java 17.

3) Add additional "--add-opens" lines to surefire configuration, so
   tests can pass successfully under Java 17.

4) Switch openjdk15 tests to openjdk17.

5) Update FrameFile to specifically mention Java runtime incompatibility
   as the cause of not being able to use Memory.map.

6) Update SegmentLoadDropHandler to log an error for Errors too, not
   just Exceptions. This is important because an IllegalAccessError is
   encountered when the correct "--add-opens" line is not provided,
   which would otherwise be silently ignored.

7) Update example configs to use druid.indexer.runner.javaOptsArray
   instead of druid.indexer.runner.javaOpts. (The latter is deprecated.)

* Adjustments.

* Use run-java in more places.

* Add run-java.

* Update .gitignore.

* Exclude hadoop-client-api.

Brought in when building on Java 17.

* Swap one more usage of java.

* Fix the run-java script.

* Fix flag.

* Include link to Temurin.

* Spelling.

* Update examples/bin/run-java

Co-authored-by: Xavier Léauté <xl+github@xvrl.net>

Co-authored-by: Xavier Léauté <xl+github@xvrl.net>
2022-08-03 23:16:05 -07:00
Gian Merlino 2912a36a20
Use nonzero default value of maxQueuedBytes. (#12840)
* Use nonzero default value of maxQueuedBytes.

The purpose of this parameter is to prevent the Broker from running out
of memory. The prior default is unlimited; this patch changes it to a
relatively conservative 25MB.

This may be too low for larger clusters. The risk is that throughput
can decrease for queries with large resultsets or large amounts of intermediate
data. However, I think this is better than the risk of the prior default, which
is that these queries can cause the Broker to go OOM.

* Alter calculation.
2022-08-02 17:57:27 -07:00
317brian 553ff47616
fix: fix broken link to Class TTest (#12836) 2022-07-31 10:18:14 +08:00
Charles Smith efbb58e90e
docs: remove maxRowsPerSegment where appropriate (#12071)
* remove maxRowsPerSegment where appropriate

* fix tutorial, accept suggestions

* Update docs/design/coordinator.md

* additional tutorial file

* fix initial index spec

* accept comments

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* add back comment on maxrows per segment

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* rm duplicate entry

* Update native-batch-simple-task.md

remove ref to `maxrowspersegment`

* Update native-batch.md

remove ref to `maxrowspersegment`

* final tenticles

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-07-28 16:52:13 +05:30
Atul Mohan 93a9a4b1c5
Add retention for file request logs (#12559)
* Add retention for file request logs

* Spelling
2022-07-27 08:17:02 -07:00
Charles Smith d7d4314367
remove ref to plywood repo (#12809) 2022-07-26 10:12:13 +08:00
Victoria Lim 6394ecfd21
update figure and reference (#12813) 2022-07-22 15:54:25 -07:00
Katya Macedo a2be685824
Remove the time bit, fix headings (#12808)
* Remove the time bit, fix headings

* Adopt review suggestions

* Edits

* Update smoosh file description

* Adopt review suggestions

* Update spelling
2022-07-20 15:37:57 -07:00
Katya Macedo 809bf161ce
Add a note about setting the value of maxNumConcurrentSubTasks (#12772)
* Add clarification for combining input source

* Update inputFormat note

* Update maxNumConcurrentSubTasks note

* Fix broken link

* Update docs/ingestion/native-batch-input-source.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-07-19 15:34:21 -07:00
Atul Mohan 75045970cd
S3 Ingestion from non-default endpoints (#11798)
* Add endpoint support for s3inputsource

* Changes to tests

* Fix docs

* Fix config

* Fix inspections

* Fix spelling

* Remove password from toString
2022-07-15 11:03:34 -07:00
Jianhuan Liu d4403c15aa
Upgrade prometheus version, add more labels to PrometheusEmitter (#12769)
Changes:
- Upgrade prometheus to version 0.16.0
- Add optional labels `druid_service` and `host_name` to `PrometheusEmitter`
2022-07-15 14:43:12 +05:30
Frank Chen a544aff761
Document missed simple granularities (#12768)
* Document missed simple granularities

* Update docs/querying/granularities.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/granularities.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-07-14 14:02:28 +08:00
zachjsh c0380e7b0a
* fix duplicate dimension (#12778) 2022-07-14 10:39:03 +05:30
Victoria Lim d8f8c56f94
Docs: Index page with all SQL functions (#12771)
* list of all functions

* add function names to spelling file
2022-07-14 09:59:55 +08:00
TSFenwick 8c02880d5f
Emit metrics for distribution of number of rows per segment (#12730)
* initial commit of bucket dimensions for metrics

return counts of segments that have rowcount in a bucket size for a datasource
return average value of rowcount per segment in a datasource
added unit test
naming could use a lot of work
buckets right now are not finalized
added javadocs
altered metrics.md

* fix checkstyle issues

* addressed review comments

add monitor test
move added functionality to new monitor
update docs

* address comments

renamed monitor
handle tombstones better
update docs
added javadocs

* Add support for tombstones in the segment distribution

* undo changes to tombstone segmentizer factory

* fix accidental whitespacing changes

* address comments regarding metrics documentation

and rename variable to be more accurate

* fix tests

* fix checkstyle issues

* fix broken test

* undo removal of timeout
2022-07-12 07:04:42 -07:00
Gian Merlino 97207cdcc7
Automatic sizing for GroupBy dictionaries. (#12763)
* Automatic sizing for GroupBy dictionary sizes.

Merging and selector dictionary sizes currently both default to 100MB.
This is not optimal, because it can lead to OOM on small servers and
insufficient resource utilization on larger servers. It also invites
end users to try to tune it when queries run out of dictionary space,
which can make things worse if the end user sets it to too high.

So, this patch:

- Adds automatic tuning for selector and merge dictionaries. Selectors
  use up to 15% of the heap and merge buffers use up to 30% of the heap
  (aggregate across all queries).

- Updates out-of-memory error messages to emphasize enabling disk
  spilling vs. increasing memory parameters. With the memory parameters
  automatically sized, it is more likely that an end user will get
  benefit from enabling disk spilling.

- Removes the query context parameters that allow lowering of configured
  dictionary sizes. These complicate the calculation, and I don't see a
  reasonable use case for them.

* Adjust tests.

* Review adjustments.

* Additional comment.

* Remove unused import.
2022-07-11 08:20:50 -07:00
Jill Osborne 682ea7f32d
IMPLY-12348: Update description of UNION ALL in SQL syntax doc (#12710)
* IMPLY-12348: Updated description of UNION ALL

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update sql.md

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-07-05 13:08:01 -07:00
Rui Chen 068bea6334
deps: upgrade mysql-connector-java to v5.1.49 (#12704) 2022-06-29 23:15:46 +08:00
Didip Kerabat 6ddb828c7a
Able to filter Cloud objects with glob notation. (#12659)
In a heterogeneous environment, sometimes you don't have control over the input folder. Upstream can put any folder they want. In this situation the S3InputSource.java is unusable.

Most people like me solved it by using Airflow to fetch the full list of parquet files and pass it over to Druid. But doing this explodes the JSON spec. We had a situation where 1 of the JSON spec is 16MB and that's simply too much for Overlord.

This patch allows users to pass {"filter": "*.parquet"} and let Druid performs the filtering of the input files.

I am using the glob notation to be consistent with the LocalFirehose syntax.
2022-06-24 11:40:08 +05:30
Gian Merlino d29343cbe3
Disable autokill of segments by default. (#12693)
Also add clarifying commentary to the documentation about how durationToRetain works.
2022-06-23 17:17:11 -07:00
Tejaswini Bandlamudi 99e1b4efee
Update default value of `inputSegmentSizeBytes` in configuration docs (#12678) 2022-06-22 09:05:03 +05:30
Gian Merlino 0099940808
Add TIME_IN_INTERVAL SQL operator. (#12662)
* Add TIME_IN_INTERVAL SQL operator.

The operator is implemented as a convertlet rather than an
OperatorConversion, because this allows it to be equivalent to using
the >= and < operators directly.

* SqlParserPos cannot be null here.

* Remove unused import.

* Doc updates.

* Add words to dictionary.
2022-06-21 13:05:37 -07:00
Jill Osborne f050069767
Segments doc update (#12344)
* Corrected heading levels in segments doc

* IMPLY-18394: Updated Segments doc

* Update docs/design/segments.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/segments.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/segments.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/segments.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/segments.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/segments.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update segments.md

* Updated links to changed headings in Segments doc

* Corrected spelling error

* Update segments.md

Incorporated suggestions from Paul Rogers.

* Update index.md

* Update segments.md

* Update segments.md

* Update segments.md

* Update compaction.md

* Update docs/design/segments.md

fix typo

* Update docs/ingestion/compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/design/segments.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-06-16 13:25:17 -07:00
Victoria Lim 94564b6ce6
Update screenshots for Druid console doc (#12593)
* druid console doc updates

* remove extra image

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* updated screenshot labels

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-06-15 16:42:20 -07:00
Victoria Lim 353475bd36
Docs for automatic compaction (#12569)
* docs for auto-compaction

* fix broken links

* another link

* Apply suggestions from code review

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Apply suggestions from code review

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Suneet Saldanha <suneet@apache.org>

* reorg content for skipOffset

* Update docs/ingestion/automatic-compaction.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Apply suggestions from code review

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

Co-authored-by: Suneet Saldanha <suneet@apache.org>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2022-06-09 14:55:12 -07:00
Gian Merlino a503683a4a
Add caching and CSP response headers. (#12609)
* Add caching and CSP response headers.

* Fix tests.

* Fix checkstyle issues

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
2022-06-04 21:46:49 +05:30
Victoria Lim 1506b26ce4
fix typo (#12607) 2022-06-04 13:14:18 +08:00
Gian Merlino a27f4f5740
Service stdout log files, move logs to log/. (#12570)
* Service stdout log files, move logs to log/.

Two changes that make log behavior cleaner:

1) Redirect messages from the Java runtime to their own log files.
   Otherwise, they would get jumbled up in the output of the all-in-one
   start command.

2) Use log/ instead of bin/log/ for the default log directory. Makes them
   easier to find.

Additionally, add documentation about how to avoid the reflective
access warnings in Java 11.

* Spelling.

* See if code formatting affects spelling.
2022-06-03 10:44:29 +05:30
Jill Osborne 9c8e6bb000
Addition to Multitenancy considerations doc (#12567)
* Small addition to Multitenancy considerations doc

* Update docs/querying/multitenancy.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update multitenancy.md

Edit suggested by @kfaraz

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-06-02 10:32:14 -07:00
Dr. Sizzles 7291c92f4f
Adding zstandard compression library (#12408)
* Adding zstandard compression library

* 1. Took @clintropolis's advice to have ZStandard decompressor use the byte array when the buffers are not direct.
2. Cleaned up checkstyle issues.

* Fixing zstandard version to latest stable version in pom's and updating license files

* Removing zstd from benchmarks and adding to processing (poms)

* fix the intellij inspection issue

* Removing the prefix v for the version in the license check for ztsd

* Fixing license checks

Co-authored-by: Rahul Gidwani <r_gidwani@apple.com>
2022-05-28 17:01:44 -07:00
Agustin Gonzalez 2f3d7a4c07
Emit state of replace and append for native batch tasks (#12488)
* Emit state of replace and append for native batch tasks

* Emit count of one depending on batch ingestion mode (APPEND, OVERWRITE, REPLACE)

* Add metric to compaction job

* Avoid null ptr exc when null emitter

* Coverage

* Emit tombstone & segment counts

* Tasks need a type

* Spelling

* Integrate BatchIngestionMode in batch ingestion tasks functionality

* Typos

* Remove batch ingestion type from metric since it is already in a dimension. Move IngestionMode to AbstractTask to facilitate having mode as a dimension. Add metrics to streaming. Add missing coverage.

* Avoid inner class referenced by sub-class inspection. Refactor computation of IngestionMode to make it more robust to null IOConfig and fix test.

* Spelling

* Avoid polluting the Task interface

* Rename computeCompaction methods to avoid ambiguous java compiler error if they are passed null. Other minor cleanup.
2022-05-23 12:32:47 -07:00
Gian Merlino 37853f8de4
ConcurrentGrouper: Add mergeThreadLocal option, fix bug around the switch to spilling. (#12513)
* ConcurrentGrouper: Add option to always slice up merge buffers thread-locally.

Normally, the ConcurrentGrouper shares merge buffers across processing
threads until spilling starts, and then switches to a thread-local model.
This minimizes memory use and reduces likelihood of spilling, which is
good, but it creates thread contention. The new mergeThreadLocal option
causes a query to start in thread-local mode immediately, and allows us
to experiment with the relative performance of the two modes.

* Fix grammar in docs.

* Fix race in ConcurrentGrouper.

* Fix issue with timeouts.

* Remove unused import.

* Add "tradeoff" to dictionary.
2022-05-21 10:28:54 -07:00
Katya Macedo 5073cee73f
Fix zookeeper spelling (#12556) 2022-05-21 16:14:02 +08:00
Gian Merlino 65a1375b67
SQL: Add is_active to sys.segments, update examples and docs. (#11550)
* SQL: Add is_active to sys.segments, update examples and docs.

is_active is short for:

  (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1

It's important because this represents "all the segments that should
be queryable, whether or not they actually are right now". Most of the
time, this is the set of segments that people will want to look at.

The web console already adds this filter to a lot of its queries,
proving its usefulness.

This patch also reworks the caveat at the bottom of the sys.segments
section, so its information is mixed into the description of each result
field. This should make it more likely for people to see the information.

* Wording updates.

* Adjustments for spellcheck.

* Adjust IT.
2022-05-19 14:23:28 -07:00
Charles Smith 3e8d7a6d9f
Sql docs items (#12530)
* touch up sql refactor

* brush up SQL refactor

* incorporate feedback

* reorder sql

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-05-17 16:56:31 -07:00
Katya Macedo 177638f171
Fix typo, add comma (#12529) 2022-05-17 16:42:47 -07:00
Gian Merlino fdfecfd996
Improved docs for range partitioning. (#12350)
* Improved docs for range partitioning.

1) Clarify the benefits of range partitioning.
2) Clarify which filters support pruning.
3) Include the fact that multi-value dimensions cannot be used for partitioning.

* Additional clarification.

* Update other section.

* Another adjustment.

* Updates from review.
2022-05-16 09:42:31 -07:00
Hellmar Becker 985640f103
Clarify the use of the Lookup API (#12088)
* Update lookups.md

* Update docs/querying/lookups.md

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>

* Update docs/querying/lookups.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
2022-05-16 07:50:24 -07:00
317brian 351e57bdb6
docs(fix): clarify how worker.version and minWorkerVersion comparison works (#12459)
* docs(fix): clarify how worker.version and minWorkerVersion comparison works

* Revert "docs(fix): clarify how worker.version and minWorkerVersion comparison works"

This reverts commit cadd1fdc60.

* docs(fix): clarify how worker.version and minWorkerVersion comparison works

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/configuration/index.md

fix spelling

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-05-16 07:48:33 -07:00
Gian Merlino 5b6727f319
Enable vectorized virtual column processing by default. (#12520)
In the majority of cases, this improves performance.

There's only one case I'm aware of where this may be a net negative: for time_floor(__time, <period>) where there are many repeated __time values. In nonvectorized processing, SingleLongInputCachingExpressionColumnValueSelector implements an optimization to avoid computing the time_floor function on every row. There is no such optimization in vectorized processing.

IMO, we shouldn't mention this in the docs. Rationale: It's too fiddly of a thing: it's not guaranteed that nonvectorized processing will be faster due to the optimization, because it would have to overcome the inherent speed advantage of vectorization. So it'd always require testing to determine the best setting for a specific dataset. It would be bad if users disabled vectorization thinking it would speed up their queries, and it actually slowed them down. And even if users do their own testing, at some point in the future we'll implement the optimization for vectorized processing too, and it's likely that users that explicitly disabled vectorization will continue to have it disabled. I'd like to avoid this outcome by encouraging all users to enable vectorization at all times. Really advanced users would be following development activity anyway, and can read this issue
2022-05-16 15:43:53 +05:30
Frank Chen c33ff1c745
Enforce console logging for peon process (#12067)
Currently all Druid processes share the same log4j2 configuration file located in _common directory. Since peon processes are spawned by middle manager process, they derivate the environment variables from the middle manager. These variables include those in the log4j2.xml controlling to which file the logger writes the log.

But current task logging mechanism requires the peon processes to output the log to console so that the middle manager can redirect the console output to a file and upload this file to task log storage.

So, this PR imposes this requirement to peon processes, whatever the configuration is in the shared log4j2.xml, peon processes always write the log to console.
2022-05-16 15:07:21 +05:30
Gian Merlino ff253fd8a3
Add setProcessingThreadNames context parameter. (#12514)
setting thread names takes a measurable amount of time in the case where segment scans are very quick. In high-QPS testing we found a slight performance boost from turning off processing thread renaming. This option makes that possible.
2022-05-16 13:42:00 +05:30
Lucas Capistrant deb69d1bc0
Allow coordinator to be configured to kill segments in future (#10877)
Allow a Druid cluster to kill segments whose interval_end is a date in the future. This can be done by setting druid.coordinator.kill.durationToRetain to a negative period. For example PT-24H would allow segments to be killed if their interval_end date was 24 hours or less into the future at the time that the kill task is generated by the system.

A cluster operator can also disregard the druid.coordinator.kill.durationToRetain entirely by setting a new configuration, druid.coordinator.kill.ignoreDurationToRetain=true. This ignores interval_end date when looking for segments to kill, and instead is capable of killing any segment marked unused. This new configuration is off by default, and a cluster operator should fully understand and accept the risks if they enable it.
2022-05-11 07:35:15 +05:30
Kashif Faraz 60b4fa0f75
Docs: Fix column name in ingestion rollup doc (#12036)
Fix the referred column name from "count" to "num_rows" as "count" vs. "COUNT(*)" might be a little confusing in this example.
2022-05-10 17:35:59 +05:30
Rohan Garg 75836a5a06
Add feature flag for sql planning of TimeBoundary queries (#12491)
* Add feature flag for sql planning of TimeBoundary queries

* fixup! Add feature flag for sql planning of TimeBoundary queries

* Add documentation for enableTimeBoundaryPlanning

* fixup! Add documentation for enableTimeBoundaryPlanning
2022-05-10 15:23:42 +05:30
Rohan Garg 2dd073c2cd
Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation (#12484)
* Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation

* fixup! Pass metrics object for Scan, Timeseries and GroupBy queries during cursor creation

* Document vectorized dimension
2022-05-09 10:40:17 -07:00
Victoria Lim 0206a2da5c
Update automatic compaction docs with consistent terminology (#12416)
* specify automatic compaction where applicable

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* update for style and consistency

* implement suggested feedback

* remove duplicate example

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/compaction.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/api-reference.md

* update .spelling

* Adopt review suggestions

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
2022-05-03 16:22:25 -07:00
Rocky Chen 770ad95169
Add a metric for task duration in the pending queue (#12492)
This PR is to measure how long a task stays in the pending queue and emits the value with the metric task/pending/time. The metric is measured in RemoteTaskRunner and HttpRemoteTaskRunner.

An example of the metric:

```
2022-04-26T21:59:09,488 INFO [rtr-pending-tasks-runner-0] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"feed":"metrics","timestamp":"2022-04-26T21:59:09.487Z","service":"druid/coordinator","host":"localhost:8081","version":"2022.02.0-iap-SNAPSHOT","metric":"task/pending/time","value":8,"dataSource":"wikipedia","taskId":"index_parallel_wikipedia_gecpcglg_2022-04-26T21:59:09.432Z","taskType":"index_parallel"}
```

------------------------------------------
Key changed/added classes in this PR

    Emit metric task/pending/time in classes RemoteTaskRunner and HttpRemoteTaskRunner.
    Update related factory classes and tests.
2022-05-02 23:47:25 -04:00
317brian b97f273d5a
docs: fix typo (#12494) 2022-05-01 22:44:31 +08:00
Charles Smith 42fa5c26e1
remove arbitrary granularity spec from docs (#12460)
* remove arbitrary granularity spec from docs

* Update docs/ingestion/ingestion-spec.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-04-28 16:36:54 -07:00
Gian Merlino a2bad0b3a2
Reduce allocations due to Jackson serialization. (#12468)
* Reduce allocations due to Jackson serialization.

This patch attacks two sources of allocations during Jackson
serialization:

1) ObjectMapper.writeValue and JsonGenerator.writeObject create a new
   DefaultSerializerProvider instance for each call. It has lots of
   fields and creates pressure on the garbage collector. So, this patch
   adds helper functions in JacksonUtils that enable reuse of
   SerializerProvider objects and updates various call sites to make
   use of this.

2) GroupByQueryToolChest copies the ObjectMapper for every query to
   install a special module that supports backwards compatibility with
   map-based rows. This isn't needed if resultAsArray is set and
   all servers are running Druid 0.16.0 or later. This release was a
   while ago. So, this patch disables backwards compatibility by default,
   which eliminates the need to copy the heavyweight ObjectMapper. The
   patch also introduces a configuration option that allows admins to
   explicitly enable backwards compatibility.

* Add test.

* Update additional call sites and add to forbidden APIs.
2022-04-27 14:17:26 -07:00
zachjsh 564d6defd4
Worker level task metrics (#12446)
* * fix metric name inconsistency

* * add task slot metrics for middle managers

* * add new WorkerTaskCountStatsMonitor to report task count metrics
  from worker

* * more stuff

* * remove unused variable

* * more stuff

* * add javadocs

* * fix checkstyle

* * fix hadoop test failure

* * cleanup

* * add more code coverage in tests

* * fix test failure

* * add docs

* * increase code coverage

* * fix spelling

* * fix failing tests

* * remove dead code

* * fix spelling
2022-04-26 11:44:44 -05:00
Peter Marshall b47316b844
Update native-batch.md (#12478)
Fixed indent on the Granularity Spec section and removed some superfluous tabbings.
2022-04-25 21:44:17 +08:00
Apoorv Gupta 4781af9921
Fix formatting in stats.md (#12470)
* Fix formatting in stats.md

* Update stats.md

* Update docs/development/extensions-core/stats.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/development/extensions-core/stats.md

Co-authored-by: Frank Chen <frankchen@apache.org>

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-04-23 11:35:08 +08:00
Victoria Lim 63a993c33a
stringFirst and stringLast supported in ingestion (#12466) 2022-04-22 10:28:49 +08:00
Victoria Lim f95447070e
updated docs for sql query context (#12406) 2022-04-21 11:19:39 -07:00
jacobtolar 0edc22179c
Document expression post-aggregators (#11896)
* Document expression post-aggregators

* Update docs/querying/post-aggregations.md

Co-authored-by: Frank Chen <frankchen@apache.org>

Co-authored-by: Frank Chen <frankchen@apache.org>
2022-04-19 10:36:19 +08:00
Victoria Lim c86c48203e
recommendation for comparing strings and numbers (#12442) 2022-04-18 09:28:32 -07:00
Peter Marshall 5167d328b1
Docs - query caching (#11584)
* Update caching.md

Knowledge from https://the-asf.slack.com/archives/CJ8D1JTB8/p1597781107153900

Update caching.md

A few additional updates OTBO https://the-asf.slack.com/archives/CJ8D1JTB8/p1608669046041300

* Update caching.md

Typos

* Amendments on the segment cache

Significant updates on content around the segment cache, pull process, and in-memory cache

* Update docs/design/historical.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/historical.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/historical.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/historical.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/historical.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/historical.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/caching.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/caching.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/caching.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/caching.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/historical.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/historical.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/design/historical.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/operations/basic-cluster-tuning.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/caching.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/caching.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/caching.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/caching.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/caching.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/operations/basic-cluster-tuning.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update basic-cluster-tuning.md

typo

* Update docs/querying/caching.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Whole-query caching update

Made more succinct and removed specific config to change.

* Update docs/design/historical.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-04-18 17:00:21 +08:00
Charles Smith 408b46ae9f
Fixes a small typo in ingestion spec doc (#12143)
* small typo

* Update docs/ingestion/ingestion-spec.md

Co-authored-by: sthetland <steve.hetland@imply.io>

Co-authored-by: sthetland <steve.hetland@imply.io>
2022-04-18 16:53:50 +08:00
Peter Marshall 1201c9b2e5
Docs - added another common config property to tuningConfig (#11935)
* Update ingestion-spec.md

Added indexSpecForIntermediatePersists as a common configuration property.

* Update ingestion-spec.md

Amended to remove "below" and add link to the table.

* Update ingestion-spec.md

Removed passive.
2022-04-18 13:41:39 +08:00
Alexandre BERTHIOT 9f2b37f250
Update tutorial-compaction.md to change an unclear statement (#11988)
* Update tutorial-compaction.md

Unclear statement on the explanation of tuningConfig section.

* Update docs/tutorials/tutorial-compaction.md

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2022-04-18 13:25:09 +08:00
Maytas Monsereenusorn 5d37d9f9d8
Add docs to metric spec for auto compaction (#12415)
* add docs

* Update docs/configuration/index.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update index.md

* Update docs/configuration/index.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-04-13 13:27:00 -07:00
Katya Macedo f24e9c6862
Add Kinesis ListShards permission (#12387)
* add Kinesis permission

* List Kinesis IAM permissions

* Adopt review suggestions

* Fix merge conflicts
2022-04-13 15:29:56 +05:30
Parag Jain 2c79d28bb7
Copy of #11309 with fixes (#12402)
* Optionally load segment index files into page cache on bootstrap and new segment download

* Fix unit test failure

* Fix test case

* fix spelling

* fix spelling

* fix test and test coverage issues

Co-authored-by: Jian Wang <wjhypo@gmail.com>
2022-04-11 21:05:24 +05:30
mark-imply bf96ddf5ba
Update index.md (#12390)
Added guidance on when to increase druid.indexer.storage.recentlyFinishedThreshold.
2022-04-08 18:01:54 +05:30
mark-imply d98cbd90f0
Update basic-cluster-tuning.md (#12412)
Changed "Other useful JVM flags" to "Other generally useful JVM flags" in order to align with the introduction to the doc.
2022-04-08 15:29:55 +05:30
317brian d82a8185d1
fix(docs): clarify what s3 permissions are needed based on the access management type (#12405)
* fix(docs): clarify what s3 permissions are needed based on the permissions model

* fix typo

* Update docs/development/extensions-core/s3.md

Co-authored-by: Jihoon Son <jihoonson@apache.org>

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2022-04-07 16:22:56 -07:00
Victoria Lim e6229b76a6
Document data format and example for featureSpec (#12394)
* add data format and example for featureSpec

* add second feature in example

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-04-06 15:17:15 -07:00
317brian ac6c24793e
docs(fix): add clarity around granularitySpec (#12362)
* fix: add clarify around granularitySpec

* fix spacing

* Update docs/ingestion/compaction.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-04-06 09:24:37 -07:00
Victoria Lim d326c681c1
Document config for ingesting null columns (#12389)
* config for ingesting null columns

* add link

* edit .spelling

* what happens if storeEmptyColumns is disabled
2022-04-05 09:15:42 -07:00
AmatyaAvadhanula 067254b778
Package kinesis client jar within the extension (#12370)
amazon-kinesis-client was not covered undered the apache license and required separate insertion in the kinesis extension.
This can now be avoided since it is covered, and including it within druid helps prevent incompatibilities.

Allows enabling of deaggregation out of the box by packaging amazon-kinesis-client (1.14.4) with druid for kinesis ingestion.
2022-04-04 21:31:18 +05:30
Tejaswini Bandlamudi 984904779b
Increase default DatasourceCompactionConfig.inputSegmentSizeBytes to Long.MAX_VALUE (#12381)
The current default value of inputSegmentSizeBytes is 400MB, which is pretty
low for most compaction use cases. Thus most users are forced to override the
default.

The default value is now increased to Long.MAX_VALUE.
2022-04-04 16:28:53 +05:30
AmatyaAvadhanula c5531be553
Add feature flag for Kinesis listShards API usage (#12383)
listShards API was used to get all the shards for kinesis ingestion to improve its resiliency as part of #12161.

However, this may require additional permissions in the IAM policy where the stream is present. (Please refer to: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_ListShards.html).

A dynamic configuration useListShards has been added to KinesisSupervisorTuningConfig to control the usage of this API and prevent issues upon upgrade. It can be safely turned on (and is recommended when using kinesis ingestion) by setting this configuration to true.
2022-04-04 14:58:10 +05:30
somu-imply a1ea658115
Introducing a new config to ignore nulls while computing String Cardinality (#12345)
* Counting nulls in String cardinality with a config

* Adding tests for the new config

* Wrapping the vectorize part to allow backward compatibility

* Adding different tests, cleaning the code and putting the check at the proper position, handling hasRow() and hasValue() changes

* Updating testcase and code

* Adding null handling test to improve coverage

* Checkstyle fix

* Adding 1 more change in docs

* Making docs clearer
2022-03-29 14:31:36 -07:00
Peter Marshall f1841c6444
Docs - S3 masking and nav update to S3 page (#11490)
* Docs: Masking S3 creds and some rewording

Knowledge transfer from https://groups.google.com/g/druid-user/c/FydcpFrA688

* Removed bold in one of the quote sections

* Update s3.md

* Update s3.md

Quick grammar change

* Update docs/development/extensions-core/s3.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/development/extensions-core/s3.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/development/extensions-core/s3.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/development/extensions-core/s3.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/development/extensions-core/s3.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update s3.md

Typo

* Update docs/development/extensions-core/s3.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update s3.md

Active lang

* Update s3.md

LAng nit

* Update native-batch.md

LAng nit

* Update docs/ingestion/native-batch.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Grammar tidy-up and link fix

Corrected 2 x links to old page H2s, resolved the question around precedence, and some other grammatical changes.

* Update docs/development/extensions-core/s3.md

* Update s3.md

Removed an Erroneous E

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-03-29 09:13:05 -07:00
Peter Marshall b9a968e7ff
Docs – expressions link back and timestamp hint (#11674)
* Update math-expr.md

Link back to transformSpec

* Update ingestion-spec.md

Moved info about using the timestamp inside transforms into the actual timestamp section.

* Update ingestion-spec.md

Active language.
2022-03-29 09:12:30 -07:00
mark-imply 3c55565398
Update ingestion-spec.md (#12371)
* Update ingestion-spec.md

Added best practice point to dimensions description.

* Update docs/ingestion/ingestion-spec.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-03-29 09:12:02 -07:00
Victoria Lim 9ed7aa33ec
Docs for request logging (#12363)
* add docs for request logging

* remove stray character

* Update docs/operations/request-logging.md

Co-authored-by: TSFenwick <tsfenwick@gmail.com>

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: TSFenwick <tsfenwick@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2022-03-28 14:09:41 -07:00
Adarsh Sanjeev ef45a1551e
Convert inQueryThreshold into query context parameter. (#12357)
Added Calcites InQueryThreshold as a query context parameter. Setting this parameter appropriately reduces the time taken for queries with large number of values in their IN conditions.
2022-03-22 18:33:57 +05:30
Frank Chen d745d0b338
Add JDK 11 (#12333) 2022-03-16 15:03:04 -07:00
Dr. Sizzles 69f928f50e
Adding k8s support for human readable parsing (#12316)
* Adding k8s support for human readable parsing

* Update docs/configuration/human-readable-byte.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/configuration/human-readable-byte.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update core/src/main/java/org/apache/druid/java/util/common/HumanReadableBytes.java

Co-authored-by: Frank Chen <frankchen@apache.org>

* Changes per review

Co-authored-by: Rahul Gidwani <r_gidwani@apple.com>
Co-authored-by: Frank Chen <frankchen@apache.org>
2022-03-16 11:18:47 +08:00
AmatyaAvadhanula 7bf1d8c5c0
Facilitate lazy initialization of connections to mitigate overwhelming of Coordinator (#12298)
Add config for eager / lazy connection initialization in ResourcePool

Description
Currently, when multiple tasks are launched, each of them eagerly initializes a full pool's worth of connections to the coordinator.

While this is acceptable when the parameter for number of eagerConnections (== maxSize) is small, this can be problematic in environments where it's a large value (say 1000) and multiple tasks are launched simultaneously, which can cause a large number of connections to be created to the coordinator, thereby overwhelming it.

Patch
Nodes like the broker may require eager initialization of resources and do not create connections with the Coordinator.
It is unnecessary to do this with other types of nodes.

A config parameter eagerInitialization is added, which when set to true, initializes the max permissible connections when ResourcePool is initialized.

If set to false, lazy initialization of connection resources takes place.

NOTE: All nodes except the broker have this new parameter set to false in the quickstart as part of this PR

Algorithm
The current implementation relies on the creation of maxSize resources eagerly.

The new implementation's behaviour is as follows:

If a resource has been previously created and is available, lend it.
Else if the number of created resources is less than the allowed parameter, create and lend it.
Else, wait for one of the lent resources to be returned.
2022-03-09 23:17:43 +05:30
Agustin Gonzalez abe76ccb90
Batch ingestion replace (#12137)
* Tombstone support for replace functionality

* A used segment interval is the interval of a current used segment that overlaps any of the input intervals for the spec

* Update compaction test to match replace behavior

* Adapt ITAutoCompactionTest to work with tombstones rather than dropping segments. Add support for tombstones in the broker.

* Style plus simple queriableindex test

* Add segment cache loader tombstone test

* Add more tests

* Add a method to the LogicalSegment to test whether it has any data

* Test filter with some empty logical segments

* Refactor more compaction/dropexisting tests

* Code coverage

* Support for all empty segments

* Skip tombstones when looking-up broker's timeline. Discard changes made to tool chest to avoid empty segments since they will no longer have empty segments after lookup because we are skipping over them.

* Fix null ptr when segment does not have a queriable index

* Add support for empty replace interval (all input data has been filtered out)

* Fixed coverage & style

* Find tombstone versions from lock versions

* Test failures & style

* Interner was making this fail since the two segments were consider equal due to their id's being equal

* Cleanup tombstone version code

* Force timeChunkLock whenever replace (i.e. dropExisting=true) is being used

* Reject replace spec when input intervals are empty

* Documentation

* Style and unit test

* Restore test code deleted by mistake

* Allocate forces TIME_CHUNK locking and uses lock versions. TombstoneShardSpec added.

* Unused imports. Dead code. Test coverage.

* Coverage.

* Prevent killer from throwing an exception for tombstones. This is the killer used in the peon for killing segments.

* Fix OmniKiller + more test coverage.

* Tombstones are now marked using a shard spec

* Drop a segment factory.json in the segment cache for tombstones

* Style

* Style + coverage

* style

* Add TombstoneLoadSpec.class to mapper in test

* Update core/src/main/java/org/apache/druid/segment/loading/TombstoneLoadSpec.java

Typo

Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com>

* Update docs/configuration/index.md

Missing

Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com>

* Typo

* Integrated replace with an existing test since the replace part was redundant and more importantly, the test file was very close or exceeding the 10 min default "no output" CI Travis threshold.

* Range does not work with multi-dim

Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com>
2022-03-08 20:07:02 -07:00
Gian Merlino 875e0696e0
GroupBy: Cap dictionary-building selector memory usage. (#12309)
* GroupBy: Cap dictionary-building selector memory usage.

New context parameter "maxSelectorDictionarySize" controls when the
per-segment processing code should return early and trigger a trip
to the merge buffer.

Includes:

- Vectorized and nonvectorized implementations.
- Adjustments to GroupByQueryRunnerTest to exercise this code in
  the v2SmallDictionary suite. (Both the selector dictionary and
  the merging dictionary will be small in that suite.)
- Tests for the new config parameter.

* Fix issues from tests.

* Add "pre-existing" to dictionary.

* Simplify GroupByColumnSelectorStrategy interface by removing one of the writeToKeyBuffer methods.

* Adjustments from review comments.
2022-03-08 13:13:11 -08:00
Victoria Lim 903174de20
correct errors on compaction doc (#12308) 2022-03-04 15:33:35 -08:00
Gian Merlino 3b373114dc
Officially support Java 11. (#12232)
There aren't any changes in this patch that improve Java 11
compatibility; these changes have already been done separately. This
patch merely updates documentation and explicit Java version checks.

The log message adjustments in DruidProcessingConfig are there to make
things a little nicer when running in Java 11, where we can't measure
direct memory _directly_, and so we may auto-size processing buffers
incorrectly.
2022-03-04 14:15:45 -08:00
Sandeep 61e1ffc7f7
add a new query laning metrics to visualize lane assignment (#12111)
* add a new query laning metrics to visualize lane assignment

* fixes :spotbugs check

* Update docs/operations/metrics.md

Co-authored-by: Benedict Jin <asdf2014@apache.org>

* Update server/src/main/java/org/apache/druid/server/QueryScheduler.java

Co-authored-by: Benedict Jin <asdf2014@apache.org>

* Update server/src/main/java/org/apache/druid/server/QueryScheduler.java

Co-authored-by: Benedict Jin <asdf2014@apache.org>

Co-authored-by: Benedict Jin <asdf2014@apache.org>
2022-03-04 15:21:17 +08:00
Jihoon Son e5ad862665
A new includeAllDimension flag for dimensionsSpec (#12276)
* includeAllDimensions in dimensionsSpec

* doc

* address comments

* unused import and doc spelling
2022-02-25 18:27:48 -08:00
Karan Kumar b94390ba33
Adding Shared Access resource support for azure (#12266)
Azure Blob storage has multiple modes of authentication. One of them is Shared access resource
. This is very useful in cases when we do not want to add the account key in the druid properties .
2022-02-22 18:27:43 +05:30
Maytas Monsereenusorn 6e2eded277
Allow coordinator run auto compaction duty period to be configured separately from other indexing duties (#12263)
* add impl

* add impl

* add unit tests

* add impl

* add impl

* add serde test

* add tests

* add docs

* fix test

* fix test

* fix docs

* fix docs

* fix spelling
2022-02-18 23:02:57 -08:00
Karan Kumar 5794331eb1
Adding new config for disabling group by on multiValue column (#12253)
As part of #12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key.
Added a config groupByEnableMultiValueUnnesting which can be set in the query context.

The default value of groupByEnableMultiValueUnnesting is true, therefore it does not change the current engine behavior.
If groupByEnableMultiValueUnnesting is set to false, the query will fail if it encounters a multi-value column in the grouping key.
2022-02-16 20:53:26 +05:30
somu-imply eae163a797
Moving in filter check to broker (#12195)
* Moving in filter check to broker

* Adding more unit tests, making error message meaningful

* Spelling and doc changes

* Updating default to -1 and making this feature hide by default. The number of IN filters can grow upto a max limit of 100

* Removing upper limit of 100, updated docs

* Making documentation more meaningful

* Moving check outside to PlannerConfig, updating test cases and adding back max limit

* Updated with some additional code comments

* Missed removing one line during the checkin

* Addressing doc changes and one forbidden API correction

* Final doc change

* Adding a speling exception, correcting a testcase

* Reading entire filter tree to address combinations of ANDs and ORs

* Specifying in docs that, this case works only for ORs

* Revert "Reading entire filter tree to address combinations of ANDs and ORs"

This reverts commit 81ca8f8496.

* Covering a class cast exception and updating docs

* Counting changed

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2022-02-15 20:45:07 -08:00
AmatyaAvadhanula 393e9b68a8
Add config to limit task slots for parallel indexing tasks (#12221)
In extreme cases where many parallel indexing jobs are submitted together, it is possible
that the `ParallelIndexSupervisorTasks` take up all slots leaving no slot to schedule
their own sub-tasks thus stalling progress of all the indexing jobs.

Key changes:
- Add config `druid.indexer.runner.parallelIndexTaskSlotRatio` to limit the task slots
  for `ParallelIndexSupervisorTasks` per worker
- `ratio = 1` implies supervisor tasks can use all slots on a worker if needed (default behavior)
- `ratio = 0` implies supervisor tasks can not use any slot on a worker
   (actually, at least 1 slot is always available to ensure progress of parallel indexing jobs)
- `ImmutableWorkerInfo.canRunTask()`
- `WorkerHolder`, `ZkWorker`, `WorkerSelectUtils`
2022-02-15 23:15:09 +05:30
Victoria Lim c61b19d443
Refactor SQL docs (#12239)
* refactor and link fixes

* add sql docs to left nav

* code format for needle

* updated web console script

* link fixes

* update earliest/latest functions

* edits for grammar and style

* more link fixes

* another link

* update with #12226

* update .spelling file
2022-02-11 14:43:30 -08:00
Clint Wylie ae71e05fc5
array_concat_agg and array_agg support for array inputs (#12226)
* array_concat_agg and array_agg support for array inputs
changes:
* added array_concat_agg to aggregate arrays into a single array
* added array_agg support for array inputs to make nested array
* added 'shouldAggregateNullInputs' and 'shouldCombineAggregateNullInputs' to fix a correctness issue with STRING_AGG and ARRAY_AGG when merging results, with dual purpose of being an optimization for aggregating

* fix test

* tie capabilities type to legacy mode flag about coercing arrays to strings

* oops

* better javadoc
2022-02-07 19:59:30 -08:00
Suneet Saldanha ced1389d4c
Enable auto kill segments by default (#12187)
* Enable auto-kill by default

* tests

* wip

* test

* fix IT

* fix it

* remove from docs

* make coverage bot happy
2022-02-07 06:57:54 -08:00
Suneet Saldanha 159f97dcb0
Update docs for druid.processing.numThreads in brokers (#12231)
* Update docs for druid.processing.numThreads

* error msg

* one more reference
2022-02-04 17:34:21 -08:00
Victoria Lim 24716bfedc
Doc updates for metadata cleanup and storage (#12190)
* doc updates for metadata storage/cleanup

* Add comments for disabling cleanup

* Apply suggestions from code review

* updated for https://github.com/apache/druid/pull/12201

* Apply suggestions from code review

Co-authored-by: Maytas Monsereenusorn <maytasm@apache.org>

* move retention period line earlier; more concise text

* fix typo

Co-authored-by: Maytas Monsereenusorn <maytasm@apache.org>
2022-01-27 11:40:54 -08:00
Maytas Monsereenusorn fac6a48a8f
add impl (#12201) 2022-01-27 11:39:59 -08:00
Suneet Saldanha 2b32d86f3b
Enable automatic metdata cleanup by default (#12188) 2022-01-24 20:04:17 -08:00
Victoria Lim d2ac146365
Docs for cluster tiering to improve query concurrency (#12128)
* add new doc

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* reorder query laning properties

* rename doc

* new name in doc header

* organize material into "service tiering" section

* text edits and update sidebars.json

* update query laning

* how queries get assigned to lanes

* add more details to intro; use more consistent terminology

* more content

* Apply suggestions from code review

Co-authored-by: Jihoon Son <jihoonson@apache.org>

* Update docs/operations/mixed-workloads.md

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* typo

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Jihoon Son <jihoonson@apache.org>
2022-01-15 12:22:08 +08:00
Jonathan Wei 74c876e578
Throw parse exceptions on schema get errors for SchemaRegistryBasedAvroBytesDecoder (#12080)
* Add option to throw parse exceptions on schema get errors for SchemaRegistryBasedAvroBytesDecoder

* Remove option
2022-01-13 12:36:51 -06:00
Clint Wylie f2ce76966c
add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures less ambiguous (#12145)
* add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures unambiguous

* switcheroo

* EARLIEST_BY/LATEST_BY use timestamp instead of numeric types, update docs

* revert unintended change

* fix docs

* fix docs better
2022-01-12 03:48:53 -08:00
Vadim Ogievetsky 2299eb321e
Standardizing SQL function docs (#12091)
* fix typos in SQL function docs

* more code

* Update docs/querying/sql.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/querying/sql.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/querying/sql.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/querying/sql.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/querying/sql.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* a few more expr, fixes

* more fixes

* quote TIME_SHIFT

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* undo header change

Co-authored-by: Frank Chen <frankchen@apache.org>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-01-06 23:57:03 -08:00
Jihoon Son 4a74c5adcc
Use Druid's extension loading for integration test instead of maven (#12095)
* Use Druid's extension loading for integration test instead of maven

* fix maven command

* override config path

* load input format extensions and kafka by default; add prepopulated-data group

* all docker-composes are overridable

* fix s3 configs

* override config for all

* fix docker_compose_args

* fix security tests

* turn off debug logs for overlord api calls

* clean up stuff

* revert docker-compose.yml

* fix override config for query error test; fix circular dependency in docker compose

* add back some dependencies in docker compose

* new maven profile for integration test

* example file filter
2022-01-05 23:33:04 -08:00
Victoria Lim 6846622080
Docs: add FILTER to sql query syntax (#12093)
* docs: add FILTER to sql query syntax

* Update docs/querying/sql.md

* Update docs/querying/sql.md

* Update docs/querying/sql.md

* Update docs/querying/sql.md

* move and update FILTER section
2022-01-05 12:59:41 -08:00
somu-imply c267b65f97
Removing unused processing threadpool on broker (#12070)
* Thread pool for broker

* Updating two tests to improve coverage for new method added

* Updating druidProcessingConfigTest to cover coverage

* Adding missed spelling errors caused in doc

* Adding test to cover lines of new function added
2021-12-21 13:07:53 -08:00
Victoria Lim acbeae23b8
New doc for troubleshooting query execution (#12075)
* new doc for troubleshooting query execution

* add doc to sidebar

* Apply suggestions from code review
2021-12-16 17:34:34 -08:00
Karan Kumar 377edff042
Ingestion metrics doc fix (#12066)
* Ingestion metrics doc fix.

* Fixing typo

* Adding missed keywords in ignore list
2021-12-15 12:51:53 +05:30
Victoria Lim 4ede3bbff6
Docs updates (#12069)
* minor updates to docs

* remove en.json
2021-12-14 14:38:18 -08:00
Victoria Lim e77bdfa70d
Document query context parameters related to join filters (#12057)
* docs update for query context and filters

* updates from review

* Update docs/querying/filters.md
2021-12-13 17:47:21 -08:00
Lucas Capistrant 761fe9f144
Add new metric that quantifies how long batch ingest jobs waited for segment availability and whether or not that wait was successful (#12002)
* add a unit test that tests that new metric is emitted

* remove unused import

* clarify in doc that this is for batch tasks

* fix IndexTaskTest
2021-12-10 11:40:52 -06:00
Frank Chen 58245b4617
Support JsonPath functions in JsonPath expressions (#11722)
* Add jsonPath functions support

* Add jsonPath function test for Avro

* Add jsonPath function length() to Orc

* Add jsonPath function length() to Parquet

* Add more tests to ORC format

* update doc

* Fix exception during ingestion

* Add IT test case

* Revert "Fix exception during ingestion"

This reverts commit 5a5484b9ea.

* update IT test case

* Add 'keys()'

* Commit IT test case

* Fix UT
2021-12-10 10:53:23 +08:00
shallada 25c9eba2f7
clarify time format for intervals (#12035) 2021-12-08 08:31:21 -08:00
Lucas Capistrant 150902b95c
clean up the balancing code around the batched vs deprecated way of sampling segments to balance (#11960)
* clean up the balancing code around the batched vs deprecated way of sampling segments to balance

* fix docs, clarify comments, add deprecated annotations to legacy code

* remove unused variable

* update dynamic config dialog in console to state percentOfSegmentsToConsiderPerMove deprecated

* fix dynamic config text for percentOfSegmentsToConsiderPerMove

* run prettier to cleanup coordinator-dynamic-config.tsx changes

* update jest snapshot

* update documentation per review feedback
2021-12-07 14:47:46 -08:00
Clint Wylie a8815f671e
Fix druid client timeout zero (#12023)
* fix bug where queries fail immediately when timeout is 0 instead of using default timeout

* fix to use serverside max

* more better

* less flaky test

* oops
2021-12-07 12:41:01 -08:00
Peter Marshall 0b3f0bbbd8
Docs - Metrics docs layout and info about query/bytes (#11481)
* Metrics docs layout and info about query/bytes

Knowledge transfer from https://groups.google.com/g/druid-user/c/8fiflmSEoTQ - updated the layout of the Metrics part, adding links between docs pages.

Update index.md

Amended typo

* Update docs/configuration/index.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/configuration/index.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/operations/metrics.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/operations/metrics.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/operations/metrics.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Feedback applied

Http --> HTTP and moved content / removed >

* Update docs/configuration/index.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/configuration/index.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-12-07 09:45:24 -08:00
Peter Marshall c209db3a1d
Docs - roll-up tip (#11677)
* Update rollup.md

Added SE tip around roll-up.

* Update docs/ingestion/rollup.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-12-07 09:17:36 -08:00
Peter Marshall d7463c99e9
Docs - Task ref logs correction (#11746)
* Update tasks.md

Removed confusing backreference

* Update tasks.md

Changed silly grammar.
2021-12-07 09:15:19 -08:00
Jihoon Son fc9513b6cd
Make NodeRole available during binding; add support for dynamic registration of DruidService (#12012)
* Make nodeRole available during binding; add support for dynamic registration of DruidService

* fix checkstyle and test

* fix customRole test

* address comments

* add more javadoc
2021-12-03 11:59:00 -08:00
jacobtolar f7f5505631
Add avro_ocf to supported Kafka/Kinesis InputFormats (#11865)
* Update docs - Kinesis InputFormat ingestion

* Add avro_ocf to list of supported Kafka InputFormats

* Remove extra whitespace.

* Update kafka-supervisor-reference.md

* Delete extra whitespace.
2021-12-03 07:57:26 -08:00
Frank Chen 4631a66723
Support rolling log files (#10147)
* apply log file rolling strategy

* fix doc

Signed-off-by: frank chen <frank.chen021@outlook.com>

* Use absolute log path and allow spaces in log path

* Update log4j2 configuration

* apply FileAppender to ZooKeeper

* DO NOT redirect application's console log to file in supervisor
2021-12-03 21:32:01 +08:00
Charles Smith 7ed46800c3
Docs: Add multi-dimension partitioning doc; refactor native batch and separate into smaller topics. (#11983)
Adds documentation for multi-dimension partitioning. cc: @kfaraz
Refactors the native batch partitioning topic as follows:

Native batch ingestion covers parallel-index
Native batch simple task indexing covers index
Native batch input sources covers ioSource
Native batch ingestion with firehose covers deprecated firehose
2021-12-03 16:37:14 +05:30
Clint Wylie 84b4bf56d8
vectorize logical operators and boolean functions (#11184)
changes:
* adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions
* vectorize logical operators and boolean functions, some only if useStrictBooleans is true
2021-12-02 16:40:23 -08:00
benkrug 11746b8536
Update datasketches-hll.md (#12010)
under "Aggregators", about the lgK setting, it said "Must be a power of 2 from 4 to 21 inclusively."  21 is not a power of 2, nor is 12, the given default.  I think there may have been confusion because lgK represents log2 of K.  We could say "K must be a power of 2...", or just say lgK must be between 4 and 21.
2021-11-30 18:52:00 -08:00
Charles Smith f536f31229
clarify avro support & general style improvements (#11975)
* clarify avro support & general style improvements

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update avro.md

remove redundancy

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
2021-11-28 16:10:18 +08:00
Laksh Singla c381cae51b
Improve the output of SQL explain message (#11908)
Currently, when we try to do EXPLAIN PLAN FOR, it returns the structure of the SQL parsed (via Calcite's internal planner util), which is verbose (since it tries to explain about the nodes in the SQL, instead of the Druid Query), and not representative of the native Druid query which will get executed on the broker side.

This PR aims to change the format when user tries to EXPLAIN PLAN FOR for queries which are executed by converting them into Druid's native queries (i.e. not sys schemas).
2021-11-25 21:08:33 +05:30
Rohan Garg 2c08055962
Specify time column for first/last aggregators (#11949)
Add the ability to pass time column in first/last aggregator (and latest/earliest SQL functions). It is to support cases where the time to query upon is stored as a part of a column different than __time. Also, some other logical time column can be specified.
2021-11-25 09:44:14 +05:30
Maytas Monsereenusorn bb3d2a433a
Support filtering data in Auto Compaction (#11922)
* add impl

* fix checkstyle

* add test

* add test

* add unit tests

* fix unit tests

* fix unit tests

* fix unit tests

* add IT

* add IT

* add comments

* fix spelling
2021-11-24 10:56:38 -08:00
Kashif Faraz 6607e4cc75
Docs: Remove reference to deprecated field `targetPartitionSize` (#11974)
* Remove reference to deprecated field `targetPartitionSize`

* Fix spelling of LeaderLatch
2021-11-23 15:32:16 +05:30
Peter Marshall ed0606db69
Docs - Corrected admonition issue (#11926)
* Corrected admonition issue

* Update data-formats.md

Removed all admonition bits, and took out sf linebreaks.

* Update data-formats.md

Changed the shocker line into something a little more practical.
2021-11-22 12:14:30 -08:00
Katya Macedo 706d057ccc
corrected leaderlatch name (#11966) 2021-11-22 11:58:42 -08:00
jacobtolar 0a9a908031
Add inline native query example to tutorial (#11642)
* Add inline native query example to tutorial

Minor change to the tutorial that adds an example of a native HTTP query request body, and adds a link to the more detailed "native query over HTTP" documentation.

* cleanup

* Apply suggestions from code review.

Co-authored-by: sthetland <steve.hetland@imply.io>

Co-authored-by: sthetland <steve.hetland@imply.io>
2021-11-22 21:35:05 +08:00
Peter Marshall 0c0001579d
Update compaction.md (#11937)
Removed superfluous tabs that caused issues in rendering
Added nav to the `inputSpec`
2021-11-22 21:33:47 +08:00
jacobtolar 3aee5d9ec3
Fix: invalid JSON in ingestion spec doc example (#11880)
* Fix: invalid JSON in ingestion spec doc example

* Update ingestion-spec.md
2021-11-22 21:33:26 +08:00
Nikhil Navadiya 3c51136098
Add worker category dimension (#11554)
* Add worker category as dimension in TaskSlotCountStatsMonitor

* Change description

* Add workerConfig as field

* Modify HttpRemoteTaskRunnerTest to test worker category in taskslot metrics

* Fixing tests

* Fixing alerts

* Adding unit test in SingleTaskBackgroundRunnerTest for task slot metrics APIs

* Resolving false positive spell check

* addressing comments

* throw UnsupportedOperationException for tasklotmetrics APIs in SingleTaskBackgroundRunner

Co-authored-by: Nikhil Navadiya <nnavadiya@twitter.com>
2021-11-18 22:59:07 -08:00
somu-imply 29710789a4
Adding safe divide function (#11904)
* IMPLY-4344: Adding safe divide function along with testcases and documentation updates

* Changing based on review comments

* Addressing review comments, fixing coding style, docs and spelling

* Checkstyle passes for all code

* Fixing expected results for infinity

* Revert "Fixing expected results for infinity"

This reverts commit 5fd5cd480d.

* Updating test result and a space in docs
2021-11-17 08:22:41 -08:00
TSFenwick 1487f558b1
Use a simple class to sanitize JDBC exceptions and also log them (#11843)
* Use a simple class to sanitize sanitizable errors and log them

The purpose of this is to sanitize JDBC errors, but can sanitize other errors
if they implement SanitizableError Interface

add a class to log errors and sanitize them
added a simple test that tests out that the error gets sanitized
add @NonNull annotation to serverconfig's ErrorResponseTransfromStrategy

* return less information as part of too many connections, and instead only log specific details

This is so an end user gets relevant information but not too much info since they might now how
many brokers they have

* return only runtime exceptions

added new error types that need to be sanitized
also sanitize deprecated and unsupported exceptions.

* dont reqrewite exceptions unless necessary for checked exceptions

add docs
avoid blanket turning all exceptions into runtime exceptions

* address comments, to fix up docs.

add more javadocs
add support UOE sanitization

* use try catch instead and sanitize at public methods

* checkstyle fixes

* throw noSuchStatement and NoSuchConnection as Avatica is affected by those

* address comments. move log error back to druid meta

clean up bad formatting and commented code. add missed catch for NoSuchStatementException
clean up comments for error handler and add comment explainging not wanting to santize avatica exceptions

* alter test to reflect new error message
2021-11-16 13:13:03 -08:00
sthetland 02b578a3dd
Fixing a few typos and style issues (#11883)
* grammar and format work

* light writing touchup

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-11-16 10:13:35 -08:00
Gian Merlino 6f6e88e02e
SQL: Add type headers to response formats. (#11914)
This allows clients to interpret the results of SQL queries without having to guess types.
2021-11-13 11:30:57 +05:30
Jihoon Son f91868602d
Remove stale warning for HTTP inputSource (#11907) 2021-11-13 10:27:14 +08:00
Charles Smith 33a5cda061
Docs: Splits Kafka topic. Adds detailed example for kafka inputFormat (#11912)
* Splits Kafka topic according to function. Adds detailed example for kafka inputFormat

* Apply suggestions from code review

accept suggestions from review

Co-authored-by: sthetland <steve.hetland@imply.io>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Apply suggestions from code review

accept suggestions

Co-authored-by: sthetland <steve.hetland@imply.io>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* accept suggestions

* accept suggestions

* final typos and clarifications

* bringing forward some syntax fixes

Co-authored-by: sthetland <steve.hetland@imply.io>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
2021-11-12 13:02:23 -08:00
Clint Wylie 5baa22148e
revert ColumnAnalysis type, add typeSignature and use it for DruidSchema (#11895)
* revert ColumnAnalysis type, add typeSignature and use it for DruidSchema

* review stuffs

* maybe null

* better maybe null

* Update docs/querying/segmentmetadataquery.md

* Update docs/querying/segmentmetadataquery.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* fix null right

* sad

* oops

* Update batch_hadoop_queries.json

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-11-10 18:46:29 -08:00
Maytas Monsereenusorn a36a41da73
Support routing data through an HTTP proxy (#11891)
* Support routing data through an HTTP proxy

* Support routing data through an HTTP proxy

This adds the ability for the HttpClient to connect through an HTTP proxy.  We
augment the channel factory to check if it is supposed to be proxied and, if so,
we connect to the proxy host first, issue a CONNECT command through to the final
recipient host and *then* give the channel to the normal http client for usage.

* add docs

* address comments

Co-authored-by: imply-cheddar <86940447+imply-cheddar@users.noreply.github.com>
2021-11-09 17:24:06 -08:00
Maytas Monsereenusorn ddc68c6a81
Support changing dimension schema in Auto Compaction (#11874)
* add impl

* add unit tests

* fix checkstyle

* add impl

* add impl

* add impl

* add impl

* add impl

* add impl

* fix test

* add IT

* add IT

* fix docs

* add test

* address comments

* fix conflict
2021-11-08 21:17:08 -08:00
Jian Wang 8e7e679984
Add more metrics for Jetty server thread pool usage (#11113)
Add more metrics for jetty server thread pool usage so we know if we have allocated enough http threads to handle requests.
2021-11-07 16:51:44 +05:30
zachjsh 1d6df48145
Warn if cache size of lookup is beyond max size (#11863)
Enhanced the ExtractionNamespace interface in lookups-cached-global core extension with the ability to set a maxHeapPercentage for the cache of the respective namespace. The reason for adding this functionality, is make it easier to detect when a lookup table grows to a size that the underlying service cannot handle, because it does not have enough memory. The default value of maxHeap for the interface is -1, which indicates that no maxHeapPercentage has been set. For the JdbcExtractionNamespace and UriExtractionNamespace implementations, the default value is null, which will cause the respective service that the lookup is loaded in, to warn when its cache is beyond mxHeapPercentage of the service's configured max heap size. If a positive non-null value is set for the namespace's maxHeapPercentage config, this value will be honored for all services that the respective lookup is loaded onto, and consequently log warning messages when the cache of the respective lookup grows beyond this respective percentage of the services configured max heap size. Warnings are logged every time that either Uri based or Jdbc based lookups are regenerated, if the maxHeapPercentage constraint is violated. No other implementations will log warnings at this time. No error is thrown when the size exceeds the maxHeapPercentage at this time, as doing so could break functionality for existing users. Previously the JdbcCacheGenerator generated its cache by materializing all rows of the underling table in memory at once; this made it difficult to log warning messages in the case that the results from the jdbc query were very large and caused the service to run out of memory. To help with this, this pr makes it so that the jdbc query results are instead streamed through an iterator.
2021-11-03 21:32:22 -04:00
Kashif Faraz a22687ecbe
Add Broker config `druid.broker.segment.watchRealtimeNodes` (#11732)
The new config is an extension of the concept of "watchedTiers" where
the Broker can choose to add the info of only the specified tiers to its timeline.
Similarly, with this config, Broker can choose to skip the realtime nodes and
thus it would query only Historical processes for any given segment.
2021-11-02 12:38:42 +05:30
Katya Macedo 5e1dc843d1
Fix quickstart link (#11864) 2021-11-02 13:27:53 +08:00
Maytas Monsereenusorn ba2874ee1f
Support changing query granularity in Auto Compaction (#11856)
* add queryGranularity

* fix checkstyle

* fix test
2021-11-01 15:18:44 -07:00
Karan Kumar 90640bb316
Support for hadoop 3 via maven profiles (#11794)
Add support for hadoop 3 profiles . Most of the details are captured in #11791 .
We use a combination of maven profiles and resource filtering to achieve this. Hadoop2 is supported by default and a new maven profile with the name hadoop3 is created. This will allow the user to choose the profile which is best suited for the use case.
2021-10-30 22:46:24 +05:30
Maytas Monsereenusorn 33d9d9bd74
Add rollup config to auto and manual compaction (#11850)
* add rollup to auto and manual compaction

* add unit tests

* add unit tests

* add IT

* fix checkstyle
2021-10-29 10:22:25 -07:00
Gian Merlino fc95c92806
Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs. (#11124)
* Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs.

This patch does the following:

- Removes OffheapIncrementalIndex.
- Clarifies that Aggregators are required to be thread safe.
- Clarifies that BufferAggregators and VectorAggregators are not
  required to be thread safe.
- Removes thread safety code from some DataSketches aggregators that
  had it. (Not all of them did, and that's OK, because it wasn't necessary
  anyway.)
- Makes enabling "useOffheap" with groupBy v1 an error.

Rationale for removing the offheap incremental index:

- It is only used in one rare scenario: groupBy v1 (which is non-default)
  in "useOffheap" mode (also non-default). So you have to go pretty deep
  into the wilderness to get this code to activate in production. It is
  never used during ingestion.
- Its existence complicates developer efforts to reason about how
  aggregators get used, because the way it uses buffer aggregators is so
  different from how every other query engine uses them.
- It doesn't have meaningful testing.

By the way, I do believe that the given way the offheap incremental index
works, it actually didn't require buffer aggregators to be thread-safe.
It synchronizes on "aggregate" and doesn't call "get" until it has
stopped calling "aggregate". Nevertheless, this is a bother to think about,
and for the above reasons I think it makes sense to remove the code anyway.

* Remove things that are now unused.

* Revert removal of getFloat, getLong, getDouble from BufferAggregator.

* OAK-related warnings, suppressions.

* Unused item suppressions.
2021-10-26 08:05:56 -07:00
Sergio Ferragut 000a5551fa
docker mem reqs (#11827)
* docker mem reqs

* Update docs/tutorials/docker.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Sergio Ferragut <sergio.ferragut@imply.io>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-10-25 12:23:25 -07:00
Gian Merlino 8276c031c5
Add druid.sql.approxCountDistinct.function property. (#11181)
* Add druid.sql.approxCountDistinct.function property.

The new property allows admins to configure the implementation for
APPROX_COUNT_DISTINCT and COUNT(DISTINCT expr) in approximate mode.

The motivation for adding this setting is to enable site admins to
switch the default HLL implementation to DataSketches.

For example, an admin can set:

  druid.sql.approxCountDistinct.function = APPROX_COUNT_DISTINCT_DS_HLL

* Fixes

* Fix tests.

* Remove erroneous cannotVectorize.

* Remove unused import.

* Remove unused test imports.
2021-10-25 12:16:21 -07:00
Kashif Faraz abac9e39ed
Revert permission changes to Supervisor and Task APIs (#11819)
* Revert "Require Datasource WRITE authorization for Supervisor and Task access (#11718)"

This reverts commit f2d6100124.

* Revert "Require DATASOURCE WRITE access in SupervisorResourceFilter and TaskResourceFilter (#11680)"

This reverts commit 6779c4652d.

* Fix docs for the reverted commits

* Fix and restore deleted tests

* Fix and restore SystemSchemaTest
2021-10-25 14:50:38 +05:30
Charles Smith 10c5fa93f1
remove dupe sentence (#11821) 2021-10-25 14:48:20 +05:30
Victoria Lim 43103632fb
Docs - add description on time origin (#11826)
* add description on time origin

* reorder parameter descriptions

* add example of origin value
2021-10-22 14:57:13 -07:00
Charles Smith 938c1493e5
edits to kafka inputFormat (#11796)
* edits to kafka inputFormat

* revise conflict resolution description

* tweak for clarity

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* style fixes

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/data-formats.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
2021-10-15 14:01:10 -07:00
Charles Smith 6089a168ea
Docs - update dynamic config provider topic (#11795)
* update dynamic config provider

* update topic

* add examples for dynamic config provider:

* Update docs/development/extensions-core/kafka-ingestion.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/kafka-ingestion.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/kafka-ingestion.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/dynamic-config-provider.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/dynamic-config-provider.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/dynamic-config-provider.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/dynamic-config-provider.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/kafka-ingestion.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/dynamic-config-provider.md

Co-authored-by: Clint Wylie <cjwylie@gmail.com>

* Update docs/operations/dynamic-config-provider.md

Co-authored-by: Clint Wylie <cjwylie@gmail.com>

* Update kafka-ingestion.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
2021-10-14 17:51:32 -07:00
Arun Ramani b6b42d3936
Minor processor quota computation fix + docs (#11783)
* cpu/cpuset cgroup and procfs data gathering

* Renames and default values

* Formatting

* Trigger Build

* Add cgroup monitors

* Return 0 if no period

* Update

* Minor processor quota computation fix + docs

* Address comments

* Address comments

* Fix spellcheck

Co-authored-by: arunramani-imply <84351090+arunramani-imply@users.noreply.github.com>
2021-10-08 22:52:03 -05:00
Victoria Lim 42e44269be
Docs update for druid-basic-security (#11782)
* update druid-basic-security

* typo

* revisions from review
2021-10-08 14:45:09 -07:00
Kashif Faraz c2c724c065
Fix docs to explain that WRITE permissions do not include READ (#11785)
* Fix docs to explain that WRITE and READ are exclusive

* Fix indentation

* Use suggested doc style
2021-10-08 14:10:20 -07:00
Charles Smith 3ecbd3aec4
docs for changes to authorization in #11718 and #11720 (#11779)
* security recommendation

* Update docs/operations/security-overview.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/operations/security-user-auth.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/operations/security-user-auth.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update security-user-auth.md

add newline

* Update docs/operations/security-overview.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update security-overview.md

add suggestion for environment variable dynamic config provider

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Clint Wylie <cwylie@apache.org>
2021-10-08 14:04:04 -07:00
Kashif Faraz f2d6100124
Require Datasource WRITE authorization for Supervisor and Task access (#11718)
Follow up PR for #11680

Description
Supervisor and Task APIs are related to ingestion and must always require Datasource WRITE
authorization even if they are purely informative.

Changes
Check Datasource WRITE in SystemSchema for tables "supervisors" and "tasks"
Check Datasource WRITE for APIs /supervisor/history and /supervisor/{id}/history
Check Datasource for all Indexing Task APIs
2021-10-08 10:39:48 +05:30
Katya Macedo 45d0ecbefb
clarify hadoop input paths (#11781)
Co-authored-by: Katya Macedo <katya.macedo@imply.io>
2021-10-07 20:22:51 -07:00
lokesh-lingarajan ad6609a606
Kafka Input Format for headers, key and payload parsing (#11630)
### Description

Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights.

PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats.

We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload.

This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row.

Lets look at a sample input format from the above discussion

"inputFormat":
{
    "type": "kafka",     // New input format type
    "headerLabelPrefix": "kafka.header.",   // Label prefix for header columns, this will avoid collusions while merging columns
    "recordTimestampLabelPrefix": "kafka.",  // Kafka record's timestamp is made available in case payload does not carry timestamp
    "headerFormat":  // Header parser specifying that values are of type string
    {
        "type": "string"
    },
    "valueFormat": // Value parser from json parsing
    {
        "type": "json",
        "flattenSpec": {
          "useFieldDiscovery": true,
          "fields": [...]
        }
    },
    "keyFormat":  // Key parser also from json parsing
    {
        "type": "json"
    }
}

Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. 

KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. 

"headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload.

Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch.

Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key".

## KafkaInputFormat Class: 
This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases.

During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.
2021-10-07 08:56:27 -07:00
Charles Smith 8fd17fe0af
fix a few typos in Kinesis doc (#11776) 2021-10-06 19:43:20 -07:00
Lucas Capistrant 1930ad1f47
Implement configurable internally generated query context (#11429)
* Add the ability to add a context to internally generated druid broker queries

* fix docs

* changes after first CI failure

* cleanup after merge with master

* change default to empty map and improve unit tests

* add doc info and fix checkstyle

* refactor DruidSchema#runSegmentMetadataQuery and add a unit test
2021-10-06 09:02:41 -07:00
Kashif Faraz b688db790b
Add Broker config `druid.broker.segment.ignoredTiers` (#11766)
The new config is an extension of the concept of "watchedTiers" where
the Broker can choose to add the info of only the specified tiers to its timeline.
Similarly, with this config, Broker can choose to ignore the segments being served
by the specified historical tiers. By default, no tier is ignored.

This config is useful when you want a completely isolated tier amongst many other tiers.

Say there are several tiers of historicals Tier T1, Tier T2 ... Tier Tn
and there are several brokers Broker B1, Broker B2 .... Broker Bm

If we want only Broker B1 to query Tier T1, instead of setting a long list of watchedTiers
on each of the other Brokers B2 ... Bm, we could just set druid.broker.segment.ignoredTiers=["T1"]
for these Brokers, while Broker B1 could have druid.broker.segment.watchedTiers=["T1"]
2021-10-06 10:06:32 +05:30
Frank Chen 104c9a07f0
Fix broken anchor and heading levels in Kafka/Kinesis ingestion (#11748)
* Fix broken anchor and heading levels

* Fix CI
2021-10-05 19:30:50 -07:00
Charles Smith 621e5ac63f
docs: clarify RealtimeMetricsMonitor, HistoricalMetricsMonitor (#11565)
* docs: clarify RealtimeMetricsMonitor, HistoricalMetricsMonitor

* Update docs/configuration/index.md
2021-10-05 17:38:23 -07:00
Maytas Monsereenusorn f60b3b3bab
fix doc (#11772) 2021-10-05 15:42:11 -07:00
Victoria Lim a31d99fb37
update docs with X-Druid-SQL-Query-Id (#11761)
* update docs with X-Druid-SQL-Query-Id

* review comments

* update header description

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* Update docs/querying/sql.md

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2021-10-06 00:15:05 +07:00
Caroline1000 ffbe303828
Update balancer strategy recommendations (#11759)
* Update balancer strategy recommendations

* Update docs/configuration/index.md

* Update docs/configuration/index.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>
2021-10-05 09:47:37 -07:00