Commit Graph

104 Commits

Author SHA1 Message Date
Parag Jain c7b46671b3
option to use deep storage for storing shuffle data (#11507)
Fixes #11297.
Description

Description and design in the proposal #11297
Key changed/added classes in this PR

    *DataSegmentPusher
    *ShuffleClient
    *PartitionStat
    *PartitionLocation
    *IntermediaryDataManager
2021-08-13 16:40:25 -04:00
Charles Smith 6524d838d7
Docs refactor of ingestion. Carries #11541 (#11576)
* Docs refactor of ingestion. Carries #11541

* Update docs/misc/math-expr.md

* add Apache license

* fix header, add topics to sidebar

* Update docs/ingestion/partitioning.md

* pick up changes to  and  md from c7fdf1d, #11479

Co-authored-by: Suneet Saldanha <suneet@apache.org>
Co-authored-by: Jihoon Son <jihoonson@apache.org>
2021-08-13 08:42:03 -07:00
Kashif Faraz aaf0aaad8f
Enable routing of SQL queries at Router (#11566)
This PR adds a new property druid.router.sql.enable which allows the
Router to handle SQL queries when set to true.

This change does not affect Avatica JDBC requests and they are still routed
by hashing the Connection ID.

To allow parsing of the request object as a SqlQuery (contained in module druid-sql),
some classes have been moved from druid-server to druid-services with
the same package name.
2021-08-13 18:44:39 +05:30
Charles Smith 941c5ffb05
clarify JVM tmp dir requires execute on files (#11542)
* clarify JVM tmp dir requires execute on files

* code SysMonitor for spellcheck
2021-08-09 17:25:10 -07:00
Suneet Saldanha e423e99997
Update default maxSegmentsInNodeLoadingQueue (#11540)
* Update default maxSegmentsInNodeLoadingQueue

Update the default maxSegmentsInNodeLoadingQueue from 0 (unbounded) to 100.

An unbounded maxSegmentsInNodeLoadingQueue can cause cluster instability.
Since this is the default druid operators need to run into this instability
and then look through the docs to see that the recommended value for a large
cluster is 1000. This change makes it so the default will prevent clusters
from falling over as they grow over time.

* update tests

* codestyle
2021-08-05 11:26:58 -07:00
frank chen 55a01a030a
Clarify that Broker caching for groupBy v2 queries does not work (#11370)
* Add a note

* Update docs/configuration/index.md

Co-authored-by: sthetland <steve.hetland@imply.io>

* clarify that both of non-result level cache and result level cache are not supported

Co-authored-by: sthetland <steve.hetland@imply.io>
2021-08-03 10:01:15 -07:00
Yuanli Han b83742179a
Reduce method invocation of reservoir sampling (#11257)
* reduce method invocation of reservoir sampling

* add a dynamic parameter and add benchmark

* rebase
2021-07-30 22:09:50 +08:00
Maytas Monsereenusorn c068906fca
Make intermediate store for shuffle tasks an extension point (#11492)
* add interface

* add docs

* fix errors

* fix injection

* fix injection

* update javadoc
2021-07-27 11:29:43 +07:00
Maytas Monsereenusorn 6ce3b6ca2d
Improve documentation for druid.indexer.autoscale.workerCapacityHint config (#11444)
* fix doc

* address comments

* Update docs/configuration/index.md

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
2021-07-21 12:48:56 +07:00
Maytas Monsereenusorn 8d7d60d18e
Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better (#11440)
* fix pendingTaskBased

* fix doc

* address comments

* address comments

* address comments

* address comments

* address comments

* address comments

* address comments
2021-07-15 06:52:25 +07:00
Agustin Gonzalez 7e61042794
Bound memory utilization for dynamic partitioning (i.e. memory growth is constant) (#11294)
* Bound memory in native batch ingest create segments

* Move BatchAppenderatorDriverTest to indexing service... note that we had to put the sink back in sinks in mergeandpush since the persistent data needs to be dropped and the sink is required for that

* Remove sinks from memory and clean up intermediate persists dirs manually after sink has been merged

* Changed name from RealtimeAppenderator to StreamAppenderator

* Style

* Incorporating tests from StreamAppenderatorTest

* Keep totalRows and cleanup code

* Added missing dep

* Fix unit test

* Checkstyle

* allowIncrementalPersists should always be true for batch

* Added sinks metadata

* clear sinks metadata when closing appenderator

* Style + minor edits to log msgs

* Update sinks metadata & totalRows when dropping a sink (segment)

* Remove max

* Intelli-j check

* Keep a count of hydrants persisted by sink for sanity check before merge

* Move out sanity

* Add previous hydrant count to sink metadata

* Remove redundant field from SinkMetadata

* Remove unneeded functions

* Cleanup unused code

* Removed unused code

* Remove unused field

* Exclude it from jacoco because it is very hard to get branch coverage

* Remove segment announcement and some other minor cleanup

* Add fallback flag

* Minor code cleanup

* Checkstyle

* Code review changes

* Update batchMemoryMappedIndex name

* Code review comments

* Exclude class from coverage, will include again when packaging gets fixed

* Moved test classes to server module

* More BatchAppenderator cleanup

* Fix bug in wrong counting of totalHydrants plus minor cleanup in add

* Removed left over comments

* Have BatchAppenderator follow the Appenderator contract for push & getSegments

* Fix LGTM violations

* Review comments

* Add stats after push is done

* Code review comments (cleanup, remove rest of synchronization constructs in batch appenderator, reneame feature flag,
remove real time flag stuff from stream appenderator, etc.)

* Update javadocs

* Add thread safety notice to BatchAppenderator

* Further cleanup config

* More config cleanup
2021-07-09 00:10:29 -07:00
frank chen 906a704c55
Eliminate ambiguities of KB/MB/GB in the doc (#11333)
* GB ---> GiB

* suppress spelling check

* MB --> MiB, KB --> KiB

* Use IEC binary prefix

* Add reference link

* Fix doc style
2021-06-30 13:42:45 -07:00
frank chen 60843bd11f
Add configuration suggestion to `druid.indexer.storage.type` (#11304) 2021-05-27 06:44:47 -07:00
Maytas Monsereenusorn 3455352241
Add feature to automatically remove compaction configurations for inactive datasources (#11232)
* add auto cleanup

* add auto cleanup

* add auto cleanup

* add tests

* add tests

* use retryutils

* use retryutils

* use retryutils

* address comments
2021-05-11 18:49:18 -07:00
Agustin Gonzalez 8e5048e643
Avoid memory mapping hydrants after they are persisted & after they are merged for native batch ingestion (#11123)
* Avoid mapping hydrants in create segments phase for native ingestion

* Drop queriable indices after a given sink is fully merged

* Do not drop memory mappings for realtime ingestion

* Style fixes

* Renamed to match use case better

* Rollback memoization code and use the real time flag instead

* Null ptr fix in FireHydrant toString plus adjustments to memory pressure tracking calculations

* Style

* Log some count stats

* Make sure sinks size is obtained at the right time

* BatchAppenderator unit test

* Fix comment typos

* Renamed methods to make them more readable

* Move persisted metadata from FireHydrant class to AppenderatorImpl. Removed superfluous differences and fix comment typo. Removed custom comparator

* Missing dependency

* Make persisted hydrant metadata map concurrent and better reflect the fact that keys are Java references. Maintain persisted metadata when dropping/closing segments.

* Replaced concurrent variables with normal ones

* Added   batchMemoryMappedIndex "fallback" flag with default "false". Set this to "true" make code fallback to previous code path.

* Style fix.

* Added note to new setting in doc, using Iterables.size (and removing a dependency), and fixing a typo in a comment.

* Forgot to commit this edited documentation message
2021-05-11 14:34:26 -07:00
Maytas Monsereenusorn 4326e699bd
Add feature to automatically remove datasource metadata based on retention period (#11227)
* add auto clean up datasource metadata

* add test

* fix checkstyle

* add comments

* fix error

* address comments

* Address comments

* fix test

* fix test

* fix typo

* add comment

* fix test

* fix test
2021-05-11 01:22:33 -07:00
Charles Smith fae7ebf489
change errant 'none' configuration to 'manual': (#11218) 2021-05-10 22:04:18 -07:00
frank chen fa113fb4a9
Fix default value (#11220) 2021-05-10 10:11:26 -07:00
Jihoon Son 2df42143ae
Fix idempotence of segment allocation and task report apis in native batch ingestion (#11189)
* Fix idempotence of segment allocation and task report apis in native
batch ingestion

* better error and javadoc

* checkstyle and dependency

* fix tests and add more tests

* task config instead of context; add doc

* unused import and dependency

* typo in doc

* fix unintended changes

* fix wrong import

* remove unnecessary error handling

* add task context back

* default task context

* fix test and doc

* address comments

* unused imports
2021-05-07 14:29:48 -07:00
Maytas Monsereenusorn d73f72e508
Add feature to automatically remove supervisor based on retention period (#11200)
* add auto clean up

* add test

* add test

* fix test

* Address comments

* Address comments
2021-05-06 22:25:23 -07:00
Lucas Capistrant bb3c810b36
Create dynamic config that can limit number of non-primary replicants loaded per coordination cycle (#11135)
* lay the groundwork for throttling replicant loads per RunRules execution

* Add dynamic coordinator config to control new replicant threshold.

* remove redundant line

* add some unit tests

* fix checkstyle error

* add documentation for new dynamic config

* improve docs and logs

* Alter how null is handled for new config. If null, manually set as default
2021-05-05 07:39:36 -05:00
Maytas Monsereenusorn 84aac4832d
Add feature to automatically remove rules based on retention period (#11164)
* Add feature to automatically remove rules based on retention period

* Add feature to automatically remove rules based on retention period

* address comments
2021-05-03 11:50:45 -07:00
benkrug fdab95ea99
Update index.md (#11174)
tiny change for readability
2021-04-30 09:40:19 -07:00
Maytas Monsereenusorn 6d2b5cdd7e
Add feature to automatically remove audit logs based on retention period (#11084)
* add docs

* add impl

* fix checkstyle

* fix test

* add test

* fix checkstyle

* fix checkstyle

* fix test

* Address comments

* Address comments

* fix spelling

* fix docs
2021-04-20 17:10:43 -07:00
Maytas Monsereenusorn f968400170
Introduce a new configuration that skip storing audit payload if payload size exceed limit and skip storing null fields for audit payload (#11078)
* Add config to skip storing audit payload if exceed limit

* fix checkstyle

* change config name

* skip null fields for audit payload

* fix checkstyle

* address comments

* fix guice

* fix test

* add tests

* address comments

* address comments

* address comments

* fix checkstyle

* address comments

* fix test

* fix test

* address comments

* Address comments

Co-authored-by: Jihoon Son <jihoonson@apache.org>

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2021-04-13 20:18:28 -07:00
bergmt2000 f60d8ea1c3
Update index.md (#11105)
Fix json typo in readme for granularitySpec in compaction config example
2021-04-13 16:26:36 +08:00
Maytas Monsereenusorn 4576152e4a
Make dropExisting flag for Compaction configurable and add warning documentations (#11070)
* Make dropExisting flag for Compaction configurable

* fix checkstyle

* fix checkstyle

* fix test

* add tests

* fix spelling

* fix docs

* add IT

* fix test

* fix doc

* fix doc
2021-04-09 00:12:28 -07:00
sthetland fb6751fa45
Fix old broken link (#11048)
* link check fixes

* updated link target

* Update aggregations.md

* spelling error
2021-04-07 20:40:50 -07:00
Abhishek Agarwal 0df0bff44b
Enable multiple distinct aggregators in same query (#11014)
* Enable multiple distinct count

* Add more tests

* fix sql test

* docs fix

* Address nits
2021-04-07 00:52:19 -07:00
Jihoon Son cc12a57034
Enforce allow list for JDBC properties by default (#11063)
* Enforce allow list for JDBC properties by default

* fix tests
2021-04-06 19:46:19 -07:00
Jihoon Son cfcebc40f6
Allow list for JDBC connection properties to address CVE-2021-26919 (#11047)
* Allow list for JDBC connection properties to address CVE-2021-26919

* fix tests for java 11
2021-04-01 17:30:47 -07:00
Gian Merlino bf20f9e979
DruidInputSource: Fix issues in column projection, timestamp handling. (#10267)
* DruidInputSource: Fix issues in column projection, timestamp handling.

DruidInputSource, DruidSegmentReader changes:

1) Remove "dimensions" and "metrics". They are not necessary, because we
   can compute which columns we need to read based on what is going to
   be used by the timestamp, transform, dimensions, and metrics.
2) Start using ColumnsFilter (see below) to decide which columns we need
   to read.
3) Actually respect the "timestampSpec". Previously, it was ignored, and
   the timestamp of the returned InputRows was set to the `__time` column
   of the input datasource.

(1) and (2) together fix a bug in which the DruidInputSource would not
properly read columns that are used as inputs to a transformSpec.

(3) fixes a bug where the timestampSpec would be ignored if you attempted
to set the column to something other than `__time`.

(1) and (3) are breaking changes.

Web console changes:

1) Remove "Dimensions" and "Metrics" from the Druid input source.
2) Set timestampSpec to `{"column": "__time", "format": "millis"}` for
   compatibility with the new behavior.

Other changes:

1) Add ColumnsFilter, a new class that allows input readers to determine
   which columns they need to read. Currently, it's only used by the
   DruidInputSource, but it could be used by other columnar input sources
   in the future.
2) Add a ColumnsFilter to InputRowSchema.
3) Remove the metric names from InputRowSchema (they were unused).
4) Add InputRowSchemas.fromDataSchema method that computes the proper
   ColumnsFilter for given timestamp, dimensions, transform, and metrics.
5) Add "getRequiredColumns" method to TransformSpec to support the above.

* Various fixups.

* Uncomment incorrectly commented lines.

* Move TransformSpecTest to the proper module.

* Add druid.indexer.task.ignoreTimestampSpecForDruidInputSource setting.

* Fix.

* Fix build.

* Checkstyle.

* Misc fixes.

* Fix test.

* Move config.

* Fix imports.

* Fixup.

* Fix ShuffleResourceTest.

* Add import.

* Smarter exclusions.

* Fixes based on tests.

Also, add TIME_COLUMN constant in the web console.

* Adjustments for tests.

* Reorder test data.

* Update docs.

* Update docs to say Druid 0.22.0 instead of 0.21.0.

* Fix test.

* Fix ITAutoCompactionTest.

* Changes from review & from merging.
2021-03-25 10:32:21 -07:00
Charles Smith d69533dbd9
First refactor of compaction (#10935)
* first pass compaction refactor. includes updated behavior for queryGranularity. removes duplicated doc

* fix links, typos, some reorganization

* fix spelling. TBD still there for work in progress

* updates tutorial examples, adds more clarification around compaction use cases

* add granularity spec to automatic compaction config

* final edits

* spelling fixes

* apply suggestions from review

* upadtes from review

* last edits

* move note

* clarify null

* fix links & spelling

* latest review

* edits to auto-compaction config

* add back rollup

* fix links & spelling

* Update compaction.md

add granularityspec to example
2021-03-24 11:41:44 -07:00
Atul Mohan 3d7e7c2c83
Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout (#10213)
* Skip queue removal on timeout

* Clarify error

* Add new config to control replication

Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com>
2021-03-17 11:34:05 -07:00
Jihoon Son 9946306d4b
Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses (#10830)
* Allow only HTTP and HTTPS protocols for the HTTP inputSource

* rename

* Update core/src/main/java/org/apache/druid/data/input/impl/HttpInputSource.java

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

* fix http firehose and update doc

* HDFS inputSource

* add configs for allowed protocols

* fix checkstyle and doc

* more checkstyle

* remove stale doc

* remove more doc

* Apply doc suggestions from code review

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>

* update hdfs address in docs

* fix test

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
2021-03-06 11:43:00 -08:00
Clint Wylie f34c6eb3c0
add druid jdbc handler config for minimum number of rows per frame (#10880)
* add druid jdbc handler config for minimum number of rows per frame

* javadocs and docs adjustments

* spelling

* adjust docs per review with minor tweaks

* adjust more
2021-02-23 02:11:04 -08:00
Jihoon Son 1ec3f0bd73
Revert "Add support for Blacklisting some domains for HTTPInputSource (#10535)" (#10871)
This reverts commit 6b14bdb3a5.
2021-02-09 17:51:26 -08:00
zhangyue19921010 bf1d1d583b
modify (#10778)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-01-22 09:20:13 -08:00
zhangyue19921010 2837a9b62f
[Minor Doc Fix] Correct the default value of `druid.server.http.gracefulShutdownTimeout` (#10661)
* done

* done

* done

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-01-08 15:23:08 -08:00
Lucas Capistrant 58ce2e55d8
Add dynamic coordinator config that allows control over how many segments are considered when picking a segment to move. (#10284)
* dynamic coord config adding more balancing control

add new dynamic coordinator config, maxSegmentsToConsiderPerMove. This
config caps the number of segments that are iterated over when selecting
a segment to move. The default value combined with current balancing
strategies will still iterate over all provided segments. However,
setting this value to something > 0 will cap the number of segments
visited. This could make sense in cases where a cluster has a very large
number of segments and the admins prefer less iterations vs a thorough
consideration of all segments provided.

* fix checkstyle failure

* Make doc more detailed for admin to understand when/why to use new config

* refactor PR to use a % of segments instead of raw number

* update the docs

* remove bad doc line

* fix typo in name of new dynamic config

* update RservoirSegmentSampler to gracefully deal with values > 100%

* add handler for <= 0 in ReservoirSegmentSampler

* fixup CoordinatorDynamicConfigTest naming and argument ordering

* fix items in docs after spellcheck flags

* Fix lgtm flag on missing space in string literal

* improve documentation for new config

* Add default value to config docs and add advice in cluster tuning doc

* Add percentOfSegmentsToConsiderPerMove to web console coord config dialog

* update jest snapshot after console change

* fix spell checker errors

* Improve debug logging in getRandomSegmentBalancerHolder to cover all bad inputs for % of segments to consider

* add new config back to web console module after merge with master

* fix ReservoirSegmentSamplerTest

* fix line breaks in coordinator console dialog

* Add a test that helps ensure not regressions for percentOfSegmentsToConsiderPerMove

* Make improvements based off of feedback in review

* additional cleanup coming from review

* Add a warning log if limit on segments to consider for move can't be calcluated

* remove unused import

* fix tests for CoordinatorDynamicConfig

* remove precondition test that is redundant in CoordinatorDynamicConfig Builder class
2020-12-22 08:27:55 -08:00
sthetland 6ae8059c09
cleaning up and fixing links (#10528)
* cleaning up and fixing links

* reverting local link

* Update indexer.md

* link checking

* Fixing one more stale link for PostgreSQL
2020-12-17 13:37:43 -08:00
frank chen c410648630
fix injection failure of StorageLocationSelectorStrategy objects (#10363)
* fix to allow customer storage location selector strategy

* add test cases to check instance of selector strategy

* update doc

* code format

* resolve code review comments

* inject StorageLocation

* fix CI

* fix mismatched license item reported by CI

* change property path from druid.segmentCache.locationSelectorStrategy.type to druid.segmentCache.locationSelector.strategy

* using a helper method to bind to correct property path
2020-12-08 09:48:31 -08:00
zhangyue19921010 e7e07eab11
[Improve Doc] : Modify the disadvantages of the lazyLoadOnStart feature. (#10608)
* modify docs

* modify docs

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2020-12-01 18:33:22 -08:00
frank chen fe693a4f01
Improve doc and exception message for invalid user configurations (#10598)
* improve doc and exception message

* add spelling check rules and remove unused import

* add a test to improve test coverage
2020-11-23 15:03:13 -08:00
zhangyue19921010 1272fb17e5
modify druid.historical.cache.maxEntrySize property in Unified format (#10590)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2020-11-17 16:36:50 -06:00
Mainak Ghosh d8e5a159e8
Update index.md (#10549)
Removing the extra `_` in the default for middlemanager category
2020-11-03 13:44:47 +05:30
Husky Zeng 9286153145
doc wrong description of configuration (#10546) 2020-11-02 17:57:16 -08:00
Nishant Bangarwa 6b14bdb3a5
Add support for Blacklisting some domains for HTTPInputSource (#10535)
fix inspections

refactor class name

change name

 add allowList as well

distinguish between empty and null list

Fix CI
2020-11-02 21:47:25 +05:30
Mainak Ghosh 14072d3ab0
Adding more dimensions to the audit log entry (#10373)
* Adding more dimensions to the audit log entry

* Making adding payload in audit metric optional

* Changing the name of the parameter to includePayloadAsDimensionInMetric. Adding a unit test

* Fixing the intellij code introspection issues
2020-09-17 18:36:28 -07:00
Jihoon Son 8657b23ab2
Integration tests and docs for auto compaction with different partitioning (#10354)
* Working

* add test

* doc

* fix test

* split other integration test

* exclude other-index from other tests

* doc anchor fix

* adjust task slots and number of merge tasks

* spell check

* reduce maxNumConcurrentSubTasks to 1

* maxNumConcurrentSubtasks for range partitinoing

* reduce memory for historical

* change group name
2020-09-15 11:28:09 -07:00