Commit Graph

3715 Commits

Author SHA1 Message Date
Maytas Monsereenusorn 3455352241
Add feature to automatically remove compaction configurations for inactive datasources (#11232)
* add auto cleanup

* add auto cleanup

* add auto cleanup

* add tests

* add tests

* use retryutils

* use retryutils

* use retryutils

* address comments
2021-05-11 18:49:18 -07:00
Agustin Gonzalez 8e5048e643
Avoid memory mapping hydrants after they are persisted & after they are merged for native batch ingestion (#11123)
* Avoid mapping hydrants in create segments phase for native ingestion

* Drop queriable indices after a given sink is fully merged

* Do not drop memory mappings for realtime ingestion

* Style fixes

* Renamed to match use case better

* Rollback memoization code and use the real time flag instead

* Null ptr fix in FireHydrant toString plus adjustments to memory pressure tracking calculations

* Style

* Log some count stats

* Make sure sinks size is obtained at the right time

* BatchAppenderator unit test

* Fix comment typos

* Renamed methods to make them more readable

* Move persisted metadata from FireHydrant class to AppenderatorImpl. Removed superfluous differences and fix comment typo. Removed custom comparator

* Missing dependency

* Make persisted hydrant metadata map concurrent and better reflect the fact that keys are Java references. Maintain persisted metadata when dropping/closing segments.

* Replaced concurrent variables with normal ones

* Added   batchMemoryMappedIndex "fallback" flag with default "false". Set this to "true" make code fallback to previous code path.

* Style fix.

* Added note to new setting in doc, using Iterables.size (and removing a dependency), and fixing a typo in a comment.

* Forgot to commit this edited documentation message
2021-05-11 14:34:26 -07:00
Maytas Monsereenusorn 4326e699bd
Add feature to automatically remove datasource metadata based on retention period (#11227)
* add auto clean up datasource metadata

* add test

* fix checkstyle

* add comments

* fix error

* address comments

* Address comments

* fix test

* fix test

* fix typo

* add comment

* fix test

* fix test
2021-05-11 01:22:33 -07:00
Maytas Monsereenusorn 3a660bc6ee
Make sure updating coordinator config is protected against race condition (#11144)
* Make sure changing coordinator config is protected against concurrent updates

* Make sure updating coordinator config is protected against race condition

* add retry

* fix checkstyle

* add tests

* add tests

* add more tests

* add tests

* fix

* fix checkstyle
2021-05-10 13:58:08 -07:00
Jihoon Son 2df42143ae
Fix idempotence of segment allocation and task report apis in native batch ingestion (#11189)
* Fix idempotence of segment allocation and task report apis in native
batch ingestion

* better error and javadoc

* checkstyle and dependency

* fix tests and add more tests

* task config instead of context; add doc

* unused import and dependency

* typo in doc

* fix unintended changes

* fix wrong import

* remove unnecessary error handling

* add task context back

* default task context

* fix test and doc

* address comments

* unused imports
2021-05-07 14:29:48 -07:00
Maytas Monsereenusorn d73f72e508
Add feature to automatically remove supervisor based on retention period (#11200)
* add auto clean up

* add test

* add test

* fix test

* Address comments

* Address comments
2021-05-06 22:25:23 -07:00
Lucas Capistrant bb3c810b36
Create dynamic config that can limit number of non-primary replicants loaded per coordination cycle (#11135)
* lay the groundwork for throttling replicant loads per RunRules execution

* Add dynamic coordinator config to control new replicant threshold.

* remove redundant line

* add some unit tests

* fix checkstyle error

* add documentation for new dynamic config

* improve docs and logs

* Alter how null is handled for new config. If null, manually set as default
2021-05-05 07:39:36 -05:00
Clint Wylie 6f0b4d90d8
clarify that appenderator persist byte sizes are estimations (#11198) 2021-05-05 17:44:04 +08:00
frank chen 204901a602
Fix Smile encoding for HTTP response (#10980)
* fix Smile encoding bug

Signed-off-by: frank chen <frank.chen021@outlook.com>

* Add unit tests

* Add IT for smile encoding

* Fix cases

* Update javadoc

Co-authored-by: Jihoon Son <jihoonson@apache.org>

* resolve comments

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2021-05-03 22:43:47 -07:00
Clint Wylie 554f1ffeee
ARRAY_AGG sql aggregator function (#11157)
* ARRAY_AGG sql aggregator function

* add javadoc

* spelling

* review stuff, return null instead of empty when nil input

* review stuff

* Update sql.md

* use type inference for finalize, refactor some things
2021-05-03 22:17:10 -07:00
Maytas Monsereenusorn 84aac4832d
Add feature to automatically remove rules based on retention period (#11164)
* Add feature to automatically remove rules based on retention period

* Add feature to automatically remove rules based on retention period

* address comments
2021-05-03 11:50:45 -07:00
Maytas Monsereenusorn 6d2b5cdd7e
Add feature to automatically remove audit logs based on retention period (#11084)
* add docs

* add impl

* fix checkstyle

* fix test

* add test

* fix checkstyle

* fix checkstyle

* fix test

* Address comments

* Address comments

* fix spelling

* fix docs
2021-04-20 17:10:43 -07:00
Maytas Monsereenusorn f968400170
Introduce a new configuration that skip storing audit payload if payload size exceed limit and skip storing null fields for audit payload (#11078)
* Add config to skip storing audit payload if exceed limit

* fix checkstyle

* change config name

* skip null fields for audit payload

* fix checkstyle

* address comments

* fix guice

* fix test

* add tests

* address comments

* address comments

* address comments

* fix checkstyle

* address comments

* fix test

* fix test

* address comments

* Address comments

Co-authored-by: Jihoon Son <jihoonson@apache.org>

Co-authored-by: Jihoon Son <jihoonson@apache.org>
2021-04-13 20:18:28 -07:00
Gian Merlino c3faa24f26
DataSchema: Improve duplicate-column error message. (#11082)
* DataSchema: Improve duplicate-column error message.

Now, when duplicate columns are specified, the error message will include
information about where those duplicate columns were seen. Also, if there
are multiple duplicate columns, all will be listed in the error message
instead of just the first one encountered.

* Fix style for checkstyle.

* Further improve error message.
2021-04-12 19:03:15 -07:00
Jihoon Son a6a2758095
More unit tests for JsonParserIterator; Integration tests for query errors (#11091)
* unit tests for timeout exception in init

* integration tests

* run integraion test on travis

* fix inspection
2021-04-12 15:08:50 -07:00
Maytas Monsereenusorn 4576152e4a
Make dropExisting flag for Compaction configurable and add warning documentations (#11070)
* Make dropExisting flag for Compaction configurable

* fix checkstyle

* fix checkstyle

* fix test

* add tests

* fix spelling

* fix docs

* add IT

* fix test

* fix doc

* fix doc
2021-04-09 00:12:28 -07:00
Lucas Capistrant 8264203cee
Allow client to configure batch ingestion task to wait to complete until segments are confirmed to be available by other (#10676)
* Add ability to wait for segment availability for batch jobs

* IT updates

* fix queries in legacy hadoop IT

* Fix broken indexing integration tests

* address an lgtm flag

* spell checker still flagging for hadoop doc. adding under that file header too

* fix compaction IT

* Updates to wait for availability method

* improve unit testing for patch

* fix bad indentation

* refactor waitForSegmentAvailability

* Fixes based off of review comments

* cleanup to get compile after merging with master

* fix failing test after previous logic update

* add back code that must have gotten deleted during conflict resolution

* update some logging code

* fixes to get compilation working after merge with master

* reset interrupt flag in catch block after code review pointed it out

* small changes following self-review

* fixup some issues brought on by merge with master

* small changes after review

* cleanup a little bit after merge with master

* Fix potential resource leak in AbstractBatchIndexTask

* syntax fix

* Add a Compcation TuningConfig type

* add docs stipulating the lack of support by Compaction tasks for the new config

* Fixup compilation errors after merge with master

* Remove erreneous newline
2021-04-08 21:03:00 -07:00
Xavier Léauté 15bdd6bc2f
Fix unit tests and GC settings for Java 15 (#11074)
* JavaScript script engine support was removed in JDK 15: skip those tests for JDKs without it
* Fix flaky HTTP client tests with Java 15
* Switch from CMS to G1GC in integration tests, since CMS is no longer available in JDK 15
2021-04-08 10:33:37 -07:00
Abhishek Agarwal 0df0bff44b
Enable multiple distinct aggregators in same query (#11014)
* Enable multiple distinct count

* Add more tests

* fix sql test

* docs fix

* Address nits
2021-04-07 00:52:19 -07:00
Jihoon Son cc12a57034
Enforce allow list for JDBC properties by default (#11063)
* Enforce allow list for JDBC properties by default

* fix tests
2021-04-06 19:46:19 -07:00
zachjsh 8cf1e83543
Add paramter to loadstatus API to compute underdeplication against cluster view (#11056)
* Add paramter to loadstatus API to compute underdeplication against cluster view

This change adds a query parameter `computeUsingClusterView` to loadstatus apis
that if specified have the coordinator compute undereplication for segments based
on the number of services available within cluster that the segment can be replicated
on, instead of the configured replication count configured in load rule. A default
load rule is created in all clusters that specified that all segments should be
replicated 2 times. As replicas are forced to be on separate nodes in the cluster,
this causes the loadstatus api to report that there are under-replicated segments
when there is only 1 data server in the cluster. In this case, calling loadstatus
api without this new query parameter will always result in a response indicating
under-replication of segments

* * fix exception mapper

* * Address review comments

* * update external API docs

* Apply suggestions from code review

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>

* * update more external docs

* * update javadoc

* Apply suggestions from code review

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
2021-04-05 00:02:43 -04:00
Jihoon Son cfcebc40f6
Allow list for JDBC connection properties to address CVE-2021-26919 (#11047)
* Allow list for JDBC connection properties to address CVE-2021-26919

* fix tests for java 11
2021-04-01 17:30:47 -07:00
Maytas Monsereenusorn d7f5293364
Add an option for ingestion task to drop (mark unused) all existing segments that are contained by interval in the ingestionSpec (#11025)
* Auto-Compaction can run indefinitely when segmentGranularity is changed from coarser to finer.

* Add option to drop segments after ingestion

* fix checkstyle

* add tests

* add tests

* add tests

* fix test

* add tests

* fix checkstyle

* fix checkstyle

* add docs

* fix docs

* address comments

* address comments

* fix spelling
2021-04-01 12:29:36 -07:00
Parag Jain b35486fa81
request logs through kafka emitter (#11036)
* request logs through kafka emitter

* travis fixes

* review comments

* kafka emitter unit test

* new line

* travis checks

* checkstyle fix

* count request lost when request topic is null
2021-04-01 11:31:32 +05:30
Lasse Krogh Mammen 782a1d4e6c
Add Calcite Avatica protobuf handler (#10543) 2021-03-31 12:46:25 -07:00
Gian Merlino bf20f9e979
DruidInputSource: Fix issues in column projection, timestamp handling. (#10267)
* DruidInputSource: Fix issues in column projection, timestamp handling.

DruidInputSource, DruidSegmentReader changes:

1) Remove "dimensions" and "metrics". They are not necessary, because we
   can compute which columns we need to read based on what is going to
   be used by the timestamp, transform, dimensions, and metrics.
2) Start using ColumnsFilter (see below) to decide which columns we need
   to read.
3) Actually respect the "timestampSpec". Previously, it was ignored, and
   the timestamp of the returned InputRows was set to the `__time` column
   of the input datasource.

(1) and (2) together fix a bug in which the DruidInputSource would not
properly read columns that are used as inputs to a transformSpec.

(3) fixes a bug where the timestampSpec would be ignored if you attempted
to set the column to something other than `__time`.

(1) and (3) are breaking changes.

Web console changes:

1) Remove "Dimensions" and "Metrics" from the Druid input source.
2) Set timestampSpec to `{"column": "__time", "format": "millis"}` for
   compatibility with the new behavior.

Other changes:

1) Add ColumnsFilter, a new class that allows input readers to determine
   which columns they need to read. Currently, it's only used by the
   DruidInputSource, but it could be used by other columnar input sources
   in the future.
2) Add a ColumnsFilter to InputRowSchema.
3) Remove the metric names from InputRowSchema (they were unused).
4) Add InputRowSchemas.fromDataSchema method that computes the proper
   ColumnsFilter for given timestamp, dimensions, transform, and metrics.
5) Add "getRequiredColumns" method to TransformSpec to support the above.

* Various fixups.

* Uncomment incorrectly commented lines.

* Move TransformSpecTest to the proper module.

* Add druid.indexer.task.ignoreTimestampSpecForDruidInputSource setting.

* Fix.

* Fix build.

* Checkstyle.

* Misc fixes.

* Fix test.

* Move config.

* Fix imports.

* Fixup.

* Fix ShuffleResourceTest.

* Add import.

* Smarter exclusions.

* Fixes based on tests.

Also, add TIME_COLUMN constant in the web console.

* Adjustments for tests.

* Reorder test data.

* Update docs.

* Update docs to say Druid 0.22.0 instead of 0.21.0.

* Fix test.

* Fix ITAutoCompactionTest.

* Changes from review & from merging.
2021-03-25 10:32:21 -07:00
Maytas Monsereenusorn c87ac0823f
Auto-compaction with segment granularity retrieve incomplete segments from timeline when interval overlap (#11019)
* Fix Auto-compaction with segment granularity retrieve incomplete segments from timeline when interval overlap

* Fix Auto-compaction with segment granularity retrieve incomplete segments from timeline when interval overlap

* Fix Auto-compaction with segment granularity retrieve incomplete segments from timeline when interval overlap

* Fix Auto-compaction with segment granularity retrieve incomplete segments from timeline when interval overlap

* address comments
2021-03-24 11:37:29 -07:00
Jihoon Son a041933017
Allow overlapping intervals for the compaction task (#10912)
* Allow overlapping intervals for the compaction task

* unused import

* line indentation

Co-authored-by: Maytas Monsereenusorn <maytasm@apache.org>
2021-03-23 11:21:54 -07:00
Maytas Monsereenusorn 51d2c61f1c
Auto-compaction with segment granularity should skip segments that already have the configured segmentGranularity (#11009)
* Auto-compaction with segment granularity should skip segments that already have the configured segmentGranularity

* Auto-compaction with segment granularity should skip segments that already have the configured segmentGranularity

* Auto-compaction with segment granularity should skip segments that already have the configured segmentGranularity

* address comments

* address comments

* address comments

* address comments

* address comments
2021-03-19 17:38:28 -07:00
Samarth Jain 5fae7dfcf2
Fix regression introduced by #11008 (#11013)
* Fix regression introduced by #11008

* Add back and tweak the check to not inspect resources for authorization when AllowAllAuthorizer is configured.
Add a unit test to validate that the change doesn't introduce new behavior.
2021-03-19 17:15:03 -07:00
Maytas Monsereenusorn f19c2e9ce4
If ingested data has sparse columns, the ingested data with forceGuaranteedRollup=true can result in imperfect rollup and final dimension ordering can be different from dimensionSpec ordering in the ingestionSpec (#10948)
* add IT

* add IT

* add the fix

* fix checkstyle

* fix compile

* fix compile

* fix test

* fix test

* address comments
2021-03-18 17:04:28 -07:00
Samarth Jain 83fcab1d0f
Improve performance of queries against SYSTEM.SEGMENT table. (#11008)
Size HashMap and HashSet appropriately. Perf analysis of the queries
revealed that over 25% of the query time was spent in resizing HashMap and HashSet
collections. Also, prevent the need to examine and authorize all resources when
AllowAllAuthorizer is the configured authorizer.
2021-03-17 22:24:02 -07:00
Atul Mohan 3d7e7c2c83
Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout (#10213)
* Skip queue removal on timeout

* Clarify error

* Add new config to control replication

Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com>
2021-03-17 11:34:05 -07:00
Xavier Léauté 1061faa6ba
prefer string concatenation over String.format in performance sensitive code (#10997)
String.format relies on regex parsing, which makes these calls expensive
at higher request volumes.
2021-03-16 22:06:26 -07:00
Maytas Monsereenusorn f37713dc6d
Fix auto compaction with mixed versions in the same time chunk based on new segment granularity (#11000) 2021-03-16 12:48:19 -07:00
zhangyue19921010 3277479ff7
[Minor]Add metadata-related logs and missing UT for kill tasks. (#10956)
* logs more info when delete segments && add deleteSegments-related UT

* revert msic.xml

* code review

* use log.debugSegments

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-03-11 18:00:52 -08:00
Xavier Léauté d26e1bc70d
update code check plugins for Java 15 support (#10978)
* update maven-forbidden-api plugin to 3.1
* update maven-pmd-plugin to 3.14
* update spotbugs to 4.2.2
* fixes validation failures newly caught by those updates
  - fix SpotBugs NP_NONNULL_PARAM_VIOLATION
  - fix PMD UnnecessaryFullyQualifiedName
2021-03-11 07:31:41 -08:00
Abhishek Agarwal c66951a59e
Add flag in SQL to disable left base filter optimization for joins (#10947)
* Add flag to disable left base filter

* code coverage

* Draft

* Review comments

* code coverage

* add docs

* Add old tests
2021-03-09 13:07:34 -08:00
Abhishek Agarwal 489f5b1a03
Avoid expensive findEntry call in segment metadata query (#10892)
* Avoid expensive findEntry call in segment metadata query

* other places

* Remove findEntry

* Fix add cost

* Refactor a bit

* Add performance test

* Add comment

* Review comments

* intellij
2021-03-08 22:08:33 -08:00
Jihoon Son 9946306d4b
Add configurations for allowed protocols for HTTP and HDFS inputSources/firehoses (#10830)
* Allow only HTTP and HTTPS protocols for the HTTP inputSource

* rename

* Update core/src/main/java/org/apache/druid/data/input/impl/HttpInputSource.java

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

* fix http firehose and update doc

* HDFS inputSource

* add configs for allowed protocols

* fix checkstyle and doc

* more checkstyle

* remove stale doc

* remove more doc

* Apply doc suggestions from code review

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>

* update hdfs address in docs

* fix test

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
2021-03-06 11:43:00 -08:00
zhangyue19921010 bddacbb1c3
Dynamic auto scale Kafka-Stream ingest tasks (#10524)
* druid task auto scale based on kafka lag

* fix kafkaSupervisorIOConfig and KinesisSupervisorIOConfig

* druid task auto scale based on kafka lag

* fix kafkaSupervisorIOConfig and KinesisSupervisorIOConfig

* test dynamic auto scale done

* auto scale tasks tested on prd cluster

* auto scale tasks tested on prd cluster

* modify code style to solve 29055.10 29055.9 29055.17 29055.18 29055.19 29055.20

* rename test fiel function

* change codes and add docs based on capistrant reviewed

* midify test docs

* modify docs

* modify docs

* modify docs

* merge from master

* Extract the autoScale logic out of SeekableStreamSupervisor to minimize putting more stuff inside there &&  Make autoscaling algorithm configurable and scalable.

* fix ci failed

* revert msic.xml

* add uts to test autoscaler create && scale out/in and kafka ingest with scale enable

* add more uts

* fix inner class check

* add IT for kafka ingestion with autoscaler

* add new IT in groups=kafka-index named testKafkaIndexDataWithWithAutoscaler

* review change

* code review

* remove unused imports

* fix NLP

* fix docs and UTs

* revert misc.xml

* use jackson to build autoScaleConfig with default values

* add uts

* use jackson to init AutoScalerConfig in IOConfig instead of Map<>

* autoscalerConfig interface and provide a defaultAutoScalerConfig

* modify uts

* modify docs

* fix checkstyle

* revert misc.xml

* modify uts

* reviewed code change

* reviewed code change

* code reviewed

* code review

* log changed

* do StringUtils.encodeForFormat when create allocationExec

* code review && limit taskCountMax to partitionNumbers

* modify docs

* code review

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-03-06 14:36:52 +05:30
Abhishek Agarwal 1a15987432
Supporting filters in the left base table for join datasources (#10697)
* where filter left first draft

* Revert changes in calcite test

* Refactor a bit

* Fixing the Tests

* Changes

* Adding tests

* Add tests for correlated queries

* Add comment

* Fix typos
2021-03-04 10:39:21 -08:00
Maytas Monsereenusorn 23333914c7
add javadoc and test (#10938) 2021-03-03 11:34:00 +08:00
Maytas Monsereenusorn b7b0ee8362
Add query granularity to compaction task (#10900)
* add query granularity to compaction task

* fix checkstyle

* fix checkstyle

* fix test

* fix test

* add tests

* fix test

* fix test

* cleanup

* rename class

* fix test

* fix test

* add test

* fix test
2021-03-02 11:23:52 -08:00
Gian Merlino 07902f607b
Granularity: Introduce primitive-typed bucketStart, increment methods. (#10904)
* Granularity: Introduce primitive-typed bucketStart, increment methods.

Saves creation of unnecessary DateTime objects in timestamp_floor and
timestamp_ceil expressions.

* Fix style.

* Amp up the test coverage.
2021-02-25 07:59:20 -08:00
Abhishek Agarwal 3a0a0c033f
Reload segment usage when starting the process (#10884)
* Reload segment usage when starting the process

* doc

* Add more tests

* remove forbidden method

* Add alert
2021-02-22 00:08:44 -08:00
Maytas Monsereenusorn f5bfccc720
Fix maxBytesInMemory for heap overhead of all sinks and hydrants check (#10891)
* fix maxBytesInMemory

* fix maxBytesInMemory check

* fix maxBytesInMemory check

* fix test
2021-02-18 21:48:57 -08:00
Jonathan Wei 8ad68135c8
Filter unauthorized views in InformationSchema (#10874)
* Filter unauthorized views in InformationSchema

* Use fixed name for view schema

* Remove unused string
2021-02-16 17:36:45 -08:00
Maytas Monsereenusorn 6541178c21
Support segmentGranularity for auto-compaction (#10843)
* Support segmentGranularity for auto-compaction

* Support segmentGranularity for auto-compaction

* Support segmentGranularity for auto-compaction

* Support segmentGranularity for auto-compaction

* resolve conflict

* Support segmentGranularity for auto-compaction

* Support segmentGranularity for auto-compaction

* fix tests

* fix more tests

* fix checkstyle

* add unit tests

* fix checkstyle

* fix checkstyle

* fix checkstyle

* add unit tests

* add integration tests

* fix checkstyle

* fix checkstyle

* fix failing tests

* address comments

* address comments

* fix tests

* fix tests

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test

* fix test
2021-02-12 03:03:20 -08:00
Jihoon Son 1ec3f0bd73
Revert "Add support for Blacklisting some domains for HTTPInputSource (#10535)" (#10871)
This reverts commit 6b14bdb3a5.
2021-02-09 17:51:26 -08:00