Commit Graph

11486 Commits

Author SHA1 Message Date
Jonathan Wei 74c876e578
Throw parse exceptions on schema get errors for SchemaRegistryBasedAvroBytesDecoder (#12080)
* Add option to throw parse exceptions on schema get errors for SchemaRegistryBasedAvroBytesDecoder

* Remove option
2022-01-13 12:36:51 -06:00
Clint Wylie 1dba089a62
fix array type strategy write size tracking (#12150)
* fix array type strategy write size tracking

* fix checkstyle
2022-01-13 10:22:40 -08:00
Xavier Léauté e56ea31697
follow-up to fix formatting broken in #12147 (#12148)
follow-up to #12147 to fix the build
2022-01-12 20:59:32 -08:00
Jihoon Son 58378aa967
Move gcs-connector from lib to hadoop-dependencies for integration test (#12144) 2022-01-12 16:47:34 -08:00
Xavier Léauté 168187e6df
avoid unnecessary String.format calls in IdUtils.validateId (#12147)
Based on profiling data, about 25% of the time de-serializing DataSchema
is spent on formatting strings in validateId.

This can add up quickly, especially when de-serializing task information
in the overlord, where in can consume almost 2% of CPU if there are many
tasks.

Since the formatting is unnecessary unless the checks fail, we can
leverage the built-in formatting of Preconditions.checkArgument instead
to avoid the cost.
2022-01-12 16:34:40 -08:00
Vadim Ogievetsky 9cd52ed914
Web console: make range partitioning a first class citizen of the console (#12146)
* first class support for range partitioning

* update e2e tests
2022-01-12 03:50:10 -08:00
Clint Wylie f2ce76966c
add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures less ambiguous (#12145)
* add EARLIEST_BY/LATEST_BY to make EARLIEST/LATEST function signatures unambiguous

* switcheroo

* EARLIEST_BY/LATEST_BY use timestamp instead of numeric types, update docs

* revert unintended change

* fix docs

* fix docs better
2022-01-12 03:48:53 -08:00
Laksh Singla fae73800a7
Set plannerContext error when cannot query external datasources and when insert is not supported. (#12136)
This PR aims to add plannerContext.setPlanningError whenever external table scan rule is invoked, without the queryMaker having the ability to do so.
2022-01-12 15:11:17 +05:30
Rohan Garg 81f0aba6cb
Use ListFilteredVirtualColumn for left/fact table expression in join condition (#12127)
* Pass VirtualColumnRegistry in PlannerContext for join expression planning

* Allow for including VCs from join fact table expression

* Optmize MV_FILTER functions to use a VC when in join fact table expression

* fixup! Allow for including VCs from join fact table expression

* Address review comments
2022-01-11 14:47:13 -08:00
imply-cheddar b153cb2342
Add a small LRU cache and use utf8 bytes in ArrayOfDoubles (#12130)
* Add a small LRU cache and use utf8 bytes in ArrayOfDoubles

* Add tests for extra branches

* Even more tests for branch coverage

* Fix Style
2022-01-11 13:04:11 -08:00
somu-imply 08fea7a46a
input type validation for datasketches hll "build" aggregator factory (#12131)
* Ingestion will fail for HLLSketchBuild instead of creating with incorrect values

* Addressing review comments for HLL< updated error message introduced test case
2022-01-11 12:00:14 -08:00
imply-cheddar eb0bae49ec
Update PostAggregator to be backwards compat (#12138)
This change mimics what was done in PR #11917 to
fix the incompatibilities produced by #11713. #11917
fixed it with AggregatorFactory by creating default
methods to allow for extensions built against old
jars to still work.  This does the same for PostAggregator
2022-01-11 02:18:14 -08:00
Clint Wylie 7cf9192765
fix delegated smoosh writer and some new facilities for segment writeout medium (#12132)
* fix delegated smoosh writer and some new facilities for segment writeout medium
changes:
* fixed issue with delegated `SmooshedWriter` when writing files that look like paths, causing `NoSuchFileException` exceptions when attempting to open a channel to the file
* `FileSmoosher.addWithSmooshedWriter` when _not_ delegating now checks that it is still open when closing, making it a no-op if already closed (allowing column serializers to add additional files and avoid delegated mode if they are finished writing out their own content and ned to add additional files)
* add `makeChildWriteOutMedium` to `SegmentWriteOutMedium` interface, which allows users of a shared medium to clean up `WriteOutBytes` if they fully control the lifecycle. there are no callers of this yet, adding for future functionality
* `OnHeapByteBufferWriteOutBytes` now can be marked as not open so it `OnHeapMemorySegmentWriteOutMedium` can now behave identically to other medium implementations

* fix to address nit - use AtomicLong
2022-01-10 22:25:19 -08:00
Frank Chen c8ddf60851
Upgrade RSA Key from 1024 bit to 4096 to eliminate warnings (#11743)
* eliminate warnings

* Change the keyStore type to PKCS12
2022-01-11 13:24:09 +08:00
Clint Wylie e583033231
add 'TypeStrategy' to types (#11888)
* add TypeStrategy - value comparators and binary serialization for any TypeSignature
2022-01-10 17:12:14 -08:00
Vadim Ogievetsky 2a41b7bffa
Web console: correctly cancel JSON shaped SQL queries (#12134)
* misc fixes

* type typo
2022-01-10 14:24:05 -08:00
Laksh Singla 7c17341caa
Return empty result when a group by gets optimized to a timeseries query (#12065)
Related to #11188

The above mentioned PR allowed timeseries queries to return a default result, when queries of type: select count(*) from table where dim1="_not_present_dim_" were executed. Before the PR, it returned no row, after the PR, it would return a row with value of count(*) as 0 (as expected by SQL standards of different dbs).

In Grouping#applyProject, we can sometimes perform optimization of a groupBy query to a timeseries query if possible (when the keys of the groupBy are constants, as generated by automated tools). For example, in select count(*) from table where dim1="_present_dim_" group by "dummy_key", the groupBy clause can be removed. However, in the case when the filter doesn't return anything, i.e. select count(*) from table where dim1="_not_present_dim_" group by "dummy_key", the behavior of general databases would be to return nothing, while druid (due to above change) returns an empty row. This PR aims to fix this divergence of behavior.

Example cases:

select count(*) from table where dim1="_not_present_dim_" group by "dummy_key".
CURRENT: Returns a row with count(*) = 0
EXPECTED: Return no row

select 'A', dim1 from foo where m1 = 123123 and dim1 = '_not_present_again_' group by dim1
CURRENT: Returns a row with ('A', 'wat')
EXPECTED: Return no row

To do this, a boolean droppedDimensionsWhileApplyingProject has been added to Grouping which is true whenever we make changes to the original shape with optimization. Hence if a timeseries query has a grouping with this set to true, we set skipEmptyBuckets=true in the query context (i.e. donot return any row).
2022-01-07 21:53:48 +05:30
Vadim Ogievetsky 2299eb321e
Standardizing SQL function docs (#12091)
* fix typos in SQL function docs

* more code

* Update docs/querying/sql.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/querying/sql.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/querying/sql.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/querying/sql.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* Update docs/querying/sql.md

Co-authored-by: Frank Chen <frankchen@apache.org>

* a few more expr, fixes

* more fixes

* quote TIME_SHIFT

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* undo header change

Co-authored-by: Frank Chen <frankchen@apache.org>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2022-01-06 23:57:03 -08:00
Marcelo R Costa c28b2834a1
Add http response status code to org.eclipse.jetty.server.RequestLog (#12116)
* Add http response status code to org.eclipse.jetty.server.RequestLog

* http response code is expressed as an int. Set log msg interpolation based on digit

* trying to add an unit test to verify if the logger.debug method is called

* trying to add an unit test to verify if the logger.debug method is called

* fix compilation issues

* remove test
2022-01-06 20:10:01 +08:00
Jihoon Son 4a74c5adcc
Use Druid's extension loading for integration test instead of maven (#12095)
* Use Druid's extension loading for integration test instead of maven

* fix maven command

* override config path

* load input format extensions and kafka by default; add prepopulated-data group

* all docker-composes are overridable

* fix s3 configs

* override config for all

* fix docker_compose_args

* fix security tests

* turn off debug logs for overlord api calls

* clean up stuff

* revert docker-compose.yml

* fix override config for query error test; fix circular dependency in docker compose

* add back some dependencies in docker compose

* new maven profile for integration test

* example file filter
2022-01-05 23:33:04 -08:00
Victoria Lim 6846622080
Docs: add FILTER to sql query syntax (#12093)
* docs: add FILTER to sql query syntax

* Update docs/querying/sql.md

* Update docs/querying/sql.md

* Update docs/querying/sql.md

* Update docs/querying/sql.md

* move and update FILTER section
2022-01-05 12:59:41 -08:00
Maytas Monsereenusorn b53e7f4d12
Support overlapping segment intervals in auto compaction (#12062)
* add impl

* add impl

* fix more bugs

* add tests

* fix checkstyle

* address comments

* address comments

* fix test
2022-01-04 11:47:38 -08:00
Frank Chen fe71fc414f
Update log4j2 to 2.17.1 (#12106)
Signed-off-by: frank chen <frank.chen021@outlook.com>
2021-12-30 19:18:16 -06:00
Vadim Ogievetsky 476d0bf4be
Web console: remove console.log (#12094)
* rm console.log

* force path-parse to 1.0.7
2021-12-22 19:31:23 -08:00
Vadim Ogievetsky 37112d24e2
Web console: new Ace, diff view, and cleanup. Decorating the console for the holidays 🎁 (#12085)
* Add diff view and upgrade AceEditor

* fix test

* function doc parsing fixes

* escape args

* allowKeys

* everyone gets a diff

* update snapshot
2021-12-22 16:31:17 -08:00
Jonathan Wei 9b598407c1
Add interface for external schema provider to Druid SQL (#12043)
* Add interfce for external schema provider to Druid SQL

* Add annotations
2021-12-22 22:17:57 +05:30
somu-imply 1871a1ab18
ARRAY_AGG and STRING_AGG will through errors if invoked on a complex datatype (#12089) 2021-12-21 17:41:04 -08:00
somu-imply c267b65f97
Removing unused processing threadpool on broker (#12070)
* Thread pool for broker

* Updating two tests to improve coverage for new method added

* Updating druidProcessingConfigTest to cover coverage

* Adding missed spelling errors caused in doc

* Adding test to cover lines of new function added
2021-12-21 13:07:53 -08:00
Frank Chen f345759360
Update to 2.17.0 (#12081) 2021-12-19 20:27:08 -08:00
AmatyaAvadhanula c0b1514177
Segment pruning for multi-dim partitioning given query domain (#12046)
Segment pruning for multi-dim partitioning for a given query

DimensionRangeShardSpec#possibleInDomain has been modified to enhance pruning when multi-dim partitioning is used.

Idea
While iterating through each dimension,

If query domain doesn't overlap with the set of permissible values in the segment, the segment is pruned.
If the overlap happens on a boundary, consider the next dimensions.
If there is an overlap within the segment boundaries, the segment cannot be pruned.
2021-12-17 12:44:43 +05:30
Abhishek Agarwal 5d043cefbc
Fix test in ResponseContextTest (#12077) 2021-12-16 22:51:51 -08:00
Victoria Lim acbeae23b8
New doc for troubleshooting query execution (#12075)
* new doc for troubleshooting query execution

* add doc to sidebar

* Apply suggestions from code review
2021-12-16 17:34:34 -08:00
lokesh-lingarajan 60a3a802b6
Modifying index from druid_segments(datasource, used, end) to druid_segments(datasource, used, end, start) to support kill task (#11894)
This index helps in faster query results during kill task's query on interval based unused segment listing. This can become a bottleneck in some production loads causing coordinator to wait longer for metadata db replies and impacting  Kafka ingestion. The modified index has helped reduce the query times for such queries.
2021-12-16 10:28:20 -08:00
Vadim Ogievetsky 0cc998d8a1
improve spec upgrading (#12072) 2021-12-15 10:28:21 -08:00
Jonathan Wei 3f79453506
Lock count guardrail for parallel single phase/sequential task (#12052)
* Lock count guardrail for parallel single phase/sequential task

* PR comments
2021-12-15 11:12:21 -06:00
Karan Kumar 377edff042
Ingestion metrics doc fix (#12066)
* Ingestion metrics doc fix.

* Fixing typo

* Adding missed keywords in ignore list
2021-12-15 12:51:53 +05:30
Laksh Singla 16642fb278
Fix incorrect type conversion in DruidLogicalValueRule (#11923)
DruidLogicalValuesRule while transforming to DruidRel can return incorrect values, if during the creation of the literal it was created from a float value. The BigDecimal representation stores 123.0, and it seems that using RexLiteral's method while conversion returns the inflated value (which is 1230). I am unsure if this is intentional from Calcite's perspective, and the actual change should be done somewhere else.

Extract the values of INT/LONG from the RexLiteral in the DruidLogicalValuesRule, via BigDecimal.longValue() method.
2021-12-15 10:44:35 +05:30
Victoria Lim 4ede3bbff6
Docs updates (#12069)
* minor updates to docs

* remove en.json
2021-12-14 14:38:18 -08:00
Gian Merlino d917e0433e
Update to log4j 2.16.0. (#12061)
* Update to log4j 2.16.0.

* Update licenses.yaml
2021-12-13 19:06:00 -08:00
Victoria Lim e77bdfa70d
Document query context parameters related to join filters (#12057)
* docs update for query context and filters

* updates from review

* Update docs/querying/filters.md
2021-12-13 17:47:21 -08:00
Clint Wylie e53c3e80ca
set log4j2.is.webapp to false if not set so that shutdown hooks are run (#12056)
* set log4j2.is.webapp to false if not set so that shutdown hooks are run
2021-12-10 21:55:35 -08:00
Lucas Capistrant 761fe9f144
Add new metric that quantifies how long batch ingest jobs waited for segment availability and whether or not that wait was successful (#12002)
* add a unit test that tests that new metric is emitted

* remove unused import

* clarify in doc that this is for batch tasks

* fix IndexTaskTest
2021-12-10 11:40:52 -06:00
Suneet Saldanha ffa4783ce8
Adjust log4j version in licenses.yaml. (#12053)
Co-authored-by: Gian Merlino <gian@imply.io>
2021-12-10 08:12:54 -08:00
Xavier Léauté 19316018b8
update log4j to 2.15.0 to address security vulnerabilities (#12051) 2021-12-09 22:34:54 -08:00
Clint Wylie 244c2559e9
fix IncrementalIndex performance regression (#12048)
changes:
* IncrementalIndex is now a ColumnInspector
* fixes performance regression from using map of ColumnCapabilities from IncrementalIndex as a RowSignature
2021-12-09 22:04:32 -08:00
Suneet Saldanha 25ac04e067
MySqlFirehoseDatabaseConnector uses configured driver class name (#12049) 2021-12-09 20:58:55 -08:00
Frank Chen 58245b4617
Support JsonPath functions in JsonPath expressions (#11722)
* Add jsonPath functions support

* Add jsonPath function test for Avro

* Add jsonPath function length() to Orc

* Add jsonPath function length() to Parquet

* Add more tests to ORC format

* update doc

* Fix exception during ingestion

* Add IT test case

* Revert "Fix exception during ingestion"

This reverts commit 5a5484b9ea.

* update IT test case

* Add 'keys()'

* Commit IT test case

* Fix UT
2021-12-10 10:53:23 +08:00
imply-cheddar a8b916576d
Allow for appending tasks to co-exist with each other. (#12041)
* Allow for appending tasks to co-exist with each other.

Add a config parameter for appending tasks to allow them to
use a SHARED lock.  This will allow multiple appending tasks
to add segments to the same datasource at the same time.

This config should actually be the default, but it is added
as a config to enable a smooth transition/validation in
production settings before forcing it as the default
behavior going forward.

This change leverages the TaskLockType.SHARED that existed
previously, this used to carry the semantics of a READ lock,
which was "escalated" when the task wanted to actually
persist the segment.  As of many moons before this diff, the
SHARED lock had stopped being used but was still piped into
the code.  It turns out that with a few tweaks, it can be
adjusted to be a shared lock for append tasks to allow them
all to write to the same datasource, so that is what this does.

* Can only reuse the shared lock if using the same groupId

* Need to serialize out the task lock type

* Adjust Unit tests to expect new field in JSON
2021-12-09 16:46:40 -08:00
Jonathan Wei 229f82a6f0
Add parse error list API for stream supervisors, use structured object for parse exceptions, simplify parse exception message (#11961)
* Add parse error list API for stream supervisors, simplify parse exception message

* Add input string to parse exception

* Use structured ParseExceptionReport

* Fix tests

* Add test

* PR comments, add ParseExceptionReport equals verifier

* Fix test
2021-12-09 15:42:55 -06:00
Xavier Léauté ffc5ade506
Remove use of deprecated PMD ruleset (#12044)
* Remove use of deprecated PMD ruleset

This fixes annoying warnings we were getting during build.

- Use a custom PMD ruleset, since the built-in one uses deprecated rules.
- UnnecessaryImport replaces most of the deprecated rules
- Update maven-pmd-plugin to 3.15
- Exclude ancient asm version from caliper, since this was causing
  incompatibility warnings with PMD and could also affect our tests runs
  in unexpected ways
2021-12-09 13:04:27 -08:00