Commit Graph

409 Commits

Author SHA1 Message Date
Gian Merlino 792e5c58e4
IncrementalIndex#add is no longer thread-safe. (#15697)
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
2024-01-18 03:45:22 -08:00
Gian Merlino d3d0c1c91e
Faster parsing: reduce String usage, list-based input rows. (#15681)
* Faster parsing: reduce String usage, list-based input rows.

Three changes:

1) Reworked FastLineIterator to optionally avoid generating Strings
   entirely, and reduce copying somewhat. Benefits the line-oriented
   JSON, CSV, delimited (TSV), and regex formats.

2) In the delimited (TSV) format, when the delimiter is a single byte,
   split on UTF-8 bytes directly.

3) In CSV and delimited (TSV) formats, use list-based input rows when
   the column list is provided upfront by the user.

* Fix style.

* Fix inspections.

* Restore validation.

* Remove fastutil-extra.

* Exception type.

* Fixes for error messages.

* Fixes for null handling.
2024-01-18 19:18:46 +08:00
Gian Merlino 500681d0cb
Add ImmutableLookupMap for static lookups. (#15675)
* Add ImmutableLookupMap for static lookups.

This patch adds a new ImmutableLookupMap, which comes with an
ImmutableLookupExtractor. It uses a fastutil open hashmap plus two
lists to store its data in such a way that forward and reverse
lookups can both be done quickly. I also observed footprint to be
somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million
row lookup.

The main advantage, though, is that reverse lookups can be done much
more quickly than MapLookupExtractor (which iterates the entire map
for each call to unapplyAll). This speeds up the recently added
ReverseLookupRule (#15626) during SQL planning with very large lookups.

* Use in one more test.

* Fix benchmark.

* Object2ObjectOpenHashMap

* Fixes, and LookupExtractor interface update to have asMap.

* Remove commented-out code.

* Fix style.

* Fix import order.

* Add fastutil.

* Avoid storing Map entries.
2024-01-13 13:14:01 -08:00
Gian Merlino cccf13ea82
Reverse, pull up lookups in the SQL planner. (#15626)
* Reverse, pull up lookups in the SQL planner.

Adds two new rules:

1) ReverseLookupRule, which eliminates calls to LOOKUP by doing
   reverse lookups.

2) AggregatePullUpLookupRule, which pulls up calls to LOOKUP above
   GROUP BY, when the lookup is injective.

Adds configs `sqlReverseLookup` and `sqlPullUpLookup` to control whether
these rules fire. Both are enabled by default.

To minimize the chance of performance problems due to many keys mapping to
the same value, ReverseLookupRule refrains from reversing a lookup if there
are more keys than `inSubQueryThreshold`. The rationale for using this setting
is that reversal works by generating an IN, and the `inSubQueryThreshold`
describes the largest IN the user wants the planner to create.

* Add additional line.

* Style.

* Remove commented-out lines.

* Fix tests.

* Add test.

* Fix doc link.

* Fix docs.

* Add one more test.

* Fix tests.

* Logic, test updates.

* - Make FilterDecomposeConcatRule more flexible.

- Make CalciteRulesManager apply reduction rules til fixpoint.

* Additional tests, simplify code.
2024-01-12 00:06:31 -08:00
Gian Merlino 2231cb30a4
Faster k-way merging using tournament trees, 8-byte key strides. (#15661)
* Faster k-way merging using tournament trees, 8-byte key strides.

Two speedups for FrameChannelMerger (which does k-way merging in MSQ):

1) Replace the priority queue with a tournament tree, which does fewer
   comparisons.

2) Compare keys using 8-byte strides, rather than 1 byte at a time.

* Adjust comments.

* Fix style.

* Adjust benchmark and test.

* Add eight-list test (power of two).
2024-01-11 08:36:22 -08:00
Clint Wylie df5bcd1367
fix bugs with expression virtual column indexes for expression virtual columns which refer to other virtual columns (#15633)
changes:
* ColumnIndexSelector now extends ColumnSelector. The only real implementation of ColumnIndexSelector, ColumnSelectorColumnIndexSelector, already has a ColumnSelector, so this isn't very disruptive
* removed getColumnNames from ColumnSelector since it was not used
* VirtualColumns and VirtualColumn getIndexSupplier method now needs argument of ColumnIndexSelector instead of ColumnSelector, which allows expression virtual columns to correctly recognize other virtual columns, fixing an issue which would incorrectly handle other virtual columns as non-existent columns instead
* fixed a bug with sql planner incorrectly not using expression filter for equality filters on columns with extractionFn and no virtual column registry
2024-01-08 13:10:11 -08:00
Clint Wylie c221a2634b
overhaul DruidPredicateFactory to better handle 3VL (#15629)
* overhaul DruidPredicateFactory to better handle 3VL

fixes some bugs caused by some limitations of the original design of how DruidPredicateFactory interacts with 3-value logic. The primary impacted area was with how filters on values transformed with expressions or extractionFn which turn non-null values into nulls, which were not possible to be modelled with the 'isNullInputUnknown' method

changes:
* adds DruidObjectPredicate to specialize string, array, and object based predicates instead of using guava Predicate
* DruidPredicateFactory now uses DruidObjectPredicate
* introduces DruidPredicateMatch enum, which all predicates returned from DruidPredicateFactory now use instead of booleans to indicate match. This means DruidLongPredicate, DruidFloatPredicate, DruidDoublePredicate, and the newly added DruidObjectPredicate apply methods all now return DruidPredicateMatch. This allows matchers and indexes
* isNullInputUnknown has been removed from DruidPredicateFactory

* rename, fix test

* adjust

* style

* npe

* more test

* fix default value mode to not match new test
2024-01-05 19:08:02 -08:00
Clint Wylie f19ece146f
expression virtual column indexes (#15585)
* ExpressionVirtualColumn + indexes = bff. Expression virtual columns can now use indexes of the underlying columns similar to how expression filters
2024-01-03 21:00:39 -08:00
Zoltan Haindrich 7552dc49fb
Reduce amount of expression objects created during evaluations (#15552)
I was looking into a query which was performing a bit poorly because the case_searched was touching more than 1 columns (if there is only 1 column there is a cache based evaluator).
While I was doing that I've noticed that there are a few simple things which could help a bit:

use a static TRUE/FALSE instead of creating a new object every time
create the ExprEval early for ConstantExpr -s (except the one for BigInteger which seem to have some odd contract)
return early from type autodetection
these changes mostly reduce the amount of garbage the query creates during case_searched evaluation; although ExpressionSelectorBenchmark shows some improvements ~15% - but my manual trials on the taxi dataset with 60M rows showed more improvements - probably due to the fact that these changes mostly only reduce gc pressure.
2023-12-15 16:11:59 +05:30
Clint Wylie e55f6b6202
remove search auto strategy, estimateSelectivity of BitmapColumnIndex (#15550)
* remove search auto strategy, estimateSelectivity of BitmapColumnIndex

* more cleanup
2023-12-13 16:30:01 -08:00
Clint Wylie 5ce4aab3b8
update ARRAY_OVERLAP to plan with ArrayContainsElement for ARRAY columns (#15451)
Updates ARRAY_OVERLAP to use the same ArrayContainsElement filter added in #15366 when filtering ARRAY typed columns so that it can also use indexes like ARRAY_CONTAINS.
2023-11-30 10:05:20 +05:30
Clint Wylie 0516d0dae4
simplify IncrementalIndex since group-by v1 has been removed (#15448) 2023-11-29 14:46:16 -08:00
Clint Wylie 64fcb32bcf
add native 'array contains element' filter (#15366)
* add native arrayContainsElement filter to use array column element indexes
2023-11-29 03:33:00 -08:00
Clint Wylie 97623b408c
add optional 'castToType' parameter to 'auto' column schema (#15417)
* auto but.. with an expected type
2023-11-28 17:19:23 -08:00
Rishabh Singh 8c802e4c9b
Relocating Table Schema Building: Shifting from Brokers to Coordinator for Improved Efficiency (#14985)
In the current design, brokers query both data nodes and tasks to fetch the schema of the segments they serve. The table schema is then constructed by combining the schemas of all segments within a datasource. However, this approach leads to a high number of segment metadata queries during broker startup, resulting in slow startup times and various issues outlined in the design proposal.

To address these challenges, we propose centralizing the table schema management process within the coordinator. This change is the first step in that direction. In the new arrangement, the coordinator will take on the responsibility of querying both data nodes and tasks to fetch segment schema and subsequently building the table schema. Brokers will now simply query the Coordinator to fetch table schema. Importantly, brokers will still retain the capability to build table schemas if the need arises, ensuring both flexibility and resilience.
2023-11-04 19:33:25 +05:30
Laksh Singla 5f86072456
Prepare master for Druid 29 (#15121)
Prepare master for Druid 29
2023-10-11 10:33:45 +05:30
Xavier Léauté adef2069b1
Make unit tests pass with Java 21 (#15014)
This change updates dependencies as needed and fixes tests to remove code incompatible with Java 21
As a result all unit tests now pass with Java 21.

* update maven-shade-plugin to 3.5.0 and follow-up to #15042
  * explain why we need to override configuration when specifying outputFile
  * remove configuration from dependency management in favor of explicit overrides in each module.
* update to mockito to 5.5.0 for Java 21 support when running with Java 11+
  * continue using latest mockito 4.x (4.11.0) when running with Java 8  
  * remove need to mock private fields
* exclude incorrectly declared mockito dependency from pac4j-oidc
* remove mocking of ByteBuffer, since sealed classes can no longer be mocked in Java 21
* add JVM options workaround for system-rules junit plugin not supporting Java 18+
* exclude older versions of byte-buddy from assertj-core
* fix for Java 19 changes in floating point string representation
* fix missing InitializedNullHandlingTest
* update easymock to 5.2.0 for Java 21 compatibility
* update animal-sniffer-plugin to 1.23
* update nl.jqno.equalsverifier to 3.15.1
* update exec-maven-plugin to 3.1.0
2023-10-03 22:41:21 -07:00
Soumyava 8088a763a6
Vectorize earliest aggregator for both numeric and string types (#14408)
* Vectorizing earliest for numeric

* Vectorizing earliest string aggregator

* checkstyle fix

* Removing unnecessary exceptions

* Ignoring tests in MSQ as earliest is not supported for numeric there

* Fixing benchmarks

* Updating tests as MSQ does not support earliest for some cases

* Addressing review comments by adding the following:
1. Checking capabilities first before creating selectors
2. Removing mockito in tests for numeric first aggs
3. Removing unnecessary tests

* Addressing issues for dictionary encoded single string columns where we can use the dictionary ids instead of the entire string

* Adding a flag for multi value dimension selector

* Addressing comments

* 1 more change

* Handling review comments part 1

* Handling review comments and correctness fix for latest_by when the time expression need not be in sorted order

* Updating numeric first vector agg

* Revert "Updating numeric first vector agg"

This reverts commit 4291709901.

* Updating code for correctness issues

* fixing an issue with latest agg

* Adding more comments and removing an unnecessary check

* Addressing null checks for tie selector and only vectorize false for quantile sketches
2023-09-05 08:41:42 -07:00
Clint Wylie 36e659a501
remove group-by v1 (#14866)
* remove group-by v1

* docs

* remove unused configs, fix test

* fix test

* adjustments

* why not

* adjust

* review stuff
2023-08-23 12:44:06 -07:00
Clint Wylie fb053c399c
consolidate json and auto indexers, remove v4 nested column serializer (#14456) 2023-08-22 18:50:11 -07:00
Clint Wylie 194a9c9abc
set druid.expressions.useStrictBooleans to true by default (#14734) 2023-08-22 00:19:56 -07:00
Kashif Faraz c211dcc4b3
Clean up compaction logs on coordinator (#14875)
Changes:
- Move logic of `NewestSegmentFirstIterator.needsCompaction` to `CompactionStatus`
to improve testability and readability
- Capture the list of checks performed to determine if compaction is needed in a readable
manner in `CompactionStatus.CHECKS`
- Make `CompactionSegmentIterator` iterate over instances of `SegmentsToCompact`
instead of `List<DataSegment>`. This allows use of the `umbrellaInterval` later.
- Replace usages of `QueueEntry` with `SegmentsToCompact`
- Move `SegmentsToCompact` out of `NewestSegmentFirstIterator`
- Simplify `CompactionStatistics`
- Reduce level of less important logs to debug
- No change made to tests to ensure correctness
2023-08-21 17:30:41 +05:30
Clint Wylie 6b14dde50e
deprecate config-magic in favor of json configuration stuff (#14695)
* json config based processing and broker merge configs to deprecate config-magic
2023-08-16 18:23:57 -07:00
dependabot[bot] f0a79fa0e4
Bump org.apache.maven.plugins:maven-source-plugin from 2.2.1 to 3.3.0 (#14812)
Bumps [org.apache.maven.plugins:maven-source-plugin](https://github.com/apache/maven-source-plugin) from 2.2.1 to 3.3.0.
- [Commits](https://github.com/apache/maven-source-plugin/compare/maven-source-plugin-2.2.1...maven-source-plugin-3.3.0)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-source-plugin
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-08-15 07:39:19 -07:00
Clint Wylie e5661a394c
refactor front-coded into static classes instead of using functional interfaces (#14572)
* refactor front-coded into static classes instead of using functional interfaces

* shared v0 static method instead of copy
2023-08-04 10:52:36 -07:00
imply-cheddar 748874405c
Minimize PostAggregator computations (#14708)
* Minimize PostAggregator computations

Since a change back in 2014, the topN query has been computing
all PostAggregators on all intermediate responses from leaf nodes
to brokers.  This generates significant slow downs for queries
with relatively expensive PostAggregators.  This change rewrites
the query that is pushed down to only have the minimal set of
PostAggregators such that it is impossible for downstream
processing to do too much work.  The final PostAggregators are
applied at the very end.
2023-08-04 00:04:31 +05:30
Kashif Faraz 22290fd632
Test: Simplify test impl of LoadQueuePeon (#14684)
Changes
- Rename `LoadQueuePeonTester` to `TestLoadQueuePeon`
- Simplify `TestLoadQueuePeon` by removing dependency on `CuratorLoadQueuePeon`
- Remove usages of mock peons in `LoadRuleTest` and use `TestLoadQueuePeon` instead
2023-07-28 16:14:23 +05:30
Clint Wylie 913416c669
add equality, null, and range filter (#14542)
changes:
* new filters that preserve match value typing to better handle filtering different column types
* sql planner uses new filters by default in sql compatible null handling mode
* remove isFilterable from column capabilities
* proper handling of array filtering, add array processor to column processors
* javadoc for sql test filter functions
* range filter support for arrays, tons more tests, fixes
* add dimension selector tests for mixed type roots
* support json equality
* rename semantic index maker thingys to mostly have plural names since they typically make many indexes, e.g. StringValueSetIndex -> StringValueSetIndexes
* add cooler equality index maker, ValueIndexes 
* fix missing string utf8 index supplier
* expression array comparator stuff
2023-07-18 12:15:22 -07:00
AmatyaAvadhanula 0412f40d36
Prepare master branch for next release, 28.0.0 (#14595)
* Prepare master branch for next release, 28.0.0
2023-07-18 09:22:30 +05:30
Kashif Faraz a6547febaf
Remove unused coordinator dynamic configs (#14524)
After #13197 , several coordinator configs are now redundant as they are not being
used anymore, neither with `smartSegmentLoading` nor otherwise.

Changes:
- Remove dynamic configs `emitBalancingStats`: balancer error stats are always
emitted, debug stats can be logged by using `debugDimensions`
- `useBatchedSegmentSampler`, `percentOfSegmentsToConsiderPerMove`:
batched segment sampling is always used
- Add test to verify deserialization with unknown properties
- Update `CoordinatorRunStats` to always track stats, this can be optimized later.
2023-07-06 12:11:10 +05:30
Clint Wylie 277aaa5c57
remove druid.processing.columnCache.sizeBytes and CachingIndexed, combine string column implementations (#14500)
* combine string column implementations
changes:
* generic indexed, front-coded, and auto string columns now all share the same column and index supplier implementations
* remove CachingIndexed implementation, which I think is largely no longer needed by the switch of many things to directly using ByteBuffer, avoiding the cost of creating Strings
* remove ColumnConfig.columnCacheSizeBytes since CachingIndexed was the only user
2023-07-02 19:37:15 -07:00
Gian Merlino 67fbd8e7fc
Add "stringEncoding" parameter to DataSketches HLL. (#11201)
* Add "stringEncoding" parameter to DataSketches HLL.

Builds on the concept from #11172 and adds a way to feed HLL sketches
with UTF-8 bytes.

This must be an option rather than always-on, because prior to this
patch, HLL sketches used UTF-16LE encoding when hashing strings. To
remain compatible with sketch images created prior to this patch -- which
matters during rolling updates and when reading sketches that have been
written to segments -- we must keep UTF-16LE as the default.

Not currently documented, because I'm not yet sure how best to expose
this functionality to users. I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.

* Fix benchmark.

* Fix style issues, improve test coverage.

* Put round back, to make IT updates easier.

* Fix test.

* Fix issue with filtered aggregators and add test.

* Use DS native update(ByteBuffer) method. Improve test coverage.

* Add another suppression.

* Fix ITAutoCompactionTest.

* Update benchmarks.

* Updates.

* Fix conflict.

* Adjustments.
2023-06-30 12:45:55 -07:00
Kashif Faraz 50461c3bd5
Enable smartSegmentLoading on the Coordinator (#13197)
This commit does a complete revamp of the coordinator to address problem areas:
- Stability: Fix several bugs, add capabilities to prioritize and cancel load queue items
- Visibility: Add new metrics, improve logs, revamp `CoordinatorRunStats`
- Configuration: Add dynamic config `smartSegmentLoading` to automatically set
optimal values for all segment loading configs such as `maxSegmentsToMove`,
`replicationThrottleLimit` and `maxSegmentsInNodeLoadingQueue`.

Changed classes:
- Add `StrategicSegmentAssigner` to make assignment decisions for load, replicate and move
- Add `SegmentAction` to distinguish between load, replicate, drop and move operations
- Add `SegmentReplicationStatus` to capture current state of replication of all used segments
- Add `SegmentLoadingConfig` to contain recomputed dynamic config values
- Simplify classes `LoadRule`, `BroadcastRule`
- Simplify the `BalancerStrategy` and `CostBalancerStrategy`
- Add several new methods to `ServerHolder` to track loaded and queued segments
- Refactor `DruidCoordinator`

Impact:
- Enable `smartSegmentLoading` by default. With this enabled, none of the following
dynamic configs need to be set: `maxSegmentsToMove`, `replicationThrottleLimit`,
`maxSegmentsInNodeLoadingQueue`, `useRoundRobinSegmentAssignment`,
`emitBalancingStats` and `replicantLifetime`.
- Coordinator reports richer metrics and produces cleaner and more informative logs
- Coordinator uses an unlimited load queue for all serves, and makes better assignment decisions
2023-06-19 14:27:35 +05:30
imply-cheddar cfd07a95b7
Errors take 3 (#14004)
Introduce DruidException, an exception whose goal in life is to be delivered to a user.

DruidException itself has javadoc on it to describe how it should be used.  This commit both introduces the Exception and adjusts some of the places that are generating exceptions to generate DruidException objects instead, as a way to show how the Exception should be used.

This work was a 3rd iteration on top of work that was started by Paul Rogers.  I don't know if his name will survive the squash-and-merge, so I'm calling it out here and thanking him for starting on this.
2023-06-19 01:11:13 -07:00
Abhishek Radhakrishnan a5e04d95a4
Add `TYPE_NAME` to the complex serde classes and replace the hardcoded names. (#14317)
* Add TYPE_NAME to the serde classes and reuse them instead of hardcoded strings.

* Static check fixes.
2023-05-23 00:54:47 -05:00
Clint Wylie 90ea192d9c
fix bugs with auto encoded long vector deserializers (#14186)
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+.

While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
2023-05-01 11:49:27 +05:30
Clint Wylie 1aef72aa7e
Bump up the version in pom to 27.0.0 in preparation of release (#14051) 2023-04-10 14:56:59 +05:30
Clint Wylie b11c0bc249
smarter nested column index utilization (#13977)
* smarter nested column index utilization
changes:
* adds skipValueRangeIndexScale and skipValuePredicateIndexScale to ColumnConfig (e.g. DruidProcessingConfig) available as system config via druid.processing.indexes.skipValueRangeIndexScale and druid.processing.indexes.skipValuePredicateIndexScale
* NestedColumnIndexSupplier uses skipValueRangeIndexScale and skipValuePredicateIndexScale to multiply by the total number of rows to be processed to determine the threshold at which we should no longer consider using bitmap indexes because it will be too many operations
* Default values for skipValueRangeIndexScale and skipValuePredicateIndexScale have been initially set to 0.08, but are separate to allow independent tuning
* these are not documented on purpose yet because they are kind of hard to explain, the mainly exist to help conduct larger scale experiments than the jmh benchmarks used to derive the initial set of values
* these changes provide a pretty sweet performance boost for filter processing on nested columns
2023-04-06 04:09:24 -07:00
Clint Wylie e3211e3be0
actually backwards compatible frontCoded string encoding strategy (#13996) 2023-03-31 02:24:12 -07:00
zachjsh 3bb67721f7
Allow for Input source security in SQL layer (#13989)
This change introduces the concept of input source type security model, proposed in #13837.. With this change, this feature is only available at the SQL layer, but we will expand to native layer in a follow up PR.

To enable this feature, the user must set the following property to true:

druid.auth.enableInputSourceSecurity=true

The default value for this property is false, which will continue the existing functionality of having the usage all external sources being authorized against the hardcoded resource action

new ResourceAction(new Resource(ResourceType.EXTERNAL, ResourceType.EXTERNAL), Action.READ

When this config is enabled, the users will be required to be authorized for the following resource action

new ResourceAction(new Resource(ResourceType.EXTERNAL, {INPUT_SOURCE_TYPE}, Action.READ

where {INPUT_SOURCE_TYPE} is the type of the input source being used;, http, inline, s3, etc..

Documentation has not been added for the feature as it is not complete at the moment, as we still need to enable this for the native layer in a follow up pr.
2023-03-29 22:15:33 -04:00
Clint Wylie 2219e68fa3
add backwards compat mode for frontCoded stringEncodingStrategy (#13988) 2023-03-28 14:44:44 -07:00
Clint Wylie ed57c5c853
better FrontCodedIndexed (#13854)
* Adds new implementation of 'frontCoded' string encoding strategy, which writes out a v1 FrontCodedIndexed which stores buckets on a prefix of the previous value instead of the first value in the bucket
2023-03-14 18:14:11 -07:00
Gian Merlino fe9d0c46d5
Improve memory efficiency of WrappedRoaringBitmap. (#13889)
* Improve memory efficiency of WrappedRoaringBitmap.

Two changes:

1) Use an int[] for sizes 4 or below.
2) Remove the boolean compressRunOnSerialization. Doesn't save much
   space, but it does save a little, and it isn't adding a ton of value
   to have it be configurable. It was originally configurable in case
   anything broke when enabling it, but it's been a while and nothing
   has broken.

* Slight adjustment.

* Adjust for inspection.

* Updates.

* Update snaps.

* Update test.

* Adjust test.

* Fix snaps.
2023-03-09 15:48:02 -08:00
Paul Rogers 914eebb4b7
Wire up the catalog resolver (#13788)
Introduces the catalog resolver interface
Wires the resolver up to the planner factory
Refactors planner factory
2023-02-22 11:42:32 -08:00
zachjsh 665dee43bf
Revert "Operator conversion deny list (#13766)" (#13829)
This reverts commit 38e620aa4c.
2023-02-21 15:14:49 -08:00
Clint Wylie 08b5951cc5
merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything (#13698)
* merge druid-core, extendedset, and druid-hll into druid-processing to simplify everything
* fix poms and license stuff
* mockito is evil
* allow reset of JvmUtils RuntimeInfo if tests used static injection to override
2023-02-17 14:27:41 -08:00
zachjsh 38e620aa4c
Operator conversion deny list (#13766)
### Description

This change adds a new config property `druid.sql.planner.operatorConversion.denyList`, which allows a user to specify
any operator conversions that they wish to disallow. A user may want to do this for a number of reasons, including security concerns. The default value of this property is the empty list `[]`, which does not disallow any operator conversions.

An example usage of this property is `druid.sql.planner.operatorConversion.denyList=["extern"]`, which disallows the usage of the `extern` operator conversion. If the property is configured this way, and a user of the Druid cluster tries to submit a query that uses the `extern` function, such as the example given [here](https://druid.apache.org/docs/latest/multi-stage-query/examples.html#insert-with-no-rollup), a response with http response code `400` is returned with en error body similar to the following:

```
{
  "taskId": "4ec5b0b6-fa9b-4c3a-827d-2308294e9985",
  "state": "FAILED",
  "error": {
    "error": "Plan validation failed",
    "errorMessage": "org.apache.calcite.runtime.CalciteContextException: From line 28, column 5 to line 32, column 5: No match found for function signature EXTERN(<CHARACTER>, <CHARACTER>, <CHARACTER>)",
    "errorClass": "org.apache.calcite.tools.ValidationException",
    "host": null
  }
}
```
2023-02-10 09:59:26 -08:00
AmatyaAvadhanula dcdae84888
Add server view initialization metrics (#13716)
* Add server view init metrics

* Test coverage

* Rename metrics
2023-02-07 20:02:00 +05:30
Jason Koch 7a3bd89a85
Dimension dictionary reduce locking (#13710)
* perf: introduce benchmark for StringDimensionIndexer

jdk11 -- Benchmark                                                       Mode  Cnt      Score     Error  Units
StringDimensionIndexerProcessBenchmark.parallelReadWrite                 avgt   10  30471.552 ±  456.716  us/op
StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelReader  avgt   10  18069.863 ±  327.923  us/op
StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelWriter  avgt   10  67676.617 ± 2351.311  us/op
StringDimensionIndexerProcessBenchmark.soloReader                        avgt   10   1048.079 ±    1.120  us/op
StringDimensionIndexerProcessBenchmark.soloWriter                        avgt   10   4629.769 ±   29.353  us/op

* perf: switch DimensionDictionary to StampedLock

jdk11 - Benchmark                                                        Mode  Cnt      Score      Error  Units
StringDimensionIndexerProcessBenchmark.parallelReadWrite                 avgt   10  37958.372 ± 1685.206  us/op
StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelReader  avgt   10  31192.232 ± 2755.365  us/op
StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelWriter  avgt   10  58256.791 ± 1998.220  us/op
StringDimensionIndexerProcessBenchmark.soloReader                        avgt   10   1079.440 ±    1.753  us/op
StringDimensionIndexerProcessBenchmark.soloWriter                        avgt   10   4585.690 ±   13.225  us/op

* perf: use optimistic locking in DimensionDictionary

jdk11 - Benchmark                                                        Mode  Cnt      Score     Error  Units
StringDimensionIndexerProcessBenchmark.parallelReadWrite                 avgt   10   6212.366 ± 162.684  us/op
StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelReader  avgt   10   1807.235 ± 109.339  us/op
StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelWriter  avgt   10  19427.759 ± 611.692  us/op
StringDimensionIndexerProcessBenchmark.soloReader                        avgt   10    194.370 ±   1.050  us/op
StringDimensionIndexerProcessBenchmark.soloWriter                        avgt   10   2871.423 ±  14.426  us/op

* perf: refactor DimensionDictionary null handling to need less locks

jdk11 - Benchmark                                                        Mode  Cnt      Score      Error  Units
StringDimensionIndexerProcessBenchmark.parallelReadWrite                 avgt   10   6591.619 ±  470.497  us/op
StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelReader  avgt   10   1387.338 ±  144.587  us/op
StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelWriter  avgt   10  22204.462 ± 1620.806  us/op
StringDimensionIndexerProcessBenchmark.soloReader                        avgt   10    204.911 ±    0.459  us/op
StringDimensionIndexerProcessBenchmark.soloWriter                        avgt   10   2935.376 ±   12.639  us/op

* perf: refactor DimensionDictionary add handling to do a little less work

jdk11 - Benchmark                                                        Mode  Cnt      Score    Error  Units
StringDimensionIndexerProcessBenchmark.parallelReadWrite                 avgt   10   2914.859 ± 22.519  us/op
StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelReader  avgt   10    508.010 ± 14.675  us/op
StringDimensionIndexerProcessBenchmark.parallelReadWrite:parallelWriter  avgt   10  10135.408 ± 82.745  us/op
StringDimensionIndexerProcessBenchmark.soloReader                        avgt   10    205.415 ±  0.158  us/op
StringDimensionIndexerProcessBenchmark.soloWriter                        avgt   10   3098.743 ± 23.603  us/op
2023-02-01 02:59:12 -08:00
Adarsh Sanjeev 0a486c3bcf
Update forbidden apis with fixed executor (#13633)
* Update forbidden apis with fixed executor
2023-01-12 15:34:36 +05:30