Simplifies logic for callers that only want to get a list of all the
column names, or column names and types. Updated callers SegmentAnalyzer,
HashJoinSegmentStorageAdapter, and DruidSegmentReader.
* SQL INSERT planner support.
The main changes are:
1) DruidPlanner is able to validate and authorize INSERT queries. They
require WRITE permission on the target datasource.
2) QueryMaker is now an interface, and there is a QueryMakerFactory that
creates instances of it. There is only one production implementation
of each (NativeQueryMaker and NativeQueryMakerFactory), which
together behave the same way as the former QueryMaker class. But this
opens the door to executing queries in ways other than the Druid
query stack, and is used by unit tests (CalciteInsertDmlTest) to
test the INSERT planning functionality.
3) Adds an EXTERN table macro that allows references external data using
InputSource and InputFormat from Druid's batch ingestion API. This is
not exposed in production yet, but is used by unit tests.
4) Adds a QueryFeature concept that enables the planner to change its
behavior slightly depending on the capabilities of the execution
system.
5) Adds an "AuthorizableOperator" concept that enables SqlOperators
to require additional permissions. This is used by the EXTERN table
macro.
Related odds and ends:
- Add equals, hashCode, toString methods to InlineInputSource. Aids in
the "from external" tests in CalciteInsertDmlTest.
- Add JSON-serializability to RowSignature.
- Move the SQL string inside PlannerContext so it is "baked into" the
planner when the planner is created. Cleans up the code a bit, since
in practice, the same query is passed in every time to the
same planner anyway.
* Fix up calls to CalciteTests.createMockQueryLifecycleFactory.
* Fix checkstyle issues.
* Adjustments for CI.
* Adjust DruidAvaticaHandlerTest for stricter test authorizations.
Important because an earlier call to getCachedColumn may have been
done with a different class, leading to a ClassCastException on the
second call. In the prior code, this could happen if a complex column
had makeDimensionSelector called on it after makeColumnValueSelector had
already been called.
Usually, "execute" is called by methods defined in the superclass
AbstractExecutorService, and the passed-in Runnable has been wrapped
by newTaskFor inside a PrioritizedListenableFutureTask. But this method
can also be called directly, and if so, the same wrapping is necessary
for the delegate to get a Runnable that can be entered into a priority
queue with the others.
* add back and deprecate aggregator factory methods so i can say i told you so when i delete these later
* rename to make less ambiguous, fix fill method
* adjust
* Scan: Add "orderBy" parameter.
This patch adds an API for requesting non-time orderings, although it
does not actually add the ability to execute such queries.
The changes are done in such a way that no matter how Scan query objects
are constructed, they will have a correct "getOrderBy". This will enable
us to switch the execution to exclusively use "getOrderBy" later on when
it's implemented.
Scan queries are serialized such that they only include "order" (time
order) if the ordering is time-based, and they only include "orderBy" if
the ordering is non-time-based. This maximizes compatibility with
the existing API while also providing a clean look for formatted queries.
Because this patch does not include execution logic, if someone actually
tries to run a query with non-time ordering, then they will get an error
like "Cannot execute query with orderBy [quality ASC]".
* SQL module fixes.
* Add spotbugs-exclude.
* Remove unused method.
Add method ShardSpec.getType() to get name of shard spec type
List all names of shard spec types in the interface ShardSpec itself
for easy reference and maintenance
Add dimension partitioningType to metric segment/added/bytes
PR #11882 introduced a type comparison using ==, but while it was in flight,
another PR #11713 changed the type enum to a class. So the comparison should
properly be done with "equals".
There are 3 types of query IDs - id, subQueryId, sqlQueryId. Currently, whenever a query generates subqueries, the subquery's subQueryId is populated randomly. Also, subquery's Id is not set to the parent query Id. Therefore there is no way of linking the subqueries to the parent query, and one loses the ability to look at end to end view of the query.
This PR aims to implement following couple of things:
Populate the subqueries with it's parent's id (and sqlQueryId if present)
Populate the subqueryId such that it forms a hierarchical relationship amongs themselves. For example, if there is a query which launches a subquery, which in turn launches a couple of subqueries, then the ids and subQueryIds should have following structure.
* revert ColumnAnalysis type, add typeSignature and use it for DruidSchema
* review stuffs
* maybe null
* better maybe null
* Update docs/querying/segmentmetadataquery.md
* Update docs/querying/segmentmetadataquery.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* fix null right
* sad
* oops
* Update batch_hadoop_queries.json
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* RowBasedSegment: Use Sequence instead of Iterable.
The main reason this is good is that Sequences can include baggage that
must be closed after iteration is finished. This enables creating
RowBasedSegments on top of closeable sequences of rows.
To preserve the optimization that allows reversing a List without
copying it, this patch also makes SimpleSequence its own class and allows
extracting the Iterable that was used to create it.
* Fix tests.
* Add Finalization option to RowSignature.addAggregators.
This make type signatures more useful when the caller knows whether it will
be reading aggregation results in their finalized or intermediate types.
* Fix call site.
* add missing json type for ListFilteredVirtualColumn, and tests to try to avoid this happening again
* fixes
* ugly, but maybe this
* oops
* too many mappers
* Remove StorageAdapter.getColumnTypeName.
It was only used by SegmentAnalyzer, and isn't necessary anymore due to
the recent improvements to ColumnCapabilities.
Also: tidy ColumnDescriptor.read slightly by removing an instanceof
check, and moving the relevant logic into ComplexColumnPartSerde.
* Fix spellings.
This could happen for right or full outer joins in certain cases. Tests
weren't catching this because existing Cursor implementations generally
ignore extraneous calls to "advance". So, to help catch this in tests,
extra state validations are also added to RowWalker, which is used by
RowBasedSegment.
* RowBasedCursor: Add column-value-reuse optimization.
Most of the logic is in RowBasedColumnSelectorFactory, although in this
patch its only user is RowBasedCursor. This improves performance of
features that use RowBasedSegment, like lookup and inline datasources.
It's especially helpful for inline datasources that contain lengthy
arrays, due to the fact that the transformed array can be reused.
* Changes from code review.
* Fixes for ColumnCapabilitiesImplTest.
* complex typed expressions
* add built-in hll collector expressions to get coverage on druid-processing, more types, more better
* rampage!!!
* more javadoc
* adjustments
* oops
* lol
* remove unused dependency
* contradiction?
* more test
* Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs.
This patch does the following:
- Removes OffheapIncrementalIndex.
- Clarifies that Aggregators are required to be thread safe.
- Clarifies that BufferAggregators and VectorAggregators are not
required to be thread safe.
- Removes thread safety code from some DataSketches aggregators that
had it. (Not all of them did, and that's OK, because it wasn't necessary
anyway.)
- Makes enabling "useOffheap" with groupBy v1 an error.
Rationale for removing the offheap incremental index:
- It is only used in one rare scenario: groupBy v1 (which is non-default)
in "useOffheap" mode (also non-default). So you have to go pretty deep
into the wilderness to get this code to activate in production. It is
never used during ingestion.
- Its existence complicates developer efforts to reason about how
aggregators get used, because the way it uses buffer aggregators is so
different from how every other query engine uses them.
- It doesn't have meaningful testing.
By the way, I do believe that the given way the offheap incremental index
works, it actually didn't require buffer aggregators to be thread-safe.
It synchronizes on "aggregate" and doesn't call "get" until it has
stopped calling "aggregate". Nevertheless, this is a bother to think about,
and for the above reasons I think it makes sense to remove the code anyway.
* Remove things that are now unused.
* Revert removal of getFloat, getLong, getDouble from BufferAggregator.
* OAK-related warnings, suppressions.
* Unused item suppressions.
* Remove CloseQuietly and migrate its usages to other methods.
These other methods include:
1) New method CloseableUtils.closeAndWrapExceptions, which wraps IOExceptions
in RuntimeExceptions for callers that just want to avoid dealing with
checked exceptions. Most usages were migrated to this method, because it
looks like they were mainly attempts to avoid declaring a throws clause,
and perhaps were unintentionally suppressing IOExceptions.
2) New method CloseableUtils.closeInCatch, designed to properly close something
in a catch block without losing exceptions. Some usages from catch blocks
were migrated here, when it seemed that they were intended to avoid checked
exception handling, and did not really intend to also suppress IOExceptions.
3) New method CloseableUtils.closeAndSuppressExceptions, which sends all
exceptions to a "chomper" that consumes them. Nothing is thrown or returned.
The behavior is slightly different: with this method, _all_ exceptions are
suppressed, not just IOExceptions. Calls that seemed like they had good
reason to suppress exceptions were migrated here.
4) Some calls were migrated to try-with-resources, in cases where it appeared
that CloseQuietly was being used to avoid throwing an exception in a finally
block.
🎵 You don't have to go home, but you can't stay here... 🎵
* Remove unused import.
* Fix up various issues.
* Adjustments to tests.
* Fix null handling.
* Additional test.
* Adjustments from review.
* Fixup style stuff.
* Fix NPE caused by holder starting out null.
* Fix spelling.
* Chomp Throwables too.
* extract generic dictionary encoded column indexing and merging stuffs to pave the path towards supporting other types of dictionary encoded columns
* spotbugs and inspections fixes
* friendlier
* javadoc
* better name
* adjust
* add ColumnInspector argument to PostAggregator.getType to allow post-aggs to compute their output type based on input types
* add test for test for coverage
* simplify
* Remove unused imports.
Co-authored-by: Gian Merlino <gian@imply.io>
* latest datasketches-java and datasketches-memory
* updated versions of datasketches-java and datasketches-memory
Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
* better type system
* needle in a haystack
* ColumnCapabilities is a TypeSignature instead of having one, INFORMATION_SCHEMA support
* fixup merge
* more test
* fixup
* intern
* fix
* oops
* oops again
* ...
* more test coverage
* fix error message
* adjust interning, more javadocs
* oops
* more docs more better
* add MV_FILTER_ONLY SQL function, and list filter virtual column
* MV_FILTER_NONE and more tests
* formatting
* o yeah, forgot can do easy thing
* style
* hmm why was that there
* test filtering on virtual column
* style
* meh
* do it right
* good bot
* fix goldilocks bug with HashVectorGrouper improperly initializing memory that causes failure when there exists room to only grow one time
* fix unintended change
* cleanup
This PR adds a new property druid.router.sql.enable which allows the
Router to handle SQL queries when set to true.
This change does not affect Avatica JDBC requests and they are still routed
by hashing the Connection ID.
To allow parsing of the request object as a SqlQuery (contained in module druid-sql),
some classes have been moved from druid-server to druid-services with
the same package name.
This change allows the selection of a specific broker service (or broker tier) by the Router.
The newly added ManualTieredBrokerSelectorStrategy works as follows:
Check for the parameter brokerService in the query context. If this is a valid broker service, use it.
Check if the field defaultManualBrokerService has been set in the strategy. If this is a valid broker service, use it.
Move on to the next strategy
* Add a new metric query/segments/count that is not emitted by default
* docs
* test the default implementation of the metric
* fix spelling error in docs
* document the fact that query retries will result in additional metric emissions
* update using recommended text from @jihoonson
This PR refactors the code related to segment loading specifically SegmentLoader and SegmentLoaderLocalCacheManager. SegmentLoader is marked UnstableAPI which means, it can be extended outside core druid in custom extensions. Here is a summary of changes
SegmentLoader returns an instance of ReferenceCountingSegment instead of Segment. Earlier, SegmentManager was wrapping Segment objects inside ReferenceCountingSegment. That is now moved to SegmentLoader. With this, a custom implementation can track the references of segments. It also allows them to create custom ReferenceCountingSegment implementations. For this reason, the constructor visibility in ReferenceCountingSegment is changed from private to protected.
SegmentCacheManager has two additional methods called - reserve(DataSegment) and release(DataSegment). These methods let the caller reserve or release space without calling SegmentLoader#getSegment. We already had similar methods in StorageLocation and now they are available in SegmentCacheManager too which wraps multiple locations.
Refactoring to simplify the code in SegmentCacheManager wherever possible. There is no change in the functionality.
* improve groupBy query granularity translation when issued from sql layer
* fix style
* use virtual column to determine timestampResult granularity
* dont' apply postaggregators on compute nodes
* relocate constants
* fix order by correctness issue
* fix ut
* use more easier understanding code in DefaultLimitSpec
* address comment
* rollback use virtual column to determine timestampResult granularity
* fix style
* fix style
* address the comment
* add more detail document to explain the tradeoff
* address the comment
* address the comment
* add single input string expression dimension vector selector and better expression planning
* better
* fixes
* oops
* rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing
* oops
* javadocs, renaming
* more javadocs
* benchmarks
* use string expression vector processor with vector size 1 instead of expr.eval
* better logging
* javadocs, surprising number of the the
* more
* simplify
This PR refactors the code for QueryRunnerFactory#mergeRunners to accept a new interface called QueryProcessingPool instead of ExecutorService for concurrent execution of query runners. This interface will let custom extensions inject their own implementation for deciding which query-runner to prioritize first. The default implementation is the same as today that takes the priority of query into account. QueryProcessingPool can also be used as a regular executor service. It has a dedicated method for accepting query execution work so implementations can differentiate between regular async tasks and query execution tasks. This dedicated method also passes the QueryRunner object as part of the task information. This hook will let custom extensions carry any state from QuerySegmentWalker to QueryProcessingPool#mergeRunners which is not possible currently.