* SQL: Add resolution parameter to quantile agg, rename to APPROX_QUANTILE.
* Fix bug with re-use of filtered approximate histogram aggregators.
Also add APPROX_QUANTILE tests for filtering and running on complex columns.
Includes some slight refactoring to allow tests to make DruidTables that
include complex columns.
* Remove unused import
* SQL: Ditch CalciteConnection layer and add DruidMeta, extension aggregators.
Switched from CalciteConnection to Planner, bringing benefits:
- CalciteConnection's JDBC interface no longer sits between the SQL server
(HTTP/Avatica) and Druid's query layer. Instead, the SQL servers can use
Druid Sequence objects directly, reducing overhead in the query return path.
- Implemented our own Planner-based Avatica Meta, letting us control
connection timeouts and connection / statement limits. The previous
CalciteConnection-based implementation didn't have any limits or timeouts.
- The Planner interface lets us override the operator table, opening up
SQL language extensions. This patch includes two: APPROX_COUNT_DISTINCT
in core, and a QUANTILE aggregator in the druid-histogram extension.
Also:
- Added INFORMATION_SCHEMA metadata schema.
- Added tests for Unicode literals and escapes.
* Verify statement is actually open before closing it.
* More detailed INFORMATION_SCHEMA docs.
* SQL support for nested groupBys.
Allows, for example, doing exact count distinct by writing:
SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)
Contrast with approximate count distinct, which is:
SELECT COUNT(DISTINCT col) FROM druid.foo
* Add deeply-nested groupBy docs, tests, and maxQueryCount config.
* Extract magic constants into statics.
* Rework rules to put preconditions in the "matches" method.
* Add an option to SearchQuery to choose a search query execution strategy.
Supported strategies are
1) Index-only query execution
2) Cursor-based scan
3) Auto: choose an efficient strategy for a given query
* Add SearchStrategy and SearchQueryExecutor
* Address comments
* Rename strategies and set UseIndexesStrategy as the default strategy
* Add a cost-based planner for auto strategy
* Add document
* Fix code style
* apply code style
* apply comments
* add first and last aggregator
* add test and fix
* moving around
* separate aggregator valueType
* address PR comment
* add finalize inner query and adjust v1 inner indexing
* better test and fixes
* java-util import fixes
* PR comments
* Add first/last aggs to ITWikipediaQueryTest
* sortByDimsFirst flag for groupBy query
* Remove need for KeyType in Grouper<KeyType> to be Comparable<KeyType>
* fix review comments
* fix review comments regarding removing code duplication of dim/time comparison
* move comparator for KeyType object to KeySerdeFactory so that creation of comparator does not need KeySerde
* remove unnecessary system.out.println
* make access static var NATURAL_NULLS_FIRST directly
* further review comments addressing
* Add "like" filter.
* Addressed some PR comments.
* Slight simplifications to LikeFilter.
* Additional simplifications.
* Fix comment in LikeFilter.
* Clarify comment in LikeFilter.
* Simplify LikeMatcher a bit.
* No use going through the optimized path if prefix is empty.
* Add more tests.
Also change defaults:
- bufferGrouperMaxLoadFactor from 0.75 to 0.7.
- maxMergingDictionarySize to 100MB from 25MB, should be more appropriate
for most heaps.
Follow-up to #1773, which meant to add more useful query errors but
did not actually do so. Since that patch, any error other than
interrupt/cancel/timeout was reported as `{"error":"Unknown exception"}`.
With this patch, the error fields are:
- error, one of the specific strings "Query interrupted", "Query timeout",
"Query cancelled", or "Unknown exception" (same behavior as before).
- errorMessage, the message of the topmost non-QueryInterruptedException
in the causality chain.
- errorClass, the class of the topmost non-QueryInterruptedException
in the causality chain.
- host, the host that failed the query.
* Add time interval dim filter and retention analysis example
* Use closed-open matching for intervals, update cache key generation
* Fix time filtering tests for interval boundary change
* ability to not rollup at index time, make pre aggregation an option
* rename getRowIndexForRollup to getPriorIndex
* fix doc misspelling
* test query using no-rollup indexes
* fix benchmark fail due to jmh bug
* Add numeric StringComparator
* Only use direct long comparison for numeric ordering in BoundFilter, add time filtering benchmark query
* Address PR comments, add multithreaded BoundDimFilter test
* Add comment on strlen tie handling
* Add timeseries interval filter benchmark
* Adjust docs
* Use jackson for StringComparator, address PR comments
* Add new TopNMetricSpec and SearchSortSpec with tests (WIP)
* More TopNMetricSpec and SearchSortSpec tests
* Fix NewSearchSortSpec serde
* Update docs for new DimensionTopNMetricSpec
* Delete NumericDimensionTopNMetricSpec
* Delete old SearchSortSpec
* Rename NewSearchSortSpec to SearchSortSpec
* Add TopN numeric comparator benchmark, address PR comments
* Refactor OrderByColumnSpec
* Add null checks to NumericComparator and String->BigDecimal conversion function
* Add more OrderByColumnSpec serde tests
This fixes a potential issue where groupBy resources could be allocated to
create a Sequence, but then the Sequence is never used, and thus the resources
are never freed.
Also simplifies how groupBy handles config overrides (this made the new
unit test easier to write).
* Support filtering on __time column
* Rename DruidPredicate
* Add docs for ValueMatcherFactory, add comment on getColumnCapabilities
* Combine ValueMatcherFactory predicate methods to accept DruidCompositePredicate
* Address PR comments (support filter on all long columns)
* Use predicate factory instead of composite predicate
* Address PR comments
* Lazily initialize long handling in selector/in filter
* Move long value parsing from InFilter to InDimFilter, make long value parsing thread-safe
* Add multithreaded selector/in filter test
* Fix non-final lock object in SelectorDimFilter
This is actually reasonable for a groupBy or lexicographic topNs that is
being used to do a "COUNT DISTINCT" kind of query. No aggregators are
needed for that query, and including a dummy aggregator wastes 8 bytes
per row.
It's kind of silly for timeseries, but why not.
* support alphanumeric sort in search query
* address a comment about handling equals() and hashCode()
* address comments
* add Ut for string comparators
* address a comment about space indentations.
This patch introduces a GroupByStrategy concept and two strategies: "v1"
is the current groupBy strategy and "v2" is a new one. It also introduces
a merge buffers concept in DruidProcessingModule, to try to better
manage memory used for merging.
Both of these are described in more detail in #2987.
There are two goals of this patch:
1. Make it possible for historical/realtime nodes to return larger groupBy
result sets, faster, with better memory management.
2. Make it possible for brokers to merge streams when there are no order-by
columns, avoiding materialization.
This patch does not do anything to help with memory management on the broker
when there are order-by columns or when there are nested queries. That could
potentially be done in a future patch.
* docs: replace OR by AND inside topnquery docs about multi value dimensions
* docs: replace OR by AND inside groupby docs about multi value dimensions
* Cleanup the base lookup cluster wide config docs
* Add better examples in lookups-cached-global.md
* Use actual valid stock lookups
* Fixed maps with :
* Add mix of lookups
* Better examples in extension
* Remove unneeded namespace requirement
* Add extra line space
* Add link to lookup tiers
* Renamed header
* Async lookups-cached-global by default
* Also better lookup docs
* Fix test timeouts
* Fix timing of deserialized test
* Fix problem with 0 wait failing immediately
There is no such thing as a "Java aggregator" in Druid from a user's point of view, there are just native aggregator that happen to be implemented in Java.
* Datasource as lookup tier
* Adds an option to let indexing service tasks pull their lookup tier from the datasource they are working for.
* Fix bad docs for lookups lookupTier
* Add Datasource name holder
* Move task and datasource to be pulled from Task file
* Make LookupModule pull from bound dataSource
* Fix test
* Fix code style on imports
* Fix formatting
* Make naming better
* Address code comments about naming