Without this transformation, distribution of hash % X is poor in general.
It is catastrophically poor when X is a multiple of 31 (many slots would
be empty).
* introducing lists of existing columns in the fields of select queries' output
* rebase master
* address the comment. add test code for select query caching
* change the cache code in SelectQueryQueryToolChest to 0x16
Follow-up to #1773, which meant to add more useful query errors but
did not actually do so. Since that patch, any error other than
interrupt/cancel/timeout was reported as `{"error":"Unknown exception"}`.
With this patch, the error fields are:
- error, one of the specific strings "Query interrupted", "Query timeout",
"Query cancelled", or "Unknown exception" (same behavior as before).
- errorMessage, the message of the topmost non-QueryInterruptedException
in the causality chain.
- errorClass, the class of the topmost non-QueryInterruptedException
in the causality chain.
- host, the host that failed the query.
1. Wrap temporaryStorage in a resource holder, to avoid spurious "Closed"
errors from already-running processing tasks.
2. Exit early from the merging accumulator if the query is cancelled.
* Add time interval dim filter and retention analysis example
* Use closed-open matching for intervals, update cache key generation
* Fix time filtering tests for interval boundary change
- HLLC.fold avoids duplicating the other buffer by saving and restoring its position.
- HLLC.makeCollector(buffer) no longer duplicates incoming BBs.
- Updated call sites where appropriate to duplicate BBs passed to HLLC.
The common theme between the two is they both create "fake" DimensionSelectors
that work on top of Rows. They both do it because there isn't really any
dictionary for the underlying Rows, they're just a stream of data. The fix for
both is to allow a DimensionSelector to tell callers that it has no dictionary
by returning CARDINALITY_UNKNOWN from getValueCardinality. The callers, in
turn, can avoid using it in ways that assume it has a dictionary.
Fixes#3311.
Add tests for the CCE and for a bunch of other groupBy stuff.
Also avoids setting the interrupted flag when InterruptedExceptions
happen, since this might interfere with resource closing, no other
query does it, and is probably pointless anyway since the thread
is likely to be a jetty thread that we don't actually want to set
an interrupt flag on.
Also fixes toString on OrderByColumnSpec.
* ability to not rollup at index time, make pre aggregation an option
* rename getRowIndexForRollup to getPriorIndex
* fix doc misspelling
* test query using no-rollup indexes
* fix benchmark fail due to jmh bug
* Add numeric StringComparator
* Only use direct long comparison for numeric ordering in BoundFilter, add time filtering benchmark query
* Address PR comments, add multithreaded BoundDimFilter test
* Add comment on strlen tie handling
* Add timeseries interval filter benchmark
* Adjust docs
* Use jackson for StringComparator, address PR comments
* Add new TopNMetricSpec and SearchSortSpec with tests (WIP)
* More TopNMetricSpec and SearchSortSpec tests
* Fix NewSearchSortSpec serde
* Update docs for new DimensionTopNMetricSpec
* Delete NumericDimensionTopNMetricSpec
* Delete old SearchSortSpec
* Rename NewSearchSortSpec to SearchSortSpec
* Add TopN numeric comparator benchmark, address PR comments
* Refactor OrderByColumnSpec
* Add null checks to NumericComparator and String->BigDecimal conversion function
* Add more OrderByColumnSpec serde tests
This fixes a potential issue where groupBy resources could be allocated to
create a Sequence, but then the Sequence is never used, and thus the resources
are never freed.
Also simplifies how groupBy handles config overrides (this made the new
unit test easier to write).
Refcounting prevents releasing the merge buffer, or closing the concurrent
grouper, before the processing threads have all finished. The better
error handling prevents an avalanche of per-runner exceptions when grouping
resources are exhausted, by grouping those all up into a single merged
exception.
* Support filtering on __time column
* Rename DruidPredicate
* Add docs for ValueMatcherFactory, add comment on getColumnCapabilities
* Combine ValueMatcherFactory predicate methods to accept DruidCompositePredicate
* Address PR comments (support filter on all long columns)
* Use predicate factory instead of composite predicate
* Address PR comments
* Lazily initialize long handling in selector/in filter
* Move long value parsing from InFilter to InDimFilter, make long value parsing thread-safe
* Add multithreaded selector/in filter test
* Fix non-final lock object in SelectorDimFilter
Fixes inconsistent metric handling between the two implementations. Formerly,
RealtimePlumber only emitted query/segmentAndCache/time and query/wait and
Appenderator only emitted query/partial/time and query/wait (all per sink).
Now they both do the same thing:
- query/segmentAndCache/time, query/segment/time are the time spent per sink.
- query/cpu/time is the CPU time spent per query.
- query/wait/time is the executor waiting time per sink.
These generally match historical metrics, except segmentAndCache & segment
mean the same thing here, because one Sink may be partially cached and
partially uncached and we aren't splitting that out.
constantly timing out on one of slow build machines, increasing the
timeout fixed it.
Running io.druid.granularity.QueryGranularityTest
Tests run: 33, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.776
sec - in io.druid.granularity.QueryGranularityTest
All query metrics now start with toolChest.makeMetricBuilder, and all of
*those* now start with DruidMetrics.makePartialQueryTimeMetric. Also, "id"
moved to common code, since all query metrics added it anyway.
In particular this will add query-type specific dimensions like "threshold"
and "numDimensions" to servlet-originated metrics like query/time.
This is actually reasonable for a groupBy or lexicographic topNs that is
being used to do a "COUNT DISTINCT" kind of query. No aggregators are
needed for that query, and including a dummy aggregator wastes 8 bytes
per row.
It's kind of silly for timeseries, but why not.
* support alphanumeric sort in search query
* address a comment about handling equals() and hashCode()
* address comments
* add Ut for string comparators
* address a comment about space indentations.
This patch introduces a GroupByStrategy concept and two strategies: "v1"
is the current groupBy strategy and "v2" is a new one. It also introduces
a merge buffers concept in DruidProcessingModule, to try to better
manage memory used for merging.
Both of these are described in more detail in #2987.
There are two goals of this patch:
1. Make it possible for historical/realtime nodes to return larger groupBy
result sets, faster, with better memory management.
2. Make it possible for brokers to merge streams when there are no order-by
columns, avoiding materialization.
This patch does not do anything to help with memory management on the broker
when there are order-by columns or when there are nested queries. That could
potentially be done in a future patch.
* add get dimension rangeset to filters
* add get domain to ShardSpec and added chunk filter in caching clustered client
* add null check and modified not filter, started with unit test
* add filter test with caching
* refactor and some comments
* extract filtershard to helper function
* fixup
* minor changes
* update javadoc
* fix caching for search results
properly read count when reading from cache.
* fix NPE during merging search count and add test
* Update cache key to invalidate prev results