* Adjust defaults for hashed partitioning
If neither the partition size nor the number of shards are specified,
default to partitions of 5,000,000 rows (similar to the behavior of
dynamic partitions). Previously, both could be null and cause incorrect
behavior.
Specifying both a partition size and a number of shards now results in
an error instead of ignoring the partition size in favor of using the
number of shards. This is a behavior change that makes it more apparent
to the user that only one of the two properties will be honored
(previously, a message was just logged when the specified partition size
was ignored).
* Fix test
* Handle -1 as null
* Add -1 as null tests for single dim partitioning
* Simplify logic to handle -1 as null
* Address review comments
* Rename partition spec fields
Rename partition spec fields to be consistent across the various types
(hashed, single_dim, dynamic). Specifically, use targetNumRowsPerSegment
and maxRowsPerSegment in favor of targetPartitionSize and
maxSegmentSize. Consistent and clearer names are easier for users to
understand and use.
Also fix various IntelliJ inspection warnings and doc spelling mistakes.
* Fix test
* Improve docs
* Add targetRowsPerSegment to HashedPartitionsSpec
* move google ext docs from contrib to core
* fix links
* revert unintended change
* more links, add note to example ext doc that it was removed, unlink from sidebar
* Exit JVM on curator unhandled errors
If an unhandled error occurs when curator is talking to ZooKeeper, exit
the JVM in addition to stopping the lifecycle to prevent the process
from being left in a zombie state. With this change,
BoundedExponentialBackoffRetryWithQuit is no longer needed as when
curator exceeds the configured retries, it triggers its unhandled error
listeners. A new "connectionTimeoutMs" CuratorConfig setting is added
mostly to facilitate testing curator unhandled errors, but it may be
useful for users as well.
* Address review comments
* Add realization for updating version of derived segments in MaterializedView
* add unit test, and change code style for the sake of ease of understanding
* fix document's mistake of expression
* Zookeeper version is updated.
* Zookeeper version is updated at licenses.yaml
* licenses.yaml is updated and dependencies are fixed to make the project successfully build.
* Zookeeper versions are fixed at licenses.yaml
* Add group_id to overlord tasks API and sys.tasks table
* adjust test
* modify docs
* Make groupId nullable
* fix integration test
* fix toString
* Remove groupId from TaskInfo
* Modify docs and tests
* modify TaskMonitorTest
* migrate binary notice entries to live in licenses.yaml, use licenses.yaml and NOTICE to generate NOTICE.BINARY at distribution time
* +x
* move release scripts to distribution/bin, fixup notice script, trim dependencies for avro and kerberos in licenses.yaml
* add missing hdfs-storage dependencies
* revert to old syntax, fixes
* formatting
* update notices for recently updated dependencies
* Add TaskResourceCleaner; fix a couple of concurrency bugs in batch tasks
* kill runner when it's ready
* add comment
* kill run thread
* fix test
* Take closeable out of Appenderator
* add javadoc
* fix test
* fix test
* update javadoc
* add javadoc about killed task
* address comment
* Add support for parallel native indexing with shuffle for perfect rollup.
* Add comment about volatiles
* fix test
* fix test
* handling missing exceptions
* more clear javadoc for stopGracefully
* unused import
* update javadoc
* Add missing statement in javadoc
* address comments; fix doc
* add javadoc for isGuaranteedRollup
* Rename confusing variable name and fix typos
* fix typos; move fetch() to a better home; fix the expiration time
* add support https
* Enable ability to toggle SegmentMetadata request logging on/off
* Move SegmentMetadata query log filter to FilteredRequestLogger
* Update documentation to reflect the segment metadata flag moving to the filtered request logger
* Modify patch to allow blacklist of query types to not log to request logger
* Address styling and naming requests following latest code review
* Fix indentation on multiple locations per Druid style rules
* Add a cluster-wide configuration to force timeChunk lock and add a doc for segment locking
* add more test
* javadoc for missingIntervalsInOverwriteMode
* Fix test
* Address comments
* avoid spotbugs
* Add IPv4 SQL functions
New SQL functions for filtering IPv4 addresses:
- IPV4_MATCH: Check if IP address belongs to a subnet
- IPV4_PARSE: Convert string IP address to integer
- IPV4_STRINGIFY: Convert integer IP address to string
These are the SQL analogs of the druid expressions with the same name.
Filtering is more efficient when operating on IP addresses as integers
instead of strings.
* Refactor operator conversions into named constants
* Add IPv4 druid expressions
New druid expressions for filtering IPv4 addresses:
- ipv4address_match: Check if IP address belongs to a subnet
- ipv4address_parse: Convert string IP address to long
- ipv4address_stringify: Convert long IP address to string
These expressions operate on IP addresses represented as either strings
or longs, so that they can be applied to dimensions with mixed
representation of IP addresses. The filtering is more efficient when
operating on IP addresses as longs. In other words, the intended use
case is:
1) Use ipv4address_parse to convert to long at ingestion time
2) Use ipv4address_match to filter (on longs) at query time
3) Use ipv4adress_stringify to convert to (readable) string at query
time
* Fix licenses and null handling
* Simplify IPv4 expressions
* Fix tests
* Fix check for valid ipv4 address string
* Use partitionsSpec for all task types
* fix doc
* fix typos and revert to use isPushRequired
* address comments
* move partitionsSpec to core
* remove hadoopPartitionsSpec
* firehose doc adjustments
* fix typo
* additional information on parser types in ingestion docs
* clarify ingest segment firehose docs, add sql firehose examples to sql extension pages
* fixit
* make sql firehose more forgiving my always constructing a MapInputRowParser from the parseSpec of whatever actual InputRowParser impl is provided, remove doc references to map based parsers
* transforms
* fix tests
* remove unecessary lock in ForegroundCachePopulator leading to a lot of contention
* mutableboolean, javadocs,document some cache configs that were missing
* more doc stuff
* adjustments
* remove background documentation
* 1. Added TimestampExtractExprMacro.Unit for MILLISECOND 2. expr eval for MILLISECOND 3. Added a test case to test extracting millisecond from expression. #7935
* 1. Adding DATASOURCE4 in tests. 2. Adding test TimeExtractWithMilliseconds
* Fixing testInformationSchemaTables test
* Fixing failing tests in DruidAvaticaHandlerTest
* Adding cannotVectorize() call before the test
* Extract time function - Adding support for MICROSECOND, ISODOW, ISOYEAR and CENTURY time units, documentation changes.
* Adding MILLISECOND in test case
* Adding support DECADE and MILLENNIUM, updating test case and documentation
* Fixing expression eval for DECADE and MILLENIUM
The Markdown dialect used when publishing the documentation to the web
site is much more sensitive than Github-flavoured Markdown. In
particular, it requires an empty line before code blocks (unless the
code block starts right after a heading), otherwise the code block
gets formatted in-line with the previous paragraph. Likewise for
bullet-point lists.
* Benchmarks: New SqlBenchmark, add caching & vectorization to some others.
- Introduce a new SqlBenchmark geared towards benchmarking a wide
variety of SQL queries. Rename the old SqlBenchmark to
SqlVsNativeBenchmark.
- Add (optional) caching to SegmentGenerator to enable easier
benchmarking of larger segments.
- Add vectorization to FilteredAggregatorBenchmark and GroupByBenchmark.
* Query vectorization.
This patch includes vectorized timeseries and groupBy engines, as well
as some analogs of your favorite Druid classes:
- VectorCursor is like Cursor. (It comes from StorageAdapter.makeVectorCursor.)
- VectorColumnSelectorFactory is like ColumnSelectorFactory, and it has
methods to create analogs of the column selectors you know and love.
- VectorOffset and ReadableVectorOffset are like Offset and ReadableOffset.
- VectorAggregator is like BufferAggregator.
- VectorValueMatcher is like ValueMatcher.
There are some noticeable differences between vectorized and regular
execution:
- Unlike regular cursors, vector cursors do not understand time
granularity. They expect query engines to handle this on their own,
which a new VectorCursorGranularizer class helps with. This is to
avoid too much batch-splitting and to respect the fact that vector
selectors are somewhat more heavyweight than regular selectors.
- Unlike FilteredOffset, FilteredVectorOffset does not leverage indexes
for filters that might partially support them (like an OR of one
filter that supports indexing and another that doesn't). I'm not sure
that this behavior is desirable anyway (it is potentially too eager)
but, at any rate, it'd be better to harmonize it between the two
classes. Potentially they should both do some different thing that
is smarter than what either of them is doing right now.
- When vector cursors are created by QueryableIndexCursorSequenceBuilder,
they use a morphing binary-then-linear search to find their start and
end rows, rather than linear search.
Limitations in this patch are:
- Only timeseries and groupBy have vectorized engines.
- GroupBy doesn't handle multi-value dimensions yet.
- Vector cursors cannot handle virtual columns or descending order.
- Only some filters have vectorized matchers: "selector", "bound", "in",
"like", "regex", "search", "and", "or", and "not".
- Only some aggregators have vectorized implementations: "count",
"doubleSum", "floatSum", "longSum", "hyperUnique", and "filtered".
- Dimension specs other than "default" don't work yet (no extraction
functions or filtered dimension specs).
Currently, the testing strategy includes adding vectorization-enabled
tests to TimeseriesQueryRunnerTest, GroupByQueryRunnerTest,
GroupByTimeseriesQueryRunnerTest, CalciteQueryTest, and all of the
filtering tests that extend BaseFilterTest. In all of those classes,
there are some test cases that don't support vectorization. They are
marked by special function calls like "cannotVectorize" or "skipVectorize"
that tell the test harness to either expect an exception or to skip the
test case.
Testing should be expanded in the future -- a project in and of itself.
Related to #3011.
* WIP
* Adjustments for unused things.
* Adjust javadocs.
* DimensionDictionarySelector adjustments.
* Add "clone" to BatchIteratorAdapter.
* ValueMatcher javadocs.
* Fix benchmark.
* Fixups post-merge.
* Expect exception on testGroupByWithStringVirtualColumn for IncrementalIndex.
* BloomDimFilterSqlTest: Tag two non-vectorizable tests.
* Minor adjustments.
* Update surefire, bump up Xmx in Travis.
* Some more adjustments.
* Javadoc adjustments
* AggregatorAdapters adjustments.
* Additional comments.
* Remove switching search.
* Only missiles.