Commit Graph

2152 Commits

Author SHA1 Message Date
Gian Merlino c44452f0c1 Tidy up lifecycle, query, and ingestion logging. (#8889)
* Tidy up lifecycle, query, and ingestion logging.

The goal of this patch is to improve the clarity and usefulness of
Druid's logging for cluster operators. For more information, see
https://twitter.com/cowtowncoder/status/1195469299814555648.

Concretely, this patch does the following:

- Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the
  goal of reducing redundancy and improving clarity by avoiding
  showing rarely-useful log messages. This includes most "starting"
  and "stopping" messages, and most messages related to individual
  columns.
- Adds new log4j2 templates that show operators how to enabled DEBUG
  logging for certain important packages.
- Eliminate stack traces for query errors, unless log level is DEBUG
  or more. This is useful because query errors often indicate user
  error rather than system error, but dumping stack trace often gave
  operators the impression that there was a system failure.
- Adds task id to Appenderator, AppenderatorDriver thread names. In
  the default log4j2 configuration, this will put them in log lines
  as well. It's very useful if a user is using the Indexer, where
  multiple tasks run in the same JVM.
- More consistent terminology when it comes to "sequences" (sets of
  segments that are handed-off together by Kafka ingestion) and
  "offsets" (cursors in partitions). These terms had been confused in
  some log messages due to the fact that Kinesis calls offsets
  "sequence numbers".
- Replaces some ugly toString calls with either the JSONification or
  something more operator-accessible (like a URL or segment identifier,
  instead of JSON object representing the same).

* Adjustments.

* Adjust integration test.
2019-11-19 13:57:58 -08:00
Jihoon Son 1611792855
Add InputSource and InputFormat interfaces (#8823)
* Add InputSource and InputFormat interfaces

* revert orc dependency

* fix dimension exclusions and failing unit tests

* fix tests

* fix test

* fix test

* fix firehose and inputSource for parallel indexing task

* fix tc

* fix tc: remove unused method

* Formattable

* add needsFormat(); renamed to ObjectSource; pass metricsName for reader

* address comments

* fix closing resource

* fix checkstyle

* fix tests

* remove verify from csv

* Revert "remove verify from csv"

This reverts commit 1ea7758489.

* address comments

* fix import order and javadoc

* flatMap

* sampleLine

* Add IntermediateRowParsingReader

* Address comments

* move csv reader test

* remove test for verify

* adjust comments

* Fix InputEntityIteratingReader

* rename source -> entity

* address comments
2019-11-15 09:22:09 -08:00
Gian Merlino ce4ee42459 Fix LIKE filter wildcards to match newlines. (#8863) 2019-11-13 23:00:54 -08:00
Clint Wylie cc54b2a9df support for array expressions in TransformSpec with ExpressionTransform (#8744)
* transformSpec + array expressions

changes:
* added array expression support to transformSpec
* removed ParseSpec.verify since its only use afaict was preventing transform expr that did not replace their input from functioning
* hijacked index task test to test changes

* remove docs about being unsupported

* re-arrange test assert

* unused imports

* imports

* fix tests

* preserve types

* suppress warning, fixes, add test

* formatting

* cleanup

* better list to array type conversion and tests

* fix oops
2019-11-13 11:04:37 -08:00
Clint Wylie 9ed9a80b9d optimize numeric column null value checking for low filter selectivity (more rows) (#8822)
* use peekable iterator for numeric column selector null checking instead of bitmap.get for those sweet sweet nanoseconds

* remove unused method

* slight optimization i think

* remove clone from wrappers since we do not use and is confusing

* fixes and tests

* int instead of Integer

* fix it

* fixes, more tests

* fix
2019-11-13 10:53:46 -08:00
Gian Merlino 0e8c3f74d0 SQL: EARLIEST, LATEST aggregators. (#8815)
* SQL: EARLIEST, LATEST aggregators.

I chose these names instead of FIRST, LAST because those are already
reserved functions in Calcite that mean something different. I think
these are also better names anyway.

* Finalify.

* SQL updates.

* Adjust aggregator calls.

* Validations, test updates.

* Review docs.
2019-11-08 16:29:25 -08:00
Gian Merlino c204d68376 Fixes, adjustments to numeric null handling and string first/last aggregators. (#8834)
There is a class of bugs due to the fact that BaseObjectColumnValueSelector
has both "getObject" and "isNull" methods, but in most selector implementations
and most call sites, it is clear that the intent of "isNull" is only to apply
to the primitive getters, not the object getter. This makes sense, because the
purpose of isNull is to enable detection of nulls in otherwise-primitive columns.
Imagine a string column with a numeric selector built on top of it. You would
want it to return isNull = true, so numeric aggregators don't treat it as
all zeroes.

Sometimes this design leads people to accidentally guard non-primitive get
methods with "selector.isNull" checks, which is improper.

This patch has three goals:

1) Fix null-handling bugs that already exist in this class.
2) Make interface and doc changes that reduce the probability of future bugs.
3) Fix other, unrelated bugs I noticed in the stringFirst and stringLast
   aggregators while fixing null-handling bugs. I thought about splitting this
   into its own patch, but it ended up being tough to split from the
   null-handling fixes.

For (1) the fixes are,

- Fix StringFirst and StringLastAggregatorFactory to stop guarding getObject
  calls on isNull, by no longer extending NullableAggregatorFactory. Now uses
  -1 as a sigil value for null, to differentiate nulls and empty strings.
- Fix ExpressionFilter to stop guarding getObject calls on isNull. Also, use
  eval.asBoolean() to avoid calling getLong on the selector after already
  calling getObject.
- Fix ObjectBloomFilterAggregator to stop guarding DimensionSelector calls
  on isNull. Also, refactored slightly to avoid the overhead of calling
  getObject followed by another getter (see BloomFilterAggregatorFactory for
  part of this).

For (2) the main changes are,

- Remove the "isNull" method from BaseObjectColumnValueSelector.
- Clarify "isNull" doc on BaseNullableColumnValueSelector.
- Rename NullableAggregatorFactory -> NullbleNumericAggregatorFactory to emphasize
  that it only works on aggregators that take numbers as input.
- Similar naming changes to the Aggregator, BufferAggregator, and AggregateCombiner.
- Similar naming changes to helper methods for groupBy, ValueMatchers, etc.

For (3) the other fixes for StringFirst and StringLastAggregatorFactory are,

- Fixed buffer overrun in the buffer aggregators when some characters in the string
  code into more than one byte (the old code used "substring" to apply a byte limit,
  which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method.
- Fixed weird IncrementalIndex logic that led to reading nulls for the timestamp.
- Adjusted weird StringFirst/Last logic that worked around the weird IncrementalIndex
  behavior.
- Refactored to share code between the four aggregators.
- Improved test coverage.
- Made the base stringFirst, stringLast aggregators adaptive, and streamlined the
  xFold versions into aliases. The adaptiveness is similar to how other aggregators
  like hyperUnique work.
2019-11-07 17:46:59 -08:00
Clint Wylie 7aafcf8bca parallel broker merges on fork join pool (#8578)
* sketch of broker parallel merges done in small batches on fork join pool

* fix non-terminating sequences, auto compute parallelism

* adjust benches

* adjust benchmarks

* now hella more faster, fixed dumb

* fix

* remove comments

* log.info for debug

* javadoc

* safer block for sequence to yielder conversion

* refactor LifecycleForkJoinPool into LifecycleForkJoinPoolProvider which wraps a ForkJoinPool

* smooth yield rate adjustment, more logs to help tune

* cleanup, less logs

* error handling, bug fixes, on by default, more parallel, more tests

* remove unused var

* comments

* timeboundary mergeFn

* simplify, more javadoc

* formatting

* pushdown config

* use nanos consistently, move logs back to debug level, bit more javadoc

* static terminal result batch

* javadoc for nullability of createMergeFn

* cleanup

* oops

* fix race, add docs

* spelling, remove todo, add unhandled exception log

* cleanup, revert unintended change

* another unintended change

* review stuff

* add ParallelMergeCombiningSequenceBenchmark, fixes

* hyper-threading is the enemy

* fix initial start delay, lol

* parallelism computer now balances partition sizes to partition counts using sqrt of sequence count instead of sequence count by 2

* fix those important style issues with the benchmarks code

* lazy sequence creation for benchmarks

* more benchmark comments

* stable sequence generation time

* update defaults to use 100ms target time, 4096 batch size, 16384 initial yield, also update user docs

* add jmh thread based benchmarks, cleanup some stuff

* oops

* style

* add spread to jmh thread benchmark start range, more comments to benchmarks parameters and purpose

* retool benchmark to allow modeling more typical heterogenous heavy workloads

* spelling

* fix

* refactor benchmarks

* formatting

* docs

* add maxThreadStartDelay parameter to threaded benchmark

* why does catch need to be on its own line but else doesnt
2019-11-07 11:58:46 -08:00
Clint Wylie 3ff5e02237 remove select query (#8739)
* remove select query

* thanks teamcity

* oops

* oops

* add back a SelectQuery class that throws RuntimeExceptions linking to docs

* adjust text

* update docs per review

* deprecated
2019-10-30 19:29:56 -07:00
Jihoon Son f5b9bf5525 Cluster-wide configuration for query vectorization (#8657)
* Cluster-wide configuration for query vectorization

* add doc

* fix build

* fix doc

* rename to QueryConfig and add javadoc

* fix checkstyle

* fix variable names
2019-10-23 21:44:28 +08:00
Jonathan Wei d88075237a
Add initial SQL support for non-expression sketch postaggs (#8487)
* Add initial SQL support for non-expression sketch postaggs

* Checkstyle, spotbugs

* checkstyle

* imports

* Update SQL docs

* Checkstyle

* Fix theta sketch operator docs

* PR comments

* Checkstyle fixes

* Add missing entries for HLL sketch module

* PR comments, add round param to HLL estimate operator, fix optional HLL param
2019-10-18 14:59:44 -07:00
Jihoon Son 4046c86d62
Stateful auto compaction (#8573)
* Stateful auto compaction

* javaodc

* add removed test back

* fix test

* adding indexSpec to compactionState

* fix build

* add lastCompactionState

* address comments

* extract CompactionState

* fix doc

* fix build and test

* Add a task context to store compaction state; add javadoc

* fix it test
2019-10-15 22:57:42 -07:00
Himanshu 46ddaf3aa1 fix sorting for resultRow object when numeric dimension not in limitSpec (#8645) 2019-10-08 16:37:15 -07:00
Himanshu c078ed40fd
groupBy query: optional limit push down to segment scan (#8426)
* groupBy query: optional limit push down to segment scan

* make segment level limit push down configurable

* fix teamcity errors

* fix segment limit pushdown flag handling on query level config override

* use equals for comparator check

* fix sql and null handling

* fix unused imports

* handle null offset in NullableValueGroupByColumnSelectorStrategy for buffer comparator similar to RowBasedGrouperHelper.NullableRowBasedKeySerdeHelper
2019-10-08 15:35:07 -07:00
Clint Wylie 7781820dea JsonParserIterator.init future timeout (#8550)
* add timeout support for JsonParserIterator init future

* add queryId

* should be less than 1

* fix

* fix npe

* fix lgtm

* adjust exception, nullable

* fix test

* refactor

* revert queryId change

* add log.warn to tie exception to json parser iterator
2019-09-27 09:13:37 +09:00
Himanshu 9f1f5e115c
doubleMean aggregator to be used at query time (#8459)
* doubleMean aggregator for computing mean

* make docs

* build fixes

* address review comment: handle null args
2019-09-26 08:04:33 -07:00
Gian Merlino d96ca9bd61 Fix serde of FilterTuning maxCardinalityToUseBitmapIndex. (#8551) 2019-09-17 12:46:46 -07:00
Chi Cao Minh baec3a06e9 Fix IntelliJ inspection error (#8553)
Change by #8535 causes TeamCity inspection error in CI (although it does
not show the error in the local IDE).
2019-09-17 12:45:25 -07:00
Benedict Jin c6f4f09557 Fix missing space in string literal and spurious Javadoc @param tags from LGTM (#8491)
* Fix missing space in string literal

* Fix spurious Javadoc @param tags
2019-09-16 14:37:47 +05:30
Clint Wylie df14e5d696
fix caching bug with multi-column group-by (#8535)
* fix caching bug with multi-column group-by

* review
2019-09-13 17:41:23 -07:00
Chi Cao Minh 5f61374cb3 Fix dependency analyze warnings (#8230)
* Fix dependency analyze warnings

Update the maven dependency plugin to the latest version and fix all
warnings for unused declared and used undeclared dependencies in the
compile scope. Added new travis job to add the check to CI. Also fixed
some source code files to use the correct packages for their imports and
updated druid-forbidden-apis to prevent regressions.

* Address review comments

* Adjust scope for org.glassfish.jaxb:jaxb-runtime

* Fix dependencies for hdfs-storage

* Consolidate netty4 versions
2019-09-09 14:37:21 -07:00
Benedict Jin 9fa3407596 Suppress index-out-of-bounds warning from LGTM about loop unrolling (#8380)
* Suppress index-out-of-bounds warning from LGTM about loop unrolling

* Remove space

* Patch comments
2019-09-06 14:46:33 -07:00
Himanshu 1fe4ecf17a
StringDictionaryEncodedColumn dimSelector to return CARDINALITY_UNKNOWN with extractionFn (#8433)
* update DimensionDictionarySelector.getValueCardinality() javadoc

* unknown cardinality in StringDictionaryEncodedColumn dim selector

* revert StringDictionaryEncodedColumn change as that fails GroupBy-v1 execution for many working queries

* fix/add more comments
2019-09-06 14:19:25 -07:00
Jonathan Wei f36fd73f60 Speed up StringDimensionIndexer.estimateEncodedKeyComponentSize (#8466)
* Speed up StringDimensionIndexer.estimateEncodedKeyComponentSize

* Remove print

* Move benchmark, add header
2019-09-04 20:26:04 -07:00
Benedict Jin de18840412
Fix inconsistent equals and hashCode (#8381)
* Fix inconsistent equals and hashCode

* Patch comments

* Remove equals and hashCode from InsensitiveContainsSearchQuerySpec
2019-09-04 13:48:08 +08:00
Himanshu ee4ebb496a
make single/multi value string column handling official in aggregation (#8428) 2019-09-03 13:47:09 -07:00
Clint Wylie c73a489335
bump master version to 0.17.0-incubating-SNAPSHOT (#8421) 2019-08-28 01:58:36 -07:00
Himanshu 5c3db41c2b
string column handling for long/float min/max/sum aggregators (#8319)
* string column handling for long min/max/sum aggregators

* add apache license to new files

* use 'L' as suffix for long literal instead of 'l'

* return null in ParallelCombiner.SettableColumnSelectorFactory.getColumnCapabilities(String) as is required by contract of ColumnSelectorFactory interface

* fix more tests
2019-08-27 16:10:59 -07:00
Himanshu d5d170f866
skip unnecessary aggregate(..) calls with LimitedBufferHashGrouper (#8412)
* skip unnecessary aggregate(..) calls with LimitedBufferHashGrouper

* remove unused bucketWasUsed arg from canSkipAggregate(..)
2019-08-27 15:01:07 -07:00
Himanshu 4d87a19547
Logging emitter to publish query and other metric events as valid json objects (#8359)
* LoggingEmitter: print event as json

* use DefaultRequestLogEventBuilderFactory in emitting request logger by default

* print context in query metric as json

* removed unused jsonMapper from DefaultQueryMetrics

* add comment

* remove change to DefaultRequestLogEventBuilderFactory.java
2019-08-27 15:00:23 -07:00
Jihoon Son e5ef5ddafa Fix the shuffle with TLS enabled for parallel indexing; add an integration test; improve unit tests (#8350)
* Fix shuffle with tls enabled; add an integration test; improve unit tests

* remove debug log

* fix tests

* unused import

* add javadoc

* rename to getContent
2019-08-26 19:27:41 -07:00
Xavier Léauté 8e0c307e54
Do not assume system classloader is URLClassLoader in Java 9+ (#8392)
* Fallback to parsing classpath for hadoop task in Java 9+
In Java 9 and above we cannot assume that the system classloader is an
instance of URLClassLoader. This change adds a fallback method to parse
the system classpath in that case, and adds a unit test to validate it matches
what JDK8 would do.

Note: This has not been tested in an actual hadoop setup, so this is mostly
to help us pass unit tests.

* Remove granularity test of dubious value
One of our granularity tests relies on system classloader being a URLClassLoaders to
catch a bug related to class initialization and static initializers using a subclass (see
#2979)
This test was added to catch a potential regression, but it assumes we would add back
the same type of static initializers to this specific class, so it seems to be of dubious value
as a unit test and mostly serves to illustrate the bug.

relates to #5589
2019-08-24 20:47:54 -04:00
Xavier Léauté 20f7db5d22
Fix ConcurrentModificationException in JDK11 (#8391)
When building column/dimension selectors, calling computeIfAbsent can
cause the applied function to modify the same cache through virtual
column references. The JDK11 map implementation detects this change and
will throw an exception.

This fix – while not as elegant – breaks the single call into two
steps to avoid this problem.
2019-08-24 18:24:50 -04:00
Jonathan Wei 368ace4e87
Fix ClassCastException for TopN with long-type dimension (#8349)
* Fix ClassCastException for TopN with long-type dimension

* Add DimValHolderTest
2019-08-23 14:55:31 -05:00
SandishKumarHN 33f0753a70 Add Checkstyle for constant name static final (#8060)
* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* check ctyle for constant field name

* merging with upstream

* review-1

* unknow changes

* unknow changes

* review-2

* merging with master

* review-2 1 changes

* review changes-2 2

* bug fix
2019-08-23 13:13:54 +03:00
Clint Wylie c87b68d2a4 use Number instead of long for response context (#8342)
* use Number instead of long for response context to be forgiving of json serde to int or long

* test that encounters issue without fix

* now with more test

* is ints
2019-08-20 19:05:49 -07:00
Chi Cao Minh 6fa22f6939 Enable code coverage (#8303)
* Enable code coverage

Code coverage was disabled via
https://github.com/apache/incubator-druid/pull/3122 due to an issue with
cobertura in Travis CI. Switch code coverage tool from cobertura to
jacoco to avoid issue and re-enable coveralls for Travis CI.

* Exclude non-production code

* Exclude benchmark generated code

* Exclude DruidTestRunnerFactory
2019-08-20 15:36:19 -07:00
Jonathan Wei e2a25fb51e
Add logging for LZ4Factory instance type (#8341) 2019-08-20 15:24:53 -05:00
Fokko Driesprong 818bf4990c Enable Spotbugs NP_NONNULL_RETURN_VIOLATION (#8234) 2019-08-20 17:23:46 +03:00
Benedict Jin 781873ba53 Fix resource leak (#8337)
* Fix resource leak

* Patch comments
2019-08-20 12:55:41 +03:00
Himanshu 176da53996
make double sum/min/max agg work on string columns (#8243)
* make double sum/min/max agg work on string columns

* style and compilation fixes

* fix tests

* address review comments

* add comment on SimpleDoubleAggregatorFactory

* make checkstyle happy
2019-08-13 15:55:14 -07:00
Clint Wylie 1054d85171
add mechanism to control filter optimization in historical query processing (#8209)
* add support for mechanism to control filter optimization in historical query processing

* oops

* adjust

* woo

* javadoc

* review comments

* fix

* default

* oops

* oof

* this will fix it

* more nullable, refactor DimFilter.getRequiredColumns to use Set, formatting

* extract class DimFilterToStringBuilder with common code from custom DimFilter toString implementations

* adjust variable naming

* missing nullable

* more nullable

* fix javadocs

* nullable

* address review comments

* javadocs, precondition

* nullable

* rename method to be consistent

* review comments

* remove tuning from ColumnComparisonFilter/ColumnComparisonDimFilter
2019-08-09 16:36:18 -07:00
Jihoon Son 8fa114c349 Fix bugs in overshadowableManager and add unit tests (#8222)
* Fix bugs in overshadowableManager and add unit tests

* Fix SegmentManager

* add segment manager test

* Address comments

* Address comments
2019-08-07 15:51:21 -05:00
Fokko Driesprong 7702005f8f Use Closer instead of List<Closeable> (#8235)
* Use Closer instead of List<Closeable>

* Process comments

* Catch an Exception instead

* Removed unused import
2019-08-07 14:29:03 +08:00
Himanshu 4507a4f8f1 fix merging of groupBy subtotal spec results (#8109)
* fix merging of groupBy subtotal spec results

* add post agg to subtotals spec ut

* add comment

* remove unnecessary agg transformation

* fix build

* fix test

* ignore unknown columns in ordering spec

* change variable names based on comment for easy read

* formatting

* don't ignore unknown columns in DefaultLimitSpec to not change existing behavior

* handle limit spec columns correctly

* uncomment inadvertantly commented lines

* GroupByStrategyV2 changes

* test changes wip

* more fixes to handle merge buffer closing and limit spec

* uncomment line commented accidentally
2019-08-06 07:06:28 -07:00
Samarth Jain 93cf9d4ad4 SQL support for t-digest based sketch aggregators (#8100)
* SQL support for t-digest based sketch aggregators

* Fix teamcity errors

* Add missing dependencies

* Remove unused dependency

* Address code review comments

* Add checks for compression param
2019-08-05 12:01:42 -07:00
Eugene Sevastianov 3f3162b85e Enum of ResponseContext keys (#8157)
* Refactored ResponseContext and aggregated its keys into Enum

* Added unit tests for ResponseContext and refactored the serialization

* Removed unused methods

* Fixed code style

* Fixed code style

* Fixed code style

* Made SerializationResult static

* Updated according to the PR discussion:

Renamed an argument

Updated comparator

Replaced Pair usage with Map.Entry

Added a comment about quadratic complexity

Removed boolean field with an expression

Renamed SerializationResult field

Renamed the method merge to add and renamed several context keys

Renamed field and method related to scanRowsLimit

Updated a comment

Simplified a block of code

Renamed a variable

* Added JsonProperty annotation to renamed ScanQuery field

* Extension-friendly context key implementation

* Refactored ResponseContext: updated delegate type, comments and exceptions

Reducing serialized context length by removing some of its'
collection elements

* Fixed tests

* Simplified response context truncation during serialization

* Extracted a method of removing elements from a response context and
added some comments

* Fixed typos and updated comments
2019-08-03 12:05:21 +03:00
Clint Wylie e7c6deac76 optimize single input column multi-value expressions (#8047)
* optimize single input column multi-value expressions

* javadocs

* merge fixup

* vectorization fixup

* more fixes

* more docs

* more links

* empty

* javadocs are hard

* suppress javadoc refs issue

* fix it
2019-08-02 13:21:25 -07:00
Fokko Driesprong 91743eeebe Spotbugs: NP_NONNULL_PARAM_VIOLATION (#8129) 2019-08-02 19:20:22 +03:00
Chi Cao Minh 7783b31846 Add IPv4 druid expressions (#8197)
* Add IPv4 druid expressions

New druid expressions for filtering IPv4 addresses:
- ipv4address_match: Check if IP address belongs to a subnet
- ipv4address_parse: Convert string IP address to long
- ipv4address_stringify: Convert long IP address to string

These expressions operate on IP addresses represented as either strings
or longs, so that they can be applied to dimensions with mixed
representation of IP addresses. The filtering is more efficient when
operating on IP addresses as longs. In other words, the intended use
case is:

1) Use ipv4address_parse to convert to long at ingestion time
2) Use ipv4address_match to filter (on longs) at query time
3) Use ipv4adress_stringify to convert to (readable) string at query
time

* Fix licenses and null handling

* Simplify IPv4 expressions

* Fix tests

* Fix check for valid ipv4 address string
2019-08-01 11:45:04 -07:00