druid

Commit Graph

Author	SHA1	Message	Date
Jihoon Son	ac6d703814	Support inputFormat and inputSource for sampler (#8901 ) * Support inputFormat and inputSource for sampler * Cleanup javadocs and names * fix style * fix timed shutoff input source reader * fix timed shutoff input source reader again * tidy up timed shutoff reader * unused imports * fix tc	2019-11-20 14:51:25 -08:00
Clint Wylie	3fcaa1a61b	fix sql compatible null handling config work with runtime.properties (#8876 ) * fix sql compatible null handling config work with runtime.properties * fix npe * fix tests * add friendly error * comment, and friendlier still * fix compile * fix from merges	2019-11-20 03:55:29 -08:00
Atul Mohan	f5fbd0bea0	Handle missing values for delimited text files when Nullhandling is enabled (#8779 ) * Handle missing values * Fix multi value tests * Fix firehose tests * Fix conflicts	2019-11-19 22:35:22 -08:00
Chi Cao Minh	4ae6466ae2	HDFS input source (#8899 ) * HDFS input source Add support for using HDFS as an input source. In this version, commas or globs are not supported in HDFS paths. * Fix forbidden api * Address review comments	2019-11-19 22:19:39 -08:00
Clint Wylie	074a45219d	add google cloud storage InputSource for native batch (#8907 ) * add google cloud storage InputSource for native batch * rename * checkstyle * fix * fix spelling * review comments	2019-11-19 19:49:43 -08:00
Gian Merlino	c44452f0c1	Tidy up lifecycle, query, and ingestion logging. (#8889 ) * Tidy up lifecycle, query, and ingestion logging. The goal of this patch is to improve the clarity and usefulness of Druid's logging for cluster operators. For more information, see https://twitter.com/cowtowncoder/status/1195469299814555648. Concretely, this patch does the following: - Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the goal of reducing redundancy and improving clarity by avoiding showing rarely-useful log messages. This includes most "starting" and "stopping" messages, and most messages related to individual columns. - Adds new log4j2 templates that show operators how to enabled DEBUG logging for certain important packages. - Eliminate stack traces for query errors, unless log level is DEBUG or more. This is useful because query errors often indicate user error rather than system error, but dumping stack trace often gave operators the impression that there was a system failure. - Adds task id to Appenderator, AppenderatorDriver thread names. In the default log4j2 configuration, this will put them in log lines as well. It's very useful if a user is using the Indexer, where multiple tasks run in the same JVM. - More consistent terminology when it comes to "sequences" (sets of segments that are handed-off together by Kafka ingestion) and "offsets" (cursors in partitions). These terms had been confused in some log messages due to the fact that Kinesis calls offsets "sequence numbers". - Replaces some ugly toString calls with either the JSONification or something more operator-accessible (like a URL or segment identifier, instead of JSON object representing the same). * Adjustments. * Adjust integration test.	2019-11-19 13:57:58 -08:00
Clint Wylie	7fa3182fe5	refactor InputFormat and InputEntityReader implementations (#8875 ) * refactor InputFormat and InputReader to supply InputEntity and temp dir to constructors instead of read/sample * fix style	2019-11-15 17:08:26 -08:00
Jihoon Son	1611792855	Add InputSource and InputFormat interfaces (#8823 ) * Add InputSource and InputFormat interfaces * revert orc dependency * fix dimension exclusions and failing unit tests * fix tests * fix test * fix test * fix firehose and inputSource for parallel indexing task * fix tc * fix tc: remove unused method * Formattable * add needsFormat(); renamed to ObjectSource; pass metricsName for reader * address comments * fix closing resource * fix checkstyle * fix tests * remove verify from csv * Revert "remove verify from csv" This reverts commit `1ea7758489`. * address comments * fix import order and javadoc * flatMap * sampleLine * Add IntermediateRowParsingReader * Address comments * move csv reader test * remove test for verify * adjust comments * Fix InputEntityIteratingReader * rename source -> entity * address comments	2019-11-15 09:22:09 -08:00
Rye	00f6a56370	Use RFC4180Parser as CSVParser (#8803 ) * Use RFC4180Parser as CSVParser, add unit test * change test file location, use assertEquals	2019-11-13 12:44:37 -08:00
Clint Wylie	cc54b2a9df	support for array expressions in TransformSpec with ExpressionTransform (#8744 ) * transformSpec + array expressions changes: * added array expression support to transformSpec * removed ParseSpec.verify since its only use afaict was preventing transform expr that did not replace their input from functioning * hijacked index task test to test changes * remove docs about being unsupported * re-arrange test assert * unused imports * imports * fix tests * preserve types * suppress warning, fixes, add test * formatting * cleanup * better list to array type conversion and tests * fix oops	2019-11-13 11:04:37 -08:00
Gian Merlino	c204d68376	Fixes, adjustments to numeric null handling and string first/last aggregators. (#8834 ) There is a class of bugs due to the fact that BaseObjectColumnValueSelector has both "getObject" and "isNull" methods, but in most selector implementations and most call sites, it is clear that the intent of "isNull" is only to apply to the primitive getters, not the object getter. This makes sense, because the purpose of isNull is to enable detection of nulls in otherwise-primitive columns. Imagine a string column with a numeric selector built on top of it. You would want it to return isNull = true, so numeric aggregators don't treat it as all zeroes. Sometimes this design leads people to accidentally guard non-primitive get methods with "selector.isNull" checks, which is improper. This patch has three goals: 1) Fix null-handling bugs that already exist in this class. 2) Make interface and doc changes that reduce the probability of future bugs. 3) Fix other, unrelated bugs I noticed in the stringFirst and stringLast aggregators while fixing null-handling bugs. I thought about splitting this into its own patch, but it ended up being tough to split from the null-handling fixes. For (1) the fixes are, - Fix StringFirst and StringLastAggregatorFactory to stop guarding getObject calls on isNull, by no longer extending NullableAggregatorFactory. Now uses -1 as a sigil value for null, to differentiate nulls and empty strings. - Fix ExpressionFilter to stop guarding getObject calls on isNull. Also, use eval.asBoolean() to avoid calling getLong on the selector after already calling getObject. - Fix ObjectBloomFilterAggregator to stop guarding DimensionSelector calls on isNull. Also, refactored slightly to avoid the overhead of calling getObject followed by another getter (see BloomFilterAggregatorFactory for part of this). For (2) the main changes are, - Remove the "isNull" method from BaseObjectColumnValueSelector. - Clarify "isNull" doc on BaseNullableColumnValueSelector. - Rename NullableAggregatorFactory -> NullbleNumericAggregatorFactory to emphasize that it only works on aggregators that take numbers as input. - Similar naming changes to the Aggregator, BufferAggregator, and AggregateCombiner. - Similar naming changes to helper methods for groupBy, ValueMatchers, etc. For (3) the other fixes for StringFirst and StringLastAggregatorFactory are, - Fixed buffer overrun in the buffer aggregators when some characters in the string code into more than one byte (the old code used "substring" to apply a byte limit, which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method. - Fixed weird IncrementalIndex logic that led to reading nulls for the timestamp. - Adjusted weird StringFirst/Last logic that worked around the weird IncrementalIndex behavior. - Refactored to share code between the four aggregators. - Improved test coverage. - Made the base stringFirst, stringLast aggregators adaptive, and streamlined the xFold versions into aliases. The adaptiveness is similar to how other aggregators like hyperUnique work.	2019-11-07 17:46:59 -08:00
Clint Wylie	7aafcf8bca	parallel broker merges on fork join pool (#8578 ) * sketch of broker parallel merges done in small batches on fork join pool * fix non-terminating sequences, auto compute parallelism * adjust benches * adjust benchmarks * now hella more faster, fixed dumb * fix * remove comments * log.info for debug * javadoc * safer block for sequence to yielder conversion * refactor LifecycleForkJoinPool into LifecycleForkJoinPoolProvider which wraps a ForkJoinPool * smooth yield rate adjustment, more logs to help tune * cleanup, less logs * error handling, bug fixes, on by default, more parallel, more tests * remove unused var * comments * timeboundary mergeFn * simplify, more javadoc * formatting * pushdown config * use nanos consistently, move logs back to debug level, bit more javadoc * static terminal result batch * javadoc for nullability of createMergeFn * cleanup * oops * fix race, add docs * spelling, remove todo, add unhandled exception log * cleanup, revert unintended change * another unintended change * review stuff * add ParallelMergeCombiningSequenceBenchmark, fixes * hyper-threading is the enemy * fix initial start delay, lol * parallelism computer now balances partition sizes to partition counts using sqrt of sequence count instead of sequence count by 2 * fix those important style issues with the benchmarks code * lazy sequence creation for benchmarks * more benchmark comments * stable sequence generation time * update defaults to use 100ms target time, 4096 batch size, 16384 initial yield, also update user docs * add jmh thread based benchmarks, cleanup some stuff * oops * style * add spread to jmh thread benchmark start range, more comments to benchmarks parameters and purpose * retool benchmark to allow modeling more typical heterogenous heavy workloads * spelling * fix * refactor benchmarks * formatting * docs * add maxThreadStartDelay parameter to threaded benchmark * why does catch need to be on its own line but else doesnt	2019-11-07 11:58:46 -08:00
Zhenxiao Luo	fca23d0c32	use copy-on-write list in InMemoryAppender (#8808 ) * use copy-on-write synchronized list in InMemoryAppender * use copy-on-write list in InMemoryAppender * Fix comment	2019-11-07 21:11:40 +03:00
Roman Leventov	5c0fc0a13a	Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments (#8564 ) * IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle() don't fetch abutting intervals; simplify getUsedSegmentsForIntervals() * Add VersionedIntervalTimeline.findNonOvershadowedObjectsInInterval() method; Propagate the decision about whether only visible segmetns or visible and overshadowed segments should be returned from IndexerMetadataStorageCoordinator's methods to the user logic; Rename SegmentListUsedAction to RetrieveUsedSegmentsAction, SegmetnListUnusedAction to RetrieveUnusedSegmentsAction, and UsedSegmentLister to UsedSegmentsRetriever * Fix tests * More fixes * Add javadoc notes about returning Collection instead of Set. Add JacksonUtils.readValue() to reduce boilerplate code * Fix KinesisIndexTaskTest, factor out common parts from KinesisIndexTaskTest and KafkaIndexTaskTest into SeekableStreamIndexTaskTestBase * More test fixes * More test fixes * Add a comment to VersionedIntervalTimelineTestBase * Fix tests * Set DataSegment.size(0) in more tests * Specify DataSegment.size(0) in more places in tests * Fix more tests * Fix DruidSchemaTest * Set DataSegment's size in more tests and benchmarks * Fix HdfsDataSegmentPusherTest * Doc changes addressing comments * Extended doc for visibility * Typo * Typo 2 * Address comment	2019-11-06 11:07:04 -08:00
Jihoon Son	511fa74fa2	Move maxFetchRetry to FetchConfig; rename OpenObject (#8776 )	2019-11-04 08:26:33 -08:00
Clint Wylie	3ff5e02237	remove select query (#8739 ) * remove select query * thanks teamcity * oops * oops * add back a SelectQuery class that throws RuntimeExceptions linking to docs * adjust text * update docs per review * deprecated	2019-10-30 19:29:56 -07:00
Jihoon Son	094936ca03	Remove commit() method Firehose (#8688 ) * Remove commit() method Firehose * fix javadoc	2019-10-23 16:52:02 -07:00
Gian Merlino	bb4368baec	Fix PeriodGranularity.bucketStart for days that don't start with midnight. (#8711 ) * Fix PeriodGranularity.bucketStart for days that don't start with midnight. * Use more newspeak method.	2019-10-22 10:00:52 -07:00
Jihoon Son	30c15900be	Auto compaction based on parallel indexing (#8570 ) * Auto compaction based on parallel indexing * javadoc and doc * typo * update spell * addressing comments * address comments * fix log * fix build * fix test * increase default max input segment bytes per task * fix test	2019-10-18 13:24:14 -07:00
Chi Cao Minh	8b2afa5c49	Use targetRowsPerSegment for single-dim partitions (#8624 ) When using single-dimension partitioning, use targetRowsPerSegment (if specified) to size segments. Previously, single-dimension partitioning would always size segments as close to the max size as possible. Also, change single-dimension partitioning to allow partitions that have a size equal to the target or max size. Previously, it would create partitions up to 1 less than those limits. Also, fix some IntelliJ inspection warnings in HadoopDruidIndexerConfig.	2019-10-17 15:55:12 -07:00
Jihoon Son	4046c86d62	Stateful auto compaction (#8573 ) * Stateful auto compaction * javaodc * add removed test back * fix test * adding indexSpec to compactionState * fix build * add lastCompactionState * address comments * extract CompactionState * fix doc * fix build and test * Add a task context to store compaction state; add javadoc * fix it test	2019-10-15 22:57:42 -07:00
Benedict Jin	bba262a4c5	Fix resource leaks and suppress an incorrect LGTM alert (#8589 ) * Fix resource leaks and suppress an incorrect alert * Replace Guava's Files	2019-10-10 22:40:45 +03:00
Jihoon Son	96d8523ecb	Use hash of Segment IDs instead of a list of explicit segments in auto compaction (#8571 ) * IOConfig for compaction task * add javadoc, doc, unit test * fix webconsole test * add spelling * address comments * fix build and test * address comments	2019-10-09 11:12:00 -07:00
Parag Jain	f0d74b240d	password provider for basic authentication of HttpEmitterConfig (#8618 )	2019-10-02 15:59:17 -07:00
Fokko Driesprong	82bfe86d0c	Make more package EverythingIsNonnullByDefault by default (#8198 ) * Make more package EverythingIsNonnullByDefault by default * Fixed additional voilations after pulling in master * Change iterator to list.addAll * Fix annotations	2019-09-30 18:53:18 -06:00
Himanshu	9f1f5e115c	doubleMean aggregator to be used at query time (#8459 ) * doubleMean aggregator for computing mean * make docs * build fixes * address review comment: handle null args	2019-09-26 08:04:33 -07:00
SandishKumarHN	ade8d1922d	#8156 : StructuralSearchInspection, Prohibit check on Thread.ge… (#8394 ) * StructuralSearchInspection, Prohibit check on Thread.getState() * review changes - 1 * review changes 2 * review changes 3 * test fix * review changes-2 * review changes-3	2019-09-22 14:12:05 +03:00
Chi Cao Minh	aeac0d4fd3	Adjust defaults for hashed partitioning (#8565 ) * Adjust defaults for hashed partitioning If neither the partition size nor the number of shards are specified, default to partitions of 5,000,000 rows (similar to the behavior of dynamic partitions). Previously, both could be null and cause incorrect behavior. Specifying both a partition size and a number of shards now results in an error instead of ignoring the partition size in favor of using the number of shards. This is a behavior change that makes it more apparent to the user that only one of the two properties will be honored (previously, a message was just logged when the specified partition size was ignored). * Fix test * Handle -1 as null * Add -1 as null tests for single dim partitioning * Simplify logic to handle -1 as null * Address review comments	2019-09-21 20:57:40 -07:00
Chi Cao Minh	99b6eedab5	Rename partition spec fields (#8507 ) * Rename partition spec fields Rename partition spec fields to be consistent across the various types (hashed, single_dim, dynamic). Specifically, use targetNumRowsPerSegment and maxRowsPerSegment in favor of targetPartitionSize and maxSegmentSize. Consistent and clearer names are easier for users to understand and use. Also fix various IntelliJ inspection warnings and doc spelling mistakes. * Fix test * Improve docs * Add targetRowsPerSegment to HashedPartitionsSpec	2019-09-20 14:59:18 -06:00
Benedict Jin	c6f4f09557	Fix missing space in string literal and spurious Javadoc @param tags from LGTM (#8491 ) * Fix missing space in string literal * Fix spurious Javadoc @param tags	2019-09-16 14:37:47 +05:30
Chi Cao Minh	5f61374cb3	Fix dependency analyze warnings (#8230 ) * Fix dependency analyze warnings Update the maven dependency plugin to the latest version and fix all warnings for unused declared and used undeclared dependencies in the compile scope. Added new travis job to add the check to CI. Also fixed some source code files to use the correct packages for their imports and updated druid-forbidden-apis to prevent regressions. * Address review comments * Adjust scope for org.glassfish.jaxb:jaxb-runtime * Fix dependencies for hdfs-storage * Consolidate netty4 versions	2019-09-09 14:37:21 -07:00
Chi Cao Minh	14a8613d69	Exit JVM on curator unhandled errors (#8458 ) * Exit JVM on curator unhandled errors If an unhandled error occurs when curator is talking to ZooKeeper, exit the JVM in addition to stopping the lifecycle to prevent the process from being left in a zombie state. With this change, BoundedExponentialBackoffRetryWithQuit is no longer needed as when curator exceeds the configured retries, it triggers its unhandled error listeners. A new "connectionTimeoutMs" CuratorConfig setting is added mostly to facilitate testing curator unhandled errors, but it may be useful for users as well. * Address review comments	2019-09-06 16:43:59 -07:00
Clint Wylie	c73a489335	bump master version to 0.17.0-incubating-SNAPSHOT (#8421 )	2019-08-28 01:58:36 -07:00
Himanshu	5c3db41c2b	string column handling for long/float min/max/sum aggregators (#8319 ) * string column handling for long min/max/sum aggregators * add apache license to new files * use 'L' as suffix for long literal instead of 'l' * return null in ParallelCombiner.SettableColumnSelectorFactory.getColumnCapabilities(String) as is required by contract of ColumnSelectorFactory interface * fix more tests	2019-08-27 16:10:59 -07:00
Himanshu	4d87a19547	Logging emitter to publish query and other metric events as valid json objects (#8359 ) * LoggingEmitter: print event as json * use DefaultRequestLogEventBuilderFactory in emitting request logger by default * print context in query metric as json * removed unused jsonMapper from DefaultQueryMetrics * add comment * remove change to DefaultRequestLogEventBuilderFactory.java	2019-08-27 15:00:23 -07:00
Jihoon Son	e5ef5ddafa	Fix the shuffle with TLS enabled for parallel indexing; add an integration test; improve unit tests (#8350 ) * Fix shuffle with tls enabled; add an integration test; improve unit tests * remove debug log * fix tests * unused import * add javadoc * rename to getContent	2019-08-26 19:27:41 -07:00
Chi Cao Minh	a4b842ac2e	Speed up ExpressionSelectors.makeExprEvalSelector (#8373 ) * Speed up ExpressionSelectors.makeExprEvalSelector Addresses performance bottlenecks observed when running a search query with an expression filter and granularity set to none by caching and changing streams to for-loops. After changes, the query was observed to run 2-3x faster. Also fixes various IntelliJ inspection warnings. * Fix static analysis errors	2019-08-26 14:34:16 -07:00
Xavier Léauté	496dfa3b15	fix JVMMonitor initialization with JDK11 (#8397 ) JVMMonitor requires access to jdk.internal.perf.Perf to enable GC counters, which requires additional JVM arguments with JDK11. This change adds a fallback in case GC counters cannot be initialized, and logs a warning message explaining how GC counters can be enabled.	2019-08-25 08:26:58 -04:00
Xavier Léauté	8e0c307e54	Do not assume system classloader is URLClassLoader in Java 9+ (#8392 ) * Fallback to parsing classpath for hadoop task in Java 9+ In Java 9 and above we cannot assume that the system classloader is an instance of URLClassLoader. This change adds a fallback method to parse the system classpath in that case, and adds a unit test to validate it matches what JDK8 would do. Note: This has not been tested in an actual hadoop setup, so this is mostly to help us pass unit tests. * Remove granularity test of dubious value One of our granularity tests relies on system classloader being a URLClassLoaders to catch a bug related to class initialization and static initializers using a subclass (see #2979) This test was added to catch a potential regression, but it assumes we would add back the same type of static initializers to this specific class, so it seems to be of dubious value as a unit test and mostly serves to illustrate the bug. relates to #5589	2019-08-24 20:47:54 -04:00
SandishKumarHN	33f0753a70	Add Checkstyle for constant name static final (#8060 ) * check ctyle for constant field name * check ctyle for constant field name * check ctyle for constant field name * check ctyle for constant field name * check ctyle for constant field name * check ctyle for constant field name * check ctyle for constant field name * check ctyle for constant field name * check ctyle for constant field name * merging with upstream * review-1 * unknow changes * unknow changes * review-2 * merging with master * review-2 1 changes * review changes-2 2 * bug fix	2019-08-23 13:13:54 +03:00
Surekha	cf2a2dd917	Add group_id to the sys.tasks table (#8304 ) * Add group_id to overlord tasks API and sys.tasks table * adjust test * modify docs * Make groupId nullable * fix integration test * fix toString * Remove groupId from TaskInfo * Modify docs and tests * modify TaskMonitorTest	2019-08-22 15:28:23 -07:00
Clint Wylie	e3e85e8f1a	use alternative public domain text instead of lorem ipsum text in compression test (#8299 ) * use apache license text instead of lorem ipsum text to be unambiguous with regards to licensing * feed your head	2019-08-22 12:53:38 -07:00
Chi Cao Minh	6fa22f6939	Enable code coverage (#8303 ) * Enable code coverage Code coverage was disabled via https://github.com/apache/incubator-druid/pull/3122 due to an issue with cobertura in Travis CI. Switch code coverage tool from cobertura to jacoco to avoid issue and re-enable coveralls for Travis CI. * Exclude non-production code * Exclude benchmark generated code * Exclude DruidTestRunnerFactory	2019-08-20 15:36:19 -07:00
Benedict Jin	5e35500e0e	Fix unused format argument (#8345 )	2019-08-20 12:45:46 -07:00
Benedict Jin	781873ba53	Fix resource leak (#8337 ) * Fix resource leak * Patch comments	2019-08-20 12:55:41 +03:00
Jihoon Son	5dac6375f3	Add support for parallel native indexing with shuffle for perfect rollup (#8257 ) * Add TaskResourceCleaner; fix a couple of concurrency bugs in batch tasks * kill runner when it's ready * add comment * kill run thread * fix test * Take closeable out of Appenderator * add javadoc * fix test * fix test * update javadoc * add javadoc about killed task * address comment * Add support for parallel native indexing with shuffle for perfect rollup. * Add comment about volatiles * fix test * fix test * handling missing exceptions * more clear javadoc for stopGracefully * unused import * update javadoc * Add missing statement in javadoc * address comments; fix doc * add javadoc for isGuaranteedRollup * Rename confusing variable name and fix typos * fix typos; move fetch() to a better home; fix the expiration time * add support https	2019-08-15 17:43:35 -07:00
Jonathan Wei	ef7b9606f2	Keep track of task location for completed tasks (#8286 ) * Keep track of task location for completed tasks * Add TaskLifecycleTest location checks	2019-08-15 16:57:02 -05:00
Fokko Driesprong	1a3aa1cfc0	Bump commons-io from 2.5 to 2.6 (#8006 ) * Bump commons-io from 2.5 to 2.6 * Update licenses.yaml * Address comments	2019-08-13 17:10:37 -07:00
Himanshu	176da53996	make double sum/min/max agg work on string columns (#8243 ) * make double sum/min/max agg work on string columns * style and compilation fixes * fix tests * address review comments * add comment on SimpleDoubleAggregatorFactory * make checkstyle happy	2019-08-13 15:55:14 -07:00
Chi Cao Minh	b359c5b3d9	Fix SIGAR dependency connection timeout (#8258 ) After enabling parallel builds for "mvn install", the sigar dependency would sometimes resolve to the incorrect artifact repo for some of the maven modules. This issue seems to be fixed by moving the definition of the sigar dependency's artifact repo to the root POM. Also, depending on network speeds, "mvn -q install" may take longer than the default 10 minute timeout to print any output. Use travis_wait to extend the timeout to 15 minutes.	2019-08-08 20:13:18 -05:00

1 2 3 4

165 Commits