druid

mirror of https://github.com/apache/druid.git synced 2025-02-10 12:05:00 +00:00

Author	SHA1	Message	Date
Gian Merlino	a87db7f353	Add HashJoinSegment, a virtual segment for joins. (#9111 ) * Add HashJoinSegment, a virtual segment for joins. An initial step towards #8728. This patch adds enough functionality to implement a joining cursor on top of a normal datasource. It does not include enough to actually do a query. For that, future patches will need to wire this low-level functionality into the query language. * Fixups. * Fix missing format argument. * Various tests and minor improvements. * Changes. * Remove or add tests for unused stuff. * Fix up package locations.	2020-01-16 13:14:20 -08:00
Chi Cao Minh	1fd05bef9a	Add jackson-mapper-asl for hdfs-storage extension (#9178 ) Previously jackson-mapper-asl was excluded to remove a security vulnerability; however, it is required for functionality (e.g., org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator).	2020-01-14 09:50:45 -08:00
Atul Mohan	ea51bc45bf	Fix nullhandling in tests (#9119 )	2020-01-12 20:19:12 -08:00
Clint Wylie	85219ece13	fix null handling for arithmetic post aggregator comparator (#9159 ) * fix null handling for arithmetic postagg comparator, add test for comparator for min/max/quantile postaggs in histogram ext * fix	2020-01-10 13:49:19 -08:00
Jihoon Son	e27a1e8604	Fix handling nullable writableComparable in OrcStructConverter (#9138 ) * Handle nullable writableComparable in OrcStructConverter * add missing dependency	2020-01-08 13:40:24 -08:00
Clint Wylie	f540216931	fix InputFormat serde issue with SeekableStream based supervisors (#9136 )	2020-01-07 16:18:54 -06:00
Clint Wylie	7af85250cb	null handling for doubles sketch and array of doubles sketch aggs (#9112 ) * doubles sketch and array of doubles sketch aggs now skip rows with nulls in sql compatible null handling mode * formatting	2020-01-07 14:15:32 -06:00
Suneet Saldanha	bdd0d0d8a5	Add avro dependency to parquet extension (#9124 ) * Add avro dependency to parquet extension If the parquet extension is loaded and an ingestionSpec uses the older format specifying a 'parser' instead of using an 'inputFormat' the job fails with the following error java.lang.TypeNotPresentException: Type org.apache.avro.generic.GenericRecord not present This change removes the exclusion of the avro package so that the missing class can be found. * Address review comments and add dependency version	2020-01-03 20:11:13 -06:00
Jonathan Wei	aa539177ec	De-incubation cleanup in code, docs, packaging (#9108 ) * De-incubation cleanup in code, docs, packaging * remove unused docs script	2020-01-03 12:33:19 -05:00
Jonathan Wei	4e8368a5d9	Set version to 0.18.0-SNAPSHOT (#9109 )	2020-01-02 17:55:10 -05:00
Gian Merlino	18eb456fe6	S3: Improvements to prefix listing (including fix for an infinite loop) (#9098 ) * S3: Improvements to prefix listing (including fix for an infinite loop) 1) Fixes #9097, an infinite loop that occurs when more than one batch of objects is retrieved during a prefix listing. 2) Removes the Access Denied fallback code added in #4444. I don't think the behavior is reasonable: its purpose is to fall back from a prefix listing to a single-object access, but it's only activated when the end user supplied a prefix, so it would be better to simply fail, so the end user knows that their request for a prefix-based load is not going to work. Presumably the end user can switch from supplying 'prefixes' to supplying 'uris' if desired. 3) Filters out directory placeholders when walking prefixes. 4) Splits LazyObjectSummariesIterator into its own class and adds tests. * Adjust S3InputSourceTest. * Changes from review. * Include hamcrest-core.	2019-12-31 19:06:49 -05:00
Chi Cao Minh	513bb1f6da	Get proper Kinesis index task AWS credentials (#9082 ) Previously, the configured S3 credentials would be used instead of the ones configured for Kinesis for Kinesis index tasks.	2019-12-20 19:35:05 -08:00
Jihoon Son	66056b2826	Using annotation to distinguish Hadoop Configuration in each module (#9013 ) * Multibinding for NodeRole * Fix endpoints * fix doc * fix test * Using annotation to distinguish Hadoop Configuration in each module	2019-12-11 17:30:44 -08:00
Jonathan Wei	8af41d7cd0	Update version to 0.18.0-incubating-SNAPSHOT (#9009 )	2019-12-11 14:04:03 -08:00
Chi Cao Minh	3de7ab8523	DataSketches jars in core (#9003 ) Having DataSketches jars in core will allow potential improvements, for example: - Provide an alternative implementation of HLL: https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html - Range partitioning for native parallel batch indexing without having the user load extensions on the classpath Dev mailing list discussion: https://lists.apache.org/thread.html/301410d71ff799cf616bf17c4ebcf9999fc30829f5fa62909f403e6c%40%3Cdev.druid.apache.org%3E	2019-12-10 14:02:34 -08:00
Chi Cao Minh	bab78fc80e	Parallel indexing single dim partitions (#8925 ) * Parallel indexing single dim partitions Implements single dimension range partitioning for native parallel batch indexing as described in #8769. This initial version requires the druid-datasketches extension to be loaded. The algorithm has 5 phases that are orchestrated by the supervisor in `ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`. These phases and the main classes involved are described below: 1) In parallel, determine the distribution of dimension values for each input source split. `PartialDimensionDistributionTask` uses `StringSketch` to generate the approximate distribution of dimension values for each input source split. If the rows are ungrouped, `PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter` uses a Bloom filter to skip rows that would be grouped. The final distribution is sent back to the supervisor via `DimensionDistributionReport`. 2) The range partitions are determined. In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the supervisor uses `StringSketchMerger` to merge the individual `StringSketch`es created in the preceding phase. The merged sketch is then used to create the range partitions. 3) In parallel, generate partial range-partitioned segments. `PartialRangeSegmentGenerateTask` uses the range partitions determined in the preceding phase and `RangePartitionCachingLocalSegmentAllocator` to generate `SingleDimensionShardSpec`s. The partition information is sent back to the supervisor via `GeneratedGenericPartitionsReport`. 4) The partial range segments are grouped. In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`, the supervisor creates the `PartialGenericSegmentMergeIOConfig`s necessary for the next phase. 5) In parallel, merge partial range-partitioned segments. `PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to retrieve the partial range-partitioned segments generated earlier and then merges and publishes them. * Fix dependencies & forbidden apis * Fixes for integration test * Address review comments * Fix docs, strict compile, sketch check, rollup check * Fix first shard spec, partition serde, single subtask * Fix first partition check in test * Misc rewording/refactoring to address code review * Fix doc link * Split batch index integration test * Do not run parallel-batch-index twice * Adjust last partition * Split ITParallelIndexTest to reduce runtime * Rename test class * Allow null values in range partitions * Indicate which phase failed * Improve asserts in tests	2019-12-09 23:05:49 -08:00
Roman Leventov	1c62987783	Add SelfDiscoveryResource; rename org.apache.druid.discovery.No… (#6702 ) * Add SelfDiscoveryResource * Rename org.apache.druid.discovery.NodeType to NodeRole. Refactor CuratorDruidNodeDiscoveryProvider. Make SelfDiscoveryResource to listen to updates only about a single node (itself). * Extended docs * Fix brace * Remove redundant throws in Lifecycle.Handler.stop() * Import order * Remove unresolvable link * Address comments * tmp * tmp * Rollback docker changes * Remove extra .sh files * Move filter * Fix SecurityResourceFilterTest	2019-12-08 18:47:58 +03:00
Chi Cao Minh	af74acaa85	Address security vulnerabilities CVSS >= 7 (#8980 ) * Address security vulnerabilities CVSS >= 7 Update dependencies to address security vulnerabilities with CVSS scores of 7 or higher. A new Travis CI job is added to prevent new high/critical security vulnerabilities from being added. Updated dependencies: - api-util 1.0.0 -> 1.0.3 - jackson 2.9.10 -> 2.10.1 - kafka 2.1.0 -> 2.1.1 - libthrift 0.10.0 -> 0.13.0 - protobuf 3.2.0 -> 3.11.0 The following high/critical security vulnerabilities are currently suppressed (so that the new Travis CI job can be added now) and are left as future work to fix: - hibernate-validator:5.2.5 - jackson-mapper-asl:1.9.13 - libthrift:0.6.1 - netty:3.10.6 - nimbus-jose-jwt:4.41.1 * Rename EDL1 license file * Fix inspection errors	2019-12-05 14:34:35 -08:00
Clint Wylie	5ecdf94d83	add 'prefixes' support to google input source (#8930 ) * add prefixes support to google input source, making it symmetrical-ish with s3 * docs * more better, and tests * unused * formatting * javadoc * dependencies * oops * review comments * better javadoc	2019-12-04 21:01:10 -08:00
Clint Wylie	b4efaa698b	unexclude necessary jackson mapper-asl jars (#8977 )	2019-12-02 17:01:11 -08:00
Chi Cao Minh	4b7e79a4e6	Exclude unneeded hadoop transitive dependencies (#8962 ) * Exclude unneeded hadoop transitive dependencies These dependencies are provided by core: - com.squareup.okhttp:okhttp - commons-beanutils:commons-beanutils - org.apache.commons:commons-compress - org.apache.zookepper:zookeeper These dependencies are not needed and are excluded because they contain security vulnerabilities: - commons-beanutils:commons-beanutils-core - org.codehaus.jackson:jackson-mapper-asl * Simplify exclusions + separate unneeded/vulnerable * Do not exclude jackson-mapper-asl	2019-12-02 16:08:21 -08:00
Clint Wylie	6997b167b1	add hdfs client dependency for native batch parquet when using hdfs (#8964 )	2019-11-28 13:12:45 -08:00
Jonathan Wei	00ce18a0ea	Additional Kinesis resharding fixes (#8870 ) * Additional Kinesis resharding fixes * Address PR comments * Remove unused method * Adjust SegmentTransactionalInsertAction null handling * Check for unchanged metadata on empty publish * Add logs for empty publish * Fix javadoc * Clear offset when invalid endOffsets are seen * Fix LGTM alert * Fix build * Add resharding note to Kinesis docs * Checkstyle * Spelling * Address PR comments * Checkstyle	2019-11-28 12:59:01 -08:00
Jihoon Son	86e8903523	Support orc format for native batch ingestion (#8950 ) * Support orc format for native batch ingestion * fix pom and remove wrong comment * fix unnecessary condition check * use flatMap back to handle exception properly * move exceptionThrowingIterator to intermediateRowParsingReader * runtime	2019-11-28 12:45:24 -08:00
jon-wei	dfbc066163	Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1" This reverts commit a0f21d9b07bb4b4efd8ef98d0effd0c28f2f7d43.	2019-11-27 23:22:43 -08:00
jon-wei	0402ff85b8	Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 8ffa71e7e6ed140446acaa94baf47b779e6f24a3.	2019-11-27 23:22:32 -08:00
jon-wei	8ffa71e7e6	[maven-release-plugin] prepare for next development iteration	2019-11-27 23:18:48 -08:00
jon-wei	a0f21d9b07	[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1	2019-11-27 23:18:37 -08:00
Clint Wylie	4458113375	S3 input source (#8903 ) * add s3 input source for native batch ingestion * add docs * fixes * checkstyle * lazy splits * fixes and hella tests * fix it * re-use better iterator * use key * javadoc and checkstyle * exception * oops * refactor to use S3Coords instead of URI * remove unused code, add retrying stream to handle s3 stream * remove unused parameter * update to latest master * use list of objects instead of object * serde test * refactor and such * now with the ability to compile * fix signature and javadocs * fix conflicts yet again, fix S3 uri stuffs * more tests, enforce uri for bucket * javadoc * oops * abstract class instead of interface * null or empty * better error	2019-11-25 22:31:19 -08:00
Alexander Saydakov	4a9da3f3fc	use the latest release of datasketches (#8647 ) * use the latest release of datasketches * added datasketches-memory dependency * updated datasketches entries * use datasketches-memory-1.2.0 * updated dependencies * fixed tests	2019-11-25 19:45:51 -08:00
Clint Wylie	cd31bcc093	un-exclude necessary parquet jackson dependencies instead of relying on curator (#8939 )	2019-11-25 15:57:34 -08:00
Jihoon Son	a2e6de4b16	Fix the potential race between SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor (#8924 ) * Fix the potential race SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor * Fix docs and javadoc * Add unit tests for large or small estimated num splits * add override	2019-11-23 01:38:08 -08:00
Gian Merlino	e0eb85ace7	Add FileUtils.createTempDir() and enforce its usage. (#8932 ) * Add FileUtils.createTempDir() and enforce its usage. The purpose of this is to improve error messages. Previously, the error message on a nonexistent or unwritable temp directory would be "Failed to create directory within 10,000 attempts". * Further updates. * Another update. * Remove commons-io from benchmark. * Fix tests.	2019-11-22 19:48:49 -08:00
Clint Wylie	7250010388	add parquet support to native batch (#8883 ) * add parquet support to native batch * cleanup * implement toJson for sampler support * better binaryAsString test * docs * i hate spellcheck * refactor toMap conversion so can be shared through flattenerMaker, default impls should be good enough for orc+avro, fixup for merge with latest * add comment, fix some stuff * adjustments * fix accident * tweaks	2019-11-22 10:49:16 -08:00
Jihoon Son	934547a215	RetryingInputEntity to retry on transient errors (#8923 ) * RetryingInputEntity to retry on transient errors * fix some javadoc and httpEntity * Make it interface * Javadoc for offset	2019-11-21 21:32:18 -08:00
Jonathan Wei	dc6178d1f2	Upgrade Calcite to 1.21 (#8566 ) * Upgrade Calcite to 1.21 * Checkstyle, test fix' * Exclude calcite yaml deps, update license.yaml * Add method for exception chain handling * Checkstyle * PR comments, Add outer limit context flag * Revert project settings change * Update subquery test comment * Checkstyle fix * Fix test in sql compat mode * Fix test * Fix dependency analysis * Address PR comments * Checkstyle * Adjust testSelectStarFromSelectSingleColumnWithLimitDescending	2019-11-20 21:22:55 -08:00
Jihoon Son	ac6d703814	Support inputFormat and inputSource for sampler (#8901 ) * Support inputFormat and inputSource for sampler * Cleanup javadocs and names * fix style * fix timed shutoff input source reader * fix timed shutoff input source reader again * tidy up timed shutoff reader * unused imports * fix tc	2019-11-20 14:51:25 -08:00
Surekha	d628bebbd7	Make supervisor API similar to submit task API (#8810 ) * accept spec or dataSchema, tuningConfig, ioConfig while submitting task json * fix test * update docs * lgtm warning * Add original constructor back to IndexTask to minimize changes * fix indentation in docs * Allow spec to be specified in supervisor schema * undo IndexTask spec changes * update docs * Add Nullable and deprecated annotations * remove deprecated configs from SeekableStreamSupervisorSpec * remove nullable annotation	2019-11-20 10:04:41 -08:00
Clint Wylie	3fcaa1a61b	fix sql compatible null handling config work with runtime.properties (#8876 ) * fix sql compatible null handling config work with runtime.properties * fix npe * fix tests * add friendly error * comment, and friendlier still * fix compile * fix from merges	2019-11-20 03:55:29 -08:00
Chi Cao Minh	4ae6466ae2	HDFS input source (#8899 ) * HDFS input source Add support for using HDFS as an input source. In this version, commas or globs are not supported in HDFS paths. * Fix forbidden api * Address review comments	2019-11-19 22:19:39 -08:00
Clint Wylie	074a45219d	add google cloud storage InputSource for native batch (#8907 ) * add google cloud storage InputSource for native batch * rename * checkstyle * fix * fix spelling * review comments	2019-11-19 19:49:43 -08:00
Rye	d0913475b7	sampler returns nulls in CSV (#8871 ) * sampler returns nulls in CSV * fixed kafka sampler test * fix Kinesis test * sql compatibility fix * remove null to empty string conversion, use null * fix sql compatibility	2019-11-19 13:59:44 -08:00
Gian Merlino	c44452f0c1	Tidy up lifecycle, query, and ingestion logging. (#8889 ) * Tidy up lifecycle, query, and ingestion logging. The goal of this patch is to improve the clarity and usefulness of Druid's logging for cluster operators. For more information, see https://twitter.com/cowtowncoder/status/1195469299814555648. Concretely, this patch does the following: - Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the goal of reducing redundancy and improving clarity by avoiding showing rarely-useful log messages. This includes most "starting" and "stopping" messages, and most messages related to individual columns. - Adds new log4j2 templates that show operators how to enabled DEBUG logging for certain important packages. - Eliminate stack traces for query errors, unless log level is DEBUG or more. This is useful because query errors often indicate user error rather than system error, but dumping stack trace often gave operators the impression that there was a system failure. - Adds task id to Appenderator, AppenderatorDriver thread names. In the default log4j2 configuration, this will put them in log lines as well. It's very useful if a user is using the Indexer, where multiple tasks run in the same JVM. - More consistent terminology when it comes to "sequences" (sets of segments that are handed-off together by Kafka ingestion) and "offsets" (cursors in partitions). These terms had been confused in some log messages due to the fact that Kinesis calls offsets "sequence numbers". - Replaces some ugly toString calls with either the JSONification or something more operator-accessible (like a URL or segment identifier, instead of JSON object representing the same). * Adjustments. * Adjust integration test.	2019-11-19 13:57:58 -08:00
Surekha	cf6643eb9a	add sequenceName and currentCheckPoint for backwards compatibility (#8864 ) * add sequenceName and currentCheckPoint for backwards compatibility * Add serde unit test in kafka * fix checkstyle * add hashcode * update javadoc	2019-11-19 13:11:31 -08:00
Chi Cao Minh	8365bdf62a	Address security vulnerabilities (#8878 ) * Address security vulnerabilities Security vulnerabilities addressed by upgrading 3rd party libs: - Upgrade avro-ipc to 1.9.1 - sonatype-2019-0115 - Upgrade caffeine to 2.8.0 - sonatype-2019-0282 - Upgrade commons-beanutils to 1.9.4 - CVE-2014-0114 - Upgrade commons-codec to 1.13 - sonatype-2012-0050 - Upgrade commons-compress to 1.19 - CVE-2019-12402 - sonatype-2018-0293 - Upgrade hadoop-common to 2.8.5 - CVE-2018-11767 - Upgrade hadoop-mapreduce-client-core to 2.8.5 - CVE-2017-3166 - Upgrade hibernate-validator to 5.2.5 - CVE-2017-7536 - Upgrade httpclient to 4.5.10 - sonatype-2017-0359 - Upgrade icu4j to 55.1 - CVE-2014-8147 - Upgrade jackson-databind to 2.6.7.3: - CVE-2017-7525 - Upgrade jetty-http to 9.4.12: - CVE-2017-7657 - CVE-2017-7658 - CVE-2017-7656 - CVE-2018-12545 - Upgrade log4j-core to 2.8.2 - CVE-2017-5645: - Upgrade netty to 3.10.6 - CVE-2015-2156 - Upgrade netty-common to 4.1.42 - CVE-2019-9518 - Upgrade netty-codec-http to 4.1.42 - CVE-2019-16869 - Upgrade nimbus-jose-jwt to 4.41.1 - CVE-2017-12972 - CVE-2017-12974 - Upgrade plexus-utils to 3.0.24 - CVE-2017-1000487 - sonatype-2015-0173 - sonatype-2016-0398 - Upgrade postgresql to 42.2.8 - CVE-2018-10936 Note that if users are using JDBC lookups with postgres, they may need to update the JDBC jar used by the lookup extension. * Fix license for postgresql	2019-11-19 09:14:33 -08:00
Atul Mohan	8515a03c6b	Modify batch index task naming to accomodate simultaneous tasks (#8612 ) * Change hadoop task naming * Remove unused * Add timestamp * Fix build	2019-11-18 15:07:16 -08:00
Chi Cao Minh	d60978343a	Improve missing JDBC driver error for lookups (#8872 ) If the JDBC drivers are missing from the lookup extensions, throw an exception that directs the user how to resolve the issue. This change is a follow up to #8825.	2019-11-18 11:42:38 -08:00
Rye	ea8e4066f6	Use earliest offset on kafka newly discovered partitions (#8748 ) * Use earliest offset on kafka newly discovered partitions * resolve conflicts * remove redundant check cases * simplified unit tests * change test case * rewrite comments * add regression test * add junit ignore annotation * minor modifications * indent * override testableKafkaSupervisor and KafkaRecordSupplier to make the test runable * modified test constructor of kafkaRecordSupplier * simplify * delegated constructor	2019-11-18 11:05:31 -08:00
Jihoon Son	1611792855	Add InputSource and InputFormat interfaces (#8823 ) * Add InputSource and InputFormat interfaces * revert orc dependency * fix dimension exclusions and failing unit tests * fix tests * fix test * fix test * fix firehose and inputSource for parallel indexing task * fix tc * fix tc: remove unused method * Formattable * add needsFormat(); renamed to ObjectSource; pass metricsName for reader * address comments * fix closing resource * fix checkstyle * fix tests * remove verify from csv * Revert "remove verify from csv" This reverts commit 1ea7758489cc8c9d708bd691fd48e62085fd9455. * address comments * fix import order and javadoc * flatMap * sampleLine * Add IntermediateRowParsingReader * Address comments * move csv reader test * remove test for verify * adjust comments * Fix InputEntityIteratingReader * rename source -> entity * address comments	2019-11-15 09:22:09 -08:00
Jonathan Wei	75ea0d592a	Add more datasketches doubles sketch SQL functions (#8843 ) * Add more datasketches doubles sketch SQL postaggs * style and lgtm	2019-11-08 18:05:06 -08:00

1 2 3 4 5 ...

679 Commits