druid

Commit Graph

Author	SHA1	Message	Date
Maytas Monsereenusorn	a46d561bd7	Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead (#10740 ) * Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead * Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead * Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead * Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead * fix checkstyle * Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead * Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead * fix test * fix test * add log * Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead * address comments * fix checkstyle * fix checkstyle * add config to skip overhead memory calculation * add test for the skipBytesInMemoryOverheadCheck config * add docs * fix checkstyle * fix checkstyle * fix spelling * address comments * fix travis * address comments	2021-01-27 00:34:56 -08:00
Charles Smith	99494e3d16	suggest index parallel for native batch reindexing > 1GB (#10788 )	2021-01-22 21:54:28 -08:00
Jonathan Wei	68bb038b31	Multiphase segment merge for IndexMergerV9 (#10689 ) * Multiphase merge for IndexMergerV9 * JSON fix * Cleanup temp files * Docs * Address logging and add IT * Fix spelling and test unloader datasource name	2021-01-05 22:19:09 -08:00
sthetland	6ae8059c09	cleaning up and fixing links (#10528 ) * cleaning up and fixing links * reverting local link * Update indexer.md * link checking * Fixing one more stale link for PostgreSQL	2020-12-17 13:37:43 -08:00
Atul Mohan	44df05b8b2	Clarify split hint spec behavior (#10656 )	2020-12-09 08:24:32 -06:00
Pierre Carrier	835b328851	docs/: use tuningConfig (#10540 )	2020-10-30 09:39:21 -05:00
Joseph Glanville	7ce9ac4548	Fix Avro support in Web Console (#10232 ) * Fix Avro OCF detection prefix and run formation detection on raw input * Support Avro Fixed and Enum types correctly * Check Avro version byte in format detection * Add test for AvroOCFReader.sample Ensures that the Sampler doesn't receive raw input that it can't serialize into JSON. * Document Avro type handling * Add TS unit tests for guessInputFormat	2020-10-07 21:08:22 -07:00
Jihoon Son	0cc9eb4903	Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided (#10288 ) * Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided * query context * fix tests; add more test * javadoc * docs and more tests * remove default and hadoop tests * consistent name and fix javadoc * spelling and field name * default function for partitionsSpec * other comments * address comments * fix tests and spelling * test * doc	2020-09-24 16:32:56 -07:00
Jonathan Wei	cb30b1fe23	Automatically determine numShards for parallel ingestion hash partitioning (#10419 ) * Automatically determine numShards for parallel ingestion hash partitioning * Fix inspection, tests, coverage * Docs and some PR comments * Adjust locking * Use HllSketch instead of HyperLogLogCollector * Fix tests * Address some PR comments * Fix granularity bug * Small doc fix	2020-09-24 13:47:53 -07:00
Atul Mohan	b6ad790dc7	Support combining inputsource for parallel ingestion (#10387 ) * Add combining inputsource * Fix documentation Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com>	2020-09-15 16:25:35 -07:00
LightGHLi	a3bb6ee4a6	Add missing comma between JSON members in data-formats.md (#10343 )	2020-09-03 20:03:06 -07:00
Fernando	69d8645425	Adding supported compression formats for native batch ingestion (#10306 ) * Adding supported compression formats for native batch ingestion * Update docs/ingestion/native-batch.md Co-authored-by: sthetland <steve.hetland@imply.io> * fix spellcheck Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: sthetland <steve.hetland@imply.io>	2020-08-26 12:39:48 -07:00
Jihoon Son	b5b3e6ecce	Add maxNumFiles to splitHintSpec (#10243 ) * Add maxNumFiles to splitHintSpec * missing link * fix build failure; use maxNumFiles for integration tests * spelling * lower default * Update docs/ingestion/native-batch.md Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * address comments; change default maxSplitSize * spelling * typos and doc * same change for segments splitHintSpec * fix build * fix build Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>	2020-08-21 09:43:58 -07:00
Gian Merlino	d36a0f61da	Clarify documentation on dimensions, dimensionExclusions. (#10265 ) In particular: exclusions are ignored if dimensions are set.	2020-08-12 08:06:53 -07:00
Atul Mohan	06539bc828	Set default server.maxsize to the sum of segment cache (#10255 ) * Default server.maxsize * Remove maxsize refs from config Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com>	2020-08-10 09:21:22 -07:00
Jian Wang	271f90f205	Add segment pruning for hash based shard spec (#9810 ) * Add segment pruning for hash based partitioning * Update doc * Add additional test * Address comments * Fix unit test failure Co-authored-by: Jian Wang <jwang@pinterest.com>	2020-07-30 18:44:26 -07:00
mans2singh	d4bd6e5207	ingestion and tutorial doc update (#10202 )	2020-07-21 17:52:23 -07:00
Fullstop000	bcf41922ce	Remove unsupported task types in doc (#10111 )	2020-07-04 18:13:53 -07:00
Maytas Monsereenusorn	ec46d82c71	Add integration tests for SqlInputSource (#10080 ) * Add integration tests for SqlInputSource * make it faster	2020-06-26 10:32:42 -10:00
Jihoon Son	aaee72c781	Allow append to existing datasources when dynamic partitioning is used (#10033 ) * Fill in the core partition set size properly for batch ingestion with dynamic partitioning * incomplete javadoc * Address comments * fix tests * fix json serde, add tests * checkstyle * Set core partition set size for hash-partitioned segments properly in batch ingestion * test for both parallel and single-threaded task * unused variables * fix test * unused imports * add hash/range buckets * some test adjustment and missing json serde * centralized partition id allocation in parallel and simple tasks * remove string partition chunk * revive string partition chunk * fill numCorePartitions for hadoop * clean up hash stuffs * resolved todos * javadocs * Fix tests * add more tests * doc * unused imports * Allow append to existing datasources when dynamic partitioing is used * fix test * checkstyle * checkstyle * fix test * fix test * fix other tests.. * checkstyle * hansle unknown core partitions size in overlord segment allocation * fail to append when numCorePartitions is unknown * log * fix comment; rename to be more intuitive * double append test * cleanup complete(); add tests * fix build * add tests * address comments * checkstyle	2020-06-25 13:37:31 -07:00
Jianhuan Liu	5600e1c204	fix docs error in hadoop-based part (#9907 ) * fix docs error: google to azure and hdfs to http * fix docs error: indexSpecForIntermediatePersists of tuningConfig in hadoop-based batch part * fix docs error: logParseExceptions of tuningConfig in hadoop-based batch part * fix docs error: maxParseExceptions of tuningConfig in hadoop-based batch part	2020-06-19 23:14:54 -10:00
Jihoon Son	d644a27f1a	Create packed core partitions for hash/range-partitioned segments in native batch ingestion (#10025 ) * Fill in the core partition set size properly for batch ingestion with dynamic partitioning * incomplete javadoc * Address comments * fix tests * fix json serde, add tests * checkstyle * Set core partition set size for hash-partitioned segments properly in batch ingestion * test for both parallel and single-threaded task * unused variables * fix test * unused imports * add hash/range buckets * some test adjustment and missing json serde * centralized partition id allocation in parallel and simple tasks * remove string partition chunk * revive string partition chunk * fill numCorePartitions for hadoop * clean up hash stuffs * resolved todos * javadocs * Fix tests * add more tests * doc * unused imports	2020-06-18 18:40:43 -07:00
Maytas Monsereenusorn	1a2620606d	API to verify a datasource has the latest ingested data (#9965 ) * API to verify a datasource has the latest ingested data * API to verify a datasource has the latest ingested data * API to verify a datasource has the latest ingested data * API to verify a datasource has the latest ingested data * API to verify a datasource has the latest ingested data * fix checksyle * API to verify a datasource has the latest ingested data * API to verify a datasource has the latest ingested data * API to verify a datasource has the latest ingested data * API to verify a datasource has the latest ingested data * fix spelling * address comments * fix checkstyle * update docs * fix tests * fix doc * address comments * fix typo * fix spelling * address comments * address comments * fix typo in docs	2020-06-16 20:48:30 -10:00
Atul Mohan	17cf8ea8f2	Add Sql InputSource (#9449 ) * Add Sql InputSource * Add spelling * Use separate DruidModule * Change module name * Fix docs * Use sqltestutils for tests * Add additional tests * Fix inspection * Add module test * Fix md in docs * Remove annotation Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com>	2020-06-09 12:55:20 -07:00
Jianhuan Liu	2050f2b00a	fix docs error: google to azure and hdfs to http (#9881 )	2020-05-20 10:17:39 -07:00
Joseph Glanville	793f386d6a	Add support for Avro OCF using InputFormat (#9671 ) * Add AvroOCFInputFormat * Support supplying a reader schema in AvroOCFInputFormat * Add docs for Avro OCF input format * Address review comments * Address second round of review	2020-05-16 14:09:12 -07:00
Jian Wang	85dfbb64cb	Update documention for metricCompression (#9811 )	2020-05-03 12:56:48 -07:00
sthetland	c61365c1e0	Druid Quickstart refactor and update (#9766 ) * Update data-formats.md Per Suneet, "Since you're editing this file can you also fix the json on line 177 please - it's missing a comma after the }" * Light text cleanup * Removing discussion of sample data, since it's repeated in the data loading tutorial, and not immediately relevant here. * Update index.md * original quickstart full first pass * original quickstart full first pass * first pass all the way through * straggler * image touchups and finished old tutorial * a bit of finishing up * Review comments * fixing links * spell checking gymnastics	2020-04-30 12:07:28 -07:00
Clint Wylie	bf85ea19b2	roaring bitmaps by default (#9548 ) * it is finally time * fix it * more docs * fix doc	2020-03-23 18:15:57 -07:00
Maytas Monsereenusorn	814f5a9717	add password provider reference to s3 optional cred docs (#9439 )	2020-03-09 17:56:42 -07:00
Jihoon Son	9466ac7c9b	Skip empty files for local, hdfs, and cloud input sources (#9450 ) * Skip empty files for local, hdfs, and cloud input sources * split hint spec doc * doc for skipping empty files * fix typo; adjust tests * unnecessary fluent iterable * address comments * fix test * use the right lists * fix test * fix test	2020-03-03 20:51:06 -08:00
Maytas Monsereenusorn	92fb83726b	Add support for optional aws credentials for s3 for ingestion (#9375 ) * Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion * Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion * Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion * fix build failure * fix failing build * fix failing build * Code cleanup * fix failing test * Removed CloudConfigProperties and make specific class for each cloudInputSource * Removed CloudConfigProperties and make specific class for each cloudInputSource * pass s3ConfigProperties for split * lazy init s3client * update docs * fix docs check * address comments * add ServerSideEncryptingAmazonS3.Builder * fix failing checkstyle * fix typo * wrap the ServerSideEncryptingAmazonS3.Builder in a provider * added java docs for S3InputSource constructor * added java docs for S3InputSource constructor * remove wrap the ServerSideEncryptingAmazonS3.Builder in a provider	2020-02-25 20:59:53 -08:00
zachjsh	d771b42ed1	Move Azure extension into Core (#9394 ) * Move Azure extension into Core Moving the azure extension into Core. * * Fix build failure * * Add The MIT License (MIT) to list of compatible licenses * * Address review comments * * change reference to contrib azure to core azure * * Fix spelling mistakes.	2020-02-25 17:49:16 -08:00
Jihoon Son	3bc7ae782c	Create splits of multiple files for parallel indexing (#9360 ) * Create splits of multiple files for parallel indexing * fix wrong import and npe in test * use the single file split in tests * rename * import order * Remove specific local input source * Update docs/ingestion/native-batch.md Co-Authored-By: sthetland <steve.hetland@imply.io> * Update docs/ingestion/native-batch.md Co-Authored-By: sthetland <steve.hetland@imply.io> * doc and error msg * fix build * fix a test and address comments Co-authored-by: sthetland <steve.hetland@imply.io>	2020-02-24 17:34:39 -08:00
zachjsh	f707064bed	Add Azure config options for segment prefix and max listing length (#9356 ) * Add Azure config options for segment prefix and max listing length Added configuration options to allow the user to specify the prefix within the segment container to store the segment files. Also added a configuration option to allow the user to specify the maximum number of input files to stream for each iteration. * * Fix test failures * * Address review comments * * add dependency explicitly to pom * * update docs * * Address review comments * * Address review comments	2020-02-21 14:12:03 -08:00
Atul Mohan	7968524b01	Add Pig-specific file handling to Avro parser (#9258 ) * Add processing for data files from AvroStorage * Add words to spellings file	2020-02-10 21:53:11 -08:00
sthetland	83ddc8de1e	Update data-formats.md (#9238 ) * Update data-formats.md Field error and light rewording of new Avro material (and working through the doc authoring process). * Update data-formats.md Make default statements consistent. Future change: s/=/is.	2020-01-22 15:00:53 -08:00
Jihoon Son	153495068b	Doc update for the new input source and the new input format (#9171 ) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs	2020-01-17 15:52:05 -08:00
Jonathan Wei	aa539177ec	De-incubation cleanup in code, docs, packaging (#9108 ) * De-incubation cleanup in code, docs, packaging * remove unused docs script	2020-01-03 12:33:19 -05:00
Chi Cao Minh	6178f05da6	Fail superbatch range partition multi dim values (#9058 ) * Fail superbatch range partition multi dim values Change the behavior of parallel indexing range partitioning to fail ingestion if any row had multiple values for the partition dimension. After this change, the behavior matches that of hadoop indexing. (Previously, rows with multiple dimension values would be skipped.) * Improve err msg, rename method, rename test class	2019-12-18 10:14:03 -08:00
Chi Cao Minh	3de7ab8523	DataSketches jars in core (#9003 ) Having DataSketches jars in core will allow potential improvements, for example: - Provide an alternative implementation of HLL: https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html - Range partitioning for native parallel batch indexing without having the user load extensions on the classpath Dev mailing list discussion: https://lists.apache.org/thread.html/301410d71ff799cf616bf17c4ebcf9999fc30829f5fa62909f403e6c%40%3Cdev.druid.apache.org%3E	2019-12-10 14:02:34 -08:00
Chi Cao Minh	bab78fc80e	Parallel indexing single dim partitions (#8925 ) * Parallel indexing single dim partitions Implements single dimension range partitioning for native parallel batch indexing as described in #8769. This initial version requires the druid-datasketches extension to be loaded. The algorithm has 5 phases that are orchestrated by the supervisor in `ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`. These phases and the main classes involved are described below: 1) In parallel, determine the distribution of dimension values for each input source split. `PartialDimensionDistributionTask` uses `StringSketch` to generate the approximate distribution of dimension values for each input source split. If the rows are ungrouped, `PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter` uses a Bloom filter to skip rows that would be grouped. The final distribution is sent back to the supervisor via `DimensionDistributionReport`. 2) The range partitions are determined. In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the supervisor uses `StringSketchMerger` to merge the individual `StringSketch`es created in the preceding phase. The merged sketch is then used to create the range partitions. 3) In parallel, generate partial range-partitioned segments. `PartialRangeSegmentGenerateTask` uses the range partitions determined in the preceding phase and `RangePartitionCachingLocalSegmentAllocator` to generate `SingleDimensionShardSpec`s. The partition information is sent back to the supervisor via `GeneratedGenericPartitionsReport`. 4) The partial range segments are grouped. In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`, the supervisor creates the `PartialGenericSegmentMergeIOConfig`s necessary for the next phase. 5) In parallel, merge partial range-partitioned segments. `PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to retrieve the partial range-partitioned segments generated earlier and then merges and publishes them. * Fix dependencies & forbidden apis * Fixes for integration test * Address review comments * Fix docs, strict compile, sketch check, rollup check * Fix first shard spec, partition serde, single subtask * Fix first partition check in test * Misc rewording/refactoring to address code review * Fix doc link * Split batch index integration test * Do not run parallel-batch-index twice * Adjust last partition * Split ITParallelIndexTest to reduce runtime * Rename test class * Allow null values in range partitions * Indicate which phase failed * Improve asserts in tests	2019-12-09 23:05:49 -08:00
Jonathan Wei	c949a25210	Add DruidInputSource (replacement for IngestSegmentFirehose) (#8982 ) * Add Druid input source and format * Inherit dims/metrics from segment * Add ingest segment firehose reindexing test * Remove unnecessary module * Fix unit tests, checkstyle * Add doc entry * Fix dimensionExclusions handling, add parallel index integration test * Add spelling exclusion * Address some PR comments * Checkstyle * wip * Address rest of PR comments * Address PR comments	2019-12-05 16:50:00 -08:00
Jihoon Son	a2e6de4b16	Fix the potential race between SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor (#8924 ) * Fix the potential race SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor * Fix docs and javadoc * Add unit tests for large or small estimated num splits * add override	2019-11-23 01:38:08 -08:00
Jihoon Son	1611792855	Add InputSource and InputFormat interfaces (#8823 ) * Add InputSource and InputFormat interfaces * revert orc dependency * fix dimension exclusions and failing unit tests * fix tests * fix test * fix test * fix firehose and inputSource for parallel indexing task * fix tc * fix tc: remove unused method * Formattable * add needsFormat(); renamed to ObjectSource; pass metricsName for reader * address comments * fix closing resource * fix checkstyle * fix tests * remove verify from csv * Revert "remove verify from csv" This reverts commit `1ea7758489`. * address comments * fix import order and javadoc * flatMap * sampleLine * Add IntermediateRowParsingReader * Address comments * move csv reader test * remove test for verify * adjust comments * Fix InputEntityIteratingReader * rename source -> entity * address comments	2019-11-15 09:22:09 -08:00
Gian Merlino	7605c23354	Remove Tranquility configs and certain doc references. (#8793 ) Since it hasn't received updates or community interest in a while, it makes sense to de-emphasize it in the distribution and most documentation (outside of simple mentions of its existence).	2019-10-30 16:30:16 -07:00
Gian Merlino	b65d2ac648	Add HDFS firehose (#8754 ) * Add HDFS firehose. * Tests, support for lists of paths. * Fixups. * Update list of firehoses. * Wildcards is a word.	2019-10-28 08:07:38 -07:00
Vadim Ogievetsky	f9b94a5db1	Docs: remove self link (#8760 ) This section links to itself in the description. I tried to follow that link and spit hot tea all over my monitor from laughter.	2019-10-27 22:33:22 -07:00
David Glasser	b453fda251	docs: clarify native batch ingestion w/ overlapping segments (#8720 ) I was confused by a paragraph in the docs that I myself wrote!	2019-10-22 21:01:56 -07:00
Jihoon Son	30c15900be	Auto compaction based on parallel indexing (#8570 ) * Auto compaction based on parallel indexing * javadoc and doc * typo * update spell * addressing comments * address comments * fix log * fix build * fix test * increase default max input segment bytes per task * fix test	2019-10-18 13:24:14 -07:00

1 2

60 Commits