druid

Commit Graph

Author	SHA1	Message	Date
Vadim Ogievetsky	bb0b810b1d	fix html tags in docs (#13117 ) * fix html tags in docs * revert not null	2022-09-18 19:40:33 -07:00
Gian Merlino	d4967c38f8	Various documentation updates. (#13107 ) * Various documentation updates. 1) Split out "data management" from "ingestion". Break it into thematic pages. 2) Move "SQL-based ingestion" into the Ingestion category. Adjust content so all conceptual content is in concepts.md and all syntax content is in reference.md. Shorten the known issues page to the most interesting ones. 3) Add SQL-based ingestion to the ingestion method comparison page. Remove the index task, since index_parallel is just as good when maxNumConcurrentSubTasks: 1. 4) Rename various mentions of "Druid console" to "web console". 5) Add additional information to ingestion/partitioning.md. 6) Remove a mention of Tranquility. 7) Remove a note about upgrading to Druid 0.10.1. 8) Remove no-longer-relevant task types from ingestion/tasks.md. 9) Move ingestion/native-batch-firehose.md to the hidden section. It was previously deprecated. 10) Move ingestion/native-batch-simple-task.md to the hidden section. It is still linked in some places, but it isn't very useful compared to index_parallel, so it shouldn't take up space in the sidebar. 11) Make all br tags self-closing. 12) Certain other cosmetic changes. 13) Update to node-sass 7. * make travis use node12 for docs Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>	2022-09-16 21:58:11 -07:00
Vadim Ogievetsky	2493eb17bf	Doc fixes around msq (#13090 ) * remove things that do not apply * fix more things * pin node to a working version * fix * fixes * known issues tidy up * revert auto formatting changes * remove management-uis page which is 100% lies * don't mention the Coordinator console (that no longer exits) * goodies * fix typo	2022-09-16 02:15:26 -07:00
Jill Osborne	1f69140623	Nested columns documentation (#12946 ) Co-authored-by: Clint Wylie <cjwylie@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: brian.le <brian.le@imply.io>	2022-09-06 14:42:18 -07:00
Gian Merlino	85d2a6d879	Improve range partitioning docs. (#13016 ) Two improvements: - Use a realistic targetRowsPerSegment, so if people copy and paste the example from the docs, it will generate reasonable segments. - Spell "countryName" correctly.	2022-09-01 15:21:30 -07:00
Jill Osborne	7a1e1f88bb	Remove experimental note from stable features (#12973 ) * Removed experimental note for features that are no longer experimental * Updated native batch doc	2022-08-25 09:26:46 -07:00
Victoria Lim	02914c17b9	Tutorial on ingesting and querying Theta sketches (#12723 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2022-08-24 09:23:22 -07:00
David Hergenroeder	533c39f35a	Fix rollup docs bullet formatting (#12876 )	2022-08-09 10:10:07 +08:00
Katya Macedo	c6dd9dd4af	Fix typo in compaction.md (#12774 )	2022-08-04 14:47:22 -07:00
Charles Smith	efbb58e90e	docs: remove maxRowsPerSegment where appropriate (#12071 ) * remove maxRowsPerSegment where appropriate * fix tutorial, accept suggestions * Update docs/design/coordinator.md * additional tutorial file * fix initial index spec * accept comments * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * add back comment on maxrows per segment * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/tutorials/tutorial-compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * rm duplicate entry * Update native-batch-simple-task.md remove ref to `maxrowspersegment` * Update native-batch.md remove ref to `maxrowspersegment` * final tenticles * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2022-07-28 16:52:13 +05:30
Victoria Lim	6394ecfd21	update figure and reference (#12813 )	2022-07-22 15:54:25 -07:00
Katya Macedo	809bf161ce	Add a note about setting the value of maxNumConcurrentSubTasks (#12772 ) * Add clarification for combining input source * Update inputFormat note * Update maxNumConcurrentSubTasks note * Fix broken link * Update docs/ingestion/native-batch-input-source.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2022-07-19 15:34:21 -07:00
Atul Mohan	75045970cd	S3 Ingestion from non-default endpoints (#11798 ) * Add endpoint support for s3inputsource * Changes to tests * Fix docs * Fix config * Fix inspections * Fix spelling * Remove password from toString	2022-07-15 11:03:34 -07:00
Didip Kerabat	6ddb828c7a	Able to filter Cloud objects with glob notation. (#12659 ) In a heterogeneous environment, sometimes you don't have control over the input folder. Upstream can put any folder they want. In this situation the S3InputSource.java is unusable. Most people like me solved it by using Airflow to fetch the full list of parquet files and pass it over to Druid. But doing this explodes the JSON spec. We had a situation where 1 of the JSON spec is 16MB and that's simply too much for Overlord. This patch allows users to pass {"filter": "*.parquet"} and let Druid performs the filtering of the input files. I am using the glob notation to be consistent with the LocalFirehose syntax.	2022-06-24 11:40:08 +05:30
Jill Osborne	f050069767	Segments doc update (#12344 ) * Corrected heading levels in segments doc * IMPLY-18394: Updated Segments doc * Update docs/design/segments.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/design/segments.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/design/segments.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/design/segments.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/design/segments.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/design/segments.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update segments.md * Updated links to changed headings in Segments doc * Corrected spelling error * Update segments.md Incorporated suggestions from Paul Rogers. * Update index.md * Update segments.md * Update segments.md * Update segments.md * Update compaction.md * Update docs/design/segments.md fix typo * Update docs/ingestion/compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/design/segments.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2022-06-16 13:25:17 -07:00
Victoria Lim	353475bd36	Docs for automatic compaction (#12569 ) * docs for auto-compaction * fix broken links * another link * Apply suggestions from code review Co-authored-by: Suneet Saldanha <suneet@apache.org> * Apply suggestions from code review Co-authored-by: Suneet Saldanha <suneet@apache.org> * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> * reorg content for skipOffset * Update docs/ingestion/automatic-compaction.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Apply suggestions from code review Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>	2022-06-09 14:55:12 -07:00
Gian Merlino	fdfecfd996	Improved docs for range partitioning. (#12350 ) * Improved docs for range partitioning. 1) Clarify the benefits of range partitioning. 2) Clarify which filters support pruning. 3) Include the fact that multi-value dimensions cannot be used for partitioning. * Additional clarification. * Update other section. * Another adjustment. * Updates from review.	2022-05-16 09:42:31 -07:00
Kashif Faraz	60b4fa0f75	Docs: Fix column name in ingestion rollup doc (#12036 ) Fix the referred column name from "count" to "num_rows" as "count" vs. "COUNT(*)" might be a little confusing in this example.	2022-05-10 17:35:59 +05:30
Victoria Lim	0206a2da5c	Update automatic compaction docs with consistent terminology (#12416 ) * specify automatic compaction where applicable * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * update for style and consistency * implement suggested feedback * remove duplicate example * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/compaction.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/api-reference.md * update .spelling * Adopt review suggestions Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2022-05-03 16:22:25 -07:00
Charles Smith	42fa5c26e1	remove arbitrary granularity spec from docs (#12460 ) * remove arbitrary granularity spec from docs * Update docs/ingestion/ingestion-spec.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2022-04-28 16:36:54 -07:00
Peter Marshall	b47316b844	Update native-batch.md (#12478 ) Fixed indent on the Granularity Spec section and removed some superfluous tabbings.	2022-04-25 21:44:17 +08:00
Charles Smith	408b46ae9f	Fixes a small typo in ingestion spec doc (#12143 ) * small typo * Update docs/ingestion/ingestion-spec.md Co-authored-by: sthetland <steve.hetland@imply.io> Co-authored-by: sthetland <steve.hetland@imply.io>	2022-04-18 16:53:50 +08:00
Peter Marshall	1201c9b2e5	Docs - added another common config property to tuningConfig (#11935 ) * Update ingestion-spec.md Added indexSpecForIntermediatePersists as a common configuration property. * Update ingestion-spec.md Amended to remove "below" and add link to the table. * Update ingestion-spec.md Removed passive.	2022-04-18 13:41:39 +08:00
Victoria Lim	e6229b76a6	Document data format and example for featureSpec (#12394 ) * add data format and example for featureSpec * add second feature in example * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2022-04-06 15:17:15 -07:00
317brian	ac6c24793e	docs(fix): add clarity around granularitySpec (#12362 ) * fix: add clarify around granularitySpec * fix spacing * Update docs/ingestion/compaction.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>	2022-04-06 09:24:37 -07:00
Victoria Lim	d326c681c1	Document config for ingesting null columns (#12389 ) * config for ingesting null columns * add link * edit .spelling * what happens if storeEmptyColumns is disabled	2022-04-05 09:15:42 -07:00
Peter Marshall	b9a968e7ff	Docs – expressions link back and timestamp hint (#11674 ) * Update math-expr.md Link back to transformSpec * Update ingestion-spec.md Moved info about using the timestamp inside transforms into the actual timestamp section. * Update ingestion-spec.md Active language.	2022-03-29 09:12:30 -07:00
mark-imply	3c55565398	Update ingestion-spec.md (#12371 ) * Update ingestion-spec.md Added best practice point to dimensions description. * Update docs/ingestion/ingestion-spec.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2022-03-29 09:12:02 -07:00
Agustin Gonzalez	abe76ccb90	Batch ingestion replace (#12137 ) * Tombstone support for replace functionality * A used segment interval is the interval of a current used segment that overlaps any of the input intervals for the spec * Update compaction test to match replace behavior * Adapt ITAutoCompactionTest to work with tombstones rather than dropping segments. Add support for tombstones in the broker. * Style plus simple queriableindex test * Add segment cache loader tombstone test * Add more tests * Add a method to the LogicalSegment to test whether it has any data * Test filter with some empty logical segments * Refactor more compaction/dropexisting tests * Code coverage * Support for all empty segments * Skip tombstones when looking-up broker's timeline. Discard changes made to tool chest to avoid empty segments since they will no longer have empty segments after lookup because we are skipping over them. * Fix null ptr when segment does not have a queriable index * Add support for empty replace interval (all input data has been filtered out) * Fixed coverage & style * Find tombstone versions from lock versions * Test failures & style * Interner was making this fail since the two segments were consider equal due to their id's being equal * Cleanup tombstone version code * Force timeChunkLock whenever replace (i.e. dropExisting=true) is being used * Reject replace spec when input intervals are empty * Documentation * Style and unit test * Restore test code deleted by mistake * Allocate forces TIME_CHUNK locking and uses lock versions. TombstoneShardSpec added. * Unused imports. Dead code. Test coverage. * Coverage. * Prevent killer from throwing an exception for tombstones. This is the killer used in the peon for killing segments. * Fix OmniKiller + more test coverage. * Tombstones are now marked using a shard spec * Drop a segment factory.json in the segment cache for tombstones * Style * Style + coverage * style * Add TombstoneLoadSpec.class to mapper in test * Update core/src/main/java/org/apache/druid/segment/loading/TombstoneLoadSpec.java Typo Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com> * Update docs/configuration/index.md Missing Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com> * Typo * Integrated replace with an existing test since the replace part was redundant and more importantly, the test file was very close or exceeding the 10 min default "no output" CI Travis threshold. * Range does not work with multi-dim Co-authored-by: Jonathan Wei <jon-wei@users.noreply.github.com>	2022-03-08 20:07:02 -07:00
Victoria Lim	903174de20	correct errors on compaction doc (#12308 )	2022-03-04 15:33:35 -08:00
Jihoon Son	e5ad862665	A new includeAllDimension flag for dimensionsSpec (#12276 ) * includeAllDimensions in dimensionsSpec * doc * address comments * unused import and doc spelling	2022-02-25 18:27:48 -08:00
Victoria Lim	c61b19d443	Refactor SQL docs (#12239 ) * refactor and link fixes * add sql docs to left nav * code format for needle * updated web console script * link fixes * update earliest/latest functions * edits for grammar and style * more link fixes * another link * update with #12226 * update .spelling file	2022-02-11 14:43:30 -08:00
Jonathan Wei	74c876e578	Throw parse exceptions on schema get errors for SchemaRegistryBasedAvroBytesDecoder (#12080 ) * Add option to throw parse exceptions on schema get errors for SchemaRegistryBasedAvroBytesDecoder * Remove option	2022-01-13 12:36:51 -06:00
Frank Chen	58245b4617	Support JsonPath functions in JsonPath expressions (#11722 ) * Add jsonPath functions support * Add jsonPath function test for Avro * Add jsonPath function length() to Orc * Add jsonPath function length() to Parquet * Add more tests to ORC format * update doc * Fix exception during ingestion * Add IT test case * Revert "Fix exception during ingestion" This reverts commit `5a5484b9ea`. * update IT test case * Add 'keys()' * Commit IT test case * Fix UT	2021-12-10 10:53:23 +08:00
shallada	25c9eba2f7	clarify time format for intervals (#12035 )	2021-12-08 08:31:21 -08:00
Peter Marshall	c209db3a1d	Docs - roll-up tip (#11677 ) * Update rollup.md Added SE tip around roll-up. * Update docs/ingestion/rollup.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2021-12-07 09:17:36 -08:00
Peter Marshall	d7463c99e9	Docs - Task ref logs correction (#11746 ) * Update tasks.md Removed confusing backreference * Update tasks.md Changed silly grammar.	2021-12-07 09:15:19 -08:00
Charles Smith	7ed46800c3	Docs: Add multi-dimension partitioning doc; refactor native batch and separate into smaller topics. (#11983 ) Adds documentation for multi-dimension partitioning. cc: @kfaraz Refactors the native batch partitioning topic as follows: Native batch ingestion covers parallel-index Native batch simple task indexing covers index Native batch input sources covers ioSource Native batch ingestion with firehose covers deprecated firehose	2021-12-03 16:37:14 +05:30
Maytas Monsereenusorn	bb3d2a433a	Support filtering data in Auto Compaction (#11922 ) * add impl * fix checkstyle * add test * add test * add unit tests * fix unit tests * fix unit tests * fix unit tests * add IT * add IT * add comments * fix spelling	2021-11-24 10:56:38 -08:00
Peter Marshall	ed0606db69	Docs - Corrected admonition issue (#11926 ) * Corrected admonition issue * Update data-formats.md Removed all admonition bits, and took out sf linebreaks. * Update data-formats.md Changed the shocker line into something a little more practical.	2021-11-22 12:14:30 -08:00
Peter Marshall	0c0001579d	Update compaction.md (#11937 ) Removed superfluous tabs that caused issues in rendering Added nav to the `inputSpec`	2021-11-22 21:33:47 +08:00
jacobtolar	3aee5d9ec3	Fix: invalid JSON in ingestion spec doc example (#11880 ) * Fix: invalid JSON in ingestion spec doc example * Update ingestion-spec.md	2021-11-22 21:33:26 +08:00
Jihoon Son	f91868602d	Remove stale warning for HTTP inputSource (#11907 )	2021-11-13 10:27:14 +08:00
Charles Smith	33a5cda061	Docs: Splits Kafka topic. Adds detailed example for kafka inputFormat (#11912 ) * Splits Kafka topic according to function. Adds detailed example for kafka inputFormat * Apply suggestions from code review accept suggestions from review Co-authored-by: sthetland <steve.hetland@imply.io> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Apply suggestions from code review accept suggestions Co-authored-by: sthetland <steve.hetland@imply.io> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * accept suggestions * accept suggestions * final typos and clarifications * bringing forward some syntax fixes Co-authored-by: sthetland <steve.hetland@imply.io> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2021-11-12 13:02:23 -08:00
Maytas Monsereenusorn	ddc68c6a81	Support changing dimension schema in Auto Compaction (#11874 ) * add impl * add unit tests * fix checkstyle * add impl * add impl * add impl * add impl * add impl * add impl * fix test * add IT * add IT * fix docs * add test * address comments * fix conflict	2021-11-08 21:17:08 -08:00
Maytas Monsereenusorn	ba2874ee1f	Support changing query granularity in Auto Compaction (#11856 ) * add queryGranularity * fix checkstyle * fix test	2021-11-01 15:18:44 -07:00
Maytas Monsereenusorn	33d9d9bd74	Add rollup config to auto and manual compaction (#11850 ) * add rollup to auto and manual compaction * add unit tests * add unit tests * add IT * fix checkstyle	2021-10-29 10:22:25 -07:00
Charles Smith	938c1493e5	edits to kafka inputFormat (#11796 ) * edits to kafka inputFormat * revise conflict resolution description * tweak for clarity * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * style fixes * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/data-formats.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2021-10-15 14:01:10 -07:00
Katya Macedo	45d0ecbefb	clarify hadoop input paths (#11781 ) Co-authored-by: Katya Macedo <katya.macedo@imply.io>	2021-10-07 20:22:51 -07:00
lokesh-lingarajan	ad6609a606	Kafka Input Format for headers, key and payload parsing (#11630 ) ### Description Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights. PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats. We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload. This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row. Lets look at a sample input format from the above discussion "inputFormat": { "type": "kafka", // New input format type "headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp "headerFormat": // Header parser specifying that values are of type string { "type": "string" }, "valueFormat": // Value parser from json parsing { "type": "json", "flattenSpec": { "useFieldDiscovery": true, "fields": [...] } }, "keyFormat": // Key parser also from json parsing { "type": "json" } } Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. "headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload. Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch. Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key". ## KafkaInputFormat Class: This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases. During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.	2021-10-07 08:56:27 -07:00

1 2 3

145 Commits