druid

Commit Graph

Author	SHA1	Message	Date
Katya Macedo	4804630c78	Clean up Kinesis doc (#14529 )	2023-07-25 19:24:36 -07:00
Nhi Pham	2dc3e94a9a	Service status API documentation refactor (#14528 ) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-07-25 18:42:47 -07:00
Atul Mohan	03d6d395a0	Extension to read and ingest iceberg data files (#14329 ) This adds a new contrib extension: druid-iceberg-extensions which can be used to ingest data stored in Apache Iceberg format. It adds a new input source of type iceberg that connects to a catalog and retrieves the data files associated with an iceberg table and provides these data file paths to either an S3 or HDFS input source depending on the warehouse location. Two important dependencies associated with Apache Iceberg tables are: Catalog : This extension supports reading from either a Hive Metastore catalog or a Local file-based catalog. Support for AWS Glue is not available yet. Warehouse : This extension supports reading data files from either HDFS or S3. Adapters for other cloud object locations should be easy to add by extending the AbstractInputSourceAdapter.	2023-07-18 08:59:57 +05:30
Abhishek Radhakrishnan	1f6507dd60	Remove the deprecated `InsertCannotOrderByDescending` MSQ fault (#14588 ) The deprecated MSQ fault, InsertCannotOrderByDescending, is removed.	2023-07-17 09:23:39 +00:00
Abhishek Radhakrishnan	f4ee58eaa8	Add `aggregatorMergeStrategy` property in SegmentMetadata queries (#14560 ) * Add aggregatorMergeStrategy property to SegmentMetadaQuery. - Adds a new property aggregatorMergeStrategy to segmentMetadata query. aggregatorMergeStrategy currently supports three types of merge strategies - the legacy strict and lenient strategies, and the new latest strategy. - The latest strategy considers the latest aggregator from the latest segment by time order when there's a conflict when merging aggregators from different segments. - Deprecate lenientAggregatorMerge property; The API validates that both the new and old properties are not set, and returns an exception. - When merging segments as part of segmentMetadata query, the segments have a more elaborate id -- <datasource>_<interval>_merged_<partition_number> format, similar to the name format that segments usually contain. Previously it was simply "merged". - Adjust unit tests to test the latest strategy, to assert the returned complete SegmentAnalysis object instead of just the aggregators for completeness. * Don't explicitly set strict strategy in tests * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/querying/segmentmetadataquery.md * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2023-07-13 12:37:36 -04:00
Gian Merlino	e10e35aa2c	Add REGEXP_REPLACE function. (#14460 ) * Add REGEXP_REPLACE function. Replaces all instances of a pattern with a replacement string. * Fixes. * Improve test coverage. * Adjust behavior.	2023-06-29 13:47:57 -07:00
Adarsh Sanjeev	128133fadc	Add column replication_factor column to sys.segments table (#14403 ) Description: Druid allows a configuration of load rules that may cause a used segment to not be loaded on any historical. This status is not tracked in the sys.segments table on the broker, which makes it difficult to determine if the unavailability of a segment is expected and if we should not wait for it to be loaded on a server after ingestion has finished. Changes: - Track replication factor in `SegmentReplicantLookup` during evaluation of load rules - Update API `/druid/coordinator/v1metadata/segments` to return replication factor - Add column `replication_factor` to the sys.segments virtual table and populate it in `MetadataSegmentView` - If this column is 0, the segment is not assigned to any historical and will not be loaded.	2023-06-18 10:02:21 +05:30
Abhishek Radhakrishnan	b8495d45a1	Expose Druid functions in `INFORMATION_SCHEMA.ROUTINES` table. (#14378 ) * Add INFORMATION_SCHEMA.ROUTINES to expose Druid operators and functions. * checkstyle * remove IS_DETERMISITIC. * test * cleanup test * remove logs and simplify * fixup unit test * Add docs for INFORMATION_SCHEMA.ROUTINES table. * Update test and add another SQL query. * add stuff to .spelling and checkstyle fix. * Add more tests for custom operators. * checkstyle and comment. * Some naming cleanup. * Add FUNCTION_ID * The different Calcite function syntax enums get translated to FUNCTION * Update docs. * Cleanup markdown table. * fixup test. * fixup intellij inspection * Review comment: nullable column; add a function to determine function syntax. * More tests; add non-function syntax operators. * More unit tests. Also add a separate test for DruidOperatorTable. * actually just validate non-zero count. * switch up the order * checkstyle fixes.	2023-06-13 15:44:04 -04:00
317brian	49c056af17	docs: add basic contributor guide for docs (#14365 ) Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-06-05 10:53:17 -07:00
317brian	70952c0977	docs: add sql array functions to nav (#14361 ) * docs: add sql array functions to nav * fix typo * add sql array functions to list * fix spelling errors	2023-06-01 16:45:27 -07:00
Victoria Lim	6b3a6113c4	Doc: List supported values for Kafka `headerFormat` (#14316 )	2023-05-22 15:41:07 -07:00
Katya Macedo	269137c682	Update Ingestion section (#14023 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Victoria Lim <lim.t.victoria@gmail.com>	2023-05-19 09:42:27 -07:00
317brian	ceda1e98b9	docs: add docs for schema auto-discovery (#14065 ) * wip schemaless * wip * more cleanup * update tuningconfig example * updates based on feedback from clint * remove errant comma * update dimension object to include auto * update to include string schemaless way * fix spelling errors * updates for type-aware and string-based changes * Update docs/ingestion/schema-design.md * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * update spelling file * Update docs/ingestion/schema-design.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * copyedits * fix anchor --------- Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2023-05-17 01:36:02 -07:00
imply-cheddar	f9861808bc	Be able to load segments on Peons (#14239 ) * Be able to load segments on Peons This change introduces a new config on WorkerConfig that indicates how many bytes of each storage location to use for storage of a task. Said config is divided up amongst the locations and slots and then used to set TaskConfig.tmpStorageBytesPerTask The Peons use their local task dir and tmpStorageBytesPerTask as their StorageLocations for the SegmentManager such that they can accept broadcast segments.	2023-05-12 16:51:00 -07:00
TSFenwick	accd5536df	Allow for Log4J to be configured for peons but still ensure console logging is enforced (#14094 ) * Allow for Log4J to be configured for peons but still ensure console logging is enforced This change will allow for log4j to be configured for peons but require console logging is still configured for them to ensure peon logs are saved to deep storage. Also fixed the test ConsoleLoggingEnforcementTest to use a valid appender for the non console Config as the previous config was incorrect and would never return a logger. * fix checkstyle * add warning to logger when it overwrites all loggers to be console * optimize calls for altering logging config for ConsoleLoggingEnforcementConfigurationFactory add getName to the druid logger class * update docs, and error message * edit docs to be more clear * fix checkstyle issues * CI fixes - LoggerTest code coverage and fix spelling issue for logging docs	2023-04-24 10:41:56 -07:00
Laksh Singla	8eb854c845	Remove maxResultsSize config property from S3OutputConfig (#14101 ) * "maxResultsSize" has been removed from the S3OutputConfig and a default "chunkSize" of 100MiB is now present. This change primarily affects users who wish to use durable storage for MSQ jobs.	2023-04-18 14:25:20 +05:30
Atul Mohan	e3c160f2f2	Add start_time column to sys.servers (#13358 ) Adds a new column start_time to sys.servers that captures the time at which the server was added to the cluster.	2023-04-14 15:23:34 +05:30
317brian	7e572eef08	docs: sql unnest and cleanup unnest datasource (#13736 ) Co-authored-by: Elliott Freis <elliottfreis@Elliott-Freis.earth.dynamic.blacklight.net> Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Paul Rogers <paul-rogers@users.noreply.github.com> Co-authored-by: Jill Osborne <jill.osborne@imply.io> Co-authored-by: Anshu Makkar <83963638+anshu-makkar@users.noreply.github.com> Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> Co-authored-by: Elliott Freis <108356317+imply-elliott@users.noreply.github.com> Co-authored-by: Nicholas Lippis <nick.lippis@imply.io> Co-authored-by: Rohan Garg <7731512+rohangarg@users.noreply.github.com> Co-authored-by: Karan Kumar <karankumar1100@gmail.com> Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Clint Wylie <cwylie@apache.org> Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com> Co-authored-by: Laksh Singla <lakshsingla@gmail.com>	2023-04-04 13:07:54 -07:00
frankgrimes97	2f98675285	Tuple sketch SQL support (#13887 ) This PR is a follow-up to #13819 so that the Tuple sketch functionality can be used in SQL for both ingestion using Multi-Stage Queries (MSQ) and also for analytic queries against Tuple sketch columns.	2023-03-28 18:47:12 +05:30
Arnout Engelen	daff7fe73b	Document how to report security issues (#13886 ) Document how to report security issues on the security overview page, so we can link this page from the homepage. That should make all the other important security information easier to find as well.	2023-03-27 11:26:37 +05:30
Atul Mohan	19db32d6b4	Add JWT authenticator support for validating ID Tokens (#13242 ) Expands the OIDC based auth in Druid by adding a JWT Authenticator that validates ID Tokens associated with a request. The existing pac4j authenticator works for authenticating web users while accessing the console, whereas this authenticator is for validating Druid API requests made by Direct clients. Services already supporting OIDC can attach their ID tokens to the Druid requests under the Authorization request header.	2023-03-25 18:41:40 +05:30
Gian Merlino	fe9d0c46d5	Improve memory efficiency of WrappedRoaringBitmap. (#13889 ) * Improve memory efficiency of WrappedRoaringBitmap. Two changes: 1) Use an int[] for sizes 4 or below. 2) Remove the boolean compressRunOnSerialization. Doesn't save much space, but it does save a little, and it isn't adding a ton of value to have it be configurable. It was originally configurable in case anything broke when enabling it, but it's been a while and nothing has broken. * Slight adjustment. * Adjust for inspection. * Updates. * Update snaps. * Update test. * Adjust test. * Fix snaps.	2023-03-09 15:48:02 -08:00
Victoria Lim	e46379ba7a	Docs: Update name of the metadata tables (#13734 ) * Update name of the metadata tables * emend spelling file * fix spelling --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-02-23 13:57:59 -08:00
AmatyaAvadhanula	0cf1fc3d55	Indexing on multiple disks (#13476 ) * Initial commit * Simple UTs * Parameterize tests * Parameterized tests for k8s task runner * Fix restore bug * Refactor TaskStorageDirTracker * Change CliPeon args	2023-02-08 11:31:34 +05:30
Kashif Faraz	f629643c50	Fix value of lookup sync period in docs (#13695 ) * Fix lookup docs * Fix spelling * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-02-01 18:12:00 -08:00
Jill Osborne	356b0e37cf	Tutorial: Query view (#13565 ) * Tutorial: Query view * Removed duplicate file * Update tutorial-sql-query-view.md * Update tutorial-sql-query-view.md * Update tutorial-sql-query-view.md * Updated after review * Update docs/tutorials/tutorial-sql-query-view.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update tutorial-sql-query-view.md Update title * Update sidebars.json fix merge conflict w/ sidebar * address spelling ci --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2023-01-27 14:29:43 -08:00
Rohan Garg	f76acccff2	Allow using composed storage for SuperSorter intermediate data (#13368 )	2023-01-24 01:02:03 +05:30
Paul Rogers	22630b0aab	Much improved table functions (#13627 ) Much improved table functions * Revises properties, definitions in the catalog * Adds a "table function" abstraction to model such functions * Specific functions for HTTP, inline, local and S3. * Extended SQL types in the catalog * Restructure external table definitions to use table functions * EXTEND syntax for Druid's extern table function * Support for array-valued table function parameters * Support for array-valued SQL query parameters * Much new documentation	2023-01-17 08:41:57 -08:00
317brian	d9c27d6102	docs: add index page and related stuff for jupyter tutorials (#13342 )	2022-12-16 13:33:50 -08:00
Gian Merlino	de5a4bafcb	Zero-copy local deep storage. (#13394 ) * Zero-copy local deep storage. This is useful for local deep storage, since it reduces disk usage and makes Historicals able to load segments instantaneously. Two changes: 1) Introduce "druid.storage.zip" parameter for local storage, which defaults to false. This changes default behavior from writing an index.zip to writing a regular directory. This is safe to do even during a rolling update, because the older code actually already handled unzipped directories being present on local deep storage. 2) In LocalDataSegmentPuller and LocalDataSegmentPusher, use hard links instead of copies when possible. (Generally this is possible when the source and destination directory are on the same filesystem.)	2022-12-12 17:28:24 -08:00
Rishabh Singh	4ebdfe226d	Druid automated quickstart (#13365 ) * Druid automated quickstart * remove conf/druid/single-server/quickstart/_common/historical/jvm.config * Minor changes in python script * Add lower bound memory for some services * Additional runtime properties for services * Update supervise script to accept command arguments, corresponding changes in druid-quickstart.py * File end newline * Limit the ability to start multiple instances of a service, documentation changes * simplify script arguments * restore changes in medium profile * run-druid refactor * compute and pass middle manager runtime properties to run-druid supervise script changes to process java opts array use argparse, leave free memory, logging * Remove extra quotes from mm task javaopts array * Update logic to compute minimum memory * simplify run-druid * remove debug options from run-druid * resolve the config_path provided * comment out service specific runtime properties which are computed in the code * simplify run-druid * clean up docs, naming changes * Throw ValueError exception on illegal state * update docs * rename args, compute_only -> compute, run_zk -> zk * update help documentation * update help documentation * move task memory computation into separate method * Add validation checks * remove print * Add validations * remove start-druid bash script, rename start-druid-main * Include tasks in lower bound memory calculation * Fix test * 256m instead of 256g * caffeine cache uses 5% of heap * ensure min task count is 2, task count is monotonic * update configs and documentation for runtime props in conf/druid/single-server/quickstart * Update docs * Specify memory argument for each profile in single-server.md * Update middleManager runtime.properties * Move quickstart configs to conf/druid/base, add bash launch script, support python2 * Update supervise script * rename base config directory to auto * rename python script, changes to pass repeated args to supervise * remove exmaples/conf/druid/base dir * add docs * restore changes in conf dir * update start-druid-auto * remove hashref for commands in supervise script * start-druid-main java_opts array is comma separated * update entry point script name in python script * Update help docs * documentation changes * docs changes * update docs * add support for running indexer * update supported services list * update help * Update python.md * remove dir * update .spelling * Remove dependency on psutil and pathlib * update docs * Update get_physical_memory method * Update help docs * update docs * update method to get physical memory on python * udpate spelling * update .spelling * minor change * Minor change * memory comptuation for indexer * update start-druid * Update python.md * Update single-server.md * Update python.md * run python3 --version to check if python is installed * Update supervise script * start-druid: echo message if python not found * update anchor text * minor change * Update condition in supervise script * JVM not jvm in docs	2022-12-09 11:04:02 -08:00
317brian	cc2e4a80ff	doc: add a basic JDBC tutorial (#13343 ) * initial commit for jdbc tutorial (cherry picked from commit 04c4adad71e5436b76c3425fe369df03aaaf0acb) * add commentary * address comments from charles * add query context to example * fix typo * add links * Apply suggestions from code review Co-authored-by: Frank Chen <frankchen@apache.org> * fix datatype * address feedback * add parameterize to spelling file. the past tense version was already there Co-authored-by: Frank Chen <frankchen@apache.org>	2022-11-30 16:25:35 -08:00
Jill Osborne	5c520e0cf9	Update LDAP configuration docs (#13245 ) * Update LDAP configuration docs * Updated after review * Update auth-ldap.md Updated. * Update auth-ldap.md * Updated spelling file * Update docs/operations/auth-ldap.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/operations/auth-ldap.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update docs/operations/auth-ldap.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update auth-ldap.md Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2022-11-29 09:26:32 -08:00
Adarsh Sanjeev	280a0f7158	Add sequential sketch merging to MSQ (#13205 ) * Add sketch fetching framework * Refactor code to support sequential merge * Update worker sketch fetcher * Refactor sketch fetcher * Refactor sketch fetcher * Add context parameter and threshold to trigger sequential merge * Fix test * Add integration test for non sequential merge * Address review comments * Address review comments * Address review comments * Resolve maxRetainedBytes * Add new classes * Renamed key statistics information class * Rename fetchStatisticsSnapshotForTimeChunk function * Address review comments * Address review comments * Update documentation and add comments * Resolve build issues * Resolve build issues * Change worker APIs to async * Address review comments * Resolve build issues * Add null time check * Update integration tests * Address review comments * Add log messages and comments * Resolve build issues * Add unit tests * Add unit tests * Fix timing issue in tests	2022-11-22 09:56:32 +05:30
Didip Kerabat	56d5c9780d	Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects (#13027 ) * Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects. Removed: import org.apache.commons.io.FilenameUtils; Add: import java.nio.file.FileSystems; import java.nio.file.PathMatcher; import java.nio.file.Paths; * Forgot to update CloudObjectInputSource as well. * Fix tests. * Removed unused exceptions. * Able to reduced user mistakes, by removing the protocol and the bucket on filter. * add 1 more test. * add comment on filterWithoutProtocolAndBucket * Fix lint issue. * Fix another lint issue. * Replace all mention of filter -> objectGlob per convo here: https://github.com/apache/druid/pull/13027#issuecomment-1266410707 * fix 1 bad constructor. * Fix the documentation. * Don’t do anything clever with the object path. * Remove unused imports. * Fix spelling error. * Fix incorrect search and replace. * Addressing Gian’s comment. * add filename on .spelling * Fix documentation. * fix documentation again Co-authored-by: Didip Kerabat <didip@apple.com>	2022-11-10 23:46:40 -08:00
Gian Merlino	77478f25fb	Add taskActionType dimension to task/action/run/time. (#13333 ) * Add taskActionType dimension to task/action/run/time. * Spelling.	2022-11-11 12:00:08 +05:30
Andreas Maechler	03175a2b8d	Add missing MSQ error code fields to docs (#13308 ) * Fix typo * Fix some spacing * Add missing fields * Cleanup table spacing * Remove durable storage docs again Thanks Brian for pointing out previous discussions. * Update docs/multi-stage-query/reference.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Mark codes as code * And even more codes as code * Another set of spaces * Combine `ColumnTypeNotSupported` Thanks Karan. * More whitespaces and typos * Add spelling and fix links Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2022-11-10 21:03:04 +05:30
Gian Merlino	227b57dd8e	Compaction: Fetch segments one at a time on main task; skip when possible. (#13280 ) * Compaction: Fetch segments one at a time on main task; skip when possible. Compact tasks include the ability to fetch existing segments and determine reasonable defaults for granularitySpec, dimensionsSpec, and metricsSpec. This is a useful feature that makes compact tasks work well even when the user running the compaction does not have a clear idea of what they want the compacted segments to be like. However, this comes at a cost: it takes time, and disk space, to do all of these fetches. This patch improves the situation in two ways: 1) When segments do need to be fetched, download them one at a time and delete them when we're done. This still takes time, but minimizes the required disk space. 2) Don't fetch segments on the main compact task when they aren't needed. If the user provides a full granularitySpec, dimensionsSpec, and metricsSpec, we can skip it. * Adjustments. * Changes from code review. * Fix logic for determining rollup.	2022-11-07 14:50:14 +05:30
Dr. Sizzles	e5ad24ff9f	Support for middle manager less druid, tasks launch as k8s jobs (#13156 ) * Support for middle manager less druid, tasks launch as k8s jobs * Fixing forking task runner test * Test cleanup, dependency cleanup, intellij inspections cleanup * Changes per PR review Add configuration option to disable http/https proxy for the k8s client Update the docs to provide more detail about sidecar support * Removing un-needed log lines * Small changes per PR review * Upon task completion we callback to the overlord to update the status / locaiton, for slower k8s clusters, this reduces locking time significantly * Merge conflict fix * Fixing tests and docs * update tiny-cluster.yaml changed `enableTaskLevelLogPush` to `encapsulatedTask` * Apply suggestions from code review Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com> * Minor changes per PR request * Cleanup, adding test to AbstractTask * Add comment in peon.sh * Bumping code coverage * More tests to make code coverage happy * Doh a duplicate dependnecy * Integration test setup is weird for k8s, will do this in a different PR * Reverting back all integration test changes, will do in anotbher PR * use StringUtils.base64 instead of Base64 * Jdk is nasty, if i compress in jdk 11 in jdk 17 the decompressed result is different Co-authored-by: Rahul Gidwani <r_gidwani@apple.com> Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>	2022-11-02 19:44:47 -07:00
Gian Merlino	d851985cf5	MSQ: Add support for indexSpec. (#13275 )	2022-10-28 14:27:50 -07:00
Adarsh Sanjeev	4775427e2c	Add task start status to worker report (#13263 ) * Add task start status to worker report * Address review comments * Address review comments * Update documentation * Update spelling checks	2022-10-28 12:00:15 +05:30
Clint Wylie	77e4246598	add support for 'front coded' string dictionaries for smaller string columns (#12277 ) * add FrontCodedIndexed for delta string encoding * now for actual segments * fix indexOf * fixes and thread safety * add bucket size 4, which seems generally better * fixes * fixes maybe * update indexes to latest interfaces * utf8 support * adjust * oops * oops * refactor, better, faster * more test * fixes * revert * adjustments * fix prefixing * more chill * sql nested benchmark too * refactor * more comments and javadocs * better get * remove base class * fix * hot rod * adjust comments * faster still * minor adjustments * spatial index support * spotbugs * add isSorted to Indexed to strengthen indexOf contract if set, improve javadocs, add docs * fix docs * push into constructor * use base buffer instead of copy * oops	2022-10-25 18:05:38 -07:00
Jonathan Wei	9b8e69c99a	Add inline descriptor Protobuf bytes decoder (#13192 ) * Add inline descriptor Protobuf bytes decoder * PR comments * Update tests, check for IllegalArgumentException * Fix license, add equals test * Update extensions-core/protobuf-extensions/src/main/java/org/apache/druid/data/input/protobuf/InlineDescriptorProtobufBytesDecoder.java Co-authored-by: Frank Chen <frankchen@apache.org> Co-authored-by: Frank Chen <frankchen@apache.org>	2022-10-11 13:37:28 -05:00
Jonathan Wei	1f1fced6d4	Add JsonInputFormat option to assume newline delimited JSON, improve parse exception handling for multiline JSON (#13089 ) * Add JsonInputFormat option to assume newline delimited JSON, improve handling for non-NDJSON * Fix serde and docs * Add PR comment check	2022-09-26 19:51:04 -05:00
Vadim Ogievetsky	b9edfe34a4	be consistent about referring to the web console by its name (#13118 )	2022-09-19 15:02:17 -07:00
Charles Smith	b366a6c5a4	Add clarification around docker environment #8926 (#13084 ) * Add clarification around docker environment #8926 * fix spelling * Update docs/tutorials/docker.md Co-authored-by: Frank Chen <frankchen@apache.org> * Update docs/tutorials/docker.md Co-authored-by: Frank Chen <frankchen@apache.org> * fix nano quickstart Co-authored-by: Frank Chen <frankchen@apache.org>	2022-09-17 20:44:24 +08:00
Gian Merlino	d4967c38f8	Various documentation updates. (#13107 ) * Various documentation updates. 1) Split out "data management" from "ingestion". Break it into thematic pages. 2) Move "SQL-based ingestion" into the Ingestion category. Adjust content so all conceptual content is in concepts.md and all syntax content is in reference.md. Shorten the known issues page to the most interesting ones. 3) Add SQL-based ingestion to the ingestion method comparison page. Remove the index task, since index_parallel is just as good when maxNumConcurrentSubTasks: 1. 4) Rename various mentions of "Druid console" to "web console". 5) Add additional information to ingestion/partitioning.md. 6) Remove a mention of Tranquility. 7) Remove a note about upgrading to Druid 0.10.1. 8) Remove no-longer-relevant task types from ingestion/tasks.md. 9) Move ingestion/native-batch-firehose.md to the hidden section. It was previously deprecated. 10) Move ingestion/native-batch-simple-task.md to the hidden section. It is still linked in some places, but it isn't very useful compared to index_parallel, so it shouldn't take up space in the sidebar. 11) Make all br tags self-closing. 12) Certain other cosmetic changes. 13) Update to node-sass 7. * make travis use node12 for docs Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>	2022-09-16 21:58:11 -07:00
DENNIS	dced61645f	prometheus-emitter supports sending metrics to pushgateway regularly … (#13034 ) * prometheus-emitter supports sending metrics to pushgateway regularly and continuously * spell check fix * Optimization variable name and related documents * Update docs/development/extensions-contrib/prometheus.md OK, it looks more conspicuous Co-authored-by: Frank Chen <frankchen@apache.org> * Update doc * Update docs/development/extensions-contrib/prometheus.md Co-authored-by: Frank Chen <frankchen@apache.org> * When PrometheusEmitter is closed, close the scheduler * Ensure that registeredMetrics is thread safe. * Local variable name optimization * Remove unnecessary white space characters Co-authored-by: Frank Chen <frankchen@apache.org>	2022-09-09 20:46:14 +08:00
317brian	d4233ef2a1	msq: add multi-stage-query docs (#12983 ) * msq: add multi-stage-query docs * add screenshots add back theta sketches tutoria change filename fix filename fix link fix headings * fixes * fixes * fix spelling issues and update spell file * address feedback from karan * add missing guardrail to known issues * update blurb * fix typo * remove durable storage info * update titles * Restore en.json * Update query view * address comments from vad * Update docs/multi-stage-query/msq-known-issues.md finish sentence * add apache license to docs * add apache license to docs Co-authored-by: Katya Macedo <katya.macedo@imply.io> Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2022-09-06 23:06:09 +05:30
senthilkv	3d9aef225d	compressed big decimal - module (#10705 ) Compressed Big Decimal is an extension which provides support for Mutable big decimal value that can be used to accumulate values without losing precision or reallocating memory. This type helps in absolute precision arithmetic on large numbers in applications, where greater level of accuracy is required, such as financial applications, currency based transactions. This helps avoid rounding issues where in potentially large amount of money can be lost. Accumulation requires that the two numbers have the same scale, but does not require that they are of the same size. If the value being accumulated has a larger underlying array than this value (the result), then the higher order bits are dropped, similar to what happens when adding a long to an int and storing the result in an int. A compressed big decimal that holds its data with an embedded array. Compressed big decimal is an absolute number based complex type based on big decimal in Java. This supports all the functionalities supported by Java Big Decimal. Java Big Decimal is not mutable in order to avoid big garbage collection issues. Compressed big decimal is needed to mutate the value in the accumulator.	2022-09-06 00:06:57 -07:00
Alexander Saydakov	7e2371bbde	KLL sketch (#12498 ) * KLL sketch * added documentation * direct static refs * direct static refs * fixed test * addressed review points * added KLL sketch related terms * return a copy from get * Copy unions when returning them from "get". * Remove redundant "final". Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>	2022-08-26 21:19:24 -07:00
Victoria Lim	02914c17b9	Tutorial on ingesting and querying Theta sketches (#12723 ) Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2022-08-24 09:23:22 -07:00
Clint Wylie	0c56b22a39	Update .spelling (#12940 )	2022-08-22 18:47:40 -07:00
Clint Wylie	f8097ccfaa	basic docs for nested column query functions (#12922 ) * basic docs for nested column query functions	2022-08-19 17:12:19 -07:00
Clint Wylie	69fe1f04e5	document virtualColumns in native query documentation, fix some redirects (#12917 ) * document virtualColumns in native query documentation, fix some redirects * after all that, forgot to run spellcheck locally * review stuff	2022-08-18 20:49:23 -07:00
Rohan Garg	5394838030	Enable conversion of join to filter by default (#12868 )	2022-08-13 20:37:43 +05:30
David Palmer	2855fb6ff8	Change Kafka Lookup Extractor to not register consumer group (#12842 ) * change kafka lookups module to not commit offsets The current behaviour of the Kafka lookup extractor is to not commit offsets by assigning a unique ID to the consumer group and setting auto.offset.reset to earliest. This does the job but also pollutes the Kafka broker with a bunch of "ghost" consumer groups that will never again be used. To fix this, we now set enable.auto.commit to false, which prevents the ghost consumer groups being created in the first place. * update docs to include new enable.auto.commit setting behaviour * update kafka-lookup-extractor documentation Provide some additional detail on functionality and configuration. Hopefully this will make it clearer how the extractor works for developers who aren't so familiar with Kafka. * add comments better explaining the logic of the code * add spelling exceptions for kafka lookup docs	2022-08-09 16:14:22 +05:30
Gian Merlino	ef6811ef88	Improved Java 17 support and Java runtime docs. (#12839 ) * Improved Java 17 support and Java runtime docs. 1) Add a "Java runtime" doc page with information about supported Java versions, garbage collection, and strong encapsulation.. 2) Update asm and equalsverifier to versions that support Java 17. 3) Add additional "--add-opens" lines to surefire configuration, so tests can pass successfully under Java 17. 4) Switch openjdk15 tests to openjdk17. 5) Update FrameFile to specifically mention Java runtime incompatibility as the cause of not being able to use Memory.map. 6) Update SegmentLoadDropHandler to log an error for Errors too, not just Exceptions. This is important because an IllegalAccessError is encountered when the correct "--add-opens" line is not provided, which would otherwise be silently ignored. 7) Update example configs to use druid.indexer.runner.javaOptsArray instead of druid.indexer.runner.javaOpts. (The latter is deprecated.) * Adjustments. * Use run-java in more places. * Add run-java. * Update .gitignore. * Exclude hadoop-client-api. Brought in when building on Java 17. * Swap one more usage of java. * Fix the run-java script. * Fix flag. * Include link to Temurin. * Spelling. * Update examples/bin/run-java Co-authored-by: Xavier Léauté <xl+github@xvrl.net> Co-authored-by: Xavier Léauté <xl+github@xvrl.net>	2022-08-03 23:16:05 -07:00
Katya Macedo	a2be685824	Remove the time bit, fix headings (#12808 ) * Remove the time bit, fix headings * Adopt review suggestions * Edits * Update smoosh file description * Adopt review suggestions * Update spelling	2022-07-20 15:37:57 -07:00
Atul Mohan	75045970cd	S3 Ingestion from non-default endpoints (#11798 ) * Add endpoint support for s3inputsource * Changes to tests * Fix docs * Fix config * Fix inspections * Fix spelling * Remove password from toString	2022-07-15 11:03:34 -07:00
zachjsh	c0380e7b0a	* fix duplicate dimension (#12778 )	2022-07-14 10:39:03 +05:30
Victoria Lim	d8f8c56f94	Docs: Index page with all SQL functions (#12771 ) * list of all functions * add function names to spelling file	2022-07-14 09:59:55 +08:00
Tejaswini Bandlamudi	99e1b4efee	Update default value of `inputSegmentSizeBytes` in configuration docs (#12678 )	2022-06-22 09:05:03 +05:30
Gian Merlino	0099940808	Add TIME_IN_INTERVAL SQL operator. (#12662 ) * Add TIME_IN_INTERVAL SQL operator. The operator is implemented as a convertlet rather than an OperatorConversion, because this allows it to be equivalent to using the >= and < operators directly. * SqlParserPos cannot be null here. * Remove unused import. * Doc updates. * Add words to dictionary.	2022-06-21 13:05:37 -07:00
Agustin Gonzalez	2f3d7a4c07	Emit state of replace and append for native batch tasks (#12488 ) * Emit state of replace and append for native batch tasks * Emit count of one depending on batch ingestion mode (APPEND, OVERWRITE, REPLACE) * Add metric to compaction job * Avoid null ptr exc when null emitter * Coverage * Emit tombstone & segment counts * Tasks need a type * Spelling * Integrate BatchIngestionMode in batch ingestion tasks functionality * Typos * Remove batch ingestion type from metric since it is already in a dimension. Move IngestionMode to AbstractTask to facilitate having mode as a dimension. Add metrics to streaming. Add missing coverage. * Avoid inner class referenced by sub-class inspection. Refactor computation of IngestionMode to make it more robust to null IOConfig and fix test. * Spelling * Avoid polluting the Task interface * Rename computeCompaction methods to avoid ambiguous java compiler error if they are passed null. Other minor cleanup.	2022-05-23 12:32:47 -07:00
Gian Merlino	37853f8de4	ConcurrentGrouper: Add mergeThreadLocal option, fix bug around the switch to spilling. (#12513 ) * ConcurrentGrouper: Add option to always slice up merge buffers thread-locally. Normally, the ConcurrentGrouper shares merge buffers across processing threads until spilling starts, and then switches to a thread-local model. This minimizes memory use and reduces likelihood of spilling, which is good, but it creates thread contention. The new mergeThreadLocal option causes a query to start in thread-local mode immediately, and allows us to experiment with the relative performance of the two modes. * Fix grammar in docs. * Fix race in ConcurrentGrouper. * Fix issue with timeouts. * Remove unused import. * Add "tradeoff" to dictionary.	2022-05-21 10:28:54 -07:00
Gian Merlino	65a1375b67	SQL: Add is_active to sys.segments, update examples and docs. (#11550 ) * SQL: Add is_active to sys.segments, update examples and docs. is_active is short for: (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1 It's important because this represents "all the segments that should be queryable, whether or not they actually are right now". Most of the time, this is the set of segments that people will want to look at. The web console already adds this filter to a lot of its queries, proving its usefulness. This patch also reworks the caveat at the bottom of the sys.segments section, so its information is mixed into the description of each result field. This should make it more likely for people to see the information. * Wording updates. * Adjustments for spellcheck. * Adjust IT.	2022-05-19 14:23:28 -07:00
Frank Chen	c33ff1c745	Enforce console logging for peon process (#12067 ) Currently all Druid processes share the same log4j2 configuration file located in _common directory. Since peon processes are spawned by middle manager process, they derivate the environment variables from the middle manager. These variables include those in the log4j2.xml controlling to which file the logger writes the log. But current task logging mechanism requires the peon processes to output the log to console so that the middle manager can redirect the console output to a file and upload this file to task log storage. So, this PR imposes this requirement to peon processes, whatever the configuration is in the shared log4j2.xml, peon processes always write the log to console.	2022-05-16 15:07:21 +05:30
Gian Merlino	ff253fd8a3	Add setProcessingThreadNames context parameter. (#12514 ) setting thread names takes a measurable amount of time in the case where segment scans are very quick. In high-QPS testing we found a slight performance boost from turning off processing thread renaming. This option makes that possible.	2022-05-16 13:42:00 +05:30
Lucas Capistrant	deb69d1bc0	Allow coordinator to be configured to kill segments in future (#10877 ) Allow a Druid cluster to kill segments whose interval_end is a date in the future. This can be done by setting druid.coordinator.kill.durationToRetain to a negative period. For example PT-24H would allow segments to be killed if their interval_end date was 24 hours or less into the future at the time that the kill task is generated by the system. A cluster operator can also disregard the druid.coordinator.kill.durationToRetain entirely by setting a new configuration, druid.coordinator.kill.ignoreDurationToRetain=true. This ignores interval_end date when looking for segments to kill, and instead is capable of killing any segment marked unused. This new configuration is off by default, and a cluster operator should fully understand and accept the risks if they enable it.	2022-05-11 07:35:15 +05:30
Rohan Garg	75836a5a06	Add feature flag for sql planning of TimeBoundary queries (#12491 ) * Add feature flag for sql planning of TimeBoundary queries * fixup! Add feature flag for sql planning of TimeBoundary queries * Add documentation for enableTimeBoundaryPlanning * fixup! Add documentation for enableTimeBoundaryPlanning	2022-05-10 15:23:42 +05:30
Victoria Lim	0206a2da5c	Update automatic compaction docs with consistent terminology (#12416 ) * specify automatic compaction where applicable * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * update for style and consistency * implement suggested feedback * remove duplicate example * Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/ingestion/compaction.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/api-reference.md * update .spelling * Adopt review suggestions Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>	2022-05-03 16:22:25 -07:00
zachjsh	564d6defd4	Worker level task metrics (#12446 ) * * fix metric name inconsistency * * add task slot metrics for middle managers * * add new WorkerTaskCountStatsMonitor to report task count metrics from worker * * more stuff * * remove unused variable * * more stuff * * add javadocs * * fix checkstyle * * fix hadoop test failure * * cleanup * * add more code coverage in tests * * fix test failure * * add docs * * increase code coverage * * fix spelling * * fix failing tests * * remove dead code * * fix spelling	2022-04-26 11:44:44 -05:00
Victoria Lim	d326c681c1	Document config for ingesting null columns (#12389 ) * config for ingesting null columns * add link * edit .spelling * what happens if storeEmptyColumns is disabled	2022-04-05 09:15:42 -07:00
somu-imply	a1ea658115	Introducing a new config to ignore nulls while computing String Cardinality (#12345 ) * Counting nulls in String cardinality with a config * Adding tests for the new config * Wrapping the vectorize part to allow backward compatibility * Adding different tests, cleaning the code and putting the check at the proper position, handling hasRow() and hasValue() changes * Updating testcase and code * Adding null handling test to improve coverage * Checkstyle fix * Adding 1 more change in docs * Making docs clearer	2022-03-29 14:31:36 -07:00
Adarsh Sanjeev	ef45a1551e	Convert inQueryThreshold into query context parameter. (#12357 ) Added Calcites InQueryThreshold as a query context parameter. Setting this parameter appropriately reduces the time taken for queries with large number of values in their IN conditions.	2022-03-22 18:33:57 +05:30
Gian Merlino	875e0696e0	GroupBy: Cap dictionary-building selector memory usage. (#12309 ) * GroupBy: Cap dictionary-building selector memory usage. New context parameter "maxSelectorDictionarySize" controls when the per-segment processing code should return early and trigger a trip to the merge buffer. Includes: - Vectorized and nonvectorized implementations. - Adjustments to GroupByQueryRunnerTest to exercise this code in the v2SmallDictionary suite. (Both the selector dictionary and the merging dictionary will be small in that suite.) - Tests for the new config parameter. * Fix issues from tests. * Add "pre-existing" to dictionary. * Simplify GroupByColumnSelectorStrategy interface by removing one of the writeToKeyBuffer methods. * Adjustments from review comments.	2022-03-08 13:13:11 -08:00
Karan Kumar	b94390ba33	Adding Shared Access resource support for azure (#12266 ) Azure Blob storage has multiple modes of authentication. One of them is Shared access resource . This is very useful in cases when we do not want to add the account key in the druid properties .	2022-02-22 18:27:43 +05:30
Karan Kumar	5794331eb1	Adding new config for disabling group by on multiValue column (#12253 ) As part of #12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key. Added a config groupByEnableMultiValueUnnesting which can be set in the query context. The default value of groupByEnableMultiValueUnnesting is true, therefore it does not change the current engine behavior. If groupByEnableMultiValueUnnesting is set to false, the query will fail if it encounters a multi-value column in the grouping key.	2022-02-16 20:53:26 +05:30
somu-imply	eae163a797	Moving in filter check to broker (#12195 ) * Moving in filter check to broker * Adding more unit tests, making error message meaningful * Spelling and doc changes * Updating default to -1 and making this feature hide by default. The number of IN filters can grow upto a max limit of 100 * Removing upper limit of 100, updated docs * Making documentation more meaningful * Moving check outside to PlannerConfig, updating test cases and adding back max limit * Updated with some additional code comments * Missed removing one line during the checkin * Addressing doc changes and one forbidden API correction * Final doc change * Adding a speling exception, correcting a testcase * Reading entire filter tree to address combinations of ANDs and ORs * Specifying in docs that, this case works only for ORs * Revert "Reading entire filter tree to address combinations of ANDs and ORs" This reverts commit `81ca8f8496`. * Covering a class cast exception and updating docs * Counting changed Co-authored-by: Jihoon Son <jihoonson@apache.org>	2022-02-15 20:45:07 -08:00
Victoria Lim	c61b19d443	Refactor SQL docs (#12239 ) * refactor and link fixes * add sql docs to left nav * code format for needle * updated web console script * link fixes * update earliest/latest functions * edits for grammar and style * more link fixes * another link * update with #12226 * update .spelling file	2022-02-11 14:43:30 -08:00
somu-imply	c267b65f97	Removing unused processing threadpool on broker (#12070 ) * Thread pool for broker * Updating two tests to improve coverage for new method added * Updating druidProcessingConfigTest to cover coverage * Adding missed spelling errors caused in doc * Adding test to cover lines of new function added	2021-12-21 13:07:53 -08:00
Karan Kumar	377edff042	Ingestion metrics doc fix (#12066 ) * Ingestion metrics doc fix. * Fixing typo * Adding missed keywords in ignore list	2021-12-15 12:51:53 +05:30
Lucas Capistrant	761fe9f144	Add new metric that quantifies how long batch ingest jobs waited for segment availability and whether or not that wait was successful (#12002 ) * add a unit test that tests that new metric is emitted * remove unused import * clarify in doc that this is for batch tasks * fix IndexTaskTest	2021-12-10 11:40:52 -06:00
Frank Chen	58245b4617	Support JsonPath functions in JsonPath expressions (#11722 ) * Add jsonPath functions support * Add jsonPath function test for Avro * Add jsonPath function length() to Orc * Add jsonPath function length() to Parquet * Add more tests to ORC format * update doc * Fix exception during ingestion * Add IT test case * Revert "Fix exception during ingestion" This reverts commit `5a5484b9ea`. * update IT test case * Add 'keys()' * Commit IT test case * Fix UT	2021-12-10 10:53:23 +08:00
Charles Smith	7ed46800c3	Docs: Add multi-dimension partitioning doc; refactor native batch and separate into smaller topics. (#11983 ) Adds documentation for multi-dimension partitioning. cc: @kfaraz Refactors the native batch partitioning topic as follows: Native batch ingestion covers parallel-index Native batch simple task indexing covers index Native batch input sources covers ioSource Native batch ingestion with firehose covers deprecated firehose	2021-12-03 16:37:14 +05:30
Clint Wylie	84b4bf56d8	vectorize logical operators and boolean functions (#11184 ) changes: * adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions * vectorize logical operators and boolean functions, some only if useStrictBooleans is true	2021-12-02 16:40:23 -08:00
Maytas Monsereenusorn	bb3d2a433a	Support filtering data in Auto Compaction (#11922 ) * add impl * fix checkstyle * add test * add test * add unit tests * fix unit tests * fix unit tests * fix unit tests * add IT * add IT * add comments * fix spelling	2021-11-24 10:56:38 -08:00
Kashif Faraz	6607e4cc75	Docs: Remove reference to deprecated field `targetPartitionSize` (#11974 ) * Remove reference to deprecated field `targetPartitionSize` * Fix spelling of LeaderLatch	2021-11-23 15:32:16 +05:30
Clint Wylie	e22abb68b0	Update .spelling (#11977 )	2021-11-22 22:28:51 -08:00
somu-imply	29710789a4	Adding safe divide function (#11904 ) * IMPLY-4344: Adding safe divide function along with testcases and documentation updates * Changing based on review comments * Addressing review comments, fixing coding style, docs and spelling * Checkstyle passes for all code * Fixing expected results for infinity * Revert "Fixing expected results for infinity" This reverts commit `5fd5cd480d`. * Updating test result and a space in docs	2021-11-17 08:22:41 -08:00
sthetland	02b578a3dd	Fixing a few typos and style issues (#11883 ) * grammar and format work * light writing touchup Co-authored-by: Charles Smith <techdocsmith@gmail.com>	2021-11-16 10:13:35 -08:00
Maytas Monsereenusorn	ddc68c6a81	Support changing dimension schema in Auto Compaction (#11874 ) * add impl * add unit tests * fix checkstyle * add impl * add impl * add impl * add impl * add impl * add impl * fix test * add IT * add IT * fix docs * add test * address comments * fix conflict	2021-11-08 21:17:08 -08:00
Jian Wang	8e7e679984	Add more metrics for Jetty server thread pool usage (#11113 ) Add more metrics for jetty server thread pool usage so we know if we have allocated enough http threads to handle requests.	2021-11-07 16:51:44 +05:30
Karan Kumar	90640bb316	Support for hadoop 3 via maven profiles (#11794 ) Add support for hadoop 3 profiles . Most of the details are captured in #11791 . We use a combination of maven profiles and resource filtering to achieve this. Hadoop2 is supported by default and a new maven profile with the name hadoop3 is created. This will allow the user to choose the profile which is best suited for the use case.	2021-10-30 22:46:24 +05:30
Gian Merlino	8276c031c5	Add druid.sql.approxCountDistinct.function property. (#11181 ) * Add druid.sql.approxCountDistinct.function property. The new property allows admins to configure the implementation for APPROX_COUNT_DISTINCT and COUNT(DISTINCT expr) in approximate mode. The motivation for adding this setting is to enable site admins to switch the default HLL implementation to DataSketches. For example, an admin can set: druid.sql.approxCountDistinct.function = APPROX_COUNT_DISTINCT_DS_HLL * Fixes * Fix tests. * Remove erroneous cannotVectorize. * Remove unused import. * Remove unused test imports.	2021-10-25 12:16:21 -07:00
Charles Smith	6089a168ea	Docs - update dynamic config provider topic (#11795 ) * update dynamic config provider * update topic * add examples for dynamic config provider: * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/development/extensions-core/kafka-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * Update docs/operations/dynamic-config-provider.md Co-authored-by: Clint Wylie <cjwylie@gmail.com> * Update kafka-ingestion.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Clint Wylie <cjwylie@gmail.com>	2021-10-14 17:51:32 -07:00
Arun Ramani	b6b42d3936	Minor processor quota computation fix + docs (#11783 ) * cpu/cpuset cgroup and procfs data gathering * Renames and default values * Formatting * Trigger Build * Add cgroup monitors * Return 0 if no period * Update * Minor processor quota computation fix + docs * Address comments * Address comments * Fix spellcheck Co-authored-by: arunramani-imply <84351090+arunramani-imply@users.noreply.github.com>	2021-10-08 22:52:03 -05:00
lokesh-lingarajan	ad6609a606	Kafka Input Format for headers, key and payload parsing (#11630 ) ### Description Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights. PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats. We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload. This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row. Lets look at a sample input format from the above discussion "inputFormat": { "type": "kafka", // New input format type "headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns "recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp "headerFormat": // Header parser specifying that values are of type string { "type": "string" }, "valueFormat": // Value parser from json parsing { "type": "json", "flattenSpec": { "useFieldDiscovery": true, "fields": [...] } }, "keyFormat": // Key parser also from json parsing { "type": "json" } } Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json. KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion. "headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload. Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch. Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key". ## KafkaInputFormat Class: This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases. During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.	2021-10-07 08:56:27 -07:00
Clint Wylie	5de26cf6d9	add optional system schema authorization (#11720 ) * add optional system schema authorization * remove unused * adjust docs * doc fixes, missing ldap config change for integration tests * style	2021-09-21 13:28:26 -07:00

1 2 3 4 5 ...

262 Commits