druid

mirror of https://github.com/apache/druid.git synced 2025-02-28 14:29:17 +00:00

Author	SHA1	Message	Date
Parth Agrawal	495e66f2e7	CVE Fix: Update json-path version (#15772 ) Apache Druid brings the dependency json-path which is affected by CVE-2023-51074. Its latest version 2.9.0 fixes the above CVE. Append function has been added to json-path and so the unit test to check for the append function not present has been updated. --------- Co-authored-by: Xavier Léauté <xvrl@apache.org>	2024-02-14 20:58:27 -08:00
Clint Wylie	97623b408c	add optional 'castToType' parameter to 'auto' column schema (#15417 ) * auto but.. with an expected type	2023-11-28 17:19:23 -08:00
Clint Wylie	fb053c399c	consolidate json and auto indexers, remove v4 nested column serializer (#14456 )	2023-08-22 18:50:11 -07:00
zachjsh	e75fb8e8e3	Account for data format and compression in MSQ auto taskAssignment (#14307 ) ### Description This change allows for consideration of the input format and compression when computing how to split the input files among available tasks, in MSQ ingestion, when considering the value of the `maxInputBytesPerWorker` query context parameter. This query parameter allows users to control the maximum number of bytes, with granularity of input file / object, that ingestion tasks will be assigned to ingest. With this change, this context parameter now denotes the estimated weighted size in bytes of the input to split on, with consideration for input format and compression format, rather than the actual file size, reported by the file system. We assume uncompressed newline delimited json as a baseline, with scaling factor of `1`. This means that when computing the byte weight that a file has towards the input splitting, we take the file size as is, if uncompressed json, 1:1. It was found during testing that gzip compressed json, and parquet, has scale factors of `4` and `8` respectively, meaning that each byte of data is weighted 4x and 8x respectively, when computing input splits. This weighted byte scaling is only considered for MSQ ingestion that uses either LocalInputSource or CloudObjectInputSource at the moment. The default value of the `maxInputBytesPerWorker` query context parameter has been updated from 10 GiB, to 512 MiB	2023-06-01 12:53:49 -07:00
Clint Wylie	d5b1b5bc8e	nested columns + arrays = array columns! (#13803 ) array columns! changes: * add support for storing nested arrays of string, long, and double values as specialized nested columns instead of breaking them into separate element columns * nested column type mimic behavior means that columns ingested with only root arrays of primitive values will be ARRAY typed columns * neat test refactor stuff * add v4 segment test * add array element indexes * add tests for unnest and array columns * fix unnest column value selector cursor handling of null and empty arrays	2023-03-27 12:42:35 -07:00
Clint Wylie	1d8fff4096	sampler + type detection = bff (#13711 ) * sampler + type detection = bff * split logical and physical dimensions, tidy up	2023-02-28 04:14:30 -08:00
Clint Wylie	fb26a1093d	discover nested columns when using nested column indexer for schemaless ingestion (#13672 ) * discover nested columns when using nested column indexer for schemaless * move useNestedColumnIndexerForSchemaDiscovery from AppendableIndexSpec to DimensionsSpec	2023-01-18 12:57:28 -08:00
Clint Wylie	7002ecd303	add protobuf flattener, direct to plain java conversion for faster flattening (#13519 ) * add protobuf flattener, direct to plain java conversion for faster flattening, nested column tests	2022-12-09 12:24:21 -08:00
Clint Wylie	37b8d4861c	fix issues with nested data conversion (#13407 )	2022-11-28 12:29:43 -08:00
Clint Wylie	8c9ffcfe37	nested column support for ORC (#13375 ) * nested column support for ORC * more test	2022-11-17 21:08:34 -08:00
Jason Koch	0d03ce435f	introduce a "tree" type to the flattenSpec (#12177 ) * introduce a "tree" type to the flattenSpec * feedback - rename exprs to nodes, use CollectionsUtils.isNullOrEmpty for guard * feedback - expand docs to more clearly capture limitations of "tree" flattenSpec * feedback - fix for typo on docs * introduce a comment to explain defensive copy, tweak null handling * fix: part of rebase * mark ObjectFlatteners.FlattenerMaker as an ExtensionPoint and provide default for new tree type * fix: objectflattener restore previous behavior to call getRootField for root type * docs: ingestion/data-formats add note that ORC only supports path expressions * chore: linter remove unused import * fix: use correct newer form for empty DimensionsSpec in FlattenJSONBenchmark	2022-11-01 14:49:30 +08:00
Gian Merlino	c00ad28ecc	Cleaner JSON for various input sources and formats. (#13064 ) * Cleaner JSON for various input sources and formats. Add JsonInclude to various properties, to avoid population of default values in serialized JSON. Also fixes a bug in OrcInputFormat: it was not writing binaryAsString, so the property would be lost on serde. * Additonal test cases.	2022-09-12 10:29:31 -07:00
Tejaswini Bandlamudi	d559773a0e	sets Hadoop conf ClassLoader (#12738 )	2022-07-04 17:07:39 +05:30
Tejaswini Bandlamudi	a85b1d8985	Lazy Initialisation of Orc extensions module (#12663 ) * Lazy initialization of Orc extension * nit * moving intialize method to OrcInputFormat	2022-06-21 11:13:10 +05:30
Laksh Singla	3f709db173	Make ParseExceptions more informative (#12259 ) This PR aims to make the ParseExceptions in Druid more informative, by adding additional information (metadata) to the ParseException, which can contain additional information about the exception. For example - the path of the file generating the issue, the line number (where it can be easily fetched - like CsvReader) Following changes are addressed in this PR: A new class CloseableIteratorWithMetadata has been created which is like CloseableIterator but also has a metadata method that returns a context Map<String, Object> about the current element returned by next(). IntermediateRowParsingReader#read() now attaches the InputEntity and the "record number" which created the exception (while parsing them), and IntermediateRowParsingReader#sample attaches the InputEntity (but not the "record number"). TextReader (and its subclasses), which is a specific implementation of the IntermediateRowParsingReader also include the line number which caused the generation of the error. This will also help in triaging the issues when InputSourceReader generates ParseException because it can point to the specific InputEntity which caused the exception (while trying to read it).	2022-02-28 22:31:15 +05:30
Jihoon Son	e5ad862665	A new includeAllDimension flag for dimensionsSpec (#12276 ) * includeAllDimensions in dimensionsSpec * doc * address comments * unused import and doc spelling	2022-02-25 18:27:48 -08:00
Karan Kumar	b86f2d4c2e	Performance fixes in proto readers (#12267 )	2022-02-24 23:21:48 +05:30
Frank Chen	58245b4617	Support JsonPath functions in JsonPath expressions (#11722 ) * Add jsonPath functions support * Add jsonPath function test for Avro * Add jsonPath function length() to Orc * Add jsonPath function length() to Parquet * Add more tests to ORC format * update doc * Fix exception during ingestion * Add IT test case * Revert "Fix exception during ingestion" This reverts commit 5a5484b9ea9d984622149c8113a566269cc10842. * update IT test case * Add 'keys()' * Commit IT test case * Fix UT	2021-12-10 10:53:23 +08:00
Gian Merlino	bf20f9e979	DruidInputSource: Fix issues in column projection, timestamp handling. (#10267 ) * DruidInputSource: Fix issues in column projection, timestamp handling. DruidInputSource, DruidSegmentReader changes: 1) Remove "dimensions" and "metrics". They are not necessary, because we can compute which columns we need to read based on what is going to be used by the timestamp, transform, dimensions, and metrics. 2) Start using ColumnsFilter (see below) to decide which columns we need to read. 3) Actually respect the "timestampSpec". Previously, it was ignored, and the timestamp of the returned InputRows was set to the `__time` column of the input datasource. (1) and (2) together fix a bug in which the DruidInputSource would not properly read columns that are used as inputs to a transformSpec. (3) fixes a bug where the timestampSpec would be ignored if you attempted to set the column to something other than `__time`. (1) and (3) are breaking changes. Web console changes: 1) Remove "Dimensions" and "Metrics" from the Druid input source. 2) Set timestampSpec to `{"column": "__time", "format": "millis"}` for compatibility with the new behavior. Other changes: 1) Add ColumnsFilter, a new class that allows input readers to determine which columns they need to read. Currently, it's only used by the DruidInputSource, but it could be used by other columnar input sources in the future. 2) Add a ColumnsFilter to InputRowSchema. 3) Remove the metric names from InputRowSchema (they were unused). 4) Add InputRowSchemas.fromDataSchema method that computes the proper ColumnsFilter for given timestamp, dimensions, transform, and metrics. 5) Add "getRequiredColumns" method to TransformSpec to support the above. * Various fixups. * Uncomment incorrectly commented lines. * Move TransformSpecTest to the proper module. * Add druid.indexer.task.ignoreTimestampSpecForDruidInputSource setting. * Fix. * Fix build. * Checkstyle. * Misc fixes. * Fix test. * Move config. * Fix imports. * Fixup. * Fix ShuffleResourceTest. * Add import. * Smarter exclusions. * Fixes based on tests. Also, add TIME_COLUMN constant in the web console. * Adjustments for tests. * Reorder test data. * Update docs. * Update docs to say Druid 0.22.0 instead of 0.21.0. * Fix test. * Fix ITAutoCompactionTest. * Changes from review & from merging.	2021-03-25 10:32:21 -07:00
frank chen	e83d5cb59e	Fix ingestion failure of pretty-formatted JSON message (#10383 ) * support multi-line text * add test cases * split json text into lines case by case * improve exception handle * fix CI * use IntermediateRowParsingReader as base of JsonReader * update doc * ignore the non-immutable field in test case * add more test cases * mark `lineSplittable` as final * fix testcases * fix doc * add a test case for SqlReader * return all raw columns when exception occurs * fix CI * fix test cases * resolve review comments * handle ParseException returned by index.add * apply Iterables.getOnlyElement * fix CI * fix test cases * improve code in more graceful way * fix test cases * fix test cases * add a test case to check multiple json string in one text block * fix inspection check	2020-11-13 13:59:23 -08:00
Jihoon Son	8f14ac814e	More structured way to handle parse exceptions (#10336 ) * More structured way to handle parse exceptions * checkstyle; add more tests * forbidden api; test * address comment; new test * address review comments * javadoc for parseException; remove redundant parseException in streaming ingestion * fix tests * unnecessary catch * unused imports * appenderator test * unused import	2020-09-11 16:31:10 -07:00
Jihoon Son	e27a1e8604	Fix handling nullable writableComparable in OrcStructConverter (#9138 ) * Handle nullable writableComparable in OrcStructConverter * add missing dependency	2020-01-08 13:40:24 -08:00
Jonathan Wei	aa539177ec	De-incubation cleanup in code, docs, packaging (#9108 ) * De-incubation cleanup in code, docs, packaging * remove unused docs script	2020-01-03 12:33:19 -05:00
Jihoon Son	66056b2826	Using annotation to distinguish Hadoop Configuration in each module (#9013 ) * Multibinding for NodeRole * Fix endpoints * fix doc * fix test * Using annotation to distinguish Hadoop Configuration in each module	2019-12-11 17:30:44 -08:00
Jihoon Son	86e8903523	Support orc format for native batch ingestion (#8950 ) * Support orc format for native batch ingestion * fix pom and remove wrong comment * fix unnecessary condition check * use flatMap back to handle exception properly * move exceptionThrowingIterator to intermediateRowParsingReader * runtime	2019-11-28 12:45:24 -08:00
Clint Wylie	7250010388	add parquet support to native batch (#8883 ) * add parquet support to native batch * cleanup * implement toJson for sampler support * better binaryAsString test * docs * i hate spellcheck * refactor toMap conversion so can be shared through flattenerMaker, default impls should be good enough for orc+avro, fixup for merge with latest * add comment, fix some stuff * adjustments * fix accident * tweaks	2019-11-22 10:49:16 -08:00
Clint Wylie	1054d85171	add mechanism to control filter optimization in historical query processing (#8209 ) * add support for mechanism to control filter optimization in historical query processing * oops * adjust * woo * javadoc * review comments * fix * default * oops * oof * this will fix it * more nullable, refactor DimFilter.getRequiredColumns to use Set, formatting * extract class DimFilterToStringBuilder with common code from custom DimFilter toString implementations * adjust variable naming * missing nullable * more nullable * fix javadocs * nullable * address review comments * javadocs, precondition * nullable * rename method to be consistent * review comments * remove tuning from ColumnComparisonFilter/ColumnComparisonDimFilter	2019-08-09 16:36:18 -07:00
Chi Cao Minh	1166bbcb75	Remove static imports from tests (#8036 ) Make static imports forbidden in tests and remove all occurrences to be consistent with the non-test code. Also, various changes to files affected by above: - Reformat to adhere to druid style guide - Fix various IntelliJ warnings - Fix various SonarLint warnings (e.g., the expected/actual args to Assert.assertEquals() were flipped)	2019-07-06 09:33:12 -07:00
Roman Leventov	782863ed0f	Fix some problems reported by PVS-Studio (#7738 ) * Fix some problems reported by PVS-Studio * Address comments	2019-05-29 11:20:45 -07:00
Clint Wylie	89bb43f382	'core' ORC extension (#7138 ) * orc extension reworked to use apache orc map-reduce lib, moved to core extensions, support for flattenSpec, tests, docs * change binary handling to be compatible with avro and parquet, Rows.objectToStrings now converts byte[] to base64, change date handling * better docs and tests * fix it * formatting * doc fix * fix it * exclude redundant dependencies * use latest orc-mapreduce, add hadoop jobProperties recommendations to docs * doc fix * review stuff and fix binaryAsString * cache for root level fields * more better	2019-04-09 09:03:26 -07:00

30 Commits