This PR:
adds a flag to JsonToParquet to do the fix during conversion
updates the json files to more correct conents
some resultset mismatches were fixed by this
updates parquet to 1.13.1
This commit borrows some test definitions from Drill's test suite
and tries to use them to flesh out the full validation of window
function capbilities.
In order to be able to run these tests, we also add the ability to
run a Scan operation against segments, which also meant an
implementation of RowsAndColumns for frames.
### Description
This change allows for consideration of the input format and compression when computing how to split the input files among available tasks, in MSQ ingestion, when considering the value of the `maxInputBytesPerWorker` query context parameter. This query parameter allows users to control the maximum number of bytes, with granularity of input file / object, that ingestion tasks will be assigned to ingest. With this change, this context parameter now denotes the estimated weighted size in bytes of the input to split on, with consideration for input format and compression format, rather than the actual file size, reported by the file system. We assume uncompressed newline delimited json as a baseline, with scaling factor of `1`. This means that when computing the byte weight that a file has towards the input splitting, we take the file size as is, if uncompressed json, 1:1. It was found during testing that gzip compressed json, and parquet, has scale factors of `4` and `8` respectively, meaning that each byte of data is weighted 4x and 8x respectively, when computing input splits. This weighted byte scaling is only considered for MSQ ingestion that uses either LocalInputSource or CloudObjectInputSource at the moment. The default value of the `maxInputBytesPerWorker` query context parameter has been updated from 10 GiB, to 512 MiB
* discover nested columns when using nested column indexer for schemaless
* move useNestedColumnIndexerForSchemaDiscovery from AppendableIndexSpec to DimensionsSpec
* introduce a "tree" type to the flattenSpec
* feedback - rename exprs to nodes, use CollectionsUtils.isNullOrEmpty for guard
* feedback - expand docs to more clearly capture limitations of "tree" flattenSpec
* feedback - fix for typo on docs
* introduce a comment to explain defensive copy, tweak null handling
* fix: part of rebase
* mark ObjectFlatteners.FlattenerMaker as an ExtensionPoint and provide default for new tree type
* fix: objectflattener restore previous behavior to call getRootField for root type
* docs: ingestion/data-formats add note that ORC only supports path expressions
* chore: linter remove unused import
* fix: use correct newer form for empty DimensionsSpec in FlattenJSONBenchmark
* Cleaner JSON for various input sources and formats.
Add JsonInclude to various properties, to avoid population of default
values in serialized JSON.
Also fixes a bug in OrcInputFormat: it was not writing binaryAsString,
so the property would be lost on serde.
* Additonal test cases.
Historicals and middle managers crash with an `UnknownHostException` on trying
to load `druid-parquet-extensions` with an ephemeral Hadoop cluster. This happens
because the `fs.defaultFS` URI value cannot be resolved at start up time as the
hadoop cluster may not exist at startup time.
This commit fixes the error by performing initialization of the filesystem in
`ParquetInputFormat.createReader()` whenever a new reader is requested.
This PR aims to make the ParseExceptions in Druid more informative, by adding additional information (metadata) to the ParseException, which can contain additional information about the exception. For example - the path of the file generating the issue, the line number (where it can be easily fetched - like CsvReader)
Following changes are addressed in this PR:
A new class CloseableIteratorWithMetadata has been created which is like CloseableIterator but also has a metadata method that returns a context Map<String, Object> about the current element returned by next().
IntermediateRowParsingReader#read() now attaches the InputEntity and the "record number" which created the exception (while parsing them), and IntermediateRowParsingReader#sample attaches the InputEntity (but not the "record number").
TextReader (and its subclasses), which is a specific implementation of the IntermediateRowParsingReader also include the line number which caused the generation of the error.
This will also help in triaging the issues when InputSourceReader generates ParseException because it can point to the specific InputEntity which caused the exception (while trying to read it).
* Add jsonPath functions support
* Add jsonPath function test for Avro
* Add jsonPath function length() to Orc
* Add jsonPath function length() to Parquet
* Add more tests to ORC format
* update doc
* Fix exception during ingestion
* Add IT test case
* Revert "Fix exception during ingestion"
This reverts commit 5a5484b9ea.
* update IT test case
* Add 'keys()'
* Commit IT test case
* Fix UT
changes:
* adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions
* vectorize logical operators and boolean functions, some only if useStrictBooleans is true
* Avro union support
* Document new union support
* Add support for AvroStreamInputFormat and fix checkstyle
* Extend multi-member union test schema and format
* Some additional docs and add Enums to spelling
* Rename explodeUnions -> extractUnions
* explode -> extract
* ByType
* Correct spelling error
* support multi-line text
* add test cases
* split json text into lines case by case
* improve exception handle
* fix CI
* use IntermediateRowParsingReader as base of JsonReader
* update doc
* ignore the non-immutable field in test case
* add more test cases
* mark `lineSplittable` as final
* fix testcases
* fix doc
* add a test case for SqlReader
* return all raw columns when exception occurs
* fix CI
* fix test cases
* resolve review comments
* handle ParseException returned by index.add
* apply Iterables.getOnlyElement
* fix CI
* fix test cases
* improve code in more graceful way
* fix test cases
* fix test cases
* add a test case to check multiple json string in one text block
* fix inspection check
* add parquet support to native batch
* cleanup
* implement toJson for sampler support
* better binaryAsString test
* docs
* i hate spellcheck
* refactor toMap conversion so can be shared through flattenerMaker, default impls should be good enough for orc+avro, fixup for merge with latest
* add comment, fix some stuff
* adjustments
* fix accident
* tweaks
* Remove Apache Pig from the tests
* Remove the Pig specific part
* Fix the Checkstyle issues
* Cleanup a bit
* Add an additional test
* Revert the abstract class
* Add checkstyle rules about imports and empty lines between members
* Add suppressions
* Update Eclipse import order
* Add empty line
* Fix StatsDEmitter
* move parquet-extensions from contrib to core, adds new hadoop parquet parser that does not convert to avro first and supports flattenSpec and int96 columns, add support for flattenSpec for parquet-avro conversion parser, much test with a bunch of files lifted from spark-sql
* fix avro flattener to support nullable primitives for auto discovery and now only supports primitive arrays instead of all arrays
* remove leftover print
* convert micro timestamp to millis
* checkstyle
* add ignore for .parquet and .parq to rat exclude
* fix legit test failure from avro flattern behavior change
* fix rebase
* add exclusions to pom to cut down on redundant jars
* refactor tests, add support for unwrapping lists for parquet-avro, review comments
* more comment
* fix oops
* tweak parquet-avro list handling
* more docs
* fix style
* grr styles