* Doc update for new input source and input format.
- The input source and input format are promoted in all docs under docs/ingestion
- All input sources including core extension ones are located in docs/ingestion/native-batch.md
- All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md
- New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md
* parquet
* add warning for range partitioning with sequential mode
* hdfs + s3, gs
* add fs impl for gs
* address comments
* address comments
* gcs
* working
* - support multi-char delimiter for tsv
- respect "delimiter" property for tsv
* default value check for findColumnsFromHeader
* remove CSVParser to have a true and only CSVParser
* fix tests
* fix another test
* Speed up String first/last aggregators when folding isn't needed.
Examines the value column, and disables fold checking via a needsFoldCheck
flag if that column can't possibly contain SerializableLongStringPairs. This
is helpful because it avoids calling getObject on the value selector when
unnecessary; say, because the time selector didn't yield an earlier or later
value.
* PR comments.
* Move fastLooseChop to StringUtils.
* Add HashJoinSegment, a virtual segment for joins.
An initial step towards #8728. This patch adds enough functionality to implement a joining
cursor on top of a normal datasource. It does not include enough to actually do a query. For
that, future patches will need to wire this low-level functionality into the query language.
* Fixups.
* Fix missing format argument.
* Various tests and minor improvements.
* Changes.
* Remove or add tests for unused stuff.
* Fix up package locations.
* Tutorials use new ingestion spec where possible
There are 2 main changes
* Use task type index_parallel instead of index
* Remove the use of parser + firehose in favor of inputFormat + inputSource
index_parallel is the preferred method starting in 0.17. Setting the job to
index_parallel with the default maxNumConcurrentSubTasks(1) is the equivalent
of an index task
Instead of using a parserSpec, dimensionSpec and timestampSpec have been
promoted to the dataSchema. The format is described in the ioConfig as the
inputFormat.
There are a few cases where the new format is not supported
* Hadoop must use firehoses instead of the inputSource and inputFormat
* There is no equivalent of a combining firehose as an inputSource
* A Combining firehose does not support index_parallel
* fix typo
CVE-2019-20330 was updated on 14 Jan 2020, which now gets flagged by the
security vulnerability scan. Since the CVE is for jackson-databind, via
htrace-core-4.0.1, it can be added to the existing list of security
vulnerability suppressions for that dependency.
Previously jackson-mapper-asl was excluded to remove a security
vulnerability; however, it is required for functionality (e.g.,
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator).
AggregateCaseToFilterRule was added to Calcite in https://issues.apache.org/jira/browse/CALCITE-3144,
and was originally copied from Druid's CaseFilteredAggregatorRule. So there isn't a good reason to
keep using our version.
* Add avro dependency to parquet extension
If the parquet extension is loaded and an ingestionSpec uses the older format
specifying a 'parser' instead of using an 'inputFormat' the job fails
with the following error
java.lang.TypeNotPresentException: Type org.apache.avro.generic.GenericRecord not present
This change removes the exclusion of the avro package so that the missing
class can be found.
* Address review comments and add dependency version
* S3: Improvements to prefix listing (including fix for an infinite loop)
1) Fixes#9097, an infinite loop that occurs when more than one batch
of objects is retrieved during a prefix listing.
2) Removes the Access Denied fallback code added in #4444. I don't think
the behavior is reasonable: its purpose is to fall back from a prefix
listing to a single-object access, but it's only activated when the
end user supplied a prefix, so it would be better to simply fail, so
the end user knows that their request for a prefix-based load is not
going to work. Presumably the end user can switch from supplying
'prefixes' to supplying 'uris' if desired.
3) Filters out directory placeholders when walking prefixes.
4) Splits LazyObjectSummariesIterator into its own class and adds tests.
* Adjust S3InputSourceTest.
* Changes from review.
* Include hamcrest-core.
* Optimize CachingLocalSegmentAllocator#getSequenceName
Replace StringUtils#format with string addition to generate the sequence
name for an interval and partition. This is faster because format uses a
Matcher under the covers to replace the string format with the variables.
* fix imports and add test
* Add comment about optimization
* Use renamed function for TaskToolbox
* Move tests after refactor
* Rename tests