* Forbid easily misused HashSet and HashMap constructors
* Add two LinkedHashMap constructors to forbidden-apis and create utility method as replacement for them
* Fix visibility of constant in CollectionUtils.java
* Make an exception for an instance of LinkedHashMap#<init>(int) because proper sizing is used
* revert changes to sql module tests that should be in separate PR
* Finish reverting changes to sql module tests that were flagged in checkstyle during CI
* Add netty dependency resulting from SupressForbidden
* Add MemoryOpenHashTable, a table similar to ByteBufferHashTable.
With some key differences to improve speed and design simplicity:
1) Uses Memory rather than ByteBuffer for its backing storage.
2) Uses faster hashing and comparison routines (see HashTableUtils).
3) Capacity is always a power of two, allowing simpler design and more
efficient implementation of findBucket.
4) Does not implement growability; instead, leaves that to its callers.
The idea is this removes the need for subclasses, while still giving
callers flexibility in how to handle table-full scenarios.
* Fix LGTM warnings.
* Adjust dependencies.
* Remove easymock from druid-benchmarks.
* Adjustments from review.
* Fix datasketches unit tests.
* Fix checkstyle.
By default native batch ingestion was only getting a batch of 10
files at a time when used with google cloud. The Default for other
cloud providers is 1024, and should be similar for google cloud.
The low batch size was caused by mistype. This change updates the
batch size to 1024 when using google cloud.
* Guicify druid sql module
Break up the SQLModule in to smaller modules and provide a binding that
modules can use to register schemas with druid sql.
* fix some tests
* address code review
* tests compile
* Working tests
* Add all the tests
* fix up licenses and dependencies
* add calcite dependency to druid-benchmarks
* tests pass
* rename the schemas
* SQL join support for lookups.
1) Add LookupSchema to SQL, so lookups show up in the catalog.
2) Add join-related rels and rules to SQL, allowing joins to be planned into
native Druid queries.
* Add two missing LookupSchema calls in tests.
* Fix tests.
* Fix typo.
This is important because if a user has the hdfs extension loaded, but is not
using hdfs deep storage, then they will not have storageDirectory set and will
get the following error:
IllegalArgumentException: Can not create a Path from an empty string
at io.druid.storage.hdfs.HdfsDataSegmentKiller.<init>(HdfsDataSegmentKiller.java:47)
This scenario is realistic: it comes up when someone has the hdfs extension
loaded because they want to use HdfsInputSource, but don't want to use hdfs for
deep storage.
Fixes#4694.
* Add LookupJoinableFactory.
Enables joins where the right-hand side is a lookup. Includes an
integration test.
Also, includes changes to LookupExtractorFactoryContainerProvider:
1) Add "getAllLookupNames", which will be needed to eventually connect
lookups to Druid's SQL catalog.
2) Convert "get" from nullable to Optional return.
3) Swap out most usages of LookupReferencesManager in favor of the
simpler LookupExtractorFactoryContainerProvider interface.
* Fixes for tests.
* Fix another test.
* Java 11 message fix.
* Fixups.
* Fixup benchmark class.
* intelliJ inspections cleanup
- remove redundant escapes
- performance warnings
- access static member via instance reference
- static method declared final
- inner class may be static
Most of these changes are aesthetic, however, they will allow inspections to
be enabled as part of CI checks going forward
The valuable changes in this delta are:
- using StringBuilder instead of string addition in a loop
indexing-hadoop/.../Utils.java
processing/.../ByteBufferMinMaxOffsetHeap.java
- Use class variables instead of static variables for parameterized test
processing/src/.../ScanQueryLimitRowIteratorTest.java
* Add intelliJ inspection warnings as errors to druid profile
* one more static inner class
* Reconcile terminology and method naming to 'used/unused segments'; Don't use terms 'enable/disable data source'; Rename MetadataSegmentManager to MetadataSegments; Make REST API methods which mark segments as used/unused to return server error instead of an empty response in case of error
* Fix brace
* Import order
* Rename withKillDataSourceWhitelist to withSpecificDataSourcesToKill
* Fix tests
* Fix tests by adding proper methods without interval parameters to IndexerMetadataStorageCoordinator instead of hacking with Intervals.ETERNITY
* More aligned names of DruidCoordinatorHelpers, rename several CoordinatorDynamicConfig parameters
* Rename ClientCompactTaskQuery to ClientCompactionTaskQuery for consistency with CompactionTask; ClientCompactQueryTuningConfig to ClientCompactionTaskQueryTuningConfig
* More variable and method renames
* Rename MetadataSegments to SegmentsMetadata
* Javadoc update
* Simplify SegmentsMetadata.getUnusedSegmentIntervals(), more javadocs
* Update Javadoc of VersionedIntervalTimeline.iterateAllObjects()
* Reorder imports
* Rename SegmentsMetadata.tryMark... methods to mark... and make them to return boolean and the numbers of segments changed and relay exceptions to callers
* Complete merge
* Add CollectionUtils.newTreeSet(); Refactor DruidCoordinatorRuntimeParams creation in tests
* Remove MetadataSegmentManager
* Rename millisLagSinceCoordinatorBecomesLeaderBeforeCanMarkAsUnusedOvershadowedSegments to leadingTimeMillisBeforeCanMarkAsUnusedOvershadowedSegments
* Fix tests, refactor DruidCluster creation in tests into DruidClusterBuilder
* Fix inspections
* Fix SQLMetadataSegmentManagerEmptyTest and rename it to SqlSegmentsMetadataEmptyTest
* Rename SegmentsAndMetadata to SegmentsAndCommitMetadata to reduce the similarity with SegmentsMetadata; Rename some methods
* Rename DruidCoordinatorHelper to CoordinatorDuty, refactor DruidCoordinator
* Unused import
* Optimize imports
* Rename IndexerSQLMetadataStorageCoordinator.getDataSourceMetadata() to retrieveDataSourceMetadata()
* Unused import
* Update terminology in datasource-view.tsx
* Fix label in datasource-view.spec.tsx.snap
* Fix lint errors in datasource-view.tsx
* Doc improvements
* Another attempt to please TSLint
* Another attempt to please TSLint
* Style fixes
* Fix IndexerSQLMetadataStorageCoordinator.createUsedSegmentsSqlQueryForIntervals() (wrong merge)
* Try to fix docs build issue
* Javadoc and spelling fixes
* Rename SegmentsMetadata to SegmentsMetadataManager, address other comments
* Address more comments
* Add JoinableFactory interface and use it in the query stack.
Also includes InlineJoinableFactory, which enables joining against
inline datasources. This is the first patch where a basic join query
actually works. It includes integration tests.
* Fix test issues.
* Adjustments from code review.
* Add HashJoinSegment, a virtual segment for joins.
An initial step towards #8728. This patch adds enough functionality to implement a joining
cursor on top of a normal datasource. It does not include enough to actually do a query. For
that, future patches will need to wire this low-level functionality into the query language.
* Fixups.
* Fix missing format argument.
* Various tests and minor improvements.
* Changes.
* Remove or add tests for unused stuff.
* Fix up package locations.
Previously jackson-mapper-asl was excluded to remove a security
vulnerability; however, it is required for functionality (e.g.,
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator).
* Add avro dependency to parquet extension
If the parquet extension is loaded and an ingestionSpec uses the older format
specifying a 'parser' instead of using an 'inputFormat' the job fails
with the following error
java.lang.TypeNotPresentException: Type org.apache.avro.generic.GenericRecord not present
This change removes the exclusion of the avro package so that the missing
class can be found.
* Address review comments and add dependency version
* S3: Improvements to prefix listing (including fix for an infinite loop)
1) Fixes#9097, an infinite loop that occurs when more than one batch
of objects is retrieved during a prefix listing.
2) Removes the Access Denied fallback code added in #4444. I don't think
the behavior is reasonable: its purpose is to fall back from a prefix
listing to a single-object access, but it's only activated when the
end user supplied a prefix, so it would be better to simply fail, so
the end user knows that their request for a prefix-based load is not
going to work. Presumably the end user can switch from supplying
'prefixes' to supplying 'uris' if desired.
3) Filters out directory placeholders when walking prefixes.
4) Splits LazyObjectSummariesIterator into its own class and adds tests.
* Adjust S3InputSourceTest.
* Changes from review.
* Include hamcrest-core.
* Parallel indexing single dim partitions
Implements single dimension range partitioning for native parallel batch
indexing as described in #8769. This initial version requires the
druid-datasketches extension to be loaded.
The algorithm has 5 phases that are orchestrated by the supervisor in
`ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`.
These phases and the main classes involved are described below:
1) In parallel, determine the distribution of dimension values for each
input source split.
`PartialDimensionDistributionTask` uses `StringSketch` to generate
the approximate distribution of dimension values for each input
source split. If the rows are ungrouped,
`PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter`
uses a Bloom filter to skip rows that would be grouped. The final
distribution is sent back to the supervisor via
`DimensionDistributionReport`.
2) The range partitions are determined.
In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the
supervisor uses `StringSketchMerger` to merge the individual
`StringSketch`es created in the preceding phase. The merged sketch is
then used to create the range partitions.
3) In parallel, generate partial range-partitioned segments.
`PartialRangeSegmentGenerateTask` uses the range partitions
determined in the preceding phase and
`RangePartitionCachingLocalSegmentAllocator` to generate
`SingleDimensionShardSpec`s. The partition information is sent back
to the supervisor via `GeneratedGenericPartitionsReport`.
4) The partial range segments are grouped.
In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`,
the supervisor creates the `PartialGenericSegmentMergeIOConfig`s
necessary for the next phase.
5) In parallel, merge partial range-partitioned segments.
`PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to
retrieve the partial range-partitioned segments generated earlier and
then merges and publishes them.
* Fix dependencies & forbidden apis
* Fixes for integration test
* Address review comments
* Fix docs, strict compile, sketch check, rollup check
* Fix first shard spec, partition serde, single subtask
* Fix first partition check in test
* Misc rewording/refactoring to address code review
* Fix doc link
* Split batch index integration test
* Do not run parallel-batch-index twice
* Adjust last partition
* Split ITParallelIndexTest to reduce runtime
* Rename test class
* Allow null values in range partitions
* Indicate which phase failed
* Improve asserts in tests
* Address security vulnerabilities CVSS >= 7
Update dependencies to address security vulnerabilities with CVSS scores
of 7 or higher. A new Travis CI job is added to prevent new
high/critical security vulnerabilities from being added.
Updated dependencies:
- api-util 1.0.0 -> 1.0.3
- jackson 2.9.10 -> 2.10.1
- kafka 2.1.0 -> 2.1.1
- libthrift 0.10.0 -> 0.13.0
- protobuf 3.2.0 -> 3.11.0
The following high/critical security vulnerabilities are currently
suppressed (so that the new Travis CI job can be added now) and are left
as future work to fix:
- hibernate-validator:5.2.5
- jackson-mapper-asl:1.9.13
- libthrift:0.6.1
- netty:3.10.6
- nimbus-jose-jwt:4.41.1
* Rename EDL1 license file
* Fix inspection errors
* add prefixes support to google input source, making it symmetrical-ish with s3
* docs
* more better, and tests
* unused
* formatting
* javadoc
* dependencies
* oops
* review comments
* better javadoc
* Exclude unneeded hadoop transitive dependencies
These dependencies are provided by core:
- com.squareup.okhttp:okhttp
- commons-beanutils:commons-beanutils
- org.apache.commons:commons-compress
- org.apache.zookepper:zookeeper
These dependencies are not needed and are excluded because they contain
security vulnerabilities:
- commons-beanutils:commons-beanutils-core
- org.codehaus.jackson:jackson-mapper-asl
* Simplify exclusions + separate unneeded/vulnerable
* Do not exclude jackson-mapper-asl
* Support orc format for native batch ingestion
* fix pom and remove wrong comment
* fix unnecessary condition check
* use flatMap back to handle exception properly
* move exceptionThrowingIterator to intermediateRowParsingReader
* runtime
* add s3 input source for native batch ingestion
* add docs
* fixes
* checkstyle
* lazy splits
* fixes and hella tests
* fix it
* re-use better iterator
* use key
* javadoc and checkstyle
* exception
* oops
* refactor to use S3Coords instead of URI
* remove unused code, add retrying stream to handle s3 stream
* remove unused parameter
* update to latest master
* use list of objects instead of object
* serde test
* refactor and such
* now with the ability to compile
* fix signature and javadocs
* fix conflicts yet again, fix S3 uri stuffs
* more tests, enforce uri for bucket
* javadoc
* oops
* abstract class instead of interface
* null or empty
* better error
* Fix the potential race SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor
* Fix docs and javadoc
* Add unit tests for large or small estimated num splits
* add override
* Add FileUtils.createTempDir() and enforce its usage.
The purpose of this is to improve error messages. Previously, the error
message on a nonexistent or unwritable temp directory would be
"Failed to create directory within 10,000 attempts".
* Further updates.
* Another update.
* Remove commons-io from benchmark.
* Fix tests.
* add parquet support to native batch
* cleanup
* implement toJson for sampler support
* better binaryAsString test
* docs
* i hate spellcheck
* refactor toMap conversion so can be shared through flattenerMaker, default impls should be good enough for orc+avro, fixup for merge with latest
* add comment, fix some stuff
* adjustments
* fix accident
* tweaks
* HDFS input source
Add support for using HDFS as an input source. In this version, commas
or globs are not supported in HDFS paths.
* Fix forbidden api
* Address review comments
* Tidy up lifecycle, query, and ingestion logging.
The goal of this patch is to improve the clarity and usefulness of
Druid's logging for cluster operators. For more information, see
https://twitter.com/cowtowncoder/status/1195469299814555648.
Concretely, this patch does the following:
- Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the
goal of reducing redundancy and improving clarity by avoiding
showing rarely-useful log messages. This includes most "starting"
and "stopping" messages, and most messages related to individual
columns.
- Adds new log4j2 templates that show operators how to enabled DEBUG
logging for certain important packages.
- Eliminate stack traces for query errors, unless log level is DEBUG
or more. This is useful because query errors often indicate user
error rather than system error, but dumping stack trace often gave
operators the impression that there was a system failure.
- Adds task id to Appenderator, AppenderatorDriver thread names. In
the default log4j2 configuration, this will put them in log lines
as well. It's very useful if a user is using the Indexer, where
multiple tasks run in the same JVM.
- More consistent terminology when it comes to "sequences" (sets of
segments that are handed-off together by Kafka ingestion) and
"offsets" (cursors in partitions). These terms had been confused in
some log messages due to the fact that Kinesis calls offsets
"sequence numbers".
- Replaces some ugly toString calls with either the JSONification or
something more operator-accessible (like a URL or segment identifier,
instead of JSON object representing the same).
* Adjustments.
* Adjust integration test.
If the JDBC drivers are missing from the lookup extensions, throw an
exception that directs the user how to resolve the issue. This change is
a follow up to #8825.
* Use earliest offset on kafka newly discovered partitions
* resolve conflicts
* remove redundant check cases
* simplified unit tests
* change test case
* rewrite comments
* add regression test
* add junit ignore annotation
* minor modifications
* indent
* override testableKafkaSupervisor and KafkaRecordSupplier to make the test runable
* modified test constructor of kafkaRecordSupplier
* simplify
* delegated constructor
There is a class of bugs due to the fact that BaseObjectColumnValueSelector
has both "getObject" and "isNull" methods, but in most selector implementations
and most call sites, it is clear that the intent of "isNull" is only to apply
to the primitive getters, not the object getter. This makes sense, because the
purpose of isNull is to enable detection of nulls in otherwise-primitive columns.
Imagine a string column with a numeric selector built on top of it. You would
want it to return isNull = true, so numeric aggregators don't treat it as
all zeroes.
Sometimes this design leads people to accidentally guard non-primitive get
methods with "selector.isNull" checks, which is improper.
This patch has three goals:
1) Fix null-handling bugs that already exist in this class.
2) Make interface and doc changes that reduce the probability of future bugs.
3) Fix other, unrelated bugs I noticed in the stringFirst and stringLast
aggregators while fixing null-handling bugs. I thought about splitting this
into its own patch, but it ended up being tough to split from the
null-handling fixes.
For (1) the fixes are,
- Fix StringFirst and StringLastAggregatorFactory to stop guarding getObject
calls on isNull, by no longer extending NullableAggregatorFactory. Now uses
-1 as a sigil value for null, to differentiate nulls and empty strings.
- Fix ExpressionFilter to stop guarding getObject calls on isNull. Also, use
eval.asBoolean() to avoid calling getLong on the selector after already
calling getObject.
- Fix ObjectBloomFilterAggregator to stop guarding DimensionSelector calls
on isNull. Also, refactored slightly to avoid the overhead of calling
getObject followed by another getter (see BloomFilterAggregatorFactory for
part of this).
For (2) the main changes are,
- Remove the "isNull" method from BaseObjectColumnValueSelector.
- Clarify "isNull" doc on BaseNullableColumnValueSelector.
- Rename NullableAggregatorFactory -> NullbleNumericAggregatorFactory to emphasize
that it only works on aggregators that take numbers as input.
- Similar naming changes to the Aggregator, BufferAggregator, and AggregateCombiner.
- Similar naming changes to helper methods for groupBy, ValueMatchers, etc.
For (3) the other fixes for StringFirst and StringLastAggregatorFactory are,
- Fixed buffer overrun in the buffer aggregators when some characters in the string
code into more than one byte (the old code used "substring" to apply a byte limit,
which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method.
- Fixed weird IncrementalIndex logic that led to reading nulls for the timestamp.
- Adjusted weird StringFirst/Last logic that worked around the weird IncrementalIndex
behavior.
- Refactored to share code between the four aggregators.
- Improved test coverage.
- Made the base stringFirst, stringLast aggregators adaptive, and streamlined the
xFold versions into aliases. The adaptiveness is similar to how other aggregators
like hyperUnique work.
* IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle() don't fetch abutting intervals; simplify getUsedSegmentsForIntervals()
* Add VersionedIntervalTimeline.findNonOvershadowedObjectsInInterval() method; Propagate the decision about whether only visible segmetns or visible and overshadowed segments should be returned from IndexerMetadataStorageCoordinator's methods to the user logic; Rename SegmentListUsedAction to RetrieveUsedSegmentsAction, SegmetnListUnusedAction to RetrieveUnusedSegmentsAction, and UsedSegmentLister to UsedSegmentsRetriever
* Fix tests
* More fixes
* Add javadoc notes about returning Collection instead of Set. Add JacksonUtils.readValue() to reduce boilerplate code
* Fix KinesisIndexTaskTest, factor out common parts from KinesisIndexTaskTest and KafkaIndexTaskTest into SeekableStreamIndexTaskTestBase
* More test fixes
* More test fixes
* Add a comment to VersionedIntervalTimelineTestBase
* Fix tests
* Set DataSegment.size(0) in more tests
* Specify DataSegment.size(0) in more places in tests
* Fix more tests
* Fix DruidSchemaTest
* Set DataSegment's size in more tests and benchmarks
* Fix HdfsDataSegmentPusherTest
* Doc changes addressing comments
* Extended doc for visibility
* Typo
* Typo 2
* Address comment
* Add option lateMessageRejectionStartDate
* Use option lateMessageRejectionStartDate
* Fix tests
* Add lateMessageRejectionStartDate to kafka indexing service
* Update tests kafka indexing service
* Fix tests for KafkaSupervisorTest
* Add lateMessageRejectionStartDate to KinesisSupervisorIOConfig
* Fix var name
* Update documentation
* Add check lateMessageRejectionStartDateTime and lateMessageRejectionPeriod, fails if both were specified.
* remove select query
* thanks teamcity
* oops
* oops
* add back a SelectQuery class that throws RuntimeExceptions linking to docs
* adjust text
* update docs per review
* deprecated
* Stateful auto compaction
* javaodc
* add removed test back
* fix test
* adding indexSpec to compactionState
* fix build
* add lastCompactionState
* address comments
* extract CompactionState
* fix doc
* fix build and test
* Add a task context to store compaction state; add javadoc
* fix it test
* Fix missing jackson jars for hadoop ingestion
* PR comments
* pom ordering
* New approach
* Remove all jackson-core/mapper-asl exclusions from hdfs storage
* Support LDAP authentication/authorization
* fixed integration-tests
* fixed Travis CI build errors related to druid-security module
* fixed failing test
* fixed failing test header
* added comments, force build
* fixes for strict compilation spotbugs checks
* removed authenticator rolling credential update feature
* removed escalator rolling credential update feature
* fixed teamcity inspection deprecated API usage error
* fixed checkstyle execution error, removed unused import
* removed cached config as part of removing authenticator rolling credential update feature
* removed config bundle entity as part of removing authenticator rolling credential update feature
* refactored ldao configuration
* added support for SSLContext configuration and TLSCertificateChecker
* removed check to return authentication failure when user has no group assigned, will be checked and handled by the authorizer
* Separate out authorizer checks between metadata-backed store user and LDAP user/groups
* refactored BasicSecuritySSLSocketFactory usage to fix strict compilation spotbugs checks
* fixes build issue
* final review comments updates
* final review comments updates
* fixed LGTM and spellcheck alerts
* Fixed Avatica auth failure error message check
* Updated metadata credentials validator exception message string, replaced DB with metadata store
* get active task by datasource when supervisor discover tasks
* fix ut
* fix ut
* fix ut
* remove unnecessary condition check
* fix ut
* remove stream in hot loop
* Added live reports for Kafka and Native batch task
* Removed unused local variables
* Added the missing unit test
* Refine unit test logic, add implementation for HttpRemoteTaskRunner
* checksytle fixes
* Update doc descriptions for updated API
* remove unnecessary files
* Fix spellcheck complaints
* More details for api descriptions
* Fix dependency analyze warnings
Update the maven dependency plugin to the latest version and fix all
warnings for unused declared and used undeclared dependencies in the
compile scope. Added new travis job to add the check to CI. Also fixed
some source code files to use the correct packages for their imports and
updated druid-forbidden-apis to prevent regressions.
* Address review comments
* Adjust scope for org.glassfish.jaxb:jaxb-runtime
* Fix dependencies for hdfs-storage
* Consolidate netty4 versions
* LoggingEmitter: print event as json
* use DefaultRequestLogEventBuilderFactory in emitting request logger by default
* print context in query metric as json
* removed unused jsonMapper from DefaultQueryMetrics
* add comment
* remove change to DefaultRequestLogEventBuilderFactory.java
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* merging with upstream
* review-1
* unknow changes
* unknow changes
* review-2
* merging with master
* review-2 1 changes
* review changes-2 2
* bug fix