* SQL join support for lookups.
1) Add LookupSchema to SQL, so lookups show up in the catalog.
2) Add join-related rels and rules to SQL, allowing joins to be planned into
native Druid queries.
* Add two missing LookupSchema calls in tests.
* Fix tests.
* Fix typo.
* Add LookupJoinableFactory.
Enables joins where the right-hand side is a lookup. Includes an
integration test.
Also, includes changes to LookupExtractorFactoryContainerProvider:
1) Add "getAllLookupNames", which will be needed to eventually connect
lookups to Druid's SQL catalog.
2) Convert "get" from nullable to Optional return.
3) Swap out most usages of LookupReferencesManager in favor of the
simpler LookupExtractorFactoryContainerProvider interface.
* Fixes for tests.
* Fix another test.
* Java 11 message fix.
* Fixups.
* Fixup benchmark class.
* Add getRightColumns to JoinConditionAnalysis
This change other implementations of JoinableFactory to ask the analysis
for the right key columns instead of having to calculate it themselves.
* Address some review comments
* more code review stuff
* intelliJ inspections cleanup
- remove redundant escapes
- performance warnings
- access static member via instance reference
- static method declared final
- inner class may be static
Most of these changes are aesthetic, however, they will allow inspections to
be enabled as part of CI checks going forward
The valuable changes in this delta are:
- using StringBuilder instead of string addition in a loop
indexing-hadoop/.../Utils.java
processing/.../ByteBufferMinMaxOffsetHeap.java
- Use class variables instead of static variables for parameterized test
processing/src/.../ScanQueryLimitRowIteratorTest.java
* Add intelliJ inspection warnings as errors to druid profile
* one more static inner class
* Make JoinableFactory an extension point
This change makes it so that extensions can register a JoinableFactory that
should be used for a DataSource.
Extensions can provide the factories via DruidBinders#joinableFactoryBinder
Known DataSources - like InlineDataSource are provided in the
JoinableFactoryModule. This module installs a FactoryWarehouse that is
used to decide which factory should be used to generate the Joinable for
the provided DataSource.
The ExtensionPoint is marked as Beta since it is not yet clear if this
needs to remain available to other extensions or if the best way to
register a factory is by using the datasource class.
* Add module test
* remove useless bindings in test
* remove ExtensionPoint annotation
* Make LifecycleLock not final to help with testing
* Reconcile terminology and method naming to 'used/unused segments'; Don't use terms 'enable/disable data source'; Rename MetadataSegmentManager to MetadataSegments; Make REST API methods which mark segments as used/unused to return server error instead of an empty response in case of error
* Fix brace
* Import order
* Rename withKillDataSourceWhitelist to withSpecificDataSourcesToKill
* Fix tests
* Fix tests by adding proper methods without interval parameters to IndexerMetadataStorageCoordinator instead of hacking with Intervals.ETERNITY
* More aligned names of DruidCoordinatorHelpers, rename several CoordinatorDynamicConfig parameters
* Rename ClientCompactTaskQuery to ClientCompactionTaskQuery for consistency with CompactionTask; ClientCompactQueryTuningConfig to ClientCompactionTaskQueryTuningConfig
* More variable and method renames
* Rename MetadataSegments to SegmentsMetadata
* Javadoc update
* Simplify SegmentsMetadata.getUnusedSegmentIntervals(), more javadocs
* Update Javadoc of VersionedIntervalTimeline.iterateAllObjects()
* Reorder imports
* Rename SegmentsMetadata.tryMark... methods to mark... and make them to return boolean and the numbers of segments changed and relay exceptions to callers
* Complete merge
* Add CollectionUtils.newTreeSet(); Refactor DruidCoordinatorRuntimeParams creation in tests
* Remove MetadataSegmentManager
* Rename millisLagSinceCoordinatorBecomesLeaderBeforeCanMarkAsUnusedOvershadowedSegments to leadingTimeMillisBeforeCanMarkAsUnusedOvershadowedSegments
* Fix tests, refactor DruidCluster creation in tests into DruidClusterBuilder
* Fix inspections
* Fix SQLMetadataSegmentManagerEmptyTest and rename it to SqlSegmentsMetadataEmptyTest
* Rename SegmentsAndMetadata to SegmentsAndCommitMetadata to reduce the similarity with SegmentsMetadata; Rename some methods
* Rename DruidCoordinatorHelper to CoordinatorDuty, refactor DruidCoordinator
* Unused import
* Optimize imports
* Rename IndexerSQLMetadataStorageCoordinator.getDataSourceMetadata() to retrieveDataSourceMetadata()
* Unused import
* Update terminology in datasource-view.tsx
* Fix label in datasource-view.spec.tsx.snap
* Fix lint errors in datasource-view.tsx
* Doc improvements
* Another attempt to please TSLint
* Another attempt to please TSLint
* Style fixes
* Fix IndexerSQLMetadataStorageCoordinator.createUsedSegmentsSqlQueryForIntervals() (wrong merge)
* Try to fix docs build issue
* Javadoc and spelling fixes
* Rename SegmentsMetadata to SegmentsMetadataManager, address other comments
* Address more comments
* Add JoinableFactory interface and use it in the query stack.
Also includes InlineJoinableFactory, which enables joining against
inline datasources. This is the first patch where a basic join query
actually works. It includes integration tests.
* Fix test issues.
* Adjustments from code review.
Builds on #9235, using the datasource analysis functionality to replace various ad-hoc
approaches. The most interesting changes are in ClientQuerySegmentWalker (brokers),
ServerManager (historicals), and SinkQuerySegmentWalker (indexing tasks).
Other changes related to improving how we analyze queries:
1) Changes TimelineServerView to return an Optional timeline, which I thought made
the analysis changes cleaner to implement.
2) Added QueryToolChest#canPerformSubquery, which is now used by query entry points to
determine whether it is safe to pass a subquery dataSource to the query toolchest.
Fixes an issue introduced in #5471 where subqueries under non-groupBy-typed queries
were silently ignored, since neither the query entry point nor the toolchest did
anything special with them.
3) Removes the QueryPlus.withQuerySegmentSpec method, which was mostly being used in
error-prone ways (ignoring any potential subqueries, and not verifying that the
underlying data source is actually a table). Replaces with a new function,
Queries.withSpecificSegments, that includes sanity checks.
* Add join-related DataSource types, and analysis functionality.
Builds on #9111 and implements the datasource analysis mentioned in #8728. Still can't
handle join datasources, but we're a step closer.
Join-related DataSource types:
1) Add "join", "lookup", and "inline" datasources.
2) Add "getChildren" and "withChildren" methods to DataSource, which will be used
in the future for query rewriting (e.g. inlining of subqueries).
DataSource analysis functionality:
1) Add DataSourceAnalysis class, which breaks down datasources into three components:
outer queries, a base datasource (left-most of the highest level left-leaning join
tree), and other joined-in leaf datasources (the right-hand branches of the
left-leaning join tree).
2) Add "isConcrete", "isGlobal", and "isCacheable" methods to DataSource in order to
support analysis.
Other notes:
1) Renamed DataSource#getNames to DataSource#getTableNames, which I think is clearer.
Also, made it a Set, so implementations don't need to worry about duplicates.
2) The addition of "isCacheable" should work around #8713, since UnionDataSource now
returns false for cacheability.
* Remove javadoc comment.
* Updates reflecting code review.
* Add comments.
* Add more comments.
* Fix equalsAndHashCode in ClientCompactQueryTuningConfig
This change introduces a dependency to EqualsVerifier for the test scope.
The dependency is licensed under Apache 2. The library makes it trivial
to add equals and hashCode checks to prevent bugs like this from happening
in the future
* fix checkstyle
* fix test name
* Address security vulnerabilities CVSS >= 7
Update dependencies to address security vulnerabilities with CVSS scores
of 7 or higher. A new Travis CI job is added to prevent new
high/critical security vulnerabilities from being added.
Updated dependencies:
- api-util 1.0.0 -> 1.0.3
- jackson 2.9.10 -> 2.10.1
- kafka 2.1.0 -> 2.1.1
- libthrift 0.10.0 -> 0.13.0
- protobuf 3.2.0 -> 3.11.0
The following high/critical security vulnerabilities are currently
suppressed (so that the new Travis CI job can be added now) and are left
as future work to fix:
- hibernate-validator:5.2.5
- jackson-mapper-asl:1.9.13
- libthrift:0.6.1
- netty:3.10.6
- nimbus-jose-jwt:4.41.1
* Rename EDL1 license file
* Fix inspection errors
* Add FileUtils.createTempDir() and enforce its usage.
The purpose of this is to improve error messages. Previously, the error
message on a nonexistent or unwritable temp directory would be
"Failed to create directory within 10,000 attempts".
* Further updates.
* Another update.
* Remove commons-io from benchmark.
* Fix tests.
* Refactor parallel indexing perfect rollup partitioning
Refactoring to make it easier to later add range partitioning for
perfect rollup parallel indexing. This is accomplished by adding several
new base classes (e.g., PerfectRollupWorkerTask) and new classes for
encapsulating logic that needs to be changed for different partitioning
strategies (e.g., IndexTaskInputRowIteratorBuilder).
The code is functionally equivalent to before except for the following
small behavior changes:
1) PartialSegmentMergeTask: Previously, this task had a priority of
DEFAULT_TASK_PRIORITY. It now has a priority of
DEFAULT_BATCH_INDEX_TASK_PRIORITY (via the new PerfectRollupWorkerTask
base class), since it is a batch index task.
2) ParallelIndexPhaseRunner: A decorator was added to
subTaskSpecIterator to ensure the subtasks are generated with unique
ids. Previously, only tests (i.e., MultiPhaseParallelIndexingTest)
would have this decorator, but this behavior is desired for non-test
code as well.
* Fix forbidden apis and pmd warnings
* Fix analyze dependencies warnings
* Fix IndexTask json and add IT diags
* Fix parallel index supervisor<->worker serde
* Fix TeamCity inspection errors/warnings
* Fix TeamCity inspection errors/warnings again
* Integrate changes with those from #8823
* Address review comments
* Address more review comments
* Fix forbidden apis
* Address more review comments
* Tidy up lifecycle, query, and ingestion logging.
The goal of this patch is to improve the clarity and usefulness of
Druid's logging for cluster operators. For more information, see
https://twitter.com/cowtowncoder/status/1195469299814555648.
Concretely, this patch does the following:
- Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the
goal of reducing redundancy and improving clarity by avoiding
showing rarely-useful log messages. This includes most "starting"
and "stopping" messages, and most messages related to individual
columns.
- Adds new log4j2 templates that show operators how to enabled DEBUG
logging for certain important packages.
- Eliminate stack traces for query errors, unless log level is DEBUG
or more. This is useful because query errors often indicate user
error rather than system error, but dumping stack trace often gave
operators the impression that there was a system failure.
- Adds task id to Appenderator, AppenderatorDriver thread names. In
the default log4j2 configuration, this will put them in log lines
as well. It's very useful if a user is using the Indexer, where
multiple tasks run in the same JVM.
- More consistent terminology when it comes to "sequences" (sets of
segments that are handed-off together by Kafka ingestion) and
"offsets" (cursors in partitions). These terms had been confused in
some log messages due to the fact that Kinesis calls offsets
"sequence numbers".
- Replaces some ugly toString calls with either the JSONification or
something more operator-accessible (like a URL or segment identifier,
instead of JSON object representing the same).
* Adjustments.
* Adjust integration test.
* first steps
* clean licenses
* fix capabilities
* fix specs
* more tests
* new web console on coordinator and overlord, remove setup for old consoles, old configs
* better message
* update licenses
* sync license files
* more button
* fix tslint issue
* jetty-rewrite dependency to add redirects for old console paths
* put dependency in the right place
* fix overlord detection
* fix notices, dedupe licenses
* make segment timeline work in no SQL mode
* update license
* revert hard coded coordinator mode from testing
* update restricted mode copy
* sketch of broker parallel merges done in small batches on fork join pool
* fix non-terminating sequences, auto compute parallelism
* adjust benches
* adjust benchmarks
* now hella more faster, fixed dumb
* fix
* remove comments
* log.info for debug
* javadoc
* safer block for sequence to yielder conversion
* refactor LifecycleForkJoinPool into LifecycleForkJoinPoolProvider which wraps a ForkJoinPool
* smooth yield rate adjustment, more logs to help tune
* cleanup, less logs
* error handling, bug fixes, on by default, more parallel, more tests
* remove unused var
* comments
* timeboundary mergeFn
* simplify, more javadoc
* formatting
* pushdown config
* use nanos consistently, move logs back to debug level, bit more javadoc
* static terminal result batch
* javadoc for nullability of createMergeFn
* cleanup
* oops
* fix race, add docs
* spelling, remove todo, add unhandled exception log
* cleanup, revert unintended change
* another unintended change
* review stuff
* add ParallelMergeCombiningSequenceBenchmark, fixes
* hyper-threading is the enemy
* fix initial start delay, lol
* parallelism computer now balances partition sizes to partition counts using sqrt of sequence count instead of sequence count by 2
* fix those important style issues with the benchmarks code
* lazy sequence creation for benchmarks
* more benchmark comments
* stable sequence generation time
* update defaults to use 100ms target time, 4096 batch size, 16384 initial yield, also update user docs
* add jmh thread based benchmarks, cleanup some stuff
* oops
* style
* add spread to jmh thread benchmark start range, more comments to benchmarks parameters and purpose
* retool benchmark to allow modeling more typical heterogenous heavy workloads
* spelling
* fix
* refactor benchmarks
* formatting
* docs
* add maxThreadStartDelay parameter to threaded benchmark
* why does catch need to be on its own line but else doesnt
* In DirectDruidClient, don't run Future cancellation listener in HTTP library executor
* extract cancelQuery as a method of DirectDruidClient
* Fix testCancel
* Add exception as the first argument to log.error
* IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle() don't fetch abutting intervals; simplify getUsedSegmentsForIntervals()
* Add VersionedIntervalTimeline.findNonOvershadowedObjectsInInterval() method; Propagate the decision about whether only visible segmetns or visible and overshadowed segments should be returned from IndexerMetadataStorageCoordinator's methods to the user logic; Rename SegmentListUsedAction to RetrieveUsedSegmentsAction, SegmetnListUnusedAction to RetrieveUnusedSegmentsAction, and UsedSegmentLister to UsedSegmentsRetriever
* Fix tests
* More fixes
* Add javadoc notes about returning Collection instead of Set. Add JacksonUtils.readValue() to reduce boilerplate code
* Fix KinesisIndexTaskTest, factor out common parts from KinesisIndexTaskTest and KafkaIndexTaskTest into SeekableStreamIndexTaskTestBase
* More test fixes
* More test fixes
* Add a comment to VersionedIntervalTimelineTestBase
* Fix tests
* Set DataSegment.size(0) in more tests
* Specify DataSegment.size(0) in more places in tests
* Fix more tests
* Fix DruidSchemaTest
* Set DataSegment's size in more tests and benchmarks
* Fix HdfsDataSegmentPusherTest
* Doc changes addressing comments
* Extended doc for visibility
* Typo
* Typo 2
* Address comment
* remove select query
* thanks teamcity
* oops
* oops
* add back a SelectQuery class that throws RuntimeExceptions linking to docs
* adjust text
* update docs per review
* deprecated
* Support assign tasks to run on different tiers of MiddleManagers
* address comments
* address comments
* rename tier to category and docs
* doc
* fix doc
* fix spelling errors
* docs
* Stateful auto compaction
* javaodc
* add removed test back
* fix test
* adding indexSpec to compactionState
* fix build
* add lastCompactionState
* address comments
* extract CompactionState
* fix doc
* fix build and test
* Add a task context to store compaction state; add javadoc
* fix it test
* IOConfig for compaction task
* add javadoc, doc, unit test
* fix webconsole test
* add spelling
* address comments
* fix build and test
* address comments
* Add tier based usage metrics for historical nodes to help with druid historical autoscaling
Add tier based usage metrics for historical nodes to help druid cluster orchestration systems understand the historical node usage and requirements. Following metrics would be helpful -
tier/required/capacity- total capacity in bytes required in each tier. Dimensions - tier
tier/total/capacity - total capacity in bytes available in a given tier. Dimension - tier
tier/historical/count - no. of historical nodes available in each tier. Dimension - tier
tier/replication/factor - configured maximum replication factor in given tier. Dimension - tier
* fix unit test failures
* #7641 - Changing segment distribution algorithm to distribute segments to multiple segment cache locations
* Fixing indentation
* WIP
* Adding interface for location strategy selection, least bytes used strategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy
* fixing code style
* Fixing test
* Adding a method visible only for testing, fixing tests
* 1. Changing the method contract to return an iterator of locations instead of a single best location. 2. Check style fixes
* fixing the conditional statement
* Added testSegmentDistributionUsingLeastBytesUsedStrategy, fixed testSegmentDistributionUsingRoundRobinStrategy
* to trigger CI build
* Add documentation for the selection strategy configuration
* to re trigger CI build
* updated docs as per review comments, made LeastBytesUsedStorageLocationSelectorStrategy.getLocations a synchronzied method, other minor fixes
* In checkLocationConfigForNull method, using getLocations() to check for null instead of directly referring to the locations variable so that tests overriding getLocations() method do not fail
* Implementing review comments. Added tests for StorageLocationSelectorStrategy
* Checkstyle fixes
* Adding java doc comments for StorageLocationSelectorStrategy interface
* checkstyle
* empty commit to retrigger build
* Empty commit
* Adding suppressions for words leastBytesUsed and roundRobin of ../docs/configuration/index.md file
* Impl review comments including updating docs as suggested
* Removing checkLocationConfigForNull(), @NotEmpty annotation serves the purpose
* Round robin iterator to keep track of the no. of iterations, impl review comments, added tests for round robin strategy
* Fixing the round robin iterator
* Removed numLocationsToTry, updated java docs
* changing property attribute value from tier to type
* Fixing assert messages
* add timeout support for JsonParserIterator init future
* add queryId
* should be less than 1
* fix
* fix npe
* fix lgtm
* adjust exception, nullable
* fix test
* refactor
* revert queryId change
* add log.warn to tie exception to json parser iterator
For CuratorModuleTest. exitsJvmWhenMaxRetriesExceeded(), the expected
log message is intermittently not the first one in the list of captured
log messages. For example, it is the second one in
https://travis-ci.org/apache/incubator-druid/jobs/586792178#L754.
* Fix dependency analyze warnings
Update the maven dependency plugin to the latest version and fix all
warnings for unused declared and used undeclared dependencies in the
compile scope. Added new travis job to add the check to CI. Also fixed
some source code files to use the correct packages for their imports and
updated druid-forbidden-apis to prevent regressions.
* Address review comments
* Adjust scope for org.glassfish.jaxb:jaxb-runtime
* Fix dependencies for hdfs-storage
* Consolidate netty4 versions
* Exit JVM on curator unhandled errors
If an unhandled error occurs when curator is talking to ZooKeeper, exit
the JVM in addition to stopping the lifecycle to prevent the process
from being left in a zombie state. With this change,
BoundedExponentialBackoffRetryWithQuit is no longer needed as when
curator exceeds the configured retries, it triggers its unhandled error
listeners. A new "connectionTimeoutMs" CuratorConfig setting is added
mostly to facilitate testing curator unhandled errors, but it may be
useful for users as well.
* Address review comments
* disallow whitespace characters in data source names
* wrapped preconditions in a function, and simplify unit tests code
* Fixed regex to allow space, simplified repeat logic
* Fixed import style against mvn checkstyle
* Add msg in case test fails, use emptyMap(), improved naming
* Changes on assertion functions
* change wording of "whitespace" to "whitespace except space" to avoid misleading
* LoggingEmitter: print event as json
* use DefaultRequestLogEventBuilderFactory in emitting request logger by default
* print context in query metric as json
* removed unused jsonMapper from DefaultQueryMetrics
* add comment
* remove change to DefaultRequestLogEventBuilderFactory.java
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* merging with upstream
* review-1
* unknow changes
* unknow changes
* review-2
* merging with master
* review-2 1 changes
* review changes-2 2
* bug fix
* fix issue8291. Make sure ReferenceCountingSegment.decrement() is invoked correctly
* add some comments in SinkQuerySegmentWalker.java
* extracting perSegmentRunners and perHydrantRunners to reduce the level of nesting
* Add TaskResourceCleaner; fix a couple of concurrency bugs in batch tasks
* kill runner when it's ready
* add comment
* kill run thread
* fix test
* Take closeable out of Appenderator
* add javadoc
* fix test
* fix test
* update javadoc
* add javadoc about killed task
* address comment
* Add support for parallel native indexing with shuffle for perfect rollup.
* Add comment about volatiles
* fix test
* fix test
* handling missing exceptions
* more clear javadoc for stopGracefully
* unused import
* update javadoc
* Add missing statement in javadoc
* address comments; fix doc
* add javadoc for isGuaranteedRollup
* Rename confusing variable name and fix typos
* fix typos; move fetch() to a better home; fix the expiration time
* add support https
* Enable ability to toggle SegmentMetadata request logging on/off
* Move SegmentMetadata query log filter to FilteredRequestLogger
* Update documentation to reflect the segment metadata flag moving to the filtered request logger
* Modify patch to allow blacklist of query types to not log to request logger
* Address styling and naming requests following latest code review
* Fix indentation on multiple locations per Druid style rules
* Fix race between canHandle() and addSegment() in StorageLocation
* add comment
* Add shuffleSegmentPusher which is a dataSegmentPusher used for writing shuffle data in local storage.
* add comments
* unused import
* add comments
* fix test
* address comments
* remove <p> tag from javadoc
* address comments
* comparingLong
* Address comments
* fix test
* Refactored ResponseContext and aggregated its keys into Enum
* Added unit tests for ResponseContext and refactored the serialization
* Removed unused methods
* Fixed code style
* Fixed code style
* Fixed code style
* Made SerializationResult static
* Updated according to the PR discussion:
Renamed an argument
Updated comparator
Replaced Pair usage with Map.Entry
Added a comment about quadratic complexity
Removed boolean field with an expression
Renamed SerializationResult field
Renamed the method merge to add and renamed several context keys
Renamed field and method related to scanRowsLimit
Updated a comment
Simplified a block of code
Renamed a variable
* Added JsonProperty annotation to renamed ScanQuery field
* Extension-friendly context key implementation
* Refactored ResponseContext: updated delegate type, comments and exceptions
Reducing serialized context length by removing some of its'
collection elements
* Fixed tests
* Simplified response context truncation during serialization
* Extracted a method of removing elements from a response context and
added some comments
* Fixed typos and updated comments
* Add IPv4 druid expressions
New druid expressions for filtering IPv4 addresses:
- ipv4address_match: Check if IP address belongs to a subnet
- ipv4address_parse: Convert string IP address to long
- ipv4address_stringify: Convert long IP address to string
These expressions operate on IP addresses represented as either strings
or longs, so that they can be applied to dimensions with mixed
representation of IP addresses. The filtering is more efficient when
operating on IP addresses as longs. In other words, the intended use
case is:
1) Use ipv4address_parse to convert to long at ingestion time
2) Use ipv4address_match to filter (on longs) at query time
3) Use ipv4adress_stringify to convert to (readable) string at query
time
* Fix licenses and null handling
* Simplify IPv4 expressions
* Fix tests
* Fix check for valid ipv4 address string