* Broker: Add ability to inline subqueries.
The main changes:
- ClientQuerySegmentWalker: Add ability to inline queries.
- Query: Add "getSubQueryId" and "withSubQueryId" methods.
- QueryMetrics: Add "subQueryId" dimension.
- ServerConfig: Add new "maxSubqueryRows" parameter, which is used by
ClientQuerySegmentWalker to limit how many rows can be inlined per
query.
- IndexedTableJoinMatcher: Allow creating keys on top of unknown types,
by assuming they are strings. This is useful because not all types are
known for fields in query results.
- InlineDataSource: Store RowSignature rather than component parts. Add
more zealous "equals" and "hashCode" methods to ease testing.
- Moved QuerySegmentWalker test code from CalciteTests and
SpecificSegmentsQueryWalker in druid-sql to QueryStackTests in
druid-server. Use this to spin up a new ClientQuerySegmentWalkerTest.
* Adjustments from CI.
* Fix integration test.
* Link up row-based datasources to serving layer.
- Add SegmentWrangler interface that allows linking of DataSources to Segments.
- Add LocalQuerySegmentWalker that uses SegmentWranglers to compute queries on
data that is available locally.
- Modify ClientQuerySegmentWalker to use LocalQuerySegmentWalker when the base
datasource is concrete and not a table.
- Add SegmentWranglerModule to the Broker so it has them available and can
properly instantiate . LocalQuerySegmentWalkers.
- Set InlineDataSource and LookupDataSource to concrete, since they can be
directly queried now.
* Fix tests.
* Use the iterator instead of higherKey(); use the iterator API instead of stream
* Fix tests; fix a concurrency bug in timeline
* fix test
* add tests for findNonOvershadowedObjectsInInterval
* fix test
* add missing tests; fix a bug in QueueEntry
* equals tests
* fix test
* integration test refactor
* integration test refactor
* refactor integration test
* refactor integration test
* refactor integration test
* refactor integration test
* refactor integration test
* refactor integration test
* refactor integration test
* refactor integration test
* address comments
* Improves on the fix for 8918
* factorize constants for ITRetryUtil.retryUntil call
* increasing retries and sleep in HttpUtil to cope with 401s in testing
* adding retries in EventReceiverFirehoseTestClient
* adding missing space
* Create splits of multiple files for parallel indexing
* fix wrong import and npe in test
* use the single file split in tests
* rename
* import order
* Remove specific local input source
* Update docs/ingestion/native-batch.md
Co-Authored-By: sthetland <steve.hetland@imply.io>
* Update docs/ingestion/native-batch.md
Co-Authored-By: sthetland <steve.hetland@imply.io>
* doc and error msg
* fix build
* fix a test and address comments
Co-authored-by: sthetland <steve.hetland@imply.io>
* Remove references to Docker Machine
Removing a broken link to an obsolete repo.
While at it, removing references to Docker Machine, which was obsolete as of Docker v1.12 (avail. 2016). This version introduced Docker as native MacOS and Windows apps.
* Update README.md
Wording nit.
* Create new dynamic config to pause coordinator helpers when needed
* Fix spelling mistakes flagged in Travis build
* Add an integration test for coordinator pause dynamic config
* Improve documentation for new dynamic coordinator config and remove un-needed info logs in favor of debug
* address naming convention of 'deep store' vs 'deep storage' in new configs doc line
* Fix newline at end of configuration index.md
* Last try to resolve newline issue in configuration readme
* fix spell checks from travis build
* Fix another flagges spelling error from Travis
* Add LookupJoinableFactory.
Enables joins where the right-hand side is a lookup. Includes an
integration test.
Also, includes changes to LookupExtractorFactoryContainerProvider:
1) Add "getAllLookupNames", which will be needed to eventually connect
lookups to Druid's SQL catalog.
2) Convert "get" from nullable to Optional return.
3) Swap out most usages of LookupReferencesManager in favor of the
simpler LookupExtractorFactoryContainerProvider interface.
* Fixes for tests.
* Fix another test.
* Java 11 message fix.
* Fixups.
* Fixup benchmark class.
* Reconcile terminology and method naming to 'used/unused segments'; Don't use terms 'enable/disable data source'; Rename MetadataSegmentManager to MetadataSegments; Make REST API methods which mark segments as used/unused to return server error instead of an empty response in case of error
* Fix brace
* Import order
* Rename withKillDataSourceWhitelist to withSpecificDataSourcesToKill
* Fix tests
* Fix tests by adding proper methods without interval parameters to IndexerMetadataStorageCoordinator instead of hacking with Intervals.ETERNITY
* More aligned names of DruidCoordinatorHelpers, rename several CoordinatorDynamicConfig parameters
* Rename ClientCompactTaskQuery to ClientCompactionTaskQuery for consistency with CompactionTask; ClientCompactQueryTuningConfig to ClientCompactionTaskQueryTuningConfig
* More variable and method renames
* Rename MetadataSegments to SegmentsMetadata
* Javadoc update
* Simplify SegmentsMetadata.getUnusedSegmentIntervals(), more javadocs
* Update Javadoc of VersionedIntervalTimeline.iterateAllObjects()
* Reorder imports
* Rename SegmentsMetadata.tryMark... methods to mark... and make them to return boolean and the numbers of segments changed and relay exceptions to callers
* Complete merge
* Add CollectionUtils.newTreeSet(); Refactor DruidCoordinatorRuntimeParams creation in tests
* Remove MetadataSegmentManager
* Rename millisLagSinceCoordinatorBecomesLeaderBeforeCanMarkAsUnusedOvershadowedSegments to leadingTimeMillisBeforeCanMarkAsUnusedOvershadowedSegments
* Fix tests, refactor DruidCluster creation in tests into DruidClusterBuilder
* Fix inspections
* Fix SQLMetadataSegmentManagerEmptyTest and rename it to SqlSegmentsMetadataEmptyTest
* Rename SegmentsAndMetadata to SegmentsAndCommitMetadata to reduce the similarity with SegmentsMetadata; Rename some methods
* Rename DruidCoordinatorHelper to CoordinatorDuty, refactor DruidCoordinator
* Unused import
* Optimize imports
* Rename IndexerSQLMetadataStorageCoordinator.getDataSourceMetadata() to retrieveDataSourceMetadata()
* Unused import
* Update terminology in datasource-view.tsx
* Fix label in datasource-view.spec.tsx.snap
* Fix lint errors in datasource-view.tsx
* Doc improvements
* Another attempt to please TSLint
* Another attempt to please TSLint
* Style fixes
* Fix IndexerSQLMetadataStorageCoordinator.createUsedSegmentsSqlQueryForIntervals() (wrong merge)
* Try to fix docs build issue
* Javadoc and spelling fixes
* Rename SegmentsMetadata to SegmentsMetadataManager, address other comments
* Address more comments
* Add JoinableFactory interface and use it in the query stack.
Also includes InlineJoinableFactory, which enables joining against
inline datasources. This is the first patch where a basic join query
actually works. It includes integration tests.
* Fix test issues.
* Adjustments from code review.
By picking one. Otherwise, when a machine has multiple IP addresses, DOCKER_HOST_IP
would have a newline in the middle, causing havoc in configuration files.
This test was failing to authenticate using the admin credentials. These
should be available by default in the metadata store. This indicates that
the credentials are not successfully being syncd before the test is run.
This change increases the number of retries to 20 so that the services
are syncd before the test runs
* Parallel indexing single dim partitions
Implements single dimension range partitioning for native parallel batch
indexing as described in #8769. This initial version requires the
druid-datasketches extension to be loaded.
The algorithm has 5 phases that are orchestrated by the supervisor in
`ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`.
These phases and the main classes involved are described below:
1) In parallel, determine the distribution of dimension values for each
input source split.
`PartialDimensionDistributionTask` uses `StringSketch` to generate
the approximate distribution of dimension values for each input
source split. If the rows are ungrouped,
`PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter`
uses a Bloom filter to skip rows that would be grouped. The final
distribution is sent back to the supervisor via
`DimensionDistributionReport`.
2) The range partitions are determined.
In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the
supervisor uses `StringSketchMerger` to merge the individual
`StringSketch`es created in the preceding phase. The merged sketch is
then used to create the range partitions.
3) In parallel, generate partial range-partitioned segments.
`PartialRangeSegmentGenerateTask` uses the range partitions
determined in the preceding phase and
`RangePartitionCachingLocalSegmentAllocator` to generate
`SingleDimensionShardSpec`s. The partition information is sent back
to the supervisor via `GeneratedGenericPartitionsReport`.
4) The partial range segments are grouped.
In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`,
the supervisor creates the `PartialGenericSegmentMergeIOConfig`s
necessary for the next phase.
5) In parallel, merge partial range-partitioned segments.
`PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to
retrieve the partial range-partitioned segments generated earlier and
then merges and publishes them.
* Fix dependencies & forbidden apis
* Fixes for integration test
* Address review comments
* Fix docs, strict compile, sketch check, rollup check
* Fix first shard spec, partition serde, single subtask
* Fix first partition check in test
* Misc rewording/refactoring to address code review
* Fix doc link
* Split batch index integration test
* Do not run parallel-batch-index twice
* Adjust last partition
* Split ITParallelIndexTest to reduce runtime
* Rename test class
* Allow null values in range partitions
* Indicate which phase failed
* Improve asserts in tests
* Address security vulnerabilities CVSS >= 7
Update dependencies to address security vulnerabilities with CVSS scores
of 7 or higher. A new Travis CI job is added to prevent new
high/critical security vulnerabilities from being added.
Updated dependencies:
- api-util 1.0.0 -> 1.0.3
- jackson 2.9.10 -> 2.10.1
- kafka 2.1.0 -> 2.1.1
- libthrift 0.10.0 -> 0.13.0
- protobuf 3.2.0 -> 3.11.0
The following high/critical security vulnerabilities are currently
suppressed (so that the new Travis CI job can be added now) and are left
as future work to fix:
- hibernate-validator:5.2.5
- jackson-mapper-asl:1.9.13
- libthrift:0.6.1
- netty:3.10.6
- nimbus-jose-jwt:4.41.1
* Rename EDL1 license file
* Fix inspection errors
* Refactor parallel indexing perfect rollup partitioning
Refactoring to make it easier to later add range partitioning for
perfect rollup parallel indexing. This is accomplished by adding several
new base classes (e.g., PerfectRollupWorkerTask) and new classes for
encapsulating logic that needs to be changed for different partitioning
strategies (e.g., IndexTaskInputRowIteratorBuilder).
The code is functionally equivalent to before except for the following
small behavior changes:
1) PartialSegmentMergeTask: Previously, this task had a priority of
DEFAULT_TASK_PRIORITY. It now has a priority of
DEFAULT_BATCH_INDEX_TASK_PRIORITY (via the new PerfectRollupWorkerTask
base class), since it is a batch index task.
2) ParallelIndexPhaseRunner: A decorator was added to
subTaskSpecIterator to ensure the subtasks are generated with unique
ids. Previously, only tests (i.e., MultiPhaseParallelIndexingTest)
would have this decorator, but this behavior is desired for non-test
code as well.
* Fix forbidden apis and pmd warnings
* Fix analyze dependencies warnings
* Fix IndexTask json and add IT diags
* Fix parallel index supervisor<->worker serde
* Fix TeamCity inspection errors/warnings
* Fix TeamCity inspection errors/warnings again
* Integrate changes with those from #8823
* Address review comments
* Address more review comments
* Fix forbidden apis
* Address more review comments
* Tidy up lifecycle, query, and ingestion logging.
The goal of this patch is to improve the clarity and usefulness of
Druid's logging for cluster operators. For more information, see
https://twitter.com/cowtowncoder/status/1195469299814555648.
Concretely, this patch does the following:
- Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the
goal of reducing redundancy and improving clarity by avoiding
showing rarely-useful log messages. This includes most "starting"
and "stopping" messages, and most messages related to individual
columns.
- Adds new log4j2 templates that show operators how to enabled DEBUG
logging for certain important packages.
- Eliminate stack traces for query errors, unless log level is DEBUG
or more. This is useful because query errors often indicate user
error rather than system error, but dumping stack trace often gave
operators the impression that there was a system failure.
- Adds task id to Appenderator, AppenderatorDriver thread names. In
the default log4j2 configuration, this will put them in log lines
as well. It's very useful if a user is using the Indexer, where
multiple tasks run in the same JVM.
- More consistent terminology when it comes to "sequences" (sets of
segments that are handed-off together by Kafka ingestion) and
"offsets" (cursors in partitions). These terms had been confused in
some log messages due to the fact that Kinesis calls offsets
"sequence numbers".
- Replaces some ugly toString calls with either the JSONification or
something more operator-accessible (like a URL or segment identifier,
instead of JSON object representing the same).
* Adjustments.
* Adjust integration test.
* remove select query
* thanks teamcity
* oops
* oops
* add back a SelectQuery class that throws RuntimeExceptions linking to docs
* adjust text
* update docs per review
* deprecated
* Stateful auto compaction
* javaodc
* add removed test back
* fix test
* adding indexSpec to compactionState
* fix build
* add lastCompactionState
* address comments
* extract CompactionState
* fix doc
* fix build and test
* Add a task context to store compaction state; add javadoc
* fix it test
* Support LDAP authentication/authorization
* fixed integration-tests
* fixed Travis CI build errors related to druid-security module
* fixed failing test
* fixed failing test header
* added comments, force build
* fixes for strict compilation spotbugs checks
* removed authenticator rolling credential update feature
* removed escalator rolling credential update feature
* fixed teamcity inspection deprecated API usage error
* fixed checkstyle execution error, removed unused import
* removed cached config as part of removing authenticator rolling credential update feature
* removed config bundle entity as part of removing authenticator rolling credential update feature
* refactored ldao configuration
* added support for SSLContext configuration and TLSCertificateChecker
* removed check to return authentication failure when user has no group assigned, will be checked and handled by the authorizer
* Separate out authorizer checks between metadata-backed store user and LDAP user/groups
* refactored BasicSecuritySSLSocketFactory usage to fix strict compilation spotbugs checks
* fixes build issue
* final review comments updates
* final review comments updates
* fixed LGTM and spellcheck alerts
* Fixed Avatica auth failure error message check
* Updated metadata credentials validator exception message string, replaced DB with metadata store
* Fix dependency analyze warnings
Update the maven dependency plugin to the latest version and fix all
warnings for unused declared and used undeclared dependencies in the
compile scope. Added new travis job to add the check to CI. Also fixed
some source code files to use the correct packages for their imports and
updated druid-forbidden-apis to prevent regressions.
* Address review comments
* Adjust scope for org.glassfish.jaxb:jaxb-runtime
* Fix dependencies for hdfs-storage
* Consolidate netty4 versions
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* merging with upstream
* review-1
* unknow changes
* unknow changes
* review-2
* merging with master
* review-2 1 changes
* review changes-2 2
* bug fix
* Add group_id to overlord tasks API and sys.tasks table
* adjust test
* modify docs
* Make groupId nullable
* fix integration test
* fix toString
* Remove groupId from TaskInfo
* Modify docs and tests
* modify TaskMonitorTest
* Add TaskResourceCleaner; fix a couple of concurrency bugs in batch tasks
* kill runner when it's ready
* add comment
* kill run thread
* fix test
* Take closeable out of Appenderator
* add javadoc
* fix test
* fix test
* update javadoc
* add javadoc about killed task
* address comment
* Add support for parallel native indexing with shuffle for perfect rollup.
* Add comment about volatiles
* fix test
* fix test
* handling missing exceptions
* more clear javadoc for stopGracefully
* unused import
* update javadoc
* Add missing statement in javadoc
* address comments; fix doc
* add javadoc for isGuaranteedRollup
* Rename confusing variable name and fix typos
* fix typos; move fetch() to a better home; fix the expiration time
* add support https
Reorganize Travis CI jobs into smaller faster (and more) jobs. Add
various maven options to skip unnecessary work and refactored Travis CI
job definitions to follow DRY.
Detailed changes:
.travis.yml
- Refactor build logic to get rid of copy-and-paste logic
- Skip static checks and enable parallelism for maven install
- Split static analysis into different jobs to ease triage
- Use "name" attribute instead of NAME environment variable
- Split "indexing" and "web console" out of "other modules test"
- Split 2 integration test jobs into multiple smaller jobs
build.sh
- Enable parallelism
- Disable more static checks
travis_script_integration.sh
travis_script_integration_part2.sh
integration-tests/README.md
- Use TestNG groups instead of shell scripts and move definition of jobs
into Travis CI yaml
integration-tests/pom.xml
- Show elapsed time of individual tests to aid in future rebalancing of
Travis CI integration test jobs run time
TestNGGroup.java
- Use TestNG groups to make it easy to have multiple Travis CI
integration test jobs. TestNG groups also make it easier to have an
"other" integration test group and make it less likely a test will
accidentally not be included in a CI job.
IT*Test.java
AbstractITBatchIndexTest.java
AbstractKafkaIndexerTest.java
- Add TestNG group
- Fix various IntelliJ inspection warnings
- Reduce scope of helper methods since the TestNG group annotation on
the class makes TestNG consider all public methods as test methods
pom.xml
- Allow enforce plugin to be run from command-line
- Bump resources plugin version so that "[debug] execute contextualize"
output is correctly suppressed by "mvn -q"
- Bump exec plugin version so that skip property is renamed from "skip"
to "exec.skip"
web-console/pom.xml
- Add property to allow disabling javascript-related work. This property
is overridden in Travis CI to speed up the jobs.
* Fix dependency analyze warnings
Update the maven dependency plugin to the latest version and fix all
warnings for unused declared and used undeclared dependencies in the
compile scope. Added new travis job to add the check to CI. Also fixed
some source code files to use the correct packages for their imports.
* Fix licenses and dependencies
* Fix licenses and dependencies again
* Fix integration test dependency
* Address review comments
* Fix unit test dependencies
* Fix integration test dependency
* Fix integration test dependency again
* Fix integration test dependency third time
* Fix integration test dependency fourth time
* Fix compile error
* Fix assert package
* Upgrade various build and doc links to https.
Where it wasn't possible to upgrade build-time dependencies to https,
I kept http in place but used hardcoded checksums or GPG keys to ensure
that artifacts fetched over http are verified properly.
* Switch to https://apache.org.
* update sys.servers table to show all servers
* update docs
* Fix integration test
* modify test query for batch integration test
* fix case in test queries
* make the server_type lowercase
* Apply suggestions from code review
Co-Authored-By: Himanshu <g.himanshu@gmail.com>
* Fix compilation from git suggestion
* fix unit test
* Bump Checkstyle to 8.20
Moderate severity vulnerability that affects:
com.puppycrawl.tools:checkstyle
Checkstyle prior to 8.18 loads external DTDs by default,
which can potentially lead to denial of service attacks
or the leaking of confidential information.
Affected versions: < 8.18
* Oops, missed one
* Oops, missed a few
* refactor lookups to be more chill to router
* remove accidental change
* fix and combine LookupIntrospectionResourceTest
* fix inspection
* rename RouterLookupModule to LookupSerdeModule and RouterLookupExtractorFactoryContainerProvider to NoopLookupExtractorFactoryContainerProvider
* make comment generic
* use ConfigResourceFilter instead of StateResourceFilter
* fix indentation
* unused import
* another unused import
* refactor some stuff into processing module, split up LookupModule.java classes into their own files
* Make IngestSegmentFirehoseFactory splittable for parallel ingestion
* Code review feedback
- Get rid of WindowedSegment
- Don't document 'segments' parameter or support splitting firehoses that use it
- Require 'intervals' in WindowedSegmentId (since it won't be written by hand)
* Add missing @JsonProperty
* Integration test passes
* Add unit test
* Remove two FIXME comments from CompactionTask
I'd like to leave this PR in a potentially mergeable state, but I still would
appreciate reviewer eyes on the questions I'm removing here.
* Updates from code review
* maxTotalRows should be checked in DataSourceCompactionConfig before setting targetCompactionSizeBytes
* remove unnecessary default values
* remove flacky test
* fix build
* Add comments
* Add integration test for transactional kafka
* Add true for transactions enabled for transactional test
* Add new test to travis_script_integration.sh, use version 0.2 of druid docker image
* Use different datasource name for ITKafkaIndexingServiceTest and ITKafkaIndexingServiceTransactionalTest
* use KafkaConsumerConfigs to get common consumer properties
* Remove double line breaks
* remove extra space
* Consolidate kafka consumer configs
* change the order of adding properties
* Add consumer properties to fix test
it seems kafka consumer does not reveive any message without these configs
* Use KafkaConsumerConfigs in integration test
* Update zookeeper and kafka versions in the setup.sh for the base druid image
* use version 0.2 of base druid image
* Try to fix tests in KafkaRecordSupplierTest
* unused import
* Fix tests in KafkaSupervisorTest
* Throw caught exception.
* Throw caught exceptions.
* Related checkstyle rule is added to prevent further bugs.
* RuntimeException() is used instead of Throwables.propagate().
* Missing import is added.
* Throwables are propogated if possible.
* Throwables are propogated if possible.
* Throwables are propogated if possible.
* Throwables are propogated if possible.
* * Checkstyle definition is improved.
* Throwables.propagate() usages are removed.
* Checkstyle pattern is changed for only scanning "Throwables.propagate(" instead of checking lookbehind.
* Throwable is kept before firing a Runtime Exception.
* Fix unused assignments.
* integration-tests: make ITParallelIndexTest still work in parallel
Follow-up to #7181, which made the default behavior for index_parallel tasks
non-parallel.
* Validate that parallel index subtasks were run
Previously, the test validated that the data source that we ingested from still
had the same query responses that it did before the second ingestion. This is
less useful than validating queries against the newly created data source.
The new queries file differs from the old one in that its maxTime is earlier due
to the interval selected by the reindex, and in that it does not query for the
dropped metric "count".
* index_parallel: support !appendToExisting with no explicit intervals
This enables ParallelIndexSupervisorTask to dynamically request locks at runtime
if it is run without explicit intervals in the granularity spec and with
appendToExisting set to false. Previously, it behaved as if appendToExisting
was set to true, which was undocumented and inconsistent with IndexTask and
Hadoop indexing.
Also, when ParallelIndexSupervisorTask allocates segments in the explicit
interval case, fail if its locks on the interval have been revoked.
Also make a few other additions/clarifications to native ingestion docs.
Fixes#6989.
* Review feedback.
PR description on GitHub updated to match.
* Make native batch ingestion partitions start at 0
* Fix to previous commit
* Unit test. Verified to fail without the other commits on this branch.
* Another round of review
* Slightly scarier warning
* Support kafka transactional topics
* update kafka to version 2.0.0
* Remove the skipOffsetGaps option since it's not used anymore
* Adjust kafka consumer to use transactional semantics
* Update tests
* Remove unused import from test
* Fix compilation
* Invoke transaction api to fix a unit test
* temporary modification of travis.yml for debugging
* another attempt to get travis tasklogs
* update kafka to 2.0.1 at all places
* Remove druid-kafka-eight dependency from integration-tests, remove the kafka firehose test and deprecate kafka-eight classes
* Add deprecated in docs for kafka-eight and kafka-simple extensions
* Remove skipOffsetGaps and code changes for transaction support
* Fix indentation
* remove skipOffsetGaps from kinesis
* Add transaction api to KafkaRecordSupplierTest
* Fix indent
* Fix test
* update kafka version to 2.1.0
* Add checkstyle rules about imports and empty lines between members
* Add suppressions
* Update Eclipse import order
* Add empty line
* Fix StatsDEmitter
* Prohibit some guava collection APIs and use JDK APIs directly
* reset files that changed by accident
* sort codestyle/druid-forbidden-apis.txt alphabetically
* include mysql-metadata-storage extension in distribution, but without the GPL-licensed connector library
* Install mysql connector package
* use symlinks to avoid versioning issues
* add documentation for fetching the mysql connector