With this change, Druid will only support ZooKeeper 3.5.x and later.
In order to support Java 15 we need to switch to ZK 3.5.x client libraries and drop support for ZK 3.4.x
(see #10780 for the detailed reasons)
* remove ZooKeeper 3.4.x compatibility
* exclude additional ZK 3.5.x netty dependencies to ensure we use our version
* keep ZooKeeper version used for integration tests in sync with client library version
* remove the need to specify ZK version at runtime for docker
* add support to run integration tests with JDK 15
* build and run unit tests with Java 15 in travis
* Avoid mapping hydrants in create segments phase for native ingestion
* Drop queriable indices after a given sink is fully merged
* Do not drop memory mappings for realtime ingestion
* Style fixes
* Renamed to match use case better
* Rollback memoization code and use the real time flag instead
* Null ptr fix in FireHydrant toString plus adjustments to memory pressure tracking calculations
* Style
* Log some count stats
* Make sure sinks size is obtained at the right time
* BatchAppenderator unit test
* Fix comment typos
* Renamed methods to make them more readable
* Move persisted metadata from FireHydrant class to AppenderatorImpl. Removed superfluous differences and fix comment typo. Removed custom comparator
* Missing dependency
* Make persisted hydrant metadata map concurrent and better reflect the fact that keys are Java references. Maintain persisted metadata when dropping/closing segments.
* Replaced concurrent variables with normal ones
* Added batchMemoryMappedIndex "fallback" flag with default "false". Set this to "true" make code fallback to previous code path.
* Style fix.
* Added note to new setting in doc, using Iterables.size (and removing a dependency), and fixing a typo in a comment.
* Forgot to commit this edited documentation message
* fix count and average SQL aggregators on constant virtual columns
* style
* even better, why are we tracking virtual columns in aggregations at all if we have a virtual column registry
* oops missed a few
* remove unused
* this will fix it
* SQL timeseries no longer skip empty buckets with all granularity
* add comment, fix tests
* the ol switcheroo
* revert unintended change
* docs and more tests
* style
* make checkstyle happy
* docs fixes and more tests
* add docs, tests for array_agg
* fixes
* oops
* doc stuffs
* fix compile, match doc style
* allow user to set group.id for Kafka ingestion task
* fix test coverage by removing deprecated code and add doc
* fix typo
* Update docs/development/extensions-core/kafka-ingestion.md
Co-authored-by: frank chen <frankchen@apache.org>
Co-authored-by: frank chen <frankchen@apache.org>
* Vectorize the DataSketches quantiles aggregator.
Also removes synchronization for the BufferAggregator and VectorAggregator
implementations, since it is not necessary (similar to #11115).
Extends DoublesSketchAggregatorTest and DoublesSketchSqlAggregatorTest
to run all test cases in vectorized mode.
* Style fix.
* upgrade to Apache Kafka 2.8.0 (release notes:
https://downloads.apache.org/kafka/2.8.0/RELEASE_NOTES.html)
* pass Kafka version as a Docker argument in integration tests
to keep in sync with maven version
* fix use of internal Kafka APIs in integration tests
* Vectorized versions of HllSketch aggregators.
The patch uses the same "helper" approach as #10767 and #10304, and
extends the tests to run in both vectorized and non-vectorized modes.
Also includes some minor changes to the theta sketch vector aggregator:
- Cosmetic changes to make the hll and theta implementations look
more similar.
- Extends the theta SQL tests to run in vectorized mode.
* Updates post-code-review.
* Fix javadoc.
* Enable rewriting certain inner joins as filters.
The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:
- It must be an inner equi-join.
- The right-hand columns referenced by the condition must not contain any
duplicate values. (If they did, the inner join would not be guaranteed
to return at most one row for each left-hand-side row.)
- No columns from the right-hand side can be used by anything other than
the join condition itself.
HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.
In support of this goal:
- Add Query getRequiredColumns() method to help us figure out whether
the right-hand side of a join datasource is being used or not.
- Add JoinConditionAnalysis getRequiredColumns() method to help us
figure out if the right-hand side of a join is being used by later
join clauses acting on the same base.
- Add Joinable getNonNullColumnValuesIfAllUnique method to enable
retrieving the set of values that will form the "in" filter.
- Add LookupExtractor canGetKeySet() and keySet() methods to support
LookupJoinable in its efforts to implement the new Joinable method.
- Add "enableRewriteJoinToFilter" feature flag to
JoinFilterRewriteConfig. The default is disabled.
* Test improvements.
* Test fixes.
* Avoid slow size() call.
* Remove invalid test.
* Fix style.
* Fix mistaken default.
* Small fixes.
* Fix logic error.
* add protobuf inputformat
* repair pom
* alter intermediateRow to type of Dynamicmessage
* add document
* refine test
* fix document
* add protoBytesDecoder
* refine document and add ser test
* add hash
* add schema registry ser test
Co-authored-by: yuanyi <yuanyi@freewheel.tv>
* Add ability to wait for segment availability for batch jobs
* IT updates
* fix queries in legacy hadoop IT
* Fix broken indexing integration tests
* address an lgtm flag
* spell checker still flagging for hadoop doc. adding under that file header too
* fix compaction IT
* Updates to wait for availability method
* improve unit testing for patch
* fix bad indentation
* refactor waitForSegmentAvailability
* Fixes based off of review comments
* cleanup to get compile after merging with master
* fix failing test after previous logic update
* add back code that must have gotten deleted during conflict resolution
* update some logging code
* fixes to get compilation working after merge with master
* reset interrupt flag in catch block after code review pointed it out
* small changes following self-review
* fixup some issues brought on by merge with master
* small changes after review
* cleanup a little bit after merge with master
* Fix potential resource leak in AbstractBatchIndexTask
* syntax fix
* Add a Compcation TuningConfig type
* add docs stipulating the lack of support by Compaction tasks for the new config
* Fixup compilation errors after merge with master
* Remove erreneous newline
* DruidInputSource: Fix issues in column projection, timestamp handling.
DruidInputSource, DruidSegmentReader changes:
1) Remove "dimensions" and "metrics". They are not necessary, because we
can compute which columns we need to read based on what is going to
be used by the timestamp, transform, dimensions, and metrics.
2) Start using ColumnsFilter (see below) to decide which columns we need
to read.
3) Actually respect the "timestampSpec". Previously, it was ignored, and
the timestamp of the returned InputRows was set to the `__time` column
of the input datasource.
(1) and (2) together fix a bug in which the DruidInputSource would not
properly read columns that are used as inputs to a transformSpec.
(3) fixes a bug where the timestampSpec would be ignored if you attempted
to set the column to something other than `__time`.
(1) and (3) are breaking changes.
Web console changes:
1) Remove "Dimensions" and "Metrics" from the Druid input source.
2) Set timestampSpec to `{"column": "__time", "format": "millis"}` for
compatibility with the new behavior.
Other changes:
1) Add ColumnsFilter, a new class that allows input readers to determine
which columns they need to read. Currently, it's only used by the
DruidInputSource, but it could be used by other columnar input sources
in the future.
2) Add a ColumnsFilter to InputRowSchema.
3) Remove the metric names from InputRowSchema (they were unused).
4) Add InputRowSchemas.fromDataSchema method that computes the proper
ColumnsFilter for given timestamp, dimensions, transform, and metrics.
5) Add "getRequiredColumns" method to TransformSpec to support the above.
* Various fixups.
* Uncomment incorrectly commented lines.
* Move TransformSpecTest to the proper module.
* Add druid.indexer.task.ignoreTimestampSpecForDruidInputSource setting.
* Fix.
* Fix build.
* Checkstyle.
* Misc fixes.
* Fix test.
* Move config.
* Fix imports.
* Fixup.
* Fix ShuffleResourceTest.
* Add import.
* Smarter exclusions.
* Fixes based on tests.
Also, add TIME_COLUMN constant in the web console.
* Adjustments for tests.
* Reorder test data.
* Update docs.
* Update docs to say Druid 0.22.0 instead of 0.21.0.
* Fix test.
* Fix ITAutoCompactionTest.
* Changes from review & from merging.
* Allow only HTTP and HTTPS protocols for the HTTP inputSource
* rename
* Update core/src/main/java/org/apache/druid/data/input/impl/HttpInputSource.java
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
* fix http firehose and update doc
* HDFS inputSource
* add configs for allowed protocols
* fix checkstyle and doc
* more checkstyle
* remove stale doc
* remove more doc
* Apply doc suggestions from code review
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
* update hdfs address in docs
* fix test
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
* druid task auto scale based on kafka lag
* fix kafkaSupervisorIOConfig and KinesisSupervisorIOConfig
* druid task auto scale based on kafka lag
* fix kafkaSupervisorIOConfig and KinesisSupervisorIOConfig
* test dynamic auto scale done
* auto scale tasks tested on prd cluster
* auto scale tasks tested on prd cluster
* modify code style to solve 29055.10 29055.9 29055.17 29055.18 29055.19 29055.20
* rename test fiel function
* change codes and add docs based on capistrant reviewed
* midify test docs
* modify docs
* modify docs
* modify docs
* merge from master
* Extract the autoScale logic out of SeekableStreamSupervisor to minimize putting more stuff inside there && Make autoscaling algorithm configurable and scalable.
* fix ci failed
* revert msic.xml
* add uts to test autoscaler create && scale out/in and kafka ingest with scale enable
* add more uts
* fix inner class check
* add IT for kafka ingestion with autoscaler
* add new IT in groups=kafka-index named testKafkaIndexDataWithWithAutoscaler
* review change
* code review
* remove unused imports
* fix NLP
* fix docs and UTs
* revert misc.xml
* use jackson to build autoScaleConfig with default values
* add uts
* use jackson to init AutoScalerConfig in IOConfig instead of Map<>
* autoscalerConfig interface and provide a defaultAutoScalerConfig
* modify uts
* modify docs
* fix checkstyle
* revert misc.xml
* modify uts
* reviewed code change
* reviewed code change
* code reviewed
* code review
* log changed
* do StringUtils.encodeForFormat when create allocationExec
* code review && limit taskCountMax to partitionNumbers
* modify docs
* code review
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
* Add config and header support for confluent schema registry. (porting code from https://github.com/apache/druid/pull/9096)
* Add Eclipse Public License 2.0 to license check
* Update licenses.yaml, revert changes to check-licenses.py and dependencies for integration-tests
* Add spelling exception and remove unused dependency
* Use non-deprecated getSchemaById() and remove duplicated license entry
* Update docs/ingestion/data-formats.md
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
* Added check for schema being null, as per Confluent code
* Missing imports and whitespace
* Updated unit tests with AvroSchema
Co-authored-by: Sergio Spinatelli <sergio.spinatelli.extern@7-tv.de>
Co-authored-by: Sergio Spinatelli <sergio.spinatelli.extern@joyn.de>
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
* use the latest Apache DataSketches release 2.0.0
* updated datasketches version
Co-authored-by: AlexanderSaydakov <AlexanderSaydakov@users.noreply.github.com>
* before i leaped i should've seen, the view from halfway down
* fixes
* fixes, more test
* rename
* fix style
* further refactoring
* review stuffs
* rename
* more javadoc and comments
* add offsetFetchPeriod to kinesis ingestion doc
* Remove jackson dependencies from extensions
* Use fixed delay for lag collection
* Metrics reset after finishing processing
* comments
* Broaden the list of exceptions to retry for
* Unit tests
* Add more tests
* Refactoring
* re-order metrics
* Doc suggestions
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
* Add tests
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
* Vectorized theta sketch aggregator.
Also a refactoring of BufferAggregator and VectorAggregator such that
they share a common interface, BaseBufferAggregator. This allows
implementing both in the same file with an abstract + dual subclass
structure.
* Rework implementation to use composition instead of inheritance.
* Rework things to enable working properly for both complex types and
regular types.
Involved finally moving makeVectorProcessor from DimensionHandlerUtils
into ColumnProcessors and harmonizing the two things.
* Add missing method.
* Style and name changes.
* Fix issues from inspections.
* Fix style issue.
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* fix checkstyle
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* fix test
* fix test
* add log
* Fix byte calculation for maxBytesInMemory to take into account of Sink/Hydrant Object overhead
* address comments
* fix checkstyle
* fix checkstyle
* add config to skip overhead memory calculation
* add test for the skipBytesInMemoryOverheadCheck config
* add docs
* fix checkstyle
* fix checkstyle
* fix spelling
* address comments
* fix travis
* address comments
Today Kafka message support in streaming indexing tasks is limited to
message values, and does not provide a way to expose Kafka headers,
timestamps, or keys, which may be of interest to more specialized
Druid input formats. For instance, Kafka headers may be used to indicate
payload format/encoding or additional metadata, and timestamps are often
omitted from values in Kafka streams applications, since they are
included in the record.
This change proposes to introduce KafkaRecordEntity as InputEntity,
which would give input formats full access to the underlying Kafka record,
including headers, key, timestamps. It would also open access to low-level
information such as topic, partition, offset if needed.
KafkaEntity is a subclass of ByteEntity for backwards compatibility with
existing input formats, and to avoid introducing unnecessary complexity
for Kinesis indexing tasks.
* integration test for coordinator and overlord leadership, added sys.servers is_leader column
* docs
* remove not needed
* fix comments
* fix compile heh
* oof
* revert unintended
* fix tests, split out docker-compose file selection from starting cluster, use docker-compose down to stop cluster
* fixes
* style
* dang
* heh
* scripts are hard
* fix spelling
* fix thing that must not matter since was already wrong ip, log when test fails
* needs more heap
* fix merge
* less aggro
* Two fixes related to encoding of % symbols.
1) TaskResourceFilter: Don't double-decode task ids. request.getPathSegments()
returns already-decoded strings. Applying StringUtils.urlDecode on
top of that causes erroneous behavior with '%' characters.
2) Update various ThreadFactoryBuilder name formats to escape '%'
characters. This fixes situations where substrings starting with '%'
are erroneously treated as format specifiers.
ITs are updated to include a '%' in extra.datasource.name.suffix.
* Avoid String.replace.
* Work around surefire bug.
* Fix xml encoding.
* Another try at the proper encoding.
* Give up on the emojis.
* Less ambitious testing.
* Fix an additional problem.
* Adjust encodeForFormat to return null if the input is null.
* support multi-line text
* add test cases
* split json text into lines case by case
* improve exception handle
* fix CI
* use IntermediateRowParsingReader as base of JsonReader
* update doc
* ignore the non-immutable field in test case
* add more test cases
* mark `lineSplittable` as final
* fix testcases
* fix doc
* add a test case for SqlReader
* return all raw columns when exception occurs
* fix CI
* fix test cases
* resolve review comments
* handle ParseException returned by index.add
* apply Iterables.getOnlyElement
* fix CI
* fix test cases
* improve code in more graceful way
* fix test cases
* fix test cases
* add a test case to check multiple json string in one text block
* fix inspection check
* support for vectorizing expressions with non-existent inputs, more consistent type handling for non-vectorized expressions
* inspector
* changes
* more test
* clean
* Introduce a Configurable Index Type
* Change to @UnstableApi
* Add AppendableIndexSpecTest
* Update doc
* Add spelling exception
* Add tests coverage
* Revert some of the changes to reduce diff
* Minor fixes
* Update getMaxBytesInMemoryOrDefault() comment
* Fix typo, remove redundant interface
* Remove off-heap spec (postponed to a later PR)
* Add javadocs to AppendableIndexSpec
* Describe testCreateTask()
* Add tests for AppendableIndexSpec within TuningConfig
* Modify hashCode() to conform with equals()
* Add comment where building incremental-index
* Add "EqualsVerifier" tests
* Revert some of the API back to AppenderatorConfig
* Don't use multi-line comments
* Remove knob documentation (deferred)
* Fix Avro OCF detection prefix and run formation detection on raw input
* Support Avro Fixed and Enum types correctly
* Check Avro version byte in format detection
* Add test for AvroOCFReader.sample
Ensures that the Sampler doesn't receive raw input that it can't
serialize into JSON.
* Document Avro type handling
* Add TS unit tests for guessInputFormat
* push down ValueType to ExprType conversion, tidy up
* determine expr output type for given input types
* revert unintended name change
* add nullable
* tidy up
* fixup
* more better
* fix signatures
* naming things is hard
* fix inspection
* javadoc
* make default implementation of Expr.getOutputType that returns null
* rename method
* more test
* add output for contains expr macro, split operation and function auto conversion
* Fix VARIANCE aggregator comparator
The comparator for the variance aggregator used to compare values using the
count. This is now fixed to compare values using the variance. If the variance
is equal, the count and sum are used as tie breakers.
* fix tests + sql compatible mode
* code review
* more tests
* fix last test
* Move tools for indexing to TaskToolbox instead of injecting them in constructor
* oops, other changes
* fix test
* unnecessary new file
* fix test
* fix build
* better type tracking: add typed postaggs, finalized types for agg factories
* more javadoc
* adjustments
* transition to getTypeName to be used exclusively for complex types
* remove unused fn
* adjust
* more better
* rename getTypeName to getComplexTypeName
* setup expression post agg for type inference existing
* more javadocs
* fixup
* oops
* more test
* more test
* more comments/javadoc
* nulls
* explicitly handle only numeric and complex aggregators for incremental index
* checkstyle
* more tests
* adjust
* more tests to showcase difference in behavior
* timeseries longsum array
* Add validation for authorizer name
* fix deps
* add javadocs
* Do not use resource filters
* Fix BasicAuthenticatorResource as well
* Add integration tests
* fix test
* fix
* Do not echo back username on auth failure
* use bad username
* Remove username from exception messages
* fix tests
* fix the tests
* hopefully this time
* this time the tests work
* fixed this time
* fix
* upgrade to Jetty 9.4.30
* Unknown users echo back Unauthorized
* fix
* new average aggregator
* method to create count aggregator factory
* test everything
* update other usages
* fix style
* fix more tests
* fix datasketches tests
* Ensure that join filter pre-analysis operates on optimized filters, add DimFilter.toOptimizedFilter
* Remove aggressive equality check that was used for testing
* Use Suppliers.memoize
* Checkstyle
* QueryCountStatsMonitor can be injected in the Peon
This change fixes a dependency injection bug where there is a circular
dependency on getting the MonitorScheduler when a user configures the
QueryCountStatsMonitor to be used.
* fix tests
* Actually fix the tests this time
* Filter http requests by http method
Add a config that allows a user which http methods to allow against their
Druid server.
Druid will only accept http requests with the method: GET, PUT, POST, DELETE
and OPTIONS.
If a Druid admin wants to allow other methods, they can do so by using the
ServerConfig#allowedHttpMethods config.
If a Druid user would like to disallow OPTIONS, this can be done by changing
the AuthConfig#allowUnauthenticatedHttpOptions config
* Exclude OPTIONS from always supported HTTP methods
Add HEAD as an allowed method for web console e2e tests
* fix docs
* fix security IT
* Actually fix the web console e2e tests
* Ignore icode coverage for nitialization classes
* code review
* optimize for protobuf parsing
* fix import error and maven dependency
* add unit test in protobufInputrowParserTest for flatten data
* solve code duplication (remove the log and main())
* rename 'flatten' to 'flat' to make it clearer
Co-authored-by: xionghuilin <xionghuilin@bytedance.com>
* retry 500 and 503 errors against kinesis
* add test that exercises retry logic
* more branch coverage
* retry 500 and 503 on getRecords request when fetching sequence numberu
Co-authored-by: Harshpreet Singh <hrshpr@twitch.tv>
* IntelliJ inspection and checkstyle rule for "Collection.EMPTY_* field accesses replaceable with Collections.empty*()"
* Reverted checkstyle rule
* Added tests to pass CI
* Codestyle
* Fix join
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* Fix Subquery could not be converted to groupBy query
* add tests
* address comments
* fix failing tests
* Add REGEXP_LIKE, fix empty-pattern bug in REGEXP_EXTRACT.
- Add REGEXP_LIKE function that returns a boolean, and is useful in
WHERE clauses.
- Fix REGEXP_EXTRACT return type (should be nullable; causes incorrect
filter elision).
- Fix REGEXP_EXTRACT behavior for empty patterns: should always match
(previously, they threw errors).
- Improve error behavior when REGEXP_EXTRACT and REGEXP_LIKE are passed
non-literal patterns.
- Improve documentation of REGEXP_EXTRACT.
* Changes based on PR review.
* Fix arg check.
* Important fixes!
* Add speller.
* wip
* Additional tests.
* Fix up tests.
* Add validation error tests.
* Additional tests.
* Remove useless call.
This change removes ListenableFutures.transformAsync in favor of the
existing Guava Futures.transform implementation. Our own implementation
had a bug which did not fail the future if the applied function threw an
exception, resulting in the future never completing.
An attempt was made to fix this bug, however when running againts Guava's own
tests, our version failed another half dozen tests, so it was decided to not
continue down that path and scrap our own implementation.
Explanation for how was this bug manifested itself:
An exception thrown in BaseAppenderatorDriver.publishInBackground when
invoked via transformAsync in StreamAppenderatorDriver.publish will
cause the resulting future to never complete.
This explains why when encountering https://github.com/apache/druid/issues/9845
the task will never complete, forever waiting for the publishFuture to
register the handoff. As a result, the corresponding "Error while
publishing segments ..." message only gets logged once the index task
times out and is forcefully shutdown when the future is force-cancelled
by the executor.
* refactor SeekableStreamSupervisor usage of RecordSupplier to reduce contention between background threads and main thread, refactor KinesisRecordSupplier, refactor Kinesis lag metric collection and emitting
* fix style and test
* cleanup, refactor, javadocs, test
* fixes
* keep collecting current offsets and lag if unhealthy in background reporting thread
* review stuffs
* add comment
* Add AvroOCFInputFormat
* Support supplying a reader schema in AvroOCFInputFormat
* Add docs for Avro OCF input format
* Address review comments
* Address second round of review
* add flag to flattenSpec to keep null columns
* remove changes to inputFormat interface
* add comment
* change comment message
* update web console e2e test
* move keepNullColmns to JSONParseSpec
* fix merge conflicts
* fix tests
* set keepNullColumns to false by default
* fix lgtm
* change Boolean to boolean, add keepNullColumns to hash, add tests for keepKeepNullColumns false + true with no nuulul columns
* Add equals verifier tests
* postagg test coverage for druid-stats, druid-momentsketch, druid-tdigestsketch and fixes
* style fixes
* fix comparator for TDigestQuantilePostAggregator
* Adding support for autoscaling in GCE
* adding extra google deps also in gce pom
* fix link in doc
* remove unused deps
* adding terms to spelling file
* version in pom 0.17.0-incubating-SNAPSHOT --> 0.18.0-SNAPSHOT
* GCEXyz -> GceXyz in naming for consistency
* add preconditions
* add VisibleForTesting annotation
* typos in comments
* use StringUtils.format instead of String.format
* use custom exception instead of exit
* factorize interval time between retries
* making literal value a constant
* iter all network interfaces
* use provided on google (non api) deps
* adding missing dep
* removing unneded this and use Objects methods instead o 3-way if in hash and comparison
* adding import
* adding retries around getRunningInstances and adding limit for operation end waiting
* refactor GceEnvironmentConfig.hashCode
* 0.18.0-SNAPSHOT -> 0.19.0-SNAPSHOT
* removing unused config
* adding tests to hash and equals
* adding nullable to waitForOperationEnd
* adding testTerminate
* adding unit tests for createComputeService
* increasing retries in unrelated integration-test to prevent sporadic failure (hopefully)
* reverting queryResponseTemplate change
* adding comment for Compute.Builder.build() returning null
ApproximateHistogram - seems unlikely
SegmentAnalyzer - unclear if this is an actual issue
GenericIndexedWriter - unclear if this is an actual issue
IncrementalIndexRow and OnheapIncrementalIndex are non-issues becaus it's very
unlikely for the number of dims to be large enough to hit the overflow
condition
* IntelliJ inspections cleanup
* Standard Charset object can be used
* Redundant Collection.addAll() call
* String literal concatenation missing whitespace
* Statement with empty body
* Redundant Collection operation
* StringBuilder can be replaced with String
* Type parameter hides visible type
* fix warnings in test code
* more test fixes
* remove string concatenation inspection error
* fix extra curly brace
* cleanup AzureTestUtils
* fix charsets for RangerAdminClient
* review comments
* Allow Cloud SegmentKillers to be instantiated without segment bucket or path
This change fixes a bug that was introduced that causes ingestion
to fail if data is ingested from one of the supported cloud storages
(Azure, Google, S3), and the user is using another type of storage
for deep storage. In this case the all segment killer implementations
are instantiated. A change recently made forced a dependency between
the supported cloud storage type SegmentKiller classes and the
deep storage configuration for that storage type being set, which
forced the deep storage bucket and prefix to be non-null. This caused
a NullPointerException to be thrown when instantiating the
SegmentKiller classes during ingestion.
To fix this issue, the respective deep storage segment configs for the
cloud storage types supported in druid are now allowed to have nullable
bucket and prefix configurations
* * Allow google deep storage bucket to be null
Fixes an issue where splitting an HDFS input source for use in native
parallel batch ingestion would cause the subtasks to get a split with an
invalid HDFS path.
* fix nullhandling exceptions related to test ordering
Tests might get executed in different order depending on the maven
version and the test environment. This may lead to "NullHandling module
not initialized" errors for some tests where we do not initialize
null-handling explicitly.
* use InitializedNullHandlingTest
* druid pac4j security extension for OpenID Connect OAuth 2.0 authentication
* update version in druid-pac4j pom
* introducing unauthorized resource filter
* authenticated but authorized /unified-webconsole.html
* use httpReq.getRequestURI() for matching callback path
* add documentation
* minor doc addition
* licesne file updates
* make dependency analyze succeed
* fix doc build
* hopefully fixes doc build
* hopefully fixes license check build
* yet another try on fixing license build
* revert unintentional changes to website folder
* update version to 0.18.0-SNAPSHOT
* check session and its expiry on each request
* add crypto service
* code for encrypting the cookie
* update doc with cookiePassphrase
* update license yaml
* make sessionstore in Pac4jFilter private non static
* make Pac4jFilter fields final
* okta: use sha256 for hmac
* remove incubating
* add UTs for crypto util and session store impl
* use standard charsets
* add license header
* remove unused file
* add org.objenesis.objenesis to license.yaml
* a bit of nit changes in CryptoService and embedding EncryptionResult for clarity
* rename alg to cipherAlgName
* take cipher alg name, mode and padding as input
* add java doc for CryptoService and make it more understandable
* another UT for CryptoService
* cache pac4j Config
* use generics clearly in Pac4jSessionStore
* update cookiePassphrase doc to mention PasswordProvider
* mark stuff Nullable where appropriate in Pac4jSessionStore
* update doc to mention jdbc
* add error log on reaching callback resource
* javadoc for Pac4jCallbackResource
* introduce NOOP_HTTP_ACTION_ADAPTER
* add correct module name in license file
* correct extensions folder name in licenses.yaml
* replace druid-kubernetes-extensions to druid-pac4j
* cache SecureRandom instance
* rename UnauthorizedResourceFilter to AuthenticationOnlyResourceFilter
* Azure deep storage does not work with datasource name containing non-ASCII chars
Fixed a bug where recording the segment file location fails when
using Azure Deep Storage, if the datasource has any special
characters
* * update jacoco thresholds
* * resolve merge conflicts
* address review comments
* Ability to Delete task logs and segments from Google Storage
* implement ability to delete all tasks logs or all task logs
written before a particular date when written to Google storage
* implement ability to delete all segments from Google deep storage
* * Address review comments
* Ability to Delete task logs and segments from Azure Storage
* implement ability to delete all tasks logs or all task logs
written before a particular date when written to Azure storage
* implement ability to delete all segments from Azure deep storage
* * Address review comments
* Broker: Add ability to inline subqueries.
The main changes:
- ClientQuerySegmentWalker: Add ability to inline queries.
- Query: Add "getSubQueryId" and "withSubQueryId" methods.
- QueryMetrics: Add "subQueryId" dimension.
- ServerConfig: Add new "maxSubqueryRows" parameter, which is used by
ClientQuerySegmentWalker to limit how many rows can be inlined per
query.
- IndexedTableJoinMatcher: Allow creating keys on top of unknown types,
by assuming they are strings. This is useful because not all types are
known for fields in query results.
- InlineDataSource: Store RowSignature rather than component parts. Add
more zealous "equals" and "hashCode" methods to ease testing.
- Moved QuerySegmentWalker test code from CalciteTests and
SpecificSegmentsQueryWalker in druid-sql to QueryStackTests in
druid-server. Use this to spin up a new ClientQuerySegmentWalkerTest.
* Adjustments from CI.
* Fix integration test.
* add kinesis lag metric
* fixes
* heh
* do it right this time
* more test
* split out supervisor report lags into lagMillis, remove latest offsets from kinesis supervisor report since always null, review stuffs
* Move RowSignature from druid-sql to druid-processing and make use of it.
1) Moved (most of) RowSignature from sql to processing. Left behind the SQL-specific
stuff in a RowSignatures utility class. It also picked up some new convenience
methods along the way.
2) There were a lot of places in the code where Map<String, ValueType> was used to
associate columns with type info. These are now all replaced with RowSignature.
3) QueryToolChest's resultArrayFields method is replaced with resultArraySignature,
and it now provides type info.
* Fix up extensions.
* Various fixes
* Ability to Delete task logs and segments from S3
* implement ability to delete all tasks logs or all task logs
written before a particular date when written to S3
* implement ability to delete all segments from S3 deep storage
* upgrade version of aws SDK in use
* * update licenses for updated AWS SDK version
* * fix bug in iterating through results from S3
* revert back to original version of AWS SDK
* * Address review comments
* * Fix failing dependency check
* Harmonization and bug-fixing for selector and filter behavior on unknown types.
- Migrate ValueMatcherColumnSelectorStrategy to newer ColumnProcessorFactory
system, and set defaultType COMPLEX so unknown types can be dynamically matched.
- Remove ValueGetters in favor of ColumnComparisonFilter doing its own thing.
- Switch various methods to use convertObjectToX when casting to numbers, rather
than ad-hoc and inconsistent logic.
- Fix bug in RowBasedExpressionColumnValueSelector: isBindingArray should return
true even for 0- or 1- element arrays.
- Adjust various javadocs.
* Add throwParseExceptions option to Rows.objectToNumber, switch back to that.
* Update tests.
* Adjust moment sketch tests.
* Skip empty files for local, hdfs, and cloud input sources
* split hint spec doc
* doc for skipping empty files
* fix typo; adjust tests
* unnecessary fluent iterable
* address comments
* fix test
* use the right lists
* fix test
* fix test
* Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
* Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
* Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
* fix build failure
* fix failing build
* fix failing build
* Code cleanup
* fix failing test
* Removed CloudConfigProperties and make specific class for each cloudInputSource
* Removed CloudConfigProperties and make specific class for each cloudInputSource
* pass s3ConfigProperties for split
* lazy init s3client
* update docs
* fix docs check
* address comments
* add ServerSideEncryptingAmazonS3.Builder
* fix failing checkstyle
* fix typo
* wrap the ServerSideEncryptingAmazonS3.Builder in a provider
* added java docs for S3InputSource constructor
* added java docs for S3InputSource constructor
* remove wrap the ServerSideEncryptingAmazonS3.Builder in a provider
* Move Azure extension into Core
Moving the azure extension into Core.
* * Fix build failure
* * Add The MIT License (MIT) to list of compatible licenses
* * Address review comments
* * change reference to contrib azure to core azure
* * Fix spelling mistakes.
* Add common optional dependencies for extensions
Include hadoop-aws and postgres JDBC connector jar to improve
out-of-the-box experience for extensions. The mysql JDBC connector jar
is not bundled as it is GPL.
* Update docs
* Fix typo
* Create splits of multiple files for parallel indexing
* fix wrong import and npe in test
* use the single file split in tests
* rename
* import order
* Remove specific local input source
* Update docs/ingestion/native-batch.md
Co-Authored-By: sthetland <steve.hetland@imply.io>
* Update docs/ingestion/native-batch.md
Co-Authored-By: sthetland <steve.hetland@imply.io>
* doc and error msg
* fix build
* fix a test and address comments
Co-authored-by: sthetland <steve.hetland@imply.io>
* add Expr.stringify which produces parseable expression strings, parser support for null values in arrays, and parser support for empty numeric arrays
* oops, macros are expressions too
* style
* spotbugs
* qualified type arrays
* review stuffs
* simplify grammar
* more permissive array parsing
* reuse expr joiner
* fix it
* Add Azure config options for segment prefix and max listing length
Added configuration options to allow the user to specify the prefix
within the segment container to store the segment files. Also
added a configuration option to allow the user to specify the
maximum number of input files to stream for each iteration.
* * Fix test failures
* * Address review comments
* * add dependency explicitly to pom
* * update docs
* * Address review comments
* * Address review comments
* Run IntelliJ inspections on Travis
Running IntelliJ inspections currently takes about 90 minutes, but they
can be run in about 30 minutes on Travis.
* Restore assert statements
* Use ExecutorService instead of ScheduledExecutorService where necessary - #9286
* Added inspection rule to prohibit ScheduledExecutorService assignment to ExecutorService
* IMPLY-1946: Improve code quality and unit test coverage of the Azure extension
* Update unit tests to increase test coverage for the extension
* Clean up any messy code
* Enfore code coverage as part of tests.
* * Update azure extension pom to remove unnecessary things
* update jacoco thresholds
* * updgrade version of azure-storage library version uses to
most upto-date version
* implement Azure InputSource reader and deprecate Azure FireHose
* implement azure InputSource reader
* deprecate Azure FireHose implementation
* * exclude common libraries that are included from druid core
* Implement more of Azure input source.
* * Add tests
* * Add more tests
* * deprecate azure firehose
* * added more tests
* * rollback fix for google cloud batch ingestion bug. Will be
fixed in another PR.
* * Added javadocs for all azure related classes
* Addressed review comments
* * Remove dependency on org.apache.commons:commons-collections4
* Fix LGTM warnings
* Add com.google.inject.extensions:guice-assistedinject to licenses
* * rename classes as suggested in review comments
* * Address review comments
* * Address review comments
* * Address review comments
* Codestyle - use java style array declaration
Replaced C-style array declarations with java style declarations and marked
the intelliJ inspection as an error
* cleanup test code
* Forbid easily misused HashSet and HashMap constructors
* Add two LinkedHashMap constructors to forbidden-apis and create utility method as replacement for them
* Fix visibility of constant in CollectionUtils.java
* Make an exception for an instance of LinkedHashMap#<init>(int) because proper sizing is used
* revert changes to sql module tests that should be in separate PR
* Finish reverting changes to sql module tests that were flagged in checkstyle during CI
* Add netty dependency resulting from SupressForbidden
* Add MemoryOpenHashTable, a table similar to ByteBufferHashTable.
With some key differences to improve speed and design simplicity:
1) Uses Memory rather than ByteBuffer for its backing storage.
2) Uses faster hashing and comparison routines (see HashTableUtils).
3) Capacity is always a power of two, allowing simpler design and more
efficient implementation of findBucket.
4) Does not implement growability; instead, leaves that to its callers.
The idea is this removes the need for subclasses, while still giving
callers flexibility in how to handle table-full scenarios.
* Fix LGTM warnings.
* Adjust dependencies.
* Remove easymock from druid-benchmarks.
* Adjustments from review.
* Fix datasketches unit tests.
* Fix checkstyle.
By default native batch ingestion was only getting a batch of 10
files at a time when used with google cloud. The Default for other
cloud providers is 1024, and should be similar for google cloud.
The low batch size was caused by mistype. This change updates the
batch size to 1024 when using google cloud.
* Guicify druid sql module
Break up the SQLModule in to smaller modules and provide a binding that
modules can use to register schemas with druid sql.
* fix some tests
* address code review
* tests compile
* Working tests
* Add all the tests
* fix up licenses and dependencies
* add calcite dependency to druid-benchmarks
* tests pass
* rename the schemas
* SQL join support for lookups.
1) Add LookupSchema to SQL, so lookups show up in the catalog.
2) Add join-related rels and rules to SQL, allowing joins to be planned into
native Druid queries.
* Add two missing LookupSchema calls in tests.
* Fix tests.
* Fix typo.
This is important because if a user has the hdfs extension loaded, but is not
using hdfs deep storage, then they will not have storageDirectory set and will
get the following error:
IllegalArgumentException: Can not create a Path from an empty string
at io.druid.storage.hdfs.HdfsDataSegmentKiller.<init>(HdfsDataSegmentKiller.java:47)
This scenario is realistic: it comes up when someone has the hdfs extension
loaded because they want to use HdfsInputSource, but don't want to use hdfs for
deep storage.
Fixes#4694.
* Add LookupJoinableFactory.
Enables joins where the right-hand side is a lookup. Includes an
integration test.
Also, includes changes to LookupExtractorFactoryContainerProvider:
1) Add "getAllLookupNames", which will be needed to eventually connect
lookups to Druid's SQL catalog.
2) Convert "get" from nullable to Optional return.
3) Swap out most usages of LookupReferencesManager in favor of the
simpler LookupExtractorFactoryContainerProvider interface.
* Fixes for tests.
* Fix another test.
* Java 11 message fix.
* Fixups.
* Fixup benchmark class.
* intelliJ inspections cleanup
- remove redundant escapes
- performance warnings
- access static member via instance reference
- static method declared final
- inner class may be static
Most of these changes are aesthetic, however, they will allow inspections to
be enabled as part of CI checks going forward
The valuable changes in this delta are:
- using StringBuilder instead of string addition in a loop
indexing-hadoop/.../Utils.java
processing/.../ByteBufferMinMaxOffsetHeap.java
- Use class variables instead of static variables for parameterized test
processing/src/.../ScanQueryLimitRowIteratorTest.java
* Add intelliJ inspection warnings as errors to druid profile
* one more static inner class
* Reconcile terminology and method naming to 'used/unused segments'; Don't use terms 'enable/disable data source'; Rename MetadataSegmentManager to MetadataSegments; Make REST API methods which mark segments as used/unused to return server error instead of an empty response in case of error
* Fix brace
* Import order
* Rename withKillDataSourceWhitelist to withSpecificDataSourcesToKill
* Fix tests
* Fix tests by adding proper methods without interval parameters to IndexerMetadataStorageCoordinator instead of hacking with Intervals.ETERNITY
* More aligned names of DruidCoordinatorHelpers, rename several CoordinatorDynamicConfig parameters
* Rename ClientCompactTaskQuery to ClientCompactionTaskQuery for consistency with CompactionTask; ClientCompactQueryTuningConfig to ClientCompactionTaskQueryTuningConfig
* More variable and method renames
* Rename MetadataSegments to SegmentsMetadata
* Javadoc update
* Simplify SegmentsMetadata.getUnusedSegmentIntervals(), more javadocs
* Update Javadoc of VersionedIntervalTimeline.iterateAllObjects()
* Reorder imports
* Rename SegmentsMetadata.tryMark... methods to mark... and make them to return boolean and the numbers of segments changed and relay exceptions to callers
* Complete merge
* Add CollectionUtils.newTreeSet(); Refactor DruidCoordinatorRuntimeParams creation in tests
* Remove MetadataSegmentManager
* Rename millisLagSinceCoordinatorBecomesLeaderBeforeCanMarkAsUnusedOvershadowedSegments to leadingTimeMillisBeforeCanMarkAsUnusedOvershadowedSegments
* Fix tests, refactor DruidCluster creation in tests into DruidClusterBuilder
* Fix inspections
* Fix SQLMetadataSegmentManagerEmptyTest and rename it to SqlSegmentsMetadataEmptyTest
* Rename SegmentsAndMetadata to SegmentsAndCommitMetadata to reduce the similarity with SegmentsMetadata; Rename some methods
* Rename DruidCoordinatorHelper to CoordinatorDuty, refactor DruidCoordinator
* Unused import
* Optimize imports
* Rename IndexerSQLMetadataStorageCoordinator.getDataSourceMetadata() to retrieveDataSourceMetadata()
* Unused import
* Update terminology in datasource-view.tsx
* Fix label in datasource-view.spec.tsx.snap
* Fix lint errors in datasource-view.tsx
* Doc improvements
* Another attempt to please TSLint
* Another attempt to please TSLint
* Style fixes
* Fix IndexerSQLMetadataStorageCoordinator.createUsedSegmentsSqlQueryForIntervals() (wrong merge)
* Try to fix docs build issue
* Javadoc and spelling fixes
* Rename SegmentsMetadata to SegmentsMetadataManager, address other comments
* Address more comments
* Add JoinableFactory interface and use it in the query stack.
Also includes InlineJoinableFactory, which enables joining against
inline datasources. This is the first patch where a basic join query
actually works. It includes integration tests.
* Fix test issues.
* Adjustments from code review.
* Add HashJoinSegment, a virtual segment for joins.
An initial step towards #8728. This patch adds enough functionality to implement a joining
cursor on top of a normal datasource. It does not include enough to actually do a query. For
that, future patches will need to wire this low-level functionality into the query language.
* Fixups.
* Fix missing format argument.
* Various tests and minor improvements.
* Changes.
* Remove or add tests for unused stuff.
* Fix up package locations.
Previously jackson-mapper-asl was excluded to remove a security
vulnerability; however, it is required for functionality (e.g.,
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator).
* Add avro dependency to parquet extension
If the parquet extension is loaded and an ingestionSpec uses the older format
specifying a 'parser' instead of using an 'inputFormat' the job fails
with the following error
java.lang.TypeNotPresentException: Type org.apache.avro.generic.GenericRecord not present
This change removes the exclusion of the avro package so that the missing
class can be found.
* Address review comments and add dependency version
* S3: Improvements to prefix listing (including fix for an infinite loop)
1) Fixes#9097, an infinite loop that occurs when more than one batch
of objects is retrieved during a prefix listing.
2) Removes the Access Denied fallback code added in #4444. I don't think
the behavior is reasonable: its purpose is to fall back from a prefix
listing to a single-object access, but it's only activated when the
end user supplied a prefix, so it would be better to simply fail, so
the end user knows that their request for a prefix-based load is not
going to work. Presumably the end user can switch from supplying
'prefixes' to supplying 'uris' if desired.
3) Filters out directory placeholders when walking prefixes.
4) Splits LazyObjectSummariesIterator into its own class and adds tests.
* Adjust S3InputSourceTest.
* Changes from review.
* Include hamcrest-core.
* Parallel indexing single dim partitions
Implements single dimension range partitioning for native parallel batch
indexing as described in #8769. This initial version requires the
druid-datasketches extension to be loaded.
The algorithm has 5 phases that are orchestrated by the supervisor in
`ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`.
These phases and the main classes involved are described below:
1) In parallel, determine the distribution of dimension values for each
input source split.
`PartialDimensionDistributionTask` uses `StringSketch` to generate
the approximate distribution of dimension values for each input
source split. If the rows are ungrouped,
`PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter`
uses a Bloom filter to skip rows that would be grouped. The final
distribution is sent back to the supervisor via
`DimensionDistributionReport`.
2) The range partitions are determined.
In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the
supervisor uses `StringSketchMerger` to merge the individual
`StringSketch`es created in the preceding phase. The merged sketch is
then used to create the range partitions.
3) In parallel, generate partial range-partitioned segments.
`PartialRangeSegmentGenerateTask` uses the range partitions
determined in the preceding phase and
`RangePartitionCachingLocalSegmentAllocator` to generate
`SingleDimensionShardSpec`s. The partition information is sent back
to the supervisor via `GeneratedGenericPartitionsReport`.
4) The partial range segments are grouped.
In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`,
the supervisor creates the `PartialGenericSegmentMergeIOConfig`s
necessary for the next phase.
5) In parallel, merge partial range-partitioned segments.
`PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to
retrieve the partial range-partitioned segments generated earlier and
then merges and publishes them.
* Fix dependencies & forbidden apis
* Fixes for integration test
* Address review comments
* Fix docs, strict compile, sketch check, rollup check
* Fix first shard spec, partition serde, single subtask
* Fix first partition check in test
* Misc rewording/refactoring to address code review
* Fix doc link
* Split batch index integration test
* Do not run parallel-batch-index twice
* Adjust last partition
* Split ITParallelIndexTest to reduce runtime
* Rename test class
* Allow null values in range partitions
* Indicate which phase failed
* Improve asserts in tests
* Address security vulnerabilities CVSS >= 7
Update dependencies to address security vulnerabilities with CVSS scores
of 7 or higher. A new Travis CI job is added to prevent new
high/critical security vulnerabilities from being added.
Updated dependencies:
- api-util 1.0.0 -> 1.0.3
- jackson 2.9.10 -> 2.10.1
- kafka 2.1.0 -> 2.1.1
- libthrift 0.10.0 -> 0.13.0
- protobuf 3.2.0 -> 3.11.0
The following high/critical security vulnerabilities are currently
suppressed (so that the new Travis CI job can be added now) and are left
as future work to fix:
- hibernate-validator:5.2.5
- jackson-mapper-asl:1.9.13
- libthrift:0.6.1
- netty:3.10.6
- nimbus-jose-jwt:4.41.1
* Rename EDL1 license file
* Fix inspection errors
* add prefixes support to google input source, making it symmetrical-ish with s3
* docs
* more better, and tests
* unused
* formatting
* javadoc
* dependencies
* oops
* review comments
* better javadoc
* Exclude unneeded hadoop transitive dependencies
These dependencies are provided by core:
- com.squareup.okhttp:okhttp
- commons-beanutils:commons-beanutils
- org.apache.commons:commons-compress
- org.apache.zookepper:zookeeper
These dependencies are not needed and are excluded because they contain
security vulnerabilities:
- commons-beanutils:commons-beanutils-core
- org.codehaus.jackson:jackson-mapper-asl
* Simplify exclusions + separate unneeded/vulnerable
* Do not exclude jackson-mapper-asl
* Support orc format for native batch ingestion
* fix pom and remove wrong comment
* fix unnecessary condition check
* use flatMap back to handle exception properly
* move exceptionThrowingIterator to intermediateRowParsingReader
* runtime
* add s3 input source for native batch ingestion
* add docs
* fixes
* checkstyle
* lazy splits
* fixes and hella tests
* fix it
* re-use better iterator
* use key
* javadoc and checkstyle
* exception
* oops
* refactor to use S3Coords instead of URI
* remove unused code, add retrying stream to handle s3 stream
* remove unused parameter
* update to latest master
* use list of objects instead of object
* serde test
* refactor and such
* now with the ability to compile
* fix signature and javadocs
* fix conflicts yet again, fix S3 uri stuffs
* more tests, enforce uri for bucket
* javadoc
* oops
* abstract class instead of interface
* null or empty
* better error
* Fix the potential race SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor
* Fix docs and javadoc
* Add unit tests for large or small estimated num splits
* add override
* Add FileUtils.createTempDir() and enforce its usage.
The purpose of this is to improve error messages. Previously, the error
message on a nonexistent or unwritable temp directory would be
"Failed to create directory within 10,000 attempts".
* Further updates.
* Another update.
* Remove commons-io from benchmark.
* Fix tests.
* add parquet support to native batch
* cleanup
* implement toJson for sampler support
* better binaryAsString test
* docs
* i hate spellcheck
* refactor toMap conversion so can be shared through flattenerMaker, default impls should be good enough for orc+avro, fixup for merge with latest
* add comment, fix some stuff
* adjustments
* fix accident
* tweaks
* HDFS input source
Add support for using HDFS as an input source. In this version, commas
or globs are not supported in HDFS paths.
* Fix forbidden api
* Address review comments
* Tidy up lifecycle, query, and ingestion logging.
The goal of this patch is to improve the clarity and usefulness of
Druid's logging for cluster operators. For more information, see
https://twitter.com/cowtowncoder/status/1195469299814555648.
Concretely, this patch does the following:
- Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the
goal of reducing redundancy and improving clarity by avoiding
showing rarely-useful log messages. This includes most "starting"
and "stopping" messages, and most messages related to individual
columns.
- Adds new log4j2 templates that show operators how to enabled DEBUG
logging for certain important packages.
- Eliminate stack traces for query errors, unless log level is DEBUG
or more. This is useful because query errors often indicate user
error rather than system error, but dumping stack trace often gave
operators the impression that there was a system failure.
- Adds task id to Appenderator, AppenderatorDriver thread names. In
the default log4j2 configuration, this will put them in log lines
as well. It's very useful if a user is using the Indexer, where
multiple tasks run in the same JVM.
- More consistent terminology when it comes to "sequences" (sets of
segments that are handed-off together by Kafka ingestion) and
"offsets" (cursors in partitions). These terms had been confused in
some log messages due to the fact that Kinesis calls offsets
"sequence numbers".
- Replaces some ugly toString calls with either the JSONification or
something more operator-accessible (like a URL or segment identifier,
instead of JSON object representing the same).
* Adjustments.
* Adjust integration test.
If the JDBC drivers are missing from the lookup extensions, throw an
exception that directs the user how to resolve the issue. This change is
a follow up to #8825.
* Use earliest offset on kafka newly discovered partitions
* resolve conflicts
* remove redundant check cases
* simplified unit tests
* change test case
* rewrite comments
* add regression test
* add junit ignore annotation
* minor modifications
* indent
* override testableKafkaSupervisor and KafkaRecordSupplier to make the test runable
* modified test constructor of kafkaRecordSupplier
* simplify
* delegated constructor
There is a class of bugs due to the fact that BaseObjectColumnValueSelector
has both "getObject" and "isNull" methods, but in most selector implementations
and most call sites, it is clear that the intent of "isNull" is only to apply
to the primitive getters, not the object getter. This makes sense, because the
purpose of isNull is to enable detection of nulls in otherwise-primitive columns.
Imagine a string column with a numeric selector built on top of it. You would
want it to return isNull = true, so numeric aggregators don't treat it as
all zeroes.
Sometimes this design leads people to accidentally guard non-primitive get
methods with "selector.isNull" checks, which is improper.
This patch has three goals:
1) Fix null-handling bugs that already exist in this class.
2) Make interface and doc changes that reduce the probability of future bugs.
3) Fix other, unrelated bugs I noticed in the stringFirst and stringLast
aggregators while fixing null-handling bugs. I thought about splitting this
into its own patch, but it ended up being tough to split from the
null-handling fixes.
For (1) the fixes are,
- Fix StringFirst and StringLastAggregatorFactory to stop guarding getObject
calls on isNull, by no longer extending NullableAggregatorFactory. Now uses
-1 as a sigil value for null, to differentiate nulls and empty strings.
- Fix ExpressionFilter to stop guarding getObject calls on isNull. Also, use
eval.asBoolean() to avoid calling getLong on the selector after already
calling getObject.
- Fix ObjectBloomFilterAggregator to stop guarding DimensionSelector calls
on isNull. Also, refactored slightly to avoid the overhead of calling
getObject followed by another getter (see BloomFilterAggregatorFactory for
part of this).
For (2) the main changes are,
- Remove the "isNull" method from BaseObjectColumnValueSelector.
- Clarify "isNull" doc on BaseNullableColumnValueSelector.
- Rename NullableAggregatorFactory -> NullbleNumericAggregatorFactory to emphasize
that it only works on aggregators that take numbers as input.
- Similar naming changes to the Aggregator, BufferAggregator, and AggregateCombiner.
- Similar naming changes to helper methods for groupBy, ValueMatchers, etc.
For (3) the other fixes for StringFirst and StringLastAggregatorFactory are,
- Fixed buffer overrun in the buffer aggregators when some characters in the string
code into more than one byte (the old code used "substring" to apply a byte limit,
which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method.
- Fixed weird IncrementalIndex logic that led to reading nulls for the timestamp.
- Adjusted weird StringFirst/Last logic that worked around the weird IncrementalIndex
behavior.
- Refactored to share code between the four aggregators.
- Improved test coverage.
- Made the base stringFirst, stringLast aggregators adaptive, and streamlined the
xFold versions into aliases. The adaptiveness is similar to how other aggregators
like hyperUnique work.
* IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle() don't fetch abutting intervals; simplify getUsedSegmentsForIntervals()
* Add VersionedIntervalTimeline.findNonOvershadowedObjectsInInterval() method; Propagate the decision about whether only visible segmetns or visible and overshadowed segments should be returned from IndexerMetadataStorageCoordinator's methods to the user logic; Rename SegmentListUsedAction to RetrieveUsedSegmentsAction, SegmetnListUnusedAction to RetrieveUnusedSegmentsAction, and UsedSegmentLister to UsedSegmentsRetriever
* Fix tests
* More fixes
* Add javadoc notes about returning Collection instead of Set. Add JacksonUtils.readValue() to reduce boilerplate code
* Fix KinesisIndexTaskTest, factor out common parts from KinesisIndexTaskTest and KafkaIndexTaskTest into SeekableStreamIndexTaskTestBase
* More test fixes
* More test fixes
* Add a comment to VersionedIntervalTimelineTestBase
* Fix tests
* Set DataSegment.size(0) in more tests
* Specify DataSegment.size(0) in more places in tests
* Fix more tests
* Fix DruidSchemaTest
* Set DataSegment's size in more tests and benchmarks
* Fix HdfsDataSegmentPusherTest
* Doc changes addressing comments
* Extended doc for visibility
* Typo
* Typo 2
* Address comment
* Add option lateMessageRejectionStartDate
* Use option lateMessageRejectionStartDate
* Fix tests
* Add lateMessageRejectionStartDate to kafka indexing service
* Update tests kafka indexing service
* Fix tests for KafkaSupervisorTest
* Add lateMessageRejectionStartDate to KinesisSupervisorIOConfig
* Fix var name
* Update documentation
* Add check lateMessageRejectionStartDateTime and lateMessageRejectionPeriod, fails if both were specified.
* remove select query
* thanks teamcity
* oops
* oops
* add back a SelectQuery class that throws RuntimeExceptions linking to docs
* adjust text
* update docs per review
* deprecated
* Stateful auto compaction
* javaodc
* add removed test back
* fix test
* adding indexSpec to compactionState
* fix build
* add lastCompactionState
* address comments
* extract CompactionState
* fix doc
* fix build and test
* Add a task context to store compaction state; add javadoc
* fix it test
* Fix missing jackson jars for hadoop ingestion
* PR comments
* pom ordering
* New approach
* Remove all jackson-core/mapper-asl exclusions from hdfs storage
* Support LDAP authentication/authorization
* fixed integration-tests
* fixed Travis CI build errors related to druid-security module
* fixed failing test
* fixed failing test header
* added comments, force build
* fixes for strict compilation spotbugs checks
* removed authenticator rolling credential update feature
* removed escalator rolling credential update feature
* fixed teamcity inspection deprecated API usage error
* fixed checkstyle execution error, removed unused import
* removed cached config as part of removing authenticator rolling credential update feature
* removed config bundle entity as part of removing authenticator rolling credential update feature
* refactored ldao configuration
* added support for SSLContext configuration and TLSCertificateChecker
* removed check to return authentication failure when user has no group assigned, will be checked and handled by the authorizer
* Separate out authorizer checks between metadata-backed store user and LDAP user/groups
* refactored BasicSecuritySSLSocketFactory usage to fix strict compilation spotbugs checks
* fixes build issue
* final review comments updates
* final review comments updates
* fixed LGTM and spellcheck alerts
* Fixed Avatica auth failure error message check
* Updated metadata credentials validator exception message string, replaced DB with metadata store
* get active task by datasource when supervisor discover tasks
* fix ut
* fix ut
* fix ut
* remove unnecessary condition check
* fix ut
* remove stream in hot loop
* Added live reports for Kafka and Native batch task
* Removed unused local variables
* Added the missing unit test
* Refine unit test logic, add implementation for HttpRemoteTaskRunner
* checksytle fixes
* Update doc descriptions for updated API
* remove unnecessary files
* Fix spellcheck complaints
* More details for api descriptions
* Fix dependency analyze warnings
Update the maven dependency plugin to the latest version and fix all
warnings for unused declared and used undeclared dependencies in the
compile scope. Added new travis job to add the check to CI. Also fixed
some source code files to use the correct packages for their imports and
updated druid-forbidden-apis to prevent regressions.
* Address review comments
* Adjust scope for org.glassfish.jaxb:jaxb-runtime
* Fix dependencies for hdfs-storage
* Consolidate netty4 versions
* LoggingEmitter: print event as json
* use DefaultRequestLogEventBuilderFactory in emitting request logger by default
* print context in query metric as json
* removed unused jsonMapper from DefaultQueryMetrics
* add comment
* remove change to DefaultRequestLogEventBuilderFactory.java
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* merging with upstream
* review-1
* unknow changes
* unknow changes
* review-2
* merging with master
* review-2 1 changes
* review changes-2 2
* bug fix
* exclude kerberos extension dependencies that are already included in druid libs
* missing net
* exclude json-smart
* eh might as well go aggro and remove all the ones it looks like we do not need
* guess we actually need this one
* Add TaskResourceCleaner; fix a couple of concurrency bugs in batch tasks
* kill runner when it's ready
* add comment
* kill run thread
* fix test
* Take closeable out of Appenderator
* add javadoc
* fix test
* fix test
* update javadoc
* add javadoc about killed task
* address comment
* Add support for parallel native indexing with shuffle for perfect rollup.
* Add comment about volatiles
* fix test
* fix test
* handling missing exceptions
* more clear javadoc for stopGracefully
* unused import
* update javadoc
* Add missing statement in javadoc
* address comments; fix doc
* add javadoc for isGuaranteedRollup
* Rename confusing variable name and fix typos
* fix typos; move fetch() to a better home; fix the expiration time
* add support https
* GroupBy array-based result rows.
Fixes#8118; see that proposal for details.
Other than the GroupBy changes, the main other "interesting" classes are:
- ResultRow: The array-based result type.
- BaseQuery: T is no longer required to be Comparable.
- QueryToolChest: Adds "decorateObjectMapper" to enable query-aware serialization
and deserialization of result rows (necessary due to their positional nature).
- QueryResource: Uses the new decoration functionality.
- DirectDruidClient: Also uses the new decoration functionality.
- QueryMaker (in Druid SQL): Modifications to read ResultRows.
These classes weren't changed, but got some new javadocs:
- BySegmentQueryRunner
- FinalizeResultsQueryRunner
- Query
* Adjustments for TC stuff.
* Use partitionsSpec for all task types
* fix doc
* fix typos and revert to use isPushRequired
* address comments
* move partitionsSpec to core
* remove hadoopPartitionsSpec
* removed hard-coded Kafka key and value deserializer, leaving default deserializer as org.apache.kafka.common.serialization.ByteArrayDeserializer. Also added checks to ensure that any provided deserializer class extends org.apache.kafka.serialization.Deserializer and outputs a byte array.
* Addressed all comments from original pull request and also added a
unit test.
* Added additional test that uses "poll" to ensure that custom deserializer
works properly.
* Fix dependency analyze warnings
Update the maven dependency plugin to the latest version and fix all
warnings for unused declared and used undeclared dependencies in the
compile scope. Added new travis job to add the check to CI. Also fixed
some source code files to use the correct packages for their imports.
* Fix licenses and dependencies
* Fix licenses and dependencies again
* Fix integration test dependency
* Address review comments
* Fix unit test dependencies
* Fix integration test dependency
* Fix integration test dependency again
* Fix integration test dependency third time
* Fix integration test dependency fourth time
* Fix compile error
* Fix assert package
* add CachingClusteredClient benchmark, refactor some stuff
* revert WeightedServerSelectorStrategy to ConnectionCountServerSelectorStrategy and remove getWeight since felt artificial, default mergeResults in toolchest implementation for topn, search, select
* adjust javadoc
* adjustments
* oops
* use it
* use BinaryOperator, remove CombiningFunction, use Comparator instead of Ordering, other review adjustments
* rename createComparator to createResultComparator, fix typo, firstNonNull nullable parameters
* doc updates and changes to use the CollectionUtils.mapValues utility method
* Add Structural Search patterns to intelliJ
* refactoring from PR comments
* put -> putIfAbsent
* do single key lookup
* Benchmarks: New SqlBenchmark, add caching & vectorization to some others.
- Introduce a new SqlBenchmark geared towards benchmarking a wide
variety of SQL queries. Rename the old SqlBenchmark to
SqlVsNativeBenchmark.
- Add (optional) caching to SegmentGenerator to enable easier
benchmarking of larger segments.
- Add vectorization to FilteredAggregatorBenchmark and GroupByBenchmark.
* Query vectorization.
This patch includes vectorized timeseries and groupBy engines, as well
as some analogs of your favorite Druid classes:
- VectorCursor is like Cursor. (It comes from StorageAdapter.makeVectorCursor.)
- VectorColumnSelectorFactory is like ColumnSelectorFactory, and it has
methods to create analogs of the column selectors you know and love.
- VectorOffset and ReadableVectorOffset are like Offset and ReadableOffset.
- VectorAggregator is like BufferAggregator.
- VectorValueMatcher is like ValueMatcher.
There are some noticeable differences between vectorized and regular
execution:
- Unlike regular cursors, vector cursors do not understand time
granularity. They expect query engines to handle this on their own,
which a new VectorCursorGranularizer class helps with. This is to
avoid too much batch-splitting and to respect the fact that vector
selectors are somewhat more heavyweight than regular selectors.
- Unlike FilteredOffset, FilteredVectorOffset does not leverage indexes
for filters that might partially support them (like an OR of one
filter that supports indexing and another that doesn't). I'm not sure
that this behavior is desirable anyway (it is potentially too eager)
but, at any rate, it'd be better to harmonize it between the two
classes. Potentially they should both do some different thing that
is smarter than what either of them is doing right now.
- When vector cursors are created by QueryableIndexCursorSequenceBuilder,
they use a morphing binary-then-linear search to find their start and
end rows, rather than linear search.
Limitations in this patch are:
- Only timeseries and groupBy have vectorized engines.
- GroupBy doesn't handle multi-value dimensions yet.
- Vector cursors cannot handle virtual columns or descending order.
- Only some filters have vectorized matchers: "selector", "bound", "in",
"like", "regex", "search", "and", "or", and "not".
- Only some aggregators have vectorized implementations: "count",
"doubleSum", "floatSum", "longSum", "hyperUnique", and "filtered".
- Dimension specs other than "default" don't work yet (no extraction
functions or filtered dimension specs).
Currently, the testing strategy includes adding vectorization-enabled
tests to TimeseriesQueryRunnerTest, GroupByQueryRunnerTest,
GroupByTimeseriesQueryRunnerTest, CalciteQueryTest, and all of the
filtering tests that extend BaseFilterTest. In all of those classes,
there are some test cases that don't support vectorization. They are
marked by special function calls like "cannotVectorize" or "skipVectorize"
that tell the test harness to either expect an exception or to skip the
test case.
Testing should be expanded in the future -- a project in and of itself.
Related to #3011.
* WIP
* Adjustments for unused things.
* Adjust javadocs.
* DimensionDictionarySelector adjustments.
* Add "clone" to BatchIteratorAdapter.
* ValueMatcher javadocs.
* Fix benchmark.
* Fixups post-merge.
* Expect exception on testGroupByWithStringVirtualColumn for IncrementalIndex.
* BloomDimFilterSqlTest: Tag two non-vectorizable tests.
* Minor adjustments.
* Update surefire, bump up Xmx in Travis.
* Some more adjustments.
* Javadoc adjustments
* AggregatorAdapters adjustments.
* Additional comments.
* Remove switching search.
* Only missiles.
* disable all compression in intermediate segment persists while ingestion
* more changes and build fix
* by default retain existing indexingSpec for intermediate persisted segments
* document indexSpecForIntermediatePersists index tuning config
* fix build issues
* update serde tests
Make static imports forbidden in tests and remove all occurrences to be
consistent with the non-test code.
Also, various changes to files affected by above:
- Reformat to adhere to druid style guide
- Fix various IntelliJ warnings
- Fix various SonarLint warnings (e.g., the expected/actual args to
Assert.assertEquals() were flipped)
* Add round support for DS-HLL
Since the Cardinality aggregator has a "round" option to round off estimated
values generated from the HyperLogLog algorithm, add the same "round" option to
the DataSketches HLL Sketch module aggregators to be consistent.
* Fix checkstyle errors
* Change HllSketchSqlAggregator to do rounding
* Fix test for standard-compliant null handling mode
* #7875: Setting ACL on S3 task logs on similar lines as that of data segment pushed to S3
* #7875 1. Extracting a method (which uploads a file to S3 setting appropriate access control list to the file being uploaded) and moving it to utils class. 2. Adding S3TaskLogsTest.java file to test acl (permissions) on the task log files pushed to S3.
* fixing checkstyle errors
* #7875 Incorporating review comments
* array support for expression language for multi-value string columns
* fix tests?
* fixes
* more tests
* fixes
* cleanup
* more better, more test
* ignore inspection
* license
* license fix
* inspection
* remove dumb import
* more better
* some comments
* add expr rewrite for arrayfn args for more magic, tests
* test stuff
* more tests
* fix test
* fix test
* castfunc can deal with arrays
* needs more empty array
* more tests, make cast to long array more forgiving
* refactor
* simplify ExprMacro Expr implementations with base classes in core
* oops
* more test
* use Shuttle for Parser.flatten, javadoc, cleanup
* fixes and more tests
* unused import
* fixes
* javadocs, cleanup, refactors
* fix imports
* more javadoc
* more javadoc
* more
* more javadocs, nonnullbydefault, minor refactor
* markdown fix
* adjustments
* more doc
* move initial filter out
* docs
* map empty arg lambda, apply function argument validation
* check function args at parse time instead of eval time
* more immutable
* more more immutable
* clarify grammar
* fix docs
* empty array is string test, we need a way to make arrays better maybe in the future, or define empty arrays as other types..
* Bump Apache Avro to 1.9.0
Apache Avro 1.9.0 brings a lot of new features:
* Deprecate Joda-Time in favor of Java8 JSR310 and setting it as default
* Remove support for Hadoop 1.x
* Move from Jackson 1.x to 2.9
* Add ZStandard Codec
* Lots of updates on the dependencies to fix CVE's
* Remove Jackson classes from public API
* Apache Avro is built by default with Java 8
* Apache Avro is compiled and tested with Java 11 to guarantee compatibility
* Apache Avro MapReduce is compiled and tested with Hadoop 3
* Apache Avro is now leaner, multiple dependencies were removed: guava, paranamer, commons-codec, and commons-logging
* Introduce JMH Performance Testing Framework
* Add Snappy support for C++ DataFile
* and many, many more!
* Add exclusions for Jackson
* Remove Apache Pig from the tests
* Remove the Pig specific part
* Fix the Checkstyle issues
* Cleanup a bit
* Add an additional test
* Revert the abstract class
* https://github.com/apache/incubator-druid/issues/7316 Use Map.putIfAbsent() instead of containsKey() + put()
* fixing indentation
* Using map.computeIfAbsent() instead of map.putIfAbsent() where appropriate
* fixing checkstyle
* Changing the recommendation text
* Reverting auto changes made by IDE
* Implementing recommendation: A ConcurrentHashMap on which computeIfAbsent() is called should be assigned into variables of ConcurrentHashMap type, not ConcurrentMap
* Removing unused import
Reading of auth cookie was not checking URI of the server where request was being sent. This was causing cookie set for one server to be sent to another one and extra authentication round trips between internal druid services.
* Add state and error tracking for seekable stream supervisors
* Fixed nits in docs
* Made inner class static and updated spec test with jackson inject
* Review changes
* Remove redundant config param in supervisor
* Style
* Applied some of Jon's recommendations
* Add transience field
* write test
* implement code review changes except for reconsidering logic of markRunFinishedAndEvaluateHealth()
* remove transience reporting and fix SeekableStreamSupervisorStateManager impl
* move call to stateManager.markRunFinished() from RunNotice to runInternal() for tests
* remove stateHistory because it wasn't adding much value, some fixes, and add more tests
* fix tests
* code review changes and add HTTP health check status
* fix test failure
* refactor to split into a generic SupervisorStateManager and a specific SeekableStreamSupervisorStateManager
* fixup after merge
* code review changes - add additional docs
* cleanup KafkaIndexTaskTest
* add additional documentation for Kinesis indexing
* remove unused throws class