* Replace StatsD client library
The [Datadog package][1] is a StatsD compatible drop-in replacement for the
client library, but it seems to be [better maintained][2] and has support for
Datadog DogStatsD specific features, which will be made use of in a subsequent
commit.
The `count`, `time`, and `gauge` methods are actually exactly compatible with
the previous library and the modifications shouldn't be required, but EasyMock
seems to have a hard time dealing with the variable arguments added by the
DogStatsD library and causes tests to fail if no arguments are provided for the
last String vararg. Passing an empty array fixes the test failures.
[1]: https://github.com/DataDog/java-dogstatsd-client
[2]: https://github.com/tim-group/java-statsd-client/issues/37#issuecomment-248698856
* Retain dimension key information for StatsD metrics
This doesn't change behavior, but allows separating dimensions from the metric
name in subsequent commits.
There is a possible order change for values from
`dimsBuilder.build().values()`, but from the tests it looks like it doesn't
affect actual behavior and the order of user dimensions is also retained.
* Support DogStatsD style tags in statsd-emitter
Datadog [doesn't support name-encoded dimensions and uses a concept of _tags_
instead.][1] This change allows Datadog users to send the metrics without
having to encode the various dimensions in the metric names. This enables
building graphs and monitors with and without aggregation across various
dimensions from the same data.
As tests in this commit verify, the behavior remains the same for users who
don't enable the `druid.emitter.statsd.dogstatsd` configuration flag.
[1]: https://www.datadoghq.com/blog/the-power-of-tagged-metrics/#tags-decouple-collection-and-reporting
* Disable convertRange behavior for DogStatsD users
DogStatsD, unlike regular StatsD, supports floating-point values, so this
behavior is unnecessary. It would be possible to still support `convertRange`,
even with `dogstatsd` enabled, but that would mean that people using the
default mapping would have some of the gauges unnecessarily converted.
`time` is in milliseconds and doesn't support floating-point values.
* move parquet-extensions from contrib to core, adds new hadoop parquet parser that does not convert to avro first and supports flattenSpec and int96 columns, add support for flattenSpec for parquet-avro conversion parser, much test with a bunch of files lifted from spark-sql
* fix avro flattener to support nullable primitives for auto discovery and now only supports primitive arrays instead of all arrays
* remove leftover print
* convert micro timestamp to millis
* checkstyle
* add ignore for .parquet and .parq to rat exclude
* fix legit test failure from avro flattern behavior change
* fix rebase
* add exclusions to pom to cut down on redundant jars
* refactor tests, add support for unwrapping lists for parquet-avro, review comments
* more comment
* fix oops
* tweak parquet-avro list handling
* more docs
* fix style
* grr styles
* include mysql-metadata-storage extension in distribution, but without the GPL-licensed connector library
* Install mysql connector package
* use symlinks to avoid versioning issues
* add documentation for fetching the mysql connector
* fix opentsdb emitter always be running
* check if emitter started
* add more details about consumeDelay in doc
* fix possible thread unsafe
* fix fail sending tags whose value contain colon
* 'suspend' and 'resume' support for kafka indexing service
changes:
* introduces `SuspendableSupervisorSpec` interface to describe supervisors which support suspend/resume functionality controlled through the `SupervisorManager`, which will gracefully shutdown the supervisor and it's tasks, update it's `SupervisorSpec` with either a suspended or running state, and update with the toggled spec. Spec updates are provided by `SuspendableSupervisorSpec.createSuspendedSpec` and `SuspendableSupervisorSpec.createRunningSpec` respectively.
* `KafkaSupervisorSpec` extends `SuspendableSupervisorSpec` and now supports suspend/resume functionality. The difference in behavior between 'running' and 'suspended' state is whether the supervisor will attempt to ensure that indexing tasks are or are not running respectively. Behavior is identical otherwise.
* `SupervisorResource` now provides `/druid/indexer/v1/supervisor/{id}/suspend` and `/druid/indexer/v1/supervisor/{id}/resume` which are used to suspend/resume suspendable supervisors
* Deprecated `/druid/indexer/v1/supervisor/{id}/shutdown` and moved it's functionality to `/druid/indexer/v1/supervisor/{id}/terminate` since 'shutdown' is ambiguous verbage for something that effectively stops a supervisor forever
* Added ability to get all supervisor specs from `/druid/indexer/v1/supervisor` by supplying the 'full' query parameter `/druid/indexer/v1/supervisor?full` which will return a list of json objects of the form `{"id":<id>, "spec":<SupervisorSpec>}`
* Updated overlord console ui to enable suspend/resume, and changed 'shutdown' to 'terminate'
* move overlord console status to own column in supervisor table so does not look like garbage
* spacing
* padding
* other kind of spacing
* fix rebase fail
* fix more better
* all supervisors now suspendable, updated materialized view supervisor to support suspend, more tests
* fix log
* resolves#5898 by adding maxTotalRows to incremental publishing kafka index task and appenderator based realtime indexing task, as available in IndexTask
* address review comments
* changes due to review
* merge fail
* Rename io.druid to org.apache.druid.
* Fix META-INF files and remove some benchmark results.
* MonitorsConfig update for metrics package migration.
* Reorder some dimensions in inner queries for some reason.
* Fix protobuf tests.
* Add PostgreSQLConnectorConfig to expose SSL configuration options for the Postgres Metadata Storage module.
* Fix checkstyle violations and add license header
* Convert properties in the postgres docs to be the full property path and fix typo
* Fix grammar in sslFactory docs
* Native parallel indexing without shuffle
* fix build
* fix ci
* fix ingestion without intervals
* fix retry
* fix retry
* add it test
* use chat handler
* fix build
* add docs
* fix ITUnionQueryTest
* fix failures
* disable metrics reporting
* working
* Fix split of static-s3 firehose
* Add endpoints to supervisor task and a unit test for endpoints
* increase timeout in test
* Added doc
* Address comments
* Fix overlapping locks
* address comments
* Fix static s3 firehose
* Fix test
* fix build
* fix test
* fix typo in docs
* add missing maxBytesInMemory to doc
* address comments
* fix race in test
* fix test
* Rename to ParallelIndexSupervisorTask
* fix teamcity
* address comments
* Fix license
* addressing comments
* addressing comments
* indexTaskClient-based segmentAllocator instead of CountingActionBasedSegmentAllocator
* Fix race in TaskMonitor and move HTTP endpoints to supervisorTask from runner
* Add more javadocs
* use StringUtils.nonStrictFormat for logging
* fix typo and remove unused class
* fix tests
* change package
* fix strict build
* tmp
* Fix overlord api according to the recent change in master
* Fix it test
* implement materialized view
* modify code according to jihoonson's comments
* modify code according to jihoonson's comments - 2
* add documentation about materialized view
* use new HadoopTuningConfig in pr 5583
* add minDataLag and fix optimizer bug
* correct value of DEFAULT_MIN_DATA_LAG_MS
* modify code according to jihoonson's comments - 3
* use the boolean expression instead of if-else
* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks
Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.
* The default value is 1/3(Runtime.maxMemory())
* To maintain the current behaviour set 'maxBytesInMemory' to -1
* If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
will be respected i.e. the first one to go above threshold will trigger persist
* Fix check style and remove a comment
* Add overlord unsecured paths to coordinator when using combined service (#5579)
* Add overlord unsecured paths to coordinator when using combined service
* PR comment
* More error reporting and stats for ingestion tasks (#5418)
* Add more indexing task status and error reporting
* PR comments, add support in AppenderatorDriverRealtimeIndexTask
* Use TaskReport instead of metrics/context
* Fix tests
* Use TaskReport uploads
* Refactor fire department metrics retrieval
* Refactor input row serde in hadoop task
* Refactor hadoop task loader names
* Truncate error message in TaskStatus, add errorMsg to task report
* PR comments
* Allow getDomain to return disjointed intervals (#5570)
* Allow getDomain to return disjointed intervals
* Indentation issues
* Adding feature thetaSketchConstant to do some set operation in PostAgg (#5551)
* Adding feature thetaSketchConstant to do some set operation in PostAggregator
* Updated review comments for PR #5551 - Adding thetaSketchConstant
* Fixed CI build issue
* Updated review comments 2 for PR #5551 - Adding thetaSketchConstant
* Fix taskDuration docs for KafkaIndexingService (#5572)
* With incremental handoff the changed line is no longer true.
* Add doc for automatic pendingSegments (#5565)
* Add missing doc for automatic pendingSegments
* address comments
* Fix indexTask to respect forceExtendableShardSpecs (#5509)
* Fix indexTask to respect forceExtendableShardSpecs
* add comments
* Deprecate spark2 profile in pom.xml (#5581)
Deprecated due to https://github.com/druid-io/druid/pull/5382
* CompressionUtils: Add support for decompressing xz, bz2, zip. (#5586)
Also switch various firehoses to the new method.
Fixes#5585.
* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks
Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.
* The default value is 1/3(Runtime.maxMemory())
* To maintain the current behaviour set 'maxBytesInMemory' to -1
* If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
will be respected i.e. the first one to go above threshold will trigger persist
* Address code review comments
* Fix the coding style according to druid conventions
* Add more javadocs
* Rename some variables/methods
* Other minor issues
* Address more code review comments
* Some refactoring to put defaults in IndexTaskUtils
* Added check for maxBytesInMemory in AppenderatorImpl
* Decrement bytes in abandonSegment
* Test unit test for multiple sinks in single appenderator
* Fix some merge conflicts after rebase
* Fix some style checks
* Merge conflicts
* Fix failing tests
Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex
* Address PR comments
* Put defaults for maxRows and maxBytes in TuningConfig
* Change/add javadocs
* Refactoring and renaming some variables/methods
* Fix TeamCity inspection warnings
* Added maxBytesInMemory config to HadoopTuningConfig
* Updated the docs and examples
* Added maxBytesInMemory config in docs
* Removed references to maxRowsInMemory under tuningConfig in examples
* Set maxBytesInMemory to 0 until used
Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing
and set to part of max jvm memory when ingestion task starts
* Update toString in KafkaSupervisorTuningConfig
* Use correct maxBytesInMemory value in AppenderatorImpl
* Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory
Experimenting with various defaults, 1/3 jvm memory causes OOM
* Update docs to correct maxBytesInMemory default value
* Minor to rename and add comment
* Add more details in docs
* Address new PR comments
* Address PR comments
* Fix spelling typo
* Fix Kerberos Authentication failing requests without cookies.
KerberosAuthenticator was failing `First` request from the clients.
After authentication we were setting the cookie properly but not
setting the the authenticated flag in the request. This PR fixed that.
Additional Fixes -
* Removing of Unused SpnegoFilterConfig - replaced by
KerberosAuthenticator
* Unused internalClientKeytab and principal from KerberosAuthenticator
* Fix docs accordingly and add docs for configuring an escalated
client.
* Fix excluded path config behavior
* spelling correction
* Revert "spelling correction"
This reverts commit fb754b43d8.
* Revert "Fix excluded path config behavior"
This reverts commit 3901047769.
Originally written by @AlexanderSaydakov in druid-io/druid-io.github.io#448.
I also added redirects and updated links to point to the new
datasketches-extension.html landing page for the extension, rather than to
the old page about theta sketches.