* Refactoring codes around ingestion:
- Parallel index task and simple task now use the same segment allocator implementation. This is reusable for the future implementation as well.
- Added PartitionAnalysis to store the analysis of the partitioning
- Move some util methods to SegmentLockHelper and rename it to TaskLockHelper
* fix build
* fix SingleDimensionShardSpecFactory
* optimize SingledimensionShardSpecFactory
* fix test
* shard spec builder
* import order
* shardSpecBuilder -> partialShardSpec
* build -> complete
* fix comment; add unit tests for partitionBoundaries
* add more tests and fix javadoc
* fix toString(); add serde tests for HashBasedNumberedPartialShardSpec and SegmentAllocateAction
* fix test
* add equality test for hash and range partial shard specs
* intelliJ inspections cleanup
- remove redundant escapes
- performance warnings
- access static member via instance reference
- static method declared final
- inner class may be static
Most of these changes are aesthetic, however, they will allow inspections to
be enabled as part of CI checks going forward
The valuable changes in this delta are:
- using StringBuilder instead of string addition in a loop
indexing-hadoop/.../Utils.java
processing/.../ByteBufferMinMaxOffsetHeap.java
- Use class variables instead of static variables for parameterized test
processing/src/.../ScanQueryLimitRowIteratorTest.java
* Add intelliJ inspection warnings as errors to druid profile
* one more static inner class
* Reconcile terminology and method naming to 'used/unused segments'; Don't use terms 'enable/disable data source'; Rename MetadataSegmentManager to MetadataSegments; Make REST API methods which mark segments as used/unused to return server error instead of an empty response in case of error
* Fix brace
* Import order
* Rename withKillDataSourceWhitelist to withSpecificDataSourcesToKill
* Fix tests
* Fix tests by adding proper methods without interval parameters to IndexerMetadataStorageCoordinator instead of hacking with Intervals.ETERNITY
* More aligned names of DruidCoordinatorHelpers, rename several CoordinatorDynamicConfig parameters
* Rename ClientCompactTaskQuery to ClientCompactionTaskQuery for consistency with CompactionTask; ClientCompactQueryTuningConfig to ClientCompactionTaskQueryTuningConfig
* More variable and method renames
* Rename MetadataSegments to SegmentsMetadata
* Javadoc update
* Simplify SegmentsMetadata.getUnusedSegmentIntervals(), more javadocs
* Update Javadoc of VersionedIntervalTimeline.iterateAllObjects()
* Reorder imports
* Rename SegmentsMetadata.tryMark... methods to mark... and make them to return boolean and the numbers of segments changed and relay exceptions to callers
* Complete merge
* Add CollectionUtils.newTreeSet(); Refactor DruidCoordinatorRuntimeParams creation in tests
* Remove MetadataSegmentManager
* Rename millisLagSinceCoordinatorBecomesLeaderBeforeCanMarkAsUnusedOvershadowedSegments to leadingTimeMillisBeforeCanMarkAsUnusedOvershadowedSegments
* Fix tests, refactor DruidCluster creation in tests into DruidClusterBuilder
* Fix inspections
* Fix SQLMetadataSegmentManagerEmptyTest and rename it to SqlSegmentsMetadataEmptyTest
* Rename SegmentsAndMetadata to SegmentsAndCommitMetadata to reduce the similarity with SegmentsMetadata; Rename some methods
* Rename DruidCoordinatorHelper to CoordinatorDuty, refactor DruidCoordinator
* Unused import
* Optimize imports
* Rename IndexerSQLMetadataStorageCoordinator.getDataSourceMetadata() to retrieveDataSourceMetadata()
* Unused import
* Update terminology in datasource-view.tsx
* Fix label in datasource-view.spec.tsx.snap
* Fix lint errors in datasource-view.tsx
* Doc improvements
* Another attempt to please TSLint
* Another attempt to please TSLint
* Style fixes
* Fix IndexerSQLMetadataStorageCoordinator.createUsedSegmentsSqlQueryForIntervals() (wrong merge)
* Try to fix docs build issue
* Javadoc and spelling fixes
* Rename SegmentsMetadata to SegmentsMetadataManager, address other comments
* Address more comments
* Add FileUtils.createTempDir() and enforce its usage.
The purpose of this is to improve error messages. Previously, the error
message on a nonexistent or unwritable temp directory would be
"Failed to create directory within 10,000 attempts".
* Further updates.
* Another update.
* Remove commons-io from benchmark.
* Fix tests.
* IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle() don't fetch abutting intervals; simplify getUsedSegmentsForIntervals()
* Add VersionedIntervalTimeline.findNonOvershadowedObjectsInInterval() method; Propagate the decision about whether only visible segmetns or visible and overshadowed segments should be returned from IndexerMetadataStorageCoordinator's methods to the user logic; Rename SegmentListUsedAction to RetrieveUsedSegmentsAction, SegmetnListUnusedAction to RetrieveUnusedSegmentsAction, and UsedSegmentLister to UsedSegmentsRetriever
* Fix tests
* More fixes
* Add javadoc notes about returning Collection instead of Set. Add JacksonUtils.readValue() to reduce boilerplate code
* Fix KinesisIndexTaskTest, factor out common parts from KinesisIndexTaskTest and KafkaIndexTaskTest into SeekableStreamIndexTaskTestBase
* More test fixes
* More test fixes
* Add a comment to VersionedIntervalTimelineTestBase
* Fix tests
* Set DataSegment.size(0) in more tests
* Specify DataSegment.size(0) in more places in tests
* Fix more tests
* Fix DruidSchemaTest
* Set DataSegment's size in more tests and benchmarks
* Fix HdfsDataSegmentPusherTest
* Doc changes addressing comments
* Extended doc for visibility
* Typo
* Typo 2
* Address comment
When using single-dimension partitioning, use targetRowsPerSegment (if
specified) to size segments. Previously, single-dimension partitioning
would always size segments as close to the max size as possible.
Also, change single-dimension partitioning to allow partitions that have
a size equal to the target or max size. Previously, it would create
partitions up to 1 less than those limits.
Also, fix some IntelliJ inspection warnings in HadoopDruidIndexerConfig.
* Stateful auto compaction
* javaodc
* add removed test back
* fix test
* adding indexSpec to compactionState
* fix build
* add lastCompactionState
* address comments
* extract CompactionState
* fix doc
* fix build and test
* Add a task context to store compaction state; add javadoc
* fix it test
* IOConfig for compaction task
* add javadoc, doc, unit test
* fix webconsole test
* add spelling
* address comments
* fix build and test
* address comments
* Make more package EverythingIsNonnullByDefault by default
* Fixed additional voilations after pulling in master
* Change iterator to list.addAll
* Fix annotations
* Adjust defaults for hashed partitioning
If neither the partition size nor the number of shards are specified,
default to partitions of 5,000,000 rows (similar to the behavior of
dynamic partitions). Previously, both could be null and cause incorrect
behavior.
Specifying both a partition size and a number of shards now results in
an error instead of ignoring the partition size in favor of using the
number of shards. This is a behavior change that makes it more apparent
to the user that only one of the two properties will be honored
(previously, a message was just logged when the specified partition size
was ignored).
* Fix test
* Handle -1 as null
* Add -1 as null tests for single dim partitioning
* Simplify logic to handle -1 as null
* Address review comments
* Rename partition spec fields
Rename partition spec fields to be consistent across the various types
(hashed, single_dim, dynamic). Specifically, use targetNumRowsPerSegment
and maxRowsPerSegment in favor of targetPartitionSize and
maxSegmentSize. Consistent and clearer names are easier for users to
understand and use.
Also fix various IntelliJ inspection warnings and doc spelling mistakes.
* Fix test
* Improve docs
* Add targetRowsPerSegment to HashedPartitionsSpec
* Fix dependency analyze warnings
Update the maven dependency plugin to the latest version and fix all
warnings for unused declared and used undeclared dependencies in the
compile scope. Added new travis job to add the check to CI. Also fixed
some source code files to use the correct packages for their imports and
updated druid-forbidden-apis to prevent regressions.
* Address review comments
* Adjust scope for org.glassfish.jaxb:jaxb-runtime
* Fix dependencies for hdfs-storage
* Consolidate netty4 versions
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* check ctyle for constant field name
* merging with upstream
* review-1
* unknow changes
* unknow changes
* review-2
* merging with master
* review-2 1 changes
* review changes-2 2
* bug fix
* Use partitionsSpec for all task types
* fix doc
* fix typos and revert to use isPushRequired
* address comments
* move partitionsSpec to core
* remove hadoopPartitionsSpec
* Fix dependency analyze warnings
Update the maven dependency plugin to the latest version and fix all
warnings for unused declared and used undeclared dependencies in the
compile scope. Added new travis job to add the check to CI. Also fixed
some source code files to use the correct packages for their imports.
* Fix licenses and dependencies
* Fix licenses and dependencies again
* Fix integration test dependency
* Address review comments
* Fix unit test dependencies
* Fix integration test dependency
* Fix integration test dependency again
* Fix integration test dependency third time
* Fix integration test dependency fourth time
* Fix compile error
* Fix assert package
* add CachingClusteredClient benchmark, refactor some stuff
* revert WeightedServerSelectorStrategy to ConnectionCountServerSelectorStrategy and remove getWeight since felt artificial, default mergeResults in toolchest implementation for topn, search, select
* adjust javadoc
* adjustments
* oops
* use it
* use BinaryOperator, remove CombiningFunction, use Comparator instead of Ordering, other review adjustments
* rename createComparator to createResultComparator, fix typo, firstNonNull nullable parameters
* disable all compression in intermediate segment persists while ingestion
* more changes and build fix
* by default retain existing indexingSpec for intermediate persisted segments
* document indexSpecForIntermediatePersists index tuning config
* fix build issues
* update serde tests
* https://github.com/apache/incubator-druid/issues/7316 Use Map.putIfAbsent() instead of containsKey() + put()
* fixing indentation
* Using map.computeIfAbsent() instead of map.putIfAbsent() where appropriate
* fixing checkstyle
* Changing the recommendation text
* Reverting auto changes made by IDE
* Implementing recommendation: A ConcurrentHashMap on which computeIfAbsent() is called should be assigned into variables of ConcurrentHashMap type, not ConcurrentMap
* Removing unused import
* Throw caught exception.
* Throw caught exceptions.
* Related checkstyle rule is added to prevent further bugs.
* RuntimeException() is used instead of Throwables.propagate().
* Missing import is added.
* Throwables are propogated if possible.
* Throwables are propogated if possible.
* Throwables are propogated if possible.
* Throwables are propogated if possible.
* * Checkstyle definition is improved.
* Throwables.propagate() usages are removed.
* Checkstyle pattern is changed for only scanning "Throwables.propagate(" instead of checking lookbehind.
* Throwable is kept before firing a Runtime Exception.
* Fix unused assignments.
* write null byte in hadoop indexing for numeric dimensions
* Add test case to check output serializing null numeric dimensions
* Remove extra line
* Add @Nullable annotations
* Remove DataSegmentFinder, InsertSegmentToDb, and descriptor.json file
* delete descriptor.file when killing segments
* fix test
* Add doc for ha
* improve warning
* Fix:
1. hadoop-common dependency for druid-hdfs and druid-kerberos extensions
Refactoring:
2. Hadoop config call in the inner static class to avoid class path conflicts for stopGracefully kill
* Fix:
1. hadoop-common test dependency
* Fix:
1. Avoid issue of kill command once the job is actually completed
* KillTask from overlord UI now makes sure that it terminates the underlying MR job, thus saving unnecessary compute
Run in jobby is now split into 2
1. submitAndGetHadoopJobId followed by 2. run
submitAndGetHadoopJobId is responsible for submitting the job and returning the jobId as a string, run monitors this job for completion
JobHelper writes this jobId in the path provided by HadoopIndexTask which in turn is provided by the ForkingTaskRunner
HadoopIndexTask reads this path when kill task is clicked to get hte jobId and fire the kill command via the yarn api. This is taken care in the stopGracefully method which is called in SingleTaskBackgroundRunner. Have enabled `canRestore` method to return `true` for HadoopIndexTask in order for the stopGracefully method to be called
Hadoop*Job files have been changed to incorporate the changes to jobby
* Addressing PR comments
* Addressing PR comments - Fix taskDir
* Addressing PR comments - For changing the contract of Task.stopGracefully()
`SingleTaskBackgroundRunner` calls stopGracefully in stop() and then checks for canRestore condition to return the status of the task
* Addressing PR comments
1. Formatting
2. Removing `submitAndGetHadoopJobId` from `Jobby` and calling writeJobIdToFile in the job itself
* Addressing PR comments
1. POM change. Moving hadoop dependency to indexing-hadoop
* Addressing PR comments
1. stopGracefully now accepts TaskConfig as a param
Handling isRestoreOnRestart in stopGracefully for `AppenderatorDriverRealtimeIndexTask, RealtimeIndexTask, SeekableStreamIndexTask`
Changing tests to make TaskConfig param isRestoreOnRestart to true
* Use multi-guava version friendly direct executor implementation
* Don't use a singleton
* Fix strict compliation complaints
* Copy Guava's DirectExecutor
* Fix javadoc
* Imports are the devil
* Add checkstyle rules about imports and empty lines between members
* Add suppressions
* Update Eclipse import order
* Add empty line
* Fix StatsDEmitter
* Prohibit some guava collection APIs and use JDK APIs directly
* reset files that changed by accident
* sort codestyle/druid-forbidden-apis.txt alphabetically
This PR accumulates many refactorings and small improvements that I did while preparing the next change set of https://github.com/druid-io/druid/projects/2. I finally decided to make them a separate PR to minimize the volume of the main PR.
Some of the changes:
- Renamed confusing "Generic Column" term to "Numeric Column" (what it actually implies) in many class names.
- Generified `ComplexMetricExtractor`
* Rename io.druid to org.apache.druid.
* Fix META-INF files and remove some benchmark results.
* MonitorsConfig update for metrics package migration.
* Reorder some dimensions in inner queries for some reason.
* Fix protobuf tests.
* Various changes about druid-services module
* Patch improvements from reviewer
* Add ToArrayCallWithZeroLengthArrayArgument & ArraysAsListWithZeroOrOneArgument into inspection profile
* Fix ArraysAsListWithZeroOrOneArgument
* Fix conflict
* Fix ToArrayCallWithZeroLengthArrayArgument
* Fix AliEqualsAvoidNull
* Remove blank line
* Remove unused import clauses
* Fix code style in TopNQueryRunnerTest
* Fix conflict
* Don't use Collections.singletonList when converting the type of array type
* Add argLine into maven-surefire-plugin in druid-process module & increase the timeout value for testMoveSegment testcase
* Roll back the latest commit
* Add java.io.File#toURL() into druid-forbidden-apis
* Using Boolean.parseBoolean instead of Boolean.valueOf for CliCoordinator#isOverlord
* Add a new regexp element into stylecode xml file
* Fix style error for new regexp
* Set the level of ArraysAsListWithZeroOrOneArgument as WARNING
* Fix style error for new regexp
* Add option BY_LEVEL for ToArrayCallWithZeroLengthArrayArgument in inspection profile
* Roll back the level as ToArrayCallWithZeroLengthArrayArgument as ERROR
* Add toArray(new Object[0]) regexp into checkstyle config file & fix them
* Set the level of ArraysAsListWithZeroOrOneArgument as ERROR & Roll back the level of ToArrayCallWithZeroLengthArrayArgument as WARNING until Youtrack fix it
* Add a comment for string equals regexp in checkstyle config
* Fix code format
* Add RedundantTypeArguments as ERROR level inspection
* Fix cannot resolve symbol datasource
* VersionedIntervalTimeline: Optimize construction with heavily populated holders.
Each time a segment is "add"ed to a timeline, "isComplete" is called on the holder
that it is added to. "isComplete" is an O(segments per chunk) operation, meaning
that adding N segments to a chunk is an O(N^2) operation. This blows up badly if
we have thousands of segments per chunk.
The patch defers the "isComplete" check until after all segments have been
inserted.
* Fix imports.
* Remove setParitionerClass call from SortableBytes since callers override the paritioner class themselves
* Get rid of SortableBytesPartitioner class
* Make partitioner class a parameter
* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks
Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.
* The default value is 1/3(Runtime.maxMemory())
* To maintain the current behaviour set 'maxBytesInMemory' to -1
* If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
will be respected i.e. the first one to go above threshold will trigger persist
* Fix check style and remove a comment
* Add overlord unsecured paths to coordinator when using combined service (#5579)
* Add overlord unsecured paths to coordinator when using combined service
* PR comment
* More error reporting and stats for ingestion tasks (#5418)
* Add more indexing task status and error reporting
* PR comments, add support in AppenderatorDriverRealtimeIndexTask
* Use TaskReport instead of metrics/context
* Fix tests
* Use TaskReport uploads
* Refactor fire department metrics retrieval
* Refactor input row serde in hadoop task
* Refactor hadoop task loader names
* Truncate error message in TaskStatus, add errorMsg to task report
* PR comments
* Allow getDomain to return disjointed intervals (#5570)
* Allow getDomain to return disjointed intervals
* Indentation issues
* Adding feature thetaSketchConstant to do some set operation in PostAgg (#5551)
* Adding feature thetaSketchConstant to do some set operation in PostAggregator
* Updated review comments for PR #5551 - Adding thetaSketchConstant
* Fixed CI build issue
* Updated review comments 2 for PR #5551 - Adding thetaSketchConstant
* Fix taskDuration docs for KafkaIndexingService (#5572)
* With incremental handoff the changed line is no longer true.
* Add doc for automatic pendingSegments (#5565)
* Add missing doc for automatic pendingSegments
* address comments
* Fix indexTask to respect forceExtendableShardSpecs (#5509)
* Fix indexTask to respect forceExtendableShardSpecs
* add comments
* Deprecate spark2 profile in pom.xml (#5581)
Deprecated due to https://github.com/druid-io/druid/pull/5382
* CompressionUtils: Add support for decompressing xz, bz2, zip. (#5586)
Also switch various firehoses to the new method.
Fixes#5585.
* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks
Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.
* The default value is 1/3(Runtime.maxMemory())
* To maintain the current behaviour set 'maxBytesInMemory' to -1
* If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
will be respected i.e. the first one to go above threshold will trigger persist
* Address code review comments
* Fix the coding style according to druid conventions
* Add more javadocs
* Rename some variables/methods
* Other minor issues
* Address more code review comments
* Some refactoring to put defaults in IndexTaskUtils
* Added check for maxBytesInMemory in AppenderatorImpl
* Decrement bytes in abandonSegment
* Test unit test for multiple sinks in single appenderator
* Fix some merge conflicts after rebase
* Fix some style checks
* Merge conflicts
* Fix failing tests
Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex
* Address PR comments
* Put defaults for maxRows and maxBytes in TuningConfig
* Change/add javadocs
* Refactoring and renaming some variables/methods
* Fix TeamCity inspection warnings
* Added maxBytesInMemory config to HadoopTuningConfig
* Updated the docs and examples
* Added maxBytesInMemory config in docs
* Removed references to maxRowsInMemory under tuningConfig in examples
* Set maxBytesInMemory to 0 until used
Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing
and set to part of max jvm memory when ingestion task starts
* Update toString in KafkaSupervisorTuningConfig
* Use correct maxBytesInMemory value in AppenderatorImpl
* Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory
Experimenting with various defaults, 1/3 jvm memory causes OOM
* Update docs to correct maxBytesInMemory default value
* Minor to rename and add comment
* Add more details in docs
* Address new PR comments
* Address PR comments
* Fix spelling typo
In DeterminePartitonsJob -
config.get("mapred.job.tracker").equals("local") throws NPE as the
property name is changed in hadoop 3.0 to mapreduce.jobtracker.address
This patch extracts the logic to fetch jobTrackerAddress in JobHelper
and reuses it when needed.
* Add more indexing task status and error reporting
* PR comments, add support in AppenderatorDriverRealtimeIndexTask
* Use TaskReport instead of metrics/context
* Fix tests
* Use TaskReport uploads
* Refactor fire department metrics retrieval
* Refactor input row serde in hadoop task
* Refactor hadoop task loader names
* Truncate error message in TaskStatus, add errorMsg to task report
* PR comments
* Use the official aws-sdk instead of jet3t
* fix compile and serde tests
* address comments and fix test
* add http version string
* remove redundant dependencies, fix potential NPE, and fix test
* resolve TODOs
* fix build
* downgrade jackson version to 2.6.7
* fix test
* resolve the last TODO
* support proxy and endpoint configurations
* fix build
* remove debugging log
* downgrade hadoop version to 2.8.3
* fix tests
* remove unused log
* fix it test
* revert KerberosAuthenticator change
* change hadoop-aws scope to provided in hdfs-storage
* address comments
* address comments
* Future-proof some Guava usage
* Use a java-util EmptyIterator instead of Guava's
* Change some of the guava future handling to do manual async
transforms. Guava changes transform into transformAsync by deprecating
transform in ONLY Guava 19. Then its gone in 20
* Use `Collections.emptyIterator()`
* Pretty formatting
* Make listenable future transforms a thing in default druid
* Format fix
* Add forbidden guava apis
* Make the ListenableFutrues.transformAsync have comments
* Undo intellij bad pattern matching in comments
* Futrues --> Futures
* Add empty iterators forbidding
* Fix extra `A`
* Correct method signature
* Address review comments
* Finish Gian review comments
* Proper syntax from https://github.com/policeman-tools/forbidden-apis/wiki/SignaturesSyntax
* Use both Joad Ids and Java IDs as Timezone to string readers
Change-Id: Ieb5c18559879f3f3a0104912ce2f0a354ad0aac3
* move the function to DateTimes and add org.joda.time.DateTimeZone#forID as part of forbidden api
Change-Id: Iff97fa044758019ed0c231587d10e31a9cc18da0
* exclude class and remove other usage
Change-Id: Ib458c2caaa1865535767e1009fbf017a92c8f615
* remove it from test classes
Change-Id: I9b576324f6c7e17a74bd8b13879232c9a8cd40b4
* remove unused
Change-Id: If1c5b70c26c2b7c83c20434cb72b2060653f5052
* Deduplicate DataSegments contents (loadSpec's keys, dimensions and metrics lists as a whole) more aggressively; use ArrayMap instead of default LinkedHashMap for DataSegment.loadSpec, because they have only 3 entries on average; prune DataSegment.loadSpec on brokers
* Fix DataSegmentTest
* Refinements
* Try to fix
* Fix the second DataSegmentTest
* Nullability
* Fix tests
* Fix tests, unify to use TestHelper.getJsonMapper()
* Revert TestUtil as ServerTestHelper, fix tests
* Add newline
* Fix indexing tests
* Fix s3 tests
* Try to fix tests, remove lazy caching of ObjectMapper in TestHelper, rename TestHelper.getJsonMapper() to makeJsonMapper()
* Fix HDFS tests
* Fix HdfsDataSegmentPusherTest
* Capitalize constant names