Commit Graph

221 Commits

Author SHA1 Message Date
Jihoon Son 9466ac7c9b
Skip empty files for local, hdfs, and cloud input sources (#9450)
* Skip empty files for local, hdfs, and cloud input sources

* split hint spec doc

* doc for skipping empty files

* fix typo; adjust tests

* unnecessary fluent iterable

* address comments

* fix test

* use the right lists

* fix test

* fix test
2020-03-03 20:51:06 -08:00
Gian Merlino ae617bf5dd
Clarify InputSource.isSplittable usage. (#9424)
Also removes TimedShutoffInputSource, which had a bug in isSplittable (it
improperly returned true, even though it didn't implement SplittableInputSource).
This bug had no user-visible impact, since the code wasn't used.
2020-02-26 22:05:46 -08:00
Jihoon Son 3bc7ae782c
Create splits of multiple files for parallel indexing (#9360)
* Create splits of multiple files for parallel indexing

* fix wrong import and npe in test

* use the single file split in tests

* rename

* import order

* Remove specific local input source

* Update docs/ingestion/native-batch.md

Co-Authored-By: sthetland <steve.hetland@imply.io>

* Update docs/ingestion/native-batch.md

Co-Authored-By: sthetland <steve.hetland@imply.io>

* doc and error msg

* fix build

* fix a test and address comments

Co-authored-by: sthetland <steve.hetland@imply.io>
2020-02-24 17:34:39 -08:00
Clint Wylie 6d8dd5ec10
string -> expression -> string -> expression (#9367)
* add Expr.stringify which produces parseable expression strings, parser support for null values in arrays, and parser support for empty numeric arrays

* oops, macros are expressions too

* style

* spotbugs

* qualified type arrays

* review stuffs

* simplify grammar

* more permissive array parsing

* reuse expr joiner

* fix it
2020-02-21 15:43:02 -08:00
zachjsh f707064bed
Add Azure config options for segment prefix and max listing length (#9356)
* Add Azure config options for segment prefix and max listing length

Added configuration options to allow the user to specify the prefix
within the segment container to store the segment files. Also
added a configuration option to allow the user to specify the
maximum number of input files to stream for each iteration.

* * Fix test failures

* * Address review comments

* * add dependency explicitly to pom

* * update docs

* * Address review comments

* * Address review comments
2020-02-21 14:12:03 -08:00
Jihoon Son 141d8dd875
Enable druid.coordinator.kill.pendingSegments.on by default (#9385)
* Enable druid.coordinator.kill.pendingSegments.on by default

* checkstyle
2020-02-21 13:13:49 -08:00
Chi Cao Minh e7eb45e648
Run IntelliJ inspections on Travis (#9179)
* Run IntelliJ inspections on Travis

Running IntelliJ inspections currently takes about 90 minutes, but they
can be run in about 30 minutes on Travis.

* Restore assert statements
2020-02-19 11:34:19 +03:00
Clint Wylie b1be88d79c
fix Expressions.toQueryGranularity to be more correct, improve javadocs of Expr.getIdentifierIfIdentifier and Expr.getBindingIfIdentifier (#9363) 2020-02-16 08:36:40 -08:00
zachjsh 5c202343c9
implement Azure InputSource reader and deprecate Azure FireHose (#9306)
* IMPLY-1946: Improve code quality and unit test coverage of the Azure extension

* Update unit tests to increase test coverage for the extension
* Clean up any messy code
* Enfore code coverage as part of tests.

* * Update azure extension pom to remove unnecessary things
* update jacoco thresholds

* * updgrade version of azure-storage library version uses to
  most upto-date version

* implement Azure InputSource reader and deprecate Azure FireHose

* implement azure InputSource reader
* deprecate Azure FireHose implementation

* * exclude common libraries that are included from druid core

* Implement more of Azure input source.

* * Add tests

* * Add more tests

* * deprecate azure firehose

* * added more tests

* * rollback fix for google cloud batch ingestion bug. Will be
  fixed in another PR.

* * Added javadocs for all azure related classes
* Addressed review comments

* * Remove dependency on org.apache.commons:commons-collections4
* Fix LGTM warnings
* Add com.google.inject.extensions:guice-assistedinject to licenses

* * rename classes as suggested in review comments

* * Address review comments

* * Address review comments

* * Address review comments
2020-02-11 17:41:58 -08:00
Chi Cao Minh e8146d5914
More superbatch range partitioning tests (#9266)
More functional tests to cover handling of input data that has a
partition dimension that contains:

1) Null values: Should be in first partition

2) Multi values: Should cause superbatch task to abort
2020-02-10 15:17:53 -08:00
Suneet Saldanha 51d7864935
Codestyle - use java style array declaration (#9338)
* Codestyle - use java style array declaration

Replaced C-style array declarations with java style declarations and marked
the intelliJ inspection as an error

* cleanup test code
2020-02-10 14:25:26 -08:00
Clint Wylie 831ec172f1
Logging large segment list handling (#9312)
* better handling of large segment lists in logs

* more

* adjust

* exceptions

* fixes

* refactor

* debug

* heh

* dang
2020-02-07 21:42:45 -08:00
Jihoon Son e81230f9ab
Refactoring some codes around ingestion (#9274)
* Refactoring codes around ingestion:

- Parallel index task and simple task now use the same segment allocator implementation. This is reusable for the future implementation as well.
- Added PartitionAnalysis to store the analysis of the partitioning
- Move some util methods to SegmentLockHelper and rename it to TaskLockHelper

* fix build

* fix SingleDimensionShardSpecFactory

* optimize SingledimensionShardSpecFactory

* fix test

* shard spec builder

* import order

* shardSpecBuilder -> partialShardSpec

* build -> complete

* fix comment; add unit tests for partitionBoundaries

* add more tests and fix javadoc

* fix toString(); add serde tests for HashBasedNumberedPartialShardSpec and SegmentAllocateAction

* fix test

* add equality test for hash and range partial shard specs
2020-02-07 16:23:07 -08:00
Lucas Capistrant 53bb45fc9a
Forbid easily misused HashSet and HashMap constructors (#9165)
* Forbid easily misused HashSet and HashMap constructors

* Add two LinkedHashMap constructors to forbidden-apis and create utility method as replacement for them

* Fix visibility of constant in CollectionUtils.java

* Make an exception for an instance of LinkedHashMap#<init>(int) because proper sizing is used

* revert changes to sql module tests that should be in separate PR

* Finish reverting changes to sql module tests that were flagged in checkstyle during CI

* Add netty dependency resulting from SupressForbidden
2020-02-07 10:44:09 +03:00
Gian Merlino 0f0554f8fa
LimitedSequence: Improve suppression comment. (#9298) 2020-01-31 16:21:08 -08:00
Gian Merlino 7d91b8f281
Suppress false-alarm inspection. (#9297)
I think a mid-air collision between #9260 and #9293 has led to
master being unable to pass insepctions in TeamCity. Hopefully
this fixes it.
2020-01-31 09:24:21 -08:00
Gian Merlino 07a91f9022
Fix early return from YieldingSequenceBase#accumulate. (#9293)
Fixes #9291.
2020-01-30 12:01:18 -08:00
Suneet Saldanha 303b02eba1
intelliJ inspections cleanup (#9260)
* intelliJ inspections cleanup

- remove redundant escapes
- performance warnings
- access static member via instance reference
- static method declared final
- inner class may be static

Most of these changes are aesthetic, however, they will allow inspections to
be enabled as part of CI checks going forward

The valuable changes in this delta are:
- using StringBuilder instead of string addition in a loop
    indexing-hadoop/.../Utils.java
    processing/.../ByteBufferMinMaxOffsetHeap.java
- Use class variables instead of static variables for parameterized test
    processing/src/.../ScanQueryLimitRowIteratorTest.java

* Add intelliJ inspection warnings as errors to druid profile

* one more static inner class
2020-01-29 11:50:52 -08:00
Suneet Saldanha 0ccfe5ca89 Expose JoinableFactory through Guice Bindings (#9271)
* Make JoinableFactory an extension point

This change makes it so that extensions can register a JoinableFactory that
should be used for a DataSource.

Extensions can provide the factories via DruidBinders#joinableFactoryBinder
Known DataSources - like InlineDataSource are provided in the
JoinableFactoryModule. This module installs a FactoryWarehouse that is
used to decide which factory should be used to generate the Joinable for
the provided DataSource.

The ExtensionPoint is marked as Beta since it is not yet clear if this
needs to remain available to other extensions or if the best way to
register a factory is by using the datasource class.

* Add module test

* remove useless bindings in test

* remove ExtensionPoint annotation

* Make LifecycleLock not final to help with testing
2020-01-28 13:59:06 -08:00
Roman Leventov b9186f8f9f Reconcile terminology and method naming to 'used/unused segments'; Rename MetadataSegmentManager to MetadataSegmentsManager (#7306)
* Reconcile terminology and method naming to 'used/unused segments'; Don't use terms 'enable/disable data source'; Rename MetadataSegmentManager to MetadataSegments; Make REST API methods which mark segments as used/unused to return server error instead of an empty response in case of error

* Fix brace

* Import order

* Rename withKillDataSourceWhitelist to withSpecificDataSourcesToKill

* Fix tests

* Fix tests by adding proper methods without interval parameters to IndexerMetadataStorageCoordinator instead of hacking with Intervals.ETERNITY

* More aligned names of DruidCoordinatorHelpers, rename several CoordinatorDynamicConfig parameters

* Rename ClientCompactTaskQuery to ClientCompactionTaskQuery for consistency with CompactionTask; ClientCompactQueryTuningConfig to ClientCompactionTaskQueryTuningConfig

* More variable and method renames

* Rename MetadataSegments to SegmentsMetadata

* Javadoc update

* Simplify SegmentsMetadata.getUnusedSegmentIntervals(), more javadocs

* Update Javadoc of VersionedIntervalTimeline.iterateAllObjects()

* Reorder imports

* Rename SegmentsMetadata.tryMark... methods to mark... and make them to return boolean and the numbers of segments changed and relay exceptions to callers

* Complete merge

* Add CollectionUtils.newTreeSet(); Refactor DruidCoordinatorRuntimeParams creation in tests

* Remove MetadataSegmentManager

* Rename millisLagSinceCoordinatorBecomesLeaderBeforeCanMarkAsUnusedOvershadowedSegments to leadingTimeMillisBeforeCanMarkAsUnusedOvershadowedSegments

* Fix tests, refactor DruidCluster creation in tests into DruidClusterBuilder

* Fix inspections

* Fix SQLMetadataSegmentManagerEmptyTest and rename it to SqlSegmentsMetadataEmptyTest

* Rename SegmentsAndMetadata to SegmentsAndCommitMetadata to reduce the similarity with SegmentsMetadata; Rename some methods

* Rename DruidCoordinatorHelper to CoordinatorDuty, refactor DruidCoordinator

* Unused import

* Optimize imports

* Rename IndexerSQLMetadataStorageCoordinator.getDataSourceMetadata() to retrieveDataSourceMetadata()

* Unused import

* Update terminology in datasource-view.tsx

* Fix label in datasource-view.spec.tsx.snap

* Fix lint errors in datasource-view.tsx

* Doc improvements

* Another attempt to please TSLint

* Another attempt to please TSLint

* Style fixes

* Fix IndexerSQLMetadataStorageCoordinator.createUsedSegmentsSqlQueryForIntervals() (wrong merge)

* Try to fix docs build issue

* Javadoc and spelling fixes

* Rename SegmentsMetadata to SegmentsMetadataManager, address other comments

* Address more comments
2020-01-27 11:24:29 -08:00
Gian Merlino 19b427e8f3
Add JoinableFactory interface and use it in the query stack. (#9247)
* Add JoinableFactory interface and use it in the query stack.

Also includes InlineJoinableFactory, which enables joining against
inline datasources. This is the first patch where a basic join query
actually works. It includes integration tests.

* Fix test issues.

* Adjustments from code review.
2020-01-24 13:10:01 -08:00
Clint Wylie 8011211a0c first/last aggregators and nulls (#9161)
* null handling for numeric first/last aggregators, refactor to not extend nullable numeric agg since they are complex typed aggs

* initially null or not based on config

* review stuff, make string first/last consistent with null handling of numeric columns, more tests

* docs

* handle nil selectors, revert to primitive first/last types so groupby v1 works...
2020-01-20 11:51:54 -08:00
Jihoon Son 84ff0d2352
Fix TSV bugs (#9199)
* working

* - support multi-char delimiter for tsv
- respect "delimiter" property for tsv

* default value check for findColumnsFromHeader

* remove CSVParser to have a true and only CSVParser

* fix tests

* fix another test
2020-01-17 15:35:14 -08:00
Gian Merlino 448da78765 Speed up String first/last aggregators when folding isn't needed. (#9181)
* Speed up String first/last aggregators when folding isn't needed.

Examines the value column, and disables fold checking via a needsFoldCheck
flag if that column can't possibly contain SerializableLongStringPairs. This
is helpful because it avoids calling getObject on the value selector when
unnecessary; say, because the time selector didn't yield an earlier or later
value.

* PR comments.

* Move fastLooseChop to StringUtils.
2020-01-16 21:02:02 -08:00
Maytas Monsereenusorn 42359c93dd Implement ANY aggregator (#9187)
* Implement ANY aggregator

* Add copyright headers

* Add unit tests

* fix BufferAggregator

* Fix bug in BufferAggregator

* hook up the SQL command

* add check for buffer aggregator

* Address comment

* address comments

* add docs

* Address comments

* add more tests for numeric columns that have null values when run in sql compatible null mode

* fix checkstyle errors

* fix failing tests

* fix failing tests
2020-01-16 14:40:32 -08:00
Gian Merlino a87db7f353
Add HashJoinSegment, a virtual segment for joins. (#9111)
* Add HashJoinSegment, a virtual segment for joins.

An initial step towards #8728. This patch adds enough functionality to implement a joining
cursor on top of a normal datasource. It does not include enough to actually do a query. For
that, future patches will need to wire this low-level functionality into the query language.

* Fixups.

* Fix missing format argument.

* Various tests and minor improvements.

* Changes.

* Remove or add tests for unused stuff.

* Fix up package locations.
2020-01-16 13:14:20 -08:00
Jonathan Wei aa539177ec De-incubation cleanup in code, docs, packaging (#9108)
* De-incubation cleanup in code, docs, packaging

* remove unused docs script
2020-01-03 12:33:19 -05:00
Jonathan Wei 4e8368a5d9 Set version to 0.18.0-SNAPSHOT (#9109) 2020-01-02 17:55:10 -05:00
Jihoon Son 298425a33a
Fix handling interruptedException in resource pool (#9044) 2019-12-16 09:41:13 -08:00
Himanshu 45101183bc
HRTR: make pending task execution handling to go through all tasks on not finding worker slots (#8697)
* HRTR: make pending task execution handling to go through all tasks on
not finding worker slots

* make HRTR methods package private that are meant to be used only in HttpRemoteTaskRunnerResource

* mark HttpRemoteTaskRunnerWorkItem.State global variables final

* hrtr: move immutableWorker NULL check outside of try-catch or finally block could have NPE

* add some explanatory comments

* add comment on explaining mechanics around hand off of pending tasks from submission to it getting picked up by a task execution thread

* fix spelling
2019-12-12 14:58:52 -08:00
Jonathan Wei 8af41d7cd0 Update version to 0.18.0-incubating-SNAPSHOT (#9009) 2019-12-11 14:04:03 -08:00
Chi Cao Minh 3de7ab8523 DataSketches jars in core (#9003)
Having DataSketches jars in core will allow potential improvements, for
example:
- Provide an alternative implementation of HLL:
  https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html
- Range partitioning for native parallel batch indexing without having
  the user load extensions on the classpath

Dev mailing list discussion:
https://lists.apache.org/thread.html/301410d71ff799cf616bf17c4ebcf9999fc30829f5fa62909f403e6c%40%3Cdev.druid.apache.org%3E
2019-12-10 14:02:34 -08:00
Chi Cao Minh bab78fc80e Parallel indexing single dim partitions (#8925)
* Parallel indexing single dim partitions

Implements single dimension range partitioning for native parallel batch
indexing as described in #8769. This initial version requires the
druid-datasketches extension to be loaded.

The algorithm has 5 phases that are orchestrated by the supervisor in
`ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`.
These phases and the main classes involved are described below:

1) In parallel, determine the distribution of dimension values for each
   input source split.

   `PartialDimensionDistributionTask` uses `StringSketch` to generate
   the approximate distribution of dimension values for each input
   source split. If the rows are ungrouped,
   `PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter`
   uses a Bloom filter to skip rows that would be grouped. The final
   distribution is sent back to the supervisor via
   `DimensionDistributionReport`.

2) The range partitions are determined.

   In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the
   supervisor uses `StringSketchMerger` to merge the individual
   `StringSketch`es created in the preceding phase. The merged sketch is
   then used to create the range partitions.

3) In parallel, generate partial range-partitioned segments.

   `PartialRangeSegmentGenerateTask` uses the range partitions
   determined in the preceding phase and
   `RangePartitionCachingLocalSegmentAllocator` to generate
   `SingleDimensionShardSpec`s.  The partition information is sent back
   to the supervisor via `GeneratedGenericPartitionsReport`.

4) The partial range segments are grouped.

   In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`,
   the supervisor creates the `PartialGenericSegmentMergeIOConfig`s
   necessary for the next phase.

5) In parallel, merge partial range-partitioned segments.

   `PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to
   retrieve the partial range-partitioned segments generated earlier and
   then merges and publishes them.

* Fix dependencies & forbidden apis

* Fixes for integration test

* Address review comments

* Fix docs, strict compile, sketch check, rollup check

* Fix first shard spec, partition serde, single subtask

* Fix first partition check in test

* Misc rewording/refactoring to address code review

* Fix doc link

* Split batch index integration test

* Do not run parallel-batch-index twice

* Adjust last partition

* Split ITParallelIndexTest to reduce runtime

* Rename test class

* Allow null values in range partitions

* Indicate which phase failed

* Improve asserts in tests
2019-12-09 23:05:49 -08:00
Clint Wylie 4327892b84 modify multi-value expression transformation behavior to not treat re-use of the same input as a candidate for cartesian mapping (#8957) 2019-12-09 20:38:15 -08:00
Rye ca77d576c6 add customize separator for TSV inputFormat (#8993)
* add customize separator for TSV inputFormat

* fix spotbug

* code refactor

* code refactor

* add argument check for delimiter

* refine null check

* add check for delimiter and listdelimiter can not be same

* add unit tests
2019-12-09 11:24:09 -08:00
Roman Leventov 1c62987783
Add SelfDiscoveryResource; rename org.apache.druid.discovery.No… (#6702)
* Add SelfDiscoveryResource

* Rename org.apache.druid.discovery.NodeType to NodeRole. Refactor CuratorDruidNodeDiscoveryProvider. Make SelfDiscoveryResource to listen to updates only about a single node (itself).

* Extended docs

* Fix brace

* Remove redundant throws in Lifecycle.Handler.stop()

* Import order

* Remove unresolvable link

* Address comments

* tmp

* tmp

* Rollback docker changes

* Remove extra .sh files

* Move filter

* Fix SecurityResourceFilterTest
2019-12-08 18:47:58 +03:00
Clint Wylie 06cd30460e
add query metrics for broker parallel merges, off by default (#8981)
* add a bunch of metrics for broker parallel merges, off by default, and tests

* fix tests

* review stuffs

* propogateIfPossible
2019-12-06 13:42:53 -08:00
Clint Wylie ca2a7a1f08 more flush timeout for emitter tests (#8991)
* more flush timeout for emitter tests

* share constant
2019-12-05 16:52:35 -08:00
Chi Cao Minh af74acaa85 Address security vulnerabilities CVSS >= 7 (#8980)
* Address security vulnerabilities CVSS >= 7

Update dependencies to address security vulnerabilities with CVSS scores
of 7 or higher. A new Travis CI job is added to prevent new
high/critical security vulnerabilities from being added.

Updated dependencies:
- api-util 1.0.0 -> 1.0.3
- jackson 2.9.10 -> 2.10.1
- kafka 2.1.0 -> 2.1.1
- libthrift 0.10.0 -> 0.13.0
- protobuf 3.2.0 -> 3.11.0

The following high/critical security vulnerabilities are currently
suppressed (so that the new Travis CI job can be added now) and are left
as future work to fix:
- hibernate-validator:5.2.5
- jackson-mapper-asl:1.9.13
- libthrift:0.6.1
- netty:3.10.6
- nimbus-jose-jwt:4.41.1

* Rename EDL1 license file

* Fix inspection errors
2019-12-05 14:34:35 -08:00
Clint Wylie 5ecdf94d83
add 'prefixes' support to google input source (#8930)
* add prefixes support to google input source, making it symmetrical-ish with s3

* docs

* more better, and tests

* unused

* formatting

* javadoc

* dependencies

* oops

* review comments

* better javadoc
2019-12-04 21:01:10 -08:00
Jihoon Son 86e8903523
Support orc format for native batch ingestion (#8950)
* Support orc format for native batch ingestion

* fix pom and remove wrong comment

* fix unnecessary condition check

* use flatMap back to handle exception properly

* move exceptionThrowingIterator to intermediateRowParsingReader

* runtime
2019-11-28 12:45:24 -08:00
jon-wei dfbc066163 Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1"
This reverts commit a0f21d9b07.
2019-11-27 23:22:43 -08:00
jon-wei 0402ff85b8 Revert "[maven-release-plugin] prepare for next development iteration"
This reverts commit 8ffa71e7e6.
2019-11-27 23:22:32 -08:00
jon-wei 8ffa71e7e6 [maven-release-plugin] prepare for next development iteration 2019-11-27 23:18:48 -08:00
jon-wei a0f21d9b07 [maven-release-plugin] prepare release druid-0.16.1-incubating-rc1 2019-11-27 23:18:37 -08:00
Clint Wylie 923c003213 add flush timeout to emitter test (#8963) 2019-11-27 19:30:09 -08:00
Atul Mohan a5b40a6099 Remove null handling check (#8960) 2019-11-27 12:09:33 -08:00
Chi Cao Minh fba876b607 Update jackson to 2.9.10 (#8940)
Addresses security vulnerabilities:

- sonatype-2016-0397:
  https://github.com/FasterXML/jackson-core/issues/315

- sonatype-2017-0355:
  https://github.com/FasterXML/jackson-core/pull/322
2019-11-26 21:41:14 -08:00
Clint Wylie 4458113375
S3 input source (#8903)
* add s3 input source for native batch ingestion

* add docs

* fixes

* checkstyle

* lazy splits

* fixes and hella tests

* fix it

* re-use better iterator

* use key

* javadoc and checkstyle

* exception

* oops

* refactor to use S3Coords instead of URI

* remove unused code, add retrying stream to handle s3 stream

* remove unused parameter

* update to latest master

* use list of objects instead of object

* serde test

* refactor and such

* now with the ability to compile

* fix signature and javadocs

* fix conflicts yet again, fix S3 uri stuffs

* more tests, enforce uri for bucket

* javadoc

* oops

* abstract class instead of interface

* null or empty

* better error
2019-11-25 22:31:19 -08:00
Jihoon Son a2e6de4b16 Fix the potential race between SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor (#8924)
* Fix the potential race SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor

* Fix docs and javadoc

* Add unit tests for large or small estimated num splits

* add override
2019-11-23 01:38:08 -08:00