Commit Graph

10270 Commits

Author SHA1 Message Date
Clint Wylie cd31bcc093 un-exclude necessary parquet jackson dependencies instead of relying on curator (#8939) 2019-11-25 15:57:34 -08:00
Jihoon Son a2e6de4b16 Fix the potential race between SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor (#8924)
* Fix the potential race SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor

* Fix docs and javadoc

* Add unit tests for large or small estimated num splits

* add override
2019-11-23 01:38:08 -08:00
Gian Merlino e0eb85ace7 Add FileUtils.createTempDir() and enforce its usage. (#8932)
* Add FileUtils.createTempDir() and enforce its usage.

The purpose of this is to improve error messages. Previously, the error
message on a nonexistent or unwritable temp directory would be
"Failed to create directory within 10,000 attempts".

* Further updates.

* Another update.

* Remove commons-io from benchmark.

* Fix tests.
2019-11-22 19:48:49 -08:00
Rye 0514e5686e add TsvInputFormat (#8915)
* add TsvInputFormat

* refactor code

* fix grammar

* use enum replace string literal

* code refactor

* code refactor

* mark abstract for base class meant not to be instantiated

* remove constructor for test
2019-11-22 18:01:40 -08:00
Clint Wylie 7250010388 add parquet support to native batch (#8883)
* add parquet support to native batch

* cleanup

* implement toJson for sampler support

* better binaryAsString test

* docs

* i hate spellcheck

* refactor toMap conversion so can be shared through flattenerMaker, default impls should be good enough for orc+avro, fixup for merge with latest

* add comment, fix some stuff

* adjustments

* fix accident

* tweaks
2019-11-22 10:49:16 -08:00
SeKing 9955107e8e RandomLocationSelectorStrategy to Choose an available disk(location) to store a segment. With unit tests. (#8461) 2019-11-22 03:46:54 -08:00
Jihoon Son 934547a215
RetryingInputEntity to retry on transient errors (#8923)
* RetryingInputEntity to retry on transient errors

* fix some javadoc and httpEntity

* Make it interface

* Javadoc for offset
2019-11-21 21:32:18 -08:00
Jonathan Wei dc6178d1f2 Upgrade Calcite to 1.21 (#8566)
* Upgrade Calcite to 1.21

* Checkstyle, test fix'

* Exclude calcite yaml deps, update license.yaml

* Add method for exception chain handling

* Checkstyle

* PR comments, Add outer limit context flag

* Revert project settings change

* Update subquery test comment

* Checkstyle fix

* Fix test in sql compat mode

* Fix test

* Fix dependency analysis

* Address PR comments

* Checkstyle

* Adjust testSelectStarFromSelectSingleColumnWithLimitDescending
2019-11-20 21:22:55 -08:00
Chi Cao Minh ff6217365b Refactor parallel indexing perfect rollup partitioning (#8852)
* Refactor parallel indexing perfect rollup partitioning

Refactoring to make it easier to later add range partitioning for
perfect rollup parallel indexing. This is accomplished by adding several
new base classes (e.g., PerfectRollupWorkerTask) and new classes for
encapsulating logic that needs to be changed for different partitioning
strategies (e.g., IndexTaskInputRowIteratorBuilder).

The code is functionally equivalent to before except for the following
small behavior changes:

1) PartialSegmentMergeTask: Previously, this task had a priority of
   DEFAULT_TASK_PRIORITY. It now has a priority of
   DEFAULT_BATCH_INDEX_TASK_PRIORITY (via the new PerfectRollupWorkerTask
   base class), since it is a batch index task.

2) ParallelIndexPhaseRunner: A decorator was added to
   subTaskSpecIterator to ensure the subtasks are generated with unique
   ids. Previously, only tests (i.e., MultiPhaseParallelIndexingTest)
   would have this decorator, but this behavior is desired for non-test
   code as well.

* Fix forbidden apis and pmd warnings

* Fix analyze dependencies warnings

* Fix IndexTask json and add IT diags

* Fix parallel index supervisor<->worker serde

* Fix TeamCity inspection errors/warnings

* Fix TeamCity inspection errors/warnings again

* Integrate changes with those from #8823

* Address review comments

* Address more review comments

* Fix forbidden apis

* Address more review comments
2019-11-20 17:24:12 -08:00
Jihoon Son ac6d703814 Support inputFormat and inputSource for sampler (#8901)
* Support inputFormat and inputSource for sampler

* Cleanup javadocs and names

* fix style

* fix timed shutoff input source reader

* fix timed shutoff input source reader again

* tidy up timed shutoff reader

* unused imports

* fix tc
2019-11-20 14:51:25 -08:00
Surekha d628bebbd7 Make supervisor API similar to submit task API (#8810)
* accept spec or dataSchema, tuningConfig, ioConfig while submitting task json

* fix test

* update docs

* lgtm warning

* Add original constructor back to IndexTask to minimize changes

* fix indentation in docs

* Allow spec to be specified in supervisor schema

* undo IndexTask spec changes

* update docs

* Add Nullable and deprecated annotations

* remove deprecated configs from SeekableStreamSupervisorSpec

* remove nullable annotation
2019-11-20 10:04:41 -08:00
Vadim Ogievetsky ee8f048381 Web console: rename Tasks to Ingestion (#8896)
* rename Tasks to Ingestion

* rename local storage key also

* align ordering
2019-11-20 06:53:40 -08:00
Clint Wylie d67c3c7aed document SQL compatible null handling mode (#8894)
* document SQL compatible null handling mode

* adjustments

* fix docs

* review changes
2019-11-20 06:52:20 -08:00
Clint Wylie 3fcaa1a61b
fix sql compatible null handling config work with runtime.properties (#8876)
* fix sql compatible null handling config work with runtime.properties

* fix npe

* fix tests

* add friendly error

* comment, and friendlier still

* fix compile

* fix from merges
2019-11-20 03:55:29 -08:00
Atul Mohan f5fbd0bea0 Handle missing values for delimited text files when Nullhandling is enabled (#8779)
* Handle missing values

* Fix multi value tests

* Fix firehose tests

* Fix conflicts
2019-11-19 22:35:22 -08:00
Chi Cao Minh 4ae6466ae2 HDFS input source (#8899)
* HDFS input source

Add support for using HDFS as an input source. In this version, commas
or globs are not supported in HDFS paths.

* Fix forbidden api

* Address review comments
2019-11-19 22:19:39 -08:00
Clint Wylie 074a45219d add google cloud storage InputSource for native batch (#8907)
* add google cloud storage InputSource for native batch

* rename

* checkstyle

* fix

* fix spelling

* review comments
2019-11-19 19:49:43 -08:00
Jihoon Son baefc65f80 Retrying with a backward compatible task type on unknown task type error in parallel indexing (#8905)
* Retrying with a backward compatible task type on unknown task type error in parallel indexing

* Register legacy class; add a serde test
2019-11-19 19:29:25 -08:00
Rye d0913475b7 sampler returns nulls in CSV (#8871)
* sampler returns nulls in CSV

* fixed kafka sampler test

* fix Kinesis test

* sql compatibility fix

* remove null to empty string conversion, use null

* fix sql compatibility
2019-11-19 13:59:44 -08:00
Gian Merlino c44452f0c1 Tidy up lifecycle, query, and ingestion logging. (#8889)
* Tidy up lifecycle, query, and ingestion logging.

The goal of this patch is to improve the clarity and usefulness of
Druid's logging for cluster operators. For more information, see
https://twitter.com/cowtowncoder/status/1195469299814555648.

Concretely, this patch does the following:

- Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the
  goal of reducing redundancy and improving clarity by avoiding
  showing rarely-useful log messages. This includes most "starting"
  and "stopping" messages, and most messages related to individual
  columns.
- Adds new log4j2 templates that show operators how to enabled DEBUG
  logging for certain important packages.
- Eliminate stack traces for query errors, unless log level is DEBUG
  or more. This is useful because query errors often indicate user
  error rather than system error, but dumping stack trace often gave
  operators the impression that there was a system failure.
- Adds task id to Appenderator, AppenderatorDriver thread names. In
  the default log4j2 configuration, this will put them in log lines
  as well. It's very useful if a user is using the Indexer, where
  multiple tasks run in the same JVM.
- More consistent terminology when it comes to "sequences" (sets of
  segments that are handed-off together by Kafka ingestion) and
  "offsets" (cursors in partitions). These terms had been confused in
  some log messages due to the fact that Kinesis calls offsets
  "sequence numbers".
- Replaces some ugly toString calls with either the JSONification or
  something more operator-accessible (like a URL or segment identifier,
  instead of JSON object representing the same).

* Adjustments.

* Adjust integration test.
2019-11-19 13:57:58 -08:00
Surekha cf6643eb9a add sequenceName and currentCheckPoint for backwards compatibility (#8864)
* add sequenceName and currentCheckPoint for backwards compatibility

* Add serde unit test in kafka

* fix checkstyle

* add hashcode

* update javadoc
2019-11-19 13:11:31 -08:00
Chi Cao Minh 8365bdf62a Address security vulnerabilities (#8878)
* Address security vulnerabilities

Security vulnerabilities addressed by upgrading 3rd party libs:

- Upgrade avro-ipc to 1.9.1
  - sonatype-2019-0115
- Upgrade caffeine to 2.8.0
  - sonatype-2019-0282
- Upgrade commons-beanutils to 1.9.4
  - CVE-2014-0114
- Upgrade commons-codec to 1.13
  - sonatype-2012-0050
- Upgrade commons-compress to 1.19
  - CVE-2019-12402
  - sonatype-2018-0293
- Upgrade hadoop-common to 2.8.5
  - CVE-2018-11767
- Upgrade hadoop-mapreduce-client-core to 2.8.5
  - CVE-2017-3166
- Upgrade hibernate-validator to 5.2.5
  - CVE-2017-7536
- Upgrade httpclient to 4.5.10
  - sonatype-2017-0359
- Upgrade icu4j to 55.1
  - CVE-2014-8147
- Upgrade jackson-databind to 2.6.7.3:
  - CVE-2017-7525
- Upgrade jetty-http to 9.4.12:
  - CVE-2017-7657
  - CVE-2017-7658
  - CVE-2017-7656
  - CVE-2018-12545
- Upgrade log4j-core to 2.8.2
  - CVE-2017-5645:
- Upgrade netty to 3.10.6
  - CVE-2015-2156
- Upgrade netty-common to 4.1.42
  - CVE-2019-9518
- Upgrade netty-codec-http to 4.1.42
  - CVE-2019-16869
- Upgrade nimbus-jose-jwt to 4.41.1
  - CVE-2017-12972
  - CVE-2017-12974
- Upgrade plexus-utils to 3.0.24
  - CVE-2017-1000487
  - sonatype-2015-0173
  - sonatype-2016-0398
- Upgrade postgresql to 42.2.8
  - CVE-2018-10936

Note that if users are using JDBC lookups with postgres, they may need
to update the JDBC jar used by the lookup extension.

* Fix license for postgresql
2019-11-19 09:14:33 -08:00
Vadim Ogievetsky 98580ffe71 adding search to docs (#8906) 2019-11-19 07:04:40 -08:00
Jonathan Wei 4bdb890f1b Add license header for LGTM yaml config file (#8902) 2019-11-18 18:26:45 -08:00
Gian Merlino f1399037a6 LGTM: Skip console during Java build. (#8900)
* LGTM: Skip console during Java build.

* More nesting.

* One-line command.

* Avoid using semmle_data.
2019-11-18 15:42:13 -08:00
Atul Mohan 8515a03c6b Modify batch index task naming to accomodate simultaneous tasks (#8612)
* Change hadoop task naming

* Remove unused

* Add timestamp

* Fix build
2019-11-18 15:07:16 -08:00
Chi Cao Minh d60978343a Improve missing JDBC driver error for lookups (#8872)
If the JDBC drivers are missing from the lookup extensions, throw an
exception that directs the user how to resolve the issue. This change is
a follow up to #8825.
2019-11-18 11:42:38 -08:00
Rye ea8e4066f6 Use earliest offset on kafka newly discovered partitions (#8748)
* Use earliest offset on kafka newly discovered partitions

* resolve conflicts

* remove redundant check cases

* simplified unit tests

* change test case

* rewrite comments

* add regression test

* add junit ignore annotation

* minor modifications

* indent

* override testableKafkaSupervisor and KafkaRecordSupplier to make the test runable

* modified test constructor of kafkaRecordSupplier

* simplify

* delegated constructor
2019-11-18 11:05:31 -08:00
Vadim Ogievetsky 80fc04be71 bump typescript (#8890) 2019-11-17 16:23:47 -08:00
Vadim Ogievetsky 17d773dca2 Web console: replace (and remove) old consoles (#8838)
* first steps

* clean licenses

* fix capabilities

* fix specs

* more tests

* new web console on coordinator and overlord, remove setup for old consoles, old configs

* better message

* update licenses

* sync license files

* more button

* fix tslint issue

* jetty-rewrite dependency to add redirects for old console paths

* put dependency in the right place

* fix overlord detection

* fix notices, dedupe licenses

* make segment timeline work in no SQL mode

* update license

* revert hard coded coordinator mode from testing

* update restricted mode copy
2019-11-15 19:45:14 -08:00
Clint Wylie 7fa3182fe5
refactor InputFormat and InputEntityReader implementations (#8875)
* refactor InputFormat and InputReader to supply InputEntity and temp dir to constructors instead of read/sample

* fix style
2019-11-15 17:08:26 -08:00
Jihoon Son 1611792855
Add InputSource and InputFormat interfaces (#8823)
* Add InputSource and InputFormat interfaces

* revert orc dependency

* fix dimension exclusions and failing unit tests

* fix tests

* fix test

* fix test

* fix firehose and inputSource for parallel indexing task

* fix tc

* fix tc: remove unused method

* Formattable

* add needsFormat(); renamed to ObjectSource; pass metricsName for reader

* address comments

* fix closing resource

* fix checkstyle

* fix tests

* remove verify from csv

* Revert "remove verify from csv"

This reverts commit 1ea7758489.

* address comments

* fix import order and javadoc

* flatMap

* sampleLine

* Add IntermediateRowParsingReader

* Address comments

* move csv reader test

* remove test for verify

* adjust comments

* Fix InputEntityIteratingReader

* rename source -> entity

* address comments
2019-11-15 09:22:09 -08:00
Gian Merlino ce4ee42459 Fix LIKE filter wildcards to match newlines. (#8863) 2019-11-13 23:00:54 -08:00
Rye 00f6a56370 Use RFC4180Parser as CSVParser (#8803)
* Use RFC4180Parser as CSVParser, add unit test

* change test file location, use assertEquals
2019-11-13 12:44:37 -08:00
Clint Wylie cc54b2a9df support for array expressions in TransformSpec with ExpressionTransform (#8744)
* transformSpec + array expressions

changes:
* added array expression support to transformSpec
* removed ParseSpec.verify since its only use afaict was preventing transform expr that did not replace their input from functioning
* hijacked index task test to test changes

* remove docs about being unsupported

* re-arrange test assert

* unused imports

* imports

* fix tests

* preserve types

* suppress warning, fixes, add test

* formatting

* cleanup

* better list to array type conversion and tests

* fix oops
2019-11-13 11:04:37 -08:00
Clint Wylie 9ed9a80b9d optimize numeric column null value checking for low filter selectivity (more rows) (#8822)
* use peekable iterator for numeric column selector null checking instead of bitmap.get for those sweet sweet nanoseconds

* remove unused method

* slight optimization i think

* remove clone from wrappers since we do not use and is confusing

* fixes and tests

* int instead of Integer

* fix it

* fixes, more tests

* fix
2019-11-13 10:53:46 -08:00
fst0 80dbf44fca Add reference to druid.storage.type (#8857)
* Add reference to `druid.storage.type`

This should be in here. Without setting storage type to S3 globally it will obviously not be used, even if all other parameters are correct.

* Update s3.md

Add global storage parameter to knob table.

* Update s3.md
2019-11-13 10:03:41 -08:00
chencb cc2bdb5f51 Fix hadoop task jdk11 compatible (#8799)
* Fix hadoop task jdk11 compatible

* Fix HadoopTaskTest
2019-11-13 02:32:46 -08:00
Evan Ren 8cb213aa9f Web console: Fix missing include future flag for byPeriod rules (#8859)
* Add missing button for include future, and handle logic for default true case

* Remove duplicate go to tasks button

* Fix lgtm issue

* Revert changes on old console

* Made changes based on PR comments
2019-11-12 20:34:30 -08:00
Lucas Capistrant a066cc5648 Fix groupMapping endpoint URIs in druid-basic-security doc (#8847) 2019-11-12 21:12:34 +05:30
Vadim Ogievetsky df2f77c58d Web console: better json-input feedback (#8851)
* better json-input feedback

* seamless Hjson

* fix tests
2019-11-11 17:06:03 -08:00
Vadim Ogievetsky e9e1625e96 Docs: Add docsearch version (#8850)
* Add docsearch version

* remove open snas stylesheet
2019-11-09 19:51:06 -08:00
Jonathan Wei 75ea0d592a Add more datasketches doubles sketch SQL functions (#8843)
* Add more datasketches doubles sketch SQL postaggs

* style and lgtm
2019-11-08 18:05:06 -08:00
Gian Merlino 0e8c3f74d0 SQL: EARLIEST, LATEST aggregators. (#8815)
* SQL: EARLIEST, LATEST aggregators.

I chose these names instead of FIRST, LAST because those are already
reserved functions in Calcite that mean something different. I think
these are also better names anyway.

* Finalify.

* SQL updates.

* Adjust aggregator calls.

* Validations, test updates.

* Review docs.
2019-11-08 16:29:25 -08:00
Vadim Ogievetsky 6eacaf446f Use more efficient tasks API (#8844) 2019-11-08 08:21:53 -08:00
Gian Merlino c204d68376 Fixes, adjustments to numeric null handling and string first/last aggregators. (#8834)
There is a class of bugs due to the fact that BaseObjectColumnValueSelector
has both "getObject" and "isNull" methods, but in most selector implementations
and most call sites, it is clear that the intent of "isNull" is only to apply
to the primitive getters, not the object getter. This makes sense, because the
purpose of isNull is to enable detection of nulls in otherwise-primitive columns.
Imagine a string column with a numeric selector built on top of it. You would
want it to return isNull = true, so numeric aggregators don't treat it as
all zeroes.

Sometimes this design leads people to accidentally guard non-primitive get
methods with "selector.isNull" checks, which is improper.

This patch has three goals:

1) Fix null-handling bugs that already exist in this class.
2) Make interface and doc changes that reduce the probability of future bugs.
3) Fix other, unrelated bugs I noticed in the stringFirst and stringLast
   aggregators while fixing null-handling bugs. I thought about splitting this
   into its own patch, but it ended up being tough to split from the
   null-handling fixes.

For (1) the fixes are,

- Fix StringFirst and StringLastAggregatorFactory to stop guarding getObject
  calls on isNull, by no longer extending NullableAggregatorFactory. Now uses
  -1 as a sigil value for null, to differentiate nulls and empty strings.
- Fix ExpressionFilter to stop guarding getObject calls on isNull. Also, use
  eval.asBoolean() to avoid calling getLong on the selector after already
  calling getObject.
- Fix ObjectBloomFilterAggregator to stop guarding DimensionSelector calls
  on isNull. Also, refactored slightly to avoid the overhead of calling
  getObject followed by another getter (see BloomFilterAggregatorFactory for
  part of this).

For (2) the main changes are,

- Remove the "isNull" method from BaseObjectColumnValueSelector.
- Clarify "isNull" doc on BaseNullableColumnValueSelector.
- Rename NullableAggregatorFactory -> NullbleNumericAggregatorFactory to emphasize
  that it only works on aggregators that take numbers as input.
- Similar naming changes to the Aggregator, BufferAggregator, and AggregateCombiner.
- Similar naming changes to helper methods for groupBy, ValueMatchers, etc.

For (3) the other fixes for StringFirst and StringLastAggregatorFactory are,

- Fixed buffer overrun in the buffer aggregators when some characters in the string
  code into more than one byte (the old code used "substring" to apply a byte limit,
  which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method.
- Fixed weird IncrementalIndex logic that led to reading nulls for the timestamp.
- Adjusted weird StringFirst/Last logic that worked around the weird IncrementalIndex
  behavior.
- Refactored to share code between the four aggregators.
- Improved test coverage.
- Made the base stringFirst, stringLast aggregators adaptive, and streamlined the
  xFold versions into aliases. The adaptiveness is similar to how other aggregators
  like hyperUnique work.
2019-11-07 17:46:59 -08:00
Evan Ren b03aa060bd Web console: Interval input component (#8777)
* Created temporary interval input component

* Make reusable interval component

* Fixed errors with typing invalid dates

* Fix interval input styling and place into autoform

* Fix styling of popover calendar that opens off the page

* Add snapshot test and change interval to required props

* Add functionality to enter hours minutes second

* Fix min date limit

* Remove console log

* Fix difference in timezone

* Update snapshot test

* Fixed snapshot test without changing min max date

* Made changes based on discussion before converting to hooks

* Rewrote using hooks and deleted duplicate states

* Remove unused states

* Change sql query view numbers to monospace

* Made changes based on discussion

* Removed duplicate state
2019-11-07 13:07:17 -08:00
Clint Wylie 7aafcf8bca parallel broker merges on fork join pool (#8578)
* sketch of broker parallel merges done in small batches on fork join pool

* fix non-terminating sequences, auto compute parallelism

* adjust benches

* adjust benchmarks

* now hella more faster, fixed dumb

* fix

* remove comments

* log.info for debug

* javadoc

* safer block for sequence to yielder conversion

* refactor LifecycleForkJoinPool into LifecycleForkJoinPoolProvider which wraps a ForkJoinPool

* smooth yield rate adjustment, more logs to help tune

* cleanup, less logs

* error handling, bug fixes, on by default, more parallel, more tests

* remove unused var

* comments

* timeboundary mergeFn

* simplify, more javadoc

* formatting

* pushdown config

* use nanos consistently, move logs back to debug level, bit more javadoc

* static terminal result batch

* javadoc for nullability of createMergeFn

* cleanup

* oops

* fix race, add docs

* spelling, remove todo, add unhandled exception log

* cleanup, revert unintended change

* another unintended change

* review stuff

* add ParallelMergeCombiningSequenceBenchmark, fixes

* hyper-threading is the enemy

* fix initial start delay, lol

* parallelism computer now balances partition sizes to partition counts using sqrt of sequence count instead of sequence count by 2

* fix those important style issues with the benchmarks code

* lazy sequence creation for benchmarks

* more benchmark comments

* stable sequence generation time

* update defaults to use 100ms target time, 4096 batch size, 16384 initial yield, also update user docs

* add jmh thread based benchmarks, cleanup some stuff

* oops

* style

* add spread to jmh thread benchmark start range, more comments to benchmarks parameters and purpose

* retool benchmark to allow modeling more typical heterogenous heavy workloads

* spelling

* fix

* refactor benchmarks

* formatting

* docs

* add maxThreadStartDelay parameter to threaded benchmark

* why does catch need to be on its own line but else doesnt
2019-11-07 11:58:46 -08:00
Zhenxiao Luo a9aa416c3d In DirectDruidClient, don't run Future cancellation listener in… (#8700)
* In DirectDruidClient, don't run Future cancellation listener in HTTP library executor

* extract cancelQuery as a method of DirectDruidClient

* Fix testCancel

* Add exception as the first argument to log.error
2019-11-07 21:12:18 +03:00
Zhenxiao Luo fca23d0c32 use copy-on-write list in InMemoryAppender (#8808)
* use copy-on-write synchronized list in InMemoryAppender

* use copy-on-write list in InMemoryAppender

* Fix comment
2019-11-07 21:11:40 +03:00