Commit Graph

10007 Commits

Author SHA1 Message Date
Gian Merlino 66657012bf Replace CaseFilteredAggregatorRule with Calcite equivalent. (#9113)
AggregateCaseToFilterRule was added to Calcite in https://issues.apache.org/jira/browse/CALCITE-3144,
and was originally copied from Druid's CaseFilteredAggregatorRule. So there isn't a good reason to
keep using our version.
2020-01-04 19:11:18 -08:00
Suneet Saldanha bdd0d0d8a5 Add avro dependency to parquet extension (#9124)
* Add avro dependency to parquet extension

If the parquet extension is loaded and an ingestionSpec uses the older format
specifying a 'parser' instead of using an 'inputFormat' the job fails
with the following error

java.lang.TypeNotPresentException: Type org.apache.avro.generic.GenericRecord not present

This change removes the exclusion of the avro package so that the missing
class can be found.

* Address review comments and add dependency version
2020-01-03 20:11:13 -06:00
Jonathan Wei aa539177ec De-incubation cleanup in code, docs, packaging (#9108)
* De-incubation cleanup in code, docs, packaging

* remove unused docs script
2020-01-03 12:33:19 -05:00
Gian Merlino eb124a3068
Fix DistinctCountGroupByQueryTest Y2020 bug. (#9120)
It used data with the current timestamp alongside a query that had an end
instant of 2020-01-01.
2020-01-02 21:10:32 -05:00
Jonathan Wei 4e8368a5d9 Set version to 0.18.0-SNAPSHOT (#9109) 2020-01-02 17:55:10 -05:00
Gian Merlino 18eb456fe6
S3: Improvements to prefix listing (including fix for an infinite loop) (#9098)
* S3: Improvements to prefix listing (including fix for an infinite loop)

1) Fixes #9097, an infinite loop that occurs when more than one batch
   of objects is retrieved during a prefix listing.

2) Removes the Access Denied fallback code added in #4444. I don't think
   the behavior is reasonable: its purpose is to fall back from a prefix
   listing to a single-object access, but it's only activated when the
   end user supplied a prefix, so it would be better to simply fail, so
   the end user knows that their request for a prefix-based load is not
   going to work. Presumably the end user can switch from supplying
   'prefixes' to supplying 'uris' if desired.

3) Filters out directory placeholders when walking prefixes.

4) Splits LazyObjectSummariesIterator into its own class and adds tests.

* Adjust S3InputSourceTest.

* Changes from review.

* Include hamcrest-core.
2019-12-31 19:06:49 -05:00
Suneet Saldanha dec619ebf4 Optimize CachingLocalSegmentAllocator#getSequenceName (#8909)
* Optimize CachingLocalSegmentAllocator#getSequenceName

Replace StringUtils#format with string addition to generate the sequence
name for an interval and partition. This is faster because format uses a
Matcher under the covers to replace the string format with the variables.

* fix imports and add test

* Add comment about optimization

* Use renamed function for TaskToolbox

* Move tests after refactor

* Rename tests
2019-12-23 18:33:22 -08:00
Vadim Ogievetsky 320c50d24a Web console: fix spec reset (#9081)
* extract spec type

* better text

* better copy

* de incubate the console

* fix status dialog scss
2019-12-23 18:23:14 -08:00
Samarth Jain 9ec9619143 Handle null values for metrics in TDigest aggregators. (#9073)
Add support for rollup during ingestion.
2019-12-23 17:49:06 -08:00
Vadim Ogievetsky a24e2f347f make supervisor statistics dialog more robust (#9089) 2019-12-23 17:43:08 -08:00
Benedict Jin 7a7c948595 Exclude .asf.yaml from the configuration of the rat plugin (#9088) 2019-12-23 13:08:23 -08:00
Fangjin Yang 2231e69b7f
Update README.md 2019-12-20 20:56:53 -08:00
Chi Cao Minh 513bb1f6da Get proper Kinesis index task AWS credentials (#9082)
Previously, the configured S3 credentials would be used instead of the
ones configured for Kinesis for Kinesis index tasks.
2019-12-20 19:35:05 -08:00
Gian Merlino 342107b4c2 Add .asf.yaml. (#9083)
Based on the docs at https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories.
2019-12-20 16:45:38 -08:00
Clint Wylie 8ccce9857a fix vectorized query engine numeric filter matchers against null values (#9063)
* fix druid-sql issue with filtering numeric columns by null values

* fix vector numeric column matchers to check null vector for null matches
2019-12-20 13:15:48 -08:00
Fangjin Yang 60d896a67c
Update README.md 2019-12-19 22:32:08 -08:00
Clint Wylie c2e9ab8100 benchmark schema with numeric dimensions and null column values (#9036)
* benchmark schema with null column values

* oops

* adjustments

* rename again, different null percentage so rows more variety

* more schema
2019-12-19 17:45:19 -08:00
Jihoon Son 3c31493772 Add missing docs for http client configurations (#9054)
* Add missing docs for http client configurations

* fix typo

* backticks
2019-12-19 17:41:04 -08:00
Suneet Saldanha 3c13444167 Fix flaky ITBasicAuthConfigurationTest (#9072)
This test was failing to authenticate using the admin credentials. These
should be available by default in the metadata store. This indicates that
the credentials are not successfully being syncd before the test is run.

This change increases the number of retries to 20 so that the services
are syncd before the test runs
2019-12-19 17:38:55 -08:00
Suneet Saldanha 176bc8fd97 Remove resolve-ip dependency for integration-tests (#9065)
* Remove resolve-ip dependency for integration-tests

* use host hostname and fallback to dscacheutil

* better shell script comparisons
2019-12-19 14:53:36 -08:00
Fangjin Yang 256b8f69b6 Update README.md (#9078) 2019-12-19 13:00:27 -08:00
Fangjin Yang d20d2ff71d Update README.md (#9077) 2019-12-19 11:54:14 -08:00
Fangjin Yang de18f76c8b Update README.md (#9074)
Updates to readme
2019-12-19 11:39:27 -08:00
Clint Wylie 84ef8b819e
fix druid-sql issue with filtering numeric columns by null values (#9061)
* fix druid-sql issue with filtering numeric columns by null values

* fix tests

* fix tests for reals
2019-12-18 13:30:34 -08:00
Jihoon Son 94a23fb17e Fix flaky realtime index task tests (#8999)
* Fix flaky realtime index task tests

* fix ITAppenderatorDriverRealtimeIndexTaskTest

* fix comment

* address comments
2019-12-18 13:25:00 -08:00
Jonathan Wei 15884f6d10
Fix hadoop ingestion property handling when using indexers (#9059) 2019-12-18 12:13:19 -08:00
Jonathan Wei b1547a76b1
Update GPG key instructions for ASF release guide (#9006) 2019-12-18 12:12:48 -08:00
Suneet Saldanha 1fb93d56c3 Add instructions to backport a PR (#9052)
* Add instructions to backport a PR

* Clearer image

* Add period in backport instructions
2019-12-18 11:57:01 -08:00
Chi Cao Minh 6178f05da6 Fail superbatch range partition multi dim values (#9058)
* Fail superbatch range partition multi dim values

Change the behavior of parallel indexing range partitioning to fail
ingestion if any row had multiple values for the partition dimension.
After this change, the behavior matches that of hadoop indexing.
(Previously, rows with multiple dimension values would be skipped.)

* Improve err msg, rename method, rename test class
2019-12-18 10:14:03 -08:00
Jonathan Wei 131b3f13be Skip non-Apache repo PRs in milestone tagging script (#9064) 2019-12-17 18:28:11 -08:00
Vadim Ogievetsky e7b1653d88 add button to reapply retention rules (#9055) 2019-12-17 18:08:57 -08:00
Benedict Jin 24be558347 Fix NPE for subquery with limit (#8775)
* Fix NPE for subquery with limit

* Mark it as unplannable by returning null

* Migrate testcases from SqlResourceTest to CalciteQueryTest

* Throw CannotBuildQueryException

* Fix typo

* Patch comments
2019-12-17 10:21:12 -08:00
Suneet Saldanha 301c0649a7 Fix equalsAndHashCode in ClientCompactQueryTuningConfig (#9035)
* Fix equalsAndHashCode in ClientCompactQueryTuningConfig

This change introduces a dependency to EqualsVerifier for the test scope.
The dependency is licensed under Apache 2. The library makes it trivial
to add equals and hashCode checks to prevent bugs like this from happening
in the future

* fix checkstyle

* fix test name
2019-12-16 14:33:00 -08:00
Jihoon Son 298425a33a
Fix handling interruptedException in resource pool (#9044) 2019-12-16 09:41:13 -08:00
Clint Wylie bc16ff5e7c
sql auto limit wrapping fix (#9043)
* sql auto limit wrapping fix

* fix tests and style

* remove setImportance
2019-12-16 01:38:24 -08:00
Clint Wylie 6881535b48
docs - clarify cache parameters (#9020) 2019-12-13 16:53:45 -08:00
Gian Merlino d452cbbb82 GenericIndexedWriter: Fix issue when writing large values to large columns. (#9029) 2019-12-13 15:33:14 -08:00
Suneet Saldanha 3325da1718 Allow startup scripts to specify java home (#9021)
* Allow startup scripts to specify java home

The startup scripts now look for java in 3 locations. The order is from
most related to druid to least, ie
    ${DRUID_JAVA_HOME}
    ${JAVA_HOME}
    ${PATH}

* Update fn names and clean up code

* final round of fixes

* fix spellcheck
2019-12-12 21:36:00 -08:00
Fangyuan Deng 41f30e53a6 [bugfix]fix getAvgSizePerGranularity logic in DerivativeDataSourceManager(materializedview) (#8929)
* fix getAvgSizePerGranularity in DerivativeDataSourceManager

* revert

* redo
2019-12-12 17:27:02 -08:00
Himanshu 9236dd9467
optionally enable Jetty ForwardedRequestCustomizer (#9010)
* optionally enable Jetty ForwardedRequestCustomizer

* fix doc build
2019-12-12 17:00:08 -08:00
Himanshu 45101183bc
HRTR: make pending task execution handling to go through all tasks on not finding worker slots (#8697)
* HRTR: make pending task execution handling to go through all tasks on
not finding worker slots

* make HRTR methods package private that are meant to be used only in HttpRemoteTaskRunnerResource

* mark HttpRemoteTaskRunnerWorkItem.State global variables final

* hrtr: move immutableWorker NULL check outside of try-catch or finally block could have NPE

* add some explanatory comments

* add comment on explaining mechanics around hand off of pending tasks from submission to it getting picked up by a task execution thread

* fix spelling
2019-12-12 14:58:52 -08:00
Xavier Léauté 810b85a352
allow druid.host to be undefined to use canonical hostname (#9019)
It is currently not possible to unset the druid.host property in the docker image to let Druid default to the canonical hostname. It always gets set to the container's IP address. Passing the override environment variable druid_host= unfortunately does not solve the problem, as this gets interpreted as empty string and does not let the default kick in.

This change adds the option to pass DRUID_SET_HOST=0 as environment variable to disable the default behavior, and allows passing a common runtime.properties file without druid.host.
2019-12-12 13:51:57 -08:00
Benjamin Hopp 13c33c1766 Update architecture.md (#9015) 2019-12-11 19:05:50 -08:00
Jihoon Son 66056b2826
Using annotation to distinguish Hadoop Configuration in each module (#9013)
* Multibinding for NodeRole

* Fix endpoints

* fix doc

* fix test

* Using annotation to distinguish Hadoop Configuration in each module
2019-12-11 17:30:44 -08:00
Jihoon Son e5e1e9c4ee
Fix broken master (#9005)
* Multibinding for NodeRole

* Fix endpoints

* fix doc

* fix test
2019-12-11 15:56:36 -08:00
Jonathan Wei 8af41d7cd0 Update version to 0.18.0-incubating-SNAPSHOT (#9009) 2019-12-11 14:04:03 -08:00
Parag Jain 24fe824055 add readiness endpoints to processes having initialization delays (#8841) 2019-12-10 17:26:13 -08:00
Chi Cao Minh 3de7ab8523 DataSketches jars in core (#9003)
Having DataSketches jars in core will allow potential improvements, for
example:
- Provide an alternative implementation of HLL:
  https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html
- Range partitioning for native parallel batch indexing without having
  the user load extensions on the classpath

Dev mailing list discussion:
https://lists.apache.org/thread.html/301410d71ff799cf616bf17c4ebcf9999fc30829f5fa62909f403e6c%40%3Cdev.druid.apache.org%3E
2019-12-10 14:02:34 -08:00
Chi Cao Minh bab78fc80e Parallel indexing single dim partitions (#8925)
* Parallel indexing single dim partitions

Implements single dimension range partitioning for native parallel batch
indexing as described in #8769. This initial version requires the
druid-datasketches extension to be loaded.

The algorithm has 5 phases that are orchestrated by the supervisor in
`ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`.
These phases and the main classes involved are described below:

1) In parallel, determine the distribution of dimension values for each
   input source split.

   `PartialDimensionDistributionTask` uses `StringSketch` to generate
   the approximate distribution of dimension values for each input
   source split. If the rows are ungrouped,
   `PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter`
   uses a Bloom filter to skip rows that would be grouped. The final
   distribution is sent back to the supervisor via
   `DimensionDistributionReport`.

2) The range partitions are determined.

   In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the
   supervisor uses `StringSketchMerger` to merge the individual
   `StringSketch`es created in the preceding phase. The merged sketch is
   then used to create the range partitions.

3) In parallel, generate partial range-partitioned segments.

   `PartialRangeSegmentGenerateTask` uses the range partitions
   determined in the preceding phase and
   `RangePartitionCachingLocalSegmentAllocator` to generate
   `SingleDimensionShardSpec`s.  The partition information is sent back
   to the supervisor via `GeneratedGenericPartitionsReport`.

4) The partial range segments are grouped.

   In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`,
   the supervisor creates the `PartialGenericSegmentMergeIOConfig`s
   necessary for the next phase.

5) In parallel, merge partial range-partitioned segments.

   `PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to
   retrieve the partial range-partitioned segments generated earlier and
   then merges and publishes them.

* Fix dependencies & forbidden apis

* Fixes for integration test

* Address review comments

* Fix docs, strict compile, sketch check, rollup check

* Fix first shard spec, partition serde, single subtask

* Fix first partition check in test

* Misc rewording/refactoring to address code review

* Fix doc link

* Split batch index integration test

* Do not run parallel-batch-index twice

* Adjust last partition

* Split ITParallelIndexTest to reduce runtime

* Rename test class

* Allow null values in range partitions

* Indicate which phase failed

* Improve asserts in tests
2019-12-09 23:05:49 -08:00
Vadim Ogievetsky a6dcc99962 better input format detection (#9007) 2019-12-09 22:31:28 -08:00