Commit Graph

11704 Commits

Author SHA1 Message Date
Karan Kumar ffa553593f
Use one factory in json reader (#11999) 2021-12-01 16:17:48 +05:30
benkrug 11746b8536
Update datasketches-hll.md (#12010)
under "Aggregators", about the lgK setting, it said "Must be a power of 2 from 4 to 21 inclusively."  21 is not a power of 2, nor is 12, the given default.  I think there may have been confusion because lgK represents log2 of K.  We could say "K must be a power of 2...", or just say lgK must be between 4 and 21.
2021-11-30 18:52:00 -08:00
Paul Rogers a66f10eea1
Code cleanup from query profile project (#11822)
* Code cleanup from query profile project

* Fix spelling errors
* Fix Javadoc formatting
* Abstract out repeated test code
* Reuse constants in place of some string literals
* Fix up some parameterized types
* Reduce warnings reported by Eclipse

* Reverted change due to lack of tests
2021-11-30 11:35:38 -08:00
Gian Merlino f6e6ca2893
Use intermediate-persist IndexSpec during multiphase merge. (#11940)
* Use intermediate-persist IndexSpec during multiphase merge.

The main change is the addition of an intermediate-persist IndexSpec
to the main "merge" method in IndexMerger. There are also a few minor
adjustments to the IndexMerger interface to encourage more harmonious
usage of its methods in the future.

* Additional changes inspired by the test coverage checker.

- Remove unused-in-production IndexMerger methods "append" and "convert".
- Add additional unit tests to UnifiedIndexerAppenderatorsManager.

* Additional adjustments.

* Even more additional adjustments.

* Test fixes.
2021-11-29 15:08:49 -08:00
Charles Smith f536f31229
clarify avro support & general style improvements (#11975)
* clarify avro support & general style improvements

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/development/extensions-core/avro.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update avro.md

remove redundancy

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
2021-11-28 16:10:18 +08:00
Gian Merlino 93aeaf4801
Improve on-heap aggregator footprint estimates. (#11950)
Add a "guessAggregatorHeapFootprint" method to AggregatorFactory that
mitigates #6743 by enabling heap footprint estimates based on a specific
number of rows. The idea is that at ingestion time, the number of rows
that go into an aggregator will be 1 (if rollup is off) or will likely
be a small number (if rollup is on).

It's a heuristic, because of course nothing guarantees that the rollup
ratio is a small number. But it's a common case, and I expect this logic
to go wrong much less often than the current logic. Also, when it does
go wrong, users can fix it by lowering maxRowsInMemory or
maxBytesInMemory. The current situation is unintuitive: when the
estimation goes wrong, users get an OOME, but actually they need to
*raise* these limits to fix it.
2021-11-28 13:21:24 +05:30
Agustin Gonzalez 8eff6334f7
AWS "Data read has a different length than the expected" error should reset stream and try again (#11941)
* Add support for custom reset condition & support for other args to have defaults to make the method api consistent

* Add support for custom reset condition to InputEntity

* Fix test names

* Clarifying comments to why we need to read the message's content to identify S3's resettable exception

* Add unit test to verify custom resettable condition for S3Entity

* Provide a way to customize retries since they are expensive to test
2021-11-26 12:45:34 -07:00
Sandeep 9bc18a93a2
warn when segment cannot be loaded by Historical nodes (#11849) 2021-11-26 17:27:17 +08:00
Kashif Faraz b48f5a576b
Fix: Do not require time condition on InlineDataSource (#11982)
For queries on logical values, e.g. SELECT 1337, we need not check for
a filter on __time column even if requireTimeCondition is true.
2021-11-25 21:10:06 +05:30
Laksh Singla c381cae51b
Improve the output of SQL explain message (#11908)
Currently, when we try to do EXPLAIN PLAN FOR, it returns the structure of the SQL parsed (via Calcite's internal planner util), which is verbose (since it tries to explain about the nodes in the SQL, instead of the Druid Query), and not representative of the native Druid query which will get executed on the broker side.

This PR aims to change the format when user tries to EXPLAIN PLAN FOR for queries which are executed by converting them into Druid's native queries (i.e. not sys schemas).
2021-11-25 21:08:33 +05:30
Frank Chen 98957be044
Return HTTP 404 instead of 400 for supervisor/task endpoints (#11724)
* Use 404 instead of 400

* Use 404 instead of 400

* Add UT test cases

* Add IT testcases

* add UT for task resource filter

Signed-off-by: frank chen <frank.chen021@outlook.com>

* Using org.testing.Assert instead of org.junit.Assert

* Resolve comments and fix test

* Fix test

* Fix tests

* Resolve comments
2021-11-25 13:09:47 +08:00
Rohan Garg 2c08055962
Specify time column for first/last aggregators (#11949)
Add the ability to pass time column in first/last aggregator (and latest/earliest SQL functions). It is to support cases where the time to query upon is stored as a part of a column different than __time. Also, some other logical time column can be specified.
2021-11-25 09:44:14 +05:30
Gian Merlino 3d72e66f56
Consolidate a bunch of ad-hoc segments metadata SQL; fix some bugs. (#11582)
* Consolidate a bunch of ad-hoc segments metadata SQL; fix some bugs.

This patch gathers together a variety of SQL from SqlSegmentsMetadataManager
and IndexerSQLMetadataStorageCoordinator into a new class SqlSegmentsMetadataQuery.
It focuses on SQL related to retrieving segment payloads and marking
segments used and unused.

In addition to cleaning up the code a bit, this patch also fixes a bug
with years before 0 or after 9999. The prior SQL did not work properly
because dates outside this range cannot be compared as strings. The new
code does work for these far-past and far-future years.

So, if you're ever interested in using Druid to analyze things from
ancient Babylon, you better apply this patch first!

* Fix test compiling.

* Fixes and improvements.

* Fix forbidden API.

* Additional fixes.
2021-11-24 14:51:53 -08:00
Gian Merlino 12e2228510
RowBasedGrouperHelper: Set hasMultipleValues = false in capabilities. (#11954)
Useful because it enables anything that consumes groupBy results to
potentially operate more efficiently.
2021-11-24 13:14:58 -08:00
Gian Merlino 5e168b861a
StorageAdapter: Add getRowSignature method. (#11953)
Simplifies logic for callers that only want to get a list of all the
column names, or column names and types. Updated callers SegmentAnalyzer,
HashJoinSegmentStorageAdapter, and DruidSegmentReader.
2021-11-24 13:14:25 -08:00
Gian Merlino 0354407655
SQL INSERT planner support. (#11959)
* SQL INSERT planner support.

The main changes are:

1) DruidPlanner is able to validate and authorize INSERT queries. They
   require WRITE permission on the target datasource.

2) QueryMaker is now an interface, and there is a QueryMakerFactory that
   creates instances of it. There is only one production implementation
   of each (NativeQueryMaker and NativeQueryMakerFactory), which
   together behave the same way as the former QueryMaker class. But this
   opens the door to executing queries in ways other than the Druid
   query stack, and is used by unit tests (CalciteInsertDmlTest) to
   test the INSERT planning functionality.

3) Adds an EXTERN table macro that allows references external data using
   InputSource and InputFormat from Druid's batch ingestion API. This is
   not exposed in production yet, but is used by unit tests.

4) Adds a QueryFeature concept that enables the planner to change its
   behavior slightly depending on the capabilities of the execution
   system.

5) Adds an "AuthorizableOperator" concept that enables SqlOperators
   to require additional permissions. This is used by the EXTERN table
   macro.

Related odds and ends:

- Add equals, hashCode, toString methods to InlineInputSource. Aids in
  the "from external" tests in CalciteInsertDmlTest.
- Add JSON-serializability to RowSignature.
- Move the SQL string inside PlannerContext so it is "baked into" the
  planner when the planner is created. Cleans up the code a bit, since
  in practice, the same query is passed in every time to the
  same planner anyway.

* Fix up calls to CalciteTests.createMockQueryLifecycleFactory.

* Fix checkstyle issues.

* Adjustments for CI.

* Adjust DruidAvaticaHandlerTest for stricter test authorizations.
2021-11-24 12:14:04 -08:00
Maytas Monsereenusorn bb3d2a433a
Support filtering data in Auto Compaction (#11922)
* add impl

* fix checkstyle

* add test

* add test

* add unit tests

* fix unit tests

* fix unit tests

* fix unit tests

* add IT

* add IT

* add comments

* fix spelling
2021-11-24 10:56:38 -08:00
Kashif Faraz 48dbe0ea45
Handle null values in Range Partition dimension distribution (#11973)
This PR adds support for handling null dimension values while creating partition boundaries
in range partitioning.

This means that we can now have partition boundaries like [null, "abc"] or ["abc", null, "def"].
2021-11-24 14:30:02 +05:30
cheddar e6570cadc4
Update LifecycleModule.java (#11972)
Update the javadoc on LifecycleModule to be more clear about why the register methods exist and why they should always be used instead of Guice's eager instantiation.
2021-11-23 17:03:37 -08:00
Abhishek Agarwal b6a0fbc8b6
Break down CalciteQueryTest (#11979)
* Refactor calciteQueryTest

* Move more tests to CalciteJoinQueryTest
2021-11-24 00:15:42 +05:30
Agustin Gonzalez 311d9a2370
Log correct hydrant count (#11976) 2021-11-23 08:22:17 -08:00
Kashif Faraz 6607e4cc75
Docs: Remove reference to deprecated field `targetPartitionSize` (#11974)
* Remove reference to deprecated field `targetPartitionSize`

* Fix spelling of LeaderLatch
2021-11-23 15:32:16 +05:30
Clint Wylie e22abb68b0
Update .spelling (#11977) 2021-11-22 22:28:51 -08:00
Laksh Singla b5a25f24f2
Improve the DruidRexExecutor w.r.t handling of numeric arrays (#11968)
DruidRexExecutor while reducing Arrays, specially numeric arrays, doesn't convert the value from ExprResult's type to BigDecimal, which causes makeLiteral to cast the values. Also, if NaN or Infinite values are present in the array, the error is a generic NumberFormatException. For example:

SELECT ARRAY[1.11, 2.22] returns [1, 2]
SELECT SQRT(-1) throws a generic NumberFormatException instead of IAE

This PR introduces change to cast the numeric values to BigDecimal since Calcite's library understands that easily, and doesn't perform casts.
2021-11-23 11:40:59 +05:30
Peter Marshall ed0606db69
Docs - Corrected admonition issue (#11926)
* Corrected admonition issue

* Update data-formats.md

Removed all admonition bits, and took out sf linebreaks.

* Update data-formats.md

Changed the shocker line into something a little more practical.
2021-11-22 12:14:30 -08:00
Katya Macedo 706d057ccc
corrected leaderlatch name (#11966) 2021-11-22 11:58:42 -08:00
Gian Merlino 35b610ada7
QueryableIndexColumnSelectorFactory: Double-check cached column class. (#11957)
Important because an earlier call to getCachedColumn may have been
done with a different class, leading to a ClassCastException on the
second call. In the prior code, this could happen if a complex column
had makeDimensionSelector called on it after makeColumnValueSelector had
already been called.
2021-11-22 11:31:24 -08:00
Gian Merlino d6507c9428
PrioritizedExecutorService: Properly wrap on direct calls to "execute". (#11956)
Usually, "execute" is called by methods defined in the superclass
AbstractExecutorService, and the passed-in Runnable has been wrapped
by newTaskFor inside a PrioritizedListenableFutureTask. But this method
can also be called directly, and if so, the same wrapping is necessary
for the delegate to get a Runnable that can be entered into a priority
queue with the others.
2021-11-22 10:30:12 -08:00
TSFenwick a4cb1de87a
get rid of class cast exception and add a new testcase for that issue (#11951) 2021-11-22 08:44:20 -08:00
jacobtolar 0a9a908031
Add inline native query example to tutorial (#11642)
* Add inline native query example to tutorial

Minor change to the tutorial that adds an example of a native HTTP query request body, and adds a link to the more detailed "native query over HTTP" documentation.

* cleanup

* Apply suggestions from code review.

Co-authored-by: sthetland <steve.hetland@imply.io>

Co-authored-by: sthetland <steve.hetland@imply.io>
2021-11-22 21:35:05 +08:00
Sandeep b1de56a3be
update Druid Chart README doc and removes unnecessary lock file (#11945)
* update Druid Chart README doc and removes unnecessary lock file

* update Druid Chart README doc and removes unnecessary lock file
2021-11-22 21:34:26 +08:00
XIAO WANG f1cf1c8f39
update count distinct tests (#11927)
Co-authored-by: wangxiao060 <wangxiao060@ke.com>
2021-11-22 21:34:00 +08:00
Peter Marshall 0c0001579d
Update compaction.md (#11937)
Removed superfluous tabs that caused issues in rendering
Added nav to the `inputSpec`
2021-11-22 21:33:47 +08:00
jacobtolar 3aee5d9ec3
Fix: invalid JSON in ingestion spec doc example (#11880)
* Fix: invalid JSON in ingestion spec doc example

* Update ingestion-spec.md
2021-11-22 21:33:26 +08:00
Frank Chen e77938b205
Add thread count to pre-push hook to speed up checking (#11808)
* Add thread count to accelerate checking

* add comment
2021-11-22 21:33:01 +08:00
Frank Chen cfd60f1222
Improve README for integration test (#11860)
* Optimize IT readme

* Resolve comments
2021-11-22 21:32:36 +08:00
Gian Merlino b13f07a057
Harmonize local input sources; fix batch index integration test. (#11965)
* Make LocalInputSource.files a List instead of Set and adjust wikipedia_index_task to use file list.

Rationale: the behavior of wikipedia_index_task.json is order-dependent with regard to its input
files; some orders produce 4 segments and some produce 5 segments. Some integration tests, like
ITSystemTableBatchIndexTaskTest and ITAutoCompactionTest, are written assuming that the
4-segment case will always happen. Providing the file list in a specific order ensures that this
will happen as expected by the tests.

I didn't see a specific reason why the LocalInputSource.files parameter needed to be a Set, so
changing it to a List was the simplest way to achieve the consistent ordering. I think it will
also make the behavior make more sense if someone does specify the same input file multiple
times in a spec: I think they'd expect it to be loaded multiple times instead of deduped. This
is consistent with the behavior of other input sources like S3, GCS, HTTP.

* Sort files in LocalFirehoseFactory.
2021-11-21 22:26:31 -08:00
Gian Merlino cb0a2af644
TestKafkaExtractionCluster: Shut down Kafka, ZK in @After. (#11963) 2021-11-20 15:17:05 -08:00
Frank Chen 2e3767bef0
Use the last ip as docker host ip (#11742) 2021-11-20 13:31:39 +08:00
Gian Merlino b3502c3e50
DruidViewMacro: Remove unused escalator field. (#11931)
* DruidViewMacro: Remove unused escalator field.

* Remove additional unused fields.
2021-11-19 16:06:29 -08:00
Clint Wylie f260bbed23
restore and deprecate AggregatorFactory methods (#11917)
* add back and deprecate aggregator factory methods so i can say i told you so when i delete these later

* rename to make less ambiguous, fix fill method

* adjust
2021-11-19 15:59:35 -08:00
Gian Merlino 36ee0367ff
Scan: Add "orderBy" parameter. (#11930)
* Scan: Add "orderBy" parameter.

This patch adds an API for requesting non-time orderings, although it
does not actually add the ability to execute such queries.

The changes are done in such a way that no matter how Scan query objects
are constructed, they will have a correct "getOrderBy". This will enable
us to switch the execution to exclusively use "getOrderBy" later on when
it's implemented.

Scan queries are serialized such that they only include "order" (time
order) if the ordering is time-based, and they only include "orderBy" if
the ordering is non-time-based. This maximizes compatibility with
the existing API while also providing a clean look for formatted queries.

Because this patch does not include execution logic, if someone actually
tries to run a query with non-time ordering, then they will get an error
like "Cannot execute query with orderBy [quality ASC]".

* SQL module fixes.

* Add spotbugs-exclude.

* Remove unused method.
2021-11-19 08:19:12 -08:00
Nikhil Navadiya 3c51136098
Add worker category dimension (#11554)
* Add worker category as dimension in TaskSlotCountStatsMonitor

* Change description

* Add workerConfig as field

* Modify HttpRemoteTaskRunnerTest to test worker category in taskslot metrics

* Fixing tests

* Fixing alerts

* Adding unit test in SingleTaskBackgroundRunnerTest for task slot metrics APIs

* Resolving false positive spell check

* addressing comments

* throw UnsupportedOperationException for tasklotmetrics APIs in SingleTaskBackgroundRunner

Co-authored-by: Nikhil Navadiya <nnavadiya@twitter.com>
2021-11-18 22:59:07 -08:00
Agustin Gonzalez a4353aa1f4
Fix bug Unrecognized token 'No': was expecting (JSON String,...) when… (#11934)
* Fix bug Unrecognized token 'No': was expecting (JSON String,...) when calling the API /druid/indexer/v1/task/taskId/reports and the report is not found

* Also log other non-OK statuses
2021-11-18 10:29:28 -07:00
Gian Merlino a04f99a950
Indexer: Demote WARN to DEBUG for tasks that don't register Appenderators. (#11939) 2021-11-18 07:54:43 -08:00
somu-imply 29710789a4
Adding safe divide function (#11904)
* IMPLY-4344: Adding safe divide function along with testcases and documentation updates

* Changing based on review comments

* Addressing review comments, fixing coding style, docs and spelling

* Checkstyle passes for all code

* Fixing expected results for infinity

* Revert "Fixing expected results for infinity"

This reverts commit 5fd5cd480d.

* Updating test result and a space in docs
2021-11-17 08:22:41 -08:00
Gian Merlino d76e646700
Fix TestServerInventoryView behavioral discrepancy. (#11932)
Unlike a real one, TestServerInventoryView would call segmentRemoved
any time _any_ segment was removed. It should only be called when _all_
segments have been removed.
2021-11-16 18:08:35 -08:00
Clint Wylie 7f0bede878
autocompaction support for complex dimensions (#11924)
* autocompaction support for complex dimensions

* more test
2021-11-16 15:57:44 -08:00
Clint Wylie 00c976a3fe
only get bitmap index for string dictionary encoded columns (#11925) 2021-11-16 15:50:02 -08:00
Clint Wylie 54fead3546
sql skip reduce of complex literal expressions (#11928) 2021-11-16 15:40:42 -08:00