* Add jsonPath functions support
* Add jsonPath function test for Avro
* Add jsonPath function length() to Orc
* Add jsonPath function length() to Parquet
* Add more tests to ORC format
* update doc
* Fix exception during ingestion
* Add IT test case
* Revert "Fix exception during ingestion"
This reverts commit 5a5484b9ea.
* update IT test case
* Add 'keys()'
* Commit IT test case
* Fix UT
* clean up the balancing code around the batched vs deprecated way of sampling segments to balance
* fix docs, clarify comments, add deprecated annotations to legacy code
* remove unused variable
* update dynamic config dialog in console to state percentOfSegmentsToConsiderPerMove deprecated
* fix dynamic config text for percentOfSegmentsToConsiderPerMove
* run prettier to cleanup coordinator-dynamic-config.tsx changes
* update jest snapshot
* update documentation per review feedback
* fix bug where queries fail immediately when timeout is 0 instead of using default timeout
* fix to use serverside max
* more better
* less flaky test
* oops
* Metrics docs layout and info about query/bytes
Knowledge transfer from https://groups.google.com/g/druid-user/c/8fiflmSEoTQ - updated the layout of the Metrics part, adding links between docs pages.
Update index.md
Amended typo
* Update docs/configuration/index.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/configuration/index.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/operations/metrics.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/operations/metrics.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/operations/metrics.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Feedback applied
Http --> HTTP and moved content / removed >
* Update docs/configuration/index.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/configuration/index.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update rollup.md
Added SE tip around roll-up.
* Update docs/ingestion/rollup.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Make nodeRole available during binding; add support for dynamic registration of DruidService
* fix checkstyle and test
* fix customRole test
* address comments
* add more javadoc
* apply log file rolling strategy
* fix doc
Signed-off-by: frank chen <frank.chen021@outlook.com>
* Use absolute log path and allow spaces in log path
* Update log4j2 configuration
* apply FileAppender to ZooKeeper
* DO NOT redirect application's console log to file in supervisor
changes:
* adds new config, druid.expressions.useStrictBooleans which make longs the official boolean type of all expressions
* vectorize logical operators and boolean functions, some only if useStrictBooleans is true
under "Aggregators", about the lgK setting, it said "Must be a power of 2 from 4 to 21 inclusively." 21 is not a power of 2, nor is 12, the given default. I think there may have been confusion because lgK represents log2 of K. We could say "K must be a power of 2...", or just say lgK must be between 4 and 21.
Currently, when we try to do EXPLAIN PLAN FOR, it returns the structure of the SQL parsed (via Calcite's internal planner util), which is verbose (since it tries to explain about the nodes in the SQL, instead of the Druid Query), and not representative of the native Druid query which will get executed on the broker side.
This PR aims to change the format when user tries to EXPLAIN PLAN FOR for queries which are executed by converting them into Druid's native queries (i.e. not sys schemas).
Add the ability to pass time column in first/last aggregator (and latest/earliest SQL functions). It is to support cases where the time to query upon is stored as a part of a column different than __time. Also, some other logical time column can be specified.
* add impl
* fix checkstyle
* add test
* add test
* add unit tests
* fix unit tests
* fix unit tests
* fix unit tests
* add IT
* add IT
* add comments
* fix spelling
* Corrected admonition issue
* Update data-formats.md
Removed all admonition bits, and took out sf linebreaks.
* Update data-formats.md
Changed the shocker line into something a little more practical.
* Add inline native query example to tutorial
Minor change to the tutorial that adds an example of a native HTTP query request body, and adds a link to the more detailed "native query over HTTP" documentation.
* cleanup
* Apply suggestions from code review.
Co-authored-by: sthetland <steve.hetland@imply.io>
Co-authored-by: sthetland <steve.hetland@imply.io>
* Add worker category as dimension in TaskSlotCountStatsMonitor
* Change description
* Add workerConfig as field
* Modify HttpRemoteTaskRunnerTest to test worker category in taskslot metrics
* Fixing tests
* Fixing alerts
* Adding unit test in SingleTaskBackgroundRunnerTest for task slot metrics APIs
* Resolving false positive spell check
* addressing comments
* throw UnsupportedOperationException for tasklotmetrics APIs in SingleTaskBackgroundRunner
Co-authored-by: Nikhil Navadiya <nnavadiya@twitter.com>
* IMPLY-4344: Adding safe divide function along with testcases and documentation updates
* Changing based on review comments
* Addressing review comments, fixing coding style, docs and spelling
* Checkstyle passes for all code
* Fixing expected results for infinity
* Revert "Fixing expected results for infinity"
This reverts commit 5fd5cd480d.
* Updating test result and a space in docs
* Use a simple class to sanitize sanitizable errors and log them
The purpose of this is to sanitize JDBC errors, but can sanitize other errors
if they implement SanitizableError Interface
add a class to log errors and sanitize them
added a simple test that tests out that the error gets sanitized
add @NonNull annotation to serverconfig's ErrorResponseTransfromStrategy
* return less information as part of too many connections, and instead only log specific details
This is so an end user gets relevant information but not too much info since they might now how
many brokers they have
* return only runtime exceptions
added new error types that need to be sanitized
also sanitize deprecated and unsupported exceptions.
* dont reqrewite exceptions unless necessary for checked exceptions
add docs
avoid blanket turning all exceptions into runtime exceptions
* address comments, to fix up docs.
add more javadocs
add support UOE sanitization
* use try catch instead and sanitize at public methods
* checkstyle fixes
* throw noSuchStatement and NoSuchConnection as Avatica is affected by those
* address comments. move log error back to druid meta
clean up bad formatting and commented code. add missed catch for NoSuchStatementException
clean up comments for error handler and add comment explainging not wanting to santize avatica exceptions
* alter test to reflect new error message
* revert ColumnAnalysis type, add typeSignature and use it for DruidSchema
* review stuffs
* maybe null
* better maybe null
* Update docs/querying/segmentmetadataquery.md
* Update docs/querying/segmentmetadataquery.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* fix null right
* sad
* oops
* Update batch_hadoop_queries.json
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Support routing data through an HTTP proxy
* Support routing data through an HTTP proxy
This adds the ability for the HttpClient to connect through an HTTP proxy. We
augment the channel factory to check if it is supposed to be proxied and, if so,
we connect to the proxy host first, issue a CONNECT command through to the final
recipient host and *then* give the channel to the normal http client for usage.
* add docs
* address comments
Co-authored-by: imply-cheddar <86940447+imply-cheddar@users.noreply.github.com>
Enhanced the ExtractionNamespace interface in lookups-cached-global core extension with the ability to set a maxHeapPercentage for the cache of the respective namespace. The reason for adding this functionality, is make it easier to detect when a lookup table grows to a size that the underlying service cannot handle, because it does not have enough memory. The default value of maxHeap for the interface is -1, which indicates that no maxHeapPercentage has been set. For the JdbcExtractionNamespace and UriExtractionNamespace implementations, the default value is null, which will cause the respective service that the lookup is loaded in, to warn when its cache is beyond mxHeapPercentage of the service's configured max heap size. If a positive non-null value is set for the namespace's maxHeapPercentage config, this value will be honored for all services that the respective lookup is loaded onto, and consequently log warning messages when the cache of the respective lookup grows beyond this respective percentage of the services configured max heap size. Warnings are logged every time that either Uri based or Jdbc based lookups are regenerated, if the maxHeapPercentage constraint is violated. No other implementations will log warnings at this time. No error is thrown when the size exceeds the maxHeapPercentage at this time, as doing so could break functionality for existing users. Previously the JdbcCacheGenerator generated its cache by materializing all rows of the underling table in memory at once; this made it difficult to log warning messages in the case that the results from the jdbc query were very large and caused the service to run out of memory. To help with this, this pr makes it so that the jdbc query results are instead streamed through an iterator.
The new config is an extension of the concept of "watchedTiers" where
the Broker can choose to add the info of only the specified tiers to its timeline.
Similarly, with this config, Broker can choose to skip the realtime nodes and
thus it would query only Historical processes for any given segment.
Add support for hadoop 3 profiles . Most of the details are captured in #11791 .
We use a combination of maven profiles and resource filtering to achieve this. Hadoop2 is supported by default and a new maven profile with the name hadoop3 is created. This will allow the user to choose the profile which is best suited for the use case.
* Remove OffheapIncrementalIndex and clarify aggregator thread-safety needs.
This patch does the following:
- Removes OffheapIncrementalIndex.
- Clarifies that Aggregators are required to be thread safe.
- Clarifies that BufferAggregators and VectorAggregators are not
required to be thread safe.
- Removes thread safety code from some DataSketches aggregators that
had it. (Not all of them did, and that's OK, because it wasn't necessary
anyway.)
- Makes enabling "useOffheap" with groupBy v1 an error.
Rationale for removing the offheap incremental index:
- It is only used in one rare scenario: groupBy v1 (which is non-default)
in "useOffheap" mode (also non-default). So you have to go pretty deep
into the wilderness to get this code to activate in production. It is
never used during ingestion.
- Its existence complicates developer efforts to reason about how
aggregators get used, because the way it uses buffer aggregators is so
different from how every other query engine uses them.
- It doesn't have meaningful testing.
By the way, I do believe that the given way the offheap incremental index
works, it actually didn't require buffer aggregators to be thread-safe.
It synchronizes on "aggregate" and doesn't call "get" until it has
stopped calling "aggregate". Nevertheless, this is a bother to think about,
and for the above reasons I think it makes sense to remove the code anyway.
* Remove things that are now unused.
* Revert removal of getFloat, getLong, getDouble from BufferAggregator.
* OAK-related warnings, suppressions.
* Unused item suppressions.
* docker mem reqs
* Update docs/tutorials/docker.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Sergio Ferragut <sergio.ferragut@imply.io>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Add druid.sql.approxCountDistinct.function property.
The new property allows admins to configure the implementation for
APPROX_COUNT_DISTINCT and COUNT(DISTINCT expr) in approximate mode.
The motivation for adding this setting is to enable site admins to
switch the default HLL implementation to DataSketches.
For example, an admin can set:
druid.sql.approxCountDistinct.function = APPROX_COUNT_DISTINCT_DS_HLL
* Fixes
* Fix tests.
* Remove erroneous cannotVectorize.
* Remove unused import.
* Remove unused test imports.
* Revert "Require Datasource WRITE authorization for Supervisor and Task access (#11718)"
This reverts commit f2d6100124.
* Revert "Require DATASOURCE WRITE access in SupervisorResourceFilter and TaskResourceFilter (#11680)"
This reverts commit 6779c4652d.
* Fix docs for the reverted commits
* Fix and restore deleted tests
* Fix and restore SystemSchemaTest
Follow up PR for #11680
Description
Supervisor and Task APIs are related to ingestion and must always require Datasource WRITE
authorization even if they are purely informative.
Changes
Check Datasource WRITE in SystemSchema for tables "supervisors" and "tasks"
Check Datasource WRITE for APIs /supervisor/history and /supervisor/{id}/history
Check Datasource for all Indexing Task APIs
### Description
Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights.
PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats.
We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload.
This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row.
Lets look at a sample input format from the above discussion
"inputFormat":
{
"type": "kafka", // New input format type
"headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns
"recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp
"headerFormat": // Header parser specifying that values are of type string
{
"type": "string"
},
"valueFormat": // Value parser from json parsing
{
"type": "json",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [...]
}
},
"keyFormat": // Key parser also from json parsing
{
"type": "json"
}
}
Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json.
KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion.
"headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload.
Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch.
Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key".
## KafkaInputFormat Class:
This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases.
During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.
* Add the ability to add a context to internally generated druid broker queries
* fix docs
* changes after first CI failure
* cleanup after merge with master
* change default to empty map and improve unit tests
* add doc info and fix checkstyle
* refactor DruidSchema#runSegmentMetadataQuery and add a unit test
The new config is an extension of the concept of "watchedTiers" where
the Broker can choose to add the info of only the specified tiers to its timeline.
Similarly, with this config, Broker can choose to ignore the segments being served
by the specified historical tiers. By default, no tier is ignored.
This config is useful when you want a completely isolated tier amongst many other tiers.
Say there are several tiers of historicals Tier T1, Tier T2 ... Tier Tn
and there are several brokers Broker B1, Broker B2 .... Broker Bm
If we want only Broker B1 to query Tier T1, instead of setting a long list of watchedTiers
on each of the other Brokers B2 ... Bm, we could just set druid.broker.segment.ignoredTiers=["T1"]
for these Brokers, while Broker B1 could have druid.broker.segment.watchedTiers=["T1"]
* update docs with X-Druid-SQL-Query-Id
* review comments
* update header description
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Rename field, fix router documentation
* Add more lines to doc
* Apply doc suggestions from code review
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Add handoff wait time to ingestion stats report. Refactor some code for batch handoff
* fix checkstyle
* Add assertion to AbstractITBatchIndexTask to make sure report reflects wait for segments happened
* add docs to the task reports section of doc
* Update sql.md
Added two example queries and adjusted formatting of one that was already there
* Update docs/querying/sql.md
Co-authored-by: Frank Chen <frankchen@apache.org>
* Update docs/querying/sql.md
Co-authored-by: Frank Chen <frankchen@apache.org>
* Update docs/querying/sql.md
Co-authored-by: Frank Chen <frankchen@apache.org>
* Update docs/querying/sql.md
Co-authored-by: Frank Chen <frankchen@apache.org>
* Update sql.md
Co-authored-by: Frank Chen <frankchen@apache.org>
* add MV_FILTER_ONLY SQL function, and list filter virtual column
* MV_FILTER_NONE and more tests
* formatting
* o yeah, forgot can do easy thing
* style
* hmm why was that there
* test filtering on virtual column
* style
* meh
* do it right
* good bot
* Update index.md
Moved H4s underneath the H3 for the task log location and added hyperlinks.
* Update tasks.md
Added process information around log file generation, and subsumed text from the configuration guide into this explanatory text instead.
* Update tasks.md
.html > .md
* Update docs/ingestion/tasks.md
Co-authored-by: Frank Chen <frankchen@apache.org>
Co-authored-by: Frank Chen <frankchen@apache.org>
* Update description of batchProcessingMode
Update the description to explicitly mention a released version of Druid that the original version was referencing
* Update docs/configuration/index.md
* Update docs/configuration/index.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update granularities.md
Link-back to the ingestion spec as well as Native queries plus examples.
* Update docs/querying/granularities.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/granularities.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Make persists concurrent with ingestion
* Remove semaphore but keep concurrent persists (with add) and add push in the backround as well
* Go back to documented default persists (zero)
* Move to debug
* Remove unnecessary Atomics
* Comments on synchronization (or not) for sinks & sinkMetadata
* Some cleanup for unit tests but they still need further work
* Shutdown & wait for persists and push on close
* Provide support for three existing batch appenderators using batchProcessingMode flag
* Fix reference to wrong appenderator
* Fix doc typos
* Add BatchAppenderators class test coverage
* Add log message to batchProcessingMode final value, fix typo in enum name
* Another typo and minor fix to log message
* LEGACY->OPEN_SEGMENTS, Edit docs
* Minor update legacy->open segments log message
* More code comments, mostly small adjustments to naming etc
* fix spelling
* Exclude BtachAppenderators from Jacoco since it is fully tested but Jacoco still refuses to ack coverage
* Coverage for Appenderators & BatchAppenderators, name change of a method that was still using "legacy" rather than "openSegments"
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
* Configurable maxStreamLength for doubles sketches
* fix equals/hashcode and it test failure
* fix test
* fix it test
* benchmark
* doc
* grouping key
* fix comment
* dependency check
* Update docs/development/extensions-core/datasketches-quantiles.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
LGTM
* Update native-batch.md
Knowledge from https://the-asf.slack.com/archives/CJ8D1JTB8/p1595434977062400
* Update native-batch.md
* Fixed broken link + some grammar
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
* Update native-batch.md
Some grammatical wizardry.
* Update native-batch.md
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/ingestion/native-batch.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Apply suggestions from code review
remove orphaned links
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Add details to the Docker tutorial
Added links, explanations and other details to the Docker
tutorial to make it easier for first-time users.
* Fix spelling error
And add "Jupyter" to the spelling dictionary.
* Update docs/tutorials/docker.md
* Update docs/tutorials/docker.md
Co-authored-by: sthetland <steve.hetland@imply.io>
* Update docs/tutorials/docker.md
Co-authored-by: sthetland <steve.hetland@imply.io>
* Update docs/tutorials/docker.md
* Update docs/tutorials/docker.md
Co-authored-by: sthetland <steve.hetland@imply.io>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: sthetland <steve.hetland@imply.io>
* Add prometheus-emitter docs
* Update docs/development/extensions.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Affinity "strong" definition
Reworded "strong" to emphasise meaning and consequences - OTBO https://the-asf.slack.com/archives/CJ8D1JTB8/p1609558156092800
* Spelling corrections
* Update docs/configuration/index.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/configuration/index.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
This change updates doc to clarify when and how a change to druid.auth.authenticator.basic.credentialIterations takes effect: changes apply only to new users or existing users upon changing their password via the credentials API, which may not be the expectation.
Fixes#11297.
Description
Description and design in the proposal #11297
Key changed/added classes in this PR
*DataSegmentPusher
*ShuffleClient
*PartitionStat
*PartitionLocation
*IntermediaryDataManager
* add binary_byte_format/decimal_byte_format/decimal_format
* clean code
* fix doc
* fix review comments
* add spelling check rules
* remove extra param
* improve type handling and null handling
* remove extra zeros
* fix tests and add space between unit suffix and number as most size-format functions do
* fix tests
* add examples
* change function names according to review comments
* fix merge
Signed-off-by: frank chen <frank.chen021@outlook.com>
* no need to configure NullHandling explicitly for tests
Signed-off-by: frank chen <frank.chen021@outlook.com>
* fix tests in SQL-Compatible mode
Signed-off-by: frank chen <frank.chen021@outlook.com>
* Resolve review comments
* Update SQL test case to check null handling
* Fix intellij inspections
* Add more examples
* Fix example
This PR adds a new property druid.router.sql.enable which allows the
Router to handle SQL queries when set to true.
This change does not affect Avatica JDBC requests and they are still routed
by hashing the Connection ID.
To allow parsing of the request object as a SqlQuery (contained in module druid-sql),
some classes have been moved from druid-server to druid-services with
the same package name.
* fix little typo
* Update docs/misc/math-expr.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update default maxSegmentsInNodeLoadingQueue
Update the default maxSegmentsInNodeLoadingQueue from 0 (unbounded) to 100.
An unbounded maxSegmentsInNodeLoadingQueue can cause cluster instability.
Since this is the default druid operators need to run into this instability
and then look through the docs to see that the recommended value for a large
cluster is 1000. This change makes it so the default will prevent clusters
from falling over as they grow over time.
* update tests
* codestyle
* Add a note
* Update docs/configuration/index.md
Co-authored-by: sthetland <steve.hetland@imply.io>
* clarify that both of non-result level cache and result level cache are not supported
Co-authored-by: sthetland <steve.hetland@imply.io>
* Allow kill task to mark segments as unused
* Add IndexerSQLMetadataStorageCoordinator test
* Update docs/ingestion/data-management.md
Co-authored-by: Jihoon Son <jihoonson@apache.org>
* Add warning to kill task doc
Co-authored-by: Jihoon Son <jihoonson@apache.org>
This change allows the selection of a specific broker service (or broker tier) by the Router.
The newly added ManualTieredBrokerSelectorStrategy works as follows:
Check for the parameter brokerService in the query context. If this is a valid broker service, use it.
Check if the field defaultManualBrokerService has been set in the strategy. If this is a valid broker service, use it.
Move on to the next strategy
* HLL lgK and a tip
Knowledge transfer from https://the-asf.slack.com/archives/CJ8D1JTB8/p1600699967024200. Attempted to make a connection between the SQL HLL function and the HLL underneath without getting too complicated. Also added a note about using K over 16 being pretty much pointless.
* Corrected spelling
* Create datasketches-hll.md
Put roll-up back to rollup
* Update docs/development/extensions-core/datasketches-hll.md
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
* Add a new metric query/segments/count that is not emitted by default
* docs
* test the default implementation of the metric
* fix spelling error in docs
* document the fact that query retries will result in additional metric emissions
* update using recommended text from @jihoonson
* fix doc
* address comments
* Update docs/configuration/index.md
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
This PR splits current SegmentLoader into SegmentLoader and SegmentCacheManager.
SegmentLoader - this class is responsible for building the segment object but does not expose any methods for downloading, cache space management, etc. Default implementation delegates the download operations to SegmentCacheManager and only contains the logic for building segments once downloaded. . This class will be used in SegmentManager to construct Segment objects.
SegmentCacheManager - this class manages the segment cache on the local disk. It fetches the segment files to the local disk, can clean up the cache, and in the future, support reserve and release on cache space. [See https://github.com/Make SegmentLoader extensible and customizable #11398]. This class will be used in ingestion tasks such as compaction, re-indexing where segment files need to be downloaded locally.
* compaction/status API retains status for datasources that no longer existed causing in-memory used to grow unbounded
* compaction/status API retains status for datasources that no longer existed causing in-memory used to grow unbounded
* compaction/status API retains status for datasources that no longer existed causing in-memory used to grow unbounded
* fix test
* fix test
* Bound memory in native batch ingest create segments
* Move BatchAppenderatorDriverTest to indexing service... note that we had to put the sink back in sinks in mergeandpush since the persistent data needs to be dropped and the sink is required for that
* Remove sinks from memory and clean up intermediate persists dirs manually after sink has been merged
* Changed name from RealtimeAppenderator to StreamAppenderator
* Style
* Incorporating tests from StreamAppenderatorTest
* Keep totalRows and cleanup code
* Added missing dep
* Fix unit test
* Checkstyle
* allowIncrementalPersists should always be true for batch
* Added sinks metadata
* clear sinks metadata when closing appenderator
* Style + minor edits to log msgs
* Update sinks metadata & totalRows when dropping a sink (segment)
* Remove max
* Intelli-j check
* Keep a count of hydrants persisted by sink for sanity check before merge
* Move out sanity
* Add previous hydrant count to sink metadata
* Remove redundant field from SinkMetadata
* Remove unneeded functions
* Cleanup unused code
* Removed unused code
* Remove unused field
* Exclude it from jacoco because it is very hard to get branch coverage
* Remove segment announcement and some other minor cleanup
* Add fallback flag
* Minor code cleanup
* Checkstyle
* Code review changes
* Update batchMemoryMappedIndex name
* Code review comments
* Exclude class from coverage, will include again when packaging gets fixed
* Moved test classes to server module
* More BatchAppenderator cleanup
* Fix bug in wrong counting of totalHydrants plus minor cleanup in add
* Removed left over comments
* Have BatchAppenderator follow the Appenderator contract for push & getSegments
* Fix LGTM violations
* Review comments
* Add stats after push is done
* Code review comments (cleanup, remove rest of synchronization constructs in batch appenderator, reneame feature flag,
remove real time flag stuff from stream appenderator, etc.)
* Update javadocs
* Add thread safety notice to BatchAppenderator
* Further cleanup config
* More config cleanup
* Avro union support
* Document new union support
* Add support for AvroStreamInputFormat and fix checkstyle
* Extend multi-member union test schema and format
* Some additional docs and add Enums to spelling
* Rename explodeUnions -> extractUnions
* explode -> extract
* ByType
* Correct spelling error
* add single input string expression dimension vector selector and better expression planning
* better
* fixes
* oops
* rework how vector processor factories choose string processors, fix to be less aggressive about vectorizing
* oops
* javadocs, renaming
* more javadocs
* benchmarks
* use string expression vector processor with vector size 1 instead of expr.eval
* better logging
* javadocs, surprising number of the the
* more
* simplify
* Create /opt/data to fix permission problem
* eliminate symlink to avoid compatibility problem on AWS Fargate
* Add a workaround section
* Update instruction for named volume
* Use named volume in docker-compose
* Revert some doc change
* Resolve review comments
With this change, Druid will only support ZooKeeper 3.5.x and later.
In order to support Java 15 we need to switch to ZK 3.5.x client libraries and drop support for ZK 3.4.x
(see #10780 for the detailed reasons)
* remove ZooKeeper 3.4.x compatibility
* exclude additional ZK 3.5.x netty dependencies to ensure we use our version
* keep ZooKeeper version used for integration tests in sync with client library version
* remove the need to specify ZK version at runtime for docker
* add support to run integration tests with JDK 15
* build and run unit tests with Java 15 in travis
* Avoid mapping hydrants in create segments phase for native ingestion
* Drop queriable indices after a given sink is fully merged
* Do not drop memory mappings for realtime ingestion
* Style fixes
* Renamed to match use case better
* Rollback memoization code and use the real time flag instead
* Null ptr fix in FireHydrant toString plus adjustments to memory pressure tracking calculations
* Style
* Log some count stats
* Make sure sinks size is obtained at the right time
* BatchAppenderator unit test
* Fix comment typos
* Renamed methods to make them more readable
* Move persisted metadata from FireHydrant class to AppenderatorImpl. Removed superfluous differences and fix comment typo. Removed custom comparator
* Missing dependency
* Make persisted hydrant metadata map concurrent and better reflect the fact that keys are Java references. Maintain persisted metadata when dropping/closing segments.
* Replaced concurrent variables with normal ones
* Added batchMemoryMappedIndex "fallback" flag with default "false". Set this to "true" make code fallback to previous code path.
* Style fix.
* Added note to new setting in doc, using Iterables.size (and removing a dependency), and fixing a typo in a comment.
* Forgot to commit this edited documentation message
* SQL timeseries no longer skip empty buckets with all granularity
* add comment, fix tests
* the ol switcheroo
* revert unintended change
* docs and more tests
* style
* make checkstyle happy
* docs fixes and more tests
* add docs, tests for array_agg
* fixes
* oops
* doc stuffs
* fix compile, match doc style
* allow user to set group.id for Kafka ingestion task
* fix test coverage by removing deprecated code and add doc
* fix typo
* Update docs/development/extensions-core/kafka-ingestion.md
Co-authored-by: frank chen <frankchen@apache.org>
Co-authored-by: frank chen <frankchen@apache.org>
* Update datasource.md
Change "table" to "datasource" in join discussion: This means that all datasources
other than the leftmost "base" table must fit in memory.
According to docs on datasources, "datasource" is the more general term, and a table is a kind of datasource. In the context here, then, "datasource" is applicable.
* left-hand table -> left-hand datasource
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
Co-authored-by: sthetland <steve.hetland@imply.io>
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
Fixed a syntax error in "prefix" lines in docs/ingestion/native-batch.md
S3 requires a trailing slash for directory like structures, so this updates the examples to include the trailing slashes.
* lay the groundwork for throttling replicant loads per RunRules execution
* Add dynamic coordinator config to control new replicant threshold.
* remove redundant line
* add some unit tests
* fix checkstyle error
* add documentation for new dynamic config
* improve docs and logs
* Alter how null is handled for new config. If null, manually set as default
* ARRAY_AGG sql aggregator function
* add javadoc
* spelling
* review stuff, return null instead of empty when nil input
* review stuff
* Update sql.md
* use type inference for finalize, refactor some things
* Add feature to automatically remove rules based on retention period
* Add feature to automatically remove rules based on retention period
* address comments
* add experimental expression aggregator
* add test
* fix lgtm
* fix test
* adjust test
* use not null constant
* array_set_concat docs
* add equals and hashcode and tostring
* fix it
* spelling
* do multi-value magic for expression agg, more javadocs, tests
* formatting
* fix inspection
* more better
* nullable
The main one is updating datasources.md to talk about SQL. (It still said
that table unions are not supported in SQL.) Also, this doc update adds
some clarifying details on limitations.
* add protobuf inputformat
* repair pom
* alter intermediateRow to type of Dynamicmessage
* add document
* refine test
* fix document
* add protoBytesDecoder
* refine document and add ser test
* add hash
* add schema registry ser test
Co-authored-by: yuanyi <yuanyi@freewheel.tv>
* add miss API references for coordinator
* add miss API references for coordinator
* add miss API references for coordinator
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
* Add ability to wait for segment availability for batch jobs
* IT updates
* fix queries in legacy hadoop IT
* Fix broken indexing integration tests
* address an lgtm flag
* spell checker still flagging for hadoop doc. adding under that file header too
* fix compaction IT
* Updates to wait for availability method
* improve unit testing for patch
* fix bad indentation
* refactor waitForSegmentAvailability
* Fixes based off of review comments
* cleanup to get compile after merging with master
* fix failing test after previous logic update
* add back code that must have gotten deleted during conflict resolution
* update some logging code
* fixes to get compilation working after merge with master
* reset interrupt flag in catch block after code review pointed it out
* small changes following self-review
* fixup some issues brought on by merge with master
* small changes after review
* cleanup a little bit after merge with master
* Fix potential resource leak in AbstractBatchIndexTask
* syntax fix
* Add a Compcation TuningConfig type
* add docs stipulating the lack of support by Compaction tasks for the new config
* Fixup compilation errors after merge with master
* Remove erreneous newline