* change kafka lookups module to not commit offsets
The current behaviour of the Kafka lookup extractor is to not commit
offsets by assigning a unique ID to the consumer group and setting
auto.offset.reset to earliest. This does the job but also pollutes the
Kafka broker with a bunch of "ghost" consumer groups that will never again be
used.
To fix this, we now set enable.auto.commit to false, which prevents the
ghost consumer groups being created in the first place.
* update docs to include new enable.auto.commit setting behaviour
* update kafka-lookup-extractor documentation
Provide some additional detail on functionality and configuration.
Hopefully this will make it clearer how the extractor works for
developers who aren't so familiar with Kafka.
* add comments better explaining the logic of the code
* add spelling exceptions for kafka lookup docs
* remove kafka lookup records from factory when record tombstoned
* update kafka lookup docs to include tombstone behaviour
* change test wait time down to 10ms
Co-authored-by: David Palmer <david.palmer@adscale.co.nz>
* Adjust "in" filter null behavior to match "selector".
Now, both of them match numeric nulls if constructed with a "null" value.
This is consistent as far as native execution goes, but doesn't match
the behavior of SQL = and IN. So, to address that, this patch also
updates the docs to clarify that the native filters do match nulls.
This patch also updates the SQL docs to describe how Boolean logic is
handled in addition to how NULL values are handled.
Fixes#12856.
* Fix test.
Kinesis ingestion requires all shards to have at least 1 record at the required position in druid.
Even if this is satisified initially, resharding the stream can lead to empty intermediate shards. A significant delay in writing to newly created shards was also problematic.
Kinesis shard sequence numbers are big integers. Introduce two more custom sequence tokens UNREAD_TRIM_HORIZON and UNREAD_LATEST to indicate that a shard has not been read from and that it needs to be read from the start or the end respectively.
These values can be used to avoid the need to read at least one record to obtain a sequence number for ingesting a newly discovered shard.
If a record cannot be obtained immediately, use a marker to obtain the relevant shardIterator and use this shardIterator to obtain a valid sequence number. As long as a valid sequence number is not obtained, continue storing the token as the offset.
These tokens (UNREAD_TRIM_HORIZON and UNREAD_LATEST) are logically ordered to be earlier than any valid sequence number.
However, the ordering requires a few subtle changes to the existing mechanism for record sequence validation:
The sequence availability check ensures that the current offset is before the earliest available sequence in the shard. However, current token being an UNREAD token indicates that any sequence number in the shard is valid (despite the ordering)
Kinesis sequence numbers are inclusive i.e if current sequence == end sequence, there are more records left to read.
However, the equality check is exclusive when dealing with UNREAD tokens.
* Improved Java 17 support and Java runtime docs.
1) Add a "Java runtime" doc page with information about supported
Java versions, garbage collection, and strong encapsulation..
2) Update asm and equalsverifier to versions that support Java 17.
3) Add additional "--add-opens" lines to surefire configuration, so
tests can pass successfully under Java 17.
4) Switch openjdk15 tests to openjdk17.
5) Update FrameFile to specifically mention Java runtime incompatibility
as the cause of not being able to use Memory.map.
6) Update SegmentLoadDropHandler to log an error for Errors too, not
just Exceptions. This is important because an IllegalAccessError is
encountered when the correct "--add-opens" line is not provided,
which would otherwise be silently ignored.
7) Update example configs to use druid.indexer.runner.javaOptsArray
instead of druid.indexer.runner.javaOpts. (The latter is deprecated.)
* Adjustments.
* Use run-java in more places.
* Add run-java.
* Update .gitignore.
* Exclude hadoop-client-api.
Brought in when building on Java 17.
* Swap one more usage of java.
* Fix the run-java script.
* Fix flag.
* Include link to Temurin.
* Spelling.
* Update examples/bin/run-java
Co-authored-by: Xavier Léauté <xl+github@xvrl.net>
Co-authored-by: Xavier Léauté <xl+github@xvrl.net>
* Use nonzero default value of maxQueuedBytes.
The purpose of this parameter is to prevent the Broker from running out
of memory. The prior default is unlimited; this patch changes it to a
relatively conservative 25MB.
This may be too low for larger clusters. The risk is that throughput
can decrease for queries with large resultsets or large amounts of intermediate
data. However, I think this is better than the risk of the prior default, which
is that these queries can cause the Broker to go OOM.
* Alter calculation.
* Add clarification for combining input source
* Update inputFormat note
* Update maxNumConcurrentSubTasks note
* Fix broken link
* Update docs/ingestion/native-batch-input-source.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* initial commit of bucket dimensions for metrics
return counts of segments that have rowcount in a bucket size for a datasource
return average value of rowcount per segment in a datasource
added unit test
naming could use a lot of work
buckets right now are not finalized
added javadocs
altered metrics.md
* fix checkstyle issues
* addressed review comments
add monitor test
move added functionality to new monitor
update docs
* address comments
renamed monitor
handle tombstones better
update docs
added javadocs
* Add support for tombstones in the segment distribution
* undo changes to tombstone segmentizer factory
* fix accidental whitespacing changes
* address comments regarding metrics documentation
and rename variable to be more accurate
* fix tests
* fix checkstyle issues
* fix broken test
* undo removal of timeout
* Automatic sizing for GroupBy dictionary sizes.
Merging and selector dictionary sizes currently both default to 100MB.
This is not optimal, because it can lead to OOM on small servers and
insufficient resource utilization on larger servers. It also invites
end users to try to tune it when queries run out of dictionary space,
which can make things worse if the end user sets it to too high.
So, this patch:
- Adds automatic tuning for selector and merge dictionaries. Selectors
use up to 15% of the heap and merge buffers use up to 30% of the heap
(aggregate across all queries).
- Updates out-of-memory error messages to emphasize enabling disk
spilling vs. increasing memory parameters. With the memory parameters
automatically sized, it is more likely that an end user will get
benefit from enabling disk spilling.
- Removes the query context parameters that allow lowering of configured
dictionary sizes. These complicate the calculation, and I don't see a
reasonable use case for them.
* Adjust tests.
* Review adjustments.
* Additional comment.
* Remove unused import.
* IMPLY-12348: Updated description of UNION ALL
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update sql.md
* Update docs/querying/sql.md
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
In a heterogeneous environment, sometimes you don't have control over the input folder. Upstream can put any folder they want. In this situation the S3InputSource.java is unusable.
Most people like me solved it by using Airflow to fetch the full list of parquet files and pass it over to Druid. But doing this explodes the JSON spec. We had a situation where 1 of the JSON spec is 16MB and that's simply too much for Overlord.
This patch allows users to pass {"filter": "*.parquet"} and let Druid performs the filtering of the input files.
I am using the glob notation to be consistent with the LocalFirehose syntax.
* Add TIME_IN_INTERVAL SQL operator.
The operator is implemented as a convertlet rather than an
OperatorConversion, because this allows it to be equivalent to using
the >= and < operators directly.
* SqlParserPos cannot be null here.
* Remove unused import.
* Doc updates.
* Add words to dictionary.
* Service stdout log files, move logs to log/.
Two changes that make log behavior cleaner:
1) Redirect messages from the Java runtime to their own log files.
Otherwise, they would get jumbled up in the output of the all-in-one
start command.
2) Use log/ instead of bin/log/ for the default log directory. Makes them
easier to find.
Additionally, add documentation about how to avoid the reflective
access warnings in Java 11.
* Spelling.
* See if code formatting affects spelling.
* Small addition to Multitenancy considerations doc
* Update docs/querying/multitenancy.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update multitenancy.md
Edit suggested by @kfaraz
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Adding zstandard compression library
* 1. Took @clintropolis's advice to have ZStandard decompressor use the byte array when the buffers are not direct.
2. Cleaned up checkstyle issues.
* Fixing zstandard version to latest stable version in pom's and updating license files
* Removing zstd from benchmarks and adding to processing (poms)
* fix the intellij inspection issue
* Removing the prefix v for the version in the license check for ztsd
* Fixing license checks
Co-authored-by: Rahul Gidwani <r_gidwani@apple.com>
* Emit state of replace and append for native batch tasks
* Emit count of one depending on batch ingestion mode (APPEND, OVERWRITE, REPLACE)
* Add metric to compaction job
* Avoid null ptr exc when null emitter
* Coverage
* Emit tombstone & segment counts
* Tasks need a type
* Spelling
* Integrate BatchIngestionMode in batch ingestion tasks functionality
* Typos
* Remove batch ingestion type from metric since it is already in a dimension. Move IngestionMode to AbstractTask to facilitate having mode as a dimension. Add metrics to streaming. Add missing coverage.
* Avoid inner class referenced by sub-class inspection. Refactor computation of IngestionMode to make it more robust to null IOConfig and fix test.
* Spelling
* Avoid polluting the Task interface
* Rename computeCompaction methods to avoid ambiguous java compiler error if they are passed null. Other minor cleanup.
* ConcurrentGrouper: Add option to always slice up merge buffers thread-locally.
Normally, the ConcurrentGrouper shares merge buffers across processing
threads until spilling starts, and then switches to a thread-local model.
This minimizes memory use and reduces likelihood of spilling, which is
good, but it creates thread contention. The new mergeThreadLocal option
causes a query to start in thread-local mode immediately, and allows us
to experiment with the relative performance of the two modes.
* Fix grammar in docs.
* Fix race in ConcurrentGrouper.
* Fix issue with timeouts.
* Remove unused import.
* Add "tradeoff" to dictionary.
* SQL: Add is_active to sys.segments, update examples and docs.
is_active is short for:
(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1
It's important because this represents "all the segments that should
be queryable, whether or not they actually are right now". Most of the
time, this is the set of segments that people will want to look at.
The web console already adds this filter to a lot of its queries,
proving its usefulness.
This patch also reworks the caveat at the bottom of the sys.segments
section, so its information is mixed into the description of each result
field. This should make it more likely for people to see the information.
* Wording updates.
* Adjustments for spellcheck.
* Adjust IT.
* Improved docs for range partitioning.
1) Clarify the benefits of range partitioning.
2) Clarify which filters support pruning.
3) Include the fact that multi-value dimensions cannot be used for partitioning.
* Additional clarification.
* Update other section.
* Another adjustment.
* Updates from review.
* docs(fix): clarify how worker.version and minWorkerVersion comparison works
* Revert "docs(fix): clarify how worker.version and minWorkerVersion comparison works"
This reverts commit cadd1fdc60.
* docs(fix): clarify how worker.version and minWorkerVersion comparison works
* Apply suggestions from code review
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/configuration/index.md
fix spelling
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
In the majority of cases, this improves performance.
There's only one case I'm aware of where this may be a net negative: for time_floor(__time, <period>) where there are many repeated __time values. In nonvectorized processing, SingleLongInputCachingExpressionColumnValueSelector implements an optimization to avoid computing the time_floor function on every row. There is no such optimization in vectorized processing.
IMO, we shouldn't mention this in the docs. Rationale: It's too fiddly of a thing: it's not guaranteed that nonvectorized processing will be faster due to the optimization, because it would have to overcome the inherent speed advantage of vectorization. So it'd always require testing to determine the best setting for a specific dataset. It would be bad if users disabled vectorization thinking it would speed up their queries, and it actually slowed them down. And even if users do their own testing, at some point in the future we'll implement the optimization for vectorized processing too, and it's likely that users that explicitly disabled vectorization will continue to have it disabled. I'd like to avoid this outcome by encouraging all users to enable vectorization at all times. Really advanced users would be following development activity anyway, and can read this issue
Currently all Druid processes share the same log4j2 configuration file located in _common directory. Since peon processes are spawned by middle manager process, they derivate the environment variables from the middle manager. These variables include those in the log4j2.xml controlling to which file the logger writes the log.
But current task logging mechanism requires the peon processes to output the log to console so that the middle manager can redirect the console output to a file and upload this file to task log storage.
So, this PR imposes this requirement to peon processes, whatever the configuration is in the shared log4j2.xml, peon processes always write the log to console.
setting thread names takes a measurable amount of time in the case where segment scans are very quick. In high-QPS testing we found a slight performance boost from turning off processing thread renaming. This option makes that possible.