under "Aggregators", about the lgK setting, it said "Must be a power of 2 from 4 to 21 inclusively." 21 is not a power of 2, nor is 12, the given default. I think there may have been confusion because lgK represents log2 of K. We could say "K must be a power of 2...", or just say lgK must be between 4 and 21.
Enhanced the ExtractionNamespace interface in lookups-cached-global core extension with the ability to set a maxHeapPercentage for the cache of the respective namespace. The reason for adding this functionality, is make it easier to detect when a lookup table grows to a size that the underlying service cannot handle, because it does not have enough memory. The default value of maxHeap for the interface is -1, which indicates that no maxHeapPercentage has been set. For the JdbcExtractionNamespace and UriExtractionNamespace implementations, the default value is null, which will cause the respective service that the lookup is loaded in, to warn when its cache is beyond mxHeapPercentage of the service's configured max heap size. If a positive non-null value is set for the namespace's maxHeapPercentage config, this value will be honored for all services that the respective lookup is loaded onto, and consequently log warning messages when the cache of the respective lookup grows beyond this respective percentage of the services configured max heap size. Warnings are logged every time that either Uri based or Jdbc based lookups are regenerated, if the maxHeapPercentage constraint is violated. No other implementations will log warnings at this time. No error is thrown when the size exceeds the maxHeapPercentage at this time, as doing so could break functionality for existing users. Previously the JdbcCacheGenerator generated its cache by materializing all rows of the underling table in memory at once; this made it difficult to log warning messages in the case that the results from the jdbc query were very large and caused the service to run out of memory. To help with this, this pr makes it so that the jdbc query results are instead streamed through an iterator.
### Description
Today we ingest a number of high cardinality metrics into Druid across dimensions. These metrics are rolled up on a per minute basis, and are very useful when looking at metrics on a partition or client basis. Events is another class of data that provides useful information about a particular incident/scenario inside a Kafka cluster. Events themselves are carried inside kafka payload, but nonetheless there are some very useful metadata that is carried in kafka headers that can serve as useful dimension for aggregation and in turn bringing better insights.
PR(https://github.com/apache/druid/pull/10730) introduced support of Kafka headers in InputFormats.
We still need an input format to parse out the headers and translate those into relevant columns in Druid. Until that’s implemented, none of the information available in the Kafka message headers would be exposed. So first there is a need to write an input format that can parse headers in any given format(provided we support the format) like we parse payloads today. Apart from headers there is also some useful information present in the key portion of the kafka record. We also need a way to expose the data present in the key as druid columns. We need a generic way to express at configuration time what attributes from headers, key and payload need to be ingested into druid. We need to keep the design generic enough so that users can specify different parsers for headers, key and payload.
This PR is designed to solve the above by providing wrapper around any existing input formats and merging the data into a single unified Druid row.
Lets look at a sample input format from the above discussion
"inputFormat":
{
"type": "kafka", // New input format type
"headerLabelPrefix": "kafka.header.", // Label prefix for header columns, this will avoid collusions while merging columns
"recordTimestampLabelPrefix": "kafka.", // Kafka record's timestamp is made available in case payload does not carry timestamp
"headerFormat": // Header parser specifying that values are of type string
{
"type": "string"
},
"valueFormat": // Value parser from json parsing
{
"type": "json",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [...]
}
},
"keyFormat": // Key parser also from json parsing
{
"type": "json"
}
}
Since we have independent sections for header, key and payload, it will enable parsing each section with its own parser, eg., headers coming in as string and payload as json.
KafkaInputFormat will be the uber class extending inputFormat interface and will be responsible for creating individual parsers for header, key and payload, blend the data resolving conflicts in columns and generating a single unified InputRow for Druid ingestion.
"headerFormat" will allow users to plug parser type for the header values and will add default header prefix as "kafka.header."(can be overridden) for attributes to avoid collision while merging attributes with payload.
Kafka payload parser will be responsible for parsing the Value portion of the Kafka record. This is where most of the data will come from and we should be able to plugin existing parser. One thing to note here is that if batching is performed, then the code is augmenting header and key values to every record in the batch.
Kafka key parser will handle parsing Key portion of the Kafka record and will ingest the Key with dimension name as "kafka.key".
## KafkaInputFormat Class:
This is the class that orchestrates sending the consumerRecord to each parser, retrieve rows, merge the columns into one final row for Druid consumption. KafkaInputformat should make sure to release the resources that gets allocated as a part of reader in CloseableIterator<InputRow> during normal and exception cases.
During conflicts in dimension/metrics names, the code will prefer dimension names from payload and ignore the dimension either from headers/key. This is done so that existing input formats can be easily migrated to this new format without worrying about losing information.
* Configurable maxStreamLength for doubles sketches
* fix equals/hashcode and it test failure
* fix test
* fix it test
* benchmark
* doc
* grouping key
* fix comment
* dependency check
* Update docs/development/extensions-core/datasketches-quantiles.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
This change updates doc to clarify when and how a change to druid.auth.authenticator.basic.credentialIterations takes effect: changes apply only to new users or existing users upon changing their password via the credentials API, which may not be the expectation.
* HLL lgK and a tip
Knowledge transfer from https://the-asf.slack.com/archives/CJ8D1JTB8/p1600699967024200. Attempted to make a connection between the SQL HLL function and the HLL underneath without getting too complicated. Also added a note about using K over 16 being pretty much pointless.
* Corrected spelling
* Create datasketches-hll.md
Put roll-up back to rollup
* Update docs/development/extensions-core/datasketches-hll.md
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
* Avro union support
* Document new union support
* Add support for AvroStreamInputFormat and fix checkstyle
* Extend multi-member union test schema and format
* Some additional docs and add Enums to spelling
* Rename explodeUnions -> extractUnions
* explode -> extract
* ByType
* Correct spelling error
* allow user to set group.id for Kafka ingestion task
* fix test coverage by removing deprecated code and add doc
* fix typo
* Update docs/development/extensions-core/kafka-ingestion.md
Co-authored-by: frank chen <frankchen@apache.org>
Co-authored-by: frank chen <frankchen@apache.org>
* druid task auto scale based on kafka lag
* fix kafkaSupervisorIOConfig and KinesisSupervisorIOConfig
* druid task auto scale based on kafka lag
* fix kafkaSupervisorIOConfig and KinesisSupervisorIOConfig
* test dynamic auto scale done
* auto scale tasks tested on prd cluster
* auto scale tasks tested on prd cluster
* modify code style to solve 29055.10 29055.9 29055.17 29055.18 29055.19 29055.20
* rename test fiel function
* change codes and add docs based on capistrant reviewed
* midify test docs
* modify docs
* modify docs
* modify docs
* merge from master
* Extract the autoScale logic out of SeekableStreamSupervisor to minimize putting more stuff inside there && Make autoscaling algorithm configurable and scalable.
* fix ci failed
* revert msic.xml
* add uts to test autoscaler create && scale out/in and kafka ingest with scale enable
* add more uts
* fix inner class check
* add IT for kafka ingestion with autoscaler
* add new IT in groups=kafka-index named testKafkaIndexDataWithWithAutoscaler
* review change
* code review
* remove unused imports
* fix NLP
* fix docs and UTs
* revert misc.xml
* use jackson to build autoScaleConfig with default values
* add uts
* use jackson to init AutoScalerConfig in IOConfig instead of Map<>
* autoscalerConfig interface and provide a defaultAutoScalerConfig
* modify uts
* modify docs
* fix checkstyle
* revert misc.xml
* modify uts
* reviewed code change
* reviewed code change
* code reviewed
* code review
* log changed
* do StringUtils.encodeForFormat when create allocationExec
* code review && limit taskCountMax to partitionNumbers
* modify docs
* code review
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
* add offsetFetchPeriod to kinesis ingestion doc
* Remove jackson dependencies from extensions
* Use fixed delay for lag collection
* Metrics reset after finishing processing
* comments
* Broaden the list of exceptions to retry for
* Unit tests
* Add more tests
* Refactoring
* re-order metrics
* Doc suggestions
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
* Add tests
Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
* zk-less druid cluster in k8s build
* attempt to fix build and use http based remote task management
* mm/router logs for debugging
* add default account k8s role and binding for pod, configMap access
* fix issue
* change router port to 8088 for common readinessProbe
* break build_run_k8s_cluster.sh into separate scripts
* revert changes to K8sDruidNodeAnnouncer.java
* k8s extension doc update
* add license to new file
* address review comments
* do not try to load lookups at startup to improve cluster startup time
* Fix Avro OCF detection prefix and run formation detection on raw input
* Support Avro Fixed and Enum types correctly
* Check Avro version byte in format detection
* Add test for AvroOCFReader.sample
Ensures that the Sampler doesn't receive raw input that it can't
serialize into JSON.
* Document Avro type handling
* Add TS unit tests for guessInputFormat
* Add note about aggreations on floats
Floating point math is known to be unstable. Due to the way aggregators work
across segments it's possible for the same query operating on the same data to
produce slightly different results.
The same problem exists with any aggregators that are not commutative since
the merge order across segments is not guaranteed.
* Also talk about doubles
* Apply suggestions from code review
* Update data-formats.md
Per Suneet, "Since you're editing this file can you also fix the json on line 177 please - it's missing a comma after the }"
* Light text cleanup
* Removing discussion of sample data, since it's repeated in the data loading tutorial, and not immediately relevant here.
* Clarifying accepted values for URI lookup
* Update index.md
* original quickstart full first pass
* original quickstart full first pass
* first pass all the way through
* straggler
* image touchups and finished old tutorial
* a bit of finishing up
* druid-caffeine-cache ext previously removed
* Sample MaxDirectMemorySize value unrealistic
* Review comments
* fixing links
* spell checking gymnastics
* workerThreads desc slightly expanded
* typo
* Typo
* Reversing Kafka config order
* Changing order of configs for Kinesis
* Trying this again: ioConfig then tuningConfig
* Document possible vulnerabilities for the druid-ranger-security
In certain configurations the ranger plugin can expose vulnerabilities due
to some of its dependencies having CVEs.
* Spelling checker is a bit tight
* kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* Kinesis IT
* fix kinesis timeout
* Kinesis IT
* Kinesis IT
* fix checkstyle
* Kinesis IT
* address comments
* fix checkstyle
* druid pac4j security extension for OpenID Connect OAuth 2.0 authentication
* update version in druid-pac4j pom
* introducing unauthorized resource filter
* authenticated but authorized /unified-webconsole.html
* use httpReq.getRequestURI() for matching callback path
* add documentation
* minor doc addition
* licesne file updates
* make dependency analyze succeed
* fix doc build
* hopefully fixes doc build
* hopefully fixes license check build
* yet another try on fixing license build
* revert unintentional changes to website folder
* update version to 0.18.0-SNAPSHOT
* check session and its expiry on each request
* add crypto service
* code for encrypting the cookie
* update doc with cookiePassphrase
* update license yaml
* make sessionstore in Pac4jFilter private non static
* make Pac4jFilter fields final
* okta: use sha256 for hmac
* remove incubating
* add UTs for crypto util and session store impl
* use standard charsets
* add license header
* remove unused file
* add org.objenesis.objenesis to license.yaml
* a bit of nit changes in CryptoService and embedding EncryptionResult for clarity
* rename alg to cipherAlgName
* take cipher alg name, mode and padding as input
* add java doc for CryptoService and make it more understandable
* another UT for CryptoService
* cache pac4j Config
* use generics clearly in Pac4jSessionStore
* update cookiePassphrase doc to mention PasswordProvider
* mark stuff Nullable where appropriate in Pac4jSessionStore
* update doc to mention jdbc
* add error log on reaching callback resource
* javadoc for Pac4jCallbackResource
* introduce NOOP_HTTP_ACTION_ADAPTER
* add correct module name in license file
* correct extensions folder name in licenses.yaml
* replace druid-kubernetes-extensions to druid-pac4j
* cache SecureRandom instance
* rename UnauthorizedResourceFilter to AuthenticationOnlyResourceFilter
* Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
* Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
* Add support for optional cloud (aws, gcs, etc.) credentials for s3 for ingestion
* fix build failure
* fix failing build
* fix failing build
* Code cleanup
* fix failing test
* Removed CloudConfigProperties and make specific class for each cloudInputSource
* Removed CloudConfigProperties and make specific class for each cloudInputSource
* pass s3ConfigProperties for split
* lazy init s3client
* update docs
* fix docs check
* address comments
* add ServerSideEncryptingAmazonS3.Builder
* fix failing checkstyle
* fix typo
* wrap the ServerSideEncryptingAmazonS3.Builder in a provider
* added java docs for S3InputSource constructor
* added java docs for S3InputSource constructor
* remove wrap the ServerSideEncryptingAmazonS3.Builder in a provider
* Move Azure extension into Core
Moving the azure extension into Core.
* * Fix build failure
* * Add The MIT License (MIT) to list of compatible licenses
* * Address review comments
* * change reference to contrib azure to core azure
* * Fix spelling mistakes.
* Add common optional dependencies for extensions
Include hadoop-aws and postgres JDBC connector jar to improve
out-of-the-box experience for extensions. The mysql JDBC connector jar
is not bundled as it is GPL.
* Update docs
* Fix typo
* Add Azure config options for segment prefix and max listing length
Added configuration options to allow the user to specify the prefix
within the segment container to store the segment files. Also
added a configuration option to allow the user to specify the
maximum number of input files to stream for each iteration.
* * Fix test failures
* * Address review comments
* * add dependency explicitly to pom
* * update docs
* * Address review comments
* * Address review comments
* Doc update for new input source and input format.
- The input source and input format are promoted in all docs under docs/ingestion
- All input sources including core extension ones are located in docs/ingestion/native-batch.md
- All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md
- New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md
* parquet
* add warning for range partitioning with sequential mode
* hdfs + s3, gs
* add fs impl for gs
* address comments
* address comments
* gcs
* Tutorials use new ingestion spec where possible
There are 2 main changes
* Use task type index_parallel instead of index
* Remove the use of parser + firehose in favor of inputFormat + inputSource
index_parallel is the preferred method starting in 0.17. Setting the job to
index_parallel with the default maxNumConcurrentSubTasks(1) is the equivalent
of an index task
Instead of using a parserSpec, dimensionSpec and timestampSpec have been
promoted to the dataSchema. The format is described in the ioConfig as the
inputFormat.
There are a few cases where the new format is not supported
* Hadoop must use firehoses instead of the inputSource and inputFormat
* There is no equivalent of a combining firehose as an inputSource
* A Combining firehose does not support index_parallel
* fix typo
* add prefixes support to google input source, making it symmetrical-ish with s3
* docs
* more better, and tests
* unused
* formatting
* javadoc
* dependencies
* oops
* review comments
* better javadoc
* add s3 input source for native batch ingestion
* add docs
* fixes
* checkstyle
* lazy splits
* fixes and hella tests
* fix it
* re-use better iterator
* use key
* javadoc and checkstyle
* exception
* oops
* refactor to use S3Coords instead of URI
* remove unused code, add retrying stream to handle s3 stream
* remove unused parameter
* update to latest master
* use list of objects instead of object
* serde test
* refactor and such
* now with the ability to compile
* fix signature and javadocs
* fix conflicts yet again, fix S3 uri stuffs
* more tests, enforce uri for bucket
* javadoc
* oops
* abstract class instead of interface
* null or empty
* better error
* add parquet support to native batch
* cleanup
* implement toJson for sampler support
* better binaryAsString test
* docs
* i hate spellcheck
* refactor toMap conversion so can be shared through flattenerMaker, default impls should be good enough for orc+avro, fixup for merge with latest
* add comment, fix some stuff
* adjustments
* fix accident
* tweaks
If the JDBC drivers are missing from the lookup extensions, throw an
exception that directs the user how to resolve the issue. This change is
a follow up to #8825.
* Add reference to `druid.storage.type`
This should be in here. Without setting storage type to S3 globally it will obviously not be used, even if all other parameters are correct.
* Update s3.md
Add global storage parameter to knob table.
* Update s3.md
* Add option lateMessageRejectionStartDate
* Use option lateMessageRejectionStartDate
* Fix tests
* Add lateMessageRejectionStartDate to kafka indexing service
* Update tests kafka indexing service
* Fix tests for KafkaSupervisorTest
* Add lateMessageRejectionStartDate to KinesisSupervisorIOConfig
* Fix var name
* Update documentation
* Add check lateMessageRejectionStartDateTime and lateMessageRejectionPeriod, fails if both were specified.
Since it hasn't received updates or community interest in a while, it makes sense
to de-emphasize it in the distribution and most documentation (outside of simple
mentions of its existence).