* Native parallel indexing without shuffle
* fix build
* fix ci
* fix ingestion without intervals
* fix retry
* fix retry
* add it test
* use chat handler
* fix build
* add docs
* fix ITUnionQueryTest
* fix failures
* disable metrics reporting
* working
* Fix split of static-s3 firehose
* Add endpoints to supervisor task and a unit test for endpoints
* increase timeout in test
* Added doc
* Address comments
* Fix overlapping locks
* address comments
* Fix static s3 firehose
* Fix test
* fix build
* fix test
* fix typo in docs
* add missing maxBytesInMemory to doc
* address comments
* fix race in test
* fix test
* Rename to ParallelIndexSupervisorTask
* fix teamcity
* address comments
* Fix license
* addressing comments
* addressing comments
* indexTaskClient-based segmentAllocator instead of CountingActionBasedSegmentAllocator
* Fix race in TaskMonitor and move HTTP endpoints to supervisorTask from runner
* Add more javadocs
* use StringUtils.nonStrictFormat for logging
* fix typo and remove unused class
* fix tests
* change package
* fix strict build
* tmp
* Fix overlord api according to the recent change in master
* Fix it test
* implement materialized view
* modify code according to jihoonson's comments
* modify code according to jihoonson's comments - 2
* add documentation about materialized view
* use new HadoopTuningConfig in pr 5583
* add minDataLag and fix optimizer bug
* correct value of DEFAULT_MIN_DATA_LAG_MS
* modify code according to jihoonson's comments - 3
* use the boolean expression instead of if-else
* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks
Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.
* The default value is 1/3(Runtime.maxMemory())
* To maintain the current behaviour set 'maxBytesInMemory' to -1
* If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
will be respected i.e. the first one to go above threshold will trigger persist
* Fix check style and remove a comment
* Add overlord unsecured paths to coordinator when using combined service (#5579)
* Add overlord unsecured paths to coordinator when using combined service
* PR comment
* More error reporting and stats for ingestion tasks (#5418)
* Add more indexing task status and error reporting
* PR comments, add support in AppenderatorDriverRealtimeIndexTask
* Use TaskReport instead of metrics/context
* Fix tests
* Use TaskReport uploads
* Refactor fire department metrics retrieval
* Refactor input row serde in hadoop task
* Refactor hadoop task loader names
* Truncate error message in TaskStatus, add errorMsg to task report
* PR comments
* Allow getDomain to return disjointed intervals (#5570)
* Allow getDomain to return disjointed intervals
* Indentation issues
* Adding feature thetaSketchConstant to do some set operation in PostAgg (#5551)
* Adding feature thetaSketchConstant to do some set operation in PostAggregator
* Updated review comments for PR #5551 - Adding thetaSketchConstant
* Fixed CI build issue
* Updated review comments 2 for PR #5551 - Adding thetaSketchConstant
* Fix taskDuration docs for KafkaIndexingService (#5572)
* With incremental handoff the changed line is no longer true.
* Add doc for automatic pendingSegments (#5565)
* Add missing doc for automatic pendingSegments
* address comments
* Fix indexTask to respect forceExtendableShardSpecs (#5509)
* Fix indexTask to respect forceExtendableShardSpecs
* add comments
* Deprecate spark2 profile in pom.xml (#5581)
Deprecated due to https://github.com/druid-io/druid/pull/5382
* CompressionUtils: Add support for decompressing xz, bz2, zip. (#5586)
Also switch various firehoses to the new method.
Fixes#5585.
* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks
Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.
* The default value is 1/3(Runtime.maxMemory())
* To maintain the current behaviour set 'maxBytesInMemory' to -1
* If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
will be respected i.e. the first one to go above threshold will trigger persist
* Address code review comments
* Fix the coding style according to druid conventions
* Add more javadocs
* Rename some variables/methods
* Other minor issues
* Address more code review comments
* Some refactoring to put defaults in IndexTaskUtils
* Added check for maxBytesInMemory in AppenderatorImpl
* Decrement bytes in abandonSegment
* Test unit test for multiple sinks in single appenderator
* Fix some merge conflicts after rebase
* Fix some style checks
* Merge conflicts
* Fix failing tests
Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex
* Address PR comments
* Put defaults for maxRows and maxBytes in TuningConfig
* Change/add javadocs
* Refactoring and renaming some variables/methods
* Fix TeamCity inspection warnings
* Added maxBytesInMemory config to HadoopTuningConfig
* Updated the docs and examples
* Added maxBytesInMemory config in docs
* Removed references to maxRowsInMemory under tuningConfig in examples
* Set maxBytesInMemory to 0 until used
Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing
and set to part of max jvm memory when ingestion task starts
* Update toString in KafkaSupervisorTuningConfig
* Use correct maxBytesInMemory value in AppenderatorImpl
* Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory
Experimenting with various defaults, 1/3 jvm memory causes OOM
* Update docs to correct maxBytesInMemory default value
* Minor to rename and add comment
* Add more details in docs
* Address new PR comments
* Address PR comments
* Fix spelling typo
* Fix Kerberos Authentication failing requests without cookies.
KerberosAuthenticator was failing `First` request from the clients.
After authentication we were setting the cookie properly but not
setting the the authenticated flag in the request. This PR fixed that.
Additional Fixes -
* Removing of Unused SpnegoFilterConfig - replaced by
KerberosAuthenticator
* Unused internalClientKeytab and principal from KerberosAuthenticator
* Fix docs accordingly and add docs for configuring an escalated
client.
* Fix excluded path config behavior
* spelling correction
* Revert "spelling correction"
This reverts commit fb754b43d8.
* Revert "Fix excluded path config behavior"
This reverts commit 3901047769.
Originally written by @AlexanderSaydakov in druid-io/druid-io.github.io#448.
I also added redirects and updated links to point to the new
datasketches-extension.html landing page for the extension, rather than to
the old page about theta sketches.
* Use the official aws-sdk instead of jet3t
* fix compile and serde tests
* address comments and fix test
* add http version string
* remove redundant dependencies, fix potential NPE, and fix test
* resolve TODOs
* fix build
* downgrade jackson version to 2.6.7
* fix test
* resolve the last TODO
* support proxy and endpoint configurations
* fix build
* remove debugging log
* downgrade hadoop version to 2.8.3
* fix tests
* remove unused log
* fix it test
* revert KerberosAuthenticator change
* change hadoop-aws scope to provided in hdfs-storage
* address comments
* address comments
* Support map type in orc extension.
Added getMapObject in OrcHadoopInputRowParser
Updated parse tests to parse map-type field in OrcHadoopInputRowParserTest
* changed from for-loop to foreach
* added resolution of column names when map types are exploded to several
columns. updated the document as well -- orc.md.
* Update orc.md
change from review
* opentsdb emitter extension
* doc for opentsdb emitter extension
* update opentsdb emitter doc
* add the ms unit to the constant name
* add a configurable event limit
* fix version to 0.13.0-SNAPSHOT
* using a thread to consume metric event
* rename method and parameter
Code changes:
- In the lookup-based extractionFns, inherit injective property from
the lookup itself if not specified.
Doc changes:
- Add a "Query execution" section to the lookups doc explaining how
injective lookups and their optimizations work.
- Remove scary warnings against using registeredLookup extractionFns.
They are necessary and important since they work with filters and
function cascades -- two things that the dimension specs do not do.
They deserve to be first class citizens.
- Move the "registeredLookup" fn above the "lookup" fn. It's probably
more commonly used, so the docs read better this way.
* Encrypting MySQL connections
* Update docs
* Make verifyServerCertificate a configurable parameter
* Change password parameter and doc update
* Make server certificate verification disabled by default
* Update tostring
* Update docs
* Add check for trust store passwords
* Add warning for null password
* Kafka Index Task that supports Incremental handoffs
- Incrementally handoff segments when they hit maxRowsPerSegment limit
- Decouple segment partitioning from Kafka partitioning, all records from consumed partitions go to a single druid segment
- Support for restoring task on middle manager restarts by check pointing end offsets for segments
* take care of review comments
* make getCurrentOffsets call async, keep track of publishing sequence, review comments
* fix setEndoffset duplicate request handling, formatting
* fix unit test
* backward compatibility
* make AppenderatorDriverMetadata backwards compatible
* add unit test
* fix deadlock between persist and push executors in AppenderatorImpl
* fix formatting
* use persist dir instead of work dir
* review comments
* fix deadlock
* actually fix deadlock
* adding new post aggregators of test stats to druid-stats extension
* changes to address code review comments
* fix checkstyle violations using druid_intellij_formatting.xml after merge upstream/master
* add @Override annotation per CI log
* make changes per review comments/discussions
* remove some blocks per review comments
* Add flattenSpec support to the Avro parser.
Also:
- Refactor the JSONPathParser a bit so it can share flattening code
with Avro (see ObjectFlatteners).
- Remove the JSONParser. It was only used in two places: by
UriNamespaceExtractor, and as a base for JSONToLowerParser. Migrated
the former to JSONPathParser and made the latter a standalone.
- Move GenericRecordAsMap to the Parquet extension, since the Avro
extension no longer uses it.
* Fix indentation.
* Fix equals/hashCode.
* Move caffeine out of extension.
* Remove `JsonTypeName` from the class itself
* Fix bad docs
* Fix distribution pom
* Fix unused import
* Make caffeine default
* Address code comments
* Add more description around the jre version in the readme
* Add suggested comments
* Move scan-query from a contrib extension into core.
Based on a proposal at: https://groups.google.com/d/topic/druid-development/ME_OatUDnbk/discussion
This patch also adds support for virtual columns to the Scan query,
and updates Druid SQL to use Scan instead of Select.
This patch also makes some behavioral changes to handling of the __time
column. In particular, it is now is returned as "__time" rather than
"timestamp"; it is no longer included if you do not specifically ask for
it in your "columns"; and it is returned as a long rather than a string.
Users can revert time handling to the legacy extension behavior by
setting "legacy" : true in their queries, or setting the property
druid.query.scan.legacy = true. This is meant to provide a migration
path for users that were formerly using the contrib extension.
* Adjustments from review.
* Add back Select query.
* Adjust SQL docs.
* Restore SelectQuery link.
* Add @ExtensionPoint and @PublicApi annotations.
* Clean up wording.
* Remove unused import.
* Remove unused imports.
* Only types can be extension points.
* Adjust annotations some more.
* Remove unused import.
* Make ServletFilterHolder an extension point.
* Add a couple extension points, and update docs.
* Implement "earlyMessageRejectionPeriod" config discussed in issue #4599
* implement the logics of this param
* Added doc of this config
* Added unit tests of it
* Update KafkaSupervisor.java
ameliorate comment
* fix format
* fix bug when rebasing
* Added support for where clauses to filter lookup values on ingestion.
Added a filter field to the JDBC lookups that is used to generate a
where clause so that only rows matching the filter value will be
brought into Druid. Example being filter="SOMECOLUMN=1"
* Required changes based on code review.
* Required changes based on code review.
* Added support for where clauses to filter lookup values on ingestion.
Added a filter field to the JDBC lookups that is used to generate a
where clause so that only rows matching the filter value will be
brought into Druid. Example being filter="SOMECOLUMN=1"
* Updates based on code review, mainly formatting and small refactor of
the buildLookupQuery method.
* Fixed broken buildLookupQuery method
* Removed empty line.
* Updates per review comments
* Remove DruidProcessingModule, QueryableModule and QueryRunnerFactoryModule from DI for coordinator, overlord, middle-manager. Add RouterDruidProcessing not to allocate processing resources on router
* Fix examples
* Fixes
* Revert Peon configs and add comments
* Remove qualifier
* Remove ability to create segments in v8 format
* Fix IndexGeneratorJobTest
* Fix parameterized test name in IndexMergerTest
* Remove extra legacy merging stuff
* Remove legacy serializer builders
* Remove ConciseBitmapIndexMergerTest and RoaringBitmapIndexMergerTest
* Add Date support to the parquet reader
Add support for the Date logical type. Currently this is not supported. Since the parquet
date is number of days since epoch gets interpreted as seconds since epoch, it will fails
on indexing the data because it will not map to the appriopriate bucket.
* Cleaned up code and tests
Got rid of unused json files in the examples, cleaned up the tests by
using try-with-resources. Now get the filenames from the json file
instead of hard coding them and integrated general improvements from
the feedback provided by leventov.
* Got rid of the caching
Remove the caching of the logical type of the time dimension column
and cleaned up the code a bit.
* refactor lag reporting and report lag at status endpoint
* refactor offset reporting logic to fetch offsets periodically vs. at request time
* remove JavaCompatUtils
* code review changes
* code review changes
* Refactoring Appenderator
1) Added publishExecutor and handoffExecutor for background publishing and handing segments off
2) Change add() to not move segments out in it
* Address comments
1) Remove publishTimeout for KafkaIndexTask
2) Simplifying registerHandoff()
3) Add increamental handoff test
* Remove unused variable
* Add persist() to Appenderator and more tests for AppenderatorDriver
* Remove unused imports
* Fix strict build
* Address comments
* move ProtoBufInputRowParser from processing module to protobuf extensions
* Ported PR #3509
* add DynamicMessage
* fix local test stuff that slipped in
* add license header
* removed redundant type name
* removed commented code
* fix code style
* rename ProtoBuf -> Protobuf
* pom.xml: shade protobuf classes, handle .desc resource file as binary file
* clean up error messages
* pick first message type from descriptor if not specified
* fix protoMessageType null check. add test case
* move protobuf-extension from contrib to core
* document: add new configuration keys, and descriptions
* update document. add examples
* move protobuf-extension from contrib to core (2nd try)
* touch
* include protobuf extensions in the distribution
* fix whitespace
* include protobuf example in the distribution
* example: create new pb obj everytime
* document: use properly quoted json
* fix whitespace
* bump parent version to 0.10.1-SNAPSHOT
* ignore Override check
* touch
* Fix lz4 library incompatibility in kafka-indexing-service extension #3266
* Bumped Kafka version to 0.10.2.0 for : Fix lz4 library incompatibility in kafka-indexing-service extension #3266
* Replaced Lists.newArrayList() with Collections.singletonList() For Fix lz4 library incompatibility in kafka-indexing-service extension #4115
* Initial commit
* Apply another config: clustername
* Rename variable
* Fix bug
* Add retry logic
* Edit retry logic
* Upgrade kafka-clients version to the most recent release
* Make callback single object
* Write documentation
* Rewrite error message and emit logic
* Handling AlertEvent
* Override toString()
* make clusterName more optional
* bump up druid version
* add producer.config option which make user can apply another optional config value of kafka producer
* remove potential blocking in emit()
* using MemoryBoundLinkedBlockingQueue
* Fixing coding convention
* Remove logging every exception and just increment counting
* refactoring
* trivial modification
* logging when callback has exception
* Replace kafka-clients 0.10.1.1 with 0.10.2.0
* Resolve the problem related of classloader
* adopt try statement
* code reformatting
* make variables final
* rewrite toString
Get rid of the metadataUpdateSpec section in the json example to
ingest parquet into druid. When this element is present, it will
fail start an indexing job.
* NN optimization for hdfs data segments.
* HdfsDataSegmentKiller, HdfsDataSegment finder changes to use new storage
format.Docs update.
* Common utility function in DataSegmentPusherUtil.
* new static method `makeSegmentOutputPathUptoVersionForHdfs` in JobHelper
* reuse getHdfsStorageDirUptoVersion in
DataSegmentPusherUtil.getHdfsStorageDir()
* Addressed comments.
* Review comments.
* HdfsDataSegmentKiller requested changes.
* extra newline
* Add maprfs.
* Add extension for supporting kerberos security
- This PR adds an extension for supporting druid authentication via
Kerberos.
- Working on the docs.
* Add docs
* review comments
* more review comments
* Block all paths by default
* more review comments - use proper Oid
* Allow extensions to override httpclient for integration tests
* Add kerberos lock to prevent multithreaded issues.
* review comment - remove enabled flag and fix router injection
* Add Cookie Handling and more detailed docs
* review comment - rename DruidKerberosConfig -> AuthKerberosConfig
* review comments
* fix travis failure on jdk7
* streaming version of select query
* use columns instead of dimensions and metrics;prepare for valueVector;remove granularity
* respect query limit within historical
* use constant
* fix thread name corrupted bug when using jetty qtp thread rather than processing thread while working with SpecificSegmentQueryRunner
* add some test for scan query
* add scan query document
* fix merge conflicts
* add compactedList resultFormat, this format is better for json ser/der
* respect query timeout
* respect query limit on broker
* use static consts and remove unused code
* Thrift ingestion plugin
1. thrift binary is platform dependent, use scrooge to generate java files to avoid style check failure
2. stream and hadoop ingesion are both supported, input format can be sequence file and lzo thrift block file.
3. base64 and protocol aware
change header
* fix conlicts in pom
* option to reset offset automatically in case of OffsetOutOfRangeException
if the next offset is less than the earliest available offset for that partition
* review comments
* refactoring
* refactor
* review comments
* URIExtractionNamespace: Treat null values in lookup maps as missing entries.
This is useful when many logical lookups are derived from the same base JSON file,
and some lookups' values may be unknown sometimes.
* Add test, logging message, and address other comments.
* Update docs.