Apache Druid: a high performance real-time analytics database.
Go to file
Surekha 13c616ba24 'maxBytesInMemory' tuningConfig introduced for ingestion tasks (#5583)
* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks

Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.

 * The default value is 1/3(Runtime.maxMemory())
 * To maintain the current behaviour set 'maxBytesInMemory' to -1
 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
   will be respected i.e. the first one to go above threshold will trigger persist

* Fix check style and remove a comment

* Add overlord unsecured paths to coordinator when using combined service (#5579)

* Add overlord unsecured paths to coordinator when using combined service

* PR comment

* More error reporting and stats for ingestion tasks (#5418)

* Add more indexing task status and error reporting

* PR comments, add support in AppenderatorDriverRealtimeIndexTask

* Use TaskReport instead of metrics/context

* Fix tests

* Use TaskReport uploads

* Refactor fire department metrics retrieval

* Refactor input row serde in hadoop task

* Refactor hadoop task loader names

* Truncate error message in TaskStatus, add errorMsg to task report

* PR comments

* Allow getDomain to return disjointed intervals (#5570)

* Allow getDomain to return disjointed intervals

* Indentation issues

* Adding feature thetaSketchConstant to do some set operation in PostAgg (#5551)

* Adding feature thetaSketchConstant to do some set operation in PostAggregator

* Updated review comments for PR #5551 - Adding thetaSketchConstant

* Fixed CI build issue

* Updated review comments 2 for PR #5551 - Adding thetaSketchConstant

* Fix taskDuration docs for KafkaIndexingService (#5572)

* With incremental handoff the changed line is no longer true.

* Add doc for automatic pendingSegments (#5565)

* Add missing doc for automatic pendingSegments

* address comments

* Fix indexTask to respect forceExtendableShardSpecs (#5509)

* Fix indexTask to respect forceExtendableShardSpecs

* add comments

* Deprecate spark2 profile in pom.xml (#5581)

Deprecated due to https://github.com/druid-io/druid/pull/5382

* CompressionUtils: Add support for decompressing xz, bz2, zip. (#5586)

Also switch various firehoses to the new method.

Fixes #5585.

* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks

Currently a config called 'maxRowsInMemory' is present which affects how much memory gets
used for indexing.If this value is not optimal for your JVM heap size, it could lead
to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might
be bad for query performance and a higher value will limit number of persists but require
more jvm heap space and could lead to OOM.
'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes
kept in memory before persisting.

 * The default value is 1/3(Runtime.maxMemory())
 * To maintain the current behaviour set 'maxBytesInMemory' to -1
 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them
   will be respected i.e. the first one to go above threshold will trigger persist

* Address code review comments

* Fix the coding style according to druid conventions
* Add more javadocs
* Rename some variables/methods
* Other minor issues

* Address more code review comments

* Some refactoring to put defaults in IndexTaskUtils
* Added check for maxBytesInMemory in AppenderatorImpl
* Decrement bytes in abandonSegment
* Test unit test for multiple sinks in single appenderator
* Fix some merge conflicts after rebase

* Fix some style checks

* Merge conflicts

* Fix failing tests

Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex

* Address PR comments

* Put defaults for maxRows and maxBytes in TuningConfig
* Change/add javadocs
* Refactoring and renaming some variables/methods

* Fix TeamCity inspection warnings

* Added maxBytesInMemory config to HadoopTuningConfig

* Updated the docs and examples

* Added maxBytesInMemory config in docs
* Removed references to maxRowsInMemory under tuningConfig in examples

* Set maxBytesInMemory to 0 until used

Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing
and set to part of max jvm memory when ingestion task starts

* Update toString in KafkaSupervisorTuningConfig

* Use correct maxBytesInMemory value in AppenderatorImpl

* Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory

Experimenting with various defaults, 1/3 jvm memory causes OOM

* Update docs to correct maxBytesInMemory default value

* Minor to rename and add comment

* Add more details in docs

* Address new PR comments

* Address PR comments

* Fix spelling typo
2018-05-03 16:25:58 -07:00
.idea Remove unused code and exception declarations (#5461) 2018-03-16 22:11:12 +01:00
api Use unique segment paths for Kafka indexing (#5692) 2018-04-29 21:59:48 -07:00
aws-common Support enablePathStyleAccess, disableChunkedEncoding, and forceGlobalBucketAccessEnabled for aws client (#5702) 2018-05-02 10:45:38 -07:00
benchmarks Refactor index merging, replace Rowboats with RowIterators and RowPointers (#5335) 2018-04-27 17:34:32 -07:00
ci Add TeamCity instructions (#5379) 2018-02-10 13:13:33 -08:00
codestyle Add GenericWhitespace checkstyle check (#5668) 2018-04-24 01:09:14 +05:30
common Use mergeBuffer instead of processingBuffer in parallelCombiner (#5634) 2018-04-27 18:14:37 -07:00
distribution Opentsdb emitter extension (#5380) 2018-02-13 13:10:22 -08:00
docs 'maxBytesInMemory' tuningConfig introduced for ingestion tasks (#5583) 2018-05-03 16:25:58 -07:00
examples 'maxBytesInMemory' tuningConfig introduced for ingestion tasks (#5583) 2018-05-03 16:25:58 -07:00
extendedset Remove unused code and exception declarations (#5461) 2018-03-16 22:11:12 +01:00
extensions-contrib 'maxBytesInMemory' tuningConfig introduced for ingestion tasks (#5583) 2018-05-03 16:25:58 -07:00
extensions-core 'maxBytesInMemory' tuningConfig introduced for ingestion tasks (#5583) 2018-05-03 16:25:58 -07:00
hll Remove unused code and exception declarations (#5461) 2018-03-16 22:11:12 +01:00
indexing-hadoop 'maxBytesInMemory' tuningConfig introduced for ingestion tasks (#5583) 2018-05-03 16:25:58 -07:00
indexing-service 'maxBytesInMemory' tuningConfig introduced for ingestion tasks (#5583) 2018-05-03 16:25:58 -07:00
integration-tests Support enablePathStyleAccess, disableChunkedEncoding, and forceGlobalBucketAccessEnabled for aws client (#5702) 2018-05-02 10:45:38 -07:00
java-util fix NPE when buffersList contains null in SmooshedFileMapper (#5689) 2018-04-27 18:15:04 -07:00
processing 'maxBytesInMemory' tuningConfig introduced for ingestion tasks (#5583) 2018-05-03 16:25:58 -07:00
publications Changes to lambda architecture paper required for HICSS (#3382) 2016-09-06 21:32:21 -07:00
server 'maxBytesInMemory' tuningConfig introduced for ingestion tasks (#5583) 2018-05-03 16:25:58 -07:00
services 'maxBytesInMemory' tuningConfig introduced for ingestion tasks (#5583) 2018-05-03 16:25:58 -07:00
sql SQL: Remove some unused code. (#5690) 2018-04-24 11:42:16 -07:00
.gitignore git ignore dependency-reduced-pom.xml (#4711) 2017-08-23 10:10:50 -07:00
.travis.yml Use the official aws-sdk instead of jet3t (#5382) 2018-03-21 15:36:54 -07:00
CONTRIBUTING.md Replace dev list references in docs. (#5723) 2018-04-30 11:25:45 -07:00
DruidCorporateCLA.pdf fix CLA email / mailing address 2014-04-17 15:26:28 -07:00
DruidIndividualCLA.pdf fix CLA email / mailing address 2014-04-17 15:26:28 -07:00
INTELLIJ_SETUP.md Prohibit and remove unused declarations in the processing module (#4930) 2017-11-09 09:27:27 -08:00
LICENSE Clean up README and license 2015-02-18 23:09:28 -08:00
NOTICE Extension points for authentication/authorization (#4271) 2017-09-15 23:45:48 -07:00
README.md Replace dev list references in docs. (#5723) 2018-04-30 11:25:45 -07:00
druid_intellij_formatting.xml Make formatting IntelliJ 2016 friendly (#2978) 2016-05-18 12:42:21 -07:00
eclipse.importorder Merge pull request #2905 from javasoze/eclipse_formatting 2016-04-29 18:42:03 -07:00
eclipse_formatting.xml Merge pull request #2905 from javasoze/eclipse_formatting 2016-04-29 18:42:03 -07:00
intellij-sdk-config.jpg Prohibit and remove unused declarations in the processing module (#4930) 2017-11-09 09:27:27 -08:00
pom.xml CompressionUtils: Add support for decompressing xz, bz2, zip. (#5586) 2018-04-06 08:06:45 -07:00
upload.sh upload.sh: Use awscli if s3cmd is not available. (#3114) 2016-06-08 17:01:46 -07:00

README.md

Build Status Inspections Status Coverage Status

Druid

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments.

Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

Druid can load both streaming and batch data and integrates with Samza, Kafka, Storm, Spark, and Hadoop.

License

Apache License, Version 2.0

More Information

More information about Druid can be found on http://www.druid.io.

Documentation

You can find the documentation for the latest Druid release on the project website.

If you would like to contribute documentation, please do so under /docs/content in this repository and submit a pull request.

Getting Started

You can get started with Druid with our quickstart.

Reporting Issues

If you find any bugs, please file a GitHub issue.

Community

The Druid community is in the process of migrating to Apache by way of the Apache Incubator. Eventually, as we proceed along this path, our site will move from http://druid.io/ to https://druid.apache.org/.

Community support is available on the druid-user mailing list(druid-user@googlegroups.com), which is hosted at Google Groups.

Development discussions occur on dev@druid.apache.org, which you can subscribe to by emailing dev-subscribe@druid.apache.org.

We also have a couple people hanging out on IRC in #druid-dev on irc.freenode.net.

Contributing

Please follow the guidelines listed here.