* Lookup cache refactoring (the main part of druid-io/druid#3667)
* Use PowerMock's static methods in NamespaceLookupExtractorFactoryTest
* Fix KafkaLookupExtractorFactoryTest
* Use VisibleForTesting annotation instead of Javadoc comment
* Create a NamespaceExtractionCacheManager separately for each test in NamespaceExtractionCacheManagersTest
* Rename CacheScheduler.NoCache.ENTRY_DISPOSED to ENTRY_CLOSED
* Reduce visibility of NamespaceExtractionCacheManager.cacheCount() and monitor() implementations, and don't run NamespaceExtractionCacheManagerExecutorsTest with off-heap cache (it didn't before)
* In NamespaceLookupExtractorFactory, use safer idiom to check if CacheState is NoCache or VersionedCache
* More logging in CacheHandler constructor and close(), VersionedCache.close()
* PR comments addressed
* Make CacheScheduler.EntryImpl AutoCloseable, avoid 'dispose' verb in comments, logging and naming in CacheScheduler in favor of 'close'
* More Javadoc comments to CacheScheduler
* Fix NPE
* Remove logging in OnHeapNamespaceExtractionCacheManager.expungeCollectedCaches()
* Make NamespaceExtractionCacheManagersTest.testRacyCreation() to have similar load to what it be before the refactoring
* Unwrap NamespaceExtractionCacheManager.scheduledExecutorService from unneeded MoreExecutors.listeningDecorator() and specify that this is ScheduledThreadPoolExecutor, which ensures happens-before between periodic runs of the tasks
* More comments on MapDbCacheDisposer.disposed
* Replace concat with Long.toString()
* Comment on why NamespaceExtractionCacheManager.scheduledExecutorService() returns ScheduledThreadPoolExecutor
* Place logging statements in VersionedCache.close() and CacheHandler.close() after actual closing logic, because logging may fail
* Make JDBCExtractionNamespaceCacheFactory and StaticMapExtractionNamespaceCacheFactory to try to close newly created VersionedCache if population has failed, as it is done already in URIExtractionNamespaceCacheFactory
* Don't close the whole CacheScheduler.Entry, if the cache update task failed
* Replace AtomicLong updateCounter and firstRunLatch with Phaser-based UpdateCounter in CacheScheduler.EntryImpl
* Fix#3795 (Java 7 compatibility).
Also introduce Animal Sniffer checks during build, which would
have caught the original problems.
* Add Animal Sniffer on caffeine-cache for JDK8.
* option to reset offset automatically in case of OffsetOutOfRangeException
if the next offset is less than the earliest available offset for that partition
* review comments
* refactoring
* refactor
* review comments
This also involved some other test changes:
- Added a factory.mergeRunners step to AggregationTestHelper's groupBy chain, since the v2
engine does merging there.
- Changed test byteBuffer pools from on-heap to off-heap to work around
https://github.com/DataSketches/sketches-core/pull/116 for datasketches tests.
Excludes tests from AvoidStaticImport, since those are used often there and
I didn't want to make this changeset too large. Production code use was minimal
and I switched those to non-static imports.
* Rename ExtractionNamespaceCacheFactory.getCachePopulator() to populateCache() and make it to populate cache itself instead of returning a Callable which populates cache, because this "callback style" is not actually needed.
ExtractionNamespaceCacheFactory isn't a "factory" so it should be renamed, but renaming right in this commit would tear the git history for files, because ExtractionNamespaceCacheFactory implementations have too many changed lines. Going to rename ExtractionNamespaceCacheFactory to something like "CachePopulator" in one of subsequent PRs.
This commit is a part of a bigger refactoring of the lookup cache subsystem.
* Remove unused line and imports
* URIExtractionNamespace: Treat null values in lookup maps as missing entries.
This is useful when many logical lookups are derived from the same base JSON file,
and some lookups' values may be unknown sometimes.
* Add test, logging message, and address other comments.
* Update docs.
* Return better lastVersion from JDBCExtractionNamespaceCacheFactory's cache populator callable
* Return the lastVersion if URI lookup last modified date is not later than the last cached, from URIExtractionNamespaceCacheFactory's cache populator callable
* Fix a race condition in NamespaceExtractionCacheManager.cancelFuture()
* Don't delete cache from NamespaceExtractionCacheManager if the ExtractionNamespaceCacheFactory returned the same version as the last; Better exception treatment in the scheduled cache updater runnable in NamespaceExtractionCacheManager (in particular, don't consume Errors); throw AssertionError in StaticMapExtractionNamespaceCacheFactory if the lastVersion != null)
* In NamespaceExtractionCacheManager, put NamespaceImplData.latestVersion update in the same synchronized() block with swapAndClearCache(id, cacheId); Turn getPostRunnable which returns a callback into a simple updateNamespace() method
* In StaticMapExtractionNamespaceCacheFactory.getCachePopulator(), check the input directly, not inside a callback
* In URIExtractionNamespaceCacheFactory, allow URI last modified time to go backwards
* Better logging in NamespaceExtractionCacheManager
* Add comment on lastVersion nullability in URIExtractionNamespaceCacheFactory
* support finding segments from a AWS S3 storage.
* add more Uts
* address comments and add a document for the feature.
* update docs indentation
* update docs indentation
* address comments.
1. add a Ut for json ser/deser for the config object.
2. more informant error message in a Ut.
* address comments.
1. use @Min to validate the configuration object
2. change updateDescriptor to a string as it does not take an argument otherwise
* fix a Ut failure - delete a Ut for testing default max length.
* fix ReleaseException when the path is being written by multiple task
* Do not throw IOException if another replica wins the race for segment creation
fix if check
* handle logging comments
* fix test
* Add time interval dim filter and retention analysis example
* Use closed-open matching for intervals, update cache key generation
* Fix time filtering tests for interval boundary change
* Initial commit of caffeine cache
* Address code comments
* Move and fixup README.md a bit
* Improve caffeine readme information
* Cleanup caffeine pom
* Address review comments
* Bump caffeine to 2.3.1
* Bump druid version to 0.9.2-SNAPSHOT
* Make test not fail randomly.
See https://github.com/ben-manes/caffeine/pull/93#issuecomment-227617998 for an explanation
* Fix distribution and documentation
* Add caffeine to extensions.md
* Fix links in extensions.md
* Lexicographic
This patch introduces a GroupByStrategy concept and two strategies: "v1"
is the current groupBy strategy and "v2" is a new one. It also introduces
a merge buffers concept in DruidProcessingModule, to try to better
manage memory used for merging.
Both of these are described in more detail in #2987.
There are two goals of this patch:
1. Make it possible for historical/realtime nodes to return larger groupBy
result sets, faster, with better memory management.
2. Make it possible for brokers to merge streams when there are no order-by
columns, avoiding materialization.
This patch does not do anything to help with memory management on the broker
when there are order-by columns or when there are nested queries. That could
potentially be done in a future patch.
* Allow S3 version finder to search entire s3 object key
* Previously only was able to search immediate "directory"
* Update method javadoc
* Expand docs a bit better
* Async lookups-cached-global by default
* Also better lookup docs
* Fix test timeouts
* Fix timing of deserialized test
* Fix problem with 0 wait failing immediately
* validate X-Druid-Task-Id header in request and add header to response
* modify KafkaIndexTaskClient to take a TaskLocationProvider as the TaskLocation may not remain constant
* support LookupReferencesManager registration of namespaced lookup and eliminate static configurations for lookup from namespecd lookup extensions
- druid-namespace-lookup and druid-kafka-extraction-namespace are modified
- However, druid-namespace-lookup still has configuration about ON/OFF
HEAP cache manager selection, which is not namespace wide
configuration but node wide configuration as multiple namespace shares
the same cache manager
* update KafkaExtractionNamespaceTest to reflect argument signature changes
* Add more synchronization functionality to NamespaceLookupExtractorFactory
* Remove old way of using extraction namespaces
* resolve compile error by supporting LookupIntrospectHandler
* Remove kafka lookups
* Remove unused stuff
* Fix start and stop behavior to be consistent with new javadocs
* Remove unused strings
* Add timeout option
* Address comments on configurations and improve docs
* Add more options and update hash key and replaces
* Move monitoring to the overriding classes
* Add better start/stop logging
* Remove old docs about namespace names
* Fix bad comma
* Add `@JsonIgnore` to lookup factory
* Address code review comments
* Remove ExtractionNamespace from module json registration
* Fix problems with naming and initialization. Add tests
* Optimize imports / reformat
* Fix future not being properly cancelled on failed initial scheduling
* Fix delete returns
* Add more docs about whole introspection
* Add `/version` introspection point for lookups
* Add more tests and address comments
* Add StaticMap extraction namespace for testing. Also add a bunch of tests
* Move cache system property to `druid.lookup.namespace.cache.type`
* Make VERSION lower case
* Change poll period to 0ms for StaticMap
* Move cache key to bytebuffer
* Change hashCode and equals on static map extraction fn
* Add more comments on StaticMap
* Address comments
* Make scheduleAndWait use a latch
* Sanity renames and fix imports
* Remove extra info in docs
* Fix review comments
* Strengthen failure on start from warn to error
* Address comments
* Rename namespace-lookup to lookups-cached-global
* Fix injective mis-naming
* Also add serde test
* Make URIExtraction not require FileSystem impls for URIs it understands
* Fixes#2928
* Preserve URI information
* Simply case for exact matching
* Move unused variable
* Make S3DataSegmentPuller do GET requests less often
* Fixes#2894
* Run intellij formatting on S3Utils
* Remove forced stream fetching on getVersion
* Remove unneeded finalize
* Allow initial object fetching to fail and be retried
* Make URI Exctraction Namespace take more sane arguments
* Fixes https://github.com/druid-io/druid/issues/2669
* Update docs
* Rename error message
* Undo overzealous deletion of docs
* Explain caching mechanism a bit more in docs
* Move kafka-extraction-namespace to the Lookup framework.
* Address comments
* Fix missing kafka introspection
* Fix tests to be less racy
* Make testing a bit more leniant
* Make tests even more forgiving
* Add comments to kafka lookup cache method
* Move startStopLock to just use started
* Make start() and stop() idempotent
* Forgot to update test after last change, test now accounts for idempotency
* Add extra idempotency on stop check
* Add more descriptive docs of behavior
- Renumbered ApproximateHistogramAggregatorFactory from 8 to 12,
8 was taken by CardinalityAggregatorFactory
- Renumbered ApproximateHistogramFoldingAggregatorFactory from 9 to 13,
9 was taken by FilteredAggregatorFactory
* Avoids fetching all segment records into heap by JDBC driver
* Set connection to read-only to help database optimize queries
* Update JDBC drivers (MySQL has fixes for streaming results)
segment creation deterministic.
This means that each segment will contain data from just one Kafka
partition. So, users will probably not want to have a super high number
of Kafka partitions...
Fixes#2703.
Reverts "Update com.maxmind.geoip2 to 2.6.0" and exclude the google http client
from com.maxmind.geoip2. This should satisfy the original need from #2646 (wanting
to run Druid along with an upgraded com.google.http-client) while preventing
Jackson conflicts pointed out in #2717.
Fixes#2717.
This reverts commit 21b7572533.
Reads a specific offset range from specific partitions, and can use dataSource metadata
transactions to guarantee exactly-once ingestion.
Each task has a finite lifecycle, so it is expected that some process will be supervising
existing tasks and creating new ones when needed.
Geared towards supporting transactional inserts of new segments. This involves an
interface "DataSourceMetadata" that allows combining of partially specified metadata
(useful for partitioned ingestion).
DataSource metadata is stored in a new "dataSource" table.
* Eliminate exclusion groups from pull-deps
* Only consider dependency nodes in pull-deps if they are not in the following scopes
* provided
* test
* system
* Fix a bunch of `<scope>provided</scope>` missing tags
* Better exclusions for a couple of problematic libs