* InputRowParser to decode OrcStruct from OrcNewInputFormat
* add unit test for orc hadoop indexing
* update docs and fix test code bug
* doc updated
* resove maven dependency conflict
* remove unused imports
* fix returning array type from Object[] to correct primitive array type
* fix to support getDimension() of MapBasedRow : changing return type of orc list from array to list
* rebase and updated based on comments
* updated based on comments
* on reflecting review comments
* fix bug in typeStringFromParseSpec() and add unit test
* add license header
1) Modify CliHadoopIndexer to share constant from `TaskConfig.DEFAULT_DEFAULT_HADOOP_COORDINATES`
2) add comment to pom.xml as discussed in
https://github.com/druid-io/druid/pull/3044
fix name
* Initial commit of caffeine cache
* Address code comments
* Move and fixup README.md a bit
* Improve caffeine readme information
* Cleanup caffeine pom
* Address review comments
* Bump caffeine to 2.3.1
* Bump druid version to 0.9.2-SNAPSHOT
* Make test not fail randomly.
See https://github.com/ben-manes/caffeine/pull/93#issuecomment-227617998 for an explanation
* Fix distribution and documentation
* Add caffeine to extensions.md
* Fix links in extensions.md
* Lexicographic
* support LookupReferencesManager registration of namespaced lookup and eliminate static configurations for lookup from namespecd lookup extensions
- druid-namespace-lookup and druid-kafka-extraction-namespace are modified
- However, druid-namespace-lookup still has configuration about ON/OFF
HEAP cache manager selection, which is not namespace wide
configuration but node wide configuration as multiple namespace shares
the same cache manager
* update KafkaExtractionNamespaceTest to reflect argument signature changes
* Add more synchronization functionality to NamespaceLookupExtractorFactory
* Remove old way of using extraction namespaces
* resolve compile error by supporting LookupIntrospectHandler
* Remove kafka lookups
* Remove unused stuff
* Fix start and stop behavior to be consistent with new javadocs
* Remove unused strings
* Add timeout option
* Address comments on configurations and improve docs
* Add more options and update hash key and replaces
* Move monitoring to the overriding classes
* Add better start/stop logging
* Remove old docs about namespace names
* Fix bad comma
* Add `@JsonIgnore` to lookup factory
* Address code review comments
* Remove ExtractionNamespace from module json registration
* Fix problems with naming and initialization. Add tests
* Optimize imports / reformat
* Fix future not being properly cancelled on failed initial scheduling
* Fix delete returns
* Add more docs about whole introspection
* Add `/version` introspection point for lookups
* Add more tests and address comments
* Add StaticMap extraction namespace for testing. Also add a bunch of tests
* Move cache system property to `druid.lookup.namespace.cache.type`
* Make VERSION lower case
* Change poll period to 0ms for StaticMap
* Move cache key to bytebuffer
* Change hashCode and equals on static map extraction fn
* Add more comments on StaticMap
* Address comments
* Make scheduleAndWait use a latch
* Sanity renames and fix imports
* Remove extra info in docs
* Fix review comments
* Strengthen failure on start from warn to error
* Address comments
* Rename namespace-lookup to lookups-cached-global
* Fix injective mis-naming
* Also add serde test
* new interval based cost function
Addresses issues with balancing of segments in the existing cost function
- `gapPenalty` led to clusters of segments ~30 days apart
- `recencyPenalty` caused imbalance among recent segments
- size-based cost could be skewed by compression
New cost function is purely based on segment intervals:
- assumes each time-slice of a partition is a constant cost
- cost is additive, i.e. cost(A, B union C) = cost(A, B) + cost(A, C)
- cost decays exponentially based on distance between time-slices
* comments and formatting
* add more comments to explain the calculation
Reverts "Update com.maxmind.geoip2 to 2.6.0" and exclude the google http client
from com.maxmind.geoip2. This should satisfy the original need from #2646 (wanting
to run Druid along with an upgraded com.google.http-client) while preventing
Jackson conflicts pointed out in #2717.
Fixes#2717.
This reverts commit 21b7572533.
com.maxmind.geoip2 2.6.0 depends on com.google.http-client 1.15.0-rc (3 years old).
When trying to include other libraries in Druid that require an up to date version of com.google.http-client this causes a problem.
Reads a specific offset range from specific partitions, and can use dataSource metadata
transactions to guarantee exactly-once ingestion.
Each task has a finite lifecycle, so it is expected that some process will be supervising
existing tasks and creating new ones when needed.
* Eliminate exclusion groups from pull-deps
* Only consider dependency nodes in pull-deps if they are not in the following scopes
* provided
* test
* system
* Fix a bunch of `<scope>provided</scope>` missing tags
* Better exclusions for a couple of problematic libs
1) Remove maven client from downloading extensions at runtime.
2) Provide a way to load Druid extensions and hadoop dependencies through file system.
3) Refactor pull-deps so that it can download extensions into extension directories.
4) Add documents on how to use this new extension loading mechanism.
5) Change the way how Druid tarball is generated. Now all the extensions + hadoop-client 2.3.0
are packaged within the Druid tarball.
- move assembly out of druid-services into a 'distribution' module
- create separate 'extensions-distribution' module and assembly to
package extensions and their dependencies into a local maven
repository
- include this extensions maven repository in the binaries tarball
* Adds kafka, URI, and JDBC namespace defintions
* Add ability to explicitly rename using a "namespace" which is a particular data collection that is loaded on all realtime, historic nodes, and brokers. If any of these nodes has the namespace extension, ALL nodes have the namespace extension.
* Add namespace caching and populating (can be on heap or off heap)
* Add NamespaceExtractionCacheManager for handling caches
* Added ExtractionNamespace for handling metadata on the extraction namespaces
* Added ExtractionNamespaceUpdate for handling metadata related to updates
* Add extension which caches renames from a kafka stream (requires kafka8)
* Added README.md for the namespace kafka extension
* Added docs
* Added namespace/size, namespace/count, namespace/deltaTasksStarted metrics
Add static config for namespaces via `druid.query.extraction.namespace`
* This is a rebase of https://github.com/b-slim/druid/tree/static_config_only
With the existing hash function some nodes could end up with 3 times the
number of keys as others. The following changes improve that to roughly
less than 5% differences across nodes.
- switch from fnv-1a to murmur3_128 hash
- increase repetitions for ketama algorithm
- test to analyze distribution
Also updates spymemcached for recent bugfixes
* Requires https://github.com/druid-io/druid-api/pull/37
* Requires https://github.com/metamx/java-util/pull/22
* Moves the puller logic to use a more standard workflow going through java-util helpers instead of re-writing the handlers for each impl
* General workflow goes like this: 1) LoadSpec makes sure the correct Puller is called with the correct parameters. 2) The Puller sets up general information like how to make an InputStream, how to find a file name (for .gz files for example), and when to retry. 3) CompressionUtils does most of the heavy lifting when it can