* Filters: Use ColumnSelectorFactory directly for building row-based matchers.
* Adjustments based on code review.
- BoundDimFilter: fewer volatiles, rename matchesAnything to !matchesNothing.
- HavingSpecs: Clarify that they are not thread-safe, and make DimFilterHavingSpec
not thread safe.
- Renamed rowType to rowSignature.
- Added specializations for time-based vs non-time-based DimensionSelector in RBCSF.
- Added convenience method DimensionHanderUtils.createColumnSelectorPlus.
- Added singleton ZeroIndexedInts.
- Added test cases for DimFilterHavingSpec.
* Make ValueMatcherColumnSelectorStrategy actually use the associated selector.
* Add RangeIndexedInts.
* DimFilterHavingSpec: Fix concurrent usage guard on jdk7.
* Add assertion to ZeroIndexedInts.
* Rename no-longer-volatile members.
* add first and last aggregator
* add test and fix
* moving around
* separate aggregator valueType
* address PR comment
* add finalize inner query and adjust v1 inner indexing
* better test and fixes
* java-util import fixes
* PR comments
* Add first/last aggs to ITWikipediaQueryTest
* Enable parallel test
* Remove unnecessary NotThreadSafe annocation
* Randomize the start port when finding available ports
* Fix test failure
* Change to handle all negatives
Excludes tests from AvoidStaticImport, since those are used often there and
I didn't want to make this changeset too large. Production code use was minimal
and I switched those to non-static imports.
* Support string type in math expression
addressed comments
addressed comments
Addressed comments
* Updated math function document
* Addressed comments
* Supports expression-paramed aggregator (squashed and rebased on master) also includes math post aggregator (was #2820)
* Addressed comments
* addressed comments
This is useful for the insert-segment-to-db tool, which would otherwise
potentially insert a lot of overshadowed segments as "used", causing
load and drop churn in the cluster.
1. Wrap temporaryStorage in a resource holder, to avoid spurious "Closed"
errors from already-running processing tasks.
2. Exit early from the merging accumulator if the query is cancelled.
* fix ConcurrentModificationException in CachingClusteredClient.run()
* obtain new copy of PartitionHolder to avoid potential multi-threads read/write issue
Refcounting prevents releasing the merge buffer, or closing the concurrent
grouper, before the processing threads have all finished. The better
error handling prevents an avalanche of per-runner exceptions when grouping
resources are exhausted, by grouping those all up into a single merged
exception.
This patch introduces a GroupByStrategy concept and two strategies: "v1"
is the current groupBy strategy and "v2" is a new one. It also introduces
a merge buffers concept in DruidProcessingModule, to try to better
manage memory used for merging.
Both of these are described in more detail in #2987.
There are two goals of this patch:
1. Make it possible for historical/realtime nodes to return larger groupBy
result sets, faster, with better memory management.
2. Make it possible for brokers to merge streams when there are no order-by
columns, avoiding materialization.
This patch does not do anything to help with memory management on the broker
when there are order-by columns or when there are nested queries. That could
potentially be done in a future patch.
* Fix parsing fail of segment id with underscored datasource (Fix for #2786)
* addressed comment
* renamed and moved code into api. added log4 dependency for tests
* addressed comments
* fixed test fails
* Make URI Exctraction Namespace take more sane arguments
* Fixes https://github.com/druid-io/druid/issues/2669
* Update docs
* Rename error message
* Undo overzealous deletion of docs
* Explain caching mechanism a bit more in docs
Geared towards supporting transactional inserts of new segments. This involves an
interface "DataSourceMetadata" that allows combining of partially specified metadata
(useful for partitioned ingestion).
DataSource metadata is stored in a new "dataSource" table.
* Defaults the thread priority to java.util.Thread.NORM_PRIORITY in io.druid.indexing.common.task.AbstractTask
* Each exec service has its own Task Factory which is assigned a priority for spawned task. Therefore each priority class has a unique exec service
* Added priority to tasks as taskPriority in the task context. <0 means low, 0 means take default, >0 means high. It is up to any particular implementation to determine how to handle these numbers
* Add options to ForkingTaskRunner
* Add "-XX:+UseThreadPriorities" default option
* Add "-XX:ThreadPriorityPolicy=42" default option
* AbstractTask - Removed unneded @JsonIgnore on priority
* Added priority to RealtimePlumber executors. All sub-executors (non query runners) get Thread.MIN_PRIORITY
* Add persistThreadPriority and mergeThreadPriority to realtime tuning config
- Fix merging when the INTERVALS analysisType is disabled, and add a test.
- Remove transformFn from CombiningSequence, use MappingSequence instead. transformFn did
not work for "accumulate" anyway, which made the tests wrong (the intervals should have
been condensed, but were not).
- Add analysisTypes to the Druids segmentMetadataQuery builder to make testing simpler.
- fixes#1970
- extracted out segment handoff callbacks in SegmentHandoffNotifier
which is responsible for tracking segment handoffs and doing callbacks
when handoff is complete.
- Coordinator now maintains a view of segments in the cluster, this
will affect the jam heap requirements for the overlord for large
clusters.
realtime index task and nodes now use HTTP end points exposed by the
coordinator to get serverView
review comment
fix realtime node guide injection
review comments
make test not rely on scheduled exec
fix compilation
fix import
review comment
introduce immutableSegmentLoadInfo
fix son reading
remove unnecessary logging
This is a feature meant to allow realtime tasks to work without being told upfront
what shardSpec they should use (so we can potentially publish a variable number
of segments per interval).
The idea is that there is a "pendingSegments" table in the metadata store that
tracks allocated segments. Each one has a segment id (the same segment id we know
and love) and is also part of a sequence.
The sequences are an idea from @cheddar that offers a way of doing replication.
If there are N tasks reading exactly the same data with exactly the same logic
(think Kafka tasks reading a fixed range of offsets) then you can place them
in the same sequence, and they will generate the same sequence of segments.
Fixes#1727.
revert to doing merging for results for union queries on broker.
revert unrelated changes
Add test for union query runner
Add test
remove unused imports
fix imports
fix renamed file
fix test
update docs.
* Adds kafka, URI, and JDBC namespace defintions
* Add ability to explicitly rename using a "namespace" which is a particular data collection that is loaded on all realtime, historic nodes, and brokers. If any of these nodes has the namespace extension, ALL nodes have the namespace extension.
* Add namespace caching and populating (can be on heap or off heap)
* Add NamespaceExtractionCacheManager for handling caches
* Added ExtractionNamespace for handling metadata on the extraction namespaces
* Added ExtractionNamespaceUpdate for handling metadata related to updates
* Add extension which caches renames from a kafka stream (requires kafka8)
* Added README.md for the namespace kafka extension
* Added docs
* Added namespace/size, namespace/count, namespace/deltaTasksStarted metrics
Add static config for namespaces via `druid.query.extraction.namespace`
* This is a rebase of https://github.com/b-slim/druid/tree/static_config_only
fixes#1330,
Avoid creating Period instance as creating a Period from Long.MAX_VALUE
throws arithmetic exception.
After this query metric will emit duration in seconds instead of
minutes.
* Requires https://github.com/druid-io/druid-api/pull/37
* Requires https://github.com/metamx/java-util/pull/22
* Moves the puller logic to use a more standard workflow going through java-util helpers instead of re-writing the handlers for each impl
* General workflow goes like this: 1) LoadSpec makes sure the correct Puller is called with the correct parameters. 2) The Puller sets up general information like how to make an InputStream, how to find a file name (for .gz files for example), and when to retry. 3) CompressionUtils does most of the heavy lifting when it can
review comments
more refactoring and cleaning of redundant code
add UT + docs + more refactoring
fixes + review comments
more cleanup
end points to fetch history
review comments
remove unnecessary changes
review comments rename header name
review comments + add test for MetadataRulesManager
review comments docs
This commit also includes
1) the addition of a context parameter on timeseries queries that allows it to ignore empty buckets instead of generating results for them
2) A cleanup of an unused method on an interface