* SQL: Full TRIM support.
- Support trimming arbitrary characters
- Support BOTH, LEADING, and TRAILING
* Remove unused import.
* Fix tests, add RTRIM / LTRIM.
* Remove unused imports.
* BTRIM and docs.
* Replace for with foreach.
* Collapse worker select strategies, change default, add strong affinity.
- Change default worker select strategy to equalDistribution. It is
more generally useful than fillCapacity.
- Collapse the *WithAffinity strategies into the regular ones. The
*WithAffinity strategies are retained for backwards compatibility.
- Change WorkerSelectStrategy to return nullable instead of Optional.
- Fix a couple of errors in the docs.
* Fix test.
* Review adjustments.
* Remove unused imports.
* Switch to DateTimes.nowUtc.
* Simplify code.
* Fix tests (worker assignment started off on a different foot)
* DruidLeaderSelector interface for leader election and Curator based impl. DruidCoordinator/TaskMaster are updated to use the new interface.
* add fake DruidNode binding in integration-tests module
* add docs on DruidLeaderSelector interface
* remove start/stop and keep register/unregister Listener in DruidLeaderSelector interface
* updated comments on DruidLeaderSelector
* cache the listener executor in CuratorDruidLeaderSelector
* use same latch owner name that was used before
* remove stuff related to druid.zk.paths.indexer.leaderLatchPath config
* randomize the delay when giving up leadership and restarting leader latch
* Add "round" option to cardinality and hyperUnique aggregators.
Also turn it on by default in SQL, to make math on distinct counts
work more as expected.
* Fix some compile errors.
* Fix test.
* Formatting.
* Add @ExtensionPoint and @PublicApi annotations.
* Clean up wording.
* Remove unused import.
* Remove unused imports.
* Only types can be extension points.
* Adjust annotations some more.
* Remove unused import.
* Make ServletFilterHolder an extension point.
* Add a couple extension points, and update docs.
* Implement "earlyMessageRejectionPeriod" config discussed in issue #4599
* implement the logics of this param
* Added doc of this config
* Added unit tests of it
* Update KafkaSupervisor.java
ameliorate comment
* fix format
* fix bug when rebasing
* Added support for where clauses to filter lookup values on ingestion.
Added a filter field to the JDBC lookups that is used to generate a
where clause so that only rows matching the filter value will be
brought into Druid. Example being filter="SOMECOLUMN=1"
* Required changes based on code review.
* Required changes based on code review.
* Added support for where clauses to filter lookup values on ingestion.
Added a filter field to the JDBC lookups that is used to generate a
where clause so that only rows matching the filter value will be
brought into Druid. Example being filter="SOMECOLUMN=1"
* Updates based on code review, mainly formatting and small refactor of
the buildLookupQuery method.
* Fixed broken buildLookupQuery method
* Removed empty line.
* Updates per review comments
* Improved SQL support for floats and doubles.
- Use Druid FLOAT for SQL FLOAT, and Druid DOUBLE for SQL DOUBLE, REAL,
and DECIMAL.
- Use float* aggregators when appropriate.
- Add tests involving both float and double columns.
- Adjust documentation accordingly.
* CR comments.
* Fix braces.
* rolling upgrade order change to bring coordinator and overlord together
* mentioned merged Coordinator-Overlord in upgrade order doc
* revert autoscaling doc change
* auto scaling doc fix
* Early publishing segments in the middle of data ingestion
* Remove unnecessary logs
* Address comments
* Refactoring the patch according to #4292 and address comments
* Set the total shard number of NumberedShardSpec to 0
* refactoring
* Address comments
* Fix tests
* Address comments
* Fix sync problem of committer and retry push only
* Fix doc
* Fix build failure
* Address comments
* Fix compilation failure
* Fix transient test failure
* SQL + Expressions = Best friends forever.
- Use expressions as a projection layer for anything that can't be
expressed using traditional Druid extractionFns. Sometimes they're
embedded directly (like "expression" filters, builtin aggregators,
or "expression" post-aggregators). Sometimes they're referenced
through virtual columns (like dimensionSpecs, which can't innately
reference functions of more than one column without the virtual
column layer).
- Add many new functions and operators, taking advantage of the
expression capability (see the querying/sql.md doc).
- Improve consistency of constant reduction and of casting by
using Druid expressions for this instead of Calcite's RexExecutor.
* Fix casting bug, and other code review comments.
* Fix docs.
* Add some new expression functions and macros.
See misc/math-expr.md for the list of added functions, except for
"like", which previously existed but was not documented.
* Add easymock to datasketches tests.
* Add easymock to distinctcount tests.
* Add easymock to virtual-columns tests.
* Code review comments.
* Clean up code a bit.
* Add easymock to scan-query tests.
* Rework ExprMacros that have multiple impls.
* Improve test coverage.
* Remove DruidProcessingModule, QueryableModule and QueryRunnerFactoryModule from DI for coordinator, overlord, middle-manager. Add RouterDruidProcessing not to allocate processing resources on router
* Fix examples
* Fixes
* Revert Peon configs and add comments
* Remove qualifier
* Add maxSegmentsInQueue parameter to CoordinatorDinamicConfig and use it in LoadRule to improve segments loading and replication time
* Rename maxSegmentsInQueue to maxSegmentsInNodeLoadingQueue
* Make CoordinatorDynamicConfig constructor private; add/fix tests; set default maxSegmentsInNodeLoadingQueue to 0 (unbounded)
* Docs added for maxSegmentsInNodeLoadingQueue parameter in CoordinatorDynamicConfig
* More docs for maxSegmentsInNodeLoadingQueue and style fixes
* Remove ability to create segments in v8 format
* Fix IndexGeneratorJobTest
* Fix parameterized test name in IndexMergerTest
* Remove extra legacy merging stuff
* Remove legacy serializer builders
* Remove ConciseBitmapIndexMergerTest and RoaringBitmapIndexMergerTest
* Add Date support to the parquet reader
Add support for the Date logical type. Currently this is not supported. Since the parquet
date is number of days since epoch gets interpreted as seconds since epoch, it will fails
on indexing the data because it will not map to the appriopriate bucket.
* Cleaned up code and tests
Got rid of unused json files in the examples, cleaned up the tests by
using try-with-resources. Now get the filenames from the json file
instead of hard coding them and integrated general improvements from
the feedback provided by leventov.
* Got rid of the caching
Remove the caching of the logical type of the time dimension column
and cleaned up the code a bit.
* Expressions: Add ExprMacros, which have the same syntax as functions, but
can convert themselves to any kind of Expr at parse-time.
ExprMacroTable is an extension point for adding new ExprMacros. Anything
that might need to parse expressions needs an ExprMacroTable, which can
be injected through Guice.
* Address code review comments.
* refactor lag reporting and report lag at status endpoint
* refactor offset reporting logic to fetch offsets periodically vs. at request time
* remove JavaCompatUtils
* code review changes
* code review changes
* Refactoring Appenderator
1) Added publishExecutor and handoffExecutor for background publishing and handing segments off
2) Change add() to not move segments out in it
* Address comments
1) Remove publishTimeout for KafkaIndexTask
2) Simplifying registerHandoff()
3) Add increamental handoff test
* Remove unused variable
* Add persist() to Appenderator and more tests for AppenderatorDriver
* Remove unused imports
* Fix strict build
* Address comments
There result would be {"error"=>"Unknown exception",
"errorMessage"=>nil, "errorClass"=>"java.lang.NullPointerException",
"host"=>nil} when the json lack of 『granularity』.
* move ProtoBufInputRowParser from processing module to protobuf extensions
* Ported PR #3509
* add DynamicMessage
* fix local test stuff that slipped in
* add license header
* removed redundant type name
* removed commented code
* fix code style
* rename ProtoBuf -> Protobuf
* pom.xml: shade protobuf classes, handle .desc resource file as binary file
* clean up error messages
* pick first message type from descriptor if not specified
* fix protoMessageType null check. add test case
* move protobuf-extension from contrib to core
* document: add new configuration keys, and descriptions
* update document. add examples
* move protobuf-extension from contrib to core (2nd try)
* touch
* include protobuf extensions in the distribution
* fix whitespace
* include protobuf example in the distribution
* example: create new pb obj everytime
* document: use properly quoted json
* fix whitespace
* bump parent version to 0.10.1-SNAPSHOT
* ignore Override check
* touch
* Fixed (#4216)
Modify the default value of `druid.server.http.numThreads` to `Math.max(10, (Runtime.getRuntime().availableProcessors() * 17) / 16 + 2) + 30`
* Fixed(#4216)
Modify the default value of `druid.server.http.numThreads` to `max(10, (Number of cores * 17) / 16 + 2) + 30`
* Fixed(#4216)
Modify the default value of `druid.server.http.numThreads` to `max(10, (Number of cores * 17) / 16 + 2) + 30`
This is useful for putting them behind load balancers or proxies, as it lets
the load balancer know which server is currently active through an http health
check.
Also makes the method naming a little more consistent between coordinator and
overlord code.
* optionally add extensions to explicitly specified hadoopContainerClassPath
* note extensions always pushed in hadoop container when druid.extensions.hadoopContainerDruidClasspath is not provided explicitly
* coordinator lookups mgmt improvements
* revert replaces removal, deprecate it instead
* convert and use older specs stored in db
* more tests and updates
* review comments
* add behavior for 0.10.0 to 0.9.2 downgrade
* incorporating more review comments
* remove explicit lock and use LifecycleLock in LookupReferencesManager. use LifecycleLock in LookupCoordinatorManager as well
* wip on LookupCoordinatorManager
* lifecycle lock
* refactor thread creation into utility method
* more review comments addressed
* support smooth roll back of lookup snapshots from 0.10.0 to 0.9.2
* correctly use LifecycleLock in LookupCoordinatorManager and remove synchronization from start/stop
* run lookup mgmt on leader coordinator only
* wip: changes to do multiple start() and stop() on LookupCoordinatorManager
* lifecycleLock fix usage in LookupReferencesManagerTest
* add LifecycleLock back
* fix license hdr
* some fixes
* make LookupReferencesManager.getAllLookupsState() consistent while still being lockless
* address review comments
* addressing leventov's comments
* address charle's comments
* add IOE.java
* for safety in LookupReferencesManager mainThread check for lifecycle started state on each loop in addition to interrupt
* move thread creation utility method to Execs
* fix names
* add tests for LookupCoordinatorManager.lookupManagementLoop()
* add further tests for figuring out toBeLoaded and toBeDropped on LookupCoordinatorManager
* address leventov comments
* remove LookupsStateWithMap and parameterize LookupsState
* address review comments
* address more review comments
* misc fixes
Updating the description of useCache
Updating query-context doc based on Gian's comment
Updating query-context doc based on Gian's comment
Updating query-context doc based on Gian's comment
Updating query-context doc based on Gian's comment
* Fix lz4 library incompatibility in kafka-indexing-service extension #3266
* Bumped Kafka version to 0.10.2.0 for : Fix lz4 library incompatibility in kafka-indexing-service extension #3266
* Replaced Lists.newArrayList() with Collections.singletonList() For Fix lz4 library incompatibility in kafka-indexing-service extension #4115
* Make timeout behavior consistent to document
* Refactoring BlockingPool and add more methods to QueryContexts
* remove unused imports
* Addressed comments
* Address comments
* remove unused method
* Make default query timeout configurable
* Fix test failure
* Change timeout from period to millis
* Initial commit
* Apply another config: clustername
* Rename variable
* Fix bug
* Add retry logic
* Edit retry logic
* Upgrade kafka-clients version to the most recent release
* Make callback single object
* Write documentation
* Rewrite error message and emit logic
* Handling AlertEvent
* Override toString()
* make clusterName more optional
* bump up druid version
* add producer.config option which make user can apply another optional config value of kafka producer
* remove potential blocking in emit()
* using MemoryBoundLinkedBlockingQueue
* Fixing coding convention
* Remove logging every exception and just increment counting
* refactoring
* trivial modification
* logging when callback has exception
* Replace kafka-clients 0.10.1.1 with 0.10.2.0
* Resolve the problem related of classloader
* adopt try statement
* code reformatting
* make variables final
* rewrite toString
* RealtimeIndexTask to support alertTimeout in context and raise alert if task process exists after the timeout
* move alertTimeout config to tuningConfig and document
It wasn't doing anything useful (the sequences were being concatted, and
cursor.getTime() wasn't being called) and it defaulted to Granularities.NONE.
Changing it to Granularities.ALL gave me a 700x+ performance boost on a
small dataset I was reindexing (2m27s to 365ms). Most of that was from avoiding
making a lot of unnecessary column selectors.
* Ignore chunkPeriod for groupBy v2, fix chunkPeriod for irregular periods.
Includes two fixes:
- groupBy v2 now ignores chunkPeriod, since it wouldn't have helped anyway (its mergeResults
returns a lazy sequence) and it generates incorrect results.
- Fix chunkPeriod handling for periods of irregular length, like "P1M" or "P1Y".
Also includes doc and test fixes:
- groupBy v1 was no longer being tested by GroupByQueryRunnerTest since #3953, now it
is once again.
- chunkPeriod documentation was misleading due to its checkered past. Updated it to
be more accurate.
* Remove unused import.
* Restore buffer size.
* Add SameIntervalMergeTask for easier usage of MergeTask
* fix a bug and add ut
* remove same_interval_merge_sub from Task.java and remove other no needed code
Get rid of the metadataUpdateSpec section in the json example to
ingest parquet into druid. When this element is present, it will
fail start an indexing job.
* NN optimization for hdfs data segments.
* HdfsDataSegmentKiller, HdfsDataSegment finder changes to use new storage
format.Docs update.
* Common utility function in DataSegmentPusherUtil.
* new static method `makeSegmentOutputPathUptoVersionForHdfs` in JobHelper
* reuse getHdfsStorageDirUptoVersion in
DataSegmentPusherUtil.getHdfsStorageDir()
* Addressed comments.
* Review comments.
* HdfsDataSegmentKiller requested changes.
* extra newline
* Add maprfs.
* Refactor Segment Granularity
* Beginning of one granularity
* Copy the fix for custom periods in segment-grunalrity over here.
* Remove the custom serialization for now.
* Compilation cleanup
* Reformat code
* Fixing unit tests
* Unify to use a single iterable
* Backward compatibility for rolling upgrade
* Minor check style. Cosmetic changes.
* Rename length and millis to duration
* CR feedback
* Minor changes.
This puts all the SQL stuff in one place. It also makes life easier by
pointing out that configs be made in either common.runtime.properties
or the broker runtime.properties.
* SQL: Resolve column type conflicts in favor of newer segments.
Helps with schema evolution from e.g. long -> float, which is supported
on the query side.
* Take columns from highest timestamp instead of max segment id.
* Fixes and docs.
* SQL: Add context and contextual functions to planner.
Added support for context parameters specified as JDBC connection properties
or a JSON object for SQL-over-JSON-over-HTTP.
Also added features that depend on context functionality:
- Added CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP functions.
- Added support for time zones other than UTC via a "timeZone" context.
- Pass down query context to Druid queries too.
Also some bug fixes:
- Fix DATE handling, it was largely done incorrectly before.
- Fix CAST(__time TO DATE) which should do a floor-to-day.
- Fix non-equality comparisons to FLOOR(__time TO X).
- Fix maxQueryCount property.
* Pass down context to nested queries too.
* Require Java 8 and include some Java 8 dependencies.
- Upgrade Jetty to 9.3.16.v20170120.
- Upgrade DataSketches to 0.8.4.
- Bundle caffeine-cache by default.
- Still target Java 7 when compiling base Druid classes.
* Update cluster, quickstart docs.
* Remove oraclejdk7 from travis.yml.
* Add extension for supporting kerberos security
- This PR adds an extension for supporting druid authentication via
Kerberos.
- Working on the docs.
* Add docs
* review comments
* more review comments
* Block all paths by default
* more review comments - use proper Oid
* Allow extensions to override httpclient for integration tests
* Add kerberos lock to prevent multithreaded issues.
* review comment - remove enabled flag and fix router injection
* Add Cookie Handling and more detailed docs
* review comment - rename DruidKerberosConfig -> AuthKerberosConfig
* review comments
* fix travis failure on jdk7
* SQL: Add resolution parameter to quantile agg, rename to APPROX_QUANTILE.
* Fix bug with re-use of filtered approximate histogram aggregators.
Also add APPROX_QUANTILE tests for filtering and running on complex columns.
Includes some slight refactoring to allow tests to make DruidTables that
include complex columns.
* Remove unused import
* SQL: Ditch CalciteConnection layer and add DruidMeta, extension aggregators.
Switched from CalciteConnection to Planner, bringing benefits:
- CalciteConnection's JDBC interface no longer sits between the SQL server
(HTTP/Avatica) and Druid's query layer. Instead, the SQL servers can use
Druid Sequence objects directly, reducing overhead in the query return path.
- Implemented our own Planner-based Avatica Meta, letting us control
connection timeouts and connection / statement limits. The previous
CalciteConnection-based implementation didn't have any limits or timeouts.
- The Planner interface lets us override the operator table, opening up
SQL language extensions. This patch includes two: APPROX_COUNT_DISTINCT
in core, and a QUANTILE aggregator in the druid-histogram extension.
Also:
- Added INFORMATION_SCHEMA metadata schema.
- Added tests for Unicode literals and escapes.
* Verify statement is actually open before closing it.
* More detailed INFORMATION_SCHEMA docs.
* streaming version of select query
* use columns instead of dimensions and metrics;prepare for valueVector;remove granularity
* respect query limit within historical
* use constant
* fix thread name corrupted bug when using jetty qtp thread rather than processing thread while working with SpecificSegmentQueryRunner
* add some test for scan query
* add scan query document
* fix merge conflicts
* add compactedList resultFormat, this format is better for json ser/der
* respect query timeout
* respect query limit on broker
* use static consts and remove unused code
* SQL support for nested groupBys.
Allows, for example, doing exact count distinct by writing:
SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)
Contrast with approximate count distinct, which is:
SELECT COUNT(DISTINCT col) FROM druid.foo
* Add deeply-nested groupBy docs, tests, and maxQueryCount config.
* Extract magic constants into statics.
* Rework rules to put preconditions in the "matches" method.
* Add an option to SearchQuery to choose a search query execution strategy.
Supported strategies are
1) Index-only query execution
2) Cursor-based scan
3) Auto: choose an efficient strategy for a given query
* Add SearchStrategy and SearchQueryExecutor
* Address comments
* Rename strategies and set UseIndexesStrategy as the default strategy
* Add a cost-based planner for auto strategy
* Add document
* Fix code style
* apply code style
* apply comments
* allow JsonConfigTesterBase to treat the fields of collections
* [Feature] Exhibitor Support (#3664)
This patch provides the integration of Druid & Netflix Exhibitor. Druid
currently use Apache Curator as ZooKeeper client. Curator can be
integrated with Exhibitor to achieve a live/updating list of the
ZooKeeper ensemble. This patch enables Druid to use this features.
* Add metrics for Query Count statistics
This PR adds a new metrics monitor “QueryCountStatsMonitor” which emits
three new metrics -
1) query/success/count - number of successful queries
2) query/failed/count - number of failed queries
3) query/interrupted/count - number of interrupted/timedout queries
fix bindings
* make fields final
* fix imports
* AsyncQueryForwardingServlet implement QueryStatsProvider
* remove unused import
* add first and last aggregator
* add test and fix
* moving around
* separate aggregator valueType
* address PR comment
* add finalize inner query and adjust v1 inner indexing
* better test and fixes
* java-util import fixes
* PR comments
* Add first/last aggs to ITWikipediaQueryTest
* Thrift ingestion plugin
1. thrift binary is platform dependent, use scrooge to generate java files to avoid style check failure
2. stream and hadoop ingesion are both supported, input format can be sequence file and lzo thrift block file.
3. base64 and protocol aware
change header
* fix conlicts in pom
* report message gap, source gap and sink count in RealtimePlumber
* report message gap, sink count in Appenderator
* add ingest/events/sourceGap in metrics.md
* remove source gap
* sortByDimsFirst flag for groupBy query
* Remove need for KeyType in Grouper<KeyType> to be Comparable<KeyType>
* fix review comments
* fix review comments regarding removing code duplication of dim/time comparison
* move comparator for KeyType object to KeySerdeFactory so that creation of comparator does not need KeySerde
* remove unnecessary system.out.println
* make access static var NATURAL_NULLS_FIRST directly
* further review comments addressing
* Normalized Cost Balancer
* Adding documentation and renaming to use diskNormalizedCostBalancer
* Remove balancer from the strings
* Update docs and include random cost balancer
* Fix checkstyle issues
* Blacklist workers if they fail for too many times
* Adding documentation
* Changing to timeout to period and updating docs
* 1. Add configurable maxPercentageBlacklistWorkers
2. Rename variable
* Change maxPercentageBlacklistWorkers to double
* Remove thread.sleep
* option to reset offset automatically in case of OffsetOutOfRangeException
if the next offset is less than the earliest available offset for that partition
* review comments
* refactoring
* refactor
* review comments
* Use Long timestamp as key instead of DateTime.
DateTime representation is screwed up when you store with an obj
and read with a different DateTime obj.
For example: The code below fails when you use DateTime as key
```
DateTime odt = DateTime.now(DateTimeUtils.getZone(DateTimeZone.forID("America/Los_Angeles")));
HashMap<DateTime, String> map = new HashMap<>();
map.put(odt, "abc");
DateTime dt = new DateTime(odt.getMillis());
System.out.println(map.get(dt));
```
* Respect timezone when creating the file.
* Update docs with timezone caveat in granularity spec
* Remove unused imports
* Add "like" filter.
* Addressed some PR comments.
* Slight simplifications to LikeFilter.
* Additional simplifications.
* Fix comment in LikeFilter.
* Clarify comment in LikeFilter.
* Simplify LikeMatcher a bit.
* No use going through the optimized path if prefix is empty.
* Add more tests.
* URIExtractionNamespace: Treat null values in lookup maps as missing entries.
This is useful when many logical lookups are derived from the same base JSON file,
and some lookups' values may be unknown sometimes.
* Add test, logging message, and address other comments.
* Update docs.
* Support string type in math expression
addressed comments
addressed comments
Addressed comments
* Updated math function document
* Addressed comments
* support finding segments from a AWS S3 storage.
* add more Uts
* address comments and add a document for the feature.
* update docs indentation
* update docs indentation
* address comments.
1. add a Ut for json ser/deser for the config object.
2. more informant error message in a Ut.
* address comments.
1. use @Min to validate the configuration object
2. change updateDescriptor to a string as it does not take an argument otherwise
* fix a Ut failure - delete a Ut for testing default max length.
* Add support for timezone in segment granularity
* CR feedback. Handle null timezone during equals check.
* Include timezone in docs.
Add timezone for ArbitraryGranularitySpec.
* Show candidate hosts for the given query
* Added test cases & minor changes to address comments
* Changed path-param to query-pram for intervals/numCandidates
Also change defaults:
- bufferGrouperMaxLoadFactor from 0.75 to 0.7.
- maxMergingDictionarySize to 100MB from 25MB, should be more appropriate
for most heaps.
Follow-up to #1773, which meant to add more useful query errors but
did not actually do so. Since that patch, any error other than
interrupt/cancel/timeout was reported as `{"error":"Unknown exception"}`.
With this patch, the error fields are:
- error, one of the specific strings "Query interrupted", "Query timeout",
"Query cancelled", or "Unknown exception" (same behavior as before).
- errorMessage, the message of the topmost non-QueryInterruptedException
in the causality chain.
- errorClass, the class of the topmost non-QueryInterruptedException
in the causality chain.
- host, the host that failed the query.
* Add time interval dim filter and retention analysis example
* Use closed-open matching for intervals, update cache key generation
* Fix time filtering tests for interval boundary change
* ability to not rollup at index time, make pre aggregation an option
* rename getRowIndexForRollup to getPriorIndex
* fix doc misspelling
* test query using no-rollup indexes
* fix benchmark fail due to jmh bug
* Add numeric StringComparator
* Only use direct long comparison for numeric ordering in BoundFilter, add time filtering benchmark query
* Address PR comments, add multithreaded BoundDimFilter test
* Add comment on strlen tie handling
* Add timeseries interval filter benchmark
* Adjust docs
* Use jackson for StringComparator, address PR comments
* Add new TopNMetricSpec and SearchSortSpec with tests (WIP)
* More TopNMetricSpec and SearchSortSpec tests
* Fix NewSearchSortSpec serde
* Update docs for new DimensionTopNMetricSpec
* Delete NumericDimensionTopNMetricSpec
* Delete old SearchSortSpec
* Rename NewSearchSortSpec to SearchSortSpec
* Add TopN numeric comparator benchmark, address PR comments
* Refactor OrderByColumnSpec
* Add null checks to NumericComparator and String->BigDecimal conversion function
* Add more OrderByColumnSpec serde tests
This fixes a potential issue where groupBy resources could be allocated to
create a Sequence, but then the Sequence is never used, and thus the resources
are never freed.
Also simplifies how groupBy handles config overrides (this made the new
unit test easier to write).
* InputRowParser to decode OrcStruct from OrcNewInputFormat
* add unit test for orc hadoop indexing
* update docs and fix test code bug
* doc updated
* resove maven dependency conflict
* remove unused imports
* fix returning array type from Object[] to correct primitive array type
* fix to support getDimension() of MapBasedRow : changing return type of orc list from array to list
* rebase and updated based on comments
* updated based on comments
* on reflecting review comments
* fix bug in typeStringFromParseSpec() and add unit test
* add license header
* Support filtering on __time column
* Rename DruidPredicate
* Add docs for ValueMatcherFactory, add comment on getColumnCapabilities
* Combine ValueMatcherFactory predicate methods to accept DruidCompositePredicate
* Address PR comments (support filter on all long columns)
* Use predicate factory instead of composite predicate
* Address PR comments
* Lazily initialize long handling in selector/in filter
* Move long value parsing from InFilter to InDimFilter, make long value parsing thread-safe
* Add multithreaded selector/in filter test
* Fix non-final lock object in SelectorDimFilter
- Attempt to make things clearer in general
- Point out that HDFS deep storage and MR jobs don't use the same loading mechanism
- Recommend using mapreduce.job.classloader = true when possible
* Initial commit of caffeine cache
* Address code comments
* Move and fixup README.md a bit
* Improve caffeine readme information
* Cleanup caffeine pom
* Address review comments
* Bump caffeine to 2.3.1
* Bump druid version to 0.9.2-SNAPSHOT
* Make test not fail randomly.
See https://github.com/ben-manes/caffeine/pull/93#issuecomment-227617998 for an explanation
* Fix distribution and documentation
* Add caffeine to extensions.md
* Fix links in extensions.md
* Lexicographic
This is actually reasonable for a groupBy or lexicographic topNs that is
being used to do a "COUNT DISTINCT" kind of query. No aggregators are
needed for that query, and including a dummy aggregator wastes 8 bytes
per row.
It's kind of silly for timeseries, but why not.
* support alphanumeric sort in search query
* address a comment about handling equals() and hashCode()
* address comments
* add Ut for string comparators
* address a comment about space indentations.
This patch introduces a GroupByStrategy concept and two strategies: "v1"
is the current groupBy strategy and "v2" is a new one. It also introduces
a merge buffers concept in DruidProcessingModule, to try to better
manage memory used for merging.
Both of these are described in more detail in #2987.
There are two goals of this patch:
1. Make it possible for historical/realtime nodes to return larger groupBy
result sets, faster, with better memory management.
2. Make it possible for brokers to merge streams when there are no order-by
columns, avoiding materialization.
This patch does not do anything to help with memory management on the broker
when there are order-by columns or when there are nested queries. That could
potentially be done in a future patch.
* docs: replace OR by AND inside topnquery docs about multi value dimensions
* docs: replace OR by AND inside groupby docs about multi value dimensions
* Cleanup the base lookup cluster wide config docs
* Add better examples in lookups-cached-global.md
* Use actual valid stock lookups
* Fixed maps with :
* Add mix of lookups
* Better examples in extension
* Remove unneeded namespace requirement
* Add extra line space
* Add link to lookup tiers
* Renamed header
* Async lookups-cached-global by default
* Also better lookup docs
* Fix test timeouts
* Fix timing of deserialized test
* Fix problem with 0 wait failing immediately
There is no such thing as a "Java aggregator" in Druid from a user's point of view, there are just native aggregator that happen to be implemented in Java.
- Make redirects for old links based on _redirects.json
- Replace #{DRUIDVERSION} tokens in docs with current version
- Allow origins named something other than "origin"
- Can use either s3cmd or awscli, depending on availability
* support LookupReferencesManager registration of namespaced lookup and eliminate static configurations for lookup from namespecd lookup extensions
- druid-namespace-lookup and druid-kafka-extraction-namespace are modified
- However, druid-namespace-lookup still has configuration about ON/OFF
HEAP cache manager selection, which is not namespace wide
configuration but node wide configuration as multiple namespace shares
the same cache manager
* update KafkaExtractionNamespaceTest to reflect argument signature changes
* Add more synchronization functionality to NamespaceLookupExtractorFactory
* Remove old way of using extraction namespaces
* resolve compile error by supporting LookupIntrospectHandler
* Remove kafka lookups
* Remove unused stuff
* Fix start and stop behavior to be consistent with new javadocs
* Remove unused strings
* Add timeout option
* Address comments on configurations and improve docs
* Add more options and update hash key and replaces
* Move monitoring to the overriding classes
* Add better start/stop logging
* Remove old docs about namespace names
* Fix bad comma
* Add `@JsonIgnore` to lookup factory
* Address code review comments
* Remove ExtractionNamespace from module json registration
* Fix problems with naming and initialization. Add tests
* Optimize imports / reformat
* Fix future not being properly cancelled on failed initial scheduling
* Fix delete returns
* Add more docs about whole introspection
* Add `/version` introspection point for lookups
* Add more tests and address comments
* Add StaticMap extraction namespace for testing. Also add a bunch of tests
* Move cache system property to `druid.lookup.namespace.cache.type`
* Make VERSION lower case
* Change poll period to 0ms for StaticMap
* Move cache key to bytebuffer
* Change hashCode and equals on static map extraction fn
* Add more comments on StaticMap
* Address comments
* Make scheduleAndWait use a latch
* Sanity renames and fix imports
* Remove extra info in docs
* Fix review comments
* Strengthen failure on start from warn to error
* Address comments
* Rename namespace-lookup to lookups-cached-global
* Fix injective mis-naming
* Also add serde test
* Allow dynamically setting of shutoffTime for EventReceiverFirehose
Allow dynamically setting shutoffTime for EventReceiverFirehose
review comments and tests
* shut down exec on close
* Datasource as lookup tier
* Adds an option to let indexing service tasks pull their lookup tier from the datasource they are working for.
* Fix bad docs for lookups lookupTier
* Add Datasource name holder
* Move task and datasource to be pulled from Task file
* Make LookupModule pull from bound dataSource
* Fix test
* Fix code style on imports
* Fix formatting
* Make naming better
* Address code comments about naming
* Make URI Exctraction Namespace take more sane arguments
* Fixes https://github.com/druid-io/druid/issues/2669
* Update docs
* Rename error message
* Undo overzealous deletion of docs
* Explain caching mechanism a bit more in docs
* Move kafka-extraction-namespace to the Lookup framework.
* Address comments
* Fix missing kafka introspection
* Fix tests to be less racy
* Make testing a bit more leniant
* Make tests even more forgiving
* Add comments to kafka lookup cache method
* Move startStopLock to just use started
* Make start() and stop() idempotent
* Forgot to update test after last change, test now accounts for idempotency
* Add extra idempotency on stop check
* Add more descriptive docs of behavior
* make isSingleThreaded groupBy query processing overridable at query time
* refactor code in GroupByMergedQueryRunner to make processing of single threaded and parallel merging of runners consistent
* Add back FilteredServerView removed in a32906c7fd to reduce memory usage using watched tiers.
* Add functionality to specify "druid.broker.segment.watchedDataSources"
* Document how to use roaring bitmaps
This fixes#2408.
While not all indexSpec properties are explained, it does explain how roaring bitmaps can be turned on.
* fix
* fix
* fix
* fix
The behavior is now that filters on "null" will match rows with no
values. The behavior in the past was inconsistent; sometimes these
filters would match and sometimes they wouldn't.
Adds tests for this behavior to SelectorFilterTest and
BoundFilterTest, for query-level filters and filtered aggregates.
Fixes#2750.
The reference to io.druid.extensions:kafka-extraction-namespace is wrong (should
be druid-kafka-extraction-namespace) and unnecessary (the extension id is written
at the top of the doc file).
This removes Filter.makeMatcher(ColumnSelectorFactory) and adds a
ValueMatcherFactory implementation to FilteredAggregatorFactory so it can
take advantage of existing makeMatcher(ValueMatcherFactory) implementations.
This patch also removes the Bound-based method from ValueMatcherFactory. Its
only user was the SpatialFilter, which could use the Predicate-based method.
Fixes#2604.
- Add central doc for multi-value dimensions, with some content from other docs.
- Link to multi-value dimension doc from topN and groupBy docs.
- Fixes a broken link from dimensionspecs.md, which was presciently already
linking to this nonexistent doc.
- Resolve inconsistent naming in docs & code (sometimes "multi-valued", sometimes
"multi-value") in favor of "multi-value".
This PR changes the retry of task actions to be a bit more aggressive
by reducing the maxWait. Current defaults were 1 min to 10 mins, which
lead to a very delayed recovery in case there are any transient network
issues between the overlord and the peons.
doc changes.
To bring consistency to docs and source this commit changes the default
values for maxRowsInMemory and rowFlushBoundary to 75000 after
discussion in PR https://github.com/druid-io/druid/pull/2457.
The previous default was 500000 and it's lower now on the grounds that
it's better for a default to be somewhat less efficient, and work,
than to reach for the stars and possibly result in
"OutOfMemoryError: java heap space" errors.
Two changes:
- Allow IncrementalIndex to suppress ParseExceptions on "aggregate".
- Add "reportParseExceptions" option to realtime tuning configs. By default this is "false".
Behavior of the counters should now be:
- processed: Number of rows indexed, including rows where some fields could be parsed and some could not.
- thrownAway: Number of rows thrown away due to rejection policy.
- unparseable: Number of rows thrown away due to being completely unparseable (no fields salvageable at all).
If "reportParseExceptions" is true then "unparseable" will always be zero (because a parse error would
cause an exception to be thrown). In addition, "processed" will only include fully parseable rows
(because even partial parse failures will cause exceptions to be thrown).
Fixes#2510.
- Add druid.indexer.server.maxChatRequests, which sets up a QoSFilter on the main Jetty server.
- Deprecate druid.indexer.runner.separateIngestionEndpoint
- Deprecate druid.indexer.server.chathandler.*